Gene selection using information gain and improved simplified swarm optimization

Gene selection using information gain and improved simplified swarm optimization

Author’s Accepted Manuscript Gene selection using information gain and improved simplified swarm optimization Chyh-Ming Lai, Wei-Chang Yeh, Chung-Yi C...

789KB Sizes 0 Downloads 33 Views

Author’s Accepted Manuscript Gene selection using information gain and improved simplified swarm optimization Chyh-Ming Lai, Wei-Chang Yeh, Chung-Yi Chang

www.elsevier.com/locate/neucom

PII: DOI: Reference:

S0925-2312(16)30985-7 http://dx.doi.org/10.1016/j.neucom.2016.08.089 NEUCOM17523

To appear in: Neurocomputing Received date: 3 September 2015 Revised date: 9 July 2016 Accepted date: 28 August 2016 Cite this article as: Chyh-Ming Lai, Wei-Chang Yeh and Chung-Yi Chang, Gene selection using information gain and improved simplified swarm optimization, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.08.089 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Gene selection using information gain and improved simplified swarm optimization

Chyh-Ming Lai1*, Wei-Chang Yeh2 and Chung-Yi Chang2 1 The Institute of Resources Management and Decision Science, Management College, National Defense University, Taipei 112, Taiwan 2 Department of Industrial Engineering and Engineering Management, National Tsing Hua University, Hsinchu 300, Taiwan * Corresponding author: Chyh-Ming Lai E-mail: [email protected] Abstract Recently, gene selection (also called feature selection in data mining) has played an important role in the development of efficient cancer diagnoses and classification because gene expression data are coded by huge measured variables (genes), and only a small number of them present distinct profiles for different classes of samples. Gene selection problem involves reducing irrelevant, redundant and noisy genes and identifying the most distinguished genes to improve the classification accuracy. In this paper, a hybrid filter/wrapper method, known as IG-ISSO is proposed for gene selection problem. In this method, information gain (IG) as a filter is applied to select the most informative genes and an improved simplified swarm optimization (ISSO) is proposed as a gene search engine to guide the search for an optimal gene subset. The support vector machine (SVM) with a linear kernel serves as a classifier of the IG-ISSO. To evaluate the performance of the proposed 1

method empirically, experiments are examined using ten gene expression datasets, and the corresponding results are compared with up-to-date works. The results of the statistical analysis indicate that the proposed method is better than its competitors.

Keywords: gene selection; feature selection; simplified swarm optimization; information gain

1. Introduction Microarray technology helps scientists to investigate and measure the expression levels of thousands of genes simultaneously and produces gene expression data. Those data with useful information, which have great potential for cancer diagnoses and classification, are used in a new area of research in bioinformatics. However, gene expression data cause difficulties for the development of an efficient classifier due to their high dimensions, large number of irrelevant genes, and small sample size. To overcome this challenge, gene selection, also called feature selection, which is a data preprocessing step in data mining, can be considered an effective and efficient method. The aim of gene selection is to reduce irrelevant, redundant and noisy genes and to find an optimal subset of genes that maximizes the accuracy of a classifier [1-3]. Gene selection is mainly divided into two different categories: filter and wrapper methods [2-4]. Filter methods evaluate the relevance of each gene to the target class by taking into account only the interior properties of the dataset [5-11]. A gene relevance score is 2

usually calculated for each gene, and then genes with lower-scoring are removed from the dataset. Finally, the rest of the genes are adopted as a target dataset of the classification algorithm. Filter methods are independent of the classifier and can be easily incorporated with different classifiers. In addition, they are more easily implemented and more efficient than wrapper methods, but the lack of interaction with the classifier results in worse classification performance [2, 3]. On the contrary, wrapper methods integrate a search algorithm with a classifier into the gene selection process to find an optimal gene subset according to the prediction accuracy [4, 12-20]. Compared to filter methods, wrapper methods often yield better classification performance but require more computation cost [2, 3]. To mitigate this issue, some works combining the filter technique with the wrapper technique as a hybrid filter/wrapper gene selection method have obtained promising results in a more efficiently way than wrapper methods [21-24]. However, the hybrid technique is still in its infancy, substantial future investigations are necessary to develop hybrid methods for gene selection. Simplified swarm optimization (SSO) is an emerging, population-based optimization algorithm proposed by Yeh in 2009 [25] to compensate for the deficiencies of particle swarm optimization (PSO) in solving discrete problems. This algorithm is particularly suitable for solving multi-variable optimization problems because of its simplicity, efficiency, and flexibility. Empirical results reveal that SSO has better convergence to quality solutions than

3

particle swarm optimization (PSO) and genetic algorithm (GA). SSO has therefore attracted considerable attention and has been widely used in a range of applications [26-33]. This paper proposes a novel hybrid filter/wrapper method based on SSO offering an alternative to solve the gene selection problem. In this proposed method called IG-ISSO, information gain (IG) as a filter selects the most informative genes based on the IG values, and the gene space of the search engine is defined by those selected genes. Then, an improved simplified swarm optimization (ISSO) is proposed as a gene search engine to guide the search for an optimal gene subset. Finally, support vector machine (SVM) with a linear kernel using leave-one-out cross validation (LOOCV) serves as a classifier to evaluate the performance of IG-ISSO. The rest of this paper is organized as follows: The overviews of IG, SSO and SVM are given in Section 2. The proposed IG-ISSO and its overall procedure are detailed in Section 3. The two experiments and statistical analysis implemented for validating IG-ISSO are illustrated in Section 4. Finally, the conclusions are presented in Section 5.

2. Related Work 2.1 Information gain (IG) Information gain is a measure based on the entropy of a system [10, 11, 34-36]. It is 4

mainly applied in gene selection as a filter to rank genes based on IG value [23, 37, 38]. The IG value of each gene represents its relevance to the dataset, that is, a higher IG value means the gene contributes more information. Assume that a dataset has N = {1, 2,…, n} instances with k classes. Let P(Ci, N) represents the proportion of Ci to N, where Ci, i = 1,2,…, k are the set of instances that belong to the ith class. The entropy of the dataset can be calculated by:

Entropy( N ) = -åi =1 P(Ci , N ) ´ log P(Ci , N ) k

(1)

If a gene g has V = {v1, v2,…, vm} distinct values and letting N j Î N g = v j , the entropy of the dataset from gene g is given by:

Entropyg ( N ) = å j =1 m

Nj N

´ Entropy ( N j )

(2)

Finally, the IG value of gene g can be derived by: IG(g ) = entropy( N ) - entropyg ( N )

(3)

2.2 Simplified swarm optimization (SSO) Similar to PSO, each solution in SSO is encoded as a finite-length string and is characterized by a variable vector with a fitness value. SSO is also initialized with a population of random solutions within the search space, and then the searches are guided for optimal solutions by the update mechanism, as shown in Eq. (4).

5

ì xijt -1 ï ï pij t xij = í ï gj ïx î

if r Î [0, Cw ) if r Î [Cw , C p )

(4)

if r Î [C p , C g ) if r Î [C g ,1]

where xijt is the value of the jth variable in the ith solution at iteration t. pi=(pi1, pi2,…, pid), where d is the total number of variables in the problem domain, which represents the best solution with the best fitness value in its own history, known as pBest. The best solution with the best fitness value among all solutions is called gBest, which is denoted by g = (g1, g2,…, gd), and gj denotes the jth variable in gBest. x is a new randomly-generated feasible value of the th variable. r is a uniform random number between [0, 1]. Cw, Cp and Cg are three predetermined parameters that construct four interval probabilities, that is, Cw, Cp-Cw, Cg-Cp and 1-Cg represent the probabilities of the updated variable generated from the current solution, pBest, gBest, and a random movement, respectively.

2.3 Support vector machine (SVM) The support vector machine (SVM), originally introduced by Vapnik [39, 40], has been proved to perform well in a number applications [6, 24, 41-43]. When used for classification, the aim of SVM is to construct a model based on a given set of labeled training data to predict the classes of the unlabeled test dataset [24, 41]. Given a set of training data Xi with label value yi, where i = 1, 2,…, n, Xi Î Rd and yi Î 6

{1, -1}n, SVM can be modeled as the following optimization problem:

min w ,b ,x

s.t.

n 1 T w w + C å i =1 xi 2 yi ( w T X i + b ) ³ 1 - x i ,

(5)

xi ³ 0

where xi are slack variables that allow an instance to be in the margin (0 ≤ xi ≤ 1) or to be misclassified (xi > 1), and the constant C defines the relative importance of maximizing the margin and minimizing the amount of slack. SVM projects the training vectors higher dimensional space from the gene space and linearly separating

Xi

Xi

into a

by a hyperplane

defined as WTXi+b = 0, where WT guides the orientation of the hyperplane and b is the bias of the distance between the hyperplane and the origin. The classification decision function is given as the following:

f ( x ) = sign



n i =1

ai yi K ( X i , X j ) + b

)

(6)

where ai is a Lagrange multiplier, sign(·) is the sign function, and K(Xi,

Xj)

is the kernel

function. There are four basic kernels, including linear, polynomial, radial basis and sigmoid functions. In this paper, SVM implemented on the LIBSVM package [44] is integrated with ISSO as a wrapper method to search the optimal gene subset, that is, the accuracy of the gene subset found by ISSO is evaluated by SVM using LOOCV. The linear kernel function is adopted because only one parameter, C, needs to be tuned in this kernel, and it is efficient for 7

large-scale dataset classification with comparable accuracy [45].

3. Proposed IG-ISSO for Gene Selection 3.1 Gene subset size determination In [23], a threshold was defined for IG to select genes (i.e., IG(g) > 0), and the gene subset size was then determined, that is, if the IG value of a gene is greater than this threshold, the gene is selected; otherwise, it is abandoned. However, the threshold is hard to define for different datasets. It may provide either too large or too small of a subset size. In the case of too big, the search space of the engine becomes larger and consumes computation time. On the other hand, some informative genes may be discarded, resulting in lower classification accuracy. Thus, this work selects the Ntop (obtained in Section 4) top informative genes of the dataset based on the IG values to determine the subset size [24], and the dataset with selected genes is used for the wrapper method constructed by ISSO and SVM to find the optimal gene subset.

3.2 Solution representation In this paper, ISSO is developed to determine the parameter C of SVM [46] and the gene selection using the gene expression data. Thus, each solution x in ISSO is encoded with length Ntop + 1, as shown in Fig. 1. The first Ntop positions in each solution are encoded by a 8

binary string, 0 or 1, corresponding to a non-selected and selected gene, respectively. According to [24, 45], the parameter C is recommended to be set between in [2-5, 215]; thus, the last position is encoded by an integer generated in [-5, 15] to represent the exponent term of the parameter C. <>

3.3 Fitness function The objective of gene selection involves both maximizing the accuracy of the classification and minimizing the size of selected genes [24]. Hence, the fitness value of a solution x is calculated by the following:

F ( x ) = a ´ A( x ) + (1 - a ) ´

d - d ( x) d

(7)

where A(x) is the accuracy of x in percentage evaluated by SVM with LOOCV, d and d(x) are the total number of genes in the original dataset and the selected genes in the solution x, respectively. a Î [0, 1] is a predefined weight with respect to the importance of A(x) compared to d(x). In this paper, a is set to 0.8 [24, 47].

3.4 Gene pruned strategy (GPS) SSO performs a satisfactory exploration capability but may take a long time to converge to an optimal solution. Thus, GPS inspired from sequential backward selection [48, 9

49] is proposed to enhance the exploitation capability of SSO as an improved SSO (ISSO) in gene selection. GPS is activated for each updated solution to balance the global search and local search when r ≤ Ngps after the initialization phase, where r is a random number in [0, 1], and Ngps is a predefined parameter discussed in Section 4. GPS generates new solutions by repeatedly pruning a gene existing in the solution with random sequentiality. Let x = {x1 , x2 ,..., xNtop+1} represent the solution x in ISSO, the fitness value of x is denoted by F(x), and the random sequential list of all selected genes in x is denoted by S = {s1, s2,…, sd}, where sj Î [1, Ntop] and d is the total number of selected genes in x (i.e., the position value = 1). Then, the GPS procedure is given below: Step 0. If r ≤ Ngps, then construct S, let i = 1 and go to Step 1. Otherwise, halt. Step 1. Let x* = x and F(x*) = F(x). Step 2. Let xs*i = 0 and calculate F(x*). Step 3. If F(x*) > F(x), let x = x* and F(x) = F(x*). Step 4. If i < d, let i = i + 1 and go to Step 1. Otherwise, halt. As an example, suppose that Ntop = 10 and a solution x = {0, 1, 1, 0, 0, 0, 0, 0, 1, 0, -2} in which x11 is the exponent term of the parameter C for SVM. The fitness value F(x) = 91.18 and according to the solution x, the total number of selected genes d = 3 (xi = 1). The above procedure can be illustrated as the following if r ≤ Ngps is satisfied: Step 0. Randomly construct the random sequential list S = {3, 2, 9} and let i = 1. 10

Step 1. Let x* = x = {0, 1, 1, 0, 0, 0, 0, 0, 1, 0, -2} and F(x*) = F(x) = 91.18. Step 2. Let x3* = 0 and calculate F(x*) (assume F(x*) = 92.78). Step 3. Let x = x* = {0, 1, 0, 0, 0, 0, 0, 0, 1, 0, -2} and F(x) = F(x*) = 92.78. Step 4. Let i = 2 and go to Step 1. Step 1. Let x* = x = {0, 1, 0, 0, 0, 0, 0, 0, 1, 0, -2} and F(x*) = F(x) = 92.78. Step 2. Let x2* = 0 and calculate F(x*) (assume F(x*) = 91.79). Step 4. Let i = 3 and go to Step 1. Step 1. Let x* = x = {0, 1, 0, 0, 0, 0, 0, 0, 1, 0, -2} and F(x*) = F(x) = 92.78. Step 2. Let x9* = 0 and calculate F(x*) (assume F(x*) = 94.59). Step 3. Let x = x* = {0, 1, 0, 0, 0, 0, 0, 0, 0, 0, -2} and F(x) = F(x*) = 94.59. Step 4. Let i = 4 > d, halt. According to the discussions in Section 2 and 3, the flowchart of the overall proposed IG-ISSO is illustrated in Fig. 2, where Niter and Nsol are the number of iterations and the size of the population, respectively. <>

4. Experimental Results and Discussion In this work, ten gene expression datasets which can be downloaded from http://www.gems-system.org, are considered and listed in Table 1, where n, d and k denote the 11

number of instances, genes and classes, respectively. Two experiments, Ex-1 and Ex-2, are implemented to validate the performance of the proposed IG-ISSO. The purpose of Ex-1 is to verify the effect and find the best setting for Ntop and Ngps using 12 designed treatments. To validate the performance of IG-ISSO, Ex-2 compares IG-ISSO with SVM-based methods and state-of-the-art algorithms that applied the same cross-validation method, LOOCV, to evaluate their performance over each dataset.

Table 1. Characteristics of the considered gene expression datasets. No. Dataset n d k 1 11_Tumors 174 12533 11 2 9_Tumors 60 5726 9 3 Brain_Tumor1 90 5920 5 4 Brain_Tumor2 50 10367 4 5 Leukemia1 72 5327 3 6 Leukemia2 72 11225 3 7 Lung_Cancer 203 12600 5 8 SRBCT 83 2308 4 9 Prostate_Tumor 102 10509 2 10 DLBCL 77 5469 2 In all experiments, all SSO-based methods adopt the same parameter settings (Cw = 0.15, Cp = 0.45 and Cg = 0.95) to update the mechanism in Eq. (4) and are coded in MATLAB R2013a on a computer equipped with an Intel 4-GHz CPU with 16 GB of memory. The runtime unit of the results is measured by CPU seconds and denoted by T.

4.1 Ex-1: Parameter setting for Ntop and Ngps Twelve designed treatments under three levels: 50, 100 and 150 for Ntop and four levels: 12

0.0, 0.1, 0.4 and 0.7 for Ngps are tested. To more clearly observe the effects of Ntop and Ngps, five datasets, including 9_Tumors, Brain_Tumor1, Brain_Tumors2, Lung_Cancer and Prostate_Tumor, which do not easily obtain 100% classification accuracy in the literature [15, 23, 24, 50], are selected among all considered datasets for this experiment to optimize the setting for Ntop and Ngps. IG-ISSO conducts 10 independent runs with 50 iterations and 30 population sizes for each treatment on the above five datasets. The experimental results are summarized in Table 2. Aavg, davg and Tavg represent the average of the solution accuracy, the number of selected genes in the solution and CPU time over 10 runs. Astd, dstd and Tstd are the standard deviation of the solution accuracy, the number of selected genes in the solution and CPU time, respectively. The results lead to the following interpretation: 1. The effect of the Ntop: the value of Ntop negatively correlates with Astd and positively correlates with Aavg, davg, Tavg, dstd and Tstd. This reveals that ISSO with more genes selected by IG yields higher classification accuracy but consumes more time and selects more genes in the solution. 2. The effect of the Ngps: the results also show that the value of Ngps negatively correlates with davg and positively correlates with Aavg and Tavg, which demonstrates that the proposed GPS enhances the ability of ISSO to reduce irrelevant, redundant and noisy genes with higher classification accuracy with the cost of CPU time. 13

3. The effect of the Ntop and Ngps: the ANOVA shown in Table 3 indicates that both Ntop and Ngps significantly affect the accuracy, the size of the selected genes and the CPU time. In addition, Table 2 shows that the values of Ntop and Ngps negatively correlate with davg and positively correlate with Aavg and Tavg, that is, a higher Ntop and Ngps setting enhances the chance of IG-ISSO yielding higher classification accuracy and selecting fewer genes with more CPU time compared to the results for other settings of the Ntop and Ngps. This situation also can also be observed in Table 4 (Favg represents the average fitness value introduced in Eq. (7) and all the best values between different settings are shown in bold), which shows that the results of four settings for (Ntop, Ngps) = (100, 0.4), (150, 0.1), (150, 0.4) and (150, 0.7) on each selected dataset are better than the best-known solution in [24] based on the average of the accuracy and the number of selected genes. The best average of fitness value, 98.0042, is obtained at (Ntop, Ngps) = (100, 0.4). Therefore, the proposed IG-ISSO is applied with Ntop = 100 and Ngps = 0.4 in the next experiment.

14

Table 2. The main-effects of different settings for Ntop and Ngps. Ngps Statistics Ntop 0 0.1 0.4 Aavg 50 95.1176 95.2496 95.5567a 100 96.0947 96.7625 97.5398ab b b 150 96.5968 97.3672 97.4770a Average 95.9363 96.4598 96.8579

0.7 95.4321 97.1334 97.3770b 96.6475

Average 95.3390 96.8826 97.2045 10.1600 14.6650 18.7450

davg

50 100 150 Average

14.0000b 27.0600 40.1200 27.0600

9.1200b 11.4400 12.4400 11.0000

9.0800b 10.6400 11.0000a 10.2400

8.4400ab 9.5200a 11.4200 9.7933

Tavg

50 100 150 Average

50.0273ab 69.0925a 77.5122a 65.5440

82.0815b 121.9731 144.7246 116.2597

195.0231b 302.7383 324.8069 274.1894

311.1162b 406.4107 505.4797 407.6689

Astd

50 100 150 Average

5.2715 4.8220 4.0784b 4.7240

4.8662 3.8555 3.6222b 4.1146

4.5448a 3.3524ab 3.4986a 3.7986

4.6093 3.5708b 3.6289 3.9363

4.8230 3.9002 3.7070

dstd

50 100 150 Average

4.3128b 5.5671 7.3656 5.7485

2.9304b 3.7799 4.2899 3.6667

3.3041 2.9636ab 3.0668a 3.1115

2.8431ab 3.3026 3.5989 3.2482

3.3476 3.9033 4.5803

50 49.2266ab 81.7759b 204.6372b 347.8435b a 100 71.8363 112.4177 307.4019 378.8398 150 82.1210a 144.6744 324.9516 516.3446 Average 67.7280 112.9560 278.9969 414.3427 a,b The best value among the values of the same row and column, respectively. Tstd

15

159.5620 225.0537 263.1309

170.8708 217.6239 267.0229 95.3390

Table 3. The ANOVA for Ex-1. Group Source DF A Ntop 2 Ngps 3 Ntop*Ngps 6 Error 108 Total 119 2 S=0.700455 R =64.67% Ntop 2 d Ngps 3 Ntop*Ngps 6 Error 108 Total 119 S=1.26681 R2=98.27% CPU Time Ntop 2 Ngps 3 Ntop*Ngps 6 Error 108 Total 119 S=26.0461 R2= 97.14%

SS 79.5506 14.0016 3.4285 52.9888 149.9694 2 R adj= 61.07% 1475.25 6309.05 2060.39 173.32 10018.01 R2adj=98.09% 219540 2181242 89890 73267 2563940 R2adj=96.85%

16

MS 39.7753 4.6672 0.5714 0.4906

F value 81.07 9.51 1.16

P value 0.000 0.000 0.331

737.62 2103.02 343.40 1.60

459.63 0.000 1310.44 0.000 213.98 0.000

109770 727081 14982 678

161.81 0.000 1071.76 0.000 22.08 0.000

Average

Prostate_Tumor

Lung_Cancer

Brain_Tumors2

Brain_Tumors1

9_Tumors

Dataset

Favg

davg

Aavg

Favg

davg

Aavg

Favg

davg

Aavg

Favg

davg

Aavg

Favg

davg

Aavg

0.4 87.5000 13.60 89.9525

0.7 87.3333 12.30 89.8237

95.1176 95.2496 95.5567 95.4321 14.00 9.12 9.08 8.44 96.0571 96.1758 96.4213 96.3239

98.0392 97.7451 97.9412 97.8431 9.20 4.80 4.90 4.70 98.4139 98.1869 98.3436 98.2656

98.2266 98.3251 98.2759 98.4729 13.00 9.40 9.40 9.80 98.5606 98.6452 98.6058 98.7628

97.6000 97.4000 97.4000 97.4000 11.80 8.30 7.20 7.10 98.0572 97.9040 97.9063 97.9061

95.8889 96.1111 96.6667 96.1111 15.40 10.30 10.40 8.20 96.6591 96.8541 97.2982 96.8612

Criteria Ntop = 50 0 0.1 Aavg 85.8333 86.6667 12.80 davg 20.60 Favg 88.5947 89.2886 0.4 91.6667 15.70 93.2785

Ngps 0.7 90.8333 15.00 92.6143

17

96.0947 96.7625 97.5398 97.1334 27.06 11.44 10.64 9.52 96.8057 97.3798 98.0042 97.6815

98.3333 98.5294 98.8235 98.4314 23.40 8.40 8.40 6.30 98.6221 98.8075 99.0428 98.7331

98.8177 99.2611 99.4089 99.1133 24.50 10.80 10.40 9.00 99.0153 99.3917 99.5106 99.2764

99.6000 98.8000 99.8000 99.4000 22.40 8.40 8.60 7.80 99.6368 99.0238 99.8234 99.5050

95.8889 97.2222 98.0000 97.8889 29.10 12.00 10.10 9.50 96.6128 97.7372 98.3659 98.2790

Ntop = 100 0 0.1 87.8333 90.0000 35.90 17.60 90.1413 91.9385

Table 4. The performance of each designed treatment on five datasets.

0.4 91.3333 16.20 93.0101

0.7 91.0000 17.70 92.7382

96.5968 97.3672 97.4770 97.3770 40.12 12.44 11.00 11.42 97.1740 97.8612 97.9528 97.8715

98.5294 98.7255 98.8922 98.7255 33.40 9.00 8.60 9.70 98.7600 98.9633 99.0974 98.9619

98.9655 99.3103 99.3596 99.3596 37.20 12.60 10.10 9.70 99.1134 99.4283 99.4717 99.4723

99.6000 99.8000 99.8000 99.8000 34.80 9.70 9.00 8.90 99.6129 99.8213 99.8226 99.8228

96.2222 98.0000 98.0000 98.0000 44.20 11.20 11.10 11.10 96.8285 98.3622 98.3625 98.3625

Ntop = 150 0 0.1 89.6667 91.0000 51.00 19.70 91.5552 92.7312

4.2 Ex-2: Comparing IG-ISSO with existing algorithms The experimental results of IG-ISSO for ten considered datasets using Ntop = 100 and Ngps = 0.4 over 10 independent runs with 50 iterations and 30 population sizes are listed in Table 5, where the best value on each dataset is shown in bold. As shown in the results, the proposed algorithm can obtain more than 95% average classification accuracy on 9 out of 10 selected gene expression datasets, except for the 9_Tumors dataset (91.667). For the Leukemia1, Leukemia2, SRBCT and DLBCL datasets, IG-ISSO can yield 100% average accuracy with less than 5 average selected genes. For the Brain_Tumors1, Brain_Tumors2, Lung_Cancer and Prostate_Tumor datasets, the average classification accuracy is equal to or more than 98% with less than 11 average selected genes.

18

F 99.1959 99.9810 99.5900 99.5885 98.8034 99.5869 99.5900 99.9825 99.5900 99.1975 99.5106 0.3608 19

SRBCT A d 100 4 100 4 100 4 100 4 100 4 100 5 100 4 100 5 100 4 100 5 100 4.3 0 0.4830 F 99.9653 99.9653 99.9653 99.9653 99.9653 99.9567 99.9653 99.9567 99.9653 99.9567 99.9627 0.0042

F 99.9911 99.9929 99.9929 99.9929 99.9929 99.9929 99.9929 99.9911 99.9929 99.9929 99.9925 0.0008

Lung_Cancer A d 99.0148 10 100 12 99.5074 10 99.5074 11 98.5222 9 99.5074 12 99.5074 10 100 11 99.5074 10 99.0148 9 99.4089 10.4 0.4527 1.0750

Leukemia2 A d 1 100 5 2 100 4 3 100 4 4 100 4 5 100 4 6 100 4 7 100 4 8 100 5 9 100 4 10 100 4 Avg 100 4.2 Std 0 0.4216

Run

F 98.1918 99.0672 98.1851 99.0807 97.3063 97.3063 98.1918 98.1851 99.0706 99.0739 98.3659 0.6970

Table 5. The results for each run using IG-ISSO for ten datasets. 11_Tumors 9_Tumors Brain_Tumors1 Run A F A F A d d d 1 95.4023 15 96.2979 93.3333 16 94.6108 97.7778 9 2 95.4023 21 96.2883 91.6667 17 93.2740 98.8889 13 3 95.4023 22 96.2867 93.3333 14 94.6178 97.7778 11 4 97.1264 22 97.6660 93.3333 14 94.6178 98.8889 9 5 96.5517 16 97.2158 88.3333 11 90.6282 96.6667 8 6 95.4023 20 96.2899 88.3333 15 90.6143 96.6667 8 7 97.1264 18 97.6724 9 95 18 95.9371 97.7778 8 95.9770 22 96.7465 88.3333 18 90.6038 97.7778 11 9 97.7011 17 94.6073 98.8889 12 21 98.1274 93.3333 10 93.1034 21 94.4492 91.6667 17 93.2740 98.8889 11 Avg 95.9195 19.8 96.7040 91.6667 15.7 93.2785 98.0000 10.1 Std 1.3119 2.5734 1.0499 2.4845 2.2136 1.9853 0.8765 1.7288 Prostate_Tumor A d 98.0392 4 98.0392 9 99.0196 8 99.0196 9 99.0196 7 99.0196 9 99.0196 10 99.0196 10 99.0196 9 99.0196 9 98.8235 8.4 0.4134 1.7764

Brain_Tumors2 A d 100 9 100 9 100 8 98 8 100 10 100 7 100 9 100 9 100 10 100 7 99.8 8.6 0.6325 1.0750

F 98.4238 98.4142 99.2005 99.1986 99.2024 99.1986 99.1967 99.1967 99.1986 99.1986 99.0428 0.3288

F 99.9826 99.9826 99.9846 98.3846 99.9807 99.9865 99.9826 99.9826 99.9807 99.9865 99.8234 0.5056

DLBCL A d 100 4 100 4 100 4 100 4 100 4 100 4 100 3 100 4 100 4 100 4 100 3.9 0 0.3162

Leukemia1 A d 100 5 100 5 100 5 100 4 100 5 100 5 100 4 100 4 100 5 100 4 100 4.6 0 0.5164

F 99.9854 99.9854 99.9854 99.9854 99.9854 99.9854 99.9890 99.9854 99.9854 99.9854 99.9857 0.0012

F 99.9812 99.9812 99.9812 99.9850 99.9812 99.9812 99.9850 99.9850 99.9812 99.9850 99.9827 0.0019

For an objective comparison, we compare the average accuracy obtained by the proposed IG-ISSO and SVM-based methods. The results of SVM-based methods were taken from [24, 51] and those methods include grid search SVM (GS) [24, 45], one-versus-rest SVM (OVR), one-versus-one SVM (OVO) [52], directed acyclic graph SVM (DAG) [53], WW SVM [54] and CS SVM [55]. In Table 6, the best results between methods are highlighted using boldface. It is easy to see that IG-ISSO obtains nine of the highest classification accuracies for the 10 selected datasets and the highest average classification accuracy compared to the previous SVM-based methods. This also demonstrates IG-ISSO is an effective method to reduce irrelevant, redundant and noisy genes and find an optimal subset of genes that enhances the accuracy of SVM.

Table 6. The accuracy comparison between our method and SVM-based methods. SVM-based methods Dataset GS OVR OVO DAG WW CS 11_Tumors 89.98 94.68 90.36 90.36 94.68 95.30 9_Tumors 51.67 65.10 58.57 60.24 62.24 65.33 Brain_Tumor1 90.00 91.67 90.56 90.56 60.56 90.56 Brain_Tumor2 90.00 77.00 77.83 77.83 73.33 72.83 Leukemia1 97.22 97.50 91.32 96.07 97.50 97.50 Leukemia2 94.44 97.32 95.89 95.89 95.89 95.89 Lung_Cancer 95.07 96.05 95.59 95.59 95.55 96.55 SRBCT 98.80 100.00 100.00 100.00 100.00 100.00 Prostate_Tumor 93.14 92.00 92.00 92.00 92.00 92.00 DLBCL 96.10 97.50 97.50 97.50 97.50 97.50 Average 89.64 90.88 88.96 89.60 86.93 90.35

IG-ISSO 95.92 91.67 98.00 99.80 100.00 100.00 99.41 100.00 98.82 100.00 98.36

To further verify the effectiveness of the proposed method, we compare IG-ISSO with state-of-the-art approaches, including IG-GA [23], NSGA-II [24, 56], IBPSO [15], MBPSO 20

[50], MOBBBO [24] and IG-SSO with Ntop = 100. The population size and the number of iterations of the above approaches are listed in Table 7. Table 8 shows the average criteria over ten runs for each approach, including the average of the accuracy, the number of selected genes and the fitness value. As illustrated in Table 8, IG-ISSO produces better results than IG-SSO on all of the considered datasets based on all criteria, which empirically proves that the proposed GPS scheme can facilitate IG-ISSO to prune more relevant genes and achieve higher classification accuracy. Furthermore, IG-ISSO obtained the highest or equivalent accuracy with the lowest number of selected genes compared to those yielded from its competitors on most of the datasets, except for the Leukemia1 dataset. The average fitness value of each approach calculated using its Aavg and davg also confirms these findings.

Table 7. Parameter setting of the competitors Approach Population size Iterations IG-GA 30 100 NSGA-II 50 100 IBPSO 100 100 MBPSO 100 300 MOBBBO 50 100 IG-SSO 30 100

21

Table 8. Comparison of the gene selection algorithms on ten selected datasets. Dataset Criteria IG-GA NSGA-II IBPSO MBPSO MOBBBO IG-SSO IG-ISSO 11_Tumors Aavg 92.53 86.44 93.10 95.06 92.41 95.86 95.92 479.00 32.30 2948.00 240.90 25.10 37.50 19.80 davg Favg 93.26 89.10 89.78 95.66 93.89 96.63 96.70 9_Tumors Aavg 85.00 71.17 78.33 75.50 80.50 89.50 91.67 52.00 27.30 1280.00 240.60 24.10 28.80 15.70 davg Favg 87.82 76.84 78.19 79.56 84.32 91.50 93.28 Brain_Tumor1 Aavg 93.33 90.74 94.44 92.56 96.67 96.11 98.00 244.00 15.44 754.00 11.20 13.20 19.10 10.10 davg Favg 93.84 92.54 93.00 94.01 97.29 96.82 98.37 Brain_Tumor2 Aavg 88.00 93.75 94.00 92.00 99.80 99.80 99.80 489.00 12.25 1197.00 9.10 10.20 14.70 8.60 davg Favg 89.46 94.98 92.89 93.58 99.82 99.81 99.82 Leukemia1 Aavg 100.00 99.03 100.00 100.00 100.00 100.00 100.00 82.00 7.90 1034.00 6.50 7.70 4.60 3.50 davg Favg 99.69 99.19 96.12 99.98 99.97 99.98 99.99 Leukemia2 Aavg 98.61 99.72 100.00 100.00 100.00 100.00 100.00 782.00 10.90 1292.00 6.70 4.80 7.50 4.20 davg Favg 97.49 99.76 97.70 99.99 99.99 99.99 99.99 Lung_Cancer Aavg 95.57 95.68 96.55 95.86 98.47 99.26 99.41 2101.00 24.88 1897.00 14.90 16.20 14.30 10.40 davg Favg 93.12 96.50 94.23 96.66 98.75 99.39 99.51 SRBCT Aavg 100.00 100.00 100.00 100.00 100.00 100.00 100.00 56.00 15.90 431.00 17.50 6.40 7.20 4.30 davg Favg 99.51 99.86 96.27 99.85 99.94 99.94 99.96 Prostate_Tumor Aavg 96.08 96.18 92.16 97.94 98.33 98.82 98.82 343.00 21.80 1294.00 13.60 11.90 18.30 8.40 davg Favg 96.21 96.90 91.27 98.33 98.64 99.02 99.04 DLBCL Aavg 100.00 99.22 100.00 100.00 100.00 100.00 100.00 107.00 9.50 1042.00 6.00 5.70 6.30 3.90 davg Favg 99.61 99.34 96.19 99.98 99.98 99.98 99.99

4.3 The statistical analysis Table 9 presents other comparison measures, including the average of the classification accuracy, the number of selected genes and fitness value across all considered datasets (the best value is shown in bold). These measures indicate that the proposed IG-ISSO is the most promising alternative for gene selection problems and outperforms all its competitors. 22

To further compare the aforementioned gene selection methods, the sign test, which is a popular way to compare the overall performances of methods, was conducted with a significance level of α = 0.05 [57, 58]. Because the fitness value in Eq. (7) is composed of the classification accuracy and the number of selected genes, we use it as a target value for performing the sign test. The results of the sign test in Table 9 are presented by the win-loss (W-L), in which the two values are the numbers of datasets for IG-ISSO obtaining better or worse fitness values than the other method, and the p-value (p) indicates whether the difference between the two algorithms is significant based on W-L (the boldface entries indicate statistically significant differences). According to the sign test, IG-ISSO achieved overwhelming win numbers compared to other algorithms. As a result, IG-ISSO is significantly more effective than its competitors based on all p-values. Table 9. The results of the sign test. Algorithm IG-ISSO IG-GA NSGA-II IBPSO MBPSO MOBBBO IG-SSO Average A 96.6180 97.9358 98.3620 94.9120 93.1931 94.8580 94.8920 473.50 17.82 1316.90 56.40 12.41 16.14 9.00 d F 97.2598 98.3048 98.6650 95.0017 94.5020 92.5629 95.7608 IG-ISSO

W-L p

10-0 0.0020

10-10 0.0020

10-0 0.0020

9-1 0.0215

10-0 0.0020

0-10 0.0020

5. Conclusions In this study, a new hybrid filter/wrapper approach that combines information gain (IG), 23

improved simplified swarm optimization (ISSO) and support vector machine (SVM), known as IG-ISSO, for gene selection problems is proposed. IG is used as a filter to select the 100 top informative genes for ISSO. ISSO integrated with SVM is employed as a wrapper to find the optimal gene subset. The experimental results show that the proposed GPS scheme can help IG-ISSO to find a smaller set of reliable genes and achieve higher classification accuracy than IG-SSO. Furthermore, the comparisons with six state-of-art approaches on ten considered gene expression datasets show that our proposed IG-ISSO can achieve 9 out of 10 best results in terms of the classification accuracy and the number of selected genes. The tests of statistical significance confirm the significance of the results obtained by the proposed algorithm in comparison to its competitors on all considered datasets. For future work, we plan to apply IG-ISSO to other high-dimensional datasets, including image and text data.

Vitae Chyh-Ming Lai is a Ph.D. student in the Department of Industrial Engineering and Engineering Management at the National Tsing Hua University (NTHU), Hsinchu, Taiwan. He received his M.S. degree from Management College at National Defense University. His research interests are Evolutionary Computation, Data Mining and network reliability theory. 24

Wei-Chang Yeh is a professor of the Department of Industrial Engineering and Engineering Management at NTHU, Hsinchu, Taiwan. He received his M.S. and Ph.D. from the Department of Industrial Engineering at the University of Texas at Arlington. His research interests include network reliability theory, graph theory, deadlock problem, and scheduling. Dr. Yeh is a member of IEEE and has received awards for his research achievement from the National Science Council.

Chung-Yi Chang completed his M.S. degree in the Department of Industrial Engineering and Engineering Management at NTHU, Hsinchu, Taiwan. He received his B. S. degree from National Kaohsiung University of Applied Sciences. His research interests are Evolutionary Computation and Data Mining.

25

Reference [1] H. Feilotter, A Biologist’s Guide to Analysis of DNA Microarray Data, American journal of human genetics, 71 (2002) 1483. [2] Y. Saeys, I. Inza, P. Larrañaga, A review of feature selection techniques in bioinformatics, Bioinformatics, 23 (2007) 2507-2517. [3] W. Awada, T.M. Khoshgoftaar, D. Dittman, R. Wald, A. Napolitano, A review of the stability of feature selection techniques for bioinformatics data, Information Reuse and Integration (IRI), 2012 IEEE 13th International Conference on, (IEEE2012), pp. 356-363. [4] A. Özçift, A. Gülten, Genetic algorithm wrapped Bayesian network feature selection applied to differential diagnosis of erythemato-squamous diseases, Digital Signal Processing, 23 (2013) 230-237. [5] I. Kononenko, Estimating attributes: analysis and extensions of RELIEF, Learning: ECML-94, (Springer1994), pp. 171-182.

Machine

[6] T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, D. Haussler, Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics, 16 (2000) 906-914. [7] M. Xiong, W. Li, J. Zhao, L. Jin, E. Boerwinkle, Feature (gene) selection in gene expression-based tumor classification, Molecular genetics and metabolism, 73 (2001) 239-247. [8] M. Hall, G. Holmes, Benchmarking attribute selection techniques for discrete class data mining, Knowledge and Data Engineering, IEEE Transactions on, 15 (2003) 1437-1447. [9] D. Chen, Z. Liu, X. Ma, D. Hua, Selecting genes by test statistics, BioMed Research International, 2005 (2005) 132-138. [10] M.T. Martín-Valdivia, M.C. Díaz-Galiano, A. Montejo-Raez, L. Urena-Lopez, Using information gain to improve multi-modal information retrieval systems, Information Processing & Management, 44 (2008) 1146-1158. [11] D. Huang, T.W. Chow, Effective feature selection scheme using mutual information, Neurocomputing, 63 (2005) 325-343. [12] M. Xiong, X. Fang, J. Zhao, Biomarker identification by feature wrappers, Genome research, 11 (2001) 1878-1887. [13] B. Samanta, K. Al-Balushi, S. Al-Araimi, Artificial neural networks and support vector 26

machines with genetic algorithm for bearing fault detection, Engineering Applications of Artificial Intelligence, 16 (2003) 657-665. [14] M.A. Tahir, A. Bouridane, F. Kurugollu, Simultaneous feature selection and feature weighting using Hybrid Tabu Search/K-nearest neighbor classifier, Pattern Recognition Letters, 28 (2007) 438-446. [15] L.Y. Chuang, H.W. Chang, C.J. Tu, C.H. Yang, Improved binary PSO for feature selection using gene expression data, Computational Biology and Chemistry, 32 (2008) 29-38. [16] S.W. Lin, K.C. Ying, S.C. Chen, Z.J. Lee, Particle swarm optimization for parameter determination and feature selection of support vector machines, Expert systems with applications, 35 (2008) 1817-1824. [17] K.H. Chen, K.J. Wang, M.L. Tsai, K.M. Wang, A.M. Adrian, W.C. Cheng, T.S. Yang, N.C. Teng, K.P. Tan, K.-S. Chang, Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm, BMC bioinformatics, 15 (2014) 49. [18] M.M. Kabir, M.M. Islam, K. Murase, A new wrapper feature selection approach using neural network, Neurocomputing, 73 (2010) 3273-3283. [19] S. Kashef, H. Nezamabadi-pour, An advanced ACO algorithm for feature subset selection, Neurocomputing, 147 (2015) 271-279. [20] S. Li, X. Wu, M. Tan, Gene selection using hybrid particle swarm optimization and genetic algorithm, Soft Computing, 12 (2008) 1039-1048. [21] Z. Zhu, Y.S. Ong, M. Dash, Markov blanket-embedded genetic algorithm for gene selection, Pattern Recognition, 40 (2007) 3236-3248. [22] S.S. Kannan, N. Ramaraj, A novel hybrid feature selection via Symmetrical Uncertainty ranking based local memetic search algorithm, Knowledge-Based Systems, 23 (2010) 580-585. [23] C.H. Yang, L.Y. Chuang, C.H. Yang, IG-GA: a hybrid filter/wrapper method for feature selection of microarray data, Journal of Medical and Biological Engineering, 30 (2010) 23-28. [24] X. Li, M. Yin, Multiobjective binary biogeography based optimization for feature selection using gene expression data, NanoBioscience, IEEE Transactions on, 12 (2013) 343-353. [25] W.C. Yeh, A two-stage discrete particle swarm optimization for the problem of multiple multi-level redundancy allocation in series systems, Expert Systems with Applications, 36 (2009) 9192-9200. [26] W.C. Yeh, Novel swarm optimization for mining classification rules on thyroid gland data, 27

Information Sciences, 197 (2012) 65-76. [27] W.C. Yeh, Simplified swarm optimization in disassembly sequencing problems with learning effects, Computers & Operations Research, 39 (2012) 2168-2177. [28] W.C. Yeh, New parameter-free simplified swarm optimization for artificial neural network training and its application in the prediction of time series, Neural Networks and Learning Systems, IEEE Transactions on, 24 (2013) 661-665. [29] R. Azizipanah-Abarghooee, T. Niknam, M. Gharibzadeh, F. Golestaneh, Robust, fast and optimal solution of practical economic dispatch by a new enhanced gradient-based simplified swarm optimization algorithm, Generation, Transmission & Distribution, IET, 7 (2013) 620-635. [30] R. Azizipanah-Abarghooee, A new hybrid bacterial foraging and simplified swarm optimization algorithm for practical optimal dynamic load dispatch, International Journal of Electrical Power & Energy Systems, 49 (2013) 414-429. [31] P.C. Chang, X. He, Macroscopic Indeterminacy Swarm Optimization (MISO) algorithm for real-parameter search, Evolutionary Computation (CEC), 2014 IEEE Congress on, (IEEE2014), pp. 1571-1578. [32] W.C. Yeh, C.M. Lai, K.H. Chang, A novel hybrid clustering approach based on K-harmonic means using robust design, Neurocomputing, 173 (2016) 1720-1732. [33] W.C. Yeh, C.M. Lai, Accelerated Simplified Swarm Optimization with Exploitation Search Scheme for Data Clustering, PloS one, 10 (2015) e0137246. [34] H. Uğuz, A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm, Knowledge-Based Systems, 24 (2011) 1024-1032. [35] J.R. Quinlan, Induction of decision trees, Machine learning, 1 (1986) 81-106. [36] C.E. Shannon, A mathematical theory of communication, ACM SIGMOBILE Mobile Computing and Communications Review, 5 (2001) 3-55. [37] J. Dai, Q. Xu, Attribute selection based on information gain ratio in fuzzy rough set theory with application to tumor classification, Applied Soft Computing, 13 (2013) 211-221. [38] B. Frénay, G. Doquire, M. Verleysen, Theoretical and empirical study on the potential inadequacy of mutual information for feature selection in classification, Neurocomputing, 112 (2013) 64-78. 28

[39] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers, Proceedings of the fifth annual workshop on Computational learning theory, (ACM1992), pp. 144-152. [40] V.N. Vapnik, V. Vapnik, Statistical learning theory (Wiley New York, 1998). [41] M.P. Brown, W.N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet, T.S. Furey, M. Ares, D. Haussler, Knowledge-based analysis of microarray gene expression data by using support vector machines, Proceedings of the National Academy of Sciences, 97 (2000) 262-267. [42] T. Jaakkola, M. Diekhans, D. Haussler, Using the Fisher kernel method to detect remote protein homologies, ISMB1999), pp. 149-158. [43] A. Zien, G. Rätsch, S. Mika, B. Schölkopf, T. Lengauer, K.-R. Müller, Engineering support vector machine kernels that recognize translation initiation sites, Bioinformatics, 16 (2000) 799-807. [44] C.C. Chang, C.J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology (TIST), 2 (2011) 27. [45] C.W. Hsu, C.C. Chang, C.J. Lin, A practical guide to support vector classification, 2003). [46] X. Guo, J. Yang, C. Wu, C. Wang, Y. Liang, A novel LS-SVMs hyper-parameter selection based on particle swarm optimization, Neurocomputing, 71 (2008) 3211-3215. [47] M.S. Mohamad, S. Omatu, S. Deris, M.F. Misman, M. Yoshioka, A multi-objective strategy in genetic algorithms for gene selection of gene expression data, Artificial Life and Robotics, 13 (2009) 410-413. [48] P.A. Devijver, J. Kittler, Pattern recognition: A statistical approach (Prentice-Hall London, 1982). [49] M. Dash, H. Liu, Feature selection for classification, Intelligent data analysis, 1 (1997) 131-156. [50] M.S. Mohamad, S. Omatu, S. Deris, M. Yoshioka, A modified binary particle swarm optimization for selecting the small subset of informative genes from gene expression data, Information Technology in Biomedicine, IEEE Transactions on, 15 (2011) 813-822. [51] A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin, S. Levy, A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis, Bioinformatics, 21 (2005) 631-643. [52] U.H.G. Kreßel, Pairwise classification and support vector machines, Advances in kernel 29

methods, (MIT Press1999), pp. 255-268. [53] J.C. Platt, N. Cristianini, J. Shawe-Taylor, Large Margin DAGs for Multiclass Classification, nips1999), pp. 547-553. [54] J. Weston, C. Watkins, Support vector machines for multi-class pattern recognition, ESANN1999), pp. 219-224. [55] K. Crammer, Y. Singer, On the learnability and design of output codes for multiclass problems, Machine learning, 47 (2002) 201-233. [56] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II, Evolutionary Computation, IEEE Transactions on, 6 (2002) 182-197. [57] A. Al-Ani, A. Alsukker, R.N. Khushaba, Feature subset selection using differential evolution and a wheel based search strategy, Swarm and Evolutionary Computation, 9 (2013) 15-26. [58] J. Derrac, S. García, D. Molina, F. Herrera, A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms, Swarm and Evolutionary Computation, 1 (2011) 3-18.

Figures

Fig. 1. The solution representation.

30

Fig. 2. The overall flowchart of IG-ISSO.

Highlights 31

l

The first work to apply simplified swarm optimization to a gene selection problem.

l

The GPS helps IG-ISSO to identify a smaller gene set with a higher accuracy.

l

Statistical results indicate IG-ISSO is better than other algorithms.

GPS: gene pruning strategy.

32