C-HMOSHSSA: Gene selection for cancer classification using multi-objective meta-heuristic and machine learning methods

Computer Methods and Programs in Biomedicine 178 (2019) 219–235 Contents lists available at ScienceDirect Computer Methods and Programs in Biomedici...

Download PDF

2MB Sizes 0 Downloads 20 Views

Report

PDF Reader
Full Text

Computer Methods and Programs in Biomedicine 178 (2019) 219–235

Contents lists available at ScienceDirect

Computer Methods and Programs in Biomedicine journal homepage: www.elsevier.com/locate/cmpb

C-HMOSHSSA: Gene selection for cancer classiﬁcation using multi-objective meta-heuristic and machine learning methods Aman Sharma∗, Rinkle Rani Computer Science and Engineering Department, Thapar Institute of Engineering & Technology, Patiala, Punjab, India

a r t i c l e

i n f o

Article history: Received 10 February 2019 Revised 24 June 2019 Accepted 27 June 2019

Keywords: Multi-objective optimization Spotted hyena optimizer Salp swarm algorithm Cancer classiﬁcation Gene expression Machine learning Gene selection Feature selection

a b s t r a c t Background and objective: Over the last two decades, DNA microarray technology has emerged as a powerful tool for early cancer detection and prevention. It helps to provide a detailed overview of disease complex microenvironment. Moreover, online availability of thousands of gene expression assays made microarray data classiﬁcation an active research area. A common goal is to ﬁnd a minimum subset of genes and maximizing the classiﬁcation accuracy. Methods: In pursuit of a similar objective, we have proposed framework (C-HMOSHSSA) for gene selection using multi-objective spotted hyena optimizer (MOSHO) and salp swarm algorithm (SSA). The real-life optimization problems with more than one objective usually face the challenge to maintain convergence and diversity. Salp Swarm Algorithm (SSA) maintains diversity but, suffers from the overhead of maintaining the necessary information. On the other hand, the calculation of MOSHO requires low computational efforts hence is used for maintaining the necessary information. Therefore, the proposed algorithm is a hybrid algorithm that utilizes the features of both SSA and MOSHO to facilitate its exploration and exploitation capability. Results: Four different classiﬁers are trained on seven high-dimensional datasets using a subset of features (genes), which are obtained after applying the proposed hybrid gene selection algorithm. The results show that the proposed technique signiﬁcantly outperforms existing state-of-the-art techniques. Conclusion: It is also shown that the new sets of informative and biologically relevant genes are successfully identiﬁed by the proposed technique. The proposed approach can also be applied to other problem domains of interest which involve feature selection. © 2019 Elsevier B.V. All rights reserved.

1. Introduction The rise of DNA-microarray technology has signiﬁcantly impacted the research in the biological domain. Scientists can perform parallel screening of thousands of genes to analyze their biological behavior and signiﬁcance in cellular functioning. Moreover, parallel screening of genomic proﬁles using microarray technology provides deep insights into various genetic variations and alterations. It helps in the early detection of various stringent diseases such as cancer. In the last two decades, many researchers have contributed various benchmark microarray datasets for different kinds of tumors [1–3] to promote oncological research. The cancer datasets consist of thousands of genes corresponding to different samples. These datasets serve as benchmarks in the wide range of techniques proposed for cancer classiﬁcation. Various computational approaches have been proposed in literature for cancer clas∗

Corresponding author. E-mail addresses: [email protected] (A. Sharma), [email protected] (R. Rani). https://doi.org/10.1016/j.cmpb.2019.06.029 0169-2607/© 2019 Elsevier B.V. All rights reserved.

siﬁcation using machine learning [4–7]. Machine learning is extensively used to analyze biological datasets [8–12]. It has been one of the most popularly used tools among researchers of diverse interests for performing data analysis. The key idea in cancer classiﬁcation problems is to achieve two goals with higher accuracy. The ﬁrst goal is to differentiate among patients with different types of cancer, based on their genomic proﬁles. The second goal is to identify the set of genes (biomarkers) that helps to differentiate the several types of cancers. These two goals aim to better understand the underlying intricacies involved in cancerous tissues so as to deliver effective treatment. However, there are various issues involved in prediction modeling from cancerous proﬁles. Selecting relevant genes for accurate diagnosis and prognosis of cancer is a crucial task. Cancer classiﬁcation is instinctively dependent on feature selection methods for gene selection. Gene selection helps to select relevant and most informative genes for cancer classiﬁcation problems. Feature selection methods can be categorized into three groups: ﬁlter, wrapper, and hybrid methods. Filter method evaluates the statistics of available data without considering any learning approach. These

220

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235

methods either consider each gene individually (univariate ﬁlter) or group of genes (multivariate ﬁlter). On the other hand wrapper method considers a signiﬁcant learning approach for optimal feature selection [13]. Filter methods are computationally inexpensive and fast as compared to wrapper methods. However, wrapper methods are popularly considered among researchers for classiﬁcation tasks because of higher accuracy [14,15]. The hybrid methods leverage the merits of ﬁlter and wrapper method for feature selection. Such methods ﬁrst apply the ﬁlter technique to reduce the feature space then use the wrapper method for ﬁnal gene selection [16–18]. The available microarray datasets of cancerous proﬁles suffer from the curse of high-dimensionality. The issues make the problem of cancer classiﬁcation as non-deterministic polynomial time (NP-hard) problem. Therefore, meta-heuristic algorithms are considered the optimal choice for such problem domains. In recent years various meta-heuristic algorithms have been proposed in the literature to solve the variety of real-life problems [19]. These algorithms are also applied in various problem domains of computational biology because of their computational inexpensiveness and easy understandability. The heart of these algorithms is the multiple objective functions which serve as the constraints to ﬁnd the global optimal solution. The critical step in multi-objective optimization is to resort the conﬂicts between multiple objectives. Over the past two decades, various wrapper methods based on bio-inspired optimization algorithms such as particle swarm optimization and the genetic algorithm have been proposed for optimal gene selection using microarray data [17,20]. Multi-objective problems can be broadly classiﬁed as: priorbased [21] and posterior-based approaches [22]. In prior approaches, the multi-objective problem is treated as single-objective with weighted objective functions. Whereas, posterior approaches ﬁnd the optimal solution based on the performance of the problem under consideration. Multi-objective optimization problems have various possible solutions in comparison to single-objective approaches. Obtaining Pareto fronts is one of the most challenging tasks in multi-objective problems as it produces many points that help in better approximation. Further, obtaining Pareto fronts for high-dimensional datasets such as gene expression data is an extremely complex task. Recently various bio-inspired optimization algorithms for solving gene selection problem have been proposed in the literature [23,24] In this paper, our objective is to develop a cancer classiﬁcation framework to predict relevant and new cancer biomarkers using bio-inspired meta-heuristic approach and machine learning. The proposed framework uses multi-objective spotted hyena optimizer [25] and salp swarm algorithm [26] for optimal gene selection. We have speciﬁcally used these algorithms because of their simplicity and faster convergence towards the global optimum. Moreover, MOSHO has been tested on 24 benchmark test functions and validated against six recently developed meta-heuristic algorithms. Most of the existing cancer classiﬁcation algorithms suffer from the issue of high-dimensionality and optimal gene selection. Hence, we have attempted to design an optimal framework for cancer classiﬁcation.

2. Related work One of the main issue with cancer classiﬁcation approaches is to design a generic pipeline for classiﬁcation using existing microarray data. In the past few years, various attempts [33,39] have been made to design optimal prediction algorithms using machine learning and statistical approaches. These prediction algorithms are primarily divided into two broad categories: supervised and unsupervised learning.

The key objective of unsupervised learning is to cluster the gene expression data so as to identify the subset of genes that serve as tumor predictors. Perou et al. performed hierarchical clustering to identify the relationship between gene expression patterns and biological variation in breast cancer tissues [40]. Golub et al. proposed a gene clustering approach using self-organizing maps and further used the cluster labels to classify the leukemia tissue samples [1]. Whereas, in supervised learning various gene classiﬁcation approaches were proposed in the literature using statistical approaches such as Gaussian and Logistic predictors; Discriminant analysis; many more [41,42]. Apart from above mentioned techniques various machine learning approaches were also proposed using Support Vector Machines (SVM) [43,44], Neural Networks (NN) [45,46] and k-nearest neighbor (KNN) [1,47] for cancer classiﬁcation. Further, various nature-inspired algorithms are also used for gene selection and cancer classiﬁcation. Alshamlan et al. proposed a hybrid gene selection algorithm using the combination of a Genetic Algorithm (GA) and Artiﬁcial Bee Colony (ABC) algorithm [48]. Their goal was to integrate the beneﬁts of both the algorithms so as to select the most relevant and predictive genes. Subhajit et al. have proposed a gene selection technique using particle swarm optimization (PSO) and k-nearest neighborhood (KNN) to distinguish relevant and most predictive genes subset for cancer classiﬁcation [28]. Huerta et al. proposed the embedded approach for cancer gene selection using the genetic algorithm and Fisher’s linear discriminant analysis (LDA) [16]. Their proposed approach used LDA’s discriminant coeﬃcients as ﬁtness function along with LDA classiﬁer in the genetic algorithm. PSO is also widely used by various researchers for solving the problem of feature selection for classiﬁcation [49–51]. Zhang et al. [49,52–54] have proposed various feature selection approaches in classiﬁcation using multiobjective particle swarm optimization and other optimization algorithms. Shen et al. have developed the hybrid approach for ﬁnding the optimal subset of genes using particle swarm optimization (PSO) and tabu search (HPSOTS) [17]. The proposed technique has the signiﬁcance of achieving local optima eﬃciently. Further, they compared the proposed technique using three tumor datasets. Their results show that HPSOTS has the potential to deal with highdimensional gene expression data for gene selection. Dashtban et al. have proposed the evolutionary method for cancer genes selection using artiﬁcial intelligence and genetic algorithm [55]. The proposed approach ﬁrst attempts to reduce the dimensions of data using ﬁlter (Laplacian and Fisher score) method and then applied integer-coded GA with modiﬁed operators. They examined the convergence trends, the complexity of the proposed technique on ﬁve benchmark tumor datasets. They further calculated the statistical signiﬁcance of the proposed approach to other available techniques. Lee et al. proposed the hybrid gene selection method using GA with the dynamic parameter setting [56]. Iteratively gene subsets are created and ranks are given to genes based on their frequency of occurrence in each subset. They further employed 2test to select top-ranked genes and then developed classiﬁer using SVM. Moreover recently researchers are using ensemble learning for distinguishing between good and poor-prognosis in breast cancer samples [57]. Support Vector Machine and Nave Bayes classiﬁer are used as base learners in ensemble classiﬁer. Performance of the proposed approach is evaluated using already established clinical criteria (Nottingham Prognostic Index, Adjuvant Online, St. Gallen) and Mammaprint approach. Chen et al. have presented gene selection approach for clustering of relevant genes using kernel functions [27]. Iteratively optimal weights are identiﬁed for genes and adaptive distance is used for weighted learning. The proposed approach was tested on eight publicly available datasets using two classiﬁers (SVM, KNN).

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235

221

Table 1 Comparison of existing cancer classiﬁcation techniques. Author(s)

Pros

[27]

Does not require parameter optimization for each dataset Kernel method is used hence model interpretability and incomprehensibility is a major issue A PSO adaptive KNN based gene selection method is Initial parameters for PSO are diﬃcult to design. PSO can proposed to select useful genes. A heuristic for selecting converge prematurely and be trapped in local minimum especially with high-dimensional data. the optimal values of K eﬃciently is also proposed. The incorporation of tabu search (TS) as a local Using gene-ranking methods, some genes among the improvement procedure enables the algorithm HPSOTS to selected genes may come out to be redundant. overleap local optima and show satisfactory performance. The proposed method has integrated with the nonlinear Wrapper method has high time complexity search capability of PSO and linearly separable advantage of DT Supports both array-based and NGS-based miRNA Sensitivity analysis for the number of feature subsets is expression data. required. A newly developed multiobjective simulated Feature selection is not used for initial gene selection. annealing-based optimization technique is used to improve cluster quality. Different combinations of objective functions are used to Dataset dependency is there on the proposed technique. improve classiﬁcation accuracy. It uses sparse representation with feature selection. It only uses breast cancer dataset. DNA microarrays appearances empowered the The feature selection technique used in this paper selects simultaneous observing of expression levels of a large the features that rarely appear in the speciﬁed categories number of genes. but frequently appear in other categories. The average ranking position for different classifying and Finding biological interpretations of the relations selection techniques, varying the number of selected between gene expression values and diseases is diﬃcult. attributes for choosing the best algorithms for different attribute intervals. The samples with the subset of genes are mapped into a The proposed approach does not deal with imbalanced class distributions, missing values in genes, data sparsity. dissimilarity space. The dissimilarity-based classiﬁers outperform the feature-based models. Deep classiﬁcation performance is improved by the Application of proposed technique is only limited to feature wise pre-processing classify breast cancer. Developing a clinical decision support system to aid The training time performance of the proposed model in the study is worse than the other methods. practitioners

[28]

[17]

[29]

[30] [31]

[32] [33] [34]

[35]

[36]

[37] [38]

The advantage of this proposed technique is that it does not require any parameter optimization. Zhang et al. developed the novel feature selection method for cancer microarray dataset [58]. The proposed technique uses relevance analysis and discernibility matrix for selection of relevant genes. Guoli et al. have presented a novel gene selection and tumor identiﬁcation algorithm using partial least squares (PLS) method [59]. The proposed technique integrates the linear kernel support vector machine and is validated on different tumor datasets. Although various hybrid meta-heuristic algorithms [60– 62] have been proposed in the literature for gene selection still these methods do not hold substantial potential in solving cancer classiﬁcation. Most of the methods suffer from several shortcomings such as high computational time, failing to obtain satisfactory results because of not considering the minimization of selected gene size as an objective, and needing a maximum number of iterations and parameters for tuning. Therefore, there is a need to develop an effective gene selection algorithm with eﬃcient global search mechanism. The main objective of the proposed framework is to propose a new wrapper approach for gene selection using two powerful meta-heuristic approaches; SSA and MOSHO. The proposed approach eﬃciently handles high-dimensional gene expression data in order to signiﬁcantly reduce the number of genes and increase the classiﬁcation performance. Table 1 contains the comparison of existing cancer classiﬁcation techniques.

2.1. Our contribution 1. A hybrid approach is proposed for gene selection using two powerful meta-heuristic approaches; SSA and MOSHO. 2. The issues related due to high-dimensional data are addressed

Cons

3. Salp Swarm Algorithm (SSA) is used to maintain the diversity and MOSHO for maintaining faster convergence. 4. There is no restriction of datasets for the proposed approach. It can be applied to different types of diseases. 5. Four popularly used classiﬁers are trained on seven different high-dimensional datasets. 3. Background In this section, we will discuss some basic terminology and background of multi-objective optimization techniques is presented. Further multi-objective spotted hyena optimizer is discussed that is used in the proposed framework. 3.1. Basic concepts of multi-objective optimization The optimization problem is referred to as multi-objective if it deals with multiple constraint satisfaction functions in a problem under consideration. It can be formulated as [63,64]:

Minimize : G(z ) = [g1 (z ), g2 (z ), . . . , gn (z )]

(1)

Subject to: fi (z ) ≥ 0, i = 1, 2, . . . , k

(2)

hi (z ) = 0, i = 1, 2, . . . , l

(3)

where z = [z1 , z2 , . . . , zk ]T is the decision variables vector, k gives the count of inequality constraints, l gives the count of equality constraints, fk is the kth inequality constraints, hk is the kth equality constraints, and obj gives the count of objective functions used in a given problem gi : Rob j → R. Performance evaluation of solutions in multi-objective problems cannot be performed using relational operators due to multiconstraint issue. To evaluate the performance comparison of two

222

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235

approaches Edgeworth [65] proposed performance metric and it was further extended by Pareto [66]. They proposed Pareto dominance and Pareto metrics such as [63]: Deﬁnition 1. Pareto dominance =(a1 , a2 , . . . , ar ) and b=(b1 , b2 , . . . , br ). If there are two vectors a is said to dominate vector b (such as a≺b) if and only The Vector a if:

∀k{1, 2, . . . , r} : gk (a ) ≤ gk (b) ∧ ∃k{1, 2, . . . , r} : gk (a ) < fk (b) (4) Deﬁnition 2. Pareto optimality A multi-objective solution x X is said to be Pareto optimal if and only if:

A | b ≺ a a

Hunting behavior of spotted hyenas is mapped and potential hunting regions are identiﬁed using the following equations:

· Ch − Ck h =| A X

|

(13)

·X h Ck = Ch − B

(14)

h = Ck + Ck+1 + · · · + Ck+N O

(15)

where N represents the number of iterations which can be computed as:

)) N = countnos (Ch , Ch+1 , Ch+2 , . . . , (Ch + M

(16)

O C (x + 1 ) = h N

(17)

(5)

Deﬁnition 3. Pareto optimal set The set of all Pareto optimal solutions including all nondominated solutions for a given problem is called Pareto optimal set if and only if:

} , b A | ∃b a Qs = {a

(6)

Deﬁnition 4. Pareto optimal front A set of all the objective values for all the available Pareto optimal solutions in Pareto optimal set is called Pareto optimal front. It is deﬁned as:

) | a Qs } Q g = { g( a

(7)

3.2. Spotted hyena optimizer (SHO) Multi-objective SHO is inspired by SHO [19]. Its main inspiration is social behavior and the communal relationship of spotted hyenas. It majorly focuses on the attacking, encircling, hunting and searching behavior of the trusted group of spotted hyenas. SHO algorithm has successfully mimicked the behavior of spotted hyenas to obtain the best possible optimal solution. Encircling behavior in SHO is captured by following the set of equations [19]:

· Cp (x ) − C (x ) | h =| A X

(8)

·X h , C (x + 1 ) = Cp (x ) − B

(9)

h represents the length spotted hyena has to cover to Where X reach the prey. x indicates the running iteration at the given instance. The position vectors corresponding to prey and spotted hyena are represented by Cp and C, respectively. || and · are sym-

nos indicates count of solutions which are somehow similar to oph timally best solution for a search space under consideration. O indicates randomly constitutes the cluster of optimal solutions. M initialized vectors within [0.5, 1]. C (x + 1 ) identiﬁes the best optimal solution and helps in the updation of search agents. with ranThe assurance of exploration is achieved by vector B dom values. If the randomly assigned values > 1 or < 1 then, also helps in explosearch agent will move away from the prey. A ration and assigns random values in the range [0, 5] which acts as weights of prey. |< 1 with The exploitation of SHO algorithm starts when | B which lies in [−1, 1]. Optimization in random initialization of B SHO starts with the generation of random solutions population. Initially, search agent forms clusters by identifying positions of best search agents and further update their position using Eqs. (15)– (17). Values of parameters h and E are linearly decremented during each iteration. After the successful completion of each iteration best positions corresponding to search agents are fetched. Gaurav and Vijay [19] has validated their algorithm on various benchmarks and conﬁrmed their optimization eﬃciency. Results state that SHO has high exploration capability providing better performance as compared to another meta-heuristic algorithm. Their solutions are globally optimum which further motivates us to consider the multi-objective version of SHO for our proposed framework. 3.3. Multi-objective spotted hyena optimizer (MOSHO)

(12)

In this section, multi-objective spotted hyena optimizer (MOSHO) which is inspired by spotted hyena optimization (SHO) [19] is discussed. They have considered the social behavior of hyenas to propose the multi-objective optimization algorithm. Their technique modeled the hunting behavior and social relationship of spotted hyenas in designing optimization algorithm. They further introduced two new elements in the previous existing SHO: Archives and Group selection mechanism. The main advantages of MOSHO algorithm as compared to other techniques is the high convergence behavior and local optima avoidance. The new components that are included in MOSHO are discussed below:

Here, the value of h is decremented from 5 to 0 in each iteration. and xd is from the range [0, 1]. One Value of random vectors xd 1 2 and B . can reach diffrent places by changing the position vectors A The algorithm has the advantage of saving the local optimal solution and asks for position update from other agents.

3.3.1. Archive All the best-obtained Pareto optimal solutions are stored in a storage space known as the archive. It is evenly spread on the Pareto front with concave, convex and disconnected fronts. It further has two components: Controller, Grid.

bols used for representing absolute value and multiplication vec and B are used in the tors respectively. Two coeﬃcient vectors A proposed SHO. and B are computed as: A

1 = 2 · xd A

(10)

2 − h = 2h · xd B

(11)

5 MaxItr where,Itr = 0, 1, 2, . . . , MaxItr

h = 5 −

Itr ×

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235

3.3.1.1. The archive controller: The inclusion of a particular solution in the archive is decided by the controller. The updation rules for the archive are given below: • •

•

•

The current solution is accepted if the archive is empty. The solution is automatically discarded if it is dominated by an individual within the archive. The solution is stored in the archive if none of the elements contained in the external population dominates such solution. The solutions are removed from the archive if these are dominated by the new element.

3.3.1.2. Grid: The distributed Pareto fronts are obtained using adaptive grid mechanism [67]. There are four linearly separable regions for the deﬁned objective function. The grid is responsible for the calculation of each individual from the population location if it lies outside the grid area [68]. The grid space is formed as a result of the uniform distribution of hypercubes.

223

j

where, xi shows the position of follower, T represents the time and V0 represents the initial speed. The parameter A is calculated as follows:

V f inal V0 x − x0 V = T A=

(22)

Considering V0 = 0, the following equation can be expressed as:

xij =

1 j (x + xij−1 ) 2 i

(23)

The proposed SSA algorithm is able to solve high-dimensional problems using low computational efforts. In this work, the behaviors of SSA algorithm are applied to solve large scale economic dispatch and micro grid power dispatch problems. The pseudo-code of proposed SSA algorithm is shown in Algorithm 1 [26]. Algorithm 1 Salp Swarm Algorithm (SSA).

3.3.1.3. Group selection mechanism: The major issue with multiobjective search space is to compare the new solutions with the existing solutions in search space. This issue is resorted using group selection mechanism. In group selection strategy less crowded search space is populated with one of the best solutions in the group of nearby solutions using the roulette-wheel technique as deﬁned below:

g Uk = Nk

(18)

Such as g is a constant variable with value > 1 and Nk deﬁnes the count of Pareto optimal solutions to kth segment. This method is popularly used the classical method which deﬁnes the contribution of each individual using roulette wheel proportion. The proposed MOSHO algorithm is an extension of the SHO algorithm with the difference of multi-objectivity and search space. MOSHO has a search space of archive whereas SHO has to do the extra task of saving the optimal solutions.

Salp swarm algorithm is a meta-heuristic bio-inspired optimization algorithm developed by Mirjalili et al. in 2017 [26]. This algorithm is based on the swarming behavior of salps when navigating and foraging in the deep sea. This swarming behavior is mathematically modeled named as salp chain. This chain is divided into two groups: leader and followers. The leader leads the whole chain from the front while the followers follow each other. The updated position of the leader in a n−dimensional search environment is described as follows:

Fi + c1 ((ubi − lbi )c2 + lbi ), c3 ≥ 1 Fi − c1 ((ubi − lbi )c2 + lbi ), c3 < 1

(19)

where, x1i represents the ﬁrst position of salp i.e. leader in the ith dimension, Fi is the position of food source, ubi and lbi are the lower bound and upper bound of ith dimension respectively. However, c1 , c2 , and c3 are random numbers. The coeﬃcient c1 is responsible for better exploration and exploitation which is deﬁned as follows:

c1 = 2 e − ( )

4l 2 L

(20)

where, l represents the current iteration and L is the maximum number of iterations. Whereas, the parameters c2 and c3 are random numbers in range [0, 1]. To update the position of followers, the following equations are deﬁned as follows:

xij =

1 2 AT + V0 T , 2

j≥2

3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

3.4. Salp swarm algorithm (SSA)

x1i =

1: 2:

(21)

Initialize the salp population xi (i = 1, 2, . . ..n) considering upper bound(ub) and lower bound(lb) while(maximum number of iterations) do Fitness of each Salp (search agent) is calculated G = the leading search agent Update c1 using Eq. (20) // for each salp (xi ) if i==1 then Position updation of leading Salp is done using Eq. (19) // else Position updation of follower Salp is done using Eq. (23) // end if end for Salps adjustment is performed corresponding to upper and lower bounds of variables end while return F

4. Proposed framework The proposed bio-inspired framework consists of two major steps. In the ﬁrst step ﬁltering method is used to remove the irrelevant genes and to select the most relevant subset of genes for further ﬁltering. In the next step hybrid bio-inspired algorithm named HMOSHSSA (Hybrid Multi-objective Spotted Hyena Optimizer and Salp Swarm Algorithm) is proposed using SSA (Salp Swarm Algorithm) and MOSHO for exploring the most promising subset of genes. The two objectives of the proposed algorithm are the minimum number of genes and better prediction accuracy. The complete ﬂowchart of the proposed framework is demonstrated in Fig. 1. 4.1. Initial gene selection Feature selection is one of the critical steps in various machine learning and data mining applications. It helps to reduce the search space and the complexity of the algorithm. Most of the hybrid algorithms for cancer classiﬁcation use ﬁlter methods as the preliminary step for selecting the relevant subset of genes. Such techniques help in removing irrelevant and redundant data that is the high-dimensionality problem which is one of the concerns in the microarray dataset. As a result eﬃciency of microarray data analysis is improved with the removal of irrelevant genes. In the proposed framework, the Fisher score method, which has

224

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235

Fig. 1. Flowchart of proposed framework.

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235

shown promising results in ﬁnding informative genes [55,69–71] is used for initial gene selection. Apart from ﬁsher score method, we can use any other ﬁlter method in the proposed framework such as laplacian-score [72], minimum redundancy maximum relevance (mRMR) [73] and many other statistical methods [74]. In this study, the top 500 genes were selected for initial ﬁltering using ﬁsher score method as suggested by Lee and Leu [55,56].

Algorithm 2 Hybrid Multi-objective Spotted Hyena Optimizer and Salp Swarm Algorithm. Input: the spotted hyenas population Psh(x ) (x ← 1, 2, . . . , n) Output: the best obtained search agent P 1: 2: 3:

4.2. Proposed algorithm: HMOSHSSA (Hybrid multi-objective spotted hyena optimizer and Salp swarm algorithm) After the initial gene selection, the selected subset of genes from each dataset is given as input to the proposed algorithm. The ﬁrst step is to initialize the spotted hyena’s population and initial parameters of HMOSHSSA algorithm as explained in Table 2. After the initialization, the objective value of each search agent is calculated using FITNESS function as deﬁned in line 31 of Algorithm 2. The archive is initialized with the set of explored non-dominated solutions. The best search agent is explored to the given search space from archive. Further, deﬁne the group of optimal solutions using Eqs. (15) and (16) from the archive until the suitable result is found for each search agent. In line 12 of Algorithm 2. position of each search, an agent is updated using Eq. (17). Until now we were using steps from MOSHO algorithm but lines 11–12 of Algorithm 2, represent the steps taken from the SSA algorithm. The leader and follower selection approach are applied to update the positions of search agents using Eqs. (19)–(23). Again, the objective value of each search agent is calculated to ﬁnd the non-dominated solutions. The archive is updated to obtain the non-dominated solutions. If the archive is full then run grid mechanism to eliminate one of the most crowded segment and adjust the new solution to archive. The condition is checked whether any search agent goes beyond the boundary in a given search space and if it happens then adjust it. Calculate the updated search agent objective value and update the vector Psh if there is a better solution from the archive. Then update the group of spotted hyenas Oh w.r.t. updated search agent from the archive. The algorithm will be stopped when the stopping criteria are satisﬁed. Otherwise, again deﬁne the group of optimal solutions using Eqs. (15) and (16) from the archive until the suitable result is found for each search agent. Finally, return the optimal solutions of the archive, after the stopping criteria are satisﬁed. The objective function is deﬁned as follows:

ob j = (α × (

Accuracy T −n )+β ×( )) 100 T

4: 5: 6: 7:

8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:

(24)

Where, α > β , α + β =1, T is the total number of genes and n is the count of selected genes. The steps of HMOSHSSA are summarized as follows: Step 1: Initialize the spotted hyena’s population and initial parameters of MOSHO algorithm. Step 2: Calculate the objective values of each search agent. Step 3: Initialize the explored non-dominated solutions to archive. Step 4: The best search agent is explored to the given search space from archive. Step 5: Deﬁne the group of optimal solutions using Eqs. (15) and (16) from the archive until the suitable result is found for each search agent. Step 6: Update the position of each search agent using Eq. (17). Step 7: Now, apply the leader and follower selection approach to the updated positions of search agents using Eqs. (19)–(23). Step 8: Again calculate the objective value of each search agent and ﬁnd the non-dominated solutions. Step 9: Update the archive to obtained non-dominated solutions.

225

29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44: 45: 46:

procedure HMOSHSSA , h, E and Maxiteration Initialize the parameters B Oh = group of found optimal solutions with respect to Ch (archive) while (x < MaxIteration ) do FITNESS(Psh ) /* Compute the ﬁtness of each search agent using FITNESS function*/ ← 2 × Rand () B 5 h ← 5 − x × /* Compute MaxIteration the temperature proﬁle around the huddle */ E ← 2 × h × Rand () − h for i ← 1 to n do for j ← 1 to n do Compute the group of optimal solutions using Eqs. (15) and (16) Update the position of current agent using Eq. (17) end for end for , h, and E Update parameters B Apply the leader and follower selection approach to the updated positions of search agents using Eqs. (19)–(23) FITNESS(Psh ) /* Again compute the ﬁtness value of updated search agents using FITNESS function*/ Again ﬁnd the non-dominated solutions from updated pool of agents Update archive with new solutions if archive is full then Grid method helps to exclude archive members that are responsible for overcrowding New solutions are added to archive end if Amend search agent which goes beyond the region of search space Update Ch if better solutions are found. Group updation is performed for Oh with respect to archive x←x+1 end while return P end procedure procedure FITNESS(Psh ) for i ← 1 to n do F IT [i] ← F IT NESS_F UNCT ION (Psh ) end for F ITbest ← BEST(F IT [] ) /* Compute the best ﬁtness value using BEST function */ return F ITbest end procedure procedure BEST(F IT [] ) best ← F IT [0] for i ← 1 to n do if(F IT [i] < best) then best ← F IT [i] end if end for return best end procedure

226

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235

Step 10: If the archive is full then run grid mechanism to eliminate one of the most crowded segment and adjust the new solution to archive. Step 11: Check whether any search agent goes beyond the boundary in a given search space and then adjust it. Step 12: Calculate the updated search agent objective value and update the vector Psh if there is a better solution from the archive. Step 13: Update the group of spotted hyenas Oh w.r.t. updated search agent from the archive. Step 14: The algorithm will be stopped until the stopping criteria are satisﬁed. Otherwise, return to Step 5. Step 15: Return the optimal solutions of the archive, after the stopping criteria is satisﬁed 4.3. Computational complexity This section provides extensive discussion about the computational complexity of proposed algorithm. The time and space complexities of HMOSHSSA are discussed below. 4.3.1. Time complexity 1. Initialization of HMOSHSSA population requires O (no × n p ) time where no denotes the count of objectives and np denote the count of population size. 2. The calculation of each search agent’s ﬁtness needs O (Maxiterations × no × n p ) time where Maxiterations is the maximum count of iterations to simulate the proposed HMOSHSSA algorithm. 3. O (M ) time is required to deﬁne the group of spotted hyenas where M indicates the number of spotted hyenas. 4. The follower and leader selection approach requires O (S ) time. 5. It requires O (no × (nns + n p )) time for the updation of nondominated solutions archive. 6. Steps 2–5 are repeated till the termination criteria is satisfy. Therefore, the overall time complexity of proposed HMOSHSSA algorithm is O (Maxiterations × no × (n p + nns ) × M × S ). 4.3.2. Space complexity In HMOSHSSA algorithm space is required during the program initialization which is a one-time process. Hence, the total space complexity of HMOSHSSA algorithm is O (no × n p ). 4.4. Proposed framework for gene classiﬁcation using the proposed approach This section explains the overall workﬂow of the proposed framework for cancer classiﬁcation using HMOSHSSA. Algorithm for the proposed framework is described in Algorithm 3. The complete algorithm is divided into three steps. In the ﬁrst step highdimensional gene expression data is given as an input to ﬁsher score statistical method to reduce the feature set. After the initial selection, the top 500 genes are selected as informative genes from the dataset. In the second step, the reduced dataset is given as an input to the proposed bio-inspired feature selection method and the ﬁtness of feature subsets is identiﬁed using the ﬁtness function deﬁned in the algorithm. The optimal subset of genes are identiﬁed and the classiﬁer is trained using a machine learning algorithm to develop the gene classiﬁcation prediction, model. 5. Experimental results The performance of the proposed approach is compared against four standard bio-inspired algorithms Multi-objective Differential Evolution (MODE), Nondominated Sorting Genetic Algorithm

Algorithm 3 C-HMOSHSSA.

1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Input: Dataset D, Labels of Dataset, Maximum number of iterations Maxiter Output: Predictive model γ procedure Choose the initial parameters and normalize the data D α ← Initial gene selection using Fisher score while (iteration < Maxiter ) do β ← Select the best genes subset using Algorithm 2 HMOSHSSA(α ) iteration ← iteration + 1 end while γ ← Train classiﬁer model (β ) return γ end procedure

Table 2 Parameter setting for proposed algorithm. Grid inﬂation parameter α

0.1

population

80

Group selection parameter β Vector h 1 xd

4 [5, 0] [0, 1] 1000

Number of grids Vector M 2 xd

10 [0.5, 1] [0, 1] 30

Number of iterations

Number of search agents

(NSGA-II), Multi-objective Particle Swarm Optimization(MOPSO), and Multi-objective Binary Bat Algorithm (MOBBA-LS) [75]. Further, different machine learning algorithms are used to train the classiﬁers using four benchmark datasets. Comparison of these classiﬁers is based on the classiﬁcation accuracy and the number of predictive genes used for cancer classiﬁcation. The accuracy of a classiﬁer is deﬁned as the percentage of correctly classiﬁed samples as compared to the total number of samples.

Accuracy =

C × 100 N

(25)

Where C deﬁnes the samples classiﬁed correctly and N deﬁnes the total number of samples considered. The prediction accuracy of each subset is evaluated using four widely-used classiﬁers namely support vector machines (SVM), K-Nearest Neighbors (KNN), Naive Bayes (NBY), and Decision Tree (DT). Each of the classiﬁers is assessed according to the training samples and testing samples. But, in the proposed framework Leave-One-Out Cross Validation (LOOCV) is used which gives fair chance to each sample during training. Suppose we have k samples, then in case of LOOCV, k-1 samples will be used for training and pending one sample will be used as the test case. Now, the same process is repeated by adding the previous test sample in the training set and taking out some another sample as the test case from the previous training set. The whole process is repeated until all the samples are tested.

5.1. Experimental setup The parameter settings of the proposed approach are tabulated in Table 2. The parameter values of other competing algorithms are set as they are recommended in their original papers. The experimentation has been done on Matlab R2014a (8.3.0.532) version in the environment of Microsoft Windows 8.1 using 64 bit Core i-5 processor with 2.40 GHz and 4 GB main memory. The performance metric is considered in the form of accuracy in the proposed approach. The proposed approach utilizes 30 independent runs in which each run employs 10 0 0 iterations.

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235 Table 4 The obtained optimal values on SRBCT, Leukemia, Prostate, and Breast datasets using different simulation runs (i.e., 100, 500, 800, and 10 0 0).

Table 3 Datasets. Datasets

SRBCT Leukemia Prostate Breast Breast2 Central Nervous System GSE25136

Genes

2308 7129 12,600 3226 24,481 7129 22,283

Class

4 2 2 2 2 2 2

227

Data # samples

# train

# test

83 72 136 22 97 60 79

63 38 102 11 72 45 59

20 34 34 11 25 15 20

5.2. Datasets Microarray dataset is generally represented as N by M matrix, where N is the number of subjects (samples) and M is the number of features (gene expression). In the proposed framework four different tumorous microarray datasets were used to evaluate the performance of proposed gene selection and tumor classiﬁcation method. Table 3 contains a brief description of the datasets used in the proposed framework. The ﬁrst dataset is of small round blue cell tumor (SRBCT) samples [76]. It consists of four classes of the tumor with 83 samples consisting of 2308 genes. The second dataset is of leukemia microarray environment with two distinctive classes, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) dataset [1]. It consists of 72 samples out of which 47 samples are of ALL and 25 of AML with 5147 genes. Third dataset is of Prostate cancer [77] with 136 samples corresponding to 12,600 genes. Fourth dataset is of Breast cancer containing 22 samples out of which 21 samples are of females and one of the male [78]. Fifteen females had a hereditary tumor, seven with BRCA-1 tumor and eight with the BRCA-2 tumor. Breast2 and central nervous system cancer are two more benchmark datasets obtained from (http://csse.szu.edu.cn/staff/zhuzx/Datasets.html). Further, we have also used one clinical dataset freely available from GEO (GSE25136) [79]. The dataset consists of gene expression proﬁle data associated with prostate cancer obtained from 79 cases, 39 of which were classiﬁed as having cancer. 5.3. Investigating the inﬂuence of hyperparameters on HMOSHSSA The problem of optimizing the hyperparameters λ ∈ of a given learning algorithm A is conceptually similar to that of model selection. Some key differences are that hyperparameters are often continuous, that hyperparameter spaces are often highdimensional, and that we can exploit correlation structure between different hyperparameter settings λ1 , λ2 , ∈ . The hyperparameter tuning of meta-heuristics is one of the major challenge [80]. Different values of hyperparameter might yield different results as reported from the exiting literature. The proposed algorithm necessitate two parameters: the maximum number of iterations and the number of search agents. The sensitivity analysis of these hyperparameters is discussed by altering their values and keeping other parameter values ﬁxed. 1. Maximum number of iterations: HMOSHSSA algorithm is iterated multiple times. The different values of Maxiteration are used in experiments such as 10 0, 50 0, 80 0, and 10 0 0. Table 4 shows obtained optimal values on SRBCT, Leukemia, Prostate, and Breast datasets using different simulation runs. Fig. 2.a, c, e, g show the effect of number of iterations using all the four datasets (SRBCT, Leukemia, Prostate, Breast). Increase in number of iteration improves the convergence of HMOSHSSA towards the optimal results. 2. Number of search agents: To analyze the effect of the count of search agents on benchmark datasets (Prostate),

Iterations

100 500 800 1000

Datasets SRBCT

Leukemia

Prostate

Breast

1.71E-17 4.44E-20 3.00E-25 1.42E-29

6.07E+00 5.01E+00 4.52E+00 4.04E+00

2.21E-03 1.10E-03 6.22E-04 3.20E-05

8.07E-01 6.97E-01 5.12E-01 2.88E-01

Table 5 The obtained optimal values on SRBCT, Leukemia, Prostate, and Breast datasets where number of iterations is ﬁxed as 10 0 0. The number of population are varied from 30–100. Datasets Population

SRBCT

Leukemia

Prostate

Breast

30 50 80 100

1.41E-16 4.43E-19 1.22E-28 4.97E-23

6.01E+00 5.31E+00 4.00E+00 4.42E+00

2.11E-03 1.30E-03 3.24E-05 7.32E-04

4.27E-01 6.20E-01 2.22E-01 5.18E-01

HMOSHSSA algorithm was executed for 30, 50, 80, and 100. Table 5 shows obtained optimal values on SRBCT, Leukemia, Prostate, and Breast datasets using different simulation runs. Fig. 2.b, d, f, h showing the effect of the number of search agents (Population) using all the four datasets (SRBCT, Leukemia, Prostate, Breast). The results show that HMOSHSSA provides the best optimal solutions when the number of search agent is set to 80. 5.4. Evaluation on leukemia dataset This is the ﬁrst dataset that we have used in our experiments to check the performance of the proposed algorithm. The proposed algorithm identiﬁes the subset of three genes which are quite informative to classify samples. Genes contained in this subset are quite consistent with literature evidence. Table 6 shows the performance of the proposed approach when used along with different classiﬁers such as DT, KNN, NBY, and SVM. On training these algorithms using leukemia dataset no misclassiﬁcation is found using KNN, SVM, and DT as shown in Table 6. Further testing is performed, results state that there is one misclassiﬁcation for NBY and no misclassiﬁcation for KNN and SVM. Decision trees have two misclassiﬁcations, one for AML and another for ALL. Biological relevance of these genes was also stated in previous studies [81,82]. Gene with database-id 2354 is also reported in a study presented by Ai-Jun and Xin-Yuan [83]. LOOCV is performed to check the robustness of the proposed framework. 5.5. Evaluation on prostate dataset The second dataset used for experimental analysis is of prostate cancer patients. The original dataset is divided into training and testing sets. There are 102 samples that are considered in the training set and 34 samples in the testing set. Proposed approach identiﬁes a subset of four genes which are highly informative to perform tumor classiﬁcation. Table 7 represents the complete information about the genes that are identiﬁed as potential biomarkers for prostate cancer. These genes are also identiﬁed in previous studies [27,84] as potential biomarkers. The dataset is trained on four different classiﬁers. The SVM classiﬁer shows the most promising results with zero misclassiﬁcation as shown in Table. KNN and NBY both have higher miss-classiﬁcation among tumor samples from training dataset. On testing, dataset SVM and KNN

228

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235

Fig. 2. Sensitivity analysis of MOSHO algorithm for, (a, c, e, g) Number of iterations; (b, d, f, h) Number of population.

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235

229

Table 6 The number of misclassiﬁed samples and identiﬁed genes in Leukemia cancer dataset. Data

Train Test LOOCV

Leukemia

Genes

Classiﬁers

Accession code

Database-Index

M92287_at, J04615_at X95735_at, K01383

2354,1120 4847,1156

SVM

KNN

NBY

DT

0 0 1

0 0 1

1 1 0

0 2 1

Table 7 The number of misclassiﬁed samples and identiﬁed genes in Prostate cancer dataset. Data

Train Test LOOCV

Prostate

Genes

Classiﬁers

Accession code

Database-Index

36836_at, 39732_at 41504_s_at 38044_at

6145, 7623 10234 9050

SVM

KNN

NBY

DT

0 1 5

3 1 7

7 0 9

2 23 8

Table 8 The number of misclassiﬁed samples and identiﬁed genes in SRBCT cancer dataset. Data

Train Test LOOCV

SRBCT

Genes

Classiﬁers

Image-Id

Database-Index

812,105, 755,239 1,435,862, 244,618

742, 801 545, 2046

SVM

KNN

NBY

DT

0 0 5

0 0 6

2 0 2

0 4 7

performs similarly with only one misclassiﬁcation. Whereas, DT has the highest number of misclassiﬁcation(23). This suggests that SVM and KNN outperform other algorithms in tumor classiﬁcation using prostate dataset. SVM performed best with a prostate dataset with no miss-classiﬁcation in training data, one misclassiﬁcation in testing data and lesser misclassiﬁcation for LOOCV. 5.6. Evaluation on SRBCT dataset Thirdly, we have used SRBCT dataset for evaluating the performance of the proposed approach. This dataset contains childhood tumor samples from four different classes: neuroblastoma, Ewing’s sarcoma, Burkitt’s lymphoma, and rhabdomyosarcoma. The proposed approach has identiﬁed the subset of four genes out of which two genes are already identiﬁed in the literature. The gene with database-id 1,435,862 was identiﬁed as SRBCT biomarker in [48] and id-244618 was identiﬁed as a potentially informative gene in [28]. Table 8 shows the classiﬁcation results of the proposed framework using SRBCT dataset. Results show that SVM, KNN, and DT perform well with this dataset. NBY has 2 misclassiﬁcations in the testing dataset. Overall KNN and SVM outperform other algorithms in classiﬁcation even though NBY has the least number of misclassiﬁcation in LOOCV. 5.7. Evaluation on Breast dataset In Breast cancer dataset, we have obtained two subsets of genes using our proposed approach that could be potential biomarkers for breast cancer classiﬁcation. The ﬁrst subset is of three genes and the second subset of six genes. Even we found literature evidence of these genes for the breast cancer classiﬁcation [85]. We have shown the results with the minimal subset i.e. using three genes with database id: 985,549,132. Table 9 shows the experimental results of the proposed technique using breast cancer dataset. Using these three genes we have trained SVM, KNN, NBY, and DT.

Table 9 The number of misclassiﬁed samples and identiﬁed genes in Breast cancer dataset. Data

Train Test LOOCV

Breast

Genes

Classiﬁers

Fisher-Index

Database-Index

193,302 459

985,549 132

SVM

KNN

NBY

DT

1 0 0

2 2 1

0 0 0

1 1 0

KNN shows lower performance as compared to other competing algorithms with 2 misclassiﬁcations in each training and testing dataset. SVM, NBY, DT achieved 100% accuracy for LOOCV. Overall NBY and SVM perform well with breast cancer dataset. 5.8. Evaluation on central nervous system (CNS) dataset For CNS dataset our proposed gene selection algorithm has identiﬁed a minimum set of 5 genes potential for tumor classiﬁcation. Among those identiﬁed genes J02611_at is named as apolipoprotein D (APOD) which is a known biomarker for CNS, breast, colorectal and prostate cancer. Table 10 shows the performance of the proposed feature selection algorithm using four different classiﬁers. On training this dataset, only one misclassiﬁcation is found for SVM, KNN, and NBY. No misclassiﬁcation is there for DT. On testing the samples from this dataset no misclassiﬁcation is found for SVM and KNN. But there is one misclassiﬁcation for NBY and two for DT. 5.9. Evaluation on Breast2 dataset For Breast2 dataset our proposed approach has identiﬁed minimum set of four genes which is giving highest prediction accuracy. The details about the genes and their ID is mentioned in Table 11.

230

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235 Table 10 The number of misclassiﬁed samples and identiﬁed genes in Central Nervous System (CNS) dataset. Data

Train Test LOOCV

CNS

Genes

Classiﬁers

Accession code

Database-Index

J02611_at, U41737_at M14328_s_at,U18018_at HG2994HT4850_s_at

1054,3185 6179,2854 5812

SVM

KNN

NBY

DT

1 0 1

1 0 1

1 1 0

0 2 1

Table 11 The number of misclassiﬁed samples and identiﬁed genes in Breast2 dataset. Data

Train Test LOOCV

Breast2

Genes

Classiﬁers

Accession code

Database-Index

AL080059,AL110282 AF055033,NM018964

10889,12078 24338

SVM

KNN

NBY

DT

1 0 1

1 2 2

0 0 1

0 1 0

Table 12 The number of misclassiﬁed samples and identiﬁed genes in GSE25136 dataset. Data

Train Test LOOCV

GSE25136

Genes

Classiﬁers

Accession code

Database-Index

CRHR1,ACTB ATP6V0A2,SGF29

214619_at,AFFXHSAC07 205704_S_at,48117_at

SVM

KNN

NBY

DT

1 1 2

0 2 2

1 2 1

1 3 4

Table 13 Performance of proposed C-HMOSHSSA using four different classiﬁers and seven datasets. Methods

Leukemia

Prostate

SRBCT

Breast

CNS

Breast2

GSE25136

Proposed+SVM Proposed+KNN Proposed+NBY Proposed+DT

100(4) 100(4) 97.05(4) 94.11(4)

97.05(4) 97.05(4) 100(4) 32.35(4)

100(4) 100(4) 100(4) 80(4)

100(3) 81(3) 100(3) 90.90(3)

100(5) 100(5) 93.33(5) 86.66(5)

100(4) 92(4) 100(4) 96(4)

95(4) 90(4) 90(4) 85(4)

On taking samples while training from this dataset SVM and KNN observes one misclassiﬁcation. NBY and DT observes no misclassiﬁcation. If samples from this dataset are considered for testing then two misclassiﬁcation is found for KNN and one for DT. In the identiﬁed set of genes AL080059 is already known biomarker for breast, stomach and glioma cancer. Table 11 shows the experimental results of the proposed technique using breast cancer dataset. 5.10. Evaluation on GSE25136 dataset For GSE25136 dataset our proposed gene selection algorithm has identiﬁed minimum set of four genes which could serve as potential biomarkers in prostate cancer. Details about the experimental results using this dataset is mentioned in Table 12. On training machine learning models using minimum set of features obtained from this dataset, SVM, NBY, DT gives one misclassiﬁcation while KNN has no misclassiﬁcation. On testing using this dataset DT gives minimum miscalssiﬁcation. Table 12 shows the experimental results of the proposed technique using GSE25136 dataset. 5.11. Performance comparison with existing techniques The performance of the proposed technique is compared with the few state-of-the-art existing techniques. Classiﬁcation accuracy and number of genes is taken as a performance metric to evaluate the performance of the proposed algorithm. Tables 13 and

Table 14 Performance comparison of C-HMOSHSSA with existing techniques. Methods

Leukemia

Prostate

SRBCT

Breast

Filter Method [58] Hybrid Approach [17] LOOCV Hybrid Algorithm [56] PLSVEG [28] SVM KNN NBY IDGA-F+SVM [55] IDGA-F+NBY IDGA-F+KNN MOBBA-LS+SVM [75] MOBBA-LS+KNN MOBBA-LS+NBY Proposed+SVM Proposed+KNN Proposed+NBY

100(30) 97.06(3) 95.88(3) 100(5) 100(8) 83.3(3) 97.2(3) 97.2(3) 100(15) 97.7(8) 98.1(13.7) 97.1(3) 100(3) 100(3) 100(4) 100(4) 97.05(4)

95.2(30) − − − − − − − 96.3(14) 93.4 95.6 94.1(6) 97.1(6) 97.1(6) 97.05(4) 97.05(4) 100(4)

− 100(6) 96.04(6) 100(8) 100(15) 96.4(6) 97.6(6) 97.6(6) 100(18) 97.9(29) 97.8(19) 85(6) 100(6) 100(6) 100(4) 100(4) 100(4)

− − − 100 (5) − − − − 100(6) 95.5(6) 100(6) − − − 100(3) 81(3) 100(3)

14 shows the prediction accuracy of different algorithms and also mentions the number of genes used for training the different classiﬁer (Accuracy(Number of selected genes)). Comparison results demonstrate that the performance of the proposed approach is signiﬁcant as compared to other competing techniques. Proposed technique shows 100% accuracy with four identiﬁed genes using

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235

231

Fig. 3. Performance comparison of proposed feature selection approach using different machine learning algorithms and cancer datasets (3.a) Leukemia Dataset, (3.b) Prostate Dataset, (3.c) Breast Cancer Dataset, (3.d) SRBCT Dataset.

SVM and KNN classiﬁer for leukemia dataset. Whereas, existing techniques shows lesser prediction accuracy for leukemia dataset. In the prostate dataset, the proposed technique shows 100% accuracy which is quite comparable with the existing techniques. Moreover, it has identiﬁed a smaller subset of four genes in the prostate dataset. In SRBCT dataset proposed technique identiﬁes four informative genes and achieves 100% accuracy using SVM, KNN, and NBY. Unfortunately, the proposed technique do not perform well with breast cancer dataset using KNN. However, no misclassiﬁcation occurred while using SVM, NBY as prediction classiﬁers in breast cancer dataset. The identiﬁed genes using the proposed approach is most consistent with already existing literature studies. Tables 6 and 7 contains the optimal subset of genes identiﬁed with the proposed approach and also shows the performance of various classiﬁers using identiﬁed genes subset. Fig. 3 shows the performance comparison of the proposed feature selection approach using different machine learning algorithms and cancer datasets. Three performance parameters accuracy (ACC), sensitivity (SEN), speciﬁcity (SPE) are used for evaluating the performance of the proposed approach for different machine learning algorithms. 6. Biological relevance Experimental results of the proposed technique and competing methods are already discussed in Section 5. Further, the importance of the proposed technique, the signiﬁcance of obtained results and their biological relevance is discussed in this section. For all population-based metaheuristic optimization algorithms, the

balance between exploitation and exploration is the determining factor for success. If that balance is not achieved, the algorithm can be (by too much exploitation) prematurely trapped in local optima, as in SSA, or it can (by too much exploration) avoid convergence similar to MOSHO algorithm. In order to take the advantages of both of these bio-inspired metaheuristic algorithms hybridization is performed. Further, hybridization is also performed to eliminate disadvantages, such as pre-convergence and high computational time as suggested by various researchers. In the proposed algorithm, the SSA is embedded in the MOSHO algorithm to make the MOSHO more effective and to facilitate the exploration and exploitation of the MOSHO algorithm to further improve the performance. The researchers have pointed out convergence and diversity diﬃculties for real-life optimization problems with more than one objective. Hence, there is a need to develop an algorithm that maintains the convergence and diversity. In this paper, the SalpSwarm Algorithm is used to maintain diversity. But, SSA suffers from the overhead of maintaining the necessary information. However, the calculation of MOSHO requires low computational effort. For this, a multi-objective strategy is used for maintaining the information. The concept of multi-objective is taken from MOSHO. Therefore, a novel hybrid algorithm is proposed that utilizes the features of both SSA and MOSHO for solving the problem of gene selection. The performance of the proposed feature selection algorithm is evaluated using seven microarray datasets. For each dataset, the proposed algorithm has generated the minimum subset of genes which could serve as a potential biomarker. The proposed tech-

232

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235

nique is also compared with existing techniques as shown in Table. 14. Results state that for the majority of the cases the proposed algorithm outperforms as compared to other competing methods. To further investigate the biological signiﬁcance of identiﬁed genes, we have referred the literature from PubMed and disease-gene association database “http://www.disgenet.org/”. For leukemia dataset proposed approach identiﬁes the subset of four genes (M92287_at, J04615_at, X95735_at, K01383). M92287_at and X95735_at are also found as key biomarkers in leukemia cancer when searched in PubMed. Even when leukemia data is analyzed using the R package (epiR), M92287_at and X95735_at are among the top 10 genes. These results state the biological relevance of the identiﬁed gene subset. For prostate dataset four genes are identiﬁed by the proposed approach. Literature evidence of these genes is also found in [27,84]. 39732_at, one of the identiﬁed gene is also known as a biomarker in lung cancer. In SRBCT dataset four genes are identiﬁed as biomarkers for SRBCT cancer. Image-id 812,105 and 244,618 are already known genes related to SRBCT cancer in [48]. Moreover, the biological relevance of the identiﬁed subset of genes is also conﬁrmed by (http://www.biolab.si/supp/bicancer/projections/info/SRBCT.html). In breast cancer dataset proposed approach has identiﬁed a set of three genes with ﬁsher index (193, 302, 459). Even we found literature evidence of these genes for the breast cancer classiﬁcation [85]. In GSE25136 dataset four genes are identiﬁed as a marker for prostate cancer using the proposed gene selection algorithm. Out of four identiﬁed genes, three (ACTB, CRHR1, SGF29) are already reported in the literature ([86–88]) as biomarkers for prostate cancer. The fourth identiﬁed gene ATP6V0A2 is a known biomarker for lung, colon and breast cancer [86,89,90]. In CNS dataset set of ﬁve genes are identiﬁed as potential biomarkers for central nervous system disorder. The biological relevance of identiﬁed genes can be interpreted using disgenet web tool [91–93] J02611_at is also known as a biomarker in glioma, prostate, colorectal cancer. In breast2 dataset set of four genes are identiﬁed as biomarkers in breast cancer. Out of identiﬁed genes, AL080059 is also known for glioma, stomach cancer [94,95]. The extensive search on the biological signiﬁcance of various identiﬁed genes led to the conclusion that the proposed framework holds enough potential in cancer classiﬁcation application. The literature evidence also supports our results using the proposed technique.

Table 15 The resulting p−values using Kruskal-Wallis test for Breast2 dataset. Methods

HMOSHSSA

MOBBA-LS

MODE

NSGA-II

MOPSO

HMOSHSSA MOBBA-LS MODE NSGA-II MOPSO

− 3.03E-07 9.81E-06 1.11E+03 1.11E-10

1.02E-09 − 7.33E-05 5.56E-04 4.81E-02

5.61E-12 4.42E-01 − 8.97E-07 5.34E-16

3.00E-03 6.11E+01 4.49E-16 − 3.03E-01

9.11E-14 5.00E-01 3.52E-03 6.61E-03 −

Table 16 The resulting p−values using Kruskal-Wallis test for Central Nervous System (CNS) dataset. Methods

HMOSHSSA

MOBBA-LS

MODE

NSGA-II

MOPSO

HMOSHSSA MOBBA-LS MODE NSGA-II MOPSO

− 2.01E-07 1.11E-08 2.11E-00 9.00E-02

2.32E-04 − 7.30E-05 3.16E-06 4.33E-09

6.60E-18 5.24E-05 − 1.91E+02 5.31E-13

5.40E-08 6.01E-02 8.26E-14 − 4.53E-03

1.01E-12 3.23E+02 3.12E-04 6.11E-04 −

Table 17 The resulting p−values using Kruskal-Wallis test for GSE25136 dataset. Methods

HMOSHSSA

MOBBA-LS

MODE

NSGA-II

MOPSO

HMOSHSSA MOBBA-LS MODE NSGA-II MOPSO

− 2.21E-09 9.80E-07 2.12E-02 3.41E-06

7.78E-10 − 4.00E-09 2.56E-08 3.21E-01

1.41E-21 4.41E-02 − 6.44E-02 5.31E-09

6.07E-23 8.03E-02 1.41E-08 − 4.43E-01

4.99E-07 4.04E-02 9.59E-11 8.88E-07 −

Table 18 The resulting p−values using Kruskal-Wallis test for SRBCT dataset. Methods

HMOSHSSA

MOBBA-LS

MODE

NSGA-II

MOPSO

HMOSHSSA MOBBA-LS MODE NSGA-II MOPSO

− 4.07E-01 1.81E-03 2.10E-01 9.10E-05

6.12E-19 − 3.33E-04 9.16E+01 4.32E-03

7.67E-26 1.41E-01 − 5.17E+03 8.32E+00

4.44E-30 9.00E-04 3.41E-06 − 9.23E-02

3.41E-04 6.60E-02 8.34E-04 7.73E+01 −

Table 19 The resulting p−values using Kruskal-Wallis test for Leukemia dataset. Methods

HMOSHSSA

MOBBA-LS

MODE

NSGA-II

MOPSO

HMOSHSSA MOBBA-LS MODE NSGA-II MOPSO

− 2.23E-02 4.44E-03 9.19E-04 7.76E-09

7.00E-19 − 6.98E-01 4.54E-02 5.80E-04

5.90E-39 4.41E-04 − 5.36E-02 6.31E-08

5.05E-26 9.00E-07 4.55E-05 − 3.00E-08

3.21E-09 5.50E-04 2.51E-02 5.11E-06 −

7. Statistical analysis The performance of the proposed algorithm is further evaluated by comparing it with NSGA-II, MOPSO, MOBBA-LS, and MODE. The NSGA-II and MOPSO are the most eﬃcient multi-objective algorithms that are devised based on the Genetic Algorithm (GA) and Particle Swarm Optimization (PSO). To compare the performance of proposed algorithm with other competing algorithms, a statistical hypothesis test (Kruskal-Wallis test) was employed, to determine with a certain level of conﬁdence (α = 0.05), whether there exists a signiﬁcant difference between them or not. The Kruskal-Wallis test [96] is a non-parametric statistical test, suitable for analyzing the rank of different methods upon multiple datasets. Seven highdimensional cancer datasets are employed namely Breast2, Central Nervous System (CNS), GSE25136, SRBCT, Leukemia, Prostate, and the Breast dataset. In Tables 15–21, the results of p−values reveal that the proposed method is very competitive as compared with NSGA-II, MOPSO, MOBBA-LS, and MODE.If p-value obtained is less than 0.05, which shows that HMOSHSSA performs well as compared to other algorithms. For most of the pair of algorithms, p-

Table 20 The resulting p−values using Kruskal-Wallis test for Prostate dataset. Methods

HMOSHSSA

MOBBA-LS

MODE

NSGA-II

MOPSO

HMOSHSSA MOBBA-LS MODE NSGA-II MOPSO

− 2.46E-02 5.80E-04 2.71E-02 1.10E-05

8.00E-17 − 6.31E-06 7.58E-08 2.80E-04

7.00E-15 5.45E-06 − 5.91E-14 5.99E-12

3.33E-09 9.69E-01 4.41E-11 − 4.07E-07

5.68E-11 8.46E-07 5.51E-04 7.77E-06 −

Table 21 The resulting p−values using Kruskal-Wallis test for Breast dataset. Methods

HMOSHSSA

MOBBA-LS

MODE

NSGA-II

MOPSO

HMOSHSSA MOBBA-LS MODE NSGA-II MOPSO

− 5.12E-03 7.37E+02 7.77E-01 1.00E-01

9.91E-07 − 8.31E+01 8.55E-07 5.48E-05

8.69E-17 9.41E-07 − 4.00E-01 2.22E-01

8.58E-13 1.10E+02 9.19E-06 − 5.78E+02

4.15E-14 6.60E+01 6.56E-06 5.38E-09 −

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235

233

Fig. 4. Boxplot analysis of proposed and competitor approaches on (a) SRBCT; (b) Leukemia; (c) Prostate; and (d) Breast datasets using SVM, KNN, NBY and DT (x-axis: Competing Methods, y-axis: Ratio of variation in gene expression level of selected genes and total number of genes before feature selection).

value obtained is greater than 0.05 which states that competing algorithm do not perform well.

7.1. Boxplot analysis Fig. 4 shows the boxplot analysis of proposed and competitor approaches for gene selection on SRBCT, Leukemia, Prostate, and Breast datasets. It shows that the proposed HMOSHSSA is better in fetching better optimal value and narrower than those of MOBBALS, MODE, NSGA-II, and MOPSO techniques. For SRBCT dataset proposed HMOSHSSA algorithm obtains the better optimal value as compared to other comparable algorithms using SVM. Further, it can also be seen that boxplot of the proposed approach using SRBCT dataset obtains best optimum value and is narrower as compared to other methods in case of DT. For leukemia dataset, it obtains optimal values for almost all the classiﬁers. But, SVM boxplot shows better and narrower results as compared to other classiﬁers. In the case of the prostate dataset, DT boxplot is narrow and obtains the better optimal value as compared to other classiﬁers. In breast cancer dataset, SVM boxplot is narrow and shows the best optimal value in comparison to other classiﬁers.

8. Conclusion and future work The key contribution of the proposed framework is to develop a hybrid approach for gene selection using two powerful metaheuristic approaches; SSA and MOSHO. The issues related to highdimensional data are addressed. The proposed algorithm exploits the merits of both of these existing algorithms. The concept of grid searching as suggested by MOSHO and leader & follower selection approach of SSA is used to develop the proposed hybrid algorithm. Moreover, there is no restriction of datasets for the proposed approach. It can be applied to different types of diseases. Proposed framework mainly performs three tasks. Firstly, the smaller subset of relevant genes is obtained using the ﬁlter method to deal with high-dimensional gene expression data. In the second step proposed gene selection algorithm is employed to ﬁnd the most relevant and informative gene subset from the preprocessed gene expression data. In the third step, machine learning algorithms are trained using identiﬁed gene subsets. Experiments are performed using seven different microarray datasets to evaluate the performance of the proposed technique. The proposed approach is compared with existing state-of-the-art techniques and results show that the proposed algorithm outperforms the existing approaches. Further, the experimental analysis revealed that the

234

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235

proposed gene selection algorithm identiﬁes a new combination of genes which could serve as a potential disease biomarker. We have also discussed the biological signiﬁcance of identiﬁed genes using literature evidence and PubMed. In the future, the identiﬁed genes can be further explored for better identiﬁcation of disease-speciﬁc biomarkers and their interactions with other genes. The binary version of the proposed algorithm can be compared with binary GA and PSO. The proposed feature selection algorithm can be extended to handle large scale problems with more challenging healthcare datasets. Conﬂicts of Interest Author Share no conﬂict of interest. References [1] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, et al., Molecular classiﬁcation of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (5439) (1999) 531–537. [2] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, A.J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Natl. Acad. Sci. 96 (12) (1999) 6745–6750. [3] L.J. Van’t Veer, H. Dai, M.J. Van De Vijver, Y.D. He, A.A. Hart, M. Mao, H.L. Peterse, K. Van Der Kooy, M.J. Marton, A.T. Witteveen, et al., Gene expression proﬁling predicts clinical outcome of breast cancer, Nature 415 (6871) (2002) 530. [4] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classiﬁcation using support vector machines, Mach. Learn. 46 (1–3) (2002) 389–422. [5] S.K. Shevade, S.S. Keerthi, A simple and eﬃcient algorithm for gene selection using sparse logistic regression, Bioinformatics 19 (17) (2003) 2246–2253. [6] C. Furlanello, M. Seraﬁni, S. Merler, G. Jurman, Gene selection and classiﬁcation by entropy-based recursive feature elimination, in: Proceedings of the International Joint Conference on Neural Networks, 4, IEEE, 2003, pp. 3077–3082. [7] W. Chu, Z. Ghahramani, F. Falciani, D.L. Wild, Biomarker discovery in microarray gene expression data with gaussian processes, Bioinformatics 21 (16) (2005) 3385–3393. [8] A. Sharma, R. Rani, KSRMF: Kernelized similarity based regularized matrix factorization framework for predicting anti-cancer drug responses, J. Intell. Fuzzy Syst. (Preprint) (2018) 1–12. [9] A. Sharma, R. Rani, Classiﬁcation of cancerous proﬁles using machine learning, in: Proceedings of the International Conference on Machine Learning and Data Science (MLDS), IEEE, 2017, pp. 31–36. [10] A. Sharma, R. Rani, An optimized framework for cancer classiﬁcation using deep learning and genetic algorithm, J. Med. Imaging Health Inf. 7 (8) (2017) 1851–1856. [11] A. Sharma, R. Rani, An integrated framework for identiﬁcation of effective and synergistic anti-cancer drug combinations, J. Bioinf. Comput. Biol. 16 (05) (2018) 1850017. [12] A. Sharma, R. Rani, BE-DTI’: ensemble framework for drug target interaction prediction using dimensionality reduction and active learning, Comput. Methods Progr. Biomed. 165 (2018) 151–162. [13] V. Bolón-Canedo, N. Sánchez-Marono, A. Alonso-Betanzos, J.M. Benítez, F. Herrera, A review of microarray datasets and applied feature selection methods, Inf. Sci. (Ny) 282 (2014) 111–135. [14] I. Inza, P. Larrañaga, R. Blanco, A.J. Cerrolaza, Filter versus wrapper gene selection approaches in dna microarray domains, Artif. Intell. Med. 31 (2) (2004) 91–103. [15] Y. Saeys, I. Inza, P. Larrañaga, A review of feature selection techniques in bioinformatics, Bioinformatics 23 (19) (2007) 2507–2517. [16] E.B. Huerta, B. Duval, J.-K. Hao, A hybrid LDA and genetic algorithm for gene selection and classiﬁcation of microarray data, Neurocomputing 73 (13–15) (2010) 2375–2383. [17] Q. Shen, W.-M. Shi, W. Kong, Hybrid particle swarm optimization and tabu search approach for selecting genes for tumor classiﬁcation using gene expression data, Comput. Biol. Chem. 32 (1) (2008) 53–60. [18] L. Li, W. Jiang, X. Li, K.L. Moser, Z. Guo, L. Du, Q. Wang, E.J. Topol, Q. Wang, S. Rao, A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset, Genomics 85 (1) (2005) 16–23. [19] G. Dhiman, V. Kumar, Spotted hyena optimizer: a novel bio-inspired based metaheuristic technique for engineering applications, Adv. Eng. Softw. 114 (2017) 48–70. [20] S. Li, X. Wu, M. Tan, Gene selection using hybrid particle swarm optimization and genetic algorithm, Soft Comput. 12 (11) (2008) 1039–1048. [21] J. Branke, K. Deb, H. Dierolf, M. Osswald, Finding knees in multi-objective optimization, in: Proceedings of the International conference on parallel problem solving from nature, Springer, 2004, pp. 722–731. [22] R.T. Marler, J.S. Arora, Survey of multi-objective optimization methods for engineering, Struct. Multidiscip. Optim. 26 (6) (2004) 369–395.

[23] I. BoussaïD, J. Lepagnot, P. Siarry, A survey on optimization metaheuristics, Inf. Sci. (Ny) 237 (2013) 82–117. [24] A. Chakraborty, A.K. Kar, Swarm intelligence: a review of algorithms, in: Nature-Inspired Computing and Optimization, Springer, 2017, pp. 475–494. [25] G. Dhiman, V. Kumar, Multi-objective spotted hyena optimizer: a multi-objective optimization algorithm for engineering problems, Knowl. Based Syst. (2018). [26] S. Mirjalili, A.H. Gandomi, S.Z. Mirjalili, S. Saremi, H. Faris, S.M. Mirjalili, Salp swarm algorithm: a bio-inspired optimizer for engineering design problems, Adv. Eng. Softw. 114 (2017) 163–191. [27] H. Chen, Y. Zhang, I. Gutman, A kernel-based clustering method for gene selection with gene expression data, J. Biomed. Inf. 62 (2016) 12–20. [28] S. Kar, K.D. Sharma, M. Maitra, Gene selection from microarray gene expression data for classiﬁcation of cancer subgroups employing pso and adaptive k-nearest neighborhood technique, Expert Syst. Appl. 42 (1) (2015) 612– 627. [29] K.-H. Chen, K.-J. Wang, M.-L. Tsai, K.-M. Wang, A.M. Adrian, W.-C. Cheng, T.-S. Yang, N.-C. Teng, K.-P. Tan, K.-S. Chang, Gene selection for cancer identiﬁcation: a decision tree model empowered by particle swarm optimization algorithm, BMC Bioinf. 15 (1) (2014) 49. [30] Y. Piao, M. Piao, K.H. Ryu, Multiclass cancer classiﬁcation using a feature subset-based ensemble from microrna expression proﬁles, Comput. Biol. Med. 80 (2017) 39–44. [31] S. Acharya, S. Saha, Y. Thadisina, Multiobjective simulated annealing-based clustering of tissue samples for cancer diagnosis., IEEE J. Biomed. Health Inf. 20 (2) (2016) 691–698. [32] S. Saha, K. Kaushik, A.K. Alok, S. Acharya, Multi-objective semi-supervised clustering of tissue samples for cancer diagnosis, Soft Comput. 20 (9) (2016) 3381–3392. [33] L. Dora, S. Agrawal, R. Panda, A. Abraham, Optimal breast cancer classiﬁcation using gauss–newton representation based algorithm, Expert Syst. Appl. 85 (2017) 134–145. [34] H. Salem, G. Attiya, N. El-Fishawy, Classiﬁcation of human cancer diseases by gene expression proﬁles, Appl. Soft Comput. 50 (2017) 124–134. [35] C.J. Alonso-González, Q.I. Moro-Sancho, A. Simon-Hurtado, R. Varela-Arrabal, Microarray gene expression classiﬁcation with few genes: criteria to combine attribute selection and classiﬁcation methods, Expert Syst. Appl. 39 (8) (2012) 7270–7280. [36] V. García, J.S. Sánchez, Mapping microarray gene expression data into dissimilarity spaces for tumor classiﬁcation, Inf. Sci. (Ny) 294 (2015) 362–375. [37] F.F. Ting, Y.J. Tan, K.S. Sim, Convolutional neural network improvement for breast cancer classiﬁcation, Expert Syst. Appl. 120 (2019) 103–115. [38] K. Adem, S. Kiliçarslan, O. Cömert, Classiﬁcation and diagnosis of cervical cancer with softmax classiﬁcation with stacked autoencoder, Expert Syst. Appl. 115 (2019) 557–564. [39] A.M. Abdel-Zaher, A.M. Eldeib, Breast cancer classiﬁcation using deep belief networks, Expert Syst. Appl. 46 (2016) 139–144. [40] C.M. Perou, S.S. Jeffrey, M. Van De Rijn, C.A. Rees, M.B. Eisen, D.T. Ross, A. Pergamenschikov, C.F. Williams, S.X. Zhu, J.C. Lee, et al., Distinctive gene expression patterns in human mammary epithelial cells and breast cancers, Proc. Natl. Acad. Sci. 96 (16) (1999) 9212–9217 . [41] Y.-J. Li, L. Zhang, M.C. Speer, E.R. Martin, Evaluation of current methods of testing differential gene expression and beyond, in: Methods of Microarray Data Analysis II, Springer, 2002, pp. 185–194. [42] E.P. Xing, M.I. Jordan, R.M. Karp, et al., Feature selection for high-dimensional genomic microarray data, in: Proceedings of the ICML, 1, Citeseer, 2001, pp. 601–608. [43] T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, D. Haussler, Support vector machine classiﬁcation and validation of cancer tissue samples using microarray expression data, Bioinformatics 16 (10) (20 0 0) 906–914. [44] Y. Tang, Y.-Q. Zhang, Z. Huang, Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis, IEEE/ACM Trans. Comput. Biol. Bioinf. 4 (3) (2007) 365–381. [45] K.-B. Hwang, D.-Y. Cho, S.-W. Park, S.-D. Kim, B.-T. Zhang, Applying Machine Learning Techniques to Analysis of Gene Expression Data: Cancer Diagnosis, in: Methods of Microarray Data Analysis, Springer, 2002, pp. 167–182. [46] F. Fernández-Navarro, C. Hervás-Martínez, R. Ruiz, J.C. Riquelme, Evolutionary generalized radial basis function neural networks for improving prediction accuracy in gene classiﬁcation using feature selection, Appl. Soft Comput. 12 (6) (2012) 1787–1800. [47] L. Li, L.G. Pedersen, T.A. Darden, C.R. Weinberg, Computational analysis of leukemia microarray expression data using the GA/KNN Method, in: Methods of Microarray Data Analysis, Springer, 2002, pp. 81–95. [48] H.M. Alshamlan, G.H. Badr, Y.A. Alohali, Genetic bee colony (GBC) algorithm: a new gene selection method for microarray cancer classiﬁcation, Comput. Biol. Chem. 56 (2015) 49–60. [49] Y. Zhang, D.-w. Gong, J. Cheng, Multi-objective particle swarm optimization approach for cost-based feature selection in classiﬁcation, IEEE/ACM Trans. Comput. Biol. Bioinf. (TCBB) 14 (1) (2017) 64–75. [50] B. Xue, L. Cervante, L. Shang, M. Zhang, A particle swarm optimisation based multi-objective ﬁlter approach to feature selection for classiﬁcation, in: Proceedings of the Paciﬁc Rim International Conference on Artiﬁcial Intelligence, Springer, 2012, pp. 673–685. [51] M.A. Tawhid, K.B. Dsouza, Hybrid binary bat enhanced particle swarm optimization algorithm for solving feature selection problems, Appl. Comput. Inf. (2018).

A. Sharma and R. Rani / Computer Methods and Programs in Biomedicine 178 (2019) 219–235 [52] Y. Zhang, D. Gong, Y. Hu, W. Zhang, Feature selection algorithm based on bare bones particle swarm optimization, Neurocomputing 148 (2015) 150–157. [53] Z. Yong, G. Dun-wei, Z. Wan-qiu, Feature selection of unreliable data using an improved multi-objective PSO algorithm, Neurocomputing 171 (2016) 1281–1290. [54] Y. Zhang, X.-f. Song, D.-w. Gong, A return-cost-based binary ﬁreﬂy algorithm for feature selection, Inf. Sci. (Ny) 418 (2017) 561–574. [55] M. Dashtban, M. Balafar, Gene selection for microarray cancer classiﬁcation using a new evolutionary method employing artiﬁcial intelligence concepts, Genomics 109 (2) (2017) 91–107. [56] C.-P. Lee, Y. Leu, A novel hybrid feature selection method for microarray data analysis, Appl. Soft Comput. 11 (1) (2011) 208–213. [57] R. Nagarajan, M. Upreti, An approach for deciphering patient-speciﬁc variations with application to breast cancer molecular expression proﬁles, J. Biomed. Inf. 63 (2016) 120–130. [58] L.-J. Zhang, Z.-J. Li, H.-W. Chen, An effective gene selection method based on relevance analysis and discernibility matrix, in: Proceedings of the Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2007, pp. 1088–1095. [59] G. Ji, Z. Yang, W. You, Pls-based gene selection and identiﬁcation of tumor-speciﬁc genes, IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 41 (6) (2011) 830–841. [60] S.S. Shreem, S. Abdullah, M.Z.A. Nazri, Hybridising harmony search with a Markov blanket for gene selection problems, Inf. Sci. (Ny) 258 (2014) 108–121. [61] L.-Y. Chuang, C.-H. Yang, J.-C. Li, C.-H. Yang, A hybrid BPSO-CGA approach for gene selection and classiﬁcation of microarray data, J. Comput. Biol. 19 (1) (2012) 68–82. [62] F.V. Sharbaf, S. Mosafer, M.H. Moattar, A hybrid gene selection approach for microarray data classiﬁcation using cellular learning automata and ant colony optimization, Genomics 107 (6) (2016) 231–238. [63] C.A. Coello Coello, Evolutionary multi-objective optimization: some current research trends and topics that remain to be explored, Front. Comput. Sci. China 3 (1) (2009) 18–30, doi:10.1007/s11704- 009- 0005- 7. [64] O. Schütze, M. Laumanns, C.A. Coello Coello, M. Dellnitz, E.-G. Talbi, Convergence of stochastic search algorithms to ﬁnite size pareto set approximations, J. Global Optim. 41 (4) (2008) 559–577, doi:10.1007/s10898- 007- 9265- 7. [65] F. Edgeworth, An essay on the application of mathematics to the moral sciences, Mathematical Physics, P. London, England: Keagan (1881). [66] V. Pareto, Cours d’economie politique: Librairie Droz (1964). [67] J.D. Knowles, D.W. Corne, Approximating the nondominated front using the pareto archived evolution strategy, Evol. Comput. 8 (2) (20 0 0) 149–172, doi:10. 1162/106365600568167. [68] C.A. Coello, G.T. Pulido, M.S. Lechuga, Handling multiple objectives with particle swarm optimization, Trans. Evol. Comp 8 (3) (2004) 256–279, doi:10.1109/ TEVC.2004.826067. [69] J. Xuan, Y. Wang, Y. Dong, Y. Feng, B. Wang, J. Khan, M. Bakay, Z. Wang, L. Pachman, S. Winokur, et al., Gene selection for multiclass prediction by weighted ﬁsher criterion, EURASIP J. Bioinf. Syst. Biol. 2007 (1) (2007) 64628. [70] S. Olyaee, Z. Dashtban, M.H. Dashtban, Design and implementation of super-heterodyne nano-metrology circuits, Front. Optoelect. 6 (3) (2013) 318–326. [71] J. Yang, Y. Liu, C. Feng, G. Zhu, Applying the ﬁsher score to identify Alzheimers disease-related genes, Genet. Molecular Res. GMR 15 (2) (2016). [72] X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, in: Proceedings of the Advances in Neural Information Processing Systems, 2006, pp. 507–514. [73] M. Radovic, M. Ghalwash, N. Filipovic, Z. Obradovic, Minimum redundancy maximum relevance feature selection approach for temporal gene expression data, BMC Bioinf. 18 (1) (2017) 9. [74] N. Sánchez-Maroño, A. Alonso-Betanzos, M. Tombilla-Sanromán, Filter methods for feature selection–a comparative study, in: Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning, Springer, 2007, pp. 178–187. [75] M. Dashtban, M. Balafar, P. Suravajhala, Gene selection for tumor classiﬁcation using a novel bio-inspired multi-objective approach, Genomics 110 (1) (2018) 10–17.

235

[76] J. Khan, J.S. Wei, M. Ringner, L.H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C.R. Antonescu, C. Peterson, et al., Classiﬁcation and diagnostic prediction of cancers using gene expression proﬁling and artiﬁcial neural networks, Nat. Med. 7 (6) (2001) 673. [77] D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D’Amico, J.P. Richie, et al., Gene expression correlates of clinical prostate cancer behavior, Cancer Cell 1 (2) (2002) 203–209. [78] I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, M. Raffeld, et al., Gene-expression proﬁles in hereditary breast cancer, N Top N. Engl. J. Med. 344 (8) (2001) 539–548. [79] Y. Sun, S. Goodison, Optimizing molecular signatures for predicting prostate cancer recurrence, Prostate 69 (10) (2009) 1119–1127. [80] C. Thornton, F. Hutter, H.H. Hoos, K. Leyton-Brown, Auto-WEKA: Combined selection and hyperparameter optimization of classiﬁcation algorithms, in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2013, pp. 847–855. [81] Y. Ai-Jun, S. Xin-Yuan, Bayesian variable selection for disease classiﬁcation using gene expression data, Bioinformatics 26 (2) (2010) 215–222, doi:10.1093/ bioinformatics/btp638. [82] K.E. Lee, N. Sha, E.R. Dougherty, M. Vannucci, B.K. Mallick, Gene selection: a Bayesian variable selection approach, Bioinformatics 19 (1) (2003) 90–97. [83] Y. Ai-Jun, S. Xin-Yuan, Bayesian variable selection for disease classiﬁcation using gene expression data, Bioinformatics 26 (2) (2009) 215–222. [84] O. Dagliyan, F. Uney-Yuksektepe, I.H. Kavakli, M. Turkay, Optimization based tumor classiﬁcation from microarray gene expression data, PLoS ONE 6 (2) (2011) e14579. [85] C. Sotiriou, S.-Y. Neo, L.M. McShane, E.L. Korn, P.M. Long, A. Jazaeri, P. Martiat, S.B. Fox, A.L. Harris, E.T. Liu, Breast cancer classiﬁcation and prognosis based on gene expression proﬁles from a population-based study, Proc. Natl. Acad. Sci. 100 (18) (2003) 10393–10398. [86] B. Liu, J. Li, M. Zheng, J. Ge, J. Li, P. Yu, MiR-542-3p exerts tumor suppressive functions in non-small cell lung cancer cells by upregulating FTSJ2, Life Sci. 188 (2017) 87–95. [87] X. Fang, Y. Hong, L. Dai, Y. Qian, C. Zhu, B. Wu, S. Li, CRH promotes human colon cancer cell proliferation via il-6/jak2/stat3 signaling pathway and vegf-induced tumor angiogenesis, Mol. Carcinog. 56 (11) (2017) 2434–2445. [88] C. Guo, S. Liu, J. Wang, M.-Z. Sun, F.T. Greenaway, Actb in cancer, Clin. Chim. Acta 417 (2013) 39–44. [89] L. Jin, C. Li, R. Li, Z. Sun, X. Fang, S. Li, Corticotropin-releasing hormone receptors mediate apoptosis via cytosolic calcium-dependent phospholipase A2 and migration in prostate cancer cell RM-1, J. Mol. Endocrinol. 52 (3) (2014) 255–267. [90] S. Pamarthy, M.K. Jaiswal, A. Kulshreshtha, G.K. Katara, A. Gilman-Sachs, K.D. Beaman, The vacuolar atpase a2-subunit regulates notch signaling in triple-negative breast cancer cells, Oncotarget 6 (33) (2015) 34206. [91] S. Liddelow, D. Hoyer, Astrocytes: adhesion molecules and immunomodulation, Curr. Drug. Targets 17 (16) (2016) 1871–1881. [92] D.A. Elliott, C.S. Weickert, B. Garner, Apolipoproteins in the brain: implications for neurological and psychiatric disorders, Clin. Lipidol. 5 (4) (2010) 555–573. [93] B. Omalu, U. Nnebe-Agunadu, Occurence of anaplastic oligodendroglioma in a patient with williams syndrome: a case report with analysis of mutational proﬁle of tumor, Niger. J. Clin. Pract. 12 (2) (2009). [94] M.H. Madden, G.M. Anic, R.C. Thompson, L.B. Nabors, J.J. Olson, J.E. Browning, A.N. Monteiro, K.M. Egan, Circadian pathway genes in relation to glioma risk and outcome, Cancer Causes Control 25 (1) (2014) 25–32. [95] J.D. Yang, S.-Y. Seol, S.-H. Leem, Y.H. Kim, Z. Sun, J.-S. Lee, S.S. Thorgeirsson, I.-S. Chu, L.R. Roberts, K.J. Kang, Genes associated with recurrence of hepatocellular carcinoma: integrated analysis by gene expression and methylation proﬁling, J. Korean Med. Sci. 26 (11) (2011) 1428–1438. [96] J. McDonald, Kruskal-Wallis Test, Handbook of Biological Statistics, Sparky House, Maryland, U. S. A., 2014.

C-HMOSHSSA: Gene selection for cancer classification using multi-objective meta-heuristic and machine learning methods

C-HMOSHSSA: Gene selection for cancer classification using multi-objective meta-heuristic and machine learning methods

Recommend Documents