Using multiobjective optimization for biclustering microarray data

Applied Soft Computing 33 (2015) 239–249 Contents lists available at ScienceDirect Applied Soft Computing journal homepage: www.elsevier.com/locate/...

Download PDF

719KB Sizes 0 Downloads 220 Views

Report

PDF Reader
Full Text

Applied Soft Computing 33 (2015) 239–249

Contents lists available at ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

Using multiobjective optimization for biclustering microarray data Khedidja Seridi a,b,∗ , Laetitia Jourdan a,b , El-Ghazali Talbi a,b a Université Lille 1, Laboratoire d’Informatique Fondamentale de Lille, UMR CNRS 8022, Cité Scientiﬁque, Bâtiment M3, 59655 Villeneuve d’Ascq cedex, France b INRIA Lille-Nord Europe, Parc Scientiﬁque de la Haute Borne, 40 Avenue Halley, 59650 Villeneuve d’Ascq, France

a r t i c l e

i n f o

Article history: Received 1 February 2013 Received in revised form 14 March 2015 Accepted 16 March 2015 Available online 28 April 2015 Keywords: Biclustering problem Gene expression data Evolutionary algorithm Multiobjective combinatorial optimization

a b s t r a c t Microarray data analysis is a challenging problem in the data mining ﬁeld. Actually, it represents the expression levels of thousands of genes under several conditions. The analysis of this data consists on discovering genes that share similar expression patterns across a sub-set of conditions. In fact, the extracted informations are submatrices of the microarray data that satisfy a coherence constraint. These submatrices are called biclusters, while the process of extracting them is called biclustering. Since its ﬁrst application to the analysis of microarray [1], many modeling and algorithms have been proposed to solve it. In this work, we propose a new multiobjective model and a new metaheuristic HMOBIibea for the biclustering problem. Results of the proposed method are compared to those of other existing algorithms and the biological relevance of the extracted information is validated. The experimental results show that our method extracts very relevant biclusters, with large sizes with respect to existing methods. © 2015 Elsevier B.V. All rights reserved.

1. Introduction Using microarray technologies, the expression levels of thousands of genes under several conditions may be measured in a single experiment. A microarray data may be presented in a matrix form A = (X, Y), where X = {i1 , i2 , . . . iN } is a set of N genes and Y = {j1 , . . ., jM } a set of M conditions. aij ∈ A (i ∈ X, j ∈ Y) represents the expression level of gene i under condition j (see Table 1). Functionally related genes are likely to behave similarly across distinct conditions, however, identiﬁcation of such genes and conditions is not obvious. In data mining, the problem is known as biclustering or co-clustering, and is applied in many ﬁelds such as marketing, psychology and bioinformatics. Biclustering is the extraction of submatrices B = (I, J) (I ⊂ X, J ⊂ Y) (called biclusters) of maximal size and meeting some coherence constraints. Commonly, the coherence is measured by the Mean Squared Residue (MSR) [1]. Computational complexity of biclustering mainly depends on its formulation, though the most interesting variants of biclustering are NP-complete [2].

∗ Corresponding author at: Université Lille 1, Laboratoire d’Informatique Fondamentale de Lille, UMR CNRS 8022 Cité Scientiﬁque, Bâtiment M3, 59655 Villeneuve d’Ascq cedex, France. Tel.: +33 359358637. E-mail addresses: [email protected] (K. Seridi), Laetitia.Jourdan@liﬂ.fr (L. Jourdan), talbi@liﬂ.fr (E.-G. Talbi). http://dx.doi.org/10.1016/j.asoc.2015.03.060 1568-4946/© 2015 Elsevier B.V. All rights reserved.

Many heuristics have been proposed for solving the biclustering problem such as greedy algorithms [1,9–12], divide-and-conquer algorithms [15,16], enumeration algorithms [13,14], and metaheuristics (mainly evolutionary and bioinspired) [3–8,31]. Mitra et al. proposed the ﬁrst multiobjective algorithm for the biclustering problem NSAG2B [8]: an evolutionary metaheuristic based on NSGA-II [19] and CC heuristic. In this approach, the CC heuristic (see Section 3.1 for more details) is applied ﬁrst to all the initial solutions and then after the application of variation operators. Furthermore, solutions (biclusters) are encoded by ﬁxed size binary string, with a bit string for the genes appended by another bit string for the conditions. A bit is set to one if the corresponding gene/condition is part of the bicluster, and to zero otherwise. Liu et al. proposed the MODPSFLB algorithm (multiobjective dynamic population shufﬂed frog-leaping biclustering) [31]. In MODPSFLB, the feasible solutions are regarded as frogs and pareto optimal solutions are preserved in a frog population updated by an -dominance relation and the computation of crowding distance. Then the next generation of frog population is dynamically adjusted according to a dynamic population strategy. Each solution is presented as a binary vector. However, most approaches do not consider the biological relevance of the extracted biclusters in the modeling of the problem. Here, we introduce an improved version of our previous biclustering model [17,18] and a powerful hybrid metaheuristic that, together, outperform state-of-the art heuristics.

240

K. Seridi et al. / Applied Soft Computing 33 (2015) 239–249

Table 1 Gene expression data matrix.

gene1 ... genei ... geneN

con1

...

conj

...

conM

a11 ... ai1 ... aN1

... ... ... ... ...

a1j ... aij ... aNj

... ... ... ... ...

a1M ... aiM ... aNM

The paper is organized as follows: Section 2 presents the main concepts of multiobjective optimization. Then, a multiobjective model for biclustering problem is discussed. Section 3 presents state-of-the-art multiobjective optimization algorithms. Then, it elaborates on the different proposed metaheuristics. Experimental protocol and results are detailed in Section 4. Finally, Section 5 concludes the paper. 2. A multiobjectivization of biclustering for microarray data In practice, many real life problems (bioinformatics [33], data mining [34,35], agriculture [36,37], transportation [38,39], etc.) require the optimization of several conﬂicting objectives simultaneously. These problems are called: Multiobjective Optimization problems (MOPs). In a MOP, the optimal solution is not a single solution but a set of solutions deﬁned as pareto optimal solutions. A solution is pareto optimal if it is not possible to improve a given objective without deteriorating at least an other objective. Multiobjective problems are generally NP-hard and their resolution cannot be performed in an exact manner within a reasonable time. In this section, some multiobjective optimization concepts are deﬁned. Then, we present our multiobjective biclustering modeling strategy. 2.1. Multiobjective optimization Deﬁnition 2.1 (Multiobjective optimization problem). A Multiobjective Optimization Problem (MOP) may be deﬁned as:

MOP =

min F(x) = (f1 (x), f2 (x), . . ., fn (x)) s.t.x ∈ S

where n (n ≥ 2) is the number of objectives, x = (x1 , . . ., xk ) is the vector representing the decision variables, and S represents the set of feasible solutions associated with equality and inequality constraints and explicit bounds, F(x) = (f1 (x), f2 (x), . . ., fn (x)) is the vector of objectives to be optimized. The search space S represents the decision space of the MOP. The space in which the objective vector belongs to is called the objective space. The vector F can be deﬁned as a cost function from decision space in the objective space that evaluates the quality of each solution (x1 , . . ., xk ) by assigning an objective vector (y1 , . . ., yn ) which represents the quality of the solution (ﬁtness). As the criteria are usually in conﬂict, it is not usual to have a solution x* , associated with a decision variable vector, where x* is optimal for all the objectives. Other concepts were established to consider optimality. A partial order relation could be deﬁned, known as dominance relation. Deﬁnition 2.2 (Pareto dominance). For a minimization multiobjective problem, an objective vector u = (u1 , . . ., un ) is said to dominate v = (v1 , . . ., vn ) (denoted by u ≺ v) if and only if no component of v is smaller than the corresponding component of u and at least one component of u is strictly smaller, that is: u ≺ v : ∀i ∈ [1, . . ., n] : ui ≤ vi ∧ ∃i ∈ [1, . . ., n] : ui < vi .

Deﬁnition 2.3 (Pareto optimality). A solution x* ∈ S is pareto optimal if for every x ∈ S, F(x) does not dominate F(x* ), that is, ∀x ∈ S, / F(x* ). F(x) ≺ Deﬁnition 2.4 (Pareto optimal set). For a given MOP(F, S), the pareto optimal set is deﬁned as P* = {x ∈ S/¬ ∃ x ∈ S, F(x ) ≺ F(x)}. Deﬁnition 2.5 (Pareto front). For a given MOP(F, S) and its pareto optimal set P* , the pareto front is deﬁned as PF* = {F(x), x ∈ P* }. The pareto front is the image of the pareto optimal set in the objective space. Obtaining the pareto front of a MOP is the main goal of the multiobjective optimization. However, given that the pareto front can contain a large number of points, a good approximation of the pareto optimal set may contain a limited number of pareto solutions that satisfy two properties [21]: • The closeness to the pareto optimal set. • The diversiﬁcation of the solutions. Indeed, the closeness to the pareto optimal set ensures the quality of the selected solutions and the diversiﬁcation avoids redundancy and allows a better representation of the pareto optimal set. The diversiﬁcation of the solutions may be measured using distance measures such as: Crowding distance. Deﬁnition 2.6 (Crowding distance). The crowding distance distx of a solution x is a measure of the search space around x which is not occupied by any other solution in the population [21]. For each objective function fj , the vectors are sorted and an inﬁnite crowding distance is assigned to the solutions with the smallest or largest values in this dimension. The crowding distance value of a solution is computed by adding the entire individual crowding distance values in each objective function which are calculated according to: dist

j xi

=

fj (xi+1 ) − fj (xi−1 ) |fj,max − fj,min |

fj,max and fj,min are the population maximum and minimum objective value of jth objective. Solutions with high crowding distance are considered better solutions, as they introduce more diversity in the population. 2.2. Multiobjective modeling for biclustering Extracting biclusters from a data matrix can be seen as a combinatorial optimization problem, where each bicluster optimizes a set of quality criteria [2]. In gene expression data, the quality of a bicluster is deﬁned by its size, its coherence and its mean rows variances. These criteria are independent and notably conﬂicting. In fact, a non perfect coherence can always be improved by removing a row or a column, i.e. by reducing its size. On the other hand, a ﬂat bicluster (bicluster with constant rows) is perfectly coherent. We can therefore deduce that the problem of biclustering is a multiobjective optimization problem. In what follows, we deﬁne the different objective functions. Then, we present a literature review. Finally, we discuss our modeling strategy. 2.2.1. The coherence In order to optimize the coherence of a bicluster, the dissimilarity measure Mean Squared Residue (MSR) is generally considered [1]. Given a bicluster B = (I, J). The residue r(aij ) of an element aij ∈ B is given by the equation: r(aij ) = aij − aiJ − aIj + aIJ .

K. Seridi et al. / Applied Soft Computing 33 (2015) 239–249

241

where aiJ represents the mean of the ith row of B, aIj represents the mean of the jth column of B and aIJ the mean of all the elements in B. The mean squared residue of the bicluster is: MSR(I, J) =

1 |I||J|

r(aij )2

(1)

i∈I,j∈J

The lower the MSR value of the bicluster is, the better is its coherence. A bicluster with MSR value less then ı is called ı-bicluster. 2.2.2. The ﬂuctuation A bicluster with a low MSR value indicates that the expression levels ﬂuctuate in unison [1]. This includes ﬂat biclusters (biclusters with the same expression value for each gene for all the conditions). For gene expression data, we are interested in biclusters with genes that exhibit signiﬁcant ﬂuctuations across the different conditions i.e. biclusters with signiﬁcant rows variances. To evaluate the variance of the bicluster’s rows we deﬁne the mean rows variances denoted Rvar. Rvar(I, J) =

1 |I||J|

(aij − aiJ )2 .

(2)

i∈I,j∈J

Fig. 1. Classiﬁcation of multiobjective optimization algorithms.

2.2.3. Literature review Different multiobjective modeling for the biclustering problem have been proposed [3–8]. Commonly, the multiobjective models comprise: one or more function(s) to optimize the biclusters sizes, a function that optimizes the coherence and a function to optimize the rows variances. In these models, a solution represents one bicluster. Regarding the size, most of the models maximize the ratio between the biclusters elements number and the microarray |I|×|J| | (minimize |X|×|Y ). However, as the number data elements i.e. |X|×|Y | |I|×|J| of rows is generally more important than the number of columns, such functions are inclined to the maximization of rows number. Thereby, in [3], the authors proposed to maximize the number of rows and columns separately by using two objective functions i.e. |J| |I| and |Y . Concerning the coherence, all the proposed models | |X| consider the Mean Squared Residue MSR [1] dissimilarity measure. Actually, the coherence is maximized by minimizing the ratio ı between the MSR and a threshold ı i.e. MSR(I,J) (maximize MSR(I,J) ) ı where ı refers to the maximum allowed dissimilarity (measured by MSR) in a bicluster. In [8] the MSR value is allowed to increase as long as it does not exceed the threshold ı. Regarding the rows ﬂuctuations, all the existing models maximize the mean rows variances. In [5] the coherence and ﬂuctuation objectives are merged in one function: the ratio between the MSR of the bicluster and its mean of the rows variances. In [22], Maulik et al. proposed a fuzzy multiobjective model where each solution represents a set of solutions. The fuzzy model optimizes the same three objectives with rows and columns fuzzy membership variables. 2.2.4. Biclustering modeling Given a data matrix A = (X, Y), the size of a bicluster B = (I, J) (I ⊆ X, J ⊆ Y) is deﬁned by its number of elements: |I||J|. Using this deﬁnition as an objective, the number of rows are more likely to be maximized since the total number of rows is higher than the total number of columns. However, in the case of gene expression data, genes (rows) of biclusters that comprise a small number of conditions (columns) are less likely to be biologically relevant. To overcome this problem, we introduce one objective function f1 to evaluate the size of the bicluster: the sum of weighted normalized number of rows and normalized number of columns. f1 (I, J) = ˛ ×

|I| |J| +ˇ× |X| |Y |

(3)

In order that the number of rows and columns have the same probability of being maximized, ˛ and ˇ are set to 12 . When ˇ > ˛, the number of columns are more likely to be maximized than the number of rows. Concerning the MSR objective, there exist three modeling strategies: minimizing the MSR value, maximizing MSR value as long as the residue does not exceed a threshold ı and keeping MSR value below the threshold ı. In our case, we are seeking biclusters with maximal size where the MSR value does not exceed a threshold ı. Therefore, we have performed a comparative experiment where the same algorithm is applied over the two former models. The experiment and the chosen model are presented in the experiments section. 3. Using multiobjective algorithms to solve multiobjective biclustering The existing algorithms to solve multiobjective problems may be classiﬁed into exact and approximate algorithms (Fig. 1). Exact methods such as branch and bound algorithms, dynamic programming, etc., have been widely used to solve biobjective problems. They are known to be effective for problems of small sizes. Actually, exact methods are able to ﬁnd optimal solutions. However, they are time consuming for difﬁcult problems. Alternatively, approximate methods do not guarantee optimal solutions (optimal pareto set), but they seek to obtain high quality solution in a reasonable time. Moreover, approximate methods can be divided into two classes: approximation algorithms and heuristics. Unlike heuristics, approximation algorithms guarantee the quality of the found solution with respect to the optimal. Finally, there also exit two subclasses of heuristics: in one hand, heuristics that are speciﬁc to a given problem using some knowledge about the problem and, on the other hand, the metaheuristics that are general-purpose algorithms applicable to a large variety of problems. Metaheuristics are algorithms that can be applied to solve almost any optimization problem. They can be seen as high-level methodologies to guide design heuristics implicitly dedicated to solve a speciﬁc problem. They include generic elements as well as speciﬁc

242

K. Seridi et al. / Applied Soft Computing 33 (2015) 239–249

elements to the considered problem such as the representation or the evaluation of a solution. Metaheuristics can be classiﬁed into two classes: Single solution based metaheuristics (S-META) and Population based metaheuristics (P-META). S-META, such as multiobjective pareto local searches (DMLS) and tabu search, handle and transform one solution during the search process. While in P-META, such as evolutionary algorithms and particle swarm algorithms, a set of solutions, called population, evolve in parallel. For all iterative metaheuristics, there exist two common components in addition to speciﬁc metaheuristic components, mainly: the representation of solutions handled by algorithms and the deﬁnition of the objective function that will guide the search. In order to solve the proposed model for biclustering problem, we propose to use multiobjective evolutionary algorithms: (MOBInsga and MOBIibea ), multiobjective pareto local searches (DMLS) and the hybridization of MOBI algorithm with DMLS: HMOBI. 3.1. Multiobjective evolutionary algorithms Multiobjective evolutionary algorithms (MOEAs) are population-based metaheuristics where a set of solutions (population) is iteratively handled in order to get a ﬁnal set of solutions. In addition to the common concepts of other metaheuristics, MOEAs contain three main search include: • Fitness assignment: The main role of this procedure is to guide the search algorithm toward pareto optimal solutions for a better convergence. It assigns a scalar-valued ﬁtness to a vector objective function. There exist four approaches for the ﬁtness assignment: – Scalar approaches: They are based on the MOP problem transformation into a monoobjective problem. – Criterion-based approaches: In criterion-based approaches, the search is performed by treating the various noncommensurable objectives separately. approaches: The dominance-based – Dominance-based approaches use the concept of dominance and pareto optimality to guide the search process. The objective vectors of solutions are scalarized using the dominance relation. NSGA-II is well-known Dominance-based multiobjective algorithm. – Indicator-based approaches: In indicator-based approaches, the metaheuristics use performance quality indicators to drive the search toward the pareto front. The optimization goal is given in terms of a binary quality indicator I that can be viewed as an extension of the pareto dominance relation. A value I(A, B) quantiﬁes the difference in quality between two approximated efﬁcient sets A and B. IBEA is well-known Indicator-based multiobjective algorithm. • Diversity preserving: The emphasis here is to generate a diverse set of pareto solutions in the objective and/or the decision space. • Elitism: The preservation and use of elite solutions (e.g., pareto optimal solutions) allows a robust, fast, and a monotonically improving performance of a metaheuristic. In the following, we propose two hybrid multiobjective evolutionary algorithms called MOBInsga and MOBIibea , based on NSGA-II [19] and IBEA [20], respectively. We start by presenting the main components of MOBI algorithm, which can be instantiated using any MOEA. Then the deﬁnitions of NSGA-II and IBEA are given. Fig. 2 illustrates the main steps of MOBI algorithm. • Encoding

Fig. 2. Main steps of MOBI algorithm.

In the existing metaheuristics to solve the biclustering problem [3,8], a solution (bicluster) is usually represented by a ﬁxed size binary string, with a bit string for the genes appended by another bit string for the conditions. A bit is set to one if the corresponding gene/condition is present in the bicluster, and set to zero otherwise. When using the CC local search (Cheng and Church’s algorithm), we need to explore all the rows and all the columns of each bicluster, a very time consuming procedure if the classical binary representation is considered, since the microarray data have high numbers of rows and columns. In our algorithm, we choose to represent a bicluster as a list of four parts: the ﬁrst part is an ordered list of rows indexes, the second part is an ordered list of columns indexes, the third part is the rows number and the forth part is the columns number. By this representation we aim to reduce time and memory space especially for the local search. A similar representation was used in [23]. Given the data matrix presented in Table 2, the string {1 3 23 2 2} represents the following bicluster (in bold) compound of two rows (1 and 3) and two columns (2 and 3):

⎡

3 11

⎤

⎢ ⎥ ⎥ ⎣0 4 ⎦

{1 3 2 3 2 2} ⇒ ⎢

K. Seridi et al. / Applied Soft Computing 33 (2015) 239–249 Table 2 Example of data matrix. 6 5 1

(b) 3 17 0

11 2 4 (c)

243

Single node deletion while(MSR(I, J) > ı) Recompute aiJ , aIj , aIJ and MSR(I, J). Find the node d (row or column) with the largest squared residue. Delete d. endwhile Multiple node addition J). Recompute

aiJ , aIj , aIJ and MSR(I, 2 1 if( |I| (aij − aiJ − aIj + aIJ ) MSR(I, J)) i∈I

• Variation operators For the proposed encoding, we deﬁne the following crossover and mutation operators which are used to generate a new offspring in each generation: 1 Crossover operator Single point crossover is used in each part of the solution (rows part and columns part). Each part undergoes crossover separately. Let parents be chromosomes P1 = {r1 . . . rn c1 . . . cm rnb cnb } and c } where r r . P2 = {r1 . . .rl c1 . . .ck rnb n nb l The crossover in the rows part is performed as follows: the crossover point in P1 , 1 is generated as a random integer in the range 2 1 rn . The crossover point in P2 , 2 = rj where rj 1 and rj−1 1 . The crossover in the columns part is performed in the same way. For example, consider the following parents: P1 : 14 15 18 22 28|2 5 7|6|3 P2 : 35 8 13 16 27 35|1 2 46 8|7|5 Suppose the 3rd gene index and the 2nd condition index of P1 are selected, so: 1 = 15 and 1 = 5 then 2 = 16 and 2 = 6. Hence, after crossover we get: C1 : 14 15 16 27 35|2 5 68|6|4 C2 : 35 8 13 18 22 28|1 2 47|7|4 2 Mutation operator In our algorithms, we replace the mutation operation by CC algorithm[1]. The CC algorithm was the ﬁrst biclustering algorithm applied to gene expression problems, it is based on the concept of ı-biclusters. The goal of CC is to obtain K ı-biclusters for a given dataset. The CC algorithm starts with a single bicluster, representing the whole dataset, and iteratively removes rows and columns of this bicluster until the residue is equal to or less than ı. The node deletion is performed by choosing nodes with high squared residue (higher than × MSR(I, J), where is a threshold greater than 1). Then, it starts to insert rows and columns (that are not in the bicluster yet) sequentially, until the insertion of any other row or column increases the residue to a value above ı. Once the ﬁrst bicluster is constructed, the rows and columns already present in this bicluster are replaced by random values in the original dataset, and the whole process is restarted until a predeﬁned number K of biclusters is created. The algorithm starts from a solution of this set. The irrelevant genes or conditions having mean squared residue above (or below) a certain threshold are eliminated (or added) using the following conditions. A ‘node’ refers to a gene or a condition.

(a)

Multiple node deletion Compute

aiJ , aIj , aIJ and MSR(I,2 J) 1 if( |J| (aij − aiJ − aIj + aIJ ) > × MSR(I, J)) j∈J

Remove the rows i ∈ I endif J). Recompute

aiJ , aIj , aIJ and MSR(I, 2 1 if( |I| (aij − aiJ − aIj + aIJ ) > × MSR(I, J)) Remove the i∈I

columns j ∈ J endif

Add the columns j ∈ /J endif J). Recompute

aiJ , aIj , aIJ and MSR(I, 2 1 if( |J| (aij − aiJ − aIj + aIJ ) MSR(I, J)) j∈J

/I Add the rows i ∈ endif

3.1.1. NSGA-II NSGA-II (Non-dominated Sorting Genetic Algorithm II) is probably the most widely used multiobjective resolution method [19]. At each NSGA-II generation, solutions from the current population are ranked in several classes. Individuals mapping to vectors from the ﬁrst non-dominated set belong to the best efﬁcient set; individuals mapping to vectors from the second non-dominated set belong to the second best efﬁcient set; and so on. Two values are then assigned to each population member. The ﬁrst one corresponds to the rank to which the corresponding solution belongs, and represents the quality of the solution in terms of convergence. The second one, the crowding distance, consists in estimating the density of solutions surrounding a particular point of the objective space, and represents the quality of the solution in terms of diversity. A solution is said to be better than another one if it has a best rank value, or in case of equality, if it has the best crowding distance. The selection strategy is a deterministic tournament between two random solutions. At the replacement step, only the best individuals are selected with respect to a predeﬁned population size. Furthermore, it has to be noted that, in addition to the original NSGA-II, an external population is added, it is the so-called archive, in order to store the whole set of potentially efﬁcient solutions that are found during the search. Algorithm 1 gives the main steps of NSGA-II.

Algorithm 1 (NSGA-II).

Input: ˛ (pop. size), Ng (max. nb. of generations) Initialization: Generate an initial population P of size ˛; set the generation counter m to 0 Mating selection: Perform binary tournament selection with replacement from P to P Variation: Apply recombination and mutation operators to the mating pool P Add the resulting offspring to P Non-Dominated sort: Sort the population based on domination, i . e . divide the population into nf fronts {Fi /i ∈ nf}: ∀x ∈ Fi , ∀ y ∈ Fk if k > i then x ≺ y. Crowding Distance: Assign crowding distance values to individuals in each front separately {Fi /i ∈ nf}: 1) For each objective fj,j≤n , sort the individuals of the front Fi according to fj . 2) Set the distance values as following: dist x0 = dist x|Fi | = ∞ dist xl =

n

fj (xl+1 )−fj (xl−1 )

j=1 |fj,max −fj,min |

l= / 0andl = / |Fi |

Selection: Perform a selection using crowded-comparison-operator (≺n ): For x ∈ Fi , y ∈ Fj , x ≺ n y if: [•] • i disty and i = j Increment the generation counter (m = m + 1).

244

K. Seridi et al. / Applied Soft Computing 33 (2015) 239–249

Termination test: If m ≥ Ng or another stopping criterion is satisﬁed Then:Stop. Else return to mating selection step. Output: A (Pareto set approximation).

3.2. IBEA IBEA (Indicator Based multi-objective Evolutionary Algorithm) is an evolutionary multiobjective algorithm proposed by Zitzler and Kunzli [20]. The purpose of this algorithm is to establish a total order between the solutions of the current population by generalizing the relationship of dominance by a binary indicator of quality I : × −→ R. This indicator represents the goal of the optimization. Thus, the ﬁtness assignment scheme is based on a pairwise comparison of solutions by using the binary quality indicator. No diversity preservation technique is required, according to the indicator being used. The selection scheme for reproduction is a binary tournament between randomly chosen individuals. The replacement strategy is an environmental one that consists in deleting the worst individuals successively and in updating the ﬁtness values of the remaining solutions each time there is a deletion; this is continued until the required population size is reached. Moreover, an archive stores the solutions that map to potentially non-dominated points, in order to prevent their loss during the stochastic search process [40]. The main steps of IBEA are given in Algorithm 2. Algorithm 2 (IBEA). Input: ˛ (pop. size), Ng (max. nb. of generations), (ﬁtness scaling factor). Initialization: Generate an initial population P of size ˛; set the generation counter m to 0 Fitness assignment: of individuals in P, i . e . for all x1 ∈ P set: Calculate

ﬁtness values 2 1 F(x1 ) = − eI(x ,x )/ 2 1 x ∈P/x

Environmental selection: Iterate the following 3 steps until the population size P does not exceed ˛: 1 Choose an individual x* ∈ P with the smallest ﬁtness value, i.e. F(x* ) ≤ F(x), ∀ x ∈ P 2 Remove x* from the population; 3 Update the ﬁtness values of the remaining individuals, i.e. ∗ F(x) = F(x) + eI(x ,x)/ , ∀x ∈ P Termination test: If m ≥ Ng or another stopping criterion is satisﬁed Then:Stop. Mating selection: Perform binary tournament selection with replacement from P to P Variation: Apply recombination and mutation operators to the mating pool P Add the resulting offspring to P Increment the generation counter (m = m + 1); Goto ﬁtness assignment step Output: A (Pareto set approximation).

3.3. Multiobjective local searches A classical local search algorithm consists in iteratively improving a solution regarding its neighborhood until a local optimum is reached. In multiobjective context, the optimality is deﬁned in terms of pareto optimality as in DMLS (Dominance-based Multiobjective Local Search). At each generation, DMLS selects one or more non-visited solutions (solutions with non-explored neighborhood) from the archive and explores their neighborhoods. Once they are explored, these solutions are marked as visited. The main steps are illustrated in Fig. 3. Algorithm 3 (DMLS). 1 Start with a given or randomly generated solution. 2 Select a set of non-visited solutions from the archive. 3 For each x in Pcurrent explore its neighborhood and add the non-dominated solutions to the set of candidate solutions Pcondidat . Mark x as visited solution. 4 Update the archive with the candidate solutions. 5 If all the solutions are visited or an other stopping criteria is satisﬁed, Stop. Else return to 2.

*** **

Update the archive

non-visited solutions * visited solutions

* * *

Exploration Selection

Archive

* * *

Current Set Fig. 3. Main steps of DMLS.

Different variants of DMLS exist depending on the number of selected solutions (step 2 of Algorithm 3) and on the exploration strategy (step 3 of Algorithm 3) [41]. For the selection phase, two strategies may be considered: random selection of one solution and an exhaustive selection of all the non-visited solutions. Concerning the exploration phase, two strategies may be considered, namely: exhaustive exploration and partial exploration. There exist three approaches for the partial exploration: exploring a random neighbor, ﬁrst non-dominated neighbor and ﬁrst non-dominating neighbor. In our case, we are interested on three variants DMLS(1,1), DMLS(1,*), DMLS(*,1). In DMLS(1,*), only one solution (1) is selected in step 2 and the exploration of its neighborhood is carried out exhaustively (*). The deﬁnition of DMLS(1,*) corresponds the wellknown PLS-1: Dominance-based local search which was presented by Paquete et al. in [24]. Conversely, in DMLS(*,1) the selection is performed exhaustively (*) and the exploration of the neighborhoods stops when the ﬁrst improving solution (solution which dominates the selected solution) is found (1). Finally, in DMLS(1,1) one solution is randomly selected (1) in step 2 and the exploration of its neighborhood stops when the ﬁrst improving solution is found (1). In our case, the general components of DMLS i.e. solution encoding and evaluation, are the same as in MOBI algorithm. Moreover, the neighborhood operator (move) is deﬁned by adding/removing a row/column from the current solution. Actually, the neighborhood size of a given solution is equal to N + M where N is the number of genes and M is the number of conditions in the microarray data. 3.4. Hybrid multiobjective metaheuristic HMOBI The proposed hybrid algorithm HMOBI combines a multiobjective evolutionary metaheuristic (MOBI) and a dominance based metaheuristic local search DMLS. Actually, evolutionary and local search metaheuristics present very different properties, aptitudes and qualities. An evolutionary algorithm generally offers a good ability to explore the search space. Conversely, local search algorithms favor the intensiﬁcation of the search. Hence, combining them allows taking advantage of their behaviors and getting better results. The hybrid algorithm HMOBI starts by executing MOBI over a random initial population. Then, DMLS is applied for each solution of MOBI’s archive (pareto approximation). Fig. 4 illustrates the steps of HMOBI Algorithm. 4. Experiments and results In this section, the results obtained using MOBI, DMLS and HMOBI algorithms over well-known data sets are presented. A comparative study is also provided. The algorithms are compared

K. Seridi et al. / Applied Soft Computing 33 (2015) 239–249 Table 4 Modeling choice.

Initial population

MOBI mutation=CC Pareto Set

DMLS

Approximation of the Pareto Optimal Set Fig. 4. General scheme of HMOBI algorithm.

Table 3 Informations about used data sets. Data sets

Genes number

Conditions number

% missing data

Yeast cell cycle Human B-cell expression Saccharomyces cerevisiae Colon cancer

2884 4026 2993 2000

17 96 173 62

0.07% 12.3% 0% 0%

with respect to multiobjective indicators and to the extracted biclusters biological relevance. All the implementations have been done thanks to ParadisEO framework and in particular with ParadisEO-MOEO [42]. 4.1. Data In our experiments we use the well-known Yeast Cell Cycle [25], Human B-cell expression [25], Colon Cancer [26] and Saccharomyces cerevisiae1 data sets. Table 3 gives details about the different data sets. In Yeast Cell Cycle data, the missing values are replaced by random numbers between 0 and 800 as in [8]. In Human B-cell expression data, the missing values are replaced by random numbers between −800 and 800 as in [8].

Model

Average MSR

Average rows nb

Average columns nb

Maximizing MSR Keeping MSR below ı

298.65 233.6

1108.88 11.87

12.54 9.12

4.2.2. Model parameters The threshold ı is used to control the coherence of the extracted biclusters. Based on 30 pre-selected clusters coherences [27], Cheng and Church [1] set the ı’s value for Yeast Cell Cycle data to 300. Actually, the 30 clusters have scores in the range between 261 and 966 with a median value of 630. ı = 300 was chosen as it is close to the lower end of the range. However, some clusters with higher MSR value have important biological signiﬁcance: clusters 8 and 14 with MSR value equal to 630. For Human B-cell expression and Colon Cancer ı is set to 1200 [1] and to 500 [26], respectively. In general, the ı parameter for a given data set can be set experimentally by evaluating biclusters extracted using different ı values. Actually, the lowest ı is, the better the extracted biclusters are. However, other issues have to be considered when choosing the appropriate value of ı such as: the size and the biological relevance. In our study, we do not focus only on the coherence of the extracted biclusters, but we also consider the number of conditions (columns) over which the genes (rows) share the same behavior. The numbers of rows and columns are controlled using ˛ and ˇ parameters. Concerning the choice of the model, we have performed a comparative study of the biclusters sizes found by MOBIibea when applied to the two models. In this experiment, we set ˛ and ˇ parameters to 0.7 and 0.3, respectively. Table 4 presents the obtained results. Table 4 presents the biclusters average MSR value, the average rows number and the average columns numbers of biclusters extracted by MOBIibea over the two studied models. The presented results show that maximizing MSR allows to get biclusters with larger sizes compared to keeping MSR value below ı strategy. Actually, this can be explained by the execution time set to the algorithm which seems to be not enough to allow convergence for the second model. Therefore, maximizing MSR model is chosen as it allows getting good results in a shorter time. Thereby, the considered model is: f1 (I, J) = ˛ ×

4.2. Parameters In this section, we explain the adopted approach in setting the different parameters. 4.2.1. Algorithm parameters For all the data sets (Yeast Cell Cycle, Human B-cell expression, Colon Cancer and Saccharomyces cerevisiae), the initial population size is set to 200 and the stopping criterion is a ﬁxed time for each data set. The time is set experimentally by monitoring the stabilization of hypervolume indicator for MOBI algorithms over the MSR maximizing model. For Yeast Cell Cycle data time is ﬁxed to 700 s, 1000 s for Saccharomyces cerevisiae data, 3000 s for Human B-cell expression data and 2000 s for Colon Cancer data. The crossover and mutation probabilities have been experimentally set to: 0.5 and 0.4, respectively. For the CC local search is set to 1.8.

1

Available at http://www.tik.ethz.ch/sop/bimax/.

245

f2 (I, J) =

f3 (I, J) =

|I| |J| +ˇ× |X| |Y |

⎧ MSR(I, J) ⎪ if MSR(I, J) ı ⎪ ⎨ ı ⎪ 0 ⎪ ⎩

else

1 Rvar(I, J) + 1

Such that MSR(I, J) ı Note that f1 and f2 are to be maximized, while f3 is to be minimized. ı is a user-threshold that represents the maximum dissimilarity allowed within the bicluster. As the size and similarity criterion are conﬂicting, we allow the function f2 to be maximized as long as the residue does not exceed the threshold ı. In order to set ˛ and ˇ parameters, we performed a comparative study of the biological relevance of the biclusters found by MOBIibea for different ˛ and ˇ values. The different values ((˛ = 0.3, ˇ = 0.7), (˛ = 0.5, ˇ = 0.5), (˛ = 0.7, ˇ = 0.3)) present the three possibilities of maximizing the biclusters sizes. This study aims to

246

K. Seridi et al. / Applied Soft Computing 33 (2015) 239–249

Table 5 Models parameters setting. (˛, ˇ)

Rate of signiﬁcant biclusters (process terms)

Average rows nb

Average columns nb

(0.7,0.3) (0.3,0.7) (0.5,0.5)

96.5% 100% 80.6%

2205.37 1108.88 1231.24

6 12.54 11.3

determine, for each case, if the set of genes discovered by MOBIibea show signiﬁcant enrichment with respect to a speciﬁc Gene Ontology (GO) annotation. For that aim, the web-tool GoTermFinder2 is used. In fact, GoTermFinder searches for signiﬁcant shared GO terms, or parents of those GO terms, used to describe the genes of a given bicluster to help discovering what the genes may have in common. Here genes are assigned to three structured, controlled vocabularies (ontologies) that describe gene products in terms of associated biological processes, components and molecular functions in a species-independent manner. In this study, we compare the relevance of the biclusters in terms of associated biological processes. Table 5 presents the rate of signiﬁcant biclusters, average rows number and average column number for each case. It can be noticed that favoring the columns to be maximized allow extracting relevant biclusters. We observe also that the rate of relevant biclusters is more important in the ﬁrst case (˛, ˇ) = (0.7, 0.3) (rows more likely to be maximized) than the third case. This can be explained by the fact that, in the ﬁrst case the biclusters have a very large number of rows, but only a small fraction of them are involved in a signiﬁcant process. The average rate of genes involved in the most signiﬁcant process is 20.63% in the ﬁrst case and 34.48% in the third case. 4.3. Experimental protocol In order to evaluate and compare the pareto front approximations generated by the different algorithms, we follow the experimental protocol described in [28]. In this protocol, two mul− tiobjective indicators are used: unary hypervolume indicator (IH ) − [29] and additive -indicator (I+ ). For the unary hypervolume indicator, we ﬁrst compute a reference set ZN of non-dominated points extracted from the union of all these fronts. Second, we deﬁne z max = (z1max , z2max , z3max ), where z1max , z2max z3max denote the upper bound of the ﬁrst, second and the third objective in the whole nondominated front approximations. Then, the quality of an output set A in comparison to ZN is measured by computing the difference between these two sets by using the unary hypervolume indicator, (1.05 × z1max , 1.05 × z2max , 1.05 × z3max ) being the reference point [43]. Furthermore, the unary additive -indicator (I1+ ) gives the minimum factor by which an approximation A has to be translated in the criterion space to weakly dominate the reference set ZN . I1+ can be deﬁned as follows: I1+ (A) = I+ (A, ZN ) where I+ (A, B) = min{∀z ∈ B, ∃z ∈ A such that zi − ≤ zi , ∀1 ≤ i ≤ n}.

As a consequence, for each test instance, 20 hypervolume differences and 20 epsilon measures, corresponding to the 20 runs, per

2

http://go.princeton.edu/cgi-bin/GOTermFinder.

algorithm were obtained. As suggested by Knowles et al. [28], once all these values are computed, we perform a statistical analysis on pairs of optimization methods in order to carry out a comparison on a speciﬁc test instance. To this end, we use the Mann–Whitney statistical test as described in [28]. Note that all the performance assessment procedures have been achieved using the performance assessment tool suite provided in PISA [30]. For each metaheuristic class i.e. evolutionary and local search metaheuristics, we perform a comparative study between the presented algorithms. These studies allow to select the best components (MOBI, DMLS) for HMOBI algorithm. After that, a comparative study between HMOBI and its components MOBI and DMLS is performed in order to show the interest in combining them. 4.4. Results Table 6 presents the performance of MOBIibea as compared to − ) that of MOBInsga with respect to the hypervolume indicator (IH and the epsilon indicator (I+ ), respectively. The results show that MOBIibea outperforms MOBInsga for all the data sets with respect to both indicators. The same comparison is carried out for DMLS(1, 1), DMLS(1, *) and DMLS(* , 1) algorithms. The results are presented in Table 7. It is observed that all the variants give equivalent results for all the data sets for both indicators. In fact, solutions have large neighborhood size in the search space. Therefore, the allocated time is not sufﬁcient to perform an exhaustive exploration of all the neighborhood. Similarly, the exhaustive selection is very time consuming. Hence, DMLS(1, *) and DMLS(* , 1) are equivalent to DMLS(1, 1) when insufﬁcient time is allocated. Accordingly, we choose DMLS(1, 1) in designing HMOBI metaheuristic. Table 8 presents the comparative results of HMOBIibea , MOBIibea − and DMLS(1, 1) with regard to the hypervolume indicator (IH ) and the epsilon indicator (I+ ), respectively. The results shows that HMOBIibea outperforms the other algorithms for the three data sets with respect to both indicators. 4.5. HMOBIibea performance In this section we study the performance of HMOBIibea algorithm and compare it to different heuristics (BicFinder [15], OPSM [9] and CC [1]) and metaheuristic (MODPSFLB [31] and DMOPSOB [7]). For the different heuristics, we use Biclustering Analysis Toolbox (BicAT) to obtain the experimental results. BicAT is a software platform that includes many biclustering heuristics, out of these: OPSM and CC heuristics are used in our study. For BicFinder we use the author’s provided software.3 Concerning the metaheuristics, we just report the published results. 4.5.1. Solutions quality Table 9 displays informations about extracted biclusters from Yeast Cell Cycle data and Human B-cell expression sets using different metaheuristics, namely: DMOPSOB [7] and MODPSFLB [31]. The reported informations are: average size of the found biclusters, average number of genes, average number of conditions and average mean squared residue. The average values of HMOBIibea are computed using pareto front ı-biclusters. Note that DMOPSOB and MODPSFLB algorithms have been developed in order to solve models that maximize the biclusters sizes and their variances and minimize their MSR values. Concerning Human B-cell expression data set, the results show that, compared to DMOPSOB and MODPSFLB, HMOBIibea extracts

3

http://www.info.univ-angers.fr/pub/hao/BicFinder.html.

K. Seridi et al. / Applied Soft Computing 33 (2015) 239–249

247

Table 6 Comparison of different metaheuristics for the IH− and the I+ metrics by using Mann-Whitney statistical test with a p-value of 1%. According to the metric under consideration, either the results of the algorithm located at a speciﬁc row are signiﬁcantly better than those of the algorithm located at a speciﬁc column (), either they are worse (≺) or there is no signiﬁcant difference between both (≡). Instances

IH−

Algorithms

Yeast

MOBIibea MOBInsga MOBIibea MOBInsga MOBIibea MOBInsga

Human Colon

I+

MOBIibea

MOBInsga

MOBIibea

MOBInsga

– ≺ – ≺ – ≺

– – –

– ≺ – ≺ – ≺

– – –

Table 7 Comparison of different metaheuristics for the IH− and the I+ metrics by using Mann -Whitney statistical test with a p-value of 1%. According to the metric under consideration, either the results of the algorithm located at a speciﬁc row are signiﬁcantly better than those of the algorithm located at a speciﬁc column (), either they are worse (≺) or there is no signiﬁcant difference between both (≡). Instances

Yeast

Human

Colon

Algorithms

DMLS(1,1) DMLS(1,*) DMLS(*,1) DMLS(1,1) DMLS(1,*) DMLS(*,1) DMLS(1,1) DMLS(1,*) DMLS(*,1)

IH−

I+

DMLS(1,1)

DMLS(1,*)

DMLS(*,1)

DMLS(1,1)

DMLS(1,*)

DMLS(*,1)

– ≡ ≡ – ≡ ≡ – ≡ ≡

≡ – ≡ ≡ – ≡ ≡ – ≡

≡ ≡ – ≡ ≡ – ≡ ≡ –

– ≡ ≡ – ≡ ≡ – ≡ ≡

≡ – ≡ ≡ – ≡ ≡ – ≡

≡ ≡ – ≡ ≡ – ≡ ≡ –

biclusters of larger sizes. Actually, the biclusters extracted have an average conditions number (62.6) that is remarkably higher than those found by DMOPSOB (42.78) and MODPSFLB (43.029). For Yeast Cell Cycle data, HMOBIibea extracts biclusters with high column numbers (10.31), however, the average size is relatively low (7665.59) compared to that of other algorithms (1151.25 and 1154.21). The average MSR values of the biclusters found using HMOBIibea are higher than those found by the other algorithms for both data sets. In fact, our aim is to extract ı-biclusters with large size and good biological relevance. Hence, as the MSR value does not exceed ı the biclusters are considered to be of good coherence.

4.5.2. Biological relevance In this section, we study the biological relevance of the extracted biclusters. The biological relevance of a bicluster depends on its set of genes. Based on GO annotation, the idea is to determine the rate of genes that share signiﬁcant GO terms using the web-tools: GoTermFinder and FuncAssociate over Yeast Cell Cycle and Saccharomyces cerevisiae data sets. • YeastCellCycledata set The biological relevance of biclusters extracted from Yeast Cell Cycle data set are tested using GoTermFinder. In this study, we want to know whether or not the genes of extracted biclusters

Table 8 Comparison of different metaheuristics for the IH− and the I+ metrics by using Mann–Whitney statistical test with a p-value of 1%. According to the metric under consideration, either the results of the algorithm located at a speciﬁc row are signiﬁcantly better than those of the algorithm located at a speciﬁc column (), either they are worse (≺) or there is no signiﬁcant difference between both (≡). Instances

Yeast

Human

Colon

Algorithms

HMOBIibea DMLS(1, 1) MOBIibea HMOBIibea DMLS(1, 1) MOBIibea HMOBIibea DMLS(1, 1) MOBIibea

IH−

I+

HMOBIibea

DMLS(1, 1)

MOBIibea

HMOBIibea

DMLS(1, 1)

MOBIibea

– ≺ ≺ – ≺ ≺ – ≺ ≺

– – –

≺ – ≺ – ≺ –

– ≺ ≺ – ≺ ≡ – ≺ ≺

– – –

≺ – ≡ ≺ – ≺ –

Table 9 Informations about the results given by different algorithms. Dataset

Avg. MSR Avg. size Avg. genes Avg. conditions

HMOBIibea

DMOPSOB

Yeast

Human

Yeast

299.6 7665.59 759.25 10.31

1199.9 47,442.87 759.37 62.6

216.13 11,213.5 1151.25 9.59

MODPSFLB Human 905.23 35,442.98 932.57 42.78

Yeast 212.8 11,220.7 1154.21 9.81

Human 904.9 35,601.8 933.9 43.029

248

K. Seridi et al. / Applied Soft Computing 33 (2015) 239–249

Table 10 Signiﬁcant shared GO terms of two selected biclusters using HMOBIibea for Yeast Cell Cycle data data. Bic’s size

Process

Function

Component

708 × 9

Cellular process (82.2%, 8.39e−07) Response to stimulus (20.0%, 0.00024)

Catalytic activity (41.1%; 0.00049) Ion binding (25.9%, 0.00137)

Intracellular (84.5%, 5.57e−10) Chromosomal part (7.0%, 0.00172)

1746 × 13

share signiﬁcant GO terms. FuncAssociate computes the adjusted signiﬁcance scores for the biclusters. Actually, this scores assess genes in each bicluster by computing adjusted p-values, which show how well they match with the different GO categories. If most of the genes in a bicluster have the same biological function, then it is unlikely that this takes place randomly, and the p-value of the category will be close to 0. The results show that the genes of all the extracted biclusters (100%) share signiﬁcant GO Terms with a p-value less than 1%. Table 10 shows the signiﬁcant shared GO terms (or parent of GO terms) used to describe the set of genes (708 and 1746) of two extracted biclusters, for the process, function and component ontologies. The values within parentheses after each GO term in Table 10, such as (82.2%, 8.39e−07) in the ﬁrst bicluster, indicate the bicluster frequency and the statistical signiﬁcance. The bicluster frequency (82.2%) shows that out of 708 genes in the ﬁrst bicluster 582 belong to this process, and the statistical signiﬁcance is provided by a p-value of 8.39e−07 (highly signiﬁcant). Note that the genes in the biclusters share other GO terms also, but with a lower signiﬁcance (i.e. have higher p-value). For example, the most signiﬁcance process in the ﬁrst bicluster is cellular process. Some of the involved genes are: YMR202W, YMR032W, YLL016W, YGL127C, YLR126C, YKL222C, YDR294C, YFL030W, YER016W, YEL015W, YDL127W, YMR072W, YGR238C, YJL111W, YBR176W, YCR060W, YOL080C, YDR045C, YER060W, YGR023W, YDR058C, YGL249W, YDR331W, YJL144W, YDR206W, YLR218C, YJL136C, YBL024W. In the second bicluster, the most signiﬁcant process is response to stimulus where YDL035C, YLL016W, YNL242W, YOL090W, YPL248C, YPR095C, YLR320W, YKL166C, YBR289W, YDR294C, YDL234C, YNR044W, YBR279W, YER174C, YBR066C, YPL203W, YLR032W, YDR461W, YNL215W, YGR023W, YOR178C, YHL023C, YAL015C, YGL032C, YBL024W, YDR501W, are some of the involved genes. • Saccharomycescerevisiaedata set Using FuncAssociate [32] web tool, we test the biological interest of HMOBIibea biclusters extracted from Saccharomyces cerevisiae data set. Actually, for a given genes list, FuncAssociate computes an adjusted signiﬁcance score (p-value) which indicates how well the genes match with the different GO categories. The results are compared to BicFinder [15], OPSM [9] and CC [1] heuristics. For BicFinder, the results concern the 100 largest biclusters out of the 1981 resulting biclusters. Table 11 presents the average rows number, average columns number and the rate of signiﬁcant biclusters with a p-value ≤0.1%. The results show that HMOBIibea and BicFinder are able to extract relevant biclusters, however the biclusters sizes of HMOBIibea are larger. OPSM extracts small biclusters of a good quality. In spite of Table 11 Signiﬁcant shared GO terms of two selected biclusters using HMOBIibea for Saccharomyces cerevisiae data. Algorithm

Genes number

Columns number

Rate

HMOBIibea BicFinder OPSM CC

1714.64 421 10 511.6

109.305 173 51.69 59.9

100% 100% 87% 24%

that, CC biclusters relevance is very poor which can be explained by its lack of robustness against noise. In fact, after the extraction of a bicluster, CC algorithm replaces the elements of this bicluster in the original data matrix by random values. Hence, the next extracted biclusters may be inﬂuenced by the included random values. 5. Conclusion In this paper we addressed the biclustering problem when applied to the analysis of gene expression data. We have proposed for this problem a new multiobjective model which aims to extract biclusters with maximal size and biological relevance. To solve it we proposed a multiobjective hybrid metaheuristic HMOBIibea based on IBEA and a Dominance-based Multiobjective Local Search. HMOBIibea takes advantage of the evolutionary and local search metaheuristics behaviors. Therefore, HMOBIibea presents a good exploration and a good intensiﬁcation at the same time. The performance of HMOBIibea has been biologically assessed in terms of Gene Ontology (GO) using literature real data sets. The comparative study with some literature heuristics and metaheuristics shows that HMOBIibea presents better results. Biclustering method can be considered in the analysis of other biological data such as data from GWAs (Genome Wide Association studies). Actually, GWAs aim to identify genetic variations (polymorphisms) associated with speciﬁc traits. They are mainly performed using supervised methods such as logistic regression and discriminant analysis. Using Biclustering, an unsupervised study of this data can be performed. To do so, a new similarity/dissimilarity measure adapted to this data has to be deﬁned. Furthermore, HMOBIibea has to be parallelized in order to deal with the big size of GWAs data. Hence, HMOBIibea could be improved by adopting the parallel insular model which will allow partitioning the search space into several subspaces and perform an intensive search in each subspace. References [1] Y. Cheng, G.M. Church, Biclustering of expression data, in: Proc. of the 8th ISMB, AAAI Press, 2000, pp. 93–103. [2] S.C. Madeira, A.L. Oliveira, Biclustering algorithms for biological data analysis: a survey, IEEE/ACM Trans. Comput. Biol. Bioinf. 1 (2004) 24–45. [3] M. Lashkargir, S.A. Monadjemi, A.B. Dastjerdi, A new biclustering method for gene expression data based on adaptive multi objective particle swarm optimization, in: Volume 01, ICCEE ’09, IEEE Computer Society, Washington, DC, USA, 2009, pp. 559–563. [4] J. Liu, Z. Li, X. Hu, Y. Chen, Biclustering of microarray data with MOSPO based on crowding distance, BMC Bioinf. 10 (Suppl. 4) (2009) S9. [5] J. Liu, Z. Li, X. Hu, Y. Chen, Multi-objective ant colony optimization biclustering of microarray data, in: GrC’09, 2009, pp. 424–429. [6] J. Liu, Z. Li, F. Liu, Y. Chen, Multi-objective particle swarm optimization biclustering of microarray data, IEEE Int. Conf. Bioinf. Biomed. 0 (2008) 363–366. [7] J. Liu, Y. Chen, Dynamic biclustering of microarray data with MOPSO, in: X. Hu, T.Y. Lin, V.V. Raghavan, J.W. Grzymala-Busse, Q. Liu, A.Z. Broder (Eds.), 2010 IEEE International Conference on Granular Computing, GrC 2010, San Jose, California, USA, 14–16 August 2010, IEEE Computer Society, 2010, pp. 330–334. [8] S. Mitra, H. Banka, Multi-objective evolutionary biclustering of gene expression data., Pattern Recognit. 39 (12) (2006) 2464–2477. [9] A. Ben-Dor, B. Chor, R.M. Karp, Z. Yakhini, Discovering local structure in gene expression data: the order-preserving submatrix problem, in: RECOMB, 2002, pp. 49–57. [10] K.O. Cheng, N.F. Law, W.C. Siu, A.W. Liew, Identiﬁcation of coherent patterns in gene expression data using an efﬁcient biclustering algorithm and parallel coordinate visualization, BMC Bioinf. 9 (2008) 210. [11] L. Xiaowen, W. Lusheng, Computing the maximum similarity bi-clusters of gene expression data, Bioinformatics 23 (1) (2007) 50–56. [12] T. Li, C. Laiwan, Discovering biclusters by iteratively sorting with weighted correlation coefﬁcient in gene expression data, J. Signal Process. Syst. 50 (3) (2008) 267–280. [13] J.A. Hartigan, Direct clustering of a data matrix, J. Am. Stat. Assoc. 67 (337) (1972) 123–129. ´ S. Bleuler, P. Zimmermann, A. Wille, P. Bühlmann, W. Gruissem, L. Hen[14] A. Prelic, nig, L. Thiele, E. Zitzler, A systematic comparison and evaluation of biclustering methods for gene expression data, Bioinformatics 22 (9) (2006) 1122–1129.

K. Seridi et al. / Applied Soft Computing 33 (2015) 239–249 [15] A. Wassim, E. Mourad, H. Jin-Kao, Bicﬁnder: a biclustering algorithm for microarray data analysis, Knowl. Inf. Syst. 30 (2) (2012) 341–358. [16] A. Wassim, E. Mourad, H. Jin-Kao, A biclustering algorithm based on a Bicluster Enumeration Tree: application to DNA microarray data, BioData Min. 2 (2009) 9. [17] K. Seridi, L. Jourdan, E.-G. Talbi, Multi-objective evolutionary algorithm for biclustering in microarrays data, in: IEEE Congress on Evolutionary Computation, IEEE, 2011, pp. 2593–2599. [18] K. Seridi, L. Jourdan, E.-G. Talbi, Hybrid metaheuristic for multiobjective biclustering in microarray data, in: IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, 2012. [19] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Trans. Evolut. Comput. 6 (2) (2002) 182–197. [20] S.E. Zitzler, Indicator-based selection in multiobjective search, PPSN 3242 (2004) 83–842. [21] K. Deb, Multi-Objective Optimization using Evolutionary Algorithms, WileyInterscience Series in Systems and Optimization, John Wiley & Sons, Chichester, 2001. [22] U. Maulik, A. Mukhopadhyay, S. Bandyopadhyay, M. Zhang, X. Zhang, Multiobjective fuzzy biclustering in microarray data: Method and a new performance measure, in: IEEE Congress on Evolutionary Computation, 2008. CEC 2008, (IEEE World Congress on Computational Intelligence), 2008, pp. 1536–1543. [23] P.A.D. de Castro, F.O. de Franc¸a, H.M. Ferreira, F.J. Von Zuben, Applying biclustering to text mining: an immune-inspired approach, in: Proceedings of the 6th international conference on Artiﬁcial immune systems, 1 ICARIS’07, Berlin, Heidelberg, Springer-Verlag, 2007, pp. 83–94. [24] M. Chiarandini, T.S.L. Paquete, Pareto local op-timum sets in the biobjective traveling salesman problem: An experimental study, Metaheuristics Multiobjective Optim. 535 (2004) 177–200. [25] Y. Cheng, G.M. Church, Biclustering of Expression Data (Supplementary Information), 2006. [26] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, A.J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Natl. Acad. Sci. U. S. A. 96 (12) (1999) 6745–6750. [27] S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho, G.M. Church, Systematic determination of genetic network architecture, Nat. Genet. 22 (1999) 281–285. [28] J. Knowles, L. Thiele, E. Zitzler, A tutorial on the performance assessment of stochastic multiobjective optimizers, in: Tech. Rep., Computer Engineering and Networks Laboratory (TIK), ETH Zurich, Switzerland (revised version), 2006. [29] E. Zitzler, L. Thiele, Multiobjective evolutionary algorithms: A comparative case study and the strength pareto approach, IEEE Trans. Evolut. Comput. 3 (4) (1999) 257–271.

249

[30] S. Bleuler, M. Laumanns, L. Thiele, E. Zitzler, PISA – a platform and programming language independent interface for search algorithms, in: C.M. Fonseca, P.J. Fleming, E. Zitzler, K. Deb, L. Thiele (Eds.), Conference on Evolutionary MultiCriterion Optimization (EMO 2003), Vol. 2632 of Lecture Notes in Computer Science (LNCS), Springer-Verlag, Faro, Portugal, 2003, pp. 494–508. [31] J. Liu, Z. Li, X. Hu, Y. Chen, F. Liu, Multi-objective dynamic population shufﬂed frog-leaping biclustering of microarray data, BMC Genomics 13 (Suppl. 3) (2012) S6. [32] G.F. Berriz, Characterizing gene sets with funcassociate, Bioinformatics 19 (2003) 2502–2504. [33] J. Handl, D.B. Kell, J. Knowles, Multiobjective optimization in bioinformatics and computational biology, IEEE/ACM Trans. Comput. Biol. Bioinf. 4 (April–June (2)) (2007) 279–292. [34] S. Saha, S. Bandyopadhyay, A generalized automatic clustering algorithm in a multiobjective framework, Appl. Soft Comput. 13 (1) (2013) 89–108. [35] D. Corne, L. Clarisse Dhaenens, Jourdan, Synergies between operations research and data mining: The emerging use of multi-objective approaches, Eur. J. Oper. Res., Elsevier 221 (2012) 469–479. [36] Y.J. Zheng, Q. Song, S.Y. Chen, Multiobjective ﬁreworks optimization for variable-rate fertilization in oil crop production, Appl. Soft Comput. 13 (11) (2013) 4253–4263. [37] D.K. Sharma, A. Gaur, D. Ghosh, Goal programming model for agricultural land allocation problems, Int. J. Model. Simul. 28 (January (1)) (2008) 43–48. [38] M.R. Khouadjia, B. Sarasola, E. Alba, L. Jourdan, E.-G. Talbi, A comparative study between dynamic adapted PSO and VNS for the vehicle routing problem with dynamic requests, Appl. Soft Comput. 12 (4) (2012) 1426–1439. [39] B. Samanta, T.K. Roy, Multi-objective entropy transportation model with trapezoidal fuzzy number penalties, sources and destination, J. Trans. Eng. 131 (6) (2005) 419–428. [40] A. Liefooghe, L. Jourdan, E. Talbi, Metaheuristics and cooperative approaches for the bi-objective ring star problem, Comput. Oper. Res. 37 (6) (2010) 1033–1044. [41] A. Liefooghe, J. Humeau, S. Mesmoudi, L. Jourdan, E. Talbi, On dominance-based multiobjective local search: design, implementation and experimental analysis on scheduling and traveling salesman problems, J. Heurist. 18 (2) (2012) 317–352. [42] A. Liefooghe, L. Jourdan, E. Talbi, A software framework based on a conceptual uniﬁed model for evolutionary multiobjective optimization: ParadisEO-MOEO, Eur. J. Oper. Res. 209 (2) (2011) 104–112. [43] R. Chevrier, A. Liefooghe, L. Jourdan, C. Dhaenens, Solving a dial-a-ride problem with a hybrid evolutionary multi-objective approach, Appl. Soft Comput. 12 (2012) 1247–1258.

Using multiobjective optimization for biclustering microarray data

Using multiobjective optimization for biclustering microarray data

Recommend Documents