Engineering Applications of Artificial Intelligence 26 (2013) 314–326
Contents lists available at SciVerse ScienceDirect
Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai
Comparing multiobjective swarm intelligence metaheuristics for DNA motif discovery David L. Gonza´lez-A´lvarez n, Miguel A. Vega-Rodrı´guez, Juan A. Go´mez-Pulido, Juan M. Sa´nchez-Pe´rez University of Extremadura, Department of Technologies of Computers and Communications, ARCO Research Group, Escuela Polite´cnica, Campus Universitario s/n, ´ceres, Spain 10003 Ca
a r t i c l e i n f o
a b s t r a c t
Article history: Received 21 December 2011 Received in revised form 5 June 2012 Accepted 18 June 2012 Available online 17 July 2012
In recent years, a huge number of biological problems have been successfully addressed through computational techniques, among all these computational techniques we highlight metaheuristics. Also, most of these biological problems are directly related to genomic, studying the microorganisms, plants, and animals genomes. In this work, we solve a DNA sequence analysis problem called Motif Discovery Problem (MDP) by using two novel algorithms based on swarm intelligence: Artificial Bee Colony (ABC) and Gravitational Search Algorithm (GSA). To guide the pattern search to solutions that have a better biological relevance, we have redefined the problem formulation and incorporated several biological constraints that should be satisfied by each solution. One of the most important characteristics of the problem definition is the application of multiobjective optimization (MOO), maximizing three conflicting objectives: motif length, support, and similarity. So, we have adapted our algorithms to the multiobjective context. This paper presents an exhaustive comparison of both multiobjective proposals on instances of different nature: real instances, generic instances, and instances generated according to a Markov chain. To analyze their operations we have used several indicators and statistics, comparing their results with those obtained by standard algorithms in multiobjective computation, and by 14 well-known biological methods. & 2012 Elsevier Ltd. All rights reserved.
Keywords: Swarm intelligence Artificial bee colony Gravitational search algorithm Deoxyribonucleic acid (DNA) Motif discovery Multiobjective optimization
1. Introduction Nowadays, we can easily find many optimization problems that require a huge computational time. There are even problems that can not be solved optimally with the existing computers. Such problems are called NP-hard problems. Currently, all known algorithms for solving NP-hard problems require an exponential time with respect to the input size. It is unknown if there will be so fast algorithms, therefore, to solve an NP-hard problem of an arbitrary size it is common to use techniques such as metaheuristics (Glover and Kochenberger, 2003). In computer science, we can define a metaheuristics as a problem optimization method that applies an iterative process to improve the quality of possible solutions, taking into account a given fitness function. Also, we can easily adapt metaheuristics to several problems. These techniques do not guarantee finding the optimal solution, they find quasi-optimal solutions in a reasonable time. Within the vast world of metaheuristics it is defined the concept of swarm intelligence. This concept is taking a lot of strength in recent years. Swarm intelligence is the discipline that deals with systems composed of a set of decentralized and self-organized
n
Corresponding author. Tel.: þ34 927 257000; fax: þ 34 927 257187. E-mail addresses:
[email protected] (D.L. Gonza´lez-A´lvarez),
[email protected] (M.A. Vega-Rodrı´guez),
[email protected] (J.A. Go´mez-Pulido),
[email protected] (J.M. Sa´nchez-Pe´rez). 0952-1976/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.engappai.2012.06.014
individuals, and which are normally based on natural phenomenons. In particular, this discipline takes advantage of the collective behavior of individuals who relate to each other and the environment. These algorithms could be divided into two groups: those based on the animal behaviors and those based on physics or nature behaviors. In recent years, many algorithms based on these collective behaviors are being successfully applied in different problems of several fields, due to this, we have decided to analyze the behavior of swarm intelligence algorithms in this work, comparing two novel algorithms such as the Artificial Bee Colony (ABC) algorithm (Karaboga, 2005), which is an optimization algorithm based on the intelligent foraging behavior of honey bee swarm; and the Gravitational Search Algorithm (GSA) (Rashedi et al., 2009), a new optimization algorithm based on the law of gravity and mass interactions. In this way, we can compare an algorithm from each group: one based on the animal behaviors (ABC), and other based on physics and nature behaviors (GSA). The main objective of this work is to analyze which kind of swarm intelligence algorithm is able to solve better the Motif Discovery Problem (MDP). MDP is an NP-hard optimization problem as defined in literature (Maier, 1978; Rivie re et al., 2008), applied to the specific task of discovering novel Transcription Factor Binding Sites (TFBS) in DNA sequences (D’haeseleer, 2006). Predicting common patterns, motifs, is one of the most important sequence analysis problem, and it has not yet been resolved in an efficient manner. In addition, in this work we have expanded the formulation of this problem with
´ lvarez et al. / Engineering Applications of Artificial Intelligence 26 (2013) 314–326 ´lez-A D.L. Gonza
several constraints to adapt the solution search to the real biology. In biology, finding and decoding the true meaning of these DNA patterns can help us to explain the complexity and development of living organisms. MDP maximizes three conflicting objectives: motif length, support, and similarity. Due to this, we have to use multiobjective techniques for its resolution. Moreover, we have to adapt the operation of our algorithms to the multiobjective context. Therefore, the swarm intelligence based algorithms compared in this paper are adapted to this context: the Multiobjective ABC (MOABC) algorithm (Gonza´lez-A´lvarez et al., 2011a), and the Multiobjective GSA (MOGSA) (Gonza´lez-A´lvarez et al., 2011b), based on the single-objective ABC and GSA algorithms, respectively. For the results presentation, we have used typical multiobjective indicators such as hypervolume (Zitzler and Thiele, 1999) or coverage relation (Zitzler et al., 2000), and thus, we facilitate future comparisons. We also want to emphasize that to ensure that the solutions found by our proposals are biologically relevant, we have made several analysis by using biological indicators such as Sensitivity, the Positive Predictive Value, the Performance Coefficient, and the Correlation Coefficient. On whole, our main objectives in this work are: compare novel swarm intelligence based techniques to solve a well-known bioinformatics problem that still has not been resolved in an efficient manner, incorporate new rules and constraints to adapt the MDP to the real biological world, obtaining good and relevant results. In the remainder of the paper, we briefly mention a number of existing works dedicated to the motif discovery in Section 2. Thereafter, in Section 3, we describe the MDP in detail. Section 4 presents the metaheuristics compared, explaining their performances. In Section 5, we include the experimental methodology, the instances used, and the best configuration of each algorithm. In Section 6, we analyze the behavior of our proposals, making comparisons between them and with two standard multiobjective evolutionary algorithms. We also compare the two proposed swarm-based algorithms with other previously proposed metaheuristics in Section 7. Section 8 compares the motifs discovered by our algorithms with those predicted by other 14 well-known biological methods. Finally, we outline the conclusions of this paper.
2. Related work In this section we present some of the research literature related to the MDP. First, we will describe some of the latest research that apply evolutionary computation to discover motifs in DNA sequences. Next, we will organize and analyze the biological methods most commonly used to solve this problem. There are many proposals based on evolutionary techniques for finding DNA motifs, an example is the algorithm FMGA (Liu et al., 2004), a genetic algorithm based on the SAGA operators (Notredame and Higgins, 1996). Another algorithms are St-GA (Stine et al., 2003) and MDGA (Che et al., 2005). Although there are other proposals such as the algorithm TS-BFO (Shao and Chen, 2009) which integrates Bacterial Foraging Optimization (BFO) and Tabu Search (TS), or the EDA/DE algorithm proposed in Shao et al. (2009), almost all the evolutionary algorithm proposals are based on genetic algorithms. Furthermore, all of them employ a single objective to discover motifs, and the motif length is given beforehand, assuming only one motif per sequence. Moreover, almost all of the algorithms tries to find motifs in all of the given sequences. Fogel et al. (2004, 2008) propose a new multi-term fitness function, the objective of this process was to maximize the motif similarities while avoiding saturation of low complexity solutions. However, the best way to address some of these problems is using a multiobjective approach. Kaya (2009) proposed a multiobjective GA based method named MOGAMOD for discovering motifs, demonstrating the advantages of multiobjective approach over single objective ones to discover motifs. Due to the
315
advantages of the multiobjective optimization, we also use it in our problem definition. Our first tests were performed by using DEPT (Differential Evolution with Pareto Tournaments, Gonza´lez-A´lvarez et al., 2010a) and MO-VNS (Multiobjective Variable Neighborhood Search, Gonza´lez-A´lvarez et al., 2010b), two new multiobjective adaptations based on Differential Evolution (DE, Storn and Price, 1997) and Variable Neighborhood Search (VNS, Mladenovic and Hansen, 1997). When we adapted the MDP formulation to the multiobjective optimization context, we could make comparisons between our proposals and the MOGAMOD algorithm in Gonza´lezA´lvarez et al. (2011c). At this moment, after better understanding the biological aspects of the MDP, we have incorporated several constraints and improvements which allow us to discover solutions with a high biological relevance. The improvements are detailed in Section 3, among them we highlight the complexity concept, which discards the biologically irrelevant solutions. In this work, our multiobjective proposals are based on ABC (Karaboga, 2005) and GSA (Rashedi et al., 2009). Unfortunately, we have not been able to compare the results obtained by them with those obtained by MOGAMOD, due to the changes that we have made in the problem formulation. Although evolutionary computation has produced very good results in almost all areas where it has been applied, in this particular problem (MDP), the methods or techniques that are getting better results are the biological ones. Most of the literature categorizes motif discovery methods into two major groups based on the combinatorial approach used in their design: string-based methods and probabilistic methods. The string-based methods guarantee global optimality and they are appropriate for finding totally constrained motifs (all instances are identical). The probabilistic methods imply representation of the motifs by a position weight matrix. Among the stringbased methods there are many proposals as Oligo/Dyad-Analysis (van Helden et al., 1998, 2000), MITRA (Mismatch Tree Algorithm) (Eskin and Pevzner, 2002), YMF (Sinha and Tompa, 2003), and QuickScore (Regnier and Denise, 2004). However, one of the most popular stringbased methods for discovering motifs is Weeder (Pavesi et al., 2001). Regarding to the probabilistic methods, Hertz et al. (1990) presented one of the first implementations for discovering motifs by using a matrix representation of TFBS. This algorithm has been improved over the years resulting in Consensus (Hertz and Stormo, 1999). However, the probabilistic methods that best predictions are given are those that apply powerful statistical techniques, two known alternatives are Expectation Maximization (EM) algorithm and Gibbs Sampling (GS). Algorithms like MEME (Multiple EM for Motif Elicitation) and MEME3 (it differs from MEME mainly in using a correction factor) (Bailey and Elkan, 1995) extended the EM algorithm for identifying motifs. Another important statistical method is Improbizer (Ao et al., 2004). GS is another algorithm that has been successfully applied in many other methods. Based on this algorithm, one that stands out is AlignACE (Aligns Nucleic Acid Conserved Elements) (Roth et al., 1998). It uses a GS similar to that described by Neuwald et al. (1995). There are other many proposals based on GS, for instance, ANN_Spec (Workman and Stormo, 2000), MotifSampler (Thijs et al., 2001), GLAM (Frith et al., 2004), or the recently proposed SeSiMCMC (Sequence Similarities by Markov Chain Monte-Carlo) (Favorov et al., 2005). Thanks to the work of Tompa et al. (2005), we can compare the results obtained by our algorithms with those obtained by the 14 biological methods described in this section. As we will see in the following sections, our algorithms obtain results with an important biological relevance.
3. Motif discovery problem Gene expression is the process by which a gene is transcribed to form an RNA sequence. Then this sequence is used to produce the corresponding protein sequence. This process starts when a
316
´ lvarez et al. / Engineering Applications of Artificial Intelligence 26 (2013) 314–326 ´lez-A D.L. Gonza
macro-molecule called Transcription Factor (TF) has been bounded to a short subsequence in the promoter region of the gene, called TFBS (Zare-Mirakabad et al., 2009). Finding TFBSs in DNA sequences (problem known as MDP) is important for uncovering the underlying regulatory relationship and understanding the evolutionary mechanism of living organisms. In this paper we solve the MDP finding biologically relevant DNA patterns in sequences of different beings. 3.1. Problem formulation Given a set of sequences S ¼ fSi 9i ¼ 1; 2, . . . ,Dg of nucleotides defined on the alphabet B ¼ fA,C,G,Tg. Si ¼ fSji 9j ¼ 1; 2, . . . ,wi g is a sequence of nucleotides, where wi is the sequence width. The set of j all the subsequences contained in S is fsii 9i ¼ 1; 2, . . . ,D,ji ¼ 1; 2, . . . ,wi l þ 1g, where ji is the binding site of a possible motif instance sji on sequence Si, and l is the motif length, the first objective to be maximized. To obtain the values of the other two objectives we have to build the Position Indicator Matrix (PIM) A ¼ fAi 9i ¼ 1; 2, . . . ,Dg of the motif, where Ai ¼ fAji 9j ¼ 1; 2, . . . ,wi g is the indicator row vector with respect to a sequence Si. Aji is 1 if the position j in Si is a binding site, and 0 otherwise. We refer to the number of motif P Pwi j instances as 9A9 ¼ D i¼1 j ¼ 1 Ai . We also require to find the consensus motif, which is a string abstraction of the motif instances. In this work, we consider a single motif instance per sequence. Only those sequences that achieve a motif instance of certain quality with respect to the consensus motif were taken into account when we perform the final motif. This is indicated by the second objective, the support. Furthermore, SðAÞ ¼ fSðAÞ1 ,SðAÞ2 , . . . ,SðAÞ9A9 g is a set of 9A9 motif instances, where SðAÞi ¼ SðAÞ1i SðAÞ2i . . . SðAÞli is the ith motif instance in 9A9. S(A) can also be expanded as ðSðAÞ1 ,SðAÞ2 , . . . ,SðAÞl Þ, where SðAÞj ¼ SðAÞj1 SðAÞj2 . . . SðAÞj9A9 is the list of nucleotides on the jth position in the motif instances. Then, we build the Position Count Matrix (PCM) N(A) with the numbers of different nucleotide bases on each position of the candidate motifs (A) which have passed the
(C3) We have also applied the complexity concept proposed in Fogel et al. (2004). This concept should be considered to avoid low-complexity solutions, for example, two candidate motifs: ‘‘AAAAAA’’ and ‘‘AAACAA’’ are very similar, but it is not a meaningful motif. The average complexity for a final motif represents the total complexity score for each candidate motif. We calculate the complexity of a candidate motif by using the following expression: l! Complexity ¼ logN Q ðni Þ!
ð2Þ
where N¼4 for DNA sequences, l is the motif length, and ni is the number of nucleotides of type iA fA,C,G,Tg. For example, if we consider the motif ‘AAAA’ (nA ¼4, nT ¼0, nG ¼0, and nC ¼0) we will obtain a minimum complexity since we get the highest Q value in ðni Þ!. Otherwise, if we have, for example, the ‘ACGT’ motif (nA ¼1, nT ¼1, nG ¼1, and nC ¼1) we will obtain a higher complexity. As we can see in Eq. (2), if we do not normalize the complexities when we compare motifs, the maximum complexity was highly dependent of the motif length. The compositional complexity calculation was revised such that the maximum possible complexity score is calculated for each possible motif length prior to the evolutionary computation. During evolution, each complexity score was rescaled between [0,1] where the maximum possible complexity score was 1. This removed any potential bias in complexity relative to the motif length, as detailed in (Fogel et al., 2008). In our algorithms we have established a minimum complexity of 0.5 (50%). These three constraints must be met by all the generated solutions, i.e., if a solution does not meet the three defined constraints, it will be discarded and it will not be part of the population. 3.2. Example
threshold marked by the support. NðAÞ ¼ fNðAÞ1 ,NðAÞ2 , . . . ,NðAÞl g, and NðAÞj ¼ fNðAÞjb 9b A Bg, where NðAÞjb ¼ 9fSðAÞji 9SðAÞji ¼ bg9. The dominant nucleotides of each position are normalized in the Position b ¼ NðAÞ=9A9. Finally, we calculate the third Frequency Matrix (PFM) N objective, the similarity, averaging all the dominance values of each PFM column, as is indicated in the following expression: Pl maxb ff ðb,iÞg ð1Þ SimilarityðMotif Þ ¼ i ¼ 1 l where f ðb,iÞ is the score of nucleotide b in column i in the PFM and maxb ff ðb,iÞg is the dominance value of the dominant nucleotide in column i. To guide the pattern search to solutions that have biological relevance, we have imposed to the objective functions several constraints that should be satisfied by each solution: (C1) Motifs are usually short (D’haeseleer, 2006), so that, finding long motifs can mean to lose a great computational time. To avoid this, we have restricted the motif length to the range [7,64] nucleotides, where the minimum is seven and the maximum is 64. (C2) We have also set a minimum support value of two for the motifs of the sequence data sets composed of four or less sequences, and of three for the other ones (more than four sequences). Normally, the binding sites are composed of motifs of all or nearly all sequences, and without this constraint is very easy to predict motifs with a high similarity (even 100%) formed, for example, by candidates of only one DNA sequence.
As an example, Fig. 1 illustrates an artificial MDP with motif length¼10. By using the motif instances shown in the first column of Fig. 1, we first obtain the consensus motif GGGATTACAG. With this consensus motif we can calculate the value of the second objective. Those sequences which candidate motif exceeds or equals a threshold value of concordance of 50%ð Z 5=10Þ, will be taken into account in the support, in this example we have support¼5 (see second column of Fig. 1). By using the candidate motifs which have surpassed the threshold value of Support, we can build the final motif. In the second column, we include this final motif and the selected candidate motifs. The last step is to build the PCM and the PFM by using the nucleotides of the motif instances that have passed the concordance threshold (see last column of Fig. 1). Having done that, we can obtain the value of the similarity, applying Eq. (1). In this example we obtain similarity¼0.94. With the naked eye we see the similarity differences between the consensus and the final motifs. If we compare the similarities obtained by both solutions we notice as if we do not apply the threshold value of Support, we would have obtained a solution with a similarity of 72%. However, taking into account this objective, we have eliminated the candidates that distorted the final solution, obtaining a solution with a similarity of 94%.
4. Description of the algorithms In this section we detail the representation of the individuals, and we include a brief description of the algorithms compared in this work.
´ lvarez et al. / Engineering Applications of Artificial Intelligence 26 (2013) 314–326 ´lez-A D.L. Gonza
317
Fig. 1. An artificial motif discovery problem. From left to right it shows the candidate motifs, the consensus motif and the candidates which have surpassed the threshold value of Support, and finally, the final motif, the position count matrix, and the position frequency matrix.
Motif Length
Seq. 0 S0
Seq. 1 S1
Seq. 2 S2
Seq. n Sn
solutions of the conflicting Pareto front and choosing the sources of the best crowding distances. For further information about the MOABC algorithm see Gonza´lez-A´lvarez et al. (2011a).
Fig. 2. Representation of an individual.
4.3. Multiobjective gravitational search algorithm (MO-GSA) 4.1. Representation of the individuals Each individual includes the necessary information to form a possible motif. An individual in our algorithms represents the starting location (si) of the potential motif on all the sequences. As our data sets have different number of sequences, the definition of the individuals should be adapted to each data set. Our definition of the individual also includes the motif length. In Fig. 2, we show our individual representation. 4.2. Multiobjective artificial bee colony (MOABC) ABC is an evolutionary algorithm defined by Karaboga (2005) motivated by the intelligent behavior of honey bee swarms. In ABC, the colony of artificial bees contains three groups of bees: employed, onlookers and scouts bees. Employed bees go to their food source and come back to hive and dance on this area. The employed bee whose food source has been abandoned becomes a scout and starts to search for finding a new food source. Onlookers watch the dances of employed bees and choose food sources depending on them. As the MDP is formulated as a multiobjective problem, we defined a new multiobjective algorithm in Gonza´lez-A´lvarez et al. (2011a) named MOABC. This algorithm incorporates several changes that allow its application in multiobjective optimization problems. The first modification applies the dominance concept (Deb, 2001) in the greedy selection functions used for improving the food sources of the employed and onlooker bees. However, the most important adaptation is made in the sorting process performed after each generation. This function selects the best food sources for the next generation. It applies two features of the standard NSGA-II (Deb et al., 2002) algorithm, the nondominated sort and the crowding distance. First, it ranks the colony by using the dominance concept previously mentioned (nondominated sorting function), situating on the top of the colony, the solutions that are not dominated by any other solution, behind them it will be the solutions dominated by only one solution, and so on. With the food sources sorted, we have to select the best ones. These selected food sources will be exploited by the employed bees in the next generation. However, it may happen that not all the solutions of the last selected Pareto front can be chosen for the next generation. In this case, we have to apply the crowding distance concept, calculating the corresponding value for all the
GSA is an heuristics recently introduced by Rashedi et al. (2009). It uses the Newtonian physics theory and its searcher agents are a collection of masses. In GSA, we have an isolated system of masses. Using the gravitational force, every mass in the system can see the situation of the other masses. All these objects attract each other by the gravity force, and this force causes a global movement of all objects towards the objects with heavier masses (better solutions). Hence, masses cooperate using a direct form of communication, through gravitational force. As the MDP is formulated as a multiobjective problem, we defined a new multiobjective algorithm in Gonza´lez-A´lvarez et al. (2011b) named MO-GSA. In this algorithm the most important modification is applied to assign the fitness value to the agents. In GSA, each individual must have a single fitness value so, if we want to design a multiobjective implementation, we can not consider the dominance concept. This new algorithm classifies the population into different Pareto fronts, ranking the individuals by using the Pareto front and the crowding distance concept from the NSGA-II algorithm. Then, it applies a linear bias br to the rth ranked element by using the expression: br ¼ 1=r, obtaining values from 1 to 1/N. Thus, we obtain a sorted population with a single fitness value. For further information about the MO-GSA consult (Gonza´lez-A´lvarez et al., 2011b). 4.4. Nondominated sorting genetic algorithm (NSGA-II) Srinivas and Deb (1995) developed the Nondominated Sorting Genetic Algorithm (NSGA) based on the classification of the population at various levels. In this algorithm, before the selection, we rank the population by using the dominance concept. All nondominated individuals are classified into a category with a dummy fitness proportional to the population size. In order to maintain the diversity of the population, these individuals are distributed according to their fitness, subject to a distribution parameter (sharing parameter). This classified group is removed of the population and the remaining individuals are reclassified by the same procedure. This process continues until all individuals in the population are classified. Since the first individuals are of best quality, they always get more copies than the rest of the population, allowing the search in nondominated regions. NSGA got promising results, however, was criticized for three main reasons: its computational complexity O(MN3), where M is
318
´ lvarez et al. / Engineering Applications of Artificial Intelligence 26 (2013) 314–326 ´lez-A D.L. Gonza
the number of objectives and N is the population size; its nonelitist operation, and the need to specify a distribution parameter. To repair these limitations, Deb et al. (2002) proposed an enhanced version of this method, called NSGA-II. It solves the previously described limitations through a fast nondominated sorting mechanism with a low computational complexity O(MN2), a selection operator used to combine the parent and child populations, and selecting the N best solutions taking into account their quality and their distribution in the Pareto front. Thanks to the good results achieved by this algorithm in a lot of multiobjective problems, it has become a standard multiobjective algorithm. In this work we have also developed it to have a point of reference on the results of our two proposals, MOABC and MOGSA. For further information about the NSGA/NSGA-II algorithms see Srinivas and Deb (1995) and Deb et al. (2002). 4.5. Strength Pareto evolutionary algorithm (SPEA2) Zitzler and Thiele (1999) presented another important algorithm, the SPEA. Zitzler and Thiele (1999) introduce a new evolutionary approach to multicriteria optimization that combines several features of previous multiobjective evolutionary algorithms in a unique manner. It is characterized by storing nondominated solutions externally in a second, continuously updated population, evaluating an individual fitness dependent on the number of external nondominated points that dominate it, preserving population diversity using the Pareto dominance relationship, and incorporating a clustering procedure in order to reduce the nondominated set without destroying its characteristics. Two years later the same authors proposed a second algorithm called SPEA2 (Zitzler et al., 2001). It eliminated the potential weaknesses of its predecessor and incorporated most recent results in order to design a powerful and up-to-date evolutionary multiobjective optimization algorithm. The main differences of SPEA2 in comparison to SPEA are an improved fitness assignment scheme is used which takes into account, for each individual, how many individuals it dominates and it is dominated by, a nearest neighbor density estimation technique is incorporated which allows a more precise guidance of the search process, and a new archive truncation method guarantees the preservation of boundary solutions. Thanks to these improvements, as with the NSGA-II algorithm, SPEA2 has become a key heuristic to propose new multiobjective techniques. To show the quality of the results obtained by our heuristics, we have also implemented this algorithm. For further information about the SPEA/SPEA2 algorithms see Zitzler and Thiele (1999) and Zitzler et al. (2001).
comparison of performances. The results are displayed using the average values of the hypervolumes among the 30 runs. The reference volume is calculated using the maximum values of each objective in each data set, for example, a data set with five sequences will have: Support¼5, Motif Length¼64, and Similarity¼1. The values of the objectives are normalized before calculating the corresponding hypervolumes. The experiments are organized taking into account the influence of each parameter in each algorithm. To compare the algorithms we have also used the coverage relation (Zitzler et al., 2000) that is useful to analyze which algorithms get the best Pareto fronts. All experiments were performed using a Pentium 4 (2.8 GHz) with 1 GB of RAM, and the algorithms were compiled using gcc with no optimization options. In our experimentation we have grouped the results into three benchmarks. In each benchmark, we have used six sequence data sets corresponding to alive beings, more concretely, from fly (those beginning by ‘dm’), from human (‘hm’), from mouse (‘mus’), and from yeast (‘yst’). In the first benchmark, we compare the behavior of the algorithms on ‘‘Generic’’ instances (those ending in ‘g’). This class of instances has the binding sites planted in randomly chosen genomic promoter sequences. In the second benchmark, we use ‘‘Real’’ instances (ending in ‘r’), where the binding sites are in their real genomic promoter sequences. Finally, the last benchmark is composed of ‘‘Markov’’ instances (ending in ‘m’), where the binding sites are planted in sequences randomly generated according to a Markov chain of order 3. These 18 sequence data sets are available in Tompa et al. (2005), and are selected from the TRANSFAC database (Wingender et al., 1996). Table 1 shows the properties of each data set. We have selected data sets with different number of sequences, with different sizes (nucleotides per sequence), and of different types (Generic, Real, and Markov), to ensure that our algorithms work well with several types of instances. By using data sets from different species and with different characteristics, we get strong algorithms that discover motifs in all types of biological data. The established runtimes are also shown (in seconds) in the last column of Table 1. These runtimes are directly related to the complexity of each instance. After carrying out a large number of experiments, we observed how our algorithms required around 20 s to find good solutions in the instances composed of less than 10 sequences, while they needed around 30 s to solve the instances composed of 10 or more sequences. In all our algorithms we have adjusted the value of each parameter to obtain the best configurations and so, the best Table 1 Data set properties. Data set
#Sequences
Size
#Nucleotides
Time (s)
dm04g hm16g mus03g mus07g yst01g yst09g
4 7 5 4 9 16
2000 3000 500 1500 1000 1000
8000 21 000 2500 6000 9000 16 000
20 20 20 20 20 30
hm03r hm15r mus02r mus05r yst04r yst05r
10 4 9 4 7 3
1500 2000 1000 500 1000 500
15 000 8000 9000 2000 7000 1500
30 20 20 20 20 20
dm03m hm04m hm26m mus11m yst03m yst10m
3 13 9 12 8 5
2000 2000 1000 500 500 1000
6000 26 000 9000 6000 4000 5000
20 30 20 30 20 20
5. Methodology In this section we explain the methodology followed to configure each algorithm, we detail the data sets used in our experiments, and we show the results obtained by our algorithms. In the following sections we will compare the results obtained by our algorithms with those obtained by several standard multiobjective algorithms, and with those obtained by other 14 well-known biological methods. We will also include a powerful statistical analysis of the results to ensure their statistical relevance. 5.1. Experimentation, data sets, and parameter settings For each experiment we have performed 30 independent runs to assure its statistical relevance. The results are measured using the hypervolume indicator (While et al., 2006) to facilitate the
´ lvarez et al. / Engineering Applications of Artificial Intelligence 26 (2013) 314–326 ´lez-A D.L. Gonza
results. The methodology followed to configure a parameter in our heuristics is always the same: first we establish, within the range of each parameter value, a minimum of five more or less equidistant values, for example, if we want to configure a parameter which value can be between 0% and 100%, we can test the following values: 10%, 25%, 50%, 75%, and 90%. After this, we run the experiments. As we can notice, we have not processed the values: 0% and 100%. If we use these values we are considering, for example, that there is no mutation or that it is always obligatory, so we believe that we should not try them. In the first algorithm (MOABC) we have configured four important parameters: the population size (PS), the mutation probability (F), the mutation shift (M), and the number of scout bees (SN) that we add in each generation. To obtain the best value for the first parameter (PS) we have performed experiments with the values: 50, 75, 100, 125, and 150. After analyzing the results, we obtain the best hypervolumes by using 150 individuals. So, in order to obtain more accurate conclusions, we performed more runs with 175, 200, and 225 individuals, concluding that with a PS¼200 the MOABC achieves better motifs. The same happened with the second parameter, the F, when we confirm that with a probability of 10% the algorithm got the best solutions, we also test the
319
values: 8%, 5%, 3%, and 1%, achieving a better performance with a F¼8%. Regarding M, we were testing values from a 10% increasing by 5% each time. We could conclude that with a 30% the algorithm performed better than with any other value. Finally, we analyzed the value of SN that is necessary to get the best discoveries. We have performed experiments with 1, 2, 3, 4, and 5 bees, concluding that with a single scout bee the algorithm performs better. To sum up, after performing and analyzing all these experiments we can conclude that the best parameter values for the MOABC algorithm are PS¼200, F ¼8%, M ¼30%, and SN¼1. In the second algorithm proposed in this work (MO-GSA), we have only adjusted the population size, performing the same tests as in the previous algorithm. Again, we have concluded that the algorithm discovers motifs of better quality with a population size of 200 individuals. The other algorithm parameters (G0, a, and e) were established with the values used by Rashedi et al. (2009). Finally, it is important to say that we have carried out the entire configuration process on the two standard multiobjective algorithms implemented: the NSGA-II and SPEA2 algorithms. In Table 2, we include the parameters configured and the best values used for each configured parameter of each algorithm.
6. Algorithm comparisons Table 2 Best found configuration for each algorithm. ´ lvarez et al., 2011a) ´lez-A Parameterization used in MOABC (Gonza Population size 200 Mutation probability 8% Mutation shift 30% of the maximum value Scout bees 1 Population ranking nonDominated sort þ CW ´ lvarez et al., 2011b) ´lez-A Parameterization used in MO-GSA (Gonza Population size 200 G0 100 a 20 e 0.01 Population ranking nonDominated sort þ Linear Bias Parameterization used in NSGA-II (Deb et al., 2002) Population size 200 Crossover SPX with probability 60% Mutation probability 50% Mutation shift 30% of the maximum value Parent choice Binary tournament Generation Selection Elitist Parameterization used in SPEA2 (Zitzler et al., 2001) Population size 200 Crossover SPX with probability 90% Mutation probability 90% Mutation shift 30% of the maximum value
As mentioned above, we have structured the comparisons into three groups, in the first one, we use ‘‘Generic’’ instances, the second one uses ‘‘Real’’ instances, and the last one applies ‘‘Markov’’ instances. Now we proceed to analyze the behavior of the algorithms with ‘‘Generic’’ instances. The first analysis was made by using the hypervolume indicator. The hypervolume indicator (While et al., 2006) is a measure of quality of a set P ¼ fpð1Þ ,pð2Þ , . . . ,pðnÞ g of n nondominated objective vectors produced in a run of a multiobjective optimizer. Assuming a minimization problem involving d objectives, this indicator consists of the measure of the region which is simultaneously dominated by P and bounded above by a reference point r A Rd such that r Zðmaxp p1, . . . ,maxp pd Þ, where p ¼ ðp1 , . . . ,pd Þ A P Rd , and the relation Z applies componentwise. In Fig. 3, we include the results (median and interquartile range) of the first group of instances. These instances have been sorted taking into account its complexity, thus the easier instances stay on the top of the table (higher hypervolumes), and the more complicated ones on the bottom (lower hypervolumes). Analyzing the numerical data of Fig. 3, we can see how the best performing algorithm in five of the six instances tested is MOABC. We also note how our second proposal (MO-GSA) does not perform well in the first two instances (the simplest ones). However, after the third instance,
Fig. 3. Numerical and graphical representation of the algorithm average hypervolumes on generic instances.
´ lvarez et al. / Engineering Applications of Artificial Intelligence 26 (2013) 314–326 ´lez-A D.L. Gonza
320
the MO-GSA improves significantly its results, obtaining the second best hypervolume value in the more complicated instance. In Fig. 3, we also include a graphical representation of the hypervolumes to better observe the algorithm behaviors. In this first set of instances, we can conclude that the algorithms that best results obtain in the first instances are SPEA2 and MOABC, and those which behave best in the more complex instances are MOABC and MO-GSA. To ensure that the differences among our algorithms are relevant, we have performed an exhaustive statistical analysis of the results obtained by each of them. The statistical study that we have carried out is the same as described by Sheskin (2007). First, we apply a Kolmogorov–Smirnov test for checking what type of distributions follow the results. If distributions are not Gaussian, we use the non-parametric Kruskal–Wallis test to compare the medians of the results. On the contrary, if the distributions are Gaussian, we perform a Levene test to check the homogeneity of variances. If the results present homogeneity of variances, we apply an ANOVA test, otherwise, the Kruskal–Wallis test is used. The results that overcome this exhaustive statistical analysis present significant differences. In all these tests we have considered a confidence level of 95%, i. e., p-value¼0.05. It is important to emphasize that we have applied the tests for each pair of algorithms, comparing the results obtained by each one with the results obtained by the other three algorithms separately. This whole process has been also repeated for each data set. The successful tests are indicated with a ‘ þ ’ in the result tables, and the non-successful tests are indicated with a ‘ ’. In the results shown in Fig. 3, we only find three non-successful cases of the 36 possible combinations. The first one occurs in the simplest instance, and the rest are given between the standard multiobjective evolutionary algorithms SPEA2 and NSGA-II.
After this first hypervolume analysis, we have also studied the motifs discovered by the algorithms (the nondominated solutions) using a second indicator, the coverage relation. This concept was presented by Zitzler et al. (2000) in their study in multiobjective optimization using evolutionary algorithms. Applying the dominance concept (Deb, 2001), the coverage relation considers that x1 covers x2 iff x1 gx2 or x1 ¼x2. The coverage relation is applied to all nondominated solutions obtained by the algorithms and it is used as a comparison criterion. The results of this last comparison are presented in Table 3. Analyzing the results, we see how the nondominated solutions discovered by our first proposal, the MOABC algorithm, cover the 94.53% of the NSGA-II solutions, and the 97.24% of the SPEA2 solutions. However, the nondominated solutions discovered by these two algorithms only cover the 14.83% and the 14.68% of the MOABC solutions, respectively. The same happens for the MO-GSA algorithm, while it is able to cover the 71.37% and the 76.40% of the solutions of NSGA-II and SPEA2, these two standard algorithms only cover the 23.46% and 23.84% of its nondominated solutions. If we compare the solutions discovered by our two proposals, we can conclude that the solutions found by MOABC are better than those discovered by the MO-GSA algorithm. Next, we compare the results of the algorithms with ‘‘Real’’ instances. In Fig. 4 we include the hypervolume comparison. In this second benchmark, we have also organized the instances by complexity, being hm15r the simplest instance, and hm03r the most complex one. As happened in the first benchmark, the algorithms that perform better in the first instances are MOABC and SPEA2. However, we see how this standard multiobjective algorithm (SPEA2), together with NSGA-II, suffers an important quality drop in its results when the instances increase in
Table 3 Coverage relation ðAkBÞ for generic instances. A
B
dm04g (%)
hm16g (%)
mus03g (%)
mus07g (%)
yst01g (%)
yst09g (%)
Mean (%)
MOABC
MO-GSA NSGA-II SPEA2 MOABC NSGA-II SPEA2 MOABC MO-GSA NSGA-II MOABC MO-GSA SPEA2
100.00 85.37 93.62 4.35 21.95 36.17 36.96 72.22 48.78 36.96 63.89 78.72
85.48 97.67 100.00 18.99 91.86 96.30 0.00 3.23 36.05 3.80 6.45 72.84
87.84 98.67 100.00 17.24 85.33 85.51 1.15 9.46 33.33 0.00 13.51 66.67
96.30 100.00 100.00 3.45 89.19 91.67 0.00 0.00 21.62 0.00 0.00 64.58
65.59 96.77 100.00 35.45 94.62 97.75 0.00 0.00 8.60 0.00 1.08 93.26
100.00 88.68 89.80 12.50 45.28 51.02 50.00 58.14 67.92 48.21 55.81 75.51
89.20% 94.53 97.24 15.33 71.37 76.40 14.68 23.84 36.05 14.83 23.46 75.26
MO-GSA
SPEA2
NSGA-II
Fig. 4. Numerical and graphical representation of the algorithm average hypervolumes on real instances.
´ lvarez et al. / Engineering Applications of Artificial Intelligence 26 (2013) 314–326 ´lez-A D.L. Gonza
321
Table 4 Coverage relation ðAkBÞ for real instances. A
B
hm03r (%)
hm15r (%)
mus02r (%)
mus05r (%)
yst04r (%)
yst05r (%)
Mean (%)
MOABC
MO-GSA NSGA-II SPEA2 MOABC NSGA-II SPEA2 MOABC MO-GSA NSGA-II MOABC MO-GSA SPEA2
16.88 98.31 96.72 100.00 71.19 65.57 31.17 43.55 67.80 24.68 45.16 68.85
88.33 100.00 100.00 22.22 100.00 98.25 0.00 0.00 52.50 0.00 0.00 50.88
87.18 100.00 98.28 14.75 80.65 98.28 0.00 5.13 9.68 3.28 15.38 89.66
95.00 96.30 87.88 8.33 48.15 36.36 33.33 55.00 66.67 22.22 60.00 42.42
92.31 100.00 91.30 0.00 74.19 47.83 11.43 34.62 87.10 5.71 30.77 26.09
100.00 87.88 90.91 19.44 21.21 15.15 63.89 100.00 72.73 69.44 92.00 72.73
93.80 97.08 94.18 13.61 65.90 60.24 23.30 39.72 59.41 20.89 40.55 58.44
MO-GSA
SPEA2
NSGA-II
Fig. 5. Numerical and graphical representation of the algorithm average hypervolumes on Markov instances.
complexity. Being the MO-GSA which best results achieves in the most complicated instance, surpassing by 9% and 14% the hypervolumes of SPEA2 and NSGA-II, respectively. In this second benchmark, we only have one non-successful statistical test of the 36 cases, presented between SPEA2 and NSGA-II in the mus05r instance. In Fig. 4 we also include a graphical representation of the algorithm behaviors. The coverage relation obtained by the algorithms in this benchmark is included in Table 4. We see how our two proposals cover a large percentage of the NSGA-II and SPEA2 non-dominated solutions, while they are only able to cover a 25.22% of the solutions at the most. In this second analysis, we can conclude that our proposals are the algorithms that behave better when the instance complexity is increased. The last benchmark analyzes the algorithm behaviors by using ‘‘Markov’’ instances. In Fig. 5, we show a numerical and graphical representation of the results of this last set of instances. If we focus on data, we obtain the same conclusions as in the first two benchmarks. We see how SPEA2 and MOABC are the best algorithms when we use the simplest instances, while MO-GSA is the worse one. However, if we calculate the hypervolume decrease obtained from the first instance to the last one, we can conclude that the algorithm that achieves a lower difference is MO-GSA, with a value of 31.243%. Second, the MOABC algorithm with a 34.811%. And finally, the SPEA2 algorithm (with a 40.430%) and NSGA-II (with 45.670%), respectively. Furthermore, MOABC and MO-GSA are again the algorithms with a better behavior in the more complex instances. In this benchmark, we have only one case without statistically relevant differences, between MOABC and SPEA2 in the dm03m instance. Finally, we include the coverage relation achieved by the algorithms in Table 5. In this table, we can note how, again, our two proposed swarm algorithms are able to cover a large percentage of the nondominated solutions of the standard multiobjective evolutionary algorithms SPEA2 and
NSGA-II. Also, if we compare the results of our two algorithms, MOABC presents a better average performance. In summary, from this complete comparative study we can obtain several conclusions. First of all, we can say that the MOABC algorithm achieves better results than the other algorithms in almost all instances. Second, we can note how the MO-GSA is an algorithm that presents a good scaling capability, i.e., as the instances are complicated, its results are better in relation to those achieved by the other algorithms. In this sense, in the result tables we see how, in the first instances, the standard multiobjective evolutionary algorithms (SPEA2 and NSGA-II) present higher hypervolumes than MO-GSA. However, in the last sequence data sets (the most complex ones) MO-GSA overcomes the results obtained by the SPEA2 and NSGA-II algorithms. For all these reasons, we can conclude that we have compared two swarm algorithms which allow us to obtain relevant results.
7. Comparison with previous works In this section we compare the results obtained by our two swarm-based algorithms with those achieved by our previously proposed metaheuristics. The first techniques that we applied to solve the MDP as a multiobjective optimization problem were Differential Evolution with Pareto Tournaments (DEPT, Gonza´lezA´lvarez et al., 2010a) and Multiobjective Variable Neighbourhood Search (MO-VNS, Gonza´lez-A´lvarez et al., 2010b). At present, after better understanding the biological aspects of the addressed problem (MDP), we have incorporated several improvements and constraints that allow us to better adapt its mathematical formulation to the real-world biological requirements, improving the quality of the discovered solutions (all these improvements are detailed in Section 3). As the two previously proposed
´ lvarez et al. / Engineering Applications of Artificial Intelligence 26 (2013) 314–326 ´lez-A D.L. Gonza
322
Table 5 Coverage relation ðAkBÞ for Markov instances. A
B
dm03m (%)
hm04m (%)
hm26m (%)
mus11m (%)
yst03m (%)
yst10m (%)
Mean (%)
MOABC
MO-GSA NSGA-II SPEA2 MOABC NSGA-II SPEA2 MOABC MO-GSA NSGA-II MOABC MO-GSA SPEA2
100.00 96.77 83.33 0.00 12.90 5.56 73.68 100.00 90.32 18.42 92.59 22.22
66.00 100.00 100.00 29.41 98.77 98.90 0.00 0.00 37.04 0.00 0.00 49.45
84.42 100.00 100.00 15.38 94.44 94.62 1.10 1.30 52.22 0.00 2.60 67.74
70.93 100.00 100.00 20.24 100.00 98.63 0.00 0.00 52.63 0.00 0.00 42.47
82.35 100.00 96.77 13.59 93.06 87.10 2.91 20.00 43.06 0.00 3.53 58.06
100.00 94.00 100.00 11.86 40.00 53.85 27.12 40.48 52.00 27.12 61.90 86.54
83.95 98.46 96.68 15.08 73.19 73.11 17.47 26.96 54.54 7.59 26.77 54.41
MO-GSA
SPEA2
NSGA-II
Table 6 Best found configuration parameters for our previous algorithms.
Table 8 Mean coverage relation ðAkBÞ among all our algorithms.
Parameterization used in DEPT (Gonza´lez-A´lvarez et al., 2010a) Population size 200 Crossover probability 25% Mutation factor 3% Selection scheme Rand/1/binomial Parameterization used in MO-VNS (Gonza´lez-A´lvarez et al., 2010b) Neighborhoods 5 Mutation probability 30%
Table 7 Hypervolume comparison between the two swarm-based algorithms (MOABC and MO-GSA) and our previous works (DEPT and MO-VNS). Instance
MOABC ~ IQR HV
dm04g hm16g mus03g mus07g yst01g yst09g
83:768% 81:554% 79:841% 89:207% 65:156% 47:920%
hm03r hm15r mus02r mus05r yst04r yst05r
60:184% 88:587% 63:583% 81:686% 75:545% 86:889%
0:026
dm03m mus11m hm04m hm26m yst03m yst10m
86:059% 51:248% 53:202% 64:434% 69:675% 66:674%
0:011
0:014 0:055 0:008 0:029 0:025 0:022
0:027 0:024 0:008 0:012 0:007
0:020 0:024 0:025 0:024 0:007
MO-GSA ~ IQR HV
DEPT ~ IQR HV
81:709% 0:010 77:205% 0:047 76:983% 0:028 82:844% 0:013 959:371% 0:024 46:497% 0:029
79:779% 78:421% 77:608% 80:014% 63:719% 48:705%
61:981% 80:235% 61:905% 78:075% 71:140% 81:294%
58:032% 80:287% 63:017% 77:559% 73:500% 80:759%
80:749% 49:506% 53:031% 58:888% 63:250% 63:395%
0:033 0:025 0:032 0:013 0:032 0:020 0:021 0:031 0:035 0:031 0:042 0:010
81:376% 51:854% 53:896% 63:685% 65:258% 63:495%
MO-VNS ~ IQR HV 0:018 0:044 0:016 0:034 0:014 0:056 0:030 0:022 0:019 0:019 0:020 0:021 0:017 0:039 0:030 0:016 0:012 0:014
82:897% 70:355% 77:394% 89:081% 61:478% 45:741% 58:751% 85:225% 59:340% 81:446% 72:878% 86:739% 85:560% 49:542% 52:205% 60:978% 65:031% 66:080%
0:008 0:026 0:018 0:015 0:021 0:037 0:014 0:016 0:025 0:006 0:028 0:005 0:007 0:031 0:022 0:025 0:026 0:005
metaheuristics (DEPT and MO-VNS) did not include these improvements in their corresponding references, we have incorporated them to both algorithms in order to compare their results with those achieved by the two swarm-based algorithms proposed in this work. Their configuration settings are described in Table 6, and the experimental methodology followed is the same as described in Section 5, conducting 30 independent runs and measuring the quality of the results with the same multiobjective indicators: hypervolume (While et al., 2006) and coverage relation (Zitzler et al., 2000). The results of the first comparison are shown in Table 7. As we can see, the results achieved by our best swarm-based algorithm (MOABC) are better than those obtained by our best previously proposed algorithm (DEPT), achieving the highest hypervolumes in 14 of the 18 solved instances. Regarding our
A
B
Generic (%)
Real (%)
Markov (%)
Global (%)
MOABC
DEPT MO-GSA MO-VNS
63.12 90.40 91.16
58.84 88.69 94.75
53.99 83.95 89.27
58.65 87.68 91.73
MO-GSA
DEPT MOABC MO-VNS
14.82 14.58 53.98
12.55 12.40 52.63
17.95 15.08 54.27
15.11 14.02 53.62
DEPT
MOABC MO-GSA MO-VNS
55.26 83.28 78.59
48.66 83.34 86.09
53.50 81.09 74.72
52.47 82.57 79.80
MO-VNS
DEPT MOABC MO-GSA
29.44 26.16 47.55
18.81 12.80 55.69
30.84 23.49 51.07
26.36 20.82 51.44
second swarm-based algorithm (MO-GSA), although it achieves results slightly lower than DEPT, it achieves similar hypervolumes to MO-VNS. In fact, if we compare the results of these two metaheuristics (MO-GSA and MO-VNS) by using the second multiobjective indicator: coverage relation (see Table 8); we note how the MO-GSA algorithm covers more MO-VNS nondominated solutions than vice versa, covering an average percentage of 53.62% of the solutions of the MO-VNS, while this latter covers a 51.44% of the MOGSA nondominated solutions. In addition, if we compare the coverages obtained by both MOABC and DEPT, we see how the MOABC algorithm covers the 58.65% of the DEPT nondominated solutions, while DEPT achieves an average coverage of 52.47% of MOABC solutions, confirming the conclusions drawn from hypervolumes. Thus, we have proposed two novel swarm-based algorithms that achieve quality solutions, overcoming those obtained by two previously proposed algorithms based on two classical algorithms such as Differential Evolution and Variable Neighbourhood Search.
8. Comparison with other biological methods In this section we analyze the motifs obtained by our two proposals, MOABC and MO-GSA algorithms. To that end we have compared the best motifs (nondominated solutions) discovered by both heuristics with the best solutions predicted by 14 well-known biological methods in motif discovery. Thus, we demonstrate that the motifs predicted by MOABC and MO-GSA have an important biological relevance. The biological methods compared in this section are AlignACE (Roth et al., 1998), ANN_Spec (Workman and Stormo, 2000), Consensus (Hertz and Stormo, 1999), GLAM (Frith et al., 2004), Improbizer (Ao et al., 2004), MEME (Bailey and Elkan, 1995), MEME3 (Bailey and Elkan, 1995), MITRA (Eskin and Pevzner, 2002), MotifSampler (Thijs et al., 2001), oligo/dyad-analysis (van Helden et al.,
´ lvarez et al. / Engineering Applications of Artificial Intelligence 26 (2013) 314–326 ´lez-A D.L. Gonza
323
Table 9 Comparison of sensitivity (nSn), positive predictive value (nPPV), Performance (nPC), and correlation (nCC) coefficients. Data set
Best method
Second best method
MOABC
MO-GSA
(a) dm04g hm16g mus03g mus07g yst01g yst09g
nSn results MotifSampler – AlignACE ANN_Spec Improbizer Improbizer
0.022222 0.000000 0.281690 0.040000 0.139344 0.497674
ANN_Spec – MEME – Weeder AlignACE
0.022222 0.000000 0.133803 0.000000 0.098360 0.400000
0.370370 0.390244 0.500000 0.56000 0.393443 0.158140
0.473118 0.335366 0.457746 0.550000 0.352459 0.213953
hm03r hm15r mus02r mus05r yst04r yst05r
MEME ANN_Spec MEME MEME3 Consensus SeSiMCMC
0.063726 0.055555 0.094828 0.386363 0.335878 0.986111
SeSiMCMC – ANN_Spec MEME MotifSampler MotifSampler
0.039216 0.000000 0.030172 0.181818 0.320611 0.694444
0.274510 0.577778 0.323276 0.500000 0.551402 0.888889
0.289216 0.455556 0.275862 0.454545 0.62617 0.805556
dm03m hm04m hm26m mus11m yst03m yst10m
SeSiMCMC AlignACE SeSiMCMC Improbizer Improbizer –
0.105769 0.005952 0.085020 0.369668 0.340136 0.000000
Weeder QuickScore MITRA YMF ANN_Spec –
0.076923 0.005952 0.052631 0.298578 0.312925 0.000000
0.461538 0.321429 0.368421 0.440758 0.272109 0.000000
0.346154 0.291667 0.291498 0.298578 0.272109 0.000000
(b) dm04g hm16g mus03g mus07g yst01g yst09g
nPPV results MotifSampler – AlignACE ANN_Spec Weeder Consensus
0.032967 0.000000 0.256410 0.020942 0.200000 0.857142
ANN_Spec – oligodyad – Improbizer YMF
0.023438 0.000000 0.250000 0.000000 0.094444 0.857142
0.600000 0.585859 0.770115 0.978261 0.609375 0.428571
0.333333 0.517544 0.846154 1.000000 0.552941 0.414286
hm03r hm15r mus02r mus05r yst04r yst05r
MEME ANN_Spec MEME MEME3 MITRA Weeder
0.108333 0.049019 0.142857 0.222222 0.357143 1.000000
SeSiMCMC – YMF QuickScore Consensus MotifSampler
0.051613 0.000000 0.059524 0.206060 0.338462 0.943396
0.450000 0.826923 0.722222 0.921053 0.510000 0.950000
0.332046 0.533333 0.580952 0.812500 0.638889 0.960784
dm03m hm04m hm26m mus11m yst03m yst10m
Weeder AlignACE MITRA Weeder YMF –
0.200000 0.006061 0.108333 0.900000 0.700000 0.000000
SeSiMCMC QuickScore MotifSampler Consensus oligodyad –
0.061111 0.001053 0.101694 0.767857 0.676923 0.000000
0.605263 0.318182 0.720588 0.550000 0.785714 0.000000
0.562500 0.325397 0.739583 0.437908 0.520000 0.000000
(c) dm04g hm16g mus03g mus07g yst01g yst09g
nPC results MotifSampler – AlignACE ANN_Spec Weeder Consensus
0.013453 0.000000 0.155039 0.013937 0.070588 0.316176
ANN_Spec – MEME – Improbizer MEME3
0.011539 0.000000 0.087156 0.000000 0.059649 0.2873563
0.295181 0.290909 0.408284 0.465517 0.231638 0.110749
0.191083 0.243363 0.352601 0.361702 0.229885 0.142415
hm03r hm15r mus02r mus05r yst04r yst05r
MEME ANN_Spec MEME MEME3 Consensus MotifSampler
0.041801 0.026738 0.060440 0.155251 0.202765 0.666667
SeSiMCMC – YMF QuickScore MotifSampler SeSiMCMC
0.022792 0.000000 0.016077 0.092307 0.189189 0.489655
0.208955 0.407407 0.214058 0.416667 0.390244 0.776316
0.22264 0.285714 0.195122 0.350877 0.343590 0.602740
dm03m hm04m hm26m mus11m yst03m yst10m
Weeder AlignACE MITRA Weeder oligodyad –
0.058823 0.003012 0.038181 0.227436 0.261905 0.000000
SeSiMCMC QuickScore MITRA MEME3 YMF/QuickScore –
0.040293 0.000895 0.036723 0.223484 0.254546 0.000000
0.323077 0.183099 0.255245 0.329167 0.198925 0.000000
0.248120 0.160656 0.223642 0.212454 0.186047 0.000000
(d) dm04g hm16g mus03g mus07g yst01g yst09g
nCC results MotifSampler MEME AlignACE ANN_Spec Weeder AlignACE
0.013401 -0.005204 0.222480 0.006056 0.132087 0.484936
ANN_Spec MITRA MEME MotifSampler Improbizer MEME3
0.006497 0.005755 0.122962 0.009229 0.099930 0.459744
0.468291 0.452594 0.578040 0.640559 0.377776 0.198555
0.351703 0.393088 0.512006 0.524117 0.370198 0.246916
hm03r hm15r mus02r mus05r yst04r yst05r
MEME ANN_Spec MEME MEME3 Consensus MotifSampler
0.063601 0.040697 0.097480 0.236944 0.322430 0.801641
SeSiMCMC MITRA & Weeder YMF QuickScore MotifSampler SeSiMCMC
0.021802 0.00756 0.020669 0.144754 0.302854 0.678464
0.344511 0.585253 0.37074 0.613003 0.575254 0.870302
0.36411 0.484942 0.317711 0.506296 0.511442 0.764935
´ lvarez et al. / Engineering Applications of Artificial Intelligence 26 (2013) 314–326 ´lez-A D.L. Gonza
324
Table 9 (continued ) Data set
Best method
dm03m hm04m hm26m mus11m yst03m yst10m
Weeder AlignACE MITRA Weeder oligodyad –
Second best method 0.114662 0.000400 0.057555 0.430395 0.437304 0.000000
SeSiMCMC MotifSampler SeSiMCMC MEME YMF/QuickScore –
0.058990 0.004508 0.044203 0.386085 0.435017 0.000000
MOABC
MO-GSA
0.492483 0.305036 0.428998 0.511817 0.356815 0.000000
0.403162 0.272252 0.385856 0.347611 0.335172 0.000000
Table 10 Biological information of some of our best predictions. Data set
Value
Sup.
Leng.
Sim.
Comp.
s1
s2
s3
s4
s5
s6
s7
s8
s9
s10
mus07g hm03r mus02r yst04r yst05r yst03m
0.560 0.223 0.371 0.626 0.776 0.905
3 6 5 5 3 3
27 40 15 31 21 7
0.593 0.621 0.600 0.613 0.746 0.905
0.904 0.713 0.987 0.937 0.915 1.000
43 47 141 466 220 276
169 49 N/A N/A 342 123
N/A 51 164 65 204 N/A
1171 903 467 691
N/A N/A 490
1142 312 723
N/A 292 N/A
N/A N/A
798 N/A
N/A
N/A
N/A
50
N/A
N/A
Sup. is Support, Leng. is Length, Sim. is similarity, and Comp. is complexity.
1998, 2000), QuickScore (Regnier and Denise, 2004), SeSiMCMC (Favorov et al., 2005), Weeder (Pavesi et al., 2001), and YMF (Sinha and Tompa, 2003). Short descriptions of them are provided in Tompa et al. (2005). All the listed biological methods need similar runtimes to make good predictions, these times range from 20 to 60 s (Kaya, 2009). As we included in Table 1, we have established runtimes within that range for our algorithms to make possible a fair comparison of the discovered results. Therefore, we can conclude that the computational cost of our algorithms and the biological methods is similar. For each method M and each data set D, we have the set of known binding sites and the set of predicted binding sites. The correctness of M and D can be assessed at the nucleotide level, defining four statistical/biological indicators such as: the Sensitivity (nSn) also called recall rate in some fields, which measures the proportion of actual positives which are correctly identified as such: nSn ¼
nTP ðnTP þ nFNÞ
ð3Þ
The Positive Predictive Value (nPPV) also called precision rate is the proportion of real positives which are correctly identified: nPPV ¼
nTP ðnTP þ nFPÞ
ð4Þ
The Performance Coefficient (nPC): nPC ¼
nTP ðnTP þ nFN þ nFPÞ
ð5Þ
And finally, the Correlation Coefficient (nCC) which is the Pearson product-moment coefficient of correlation in the particular case of two binary variables. The value of nCC ranges from 1 (indicating perfect anticorrelation) to þ1 (indicating perfect correlation): nTP nTNnFN nFP nCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PN NP PP NN
ð6Þ
where PN ¼ nTP þ nFN, NP ¼ nTN þnFP, PP ¼ nTP þ nFP, and NN ¼ nTN þ nFN. The values of these four biological indicators range from 1 (indicating the worst prediction) to þ1 (indicating the best prediction). Thus, if the predicted motifs exactly coincide with the known
binding sites, all indicators will be þ1. This means that the higher the values of the biological indicators the better predictions. For further information about these coefficients see Tompa et al. (2005). Table 9 shows the results of comparing our proposals with the previously defined biological methods by using these four statistical/ biological indicators: nSn, nPPV, nPC, and nCC. To facilitate comparisons between the best predictions of our algorithms and the best predictions of the 14 biological methods we have selected, for each data set, the two tools that achieve the best results for each indicator. For example, the methods that better Sensitivities (nSn) obtain for the dm04g data set are MotifSampler and ANN_Spec with 0.022. Thus, we facilitate the understanding of all this amount of biological information. By using this methodology, we can make statements like: ‘as our heuristics gets better results than the best biological method, we can say that it achieves better results than the 14 tools defined in this section’. In Table 9(a)–(d) we include in the second and the third columns the best method, and the corresponding indicator value. In the fourth and the fifth columns, we show the second best method and its result for each instance, and finally, we include the results of our algorithms in the last columns. That being said, we can proceed to analyze the results obtained. In Table 9(a) we see how only in yst09g, yst05r and yst03m (3 data sets out of 18), our heuristics fail to overcome the results of the best biological method. The same happens for the nPC and nCC indicators. If we compare Table 9(a)–(d) with Figs. 3–5, we can notice how in the data set hm03r, where MO-GSA got the best hypervolumes, it also gets the best predictions. It is worth noting that the yeast data sets (‘yst’) are the most worked on a biological level, and thus, our algorithm obtains a minor difference with respect to the other methods. These results demonstrate that, in addition to obtain good results at computer science level (through indicators such as hypervolume or coverage relation), we get very relevant biological motifs. It is also important to notice that the best results are usually obtained by the same biological method in each data set, e.g., for dm04g the best results are always obtained by MotifSampler, or for hm03r the best results are always achieved by MEME. However, our algorithms achieve very good results in all data sets, regardless of the species (fly, human, mouse, or yeast) and the nature of the instance (Generic, Real, or Markov). It makes possible to expect that our multiobjective versions of ABC and GSA can also obtain good results for genomes of other different beings.
´ lvarez et al. / Engineering Applications of Artificial Intelligence 26 (2013) 314–326 ´lez-A D.L. Gonza
Finally, in Table 10 we include all the necessary information to compose some of the better predictions discovered by our algorithms, so any researcher can verify that the information displayed in this work is true.
9. Conclusions and future work In this paper, we have compared two novel swarm intelligence based algorithms: ABC and GSA to solve the MDP. Moreover we have adapted these algorithms to the multiobjective context, resulting in two new algorithms named MOABC and MO-GSA. This work differs from previous approaches to MDP, because our new constraints focuses on real-world aspects of biology, e.g., the motif complexity concept. In this work, we have also combined computational and biological aspects, demonstrating through several indicators and statistics that the results obtained are relevant in both fields. To summarize, some of the contributions of this work are the formulation presented and applied to solve the MDP, and the comparison made among our algorithms (MOABC and MO-GSA) and other standard multiobjective algorithms (SPEA2 and NSGA-II) and also 14 well-known biological methods, where we analyze the behavior of each of them in three benchmarks composed of six instances of different nature. As we have seen, MOABC is the algorithm that best results have obtained in most instances. Evaluating new algorithms for this problem is a matter of future work. In particular, we have planned comparisons with other known swarm-based algorithms such as Particle Swarm Optimization (PSO), since this is a clear example of swarm algorithm which has received a lot of attention in the literature. Furthermore, we will also investigate the application of parallel or distributed techniques for solving the MDP.
Acknowledgments This work was partially funded by the Spanish Ministry of Science and Innovation and ERDF (the European Regional Development Fund), under the contract TIN2008-06491-C04-04 (the Mn project). Thanks also to the Fundacio´n Valhondo, for the economic support offered to David L. Gonza´lez-A´lvarez to make this research. References Ao, W., Gaudet, J., Kent, W.J., Muttumu, S., Mango, S.E., 2004. Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR. Science 305 (5691), 1743–1746. Bailey, T.L., Elkan, C., 1995. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach. Learn. 21 (1–2), 51–80. Che, Y., Song, D., Rashedd, K., 2005. MDGA: Motif discovery using a genetic algorithm. In: Proceedings of the 2005 Conference on Genetic and Evolutionary Computation (GECCO’05), pp. 447–452. Deb, K., 2001. Multi-objective Optimization Using Evolutionary Algorithms. John Wiley & Sons. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T., 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evolut. Comput. 6, 182–197. D’haeseleer, P., 2006. What are DNA sequence motifs? Nat. Biotechnol. 24 (4), 423–425. Eskin, E., Pevzner, P.A., 2002. Finding composite regulatory patterns in DNA sequences. Bioinformatics 18 (Suppl 1), S354–S363. Favorov, A.V., Gelfand, M.S., Gerasimova, A.V., Ravcheev, D.A., Mironov, A.A., Makeev, V.J., 2005. A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics 21 (10), 2240–2245. Fogel, G.B., Weekes, D.G., Varga, G., Dow, E.R., Harlow, H.B., Onyia, J.E., Su, C., 2004. Discovery of sequence motifs related to coexpression of genes using evolutionary computation. Nucl. Acids Res. 32 (13), 3826–3835. Fogel, G.B., Porto, V.W., Varga, G., Dow, E.R., Craven, A.M., Powers, D.M., Harlow, H.B., Su, E.W., Onyia, J.E., Su, C., 2008. Evolutionary computation for discovery of composite transcription factor binding sites. Nucl. Acids Res. 36 (21), e142. Frith, M.C., Hansen, U., Spouge, J.L., Weng, Z., 2004. Finding functional sequence elements by multiple local alignment. Nucl. Acids Res. 32 (1), 189–200. Glover, F., Kochenberger, G., 2003. Handbook of Metaheuristics. Kluwer Academic Publishers.
325
Gonza´lez-A´lvarez, D.L., Vega-Rodrı´guez, M.A., Go´mez-Pulido, J.A., Sa´nchez-Pe´rez, J.M., 2010a. Solving the motif discovery problem by using differential evolution with Pareto tournaments. In: Proceedings of the 2010 IEEE Congress on Evolutionary Computation (CEC’10), pp. 4140–4147. Gonza´lez-A´lvarez, D.L., Vega-Rodrı´guez, M.A., Go´mez-Pulido, J.A., Sa´nchez-Pe´rez, J.M., 2010b. A multiobjective variable neighborhood search for solving the motif discovery problem. In: International Workshop on Soft Computing Models in Industrial Applications (SOCO’10), vol. 73, pp. 39–46. Gonza´lez-A´lvarez, D.L., Vega-Rodrı´guez, M.A., Go´mez-Pulido, J.A., Sa´nchez-Pe´rez, J.M., 2011a. Finding motifs in DNA sequences applying a multiobjective artificial bee colony (MOABC) algorithm. In: Proceedings of 8th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (EVOBIO’11), Lecture Notes in Computer Science, vol. 6623/ 2011, pp. 89–100. Gonza´lez-A´lvarez, D.L., Vega-Rodrı´guez, M.A., Go´mez-Pulido, J.A., Sa´nchez-Pe´rez, J.M., 2011b. Applying a multiobjective gravitational search algorithm (MO-GSA) to discover motifs. In: International Work Conference on Artificial Neural Networks (IWANN’11), Lecture Notes in Computer Science, vol. 6692/2011, pp. 372–379. Gonza´lez-A´lvarez, D.L., Vega-Rodrı´guez, M.A., Go´mez-Pulido, J.A., Sa´nchez-Pe´rez, J.M., 2011c. Predicting DNA motifs by using evolutionary multiobjective optimization. IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev., 1–13. (on-line available since December 2011). Hertz, G.Z., Stormo, G.D., 1999. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15 (7-8), 563–577. Hertz, G.Z., Hartzell, G.W., Stormo, G.D., 1990. Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput. Appl. Biosci. 6 (2), 81–92. Karaboga, D., 2005. An Idea Based on Honey Bee Swarm for Numerical Optimization. Technical Report-tr06, Erciyes University, Turkey. Kaya, M., 2009. MOGAMOD: Multi-objective genetic algorithm for motif discovery. Expert Syst. Appl. 36 (2), 1039–1047. Liu, F.F.M., Tsai, J.J.P., Chen, R.M., Chen, S.N., Shih, S.H., 2004. FMGA: finding motifs by genetic algorithm. In: Fourth IEEE Symposium on Bioinformatics and Bioengineering (BIBE’04), pp. 459–466. Maier, D., 1978. The complexity of some problems on subsequences and supersequences. J. ACM 25 (2), 322–336. Mladenovic, N., Hansen, P., 1997. Variable neighborhood search. Comput. Oper. Res. 24, 1097–1100. Neuwald, A.F., Liu, J.S., Lawrence, C.E., 1995. Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Sci. 4 (8), 1618–1632. Notredame, C., Higgins, D.G., 1996. SAGA: sequence alignment by genetic algorithm. Nucl. Acids Res. 24 (8), 1515–1524. Pavesi, G., Mauri, G., Pesole, G., 2001. An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17 (Suppl 1), S207–S214. Rashedi, E., Nezamabadi-pour, H., Saryazdi, S., 2009. GSA: a gravitational search algorithm. Inf. Sci. 179 (13), 2232–2248. Regnier, M., Denise, A., 2004. Rare events and conditional events on random strings. Discrete Math. Theor. Comput. Sci. 6, 191–214. Rivie re, R., Barth, D., Cohen, J., Denise, A., 2008. Shuffling biological sequences with motif constraints. J. Discrete Algorithms 6 (2), 192–204. Roth, F.P., Hughes, J.D., Estep, P.W., Church, G.M., 1998. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol. 16 (10), 939–945. Shao, L., Chen, Y., 2009. Bacterial foraging optimization algorithm integrating tabu search for motif discovery. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM’09), pp. 415–418. Shao, L., Chen, Y., Abraham, A., 2009. Motif discovery using evolutionary algorithms. In: International Conference of Soft Computing and Pattern Recognition (SOCPAR’09), pp. 420–425. Sheskin, D.J., 2007. Handbook of Parametric and Nonparametric Statistical Procedures, 4th ed. Chapman & Hall/CRC Press, New York. Sinha, S., Tompa, M., 2003. YMF: A program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucl. Acids Res. 31 (13), 3586–3588. Srinivas, N., Deb, K., 1995. Multi-objective function optimization using nondominated sorting genetic algorithms. Evolut. Comput. 2 (3), 221–248. Stine, M., Dasgupta, D., Mukatira, S., 2003. Motif discovery in upstream sequences of coordinately expressed genes. In: The 2003 Congress on Evolutionary Computation (CEC’03), vol. 3, pp. 1596–1603. Storn, R., Price, K., 1997. Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim. 11 (4), 341–359. Thijs, G., Lescot, M., Marchal, K., Rombauts, S., De Moor, B., Rouze´, P., Moreau, Y., 2001. A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17 (12), 1113–1122. Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., Makeev, V.J., Mironov, A.A., Noble, W.S., Pavesi, G., Pesole, G., Regnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, C., Workman, Z., Ye, C., Zhu, Z., 2005. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23 (1), 137–144. van Helden, J., Andre, B., Collado-Vides, J., 1998. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281 (5), 827–842. van Helden, J., Rios, A.F., Collado-Vides, J., 2000. Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucl. Acids Res. 28 (8), 1808–1818.
326
´ lvarez et al. / Engineering Applications of Artificial Intelligence 26 (2013) 314–326 ´lez-A D.L. Gonza
While, L., Hingston, P., Barone, L., Huband, S., 2006. A faster algorithm for calculating hypervolume. IEEE Trans. Evolut. Comput. 10 (1), 29–38. Wingender, E., Dietze, P., Karas, H., Knuppel, R., 1996. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucl. Acids Res. 24 (1), 238–241. Workman, C.T., Stormo, G.D., 2000. ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. In: Pacific Symposium on Biocomputing, pp. 467–478. Zare-Mirakabad, F., Ahrabian, H., Sadeghi, M., Hashemifar, S., Nowzari-Dalini, A., Goliaei, B., 2009. Genetic algorithm for dyad pattern finding in DNA sequences. Genes Genet. Syst. 84 (1), 81–93.
Zitzler, E., Thiele, L., 1999. Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach. IEEE Trans. Evolut. Comput. 3 (4), 257–271. Zitzler, E., Deb, K., Thiele, L., 2000. Comparison of multiobjective evolutionary algorithms: empirical results. Evolut. Comput. 8 (2), 173–195. Zitzler, E., Laumanns, M., Thiele, L., 2001. SPEA2: Improving the Strength Pareto Evolutionary Algorithm. Technical Report tik-report 103, Swiss Federal Institute of Technology Zurich, Switzerland.