Hybrid multiobjective artificial bee colony for multiple sequence alignment

Hybrid multiobjective artificial bee colony for multiple sequence alignment

Accepted Manuscript Title: Hybrid Multiobjective Artificial Bee Colony for Multiple Sequence Alignment ´ Author: Alvaro Rubio-Largo Miguel A. Vega-Rod...

1MB Sizes 0 Downloads 95 Views

Accepted Manuscript Title: Hybrid Multiobjective Artificial Bee Colony for Multiple Sequence Alignment ´ Author: Alvaro Rubio-Largo Miguel A. Vega-Rodr´ıguez ´ David L. Gonz´alez-Alvarez PII: DOI: Reference:

S1568-4946(15)00813-3 http://dx.doi.org/doi:10.1016/j.asoc.2015.12.034 ASOC 3391

To appear in:

Applied Soft Computing

Received date: Revised date: Accepted date:

28-7-2015 11-12-2015 21-12-2015

´ Please cite this article as: Alvaro Rubio-Largo, Miguel A. Vega-Rodr´iguez, David ´ L. Gonz´alez-Alvarez, Hybrid Multiobjective Artificial Bee Colony for Multiple Sequence Alignment, (2016), http://dx.doi.org/10.1016/j.asoc.2015.12.034 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Highlights: We propose the use of multiobjective optimization and evolutionary computation jointly for solving the Multiple Sequence Alignment problem.

-

We consider the optimization of two of the most widely-used objective functions in the literature: the weighted sum-of-pairs function with affine gap penalties (WSP) and the number of Totally Conserved (TC) columns score.

-

A Hybrid swarm intelligence algorithm and a traditional aligner to obtain more accurate alignments.

-

A comparative study between our proposal and 13 well-known aligners in the bioinformatics field.

Ac

ce pt

ed

M

an

us

cr

ip t

-

Page 1 of 28

Hybrid Multiobjective Artificial Bee Colony for Multiple Sequence Alignment b ´ ´ Alvaro Rubio-Largob, Miguel A. Vega-Rodr´ıguezb, David L. Gonz´ alez-Alvarez a NOVA

Information Management School, University Nova of Lisbon (Portugal) of Computer and Communications Technologies, University of Extremadura (Spain)

ip t

b Department

cr

Abstract

us

In the bioinformatics community, it is really important to find an accurate and simultaneous alignment among diverse biological sequences which are assumed to have an evolutionary relationship. From the alignment, the sequences homology is inferred and the shared evolutionary origins among the sequences

an

are extracted by using phylogenetic analysis. This problem is known as the Multiple Sequence Alignment (MSA) problem. In the literature, several approaches have been proposed to solve the MSA problem, such as progressive alignments methods, consistency-based algorithms, or genetic algorithms (GAs). In this work, we

M

propose a Hybrid Multiobjective Evolutionary Algorithm based on the behaviour of honey bees for solving the MSA problem, the Hybrid Multiobjective Artificial Bee Colony (HMOABC) algorithm. HMOABC considers two objective functions with the aim of preserving the quality and consistency of the alignment:

d

the weighted sum-of-pairs function with affine gap penalties (WSP) and the number of Totally Conserved

Ac ce pt e

(TC) columns score. In order to assess the accuracy of HMOABC, we have used the BAliBASE benchmark (version 3.0), which according to the developers presents more challenging test cases representing the real problems encountered when aligning large sets of complex sequences. Our multiobjective approach has been compared with 13 well-known methods in Bioinformatics field and with other 6 evolutionary algorithms published in the literature.

Keywords: Artificial bee colony, multiobjective optimization, multiple sequence alignment, evolutionary computation, swarm intelligence, bioinformatics

1. Introduction

Any living species is represented by its biological sequence and; therefore, an accurate alignment among several biological sequences is critical for finding an evolutionary relationship among different species [1] [2]. This problem is known in the literature as the Multiple Sequence Alignment (MSA) problem [3]. However, ´ Email addresses: [email protected] (Alvaro Rubio-Largo), [email protected] (Miguel A. Vega-Rodr´ıguez), [email protected] (David ´ L. Gonz´ alez-Alvarez) Preprint submitted to Elsevier

December 11, 2015

Page 2 of 28

MSAs not only allow us to infer phylogenetic relationships among living species, but also they can provide biological facts about proteins - most conserved regions are biologically significant [4]. Furthermore, an accurate MSA is highly valuable in the formulation and test hypotheses about protein 3-D structure and function, that is to say, it helps us to detect which regions of a gene are susceptible to mutation and which

ip t

can have one residue replaced by another without changing the function. The natural formulation of the MSA problem, in computational terms, is to define a model of sequence evolution that assigns probabilities to all possible elementary sequence edits and then to seek an optimal

cr

directed graph in which edges represents edits and terminal nodes represents the observed sequences [5]. Unfortunately, in biologically realistic models it is not possible to determining an optimal directed graph;

alignment score (SP score) between each pair of sequence.

us

therefore, we need to turn to approximate heuristics. A well-known heuristic is to optimize the sum of

The MSA problem may be defined as an NP-hard optimization problem [6] which can be solved by

an

using dynamic programming with a time and space complexity of O(k2k Lk ) [7] when aligning k sequences of length L. Although the use of dynamic programming guarantees mathematically optimal alignments, the problem space increases significantly with the number of sequences and with the length. In order to

M

overcome this drawback, several heuristics have been proposed in the literature. We can classify them into two main categories: progressive and iterative alignments.

Progressive alignment is the most widely used technique for multiple sequence alignment in the literature.

d

It basically starts aligning the closest evolutionary sequences and after that, continues with the more distant

Ac ce pt e

ones until all the sequences are aligned. This method presents the advantage of being simple and very fast; however, a certain level of accuracy is not guaranteed. In this way, we can highlight that the main disadvantage of this method is that it can be trapped in suboptimal alignments. Among the main multiple sequence aligners published in the literature that make use of progressive alignment are Clustal W [8], or Clustal Ω [9], Tree-based Consistency Objective Function For alignment Evaluation (T-Coffee) [10], PRANK [11], Fast Statistical Alignment (FSA) [12], or Kalign [13]. The iterative alignment techniques make use of one method to produce an initial alignment (such a progressive method) and then refine this initial alignment by performing diverse iterations until a given stopping criterion. The main idea behind this technique is therefore to consider the initial alignment as suboptimal and then refine it until no further improvements can be achieved. In the literature we find several approaches that takes the advantage of performing an iterative refinement in order to obtain more accurate alignments, among the main ones are MUltiple Sequence Comparison by Log-Expectation (MUSCLE) [5], Multiple Alignment using Fast Fourier Transform (MAFFT) [14], PROBabilistic CONSistency-based multiple sequence alignment (ProbCons) [15], MSAProbs [16], or MUMMALS [17]. Genetic algorithms and evolutionary computation have also been considered for solving the multiple sequence alignment problem, we find diverse genetic algorithms (GA) in the literature: Sequence Alignment by Genetic Algorithm (SAGA) 2

Page 3 of 28

[18], Multiple Sequence Alignment Genetic Algorithm (MSA-GA) [19], Rubber Band Technique Genetic Algorithm (RBT-GA) [20], Vertical Decomposition Genetic Algorithm (VDGA) [21], Genetic Algorithm for multiple sequence alignment using Progressive Alignment Method (GAPAM) [22], Multiobjective Optimizer for Sequence Alignments based on Structural Evaluations (MO-SAStrE) [23]. In addition, we find other

ip t

single-objective approaches based on swarm intelligence, such as Artificial Bee Colony (ABC) [24] [25], Ant Colony Optimization algorithm (ACO) [26] [27], or Immune Artificial System Algorithm (IMSA) [28].

In the last years some efforts were done on incorporating structural information for obtaining more

cr

accurate alignments. Basically, these methods use Protein Data Bank (PDB) structures as template in order to guide the alignment of a given set of unaligned sequences using structure-based sequence alignment

us

methods, two examples of structural-based methods are 3D-COFFEE [29] and MO-SAStrE. The main drawback of these methods is the limited availability of PDB structures.

One of the main contributions of this work is to use multiobjective evolutionary computation to solve the

an

MSA problem. In the literature, we find evolutionary approaches that optimize the sum-of-pairs function (SAGA [18], MSA-GA [19], RBT-GA [20], VDGA [21], GAPAM [22], ABC [24] [25], ACO [27], or IMSA [28]) or the column score (RBT-GA [20], ACO [26]). In [30], a multiobjective evolutionary algorithm was

M

implemented with the aim of assembling previously aligned sequences, trying to optimize jointly the sumof-pairs function and the column score.

In this work, we also optimize at the same time two of the most widely-used objective functions in the

d

literature: the weighted sum-of-pairs function with affine gap penalties (WSP) and the number of Totally

Ac ce pt e

Conserved (TC) columns score. Therefore, each objective function focuses on either preserving the quality of the alignment and consistency; respectively.

In addition, we apply a well-known swarm intelligence approach, the Artificial Bee Colony (ABC) algorithm [31]; but adapted to handle multiobjective problems, we refer to it as MOABC. The ABC algorithm was developed by D. Karaboga, inspired by the foraging and dance of honey bee colonies [31]. The swarm algorithms, such as ABC, have been successfully applied to solve real-world problems in different domains, such as the design and manufacturing problem [32], selection of cutting parameters in machining operations [33], the structural damage detection problem [34], image segmentation problems [35], image classification [36], the abnormal brain detection [37], or in the path planning problem [38]. As we have mentioned, several Genetic Algorithms (GAs) have been proposed in the literature for solving the MSA problem (SAGA [18], MSA-GA [19], RBT-GA [20], VDGA [21], GAPAM [22], or MO-SAStrE [23]). Whereas GAs take the information from 2-3 parents to generate a new solution; the algorithms based on swarm intelligence produce new individuals taking into account information not only from their parents, but also from the rest of the population. The effectiveness and goodness of the ABC against traditional GAs has been widely studied in the literature [39] [40]. In the ABC algorithm, we find three types of bees: employed bees, onlooker bees, and scout bees. In 3

Page 4 of 28

the canonical ABC, an employed bee becomes scout if it reaches a certain number of iterations with no improvements, which means that this bee is replaced by a new random solution. In our proposal, when an employed bee becomes scout, its stagnated solution (alignment) will be processed by the fast and accurate Kalign [13], avoiding the stagnation of the algorithm and promoting the diversity of the population as a

ip t

result. In this way, the multiobjective ABC algorithm proposed in this paper was hybridized with the progressive, fast, and accurate Kalign to boost the accuracy and effectiveness of the algorithm, we refer to it as Hybrid Multiobjective Artificial Bee Colony (HMOABC). In [41], a hybrid multi-objective artificial bee

cr

colony is proposed for burdening optimization of copper strip production. The main difference between the approach proposed in [41] and ours relies on the use of a deterministic heuristics (Kalign) in the scout phase

us

of the ABC algorithm.

The remainder of this paper is organized as follows. Section 2 describes the Multiple Sequence Alignment problem. A detailed description of HMOABC is presented in Section 3. Section 4 is devoted to analysis

an

of the experiments carried out and also a comparison with other approaches published in the literature. Finally, Section 5 summarizes the conclusions of the paper and discusses possible lines of future work.

M

2. Multiple Sequence Alignment

Multiple Sequence Alignment (MSA) is simply an alignment of more than two sequences and is considered as an NP-hard optimization problem [42]. The MSA problem can be defined as follows:

d

Given a set of sequences S: {s1 , s2 , . . ., sk } of lengths |s1 |, |s2 |, . . ., |sk | defined over an alphabet Σ, for

Ac ce pt e

example ΣDN A = {A, C, G, T } or Σprotein = {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y }. A multiple sequence alignment of S is defined as S ′ : {s′1 , s′2 , . . ., s′k }, where the length of the all the k sequences is exactly the same. Note that, S ′ is defined over the same alphabet as S (Σ) with an additional gap symbol (−); so, S ′ is defined over the alphabet Σ ∪ {−}. The gap symbol refers to indels, that is to say, the insertion or deletion of bases in the unaligned sequences. In this way, a multiple alignment is obtained by adding gaps to the sequences of S so that their lengths become the same. It can be seen as a matrix representation where the rows are sequences and the columns represent aligned symbols. Each column of an alignment must contain at least one symbol of Σ, in other words, a column with all gaps is not allowed. According to [42], the complexity of finding an optimal alignment is O(k2k Lk ), where k is the number of sequences and L is the max(|s1 |, |s2 |, . . ., |sk |). In this work, we propose the use of multiobjective optimization for solving this NP-complete optimization problem. Therefore, we search the best solution that simultaneously maximizes the weighted sum-of-pairs function with affine gap penalties (WSP, f1 ) and the number of Totally Conserved (TC, f2 ) columns score. The weighted sum of pairs with affine gaps (WSP, f1 ) needs to maximize the following equation:

4

Page 5 of 28

W SP (S) =

L X

SP (l) −

k X

AGP (si )

(1)

i=1

l=1

where L is the alignment length, SP (l) is the sum-of-pairs score of the lth column, which is defined as: k−1 k XX

Wi,j × SM (si,l , sj,l )

ip t

SP (l) =

i=1 j=i

cr

and AGP (si ) is the affine gap penalty score of sequence si :

AGP (si ) = (gopen × #gaps) + (gextend × #spaces)

(2)

(3)

us

In equation 2, Wi,j refers to the sequence weight between sequence si and sj , and SM is the substitution matrix used, either Point Accepted Mutation (PAM) or Block Substitution Matrix (BLOSUM). In addition,

an

in equation 3, gopen is the weight to open the gap and gextend is the weight to extend the gap with one more space. In this work we have used the BLOSUM62 substitution matrix, gopen =6, and gextend =0.85. The number of Totally Conserved (TC, f2 ) columns score refers to the number of columns that are

M

completely aligned with exactly the same compound. This objective function needs to be maximized to ensure more conserved regions within the alignment.

This Multiobjective Optimization problem with 2 objective functions for maximization can be formally

d

defined as: (4)

Ac ce pt e

maximize F (x) = (f1 (x), f2 (x))

where x is the vector of variables in the set X,F : X → Y ⊂ R2 is a vector of 2 objective functions, and Y is the objective space. For any two objective vectors a = (a1 , a2 ) and b = (b1 , b2 ), the vector a is said to dominate b if and only if ((a1 > b1 ) and (a2 ≥ b2 )) or ((a1 ≥ b1 ) and (a2 > b2 )). Therefore, for any two solutions x1 , x2 ∈ X, x1 is said to dominate x2 (x1  x2 ) if F (x1 ) dominates F (x2 ). We denominate a solution x∗ as non-dominated solution if no solution in X dominates x∗ . The set of all non-dominated solutions in X is known as Pareto-optimal set or simply Pareto set [43]. The representation of the individual determines how the MSA problem is structured in the proposed algorithm. It gives us the necessary knowledge to understand the behaviour of the proposal. In Figure 1, we illustrate a novel representation of the individual. As we can see, for each sequence, we store the number of gaps followed by a list with the initial position of the gap (positive value) and its length (negative value). In case of finding a gap of length equal to one, we only store the initial position. 3. Hybrid Multiobjective Artificial Bee Colony The Artificial Bee Colony (ABC) algorithm is a swarm-based evolutionary algorithm created by Dervis Karaboga [31] and it is inspired by the intelligent behavior of honey bees. In this algorithm, the position of 5

Page 6 of 28

a food source represents a feasible solution, that is to say, a feasible alignment; and the amount of nectar indicates the quality of the food source (fitness value). In ABC, the population of individuals is known as colony of bees, in which we find three types of bees: employed, onlooker, and scout. The first type of bees (employed) are related to a specific food source, the

ip t

onlooker bees remain in the colony watching the dance of the employed bees with the objective of exploiting those food sources with higher amount of nectar, and finally, the scout bees are in charge of searching new food sources when one of the food sources becomes exhausted.

cr

In Figure 2, we present a diagram that summarizes the flow of our proposal, the Hybrid Multiobjective Artificial Bee Colony Algorithm (HMOABC). As we can see, there are six steps in the HMOABC algorithm:

of food for each of the N employed bees of the colony.

us

• Step 1. Given a set of unaligned sequences, generate N random alignments, that is to say, one source

an

• Step 2. For each employed bee Ei in the colony (E), select a random employed bee (Erandom ) from E, and perform a one-point crossover between the two solutions (alignments) in order to obtain a new one (Ei′ ). Once we have obtained N new employed bees in E ′ , perform a Fast Non-Dominated

M

Sorting procedure [44] over E ∪ E ′ with the aim of preserving in E the best N bees, that is to say, those bees with lower rank and higher crowding distance. For further information about the Fast NonDominated Sorting procedure (see Algorithm 2), including the rank and crowding distance assignments

d

(see Algorithm 3 and 4), please refer to Deb et al. [44].

Ac ce pt e

• Step 3. Compute the amount of nectar of each source of food associated with each employed bee in E. In this work, the amount of nectar (M OF itness) is computed by using the rank and crowding distance (CD) (line 16 in Algorithm 1). After that, for each employed bee, we calculate a probability of being selected by an onlooker bee in the next step (line 14 in Algorithm 1). • Step 4. Generate N new onlooker bees by selecting those sources of food with higher probability of being selected. As we can see in Figure 2, when an employed bee is selected, its related source of food goes through a mutation process. This mutation process consists of four steps: (1) Move a block : select a random block of gaps/compounds and move it towards the left or right; (2) Merge two groups: for each sequence, if a mutation probability is satisfied (F ), select a random group of gaps/compounds and merge it with its adjacent group; (3) Divide a group: for each sequence, in case of mutation, select a random group of gaps/compounds and divide it into two groups; (4) Compact : delete those columns with all gaps. In Figure 2, an illustrative example of each step is presented. Finally, check if the new mutated alignment (Oj ) dominates () the previous one (Ei ), in such case, we store in O the new bee; otherwise, store the previous one with an increment in its stagnation limit.

6

Page 7 of 28

• Step 5. Check the stagnation limit of every bee in E and O. If the limit of a bee reaches a maximum number of cycles with no improvements (maxLimit), then this bee becomes scout, which means that its source of food will be strongly modified by using a Local Search. In this work, our multiobjective version of the ABC algorithm has been hybridized (HMOABC) with the fast and accurate Kalign2

ip t

[13]. In the local search, a random portion of the alignment is extracted. The size of this portion is chosen randomly in the range [5-25%] of the alignment length. Then, we delete all gaps in the portion and re-align it by using the Kalign2 method. Once we obtain the new alignment, we replace the old

cr

portion by the new one.

• Step 6. If the maximum number of evaluations is reached, then stop and output the set of non-

us

dominated solutions, otherwise, use the Fast Non-Dominated Sorting procedure [44] over E ∪ O to preserve in E the best N bees, then, go to Step 2.

an

As we have seen, to calculate the M OF itness of each employed bee, we first need to rank the employed colony by Pareto dominance, see Algorithm 3. In this way, we need to assign rank 1 to the non-dominated individuals and removing them from contention, then finding a new set of non-dominated individuals and

M

assign rank 2, and so forth (see Figure 3(a)). The second step is to compute the crowding distance (see Algorithm 4) of the bees with the same rank. As we can see in Algorithm 4, to calculate the crowdingdistance, we first sort the bees in each objective domain. Then, the first and last solutions (bees) are assigned

d

a distance equal to infinity. For the other solutions, the crowding distance is calculated by the difference of

Ac ce pt e

the objective value of the two closest neighbours. In Figure 3(b), we present an example where the crowding distance of the non-dominated solution B is higher than the crowding distance of the non-dominated solution A because A1 + A2 < B1 + B2. 4. Experimental Results

In our experiments, we have chosen the well-known Benchmark ALignment dataBASE (BALiBASE), which was developed to evaluate and compare multiple alignment methods. In particular, we have used the version 3.0 [45], which contains 218 alignments. It is freely available to download 1 . BALiBASE3.0 benchmark is divided into six different groups or families: RV11, RV12, RV20, RV30, RV40, and RV50; each group presents different biological characteristics, for further information about them, please refer to [45]. We have two assessment measures: Q and TC [5]. The Q (quality) indicates the number of correctly aligned residue pairs divided by the number of residue pairs in the reference alignment, it is also known as Sum-of-Pairs (SP) score. The total column score (TC) is the number of correctly aligned columns divided by the number of columns in the reference alignment, it is also known as Column Score (CS). 1 http://www.lbgi.fr/balibase/

7

Page 8 of 28

In the configuration of HMOABC there are four main parameters: number of employed/onlooker bees (N ), number of fitness evaluations (stopping criterion), mutation probability (F ), and the stagnation limit (maxLimit). In the GAs published in the literature, the population size is 100 individuals; therefore, in this work, we use a total of 50 employed bees in E and 50 onlooker bees in O, that is to say, a population size

ip t

equal to 100 individuals. In order to determine the parameters F and maxLimit, we check the accuracy of HMOABC with the following values of F : 0.1, 0.25, 0.5, 0.75, and 0.9; and the following values of maxLimit: 10, 25, 50, 75, and 100. In Figure 4, we present the average value of Q-score and TC-score obtained by

cr

different configurations of HMOABC on the family RV11. As we can see, the best configuration of HMOABC obtained for the RV11 subset was: F =0.75 and maxLimit=25 (a bee will become scout after reaching 25

us

tries with no improvements). The stopping criterion, as in other GAs published in the literature, is based on the number of fitness evaluations. The number of fitness evaluations in this work was established to 50000 evaluations. Note that, 30 independent runs were performed per experiment by HMOABC, and the

an

average was used, in order to extract some useful conclusions with a certain level of statistical confidence. HMOABC was run using g++ (GCC) 4.4.5 on a 2.3GHz Intel PC with 1GB RAM. In [46], the author presents the Q and TC scores obtained by several well-known methods in BALiBASE

M

3.0. The results of the approaches included in [46] have been used in our comparative study, and they are: Clustal W [8], Clustal Ω [9] (v1.2.1), DIALIGN-TX [47] (v1.0.2), FSA and FSA with the -maxn option [12] (v1.14.5), Kalign2 [13] (v2.03), MAFFT [14] (v6.603, three versions with the accurate flags: L-INS-i,

d

E-INS-i, and G-INS-i), MSAProbs [16] (v0.9.7), MUMMALS [17] (version dated on 08/02/2008), MUSCLE

Ac ce pt e

[5] (v4.0), ProbCons [15] (v1.12), PRANK [11] (with accurate option flag +F and using Clustal W tree), and ProbAlign [48] (v1.0). Furthermore, in our comparative study, we have included the latest version of T-Coffee [10] (v11.0). All the methods involved in the comparative study were run by using the same computer machine as HMOABC, a 2.3GHz Intel PC with 1GB RAM. In Table 1, we compare the results obtained by the aforementioned well-known approaches and the HMOABC algorithm. In Table 1, we have highlighted the best and second best results for each family in both measures (Q and TC). As we can see, HMOABC obtains the highest results in 5 out of the 6 families in terms of Q, and in 6 out of the 6 families in terms of TC. We observe the best performance of HMOABC in the first group/family (RV11), where the differences in terms of Q and TC between our approach and the ProbAlign (second best approach) are around 5%±0.82% in terms of Q and near to 12%±1.57% in terms of TC. In RV50 HMOABC obtains better results (in terms of Q) than most of the approaches and it is able to obtain the most competitive results in terms of TC. Finally, if we focus on the last column of Table 1 (Overall), the best and second best approaches are HMOABC and MSAProbs, respectively; if we compare their overall results on Q and TC, we find a difference higher than 1.5%±0.11% in terms of Q, and higher than 4.5%±0.44% in terms of TC. In order to easily observe the accuracy of HMOABC against the other approaches, a bar plot is presented in Figure 5. 8

Page 9 of 28

In Table 2, we present the total runtime required by the sixteen methods for aligning the six subfamilies contained in BAliBASE v3.0. As we can see, the runtime required by HMOABC (02h 52m 01s) is reasonable, it is the sixth fastest method. In addition, in Figure 6, diverse box-plots are shown in order to provide a visual comparison of the running time of each algorithm in the six sub-families of BALiBASE, avoiding the

ip t

slowest approach (PRANK) in order to make the box-plot readable. In order to ensure that the differences among the approaches shown in Table 1 are statistically significant. As the data do not follow a normal distribution, we have used a non-parametric test. In Table 3, we present

cr

the results of a Wilcoxon signed-rank test between each pair of methods by using a confidence level of 1% (p-value < 0.01). As we can see, the differences between HMOABC and any other approach are always

us

statistically significant.

Since HMOABC makes use of Kalign2 when a bee becomes scout, we need to study in depth their results in the BALiBASE3.0. In Figure 7, we compare both approaches (HMOABC and Kalign2) in the six different

an

families of BALiBASE3.0. As we can see, the values of Q and TC obtained by HMOABC are higher than those obtained by Kalign2 in all the groups. In average, we observe a difference around 7% and 15% in terms of Q and TC.

M

In the following comparison, we study the behaviour of HMOABC and other GAs published in the literature, such as MO-SAStrE [23], GAPAM [22], MSA-GA [19], RBT-GA [20], SAGA [18], and VDGA [21]. These GAs were tested by using a set of 26 datasets of the previous version of BALiBASE3.0 (BAL-

d

iBASE2.0). However, in our comparison, we have selected the 18 out of the 26 subsets included in both

Ac ce pt e

versions (BALiBASE2.0 and BALiBASE3.0). For VDGA the authors present results with 3 different configurations, depending on the number of decompositions, we refer to them as VDGA D2, VDGA D3, and VDGA D4; in addition, a version of MSA-GA with a pre-aligned procedure (MSA-GA w/p) is also included. In Table 4, we compare the quality of the alignments (Q score) obtained by the HMOABC algorithm and the aforementioned GAs published in the literature. As we can see, those approaches that clearly obtain the most accurate alignments are multiobjective approaches: HMOABC and MO-SAStrE. However, we can observe that HMOABC obtains the best results of Q-score in 10 out of the 18 datasets; whereas, MOSAStrE obtains the best results in only 7. In Figure 8, we present an illustrative view of this comparison in which we can see the good performance of HMOABC in terms of Q-score. Furthermore, our swarm approach (HMOABC) presents a strong advantage against MO-SAStrE: HMOABC requires no structure for constructing an accurate alignment. 5. Conclusions and Future work In this paper, we have introduced a novel multiobjective evolutionary algorithm based on swarm intelligence for the Multiple Sequence Alignment problem. By using our approach (HMOABC), we are able to obtain a set of accurate and consistent solutions by simultaneously maximizing two well-known objective 9

Page 10 of 28

functions: the weighted sum-of-pairs function with affine gap penalties (WSP) and the number of Totally Conserved (TC) columns score. The proposed approach is inspired by the foraging behaviour of honey bees and it incorporates a local search procedure based on the fast and accurate Kalign. HMOABC has been compared with the most

ip t

relevant multiple sequence alignment methods published in the literature. In this comparison, we have used the BALiBASE3.0 benchmark and two accuracy measures: Q and TC. In addition, we have compared our proposal with other GAs published in the literature for solving this bioinformatics problem. From these

cr

comparisons, we conclude that the HMOABC algorithm is a very promising approach for solving the MSA problem.

us

In future work, we intend to apply other multiobjective variants of evolutionary algorithms based on swarm intelligence to the MSA problem. Furthermore, it would be a good future challenge to test the parameters and performance of HMOABC in other benchmarks, such as PREFAB, SABMARK, or OXBENCH.

an

This work is mainly focused on proving the effectiveness of HMOABC in experimental data; therefore, an in-depth mathematical analysis will be performed in order to provide a strong theoretical support of

M

HMOABC. Acknowledgment

´ Alvaro Rubio-Largo is supported by the post-doctoral fellowship SFRH/BPD/100872/2014 granted by

d

Funda¸c˜ ao para a Ciˆencia e a Tecnologia (FCT), Portugal. Furthermore, this work has been partially funded

Ac ce pt e

by the Spanish Ministry of Science and Innovation and ERDF (the European Regional Development Fund), under contract TIN2012-30685 (BIO project). References

[1] R. Doolittle, Similar amino acid sequences: chance or common ancestry?, Science 214 (1981) 149–159. [2] D. Feng, R. Doolittle, Progressive sequence alignment as a prerequisite to correct phylogenetic trees, J. Mol. Evolut. 25 (1987) 351–360. [3] D. J. Bacon, W. F. Anderson, Multiple sequence alignment, J. Mol. Biol. 191 (1986) 153–161. [4] C. Notredame, Recent progresses in multiple sequence alignment: a survey, Pharmacogenomics 3 (1) (2002) 131–144. [5] R. C. Edgar, MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research 32 (2004) 1792–1797. [6] L. Wang, T. Jiang, On the complexity of multiple sequence alignment., Journal of Computational Biology 1 (4) (1994) 337–348. [7] M. Waterman, T. Smith, W. Beyer, Some biological sequence metrics, Advances in Mathematics 20 (3) (1976) 367–387. [8] J. D. Thompson, D. G. Higgins, T. J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Research 22 (1994) 4673–4680. doi:10.1093/nar/22.22.4673. [9] F. Sievers, A. Wilm, D. Dineen, T. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M. Remmert, J. Soding, J. Thompson, D. Higgins, Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega (2011). [10] C. Notredame, D. G. Higgins, J. Heringa, T-Coffee: a novel method for fast and accurate multiple sequence alignment, Journal of Molecular Biology 302 (1) (2000) 205 – 217. [11] A. Loytynoja, N. Goldman, An algorithm for progressive multiple alignment of sequences with insertions, Proceedings of the National Academy of Sciences of the United States of America 102 (30) (2005) 10557–10562. [12] R. K. Bradley, A. Roberts, M. Smoot, S. Juvekar, J. Do, C. Dewey, I. Holmes, L. Pachter, Fast statistical alignment, PLoS Computational Biology 5 (5) (2009) e1000392.

10

Page 11 of 28

Ac ce pt e

d

M

an

us

cr

ip t

[13] T. Lassmann, O. Frings, E. L. L. Sonnhammer, Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features, Nucleic Acids Research 37 (3) (2009) 858–865. [14] K. Katoh, K. Misawa, K. Kuma, T. Miyata, MAFFT: a novel method for rapid multiple sequence alignment based on fast fourier transform, Nucleic Acids Research 30 (14) (2002) 3059–3066. [15] C. B. Do, M. S. Mahabhashyam, M. Brudno, S. Batzoglou, ProbCons: Probabilistic consistency-based multiple sequence alignment, Genome Research 15 (2) (2005) 330–340. [16] Y. Liu, B. Schmidt, D. L. Maskell, MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities, Bioinformatics 26 (16) (2010) 1958–1964. [17] J. Pei, N. V. Grishin, MUMMALS: multiple sequence alignment improved by using hidden markov models with local structural information, Nucleic Acids Research 34 (16) (2006) 4364–4374. [18] C. Notredame, D. G. Higgins, Saga: Sequence alignment by genetic algorithm, Nucleic Acids Research 24 (8) (1996) 1515–1524. [19] C. Gondro, B. P. Kinghorn, A simple genetic algorithm for multiple sequence alignment, Genetics and Molecular Research 6 (4) (2007) 964–982. [20] J. Taheri, A. Y. Zomaya, RBT-GA: a novel metaheuristic for solving the multiple sequence alignment problem, BMC Genomics 10 (Suppl 1) (2009) S10. [21] F. Naznin, R. Sarker, D. Essam, Vertical decomposition with genetic algorithm for multiple sequence alignment, BMC Bioinformatics 12 (2011) 353. [22] F. Naznin, R. Sarker, D. Essam, Progressive Alignment Method Using Genetic Algorithm for Multiple Sequence Alignment, IEEE Transactions on Evolutionary Computation 16 (5) (2012) 615–631. [23] F. M. Ortu˜ no, O. Valenzuela, F. Rojas, H. Pomares, J. P. Florido, J. M. Urquiza, I. Rojas, Optimizing multiple sequence alignments using a genetic algorithm based on three objectives: structural information, non-gaps percentage and totally conserved columns, Bioinformatics 29 (17) (2013) 2112–2121. [24] L. Xiujuan, S. Jingjing, X. Xiaojun, G. Ling, Artificial bee colony algorithm for solving multiple sequence alignment, IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA) (2010) 337–342. [25] S. Aslan, C. Ozturk, Alignment of biological sequences by discrete artificial bee colony algorithm, 23th Signal Processing and Communications Applications Conference (SIU) (2015) 678–681. [26] J. Moss, C. G. Johnson, An ant colony algorithm for multiple sequence alignment in bioinformatics, Artificial Neural Nets and Genetic Algorithms (Springer) (2003) 182–186. [27] C. Ling, L. Wei, Y. Chen, Ant colony optimization method for multiple sequence alignment, International Conference on Machine Learning and Cybernetics 2 (2007) 914–919. [28] A. Layeb, A. Deneche, Multiple sequence alignment by immune artificial system, IEEE/ACS International Conference on Computer Systems and Applications (2007) 336–342. [29] O. O’Sullivan, K. Suhre, C. Abergel, D. G. Higgins, C. Notredame, 3dcoffee: Combining protein sequences and structures within multiple sequence alignments, Journal of Molecular Biology 340 (2004) 385–395. [30] F. Ortuno, J. Florido, J. Urquiza, H. Pomares, A. Prieto, I. Rojas, Optimization of multiple sequence alignment methodologies using a multiobjective evolutionary algorithm based on NSGA-II, IEEE Congress on Evolutionary Computation (CEC) (2012) 1–8. [31] D. Karaboga, An Idea Based on Honey Bee Swarm for Numerical Optimization, in: Technical report-tr06, Erciyes University, Engineering Faculty, Computer Engineering Department, 2005. [32] A. R. Yildiz, A new hybrid artificial bee colony algorithm for robust optimal design and manufacturing, Applied Soft Computing 13 (5) (2013) 2906–2912. [33] A. R. Yildiz, Optimization of cutting parameters in multi-pass turning using artificial bee colony-based approach, Information Sciences 220 (2013) 399 – 407. [34] H. Gokdag, A. R. Yildiz, Structural damage detection using modal parameters and particle swarm optimization, Materials Testing 54 (6) (2012) 416 – 420. [35] Y. Zhang, L. Wu, Optimal Multi-Level Thresholding Based on Maximum Tsallis Entropy via an Artificial Bee Colony Approach, Entropy 13 (4) (2011) 841–859. [36] Y. Zhang, L. Wu, S. Wang, Magnetic Resonance Brain Image Classification by an Improved Artificial Bee Colony Algorithm, Progress in Electromagnetics Research 116 (2011) 65–79. [37] S. Wang, Y. Zhang, Z. Dong, S. Du, G. Ji, J. Yan, J. Yang, Q. Wang, C. Feng, P. Phillips, Feed-forward neural network optimized by hybridization of pso and abc for abnormal brain detection, International Journal of Imaging Systems and Technology 25 (2) (2015) 153–164. [38] Y. Zhang, L. Wu, S. Wang, UCAV Path Planning by Fitness-Scaling Adaptive Chaotic Particle Swarm Optimization, Mathematical Problems in Engineering 14 (3) (2011) 687–692. [39] D. Karaboga, B. Basturk, On the performance of artificial bee colony (abc) algorithm, Applied Soft Computing 8 (1) (2008) 687 – 697. [40] D. Karaboga, B. Akay, A comparative study of artificial bee colony algorithm, Applied Mathematics and Computation 214 (1) (2009) 108 – 132. [41] H. Zhang, Y. Zhu, Z. Wenping, X. Yan, A hybrid multi-objective artificial bee colony algorithm for burdening optimization of copper strip production, Applied Mathematical Modelling 36 (6) (2012) 2578 – 2591. [42] H. Dogan, H. H. Otu, Objective Functions, Multiple Sequence Alignment Methods, Methods in Molecular Biology (David J. Russell Ed.) 1079 (2014) 45–58. [43] C. A. Coello, C. Dhaenens, L. Jourdan (Eds.), Advances in Multi-Objective Nature Inspired Computing, Vol. 272 of Studies in Computational Intelligence, Springer, 2010.

11

Page 12 of 28

Ac ce pt e

d

M

an

us

cr

ip t

[44] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A Fast Elitist Multi-Objective Genetic Algorithm: NSGA-II, IEEE Transactions on Evolutionary Computation 6 (2000) 182–197. [45] J. D. Thompson, P. Koehl, O. Poch, BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark, Proteins 61 (2005) 127–136. [46] R. C. Edgar, BENCH, http : //www.drive5.com/bench (2014). [47] A. R. Subramanian, M. Kaufmann, B. Morgenstern, DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment, Algorithms for Molecular Biology 3:6. [48] U. Roshan, D. R. Livesay, Probalign: multiple sequence alignment using partition function posterior probabilities, Bioinformatics 22 (22) (2006) 2715–2721.

12

Page 13 of 28

Algorithm 1: Pseudocode of the HMOABC Input :

ip t

A set of k unaligned sequences (S), A number of Employed bees in E (N ), Mutation Probability (F ), Scout Limit (maxLimit), Stopping Criterion ; Output:

5 6 7 8 9 10 11 12 13 14

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

d

17

//where: //MOFitness(Ei ) = (2Ei .rank + 1+E1i .cd )−1 end ; //Onlooker Bees phase foreach Oi in O do Ei ← SelectEmployedBee(E, P robability); Oj ← Mutation-MoveABlock(Ei ); Oj ← Mutation-MegeTwoGroups(Oj ); Oj ← Mutation-DivideAGroup(Oj ); Oj ← Mutation-CompactAlignment(Oj ); if Oj  Ei then Oi ← Oj ; else IncreaseStaganationLimit(Ei ); Oi ← Ei ; end end ; //Scout Bees phase foreach Si in E ∪ O do if (StaganationLimit(Si ) ≥ maxLimit ) then Si ← Kalign2(Si , rand[5-25%]); ResetStaganationLimit(Si ); end end E ← FastNonDominatedSorting(E, O); UpdateSetOfNonDominatedSolutions(E); end

Ac ce pt e

16

j

M

j

15

cr

4

us

2 3

Set of non-dominated solutions; E ← GenerateRandomBees(N ); while not stopping criterion do //Employed Bees phase foreach Ei in E do Erandom ← RandomBee(E); Ei′ ← SPX(Ei , Erandom ); //One-point crossover end E ← FastNonDominatedSorting(E,E ′ ); ; //Generate a probability vector by using //the MOFitness of the employed bees for i ← 1 to N do M OF itness(Ei ) P robabilityi = PN ; M OF itness(E )

an

1

13

Page 14 of 28

Algorithm 2: Pseudocode of the FastNonDominatedSorting Input : Set of bees (E1 ); Set of bees (E2 );

8 9 10 11 12 13 14 15 16

cr

6 7

us

5

an

4

M

3

d

2

Best Employed bees (E); P ← E1 ∪ E2 ; // Obtain non-dominated fronts of P in F = {F 1 , F 2 , . . .} F ← Rank-Assignment(P ); E ← ∅; i ← 1; do E ← E ∪ F i; i ← i + 1; while |E| + |F i | ≤ N ; F i ← Crowding-Distance-Assignment(F i ); F i ← SortByCrowdingDistance(F i ); j ← 1; while |E| ≤ N do E ← E ∪ Fji ; j ← j + 1; end

Ac ce pt e

1

ip t

Output:

Figure 1: Representation of an individual.

14

Page 15 of 28

ip t

Algorithm 3: Pseudocode of the Rank-Assignment Input :

cr

Set of bees (P );

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

an

4

M

3

d

2

Ac ce pt e

1

// F = {F 1 , F 2 , . . .} Non-dominated fronts of P (F ); foreach p in P do Sp ← ∅; np ← 0; foreach q in P do if p ≺ q then Sp ← Sp ∪ q; else if q ≺ p then np ← np + 1; end end if np = 0 then p.rank ← 1; F 1 ← F 1 ∪ p; end i ← 1; while F i 6= ∅ do Q ← ∅; foreach p in F i do foreach q in Sp do nq ← nq − 1; if nq = 0 then q.rank ← i + 1; Q ← Q ∪ q; end end i ← i + 1; F i ← Q; end

us

Output:

15

Page 16 of 28

ip t cr us

Algorithm 4: Pseudocode of the Crowding-Distance-Assignment

an

Input : Front (F i ); Output: Updated Front (F i );

4 5 6 7 8 9 10 11 12 13

M

3

n ← |F |; foreach p in Fi do p.cd ← 0 end foreach Objective m do F i ← SortByObjectiveFunction(F i , m); //Crowding-distance of the extreme solutions equal to infinity F1i .cd ← ∞; Fni .cd ← ∞; for j = 2 to (L − 1) do

d

2

Ac ce pt e

1

i

Fji .cd ← Fji .cd +

i i Fj+1 .fm −Fj−1 .fm max −f min fm m

end end

16

Page 17 of 28

ip t cr us an M d Ac ce pt e Figure 2: HMOABC flowchart

17

Page 18 of 28

ip t cr us an M d Ac ce pt e (a) Rank concept

(b) Crowding-distance concept

Figure 3: Rank and crowding-distance concepts

18

Page 19 of 28

ip t cr us an M Ac ce pt e

d

(a)

(b) Figure 4: Parameter configuration of HMOABC in terms of average Q-score (a) and TC-score (b) in RV11.

19

Page 20 of 28

ip t cr us an M Ac ce pt e

d

(a)

(b) Figure 5: Comparison among HMOABC and other well-known approaches published in the literature, in terms of average Q-score (a) and TC-score (b) score in the 218 test sets of BAliBASE3.0.

20

Page 21 of 28

ip t

(b) RV12

an

us

cr

(a) RV11

(d) RV30

Ac ce pt e

d

M

(c) RV20

(e) RV40

(f) RV50

Figure 6: Box-plots for the running time in aligning the six sub-families of BAliBASE3.0.

Figure 7: HMOABC vs Kalign2 [13] in average terms of Q-score and TC-score in the 218 test sets of BAliBASE 3.0

21

Page 22 of 28

ip t cr us an M d Ac ce pt e

Figure 8: Comparison among HMOABC and other GAs published in the literature in terms of average Q-score in the selected 18 test sets from BAliBASE 2.0 and 3.0.

22

Page 23 of 28

0.7473 0.5006 0.5901 0.5047 0.5027 0.6188 0.6053 0.6600 0.6071 0.6712 0.6818 0.6694 0.6826 0.6697 0.4618 0.6950 0.6572

TC

0.5762 0.2299 0.3622 0.2681 0.2723 0.3658 0.3687 0.4402 0.3466 0.4500 0.4440 0.4197 0.4409 0.4196 0.2162 0.4569 0.4141

RV11

Q

TC

0.8868 0.7170 0.7938 0.7569 0.8222 0.8479 0.7934 0.8390 0.8252 0.8423 0.8703 0.8447 0.8619 0.8605 0.6208 0.8669 0.8593

RV12

0.9506 0.8649 0.9060 0.8821 0.9238 0.9365 0.9121 0.9361 0.9270 0.9363 0.9463 0.9430 0.9447 0.9412 0.8377 0.9464 0.9447

Best and second best results are highlighted.

HMOABC Clustal W [8] Clustal Ω [9] DIALIGN-TX [47] FSA [12] FSA-maxsn [12] Kalign2 [13] MAFFT-EINSi [14] MAFFT-GINSi [14] MAFFT-LINSi [14] MSAProbs [16] MUMMALS [17] MUSCLE [5] ProbCons [15] PRANK [11] ProbAlign [48] T-Coffee [10]

Q

d

TC 0.5127 0.2216 0.4529 0.3078 0.1899 0.3434 0.3625 0.4515 0.3913 0.4574 0.4694 0.4313 0.4794 0.4116 0.1242 0.4444 0.4060

RV20

0.9355 0.8520 0.9116 0.8781 0.8650 0.9010 0.9008 0.9264 0.9050 0.9262 0.9283 0.9062 0.9284 0.9168 0.8014 0.9259 0.9157

Q

Ac ce pt e us

TC 0.6549 0.3982 0.5826 0.4517 0.4780 0.5846 0.5078 0.5751 0.5156 0.6009 0.6104 0.4884 0.6039 0.5361 0.3422 0.6123 0.5538

RV40 0.9384 0.7894 0.9010 0.8340 0.8609 0.9161 0.8833 0.9143 0.8864 0.9191 0.9232 0.8714 0.9254 0.9003 0.7477 0.9221 0.8964

Q

an

M

TC 0.6343 0.2759 0.5791 0.3890 0.2629 0.4826 0.4799 0.5920 0.5322 0.5733 0.6115 0.4981 0.6227 0.5473 0.0638 0.5695 0.4776

RV30 0.8868 0.7250 0.8624 0.7614 0.6896 0.8137 0.8126 0.8612 0.8531 0.8555 0.8646 0.8479 0.8755 0.8453 0.5784 0.8530 0.8369

Q

cr

TC

Q

0.6459 0.3590 0.5513 0.4407 0.4076 0.5317 0.4920 0.5827 0.5268 0.5816 0.6026 0.5359 0.6003 0.5590 0.2628 0.5842 0.5503

Overall TC

0.8917 0.7457 0.8389 0.7803 0.7719 0.8430 0.8224 0.8662 0.8438 0.8680 0.8753 0.8528 0.8752 0.8612 0.6834 0.8718 0.8576

ip t

0.6103 0.3116 0.5374 0.4705 0.4206 0.5657 0.4396 0.5985 0.5499 0.5661 0.6101 0.5329 0.5931 0.5789 0.2099 0.5552 0.5911

RV50 0.8919 0.7424 0.8620 0.8215 0.7895 0.8719 0.8201 0.8991 0.8839 0.8999 0.9076 0.8791 0.8944 0.8941 0.6735 0.8886 0.8949

Q

Table 1: Average Q and TC scores on each BAliBASE3.0 subset. The last column comprises the average Q and TC scores in the 218 test sets.

23

Page 24 of 28

ip t cr us

Table 2: Required runtime for aligning the six sub-families of BAliBASE3.0. Runtime RV12

RV20

RV30

RV40

an

RV11

RV50

Total

00:00:02

00:00:04

00:00:12

00:00:11

00:00:24

00:00:09

00:01:03

Clustal Ω [9] Clustal W [8]

00:00:14 00:00:08

00:00:35 00:00:29

00:03:51 00:06:26

00:02:57 00:07:56

00:05:45 00:04:44

00:01:43 00:01:43

00:15:05 00:21:27

MUSCLE [5] DIALIGN-TX [47]

00:00:27 00:00:27

00:00:37 00:01:13

00:04:52 00:27:54

00:05:03 00:36:55

00:11:56 00:16:02

00:02:42 00:05:41

00:25:37 01:28:12

HMOABC MAFFT-GINSi [14]

00:04:57 00:02:42

00:08:48 00:05:35

00:28:30 00:32:27

00:36:51 00:48:53

01:14:49 01:44:19

00:18:07 00:34:50

02:52:01 03:48:47

MAFFT-EINSi [14] MAFFT-LINSi [14]

00:03:02 00:02:39

00:05:43 00:04:51

00:33:24 00:30:29

00:44:30 00:40:08

02:01:33 02:18:08

00:34:45 00:36:14

04:02:57 04:12:29

ProbAlign [48] MSAProbs [16]

00:00:57 00:01:19

00:03:53 00:04:42

01:23:13 01:41:34

01:59:18 02:33:49

00:52:50 01:08:58

00:20:31 00:26:45

04:40:41 05:57:07

ProbCons [15]

00:01:44

00:06:03

02:07:02

03:00:52

01:29:13

00:33:54

07:18:47

MUMMALS [17] FSA-maxsn [12]

00:05:19 00:05:54

00:16:34 00:19:21

01:53:12 03:04:59

03:33:25 03:08:34

03:05:41 02:52:35

01:09:23 01:02:35

10:03:35 10:33:59

T-Coffee [10] FSA [12]

00:04:11 00:05:51

00:13:30 00:20:45

04:37:16 07:20:35

07:07:06 10:49:38

03:14:19 05:07:21

01:15:56 01:50:05

16:32:19 25:34:16

PRANK [11]

01:28:50

03:55:45

24:17:37

18:28:40

64:15:41

13:44:17

126:10:51

Ac ce pt e

d

M

Kalign2 [13]

The runtime is expressed in hours, minutes, and seconds (hh:mm:ss)

24

Page 25 of 28

HMOABC

Kalign2 [13]

FSA-maxsn [12]

FSA [12]

DIALIGN-TX [47]

−8×10−08

+< 10−10

+7×10−04 −2×10−07

−< 10−10

−< 10−10

−< 10−10

FSA [12]

d

MAFFT-EINSi [14] +< 10−10

MAFFT-GINSi [14] +5.E-09

MSAProbs [16] +2×10−04

MUMMALS [17] +7.E-10

PRANK [11]

ProbCons [15]

MUSCLE [5]

+1×10−04 +< 10−10 +< 10−10 +6×10−05 +< 10−10

ProbAlign [48]

+< 10−10 +2×10−07 +< 10−10 +< 10−10

−< 10−10

M

0.40

0.96

−7×10−06 0.97

+< 10−10 −1×10−05

0.03

+< 10−10 +< 10−10 +< 10−10 +< 10−10 +4×10−04 +< 10−10 0.21

+3×10−05

0.19

0.84 −9×10−06

us

+< 10−10 +< 10−10 +< 10−10 +< 10−10 +< 10−10 +< 10−10 +6×10−03 +< 10−10 +2×10−03

−2.E-10

−< 10−10

0.22

+3×10−03

−7×10−06

0.64

−1×10−04

ProbAlign [48]

−2×10−04

0.93

T-Coffee [10]

0.71

−< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10

0.72

+< 10−10 +< 10−10 +< 10−10 +< 10−10 +2×10−04 +< 10−10

+2×10−06

+< 10−10

0.91

0.21

0.93 0.96

+7×10−06

+1×10−05

ip t

+< 10−10 −1×10−06

cr

0.36 +6×10−06

−< 10−10 −< 10−10

+< 10−10 −6×10−06 +2×10−03 +< 10−10 0.44

0.01 0.71 +< 10−10 −7×10−03 +1×10−06 +< 10−10

−< 10−10

−3×10−03

−< 10−10

0.65

+1×10−03

0.10

ProbCons [15] 0.72

+< 10−10 −4×10−03

+1×10−05 +< 10−10

PRANK [11]

−< 10−10

0.76

−4×10−05

+< 10−10 +< 10−10 +< 10−10 +< 10−10 +< 10−10 +< 10−10 +3×10−05 +< 10−10 +1×10−05

0.48

0.83

−< 10−10

+< 10−10

+3×10−04

−5×10−05

−< 10−10 0.29

0.18

−1×10−07

+2.E-09

−1×10−05

MUMMALS [17]

+< 10−10 +< 10−10 +< 10−10 +< 10−10 +< 10−10 +< 10−10 +4×10−05 +< 10−10 +1×10−04

0.38

−< 10−10 −1×10−04 −< 10−10 −8.8E-08 +< 10−10 −< 10−10 −2×10−07

−1×10−05

MUSCLE [5]

+< 10−10 +< 10−10 +< 10−10 +< 10−10 +4×10−06 +< 10−10 +< 10−10 +< 10−10 +4×10−03

−< 10−10 −1×10−04 −< 10−10 −3×10−08 +< 10−10 −< 10−10 −2×10−08

an

0.27

−6.E-09

−< 10−10

+3×10−06

+6.E-09

−3.E-09

−1×10−06

+3.E-09

−5×10−06

−< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 +< 10−10 −< 10−10 −< 10−10

MAFFT-LINSi [14] 0.62

0.69

−8.E-09

MSAProbs [16]

0.65

−< 10−10 +3×10−07 −9×10−06

+< 10−10 +< 10−10 +< 10−10 +< 10−10 +1×10−05 +< 10−10

−< 10−10

MAFFT-EINSi [14]

MAFFT-GINSi [14]

+2×10−07 +8×10−08 −2×10−06

+< 10−10

−< 10−10

+2×10−07 −5×10−08

Kalign2 [13] 0.03

+2×10−07

−< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 +< 10−10 −< 10−10 −< 10−10

+< 10−10 +2×10−06 +< 10−10 +< 10−10

0.09

0.25

−< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 +< 10−10 −< 10−10 −< 10−10

−< 10−10

0.97

+< 10−10 +< 10−10 +2×10−08 +< 10−10

FSA-maxsn [12]

0.04

MAFFT-LINSi [14] +3.E-10

T-Coffee [10]

−< 10−10 −7×10−03 −7×10−03 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 −< 10−10 +< 10−10 −< 10−10 −< 10−10

+< 10−10 +< 10−10 +< 10−10 +< 10−10 +< 10−10 +< 10−10

Clustal W [8]

Clustal Ω [9]

−< 10−10

Clustal Ω [9]

DIALIGN-TX [47]

Clustal W [8]

HMOABC

Ac ce pt e

Table 3: Each entry in the table contains the p-value assigned by a Wilcoxon signed-rank test to the difference between a pair of methods. The upper-right half of the matrix is obtained from Q-score, the lower-left half from TC-score. If the method on the left is ranked higher than the method above, the p-value is preceded by ’+’. If the method on the left is ranked lower, the p-value is preceded by ’−’. If the p-value is ≥ 0.01, the difference is not considered statistically significant and is shown in bold.

25

Page 26 of 28

ip t cr MSA-GA [19]

Ref.3

Ref.4

SAGA [18]

0.443 0.416 0.347 0.531 0.803 0.857 0.732 0.875 0.856 0.847 0.717 0.890 0.383 0.398 0.469 0.836 0.870 0.330

MSA-GA w/p [19]

VDGA D2 [21]

0.451 0.464 0.282 0.548 0.816 0.853 0.794 0.774 0.742 0.839 0.781 0.899 0.408 0.410 0.526 0.866 0.887 0.478

0.498 0.402 0.317 0.487 0.781 0.860 0.767 0.851 0.796 0.826 0.746 0.896 0.311 0.386 0.468 0.800 0.825 0.384

n/a n/a n/a n/a 0.567 0.660 0.795 0.825 0.745 0.730 0.755 0.812 0.180 0.310 0.350 0.680 n/a n/a

0.687 0.405 0.302 0.488 n/a 0.758 n/a n/a 0.768 n/a n/a n/a n/a n/a n/a n/a 0.619 0.635

n/a n/a n/a n/a 0.726 0.623 0.492 0.694 0.498 0.763 0.282 0.739 0.186 0.585 0.269 0.672 n/a n/a

0.501 0.443 0.212 0.295 n/a 0.755 n/a n/a 0.761 n/a n/a n/a n/a n/a n/a n/a 0.580 0.710

us

VDGA D4 [21]

0.482 0.459 0.359 0.545 0.819 0.863 0.778 0.815 0.829 0.850 0.751 0.889 0.453 0.414 0.481 0.866 0.890 0.542

RBT-GA [20]

VDGA D3 [21]

0.716 0.403 0.544 0.808 0.825 0.913 0.911 0.917 0.855 0.879 0.864 0.912 0.586 0.590 0.673 0.862 0.918 0.865

GAPAM [22]

MO-SAStrE [23]

0.732 0.559 0.466 0.783 0.912 0.880 0.925 0.855 0.898 0.893 0.863 0.930 0.539 0.651 0.692 0.832 0.874 0.928

d

M

an

HMOABC 1ped 1uky 2myr Kinase 1lvl 1pamA 1ubi 1wit 2hsdA 2pia 3grs 4enl 1ajsA 1ubi 1uky 4enl Kinase Kinase

Ac ce pt e

Ref.2

Ref.1v.1

Table 4: Average Q score on the 18 test sets from BAliBASE 2.0 and 3.0. The last row comprises the average Q score in the 18 test sets.

Overall 0.790 0.780 0.671 0.657 Best and second best results are highlighted.

0.644 0.633 0.617 0.583 0.544 0.532

26

Page 27 of 28

Page 28 of 28

ed

pt

Ac ce us

an

M

cr

ip t