An improved voting algorithm for planted (l, d) motif search

An improved voting algorithm for planted (l, d) motif search

Information Sciences 237 (2013) 305–312 Contents lists available at SciVerse ScienceDirect Information Sciences journal homepage: www.elsevier.com/l...

302KB Sizes 1 Downloads 52 Views

Information Sciences 237 (2013) 305–312

Contents lists available at SciVerse ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

An improved voting algorithm for planted (l, d) motif search Yun Xu a,b,⇑, Jiaoyun Yang a,b, Yuzhong Zhao a,b, Yi Shang c a b c

Department of Computer Science, University of Science and Technology of China, Hefei, Anhui 230026, China Key Laboratory on High Performance Computing, Hefei, Anhui 230026, China Department of Computer Science, University of Missouri–Columbia, Columbia, MO, USA

a r t i c l e

i n f o

Article history: Received 13 November 2011 Received in revised form 7 March 2013 Accepted 14 March 2013 Available online 21 March 2013 Keywords: Planted motif search Exact algorithm Time and space complexity Challenging instances

a b s t r a c t The planted motif search problem is a classical problem in bioinformatics that seeks to identify meaningful patterns in biological sequences. As an NP-complete problem, current algorithms focus on improving the average time complexity and solving challenging instances within an acceptable time. In this paper, we propose a new exact algorithm CVoting that improves the state-of-the-art Voting algorithm. CVoting uses a new hash technique to reduce the space complexity to O(mn + N(l, d)) and a new pruning technique  qffiffil   . Experimental results to reduce the average time complexity to O m2 nNðl; dÞ 14 þ 3l show that CVoting outperforms competing algorithms, including PMS1, RISOTTO, Voting and Pmsprune, in both space and time: up to an order of magnitude faster and using less memory in solving challenging instances. The software of the proposed algorithm is publicly available at http://staff.ustc.edu.cn/xuyun/motif. Ó 2013 Elsevier Inc. All rights reserved.

1. Introduction The planted (l, d)-motif search is a classical problem of motif discovery in bioinformatics due to its importance in identifying meaningful patterns in biological sequences. Patterns such as transcription factor binding sites (TFBSs) and splice sites are called motifs, which are recurring and conserved regions in biological sequences [8]. Since motifs have molecular structural or functional features related to the behaviors of DNA, RNA, or proteins [4,13,16,22,27], the identification of them can help us better understand the mechanisms of life. The formal description of the planted (l, d)-motif search problem is as follows [6,7,24]: Definition 1. Given n sequences s1, s2, . . . , sn of length m over a finite alphabet R = {A, C, G, T} and two integers l and d, 0 6 d < l < m, the planted (l, d)-motif search problem is to find all strings of length l, which are also called ‘‘motif’’, such that for each sequence there exists a length l substring whose Hamming distance from the motif 6d. Planted (l, d)-motif search algorithms can be divided into two categories: approximate and exact algorithms. A motif search algorithm is approximate if it does not guarantee finding all solutions (or the optimal solutions). Some approximate algorithms, such as WEEDER [19], VINE [15], Pattern Branching [23] and Random Projection [25], apply greedy or heuristic search techniques to speed up the execution time. Others apply statistical techniques, including Expectation Maximization (EM) [2,18], Gibbs Sampling [17] and hidden Markov models [31]. ⇑ Corresponding author at: Department of Computer Science, University of Science and Technology of China, Hefei, Anhui 230026, China. Tel.: 86 551 3602441. E-mail addresses: [email protected] (Y. Xu), [email protected] (J. Yang), [email protected] (Y. Shang). 0020-0255/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ins.2013.03.023

306

Y. Xu et al. / Information Sciences 237 (2013) 305–312

In this paper, we focus on exact algorithms that are able to guarantee solution quality. Since the planted (l, d)-motif search problem has been proved to be NP-complete [11], no polynomial time algorithm exists unless P = NP. In practice, efficient exact algorithms are developed to achieve lower average time complexity and to increase the size of practical instances that can be solved. In the community, some relatively larger size instances have been proposed as the so called challenging instances. The first challenging instance proposed by Pevzner and Sze [20] was to find a (15, 4)-motif in 20 random sequences with length 600. When solving this instance, Buhler and Tompa [25] found that if the sequences were selected randomly, the expected number of (l,d)-motif would be more than 1 for some (l, d) with n = 20 and m = 600, where n is the number of sequences and m is the length of sequences. Davila et al. [7] proposed to include the (11, 3), (13, 4), (15, 5), (17, 6), or (19, 7) random problems as challenging instances. Exact algorithms adopt various techniques to enumerate potential motifs through analyzing patterns in the sequences and using different pruning strategies to reduce the search space. The state-of-the-art algorithms include MITRA [10], PMS1 [24], CENSUS [12], RISOTTO [21], SPELLER [26], Voting [6] and Pmsprune [7]. They organize the search space into a graph, tree, or hash table. Graph-based methods construct a graph in which each substring of length l of each sequence is a vertex and two vertices are linked by an edge if and only if the two substrings are from different sequences and their Hamming distance is smaller than 2d. In this graph, a motif corresponds to a clique of size n. RecMotif [29] applies the concepts of reference sequence and reference vertex to find this kind of cliques. Pevzner and Sze proposed the WINNOWER [20] algorithm to eliminate spurious edges and its time complexity is O((mn)2d+1). MITRA [10] improves the efficiency of edge pruning in WINNOWER by using mismatch trees. However, neither WINNOWER nor MITRA can solve (15, 5)-motif instances in a reasonable amount of time. Tree-based methods, including SPELLER [26] and RISOTTO [21], are based on suffix-trees and apply efficient techniques to prune prefixes. The time complexity and space complexity of both SPELLER and RISOTTO are O(n2mN(l, d)) and O(n2m),   P l respectively, where Nðl; dÞ ¼ di¼0 3i , although RISOTTO is shown to run faster in simulations. CENSUS [12] uses lexicoi graphic trees to prune spurious prefixes; its time complexity and space complexity are O(lmnN(l, d)) and O(lmn), respectively. TreeMotif [30] uses a new deterministic tree structure to discover motif with time complexity O(nm4p2), where   P l p ¼ 2d ð3=4Þi ð1=4Þli . Rajasekaran et al. proposed the PMS1 algorithm [24] based on a binomial-tree-like data struci¼0 i ture. The time complexity and space complexity of PMS1 are O(mnN(l, d)) and O(mN(l, d)), respectively. Recently, Davila et al. proposed the Pmsprune algorithm [7] by extending PMS1. Despite the worst-case time complexity of Pmsprune, O(m2nN(l, d)), Pmsprune algorithm works well in practical applications because many impossible prefixes can be pruned. The space complexity of Pmsprune is O(m2n). As far as we know, Pmsprune is the first exact motif search algorithm that is capable of solving (19, 7) instances. Hash table based methods use hashing techniques to store all potential motifs instead of pruning them. Ho et al. presented the iTriplet algorithm [14] by generating triplets and keeping putative motifs to triplets in hash table and its time complexity and space complexity are O(m3npl3d2) and O(N(l, d)), respectively. The Voting algorithm proposed by Chin and Leung [6] has time complexity O(mnN(l, d)), which appears to be the best among all existing algorithms. However, its space complexity is O(mN(l, d)), which will be too high when l is larger than 15. Other motif discovery algorithms could also be found in [1,3,5,9,32], etc. In this paper, we propose an exact algorithm CVoting that improves Voting in both time and space complexity for large-size problems. CVoting uses a new hash technique that has no collision when storing the d-neighbors of a single length-l string so that the space complexity can be reduced to O(mn + N(l, d)). In addition, it applies a new pruning technique to prune candidate length-l strings. Its average time complex qffiffil  qffiffil   ity is O m2 nNðl; dÞ 14 þ 3l , lower than the Voting algorithm’s when m 14 þ 3l < 1. Empirically, CVoting is faster than all existing algorithms. It solves (19,7) challenge instances in almost one hour whereas Pmsprune needs more than 11 h. 2. CVoting – an improved algorithm based on voting First, a neighborhood definition between two strings is introduced in this section: Definition 2. Given two strings x and y of length l, if the Hamming distance between x and y is no larger than d, then y is   P l 3i . called a d-neighbor of x (x is also a d-neighbor of y). Let Nd(x) = {yjy is a d -neighbor of x}, then jN d ðxÞj ¼ Nðl; dÞ ¼ di¼0 i The d-neighbor relationship is reflexive and symmetric, yet not transitive. CVoting is developed based on the Voting algorithm [6], which works as follows: the d-neighbors of each length-l substring can be regarded as potential candidate strings that may contain a real motif and there are m  l + 1 substrings of length l for each sequence si (1 6 i 6 n) of length m. The motif can be determined through voting of these candidate strings. Each candidate string x can gain a vote from a sequence si if and only if there exists a d-neighbor of x within si. Voting builds a hash table V to record the votes and a hash table R to avoid duplicated voting from the same sequence. Thus, a real motif can obtain n votes from s1, s2, . . . , sn, while others cannot. The final result can be found through a simple scan over the hash table V.

Y. Xu et al. / Information Sciences 237 (2013) 305–312

307

The time complexity of Voting is O(mnN(l, d)). Its space complexity is O(mN(l,d)) because all the d-neighbors of m  l + 1 length-l substrings in the first sequence need to be stored. Voting works very well for short motifs, such as l 6 15. However, for longer ones, such as (17, 6), the space usage of Voting would increase drastically and become too large for current computers. CVoting reduces the space complexity of Voting by restricting the number of candidate strings. Let the jth length-l substring (1 6 j 6 m  l + 1) of sequence si be sji . First, the d-neighbors of s00 vote. Then the d-neighbors of s10 vote, and so on so forth. In this way, the space complexity can be reduced to O(N(l, d)), which is small enough to design a hash function without collision when mapping. Although this algorithm increases the worst time complexity, the time cost can be greatly reduced according to the following observations. Observation 1. Consider two strings x and y of length l and Hamming distance k. The size of Nd(x) \ Nd(y) is much smaller than N(l, d) when k is large. If k > 2d, the size is zero.

Observation 2. Consider two strings x and y of length l. If the characters in alphabet R obey an uniform distribution, the   l 1lk 3k . This probability decreases sharply as k increases. probability that k mismatches occur between x and y is 4 k 4 Definition 3. Given three strings x, y and z of length l. If z is a d-neighbor of both x and y, then z is consistent with x and y. Definition 4. For two strings x and y of length l, let Nd(x, y) = Nd(x) \ Nd(y) be the set of consistent strings of x and y. When the d-neighbors of a single length-l substring sj0 (1 6 j 6 m  l + 1) are considered as the candidate strings, only those consistent strings can accept votes. The number of consistent strings is much smaller than N(l, d). Satya and Mukherjee [28] use the same observations to prune their graph in the WINNOWER algorithm. Here, we propose a different algorithm to generate the consistent strings of two length-l substrings, which can significantly improve the efficiency of Voting. Consider two length-l substrings x and y with Hamming distance k 6 2d. Let xi (or yi) be the ith character of x (or y) and di be the Hamming distance between the suffix xi+1xi+2 . . . xl and yi+1yi+2    yl. Therefore, d0 = k. We start from a null character and generate the consistent string character by character. Two variables, dx and dy, are set to record the Hamming distance between the current string being generated and x and y, respectively. Suppose that we have generated i characters z1z2    zi of the consistent string z, then the value of dx (or dy) should equal to the number of the mismatches between the prefix z1z2    zi and x1x2    xi (or y1y2    yi). We can prune the impossible prefixes according to the following constraints: 1. dx 6 d 2. dy 6 d 3. 2d  dx  dy P di This idea is illustrated in Fig. 1, which shows an example of pruning tree when l = 4, d = 1 and k = 1. Each node of the tree has four children A, C, G and T, and the depth of the tree is 4. For the nodes with depth 1, A, T and G are pruned as they are against the third constraint. For the node with depth 3 and 4, only A and T are reserved as the other nodes do not satisfy the first two constraints. Therefore, there are only four consistent strings: ‘CAAT’, ‘CCAT’, ‘CGAT’ and ‘CTAT’.

Fig. 1. The pruning tree of two length-l substrings ‘CAAT’ and ‘CGAT’ with d = 1.

308

Y. Xu et al. / Information Sciences 237 (2013) 305–312

More specifically, the pseudocode of this algorithm, ConsistentPat, is presented in Algorithm 1. Algorithm 1. ConsistentPat Input: Two length-l substrings x and y, parameter dx, dy and lv {As described previously, dx and dy are two variables with values initialized to 0. lv denotes the current level in the pruning tree with initial value set to 1. xlv and ylv denote the lvth character of x and y, respectively.} Output: Nd(x,y) if lv = l + 1 then output a consistent string els if dx > d or dy > d or 2d  dx  dy < dlv then return ’’no consistent string found’’ else for all character c in R do ConsistentPat (x, y, dx + (xlv – c), dy + (ylv – c), lv + 1) end for end if

Theorem 1. Given two length-l substrings x and y with Hamming distance k, the algorithm ConsistentPat can enumerate all consistent strings of x and y with O(jNd(x, y)j) time complexity. Proof. The enumeration of consistent strings can be regarded as a quadtree T as illustrated in Fig. 1. Each edge of T is labeled with one character from R. A consistent string can be spelled out by the concatenation of the edge labels on the path from the root to a leaf. Note that a node in the tree T can be extended if and only if the three constraints are satisfied, thus if all of the three constraints hold at a node v, there must exist a consistent string in the subtree with the root at v. Therefore, the number of the nodes in the tree T will be less than O(jNd(x, y)j). h CVoting combines ConsistentPat and Voting; its pseudocode is presented in Algorithm 2. Algorithm 2. CVoting Input: n sequences of length m over the alphabet R, parameter l and d Output: All motifs Create two hash tables V and R for j = 1 to m  l + 1 do Set the value of each entry of V and R to 1 for i = 2 to n do for t = 1 to m  l + 1 do for all consistent strings x of sj1 and sti do if R[H(x)] – i then V[H(x)] V[H(x)] + 1 R[H(x)] i end if end for end for end for   for all x 2 N d sj1 do if V[H(x)] = n then output x end if end for end for

Y. Xu et al. / Information Sciences 237 (2013) 305–312

(a)

309

(b)

Fig. 2. (a) The regions M and H denote x matches or mismatches y, respectively. The sizes of M and H are l  k and k, respectively. (b) P1 denotes the region z differs from x and y in M with size i. In H, P2 and P3 are the regions z mismatches x and P3 and P4 are the regions z mismatches y. The sizes of P2, P3 and P4 are k  j2, j1 + j2  k and k  j1, respectively.

  k  Lemma 2. jN d ðx; yÞj ¼ O ð3lÞd p4ffiffiffi 3l Proof. Considering two strings x and y of length l with Hamming distance k, the positions 1, 2, . . . , l can be partitioned into two disjointed sets M = {tjxt = yt} and H = {tjxt – yt} as illustrated in Fig. 2a. For a given consistent string z of x and y, suppose z differs from x and y at i positions in M and let j1 and j2 denote the number of positions at which z differs from x and y in H, respectively, as illustrated in Fig. 2b. Since xt – yt for any position t in H and the size of H is k, j1 + j2 P k. Let us first consider   lk the positions in M. The i positions can be selected from l  k positions in ways. At each one of these positions, there i   lk 3i combinations. Because i + j1 6 d, i + j2 6 d and j1 + j2 P k, so are jRj  1 = 3 symbols to choose from, so there are i   k ways. For the j1 positions, there are 2i  k 6 2d. For the positions in H, the j1 positions can be chosen from H in j1 j1 + j2  k positions where z differs from both x and y and k  j2 positions where z is the same as y. Therefore, the number    j1 k of combinations in the region H is 2j1 þj2 k . Since j2 6 d  i and j1 + j2 P k, j1 P k  d + i. Summing the numj1 j1 þ j2  k ber of all the combinations, we can obtain 2dk     bX di di  2 c X j1 lk i X k 2j1 þj2 k 3 jNd ðx; yÞj ¼ j1 þ j2  k i j1 i¼0 j ¼kdþij ¼kj 1

2

ð1Þ

1

This equation has also been derived by Satya and Mukherjee [28]. Note that di X

 di  X k

j1 ¼kdþij2 ¼kj1

j1

j1 j1 þ j2  k

 2j1 þj2 k ¼

  X di  di X k

 2j1 þj2 k

j1

j1 j þj k¼0 j1 þ j2  k 1 2    X  di di  X X k j1 j1 k 2t 1j1 t 6 3j1 6 j j t 1 1 t¼0 j ¼kdþi j ¼kdþi j1 ¼kdþi

1

1

Simplifying Eq. (1) yields:

jNd ðx; yÞj 6

2dk   bX 2 c lk

i¼0

i

i

3

 di  X k j1 ¼kdþi

j1

2dk    k ! bX 2 c lk i 4 dk=2 k d k p ffiffiffiffi ffi 3 Oð4 Þ ¼ Oðð3lÞ : 3 ¼ 4 Þ ¼ O ð3lÞ i 3l i¼0

j1

 qffiffil   Lemma 3. The expected value of jNd(x, y)j is EðjN d ðx; yÞjÞ ¼ O ð3lÞd 14 þ 3l . Proof. According to Observation 2 and Lemma 2,

rffiffiffi!k  lk  k  k ! minð2d;lÞ X  l 1lk 1 3 4 3 ¼ Oðð3lÞd Þ O ð3lÞd pffiffiffiffiffi 4 4 4 l k k 3l k¼0 k¼0 0 !k !l 1 r r ffiffiffi ffiffiffi     lk l X l 1 3 1 3 A  6 Oðð3lÞd Þ ¼ O@ð3lÞd þ 4 l 4 l k k¼0

EðjNd ðx; yÞjÞ ¼

minð2d;lÞ X 

l

From Observation 1, k 6 2d. Since k 6 l, thus k 6 min (2d,l).

310

Y. Xu et al. / Information Sciences 237 (2013) 305–312

Theorem 4. The worst-case time and space complexities of the CVoting algorithm are O(m2n(3l)d) and O(mn + (3l)d), respectively.  qffiffil   . The average time complexity of the CVoting algorithm is O m2 nð3lÞd 14 þ 3l Proof. CVoting needs to store the d-neighbors of a length-l substring, so the space complexity of CVoting is O(mn + N(l, d)) = O(mn + (3l)d). In the worst case, ConsistentPat cannot prune any strings, so the worst time complexity of ConsistentPat is O(N(l, d)) = O((3l)d). Thus the worst time complexity of CVoting is O(m2n(3l)d). According to Lemma 3, the  qffiffil   , so the average time complexity of CVoting is average time complexity of ConsistentPat is O ð3lÞd 14 þ 3l   qffiffil  . h O m2 nð3lÞd 14 þ 3l 6 1, which means jNd(x, y)j decreases with the increase of k. In the worst case, the n sequences are When l P 6, we have p4ffiffiffi 3l the same, i.e. k = 0, thus jNd(x, y)j = N(l, d) and the pruning strategy is ineffective, resulting in the worse performance of CVoting. This worst time complexity is the same as that of Pmsprune and larger than that of Voting as the time complexity of Pmsprune and Voting are O(m2nN(l, d)) and O(mnN(l, d)), respectively. However, the average time complexity of CVoting is qffiffil  lower than Voting’s if m 14 þ 3l < 1, because the worst time complexity and average time complexity of Voting are both O(mn(3l)d). 3. Experimental results In this section, CVoting is compared with four state-of-the-art exact algorithms (Voting, Pmsprune, PMS1 and RISOTTO) in solving challenging instances. All experiments were conducted on a Linux server with Core i5-2400 3.1 GHz CPU and 6 GB RAM. For each experiment, we generated 10 random datasets. The results were obtained through 10 runs. Similar to the simulations presented in [6,7], we first tested instances of 20 DNA sequences of length 600 bp in three different cases: randomly generated sequences with at most d variations in each planted motif, randomly generated sequences with exactly d variations in each planted motif, and the generated sequences with occurrence probabilities of A, T, G, C to be 3:3:2:2. In order to make the dataset more realistic, we obtained the DNA data of yeast and drosophila from NCBI database and tested the probabilities of each character over the whole genome. The occurrence probabilities of A, T, G, C of these two species are 0.31:0.31:0.19:0.19 and 0.29:0.29:0.21:0.21, respectively. Therefore we chose similar occurrence probabilities, i.e. 3:3:2:2, in order to test the performance of different algorithms over datasets with non-uniform distribution of characters. Table 1 shows the CPU times of the five algorithms with respect to different l and d for the three cases, where ‘–’ represents that the program could not finish within 12 h or uses more than available memory. The results show that CVoting is much faster than PMS1, RISOTTO, and Pmsprune in all cases. CVoting is also shown to be faster than Voting in the first two cases, as well as the third case except on (13, 4) instances. Note that the running times in case 1 and case 2 are similar for each algorithm. For Voting, its running time for all three cases are almost the same because it enumerates all potential motifs rather than applying pruning strategy. The other four algorithms runs a little slower in the third case because the occurrence probabilities of A, C, G, T will affect the efficiency of each algorithm’s pruning strategy.

Table 1 CPU time comparison of five algorithms on solving various planted (l, d)-motif search instances of 20 sequences with length 600 bp averaged over 10 independent runs. (l, d)

Case

PMS1

RISOTTO

Voting

Pmsprune

Mean

Stdv

Mean

Stdv

Mean

Stdv

Mean

Stdv

Mean

Stdv

(13, 4)

1 2 3

32.4 s 31.6 s 48 s

0.89 s 0.5 s 0.45 s

3.89 m 3.9 m 4.06 m

0.01 m 0.02 m 0.03 m

12 s 12 s 12 s

0.52 s 0.32 s 0.42 s

22 s 22 s 28 s

0.56 s 0.63 s 1.08 s

7.51 s 7.57 s 13.7 s

0.12 s 0.16 s 0.88 s

(15, 5)

1 2 3

10.45 m 10.45 m 15.7 m

0.05 m 0.04 m 0.09 m

41.12 m 39.91 m 42.41 m

0.47 m 0.19 m 0.7 m

2.57 m 2.55 m 2.57 m

0.02 m 0.02 m 0.02 m

4.95 m 4.96 m 6.89 m

0.06 m 0.06 m 0.4 m

56 s 55.5 s 1.66 m

1.13 s 0.72 s 0.82 m

(17, 6)

1 2 3

– – –

– – –

6.62 h 6.48 h 7h

0.07 h 0.03 h 0.15 h

– – –

– – –

59.78 m 59.36 m 1.54 h

0.69 m 0.62 m 0.12 h

8.89 m 8.73 m 14.5 m

0.12 m 0.19 m 0.82 m

(19, 7)

1 2 3

– – –

– – –

– – –

– – –

– – –

– – –

11.74 h 11.68 h –

0.16 h 0.23 h –

1.27 h 1.26 h 2.06 h

0.01 h 0.01 h 0.24 h

Case 1: Randomly generated sequences with at most d variations in each planted motif. Case 2: Randomly generated sequences with exactly d variations in each planted motif. Case 3: The occurrence probabilities of A, T, G, C are 3:3:2:2 and each planted motif has at most d variations.

CVoting

311

Y. Xu et al. / Information Sciences 237 (2013) 305–312 Table 2 Space usage comparison of five algorithms on solving various planted (l, d)-motif search instances. (l, d)

PMS1

RISOTTO

Voting

Pmsprune

CVoting

(13, 4) (15, 5) (17, 6) (19, 7)

1.67 GB 1.67 GB – –

2.3 MB 2.3 MB 2.3 MB –

257 MB 4 GB – –

28 MB 29 MB 30 MB 31 MB

0.24 MB 2 MB 21 MB 254 MB

Table 3 Execution time comparison of five algorithms on solving planted (l, d)-motif search instances of different l and d averaged over 10 independent runs. l

d

PMS1

RISOTTO

Voting

Pmsprune

CVoting

Mean

Stdv

Mean

Stdv

Mean

Stdv

Mean

Stdv

Mean

Stdv

15

5 4 3

10.45 m 25 s 2s

0.05 m 0.42 s 0.48 s

41.12 m 2.6 m 9.04 s

0.47 m 0.03 m 0.1 s

2.57 m 51 s 30 s

0.02 m 0.52 s 0.48 s

4.95 m 1.2 s 0.6 s

0.06 m 0.07 s 0.03 s

56 s 1.04 s 0.04 s

1.13 s 0.02 s 0.002 s

17

6 5 4

– – –

– – –

6.62 h 26.3 m 1.7 m

0.07 h 0.21 m 0.02 m

– – –

– – –

59.78 m 13.6 s 0.4 s

0.69 m 0.55 s 0.01 s

8.89 m 10.04 s 0.34 s

0.12 m 0.17 s 0.007 s

19

7 6 5

– – –

– – –

– 4.2 h 17.2 m

– 0.024 h 0.09 m

– – –

– – –

11.74 h 3.48 m 1s

0.16 h 0.08 m 0.06 s

1.27 h 2.07 m 4.59 s

0.01 h 0.04 m 0.14 s

Table 4 Comparison of the running times of the five algorithms on (15, 5) motif search instances with different number of sequences, n, and different lengths, m, averaged over 10 independent runs. n

m

PMS1

RISOTTO

Voting

Mean

Stdv

Mean

Stdv

Stdv

Mean

Stdv

Mean

Stdv

20

600 800 1000

10.45 m 19.07 m 37.97 m

0.05 m 0.2 m 0.8 m

41.12 m 1.04 h 1.45 h

0.47 m 0.006 h 0.006 h

2.57 m 3.37 m 4.1 m

0.02 m 0.01 m 0.01 m

4.95 m 12.87 m 23.55 m

0.06 m 0.17 m 0.31 m

56 s 2.07 m 3.9 m

1.13 s 0.02 m 0.07 m

40

600 800 1000

9.98 m 18.23 m 37.87 m

0.17 m 0.15 m 0.34 m

1.18 h 1.93 h 2.72 h

0.016 h 0.02 h 0.014 h

4.87 m 6.32 m 7.5 m

0.03 m 0.03 m 0.03 m

6.25 m 17.8 m 38.08 m

0.09 m 0.26 m 0.43 m

56.9 s 2.08 m 3.98 m

1.16 s 0.02 m 0.09 m

60

600 800 1000

9.93 m 18.25 m 38.01 m

0.21 m 0.11 m 0.33 m

1.65 h 2.78 h 4.03 h

0.005 h 0.013 h 0.08 h

7.02 m 9.28 m 11.05 m

0.03 m 0.06 m 0.07 m

7.22 m 21.65 m 50.3 m

0.09 m 0.44 m 0.64 m

56 s 2.08 m 3.98 m

1.35 s 0.03 m 0.03 m

Mean

Pmsprune

CVoting

Table 2 shows the comparison of space usages of the five algorithms. Since the space usages for all three cases are similar, we only show the usages in the first case. The space usages of RISOTTO and Pmsprune are almost constant for the different 2 2 size instances. RISOTTO uses about 1/10 of the space of Pmsprune and their space complexities are O(n m) and O(nm ), respectively. The space usage of PMS1 is related to 4l , thus PMS1 uses the same space for (13, 4) and (15, 5). Consistent with our space complexity analysis, the space requirement of Voting becomes very large when l P 15, while on the other hand, CVoting only uses about 1% of the space of Voting, much more efficient than Voting. Although the space usage of Pmsprune is almost constant for fixed n and m, CVoting uses less space than Pmsprune in all cases except (19, 7). Table 3 shows the execution times of the five algorithms for solving instances of different l and d. For a fixed l, the execution times of all algorithms decrease sharply as d increases. Compared with other algorithms, CVoting is always the fastest. Finally, Table 4 shows the result on (15, 5) instances with different number of sequences, n, and sequence length, m. CVoting achieves a significant improvement in running time over the four other algorithms, e.g. about 10 times faster than Pmsprune. For a fixed n, their execution times all increase superlinearly with m. For a fixed m, the execution times of PMS1 and CVoting are about the same for different n, whereas the execution times of the other three algorithms increase proportionally with n at different rates.

4. Summary In this paper, a new exact algorithm, CVoting, is proposed for solving planted (l, d)-motif search problems. The algorithm has attractive theoretical properties in terms of both low space complexity and average time complexity. In simulations, when compared to four of the best existing algorithms (PMS1, RISOTTO, Voting and Pmsprune), CVoting outperforms them

312

Y. Xu et al. / Information Sciences 237 (2013) 305–312

all significantly in challenging benchmark instances. The software implementing the proposed algorithm is publicly available at http://staff.ustc.edu.cn/xuyun/motif. Acknowledgements We thank Linbin Yu, Yiming Lei and Mingzhi Shao for their helpful suggestions for our article. This work is supported in part by the National Natural Science Foundation of China (Nos. 61033009 and 60970085). References [1] M.M. Abbas, M. Abouelhoda, H.M. Bahig, A hybrid method for the exact planted (l,d) motif finding problem and its parallelization, BMC Bioinformatics 13 (Suppl 17) (2012) S10. [2] T.L. Bailey, C. Elkan, Unsupervised learning of multiple motifs in biopolymers using expectation maximization, Machine Learning 21 (1995) 51–80. [3] S. Bandyopadhyay, S. Sahni, S. Rajasekaran, PMS6: a fast algorithm for motif discovery, in: Proceedings of the 2nd International Conference on Computational Advances in Bio and Medical Sciences, 2012, pp. 1–6. [4] A. Ben-Hur, D. Brutlag, Sequence motifs: highly predictive features of protein function, Feature Extraction series 207 (2006) 625–645. [5] Z.-Z. Chen, L. Wang, Fast exact algorithms for the closest string and substring problems with application to the planted (l,d)-motif model, IEEE/ACM Transactions on Computational Biology and Bioinformatics 8 (5) (2011) 1400–1410. [6] F.Y.L. Chin, H.C.M. Leung, Voting algorithms for discovering long motifs, in: Proceedings of Third Asia–Pacific Bioinformatics Conference (APBC’05), 2005, pp. 261–271. [7] J. Davila, S. Balla, S. Rajasekaran, Fast and practical algorithms for planted (l,d) motif search, IEEE/ACM Transactions on Computational Biology and Bioinformatics 4 (4) (2007) 544–552. [8] P. D’haeseleer, What are DNA sequence motifs?, Nature Biotechnology 24 (2006) 423–425 [9] H. Dinh, S. Rajasekaran, V.K. Kundeti, PMS5: an efficient exact algorithm for the (l,d)-motif finding problem, BMC Bioinformatics 12 (2011) 410. [10] E. Eskin, P.A. Pevzner, Finding composite regulatory patterns in DNA sequences, Bioinformatics 18 (1) (2002) 353–363. [11] P.A. Evans, A.D. Smith, H.T. Wareham, On the complexity of finding common approximate substrings, Theoretical Computer Science 306 (2003) 407– 430. [12] P.A. Evans, A.D. Smith, Toward optimal motif enumeration, Algorithms and Data Structures 2748 (2003) 47–58. [13] R. Hassan, R.M. Othman, P. Saad, S. Kasim, A compact hybrid feature vector for an accurate secondary structure prediction, Information Sciences 181 (23) (2011) 5267–5277. [14] E.S. Ho, C.D. Jakubowski, S.I. Gunderson, iTriplet, a rule-based nucleic acid sequence motif finder, Algorithms for Molecular Biology 4 (2009) 14. [15] C.-W. Huang, W.-S. Lee, S.-Y. Hsieh, An improved heuristic algorithm for finding motif signals in DNA sequences, IEEE/ACM Transactions on Computational Biology and Bioinformatics 8 (4) (2011) 959–975. [16] M. Kaytoue, S.O. Kuznetsov, A. Napoli, S. Duplessis, Mining gene expression data with pattern structures in formal concept analysis, Information Sciences 181 (10) (2011) 1989–2001. [17] C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald, J.C. Wootton, Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment, Science 262 (1993) 208–214. [18] C.E. Lawrence, A.A. Reilly, An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences, Proteins: Structure, Function and Bioinformatics 7 (1990) 41–51. [19] G. Pavesi, G. Mauri, G. Pesole, An algorithm for finding signals of unknown length in DNA sequences, Bioinformatics 17 (Suppl 1) (2001) S207–S214. [20] P.A. Pevzner, S.H. Sze, Combinatorial approaches to finding subtle signals in DNA sequences, in: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, 2000, pp. 269–278. [21] N. Pisanti, A.M. Carvalho, L. Marsan, M.-F. Sagot, RISOTTO: fast extraction of motifs with mismatches, LATIN 2006: Theoretical Informatics 3887 (2006) 757–768. [22] S. Polo, S. Sigismund, M. Faretta, M. Guidi, M.R. Capua, G. Bossi, H. Chen, P. De Camilli, P.P. Di Fiore, A single motif responsible for ubiquitin recognition and monoubiquitination in endocytic proteins, Nature 416 (2002) 451–455. [23] A. Price, S. Ramabhadran, P.A. Pevzner, Finding subtle motifs by branching from sample strings, Bioinformatics 19 (Suppl 2) (2003) ii149–ii155. [24] S. Rajasekaran, S. Balla, C.-H. Huang, Exact algorithms for planted motif challenge problems, Journal of Computational Biology 12 (8) (2005) 1117– 1128. [25] J. Ruhler, M. Tompa, Finding motifs using random projections, Journal of Computational Biology 9 (2) (2002) 225–242. [26] M.-F. Sagot, Spelling approximate repeated or common motifs using a suffix tree, Latin’98: Theoretical Informatics 1380 (1998) 111–127. [27] M. Sassanfar, J.W. Szostak, An RNA motif that binds ATP, Nature 364 (1993) 550–553. [28] R.V. Satya, A. Mukherjee, New algorithms for finding monad patterns in DNA sequences, String Processing and Information Retrieval 3246 (2004) 273– 285. [29] H.Q. Sun, M.Y.H. Low, W.J. Hsu, J.C. Rajapakse, RecMotif: a novel fast algorithm for weak motif discovery, BMC Bioinformatics 11 (Suppl. 11) (2010) S8. [30] H.Q. Sun, M.Y.H. Low, W.J. Hsu, C.W. Tan, J.C. Rajapakse, Tree-structured algorithm for long weak motif discovery, Bioinformatics 27 (19) (2011) 2641– 2647. [31] M.M. Yin, J.T.L. Wang, Effective hidden Markov models for detecting splicing junction sites in DNA sequences, Information Sciences 139 (1–2) (2001) 139–163. [32] Q. Yu, H. Huo, Y. Zhang, H. Guo, PairMotif: a new pattern-driven algorithm for planted (l,d) DNA motif search, PLoS ONE 7 (10) (2012) e48442.