A bit-string longest-common-subsequence algorithm

A bit-string longest-common-subsequence algorithm

Information Processing Letters 23 (1986) 305-310 North-Holland A BIT-STRING LONGEST-COMMON-SUBSEQUENCE 3 December 1986 ALGORITHM Lloyd A L L I S O...

331KB Sizes 16 Downloads 135 Views

Information Processing Letters 23 (1986) 305-310 North-Holland

A BIT-STRING LONGEST-COMMON-SUBSEQUENCE

3 December 1986

ALGORITHM

Lloyd A L L I S O N and Trevor I. D I X Department of Computer Science, University of Western Australia. Nedlands, Western Australia 6000 Communicated by M.A. Harrison Received 7 October 1985 Revised 17 December 1985

A longest-common-subsequence algorithm is described which operates in terms of bit or bit-string operations. It offers a speedup of the order of the word-length on a conventional computer.

Keywords: Longest common subsequence, edit distance, bit string

1. Introduction

The longest-common-subsequence (LCS) problem is to find the m a x i m u m possible length of a c o m m o n subsequence of two strings, " a " of length l al and " b " of length Ib l . Usually, an actual LCS is also required. For example, using the alphabet A, C, G, and T of genetic bases, an LCS of " G C T A T " and " C G A T T A " is " G T T " of length three. Here, an algorithm which requires O( l a l x Ib l ) operations on single bits or O([ l a I/w] x Ib I) operations on w-bit c o m p u t e r words or O( Ib I) operations on l a I-bit bit-strings, for a fixed finite alphabet, is presented. Although falling into the same complexity class as simple LCS algorithms, if w is greater than any additional multiplicative cost, this algorithm will be faster. If l al ~< w, the algorithm is effectively linear in Ibl. (An alphabet larger than Ib I can effectively be reduced to Ibl by sorting " b " in O( Ib I x log Ib I) time and using index positions in the sorted string.) The LCS problem is related to the edit-distance [11] or evolutionary-distance problem [9,10], where the m i n i m u m cost of editing string " a " to string " b " must be found. The elementary edit operations are to insert or delete one character or to change one character. There is a cost function, d(-, .), which can be extended to strings in the obvious way. A common choice for the elementary edit costs is d(et, d(a, d(tx, d(-,

et) = O, 13) -- 2 if ot :~ 13 -) = 1 et) = 1

(cost of changing character o~ to 13), (cost of deleting et), (cost of inserting et),

and then d(a, b) = l a t + I bl - 2 x ILCS(a, b) l. The LCS problem and the edit-distance problem find applications in computer science and molecular biology [8,9,10,11,12]. A n LCS algorithm can compare two files and, by finding what they have in c o m m o n , compute their differences. It can be used to correct spelling errors and to compare two genetic sequences for homology (similarity). U n d e r plausible assumptions [1,13], the worst-case complexity of an LCS or edit-distance algorithm on strings over an infinite alphabet must be O( l a I x Ib l) time. Masek and Paterson [5] give an O( l a I × I bl/ 0020-0190/86/$3.50 © 1986, Elsevier Science Publishers B.V. (North-Holland)

305

Volume 23, Number 6

INFORMATION PROCESSING LETTERS

3 December 1986

log(rain{ l a l , Ibl})) edit-dlstance algorithm for strings over a finite alphabet with modest restrictions on d(., .). However, this is faster than simple algorithms only for strings longer than about 200000 characters. For special cases there are faster algorithms: for similar strings [3,6], and for strings with few matching characters [4], but their worst-case running time is at least O ( l a l × Ibl). The algorithm presented here certainly has complexity O( [a I × Ib I) for a finite alphabet, but asymptotically the improvement in speed is close to w. The algorithm is insensitive to the number of matches between strings and to their similarity. 1.1. M a t r i x

L

,.

Conventionally a matrix Lij is defined as follows:

i']" 1£3 __ r°w3 b 2 -- row 2 bl - - row 1

j

<

• ..

a 3 a 2 a1 string a

string b

matrix Lij

Lij equals the length of an LCS of a 1..... j and b 1..... i,

Lij =

1 + Li_l~_ 1

if aj = bi,

max{Li_l,i, Lid_l}

otherwise.

This leads to simple dynamic-programming algorithms [8] upon which this offe is based. A reflected representation of matrix L is typically used; the above representation is chosen to make the reading of certain bit-strings easier. L has m a n y interesting properties: Li_ ld_l ~< Lid_l, Li-l,j ~< Lij,

ILij - L i _ l d _ l ] ~< 1.

The rows and columns of Lij contain ascending values for increasing i and j. This prompted H u n t and Szymanski [41 and Hirschberg [31 to 'contour' L and develop fast algorithms for special cases.

2. Bit-string algorithm The values in the rows of L increase by at most one. This makes it possible to represent the information in L by a bit-matrix, as shown in Fig. 1, L0=

~

Ma,.

k=l,...,j

R o w i has either the same n u m b e r of bits set or one more bit set than the previous row, row i_ 1- N e w bits are set towards the left of a row. The length of an LCS of " a " and " b " is the n u m b e r of bits in the top row. Let l's in a particular row tend to drift the right as we look (up) at the next row. This is because when more of "b'" is used, an LCS of a given length can be found using no more of, a n d possibly less of, "a". The l's mark the contour lines of matrix L. 306

Volume 23, Number 6

I N F O R M A T I O N PROCESSING LETTERS

#bits 10 0 1 0 1 1 0 0 0 1 0 1 1 1 1 1 1 10 1 0 0 1 1 0 0 0 1 0 1 1 1 1 1 1 9 0001100010111111 9 0100100010111111 8 0010001000111111 7 0000100010011111 7 1000000100011111 7 1000010000011111 6 0000010100001111 5 0000000100001111 5 0000100000001111 4 0000000010000111 3 0000000000100011 3 1000000000010001 2 0000000100000100 1 0000000000000100

T G T T C T A G A A T T C G A T

3 December 1986

string b

-

rowll

-

- - FOWl0

matrix M ij

GTCTTACATCCGTTCG string a Fig. 1.

2.1. Alphabet-strings E a c h letter in the alphabet defines a bit-string when it is compared with the elements of "a"" a" A-string: C-string: G-string: T-string:

GTCTTA 0 0 1 0

0 0 0 1

0 1 0 0

0 0 0 1

CATC 0 0 0 1

1 0 0 0

0 1 0 0

1 0 0 0

0 0 0 1

CGTTC 0 1 0 0

0 1 0 0

0 0 1 0

0 0 0 1

G 0 0 0 1

0 1 0 0

0 0 1 0

Precomputing these alphabet-strings contributes O( Jalphabet [ × [ l a ] / w ] + [a[) to the time complexity. F o r a fixed alphabet, this is O( [a [); for a nonfixed alphabet, this could be O( lal × Ib l) at worst. If the alphabet is small ( [alphabet [ << [b]) , the contribution to the total time can be ignored.

2.2. Matrix M To calculate rowi, we use the ith character in string " b " , bi, to select the bi-string from the set of alphabet-strings. The l ' s in rowi_ 1 'cut' the bi-string into segments: rowl0 • T-string:

1000000 0101100

1000 0100

1 0

1 1

1 1

1 0

1 0

Each segment extends from the position of a I in row i_ a rightward to the position to the left of the next 1. If the left-hand bit of row i_ 1 is zero, the left-most segment extends from the left of the bi-string to the position left of the first 1. Row i is formed by setting only the rightmost 1 bits in each segment of the b~-string. If a segment is all zero, the bit defining the left end of the segment should be set in row i (that is 307

Volume 23, Number 6

INFORMATION PROCESSING LETTERS

3 December 1986

the segmenting bit in row i_ 1)- This can be ensured by first or-ing row i_ 1 into the bi-string. T-string v row10: 1 1 0 11"

0 0

1 1" 0 0

rowl~: 0 0 0 0 1 0 0 01 * indicates a right-most 1.

0 0

1"

1"

1"

1"

1

1

1

1

1" 1

It m a y be convenient to imagine an invisible 1 at position l a I + 1 for a leftmost segment which is zero, but this is unnecessary for the algorithm. A 1 in row i_ 1 marks the shortest piece of " a " that can be used to m a k e an LCS of a certain length with bl ..... i-~. Bringing b i into use, the best that can be done to extend that LCS is to use the first b i in " a " to the left of the 1, if possible• This is m a r k e d b y the rightmost 1 bit in the next segment to the left. In m o r e detail, let x = rowi_ 1 v bi-string , for example, x = row10 v T-string: x"

1101100

1100

1

1

1

1

1

Each segment of x can be read as a binary number• There is a 'folk-technique' that decrementing a binary n u m b e r changes the low-order bits [7] up to and including the least-significant 1. The c o ~ e c t d e c r e m e n t for each segment can be found by a logical left-shift of rowi_ 1 with a 1 carry-in: x"

1101100 -000 0001

1100 0001

1 1

1 1

1 1

1 1

1 1

1101011

1011

0

0

0

0

0

A nonequivalence of the result and x sets the changed bits: 0000111

0111

1

1

1

1

1

A n d - i n g this with x gives the rightmost bits in each segment: 0000100

01

00

1

1

1

1

1

This is rowi, rowll in this example. In summary: row i = x A ( ( x -

(rowi_ 1 << 1)) ~ x),

where x = row i_ x v bi-string

with VA:~

• bit-string operations, " logical left-shift bit-string; right-hand bit set, • l a l - b i t integer subtraction•

T h e bit-string operations can be done in units of a word-length. The shift a n d subtraction can also be done o n e word at a time, taking care that the carry-out-bit a n d the borrow-bit are propagated. The n u m b e r of bits set in a w o r d can be c o u n t e d in O(log 2 w) time; this is attributed to D. Muller in 1954 in [7]• The n u m b e r of l's in the final row can therefore be calculated in O ( [ la I / w ] × log 2 w) time.

2.3. An L C S A n LCS, as opposed to just its length, can be obtained in at least two ways. Hirschberg's recursive technique [2] can be used to find an LCS in linear space at the cost o f slowing the algorithm by a factor of two. 308

Volume 23, N u m b e r 6 #bits 10

0

10

1

t

I N F O R M A T I O N PROCESSING LETTERS

3 December 1986

1

0

1

0

0

1

0

1

1

1

1

1

T

0

0

1

0

0

1

0

1

1

1

1

1

G

string b

v

9

0

0

0

1

0

0

1

0

1

1

1

1

1

T

9

0

1

0

0

0

0

1

0

1

1

1

1

1

T

8

0

0

1

0

I I0

0

0

1

1

1

1

1

C

1

0

0

1

1

1

1

T

0

0

0

1

1

1

1

A G

0

I

7

0

0

0

0

1

7

1

0

0

0

0

[

I

0

0

0

1

[ $

7

1

5

0

5

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

I

0

0

0

0

0

1

1

1

1

o o

I 1 I

0

0

0

0

1

1

1

1

A

0

0

0

0

0

1

1

1

1

A

0

0

0

0

0

0

0

1

1

1

1

T

0

0

0

1

0

0

0

0

1

1

1

T

0

0

0

1

1

C

I

4

0

0

0

0

0

I I

3

0

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

1

G

0

0

0

0

0

0

0

1

0

0

0

0

0

1

0

0

A

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

T

G T string a

C

T

T

A

C

A

T

C

C

G

T

T

C

G

I ]

/

J

matrix M q

Fig. 2.

Alternatively, all rows of M can be kept, the space required is [a I x Ibl bits, [ l a l / w ] x Ibl words. This is quite practical for strings of the order of 1000 characters. Then, an LCS can be recovered by finding the 'corners' of the contours in L which stand out as patterns in M (see Fig. 2). This takes O( Ia I + Ib I) time. The emboldened characters in Fig. 2 indicate an LCS.

3. Simulation results Both the simple O( l a I × Ib l ) algorithm [2,8] and the bit-string algorithm to calculate the length of an LCS have been implemented in C and run on a VAX 1 1 / 7 5 0 with a 32-bit word. For random strings and an alphabet of size 4 the resulting speedups were lal 32 64 100 500 1000 4000

xlbl x 32 x 64 x 100 × 500 x 1000 x 4000

Time ratio (simple/bit) 6 10 11 25 26 27 309

Volume 23, Number 6

INFORMATION PROCESSING LETTERS

3 December 1986

For comparison, an alphabet of size 256 resulted in l a I x Ib I 32 64 100 500 1000 4000

x 32 x 64 x 100 x 500 x 1000 × 4000

Time ratio (simple/bit) 2 5 6 19 21 27

A similar number of operations are in the inner loop of each algorithm. The bit-string algorithm is well suited to optimization since it works across rows of matrix M and alphabet-strings. For small alphabets, the bit-string algorithm clearly provides a considerable speedup.

4. Conclusions The time complexity of the bit-string LCS algorithm on a computer with w-bit words for a fixed finite alphabet is O([ la I/w] x Jb I)- This gives a speedup over simple O( la I x Jb I) algorithms. The time does not depend on properties of " a " and "b". For random strings and moderate alphabets, the bit-string LCS algorithm will be faster than the special case algorithms. The algorithm might also be programmed, on a vector computer if carry-out and borrow bits can be propagated from word to word in a vector-shift and a vector-subtract operation. It could be 'pipelined' in a diagonal fashion on a vector or parallel machine because the least significant bit (or word) of row i +1 can be calculated as soon as the least significant bit (or word) of row i has been calculated.

References [1] A.V. Aho, D.S. Hirschberg and J.D. Ullman, Bounds on the complexity of the longest common subsequence problem, J. ACM 23 (1) (1976) 1-12. [2] D.S. Hirschberg, A linear space algorithm for computing maximal common subsequences, Comm. ACM 18 (6) (1975) 431-433. [3] D.S. Hirschberg, Algorithms for the longest common subsequence problem, J. ACM 24 (4) (1977) 664-675. [4] J.W. Hunt and T.G. Szymanski, A fast algorithm for computing longest common subsequences, Comm. ACM 20 (5) (1977) 350-353. [5] W.J. Masek and M.S. Paterson, How to computer stringedit distances quickly, in: D. Sankoff and J.B. Kruskal, eds., Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison (AddisonWesley, Reading, MA, 1983). [6] N. Nakatsu, Y. Kambayashi and S. Yajima, A longest common subsequence algorithm suitable for similar text

310

strings, Acta Informatica 18 (1982) 171-179. [7] E.M. Reingold, J. Nievergelt and N. Deo, Combinatorial Algorithms: Theory and Practice (Prentice-Hall, Englewood Cliffs, N J, 1977). [8] D. Sankoff and J.B. Kruskal, eds., Time Warps, String Edits and Macromolecules:. The Theory and Practice of Sequence Comparison (Addison-Wesley, Reading, MA, 1983). [9] P.H. Sellers, On the theory and computation of evolutionary distances, SIAM J. Math. 26 (4) (1974) 787-793. [10] P.H. Sellers, The theory and computation of evolutionary distances: Pattern recognition, J. Algorithms 1 (4) (1980) 359-373. [11] R.A. Wagner and M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1) (1974) 168-173. [12] M.S. Waterman, General methods of sequence comparison, Bull. Math. Biology 46 (4) (1984) 473-500. [13] C.K. Wong and A.K. Chandra, Bounds for the string editing problem, J. ACM 23 (1) (1976) 13-16.