Systolic-based parallel architecture for the longest common subsequences problem

Systolic-based parallel architecture for the longest common subsequences problem

INTEGRATION, the VLSI journal 25 (1998) 53—70 Systolic-based parallel architecture for the longest common subsequences problem Guillaume Luce, Jean F...

272KB Sizes 2 Downloads 112 Views

INTEGRATION, the VLSI journal 25 (1998) 53—70

Systolic-based parallel architecture for the longest common subsequences problem Guillaume Luce, Jean Fre´de´ric Myoupo* LaRIA: Laboratoire de Recherche en Informatique d+Amiens, Faculte& de Mathe& matiques et d+Informatique, Universite& de Picardie Jules Verne, 33, rue Saint Leu, 80039 Amiens Cedex, France

Abstract In this paper we design a new and efficient systolic architecture for the longest common subsequences problem which is, given two finite strings on any alphabet, to recover a subsequence of maximal length of both strings. A natural extension to this problem is to determine the set of all longest common subsequences of the two given strings. First, we present a modular linear time algorithm on an input/output bounded and fault-tolerant semi-mesh systolic structure for the longest common subsequence problem. Then, we extend this algorithm to the set of all longest common subsequences problem. ( 1998 Elsevier Science B.V. All rights reserved. Keywords: Systolic algorithms; Modularity; Fault tolerance; Dynamic programming; Longest common subsequences

1. Introduction A string is a finite sequence over a given alphabet. A subsequence of a string is a sequence obtained by deleting none or more symbols from the initial string. A common subsequence of two strings is a subsequence of both. A longest common subsequence of two strings (LCS) is a common subsequence of these two which is of maximal length. LCS and related string matching problems play a role in biosciences in which similarities of sequences of DNA or proteins are studied [1]. These problems are also of great importance for data compression, for syntactic pattern recognition and for the management of computer files or even for bird song analysis [2]. There are many papers in the literature dealing with the computation of the length of a longest common subsequence (LLCS) but there are few results concerning the recovering of more than one LCS. The determination of the set of all LCS is a problem which we will call the SLCS problem. * Corresponding author. Tel.: #33 22 82 75 55; fax: # 33 22 82 75 02; e-mail: [email protected]. 0167-9260/98/$19.00 ( 1998 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 9 2 6 0 ( 9 8 ) 0 0 0 0 3 - 0

54

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

Aho et al. [3] studied complexity bounds of the LCS problem. For an unrestricted alphabet, straightforward dynamic programming requires O(mn) sequential time for strings of length m and n. Among others, [4—9] gave efficient sequential algorithms, but one must use divide and conquer techniques, assume an expected length of the longest common subsequence or design restricted models such as small alphabet size [10,11] to derive algorithms which, in some cases, can run in less than O(mn) time. Many other variations are known as the more general string editing problem [12] or the k-LCS problem (finding LCS between k strings [13,14] which is NP-hard. We refer the reader to [15] for an elegant presentation of the myriad sequential algorithms for the LCS problem. Several parallel algorithms have also been designed [16—23]. Systolic arrays as defined by Kung [24] have been proposed as a simple and effective means of employing Very Large Scale Integration (VLSI) technology to handle compute-bound problems. These array processors are typically made up of simple and identical processing elements (PEs or cells) that operate synchronously for specific applications. Such arrays are attached as coprocessors to a host computer which inserts input values into them and extracts output values from them. Wafer-Scale Integration (WSI) technologies [25,26] require systolic algorithms that are both modular and fault tolerant [27]. An array is modularly extensible if the number of cells and the time of execution are its only characteristics which depend on the size of the problem. A WSI circuit is fault tolerant if wafers with defective components can still be used (by defining redundant circuitry). In [16—20] systolic designs for the LCS problem are proposed (for which known automatic methods cannot be applied [28]. The LLCS problem is optimally solved (on an unidirectional one-dimensional modular and fault-tolerant systolic array) in [17] in the sense that the time of execution multiplied by the number of cells bound matches the complexity bound of the problem. But, for the LCS problem, none of these arrays is really efficient with regard to current VLSI/WSI technologies. In this paper we use a unidirectional, fault tolerant and modular semi-mesh architecture of 1m(m#1) processors where m is the length of the shortest of the two strings. This systolic structure 2 adapted to the LCS problem enables us to derive an algorithm that runs in time n#3m#p where m and n are the lengths of the two input strings (m)n) and p is the length of an LCS. This systolic algorithm is faster than the ones described in [16,17,19], and also requires less processors than two-dimensional systolic arrays known for the LCS problem [17,18]. Moreover many LCS can be recovered, and specifically the SLCS problem can be solved in such a way as to be suited for WSI technologies. The remainder of the paper is organized as follows. In Section 2 we formally define the LLCS, LCS and SLCS problems, and we present the semi-mesh architecture. In Section 3 we give our LCS algorithm and the proof of its correctness. In Section 4 we generalize this algorithm to the SLCS problem. In Section 5 we summarize our results and make concluding remarks.

2. The LLCS, LCS and SLCS problems First we formally define the problems, then recall previous systolic arrays, and, finally, present the new array requirements.

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

55

2.1. Basic approach Let R be the alphabet (any set of symbols is allowed). A string A is a finite sequence over R (a concatenation of symbols). A"A(i )A(i )2A(i ) is also called a word over R. AI " 1 2 n A(i )2A(i )A(i ) is the reverse string of A. DAD denotes the length of string A (its number of n 2 1 elements), while e denotes the empty string (DeD"0). A subsequence of A is any sequence C"A(i )A(i )2A(i ) where 1)i (i (2(i )n. If 1 2 p 1 2 p this subsequence is obtained by deleting the a last symbols, then it is a prefix of length l"m!a. A[1..l] denotes the string A(1)A(2)2A(l). Given two strings A and B over R, where DAD"n and DBD"m (m)n), the string C over R is a common subsequence of A and B if it is a subsequence of both A and B. C is a longest common subsequence (LCS) of A and B if it is a common subsequence of maximal length. An LCS of A and B is therefore any string C"A(i )A(i )2A(i )"B( j )B( j )2B( j ) where 1 2 p 1 2 p 1)i (i (2(i )n and 1)j (j (2(j )m. 1 2 p 1 2 p For example, if A"bcabcb and B"abccb (n"DAD"6, m"DBD"5, R"Ma, b, cN) then C "bccb and C "abcb are the two LCSs of length p"4 (see Appendix A). 1 2 The LLCS problem is to determine p.1 The LCS problem is to recover an LCS of the two strings. The SLCS problem consists in finding all LCS of the two strings. Throughout this paper i, j, k (i3[0 .. n], j3[0..m],k3[0..m]) are indexes on strings A, B and C, respectively. ¸CS(i, j) denotes a longest common subsequence for the prefixes of length i and j of A and B, respectively (i.e., ¸CS(i, j)"¸CS(A[1..i],B[1..j])), and ¸¸CS(i, j) denotes its length (¸¸CS(i, j)"D¸CS(i, j)D). Note also that ¸¸CS(n,m)"D¸CS(n,m)D"p. Dynamic programming is a general method often used in computer sciences to solve problems, whose solutions depend on the solutions of their own sub-problems. By considering successive increasing values of a prefix length, we construct an (n#1)](m#1) matrix M in which M[i, j],i3[0..n], j3[0..m], checks whether A(i)"B( j) and stores ¸¸CS(i, j) according to the following recurrence: The LLCS recurrence scheme (see e.g., [29] for a proof ). f ∀i3[1..n], j3[1..m], ¸¸CS(i, 0)"¸¸CS(0, j)"¸¸CS(0,0)"0, and f ¸¸CS(i, j)"MaxM¸¸CS(i!1, j!1)#E(i, j), ¸¸CS(i!1, j), ¸¸CS(i, j!1)N where E(i, j) is 1 if A(i)"B( j) and 0 otherwise. Clearly, M[n,m] is p that solves the LLCS problem. As for the LLCS problem, the LCS problem is solved according to the following recurrence: The LCS scheme [20]. f ∀i3[1..n], j3[1..m], ¸CS(i, 0)"¸CS(0, j)"¸CS(0,0)"e, and f if E(i, j)"1 then ¸CS(i, j)"¸CS(i!1, j!1)A(i) else if ¸¸CS(i!1, j)(¸¸CS(i, j!1) then ¸CS(i, j)"¸CS(i, j!1) else ¸CS(i, j)"¸CS(i!1, j). 1 Note that if DCD"p, then p)Min(n,m))m.

56

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

Hirschberg [4], introducing the k-candidate and minimal k-candidate definitions, greatly reduces storage requirements for the LCS problem. A k-candidate is a pair2(i, j) such that A(i)" B( j) and ¸¸CS(i, j)*k. For all k, the set of minimal k-candidates (i, j) (by usual vector ordering) is obtained by ruling out all k-candidates (i@, j@) that do not respect the rule i(i@ and j'j@ (i)i@ is assumed without loss of the generality). The sets X of minimal k-candidates are also dek fined by Definition 1. ¸et X "M(i, j) D ¸¸CS(i, j)"0, i3[0..n], j3[0..m]N. ∀k3[1..m],X "M(i, j) D ¸¸CS(i, j)" 0 k k'¸¸CS(i!1, j)"¸¸CS(i, j!1)"¸¸CS(i!1, j!1)"k!1, i3[0..n], j3[0..m]N. Here, we extend Definition 1 as follows: Definition 2. Let W "X . ∀k3[1..m], W "X XM(i, j) D A(i)"B(j)'¸¸CS(i, j)"¸¸CS(i, j!1)" 0 0 k k k'¸¸CS(i!1, j)"k!1N. W is a necessary extension of X , to recover more than one LCS in some cases (see characteristic k k example in Appendix A), and indeed will enable us to solve the SLCS problem as the sets are constructed with respect to the following properties: 1. ∀(i, j)3W , k3[1..p], &(i@, j@)3W such that i@(i'j@(j, i@3[0..n], j@3[0..m], k k~1 2. (i, j)3W '(i@, j)3W Ni@"i. k k Proof of (i) is straightforward from the LLCS and LCS recurrence schemes. Proof of (ii): Assume i)i@ without loss of the generality. (i@, j)3W N¸¸CS(i@!1, j)(k (Definik tion 2). But, i)i@ assumed, (i, j)3W implies i"i@. h k 2.2. Previous results For parallel designs, efficient algorithms for the LCS problem can be found for many models: on a PRAM-CRCW model [22], on a PRAM-CREW model [21] (which is optimal), on the Bus Automaton theoretical model [23], and many others. Effective implementation needs led us to systolism for cost effectiveness and WSI designing capabilities. Concerning systolic arrays, LLCS and LCS problems have been studied in e.g. [16—20]. Yang and Lee [16] were the first to exhibit a linear time systolic algorithm for the LLCS problem, but many registers are used and PE operations are quite complex. An important paper for systolic design for the LLCS and LCS problems is [17]. Y. Robert and Tchuente present an optimal modular algorithm that uses n#2m time operation steps on a pipeline of m processors to compute the length of an LCS. PEs are made of three registers and three I/O channels. But this design does not enable the retrieval of an LCS. The authors proposed

2 The LCS problem can also be seen as posets (partially ordered sets [30] on the ranks of A(i)’s and B( j)’s or by finding a maximum-cost path in the associated grid DAG (directed acyclic graph [15,21].

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

57

a 2D array of m PEs, each one being equipped with a systolic stack of size bounded by 2m processors (indeed a stack has to be of size 2k on cell k, thus the array needs about m(m#1) cells). Under the assumption of constant-time stack operations, they recover one LCS in time n#5m, but the algorithm is not modular and the stack architecture does not suit WSI technologies. The Chang et al. [18] LCS algorithm is presented as an instance of the string editing problem. It is linear in time: 4n#2m clock cycles are required for the recovering of one LCS on a 2D modularly extensible array of mn cells. The algorithm of Lin [19] is the Robert—Tchuente LCS algorithm reworked with a contentaddressable memory in each PE instead of the systolic stacks. In [20] we presented a faster systolic algorithm (only n#3m#p clock cycles are required) on an array of m processors which can recover more than one LCS in some situations. It is a VLSI implementable array but it is not modular as each cell needs O(m) memory locations. 2.3. A new systolic architecture Our architecture’s topology is a semi-mesh, or tri-array, ¹ consisting of m#(m!1)#2# 2#1"m(m#1)/2 cells. Its structure is shown in Fig. 1. Column j relates to the jth character of string B, and row k, k3[1..j], is associated with W (or X for the single LCS problem, see k k Sections 2.1 and 3.2). Each cell ¹ , at the intersection of column j and row k, has two registers: a register B to store k,j B( j), and a register R whose contents is i if and only if (i, j) belongs to W (R is set to 0 otherwise). k The data, input from ¹ , travel from cell ¹ to cell ¹ through the internal interconnection 1,1 k,j k,j`1 channels A1 and I1, and to cell ¹ through channels A2 and I2 (see Fig. 1). Moreover k`1,j`1 a channel X, which is used for the recovering phase, interconnects cells ¹ and ¹ . k,j k~1,j

Fig. 1. The semi-mesh ¹.

58

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

Due to advances in technology, in particular increasing integration density and die sizes, systolic arrays for WSI need to be fault tolerant [25—27]. As this architecture is equivalent to the right part of Kung and Lam’s hexagonal array in [27], it is a fault tolerant one. Our algorithm has three steps: step 1 initializes the B registers (storing string B), step 2 introduces A and computes M, and step 3 extracts one LCS. In step 1 of the LCS algorithm, channels A1 and A2 of ¹ are used to input the sequence 1,1 B(m)B(m!1)2B(1), and ¹ ’s channels I1 and I2 are used to input the sequence m, m!1,2,1. 1,1 Channels A1 , A2 , I1 and I2 of ¹ are used to propagate B( j@) from cell ¹ to both 065 065 065 065 k,j k,j ¹ (via A2 and I2) and ¹ (via A1 and I1) until it reaches column j@. k`1,j`1 k,j`1 In step 2 of the algorithm, channels A1, A2 of ¹ are used to input the sequence 1,1 A(1)A(2)2A(n), and ¹ ’s channels I1 and I2 are used to input the sequence of corresponding 1,1 indices i. Channels A1 , A2 , I1 and I2 of ¹ are used to propagate (A(i), i) to ¹ (via A1 065 065 065 065 k,j k,j`1 and I1) if the PE evaluation function yields R"0, or to propagate it to ¹ (via A2 and I2) if k`1,j`1 (R, j) belongs to W , i.e., if a match has previously been stored. k Step 3 (the recovering step) mainly proceeds along the mth column cells ¹ , k3[1..p], in which k,m some information that enables to then output an LCS is moved. For that purpose, we need the upward edges X depicted by dashed vectors in Fig. 1. Even within these additional features, our array remains fault tolerant (Fig. 3.2(a), p. 40, in [27]. As only cells ¹ and ¹ require external communication channels, our model is, clearly, I/O 1,1 1,m bounded. Moreover, as only the number of processors and the time of execution of our algorithm will depend on the problem size, our design is modularly extensible.

3. A modular algorithm for the LCS problem In this section we present the three phases of the LCS algorithm. Recall that the first step initializes the B registers (storing the sequence B), the second step computes M, and the third outputs one LCS. 3.1. Initialization phase Input data items: Input BI , the reversed B sequence, through the channel A1 of cell ¹ . Input the 1,1 sequence of corresponding indices m, m!1,2, 1 through the channel I1 of cell ¹ . Then 1,1 propagate the B( j) to their respective destination cells by executing program P1 below in each cell. P1: if I '1 then */ A1 "A2 "A 065 065 */ I1 "I2 "I !1 065 065 */ else B"A */ R"0 endif

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

59

Each symbol of B enters the array through ¹ and is propagated along the columns of ¹. 1,1 A refers to the same input3 value A1 and A2 that are respectively read on channels A1 and A2. */ */ */ So does I . */ Proof of correctness of this algorithm is straightforward as, according to the input data flow, at time t, t3[1..m!1], symbol B(m!t#j), accompanied by the value m!t#j, is considered by all the processors of column j. Thus, the initialization phase requires m time-steps to simultaneously initialize all the B-registers of ¹ if the evaluation of P1 takes one time-step. 3.2. Construction phase Input data items: Input string A through the channel A1 of cell ¹ . Input the sequence of 1,1 corresponding indices 1, 2,2, n through the channel I1 of cell ¹ . Then propagate the A(i) 1,1 toward the last column by executing program P2 below in each cell. One time-step assumed for the execution of P2, this phase of computation requires m time-steps to input A plus m!1 time-steps for A(n) to be processed by a cell of column j. P2: if R" 0 then ifA "B then */ R"I */ endif A1 "A 065 */ I1 "I 065 */ else A2 "A 065 */ I2 "I 065 */ endif

l1 l2 l3 l4 l5 l6 l7 l8 l9 l10

Lemma 1. After m executions of P1, m#n!1 executions of P2 determine all (W ) information k k|*1..m+ in ¹ (the register R of ¹ equals i if and only if (i, j)3W , k3[1..p]). k,j k Proof of Lemma 1. Basically, when running P2, processor ¹ considers the input item k,j (A(i),i), i3[1..n], and, according to the contents of R, tests if (i, j)3W (l2) and sends the item either k toward ¹ (l5 and l6) after saving it if necessary (l3), or toward ¹ (l8 and l9). k,j`1 k`1,j`1 The first input item (A(1),1) is successively processed by ¹ ,¹ ,2,¹ . By running P2 in 1,1 1,2 1,m each cell, (1, j)3W , j3[1..m], Q R"1 on ¹ is obvious. 1 1,j Let us assume: (i, j)3W , j3[1..m]QR"i on ¹ . 1 1,j Clearly, the input item (A(i#1), (i#1) travels along the first row of ¹, and all the (i#1, j)3W 1 are stored in ¹ (by l3), until one gets a register RO0 on a cell ¹ , and where j@'j, that the item is 1,j{ sent onto the next row (by l8 and l9).

3 Similarly, A1 and A2 refer to the values output on A1 and A2 channels. So as for I1 and I2. 065 065

60

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

By Definition 2, there is no (i#1, jA)3W with jA*j belonging to W , so, all the 1 1 (i#1, j)3W , j3[1..m] are stored in ¹. 1 Therefore, W is in ¹. Similar arguments for any W provide the end of the proof. h 1 k Note that to retrieve a single LCS, ¹ only needs X information (cf. Section 2.1). In order to do k so, just replace l4 by ‘else’ and add ‘endif ’ between l6 and l7 in P2. The proof of this new lemma is similar to that of lemma 1 and proof of correctness of this algorithm is related to Definition 1. Property (i) (see Section 2.1) enables the recovering of an LCS as, by construction, we have in ¹: (1) ∀(i, j)3W , R"i on ¹ , and, (2) ∀(i, j)3W , & i@(i, & j@(j D R"i@ on ¹ . Thus, we just have k k,j k k~1,j{ to retrieve an element (i , j ) from W , and, according to (i), determine (i , j )3W for a given p p p k~1 k~1 k~1 (i , j )3W to reveal C(k) and C(k!1), and therefore C"¸CS(n,m). k k k 3.3. Extraction phase This step consists in moving the information in the ¹ cells to their right neighbours k,j ¹ , j3[1..m!1], in order to fill the mth column with C. First, run P31 once, then P32 m!1 k,j`1 times: P31: I1 "R 065 X "R 065 P32: if (0(I1 ) and ((I1 (X ) or (X "0)) then */ */ */ */ R"I1 */ endif I1 "I1 065 */ X "R 065 Lemma 2. ¹his algorithm sets in the registers R of ¹ , ¹ ,2, ¹ the sequence i , i ,2, i , with 1,m 2,m p,m 1 2 p 1)i (i (2(i )n, that forms the ¸CS C"A(i )A(i )2A(i ). 1 2 p 1 2 p Proof of Lemma 2. Exhibit C(p): Let us consider the smallest j such that (i , j )3W . After the p p p p execution of P31, ¹ p reads the input i on channel I1, and 0 on channel R (¸¸CS(n,m)(p#1 p,j `1 p implies (i , j )3W ). So, by P32, the register R of ¹ p is assigned i . And so on, by P32, i is p p p{ p{ p`1 p,j propagated on the pth row until it reaches the m column. Therefore, when running m!1 times P32, we obtain R"i on ¹ . p p,m C(k) exhibited, exhibit C(k!1). Basis hypothesis: R"i on ¹ k (the smallest j such that k k, j k (i , j )3W ) and A(i )"C(k) (i.e., R"i on ¹ after m!1 executions of P32). k k k k k k,m Let us consider the smallest j such that (i ,j )3W . k{ k{ k{ k~1 As for C(p), after the execution of P31, ¹ reads the input i on I1, and, either X equals k~1,jk{`1 k{ */ 0 (if, and only if, j #1(j ) and statement is correct, or X "i that, by P32, sets R to i . And so k{ k */ k k{ forth until R"i on ¹ . h k{ k~1,m

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

61

The following achieves the recovering phase: run P33 p times to output from ¹ the i indices 1,m sequence. P33: X "X 065 */ This recovering phase requires m time-step to run P31 and P32 plus p#1 ()m) time-steps to output the sequence solution (the sequence ends by 0 if pOm), leading to the following: Theorem 1. ¹his ¸CS algorithm runs in time n#3m#p on an I/O bounded, modular and fault tolerant systolic array of m(m#1)/2 cells. Two additional channels are required: C¹1, interconnecting ¹ to ¹ , and C¹2, interconk,j k,j`1 necting ¹ to ¹ . They are used to propagate a control value (C¹ 3[1..5]) that enables the k,j k`1,j`1 */ switching of the above programs P1, P2, P31, P32 and P33 in each cell. Data flow follows. When a B (resp. A) symbol is input, 1 (resp. 2) is input on channel C¹1 of ¹ , in order to run P1 (resp. 1,1 P2). Then, the sequence m, m!1,2,1 is input on channel I1 of ¹ when C¹ "3, in order to, 1,1 */ according to the contents I of both I1 and I2 channels, run P31 simultaneously in each cell (i.e. */ when I "1; the output control value is 4 in this case). Then, on channel C¹1 of ¹ , is input */ 1,1 m!1 times the control value 4 to run P32, and, at least p#1 times the control value 5 is input in order to run P33. This algorithm is given in Appendix B. In order to precisely locate the matching symbols, the third phase has to be slightly reworked. Two additional channels, copies of I1 and X, would, without any loss of time, enable the output of the indices from B, while outputting the indices from A if each cell ¹ has a register which stores k,j j (a copy of I2 obviously enables its initialization). Another modular and fault tolerant variation of this new systolic array, whose PE combinational logic remains very simple, solves the SLCS problem in terms of a graph structure.

4. An extension to the SLCS problem First, note that even with more channels and registers, the semi-mesh architecture will remain fault-tolerant if we keep communication links going in the same directions as before. While constructing ¹ we notice that ∀k3[2,2,p], ∀(i, j)3W , we have enough information to k count the number of (i@, j@)3W such that i@(i and j@(j. In order to store this number of (i@, j@) k~1 which are C(k!1) candidates for a given C(k), we add one more integer register N, initially assigned 0. This adding is helpful to understand the following SLCS graph structure and for the proof of Theorem 3. Define the graph structure G"(º,») by º"M(i, j, k) D the register R of ¹ equals i with i'0N, k, j and »"M((i, j, k),(i@, j@, k#1))3º2 D i(i@'j(j@N. Clearly, this directed acyclic graph schedules all W information, k3[1,2,p], that solves the SLCS k problem by following all the different paths of length p in G.

62

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

Note that two paths of length p in G can represent the same words over R, but are differentiated by their respective indices in A and B. Note also that, resulting from property (ii) of Section 2.1, some LCS localizations in the SLCS solution we provide are not given by the graph, i.e., the set t"M(i, j) D A(i)"B( j)"C(k)'(i, j)NW N. k For example, G for the strings A"acbb and B"abc reveals the two strings ac and ab, solution of the SLCS problem, but, when considering their indices, G provides the two strings A(1)A(2)"ac and A(1)A(3)"ab but not the string A(1)A(4)"ab. 4.1. Construction phase When running the program P’2 (cf. overleaf) instead of P2, the register N of ¹ , if RO0 (i.e., if k, j k"¸¸CS(R, j) holds), stores the number of possible C(k!1) for a considered C(k). Input data items are the ones required by P2. Channels A1,A2,I1 and I2 are involved as above. We also take a tracing of I1 and I2 that are boolean channels, I3 and I4 respectively, to indicate that a match has to be taken into account in further cells. More precisely, if a first match happens in cell ¹ , k, j a token, say 1, is sent on I4 , so toward ¹ which will increment its counter N. Then 065 k`1, j`1 propagate the token on W (using I3) while RO0, i.e., toward ¹ , k@'k, until a match has k`1 k{, j happened in that cell. P@2:

if R"0 then if A "B then */ R"I */ A4 "1 065 else if A3 "1 or A4 "1 then */ */ N"N#1 A3 "1 065 endif endif A1 "A 065 */ I1 "I 065 */ else A2 "A 065 */ I2 "I 065 */ endif

l1 l2 l3 l4 l5 l6 l7 l8 l9 l10 l11 l12 l13 l14 l15 l16

Correctness of P’2: Assume (i, j)3W . When ¹ processes A(i), R is set to i (by l3, cf. Lemma 1) k k, j and a token is sent to ¹ (by l4). The token then travels from cell ¹ to ¹ (by l8) k`1, j`1 k`1, j{ k`1, j{`1 and it increments the contents of N (by l7) until one of the following two cases appears: either R"i@ is positive (l13), which correctly ends the token’s travel as i@(i results from input data flow, or (i, j@)3W (l2). k`1 If (i@, j@)3W with i@'i and j@'j then, when ¹ processes A(i@) (after A(i@) from input data k`1 k`1, j{ flow), the register N of ¹ obviously memorizes all the tokens resulting from the hypothesis k`1, j{ (i, j)3W , above all with i(i@ and j(j@. k

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

63

So, in ¹ , after m executions of P’2, if R"i and i'0 then we have k, j N"AM(i@,j@)3W D i@(i'j@(jN.4 h k~1 4.2. Extraction phase First, the process consists in recovering the first LCS, i.e., using the algorithm of Section 3.3, extract the sequence (R"i , j , k, N), k3[1,2,p], where N and R denote the registers of cell ¹ , k k k, j which gives both an LCS C "C (1)2C (p)"A(i )2A(i )"B( j )2B( j ) and p in time 1 1 1 1 p 1 p n#3m#p (Theorem 1). Note that an auxiliary register R@, copy of R, is used to avoid the loss of W information because the contents of R are modified during the extraction phase of the LCS k algorithm (add R@"R to P31 and use R@ instead of R in P32 overcomes the problem). Then, we recover other W information by sending a wave in ¹ as follows: input the items k (C (p),1),2,(C (1),1), where C here denotes an input string associated to the (l!1)th output l l l sequence, through ¹ for the processor ¹ , k3[1,2,p], to consider the input item (C (k),1) on 1,1 k,k l its input channels A (to wit A1 or A2 ) and I (to wit I3 or I4 ) which enables to mark and */ */ */ */ */ */ push some information not yet extracted toward the mth column of the array, until all W informak tion is output.5 More precisely, the input item (C (k),1) travels on row k until it detects the previous output l information that has been marked (note we always output the leftmost existing information of a row, i.e., the first one the input item encounters). Then, (C (k), 2) travels onto A2 and I3 channels l of row k, until, either, we find R"i k with i kO0 which is marked and (R"i k, j k, k, N) is sent to l l l l ¹ (in this case, the contents of I3 becomes 3), or, (C (k), 2) reaches ¹ and there is no more k,m l k,m information to output from row k. This SLCS algorithm clearly ends when an output sequence does not deliver any information, thus, it requires n#3m#p time-steps for recovering the first LCS plus l multiplied by m#2p!1 time-steps (p data items are input plus m!1 time-steps required for the last input item to reach the last column plus p to be output), where l)m is the number of input sequences C (DC D"p) needed l l to output a sequence empty of W information. k The remaining work of bufferizing all the (k, i, j, N) output information and restoring the edges of G being assumed to be achieved by the host device in acceptable time, our systolic design provides the following: Theorem 2. ¹his S¸CS algorithm runs in time n#3m#p#l(m#2p!1) on an I/O bounded, modular and fault tolerant systolic array of m(m#1)/2 cells. Unfortunately, the time of execution of our SLCS algorithm depends on the size of the solution, so, is difficult to evaluate. However, to the best of our knowledge, such an algorithm is presented here for the first time.

4 A denotes the set cardinality. 5 Clearly, we repeat a process very similar to the one described in Section 3.3, and therefore an outline of the method is simply given in this section.

64

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

5. Summary and concluding remarks This paper presented a new systolic approach for the LCS problem and proposed a solution for the SLCS problem. The semi-mesh structure ¹ and its special semantics enabled us to derive simple algorithms that are linear in time on a modular, fault tolerant architecture. Our method can be summarized as follows: first, initialize ¹ with B (all the PEs of column j memorize B( j)), next compute the matrix M of the LLCS values (input A and compute ¸¸CS(i, j) on a processor of column j according to the LLCS recurrence scheme) in order to place on ¹, at the intersection of row k and column j, the unique i such that (i, j)3W if it exists (i.e., A(i)"B( j)"C(k) k is involved in an LCS solution as W is a tracing of the LCS scheme that fits the SLCS problem). k This systolic semi-mesh, since it contains all and exactly all W ’s information, solves both the LCS k and SLCS problems. One step to empty a longest common subsequence of the two input strings derives the n#3m#p time algorithm for the LCS problem. Running other W ’s extractions (one k extraction requiring m#2p clock cycles, supplies p additional informations to build the graph solution) solves the second problem. For the recovering of one LCS, our design uses less cells and runs faster compared with the 2D ones in [17,18] (see Table 1). Unlike the 1D arrays in [19,20], this array is modularly extensible. Moreover, we can precisely locate the elements of the two strings which match each other. A first systolic algorithm is given for the SLCS problem. This algorithm remains a linear time one for outputting one LCS, however, to build the graph solution, it also requires an additional time of less than l(m#2p) time-steps, where l)m is a value depending on the size of the SLCS solution. We still need an array of m(m#1)/2 cells.

Acknowledgements The authors would like to thank the referee for his many helpful comments and remarks, and recognize their indebtedness to their in-house British philosopher Howard Maynard.

Table 1 Comparison of 2D systolic arrays for recovering one LCS

Time of execution Number of cells Particularities

[17] n#5m m(m#1)

[18] 4n#2m mn

This paper n#3m#p m(m#1)/2

Non modular Bidirectional One LCS only

Modular Unidirectional One LCS only

Modular Unidirectional One or many LCSs

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

65

Fig. 2. The matrix M for the strings A"bcabcb and B"abccb.

Appendix A. Example The two strings A"bcabcb (n"6) and B"abccb (m"5) admit two LCS of length p"4: C "abcb and C "bccb. The associated matrix M is shown in Fig. 2, where matches are bolded, 1 2 W elements are underlined, and X elements are underlined twice. k k By Definitions 1 and 2, we obtain the following sets: X "0, W "0, 0 0 X "M(1,2); (3,1)N, W "M(1,2); (1,5); (3,1)N, 1 1 X "M(2,3); (4,2)N, W "M(2,3); (2,4); (4,2)N, 2 2 X "M(4,5); (5,3)N, W "M(4,5); (5,3); (5,4)N, 3 3 X "M(6,5)N, W "M(6,5)N, 4 4 W "0. X "0, 5 5 So, we find C(4)"B(5)"A(6)"b for the single (6,5)3X and (5,3)3X , (4,2)3X and (3,1)3X 4 3 2 1 which form an LCS C"abcb as 1)3(4(5(6)n and 1)1(2(3(5)m. But, to recover the second LCS, one needs the sets W , k3[1,2,p]: we have (5,4)3W , (but (5,4)NX !), k 3 3 (2,3)3X N(2,3)3W and (1,2)3X N(1,2)3W , thus the LCS bccb is found. 2 2 1 1 Appendix B. LCS algorithm Input data flow

5

2 (p#1)

5

4

2 (m!1)

4

1 3

2 2

m 3

A[n] 2 n 2 2 2

A[1] B[1] 2 1 1 2 2 1 2

B[m] m 1

66

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

Algorithm case C¹ */ (1)

(2)

(3)

(4)

(5) endcase

if I '1 then */ A1 "A , A2 "A 065 */ 065 */ I1 "I !1, I2 "I !1 065 */ 065 */ else B"A , R"0 */ endif C¹1 "C¹ , C¹2 "C¹ 065 */ 065 */ if R"0 then if A "B then */ R"I */ endif A1 "A , I1 "I 065 */ 065 */ else A2 "A , I2 "I 065 */ 065 */ endif C¹1 "C¹ , C¹2 "C¹ 065 */ 065 */ if I "1 then */ I1 " R, X " R 065 065 CT1 " 4, CT2 " 4 065 065 else I1 "I !1, I2 "I !1 065 */ 065 */ C¹1 "C¹ , C¹2 "C¹ 065 */ 065 */ endif if (0(I1 ) and ((I1 (X ) or (X "0)) then */ */ */ */ R"I1 */ endif I1 "I1 , X "R 065 */ 065 C¹1 "C¹ , C¹2 "C¹ 065 */ 065 */ X "X 065 */ C¹1 "C¹ , C¹2 "C¹ 065 */ 065 */

Appendix C. The LCS algorithm: an example The three phases of the LCS systolic algorithm for the above example are depicted in Figs. 3—5, respectively. Step 1 (Initialization phase (Fig. 3)): t ,l3[2,2,5], presents ¹ after the lth execution of P1. l Pairs (B( j), j),B( j)3R, j3[1,2,m], written on vectors, denote the contents of the related channels A1 or A2, and, I1 or I2.

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

Fig. 3. Initialization phase.

Fig. 4. Construction phase.

67

68

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

Fig. 5. Recovering one LCS for A"bcabcb and B"abccb.

t shows ¹ after the last execution of P1. Obviously, all the B registers of ¹ are initialized, and 5 R equals 0. Step 2 (Construction phase (Fig. 4)): t ,l3[1,2,6], presents ¹ after the lth execution of P2. Semantics remains as in Fig. 3. l Step 3 (Extract phase (Fig. 5)): t displays ¹ after P31, and t ,l3[1,2,4], presents ¹ after the lth execution of P32. The contents 0 l of R and B are provided. So as for the contents of X (the only channel used in the recovering phase). The last picture depicts the fifth column of ¹ after one running of P33 and four of P34. Clearly, we get, from ¹ , the sequence 3, 4, 5 and 6, that reveals the LCS 1,m C"A(3)A(4)A(5)A(6)"abcb. In this example, 5 time-steps are needed for P1, 10 for P2 and 10 for the third phase, that confirms theoretical results.

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

69

References [1] T.F. Smith, M.S. Waterman, Mathematical Methods for DNA Sequences, CRC Press, Boca Raton, MD, 1989. [2] D. Sankoff, J.B. Kruskal, Time Warps String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley, Reading, MA, 1983. [3] A. Aho, D. Hirschberg, J.D. Ullman, Bounds on the complexity of the longest common subsequence problem, J. ACM 23 (1) (1976) 1—12. [4] D.S. Hirschberg, Algorithms for the longest common subsequence problem, J. ACM 24 (4) (1977) 664—675. [5] N. Nakatsu, Y. Kayambayashi, S. Yajima, A longest algorithm suitable for similar text string, Acta Inform. 18 (1982) 171—179. [6] W.J. Masek, M.S. Paterson, A faster algorithm for computing string edit distances, J. Comput. System Sci. 20 (1980) 18—31. [7] D.S. Hirschberg, A linear space algorithm for computing maximal common subsequences, Comm. ACM 18 (6) (1975) 341—343. [8] S.K. Kumar, C.P. Rangan, A linear space algorithm for the longest common subsequence problem, Acta Inform. 24 (1987) 353—362. [9] A. Apostolico, S. Browne, C. Guerra, Fast linear-space computations of longest common subsequences, Theoret. Comput. Sci. 92 (1992) 3—17. [10] L. Allison, T.I. Dix, A bit string longest common subsequence algorithm, Inform. Process. Lett. 23 (1986) 305—310. [11] F.Y.L. Chin, C.K. Poon, A fast algorithm for computing longest common subsequences of small alphabet size, J. Inform. Process. 13 (4) (1990) 463—469. [12] R.A. Wagner, M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1) (1974) 168—173. [13] R.W. Irving, C.B. Fraser, Two algorithms for the longest common subsequence of three (or more) strings, in: Proc. 3rd Ann. Symp. CPM, Lecture Notes in Computer Science, vol. 664, Springer, Berlin, 1992, pp. 214—229. [14] K. Hakata, H. Himai, The longest common subsequence problem for small alphabet size between many strings, in: Proc. 3rd Int. Symp. Algo. Computing, Lecture Notes in Computer Science, vol. 650, Springer, Berlin, 1992, pp. 469—478. [15] V. Danc\ ı´ k, M.S. Paterson, Longest common subsequences, in: Proc. 11th Annual Sympon Theoretical Aspects of Computer Science, Lecture Notes in Computer Science, vol. 775, Springer, Berlin, 1994, pp. 127—142. [16] C.B. Yang, R.C.T. Lee, Systolic algorithms for the longest common subsequence problem, in: Proc. Int. Comput. Symp., 1984, pp. 895—901. [17] Y. Robert, M. Tchuente, A systolic array for the longest common subsequence problem, Inform. Process. Lett. 21 (1985) 191—198. [18] J.H. Chang, O.H. Ibarra, M.A. Pallis, Parallel parsing on a one-way array of finite-state machines, IEEE Trans. Computers C-36 (1987) 64—75. [19] Y.C. Lin, New systolic arrays for the longest common subsequence problem, Parallel. Comput. 20 (1994) 1323—1334. [20] G. Luce, J.F. Myoupo, An efficient linear systolic array for recovering longest common subsequences, in: Proc. 1st IEEE Int. Confon. Algo. Archi. for Paral. Proc., 1995, pp. 20—29. [21] M. Lu, H. Lin, Parallel algorithms for the longest common subsequence problem, IEEE Trans. Comput. 5 (8) (1994) 835—848. [22] A. Apostolico, M. Attalah, L. Larmore, S. Mcfaddin, Efficient parallel algorithms for string editing and related problems, SIAM J. Comput. 19 (1990) 968—988. [23] D.M. Champion, J. Rothstein, Immediate parallel solution of the longest common subsequence problem, in: Proc. Int. Confon. Parallel Processing, 1987, pp. 70—77. [24] H.T. Kung, Why systolic architectures, IEEE Comput. 15 (1) (1980) 37—46. [25] R. Aubusson, I. Catt, Wafer-scale integration — a fault tolerant procedure, IEEE J. Solid-State Circ. SC-13 (3) (1978) 339—344. [26] F.T. Leighton, C.E. Leiserson, Wafer-scale integration of systolic arrays, in: Proc. 23rd Symp. Fond. Comp. Sci. 1982, 297—311. [27] H.T. Kung, M.S. Lam, Wafer-scale integration and two-level pipelined implementation of systolic arrays, J. Parallel Distributed Computing 1 (1984) 32—63.

70

G. Luce, J.F. Myoupo / INTEGRATION, the VLSI journal 25 (1998) 53—70

[28] P. Lee, Z.M. Kedem, Synthesizing linear array algorithms from nested for loop algorithms, IEEE Trans. Computers 37 (12) (1988) 1578—1598. [29] A.V. Hall, G.R. Dowling, Approximate string matching, Comput. Surveys 12 (4) (1980) 54—58. [30] P.A. Pevzner, M.S. Waterman, Matrix longest common subsequence problem, duality and Hilbert bases, in: Proc. 3rd Ann. Symp. CPM, Lecture Notes in Computer Science, vol. 664, Springer, Berlin, 1992, pp. 79—89.

Jean Fre´de´ric Myoupo is Professor of Computer Science in the Faculte´ de Mathematiques et d’Informatique of the Universite´ de Picardie Jules Verne in Amiens, France, where he heads theParallel Computing Group in the Computer Science laboratory: LaRIA. He received his Ph.D. degree in applied mathematics from the University of Toulouse III, France, in 1983. He held many positions in Mathematics and Computer Science departments such as Associate-Professor in different universities: the University of Sherbrooke, Que´bec, Canada from 1983 to 1985; the Univeristy of Yaounde, Cameroon, 1985—1990; the Universite´ of Paris XI, Orsay, France, 1990—1993 and the University of Rouen, France,1993—1990. Professor Myoupo has served as member of the program committee ofEºROMICRO ¼orkshop on Parallel and Distributed Systems. He is also a member of the Editorial Board of theInternational Journal of Computers and their Applications. His current research interests include parallel algorithms and architectures, VLSI design and Mobile Computing. Professor Myoupo is a member of theIEEE Computer Society and of theSociety of Industrial and Applied Mathematics (SIAM). Guillaume Luce is an Assistant Professor in the Faculte´ de Mathematiques et d’Informatique of the Universite´ de Picardie Jules Verne in Amiens, France. He received the DEA (Diplome d’E¨tudes Approfondies) and the Ph.D both in computer science in 1994 from the universite´ de Rouen and in 1997 from the Universite´ de Picardie-Jules Verne, respectively. His current reserach interests include parallel algorithms and architectures, and VLSI design.