Information
Processing
Letters 61 ( 1997) 253-258
Improved linear systolic algorithms for substring statistics * Jean-Frederic
Myoupo *, Ahmad Wabbi 1
LARIA, CURI, Universite’ de Picardie Jules Verne, 5 rue du Moulin Neuf Received
80000 Amiens, France
1996; revised 15 January 1997 Communicatedby S.G. Akl
15 January
Abstract Improved linear and square systolic arrays are presented that support the detection of repetitions in a string and the substring statistics with and without overlap. The time equals to 5n/4 - 1 and n for the first and the second problems respectively, where n is the length of the string, whereas the number of processors is, respectively, n/4 and n*/2. @ 1997 Elsevier Science B.V. Keywords: Parallel computation; Parallel processing
Pattern matching;
Repetition
in a string; Statistics of a string; Systolic architectures;
1. Introduction The detection of repetition in a string was the subject of many studies because of its important applications in many domains. For instance, it may be used in genealogy to study DNA sequences. It may be also applied in data compression schemas. A variant of the problem, verifying that a string does not have any repetition, also attracted the attention of researchers in diverse fields for a long time [ 4,6,7]. Optimal algorithms that solve the problem of repetition detection of a string x in time 0( 1x1log2 1x1) are introduced in [ 3,5 1. These algorithms perform serial off-line computations on a RAM. Another problem that is related to pattern matching is the problem of computing substring statistics for an assigned input string. It has many applications in *This work was supported by the “pole Modelisation” “region Picardie”. * Corresponding author. Email:
[email protected]. 1Email:
[email protected].
of the
0020-0190/97/$17.00 @ 1997 Elsevier Science B.V. All rights reserved. PIISOO20-0190(97)00025-2
Parallel algorithms;
the fields of text processing, data compression, computational linguistics, pattern recognition, etc. [ 11. Given a textstring x, the problem consists of calculating how many times each and every distinct substring (pattern) of x occurs within it. Recently introduced pattern matching tools can be easily adapted to solve this problem in linear time [2]. However, it is sometimes wanted to calculate statistics for nonoverlapping instances of substrings. This problem is harder than the first one and is not linearly solvable in general. As is well known, systolic arrays play an important role in parallel computation because they are very regular and easy to manufacture. A systolic array is a network of locally connected simple processors. Each processor performs a specific task synchronously with all the other processors. We will present in this paper improved systolic array algorithms for the solution of the problem of repetition detection and substring statistics with and without overlap. These algorithms are based on the
254
J.-l;: Myoupo, A. Wabbi/lnformation
algorithms presented in [ 1 ] that solve the previous problems in time 3n for the first and 2n for the second, and need n and n* processors respectively. Our algorithms reduce the time to 5n/4 - 1 and n and the number of processors to n/4 and n*/2, respectively.
2. Statement of the problems We will present in this section the detection of repetitions and substring statistics problems formally, accompanied with the systolic algorithms proposed in [ 1 ] to solve them. In the next section, we will show that these algorithms can be improved in several ways to yield a highly optimized systolic configurations. 2.1. Detection of repetitions in textstrings
2.1.1. Definitions Let I be a finite alphabet and Z* the free semigroup generated by I. A string x E Z* is fully specified by writing x = ~21~22.. . a,, where ai E { 1,2,. . . , n} and 1x1= n is the length of x. A substring of x is a string w such that w = aiUi+l. . . Uj and 1 < i < j < n. A factor of x is a substring of x and its starting position in {1,2,. . . , n} (that is, a positioned substring). The notation x(i : j) is used to denote the factor of x: x(i)x(i+ 1). . . x(j). A left (right) factor of x is a prefix (&3x> of x. Two factors x( i : j) and x( m : n) are equivalent if their associated substrings are identical. A string x is primitive if setting x = uk implies u = x and k = 1. A square in a string x is any non-empty substring of x in the form UU. A string x is square-free if no substring of x is a square. Assume that x is not square-free. A repetition in x is a factor x( i : m) for which there are indices d and j such that i < d < j < m and: (i) x( i : j) is equivalent to x( d : m), (ii) x( i : d - 1) corresponds to a primitive string, (iii) x(i+ 1) # x(m+ 1). Recallthatpisaperiodofwifw(i)=w(i+p), i= 1,2,..., w - p. On the other hand, w is periodic if it has a period p < w/2. It is easily seen [2] that a repetition is a positioned periodic substring in the form uku’, with k > 1, u E Z* and U’ is a prefix
Processing Letters 61 (1997) 253-258
of u.* A repetition triple
is completely
identified
by the
(i,p=d-i,L=m-i+l) of its starting position i, its period p, and its length L, respectively. Hereafter, a repetition will be shortly denoted by R( i,p, L). We remark that only the periodic substrings of x contribute to the set of its repetitions. A repetition R(i,p, L) is maximal if i - p < 0 or if x( i - p : i + L) is not a repetition. 2.1.2. A linear con.gurution for the detection of
repetition Given a string x of length n and a period p, the following straightforward sequential algorithm detects all repetitions of x of period p in linear time.
count := p fori= 1 to (n-p) do begin if x(i) # x(i+p) then count := p
else begin count := count + 1 if count > 2p then output R( count - i, p. count) end end Iterating through the values of p yields an O(n*) algorithm. We remark that each iteration in this outer loop is independent of the others. Thus, the authors of [ 1] propose to use a two-way one-dimensional systolic array of n processors, each devoted to a specific period, to improve the overall performance by a factor of n. Fig. 1 shows this configuration in four clock cycles. The entrance and the exit of the array is its leftmost side. Basically, each cell p in the array, except cell 0, is a character comparator that compares character i with character i + p, where i E {1,2,..., n - p}. The role of cell 0 is to form a loop in the array in its rightmost side. The algorithm executed by cell p is the following, where Lcounter and * This is an extension of the original definition of repetition that is found in the literature: there, the definition refers to strings in the form uk.
J.-E Myoupo,A. Wabbi/InformationProcessingLetters 61 (1997) 253-258
Fig. 1. Four successive configurations
icounter are local counters spectively:
of a systolic repetition finder.
initialized
to p and 1 re-
begin if (x # y) then begin icounter := icounter + Lcounter - p + 1 Lcounter := p end else begin Lcounter := Lcounter + 1 if Lcounter > 2p then output R( icounter, p, Lcounter) end end Finally, note that a blank character is inserted between each two characters of the string so that each cell be able to compare characters properly. For example, if no blanks are used, the character x( 1) can never meet the character x(2) nor x(4). 2.2. Substring statistics Calculating substring statistics of a string x with overlap consists of calculating, for each substring w of x, the number of distinct equivalent factors of x the substring portions of which is equal to w. On the other hand, substring statistics without overlap is to calculate, for each substring w of X, the maximum numberofdistinct factors x(il : j,),x(i2 : j,),..., x(i, : j,) corresponding to w and such that it is possible to write x = WIWW~WW~ . . . ww,+l with wd E I*, d E {1,2 ,..., p+ 1).
Fig. 2. Two-dimensional tics.
systolic architecture
255
for substring
statis-
A systolic array that calculates substring statistics with and without overlap in linear time is presented in [ l] (Fig. 2). It is a two-dimensional array that consists of n vector arrays of the type described in the previous section. Cells in a column are connected by enable lines that carry values used in the algorithm executed by each cell. Data is entered to each line of the array from its leftmost side in a special feeding pattern. Fig. 3 shows the status of the first two lines of the array 7.t four successive clock cycles according to this feeding pattern. It is shown in [ 1] that the previous configuration calculates substring statistics with and without overlap of a string x of length n in 2n clock cycles.
3. Improved linear systolic arrays for detection of repetition We will present in this section two improvements of the linear systolic arrp:. presented in Section 2.1.2. Then, in the next section, we will show that these two approaches can be applied together to give a highly improved repetition detection systolic configuration. 3.1. First approach From the definition of the repetition, it can be easily seen that the value of the period p cannot exceed n/2. Indeed, if we suppose that p > n/2, then 1x1 = I&‘u’wI will be greater than n because k > 1 and 1~1 = p > n/2. As a result, we can keep only those cells that detect repetitions of periods 1 to n/2, which are the n/2 rightmost cells. In this case, the time of calculation becomes 5n/2.
J.-E Myoupo, A. Wabbi/Information
256
xs
-x4--+ -
L+
-
Processing
Letters 61 (1997) 253-258
v dir, xs-
V
x4
--plJ
xr
-
x6-
dir, x4
-L-%
Fig. 3. Four successive
configurations
of the first two lines.
x7--+%-----
3.2. Second approach
-G--f
Fig. 5. Six successive configurations of the new array.
We saw in Section 2.1.2 that blanks must be inserted between the characters of the string x to assure the detection of repetitions of odd periods. We remark that this makes only half of the processors active in each clock cycle (see Fig. 1) . So, to use the processors more efficiently, we suggest to remove blanks and make special arrangements in the architecture of the array so that all periods are checked out. The idea is to divide the operation of data transition between cells at each clock cycle into two smaller transitions: the first is the transition of data from left to right (upper lines) and the second is from right to left (lower lines). The first transition is performed when the clock becomes high and the other when the clock becomes low (Fig. 4). This modification makes each processor of the new configuration perform the work of two processors of the original one and all processors become busy in every clock cycle (Fig. 5). So, the number of processors and the calculation time becomes n/2 and 3n/2 - 1, respectively, for this configuration.
The algorithm executed by each processor p must be slightly modified because it detects now repetitions of periods p and p + 1. The solution is to use four local counters (icounterl, Lcouraterl , icounter& Lcounter2) rather than two, initialized to (0, p, 0,~ + 1) respectively, and duplicate the original algorithm in the following manner: if clock becomes high then begin if (x # y) then begin icounterl := icounterl + Lcounterl - p + 1 Lcounterl := p end else
J.-E Myoupo, A. WabbillnformarionProcessingL.errers61 (1997) 253-258
257
begin Lcounterl := Lcounterl + 1 if Lcounterl > 2p then output R( icounterl p, Lcounterl ) end )
end if clock becomes low then begin if (x # y) then begin icounter := icounter + Lcounter;? - p + 2 Lcounter;? := p + 1 end else begin Lcounter2 := Lcounter2 + 1 if Lcounter:! > 2p + 2 then output R( icounter!& p + 1, Lcounter2) end end 3.3. A combined approach We presented in the previous section two approaches that improve the speed and the space of the original linear systolic array. We remark that the two approaches are totally independent and can be applied together to give a more efficient configuration. The new array needs only to detect repetitions that have a period p E { 1,2,. . . , n/2} according to the first approach. By applying the second, we need only n/4 processors to perform this work. The new configuration has, therefore, n/4 processors and performs the detection of repetitions in time 5n/4 - 1.
4. An improved statistics
systolic array for substring
We saw in Section 2.2 that the proposed twodimensional systolic array for substring statistics consists of n repetition finders that are connected
Fig. 6. Four successive
configurations
of the first two lines.
by enable lines. As a first intuition, we can use the improved systolic vector presented in Section 3 to build a similar two-dimensional array. The problem is that the first approach does not apply in this case because we need to compare characters i and i + p for p > n/2 to calculate substring statistics [ 11. On the other hand, the second approach applies perfectly with slight arrangement in the configuration. In the original configuration (Fig. 3), we remark that each processor p of the first line compares the character x( 1) with x( 1 +p) and sends a value by the enable line to the processor just below, which compares x(2) with x(2 + p) and so on. For example, the first processor (the rightmost one) compares x( 1) with x(2) and sends a value to the processor just below it. In the next clock cycle, this processor compares x(2) with x(3) and sends a value to the one just below. When we apply the second approach, the values delivered by the enable lines correspond to two periods p and p + 1. To synchronize the work of the processors, the clock of the even lines must be advanced by half a cycle so that when odd lines perform data transition in a direction, even lines perform data transition in the other direction. Fig. 6 shows four consecutive clock cycles of the first two lines of this config-
258
J.-R Myoupo, A. Wabbi/lnformation Processing Letters 61 (1997) 253-258
uration. Note that each processor treats two periods p and p + 1 and the values of the enable line are synchronized between processors. As a result of removing blanks between characters, the calculation time of our configuration equals to n rather than 2n for the original one.
5. Conclusions We have presented in this paper an improved linear systolic array for detection of repetitions in a textstring. We have shown that this improvement can also be applied to a two-dimensional array that calculates substring statistics. Further investigations on whether any of the existing repetition finders with time O(n log n) can be translated into a suitable parallel scheme are interesting fields of research.
References [ 11 A. Apostolic0 and A. Negro, Systolic algorithms
for string manipulations, IEEE Trans. Comput. 33 (1984) 361-364. [2] A. Apostolico and EP. Preparata, A structure for the statistics of all substrings of a textstring with or without overlap, in: Proc. 2nd World Conf: on Mathematics at Service of Man, Las Palmas (1982). [ 31 A. Apostolico and F.P. Preparata, Optimal off-line detection of repetitions in a string, Theoret. Comput. Sci. 22 (1983) 297-315. [4] C.H. Braunholtz, Solution to problem 5030, Ann. Math. 70 (1980) 558-567. An optimal algorithm for computing I51 M. Crochemore, repetitions in a word, Inform. Process. Len. 12 (1981) 244250. [61 M.A. Harrison, Introduction to Formal Language Theory (Addison-Wesley, Reading, MA, 1978) 36-40. [71 GA. Hedlund, Remarks on the work of Axe1 Thue on sequences, Nord. Mat. 7idskr. 15 (1967) 148-150. [81 E. Thomas, Detection of repetitions in a word in parallel, internal report, Universitt! de Picardie Jules Verne, Amiens (1995).