Information Processing Letters 93 (2005) 307–313 www.elsevier.com/locate/ipl
A Coarse-Grained Multicomputer algorithm for the detection of repetitions Thierry Garcia, David Semé ∗ LaRIA: Laboratoire de Recherche en Informatique d’Amiens, Université de Picardie Jules Verne, CURI, 5, rue du Moulin Neuf, 80000 Amiens, France Received 17 October 2003 Available online 11 January 2005 Communicated by F. Dehne
Abstract The paper presents a Coarse-Grained Multicomputer algorithm that solves the problem of detection of repetitions. This algorithm can be implemented in the CGM model with P processors in O(N 2 /P ) in time and O(P ) communication steps. It is the first CGM algorithm for this problem. We present also experimental results showing that the CGM algorithm is very efficient. 2004 Elsevier B.V. All rights reserved. Keywords: Parallel algorithms; Coarse-Grained Multicomputers; Dynamic programming; String matching
1. Introduction This paper describes a Coarse-Grained Multicomputer solution for the detection of repetitions. Strings of symbols containing no consecutive occurrences of the same pattern have attracted the attention of researchers in diverse fields for a long time. For example, repetitions play a crucial role in many domains [3,19] such as formal language theory, term rewriting, data compression and computational molecular biol* Corresponding author.
E-mail addresses:
[email protected] (T. Garcia),
[email protected] (D. Semé). 0020-0190/$ – see front matter 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.ipl.2004.12.004
ogy. Perhaps their first appearance dates back to the work of Thue [23], who is generally credited with the discovery of arbitrary long streams of symbols from a finite alphabet that do not contain any “square” substrings, i.e., subpatterns formed by the concatenation of some substring with itself. In recent years, the study of such “square-free” strings has been found relevant to automata and formal language theory, algebraic coding and more generally in systems theory and combinatorics, and we shall make no attempt to refer to the existing copious literature. Suffice it to mention that papers have been devoted to the construction of arbitrarily long squarefree (as well as other related repetition-constrained)
308
T. Garcia, D. Semé / Information Processing Letters 93 (2005) 307–313
strings [18] over alphabets of fixed cardinality. In a related endeavor, the complementary notions of periodicity and overlaps of strings have been extensively investigated and still are an active research subject (see Duval [11] for an extensive bibliography). In the framework of pattern matching [2], some classic results on string periodicity [13] have been used to develop clever techniques for the detection of assigned patterns in text-strings in time linear in the string length [7]. The problem of the efficient recognition of the occurrence of substring squares in a string X stems quite naturally from the preceding remarks, and it is certainly relevant to a variety of practical applications as well [1]. O(N 2 ) time algorithms can be quickly developed on the basis of existing pattern matching techniques and tools (N is the length of the string X). Recently, an O(N log N) algorithm has been proposed to determine whether a given text-string over a finite alphabet contains a repetition [21]. Crochemore [5] developed an O(N log N) algorithm to determine all repetitions in a text-string x. Crochemore’s approach essentially relies on the well-known minimization algorithm for finite state automata [2] and exploits the theoretical bound of N log N repetitions in a string [20] as a terminating condition. Apostolico and Negro [4] presented a systolic algorithm for the detection of repetitions running in time O(N) (with 4N − 3 steps) and using N processors, whereas the linear systolic array for the same problem presented in [22] requires (5N/4) − 1 steps on N/4 processors. The only known algorithm for the PRAM model can be obtained by simulating the linear systolic array on the PRAM. A constant time parallel detection of repetitions with a BSR (Broadcasting with Selective Reduction) solution has been developed by [10]. All known parallel algorithms require a work of O(N 2 ). In this paper we show an algorithm which solve this problem in the CGM model but this algorithm is not work optimal. Indeed, the optimality in work to solve this problem is O(N log N). Here, our algorithm has a work of O(N 2 ) but our goal is to obtain a CGM algorithm from a systolic solution having a work of O(N 2 ). This is a contribution to a more larger work on the adaptation of linear systolic solutions in CGM model. In recent years several efforts have been made to define models of parallel computation that are more realistic than the classical PRAM models. In contrast of
the PRAM, these new models are coarse grained, i.e., they assume that the number of processors P and the size of the input N of an algorithm are orders of magnitudes apart, P N . By the precedent assumption these models map much better on existing architectures where in general the number of processors is at most some thousands and the size of the data that are to be handled goes into millions and billions. This branch of research got its kick-off with Valiant [24] introducing the so-called Bulk Synchronous Parallel (BSP) machine, and was refined in different directions, for example, by Culler et al. [6], LogP, and Dehne et al. [9], CGM extensively studied in [8,12, 14–16]. CGM seems to be the best suited for a design of algorithms that are not too dependent on an individual architecture. We summarize the assumptions of this model: • all algorithms perform in so-called supersteps, that consist of one phase of interprocessor communication and one phase of local computation, • all processors have the same size M = O(N/P ) of memory (M > P ), • the communication network between the processors can be arbitrary. The goal when designing an algorithm in this model is to keep the individual workload, time for communication and idle time of each processor within T /s(P ), where T is the runtime of the best sequential algorithm on the same data and s(P ), the speedup, is a function that should be as close to P as possible. To be able to do so, it is considered as a good idea the fact of keeping the number of supersteps of such an algorithm as low as possible, preferably O(M). As a legacy from the PRAM model it is usually assumed that the number of supersteps should be polylogarithmic in P , but there seems to be no real world rationale for that. In fact, algorithms that simply ensure a number of supersteps that are a function of P (and not of N ) perform quite well in practice, see Goudreau et al. [17]. The paper is organized as follows. In Section 2, we present the detection of repetition problem. The CGM solution of the problem is described in Section 3. Section 4 presents some experimental results and the conclusion ends the paper.
T. Garcia, D. Semé / Information Processing Letters 93 (2005) 307–313
309
(2) X[k : d − 1] corresponds to a primitive word, (3) Xl+1 = Xm+1 .
2. The detection of repetitions problem 2.1. Basic definitions and notations
Definition 6. p is a period of a repetition X[k : m] of X if xi = xi+p (∀i = k, k + 1, . . . , |X[k : m]| − p). Therefore, 1 p |X[k : m]|/2.
Let A be a finite alphabet and A∗ be the free monoid generated by A. Let A+ the free semigroup generated by A. A string X ∈ A+ is fully specified by writing X = x1 . . . xn , where xi ∈ A (1 i n).
2.2. Dynamic programming approach
Definition 1. A factor of X is a substring of X and its starting position in {1, 2, . . . , n}. The notation X[k : l] is used to denote the factor of X : xk xk+1 . . . xl . A left (right) factor of X is a prefix (suffix) of X.
Let X be a word of length N and p be a period, our goal is to detect all repetitions in X for a period p using Definitions 5 and 6. Table 1 gives an example of all periods of the word X of length N = 16. The idea is to cut this table in P columns of N/P elements, with P N and P = 0. Table 2 gives an example of this cut using P = 4 processors. Fig. 2 is composed of square parts of two equal parts of X. Each square part is composed of a low triangular part and a high triangular part. The dynamic programming approach is based on different systolic solutions [25]. The following section describes two sequential algorithms based on the previous definitions, which computes respectively low triangular parts and high triangular parts.
Definition 2. A string X ∈ A+ is primitive if setting X = uk implies u = X and k = 1. Definition 3. A square in a string X is any nonempty substring of X in the form uu. Definition 4. String X is square-free if no substring of X is a square. Equivalently, X is square-free if each substring of X is primitive. Definition 5. A repetition in X is a factor X[k : m] for which there are indices l, d (k < d l m) such that:
2.3. Sequential algorithms Algorithm 1 detects all repetitions for periods p (with 1 p N − 1) from the first character to the
(1) X[k : l] is equivalent to X[d : m], Table 1 Values of all periods p for X = abbbbaacacad 0 a 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
a b b b b a a c a c a d d a c a
1 2 3 4 5 6 7 8
1 b
1 2 3 4 5 6 7 8
2 b
1 2 3 4 5 6 7 8
3 b
1 2 3 4 5 6 7 8
4 b
1 2 3 4 5 6 7 8
5 a
1 2 3 4 5 6 7 8
6 a
1 2 3 4 5 6 7 8
7 c
8 a
9 c
10 a
11 d
12 d
13 a
14 c
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7
1 2 3 4 5 6
1 2 3 4 5
1 2 3 4
1 2 3
1 2
1
15 a
310
T. Garcia, D. Semé / Information Processing Letters 93 (2005) 307–313
Table 2 Cut-table for X = abbbbaacacad
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
a b b b b a a c a c
0 a
1 b
2 b
1 2 3
1 2
1
..
4 .
3
2 ..
4 .
5 6
8 .
3
2
4 .
6 ..
8 .
a d d a c
3
2
4 .
6 ..
5 a
6 a
7 c
8 a
9 c
10 a
11 d
12 d
13 a
14 c
15 a
1
..
5
7
4 b
1
..
5
7
3 b
8 .
2
4 .
6 ..
3 ..
5
7
1
8 .
2
4 .
6 ..
3 ..
5
7
1
8 .
2
4 .
6 ..
3 ..
5
7
1
8 .
8 .
a
(N − p)th character of the word X. At the end, two tables Ic and Lc are created or modified. These tables represent respectively starting and ending positions of a repetition in the word X for a period p. The different repetitions are given dynamically during the execution. So, the function R(Ic, p, Lc) is used to write the repetition starting at the position Ic with a period p of length Lc. Algorithm 2 detects all repetitions for periods 1 p N computed from the (N − p + 1)th character to the N th character of the word X. At the end, two tables Ic and Lc are created or modified. These tables represent respectively starting and ending positions of a repetition in the word X for a period p. The different repetitions are given in real time during the execution.
Algorithm REPET_CGM presents the CGM solution of the Detection of Repetition problem. Each processor num (1 num P ) has the numth partition of N/P elements of the input sequence X. The fol-
3
2
4 .
6 ..
1
..
5
7
2.4. CGM algorithm
2
4 .
6 ..
3 ..
5
7
1
5
7
6 ..
8 .
7
1
3
2 ..
4 . 5 6
1
3
2 ..
4 . 5
1
3 ..
4 .
2
1
3
2
1
for (p = 1) to (p = N − 1) for (i = 1) to (i = N − p) if (X[i] = X[i + p]) then Ic[p] = Ic[p] + Lc[p] − p + 1 Lc[p] = p else Lc[p] = Lc[p] + 1 if (Lc[p] 2p) then R(Ic[p], p, Lc[p]) endif endif endfor endfor Algorithm 1.
lowing CGM algorithm presents the program of each processor num. This CGM algorithm uses sequential Algorithms 1 and 2 described before as local computation parts. Two functions are used for communication rounds: send (X, num): a vector X is sent to the processor num,
T. Garcia, D. Semé / Information Processing Letters 93 (2005) 307–313
311
for (p = 1) to (p = N) for (i = N − p + 1) to (i = N) if (X[i] = X[i + p]) then Ic[p] = Ic[p] + Lc[p] − p + 1 Lc[p] = p else Lc[p] = Lc[p] + 1 if (Lc[p] 2p) then R(Ic[p], p, Lc[p]) endif endif endfor endfor Algorithm 2.
for (ii = 0) to (ii (P − 1) − num) if (ii = 0) and (num = 0) then send (X, 0) if (ii = 0) and (num = 0) then receive (Y, ii) if (num = 0) then receive (Y, num − 1) if (ii = 0) then Local computation using Algorithm 2 send (Y, num + 1) send (Lc, num + 1) send (Ic, num + 1) endif if (num = 0) then receive (Lc, num − 1) receive (Ic, num − 1) endif Local computation using Algorithm 1 endfor Algorithm REPET_CGM.
receive (Y, num): a vector Y is received from the processor num.
Fig. 1. Communication round (for 4 processors).
2.6. Complexity The complexity of the CGM algorithm is O(N 2 /P ). Proof. The complexity of sequential algorithms used as local computation in our CGM algorithm is O((N/P )2 ). A processor must execute N/2 periods. Each processor has N/P characters and compares (N/2 × P /N) = P /2 characters. When a processor has finished the execution of its local computations and as the first period begins to the second character, the total complexity is (P /2 × 2 + 1) × N 2 /P = (P + 1) × N 2 /P ). This implies a complexity of O(N 2 /P ). This ends the proof. 2 Then, the work of our CGM solution is O(N 2 ) as for the systolic solution.
2.5. Communication rounds The communication strategy is described in Fig. 1. The number of communication rounds is 2(P − 1). Proof. Each processor sends one message to the processor 0. So we have (P − 1) communication rounds. After that, the message from the last processor received by the processor zero should go through all other processors. This implies (P − 1) communication rounds mere. Thus, the total number of communication rounds is 2(P − 1). This ends the proof. 2
3. Experimental results We have implemented these programs in C language using MPI communication library and tested them on a multiprocessor Celeron 466 MHz platform running LINUX. The communication between processors is performed through an Ethernet switch. Tables 3 and 4 present respectively the total running times and the communication times (in seconds), for each configuration of 1, 2, 4, 8 and 16 processors.
312
T. Garcia, D. Semé / Information Processing Letters 93 (2005) 307–313
Fig. 2. Total running times (in seconds) for different numbers of data items. The curves represent configurations of 2, 4, 8 and 16 processors, respectively. Table 3 Total running times (in seconds) for each configuration of 1, 2, 4, 8 and 16 processors, respectively N
P =1
P =2
P =4
P =8
P = 16
4096 8192 16384 32768 65536 131072 262144 524288 1048576
4.15 16.55 66.10 266.09 1086.78 4376.22 17536.28 81061.04 324395.75
3.11 12.41 49.57 199.57 815.09 3282.16 13152.21 60795.78 243296.81
3.14 13.36 53.27 213.87 868.44 3512.33 14162.64 45294.67 181163.14
1.90 7.68 31.10 124.46 572.94 2307.40 10259.82 42242.09 169284.73
1.19 5.13 20.81 83.85 335.52 1624.47 6569.23 37367.71 150588.20
Table 4 Communication times (in seconds) for each configuration of 2, 4, 8 and 16 processors, respectively N
P =2
P =4
P =8
P = 16
4096 8192 16384 32768 65536 131072 262144 524288 1048576
0.004 0.007 0.013 0.025 0.053 0.104 0.203 0.404 0.809
0.010 0.015 0.061 0.088 0.144 0.221 0.429 0.847 1.686
0.037 0.037 0.039 0.097 0.177 0.508 1.005 2.014 4.064
0.038 0.042 0.055 0.085 0.179 0.611 2.128 4.367 8.852
have N = 2k where k is an integer such that 12 k 20. Table 4 shows that communication times increase proportionally of the number of processors. But Table 3 shows that total running times, for a fixed number of data items, decrease when the number of processors increase. Then, the communication times are much more lower than the computation times. This leads that there is more important to improve local computations than communication rounds. 4. Concluding remarks
The communication times correspond to send and receive N/P characters by a processor to another using the communication rounds described before. Here, we
We presented a Coarse-Grained Multicomputer algorithm that solves the detection of repetitions prob-
T. Garcia, D. Semé / Information Processing Letters 93 (2005) 307–313
lem. This algorithm can be implemented in the CGM model with P processors with a time complexity of O(N 2 /P ) and O(P ) communication rounds. This is the first CGM algorithm for this problem. The paper presents also experimental results showing that the CGM algorithm is very efficient and that it is interesting to increase the number of processors when the input sequence X is very long. We should now implement this algorithm on another platform in order to show its portability. Our goal was to develop a CGM algorithm from a systolic solution in order to show that a linear systolic solution can be implemented in the CGM model with the same work. It will be interesting to develop another CGM algorithm for the same problem using the optimal sequential solution as local computation.
References [1] D.G. Lainiotis, A. Apostolico, Fast applications of suffix trees, in: N.S. Tzannes (Ed.), Advances in Control, D. Reidel, Dordrecht, 1974. [2] A.V. Aho, J.E. Hopcroft, J.D. Ullman, The Design and Analysis of Computer Algorithms, Addison–Wesley, Reading, MA, 1974. [3] A.V. Aho, Algorithms for finding patterns in strings, in: J. Van Leeuwen (Ed.), Handbook of Theoretical Computer Science, Elsevier, Amsterdam, 1990. [4] A. Apostolico, A. Negro, Systolic algorithm for string manipulations, IEEE Trans. Comput. c-33 (4) (1984) 515–520. [5] M. Crochemore, An optimal algorithm for computing the repetitions in a word, Inform. Process. Lett. 12 (5) (1981) 244–250. [6] D. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, T. Von Eicken, Log “p”: towards a realistic model of parallel computation, in: Proc. 4th ACM SIGPLAN Symp. on Principles and Practices of Parallel Programming, 1996, pp. 1–12. [7] J.H. Morris, D.E. Knuth, V.R. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1) (1977) 323–350. [8] F. Dehne, X. Deng, P. Dymond, A. Fabri, A. Khokhar, A randomized parallel 3d convex hull algorithm for coarse grained multicomputers, in: Proc. 7th ACM Symp. on Parallel Algorithms and Architectures, 1995, pp. 27–33. [9] F. Dehne, A. Fabri, A. Rau-Chaplin, Scalable parallel computational geometry for coarse grained multicomputers, Int. J. Comput. Geom. 6 (3) (1996) 379–400.
313
[10] E. Delacourt, J.F. Myoupo, D. Semé, A constant time parallel detection of repetitions, Parallel Process. Lett. 9 (1) (1999) 81– 92. [11] J.P. Duval, Sur la périodicité des mots, Thèse de Docteur du 3me cycle, Faculty of Sciences, University of Rouen, 1978. [12] A. Ferreira, N. Schabanel, A randomized bsp/cgm algorithm for the maximal independent set problem, Parallel Process. Lett. 9 (3) (1999) 411–422. [13] N.J. Fine, H.S. Wilf, Uniqueness theorems for periodic functions, Proc. Amer. Math. Soc. 16 (1965) 109–114. [14] T. Garcia, J.F. Myoupo, D. Semé, A work-optimal CGM algorithm for the longest increasing subsequence problem, in: Proc. Int. Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA’01), 2001. [15] T. Garcia, J.F. Myoupo, D. Semé, A coarse-grained multicomputer algorithm for the longest common subsequence problem, in: Proc. 11th Euromicro Conference on Parallel Distributed and Network based Processing (PDP’03), 2003. [16] T. Garcia, D. Semé, A coarse-grained multicomputer algorithm for the longest repeated suffix ending at each point in a word, in: Proc. 2003 Internat. Conf. on Computational Science and its Application (ICCSA’03), 2003. [17] M. Goudreau, S. Rao, K. Lang, T. Suel, T. Tsantilas, Towards efficiency and portability: Programming with the bsp model, in: Proc. 8th Annual ACM Symp. on Parallel Algorithms and Architectures (SPAA’96), 1996, pp. 1–12. [18] M.A. Harrison, in: Introduction to Formal Language Theory, Addison–Wesley, Reading, MA, 1978, pp. 36–40. [19] J. Van Leeuwen (Ed.), Handbook of Theoretical Computer Science, Elsevier, Amsterdam, 1993. [20] A. Lentin, M.P. Schützenberger, A combinatorial problem in the theory of free monoids, in: Proc. University of North Carolina, 1967, pp. 128–144. [21] R. Lorentz, M. Main, An o(n log n) algorithm for finding repetition in a string, T.R. 79-056 Computer Science Department, Washington State University, Pullman, 1979. [22] J.-F. Myoupo, A. Wabbi, Improved linear systolic algorithms for substrings statistics, Inform. Process. Lett. 61 (1997) 253– 258. [23] A. Thue, Über die gegenseitige Lage gleicher Teile gewisser Zeichenreichen, Skr. Vid.-Kristiana I. Mat. Naturv. Klasse 1 (1912) 1–67. [24] L.G. Valiant, A bridging model for parallel computation, Comm. ACM 33 (8) (1990) 103–111. [25] A. Wabbi, Architectures Paralléles systoliques pour la compression de données et la manipulation des sous-chaines, PhD thesis, Université de Picardie Jules Verne, Faculté de Mathématiques et d’Informatique, Amiens, France, 1998.