Information Processing Letters 24 (1987) 325-329 North-Holland
16 March 1987
REMARKS O N S T R I N G - M A T C H I N G AND ONE-WAY M U L T I H E A D A U T O M A T A Marek C H R O B A K and Wojciech RYTTER Institute of lnformatics, Warsaw University, PKiN VHI p., 00-901 Warsaw, Poland
Communicated by W.M. Turski Received 10 February 1986 Revised 12 May 1986
Keywords: String matching, multihead automata, on-line computation, complexity
The complexity of string-matching has been deeply investigated both in sequential and parallel models of computation [1,6,13]. Since stringmatching can be done in real time, there is no sense of talking about any lower bounds concerning its time complexity. However, one can ask what is the simplest possible device capable of doing string-matching. In [4], Galil and Seiferas proved that string-matching can be done by a 6-head deterministic two-way finite automaton (2dfa(k) with k = 6) in linear time. In the same paper, they raised the question whether stringmatching can be done by a k-head one-way deterministic finite automation (ldfa(k), for short), for any k (see also [3]). Note that a ldfa(k) cannot help working in linear time, unless it loops. The best result related to this problem was obtained by Li [8,10]. He proved that a ldfa(3) cannot do string-matching. He also showed some lower bounds for 2dfa(k)'s with k - 1 heads blind. In this paper we generalize some ideas of his proof in [8] and the one for two heads in [9]. We prove that for each k: --ldfa(k)'s cannot do on-line multi-pattern string-matching (even if all patterns are prefixes of the same string), -ldfa(k)'s cannot report on-line all occurrences of the nonempty prefixes of a given pattern. More formally, let I be a finite alphabet with at least two letters, not containing symbols * , # .
We define languages L~m= { w # y w : w , y ~ I * ) , L 1 = ( W x , W 2 . . . . * W n # y * w j ' w l , . . . , W n ~ I* ' y~{IU(*}}*, l ~ < j ~ n , and f o r l ~ < i < n, w i is a prefix of wi+ 1}, L 2 -- ( w # y x : w , y, x ~ I*, x is a nonempty prefix
of w}. We consider sensing ldfa(k)'s only (that is, such ldfa(k)'s that are able to detect coincidences of their heads). Then we can also assume that the order of the heads on the input is fixed. By the first head we shall always understand the rightmost head on the input. We say that a ldfa(k) A recognizes a language L on-line if for each word w the following statements are equivalent: (1) w belongs to L, (2) when, in the computation of A on w, the first head falls off the last symbol of w, then A is in an accepting state. The reason for considering the languages L 1 and L 2 is that they are similar to Lsm, that is, to the string-matching with one pattern. For example, Lsm and L 1 can be recognized on-line by 2dfa(2)'s and L 2 by a 2dfa(3). One of the difficulties in proving that ldfa(k)'s cannot recognize Lsm is that during the computation an automaton discovers many occurrences of the prefixes of the given pattern in the text and can use this informa-
0020-0190/87/$3.50 © 1987, Elsevier Science Publishers B.V. (North-Holland)
325
Volume 24, Number 5
INFORMATION PROCESSING LETTERS
tion in further computation. W e shall show that this p r o b l e m can be overcome if we require it to report some (or all) prefixes of the p a t t e m . The m a i n result of this paper is the following theorem. 1. Theorem. L 1 and L 2 c a n n o t be recognized on-line by a ldfa(k), for any k. Proof. First we consider L 1. We use the notion of K o l m o g o r o v complexity, introduced in [2,7] and recently used to prove some tight lower bounds [10,12]. Let us fix some standard e n u m e r a t i o n of Turing machines. The Kolmogorov complexity of a string x with respect to y, shortly denoted by K ( x l y ), is the size of the smallest T M which given y prints only x. W e also d e n o t e K ( x l e ) by K(x), where e is the e m p t y string. We say that x is random if K ( x ) > / I x l . It is well k n o w n that rand o m strings of each length exist, and that if x is r a n d o m and x = uyv, then K ( y Ix - y) >/ lY [ - O(log Ix 1), where x - y = uv. The p r o o f of the theorem is by contradiction. Suppose that there is a l d f a ( k ) M recognizing L 1 on-line. Let x be a r a n d o m string such that Ixl = dk 2, where d is a sufficiently large integer. First we divide x into k segments xi: X=
X1X 2...Xk,
where Ix i[ = dk, for i = 1 , 2 , . . . , k . Then, each x i is divided into k blocks x i,j: X i = Xi,lXi, 2 .-.Xi, k,
and Ixi, j I = d, for i, j = 1, 2 . . . . , k. Consider all inputs of the f o r m wx *w2 * "'" * W k ~ "3'1 *Y2 * "'" *Yp,
(~')
where w i = xax z . . . x i for i = 1,2 . . . . ,k, and, for each r = 1 , 2 , . . . , p , Yr =Wi for some i. Clearly, after reading each segment of yp, A m u s t be in an accepting state. Later we shall use without an explicit reference the fact that if d is large enough, then x cannot contain either two equal segments or two equal blocks. This easily follows f r o m the r a n d o m n e s s of X.
326
16 March 1987
We say that a segment x i of yp is matched if there is a time in the c o m p u t a t i o n of M on &r) w h e n the first head of M is on this x i and some o t h e r head is on another occurrence of segment x i. W e say that a block xid of yp is matched if there is a time in the computation of M on (-~) when the first head of M is on the segment x i (not necessarily on the block xi, ) in yp and some other head is on some other occurrence of X i, j. 2. Lemma. Each segment x i in yp and each of its blocks must be matched. Proof. W e generalize the proof of [9, L e m m a 3.1]. Suppose that x i in yp is not m a t c h e d by M. It m e a n s that while the first head reads this occurrence of x i, no other head is on some other occurrence of segment x i on the input. Let I D 1 (respectively I D 2) be the configuration of M when the first head enters (respectively leaves) this occurrence of x i. We shall construct a small T M M ' which given xx,... , x i _ l , X i + l , . . . , x k will print x i. W i t h o u t loss of generahty we can assume that M ' has k heads on its working tape. In the first phase, M ' prints on its working tape a w o r d of the form ( ~ ) except that the places containing occurrences of x i are left empty. To do this, M ' must store in its finite control the structure of (-~). It can be d o n e with O(p log k + log d) states, since it suffices to store for each r = 1, 2 , . . . , p the n u m b e r s such that Yr = ws. Let R be the e m p t y region corresponding to this occurrence of x i. M ' generates in R all strings over I of length kd in the lexicographic order. For each such string v, M ' simulates M starting from I D a. If M ' reaches I D 2, it m e a n s that M would accept (,~-) with v in R. It implies that v = x i. Now, M ' prints x i. The second phase of the c o m p u t a t i o n requires O(k log d(k + p)) states to store I D 1 and I D 2 and a similar a m o u n t of states to p e r f o r m the simulation. Therefore, the size of M ' is O(log d), since k a n d p are fixed. This implies that K(x i Ix - - x i ) = O(log d). But, from the r a n d o m n e s s of x we have that K ( x i Ix - xi) >1 Ixi I - O ( l o g Ix I) = kd - O(log d),
Volume 24, Number 5
INFORMATION PROCESSING LETTERS
a contradiction. Thus, we o b t a i n that each segm e n t must be matched. It is easy to m o d i f y this p r o o f to s h o w that each b l o c k xi, j in yp m u s t be matched. S u p p o s e that Xi, j is n o t matched. L e t R be the region in yp containing xi, j. N o w , we generate in R all strings v of length d and simulate M starting f r o m I D p If M ' reaches I D 2, then v must be equal to Xi, j. With similar reasoning as above we q b t a i n a contradiction. [] T h e next l e m m a states that if m a n y segments x i a p p e a r on the input, then at least o n e of the matching heads m u s t move f o r w a r d to another c o p e of x~. To state this more formally, let MPri be the set of all pairs (h, s) such that there is some configuration of M when its first head is on the rth occurrence of x i in the text a n d the hth head is o n the sth occurrence of x i. W e shall call M P i the matching party for the rth occurrence of xi. 3. Lemma. A t most k - 1 occurrences of segment x i can have the same matching party.
16 March 1987
is n o t able to satisfy the requirements of L e m m a s 2 and 3. It will follow directly from L e m m a 4 presented below. F o r the sake of exposition we state it in more abstract terms. First we define a two-player game, further called a matching game. The players, are called Matcher and Requester ( M a t and R e q for short). M a t has k-1 markers. Let I be a finite alphabet, I = {al, a 2 , . . . , a k ). By u 0 w e d e n o t e the initial string. Initially, all markers are on the first symbol of u 0. In each move, R e q issues a request a i, for some 1 ~< i ~< k. If the last request w a s ai, then the next one can be either ai+ 1 or al, 1 ~< i < k. The goal of M a t is to match it, that is, to place one of its markers on some other occurrence of a i in u, if it is n o t matched yet. Then the s y m b o l a i is inserted at the end of u (u := u a i ) . There are some restrictions concerning the moves of Mat. First, M a t can m o v e its markers forward only. Second, after at most k - 1 requests of the same s y m b o l a i (even if they are separated b y other requests), at least one marker used to m a t c h these s y m b o l s must be moved. R e q wins if at some m o m e n t M a t cannot satisfy its requests.
Proof. W e use L e m m a 2. Consider k occurrences of x i. W e can a s s u m e that they are consecutive. Also s u p p o s e that
4. Lemma. Req has a winning strategy.
MP~i = MP~i+1 . . . . .
Proof. The lemma will follow from the following claim.
={(hi,
Mpi+ k - 1
s1) . . . . . (hm, s m ) ) ,
m
Since the last b l o c k xi, k in the rth occurrence of x i must be matched, after the first h e a d reads the rth occurrence of x i, one head in the matching party, say h i, is on the last block in the slth occurrence of x i. Similarly, after the first head reads the (r + 1)st occurrence of x i, one of the heads, say h 2, m u s t be on (or after) the (k - 1)st b l o c k in the s2th occurrence o f x i. Proceeding in this w a y we o b t a i n that w h e n the first head enters the (r + m)th occurrence of x i, the heads h 1. . . . , h m are on (or after) some of the blocks n u m b e r e d 2 . . . . , k. T o m a t c h the first b l o c k in the (r + m)th occurrence of x i, one of the heads must m o v e to another c o p y o f X i.
[]
T o c o m p l e t e the p r o o f of the t h e o r e m we must yet show that there exist strings (Or) for which M
5. Claim. For each 1 <~m <~k - 1 and u ~ I* there exists a string Vm,~ = bib2 ... br, where ba, b 2 . . . . . b r E ( a l , a 2 , . . . , a m ), such that if M a t has all k - 1 markers in u, then after requests b l , b 2 , . . . , b r at least m M a t ' s markers will be driven into the suffix Vm,u of U' = UVm,u" Also, the last symbol in Vm,u is a m"
Proof of Claim 5. The p r o o f is b y induction on m. (1) m = 1. Let c b e the n u m b e r of symbols a 1 in u. A f t e r c(k - 1) 2 + 1 requests ax, M a t must place one of its markers outside the initial text u because of the rules of the game. (2) m = q + 1, m>1 2. Let c b e the n u m b e r of occurrences of a m in u. W e construct the sequence 327
Volume 24, Number 5
INFORMATION PROCESSING LETTERS
of requests v as follows. Set initially v to be the e m p t y w o r d and b = c ( k - 1 ) 2 + 1. T h e n do the following:
repeat b times v := VVq,uva m. H e n c e the structure of v is V1 a mV2a mV3a m . . . V b a m-
W e k n o w from the inductive hypothesis that, after each sequence of requests vi, q markers are inside v i and none of them is matching a m (because v i ~ ( a l , . . . , a q } * ) . If R e q n o w requests a m, then his request can only be satisfied b y one of the remaining markers which are still before v. But it is impossible that c ( k - 1)2+ 1 such requests are satisfied inside the initial text u. Thus, one of this markers has at last to be placed at some occurrence of a m in v. Naturally, we take Vm,u = v. This c o m p l e t e s the p r o o f of the claim. [] P r o o f of L e m m a 4 (continued). To complete the p r o o f of the l e m m a let u = u 0. We have a v = Vk-1,u such that after a sequence v of requests all k - 1 markers will be outside u 0 and v does not contain a k. N o w , making an additional request a k, R e q wins the game. T o clarify the proof, consider yet an example. F o r k = 2 and u 0 = a~a 2 the winning strategy for R e q is v = a~a 2. If k = 3 and u 0 = a i a 2 a 3 , the strategy is _ 2 5 . s _125_ _625_ 3125n n15625_ a 1,:t2~l ia2ia 1 et2i:t 1h ~t2a 1 a2et 1 et 3 •
N o t e that R e q ' s strategy does not really d e p e n d on the moves of Mat. [] P r o o f of Theorem 1 (continued). N o w we complete the p r o o f o f the theorem for L x. Suppose that M recognizes L 1. Then we w o u l d be able to construct a winning strategy for M a t (or rather a nonlosing one since it makes the Lame to go forever) b y mimicking the moves of M on inputs of the form (~-) as follows. W e take u 0 = a x a l a 2 . . . a l a 2 . . . a k. If R e q requests some ai, then we feed M with x i (or * x I if i = 1). W h e n the first head reads xi, M a t watches the positions of the k - 1 remaining heads and places his markers at the corresponding 328
16 March 1987
positions in the current generated string (if some of the heads m a d e several matches he can choose any of them). F r o m L e m m a s 2 and 3 this w o u l d give a winning strategy for Mat. This completes the p r o o f of T h e o r e m 1 for L~. The p r o o f for L 2 is a simple modification of the a b o v e argument. W e take $x as a pattern, where x is a r a n d o m string, x = x l x 2 . . . x k. Then we consider inputs $x :~: YixYi2 • • • Yip,
(~r)
where Yi = $xlx2 - - - x i ,
i = 1,2 . . . . . k.
The rest of the p r o o f is similar. The details are left to the reader. [] If we wished to prove the 1 without the assumption prefixes of the same string, be m u c h simpler. It suffices
first part of T h e o r e m that all patterns are then the p r o o f w o u l d to consider inputs
X 1 * X 2 * . . . . Xk:~:X k * X k _ 1 * • . . , X1,
where x l x 2 . . . x k is a r a n d o m string.
Acknowledgment The p r o o f of L e m m a 4 has independently been f o u n d b y Juraj Hromkovir. W e would also like to thank the u n k n o w n referee for m a n y remarks and for finding some bugs in the previous version of the paper.
References [1] R.S. Boyer and J.S. Moore, A fast string searching algorithm, Comm. A C M 20 (1977) 762-772. [2] G. Chaitin, Algorithmic information theory, IBM J. Res. Develop. 21 (1977) 350-359. [3] Z. Galil, Open problems in stringology, in: A. Apostolico and Z. Galil, eds., Combinatorial Algorithms on Words, N A T O ASI Series (Springer, Berlin, 1974) 350-359. [4] Z. Gall1 and J. Seiferas, Time-space optimal string-matching, Proc. 13th A C M STOC (1981) 106-113. [5] J.E. Hopcroft and J.D. Ullman, Introduction to Automata Theory, Languages and Computation (Addison-Wesley, Reading, M.A, 1979).
Volume 24, Number 5
INFORMATION PROCESSING LETTERS
[6] D.E. Knuth, J.M. Morris, Jr. and V.R. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) 323-350. [7] A. Kolmogorov, Three approaches to the quantitive deftnition of information, Problems of Information Transmission 1 (1965) 1-7. [8] M. Li, Lower bounds on string-matching, Rept. TR 84-636, Dept. of Computer Science, Cornell Univ., 1984. [9] M. Li, Unpublished manuscript; private communication. [10] M. Li, Lower bounds by Kolmogorov complexity, Proc. ICALP'85, Lecture Notes in Computer Science, Vol. 194 (Springer, Berlin, 1985) 383-393.
16 March 1987
[11] M. Li and Y. Yesha, String-matching cannot be done by a two-head one-way deterministic finite automaton, Rept. TR 83-579, Dept of Computer Science, Comell Univ., 1983. [12] W.J. Paul, On-line simulation of k + 1 tapes by k tapes requires nonlinear time, Inform. and Control (1982) 1-8. [13] U. Vishkin, Optimal parallel algorithms for string-matching, Proc. ICALP'85, Lecture Notes in Computer Science, Vol. 194 (Springer, Berlin, 1985) 497-508.
329