An on-line pattern matching algorithm

An on-line pattern matching algorithm

Information Processing North-Holland Letters 30 May 1986 22 (1986) 329-330 AN ON-LINE PATTERN MATCHING ALGORITHM Tadao TAKAOKA Department of Info...

156KB Sizes 2 Downloads 120 Views

Information Processing North-Holland

Letters

30 May 1986

22 (1986) 329-330

AN ON-LINE PATTERN MATCHING ALGORITHM Tadao TAKAOKA Department

of Information

Science, Faculty of Engineering,

Uniuersity of Ibaraki,

Hitachi, Ibaraki 316. Japan

Communicated by W.L. Van der Poe1 Received 8 December 1984

Keywords:

Pattern

matching,

algorithm,

text editor.

on-line

1. Introduction

Two fast pattern matching algorithms appeared in the past, one by Boyer and Moore [l] and the other by Knuth, Morris and Pratt [2]. Both run in linear time O(m + n) where m is the length of the pattern and n is the length of the text. Both algorithms are off-line ones in the sense that after the pattern is input, the .actual matching algorithm runs. Since pattern matching algorithms are mainly used for text editors, the response time must be minimized. Therefore, we have to take into account the time for our finger motion, that is, the time necessary to input the pattern. If we can do the job of matching as the new input symbols are given, we should be close to the position of the pattern when the last symbol of the pattern is given. This consideration is especially important in a virtual memory environment, where the memory access time is rather slow. The present short article gives an algorithm for on-line pattern matching which does pattern matching in parallel with the action of reading input symbols.

2. Review of the Knuth-Morris-Pratt

algorithm

editor

h[p] = max{ s 1(s = 0) or (pt[l..s - l] = pt[p - s + l..p - 11 and Nsl# pW)), h[l] = 0. The value h[p] gives the next position where matching has to start on the pattern after a mismatch occurs. The pattern matching algorithm is as follows (declaration parts are omitted). Algorithm A 1 begin 2 p:=l; q:=l; 3 while (p G m) and (q 6 n) do 4 begin 5 while (p >, 0) and (pt[p] f tx[q]) do p := hip]; p:=p+l; q:=q+l 6 7 end; 8 if p G m then writeln(‘not found’) else writeln(‘found at’, q - p + 1) 9 10 end. In the above, p and q are pointers for pattern and text respectively. Let the function f(p) be defined by f(p)=max(i)(l
Let tx be the array for text and pt be that for pattern. The notation a[l..n] means the concatenation of a[l]..a[n]. Let array h be defined by 0020-0190/86/$3.50

0 1986, Elsevier Science Publishers

f(l) = 0. The computation

B.V. (North-Holland)

of h[p] is like pattern match329

INFORMATION

Volume 22. Number 6

PROCESSING

of the pattern with itself. The following algorithm can be understood with the aid of the comments for the function f(p). ing

Algorithm B 1 begin t := 0; h[l] := 0;

for p := 2 to m do begin

15 16 17 18

{t = f(P))

read(pW ; while (t > 0) and pt[r - l] f pt[t]

do t := h[t]; t:=t+l; if pt[r] f pt[t] then h[r] := t else h[r] := h[t]; r:=r+ 1

is

3. On-line algorithm In a real text-editing situation, the mismatch pt[p] # tx[q] often occurs and p is decreased by P := h[p] before it gets larger by p := p + 1 in line 6. That is, the value of the pointer does not get close to m until the pattern is actually found in the text. This observation leads to the idea of computing the value of h[p] only when it is needed. That is to say, the next symbol of the pattern is input only when it is needed. Hence the following algorithm. Algorithm C

1 begin {Ix: given} 2 p:=l; q:=l; t := 0; h[l] := 0; 3 r := 2; read(pt[l]); 4 while (pt[p] # ‘#‘) and (tx[q] Z ‘$‘) do 5 begin 6 7 while (P > 0) and W[pl f txhl) do p := h[p];

end; if pt[p] =‘#’

then writeln(‘found at’, q-p+I) else writeln(‘not found’)

20 end.

else h[p] := h[ t]

of these algorithms

end

19

if pt[p] f p[t] then h[p] := t

end 10 11 end.

330

q:=q+l; if (p = r) and (pt[r] f ‘#‘) then begin

i

do t := h[t]; t:=t+l;

The detailed explanation given in [2].

p:=p+l;

8 9 10 11 12 13 14

{t = f(P - 1)) while (t > 0) and (pt[p - l] f pt[t])

7 8 9

30 May 1986

LETTERS

In the above, Algorithm B is superimposed into Algorithm A with the change of the variable p to r. The endmarkers of pattern and text are # and $ respectively. Pointer r indicates the position in array h at which the next component of h should be computed. After Algorithm C computes the first three or four components of h, it most frequently executes lines 7 and 8 and skips lines lo-16 since p < r is likely. Only when the pattern is placed at the right position on the text, the remaining components of h are computed. Experimental results show that the response time of Algorithm C after the last symbol is given is much shorter than that of the KnuthMorris-Pratt algorithm on average. It is clear that the running time of Algorithm C is O(1 + n), where I is the time for input operation. Note that the Boyer-Moore algorithm [l] cannot be modified into an on-line one since the pattern matching takes place at the end of the pattern. It is not clear whether Algorithm C can be further simplified.

References [l] R.S. Boyer and J.S. Moore, A fast gorithm, Comm. ACM 20 (10) (1977) [2] D.E. Knuth, J.H. Morris and V.R. matching in strings, SIAM J. Comput.

string searching al762-772. Pratt, Fast pattern 6 (2) (1977) 323-350.