Information Processing North-Holland
Letters
30 May 1986
22 (1986) 329-330
AN ON-LINE PATTERN MATCHING ALGORITHM Tadao TAKAOKA Department
of Information
Science, Faculty of Engineering,
Uniuersity of Ibaraki,
Hitachi, Ibaraki 316. Japan
Communicated by W.L. Van der Poe1 Received 8 December 1984
Keywords:
Pattern
matching,
algorithm,
text editor.
on-line
1. Introduction
Two fast pattern matching algorithms appeared in the past, one by Boyer and Moore [l] and the other by Knuth, Morris and Pratt [2]. Both run in linear time O(m + n) where m is the length of the pattern and n is the length of the text. Both algorithms are off-line ones in the sense that after the pattern is input, the .actual matching algorithm runs. Since pattern matching algorithms are mainly used for text editors, the response time must be minimized. Therefore, we have to take into account the time for our finger motion, that is, the time necessary to input the pattern. If we can do the job of matching as the new input symbols are given, we should be close to the position of the pattern when the last symbol of the pattern is given. This consideration is especially important in a virtual memory environment, where the memory access time is rather slow. The present short article gives an algorithm for on-line pattern matching which does pattern matching in parallel with the action of reading input symbols.
2. Review of the Knuth-Morris-Pratt
algorithm
editor
h[p] = max{ s 1(s = 0) or (pt[l..s - l] = pt[p - s + l..p - 11 and Nsl# pW)), h[l] = 0. The value h[p] gives the next position where matching has to start on the pattern after a mismatch occurs. The pattern matching algorithm is as follows (declaration parts are omitted). Algorithm A 1 begin 2 p:=l; q:=l; 3 while (p G m) and (q 6 n) do 4 begin 5 while (p >, 0) and (pt[p] f tx[q]) do p := hip]; p:=p+l; q:=q+l 6 7 end; 8 if p G m then writeln(‘not found’) else writeln(‘found at’, q - p + 1) 9 10 end. In the above, p and q are pointers for pattern and text respectively. Let the function f(p) be defined by f(p)=max(i)(l
Let tx be the array for text and pt be that for pattern. The notation a[l..n] means the concatenation of a[l]..a[n]. Let array h be defined by 0020-0190/86/$3.50
0 1986, Elsevier Science Publishers
f(l) = 0. The computation
B.V. (North-Holland)
of h[p] is like pattern match329
INFORMATION
Volume 22. Number 6
PROCESSING
of the pattern with itself. The following algorithm can be understood with the aid of the comments for the function f(p). ing
Algorithm B 1 begin t := 0; h[l] := 0;
for p := 2 to m do begin
15 16 17 18
{t = f(P))
read(pW ; while (t > 0) and pt[r - l] f pt[t]
do t := h[t]; t:=t+l; if pt[r] f pt[t] then h[r] := t else h[r] := h[t]; r:=r+ 1
is
3. On-line algorithm In a real text-editing situation, the mismatch pt[p] # tx[q] often occurs and p is decreased by P := h[p] before it gets larger by p := p + 1 in line 6. That is, the value of the pointer does not get close to m until the pattern is actually found in the text. This observation leads to the idea of computing the value of h[p] only when it is needed. That is to say, the next symbol of the pattern is input only when it is needed. Hence the following algorithm. Algorithm C
1 begin {Ix: given} 2 p:=l; q:=l; t := 0; h[l] := 0; 3 r := 2; read(pt[l]); 4 while (pt[p] # ‘#‘) and (tx[q] Z ‘$‘) do 5 begin 6 7 while (P > 0) and W[pl f txhl) do p := h[p];
end; if pt[p] =‘#’
then writeln(‘found at’, q-p+I) else writeln(‘not found’)
20 end.
else h[p] := h[ t]
of these algorithms
end
19
if pt[p] f p[t] then h[p] := t
end 10 11 end.
330
q:=q+l; if (p = r) and (pt[r] f ‘#‘) then begin
i
do t := h[t]; t:=t+l;
The detailed explanation given in [2].
p:=p+l;
8 9 10 11 12 13 14
{t = f(P - 1)) while (t > 0) and (pt[p - l] f pt[t])
7 8 9
30 May 1986
LETTERS
In the above, Algorithm B is superimposed into Algorithm A with the change of the variable p to r. The endmarkers of pattern and text are # and $ respectively. Pointer r indicates the position in array h at which the next component of h should be computed. After Algorithm C computes the first three or four components of h, it most frequently executes lines 7 and 8 and skips lines lo-16 since p < r is likely. Only when the pattern is placed at the right position on the text, the remaining components of h are computed. Experimental results show that the response time of Algorithm C after the last symbol is given is much shorter than that of the KnuthMorris-Pratt algorithm on average. It is clear that the running time of Algorithm C is O(1 + n), where I is the time for input operation. Note that the Boyer-Moore algorithm [l] cannot be modified into an on-line one since the pattern matching takes place at the end of the pattern. It is not clear whether Algorithm C can be further simplified.
References [l] R.S. Boyer and J.S. Moore, A fast gorithm, Comm. ACM 20 (10) (1977) [2] D.E. Knuth, J.H. Morris and V.R. matching in strings, SIAM J. Comput.
string searching al762-772. Pratt, Fast pattern 6 (2) (1977) 323-350.