Information Processing North-Holland
Letters 18 (1984) 249-256
AN ANALYTICAL
Gerhard
COMPARISON
18 June 1984
OF TWO
STRING
SEARCHING
ALGORITHMS
BARTH
Fachhereich Injormatik,
Uniuersitiit Kamv-sluutern,
D -6750 Kaiserslautern,
Fed. Rep. Germany
Communicated by K. Mehlhorn Received 15 October 1982 Revised 2 January 1984
Average case analyses of two algorithms to locate the leftmost occurrence of a string PATTERN in a string TEXT are conducted in this paper. One algorithm is based on a straightforward trial-and-error approach, the other one uses a sophisticated strategy discovered by Knuth, Morris and Pratt (1977). Costs measured are the expected number of comparisons between individual characters. Let NAIVE and KMP denote the average case complexities of the two algorithms, respectively. We show that 1 -(l/c)+(l/c2) is an accurate approximation for the ratio KMP/NAIVE, provided both PATTERN and TEXT are random strings over an alphabet of size c. In both cases, the application of Markov chain theory is expedient for performing the analysis. However,in order to get rid of complex conditioning, the Markov chain mode1 for the KMP algorithm is based on some heuristics. This approach is believed to be practically sound. Some indication on the complexity that might be involved in an exact average case analysis of the KMP algorithm can be found in the work by Guibas and Odlyzko (1981).
Keywords:
Substring tests, pattern Markov chain theory
matching
algorithms,
Knuth-Morris-Pratt
algorithm,
analysis
of combinatorial
algorithms,
1. Two algorithms for substring searching A problem frequently encountered in text processing is to search for occurrences of a string as a substring in an other one. To be more specific, let PATTERN = p, p2. . . p, and TEXT = t,t 2. . . t n denote two strings. PATTERN is said to be a (contiguous) substring of TEXT if PIP2...
pm =
tktk+l
(1.1)
...tk+m-l
holds for some k, 1 < k 6 n - m + 1. Various forms of the substring searching problem aim at finding the smallest, the largest, any, or all indices k meeting the requirements of (1.1). Here we will concentrate on finding the smallest of these indices k. This amounts to locating the leftmost occurrence of PATTERN in TEXT. A straightforward solution to this problem aligns TEXT and PATTERN side by side and compares corresponding characters. As soon as a mismatch is detected, PATTERN is shifted one position towards the right end of TEXT and the search is resumed at the first character of PATTERN. Fig. 1 pictorially describes this simple strategy.
0020.0190/84/$3.00
0 1984, Elsevier Science Publishers
B.V. (North-Holland)
249
Volume 18,Number 5
INFORMATIONPROCESSINGLETTERS
TEXT
t. 3
tj+,
t j+2
. . . . .
PATTERN
PI
P2
P3
* * - - -
18 June1984
tj+i-l pi
a
PATTERN
p2
Pl
. . . . .
b Fig.1.(a) Testif t,+,_,
=
p>.(b) New alignment
after mismatch.
Algorithm STRING_NAIVE pp := 1;
{initialize
pattern
pointer}
tp := 1;
{initialize
text pointer}
{advance
both pointers}
while (pp G m) and (tp < n) do if PATTERN[PP]=TEXT[ thenpp
:= pp+l;
tp] tp := tp+I
else pp := 1; tp := tp - pp + 2
{retract
both pointers}
endif endwhile; if pp > m then write (“leftmost else write (“no
occurrence
occurrence
at”, tp - m)
found”)
endif end STRING-NAIVE Algorithm
given above implements this strategy. It is very simple to see that may require as many as O(n . m) comparisons of characters in the worst possible case. A clever algorithm for substring searching with an O(n + m) worst case complexity has been developed by Knuth, Morris and Pratt 171. The basic idea of this method is never to retract pointer tp to the left. Instead, after a mismatch between PATTERN[PP] and TEXT[tp] has been detected, the search is resumed by comparing PATTERN[IIeXt(pp)]and TEXT[tp].Thereby, next is a function defined as STRING-NAIVE
STRING-NAIVE
next(i)=max{kIO~k
andp,+pi}
for 1 < i < m. The rationale behind function next and a detailed discussion of how to compute it is not given here, the reader is referred to [7]. Fig. 2 depicts the critical step involved in this method. Algorithm STRING_KMP given below implements the strategy proposed by Knuth, Morris and Pratt. 250
18June 1984
INFORMATIONPROCESSINGLETTERS
Volume 18.Number 5
a .
TEXT
.
tj+i-l 1
I PATTERN
PIP2
--
pk
b Fig. 2. (a) Test if t,+,_,
=
p,. (b) New alignment
after mismatch,
k = next(i)
Algorithm STRING _ KKP pp := 1; tp :=
1;
{initialize
pointers}
{advance
pointers}
while (pp G m) and (tp G n) do while (PATTERN[PP]#
TEXT[ tp]) and (pp > 0) do
pp := next(pp) endwhile; pp := pp+l;
tp := tp+1
endwhile; if pp > m then write (“leftmost else write (“no
occurrence
occurrence
at”, tp - m)
found”)
endif end STRING_KMP
In the remainder of this paper the average case performances will be investigated. The analyses center around Markov chain briefly introduced in the subsequent section.
of both STRING-NAIVE and STRING_KMP techniques. The required concepts will be
2. Markov chains An extensive treatment of Markov chains will not be given here, the reader is referred to the literature, like [9]. Instead, an intuitively appealing characterization shall suffice. A finite Murkou chain is a stochastic process that at any given time is in one of a finite number of states s,, s2,. . . , sr, say. The probability that the process moves from si to sJ depends only on state s, and not on the history of how si was reached. A transition probability pi, is given for any pair of states. A state is absorbing if it is impossible to leave it. A Markov chain is called absorbing if (i) it has at least one absorbing state, and (ii) from every state it is possible to reach an absorbing state. The transition probabilities pli, 1 G i, j d r, can be stored in an r X r matrix P. For an absorbing Markov 251
Volume 18, Number 5
INFORMATION
PROCESSING LETTERS
18 June 1984
! I l/21 3/2
0
0
a
0
;
0
211
3/2 l’2 1
1
b ’
C
Fig. 3. (a) A Markov chain. (b) Transition probability matrix P. (c) Fundamental matrix F.
chain with absorbing
states sk + , , . . . , S, and non-absorbing
Q kxk ’
=
O(r-k)xk
A
kx(r-k)
I(r-k)x(r-k)
states s,, . . . , sk, matrix
P can be arranged
as
1 ’
where Q is the submatrix for transitions among non-absorbing states, A is the submatrix for transitions from non-absorbing into absorbing states, 0 and I are the null and identity submatrix, respectively. Matrix F := (I - Q)-’ is called the fundamental matrix. In essence we need only one result from the theory of absorbing Markov chains to assist us in analyzing algorithms STRING-NAIVE and STRING_KMP. Theorem 2.1. Entry f, of the fundamental matrix F equals the expected number of visits of state s, the process makes before absorption, provided it was started in state si. (For a proof see [9]). Corollary 2.2. absorption.
CF,, f, equals the expected number of steps the process makes from a start in state s, until
Example. Fig. 3(a) contains the description of an absorbing Markov chain with states 1, 2, 3, 4 and 5. States 4 and 5 are absorbing, as follows from the transition probabilities p4,4 and p5,5 being 1. Fig. 3(b) and (c) show the transition probability matrix P of this chain and its fundamental matrix F, respectively. The dashed lines in Fig. 3(b) indicate the aforementioned partition of P into four submatrices. Matrix F tells us, for example, that when starting in state 2, the process may be expected to return to state 2 once, which results in a total number of two visits to state 2. Furthermore, we may expect that the process will be absorbed after 4 steps (either in state 4 or state S), if being released in state 2. Of course, the theory of absorbing Markov chains encompasses many more results than just Theorem 2.1 and Corollary 2.2, the interested reader may wish to consult [9]. Yet, the two facts cited above will suffice to analyse the average case performances of STRING-NAIVE and STRING_KMP. 3. Analysis of STRING_NAIVE and STRING_KMP From Fig. 4 it is straightforward to see that execution of Algorithm STRING-NAIVE can conveniently modeled as the activation of a deterministic finite automaton with m + 1 states. At any moment, 252
be the
18 June 1984
INFORMATION PROCESSING LETTERS
Volume 18, Number 5
Fig. 4. Automaton modeling STRING_ NAIVE.
current value of pp determines the state this automaton acts in. The characters t,, t *, . , t ,, of TEXT serve either advances to state as successive input signals. When reading in state i character t,, the automaton i + 1 or returns to state 1, depending on whether t, matches pi or not. As soon as state m + 1 is entered the automaton terminates. In order to apply Theorem 2.1 and Corollary 2.2, the automaton in Fig. 4 is turned into an absorbing Markov chain by the following method: (1) Assign probabilities a, (for ‘advance’) to transitions from state i to state i + 1. for 1 < i < m. (2) Assign probabilities b, (for ‘go back’) to transitions from state i to state 1, for 1 < i < m. (3) Add a transition from state m + 1 to state m + 1 and assign probability 1 to it. Thus, we can construct an absorbing Markov chain with the single absorbing state m + 1. Submatrix Q mx “, of the chain’s transition probability matrix P,,, + , )x~m+,~contains the entries bi 9i,=
for1
forl
a,
otherwise.
10 From
there, the fundamental l/(a,a,+, f, =
(1 - aiai+, i l/(a,a,+,
matrix
F,,,,,,
= (I,,,,
- QmX,,,)’
can be computed,
whose coefficients
are
fori=l,l
..,a,) . ..a.)/(a,aj+,
. ..a,)
for i 2 2,1
2,i
. ..a.>
This can easily be verified by multiplying F with (I - Q) to obtain the identity matrix I. Let us elaborate on the results derived so far by substituting values for the parameters a, and bi. If both PATTERN and TEXT are random strings of characters drawn from an alphabet with c elements, the transition probabilities are a, = l/c and bi = (c - 1)/c for 1 < i < m. Entries in the first row of the fundamental matrix F then become f,, = cm-J+‘, for 1
- 1) - c/(c
with c characters,
Algorithm
STRINKNAIVE
performs
an
- 1)
steps to locate the leftmost occurrence of PATTERN in TEXT. Proof. From Corollary to evaluate C,:, f,,. Ff,,= j=l
As to non-random
2.2 and the fact that STRING-NAIVE
&J+l= j=l
strings,
always starts in state 1, it follows that we have
fcJ=c”“‘,(C-l)-c,(c-1). j=l
it is not obvious
at all which
transition
probabilities
a, and b, should
be 253
Volume 18,Number 5
INFORMATIONPROCESSINGLETTERS
18June1984
W:L~.l,iiF Fig. 5. Automaton
modeling
STRING_ KMP
substituted. Note that the technique employed in the above analysis is independent of any particular choice of values for the parameters ai and bi. The reader is invited to pick his favorite transition probabilities for non-random strings and to evaluate C,T=, f,j with coefficients f,, = l/(a,a, +, . . . a ,) as calculated above. q Let us now turn our attention to the analysis of STRING_KMP. First of all, an automaton has to be tailored to the given string PATTERN = p, p2. . . p,. Fig. 5 presents the general picture. The specific values of next(i), 1 < i < m, depend on the actual string PATTERN. For an average case analysis of STRING_KMP, the expected values E[Xi ] of the random variables Xi defined as Xi = j
iff
have to be computed
control
returns
+ ... the definition
in state i,
for 3 6 i < m. It holds that
E[X,] = 1 .(P{next(i)
From
to state j after a mismatch
+(i-
= 1) + P{next(i)
= 0))
l).P{next(i)=i-
of function
next (see Section
+2.
P{next(i)
= 2)
l}. 1) it follows that
P{next(i)=j}=P{p,...pj_,=pi_,+,...pi_,andpj+p,} .P{nok>jexistswithp,...p,_,=pi_,+,...p,_,andp,#pi} G P{P ~...pj-~=pi-J+~...pi_~andp,#pi}. For random strings drawn from an alphabet with c symbols the probability of any two symbols to match is a = l/c. Let b denote 1 - a, i.e., the probability of a mismatch between any two symbols. Hence we have P{next(i) = j} < aj-‘b and can continue as follows: i-l
i-l
EIX,]
j=l
m
= 1 + b/(1
- a)’
(for 0 < a < 1)
=l+l/b=2+l/(c-1). So we may conclude that for c > 2 (note that for unary alphabets the substring search problem is trivial!) the expected values E[Xi] are bounded by 2 Q E[X,] G 3. The sizes of actual computer codes (ASCII,EBCDIC, BCDIC, etc.) are large enough to warrant the usage of 2 as a very accurate approximation for every E[Xi]. It has to be admitted that this constitutes a heuristic approach, but since it gets rid of some complex conditioning it is believed to be practically sound. Consequently, the state diagram shown in Fig. 6 may serve as a heuristic Markov chain model for the execution of STRING_KMP. The foregoing discussion has been based on the assumption that random strings are involved in a 254
Volume 18, Number 5
Fig. 6. A heuristic
INFORMATION PROCESSING LETTERS
Markov
18 June 1984
chain model for STRING_ KMP.
search implemented by STRING~KMP. Here, again, the identical remarks as made before about apply, viz. that the applicability of Markov chain techniques to an average case analysis is not restricted to this situation. Each reader may choose a probability distribution he believes to be a good representation for the likelihood of matches and mismatches for a sample of non-random strings. Thereafter, he can pursue the same way as described here to analyze STRING_KMP. From Fig. 6 it follows that we have to compute the fundamental matrix F,,,,, = (I,,, - Q,X,)-‘, contains the following coefficients: where Q,,, substring
STRING-NAIVE
b qi,=
for(j=l;i=1,2)or(j=2;3
a ( 0
otherwise.
It turns out that F,,,,,
f, =
I The correctness identity
matrix.
has the form
(a”‘-’ + b)/a”
fori=l,j=l,
b/a”
fori=2,j=l,
((1 - amP’+‘)b)/a”’
for3
(1-a”‘-‘+‘)/am-j+’
forZ
l/a”pJ+’
otherwise.
of this statement may be proved by multiplying Now we are ready to state our second result.
Result 3.2. For random average number of cm + (l/(c
strings over an alphabet
- I))?
+ c - c/(c
F and (I - Q), which yields as product
with c characters,
Algorithm
STRING~KMP
performs
the
an
- 1)
steps to locate the leftmost occurrence of PATTERN in TEXT. Proof. By Corollary 2.2, the average number f, as given above. Hence we get the sum m
c
of steps performed
by STRING~KMP is Cl’= ,f,, with coefficients
m
f,j = (a”-’
+ b)/a”
j=l
+ c
l/amPJtl
j=2 m-l =
l/a
+ (l/a)”
-(l/a)“‘-’
+ C
(l/a)’
j=l
cm- cm-’ + (Cmcm+ C”‘/(C - 1) -cm-’
= c +
=
= cm + L?‘(c/(c
c)/(c
- 1)
+ c - C/(C - 1)
- 1) - 1) + c - C/(C - 1)
=crn+(l/(c-l))?‘+c-C/(C-1). 255
Volume 18,Number
Having
derived
INFORMATIONPROCESSlNGLE7TERS
5
quantities
KMP and NAIVE, say, where
1
KMP = Cm + -C C-l
n-l
+c_-
“1
C C-l
= [cm +(1/(c -l)c"-' =
[c
m+l
-
cm + cm-’
C c-l
and
for the average case complexities of STRING_KMP study the ratio KMP/NAIVE. We get KMP/NAIVE
18June1984
+
1
NAIVE=---
C
c-l’
and STRING-NAIVE, respectively,
+ c)]/[c""/(C-
+ c(c
-
it is very informative
to
l)]
1)1/P+’
=l-(l/c)+(l/c2)+(c-1)/c”. m = 2 this ratio equals 1, which has to be the case, indeed, since then the Markov chains for and STRING~KMP coincide (compare Figs. 4 and 6). For m exceeding 2 the ratio KMP/NAIVE rapidly approaches 1 - l/c + l/c’, a quantity which is always smaller than 1, yet for computer alphabets of sizes c = 64, 128,. . by a negligible amount only. We summarize the outcome of the foregoing analyses in stating that for random strings the difference in the run-times of both methods is on the average very close to zero. Taking into account the additional effort of computing function next, which takes O(m) time (see [7]) and requires m additional storage locations, we may even go a step further and expect that on the average STRING-NAIVE uses less resources than STRING_KMP. For
STRING-NAIVE
4. Conclusion Average case analyses of two substring searching algorithms have been conducted. In both cases methods from Markov chain theory have been employed. This has been expedient since both algorithms lend themselves to modeling their execution as the activation of finite automata. The latter statement is true for a large class of algorithms, whose average case analyses are therefore amenable to the application of Markov chain theory. Surprisingly enough, none of the standard textbooks on the design and analysis of computer algorithms, like [1,5,6,7], contains (to the best of my knowledge) an analysis performed along the lines outlined in this paper. In [2], a more detailed discussion of the average case analysis of finite state algorithms by means of Markov chains is presented.
References [l] A.V. Aho, A.J. Hopcroft and J.D. Ullman, The Design and Analysis of Computer Algorithms (Addison-Wesley, Reading, MA, 1974). [2] G. Barth, Analyzing algorithms by Markov chains, in: Methods of Opera’tions Research Vol. 45 (AthenPum Press, 1982) pp. 405-418. [3] R. Boyer and J. Moore, A fast string searching algorithm, Comm. ACM 20 (1977) 262-272. [4] L.J. Guibas and A.M. Odlyzko, String overlaps, pattern matching and nontransitive games, J. Combin. Theory A30 (1981) 183-208.
256
[5] E. Horowitz and S. Sahni. Fundamentals of Computer Algorithms (Computer Science Press, Rockville, MD, 1978). [6] D. Knuth, The Art of Computer Programming, Vols. 1, 3 (Addison-Wesley, Reading, MA, 1973). [7] D. Knuth, J. Morris and V. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) 323-350. [8] E. Reingold, J. Nievergelt and N. Deo, Combinatorial Algorithms (Prentice-Hall, Englewood Cliffs, NJ, 1977). [9] J. Snell, Introduction to Probability Theory with Computing (Prentice-Hall, Englewood Cliffs, NJ, 1975).