An analytical comparison of two string searching algorithms

Information Processing North-Holland Letters 18 (1984) 249-256 AN ANALYTICAL Gerhard COMPARISON 18 June 1984 OF TWO STRING SEARCHING ALGORITH...

Download PDF

520KB Sizes 29 Downloads 111 Views

Report

PDF Reader
Full Text

Information Processing North-Holland

Letters 18 (1984) 249-256

AN ANALYTICAL

Gerhard

COMPARISON

18 June 1984

OF TWO

STRING

SEARCHING

ALGORITHMS

BARTH

Fachhereich Injormatik,

Uniuersitiit Kamv-sluutern,

D -6750 Kaiserslautern,

Fed. Rep. Germany

Communicated by K. Mehlhorn Received 15 October 1982 Revised 2 January 1984

Average case analyses of two algorithms to locate the leftmost occurrence of a string PATTERN in a string TEXT are conducted in this paper. One algorithm is based on a straightforward trial-and-error approach, the other one uses a sophisticated strategy discovered by Knuth, Morris and Pratt (1977). Costs measured are the expected number of comparisons between individual characters. Let NAIVE and KMP denote the average case complexities of the two algorithms, respectively. We show that 1 -(l/c)+(l/c2) is an accurate approximation for the ratio KMP/NAIVE, provided both PATTERN and TEXT are random strings over an alphabet of size c. In both cases, the application of Markov chain theory is expedient for performing the analysis. However,in order to get rid of complex conditioning, the Markov chain mode1 for the KMP algorithm is based on some heuristics. This approach is believed to be practically sound. Some indication on the complexity that might be involved in an exact average case analysis of the KMP algorithm can be found in the work by Guibas and Odlyzko (1981).

Keywords:

Substring tests, pattern Markov chain theory

matching

algorithms,

Knuth-Morris-Pratt

algorithm,

analysis

of combinatorial

algorithms,

1. Two algorithms for substring searching A problem frequently encountered in text processing is to search for occurrences of a string as a substring in an other one. To be more specific, let PATTERN = p, p2. . . p, and TEXT = t,t 2. . . t n denote two strings. PATTERN is said to be a (contiguous) substring of TEXT if PIP2...

pm =

tktk+l

(1.1)

...tk+m-l

holds for some k, 1 < k 6 n - m + 1. Various forms of the substring searching problem aim at finding the smallest, the largest, any, or all indices k meeting the requirements of (1.1). Here we will concentrate on finding the smallest of these indices k. This amounts to locating the leftmost occurrence of PATTERN in TEXT. A straightforward solution to this problem aligns TEXT and PATTERN side by side and compares corresponding characters. As soon as a mismatch is detected, PATTERN is shifted one position towards the right end of TEXT and the search is resumed at the first character of PATTERN. Fig. 1 pictorially describes this simple strategy.

0020.0190/84/$3.00

0 1984, Elsevier Science Publishers

B.V. (North-Holland)

249

Volume 18,Number 5

INFORMATIONPROCESSINGLETTERS

TEXT

t. 3

tj+,

t j+2

. . . . .

PATTERN

PI

P2

P3

* * - - -

18 June1984

tj+i-l pi

a

PATTERN

p2

Pl

. . . . .

b Fig.1.(a) Testif t,+,_,

=

p>.(b) New alignment

after mismatch.

Algorithm STRING_NAIVE pp := 1;

{initialize

pattern

pointer}

tp := 1;

{initialize

text pointer}

{advance

both pointers}

while (pp G m) and (tp < n) do if PATTERN[PP]=TEXT[ thenpp

:= pp+l;

tp] tp := tp+I

else pp := 1; tp := tp - pp + 2

{retract

both pointers}

endif endwhile; if pp > m then write (“leftmost else write (“no

occurrence

occurrence

at”, tp - m)

found”)

endif end STRING-NAIVE Algorithm

given above implements this strategy. It is very simple to see that may require as many as O(n . m) comparisons of characters in the worst possible case. A clever algorithm for substring searching with an O(n + m) worst case complexity has been developed by Knuth, Morris and Pratt 171. The basic idea of this method is never to retract pointer tp to the left. Instead, after a mismatch between PATTERN[PP] and TEXT[tp] has been detected, the search is resumed by comparing PATTERN[IIeXt(pp)]and TEXT[tp].Thereby, next is a function defined as STRING-NAIVE

STRING-NAIVE

next(i)=max{kIO~k
andp,+pi}

for 1 < i < m. The rationale behind function next and a detailed discussion of how to compute it is not given here, the reader is referred to [7]. Fig. 2 depicts the critical step involved in this method. Algorithm STRING_KMP given below implements the strategy proposed by Knuth, Morris and Pratt. 250

18June 1984

INFORMATIONPROCESSINGLETTERS

Volume 18.Number 5

a .

TEXT

.

tj+i-l 1

I PATTERN

PIP2

--

pk

b Fig. 2. (a) Test if t,+,_,

=

p,. (b) New alignment

after mismatch,

k = next(i)

Algorithm STRING _ KKP pp := 1; tp :=

1;

{initialize

pointers}

{advance

pointers}

while (pp G m) and (tp G n) do while (PATTERN[PP]#

TEXT[ tp]) and (pp > 0) do

pp := next(pp) endwhile; pp := pp+l;

tp := tp+1

endwhile; if pp > m then write (“leftmost else write (“no

occurrence

occurrence

at”, tp - m)

found”)

endif end STRING_KMP

In the remainder of this paper the average case performances will be investigated. The analyses center around Markov chain briefly introduced in the subsequent section.

of both STRING-NAIVE and STRING_KMP techniques. The required concepts will be

2. Markov chains An extensive treatment of Markov chains will not be given here, the reader is referred to the literature, like [9]. Instead, an intuitively appealing characterization shall suffice. A finite Murkou chain is a stochastic process that at any given time is in one of a finite number of states s,, s2,. . . , sr, say. The probability that the process moves from si to sJ depends only on state s, and not on the history of how si was reached. A transition probability pi, is given for any pair of states. A state is absorbing if it is impossible to leave it. A Markov chain is called absorbing if (i) it has at least one absorbing state, and (ii) from every state it is possible to reach an absorbing state. The transition probabilities pli, 1 G i, j d r, can be stored in an r X r matrix P. For an absorbing Markov 251

Volume 18, Number 5

INFORMATION

PROCESSING LETTERS

18 June 1984

! I l/21 3/2

0

0

a

0

;

0

211

3/2 l’2 1

1

b ’

C

Fig. 3. (a) A Markov chain. (b) Transition probability matrix P. (c) Fundamental matrix F.

chain with absorbing

states sk + , , . . . , S, and non-absorbing

Q kxk ’

=

O(r-k)xk

A

kx(r-k)

I(r-k)x(r-k)

states s,, . . . , sk, matrix

P can be arranged

as

1 ’

where Q is the submatrix for transitions among non-absorbing states, A is the submatrix for transitions from non-absorbing into absorbing states, 0 and I are the null and identity submatrix, respectively. Matrix F := (I - Q)-’ is called the fundamental matrix. In essence we need only one result from the theory of absorbing Markov chains to assist us in analyzing algorithms STRING-NAIVE and STRING_KMP. Theorem 2.1. Entry f, of the fundamental matrix F equals the expected number of visits of state s, the process makes before absorption, provided it was started in state si. (For a proof see [9]). Corollary 2.2. absorption.

CF,, f, equals the expected number of steps the process makes from a start in state s, until

Example. Fig. 3(a) contains the description of an absorbing Markov chain with states 1, 2, 3, 4 and 5. States 4 and 5 are absorbing, as follows from the transition probabilities p4,4 and p5,5 being 1. Fig. 3(b) and (c) show the transition probability matrix P of this chain and its fundamental matrix F, respectively. The dashed lines in Fig. 3(b) indicate the aforementioned partition of P into four submatrices. Matrix F tells us, for example, that when starting in state 2, the process may be expected to return to state 2 once, which results in a total number of two visits to state 2. Furthermore, we may expect that the process will be absorbed after 4 steps (either in state 4 or state S), if being released in state 2. Of course, the theory of absorbing Markov chains encompasses many more results than just Theorem 2.1 and Corollary 2.2, the interested reader may wish to consult [9]. Yet, the two facts cited above will suffice to analyse the average case performances of STRING-NAIVE and STRING_KMP. 3. Analysis of STRING_NAIVE and STRING_KMP From Fig. 4 it is straightforward to see that execution of Algorithm STRING-NAIVE can conveniently modeled as the activation of a deterministic finite automaton with m + 1 states. At any moment, 252

be the

18 June 1984

INFORMATION PROCESSING LETTERS

Volume 18, Number 5

Fig. 4. Automaton modeling STRING_ NAIVE.

current value of pp determines the state this automaton acts in. The characters t,, t *, . , t ,, of TEXT serve either advances to state as successive input signals. When reading in state i character t,, the automaton i + 1 or returns to state 1, depending on whether t, matches pi or not. As soon as state m + 1 is entered the automaton terminates. In order to apply Theorem 2.1 and Corollary 2.2, the automaton in Fig. 4 is turned into an absorbing Markov chain by the following method: (1) Assign probabilities a, (for ‘advance’) to transitions from state i to state i + 1. for 1 < i < m. (2) Assign probabilities b, (for ‘go back’) to transitions from state i to state 1, for 1 < i < m. (3) Add a transition from state m + 1 to state m + 1 and assign probability 1 to it. Thus, we can construct an absorbing Markov chain with the single absorbing state m + 1. Submatrix Q mx “, of the chain’s transition probability matrix P,,, + , )x~m+,~contains the entries bi 9i,=

for1

forl
a,

otherwise.

10 From

there, the fundamental l/(a,a,+, f, =

(1 - aiai+, i l/(a,a,+,

matrix

F,,,,,,

= (I,,,,

- QmX,,,)’

can be computed,

whose coefficients

are

fori=l,l
..,a,) . ..a.)/(a,aj+,

. ..a,)

for i 2 2,1 2,i
. ..a.>

This can easily be verified by multiplying F with (I - Q) to obtain the identity matrix I. Let us elaborate on the results derived so far by substituting values for the parameters a, and bi. If both PATTERN and TEXT are random strings of characters drawn from an alphabet with c elements, the transition probabilities are a, = l/c and bi = (c - 1)/c for 1 < i < m. Entries in the first row of the fundamental matrix F then become f,, = cm-J+‘, for 1
- 1) - c/(c

with c characters,

Algorithm

STRINKNAIVE

performs

an

- 1)

steps to locate the leftmost occurrence of PATTERN in TEXT. Proof. From Corollary to evaluate C,:, f,,. Ff,,= j=l

As to non-random

2.2 and the fact that STRING-NAIVE

&J+l= j=l

strings,

always starts in state 1, it follows that we have

fcJ=c”“‘,(C-l)-c,(c-1). j=l

it is not obvious

at all which

transition

probabilities

a, and b, should

be 253

Volume 18,Number 5

INFORMATIONPROCESSINGLETTERS

18June1984

W:L~.l,iiF Fig. 5. Automaton

modeling

STRING_ KMP

substituted. Note that the technique employed in the above analysis is independent of any particular choice of values for the parameters ai and bi. The reader is invited to pick his favorite transition probabilities for non-random strings and to evaluate C,T=, f,j with coefficients f,, = l/(a,a, +, . . . a ,) as calculated above. q Let us now turn our attention to the analysis of STRING_KMP. First of all, an automaton has to be tailored to the given string PATTERN = p, p2. . . p,. Fig. 5 presents the general picture. The specific values of next(i), 1 < i < m, depend on the actual string PATTERN. For an average case analysis of STRING_KMP, the expected values E[Xi ] of the random variables Xi defined as Xi = j

iff

have to be computed

control

returns

+ ... the definition

in state i,

for 3 6 i < m. It holds that

E[X,] = 1 .(P{next(i)

From

to state j after a mismatch

+(i-

= 1) + P{next(i)

= 0))

l).P{next(i)=i-

of function

next (see Section

+2.

P{next(i)

= 2)

l}. 1) it follows that

P{next(i)=j}=P{p,...pj_,=pi_,+,...pi_,andpj+p,} .P{nok>jexistswithp,...p,_,=pi_,+,...p,_,andp,#pi} G P{P ~...pj-~=pi-J+~...pi_~andp,#pi}. For random strings drawn from an alphabet with c symbols the probability of any two symbols to match is a = l/c. Let b denote 1 - a, i.e., the probability of a mismatch between any two symbols. Hence we have P{next(i) = j} < aj-‘b and can continue as follows: i-l

i-l

EIX,]
j=l

m

= 1 + b/(1

- a)’

(for 0 < a < 1)

=l+l/b=2+l/(c-1). So we may conclude that for c > 2 (note that for unary alphabets the substring search problem is trivial!) the expected values E[Xi] are bounded by 2 Q E[X,] G 3. The sizes of actual computer codes (ASCII,EBCDIC, BCDIC, etc.) are large enough to warrant the usage of 2 as a very accurate approximation for every E[Xi]. It has to be admitted that this constitutes a heuristic approach, but since it gets rid of some complex conditioning it is believed to be practically sound. Consequently, the state diagram shown in Fig. 6 may serve as a heuristic Markov chain model for the execution of STRING_KMP. The foregoing discussion has been based on the assumption that random strings are involved in a 254

Volume 18, Number 5

Fig. 6. A heuristic

INFORMATION PROCESSING LETTERS

Markov

18 June 1984

chain model for STRING_ KMP.

search implemented by STRING~KMP. Here, again, the identical remarks as made before about apply, viz. that the applicability of Markov chain techniques to an average case analysis is not restricted to this situation. Each reader may choose a probability distribution he believes to be a good representation for the likelihood of matches and mismatches for a sample of non-random strings. Thereafter, he can pursue the same way as described here to analyze STRING_KMP. From Fig. 6 it follows that we have to compute the fundamental matrix F,,,,, = (I,,, - Q,X,)-‘, contains the following coefficients: where Q,,, substring

STRING-NAIVE

b qi,=

for(j=l;i=1,2)or(j=2;3
a ( 0

otherwise.

It turns out that F,,,,,

f, =

I The correctness identity

matrix.

has the form

(a”‘-’ + b)/a”

fori=l,j=l,

b/a”

fori=2,j=l,

((1 - amP’+‘)b)/a”’

for3
(1-a”‘-‘+‘)/am-j+’

forZ
l/a”pJ+’

otherwise.

of this statement may be proved by multiplying Now we are ready to state our second result.

Result 3.2. For random average number of cm + (l/(c

strings over an alphabet

- I))?

+ c - c/(c

F and (I - Q), which yields as product

with c characters,

Algorithm

STRING~KMP

performs

the

an

- 1)

steps to locate the leftmost occurrence of PATTERN in TEXT. Proof. By Corollary 2.2, the average number f, as given above. Hence we get the sum m

c

of steps performed

by STRING~KMP is Cl’= ,f,, with coefficients

m

f,j = (a”-’

+ b)/a”

j=l

+ c

l/amPJtl

j=2 m-l =

l/a

+ (l/a)”

-(l/a)“‘-’

+ C

(l/a)’

j=l

cm- cm-’ + (Cmcm+ C”‘/(C - 1) -cm-’

= c +

=

= cm + L?‘(c/(c

c)/(c

- 1)

+ c - C/(C - 1)

- 1) - 1) + c - C/(C - 1)

=crn+(l/(c-l))?‘+c-C/(C-1). 255

Volume 18,Number

Having

derived

INFORMATIONPROCESSlNGLE7TERS

5

quantities

KMP and NAIVE, say, where

1

KMP = Cm + -C C-l

n-l

+c_-

“1

C C-l

= [cm +(1/(c -l)c"-' =

[c

m+l

-

cm + cm-’

C c-l

and

for the average case complexities of STRING_KMP study the ratio KMP/NAIVE. We get KMP/NAIVE

18June1984

+

1

NAIVE=---

C

c-l’

and STRING-NAIVE, respectively,

+ c)]/[c""/(C-

+ c(c

-

it is very informative

to

l)]

1)1/P+’

=l-(l/c)+(l/c2)+(c-1)/c”. m = 2 this ratio equals 1, which has to be the case, indeed, since then the Markov chains for and STRING~KMP coincide (compare Figs. 4 and 6). For m exceeding 2 the ratio KMP/NAIVE rapidly approaches 1 - l/c + l/c’, a quantity which is always smaller than 1, yet for computer alphabets of sizes c = 64, 128,. . by a negligible amount only. We summarize the outcome of the foregoing analyses in stating that for random strings the difference in the run-times of both methods is on the average very close to zero. Taking into account the additional effort of computing function next, which takes O(m) time (see [7]) and requires m additional storage locations, we may even go a step further and expect that on the average STRING-NAIVE uses less resources than STRING_KMP. For

STRING-NAIVE

4. Conclusion Average case analyses of two substring searching algorithms have been conducted. In both cases methods from Markov chain theory have been employed. This has been expedient since both algorithms lend themselves to modeling their execution as the activation of finite automata. The latter statement is true for a large class of algorithms, whose average case analyses are therefore amenable to the application of Markov chain theory. Surprisingly enough, none of the standard textbooks on the design and analysis of computer algorithms, like [1,5,6,7], contains (to the best of my knowledge) an analysis performed along the lines outlined in this paper. In [2], a more detailed discussion of the average case analysis of finite state algorithms by means of Markov chains is presented.

References [l] A.V. Aho, A.J. Hopcroft and J.D. Ullman, The Design and Analysis of Computer Algorithms (Addison-Wesley, Reading, MA, 1974). [2] G. Barth, Analyzing algorithms by Markov chains, in: Methods of Opera’tions Research Vol. 45 (AthenPum Press, 1982) pp. 405-418. [3] R. Boyer and J. Moore, A fast string searching algorithm, Comm. ACM 20 (1977) 262-272. [4] L.J. Guibas and A.M. Odlyzko, String overlaps, pattern matching and nontransitive games, J. Combin. Theory A30 (1981) 183-208.

256

[5] E. Horowitz and S. Sahni. Fundamentals of Computer Algorithms (Computer Science Press, Rockville, MD, 1978). [6] D. Knuth, The Art of Computer Programming, Vols. 1, 3 (Addison-Wesley, Reading, MA, 1973). [7] D. Knuth, J. Morris and V. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) 323-350. [8] E. Reingold, J. Nievergelt and N. Deo, Combinatorial Algorithms (Prentice-Hall, Englewood Cliffs, NJ, 1977). [9] J. Snell, Introduction to Probability Theory with Computing (Prentice-Hall, Englewood Cliffs, NJ, 1975).

An analytical comparison of two string searching algorithms

An analytical comparison of two string searching algorithms

Recommend Documents