Worst-case behavior of string-searching algorithms

Worst-case behavior of string-searching algorithms

99 Rbti’.‘M 3 June 1938; revised manuscript received 3 R~mmend~ by RX. Bose and G.8.H. Katona Atutm~t: If 5omecaaditions hold for the string pattern...

649KB Sizes 3 Downloads 117 Views

99

Rbti’.‘M 3 June 1938; revised manuscript received 3 R~mmend~ by RX. Bose and G.8.H. Katona

Atutm~t: If 5omecaaditions hold for the string pattern, then in the worst classeverystringurchin ~orithm has to examine at1the characters of the text. Theseconditions holidfor mre than half of the wtterns consisting of O’sand 1‘s. AluS Subjet Clmfieatioon: Primary: 68A10,02E10; Secondary: 05-04. Kep w&k: String-searching; Worst-case behavior.

Strings are sequences over a finite alphabet. The blocks (subsequences consisting of consecutive characters of the string) are called substrings. f f s’ is a substring of s, we denote it by S”CS. Otherwise we write s’Qs. Let s be a given pattern of length IsI=s, and let n be a text of length i. String-searching algorithms are to answer the question whether sCn or not. Suppose that the algorithm A has to examine e(A, n) characters of n in order to answer this question. Then

is the number of characters examined by the best algorithm in the worst case. Rivest that W(S, n) r: n -s-+ 1 and he showed that this tower bound is the best ible since ~$4, n) stt -s + 1 for the string = 00 0 and the ~oyer~~oor~~ suggested that one should prove rithm (1976)whenever n s- 1 (mods+ 1). of this upper bound for every pattern s. We give a negative ~nswei to this problem by showing that for ore than half of the patterns w(s, n) = n if the nd I. We give sclme sufficient conditions for aIphabet consists of two symbols l

We use the following de ; tkc same gettersno bald

the length of x, Let BE(

itions and notation: L*etters denoting strings are

set of patterns s for which s(b) has a substring s’ different from Pt P2 and P2 having the property that s’ and s differ from each other in at most two c&acters‘ Also, let & be the set of patterns s for which the string s2 = S1S2 with St = S’ =s has a substring s” different from St and S2 having the property that s’ and s differ from each other in at most four characters. Let S,(Y) and 4(s) be the M@aeter lo strings in D1and B2+respectiveIy, Finally, (a, s) denotes the greatest common divixx of the natural numbers ~1and b. We assume that the alphabet consists of 0 and 1. Proposition 1, We hcrvew(s, n) zf w(s, a + 1). Proof. This follows easily by considering the texts which differ from s in the very

last character. Proposftbn 2. Ifs E BE(t) and

t E

BE(u), than se BE(u).

Proposition 3. If SE BE(b) and b > fs, then SE:BE(I)‘)formno b’ whereb%s - 6. Proof. Clearly SE BE(b) implies that for every iss- b, the i-th and (i+s- b)-th characters of s are equal. Suppose that k(s - b)
Theorem 4,

Supposes~

BE(b), sd8, US,, n zs(&-

Proof. Suppose that n’ is a multiple of (s, b) and n’rs(2sthat n’ can be written in the form n’=is+j(2v-b),

b)/(s, b).

It is

known 0+

where i andi j are non-negative integers. Th show w(s, nt) = n’, To this end, fcr every a length n’ such that e(A, II$) =M’. We ivi& t hct n’-&aracter blocks; i of them will have length s, the ther j blscks will hav Every block of A will be equal to s or tisn of one and two characters respectively, as follows: A have the opposite symbols 0 and 1 in the nt-th character if either (i) the algorithm A has examined all the other c~~ara~t~rsof the block by the time it examines the m-th one; or

examination of every character. Therefore if ,4 stopped before the (n’)-th step, it C HA.But it is impossible because SC nA contradicts s&B, (if s occurs ) or se& (if s is formed by the end and the beginning of two cono&s). The ;roof is complete.

IfseBE(b), (s&)=1, sgBlUB2 andnrs(k-

b), then w@,n)=n.

rf?.ICfsrBEands$A91U132, t#mw(s,n)>n-s+l, Proof. The blocks of @Awill have length s (with the exception of the last one which aracters if n + I is 11multiple of s). Proposition 1 completes the proof. Remark 8. If se E(bl), se BE(bt), . , ., SE BE(b,,,), then in the construction of nA the foliowing blocks can be used: s, s(bl), s(bz), .... s(b,,J. Therefore W(S,n) = n whenever n is large enough and n is a multiple of the greatest common divisor of bl, bz, ‘*Vb,andsifs~B&&. l

Proof. The statement can be seen easily for ss 5. For s zr:6 we have +s - 1 r fs, therefore

Proposition 10. Thefollowing equation holds: f fk2!k~(gm-1)2!m*'_t(~m-t)at'"+~+~~+2. &=I Proof. This follows immediately by induction. ma 11. We have

he statements of Lemmas t 1 and 12 are obvious for small s (for example ifs IS6)

102

since the given upperbounds are greaterthan 25 Therefore in the fr(cliowin we suppose ;sk7. Proof of Lermma11. Ifs E;1B,then s E BE(b) for some b ands(b) has an uapl substring S+that differs from s in at most two characters, These ch chosen in +s(s - 1) + s + 1 different ways, If we fix these at most tw and s’ are different and suppose that s’ b ins at the (a+ lbth characterof a(b), then the i-sehcharacter uniquely determines the (i + a)& (i - u)mth,(i + (s- 4)~th and (i -. (S- b))-th characters (since s E BE(b)). This implies that the first (s- b, a) characters determine s. Therefore (using Proposit:ions 3,9 and 10) it follows that

s t(s2 + s + 2) c (s - b)2+@-k)--’ _

_ i?“-’ <...
b=l

(s -

+)32,.~.

Proof of Lemma 12. The proof is similar to the proof of Lemma 11. Now s and s’ may be dif’ferent in at most four places and the i-th character determines the (i + a)-th (mod s) character of s. Therefore s is uniiquely determined by its first (J; a) characters. Using Proposition 9, we have

(

S(S-- W-2~~)+s(sIB2s( )Ic -24

-F

I)(s-~)+s(sl)+$+I “i;’ 3(s,0) 6 2 > L1 .(191 I

Corollary 13. For every s, 1B, (s) U B*(s)1:=0(2~). lary 14. Thew exist uxve than cl F- o@) patterns w(s, n) = n i,f n is /urge enough. t?ere cl > QSO61,

of kflgth

s

for

of proof. Using Corollary 6, Remark 8 and Corollary 13, it is en

or II)form a class of such strings. Anothe:r class (for s beginning and ending in the same S-element substring

which

c

k

Cbrsllary

14,

me

cm

write

cI 2 #+ 1/2P+

1 instead

of cl> 0.5061

ates the smallestprime nun&et nut dividing s.

s of patterns wiN contain the strings beginning and ending in We constmet another class of patterns by fixing the first p- 1 QW,let the pth kh cter be different from the first ane and p characters of the pattern form the same substriing. The s (of the second class) is 2S-P- l. Corollaries 6 and 13

mute

than

cz2S- o(29 pizttetns s’ of length s for which emugh (if n ks(2r - 1)). Here c2 >0.732.

Sketch of proof. Usin Corollaries 5 and 13, it is enough to prove that ~{S’EBE:~S’~=~)~>C~~~ if s is large enough. It can be done in a similar way as in the proof of Corollary 14. We have obtained cr >0.732 for the strings #tz BE@) with 6~ 11. Corolhry 17. Fop any fixed gattetn so, there exist cl 2S- o(2”) and ~~2~- o(29 strings s’ with ~0Cs’ such that w(s’, n) = n and w(s’, n) L n - 3s + 1 respectively whenever n is large enough. Here cl and c2 are the Sal -* ? as in Corollaries 14, 15 and 96. Proof. It is known that the number of strings s’ with Is’1=s and ~~48’ is o(2’). Therefore ;+e statement follows immediaiely from the previous three corollaries.

References Bayer, B.S. and J. Strothcr Moore (1976). A fast string searching algorithm, Stanford Research Institute Technical Report 3 (March 76). Rivest, RL (1996). n the worst-case behavior of string-searching algorithms. Technical Memoranda 71, Lab, for Camp. Sci. MIT. (April 76).