Information Processing Letters 30 (1989) 79-86 North-Holland
30 January 1989
PARTIAL EVALUATION OF PA’ITERN MATCHING IN STRINGS Charles CONSEL * LITP, University of Paris VI, 4 Place Jussiey 75252 Paris Cedex 05, France
Olivier DANVY ** DIKU - Computer Science Department, University of Copenhagen, Universitetsparken I, 2100 Copenhagen 0, Denmark
Communicated by A. Ershov Received 6 July 1988
This article describes how automatically specializing a fairly naive pattern matcher by partial evaluation leads to the Knuth, Morris & Pratt algorithm. Interestingly enough, no theorem proving is needed to achieve the partial evaluation, as was previously argued, and it is sufficient to identify a static component in the computation to get the result-a deterministic finite automaton. This experiment illustrates how a small insight and partial evaluation can achieve a nontrivial result.
Keywords: Partial evaluation, program specialization, Knuth, Morris & Pratt algorithm
Introduction Finding whether a given string occurs within another is a typical problem to which partial evaluation can be applied. Given a general pattern matching program PM whose semantics is a function: Pattern-String X Subject-String + Boolean, one can specialize PM with respect to a given pattern string, and by partial evaluation obtain a program whose semantics has the functionality: Subject-String + Boolean where, for some alphabet C, Pattern-String = Subject-String = 8*. We propose such an approach here, and more precisely to demonstrate that specializing a fairly simple-minded pattern matcher yields the efficient Knuth, Morris $i;Pratt algorithm 171. Partial evaluation is a semantics-preserving program transformation based on propagating constants, unfolding and specializing procedures. Presently, partially evaluating the program PM with respect to a pattern string pat: Mix PM pat gives
a so-called residual program v:&ose functionality is Subject-String --) Boolean
* Electronic mailbox: . . . !inria!litp!chac. * * Electronic mailbox:
[email protected]. 002G-0190/89/$3.5@ 8 1989, Elsevier Science Publishers B.V. (North-Hol!and)
79
Volume 30, Number 2
INFORMATION PROCESSING LETTERS
30 January 1989
and which performs all the operations of PM on the subject string-all those depending only on the pattern have been achieved by the partial evaluator. Further, self-applying the partial evaluator [3], that is, specializing the partial evaluator with respect to the pattern matching program: Mix Mix PM makes it generate a program compiling a pattern into the program above. ’ The semantics of the residual program is a function: Pattern-string + (Subject-String --, Boolean ) which illustrates that a self-applicable par?ial evaluator implements the Curry function. Self-applicable partial evaluators exist since 1985 [5]. All the programs described here have been run with the partial evaluator of [6] and with the Schism partial evaluator of f2], both of which are self-applicable. We will use Mix generically for naming any partial evaluator. Below, we reproduce the parts processed with Schism because they are written in Scheme [8] and are thus readable without further description. Section 1 presents the naive algorithm and Sections 2 and 3 propose one single transformation and its partial evaluation, respectively. Section 4 optimizes the construction further. Section 5 compares the present approach with related work and puts it into perspective.
1. A naive appacb
A fairly simpleminded approach to the problem of finding whether a given pattern occurs within another string is to iteratively compare it with every suffix of the second string. (define(kmp p d) (or (null?p) (start p d))) (define(startpd) (and (not (null?d)) (restart p d)))
(define (loop p d pp dd) (or (null?p) (and (not (null? d)) (if (equal? (car p) (car d)) (loop(cdrp)(cdrd)ppdd) (restart pp (cdr dd)))))) ; naive!
(define(restart p d) (if (equal?(car p) (car d)) (loop(cdrp)(cdrd)pd) (start p (cdr d)))) This strategy is not optimal. It is a strong starting point of [7] to remark that where to restart after a mismatch could be deduced from the pattern alone. The result is much betz? algorithm: a preprocessing time of order O(m) and a matching time of order O(n) rather than 0( m x n), m and n being the length of the pattern string and the subject string, respectively. ’ Mix MixPM, Mixdesignates a program text and Mix designates its meaning, that is an input-output function. The (residual) program text. 80
result is a
Volume 30, Number 2
INFORMATION PROCESSING L
Let us illustra:r this by partially evaluating that p binding time analysis [S, 61, variables p and pp are iden pattern alone. The residual program’s functions are versions of p and pp, renamed to be distinguished from the o loop,,,(d,dd), startJd) and restartJd), where pp = (a b Mix's
(define (kmp0 d) (start-1 d))
30 January 1989
saect to a pattern stkg (a b a). By as static-that is, computable from the original functions specialixed to values rsions. They are of form kmp,(8), ranges over suffixes of (Q b a).
(define (loop3 d dd) (and (not (null?d))
(define (start-1d) (and (not (null?d)) (restart-2d))) (define (restart-2d) (if (equal? ‘a (car d)) (loop3(cdr d)d) (stwt-1 (cdr d))))
(define (loop-4 d dd) (and (not (null? d)) (if (equal? ‘a (car d)) (loopfi(cdrd)dd) (restart-2 (cdr dd))))) (define (loop5 d dd) ‘#t)
The resulting code is poor. Up to the occurrence of a mismatch, a substring of the subject string is known (since it has iteratively been proven to be equal to the patterns so far, and the pattern is known), but this knowledge io thrown away and the subject string is re-scanned at the state start-l. 2. A still naive appacb Noticing that, up to the point of mismatch, the pattern string is identical to the subject string, we propose to iterate first on two instances of the pattern string up to the point of mismatch and then on pattern and the subject strings, rather than only iterating along the pattern and the subject strings. This is exactly equivalent since the pattern and the subject string are identical up to that mismatch point. To express it graphically: P&tern String
+ _________ I_ ____________ + 1mismatch + --------1__________________________u______
Subject String
+
On the left side of the mismatch, the pattern string and the subject string are equal. What we propose is, rather than shifting the pattern one place to the right and re-scanning the pattern and the subject strings up to the mismatch point and further: -_, 3 --~-_-------+ + --------Pattern String I Jr + --------I ____________________------------Subject String >
P
instead we match two instances of the pattern string up to the point of mismatch and then scan the pattern and the subject strings further. >
Pattern String Subject String Pattern String
+ --------_
l
I _____________ +
I ___________I________-------------
+
+---------I a
> 81
Volume30, Number2
INFORMATION PROCESSINGLETTERS
30 January1989
To match the pattern against a part of itself, any algorithm will do. We use the same strategy of going iteratively ahg the pattern string. The number of iterations, however, is already given (the length of the substring that matched before the mismatch occurred). (define(kmppd) (or (null? p) (start p d))) (define(start p d) (and (not (null?d)) (restart p d)))
(define (restartp d) if (equal?(car p) (car d)) (or (null?(cdr p)) (and (not (null?(cdr d))) (loop (cdr p) (cdr d) P))) (startp (cdr d))))
(define(loop p d pp) (if(equalP(car p) (car d)) (or (null?(cdr p)) (and (not (null?(cdr d))) (loop (cdr p) (cdr d) PP))) (let ((np(static-kmppp (cdr pp) ( - (length(cdr pp)) (length p))))) (if (equal? np pp) ; any match at all? (if (equal?car pp) (car p)) ; where to continue? (startpp (cdr d)) (restartpp d)) (loop nPd PP))))) (define(static-kmpp d n) ; n recordsthe numberof elementsto match (static-loopp d n p d n)) (define(static-loopp d n pp dd nn) (if (zero? n) P (if (equal?(car p) (car d)) (static-loop(cdr p) (cdr d) (sub1 n) pp dd nn) (static-kmppp (cdr dd) (sub1 nn))))) This meaning-preservingtransformation has the effect that in the first part of the algorithm (loop), control is determined by the subject string and in the second (static-loop),it is determined by the pattern string, that is, intensionally, by the “static” variables p, pp, n and nn. That situation is ideal for a partial evaluator.
3. Partial evaluation E%rtiallyevaluating the code gven at the end of Section 2 with respect to a pattern has the effect to specialize it and produces a surprisingly optimized #tie. In particular, it does not “back upi” the input string.
Volume 30, Number 2
INFORMATIOV PROCESSING LETI’ERS
30 Jauuary 1989
Here is the new residual program for the string (a b a): (define (kmp0 d) (start-l d)) (define (start-l d) (and (not (null?d)) (restart-2 d))) (define (restart-2d) (if (equal? ‘a (car d)) (and (not (null?(cdr d))) (loop3 (cdr d))) (start-1(cdr d))))
(define (loop3 d) (if (equal? ‘b (car d)) (and (not (null? (cdr d))) (loop4 (cdr d))) (restart-2d)) (define (loop-4 Id) (or (equal? ‘a (car d)) (start-l (cdr d))))
The partial evaluator has processed everything that depends on the pattern string, leaving only the actions depending on the subject string. The resulting program expects a string and scans it linearly. What we obtain is a deterministic finite automaton: it is deterministic because the original program is deterministic; and it is a finite because the pattern is finite. At this point we can state a conclusion: the effect of the Knuth, Morris & Pratt algorithm has been automatically achieved by simply identifying a part of the program that depends solely on the pattern string which is static- that is: available at partial evaluation time. In particular, we did not need to compute any next or failure table to drive the pattern matching. More generally, this has been done quite naively and automatically rather than cleverly and by hand, as in ae original algorithm. Further, any pattern will be compiled into a non-backtracking matcher, running in time O(length (subject-string)). Still one piece of information has not been exploited: the character causing the mismatch. 4. hrther optimization Because the character causing the mismatch is ignored, we cisz~expect some redundancy to remain in the residual program, and actually this redundancy can be pointed auf. Specializing the program with respect to (a b a b c) yields (define (kmp0 d) (start-l d)) (define(start-1d) (and (not (null?d)) (restart-2d))) (define(restart-2 d) (if (equal? ‘a (card)) (and (not (null?(cdr d))) (loop3 (cdr d))) (start-1(cdr d)))) (define(loop3 d) (if (equal? ‘b (car d)) (and (not (null?(cdr d))) (looP-4 (cdrd))) (restart-2 d)))
(define(loop4 d) (if (equal? ‘a (card)) (and (not (null?(cdr d))) (loop-5 (cdr d))) (statt-1 (cdr d)))) (define(loop-5 d) (if (equal? ‘b (car d)) (and (not (null?(cdr d))) (loop6 (cdr d))) (loop3 @I9 (define (loop6 d) (or (equal? ‘c (car d)) (loop-4 d)))
where, in state 100~5, -wego to state loop-3 and repeat the same test, to see whether (card) is b.
83
Volum
30, Number 2
INFORMATION
PROCESSING
LEmERS
30 January 1989
We can use the information about the last mismatch, taking it into account while statically matching the pattern against itself: (define(static-loopp d n pp dd nn) (if (zero? n) ; possibleto continue? (#(and ( > nn 0) ; mismatchagain? (equal?(car p) (car d))) (static-kmppp(cdr dd) (sub1 nn)) P) (if (equal?(car p) (car d)) (static-loop(cdr p)(cdr d) (sub1 n) pp dd nn) (static-kmppp (cdr dd) (sub1 nn))))) Specializing this new algorithm with respect to (a b a b c), we obtain a residual program identical to the one above except at the state loop-S,where the redundancy has vanished: (define(loop-5 d) (if (equal? ‘b (car d)) (and (not (null?(cdr d)))
5. Comparisonwith related work The paper by Knuth, Morris and Pratt [7] is th5 semina! paper on fast pattern matching in strings. Because partial evaluation is essentially program specialization, we get algorithmic (L‘compiled”) versions of the original next table that determines which character in the pattern should be tested after a mismatch. Further, by partially evaluating the original Knuth, Morris & Pratt algorithm with respect to a next table, we have obtained a residual program structurally equivalent to the one above. During the first workshop on Partial Evaluation and Mixed Computatiofi [l], the question was asked recurrently whether a partial evaluator could treat the Knuth, Morris & Pratt algorithm. It is argued in [4] that it needs a theorem prover. This paper shows that theorem proving is not necessary for this, and that separating the portion of the algorithm that will be statIcally repeated is sufficient to obtain straightforwardly the Knuth, Morris & Pratt algorithm. Conclusion This paper illustrates how partial evaiuation can be used for obtaining the Knuth, Morris & Pratt pattern matching algorithm from a fairly naive method. Interestingly enough, this has been done automatrcally whereas it was done by hand originally. To conclude, let us underline that the Knuth, Morris & Pratt algorithm is two-fold: it. offers both efficient processing in time O(m) and efficient matching in time O(n), m and n being the length of the pattern string and the subject string, respectively. Presently, we get the latter by partial evaluation, so the generatedmatcher also runs in time O(n). Generating a matcher generator can be done at an earlier stage
Volume 30, Number 2
INFORMATION PROCESSING LETIERS
30 January 1989
by self-applying Mix. Generating the matcher2 takes more time than O(m) (apparently 0(m2)), which is less efficient Aan the sophisticated Knuth, Morris & Pratt construction of the next table. This is not unreasonable since a partial evaluator can specialize any program at all. It will not matter in the common situation where the same (short) pattern is to be matched against many (long) subject strings. Finally let us point out that a pattern matching program could be specialized as well with respect to the subject string. However, the residual program can be huge. This would need some better insight into suffix trees [9]. It is our hope that this work contributes to present partial evaluation as an active help for creating and designing algorithms and programs. Appendix
This appendix presents the residual program obtained by specializing our pattern matching program with respect to the string (a b c a b c a c a b). This example is interesting because it is isomorphic to the one in Section 3 of [7]. (define (kmp-0 d) (start1 d)) (define (start-l d) (and (n& (null?d)) (restart-2d))) (define (restart-2d) (if (equal? ‘a (car d)) (and (not (null?(cdr d))) (loop-3 (cdr 8))) (start-l (cdr d)))) (define (loop-3 cl) (if (equal? ‘b (car d)) (and (not (null? (cdr d))) (loop-4 (cdr d)))
(restart-2 d))) (define (loop-4 d) (if (equal? ‘c (car d)) and (not (null?(cdr d))) (loop-5 (cdr d))) (restart-2d))) (define (loop-5 d) (if (equal? ‘a (car d)) (and (not (null?(cdr d))) (loop-6 (cdr d))) (start-l (cdr d)))) (define (loop-6 d) (if (equal? ‘b (car d)g (and (not (nr,ll?(cdr d))) (loop-7 (cdr d))) (restart-2d)))
(define (loop-7 d) (if (equal? ‘c (car d)) (and (not (null (cdr d))) (loop-8 (cdr d))) (restart-2 d))) (define (loop-8 d) (if (equal? ‘a (car dN (and (not (null?(cdr d))) (loop-9 (cdr d))) (start-l (cdr d)))) (define (loop-9 d) if (equal? ‘c (car d)) (and (not (null? (cdr d))) (loop-l 0 (cdr d))) (loop-6 d))) (define(loop-10 d) (if (equal? ‘a (car d)) (and (not (null? (cdr d))) (loop-l 1 (cdr d))) (start-l (cdr d)))) (define (loop-l 1 d) (or (equal? ‘b (car d)) (restart-2 d))))
2 Either by applytig the matcher generator to a pattern string or by partially evaluating !he general pattern matcher with respect to a pattern string. 85
Volume 30, Number 2
INFORMATION PROCESSING LETTERS
30 January 1989
Acknowkdgment We express our gratitude to Neil D. Jones for his thoughtful interaction and support, and to Peter Sestoft and Andrzej Filinski for their re-reading. This work has been achieved while the first author was visiting DIKU. References (11 D. Bjerner, A.P. Ershov and N.D. Jones, eds., Partial Evaluation and Mixed Computation (North-Holland, Amsterdam, 1988). [2] 6. Consel, New insights into partial evahiation: the schism experiment, In: H. Ganxinger, ed., Prvc. 2nd European Symp. on Programming ‘88, Nancy, France (March 1988) -tire Notes in Computer Science, Vol. 300 (Springer, Berlin, 1988) 236-246. [3] Y. Futamura, Partial evahtation of computatidn processan approach to a compiler-compiler, Systems, Computers, Controls 2 (5) (1971) 45-50. [4] Y. Futamura and R. Nogi, Generalized partial computation, In: D. Bjarner. A.P. Ershov and N.D. Jones, eds., Partial Evahuztion and Mixed Computation (North-Holland, Amsterdam, 1988) 133-151.
[5] N.D. Jones, P. Se&oft and H. SPrndergaar d, An experiment in partial evaluation: the generation of a compiler generator, In: J.P. Jouannaud, ed., Proc. 1st Jnternat. Conf: on Rewriting Techniques and Applications, Dijon, France (June 1985), Lecture Notes in Computer Science, Vol. 202 (Springer, Berlin, 1985) 124-140. [6] N.D. Jones, P. Se&oft and H. Ssndergaard, MIX: a self-applicable partial evaluatir for experiments in co~~prbrgeneratio:, Internat. J. LISP Symbolic Comput. (W&?. [8] J. Rees and W. Clinger, eds., Revised3 report on the algorithmic language scheme, S&plan Abtices 21(12) (1986) 37-79. [9] P. Weiner, Linear pattern matching algorithms, In: Proc. IEEE Symp. on Switching and Automata Theory Vol. 14
(IEEE, New York, 1972) l-11.