A note on parsing pattern languages

A note on parsing pattern languages

~_N~H)~ February ! 995 Pattern Recognition Letters El ~qEVIER Pattern RecognitionLetters 16 ( 1995) 179-182 A note on parsing pattern languages -A...

296KB Sizes 7 Downloads 161 Views

~_N~H)~

February ! 995

Pattern Recognition Letters El ~qEVIER

Pattern RecognitionLetters 16 ( 1995) 179-182

A note on parsing pattern languages -AOscar H. Ibarra a, Ting-Chuen Pong b,¢,., Stephen M. Sohn d "Department of Computer Science, Universityof California at Santa Barbara, Santa Barbara, CA 93106, USA b Department of Computer Science, Universityof Minnesota, Minneapolis, MN 55455, USA ¢Department of Computer Science, Hong Kong Universityof Science and Technology, Clear WaterBay, Kowloon, Hong Kong d UnisysCorporation, St. Paul, MN55164, USA Received 4 November 1993; revised 14 September 1994

Abstract

We give an algorithm for parsing pattern languages. We show that a string of length n can be parsed with respect to a k-variable pattern in O ( n k) time using O ( n ) space. Moreover, the time bound can be reduced by accounting for the number of occurrences of variables in the pattern. We also present evidence that employing some results of elementary number theory and integer linear programming leads to a faster algorithm.

1. I n t r o d u c t i o n

Pattern languages were first considered by Angluin (1980) in the context of inferring a pattern from a given finite set of strings. Formally, let 27 be a finite alphabet of constant symbols, and let V= (v~, v2, ...} be a countable set of variable symbols. A pattern p is an element of the set (27u Vw Vr) + where V r denotes the set of reversed variables, namely, {v~, v[ .... }. The language generated byp, L ( p ) , is the set of strings of 27* obtained by consistently substituting strings of constants (possibly the empty string, ) for the variables in p. For example, consider the pattern v i v~ over 27": L (v i v] ) is the set of all evenlength palindromes. The problem of determining the initial palindromes of a string, or parsing with the pattern v i v~ v2, was considered by Breslauer and Galil ( 1992). The language L (vl0 1Or2) over {0, 1}* is the This research was supported in part by NSF Grants DCR8420935, 8604603 and ECS-8505662. * Corresponding author. Email: [email protected]

set of all binary strings that have 010 as a substring. (We define a substring as a consecutive sequence of symbols, compared to a subsequence where the symbols are not necessarily consecutive.) The following interesting property of pattern languages is an important result in the study of inductive inference which is a branch of learning theory (see (Angluin and Smith, 198 3 ) for a survey of inductive inference). Here, we disallow substituting the empty string ¢ for any variable. Let S be an e-free finite set of strings of constants. A pattern p is said to be descriptive o f S i f S c _ L ( p ) and furthermore for every pattern q such that Sc_L(q), L ( p ) c_L(q). I.e., p is in a strong sense the best-fitting pattern to S. In (Angluin, 1980) it was shown that a descriptive one-variable pattern can be found in polynomial time with respect to the sum of the lengths of the strings in S. Some evidence has been presented indicating that inferring two-variable patterns may be an NP-complete problem (Ko and Hua, 1987). The class of pattern languages is surprisingly expressive, yet is incomparable to either the class of

0167-8655/95/$09.50 © 1995 ElsevierScience B.V. All rights reserved SSDI 0167-8655 (94)00091-3

180

O.H. lbarra et al. / Pattern Recognition Letters 16 (1995) 179-182

regular languages or the class of context-free languages (Angluin, 1980). E.g., the regular language {0, 1} is not a pattern language and the pattern language L(v~vl) is not context-free, although it is contextsensitive. Many texts in language and automata theory contain languages that are presented essentially as pattern languages due to their concise representations. It is worth mentioning that other investigators (Apostolico, 1992; Crochemore, 1981 ) have studied similar problems. The methods can handle special cases of the problem discussed in this paper with better time bounds. In this paper we consider the problem of recognizing and parsing pattern languages. Given a pattern p and string s, the recognition problem asks whether s e L ( p ) . The parsing problem is to find a substitution of substrings ofs for the variables of p, provided that s is in the language. In general, a pattern may have more than one parse, or derivation. The algorithm we present yields all the parses of s as a direct consequence of the recognition process. All time bounds given below assume a unit-cost RAM model of computation. Let k be the number of distinct variables (v; and v~ are counted as the same variable) in p and let n = Isl. (The length of a string s, or number of symbols in s, is denoted Isl.) If k is arbitrary then the recognition problem has been shown to be NP-complete (Angluin, 1980). For fixed k, it was further noted that an O(n 2k+~ ) time algorithm exists. Here we show that the problem can be solved in O ( n k) time. This bound, moreover, can be reduced by conditioning on the number of occurrences of the variables in the pattern (i.e., by observing that there a~ occurrences ofv~ or v~, a2 o c c u r r e n c e s of V2or v~, and so on).

2. The parsing algorithm For the following discussion suppose that [sl = n, IPl =m, Co is the number of constant symbols in p, and k is the number of distinct variables in p. For example, when p=vnv21v~ and s=0111101101 with 27= {0, 1}, we have n = 10, m = 4, Co= 1 and k = 2. The parsing algorithm as briefly noted in (Angluin, 1980) considers all O ( n 2k) k-tuples of substrings of s and tests whether or not these substrings may be substi-

tuted for the variables ofp yielding s. Verification for a given tuple can be done in O ( n ) time, which results in O(n 2k÷~) overall time. This bound can be improved by observing that only certain tuples of substrings need be checked as constrained by length and position with respect to p. We define the characteristic pattern equation of p to be ~k= ~a~xi = n-- Co, where a,- is the number of occurrences of variable vi or v~. Each ai is an element of the Parikh vector o f p with respect to Vu V r, treating vgand v~ equivalently. (Parikh vectors and the commutative image of a language are discussed in (Harrison, 1978 ). ) In the example mentioned above, the characteristic pattern equation is 2 x l + x 2 = 9 . The nonnegative integer solutions to this equation are the lengths of the feasible substrings which may then be tested. As a special case, if the variables are all distinct and reversed variables are disallowed, a string can be parsed in O (m + n) time, which is independent of k. In general, a string in the language will be composed of a sequence of constant strings separated by variables. Since the variables are not repeated, anything can be substituted for the variables (i.e., "don't care"). The problem thus becomes one of locating the leftmost occurrence of each constant string in the string under examination. Knuth et al. (1977) have shown that a substring of length mo can be located in a string of length no in O (mo + no) time. A generalization of this approach to multiple substrings leads to an overall time bound of O ( m + n ) for this problem. The generation of all solutions to the characteristic pattern equation will be discussed below. For the moment, assume that

.4 = { (lt 1~, l~ 1~, ..., l~ ~ ~, (ll 2~, l~ 2~, ..., l ~ ), ...} is an enumeration without repetition of the nonnegative solutions to the characteristic pattern equation. In practice, the parsing algorithm will interleave the generation of A with the verification of each solution. The verification procedure may be stated simply: Scan p and s from left to right. If a constant symbol is encountered in p, verify that the corresponding symbol is present in s at the correct position. Ira variable symbol is encountered in p for the first time, say v~,record the correspondence of the substring of length

O.H. Ibarra et aL / Pattern Recognition Letters 16 (1995) 179-182

l~ at the current position in s with v,. (Reversed variables encountered for the first time are handled similarly.) If v~or vr is seen subsequently, verify the recorded correspondence. It is not difficult to show that the verification can be done in O (n) time and space. The time for verifying every solution to the characteristic pattern equation is the time to generate A plus O ( I A l ' n ) . In the following section, techniques will be given for finding all solutions to the characteristic pattern equation.

3. Solving the characteristic pattern equation Equations like ~ = i atx~ = b with integer solutions are more generally known as linear Diophantine equations. Here we restrict our attention to the class of equations where a~, b > 0. Without loss of generality, assume that ag>~ai+ 1. We write a[ b to mean a divides b. The following simple yet inefficient procedure generates all nonnegative solutions to the equation Y~= i aix~ = b and places them in A. Procedure solve() begin A'.=O ; forxl .'=0 toLb/alJdO forx2 :=0 to L ( b - a l x l )/az)] do to I ( b - ~ ~-3 aixi) /ak_ 1.] do if ak l ( b-- ~ ki--llaixi ) then A : = A u { ( x l , x 2 ..... (b--Zki=ltaixi)/ak) }

for X k _

l :-~- 0

end The time complexity of procedure solve() is O \t b k- 1/1-i kI i = 1 I a i /~ " It is not difficult to see that a suitable combination of procedures parse ( ) and solve ( ) yields a parsing algorithm taking O (n k/]-[ k_slai) time and using O (n) space. It appears that an asymptotically more efficient algorithm for generating solutions may be obtained by using some results of elementary number theory and integer linear programming. In practice, however, the algorithm we briefly sketch below may not outperform the simple algorithm given above. It is divided into two phases. Phase 1. First, a parametric representation of all so-

181

lutions (possibly negative) is found. Using a variant of the Euclidean algorithm, a particular solution x (o) and a set of k - 1 linearly independent auxiliary vectors x (~), x (2), ..., x (k- 1) can be computed. All solutions are given by adding any linear combination of the auxiliary vectors to the particular solution. I.e.,

x(O) + tlx(l ) + t2x(2) +...+ tk_~x(k- 1) is a solution for any integers tl, 12, ..., tk- 1. These vectors can be constructed using O ( k 2 + k loglo ak) arithmetic operations (Blankinship, 1966; Bradley, 1970 ). Note that at least one solution exists iffgcd (al, a2..... ak ) l b (Niven and Zuckerman, 1980). Phase 2. We now need to restrict the ti's so that each component of the resulting solution is nonnegative. Let X = (x (1), x (2), ..., x tk-1) ). Hence, any solution t to X t + x t ° ) > 0 will generate a nonnegative solution to the original equation. This is an integer linear programming (ILP) problem in k - 1 unknowns and k constraints. When k= 2, the ILP problem is trivially solvable. A result of elementary number theory states that there are at most [_gcd(a~, a2)b/(alaz)J + 1 solutions (Niven and Zuckerman, 1980, p. 135, Exercise 8). Therefore, the overall time for both phases is O (loglo a2 +Lgcd (al, az)b/(a~ a2 )J) which is asymptotically better than before. For example, the equation 21 xl + 20x2- 1000 has exactly 3 nonnegative solutions. The technique mentioned earlier would test 48 candidate solutions. The overall time bound for parsing pattern languages of two variables is O (loglo a2 + n z gcd (al, a2)/(al a2) ) . We have been unable to establish a good bound when k> 2. Some recent work in ILP with a fixed number of variables shows that finding a single solution, or determining that none exists, can be done in k o (k) arithmetic operations (ignoring the sizes of the numbers) (Lenstra, 1983; Kannan, 1987). Abound on the number of solutions was given in (Lambe, 1977), namely,

b+a*\ 1 k - 1 ) fk f i ~

182

O.H. Ibarra et al. /Pattern Recognition Letters 16 (1995) 179-182

References

where

a'f'-' I

al a2

,=3 L 2fi

J

and f / = g o d ( a , , a2, ..., ai) • Taken together, these results seem to indicate that this technique is again asymptotically better than before.

4. Summary We have given an algorithm for parsing pattern languages. There is evidence that employing some results o f elementary n u m b e r theory a n d integer linear p r o g r a m m i n g leads to an asymptotically faster algorithm. We note that the algorithm is easily parallelized by partitioning the generation o f the solutions and subsequent verification in a straightforward way among the processors.

Acknowledgement We would like to t h a n k Professor Grog W. Anderson for his helpful comments. We would also like to thank the referee for his detailed a n d thoughtful comments. The referee discovered the O ( m + n) t i m e b o u n d for the special case o f this p r o b l e m with distinct and nonreversed variables.

Apostolico, A. (1992). Optimal parallel detection of SCluaresin strings. Algorithmics 8, 285-319. Angluin, D. (1980). Finding patterns common to a set of strings. J. Comput. Syst. Sci. 2 l, 46-62. Angluin, D. and C.H. Smith (1983). Inductive inference: theory and methods. Comput. Surveys 15 (3), 237-269. Blankinship, W.A. (1986). Algorithm 288: solution of simultaneous linear diophantine equations [F4]. Comm. ACM. 9 (7), 514. Bradley, G.H. (1970). Algorithm and bound for the greatest common divisor ofn integers. Comm. ACM 13 (7), 433--436. Breslauer, D. and Z. Galil (1992). Finding all periods and initial palindromes of a string in parallel. Technical Report CUCS017-92, Computer ScienceDepartment, Columbia University. Crochemore, M. (1981). An optimal algorithm for computing the repetitions in a word. Inform. Process. Lett. 12, 244-250. Harrison, M.A. (1978). Introduction to Formal Language Theory. Addison-Wesley, Reading, MA. 1978. Kannan, R. (1987). Minkowski's convex body theorem and integer programming. Math. Oper. Res. 12 (3), 415-440. Ko, K.-T and C.-M. Hua (1987). A note on the two-variable pattern-finding problem. J. Comput. Syst. Sci. 34, 75-86. Knuth, D.E., J.H. Morris and V.R. Pratt (1977). Fast pattern matching in strings. SIAMJ. Comput. 6 (2), 323-350. Lambe, T.A. (1977). Upper bound on the number of nonnegative integer solutions to a linear equation. SIAMJ. Appl. Math. 32 (1), 215-219. Lenstra, H.W. Jr. (1983). Integer programming with a fixed number of variables. Math. Oper. Res. 8 (4), 538-548. Niven, I. and H.S. Zuckerman (1980). An Introduction to the Theory of Numbers, 4th edition. Wiley, New York.