Pattern expressions and pattern automata

Information Processing Letters 92 (2004) 267–274 www.elsevier.com/locate/ipl Pattern expressions and pattern automata Cezar Câmpeanu a , Sheng Yu b,∗...

Download PDF

131KB Sizes 0 Downloads 51 Views

Report

PDF Reader
Full Text

Information Processing Letters 92 (2004) 267–274 www.elsevier.com/locate/ipl

Pattern expressions and pattern automata Cezar Câmpeanu a , Sheng Yu b,∗ a Department of Mathematics and Computer Science, UPEI, Charlottetown, PE, Canada C1A 4P3 b Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7

Received 12 November 2003; received in revised form 20 September 2004 Available online 14 October 2004 Communicated by L. Boasson

Abstract We define the pattern expressions as an extension of both regular expressions and patterns. We prove several properties of the new family of languages, similar to those of extended regex languages [Câmpeanu et al., Int. J. Found. Comput. Sci. 14 (6) (2003) 1007–1018]. We also define an automata system that recognizes these languages. Differences between regex and pattern expressions are also discussed.  2004 Published by Elsevier B.V. Keywords: Formal languages; Pattern expressions; Regex

1. Introduction Regular expressions and their variations have been used in many practical software systems and programming languages. For example, regex, a variation of regular expressions, has become an important part of egrep, vi, emacs, as well as part of the programming languages Perl and Python [2,5]. Regex is a convenient tool to use in practice. A detailed description of regex can be found in [3]. The differences between regex and formal regular expressions, as well as the languages that can or cannot be * Corresponding author.

E-mail addresses: [email protected] (C. Câmpeanu), [email protected] (S. Yu). 0020-0190/$ – see front matter  2004 Published by Elsevier B.V. doi:10.1016/j.ipl.2004.09.007

expressed by regex, are studied in [1]. However, there are many simple practical patterns that cannot be expressed by regex in an intuitive and straightforward way. We give two simplified examples in the following. We know that there are two names appearing in the text: one starts with a ‘C’, and the other starts with a ‘W ’. One of the names appears in the text first, then both of them appear with the word ‘and’ in between and the name starting with a ‘C’ appearing before ‘and’. We naturally write a regex for the pattern as follows: (C[a − z]∗) | (W [a − z}∗) . ∗ \2\s + and\s + \3 but the regex is incorrect and does not work.

268

C. Câmpeanu, S. Yu / Information Processing Letters 92 (2004) 267–274

Consider another example. We want to write a regex that matches a string of letters which is a repetition (including zero times) of a string starting with an ‘h’ and ending with an ‘a’. We write (h[A − Za − z] ∗ a)∗. Again, the regex is incorrect, which may match strings that are not what we want. Both the above examples can be represented correctly by regex. However, the solutions are not as intuitive and straightforward as we expect. A quite serious general problem about regex is that back-reference to any sub-expression in an alternation (“|”) is not well defined. More detailed discussions on regex can be found in Section 3. In this paper, we introduce pattern expressions, which we consider to be more intuitive and easier to use than regex. We will show that the above two examples can be expressed by pattern expressions intuitively and easily. Patterns and pattern languages have been studied recently due to their roles in learning, biocomputing, voice, and optical character recognition [6]. In this paper we generalize both regular expressions and patterns by introducing pattern expressions. We define a pattern expression as a finite ordered set of regular expressions over an alphabet and a finite set of variables, each variable corresponding to a regular expression, such that each regular expression depends only on previously defined variables. Therefore, circular definitions cannot occur in any pattern expression. Each regular expression of this type is called a pattern expression. Of course, there is at least one expression defined without using any variables. Using this new definition, any regular expression is a pattern expression with no variables. Also, patterns become pattern expressions, having for each variable a corresponding pattern expression or a regular expression, denoting all possible words in the language defined by the expression. As simple examples of pattern expressions, we consider again the two examples which we just mentioned. For the first example, we can define as follows. First we define two variables u = C[a − z}∗;

v = W [a − z}∗;

and then the expression (\u + \v). ∗ \u\s + and\s + \v

Note that \x denotes the variable x corresponding to the pattern expression for which has been defined. For the second example, we first define u = h[A − Za − z] ∗ a; and then the expression \u∗ Since regular languages are incomparable with pattern languages, pattern expression languages are a non-trivial extension of both families of languages. We prove that pattern expression languages are incomparable with context-free languages and included in context sensitive languages. Other properties of pattern expression languages are proved in Section 6 of the paper.

2. Preliminaries and pattern expressions In this section we give some basic definitions and notations that are used through the paper, and for all other definitions and notations not presented here, we refer the reader to [4,7,8]. We also give a formal definition for pattern expressions. Several examples of pattern expressions are given to illustrate the basic definition. An alphabet Σ is a finite non-empty set. A word, w = w1 . . . wn , wi ∈ Σ, is an element of the free monoid Σ ∗ generated by Σ; the length of the word is denoted by |w|. The word with no letters, i.e., the empty word, is denoted λ. A regular expression over Σ is described recursively as follows: (1) the empty set or any letter a of the alphabet Σ is a regular expression, denoting languages ∅, respectively {a}; (2) if α and β are regular expressions, then αβ, α + β, α ∗ , and (α) are regular expressions, denoting languages L(α)L(β), L(α) ∪ L(β), L(α)∗ , and L(α), respectively; (3) any regular expression is obtained by applying the above rules a finite number of times. It is clear that L(∅∗ ) = {λ}, therefore, we denote the regular expression ∅∗ by λ.

C. Câmpeanu, S. Yu / Information Processing Letters 92 (2004) 267–274

In the following, we first define pattern expressions formally and then explain the definition by examples. Let Σ be an alphabet, and V = {v0 , . . . , vn−1 } a finite set of variables such that V ∩ Σ = ∅. We define a regular pattern as a regular expression over Σ ∪ V . A pattern expression p is a set of regular patterns r0 , r1 , . . . , rn , with the following properties: (1) r0 is a regular expression over Σ; (2) ri for each i, 0 < i n, is a regular expression over Σ ∪ {v0 , . . . , vi−1 }.

269

• For p = (u = ab ∗, u∗ cu), L(p) = (abn )m cabn | n 0, m 0 . Example 2. For the expression p = (u = a ∗ b, u∗ u), aaabaaabaaab ∈ L(p), while aabaaabaaab ∈ / L(p), and aabaaabaab ∈ / L(p). Example 3. Let p = (v = (a + b)∗ , u = vcv, u∗ ). Then L(p) = L∗1 , where L1 = wcw | w ∈ {a, b}∗ .

The language L(p) generated by a pattern expression p = (r0 , r1 , . . . , rn ) is defined as follows:

3. Pattern expression versus regex

L0 = L(r0 ), as defined for a regular expression, Li = (u0 /v0 ) . . . (ui−1 /vi−1 )ui |

Pattern expressions appear to be similar to regex. Clearly, they are different in how they are defined. However, there are more differences between them than merely their definitions. In the following, we analyze the similarities and differences between pattern expressions and regex. Some of the pattern expressions can be directly converted into regex and conversely, using a “blind” translation.

ui ∈ L(ri ), uj ∈ Lj , j = 0, . . . , i − 1 ,

0 < i n, and L(p) = Ln , where by (u/v)α, we mean the substitution by the word u of all occurrences of variable v in α. Note that for pattern expressions used in practice, e.g., programming languages, software packages, etc., we suggest that \x is used when the variable x ∈ {v0 , . . . , vn−1 } is referenced. However, in the following examples, we simply use x rather than \x when the variable x is referenced for convenience of reading. Example 1. • p = (w = (a + b)∗ , ww) is a pattern expression generating all double words over the alphabet {a, b}; • For p = (u = ab∗ aa, w = buua ∗, uwawbu), L0 = {abn aa | n 0}, L1 = {babnaaabnaaa m | m 0, n 0}, L2 = {abl aababnaaabnaaa mababnaaabnaaa mbabl aa | l 0, m 0, n 0}; • For p = (u = ab∗ , v = baa ∗, (u + v)(u + v)), L(p) = {abnabn | n 0} ∪ {ba n ba n | n 1} ∪ {abn ba m | n 0, m > 1} ∪ {ba m abn | n 0, m > 1};

(1) The regex ((a ∗ b) | (b ∗ a))aa\1 can be directly translated into equivalent pattern expression u = a ∗ b + b ∗ a, uaau, i.e., we replace all the parentheses that are back-referenced by variables and each regular pattern is the regular expression defined between the corresponding pair of parentheses. (2) The pattern expression u = ab∗, v = bba∗, uabvuv can be translated directly into the regex (ab∗)ab(bba∗)\1\2, i.e., the first occurrence of the variable is replaced by the corresponding regular pattern between a pair of parentheses and the following occurrences are replaced by backreferences. Some expressions cannot be translated directly, but we can still translate each extended regex into a pattern expression and each pattern expression into an extended regex, according to their special cases. (1) Consider the pattern expression (u = ab ∗, v = baa ∗, (u + v)(u + v)). If we substitute the first occurrence of u and v by its corresponding regular pattern surrounded by parenthesis and we replace

270

C. Câmpeanu, S. Yu / Information Processing Letters 92 (2004) 267–274

the rest of the variables by back-references, we get the regex: ((ab∗)\(baa∗))(\2 | \3). This regex will not recognize abbbbbaaa because the parenthesis \3 is not defined in this case. A workaround for this case is to rewrite the initial expression as uu + uv + vv + vu, and to replace the first occurrence of u and v in every term by its corresponding regular pattern surrounded by parenthesis, then to replace the other variables by back-references: (ab∗)\1 | (ab∗)(baa∗) | (baa∗)\4 | (baa∗)(ab∗). (2) The regex ((a ∗ b) | (b ∗ a))aa\2\3 cannot be directly translated into the pattern expression u = a ∗ b, v = b ∗ a, (u + v)aauv since at least one of \2 or \3 will be undefined. The corresponding pattern expression is (u = a ∗ b, v = b∗ a, (uaau + vaav)). (3) Let p be the following pattern expression: u = ab∗, u∗ cu. If we substitute the first occurrence of u by its corresponding regular pattern surrounded by parentheses, and then replace next u by a backreference, we get the regex (ab∗) ∗ c\1, which matches abbbabbbcabbb, as well as abbabbbcabbb. However, L(p) = {(abn )m c(abn) | m, n 0}. This happens because the back-reference will match the last occurrence, while for pattern expression, the first match will set the value of the variable. To avoid this case, we can rewrite the regex as ((ab∗)(\2) ∗ c\2) | cab∗. (4) A similar case is when we get an expression like (u + v)∗ , but we can rewrite this as (u∗ v ∗ )∗ , and only afterwards we can transform it into a complicated regex. For the transformation of a regex into a pattern expression in general, we can replace each pair of parentheses that are back-referenced by a variable, and define the variable by the corresponding regular pattern. Of course, in the case where the referenced expression is under the star operator, we have to rewrite the pattern expression in order to accommodate the late binding of parenthesis: (α)∗ must be replaced with (α)∗ u + ε and u = α. We should also notice that for regex, a match for a pair of parentheses is replaced with the same words in every back-reference, while for pattern expressions this is not always true.

Example 4. For the regex r = ((ab∗)ab\2)a\1\2, we may “blindly” translate it into a pattern expression p = (v = ab∗, u = vabv, uauv). But, the translation is incorrect. We have L(p) = {abn ababnaabnababnabm | n 0, m 0}, while L(r) = {abn ababnaabn ababnabn | n 0}. A workaround for this case is to “replace” the variable u by its content, i.e., p = (v = ab∗ , vabvavabvv). Some cases are even more complex: Example 5. The regex r = ((ab∗)a ∗ b\2)a\1\2 can be “blindly” translated into a pattern expression p = (v = ab∗, u = va ∗ bv, uauv). But, L(p) = {abn a k babnaabna k babnabm | n 0, m, k 0}, while L(r) = {abn a k babnaabn a k babnabn | n, k 0}. The workaround from the previous example yields p = (v = ab∗ , va ∗ bvava ∗ bvv), which is corresponding to the language: {abn a k babnaabna h babnabn | n 0, k, h 0}. Summarizing the above examples, pattern expressions and regex have at least the following differences: (1) In regex, an expression of an alternation (“|”) which has not been chosen in a match cannot be back-referenced properly. For example, the regex (aa∗)|(bb∗) \2\3 will not work as expected since either \2 or \3 will not be back-referenced properly. However, there is no such problem in a pattern expression. The definition of a pattern expression (u = aa ∗ , v = bb∗, (u + v)uv) is totally valid. (2) In regex, there is no mechanism to denote directly that a string which matches an expression repeats zero or more times. So, the indirect and nonintuitive expressions have to be used. However, pattern expressions do not have such a problem. For example, let P be a pattern expression

C. Câmpeanu, S. Yu / Information Processing Letters 92 (2004) 267–274

(u = ba ∗ , u∗ cu). Then an equivalent regex would be (ba∗)(\2) ∗ c\2) | cba∗ which is much more complex than the corresponding pattern expression. (3) In regex, a sub-expression (between a pair of parentheses) and all its back-references have to match the same string. However, in pattern expressions the same variable may match different strings in different variable definitions and the final expression. Clearly, a serious general problem about regex is that any back-reference to a sub-expression in an alternation (“|”) operation is not well defined. For example, in the regex ((ab) | (cd))\2, the value of \2 is not defined if cd is matched rather than ab. Pattern expressions do not have such problems.

4. A pumping lemma for pattern expressions There are several pumping lemmas for regular languages [4,8], which are useful tools for showing that certain languages are not regular languages. In the following, we prove a pumping lemma for pattern expression languages and give several examples showing how to use the lemma to prove that certain languages are not pattern expression languages. Lemma 1. Let p = (r0 , . . . , rn ) be a pattern expression. Then there is a constant N , such that if w ∈ L(p) and |w| > N , there is a decomposition w = x0 yx1yx2 · · · xm for some m 1, such that: (1) |x0 y| N , (2) |y| 1, (3) x0 y j x1 y j x2 · · · xm ∈ L for all j > 0. Proof. We shall prove the lemma by induction on n. If n = 0, the pattern expression is a regular expression. Let N0 = |r0 | + 1. (Note that |r| is the length of the expression r, i.e., the number of all symbols in r including operators.) Then the lemma holds for m = 1. Assume the lemma is true for k < n. We shall prove next for k = n.

271

Let us consider the pattern expressions pn = (r0 , r1 , . . . , rn ). Using the inductive hypothesis, it follows that, for each k, 0 k < n, there exist some constants Nk for pk = (r0 , r1 , . . . , rk ) such that for all words wk ∈ L(pk ) where |wk | > Nk , there is a decomposition wk = x0,k yk x1,k yk . . . xmk −1,k yk xmk for some mk 1 such that |x0,k yk | < Nk , yk = ε, and j j j x0,k yk x1,k yk . . . xmk −1,k yk xmk ,k ∈ L(pk ), for all j > 0. Let Nn = max{N0 , N1 , Ni , . . . , Nn−1 } · |rn | + 1. Let w ∈ Ln , |w| > Nn . There are several cases. (1) If rn = s0 s1∗ s2 and s0 does not contain any ∗, and w = w0 w1 · · · wl−1 wl such that w0 ∈ L(s0 ), w1 , . . . , wl−1 ∈ L(s1 ), wl ∈ L(s2 ) and w1 = λ. (a) If no variable vi , 0 i < n, in s0 such that a substring x of w0 matches vi and |x| > Ni , then clearly w satisfies the pumping lemma for m = 1, where x0 = w0 and y = w1 . (b) If there is a variable vi , 0 i < n, in s0 such that a substring x of w0 matches vi and |x| > Ni . Let x be the first such string in w0 and w0 = uxv. By the induction hypothesis, x = x0,i yi x1,i yi . . . xmi −1,i , yi xmi ,i , mi 1 and yi = λ, and for any j 0, x(j ) = j j j x0,i yi x1,i yi . . . xmi −1,i yi xmi ,i ∈ L(vi ). Clearly ux0,i yi Nn and by replacing every x which matches vi in w by x(j ) for any j 0, we obtain a word in L. Therefore, the lemma holds. (2) If rn does not contain any ∗, then clearly there exits variable vi in rn , 0 i < n, such that the string x that matches ri in w has the property |x| > Ni . Let x be the first (from left to right) such substring in w. Then, similar to (1)(b), x can be decomposed into x = x0,i yi x1,i yi . . . xmi −1,i yi xmi ,i , mi 1 and yi = λ, by the induction hypothesis. The remaining of the proof just follows (1)(b). Now, we have considered all the cases. The lemma holds in each case. Therefore, the proof is completed. 2 Example 6. Consider some special cases of pattern expressions for the pumping lemma.

272

C. Câmpeanu, S. Yu / Information Processing Letters 92 (2004) 267–274

• Let p1 = (u = ab + ba, uu∗ ). Then the constant N0 = 6 and N1 = 6 · 3 + 1 = 19. Any word w that matches p1 and |w| > N1 can be decomposed into xyz, such that |xy| N1 , |y| 1, and xy j z ∈ L(p1 ) for all j > 0. For example, w = (ba)13. Then x = ba, y = ba, and z = (ba)11. • Let p2 = (u = a + b, bab(ucu)∗). Then N0 = 4 and N1 = 4 · 9 + 1 = 37. For example, let w = babacabcbbcbacabcbacaacaacaacaacaacaaca. Clearly, |w| > N1 . Then w = xyz such that x = bab, y = aca, and z = bcbbcbacabcbacaacaacaacaacaacaaca. • Let p3 = (u = a ∗ , ububbubbb). Then N0 = 3 and N1 = 3 · 9 + 1 = 28. In this case, any word w ∈ L(p3 ) and |w| > N1 can be decomposed into x0 yx1 yx2yx3 such that |x0 y| N1 , |y| 1, and x0 y j x1 y j x2 y j x3 ∈ L(p3 ) for all j > 0. Next we show how the pumping lemma can be used. Example 7. The language L = {a 2n bn | n > 0} cannot be expressed by a pattern expression. Proof. Assume that L is expressed by a pattern expression. Let N be the constant of Lemma 1 and consider the word a 2k bk for some integer k such that 2k > N . By Lemma 1, a 2k bk has a decomposition x0 yx1yx2 · · · xm , m 1, such that (1) |x0 y| N , (2) |y| 1, and (3) x0 y j x1 y j x2 · · · xm ∈ L for all j > 0. According to (1) and (2), y = a i for some i > 0. But then, the word x0 y 2 x1 y 2 · · · xm is clearly not in L, and it is a contradiction. Therefore, L does not satisfy Lemma 1. Thus, L cannot be expressed by any pattern expression. 2 Similarly, we can prove that the languages {a n bn | n 1} and {a n bn cn | n 1} are not pattern expression languages. As an application of the pumping lemma, we get also the following example. Example 8. The language L = {{a, b}ncn | n 0} is not a pattern expression language. Proof. Suppose that L is a pattern expression language and let N be the constant of the pumping lemma. Consider the word a N cN in L. Then, by

the pumping lemma, a N cN = x0 yx1 y · · · xm for some m 1 and |y| > 0, and x0 y i x1 y i · · · xm ∈ L for all i > 0. It is clear that y cannot contain c and we have y = a k for some k > 0. Then we have x0 y 2 x1 · · · xm ∈ / L, which is a contradiction. So, L is not a pattern expression language. 2 Similarly, we can prove that {{a, b}nc{a, b}n | n 0} is not a pattern expression language. The following result is similar to that for regex in [1]. Lemma 2. The family of pattern expression languages is not closed under complementation. Proof. The language L1 = {a m | m > 1 is not prime} is expressed by the pattern expression (uuu∗, u = aaa∗). Assume that the complement of L1 , Lc1 , is a pattern expression language, and apply Lemma 1 to Lc1 . This implies that there exist n1 0, n2 1 such that for all j > 0, a n1 +j ·n2 ∈ Lc1 . This is a contradiction since it is not possible that n1 + j · n2 to be prime for all j > 0. 2 We can also note that the language L1 in the proof of Lemma 2 is a pattern expression language over a one-letter alphabet that is not context-free. In particular, this means that there are pattern expression languages that do not belong to the Boolean closure of context-free languages.

5. Pattern automata In what follows, we want to design a device, called pattern automaton, to recognize pattern expression languages. Let p = (r0 , r1 , . . . , rn ) be a pattern expression. A pattern automaton is an automata system P = (A0 , A1 , . . . , An ) where A0 = (Q0 , Σ, δ0 , q0,0, F0 ), and Ai = Qi , Σ ∪ {v0 , . . . , vi−1 }, δi , qi,0 , Fi , 0
C. Câmpeanu, S. Yu / Information Processing Letters 92 (2004) 267–274

are finite automata. For A0 , Q0 is the finite set of states, Σ is the finite alphabet of input symbols, δ0 is the transition relation, q0,0 is the initial state, and F0 ⊆ Q0 is the set of final states. For each i, 0 < i n, Ai is the same as A0 , except that the labels of the transitions may include variables v0 , . . . , vi−1 . Assume thatQi ∩ Qj = ∅ for 0 i, j n and i = j . Let Q = ni=0 Qi . If n = 0, then the pattern automaton consists of only one automaton and it works just as a normal finite automaton. If n > 0, then the pattern automaton works with a state stack S and two arrays of stacks U and V , where the elements of S are from Q, each element of Ui , 0 i < n, is from {0, 1}, and each element of Vi , 0 i < n, is from Σ ∗ . Note that when a substring of the input is considered matching the expression ri , 0 < i n, each top element of Uj , 0 j < i, shows whether the variable vj is defined for the variable vi , by setting top(Uj ) = 0 (vj is not defined) or top(Uj ) = 1 (vj is defined), and the top of Vj stores the string defined for vj if it is defined. All stacks are bounded stacks, each containing at most n − 1 elements. The current configuration of the pattern automaton can be described by its current state q ∈ Q, the remaining of the input string w ∈ Σ ∗ , the status of the state stack S, and current status of the arrays of stacks U and V . Initially, the pattern automaton is at the state qn,0 and with the input string w ∈ Σ ∗ ; S is empty and all the stacks in both U and V are empty. An accepting configuration is one with the current state being in Fn . The first step of the pattern automaton P is push(Ui , 0) for all 0 i n − 1. In the following, we use s t to denote the current state and x t to denote the remaining input, s t +1 the next state and x t +1 the remaining input at the next step, thus, the initial configuration is

(2)

(3)

(4)

(5)

273

Σ, and q ∈ δi (p, a), then s t +1 = q, x t +1 = y, and top(Vi ) = top(Vi )a. If s t = p ∈ Qi , i > 0, q ∈ δi (p, vj ) for 0 j < i, top(Uj ) = 0, then push(S, q), s t +1 = qj,0 , x t +1 = x t , push(Vj , λ), and push(Uk , 0) for all k, 0 k < j. If s t = p ∈ Fi for 0 i < n and top(Ui ) = 0, then s t +1 = top(S), pop(S), pop(Vj ), for 0 j < i, pop(Uj ) for all 0 j < i, top(Ui ) = 1. If s t = p ∈ Qi , 0 < i n, q ∈ δi (p, vj ), 0 j < i, top(Uj ) = 1, and x t = top(Vj )y, then s t +1 = q, x t +1 = y, and top(Vi ) = top(Vi )top(Vj ). If s t ∈ Fn and x t = λ, then accept.

If a configuration of the pattern automaton is denoted (q, x, S, U, V ), the language recognized by a pattern automaton P is: L(P ) = w | (qn,0 , w, λ, λ, λ) ∗ (f, λ, S, U, V ),

f ∈ Fn .

For each of the automata Ai , 0 i n, we consider the languages Ri = L(Ai ) ⊆ (Σ ∪ {v0 , . . ., vi−1 })∗ . The language recognized by the pattern automaton P = (A0 , A1 , . . . , An ) is Wn = (u0 /v0 ) . . . (un−1 /vn−1 )un |

un ∈ Rn , ui ∈ Wi , 0 i n − 1 ,

where W0 = R0 , and Wi = (u0 /v0 ) . . . (ui−1 /vi−1 )ui |

ui ∈ Ri , uj ∈ Wj , 0 j i − 1 .

The transitions between configurations are defined by one of the following rules:

Since Ri = L(ri ), it follows that Wi = Li , hence L(P ) = L(p), i.e., the automata system recognizes the same language as the language generated by the pattern expression p. The following theorem holds clearly.

(1) If s t = p ∈ Qn , x t = ay ∈ Σ ∗ where a ∈ Σ, and q ∈ δn (p, a), then s t +1 = q and x t +1 = y. If s t = p ∈ Qi for some i < n, x t = ay ∈ Σ ∗ where a ∈

Theorem 1. Languages recognized by Pattern Automata Systems (PAS) are the same as pattern expression languages.

(s 0 , x 0 , S 0 , U 0 , V 0 ) = (qn,0 , w, λ, λ, λ).

274

C. Câmpeanu, S. Yu / Information Processing Letters 92 (2004) 267–274

6. Other properties of pattern expression languages In this section we prove that every pattern expression language is a context-sensitive language. We also establish the relationship between pattern expression languages and context-free languages. Theorem 2. Pattern expression languages are contextsensitive languages. Proof. It suffices to show that each pattern expression language is accepted by a linear-bounded automaton (LBA), i.e., a nondeterministic Turing machine in linear space. We have already proved that pattern expression languages are the same as the languages by PA languages. For any PA there are a constant number of buffers, and each buffer needs at most the space of the input word. Thus, a pattern expression language is accepted by an LBA and is a context-sensitive language. 2 Theorem 3. The family of pattern expression languages is incomparable with the family of context-free languages. Proof. The language L = {a n ba nba n | n 1} is clearly a pattern expression language. It can be expressed as “(ububu, u = (a+))”. However, L is not a context-free language. We know that {a 2n bn | n 0} is a context-free language, but it is not a pattern expression language, as we proved in Section 4. 2 There are many other properties of pattern expression languages. We state some of them in the following. Theorem 4. The family of pattern expression languages

• is a proper subset of the family of context-sensitive languages, • is closed under homomorphism, • is not closed under inverse homomorphisms, and • is not closed under finite substitutions. 7. Conclusions In this paper, we show that regex in general have serious problems and inconvenient to use in some cases. We introduced pattern expressions and compared pattern expressions with regex. We proved a pumping lemma for pattern expression languages and introduced pattern automata which recognize pattern expression languages. We also studied properties of pattern expressions and the languages they defined. Our future work is to develop an efficient deterministic algorithm that, given a pattern expression, accepts the language denoted by the expression. References [1] C. Câmpeanu, K. Salomaa, S. Yu, A formal study of practical regular expressions, Int. J. Found. Comput. Sci. 14 (6) (2003) 1007–1018. [2] N. Chapman, Perl—The Programmer’s Companion, Wiley, Chichester, 1997. [3] J.E.F. Friedl, Mastering Regular Expressions, O’Reilly & Associates, Inc., Cambridge, 1997. [4] J.E. Hopcroft, J.D. Ullman, Introduction to Automata Theory, Languages, and Computation, Addison-Wesley, Reading, MA, 1979. [5] M.E. Lesk, Lex—a lexical analyzer generator, Computer Science Technical Report 39, AT&T Bell Laboratories, Murray Hill, NJ, 1975. [6] M. Mohri, Minimization algorithms for sequential transducers, Theoret. Comput. Sci. 234 (2000) 177–201. [7] A. Salomaa, Formal Languages, Academic Press, New York, London, 1973. [8] S. Yu, Regular languages, in: G. Rozenberg, A. Salomaa (Eds.), Handbook of Formal Languages, Springer, Berlin, 1998, pp. 41– 110.

Pattern expressions and pattern automata

Pattern expressions and pattern automata

Recommend Documents