Information Processing Letters 95 (2005) 396–400 www.elsevier.com/locate/ipl
On the language equivalence of NE star-patterns Benedek Nagy a,b,1 a Research Group on Mathematical Linguistics, Rovira i Virgili University, Pl. Imperial Tárraco 1, 43005 Tarragona, Spain b Faculty of Informatics, University of Debrecen, H-4010 Debrecen, P.O. Box 12, Hungary
Received 30 July 2004; received in revised form 31 March 2005 Available online 8 June 2005 Communicated by M. Yamashita
Abstract A pattern is a finite string of constant and variable symbols. The language generated by a pattern is the set of all strings of constant symbols which can be obtained from the pattern by substituting (non-empty) strings for variables. The pattern languages are one of language family which is orthogonal to the Chomsky-type languages hierarchy. They have many applications, such as the extended regular expressions, for instance, in languages Perl, awk, etc. They are well applicable in machine learning as well. There are erasing and non-erasing patterns are used. For non-erasing pattern languages the equivalence of languages is decidable but the inclusion problem for them is undecidable. In extended regular expressions one can use union, concatenation and Kleene star to make more complex patterns. It is also known, that the equivalence problem of extended regular expressions is undecidable. However, the problem, whether the equivalence is decidable for patterns using only concatenation and star still open. In this paper there are some interesting results about inclusion properties and equivalences of some kinds of erasing and non-erasing pattern languages. We show that the equivalence problem of non-erasing patterns in some cases can be reduced to the decidability problem of some very special inclusion properties. These results may be useful to decide whether the language equivalence of non-erasing star-patterns is decidable or not. 2005 Elsevier B.V. All rights reserved. Keywords: Formal languages; Pattern languages; Star-pattern; Non-erasing pattern; Extended regex
1. Introduction In formal language theory the pattern languages (introduced in [1]) are one of the alternatives of the E-mail address:
[email protected] (B. Nagy). URL: http://www.inf.unideb.hu/~nbenedek. 1 This work is supported by grants of the Hungarian National Foundation for Scientific Research (OTKA T049409) and the International Visegrad Fund (S-061-2004). 0020-0190/$ – see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.ipl.2005.05.003
most used Chomsky-type languages families. A pattern is a finite string of terminal letters (constants) and variable symbols. The language generated by a non-erasing pattern is the set of all strings of terminal letters which can be obtained from the pattern by substituting non-empty strings for variables. Similarly, allowing substitution by the empty string the result is the language generated by an erasing-pattern. Nowadays, these languages have very high importance in infer-
B. Nagy / Information Processing Letters 95 (2005) 396–400
ence problems and in algorithmic learning theory. The pattern languages are applicable in machine learning procedures, because they are learnable in polynomial time [5]. Typical use of patterns occurs in extended regular expressions as in languages Perl, Python, awk, egrep. The non-erasing pattern languages are one of the rare examples of classes of languages where the equivalence of languages is decidable but the inclusion problem for them is undecidable. In extended regular expressions (introduced in [2]) one can use union, concatenation and Kleene star to make more complex patterns. It is also known, that the equivalence problem of extended regular expressions is undecidable. However, the problem, whether the equivalence is decidable for patterns using only concatenation and star still open. This paper contains some interesting results, which may be useful to decide whether the language equivalence of non-erasing star-patterns is decidable or not. 2. Basic definitions In this section we give the basic notions and definitions of pattern-languages. We will use λ as the empty word. Definition 1. Let Σ be a finite set of terminals (a1 , . . . , an ), V = {x1 , x2 , . . .} be an infinite set of variables (Σ ∩ V = { }). Then a pattern is a non-null finite string over Σ ∪V . We use the terms erasing (E-) and non-erasing (NE-) patterns in the following sense. Let HΣ,V be the set of morphism h : (Σ, V )∗ → (Σ, V )∗ . The language generated by an E-pattern α is defined as LE (α) = w ∈ Σ ∗ | ∃h ∈ HΣ,V ∀a ∈ Σ: h(a) = a ∧ w = h(α) . The language generated by an NE-pattern α is defined as LNE (α) = w ∈ Σ ∗ | ∃h ∈ HΣ,V : ∀a ∈ Σ: h(a) = a ∧ ¬∃v ∈ V : h(v) = λ ∧ w = h(α) . We use the notation |w| for the length of a word. (Similarly we use this notation for the length of patterns, which is the sum of the cardinalities of multisets of its terminals and variables.)
397
The following definition was very useful for deciding the equivalence of two NE-patterns. Definition 2. We assume that the alphabet contains at least two terminal letters (a, b). For a given pattern α we define the sample F (α) in the following way: F (α) = fa (α), fb (α) ∪ gi (α) | i 1 , where fa (xi ) = a,
fb (xi ) = b
(i ∈ N),
gi (xj ) = b
(i = j ).
and gi (xi ) = a,
Easy to check, that the sample contains exactly the same length words as the length of the pattern. Now we define some extensions of patterns. Definition 3. Each pattern is an extended regular expression. If γ and δ are extended regular expressions then γ ∨ δ is also an extended regular expression (using the operation union), γ · δ is also an extended regular expression (using the operation concatenation), γ ∗ is also an extended regular expression (using the operation Kleene star). All extended regular expressions are obtained from the patterns by using finitely many operators. The extended regular expressions which can be obtained without using union (∨) are called star-patterns. Definition 4. The language defined by an extended regular expression can be obtained by the languages defined by patterns in the following way. (The same definitions work for E-patterns and NE-patterns, but in this paper we analyze NE-patterns, so we need only the definition for them.) LNE (γ ∨ δ) = LNE (γ ) ∪ LNE (δ) (using the operation union), LNE (γ · δ) = LNE (γ ) · LNE (δ) (using the operation concatenation for languages), LNE (γ ∗ ) = LNE (γ )∗ (using the operation Kleene star).
398
B. Nagy / Information Processing Letters 95 (2005) 396–400
For star-patterns we get the (NE) star-pattern languages. 3. Preliminaries In this part we repeat some lemmas and theorems about the topic, what are needed to prove our results. Lemma 5. Two NE-patterns α and β define the same language if and only if α results from β by renaming of variables (and vice versa). This result means that equivalence of NE-patterns is easily decidable. It is from [1]. The inclusion problem for NE-patterns is undecidable as the next theorem states.
if and only if LNE (β) ⊆ LNE (α) for any two patterns α and β. And the inclusion problem is undecidable, as we stated in the previous lemma. 2 The following lemma is from [7]. Lemma 9. Let α and β be patterns such that |α| = |β|, and F (α) ⊆ LNE (β). Then LNE (α) ⊆ LNE (β).
4. Results The following lemma is about a simple fact.
Theorem 6. Let α and β be two patterns. It is undecidable in general whether LNE (α) ⊆ LNE (β).
Lemma 10. The shortest word(s) in the language LNE (α) has the same length as α.
This is one of the main result of [4]. For the proof the theorem above, the authors showed that the following problems are equivalent.
Proof. It is trivial (with using non-erasing morphism for a letter length word instead of each variable). 2
Lemma 7. Let α, β1 , . . . , βn be NE-patterns. Then one can effectively construct NE-patterns α , β such that: LNE (α) ⊆
n
LNE (βj )
j =1
if and only if LNE (α ) ⊆ LNE (β ). As a consequence of the previous facts the inclusion problem LNE (α) ⊆
n
LNE (βj )
j =1
is undecidable. About the extended regular expressions the following fact is known. Lemma 8. The language equivalence of extended regular expressions is undecidable. Proof. It immediately comes from the previous theorem, because LNE (α ∨ β) = LNE (α)
Lemma 11. For a given NE-pattern α the language LNE (α ∗ )\{λ} has the same shortest words as the language LNE (α). Proof. Trivial, because of the definition of Kleenestar. 2 Lemma 12. Let α, β be two patterns. LNE (α ∗ ) = LNE (β ∗ ) if and only if LNE (α) = LNE (β). Proof. The LNE (α) = LNE (β) implies LNE (α ∗ ) = LNE (β ∗ ) evident. Let us prove the other direction. Let us assume that LNE (α ∗ ) = LNE (β ∗ ). Let the length of α: |α| = n and |β| = m. If n < m, then LNE (α), and so LNE (α ∗ ) contains word with length n, but neither LNE (β) and nor LNE (β ∗ ). It is a contradiction. We use the same argument in the symmetric case, when n > m. So in the third case we have n = m. We are going to prove by indirect method, so let us assume that LNE (α) = LNE (β). Then let w is the shortest word, which is in the symmetric difference of LNE (α) and LNE (β). Without restrict the general case we can assume that w ∈ LNE (α) (and not in LNE (β)). Then since w ∈ LNE (α ∗ ) = LNE (β ∗ ) we can write w to the following form: w = v1 v2 . . . vi for an i ∈ N, i > 1
B. Nagy / Information Processing Letters 95 (2005) 396–400
such a way that ∀j : vj ∈ LNE (β) (for j = 1 . . . i). Since w is the shortest word in the symmetric difference of LNE (α) and LNE (β) all vj is also in LNE (α). Now, let F (α) be the sample for the NE-pattern α defined in Definition 2. Then F (α) contains only words with length of n. But we know that the shortest word w which is not in LNE (β) (but in LNE (α)) has length at least 2n (i at least 2). All the shorter words of LNE (α) and LNE (β) are the same. But then F (α) ⊆ LNE (β), so the conditions of Lemma 9 hold, therefore LNE (α) ⊆ LNE (β). It is contradicts to our assumption that w is in LNE (α) but not in LNE (β). 2 Lemma 13. Let α be an NE-pattern, such that there is a variable x in α which has exactly one occurrence, and there is no (other) variable which occurs both in front of and behind the variable x. Then the language LNE (α) = LNE (α ∗ )\{λ}. Proof. Let α and x be as defined above. Then x is cutting the pattern α to two independent parts: α = α1 xα2 (where x ∈ / α1 , α2 and ¬∃ variable y: y ∈ α1 and y ∈ α2 ). Assume that the non-empty word w ∈ LNE (α ∗ ), we will show that w ∈ LNE (α) also holds. Since w = λ there is an i ∈ N and set of morphisms H = {h1 , . . . , hi }, such that w = h1 (α1 )h1 (x)h1 (α2 ) . . . hi (α1 )hi (x)hi (α2 ). We construct a morphism h to complete the proof of this part. Since α1 , α2 has not any common variable, let us map the variables of α1 in the same way as in h1 and the variables of α2 in the same way as in hi . Then h (α1 ) = h1 (α1 ) and h (α2 ) = hi (α2 ). And let h (x) = h1 (x)h1 (α2 ) . . . hi (α1 )hi (x) (it can be, because x is not dependent of α1 , α2 ). Now, h (α) = w. The other direction of inclusion LNE (α) ⊆ LNE (α ∗ )\{λ} is trivial. 2 Note, that x can be the first or the last letter or variable in the pattern α. In the case when x is the first or the last letter of the pattern the proof works without α1 or α2 , respectively. The following lemma is used in [6] to define a normal form for regular expressions. Lemma 14. The expression (x the form (x ∗ y ∗ )∗ .
∨ y)∗
can be written in
Proof. It is easy to show by induction.
2
399
Lemma 15. Let α and β be two NE-patterns. Then the following problems are equivalent. LNE (α ∗ ) = LNE (α ∗ β ∗ )∗ , LNE (α ∗ ) = LNE (α ∨ β)∗ , LNE (α ∗ ) ⊇ LNE (β ∗ ), LNE (α ∗ ) ⊇ LNE (β). Proof. The first and second statements are equivalent by using Lemma 14. Let us assume that the third one is false. Then there is a word w, which is in LNE (β ∗ ), but this word must be in LNE ((α ∨ β)∗ ) also. But it contradicts to the second statement. For proving the other direction assume the third statement is true. Then each word w ∈ LNE (β ∗ ) there is an i ∈ N such that w ∈ LNE (α i ). Then if a word w0 ∈ LNE ((α ∨β)∗ ) then it can be written in the a form w0 = wa1 wb1 wa2 wb2 . . . wan wbn for some n ∈ N, where each wai ∈ LNE (α)∗ and each wbi ∈ LNE (β)∗ (1 i n). But each wbi is also in w ∈ LNE (α i ) for an i ∈ N, therefore LNE (α ∗ ) ⊇ LNE ((α ∨ β)∗ ). The forms of the second statement imply that LNE (α ∗ ) ⊆ LNE ((α ∨ β)∗ ). So the third must be equivalent with the first two statements. The third statement is an immediately consequence of the fourth one. For proving the other direction, assume that the fourth is not true, but the third is. Then there is a word w ∈ LNE (β) which is not in LNE (α ∗ ), but then w is also in LNE (β ∗ ), which is a contradiction. So we finished the proof. 2 Theorem 16. Over an alphabet Σ there are starpatterns γ and δ such that the equality problem LNE (γ ) = LNE (δ) is undecidable if any of the following three inclusion problem is undecidable for NEpatterns α1 , α2 , β. LNE (β) ⊆ LNE (α1 ) · Σ + , LNE (β) ⊆ Σ + · LNE (α2 ), LNE (β) ⊆ LNE (α1 ) · Σ + · LNE (α2 ). Proof. Let α be an NE-pattern satisfying the conditions of Lemma 13, and let β be an arbitrary NEpattern. Let the form of γ = α ∗ and δ = (α ∗ β ∗ )∗ . The question is whether LNE (γ ) = LNE (δ),
400
B. Nagy / Information Processing Letters 95 (2005) 396–400
which is the same as LNE (α ∗ ) = LNE (α ∗ β ∗ )∗ .
Corollary 18. If the equivalence problem for E-patterns turns to be undecidable, then the equivalence problem for star-patterns is also undecidable.
The second part by Lemma 14 can be written in the next form: LNE ((α ∨ β)∗ ) and using Lemma 15 the decision problem is equivalent to the following problem:
5. Conclusion
∗
LNE (α ) ⊇ LNE (β). But for α the conditions of Lemma 13 hold, therefore LNE (α ∗ ) = LNE (α) ∪ {λ}, then our equivalence is decidable if and only if: LNE (α) ⊇ LNE (β)
In this paper we showed some facts about inclusion properties and equivalences of star-pattern languages. Although the equivalence of NE-patterns are easy solvable, the equivalence of star-patterns is more complicated. We proved that this equivalence is related to very special inclusion problems. We hope that this result will help to solve this decidability problem. Some other questions are arisen, such as:
decidable. For prove the third case we write α to the form α = α1 xα2 (as we used in the proof of the Lemma 13), where α1 , α2 are independent arbitrary NE-patterns. Since our patterns are non-erasing we have the language Σ + as LNE (x). Note, that this theorem holds for every α for the conditions of Lemma 13 holds, so getting special cases, one of the patterns α1 , α2 can be removed (as the first two inclusion problem state). 2
Is it true or not, that the language families generated by E star-patterns and NE star-patterns are the same? Let α and β two patterns. Does LNE (α) ⊆ LNE (β) imply or it does not imply LE (α) ⊆ LE (β)? Is our Lemma 13 reversible, i.e., does the “if” mean “if and only if”?
In the following theorem we will use the backreference form of the extended regular expressions, as is used in [3].
Useful discussions with Kai Salomaa are gratefully acknowledged.
Theorem 17. Above any finite alphabet Σ = {a1 , . . . , an } for each E-pattern α there is an NE star-pattern γ , such that LNE (γ ) = LE (α). Proof. We show a construction. We will change each variable of the E-pattern α by a star-pattern (including backreferences). Substitute each variable at its first occurrence by ((a1∗ . . . an∗ )∗ ). We use two pairs of brackets for each variable. We will refer them \2i as the ith variable. (The (2i)th pair of brackets contain exactly the part of the star-pattern respect to the ith variable.) At all other occurrences of the variables we use that backreference. It is easy to show that the result NE star-pattern generates exactly the same language as the given erasing pattern. 2
Acknowledgements
References [1] D. Angluin, Finding patterns common to a set of strings, J. Comput. System Sci. 21 (1980) 46–62. [2] C. Campeanu, K. Salomaa, S. Yu, A formal study of practical regular expressions, Internat. J. Found. Comput. Sci. 14 (2003) 1007–1018. [3] J.E.F. Friedl, Mastering Regular Expressions (Powerful Techniques for Perl and Other Tools), O’Reilly, 1997. [4] T. Jiang, A. Salomaa, K. Salomaa, S. Yu, Decision problems for patterns, J. Comput. System Sci. 50 (1995) 53–63. [5] K.P. Jantke, Polynomial time inference of general pattern languages, in: STACS 84 (Paris), in: Lecture Notes in Comput. Sci., vol. 166, Springer, Berlin, 1984, pp. 314–325. [6] B. Nagy, A normal form for regular expressions, in: Supplemental Papers for DLT’04, Auckland, New Zealand, CDMTCS-252, 2004, pp. 51–60. [7] K. Salomaa, Patterns, in: C. Martin-Vide, V. Mitrana, Gh. Paun (Eds.), Formal Languages and Applications, Springer, Berlin, 2004, pp. 367–380.