A note on inferring uniquely terminating code languages

A note on inferring uniquely terminating code languages

Information Processing Letters 70 (1999) 217–222 A note on inferring uniquely terminating code languages J.D. Emerald, K.G. Subramanian ∗ , D.G. Thom...

105KB Sizes 0 Downloads 59 Views

Information Processing Letters 70 (1999) 217–222

A note on inferring uniquely terminating code languages J.D. Emerald, K.G. Subramanian ∗ , D.G. Thomas Department of Mathematics, Madras Christian College, Tambaram, Madras 600 059, India Received 3 May 1999; received in revised form 28 May 1999 Communicated by L. Boasson

Abstract A subclass of linear languages, called uniquely terminating code linear languages (utCLL), is introduced, motivated by the study of Mäkinen (1997), and an inference algorithm for utCLL is provided. The algorithm is extended to a corresponding subclass of equal matrix languages.  1999 Elsevier Science B.V. All rights reserved. Keywords: Formal languages; Grammatical inference; Identification in the limit

1. Introduction The problem of inferring formal languages has been extensively studied in the literature, see [9] for an excellent survey. It is known [4] that the class of regular languages is not inferable in the limit from positive data, but several inference algorithms for incomparable subclasses of regular languages are introduced [3,5,7,8,10,12]. Mäkinen [6] has recently introduced another subclass of regular languages called uniquely terminating regular languages and has inferred this class from positive data using an elegant technique of associating a trie structure with sample words. In this note, we consider code linear grammars introduced in [3]. A set C of words over an alphabet 6 is a code, if every word over 6 has at most one factorization over C [2]. There is a linear time algorithm, due to Sardinas and Patterson [2] to check whether a set of words is a code or not. A code is called a bifix code if no word in C is either a proper prefix or a proper suffix of another word in C. In fact, in [3], ∗ Corresponding author. Email: [email protected].

motivated by Yokomori [14], the notion of a code deterministic automaton is introduced by requiring the labels of edges of the automaton as words constituting a code set. The problem of learning regular languages accepted by these automata has been considered in [3]. A code linear grammar (CLG) G = (N, Σ, C, P , S) where N is the set of nonterminals, Σ is the set of terminals, C is a code set, S ∈ N is the start symbol and P is the set of rules of the form A → uBv, A → w where u, v, w ∈ C. A code linear grammar is deterministic, if (i) whenever A → uBv, A → uCv are in P , then B = C, and (ii) each x ∈ C occurs in the right side of at most one terminal rule in P . The language generated by a code linear grammar (CLG) is called a code linear language (CLL). A code linear language L is deterministic, if there exists a deterministic CLG generating L and L is a bifix language, if there exists a CLG with a bifix code set generating L. Extending the notion of a transformation introduced in [10], the problem of identification in the limit from positive data of linear languages generated by deterministic bifix code linear grammars is reduced

0020-0190/99/$ – see front matter  1999 Elsevier Science B.V. All rights reserved. PII: S 0 0 2 0 - 0 1 9 0 ( 9 9 ) 0 0 0 6 5 - 4

218

J.D. Emerald et al. / Information Processing Letters 70 (1999) 217–222

in [3] to the problem of inferring in the limit a corresponding code automaton. We give an outline of the idea here. The transformation σe is defined as follows: if x = uyv is a word over a set C of code words, where u, v ∈ C, then σe (x) = uvσe (y), and σe (x) = x, if x ∈ C. The inverse transformation σe−1 is defined by σe−1 (x) = x, if x ∈ C, and σe−1 (uvx) = uσe−1 (x)v, where u, v ∈ C, and x is a word over C. For a language L, σe (L) = {σe (x) | x ∈ L}. Given a linear language L generated by a deterministic bifix code linear grammar G = (N, Σ, C, P , S), a code deterministic automaton (CDA) A to accept σe (L) is formed as follows: A = (Q, C, δ, q0 , F ), where Q = / N, q0 = S, F = {qf }, and δ is defined N ∪{qf }, qf ∈ by the following steps: (i) B ∈ δ(A, uv), if A → uBv ∈ P , u, v ∈ C, and (ii) δ(A, w) = {qf }, if A → w ∈ P , w ∈ C. Conversely, if CDA A of the type constructed above is given, then a deterministic code linear grammar can be formed to generate σ −1 (L(A)), where L(A) is the language accepted by A. Here we introduce a subclass of linear languages called uniquely terminating code linear languages, motivated by the study of Mäkinen [6]. It is of interest to note that the technique of trie structure considered in [6] can be generalized by associating words with the nodes of the trie structure. This enables us to provide a inference procedure for uniquely terminating code linear languages in the framework of identification in the limit from positive data. This class of linear languages is incomparable with the class of deterministic bifix code linear languages. In addition, we exhibit the power of the trie structure by considering analogous inference procedures for code equal matrix languages, a subclass of equal matrix languages of Siromoney [11]. In fact, the class of equal matrix languages includes all regular languages and intersects the context-free and contextsensitive language families. Inference algorithms for equal matrix grammars have also been considered in [13], but the techniques are different here.

2. Uniquely terminating code linear languages We define uniquely terminating code linear grammars and languages. The rules of a code linear gram-

mar are of the form A → uBv and A → w, u, v, w ∈ C. Definition 1. A code linear grammar G = (V , Σ, C, P , S) is uniquely terminating if the productions in P fulfill the following conditions for each nonterminal A in G: (i) A → u1 Bv1 and A → u2 Dv2 , u1 , u2 , v1 , v2 ∈ C, imply B = C, whenever u1 = u2 and v1 = v2 . (ii) Each nonterminal has exactly one terminating production and the terminal words appearing in the right hand sides of terminating productions are all different. Remark. The conditions (i) and (ii) in Definition 1 ensure that a terminal word derived by the grammar has a unique derivation. A code linear language L is said to be uniquely terminating (utCLL), if there is a uniquely terminating code linear grammar generating L. Example 2. The code linear grammar with code set {abb, ba, bba, cb, ca, aab, db, cd} and productions S → abbSba, S → bbaAcb, A → caAaab, S → db, A → cd is uniquely terminating. It generates the language  (abb)nbba(ca)mcd(aab)mcb(ba)n | n, m > 0  ∪ (abb)ndb(ba)n | n > 0 . Proposition 3. The class of deterministic bifix code linear languages is incomparable with the class of uniquely terminating code linear languages. Proof. The proposition is a consequence of the following observations: (1) The language L2 = {(ru)n r(dr)n | n > 0} generated by the rules S → ru S dr, S → r is a uniquely terminating code linear language but not deterministic bifix code linear language. (2) The language  L3 = (ru)n ll(dr)n | n > 0  ∪ (ru)n rr(dr)n | n > 0 generated by the rules S → ru A dr, A → ru A dr, A → rr, A → ll is a deterministic bifix code lin-

J.D. Emerald et al. / Information Processing Letters 70 (1999) 217–222

Fig. 1a.

ear language but not uniquely terminating code linear language. 2 We give an inference procedure for uniquely terminating code linear languages. 2.1. The inference algorithm for utCLL A set of sample words {w1 , w2 , . . . , wn } from a uniquely terminating code linear language and the code set are given as input to the inference algorithm. The construction of the sample words is as in [6]. These are words in the language derived from the start symbol by applying the rules, not more than twice. The idea of the algorithm is as follows: The sample words are factorized using the code set. In the factorized word, the factors from the beginning and end are paired up and stored as labels of nodes in a trie structure. Each node of the trie (excluding the leaves) has an associated nonterminal. If the pair labeling a node is (u, v) and the node has associated nonterminal B and its parent node has nonterminal A, then the rule A → uBv is added to the resulting grammar. If a node

219

Fig. 1b.

associated with A has a child labeled by the terminal code word w, then we add the production A → w to the resulting grammar which is given an output. We now state the algorithm. Algorithm LC 1. For each word in the sample do 1a. Factorize the word using the code set C. 1b. If the factorized word is of the form u1 u2 . . . um wvm vm−1 . . . v1 , ui , vi (1 6 i 6 m) and w ∈ C, the words ui , vi (1 6 i 6 m) are paired and stored as labels of nodes in the extended trie structure with the middle word w being the label of a terminal node; insert associated nonterminals to the new nodes. 1c. If there are nodes having children which are equally labeled final nodes then merge the associated nonterminals of these nodes. 2. The productions of the resulting grammar are given as output from the extended trie: 2a. If a node labeled by the pair (u, v) has the nonterminal B associated with it, and A is

220

J.D. Emerald et al. / Information Processing Letters 70 (1999) 217–222

the nonterminal associated with its parent then include A → uBv to the resulting grammar. 2b. If a node with associated nonterminal A has a child which is a final node labeled with w, then we add the production A → w to the resulting grammar. Consider the sample set S = {db, bbacdcb, bbacacdaabcb, abbdbba, abbbbacdcbba, abbbbacacdaabcbba} for the utlc-grammar in Example 2, with code set C = {abb, bba, aab, ba, cb, ca, db, cd}. The extended trie structures before and after merging the nonterminals are shown in Figs. 1a and 1b, respectively. The conjectured rules are S → abbSba, S → bbaAcb, A → caAaab, S → db, A → cd. Remark 1. Just as in the case of uniquely terminating code linear grammars, we can define uniquely terminating code regular grammars (utcr-grammar) by requiring the terminal words in the rules of the form A → uB, A → v, in a uniquely terminating regular grammar, to constitute a code set. The inference procedure using a trie structure can be similarly given. This class includes the uniquely terminating regular languages of Mäkinen [6].

Fig. 2.

which case the factorization of the sample words can be obtained by comparing their prefixes and suffixes as done in the identification of code regular languages discussed in [3].

Consider the sample set S = {aab, cabbc, cabbccb, cababbc, cabbccbbbc} and code set C = {aab, cab, bc, cb, ab, cbb} for the utcr-grammar with the following productions: S → cabA, A → abA, A → bcB, B → cbbA, A → bc, B → cb, S → aab. The sample words are factored and stored in an extended trie structure. Fig. 2 shows the ultimate extended trie structure. The required production rules of the grammar, eventually found in the sequence of conjectures are S → cabA, A → abA, A → bcB, B → cbbA, A → bc, B → cb, S → aab. Remark 2. The code set is given as input to the inference algorithm for utCLL. This amounts to giving some additional information on the grammatical structure [9] for the inference algorithm. This can be avoided if we take the code set to be a bifix code, in

3. Uniquely terminating code k-equal matrix languages We now define code equal matrix grammars which form a subclass of equal matrix grammars (EMG) [11]. We allow only a single initial matrix rule in normal form namely S → A1 . . . Ak , for a fixed k. We consider nonterminal equal matrix rules of the form [A1 → f1 A1 , . . . , Ak → fk Ak ] where f1 , . . . , fk are nonempty strings of terminals only. Let Ri,j denote the ith right-linear rule of the j th matrix rule (1 6 i 6 k, 1 6 j 6 m), where m is the number of rules and T (Ri,j ) denote the terminal string on the right side of the ith right-linear rule of the j th matrix. That is, if Ai → fij Bi is a rule, then T (Ri,j ) = fij .

J.D. Emerald et al. / Information Processing Letters 70 (1999) 217–222

Definition 4. A code k-equal matrix grammar (CkEMG) G = (V , C, S, P ) is a k-equal matrix grammar [11] satisfying the following conditions on P : (i) The initial matrix rule is of the form S → A1 . . . Ak . (ii) The nonterminal and terminal matrix rules together satisfy the code property. That is, if there are m matrix rules (nonterminal and terminal), then: (a) For a fixed i (1 6 i 6 k), T (Ri,r ) 6= T (Ri,s ) for any r, s; r 6= s with 1 6 r 6 m and 1 6 s 6 m where T (Ri,r ), T (Ri,s ) ∈ C. (b) For a fixed j (1 6 j 6 m), T (Ri,j ) need not be distinct for different i, 1 6 i 6 k, where T (Ri,j ) ∈ C. Note that in the above definition, C is a finite code set over 6. A code k-equal matrix language (Ck-EML) is a language generated by a code k-equal matrix grammar (Ck-EMG). Definition 5. A code k-equal matrix grammar G = (V , C, P , S) is uniquely terminating if the productions in P satisfy the following conditions for each k-tuple hA1 , A2 , . . . , Ak i of nonterminals in G: (i) There are no two distinct nonterminal matrix rules [A1 → u1 B1 , . . . , Ak → uk Bk ] and [A1 → u1 C1 , . . . , Ak → uk Ck ] where ui ∈ C ∗ and Bi 6= Ci for each i. (ii) Each k-tuple hA1 , A2 , . . . , Ak i of nonterminals has exactly one terminal matrix rule and there are no two distinct terminal matrix rules [B1 → u1 , . . . , Bk → uk ] and [C1 → u1 , . . . , Ck → uk ] where ui ∈ C and Bi 6= Ci for each i. A code k-equal matrix language L is said to be uniquely terminating (utCk-EML) if there is a uniquely terminating code k-equal matrix grammar (utCk-EMG) generating L. 3.1. The inference algorithm for utCk-EML A set of sample words {w1 , w2 , . . . , wn } from a uniquely terminating code k-equal matrix language and the code set C are given as input to the inference algorithm. The construction of sample set is similar to the construction given for utCLL.

221

Algorithm UTCk-EMG 1. For each word in the sample do 1a. Factorize the word using the code set C and partition into k-equal parts as wi = α1 α2 . . . αk , αi ∈ C + . 1b. If the αi ’s are of the form α1 = u1 u2 . . . un ; α2 = v1 v2 . . . vn ; . . . ; αk = w1 w2 . . . wn ; then k-tuples (u1 , v1 , . . . , w1 ), (u2 , v2 , . . . , w2 ), . . . , (un , vn , . . . , wn ) are formed and stored as labels of nodes in the trie structure with the root node labeled by some k-tuple of nonterminals, i.e., (S1 , S2 , . . . , Sk ) and (un , vn , . . . , wn ) being the label of the terminal node. Insert associated k-tuple nonterminals to the new nodes. 1c. If there are nodes having children which are equally labeled final nodes then merge the associated nonterminals of these nodes. 2. The productions of the resulting grammar are given as output from the extended trie: 2a. If the root node is labeled by the k-tuple (A1 , A2 , . . . , Ak ), then the initial matrix rule [S → A1 A2 , . . . , Ak ] is included in the grammar. 2b. If a node labeled by the k-tuple (uj , vj , . . . , wj ) has the nonterminal tuple (B1 , B2 , . . . , Bk ) associated with it and (A1 , A2 , . . . , Ak ) is the nonterminal associated with its parent, then we include the matrix rule [A1 → uj B1 , A2 → vj B2 , . . . , Ak → wj Bk ] to the resulting grammar. 2c. If the child of a node associated with (C1 , C2 , . . . , Ck ) is a final node labeled with (uk , vk , . . . , wk ), then we add the production [C1 → uk , C2 → vk , . . . , Ck → wk ] to the resulting grammar. Remark. The inference algorithm for a uniquely terminating code k-equal matrix language differs from that for a utCLL by requiring the sample words to be partitioned into k-equal parts. Each part is factorized over the code set and k-tuples of words are obtained choosing the left most factor from each part. These k-tuples label the nodes of the trie structure, whereas in the inference algorithm for the linear case the factors from the beginning and the end are paired in the factorized words, which serve as the labels of the nodes.

222

J.D. Emerald et al. / Information Processing Letters 70 (1999) 217–222

Franco-Indien pour la Promotion de la Recherche Avancée), for financial support for the work done under the project No. 1602-1.

References

Fig. 3.

Example 6. Consider the uniquely terminating code 2-equal matrix grammar G = (V , C, S, P ) where the code set C = {ab, ba, ac, ca, cba, cab, aab, bcc}, V = {a, b, c, d, S, (A1, A2 ), (B1 , B2 )}, and P consists of the rules [S → A1 A2 ], [A1 → abB1 , A2 → baB2 ], [B1 → acB1 , B2 → caB2 ], [A1 → cba, A2 → cab] and [B1 → aab, B2 → bcc]. The language generated is  L = {cbacab} ∪ ab(ac)naabba(ca)nbcc | n > 0 . Here the sample words are {cbacab, abaabbabcc, abacaabbacabcc}. The extended trie structure which infers this grammar using these sample words is given Fig. 3.

Acknowledgements The authors are grateful to the anonymous referee for valuable and detailed comments, which helped to improve the paper. The authors thank the Indo-French Centre for Promotion of Advanced Research (Centre

[1] D. Angluin, Inductive inference of formal languages from positive data, Inform. Control 45 (1980) 117–135. [2] J. Berstel, D. Perrin, Theory of Codes, Academic Press, New York, 1985. [3] J.D. Emerald, K.G. Subramanian, D.G. Thomas, Learning code regular and code linear languages, in: Proceedings of ICGI-96, Lecture Notes in Artificial Intelligence, Vol. 1147, Springer, Berlin, 1996, pp. 211–221. [4] E.M. Gold, Language identification in the limit, Inform. Control 10 (1967) 447–474. [5] E. Mäkinen, The grammatical inference problem for the Szilard languages of linear grammars, Inform. Process. Lett. 36 (1990) 203–206. [6] E. Mäkinen, Inferring uniquely terminating regular languages from positive data, Inform. Process. Lett. 62 (1997) 57–60. [7] E. Mäkinen, Inferring regular languages by merging nonterminals, Internat. J. Comput. Math. 70 (1999) 601–616. [8] V. Radhakrishnan, G. Nagaraja, Inference of even linear grammars and its application to picture description languages, Pattern Recogn. 21 (1) (1988) 55–62. [9] Y. Sakakibara, Recent advances of grammatical inference, Theoret. Comput. Sci. 185 (1997) 15–45. [10] J.M. Sempere, P. Garcia, A characterization of even linear languages and its application to the learning problem, in: Lecture Notes in Artificial Intelligence, Vol. 862, Springer, Berlin, 1994, pp. 38–44. [11] R. Siromoney, On equal matrix languages, Inform. Control 14 (1969) 135–151. [12] Y. Takada, Grammatical inference of even linear languages based on control sets, Inform. Process. Lett. 28 (4) (1988) 193– 199. [13] Y. Takada, Learning equal matrix grammars and multitape automata with structural information, in: Proceedings of the 1st Workshop on Algorithmic Learning Theory, 1990. [14] T. Yokomori, On polynomial-time learnability in the limit of strictly deterministic automata, Machine Learning 19 (1985) 153–179.