Applied Soft Computing 10 (2010) 44–52
Contents lists available at ScienceDirect
Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc
Learning context-free grammar using improved tabular representation Olgierd Unold *, Marcin Jaworski The Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology, Poland
A R T I C L E I N F O
A B S T R A C T
Article history: Received 27 December 2007 Received in revised form 29 May 2009 Accepted 16 June 2009 Available online 25 June 2009
This paper describes an improved version of TBL algorithm [Y. Sakakibara, Learning context-free grammars using tabular representations, Pattern Recognition 38(2005) 1372–1383; Y. Sakakibara, M. Kondo, GA-based learning of context-free grammars using tabular representations, in: Proceedings of 16th International Conference in Machine Learning (ICML-99), Morgan-Kaufmann, Los Altos, CA, 1999] for inference of context-free grammars in Chomsky Normal Form. The TBL algorithm is a novel approach to overcome the hardness of learning context-free grammars from examples without structural information available. The algorithm represents the grammars by parsing tables and thanks to this tabular representation the problem of grammar learning is reduced to the problem of partitioning the set of nonterminals. Genetic algorithm is used to solve NP-hard partitioning problem. In the improved version modified fitness function and new delete specialized operator is applied. Computer simulations have been performed to determine improved a tabular representation efficiency. The set of experiments has been divided into 2 groups: in the first one learning the unknown context-free grammar proceeds without any extra information about grammatical structure, in the second one learning is supported by a partial knowledge of the structure. In each of the performed experiments the influence of partition block size in an initial population and the size of population at grammar induction has been tested. The new version of TBL algorithm has been experimentally proved to be not so much vulnerable to block size and population size, and is able to find the solutions faster than standard one. ß 2009 Elsevier B.V. All rights reserved.
Keywords: Grammatical inference Context-free grammar Partitioning problem Genetic algorithm CYK algorithm
1. Introduction Grammatical inference (GI, grammar induction or language learning) is a process of learning of a grammar from training data. Machine learning of grammars (especially context-free grammars) finds many applications in speech recognition [5,38], computational biology [33,1,18,36,43], XML technology [14,4,8], compression [28], computational linguistics [2,3,7,44]. GI is the gradual construction of a model based on a finite set of sample expressions. In general, the training set may contain both positive and negative examples from the language under study. If only positive examples are available, no language class other than the finite cardinality languages is learnable [16]. It has been proved that deterministic finite automata are the largest class that can be efficiently learned by provable converging algorithms. There is no context-free grammatical inference theory which provable converges, if language defined by a grammar is infinite [41]. Building algorithms that learn context-free grammars (CFG) is one of the open and crucial problems in the grammatical inference [11].
* Corresponding author. E-mail addresses:
[email protected] (O. Unold),
[email protected] (M. Jaworski). 1568-4946/$ – see front matter ß 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2009.06.006
The approaches taken have been to provide learning algorithms with more helpful information, such as negative examples or structural information; to formulate alternative representation of CFGs; to restrict attention to subclasses of context-free languages that do not contain all finite languages; and to use Bayesian methods [23]. Many researchers have attacked the problem of grammar induction by using evolutionary methods to evolve (stochastic) CFG or equivalent pushdown automata (for references see [42]), but mostly for artificial languages like brackets, and palindromes. In this paper we introduce some improvements to the TBL algorithm dedicated to inference of CFG in Chomsky Normal Form (CNF). The TBL (tabular representation algorithm) was proposed by Sakakibara and Kondo [34], in [32] Sakakibara analyzed some theoretical foundations for the algorithm. The TBL algorithm is a novel approach to overcome the difficulties of learning context-free grammars from examples without structural information available. The algorithm uses a new representation method, called tabular representation, which consists of a tablelike data structure similar to the parse table used in the Cocke– Younger–Kasami parsing algorithm. This tabular representations efficiently represent a hypothesis space of context-free grammars. More precisely, the tabular representations use the dynamic programming technique to efficiently store the exponential number of all possible grammatical structures in a table of
O. Unold, M. Jaworski / Applied Soft Computing 10 (2010) 44–52
polynomial size. By employing this representation method, the problem of learning CFGs from examples is reduced to the problem of partitioning the set of nonterminals. TBL algorithm uses genetic algorithm to solve NP-hard partitioning problem. The proposed improved version of TBL algorithm outperforms the standard one by providing the approach not so much vulnerable to block size and population size, and able to find the solution faster. The remainder of this paper is organised as follows. In Section 2 we review CFG learning method, especially heuristics based on genetic algorithms. Next we give a brief introduction to contextfree grammar induction in Section 3, followed by a presentation of the tabular representation for CFG induction in Section 4. Section 5 describes the improved version of TBL algorithm used in experimental studies. In Section 6 we present results of experiments which show the distinct improvements in performance of the algorithm. A summary of the empirical results, followed by a short discussion of possible future research directions conclude this paper. 2. Related works The state of the art of learning context-free grammar is mainly made of negative results. Conversely, the practical implications of context-free languages are important. So, research has concentrated essentially in three directions: studying sub-classes of linear languages, learning from structured data, and developing heuristics based on genetic algorithms. Many researchers consider linearity to be a necessary condition for learning to be possible. Successful results were gained over even linear grammars [39,40,37,25,21], and deterministic linear grammars [12]. Another approach used in CFG induction is to learn from structured data. Learning CFG is supported by bracketed data [30], queries [31], regular distributions [22,6,29], and negative information [15]. It is worth to mention learning from tree grammars as well [20,17]. Grammatical inference can be considered as a difficult optimization task. Evolutionary approaches are probabilistic search techniques especially suited for search and optimization problems where the problem space is large, complex and contains possible difficulties like high dimensionality, discontinuity and noise. The impact of different representations of grammars was explored in [47] where experimental results showed that an evolutionary algorithm using standard contextfree grammars (BNF) outperforms those using Greibach Normal Form, Chomsky Normal Form (CNF) or bit-string representations [24]. In [13] a genetic approach was used for inferring grammars of regular languages only, but experiments proved that the genetic approach is comparable to other grammatical inference approaches. What is worth mentioning, Dupont [13] applied a GA to solve the partitioning problem for the set of states in the prefix-tree automaton. TBL algorithm is motivated by Dupont’s work and extends his method to learn CFGs by introducing the tabular representation. The approach called Synapse system [27] is motivated by tabular method and is based on inductive CYK algorithm, incremental learning, and search for rule sets. Another genetic-based system using CYK to parse examples form learning contex-free language is Grammar-based Classifier System (GCS) [42]. GCS is a Michigan-style learning classifier system, in which single classifier represents one production rule, and population of the classifiers the whole CFG. GCS was successfully applied to learn formal languages, natural languages [44], and biosequences [43]. In [10] genetic programming was applied to infer grammars from programs written in simple domain-specific languages.
45
3. Context-free grammar parsing 3.1. Context-free grammar A context-free grammar is a quadruple G ¼ ðS; V; R; SÞ where S is finite set of terminal symbols called alphabet and V is finite set of nonterminal symbols such that S \ V ¼ 0. R is set of production þ rules in the form W ! b where W 2 V and b 2 ðV [ SÞ . S 2 V is a special nonterminal symbol called start symbol. All derivations start using the start symbol S. A derivation is a sequence of þ rewriting the string containing symbols ðV [ SÞ using production rules of CFG G. In each step of a derivation a nonterminal symbol from string is selected and replaced by a string on the right side of the production rule. For example if we have CFG G ¼ ðS; V; R; SÞ where S ¼ f a; b g; V ¼ f A; Bg; R ¼ f A ! a; A ! BA; A ! AB; A ! AA; B ! bg; S ¼ A, then the derivation can have the form: A ! AB ! ABB ! AABB ! aABB ! aABb ! aAbb ! aabb. The derivation ends when there are no more nonterminal symbols left in the string. The language generated by CFG is denoted LðGÞ ¼ fx 2 S jS ) G xg, where x represents words of language L. The word x is a string of any number of terminal symbols derived using production rules from CFG G. A CFG G ¼ ðS; V; R; SÞ is in Chomsky Normal Form if every production rule in R has form: A ! BCorA ! a, where A; B; C; 2 V and a 2 S. A CFG describes specific language L(G) and can be used to recognize words (strings) that are members of the language L(G). But to answer the question which words are members of the language L(G) and which are not we need a parsing algorithm. This problem, called membership problem can be solved using the CYK algorithm. Algorithm 1. CYK Input: a1 ; . . . ; an is the input word w and CFG G in CNF. P [n,n] is an array of sets of nonterminals items. These are initially empty. for each i ¼ 1 . . . n for every nonterminal A such that A ! ai add A to P [i, 1] for l ¼ 2 . . . n for o ¼ 1 . . . n l þ 1 for p ¼ 1 . . . l 1 for all A ! T 1 T 2 where T 1 member of P [o, p] and T 2 member of P½o þ p; l p add A to P [o, l] Output: if P [1,n] contains the start symbol S then parsing succeeds 3.2. Cocke–Younger–Kasami algorithm The Cocke–Younger–Kasami (CYK) algorithm (discovered independently by Cocke [9], Younger [48], and Kasami [19]) uses a bottom up dynamic programming technique to solve the membership problem in a polynomial time. This parsing algorithm needs CFG to be in a CNF, that is, the right-hand side of each rule must expand to either two nonterminals or to a single terminal. Restricting a grammar to CNF does not lead to any loss in expressiveness since any CFG can be converted into a corresponding CNF grammar that accepts exactly the same set of strings as the original grammar. It is also possible to extend the CYK algorithm to handle some context-free grammars which are not in CNF [27].
46
O. Unold, M. Jaworski / Applied Soft Computing 10 (2010) 44–52
Fig. 1. CYK parse table for the input word ababb.
The CYK parser considers which nonterminals can be used to derive substrings of the input, beginning with shorter strings and moving up to longer strings. The algorithm starts with strings of length one, matching the single characters in the input string against unit productions in the grammar. It then considers all substrings of length two, looking for productions with right-hand side elements that match the two characters of the substring. This process continues up to longer strings. At each length of substring, it searches through all the levels below, that is, all the matches of the substrings of the substring (see Algorithm 1. CYK). The following example shows how CYK algorithm works. Let us assume we have input word w ¼ ababb and CFG G ¼ ðS; V; R; SÞ where S ¼ fa; bg; V ¼ fA; Bg; R ¼ fA ! a; A ! BA; A ! AB; A ! AA; B ! bg. In the first step the algorithm sets all pi;1 ¼ fAjA ! ai is in Rg values in P table which refer to subsets of length 1 (those subsets can be derived only by production rule of form A ! a). In the second step the algorithm uses previously computed pi; j0 for 1 i n and 1 j0 < j, and uses this values to compute pi; j by examination of entries ( pi;1 , piþ1; j1 ), ( pi;2 , piþ2; j2 ), . . ., ( pi; j1 , piþ j1;1 ). In this step CYK uses rules of type A ! BC and if B is in pi;k and C is in piþk;ik for some 1 k < j and the production A ! BC is in R then puts A in pi;k . Keeping to this procedure CYK algorithm creates parse table P. Fig. 1 shows how CYK fills table P for w ¼ ababb. 4. Tabular representation for CFG induction Context-free grammar induction target is to find a grammar structure, i.e. set or rules and set of nonterminals using the learning set. It is a very complex problem, especially hard is finding the grammar structure. Its complexity results from great amount of possible grammatical structures that grow exponentially with length of the examples. For example, all possible grammatical structures of the word input aabb and its tabular representation are shown in Fig. 2a. Tabular representation (TBL) introduced by Sakakibara [32,35,34] is a new approach to evolve CFGs. This tabular method uses parse table – similar to the one used by CYK algorithm – to represent hypotheses of CFGs. The simple CYK table stores information about every possible derivation tree for a given string with a given grammar. Sakakibara’s tabular representation collects also an information about all possible grammatical structures for the given string (Fig. 2b). This gives an opportunity to gather
potential CFGs for every sentence in the learning set. In the next step TBL algorithm combines all the grammars from created tabular representations, by reducing the number of nonterminals and merging distinct ones. Finding the smallest grammar that is consistent with all the examples in the learning set is surely the most difficult part of the learning task. Since the problem is NPhard Sakakibara uses genetic algorithm with roulette-wheel selection, structural crossover, structural mutation and a special mutation then deletes randomly selected element with a small probability. The fitness function prefers CFGs with minimal number of nonterminals and consistent with maximum number of sentences in the learning set. 4.1. TBL algorithm The algorithm proposed by Sakakibara uses the dynamic programming method to store the exponential number of all possible grammatical structures in a similar to CYK table of polynomial size. More precisely, for the input string w ¼ a1 ; . . . ; an the tabular representation is a triangular table T(w) where each element t i; j for 1 i n and 2 j n i þ 1 contains set fX i; j;k1 ; . . . ; X i; j;k j1 g of j 1 nonterminals. Tabular representation T(w) is used to create primitive CFG grammars G ¼ ðS; V; R; SÞ in CNF, where: V ¼ fX i; j;k j1 i n; 1 j n i þ 1; 1 k < jg [ fX i;1;1 j1 i ng R ¼ fX i; j;k ! X i;k;l1 X iþk; jk;l2 j1 i n; 1 j n i þ 1; 1 k < j; 1 l1 < k; 1 l2 < j kg [ fX i;1;1 ! ai j1 i ng [ fS ! X 1;n;k j1 k n 1g: Each primitive G can generate the whole grammar topology (i.e. all possible grammatical structures) on the input word w. Learning a CFG relies on constructing the tabular representation T(w) for a given sample of positive examples w, merging the nonterminals in G(T(w)) to be consistent with the learning set, and reducing the number of nonterminals in G(T(w)) (Algorithm 2. TBL algorithm). In order to take union of CFGs (step 3 of Algorithm 2) each nonterminal of primitive G is renamed, so every Gw ðTðwÞÞ has distinct nonterminal symbols (step 2).Thanks to the use of tabular representations in TBL algorithm, the grammar induction problem is reduced to partitioning the set of nonterminals in the primitive CFGs (step 4).
Algorithm 2. TBL algorithm Input: U is the learning set, U ¼ U þ [ U . U þ is the set of positive examples, whereas U is the set of negative examples. 1. Construct T(w) for each w in U þ
Fig. 2. All possible grammatical structures of the word aabb (a) and its tabular representation (b).
O. Unold, M. Jaworski / Applied Soft Computing 10 (2010) 44–52
2. Derive the primitive G(T(w)) and Gw ðTðwÞÞ for each w in U þ 3. Create the union of all primitives CFGs GðTðU þ ÞÞ ¼ S w w 2 U þ G ðTðwÞÞ 4. Find a smallest partition p such that GðTðU þ ÞÞ=p is consistent with U Output: The resulting CFG GðTðU þ ÞÞ=p 4.2. Partitioning the set of nonterminals Partitioning the set of nonterminals is an example of a wellknown set partition problem, which is concerned with partitioning a collection of sets into P sub-collections such that the maximum number of elements among all the sub-collections is minimized. The problem is NP-hard, even when P ¼ 2 and the cardinality of each set is 2. If p is a partition of V, then for any element x 2 V there is unique element of p containing x denoted Kðx; pÞ, and called block of p containing x. Let p be a partition of the set of V nonterminals of CFG G ¼ ðS; V; R; SÞ. The CFG G=p ¼ ðS; V 0 ; R0 ; S0 Þ induced by p from G is defined as follows:
47
TBL algorithm introduces special delete operator. This operator replaces a single integer in string with a special value ð1Þ. This value means that this nonterminal is no longer used. A delete operator reduces the number of nonterminals in a learned grammar. Fitness function. Fitness function is a number from interval [0,1]. The fitness function was designed to both maximize number of accepted positive examples (function f 1 ) and minimize number of nonterminals (function f 2 ). If CFG accepts any negative example fitness function is set to 0 to be immediately discarded. Let p be an individual and p p be the partition represented by p. U þ is the set of positive examples, U is the set of negative examples. C 1 and C 2 are some constants that can be selected experimentally. The fitness function is defined as follows: jfw 2 U þ jGðTðU þ ÞÞ=p p gj jU þ j 1 f 2 ð pÞ ¼ jp p j
f 1 ð pÞ ¼
f ð pÞ ¼
(1) (2)
8 0 > > <
if any negative example is generated
C 1 f 1 ð pÞ þ C 2 f 2 ð pÞ > > : C1 þ C2
otherwise
(3)
V 0 ¼ p; R0 ¼ fKðA; pÞ ! KðB; pÞKðC; pÞjA ! BC 2 Rg [ fKðA; pÞ ! ajA ! a 2 Rg; S0 ¼ KðS; pÞ: TBL uses a genetic algorithm to solve the set partition problem. The use of GA in minimal partition problem is thoroughly studied in [26]. 4.3. Genetic algorithm TBL algorithm employs so-called group-number encoding [46] to encode partitions of nonterminals. Partitions are represented as strings of integers p ¼ ði1 ; i2 ; . . . ; in Þ, where i j 2 f1; 2; . . . ; kg indicates the block number assigned to element j. For example, string p ¼ ð1 3 4 3 6 3 3 4 5 2 6 4Þ represents partition
p p : f1g; f10g; f2 4 6 7g; f3 8 12g; f9g; f5 11g; andjp p j ¼ 6: Group-number encoding has own genetic operators: structural crossover and structural mutation. Structural crossover. This operator takes a union of both parent partitions of a randomly selected block. For example, after interchanging the parent chromosomes p1 ¼ ð1 1 2 3 4 1 2 3 2 2 5 3Þ encoded partition p p1 : f1 2 6g; f3 7 9 10g; f4 8 12g; f5g; f11g and p2 ¼ ð1 1 2 4 2 3 1 2 2 3 3 5Þ encoded partition p p2 : f1 2 7g; f3 5 8 9g; f6 10 11g; f4g; f12g at the crossover second block, the following offspring is produced: p02 ¼ ð1 1 2 4 2 3 2 2 2 2 3 5Þ encoded partition p p02 : f1 2g; f3 5 7 8 9 10g; f6 11g; f4g; f12g. Structural mutation. This operator replaces one integer in string with another string. For example, after mutating the parent chromosome p02 ¼ ð1 1 2 4 2 3 2 2 2 2 3 5Þ at position 9 we obtain p002 ¼ ð1 1 2 4 2 3 2 2 5 2 3 5Þ representing partition p p00 : f1 2g; f3 5 7 8 10g; f6 11g; f4g; f9 12g. 2
5. Improved TBL algorithm Although TBL algorithm is effective we have encountered some difficulties using it during experiments. The main problem was the fitness function, which was not always precise enough to describe differences between individuals. When partitions with small block were used (average size 1–5) the algorithm usually could find a solution, but when partitions with larger blocks were used, TBL algorithm often could not find solutions because all individuals in the population had fitness 0 and started to work completely randomly. To solve this problem we propose some modifications. The proposed improvements affect only the genetic algorithm solving the partitioning problem. First, we want to discuss the concept of partition block size in initial population. Partitions used by GA group integers that represent nonterminals in blocks. For example, the partition p p contains of 6 blocks of size 1–4: p p : f1g; f10g; f2 4 6 7g; f3 8 12g; f9g; f5 11g: The range of partition block size in initial population have great influence on the evolution process. If blocks are small (size 1–4) then it is easier to find solutions (partitions) p generating CFG GðTðU þ ÞÞ=p which do not parse any negative examples. On the other hand, creating partitions that contain bigger blocks should assure that the reduction of nonterminal number in GðTðU þ ÞÞ=p (number of blocks in p) to the optimal level proceeds quicker (this process is slower than finding solution that do not accept any negative example), because partitions in initial generation contain less blocks. The main problem with using bigger blocks is low percentage of successful runs if a standard fitness function is used. Usually in such case all individuals (partitions) create CFG’s that accept at least one negative example and then the whole population has the same fitness equal 0. To overcome these difficulties, we have used some modifications in GA listed below. Initial population block size manipulation. The influence of block size on the evolution process was tested during the experiments. Block size is a range of possible sizes (for example 1–5), not one number as in Sakakibara’s work. Block delete specialized operator. This operator is similar to special delete operator described earlier but instead of removing one nonterminal from partition it deletes all nonterminals that are placed in some randomly selected block.
O. Unold, M. Jaworski / Applied Soft Computing 10 (2010) 44–52
48
Modified fitness function. In order to solve problems with using standard fitness function when block sizes are bigger (usually all individuals have fitness 0) we propose modified fitness function that does not give all individuals accepting negative examples the same fitness value. To be able to distinguish between the better and worse solutions that still accept any negative example we give that solutions negative fitness value (values in range [1, 0)). This fitness function range ½1; 1 is used only to be able to compare modified fitness function with standard fitness function more easily but it can be of course normalized to range [0,1]. jfw 2 U þ jGðTðU þ ÞÞ=p p gj jU þ j 1 f 2 ð pÞ ¼ jp p j f 1 ð pÞ ¼
f 3 ð pÞ ¼
f ð pÞ ¼
(4) (5)
jfw 2 U jGðTðU ÞÞ=p p gj jU j
(6)
8 ðC 3 f 3 ð pÞ þ C 4 ð1 f 1 ð pÞÞÞ > > <
if any negative example is generated
C 1 f 1 ð pÞ þ C 2 f 2 ð pÞ > > : C1 þ C2
otherwise (7)
The aim of the modified fitness function f ð pÞ is to find context-free grammar consistent with a whole learning set, i.e. both positive and negative sentences. The proposed formula extends Eq. (3) by the function f 3 ð pÞ used to estimate a fitness function if any negative example is generated by GðTðU þ ÞÞ=p p . In contrary to the standard fitness function, formula (7) does not punish the inferred grammar for generating even one negative sentence, but counts sum of the normalized number of generated negative examples (formula C 3 f 3 ð pÞ) and complement of the normalized number of non-accepted positive examples (formula C 4 ð1 f 1 ð pÞÞ). Introduced constants C 4 and C 4 (set to 0.5) allow to calibrate an influence of both normalized numbers f 1 ð pÞ and f 3 ð pÞ on fitness function in a case of negative example. Note that new fitness function uses also Occam razor as Sakakibara’s formula to minimize the number of nonterminals (formula f 2 ð pÞ). As a result, when C 1 ¼ C 2 ¼ 0 the fitness function for positive examples changes into 0:5 ð f 1 ð pÞ þ f 2 ð pÞÞ. Note that maximal value of f 1 ð pÞ equals 1 (the induced grammar which accepts all positive examples, and rejects all negative examples). It means that the fitness function is able to reach the value of 1 only if f 2 ð pÞ ¼ 1, i.e. there is only one nonterminal (!). In a real case fitness function for a CFG consistent with a whole learning set equals 0.5 plus an inverse of the number of nonterminals. 6. Experimental results While performing the experiments with a standard and improved TBL algorithm the same languages as used by Sakakibara in [32] were learned. Both targed languages, i.e. L1 ¼ n fan b cm jn; m 1g and L2 ¼ facn [ bcn jn 1g contain all typical grammatical structures characteristic of the CFG: branch structure A ! BC, parenthesis structure A ! aBb, and iteration structure A ! aA. In every experiment (and for each block size range) induction was repeated until 10 successful runs. As a successful run we consider a run, when algorithm finds grammar GðTðU þ ÞÞ=p , which accepts all positive examples and none negative example within 1100 generations. Successful runs values (Succ. runs) are defined as (successful run number/overall run number) 100½%. Average generations (Avg. gen.) indicate the average number of generations needed to reach 100% fitness.
n
Fig. 3. Learning over all examples from language L ¼ fan b cm jn; m 1g up to length 6. Initial population block size: 1–5 (a) and 3–7 (b).
The set of experiments was divided into 2 groups: in the first one learning the unknown context-free grammar proceeds without any extra information about grammatical structure, in the second one learning is supported by a partial knowledge of the structure. 6.1. Learning without prior knowledge of structure We start comparing the standard and improved TBL algorithms in a task of learning unknown CFG over examples without any structural information. Experiment 1. As positive examples were used words from n language L ¼ fan b cm jn; m 1g up to length 6 and a set of 15 negative examples length up to 6. The population size was 100 and C 1 ¼ C 2 ¼ C 3 ¼ C 4 ¼ 0:5. Results are shown in Fig. 3 and summarised in Table 1. The standard version of a fitness function was depicted as standard, an improved version as modified, and a new block delete operator as BDO. Experiment 2. As positive examples were used words from language L ¼ facn [ bcn jn 1g up to length 5 and a set of 25 negative examples length up to 5. The population size was 20 and C 1 ¼ C 2 ¼ C 3 ¼ C 4 ¼ 0:5. Results are shown in Fig. 4 and Table 2. Comparing experiment results in Tables 1 and 2 we can notice that the size of block has a big influence on learning algorithm ability to find solution within 1100 generations in the case of a standard TBL. As shown in the results of experiment 1, if block size in initial population is 1–5, then TBL algorithm finds solution within 1100 generations in 100% cases, but if block size is 3–7 it
O. Unold, M. Jaworski / Applied Soft Computing 10 (2010) 44–52
49
Table 1 n Learning over examples from language L ¼ fan b cm jn; m 1g.
Table 2 Learning over examples from language L ¼ facn [ bcn jn 1g.
Block size
Block size
1–5 3–7
Standard TBL
Improved TBL
Succ. runs
Avg. gen.
Succ. runs
Avg. gen.
100 91
950 800
100 100
950 600
finds solution only in 91% cases. In experiment 2 the difference is even greater. If block size is 1–5 algorithm finds solution in 77% cases and if block size is 3–7, then it reaches 100% fitness only in 9.5% cases. Block size has no influence on the number of successful runs when modified fitness function is used and improved TBL always finds solution. It means that standard fitness is really effective only if we use very small blocks (size 5 or smaller) to build an initial population. If we use a modified fitness function instead, we can build an initial population using much greater blocks. The experiment results show also that the population size has a big influence on the number of successful runs when the standard fitness function is used. In both experiments the size of the problem was similar but the results (the number of successful runs for a standard TBL) were much better in experiment 1, where the population size was bigger (100) than in experiment 2 (20). The experiment results proved also that the population size has no bigger influence on algorithm effectiveness if the modified fitness function is used. The use of larger blocks improves also the quality of solutions and algorithm speed. Fig. 4 shows more detailed results of experiment 2. If we use smaller blocks the results are
Fig. 4. Learning over all examples from language L ¼ facn [ bcn jn 1g up to length 5. Initial population block size: 1–5 (a) and 2–6 (b).
1–5 2–6
Standard TBL
Improved TBL
Succ. runs
Avg. gen.
Succ. runs
Avg. gen.
77 9.5
1000 900
100 100
1000 900
similar (Fig. 4a), but when blocks are bigger, the average fitness rises quicker for a modified fitness function (Fig. 4b) and the solutions achieved after 1100 generations reach greater fitness. The solutions are not only better in comparison with standard fitness using larger blocks (2–6), but also better than solutions gained when smaller blocks were used (1–5). Using larger block size reduces also the average number of generations needed to find the best solution. This is possible because each partition pi in initial population contains smaller number of blocks and CFG GðTðU þ ÞÞ=pi based on pi contains less nonterminals. The algorithm finds solution faster because nonterminal number reduction process is quicker (smaller initial number of nonterminals). 6.2. Learning with partially folded examples Sakakibara extends TBL by introducing structured examples to help the algorithm evolving correct CFG [32]. A structured example string contains brackets that show the shape of the derivation tree. A completely structured example includes brackets describing the whole derivation tree while partially structured sentence has only some of the pairs of left and right of the brackets. For the examples of a partially structured sentence and its derivation tree, see Fig. 5. The results show that the more information we provide (by giving examples with more structural information) the less steps it takes for GA to evolve the correct grammar. In order to incorporate structural information TBL algorithm is supplemented by two extra steps (Algorithm 3. TBL algorithm to incorporate partially structured examples). Experiment 3. As positive examples were used words from n language L ¼ fan b cm jn; m 1g up to length 6 and a set of 15 negative examples length up to 6. To the learning set one folded example ðaðabÞbcÞc was added. The population size was 100 and C 1 ¼ C 2 ¼ C 3 ¼ C 4 ¼ 0:5. The results are shown in Fig. 6 and summarized in Table 3. Experiment 4. As positive examples were used words from language L ¼ facn [ bcn jn 1g up to length 5 and a set of 25 negative examples length up to 5, partially folded examples: ðaðccÞcÞc and ðbðccÞcÞc. The population size was 20 and C 1 ¼ C 2 ¼ C 3 ¼ C 4 ¼ 0:5. The results are shown in Fig. 7 and Table 4.
Fig. 5. Examples of consistent (a) and inconsistent (b) derivation trees for partially structured sentence aðabÞb.
O. Unold, M. Jaworski / Applied Soft Computing 10 (2010) 44–52
50
n
Fig. 6. Learning over all examples from language L ¼ fan b cm jn; m 1g up to length 6 with folded examples ðaðabÞbcÞc. Initial population block size: 1–5 (a) and 3–7 (b).
The results of experiment 3 and 4 show that the use of partially folded examples speed up the induction and has no influence on the quality of results. In experiment 1 and 3 or 2 and 4 the results were very similar. Tables 3 and 4 prove that even the use of one or two partially folded examples reduces the number of generations needed to find the smallest partition. Tables 3 and 4 show also that the use of modified fitness function assures that an algorithm always finds the smallest partition and searched grammar (in tested cases: population size, block size).
Fig. 7. Learning over all examples from language L ¼ facn [ bcn jn 1g up to length 5 with folded examples ðaðccÞcÞc and ðbðccÞcÞc. Initial population block size: 1–5 (a) and 2–6 (b).
3. Eliminate unnecessary nonterminals and production rules based on folded examples 4. Eliminate nonterminals that become useless by result of step 3 5. Create the union of all primitives CFGs GðTðU þ ÞÞ ¼ S w w 2 U þ G ðTðwÞÞ 6. Find a smallest partition p such that GðTðU þ ÞÞ=p is consistent with U Output:
Algorithm 3. TBL algorithm to incorporate partially structured examples Input:
The resulting CFG GðTðU þ ÞÞ=p 6.3. An analysis of influence of modified fitness function and initial population block size
U is the learning set, U ¼ U þ [ U . U þ is the set of positive examples, whereas U is the set of negative examples. 1. Construct T(w) for each w in U þ 2. Derive the primitive G(T(w)) and Gw ðTðwÞÞ for each w in U þ
The previous experiments (1–4) showed that an initial block size has significant impact on both TBL algorithm and improved TBL algorithm. For smaller blocks the gained results are similar, but if we use bigger blocks the standard TBL algorithm got stuck. On
Table 3 n Learning over examples from language L ¼ fan b cm jn; m 1g with partially folded example.
Table 4 Learning over examples from language L ¼ facn [ bcn jn 1g with partially folded examples.
Block size
Block size
1–5 3–7
Standard TBL
Improved TBL
Succ. runs
Avg. gen.
Succ. runs
Avg. gen.
100 24
550 500
100 100
550 500
1–5 2–6
Standard TBL
Improved TBL
Succ. runs
Avg. gen.
Succ. runs
Avg. gen.
100 83
650 600
100 100
450 350
O. Unold, M. Jaworski / Applied Soft Computing 10 (2010) 44–52
Fig. 8. The average performance time for different fitness functions and average block sizes.
Fig. 9. Average best fitness for different fitness functions and average block sizes.
the other hand bigger blocks should reduce the number of generations needed to induce the grammar, because of smaller an initial number of nonterminals. Note that the improved version of TBL algorithm introduces two modifications: an extended fitness function and a new block delete operator. It is worth to know, which modification and in what way improves learning performance. The correct answer to both above mentioned issues can be found by performing another series of experiments. We tested standard fitness function without block delete operator (standard), modified fitness function without block delete operator (modified), standard fitness function with block delete operator (standard + BDO), and modified fitness function with block delete operator (modified + BDO) separately. The performance time for several average block sizes were collected. The average block size n depicts, that the block sizes varies from n 1 to n þ 1. For example the average block size of 2 means that the initial blocks sizes were 1–3. Each function (standard, modified, standard + BDO, and modified + BDO) was tested using several average block sizes, from 2 to 5. As positive examples were used words from language L ¼ facn [ bcn jn 1g up to length 5 and a set of 25 negative examples length up to 5, partially folded examples: ðaðccÞcÞc and ðbðccÞcÞc. The population size was 20 and C 1 ¼ C 2 ¼ C 3 ¼ C 4 ¼ 0:5. The results are shown in Fig. 8. The results of this experiments allow to draw three important conclusions. First of all, the bigger block size,
51
Fig. 10. The average performance time of modified function with BDO for different block size ranges and minimal block sizes.
the faster algorithm, regardless of type of algorithm. Next, the standard function (with or without BDO) loses ability to infer the grammar for the average block size larger than 3. And the last conclusion—the new block delete operator speeds up the algorithm (with standard or modified fitness function) almost 3 times. Moreover, BDO operator has great influence not only on the time, but also on results’ quality (see Fig. 9). BDO bumps up the fitness both for standard fitness function and modified one. The main reason why new delete operator improves the execution time and grammar fitness lies in faster reduction of nonterminal symbols than the standard delete operator used by Sakakibara. The number of nonterminals is a key value for (improved) TBL algorithm and affects both parsing time and fitness function. Looking for optimal block size for the modified function, we tested the dependence between a minimal block size, different block size ranges and average performance time. The experiment was performed for the modified function with BDO. The results are shown in Fig. 10. The optimal value for minimal block size, regardless of block size range is set to 4. For bigger block sizes than 4, the performance time rises up. Note that the curve for the block size range of 6 goes down, what could suggest that the optimal block size is greater than 4. But, when we take into consideration the time needed not only for successful runs (when the grammar consistent with a learning test is found), but also the time wasted for suboptimal results (i.e., when inferred grammar does not parse any negative sentence, but also does not parse any positive sentence), the real performance time is much longer. For the block size between 5 and 11 (range of 6), unsuccessful runs appear more than in 50%! It means that block size has its limit, beyond which the algorithm appears to be ineffective. 7. Conclusions The TBL algorithm is a promising approach to deal with the problem of context-free grammar induction by using evolutionary methods. The tabular representations efficiently store the exponential number of all possible grammatical structures in a table of polynomial size. Furthermore, the partially structured examples essentially improve the efficiency of the TBL algorithm, so the learning algorithm can also identify a grammar having the intended structure, that is, structurally equivalent to the unknown grammar. But the main problem of using standard TBL algorithm is its susceptibility to a partition block size in an initial population and also population size. The experiments we have performed showed that for larger blocks the algorithm is not effective, and for small
52
O. Unold, M. Jaworski / Applied Soft Computing 10 (2010) 44–52
populations (size of 20) is even unable to find unknown grammar. On the other hand, exactly the size of partition block and of population determine number of generations needed to find a solution. In this paper we have introduced some modifications to TBL algorithm which concentrate on genetic algorithm. We have proposed a modified fitness function that distinguishes also individuals accepting negative examples, in contrary to the standard one giving them the same fitness value (value of 0). The proposed fitness function ranges from 1 to 1 in order to compare with standard one, but it can be normalized to range [0,1]. Next, we have introduced a block delete specialized operator, which works similarly to a special delete operator described in [32] but instead of removing one nonterminal from partition it deletes all nonterminals that are placed in some randomly selected block. The aim of this operator is to speed up the process of searching minimal partition. The set of experiments have been performed to determine an improved tabular representation efficiency. In the first group of experiments learning proceeded without any prior information about grammatical structure, in the second one learning was supported by a partial knowledge of the structure. In each of the experiments the influence of partition block size in initial population and the size of population at grammar induction has been tested. The proposed new TBL algorithm has been experimentally proved to be not so much prone to the size of block and population as the standard one, and finds the target grammar in the less number of generations. We have extended Sakakibara’s analysis and experimentally tested the influence of initial block size on performance time and fitness of both the standard algorithm and the improved one. The optimal value for the block size has been found. Our future work, aimed in [32], will investigate the relation between the degree of structural information and the number of necessary unstructured examples. Although the standard TBL algorithm has been compared with other CFG learning method, i.e. Unold’s GCS in [45], the improved version requires own comparisons. There are many partially bracketed sentences available in the real databases dedicated for natural language processing, so in future work we want to apply our modified learning algorithm to such databases. References [1] N. Abe, H. Mamitsuka, Predicting protein secondary structure using Stochastic tree grammars, Machine Learning Journal 29 (1997) 275–301. [2] P. Adriaans, Language Learning from a Categorical Perspective, Ph.D. thesis, Universiteit van Amsterdam, 1992. [3] P. Adriaans, H. Fernau, M. van, Zaannen (Eds.), Grammatical Inference: Algorithms and Applications, Proceedings of ICGI 00, LNAI 2484, Springer-Verlag, Berlin, Heidelberg, 2002. [4] H. Arimura, H. Sakamoto, S. Arikawa, Ecient Learning of Semistructured Data from Queries, in: N. Abe, R. Khardon, T. Zeugmann (Eds.), Proceedings of ALT 2001, LNCS 2225, 2001. [5] J.K. Baker, Trainable Grammars for Speech Recognition, in: D.H. Klatt, J.J. Wolf (Eds.), Speech Communication Papers for the 97th Meeting of the Acoustical Society of America, 1979. [6] R.C. Carrasco, J. Oncina, J. Calera-Rubio, Stochastic inference of regular tree languages, Machine Learning Journal 44 (1) (2001) 185–197. [7] E. Charniak, Statistical Language Learning, MIT Press, Cambridge, 1993. [8] B. Chidlovskii, Schema Extraction from XML: A Grammatical Inference Approach, in: M. Lenzerini, D. Nardi, W. Nutt, D. Suciu (Eds.), Proceedings of the 8th International Workshop on Knowledge Representation meets Databases (KRDB 2001), Vol. 45 of CEUR Workshop Proceedings, 2001. [9] J. Cocke, Programming Languages and Their Compilers: Preliminary Notes, Tech. Rep., Courant Institute of Mathematical Sciences, New York University, 1966. [10] M. Crepinsek, M. Mernik, F. Javed, B. Bryant, A. Sprague, Extracting grammar from programs: evolutionary approach, ACM Sigplan 40 (2005) 39–46. [11] C. de la Higuera, Current Trends in Grammatical Inference, in: F. Ferri, J. Inesta, A. Admin, P. Pudil (Eds.), Advances in Pattern Recognition. Joint IAPR International Workshops SSPR+SPR’2000, LNCS 1876, Springer Verlag, Berlin, 2000. [12] C. de la Higuera, J. Oncina, Learning Deterministic Linear Languages, in: J. Kivinen, R.H. Sloan (Eds.), Proceedings of COLT 2002, LNAI 2375, Springer-Verlag, Berlin, Heidelberg, 2002.
[13] P. Dupont, Regular Grammatical Inference from Positive and Negative Samples by Genetic Search: the GIG Method, Springer-Verlag, 1994. [14] H. Fernau, Learning XML grammars, in: P. Perner et al. (Eds.), Machine Learning and Data Mining in Pattern Recognition MLDM01, Springer-Verlag, 2001. [15] P. Garcı´a, J. Oncina, Inference of recognizable tree sets, Tech. Re DSIC-II/47/93, Departamento de Lenguajes y Sistemas Informaticos, Universidad Politecnica de Valencia, 1993. [16] E. Gold, Language identification in the limit, Information and Control 10 (5) (1967) 447–474. [17] A. Habrard, M. Bernard, F. Jacquenet, Generalized Stochastic Tree Automata for Multi-relational Data Mining, in: Adriaans et al. [3], pp. 125–133. [18] A. Jagota, R.B. Lyngso, C.N.S. Pedersen, Comparing a Hidden Markov Model and a Stochastic Context-Free Grammar, in: Proceedings of WABI 01, Springer-Verlag, Berlin, Heidelberg, 2001. [19] T. Kasami, An Efficient Recognition and Syntax-analysis Algorithm for Contextfree Languages, Tech. Re AFCRL-65–558, Air Force Cambridge Research Laboratory, Bedford, Massachusetts, 1965. [20] T. Knuutila, M. Steinby, Inference of tree languages from a finite sample: an algebraic approach, Theoretical Computer Science 129 (1994) 337–367. [21] T. Koshiba, E. Ma´kinen, Y. Takada, Inferring pure context-free languages from positive data, Acta Cybernetica 14 (3) (2000) 469–477. [22] S.C. Kremer, Parallel Stochastic Grammar Induction, in: Proceedings of the 1997 International Conference on Neural Networks (ICNN 97), vol. I, 1997. [23] L. Lee, Learning of context-free languages: A Survey of the Literature, Tech. Re TR12–96, Harvard University, Cambridge, Massachusetts, 1996. [24] S. Lucas, Structuring Chromosomes for Context-Free Grammar Evolution, in: 1st International Conference on Evolutionary Computing, 1994. [25] E. Makinan, A note on the grammatical inference problem for even linear languages, Fundamenta Informaticae 25 (2) (1996) 175–182. [26] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer, Berlin, 1996. [27] K. Nakamura, Incremental Learning of Context Free Grammars by Extended Inductive CYK Algorithm, in: C. de la Higuera, P. Adriaans, M. van Zaanen, J. Oncina (Eds.), Proceedings of the Workshop and Tutorial on Learning Contex-Free Grammars, ECML 2003, Ruder Boskovic Institute, Cavtat-Dubrovnik, Croatia, 2003. [28] C. Nevill-Manning, I. Witten, Identifying hierarchical structure in sequences: a linear-time algorithm, Journal of Artificial Intelligence Research 7 (1997) 67–82. [29] J.R. Rico-Juan, J. Calera-Rubio, R.C. Carrasco, Stochastic k-testable Tree Languages and Applications, in: Adriaans et al. [3], pp. 199–212. [30] Y. Sakakibara, Learning Context-free grammars from structural data in polynomial time, Theoretical Computer Science 76 (1990) 223–242. [31] Y. Sakakibara, Efficient learning of context-free grammars from positive structural examples, Information and Computation 97 (1992) 23–60. [32] Y. Sakakibara, Learning context-free grammars using tabular representations, Pattern Recognition 38 (2005) 1372–1383. [33] Y. Sakakibara, M. Brown, R. Hughley, I. Mian, K. Sjolander, R. Underwood, D. Haussler, Stochastic context-free grammars for tRNA modeling, Nuclear Acids Research 22 (1994) 5112–5120. [34] Y. Sakakibara, M. Kondo, GA-based learning of context-free grammars using tabular representations, in: Proceedings of 16th International Conference in Machine Learning (ICML-99), Morgan-Kaufmann, Los Altos, CA, 1999. [35] Y. Sakakibara, H. Muramatsu, Context-free grammars from partially structured examples, in: Proceedings of 15th International Colloquium on Grammatical Inference (ICGI-2000)), LNAI 1891, Springer, Berlin, 2000. [36] I. Salvador, J.M. Benedı´, Rna modeling by combining stochastic context-free grammars and n-gram models, International Journal of Pattern Recognition and Artificial Intelligence 16 (3) (2002) 309–316. [37] J.M. Sempere, P. Garcı´a, A Characterisation of Even Linear Languages and its Application to the Learning Problem, Springer-Verlag, 1994. [38] A. Stolcke, An Efficient probablistic context-free parsing algorithm that computes prefix probabilities, Comp. Linguistics 21 (2) (1995) 165–201. [39] Y. Takada, Grammatical inference for even linear languages based on control sets, Information Processing Letters 28 (1988) 193–199. [40] Y. Takada, A Hierarchy of Language Families Learnable by Regular Language Learners, Springer-Verlag, 1994. [41] E. Tanaka, Theoretical aspects of syntactic pattern recognition, Pattern Recognition 28 (7) (1995) 1053–1061. [42] O. Unold, Context-free grammar induction with grammar-based classifier system, Archives of Control Science 15 (LI) (4) (2005) 681–690. [43] O. Unold, Grammar-based classifier system for recognition of promoter regions, in: B. Beliczynski et al. (Eds.), Proceedings of ICANNGA07, LNCS 4431, Part I, 2007. [44] O. Unold, Learning classifier system approach to natural language grammar induction, in: Y. Shi et al. (Eds.), Proceedings of ICCS 2007. LNCS 4488, Part II, 2007. [45] O. Unold, L. Cielecki, Learning context-free grammars from partially structured examples: Juxtaposition of GCS with TBL, in: 7th International Conference on Hybrid Intelligent Systems, IEEE Computer Society Press, Germany, 2007. [46] G. von Laszewski, Intelligent structural operators for the k-way graph partitioning problem, in: Proceedings of Fourth International Conference on Genetic Algorithms, Morgan Kaufmann, Los Altos, CA, 1991. [47] P. Wyard, Representational Issues for Context Free Grammar Induction Using Genetic Algorithms, Springer-Verlag, 1994. [48] D.H. Younger, Recognition and parsing of context-free languages in time n3 , Information and Control 10 (2) (1967) 189–208.