COMPUTER SPEECH AND LANGUAGE
Computer Speech and Language 20 (2006) 644–658
www.elsevier.com/locate/csl
Unsupervised grammar induction using history based approach Heshaam Feili, Gholamreza Ghassem-Sani
q
*
Department of Computer Engineering, Sharif University of Technology, Azadi Avenue, Tehran, Iran Received 6 January 2005; received in revised form 5 November 2005; accepted 10 November 2005 Available online 5 December 2005
Abstract Grammar induction, also known as grammar inference, is one of the most important research areas in the domain of natural language processing. Availability of large corpora has encouraged many researchers to use statistical methods for grammar induction. This problem can be divided into three different categories of supervised, semi-supervised, and unsupervised, based on type of the required data set for the training phase. Most current inductive methods are supervised, which need a bracketed data set for their training phase; but the lack of this kind of data set in many languages, encouraged us to focus on unsupervised approaches. Here, we introduce a novel approach, which we call history-based inside-outside (HIO), for unsupervised grammar inference, by using part-of-speech tag sequences as the only source of lexical information. HIO is an extension of the inside-outside algorithm enriched by using some notions of history based approaches. Our experiments on English and Persian languages show that by adding some conditions to the rule assumptions of the induced grammar, one can achieve acceptable improvement in the quality of the output grammar. Ó 2005 Elsevier Ltd. All rights reserved.
1. Introduction With the recent increasing interest in statistical approaches to natural language processing, corpus based linguistics has become a hot topic (Brill, 1993). This is due to the fact that computer based texts are available more than ever before, and easier to use for various data tasks. The success of part-of-speech tagging by using the Hidden Markov Model (HMM) (Charniak, 1997b; Church, 1988) also attracted the attention of computational linguists to the lexical analysis, language modeling, and machine translation by using various statistical methods (Feili and Ghassem-Sani, 2004; Charniak, 1996). Manual design and refinement of a natural language grammar is a difficult and time-consuming task, and requires a large amount of skilled efforts. A hand-crafted grammar is not usually completely satisfactory, and frequently fails to cover many unseen sentences. Automatic acquisition of grammars is a solution to this problem. With the increasing availability of large, machine-readable, parsed corpora such as Penn Treebank
q *
This research has been partially funded by the Iranian Telecommunication Research Center (ITRC). Corresponding author. Tel./fax: +98 21 66164624. E-mail addresses:
[email protected] (H. Feili),
[email protected] (G. Ghassem-Sani).
0885-2308/$ - see front matter Ó 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.csl.2005.11.001
H. Feili, G. Ghassem-Sani / Computer Speech and Language 20 (2006) 644–658
645
(Marcus et al., 1993), there have been numerous attempts to automatically derive a context free grammar by using such corpora (Pereira and Schabes, 1992). Based on the level of supervision, which is used by different algorithms, grammar induction methods are divided in three main categories: supervised, semi-supervised, and unsupervised. Here, we present a novel unsupervised algorithm named History-Based Inside-Outside (HIO) as an extension of the well-known unsupervised Inside-Outside algorithm (Lari and Young, 1990). The experimental results of applying HIO to English and Persian languages are also demonstrated and compared with that of several other unsupervised approaches. 2. Grammar induction methods There are different inductive methods that can be classified into three categories based on the type of required data (Thanaruk and Okumaru, 1995). The first category, which is called supervised grammar induction, uses a full-parsed and -tagged corpora such as Penn Treebank (Marcus et al., 1993). Some of the most promising methods of this category were presented by Charniak (1997a,b), Collins (1996, 1997), Charniak (2000) and Magerman (1995), all of which used the Penn Treebank. Experiments on the induced grammars of these methods, showed about 85–88% in accuracy with respect to the precision/recall metrics (Black et al., 1991). The methods in the second category, called semi-supervised grammar induction, use less supervision information than the previous category. Some semi-supervised methods were introduced by Pereira and Schabes (1992) and Schabes et al. (1993). They improved the accuracy to the level of 90% in bracketing accuracy for sentences with the average length of 10 words. Other works that can be regarded as semi-supervised were reported in Sanchez et al. (1996) and Sanchez and Benedi (1998, 1999), where the K-most probable derivations of input sentences were used in an algorithm based on the Viterbi Score (Manning and Schutze, 1999). Briscoe and Waegner incorporated a supervised explicit grammar in an unsupervised algorithm (i.e., IO), and gained some improvements (Briscoe and Waegner, 1992). In the third category, called unsupervised grammar induction, only tagged sentences without any bracketing or any other type of supervised information are used. Based on the Expectation-Maximization (EM) algorithm, Baker proposed an unsupervised algorithm, called Inside-Outside (IO), which constructs a grammar based on an unbracketed corpus (Baker, 1979). The algorithm will converge towards a local optimum by iteratively re-estimating the probabilities in a manner that maximizes the likelihood of the training corpus given the grammar (Homes, 1988). This method is regarded as one of basic algorithms of unsupervised automatic grammar learning (Amaya et al., 1999; Pereira and Schabes, 1992; Casacuberta, 1996). The IO algorithm requires a great deal of computation, and produces a grammar in the Chomsky normal form. Stolcke and Omohundro induced a small and artificial context free grammar with chunk-merge systems (Stolcke and Omohundro, 1994). The results of these approaches showed that completely unsupervised acquisition is not very effective. Furthermore, Charniak demonstrated that running an EM algorithm from random starting points produces widely varying grammars (Charniak, 1993). Using IO, Chen introduced a Bayesian grammar induction method followed by a post-pass (Chen, 1995). There have also been other efforts to improve the quality of the unsupervised induction methods by considering some limitation, or additional information. Magerman and Marcus used a distituent grammar to eliminate undesirable rules (Magerman and Marcus, 1990). Carroll and Charniak restricted the set of non-terminals that may appear on the right hand side of rules with a given left hand side (Carroll and Charniak, 1992). Alignment-based learning (ABL) is a learning paradigm based on the principle of substitutability, whereby two constituents of the same type can be substituted (Van Zaanen, 2000, 2002; Van Zaanen and Adriaans, 2001). Also, Adriaan presented EMILE, which initially was based on some levels of supervision, but later modified to be a completely unsupervised system (Adriaans and Haas, 1999). One of the most promising classes of unsupervised induction algorithms is based on particular distribution of words in sentences, and uses some distributional evidences to identify constituent structure (Klein and Manning, 2001b). Here, the main idea is that sequences of words (or tags) generated by the same non-terminal normally appear in similar contexts (Clark, 2001a,b; Klein and Manning, 2005). Klein and Manning have introduced a new distributional method for inducing a bracketed tree structure of the sentence, with a dependency model to induces a word-to-word dependency structure (Klein and Manning, 2001a,b, 2002, 2004). By
646
H. Feili, G. Ghassem-Sani / Computer Speech and Language 20 (2006) 644–658
combining these two models, they have achieved the best result in unsupervised grammar induction so far. Other dependency models with weaker results were presented by Carroll and Charniak (1992), Yuret (1998) and Paskin (2002). Although supervised methods outperform current unsupervised induction algorithms with a relatively large difference, there are still compelling motivations for working on unsupervised methods (Marcken, 1995; Marcken, 1996). Building supervised training data requires considerable resources, including time and linguistic expertise. This problem is more complicated when we deal with languages other than English, which usually lack the necessary resources. Furthermore, the resulting hand-crafted treebank may be too susceptible to restriction to a particular domain, application, or genre (Kehler and Stolcke, 1999). Applying probabilistic models to natural languages is not a new experiment; however most previous models of parsing assume the independence of the input sentence and its surrounding context. In fact, previous works were often based on an even stronger independence assumption. For instance, in the probabilistic context-free grammar (PCFG) model the independence of the probability of each constituent and its neighboring constituents is assumed (Charniak, 1997a,b). But this assumption is not valid in some cases. For example in English, noun phrases in the subject position are more likely to be pronouns than noun phrases not in the subject position (Allen, 1995). In other words, the probability of P(NP ! pronoun|parent = S) is higher than that of P(NP ! pronoun|parent = VP). This is also the case in Persian, too, where by a pro-drop feature, the pronoun can be dropped when it is used as the subject (Bateni, 1995; Bijankhan, 2003). On the other hand, there are some richer models of context that incorporate some additional information with the probability of each constituent, and present a way of calculating the probability model more accurately. For instance, the probabilistic parser Pearl introduced in Magerman and Marcus (1991) and Magerman and Weir (1992) is more sensitive to context than ordinary PCFGs. In an ordinary PCFG, each rule A ! B has the probability P(A ! B|A), whereas in Pearl the parameters are extended by conditioning on some rule C ! aAc that may have generated A, and the part-of-speech trigram a0a1a2 in the sentence such that a1 is the left-most word of the corresponding constituent. Using a supervised learning method, Pearl acquired 88% of the bracketing accuracy. Another important model, which increased the dependencies on the context, is the history-based model that was originally developed by the researchers at IBM (Black et al., 1992a,b; Jelinek et al., 1994). The parse tree representation in this model is richer than an ordinary parse tree in a couple of ways: non-terminal labels are extended to have some extra pieces of information such as lexical items and head words, and conditioning context has been extended to look at potentially all previously built structures, rather than just the non-terminal being expanded (as in PCFGs). An improvement from 59.8% to 74.6% in parsing accuracy was reported by using this model (Black et al., 1992b). The idea of adding the parent of each non-terminal as conditioning information to the rules was also employed by Johnson (1998). It was shown that by replacing P(a ! b|a) with P(a ! b|a, Parent(a)), where Parent(a) is the non-terminal dominating a, a noticeable improvement in precision/recall can be achieved. All the mentioned history-based models were supervised. In this paper, we introduce a new unsupervised approach, named history-based inside-outside (HIO), to extend the IO algorithm by the history-based notion. We use the idea of adding more conditions to the probability model of the inside-outside algorithm by estimating P(a ! b|a, Parent(a)). The new approach for estimating P(a ! b|a, Parent(a)) retrieves the output of traditional IO algorithm (i.e., P(a ! b|a)) as its input, and tries to distribute this probability over different events such as (a ! b|a,C = Parent(a)), by using the inner and outer probability values. Inner and outer probabilities are calculated by using the ordinary IO estimation process. The HIO model differs from other unsupervised grammar induction methods with respect to the form of the output grammar. The final output grammar of unsupervised models is often a probabilistic context free grammar (in the general form, or in a more specific form such as the probabilistic Chomsky normal form) (Johnson, 1998; Lari and Young, 1990; Pereira and Schabes, 1992). In HIO, however, a general rule is defined as a couple hR, Ci, where R is a rule in the Chomsky normal form (e.g., i ! jk), and C is the parent non-terminal of R. The parent non-terminal of rule R is its immediately dominating non-terminal in the derivation process. Unlike a PCFG, in which the sum of rule probabilities with the same left non-terminal is equal to one, here the sum of all rule probabilities with the same left hand side and the same parent is equal to one.
H. Feili, G. Ghassem-Sani / Computer Speech and Language 20 (2006) 644–658
647
We use an extension of the probabilistic CYK (PCYK) (Jurafsky and Martin, 2000; Kasami, 1965; Younger, 1967) as the parsing algorithm. In the extended PCYK algorithm, the parent non-terminal of each CNF rule is also considered in parsing. We evaluated our estimation model on both English and Persian languages. For English, the experiments on ATIS and WSJ show that HIO outperforms most current unsupervised methods, including baselines. For Persian, we compared HIO only against IO, and the results show a significant improvement in the quality of the induced grammar, with respect to the precision/recall criteria. 3. History-based inside-outside (HIO) In this section, we explain HIO, a new estimation model to induce an extended form of the PCFG by using the traditional IO algorithm. Here, the output of IO algorithm and other partial information such as inner and outer probabilities are used as the input data. This information is necessary to calculate the history-based model. The main idea of HIO is to decompose the output probabilities of rules extracted from IO, over their parent non-terminal. This decomposition can guide the parser toward better decisions in analyzing its input sentences. In order to incorporate the parent non-terminal into the Chomsky normal form, we use the following notation to state the probability of using rule i ! jk1: A½C; i; j; k ¼ pði ! sjkji-used; C ¼ ParentðiÞ; C-usedÞ
ð1Þ
Considering the conditional probability: pði ! jkji-used; C ¼ parentðiÞ; C-usedÞ ¼
pði ! jk; C ¼ ParentðiÞji-used; C-usedÞ pðC ¼ parentðiÞÞ
ð2Þ
and C = Parent(i), we can infer: 8U ðC ! Ui or
C ! iU Þ
ð3Þ
pðC ¼ ParentðiÞÞ ¼ P ð8U ; C ! UiÞ þ P ð8U ; C ! iU Þ P ðC ! iiÞ
ð4Þ
In the original IO algorithm (Lari and Young, 1990), matrix ‘‘a’’ is defined as follows: a½i; j; k ¼ pði ! jkji-usedÞ
ð5Þ
From (3) and (4), we can derive: X X 8C; 8i; pðC ¼ parentðiÞÞ ¼ a½c; u; i þ a½c; i; u a½c; i; i u
ð6Þ
u
By the assumption of having observation O, we obtain: pðC) OðsÞ . . . OðtÞjS) O; GÞ ¼ eðs; t; CÞ f ðs; t; CÞ=P where P =
ð7Þ
p(S)*O|G),
and e(s, t, C) and f(s, t, C) are the inner and outer probabilities, respectively. But: # t1 X X eðs; t; iÞ ¼ a½i; j; k e½s; r; j eðr þ 1; t; kÞ ð8Þ "
j;k
r¼s
Thus, if at the first step of derivation starting from non-terminal C, rule ‘‘C ! i U’’ is used, and by considering definitions of the inner probability (i.e., Eq. (8)) and matrix a (i.e., Eq. (5)), we can derive: pðC ! iU ) OðsÞ . . . OðtÞjS) O; GÞ ¼
t1 1X a½C; i; U eðs; r; iÞ eðr þ 1; t; U Þ f ðs; t; CÞ8i; U ; t > s P r¼s
ð9Þ
Supposing that in the next step of the derivation, rule ‘‘i ! jk’’ is used, the following equation holds: 1
For the sake of simplicity, we ignore the assumption of using Grammar G in all probabilities. Also, in the remainder of the paper, we use ‘‘i-used’’ to show that non-terminal i has been used in the derivation process.
648
H. Feili, G. Ghassem-Sani / Computer Speech and Language 20 (2006) 644–658
pðC ! iU ! ðjkÞU ) OðsÞ . . . OðtÞjS) O; GÞ ¼
t1 r1 X 1X a½C; i; U a½i; j; k eðs; v; jÞ eðv þ 1; r; kÞ eðr þ 1; t; U Þ f ðs; t; CÞ8i; j; U ; t > s P r¼s v¼s
ð10Þ
By naming the next left branching tree as TL, we can rewrite Eq. (10) as follows: C TL =
i j
U k
pðT L ; i-used; C-usedÞ ¼ Eq. (10)
ð11Þ
Therefore: P ðT L ji-used; C-usedÞ ¼ Eq. (10)=P ði-used; C-usedÞ
ð12Þ
By assuming the independence of using non-terminals i and C, we have: pði-used; C-usedÞ ¼ pði-usedÞ pðC-usedÞ
ð13Þ
And from (12) and (13), we get the following equation: pðT L ji-used; C-usedÞ ¼ Eq. (10)=ðpði-usedÞ pðC-usedÞÞ
ð14Þ
Eq. (15) can be derived in a way similar to that of (10): pðC ! Ui ! U ðjkÞ) OðsÞ . . . OðtÞjS) O; GÞ ¼
t1 t1 X 1X a½C; U ; i a½i; j; k eðr; v; jÞ eðv þ 1; t; kÞ eðs; r; U Þ f ðs; t; CÞ8i; j; U ; t > s P r¼s v¼rþ1
ð15Þ
And by defining TR as follows: C TR = U
i j
k
pðT R ji-used; C-usedÞ ¼ Eq. (15)=ðP ði-usedÞ P ðC-usedÞÞ
ð16Þ
Now, we can evaluate the main equation: pði ! jk; C ¼ ParentðiÞji-used; C-usedÞ ¼ ¼
X XU U
pði ! jk; ðC ! Ui
or
C ! iU Þji-used; C-usedÞ
pði ! jk; C ! Uiji-used; C-usedÞ
þ P ði ! jk; C ! iU ji-used; C-usedÞ pði ! jk; C ! iiji-used; C-usedÞ
ð17Þ
TL and TR are equal to: T L ¼ fC ! iU ; i ! jkg;
T R ¼ fC ! Ui; i ! jkg
ð18Þ
By replacing TL and TR in (17), one gets: pði ! jk; C ¼ ParentðiÞji-used; C-usedÞ ¼ Eq. (14) þ Eq. (16)-pðC ! ii ! ðjkÞi ! ðjkÞðjkÞji-used; C-usedÞ ð19Þ
H. Feili, G. Ghassem-Sani / Computer Speech and Language 20 (2006) 644–658
649
The last term of (19), can be computed by using (10), where U is initialized to i and later expanded by i ! jk. Thus: pðC ! ii ! ðjkÞi ! ðjkÞðjkÞ) OðsÞ . . . OðtÞjS) O; GÞ ¼
t1 Xr1 1X a½C; i; i.f ðs; t; CÞ a½i; j; k eðs; v; jÞ eðv þ 1; r; kÞ v¼s P r¼s Xt1 a½i; j; k eðr þ 1; w; jÞ eðw þ 1; t; kÞ8i; j; t > s w¼rþ1
ð20Þ
The third line of Eq. (20) is equal to e(s, r, j) and the fourth line is e(r + 1, t, i) (see (Lari and Young, 1990)). The probability of using any non-terminal i in a derivation, is computed as in (Lari and Young, 1990): pði-usedÞ ¼
T X T X 1 eðs; t; iÞ f ðs; t; iÞ P s¼1 t¼1
ð21Þ
Considering Eqs. (2), (5), (9), (20) and (21), we can compute the probability mentioned in (1), which is the main parameter of the history-based model: pði ! jkji-used; C ¼ ParentðiÞ; C-usedÞ ¼ ðEq. (14) þ Eq. (16) Eq. (20)Þ=Eq. (21)
ð22Þ
The last equation, denoted by A[C, i, j, k], is the main estimation formula used by HIO for supporting the history-based model. In a similar manner, the second form of Chomsky rules (i ! m), can be extended to support probability of using such rules, considering parent non-terminals as well. After evaluation of parameter matrices a and b in the traditional IO algorithm, we evaluate matrices A and B. The main iteration of inside-outside algorithm will be terminated when changes in the overall probability of observations is less than a pre-defined threshold. 4. Extended PCYK algorithm The HIO model discussed in the previous section considers parent non-terminals in the parsing. Therefore, the PCYK algorithm has been extended to focus on parent non-terminals during filling the table of CYK parsing algorithm. Probabilistic CYK algorithm itself, is an extended version of the traditional CYK algorithm (see (Grune and Jacobs, 1990; Kasami, 1965)), and has been tailored for calculating probabilities of different generated parses. Traditional CYK receives a CNF grammar as a modeling formalism and, by using a dynamic programming notion, generates all possible parse trees of any given sentence. Like any other dynamic programming algorithm, a table is used to store partial results of parsing. Similar to PCYK, in the extended PCYK, a 4-dimensions table W of size N * N * M * M is used, where N is the number of non-terminals, and M is the number of words in the input sentence. Any entry WijAC of this table is defined as: W ijAC ¼ P ðA) OðiÞ . . . OðjÞjC ¼ ParentðAÞÞ
ð23Þ
where O(i) refers to the ith word of the input sentence O. The entries of this matrix can be calculated in a recursive manner. If in the first step of deriving ‘‘A)*O(i). . . O(j)’’, rule A ! X Y is used, we can infer: W ijAC ¼ MAX k ðP ðA ! XY jC ¼ ParentðAÞÞ.W ikXA W kþ1jYA Þ
ð24Þ
The final state of this recursion is reached by using the second type of CNF rules (i.e., A ! m), by using the following formula: W iiAC ¼ P ðA ! OðiÞjC ¼ ParentðAÞÞ
ð25Þ
650
H. Feili, G. Ghassem-Sani / Computer Speech and Language 20 (2006) 644–658
5. Experimental results Two kinds of experiments are presented in this section. At first, the results of applying HIO to two different English data sets are demonstrated, and compared with related approaches. Then, the results of applying HIO to Persian, which is essentially very different from English are also discussed.
5.1. Experiments on English language HlO was tested on both ATIS (Hemphill et al., 1990) and Wall Street Journal (Schabes et al., 1993). In order to be able to compare othersÕ work with ours, we selected these data sets for testing the induced grammar. There are different metrics for evaluating the output grammars. The most popular of such metrics was introduced by Black et al. (1991). They proposed a measure called PARSEVAL to compare the quality of the output grammars. We use only POS tag sequences as the lexical information of the training and testing data sets. The following is an example of a training sentence: I/PRP need/VBP a/DT flight/NN to/TO Seattle/NNP leaving/VBG from/IN Baltimore/NNP making/VBG a/DT stop/NN in/IN Minneapolis/NNP where POS tag sequence used in the IO training process is: PRP VBP DT NN TO NNP VBG IN NNP VBG DT NN IN NNP This corresponds to the parsed sentence: ((S (NP-SBJ I/PRP) (VP need/VBP (NP (NP a/DT flight/NN) (PP-DIR to/TO (NP Seattle/NNP)) (VP leaving/VBG (PP-DIR from/IN (NP Baltimore/NNP))) (VP making/VBG (NP (NP a/DT stop/NN) (PP-LOC in/IN (NP Minneapolis/NNP )))))))) We executed two different experiments on English sentences. At first, like to other works, ATIS was divided into two distinct sets: the training set with approximately 90% of data and the test set (i.e., remaining 10%). Note that although our approach is unsupervised and does not need a bracketing data set, we need the tree style of syntactic information of the test data set for the evaluation purpose. The results were computed using tenfold cross validation method. In the second experiment, 7400 sentences shorter than 11 words of Wall Street Journal (WSJ-10) were selected. This data set was the main corpus used in training of HIO, and the induced grammar was tested on both ATIS and WSJ sentences, too. We added one dummy non-terminal, named NULL as the parent of the starting symbol used in the training process in our grammar model. For HIO, we used the same 48 POS labels as the grammar terminal symbols. As we mentioned before, HIO gets the induced grammar from IO as an initial grammar, and then tries to distribute the rule probabilities over the parents of left-hand non-terminals.
H. Feili, G. Ghassem-Sani / Computer Speech and Language 20 (2006) 644–658
651
Table 1 The main characteristics of ATIS corpus No. of Sent.
577
Max. Len. Min. Len. Ave. Len. No. of Words
35 2 8 4609
Table 2 The results of IO and HIO on ATIS data set UP
UR
F1
No. of non-terminals = 5 IO HIO
30.03 42.54(3.15)
25.27 38.95(5.8)
27.45 40.67(6.8)
No. of non-terminals = 15 IO HIO
42.19 46.85(3.2)
35.51 40.9(5.9)
38.56 43.67(3.9)
The initial grammar of IO is a full grammar, which contains all possible CNF rules, with random probabilities assigned to each rule. To alleviate the algorithm from any possible bias, the process was repeated 100 times with different initial probabilities. The output of all experiments was evaluated on a test corpus by using the extended PCYK parsing algorithm. This algorithm gets the extended model of PCFG and a test sentence, generates all possible parses of the input sentence in a dynamic programming manner, and selects the most probable parse. Then the parsed sentences were evaluated by comparison against the corresponding treebank of the same corpus. In the first experiment, we selected spoken-language transcription of the Texas Instruments subset of the Air Travel Information System (ATIS) corpus (Hemphill et al., 1990). This corpus, which has been automatically labeled, analyzed and manually checked, is included in Penn Treebank II (Marcus et al., 1993). There are two different pieces of labeling information in the Treebank: a part of speech tag and a syntactic labeling. We used 577 sentences of the corpus with 4609 words. The main characteristics of this corpus are summarized in Table 1. We ran the experiments with both 5 and 15 non-terminals2, and every experiment was also repeated 100 times. The mean and standard deviations on every experiment for both methods IO and HIO are shown in Table 2. The HIO was applied to ATIS, using the tenfold cross validation method. The results of IOÕs experiments have been collected from Clark (2001b). As it is shown, HIO significantly outperforms IO on the desired corpus. Moreover, the ratio of improvement of IO is very sensitive to number of non-terminals. In other words, the number of non-terminals used in the IO algorithm has a market impact on the accuracy of induced grammar; while HIOÕs accuracy improvement is much less dependent on this factor. This is due to the structure of the richer model used by HIO.3 In the next section, we also discuss the impact of this factor on a very divergent language (i.e., Persian). To show why HIO works better than IO, we selected an example to compare the output grammars. Fig. 1 shows the output of parsing sentence ‘‘What flights do you have from Milwaukee to Tampa’’ with the induced grammar of HIO Fig. 1(a) and IO Fig. 1(b), compared to gold treebank Fig. 1(c). Fig. 2, shows the parsing measures on these two parses (IO and HIO). As it is shown, HIO detects WH-NP (What flights) correctly, while IO does not. The leafs of the parse tree are POS tag sequences of the sentence. Because of using unsupervised induction methods, the nonterminal labels of the parse trees are not shown. In fact, at the end of IOÕs training phase with five 5 non-terminals, the POS tag sequence ‘‘WDT NNS’’ can only be reduced by the induced rule (NNT5 ! WDTNNS) which is linguistically plausible. However, the parse tree of Fig. 1(b) shows that this rule does not participate in the derivation because of its low probability (i.e., 2 3
By considering the NULL non-terminal, the numbers of used non-terminal in these experiments are 6 and 16, respectively. Note that HIO uses an extended PCFG model in which each rule is associated with the parent of its left-hand side.
652
H. Feili, G. Ghassem-Sani / Computer Speech and Language 20 (2006) 644–658
a
b WDT
WDT
NNS
VBP NNS
PRP
VBP PRP
NNS
VB
VB
NNS
IN
TO
IN
TO
NNS
NNS
c
SBARQ
WHNP
SQ
WH VBP WDT
NP-SBJ
VP
NNS PRP
VB
PP-DIR2
IN
NP
PP-DIR3
TO
NNS
NP
NNS
Fig. 1. Parse tree of sentence ‘‘What flights do you have from Milwaukee to Tampa’’ by the induced grammar of HIO (a) and IO (b). The gold treebank of the same sentence is shown in (c).
60
Percent
50
IO HIO
40 30 20 10 0 UP
UR
F1
Fig. 2. The PARSEVAL measures of sentence ‘‘What flights do you have from Milwaukee to Tampa’’ in IO and HIO.
0.0056). After parsing the whole ATIS corpus using the grammar induced by IO, this rule is used only in 15 occasions. By distributing this probability over the parent non-terminals in HIO, we induce the following rules in the output grammar (The non-terminal to the left of Ô:Õ is the parent of the rule in the derivation sequence, and the number in the parentheses is its corresponding probability). 1. NNT1: NNT5 ! WDT NNS (0.0005) 2. NNT2: NNT5 ! WDT NNS (0.0017) 3. NNT4: NNT5 ! WDT NNS (0.1789) Here, rule 3 has a high potential to be selected during parsing the test data set, due to its high probability. Having parsed the whole ATIS corpus with the extended induced grammar, the number of usages of the
H. Feili, G. Ghassem-Sani / Computer Speech and Language 20 (2006) 644–658
653
Table 3 The results of different approaches on ATIS data set Approach
UP
UR
F1
NCB
<2CB
CB
EMILE ABL CDC-40 CCM LEFT RIGHT IO HIO–WSJ HIO–ATIS
51.59(2.70) 43.64(0.02) 53.4 55.4 19.89 39.9 42.19 45.2(2.6) 46.85(3.2)
16.81 (0.69) 35.56(0.02) 34.6 47.6 16.74 46.4 35.51 41.6(1.5) 40.9(5.9)
25.35(1.00) 39.19(0.02) 42.0 51.2 18.18 42.9 38.56 43.32(1.9) 43.67(4.15)
47.36(4.77) 29.05(0.00) 45.3 – 7.16 24.89 22.90 18.21(5.2) 19.55(6.4)
93.41(2.2) 65.00(0.07) 80.5 – 33.45 62.61 60.47 58.27(8.6) 57.55(4.5)
0.84(0.11) 2.12(0.00) 1.46 1.45 4.70 2.18 2.48 2.37(0.28) 2.21(0.31)
linguistically plausible rule (NNT5 ! WDT NNS) has increased to 54 times (i.e., 49 occasions rule 3, and 5 occasions rule 2). To compare HIO with other unsupervised methods, we collected the results of testing several different approaches on ATIS data set in Table 3: EMILE (Adriaans and Haas, 1999), ABL (Van Zaanen, 2000), CDC with 40 iterations (Clark, 2001b), and CCM (Klein and Manning, 2005). LEFT and RIGHT are leftand right-branching baselines applied to ATIS. The results of LEFT and RIGHT baselines has been taken from Klein and Manning (2005). HIO was trained on two different data sets. It was first trained on WSJ-10, and then tested on ATIS data set. Then it was also trained on 90% of ATIS, and tested on the remaining 10%. The second experiment was evaluated by the so-called tenfold cross validation method. HIO–WSJ and HIO–ATIS, respectively, correspond to these two experiments. As it is shown, HIOÕs output is better than all of the mentioned works, except CCM, which seems to be the current state of the art in unsupervised grammar induction. Also, the results of HIO–ATIS is better than that of HIO–WSJ because the test and training data set has been selected from same corpus. We also tested HIO on WSJ in order to compare it with other published works, especially with CCM (Klein and Manning, 2002). The mean value of F1 score for HIO and other different methods on WSJ-10 are shown in Fig. 3.4 RANDOM selects a tree uniformly at random from the set of binary trees. DEP-PCFG is the result of duplicating the experiments of (Carroll and Charniak, 1992), using EM to train a dependency structured PCFG. SUP-PCFG is a supervised PCFG parser trained on a 90-10 split of WSJ-10 data set, using the treebank grammar with the Viterbi parse right-binarized. UBOUND is the upper bound of how well a binary system can cope with the treebank sentences that are generally flatter rather than being binary, limiting the maximum precision. As it is shown, although HIO outperforms the right-branching baseline, CCM still has the best rank among unsupervised methods. As we will explain in the next section, HIO works on free-word order languages (e.g., Persian), too. 5.2. Experiments on Persian language We have also tested HIO on Persian, which is linguistically very divergent from English. Persian, also known as Farsi, is the official language of Iran and Tajikistan, and one of the two main languages used in Afghanistan. This language has been influenced by other languages such as Arabic (in Iran) and Russian (in Tajikistan). Here, we use the Persian that is the official language of Iran. Arabic has heavily influenced Persian, but has not changed its structure. In other words, Persian has only borrowed a large number of words from Arabic. Therefore, in spite of this influence, it does not affect the syntactic and morphological structure of Persian (Feili and Ghassem-Sani, 2004; Bateni, 1995). Persian is a language with SOV form with a large potential to be free-word-order, especially in proposition adjunction and complements. For example, adverbs can be placed at the beginning, in the middle, or at the 4
The higher results on WSJ, compared with that of ATIS, are due to the tendency of ATIS to have shorter constituents, especially with one word. HIO and IO do not bracket single word, and that is why F1 measure (especially recall) decreases.
H. Feili, G. Ghassem-Sani / Computer Speech and Language 20 (2006) 644–658
82 60
87
71
63.2
48 30
FG U BO U N D
PC
SU P
C C
M
13
H IO
100 90 80 70 60 50 40 30 20 10 0
LE FT R AN D O D M EP PC FG R IG H T
F1
654
Fig. 3. F1 score for various models on WSJ-10.
Table 4 The main features about Peykareh corpus No. of Sent.
2200
Max. Len. Min. Len. Ave. Len. No. of POS tags No. of Words
10 2 7 18 15421
Table 5 Baselines for Persian Treebank Baselines
UP
UR
F1
Left branching Right branching Upper bound
25.07 17.63 94.35
16.45 11.57 100
19.87 13.97 97.09
end of sentences, and this does not often change the meaning. Written style of Persian is right to left with the Arabic alphabet.5 Vowels, generally known as short vowels (a, e, o), are usually not written. This causes some ambiguities in pronunciation of words in Persian (Khanlari, 1995). In order to test HIO on Persian, we manually produced a treebank adopting different notions of the Penn Treebank. We chose a data set called Peykareh as the initial corpus for our experiments (Bijankhan, 2005, 2003). This corpus has more than 7000 sentences collected from formal newsletters in Persian. The tag set that was used in the corpus is the same as one used by Amtrup et al. (2000) and Megerdoomian (2000); however some of the tags were merged in order to be used in the syntactical analysis6, remaining only 18 POS tags. As in previous experiments on English, POS tag sequences were used as the only source of lexical information. We selected 2200 sentences shorter than 11 words from Peykareh for both HIO and IO. Then by using some linguistic experts, mentioned sentences were syntactically analyzed, and their tree structures were manually produced. Table 4 shows the features of the mentioned corpus. We tested left- and right-branching notions, and the upper bound of binarized parsing performance on the mentioned treebank. Table 5 shows the main baselines of the Persian treebank. By comparing the results to that of English, one can observe that Persian, unlike English, is neither a left- nor a right-branching language (see Fig. 4). However, the high-achieving of upper bound F1 score implies that, in contrast with English, Persian is more likely to have a binary structure. 5
By Persian, we mean the language that used in Iran. In Tajikistan, Persian is written by English letters. The tag set presented by Megerdoomian (2000) and Bijankhan (2003, 2005) was used only for morphological analysis tasks; the set needed to be re-arranged and merged in a syntactical analysis view. 6
H. Feili, G. Ghassem-Sani / Computer Speech and Language 20 (2006) 644–658
100 80
655
97.09 87
Persian English
60 F1
42.9 40 20
19.8718.18
13.97
0 LEFT
RIGHT
UBOUND
Fig. 4. Baselines of Persian and English.
Table 6 The results of IO and HIO on Persion corpus with sentences shorter than 11 words UP
UR
F1
No. of non-terminals = 5 IO HIO
33.25(0.64) 41.26(1.16)
31.93(0.87) 38.59(2.3)
32.58(0.74) 39.88(1.54)
No. of non-terminals = 15 IO HIO
44.35(0.5) 52.5(1.29)
40.1(0.68) 50.28(1.7)
42.19(0.58) 51.37(1.5)
55 English F1 (%)
50
Persian
45 40 35 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 No. of non-terminals
Fig. 5. The effect of No. of non-terminals on F1 measure in English and Persian.
All other experiments of IO and HIO on the Persian Treebank were evaluated by using the tenfold cross validation method, and ran one hundred times in order to alleviate the possible unfavorable bias of the initial probabilities. In every run, we chose 10% of the corpus as test and the rest as training data. Table 6, shows the results of these experiments. As it is shown, in Persian too, HIO outperforms IO by more than 20% with respect to the F1 measure. Fig. 5, compares the effect of number of non-terminals on the quality of the induced grammar (again based on F1) in Persian with that of English. This comparison implies that the size of grammar (i.e., number of non-terminals) affects Persian more than English. In other words, a larger grammar is required to model Persian. That is mainly because Persian is a free-word-order language, and thus harder to model. Furthermore, HIO can achieve even better results on free-word order languages by using larger grammars. As it is shown in Fig. 5, the effect of increasing the number of non-terminals (from 5 to 20) on Persian is stronger than that of English. So far, distributional induction methods such as CDC or CCM7 have not yet considered the free word order languages.
7
Clark (2001b) refers to this challenge too. His distributional induction technique relies on the fact the similar words can occurs in similar contexts. This phenomenon is not a case in the languages such as Persian, which is, to a large extent, a free word order language.
656
H. Feili, G. Ghassem-Sani / Computer Speech and Language 20 (2006) 644–658
This kind of languages seem to be a challenge for such methods (Clark, 2001b)7. In other words, we think although HIOÕs performance is not as well as distributional methods in English, it may achieve better results on free-word-order languages. We used the main source code of IO algorithm, which has been published by Dr. Mark Johnson8 as the base for our method. The code has been written in ANSI C, compiled by g + + in Linux operating system, and ran on a 2.4 GHz PC with 256 MB of RAM. During running the system, any rule with probability less than 109 was omitted. The training phase of the traditional IO was ended when the reduction of negative log-probabilities of observations became less than 107. In experiments on ATIS, every run of the main IO needed on average about 208 iterations to converge; and it took about 42 h to reach to a stable state. On other hand, HIO needed on average more about 50 h to estimate its extended model. During running IO on ATIS, the number of grammar rules was decreased from 4095 to 178 rules, whereas the final grammar induced by HIO had 1040 rules. 6. Conclusion One of the weaknesses of a PCFG is its insensitivity to the non-local relationships among nodes. Where these relationships are crucial, a PCFG would be a poor modeling tool. Indeed, the sense in which the set of trees generated by a CFG is ‘‘context free’’ is precisely that the label on a node completely characterizes relationships between the sub-tree dominated by the node and the set of nodes that properly dominate this sub-tree (Johnson, 1998). Therefore, the results of any experiment with a PCFG would be less accurate in terms of labeled precision and recall metrics. Our new idea for relaxing the independence assumptions implicit in a PCFG model is to systematically encode more information in each node of the generated parses trees (i.e., enhancing the node information with the parent non-terminal label). This helps the parser to detect a more accurate label for each node. The JohnsonÕs experiments show similar results, where copying the label of the a parent node onto the labels of its children, dramatically improves the performance of a PCFG model in terms of labeled precision and recall (Johnson, 1998). As it was shown in this paper, adding the parent non-terminal label to the rule assumptions increases the quality of output grammar. In other words, by relaxing the flawed independence assumption, we can construct a richer model of unsupervised grammar induction method. It might be the case that augmenting the model with some other conditions further improves the accuracy of the results. However, adding extra conditions to the model may also cause an exponential increase in the size of the output grammar, which hampers the parsing process. We described HIO, a new approach based on the well-known IO algorithm for inferring stochastic contextfree grammars. We relaxed the independence assumptions often used with the PCFG rules in the parsing process by adding some extra information about the context to the derivation process. Here, the parent non-terminal, which dominates the PCFG rule, was chosen as the contextual information. Also, a novel unsupervised estimation model based on the IO algorithm was introduced and experimented on the English and Persian languages. The results showed a considerable improvement over almost all other unsupervised methods. For evaluating the output grammar, a modified version of probabilistic CYK parsing methods was used. Our method outperforms most current works except CCM presented by (Klein and Manning, 2005). CCM is a distributional method, which uses the idea that similar words occur in similar contexts. In other words, all contexts often obey some general patterns. This assumption does not hold in free-word order languages such as Persian. Therefore, CCM might not be applicable to free-word order languages, whereas our results on Persian show that HIO can get even more accurate results on such languages. Working on defining semi-supervised methods like the one described by Pereira and Schabes (1992) is one of our future works. Using bracketed data set for inducing extended PCFG, can improve the speed and accuracy of the algorithm.
8
Email:
[email protected].
H. Feili, G. Ghassem-Sani / Computer Speech and Language 20 (2006) 644–658
657
Choosing more information beside the parent non-terminal as the context information may also be useful for further enhancing the accuracy of parsing methods. However, estimating the required parameters of the extra information is the main obstacle here. References Adriaans, P., Haas, E., 1999. Grammar induction as substructural inductive logic programming. In: Cussens, J. (Ed.), Proceedings of the 1st Workshop on Learning Language in Logic. Bled, Slovenia, pp. 117–127. Allen, J., 1995. Natural Language Understanding. Benjamin/Cummings Pub. Amaya, F., Benedi, J.M., Sanchez, J.A., 1999. Learning of stochastic context-free grammars from bracketed corpora by means of reestimation algorithms. In: The VIII Symposium on Pattern Recognition and Image Analysis, vol. 1, pp. 19–126, Bilbao. Amtrup, J.W., Rad, H.R., Megerdoomian, K., Zajac, R., 2000. Persian-English Machine Translation: An Overview of the Shiraz Project, NMSU, CRL. Memoranda in Computer and Cognitive Science MCCS-00-319. Baker, J.K., 1979. Trainable grammars for speech recognition. Speech communication papers for the 97th Meeting of the Acoustical Society of America, pp. 547–550. Bateni, M., 1995. Tosif-e Sakhtari Zaban-e Farsi (Describing the Persian Structure). Amir-Kabir Press, Tehran, Iran (in Persian). Bijankhan, M., 2003. Emkansanji baraye Tarhe Modelsaziye Zabane Farsi (The feasibility study for Persian language modeling). The Journal of Literature 162–163 (50–51), 81–96 (in Persian). Bijankhan, M., 2005. Naghshe Peykarehaye Zabani dar Neveshtane Dasture Zaban: MoÕarrefiye yek Narmafzare RayaneÕi (The role of corpus in generating grammar: presenting a computational software and corpus). Iranian Linguistic Journal 2 (19), 48–67 (in Persian). Black, E., Abnery, S., Flickinger, D., et al., 1991. A procedure for quantitatively comparing the syntactic coverage of English grammars, In: DARPA Speech and Natural Language Workshop, pp. 306–311. Black, E., Lafferty, J., Roukos, S., 1992. Development and evaluation of a broad-coverage probabilistic grammar of English-language computer manuals. In: The Proceedings of the 30th Annual Meeting of the Association for computational Linguistics, pp. 185–192. Black, E., Jelinek, F., Lafferty, J., Magerman, D., Mercer, R., Roukos, S. 1992., Towards history-based grammars: using richer models for probabilistic parsing. In: The Proceedings of the 5th DARPA Speech and Natural Languages Workshop, Harriman, NY. Brill, E., 1993. A corpus-based approach to language learning. Ph.D. Thesis, Department of the computer and Information Science, University of Pennsylvania. Briscoe, T., Waegner, N., 1992. Robust stochastic parsing using the inside-outside algorithm. In: AAAI-92 Workshop on Statistically Based NLP Techniques. Carroll, G., Charniak, E., 1992. Two experiments on learning probabilistic dependency grammars from corpora, Technical Reports CS92-16, Department of Computer Science, Brown University, March . Casacuberta, F., 1996. Growth transformations for probabilistic functions of stochastic grammars. IJPRAI 10 (3), 183–201. Charniak, E., 1993. Statistical Language Learning. MIT Press, Cambridge, MA. Charniak, E., 1996. Statistical Language Learning. MIT Press, Cambridge, London, UK. Charniak, E., 1997a. Statistical parsing with a context-free grammar and word statistics. In: Proceedings of the 14th National Conference on Artificial Intelligence. AAAI Press/MIT Press, Menlo Park, pp. 598–603. Charniak, E., 1997b. Statistical techniques for natural language parsing. AI Magazine 18 (4), 33–44. Charniak, E., 2000. A maximum-entropy-inspired parser. In: NAACL 1, pp. 132–139. Chen, S.F., 1995. Bayesian grammar induction for language modeling. In: Proceedings of the Association for Computational Linguistics, pp. 228–235. Church, K., 1988. A stochastic parts program and noun phrase parser for unrestricted text. In: The Proceedings of the Second Conference on Applied Natural Language Processing, pp. 136–143. Clark, A., 2001., Unsupervised induction of stochastic context-free grammars using distributional clustering. In: The Fifth Conference on Natural Language Learning. Clark, A., 2001. Unsupervised language acquisition: theory and practice. Ph.D. Thesis, University of Sussex. Collins, M., 1996. A new statistical parser based on bigram lexical dependencies. In: The Proceedings of the 34th Annual Meeting of the ACL, Santa Cruz. Collins, M.J., 1997. Three generative, lexicalized models for statistical parsing. In: ACL 35/EACL 8, pp. 16–23. Feili, H., Ghassem-Sani, G., 2004. An application of lexicalized grammars in English-Persian translationProceedings of the 16th European Conference on Artificial Intelligence (ECAI 2004). Universidad Politecnica de Valencia, Spain, pp. 596–600. Grune, D., Jacobs, C.J.H., 1990. Parsing Techniques, a Practical Guide. Ellis Horwood, Chichester, England. Hemphill, C.T., Godfrey, J., Doddington, G., 1990. The ATIS spoken language systems pilot corpus. In: DARPA Speech and Natural language Workshop, Hidden Valey, Pennsylvania, June. Homes, J.N., 1988. Speech Synthesis and Recognition. Van Nostrand Reinhold. Jelinek, F., Laferty, J.D., Magerman, D., Mercer, R., Ratnaparakhi, A., Roukos, S., 1994. Decision-tree parsing using hidden derivation model. In: The Proceedings of the 1994 Human Language Technology Workshop, pp. 272–277. Johnson, M., 1998. The effect of alternative tree representations on tree bank grammars. In: Powers, D.M.W. (Ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural Language Learning, ACL, pp. 39–48. Jurafsky, D., Martin, J.H., 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice-Hall Inc., New Jersey.
658
H. Feili, G. Ghassem-Sani / Computer Speech and Language 20 (2006) 644–658
Kasami, T., 1965. An efficient recognition and syntax algorithm for context-free languages. Scientific Report AFCRL-65-758, Air Force Cambridge Research Laboratory, Bedford, MA. Kehler, A., Stolcke, A., 1999. Preface. In: Kehler, A. Stolcke, A. (Eds.), Unsupervised Learning in Natural Language Processing. Proceedings of the Workshop. Association for Computational Linguistics. Khanlari, P., 1995. Tarikh-e Zaban-e Farsi Tehran (History of Persian Language). Simorgh Press, Iran (in Persian). Klein, D., Manning, C.D., 2001. Distributional phrase structure induction. In: Proceedings of the Fifth Conference on Natural Language Learning (CoNLL 2001), pp. 113–120. Klein, D., Manning, C.D., 2001b. Natural language grammar induction using a constituent-context model. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (Eds.), Advances in Neural Information Processing Systems 14 (NIPS 2001), vol. 1. MIT Press, pp. 35–42. Klein, D., Manning, C.D., 2002. A generative constituent-context model for improved grammar induction. In: ACL 40, pp. 128–135. Klein, D., Manning, C.D., 2004. Corpus-based induction of syntactic structure: models of dependency and constituency. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 04). Klein, D., Manning, C.D. 2005. The Unsupervised Learning of Natural Language Structure. Ph.D. Thesis, Department of Computer Science, Stanford University. Lari, K., Young, S.J., 1990. The estimation of stochastic context-free grammar using the inside-outside algorithm. Computer Speech and Language 4, 35–56. Magerman, D.M., Marcus, M.P., 1990.Parsing a natural language using mutual information statistics. In Proceedings of the Eighth National Conference on Artificial Intelligence, August. Magerman, D., Marcus, M., 1991. Pearl: a probabilistic chart parser. In: The Proceedings of the 1991 European ACL conference, Berlin, Germany. Magerman, D., Weir, D., 1992. Efficiency, robustness and accuracy in picky chart parsing. In: The Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, pp. 40–47. Magerman, D.M., 1995. Statistical decision-tree models for parsing. In: The Proceedings of ACL Conference, June 1995. Manning, C., Schutze, H., 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Marcken, C., 1995. On the unsupervised induction of phrase-structure grammars. In: The Proceedings of the 3rd Workshop on Very Large Corpora. Marcken, C., 1996. Unsupervised language acquisition. Ph.D. Thesis, Department of Electrical Engineering and Computer Science, MIT. Marcus, M.P., Santorini, B., Marcinkiewicz, M.A., 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19, 313–330. Megerdoomian, K., 2000. Persian Computational Morphology: A Unification-Based Approach, NMSU, CRL. Memoranda in Computer and Cognitive Science (MCCS-00-320). Paskin, M.A., 2002. Grammatical bigrams. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (Eds.), Advances in Neural Information Processing Systems, 14. MIT Press, Cambridge, MA. Pereira, F., Schabes, Y., 1992. Inside-outside re-estimation from partially bracketed corpora. In: The Proceeding of 30th Annual Meeting of the ACL, pp. 128–135. Sanchez, J.A., Benedi, J.M., Casacuberta, F., 1996. Comparison between the inside-outside algorithm and the Viterbi algorithm for stochastic context-free grammars. In: The Proceedings of the 6th International Workshop on Advances in Structural and Syntactical Pattern Recognition, pp. 50–59. Sanchez, J.A., Benedi, J.M., 1998. Estimation of the probability distributions of stochastic context-free grammars from the K-best derivations. In: The 5th International Conference on Spoken Language Proceeding. Sanchez, J.A., Benedi, J.M., 1999. Probabilistic estimation of stochastic context-free grammars from the k-best derivations. In: The VIII Symposium on Pattern Recognition and Image Analysis, vol. 2, pp. 7–8, Bilbao. Schabes, Y., Roth, M., Obsorne, R., 1993. Parsing the Wall Street Journal with the inside-outside algorithm. In: The Proceedings of the 6th Conference of the European Chapter of the ACL, pp. 341–347. Stolcke, A., Omohundro, S.M., 1994. Inducing probabilistic grammars by Bayesian model merging. In: Grammatical Inference and ApplicationsProceedings of the Second International Colloquium on Grammatical Inference. Springer Verlag. Thanaruk, T., Okumaru, M., 1995. Grammar acquisition and statistical parsing. Journal of Natural language Processing 2 (3). Van Zaanen, M., 2000. ABL: alignment-based learning. In: COLING 2000, pp. 961–967. Van Zaanen, M., 2002. Bootstrapping structure into language: alignment-based learning. Ph.D. Thesis, School of Computing, University of Leeds. Van Zaanen, M., Adriaans, P.W., 2001. Comparing two unsupervised grammar induction systems: alignment-based learning vs. EMILE. Technical Report: TR2001.05, School of Computing, University of Leeds. Younger, D., 1967. Recognition and parsing of context-free languages in time O(n3). Information and Control 10, 189–208. Yuret, D., 1998. Discovery of linguistic relations using lexical attraction. Ph.D. Thesis, MIT.