Machine Translation Using Constraint-Based Synchronous Grammar

Machine Translation Using Constraint-Based Synchronous Grammar

TSINGHUA SCIENCE AND TECHNOLOGY I S S N 1 0 0 7 - 0 2 1 4 0 6 / 1 6 p p 2 9 5 -306 Volume 11, Number 3, June 2006 Machine Translation Using Constrain...

364KB Sizes 2 Downloads 82 Views

TSINGHUA SCIENCE AND TECHNOLOGY I S S N 1 0 0 7 - 0 2 1 4 0 6 / 1 6 p p 2 9 5 -306 Volume 11, Number 3, June 2006

Machine Translation Using Constraint-Based Synchronous Grammar* WONG Fai (黄 辉), DONG Mingchui (董名垂)**, HU Dongcheng (胡东成) Department of Automation, Tsinghua University, Beijing 100084, China Abstract: A synchronous grammar based on the formalism of context-free grammar was developed by generalizing the first component of production that models the source text. Unlike other synchronous grammars, the grammar allows multiple target productions to be associated to a single production rule which can be used to guide a parser to infer different possible translational equivalences for a recognized input string according to the feature constraints of symbols in the pattern. An extended generalized LR algorithm was adapted to the parsing of the proposed formalism to analyze the syntactic structure of a language. The grammar was used as the basis for building a machine translation system for Portuguese to Chinese translation. The empirical results show that the grammar is more expressive when modeling the translational equivalences of parallel texts for machine translation and grammar rewriting applications. Key words: constraint-based synchronous grammar; machine translation; Portuguese to Chinese translation

Introduction In machine translation, analysis of the structural deviations of the languages pairs is key to transforming one language into another. This analysis requires a large number of structural transformations, both grammatically and conceptually. The problems of syntactic complexity and word sense ambiguity have been the major obstacles to high-quality translations. Several approaches have been proposed to overcome the obstacles and, hence, improve the quality of translation systems. Ranbow and Satta[1] stated that much of theoretical linguistics can be formulated in a very natural manner as correspondences between layers of representations. Received: 2004-11-10

﹡ Supported by the Center of Scientific and Technological Research of University of Macao (CATIVO: 3678)

﹡﹡ To whom correspondence should be addressed. E-mail: [email protected]; Tel: 86-10-82902115

Similarly, many problems in natural language processing, in particular language translation and grammar rewriting systems, can be expressed as transduction through the use of synchronous formalisms[2-6]. Synchronous grammars are becoming more and more popular for the formal description of parallel texts representing translations. The underlying idea of such formalisms is to combine two generative devices through pairing their productions in such a way that links the right-hand side non-terminal symbols in the paired productions. However, such formalisms are less expressive and unable to express mutual translations that have different lengths and crossing dependencies. Moreover, synchronous formalisms do not deal with unification and feature structures, as in unificationbased formalisms, that give patterns additional power for describing constraints on features. For example, multiple context-free grammar[4] uses functions for the non-terminal symbols in the productions to further interpret the symbols in the target generation. Inversion transduction grammar (ITG)[7,8] has been proposed for

296

simultaneously bracketing parallel corpora as a variant of syntax directed translation schema[9] . But these formalisms lack the expressive ability to describe discontinuous constituents in linguistic expressions. The generalized multitext grammar (GMTG)[5,10] maintains two sets of productions as components, one for each language, for modeling parallel texts. Although GMTG is more expressive and can be used to express independent rewriting, the lack of flexibility in the description constraints on the features associated with a nonterminal makes it difficult to develop a practical machine translation system. Lexical-semantic formalisms and unification-based grammar formalisms such as lexical function grammar (LFG)[11-13] use constituent structures (c-structure) and functional structures (f-structure) for semantic interpretation of words. Head-driven phrase structural grammar (HPSG)[14] expresses the units of linguistic information such as syntactic, semantic, as well as pragmatic information of utterances in typed feature structures[15] to facilitate computationally precise descriptions of natural language syntax and semantics. The descriptive power of these grammars may be defined specifically enough to give correct translations of individual usages of words and phrases. However, translation systems based on these formalisms are not possible without much more efficient parsing and disambiguation algorithms for these formalisms. Moreover, the construction of these formalisms needs deeper grammatical knowledge to develop grammatical rules. The integration of syntactical and semantic information requires the use of many features making the task of writing specific constraint grammars too difficult for a non-specialist. Another limitation of translation applications is that these formalisms are not ready to describe the relations between two-language pairs that researchers need for translation systems. This paper describes a variation of synchronous grammar, the constraint-based synchronous grammar (CSG), based on the formalism of context-free grammar. The CSG uses feature structures such as in unification-based grammar to generalize the sentential patterns in the source text while grouping the corresponding target rewriting rules for each production in a vector representing the possible translation patterns for the source production. The rule choice for target

Tsinghua Science and Technology, June 2006, 11(3): 295-306

generation is based on the feature’s constraints of the non-terminal symbols in the pattern. The motivation is three-fold. First, synchronous formalisms have been proposed to model parallel texts, since such algorithms can infer the synchronous structures of texts for two different languages through the grammatical representation of their syntactic deviations. These synchronous structures are quite useful for the analysis of language pairs in machine translation systems. Secondly, augmentation of the synchronous models with feature structures enhances the pattern recognition with additional power to describe gender, number, agreement, etc., since the descriptive power of unification-based grammars is considerably greater than that of the classical CFG[11] . Finally, the combining of the notational and intuitive simplicity of the CFG with the power of the feature constraints provides grammar formalism with a better descriptive power than the CFG to give more efficient parsing and generation controlled by the feature constraints of symbols to provide better word sense and syntax disambiguation. The paper describes the formalism of the CSG with a parsing algorithm for the formalism. The CSG is then used to construct a machine translation system for Portuguese to Chinese translation.

1

Constraint-Based Synchronous Grammars

The CSGs are defined using the syntax of context-free grammar for synchronous language pairs. The formalism consists of a set of generative productions with each production constructed by a pair of context free grammar with/without syntactic head and link constraints for the non-terminal symbols in the patterns. The first component (on the right hand side of the productions) represents the sentential patterns of the source language, the source component, while the second component represents the translation patterns in the target language, the target component. Unlike other synchronous formalisms, the production target component consists of one or more generative rules associated with zero or one or more controlled conditions based on the features of the non-terminal symbols of the source rule for describing the possible generation correspondences in the target translation. The source

WONG Fai (黄 辉) et al:Machine Translation Using Constraint-Based …

components in the CSG are then generalized by handling the feature’s constraints in the target component, which also helps reduce the grammar size. For example, the following is one of the productions used in the Portuguese to Chinese translation system: S→NP1 VP* NP2 PP NP3{[NP1 VP1 NP3 VP2 NP2 ; VPcate=vb1,VPs:sem =NP1sem, VPio:sem=NP2sem,VPo:sem= NP3sem], [NP1 VP NP3 NP2 ; VPcate= vb0,VPs:sem =NP1sem,

(1) The production has two components besides the reduced syntactic symbol on the left hand side, the first modeling Portuguese and the second Chinese. The target component in this production consists of two generative rules maintained in a vector, each of which involves control conditions based on the symbol features from the source component, which are used as the selectional preferences during parsing. These constraints in the parsing/generation algorithm are used for inferring not only the input structure to decide which structures are possible or probable, but also the output text structure for the target translation. For example, the condition expression: VPcate=vb1, VPs:sem=NP1sem, VPio:sem=NP2sem, VPo:sem=NP3sem, specifies that the senses of the first, second, and third nouns (NPs) in the input strings match the sense of the subject and the direct and indirect objects governed by the verb, VP, with category type vb1. When the condition is satisfied, the source structure is successfully recognized and the corresponding structure of the target language, NP1 VP1 NP3 VP2 NP2, is also determined. Non-terminal symbols in the source and target rules are linked if they are given the same index “subscripts” for multiple occurrences, such as NP in the production: S→NP1 VP NP2 PP NP3 [NP1 VP* NP3 NP2]. Symbols that appear only once in both the source and target rules, such as VP, are implicitly linked to give the synchronous rewriting. Linked non-terminal symbols must be derived from a sequence of synchronized pairs. Consider the production: S→NP1 VP NP2 PP NP3 [NP1 VP* NP3 NP2], where the second NP (NP2) in the source rule corresponds to the third NP (NP2) in the target rule and the third NP (NP3) in the source rule corresponds to the second NP (NP3) in the target pattern, while the first NP (NP1) and VP correspond to their counterparts in the source and target rules. The symbol marked by an “*” is designated as the head VPio:sem=NP2sem]}

297

element in the pattern, with the features of the designated head symbol propagated to the reduced nonterminal symbol on the left side of the production rule to achieve the feature inheritance in the CSG formalism. The use of feature structures associated with nonterminal symbols will be explained further in Section 2.2. When modeling natural languages, especially when processing language pairs, non-standard linguistic phenomena such as crossing dependencies, and discontinuous constituents must be carefully analyzed due to the structural differences between the two languages, in particular for languages from different families such as Portuguese and Chinese[16,17]. Linguistic expressions can vanish or appear in translation. For example, the preposition (PP) in the source rule does not show up in any of the target rules in Production (1). In contrast, Production (2) allows the Chinese characters “本” and “辆” to appear in the target rules as modifiers of the noun (NP) together with the quantifier (num) as the proper translation for the source text. This explicitly relaxes the synchronization constraint, so that the two components can be rewritten independently. NP→num NP* {[num 本NP; NPsem=SEM_book], [num 辆NP; NPsem=SEM_automobile]}

(2)

An important strength of the CSG is its expressive power to describe discontinuous constituents. Chinese frequently uses discontinuously distributed combination words. For example, take the sentence pair [“Ele vendeu-me todas as uvas. (He sell me all the grapes.)”, “他把所有的葡萄卖给了我”]. The Chinese preposition “把” and the verb “卖给了” should be paired with the Portuguese verb “vendeu” which causes a fan-out with the discontinuous constituent in the Chinese component. (The term fan-out describes a word whose translation is paired with discontinuous words in the target language, e.g., “vendeu[-pro] [NP]” in Portuguese gives the similar English translation “sell [NP] to [pro]”, so “vendeu”, in this case, corresponds to both “sell” and “to”.) The following fragment of CSG productions represents such relationships. S→NP1 VP* NP2 NP3{[NP1 VP1 NP3 VP2 NP2 ; VPcate= vb0,…],...} VP→vendeu*{[把, 卖给了;∅]}

(3)

In Production (3), the discontinuous constituents of

298

VP from the source rule are represented by VP1 and VP2 in the target rule, where the “superscripts” are added to indicate the pairing to VP in the target component. The corresponding translation constituents in the lexicalized production are separated by commas representing the discontinuity between constituents “把” and “卖给了” in the target translation. During the rewriting phase, the corresponding constituents will be used to replace the syntactic symbols in the pattern rule.

2

Definitions and Features Representation

2.1

Definitions

Let L be a context-free language defined over the terminal symbol VT which is generated by the contextfree grammar G using the non-terminal symbol VN disjointed with VT, the starting symbol S, and the productions of the form A → w where A is in VN and w in (VN ∪VT)*. Let Z be a set of integers, with each nonterminal symbol in VN assigned to an integer, Γ(VN) = {Wω |W∈VN, ω∈Z}. The elements of Γ(VN) are indexed non-terminal symbols. Then extend the rule to include the set of terminal symbols VT’ as the translation in the target language, disjoint from VT, (VT∩VT’=∅). Let R={r1, …, rn | ri∈ (Γ(VN)∪VT’), 1≤i ≤n} be a finite set of rules and C={c1, …, cm} be a finite set of constraints over the associated features of (Γ(VN)∪VT), where the features of non-terminal Γ(VN), the syntactic symbols, are inherited from the designated head element during rule reduction. A target rule is defined as a pair [r∈R*, c∈C*] in γ, where γ = R*×C* has the form [r, c]. Then, define ψ(γi) to denote the number of conjunct features being considered in the associated constraint to determine the degree of generalization for a constraint. So, the rules γi and γj γj, if are orderable, γi ≺ γj, if ψ(γi)≥ ψ(γj) (or γi ψ(γi)<ψ(γj)). For γi ≺ γj (ψ(γi)≥ψ(γj)), the constraint of the rule, γi, is said to be more specific, while the constraint of γj is more general. In the following, consider a set of related target rules working over the symbols, w’, on the right hand side (RHS) of production A → w’, the source rule, where w’∈Γ(VN)∪VT. All of these non-terminals are co-indexed as links. Definition 1 A target component is defined as an ordered vector of target rules in γ having the form σ=

Tsinghua Science and Technology, June 2006, 11(3): 295-306

{γ1, …, γq}, where 1≤i≤q denotes the i-th tuple of σ. The rules are arranged in the order of γ1 ≺ γ2 ≺ … ≺ γq. In the rule reduction, the association conditions of the target rules are used to investigate the features of the corresponding symbols in the source rules, similar to that of feature unification, to determine if the active reduction is successful or not. This also helps determine the proper structure for the target correspondence. Definition 2 A CSG is defined to be a 5-tuple G= (VN, VT, P, CT, S) which satisfies the following conditions: •  VN is a finite set of non-terminal symbols. •  VT is a finite set of terminal symbols which is disjoint with VN. •  CT is a finite set of target components. •  P is a finite set of productions of the form A→αβ, where α∈(Γ(VN) ∪ VT)*, β∈CT, and the nonterminal symbols that occur in both the source and target rules are linked using the index given by Γ(VN). (Link constraints determined by the symbol indices represent a trivial connection between corresponding symbols in the source and target rules. Hence, we assume without loss of generality, that the index is only given to the nonterminal symbols that have multiple occurrences in the production rules. The analysis assumes that “S→NP1 VP2 PP3 NP4 {NP1 VP21 NP4 VP22}” implies “S→NP1 VP PP NP2 {NP1 VP1 NP2 VP2}”). •  S∈VN is the initial symbol. For example, the following CSG productions (Table 1) can generate both of the parallel texts [“Ele deu um livro ao José. (He gave a book to José)”, “他给了若泽 一本书”] and [“Ele comprou um livro ao José. (He bought a book from José)”, “他向若泽买了一本书”]. A set of productions P is said to accept an input string s iff there is a derivation sequence Q for s using the source rules of P and any of the constraints associated with every target component in Q are satisfied. If there is no constraint associated to a target rule, during the parsing phase, the reduction of the source rule is assumed to be valid at all times. Similarly, P is said to translate s iff there is a synchronized derivation sequence Q for s such that P accepts s, and the link constraints of the associated target rules in Q are satisfied. The derivation Q then produces a translation t as the resulting sequence of terminal symbols included in the

WONG Fai (黄 辉) et al:Machine Translation Using Constraint-Based …

299

Table 1 CSG productions S



NP1 VP* NP2 PP

{[NP1 VP1 NP3 VP2 NP2 ; VPcate=vb1, VPs:sem =NP1sem,

NP3

VPio:sem=NP2sem, VPo:sem= NP3sem], [NP1 VP NP3 NP2 ;

(4)

VPcate= vb0, VPs:sem =NP1sem, VPio:sem=NP2sem, VPo:sem= NP3sem]} VP



v

{[v ; ∅]}

(5)

NP



det NP*

{[NP ; ∅]}

(6)

NP



num NP*

{[num 本NP; NPsem=SEM_book]}

(7)

NP



n

{[n; ∅]}

(8)

NP



pro

{[pro; ∅]}

(9)

PP



p

{[p; ∅]}

(10)

n



José

{[若泽; ∅]}|

(11)

livro

{[书; ∅]}

pro



ele

{[他; ∅]}

(12)

v



deu

{[给了; ∅]} |

(13)

comprou

{[向, 买了; ∅]}

num



um

{[一; ∅]}

(14)

p



a

{∅}

(15)

det



o

{∅}

(16)

Note: As with the head element designation in productions, the only symbol from the RHS of the production in the left hand side (LHS) will be the head element. Thus, no head mark “*” is given for such rules and we assume that “VP → v*” implies “VP → v”.

determined target rules in Q. The translation of an input string s essentially consists of three steps. First, the input string is parsed using the source rules from the productions. Secondly, the link constraints are propagated from the source rule to the target component to determine and build a target derivation sequence. Finally, the translation of the input string is generated from the target derivation sequence. 2.2

Features representation

In the CSG, linguistic entities are modeled as feature structures which give patterns additional power for describing gender, number, semantics, attributes, and number of arguments required by a verb, and so on. This information is encoded in the commonly used attribute value matrices (AVMs) attached to each of the lexical and syntactic symbols in the CSG. The attribute or feature names are typically written in upper case in the AVMs with the values written to the right of the feature name in lower case:

⎡ CAT ⎢ ⎢ SUBJ ⎢ ⎢ DOBJ ⎢ ⎢ ⎢ IOBJ ⎢ ⎢ TEN ⎢ ⎢ PER ⎢ ⎢⎣ NUM

vb0 ⎤ ⎥

hum ⎥⎥ obj ⎥⎥

hum ⎥⎥ pp 3rd sg

(17)

⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

This structure allows specification of syntactic dependencies such as agreement and subcategorization in the patterns. Unlike other unification-based grammars[12,15], the unification is not completed but only the conditions that are explicitly expressed in the rule constraints are tested and unified. The unification process can be performed in a constant time. The use of feature constraints has to be restricted to maintain the parsing and generation algorithm efficiency, and especially to prevent generation of a large number of ambiguous structure candidates. The word selection in the target language can also be achieved by checking features. The parsing and generation algorithm propagates the features information to the reduced symbol from the

Tsinghua Science and Technology, June 2006, 11(3): 295-306

300

designated head element in the pattern to realize the features inheritance. Features can either be put in the lexical dictionary isolated from the formalism to simplify construction of the analytical grammar or can be explicitly encoded in the preterminal rules as: pro→José:[CAT:pro,NUM:sg,GEN:masc,SEM:hum] {[若泽; ∅]}; n→livro:[CAT:n,NUM:sg,GEN:masc,SEM:artifact+book] {[书; ∅]},

where the features set is bracketed and separated by a semi-colon, and the name and the value of a feature are delimited by a colon to represent the feature pair. Another way to enhance the CSG formalism is to apply soft preferences instead of hard constraints in the features unification process. Two problems can arise. First, a single lexical item often has more than one combination of feature values in natural languages, i.e., one word may have several translations according to the different senses and the pragmatic uses of the word, which is known as word senses disambiguation[18] . Secondly, the conventional feature unification method can only tell if the process is successful. For example, if relatively minor conditions fail during the unification, all the related candidates are rejected without any flexibility in choosing the next preferable or most probable candidate. The importance of conditions is associated by a weight with each feature structure. The matching features are then ranked according to their weights rather than the order of the lexical items expressed in grammars or dictionaries. In the current system, each symbol has an original weight, with the preference measurements when checking the feature constraints used to create a penalty to reduce the weight to the effective weight of those associated by features in a particular context. Features with the largest weight are chosen as the most preferable content.

3

Parsing Algorithm

The CSG formalism can be parsed by any CFG parsing algorithm including the Earley[19,20] and generalized LR algorithms[21,22] augmented to take into account the features’ constraints and the inference of the target structure. The parsing algorithm used for the formalism described here is based on the generalized LR algorithm developed for translation systems. Since the method uses a parsing table, it is considerably more efficient than Earley’s non-complied method which has

to compute a set of LR items at each parsing stage[23]. The generalized LR algorithm was first introduced by Tomita for parsing the augmented context-free grammar that can ingeniously handle non-determinism and ambiguity through the use of a graph-structured stack while retaining much of the advantages of the standard LR parsing, especially when the grammar is close to the LR grammars. The method is quite suitable for parsing natural languages with grammatical features such as prepositional phrase attachments which are inherently ambiguous. The generalized LR algorithm actually is a variation of standard LR parsing which uses a shift-reduce approach by using an extended LR parsing table to guide its actions by allowing multiple action entries such as shift/reduce and reduce/reduce to handle nondeterministic parsing with pseudo-parallelism. For this formalism, the parsing table was extended by including the features’ constraints and target rules into the actions table. The strategy is thus to parse the source rules of the CSG productions through the normal shift actions proposed by the parsing table, while checking the associated conditions to determine if the active reduction is valid depending on weather the working symbols in the patterns fulfill the features’ constraints. 3.1

CSG parsing table

Figure 1 shows an extended LR(1) parsing table for Productions (4)-(16) as constructed by using the LR table construction method[24] extended to consider the rule components of productions by associating the corresponding target rules and constraints. The parsing table includes a compact ACTION-GOTO table and a CONSTRAINT-RULE table. The ACTION-GOTO table is indexed by a state symbol s (row) and a symbol x ∈VN∪VT, including the end marker “⊥”. The entry ACTION[s, x] can be one of sn, rm, acc, or blank. sn denotes a shift action representing GOTO[s, x]=n which defines the next state for the parser; rm denotes a reduction using the m-th production located in the entry of CONSTRAINT-RULE in state s; acc denotes the accept action and blank indicates a parsing error. The CONSTRAINT-RULE table is indexed by the state symbol s (row) and the number of productions m that may be applied for reduction in state s. The entry CONSTRAINT-RULE[s, m] consists of a set of productions together with the target rules and features

WONG Fai (黄 辉) et al:Machine Translation Using Constraint-Based … ACTIONs/GOTOs

Reduced rules



constraints/target rules

State pro 0

s8

num n s9

v det p

NP

s11

s7

s1

VP PP

S s6

o

a

s5

um ele José livro deu s2

1

s1

s4

(1) pro → ele r1

4

r1

5

(1) n → livro

{[书; ∅]}

(1) n → José

{[若泽; ∅]}|

(1) det → o

r1

6

{[他; ∅]}

(1) num → um

r1

3

acc

7

s14

8

r1

9

s8

s15

s12

s13 (1) NP → pro

s9

10

s1

s11

s16

s5

s2

s1

s4

s3

s11

s17

s5

s2

s1

s4

s3

(1) NP → n

r1 s8

s9

s1

12

(1) v → deu

r1

13

r1

14 15

comprou

s3

r1

2

11

301

s9

s1

(1) v → comprou {[向, 买了; ∅]} (1) VP → v

r1 s8

{[给了; ∅]}

s11

s18

s5

s2

s1

s4

s3

16

r1

(1) NP → num NP*

17

r1

(1) NP → det NP*

18

s21

s20

s19

19 20

(1) p → a

r1 s8

s9

s1

21

s11

s22

s5

s2

s1

s4

s3 (1) PP → p

r1

22

{[NP; ∅]}

(1) S → NP1 VP* NP2 PP NP3

r1

{[...]}

Fig. 1 Extended LR(1) parsing table

constraints that are used for validating if the active parsing node can be reduced and to identify the corresponding target generative rule for the reduced production. For simplicity, the productions used for building the parsing table are deterministic so no conflicting actions such as shift/reduce and reduce/reduce, appear in the constructed parsing table. 3.2

CSG parser

In the parsing process, the algorithm maintains a number of parsing processes in parallel, each of which represents an individual parsed result to handle nondeterministic cases. Each process has a stack and scans the input string from left-to-right. Actually, each process does not maintain its own separate stack, but the multiple stacks are represented by using a single directed acyclic graph, a graph-structured stack, for parsing multiple possible structures. The process has two

major components, shift(i) and reduce(i), which are called at each position i=0, 1, …, n in an input string I = x1x2…xn. The shift(i) process shifts the top stack vertex v on xi from its current state s to some successor state s’ by creating a new leaf v’; establishes an edge from v’ to the top stack v; and makes v’ the new top stack vertex. The reduce(i) process executes a reduction action on production p by following the chain of parent links down from the top stack vertex v to the ancestor vertex from which the process began scanning for p, while removing intervening vertices off the stack. Every reduction action in reduce(i) has a set C of ordered constraints, c1 ≺ … ≺ cm, in the production, each of which is associated with a target rule that may be the probable corresponding target structure for the production, depending on whether the paired constraint is satisfied according to the features of the parsed string p. Before reduction, the constraints cj (1≤j≤m) are tested in order starting from the most specific one with the evaluation process stopping once a positive result

Tsinghua Science and Technology, June 2006, 11(3): 295-306

302

is obtained from the evaluation. The corresponding target rule for the parsed string is determined and attached to the reduced syntactic symbol which will be used to rewrite the target translation in the generation PARSE (grammar, x1 … xn) xn+1 ⇐ ⊥ Ui ⇐ ∅

phase. At the same time, the features information is inherited from the designated head production element. The parsing algorithm for the CSG formalism is given in Fig. 2. REDUCE(v, p) for each possible reduced parent v1’ ∈ ANCESTOR(v,

(0≤i≤n)

RHS(p))

U0 ⇐ v0

if UNIFY(v, p)= “success”

for each terminal symbol xi

(1≤i≤n)

s” ⇐ GOTO(v1’, LHS(p))

P⇐∅

if node v” ∈ Ui−1 s.t. STATE(v”)=s”

for each node v ∈ Ui−1 P ⇐ P∪v

if δ(v1’, LHS(p))=v” do nothing

if ACTION[STATE(v), xi] = “shift s’”, SHIFT(v, s’)

else

for each “reduce p” ∈ ACTION[STATE(v), xi],

if node v2’ ∈ ANCESTOR(v”, 1)

REDUCE(v, p)

let vc” s.t. ANCESTOR(vc”, 1)=v1’ and

if “acc” ∈ ACTION[STATE(v), xi], accept

STATE(vc”)=s”

if Ui = ∅, reject

for each “reduce p” ∈ ACTION[STATE(vc”), xi], REDUCE(vc”, p)

SHIFT(v, s) if v’ ∈ Ui s.t. STATE(v’)=s and ANCESTOR(v’, 1)=v and

else if v” ∈ P

state transition δ(v, x)=v’

let vc” s.t. ANCESTOR(vc”, 1)=v1’ and

do nothing else

STATE(vc”)=s” for each “reduce p” ∈ ACTION[STATE(vc”), xi],

create a new node v’

REDUCE(vc”, p)

s.t. STATE(v’)=s and ANCESTOR(v’, 1)=v and state transition δ(v, x)=v’ Ui ⇐ Ui∪v’

else create a new node vn

UNIFY(v, p) for “constraint cj” ∈ CONSTRAINT(STATE(v))

s.t. STATE(vn)=s” and ANCESTOR(vn, 1)=v1’ and state transition δ (vn, x)=v1’ Ui−1 ⇐ Ui−1∪vn

(1≤j≤

m, c1 ≺ … ≺ cm) if ξ(cj, p)= “true”

else

(ξ(∅, p)= “true”)

current reduction failed

TARGET(v) ⇐ j return “success”

Fig. 2 Modified generalized LR parsing algorithm

The parser is a function of two arguments PARSE (grammar, x1 … xn), where the grammar is provided as a parsing table. The parser calls the functions SHIFT(v, s) and REDUCE(v, p) to process the shifting and rule reductions as described. The UNIFY(v, p) function is called for every possible reduction in REDUCE(v, p) to verify the legal reduction and select the target rule for the source structure for synchronization. The function TARGET(v) is called after unification passed to determine the j-th target rule for correspondence.

4

Application to Portuguese to Chinese Machine Translation

4.1

Portuguese to Chinese translation system

The formalism described in Section 3 was used to develop a Portuguese to Chinese translation system. The system is a transfer-based translation system using the formalism of the CSG as its analytical grammar. Unlike other transfer-based translation systems

WONG Fai (黄 辉) et al:Machine Translation Using Constraint-Based …

that individually complete the major components of analysis, transfer, and generation in the pipeline using different sets of representation rules for structure analysis and transformation[25] , the present system uses only a single CSG set for the translation task. Since the structures of parallel languages are synchronized in the formalism, their structural differences are also captured and described by the grammar. Hence, translation of an input text essentially involves three steps. First, the structure of an input sentence s is analyzed using the source component’s rules from the CSG productions by the augmented generalized LR parsing algorithm. Secondly, the link constraints determined during the rule reduction process are propagated to the corresponding target rules R (as a selection of target rules) to construct a target derivation sequence Q. Finally, the derivation sequence Q is used to generate a translation of the input sentence s by referencing the set of genera-

Fig. 3

303

tive rules R attached to the corresponding constituent nodes in the parsed tree. Figure 3 shows the architecture of the Portuguese to Chinese translation system, with the data flow in the translation process illustrated by the arrows. The input sentence is first analyzed by a morphological analyzer to restore the lexical item to its canonical form. Compared to English, Portuguese has a highly developed morphology and is relatively richer in inflectional variation. For example, each verb has at least 50 different conjugations according to distinctions such as number, person, voice, mood, and tense. The Portuguese morphological analyzer developed by Wong et al.[26] was used to analyze the input word and to retrieve a list of possible candidates that have different feature values including the part of speech. Ambiguous words are handled by the sense disambiguation module and the parser.

Portuguese to Chinese translation system architecture

After analyzing the form of the lexical items and assigning a set of possible features to each word in the sentence, the next step is to determine the proper grammatical categories for the words using the Portuguese tagger[27] , which is a probabilistic tagging model smoothed by interpolating the probabilities of lexical features into the word probabilities to evaluate unknown words. As described in Section 3, soft preferences were used instead of hard constraints by associating a weight to each feature structure. During the translation process, no feature candidates are removed from the candidates list, but their weights are adjusted if the specific candidate cannot be hit in each analytical process. For example, in the tagging process, if the

carrying grammar category of a feature does not match the tagged result, a penalty is given to reduce the weight so that the feature is given a lower priority. The search space during parsing is narrowed by using a word sense disambiguation module based on context collocation to select the most probable features for words and to put these features at the top of the features list. The parsing process will then use these as the default features for testing the constraint conditions defined in the CSG. The input sentence is then parsed to produce a parsed tree representing the source sentence structure, where each constituent unit of the representation structure is attached to the corresponding structure of the target language. For example, the

304

parsed structure of the sentence “Ele deu um livro ao José. (He gave a book to José)” together with target correspondences is illustrated in Fig. 4. Figure 5 gives the final translation “他给了若泽一本书” for the sentence after rendering the generation process based on the synchronous structures produced from the parser.

Fig. 4 Parsed structure of the sentence “Ele deu um livro ao José (He gave a book to José)” with the target correspondences

Fig. 5 Target translation after processing by the generation module

4.2

Other applications

The CSG can be used to simulate other synchronous models such as ITG[8] and independent rewriting systems by maintaining only one rule in the target component and relaxing the associated target rule constraints. In this case, the CSG model can be used to extract linguistic information from parallel texts, including the discovery of hidden structures in texts from different languages, by learning from parallel corpora. Collocation extraction and expression, for example, are quite important for developing translation systems where collocation data can contribute to improved performance in almost every processing component. Collocations, like idioms, belong to the set of phrases in a language that describe habitual word combinations which have particular meanings relative to the base. While

Tsinghua Science and Technology, June 2006, 11(3): 295-306

they are often realized idiomatically, collocational meanings can also be found in regular compositional phrases. There is a clear consensus in the literature that collocation data are very useful for parsing, word sense disambiguation, text generation, and other applications, especially for machine translation. Collocational data expressions can be easily achieved through the CSG formalism, since they can be incorporated as part of the analytical grammars for parsing during analysis and translation. Recently, there has been an increasing interest in controlled languages (an offspring of machine translation), culminating in a series of CLAW workshops on controlled language applications[28] . These have given impetus to both monolingual and multilingual guidelines and applications using controlled language for many different languages. Controlled languages are subsets of natural languages whose grammars and dictionaries have been restricted to reduce or eliminate both ambiguity and complexity. In a transfer-based system, controlled translation involves three processing phases that control the input text, the transfer routines, and the generation component on the target side. An alternative way to facilitate the controlled translation process is to limit the language grammars through the modeling of parallel (synchronous) structures for the language pairs, e.g., by using the CSG formalism. The generalization of source rules in the CSG allows flexible description of the common (universal) structure of natural languages, which greatly reduces the grammar size. With the CSG formalism, the output translation can be easily controlled with the constraints of the corresponding target component according to the input text features and the contextual information.

5

Conclusions

This paper describes a synchronous grammar based on the syntax of context-free grammar, called the CSG. The CSG source components are generalized to represent the common structures of languages. Unlike in other synchronous grammars, each source rule is associated with a set of target productions, where each target rule is connected with a constraint over the source pattern features. The set of target rules are grouped into

WONG Fai (黄 辉) et al:Machine Translation Using Constraint-Based …

a vector ordered by the specificity of the constraints. The objective of this formalism is to allow parsing and generating algorithms to infer different possible translation equivalences for an input sentence according to the linguistic features. A modified generalized LR parsing algorithm was adapted to the parsing of this formalism to analyze the syntactic structure of Portuguese in a machine translation system. The CSG was then used to construct a machine translation system for Portuguese to Chinese translation. Other potential applications of the CSG include parallel text modeling, identification of collocation data expressions, and controlled translation. References

305

pushdown assembler. Journal of Computer and System Sciences, 1969, 3: 37-56. [10] Melamed I D, Satta G, Wellington B. Generalized multitext grammars. In: Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics. Barcelona, Spain, 2004: 661-668. [11] Kaplan R, Bresnan J. Lexical-functional grammar: A formal system for grammatical representation. In: Bresnan J, ed. The Mental Representation of Grammatical Relations. Cambridge, Mass.: MIT Press, 1982: 173-281. [12] Kaplan R. The formal architecture of lexical-functional grammar. Information Science and Engineering, 1989, 5: 305-322. [13] Bresnan J. Lexical-Functional Syntax. Oxford: Blackwell, 2001. [14] Pollard C, Sag I. Information-Based Syntax and Semantics.

[1]

Ranbow O, Satta G. Synchronous models of language. In: Proceedings of 34th Annual Meeting of the Association for Computational Linguistics. University of California, Santa Cruz, California, USA, 1996: 116-123.

[2]

Lewis P, Stearns R. Syntax-directed transduction. Journal of the Association for Computing Machinery, 1968, 15(3): 465-488.

[3]

on Computational Linguistic. Helsinki, 1990. Seki H, Matsumura T, Fujii M, Kasami T. On multiple context-free grammars. Theoretical Computer Science, 1991, 88(2): 191-229. [5]

Chinese languages. In: Proceedings (CD Version) of First [17] Wong F, Mao Y H. Framework of electronic dictionary system for Chinese and Romance languages. Automatique des Langues (TAL): Les Dictionnaires Électroniques, 2003, 44(2): 225-245. [18] Ide N, Veronis J. Word sense disambiguation: The state of [19] Earley J. An efficient context-free parsing algorithm [Dissertation]. Computer Science Department, CarnegieMellon University, 1968.

Wong F, Hu D C, Mao Y H, Dong M C. A flexible exam-

Conference

on

Computational

Linguistics. Geneva,

Switzerland, 2004: 1079-1085. Wu D. Grammarless extraction of phrasal translation examples from parallel texts. In: Proceedings of TMI-95, Sixth International Conference on Theoretical and Methodological Issues in Machine Translation, vol. 2. Leuven, Belgium, 1995: 354-372. Wu D. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 1997, 23(3): 377-404. [9]

translation: Overcome the barriers between European and

ers. In: Proceedings of NAACL/HLT. Edmonton, 2003:

resentation. In: Proceedings of the 20th International

[8]

[16] Wong F, Mao Y H, Dong Q F, Qi Y H. Automatic

the art. Computational Linguistics, 1998, 24(1): 1-41.

ple annotation schema: Translation corresponding tree rep-

[7]

Chicago: University of Chicago Press, 1994.

Melamed I D. Multitext grammars and synchronous pars79-86.

[6]

[15] Pollard C, Sag I. Head-Driven Phrase Structure Grammar.

International UNL Open Conference. Suzhou, China, 2001.

Shieber S, Schabes Y. Synchronous tree adjoining grammar. In: Proceedings of the 13th International Conference

[4]

Stanford: CSLI, 1989.

Aho A, Ullman J. Syntax directed translations and the

[20] Earley J. An efficient context-free parsing algorithm. CACM, 1970, 13(2): 94-102. [21] Tomita M. An efficiency augmented context-free parsing algorithm. Computational Linguistics, 1987, 13(1-2): 31-46. [22] Rekers J. Parser generation for interactive environments. [Dissertation]. University of Amsterdam, 1992. [23] Tomita M. Generalized LR Parsing. Carnegie-Mellon University: Kluwer Academic Publishers, 1991. [24] Aho A V, Sethi R, Ullman J D. Compiler: Principles, Techniques and Tools. Massachusetts: Addison-Wesley, 1986. [25] Hutchins W J, Somers H L. An Introduction to Machine Translation. New York: Academic Press, 1992. [26] Wong F, Dong M C, Li Y P, Mao Y H. Semantic and morphological analysis and mouse tracking technology

Tsinghua Science and Technology, June 2006, 11(3): 295-306

306 in translation tool. In: Proceedings of Computational

[28] Carl M. Data-assisted controlled translation. In: Joint Con-

Methods in Engineering and Science (EPMESC IX).

ference Combining the 8th International Workshop of the

Macao, 2003: 525-532.

European Association for Machine Translation and the 4th

[27] Wong F, Chao S, Hu D C, Mao Y H. Interpolated probabilistic tagging model optimized with genetic algorithm. In:

Controlled Language Applications Workshop (EAMTCLAW 03). Dublin City, 2003: 16-24.

Proceedings of the 3rd International Conference on Machine Learning and Cybernetics. Shanghai, China, 2004: 2569-2574.

Tsinghua Betters Ties with MIT BEIJING, Feb.18, 2006—Tsinghua University, famed as the “cradle of engineers,” is often dubbed “Chinese MIT (Massachusetts Institute of Technology)” on the Chinese mainland. The recent visit of Dr. Susan Hockfield, president of MIT, to Tsinghua University brought the relationship between the two universities even closer. “Universities must foster many new kinds of collaboration,” Hockfield said in her speech at Tsinghua. “We must work across the borders between universities and industry, between academic disciplines, between institutions, and between nations.” Tsinghua-MIT Beijing Urban Design Joint Studio has been one of the co-operative programs between Tsinghua and MIT. The idea was mooted in 1979 and turned into reality six years later when the first joint studio opened in Beijing. The joint studio has been held every two years, ever since. In each studio, directed by two to three professors from both universities, some 20 MIT graduates together with some 10 Tsinghua counterparts do field study, urban design and joint review in selected places in Beijing for a month. Tsinghua and MIT will hold the 10th joint studio between June and July, 2006. And an exhibition of the research fruits yielded by participating students will be held. Over the years, Tsinghua and MIT have started other collaborative programs. In 1996, the School of Economics and Management of Tsinghua began a partnership with the MIT Sloan School of Management, developing programmes to prepare China’s management for business in the global arena. Since then, more than 40 faculty members have visited MIT Sloan for at least half a year for teaching and research experience. The joint International Master’s in Business Administration (IMBA) have attracted applicants not only from China, but also from other countries. In fact, 26% of the students in the program are international students. In the same co-operative spirit, the Tsinghua China Finance Research Centre was founded in July 2002. Since then, more than 200 students have received training through various seminars at the centre. “Universities will need to learn from each other and to draw on best practices in education and research wherever they originate,” Hockfield said. She is the first university president to be given an honourary doctorate by Tsinghua University in its 95-year history. (Source: China Daily, http://news.xinhuanet.com/english/2006-02/18/content_4196376.htm) (http://news.tsinghua.edu.cn, partly selected)