Meaning preservation in Example-based Machine Translation with structural semantics

Meaning preservation in Example-based Machine Translation with structural semantics

Accepted Manuscript Meaning Preservation in Example-based Machine Translation with Structural Semantics Chong Chai Chua, Tek Yong Lim, Lay-Ki Soon, E...

8MB Sizes 0 Downloads 56 Views

Accepted Manuscript

Meaning Preservation in Example-based Machine Translation with Structural Semantics Chong Chai Chua, Tek Yong Lim, Lay-Ki Soon, Enya Kong Tang, Bali Ranaivo-Malanc¸on PII: DOI: Reference:

S0957-4174(17)30103-3 10.1016/j.eswa.2017.02.021 ESWA 11128

To appear in:

Expert Systems With Applications

Received date: Revised date: Accepted date:

6 November 2016 7 February 2017 9 February 2017

Please cite this article as: Chong Chai Chua, Tek Yong Lim, Lay-Ki Soon, Enya Kong Tang, Bali Ranaivo-Malanc¸on, Meaning Preservation in Example-based Machine Translation with Structural Semantics, Expert Systems With Applications (2017), doi: 10.1016/j.eswa.2017.02.021

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • Introduce Structural Semantic Annotation to improve English to Malay

CR IP T

EBMT System. • Emphasize meaning structure preservation in the automated translation process.

• Translation examples representation is extended with Structural Semantic Annotation.

AN US

• Resolved the fragmentation and inconsistency issues in EBMT system.

AC

CE

PT

ED

M

• The result of the English to Malay automated translation is improved.

1

ACCEPTED MANUSCRIPT

Meaning Preservation in Example-based Machine Translation with Structural Semantics

CR IP T

Chong Chai Chuaa,∗, Tek Yong Lima , Lay-Ki Soona , Enya Kong Tangb , Bali Ranaivo-Malan¸conc a Faculty

AN US

of Computing and Informatics, Multimedia University, Persiaran Multimedia, 63100 Cyberjaya, Selangor, Malaysia b Universiti Sains Malaysia, Gelugor, 11800, Pulau Pinang, Malaysia c Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia

Abstract

The main tasks in Example-based Machine Translation (EBMT) comprise of source text decomposition, following with translation examples matching and selection, and finally adaptation and recombination of the target translation. As the natural language is ambiguous in nature, the preservation of source

M

text’s meaning throughout these processes is complex and challenging. A structural semantics is introduced, as an attempt towards meaning-based approach

ED

to improve the EBMT system. The structural semantics is used to support deeper semantic similarity measurement and impose structural constraints in translation examples selection. A semantic compositional structure is derived

PT

from the structural semantics of the selected translation examples. This semantic compositional structure serves as a representation structure to preserve the

CE

consistency and integrity of the input sentence’s meaning structure throughout the recombination process. In this paper, an English to Malay EBMT system is presented to demonstrate the practical application of this structural semantics.

AC

Evaluation of the translation test results shows that the new translation framework based on the structural semantics has outperformed the previous EBMT ∗ Corresponding

author Email addresses: [email protected] (Chong Chai Chua), [email protected] (Tek Yong Lim), [email protected] (Lay-Ki Soon), [email protected] (Enya Kong Tang), [email protected] (Bali Ranaivo-Malan¸con) URL: fci.mmu.edu.my/v3/ (Chong Chai Chua)

Preprint submitted to Journal of Expert Systems With Applications

February 10, 2017

ACCEPTED MANUSCRIPT

framework. Keywords: Example-based Machine Translation, Structured String-Tree Correspondence, Synchronous Structured String-Tree Correspondence,

CR IP T

Structural Semantics, Semantic Roles

1. Introduction

Machine Translation (MT) is an approach to use computer to model the

process of translating a human language to another human language. In gen-

eral, the procedures of machine translation involve decoding the meaning of the source language (SL) and then re-encoding the meaning into the target language

AN US

5

(TL). The decoding and re-encoding process required a certain level of in-depth knowledge about the languages involve in the translation. MT researchers have implemented many strategies and methods to preserve the original meaning of a SL to the TL, i.e. grammatical rules, transfer rules, translation templates, statistical model, etc.

M

10

In this study, example-based approaches will be used to resolve the meaning preservation problems in MT. The Example-based MT (EBMT) was originated

ED

from the idea of mechanical translation by analogy proposed by Nagao (1984). The fundamentally concepts of EBMT was introduced by Nagao and defined by Hutchins (2005b), Somers (1999), Carl (2005). The machine translation pro-

PT

15

cess of EBMT basically involves decomposition of input source sentence into segments, matching of these segments against the examples database and iden-

CE

tification of the corresponding translations from the matched examples, finally perform adaptation and recombination of the translation examples to construct the target sentence.

AC

20

As pointed out by Somers (1999) and Hutchins (2005a), the recombination

is the most difficult task in EBMT. This is easily anticipated as the original context and relationships between segments from the input sentence is lost during the decomposition process. Without any explicit semantic information with

25

reference to the original input sentence, each of the segments is self explainable

3

ACCEPTED MANUSCRIPT

with its own new context. This will lead to possibility of other interpretations and introduce new ambiguities. The original input sentence’s meaning cannot be fully determined from the successfully recombined target translation.

30

CR IP T

In the study of linguistic semantics, it was commonly agreed that the main determinant of a sentence’s meaning is the verb of the main predicate Healy &

Miller (1970). According to projectionist approaches, many aspects of syntactic

structure of sentence are assumed to be projected from the lexical properties of verb, in particular the morphosyntactic realization of verb’s arguments Rap-

paport Hovav & Levin (1998). Goldberg’s (Goldberg, 1995, 1999) studies sug-

gested that the basic meaning of clausal expressions is the result of interaction

AN US

35

between verb meaning and semantics of the argument construction. This is in accordance with Frege’s (Szab´ o, 2013) idea that semantics need to be compositional, such that the meaning of every expression in a language must be a function of the meaning of its immediate constituents and the syntactic rule 40

used to combine them (Goldberg, 2016). This idea of semantic argument con-

M

struction of verbs is contributed to the core techniques for structural semantics representation in Natural Language Processing related research fields, such as:

ED

Question Answering, Information Extraction and Information Retrieval. The research presented in this paper is adhered to the principle that both the 45

verb meaning and argument structure construction are important and should be

PT

co-exist in order to form the meaning structure of a sentence. As to preserve the organization of the arguments and the co-occurrence information relative to the

CE

verb, this meaning structure will be represented using semantic roles as a layer of structural semantics directly corresponding to the translation example. This

50

structural semantics will serve as a mean to evaluate the semantic similarity

AC

between the input sentence and stored source example. A semantic compositional structure will be derived from the structural semantics of the selected translation examples and is used throughout the recombination process.

55

Resolution of the meaning fragmentation and integrity issues in EBMT us-

ing the structural semantics is one of the main contributions from the finding in this paper. This structural semantics then contributing to the design of a new 4

ACCEPTED MANUSCRIPT

translation framework by enabling incorporation of semantic information to the existing EBMT system at various levels. The combined strength of structural semantics and synchronized recombination in this new EBMT framework produced better translation results and at the same time maintain the efficiency of

CR IP T

60

the translation mechanisms. Beside this, a modified semantic-based evaluation based on the precision, recall and f-score measurements in (Lo & Wu, 2011) is used as an alternative for human evaluation.

In the following sections, the discussion will begin with the related research 65

of meaning treatment in MT Systems and semantic similarity in Section 2. An

AN US

overview of the current status and problems of the English to Malay EBMT System is reviewed in Section 3. Section 4 presents the complete details of the structural semantics. The discussion continues with the proposal of a new EBMT framework based on the structural semantics in section 5 and the trans70

lation evaluation results in Section 6. Finally, the research work in this paper

2. Related Work

M

is concluded in Section 8 with suggestions of future work.

ED

2.1. Intermediate Representation Structures in Interlingua MT System The study of semantics is a main topic in the traditional Interlingua Machine Translation system (Dorr et al., 2006). The main idea of Interlingua translation

PT

75

is to analyze, extract and represent the meaning of SL into a language independent structure or general meaning representation for later generation of the

CE

TL. Although there is no common agreed consensus of the form, primitives and levels of the meaning representation in Interlingua MT system, there is a clear outline of the importance of semantic relations (Dorr & Habash, 2002) between

AC

80

concepts, as part of the construction of the conceptual structure, especially for structural consistency of the meaning representation. The semantic roles originated from linguistic theories such as case roles, theta roles and thematic roles were used in number of the Interlingua MT research (Teruko et al., 2004; Dorr

85

et al., 2010) to represent these semantic relations. Semantic roles are mainly

5

ACCEPTED MANUSCRIPT

centered on predicate-argument structure (Levin & Rappaport Hovav, 2005), where the arguments of predicate are classified to general or specific roles to express its role in respect to the situation (such as event, action or state) de-

90

CR IP T

scribed by the verb; and also the semantic relations among all the participating arguments in the sentence.

2.2. Semantic-based Reordering in Statistical MT (SMT) System

In contrast, there is no explicit semantic handling in early SMT Sys-

tem (Brown et al., 1990; Vogel et al., 2003; L¨ u et al., 2007). The focus is more

95

AN US

on prediction of best translation target based on a trained statistical translation model and language model. The translation segments search, selection and merging is based on the maximized weight scoring using the learned stochastic value estimation from aligned bilingual corpus. In recent years, there are

active research efforts to use linguistic knowledge such as syntactic (Dlougach & Galinskaya, 2012; Zhang & Zong, 2013; Li et al., 2013) and semantic information (Aziz et al., 2011; Feng et al., 2012; Bazrafshan & Gildea, 2013) to

M

100

assist phrase reordering to improve the overall translation results. The semantic

ED

structure is claimed to produce better results over syntactic structure as it can provide better skeleton structure of a sentence’s meaning (Liu & Gildea, 2010; Feng et al., 2012). The semantic structure in most of these SMT Systems is modeled according to the semantic roles of predicate-argument structure. Overall,

PT

105

parallel aligned bilingual corpus is automatically annotated with semantic roles labeler; then utilized for phrase reordering rules and translation model learn-

CE

ing. The learned reordering rules are applied either during pre-translation (Zhai et al., 2012), embedded into decoder (Liu & Gildea, 2010; Gao & Vogel, 2011; Feng et al., 2012) or post-translation (Wu & Fung, 2009). As opposed to man-

AC

110

ual crafting of parsing rules, semantic parsing are eased with the availability of semantically annotated resources (i.e. PropBank, NomBank, VerbNet) and advancement of machine learning for automatic semantic role labeling (Hajiˇc et al., 2009; Bj¨ orkelund et al., 2009; Zhao et al., 2013).

6

ACCEPTED MANUSCRIPT

115

2.3. Semantics Disambiguation in EBMT System On the other hand, the semantic handling in EBMT systems is mainly focusing on similarity measurement and semantic disambiguation. The similarity

CR IP T

measurement is heavily relying on thesaurus and focus on two aspects. One is to assist the selection of suitable translation example (Way, 2010) and another 120

one is on cross-lingual corpus alignment (Sumita, 2001). In the early stage,

Nagao (1984) suggested selecting suitable translation example based on the criteria if words from the SL are replaceable with the corresponding words from

the input sentence. These corresponding words are checked for semantic sim-

125

AN US

ilarity based on thesaurus. The similarity measurement approaches are grad-

ually enhanced from simple semantic distance measurements of sub-sentence segments towards the incorporation of substitution cost using string edit distance (Doi et al., 2005; Vertan & Martin, 2005). The sub-sentence segments can be words (Nagao, 1984), substring/chunks (Nirenburg et al., 1993), content words (Aramaki et al., 2003), phrasal context (Aramaki & Kurohashi, 2004), and head words in tree structure (Liu et al., 2003; Imamura et al., 2004). The

M

130

usage of thesaurus is also significant for semantic disambiguation in EBMT,

ED

especially application in generalization of translation. For instance, Matsumoto & Kitamura (1997) acquired generalized word selection rule and translation template by replacing semantically similar elements (words or phrases) in sentence using semantic classes from a thesaurus. Kaji et al. (1992) also performed

PT

135

very similar disambiguation approach by refining their syntactic generalization

CE

approach with semantic categories to resolve template selection problems (ambiguous verb). Brown (1999) also reported successful of templates generalization to improve the coverage and accuracy of EBMT by using manually generated equivalence classes. Brown continued to improve the approach by automatically

AC

140

generating equivalence classes (Brown, 2000, 2001; Gangadharaiah et al., 2006) based on word clustering techniques without using any thesaurus.

7

ACCEPTED MANUSCRIPT

2.4. Semantic Similarity The semantic similarity is to determine conceptually similar between two 145

non-identical entities (Petrakis & Varelas, 2006) for various textual units such

CR IP T

as words, phrases and sentences. The approaches range from simple lexical overlapping (Gomaa & Fahmy, 2013) to complex similarity measurements based on concepts and semantic networks in Thesaurus (Mihalcea et al., 2006; Matar

et al., 2008). With knowledge sources such as ontology, semantic similarity 150

between terms/concepts can be estimated by defining a topological similarity,

which generally covered the approaches (Albacete et al., 2012; Slimani, 2013)

AN US

such as: structure-based measures, information content measures, and featuresbased measures.

In order to support full sentence similarity estimation, measurement based 155

on an abstract semantics level is needed. In fact matching over abstraction level is an important technique in the field of Information Extraction and Information Retrieval. As proposed in Malik & Rizvi (2011); Kaptein et al. (2013);

M

Zuccon et al. (2014), the semantic annotation serves as abstract level indicator of the concept of the text content and the structure illustrates the organization of such concepts. Recent development is focusing on annotation

ED

160

of predicate-argument structure with semantic role, formulating the meaning structure Blanco & Moldovan (2013) or concept map Trandab˘ a¸t (2011) with

PT

semantic role relations.

The Semantic Role Labeling is also very useful for question answering (Shen & Lapata, 2007; Moreda et al., 2008). Semantic roles are assigned to both

CE

165

the query and also candidate answers. The semantic similarity measurement is estimated based on aligned semantic roles of verb that evoking the same

AC

semantic frame. Beside this, in paraphrase similarity measurements (Bertero & Fung, 2015), the semantic role is used as semantic features for paraphrase

170

classification.

8

ACCEPTED MANUSCRIPT

2.5. Structural Semantics as Extended Annotation Layer to Translation Examples Representation As opposed to Interlingua MT (Dorr et al., 2010), the semantic representa-

175

CR IP T

tion proposed in this paper will be added as annotation layer to the existing

translation examples. The semantic annotation is language specific, where there

will be source and target language semantic annotations for the aligned bilingual translation pairs. Cross-lingual correspondences of semantic representation between language pairs are established, such that the variation of semantic

knowledge between source and target languages are explicitly specified. The

disambiguation will be focusing on relationships of the verb(s) with its argu-

AN US

180

ments in sentence, which is based on complete semantic structure instead of lexical meaning. As a result, the similarity measurements in translation example selection will emphasize more on overall sentence’s level semantic similarity. Furthermore, the sentence structure and phrasal ordering is depending on the 185

semantic structure instead of reordering rules as in Phrase-based SMT (Feng

M

et al., 2012; Zhai et al., 2012).

The structural representation will inherit the original representation of the

ED

translation examples such that the linguistic phenomenon and exceptional cases can be specified directly on the representation structure for special handling. 190

For SMT and Neural MT, the approach is relying on generalized statistical

PT

modeling, many linguistic aspects are not able to cater on a case by case basis. Hence, the training dataset for SMT (L¨ u et al., 2007) and Neural MT (Wu et al.,

CE

2016) is normally very large as compare to EBMT. The similarity measure for selection of best translation example later in

Section 5.2 will be constructed based on aggregation of linguistics information represented in a multi-level synchronized structure. This aggregated linguistics

AC

195

information is consisted of: lexical surface form, POS pattern, syntactic dependency, and semantic role labeled predicate-argument structure. The similarity measurements will be derived using the idea of Levenshtein Edit Distance.

9

ACCEPTED MANUSCRIPT

200

3. English to Malay EBMT System As a real case study, we present the English to Malay EBMT system, SiSTeC. SiSTeC stands for SiS tem Terjemahan berasaskan SSTC (SSTC-based Trans-

CR IP T

lation System). In SiSTeC, the translation example is represented in Structured String-Tree Correspondence (SSTC), a general structure that associates 205

an arbitrary tree structure (interpretation structure) to string in a language.

This SSTC representation scheme was extended by Al-Adhaileh & Tang (1999) to Synchronous Structured String-Tree Correspondence (S-SSTC), as representation for the synchronization between one natural language sentence with its

210

AN US

equivalent translation in another natural language sentence. As shown in Fig. 1, the S-SSTC describes a synchronized structure between English source sentence

(“he knelt on the floor”) and the correspondence Malay translation (“dia berlutut di atas lantai itu”).

SiSTeC performs translation by simulating the process of synchronous pars-

215

M

ing (Al-Adhaileh et al., 2002) as in synchronous grammar formalism (B¨ uchse et al., 2011). The segmentation of text is based on the longest matching of source strings with the translation examples in the Bilingual Knowledge Bank

ED

(BKB). SiSTeC performs structural matching (Ye, 2006) of segmented input language (IL) sentence against the stored SL translation examples based on the

220

PT

structural patterns constructed from the lexical and syntactic features. The dependency structure of the input sentence is reconstructed based on the matching structural patterns.

CE

The structural matching between the SL sentence and IL sentence is depend-

ing on the linear form (continuous strings with lexical and syntactic annotation) and syntactic structures (partially or fully generalized dependency trees) similarity. It is not able to examine if these matched sentences are semantically

AC 225

equivalent or approximately close meaning. This surface form matching with the lexical and syntactic patterns will lead to the problems of mismatching of translation examples and subsequently causing adaptation and recombination errors.

10

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 1: S-SSTC for English to Malay

The translation result of SiSTeC is good when the degree of matching be-

PT

230

tween the input sentence’s segments and translation example is high (longer matching with very similar lexical and syntactic features). This is clearly ob-

CE

served especially when exact matching with the main verb is found (same tense and voice form). This type of best match can be illustrated with the following Example 1. The recombination of input segments based on rule matching for

AC

235

Example 1 is elaborated in Fig. 2. The segment “was sent” from input sentence is matched with the “was sent” from the translation example “relief was sent to victim”. The syntactic pattern of the input sentence “DET N V EN

PREP N PREP N” is overlapped with this example’s syntactic pattern at “N 240

V EN PREP N”, where the main verb is bounded within this matched pattern.

11

ACCEPTED MANUSCRIPT

Hence, the lexical verb pattern is considered as exact match. A base structure “N V EN PREP PREP” is identified and provides the template to combine all the input segments (nodes or subtrees) to build a complete dependency struc-

245

CR IP T

ture. This base structure provides complete structural and ordering information of the corresponding target sentence.

Example 1. English to Malay translation: “An email was sent to Ryan by John.”

• SiSTeC Output: “E-mel telah dihantar kepada Ryan oleh John.”

AN US

(Equivalent to: “Email was sent to Ryan by John.”)

• Reference: “Satu e-mel telah dihantar kepada Ryan oleh John.”

250

• Best match: “was sent” (node), “N V EN PREP PREP” (rule) In contrary, the translation accuracy will decrease when exact matching of verb and best template are not found. In Example 2 below, exact match of “is

255

M

eating” is not found in the knowledge bank, the segment “were eaten” is selected as alternative as both has similar lemma form “be eat”. However, the tense and

ED

voice form of “were eaten” (past perfect tense, passive voice) is different from “is eating” (present progressive tense, active voice). Direct usage of the “were eaten” without the reordering of the sentence’s segments is subsequently causing

260

PT

structural error. As elaborated in Fig. 3, the selected alternative segment “were eaten” is successfully merged with other segments using the base structure “N

CE

V EN N”. This is a simple example of changes at the source side that cannot be propagated to other elements in the target sentence, as the mapping between the source and target in the translation template does not signal a transformation

AC

request. Hence, a paraphrasing that involves changing of the main verb’s form

265

with restructuring of the elements in the target sentence is not able to perform. Example 2. English to Malay translation: “The cat is eating the rat.” • SiSTeC Output: “Kucing itu dimakan tikus itu.” (Equivalent to: “The cat was eaten by the rat.”)

12

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

Figure 2: Recombination rule matching for English to Malay translation of “An email was

sent to Ryan by John.”

13

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 3: Recombination rule matching for English to Malay translation of “The cat is eating

PT

the rat.”

• Reference: “Tikus itu dimakan oleh kucing itu.” or “Kucing itu sedang makan tikus itu.”

CE

270

• Best match: “were eaten” (node, lemmatized verb pattern matching as no

AC

exact match is found), “N V EN N” (rule)

• Problem: structure errors, unable to determine reordering of phrases based on verb meaning structure

14

ACCEPTED MANUSCRIPT

275

4. Structural Semantics Correspondence As demonstrated in Interlingua and Statistical MT research, the semantic relations between concepts conveyed by words or phrases can be explicitly de-

CR IP T

scribed by the semantic roles. The semantic roles also provide an organized and consistent representation structure, which proven to be more effective than 280

syntactic structure. Many linguistics views suggested that the lexical meaning and properties of verb is the key towards predicting and determining sentence

meaning. According to semantic role centred approach towards lexical semantic representation (Levin & Rappaport Hovav, 2005), the verb’s meaning can be

285

AN US

represented by a list of semantic role labels (also known as “Case Frame” by Fillmore (1968), “Thematic Relations” by Gruber (1965) and Jackendoff (1976)) and each of this role is assigned to an argument that bearing the semantic

relation to the verb. With recent natural language shallow semantic parsing techniques (or Semantic Role Labeling), by associating the surface arguments

290

M

of predicate, especially a verb, with discrete semantic roles, an abstract meaning structure (or skeleton structure) of a sentence can be explicitly represented. In this section, the theoretical aspects of the structural semantics for SiSTeC

ED

is presented. The existing SSTC in SiSTeC will be annotated with semantic roles 4.1. This semantic annotation is added to the SL in SSTC, from here it is projected to the TL 4.2 based on correspondence relationships in the SSSTC. The semantic compositional structure will be derived from the structural

PT

295

semantics annotation, it is used to facilitate the transformation and adaptation

CE

in the recombination process of a new EBMT framework (further details in Section 5.3).

AC

4.1. SSTC with Semantics (SSTC+SEM)

300

The meaning structure constructed from the semantically labeled predicate-

argument structure is aggregated to the SSTC as a new semantic layer. This semantic layer will act as an abstract semantic descriptor for the SSTC. The nodes in this semantic layer are directly corresponding to the predicate or arguments in the SSTC. The nodes are connected and organized according to 15

ACCEPTED MANUSCRIPT

305

the co-occurrence and also dependencies of the predicate and arguments in the SSTC. The semantic roles in this semantic layer will be denoted as numbered semantic arguments (i.e. A0, A1, etc) following the annotation approach in

CR IP T

PropBank (Palmer et al., 2005). These numbered semantic arguments are verbby-verb basis. For different verbs, the argument with the same tag will have 310

different semantic role, i.e. the A0 in Example 3-1 is an argument with seman-

tic role Consumer/Eater, whereas the A0 in Example 3-2 is an argument with semantic role Borrower. Hence, the semantic roles for numbered arguments

are verb-specific, and at the same time the co-occurrence of these arguments

315

AN US

defined the meaning construction for the verb.

Example 3. Same argument tag but different role

1. [T he eggs]A1 were [eaten]P red [by the benef icial]A0 (with verb eat, the A0 has semantic function Consumer or Eater, A1 has semantic function M eal)

M

2. [He]A0 [borrowed]P red [a book]A1 [f rom the library]A2 (with verb borrow, the A0 has semantic function Borrower, A1 has semantic function T hing Borrowed, A2 has semantic function Loaner)

ED

320

These semantically labeled arguments at the structural level together with the predicate form a Structural Semantics (SEM) for the SSTC. This combined

PT

structure is defined as a triple (SEM, SST C, γ (SEM, SST C)), where: 1. SEM is a tree representation of structural semantics constructed from the predicate and argument(s) labeled with semantic role. It is organized into a

CE

325

dependency-based structure, such that:

AC

(a) The predicate (verb) as the root node.

330

(b) The root node is connected to leave node(s), constituted of argument(s) labeled with the semantic role. (c) The dependency relations between the root node and leave nodes are reflected directly by the semantic role.

16

ACCEPTED MANUSCRIPT

2. An SST C is a general structure defined as a triple (st, tr, co), where st is a string in one language, tr is its associated arbitrary tree structure (i.e. its interpretation structure), and co is the correspondence between st and tr,

(2002)).

CR IP T

which can be non-projective (detail definitions can refer Al-Adhaileh et al.

335

3. γ (SEM, SST C) defined a link lRel ∈ γ (SEM, SST C), corresponding from a node in SEM to a sub-SST C, such that:

SST C ⊆ SST C.

340

AN US

(a) A node in SEM is associated to sub-SST C of SST C, where sub-

(b) A sub-SST C is consisted of sub-string (partial of st) and sub-tree (partial of tr) from the SST C.

(c) This sub-string and sub-tree are linked with correspondence defined by the corresponding function co (st, tr) in SST C.

(d) For the root node of SEM , the lRel will record the correspondence to

M

345

sub-SST C constructed from predicate (i.e. verb) in the st; and for the leave node of SEM , the lRel will record the correspondence to the

ED

sub-SST C constructed from the predicate’s argument of st. (e) lRel will only need to record the correspondences from the node in SEM directly to st of the SST C, the correspondence from the SEM to tr can

PT

350

be achieve via the correspondence between st and tr defined by co, which can be referred as indirect linking, SEM ⇒ st ⇒ tr; in terms of function

CE

composition, let α be the correspondence function from SEM to tr , then α = co ◦ γ.

AC

355

(f) lRel is represented by sets of intervals, which encode the index for sequence of words in the st.

Fig. 4 illustrates the SSTC+SEM representation structure for the sentence

“The moths have eaten holes in his coat”. The sentence consisted of a main verb “eaten” and two arguments “the moths” and “holes in his coat”. Respectively,

17

ACCEPTED MANUSCRIPT

360

the arguments “the moths” is assigned with argument A0 (Consumer or Eater) and “holes in his coat” is assigned with argument A1 (Meal). These annotations are represented as a distinct SEM tree representation: with the predicate “eat”

CR IP T

(lemma of eaten) as root node; the A0 node is connected to the root node as child node to the left; and the A1 node is connected to the root node as 365

child node to the right. The SEM structure is then associated to the SSTC (both string and tree representation) such that: the root node from the SEM

tree is corresponding to the verb “eaten”; the child node A0 is corresponding

to the argument “the moths”; and the child node A0 is corresponding to the

370

AN US

argument “holes in his coat”. For the case when there is more than one verb in the sentence, each verb with its arguments will be represented as different SEM representation tree.

4.2. S-SSTC with Semantics (S-SSTC+SEM)

The SEM representation can be applied as the structural semantics for both

375

M

of the SL SSTC and TL SSTC in the S-SSTC. The source language SSTC+SEM will be referred as SL SSTC+SEM and the target language SSTC+SEM as

ED

TL SSTC+SEM. The structural semantics of the predicate-argument structure in the SL SSTC+SEM can be projected to the TL SSTC via the correspondences established in S-SSTC. From the definitions in Al-Adhaileh et al.

380

PT

(2002), a S-SSTC is defined as a triple (S, T, ϕ (S, T )), such that: S (i.e. SL) and T (i.e.

T L) respectively is represented as SST C; ϕ (S, T ) is a set of

links defining the synchronization correspondence between S and T at differ-

CE

ent internal levels of the two SST C structures. Thus, the correspondences of

SEM ⇒ SL SST C ⇒ T L SST C can be achieved via the compositional func-

AC

tion ϕ ◦ γ. On top of this indirect linking, the semantic annotations of the

385

SL predicate-argument structure can be projected to the TL SSTC, derived the abstract semantic layer and constructed the TL SSTC+SEM representation. For multiple verbs sentence, the main predicate and sub predicate are determined based on the syntactic dependencies in the tr representation. The mapping from SL SSTC+SEM to TL SSTC+SEM provides structural seman18

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 4: Synchronization of the S-SSTC+SEM for sentence “The moths have eaten holes in his coat”

19

ACCEPTED MANUSCRIPT

390

tic correspondences that encode the information of cross-lingual transformation (semantic-based structural transfer and reordering). This representation structure with synchronization between SL SSTC+SEM and TL SSTC+SEM is de-

CR IP T

noted as S-SSTC+SEM (Synchronous Structured String-Tree Correspondence with Semantics).

The S-SSTC+SEM for the sentence “The moths have eaten holes in his coat”

395

in Fig. 4 can be synchronized with the translation in Malay language “kotnya berlubang-lubang dimakan rama-rama” at various levels of correspondences.

With reference to the ϕ (S, T ) correspondences, the argument “the moths” is

400

AN US

corresponding with the target translated phrase “rama-rama” and “holes in his coat” is corresponding with the phrase “kotnya berlubang-lubang”. The se-

mantic annotation for each of the arguments at the SL side can be projected to the TL side according to these correspondences, such that the argument “rama-rama” is annotated with semantic role A0 (Consumer) and the argument “kotnya berlubang-lubang” is annotated with semantic role A1 (Meal). Based on the semantic correspondences between the SEM structures in Fig. 4,

M

405

a position switching between the arguments A1 and A0 will be required for the

ED

transformation from the SL to TL.

4.3. Semantic Compositional Structure

410

PT

The structural semantics of SEM exhibits semantic dependencies between the predicate and arguments that reflecting a basic meaning structure, for the clause(s) or phrase(s) of a sentence. With the aggregation of linguistics infor-

CE

mation between the SEM semantic dependencies and tr syntactic dependencies, a semantic specification of the correspondence natural language text is encoded

AC

in the representation structures. A semantic compositional structure can be

415

obtained with simple derivation. For simple sentence with a single verb, the semantic compositional structure is equivalent to the SEM structure. For sentence with multiple verbs, the semantic compositional structure will involve combination of multiple SEM structures. As depicted in Fig. 5, the meaning structure of the sentence “he refused to do 20

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 5: Compositional structural semantics for the sentence “he refused to do it because he felt it was not ethical”

420

it because he felt it was not ethical” is constructed from the predicate-argument

M

structures of the verbs “refused”, “do”, “felt” and “was”. The sentence’s base meaning is constructed from the main predicate verb “refused”, with three semantic arguments A0, A1 and AM -CAU . Respectively, the base meaning for

425

ED

the arguments A1 and AM -CAU of the verb “refused” is composed from the predicate-argument structure of the verb “do” and “felt”. In addition, the

PT

meaning of argument A1 for the verb “felt” is contributed by the predicateargument structure of the verb “was”. Such compositional characteristics allow the predicate-argument structures to be jointly combined and organized into

CE

a single compositional structure as depicted in Fig. 5. This unique semantic

430

specification supports deeper analysis and interpretation in order to preserve the meaning structure of a sentence during the matching and transformation

AC

process in the EBMT system. The process of abstract matching towards the semantic transformation for

the sentence “the cat likes to eat fish” with two separate Sem structures match-

435

ing is demonstrated in Fig. 6. The two SEM structures in the sentence “the cat likes to eat fish” can be combined to form a single semantic compositional

21

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 6: From abstract meaning matching to semantic-based transformation for the sentence “the cat likes to eat fish”

22

ACCEPTED MANUSCRIPT

structure. The SEM structures of the verb “likes” and “eat” in this sentence is matched with the SL SEM structures of the stored examples “he likes to affect the great philosopher” and “who will want to eat this poison” respectively. The transformation of the abstract meaning structure from the SL to the TL for

CR IP T

440

both SEM structures of the verb “likes” and “eat” can be performed separately

according to their matching SL SSTC+SEM and TL SSTC+SEM. Hence, there will be two target SEM structures constructed. These two TL SEM structures can be combined into one semantic compositional structure with reference to 445

the input sentence’s semantic compositional structure, such that the main pred-

AN US

icate will be the verb “suka” with two arguments, where the argument A1 is constructed from the predicate-argument structure of the verb “makan”.

5. New EBMT Framework with Incorporation of Structural Semantics

A new translation framework is required to incorporate the structural se-

M

450

mantics into existing EBMT system. This new translation framework with the structural semantics is referred as SiSTeC+SEM. As highlighted in Fig. 7,

ED

the IL text will be preprocessed with dependency parser and semantic parser to construct the IL SSTC+SEM. The selection of translation example will be based on structural matching between the IL SSTC+SEM with the stored SL

PT

455

SSTC+SEM. The matching of these SSTC+SEM structures will be simplified by converting the structures into linear semantic patterns as describe in 5.1.

CE

Semantic similarity between the IL SSTC+SEM and SL SSTC+SEM is measured using the formulated distance measurement in 5.2 based on the semantic patterns. The semantic compositional structure of the target sentence will be derived based on the structural semantic correspondences of the matching trans-

AC

460

lation example. Finally, this semantic compositional structure will provide full semantic information to guide through the adaptation and recombination process 5.3.

23

Figure 7: Translation Phase

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

24

ACCEPTED MANUSCRIPT

465

5.1. Structural Semantics Pattern As described in section 4, the constructed SSTC+SEM is a multi-level structure with associated syntactic and semantic knowledge. Instead of meaning

CR IP T

interpretation based on logical rules, the similarity between two structural se-

mantics can be examined based on pattern matching. The SSTC+SEM will 470

be formulated into a linear string pattern to support the semantic similarity

measurement based on edit distance. From the basic definitions discussed in

previous section, the characteristics of the SSTC+SEM representation can be

of pattern matching, such that: 475

AN US

elaborated from the perspective of the requirements in order to perform the task

1. Two sentences can be distinguished based on the specifications of a shallow semantic layer, via:

(a) The constitution of type and number of semantic arguments in the structure of the semantic layer.

480

M

(b) Semantic relations and semantic dependencies between the arguments, each argument is assigned with a distinct semantic role, which is specific

ED

to a predicate (verb), i.e. the semantic roles assigned to the arguments with the predicate “see” is different from the semantic roles with predicate “eat”.

485

PT

(c) Semantic structure, the organization of the arguments and predicate, i.e.:

CE

• M eal P red[eat] Consumer, where the argument with semantic role “Meal” precede the predicate “Eat” (as child node to the left in tree

AC

representation) and the “Consumer” succeed the predicate (as child

490

node to the right). (d) Semantic constraints based on semantic relations, semantic dependencies and semantic structure, i.e.: • M eal P red[eat] Consumer ≈ Consumer P red[eat] M eal (both are approximately similar but not equivalent); and 25

ACCEPTED MANUSCRIPT

• M eal P red[eat] Consumer 6= V iewer P red[see] T hing V iewed (both are totally different).

495

CR IP T

2. Multiple levels matching and comparison of linguistic information can be performed from abstract level (semantic) to syntactic and context specific (or surface form)

(a) Abstract level via the shallow semantic layer; Syntactic layer via the 500

dependency structure of SST C and POS tagging; and Content specific layer via the lexical string of the source sentence in SST C

AN US

(b) For example, in Fig. 4, the SSTC+SEM of the sentence “The moths have eaten holes in his coat” consist of:

• Consumer P red[eat] M eal(which is equivalent to A0 P red[eat] A1) 505

semantic role labeled predicate-argument structure

• The Consumer argument is constructed from the lexical string “the

his coat”, with POS pattern “N PREP PRON N” and dependency  holes N       |      in P REP      structure   |      coat N        |   his P RON

AC

CE

PT

510

ED

M

moths”, with   POS pattern “DET N” and dependency structure moths N       |   the DET • The M eal argument then is constructed from the string “holes in

(c) A semantic pattern with multiple level of linguistic information can be

formed, where the lexical string, POS pattern and dependency structure will serve as extension of linguistic features for the semantic argument,

515

i.e.: 26

ACCEPTED MANUSCRIPT



moths N



    A0 [the moths] [DET N ]   |   the DET

 holes N       |      in P REP      A1 [holes in his coat] [N P REP P RON N ]   |      coat N        |   his P RON 

CR IP T





(d) By combining the linguistic information of the shallow semantic (seman-

AN US

tic role labeled predicate-argument structure), syntactic (part of speech and dependency structure) and surface form (lexical or string pattern), a linear structural semantic pattern for a sentence can be generated, i.e.: 

moths N



M

    A0 [the moths] [DET N ]   P red [eat] A1 [..] .. |   the DET

ED

(e) A string index is added to this pattern for reference to the original source sentence. For ease of processing, the dependency structure can be sim520

plified and transformed into linear form with only root node and direct

PT

child(ren) node(s) to the root node. Hence, the structural semantic

AC

CE

pattern in the previous example can be refined to:

A0 [0 2] [the moths] [DET N ] [root : N DET ] P red [3 4] [eat] A1 [4 8] ...

(f) The structural semantic pattern hence is generalized to: Argumenti [index] [lexical string] [P OS] [dependency structure] P red [index] [verb lemma] Argumenti+1 ... 27

CR IP T

ACCEPTED MANUSCRIPT

Figure 8: Semantic pattern information table

3. Compound sentence with multiple verbs can be analyzed according to se-

AN US

mantic compositional structure

(a) With the Semantic-Syntactic integration via correspondences mapping

525

from the semantic layer to the tr (the dependency tree) of SSTC, the tr describe the syntactic dependency of the words in the natural language text st of SSTC, it encodes the syntactic dependency hierarchy of words,

M

this is mostly useful in compound sentence with multiple predicateargument structures, allowing analysis of the semantic hierarchy.

530

(b) Based on the compositional structure of the verbs, each verb can be

ED

compared separately as an independent SSTC+SEM structure. (c) For example, in Fig. 5, the sentence “he refused to do it because he felt it

PT

was not ethical” formed a semantic compositional structure constructed from four SSTC+SEM structure, respectively:

535

CE

• A0 [0 1] P red [ref use] [1 2] A1 [2 5] AM -CAU [5 12] • A0 [0 1] P red [do] [3 4] A1 [4 5]

AC

• A0 [6 7] P red [f eel] [7 8] A1 [8 12]

540

• A1 [8 9] P red [be] [9 10] AM -N EG [10 11] A3 [11 12]

5.2. Distance Measurement for Structural Semantics Patterns The information encapsulated in the structural semantic pattern can be vi-

sualized as a table of linguistic information organized into multiple levels (rows), where correspondences are established between these levels (as shown in Fig. 8). 28

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 9: Comparison of two structural patterns

It is possible to derive an aggregated distance measurement which simulates 545

comparisons of multi-levels linguistic information. One important criterion in the distance measurements between two structural semantic patterns is to take

M

into account the overall structure of a pattern which is constrained by the order and position of the arguments with reference to the predicate as the root (or central) object of the pattern. As illustrate in Fig. 9 the comparison of two patterns is guided by the semantic predicate-argument structure, following with the

ED

550

additional multi-levels linguistic descriptions of the argument structure. As the

PT

linguistic information in each of the levels is formed as string-based pattern, the order sequence of each element is important. With this in the consideration, the Levenshtein edit distance can be used to perform the similarity measurements. The edit distance for the structural semantic pattern similarity measurements

CE

555

is formulated as below:

AC

1. The similarities measurement of two structural semantic patterns for selection of best translation example will be based on the minimum distance between the input source pattern x and a stored translation example pattern y, where min dstructural semantic (x, y)

29

ACCEPTED MANUSCRIPT

2. The distance measurement consist of two parts: (a) Structural semantics distance - how two structural semantic patterns

dpredicate arguments (x, y)

CR IP T

are different in terms of the overall patterns’ structure

(b) Linguistic features distance of each matching argument - the differences of each feature’s elements

AN US

dlinguistic f eatures (x, y)

3. The structural distance is defined as the Predicate Arguments Distance measurement for two semantic patterns x and y, dpredicate arguments (x, y)

levsemx ,semy (|semx | , |semy |) + darguments position (x, y)

M

= Where:

2

ED

(a) Levenshtein distance between two predicate arguments pattern

PT

levsemx ,semy (|semx | , |semy |)

CE

(b) Argument distance with reference to the predicate as the root node of

AC

the semantic dependency tree,

560

darguments position (x, y) =

count lef t dif f (x, y) + count right dif f (x, y) t

* t = total number of arguments The argument distance is derived from the idea of Jaccard measure. It

30

ACCEPTED MANUSCRIPT

will impose additional distance measure when there is a position switching of arguments with reference to the predicate. For example, the distance between “A0 PRED A1 A2” and “A1 PRED A0 A2” will be

CR IP T

greater than the distance between “A0 PRED A1 A2” and “A0 PRED

565

A2 A1”. As in the pattern “A1 PRED A0 A2”, the position of “A1” and “A0” is switched with reference to the predicate “PRED”. This will

create an effect such that the “A0 PRED A2 A1” is more preferable as a matching pattern for “A0 PRED A1 A2”.

4. The linguistic features distance between two semantic patterns x and y with

AN US

total of n distinct arguments can be defined as: dlinguistic f eatures (x, y) n

=1− Where:

M

570

1X simlinguistic f eatures (xi , yi ) n i=0

(a) Aggregated linguistic features similarity between two arguments (aggregation of dependency structure, lexical string patterns and syntactic

ED

patterns) is defined as:

CE

PT

simlinguistic f eatures (x, y)

=

 1 − levdepx ,depy (|depx | , |depy |)  + 1 − levsynx ,syny (|synx | , |syny |)  + 1 − levlexx ,lexy (|lexx | , |lexy |) total number of f eatures

AC

* for our case, total number of features = 3

(b) Levenshtein distance between two dependency structure pattern, levdepx ,depy (|depx | , |depy |) (c) Levenshtein distance between two syntactic (Part Of Speech) patterns,

575

levsynx ,syny (|synx | , |syny |) 31

ACCEPTED MANUSCRIPT

(d) Levenshtein distance between two lexical string patterns, levsynx ,syny (|synx | , |syny |)

y is then, dstructural semantic (x, y) =

CR IP T

5. Structural Semantic Pattern Distance between two semantic patterns x and

dpredicate arguments (x, y) ∗ α

AN US

+ dlinguistic f eatures (x, y) ∗ β Where,

α + β = 1 and β > α

*β > α, so that the distance of the linguistic features will have more signifi-

580

M

cant influence to the overall measurements

6. The matching and computation of structural semantic patterns distance is

ED

performed only if the main predicate of the two semantic patterns is matched

PT

d (x, y)   dstructural semantic (x, y) , same pred (x, y) = true =  1, otherwise

CE

With the distance measurement of two structural semantic patterns, two SSTC+SEM structures can be compared without performing direct structural mapping. The comparison will first examine the matching arguments between two patterns, and then check the structural resemblance of these patterns, fur-

AC

585

ther assess the degree of similarity of two matching arguments based on the additional linguistic features. The characteristics of these distance measurements can be examined based on the examples in Table 1, where there are three input examples to compare with two stored examples. From the distance mea-

590

surement results in Table 2, best match for each of the examples is obtained: 32

ACCEPTED MANUSCRIPT

Input Example 1 with Stored Example 1, Input Example 2 with Stored Example 2, and Input Example 3 with Stored Example 2. Matching of Input Example 1 and Stored Example 1 was based on the lowest distance acquired, due to the

595

CR IP T

equivalent of the structural semantic pattern “A0 AM-TMP Pred A1 A2”. The Input Example 2 and 3 are basically the same sentences with minor phrasal

reordering. Due to the shallow semantic matching, the comparison of patterns between Input Example 2 and Stored Example 2 is resulted with lower edit

distance than the comparison of patterns between Input Example 3 and Stored

600

AN US

Example 2.

5.3. Target Translation Sentence Reordering, Recombination and Generation The edit distance measurement evaluates the similarities of two SSTC+SEM structures based on multi-levels linguistics information. It is used to select the best translation example to support the derivation of the target translation’s SEM structure. The TL SEM which is corresponding to SL SEM in the matched translation example will serve as base template for the target translation’s SEM

M

605

structure. A correspondence mapping between the input sentence with the

ED

SL example is performed based on the structural semantics similarities. The transformation from input sentence to the target sentence is described as the mapping from input sentence to SL example, following with the transformation as like the transformation from SL example to TL example. These mapping and

PT

610

transformation during the derivation process can be elaborated as the structural correspondence relationships between SSTC+SEM structures.

CE

The derivation of the target SEM structure with reference to the structural

AC

correspondence can be described as following procedures:

615

1. Construct a semantic compositional structure from the source sentence structural semantics (a) Construct by merging the source sentence’s structural semantics. (b) The structural semantics will be merged according to the parent-child

33

ACCEPTED MANUSCRIPT

Table 1: Examples of Structural Semantic Patterns

Source Sentence

Structural Semantic Pattern

Input Example 1 Kerpan

currently

sells

farm

A0([0 3][DET

fresh

N

N])

root:N

N][the

kerpan

farm][DET

CR IP T

The

AM-TMP([3 4][ADV][currently][ADV])

shrimps to third party

Pred([4 5][sell]) A1([5 7][A root:N][fresh shrimps][A

processors.

N])

A2([7 11][AU INF

root:N

N][to

third

party

processors][AU INF NUM ORD N N]) Input Example 2 shrimps

were

A1([0 2][A

root:N][fresh

shrimps][A

AN US

Fresh

N])

sold by the Kerpan

Pred([3 4][sell]) A0([4 8][PREP root:N DET N][by the

farm to third party

Kerpan farm][PREP DET N N]) A2([8 12][AU INF

processors.

root:N

NUM ORD

N][to

third

party

proces-

sors][AU INF NUM ORD N N])

Fresh

shrimps

were

M

Input Example 3

A1([0 2][A

root:N][fresh

shrimps][A

N])

sold to third party

Pred([3 4][sell]) A2([4 8][AU INF root:N NUM ORD

processors

N][to third party processors][AU INF NUM ORD N

the

ED

kerpan farm.

by

N]) A0([8 12][PREP root:N DET N][by the Kerpan

PT

farm][PREP DET N N])

Stored Example 1 Relationship

market-

A0([0 2][N

root:N][relationship

marketing][N

AM-TMP([2 3][ADV][then][ADV])

product to broker.

A1([5 7][GEN PRON

AC

CE

ing then is selling your

uct][GEN PRON

Pred([4 5][sell])

root:N][your

N])

N])

A2([7 9][AU INF

prodroot:N][to

broker][AU INF N])

Stored Example 2 The officer is selling in-

A0([0 2][DET

formation to the en-

Pred([3 4][sell])

emy.

A2([5 8][AU INF DET root:N][to the enemy][AU INF DET N])

34

root:N][the

officer][DET

N])

A1([4 5][root:N][information][N])

CR IP T

ACCEPTED MANUSCRIPT

Table 2: Examples of Structural Semantic Pattern Distance Measurements

Stored Example 2

Relationship

market-

The officer is selling in-

ing then is selling your

formation to the en-

product to broker.

emy.

AN US

Stored Example 1

Input Example 1 The

Kerpan

currently

sells

farm

0.3646

fresh

shrimps to third party

Input Example 2 shrimps

were

0.6792

0.5778

0.7167

0.6278

ED

Fresh

M

processors.

0.5181

sold by the Kerpan farm to third party

PT

processors.

Input Example 3 shrimps

CE

Fresh

were

sold to third party processors

by

the

AC

kerpan farm.

35

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 10: Semantic Compositional Structure Building

relationship of the predicates (verbs) with reference to the dependency structure (as shown in the Fig. 10).

M

620

ED

2. The target predicate arguments structure will be merged based on the following simple algorithm: (a) Search for the first predicate by traversing through the structural

625

PT

semantics dependency tree (represent as directed graph objects).

(b) The corresponding target predicate arguments structure of the first

CE

predicate will be used as the target structure’s root, as shown in Iteration 1 of Fig. 11.

AC

(c) Traverse through the target predicate arguments structure.

630

(d) If the node is an argument, search into the structural dependency tree to find if there is any predicate arguments structure that bound to the scope of this argument, i.e. the structure “A0[2](2 3) Pred[2](3 4) A1[2](5 7)” is bound to the “A1” argument of the root predicate.

36

AN US

CR IP T

ACCEPTED MANUSCRIPT

M

Figure 11: Target Sentence’s Structure Construction

(e) The target argument node is replaced with the corresponding target

635

ED

predicate arguments structure, as shown in Iteration 2 of Fig. 11. (f) The iteration is repeated with traversing of the target structure and

PT

replacements of argument node. (g) The iteration end when no arguments replacement is required.

CE

(h) Redundancy checking will be performed to eliminate repeating argument node(s), i.e. the target argument node T:A0[3] is removed as

AC

640

it is repeating the argument node T:A0[2] in the structure.

3. Mapping from structure to text (a) The translation process continues with mapping from target structure to target string as shown in Fig. 12. (b) The string of the target predicate(s) is directly mapped as the trans-

645

lated string for the verb(s) (as it is directly matched and selected). 37

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 12: Target Text Recombination and Generation

(c) As for the string segments of the target arguments, the original source

M

string will be mapped and highlighted as string segments require further translation.

(d) All these string segments will be translated using the baseline EBMT

ED

system.

650

(e) The translated text is output as the final result.

PT

(f) Based on the example: “He thought I like to eat fish.” the translated

CE

text is: “Dia fikir saya suka makan ikan.”.

6. Evaluation and Results

AC

655

In this section, experiments are conducted to evaluate the translation results

of the new SiSTeC+SEM framework. The dataset for the experiments is briefly discussed in 6.1. The first experiment in 6.2 is to evaluate the performance of the SiSTeC+SEM against the SiSTeC baseline EBMT system. The evaluation results of SiSTeC and SiSTeC+SEM are compared to SMT and Neural MT

38

ACCEPTED MANUSCRIPT

660

in 6.3. Semantic-based evaluation with human justification is carried out in 6.4 as a complementary test for the automatic evaluation metrics.

CR IP T

6.1. Preparation and Test Examples Selection Twenty thousand English SSTCs are selected from the existing BKB to train a dependency tree parser (G´ omez-Rodr´ıguez & Nivre, 2013). The dependency 665

structure of all the SSTCs in the BKB are replaced with new parsed result using this trained English dependency tree parser. All the S-SSTCs are processed with

semantic parser (Punyakanok et al., 2008) and annotated with the structural

AN US

semantics to form the S-SSTC+SEM. The new input sentence will be parsed

using the dependency tree parser and semantic parser later in the translation 670

phase such that the produced semantic structure will be consistent with the stored examples in the BKB.

One thousand examples are selected from the BKB. The selection is performed based on criteria such as: short and long sentences (3 to 30 words); sim-

675

M

ple and complex sentences, from sentences with single verb (single predicate) to multiple verbs (with main predicate and sub-predicates, complex arguments

ED

structures); with passive and active form sentences; the lexicons in the sentences should have corresponding target translation within the scope of the BKB. All these one thousand instances are removed from the BKB and translated us-

680

PT

ing the remaining stored translation examples. Based on manual examination of these translation results, one hundred examples with translation errors are selected: i.e. errors caused by boundary frictions, verb selection, local words

CE

ordering and global phrasal ordering. These test examples consisted of a total of 163 predicates, hence 163 semantic structures. Instead of performing general

AC

test, these filtered examples are used to evaluate the new translation framework

685

specifically targeting on the translation errors identified. 6.2. Evaluation of SiSTeC and SiSTeC+SEM Automated evaluations of the translation results are performed using the BLEU (Papineni et al., 2002), NIST (Doddington, 2002), METEOR (Denkowski

39

ACCEPTED MANUSCRIPT

& Lavie, 2014), LEPOR (Han et al., 2012) and TER (Snover et al., 2006). Be690

sides of measurements based on precision (BLEU, NIST) and recall (METEOR), the LEPOR considers more aspects such as sentence length penalty and n-gram

CR IP T

position penalty. In a different respect than the n-gram based metrics, the TER is used to estimate the post-editing efforts required in order to modify the translation results such that it can match with the reference translation.

The overall comparisons of the translation results are shown in Table 3. In

695

the test with the 100 samples, the evaluation scores of the translation results

from SiSTeC+SEM is higher than the SiSTeC, respectively with percent points

AN US

of: 22.95 (BLEU), 51.13 (NIST), 53.63 (METEOR), 68.93 (LEPOR), and 63.19

(TER). Among these scores, the translation results from the new translation 700

framework are contributed to an improvement of 8.05 percent points based on the TER score and 6.13 percent points with the NIST metric. With careful examination of the translation results, there are examples with very similar translation results, i.e. with similar target verb(s) and predicate-argument structure.

705

M

As the main purpose of the evaluation is to compare the differences between the translation results of SiSTeC and SiSTeC+SEM, 60 test examples with very sim-

ED

ilar translation results are filtered. The second round of the evaluation is scoped down to these 40 examples and comparisons obtained are shown in Table 3. The SiSTeC+SEM contributed 12.54 percent points of improvement according to the

710

PT

TER metric and 10.37 percent points with LEPOR metric as compare to the results from SiSTeC.

CE

The statistical significance test is performed using the paired bootstrap resampling approach proposed by Koehn (2004) for small set of test data. The virtual test sets are created based the selected samples. The bootstrap resam-

AC

pling process is repeated in 1000 iterations. As shown in Table 4, the translation

715

results of SiSTeC+SEM is significantly better than the SiSTeC at p < 0.05 based

on the BLEU and NIST metrics.

40

ACCEPTED MANUSCRIPT

Table 3: Evaluation of Translation Results for SiSTeC and SiSTeC+SEM

NIST(%) METEOR LEPOR

(%)

TER

(%)

(%)

(%)

100 Samples

CR IP T

BLEU

16.67

44.99

51.49

64.99

71.24

SiSTeC+SEM

21.26

51.13

53.63

68.93

63.19

Difference

4.59

6.13

2.14

3.94

8.05

SiSTeC

16.26

39.67

50.43

62.71

68.95

SiSTeC+SEM

22.95

48.09

57.66

73.08

56.41

Difference

6.69

8.42

7.21

10.37

12.54

AN US

SiSTeC

ED

M

40 Samples

Table 4: Medians and confidence intervals for SiSTeC and SiSTeC +SEM using Paired Bootstrapping Resampling

PT

SiSTeC

SiSTeC+SEM

P-value

BLEU

0.1990 ± 0.0394

0.2286 ± 0.0491

0.04

40 Samples

0.1666 ± 0.0551

0.2578 ± 0.0924

0.03

100 Samples

4.7764 ± 0.3627

5.2569 ± 0.3716

0.02

40 Samples

3.9077 ± 0.4248

4.9397 ± 0.5608

0.01

CE

100 Samples

AC

NIST

41

ACCEPTED MANUSCRIPT

Table 5: Comparison with other MT systems

NIST

METEOR LEPOR

TER

(%)

(%)

(%)

(%)

(%)

SiSTeC

16.67

44.99

51.49

64.99

71.24

SiSTeC+SEM

21.26

51.13

53.63

68.93

63.19

Moses

25.20

51.94

55.22

67.25

75.13

OpenNMT

22.75

48.34

50.14

63.17

74.28

SiSTeC

16.26

39.67

50.43

62.71

68.95

SiSTeC+SEM

22.95

48.09

57.66

73.08

56.41

Moses

27.74

47.19

57.56

68.34

74.78

OpenNMT

26.31

53.66

65.72

72.70

AN US

100 Samples

CR IP T

BLEU

M

40 Samples

46.25

NMT

ED

6.3. Comparative Evaluation of SiSTeC, SiSTeC+SEM, MOSES and Open-

720

PT

As comparisons, two different MT systems are trained using the 100,000 parallel aligned sentence pairs extracted from the BKB (same dataset for SiSTeC and SiSTeC+SEM). One is phrase-based SMT based on the Moses (Koehn et al.,

CE

2007) and another one is a Neural MT using OpenNMT (Klein et al., 2017). For Moses training, the translation examples are automatically aligned using

AC

GIZA++1 and a language model is trained using the open source IRSTLM2

725

toolkit up to 5 gram. The translation test for both Moses and OpenNMT is conducted using the 1 http://www.statmt.org/moses/giza/GIZA++.html 2 http://hlt-mt.fbk.eu/technologies/irstlm

42

ACCEPTED MANUSCRIPT

same set of samples in the previous test for SiSTeC and SiSTeC+SEM. The results are combined with the previous test results and illustrated in Table 5. In the test using the 100 samples, Moses obtained best scores with the BLEU (25.20 points), NIST (51.94 points) and METEOR (55.22 points). The performance

CR IP T

730

of SiSTeC+SEM is comparable with Moses with reference to the NIST (51.13

points) and METEOR (53.63 points). Furthermore, it achieved best scores

with LEPOR (68.93 points) and TER (63.19 points) evaluation metrics. For the testing using the filtered 40 samples, the SiSTeC+SEM performed better 735

than all other MT systems in all of the metrics except BLEU score.

AN US

As shown in Table 6, Moses can select better verb than the SiSTeC. However,

Moses is not able to determine the correct morphological form for some of the selected verbs, as shown in Table 7, “dilantik” and “diproses” are selected for the sentence “we have not yet appointed a place for the meeting” and “the bank 740

quickly processed the loan requested by the company” respectively. Both verbs are not in proper morphological form. The verb “dilantik” is suitable unless the

M

target sentence’s structure is modified and changed to “tempat untuk mesyuarat masih belum dilantik oleh kami”. For the verb “diproses”, the target sentence

745

ED

will need to restructure and change to “pinjaman yang diminta oleh company itu diproses dengan cepat oleh bank itu”. As the SiSTeC+SEM will select verb with minimum adaptation to the sentence’s structure, the target verbs selected

PT

for these two input sentences are “melantik” and “memproses” respectively. As shown in Table 8, the limitation in selection of verbs with suitable morphological

CE

form is also observed in the OpenNMT translation results. 750

6.4. Semantic-based Evaluation

AC

The requirements of the automatic evaluations in Section 6.2 are simple

and depending only on the reference translation, no additional language specific data or tools and training are required. Consequently, the automatic evaluation is influenced by how well the translation results will correlate with the refer-

755

ence translations. The automatic scores are unable to reflect the consistency of meaning structure between the input sentence and target sentence. As a sim43

ACCEPTED MANUSCRIPT

Table 6: Examples of translation results: SiSTeC vs. Moses

English Source Sen-

Human Translation

SiSTeC Translation

tence

Transla-

tion dapat akses bilik .

akan

room.

mendapat

keuntungan akses kepada

se

bilik . accompany

someone

on

a

mengiringi orang

sese-

untuk

dalam

perjalanan.

to adjust to life in

menyesuaikan

a foreign country.

diri

dengan

perjalanan.

ke-

hidupan di negara asing.

akses kepada bilik.

dalam

mengiringi orang

sese-

dalam

perjalanan.

AN US

journey.

mendapat

buah

mengikut

someone

untuk

CR IP T

to gain access to a

to

MOSES

mengubah dengan

menyesuaikan

kehidupan di ne-

diri

gara asing.

hidupan

dengan di

keluar

negara.

M

ple solution to this limitation, the evaluation using semantic frames based on the suggestion in Lo et al. (2012) and Lo & Wu (2011) is performed. In this

760

ED

semantic-based evaluation, the automated translation results and the reference translations are manually annotated with semantic roles. The precision, recall and f-score is calculated for the semantics similarities between the automated

PT

translation results and the reference translations. The proposed equations in Lo & Wu (2011) are slightly modified to prioritize on the correct and selection

CE

of more suitable target verb. The modified equations are list as below (with k 765

total number of test sentences):

AC

1. General definition of variables:

770

(a) Ci,j = number of correct fillers of argument j for predicate i in the machine translation result (b) Pi,j = number of partial fillers of argument j for predicate i in the machine translation result (c) Mi,j = total number of fillers of argument j for predicate i in the 44

CR IP T

ACCEPTED MANUSCRIPT

Table 7: Examples of translation results: SiSTeC+SEM vs. Moses

Englsh Source Sen-

Human Translation

SiSTeC+SEM

tence

Translation kami

belum

pointed a place for

melantik

the meeting.

mesyuarat.

lagi

kami

tempat

tidak

melantik untuk

Transla-

tion

lagi

kami masih belum

AN US

we have not yet ap-

MOSES

tempat

mesyuarat

dilantik untuk

itu.

itu.

tempat mesyuarat

diterima masuk ke

diterima masuk ke

untuk

a university.

universiti.

universiti.

sukkan ke univer-

M

to be admitted to

dima-

siti.

dia dapat menen-

dia dapat menen-

dia berlawan dari

shark and swam

tang ikan yu dan

tang yu dan ber-

shark dan bere-

berenang kembali

enang kembali ke

nang ke pantai.

ke pantai.

pantai.

the bank quickly

bank memproses

bank memproses

bank

processed the loan

pinjaman

yang

pinjaman itu dim-

cepat

requested by the

dipohon

oleh

inta oleh syarikat

pinjaman

yang

company.

syarikat itu dengan

itu cepat.

diminta

oleh

ED

he fought off the

itu

dengan diproses

syarikat itu.

segera.

AC

CE

PT

back to the beach.

45

CR IP T

ACCEPTED MANUSCRIPT

Table 8: Examples of translation results: SiSTeC+SEM vs. OpenNMT

Englsh Source Sen-

Human Translation

SiSTeC+SEM

tence

Translation kami

belum

pointed a place for

melantik

the meeting.

mesyuarat.

lagi

kami

lation

tidak

lagi

belum

AN US

we have not yet ap-

OpenNMT Trans-

tempat

melantik untuk

tempat

mesyuarat

lagi

dilantik

kami

sebagai

tempat

untuk

itu.

mesyuarat itu.

diterima masuk ke

diterima masuk ke

dilantik ke univer-

a university.

universiti.

universiti.

siti.

the

pihak

management

M

to be admitted to

pengurusan

mengadili

close down the old

baik

factory.

kilang

penguru-

pihak

pengurusan

lebih

san mengadili ia

mengambil masa

menutup

lebih baik kepada

yang

menutup

untuk

ED

judged it better to

pihak

lama

itu

kilang

lebih

baik

menutup

lama.

kilang lama itu.

the bank quickly

bank memproses

bank memproses

bank

processed the loan

pinjaman

yang

pinjaman itu dim-

dengan

requested by the

dipohon

oleh

inta oleh syarikat

yang diminta oleh

company.

syarikat itu dengan

itu cepat.

syarikat tersebut .

segera.

AC

CE

PT

sahaja.

46

itu

cepat kemas

ACCEPTED MANUSCRIPT

machine translation result (d) Ri,j = total number of fillers of argument j for predicate i in the

775

(e) wp = predicate weight (f) wa = argument weight, which is 1 − wp

CR IP T

reference translation

2. For n predicates in a test sentence, the precision of correct predicate P redi

AN US

and complete argument(s) Ci,j matching for each of the predicate P redi : Pm n X wp P redi + j wa Ci,j Pm Cprecision = wp P redi + j wa Mi,j i

3. For n predicates in a test sentence, the recall of correct predicate P redi

M

and complete argument(s) Ci,j matching for each of the predicate P redi : Pm n X wp P redi + j wa Ci,j Pm Crecall = wp P redi + j wa Ri,j i

4. For n predicates in a test sentence, the precision of partial argument(s)

ED

Ci,j matching for each of the predicate P redi : Pm n X j wa Pi,j Pm Pprecision = w P red p i+ j wa Mi,j i

PT

5. For n predicates in a test sentence, the recall of partial argument matching

AC

CE

Pi,j for each of the predicate P redi : Precall =

n X i

Pm j

wa Pi,j Pm j wa Ri,j

wp P redi +

6. The precision of the overall test set with k sentences: P k (Cprecision + Pprecision ) precision = total number of predicates in M T

7. The recall of the overall test set with k sentences: P k (Crecall + Precall ) recall = total number of predicates in REF 47

ACCEPTED MANUSCRIPT

8. the f-score of the overall test set with k sentences: f -score =

2 ∗ precision ∗ recall precision + recall

CR IP T

The weights of the predicate (wp ) and argument (wa ) in the above equations

are separated so that it can be set at different scale in the calculation. This

is based on the assumption that when the target verb selection is inaccurate, 780

the whole interpretation of the sentence will be affected. The filtered 40 test examples in Section 6.2 are used to perform the semantic-based evaluation. As

shown in Fig. 13, the translation results from the SiSTeC+SEM achieved higher

AN US

precision, recall and f-score scores as compare to the translation results from

SiSTeC. This indicates that in order to improve the overall translation results 785

of the SiSTeC, the new translation framework will need to select better verb and at the same time suggest correct predicate-argument structure. Table 9 shows one example from the SiSTeC translation results where the selection of

M

verbs with incorrect morphological form is causing predicate-argument structure errors to the target sentence. The errors lead to misinterpretation of the overall 790

sentence’s meaning. Thus, the semantic-based evaluation results are aligned

ED

with the results in Section 6.3 where the SiSTeC+SEM is capable of selecting more suitable verb and also with more accurate semantic arguments structure.

PT

7. Discussion

Based on examination of the test results, apparently the n-gram based evaluation metrics are susceptible to structural alteration of sentences. In order

CE 795

to obtain high score in n-gram based evaluation, the translation results need

AC

to reveal resembling grammatical structure and high lexical similarity with the reference translation. The selected test samples for the evaluations were mainly examples with boundary friction, structural and reordering errors when trans-

800

lating with SiSTeC. From observation, the translation results from SiSTeC have more structural variations than the reference translations as compare to the translation results from the SSTC+SEM. Thus, the scores of the n-gram based 48

AN US

CR IP T

ACCEPTED MANUSCRIPT

M

Figure 13: Semantic-base Evaluation for SiSTeC+SEM vs. SiSTeC

Table 9: Examples of Semantic-based evaluation results: SiSTeC+SEM vs. SiSTeC EBMT

SiSTeC

A0pred1 (pihak pen-

A0pred1 (pihak pen-

A0pred1 (ia

gurusan)

gurusan)

baik

ED

SiSTeC+SEM

PT

Reference

CE

kilang lama itu di-

baik

tutup sahaja)

menutup

AC

A1pred1 (ia

Result

(SiSTeC+SEM)(SiSTeC) lebih

match

not match

pred1(dinilai)

match

match

A1pred1 (pihak pen-

match

not match

match

not match

match

match

untuk

di-

tutup golongan tua kilang)

pred1(berpendapat) pred1(mengadili) A1pred1 (lebih baik

Result

lebih kepada

gurusan)

kilang

lama)

A1pred2 (kilang

A1pred2 (kilang

A0pred2 (golongan

lama itu)

lama)

tua kilang)

pred2(ditutup)

pred2(menutup)

pred2(ditutup)

49

ACCEPTED MANUSCRIPT

metrics for SiSTeC+SEM is better than SiSTeC. As the focus of this research is more on the adequacy of meaning structure of the target translation, the seman805

tic frames approach suggested by Lo & Wu (2011) is probably more suitable to

CR IP T

compare the translation results of SiSTeC and SSTC+SEM. As to complement the automatic evaluation metrics, the semantic-based evaluations is conducted and presented in Section 6.4.

The translation examples selection in SiSTeC+SEM can be separated into 810

two stages. The first stage is focusing on selection by meaning structure, which required full matching between the input sentence and the stored examples.

AN US

The second stage is partial selection, to retrieve the equivalent TL text for the input sentence, which involved segmentation during the translation phase. The segmentation of the input sentence is guided by the boundaries of arguments 815

defined by the meaning structure of the input sentence. Hence, the partial translation example selection in the second stage is governed by the first stage example selection. With the partial example selection results, further recom-

M

bination will be facilitated by the target sentence’s meaning structure. A few characteristics of the new translation framework based on the structural semantics can be highlighted:

ED

820

1. The selection of the target verb is not based on direct matching of input

PT

sentence’s verb. Instead, it is based on the combination of the co-occurring arguments and its structure, which provides complete definition of the verb predicate. This leads to the disambiguation of polysemy verb and also determining the proper morphological form of the target verb. As shown in

CE

825

Table 10, in most of the cases, the SiSTeC+SEM can select better verb(s)

AC

as compare to the SiSTeC. The verb selection with correct morphological

830

form is important for Malay language. For example, for the input sentence “the bank quickly processed the loan requested by the company.”, the SiSTeC+SEM selected the verb “diminta” as target translation for the verb “requested”, which is more suitable than the verb “meminta” selected by the SiSTeC. Although both “diminta” and “meminta” have the same base form

50

ACCEPTED MANUSCRIPT

“minta”, but they lead to different meaning interpretation with reference to the constructed sentence’s structure. 835

2. The selection of translation example will prioritize example with minimal

CR IP T

adaptation, minimizing the reordering and transformation steps during the recombination. With minimal adaptation, the cost of processing and error

rates will be reduced. The translated result will be generated as natural

as possible. As shown in Table 11, the example “she begrudged the time her husband spent with his friends.” requires swapping of the arguments’

840

position for A0 and A1 in order to adapt to the input sentence. As for

AN US

the example “she spent her vacation swanning around Europe visiting old friends.”, there is no restructuring of arguments will be required. Thus, it

serves as best matched example in this case and this is indirectly reflected with the lower semantics distance.

845

The overall evaluation results in Section 6 provides sufficient evidence that

M

the structural semantics is a feasible approach to preserve the meaning structure of the source language towards the final translation. This is clearly re-

850

ED

viewed in the process of translation example matching to determine the best semantic structure and selection of the most suitable verb with correct morphological form. The transformation of the target sentence’s structure is eased

PT

by the multi-level syntactic and semantic structural correspondences in the SSSTC+SEM.

CE

Semantic annotation at the structural level for the S-SSTC provides a flexi855

ble and efficient way for integration of multi-layers linguistics knowledge, at the same time maintain the specification of the correspondence between bi-lingual

AC

translation examples in a natural and simplest representation scheme. This multi-layers correspondence approach allows future extension of the representation to cater for other aspects of linguistics knowledge, without the need to

860

modify the current representation scheme of S-SSTC. The integration of structural annotation with the S-SSTC allows indirect projection of the source language linguistics knowledge to the target language via structural correspondence 51

ACCEPTED MANUSCRIPT

relationships. Together, the structural semantics annotation and multi-layers correspondence relationships enhance the meaning preservation of the transfer 865

from the source language to the target language throughout the whole process of

CR IP T

translation. These integrated and synchronized structural semantics annotation and multi-layers correspondences provide the foundation to derive and design meaning-based approaches to resolve problems in the EBMT system.

8. Conclusion and Future Work

Solutions for better means of meaning preservation are designed, by incor-

870

AN US

porating of the guided analysis based on the structural semantics information

as opposed to the non-guided analysis approach in the previous SiSTeC EBMT. This structural semantics contributes as a reference basic meaning structure for translation example selection and for facilitating the meaning-based construc875

tion of the target sentence structure. This meaning structure allows imposition

M

of semantic constraints throughout the translation process such that the semantic consistency and integrity of the input sentence can be preserved and transferred to the target sentence.

880

ED

As elaborated in Section 5, no specific cross language handling is explicitly defined in the new EBMT framework except the language pair alignment. The

PT

corresponding of syntactic and semantic structure in the Bilingual Knowledge Bank is learned solely from the alignment information. Hence, the proposed approach will be applicable for language pairs with predicate argument structure

CE

which can be explicitly matched and the semantic roles variation can be clearly

885

specified. However, the degree of accuracy could be varied for language pairs

AC

which are very different in nature. As continuous efforts to improve the English to Malay EBMT system, we

are keen to conduct further research into following aspects: 1. Segment translation improvement in SiSTeC EBMT - Words reordering error

890

of the segment translation using the baseline SiSTeC is detected based on the translation test in Section 6. As the structural index matching is not sensitive 52

ACCEPTED MANUSCRIPT

Table 10: Examples of translation results: SiSTeC+SEM vs SiSTeC EBMT

Englsh Source Sen-

Human Translation

SiSTeC Translation

tence

SiSTeC+SEM Translation

menuntut

to a building

ses

ke

ak-

untuk

sesebuah

memer-

menuntut

akses

CR IP T

to demand access

lukan pintu masuk

pada bangunan.

bangunan.

pada bangunan.

was

mata pelajaran itu

perkara itu diajar

perkara itu diajar

se-

diajar di sekolah-

di dipilih sekolah

dalam

sekolah-

lected schools as

sekolah terpilih se-

sebagai uji kaji.

sekolah

terpilih

an experiment.

bagai satu eksperi-

subject

taught

in

men. these

considera-

pertimbangan-

tions induce me to

pertimbangan

believe that.

membuat

sebagai uji kaji.

AN US

the

pertimbangan-

ini

pertimbangan-

pertimbangan

saya

membuat

percaya bahawa.

untuk

ini

saya

memper-

pertimbangan membuat

ini saya

percaya bahawa.

the Kerpan farm

ladang

currently

pada

to

shrimps third

party

management

CE

the

menjual

ini

udang

Kerpan pada

menjual

ini den-

gan

pemproses ketiga.

pemproses

udang

kepada

pihak

pihak

harga

segar

ladang pada

Kerpan masa

menjual

ini

udang

segar kepada pihak ketiga pemproses.

ketiga. pengurusan

pihak dinilai

judged it better to

mengadili

close down the old

baik

menutup

baik

factory.

kilang

lama

tutup

AC

ladang masa

segar kepada pihak

PT

processors.

Kerpan

masa

ED

fresh

sells

M

cayainya itu.

lebih

itu

pengurusan ia untuk

pihak

penguru-

lebih

san mengadili ia

di-

lebih baik kepada

golongan

menutup

kilang

sahaja.

tua kilang.

the bank quickly

bank memproses

bank

cepat

bank memproses

processed the loan

pinjaman

yang

memproses pinja-

pinjaman itu dim-

requested by the

dipohon

oleh

man itu meminta

inta oleh syarikat

company.

syarikat itu dengan

oleh syarikat itu.

itu cepat .

segera.

53

itu

lama.

ACCEPTED MANUSCRIPT

Input

“During the daytime houseflies spend their time outdoors or in covered areas near the open air.”

Pattern

CR IP T

Table 11: Example 1 of Verb Selection Based On Structural Semantic Patterns

AM-TMP([1 3][root:PREP][during N

DET])

the

daytime][PREP

A0([4 4][root:PRON][houseflies][PRON])

Pred([5 5][spend])

A1([6 7][GEN PRON

root:N][their

AN US

time][GEN PRON N]) A2([8 16][N root:CC PREP][outdoors or in covered areas near the open air][N CC PREP V EN N PREP DET N])

Source

Matched Example 1

Matched Example 2

She spent her vacation swan-

She begrudged the time her hus-

ning around Europe visiting old

band spent with his friends.

Dia menghabiskan percutian

Dia menyesali masa yang di-

nya mengembara sekitar Eropah

habiskan

ED

Target

M

friends.

sambil

melawat

kawan-kawan

oleh

suami

nya

bersama kawan-kawan nya.

lama nya.

A0([1 1][root:PRON][she]

A1([3 4][DET

[PRON])

[the

AC

CE

PT

Pattern

Distance

Pred([2 2][spend])

time]

root:N] [DET

N])

A1([3 4] [GEN PRON root:N]

A0([5 6][GEN PRON

[her vacation] [GEN PRON N])

[her husband] [GEN PRON N])

A2([5 10][root:N ; root:PREP

Pred([7 7][spend])

N] [swan around europe visit old

[root:PREP N] [with his friend]

friend] [N PREP N ING A N])

[PREP GEN PRON N])

0.60625

0.80417

54

root:N]

A2([8 10]

ACCEPTED MANUSCRIPT

to syntactic phrases, under certain circumstances during the recombination process, a single phrase could be segmented into smaller chunks. When this occurred, the words ordering information from the source to target is lost.

CR IP T

Hence, the words reordering of the phrase is not able to perform at the

895

target side. One of the possible solutions to resolve this local reordering

issue is based on simple syntactic transfer rules to perform post translation words reordering.

2. Adding in more types of linguistic annotations - There are other aspects of linguistics information such as named entities, which involve classification

900

AN US

of Proper Noun, i.e. Person, Organisation, Place, Time, etc. Annotation of SSTC with named entities will provide context sensitive information to aid the semantic argument annotation and argument level abstraction. This is meaningful especially when the EBMT is targeting for domain specific text translation by generalizing the translation example pairs and at the

905

M

same time supplying domain specific glossary or dictionary for specific terms translation. This allows reuse of the BKB and minimizes the requirements of additional resources when implementing automated translation solution for

References

PT

910

ED

specific domain document translation.

Al-Adhaileh, M. H., Kong, T. E., & Yusoff, Z. (2002). A synchronization struc-

CE

ture of SSTC and its applications in machine translation. In Proceedings of the 2002 COLING Workshop on Machine Translation in Asia - Volume 16 COLING-MTIA ’02 (pp. 1–8). Stroudsburg, PA, USA: Association for Computational Linguistics.

AC

915

Al-Adhaileh, M. H., & Tang, E. K. (1999). Example-based machine translation based on the synchronous SSTC annotation schema. In Machine Translation Summit VII (p. 244249).

Albacete, E., Calle, J., Castro, E., & Cuadra, D. (2012). Semantic similarity 55

ACCEPTED MANUSCRIPT

measures applied to an ontology for human-like interaction. J. Artif. Int.

920

Res., 44 , 397–421. Aramaki, E., & Kurohashi, S. (2004). Example-based machine translation using

on spoken language translation (IWSLT-04) (p. 9194). 925

CR IP T

structural translation examples. In Proceedings of the international workshop

Aramaki, E., Kurohashi, S., Kashioka, H., & Tanaka, H. (2003). Word selec-

tion for EBMT based on monolingual similarity and translation confidence. In Proceedings of the HLT-NAACL 2003 Workshop on Building and Using

AN US

Parallel Texts: Data Driven Machine Translation and Beyond - Volume 3

HLT-NAACL-PARALLEL ’03 (pp. 57–64). Stroudsburg, PA, USA: Association for Computational Linguistics.

930

Aziz, W., Rios, M., & Specia, L. (2011). Shallow semantic trees for SMT. In Proceedings of the Sixth Workshop on Statistical Machine Translation WMT

M

’11 (pp. 316–322). Stroudsburg, PA, USA: Association for Computational Linguistics.

Bazrafshan, M., & Gildea, D. (2013). Semantic roles for string to tree machine

ED

935

translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 419–423). Sofia,

PT

Bulgaria: Association for Computational Linguistics. Bertero, D., & Fung, P. (2015). HLTC-HKUST: A neural network paraphrase classifier using translation metrics, semantic roles and lexical similarity fea-

CE

940

tures. In Proceedings of the 9th International Workshop on Semantic Evalua-

AC

tion, SemEval@NAACL-HLT 2015, Denver, Colorado, USA, June 4-5, 2015 (pp. 23–28).

Bj¨ orkelund, A., Hafdell, L., & Nugues, P. (2009). Multilingual semantic role

945

labeling. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task CoNLL ’09 (pp. 43–48). Stroudsburg, PA, USA: Association for Computational Linguistics.

56

ACCEPTED MANUSCRIPT

Blanco, E., & Moldovan, D. (2013). Composition of semantic relations: Theoretical framework and case study. ACM Transactions on Speech and Language Processing (TSLP), 10 , 17.

950

CR IP T

Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., & Roossin, P. S. (1990). A statistical approach to machine translation. Comput. Linguist., 16 , 79–85.

Brown, R. D. (1999). Adding linguistic knowledge to a lexical example-based

translation system. In In Proceedings of the Eighth International Conference

955

AN US

on Theoretical and Methodological Issues in Machine Translation (TMI-99 (pp. 22–32).

Brown, R. D. (2000). Automated generalization of translation examples. In Proceedings of the 18th Conference on Computational Linguistics - Volume 1 COLING ’00 (pp. 125–131). Stroudsburg, PA, USA: Association for Compu-

960

M

tational Linguistics.

Brown, R. D. (2001). Transfer-rule induction for example-based translation. In M. Carl, & A. Way (Eds.), Recent Advances in Example-Based Machine

965

ED

Translation (pp. 1–11). Kluwer Academic. B¨ uchse, M., Nederhof, M.-J., & Vogler, H. (2011). Tree parsing with syn-

PT

chronous tree-adjoining grammars. In Proceedings of the 12th International Conference on Parsing Technologies IWPT ’11 (pp. 14–25). Stroudsburg, PA,

CE

USA: Association for Computational Linguistics. Carl, M. (2005). A system-theoretical view of EBMT. Machine Translation, 19 , 229–249.

AC

970

Denkowski, M., & Lavie, A. (2014). Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation.

Dlougach, J., & Galinskaya, I. (2012). Building a reordering system using tree975

to-string hierarchical model. In Proceedings of the Workshop on Reordering 57

ACCEPTED MANUSCRIPT

for Statistical Machine Translation (pp. 27–36). Mumbai, India: The COLING 2012 Organizing Committee. Doddington, G. (2002). Automatic evaluation of machine translation quality

CR IP T

using n-gram co-occurrence statistics. In Proceedings of the Second International Conference on Human Language Technology Research HLT ’02 (pp.

980

138–145). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Doi, T., Yamamoto, H., & Sumita, E. (2005). Example-based machine translation using efficient sentence retrieval based on edit-distance. ACM Transac-

985

AN US

tions on Asian Language Information Processing, 4 , 377–399.

Dorr, B., & Habash, N. (2002). Interlingua approximation: A generation-heavy approach. In Proceedings of Workshop on Interlingua Reliability, Fifth Conference of the Association for Machine Translation in the Americas, AMTA2002 (pp. 1–6). University of Chicago Press.

M

Dorr, B. J., Hovy, E., & Levin, L. (2006). Machine translation: Interlingual methods. In E. in Chief: Keith Brown (Ed.), Encyclopedia of Language &

990

ED

Linguistics (Second Edition)Encyclopedia of Language & Linguistics (Second Edition) (pp. 383 – 394). Oxford: Elsevier. Dorr, B. J., Passonneau, R. J., Farwell, D., Green, R., Habash, N., Helmreich, S.,

PT

Hovy, E., Levin, L., Miller, K. J., Mitamura, T., Rambow, O., & Siddharthan, A. (2010). Interlingual annotation of parallel text corpora: A new framework

995

CE

for annotation and evaluation. Natural Language Engineering, 16 , 197–243. Feng, M., Sun, W., & Ney, H. (2012). Semantic cohesion model for phrase-based

AC

SMT. In COLING 2012, 24th International Conference on Computational

1000

Linguistics, Proceedings of the Conference: Technical Papers, 8-15 December 2012, Mumbai, India (pp. 867–878).

Fillmore, C. J. (1968). The case for case. In E. Bach, & R. T. Harms (Eds.), Universals in Linguistic Theory (pp. 0–88). New York: Holt, Rinehart and Winston. 58

ACCEPTED MANUSCRIPT

Gangadharaiah, R., Brown, R. D., & Carbonell, J. G. (2006). Spectral clustering for example based machine translation. In R. C. Moore, J. A. Bilmes,

1005

J. Chu-Carroll, & M. Sanderson (Eds.), HLT-NAACL. The Association for

CR IP T

Computational Linguistics. Gao, Q., & Vogel, S. (2011). Utilizing target-side semantic role labels to assist hierarchical phrase-based machine translation. In Proceedings of the Fifth Work-

shop on Syntax, Semantics and Structure in Statistical Translation SSST-5

1010

(pp. 107–115). Stroudsburg, PA, USA: Association for Computational Lin-

AN US

guistics.

Goldberg, A. E. (1995). Constructions: A construction grammar approach to argument structure. Chicago: University of Chicago Press. 1015

Goldberg, A. E. (1999). The emergence of the semantics of argument structure constructions. In B. MacWhinney (Ed.), Emergence of Language. Hillsdale,

M

NJ: Lawrence Earlbaum Associates.

Goldberg, A. E. (2016). The routledge handbook of semantics. chapter Com-

1020

ED

positionality. (pp. 419–430). Routledge. Gomaa, W. H., & Fahmy, A. A. (2013). A survey of text similarity approaches.

PT

International Journal of Computer Applications, 68 , 13–18. G´ omez-Rodr´ıguez, C., & Nivre, J. (2013). Divisible transition systems and

CE

multiplanar dependency parsing. Computational Linguistics, 39 , 799–845. Gruber, J. S. (1965). Studies in Lexical Relations. Ph.D. thesis MIT Cambridge, MA.

AC

1025

Hajiˇc, J., Ciaramita, M., Johansson, R., Kawahara, D., Mart´ı, M. A., M` arquez, ˇ ep´ L., Meyers, A., Nivre, J., Pad´ o, S., Stˇ anek, J., Straˇ n´ ak, P., Surdeanu, M., Xue, N., & Zhang, Y. (2009). The CoNLL-2009 Shared Task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the

1030

Thirteenth Conference on Computational Natural Language Learning: Shared

59

ACCEPTED MANUSCRIPT

Task CoNLL ’09 (pp. 1–18). Stroudsburg, PA, USA: Association for Computational Linguistics. Han, A. L. F., Wong, D. F., & Chao, L. S. (2012). LEPOR: A robust evaluation

CR IP T

metric for machine translation with augmented factors. In COLING 2012,

24th International Conference on Computational Linguistics, Proceedings of

1035

the Conference: Posters, 8-15 December 2012, Mumbai, India (pp. 441–450).

Healy, A. F., & Miller, G. A. (1970). Psychonomic science. chapter The verb

as the main determinant of sentence meaning. (p. 372). Psychonomic Society

1040

AN US

volume 20.

Hutchins, J. (2005a). Example-based machine translation: A review and commentary. Machine Translation, 19 , 197–211.

Hutchins, J. (2005b). Towards a definition of example-based machine translation. In Proceedings of Second Workshop on Example-Based Machine Trans-

1045

M

lation (pp. 63–70). Phuket, Thailand: MT Summit X.

Imamura, K., Okuma, H., Watanabe, T., & Sumita, E. (2004). Example-based

ED

machine translation based on syntactic transfer with statistical models. In Proceedings of the 20th International Conference on Computational Linguistics COLING ’04. Stroudsburg, PA, USA: Association for Computational

1050

PT

Linguistics.

Jackendoff, R. (1976). Toward an explanatory semantic representation. Lin-

CE

guistic Inquiry, 7 , 89–150.

AC

Kaji, H., Kida, Y., & Morimoto, Y. (1992). Learning translation templates

1055

from bilingual text. In Proceedings of the 14th Conference on Computational Linguistics - Volume 2 COLING ’92 (pp. 672–678). Stroudsburg, PA, USA: Association for Computational Linguistics.

Kaptein, R., Van den Broek, E. L., Koot, G. et al. (2013). Recall oriented search on the web using semantic annotations. In Proceedings of the sixth

60

ACCEPTED MANUSCRIPT

international workshop on Exploiting semantic annotations in information retrieval (pp. 45–48). ACM. 1060

Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. M. (2017). OpenNMT:

CR IP T

Open-Source Toolkit for Neural Machine Translation. ArXiv e-prints, .

Koehn, P. (2004). Statistical significance tests for machine translation evalua-

tion. In D. Lin, & D. Wu (Eds.), Proceedings of EMNLP 2004 (pp. 388–395). Barcelona, Spain: Association for Computational Linguistics. 1065

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N.,

AN US

Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In ACL. The Association for Computer Linguistics.

Levin, B., & Rappaport Hovav, M. (2005). Argument Realization. Research surveys in linguistics. Cambridge, New York (N.Y.), Melbourne: Cambridge

1070

M

University Press. Autres tirages : 2006, 2007, 2008.

Li, J., Resnik, P., & Daum´e III, H. (2013). Modeling syntactic and semantic

ED

structures in hierarchical phrase-based translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 540–549). Atlanta,

1075

PT

Georgia: Association for Computational Linguistics. Liu, D., & Gildea, D. (2010). Semantic role features for machine translation.

CE

In Proceedings of the 23rd International Conference on Computational Linguistics COLING ’10 (pp. 716–724). Stroudsburg, PA, USA: Association for Computational Linguistics.

AC

1080

Liu, Z., Wang, H., & Wu, H. (2003). Example-based machine translation based on tree-string correspondence and statistical generation. Machine Translation, Volume 20, Issue 1 , 25–41.

Lo, C.-k., Tumuluru, A. K., & Wu, D. (2012). Fully automatic semantic mt 1085

evaluation. In Proceedings of the Seventh Workshop on Statistical Machine 61

ACCEPTED MANUSCRIPT

Translation WMT ’12 (pp. 243–252). Stroudsburg, PA, USA: Association for Computational Linguistics. Lo, C.-k., & Wu, D. (2011). Meant: An inexpensive, high-accuracy, semi-

CR IP T

automatic metric for evaluating translation utility via semantic frames. In Proceedings of the 49th Annual Meeting of the Association for Computational

1090

Linguistics: Human Language Technologies - Volume 1 HLT ’11 (pp. 220– 229). Stroudsburg, PA, USA: Association for Computational Linguistics.

L¨ u, Y., Huang, J., & Liu, Q. (2007). Improving statistical machine translation

AN US

performance by training data selection and optimization. In Proceedings of the

2007 Joint Conference on Empirical Methods in Natural Language Processing

1095

and Computational Natural Language Learning (EMNLP-CoNLL) (pp. 343– 350).

Malik, S. K., & Rizvi, S. (2011). Information extraction using web usage mining,

M

web scrapping and semantic annotation. In Computational Intelligence and Communication Networks (CICN), 2011 International Conference on (pp.

1100

ED

465–469). IEEE.

Matar, Y., Egyed-Zsigmond, E., & Lajmi, S. (2008). KWSim: Concept Similarity Measure. In ARIA (Ed.), CORIA 2008, COnfrence en Recherche

1105

PT

d’Information et Applications (pp. 475–482). Matsumoto, Y., & Kitamura, M. (1997). Acquisition of translation rules from

CE

parallel corpora. In R. Mitkov, & N. Nicolov (Eds.), Recent Advances in Natural Language Processing: Selected Papers from RANLP 95 . John Benjamins.

AC

Mihalcea, R., Corley, C., & Strapparava, C. (2006).

1110

Corpus-based and

knowledge-based measures of text semantic similarity. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1 AAAI’06 (pp. 775–780). AAAI Press.

Moreda, P., Llorens, H., Saquete, E., & Palomar, M. (2008). Two proposals of a QA answer extraction module based on semantic roles. In Proceedings of the 62

ACCEPTED MANUSCRIPT

7th Mexican International Conference on Artificial Intelligence: Advances in Artificial Intelligence MICAI ’08 (pp. 174–184). Berlin, Heidelberg: Springer-

1115

Verlag.

CR IP T

Nagao, M. (1984). A framework of a mechanical translation between Japanese and English by analogy principle. In A. Elithorn, & R. Banerji (Eds.), Artificial and human intelligence (pp. 173–180). 1120

Nirenburg, S., Domashnev, C., & Grannes, D. J. (1993). Two approaches to

matching in example-based machine translation. IEEE Transactions on Med-

AN US

ical Imaging, .

Palmer, M., Gildea, D., & Kingsbury, P. (2005). The proposition bank: An annotated corpus of semantic roles. Computational linguistics, 31 , 71–106. 1125

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th

M

Annual Meeting on Association for Computational Linguistics ACL ’02 (pp. 311–318). Stroudsburg, PA, USA: Association for Computational Linguistics.

ED

Petrakis, E., & Varelas, G. (2006). Design and evaluation of semantic similarity measures for concepts stemming from the same or different ontologies.

1130

PT

Multimedia Semantics, .

Punyakanok, V., Roth, D., & Yih, W. T. (2008). The importance of syntactic parsing and inference in semantic role labeling. Comput. Linguist., 34 , 257–

CE

287.

Rappaport Hovav, M., & Levin, B. (1998). Building verb meanings. In The

AC

1135

Projection of Arguments: Lexical and Compositional Factors (pp. 97–134).

CSLI Publications, Stanford.

Shen, D., & Lapata, M. (2007). Using semantic roles to improve question answering. In J. Eisner (Ed.), EMNLP-CoNLL (pp. 12–21). ACL.

63

ACCEPTED MANUSCRIPT

1140

Slimani, T. (2013). Description and evaluation of semantic similarity measures approaches. International Journal of Computer Applications, 80 , 25–33. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A Study

CR IP T

of Translation Edit Rate with Targeted Human Annotation. In Proceedings of Association for Machine Translation in the Americas, (pp. 223–231). 1145

Somers, H. (1999). Review article: Example-based machine translation. Machine Translation, .

Sumita, E. (2001). Example-based machine translation using dp-matching be-

AN US

tween word sequences. In Proceedings of the Workshop on Data-driven Meth-

ods in Machine Translation - Volume 14 DMMT ’01 (pp. 1–8). Stroudsburg, PA, USA: Association for Computational Linguistics.

1150

Szab´ o, Z. G. (2013). Compositionality. In E. N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy. (Fall 2013 edition ed.). URL: https://plato.

M

stanford.edu/archives/fall2013/entries/compositionality/. Teruko, M., Miller, K. J., Dorr, B. J., Farwell, D., Habash, N., Levin, L., Helmreich, S., Hovy, E., Rambow, O., Florence, R., & Siddharthan, A. (2004).

ED

1155

Semantic annotation for interlingual representation of multilingual texts. In Proceedings of the Workshop on Beyond Named Entity Recognition: Semantic

PT

Labelling for NLP Tasks, LREC . Trandab˘ a¸t, D. M. (2011). Semantic role labeling for structured information extraction. In Proceedings of the fourth workshop on Exploiting semantic

CE

1160

annotations in information retrieval (pp. 25–26). ACM.

AC

Vertan, C., & Martin, V. E. (2005). Experiments with matching algorithms

1165

in example-based machine translation. In Proceedings of the International Workshop Modern approaches in Translation Technologies,.

Vogel, S., Zhang, Y., Huang, F., Tribble, A., Venugopal, A., Zhao, B., & Waibel, A. (2003). The CMU statistical machine translation system. In IN PROCEEDINGS OF MT SUMMIT IX (pp. 110–117). 64

ACCEPTED MANUSCRIPT

Way, A. (2010). Panning for ebmt gold, or ”remembering not to forget”. Machine Translation, 24 , 177–208. 1170

Wu, D., & Fung, P. (2009). Semantic roles for SMT: A hybrid two-pass model. In

CR IP T

Proceedings of Human Language Technologies: The 2009 Annual Conference

of the North American Chapter of the Association for Computational Lin-

guistics, Companion Volume: Short Papers NAACL-Short ’09 (pp. 13–16). Stroudsburg, PA, USA: Association for Computational Linguistics. 1175

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun,

AN US

M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick,

A., Vinyals, O., Corrado, G., Hughes, M., & Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine

1180

M

translation. CoRR, abs/1609.08144 .

Ye, H. H. (2006). Indexing of bilingual knowledge bank based on the synchronous

ED

SSTC structure. Master’s thesis Universiti Sains Malaysia. Zhai, F., Zhang, J., Zhou, Y., & Zong, C. (2012). Machine translation by modeling predicate-argument structure transformation. In COLING 2012,

1185

PT

24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8-15 December 2012, Mumbai, India (pp.

CE

3019–3036).

Zhang, J., & Zong, C. (2013). A unified approach for effectively integrating source-side syntactic reordering rules into phrase-based translation. Language

AC

1190

Resources and Evaluation, 47 , 449–474.

Zhao, H., Zhang, X., & Kit, C. (2013). Integrative semantic dependency parsing via efficient large-scale feature selection. J. Artif. Int. Res., 46 , 203–233.

Zuccon, G., Koopman, B., & Bruza, P. (2014). Exploiting inference from se1195

mantic annotations for information retrieval: Reflections from medical IR. In 65

ACCEPTED MANUSCRIPT

Proceedings of the 7th International Workshop on Exploiting Semantic Anno-

AC

CE

PT

ED

M

AN US

CR IP T

tations in Information Retrieval (pp. 43–45). ACM.

66