Accepted Manuscript
Meaning Preservation in Example-based Machine Translation with Structural Semantics Chong Chai Chua, Tek Yong Lim, Lay-Ki Soon, Enya Kong Tang, Bali Ranaivo-Malanc¸on PII: DOI: Reference:
S0957-4174(17)30103-3 10.1016/j.eswa.2017.02.021 ESWA 11128
To appear in:
Expert Systems With Applications
Received date: Revised date: Accepted date:
6 November 2016 7 February 2017 9 February 2017
Please cite this article as: Chong Chai Chua, Tek Yong Lim, Lay-Ki Soon, Enya Kong Tang, Bali Ranaivo-Malanc¸on, Meaning Preservation in Example-based Machine Translation with Structural Semantics, Expert Systems With Applications (2017), doi: 10.1016/j.eswa.2017.02.021
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights • Introduce Structural Semantic Annotation to improve English to Malay
CR IP T
EBMT System. • Emphasize meaning structure preservation in the automated translation process.
• Translation examples representation is extended with Structural Semantic Annotation.
AN US
• Resolved the fragmentation and inconsistency issues in EBMT system.
AC
CE
PT
ED
M
• The result of the English to Malay automated translation is improved.
1
ACCEPTED MANUSCRIPT
Meaning Preservation in Example-based Machine Translation with Structural Semantics
CR IP T
Chong Chai Chuaa,∗, Tek Yong Lima , Lay-Ki Soona , Enya Kong Tangb , Bali Ranaivo-Malan¸conc a Faculty
AN US
of Computing and Informatics, Multimedia University, Persiaran Multimedia, 63100 Cyberjaya, Selangor, Malaysia b Universiti Sains Malaysia, Gelugor, 11800, Pulau Pinang, Malaysia c Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia
Abstract
The main tasks in Example-based Machine Translation (EBMT) comprise of source text decomposition, following with translation examples matching and selection, and finally adaptation and recombination of the target translation. As the natural language is ambiguous in nature, the preservation of source
M
text’s meaning throughout these processes is complex and challenging. A structural semantics is introduced, as an attempt towards meaning-based approach
ED
to improve the EBMT system. The structural semantics is used to support deeper semantic similarity measurement and impose structural constraints in translation examples selection. A semantic compositional structure is derived
PT
from the structural semantics of the selected translation examples. This semantic compositional structure serves as a representation structure to preserve the
CE
consistency and integrity of the input sentence’s meaning structure throughout the recombination process. In this paper, an English to Malay EBMT system is presented to demonstrate the practical application of this structural semantics.
AC
Evaluation of the translation test results shows that the new translation framework based on the structural semantics has outperformed the previous EBMT ∗ Corresponding
author Email addresses:
[email protected] (Chong Chai Chua),
[email protected] (Tek Yong Lim),
[email protected] (Lay-Ki Soon),
[email protected] (Enya Kong Tang),
[email protected] (Bali Ranaivo-Malan¸con) URL: fci.mmu.edu.my/v3/ (Chong Chai Chua)
Preprint submitted to Journal of Expert Systems With Applications
February 10, 2017
ACCEPTED MANUSCRIPT
framework. Keywords: Example-based Machine Translation, Structured String-Tree Correspondence, Synchronous Structured String-Tree Correspondence,
CR IP T
Structural Semantics, Semantic Roles
1. Introduction
Machine Translation (MT) is an approach to use computer to model the
process of translating a human language to another human language. In gen-
eral, the procedures of machine translation involve decoding the meaning of the source language (SL) and then re-encoding the meaning into the target language
AN US
5
(TL). The decoding and re-encoding process required a certain level of in-depth knowledge about the languages involve in the translation. MT researchers have implemented many strategies and methods to preserve the original meaning of a SL to the TL, i.e. grammatical rules, transfer rules, translation templates, statistical model, etc.
M
10
In this study, example-based approaches will be used to resolve the meaning preservation problems in MT. The Example-based MT (EBMT) was originated
ED
from the idea of mechanical translation by analogy proposed by Nagao (1984). The fundamentally concepts of EBMT was introduced by Nagao and defined by Hutchins (2005b), Somers (1999), Carl (2005). The machine translation pro-
PT
15
cess of EBMT basically involves decomposition of input source sentence into segments, matching of these segments against the examples database and iden-
CE
tification of the corresponding translations from the matched examples, finally perform adaptation and recombination of the translation examples to construct the target sentence.
AC
20
As pointed out by Somers (1999) and Hutchins (2005a), the recombination
is the most difficult task in EBMT. This is easily anticipated as the original context and relationships between segments from the input sentence is lost during the decomposition process. Without any explicit semantic information with
25
reference to the original input sentence, each of the segments is self explainable
3
ACCEPTED MANUSCRIPT
with its own new context. This will lead to possibility of other interpretations and introduce new ambiguities. The original input sentence’s meaning cannot be fully determined from the successfully recombined target translation.
30
CR IP T
In the study of linguistic semantics, it was commonly agreed that the main determinant of a sentence’s meaning is the verb of the main predicate Healy &
Miller (1970). According to projectionist approaches, many aspects of syntactic
structure of sentence are assumed to be projected from the lexical properties of verb, in particular the morphosyntactic realization of verb’s arguments Rap-
paport Hovav & Levin (1998). Goldberg’s (Goldberg, 1995, 1999) studies sug-
gested that the basic meaning of clausal expressions is the result of interaction
AN US
35
between verb meaning and semantics of the argument construction. This is in accordance with Frege’s (Szab´ o, 2013) idea that semantics need to be compositional, such that the meaning of every expression in a language must be a function of the meaning of its immediate constituents and the syntactic rule 40
used to combine them (Goldberg, 2016). This idea of semantic argument con-
M
struction of verbs is contributed to the core techniques for structural semantics representation in Natural Language Processing related research fields, such as:
ED
Question Answering, Information Extraction and Information Retrieval. The research presented in this paper is adhered to the principle that both the 45
verb meaning and argument structure construction are important and should be
PT
co-exist in order to form the meaning structure of a sentence. As to preserve the organization of the arguments and the co-occurrence information relative to the
CE
verb, this meaning structure will be represented using semantic roles as a layer of structural semantics directly corresponding to the translation example. This
50
structural semantics will serve as a mean to evaluate the semantic similarity
AC
between the input sentence and stored source example. A semantic compositional structure will be derived from the structural semantics of the selected translation examples and is used throughout the recombination process.
55
Resolution of the meaning fragmentation and integrity issues in EBMT us-
ing the structural semantics is one of the main contributions from the finding in this paper. This structural semantics then contributing to the design of a new 4
ACCEPTED MANUSCRIPT
translation framework by enabling incorporation of semantic information to the existing EBMT system at various levels. The combined strength of structural semantics and synchronized recombination in this new EBMT framework produced better translation results and at the same time maintain the efficiency of
CR IP T
60
the translation mechanisms. Beside this, a modified semantic-based evaluation based on the precision, recall and f-score measurements in (Lo & Wu, 2011) is used as an alternative for human evaluation.
In the following sections, the discussion will begin with the related research 65
of meaning treatment in MT Systems and semantic similarity in Section 2. An
AN US
overview of the current status and problems of the English to Malay EBMT System is reviewed in Section 3. Section 4 presents the complete details of the structural semantics. The discussion continues with the proposal of a new EBMT framework based on the structural semantics in section 5 and the trans70
lation evaluation results in Section 6. Finally, the research work in this paper
2. Related Work
M
is concluded in Section 8 with suggestions of future work.
ED
2.1. Intermediate Representation Structures in Interlingua MT System The study of semantics is a main topic in the traditional Interlingua Machine Translation system (Dorr et al., 2006). The main idea of Interlingua translation
PT
75
is to analyze, extract and represent the meaning of SL into a language independent structure or general meaning representation for later generation of the
CE
TL. Although there is no common agreed consensus of the form, primitives and levels of the meaning representation in Interlingua MT system, there is a clear outline of the importance of semantic relations (Dorr & Habash, 2002) between
AC
80
concepts, as part of the construction of the conceptual structure, especially for structural consistency of the meaning representation. The semantic roles originated from linguistic theories such as case roles, theta roles and thematic roles were used in number of the Interlingua MT research (Teruko et al., 2004; Dorr
85
et al., 2010) to represent these semantic relations. Semantic roles are mainly
5
ACCEPTED MANUSCRIPT
centered on predicate-argument structure (Levin & Rappaport Hovav, 2005), where the arguments of predicate are classified to general or specific roles to express its role in respect to the situation (such as event, action or state) de-
90
CR IP T
scribed by the verb; and also the semantic relations among all the participating arguments in the sentence.
2.2. Semantic-based Reordering in Statistical MT (SMT) System
In contrast, there is no explicit semantic handling in early SMT Sys-
tem (Brown et al., 1990; Vogel et al., 2003; L¨ u et al., 2007). The focus is more
95
AN US
on prediction of best translation target based on a trained statistical translation model and language model. The translation segments search, selection and merging is based on the maximized weight scoring using the learned stochastic value estimation from aligned bilingual corpus. In recent years, there are
active research efforts to use linguistic knowledge such as syntactic (Dlougach & Galinskaya, 2012; Zhang & Zong, 2013; Li et al., 2013) and semantic information (Aziz et al., 2011; Feng et al., 2012; Bazrafshan & Gildea, 2013) to
M
100
assist phrase reordering to improve the overall translation results. The semantic
ED
structure is claimed to produce better results over syntactic structure as it can provide better skeleton structure of a sentence’s meaning (Liu & Gildea, 2010; Feng et al., 2012). The semantic structure in most of these SMT Systems is modeled according to the semantic roles of predicate-argument structure. Overall,
PT
105
parallel aligned bilingual corpus is automatically annotated with semantic roles labeler; then utilized for phrase reordering rules and translation model learn-
CE
ing. The learned reordering rules are applied either during pre-translation (Zhai et al., 2012), embedded into decoder (Liu & Gildea, 2010; Gao & Vogel, 2011; Feng et al., 2012) or post-translation (Wu & Fung, 2009). As opposed to man-
AC
110
ual crafting of parsing rules, semantic parsing are eased with the availability of semantically annotated resources (i.e. PropBank, NomBank, VerbNet) and advancement of machine learning for automatic semantic role labeling (Hajiˇc et al., 2009; Bj¨ orkelund et al., 2009; Zhao et al., 2013).
6
ACCEPTED MANUSCRIPT
115
2.3. Semantics Disambiguation in EBMT System On the other hand, the semantic handling in EBMT systems is mainly focusing on similarity measurement and semantic disambiguation. The similarity
CR IP T
measurement is heavily relying on thesaurus and focus on two aspects. One is to assist the selection of suitable translation example (Way, 2010) and another 120
one is on cross-lingual corpus alignment (Sumita, 2001). In the early stage,
Nagao (1984) suggested selecting suitable translation example based on the criteria if words from the SL are replaceable with the corresponding words from
the input sentence. These corresponding words are checked for semantic sim-
125
AN US
ilarity based on thesaurus. The similarity measurement approaches are grad-
ually enhanced from simple semantic distance measurements of sub-sentence segments towards the incorporation of substitution cost using string edit distance (Doi et al., 2005; Vertan & Martin, 2005). The sub-sentence segments can be words (Nagao, 1984), substring/chunks (Nirenburg et al., 1993), content words (Aramaki et al., 2003), phrasal context (Aramaki & Kurohashi, 2004), and head words in tree structure (Liu et al., 2003; Imamura et al., 2004). The
M
130
usage of thesaurus is also significant for semantic disambiguation in EBMT,
ED
especially application in generalization of translation. For instance, Matsumoto & Kitamura (1997) acquired generalized word selection rule and translation template by replacing semantically similar elements (words or phrases) in sentence using semantic classes from a thesaurus. Kaji et al. (1992) also performed
PT
135
very similar disambiguation approach by refining their syntactic generalization
CE
approach with semantic categories to resolve template selection problems (ambiguous verb). Brown (1999) also reported successful of templates generalization to improve the coverage and accuracy of EBMT by using manually generated equivalence classes. Brown continued to improve the approach by automatically
AC
140
generating equivalence classes (Brown, 2000, 2001; Gangadharaiah et al., 2006) based on word clustering techniques without using any thesaurus.
7
ACCEPTED MANUSCRIPT
2.4. Semantic Similarity The semantic similarity is to determine conceptually similar between two 145
non-identical entities (Petrakis & Varelas, 2006) for various textual units such
CR IP T
as words, phrases and sentences. The approaches range from simple lexical overlapping (Gomaa & Fahmy, 2013) to complex similarity measurements based on concepts and semantic networks in Thesaurus (Mihalcea et al., 2006; Matar
et al., 2008). With knowledge sources such as ontology, semantic similarity 150
between terms/concepts can be estimated by defining a topological similarity,
which generally covered the approaches (Albacete et al., 2012; Slimani, 2013)
AN US
such as: structure-based measures, information content measures, and featuresbased measures.
In order to support full sentence similarity estimation, measurement based 155
on an abstract semantics level is needed. In fact matching over abstraction level is an important technique in the field of Information Extraction and Information Retrieval. As proposed in Malik & Rizvi (2011); Kaptein et al. (2013);
M
Zuccon et al. (2014), the semantic annotation serves as abstract level indicator of the concept of the text content and the structure illustrates the organization of such concepts. Recent development is focusing on annotation
ED
160
of predicate-argument structure with semantic role, formulating the meaning structure Blanco & Moldovan (2013) or concept map Trandab˘ a¸t (2011) with
PT
semantic role relations.
The Semantic Role Labeling is also very useful for question answering (Shen & Lapata, 2007; Moreda et al., 2008). Semantic roles are assigned to both
CE
165
the query and also candidate answers. The semantic similarity measurement is estimated based on aligned semantic roles of verb that evoking the same
AC
semantic frame. Beside this, in paraphrase similarity measurements (Bertero & Fung, 2015), the semantic role is used as semantic features for paraphrase
170
classification.
8
ACCEPTED MANUSCRIPT
2.5. Structural Semantics as Extended Annotation Layer to Translation Examples Representation As opposed to Interlingua MT (Dorr et al., 2010), the semantic representa-
175
CR IP T
tion proposed in this paper will be added as annotation layer to the existing
translation examples. The semantic annotation is language specific, where there
will be source and target language semantic annotations for the aligned bilingual translation pairs. Cross-lingual correspondences of semantic representation between language pairs are established, such that the variation of semantic
knowledge between source and target languages are explicitly specified. The
disambiguation will be focusing on relationships of the verb(s) with its argu-
AN US
180
ments in sentence, which is based on complete semantic structure instead of lexical meaning. As a result, the similarity measurements in translation example selection will emphasize more on overall sentence’s level semantic similarity. Furthermore, the sentence structure and phrasal ordering is depending on the 185
semantic structure instead of reordering rules as in Phrase-based SMT (Feng
M
et al., 2012; Zhai et al., 2012).
The structural representation will inherit the original representation of the
ED
translation examples such that the linguistic phenomenon and exceptional cases can be specified directly on the representation structure for special handling. 190
For SMT and Neural MT, the approach is relying on generalized statistical
PT
modeling, many linguistic aspects are not able to cater on a case by case basis. Hence, the training dataset for SMT (L¨ u et al., 2007) and Neural MT (Wu et al.,
CE
2016) is normally very large as compare to EBMT. The similarity measure for selection of best translation example later in
Section 5.2 will be constructed based on aggregation of linguistics information represented in a multi-level synchronized structure. This aggregated linguistics
AC
195
information is consisted of: lexical surface form, POS pattern, syntactic dependency, and semantic role labeled predicate-argument structure. The similarity measurements will be derived using the idea of Levenshtein Edit Distance.
9
ACCEPTED MANUSCRIPT
200
3. English to Malay EBMT System As a real case study, we present the English to Malay EBMT system, SiSTeC. SiSTeC stands for SiS tem Terjemahan berasaskan SSTC (SSTC-based Trans-
CR IP T
lation System). In SiSTeC, the translation example is represented in Structured String-Tree Correspondence (SSTC), a general structure that associates 205
an arbitrary tree structure (interpretation structure) to string in a language.
This SSTC representation scheme was extended by Al-Adhaileh & Tang (1999) to Synchronous Structured String-Tree Correspondence (S-SSTC), as representation for the synchronization between one natural language sentence with its
210
AN US
equivalent translation in another natural language sentence. As shown in Fig. 1, the S-SSTC describes a synchronized structure between English source sentence
(“he knelt on the floor”) and the correspondence Malay translation (“dia berlutut di atas lantai itu”).
SiSTeC performs translation by simulating the process of synchronous pars-
215
M
ing (Al-Adhaileh et al., 2002) as in synchronous grammar formalism (B¨ uchse et al., 2011). The segmentation of text is based on the longest matching of source strings with the translation examples in the Bilingual Knowledge Bank
ED
(BKB). SiSTeC performs structural matching (Ye, 2006) of segmented input language (IL) sentence against the stored SL translation examples based on the
220
PT
structural patterns constructed from the lexical and syntactic features. The dependency structure of the input sentence is reconstructed based on the matching structural patterns.
CE
The structural matching between the SL sentence and IL sentence is depend-
ing on the linear form (continuous strings with lexical and syntactic annotation) and syntactic structures (partially or fully generalized dependency trees) similarity. It is not able to examine if these matched sentences are semantically
AC 225
equivalent or approximately close meaning. This surface form matching with the lexical and syntactic patterns will lead to the problems of mismatching of translation examples and subsequently causing adaptation and recombination errors.
10
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 1: S-SSTC for English to Malay
The translation result of SiSTeC is good when the degree of matching be-
PT
230
tween the input sentence’s segments and translation example is high (longer matching with very similar lexical and syntactic features). This is clearly ob-
CE
served especially when exact matching with the main verb is found (same tense and voice form). This type of best match can be illustrated with the following Example 1. The recombination of input segments based on rule matching for
AC
235
Example 1 is elaborated in Fig. 2. The segment “was sent” from input sentence is matched with the “was sent” from the translation example “relief was sent to victim”. The syntactic pattern of the input sentence “DET N V EN
PREP N PREP N” is overlapped with this example’s syntactic pattern at “N 240
V EN PREP N”, where the main verb is bounded within this matched pattern.
11
ACCEPTED MANUSCRIPT
Hence, the lexical verb pattern is considered as exact match. A base structure “N V EN PREP PREP” is identified and provides the template to combine all the input segments (nodes or subtrees) to build a complete dependency struc-
245
CR IP T
ture. This base structure provides complete structural and ordering information of the corresponding target sentence.
Example 1. English to Malay translation: “An email was sent to Ryan by John.”
• SiSTeC Output: “E-mel telah dihantar kepada Ryan oleh John.”
AN US
(Equivalent to: “Email was sent to Ryan by John.”)
• Reference: “Satu e-mel telah dihantar kepada Ryan oleh John.”
250
• Best match: “was sent” (node), “N V EN PREP PREP” (rule) In contrary, the translation accuracy will decrease when exact matching of verb and best template are not found. In Example 2 below, exact match of “is
255
M
eating” is not found in the knowledge bank, the segment “were eaten” is selected as alternative as both has similar lemma form “be eat”. However, the tense and
ED
voice form of “were eaten” (past perfect tense, passive voice) is different from “is eating” (present progressive tense, active voice). Direct usage of the “were eaten” without the reordering of the sentence’s segments is subsequently causing
260
PT
structural error. As elaborated in Fig. 3, the selected alternative segment “were eaten” is successfully merged with other segments using the base structure “N
CE
V EN N”. This is a simple example of changes at the source side that cannot be propagated to other elements in the target sentence, as the mapping between the source and target in the translation template does not signal a transformation
AC
request. Hence, a paraphrasing that involves changing of the main verb’s form
265
with restructuring of the elements in the target sentence is not able to perform. Example 2. English to Malay translation: “The cat is eating the rat.” • SiSTeC Output: “Kucing itu dimakan tikus itu.” (Equivalent to: “The cat was eaten by the rat.”)
12
CE
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
AC
Figure 2: Recombination rule matching for English to Malay translation of “An email was
sent to Ryan by John.”
13
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 3: Recombination rule matching for English to Malay translation of “The cat is eating
PT
the rat.”
• Reference: “Tikus itu dimakan oleh kucing itu.” or “Kucing itu sedang makan tikus itu.”
CE
270
• Best match: “were eaten” (node, lemmatized verb pattern matching as no
AC
exact match is found), “N V EN N” (rule)
• Problem: structure errors, unable to determine reordering of phrases based on verb meaning structure
14
ACCEPTED MANUSCRIPT
275
4. Structural Semantics Correspondence As demonstrated in Interlingua and Statistical MT research, the semantic relations between concepts conveyed by words or phrases can be explicitly de-
CR IP T
scribed by the semantic roles. The semantic roles also provide an organized and consistent representation structure, which proven to be more effective than 280
syntactic structure. Many linguistics views suggested that the lexical meaning and properties of verb is the key towards predicting and determining sentence
meaning. According to semantic role centred approach towards lexical semantic representation (Levin & Rappaport Hovav, 2005), the verb’s meaning can be
285
AN US
represented by a list of semantic role labels (also known as “Case Frame” by Fillmore (1968), “Thematic Relations” by Gruber (1965) and Jackendoff (1976)) and each of this role is assigned to an argument that bearing the semantic
relation to the verb. With recent natural language shallow semantic parsing techniques (or Semantic Role Labeling), by associating the surface arguments
290
M
of predicate, especially a verb, with discrete semantic roles, an abstract meaning structure (or skeleton structure) of a sentence can be explicitly represented. In this section, the theoretical aspects of the structural semantics for SiSTeC
ED
is presented. The existing SSTC in SiSTeC will be annotated with semantic roles 4.1. This semantic annotation is added to the SL in SSTC, from here it is projected to the TL 4.2 based on correspondence relationships in the SSSTC. The semantic compositional structure will be derived from the structural
PT
295
semantics annotation, it is used to facilitate the transformation and adaptation
CE
in the recombination process of a new EBMT framework (further details in Section 5.3).
AC
4.1. SSTC with Semantics (SSTC+SEM)
300
The meaning structure constructed from the semantically labeled predicate-
argument structure is aggregated to the SSTC as a new semantic layer. This semantic layer will act as an abstract semantic descriptor for the SSTC. The nodes in this semantic layer are directly corresponding to the predicate or arguments in the SSTC. The nodes are connected and organized according to 15
ACCEPTED MANUSCRIPT
305
the co-occurrence and also dependencies of the predicate and arguments in the SSTC. The semantic roles in this semantic layer will be denoted as numbered semantic arguments (i.e. A0, A1, etc) following the annotation approach in
CR IP T
PropBank (Palmer et al., 2005). These numbered semantic arguments are verbby-verb basis. For different verbs, the argument with the same tag will have 310
different semantic role, i.e. the A0 in Example 3-1 is an argument with seman-
tic role Consumer/Eater, whereas the A0 in Example 3-2 is an argument with semantic role Borrower. Hence, the semantic roles for numbered arguments
are verb-specific, and at the same time the co-occurrence of these arguments
315
AN US
defined the meaning construction for the verb.
Example 3. Same argument tag but different role
1. [T he eggs]A1 were [eaten]P red [by the benef icial]A0 (with verb eat, the A0 has semantic function Consumer or Eater, A1 has semantic function M eal)
M
2. [He]A0 [borrowed]P red [a book]A1 [f rom the library]A2 (with verb borrow, the A0 has semantic function Borrower, A1 has semantic function T hing Borrowed, A2 has semantic function Loaner)
ED
320
These semantically labeled arguments at the structural level together with the predicate form a Structural Semantics (SEM) for the SSTC. This combined
PT
structure is defined as a triple (SEM, SST C, γ (SEM, SST C)), where: 1. SEM is a tree representation of structural semantics constructed from the predicate and argument(s) labeled with semantic role. It is organized into a
CE
325
dependency-based structure, such that:
AC
(a) The predicate (verb) as the root node.
330
(b) The root node is connected to leave node(s), constituted of argument(s) labeled with the semantic role. (c) The dependency relations between the root node and leave nodes are reflected directly by the semantic role.
16
ACCEPTED MANUSCRIPT
2. An SST C is a general structure defined as a triple (st, tr, co), where st is a string in one language, tr is its associated arbitrary tree structure (i.e. its interpretation structure), and co is the correspondence between st and tr,
(2002)).
CR IP T
which can be non-projective (detail definitions can refer Al-Adhaileh et al.
335
3. γ (SEM, SST C) defined a link lRel ∈ γ (SEM, SST C), corresponding from a node in SEM to a sub-SST C, such that:
SST C ⊆ SST C.
340
AN US
(a) A node in SEM is associated to sub-SST C of SST C, where sub-
(b) A sub-SST C is consisted of sub-string (partial of st) and sub-tree (partial of tr) from the SST C.
(c) This sub-string and sub-tree are linked with correspondence defined by the corresponding function co (st, tr) in SST C.
(d) For the root node of SEM , the lRel will record the correspondence to
M
345
sub-SST C constructed from predicate (i.e. verb) in the st; and for the leave node of SEM , the lRel will record the correspondence to the
ED
sub-SST C constructed from the predicate’s argument of st. (e) lRel will only need to record the correspondences from the node in SEM directly to st of the SST C, the correspondence from the SEM to tr can
PT
350
be achieve via the correspondence between st and tr defined by co, which can be referred as indirect linking, SEM ⇒ st ⇒ tr; in terms of function
CE
composition, let α be the correspondence function from SEM to tr , then α = co ◦ γ.
AC
355
(f) lRel is represented by sets of intervals, which encode the index for sequence of words in the st.
Fig. 4 illustrates the SSTC+SEM representation structure for the sentence
“The moths have eaten holes in his coat”. The sentence consisted of a main verb “eaten” and two arguments “the moths” and “holes in his coat”. Respectively,
17
ACCEPTED MANUSCRIPT
360
the arguments “the moths” is assigned with argument A0 (Consumer or Eater) and “holes in his coat” is assigned with argument A1 (Meal). These annotations are represented as a distinct SEM tree representation: with the predicate “eat”
CR IP T
(lemma of eaten) as root node; the A0 node is connected to the root node as child node to the left; and the A1 node is connected to the root node as 365
child node to the right. The SEM structure is then associated to the SSTC (both string and tree representation) such that: the root node from the SEM
tree is corresponding to the verb “eaten”; the child node A0 is corresponding
to the argument “the moths”; and the child node A0 is corresponding to the
370
AN US
argument “holes in his coat”. For the case when there is more than one verb in the sentence, each verb with its arguments will be represented as different SEM representation tree.
4.2. S-SSTC with Semantics (S-SSTC+SEM)
The SEM representation can be applied as the structural semantics for both
375
M
of the SL SSTC and TL SSTC in the S-SSTC. The source language SSTC+SEM will be referred as SL SSTC+SEM and the target language SSTC+SEM as
ED
TL SSTC+SEM. The structural semantics of the predicate-argument structure in the SL SSTC+SEM can be projected to the TL SSTC via the correspondences established in S-SSTC. From the definitions in Al-Adhaileh et al.
380
PT
(2002), a S-SSTC is defined as a triple (S, T, ϕ (S, T )), such that: S (i.e. SL) and T (i.e.
T L) respectively is represented as SST C; ϕ (S, T ) is a set of
links defining the synchronization correspondence between S and T at differ-
CE
ent internal levels of the two SST C structures. Thus, the correspondences of
SEM ⇒ SL SST C ⇒ T L SST C can be achieved via the compositional func-
AC
tion ϕ ◦ γ. On top of this indirect linking, the semantic annotations of the
385
SL predicate-argument structure can be projected to the TL SSTC, derived the abstract semantic layer and constructed the TL SSTC+SEM representation. For multiple verbs sentence, the main predicate and sub predicate are determined based on the syntactic dependencies in the tr representation. The mapping from SL SSTC+SEM to TL SSTC+SEM provides structural seman18
AC
CE
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 4: Synchronization of the S-SSTC+SEM for sentence “The moths have eaten holes in his coat”
19
ACCEPTED MANUSCRIPT
390
tic correspondences that encode the information of cross-lingual transformation (semantic-based structural transfer and reordering). This representation structure with synchronization between SL SSTC+SEM and TL SSTC+SEM is de-
CR IP T
noted as S-SSTC+SEM (Synchronous Structured String-Tree Correspondence with Semantics).
The S-SSTC+SEM for the sentence “The moths have eaten holes in his coat”
395
in Fig. 4 can be synchronized with the translation in Malay language “kotnya berlubang-lubang dimakan rama-rama” at various levels of correspondences.
With reference to the ϕ (S, T ) correspondences, the argument “the moths” is
400
AN US
corresponding with the target translated phrase “rama-rama” and “holes in his coat” is corresponding with the phrase “kotnya berlubang-lubang”. The se-
mantic annotation for each of the arguments at the SL side can be projected to the TL side according to these correspondences, such that the argument “rama-rama” is annotated with semantic role A0 (Consumer) and the argument “kotnya berlubang-lubang” is annotated with semantic role A1 (Meal). Based on the semantic correspondences between the SEM structures in Fig. 4,
M
405
a position switching between the arguments A1 and A0 will be required for the
ED
transformation from the SL to TL.
4.3. Semantic Compositional Structure
410
PT
The structural semantics of SEM exhibits semantic dependencies between the predicate and arguments that reflecting a basic meaning structure, for the clause(s) or phrase(s) of a sentence. With the aggregation of linguistics infor-
CE
mation between the SEM semantic dependencies and tr syntactic dependencies, a semantic specification of the correspondence natural language text is encoded
AC
in the representation structures. A semantic compositional structure can be
415
obtained with simple derivation. For simple sentence with a single verb, the semantic compositional structure is equivalent to the SEM structure. For sentence with multiple verbs, the semantic compositional structure will involve combination of multiple SEM structures. As depicted in Fig. 5, the meaning structure of the sentence “he refused to do 20
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 5: Compositional structural semantics for the sentence “he refused to do it because he felt it was not ethical”
420
it because he felt it was not ethical” is constructed from the predicate-argument
M
structures of the verbs “refused”, “do”, “felt” and “was”. The sentence’s base meaning is constructed from the main predicate verb “refused”, with three semantic arguments A0, A1 and AM -CAU . Respectively, the base meaning for
425
ED
the arguments A1 and AM -CAU of the verb “refused” is composed from the predicate-argument structure of the verb “do” and “felt”. In addition, the
PT
meaning of argument A1 for the verb “felt” is contributed by the predicateargument structure of the verb “was”. Such compositional characteristics allow the predicate-argument structures to be jointly combined and organized into
CE
a single compositional structure as depicted in Fig. 5. This unique semantic
430
specification supports deeper analysis and interpretation in order to preserve the meaning structure of a sentence during the matching and transformation
AC
process in the EBMT system. The process of abstract matching towards the semantic transformation for
the sentence “the cat likes to eat fish” with two separate Sem structures match-
435
ing is demonstrated in Fig. 6. The two SEM structures in the sentence “the cat likes to eat fish” can be combined to form a single semantic compositional
21
AC
CE
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 6: From abstract meaning matching to semantic-based transformation for the sentence “the cat likes to eat fish”
22
ACCEPTED MANUSCRIPT
structure. The SEM structures of the verb “likes” and “eat” in this sentence is matched with the SL SEM structures of the stored examples “he likes to affect the great philosopher” and “who will want to eat this poison” respectively. The transformation of the abstract meaning structure from the SL to the TL for
CR IP T
440
both SEM structures of the verb “likes” and “eat” can be performed separately
according to their matching SL SSTC+SEM and TL SSTC+SEM. Hence, there will be two target SEM structures constructed. These two TL SEM structures can be combined into one semantic compositional structure with reference to 445
the input sentence’s semantic compositional structure, such that the main pred-
AN US
icate will be the verb “suka” with two arguments, where the argument A1 is constructed from the predicate-argument structure of the verb “makan”.
5. New EBMT Framework with Incorporation of Structural Semantics
A new translation framework is required to incorporate the structural se-
M
450
mantics into existing EBMT system. This new translation framework with the structural semantics is referred as SiSTeC+SEM. As highlighted in Fig. 7,
ED
the IL text will be preprocessed with dependency parser and semantic parser to construct the IL SSTC+SEM. The selection of translation example will be based on structural matching between the IL SSTC+SEM with the stored SL
PT
455
SSTC+SEM. The matching of these SSTC+SEM structures will be simplified by converting the structures into linear semantic patterns as describe in 5.1.
CE
Semantic similarity between the IL SSTC+SEM and SL SSTC+SEM is measured using the formulated distance measurement in 5.2 based on the semantic patterns. The semantic compositional structure of the target sentence will be derived based on the structural semantic correspondences of the matching trans-
AC
460
lation example. Finally, this semantic compositional structure will provide full semantic information to guide through the adaptation and recombination process 5.3.
23
Figure 7: Translation Phase
AC
CE
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
24
ACCEPTED MANUSCRIPT
465
5.1. Structural Semantics Pattern As described in section 4, the constructed SSTC+SEM is a multi-level structure with associated syntactic and semantic knowledge. Instead of meaning
CR IP T
interpretation based on logical rules, the similarity between two structural se-
mantics can be examined based on pattern matching. The SSTC+SEM will 470
be formulated into a linear string pattern to support the semantic similarity
measurement based on edit distance. From the basic definitions discussed in
previous section, the characteristics of the SSTC+SEM representation can be
of pattern matching, such that: 475
AN US
elaborated from the perspective of the requirements in order to perform the task
1. Two sentences can be distinguished based on the specifications of a shallow semantic layer, via:
(a) The constitution of type and number of semantic arguments in the structure of the semantic layer.
480
M
(b) Semantic relations and semantic dependencies between the arguments, each argument is assigned with a distinct semantic role, which is specific
ED
to a predicate (verb), i.e. the semantic roles assigned to the arguments with the predicate “see” is different from the semantic roles with predicate “eat”.
485
PT
(c) Semantic structure, the organization of the arguments and predicate, i.e.:
CE
• M eal P red[eat] Consumer, where the argument with semantic role “Meal” precede the predicate “Eat” (as child node to the left in tree
AC
representation) and the “Consumer” succeed the predicate (as child
490
node to the right). (d) Semantic constraints based on semantic relations, semantic dependencies and semantic structure, i.e.: • M eal P red[eat] Consumer ≈ Consumer P red[eat] M eal (both are approximately similar but not equivalent); and 25
ACCEPTED MANUSCRIPT
• M eal P red[eat] Consumer 6= V iewer P red[see] T hing V iewed (both are totally different).
495
CR IP T
2. Multiple levels matching and comparison of linguistic information can be performed from abstract level (semantic) to syntactic and context specific (or surface form)
(a) Abstract level via the shallow semantic layer; Syntactic layer via the 500
dependency structure of SST C and POS tagging; and Content specific layer via the lexical string of the source sentence in SST C
AN US
(b) For example, in Fig. 4, the SSTC+SEM of the sentence “The moths have eaten holes in his coat” consist of:
• Consumer P red[eat] M eal(which is equivalent to A0 P red[eat] A1) 505
semantic role labeled predicate-argument structure
• The Consumer argument is constructed from the lexical string “the
his coat”, with POS pattern “N PREP PRON N” and dependency holes N | in P REP structure | coat N | his P RON
AC
CE
PT
510
ED
M
moths”, with POS pattern “DET N” and dependency structure moths N | the DET • The M eal argument then is constructed from the string “holes in
(c) A semantic pattern with multiple level of linguistic information can be
formed, where the lexical string, POS pattern and dependency structure will serve as extension of linguistic features for the semantic argument,
515
i.e.: 26
ACCEPTED MANUSCRIPT
•
moths N
A0 [the moths] [DET N ] | the DET
holes N | in P REP A1 [holes in his coat] [N P REP P RON N ] | coat N | his P RON
CR IP T
•
(d) By combining the linguistic information of the shallow semantic (seman-
AN US
tic role labeled predicate-argument structure), syntactic (part of speech and dependency structure) and surface form (lexical or string pattern), a linear structural semantic pattern for a sentence can be generated, i.e.:
moths N
M
A0 [the moths] [DET N ] P red [eat] A1 [..] .. | the DET
ED
(e) A string index is added to this pattern for reference to the original source sentence. For ease of processing, the dependency structure can be sim520
plified and transformed into linear form with only root node and direct
PT
child(ren) node(s) to the root node. Hence, the structural semantic
AC
CE
pattern in the previous example can be refined to:
A0 [0 2] [the moths] [DET N ] [root : N DET ] P red [3 4] [eat] A1 [4 8] ...
(f) The structural semantic pattern hence is generalized to: Argumenti [index] [lexical string] [P OS] [dependency structure] P red [index] [verb lemma] Argumenti+1 ... 27
CR IP T
ACCEPTED MANUSCRIPT
Figure 8: Semantic pattern information table
3. Compound sentence with multiple verbs can be analyzed according to se-
AN US
mantic compositional structure
(a) With the Semantic-Syntactic integration via correspondences mapping
525
from the semantic layer to the tr (the dependency tree) of SSTC, the tr describe the syntactic dependency of the words in the natural language text st of SSTC, it encodes the syntactic dependency hierarchy of words,
M
this is mostly useful in compound sentence with multiple predicateargument structures, allowing analysis of the semantic hierarchy.
530
(b) Based on the compositional structure of the verbs, each verb can be
ED
compared separately as an independent SSTC+SEM structure. (c) For example, in Fig. 5, the sentence “he refused to do it because he felt it
PT
was not ethical” formed a semantic compositional structure constructed from four SSTC+SEM structure, respectively:
535
CE
• A0 [0 1] P red [ref use] [1 2] A1 [2 5] AM -CAU [5 12] • A0 [0 1] P red [do] [3 4] A1 [4 5]
AC
• A0 [6 7] P red [f eel] [7 8] A1 [8 12]
540
• A1 [8 9] P red [be] [9 10] AM -N EG [10 11] A3 [11 12]
5.2. Distance Measurement for Structural Semantics Patterns The information encapsulated in the structural semantic pattern can be vi-
sualized as a table of linguistic information organized into multiple levels (rows), where correspondences are established between these levels (as shown in Fig. 8). 28
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 9: Comparison of two structural patterns
It is possible to derive an aggregated distance measurement which simulates 545
comparisons of multi-levels linguistic information. One important criterion in the distance measurements between two structural semantic patterns is to take
M
into account the overall structure of a pattern which is constrained by the order and position of the arguments with reference to the predicate as the root (or central) object of the pattern. As illustrate in Fig. 9 the comparison of two patterns is guided by the semantic predicate-argument structure, following with the
ED
550
additional multi-levels linguistic descriptions of the argument structure. As the
PT
linguistic information in each of the levels is formed as string-based pattern, the order sequence of each element is important. With this in the consideration, the Levenshtein edit distance can be used to perform the similarity measurements. The edit distance for the structural semantic pattern similarity measurements
CE
555
is formulated as below:
AC
1. The similarities measurement of two structural semantic patterns for selection of best translation example will be based on the minimum distance between the input source pattern x and a stored translation example pattern y, where min dstructural semantic (x, y)
29
ACCEPTED MANUSCRIPT
2. The distance measurement consist of two parts: (a) Structural semantics distance - how two structural semantic patterns
dpredicate arguments (x, y)
CR IP T
are different in terms of the overall patterns’ structure
(b) Linguistic features distance of each matching argument - the differences of each feature’s elements
AN US
dlinguistic f eatures (x, y)
3. The structural distance is defined as the Predicate Arguments Distance measurement for two semantic patterns x and y, dpredicate arguments (x, y)
levsemx ,semy (|semx | , |semy |) + darguments position (x, y)
M
= Where:
2
ED
(a) Levenshtein distance between two predicate arguments pattern
PT
levsemx ,semy (|semx | , |semy |)
CE
(b) Argument distance with reference to the predicate as the root node of
AC
the semantic dependency tree,
560
darguments position (x, y) =
count lef t dif f (x, y) + count right dif f (x, y) t
* t = total number of arguments The argument distance is derived from the idea of Jaccard measure. It
30
ACCEPTED MANUSCRIPT
will impose additional distance measure when there is a position switching of arguments with reference to the predicate. For example, the distance between “A0 PRED A1 A2” and “A1 PRED A0 A2” will be
CR IP T
greater than the distance between “A0 PRED A1 A2” and “A0 PRED
565
A2 A1”. As in the pattern “A1 PRED A0 A2”, the position of “A1” and “A0” is switched with reference to the predicate “PRED”. This will
create an effect such that the “A0 PRED A2 A1” is more preferable as a matching pattern for “A0 PRED A1 A2”.
4. The linguistic features distance between two semantic patterns x and y with
AN US
total of n distinct arguments can be defined as: dlinguistic f eatures (x, y) n
=1− Where:
M
570
1X simlinguistic f eatures (xi , yi ) n i=0
(a) Aggregated linguistic features similarity between two arguments (aggregation of dependency structure, lexical string patterns and syntactic
ED
patterns) is defined as:
CE
PT
simlinguistic f eatures (x, y)
=
1 − levdepx ,depy (|depx | , |depy |) + 1 − levsynx ,syny (|synx | , |syny |) + 1 − levlexx ,lexy (|lexx | , |lexy |) total number of f eatures
AC
* for our case, total number of features = 3
(b) Levenshtein distance between two dependency structure pattern, levdepx ,depy (|depx | , |depy |) (c) Levenshtein distance between two syntactic (Part Of Speech) patterns,
575
levsynx ,syny (|synx | , |syny |) 31
ACCEPTED MANUSCRIPT
(d) Levenshtein distance between two lexical string patterns, levsynx ,syny (|synx | , |syny |)
y is then, dstructural semantic (x, y) =
CR IP T
5. Structural Semantic Pattern Distance between two semantic patterns x and
dpredicate arguments (x, y) ∗ α
AN US
+ dlinguistic f eatures (x, y) ∗ β Where,
α + β = 1 and β > α
*β > α, so that the distance of the linguistic features will have more signifi-
580
M
cant influence to the overall measurements
6. The matching and computation of structural semantic patterns distance is
ED
performed only if the main predicate of the two semantic patterns is matched
PT
d (x, y) dstructural semantic (x, y) , same pred (x, y) = true = 1, otherwise
CE
With the distance measurement of two structural semantic patterns, two SSTC+SEM structures can be compared without performing direct structural mapping. The comparison will first examine the matching arguments between two patterns, and then check the structural resemblance of these patterns, fur-
AC
585
ther assess the degree of similarity of two matching arguments based on the additional linguistic features. The characteristics of these distance measurements can be examined based on the examples in Table 1, where there are three input examples to compare with two stored examples. From the distance mea-
590
surement results in Table 2, best match for each of the examples is obtained: 32
ACCEPTED MANUSCRIPT
Input Example 1 with Stored Example 1, Input Example 2 with Stored Example 2, and Input Example 3 with Stored Example 2. Matching of Input Example 1 and Stored Example 1 was based on the lowest distance acquired, due to the
595
CR IP T
equivalent of the structural semantic pattern “A0 AM-TMP Pred A1 A2”. The Input Example 2 and 3 are basically the same sentences with minor phrasal
reordering. Due to the shallow semantic matching, the comparison of patterns between Input Example 2 and Stored Example 2 is resulted with lower edit
distance than the comparison of patterns between Input Example 3 and Stored
600
AN US
Example 2.
5.3. Target Translation Sentence Reordering, Recombination and Generation The edit distance measurement evaluates the similarities of two SSTC+SEM structures based on multi-levels linguistics information. It is used to select the best translation example to support the derivation of the target translation’s SEM structure. The TL SEM which is corresponding to SL SEM in the matched translation example will serve as base template for the target translation’s SEM
M
605
structure. A correspondence mapping between the input sentence with the
ED
SL example is performed based on the structural semantics similarities. The transformation from input sentence to the target sentence is described as the mapping from input sentence to SL example, following with the transformation as like the transformation from SL example to TL example. These mapping and
PT
610
transformation during the derivation process can be elaborated as the structural correspondence relationships between SSTC+SEM structures.
CE
The derivation of the target SEM structure with reference to the structural
AC
correspondence can be described as following procedures:
615
1. Construct a semantic compositional structure from the source sentence structural semantics (a) Construct by merging the source sentence’s structural semantics. (b) The structural semantics will be merged according to the parent-child
33
ACCEPTED MANUSCRIPT
Table 1: Examples of Structural Semantic Patterns
Source Sentence
Structural Semantic Pattern
Input Example 1 Kerpan
currently
sells
farm
A0([0 3][DET
fresh
N
N])
root:N
N][the
kerpan
farm][DET
CR IP T
The
AM-TMP([3 4][ADV][currently][ADV])
shrimps to third party
Pred([4 5][sell]) A1([5 7][A root:N][fresh shrimps][A
processors.
N])
A2([7 11][AU INF
root:N
N][to
third
party
processors][AU INF NUM ORD N N]) Input Example 2 shrimps
were
A1([0 2][A
root:N][fresh
shrimps][A
AN US
Fresh
N])
sold by the Kerpan
Pred([3 4][sell]) A0([4 8][PREP root:N DET N][by the
farm to third party
Kerpan farm][PREP DET N N]) A2([8 12][AU INF
processors.
root:N
NUM ORD
N][to
third
party
proces-
sors][AU INF NUM ORD N N])
Fresh
shrimps
were
M
Input Example 3
A1([0 2][A
root:N][fresh
shrimps][A
N])
sold to third party
Pred([3 4][sell]) A2([4 8][AU INF root:N NUM ORD
processors
N][to third party processors][AU INF NUM ORD N
the
ED
kerpan farm.
by
N]) A0([8 12][PREP root:N DET N][by the Kerpan
PT
farm][PREP DET N N])
Stored Example 1 Relationship
market-
A0([0 2][N
root:N][relationship
marketing][N
AM-TMP([2 3][ADV][then][ADV])
product to broker.
A1([5 7][GEN PRON
AC
CE
ing then is selling your
uct][GEN PRON
Pred([4 5][sell])
root:N][your
N])
N])
A2([7 9][AU INF
prodroot:N][to
broker][AU INF N])
Stored Example 2 The officer is selling in-
A0([0 2][DET
formation to the en-
Pred([3 4][sell])
emy.
A2([5 8][AU INF DET root:N][to the enemy][AU INF DET N])
34
root:N][the
officer][DET
N])
A1([4 5][root:N][information][N])
CR IP T
ACCEPTED MANUSCRIPT
Table 2: Examples of Structural Semantic Pattern Distance Measurements
Stored Example 2
Relationship
market-
The officer is selling in-
ing then is selling your
formation to the en-
product to broker.
emy.
AN US
Stored Example 1
Input Example 1 The
Kerpan
currently
sells
farm
0.3646
fresh
shrimps to third party
Input Example 2 shrimps
were
0.6792
0.5778
0.7167
0.6278
ED
Fresh
M
processors.
0.5181
sold by the Kerpan farm to third party
PT
processors.
Input Example 3 shrimps
CE
Fresh
were
sold to third party processors
by
the
AC
kerpan farm.
35
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 10: Semantic Compositional Structure Building
relationship of the predicates (verbs) with reference to the dependency structure (as shown in the Fig. 10).
M
620
ED
2. The target predicate arguments structure will be merged based on the following simple algorithm: (a) Search for the first predicate by traversing through the structural
625
PT
semantics dependency tree (represent as directed graph objects).
(b) The corresponding target predicate arguments structure of the first
CE
predicate will be used as the target structure’s root, as shown in Iteration 1 of Fig. 11.
AC
(c) Traverse through the target predicate arguments structure.
630
(d) If the node is an argument, search into the structural dependency tree to find if there is any predicate arguments structure that bound to the scope of this argument, i.e. the structure “A0[2](2 3) Pred[2](3 4) A1[2](5 7)” is bound to the “A1” argument of the root predicate.
36
AN US
CR IP T
ACCEPTED MANUSCRIPT
M
Figure 11: Target Sentence’s Structure Construction
(e) The target argument node is replaced with the corresponding target
635
ED
predicate arguments structure, as shown in Iteration 2 of Fig. 11. (f) The iteration is repeated with traversing of the target structure and
PT
replacements of argument node. (g) The iteration end when no arguments replacement is required.
CE
(h) Redundancy checking will be performed to eliminate repeating argument node(s), i.e. the target argument node T:A0[3] is removed as
AC
640
it is repeating the argument node T:A0[2] in the structure.
3. Mapping from structure to text (a) The translation process continues with mapping from target structure to target string as shown in Fig. 12. (b) The string of the target predicate(s) is directly mapped as the trans-
645
lated string for the verb(s) (as it is directly matched and selected). 37
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 12: Target Text Recombination and Generation
(c) As for the string segments of the target arguments, the original source
M
string will be mapped and highlighted as string segments require further translation.
(d) All these string segments will be translated using the baseline EBMT
ED
system.
650
(e) The translated text is output as the final result.
PT
(f) Based on the example: “He thought I like to eat fish.” the translated
CE
text is: “Dia fikir saya suka makan ikan.”.
6. Evaluation and Results
AC
655
In this section, experiments are conducted to evaluate the translation results
of the new SiSTeC+SEM framework. The dataset for the experiments is briefly discussed in 6.1. The first experiment in 6.2 is to evaluate the performance of the SiSTeC+SEM against the SiSTeC baseline EBMT system. The evaluation results of SiSTeC and SiSTeC+SEM are compared to SMT and Neural MT
38
ACCEPTED MANUSCRIPT
660
in 6.3. Semantic-based evaluation with human justification is carried out in 6.4 as a complementary test for the automatic evaluation metrics.
CR IP T
6.1. Preparation and Test Examples Selection Twenty thousand English SSTCs are selected from the existing BKB to train a dependency tree parser (G´ omez-Rodr´ıguez & Nivre, 2013). The dependency 665
structure of all the SSTCs in the BKB are replaced with new parsed result using this trained English dependency tree parser. All the S-SSTCs are processed with
semantic parser (Punyakanok et al., 2008) and annotated with the structural
AN US
semantics to form the S-SSTC+SEM. The new input sentence will be parsed
using the dependency tree parser and semantic parser later in the translation 670
phase such that the produced semantic structure will be consistent with the stored examples in the BKB.
One thousand examples are selected from the BKB. The selection is performed based on criteria such as: short and long sentences (3 to 30 words); sim-
675
M
ple and complex sentences, from sentences with single verb (single predicate) to multiple verbs (with main predicate and sub-predicates, complex arguments
ED
structures); with passive and active form sentences; the lexicons in the sentences should have corresponding target translation within the scope of the BKB. All these one thousand instances are removed from the BKB and translated us-
680
PT
ing the remaining stored translation examples. Based on manual examination of these translation results, one hundred examples with translation errors are selected: i.e. errors caused by boundary frictions, verb selection, local words
CE
ordering and global phrasal ordering. These test examples consisted of a total of 163 predicates, hence 163 semantic structures. Instead of performing general
AC
test, these filtered examples are used to evaluate the new translation framework
685
specifically targeting on the translation errors identified. 6.2. Evaluation of SiSTeC and SiSTeC+SEM Automated evaluations of the translation results are performed using the BLEU (Papineni et al., 2002), NIST (Doddington, 2002), METEOR (Denkowski
39
ACCEPTED MANUSCRIPT
& Lavie, 2014), LEPOR (Han et al., 2012) and TER (Snover et al., 2006). Be690
sides of measurements based on precision (BLEU, NIST) and recall (METEOR), the LEPOR considers more aspects such as sentence length penalty and n-gram
CR IP T
position penalty. In a different respect than the n-gram based metrics, the TER is used to estimate the post-editing efforts required in order to modify the translation results such that it can match with the reference translation.
The overall comparisons of the translation results are shown in Table 3. In
695
the test with the 100 samples, the evaluation scores of the translation results
from SiSTeC+SEM is higher than the SiSTeC, respectively with percent points
AN US
of: 22.95 (BLEU), 51.13 (NIST), 53.63 (METEOR), 68.93 (LEPOR), and 63.19
(TER). Among these scores, the translation results from the new translation 700
framework are contributed to an improvement of 8.05 percent points based on the TER score and 6.13 percent points with the NIST metric. With careful examination of the translation results, there are examples with very similar translation results, i.e. with similar target verb(s) and predicate-argument structure.
705
M
As the main purpose of the evaluation is to compare the differences between the translation results of SiSTeC and SiSTeC+SEM, 60 test examples with very sim-
ED
ilar translation results are filtered. The second round of the evaluation is scoped down to these 40 examples and comparisons obtained are shown in Table 3. The SiSTeC+SEM contributed 12.54 percent points of improvement according to the
710
PT
TER metric and 10.37 percent points with LEPOR metric as compare to the results from SiSTeC.
CE
The statistical significance test is performed using the paired bootstrap resampling approach proposed by Koehn (2004) for small set of test data. The virtual test sets are created based the selected samples. The bootstrap resam-
AC
pling process is repeated in 1000 iterations. As shown in Table 4, the translation
715
results of SiSTeC+SEM is significantly better than the SiSTeC at p < 0.05 based
on the BLEU and NIST metrics.
40
ACCEPTED MANUSCRIPT
Table 3: Evaluation of Translation Results for SiSTeC and SiSTeC+SEM
NIST(%) METEOR LEPOR
(%)
TER
(%)
(%)
(%)
100 Samples
CR IP T
BLEU
16.67
44.99
51.49
64.99
71.24
SiSTeC+SEM
21.26
51.13
53.63
68.93
63.19
Difference
4.59
6.13
2.14
3.94
8.05
SiSTeC
16.26
39.67
50.43
62.71
68.95
SiSTeC+SEM
22.95
48.09
57.66
73.08
56.41
Difference
6.69
8.42
7.21
10.37
12.54
AN US
SiSTeC
ED
M
40 Samples
Table 4: Medians and confidence intervals for SiSTeC and SiSTeC +SEM using Paired Bootstrapping Resampling
PT
SiSTeC
SiSTeC+SEM
P-value
BLEU
0.1990 ± 0.0394
0.2286 ± 0.0491
0.04
40 Samples
0.1666 ± 0.0551
0.2578 ± 0.0924
0.03
100 Samples
4.7764 ± 0.3627
5.2569 ± 0.3716
0.02
40 Samples
3.9077 ± 0.4248
4.9397 ± 0.5608
0.01
CE
100 Samples
AC
NIST
41
ACCEPTED MANUSCRIPT
Table 5: Comparison with other MT systems
NIST
METEOR LEPOR
TER
(%)
(%)
(%)
(%)
(%)
SiSTeC
16.67
44.99
51.49
64.99
71.24
SiSTeC+SEM
21.26
51.13
53.63
68.93
63.19
Moses
25.20
51.94
55.22
67.25
75.13
OpenNMT
22.75
48.34
50.14
63.17
74.28
SiSTeC
16.26
39.67
50.43
62.71
68.95
SiSTeC+SEM
22.95
48.09
57.66
73.08
56.41
Moses
27.74
47.19
57.56
68.34
74.78
OpenNMT
26.31
53.66
65.72
72.70
AN US
100 Samples
CR IP T
BLEU
M
40 Samples
46.25
NMT
ED
6.3. Comparative Evaluation of SiSTeC, SiSTeC+SEM, MOSES and Open-
720
PT
As comparisons, two different MT systems are trained using the 100,000 parallel aligned sentence pairs extracted from the BKB (same dataset for SiSTeC and SiSTeC+SEM). One is phrase-based SMT based on the Moses (Koehn et al.,
CE
2007) and another one is a Neural MT using OpenNMT (Klein et al., 2017). For Moses training, the translation examples are automatically aligned using
AC
GIZA++1 and a language model is trained using the open source IRSTLM2
725
toolkit up to 5 gram. The translation test for both Moses and OpenNMT is conducted using the 1 http://www.statmt.org/moses/giza/GIZA++.html 2 http://hlt-mt.fbk.eu/technologies/irstlm
42
ACCEPTED MANUSCRIPT
same set of samples in the previous test for SiSTeC and SiSTeC+SEM. The results are combined with the previous test results and illustrated in Table 5. In the test using the 100 samples, Moses obtained best scores with the BLEU (25.20 points), NIST (51.94 points) and METEOR (55.22 points). The performance
CR IP T
730
of SiSTeC+SEM is comparable with Moses with reference to the NIST (51.13
points) and METEOR (53.63 points). Furthermore, it achieved best scores
with LEPOR (68.93 points) and TER (63.19 points) evaluation metrics. For the testing using the filtered 40 samples, the SiSTeC+SEM performed better 735
than all other MT systems in all of the metrics except BLEU score.
AN US
As shown in Table 6, Moses can select better verb than the SiSTeC. However,
Moses is not able to determine the correct morphological form for some of the selected verbs, as shown in Table 7, “dilantik” and “diproses” are selected for the sentence “we have not yet appointed a place for the meeting” and “the bank 740
quickly processed the loan requested by the company” respectively. Both verbs are not in proper morphological form. The verb “dilantik” is suitable unless the
M
target sentence’s structure is modified and changed to “tempat untuk mesyuarat masih belum dilantik oleh kami”. For the verb “diproses”, the target sentence
745
ED
will need to restructure and change to “pinjaman yang diminta oleh company itu diproses dengan cepat oleh bank itu”. As the SiSTeC+SEM will select verb with minimum adaptation to the sentence’s structure, the target verbs selected
PT
for these two input sentences are “melantik” and “memproses” respectively. As shown in Table 8, the limitation in selection of verbs with suitable morphological
CE
form is also observed in the OpenNMT translation results. 750
6.4. Semantic-based Evaluation
AC
The requirements of the automatic evaluations in Section 6.2 are simple
and depending only on the reference translation, no additional language specific data or tools and training are required. Consequently, the automatic evaluation is influenced by how well the translation results will correlate with the refer-
755
ence translations. The automatic scores are unable to reflect the consistency of meaning structure between the input sentence and target sentence. As a sim43
ACCEPTED MANUSCRIPT
Table 6: Examples of translation results: SiSTeC vs. Moses
English Source Sen-
Human Translation
SiSTeC Translation
tence
Transla-
tion dapat akses bilik .
akan
room.
mendapat
keuntungan akses kepada
se
bilik . accompany
someone
on
a
mengiringi orang
sese-
untuk
dalam
perjalanan.
to adjust to life in
menyesuaikan
a foreign country.
diri
dengan
perjalanan.
ke-
hidupan di negara asing.
akses kepada bilik.
dalam
mengiringi orang
sese-
dalam
perjalanan.
AN US
journey.
mendapat
buah
mengikut
someone
untuk
CR IP T
to gain access to a
to
MOSES
mengubah dengan
menyesuaikan
kehidupan di ne-
diri
gara asing.
hidupan
dengan di
keluar
negara.
M
ple solution to this limitation, the evaluation using semantic frames based on the suggestion in Lo et al. (2012) and Lo & Wu (2011) is performed. In this
760
ED
semantic-based evaluation, the automated translation results and the reference translations are manually annotated with semantic roles. The precision, recall and f-score is calculated for the semantics similarities between the automated
PT
translation results and the reference translations. The proposed equations in Lo & Wu (2011) are slightly modified to prioritize on the correct and selection
CE
of more suitable target verb. The modified equations are list as below (with k 765
total number of test sentences):
AC
1. General definition of variables:
770
(a) Ci,j = number of correct fillers of argument j for predicate i in the machine translation result (b) Pi,j = number of partial fillers of argument j for predicate i in the machine translation result (c) Mi,j = total number of fillers of argument j for predicate i in the 44
CR IP T
ACCEPTED MANUSCRIPT
Table 7: Examples of translation results: SiSTeC+SEM vs. Moses
Englsh Source Sen-
Human Translation
SiSTeC+SEM
tence
Translation kami
belum
pointed a place for
melantik
the meeting.
mesyuarat.
lagi
kami
tempat
tidak
melantik untuk
Transla-
tion
lagi
kami masih belum
AN US
we have not yet ap-
MOSES
tempat
mesyuarat
dilantik untuk
itu.
itu.
tempat mesyuarat
diterima masuk ke
diterima masuk ke
untuk
a university.
universiti.
universiti.
sukkan ke univer-
M
to be admitted to
dima-
siti.
dia dapat menen-
dia dapat menen-
dia berlawan dari
shark and swam
tang ikan yu dan
tang yu dan ber-
shark dan bere-
berenang kembali
enang kembali ke
nang ke pantai.
ke pantai.
pantai.
the bank quickly
bank memproses
bank memproses
bank
processed the loan
pinjaman
yang
pinjaman itu dim-
cepat
requested by the
dipohon
oleh
inta oleh syarikat
pinjaman
yang
company.
syarikat itu dengan
itu cepat.
diminta
oleh
ED
he fought off the
itu
dengan diproses
syarikat itu.
segera.
AC
CE
PT
back to the beach.
45
CR IP T
ACCEPTED MANUSCRIPT
Table 8: Examples of translation results: SiSTeC+SEM vs. OpenNMT
Englsh Source Sen-
Human Translation
SiSTeC+SEM
tence
Translation kami
belum
pointed a place for
melantik
the meeting.
mesyuarat.
lagi
kami
lation
tidak
lagi
belum
AN US
we have not yet ap-
OpenNMT Trans-
tempat
melantik untuk
tempat
mesyuarat
lagi
dilantik
kami
sebagai
tempat
untuk
itu.
mesyuarat itu.
diterima masuk ke
diterima masuk ke
dilantik ke univer-
a university.
universiti.
universiti.
siti.
the
pihak
management
M
to be admitted to
pengurusan
mengadili
close down the old
baik
factory.
kilang
penguru-
pihak
pengurusan
lebih
san mengadili ia
mengambil masa
menutup
lebih baik kepada
yang
menutup
untuk
ED
judged it better to
pihak
lama
itu
kilang
lebih
baik
menutup
lama.
kilang lama itu.
the bank quickly
bank memproses
bank memproses
bank
processed the loan
pinjaman
yang
pinjaman itu dim-
dengan
requested by the
dipohon
oleh
inta oleh syarikat
yang diminta oleh
company.
syarikat itu dengan
itu cepat.
syarikat tersebut .
segera.
AC
CE
PT
sahaja.
46
itu
cepat kemas
ACCEPTED MANUSCRIPT
machine translation result (d) Ri,j = total number of fillers of argument j for predicate i in the
775
(e) wp = predicate weight (f) wa = argument weight, which is 1 − wp
CR IP T
reference translation
2. For n predicates in a test sentence, the precision of correct predicate P redi
AN US
and complete argument(s) Ci,j matching for each of the predicate P redi : Pm n X wp P redi + j wa Ci,j Pm Cprecision = wp P redi + j wa Mi,j i
3. For n predicates in a test sentence, the recall of correct predicate P redi
M
and complete argument(s) Ci,j matching for each of the predicate P redi : Pm n X wp P redi + j wa Ci,j Pm Crecall = wp P redi + j wa Ri,j i
4. For n predicates in a test sentence, the precision of partial argument(s)
ED
Ci,j matching for each of the predicate P redi : Pm n X j wa Pi,j Pm Pprecision = w P red p i+ j wa Mi,j i
PT
5. For n predicates in a test sentence, the recall of partial argument matching
AC
CE
Pi,j for each of the predicate P redi : Precall =
n X i
Pm j
wa Pi,j Pm j wa Ri,j
wp P redi +
6. The precision of the overall test set with k sentences: P k (Cprecision + Pprecision ) precision = total number of predicates in M T
7. The recall of the overall test set with k sentences: P k (Crecall + Precall ) recall = total number of predicates in REF 47
ACCEPTED MANUSCRIPT
8. the f-score of the overall test set with k sentences: f -score =
2 ∗ precision ∗ recall precision + recall
CR IP T
The weights of the predicate (wp ) and argument (wa ) in the above equations
are separated so that it can be set at different scale in the calculation. This
is based on the assumption that when the target verb selection is inaccurate, 780
the whole interpretation of the sentence will be affected. The filtered 40 test examples in Section 6.2 are used to perform the semantic-based evaluation. As
shown in Fig. 13, the translation results from the SiSTeC+SEM achieved higher
AN US
precision, recall and f-score scores as compare to the translation results from
SiSTeC. This indicates that in order to improve the overall translation results 785
of the SiSTeC, the new translation framework will need to select better verb and at the same time suggest correct predicate-argument structure. Table 9 shows one example from the SiSTeC translation results where the selection of
M
verbs with incorrect morphological form is causing predicate-argument structure errors to the target sentence. The errors lead to misinterpretation of the overall 790
sentence’s meaning. Thus, the semantic-based evaluation results are aligned
ED
with the results in Section 6.3 where the SiSTeC+SEM is capable of selecting more suitable verb and also with more accurate semantic arguments structure.
PT
7. Discussion
Based on examination of the test results, apparently the n-gram based evaluation metrics are susceptible to structural alteration of sentences. In order
CE 795
to obtain high score in n-gram based evaluation, the translation results need
AC
to reveal resembling grammatical structure and high lexical similarity with the reference translation. The selected test samples for the evaluations were mainly examples with boundary friction, structural and reordering errors when trans-
800
lating with SiSTeC. From observation, the translation results from SiSTeC have more structural variations than the reference translations as compare to the translation results from the SSTC+SEM. Thus, the scores of the n-gram based 48
AN US
CR IP T
ACCEPTED MANUSCRIPT
M
Figure 13: Semantic-base Evaluation for SiSTeC+SEM vs. SiSTeC
Table 9: Examples of Semantic-based evaluation results: SiSTeC+SEM vs. SiSTeC EBMT
SiSTeC
A0pred1 (pihak pen-
A0pred1 (pihak pen-
A0pred1 (ia
gurusan)
gurusan)
baik
ED
SiSTeC+SEM
PT
Reference
CE
kilang lama itu di-
baik
tutup sahaja)
menutup
AC
A1pred1 (ia
Result
(SiSTeC+SEM)(SiSTeC) lebih
match
not match
pred1(dinilai)
match
match
A1pred1 (pihak pen-
match
not match
match
not match
match
match
untuk
di-
tutup golongan tua kilang)
pred1(berpendapat) pred1(mengadili) A1pred1 (lebih baik
Result
lebih kepada
gurusan)
kilang
lama)
A1pred2 (kilang
A1pred2 (kilang
A0pred2 (golongan
lama itu)
lama)
tua kilang)
pred2(ditutup)
pred2(menutup)
pred2(ditutup)
49
ACCEPTED MANUSCRIPT
metrics for SiSTeC+SEM is better than SiSTeC. As the focus of this research is more on the adequacy of meaning structure of the target translation, the seman805
tic frames approach suggested by Lo & Wu (2011) is probably more suitable to
CR IP T
compare the translation results of SiSTeC and SSTC+SEM. As to complement the automatic evaluation metrics, the semantic-based evaluations is conducted and presented in Section 6.4.
The translation examples selection in SiSTeC+SEM can be separated into 810
two stages. The first stage is focusing on selection by meaning structure, which required full matching between the input sentence and the stored examples.
AN US
The second stage is partial selection, to retrieve the equivalent TL text for the input sentence, which involved segmentation during the translation phase. The segmentation of the input sentence is guided by the boundaries of arguments 815
defined by the meaning structure of the input sentence. Hence, the partial translation example selection in the second stage is governed by the first stage example selection. With the partial example selection results, further recom-
M
bination will be facilitated by the target sentence’s meaning structure. A few characteristics of the new translation framework based on the structural semantics can be highlighted:
ED
820
1. The selection of the target verb is not based on direct matching of input
PT
sentence’s verb. Instead, it is based on the combination of the co-occurring arguments and its structure, which provides complete definition of the verb predicate. This leads to the disambiguation of polysemy verb and also determining the proper morphological form of the target verb. As shown in
CE
825
Table 10, in most of the cases, the SiSTeC+SEM can select better verb(s)
AC
as compare to the SiSTeC. The verb selection with correct morphological
830
form is important for Malay language. For example, for the input sentence “the bank quickly processed the loan requested by the company.”, the SiSTeC+SEM selected the verb “diminta” as target translation for the verb “requested”, which is more suitable than the verb “meminta” selected by the SiSTeC. Although both “diminta” and “meminta” have the same base form
50
ACCEPTED MANUSCRIPT
“minta”, but they lead to different meaning interpretation with reference to the constructed sentence’s structure. 835
2. The selection of translation example will prioritize example with minimal
CR IP T
adaptation, minimizing the reordering and transformation steps during the recombination. With minimal adaptation, the cost of processing and error
rates will be reduced. The translated result will be generated as natural
as possible. As shown in Table 11, the example “she begrudged the time her husband spent with his friends.” requires swapping of the arguments’
840
position for A0 and A1 in order to adapt to the input sentence. As for
AN US
the example “she spent her vacation swanning around Europe visiting old friends.”, there is no restructuring of arguments will be required. Thus, it
serves as best matched example in this case and this is indirectly reflected with the lower semantics distance.
845
The overall evaluation results in Section 6 provides sufficient evidence that
M
the structural semantics is a feasible approach to preserve the meaning structure of the source language towards the final translation. This is clearly re-
850
ED
viewed in the process of translation example matching to determine the best semantic structure and selection of the most suitable verb with correct morphological form. The transformation of the target sentence’s structure is eased
PT
by the multi-level syntactic and semantic structural correspondences in the SSSTC+SEM.
CE
Semantic annotation at the structural level for the S-SSTC provides a flexi855
ble and efficient way for integration of multi-layers linguistics knowledge, at the same time maintain the specification of the correspondence between bi-lingual
AC
translation examples in a natural and simplest representation scheme. This multi-layers correspondence approach allows future extension of the representation to cater for other aspects of linguistics knowledge, without the need to
860
modify the current representation scheme of S-SSTC. The integration of structural annotation with the S-SSTC allows indirect projection of the source language linguistics knowledge to the target language via structural correspondence 51
ACCEPTED MANUSCRIPT
relationships. Together, the structural semantics annotation and multi-layers correspondence relationships enhance the meaning preservation of the transfer 865
from the source language to the target language throughout the whole process of
CR IP T
translation. These integrated and synchronized structural semantics annotation and multi-layers correspondences provide the foundation to derive and design meaning-based approaches to resolve problems in the EBMT system.
8. Conclusion and Future Work
Solutions for better means of meaning preservation are designed, by incor-
870
AN US
porating of the guided analysis based on the structural semantics information
as opposed to the non-guided analysis approach in the previous SiSTeC EBMT. This structural semantics contributes as a reference basic meaning structure for translation example selection and for facilitating the meaning-based construc875
tion of the target sentence structure. This meaning structure allows imposition
M
of semantic constraints throughout the translation process such that the semantic consistency and integrity of the input sentence can be preserved and transferred to the target sentence.
880
ED
As elaborated in Section 5, no specific cross language handling is explicitly defined in the new EBMT framework except the language pair alignment. The
PT
corresponding of syntactic and semantic structure in the Bilingual Knowledge Bank is learned solely from the alignment information. Hence, the proposed approach will be applicable for language pairs with predicate argument structure
CE
which can be explicitly matched and the semantic roles variation can be clearly
885
specified. However, the degree of accuracy could be varied for language pairs
AC
which are very different in nature. As continuous efforts to improve the English to Malay EBMT system, we
are keen to conduct further research into following aspects: 1. Segment translation improvement in SiSTeC EBMT - Words reordering error
890
of the segment translation using the baseline SiSTeC is detected based on the translation test in Section 6. As the structural index matching is not sensitive 52
ACCEPTED MANUSCRIPT
Table 10: Examples of translation results: SiSTeC+SEM vs SiSTeC EBMT
Englsh Source Sen-
Human Translation
SiSTeC Translation
tence
SiSTeC+SEM Translation
menuntut
to a building
ses
ke
ak-
untuk
sesebuah
memer-
menuntut
akses
CR IP T
to demand access
lukan pintu masuk
pada bangunan.
bangunan.
pada bangunan.
was
mata pelajaran itu
perkara itu diajar
perkara itu diajar
se-
diajar di sekolah-
di dipilih sekolah
dalam
sekolah-
lected schools as
sekolah terpilih se-
sebagai uji kaji.
sekolah
terpilih
an experiment.
bagai satu eksperi-
subject
taught
in
men. these
considera-
pertimbangan-
tions induce me to
pertimbangan
believe that.
membuat
sebagai uji kaji.
AN US
the
pertimbangan-
ini
pertimbangan-
pertimbangan
saya
membuat
percaya bahawa.
untuk
ini
saya
memper-
pertimbangan membuat
ini saya
percaya bahawa.
the Kerpan farm
ladang
currently
pada
to
shrimps third
party
management
CE
the
menjual
ini
udang
Kerpan pada
menjual
ini den-
gan
pemproses ketiga.
pemproses
udang
kepada
pihak
pihak
harga
segar
ladang pada
Kerpan masa
menjual
ini
udang
segar kepada pihak ketiga pemproses.
ketiga. pengurusan
pihak dinilai
judged it better to
mengadili
close down the old
baik
menutup
baik
factory.
kilang
lama
tutup
AC
ladang masa
segar kepada pihak
PT
processors.
Kerpan
masa
ED
fresh
sells
M
cayainya itu.
lebih
itu
pengurusan ia untuk
pihak
penguru-
lebih
san mengadili ia
di-
lebih baik kepada
golongan
menutup
kilang
sahaja.
tua kilang.
the bank quickly
bank memproses
bank
cepat
bank memproses
processed the loan
pinjaman
yang
memproses pinja-
pinjaman itu dim-
requested by the
dipohon
oleh
man itu meminta
inta oleh syarikat
company.
syarikat itu dengan
oleh syarikat itu.
itu cepat .
segera.
53
itu
lama.
ACCEPTED MANUSCRIPT
Input
“During the daytime houseflies spend their time outdoors or in covered areas near the open air.”
Pattern
CR IP T
Table 11: Example 1 of Verb Selection Based On Structural Semantic Patterns
AM-TMP([1 3][root:PREP][during N
DET])
the
daytime][PREP
A0([4 4][root:PRON][houseflies][PRON])
Pred([5 5][spend])
A1([6 7][GEN PRON
root:N][their
AN US
time][GEN PRON N]) A2([8 16][N root:CC PREP][outdoors or in covered areas near the open air][N CC PREP V EN N PREP DET N])
Source
Matched Example 1
Matched Example 2
She spent her vacation swan-
She begrudged the time her hus-
ning around Europe visiting old
band spent with his friends.
Dia menghabiskan percutian
Dia menyesali masa yang di-
nya mengembara sekitar Eropah
habiskan
ED
Target
M
friends.
sambil
melawat
kawan-kawan
oleh
suami
nya
bersama kawan-kawan nya.
lama nya.
A0([1 1][root:PRON][she]
A1([3 4][DET
[PRON])
[the
AC
CE
PT
Pattern
Distance
Pred([2 2][spend])
time]
root:N] [DET
N])
A1([3 4] [GEN PRON root:N]
A0([5 6][GEN PRON
[her vacation] [GEN PRON N])
[her husband] [GEN PRON N])
A2([5 10][root:N ; root:PREP
Pred([7 7][spend])
N] [swan around europe visit old
[root:PREP N] [with his friend]
friend] [N PREP N ING A N])
[PREP GEN PRON N])
0.60625
0.80417
54
root:N]
A2([8 10]
ACCEPTED MANUSCRIPT
to syntactic phrases, under certain circumstances during the recombination process, a single phrase could be segmented into smaller chunks. When this occurred, the words ordering information from the source to target is lost.
CR IP T
Hence, the words reordering of the phrase is not able to perform at the
895
target side. One of the possible solutions to resolve this local reordering
issue is based on simple syntactic transfer rules to perform post translation words reordering.
2. Adding in more types of linguistic annotations - There are other aspects of linguistics information such as named entities, which involve classification
900
AN US
of Proper Noun, i.e. Person, Organisation, Place, Time, etc. Annotation of SSTC with named entities will provide context sensitive information to aid the semantic argument annotation and argument level abstraction. This is meaningful especially when the EBMT is targeting for domain specific text translation by generalizing the translation example pairs and at the
905
M
same time supplying domain specific glossary or dictionary for specific terms translation. This allows reuse of the BKB and minimizes the requirements of additional resources when implementing automated translation solution for
References
PT
910
ED
specific domain document translation.
Al-Adhaileh, M. H., Kong, T. E., & Yusoff, Z. (2002). A synchronization struc-
CE
ture of SSTC and its applications in machine translation. In Proceedings of the 2002 COLING Workshop on Machine Translation in Asia - Volume 16 COLING-MTIA ’02 (pp. 1–8). Stroudsburg, PA, USA: Association for Computational Linguistics.
AC
915
Al-Adhaileh, M. H., & Tang, E. K. (1999). Example-based machine translation based on the synchronous SSTC annotation schema. In Machine Translation Summit VII (p. 244249).
Albacete, E., Calle, J., Castro, E., & Cuadra, D. (2012). Semantic similarity 55
ACCEPTED MANUSCRIPT
measures applied to an ontology for human-like interaction. J. Artif. Int.
920
Res., 44 , 397–421. Aramaki, E., & Kurohashi, S. (2004). Example-based machine translation using
on spoken language translation (IWSLT-04) (p. 9194). 925
CR IP T
structural translation examples. In Proceedings of the international workshop
Aramaki, E., Kurohashi, S., Kashioka, H., & Tanaka, H. (2003). Word selec-
tion for EBMT based on monolingual similarity and translation confidence. In Proceedings of the HLT-NAACL 2003 Workshop on Building and Using
AN US
Parallel Texts: Data Driven Machine Translation and Beyond - Volume 3
HLT-NAACL-PARALLEL ’03 (pp. 57–64). Stroudsburg, PA, USA: Association for Computational Linguistics.
930
Aziz, W., Rios, M., & Specia, L. (2011). Shallow semantic trees for SMT. In Proceedings of the Sixth Workshop on Statistical Machine Translation WMT
M
’11 (pp. 316–322). Stroudsburg, PA, USA: Association for Computational Linguistics.
Bazrafshan, M., & Gildea, D. (2013). Semantic roles for string to tree machine
ED
935
translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 419–423). Sofia,
PT
Bulgaria: Association for Computational Linguistics. Bertero, D., & Fung, P. (2015). HLTC-HKUST: A neural network paraphrase classifier using translation metrics, semantic roles and lexical similarity fea-
CE
940
tures. In Proceedings of the 9th International Workshop on Semantic Evalua-
AC
tion, SemEval@NAACL-HLT 2015, Denver, Colorado, USA, June 4-5, 2015 (pp. 23–28).
Bj¨ orkelund, A., Hafdell, L., & Nugues, P. (2009). Multilingual semantic role
945
labeling. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task CoNLL ’09 (pp. 43–48). Stroudsburg, PA, USA: Association for Computational Linguistics.
56
ACCEPTED MANUSCRIPT
Blanco, E., & Moldovan, D. (2013). Composition of semantic relations: Theoretical framework and case study. ACM Transactions on Speech and Language Processing (TSLP), 10 , 17.
950
CR IP T
Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., & Roossin, P. S. (1990). A statistical approach to machine translation. Comput. Linguist., 16 , 79–85.
Brown, R. D. (1999). Adding linguistic knowledge to a lexical example-based
translation system. In In Proceedings of the Eighth International Conference
955
AN US
on Theoretical and Methodological Issues in Machine Translation (TMI-99 (pp. 22–32).
Brown, R. D. (2000). Automated generalization of translation examples. In Proceedings of the 18th Conference on Computational Linguistics - Volume 1 COLING ’00 (pp. 125–131). Stroudsburg, PA, USA: Association for Compu-
960
M
tational Linguistics.
Brown, R. D. (2001). Transfer-rule induction for example-based translation. In M. Carl, & A. Way (Eds.), Recent Advances in Example-Based Machine
965
ED
Translation (pp. 1–11). Kluwer Academic. B¨ uchse, M., Nederhof, M.-J., & Vogler, H. (2011). Tree parsing with syn-
PT
chronous tree-adjoining grammars. In Proceedings of the 12th International Conference on Parsing Technologies IWPT ’11 (pp. 14–25). Stroudsburg, PA,
CE
USA: Association for Computational Linguistics. Carl, M. (2005). A system-theoretical view of EBMT. Machine Translation, 19 , 229–249.
AC
970
Denkowski, M., & Lavie, A. (2014). Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation.
Dlougach, J., & Galinskaya, I. (2012). Building a reordering system using tree975
to-string hierarchical model. In Proceedings of the Workshop on Reordering 57
ACCEPTED MANUSCRIPT
for Statistical Machine Translation (pp. 27–36). Mumbai, India: The COLING 2012 Organizing Committee. Doddington, G. (2002). Automatic evaluation of machine translation quality
CR IP T
using n-gram co-occurrence statistics. In Proceedings of the Second International Conference on Human Language Technology Research HLT ’02 (pp.
980
138–145). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Doi, T., Yamamoto, H., & Sumita, E. (2005). Example-based machine translation using efficient sentence retrieval based on edit-distance. ACM Transac-
985
AN US
tions on Asian Language Information Processing, 4 , 377–399.
Dorr, B., & Habash, N. (2002). Interlingua approximation: A generation-heavy approach. In Proceedings of Workshop on Interlingua Reliability, Fifth Conference of the Association for Machine Translation in the Americas, AMTA2002 (pp. 1–6). University of Chicago Press.
M
Dorr, B. J., Hovy, E., & Levin, L. (2006). Machine translation: Interlingual methods. In E. in Chief: Keith Brown (Ed.), Encyclopedia of Language &
990
ED
Linguistics (Second Edition)Encyclopedia of Language & Linguistics (Second Edition) (pp. 383 – 394). Oxford: Elsevier. Dorr, B. J., Passonneau, R. J., Farwell, D., Green, R., Habash, N., Helmreich, S.,
PT
Hovy, E., Levin, L., Miller, K. J., Mitamura, T., Rambow, O., & Siddharthan, A. (2010). Interlingual annotation of parallel text corpora: A new framework
995
CE
for annotation and evaluation. Natural Language Engineering, 16 , 197–243. Feng, M., Sun, W., & Ney, H. (2012). Semantic cohesion model for phrase-based
AC
SMT. In COLING 2012, 24th International Conference on Computational
1000
Linguistics, Proceedings of the Conference: Technical Papers, 8-15 December 2012, Mumbai, India (pp. 867–878).
Fillmore, C. J. (1968). The case for case. In E. Bach, & R. T. Harms (Eds.), Universals in Linguistic Theory (pp. 0–88). New York: Holt, Rinehart and Winston. 58
ACCEPTED MANUSCRIPT
Gangadharaiah, R., Brown, R. D., & Carbonell, J. G. (2006). Spectral clustering for example based machine translation. In R. C. Moore, J. A. Bilmes,
1005
J. Chu-Carroll, & M. Sanderson (Eds.), HLT-NAACL. The Association for
CR IP T
Computational Linguistics. Gao, Q., & Vogel, S. (2011). Utilizing target-side semantic role labels to assist hierarchical phrase-based machine translation. In Proceedings of the Fifth Work-
shop on Syntax, Semantics and Structure in Statistical Translation SSST-5
1010
(pp. 107–115). Stroudsburg, PA, USA: Association for Computational Lin-
AN US
guistics.
Goldberg, A. E. (1995). Constructions: A construction grammar approach to argument structure. Chicago: University of Chicago Press. 1015
Goldberg, A. E. (1999). The emergence of the semantics of argument structure constructions. In B. MacWhinney (Ed.), Emergence of Language. Hillsdale,
M
NJ: Lawrence Earlbaum Associates.
Goldberg, A. E. (2016). The routledge handbook of semantics. chapter Com-
1020
ED
positionality. (pp. 419–430). Routledge. Gomaa, W. H., & Fahmy, A. A. (2013). A survey of text similarity approaches.
PT
International Journal of Computer Applications, 68 , 13–18. G´ omez-Rodr´ıguez, C., & Nivre, J. (2013). Divisible transition systems and
CE
multiplanar dependency parsing. Computational Linguistics, 39 , 799–845. Gruber, J. S. (1965). Studies in Lexical Relations. Ph.D. thesis MIT Cambridge, MA.
AC
1025
Hajiˇc, J., Ciaramita, M., Johansson, R., Kawahara, D., Mart´ı, M. A., M` arquez, ˇ ep´ L., Meyers, A., Nivre, J., Pad´ o, S., Stˇ anek, J., Straˇ n´ ak, P., Surdeanu, M., Xue, N., & Zhang, Y. (2009). The CoNLL-2009 Shared Task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the
1030
Thirteenth Conference on Computational Natural Language Learning: Shared
59
ACCEPTED MANUSCRIPT
Task CoNLL ’09 (pp. 1–18). Stroudsburg, PA, USA: Association for Computational Linguistics. Han, A. L. F., Wong, D. F., & Chao, L. S. (2012). LEPOR: A robust evaluation
CR IP T
metric for machine translation with augmented factors. In COLING 2012,
24th International Conference on Computational Linguistics, Proceedings of
1035
the Conference: Posters, 8-15 December 2012, Mumbai, India (pp. 441–450).
Healy, A. F., & Miller, G. A. (1970). Psychonomic science. chapter The verb
as the main determinant of sentence meaning. (p. 372). Psychonomic Society
1040
AN US
volume 20.
Hutchins, J. (2005a). Example-based machine translation: A review and commentary. Machine Translation, 19 , 197–211.
Hutchins, J. (2005b). Towards a definition of example-based machine translation. In Proceedings of Second Workshop on Example-Based Machine Trans-
1045
M
lation (pp. 63–70). Phuket, Thailand: MT Summit X.
Imamura, K., Okuma, H., Watanabe, T., & Sumita, E. (2004). Example-based
ED
machine translation based on syntactic transfer with statistical models. In Proceedings of the 20th International Conference on Computational Linguistics COLING ’04. Stroudsburg, PA, USA: Association for Computational
1050
PT
Linguistics.
Jackendoff, R. (1976). Toward an explanatory semantic representation. Lin-
CE
guistic Inquiry, 7 , 89–150.
AC
Kaji, H., Kida, Y., & Morimoto, Y. (1992). Learning translation templates
1055
from bilingual text. In Proceedings of the 14th Conference on Computational Linguistics - Volume 2 COLING ’92 (pp. 672–678). Stroudsburg, PA, USA: Association for Computational Linguistics.
Kaptein, R., Van den Broek, E. L., Koot, G. et al. (2013). Recall oriented search on the web using semantic annotations. In Proceedings of the sixth
60
ACCEPTED MANUSCRIPT
international workshop on Exploiting semantic annotations in information retrieval (pp. 45–48). ACM. 1060
Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. M. (2017). OpenNMT:
CR IP T
Open-Source Toolkit for Neural Machine Translation. ArXiv e-prints, .
Koehn, P. (2004). Statistical significance tests for machine translation evalua-
tion. In D. Lin, & D. Wu (Eds.), Proceedings of EMNLP 2004 (pp. 388–395). Barcelona, Spain: Association for Computational Linguistics. 1065
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N.,
AN US
Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In ACL. The Association for Computer Linguistics.
Levin, B., & Rappaport Hovav, M. (2005). Argument Realization. Research surveys in linguistics. Cambridge, New York (N.Y.), Melbourne: Cambridge
1070
M
University Press. Autres tirages : 2006, 2007, 2008.
Li, J., Resnik, P., & Daum´e III, H. (2013). Modeling syntactic and semantic
ED
structures in hierarchical phrase-based translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 540–549). Atlanta,
1075
PT
Georgia: Association for Computational Linguistics. Liu, D., & Gildea, D. (2010). Semantic role features for machine translation.
CE
In Proceedings of the 23rd International Conference on Computational Linguistics COLING ’10 (pp. 716–724). Stroudsburg, PA, USA: Association for Computational Linguistics.
AC
1080
Liu, Z., Wang, H., & Wu, H. (2003). Example-based machine translation based on tree-string correspondence and statistical generation. Machine Translation, Volume 20, Issue 1 , 25–41.
Lo, C.-k., Tumuluru, A. K., & Wu, D. (2012). Fully automatic semantic mt 1085
evaluation. In Proceedings of the Seventh Workshop on Statistical Machine 61
ACCEPTED MANUSCRIPT
Translation WMT ’12 (pp. 243–252). Stroudsburg, PA, USA: Association for Computational Linguistics. Lo, C.-k., & Wu, D. (2011). Meant: An inexpensive, high-accuracy, semi-
CR IP T
automatic metric for evaluating translation utility via semantic frames. In Proceedings of the 49th Annual Meeting of the Association for Computational
1090
Linguistics: Human Language Technologies - Volume 1 HLT ’11 (pp. 220– 229). Stroudsburg, PA, USA: Association for Computational Linguistics.
L¨ u, Y., Huang, J., & Liu, Q. (2007). Improving statistical machine translation
AN US
performance by training data selection and optimization. In Proceedings of the
2007 Joint Conference on Empirical Methods in Natural Language Processing
1095
and Computational Natural Language Learning (EMNLP-CoNLL) (pp. 343– 350).
Malik, S. K., & Rizvi, S. (2011). Information extraction using web usage mining,
M
web scrapping and semantic annotation. In Computational Intelligence and Communication Networks (CICN), 2011 International Conference on (pp.
1100
ED
465–469). IEEE.
Matar, Y., Egyed-Zsigmond, E., & Lajmi, S. (2008). KWSim: Concept Similarity Measure. In ARIA (Ed.), CORIA 2008, COnfrence en Recherche
1105
PT
d’Information et Applications (pp. 475–482). Matsumoto, Y., & Kitamura, M. (1997). Acquisition of translation rules from
CE
parallel corpora. In R. Mitkov, & N. Nicolov (Eds.), Recent Advances in Natural Language Processing: Selected Papers from RANLP 95 . John Benjamins.
AC
Mihalcea, R., Corley, C., & Strapparava, C. (2006).
1110
Corpus-based and
knowledge-based measures of text semantic similarity. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1 AAAI’06 (pp. 775–780). AAAI Press.
Moreda, P., Llorens, H., Saquete, E., & Palomar, M. (2008). Two proposals of a QA answer extraction module based on semantic roles. In Proceedings of the 62
ACCEPTED MANUSCRIPT
7th Mexican International Conference on Artificial Intelligence: Advances in Artificial Intelligence MICAI ’08 (pp. 174–184). Berlin, Heidelberg: Springer-
1115
Verlag.
CR IP T
Nagao, M. (1984). A framework of a mechanical translation between Japanese and English by analogy principle. In A. Elithorn, & R. Banerji (Eds.), Artificial and human intelligence (pp. 173–180). 1120
Nirenburg, S., Domashnev, C., & Grannes, D. J. (1993). Two approaches to
matching in example-based machine translation. IEEE Transactions on Med-
AN US
ical Imaging, .
Palmer, M., Gildea, D., & Kingsbury, P. (2005). The proposition bank: An annotated corpus of semantic roles. Computational linguistics, 31 , 71–106. 1125
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th
M
Annual Meeting on Association for Computational Linguistics ACL ’02 (pp. 311–318). Stroudsburg, PA, USA: Association for Computational Linguistics.
ED
Petrakis, E., & Varelas, G. (2006). Design and evaluation of semantic similarity measures for concepts stemming from the same or different ontologies.
1130
PT
Multimedia Semantics, .
Punyakanok, V., Roth, D., & Yih, W. T. (2008). The importance of syntactic parsing and inference in semantic role labeling. Comput. Linguist., 34 , 257–
CE
287.
Rappaport Hovav, M., & Levin, B. (1998). Building verb meanings. In The
AC
1135
Projection of Arguments: Lexical and Compositional Factors (pp. 97–134).
CSLI Publications, Stanford.
Shen, D., & Lapata, M. (2007). Using semantic roles to improve question answering. In J. Eisner (Ed.), EMNLP-CoNLL (pp. 12–21). ACL.
63
ACCEPTED MANUSCRIPT
1140
Slimani, T. (2013). Description and evaluation of semantic similarity measures approaches. International Journal of Computer Applications, 80 , 25–33. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A Study
CR IP T
of Translation Edit Rate with Targeted Human Annotation. In Proceedings of Association for Machine Translation in the Americas, (pp. 223–231). 1145
Somers, H. (1999). Review article: Example-based machine translation. Machine Translation, .
Sumita, E. (2001). Example-based machine translation using dp-matching be-
AN US
tween word sequences. In Proceedings of the Workshop on Data-driven Meth-
ods in Machine Translation - Volume 14 DMMT ’01 (pp. 1–8). Stroudsburg, PA, USA: Association for Computational Linguistics.
1150
Szab´ o, Z. G. (2013). Compositionality. In E. N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy. (Fall 2013 edition ed.). URL: https://plato.
M
stanford.edu/archives/fall2013/entries/compositionality/. Teruko, M., Miller, K. J., Dorr, B. J., Farwell, D., Habash, N., Levin, L., Helmreich, S., Hovy, E., Rambow, O., Florence, R., & Siddharthan, A. (2004).
ED
1155
Semantic annotation for interlingual representation of multilingual texts. In Proceedings of the Workshop on Beyond Named Entity Recognition: Semantic
PT
Labelling for NLP Tasks, LREC . Trandab˘ a¸t, D. M. (2011). Semantic role labeling for structured information extraction. In Proceedings of the fourth workshop on Exploiting semantic
CE
1160
annotations in information retrieval (pp. 25–26). ACM.
AC
Vertan, C., & Martin, V. E. (2005). Experiments with matching algorithms
1165
in example-based machine translation. In Proceedings of the International Workshop Modern approaches in Translation Technologies,.
Vogel, S., Zhang, Y., Huang, F., Tribble, A., Venugopal, A., Zhao, B., & Waibel, A. (2003). The CMU statistical machine translation system. In IN PROCEEDINGS OF MT SUMMIT IX (pp. 110–117). 64
ACCEPTED MANUSCRIPT
Way, A. (2010). Panning for ebmt gold, or ”remembering not to forget”. Machine Translation, 24 , 177–208. 1170
Wu, D., & Fung, P. (2009). Semantic roles for SMT: A hybrid two-pass model. In
CR IP T
Proceedings of Human Language Technologies: The 2009 Annual Conference
of the North American Chapter of the Association for Computational Lin-
guistics, Companion Volume: Short Papers NAACL-Short ’09 (pp. 13–16). Stroudsburg, PA, USA: Association for Computational Linguistics. 1175
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun,
AN US
M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick,
A., Vinyals, O., Corrado, G., Hughes, M., & Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine
1180
M
translation. CoRR, abs/1609.08144 .
Ye, H. H. (2006). Indexing of bilingual knowledge bank based on the synchronous
ED
SSTC structure. Master’s thesis Universiti Sains Malaysia. Zhai, F., Zhang, J., Zhou, Y., & Zong, C. (2012). Machine translation by modeling predicate-argument structure transformation. In COLING 2012,
1185
PT
24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8-15 December 2012, Mumbai, India (pp.
CE
3019–3036).
Zhang, J., & Zong, C. (2013). A unified approach for effectively integrating source-side syntactic reordering rules into phrase-based translation. Language
AC
1190
Resources and Evaluation, 47 , 449–474.
Zhao, H., Zhang, X., & Kit, C. (2013). Integrative semantic dependency parsing via efficient large-scale feature selection. J. Artif. Int. Res., 46 , 203–233.
Zuccon, G., Koopman, B., & Bruza, P. (2014). Exploiting inference from se1195
mantic annotations for information retrieval: Reflections from medical IR. In 65
ACCEPTED MANUSCRIPT
Proceedings of the 7th International Workshop on Exploiting Semantic Anno-
AC
CE
PT
ED
M
AN US
CR IP T
tations in Information Retrieval (pp. 43–45). ACM.
66