Bilingual recursive neural network based data selection for statistical machine translation

Bilingual recursive neural network based data selection for statistical machine translation

Accepted Manuscript Bilingual Recursive Neural Network Based Data Selection for Statistical Machine Translation Derek F. Wong, Yi Lu, Lidia S. Chao P...

1MB Sizes 1 Downloads 114 Views

Accepted Manuscript

Bilingual Recursive Neural Network Based Data Selection for Statistical Machine Translation Derek F. Wong, Yi Lu, Lidia S. Chao PII: DOI: Reference:

S0950-7051(16)30091-0 10.1016/j.knosys.2016.05.003 KNOSYS 3500

To appear in:

Knowledge-Based Systems

Received date: Revised date: Accepted date:

26 October 2015 28 April 2016 6 May 2016

Please cite this article as: Derek F. Wong, Yi Lu, Lidia S. Chao, Bilingual Recursive Neural Network Based Data Selection for Statistical Machine Translation, Knowledge-Based Systems (2016), doi: 10.1016/j.knosys.2016.05.003

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Bilingual Recursive Neural Network Based Data Selection for Statistical Machine Translation

a Natural

CR IP T

Derek F. Wonga , Yi Lua , Lidia S. Chaoa,∗ Language Processing & Portuguese-Chinese Machine Translation Laboratory, Department of Computer and Information Science, University of Macau, Macau, China

Abstract

AN US

Data selection is a widely used and effective solution to domain adaptation in statistical machine translation (SMT). The dominant methods are perplexity-based ones, which do not consider the mutual translations of sentence pairs and tend to select short sentences. In this paper, to address these problems, we propose bilingual semi-supervised recursive neural network data selection methods to differentiate domain-relevant data from out-domain data. The proposed methods are evaluated in the task of building domain-adapted SMT systems. We present extensive comparisons and show that the proposed methods outperform the state-of-the-art data selection approaches.

ED

M

Keywords: Data Selection; Machine Translation; Domain Adaptation; Recursive Neural Network; Autoencoder

1. Introduction

PT

CE

5

As statistical machine translation (SMT) systems acquire translation rules from training data, their performance relies heavily on the data. In general, the larger the training data, the better the translation systems can be. This is true for general-purpose SMT systems. However, experiments (Axelrod et al., 2011; Duh et al., 2013; Wang et al., 2014) showed that smaller but more relevant training data yields better translation quality when it comes to domain-specific translation tasks. An ideal domain-specific SMT system should be trained on a well maintained corpus which is from the domain of interest. This leads to the scenario of using in-domain data to filter out redundant and irrelevant data with the objective of regularizing the distributions of phrase pairs for domain translation (Yasuda et al., 2008). On the other hand, in practice, a domain-specific corpus is difficult to obtain and usually limited in size, while general-domain data is easier to harvest and construct. In this second application scenario, general-domain data can be utilized to back up the in-domain data. The intention of this application is

AC

10

∗ Corresponding

author Email addresses: [email protected] (Derek F. Wong), [email protected] (Yi Lu), [email protected] (Lidia S. Chao)

Preprint submitted to Knowledge-based NLP

May 7, 2016

ACCEPTED MANUSCRIPT

20

to broaden the content of data for training a better model (Pecina et al., 2011; Lu et al., 2014). Data selection is a complementary solution to these problems. Instead of using the large general-domain corpus, a subsample which is more relevant to the target domain is preferable for training a domain-specific system. Most data selection approaches use a model that is trained from a small domain-specific corpus to estimate the relevance of sentences (sentence pairs) Si in a general monolingual (bilingual) corpus G. The relevance is represented as a score and can be stated as follows (Wang et al., 2013): Score(Si ) → Sim(Si , R)

40

AN US

CE

45

M

35

ED

30

AC

50

55

(1)

where R is the abstract model to represent the target domain; meaning that we could score the relevance by measuring the similarity, Sim(·, ·), between Si and R. Sentences (sentence pairs) that give better scores are extracted to compose the pseudo in-domain sub-corpus Gsub . Therefore, the main problem in data selection is finding an appropriate scoring function. Current dominant approaches are perplexity-based models (Moore & Lewis, 2010; Axelrod et al., 2011). They use language models (LMs) trained on an in-domain corpus to measure the perplexity of sentences in the general-domain corpus. Those sentences or sentence pairs which are assigned lower perplexity by the LMs are considered to be more domain relevant. However, these approaches rely solely on the surface forms and the word occurrences (word collocations) of sentences, which may be insufficient to represent domain-specific data without considering the linguistic properties and mutual translation features of a sentence pair. In addition, perplexity models tend to favor sentences containing fewer words, resulting in long but relevant sentences being filtered out. In this work, we seek to address these issues by proposing a bilingual recursive neural network (biRNN) for the problem of data selection. The proposed model aims to learn higher level abstraction (vector representation) of sentence pairs by considering the syntactic and semantic information of sentences in a bilingual context (Socher et al., 2011). The model is designed to be integrated with two recursive autoencoders (RAE) in a bilingual context, one for each source and target sentence. To experience different neural network architectures, for the single-layer model, the representations of the source and target sentences are fed directly to a softmax layer. For the multilayer model, we introduce a hidden layer between the representation of source and target sentences and the output layer by leveraging the recursive merging mechanism. To evaluate the proposed approaches in the task of domain adaptation, we employ our methods as well as the previous methods to build SMT systems trained on selected sub-corpora and examine their performances. Experimental results show that the proposed models yield better BLEU scores (Papineni et al., 2002) than perplexity-based data selection models. The remainder of this paper is organized as follows. We firstly review the related work in Section 2. Section 3 describes the proposed methods. Section 4 details the setup of experiments and reports the end-to-end SMT evaluation results and analysis. Finally, we draw a conclusion in Section 5.

PT

25

CR IP T

15

2

ACCEPTED MANUSCRIPT

2. Related Works

70

M

75

CR IP T

65

AN US

60

Domain adaptation is an active field of research in both machine learning (ML) (Daum´e III & Marcu, 2006; Duan et al., 2009) and natural language processing (NLP) (Xia et al., 2013; Cambria & White, 2014). In particular, it has attracted much attention in field of machine translation (MT) (Koehn & Schroeder, 2007; Razmara et al., 2012; Wang et al., 2014). In the big data environment, training data for SMT has increased significantly over the past decades. The data however is coming in a wider spectrum of texts encompassing different topics and genres. This is the reasons why data selection (and cleaning) has been the essential step in building a quality domain specific MT system. The most commonly used selection criterion is the perplexity-based method (Lin et al., 1997; Gao et al., 2002). It uses in-domain LM to measure the general-domain text. The sentences that are assigned lower perplexity are considered as domain data and are selected. The Moore-Lewis (ML) model (Moore & Lewis, 2010; Yasuda et al., 2008; Foster et al., 2010) is an augmented method which considers not only the perplexity with respect to the in-domain LM but also the perplexity with respect to the general-domain LM, in order to differentiate domain specific sentences. Axelrod et al. (2011) further extended the ML model to bilingual application. In general, the modified Moore-Lewis model performs better than early Information Retrieval methods (Eck et al., 2004; Hildebrand et al., 2005; Falavigna & Gretter, 2012) and other perplexitybased variants. It has been commonly used in the task of SMT domain adaptation. Formally, given a language model q, the perplexity of a string s with empirical n-gram distribution p is defined by: 2H(p,q) = 2−

x

p(x) log q(x)

,

PT

[HI−src (s) − HO−src (s)] + [HI−tgt (t) − HO−tgt (t)],

AC 90

95

(3)

where HI−∗ (x) and HO−∗ (x) are the cross-entropy between the source or target side string of sentence pair (s, t) and an in-domain language model LMI−∗ and an outdomain language model LMO−∗ respectively, which are trained on in-domain and outdomain data on the source and target sides. In line with this direction, there have been a number of studies. Duh et al. (2013) employed a more robust language model based on recurrent neural networks to resolve the problem of unknown words by replacing the traditional n-gram LM. As stated in (Bengio et al., 2006), the neural-based language model performs well in providing smooth probability estimates of unseen but relevant context. To account for linguistic features, Toral et al. (2015) modeled various information such as lemmas, Part-of-Speech, and named entities of sentences instead of the surface form. Sentences with these additional representations that were previously limited by the na¨ıve LM can be selected. However, these methods do not model the bilingual data at sentence level, which makes it difficult to capture the parallel sentences where their translations are in line with the target domain. Besides, the models

CE

85

(2)

in which x is the n-gram of s, H(p, q) is the cross-entropy between p and q. The relevance between s and the target domain is calculated by bilingual cross-entropy difference (Axelrod et al., 2011):

ED

80

P

3

ACCEPTED MANUSCRIPT

115

120

CR IP T

110

AN US

105

M

100

also suffer from the problem of bias towards short sentences (Axelrod et al., 2011; Duh et al., 2013). In this study, the proposed models are designed to address these issues. The models do not rely on the perplexity of domain data, but try to learn the deep representation of sentences to differentiate domain-specific from general contexts. Recently, Liu et al. (2014) presented the utilization of the translation probabilities of bilingual phrases as additional scores in their data selection model. Their selection function is simply the sum of perplexities (of source and target sentences) and the bidirectional translation probabilities. The sentence pairs which receive higher scores are presumed to be domain-relevant data and are selected. However, the model favors the sentence pairs that are similar to the training data. Sentence pairs of unseen context are assigned lower scores by the model. In contrast, our model based on RNNs learns the abstract representation of sentences, which is able to better smooth the probabilities of the new context. Although perplexity and neural network based data selection methods are the focus of this study, it is worth mentioning other approaches which have been proposed for SMT domain adaptation. The first attempt in domain adaptation via training data selection is the work of Eck et al. (2004). They adapted the method from information retrieval (IR) realm, TF-IDF, and cosine distance similarity measure to select sentences to adapt LMs in SMT systems. More followed similar technique include Hildebrand et al. (2005) and Falavigna & Gretter (2012). The standard IR approach considers bag-of-words. The alternative selection criterion is the edit-distance based similarity measure (Levenshtein, 1966). It was recently applied to domain adaptation and the empirical results showed that edit-distance criterion works well when the general corpus contains sentences that are very close to the in-domain data (Wang et al., 2013).

PT

In this section, the proposed data selection model is introduced. It is based on a bilingual recursive neural network, namely the RAE (Socher et al., 2011), which enables domain-specific context to be predicted to exploit syntactic and semantic information from a neural language modeling point of view (Bengio et al., 2006). An RAE is a kind of recursive neural network (RNN) which aims to find vector representations for variably sized phrases or sentences and, to some degree, is able to capture the linguistic meaning of sentences. The framework for the induction of vector space representation for sentences is first described, followed by an introduction of the bilingual setting of the network structure, objective function, and parameter inference.

CE

125

ED

3. Bilingual Recursive Neural Network Based Data Selection Model

130

AC

3.1. Recursive Autoencoders

135

To generate phrase or sentence embedding using composition, the word is represented as a real-value vector (Bengio et al., 2003; Collobert et al., 2011), which serves as the basis for input to the neural network. These vectors are stacked into a wordembedding matrix L ∈ Rn×|V | , where |V | is the size of the vocabulary. This wordembedding matrix is a parameter to be learned and subsequently modified to capture the domain information. Given a sentence as an ordered list of m words, each word wi

4

ACCEPTED MANUSCRIPT

has an associated vocabulary index k, and retrieving the word-embedding vector from matrix L can be seen as a projection layer: xi = L · rk ∈ Rn ,

150

3.2. Representations for Sentences An application of a RAE that is applied to a binary tree is illustrated in Figure 1, in which red and gray (call-out) nodes indicate the parent node and its corresponding reconstructed nodes. The autoencoder aims at abstracting a representation of its children. Assuming that a sentence and its corresponding tree were given, the phrase representations are computed from the bottom up by multiplying a parameter matrix W ∈ Rn×2n by the concatenation of two children [p1 ; p2 ] ∈ R2n×1 . With a bias term b ∈ Rn , an element-wise activation function to the resulting vector such as tanh(·) is applied as follows:

M

155

CR IP T

145

where rk is a binary vector which is zero in all positions except at the kth index. The word vectors can be either pretrained using an unsupervised neural language model (Bengio et al., 2006; Mikolov et al., 2013; Pennington et al., 2014), or randomly initialized. For simplicity, each word vector x ∈ Rn is initialized by sampling it from zero mean Gaussian distribution: x ∼ N (0, σ 2 ). The vector dimension n is usually set empirically, and in this study n = 50 is used in all experiments1 . It is worth noting that, for bilingual sentences, the associated vectors for source and target words can be retrieved from two embedding matrices, respectively.

AN US

140

(4)

p = tanh(W [p1 ; p2 ] + b),

ED

where W and b are the learned parameter matrices. p1 and p2 are the representations attached to the left and the right child nodes of p. The same process is used recursively until it reaches the root of the tree, representing the embedding of a sentence. The learning of W and b is guided by assessing how well the phrase vector represents its children. It is achieved by trying to reconstruct the children under the assumption that a parent node represents its children well if they can be recovered. An example is illustrated in Figure 1, following:

PT

160

(5)

[p01 ; p02 ] = tanh(W 0 p + b0 ),

(6)

CE

where p01 and p02 are the reconstructed representations, and W 0 and b0 are reconstruction parameters which are also learned during the training process. In order to obtain the optimal abstract representation of the inputs, the autoencoder tries to minimize the reconstruction errors between the original inputs and its reconstructions during the training, by using the Euclidean distance to measure the loss:

AC

165

Erec ([p1 ; p2 ]; θ) =

||[p1 ; p2 ] − [p01 ; p02 ]||2 , 2

(7)

1 In this study, we have empirically tried three settings n = 30, 50, 100 in our experiments. We observed that the translation performance is not consistently improved as the embedding size increases. The most significant improvement is achieved when using n = 50.

5

ACCEPTED MANUSCRIPT

175

A standard autoencoder usually relies on a binary tree for the induction of the representation. Given a sentence s, there are many ways (binary trees) to obtain the vector space representation. In an unsupervised manner, a greedy algorithm (Socher et al., 2011) by minimizing the reconstruction error Erec (·), Eq. 7, was used to determine the best binary tree structure for an input sentence. The algorithm starts by computing the reconstruction error for each of n − 1 consecutive word vector pairs and replaces the pair that gives the smallest error with the phrase representation induced by Eq. 5. Then, the algorithm repeats the evaluation on the remaining n−2 ordered word vectors including the phrase vectors until there is only one vector remaining. Under this unsupervised phrase embedding, the objective is to minimize the sum of reconstruction errors at all non-terminal nodes in the optimal binary tree: X RAEθ (s) = arg min Erec ([p1 ; p2 ]p ), (8) A(s)

where A(s) denotes the set of all possible binary trees that can be built from s. According to the optimal binary tree, the parameters θ = (W, b, W 0 , b0 ) are optimized over all the phrases in the training data. 3.3. Domain Prediction The above RAE induces general vector representations that capture the semantics of sentences. The vector representation attached to each node can be seen as features describing that phrase or sentence. However, these features are not suitable for the task of data selection. In order to guide the training towards domain distribution, this feature representation is leveraged by adding a sof tmax layer on top of each parent node:

M

185

p

AN US

180

CR IP T

170

CE

195

(9)

where W ∈ R is a parameter matrix and b ∈ R is a bias term. Given a pair of a sentence and a domain label (s, l), and assuming that there are two domains, l ∈ {in, out}, d{in,out} ∈ R2 is a two-dimensional multinomial label distribution and din + dout = 1. Equation 9 is applied to each node of the optimal binary tree. The sof tmax layer’s outputs can be interpreted as conditional probabilities dk = p(lk |[p1 ; p2 ]), and the cross-entropy error is: X Ece (p, l; θ) = − lk log dk (p; θ). (10) label

1×2n

label

PT

190

ED

d(p; θ) = sof tmax(W label p + blabel ).

lk ∈{din ,dout }

AC

The error at each non-terminal node is the weighted sum of reconstruction and crossentropy error: E([p1 ; p2 ]p , p, l; θ) = αErec ([p1 ; p2 ]p ; θ) + (1 − α)Ece (p, l; θ).

(11)

Finally, the error in each sentence or sentence pair in the training corpus is the sum over the errors at all nodes of the tree which is constructed by the greedy algorithm: X E(s, l; θ) = E([p1 ; p2 ]p , p, l; θ). (12) p∈RAEθ (s)

6

ACCEPTED MANUSCRIPT

𝑔(𝑠, 𝑡)

𝑑(𝑝𝑠 , 𝑝𝑡 , 𝜃) 𝑊𝑠 , 𝑏𝑠

𝑊𝑡 , 𝑏𝑡 𝑊𝑠 , 𝑏𝑠

𝑝𝑖

𝑊𝑡 , 𝑏𝑡

𝑓2

𝑒1

𝑓3

𝑒2

𝑒3

AN US

𝑓1

CR IP T

𝑒2′ ; 𝑒3′ = tanh(𝑊𝑡′ 𝑝𝑖 + 𝑏𝑡′ )

Figure 1: Architecture of bilingual RAE model.

The hyper-parameter α is used to weight the reconstruction and cross-entropy error in controlling the preference of a learned model. For domain prediction in this study, an empirically value α = 0.2 was used to focus more on the cross-entropy error 2 . In practice, this sof tmax layer is used for scoring an unseen sentence, which measures the relevance of an input sentence s against the target domain (Eq. 9) and is defined as:

M

200

f (s) = din (p; θ),

where p = encode(s) is the vector representation attached to the root node of s.

ED

205

(13)

3.4. Bilingual RAE: Model I

PT

CE

210

Once the vectors for sentences have been generated, it is straightforward to introduce a bilingual recursive neural network (biRNN) based data selection model. As shown in Figure 1, the recursive neural network consists of an input layer and a sof tmax layer. The input layer is composed of two autoencoders, one for the source sentence and another for the target sentence. In order to use the autoencoder for bilingual sentences, the representation of internal nodes of binary trees should have the same dimensionality as the words throughout the network. As the word embeddings for two languages are learned individually and located in different vector spaces, our model does not impose a close interaction at phrase level like Zhang et al. (2014). Instead, a softmax layer was added on top of the source and target representations, which could be seen as features describing the pair of sentences. Therefore, for a given pair of sentences (s, t), it becomes possible to determine whether the sentences belong to a

AC

215

2 In our experiments, different α ∈ {0.1 ≤ α ≤ 0.9} incrementing 0.1 stepwise were tested. The i results showed that the best discriminative result is at α = 0.2.

7

ACCEPTED MANUSCRIPT

ℎ(𝑠, 𝑡)

CR IP T

𝑑(𝑝𝑟′ , 𝜃) ℎ 𝑊 ℎ , 𝑏ℎ 𝑊𝑠 , 𝑏𝑠

𝑊𝑡 , 𝑏𝑡

𝑊𝑡 , 𝑏𝑡

𝑓1

𝑓2

𝑓3

AN US

𝑊𝑠 , 𝑏𝑠

𝑒1

𝑒2

𝑒3

Figure 2: Architecture of multi-layer bilingual RAE model.

domain specific context or not. The data selection task is now modeled by the neural network with a scoring function:

M

220

(14)

g(s, t) = d(ps , pt ; θ),

3.5. Bilingual RAE: Model II The previous model simply couples the representations of source and target sentences in a loose way. We consider it as single-layer interaction between the source and target sentences, although the learning of the model is carried out as a whole thorough the network. However, due to the architecture of the network, we want to investigate how well the model performs if we strengthen the interaction of the source and target representations through a multi-layer network. To this extent, we leverage it by introducing a hidden layer between the representations induced by individual autoencoders and the output layer. The architecture of the model is illustrated in Figure 2. The bilingual representations and the distribution of in-domain sentences are now defined as follows:

CE

PT

225

ED

where ps and pt are the vector representations of source and target sentences, and θ are RAE parameters learned from training data.

AC

230

235

h(s, t) = d(p0r ; θ), p0r

h

(15) h

= tanh(W [ps ; pt ] + b ),

(16)

where W h ∈ Rn×2n is a parameter matrix and bh ∈ Rn is a bias term. So far, we have described three RAE-based scoring functions: f , g and h, which will be evaluated in the experiments. 8

ACCEPTED MANUSCRIPT

3.6. Learning In constructing the models, there are several sets of parameters to be trained in the proposed RAEs: • θL : word embedding matrix L for source and target languages, as described in Section 3.1;

CR IP T

240

• θrec : RAE parameter matrices W and W 0 and bias terms b and b0 for both source and target languages, as described in Section 3.2, the domain prediction parameter matrix W label 3 , and the bias term blabel in Section 3.3. Note that the same set of parameters is used in the two models, Model I (Section 3.4) and Model II (Section 3.5), respectively;

245

250

AN US

• θsel : neural network based selection model parameter matrix W h and bias term bh for Model II, as described in Section 3.5.

All these parameters are learned from the (bilingual) training data. Under the bilingual setting, the models jointly learn two RAEs for a sentence pair (s, t), one for the source language and the other for the target language. The reconstruction error for the bilingual model is now rewritten as: Erec (s, t; θ) = Erec (s; θ) + Erec (t; θ),

M

and the cross-entropy error is:

Ece (s, t, l; θ) = Ece (s, l; θ) + Ece (t, l; θ).

(18)

where l ∈ D = {in, out}. Hence, for a pair of sentences and domain label (s, t, l), the joint error and the final objective function over the training data (S, T, D) are:

ED

255

(17)

PT

E(s, t, l; θ) = αErec (s, t; θ) + (1 − α)Ece (s, t, l; θ), J=

1 X λ E(s, t, l; θ) + ||θ||2 . N 2

(19) (20)

(s,t,l)

CE

Then, the gradient is computed as follows:

∂J 1 X E(s, t, l; θ) = + λθ. ∂θ N ∂θ

(21)

(s,t,l)

AC

To compute the gradient, all trees are constructed at the beginning of each iteration and then the derivatives for the trees are computed via backpropagation through structure (Goller & Kuchler, 1996). After that, the objective function can be optimized by applying a generic quasi-Newton gradient-based optimizer LBFGS-B (Zhu et al.,

260

3 It is worth to mention that the dimensionality of W label is R1×2n for Model I, while for Model II, the dimensionality of W label is R1×n .

9

ACCEPTED MANUSCRIPT

AN US

CR IP T

Data: Sentences (s1 , t1 ), . . . , (sN , tN ) with corresponding labels l1 , . . . , lN Result: θ = (θL , θrec , θsel ) Initialize W ∈ Rn×2n , W 0 ∈ R2n×n , W label ∈ R1×2n , W h ∈ Rn×2n randomly; Initialize L ∈ Rn×|V | from x ∼ N (0, σ 2 ); Initialize b ∈ Rn , b0 ∈ R2n , blabel ∈ R, bh ∈ Rn ← 0; while not converged and epoch < maxEpoch do ∇J ← 0; for i ← 1 to N do Initialize ∇J 0 , J 0 ← 0; Tsi ,ti ← Compute RAE trees for sentences (si , ti ) in Section 3.2; J 0 ← total error incurred by input (si , ti ); 0 ∇J 0 ← ∂J ∂θ computed by Eq. 21; ∇J ← ∇J + ∇J 0 ; end ∇J ← N1 ∇J + λθ; Update θ using L-BFGS with ∇J; epoch ← epoch + 1; end Algorithm 1: Training of BiRAE parameters

M

1997). For initialization, all parameters including the word-embedding matrices are assigned by sampling from a zero mean Gaussian distribution. The training process in pseudo-code format is presented in Algorithm 1.

The task of data selection for SMT domain adaptation involves two corpora, namely in-domain and general-domain. The in-domain corpus, dev set and test set, are the Chinese-English TED Talk data of IWSLT2014 (International Workshop on Spoken Language Translation), IWSLT dev2010 and IWSLT tst2010 (Cettolo et al., 2012). In this study, the UM-Corpus (Tian et al., 2014) was used as the general-domain data, which consists of 14 million sentences categorized in eight different domains, covering various topics and genres. The size of each domain in this corpus is listed in Table 1. All the data from these domains was mixed to form a large general-domain corpus. In the preprocessing, English texts were tokenized and truecased using Moses preprocessing scripts4 and Chinese texts were segmented using the open source toolkit AnsjSeg.5 Sentences of more than 80 words in length were discarded. For training the data selection models, the in-domain data was used as the positive instances. The negative instances were randomly selected from the general-domain corpus. Both the positive

CE

270

4.1. Corpora

PT

265

ED

4. Experiments

AC

275

4 http://www.statmt.org/moses/?n=FactoredTraining.PrepareTraining 5 https://github.com/NLPchina/ansj

seg.

10

ACCEPTED MANUSCRIPT

and negative sets are of the same number of sentences. The English monolingual data from WMT14 was used to train the language model for all evaluated SMT systems. The statistical information of the data used after preprocessing is summarized in Table 2. Domain (Topic) Education News Laws Science

Sentences 1.9M 3.7M 8.6M 0.1M

Domain (Genre) Spoken Thesis Patent Micro-blog

Sentences 1.2M 0.9M 0.3M 11K

CR IP T

280

Table 1: Statistical information of domain data in UM-Corpus. Language En/Zh En/Zh En/Zh En/Zh En/Zh En

Sentence 177.5K 177.5K 14M 887 1,570 85.3M

Token 3.5M/3.3M 4.2M/4.0M 335.3M/313.9M 20.1K/21.3K 32.0K/33.6K 2033.1M

Vocabulary 58.5K/54.9K 109.9K/81.8K 1.4M/0.7M 3.3K/3.89K 3.8K/4.4K 4.5M

AN US

Data set In-Domain Out-Domain General-Domain Dev Data Test Data Monolingual Data

M

Table 2: Statistical summary of the data used.

ED

CE

290

PT

285

4.2. Settings In this work, the standard log-linear phrase-based SMT model6 (Koehn et al., 2007) was used to build an SMT system. The translation and re-ordering models were trained using Fast Align7 (Dyer et al., 2013) with default settings and the “grow-diag-finaland” symmetrize method to obtain the word alignments. A 5-gram language model was trained using the SRILM toolkit8 (Stolcke, 2002) smoothed with modified Kneser-Ney discounting (Chen & Goodman, 1996) and quantizing both probabilities and back-off weights. Minimum-Error-Rate Training (MERT) (Och, 2003) was used for tuning the parameters. The translation performances of all SMT systems were evaluated by the case-sensitive BLEU metric (Papineni et al., 2002). We built a baseline SMT system trained on in-domain data (Modelin ) and another baseline SMT system trained on general-domain data (Modelgen ). The data size (in sentences) used to train the systems and their translation performances are presented in Table 3. Modelgen outperforms Modelin dramatically since its training data is about 79 times larger than the translation system trained on in-domain data. To provide a thorough analysis, we carried out end-to-end translation experiments to evaluate our proposed models against two dominant data selection methods:

AC

295

6 http://www.statmt.org/moses/. 7 https://github.com/clab/fast

align/.

8 http://www.speech.sri.com/projects/srilm/.

11

Avg. Length 19.9/18.7 24.0/22.5 24.0/22.5 22.7/24.0 20.4/21.4 23.85

ACCEPTED MANUSCRIPT

Systems Modelin Modelgen

Train 177.5K 14M

Data Dev.

Test

887

1,570

BLEU Scores Dev. Test 10.95 12.58 61.39 53.78

CR IP T

Table 3: Translation performances of in- and general-domain baseline systems trained on different sizes of in- and general-domain data respectively.

• MML-5gram: Modified Moore-Lewis model based on bilingual cross-entropy, where 5-gram LMs were trained using SRILM with modified Kneser-Ney smoothing (Axelrod et al., 2011).

300

• MML-rnnlm: Modified Moore-Lewis model using Recurrent Neural LMs. In this model, conventional n-gram models were replaced by the neural-based models, which were trained by the RNN-LM Toolkit (Duh et al., 2013).

AN US

305

• biRAE-I: The proposed bilingual RAE model based on the single-layer architecture as described in Section 3.4. • biRAE-II: The proposed bilingual RAE model based on the multi-layer architecture as described in Section 3.5. All RAE-based models were implemented9 using the Apache Spark framework,10 which is a fast and general engine for big data processing, and MLlib,11 in order to take advantage of a high-performance computing cluster with sixty 3.3-GHz Xeon E5-2670 cores (120 threads). It took about 10 hours to run 200 iterations for training biRAE-I and 11 hours to train biRAE-II models. We adopted the four data selection models for extracting domain relevant parallel data from the general corpus. For each selection model, the top N % = {1, 5, 10, 20, 30, 40, 50, 60, 70} (roughly 140K, 698K, 1.4M, 2.8M, 4.2M, 5.6M, 7M, 8.4M, 9M) of scored sentence pairs were selected to form pseudo in-domain sub-corpora. Moreover, various SMT systems were trained on these sub-corpora respectively; that is, for each (selection model, N ) pair, an SMT system was built. The same language model that had been trained on monolingual English text was used in all systems, so that the effectiveness of different data selection models could be evaluated fairly from the translation results.

CE

320

PT

315

ED

M

310

4.3. Data Selection Evaluation Before conducting the full extrinsic SMT evaluation, we wanted to observe and compare the effectiveness of various data selection methods in a way that did not rely on external application. That is usually very time consuming, such as the model training and parameter-tuning in SMT building. The main concern is that intrinsic evaluation allows us to focus on the quality of selection. The accuracy of selection of

AC

325

9 The

source code will be hosted on Github.

10 https://spark.apache.org.

11 https://spark.apache.org/mllib.

12

ACCEPTED MANUSCRIPT

340

CR IP T

AN US

335

in-domain sentence pairs can be treated as an indicator of the effectiveness of the proposed models, while end-to-end translation evaluation is considered an alternative way to evaluate the model performance indirectly. In the evaluation, the data selection models were used to score and retrieve the domain-specific sentences from the mixture of in- and out-domain test data. The data were composed of in-domain sentences of the test set (1,570 sentences) and the outdomain 1,570 sentences that were selected from the general corpus, making 3,140 sentences in total. All models were trained using the training data as presented in Table 2, with the settings as described in Section 4.2. We then used the models to score the sentences from the mixed data and sorted the sentences according to their scores. The top 1,570 sentences were selected as in-domain data as predicted by each model. The selection accuracy is evaluated by: |pred ∩ testin | , acc = |testin |

where pred is the set of sentence pairs predicted to be in-domain by the data selection models and testin is the in-domain test set. Among the models, biRAE-I achieved the best accuracy. It outperformed MML-5gram significantly, with an improvement of 2.83%. Both the proposed RAE-based models outperformed the perplexity-based models. Comparison results are shown in Table 4. Model Acc% Delta

MML-5gram 84.1 -

MML-rnnlm 84.24 0.14

M

330

biRAE-II 85.65 1.55

biRAE-I 86.93 2.83

ED

Table 4: The selection accuracy of different models. 4.4. End-to-End Translations

PT

CE

345

Table 5 shows the performances of the constructed SMT systems obtained by different selection methods. The overall results indicate that the two proposed models are more effective in selecting domain-relevant sentence pairs compared to the two baseline approaches and produce a sustained gain in BLEU scores in all N points. biRAE-II performs slightly better than biRAE-I even though it takes more parameters and time to train. We therefore believe that a method which models a deep interaction between bilingual contexts is much more effective than one which models a shallow interaction. The performance peaked when 40% of the general data was selected for training the SMT models. Both biRAE-II and biRAE-I achieved the same BLEU scores of 54.96, outperforming Modelgen by 1.18 BLEU points and MML-5gram (trained with 50% of the data) by 0.59 BLEU points. By using N = 40% of the data, MML-5gram achieved the same BLEU score as the Modelgen that was trained on the whole data set. This once again confirms the finding that compact but more relevant data yields comparable translation quality (Moore & Lewis, 2010). To obtain the confidence interval of the scores, we use bootstrap resampling described by Koehn (2004) to test our results against the testing methods. Both the proposed biRAE-II and

AC

350

355

13

ACCEPTED MANUSCRIPT

biRAE-I show significantly better improvement over the MML-5gram (and MMLrnnlm) results with a confidence of p < 0.05, where 1000 samples were used during the bootstrap resampling. MML-5gram 11.25 21.13 33.77 46.94 52.43 53.78 54.37 54.15 53.95

MML-rnnlm 10.71 19.91 33.91 45.79 50.09 51.74 51.74 53.11 53.94 53.78

biRAE-II 18.46 33.92 43.74 52.31 54.59 54.96 54.95 54.79 54.27

biRAE-I 18.24 32.51 42.85 52.72 54.58 54.96 54.82 54.38 53.96

CR IP T

N% 1 5 10 20 30 40 50 60 70 100

AN US

360

Table 5: BLEU scores of the SMT systems trained on subsets of the general-domain parallel corpus. The best results are bold-faced. The score of Modelgen (N = 100%) is also listed at the bottom for comparison. 4.5. Analysis

AC

M

CE

PT

370

ED

365

To further analyze the results, we took a closer look at the data selected by the selection models. The out-of-vocabulary (OOV) rate was first considered by counting the OOV words (types and tokens) when decoding the test data with each model. The results indicated that incorporating more data into the training corpus can lead to better word coverage and statistical estimation. Table 6 presents the OOV rate of different SMT models. The OOV rates of all models were found to be very close. An increase in N , the OOV rate decreases and the translation quality improves. This indicates that unknown word problem has a large impact on translation performance. Thus, the factor of out-of-vocabulary or word coverage should be taken into account in designing a data selection method. Systems MML-5gram

OOV%

MML-rnnlm

types tokens

biRAE-II biRAE-I

1 47.6 17.6 45.9 16.7 47.0 17.9 46.6 17.8

5 42.7 15.3 41.5 14.8 42.2 15.1 42.7 15.6

10 40.3 14.3 39.9 14.3 40.5 14.2 41.0 14.7

N% 20 38.1 13.5 38.3 13.6 38.8 13.7 38.6 13.7

30 37.2 13.2 37.5 13.4 37.6 13.3 37.6 13.4

40 36.6 13.1 37.2 13.3 37.1 13.2 37.0 13.2

Table 6: OOV rate of models trained on different sub-corpora.

14

50 36.2 13.0 36.8 13.2 36.7 13.1 36.7 13.1

ACCEPTED MANUSCRIPT

Systems MML-5gram Avg. Length En Zh

MML-rnnlm biRAE-II biRAE-I

1 10.0 9.6 11.0 10.7 21.9 21.2 22.1 21.2

N% 20 14.6 14.4 14.8 14.4 14.0 13.7 14.1 13.7

CR IP T

385

AN US

380

5 11.8 11.4 14.1 13.9 16.9 16.3 16.5 15.9

M

375

Furthermore, the average length of sentences in each of the selected corpora was examined to determine the impact by different selection models on the overall translation performance. Table 7 summarizes the length statistics of the corpora. It was found that both of the MML models (MML-5gram and MML-rnnlm) prefer shorter sentences in selection due to the property of LMs, where a lower perplexity is usually assigned to a shorter sentence than a longer one. This might be the cause for the poor translation quality, that was particularly obvious when a smaller training data was used (N = 1%, 5%, 10%). For example, biRAE-II can achieve the best improvement over MML-5gram (12.79 BLEU points) and MML-rnnlm (14.01 BLEU points) at N = 5%. In contrast, our models tend to select the sentences which are more context oriented. As the size of the data increases, the length difference diminishes. In all cases, the average length of each selected data is less than that of the general-domain corpus that were used in this study. 10 13.0 12.7 14.9 14.6 15.3 14.8 14.7 14.2

30 16.1 15.8 16.6 16.0 14.3 14.0 14.3 13.9

40 17.9 17.5 18.9 18.1 15.4 15.0 15.6 15.1

50 19.6 19.0 20.9 19.9 16.8 16.2 17.2 16.5

ED

Table 7: Average sentence length of sub-corpora.

PT

AC

CE

390

Table 8 presents the top 10 sentence pairs scored by the MML-5gram model. The sentences consists of generally shorter phrases, most of which are the names of people, place, and terminologies. We speculated that this makes it difficult to determine the genre of the selected text due to the insufficient context. On the other hand, the sentences selected by our biRAE-II model are more rational for the purposes of acquiring domain-relevant data. The top 10 ranked sentences for the spoken domain are listed in Table 9. It is obvious to see that the top ranked sentences are primarily the genre of spoken text, which makes sense given the content of the IWSLT spoken corpus.

15

2 3 4 5 6 7 8 9

M

10

WebSphere消息队列 WebSphere MQ 丁烷 Butane 马迪哈利勒马迪 Madi Khalil Madi 马蒙拉希德 Hashayka Mamoun Rasheed Hashayka 伊戈尔·罗戈夫 Igor Rogov 哈萨姆·哈拉德 Hussam Jaradat 阿德南·塔勒布 Adnan Taleb 罗萨里奥·马纳洛 Rosario Manalo 里亚德·巴德兰 Riad Badran 欧文·沃勒 Irvin Waller

AN US

1

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

Table 8: The top 10 scored sentences selected by MML-5gram.

16

ACCEPTED MANUSCRIPT

5

6

CE

PT

7

CR IP T

4

AN US

3

M

2

实际上,我们想要的不是针对犯罪的法律,而是针对疯狂的法律。 Really, what we want now is not laws against crime but a law against insanity. 并不是只有日本才这样,而是在一些国家的自然食物链都这样在加拿大北部,在美 国还有欧洲北部,海豹和鲸鱼的自然食物链导致了PCD 分子的富集从世界上的各个 地方聚集到妇女的身上。 It isn’t just there that this happens, but in the natural diet of some communities in the Canadian arctic, in the United States, and in the European arctic, where a natural diet of seals and whales leads to an accumulation of PCBs that have gathered from all parts of the world and ended up in these women. 我要说,在这里的每一个人都直接或间接地受到艾滋病毒/艾滋病的影响;我们这里 的所有人都无一例外。 Everybody here, I would say, is directly or indirectly affected by HIV/AIDS, all of us here. 我们所有人都同意,艾滋病毒/艾滋病目前已不仅仅是一个公共保健问题。 We have all come to terms with the fact that HIV / AIDS is now more than just a public health issue. 在培养优等生方面,美国落后于亚洲和北欧的一些国家。在这样的背景下,我认为 我们可以对美国的k - 12(幼儿园至12年级)教育制度做出最关键的改变是建立一个 有妥善资金来源、高质量而且受到老师们信任的教师反馈体系。 I think the most critical change we can make in U.S. K-12 education, with America lagging behind countries in Asia and Northern Europe when it comes to turning out top students, is to create teacher-feedback systems that are properly funded, high quality, and trusted by teachers. 我很乐意跟你们谈谈这个东西,但是我又没什么可多说的,因为─(Chris Anderson :我有感冒。)然而,“TED” 中的 “D” 代表的是设计(design) And I’d love to talk to you about this, but I don’t have much in the way of... things to say because – (Chris Anderson: I’ve got a cold.) However, the “D” in “TED” of course stands for design. 我们希望玩家可以经历不同的活动。我希望这个游戏能变成《地球停转之日》、 《2001太空漫游》《星际迷航》、《世界之战》。 But we basically want a diversity of activities the players can play through this; you know. basically, I want to be able to play, you know, “The Day the Earth Stood Still,” “2001 Space Odyssey,” “Star Trek,” ”War Of the Worlds.” 这是一个脑成像图,在大脑的表层上,我们用一个很精巧的实验,在每个区域都重 建了大脑的反应,这是对神经元反应所做的非常详细的测绘。 Now this is a map, down on the surface of the brain, in which, in a very elaborate experiment we’ve reconstructed the responses location by location, in a highly detailed response mapping the responses of its neurons. 当我们在等待核能发电的时间里,我们不得不维持常规电网的运行。不论在美国还 是世界的任何一个角落,这意味着煤炭的使用。 While you’re waiting around for your nuclear, you have to run the regular electric power grid, which is mostly coal in the United States and around the world. 各位成员知道,现在的情况是我们来到这里,我们之中许多人- -这包括我自己- -重复 我们去年关于各个项目的发言,没有什么新内容。 As members know, what happens now is that we come here and many of us, and I include myself in this, rehearse our speeches of last year on the various items and there is very little 17 new we can put in.

ED

1

AC

8

9

10

Table 9: The top 10 scored sentences selected by biRAE-II.

ACCEPTED MANUSCRIPT

MML-5gram MML-rnnlm biRAE-II

MML-5gram -

MML-rnnlm 76.96% -

CR IP T

400

Finally, the degree of uniqueness of sentences brought by each model was examined by measuring the percentage overlap between pairs of models. These allow us to assess how different the models are from each other. Table 10 shows the percentage sentence overlap between pairs of individual models when N = 40% of data were selected. The overlaps between the models are above 76% for MML5gram/MML-rnnlm and MML-rnnlm/biRAE-II, 77% for MML-rnnlm/biRAE-II, 78% for MML-5gram/biRAE-I and MML-5gram/biRAE-I, and 93% for biRAEI/biRAE-II. This shows the high percentage of unique data among the models based on perplexity and those based on recursive neural networks, while the data selected by our models is consistent in nature. biRAE-II 78.37% 77.03% -

AN US

395

biRAE-I 78.10% 76.61% 93.02%

Table 10: Selection overlaps between pairs of models.

5. Conclusion

We have presented a novel data selection model for domain adaptation in statistical machine translation. In particular, we used recursive neural networks to learn the vector space representations for bilingual sentences, that make it possible to predict in-domain sentences by exploiting syntactic and semantic information from the perspective of neural language modeling. Furthermore our models exhibit a strong adaptability for dealing with unseen but relevant contexts. Extensive comparisons and experiments were carried out on Chinese-English languages. Compared to conventional methods, we observed intrinsic data selection improvements ranging from 0.14 to 2.83%. For end-to-end translation, we found significant improvements of 1.18 to 3.22 BLEU points. We analyzed the data brought by different selection methods and observed that the data offered by our models is more relevant to the in-domain context.

415

PT

410

ED

M

405

CE

Acknowledgements

AC

The authors are grateful to the Science and Technology Development Fund, Macao S.A.R. (FDCT), and the Research Committee of the University of Macau for funding support for our research under grant numbers 057/2014/A, MYRG070(Y1-L2)-FST12CS, MYRG2015-00175-FST, and MYRG2015-00188-FST.

420

References Axelrod, A., He, X., & Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK (pp. 355–362).

18

ACCEPTED MANUSCRIPT

425

Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. The Journal of Machine Learning Research, 3, 1137–1155.

430

435

CR IP T

Bengio, Y., Schwenk, H., Sen´ecal, J.-S., Morin, F., & Gauvain, J.-L. (2006). Neural probabilistic language models. In Innovations in Machine Learning (pp. 137–186). Springer.

Cambria, E., & White, B. (2014). Jumping NLP curves: A review of natural language processing research [review article]. IEEE Computational Intelligence Magazine, 9, 48–57. Cettolo, M., Girardi, C., & Federico, M. (2012). Wit3: Web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), Trento, Italy (pp. 261–268).

440

AN US

Chen, S. F., & Goodman, J. (1996). An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, California, USA (pp. 310–318).

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12, 2493–2537.

ED

450

Duan, L., Tsang, I. W., Xu, D., & Chua, T.-S. (2009). Domain adaptation from multiple sources via auxiliary classifiers. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, Quebec, Canada (pp. 289–296). Duh, K., Neubig, G., Sudoh, K., & Tsukada, H. (2013). Adaptation data selection using neural language models: Experiments in machine translation. In Proceedings of the 51st Annual Meeting on Association for Computational Linguistics, Sofia, Bulgaria (pp. 678–683).

PT

445

M

Daum´e III, H., & Marcu, D. (2006). Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26, 101–126.

CE

Dyer, C., Chahuneau, V., & Smith, N. A. (2013). A simple, fast, and effective reparameterization of IBM model 2. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, Atlanta, USA (pp. 644–648). Eck, M., Vogel, S., & Waibel, A. (2004). Language model adaptation for statistical machine translation based on information retrieval. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, Lisbon, Portugal (pp. 327–330).

AC

455

460

Falavigna, D., & Gretter, R. (2012). Focusing language models for automatic speech recognition. In 2012 International Workshop on Spoken Language Translation, Hong Kong (pp. 171–178).

19

ACCEPTED MANUSCRIPT

465

Foster, G. F., Goutte, C., & Kuhn, R. (2010). Discriminative instance weighting for domain adaptation in statistical machine translation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Massachusetts, USA (pp. 451–459).

475

Goller, C., & Kuchler, A. (1996). Learning task-dependent distributed representations by backpropagation through structure. In Proceedings of the International Conference on Neural Networks (pp. 347–352). Hildebrand, A. S., Eck, M., Vogel, S., & Waibel, A. (2005). Adaptation of the translation model for statistical machine translation based on information retrieval. In Proceedings of the 10th Annual Conference of the European Association for Machine Translation (pp. 133–142).

AN US

470

CR IP T

Gao, J., Goodman, J., Li, M., & Lee, K.-F. (2002). Toward a unified approach to statistical language modeling for chinese. ACM Transactions on Asian Language Information Processing (TALIP), 1, 3–33.

Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004, A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction with ACL 2004, Barcelona, Spain (pp. 388–395). Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic (pp. 177–180).

485

ED

M

480

Koehn, P., & Schroeder, J. (2007). Experiments in domain adaptation for statistical machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic (pp. 224–227).

Lin, S., Tsai, C., Chien, L., Chen, K., & Lee, L. (1997). Chinese language model adaptation based on document classification and multiple domain-specific language models. In Proceedings of the Fifth European Conference on Speech Communication and Technology, Rhodes, Greece (pp. 1463–1466).

CE

490

PT

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.

Liu, L., Hong, Y., Liu, H., Wang, X., & Yao, J. (2014). Effective selection of translation model training data. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland (pp. 569–573).

AC 495

Lu, Y., Wang, L., Wong, D. F., Chao, L. S., & Wang, Y. (2014). Domain adaptation for medical text translation using web resources. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA (pp. 233–238).

20

ACCEPTED MANUSCRIPT

505

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems, Nevada, United States. (pp. 3111–3119). Moore, R. C., & Lewis, W. (2010). Intelligent selection of language model training data. In ACL 2010, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden (pp. 220–224).

CR IP T

500

Och, F. J. (2003). Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan. (pp. 160–167).

520

AN US

515

Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, USA. (pp. 311–318). Pecina, P., Toral, A., Way, A., Papavassiliou, V., Prokopidis, P., & Giagkou, M. (2011). Towards using web-crawled data for domain adaptation in statistical machine translation. In 15th International Conference of the European Association for Machine Translation, Leuven, Belgium (pp. 297–304). Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, A meeting of SIGDAT, A Special Interest Group of the ACL (pp. 1532–1543).

M

510

Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., & Manning, C. D. (2011). Semisupervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL (pp. 151–161).

CE

PT

525

ED

Razmara, M., Foster, G., Sankaran, B., & Sarkar, A. (2012). Mixing multiple translation models in statistical machine translation. In The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Jeju Island, Korea (pp. 940–949).

Stolcke, A. (2002). SRILM - an extensible language modeling toolkit. In 7th International Conference on Spoken Language Processing, ICSLP2002 - INTERSPEECH, Denver, Colorado, USA (pp. 901–904). Tian, L., Wong, D. F., Chao, L. S., Quaresma, P., Oliveira, F., & Yi, L. (2014). Umcorpus: A large english-chinese parallel corpus for statistical machine translation. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, Reykjavik, Iceland (pp. 1837–1842).

AC

530

535

Toral, A., Pecina, P., Wang, L., & van Genabith, J. (2015). Linguistically-augmented perplexity-based data selection for language models. Computer Speech & Language, 32, 11–26. Hybrid Machine Translation: integration of linguistics and statistics. 21

ACCEPTED MANUSCRIPT

545

Wang, L., Wong, D. F., Chao, L. S., Lu, Y., & Xing, J. (2014). A systematic comparison of data selection criteria for smt domain adaptation. The Scientific World Journal, (pp. 1–10). Wang, L., Wong, D. F., Chao, L. S., Xing, J., Lu, Y., & Trancoso, I. (2013). Edit distance: A new data selection criterion for domain adaptation in SMT. In Recent Advances in Natural Language Processing, Hissar, Bulgaria (pp. 727–732).

CR IP T

540

Xia, R., Zong, C., Hu, X., & Cambria, E. (2013). Feature ensemble plus sample selection: domain adaptation for sentiment classification. IEEE Intelligent Systems, 28, 10–18.

555

Yasuda, K., Zhang, R., Yamamoto, H., & Sumita, E. (2008). Method of selecting training data to build a compact and efficient translation model. In Third International Joint Conference on Natural Language Processing, Hyderabad, India (pp. 655–660).

AN US

550

Zhang, J., Liu, S., Li, M., Zhou, M., & Zong, C. (2014). Bilingually-constrained phrase embeddings for machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, USA (pp. 111–121).

AC

CE

PT

ED

M

Zhu, C., Byrd, R. H., Lu, P., & Nocedal, J. (1997). Algorithm 778: L-BFGS-B: fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software, 23, 550–560.

22