A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation

A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation

ARTICLE IN PRESS Available online at www.sciencedirect.com ScienceDirect Computer Speech and Language ■■ (2016) ■■–■■ www.elsevier.com/locate/csl 1 ...

2MB Sizes 0 Downloads 94 Views

ARTICLE IN PRESS Available online at www.sciencedirect.com

ScienceDirect Computer Speech and Language ■■ (2016) ■■–■■ www.elsevier.com/locate/csl

1 2 3 4 5

bs_bs_query

Q2 bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation

6

bs_bs_query

7

bs_bs_query

Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen *

Q1 bs_bs_query

8

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan

bs_bs_query

9 10 11 12

Received 23 October 2015; received in revised form 12 August 2016; accepted 15 August 2016 Available online

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

Abstract

bs_bs_query

Integration of in-domain knowledge into an out-of-domain statistical machine translation (SMT) system poses challenges due to the lack of resources. Lack of in-domain bilingual corpora is one such issue. In this paper, we propose a simplification– translation–restoration (STR) framework for domain adaptation in SMT systems. An SMT system to translate medical records from English to Chinese is taken as a case study. We identify the critical segments in a medical sentence and simplify them to alleviate the data sparseness problem in the out-of-domain SMT system. After translating the simplified sentence, the translations of these critical segments are restored to their proper positions. Besides the simplification pre-processing step and the restoration postprocessing step, we also enhance the translation and language models in the STR framework by using pseudo bilingual corpora generated by the background MT system. In the experiments, we adapt an SMT system from a government document domain to a medical record domain. The results show the effectiveness of the STR framework. © 2016 Elsevier Ltd. All rights reserved.

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

Keywords: Cross-domain SMT; Domain adaptation; Statistical machine translation; Medical document processing

bs_bs_query

bs_bs_query

bs_bs_query

1. Introduction

bs_bs_query

bs_bs_query

Statistical machine translation (SMT), in which translation and language models are trained on large-scale corpora, is the mainstream in MT research. Numerous approaches have been proposed in the past decades. SMT systems are useful, but they fail to capture long-range contextual knowledge because of the limited horizon and sparse nature of lexical n-grams. These drawbacks reduce translation quality, especially when limited bilingual corpora are available for estimating translation models. Cross-domain issues further complicate the task. Language usage in different domains significantly varies in linguistic aspects such as lexical choice, collocations, and writing style. The term distribution in the training corpus may greatly differ from that in the target domain. An intuitive solution to deal with the domain-specific problem is to train the model with in-domain data. However, a domain-dependent corpus is not always available. Even given an in-domain parallel corpus, the quantity of the training instances may be poor. This issue is more serious in cross-language cross-domain scenarios, particularly for applications in highly specific domains like biochemistry and medical science.

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

Q3 bs_bs_query

bs_bs_query

* Corresponding authors at: Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan. E-mail addresses: [email protected] (H.-H. Huang), [email protected] (H.-H. Chen). http://dx.doi.org/10.1016/j.csl.2016.08.003 0885-2308/© 2016 Elsevier Ltd. All rights reserved.

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

2

45

46

Fig. 1. Translation of domain-specific text generated with Google Translate. The bold target phrases are inappropriate translations.

bs_bs_query

The data sparseness problem is accentuated when building an SMT system for a specific domain with insufficient in-domain data. Translation performance degrades when the SMT system cannot gather the statistical evidence needed for a segment containing out-of-vocabulary (OOV) words. In some cases, a bilingual dictionary containing domainspecific term pairs is available. We can then force an SMT system to translate specific terms according to the dictionary. However, translation quality may still be unsatisfactory using this naive approach because in-domain terms are rare or unseen in the background SMT model, and the context of these in-domain terms cannot be captured. Fig. 1 shows an incorrect English–Chinese translation of a sentence in a medical record provided by the online Google Translate service.1 In this example, although it correctly translates the diagnosis of “crystal induced arthritis”, it mistranslates the nearby phrasal verb “suffered from”. This example shows how an SMT system recognizes domain-specific terms with either a bilingual dictionary or the background SMT model, but still produces improper translations of the surrounding words due to insufficient contextual knowledge of these in-domain terms. In this work, to deal with the above problems, we propose a domain-adaptation framework called Simplification– Translation–Restoration (STR). We take English-to-Chinese medical record translation as a case study. The idea behind the STR framework is to modify in-domain segments in the source text in such a way that the out-of-domain background SMT system can recognize and translate the revised segments properly with their contextual words. Consider the example in Fig. 1 again. We replace the complicated diagnosis “crystal induced arthritis” with a more general term that occurs more frequently in the background SMT system, such as “cancer”, “pneumonia”, or “hypertension”. Since these terms are more common in a general corpus, the background SMT system handles the modified text better than the original text. The STR framework works as follows. We first identify the in-domain segments and simplify them into general expressions. The modified text is then translated by an out-of-domain SMT system. After that, we manipulate the translation result by replacing the modified parts and corresponding translations with the correct bilingual segments. This framework is suitable for cross-domain SMT scenarios where an out-of-domain SMT system and bilingual indomain dictionaries are available. In this study, we show the effectiveness of the framework through a case study in the medical domain. In addition to pre-processing (simplification) and post-processing (restoration), we also adapt directly the kernel part (translation) by using pseudo bilingual corpora generated by the basic STR-based SMT system. The contribution of this paper is two-fold. First, a novel simplification–translation–restoration framework for SMT is proposed. Second, a novel medical translation system is constructed. This paper is organized as follows. Section 2 reviews the related work on SMT research and domain adaptation. Section 3 presents our Simplification–Translation– Restoration (STR) framework and introduces the linguistic data used in the experiments. Section 4 specifies the details with a case study on English–Chinese medical record translation. We describe the construction of our STR framework MT system in Section 5. Section 6 evaluates the effectiveness of our STR framework and discusses the experimental results. In Section 7 we conclude and describe future work.

bs_bs_query

2. Related work Building an SMT system with large amounts of bilingual data is now a practical option. One problem with the SMT model, however, is data sparseness, because it relies heavily on statistical evidence from the training corpus. This can result in bias in the learned SMT model when dealing with the ambiguities of human language. This can lead to problems such as translation disambiguation (Carpuat and Wu, 2007; Chan et al., 2007). When a word has several senses, the corpus is biased toward a particular subset of the senses. The SMT system trained from such a corpus is therefore prone to present the wrong translation due to the wrong lexical choice. 47 48

bs_bs_query

bs_bs_query

Q9

1

http://translate.google.com.

bs_bs_query

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

3

With a cross-domain SMT application, the data sparseness problem is exacerbated by the limited in-domain bilingual corpus. Domain adaptation techniques therefore play a key role in building an in-domain SMT system under a resource-poor environment. Depending on the type of in-domain resources at hand, various domain adaptation techniques have been proposed that leverage either bilingual or monolingual in-domain resources. Foster and Kuhn (2007) propose a mixture-model approach for cases where only small amounts of bilingual in-domain text are available. A training corpus is divided into several components to train several models. Different models are then weighted by estimating the similarity between a model and the in-domain development data. Matsoukas et al. (2009) devise sentenceQ4 level features and weight the domain relevance to each sentence in the bilingual training corpus by optimizing an obQ5 jective function. Foster et al. (2010) further raise the granularity by weighting at the level of phrase pairs. A mixturemodel approach is similarly applied in a word-alignment task (Civera and Juan, 2007). In their work, Civera and Juan added domain-related parameters in the standard HMM training to derive an alignment model sensitive to the topic for each sentence. In some scenarios, a bilingual in-domain corpus is unavailable although in-domain monolingual text (either in source side or in target side) is relatively easy to acquire. Hybrid approaches have been proposed for this situation (Costa-Jussà and Fonollosa, 2015). Data selection is one way to collect a small amount of good training data (Axelrod et al., 2015). In the work of Axelrod et al. (2011), pseudo in-domain instances are extracted from a general-domain corpus to train domain-adapted SMT. Zhao et al. (2004) apply information retrieval techniques to select in-domain documents from large monolingual text collections and enhance the baseline language model. They combine the baseline language model and the in-domain language model trained on the documents retrieved from large text collections using query models. In addition to the language model, Bertoldi and Federico (2009) exploit an in-domain monolingual corpus to train SMT models. They synthesize a pseudo bilingual corpus and train an in-domain translation model from the synthesized corpus. Bertoldi et al. (2014) and Lagarda et al. (2015) improve the post-editing module of a generaldomain translation system with online learning. Transductive learning has also been introduced to SMT in a monolingual scenario (Ueffing, 2006; Ueffing et al., 2007). Q6 In the medical translation task of the ninth workshop on statistical machine translation (WMT 2014), millions of in-domain parallel sentences in Czech, English, French, and German were made available for use in training SMT systems (Bojar et al., 2014). In contrast, our work is close to the monolingual scenario. Provided with a monolingual domain-specific corpus, we adapt a background MT system to the translation of medical documents (Chen et al., 2011). There are major differences between our work and those proposed previously. First, related works exploit the entire in-domain training data to adapt the existing translation or language models using model mixtures and parameter adjustment. In contrast, we identify and translate significant patterns from large-scale in-domain source text and introduce them to our SMT system. Second, these significant patterns are translated using expert knowledge to reflect the large domain differences between the background training corpus and medical documents. That is, previous work has concentrated on model and parameter estimation to achieve domain adaptation when working on datasets with similar lexicons. Few studies deal with such a large domain gap, which however is a practical issue for a cross-domain SMT system. In the medical domain, for example, a term may not even appear in the training corpus and therefore the SMT system can suggest no translation for it. This OOV problem is common when translating domain-specific terms such as diagnoses and surgical names in biomedical literature or medical records using a general-purpose SMT system. Different from previous work, we address cross-domain issues in SMT systems across two largely distinct domains by simplifying in-domain segments to those that can be recognized by the out-of-domain or background SMT system, and restoring the in-domain segments after receiving the SMT results. Text simplification itself has some straightforward NLP applications (Woodsend and Lapata, 2011; Wubben et al., 2012; Zhu et al., 2010). For example, we produce a simpler version of a text by modifying the lexical contents and shortening the grammatical structures without changing the original text at the semantic level. Such simplified contents are beneficial for language learners and people with lower levels of literacy. One real-world application is Simple English Wikipedia,2 which uses simple English words and grammar, thus benefitting English language learners. In this work, we apply sentence simplification techniques to improve machine translation quality. The source language input is simplified to facilitate translation by the SMT system. This simplification step serves as a pre-processing module of the out-of-domain SMT system, which has poor in-domain knowledge.

bs_bs_query

bs_bs_query

bs_bs_query

49 50

bs_bs_query

2 bs_bs_query

http://simple.wikipedia.org.

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

4

The simplification approach can be viewed as a variant of paraphrasing, which expresses the same meaning in different ways. Paraphrasing is employed for NLP tasks such as machine translation, natural language generation, and computer-assisted language learning. For SMT, paraphrasing is often used to alleviate data sparseness in translation models. For example, we paraphrase a source language text so that the paraphrased parts are easier for the background SMT system to translate. Callison-Burch et al. (2006) pioneered a pivoting approach using a parallel corpus to improve a phrase-based SMT model. Marton et al. (2009) propose a monolingual framework to select paraphrases of a term by comparing its context with those of candidate paraphrases. Aziz et al. (2010) propose a semi-automatic approach to mine paraphrases from hypernyms and hyponyms in ontology. Resnik et al. (2010) conducted a pilot study of targeted paraphrasing in which monolingual speakers on both sides collaborate to improve SMT output by paraphrasing the critical segments of source text. In contrast to previous studies which apply paraphrasing to the SMT system in the general domain, our work focuses on cross-domain issues. We aim at introducing in-domain knowledge into the out-of-domain SMT system in a smoother fashion than the naive integration approach. Other works paraphrase general segments to provide more decoding options, whereas we concentrate on in-domain segments that account for the performance degradation of a cross-domain SMT system. We identify and simplify these in-domain segments to fit the background out-of-domain SMT system. 3. A simplification–translation–restoration framework 3.1. General principles Fig. 2 is an overall picture of the proposed Simplification–Translation–Restoration framework. A resource is represented in terms of its linguality, domain, language, and type, where MO/BI denotes mono-lingual/bilingual, ID/OD denotes in-domain/out-of-domain, and SL/TL denotes source/target language. For example, an MO–ID–SL corpus and an MO–ID–TL corpus correspond to monolingual in-domain corpora in source and in target languages, respectively. Similarly, a BI–OD corpus and a BI–ID dictionary denote a bilingual out-of-domain corpus, and a bilingual in-domain dictionary, respectively. In an extreme case, we are given only a bilingual out-of-domain corpus, monolingual in-domain corpora in the source and target languages, and a bilingual in-domain dictionary. The bilingual out-of-domain corpus is used to

51

bs_bs_query

Fig. 2. Basic STR-based translation system.

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

52

5

Fig. 3. Simplification–Translation–Restoration example.

bs_bs_query

train translation and language models by Moses (Koehn et al., 2007), which forms an initial background translation Q7 system. bs_bs_query

The basic STR-based translation system is composed of three major modules, shown as follows. Module (1) is a background translation system, (2) is a pre-processing module preceding this system, and (3) is a post-processing module after it. The example shown in Fig. 3 demonstrates how translations are generated. (1) Identify and translate in-domain segments in the input sentence by using translation rules, and simplify the input sentence by replacing in-domain segments with simpler ones. For example, as illustrated in Fig. 3(a), the terminological unit “crystal induced arthritis”, a rare medical term in the class DIAGNOSIS, is simplified to a more public diagnosis term “hypertension”. (2) Translate the simplified source sentence using the background translation system. Fig. 3(b) shows the translation result, which gives not only the correct translation of the diagnosis term “高血壓” but also the nearby context “患有”. This is because the phrase “suffered from hypertension” is translated with a customized decoder, which regards “suffered from DIAGNOSIS” as a syntactic unit. (3) Restore the results of the bilingual in-domain segments translated in (1) back to the translation results generated in (3). This restoration is based on the internal alignment between the source and the target sentences. To acquire the actual translation rather than a simplified one, the simplified parts are cut out and the original bilingual domain-specific terms are restored as illustrated in Fig. 3(c). In contrast with the translation in Fig. 1, the final translation result is correct and fluent. The translation results generated by the basic STR-based translation system may be regarded as pseudo bilingual in-domain corpora, which can be further used to enhance the background translation system, for example by training new translation and language models (Ueffing et al., 2007). Fig. 4 shows the framework of the complete refined system, which we term an advanced STR-based translation system. Listed below are the major challenges in the STR framework. Details are provided in Section 4. (1) How to mine patterns from the in-domain corpus and use them to identify in-domain segments in the given sentences. (2) How to create bilingual patterns and use them to translate in-domain segments. (3) How to select terms to replace in-domain segments and derive simplified sentences. (4) How to translate the simplified sentences with the background translation system. (5) How to restore the translation result of the in-domain segment to the translation result of the simplified sentences. (6) How to update the translation and language models trained using the out-of-domain corpus.

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

6

53

Fig. 4. Advanced STR-based translation system.

bs_bs_query

3.2. Linguistic resources used in the case study To train the translation model, we use Hong Kong Parallel Text (LDC2004T08), which contains official records, law codes, and press releases of the Legislative Council, the Department of Justice, and the Information Services Department of the HKSAR, respectively, and the UN Chinese-English Parallel Text collection (LDC2004E12). These two corpora contain a total of 6.8M sentences. The Chinese counterpart of the above parallel corpus and the Central News Agency part of the Tagged Chinese Gigaword (LDC2007T03) are used to train the trigram language model. These two corpora contain a total of 18.8M sentences. The trained models are used in module (2) of the basic STRbased translation system specified in Section 3.1. In addition to the out-of-domain corpora for the development of translation and language models, 60,448 English medical records (6.2M sentences) were selected from the National Taiwan University Hospital (NTUH) records to form an in-domain corpus. There are three main parts in an NTUH medical record: chief complaint, brief history, and course and treatment. The chief complaint includes the symptoms for which a patent seeks medical care. It is written in patients’ own words. The brief history states the present illness and medical history of a patient. The course and treatment describe the disease status of a patient and the progress of treatment. A typical medical record is shown in Table 1. As illustrated in the table, the chief complaint and course and treatment are briefly documented and short in length. In contrast, the brief history records how the patient has suffered in the past from the present illness. It constitutes the major part of a medical record and contains many specific writing styles. Table 1 highlights the examples of frequent patterns with the strings underlined. 4. Translation rule mining 4.1. Pattern mining There are a number of frequent patterns in medical records. For instance, the sentence “Port-A implantation was performed on 2009/10/9” contains the pattern shown below:

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■ 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

7

Table 1 A sample medical record. Chief Complaint Hematemesis on 1/22. Brief History This 70-year-old male has history of (1) Hypertension, (2) Diabetes mellitus, and (3) Chronic kidney disease. He regularly followed up at our OPD. His long term prescription included Tapal 1 tab QD. According to his wife’s statement, he felt nausea since 1/22 morning, and then he had sudden onset of syncope. A lot of blood was vomited on 1/23 accompanied with general convulsion. Then he was sent to our emergency department by Ambulance. In our emergency department, laboratory data revealed low level hemoglobin (7.5 mg/dl) and normal range of cardiac enzyme and normal range of PT, PTT. Blood transfusion and Pantoloc were given for suspected upper gastrointestinal bleeding. Endoscopy was performed on 1/23 with results of a 0.4 cm A1 duodenal ulcer and a gastric erosion. Bosmin injection and heater probe were done during endoscopy procedure. Under the impression of duodenal ulcer with bleeding, he was admitted to our ward for further management. Course and Treatment After admission, we continued intravenous fluid supply and PPI. We switched PPI to oral form since 2010/01/26. Following hemoglobin was stationary (1/27 Hb:10.3). He was discharged on 2010/01/28 with further GI OPD follow up.

paracentesis was performed on 2010-01-08 repositioning was performed on 2008/04/03 incision and drainage was performed on 2010-01-15 tracheostomy was performed on 2010/1/11 The following pattern specifies a particular surgery that was performed on a particular date. (S1) SURGERY was performed on DATE Here SURGERY represents a class of medical terms denoting surgeries, and DATE is a class of date expressions. Because highly specific terms in the target domain are rare or even unseen in the bilingual training corpus, general SMT systems yield for these terms poor-quality translations with inconsistent styles. Given a medical record, identifying domain-specific terms and classifying them into suitable classes such as SURGERY is the first step toward the extraction of significant patterns. We follow the Unified Medical Language System (UMLS), which covers a wide range of terms in the medical domain and the relations between these medical terms. The Metathesaurus organizes medical terms into groups of concepts. Every concept is assigned to at least one Semantic Type. In this study, we map 133 UMLS Semantic Types to the four NTUH medical classes: DIAGNOSIS, DRUG, SURGERY, and TEST. Terms in unmapped Semantic Types, such as animals and plants, are not classified. Terms in Semantic Types related to organs and body parts are mapped to the additional medical class BODY. Provided with a large English medical record corpus and terminological databases, we aim to (1) extract significant patterns to capture domain-specific writing styles, and (2) reduce the size of the pattern set to minimize the cost of involving experts in reviewing and translating the patterns. The overall steps are summarized as follows. The details are described in the subsections. (a) Medical entity classification Recognize medical named entities such as surgeries, diseases, and drugs; transform them into the corresponding medical classes; and derive a new corpus. (b) Frequent pattern extraction Employ n-gram models to extract a set of frequent patterns from the new corpus. (c) Linguistic pattern extraction Randomly sample sentences for each pattern, parse the selected sentences, and keep the pattern if there exists at least one parsing tree for it. (d) Pattern pruning Check coverage relations among higher- and lower-order patterns, and remove lower patterns that are covered. (e) Pattern clustering

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS 8

H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

Cluster the remaining patterns of the same order, and output the representative patterns from each cluster for pattern translation. 4.1.1. Medical entity classification Named entities such as medical terms, hospital names, and date/time expressions are our targets. As recognition of traditional named entities like organization names and date/time expressions has been discussed intensively before, in this paper we omit this. We focus instead on the classification of medical terms. To identify and classify medical terms in our domain-specific corpus, we examine each sentence from left to right and adopt a longest-first strategy to replace medical terms with classes. In this study, we employ the n-gram to represent patterns. In natural language processing, the n-gram’s primary limitation is its length. Lower order n-grams capture only local cues in a restricted scope. Moreover, reliable statistics on higher-order n-grams call for more training data. One way to resolve the n-gram’s locality issue is to recognize word strings with specific semantics and replace them with classes. Thus, patterns are represented in terms of combinations of words and classes rather than words only. In this manner we enlarge the scope of patterns. 4.1.2. Frequent pattern extraction We extract patterns from an in-domain medical record corpus to capture domain-specific writing styles. These patterns are translated for later application in run-time translation. Accordingly, we prefer a pattern format that is easily integrated into an SMT system for the target application. The phrase-based model (Koehn, 2004; Koehn et al., 2003) is a state-of-the-art translation model in terms of both accuracy and speed. The phrase-based SMT model translates source phrases into target phrases with a phrase table which consists of bilingual phrase pairs and feature scores estimated from word-to-word alignments. Since a phrase (a sequence of consecutive words) serves as the basic translation unit, integrating n-gram based patterns into the background phrase-based SMT system is a natural choice. We enumerate all the n-grams from those sentences of our in-domain corpus that contain words and medical classes. Two kinds of patterns are extracted: (1) class patterns that contain at least one medical class, and (2) lexical patterns that contain only words. Note that both patterns are easily integrated into a phrase-based SMT system by embedding lexical patterns into the phrase table and by using class patterns as translation rules that are applied when the pattern occurs in a medical record. After all the patterns are extracted from the medical records, they are ranked by frequency. 4.1.3. Linguistic pattern extraction The linguistic meaningfulness of patterns is considered as a metric to judge their significance. For example, “SURGERY was performed” is a linguistic constituent, unlike “SURGERY was performed on” which is incomplete. We filter out those patterns that do not meet the requirements of complete linguistic constituency. A parser is adopted to determine the linguistic completeness. A cross-domain problem arises when applying a general-purpose parser to the domain-specific corpus. The parser may not work well when parsing text with domain-specific terms such as diagnosis and drug names, especially when a term spans multiple words. Here we take advantage of the medical entity classification introduced in Section 4.1.1: to allow use of the general-purpose parser, we replace each named entity in a sentence with a common word. In this way, we reduce not only the out-of-vocabulary (OOV) words, but also the sentence length. For example, as illustrated in Fig. 5, we successfully determine the linguistic completeness of pattern (S1) by replacing the complicated surgery name “percutaneous transluminal coronary angioplasty” with the simpler “surgery”. As the original sentence is parsed incorrectly, the parsing tree in the left side does not meet the requirement of linguistic completeness. In contrast, the revised sentence on the right side, “surgery was performed on 2007-12-25”, is parsed properly: its parsing tree forms a complete linguistic constituent. For each extracted pattern, we select m distinct sentences in which it occurs. A parser then analyzes these sentences and produces m parse trees. The pattern is considered a significant candidate if it is a syntactic constituent in at least one of the m parsing trees. In this study, we apply the Stanford parser (Klein and Manning, 2003) and set m to 10 in consideration of parsing speed. 4.1.4. Pattern pruning A higher-order pattern A may be composed of two lower-order patterns B and C. We say that A covers B and C if all three are linguistically complete. The 5-gram pattern (S1) in Section 4.1 is the concatenation of the 3-gram pattern

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

70 71 72

Fig. 5. Linguistic completeness of pattern (S1).

bs_bs_query

Table 2

bs_bs_query

bs_bs_query

9

Q11 Examples of coverage relations. bs_bs_query

73 74 75 76 77

bs_bs_query

Coverage Relation

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

4=2+2 5=2+3 5=3+2

Higher Order

Lower Order

Pattern A

Pattern B

Lower Order Pattern C

BODY SURGERY on DATE local finding showed left DIAGNOSIS Elevated TEST level was noted

BODY SURGERY local finding Elevated TEST level

on DATE showed left DIAGNOSIS was noted

“SURGERY was performed” and the 2-gram pattern “on DATE”. After pattern (S1) is translated, we derive the translations of their lower-order composing components. Using this coverage relation (“5 = 3 + 2”), we retain only pattern (1), and omit its 3-gram and 2-gram components. Other examples of coverage relations are given in Table 2. Translating the higher-order pattern not only extends the translations of its components, but also yields the correct ordering of their combination. Thus, keeping coverage patterns and ruling out those that are covered reduces the size of the extracted patterns and at the same time preserves their integrity. 4.1.5. Pattern clustering Given a cluster of similar patterns, the translation for the most representative pattern may also work as a translation for the other patterns in the cluster. An example of a cluster of similar patterns is illustrated as follows. he received SURGERY on DATE he received TEST on DATE he underwent SURGERY on DATE he underwent TEST on DATE If the first pattern is translated by an expert, the translations for the others are easily inferred without physician involvement. Consequently, it is unnecessary for physicians to translate similar patterns; we thus enrich the diversity of their efforts. We adopt single-link clustering (Sibson, 1973). We define the similarity between two n-gram patterns as the number of identical words in identical positions. Two n-grams are placed into the same cluster if their similarity is no less than n-1. 4.2. Pattern translation With the medical entity classification described in Section 4.1.1, domain-specific terms are identified and transformed into medical classes. As a result, the lexical patterns extracted from the corpus contain only common words, and the MT system translates these OOV-free lexical patterns. In this study, we use the online tool Google Translate to translate the lexical patterns. Translation quality can be compromised due to domain-specific usage. As illustrated in Table 3, incorrect word sense disambiguation (WSD) often leads to translation errors in a general-purpose translation system. Physicians review every lexical pattern and revise it if necessary to domain-specific usage. In comparison to starting from scratch with only monolingual patterns, thus building bilingual patterns based on MT output can save much of the experts’ time.

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

10 78 79 80 81 82 83

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

Table 3 WSD problems in medical record domain. Bold text denotes words with ambiguous senses. English Patterns

Wrong Translation by MT

Correct Translation by Physicians

the bulging mass progressively enlarged a total of six courses visited our hospital for help

質量 (quantity of matter) 課程 (a series of lessons; class) 參觀 (look around)

腫塊 (tumor) 療程 (treatment plan) 到 (go to see a doctor)

We deploy physicians to translate the class patterns, which contain medical classes and require in-domain knowledge from experts. A web user interface is designed for the physicians. The physicians inspect class patterns and correct the translations as needed. The translation is skipped if the patterns are considered unimportant by the physicians. Some patterns are relatively hard to understand and translate, and therefore we provide the physicians with 10 instances for reference. 4.3. Pattern integration In this section, we extract and translate significant patterns from the medical record corpus and attempt domain adaptation by integrating these bilingual patterns into a general-domain SMT system. Since we use n-gram patterns, which are consecutive sequences of tokens, the integration can be carried out without major changes to the background SMT system. We set up a phrase-based SMT system using Moses. As proposed in the work of Bertoldi and Federico (2009), lexical patterns can serve as a separate phrase table to provide domain-specific translation options during the decoding stage. As class patterns appear frequently in the medical domain, and as their translations by physicians are unlikely to be ambiguous, we adopt the translations of these class patterns in each input medical record, and start decoding from the partial hypothesis. This is feasible with the support of advanced functions in the current version of Moses such as XML markup and continuing partial translation. 5. Implementation of STR framework The central issues in the STR framework include (1) collecting bilingual in-domain knowledge and identifying in-domain segments in a source text, (2) replacing the in-domain segments with the proper simplified forms, (3) translating the modified text with a background SMT system, and (4) restoring the original in-domain segments after receiving the translation results from the background SMT system. We are not the first to rephrase source language text in order to improve SMT output. As a pilot study, Resnik et al. (2010) proposes a targeted paraphrasing approach, which identifies the critical source segments difficult for the background SMT system to translate. These segments are then manually paraphrased in many ways in order to provide the SMT system with more choices of decoding paths. In contrast to their work, we automatically identify the critical segments with in-domain knowledge, simplify them with linguistic information, and restore these critical segments after receiving the SMT results. The STR framework is composed of the following four steps: (1) (2) (3) (4)

Identify in-domain segments s1, s2, . . ., sn from input sentence S. Simplify s1, s2, . . ., sn in S and derive new source sentence S’. Translate source sentence S’ into target sentence T’. Restore bilingual in-domain segments s1-t1, s2-t2, . . ., sn-tn back to S’-T’ and derive final translation result T.

In the following sections, we describe each step in detail. 5.1. Identification An SMT system performs poorly when translating segments which are rare in the background model. In the first step of our framework, we identify in-domain segments in an input source text for the next simplification step. To this end, we collect bilingual in-domain resources consisting of source–target string pairs. Although a parallel corpus may not be available in a special domain, there are various ways to collect bilingual in-domain knowledge. For example, bilingual dictionaries are available in specific areas such as medicine, physics,

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

11

and economics. They provide high-quality in-domain terminology. In addition to handcrafted dictionaries, bilingual in-domain knowledge may also include phrase pairs or synchronous grammar rules, depending on the translation model and the decoding style of the background SMT system. Such bilingual knowledge can be collected using automatic (Haghighi et al., 2008; Wu and Chang, 2007) or semi-automatic (Morin et al., 2007) approaches. As introduced in Section 4, we propose an approach to collect bilingual medical terminology. In the STR framework, we also apply the bilingual medical terminology as an in-domain knowledge base. 5.2. Simplification In the simplification step, the identified in-domain segments of a text are transformed into more general expressions. The modified text is then ready to be translated by the background SMT system. This simplification step serves as a pre-processing step before translation. We simplify an in-domain segment according to its type, for example, terminological unit and syntactic unit in this study. Terminological units, based on the ontology provided by the Unified Medical Language System (UMLS), refer to terms that appear in domain-specific dictionaries and glossaries without specifiers and modifiers. For these indomain terms, we simplify them by finding in the background SMT model related terms such as hypernyms or synonyms that occur more frequently, as well as contextual information. For a medical term, we enumerate its hypernyms and find in the background translation model the most frequent one, which we designate as the simplified form. Syntactic units are linguistically meaningful segments which constitute special writing styles of the target domain. These units usually bear syntactic categories at clausal or phrase levels such as S, NP, VP, or PP. They contain heads along with their modifiers. These syntactic categories can be derived from the parsed or the chunked results. In this work, the Stanford parser is employed. As shown below, we further simplify each syntactic unit based on the rule corresponding to its syntactic category. (a) NP (noun phrase) We keep the head of an NP and remove its specifier and modifier. If the head noun is a domain-specific term, then it is further transformed by the simplification rule to a terminological unit. Fig. 6 shows a parsing tree of a string as an example. The string is labeled NP at the root node that contains two sub-trees with categories NP and PP, respectively. According to this simplification rule, we therefore remove its PP modifier. As a result, the string “a patient of skin rash with multiple erythematous papules” is simplified to only its head “a patient”. (b) VP (verb phrase) VP → V + NP: We leave V untouched and simplify NP per simplification rule (a). For example, we simplify “had underlying diseases of ventricular tachycardia and dyslipidemia” to “had diseases”. VP → V + PP: We leave V untouched and remove PP if PP is a modifier. If PP is mandatory, it is further simplified per simplification rule (c). Whether PP is optional or mandatory is determined by V’s subcategory. For example, we simplify the sentence “he was discharged on the morning of 6/30” to “he was discharged”.

84

bs_bs_query

Fig. 6. Parse tree of a syntactic unit. A bracketed string at each non-terminal node is the head of the corresponding syntactic category.

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS 12

H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

(c) PP (prepositional phrase) PP → P + NP: We keep P and simplify NP per simplification rule (a). For example, we simplify “with underlying diseases of ventricular tachycardia and dyslipidemia” to “with diseases”. (d) S (clause) We simplify a clause by simplifying its children recursively according to the above simplification rules. The rule-based simplification approach is straightforward, but it can be effectively applied to most of the syntactic units discussed in Section 6. Applying transformation or rewriting rules on source sentences based on their syntactic structures has been adopted in other work. Wang et al. (2007) list a set of prominent syntactic reordering rules that systematically describe the word order difference between source and target languages. Based on these rules, they parse a source language input and reorder its structure to match the target language grammar to train a better translation model and improve a phrase-based SMT system. In their work, source-side syntactic reordering also serves as a pre-processing module of the SMT system. In contrast to their work, we simplify the source language input in favor of the background SMT system instead of changing the order of its structure. 5.3. Translation In the translation step, the background SMT system translates the simplified in-domain text and produces its translation result. Since the input source text is simplified in favor of the translation system, the contextual distributions of the phrases can be estimated better than those of the original text, as demonstrated in Fig. 3(b). Our STR framework performs domain adaptation under the scenario where bilingual in-domain segments are available. This approach can be combined with other domain adaptation approaches that exploit monolingual or bilingual in-domain corpora to further improve translation quality. In our STR framework, the semi-automatic approach proposed in Section 4 is applied to collect monolingual and bilingual medical terms. For instance, if a parallel indomain corpus is available, we use the learning-based domain adaptation approaches described in Section 2, and tune the background translation model toward the specific domain. In this way, we hope to receive better translation results from the background SMT system and facilitate the next restoration step. For a phrase-based SMT system and its variations (Chiang, 2005; Huang and Chiang, 2007; Xiong et al., 2006), we can further customize the decoder to produce higher-quality translation and facilitate the next restoration step of our framework. For example, as in-domain segments in a text are either terminological or syntactic units in our experiments, their translations are continuous without other interleaving translations. However, an out-of-domain SMT system may generate incorrect ordering due to lack of domain-specific knowledge. For instance, as illustrated in Fig. 7, the translation of the phrase “bone lesion”, a terminological unit consisting of two words, is separated. In our work, we set up an SMT system based on Moses (Koehn et al., 2007) and apply its advanced feature of specifying reordering constraints to each of the simplified phrases. Under this constraint, a simplified phrase denoting a multi-word is translated as a block, and its translation is preserved on the target side. 5.4. Restoration In the restoration step, we receive the translation result from the background SMT system and perform post-processing steps. We locate the simplified phrase pairs and replace them with their corresponding bilingual

85

bs_bs_query

Fig. 7. Incorrect reordering of the in-domain phrase “bone lesion”.

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

86 87

Fig. 8. Three restoration methods: phrase alignment, word alignment and probability-based extraction. Thick lines are phrase alignments and thin

bs_bs_query

bs_bs_query

13

Q10 lines are word alignments. bs_bs_query

in-domain segments as illustrated in Fig. 3(c). The resulting parallel text is the final output of our framework, and its target side is the translation of the in-domain text. To accurately restore bilingual in-domain segments, we require the internal alignment information between the source and the target sides. Depending on the difficulty involved in extracting the simplified phrase pairs, different granularity levels are applied, including phrase alignment, word alignment, and word alignment score. The restoration methods under different situations are summarized in Fig. 8. We apply these restoration approaches one-by-one in order of increasing granularity: phrase alignment, word alignment, and probability-based extraction. As discussed in the next section, empirical results show that these approaches, although simple, effectively handle most cases. If a simplified source phrase is translated alone without local context, phrase-level alignment information is sufficient to perform the restoration: we replace the simplified phrase pairs with the original bilingual in-domain segments. Fig. 8(a) illustrates the phrase alignment method. The shaded blocks and thick lines denote simplified phrase pairs. By checking the phrase alignments provided by the decoder, the simplified phrase pairs can be replaced without further processing. The method is demonstrated in Fig. 9(a). In this example, the simplified term “surgery” is translated as a single phrase. Based on the phrase alignments, the simplified phrase pair “surgery-手術” is replaced with its original form “abdominal tapping-腹部穿刺”. In some cases, a simplified phrase is translated together with its local context, and therefore we must determine the translation of the simplified phrase before we can restore its original form, such as for the term “hypertension” in Fig. 9(b). In this case phrase-level alignments are insufficient, and the higher granularity of word-level alignments is needed to separate the simplified phrase from its contexts. With word alignment information, we can extract the simplified phrase pair without violating the consistency judgment of a phrase pair (Och et al., 1999). We then replace the simplified phrase pair with its original bilingual in-domain segment. Our word alignment method is illustrated in Fig. 8(b). The hollow blocks denote the context f2f3, which is translated together with its local simplified term f1. The thin lines and the thick lines are word alignments and phrase alignments, respectively. Compliant with the consistency of phrase extraction, we separate the phrase pair f1–e1 from the phrase pair f1f2f3–e1e2e3 and perform the restoration. The method is demonstrated in Fig. 9(b). Based on the word alignments, the simplified term “hypertension” can be separated from its context “suffered from” without violating the consistency. As a result, we successfully restore the original form “crystal induced arthritis”. There are still cases in which the word alignment method fails due to the fertility between source and target languages. For a simplified term, which is usually a content word, its translation may be aligned to non-content words on the source side. In this case, extracting simplified phrase pairs would violate the consistency of phrase extraction. Here we apply a probability-based extraction of simplified phrases. This approach deletes weak word alignments based on alignment probabilities. For a source simplified phrase fi,j spanning words fi to fj which is aligned to its target translation ei,j, we examine each word in ei,j. If there exists an ek (i < =k < =j) which is aligned to two source words within fi,j and outside fi,j respectively, we attempt to delete one of the alignments by comparing their word alignment probabilities. Fig. 8(c) illustrates this probability-based extraction method. The word e1 is aligned to both f1 and f2. The fuzzy alignment causes the unsuccessful separation of f1–e1 from its context f2f3–e2e3. If the word alignment probabilities show P(f1|e1) > P(f2|e1), we delete the weak word alignment f2–e1, and meet the consistency judgment. The method is demonstrated in Fig. 9(c), where the word alignment method is unsuccessful. In this case, we fail to determine the translation of the simplified syntactic unit “he was admitted” because the target word “承認” is aligned to both “admitted” and “in”. With the probability-based extraction approach, we delete the alignment (the dashed line)

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS 14

88 89

bs_bs_query

bs_bs_query

H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

Fig. 9. Three restoration methods applied with different levels of alignment information. Thick lines are phrase alignments and thin lines are word alignments.

“in-承認”, because P(in | 承認) < P(admitted | 承認). After that, we are able to extract the phrase pair “admitted承認”. By determining the translation of “he was admitted” with two consecutive phrase pairs, the restoration of the original form “he was admitted for scheduled Trabeculectomy” is easily done.

5.5. Model updating There are several ways to update the basic STR-based translation system. The first consideration is the in-domain translation rules. They are formulated semi-automatically by domain experts. The high cost of domain experts means that only a small portion of n-gram patterns along with the corresponding translation are generated. The post-editing results suggest more translation rules, which can be fed back to revise the basic translation system. The second consideration is the translation and language models in Moses. Transductive learning (Ueffing et al., 2007) and data selection (Axelrod et al., 2011) are employed. In an ideal case, the complete monolingual in-domain corpus in the source language is translated by the basic STR-based translation system, after which the results are postedited by domain experts, and finally the complete post-edited bilingual corpus is fed back to revise both the translation and language models. However, the high post-editing cost by domain experts means that only some samples of the initial translation are edited by domain experts. Thus, the sampled post-edited in-domain corpus in the target language is used to revise the language model, and the in-domain bilingual translation result before post-editing is used to revise the translation and language models. We must take into account size and translation quality. We explore the effect of different sizes of imperfect in-domain translation results on refining the basic STR-based translation system. Moreover, a selection strategy – take into account only those translation results that are completely in the target language – is introduced to sample “relatively more accurate” bilingual translation results. In the above model updating, the translation rules, translation model, and language model are revised individually. The third consideration is to merge the refinements together and examine their effects on translation performance.

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

15

6. Results and discussion In Section 6.1 we analyze the experimental results of bilingual pattern acquisition, in Section 6.2 we show the experimental results of the simplification and restoration in the Simplification–Translation–Restoration (STR) framework, and in Section 6.3 we discuss the translation performance using the complete STR framework. 6.1. Bilingual pattern acquisition Acquiring the bilingual patterns constitutes one of the most expensive parts of the overall framework due to the involvement of physicians. The cost of expert translation depends on the algorithms for mining significant patterns. We evaluate the performance of our pattern miner from two aspects: significance (Section 6.1.1) and diversity (Section 6.1.2), which are addressed in steps (a)–(c) and steps (d)–(e), respectively, in Section 4.1, then discuss the quality of the translated lexical patterns (Section 6.1.3), and finally conduct experiments on NTUH medical records. Table 4 shows the number of n-grams after each step. The patterns after the parser step are less than 1% of those extracted by the Ngram Statistics Package (NSP) (Banerjee and Pedersen, 2003). Most of the patterns filtered by the Stanford Parser end with conjunctions, prepositions, or adjectives. The patterns are further reduced after the pruning step. Note that 5-gram patterns remain unchanged in this step because the length of 5 is the highest order among the extracted n-gram patterns. After the coverage step, we translate the class patterns and lexical patterns to obtain bilingual patterns. For class patterns, we perform pattern clustering and present the result to the physicians as described in step (e). The statistics of class patterns and derived clusters are shown in Table 5. We sample 5-grams and 4-grams for translations. Each cluster contains on average 2.17 and 3.18 patterns for 5-grams and 4-grams, respectively. For lower-order patterns, average cluster sizes are much larger and patterns in each cluster are less similar. We asked 32 NTUH residents to translate class patterns in larger clusters to achieve diversity. Using an on-site tutorial, we instructed the domain experts how to use the Web annotation system. Then, they examined each pattern in the order presented. Based on their expertise, they translated common patterns in medical records, ignoring those considered unimportant. For lexical patterns, we experimented on 5-grams, which were first translated using Google Translate. These bilingual patterns were then reviewed by one member of the NTUH staff. The translations of these patterns were either accepted or modified based on the physicians’ knowledge. 6.1.1. Significance of class patterns As illustrated in Section 4.2, the physicians used an annotation system to translate class patterns by either acceptQ8 ing or ignoring them. They considered the former as significant patterns and translated them into Chinese. In contrast, the latter were deemed non-significant. Table 6 shows the overall results of the translation. Except for a few

bs_bs_query

90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

Table 4 Pattern counts after each step. Length of n-gram

Ngram Statistics Package

Parser

Pruning

5-gram 4-gram 3-gram 2-gram Total

2.6M 2.3M 1.6M 0.7M 7.2M

7.6K 14.7K 19.1K 15.8K 57.2K

7.6K 10.8K 12.5K 9.2K 40.1K

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

Table 5 Results of pattern clustering for class patterns. Length of n-gram

Number of patterns

Number of clusters

Average Cluster Size

5-gram 4-gram 3-gram 2-gram

4634 6229 5099 2097

2149 1957 753 14

2.17 3.18 6.77 149.79

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

16 106 107

Table 6 Results of class pattern translations.

bs_bs_query

108

bs_bs_query

bs_bs_query

109 110 111

bs_bs_query

bs_bs_query

bs_bs_query

Length of n-gram

Accepted

Wrong

Ignored

Accuracy

5-gram 4-gram Total

642 348 990

6 3 9

432 152 584

59.22% 69.00% 62.33%

errors with incorrect translations, most of the translated patterns were suitable for integration into the background SMT. The accuracy of the presented patterns is defined as follows:

Accuracy =

Accepted − Wrong Accepted + Ignored

which is 59.22% and 69.00% for the 5-gram and 4-gram patterns, respectively. This demonstrates the effectiveness of our strategy to select linguistic patterns. Among the ignored patterns, some contained incorrect medical classes due to misclassified terms in our terminological databases. Parsing errors also produced some non-linguistic patterns.

112

bs_bs_query

6.1.2. Diversity of class patterns As shown in Table 6, only 990 patterns were translated by the physicians. We further exploited the diversity of these patterns by extending them based on coverage relation and pattern clustering in steps (d) and (e) of Section 4.1. For each translated 5-gram pattern, we produced a new 4-gram bilingual pattern if such a coverage relation existed. For each cluster with at least one pattern translated by a physician, we manually uncovered the translations of other similar patterns. During this manual extension, we discarded improper patterns containing errors such as typos or incorrect grammar. Table 7 shows examples of the discarded patterns. These common errors could be corrected by a post-editing system such as a grammar checker. Table 8 reports the results of our extension methods, showing the newly discovered patterns after the Pruning and Clustering steps. The 5-gram and 4-gram patterns after the extensions are 3.05 and 9.38 times more numerous than the patterns translated by physicians, respectively. Compared to the average cluster sizes in Table 5, we more efficiently allocated our expert efforts and achieved high diversity among the translated patterns. 6.1.3. Error analysis of online translator After the NTUH staff reviewed the lexical patterns, we analyzed the data to examine the effectiveness of the general online translator applied to the specific domain. Among the 1174 reviewed bilingual 5-gram lexical patterns, 354 of them remained unchanged; the others were modified. In other words, the accuracy of Google Translate is 30.15%. We performed manual analysis on these 820 corrected translations, and show the results in Table 9. Half of the translations

113 114

Table 7 Common error patterns. Error parts are underlined.

115

English Patterns

Error Types

116 117 118 119

under the impresison of DIAGNOSIS was admitted for schedualed SURGERY DIAGNOSIS and DIAGNOSIS was noted DRUG and DRUG was given

typo typo grammar grammar

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

120 121 122 123 124 125

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

Table 8 Extension of bilingual class patterns. Length of n-gram

Physicians

Coverage

Clustering

Total

5-gram 4-gram

636 345

+0 +1141

+1303 +1750

1939 3236

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■ 126 127 128 129 130 131 132

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

17

Table 9 Error analysis of bilingual lexical patterns. Note that a translation may have multiple errors. Error type

Count

Percent

WSD Style Reordering Others

411 205 163 232

50.12% 25.00% 19.88% 28.29%

have WSD errors as illustrated in Section 4.2. This suggests that domain gap is a practical issue in SMT applications. Disagreements with writing styles mainly come from the regional difference in language use between mainland China and Taiwan. For example, both “烟鬼” and “老煙槍” are translations of “heavy smoker”, but Taiwanese prefer the latter. In addition, the lack of training corpora in medical domain caused reordering errors (incorrect word orders between source and target languages). Other translation errors include missed words and extra words. Evaluation and analysis of MT output is itself an important research issue, and readers can refer to work of Vilar et al. (2006) for a gentle introduction. 6.2. Simplification and restoration in STR framework 133

134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

Simplification and restoration are important pre-processing and post-processing steps in the STR framework. We sampled 1077 sentences in 18 medical records chosen from the NTUH corpus to examine their operations in depth. In our SMT application, the gap between in-domain and out-of-domain corpora is very large in terms of vocabulary and writing styles. The average length of a sentence in a medical record is short (12.58 words) compared to the background general corpus (29 words). On average, the in-domain segments (terminological and syntactic units) constituted over 36% of the dataset. Nearly 21% of the words in a sentence were OOVs, including surgical, diagnosis and drug terms; most of them were parts of terminological and syntactic units. This justifies our desire to alleviate the OOV problem in our cross-domain SMT system by simplifying the text in this special domain before sending it to the background SMT system. The mined class patterns were organized into the bilingual syntactic units according to their syntactic roles. Table 10 gives some samples of these syntactic units in the medical domain. Both source and target class patterns are listed for each bilingual syntactic unit. The words in bold denote medical classes which represent diseases or symptoms (DIAGNOSIS), medical tests (TEST), surgical or non-surgical treatments (TREATMENT), and so on. Table 10 Samples of bilingual syntactic units. Source Syntactic Unit Target Syntactic Unit

Syntactic Label

underwent TEST on DATE 於 DATE 接受 yu jie-shou

VP 檢查 jian-cha

TEST

DRUG was given for DIAGNOSIS 使用 DRUG 用於治療 shi-yong yong-yu-zhi-liao received TREATMENT with DRUG 接受 TREATMENT 及 jie-shou ji DIAGNOSIS at the right REGION 在右側 REGION zai-you-ce TREATMENT of the right REGION 右側 REGION you-ce

S DIAGNOSIS VP DRUG

治療 zhi-liao NP

之 zhi

DIAGNOSIS

的 de

TREATMENT

NP

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

18 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

Table 11 Parsing and simplification performance on syntactic units. We manually simplify the 12 syntactic units with minority labels. Syntactic Unit

Count

Parsing Error

Simplification Error

NP VP PP S Other

697 287 228 342 12

4.85% 2.94% 1.85% 1.23% 33.33%

28.15% 27.12% 8.89% 9.59% N/A

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

Table 12 Various restoration methods. Restoration Method

Total Count (Percentage)

Sentence Count (Percentage)

Phrase Alignment Word Alignment Probability-based Extraction

1767 (60.93%) 981 (33.83%) 152 (5.24%)

865 (86.16%) 657 (65.44%) 137 (13.65%)

Table 11 shows the number of syntactic units for each syntactic label. Parsing errors for a few n-grams were manually corrected. More than 70% of the n-grams were correctly simplified. Most of the errors came from unexpected syntactic labels. This can be resolved with new simplification rules. For example, the common syntactic unit “was admitted due to DIAGNOSIS” can be simplified to “was admitted”. However, the syntactic label ADJP, which covers “due to DIAGNOSIS”, was not considered and therefore remained unchanged after the simplification step. In the restoration step, we received and post-processed the SMT results as described in Section 5.4. In our experiments, phrase-level alignments were provided by the Moses decoder, and word-level alignments were obtained as intermediate results after training the phrase table. Our method successfully performed the restorations on most of the test data. A total of 1004 (93.22%) of the 1077 sentences were successfully restored with the three proposed restoration methods. For the other 73 sentences, most of these failures resulted from translation errors of the simplified phrases by Moses, which in turn affected the word alignments and introduced difficulties. Table 12 shows the performance of each restoration method. The second column counts the number of applications of each restoration method on the test set, and the third column counts how many sentences included the application of each restoration method. A total of 2900 applications of the restoration methods were made during testing. With phrase alignments, we handled over 60% of the simplified phrases. For the remaining simplified phrases, the background SMT model captured their contexts with higher confidence, and therefore they were translated with the surrounding words. For these cases, we used word alignment information, and utilized the word alignment and the probability-based extraction methods. Although the probability-based extraction accounted for only 5.24% of the restorations, it was applied on 13.65% of the sentences, as shown in the third column. It confirms that the approach of deleting weak word alignment can handle some inconsistencies during phrase extraction. 6.3. Complete simplification–translation–restoration framework The framework of the advanced STR-based translation system is shown in Fig. 4. Various strategies to update the translation and language models in the basic STR-based MT system were evaluated. A total of 1004 sentences postedited by the domain experts were used to train the advanced STR-based translation systems. Furthermore, we sampled 2.1M and 1.1M sentences from NTUH medical record datasets and translated them by our STR-based translation system to yield 2.1M- and 1.1M-sentence pseudo bilingual in-domain corpora. We analyzed the effects of the corpus size. In addition, we applied the data selection strategy specified in Section 5.5 to select 0.95M “good” translations from the 2.1M pseudo bilingual in-domain corpus. To evaluate the basic and advanced STR-based translation systems, we sampled 1000 sentences disjoint from the above corpora as the test data, and translated them manually as the ground truth. We adopted BLEU (Papineni et al., 2002) to evaluate the various MT methods. Table 13 summarizes the methods along with the resources they used and experimental results. The difference between successive methods is highlighted in bold. G is a general translation system that does not use simplification and

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■ 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196

bs_bs_query

bs_bs_query

Table 13 Resources used in each translation system and experimental results. Translation Rules

bs_bs_query

bs_bs_query

G

bs_bs_query

bs_bs_query

F

981 bilingual translation rules

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

bs_bs_query

19

Post-Editing M1 981 bilingual translation rules

Translation Model

Language Model

Tuning Data

BLEU

6.8M government domain bilingual sentences 6.8M government domain bilingual sentences

18.8M government/news domain Chinese sentences 18.8M government/news domain Chinese sentences

1000 government domain bilingual sentences 1000 government domain bilingual sentences

15.24

18.8M government/news domain Chinese sentences 18.8M government/news domain Chinese sentences 804 post-edited Chinese sentences

200 post-edited medical domain sentences 200 post-edited medical domain sentences 200 post-edited medical domain sentences

39.22

1.1M pseudo medical domain Chinese sentences generated by M2 2.1M pseudo medical domain Chinese sentences generated by M2

200 post-edited medical domain sentences

35.11

200 post-edited medical domain sentences

35.52

0.95M selected pseudo medical domain Chinese sentences generated by M2 0.95M selected pseudo medical domain Chinese sentences generated by M2

200 post-edited medical domain sentences

40.71

200 post-edited medical domain sentences

40.48

6.8M government domain bilingual sentences M2 981 bilingual translation rules 6.8M government domain + 422 mined rules from post-editing bilingual sentences M3 981 bilingual translation rules 6.8M government domain + 422 mined rules from post-editing bilingual sentences Transductive Learning (with pseudo in-domain bilingual data) M4 981 bilingual translation rules 1.1M pseudo medical + 422 mined rules from post-editing domain bilingual sentences generated by M2 M5 981 bilingual translation rules 2.1M pseudo medical + 422 mined rules from post-editing domain bilingual sentences generated by M2 Data Selection M6 981 bilingual translation rules 0.95M selected pseudo + 422 mined rules from post-editing medical domain bilingual sentences generated by M2 M7 981 bilingual translation rules 0.95M selected pseudo medical domain bilingual sentences generated by M2

28.04

39.72 39.72

restoration. The BLEU of G is only 15.24. This shows that a general-domain translation system does not properly handle sentences in the medical domain. F, the basic STR-based translation system, achieves a BLEU of 28.04, which is much better than the general translation system. M1–M7 are advanced STR-based translation systems refined by using different resources. In M1, a total of 200 of the 1004 post-edited sentences were selected to tune the parameters of Moses. M1 achieved a BLEU of 39.22. Comparing M1 and F, we find that tuning with post-edit sentences has a great effect, although the number of the postedited sentences is small. In M2, an additional 422 patterns appearing at least twice in the post-edited result were integrated to translation rules. Introducing more patterns mined form post-edited data further increased the BLEU from 39.22 to 39.72. In M3, 804 post-edited sentences were used to train a new language model, without changing the translation model. However, this did not yield improved performance. M4 and M5 used the 1.1M and 2.1M pseudo medical domain bilingual sentences generated by M2, respectively. Comparing M4 and M5, we find that a larger pseudo corpus is useful. However, neither of them competes with M1–M3 due to the noise in pseudo bilingual/monolingual corpora. Proper selection of pseudo corpora is necessary. M6 used 0.95M sentences chosen by data selection and achieved a BLUE of 40.71. Thus, the selected pseudo subset yielded much improved performance. To confirm the helpfulness of the 422 translation rules mined from postediting, M7 excluded the 422 translation rules from M6, and achieved a BLEU of 40.48 – inferior to that of M6. All the advanced STR-based translation systems (M1–M7) were significantly better than the basic STR-based translation system F by a t-test (p < 0.05). The best method, M6, was significantly better than M1–M5 (p < 0.05), but was not significantly different from M7 (p = 0.1662). In summary, the STR-based framework is quite useful for domain adaptation, and the proper selection of translation results generated by the translation system to update the translation and language models is important in advanced STR-based translation systems. In Table 13 we manually labeled the error types of the results of the basic STR-based system (F) and the bestperforming advanced method in each level. M2 was the best performing method with post-editing (PE) information, M5 was the best performing model with post-editing information and transductive learning (TL), and M6 was the best performing method with post-editing information, transductive learning, and data selection (DS). Fig. 10 shows the occurrences of word ordering errors and word-sense-disambiguation errors in the outcome of each method. In terms

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS 20

197

bs_bs_query

H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

Fig. 10. Ordering errors and WSD errors in different methods.

of word ordering errors, the performance of M2, M5, and M6 are similar, while F tends to generate more ordering errors. This shows that the small amount of post-editing data reduces the word ordering errors. The WSD errors of M2, M5, and M6 are also fewer than those of F. Comparing the WSD errors of M2 and M6, the pseudo bilingual training data introduced more WSD errors. With data selection, however, M6 performed better with respect to WSD. We further analyzed the translation results of the best methods M6 and M7 from two perspectives. First we showed how the mined rules improved the translation, though the difference between M6 and M7 was insignificant. The following is a list of some examples for reference. The underlined parts were translated correctly by the new patterns in M6. (1) Example: Stenting was done from distal IVC through left common iliac vein to external iliac vein. M7: 支架置入術 是 從 遠端 下腔靜脈 通過 從 左髂總靜脈 到 髂外靜脈 。 (Stenting is from distal IVC through left common iliac vein to external iliac vein. [wrong verb]) M6: 完成 支架置入術 從 遠端 下腔靜脈 通過 從 左髂總靜脈 到 髂外靜脈 。 (2) Example: We shifted the antibiotic to cefazolin. M7: 我們 把 抗生素 頭孢唑啉 。 (We the antibiotic cefazolin [missing verb]) M6: 我們 把 抗生素 更換 為 頭孢唑啉 。 (3) Example: Enhancement of right side pleural, and mild pericardial effusion was underlined. M7: 增強 方面 的 權利 胸腔 、 和 發現 有 輕微 的 心包積液 。 (“right” [side of body] is mistranslated to “permission”) M6: 增強 的 右 胸腔 、 輕微 心包積液 被 注意到 。 In addition, we touch on which factors affected the translation performance of M6. Three factors, including word ordering errors, word-sense-disambiguation errors and OOV (out-of-vocabulary) errors, are addressed as follows. The erroneous parts are underlined. (1) Ordering errors. The correct translation result is “8天 的 治療 後 抗生素 中斷.” The current patterns are 2–5 grams, hence longer patterns cannot be captured. Example: Antibiotics were discontinued after 8 days of treatment. M6: 抗生素 中斷 後 8天 的 治療 。

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

21

(2) Word sense disambiguation errors. The correct translation of “post operation care” should be “術後照護”. However, the 1004 post-edited sentences are still not numerous enough to cover all possible patterns. Incremental updates will introduce more patterns and may decrease the number of translation errors. Example: After tracheostomy, he was transferred to our ward for post operation care. M6: 氣管切開術 後, 他 被 轉送到 我們 病房 為 員額 關懷 行動 。 (3) OOV errors. The word “transcatheter” is an OOV. Its translation should be “導管” Example: Transcatheter intravenous urokinase therapy was started on 1/11 for 24 hours infusion. M6: transcatheter 靜脈 尿激酶 在 1/11 開始 進行 治療 24 小時 輸液 。 7. Conclusion and future work This paper proposes a Simplification–Translation–Restoration framework for cross-domain applications in SMT. We integrate bilingual in-domain knowledge into a background out-of-domain SMT system. Thus we target simultaneously the cross-domain and data sparseness problems. The in-domain text goes through identification, simplification, translation, and restoration steps. Important issues are addressed and discussed for each step including (1) preparing bilingual in-domain knowledge, (2) creating a bilingual medical record corpus using a semi-automatic approach, (3) simplification with syntactic information, and (4) different restoration strategies for extracting simplified phrases from the SMT results. We conduct a number of experiments and evaluate the performance of our STR-based translation system through a case study involving medical record translation. The empirical results show the effectiveness of our approach at each step of the framework. To introduce a feedback mechanism to improve an STR-based translation system is also one of the research issues. It is critical to identify which parts – pre-processing, translation, and post-processing – must be modified through the feedback and how they are to be modified. Several methods are proposed to integrate the mined translation rules, and to revise the translation model as well as the language model. The adaptation experiments show that the rules mined from the monolingual in-domain corpus are useful, and the effect of using the selected pseudo bilingual corpus is significant. Word ordering errors, word sense disambiguation errors, and OOV errors remain for future investigation. Acknowledgments This work was partially supported by Ministry of Science and Technology (Taiwan) under contract MOST1012221-E-002-195-MY3. We are very thankful to National Taiwan University Hospital for providing NTUH the medical record dataset. References Axelrod, A., He, X., Gao, J., 2011. Domain adaptation via pseudo in-domain data selection. In: EMNLP 2011, pp. 355–362. Axelrod, A., Resnik, P., He, X., Ostendorf, M., 2015. Data selection with fewer words. In: The Tenth Workshop on Statistical Machine Translation, pp. 58–65. Aziz, W., Dymetman, M., Mirkin, S., Specia, L., Cancedda, N., Dagan, I., 2010. Learning an expert from human annotations in statistical machine translation: the case of out-of-vocabulary words. In: EAMT 2010. Banerjee, S., Pedersen, T., 2003. The design, implementation, and use of the Ngram Statistics Package. In: The Fourth International Conference on Intelligent Text Processing and Computational Linguistics, pp. 370–381. Bertoldi, N., Federico, M., 2009. Domain adaptation for statistical machine translation with monolingual resources. In: The Fourth Workshop on Statistical Machine Translation, pp. 182–189. Bertoldi, N., Simianer, P., Cettolo, M., Wäschle, K., Federico, M., Riezler, S., 2014. Online adaptation to post-edits for phrase-based statistical machine translation. Mach. Trans. 28 (3–4), 309–339. Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., et al., 2014. Findings of the 2014 workshop on statistical machine translation. In: The Ninth Workshop on Statistical Machine Translation, pp. 12–58. Callison-Burch, C., Koehn, P., Osborne, M., 2006. Improved statistical machine translation using paraphrases. In: NAACL 2006, pp. 17–24. Carpuat, M., Wu, D., 2007. Improving statistical machine translation using word sense disambiguation. In: EMNLP 2007, pp. 61–72. Chan, Y.S., Ng, H.T., Chiang, D., 2007. Word sense disambiguation improves statistical machine translation. In: ACL 2007, pp. 33–40. Chen, H.B., Huang, H.H., Tjiu, J., Tan, C., Chen, H.H., 2011. Identification and translation of significant patterns for cross-domain SMT applications. In: Machine Translation Summit XIII, pp. 277–284. Chiang, D., 2005. A hierarchical phrase-based model for statistical machine translation. In: ACL 2005, pp. 263–270.

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003

ARTICLE IN PRESS 22

H.-B. Chen et al. / Computer Speech and Language ■■ (2016) ■■–■■

Civera, J., Juan, A., 2007. Domain adaptation in statistical machine translation with mixture modeling. In: The Second Workshop on Statistical Machine Translation, pp. 177–180. Costa-Jussà, M.R., Fonollosa, J.A.R., 2015. Latest trends in hybrid machine translation and its applications. Comput. Speech Lang. 32, 3–10. Foster, G., Kuhn, R., 2007. Mixture model adaptation for SMT. In: The Second Workshop on Statistical Machine Translation, pp. 128–135. Foster, G., Goutte, C., Kuhn, R., 2010. Discriminative instance weighting for domain adaptation in statistical machine translation. In: EMNLP 2010, pp. 451–459. Haghighi, A., Liang, P., Berg-Kirkpatrick, T., Klein, D., 2008. Learning bilingual lexicons from monolingual corpora. In: ACL 2008, pp. 771– 779. Huang, L., Chiang, D., 2007. Forest rescoring: Faster decoding with integrated language models. In: ACL 2007, pp. 144–151. Klein, D., Manning, C., 2003. Accurate unlexicalized parsing. In: ACL 2003, pp. 423–430. Koehn, P., 2004. Pharaoh: a beam search decoder for phrased-based statistical machine translation models. In: AMTA 2004, pp. 115–124. Koehn, P., Och, F.J., Marcu, D., 2003. Statistical phrase-based translation. In: HLT/NAACL 2003, pp. 127–133. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al., 2007. Moses: Open source toolkit for statistical machine translation. In: ACL 2007 (Demonstration), pp. 177–180. Lagarda, A.L., Ortiz-Martínez, D., Alabau, V., Casacuberta, F., 2015. Translating without in-domain corpus: machine translation post-editing with online learning techniques. Comput. Speech Lang. 32, 109–134. Marton, Y., Callison-Burch, C., Resnik, P., 2009. Improved statistical machine translation using monolingually-derived paraphrases. In: EMNLP 2009, pp. 381–390. Matsoukas, S., Rosti, A.I., Zhang, B., 2009. Discriminative corpus weight estimation for machine translation. In: EMNLP 2009, pp. 708–717. Morin, E., Daille, B., Takeuchi, K., Kageura, K., 2007. Bilingual terminology mining – using brain, not brawn comparable corpora. In: ACL 2007, pp. 664–671. Och, F.J., Tillmann, C., Ney, H., 1999. Improved alignment models for statistical machine translation. In: EMNLP 1999, pp. 20–28. Papineni, K., Roukos, S., Ward, T., Zhu, W., 2002. BLEU: a method for automatic evaluation of machine translation. In: ACL 2002, pp. 311–318. Resnik, P., Buzek, O., Hu, C., Kronrod, Y., Quinn, A., Bederson, B., 2010. Improving translation via targeted paraphrasing. In: EMNLP 2010, pp. 127–137. Sibson, R., 1973. SLINK: an optimally efficient algorithm for the single-link cluster method. Comput. J. 16 (1), 30–34. Ueffing, N., 2006. Using monolingual source-language data to improve MT performance. In: IWSLT 2006. Ueffing, N., Haffari, G., Sarkar, A., 2007. Transductive learning for statistical machine translation. In: ACL 2007, pp. 25–32. Vilar, D., Xu, J., D’Haro, L.F., Ney, H., 2006. Error analysis of statistical machine translation output. In: The 5th International Conference on Language Resources and Evaluation, pp. 697–702. Wang, C., Collins, M., Koehn, P., 2007. Chinese syntactic reordering for statistical machine translation. In: EMNLP 2007, pp. 737–745. Woodsend, K., Lapata, M., 2011. Learning to simplify sentences with quasi-synchronous grammar and integer programming. In: EMNLP 2011, pp. 409–420. Wu, J., Chang, J.S., 2007. Learning to find English to Chinese transliterations on the web. In: EMNLP-CoNLL 2007, pp. 996–1004. Wubben, S., Bosch, A., Krahmer, E., 2012. Sentence simplification by monolingual machine translation. In: ACL 2012, pp. 1015–1024. Xiong, D., Liu, Q., Lin, S., 2006. Maximum entropy based phrase reordering model for statistical machine translation. In: ACL-COLING 2006, pp. 521–528. Zhao, B., Eck, M., Vogel, S., 2004. Language model adaptation for statistical machine translation via structured query models. In: COLING 2004, pp. 411–417. Zhu, Z., Bernhard, D., Gurevych, D.I., 2010. A monolingual tree-based translation model for sentence simplification. In: COLING 2010, pp. 1353– 1361.

Please cite this article in press as: Han-Bin Chen, Hen-Hsen Huang, An-Chang Hsieh, Hsin-Hsi Chen, A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation, Computer Speech and Language (2016), doi: 10.1016/j.csl.2016.08.003