Integration of speech recognition and machine translation: Speech recognition word lattice translation

Integration of speech recognition and machine translation: Speech recognition word lattice translation

Speech Communication 48 (2006) 321–334 www.elsevier.com/locate/specom Integration of speech recognition and machine translation: Speech recognition w...

232KB Sizes 0 Downloads 70 Views

Speech Communication 48 (2006) 321–334 www.elsevier.com/locate/specom

Integration of speech recognition and machine translation: Speech recognition word lattice translation

q

Ruiqiang Zhang *, Genichiro Kikui ATR Spoken Language Translation Research Laboratories, 2-2 Hikaridai, Seiika-cho, Soraku-gun, Kyoto 619-0288, Japan Received 30 December 2004; received in revised form 14 June 2005; accepted 20 June 2005

Abstract An important issue in speech translation is to minimize the negative effect of speech recognition errors on machine translation. We propose a novel statistical machine translation decoding algorithm for speech translation to improve speech translation quality. The algorithm can translate the speech recognition word lattice, where more hypotheses are utilized to bypass the misrecognized single-best hypothesis. The decoding involves converting the recognition word lattice to a translation word graph by a graph-based search, followed by a fine rescoring by an A* search. We show that a speech recognition confidence measure implemented by posterior probability is effective to improve speech translation. The proposed techniques were tested in a Japanese-to-English speech translation task, in which we measured the translation results in terms of a number of automatic evaluation metrics. The experimental results demonstrate a consistent and significant improvement in speech translation achieved by the proposed techniques.  2005 Elsevier B.V. All rights reserved.

1. Introduction Speech recognizers play an important role for spoken language understanding. While speech recognition research has been underway for decades, q

The research reported here was supported in part by a contract with the National Institute of Information and Communications Technology of Japan, entitled ‘‘A study of speech dialogue translation technology based on a large corpus.’’ * Corresponding author. E-mail address: [email protected] (R. Zhang).

current state of the art technology still cannot secure an error-free speech recognition system. A spoken language understanding system must be designed robust to speech recognition errors, and when speech recognition makes errors, other components in the system should be immune or resistant to such misrecognition. In this paper we present an approach that alleviates the adverse effects of speech recognition errors in speech translation: a specialized machine translation algorithm that translates the word lattice output of speech recognition rather than the single-best results.

0167-6393/$ - see front matter  2005 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2005.06.007

322

R. Zhang, G. Kikui / Speech Communication 48 (2006) 321–334

Most current speech translation systems have a cascaded structure: a speech recognition component followed by a machine translation component. Usually only the single-best outcome of speech recognition is used in the machine translation component. Speech translation cannot achieve the same level of translation performance as that achieved by perfect text input due to the inevitable errors of speech recognition. To improve speech translation performance, several architectures have been proposed based on the principle of tight integration of speech recognition and machine translation. Boitet and Seligman (1994) proposed using speech recognition word lattice to create effective communication between speech recognition and machine translation components. However, that translation system is rule-based and less optimized than statistical machine translation. Ney (1999) proposed a coupling structure to combine automatic speech recognition and statistical machine translation. Casacuberta et al. (2002) used a finite-state transducer that conveys acoustic and translation features in one framework. Strictly, this approach does not use the statistical speech translation structure (Ney, 1999). Gao (2003) used a unified structure where the maximum entropy approach is proposed to build entire speech translation system models. Though the above architectures were proposed some years ago, no experimental results have been reported except (Casacuberta et al., 2002), where the integration was implemented by a finite-state transducer and shown as a promising method over the cascaded structure. In this work our approach is closely related to the ideas in (Ney, 1999) but with a different implementation. We used the speech recognition word lattice as the output of speech recognition and the input of machine translation. While in this structure the speech recognition component and the machine translation component are sequentially connected, more hypotheses are stored in the word lattice than in the single-best structure, and complementary information such as the acoustic model and language model scores can be forwarded to the machine translation component to enhance translation performance. Hence, this structure can be seen as an approximate solution

to the unified structure proposed above (Ney, 1999). We use the statistical log-linear model as the translation model. However, in the field of statistical machine translation, the famous, early models are the IBM models (Brown et al., 1993) that uses Bayes rule to convert P(ejf) into P(e)P(fje). IBM Models 1–5 introduced various models for P(fje) in order of increasing complexity. For example, there is only a lexical model in IBM Model 1, but there is a set of models in Model 4 including a fertility model, a dummy word NULL model, a distortion model, and also a lexical model. Another popular model is called an ‘‘HMM’’ model (Vogel et al., 1996), which can capture the distortions in IBM Model 1. Recently direct modeling of P(ejf) in the maximum entropy framework, the log-linear model, has been proven effective (Och, 2003; Och et al., 2004). It can integrate a number of features log-linearly and is widely used in phrase-based machine translation (Koehn, 2004). It has also been proven effective for speech translation tasks (Zhang et al., 2004). In this work we implemented a new lattice translation decoding algorithm specialized for speech translation that outperforms the N-best hypothesis translation proposed in (Zhang et al., 2004) in speed and efficiency. In the decoding we used a two-pass search strategy, a graph search followed by a A* search. In the first graph search, we integrated IBM Model 1 into the log-linear model while IBM Model 4 was integrated to the second A* search. The lattice generated by the speech recognizer was downsized to speed the decoding. We found these techniques effectively improved speech translation quality. We also found that sentence posterior probability of speech recognition was very useful to further improve speech translation. A significant translation improvement was achieved by filtering lowconfidence hypotheses in the word lattice. The remaining sections are organized as follows. Section 2 introduces our speech translation models and structure, and then Section 3 provides detailed descriptions of the decoding algorithm of the word lattice translation. In Section 4 we describe a lattice reduction method to reduce the computations. We describe the sentence posterior

R. Zhang, G. Kikui / Speech Communication 48 (2006) 321–334

probability approach for choosing hypotheses in Section 5. Section 6 presents our experimental results and a detailed analysis, and Section 7 gives our discussions and conclusions. 2. Proposed speech translation structure The proposed speech translation system is illustrated in Fig. 1. It consists of two major components: an automatic speech recognition (ASR) module and a speech recognition word lattice translation (WLT) module. The interface between the two modules is a recognition word lattice. The task of speech translation, in the case of Japanese-to-English translation, can be modeled b that maxito find the target English sentence E mizes the probability P(EjX), where X is an acoustic observation of the source Japanese spoken utterance. If the intermediate output of ASR is defined as J, then b ¼ arg max P ðEjX Þ ¼ arg max P ðEÞP ðX jEÞ E E E ( ) X ¼ arg max P ðEÞ P ðX ; J jEÞ E

( ¼ arg max P ðEÞ E

J

X

an ASR score higher than TH remain in the word lattice. That is, G ¼ fJ jP ðJ ÞP ðX jJ Þ > THg.

E;J

ð1Þ

where P(XjJ) is the ASR acoustic model; P(JjE), the translation model; P(E), the target language model. As indicated by Eq. (1), the best outcome b must be the one that maximizes the summation E of the product of probability over all source language word sequences. To reduce the search space, we approximate the speech translation model in Eq. (1) by separating it into two decoding phases: First, the ASR component generates a word lattice G containing the most likely source language hypotheses. Only the top-ranked hypotheses with

b is also obMeanwhile, a b J producing the best E tained in WLT despite that the goal of speech b In this step, we use maxtranslation is only for E. imization to approximate the summation in Eq. (1), a well-used technique in speech recognition, that is, X P ðX jJ ÞP ðJ jEÞP ðEÞ  max P ðX jJ ÞP ðJ jEÞP ðEÞ. J

J

Therefore, our speech translation structure as shown in Fig. 1 is an approximate implementation of the speech translation models, Eq. (1). Although the WLT translation model is derived in the form of Eq. (3), we actually used a more advanced model, feature based log-linear models formalized as

word lattice ASR

J 2 G. ð3Þ

J

source utterance X

ð2Þ

In this step, we assume that J and E are independent where E does not interfere with the generation of J in the machine translation component, though this assumption may risk pruning some promising hypotheses that could produce a higher translation score despite the low acoustic scores in the ASR component. A bi-directional decoding approach without the assumption is discussed in (Casacuberta et al., 2002). However, it is worth making the assumption so that the computation is reduced, and the two modules, speech recognition and machine translation, can be optimized separately. By setting a lower threshold TH, more hypotheses can be included in the word lattice to compensate for the assumption. Second, the WLT component outputs the target b that maximizes language sentence, E, b b h E; J i ¼ arg maxfP ðEÞP ðX jJ ÞP ðJ jEÞg;

) P ðX jJ ÞP ðJ jEÞ ;

323

target translation E WLT

Fig. 1. Speech translation framework.

324

R. Zhang, G. Kikui / Speech Communication 48 (2006) 321–334

b ¼ arg maxfk0 log P pp ðJ jX Þ þ k1 log P lm ðEÞ E E

þ k2 log P lm ðPOSðEÞÞ þ k3 log P ð/0 Þ þ k4 log NðUjEÞ þ k5 log TðJ jEÞ þ k6 log DðE; J Þg;

ð4Þ

where we define seven features represented by the following. (i) ASR hypothesis posterior probability, Ppp(JjX). In Eq. (3) it is the acoustic model P(XjJ) representing ASR contribution. However, we did not use P(XjJ) but the posterior probability, Ppp, since the P(XjJ) has a large dynamic range, and is difficult to normalize. The posterior probability is calculated as: P ðJ jX Þ ¼ P

P ðX jJ ÞP ðJ Þ ; J i P ðX jJ i ÞP ðJ i Þ

ð5Þ

where the summation is made over all the hypotheses in the word lattice, G. Ji is a hypothesis in the word lattice. (ii) Language model for target language word sequences, Plm(E). We used the trigram language model in the experiments. For a target word sequence E = e0e1    el, we define l Y P ðei jei2 ei1 Þ. ð6Þ P lm ðEÞ ¼ P ðe0 ÞP ðe1 je0 Þ i¼2

The trigram language model was trained by the CMU language model toolkit. (iii) Language model for target language part-ofspeech (POS) sequences, P lm ðPOSðEÞÞ. The POS model is the same as the word language model except that every word is replaced by its corresponding syntactic class, the partof-speech. (iv) P(/0jE). The probability of /0 Japanese words connecting to the empty cell, NULL, of the English sequence. This probability is computed by Eq. (31) in (Brown et al., 1993). Here, we skip the description for simplicity. Readers can find the details in (Brown et al., 1993). (v) Fertility model, NðUjEÞ, is defined as NðUjEÞ ¼

l Y i¼1

/i !nð/i jei Þ;

where n(/ijei) is the probability of the ith English word, ei, connected by /i Japanese words. (vi) Lexical Model, TðJ jEÞ. For a (J, E) bilingual sentence pair, j1    jm and e0    el, we define m Y tðjk jeak Þ; TðJ jEÞ ¼ k¼1

where the series, am1 ¼ a1 a2    am , is the alignment. If the kth Japanese word is connected to the ith English word, then ak = i. If it is not connected to any English word, then ak = 0; tðjk jeak Þ is the probability of the kth Japanese word translated to the corresponding English word, eak . (vii) Distortion model, DðE; J Þ. This is for the alignment probability of the source and target sentence,(J, E). We used Eqs. (45) and (46) in (Brown et al., 1993) to calculate the distortion probability. Eq. (4) is a logarithmic extension of Eq. (1) except that the translation model P(JjE) is factorized by means of IBM Model 4 (Brown et al., 1993), and the acoustic feature replaced by the posterior probability. The k, a weighting coefficient for each feature, can be obtained by a number of methods (Och, 2003; Zhang et al., 2004). The WLT decoding algorithm employs a twopass decoding. The log-linear model in Eq. (4) is used in the second pass, the A* search. But, in the first pass, a simple log-linear model that only integrates a lexical model is used where the lexical model is trained by IBM Model 1. The simple log-linear model is as follows: b ¼ arg maxfk0 log P pp ðJ jX Þ þ k1 log P lm ðEÞ E E

þ k2 log P lm ðPOSðEÞÞ þ k5 log TðJ jEÞg.

ð7Þ

3. Speech recognition word lattice translation— WLT In this section we describe our decoding algorithm for speech recognition word lattice translation, which is much more complicated than text translation. In contrast to text translation where

R. Zhang, G. Kikui / Speech Communication 48 (2006) 321–334

325

j3 j0

j1

j6 j4

j2 j5 e2 10100000

e5

j7

10100100

e7

10100101

e0 10000000 e6

s e6 00000010

e2 10100010 e0

e4 11001010

/s

10000010 e1 11000010 e3 11010010

Fig. 2. Source language word lattice (top) and target language word graph (bottom).

a single source sentence is known, there is no single source sentence for word lattice translation but a lattice containing multiple hypotheses. The right hypothesis to be translated is unknown until decoding is completed. An example of word lattice output by a automatic speech recognition (ASR) is shown at the top of Fig. 2. Each edge, indicating a word in the lexicon, is associated with two nodes. A node is the entry and exit states of the edges. A hypothesis is a full path traversing the lattice from the start node to the end node. Since statistical machine translation decoding is not time synchronous, the rear of the lattice may be visited earlier than the front. As the decoding proceeds, both the source and the target sentence hypotheses are updated according to the model of Eq. (4). After the decoding is completed, the target sentence is found and aligned to a recognition hypothesis. This hypothesis is equivalent to the source sentence in text translation. Based on the IBM statistical machine translation approach (Brown et al., 1993), there have been a number of decoding approaches proposed for text translation. The Candide system (Berger et al., 1994) used a stack decoder. Wang and Waibel (1997) implemented the A* algorithm. Tillmann et al. (1997) described a dynamic programming algorithm. Ueffing et al. (2002) adopted the graph + A* algorithm to generate N-best translation hypotheses. In our work, we also adopted graph + A* because this approach can keep more translation hypotheses in a compact structure. Especially for lattice translation where

multiple ASR hypotheses must be considered, the graph + A* algorithm appears more appropriate. Graph + A* decoding is two-pass decoding. The first pass uses a simple log-linear model that integrates IBM Model 1 and generates a word graph to save the most likely hypotheses. It basically converts a source language word lattice into a target language word graph. Information loss can be prevented because multiple hypotheses in the source language word lattice can be saved in the target language word graph. The second pass uses a more complicated model that integrates IBM Model 4 to output the best hypothesis by traversing the target word graph. This pass uses the A* algorithm. We describe the two-pass WLT algorithm in the following two sections. 3.1. First pass—from source word lattice to target word graph The bottom of Fig. 2 shows an example of a translation word graph that corresponds to the recognition word lattice of the top. For simplicity we denote the bottom graph as TWG, which means target word graph, and the top as SWL, meaning source word lattice. Each edge in the TWG is a target language word being a translation of a source word in the SWL, either from word translations provided by lexical models or fertility expansion by fertility models. For example, e0 is a translation of j0. The edges may be combined into the same nodeÕs structure, and new edges are

326

R. Zhang, G. Kikui / Speech Communication 48 (2006) 321–334

extended from the node. A nodeÕs structure consists of the following elements: Target N-gram word sequence represents the latest N consecutive words in the hypotheses. For example, the node saves the last three target words if a trigram model is used. Target N-gram part-of-speech sequence represents the part-of-speech sequence of the latest N consecutive target words of hypotheses. Target word e and source word index i indicate that this node is made by translating the ith source word into the target word e. If the target word is empty, that means the node is produced by increasing the previous edgeÕs fertility. Coverage represents the source words covered in the current node. It is a binary vector with a size equal to the number of edges in the SWL, initially set to 0. It indicates the source edges already translated. For instance, if jth source word is translated, jth dimension is set to 1. If the node covers all the edges of a full path in the SWL, this node connects to the last node, the end, in the TWG. ‘‘Coverage’’ is an important element in the nodeÕs structure by which we can judge end of a node expansion and find uncovered source edges. For instance, ‘‘11010010’’ is coverage of a node in Fig. 2, where the values of the 1st, 2nd, 4th and 7th dimensions are 1s. Hence, this node has translated SWLÕs edges, j0, j1, j3 and j6, which are composed of a full path of the SWL. Likewise, ‘‘10000010’’ covers j0 and j6 only. All edges except j0 and j6 are uncovered source edges, candidates for extending the node. There are two main operations in extending a node into the edges: DIRECT and ALIGN. DIRECT extends the hypothesis with a target word by translating an uncovered source word. The target word is determined by looking up the current target N-gram context and the possible translations of the uncovered source word. The N-gram language model provides the next possible word that may be chosen if it is a translation of an uncovered source word based on the lexical models. The number of possible words is restricted by the scores of both language and lexical models.

ALIGN extends the hypothesis by aligning one more uncovered source word to the current node, i.e., increasing the fertility of the last target word. Increasing the fertility of the NULL target word is conducted at the beginning of TWG. The edge of SWL is skipped if its resulting coverage did not cover any full path in the SWL. If the node has covered a full path in the SWL, as indicated by the nodeÕs coverage, it is connected to the end node. When there are no nodes available for possible extension, the conversion is completed. The conversion algorithm is illustrated simply in Algorithm 1. The entire process equals the growing of a graph that can be indexed in time slices because new nodes are created based on the old nodes of the last time slice. New nodes are created by DIRECT or ALIGN to cover the uncovered source edges and are connected to the old nodes. The newly generated nodes are then sorted and merged in the graph if they share the same nodeÕs structure: the same coverage, the same translations, and the same N-gram sequence. If the node covers a full hypothesis in SWL, it connects to the end. If no nodes need to be expanded, conversion is finished. While TWG can contain many more hypotheses, pruning is still needed for overcoming graph explosion. For this, we used two traditional pruning methods: threshold and histogram pruning. When a new node is generated, we calculate the score of the hypothesis and compare it with the current highest hypothesis score. If the score is higher than a threshold derived from the top hypothesis score, the node is kept. The score is calculated by Eq. (7) where all scores including ASR posterior probability, Ppp, are calculated by a partial hypothesis from the start to the current node; Ppp uses the highest value of all the ASR hypotheses under the current context. For histogram pruning, we set a limit for the new nodes added. If adding a new node does not exceed this limit, the node is kept in the graph. Otherwise all hypotheses are sorted, and those with low scores are discarded. The first decoding pass generates TWG. Each path from the start to the end node is a translation hypothesis. In the first pass we incorporate a simpler translation model into the log-linear model, as

R. Zhang, G. Kikui / Speech Communication 48 (2006) 321–334

shown in Eq. (7), where we used IBM Model 1, the lexical model only. Other models derived from IBM Model 4, such as the fertility model, the distortion model, and the NULL model, are used in the second pass. The first pass serves to retain the most likely hypotheses in the translation word graph, and the second pass is a refined search that uses advanced models. This hierarchical search strategy is frequently used in speech recognition (Zhang et al., 2000). Algorithm 1 (Conversion Algorithm from SWL to TWG).

[1]: [2]: [3]: [4]: [5]: [6]: [7]: [8]: [9]: [10]: [11]:

Initialize graph buffer G[0] = 0; t = 0 DO FOR EACH node n = 0, 1, . . . , #(G[t]) DO IF (n cover A FULL PATH) NEXT FOR EACH edge l = 0, 1, . . . , #(EDGES) DO IF (n cover l) NEXT IF (n not cover ANY SWL PATH) NEXT generate new node and push to G[t + 1] merge and prune nodes in G[t + 1] t=t+1 WHILE (G[t] is empty)

3.2. Second pass—by an A* search to find the best outcome through the TWG An A* search traverses the TWG generated in last section, and this is the best first approach. All partial hypotheses generated are pushed into a priority queue with the top hypothesis popping first out of the queue for the next extension. To execute the A* search, the hypothesis score, D(h, n), of a node n is evaluated in two parts: the forward score, F(h, n), and the heuristic estimation, H(h, n), D(h, n) = F(h, n) + H(h, n). The calculation of F(h, n) begins from the start node and accumulates all nodesÕ scores belonging to the hypothesis until the current node, n. Because all nodes in the TWG save the N-gram context and alignment of source and target sentences, F(h, n) can be

327

obtained by calculating all the modelsÕ scores of each node and being integrated by Eq. (4). As for the calculation of H(h, n), we trace back the TWG from the end node to the current node, n. H(h, n) is defined as the accumulated maximum probability of the models from the end node to the current node n. All the model scores use the scores already calculated in the first graph search. The ASR posterior probability is not included in the backward score because it has been counted in the F(h, n). The backward probability is integrated by the model in Eq. (7). In the second pass we used the model in Eq. (4), integrating IBM Model 4 to calculate the F(h, n). However, we cannot use IBM Model 4 directly without approximation because the calculations of two models in Model 4 requires the source sentence to be known in advance. Of all the features in Eq. (4), the number of source words aligned to the target word NULL, P(U0jE), is dependent on the number of words in the source sentence. The computation of distortion probability, DðE; J Þ, is a function of the alignment between the source sentence and the target hypothesis. However, as mentioned before, the source sentence for word lattice translation is unknown until the decoding is completed. Hence, the probability of P(U0jE) and DðE; J Þ cannot be calculated precisely in the middle of decoding. Our method to fix this problem is to use the maximum over all possible hypotheses. For the above two models, we calculated the scores for all the possible ASR hypotheses under the current context. The maximum value was used as the modelÕs probability.

4. Downsizing the source word lattice Because ASR uses a time-synchronized decoding algorithm to generate raw SWL, the same word identity can be repeatedly recognized in slightly different frames. As a result, the same word identity may appear in more than one edge. Direct conversion from SWL to TWG incurs duplicated computation and explosion of TWG space. On the other hand, while raw SWL contains hundreds of hypotheses, the top N-best hypotheses, which are the most significant, are only a small

328

R. Zhang, G. Kikui / Speech Communication 48 (2006) 321–334

Fig. 3. Example of lattice reduction.

portion of all hypotheses. We can reduce the size of raw SWL by cutting off all other hypotheses except the top N-best without degrading the translations. In reducing the size of SWL, we follow one rule: TWG is the translation counterpart of SWL if and only if any full path in TWG is a translation of a full path in SWL. We use the following steps to downsize raw SWL. From raw SWL we generate N-best hypotheses in a sequence of edge numbers. We list the word IDs of all the edges in the hypotheses, remove the duplicate words, and index the remainders with new edge IDs. The number of new edges is fewer than in the raw SWL. Next, we replace the edge sequence in each hypothesis with a new edge ID. If more than one edge shares the same word ID in one hypothesis, we add a new edge ID for the word again and replace the edge with the new ID. Finally, we generate a new word lattice with a new word list as its edges that only consists of the N-best hypotheses only. The raw SWL becomes the downsized SWL, which is much smaller than the raw SWL. In fact, in our experiments word lattice is reduced by 50% on average. Fig. 3 shows an example of lattice downsizing. The word IDs are shown in parentheses. There are five different word IDs for the eight edges in the raw SWL. Because j1 and j2 share the same word ID, Wb, they are merged in the reduced SWL, as are j6 and j7. In the reduction, the N-best hypotheses are extracted first. Then transfer rules are created to change into new hypotheses with new edge IDs. After downsizing, one hypothesis is removed. In fact, the downsized SWL is just

the N-best ASR hypotheses with new assigned edge IDs. However, the hypothesis probability is still inherited in the downsized lattice. In this paper we use lattice-hypothesis1 to indicate the quantities of hypotheses in the lattice, defined as the number of hypotheses used to construct the TWG in the downsized SWL.

5. Selection of hypotheses by confidence measure (CM) filtering As described above, the downsized SWL stores N ASR hypotheses. All the hypotheses can find a counterpart in the TWG after the conversion if they are not removed by histogram and threshold pruning. Because we integrated the posterior probability of the hypotheses in the log-linear model, the hypotheses with the lowest posterior probability are the least likely to be used in the WLT module. They are most likely pruned in the earlier stage of the decoding process. However, the hypotheses with extremely low posterior probability can result in a worse result if there are no hard decisions to remove them. In the experiments we found that using posterior probability as a confidence measure to filter low-confidence hypotheses could improve translations. For all the hypotheses in the SWL, we used Eq. (5) to compute each hypothesis posterior probability. We compared each oneÕs posterior probability with that of the single-best hypothesis, P0, the 1 Using lattice density is not suitable in this work because it is defined as average edges per node.

R. Zhang, G. Kikui / Speech Communication 48 (2006) 321–334

highest posterior probability. If it exceeds a threshold, P0/T, where T is a confidence factor, the hypothesis is used in WLT, otherwise removed. By applying a confidence measure, WLT automatically selects the hypotheses that is conversed into the TWG. Hence, for a given SWL, the number of hypotheses for translation in WLT is determined by the confidence measure (CM) filtering.

6. Experiments 6.1. BTEC database and model training The data used in this study were from the Basic Travel Expression Corpus (BTEC) (Kikui et al., 2003; Takezawa et al., 2002). BTEC consists of commonly used sentences published in travel guidebooks and tour conversations. The spoken language corpus, used for spoken language translation research, has been under joint development by the CSTAR members including CMU, ATR, CAS, et al. BTEC collects travel-related phrases, sentences, and dialogs, and currently it covers four languages, English, Chinese, Korean and Italian. Each utterance has corresponding translations from multiple languages. In our experiments we used the BTEC training data to train our models, the BTEC1 test data #1 as the development data for parameter optimization and the BTEC1 test data #2 for evaluation. The statistics of the corpus are shown in Table 1. The speech recognition engine used in the experiments was a HMM-based, large vocabulary continuous speech recognizer developed at ATR. The acoustic HMMs were context-dependent triTable 1 Training, development and test data from Basic Travel Expression Corpus (BTEC) Japanese Training Development Test

English

Sentences Words

3,647,240

468,595 3,132,305

Sentences Words

4015

510 2983

Sentences Words

4112

508 2951

329

phone models with 2100 states in total, using 25 dimensional, short time spectral features. In the first and second pass of decoding, we used a multiclass word bigram with a lexicon of 37,000 words plus 10,000 compound words. Furthermore, a word trigram was applied to rescore the recognition results. All the LMs and a number of translation models including the lexicon, fertility, and distortion models were trained with the same training data. The GIZA++ toolkit (Och and Ney, 2003) was applied to train IBM Models used in the log-linear model. The log-linear parameter optimization is for minimizing the translation ‘‘distortion’’ between the reference translations, R, and the translated b sentences, E, b kM 1 ¼ optimize Dð E; RÞ;

ð8Þ

b is a set of translations of all utterances and where E R is the set of translation references for all utterances. We used 16 reference sentences for each utterance, created by human translators. Here, D is one of the translation evaluation criteria. To optimize the ks, parameters of the log-linear models in Eq. (4), we used the development data of 510 speech utterances. We used the 100-best recognition hypothesis translation approach described in (Zhang et al., 2004) to train the ks. First, for each of the 100 hypotheses in the SWL we generated 1000 translations. Then, a number of features were extracted from each translation. The PowellÕs algorithm (Press et al., 2000) was used to optimize these parameters. The criterion used in the parameter optimization was BLEU for all experiments. References of other metricsÕ effect on speech translation can be found in (Zhang et al., 2004). To evaluate the translation quality, we used five automatic evaluation criteria including BLEU, NIST, mWER, mPER and GTM. (i) BLEU (Papineni et al., 2002): A weighted geometric mean of the n-gram matches between test and reference sentences multiplied by a brevity penalty that penalizes short translation sentences. (ii) NIST (Doddington, 2002): An arithmetic mean of the n-gram matches between test and reference sentences multiplied by a

R. Zhang, G. Kikui / Speech Communication 48 (2006) 321–334

length factor which again penalizes short translation sentences. (iii) mWER (Niessen et al., 2000): Multiple reference word error rate, which computes the edit distance (minimum number of insertions, deletions, and substitutions) between test and reference sentences. (iv) mPER (Och, 2003): Multiple reference position independent word error rate, which computes the edit distance without considering the word order. (v) GTM (Turian et al., 2003): A unigram-based F-measure used to measure the similarity between test and reference sentences. The BLEU and NIST scores were calculated by the downloadable tool (version v11a).2 GTM used version v1.2.3 6.2. Results for word lattice translation The WLT speech translation system can accept both text input and lattice input in the defined format. In the experiments the ASR system output the raw lattice. The speech recognition performance of our test data (BTEC test #2) is shown in Fig. 4, where the word and sentence accuracy for the single-best (lattice-hypothesis = 1) recognition are around 93% and 79%, respectively. As the latticehypothesis increases, the word and sentence recognition accuracy increase accordingly. The increase, however, almost stops at lattice-hypothesis = 20, where word and sentence accuracy are respectively 96% and 87%. The figure shows the best accuracy in all the hypotheses for a given lattice-hypothesis, and we observe that the sentence accuracy increases by 8% with increasing lattice-hypothesis. The required lattice for WLT was generated by lattice reduction approach described in Section 4. We set the number of ASR hypotheses to 100 when we created the downsized SWL. The downsized SWL was then translated by the WLT translation module. While the downsized SWL contained 100 ASR hypotheses, the actual number of hypotheses used in WLT can be designated by

0.98 0.96 Word Acc

0.94 0.92 0.9 0.88 0.86

Sent Acc

0.84 0.82 0.8 0.78 0

10

20

30

40

50

60

70

80

90

100

Lattice-hypothesis Fig. 4. Word and sentence accuracy in speech recognition.

changing the lattice-hypothesis. For a given lattice-hypothesis, we carried out the translation experiments under the conditions of with and without CM filtering. When done with CM filtering, the number of hypotheses used decreased further. The following experiments compared the results of with and without CM filtering for a given lattice-hypothesis. Fig. 5 presents the translation results of WLT, showing the change of BLEU with increasing lattice-hypothesis. To calculate the BLEU score of the translations for the test data, we used 16 references for each test utterance. The bottom curve represents the translation without CM filtering while the top curve is the one with it. The worst BLEU score occurs in the case when only the

0.56

with CM filtering without CM filtering

0.555

BLEU

330

0.55 0.545 0.54

1 2 3 4 5 6 7 8 9 10 15 20 30 40 50 60 80 100

Lattice-hypothesis 2 3

http://www.nist.gov/speech/tests/mt/. http://nlp.cs.nyu.edu/GTM/.

Fig. 5. Comparison of translation results with CM filtering and without CM filtering.

R. Zhang, G. Kikui / Speech Communication 48 (2006) 321–334

LT1: translations of WLT produced by the utterances whose single-best hypotheses were used. LT2: translations of WLT produced by the utterances whose non-single-best hypotheses were used. ST1: translations by the text translator of the utterances in LT1. ST2: translations by the text translator of the utterances in LT2.

The classes ‘‘LT1’’ and ‘‘ST1’’ used the same ASR hypotheses for translation, but with the differences that ‘‘LT1’’ was created in the WLT (latticehypothesis > 1), while ‘‘ST1’’ was produced in the text translator (lattice-hypothesis = 1). The classes ‘‘LT2’’ and ‘‘ST2’’ were used the same utterances but different ASR hypotheses and different translators. For each lattice-hypothesis, the classification of the utterances was different because the classification was conducted after the WLT translation. For the results to be convincing, we applied multiple translation evaluation metrics. This time we used NIST to measure translations of the four classes. The results are shown in Figs. 6 and 7, where for each lattice-hypothesis there are two vertical bars. The first bars indicate results for ‘‘ST2’’in Fig. 6 and ‘‘LT2’’ in Fig. 7, while the second bars present results for ‘‘ST1’’ in Fig. 6 and ‘‘LT1’’ in Fig. 7. In Fig. 6, we found that the results for ‘‘LT2’’ were much better than those for ‘‘ST2’’ consistently at any lattice-hypothesis. In fact, ‘‘LT2’’ surpassed over ‘‘ST2’’ by 30% on average. Remember that ‘‘ST2’’ was made by the non-single-best hypotheses whereas ‘‘LT2’’ was created by the single-best hypotheses. This proves that the WLT algorithm implemented in this work can choose a non-single-best hypothesis for translation if the single-best one cannot produce quality

4.5 4 3.5

NIST

single-best, lattice-hypothesis = 1, was used in the translation. Although the BLEU scores of the two types of translations increase as the latticehypothesis increases, and it appears that the results when lattice-hypothesis = 20 are the best for both, the difference between the two is obvious. The translation improvement with CM filtering is more stable with increasing lattice-hypothesis and the improvement in amplitude is greater than that without CM filtering. On the contrary, the translation improvement without CM filtering is unstable and fluctuating. The results of Fig. 5 were obtained when the confidence factor explained in Section 5 was set to T = 10. We chose this value because it output the best translation results. Fig. 5 shows the overall translation improvements brought by the lattice translation. However, the overall improvement is not large because the ratio of wrongly recognized sentences to the whole test data is small. We are able to demonstrate the significant effect of lattice translation from a second point of view in the following analysis. As we see, in the word lattice translation, some translations were produced by the single-best hypotheses and others by the non-single-best hypotheses. To compare the single-best translation with the word lattice translation in detail, we divided the test data into two classes based on the results of word lattice translation: those consisting of the utterances whose single-best was used in WLT; others whose non-single-best was used. For the above divided utterances, we generated the single-best translations using the text translator (equivalent to setting lattice-hypothesis = 1 in WLT). Therefore, we could obtain four sets of translations defined as follows:

331

3 2.5 2 1.5

2

3

5

10

15

20

30

50

100

Lattice-hypothesis Fig. 6. Translation comparison of ST2 (in text translator using single-best) and LT2 (in lattice translator using non-single-best hypotheses).

332

R. Zhang, G. Kikui / Speech Communication 48 (2006) 321–334 7.35 7.3

NIST

7.25 7.2 7.15 7.1 7.05 7 2

3

5

10

15

20

30

50

100

Lattice-hypothesis Fig. 7. Translation comparison of ST1 (in text translator) and LT1 (in lattice translator) using the same single-best hypotheses.

translations; most likely, the speech recognition system made a wrong recognition and the better recognition was not the single-best. However, we found that in Fig. 7 the results for ‘‘LT1’’ were a little worse than ‘‘ST1’’. This is because word lattice translation contains multiple recognition hypotheses, while the single-best translation contains only one hypothesis, though the decoding beam size in the lattice translation was the same as the single-best translation. Hence, each hypothesis in the lattice translation amounts to being allocated a smaller beam size than that in the single-best translation. In principle, the decoding results using a narrow beam are inferior to those using a wide beam. Consequently, it is not strange that in Fig. 7 the results for ‘‘LT1’’ are worse than those for ‘‘ST1’’. Finally, in Table 2 we present the detail results of one of our experiments. The translations were Table 2 Translation results at different lattice-hypothesis and test data divisions

LH = 1 LH = 20 LT1 ST1 LT2 ST2

BLEU

NIST

GTM

mWER

mPER

0.545 0.557 0.594 0.591 0.327 0.247

6.04 6.18 7.32 7.32 4.08 3.28

0.659 0.663 0.698 0.699 0.566 0.527

0.417 0.410 0.376 0.375 0.577 0.651

0.382 0.377 0.346 0.345 0.512 0.564

evaluated by the five criteria, BLEU, NIST, mWER, mPER and GTM. In the first column, ‘‘LH = 1’’ indicates overall single-best translations; and ‘‘LH = 20’’ the overall lattice translation when the lattice-hypothesis = 20; ‘‘LT1’’ the lattice translations aligned by the single-best hypotheses when the lattice-hypothesis = 20; ‘‘LT2’’ the lattice translations performed by the non-single-best hypotheses when the latticehypothesis = 20; ‘‘ST1’’ the single-best translations by utterances of LT1; and ‘‘ST2’’ the single-best translations by utterances of LT2. BLEU, NIST and GTM measure accuracy. A larger value means a better translation. mWER and mPER measure error rate instead. The results in Table 2 were experimented with CM filtering. The following conclusions can be derived from Table 2. • Overall translation improvements are observed over the single-best translation by lattice translation because the translations of LH = 20 are better than LH = 1 in terms of all metrics. • Significant translation improvements of the non-single-best ASR hypotheses prove that lattice translation can make use of more appropriate hypotheses for translation, rather than the single-best hypotheses because the results of ‘‘LT2’’ are much better than those of ‘‘ST2’’.

7. Discussion and conclusions This paper describes our work on improving speech translation performance. We proposed some new techniques to achieve the purpose and implemented a new speech translation system, WLT, that can directly translate speech recognition word lattice. Furthermore, we implemented an effective graph + A* decoding algorithm in WLT and used a word lattice size reduction method to reduce computation load and improve translation quality. While all these techniques improved speech translation, we found the largest gain was achieved by applying the confidence measure in WLT. By choosing ASR hypotheses based on posterior probability, speech translation qualities were improved to a new higher level. We measured

R. Zhang, G. Kikui / Speech Communication 48 (2006) 321–334 Table 3 Decoding speed comparison

Graph A*

333

Table 5 Effect of parameter optimization

Text

With CM

Without CM

Raw lattice

0.98t 0.02t

1.45t 0.02t

2.19t 0.02t

5.26t 0.04t

translation quality using multiple automatic evaluation criteria. The proposed techniques were consistently proved effective and efficient. In addition to the translation improvements observed above, it is worthwhile to note the other issues missing from the previous sections. One significant advantage of our approach is the reduction of computation costs in decoding. We compared the approach with the baseline text translation for speed. These results are shown in Table 3, where t is the average decoding time in text translation for one sentence in seconds per sentence. Running on a 2.8 GHz CPU, t is around 19 s on average for one of the test data of 508 utterances, and the average length of utterances is 9.8 words. We found that the total decoding time of the approach, a reduced lattice with CM filtering, is 1.5 times slower than the text translation. The raw latticeÕs translation is, however, the slowest, about 5.3 times slower than text translation. In contrast, the lattice reduction approach improves the speed over the raw lattice by about 3.5 times. The speed of a reduced lattice without CM filtering is also improved, but less than that with CM filtering. Regarding total decoding time, the first pass of a graph search amounts to 98% of time consumed, while the A* search only takes 2%. All the results were measured under consistent conditions: beam width, thresholds, and models. The size of the training data significantly affected the translations. Performance fell dramatically if fewer training data were used. As Table 4 shows, the BLEU score when using 468K training data is much better than using 125K. Here, ‘‘t-taTable 4 Effect of the size of training data on translations Training data

t-Table

N-gram

LH = 1

LH = 20

125K 310K 468K

150K 360K 530K

180K 420K 660K

0.364 0.466 0.545

0.378 0.482 0.557

BLEU LH = 1 LH = 20

NIST

GTM

mWER

mPER

0.51/0.54 5.52/6.04 0.55/0.66 0.48/0.42 0.38/0.36 0.52/0.56 5.73/6.18 0.57/0.66 0.46/0.41 0.36/0.34

ble’’ and ‘‘N-gram’’ are respectively the size of the translation table in the lexical model and the number of trigrams in the target language model. Improvements in lattice translation were observed under all conditions. We show the results when the lattice-hypothesis = 1 and when it = 20. Section 6 mentioned log-linear model parameter optimization. The effect of optimizing the lambda parameters is significant in lattice translation as in other work presented by (Och, 2003; Zhang et al., 2004). Table 5 shows the results. The numbers before and behind ‘‘/’’ indicate the translation results without and with parameter optimization, respectively. The parameters were made to optimize the BLEU metric by PowellÕs method. Translation improvement is obvious after parameter optimization. Regarding speech translation, this topic is attracting more attention from researchers in both ASR and machine translation communities. The coupling of ASR and SMT has been presented in (Ney, 1999; Gao, 2003; Casacuberta et al., 2002), which are approaches that involve the implementation of simultaneous decoding of speech recognition and machine translation. In our work, ASR and SMT are separate but still tightly related by the word lattice. The same idea was used in (Boitet and Seligman, 1994), but we implemented a speech translation system in the framework of statistical machine translation. A recent paper (Saleem et al., 2004) shares similar ideas to ours. However, we further demonstrated that using a confidence measure made speech translation more robust against changing the lattice-hypothesis. The WLT developed in this work for speech translation is an extension of a word-based statistical machine translation system written for text translation. This SMT system achieved comparable performance with the phrase-based translation approach in a recent evaluation (Akiba et al., 2004). While the main features in the translation

334

R. Zhang, G. Kikui / Speech Communication 48 (2006) 321–334

module come from IBM Model 4, our translation model is not a pure IBM Model 4 but a log-linear model incorporating the features derived by IBM Model 4. Currently, we only use word-to-word pairs in the log-linear models, though we expect that the system can be improved further if we integrate phrase-to-phrase translation pairs into the log-linear model. Acknowledgements We would like to thank Dr. Seiichi Yamamoto, Dr. Yoshimori Sagisaka, Dr. Hirofumi Yamamoto, Dr. W.K. Lo, and Dr. Taro Watanabe for their assistance with this work. We would also appreciate the anonymous reviewers for their comments and suggestions.

References Akiba, Y., Federico, M., Kando, N., Nakaiwa, H., Paul, M., Tsujii, J., 2004. Overview of the IWSLT04 evaluation campaign. In: Proc. IWSLT04, ATR, Kyoto, Japan. Berger, A., Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Lafferty, J., Printz, H., Ures, L., 1994. The candide system for machine translation. In: Proc. ARPA on HLT. Boitet, C., Seligman, M., 1994. The ‘‘whiteboard’’ architecture: a way to integrate heterogeneous components of nlp systems. In: Proc. Coling 1994. Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L., 1993. The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19 (2), 263–311. Casacuberta, F., Vidal, E., Vilar, J.M., 2002. Architectures for speech-to-speech translation using finite-state models. In: Proc. Speech-to-speech Translation Workshop, Philadelphia, PA, pp. 39–44. Doddington, G., 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proc. ARPA Workshop on Human Language Technology. Gao, Y., 2003. Coupling vs. unifying: modeling techniques for speech-to-speech translation. In: Proc. EuroSpeech 2003, Geneva, pp. 365–368. Kikui, G., Sumita, E., Takezawa, T., Yamamoto, S., 2003. Creating corpora for speech-to-speech translation. In: Proc. EUROSPEECH 2003, Geneva, pp. 381–384. Koehn, P., 2004. Pharaoh: A beam search decoder for phrasebased statistical machine translation models. In: Proc. AMTA 2004, Washington, DC.

Ney, H., 1999. Speech translation: coupling of recognition and translation. In: Proc. ICASSP 1999, Vol. 1, Phoenix, AR, pp. 517–520. Niessen, S., Och, F.J., Leusch, G., Ney, H., 2000. An evaluation tool for machine translation: fast evaluation for machine translation research. In: Proc. LREC, 2000, Athens, Greece, pp. 39–45. Och, F.J., 2003. Minimum error rate training in statistical machine translation. In: Proc. ACLÕ2003, pp. 160–167. Och, F.J., Ney, H., 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. 29 (1), 19– 51. Och, F.J., Gildea, D., Khudanpur, S., Sarkar, A., Yamada, K., Fraser, A., Kumar, S., Shen, L., Smith, D., Eng, K., Jain, V., Jin, Z., Radev, D., 2004. A smorgasbord of features for statistical machine translation. In: Proc. HLT-NAACL, Boston, USA. Papineni, K.A., Roukos, S., Ward, T., Zhu, W.-J., 2002. Bleu: a method for automatic evaluation of machine translation. In: Proc. ACL 2002, Philadelphia, PA, pp. 311–318. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P., 2000. Numerical Recipes in C++. Cambridge University Press, Cambridge, UK. Saleem, S., Chen Jou, S., Vogel, S., Schultz, T., 2004. Using word lattice information for a tighter coupling in speech translation systems. In: Proc. ICSLP 2004, Jeju, Korea. Takezawa, T., Sumita, E., Sugaya, F., Yamamoto, H., Yamamoto, S., 2000. Toward a broad-coverage bilingual corpus for speech translation of travel conversations in the real world. In: Proc. LREC 2002, Las Palmas, Canary Islands, Spain, pp. 147–152. Tillmann, C., Vogel, S., Ney, H., Zubiaga, A., 1997. A DPbased search using monotone alignments in statistical translation. In: Proc. ACL/EACL 1997, Madrid, Spain, pp. 313–320. Turian, J.P., Shen, L., Melamed, I.D., 2003. Evaluation of machine translation. In: Proc. MT Summit IX, New Orleans, USA, pp. 386–393. Ueffing, N., Och, F.J., Ney, H., 2002. Generation of word graphs in statistical machine translation. In: Proc. Conf. on Empirical Methods for Natural Language Processing (EMNLP02), Philadelphia, PA, pp. 156–163. Vogel, S., Ney, H., Tillman, C., 1996. HMM word-based alignment in statistical machine translation. In: Proc. COLING. Copenhagen, pp. 836–841. Wang, Y., Waibel, A., 1997. Decoding algorithm in statistical machine translation. In: Proc. ACL/EACL 1997, Madrid, Spain, pp. 366–372. Zhang, R., Black, E., Finch, A., Sagisaka, Y., 2000. Integrating detailed information into a language model. In: Proc. ICASSPÕ2000, Istanbul, pp. 1595–1598. Zhang, R., Kikui, G., Yamamoto, H., Watanabe, T., Soong, F., Lo, W.K., 2004. A unified approach in speech-to-speech translation: integrating features of speech recognition and machine translation. In: Proc. Coling 2004, Geneva.