Rapid bootstrapping of statistical spoken dialogue systems

Available online at www.sciencedirect.com Speech Communication 50 (2008) 580–593 www.elsevier.com/locate/specom Rapid bootstrapping of statistical s...

Download PDF

232KB Sizes 0 Downloads 61 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com

Speech Communication 50 (2008) 580–593 www.elsevier.com/locate/specom

Rapid bootstrapping of statistical spoken dialogue systems q Ruhi Sarikaya * IBM T.J. Watson Research Center, 1101 Kitchawan Road/Route 134, Yorktown Heights, NY 10598, United States Received 6 January 2006; received in revised form 11 February 2008; accepted 25 March 2008

Abstract Rapid deployment of statistical spoken dialogue systems poses portability challenges for building new applications. We discuss the challenges that arise and focus on two main problems: (i) fast semantic annotation for statistical speech understanding and (ii) reliable and eﬃcient statistical language modeling using limited in-domain resources. We address the ﬁrst problem by presenting a new bootstrapping framework that uses a majority-voting based combination of three methods for the semantic annotation of a ‘‘mini-corpus” that is usually manually annotated. The three methods are a statistical decision tree based parser, a similarity measure and a support vector machine classiﬁer. The bootstrapping framework results in an overall cost reduction of about a factor of two in the annotation eﬀort compared to the baseline method. We address the second problem by devising a method to eﬃciently build reliable statistical language models for new spoken dialog systems, given limited in-domain data. This method exploits external text resources that are collected for other speech recognition tasks as well as dynamic text resources acquired from the World Wide Web. The proposed method is applied to a spoken dialog system in a ﬁnancial transaction domain and a natural language call-routing task in a package shipment domain. The experiments demonstrate that language models built using external resources, when used jointly with the limited in-domain language model, result in relative word error rate reductions of 9–18%. Alternatively, the proposed method can be used to produce a 3-to-10 fold reduction for the in-domain data requirement to achieve a given performance level. Ó 2008 Elsevier B.V. All rights reserved. Keywords: Spoken dialog systems; Semantic annotation; Web-based language modeling; Rapid deployment

1. Introduction Spoken language technology research during the past two decades have lead to the development of numerous deployed spoken dialogue systems (SDSs) each year in travel, automotive and ﬁnancial services industries. However, there are still barriers that hamper the scalability of such applications including lack of suﬃcient domain speciﬁc data, rapid data annotation and adequate pronunciation dictionary generation. Even though data-driven approaches have been investigated (Levin et al., 2000; Hardy et al., 2004), typically little data, if any, is available during the development of SDS applications. Automatic design paradigms have yet to match the level of accuracy that is q *

Permission is hereby granted to publish this abstract separately. Tel.: +1 914 945 3919; fax: +1 914 945 4490. E-mail address: [email protected]

0167-6393/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2008.03.011

required to guarantee high quality human–machine interaction. Consequently, in deployed systems, dialogue strategies, understanding components and language models are often handcrafted. A typical spoken dialog system architecture consists of an automatic speech recognition (ASR), a parser performing natural language understanding (NLU), a dialog manager, a language generation and a speech synthesis component. In a human–machine interaction, an ASR engine transcribes the incoming speech waveform and passes the transcribed text output to a NLU agent. The NLU agent extracts the relevant semantic information from the text. The dialogue manager uses the semantic information in its appropriate context to issue database queries, and formulate responses to the user. The dialogue manager maintains a dialogue state, which is updated in response to each spoken utterance. A typical SDS deployment cycle requires acquiring domain speciﬁc data,

R. Sarikaya / Speech Communication 50 (2008) 580–593

performing semantic annotation of the data to train a parser, building the router and forms, designing the user interface, constructing the dialogue, building the vocabulary, pronunciation lexicon, language and acoustic models for the ASR component. It is well known that SDSs perform better when trained on large amounts of domain speciﬁc real user data. However, acquiring domain speciﬁc data for language modeling and annotating it to train the parser is a time-consuming task that slows the SDS deployment for new applications. Some research has been centered on minimizing the eﬀort for obtaining training data (Fabbrizio et al., 2004; Fosler-Lussier and Kuo, 2001; Bechet et al., 2004; Bertoldi et al., 2001). However, in-domain utterances remain very important in every step of the dialogue system development for ensuring adequate coverage and alleviating data sparseness issues in the statistical language model and understanding components. Typically, an initial in-domain mini-corpus is acquired via a Wizard-of-Oz data collection or soliciting input from real users. This mini-corpus serves to build a pilot system that is used for successive data collections. In practice, we face the constant challenge of how to deploy one or more SDSs within a short time frame. Given the time constraints, rapid annotation of the minicorpus becomes essential. We believe that minimizing the initial cost of a reasonably performing pilot dialogue system is a crucial step towards wider and faster deployment of such systems. In order to achieve this objective, we propose two methods directed at semantic annotation and language modeling. The ﬁrst method reduces the human eﬀort required for annotating the domain speciﬁc parser training data. The second method leverages limited in-domain data to retrieve additional data from the World Wide Web (WWW) and other external sources to train language models. We furthermore investigate the impact of the proposed language models on the speech understanding performance using a natural language call-routing system. Since each of the proposed methods came out of a real life necessity for diﬀerent applications at diﬀerent times, we had to use diﬀerent data sets for the semantic annotation and language modeling experiments, rather than testing both set of methods on the same application or data. Nevertheless, this gave us the opportunity to test the proposed methods on several applications and data sets. The rest of the paper is split into two main parts: Sections 2–5 focus on semantic annotation and Sections 6 and 7 focus on language modeling for SDSs. In particular, Section 2 presents the previous research and motivates the need for semantic annotation of the mini-corpus. It also presents a brief description of a decision tree based statistical parser that is used as the baseline. Section 3 introduces the proposed classiﬁcation based schemes for semantic annotation. Section 4 describes the proposed bootstrapping framework, followed by the semantic annotation experimental results in Section 5. Section 6 discusses the data sparseness issue for language modeling and presents

581

a framework for collecting relevant data from the Web and other available resources for language modeling. Section 7 contains the experimental results for language modeling and call-routing. Section 8 presents our conclusions. 2. Semantic annotation for statistical speech understanding There are two main approaches to NLU: grammar based and corpus-driven. The grammar based methods require writing handcrafted rules by a grammarian and/or someone with domain expertise (Jurafsky et al., 1994; Seneﬀ et al., 1992). As the grammar-rules need to capture domain speciﬁc knowledge, pragmatics, syntax and semantics altogether, it is diﬃcult to write a rule-set that has good coverage of real data without becoming intractable. The alternative method is a corpus based approach requiring less developer expertise than traditional grammar based systems. Corpus based approaches employ statistical methods to model the syntactic and semantic structure of the sentences (Davies et al., 1999). The task of writing grammars is replaced by a simpler task of annotating the meaning of sentences. The corpus-driven approach is desirable in that the induced grammar can model real data closely. Within corpus-driven approaches, there are two main methods that address semantic annotation of data: supervised methods and unsupervised methods (Meng and Siu, 2002). Unsupervised methods make no assumptions about the data, for example, they do not require semantically annotated data as a prerequisite for the algorithm. Supervised methods, on the other hand, assume the availability of annotated data for training. Even though manual annotation of utterances is cumbersome, supervised methods remain the preferred method over unsupervised annotation methods due to superior overall system performance (Meng and Siu, 2002; Fosler-Lussier and Kuo, 2001; McCandless and Glass, 1993). The collection of a mini-corpus of 10–15K sentences is a necessary step for building a conversational system using both corpus-driven and grammar based methods. Corpus-driven methods use the mini-corpus to train the models, while grammar based methods need the mini-corpus to write and validate the grammar rules. The mini-corpus is typically manually annotated. Manual annotation may take a long time depending on the complexity of the semantic analysis. In this study, an experienced human annotator is able to annotate 250–300 sentences per day. However, semi-automatic methods can be used for the annotation of the mini-corpus to expedite SDS deployment. For example, a subset of the mini-corpus can be manually annotated to train a parser. Then the parser can be used to generate the most likely complete semantic annotation of the remaining sentences. Even though there are partial parsing strategies (Abney, 1996), complete sentence parsing is commonly used for NLU systems due to the assumption of the availability of suﬃciently large hand-labeled data to learn sentence structures (Davies et al., 1999). Likewise, the statistical decision tree based parser used in this study generates annotations for the entire sentence.

582

R. Sarikaya / Speech Communication 50 (2008) 580–593

In the past, we have built NLU applications and trained parsers for a number of tasks (Magerman, 1994; Davies et al., 1999). Based on our experience with these tasks, it is essential to have at least 10–15K sentences for parser training to achieve reasonable performance. Then, the parser can be used either in a pilot system to acquire additional real in-domain data or to bootstrap the annotation process for a larger set of training data. The objective of the parser is to generate a complete parse tree for a given sentence. The parser works in leftto-right bottom-up fashion. For example, the parser would ﬁrst attempt to predict the tag for the word ‘‘I” in Fig. 1, then predict the label corresponding to the predicted tag, and so on. Each of these decisions corresponds to a parser action. At any given step, the parser performs feature value assignment corresponding to a parser action. Each parser action is assigned a probability given the current context. For the sake of simplicity, we consider only four main feature values. Let N k ¼ ½N kl N kw N kt N ke refer to the 4-tuple feature values at the kth node in a parse state. These feature values are ‘‘label” ðN kl Þ, ‘‘word” ðN kw Þ, ‘‘tag” ðN kt Þ, and ‘‘extension” ðN ke Þ. The probability distribution for each feature value is estimated using conditional models. For example, the tag feature is modeled as given below:

Thus far, little attention, if any, was given to the rapid annotation of the mini-corpus. Our work is an attempt to expedite the mini-corpus annotation. In this study, we devise a semi-automatic methodology to capture language structures given a limited annotated corpus. Since minicorpus annotation should be error-free, a complete manual correction step is necessary to eliminate errors in the automatic semantic annotation. We formulate the automatic annotation problem as a classiﬁcation problem. We use a similarity measure and a support vector machine (SVM) based classiﬁer to perform the classiﬁcation. In the classiﬁcation framework, each word is guaranteed to be assigned a tag and a label. In (Sarikaya et al., 2004), we presented some initial experiments using the proposed methods. In the present study, we describe a mechanism for combining the proposed methods with the parser based baseline annotation method in a sequential voting scheme to generate the annotation of a sentence. Our motivation for using these three inherently diﬀerent methods is due to the fact that each uses a diﬀerent context and type of information and thus ensures independent errors. This choice helps to minimize sentence level annotation errors. While the similarity measure relies only on the local word context, the SVM classiﬁer uses the local tag and label context in addition to the word context, and the parser uses the entire parse tree context available at that point. The proposed methods present an alternative to active learning, which is used for sentence level annotation of data (Tur et al., 2005). Active learning aims at reducing the number of training examples to be labeled by automatically processing the unlabeled examples, and then selecting the most informative ones with respect to a given cost function.

pðN kt jcontextðtÞÞ pðN kt jwi wi1 wi2 wiþ1 wiþ2 N k1 N k2 N kþ1 N kþ2 Þ

ð1Þ

We recognize that using a large feature set in constraining the prediction of a tag or label has advantages if there is enough training data to learn the dependencies. Not surprisingly, training data sparseness adversely aﬀects parser robustness. For example, if we use only 1K sentences for parser training and parse the next 1K sentences from the corpus, the parser fails on 36% of the sentences since most of these sentences contain words that were not covered in the 1K training set (Sarikaya et al., 2004). The corresponding rates for 2K, 3K, and 9K training sets are 23.5%, 14.7%, and 5.4%, respectively. As can be expected, the parse failure rates decrease as the size of the training set increases.

3. Semi-automatic semantic annotation schemes We employ three annotation schemes: a decision tree based statistical parser, which is used as the baseline, a similarity measure and a SVM based classiﬁer. The similaritymeasure based method is founded upon example based learning. As such, it does not need a model since it directly works with the training data.

3.2. Similarity measure for semantic annotation 3.1. The decision tree based statistical parser The ﬁrst classiﬁcation based method is founded on the premise that given two instances of a word, if the context in which they are used is similar, then they should be annotated with the same tag and label. The key question is, what

Our baseline decision tree based statistical parser is trained using manually annotated data. The parser performance depends heavily on the amount of annotated data.

!S! LABELS TAGS

SUBJECT

INTEND

VERB

BODY-PART

pron-sub

intend

intend0

verb

pron-pos

body-part

I

need

to

x-ray

your

chest

Fig. 1. An example of a semantically annotated sentence (or a parse tree) from the medical domain.

R. Sarikaya / Speech Communication 50 (2008) 580–593

is the appropriate similarity measure? Noting the resemblance between the annotation problem and Machine Translation (MT) evaluation problem, we adopted BiLingual Evaluation Understudy (BLEU) (Papineni et al., 2002) as the similarity measure for annotation. In the annotation problem a word with context is compared to other instances of the same word in the training data, and in the MT evaluation problem, a translated sentence is compared to a set of reference sentences to ﬁnd the most likely translation in the target language. The BLEU metric is deﬁned as follows: ! N X BLEU ¼ BP exp wn log pn ð2Þ n¼1

where N is the maximum n-gram length, wn and pn are the corresponding weight and precision, respectively, and BP is the Brevity Penalty: 1 if c > r BP ¼ ð3Þ expð1 r=cÞ if c 6 r where r and c are the lengths of the reference and translated sentences respectively. Experimentally, the best correlation between BLEU and mono-lingual human judgments was obtained by using a maximum n-gram order of 4, although 3-grams and 5-grams gave comparable results (Papineni et al., 2002). Therefore, we set N ¼ 4 and wn ¼ 1=N to equally weigh the logarithms of the precisions. Our goal is to annotate the words in a sentence rather than to determine how close two sentences are. Therefore, we tailored the way BLEU is applied to fulﬁll that goal. Based on the analogy between MT evaluation and sentence annotation, the sentence to be annotated is treated as the translated sentence, and all the sentences in the training data containing the word to be annotated are considered as possible ‘‘reference” sentences. The task is to ﬁnd the best reference sentence using the BLEU score. Moreover, the annotation is performed sequentially for each word.

583

Once the best reference sentence that contains the word to be annotated is found, the tag and label for that word are used as the tag and label for the current word. 3.3. The SVM based semantic annotation The SVMs are derived from the theory of structural risk minimization (Vapnik, 1995). The SVMs learn the boundary between samples of two classes by mapping these sample points into a higher dimensional space. In the high dimensional space, a hyperplane separating these regions is found by maximizing the margin between closest sample points belonging to competing classes. Although SVMs build binary classiﬁers, a multi-class classiﬁcation problem can be performed using pairwise binary classiﬁers. Namely, one can train MðM 1Þ=2 pairwise binary classiﬁers, where M is the number of classes. Much of the ﬂexibility and classiﬁcation power of SVMs resides in the choice of the kernel (Cristianini and Shawe-Taylor, 2000). We compared linear, Gaussian and polynomial kernels on a development data set, but did not observe signiﬁcant performance diﬀerences among them. We therefore based our decision on the computational speed and chose the linear kernel, since it provides faster model training and testing. SVMs have also been shown to provide competitive results for syntactic constituent chunking (Kudo and Matsumato, 2000) and semantic role labeling (Hacioglu and Ward, 2003), which are in some respect similar to the semantic annotation problem. The most important step in SVM-based classiﬁcation is relevant feature selection. As shown by the dashed box in Fig. 2, we derive the features from the context surrounding the word, ‘‘X-ray” that is being annotated. The word context contains previous and future words, whereas the tag and label context are obtained only from the tag and label predictions for the previous words. The classiﬁcation scheme is sequential. First, the tag of a word is determined

TAG FEATURE SUBJECT

INTEND

INTEND

VERB

BODY-PART

BODY-PART

pron-sub

intend

intend0

verb

prob-pos

body-part

I

need

to

x-ray

your

chest

LABEL FEATURE SUBJECT

INTEND

INTEND

VERB

BODY-PART

BODY-PART

pron-sub

intend

intend0

verb

prob-pos

body-part

your

chest

I

need

to

x-ray

Fig. 2. SVM based annotation and the features used for classiﬁcation for the word X-ray.

584

R. Sarikaya / Speech Communication 50 (2008) 580–593

PARSER TRAINER Sentences (1-1000)

HUMAN ANNOTATOR

SIMILARITY TRAINER

NEW PARAMETERS

SVM TRAINER New Sentences (1001-2000)

TRIPLE ENGINE ANNOTATION

COMBINATION (ROVER)

(CORRECTLY) ANNOTATED SENTENCES

ANNOTATED SENTENCES (with confidence) (HUMAN) ANNOTATION CORRECTOR

Fig. 3. Flow-diagram of the ROVER based semantic annotation framework.

4. A bootstrapping approach for semantic annotation

complete annotation. This chunk is used as the seed data to train the statistical parser, the SVM classiﬁer and also used to estimate the similarity measure. These newly trained models are used in the corresponding engines to annotate the next chunk of unlabeled sentences. Sentences for which the annotation engines generate the same parse tree are tagged as reliable and are directly added to the training data to be used for the next round of annotation. Otherwise they are tagged as unreliable and forwarded to a human annotator for inspection and possible correction. As the training data increases, the accuracy of each annotation engine increases, thus increasing the number of reliable sentences. We adopt a voting scheme that is commonly called ‘‘ROVER1” to combine tag and label hypotheses from all three annotation engines. ROVER has been shown to be an eﬀective mechanism for combining multiple speech recognition hypotheses (Fiscus, 1997). Combination of multiple tag/label hypotheses is based on majority voting

The proposed bootstrapping framework relies on the incremental annotation scheme presented in Fig. 3. First, the corpus is split into smaller chunks (i.e., 1000) of utterances. The ﬁrst chunk is given to a human annotator for

1 The name ROVER refers to a speciﬁc method developed by NIST for ASR output combination based on voting and conﬁdence scores at the word level (Fiscus, 1997). However, here we use it to describe a simple majority voting scheme.

using a tag SVM classiﬁer built using the following tag feai ture vector, ftag , for the ith word, wi : i ftag ¼ ½wi2 wi1 wi wiþ1 wiþ2 ti2 ti1 li2 li1

ð4Þ

where wi is the word to be tagged, ti1 and li1 are the tag and label of the previous word, wi1 , respectively. In addition to word context, tags and labels of the previous words are also used. Next, given the predicted tag, ^ti , we use the following label feature vector to predict the label for wi using a separate label SVM model with the following label feature vector: i ¼ ½wi2 wi1 wi wiþ1 wiþ2 ti2 ti1^ti li2 li1 flabel

ð5Þ

Once the label li for wi is determined, then tiþ1 and liþ1 are predicted. This process is sequentially applied to all words in the sentence.

R. Sarikaya / Speech Communication 50 (2008) 580–593

5. Semantic annotation experimental results 5.1. Experimental setup for semantic annotation The mini-corpus used in this study consists of dialogs between doctors and patients, and was collected for a speech-to-speech translation project in the medical domain (Gao et al., 2006). The mini-corpus has 10K manually annotated sentences that are randomly split into 10 equal chunks. In order to simulate incremental, interactive training, each chunk is automatically annotated before the human inspection and correction, and then it is folded into training data to retrain the annotation models for the next round. The manual annotations of the ﬁrst chunk are used to train the annotation models, which are then used to automatically annotate the next chunk. Then, the ﬁrst two chunks are used as training data to annotate the third chunk. The process is repeated until 9K sentences are annotated, and the ﬁnal model is used to annotate the last chunk, which serves as the test set. The mini-corpus has 158 unique semantic tags and 69 unique semantic labels. The objective of a semantic annotation performance measure should be to minimize the work left to a human annotator for the correction of the erroneously annotated examples. Therefore, the Annotation Error Rate (AER) that measures the percentage of tags and labels (with respect to total tags and labels in the reference), which needs to be corrected by a human annotator, is an appropriate measure. Since this corpus is fully annotated the AER can be approximated by computing the string edit distance between the reference and the hypothesized annotations. Savings in annotation eﬀort come at both the word and sentence levels. 5.2. Word level savings

The SVM based scheme consistently outperforms both the parser and the similarity-measure based schemes. Note that parser based scheme caught up with the SVM based classiﬁcation scheme when the training data size reaches 9K. However, the ROVER scheme consistently outperforms all of the individual schemes. The absolute improvements over SVM based scheme are about 3% across all training data sizes. Even though the AER is our main concern, it is worth reporting the F-measure. Precision, recall and F-measure are widely used for performance evaluation; for our case, recall (r) measures the percentage of relevant bracketed annotation generated by a method as compared to total bracketed reference annotation and precision (p), which measures the percentage of relevant bracketed annotation contained in the annotated sentence. The F-measure is deﬁned using the following formula: Fb ¼

ð1 þ b2 Þpr b2 p þ r

ð6Þ

where b is a weighting factor that determines relative weighting of precision and recall. If b ¼ 1, both precision and recall are equally weighted. If b ¼ 0, then F 0 ¼ p and as b ! 1; F 1 ! r. In Table 1, F-measure ﬁgures are provided for the individual methods and for ROVER. The results conﬁrm the eﬀectiveness of the proposed methods. The SVM based classiﬁcation scheme has a higher F-measure than the parser based method up to 4–5K training sentences. They have similar scores for 6–7K training sentences and then the parser based scheme starts to outperform the SVM based scheme. The similarity-measure based scheme outperformed the parser based scheme up to 3K training sentences. The ROVER scheme again signiﬁcantly outperforms all the individual methods. In particular, the F-measure improves by about 13.6% absolute for 1K ANNOTATION SCHEMES

55 50

PARSER SIMILARITY SVM ROVER

45 40 AER (%)

between the three engine hypotheses and works in left-toright order. Based on some preliminary experiments (detailed results are presented in Section 5), we observed that the SVM classiﬁer is more accurate than the statistical parser and similarity measure. ROVER output is set to the output of the statistical parser (or similarity measure) if the statistical parser and similarity measure outputs are the same but they are diﬀerent than the SVM output. Otherwise, ROVER output is set to the SVM output. After generating the ROVER based tags and labels a post-processing step goes over labels to collapse consecutive identical labels into a single label to create the ﬁnal parse tree.

585

35 30 25

Word level savings measure the number of tags and labels that do not need to be corrected. In Fig. 4, we compare the word level AERs for parser, similarity measure and SVM based annotation schemes. These curves are obtained on the ‘‘unreliable” sentences, which will be explained in more detail in the next section. The parser based scheme outperforms the similarity-measure based method when the training data size is larger than 7K.

20 15 10

1

2

3

4 5 6 7 AMOUNT OF TRAINING DATA (X1000)

8

9

Fig. 4. Comparison of the annotation schemes for the unreliable sentences.

586

R. Sarikaya / Speech Communication 50 (2008) 580–593

So far, we presented three individual and a ROVER based combined annotation method for rapid semi-automatic semantic annotation of the spoken dialog data. The proposed ROVER based bootstrapping method reduces the annotation cost by a factor of two in terms of number of tags and labels directed to a human annotator.

training data compared to the parser based scheme. For 9K training data, the resulting absolute improvement is 1%. 5.3. Sentence level savings Sentence level savings come as a consequence of word level annotation and involve reliable sentences that are not directed to a human annotator. Sentence level savings are achieved with the use of automatic annotations for which the annotation engines agreed without manual correction. Fig. 5 represents the sentence level savings, which is the percentage of unique sentences that all three annotation engines agree upon but do not exist in the previously annotated data. The savings start at around 10% for 1K annotated training data and improves to about 28% for 9K training data. The primary reason for starting from such a low level is because of the parser sensitivity to small training data. (i.e., 36% of the sentences were not parsed when the parser is trained using the 1K training data). The sentence level savings, however, come at the expense of introducing some errors to the training data for the next round of incremental annotation. In the upper panel of Fig. 6 Word-level Annotation Error Rate (AER) and Sentence-level Annotation Error Rate (S-AER) are presented for the ‘‘reliable” and ‘‘unreliable” sentences. In the ﬁgure REL-S-AER denotes the S-AER for the ‘‘reliable” sentences. Likewise, UNREL-SAER stands for the S-AER for the ‘‘unreliable” sentences. If there is at least one error in the annotation of any words with the tag or label, it counts as a sentence-level error. As expected, the REL–AER is very low at about 1% and is independent of the training data size, which indicates that the annotation methods make fairly independent errors. The REL-S-AER is around 5% and fairly insensitive to the amount of training data. Unlike ‘‘reliable” sentences, both word and sentence level AERs are fairly high for the ‘‘unreliable” sentences. UNREL-AER starts at around 30% for 1K training data and decreases to about 20% for 9K training data. The UNREL-S-AER is in the range of 65–80%, which conﬁrms our decision of inspecting the ‘‘unreliable” sentences. The lower panel of Fig. 6 shows that sentence level savings have no impact on the performance of the incremental annotation. In the ﬁgure, ‘‘MANUALLY CORRECTALL” denotes the case where each automatic annotation is inspected and corrected by a human annotator.

6. Data sparseness problem in statistical language modeling Data sparseness for building robust statistical language models (LMs) remains a challenge, despite widespread use of SDSs in diﬀerent application areas. Unlike acoustic modeling, which is fairly insensitive to the domain of application as long as there is not a signiﬁcant environmental and/or channel mismatch between training and test conditions, domain speciﬁc language modeling has a signiﬁcant impact on the speech recognition performance. Generic LMs for such tasks as dictation, Broadcast News or Switchboard (Kingsbury et al., 2003) do not provide a satisfactory performance for the domain speciﬁc target tasks. The overlap between domain speciﬁc target task data and the generic model data largely determines the level of success. For example, in Lefevre et al. Lefevre et al. (2005) generic models through model adaptation and multi-source training shown to perform comparably to task speciﬁc models. However, when there is not suﬃcient overlap, using generic models may not be a solution to improve speech recognition performance. In such cases, it is essential to improve the LM performance through other means. The bulk of LM research has concentrated on improving the language model probability estimation (Katz, 1987; Kneser and Ney, 1995; Chen and Goodman, 1996), while obtaining additional training material from external resources has received little, but growing attention (Rosenfeld, 2001; Zhu and Rosenfeld, 2001; Chung et al., 2005). As the largest data source, the World Wide Web (WWW) has been previously utilized for numerous natural language processing (NLP) applications (Lapata and Keller, 2004). The Web has been used to ﬁnd relevant text for training language models as well (Zhu and Rosenfeld, 2001; Berger and Miller, 2001; Bulyko et al., 2003). The purpose of our study is to propose a framework for using the Web to build LMs for limited domain SDSs. To this end, we formulated a query generation scheme for data retrieval from the Web and developed a data ﬁltering and selection mechanism to extract the ‘‘useful” utterances

Table 1 Comparison of annotation schemes using F-measure Annotation methods

F-measure ðF 1 Þ ð%Þ Parser Similarity measure SVM ROVER

Training data 1K

2K

3K

4K

5K

6K

7K

8K

9K

63.4 71.9 73.3 76.6

74.5 78.0 78.6 81.8

80.3 80.5 81.6 84.7

81.9 81.7 83.1 85.8

84.1 82.7 84.2 86.6

84.8 83.1 85.1 87.4

86.2 84.2 85.9 88.0

87.8 85.3 86.7 89.0

88.5 86.2 87.2 89.5

R. Sarikaya / Speech Communication 50 (2008) 580–593

step further and sifting for relevant information within a document is essential.

SENTENCE LEVEL SAVINGS AUTOMATICALLY ANNOTATED SENTENCES (%)

30

6.1. Exploiting external resources

25

20

15

10

5

1

2

3

4 5 6 #SENTENCES (X1000)

7

8

9

Fig. 5. Sentence level savings: sentences that are folded into training data for the next round of incremental annotation without human inspection.

from the retrieved pages (Sarikaya et al., 2005a). The proposed framework is applied to two target tasks, one in the ﬁnancial domain and one in the package shipping domain. The proposed mechanism makes good use of limited in-domain data to sift through the large external text inventory to identify ‘‘similar” sentences. In much of the previous work, documents were used as the unit in accepting or rejecting training material. We believe that going one

RELIABLE and UNRELIABLE SENTENCES AER/S−AER (%)

100 80 REL−AER UNREL−AER REL−SAER UNREL−SAER

60 40 20 0

1

2

3

4 5 6 7 AMOUNT OF TRAINING DATA (X1000)

8

9

IMPACT OF ERRORS IN THE AUTOMATIC ANNOTATION 30 MANUALLY CORRECT−ALL WITH SENT−SAVINGS

AER (%)

25 20 15 10

587

1

2

3

4 5 6 7 AMOUNT OF TRAINING DATA (X1000)

8

9

Fig. 6. Upper Panel: Word-level Annotation Error Rate (AER) and Sentence-level Annotation Error Rate (S-AER) for the RELIABLE (REL) and UNRELIABLE (UNREL) sentences. REL-S-AER: S-AER for the RELIABLE sentences. UNREL-S-AER: S-AER for the UNRELIABLE sentences. REL-AER: AER for the RELIABLE sentences. UNREL-AER: AER for the UNRELIABLE sentences. Lower Panel: Impact of annotation errors due to the sentence level savings on the annotation performance.

We categorize external resources as static or dynamic. Corpora collected for other tasks are examples of static resources. The Web is a dynamic resource because its content is changing constantly. It is the largest data set available not only for language modeling but also for other NLP tasks and currently consists of more than 11 billion pages (Gulli and Signorini, 2005). So far, the Web has been used mainly for domain-independent speech recognition tasks, including AP newswire transcription (Berger and Miller, 2001), Switchboard and ICSI Meeting transcription (Bulyko et al., 2003), and the spoken document retrieval task (Zhu and Rosenfeld, 2001). To the best of our knowledge, we made the ﬁrst successful attempt at exploiting the Web for speech recognition in a limited domain spoken dialogue system (Sarikaya et al., 2005a). In addition to Web-based data, we also considered using previously collected domain speciﬁc as well as domainindependent data. In Table 2, we list the static corpora used for the experiments in this paper. The domain speciﬁc sources are used for diﬀerent SDS applications. In the table, ‘‘Call Center1” refers to a telecommunication company’s call center data. ‘‘Call Center2” refers to data from one of IBM’s internal call centers for customers having trouble with their computers. Medical data was collected for our speech-to-speech translation project (Gao et al., 2006) that involves dialogs between doctors and patients. None of the domain speciﬁc corpora are related to the target domains used in this study. Our approach is summarized in Fig. 7. We assume that we are given a limited amount of target domain data, which can also be generated manually, after one becomes familiar with the domain. We generate queries from these sentences and search the Web. The retrieved documents are ﬁltered and processed to extract relevant utterances using the limited in-domain data. We employ a similarity measure (see Section 3.1) to identify sentences that are likely to belong to the target domain. The same process is applied to the static data sources where in-domain data is directly used with the similarity measure to identify relevant utterances without query generation and retrieval. Finally, we build a domain-speciﬁc language model using the selected sentences obtained from static and dynamic sources. An eﬀective method to combine a small amount of in-domain and a large amount of out-of-domain data is through building separate language models and interpolating them (Rudnicky, 1995). 6.2. Search query generation and sentence selection Our initial evaluations of several search engines demonstrated that Google was the most useful engine for our application. Google indexes web pages (it also

588

R. Sarikaya / Speech Communication 50 (2008) 580–593

includes URLs that it has not fully indexed) and many additional ﬁle types in the web database. Due to computational concerns for document conversion, we only downloaded those ﬁles from which text can be retrieved eﬃciently. Search query generation from a sentence is a key issue. The queries should be suﬃciently speciﬁc since the more speciﬁc the query is, the more relevant the retrieved pages will be. On the other hand, if the query is too speciﬁc there may not be enough, if any, retrieved pages. In reality, we do not have inﬁnite resources, and as such, one needs to avoid sending too many failed requests to a server just to get documents for a sentence. Our query generation approach takes these concerns into account by generating queries that start from the most relevant one and gracefully degrades to the least relevant one. We deﬁne the most relevant query as the one that has all n-grams of content words with context, which will be clariﬁed below, are combined via AND. The least relevant query is deﬁned via unigrams that are combined via OR. An example of query generation is given in Table 3. The ﬁrst step in forming queries is to deﬁne a set of frequently occurring words as stop words (e.g. the, a, is, etc.). The remaining text is chunked into n-gram islands consisting of only content words. Then, a certain amount of context is added to these islands by including their left and right neighbors. The purpose of adding context around the content words is to incorporate conversational style into queries to some degree. In the example, ‘‘what is the balance of my stock fund portfolio”, ﬁrst the stop words (is, the, of, my) are identiﬁed. The remaining word or phrase islands form the basis of the queries. Then, the islands are expanded by adding context from the neighboring words. The amount of context can be increased by adding more neighboring words from the right and left of the content word at the expense of increased likelihood for failed requests. Next, queries are formed starting with the most relevant one (Q1), which combines n-gram chunks using AND. The next best query (Q2) is formed by splitting the trigram content word island, [stock fund portfolio] into two bigram islands, [stock fund] and [fund portfolio], and then adding context again. This is repeated until unigram islands are obtained. The queries [Q1, Q2, . . ., QN] are repeated by substituting AND with OR and added to the end of the query list. During retrieval, queries from this list are submitted to a server in order until a pre-speciﬁed number of documents are retrieved. In this paper, the stopping point was 100 pages per in-domain utterance. The retrieved documents are ﬁltered by stripping oﬀ the HTML tags, punctuation marks and HTML speciﬁc information that are not part of the page’s content. We adopted BLEU (Papineni et al., 2002) (see Section 3.1) as the similarity measure for utterance selection. For each sentence in the in-domain data, we select all the sentences in the retrieved Web data as well as static corpora where the similarity score is above an empirically determined threshold.

7. Language modeling experimental results and discussion 7.1. Task 1: Financial transaction domain Task 1 is a ﬁnancial transaction application of a very large ﬁnancial company (Sarikaya et al., 2005a). We assume that we are given a reasonable vocabulary for the task, along with the dialogue states assigned to limited in-domain training data. The query generation data has 1.7K utterances (6.8K words) randomly selected from a larger set. The test data consists of 3148 utterances. The training data vocabulary has 3228 items. The acoustic models are ﬁrst trained using +1000 h of generic telephony acoustic data, and later MAP adapted using 42 h of in-domain acoustic training data to this application. The in-domain acoustic training data is a subset of the the in-domain language model training data. Words are represented using an alphabet of 54 phones. Phones are represented as three-state, left-to-right HMMs. With the exception of silence, the HMM states are contextdependent, and conditioned on quinphone context. The context-dependent HMM states are clustered into equivalence classes using decision trees. Context-dependent states are modeled using mixtures of diagonal-covariance gaussians. There are 2198 context-dependent states with 222K Gaussians in the acoustic model. All language models are based on dialogue state based trigrams with smoothed deleted interpolation. In all cases, the language model data is randomly split into 90% and 10% chunks, which are used as training and held-out sets, respectively. For this task, domain independent large vocabulary continuous speech recognition (LVCSR) systems, which use large amounts of training data for language modeling, resulted in fairly high error rates (>45%). We believe that selecting the relevant part of the entire data and using it for language modeling is better than using it all for limited domain applications. In Fig. 8, we plotted the Word Error Rate (WER) with respect to the natural logarithm of the amount of data retrieved from the static sources. As the amount of data is increased by changing the similarity threshold the WER improves. However, beyond a certain Table 2 Static data sources Corpus

Size (million words)

Domain independent

CTRAN Cellular FN-CMV Web-meetings SWB-Fisher UW web data Broadcast News

1.00 0.24 0.69 30 107 191 204

Domain speciﬁc

Medical Call Center1 Call Center2 IBM Darpa Communicator

1.4 0.33 1.9 0.7

R. Sarikaya / Speech Communication 50 (2008) 580–593

WWW

DOCUMENT

SEARCH

FILTERING

589

QUERY GENERATION SIMILARITY BASED SENTENCE

(LIMITED) IN-DOMAIN

DOMAIN-SPECIFIC LM

SELECTION

TEXT DATA

BUILD

LARGE STATIC

IN-DOMAIN LM

TEXT INVENTORY

INTERPOLATION

LM

Fig. 7. Flow diagram for the collecting relevant data.

point, further reduction in threshold allows increasingly less relevant data to be used as part of the training data, and the WER starts to increase. The best performance is achieved by setting the similarity threshold to 0.08, which results in the selection of 2.8M words (391K utterances) out of 538M words that are given in Table 2. For reference, the WER against diﬀerent amounts of indomain language model data is plotted in Fig. 9. The ﬁrst point in the graph, 23.4%, corresponds to 1.7K utterances. As the data size increases, the WER steadily improves. At 17K sentences the WER is 18.9%. Note that this is a small/ medium size vocabulary task, which largely consists of fund and plan names. Even using only 1.7K sentences provides a fairly low WER compared to that for the domain independent LVCSR systems. Next, the language models built using external resources and the in-domain data are log-linearly interpolated and the results are given in Table 4. The interpolation weights are optimized on a small development set. Typically indomain LM weights were between 0.5 and 0.6. The indomain LM is built using a small amount of utterances (1.7K utterances) and is taken as the baseline (BaseLM). Using only static corpora for language modeling (StatLM) resulted in 26.8% WER. Combining the BaseLM and the StatLM reduced the WER to 21.7%. For Web data collection, we investigated setting the pre-deﬁned page limit to 20 pages/sentence vs. 100 pages/sentence. When the page limit is set to 100 pages/sentence, the actual number of retrieved pages is on average around 60. This is due to disregarded ﬁle formats, inactive web sites and downloading problems. In fact, we also used 150 pages/sentence without getting

Table 3 Query generation (S)

what is the balance of my stock fund portfolio + STOP-WORDS what is the balance of my stock fund portfolio + N-GRAMS ISLANDS [what] [balance] [stock fund portfolio] + ADD CONTEXT

(Q1)

[what is the] [the balance of] [my stock fund portfolio] + RELAX N-GRAMS [what] [balance] [stock fund] [fund portfolio]

(Q2)

[what is the] [the balance of] [my stock fund] [fund portfolio]

(QN)

[what] [balance] [stock] [fund] [portfolio]

additional improvement over 100 pages/sentence. Using 20 pages/sentence ðWebLM20 Þ alone gave 24.1% (vs. 26.8% with the static corpora). Increasing the number of retrieved pages to 100 reduced the WER to 21.2% ðWebLM100 Þ. Combining WebLM20 with the BaseLM reduced the WER to 20.2%. Combining WebLM100 with the BaseLM resulted in 19.2%. Three way interpolation of StatLM, WebLM100 and BaseLM resulted in the lowest WER: 19.1%. Overall, we achieved 4.3% absolute reduction in the WER compared to the baseline. This ﬁgure is very close to the 18.9% obtained with the 17K in-domain training sentences (68K words). An alternative interpretation of these results demonstrates that we can reduce the in-domain data requirement by a factor of ten to achieve a performance target by leveraging the external resources.

590

R. Sarikaya / Speech Communication 50 (2008) 580–593 DATA SELECTION FROM CORPORA

33

23

32

22

31

21

30

20

29

19

28

18

27

17

26 13

14

15

16

17

INDOMAIN DATA SIZE VS. WER

24

WER (%)

WER (%)

34

18

16

19

0

log(#WORDS)

Task 2 is a call-routing application of a technical support hotline for a Fortune-500 package shipping company (Goel et al., 2005). The entire training data has 27K utterances amounting to 177K words. The test data consists of 5644 utterances, and is also used for call-routing experiments in Section 7.3. The training data is split into ten equal chunks by randomly sampling from the full set. The ﬁrst chunk is further split into two 1.3K sentence chunks (8.9K words and 9.1K words). The training data vocabulary has 3667 words. In Fig. 10, the WERs for ﬁve language models are plotted with respect to the amount of in-domain data. In the ﬁgure, ‘‘BaseLM” stands for the baseline LM built using only in-domain data, ‘‘WebLM100 ” stands for the LM built using the data obtained with 100 pages/sentence and ‘‘StatLM” is for the LM built using static resources. We did not plot the graph for WebLM100 ,2 StatLM and their interpolations with BaseLM beyond 8.1K in-domain utterances due to increased computational requirements for data retrieval, storage and sentence level data selection. Note that for 8.1K in-domain sentences, the Web data size prior to cleaning and selection totaled over 12GB. The ﬁrst points on the curves correspond to using only 1.3K indomain sentences. The WER for BaseLM is 30.9% and the corresponding ﬁgure for WebLM is 35.1%. Again, LMs are log-linearly interpolated and the interpolation weights are optimized on a small development set. Typically in-domain BaseLM weights were between 0.6 and 0.8. Interpolating WebLM with StatLM reduces the WER to 32.5%. Interpolation of WebLM with BaseLM results in 28.9%. Three way interpolation of BaseLM, WebLM and StatLM reduces the ﬁgure to 28.2%. This is

2 In the rest of the paper, we drop the subscript from WebLM100 and simply use WebLM.

20 30 40 # OF UTTERANCES (X 1000)

50

60

Fig. 9. Task 1: data size versus WER using in-domain data.

Fig. 8. Task 1: data size versus WER using external static corpora.

7.2. Task 2: Package shipment domain

10

a 2.7% absolute reduction compared to BaseLM WER (30.9%). Similar improvements are observed for the explored range of up to 8.1K (53K words) in-domain utterances. Fig. 10 can also be interpreted in terms of in-domain data reduction to achieve a given performance level. We observe about a 3-to-4 fold in-domain data reduction to match a given performance when using external data in conjunction with the in-domain data. For example, the performance of the in-domain LM that uses the entire 27K set is matched by using the interpolation of 8.1K in-domain LM, WebLM and StatLM. We did not plot the graph for WebLM, StatLM and their interpolations with BaseLM beyond 8.1K, however, it is worth reporting the impact of interpolating WebLM and StatLM with the entire 27K-sentence BaseLM. Table 5 shows that there are additional improvements over the entire 27K-sentence in-domain LM even if we use just a subset (e.g. 1.3K, 2.7K, 5.4K and 8.1K) of the entire indomain data to search the Web. When these WebLMs and StatLMs are interpolated with 27K-BaseLM, the WER reduces from 25.7% to 25.0%, 24.8%, 24.7% and

Table 4 Task 1: word error rates (WER) for various language model combinations LM

WER (%)

Performance of language models BaseLM (1.7K in-domain) StatLM BaseLM (0.6) + StatLM (0.4) WebLM20 BaseLM ð0:6Þ þ WebLM20 ð0:4Þ WebLM100 BaseLM ð0:5Þ þ WebLM100 (0.5) BaseLM ð0:5Þ þ WebLM100 ð0:3Þ þ StatLM ð0:2Þ 17K in-domain

23.4 26.8 21.7 24.1 20.2 21.2 19.2 19.1 18.9

The numbers inside the parenthesis in the ﬁrst column stands for the interpolation weights found by tuning on a development set.

R. Sarikaya / Speech Communication 50 (2008) 580–593 WEB, STATIC and INDOMAIN LMs

591 ASR BASED ON VARIOUS LMs

36

82

34

80 WebLM

30

AC ACCURACY (%)

WER(%)

32

WebLM + StatLM

BaseLM + WebLM

28

78

76

BaseLM WebLM WebLM+StatLM BaseLM + WebLM BaseLM + WebLM + StatLM

74

BaseLM

26

72 BaseLM + WebLM + StatLM

24

0

5

10 15 20 25 # OF IN−DOMAIN SENTENCES (X1000)

30

Fig. 10. Task 2: in-domain data size versus WER. Table 5 Task 2: interpolation of Web and static LMs with the full 27K in-domain LM for diﬀerent amounts of in-domain data to retrieve the external data LM

Amount of in-domain data to retrieve the external data 1.3K

Performance of WEB, STATIC and 27K-BaseLM WebLM WebLM + StatLM 27K-BaseLM + WebLM + StatLM

2.7K

5.4K

8.1K

IN-DOMAIN language models 25.7 25.7 25.7 25.7 35.1 33.5 32.0 31.3 32.5 31.1 30.6 30.1 25.0 24.8 24.7 24.6

24.6% for 1.3K, 2.7K, 5.4K and 8.1K in-domain utterances, respectively. For 8.1K in-domain sentences to retrieve the external data, the absolute improvement over 27K-BaseLM is 1.1%. 7.3. Natural language call-routing experiments In order to evaluate the impact of ASR improvements due to WebLM on the speech understanding, a natural language call-routing system (Goel et al., 2005; Sarikaya et al., 2005b) is built using the in-domain data described in the package shipping domain (described in Section 7.2). The aim of call-routing is to understand the speaker’s request and take the appropriate action. Actions are the categories into which each request a caller makes can be mapped. Here, we have 35 predetermined actions or call-types. Typically, natural language call-routing requires two statistical models. The ﬁrst model performs speech recognition that transcribes the spoken utterance. The second model is the action classiﬁer that takes the spoken utterance obtained by the speech recognizer and predicts the correct action to fulﬁll the speaker’s request. We use the Maximum Entropy (MaxEnt) method to build the action classiﬁcation model (Goel et al., 2005; Sarikaya et al., 2005c). The MaxEnt method is a ﬂexible and eﬀective modeling framework

70

1

2

3

4 5 6 # OF UTTERANCES (X 1000)

7

8

9

Fig. 11. Task 2: ASR outputs, which are obtained using the interpolated in-domain and Web language models corresponding to diﬀerent indomain data sizes, and the AC accuracy.

that allows for the combination of multiple overlapping information sources (Pietra et al., 1997; Chen and Rosenfeld, 2001). We use unigram (single word) features to train the MaxEnt model. The action classiﬁcation model is trained using only in-domain data. In fact, we experimented with using subsets of the retrieved external LM data as part of action classiﬁcation model training data without success. Fig. 11 shows the Action Classiﬁcation (AC) performance as a function of the amount of in-domain data that is used to build the AC model, the in-domain LMs and to collect data from the WWW. The diﬀerence between these curves is only in the LMs used to generate the speech recognition hypothesis. Even though only indomain data is used to build the AC model, using only WebLM or ‘‘WebLM + StatLM” for speech recognition results in fairly high AC accuracy. However, they do not match the performance of using in-domain data for speech recognition. Interpolating the BaseLM with the WebLM and the StatLM improves AC accuracy modestly starting with the second sampling point. The absolute improvements are 0.3%, 0.6%, 0.8% and 0.7% for 1.3K, 2.7K, 5.1K and 8.1K utterances, respectively. We believe the reason that the call-routing performance was not improved signiﬁcantly was due to the nature of the task. This domain is very speciﬁc in that despite good lexical coverage it is diﬃcult to ﬁnd the relevant data covering the content words from the Web. 8. Conclusion We presented our solutions to semantic annotation for the statistical speech understanding and language modeling using limited resources for rapid deployment of statistical spoken dialog systems. First, we proposed a new

592

R. Sarikaya / Speech Communication 50 (2008) 580–593

semi-automatic semantic annotation framework that uses a parser, a Similarity Measure and a SVM based method to annotate a mini-corpus. The individual annotation methods were combined in a ROVER scheme. Annotation experiments on the medical domain corpus demonstrated that the proposed framework can lead to a potential cost reduction of about a factor of two in the annotation eﬀort. Second, we presented methods for query generation, data retrieval and selection from the WWW to build statistical language models. The proposed methods were tested on two tasks where we obtained signiﬁcant improvements in WER over the baseline in-domain language models built using subsets of the entire in-domain corpora. More importantly, we achieved virtually the same level of performance, as could be obtained with much larger training corpora. This is particularly important for building pilot SDSs for collecting real data for the domain of interest. We further evaluated the impact of the speech recognition accuracy improvements on call-routing performance. The resulting improvements in call-routing accuracy were modest, ranging from 0.3% to 0.8%. Collectively, the proposed methods for fast semantic annotation and building reliable language models by exploiting external resources improved rapid deployment of statistical SDS. Acknowledgements The author would like to thank the reviewers for their constructive comments and suggestions and Paola Virga, Agustin Gravano, Kishore Papineni and Yuqing Gao for fruitful discussions. References Abney, S., 1996. Part-of-speech tagging and partial parsing. In: Church, K. et al. (Eds.), Corpus Based Methods in Language and Speech. Kluwer Academic Publishers, Dordrecht. Bechet, F. et al., 2004. Mining spoken dialogue corpora for system evaluation and modeling. Empirical Methods in Natural Language Processing (EMNLP). Barcelona, Spain. Berger, A., Miller, R., 2001. Just-in-time language modeling. In: Proc. Internat. Conf. on Acoustics Speech and Signal Processing (ICASSP), Seattle, WA, pp. II:705–708. Bertoldi, N., Brugnara, F., Cettolo, M., Federico, M., 2001. From broadcast news to spontaneous dialogue transcription: portability issues. In: Proc. Internat. Conf. Acoustics Speech and Signal Processing (ICASSP), Salt Lake City, UT, pp. I:37–40. Bulyko, I., Ostendorf, M., Stolcke, A., 2003. Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures. Human Language Tech. (HLT), Edmonton, Canada. Chen, S., Goodman, J., 1996. An empirical study of smoothing techniques for language modeling. Assoc. Comp. Ling. (ACL). Chen, S., Rosenfeld, R., 2001. A survey smoothing techniques for ME models. IEEE Trans. Speech Audio Process. 8 (1), 37–50. Chung, G., Seneﬀ, S., Wang, C., 2005. Automatic induction of language model data for a spoken dialogue system. Special Interest Group on Discourse and Dialogue (SIGDIAL). Cristianini, N., Shawe-Taylor, J., 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press.

Davies, K. et al., 1999. The IBM conversational telephony system for ﬁnancial applications. In: Proc. European Conf. on Speech Technology (Eurospeech), Budapest, Hungary. Fabbrizio, G.D., Tur, G., Tur, D.H., 2004. Bootstrapping spoken dialogue systems with data reuse. In: Proc. SIGDIAL, Cambridge, MA. Fiscus, J.G., 1997. A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER). In: Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Santa Barbara, CA, pp. 347–354. Fosler-Lussier, E., Kuo, H.-K.J., 2001. Using semantic class information for rapid development of language models within ASR dialogue systems. In: Proc. ICASSP, Salt Lake City, Utah. Gao, Y. et al., 2006. IBM MASTOR: multilingual automatic speech-tospeech translator. In: ICASSP, Toulouse, France. Goel, V., Kuo, H.-K.J., Deligne, S., Wu, C., 2005. Language model estimation for optimizing end-to-end performance of a natural language call routing system. In: Proc. ICASSP, Philadelphia, PA. Gulli, A., Signorini, A., 2005. The Indexable Web is more than 11.5 billion pages. In: Proc. WWW-2005, Chiba, Japan, May. Hacioglu, K., Ward, W., 2003. Target Word Detection and Semantic Role Chunking using Support Vector Machines. HLT, Edmonton, Canada. Hardy, H. et al., 2004. Data-Driven Strategies for Automated Dialogue Systems. ACL, Barcelona, Spain. Jurafsky, D., Wooters, C., Tajchman, G., Segal, J., Stolcke, A., Fosler, E., Morgan, N., 1994. The Berkeley Restaurant Project. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP), pp. 2139–2142. Katz, S.M., 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust. Speech Signal Process. 35 (3), 400–401. Kingsbury, B. et al., Toward domain-independent conversational speech recognition. In: Eurospeech, Geneva, Switzerland. Kneser, R., Ney, H., 1995. Improved backing-oﬀ for m-gram language modeling. In: ICASSP, pp. 181–184. Kudo, T., Matsumato, Y., 2000. Use of support vector learning for chunk identiﬁcation. In: Proc. 4th Conf. Very Large Corpora, pp. 142–144. Lapata, M., Keller, F., 2004. The Web as a baseline: evaluating the performance of unsupervised web-based models for a range of NLP tasks. HLT/NAACL, Boston, MA, pp. 121–128. Lefevre, L., Gauvain, J.-L., Lamel, L., 2005. Genericity and portability for task-independent speech recognition. Comput. Speech Lang. 19 (3), 345–363. Levin, E., Pieraccini, R., Eckert, W., 2000. A stochastic model of human– machine interaction for learning dialogue strategies. IEEE Trans. Speech Audio Proc. 8 (1). Magerman, D.M., 1994. Natural Language Parsing as Statistical Pattern Recognition, Ph.D. Thesis, Stanford University. McCandless, M., Glass, J., 1993. Empirical acquisition of word and phrases classes in the ATIS Domain. In: Eurospeech, Berlin, Germany. Meng, H.M., Siu, K.-C., 2002. Semiautomatic acquisition of semantic structures for understanding domain-speciﬁc natural language queries. IEEE Trans. Knowledge Data Eng. 14 (1), 172–181. Papineni, K., Roukos, S., Ward, T., Zhu, W., 2002. Bleu: A Method for Automatic Evaluation of Machine Translation. ACL, Philadelphia, PA. Pietra, S.D., Pietra, V.D., Laﬀerty, J., 1997. Inducing features of random ﬁelds. IEEE Trans. Pattern Anal. Machine. Intell. 19 (4), 380– 393. Rosenfeld, R., 2001. Two decades of statistical language modeling: where we go from here? Proc. IEEE 88 (8). Rudnicky, A., 1995. Language modeling with limited domain data. In: Proc. ARPA Spoken Language Technology Workshop, pp. 66–69. Sarikaya, R., Gao, Y., Virga, P., 2004. Fast semi-automatic semantic annotation for spoken dialogue systems. In: ICSLP, Jeju Island, South Korea, October. Sarikaya, R., Agustin Gravano, Yuqing Gao, 2005a. Rapid language model development using external resources for new spoken dialog domains. In: ICASSP, Philadelphia, PA.

R. Sarikaya / Speech Communication 50 (2008) 580–593 Sarikaya, R., Kuo, H.-K.J., Gao, Y. 2005b. Impact of using web based language modeling on speech understanding. In: IEEE ASRU Workshop, San Juan Puerto Rico. Sarikaya, R., Kuo, H.-K.J., Goel, V., Gao, Y., 2005c. Exploiting unlabeled data using multiple classiﬁers for improved natural language call-routing. In: Interspeech 2005, Lisbon, Portugal. Seneﬀ, S., Meng, H., Zue, V., 1992. Language modeling for recognition and understanding using layered bigrams. In: ICASSP, pp. 317–320.

593

Tur, G., Hakkani-Tur, D., Schapire, R.E., 2005. Combining active and semi-supervised learning for spoken language understanding. Speech Commun. 45 (2), 175–186. Vapnik, V., 1995. The Nature of Statistical Learning Theory. SpringerVerlag, NY, USA. Zhu, X., Rosenfeld, R., 2001. Improving trigram language modeling with the world wide web. In: ICASSP, Salt Lake City, UT, pp. I:533–536.

Rapid bootstrapping of statistical spoken dialogue systems

Rapid bootstrapping of statistical spoken dialogue systems

Recommend Documents