Computer Speech and Language (1997) 11, 207–224
Unconstrained keyword spotting using phone lattices with application to spoken document retrieval J. T. Foote,∗ S. J. Young,∗ G. J. F. Jones† and K. Spa¨rck Jones† ∗Cambridge University Engineering Department, Cambridge, CB2 1PZ, UK †Cambridge University Computer Laboratory, Cambridge, CB2 3QG, UK
Abstract Traditional hidden Markov model (HMM) word spotting requires both explicit HMM models of each desired keyword and a computationally expensive decoding pass. For certain applications, such as audio indexing or information retrieval, conventional word spotting may be too constrained or impractically slow. This paper presents an alternative technique, where a phone lattice—representing multiple phone hypotheses—is pre-computed prior to need. Given a phone decomposition of any desired keyword, the lattice may be rapidly searched to find putative occurrences of the keyword. Though somewhat less accurate, this can be substantially faster (orders of magnitude) and more flexible (any keyword may be detected) than previous approaches. This paper presents algorithms for lattice generation and scanning, and experimental results, including comparison with conventional keyword-HMM approaches. Finally, word spotting based on phone lattice scanning is demonstrated to be effective for spoken document retrieval. 1997 Academic Press Limited
1. Introduction Locating particular words in unconstrained speech is termed word spotting. Traditional word spotting approaches require considerable computational effort to be expended for every keyword that is to be detected. Typically, a hidden Markov model (HMM) must be constructed or trained, and an expensive decoding operation performed for every keyword (Rose, 1991). This has limited word spotting to applications such as “topic spotting” where appropriate keywords may be selected well in advance, and necessary weighting factors and thresholds can be pre-trained (McDonough, Ng, Jeanrenaud, Gish & Rohlicek, 1994; Wright, Carey & Parris, 1995). Other, arguably more useful, applications (such as audio indexing or on-demand information retrieval) need a word spotter that is both extremely rapid and not limited to a set of predefined keywords. 0885–2308/97/030207+18 $25.00/0/la970027
1997 Academic Press Limited
208
J. T. Foote et al.
Figure 1. Phone lattice for word “manage” (m ae n ih jh).
This paper reports on an alternative keyword spotting technique satisfying the above demands. In this approach, a phone lattice representing multiple phone hypotheses is computed for each utterance before search time. Though this takes substantial computation, it need only be done once, and then any arbitrary keyword may be rapidly located by searching the lattice (James & Young, 1994). In contrast, conventional word spotters require a Viterbi decoding pass for every keyword or group of keywords. The paper is organized as follows. Section 2 presents the mechanisms of phone lattice generation, pruning, and scanning to detect keywords. Section 3 presents experimental spotting results, including a description of the speech corpus used for evaluation and comparisons with conventional keyword-HMM spotting. Finally, Section 4 demonstrates how the lattice-based word spotter can be used to generate acoustic indexes for spoken document retrieval. Though somewhat less accurate than a keyword-HMM spotter, lattice-based word spotting is shown to be more effective for spoken document retrieval because it allows arbitrary keywords to be located at search time. 2. Lattice-based word spotting Given a perfect phonetic transcription of a spoken utterance, it would be simple to find a phone sequence corresponding to a given word. Unfortunately, real-world phone recognition is subject to errors: phones will be missed (deletion errors), misrecognized (substitution errors) or spuriously recognized (insertion errors). For example, even on clean read speech, the best phone recognition accuracy achieved to date is little better than 70% (Robinson, 1991). Phone lattice scanning attempts to compensate for this inaccuracy by retaining multiple hypotheses from the phone recognizer, in the expectation that if the maximumlikelihood phone sequence does not contain the desired phone string, then one of the “next-best” sequences will. These multiple hypotheses can be stored as a phone lattice which is a directed acyclic graph whose edges represent hypothesized phone occurrences and whose nodes represent the corresponding start and end times. Edges are labelled with an acoustic score indicating the log likelihood that the acoustic data over that interval was generated by that particular phone model. Lattices are constrained such that all paths start at a particular start node and end at another special node. Figure 1 shows a lattice generated for the single utterance “manage”; the correct path is shown in grey. For clarity, acoustic scores and start/end times are not shown, though nodes are arranged in roughly chronological order.
Keyword spotting using phone lattices
209
2.1. Phone lattice generation For the work described here, phone lattices were generated using a simple extension to the token passing implementation of the basic Viterbi algorithm (Young et al., 1989). In this scheme, each partial state/frame alignment path is represented by a token which is propagated from state to state at each time step. Paths are iteratively extended by examining the tokens for all connecting states, adding the transition and output log probabilities, and then propagating only the best token. The token propagated into a phone q is selected by choosing the most likely token exiting from all connecting phones pi. Each token records its history in a chained list of phone transition records. Every time a token t transits from phone pi to another phone q, the identity of pi (and the current time) is appended to t’s history list. A simple way to generate multiple phone hypotheses is to allow each state to hold multiple tokens and then to record at each phone boundary not just the phone pi holding the best token, but the set of phones {pi} holding the N-best tokens (Odell, 1995). To do this efficiently, it is necessary to discard the least likely tokens in a set of tokens with equivalent histories. In the implementation used here, two histories are regarded as equivalent if they end in the same phone.
2.2. Lattice depth The “depth” of a phone lattice is the average number of phone hypotheses active at a given time. For example, the depth of the lattice shown in Fig. 1 is 5, because 35 edges were generated for the seven phones actually uttered (m ae n ih jh, plus beginning and ending silences). The depth is critical to the performance of a lattice-based wordspotting system. If the lattice is too shallow, performance will be poor due to deleting phones from wanted keywords. On the other hand, if the lattice is too deep, too many phone sequences become possible, most of which will be incorrect. Furthermore, the storage requirements and search time increase substantially with lattice depth. In a real system, lattice depth would be controlled by the normal beam search pruning mechanism. However, lattice computation is expensive and for the experiments reported here, the effect of varying lattice depth was investigated by first generating large lattices and then pruning them back. This allowed performance to be measured for a range of lattice depths.
2.3. Phone lattice scanning Once computed, a phone lattice may be rapidly searched to find keywords. This requires a phonetic decomposition of the desired words, but these are easily found from a dictionary or by a rule-based algorithm (Coker, Church & Liberman, 1990). To simplify searching and storage, an assumption is made that any phone starting at time t+1 may follow a phone ending at time t. Ignoring the detailed lattice connectivity results in large storage and I/O savings by preserving just the start and stop times, acoustic score, and phone identity for each lattice edge. The actual search strategy used to find all examples of a keyword in a lattice is as follows. A list is kept of candidate phone sequences in the lattice which match (i.e. are a prefix of) the keyword being sought. For each candidate c, a record is kept of the
210
J. T. Foote et al.
total acoustic score, the current end time of the sequence and the next phone pc required to match the keyword. At each time t the following steps are taken. 1. For each complete candidate sequence c ending at time t, calculate a normalized acoustic score for the sequence, record the keyword instance and delete c. 2. For each incomplete candidate sequence c ending at time t, if the lattice contains an instance of pc starting at time t then extend the candidate sequence otherwise delete it. 3. For each instance of ps in the lattice where ps is the first phone in the keyword, create a new candidate sequence. The scores are normalized by scaling the log likelihood of the phone sequence corresponding to the keyword instance by the log likelihood of the best possible phone sequence spanning the duration of the hypothesized keyword. Deep lattices will result in many hypotheses for a given word, so overlapping word hypotheses are eliminated by discarding all but the highest-scoring one. For example, Fig. 1 shows that there are two possible hypotheses for the phone sequence m ae n ih jh, because of the two instances of the phone jh following ih. The detailed implementation of the above scanning procedure involves a speed/ memory trade-off. To determine whether to extend a path, it is necessary to know whether a given phone starts at a certain time. One strategy is to create a 2-D lookup table indexed by time on one axis and phone on the other. Each non-null element in the lookup table points to a linked list of edges of that phone starting at that time (James, 1995). This is very fast but also memory-intensive; indexing a large number of lattices in this manner can quickly exhaust memory resources. A more conservative approach was used for these experiments as well as in the actual voice mail retrieval application. Here a 1-D lookup table was used where each element points to a linked list of phones starting at that time, in effect hashing on the phone’s start time. The linked list may then be traversed to determine whether the desired phone starts at the given time. Although this is slower than using the 2-D table, it is much more memoryefficient and it is still very fast, producing hypotheses in the order of a thousand times faster than would be obtained by word-spotting directly on the source audio waveforms. 3. Experimental evaluation This section presents experiments comparing word spotting performance using the phone lattice scanner with that using more conventional HMM-based systems. The application for our work on word spotting is an audio indexing engine to retrieve voice messages by content (the VMR system (Brown, Foote, Jones, Spa¨rck Jones & Young, 1996)). To support this, an audio database was designed and collected specifically to evaluate word spotting and spoken document retrieval. 3.1. The VMR message corpus The VMR message corpus is a structured collection of audio training data and information-bearing audio messages. A fixed set of 35 keywords (see Table I) was chosen, including 11 difficult monosyllabic words (e.g. “date” and “mail”), as well as overlapping words (e.g. “word” and “keyword”) and word variants (e.g. “locate” and “location”). Fifteen speakers (11 men and four women) each provided about 45 minutes
Keyword spotting using phone lattices
211
T I. VMR keywords active assess badge camera date display document
find indigo interface keyword locate location mail
manage meeting message microphone network output pandora
plan project rank retrieve score search sensor
spotting staff time video window word workstation
of speech data for a total of 5 hours of read training data and 5 hours of spontaneous speech messages. Speakers spoke primarily British English except for one North American and one European-accented speaker. Data were recorded at 16 kHz from a Sennheiser HMD 414 close-talking microphone in an acoustically-quiet room. Each speaker provided the following. • 77 read sentences (“r” data) consisting of keywords spoken in a sentence context. Sentences were constructed such that each keyword occurred a minimum of five times; • 150 read sentences (“z” data) consisting of phonetically-rich sentences from the TIMIT corpus. Unlike the 77 common “r” sentences, the “z” sentences are different for each speaker. • 170 isolated keywords (“i” data): five occurrences of each of the 35 keywords spoken in isolation. • 20 prompted messages (“p” data) consisting of natural speech responses to a prompt requesting a video mail message. The average duration of each message is one minute. The “r”, “i”, and “z” sentences are intended for use as training data; the “p” messages, along with their text transcriptions, serve as a test corpus for both keyword spotting and retrieval experiments. Messages are fully spontaneous, and contain a large number of disfluencies such as “um” and “er”, partially uttered words and false starts, laughter, sentence fragments, and informalities and slang. All files were transcribed at the word level and spontaneous features such as hesitations and filled pauses were marked. Phonetic transcriptions were then derived automatically from a large dictionary of British English word pronunciation (BEEP). The standard reduced TIMIT phone set was augmented with additional vowels specific to British English pronunciation. A full description of the VMR corpus may be found in Jones, Foote, Spa¨rck Jones and Young (1994). Speaker-independent HMM training data was obtained from the training portion of the WSJCAM0 British English corpus (Robinson, Fransen, Pye, Foote & Renals, 1995). This corpus serves as a British English adjunct to the American Wall Street Journal corpus. The training portion consists of 90 sentence-length utterances from each of 92 native British English speakers. Data were recorded at 16 kHz from the same closetalking microphone used to record the VMR corpus. 3.2. Model training To support the word-spotting experiments described below in section 3.3, a variety of different whole-word and sub-word HMM sets were generated. All acoustic data were
212
J. T. Foote et al.
parameterized into 12 mel-cepstral coefficients at a 100 Hz frame rate, and difference and acceleration coefficients were appended. The HTK tool kit was used for all training and recognition, including phone lattice generation (Young et al., 1995). The wholeword HMMs had two component mixture Gaussians per state and all sub-word HMMs had eight component mixtures per state (12 components per mixture for the states of silence models). SD Whole-word models A whole-word HMM was estimated for each VMR keyword using approximately 10 training examples per model of which half were from the isolated word “i” data and the remainder were extracted from the continuously spoken “r” data. These models had two mixture components per state. SD Monophone models Three-state speaker-dependent (SD) monophone models were trained on the read sentences (“r” and “z” data) from the VMR corpus. Once single Gaussian monophone models had been initialized, the number of mixture components was increased using mixture-splitting. This is a process whereby component means are copied, then perturbed, and then re-estimated using Baum-Welch. This was repeated until every state had eight mixture components. SI Monophone models Speaker-independent (SI) monophone models were trained in a similar manner using the 92-speaker training set of the WSJCAM0 corpus. SI Biphone models Tied-state biphone HMMs were generated using an incremental training procedure (Young, Odell & Woodland, 1994). Single Gaussian versions of the SI monophone models were “cloned” such that every possible biphone combination was represented by a copy of the appropriate monophone. Transition matricies were tied across all the instances of each phone. After several iterations of Baum-Welch embedded re-estimation, corresponding states from each subset of biphones derived from the same base monophone were clustered using phonetic decision trees. This resulted in 2025 logical biphone models represented by 1119 physical models. From these sets of tied-state single Gaussian HMMs, mixture Gaussian HMMs were generated by “mixture-splitting” as described above. The final model sets had eight component mixtures for all speech states and 12 component mixtures for all silence states. SI Triphone models A set of state-tied SI triphones was generated in the same manner as for the biphones. This set consisted of 89 102 logical context-dependent triphone models represented by 6390 physical models. 3.3. Conventional word spotting To provide a comparison for the lattice-based word spotting, baseline experiments using SD whole-word and SI sub-word models were performed using a conventional
Keyword spotting using phone lattices
213
Figure 2. Subword-model keyword recognition network.
keyword-HMM approach similar to Rose (Rose, 1991). Speaker adaptation was also investigated as a means of improving performance (especially for speakers with nonBritish accents). The subword based keywords were constructed by concatenating the appropriate sequence of subword models using the BEEP phonetic dictionary (see Fig. 2). Biphones are used at the beginning and end of the keyword and triphone models used internally. For example, the keyword “find” is represented by the model sequence f+ay f−ay+n ay−n+d n−d.1 For both the wholeword and subword cases, non-keyword speech is modelled by an unconstrained network of monophones (denoted “filler models”), as shown in Fig. 2. The VMR corpus is realistic in that it contains speakers with varied backgrounds and accents. In contrast, the acoustic models were trained exclusively on British English speakers. In an attempt to ameliorate the mis-match that this causes and increase spotting performance in general, speaker-adaptation was investigated. A maximumlikelihood linear regression (MLLR) approach was chosen because it has been shown previously to improve recognition with a comparatively small amount of enrolment data (Leggetter & Woodland, 1995). Varying amounts of the VMR corpus training data were used as enrolment data for speaker-adaptation experiments. Preliminary experiments showed that using the “z” training data for adaptation resulted in only modest performance improvements, whereas, perhaps not surprisingly, much better results could be obtained using the keyword-rich “r” data. To perform MLLR adaptation, all HMM states were clustered into “classes” and a separate adaptation transformation was estimated for each class. Only the Gaussian means were adapted. A threshold was used to select how many classes were used for a given set of models and training data. A range of thresholds was investigated and the best-performing threshold resulted in an average of 127 classes being used across all speakers. 1
The notation x−y+z means that phone y has a left context of x and a right content of z.
214
J. T. Foote et al.
An unfortunate consequence of speaker adaptation is that it increases the number of false alarms, as the average model likelihood is increased by the adaptation process. For the speaker-dependent models the operating point was satisfactorily determined by an appropriate choice of filler models. However, finer control was needed for the speaker-adaptation experiments, because the false alarm rate changes significantly with the degree of adaptation. The solution adopted here was to introduce a separate transition penalty to the keyword models, as in Fig. 2. Adjusting this global penalty value changes the likelihood of the keywords relative to the filler models, and thus allows the false alarm rate to be controlled. 3.4. Word spotting results Keyword spotting involves a two-pass recognition procedure. First, Viterbi decoding is performed on a network of just the filler models, yielding a time-aligned sequence of the maximum-likelihood filler monophones and their associated log-likelihood scores. Secondly, another Viterbi decoding is carried out using a network of the keywords, silence, and filler models in parallel. In a manner similar to Rose (Rose, 1991), keywords are rescored by normalizing each hypothesis score by the average filler model score over the keyword interval. Unlike Rose (1991), however, the average log likelihoods are divided rather than subtracted, which results in somewhat better performance (Knill & Young, 1994). For scoring purposes, “true” keyword hits were determined by scanning the Viterbialigned phones for sequences of keyword phones. Note that a consequence of using the phone transcriptions to determine keyword locations is that, for example, an occurrence of “locate” in the utterance “Hello Kate” is recorded as a correct hit and not a false alarm. An accepted figure-of-merit (FOM) for word spotting is defined as the average percentage of correctly detected keywords as the threshold is varied from one to 10 false alarms per keyword per hour. This measure is computed by ranking all word hypotheses in score order and then lowering the threshold until there is just one false alarm, then just two false alarms, and so on until the 10 false alarm limit is reached. The FOM results for various model sets and the VMR message corpus are shown in Table II. The speaker-dependent whole-keyword models are shown in column SD, the speaker-independent subword-model results are shown in column SI, while the remaining columns show the effects of speaker-adaptation on the SI models using different amounts of adaptation data. Though adaptation does not uniformly improve all speakers, the average increase is substantial, and is particularly dramatic for speaker 64 (who has an American accent ill-suited to the British English models). Using just 13 utterances for enrolment improved the FOM performance substantially (column SIr13), and increasing the enrolment data to 77 keyword-rich utterances (column SI-r77) yielded FOM scores nearly matching the SD whole-keyword results. 3.5. Lattice spotting results Table III presents the comparable results using phone lattice-based word spotting to locate the same 35 keywords in the VMR corpus messages. Three types of models were used to build the lattices: speaker-dependent (SD-m) and independent (SI-m) monophone models and speaker-independent biphone models (SI-b). Biphone transitions were
Keyword spotting using phone lattices
215
T II. Figures of Merit (%) for speaker-independent whole-word HMMs (SD), speaker-independent subword HMMs (SI) and adapted subword HMMs with 13 adaptation utterances (SI-r13), and 77 adaptation utterances (SI-r77) using conventional Viterbi word-spotting and optimal a posteriori transition penalties. VMR Message Corpus Speaker 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 Mean
SD
SI
SI-r13
SI-r77
66·67 75·72 70·45 83·81 84·55 77·62 76·63 88·79 88·73 84·85 65·41 88·49 95·29 87·22 82·90 81·14
64·52 56·15 72·82 63·23 84·55 65·71 71·06 69·87 80·34 80·42 53·99 80·75 85·92 41·98 77·07 69·89
69·18 63·89 77·14 69·72 92·58 72·65 70·50 78·83 89·75 84·04 59·11 87·06 87·34 80·07 73·80 77·13
68·97 70·34 80·90 80·16 91·77 72·33 77·46 85·89 90·36 86·47 63·04 88·41 90·16 81·99 78·67 80·46
T III. Figure of Merit (%) for phone lattice-based keyword spotting using lattices constructed with speaker-independent monophones (SI-m), speaker-dependent monophones (SD-m), and speaker-independent biphones (SI-b). VMR Message Corpus Speaker
SI-m
SD-m
SI-b
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 Mean
32·52 25·38 42·80 48·13 48·84 47·73 64·67 50·96 63·92 53·85 37·32 60·12 67·60 25·47 50·78 48·00
58·44 67·67 61·81 70·64 75·33 67·93 74·31 76·86 80·83 81·32 56·71 83·06 94·83 76·21 78·03 73·60
51·62 44·21 59·26 60·78 71·71 52·25 57·96 62·58 73·80 58·83 49·79 71·57 84·04 39·93 68·24 60·44
restricted to those models allowed by context; for example, the only transitions permitted into the biphone th+ax were from other biphones whose right context was th. In addition, phone-to-phone transitions were weighted by bigram transition probabilities
216
J. T. Foote et al. T IV. Phone recognition accuracy for the three model sets used in the phone lattice scanning experiments SI-m
SD-m
SI-b
41·1%
55·4%
51·7%
trained on the phone transcriptions of the WSJCAM0 training set. Results are presented for the deepest lattices generated (approximately 85-deep for the biphone generated lattices). Preliminary experiments using triphones for phone recognition yielded a negligible increase in accuracy whilst incurring a very much larger computational burden. Hence, triphone models were not used further in the phone lattice experiments. Table IV shows the corresponding “1-best” phone accuracy for each of the model sets used in Table III where accuracy is defined as the ratio of correctly recognized phones, minus deletions and insertions, to the actual number of phones, for the best path through the lattice. Experiments on the WSJCAM0 read-speech corpus using the monophone models resulted in phone accuracies of more than 60%, suggesting that the natural-speech in the VMR corpus is more difficult to recognize than read speech. It should be noted that the speaker-dependent models were trained on data containing many occurrences of the 35 keywords and this might improve recognition performance on those keywords (as opposed to arbitrary words). The speaker-independent models, though, are free of this effect. Comparing Tables II and III, it can be seen that the best SI performance using phone lattices is 60·4% FOM which is a little worse than the 69·9% FOM obtained using conventional keyword spotting. However, as noted already, the phone lattice approach allows open-keyword spotting and as demonstrated in Section 4, this leads to significantly improved document retrieval performance compared to using a fixed keyword scheme. 3.5.1. The effect of lattice depth Phone accuracies are not affected by pruning as they are computed for the best phone path, which is never pruned. Pruning does, however, affect the lattice depth and thus the operating point of the keyword scanner. Figure 3 shows how lattice depth affects the Figure of Merit for the speaker-independent biphone models. The deeper the lattice, the better the Figure of Merit, although with diminishing returns as the number of correctly identified keywords reaches an upper limit. Figure 4 shows how lattice depth affects the number of correctly detected keywords and false alarms. The number of false alarms increases dramatically with the lattice depth, while the number of true hits approaches the total number of keywords actually uttered (the horizontal dotted line). The large numbers of false alarms found with deeper lattice do not adversely affect the FOM because they typically have comparatively low acoustic scores. In practice, these low probability false alarms are removed before the information retrieval stage by ignoring any putative keyword with an acoustic score below a given threshold. Thus, the only consequence of using a lattice which is too deep is to incur unnecessary computation.
Keyword spotting using phone lattices
217
Figure 3. Figure of Merit (%) vs. lattice depth (SI biphones).
Figure 4. Keyword hits and false alarms vs. lattice depth (SI biphones).
4. Spoken document retrieval The particular application of the keyword spotting work described in this paper is to locate instances of desired words in a large collection of voice messages, to identify particular messages matching a user’s request (Wilcox & Bush, 1991; Glavitsch & Scha¨uble, 1992; Spa¨rck Jones, Jones, Foote & Young, 1996). In its simplest form, this can serve as an “audio grep” that returns a list of messages containing a particular keyword. However, such a simple implementation will not be impervious to spotting errors. This section shows that well-established techniques developed for text information
218
J. T. Foote et al.
retrieval can be straightforwardly adapted to the speech domain, where they can compensate for much of the uncertainty caused by acoustic matching errors. A spoken document retrieval system functions much like a conventional text retrieval system in that a specific user request is used to locate promising audio documents. Most of the time-consuming speech recognition must be done off-line, as messages are added to the archive, so that user requests may be processed quickly (of the order of seconds rather than hours). This requirement is consistent with the phone lattice-based approach. In operation, a user generates a text request consisting of one or more search keys, and each message in the archive is scored for relevance based on the keyword hits. The result will be a list of potentially relevant messages, ranked in order of estimated relevance to the request. It should be noted that this type of spoken document retrieval is rather different from much of the related research on audio “topic” identification, where much broader subject classification is typical and the classes are predefined (Rose, 1991; McDonough et al., 1994; Wright et al., 1995). This is appropriate for document routing but not for interactive user-initiated retrieval. Our system, in contrast, seeks to identify those documents in the archive that are relevant to an individual request submitted at a particular time without any prior definition of topic classes. Further, the contents of a voice mail archive are dynamic over time and new messages may introduce new subject areas or additional topics within existing areas. Thus, as there is no opportunity to select and weight keywords in advance for specific topics, messages must be scored relative to the user’s request based on their internal keyword composition and their relationship to the overall contents of the archive. 4.1. Retrieval methodology For the work described here, standard indexing and matching techniques were applied both to the text transcription files and to the quasi-transcription generated by the speech recognition engines. Performance for the text transcriptions could then be used as a reference for the various speech retrieval strategies. (More detailed description of IR performance may be found in Spa¨rck Jones et al. (1996) and Jones, Foote, Spa¨rck Jones & Young (1996a). Information retrieval is performed on the word spotting outputs, which are considered to be “pseudo-transcriptions” of the acoustic messages. Requests are entered as written text in natural language and common function words (such as “and”, “a”, “the”) having little information content are removed. Once processed, a request is referred to as a search query and the words that it contains are called terms. Note that in textbased systems, the endings of query terms are usually removed using a suffix-stripping algorithm (Porter, 1980). For speech-based retrieval, short keywords yield higher false alarm rates. Hence, suffix-stripping is less useful and it was not used here. Given a query consisting of a set of terms, the score for each message is based on the frequency of term occurrence. However, in text retrieval, it has been shown that weighting term occurrences according to the global statistics of term distribution can lead to improved performance (Salton, 1983; Robertson & Spa¨rck Jones, 1996). We have investigated the behaviour, for the speech case, of both unweighted and two different types of weighted matching; the results confirmed that, as with text, best performance is obtained with the following combined weight scheme:
Keyword spotting using phone lattices
w(i,j)=
c(i)×f(i,j)×(K+1) , K×l(j)+f(i,j)
219
(1)
where w(i,j) represents the weight of term i in message j. f(i,j) is the number of occurrences of term i in message j and l(j) is the normalized length (Spa¨rck Jones et al., 1996). c(i) is the collection frequency weight c(i)=log
N , n[i]
where N is the total number of messages and n[i] is the number of messages that contain term i. The main ideas of this weighting scheme are that terms will occur frequently in relevant messages, terms that in a small number of messages should be favoured since they will be more discriminating, and messages should be normalized for length since long messages will naturally have more term occurrences independently of their relevance. Given the above weighting scheme, the score for each message j in the corpus is computed by summing over all the term weights, i.e. score(j)=;w(i,j) . i
Messages are then ranked in score order. The message length is ordinarily measured as the number of term occurrences in the message. However, this is not appropriate for spoken document retrieval where the message is explicitly represented only by the search terms found for the current query. We therefore use the number of phones in the best matching path through the message as an estimate of message length and normalize using the average length of all messages in the corpus. This has been found to be a better estimate than the time-duration of the message, possibly because it is independent of speaking rate (Jones et al., 1996a). The combined weight constant K must be determined empirically. After informal testing, a value of K=1 was used for the experiments reported here. 4.2. Measuring retrieval performance Since we did not have a natural user population submitting requests to the VMR corpus, we had to obtain retrieval requests and the relevance assessments needed to evaluate performance by alternative means. For the experiments reported here, requests were constructed from the message prompts used in the database recording, and processed as described in the previous section to obtain the actual search queries. There were thus 50 test queries with an average of 17·6 distinct terms per query. When the terms were filtered to include only the fixed 35 keywords, the number of distinct terms per query reduced to 4·6. Messages recorded in response to a prompt were deemed relevant to the query derived from that prompt. We measured retrieval performance using one of the standard ratios, namely precision,
220
J. T. Foote et al.
Figure 5. Average retrieval precision vs. keyword score threshold.
the proportion of retrieved messages that are relevant to a particular query. A singlenumber performance figure, average precision, is derived as follows: the precision values obtained at each new relevant message in the ranked output for an individual query are averaged, and the results are then averaged across the query set. Note that all messages in the collection are considered by this procedure. It is therefore similar in spirit to the FOM measure in that it effectively averages precision across the sequence of document cut-off levels chosen to extract 1, 2, 3, . . . relevant documents and subsumes the need for a separate recall measure. Because of spotting errors, acoustic retrieval performance will be poorer than that available from text messages. The extent of the degradation can be measured by comparing retrieval performance on word spotting results with that on text. We used our transcribed corpus to provide us with this text performance standard, applying suffix-stripping, matching and scoring as described in the previous section. 4.3. Operating point selection The word spotter outputs a list of putative keyword hits and associated acoustic scores. Because the message retrieval scheme uses only the presence/absence of a keyword in a message, the acoustic scores are thresholded such that only hits with a score above the threshold are counted. Because false alarms typically score lower than true hits, high thresholding values will remove a greater proportion of the false alarms. Choosing the appropriate threshold is a trade-off between the number of Type I (missed keywords) and Type II (false alarm) errors with the usual problem that reducing one increases the other. Figure 5 shows how the IR performance varies with the choice of score threshold and transition penalty, for the lattice scanning output of all terms. At low threshold values, retrieval performance is somewhat impaired by a high proportion of false alarms (Type II errors); conversely, higher thresholds (towards the right) remove a significant number of true hits (Type I errors), also degrading per-
Keyword spotting using phone lattices
221
T V. Relative retrieval performance (Average Precision) and Figures of Merit. VMR Message Corpus
Text (All terms) Text (35 KWs) SD Whole-keyword models SI Subword models Adapted SI subword (r13) Adapted SI subword (r77) SD Phone lattice (35 KWs) SI Phone lattice (35 KWs) SD Phone lattice (all terms) SI Phone lattice (all terms)
Absolute
Relative (%)
FOM (%)
0·718 0·358 0·316 0·300 0·324 0·338 0·302 0·247 0·495 0·470
100·0 49·9 44·0 41·8 45·1 47·1 42·1 34·4 68·9 65·5
— — 81·46 69·89 77·13 79·26 73·60 60·44 — —
formance. The optimal threshold, in the central region, represents the best trade-off between the numbers of true hits and false alarms. The FOM is an average across a wide range of a posteriori thresholds, and so does not give any useful information about choosing an appropriate threshold. Because the goal of the word spotting is to produce a successful retrieval application, IR results are presented as a more relevant measure of word spotting performance.
4.4. Information retrieval results Table V compares retrieval performance for the VMR message set using each of the word-spotting systems discussed in Section 3 with the combined weighting scheme given in Equation 1 and the best a posteriori acoustic thresholds. Also shown for comparison are the retrieval scores obtained using the reference text transcriptions, and those for the case when the requests are limited to using just the 35 keywords. The column “Absolute” is the absolute average precision and the column “Relative” is the average precision relative to that obtainable from the full text transcriptions. The final column shows the word spotting Figures of Merit for comparison. These results show that for a given set of search terms, average retrieval precision for the various model sets follows the same general trend as the FOM scores. However, the comparison between using the fixed keyword set and the open set shows that artificially limiting the number of terms in a query severely degrades performance. As shown by the second line of Table V, using text transcriptions, retrieval performance is halved by restricting the available search terms to the 35 keywords and as shown in the lower part of the table, the same effect is observed for the speechbased retrieval. (This would be even more dramatic if the message collection had not been designed in a way that makes the 35 keywords useful search keys for it.) Hence, even though the phone lattice spotting is less accurate than the fixed-keyword spotting, it substantially outperforms the fixed-keyword retrieval because of the availability of additional search terms.
222
J. T. Foote et al. 5. Conclusions
This paper has described a word-spotting scheme based on scanning pre-computed phone lattices and it has presented experimental results to compare this scheme with a conventional word spotter. The results show that using a phone lattice as an intermediate processing step incurs a performance degradation in Figure of Merit of 10% for speakerdependent recognition and 14% for speaker-independent recognition. When limiting message retrieval to just using the 35 predefined keywords, the corresponding drops in retrieval precision are similar, being 4·7% and 17%, respectively. However, message retrieval can benefit greatly from multiple search terms and statistical term weighting. When full open-keyword search is allowed, the phone latticebased systems gives an increase in retrieval precision of 57% for both the speakerdependent and speaker-independent cases. Hence, when open search terms are allowed, good retrieval performance can be obtained from word spotting systems even if they have rather modest accuracy. Though individual words may often be misrecognized, the chance that all are missed (or appear as false alarms in an unrelated message) becomes small as the number of words in the query increases. This effect (termed “Semantic co-occurrence filtering” by Kupiec et al. (Kupiec, Kimber & Balasubramanian, 1994) follows conventional IR wisdom that longer queries result in much better retrieval performance. From a retrieval point of view, these results are subject to two caveats. Firstly, in an operational environment a fixed a priori keyword set would be much larger than the one used here. So even if the set was less tailored than ours, reasonable performance might still be expected. However, this does not alter the fact that requests will often contain unknown yet significant words which if included in the search will improve performance. Secondly, by text retrieval standards, the message set used here is very small and hence the specific retrieval performance results cannot be taken as anything more than indicative. Whichever word spotting scheme is used, speaker-dependent models consistently outperform speaker-independent models. In an attempt to counteract this effect, we have investigated using MLLR speaker adaptation. When the adaptation data contains examples of the keywords, then good results have been obtained. For example, using just 13 keyword rich utterances increased the FOM for the SI word spotter from 69·9% to 77·1%. However, using keyword independent adaptation data was generally not successful and this requires further study. In the spirit of the FOM and Average Precision performance metrics, all experimental results have been computed using a posteriori optimized thresholds. For practical applications, the selection of appropriate operating points (acoustic threshold, lattice depth, etc.) is clearly important. However, for message retrieval, this appears to be not critical since we have obtained good performance over a fairly wide range of thresholds. Once again, the statistical term weighting can compensate to some extent for a nonoptimal threshold (Jones, Foote et al., 1995). An alternative approach to reducing the run-time burden of open vocabulary wordspotting is to transcribe the messages beforehand using a large vocabulary recognition (LVR) system. However, LVR systems are extremely complex, have limited accuracy and are still not truly open-vocabulary. In particular, new names and places are often outside the vocabulary of an LVR system. Thus, phone-based lattice scanning is an attractive alternative.
Keyword spotting using phone lattices
223
Furthermore, we have previously shown that word spotting can be successfully combined with large-vocabulary recognition for information retrieval. Hence, there may also be value in combining LVR with the phone-lattice approach so that the phone-lattice scanner can be used to find out-of-vocabulary terms such as proper names (Jones et al., 1996a; Jones et al., 1996b). In conclusion, the experiments presented here show the potential value of the phone lattice-based approach to word spotting. Although not as accurate as conventional methods, the increased flexibility and speed of this approach make it very effective for information retrieval applications. In addition to the experimental work presented here, we have also developed a working real-time retrieval application which further demonstrates the viability of the approach (Brown et al., 1996). Overall, this work indicates that state-of-the-art information retrieval and word-spotting techniques can be combined successfully to provide a useful spoken document retrieval system. This project is supported by the UK DTI Grant IED4/1/5804 and SERC Grant GR/H87629. ORL, formerly Olivetti Research Limited, is an industrial partner of the VMR project. The authors are grateful to David James for useful discussions, Julian Odell for decoder improvements, Tony Robinson for compiling the BEEP British English pronunciation dictionary, Kate Knill for speaker-dependent monophone models, and Chris Leggetter and Phil Woodland for speakeradaptation software.
References Brown, M. G., Foote, J. T., Jones, G. J. F., Spa¨rck Jones, K. & Young, S. J. (1996). Open-vocabulary speech indexing for voice and video mail retrieval. In Proc. ACM Multimedia 96, Boston, November, ACM. Coker, C., Church, K. & Liberman, M. (1990). Morphology and rhyming: two powerful alternatives to Letter-to-Sound rules for speech synthesis. In ESCA Workshop on Speech Synthesis, pp. 83–86. Autrans, France, September, ECSA. Glavitsch, U. & Scha¨uble, P. (1992). A system for retrieving speech documents. In Proc SIGIR 92, pp. 168–176, ACM. James, D. A. (1995). The Application of Classical Information Retrieval Techniques to Spoken Documents. PhD thesis, Cambridge, University, February. James, D. A. & Young, S. J. (1994). A fast lattice-based approach to vocabulary independent wordspotting. In Proc. ICASSP 94, Volume 1, pp. 377–380, Adelaide, IEEE. Jones, G. J. F., Foote, J. T., Spa¨rck Jones, K. & Young, S. J. (1994). VMR report on keyword definition and data collection. Technical Report 335, Cambridge University Computer Laboratory, May. Jones, G. J. F., Foote, J. T., Spa¨rck Jones, K. & Young, S. J. (1995). Video Mail Retrieval: the effect of word spotting accuracy on precision. In Proc. ICASSP 95, Volume I, pp. 309–312, Detroit, May, IEEE. Jones, G. J. F., Foote, J. T., Spa¨rck Jones, K. & Young, S. J. (1996a). Retrieving spoken documents by combining multiple index sources. In SIGIR 96. Proc. 19th Annu. Int. SIGIR Conf. Res. Dev. Information Retrieval, pp. 30–38, 1996, ACM. Jones, G. J. F., Foote, J. T., Spa¨rck Jones, K. & Young, S. J. (1996b). Robust talker-independent audio document retrieval. In Proc. ICASSP 96, Volume I, pp. 311–314, Atlanta, GA, April, IEEE. Knill K. M. & Young S. J. (1994). Speaker dependent keyword spotting for hand-held devices. Technical Report 193, Cambridge University Engineering Department, July. Kupiec, J., Kimber, D. & Balasubramanian, V. (1994). Speech-based retrieval using semantic co-occurrence filtering. In Proc. HLT 94, pp. 350–354, ARPA. Leggetter, C. J. & Woodland, P. C. (1995). Flexible speaker adaptation for large vocabulary speech recognition. In Proc. Eurospeech 95, ESCA. McDonough, J., Ng, K., Jeanrenaud, P., Gish, H. & Rohlicek J. R. (1994). Approaches to topic identification on the switchboard corpus. In Proc. ICASSP 94, Volume I, pp. 385–388, Adelaide, IEEE. Odell, J. J. (1995). The Use of Context in Large Vocabulary Speech Recognition. PhD thesis, Cambridge University, March. Porter, M. F. (1980). An algorithm for suffix stripping. Program 14, 130–137. Robertson, S. E. R. & Spa¨rck Jones, K. (1996). Simple proven approaches to text retrieval. Technical Report 356, Computer Laboratory, University of Cambridge.
224
J. T. Foote et al.
Robinson, A. J. (1991). Several improvements to a Recurrent Error Propagation Network Phone Recognition System. Technical Report CUED/F-INFENG/TR.82, Cambridge University Engineering Department. Robinson, T., Fransen, J., Pye, D., Foote, J. & Renals, S. (1995). WSJCAM0: A British English speech corpus for large vocabulary continuous speech recognition. In Proc. ICASSP 95, pp. 81–84, Detroit, May, IEEE. Rose, R. C. (1991). Techniques for information retrieval from speech messages. Lincoln Laboratory Journal 4, 45–60. Salton, G. (1983). Introduction to Modern Information Retrieval. McGraw-Hill, September. Spa¨rck Jones, K., Jones, G. J. F., Foote, J. T. & Young, S. J. (1996). Experiments in spoken document retrieval. Information Processing and Management 32, 399–417. Wilcox, L. D. & Bush, M. A. (1991). HMM word spotting for voice editing and indexing. In Proc. Eurospeech 91, pp. 25–28, Genoa, Italy, September. Wright, J. H., Carey, M. J. & Parris, E. S. (1995). Improved topic spotting through statistical modelling of keyword dependencies. In Proc. ICASSP 95, pp. 313–316, Detroit, MI, May, IEEE. Young, S. J. (1995). The HTK Book Entropic Cambridge Research Laboratory Ltd, Sheraton House, Castle Park, Cambridge, England. Young, S. J., Russell, N. H. & Thornton, J. H. S. (1989). Token passing: a simple conceptual model for connected speech recognition systems. Technical Report CUED/F-INFENG/TR.38, Cambridge University Engineering Department. Young, S. J., Odell, J. J. & Woodland, P. C. (1994). Tree-based state tying for high accuracy acoustic modelling. In Proc. ARPA Spoken Language Technology Workshop, Plainsboro, NJ. (Received 7 October 1996 and accepted for publication 21 April 1997)