+Model YCSLA-645; No. of Pages 17
ARTICLE IN PRESS Available online at www.sciencedirect.com
ScienceDirect Computer Speech and Language xxx (2014) xxx–xxx
Web-based possibilistic language models for automatic speech recognition夽 Stanislas Oger, Georges Linarès ∗ Université d’Avignon, 339 Chemin des Meinajaries, Agroparc BP 1228, 84911 Avignon cedex 9, France Received 1 June 2013; received in revised form 14 February 2014; accepted 15 February 2014
Abstract This paper describes a new kind of language models based on the possibility theory. The purpose of these new models is to better use the data available on the Web for language modeling. These models aim to integrate information relative to impossible word sequences. We address the two main problems of using this kind of model: how to estimate the measures for word sequences and how to integrate this kind of model into the ASR system. We propose a word-sequence possibilistic measure and a practical estimation method based on word-sequence statistics, which is particularly suited for estimating from Web data. We develop several strategies and formulations for using these models in a classical automatic speech recognition engine, which relies on a probabilistic modeling of the speech recognition process. This work is evaluated on two typical usage scenarios: broadcast news transcription with very large training sets and transcription of medical videos, in a specialized domain, with only very limited training data. The results show that the possibilistic models provide significantly lower word error rate on the specialized domain task, where classical n-gram models fail due to the lack of training materials. For the broadcast news, the probabilistic models remain better than the possibilistic ones. However, a log-linear combination of the two kinds of models outperforms all the models used individually, which indicates that possibilistic models bring information that is not modeled by probabilistic ones. © 2014 Elsevier Ltd. All rights reserved. Keywords: Speech processing; Language modeling; Theory of possibilities
1. Introduction State-of-the-art large vocabulary continuous speech recognition (LVCSR) systems rely on n-gram language models that are estimated on text collections composed of billion of words. These models have demonstrated their efficiency in a wide scope of applications but their accuracy depends on the availability of huge and relevant training corpora that may not be available, for instance for low resource languages or for specific domains. One of the most popular approaches for dealing with this lack of training data consists in collecting text material on the Internet and in estimating classical n-gram statistics on these automatically collected datasets (Kemp and Waibel, 夽 ∗
This paper has been recommended for acceptance by Haizhou Li. Corresponding author. Tel.: +33 625194850. E-mail addresses:
[email protected] (S. Oger),
[email protected] (G. Linarès).
http://dx.doi.org/10.1016/j.csl.2014.02.003 0885-2308/© 2014 Elsevier Ltd. All rights reserved.
Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17 2
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx
1998; Bulyko et al., 2003). This approach benefits from two interesting characteristics of the Internet: large coverage and continuous updating. Coverage relies on the fact that the Web may be viewed as a close-to-infinite corpus, where most of the linguistic realizations may be found. The Internet provides a linguistic coverage significantly larger than the text corpora usually involved in LM training (Keller and Lapata, 2003). Updating is provided by users who continuously add documents containing new words and new idiomatic forms. This last point was largely exploited for various aspects of statistical language modeling, typically for new words discovery (Asadi et al., 1990; Bertoldi and Federico, 2001; Allauzen and Gauvain, 2005), n-gram model adaptation, unseen n-gram scoring (Keller and Lapata, 2003), etc. Nevertheless, exploiting the large coverage and the updating for statistical language modeling is limited by technical issues which are related to the size and the instability of the Internet contents. The standard approach would be to regularly collect all data available on the Internet and to estimate n-gram models on the resulting corpus. Such a technique is clearly unfeasible; some authors proposed solutions that are supposed to enable the estimation of huge LM: Guthrie and Hepple (2010) tackled footprint memory reduction of sparse n-gram models; fast smoothing techniques were proposed in Brants et al. (2007); technological solutions based on distributed data storing and processing are presented in Ghemawat et al. (2003), Chang et al. (2006). Finally, even if software and hardware technologies are continuously progressing, the training of up-to-date LMs on the whole Web contents is still a challenging problem. Another issue is related to word sequence distributions on the Web. They are poorly reliable due to the diversity of the document sources, the variability of production and usage contexts, etc. Distributions are not only unreliable, but they would not match a targeted application context which determines the potential topics, speech styles, language levels, etc. Considering these practical and theoretical limits of using the whole Web, most of the previous studies consisted in extracting relevant and tractable Web subsets, which are used as classical corpora for estimating n-gram statistics. Corpora are obtained by automatic querying search engines (Monroe et al., 2002; Wan and Hain, 2006; Lecorve et al., 2008). The query composing technique determines the corpus accuracy in terms of coverage, language styles, etc. Unfortunately, querying is based on prior knowledge or on an automatic extraction of domain-related descriptors that are potentially incomplete or inaccurate (Sethy et al., 2005). Moreover, independently of the query composing techniques, the collected data collections depend on the search strategies that are implemented in commercial engines, which may be totally or partially confidential. Even if these methods were successfully applied in various application contexts, some authors tried to get further benefit from the Web specificities by using dynamic approaches of n-gram estimate. In Berger and Miller (1998), a just-in-time adaptation process, based on an on-line analysis of the document topic and fast LM updating, is proposed. In Zhu and Rosenfeld (2001), the authors proposed a back-off technique which estimates a word sequence probability by counting the number of Web documents that contain it. This number is the number of hits returned by using a search engine which was queried with the targeted word sequence. This paper focused on LM adaptation to a specialized domain, but it introduced the idea of using a search engine for ad-hoc estimate of linguistic scores. We developed this idea in Oger et al. (2009a), where we proposed an efficient way of using Web search engine hit ratios as probabilities in an ASR system. Ad hoc n-gram estimate provides updated statistics but does not address the Web-statistic reliability issue. In order to tackle this problem, we proposed language models that take into account the existence or the non-existence of word-sequences rather than their frequencies Oger et al. (2009b). These models are based on the possibility theory which provides a theoretical framework to deal with uncertainty. We proposed a way to quantify the possibility of word sequences by querying the Web and to integrate this possibilistic measure into a probabilistic ASR system. Probability-based language models perform well in most situations, especially on high- and medium-frequency events. The low-frequency event probability estimation rely generally on a back-off or smoothing strategy, which led to less reliable probabilities. The proposed possibility language models only operate on these low-frequency events, by measuring their plausibility, which is not actually measured by the smoothing and back-off techniques used to estimate the probabilities on these events. Therefore, the proposed possibility-based language models do not replace the probability-based language models, but rather complement them in situations where they are not reliable, that is, mainly, on low-frequency events. The goal of the possibility-based language models is thus to estimate the plausibility of these low-frequency events in order to filter them when the main language model wrongly assigns them a higher probability that it should be. Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx
3
This paper presents an in-depth study of possibilistic language models. We will state motivations and theoretical foundations of these models as well as present a method for empirically estimating possibilities and new ways to integrate them in an ASR system. Possibilistic models are compared and combined with classical n-gram probabilities estimated on both Web and classical text corpus. Experiments are conducted on two tasks: broadcast news transcription, for which large training materials are available, and transcription of medical videos that are dedicated to training surgeons. The latter application context corresponds to a very specialized domain with only low resources available. The rest of the paper is organized as follows. The next section proposes a step-by-step description of possibilistic Web models, starting from classical corpus probabilistic models. Section 3 presents various strategies for the integration of possibilistic language models into a statistical ASR system. Section 4 describes the experimental setup and the comparative experiments that were conducted. Finally, Section 5 concludes and propose some perspectives. 2. From corpus probabilities to Web possibilities In this section, we present new approaches for improving language modeling by using a new data source, the Web, and a new theoretical framework, the possibility theory. We first describe the classical corpus-based probabilistic language models, as used in most of the state-of-the-art speech recognition systems. Then, we introduce a new approach for estimating these probabilities from the Web. Finally, we propose to use concepts from the possibility theory for building a new measure that can be estimated on the Web as well as on classical closed corpora: the possibility measure. 2.1. Corpus-based probabilities In the ASR domain, language models are mainly designed with the purpose of estimating the prior probability P(W) of a word sequence W: W = (w1 , w2 , . . ., wn ), wi ∈ V This probability may be decomposed as the product of conditional probabilities: P(W) =
n P(wi |w1 , w2 , . . ., wi−1 )
(1)
i=1
This formula assumes that a word wi could be predicted only from the preceding word sequence. Globally, n-gram models consist in a collection of a conditional probabilities that will be used, in the ASR engine, for the prediction of a word, given a partially transcribed hypothesis. As expressed in Eq. (1), a word probability depends on the whole linguistic history. In practice, such long-term dependencies cannot be estimated due to complexity and to the limits of the corpus: the amount of training data required for estimating such long sequences would be huge, and it is usually impossible to perform a direct estimate of n-gram statistics of high order (n > 6). Therefore, most state-of-the-art ASR systems use only 4 or 5 gram models. Some alternative approaches for linguistic scoring were proposed to enable the estimation of long-sequence probability, mainly with neural networks that offer efficient (but implicit) inference and smoothing mechanisms (Bengio et al., 2006; Mnih and Hinton, 2007). Nevertheless, the ideal situation would be to estimate directly accurate probabilities on an exhaustive corpus, where all possible sentences would be found. This would suggest that the problem of ASR could be viewed as the search of the correct transcription in a closed collection of text documents Borges (1944). Considering that we are never going to have such an infinite corpus, n-gram language models were introduced in speech recognition systems by Jelinek (1976), especially to deal with the problem of long word sequence modeling. The global approach consists in limiting the size of the history, in order to be able to perform a good estimation of conditional probabilities. Then, speech is viewed as a Markovian source of words, of order n − 1: P(wi |w1 , w2 , . . ., wi−1 ) ≈ P(wi |wi−n+1 , . . ., wi−1 )
Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17 4
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx Table 1 Constants involved in the estimate of n-gram frequencies from n-gram document frequencies. n
α
β
1 2 3
2.427 1.209 1.174
1.019 1.014 1.025
By using this approximation, the probability of the word sequence W becomes: P(W) ≈
N P(wi |wi−n+1 , . . ., wi−1 )
(2)
i=0
where the parameter n is the order of the model and N the size of the sequence W. A high value for n is supposed to improve the model accuracy but requires a larger corpus to estimate the model. As is the case for most of the parametric models involved in speech processing, the choice of the value of n results from a trade-off between accuracy and estimate requirements in terms of amount of training data and CPU consumption. The cut-off presented in Eq. 2 impacts the language modeling process by limiting the complexity of the estimation process (this was the main purpose). Nevertheless, it also causes a loss of precision, as reported in many papers whose empiric results confirmed this theoretic observation: increasing n generally yields some improvement in performance under the condition that there is sufficient amount of training data. 2.2. Web-based probabilities 2.2.1. Estimating n-gram probabilities from the Web As previously described, n-gram probabilities are usually estimated from word sequence counting on corpora. Using this approach to estimate the probability of a given word sequence on the Web requires to know the frequency of the word sequence in at least a part of the documents that can be found on the Web. In order to do this, we can rely on statistics obtained with a Web search engine: most of them provide the number of documents that satisfy a given query; this query can be an n-gram word sequence. Using the number of documents that contain a specified word sequence, we can deduce the number of n-grams. This approach is presented in Zhu and Rosenfeld (2001) where the authors propose to estimate the n-gram frequency from the document frequency with the formula: f Web (wi−n+1 , wi−n , . . ., wi ) ≈ α × df Web (wi−n+1 , wi−n , . . ., wi )β fWeb (W)
(3) dfWeb (W)
where is the frequency of the word sequence W on the Web and the document frequency of W. α and β are constants for a given n-gram order n. Zhu and Rosenfeld have estimated the values of α and β for n-gram orders from 1 to 3 Zhu and Rosenfeld (2001). Their results are reported in Table 1. The first point we can note is that the value of β is always close to 1. This indicates that there is a proportional relationship between the document frequency and the word frequency. Moreover, as expected, when the n-gram order increases, the value of the proportionality factor α approaches 1. For example, for n = 2, the value of α is about 1.2. Given this information, we can consider that the β coefficient is about equal to 1, and we should estimate the value of the proportionality factor α for each n-gram order. However, given that the purpose of the measure is the estimation of probabilities and that in our case the estimation of such probabilities is a frequency ratio, the proportionality factors cancel each other, leaving a document frequency ratio. We thus propose to estimate the Web n-gram probabilities by relying directly on the number of documents that contain a given n-gram. For a given word wi , we first note by ψin its history of size n − 1 in an n-gram: ψin = wi−n+1 , . . ., wi−1 . Thus, in order to obtain the probability of a Web n-gram, we use Eq. (4): PWeb (wi |ψin ) =
H(ψin , wi ) H(ψin )
(4)
Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx
5
where H(S) is the number of documents that contain word sequence S retrieved by the search engine, and n is the order of the n-gram model. However, Eq. (4) is not easy to use, because it assigns a zero probability to the word sequences that are not on the Web. To tackle this issue, we usually redistribute one part of the probability mass assigned to the events seen during training to unseen events. Given that the necessary statistics for using a state-of-the-art back-off technique, such as the modified Kneser-Ney method (Goodman, 2006), are not available in this specific the Web in this manner, we will interpolate our distribution with lower order distributions, which have proven to work well. Probabilities are therefore computed by using Eq. (5): ∗ PWeb (wi |ψin ) = λ1 · PWeb (wi |ψin ) + λ2 · PWeb (wi |ψin−1 ) + · · · + λn · PWeb (wi ) (5) n where λi are positive real numbers such that i=1 λi = 1. However, a difficulty related to the estimation of the unigram probability still exists in this formulation. In the Web context, the frequency of a word is computed as the number of Web documents that contain this word, and the size of the corpus corresponds to the total number of documents indexed by the search engine. For an estimation of the latter value, we use the number of documents that contain the most frequent word in the natural language of interest (for English, the word the), thus hoping to cover most of the documents in this language, that are indexed by the search engine. The probability computed this way will never be zero if one makes sure that all the words in the vocabulary are present in at least one Web document. We therefore propose a Web n-gram probability estimation method that does not result in zero probabilities, even for unseen word sequences. With Web n-gram probabilities thus defined, there are several manners of using them for computing word sequence probabilities.
2.3. Background on possibilities The possibility theory is a mathematical framework devoted to handling uncertainty resulting from incomplete knowledge (Dubois, 2006). Originally designed in order to formalize the notion of linguistic uncertainty Dubois (2006), possibility theory has been recently given a formal status akin to that of the Probability theory. This advance relies on measure-theoretic concepts, thus transforming it into a quantitative framework for reasoning with incomplete knowledge (de Cooman, 1997). Therefore, possibility measure reflects the uncertainty rather than the imprecision, two concepts that are merged in probability measures. Possibility theory is based on a pair of dual functions, possibility and necessity. Possibility function, denoted π(e), represents the knowledge that distinguishes what is plausible from what is less plausible, and what is atypical from what is “normal”. This function is a mapping from a set E of events to the unit interval [0 ; 1]. When π(e) = 0, then event e is known as impossible; (ii) if π(e) = 1, then event e is considered as totally possible (plausible). In a manner akin to probability theory, a possibility measure can be computed from the possibility distribution on a bounded set of events (de Cooman, 1997). Considering a set E of events, a possibility measure Π on the set of events E can be defined as: Π(E) = maxπ(e) e∈E
Therefore, the possibility of set E is the possibility of the most plausible event belonging to E. Globally, Π(E) evaluates to the extent that the set E of events is consistent with the knowledge π. For any two subsets A and B of E, the joint possibility measure of A and B is constrained by: Π(A ∩ B) ≤ min(Π(A), Π(B))
(6)
The application of possibility theory to language modeling relies on the fact that empirically estimated probabilities are dramatically imprecise on very low-frequency events: the unobserved (or rare) word sequences are evaluated by smoothing functions that provide very coarse approximations of low probabilities. Rather than trying to infer linguistic scores from a so partial knowledge, we aim to perform a reliable estimate of plausibility of infrequent linguistic events. This approach encounters two major problems due to this specific interest of possibility on low-probability domain and to the estimation of possibility-based linguistic scoring. Indeed, literature lacks of a practical formula for automatically estimating a possibilistic measure on a sequence of words, given a training corpus. We will fill this gap in the next sections. Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17 6
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx
2.4. Web-based possibilities Numerous research reports how one can take advantage of the statistics of word sequences on the Web. In the previous section, we proposed another similar approach. Nonetheless, the absence of an n-gram from the Web could represent relevant information, which could be integrated into an LM. To the best of our knowledge, this information has never been studied in the literature; possibility theory provides a theoretical framework for modeling this information Zadeh (1978), Dubois (2006). In this section, we propose practical formulas for estimating a possibility measure for word sequence by using statistics from the Web. The possibility measure has to represent the possibility that a word sequence exists. For this, we rely on the existence of this sequence and of its sub-sequences on the Web. By existence on the Web, we mean here the fact that there exists at least one Web document that contains the word sequence under discussion. The idea is that the longer sub-sequences of a word sequence exist on the Web, the more the word sequence is possible. However, one needs to limit the search of sub-sequences in order to obtain a reliable measure. Indeed, the smaller the corpus considered for computing the possibilistic measure (here, the Web), the less the non-existence of long sequences is significant. First of all, for each desired LM order n, we recursively construct a distinct set of possibility distributions πn to π1 , according to the equation: πn (W) =
|Wn ∩ Webn | + γ · |Wn \Webn | · πn−1 (W) |Wn |
(7)
where W is a sequence of n or more words, Wn is the set of word sequences of size n in W, Webn is the set of word sequences of size n on the Web, \ is the set subtraction operator and 0 ≤ γ ≤ 1 is the back-off coefficient. The terminal condition for the recursion is π0 (W) = 0. For a given word sequence W, this distribution expresses the number of its sub-sequences of length n that exist on the Web, with respect to the total number of its sub-sequences of length n. The possibility mass that is lost because of the absence of sub-sequences of length n on the Web, is redistributed to the possibility measure of lower order. In our experiments, Web-based statistics will be estimated by requesting the Web search engine. For instance, |Wn \ Webn | is obtained by counting all the sub-sequences of size n of W that cannot be found by the search engine. The set of possibility distributions previously defined allows us to construct a corresponding set of possibility measures Π n , according to Eq. 8: Πn (Θ) = max (πn (W)) W∈Θ
(8)
where Θ is a set of sequences of n or more words; if Θ has only one element W, then Π n ({W}) = πn (W). 2.5. Corpus-based possibilities In the previous section, we proposed a Web-based possibility measure that relies on the existence or the non-existence of a word sequence on the Web. Here, we propose to generalize the formula of the Web-based possibility distribution and thus enable the computation of a possibility measure on any arbitrary corpus. The principle of corpus-based estimator is similar to Web-based one: possibilities can be deducted from the presence/absence of the word sequences in the observation source, for instance the train corpus. Therefore, we use the same back-off strategy which interpolates high order possibilities from its sub-sequence possibilistic scores. For each desired language model order n, possibility distributions are computed recursively πnc to π1c , according to Eq. (9): πnc (W) =
|Wn ∩ Cn | + γ · |Wn \Cn | · πn−1 (W) |Wn |
(9)
where W is a sequence of n or more words, Wn is the set of word sequences of size n in W, Cn is the set of word sequences of size n in the corpus, and 0 ≤ γ ≤ 1 is the back-off coefficient. The terminal condition for the recursion is π0c (W) = 0. The higher the number of sequences W found in the corpus, the higher the possibility of W. Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx
7
The possibility distribution πnc defined above, in Eq. (9), allows us to derive the possibility measure Πnc , according to Eq. 10: Πnc (A) = max(πnc (W)) W∈A
(10)
where A is a set of sequences of n or more words; if A has only one element W, then πnc ({W}) = πnc (W). We have thus proposed a formula that allows us to estimate a possibilistic distribution on the Web as well as on a classical text corpus. Unlike in classical n-gram-based language models, the possibility measures are estimated directly on complete word utterances, without computing individual possibility measures for sub-sequences. 3. Integration of possibilistic measures into probabilistic speech recognition system The possibility paradigm of the proposed measures is highly different from the traditional probabilistic paradigm used in modern ASR systems. In this section we propose different ways of combining this kind of measures in a state-of-the-art probabilistic ASR system. 3.1. Probabilities and possibilities as stand-alone metrics The most obvious way of using the proposed measures in an ASR system is to use them as full linguistic scores. 3.1.1. Possibility The possibility measure can be seen as a stand-alone linguistic measure and can be used as such in ASR tasks, for example in combination with the acoustic score. Starting from Eq. (7) where the Web-based possibility of a word sequence is defined, the size and the nature of this word sequence in the context of an ASR hypothesis evaluation have to be defined. Such a word sequence of m words wi , for i ∈ {1 . . . m}, is denoted by Sm and can be expressed as a set of n-sized word sequences Snm such that: Snm = {(ψin , wi ),
for i ∈ {n, . . ., m}}
(11)
Given that Sm is the hypothesis to evaluate, in a manner akin to classical n-gram models, we can measure the possibility of its sub-sequences Snm and combine them according to the inequality given in Eq. (6). When we do not have complete information on an event, the possibility theory compels us to choose the maximal estimate for the value of the possibility measure for this event. Thus, we can use the equality case in Eq. (6) to assign a possibility to Sm (given by Eq. (11)): n Πn (Snm ) = min(Πn ({ψnn , wn }), Πn (Snm \{ψnn , wn })) = min(Πn ({ψnn , wn }), . . .Πn ({ψm , wm })) n = min(πn (ψnn , wn ), . . ., πn (ψm , wm ))
(12)
The shortcoming of this first approach is that it reduces the possibility of a hypothesis to its least possible element. As a consequence, if several hypotheses hold a common low-possibility sub-sequence, no matter the other sub-sequences, all these hypotheses would have the same possibility. With an aim of scoring these hypotheses, it would be better to assign different possibility values to them. Therefore, we propose to measure the possibility of the whole word sequence Sm , while verifying Eq. (6), by directly applying Equation refposs phrase to Sm : Πn (S m ) = πn (S m )
(13)
This last equation leads to get a smoothed possibility of the whole word sequence Sm , rather than the lowest possibility of its sub-sequences. The results of these approaches are presented in Section 4.4. Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17 8
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx
3.1.2. Probability A way of using the Web for building LMs is to consider that probabilities estimated from the Web are reliable, and thus not to interpolate these probabilities with the LM learned from the corpus; this is shown in Eq. (14): ∗ ˆ i |ψin ) = PWeb P(w (wi |ψin )
(14)
This approach is justified when the corpus used for learning the LM is too small or too poorly adapted to the task. In Section 4.3 we present several experiments with these two approaches, in an ASR task. 3.2. Combination of possibilities and probabilities In Section 2, we proposed possibilistic and probabilistic measures that can be computed on the Web, as well as on classical text corpora. To take advantage of all these measures, we need a way to combine them that respects both the possibilistic and probabilistic theories. In this section we propose several strategies in this direction. 3.2.1. Possibilities as probabilities upper-bound There are several definitions of the relation between possibilities and probabilities. Here, we use the definition provided in Dubois and Prade (1988), where it is stated that, for a probability measure P, the possibility measure Π that corresponds to P satisfies Eq. (15): ∀A ⊆ S,
P(A) ≤ Π(A)
(15)
where S is the set of events. This theoretical inequality is passed by in practice, due to smoothing techniques: on rare events, classical n-gram models interpolate probabilities from the scores of their sub-sequences. So, impossible word sequences will have a positive probability computed from the probability of its sub-sequences. In the most general case, it is likely that the probability assigned to rare events by the general LM be sometimes greater than the possibility assigned to this event by the possibilistic LM. We can further use this property for improving a probabilistic LM on the low probability domain. Hence, we can redistribute this exceeding probability mass between the well-learned events in the general LM. Eq. (16) formalizes this idea: Πn (ψin , wi )f , if Πn (ψin , wi )f < PLM (wi |ψin ) n ˆ P(wi |ψi ) = (16) β · PLM (wi |ψin ), otherwise where Π n is a possibility measure, f is a scaling factor that controls the fraction of the probabilities affected by the cut; with f = 0, no probabilities are modified. β is a normalization factor, defined in Eq. (17): n ˆ 1 − u∈Uψn P(u|ψ i) i β= (17) 1 − u∈Uψn PLM (u|ψin ) i
where Uψin is the set of words wi with the history ψin of size n − 1, for which the probability is higher than the possibility. Starting from this idea, the Web-based possibilities can be seen as upper-bounds of corpus-based probabilities, and the corpus-based possibilities can be seen as upper-bounds of Web-based probabilities. 3.2.2. The Web probability as a better back-off Starting from the idea that the back-off probabilities of the probabilistic n-gram language models are poorly estimated, we can use another measure for improving these probabilities. An approach, which has been used recently in Zhu and Rosenfeld (2001), consists in using the Web probability to improve the baseline LM back-off. This boils down to giving to the n-gram probabilities in the corpus a higher Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx
9
confidence level than to the Web-based probabilities. Let Uψin be the set of words wi with the history ψin of size n − 1, for which the baseline LM has to back-off. In formal terms, this can be written as in Eq. (18): ∗ (w |ψ n ), if wi ∈ Uψin ρ · PLM (wi |ψin ) + (1 − ρ) · PWeb i i n ˆ P(wi |ψi ) = (18) β · PLM (wi |ψin ), otherwise where ρ is a positive, empirically chosen, weighting factor, and β is a normalization factor, defined in Eq. (17). 3.2.3. Possibilities as corpus probability back-off Starting from the same idea, we can combine the possibility and probability measures only when the probability measure is not reliable. The possibility measures previously introduced inform us on the confidence that we can have in the existence of a word sequence. If we have a higher confidence in the training corpus than in the Web, then all the n-grams seen in this corpus are totally possible (πn (ψin , wi ) = 1). On the contrary, the n-grams composed by using back-off strategies are subject to controversy. We thus propose to weight the probability that the language model assigns to the unseen n-grams in the training corpus, with the possibility estimated from the Web. This idea is formalized in Eq. (19): Πn ({ψin , wi }) · bo(ψin ) · P(wi |ψin−1 ), if wi ∈ Uψin n ˆ i |ψi ) = P(w (19) β · PLM (wi |ψin ), otherwise where bo(ψin ) is the baseline language model back-off factor. We thus redistribute, through the β factor defined in Eq. (17), the probability mass wrongly assigned to impossible events according to the Web, to the events that were seen in the training corpus. 3.2.4. Log-linear combinations The probabilistic and possibilistic LMs can also be considered as complementary linguistic scores. In a typical ASR system, each hypothesis is assigned a score, computed as a log-linear combination between an acoustic score (S(X|W)), and a weighted linguistic probability (P(W)a , with 0 ≤ a ≤ 1). To improve this score, we can add other linguistic information to it, by adding terms to the log-linear combination. For instance, linguistic possibility measures can be integrated into the ASR framework as in Eq. (20), where W is a hypothesis, and X is the sequence of acoustic observations for this hypothesis: S(W|X) = S(X|W) × P(W)a × Π(W)b
(20)
where S(X|W) is the acoustic score, P(W) is the linguistic probability, Π(W) is the linguistic possibility, a and b are positive, empirically chosen, weighting factors. According to this approach, we can combine in all possible ways the four measures that we proposed for estimating the global score of the hypotheses: Web- and corpus-based possibility measures, and Web- and corpus-based probability measures. 4. Experiments We have proposed a way of estimating a possibilistic measure on a closed corpus or on the Web; we have also proposed several hybrid LMs that allow one to combine probabilistic and possibilistic measures. We thus have four measures: Web-based possibility and probability measures, and corpus-based possibility and probability measures. We have two binary parameters, related respectively to the nature of the measure (possibility or probability) and to the corpus where the measure is defined (the Web or a closed corpus). In the remainder of the paper, we are going to evaluate these models on two ASR tasks: broadcast news transcription, and specialized spoken discourse transcription. These evaluations have been performed on models of orders from 3 to 6, because the nature of the measure, the size of the corpus where it is estimated, and the order of the LM are all related. Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17 10
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx
Table 2 WER ([%]) of the proposed LMs according to the order n of the LMs, on the HUB4 and AVISON corpora. AVISON
HUB4
n=3
n=6
n=3
n=6
Pc Pw Πw Πc
27.9 27.2 28.5 28.0
28.0 25.5 25.2 28.1
27.0 27.0 27.7 27.9
26.9 26.0 26.9 27.9
Pw BO Pc Πw BO Pc Π c BO Pc
27.6 27.5 27.6
26.1 26.5 27.4
26.7 26.8 26.8
26.6 26.7 26.9
Pw ≤ Πc Pc ≤ Πw Pw ≤ Πw
27.8 28.1 27.6
25.6 26.0 25.3
27.2 27.0 27.0
26.1 26.7 26.1
Pw + Pc Pw + Πw Pc + Πw Πw + Πc + Pw + Pc
27.1 27.1 27.7 26.8
25.4 24.7 24.8 24.5
26.6 27.0 26.8 26.4
25.9 26.1 26.6 25.9
Bold values indicate key figures.
4.1. Experimental setup To assess the proposed methods on the two transcription tasks, we used the Avignon Computer Science Laboratory’s (LIA) broadcast news transcription system, SPEERAL (Nocéra et al., 2004). This system is an A* decoder based on state-dependent hidden Markov models for acoustic modeling, and on an n-gram LM. For the broadcast news transcription task, we used the test corpus of the HUB4’98 campaign (Stern, 1997), of about 3 h of English broadcast news. The baseline LM is a 65k word classical 3-gram, estimated on 2.7G words from the Gigaword, North American News and HUB4 corpora. We used the modified Kneser-Ney smoothing technique. The transcription word error rate (WER) of the test corpus with this configuration, without speaker adaptation, is 27.0%. For the specialized domain transcription task, we used 4 h from the English AVISON corpus, which contains recorded surgery-related discourse. A combined 65k word LM is used, by interpolating general 3-grams learned on the HUB4 English corpus, with 3-grams estimated on all the reference transcriptions available in the AVISON training corpus, by relying, here as well, on the modified Kneser-Ney smoothing technique. A baseline 27.9% WER without speaker adaptation was obtained on this specialized domain corpus. The direct use of the proposed Web-based LMs in the search algorithm of the ASR system would lead us to submit too many queries to the Web search engines. This is why a N-best decoding is done instead, with the baseline 3-gram LM, which produces the top N recognition hypotheses; the proposed Web-based LMs are used for rescoring these hypotheses in combination with the acoustic score of the hypothesis. The Google search engine is used for processing Web queries. Regarding the number of best hypotheses to consider, we conducted some measurements on the best hypotheses, from 10 to 1000-best, and we observed that the oracle WER improvement decreases following an invert logarithm curve over the number of bests. We concluded that the best trade-off between the oracle WER and the number of hypotheses to evaluate against Web LMs was about 100. Thus, the following experiments are conducted over a 100-best decoding process. The optimization of the weight factors for the log-linear combinations and the smoothing and back-off coefficients is performed within the K-fold cross-validation framework, with K = 10: firstly, the test corpus is partitioned into ten sub-corpora; then, the coefficients are optimized on nine partitions and tested on the tenth. This last step is repeated ten times, with each of the ten sub-corpora used exactly once as test data. The global error is the sum of the errors of all the test partitions. The results of these experiments are reported in Table 2 and discussed in the next sections. Given that the results presented are from a reordering of the 100 best hypotheses generated by the baseline system, the best score reachable by a model is the one associated with the best of the 100 hypotheses. Table 3 contains the score Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx
11
Table 3 Oracle WER [%] of the 100-best hypotheses on the HUB4 and AVISON corpora.
WER
AVISON
HUB4
22.0
25.9
of the best hypothesis found in the 100-best for the AVISON and HUB4 tasks. We can see that the best improvement that any model can offer a WER reduction of 5.9% absolute on the AVISON task (from 27.9% to 22.0%) and of 1.1% absolute on the HUB4 task (from 27.0% to 25.9%). We observed that the proposed Web-based measures perform better with high n-gram orders. In order to compare these measures with the baseline LM, we estimated corpus-based n-gram models of orders from 4 to 6, on the same data as the 3-gram models. The evolution of the figures according to the n-gram order is linear, thus only the results obtained with the 3- and 6-gram models are shown because the intermediate n-gram orders does not bring much information. Row Pc of Table 2 contains the WER obtained by the baseline corpus-based n-gram models of orders 3 and 6 in the 100-best rescoring task. 4.2. Confidence test In order to assess the significance of the reported results, we estimated the confidence interval of the baseline results according to the following equation (Chollet (1995)): wer f (1 − wer f ) wer f (1 − werf ) wer f − u α2 < wer p < wer f + u α2 (21) k k where k is the number of words in the evaluation corpus, werf the WER on the evaluation corpus. α defines a confidence interval: if α = 95%, then werp is defined with a confidence of ±0.05%. uα/2 is defined by Student value: u0.425 = 1.96. The confidence of the baseline results are reported in Table 4 4.3. Individual probabilities The rows Pw and Pc of Table 2 contain respectively the results of the Web-based and corpus-based probabilistic LM alone. It is noticeable that performance of the Web-based probabilistic LM increases with the order of the model, thanks to the huge size of the Web corpus, which allows the reliable estimation of long-range n-gram probabilities. The probabilistic Web-based 3-gram LM performs as well as the corpus-based probabilistic LM on the HUB4 task, and slightly better than it on the AVISON task. The 6-gram Web-based probabilistic models are always better than the corpus-based probabilistic LMs and provide an absolute decrease of the WER of 0.9% on the HUB4 task and of 2.5% on the AVISON task. These results could deceptively lead us to believe that the Web-based probabilistic model performs better on the AVISON task, but considering the oracle results presented in Table 3, we can see that on the HUB4 task the optimal WER is almost reached whereas on the AVISON task, only half of the potential WER reduction is obtained. Table 4 WER ([%]) and confidence intervals ([%]) of the baseline corpus-based LMs, according to the order n of the LMs, on the HUB4 and AVISON corpora. AVISON n=3 Pc Conf. int.
27.9 0.93
HUB4 n = 6 28.0 0.93
n=3
n=6
27.0 0.91
26.9 0.91
Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17 12
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx
1
0.8
0.6
0.4
0.2
0 0
0.2
0.4
0.6
0.8
1
Fig. 1. Correlation between Web and corpus 3-gram probabilities observed in the test corpus of the HUB4 task.
Figs. 1 and 3 highlight the relation that exists between the corpus and Web n-gram probabilities of orders 3 and 6 found in the test corpus of the HUB4 task. The n-grams were separated into two categories: those which were found in the training corpus of the classical corpus-based LM and the others. The two language models providing the corpus-based and Web-based probabilities use back-off strategy, as described in the experimental setup, therefore
0.12
0.1
0.08
0.06
0.04
0.02
0 −0.6
−0.4
−0.2
0
0.2
0.4
0.6
Fig. 2. Distribution of divergences between the corpus and Web 3-gram probabilities, measured on the test corpus of the HUB4 task.
Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx
13
1
0.8
0.6
0.4
0.2
0 0
0.2
0.4
0.6
0.8
1
Fig. 3. Correlation between Web and corpus 6-gram probabilities observed in the test corpus of the HUB4 task.
both of them provide a positive probability even for events not seen in the training corpora. Figs. 2 and 4 represent the repartition of the n-grams according to the divergence between the web and corpus probabilities. In Figs. 1 and 3, we show that there is a strong correlation between Web and corpus n-gram probabilities. The linear correlation coefficient is of 80.3% for the 3-grams, which confirms the visual impression. We observe that the points are more scattered when the probability is low, for both the seen and unseen n-grams, this suggests that the low probabilities are less well-trained. Histograms 2 and 4 show that the divergences are more important for unseen n-grams than for seen n-grams. Given that Web-based LM performance is better than the corpus-based LM, one might think that the strength of the Web-based LM lies in a better modeling of the low-frequency n-grams, probably thanks to the huge size of the Web. By comparing the correlation clouds of the 3-gram and 6-gram probabilities, shown in Figs. 1 and 3, respectively, we observe that the 6-gram cloud is more scattered than the 3-gram cloud. Given that the 6-gram Web-based LM performs better than the corpus-based LM, the scattering of the points suggests that the corpus-based probabilities are less reliable than the Web-based ones. These experiments show that the Web-based LMs are more reliable than the corpus-based LMs, especially with high-order n-grams and low-frequency events, on both tasks. This indicates that the proposed document-frequency estimation is relevant for taking advantage of the information present on the Web. 4.4. Individual possibilities Rows Π c and Πw of Table 2 contain respectively the results in terms of WER of the corpus-based and Web-based possibilistic LM alone. The results obtained by the Web and corpus-based 3-gram possibilistic LMs are all bad and none stands out. On the one hand, the Web is too large a corpus where almost all the 3-grams are present, which removes a lot of information that a possibilistic measure can bring. On the other hand, the corpus is too small to consider that the absence of a 3-gram means that it is not possible. This last statement is even more true for models of higher order, while the first statement, concerning the Web-based LMs, is not true when the order of the model increases. Therefore, as expected, the 6-gram Web-based possibilistic LM provides the best performances, which indicates that the size of the Web is fitted for this kind of measure with a high order, and that the proposed document-frequency Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17 14
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx
0.12
0.1
0.08
0.06
0.04
0.02
0 −0.6
−0.4
−0.2
0
0.2
0.4
0.6
Fig. 4. Distribution of divergences between the corpus and Web 6-gram probabilities, measured on the test corpus of the HUB4 task.
estimation is relevant for taking advantage of the information present on the Web. Moreover, the Web-based possibility model performs well on the specialized AVISON corpus, whereas no improvement is obtained on the broadcast news corpus. This shows that this measure is relevant in the under-resourced domain covered by the AVISON corpus. It is important to note that this evaluation is done on a 100-best rescoring task, so the worse hypotheses have already been discarded by the ASR system, which partly explains the not-so-bad results obtained with the 3-gram possibilistic measures. To conclude, the possibility measure is effective on the Web and in the specialized domain, which shows that, as expected, it is an alternative to probability measures when no large enough relevant corpora are available. 4.5. Possibilities and Web-based probabilities as corpus-based probability back-off Rows Pw BO Pc , Πw BO Pc , and Π c BO Pc of Table 2 contain respectively the results of the Web-based probability used as corpus-based back-off coefficient, the Web-based possibility used as a corpus-based back-off coefficient, and the corpus-based possibility used as a corpus-based back-off coefficient. All the modified back-off approaches improve the baseline corpus-based probabilities, which indicates that the corpus-based back-off probabilities become more accurate when they are combined with other kind of information. Adding corpus-based possibility measures in the back-off slightly improves performance. We obtain a WER reduction of 0.6 point for the AVISON corpus, and of 0.2 point for HUB4. These results show that the possibility measures bring information that the classical probabilistic modeling does not. The results obtained with the Web-based probabilistic and possibilistic back-off are similar. However, the Webbased probabilistic back-off behaves slightly better: a WER reduction of 1.9 points is obtained for the AVISON corpus, whereas on HUB4 this reduction is of 0.3 point, with respect to the corpus-based probability measure. To conclude, the results show that the Web-based back-off is better than the corpus-based one, irrespective of the measures used. 4.6. Possibilities as probabilities’ upper-bound Rows Pw ≤ Πc , Pc ≤ Πw , and Pw ≤ Πw of Table 2 contain respectively the results obtained with the corpusbased possibilities used as upper bounds of Web-based probabilities, with the Web-based possibilities used as upper Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx
15
Table 5 Normalized weights of the possibilistic and probabilistic corpus- and Web-based 6-gram LMs in the linear combination reported in Table 2 on the AVISON and HUB4 corpora.
AVISON HUB4
Πw
Πc
Pw
Pc
0.16 0.04
0.08 0.01
0.57 0.67
0.19 0.28
bounds of corpus-based probabilities, and with the Web-based possibilities used as upper bounds of Web-based probabilities. The corpus-based possibilities do not bring any improvement when used as an upper bound for the Web-based probabilities. However, using the Web-based possibilities as an upper bound for the corpus-based probabilities yields an absolute WER improvement of 0.2 point on HUB4, and of 2 points on the AVISON corpus. As expected, combining the best probabilistic and possibilistic measures (both on the Web) yields the best results: an absolute WER improvement of 0.8 point for HUB4, and of 2.7 points for the AVISON corpus, with respect to the corpus-based probability measure. These results confirm that the corpus-based probabilistic LM assigns too high a probability mass to certain events, and that the Web-based possibilistic measure allows one to identify these events. 4.7. Log-linear combination Starting with the four measures that we proposed (corpus- and Web-based probabilities and possibilities), eleven combinations of two, three and four measures are possible. We present here the most interesting combinations. Rows Pw + Pc , Pw + Πw , and Pc + Πw of Table 2 contain respectively the results of the log-linear combination of the Web-based probabilities with the corpus-based probabilities, the Web-based probabilities with the Web-based possibilities, and the corpus-based probabilities with the Web-based possibilities. The results are very different for HUB4 and for the AVISON task. On HUB4, none of the combinations is significantly better than the Web-based probabilities alone. On the AVISON corpus, the best combination of the Web-based probability and possibility measures allows for WER improvement of 3.3 points. This result confirms that for the same “corpus” (here, the Web), the possibilistic measure adds information to the probabilistic measure. Row Πw + Πc + Pw + Pc of Table 2 contains the log-linear combination of the four proposed metrics: corpus- and Web-based probabilities and possibilities. The combination of the four measures yields, globally, the best performance. We thus obtain a WER improvement of 3.5 points on the AVISON corpus. This represents an improvement of 0.7 point with respect to the best estimator alone (the Web-based possibility measure). For HUB4, the combined four measures yield a marginal improvement of 0.1 point, with respect to the best measure alone (the Web-based possibility measure). Hence, despite the previously presented fine ways of combining these measures, the log-linear combination of the four measures seems to be the best way of using them jointly. The normalized weights of the four LMs in the log-linear combination are reported in Table 5. On both corpora, with respect to the kind of model, the Web-based LMs contribute the most, the Web-based probabilistic LM being by far the most contributing. On the AVISON corpus the two possibilistic LM joint weight is higher than that of the classical corpus-based probabilistic LM. This last observation indicates that the possibilistic LMs are particularly useful in such a case of highly specialized and poorly resourced linguistic domain. On both corpora, the possibilistic LMs have significant weights, especially on the AVISON corpus, which confirms that the information the possibilistic LMs handle is useful and not already modeled by the probabilistic LMs. 5. Conclusion This paper presents language models based on the possibility theory, especially for using a Web search engine as n-gram statistic provider. We address the two main problems caused by the use of possibilistic language models in ASR systems: the estimation of word sequence possibilities and the integration of possibilistic models into the probabilistic framework of ASR systems. Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17 16
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx
Our proposal for empirically estimating the Web possibilities relies on the requesting of word sequences and simple Web search engine hits ratio. In order to be able to evaluate the scores of long word sequences, we propose an estimate rule that computes long sequence scores by combining sub-sequence frequencies. These frequencies are approximated by counting the number of Web-documents that contain them. This count is performed by using a Web search engine. We proposed various techniques for the integration of possibility scores into a probabilistic ASR system: possibilistic bounding of probabilities, possibility-based back-off strategy and a fully possibilistic scoring of word sequences. Finally, we combined possibilistic and probabilistic measures. Experiments are conducted on two typical usage scenarios: broadcast news transcription with very large training sets and transcription of medical videos in a specialized domain with only very limited training sets. Results demonstrated that possibilistic language models yield significant improvements of WER on specialized domain, where classical n-gram models fail due to the lack of training materials. On broadcast news, probabilistic models remain better than possibility-based models, but the log-linear combination of both outperforms all other configurations. Even if this mixed approach is not as theoretically well-founded as purely probabilistic or possibilistic ones, these results show the complementarity of the two paradigms, on both specialized (medical domain) and generic (BN) contexts. Some aspects of the proposed method could be further explored. The first point is related to the idea of a word sequence possibility: possibility may depend on the context of the discourse. This dependency could be modeled as conditional possibilities, according to various factors (domain, topics, etc). Another critical point is related to the integration of possibility measure into ASR systems which are basically built on a probabilistic formulation of the speech recognition problem. We now plan to evaluate a fully possibilistic ASR system. References Allauzen, A., Gauvain, J., 2005. Diachronic vocabulary adaptation for broadcast news transcription. In: European Conference on Speech Communication and Technology, INTERSPEECH’05, Lisbon, Portugal, pp. 1305–1308. Asadi, A., Schwartz, R., Makhoul, J., 1990. Automatic detection of new words in a large vocabulary continuous speech recognition system. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP’90, Albuquerque, NM, USA, pp. 125–128. Bengio, Y., Schwenk, H., Senécal, J.S., Morin, F., Gauvain, J.L., 2006. Neural probabilistic language models. In: Innovations in Machine Learning. Springer, Berlin Heidelberg, pp. 137–186. Berger, A., Miller, R., 1998. Just-in-time language modelling. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP’98, Seattle, WA, USA, pp. 705–708. Bertoldi, N., Federico, M., 2001. Lexicon adaptation for broadcast new transcription. In: ITRW on Adaptation Methods for Speech Recognition, Sophia-Antipolis, France, pp. 187–190. Borges, J.L., 1944. La bilbioteca de Babel. Editorial Sur. Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J., 2007. Large language models in machine translation. In: Conference on Empirical Methods in Natural Language Processing, EMNLP’07, Prague, Czech Republic, pp. 858–867. Bulyko, I., Ostendorf, M., Stolcke, A., 2003. Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures. In: Human Language Technology, HLT-NAACL’03, vol. 2, 7–9n, Edmontton, Canada. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E., 2006. Bigtable: a distributed storage system for structured data. In: Conference on Usenix Symposium on Operating Systems Design and Implementation, USENIX’06, vol. 7, Seattle, WA, USA, pp. 205–218. Chollet, G., 1995. Evaluation of ASR systems, algorithms and databases. In: New Advances and Trends in Speech Recognition and Coding., pp. 32–40. de Cooman, G., 1997. Possibility theory I: the measure- and integral-theoretic groundwork. International Journal of General Systems 25, 291–323. Dubois, D., Prade, H., 1988. Possibility Theory: An Approach to Computerized Processing of Uncertainty. Plenum Press. Dubois, D., 2006. Possibility theory and statistical reasoning. Computational Statistics and Data Analysis 21, 47–69. Ghemawat, S., Gobioff, H., Leung, S.-T., 2003. The Google file system. In: ACM Symposium on Operating Systems Principles, New York, USA, pp. 20–43. Goodman, J., 2006. A bit of progress in language modeling extended version. Tech. rep. Microsoft Research. Guthrie, D., Hepple, M., 2010. Storing the web in memory: space efficient language models with constant time retrieval. In: Conference on Empirical Methods in Natural Language Processing, EMNLP’10, Cambridge, MA, USA, pp. 262–272. Jelinek, F., 1976. Continuous speech recognition by statistical methods. IEEE Proceedings 64, 532–556. Keller, F., Lapata, M., 2003. Using the web to obtain frequencies for unseen bigrams. Computational Linguistics 29, 459–484. Kemp, T., Waibel, A., 1998. Reducing the OOV rate in broadcast news speech recognition. In: International Conference on Spoken Language Processing. Lecorve, G., Gravier, G., Sebillot, P., 2008. An unsupervised web-based topic language model adaptation method. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP’08, Las Vegas, USA, pp. 5081–5084.
Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003
+Model YCSLA-645; No. of Pages 17
ARTICLE IN PRESS S. Oger, G. Linarès / Computer Speech and Language xxx (2014) xxx–xxx
17
Mnih, A., Hinton, G., 2007. Three new graphical models for statistical language modelling. In: Proceedings of the 24th International Conference on Machine Learning, ACM, Seoul, Korea, pp. 641–648. Monroe, G.A., French, J.C., Powell, A.L., 2002. Obtaining language models of Web collections using query-based sampling techniques. In: Hawaii International Conference on System Sciences, HICSS’02, Hawaii, USA, pp. 1241–1247. Nocéra, P., Fredouille, C., Linarès, G., Matrouf, D., Meignier, S., Bonastre, J., Massonié, D., Béchet, F., 2004. The LIA’s French broadcast news transcription system. In: SWIM: Lectures by Masters in Speech Processing, Maui, HI, USA. Oger, S., Popescu, V., Linarès, G., 2009a. Probabilistic and possibilistic language models based on the world wide web. In: International Conference on Speech Communication and Technology, INTERSPEECH’09, Tokyo, Japan. Oger, S., Popescu, V., Linarès, G., 2009b. Using the world wide web for learning new words in continuous speech recognition tasks: two case studies. In: International Conference on Speech and Computer SPECOM’2009, St. Petersburg, Russia. Sethy, A., Georgiou, P., Narayanan, S., 2005. Building topic specific language models from webdata using competitive models. In: International Conference on Speech Communication and Technology, INTERSPEECH’05, Lisbon, Portugal, pp. 1293–1296. Stern, R., 1997. Specifications of the 1996 HUB-4 broadcast news evaluation. In: Proc. DARPA Speech Recognition Workshop, pp. 7–14. Wan, V., Hain, T., 2006. Strategies for language model web-data collection. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP’06, vol. 6, Toulouse, France. Zadeh, L., 1978. Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems 1 (1), 3–28. Zhu, X., Rosenfeld, R., 2001. Improving trigram language modeling with the world wide web. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP’01, vol. 1, Salt Lake City, UT, USA, pp. 533–536.
Please cite this article in press as: Oger, S., Linarès, G., Web-based possibilistic language models for automatic speech recognition. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.02.003