Subspace Gaussian mixture based language modeling for large vocabulary continuous speech recognition

Speech Communication 117 (2020) 21–27 Contents lists available at ScienceDirect Speech Communication journal homepage: www.elsevier.com/locate/speco...

Download PDF

771KB Sizes 0 Downloads 81 Views

Report

PDF Reader
Full Text

Speech Communication 117 (2020) 21–27

Contents lists available at ScienceDirect

Speech Communication journal homepage: www.elsevier.com/locate/specom

Subspace Gaussian mixture based language modeling for large vocabulary continuous speech recognition Ri Hyon Sun∗, Ri Jong Chol College of Information Science, Kim Il Sung University, Taesong District, Pyongyang, Democratic People’s Republic of Korea

a r t i c l e

i n f o

Keywords: Language modeling Speech recognition Recurrent neural network Subspace Gaussian mixture model

a b s t r a c t This paper focuses on adaptable continuous space language modeling approach of combining longer context information of recurrent neural network (RNN) with adaptation ability of subspace Gaussian mixture model (SGMM) which has been widely used in acoustic modeling for automatic speech recognition (ASR). In large vocabulary continuous speech recognition (LVCSR) it is a challenging problem to construct language models that can capture the longer context information of words and ensure generalization and adaptation ability. Recently, language modeling based on RNN and its variants have been broadly studied in this ﬁeld. The goal of our approach is to obtain the history feature vectors of a word with longer context information and model every word by subspace Gaussian mixture model such as Tandem system used in acoustic modeling for ASR. Also, it is to apply fMLLR adaptation method, which is widely used in SGMM based acoustic modeling, for adaptation of subspace Gaussian mixture based language model (SGMLM). After fMLLR adaptation, SGMLMs based on Top-Down and Bottom-Up obtain WERs of 5.70 % and 6.01%, which are better than 4.15% and 4.61% of that without adaptation, respectively. Also, with fMLLR adaptation, TopDown and Bottom-Up based SGMLMs yield absolute word error rate reduction of 1.48%, 1.02% and a relative perplexity reduction of 10.02%, 6.46% compared to RNNLM without adaptation, respectively.

1. Introduction Recently, in pattern recognition the interest on deep layer neural network has been increased and the research of recurrent neural network based language model (RNNLM) for speech recognition has been vigorous. RNNLM (Mikolov et al., 2010) is a superior language model that can capture longer context information. RNNLM has been shown to outperform the conventional n-gram LMs as well as many other language modeling techniques because of using the full history information instead of limited length of word history. But in the ﬁeld of RNN language modeling, adaptation methods have not been studied enough. Usually, adaptation of language model, which improves the recognition performance of automatic speech recognition systems on the speciﬁc domain such as law and medical science, is performed on small amount of domain speciﬁc data, because it is diﬃcult to get available large amount of domain speciﬁc data. Especially, language modeling of conversational speech suﬀers from a lack of enough training data because collecting large amounts of conversational data and producing detailed transcriptions is very costly. Meanwhile, Gaussian mixture models (GMMs) that have been widely used for acoustic modeling of ASR have various adaptation methods ∗

such as maximum a posterior probability (MAP), maximum likelihood linear regression (MLLR) (Young et al., 2009; Povey, 2006) and feature space MLLR (fMLLR) (Young et al., 2009; Ghoshal et al., 2010; Povey et al., 2006). If we use a word instead of a phone and a topic instead of a speaker, all these above can be also applied to language modeling. Of course there was an attempt to estimate Gaussian mixture language model (GMLM) (Aﬁfy et al., 2007). In GMLM, authors used the singular vectors obtained with singular value decomposition (SVD) of word co-occurrence matrix as the feature vectors of words, but they didn’t study about the adaptation of GMLM in detail. The word feature vectors used by Aﬁfy et al. (2007) didn’t capture context information and did not show superior performance over n-gram language models because of the SVD error of large scale sparse matrix. Also the word feature vectors extracted from the above are not likely to exploit longer context information. For modeling complex and sparse data using GMM, parameter tying should be used for robust estimation of model parameters (for example, in acoustic modeling for speech recognition the acoustic states are clustered by phone decision tree, leading to improve the performance of speech recognition), but the discussion on the parameter tying was

Corresponding author. E-mail address: [email protected] (R.H. Sun).

https://doi.org/10.1016/j.specom.2020.01.001 Received 23 August 2018; Received in revised form 2 September 2019; Accepted 3 January 2020 Available online 23 January 2020 0167-6393/© 2020 Elsevier B.V. All rights reserved.

R.H. Sun and R.J. Chol

Speech Communication 117 (2020) 21–27

not mentioned in Aﬁfy et al. (2007). When we propose the eﬃcient parameter tying approaches and use SGMM known to be superior to GMM (Povey, 2009; Povey et al., 2010), we can improve the modeling accuracy over GMLM and exploit the adaptation methods known about SGMMs without modiﬁcation. Therefore, in this work we aim to capture the longer context information of words with RNN, and model the distribution of context information with subspace Gaussian mixture models. Then, our proposed system would have advantages of RNNs which are known to model longer context information and SGMMs which showed superior performance in the Gaussian mixture model class. Our framework is similar to the Tandem system (Hermansky, 2000; Vinyals, 2011), combining ANN (artiﬁcial neural network) discriminative feature processing with a GMM, used in acoustic modeling for ASR. We describe our framework for Korean continuous speech recognition below. Firstly, after RNNLM is already learned on training corpus, the outputs of hidden layer of RNN are selected as context features of a word in continuous space. The features obtained in this manner can capture longer context information and can be used as training data for subspace Gaussian mixture based language model (SGMLM). Secondly, we construct language model based on subspace Gaussian mixture model (Povey, 2009; Povey et al., 2010; Burget and Schwarz et al, 2010) with training criterion such as maximum likelihood or discriminative training. It is known that subspace Gaussian mixture model is superior to Gaussian mixture model because SGMM can model the feature space compactly by using fewer parameters. Parameter tying is very important because lots of words occur a few times in training corpus for language modeling. So in order to make SGMLM more robust, we perform parameter tying using Top-Down or Bottom-Up, as in acoustic modeling (Young et al., 2009; Reichl et al., 2000). We use decision trees in Top-Down method and agglomerative clustering approach (Babich et al., 1996) in Bottom-Up method. Both methods exploit the Part-of-Speeches suitable for Korean speech recognition. The studies for language models using the decision tree have been reported in the literature. Bahl et al. (1989) constructed decision tree language models using a 20-word history to predict the future words. Xu et al. (2007) developed a smoothing technique based on randomly grown decision trees. In decision tree language models, decision trees were used to classify histories into equivalence classes. But we used decision trees to classify the words in vocabulary for parameter tying. In our work, the word clusters obtained by clustering approaches such as Top-Down or Bottom-Up are similar to conventional word classes in class-based language models. Class-based language models had been broadly studied to address the problem of data sparseness. Since the number of classes is much smaller than the number of words in the vocabulary, there are signiﬁcantly fewer parameters to estimate, thus alleviating the data sparseness problem and reducing the size of the language model. Brown et al. (1992) developed a class-based n-gram model, which uses probabilities of sequences of word classes instead of sequences of individual words. They used the bottom-up word clustering algorithm which ﬁnds the classes that give high mutual information between the classes associated with adjacent words to determine the word classes. Also, Martin et al. (1998) employed an exchange algorithm using class trigram perplexity criterion to obtain word classes. Besides, many other class-based language models have been proposed, but the idea behind those approaches is to group the words with similar context into one class. The approaches of word clustering reported above are similar to our approach in aspect of using the word context for clustering. But our approach is done on the continuous space. So the clustering criteria are diﬀerent deﬁnitely. Finally, in order to improve recognition performance on speciﬁc domain, we adapt SGMLM with fMLLR. The rest of this paper is organized as follows. In Section 2 we brieﬂy consider GMM language model and RNN language model related with our work and in Section 3 our proposal is de-

scribed in detail. In Section 4 we discuss about parameter clustering for robust training and in Section 5 fMLLR adaptation method of SGMLMs. Setup of experiments and results are mentioned in Section 6 and we conclude in Section 7. 2. Review on continuous space language models 2.1. Gaussian mixture language model (GMLM) GMMs are widely known in the domain of data modeling and pattern classiﬁcation because of their ability to represent any complex distributions of data with multi-modes when they have plenty of components. Aﬁfy et al. (2007) had proposed continuous space language model called GMLM for ASR using GMMs to model distribution of word histories in the space of word histories. Because GMLM can use the training and adaptation methods developed in acoustic modeling for ASR, it is promising in language modeling for ASR. In acoustic modeling there are various methods to improve the accuracy of GMM and avoid over-ﬁtting in estimation of model (Young et al., 2009; Povey et al., 2009, 2010). Also, it has been wellknown that ﬁrst training GMM parameters by EM and then ﬁne-tuning GMM parameters discriminatively could signiﬁcantly improve the accuracy of ASR system based on GMM-HMM (Povey et al., 2004, 2008). Furthermore, there are various eﬃcient adaptation methods such as MLLR, MAP and fMLLR that can perform the adaptation of GMM for the speciﬁc target domain. GMLM needs an appropriate order of model parameters because of the model size and the cost of computation. Therefore, in previous GMLM, authors used the singular vectors corresponding to the largest singular values after singular value decomposition (SVD) of word cooccurrence matrix as feature vectors and then projected the concatenated n-1 (context size) feature vectors to a lower order using LDA (linear discriminant analysis). But desirable feature vectors could not be obtained because of numerical computation errors and SVD errors of the large scale sparse matrix of word co-occurrence. Therefore, in Aﬁfy et al. (2007) GMLM was used in combination with conventional n-gram language model as below: { 𝛼(ℎ)𝑃𝑔𝑚𝑚 (𝑤|ℎ), 𝑤 ∈ 𝐺𝑀𝐿𝑀 𝑃 ′ (𝑤|ℎ) = (1) 𝑃 (𝑤|ℎ), 𝑤 ∉ 𝐺𝑀𝐿𝑀 where Pgmm (w|h) is GMLM probability of word w given history h (see Eq. (6) in Aﬁfy et al., 2007 for detail), P(w|h) is the n-gram probability ∑ of word w given history h and 𝛼(ℎ) = 𝑤′ ∈𝐺𝑀𝐿𝑀 𝑃 (𝑤′ |ℎ) is the normalization factor of Pgmm (w|h). In order to model correctly the probability density function of history h given a word w, p(y|w)1 , with GMM, suﬃcient histories of the word must be available. So, in Aﬁfy et al. (2007) only words that occur 100 times or more were modeled with GMM. Then, the remaining words were classiﬁed into 200 classes using SRILM toolkit. We consider it is one of the reasons that degrade the performance of language model based on GMM. So, we want to cluster the words in the continuous space just like state clustering in acoustic modeling and model all of words in vocabulary with SGMM. 2.2. Recurrent neural network based language model (RNNLM) It has been well known that in theory the recurrent neural network can train the word context with unlimited length through hidden layers connected recurrently (Mikolov et al., 2010). The structure of recurrent neural network is shown in Fig. 1. The network has an input layer x, hidden layer s, output layer y, weight matrix U and R between the input layer and the hidden layer, weight matrix V between the hidden layer and the output layer. 1 y is real-valued vector for history h. we refer it to history vector. p(y|w) is the approximation of P(h|w).

22

R.H. Sun and R.J. Chol

Speech Communication 117 (2020) 21–27

Recently, RNNs are widely used for language modeling. RNNs do not limit the size of context. The context information can be reserved inside network by recurrent connections. The hidden states of RNNs depend on the entire input history. Therefore, we want to use the output of hidden layer of RNN in time 𝑡 − 1 as history vector of a word w in time t (𝑠(𝑡 − 1) in Fig. 1). After RNNLM have been trained, we use history vectors obtained from hidden layer of RNN as input features for SGMLM. The set of these history vectors corresponding to a word form feature space of the word for SGMLM. Then, probability density function of history vector y given a word w is computed as follows: 𝑝(𝑦|𝑤) =

𝐶𝑤 ∑ 𝑐=1

Let x(t) be an input signal to the network at time t, y(t) the output from the network at time t and s(t) the state of the hidden layer. The input vector x(t) is formed by concatenating vector w(t) representing current word in time t encoded using 1-of-N coding and previous context layer vector 𝑠(𝑡 − 1) representing the output from the hidden layer in time t-1. The size of input vector x(t) is equal to the size of vocabulary plus the size of the hidden layer. The hidden layer s had usually 30–500 hidden units in previous works (Mikolov et al., 2010, etc) and it should be determined according to amount of training data. Because large hidden layers need a large amount of training data, it takes a long time to learn the large scale network. The hidden and output layers of recurrent neural network are computed as follows: ( ) ∑ ∑ 𝑠𝑗 (𝑡) = 𝑓 𝑤𝑖 (𝑡)𝑢𝑗𝑖 + 𝑠𝑙 (𝑡 − 1)𝑟𝑗𝑙 (2) 𝑖

𝑦𝑘 (𝑡) = 𝑔

∑ 𝑗

𝑖=1

𝜔𝑤𝑐𝑖 𝑁(𝑦; 𝜇𝑤𝑐𝑖 , Σ𝑖 )

(4)

𝜇𝑤𝑐𝑖 = M𝑖 v𝑤𝑐

(5)

exp w𝑇𝑖 v𝑤𝑐 𝑤𝑤𝑐𝑖 = ∑𝐼 𝑇 𝑘=1 exp w𝑘 v𝑤𝑐

(6)

𝐶𝑤 ∑ 𝑐=1

𝜌𝑤𝑐 = 1,

(7)

where vwc is c-th sub-state vector of word w and Mi , wi is mean mapping and weight mapping to i-th Gaussian component, respectively. We calculated vwc , Mi and wi with the same approach in Povey et al. (2010). As above, probability density function of word w is GMM with I Gaussians, but covariance matrix Σi is shared over all words, mixture component weight wwci and mean 𝜇 wci are derived from sub-state vector vwc and mapping Mi , wi . Therefore, GMM parameters of a word are limited to any subspace of total parameter space. Because parameter training of subspace Gaussian mixture model were suﬃciently described in acoustic modeling (Povey et al., 2011), here we avoid describing about it. In fact, when considering a word as a HMM with a state, the training framework is the same as in acoustic modeling. Actually, we are interested in probabilityP(w|h), which can be calculated using Bayes rule as follows:

𝑙

) 𝑠𝑗 (𝑡)𝑣𝑘𝑗

𝐼 ∑

where Cw is the number of sub-states of the word w, 𝜌wc ( ≥ 0) is weight of c-th sub-state in the word, I is the number of Gaussians in a GMM and 𝜔wci ,𝜇 wci ,Σi are weight value, mean vector and covariance matrix of i-th Gaussian component in a GMM for c-th sub-state in the word w, respectively. These parameters are deﬁned as follows:

Fig. 1. The structure of recurrent neural network.

(

𝜌𝑤𝑐

(3)

where f(z) is sigmoid activation function and g(z) is softmax function. The output layer y(t) represents probability distribution of next word 𝑤(𝑡 + 1) given current word w(t) and context 𝑠(𝑡 − 1). Recurrent neural network based language models can capture longer context and has higher generalization ability than N-gram LMs, but we did not ﬁnd any eﬀective adaptation methods.

𝑃 (𝑤|ℎ) =

3. Subspace Gaussian mixture based language modeling

𝑃 (𝑤)𝑃 (ℎ|𝑤) . 𝑃 (ℎ)

(8)

As GMLM in Aﬁfy et al. (2007), using approximations such as P(h) ≈ p(y),P(h|w) ≈ p(y|w),P(w|h) ≈ p(w|y), Eq. (8) is rewritten as follows:

It has been known that SGMM was superior to Gaussian mixture model (GMM) in acoustic modeling for ASR. SGMMs represent compactly feature space using few parameters and train suﬃciently using small amount of training data. So, we assume that if we apply SGMMs to language modeling, data sparseness problem in language modeling could be alleviated more or less and the performance could be superior to GMLM. We veriﬁed this hypothesis by experiment in Section 6.2 (Table 5). In SGMLM as a kind of continuous space language model, every word (corresponds to a state in acoustic modeling) is modeled so that it is placed in a subspace spanned by universal background model of vocabulary space. We consider a word as a HMM with one state in aspect of acoustic models or as a speaker with a GMM in aspect of speaker recognition. In Povey et al. (2010), authors improved the accuracy of models by using a state with multiple sub-states. So we assume that a state within a HMM corresponding to a word has also multiple sub-states. This hypothesis is veriﬁed by experiments in Section 6.2 (Tables 2 and 3).

𝑝(𝑤|𝑦) =

𝑃 (𝑤)𝑝(𝑦|𝑤) 𝑃 (𝑤)𝑝(𝑦|𝑤) = ∑ , 𝑝(𝑦) 𝑃 (𝑣)𝑝(𝑦|𝑣)

(9)

𝑣∈𝑉

where P(w) can be taken as the usual unigram probability of word w and p(y|w) is SGMM probability of history vector y given a word w. 4. Parameter tying Parameter tying is important because a lot of words occur few times in training corpus for language modeling as well as acoustic modeling. Parameter tying based on Top-Down or Bottom-Up is widely used for training of acoustic models (Young et al., 2009; Reichl et al., 2000). Usually Top-Down is used for state tying by decision tree and Bottomup for distribution tying. 23

R.H. Sun and R.J. Chol

Speech Communication 117 (2020) 21–27

Fig. 3. Word clustering by Bottom-Up.

Fig. 2. Word decision trees of noun and pronoun.

Thirdly, when growth of WDTs is ﬁnished, we perform parameter tying with words belonged to ﬁnal leaves of WDTs. We call the leaf of WDT a “word-state” bellow. The equations of parameter tying are the same as in acoustic modeling (Reichl and Chou, 2000).

4.1. Word clustering by Top-Down Let’s recall the procedure of state-tying by decision tree in acoustic modeling in brief. Until some criteria are satisﬁed, triphones continue to cluster by question sequences of what phone is at left and/or right hand. Finally, triphones belonged to the same class are tied. Because this approach in acoustic modeling can predict unseen triphones in training corpus, it is usually used for state tying. Similarly, we want to classify the words by question sequences what kind of word is the left and /or right context for SGMM language modeling. Continuing to classify the words with their context information, ﬁnally we will obtain the binary tree whose nodes have ‘yes’ child node and ‘no’ child node, responding for questions. We name this tree as word decision tree (WDT). The leaves of WDT contain the words with similar context. We tied the parameters of words in a leaf. Basic modeling units for acoustic model are several tens of phones, but for language modeling there are tens of thousands of words. Many words of them occur few times in training data and their combinations are also very sparse. So we cannot build WDT of each word. We build a WDT per a Part-of-Speech. Linguistic questions are made from the prior knowledge on the combination of words. Below we describe the procedure of building of word decision tree in detail. Firstly, words are separated according to Part-of-Speeches. We use 96 Part-of-Speeches suitable for Korean ASR in this work. So totally there are 96 WDTs. Fig. 2 shows these trees. The root of noun tree T1 contains all noun words and pronoun tree T96 contains all pronoun words. Secondly, linguistic questions are applied to all leaves of WDTs (see Step 2 in Fig. 2). Then, the questions that provide the largest increase of log likelihood of training data are selected in each WDT (see Reichl and Chou, 2000, for details). In Fig. 2, selected question is “following word is a postposition (Korean ‘TOO’) indicating direction?”. According to selected question, leaves are split. Until some stopping criteria are satisﬁed, this procedure is performed iteratively. We use the occupation counts of separated child nodes and the increase of log likelihoods between before and after of splitting as stopping criteria. In this work we used 384 questions that take a consideration of context length 5 from front word and its front to back word and its back. We show some of questions as below: The front word is noun? The back word is noun? The front word is Korean ‘TOO’? The front word is noun and its front is name of person? The back word is Korean ‘TOO’ indicating direction and its back is verb? The back word is ‘d’ or ‘d’ in Korean ‘TOO’ and its back is verb? The back word is ‘d’ or ‘d’ in Korean ‘TOO’ and its back is verb?

4.2. Word clustering by Bottom-Up Fig. 3 shows the procedure of word clustering by Bottom-Up. The procedure of Bottom-Up clustering is represented below in detail. Firstly, words are separated according to 96 Part-of-Speeches suitable for Korean ASR. So words ﬁrst are separated into 96 subsets and word clustering by Bottom-Up will be separately performed for each of 96 subsets. Initially, each of words in subset begins in its own cluster, i.e., the cluster is initialized from the center of feature space consisted of the history vectors of a word. Secondly, similarity between each pair of clusters is measured by Euclidean distance metric. The pair of clusters that are the most similar are found, and then this pair is merged to form a new cluster. Fig. 3 shows word clustering in noun class in detail. As well, the same procedure can be applied to pronoun class. The new cluster centers are computed using a weighted average across the old clusters. The process is repeated until the number of top-level clusters reaches the predeﬁned number of classes. Thirdly, when the process of word clustering is ﬁnished, we perform parameter tying with the words belonged to a cluster. We call the top-level clusters in the ﬁnal hierarchy the word-states, as in Top-Down. 5. fMLLR based SGMLM adaptation One problem in language modeling is to process the words unseen in training corpus, namely out-of-vocabulary (OOV) words. Once we have done word-clustering by using the methods described in Section 4, we can map all the OOV words into clusters. In the case of Top-Down, the OOV words take the cluster determined by decision tree. But, in the case of Bottom-Up, mapping of the OOV words into clusters is a little complex. We ﬁrst compute the center of history vectors of an OOV word in adaptation data, averaging them. OOV words are replaced in RNN by the UNK symbol, as the input layer of RNN has a pre-deﬁned UNK unit. Then, we compute Euclidian distance between the center vector and clusters and ﬁnd the closest cluster. Finally, the OOV word is belonged to the closest cluster. Another problem in language modeling is the mismatch between training domain and testing domain. To overcome this problem, we perform fMLLR based language model adaptation. fMLLR adaptation has been proposed for acoustic modeling and described in (Povey et al., 2011). Therefore, only a short overview is given below. When considering a word as a HMM with a state, we can reuse the estimation equations described by (Povey et al., 2009, 2010, 2011) with a little modiﬁcation. In this case, the feature vector y, the history vector of a word, is transformed into speciﬁc topic domain by transform matrix A and bias vector b as following. 𝑦′ = 𝐴(𝑡) 𝑦 + 𝑏(𝑡) 24

(10)

R.H. Sun and R.J. Chol

Speech Communication 117 (2020) 21–27

Table 1 Corpora for training, testing and adapting language models. CORPUS

Sentence

Word

OOV

TRAIN DATA ADAPT. DATA TEST DATA 1 TEST DATA 2

1600K 15K 170K 130K

44M 0.45M 4.2M 4.1M

91K 2K 5K 10K

where y and y′ denote the history vectors of a word, A(t) and b(t) are the transform matrix and the bias vector of t-th topic. Then, the likelihood of the history vector given a word is computed as follows (Povey et al., 2009, 2010, 2011): 𝐶𝑤 𝐼 ∑ ( ) | |∑ 𝑝(𝑦|𝑤, 𝑡) = |det 𝐴(𝑡) | 𝜌𝑤𝑐 𝜔𝑤𝑐𝑖 𝑁 𝑦′ ; 𝜇𝑤𝑐𝑖 , Σ𝑖 | | 𝑐=1

Fig. 4. Block diagram for building SGMLM.

(11)

Also two test data of text and two read-speech data are used to compare SGMLMs with GMLM, on the in-domain and the out-of-domain. Because the adaptation of SGMLMs language model is applied to the out-of-domain data, the adaptation performance is only measured on test data 2 and read-data 2. We use GMLM and RNNLM as baseline LMs. We built GMLM by Aﬁfy et al. (2007) and RNNLM by RNNLM toolkit(Mikolov, 2011). We used Korean continuous speech recognition system “RyongNamSan” as Viterbi decoder, which generates 80 alternatives per sentence on average. The lattices generated with 3-gram Kneser-Ney LM are rescored with GMLM, RNNLM and SGMLMs in the second pass of Viterbi decoder because of computational complexity. 3-gram LM was built using Kneser-Ney smoothing by SRILM toolkit (Stolke, 2002).

𝑖=1

The transform matrix and bias are estimated iteratively to maximize auxiliary function (Eq. (B.12) in Povey et al., 2011) on adaptation data. When we use an extension to the fMLLR framework, the transform matrix is presented as a sum of basis matrices. However, our work uses the non-basis version of fMLLR framework, using Eq. (B.19) instead of Eq. (B.20) in Povey et al. (2011). 6. Experimental results 6.1. Experimental set-up We used 120 dimensional Mel-ﬁlter bank including single and double derivatives as speech features. The 120 dimensional features are concatenated with 5 frames in left and right hand (11 frames in all) and then 120 × 11 dimensional feature vector s are fed into deep neural network (DNN). For acoustic modeling, we used a conventional hybrid DNN-HMM structure, where DNN is composed of 5 hidden layers (2048 sigmoid units per layer) and the HMM is composed of 3 emitting states with left-right topology. The number of physical HMMs and sharing states is 31,650 and 4097, respectively. The distribution of sharing states (i.e., senones) is modeled with DNN. In other words, the size of the output layer in DNN is equal to the number of senones, 4097. The acoustic model is trained on about 500 hours of Korean read-speech data, using Kaldi toolkit (Povey et al., 2011). We used the tagged news corpora that consist of about 44M words for training of RNNLM, GMLM, n-gram LM and SGMLM. The words in training corpus were split into sub-word units (morpheme-like units). Also, frequent several sub-words were joined into one token to improve language model. The most frequent 65K words are selected as vocabulary. The data for SGMLM adaptation consists of 15K sentences that include about 450K words selected in the law domain out of SGMLM training corpus. This SGMLM adaptation data are diﬀerent from SGMLM training corpus. Test data 1 consists of 170K sentences that include about 4.2M words in the same domain as SGMLM training corpus. So, test data 1 is indomain data. Test data 2 consists of 130K sentences selected in the same law domain as SGMLM adaptation data. So, test data 2 is out-of-domain data. Table 1 shows in detail the corpora used for our experiments. We used two test sets for textual data (test data1 and test data 2) and two read-speech data (read-speech 1 and read-speech 2) to compare the performance of SGMLMs based on Top-Down and Bottom-Up, on the indomain and the out-of-domain. Each of the 2 read-speech data contains 20 speakers (10 males and 10 females) with 30 utterances per speaker, i.e. contains 600 utterances. Read-speech data 1 is recorded with the sentences randomly selected in test data 1 (in-domain) and the readspeech data 2 in test data 2 (out-of-domain), respectively.

6.2. Procedure and results The steps for constructing of SGMLM are presented below. The ﬁrst step is to train recurrent neural network based language model using cross entropy criterion on training corpus. We set the size of hidden layer of recurrent neural network to 70. RNNLM is trained with speedup technology using classiﬁcation of output layer of recurrent neural network proposed by Mikolov (2010, 2011). The second step is to initialize SGMLM with outputs of hidden layer of recurrent neural network trained in the ﬁrst step. The output of hidden layer of RNN in time 𝑡 − 1 is initialized as history vector of a word w in time t. We use these history vectors from hidden layer of RNN as input features for SGMLM. The set of history vectors corresponding to a word w form feature space of w for SGMLM. UBM is obtained by clustering all of Gaussians in the system for acoustic modeling. However, for language modeling, it is obtained by clustering all of history vectors in training corpus using well-known K-Means. Since we want to investigate the eﬀect of word clustering approaches, we will ﬁx the number of mixture of UBM to 1000. The initialization and training of SGMLM are the same as those of SGMM in acoustic modeling. In this step, words are clustered using Top-Down or Bottom-Up. Note that the number of sub states of SGMM representing a word is 1, i.e. each word is represented by vw1 in detail. The third step is iteratively to increase the number of sub states in the words and to train their distribution. Here the term “epoch” means the period of increasing of number of sub-states in the words and updating of their distribution parameters. Fig. 4 shows the procedure of building SGMLM from the second step to the third step after RNNLM training in detail. To ﬁnd optional regimes of SGMLM based on Top-Down, we perform the recognition experiments on read-data 1 with respect to the number of word-states and sub states. The results are shown in Table 2 in terms of word error rate (WER). In Table 2, the best result was obtained with 5 sub states in 5000 word-states. So the successive experiments of SGMLM based on TopDown are performed with these regimes. 25

R.H. Sun and R.J. Chol

Speech Communication 117 (2020) 21–27

Table 2 WERs of SGMLMs based on Top-Down with respect to the number of word-states and sub states (%). Word-state

3000 4000 5000 6000 7000 8000

Table 6 Performance of fMLLR adaptation of SGMLMs. (PPL on test data 2, WER on read-data 2).

Number of average sub states per word-state 3

4

5

3.40 3.36 3.02 3.01 2.95 2.96

3.30 3.07 2.97 2.95 2.96 3.18

3.18 3.00 2.90 2.90 2.92 3.08

Models RNNLM SGMLM

5000 5500 6000 6500 7000

Number of average sub states per word-state 3

4

5

2.93 2.90 2.83 2.82 2.82

2.86 2.84 2.85 2.86 2.85

2.82 2.81 2.82 2.84 2.83

PPL

WER (%)

Top-Down Bottom-Up

125.7 130.8

5.70 6.01

GMLM SGMLM

PPL

Top-Down Bottom-Up

– 111.3.(4.15) 115.7 (4.61)

In this paper, we proposed the adaptable continuous space language modeling approach of combining longer context information of RNN with adapting ability of subspace Gaussian mixture model used widely in acoustic modeling for ASR. Also, we proposed the parameter tying method by Top-Down or Bottom-Up to estimate eﬃciently SGMLM. As a result, SGMLM can exploit longer context information between words and allow task adaptation to a speciﬁc domain. SGMLMs are considerably diﬀerent from conventional n-gram LM and GMLM. Of course, context features of RNN can be used for constructing GMLM, but we did not evaluate about it because it has known that SGMM is much better than GMM for acoustic modeling. We veriﬁed that the proposed SGMLMs are superior to GMLM. After fMLLR adaptation, SGMLMs based on Top-Down and BottomUp obtain WERs of 5.70 % and 6.01%, which are better than 4.15% and 4.61% of that without adaptation, respectively. Also, with fMLLR adaptation, Top-Down and Bottom-Up based SGMLMs yield absolute word error rate reduction of 1.48%, 1.02% and a relative perplexity reduction of 10.02%, 6.46% compared to RNNLM without adaptation, respectively. Two parameter tying approaches show the conﬂicting results on test data 1 and test data 2. The disadvantage of SGMLMs is that they take a long time for likelihood computation during decoding. It is due to computing conditional probabilities over all the words given a history (actually, computing the denominator term of Eq. 9). Thus, we plan to investigate the way to reduce expensive computation costs in the future.

Table 5 The performance of SGMLMs and GMLM. Models

123.7(5.63) 125.7(5.70) 130.8(6.01)

7. Conclusion and future work

Table 4 Performance of SGMLMs in out-of-domain (PPL on test data 2, WER on read-data 2). SGMLM

After adaptation PPL (WER %)

maries the performance of fMLLR adaptation of SGMLMs on out-ofdomain data (test data 2 and read-data 2). In Table 6, SGMLMs were adapted with fMLLR. After fMLLR adaptation, SGMLMs based on Top-Down and Bottom-Up obtain WERs of 5.70 % and 6.01%, which are better than 4.15% and 4.61% of that without adaptation, respectively. Also, SGMLMs with adaptation outperform RNNLM without adaptation. In this case, SGMLMs achieved a relative perplexity reduction of 10.02%, 6.46% and an absolute word error rate reduction of 1.48%, 1.02%, respectively.

Table 3 WERs of SGMLMs based on Bottom-Up with respect to the number of word clusters and sub states (%). Word cluster

Top-Down Bottom-Up

Before adaptation PPL (WER %).

WER (%)

test data 1

test data 2

read-data 1

read-data 2

156.8 110.6 102.7

176.1 125.7 130.8

3.18 2.9 2.81

7.09 5.70 6.01

Similarly, we perform the recognition experiments on read-dada 1 with respect to the number of word clusters for SGMLM based on Bottom-Up. The results are presented in Table 3 in terms of WER. In Table 3, the best result was obtained with 5 sub states in 5500 word clusters. So the successive experiments of SGMLM based on Bottom-Up are performed with these regimes. Tables 2 and 3 show that Bottom-Up is slightly better than Top-Down for read-data 1 recorded in the same domain as SGMLM training corpus. It shows that Bottom-Up based SGMLM models the words seen in the training corpus better than Top-Down based one. Next, using out-of-domain data, we evaluated the performance of SGMLMs based on Top-down and Bottom-Up with their best regimes in terms of perplexity (PPL) and WER (Table 4). The results in Table 4 show that SGMLM based on Top-Down is slightly better than Bottom-Up. It shows that Top-Down based SGMLM deals with words unseen in training corpus better than Bottom-Up based one. Next, we compared our proposed SGMLMs with GMLM on in-domain and out-of-domain to verify hypothesis that performance of SGMLMs could be superior to GMLM (Table 5). Table 5 shows SGMLMs are consistently better than GMLM on indomain and out-of-domain. Next, we performed experiment of our proposed SGMLMs adaptation. We used RNNLM(Mikolov, 2011) as baseline LM. Table 6 sum-

Declaration of Competing Interest We have no conﬂicts of interest. Acknowledgment The authors would like to thank the anonymous reviewers for substantive and constructive comments which led to the improvement of this paper. References Aﬁfy, M.ohamed, Siohan, O.liver, Sarikaya, R.uhi, 2007. Gaussian mixture language models for speech recognition. In: Proc. IEEE Internat. Conf. on Acoustic, Speech and Signal Processing (ICASSP), Hawaii, pp. 29–32. Babich, G.A., et al., 1996. Weighted Parzen windows for pattern classiﬁcation. IEEE Trans. Pattern Anal. Mach. Intell. 18 (5), 567–570. Bahl, L.R., et al., 1989. A tree-based statistical language model for natural language speech recognition. IEEE Trans, on ASSP ASSP-37 (7), 1001–1008. 26

R.H. Sun and R.J. Chol

Speech Communication 117 (2020) 21–27

Brown, P.F., et al., 1992. Class based n -gram models of natural language. Comput. Linguist. 18 (4), 467–479. Burget, L., Schwarz, P., Agarwal, M., Akyazi, P., Feng, K., Ghoshal, A., Povey, D., 2010. Multilingual acoustic modeling for speech recognition based on subspace Gaussian mixture models. In: Proc. IEEE Internat. Conf. on Acoustic, Speech and Signal Processing (ICASSP). Dallas, TX. Ghoshal, A., Povey, D., et al., 2010. A novel estimation of feature-space MLLR for full covariance models. In: Proc. IEEE Internat. Conf. on Acoustic, Speech and Signal Processing (ICASSP). Dallas, TX. Hermansky, H., Ellis, D.P.W., Sharma, S., 2000. Tandem connectionist feature extraction for conventional HMM systems. In: Proc. IEEE Internat. Conf. on Acoustic, Speech and Signal Processing (ICASSP), Istanbul, 3, Turkey, pp. 1635–1638. Martin, S., et al., 1998. Algorithms for bigram and trigram word clustering. Speech Commun. 24 (1998), 19–37. Mikolov, T., et al., 2010. Recurrent neural network based language mode. In: Proceedings of Interspeech. Makuhari, Chiba, Japan. Mikolov, T., Kombrink, S., Deoras, A., et al., 2011. RNNLM – recurrent neural network language modeling toolkit. In: Proceedings of ASRU Workshop, pp. 196–201. Povey, D., 2004. Discriminative Training for Large Vocabulary Speech Recognition Ph.D. Thesis. Cambridge University. Povey, D., 2009. A Tutorial Introduction to Subspace Gaussian Mixture Models for Speech Recognition Tech. Re MSR-TR-2009-111, Microsoft Research. Povey, D., Burget, L., et al., 2010. Subspace Gaussian mixture models for speech recognition. In: Proc. IEEE Internat. Conf. on Acoustic, Speech and Signal Processing (ICASSP). Dallas, TX.

Povey, D., Burget, L., et al., 2011. The Subspace Gaussian mixture models – a structured model for speech. Comp. Speech Lang. 25, 404–439. Povey, D., Ghoshal, A., et al., 2011. The Kaldi speech ´recognition toolkit. In: Proc. of the ASRU. Hawaii, USA. Povey, D., Kanevsky, D, Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K., 2008. Boosted MMI for model and feature-space discriminative training. In: Proc. IEEE Internat. Conf. on Acoustic, Speech and Signal Processing (ICASSP), Las Vegas, NV, pp. 4057–4060. Povey, D., Saon, G., 2006. Feature and model space speaker adaptation with full covariance Gaussians. In: Proceedings of Interspeech / ICSLP. Reichl, W., Chou, W., 2000. Robust decision tree state tying for continuous speech recognition. IEEE Trans. Speech Audio Process. vol.8 (5), 555–566. Stolke, A., 2002. SRILM-an extensible language modeling toolkit. In: Proceedings of ICSLP Denver. Colorado. Vinyals, O., Ravuri, S.V., 2011. Comparing multilayer perceptron to deep belief network tandem features for robust ASR. In: Proc. IEEE Internat. Conf. on Acoustic, Speech and Signal Processing (ICASSP), Prague, Czech, pp. 4596–4599. Xu, P., Jelinek, F., 2007. Random forests and the data sparseness problem in language modeling. Comput. Speech Lang. 21 (2007), 105–152. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P., 2009. The HTK Book (for Version 3.4). Cambridge University Engineering Department.

27

Subspace Gaussian mixture based language modeling for large vocabulary continuous speech recognition

Subspace Gaussian mixture based language modeling for large vocabulary continuous speech recognition

Recommend Documents