SPECOM 2173
No. of Pages 14, Model 5+
30 July 2013
Available online at www.sciencedirect.com 1
Speech Communication xxx (2013) xxx–xxx www.elsevier.com/locate/specom
Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain
2 3
Aanchan Mohan a,⇑, Richard Rose a, Sina Hamidi Ghalehjegh a, S. Umesh b
4 Q1 5 6
b
a McGill University, Montreal, QC, Canada Indian Institute of Technology – Madras, Chennai, India
7
8
Abstract
9 10 11 12 13 14 15 16 17 18 19 20
In developing speech recognition based services for any task domain, it is necessary to account for the support of an increasing number of languages over the life of the service. This paper considers a small vocabulary speech recognition task in Indian languages. To configure a multi-lingual system in this task domain, an experimental study is presented using data from two linguistically similar languages – Hindi and Marathi. We do so by training a subspace Gaussian mixture model (SGMM) (Povey et al., 2011; Rose et al., 2011) under a multi-lingual scenario (Burget et al., 2010; Mohan et al., 2012a). “Real-world” speech data, collected to develop spoken dialogue systems is used in this experimental study. Acoustic, channel and environmental mismatch between data sets from multiple languages is an issue while building multi-lingual systems of this nature. As a result, we use a cross-corpus acoustic normalization procedure which is a simple variant of speaker adaptive training (SAT) (Mohan et al., 2012a). The resulting multi-lingual system provides the best speech recognition performance for both languages. Further, the effect of sharing “similar” context-dependent states from the Marathi language on the Hindi speech recognition performance is presented. Ó 2013 Published by Elsevier B.V.
21 22
Keywords: Automatic speech recognition; Multi-lingual speech recognition; Subspace modelling; Under-resourced languages; Acoustic-normalization
23
1. Introduction
24
With the proliferation and penetration of cellular telephone networks in remote regions, a larger proportion of the planet’s population has inexpensive access to meet its communication needs. With this there is a wider audience, especially under-served populations that have access to telephony based information and services. Spoken dialog (SD) systems provide a natural method for information access, especially for users who have little or no formal education. Among the challenges facing the development of such systems is the need to configure them in languages that are under-resourced.
25 26 27 28 29 30 31 32 33 34
Q2
⇑ Corresponding author. Tel.: +1 5142246996.
E-mail addresses:
[email protected] (A. Mohan), rose@ ece.mcgill.ca (R. Rose),
[email protected] (S.H. Ghalehjegh),
[email protected] (S. Umesh).
Agriculture provides a means of livelihood for over 50% of India’s population. Most of India’s farming is small scale; 78% of farms are five acres or less (Patel et al., 2010; Singh et al., 1999). Indian farmers, a proportion of whom are illiterate, face a host of challenges such as water shortages, increasing cost of farm supplies and inequitable distribution systems that govern the sale of their produce. Accesss to information through information and communication technologies (ICT) is often seen as a solution to enable and empower rural Indian farmers. There have been many noted efforts in this direction to develop ICTs by members of the private sector, non-governmental organizations (NGOs) and the government. E-Choupal (Bowonder et al., 2003), Avaaj Otalo (Patel et al., 2010), and the Mandi Information System (Umesh, 2013; Mantena et al., 2011; Shrishrimal et al., 2012) are examples of such efforts that have been undertaken in the past or are currently ongoing. In addition it is noteworthy to mention
0167-6393/$ - see front matter Ó 2013 Published by Elsevier B.V. http://dx.doi.org/10.1016/j.specom.2013.07.005 Please cite this article in press as: Mohan, A. et al., Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain, Speech Comm. (2013), http://dx.doi.org/10.1016/j.specom.2013.07.005
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
SPECOM 2173
No. of Pages 14, Model 5+
30 July 2013 2 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
A. Mohan et al. / Speech Communication xxx (2013) xxx–xxx
the evaluation done by Plauche´ and Nallasamy (2007) to assess the factors involved in setting up a low-cost, smallscale spoken dialogue system (SDS) to disseminate agricultural information to farmers in rural Tamil Nadu, India. The work reported on the Mandi Information System is part of a larger effort called “Speech-based Access for Agricultural Commodity Prices in Six Indian Languages” initiated by the Goverment of India (Umesh, 2013; Mantena et al., 2011; Shrishrimal et al., 2012). The project involves development of spoken dialog systems to access prices of agricultural commodities in various Indian districts. This information updated daily is also made available online through an Indian government web-portal www.agmarknet.nic.in. Needless to say, configuring SD systems especially in any or most of these languages is hardly a trivial task. Also, a large investment in time, money and effort is needed to collect, annotate and transcribe speech data required to develop the automatic speech recognition (ASR) engine that the SD systems depend upon. Our goal in this paper is not to describe the development of the dialog system or the data collection effort itself, but to describe acoustic modelling configurations that would make the process of developing such systems more efficient. We use a subset of the data provided to us by the teams involved in the project “Speech based Access for Agricultural Commodity Prices in Six Indian Languages” (Umesh, 2013) for our experimental study. In terms of the target population for the service, the recording conditions of the speech data, the nature of the task (small-vocabulary) and the service itself is similar to the work described by authors in Plauche´ and Nallasamy (2007). Section 2 provides a description of the data, which was provided to us at a time when the development of the dialog systems was still in progress. We restrict our multi-lingual experimental study to two out of six languages – Hindi and Marathi. With limited available training data in Hindi, and the existing sources of acoustic, background and channel variabilities associated with the data collected for each language, the data set provided to us from this task domain poses some unique challenges. The speech data is “realworld”, in that it has been collected in conditions that shall be experienced in the actual use of the service. It is worth noting that the speech data was collected from a population whose education levels were representative of the target population who would actually use the service. Our work here is motivated by previous work in developing the subspace Gaussian mixture model (SGMM) (Povey et al., 2011) and the use of the SGMM for multilingual speech recognition reported by Burget et al. (2010) and Lu etal. (2011, 2012). Other approaches to training multi-lingual acoustic models involve the use of a socalled common phone set as a means of sharing data between languages (Schultz and Waibel, 2001; Schultz and Kirchhoff, 2006; Vu et al., 2011). Though the use of a common phone set is a popular approach to train multi-lingual acoustic models, it is also quite cumbersome. Since the parametrization of the SGMM naturally allows
for the sharing of data between multiple languages, the use of a common phone set is unnecessary, and hence this approach is attractive. The results from previous work in Burget et al. (2010) imply that when there is a limited amount of data in a certain target language, the SGMM parametrization facilitates the use of “shared” parameters that have been estimated reliably from other well-resourced languages for speech recognition in the target language. Authors in Lu et al. (2011) examine the effect that limited training data can have on the estimation of the statedependent parameters of the SGMM and suggest the modification of using a regularization term for this purpose. Further, authors in Lu et al. (2012) present a MaximumA-Posteriori (MAP) adaptation method to “adapt” an SGMM acoustic model trained in a related but well-resourced language to a target language in which limited training data is available. The speech data used in all of these studies were collected in noise-free environments, and over close-talking microphones. Given our task domain, it seems appropriate to perform an experimental study detailing some of the issues involved in building SGMM based systems for both mono-lingual and multi-lingual systems for speech data intended for a practical application and in languages that are of interest to a large-part of the developing world. In this paper we demonstrate the effects of multi-lingual SGMM acoustic model training carried out after cross-corpus acoustic normalization. A study is performed using the Marathi and Hindi language pair since they are linguistically related languages. Our work shows the importance of compensating for the sources of acoustic variability between speech data collected from multiple languages while training multi-lingual SGMM models. The issue of handling speaker and environmental “factors” that are causes for variation in the speech signal has been addressed in Gales (2001a). Further, in Seltzer and Acero (2001) the authors propose the factoring of speaker and environment variability into a pair of cascaded constrained maximumlikelihood linear regression (CMLLR) transforms-one for each source of variability. They propose an adaptive training frame-work to train the acoustic model. Our approach to cross-corpus acoustic normalization is similar in spirit to the approach presented by the authors in Seltzer and Acero (2001) in using a pair of “factored transforms” to compensate for speaker and environmental variability. Further, our work addresses the issue of not having enough welltrained context-dependent states in the Hindi language. To address this issue, context-dependent states in the multi-lingual SGMM are borrowed from the more wellresourced Marathi language. This is complementary to the work presented by Lu et al. (2011), where a regularization term is introduced in the optimization criterion for the state-dependent parameters. This is regularization is introduced as means of dealing with limited data in the target language. To provide the context for this study, we present baseline results for mono-lingual continuous density hidden
Please cite this article in press as: Mohan, A. et al., Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain, Speech Comm. (2013), http://dx.doi.org/10.1016/j.specom.2013.07.005
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
SPECOM 2173
No. of Pages 14, Model 5+
30 July 2013 A. Mohan et al. / Speech Communication xxx (2013) xxx–xxx
218
Markov model (CDHMM) speech recognition performance for Hindi and Marathi in this task domain. We compare the ASR performance of the SGMM based mono-lingual system with respect to the CDHMM baseline (Mohan et al., 2012b). Our experimental study considers the impact of each of the stages of multi-lingual training on the final ASR system performance. Interesting anecdotal results are presented to show that the SGMM’s state-level parameters are able to capture phonetically similar and meaningful information across the two languages. Further, recognition errors made by the the final multilingual SGMM system on the Hindi test set that are attributed to a lack of adequate context-dependent states are analysed. To this effect, an experimental study that enunciates the impact of borrowing context-dependent states from the Marathi language is presented. The main contributions of this work include the development of the Multilingual SGMM system for Hindi and Marathi, crosscorpus normalization for multi-lingual training, an analysis of the linguistic similarity between the two languages and the cross-lingual borrowing of contexts from the Marathi (non-target, well-resourced) language. This paper is organized follows. Section 2 presents a detailed description of the agricultural commodities task domain. Next, Section 3 briefly describes the SGMM in the context of the experimental study. In Section 4 we describe our experimental setup for both the CDHMM and SGMM systems. The comparative performance of baseline mono-lingual CDHMM systems and the monolingual SGMM systems is described in Section 5. In Section 6 we provide a description of training and the performance of the multi-lingual SGMM system for Hindi and Marathi, highlighting the effects of acoustic mismatch between the two languages. After mentioning an algorithm to obtain features normalized for cross-speaker and cross-corpus acoustic variation, we consider the impact of using these for multi-lingual SGMM training. We summarize the performance of the multi-lingual Hindi and Marathi ASR system obtained with these normalized features. Further, an anecdotal experiment is presented in Section 6.4, where it is shown that errors arising due to poor context modelling in Hindi can be mitigated by borrowing contexts from the Marathi language. Next, an analysis of cross-lingual similarity between the languages, based on the cosine distance measure between the individual state-dependent parameters is presented in Section 6.5. Finally, in Section 6.6, a method to borrow context-dependent states based on the cosine distance measure is discussed. The effect of appropriately weighting these Marathi language states and their impact on the Hindi language recognition performance is studied. We conclude this paper by summarizing our findings in Section 7.
219
2. Agricultural commodities task domain
220
This section provides a brief description of the data used in this experimental study. As mentioned in Section 1, we
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217
221
3
use a subset of the data which has been collected for the project titled-“Speech-based Access for Agricultural Commodity Prices in Six Indian Languages” (Umesh, 2013), sponsored by the Government of India. The goal of this project is to implement and deploy a speech based system for farmers to access prices of agricultural commodities in various districts across India from inexpensive mobile phones. The dialog systems are being collectively developed by a consortium of Indian institutions. In this paper the consortium of Indian institutions responsible for spoken dialogue system development is referred to as the Indian ASR Consortium. The data collection process involved consultants actually travelling to the regions of interest and soliciting the participation of the local residents to call a central location from their mobile phones and respond by voice to prompts requesting the names of agricultural commodities and local districts. One of the main issues associated with this speech data in each language, is the fact that it has been collected in environments that shall be experienced in the actual use of the service. These include quiet indoor home environments, noisy indoor home environments with competing speech and broadcast audio as well as outdoor environments with background vehicle and machine noise. In short, the environments encountered in the data collection effort for each individual language are varied and are difficult to specifically characterize. Table 1 shows the subset of the above-mentioned data set used in this experimental study, along with the number of hours of data available in each language for both training and testing. The data is narrow band 8 KHz speech data collected over mobile telephones. In Table 1, Hindi has just one hour of data while the Marathi has about eight hours of data. The speech data in each language largely consists of names of agricultural commodities and local districts. Table 1 shows the amount of available speech data for both Hindi and Marathi. For Marathi, the collected data was divided into training and test sets by the organization (IIT-Bombay, Mumbai, India) involved in the development of the agricultural commodities spoken dialogue system. For Hindi, the available data was equally divided by us into training and test sets. Table 2 also gives a distribution of the number of speakers in the final training and the test sets for each of the languages. In each language, there was not enough data for an independent development set. The data collected by the Indian ASR Consortium only had an evaluation set to gauge the initial performance of the spoken dialogue systems. We used this evaluation set as our test set. Therefore a few (CDHMM), though not all (CDHMM and SGMM) parameters in this study are tuned to optimize ASR performance on the given evaluation sets. Discussions appear in Sections 4.1, 4.2 and 5 that mention the details on how the choices of the model and test parameters were arrived at for the various systems. Clarifications have been provided where necessary. This is a small vocabulary task. Table 3 shows the composition of the words and phones for this task. Column
Please cite this article in press as: Mohan, A. et al., Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain, Speech Comm. (2013), http://dx.doi.org/10.1016/j.specom.2013.07.005
222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278
SPECOM 2173
No. of Pages 14, Model 5+
30 July 2013 4
A. Mohan et al. / Speech Communication xxx (2013) xxx–xxx
Table 1 Amounts of language specific speech data used for experimental study. Data set
Language
Hrs. of data
No. of utterances
No. of speech frames
No. of words
Train
Marathi Hindi Marathi Hindi
7.94 1.22 2.5 1.24
8865 1200 2875 1200
2721924 401968 877574 408280
10688 1407 3476 1428
Test
Table 2 Number of speakers in the train and test sets. Language
Train
Test
Marathi Hindi
18 91
11 96
For an automatic speech recognition (ASR) system configured with J states, the observation density for a given D dimensional feature vector, x for a state j 2 f1; . . . ; J g can be written as, pðxjjÞ ¼
I X
wji Nðxjlji ; Ri Þ;
Language
Total words in lexicon
Phones
Filler phones
Marathi Hindi
551 119
59 48
1 6
305
2 of Table 3 lists the number of words in the lexicon for both languages. The lexicons in each language were prepared by each individual institution in-charge, to capture as many possible regional and dialectal variations of the pronunciation of a certain word as was deemed necessary. The Indian organizations involved in the development of the spoken dialogue systems for Hindi and Marathi did not consider it necessary to include pronunciation variants while developing the lexicons. The mapping of the consonants and vowels from native Indian scripts to their ASCII equivalent is achieved by using the ITRANS transliteration scheme (Chopde, 2006). Hindi, Marathi and Sanskrit use the Devanagari alphabet (Scharf and Hyman, 2009; Chopde, 2006; Central Hindi Directorate, 1977). Most of the words that appear in the vocabulary correspond to names of commodities in each of the languages, while the rest are district names, “yes” and “no”, and spoken digits. To allow for unpredictable background events, special symbols appear in the transcriptions to mark events such as a sudden noise or a telephone ringing. The count of unique words in Table 3 does not include these symbols. These symbols map to a unique phonetic unit in the lexicon, and dedicated acoustic models are associated with them. The count of the phonetic units associated with these symbols are listed as “filler phones” in the last column of Table 3. The count of all of these filler phones inlcude the mandatory sil phone.
306
3. The subspace Gaussian mixture model
307
This section provides a brief description of the subspace Gaussian mixture model (SGMM) implementation (Rose et al., 2011), recently proposed by Povey et al. (2011). The description here follows the work of Rose et al.(2011).
279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304
308 309 310
ð2Þ
In (2), vj is the state projection vector for state j. The subspace projection matrix Mi is of dimension D S where S is the dimension of the state projection vector vj for state j. In this work, S ¼ D. This choice of the subspace dimension was motivated primarily due to the ease of system configuration with this setting in our SGMM implementation (Rose et al., 2011). This parameter setting was consistent across our SGMM systems across our systems in Hindi, Marathi, Bengali and Assamese (Mohan et al., 2012b).The vector mi , with dimension D, is used to denote the UBM mean of the Gaussian mixture component i 2 f1; . . . ; Ig. It is used as an optional offset to the term Mi vj in the expression for lji in (2) . The state specific weights in (1), are obtained from the state projection vector vj using a log-linear model, exp wTi vj wji ¼ PI : T i0 ¼1 exp wi0 vj
313 314 315
317
where I full-covariance Gaussians are shared between the J states. The state dependent mean vector, lji , for state j is a projection into the ith subspace defined by a linear subspace projection matrix Mi , lji ¼ mi þ Mi vj :
312
ð1Þ
i¼1
Table 3 Characteristics of the speech data in each language.
311
318 319 320 321 322 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
ð3Þ
It is apparent that this acoustic modelling formalism has a large number of “shared” parameters and small number of state specific parameters. For multi-lingual acoustic modelling the shared parameters, namely Mi ; wi and Ri are trained by pooling data from multiple languages. Multi-lingual SGMM training involves maintaining separate phone sets for each language. This is made possible by adding a language specific tag to each phone, and thereby each clustered HMM state. Each state projection vector vj , which is attached to a clustered state is then trained only with data specific to each language. The parameterization for multilingual SGMM training is described in Fig. 1. In addition, to add more flexibility to the SGMM parametrization at the state level, the concept of substates is adopted where the distribution of a state can be represented by more than one vector vjm , where m is the
Please cite this article in press as: Mohan, A. et al., Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain, Speech Comm. (2013), http://dx.doi.org/10.1016/j.specom.2013.07.005
342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358
SPECOM 2173
No. of Pages 14, Model 5+
30 July 2013 A. Mohan et al. / Speech Communication xxx (2013) xxx–xxx
5
Table 4 Language model characteristics. Language
No. of words in training text for LM training
Perplexity
No. of 3-grams
Marathi Hindi
28445 3807
199.6 49.8
1624 361
Table 5 Number of clustered CDHMM states in each language.
Fig. 1. Parameterization of the Multi-lingual SGMM.
359 360 361 362
substate index. This “substate” distribution is again a mixture of Gaussians. The state distribution is then a mixture of substate distributions which is defined as follows: pðxjjÞ ¼
364 365 366 367 368 370 371 373
Mj X
I X
m¼1
i¼1
cjm
wjmi Nðxjljmi ; Ri Þ;
ð4Þ
where cjm is the relative weight of substate m in state j and the means and mixture weights are obtained from substate projection vectors vjm : ljmi ¼ mi þ Mi vjm ; wjmi ¼
exp wTi vjm PI T i0 ¼1 exp wi0 vjm
ð5Þ :
ð6Þ
374
4. Experimental setup for the mono-lingual systems
375
This section gives a summary of the experimental setup for the mono-lingual systems. Section 4.1 describes the setup of the mono-lingual CDHMM system and Section 4.2 describes the setup of the mono-lingual SGMM systems.
376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392
4.1. Experimental setup for the mono-lingual CDHMM systems This section gives an overview of the experimental setup and a description of the baseline CDHMM systems that we set up for this task. Baseline systems in each language were configured using the HTK Speech Recognition Toolkit (Young, 2006). Our CDHMM HTK ASR system setup in each language closely follows the setup of each regional institution of the Indian ASR Consortium. Each site configured their systems using the SPHINX (Lee, 1989) speech recognition system. The baseline system was based on conventional three state left-to-right HMM triphone models. Clustered states
Language
Clustered states
Marathi Hindi
476 161
were obtained after decision tree clustering for the systems in each language. Due to the lack of a development set, we used a fixed decision tree splitting threshold for the context-clustering procedure across the CDHMM systems that were built for Hindi, Marathi, Bengali and Assamese (Mohan et al., 2012b). The number of clustered states thus obtained in each language specific CDHMM system is listed in Table 5. We used 16 Gaussians per state for Marathi and 8 Gaussians per state for Hindi. The choice of the number of Gaussians per state was obtained by observing the performance of the CDHMM systems on the test set. The CDHMM systems for both languages used diagonal covariance Gaussians. The features used are 13 MFCC coefficients concatenated with first and second difference cepstrum coefficients. The features were extracted with an analysis window of size 25ms, with a hop size of 10ms used between successive frames of speech. Cepstrum mean normalization was performed for each utterance. We did not experiment with a variable number of Gaussians in each state. One issue with setting up ASR systems in new languages is the definition of subword phonetic contexts. When these contexts are defined using decision tree clustering, it is necessary to define a list of phonetic contexts that make up the question set for the decision tree. Question sets prepared by expert phoneticians for decision tree clustering in underresourced languages like the ones mentioned in this experimental study are difficult to obtain. Another approach, popular in grapheme-based speech recognition (Killer et al., 2003) is the use of the so-called singleton question sets. The questions used in the decision tree clustering procedure directly ask for the identity of the grapheme that might occur in the left or the right context, and not its linguistic class. Since Hindi and Marathi like most Indian languages are phonetic in nature with nearly a one-to-one mapping between the grapheme and the phoneme, it begets one to consider the use of grapheme based approaches to speech recognition. We did not consider this approach while building our system. Instead, we relied on an automated procedure that clusters CDHMM state distributions of context-independent phones is used to obtain a set of
Please cite this article in press as: Mohan, A. et al., Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain, Speech Comm. (2013), http://dx.doi.org/10.1016/j.specom.2013.07.005
393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433
SPECOM 2173
No. of Pages 14, Model 5+
30 July 2013 6 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488
A. Mohan et al. / Speech Communication xxx (2013) xxx–xxx
linguistic questions for clustering triphone contexts (Shrishrimal et al., 2012). The final linguistic question set that was obtained for each language was made compatible for use with HTK. We primarily chose the latter method to design our question set, to have our baseline HTK systems give an equivalent performance to the SPHINX based systems being configured by the Indian ASR Consortium (Umesh, 2013). A comparison on the difference in ASR performance using singleton question sets though certainly interesting is not the focus of this experimental study and we leave it to address this issue in our future work. Development of language models for spoken dialogue systems in a new task domain is difficult and language models are often bootstrapped (Weilhammer et al., 2011; Hakkani-Tur and Rahim, 2006). For this experimental study language models were built with the best available resources at hand. The language models (LM) used in the case of Marathi and Hindi were tri-gram language models. Training text consisted of transcriptions of the utterances used in acoustic model training. This is a closed vocabulary task. The training and test utterances consist of single or multiple words with sentence start and end markers describing the name of commodities or districts. The training text used for obtaining the N-gram LM was able to cover all of the words encountered in the vocabulary. The number of words used to train the language models along with the perplexity on the test set are included in Table 4. The count of the words and the perplexity values in Table 4 are with sentence-start and sentence-end markers included. We would like to clarify that the word count that appears in the training corpus in Table 1 refer to the number of words without the sentence-start and sentenceend markers. The evaluation we provide for our Hindi system is on recognition of the commodities since these span the majority of the words in the vocabulary. For Hindi, a separate language model is trained for recognizing either commodities or districts, as determined by the organization developing the Hindi spoken dialog system. For Marathi on the other hand a single language model is trained for recognizing both commodities and districts. 4.2. Experimental setup for the mono-lingual SGMM systems For the SGMM, we use our implementation that is an extension to HTK with added libraries (Rose et al., 2011). The system is initialized with the clustered triphone states obtained from the CDHMM. The basic setup is identical to that used for the CDHMM systems. The training for the SGMM in this work follows the procedure mentioned in Rose et al. (2011). Each language-specific SGMM system is initialized with Gaussians from a language-specific universal background model (UBM). Each language-spcific UBM is trained from speech-only segments from the training portion of the data-set of each language. The nature of the speech data in this task which is available in limited quantities, forced
us to experiment with empirical procedures to obtain the initial set of I l Gaussians for each language. Here we use I l to denote the total number of Gaussians in each language specific system l 2 fHindi; Marathig. For example, reference to Table 9, I Hindi ¼ 256. The number of I l Gaussians in each language specific UBM were determined by looking at the distribution of the mixture dependent frame occupancy counts. We chose the final number of I l Gaussians ensuring that there was an even distribution of speech frames across all mixtures. We trained each language specific UBM, with all I l Gaussians in that UBM sharing a single grand full-covariance matrix. This allows for a robust estimation of the single full-covariance matrix, instead of estimating I l individual full-covariance UBM matrices as is done traditionally (Povey et al., 2011; Rose et al., 2011). After obtaining the UBM Gaussians, each of the I l individual SGMM Gaussians is initialized with the same full covariance matrix that was shared between the I l UBM Gaussians. Training of the SGMM is then carried out using a joint posterior initialization (JPI) (Rose et al., 2011). A joint posterior initialization involves a process where the initial posterior probability of a speech frame xt at time t belonging to a mixture component i in state j; c0ji ðtÞ is approximated as: c0ji ðtÞ ¼ p0 ðst ¼ j; mt ¼ ijxt Þ ph ðst ¼ jjxt Þpu ðmt ¼ ijxt Þ:
489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513
ð7Þ ð8Þ
515
In Eq. 8, the posteriors ph ðst ¼ jjxt Þ are obtained from forward-backward decoding of the training utterances with respect to the prototype CDHMM. The posteriors pu ðmt ¼ ijxt Þ are obtained from training utterances using the intitial Gaussian mixture models (i.e. the language specific UBM in this case). For this task where the vocabulary is very small a JPI of the SGMM is extremely critical for the final system to give good performance. Updates for the full-covariance matrices were initially disabled for the first 8 iterations where initial estimates for the other parameters of the SGMM are obtained through the JPI process. For the 10 subsequent training iterations, parameter updates were allowed for all SGMM parameters. The results for the SGMM system performance in each language, with the number of mixtures I (I l in this section) used in each system is presented in Table 9. The following Tables 6 and 7 give the counts of the number of parameters in the CDHMM and the SGMM systems for the Hindi language. In a small-vocabulary ASR system, the parameter count for the SGMM system surpasses the number of parameters in the equivalent CDHMM system. In this case the number of shared parameters in the SGMM dominates the total parameter count. This is true for the Hindi mono-lingual SGMM system described here. Referring to Table 6 the Hindi SGMM system configured with I ¼ 256 full-covariance Gaussians and J ¼ 161 tied-states, has a total of 605319 parameters. A feature dimension of D ¼ 39 features was used with a subspace dimension S ¼ 39. The counterpart CDHMM
516
Please cite this article in press as: Mohan, A. et al., Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain, Speech Comm. (2013), http://dx.doi.org/10.1016/j.specom.2013.07.005
517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544
SPECOM 2173
No. of Pages 14, Model 5+
30 July 2013 A. Mohan et al. / Speech Communication xxx (2013) xxx–xxx Table 6 SGMM example system sizes J ¼ 161 D ¼ 39. Parameter name
Parameter count
Example count
Shared parameters Mi Ri wi
IDS IDðD þ 1Þ=2 IS
389376 199680 9984
JS
6279 605319
State-specific parameters vj Total
Table 7 CDHMM example system size J ¼ 161 D ¼ 39.
545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579
Parameter name
Parameter count
Example count
lji Rji wji Total
DJ #mixtures DJ #mixtures J #mixtures
100464 100464 2576 101752
system, configured with J ¼ 161 tied-states with 8 Gaussian mixtures per-state on the other hand has a total of 101752 parameters. On the other hand, in large-vocabulary continuous speech recognition systems (LVCSR) systems due to the presence of a larger number of states J, the CDHMM systems tend to have far more parameters compared to equivalent SGMM systems (Povey et al., 2011). The only way to increase the number of parameters in a CDHMM system is by increasing the number of statedependent Gaussian densities. From our experiments when we tried increasing the number of state-dependent mixtures in the CDHMM beyond 8 mixtures per state the resulting ASR performance was seen to degrade significantly. This is to be expected as the amount of data per state-dependent Gaussian mixture in the CDHMM decreases as the number of state-dependent mixtures are increased. On the other hand, especially in a single sub-state per state SGMM system the state-dependent parameter is only a D dimensional vector. With limited amounts of data therefore one could reliably estimate state-dependent parameters, with all the data in the corpus being used to train the shared parameters. The SGMM’s parametrization with its larger number of shared parameters (shared between the states, as depicted in Fig. 1) certainly allows for a more expressive model with the same amount of data that is used to train a CDHMM system.
task, since for the spoken dialogue system it is important that the transcription of the words in the utterance be retrieved correctly by the ASR engine. The ASR result could then be post-processed to make it suitable for use with other components of the spoken dialogue system. With reference to Table 8 the Marathi CDHMM system with a vocabulary of 551 words has a baseline performance of 74.2%. The performance for Hindi at 65.8% is perhaps due to a lack of data. The insertion penalty and language scale factors for the CDHMM decoder were tuned on the test set to give the best possible performance. The results for the performance of the mono-lingual SGMM systems is listed in Table 9. For the Marathi SGMM systems, a 3.5% absolute increase in the WAc. is observed with respect to the baseline system . For Hindi an increase of 3.4% WAc. is seen in the performance of the SGMM with respect to the baseline. The insertion penalty and language scale factors for the SGMM were not tuned on the test set. The same values for these parameters were maintained for the recognition experiments for all four languages Hindi, Marathi, Assamese and Bengali (Mohan et al., 2012b). From these results we could conclude that the SGMM is seen to provide better performance possibly due to its ability to model phonetic variability effectively.
580
6. Hindi and Marathi multi-lingual SGMM system
604
This section summarizes the building of a multi-lingual SGMM system for the closely related language pair – Hindi and Marathi. Both Hindi and Marathi belong to the Indo-Aryan (Cardona, 2003) group of languages. Both languages share a large degree of similarity in terms of syntactic structure, written Devanagari script (Central Hindi Directorate, 1977), and to some extent the vocabulary. The Devanagari script is phonetic in nature. There is more or less a one-to-one correspondence between what is being written and what is spoken. As a result, to a large extent both languages are in fact “phonetically similar”. Our interest here is to understand if this similarity between languages could be exploited in order to build a multi-lingual speech recognition system, whose performance is perhaps
605 606
Table 8 Baseline CDHMM performance on the commodities task.
5. Comparitive performance of the mono-lingual CDHMM and the mono-lingual SGMM systems This section summarizes the comparative performance between the baseline mono-lingual CDHMM systems and the mono-lingual SGMM systems. We use the Word Accuracy (WAc.) as the performance measure for recognition. We have also provided the percentage correct scores (% Corr.) for comparison. The percentage correct scores (%Corr) perhaps carry more meaning in the context of this
7
Language
%WAc (%Corr)
Marathi Hindi
74.2 (82.1) 65.8 (68.7)
Table 9 SGMM performance on the commodities task. Language
No. of Mixtures I
%WAc (%Corr)
Marathi Hindi
400 256
77.7 (84.7) 69.2 (71.0)
Please cite this article in press as: Mohan, A. et al., Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain, Speech Comm. (2013), http://dx.doi.org/10.1016/j.specom.2013.07.005
581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603
607 608 609 610 611 612 613 614 615 616 617 618
SPECOM 2173
No. of Pages 14, Model 5+
30 July 2013 8
A. Mohan et al. / Speech Communication xxx (2013) xxx–xxx
640
better than their mono-lingual counterparts. In addition a further question we seek to answer is, having built such a system, is the “phonetic similarity” between Hindi and Marathi apparent in terms of their acoustic realizations captured by statistical models? Multi-lingual SGMM systems in Burget et al. (2010) and Povey et al. (2011) have been built without explicitly accounting for possible mismatch of acoustic, background and environmental conditions between data-sets from multiple languages. From our previous work (Mohan et al., 2012a), in the absence of cross-corpus normalization, it has been found that these sources of mismatch between data sets is a significant issue in most practical scenarios for training systems from multiple data-sets. Without accounting for this mismatch the performance of the multi-lingual system is not any better than its mono-lingual SGMM counterpart. After accounting for cross-corpus mismatch the multi-lingual SGMM system is seen to perform better compared to the mono-lingual SGMM. The method for dealing with cross-corpus mismatch is a simple modification of Speaker Adaptive Training (SAT) and is described in greater detail in Section 6.2.
641
6.1. Multi-lingual SGMM training
642
First, SGMM training was carried out in a multi-lingual fashion, by using the Marathi and Hindi data to collectively train the “shared” parameters Mi ; wi and Ri as was first done in Burget et al. (2010). Here i ¼ f1; . . . ; Ig is the index over the Gaussian mixture component. As mentioned in Section 3 separate phone sets are maintained for each language. Maintaining separate phone sets allows for training the state specific parameters i.e. vj ; j ¼ f1; . . . ; J g with data from each of the two languages individually. Our multi-lingual SGMM has a total of J ¼ 637 clustered states, with 161 states coming from Hindi and 476 coming form the Marathi CDHMM systems respectively. The system was initialized with a UBM with I ¼ 400 Gaussians trained on speech-only segments of speakers from both corpora. The system was initialized with a joint posterior initialization procedure, similar to the monolingual SGMM systems trained for this task. No feature normalization was applied prior to multi-lingual training to account for cross-corpus mismatch. The results for the multi-lingual system trained in this manner is reported in Table 10. It can clearly be seen from the results of Table 10 that the multi-lingual SGMM system for Hindi performs no better than the mono-lingual Hindi SGMM system. This is consistent with our earlier observations stated in Mohan et al. (2012a). Building a multi-lingual system without accounting for the systematic acoustic differences between sources of multi-lingual data could decrease rather than help performance. Since the data used in this experimental study is “real-world”, where the acoustic environments are difficult to characterize, the impact of this mismatch
619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639
643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672
Table 10 Multi-lingual SGMM performance without normalization. Language
%WAc (%Corr)
Marathi Hindi
78.7 (85.2) 50.5 (60.1)
appears to be severe. In comparing this multi-lingual training result for the Hindi language test set, a decrease in WAc. of 8% absolute is seen with respect to the CDHMM baseline. Further, with respect to the mono-lingual Hindi SGMM system a decrease in WAc. of 19% is seen. For the Marathi system on the other hand, the performance is about the same as the Marathi mono-lingual SGMM system, with an improvement of about 1% that can be seen for the multi-lingual training.
673
6.2. Cross-corpus feature normalization
682
This section presents an acoustic normalization procedure for dealing with this cross-corpus mismatch (Mohan et al., 2012a). The procedure is applied as part of SGMM training and corresponds to a straight-forward variant of speaker adaptive training (SAT) (Gales, 2001b) and constrained maximum likelihood linear regression (CMLLR) (Gales, 1998). Referring to Fig. 2 the acoustic normalization procedure is described here as being performed in two major steps as given in Sections 6.2.1 and 6.2.2 below.
683
6.2.1. Compensating for speaker variation
692
Given our training feature sets X ¼ fX Hr ; X M s g and the speaker independent (SI) CDHMM models K ¼ fKH ; KM g, we first estimate speaker dependent H CMLLR matrices AM r and As , as is done in standard SAT for both languages Hindi (H) and Marathi (M). Here we use the index r to denote a speaker in the Hindi training set and s to denote a speaker in the Marathi training set. Interpreting the CMLLR transformation as a feature space transformation (Gales, 1998), we obtain a trans^ ¼ fAH X H ; AM X M g by transforming formed feature set X r s r s our original feature set X using the speaker dependent transformations AHr and AM s . Speaker-normalized mono-lingual CDHMM models ^H; K ^ M g are trained using our new feature set X. ^ ¼ fK ^ K Using the mono-lingual speaker-normalized models ^H; K ^ M g and the test features Y ¼ fY H ; Y M g, lan^ ¼ fK K r s guage-specific speaker dependent CMLLR transforms are generated in an unsupervised fashion for each test speaker. Here r is used to denote a speaker in the Hindi test set, and s is used to denote a speaker in the Marathi test set. Y ¼ fY Hr ; Y M s g are then transformed with the corresponding test speaker CMLLR transforms to obtain a ^ Recognition results for this new set of test features Y.
Please cite this article in press as: Mohan, A. et al., Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain, Speech Comm. (2013), http://dx.doi.org/10.1016/j.specom.2013.07.005
674 675 676 677 678 679 680 681
684 685 686 687 688 689 690 691
693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718
SPECOM 2173
No. of Pages 14, Model 5+
30 July 2013 A. Mohan et al. / Speech Communication xxx (2013) xxx–xxx
9
Table 11 ASR WAc for Marathi. I represents number of SGMM mixtures and AN Q4 indicates acoustic normalization. Marathi Language ASR Performance System
I
WAc [%] (% Corr.)
CDHMM CDHMM+SAT Mono-lingual SGMM Multi-lingual SGMM Mono-lingual SGMM +AN Multi-lingual SGMM + AN
n/a n/a 400 400 400 400
74.3 85.3 77.7 78.7 86.4 87.5
(82.2) (90.2) (84.7) (85.2) (92.0) (92.0)
Table 12 ASR WAc for Hindi. I represents number of SGMM mixtures and AN indicates acoustic normalization. Hindi Language ASR Performance
Fig. 2. Cross-speaker normalization.
719 720 721
and
cross-corpus
environment
acoustic
SAT procedure for both Marathi and Hindi are listed in row 2 (CDHMM+SAT) of Tables 11 and 12, respectively.
724
Since this first step normalizes only inter-speaker variation and not inter-corpus acoustic mismatch, we perform a modified version of the SAT procedure in a second step.
725
6.2.2. Compensating for inter-corpus acoustic mismatch
722 723
726 727 728 729 730 731 732 733 734
^ and the feature sets X, ^ two language Using the models K specific CMLLR transformation matrices BH and BM are trained. The transforms are now generated as a modified version of SAT, where we treat the data from each language as if it were data coming from two “speakers” – Hindi and Marathi. Two language-independent regression classes are used namely “speech” and “silence”. The “speech” regression class is assigned all the non-silence
System
I
WAc [%] (% Corr.)
CDHMM CDHMM+SAT Mono-lingual SGMM Multi-lingual SGMM Mono-lingual SGMM + AN Multi-lingual SGMM + AN
n/a n/a 256 256 256 400
65.8 67.7 69.1 50.5 72.4 74.4
(68.7) (71.0) (71.0) (60.1) (73.7) (76.2)
Gaussian components coming from CDHMM states of both languages. We believe that large amounts of data are required to characterize the acoustic environments, and therefore all the training data from each language is required to train the transformation matrices BH and BM . We use BH and BM to obtain an updated set of training ^^ ^ These feafeatures X by transforming the features X. tures are then used to obtain an updated CDHMM ^^ model K. ^^ A new set of test features Y is obtained by transforming ^ Y using the language specific CMLLR transforms BH and BM . ^^ The new training features X are now used for multi-lingual SGMM training. This involves training the UBM, generating the JPI (joint posterior initialization, ref. Section 4.2) estimates, and finally the SGMM training. The ^^ updated CDHMM model K is used for generating the new JPI estimates before multi-lingual SGMM training.
735
Clearly there are many ways of compensating for variability across speakers and across corpora. The advantage of transforming features from both data sets as described in this section is that it serves to reduce mismatch across all stages of SGMM training.
754
6.3. Impact of speaker variability and the cross-corpus feature normalization procedure
759
In this section we provide some experimental results to assess the effectiveness of the cross-corpus normalization procedure presented in Section 6.2, and its impact on
761 762
Please cite this article in press as: Mohan, A. et al., Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain, Speech Comm. (2013), http://dx.doi.org/10.1016/j.specom.2013.07.005
736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753
755 756 757 758
760
763
SPECOM 2173
No. of Pages 14, Model 5+
30 July 2013 10 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818
A. Mohan et al. / Speech Communication xxx (2013) xxx–xxx
multi-lingual training. The results for Marathi language ASR performance is summarized in Table 11 and Hindi language ASR performance is summarized in Table 12. In this corpus, speaker variability seems to be an issue affecting the speech recognition system performance. Referring to the second row of Tables 11 and 12 it can be seen that a large gain in performance with respect to the CDHMM baseline for both Marathi and Hindi is obtained by using speaker adaptive training with the CDHMM models. Row 3 of Tables 11 and 12 lists the mono-lingual SGMM performance in both Marathi and Hindi. Row 4 of Tables 11 and 12 lists results of multi-lingual training without any cross-speaker or cross-corpus normalization. This result, as mentioned before in Section 6.1 highlights the issues of training a multi-lingual system without accounting for cross-corpus mismatch. The label AN in the first column of Tables 11 and 12 is used to denote the use of the acoustic normalization procedure described in Section 6.2. In Row 5 of both tables, results are shown for language-specific SGMM models which have been trained with the normalized features. In other words the language-specific systems (the results for which are displayed in Row 5 of both tables), are trained after speaker normalization and cross-corpus acoustic normalization. A drastic improvement is seen with respect to the mono-lingual SGMM trained without the normalization procedure (row 3 of Tables 11 and 12). These results thus show the impact of this normalization procedure on the final ASR system performance in both languages. As noted earlier in Section 2 a great deal of environmental variability exists in the speech data within a language. Therefore this improvement could be attributed to the fact that the CMLLR transform BH trained from all the features in Hindi training set is able to compensate for the intra-corpus environmental variability to yield the improvement that is seen by the monolingual+AN Hindi system. Further, Row 6 of Tables 11 and 12 show the performance of the mutli-lingual SGMM models trained with the normalized acoustic features. With respect to row 5 of the Table 12, a 2% absolute improvement in the Hindi ASR system can be observed. This improvement can therefore be attributed to multi-lingual training. We applied the matched-pairs significance test described in Gillick and Cox (1989), and the improvement in performance for the multi-lingual system (Row 6 of Table 12) versus the Hindi SGMM (mono-lingual+AN) (Row 5 of Table 12) was statistically significant at the chosen confidence level of 99.5%. Section 6.4 analyzes the recognition errors made by the multi-lingual Hindi system. 6.4. Analysis of recognition errors made by the multi-lingual Hindi system While the SGMM is able to provide additional gains by allowing a structured and shared parameterization between multiple languages, useful insights for SGMM-ASR system
construction can be obtained by analyzing the errors made on the Hindi utterances by the multi-lingual SGMM system. It was found that the recognizer consistently made errors of the kind that are listed below:
819
TARBOOJ ¼ tarbooj meaning watermelon in Hindi is mis-recognized as musk-melon KARBUJA ¼ kharbuujaa BADSHAH ¼ baadshah a variety of potato is mis-recognized as BAJRA ¼ baajraa or millet. ANANNAS ¼ anannas meaning pineapple is mis-recognized as pomegranate or ANAR ¼ anaar SUKHIMIRCH appearing as two separate words with the composite lexical expansion sukhiimirc meaning dried red-chilli peppers is mis-recognized as SUPERIOR ¼ supirior, a variety of cardamom.
823 824
Errors of the kind listed above seem reasonable since the hypothesized word bears some likeness in pronunciation to the reference string. However, the repetitive nature of the errors for each of the examples listed here makes it clear that certain tri-phone contexts that are seen more often in training are favored over more rarely occurring tri-phone contexts during recognition. For example it appears that the context sil kh þ a is favoured over the context sil t þ a in the first example, or the context aa r þ sil is favoured over a nn þ a in the third example. Further, the number of context-dependent states in Hindi, J ¼ 161, is only slightly higher than the number of K ¼ 147 context-independent states for this task. One possible approach for reducing these Hindi acoustic confusions would be to borrow context dependent stateprojection vectors from the more well-trained Marathi states. These borrowed Marathi state projection vectors could then be combined with the more poorly trained Hindi state projection vectors. To do this, it is necessary to first determine which Marathi states should be associated with Hindi states. Then one must determine how the two state projection vectors must be combined. The issue of sharing state-level SGMM parameters for multi-lingual ASR has received limited attention in the literature, except for the work presented in Qian et al. (2011). An algorithm is presented by the authors in that work to “borrow” data from segments of speech data of the well-resourced language that are acoustically “similar” to that in the target language. The acoustic similarity was determined by using an appropriate distance measure between the state-dependent parameters of the lowresource target language SGMM and the non-target language SGMM. This “borrowed” data was then used to update the state-dependent parameters of states that are then shared between the target and non-target languages. For the low-resource acoustic model training condition that was studied, a limited gain in ASR performance over the base-line SGMM were reported by using this strategy. Here an anecdotal “cheating” experiment is presented where the states of the context-dependent HMM of the
834
Please cite this article in press as: Mohan, A. et al., Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain, Speech Comm. (2013), http://dx.doi.org/10.1016/j.specom.2013.07.005
820 821 822
825 826 827 828 829 830 831 832 833
835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873
SPECOM 2173
No. of Pages 14, Model 5+
30 July 2013 A. Mohan et al. / Speech Communication xxx (2013) xxx–xxx 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930
Hindi language directly use state-dependent parameters of “similar” context-dependent HMMs of the Marathi language. The goal of this experiment is to determine the potential performance gains that might be achieved by combining state-level SGMM parameters across languages. The choice of which state-level parameters should be combined is made by exploiting the actual misrecognitions observed on the test data. This is, of course, the cheating aspect of the experiment. The method for combining the state-level parameters is described below. The experiment provides a way to decouple the issue of which acoustic contexts to combine from the choice of the method used for combining them. All the SGMM based systems described in Sections 6.1, 6.2, 6.3 rely on single sub-state per state models. In this experiment, the concept of the sub-state in the SGMM is used to allow Hindi context-dependent HMMs to use the state dependent parameters from context-dependent HMMs of the Marathi language. This allows well-trained “similar” phonetic context-dependent HMMs in the well-resourced language to “augment” existing contextdependent phones in the target language. A technique for sharing Gaussians at the state-level between contextdependent models was sucessfully applied in Saraclar and Khudanpur (2000) to model pronunciation variation in conversational speech. The experiment proposed here allows us to investigate the potential for sharing sub-state projection vectors at the state-level between contextdependent models from a non-target language and context-dependent models in the target language. This corpus offers a unique opportunity of this kind because a large proportion of tri-phone contexts appearing in the confusable Hindi words listed in the above example also appear in the Marathi language as well. This degree of cross-language context sharing is not at all typical. However, it provides a chance to observe the potential effects of cross-language parameter sharing when the determination of similar contexts is not an issue. After looking up the equivalent context-dependent models in Marathi, the state projection vector from each state from the Marathi model is “augmented” as a sub-state in the corresponding state for the Hindi model. The sub-state weights are set equally between the existing sub-state vector and the newly introduced sub-state vector from the Marathi model. The issue of estimation of these weights is not considered here and is a topic of future research. The state dependent likelihoods for these “augmented” states are computed as given by Eq. 4. The multi-lingual SGMM model thus obtained is referred to as the “augmented” model in Table 13. A similar procedure was carried out to modify the target language silence model. N-Best lists were generated for all the utterances corresponding to the four mis-recognized examples with a value of N ¼ 2. This generated the possible hypotheses as being either the correct term or the closest incorrect term. The first row of Table 13 displays the percentage of these utterances that the multi-lingual SGMM model recognizes
11
Table 13 Recognition accuracy for frequently mis-recognized utterances. ML-Hindi SGMM “Augmented” ML-Hindi SGMM
56% 76%
correctly. These N-Best lists were then re-scored with the “augmented” models. The proportion of utterances that were correctly recognized after this re-scoring procedure appears in the second row of Table 13. It can be seen from Table 13 that the so-called “augmented” models provide a 20% absolute improvement over the multi-lingual SGMM Hindi models in correctly recognizing the originally mis-recognized test utterances. While this result must clearly be considered as anecdotal it suggests that there may be considerable potential for statelevel cross-language parameter sharing in multi-lingual SGMMs.
931
6.5. Analysis of cross-lingual phonetic similarity captured by the SGMM model parameters
943
As mentioned at the beginning of this section, Marathi and Hindi are languages that are “phonetically similar”. This is because both languages share the Devanagari script (Central Hindi Directorate, 1977), where there is almost a one-to-one correspondence between what is being written and spoken. This section demonstrates that this cross-lingual similarity is in fact reflected in the state-dependent parameters of the multi-lingual SGMM. Fig. 3 shows a heatmap to visualize the similarities “discovered” between the phones of Hindi and the phones of Marathi. The plot was obtained as follows. We first extracted the normalized (refer Appendix K of Povey (2009)) state projection vectors vj from the centre state of clustered cross-word tri-phone models of both languages Hindi and Marathi. Then, a similarity sðh; mÞ was calculated as the cosine distance between state h from the Hindi language model and state m from the Marathi language. In Fig. 3, the vertical axis bears the labels of the phones that correspond to the clustered states of the Hindi language, and the horizontal axis bears the labels of the phones that correspond to the clustered states of the Marathi language. It is necessary to note that in this task domain, Hindi and Marathi do not have the identical phone sets, even though they are derived from the same script. In Fig. 3, for the purposes of viewing not all of the center context labels are displayed along the axes of the plot. The brightness of a point on this heatmap is proportional to the “closeness” between the two corresponding states. The labels are ordered alphabetically by the ITRANS ASCII equivalent of the Devanagari alphabet. As the ASCII equivalents (of the phones) in both languages (i.e. Hindi and Marathi) correspond to the same Devanagari alphabet some of similarities are immediately apparent from the plot. For example, aa (ARPABET AA) in Mara-
945
Please cite this article in press as: Mohan, A. et al., Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain, Speech Comm. (2013), http://dx.doi.org/10.1016/j.specom.2013.07.005
932 933 934 935 936 937 938 939 940 941 942
944
946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978
SPECOM 2173
No. of Pages 14, Model 5+
30 July 2013 A. Mohan et al. / Speech Communication xxx (2013) xxx–xxx
Hindi Phones
12 aa aa a bh cc c d’d’ d’q d f h ii j kk l nn n o ph p r s t’ tt
0.6
0.4
0.2
0
−0.2
w z aa|aa|aa|aa|aa|aa|aa| a| a| a| a| bb| b| c| d| d| e| e| g| ii| i| i| j| k| l’| l| m| m| nj| n| n| o| ph| r| r| sh| s| t| tt| u| uu| w| Marathi Phones
−0.4
Fig. 3. Cross-lingual similarity heatmap to visualize the similarity between the phones of Hindi and Marathi.
986
thi is seen to be similar to aa and a (ARPABET AH) in Hindi. Other common confusable pairs are also apparent: o (ARPABET AO) is seen to be confused with uu (ARPABET UW) in both Marathi and Hindi and s in Hindi (ARPABET S) is seen to be confused with cc (ARPABET CH) in Marathi. The heatmap in Fig. 3 serves to show that the state projection vectors vj are able to capture information that is phonetically and linguistically meaningful.
987
6.6. Cross-lingual context-dependent state borrowing
988
This section describes an experimental to study the effect of borrowing context dependent states from the Marathi language to improve the Hindi ASR performance. This in effect includes two steps. The first step is to automatically select the appropriate Marathi language context-dependent state for each Hindi language state. The second step is to weight these context-dependent states appropriately, as sub-states, to yield the final state-likelihood. The first issue is the subject of Section 6.6.1. After having observed crosslingual similarities using a simple cosine distance in Section 6.5, we use this distance to select context-dependent states from the Marathi language. In order to understand the effect of weighting the sub-states properly, the sub-state weights are varied globally over a range of values and the effect on the percentage of Hindi words correctly decoded is observed in Section 6.6.2.
979 980 981 982 983 984 985
989 990 991 992 993 994 995 996 997 998 999
1000 1001
1002
1003
1004
1005
1006
1007 1008
1009
6.6.1. Algorithm for selecting states from the non-target language As a first step, for each target language (Hindi) state, a list of 10 similar states is picked based on the cosine distance metric. The evaluation of the similarity is restricted only to those non-target language (Marathi) states that
appear in the same state position in a left-to-right HMM topology. As a second refinement step, a log-likelihood based re-ranking of this list of potential states for each target language state is carried out. The re-ranking procedure involved doing a forced alignment pass over the entire set of Hindi training utterances. During alignment, for each speech frame, when a certain target language state was encountered, a record of the frame log-likelihood for each potential non-target language state is maintained. This loglikelihood for the potential non-target state is accumulated over all such occurrences. Finally, the average loglikelihood for each potential non-target language state is calculated over all such occurrences and the list of potential non-target states is sorted. The top-ranking non-target language state is then selected as the non-target language state of choice.
1010
6.6.2. Experimental analysis of language-weighting Fig. 4 summarizes the Hindi language ASR performance with cross-lingual state sharing. The performance curve is displayed as the static target language sub-state weight is varied between 0 and 1. On the extreme end when the target language sub-state weight is set to 0.0, only the “closest” Marathi-language states are used to evaluate ASR performance. This is not shown in the figure, but as expected, the performance at 57.01% is well below the baseline multi-lingual SGMM performance on the Hindi test set. Also as expected, as the weighting towards the Hindi language states increase, the performance is seen to increase. The best performing system at 77.77% is obtained when the Hindi language states are weighted at 0.8. At this point, an improvement of 1.57% absolute is seen with respect to the SGMM baseline of 76.2%. The matchedpairs significance test described in Gillick and Cox (1989)
1026
Please cite this article in press as: Mohan, A. et al., Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain, Speech Comm. (2013), http://dx.doi.org/10.1016/j.specom.2013.07.005
1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025
1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042
SPECOM 2173
No. of Pages 14, Model 5+
30 July 2013 A. Mohan et al. / Speech Communication xxx (2013) xxx–xxx
13
77.77 Base−line 77.07
Cross−lingual state sharing
%Correct
76.58 76.16
74.05 0.2
0.4
0.5
0.6
0.8
1
Static sub−state weight on the Hindi language states Fig. 4. Variation in ASR performance as a function of language weighting.
1045
was run, and the improvement in performance with respect to the SGMM baseline was statistically significant at the chosen confidence level of 99.99%.
1046
7. Conclusions
1047
1074
An experimental study was presented that investigated acoustic modelling configurations for speech recognition in the Indian languages – Hindi and Marathi. The experimental study was performed using data from a small vocabulary agricultural commodities task domain that was collected for configuring spoken dialogue systems. Two acoustic modelling techniques for mono-lingual ASR were compared namely – the conventional CDHMM and the SGMM acoustic modelling technique. The SGMM mono-lingual models were seen to outperform their CDHMM counterparts when there is insufficient acoustic training data. For the Hindi and Marathi language pair, a multi-lingual SGMM training scenario was presented. It was concluded that cross-corpus mismatch is an important issue that needs to be addressed while building systems of this nature. Not accounting for cross-corpus mismatch is seen to decrease the performance in the target language Hindi which has limited amounts of training data. After accounting for this cross-corpus mismatch a gain in multi-lingual SGMM performance is observed. Further, interesting anecdotal results have been obtained to show that the parameters in the multi-lingual SGMM system are able to capture the phonetic characteristics in a structured and meaningful manner especially across the the “similar” languages used in this experimental study. Finally, it was found beneficial to share “similar” context-dependent states from Marathi in order to improve Hindi language speech recognition performance.
1075
Acknowledgements
1076
The authors would like to thank all of the members involved in the data collection effort and the development of
1043 1044
1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073
1077
the dialogue system for the project “Speech-based Access for Agricultural Commodity Prices in Six Indian Languages” sponsored by the Government of India. We would also like to thank M.S. Research Scholar Raghavengra Bilgi at IIT Madras for his timely help with providing resources for this experimental study.
1078
References
1084
Bowonder, B., Gupta, V., Singh, A., 2003. Developing a rural market ehub, the case study of e-Choupal experience of ITC. Planning Commission of India. Burget, L., Schwarz, P., Agarwal, M., Akyazi, P., Kai, Feng, Ghoshal, A., Glembek, O., Goel, N.K., Karafit, M., Povey, D., Rastrow, A., Rose, R.C., Thomas, S., 2010. Multilingual acoustic modeling for speech recognition based on subspace Gaussian mixture models. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp. 4334–4337. Cardona, G., 2003. In: Indo-Aryan Languages, vol. 2. Routledge Curzon. Central Hindi Directorate, I., 1977. Devanagari: development, amplification, and standardisation. Central Hindi Directorate, Ministry of Education and Social Welfare, Government of India. Chopde, A., 2006. ITRANS-Indian language transliteration package.
Gales, M., 1998. Maximum likelihood linear transformations for HMMbased speech recognition. Computer Speech and Language 12 (2). Gales, M., 2001a. Acoustic factorisation. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, pp. 77– 80. Gales, M., 2001. Multiple-cluster adaptive training schemes. In: Proceedings of IEEE International Conference on Acoustics Speech and Signal Processing. Gillick, L., Cox, S., 1989. Some statistical issues in the comparison of speech recognition algorithms. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, pp. 532–535. Hakkani-Tur, D., Rahim, M., 2006. Bootstrapping language models for spoken dialog systems from the world wide web. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1. IEEE, pp. I–I. Killer, M., Stuker, S., Schultz, T., 2003. Grapheme based speech recognition. In: Eighth European Conference on Speech Communication and Technology. Lee, K., 1989. Automatic Speech Recognition: The Development of the SPHINX System. Kluwer Academic Pub..
1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120
Please cite this article in press as: Mohan, A. et al., Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain, Speech Comm. (2013), http://dx.doi.org/10.1016/j.specom.2013.07.005
1079
1080
1081
1082 1083
1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161
SPECOM 2173
No. of Pages 14, Model 5+
30 July 2013 14
A. Mohan et al. / Speech Communication xxx (2013) xxx–xxx
Lu, L., Ghoshal, A., Renals, S., 2011. Regularized subspace gaussian mixture models for cross-lingual speech recognition. Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, pp. 365–370. Lu, L., Ghoshal, A., Renals, S., 2012. Maximum a posteriori adaptation of subspace gaussian mixture models for cross-lingual speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp. 4877–4880. Mantena, G., Rajendran, S., Rambabu, B., Gangashetty, S., Yegnanarayana, B., Prahallad, K., 2011. A speech-based conversation system for accessing agriculture commodity prices in Indian languages. Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA). IEEE, pp. 153–154. Mohan, A., Ghalehjegh, S.H., Rose, R.C., 2012. Dealing with acoustic mismatch for training multilingual subspace Gaussian mixture models for speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE. Mohan, A., Umesh, S., Rose, R.C., 2012b. Subspace based acoustic modelling in Indian languages. IEEE Conference on Information Science, Signal Processing and their Applications, Montreal, Canada. IEEE. Patel, N., Chittamuru, D., Jain, A., Dave, P., Parikh, T., 2010. Avaaj Otalo: a field study of an interactive voice forum for small farmers in rural India. Proceedings of the 28th International Conference on Human Factors in Computing Systems. ACM, pp. 733–742. Plauche´, M., Nallasamy, U., 2007. Speech interfaces for equitable access to information technology. Information Technologies and International Development 4 (1), 69–86. Povey, D., 2009. A Tutorial-Style Introduction to Subspace Gaussian Mixture Models for Speech Recognition. Microsoft Research Redmond, WA. Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal, A., Glembek, O., Goel, N., Karafia´t, M., Rastrow, A., et al., 2011. The subspace Gaussian mixture model A structured model for speech recognition. Computer Speech & Language 25 (2), 404–439. Qian, Y., Povey, D., Liu, J., 2011. State-level data borrowing for lowresource speech recognition based on subspace GMMs. In: Twelfth Annual Conference of the International Speech Communication Association. Rose, R.C., Yin, S.-C., Tang, Y., 2011. An investigation of subspace modeling for phonetic and speaker variability in automatic speech
1162 recognition. Proceedings of IEEE International Conference on Acous1163 tics, Speech, and Signal Processing. IEEE. 1164 Saraclar, M., Khudanpur, S., 2000. Pronunciation ambiguity vs. pronun1165 ciation variability in speech recognition. In: Proceedings of the IEEE 1166 International Conference on Acoustics, Speech, and Signal Processing, 1167 vol. 3. IEEE, pp. 1679–1682. 1168 Scharf, P., Hyman, M., 2009. Linguistic Issues in Encoding Sanskrit. 1169 Motilal Banarsidass, Delhi. 1170 Schultz, T., Kirchhoff, K., 2006. Multilingual Speech Processing. Elsevier 1171 Academic Press . 1173 Schultz, T., Waibel, A., 2001. Language independent and language 1174 adaptive acoustic modeling for speech recognition. Speech Communi1175 cation 35, 31–51. 1176 Seltzer, M., Acero, A., 2011. Separating speaker and environmental 1177 variability using factored transforms. In: Proceedings Interspeech. pp. 1178 1097–1100. 1179 Shrishrimal, P., Deshmukh, R., Waghmare, V., 2012. Indian language 1180 speech database: a review. International Journal of Computer Appli1181 cations 47 (5), 17–21. 1182 Sulaiman, R., 2003. Innovations in agricultural extension in India. 1183 Sustainable Development Department, FAO: http://www.fao.org/sd/ 1184 2003/KN0603aen.htm. 1185 Singh, R., Raj, B., Stern, R., 2012. Automatic clustering and generation of 1186 contextual questions for tied states in hidden Markov models. In: 1187 Proceedings of the IEEE International Conference on Acoustics, 1188 Speech and Signal Processing, vol. 1. IEEE, pp. 117–120. 1189 Umesh S et al., 2013. Speech-based access of agricultural commodity 1190 prices in six Indian languages. Special Issue on Processing Under1191 resourced Languages, Speech Communication (submitted for Q3 1192 publication). 1193 Vu, N.T., Kraus, F., Schultz, T., 2011. Cross-language bootstrapping 1194 based on completely unsupervised training using multilingual A-stabil. 1195 Proceedings of the IEEE International Conference on Acoustics, 1196 Speech, and Signal Processing. IEEE, pp. 5000–5003. 1197 Weilhammer, K., Stuttle, M., Young, S., 2006. Bootstrapping language 1198 models for dialogue systems. In: Ninth International Conference on 1199 Spoken Language Processing. 1200 Young, S., et al., 2006. The HTK book (for HTK version 3.4). 1201
Please cite this article in press as: Mohan, A. et al., Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain, Speech Comm. (2013), http://dx.doi.org/10.1016/j.specom.2013.07.005