Multilingual and multimode phone recognition system for Indian languages

Multilingual and multimode phone recognition system for Indian languages

Speech Communication 119 (2020) 12–23 Contents lists available at ScienceDirect Speech Communication journal homepage: www.elsevier.com/locate/speco...

881KB Sizes 0 Downloads 22 Views

Speech Communication 119 (2020) 12–23

Contents lists available at ScienceDirect

Speech Communication journal homepage: www.elsevier.com/locate/specom

Multilingual and multimode phone recognition system for Indian languages Kumud Tripathi a,1,∗, M. Kiran Reddy a, K. Sreenivasa Rao a Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, India

a r t i c l e

i n f o

Keywords: Mode classification Multilingual phone recognition Vocal tract features Source features Feed forward neural networks

a b s t r a c t The aim of this paper is to develop a flexible framework capable of automatically recognizing phonetic units present in a speech utterance of any language spoken in any mode. In this study, we considered two modes of speech: conversation and read modes in four Indian languages, namely, Telugu, Kannada, Odia, and Bengali. The proposed approach consists of two stages: (i) Automatic speech mode classification (SMC) and (ii) Automatic phoneme recognition using mode-specific multilingual phone recognition system (MMPRS). The vocal tract and excitation source features are considered for classifying speech modes using feed forward neural networks (FFNNs). The vocal tract, excitation source, and tandem features are used in training deep neural network (DNN)-based multilingual phone recognition systems (MPRSs). The performance of the proposed approach is compared with baseline mode-dependent and mode-independent MPRSs. Experimental results show that the proposed approach which combines both SMC and MMPRS into a single system outperforms the baseline phone recognition systems.

1. Introduction The objective of the phone recognition system (PRS) is to convert a speech signal into a sequence of phones. The PRS can be developed using data from single language (Manjunath and Rao, 2015; Pradeep and Rao, 2016), or multiple languages (Scanzio et al., 2008; Burget et al., 2010; Manjunath et al., 2019). A PRS trained with data from more than one language is known as multilingual PRS (MPRS). In the literature, there are very few studies on multilingual phone recognition. Schultz et al., proposed multilingual acoustic models for recognition of read-mode speech (Scanzio et al., 2008; Burget et al., 2010). Here, large vocabulary speech recognition systems are investigated for 15 languages. In Kumar et al. (2005), a unified approach for the development of the hidden Markov model (HMM) based multilingual speech recognizer is proposed. The study considered two acoustically similar languages, Tamil and Hindi along with an acoustically very different language, American English. Here, Bhattacharyya distance measure is used to group the acoustically similar phones across the considered languages. In Gangashetty et al. (2005), a syllable-based multilingual speech recognizer is developed using 3 Indian languages: Telugu, Tamil, and Hindi. Here, vowel onset points are used as anchor points to derive the syllable-like units. Similar consonant-vowel units across the considered languages are merged to train a multilingual speech recognizer. Mohan et al. (2014) developed a small vocabulary multilingual speech recognizer using two linguistically similar Indian languages: Marathi



1

and Hindi. Here, the multilingual speech data collected over mobile telephones is used for training a subspace Gaussian mixture model. The speaker variations are handled by employing a cross-corpus acoustic normalization technique. In Manjunath et al. (2019), deep neural network (DNN)-based MPRS is developed using MFCC and tandem features, for read speech. The study considered speech utterances from 4 Indian languages, namely, Telugu, Kannada, Odia, and Bengali. Here, the acoustically similar units from various languages are grouped by using the international phonetic alphabet (IPA) transcription for training the MPRS. In general, speech can be broadly classified into two modes, namely, read and conversation (Batliner et al., 1995; Blaauw, 1991; Dellwo et al., 2015). Read mode is a formal mode of speech, where a person speaks in a constrained environment, for example, news broadcasts from television. On the other hand, conversation speech is spontaneous, informal, unstructured, and unorganized. Generally, an MPRS is trained and tested using data from the same speech mode (read or conversation), viewed as mode dependent or mode specific MPRS (MMPRS). The performance of MMPRS will be affected when the test utterance belongs to a different mode of speech. This is mainly due to mismatch in the acoustic characteristics of speech signals across different modes. In several realistic scenarios, both read and conversation modes occur simultaneously. For example, in Indian TV news channel such as Zee News, the read speech is generated by the newsreader and the conversation speech is generated by a group of people during the formal or informal discussions on a specific topic. Hence, the mode-specific MPRSs are not suitable for ac-

Corresponding author. E-mail address: [email protected] (K. Tripathi). Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721302, India

https://doi.org/10.1016/j.specom.2020.02.006 Received 19 July 2019; Received in revised form 5 January 2020; Accepted 25 February 2020 Available online 26 February 2020 0167-6393/© 2020 Elsevier B.V. All rights reserved.

K. Tripathi, M.K. Reddy and K.S. Rao

Speech Communication 119 (2020) 12–23

Table 1 The number of male and female speakers (for training and testing) from read and conversation modes of Bengali, Odia, Telugu and Kannada languages. (Abbreviation- SM: Speech Mode, M: Male, F: Female).

curate phone recognition in multi-modal environments. Therefore, there is a need to develop a multilingual phone recognition systems which can accurately recognize the phonetic units irrespective of the mode of the signal. Further, it is observed from a renowned video-sharing website: YouTube that the speech transcription is auto-generated for the English news channel; however, auto-generated speech transcription doesn’t exists for news videos in Indian languages. In this study, we aim at developing a multilingual phone recognition framework for Indian languages, which accepts speech utterances from multiple modes at the input and generates a sequence of phonetic units at its output. The proposed approach combines the speech mode classification (SMC) system and MMPRS into a single framework. Here, the SMC system acts as a front-end to recognize the mode of the input speech utterance, which is then given as input to corresponding MMPRS to decode the phonetic units. In literature, some studies have been carried out to analyze the importance of various features for different speaking styles. Barbosa et al. (2013) performed experiments on Brazilian-Portuguese speech corpus and observed that duration, standard-deviation of F0 and spectral emphasis values are useful for discriminating three speech styles: informal interview (conversation), phrase reading and word list reading. In a similar way, Eriksson and Heldner (2015) analyzed Brazilian-Portuguese, English (U.K.), Estonian, German, French, Italian and Swedish speech samples and observed that F0-level, F0-variation, duration, and spectral emphasis are significant for differentiating speech styles such as phrase reading and spontaneous speech. In Arantes et al. (2017), the authors have shown the impact of language, speaking style and speaker on Long-term fundamental frequency computation. From the previous studies, it is evident that the excitation source features contain discriminative information related to speaking style. Therefore, in addition to vocal tract features, we have also explored the source features for speech mode discrimination. While Mel-frequency cepstral coefficients (MFCCs) are used to represent the vocal tract information; supra-segmental level epoch strength contour (ESC) and pitch contour (PC) represent the excitation source information. Further, the complementary nature of source features is investigated by combining the scores of vocal tract-based SMC system and source features based SMC system. In this study, the SMC systems are developed using feed forward neural networks (FFNNs). After identifying the mode, the speech utterance is given as input to mode-specific DNNbased MPRS to decode the phonetic units. The MPRSs are trained using a combination of MFCCs, tandem features, and excitation source features extracted from a multilingual speech database. The significance of the proposed framework is demonstrated by comparing the performance with mode-dependent and mode-independent MPRSs. The workflow of this paper is as follows. Section 2 provides the details about the speech corpora. Feature extraction techniques are described in Section 3. The significance of excitation source features for speech mode classification is discussed in Section 4. The development of baseline MPRS has been elaborated in Section 5. The proposed framework and the details of speech mode classification are discussed in Section 6. The evaluation of proposed SMC and MPRS have been carried out in Sections 7 and 8, respectively. Finally, conclusion and future scope of this paper are mentioned in Section 9.

Language

SM

Bengali (B)

Conv Read Conv Read Conv Read Conv Read

Odia (O) Telugu (T) Kannada (K)

# Speakers (Training)

# Speakers (Testing)

M

F

M

F

6 6 5 6 6 6 6 5

6 6 7 6 6 6 6 7

3 4 3 3 2 3 3 3

3 2 3 3 4 3 3 3

KHz with 16 bits per sample. The duration of each wave file varies between 6 to 8 s. For all the speech signals, the phonetically and prosodically rich transcription is derived using the International Phonetic Alphabet (IPA) chart. An IPA transcription provides one symbol for every unique sound unit independent of the language information. As a pre-processing step for SMC, we have removed silence regions from the speech signals, and each speech file is chopped to have a fixed duration of 5 s. Table 1, details the number of male and female speakers used for training and testing. The first column contains the names of the languages, and the corresponding mode name is listed in the second column. Next two columns show the number of male and female speakers considered for training the models. Last two columns depict the number of male and female speakers for testing the models. Altogether, there are 12 distinct speakers for each mode of a language for training and 6 distinct speakers for each mode of a language for testing. 150 sentences per speaker are used for training and 50 sentences per speaker are used for testing. Note that there is no overlap between the speakers used for training and testing. 3. Feature extraction This section describes the extraction of various features for developing speech mode classifiers and multilingual phone recognition systems. The vocal tract and excitation source features considered for developing the speech mode classification model are discussed in Section 3.1. The vocal tract, excitation source, and tandem features considered for training the multilingual PRS are explained in Section 3.2. 3.1. Features for speech mode classification In this work, SMC models are developed using vocal tract and excitation source features. The 13-dimensional MFCCs along with its first (Δ) and second (ΔΔ) order derivatives were considered as vocal tract features for frame-level classification. The 39-dimensional MFCC feature vectors are extracted at frame level by considering a frame size of 25 ms and a frame shift of 10 ms using the Hamming window. The excitation source features describe the various time-varying characteristics of the vocal folds while producing voiced segments of speech. In this work, the speech signal is divided into a sequence of 100 ms segments, to capture the mode-specific excitation source information. Pitch contour (PC) and epoch strength contour (ESC) represent the excitation source information at supra-segmental level. Pitch represents the vibration frequency of the vocal cords during the production of voiced speech. Speakers utilize pitch to demonstrate the salience of words such as higher pitch implies that the word is more vital than other words in a speech utterance. In this work, the pitch and epoch strength contours are calculated using zero frequency filtering (ZFF) approach (Murty and Yegnanarayana, 2008). The ZFF method tries to estimate the pitch and epoch strength by finding epoch locations in the

2. Speech corpora In this work, speech corpora of four Indian languages: Telugu, Kannada, Odia, and Bengali is considered for developing both speech mode classification models and MPRSs. The speech corpora are collected as a part of consortium project titled Prosodically guided phonetic engine for searching speech databases in Indian languages supported by DIT, Govt. of India. The corpora contain speech data in read and conversation modes. In this study, read speech is collected from news reading, and the conversational speech is collected from news interviews. Complete details of speech corpora are given in Kumar et al. (2013) and Shridhara et al. (2013). The speech signals are sampled at a rate of 16 13

K. Tripathi, M.K. Reddy and K.S. Rao

Speech Communication 119 (2020) 12–23

Fig. 1. (a) and (b) represents Pitch contours of Bengali utterances in read and conversation modes, respectively. (c) and (d) represents Pitch contours of Odia utterances in read and conversation modes, respectively. F1, F2, F3, and F4 are distinct female speakers, and M1, M2, M3, and M4 are distinct male speakers.

speech. The time instants correspond to the positive zero-crossings of the ZFF signal are considered as epochs or the instants of glottal closure. The interval between two successive epochs represents the pitch period (t0 ). The reciprocal of pitch period (t0 ) will give the pitch (𝑝0 = 𝑡1 )

along with Δ and ΔΔ coefficients are extracted at a frame size of 25 ms with an overlap of 10 ms. The excitation source features are extracted at the frame level similar to vocal tract system features. Here, Mel power differences of spectrum in sub-bands (MPDSS) and residual Mel-frequency cepstral coefficients (RMFCCs) are used as source features (Nandi et al., 2017). For details about extraction of these features, please refer Nandi et al. (2017). In this work, 13-dimensional RMFCCs and 25-dimensional MPDSS are extracted for every frame. Further, we appended the Δ and ΔΔ coefficients to RMFCCs. In PRS, source features are processed at frame-level unlike utterance-level in SMC. Each utterance contains different sound units (phones) therefore, it required to process the features at frame-level for recognizing these phones. However, speech modes are associated to utterance but not with frames therefore, in SMC, source features are processed at utterance-level. Tandem features are the phone posteriors derived after developing a classifier using spectral features. Tandem features are also referred as discriminative features since they are produced from a discriminative classification model. In this work, we have followed the same procedure as discussed in Manjunath et al. (2019) for extracting the tandem features. In Manjunath et al. (2019), the MFCCs are used for training the DNN-based discriminative classifier for extracting the phone posteriors. The dimension of a tandem feature vector is 44, which represents the total number of unique phones present in the considered speech corpora.

0

(Yegnanarayana and Murty, 2009). The epoch strength is calculated for each epoch from the ZFF signal as the slope of ZFF signal at the positive zero-crossing. The slope is estimated as the difference between the successive samples of ZFF signal around the epoch locations (Murty and Yegnanarayana, 2008). The excitation source features are extracted at frame-level and processed at utterance-level, because, from informal studies we observed that the information provided by these features at the frame-level are not much significant in discriminating the modes. However, on examining these features at sentence level, a clear discrimination is observed across the modes. The discriminative capability of pitch contours extracted at sentence-level can be seen from Fig. 1. In case of read speech, the pitch contours are observed to follow a smooth and steady trend with few fluctuations occurring at some random instants of time (see Figs. 1(a) and (c)). In the case of conversation speech, the contours are relatively more dynamic than those obtained for read speech (see Figs. 1(b) and (d)). This indicates that the source features processed at sentence-level carry more discriminative mode information, which is suitable for developing an efficient SMC model.

4. Significance of excitation source features for speech mode classification

3.2. Features for multilingual phone recognition system In this work, the proposed multilingual phone recognition system is trained using the vocal tract, excitation source, and tandem features. To capture the vocal tract information for MPRS, 13-dimensional MFCCs

The importance of pitch and epoch strength contours for speech mode classification task is shown by their respective correlation coeffi14

K. Tripathi, M.K. Reddy and K.S. Rao

Speech Communication 119 (2020) 12–23

Table 2 Correlation coefficients across the speech modes (SM) of multiple languages using pitch contour (PC) and epoch strength contour (ESC). ID

Language

SM

1

Bengali (B) Odia (O) Telugu (T) Kannada (K) K-T-B-O

Conv Read Conv Read Conv Read Conv Read Conv Read

3 2 4 5

PC

with respect to conversation mode (0.12) is less as compared to the CC value of the same mode (0.35) using PC. Similar trends can be observed for conversation mode. Hence, both the PCs and ESCs are highly similar within modes and highly dissimilar between modes, across the languages. This analysis confirms that the pitch and epoch strength contours contain mode discriminating information.

ESC

WM

BM

WM

BM

0.37 0.53 0.42 0.56 0.32 0.44 0.21 0.26 0.21 0.35

0.14 0.14 0.15 0.15 0.12 0.12 0.13 0.13 0.12 0.12

0.63 0.67 0.64 0.70 0.52 0.68 0.47 0.61 0.59 0.67

0.06 0.06 0.02 0.02 0.04 0.04 0.07 0.07 0.03 0.03

5. Development of baseline MPRSs A PRS trained with data from multiple languages is called as Multilingual PRS (MPRS). The open-source speech recognition Kaldi toolkit Povey et al. (2011) is used for building the speech recognizers. The important components in building the phone recognition system using Kaldi toolkit are feature extraction, acoustic and language modeling, and decoding of phone sequence. DNNs are trained as in Zhang et al. (2014) for acoustic modeling. For language modeling (LM), the prior probability of a phone sequence (also known as LM score) is estimated by learning the relation between phones from the training data. Language model will provide prior information about the occurrences of sequence of phones, and hence by including LM, the performance of speech task will be improved (Yu and Deng, 2015). In this work, the effect of LM is also analyzed on the performance of the MPRS. The speech corpus described in Section 2 is considered for training MPRSs. The objective of an MPRS is to find the sequence of phones 𝐻̂ , whose likelihood to a given sequence of feature vectors A is maximum. The Eq. (2) shows the mathematical formulation for estimating the decoded phone sequence:

cients (CCs) (Benesty et al., 2009) for within and between modes across the languages, in Table 2. Correlation coefficients quantify the strength of relationships among the speech signals. Assume R, and Q are two finite duration speech signals. The CC between the R and Q can be calculated as follows: ( )( ) 𝐿 𝑄 𝑖 − 𝜇𝑄 1 ∑ 𝑅 𝑖 − 𝜇𝑅 𝐶(𝑅, 𝑄) = (1) 𝐿 𝑖=1 𝜎𝑅 𝜎𝑄 where 𝜇 R and 𝜎 R are mean and standard deviation of R, respectively, and 𝜇 Q and 𝜎 Q are mean and standard deviation of Q. L is the number of samples in each signal. The value of correlation coefficient is maximum when signals are similar and minimum when signals are orthogonal. If the signals have no relation, then the value of the coefficient is 0. For each language, four speakers data (two speakers’ per mode) is considered for computing the correlation coefficients. Note that the speakers are distinct for each language and each speaker has uttered 50 distinct utterances. To normalize the speaker variability across languages, the mean subtraction is imposed on the feature vectors across all languages. The correlation coefficients among modes of monolingual speech signals are depicted using ID: 1,2,3,4, whereas among modes of multilingual speech signals are shown with ID: 5. The values in the fourth and fifth columns of the Table 2 denote the CCs within the modes (WM) and between the modes (BM) using pitch contour. Similarly, the values in the sixth and seventh columns of the Table 2 denote the CCs within the modes (WM) and between the modes (BM) using epoch strength contour. For computing the average CC within the mode of a language, 50 distinct utterances spoken by speaker-1 of a mode is correlated with 50 distinct utterances spoken by speaker-2 of a mode. However, the average CC between the modes is computed using 50 distinct utterances from each mode of a language. In ID: 1 of Table 2, the first value of the fourth and sixth columns presents the average CC of within conversation (abbreviated as conv) mode of Bengali language using PC and ESC, respectively. However, the first value of the fifth and seventh columns represents the average CC of conversation mode with respect to read mode of Bengali language using PC and ESC, respectively. A low average CC value indicates high dissimilarity and a high average CC value indicates high similarity between pitch or epoch strength contours. From the table, it can be seen that the average CC values within a mode of any language are much higher compared to the CC values with respect to other modes. This depicts that the pitch and epoch strength contours have significant mode discrimination capability. In Table 2, ID: 5 (K-T-B-O) shows the correlation coefficients within and between modes, across the languages. In ID: 5, the last value of the fourth and sixth columns represents the average CC within the read mode of multilingual speech using PC and ESC, respectively. However, the last value of the fifth and seventh columns shows the average CC of read mode with respect to conversation mode of multilingual speech using PC and ESC, respectively. The average CC value of read mode

𝐻̂ = arg max 𝑃 (𝐻|𝐴) 𝐻

(2)

The term P(H|A) can be interpreted by applying Bayes’ rule as, 𝑃 (𝐻|𝐴) =

𝑃 (𝐴|𝐻)𝑃 (𝐻) 𝑃 (𝐴)

(3)

The acoustic modeling is used for calculating the likelihood P(A/H), whereas the prior probability P(H) of phone sequence is computed using language modeling. P(A) is the prior probability of a feature vector, and it can be ignored from Eq. (3) without any issue. Because it is independent of both acoustic and language models. In the current work, the bi-gram language model (Brown et al., 1992) is used for determining P(H). The final decoded sequence can be computed as follows: 𝐻̂ = 𝑙𝑜𝑔𝑃 (𝐴|𝐻) + 𝛼𝑙𝑜𝑔𝑃 (𝐻)

(4)

where 𝛼 is the LM scaling factor which is applied to balance between the acoustic and language model scores. The performance of MPRSs are computed in the form of phone error rates (PER). The PER (E) is computed using the formula given below: 𝑆 +𝐷+𝐼 (5) 𝑁 where N represents the total number of phones in the original transcription, D accounts for the number of deletion errors, S represents the number of substitution errors and I accounts for the number of insertion errors in the decoded output. The MPRS developed for Indian languages in Manjunath et al. (2019), is considered as our baseline. Here, the authors have considered MFCC and tandem features extracted from read speech utterances for training the speech recognizer (Manjunath et al., 2019). Hence, this system is referred to as read-baseline MPRS in this work. Besides read-baseline MPRS, we have also developed two more systems, named as, conv-baseline and RC-baseline MPRSs. While the conv-baseline MPRS is trained using conversational speech utterances, the RC-baseline MPRS is trained with speech data from both read and conversation modes. Except for the training data, the training and feature extraction steps are same in all the aforementioned systems. Each baseline MPRSs is trained using data of respective modes from four Indian languages: Telugu, Kannada, Odia, and Bengali. After training, each MPRS is tested with speech utterances from all the considered 𝐸=

15

K. Tripathi, M.K. Reddy and K.S. Rao

Speech Communication 119 (2020) 12–23

Fig. 2. Illustration for testing the baseline MPRSs using the read and conversation speech signals of multiple languages.

Table 3 Phone error rate (%) of read-baseline, conversationbaseline and RC-baseline MPRSs for read and conversation modes of speech. Model

Read-Baseline MPRS Conv-Baseline MPRS RC-Baseline MPRS

the rest of this paper, for simplicity, we use the term COMB-MPRS to refer to the proposed framework. 6.1. Development of multilingual speech mode classification system

PER (%) Read

Conversation

35.55 69.70 49.65

66.50 35.77 51.61

The objective of multilingual speech mode classification (SMC) system is to identify the mode (read or conversation) of speech which is spoken in any language. In this study, the SMC model is developed using the vocal tract and excitation source features (discussed in Section 3). The proposed approach consists of three stages as shown in Fig. 4. Here, feed forward neural network (FFNN) with single hidden layer is used as a classifier. The main reason for choosing FFNN for building the SMC models is that it has been shown to perform well for various speech processing tasks (Lopez-Moreno et al., 2016; Momo et al., 2017). For exploiting the complementary evidences provided by vocal tract and excitation source components, score-level fusion has been explored at different stages in the proposed framework. Fig. 4 shows the score-level fusion at 2nd and 3rd stages of the proposed SMC system. The following sub-sections provide the details of structure of FFNN, classification at different stages and score-level fusion.

modes and languages, as shown in Fig. 2. In view of simplifying the figure, all three baseline MPRSs are shown in single block; however, they are tested separately. The output of each MPRS is a sequence of phonemes. Table 3, shows the phone error rates (PERs) of read-baseline MPRS, conversation-baseline MPRS, and RC-baseline MPRS. From Table 3, we can see that the read-baseline MPRS is giving lower PER when tested with read mode (35.55%) compared to conversation mode (66.50%). Similarly, conversation-baseline MPRS is giving lower PER when tested with conversation mode (35.77%) than read mode (69.70%). This indicates that the performance of mode-specific MPRSs degrades when there is a mismatch in the modes of train and test utterances. Therefore, it is essential to develop MPRSs that are robust to the mode of input speech. The straight-forward way to develop mode-independent MPRS is to use data from both the modes during training, as in the case of RC-baseline MPRS. From table 3, it can be seen that RC-baseline system performs equally well for both the modes. However, there is a significant increase in PERs by about 14% and 16% for read and conversation modes of speech, respectively, when it is compared with the optimal performances of read-baseline and conv-baseline systems. Alternatively, in this work, we propose an approach that combines a speech mode classification (SMC) system with mode-specific MPRSs to handle speech utterances from various modes for further improving the performance. The details of the proposed approach are given in the following sections.

6.1.1. Feed forward neural network In this work, FFNN with single hidden layer is used for mapping the feature vectors to labels. According to Trenn (2008), a single hiddenlayered FFNN with sufficient number of neurons in the hidden layer can deliver optimal classification performance. The network is initialized with a learning rate of 0.005. The tanh non-linear function and the softmax activation function are used at the hidden and output layers, respectively. For finding the optimal number of neurons at hidden layer, we have experimented using the subset of database as discussed in Section 2. This dataset includes, 500 randomly selected utterances for male and female speakers from each mode of a language. Out of this, 70% dataset is used for training and remaining 30% is used for testing. We have performed this experiment for SMC model developed using each features, individually. The number of neurons at input layer p of each model depends on the size of the feature vector. For example, in case of PC-SMC the feature dimension is 500, so the size of input layer will be (𝑝 = 500), whereas, the number of nodes at the output layer r will be constant, because we have two modes. The optimum number of nodes at the hidden layer q is derived by experimenting the values of q varying from p to p/10 in 10 steps by multiplying p with 1/m, where m varies between 1 to 10 with one as increment in each step. The details of the SMC performance with various values of q are shown in Table 4. The first column of the table shows the values of q, and the columns from second to fourth represents the classification accuracy of PC-SMC, ESC-SMC and VT-SMC models, respectively. The details of these 3 models will be discussed in next subsection. It is observed that the optimal performance for PC-SMC and ESC-SMC models is obtained at 𝑞 = 𝑝∕8 and for VT-SMC

6. Proposed framework The block diagram of the proposed two-stage phone recognition system is shown in Fig. 3. The first stage consists of speech mode classification, and the second stage consists of phone recognition. The purpose of speech mode classifier is to automatically detect the mode of the input test speech utterance, i.e., read mode or conversation mode. Once the mode is identified, the speech utterance is processed through corresponding mode-specific MPRS to derive the phoneme sequence. Thus, the proposed approach is capable of deriving phonetic transcriptions of speech signals belonging to different modes and different languages. In 16

K. Tripathi, M.K. Reddy and K.S. Rao

Speech Communication 119 (2020) 12–23

Fig. 3. Block diagram of the proposed framework of 2-stage MPRS.

Fig. 4. Block diagram of multilingual SMC system using vocal tract and excitation source features.

Table 4 Classification accuracy (%) of PC-SMC, ESC-SMC and VT-SMC models for various number of nodes at hidden layer (q). #Nodes (Hidden Layer) (q)

PC-SMC (𝑝 = 500)

ESC-SMC (𝑝 = 500)

VT-SMC (𝑝 = 39)

p p/2 p/3 p/4 p/5 p/6 p/7 p/8 p/9 p/10

28.24 31.44 35.89 36.32 38.91 40.31 42.17 46.85 44.56 40.23

39.35 42.50 46.03 51.56 56.23 57.34 60.18 62.42 58.75 53.82

75.43 83.12 82.05 78.10 75.19 71.58 65.97 61.45 40.67 33.48

respectively. Further, standard back-propagation algorithm (stochastic gradient descent) (Bottou, 2010) is used for training the FFNN to minimize the root mean squared error between the actual and the predicted outputs.

6.1.2. Development of SMC models at Stage-1 At the stage-1, two separate SMC models are developed using the proposed pitch contour and epoch strength contour based excitation source features. Each feature contains distinct mode-related information. Therefore, two different SMC models are developed using FFNN at stage-1: (i) SMC model using pitch contour features is denoted as PCSMC model, and (ii) SMC model using epoch strength contour features is represented as ESC-SMC model. The number of nodes at the input and hidden layers of PC-SMC model are 𝑝1 = 500 and 𝑞1 = 62, respectively. The network structure for ESC-SMC is exactly same as PC-SMC model. These two FFNN models are trained and tested at sentence level. The training process of PC-SMC and ESC-SMC models is terminated after 200 epochs as there is no substantial decrements in the error, when the number of epochs is further increased.

at 𝑞 = 𝑝∕2 (see Table 4). Therefore, the above mentioned optimal values of q are considered for training and testing the three SMC models. The final network structures for excitation source and vocal tract based SMC models are specified in stage-1 and 2 (see Sections 6.1.2 and 6.1.3), 17

K. Tripathi, M.K. Reddy and K.S. Rao

Speech Communication 119 (2020) 12–23

Table 5 Performance of stage-1, stage-2, and stage-3 SMC models developed using monolingual and multilingual speech corpora. (Src-SMC = PC-SMC + ESC-SMC and Int-SMC = Src-SMC + VT-SMC.)

6.1.3. Development of SMC models at Stage-2 At the stage-2, two more SMC models are developed: (i) one of the models is developed by merging the two SMC models prepared at stage1 (referred in this work as Src-SMC model) and (ii) the other SMC model at stage-2 is developed using vocal tract features (referred in this work as VT-SMC model). In Src-SMC, the mode classification is performed by fusion of scores (discussed in later Section 6.1.5) obtained from PC-SMC and ESC-SMC models. Further, in this stage, the VT-SMC model is trained and tested using MFCC+Δ + ΔΔ features (discussed in Section 3) extracted from every frame. Hence, the model generates the class probabilities at frame level, i.e., for each frame two probability scores are attained. From frame-level scores, the sentence level scores are derived by following the given steps:

Average Classification Performance (%) Language ID

1. Derive 39 dimensional MFCC, Δ, and ΔΔ feature for each frame of a sentence. 2. Detect speech mode for each frame using VT-SMC model. 3. Find the total number of frames which are detected as read and conversation modes. For instance, a sentence contain 1000 frames where, 800 frames are detected as read and remaining 200 frames are detected as conversation mode. 4. For each sentence, calculate score (Sc ) corresponding to each mode by using given equation: 𝑆𝑐 = 𝑛𝑐 ∕𝑁

(6)

6.1.4. Development of SMC models at Stage-3 Stage-3 consisting of single integrated SMC model developed by fusing the scores obtained from the SMC models of Stage-2. As we have seen in the preceding section that stage-2 contains two SMCs: Src-SMC and VT-SMC. Here, the integrated model is denoted as Int-SMC model. 6.1.5. Score level fusion Score level fusion is usually explored for combining multiple classification systems (where each system is developed using specific features) into single system for exploiting the complementary and nonoverlapping discrimination information provided by each of the system. The score is an estimate of the relationship between the feature vectors of training and testing utterances. The score of a frame is represented in terms of the posterior probability corresponding to each class. Evidence generated from classifiers is dependent on the features used during the training and testing. In order to optimize the performance of combined system, the scores provided by the individual systems need to be weighted based on the discriminating capability of individual features. This is known as weighted score level fusion. Adaptive weighted combination scheme (Reddy et al., 2013) has been used for combining the scores of individual speech mode classification models. The idea behind the weighted score fusion is to reduce the total error on the development set. The fused score is calculated as follows: 𝑐 ∑ 𝑖=1

𝑤𝑖 𝑆𝑖

(7)

where, 𝑖 = {1, 2, .., 𝑐} represents the number of models used for fusing the scores. The confidence score and weight factor of the ith model are shown as Si and wi , respectively. Whereas, 𝑆̂ indicates the weighted fusion score. In this study, each of the weights (wi ) is varied in steps of 0.01 from 0 to 1. The weights are assigned in score-level fusion based on the following constraints: 𝑐 ∑ 𝑖=1

𝑤𝑖 = 1

and 𝑤𝑖 ≥ 0

Stage-2

Stage-3

Testing

PCSMC

ESCSMC

SrcSMC

VTSMC

IntSMC

1 2 3

Bengali (B) Odia (O) Telugu (T)

Bengali Odia Telugu

51.12 52.12 46.45

64.37 66.42 64.44

72.24 75.15 72.21

87.67 88.64 84.78

93.16 94.29 91.67

4 5 6 7 8 9

Kannada (K) KTBO

Kannada Bengali Odia Telugu Kannada Average

44.19 51.06 51.23 45.17 42.23 47.42

60.21 61.45 63.27 63.34 59.56 61.90

71.43 70.36 71.52 69.12 68.75 69.93

82.61 85.12 86.35 81.43 80.51 83.35

90.87 91.95 92.59 90.54 89.43 91.10

In this work, we have analyzed 98 distinct sets of weights for SrcSMC (at stage-2) and Int-SMC (at stage-3). In this work, SMC has been carried out on monolingual and multilingual scenarios. Here, four monolingual SMCs and one multilingual SMC have been developed and analyzed their performance at different stages as given in Table 5. The models at the score-level fusion (i.e., at stage-2 and stage-3) are evaluated and analyzed using all possible (98 in this work) sets of weights on the development data set, and finally chosen the optimal set of weights, which yields the best performance among the possible combinations of weights. These optimal weights are used for evaluating the performance of SMC models on test dataset. The excitation source models (PC-SMC and ESC-SMC) are combined at stage-2 and referred as Src-SMC (PC-SMC + ESC-SMC) model. Thereafter, vocal tract model (VT-SMC) and Src-SMC model are combined at stage-3 and referred as Int-SMC (Src-SMC + VT-SMC) model. The weight factors associated to PC-SMC, ESC-SMC, Src-SMC and VT-SMC models are denoted by w1, w2, w3 and w4, respectively. Here, w2 and w4 are computed as (1 − 𝑤1) and (1 − 𝑤3), respectively. The performance of each SMC model has been evaluated on 4 monolingual (Bengali, Odia, Telugu and Kannada) and 1 multilingual (combination of all 4 languages) datasets. Hence, each SMC model is studied at 5 different ways, where four of them are monolingual (trained and tested with same language) and the fifth one is multilingual (trained with all four languages together and tested with individual languages). The classification accuracy of five Src-SMC models (4 monolingual and 1 multilingual) for various combinations of weights (w1, w2) are shown in Fig. 5. For each model, the score level fusion has been used to optimize the performance. From Fig. 5, it is noted that, for all five models, the weights around 𝑤1 = 0.4 and 𝑤2 = 0.6 are found to be optimal for achieving the best performance on development data set. This result, indirectly indicates that the considered excitation source features (i.e., pitch contour and and epoch strength contour) carry the mode-discriminative information, and it is observed to be language independent. The main reason for the above claim is due to approximately same set of weights is found to be optimal for all five SMC models. Similarly, Fig. 6 shows the classification accuracy of five Int-SMC models (4 monolingual and 1 multilingual) at stage-3 for various combinations of weights w3 and w4. From Fig. 6, it is observed that for all five models approximately same set of weights around 𝑤3 = 0.3 and 𝑤4 = 0.7 are found to be optimal on the development dataset. From the Figs. 5 and 6, it is observed that the optimal weights for w1, w2, w3 and w4 are varying between 0.4-0.5, 0.5-0.6, 0.3-0.4, 0.6-0.7, respectively, for all five SMC modes at stage-2 and stage-3. Therefore, we have chosen the average of weight values obtained from all the five models as the optimal weights, which are (𝑤1 = 0.45, 𝑤2 = 0.55) in case of stage-2 and (𝑤3 = 0.35, 𝑤4 = 0.65) in case of stage-3, for final evaluation of SMC models on test dataset.

where, Sc represents the score corresponding to class c (𝑐 = {𝑟𝑒𝑎𝑑, 𝑐𝑜𝑛𝑣𝑒𝑟𝑠𝑎𝑡𝑖𝑜𝑛}), nc represents the total number of frames classified as class c and N represents the total number of frames in a sentence. For instance, 𝑆𝑟𝑒𝑎𝑑 = 800∕1000 = 0.8 and 𝑆𝑐𝑜𝑛𝑣 = 200∕1000 = 0.2. In this utterance, majority of frames belong to read mode, so this utterance will be classified as read speech.

𝑆̂ =

Stage-1

Training

(8) 18

K. Tripathi, M.K. Reddy and K.S. Rao

Speech Communication 119 (2020) 12–23

O-SMC models. The developed multilingual SMC models are evaluated with speech utterances of each of the languages considered for building the models. In this work, two multilingual SMC models are developed at stage-1, correspond to PC and ESC features. At stage-2, one more multilingual SMC is developed using VT features. Table 5 shows the average classification performance of monolingual and multilingual SMC models developed at stage-1, stage-2, and stage-3. The 1st column of Table 5 indicates the unique identities (ID) correspond to various testing strategies. IDs: 1-4, refer to evaluation of monolingual SMC models using their respective language speech utterances. IDs: 5-8, refer to evaluation of multilingual SMC model using speech utterances of four languages, separately. ID: 9, refers to average performance of multilingual SMC model using speech utterances of all four languages together. The second and third columns represent the speech utterances of specific languages involved in training and testing the models. The columns 4-8, show the classification accuracy of developed SMC models at stage-1, 2 and 3. The detailed explanation about the performance of SMC models at each stage is given in Sections 7.1-7.3. 7.1. Evaluation of SMC models at Stage-1

Fig. 5. Performance of five Src-SMC (PC-SMC + ESC-SMC) models (four monolingual and one multilingual) with different sets of weights during score-level fusion.

Fig. 6. Performance of five Int-SMC (Src-SMC + VT-SMC) models (four monolingual and one multilingual) with different sets of weights during score-level fusion.

At the stage-1, four monolingual and one multilingual SMC models were developed separately, using pitch contour (PC) and epoch strength contour (ESC) features extracted at the sentence level. The fourth and fifth columns of the Table 5 show the mode classification performance achieved by using PC and ESC features. From the results, it is observed that the accuracy of classification is better for ESC features, compared to PC features. It is observed across all nine test strategies. The similar observation is also noted while analyzing the correlation coefficients in view of mode discrimination. This demonstrates that the epoch strength contour has better mode discrimination power than the pitch contour. Examining the SMC results using monolingual (rows 1-4) and multilingual models (rows 5-8), it is observed that the performance of multilingual models is almost at par with monolingual models. This result indicates that the Multilingual SMC model is a language-independent SMC model. To justify this statement, we have performed some informal studies where SMC models are build incrementally by increasing the number of languages and tested with the unseen language speech utterances. When a model is trained with only one language and tested with unseen language speech utterances, low classification performance is observed. It is also noted that the mode discrimination performance for unseen languages is gradually improving as the number of languages considered for building the models is increased. This is due to increase in the diversity of the training data, due to including more number of languages for training the models, and hence, it tries to infer the generic language-independent mode discrimination capability.

7. Evaluation of SMC models

7.2. Evaluation of SMC models at Stage-2

In this work, excitation source and vocal tract features are explored for developing the speech mode classification models using FFNN. Read and conversation modes of speech from four Indian languages: Telugu, Kannada, Odia, and Bengali is considered for evaluating the proposed SMC models. We have developed monolingual and multilingual SMC models. Monolingual SMC models are developed using the speech utterances of specific language. Based on this concept, in this work, we have four monolingual SMC models correspond to four languages: Bengali, Odia, Kannada and Telugu. In this work, these models are evaluated with the speech utterances of same language. These monolingual SMC models are developed separately, at each stage using different features. At stage-1, two sets of monolingual SMCs are developed correspond to PC and ESC features. At stage 2, another set of monolingual SMCs are developed using VT features. Multilingual SMC models are developed using speech utterances of all four considered languages together, and they are referred as K-T-B-

The aim of stage-2 is to develop two different speech mode classification models by processing the complete excitation source and vocal tract features. The complete excitation source based model at stage-2 is designed by fusing the scores from pitch contour and epoch strength contour based models and termed as Src-SMC. The vocal tract based model at stage-2 are designed by using the 39-dimensional MFCC+Δ + ΔΔ feature. For analysis purpose, we explored MFCC alone and MFCC with dynamic features (MFCC+Δ + ΔΔ). It is observed that MFCC+Δ + ΔΔ features are capturing better speech discrimination information, compared to MFCCs. Since, MFCC will provide the instant shape of the spectral envelope, but not the details about how it changes with respect to its neighbouring frames, whereas Δ + ΔΔ will provide this information. Therefore, by concatenating MFCC and Δ + ΔΔ, we will preserve the current instant spectral envelope as well as its dynamics with respect to preceding and future frames. Hence, we considered MFCC+Δ + ΔΔ for performing the experiments in this work. In Table 5, 19

K. Tripathi, M.K. Reddy and K.S. Rao

Speech Communication 119 (2020) 12–23

Table 6 Classification accuracy (%) of conversation and read modes for monolingual and multilingual based Int-2 SMC models. ID

Language

Accuracy (%)

Training

Testing

Conversation

Read

1 2 3 4

Bengali (B) Odia (O) Telugu (T) Kannada (K)

Bengali Odia Telugu Kannada

92.47 93.37 89.76 89.10

93.85 95.20 93.58 92.63

5 6 7 8 9

KTBO

Bengali Odia Telugu Kannada Average

91.42 91.35 89.32 87.81 89.97

92.48 93.83 91.76 91.05 92.28

recognized correctly. This shows that vocal tract and excitation source features capture some complementary mode-specific information which results in better classification accuracy with Int-SMC model. 7.5. Performance comparison of two speech modes

Fig. 7. Illustration of mode classification for a subset of speech utterances from a development set using VT-SMC, Src-SMC, and Int-SMC models. Correctly detected mode of an utterance is marked with the red circle, and falsely detected mode of an utterance is marked with the black circle.

In this section, we have analyzed the classification accuracies of IntSMC models developed using monolingual and multilingual speech corpora in view of speech utterances from read and conversation modes. Table 6 presents the classification accuracies of four monolingual and one multilingual Int-SMC models for conversation and read modes of speech, separately in columns 4 and 5, respectively. The process of testing these models is same as in stage-1 and stage-2, and it was already discussed in previous subsections. From the results, it can be observed that the classification performance of read mode is slightly better than the conversation speech mode across all the models. The reason for this misclassification may be due to the nature of speech data considered. In this work, we have considered the speech data related to news, read by a news reader as read mode speech, and speech data related to news interviews and discussions managed by a TV anchor is considered as conversation mode of speech. In this scenario, the labeled conversation speech may contain some neutral style speech at the initial stage of the interview or discussion, and hence, these conversation mode speech samples may leads to misclassification. Similarly, some of the speech samples of read mode may classify as conversation mode, due to variations in speaking rates and intonation patterns of news readers. However, the misclassification rate is not significant. From the Table 6, it can be observed that the performance of multilingual SMC model (labeled as K-T-B-O) is mostly close to the performance of monolingual SMC models (labeled as Bengali, Odia, Telugu, and Kannada) for classifying the speech modes. This result indicates that by using multilingual SMC system, we can detect the mode of a speech utterance irrespective of language, and further by using the detected mode-identity, accurate phone recognition can be achieved with the help of mode-specific phone recognizer. Otherwise, if we use monolingual SMC models for speech mode classification, we should have accurate language identification (LID) system, prior to the SMC system. In this case, the LID system, first identify the language of the speech utterance, and then feed the speech utterance to the specific monolingual SMC, based on the detected language identity of the speech utterance. Therefore, here, the overall SMC performance depends on the performance of both LID and SMC systems. The advantage of using multilingual SMC system over monolingual SMC systems is that we need to develop and maintain only one SMC with speech data from all languages together. Otherwise, the number of monolingual SMCs to be developed is equal to the number of languages considered. In addition to this, an efficient LID system at the front end is required to accurately detect the identity of the language for the given speech utterance. But, realizing LID system with 100% accuracy is not practical. Whereas the proposed multimodal SMC model, which achieves almost similar performance as

the average classification performance of the SMC model developed using complete excitation source features is less than the models developed using MFCC+Δ + ΔΔ. This indicates that the vocal tract features contain significantly higher mode specific information than excitation source features. 7.3. Evaluation of SMC models at Stage-3 At this stage, an integrated speech mode classification (Int-SMC) model is developed by combining Src-SMC and MFCC+Δ + ΔΔ-SMC models using score level fusion. In column eight of Table 5, it is shown that the classification performance of integrated model is better compared to the models based on the individual feature. For example, the accuracy of the Int-SMC model is better than the accuracies of the SrcSMC and MFCC+Δ + ΔΔ-SMC models. The results portray that the IntSMC model has better average classification accuracy across the languages. Hence, this integrated SMC model can be used as a front-end of multilingual PRS for optimally recognizing the phonetic units present in the input speech signal of any mode spoken in any language. 7.4. Performance analysis of integrated model The improved performance of the integrated model (Int-SMC) can be better understood from Fig. 7. The mode classification for 25 random speech utterances using VT-SMC, Src-SMC and Int-SMC models are shown in Fig. 7. In Fig. 7, the red color circle represents the correctly detected mode, and the black color circle indicates the falsely detected mode of an utterance using the model shown on the y-axis. The performance of Int-SMC model depends on the classification accuracy of VT-SMC and Src-SMC models. It can be visualized from Fig. 7 that if both models fail, then the Int-SMC model will also fail in accurately classifying the mode. For example, the mode of utterance 15 is falsely detected in VT-SMC and Src-SMC models as well as in Int-SMC model. Further analysis shows that, for some utterances one of them (either VT-SMC or Src-SMC) is recognizing correctly. For example, the mode of speech utterances 2 and 4 are recognized correctly using Src-SMC model whereas VT-SMC model is not able to recognize it. Similarly, the mode of speech utterances 5 and 6 are recognized correctly by VT-SMC model, whereas Src-SMC model is not recognized correctly. However, in the integrated model (Int-SMC), mode of utterances 2, 5 and 6 are 20

K. Tripathi, M.K. Reddy and K.S. Rao

Speech Communication 119 (2020) 12–23

Table 7 Average phone error rates (%) using baseline and proposed MPRSs. Feature

MPRS

Average PER (%)

MFCC+Tandem (Baseline)

Read-Baseline Conv-Baseline RC-Baseline

51.03 52.73 50.55

Read-Proposed Conv-Proposed RC-Proposed

46.77 49.31 47.17

MFCC+Tandem+ RMFCC+MPDSS

Table 8 Error rates (%) of phones from read and conversation modes of speech using proposed COMB-MPRS in the presence of SMC with 91% and 100% accuracy. MPRS

COMB-MPRS COMB-MPRS (Ideal)

PER (%) Read

Conversation

Average

38.98 31.85

40.63 33.47

39.81 32.66

8.2. Evaluation of proposed COMB-MPRS monolingual SMC models, doesn’t needs any LID system at the frontend. Therefore, in this work, we have used the multilingual SMC system for detecting the mode of a speech utterance for further processing.

From the Table 7, it is clear that the mode-independent MPRSs (RCbaseline and RC-proposed) are also not capable enough to significantly reduce the PER. Therefore, for development of efficient and accurate MPRS, which is robust to any mode of speech, we have proposed a 2-stage MPRS, where stage-1 consists of an efficient SMC system and stage-2 consists of two separate mode-specific MPRSs: Read-proposed and Conv-proposed. In this work, we have used the proposed 3-stage SMC system for identifying the mode of the speech utterance. The details of the proposed 3-stage SMC system are discussed in section 6. After, identifying the mode of test speech utterance by the proposed SMC model, the speech utterance is passed to the respective MMPRS for efficient recognition of phone sequence from the given speech utterance. The PER of the proposed 2-stage MPRS (or COMB-MPRS) is given in Table 8. From the results presented in Table 8 (row 1), it is observed that the overall PER is around 39.81%, out of which PER of read speech is around 38.98% and for conversation speech it is around 40.63%. By comparing Tables 7 and 8, it is observed that COMB-MPRS has significantly reduced the PER as compared to mode-dependent and mode-independent MPRSs. We have also analyzed the performance of COMB-MPRS for the ideal case in Table 8 (see row 2). Since, we had the ground-truth labels also, so it was possible to evaluate the performance at the ideal case. From the Table 8, it can be seen that COMB-MPRS with 100% MSMC accuracy is performing significantly better for both the modes as compared to the COMB-MPRS with maximum (91.10%) MSMC accuracy. The reason for lower performance in COMB-MPRS with 91.10% MSMC accuracy is that the current MSMC model is not accurately classifying the modes of some utterances. Therefore, we will ensure that the MSMC model should be highly accurate. This shows the scope for improving the MSMC model.

8. Evaluation of multilingual phone recognition systems In this section first we have analyzed the impact of proposed excitation source features on the PERs of baseline multilingual phone recognition systems (MPRSs). Afterward, we have analyzed the performance of proposed COMB-MPRS as discussed in Section 6. At the end of this section, pros and cons of phone recognition using monolingual and multilingual models is discussed.

8.1. Improvement of Baseline MPRSs In this section, we have improved the performance of baseline MPRSs discussed in Section 5. From Table 3 it is observed that the mode-specific MPRSs (read-baseline and conv-baseline) are providing lower PERs in case of their corresponding modes. Hence, it is required to develop MPRS which is robust to the mode of input speech. As we have mentioned already in Section 5 that one possible way for this is to use combined data from both the modes for training the MPRS, as RC-baseline MPRS. But, from the results presented in Table 3, it can be observed that even with RC-baseline the performance is not significantly improved. Therefore, first we try to improve the baseline systems itself by adding the proposed feature which is excitation source features (RMFCC and MPDSS). Since, the source features also contain significant information, therefore we are adding them with MFCCs and Tandem features. The 3 baseline MPRSs (read-baseline, conv-baseline, and RC-baseline) in addition with excitation source features are named as read-proposed, conv-proposed and RC-proposed MPRSs. The experimental setup and database used for developing these 3 proposed MPRSs is same as 3 baseline MPRSs. The performance of the proposed and their corresponding baseline MPRS are given in Table 7. The results from Table 7 indicate that the PERs are lower in case of the proposed MPRSs, compared to their baseline MPRSs. The proposed MPRSs have approximately 3-4% less PERs, compared to their respective baseline MPRSs. From these results, it is indicating that the proposed excitation source features consisting of MPDSSs and RMFCCs have some non-overlapping or complementary phone specific knowledge, when compared to the combination of conventional vocal tract features such as MFCCs and Tandem features. Because of these complementary features, we have observed performance improvement across all three proposed MPRSs, compared to their respective baseline MPRSs. But the improvement in PER is not satisfactory for real-time scenarios. It shows that by adding extra features to baseline MPRSs will not provide much improvement, in spite it will increase the system complexity. Hence, the proposed COMB-MPRS will be useful to deal with speech utterances from multiple modes. In Section 8.2, the performance of proposed 2-stage MPRS is analyzed and its significance is demonstrated.

8.3. Analysis of monolingual and multilingual PRSs In this section, we have carried out the analysis of monolingual and multilingual PRSs developed using mode-specific MPRSs. For this analysis, we have considered 4 monolingual PRSs and one multilingual PRS. Note that each PRS (either monolingual or multilingual) consists of pair of PRSs correspond to read-proposed and conv-proposed. For the analysis purpose, we have examined the PERs of readproposed and conv-proposed PRSs, when they are tested with the corresponding mode speech utterances. For multilingual PRS, the PERs are examined with respect to language and mode. Table 9 shows the PERs of developed monolingual and multilingual PRSs are given with respect to speech mode. The first column indicates the identities of various PRS models considered for this study. Here, IDs (rows) 1-4 show the PERs of monolingual PRSs. IDs (rows) 5-8 indicate the PERs of developed multilingual PRS, when tested with speech utterances of 4 different languages. ID (row) 9, indicates the average PER offered by multilingual PRS, when tested with speech utterances of all four languages together. From the results of Table 9, it is indicating that the performance of monolingual PRSs is better compared to multilingual PRS. There is about 5-7% difference in PERs is observed between monolingual and multilingual PRSs. Even though monolingual PRSs offer lowest PERs, but there 21

K. Tripathi, M.K. Reddy and K.S. Rao

Speech Communication 119 (2020) 12–23

Table 9 Phone error rates (%) of read-proposed and conv-proposed MPRSs in presence of monolingual and multilingual speech dataset. ID

1 2 3 4 5 6 7 8 9

Language

of proposed approach can be evaluated by using speech data from other languages. Supplementary material

PER (%)

Training

Testing

Read-Proposed

Conv-Proposed

Bengali (B) Odia (O) Telugu (T) Kannada (K) KTBO

Bengali Odia Telugu Kannada Bengali Odia Telugu Kannada Average

25.01 23.71 25.20 24.73 30.03 33.50 26.33 37.55 31.85

26.59 25.82 27.91 28.76 32.98 34.37 30.58 35.94 33.47

Supplementary material associated with this article can be found, in the online version, at 10.1016/j.specom.2020.02.006. CRediT authorship contribution statement Kumud Tripathi: Conceptualization, Methodology, Software, Formal analysis, Investigation, Writing - original draft. M. Kiran Reddy: Validation, Resources, Writing - review & editing, Visualization. K. Sreenivasa Rao: Writing - review & editing, Visualization. References

exists certain disadvantages: (i) For each new language, a separate PRS needs to be developed, (ii) To process the given speech utterance to the language-specific PRS, an accurate language identification system is required as a front-end to the monolingual PRSs. (iii) Developing LID system with 100% accuracy is not practically possible due to structural and lexical similarity among the languages. (iv) For low-resource languages, developing language-specific PRS is very complex due to unavailability of linguistic knowledge such as pronunciation dictionary, linguistic and phonetic rules etc. and (v) Precise modeling of all unique phones is difficult due to unavailability of some rare phones. Based on these shortcomings in monolingual PRSs, we suggest the multilingual PRS which is free from the difficulties mentioned above.

Arantes, P., Eriksson, A., Gutzeit, S., 2017. Effect of language, speaking style and speaker on long-term f0 estimation.. In: INTERSPEECH, pp. 3897–3901. Barbosa, P.A., Eriksson, A., Åkesson, J., 2013. On the robustness of some acoustic parameters for signalling word stress across styles in Brazilian portuguese.. In: INTERSPEECH, pp. 282–286. Batliner, A., Kompe, R., Kießling, A., Nöth, E., Niemann, H., 1995. Can you tell apart spontaneous and read speech if you just look at prosody? In: Speech Recognition and Coding. Springer, pp. 321–324. doi:10.1007/978-3-642-57745-1_47. Benesty, J., Chen, J., Huang, Y., Cohen, I., 2009. Pearson correlation coefficient. In: Noise Reduction in Speech Processing. Springer, pp. 1–4. doi:10.4135/9781412953948.n342. Blaauw, E., 1991. Phonetic characteristics of spontaneous and read-aloud speech. Phonetics and Phonology of Speaking Styles. Bottou, L., 2010. Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. Springer, pp. 177–186. doi:10.1007/978-3-7908-2604-3_16. Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C., 1992. Class-based n-gram models of natural language. Comput. Linguist. 18 (4), 467–479. Burget, L., Schwarz, P., Agarwal, M., Akyazi, P., Feng, K., Ghoshal, A., Glembek, O., Goel, N., Karafiát, M., Povey, D., et al., 2010. Multilingual acoustic modeling for speech recognition based on subspace Gaussian mixture models. In: Proceedings of International Conference on Acoustics Speech and Signal Processing (ICASSP), Dallas, Texas, pp. 4334–4337. doi:10.1109/ICASSP.2010.5495646. Dellwo, V., Leemann, A., Kolly, M.-J., 2015. The recognition of read and spontaneous speech in local vernacular: the case of Zurich German. J. Phonet. 48, 13–28. doi:10.1016/j.wocn.2014.10.011. Eriksson, A., Heldner, M., 2015. The acoustics of word stress in english as a function of stress level and speaking style. In: 16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015), Dresden, Germany, September 6-10, 2015, pp. 41–45. Gangashetty, S.V., Sekhar, C.C., Yegnanarayana, B., 2005. Spotting multilingual consonant-vowel units of speech using neural network models. In: Proceedings of International Conference on Nonlinear Analyses and Algorithms for Speech Processing, Berlin, Heidelberg, pp. 303–317. doi:10.1007/11613107_27. Kumar, C.S., Mohandas, V., Li, H., 2005. Multilingual speech recognition: a unified approach. In: Proceedings of Ninth European Conference on Speech Communication and Technology, Lisbon, Portugal, pp. 3357–3360. Kumar, S.S., Rao, K.S., Pati, D., 2013. Phonetic and prosodically rich transcribed speech corpus in Indian languages: Bengali and Odia. In: Proceedings of International Conference on Oriental COCOSDA held jointly with Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), Gurgaon, India, pp. 1–5. doi:10.1109/icsda.2013.6709901. Lopez-Moreno, I., Gonzalez-Dominguez, J., Martinez, D., Plchot, O., Gonzalez-Rodriguez, J., Moreno, P.J., 2016. On the use of deep feedforward neural networks for automatic language identification. Comput. Speech Lang. 40, 46–59. Manjunath, K., Jayagopi, D.B., Rao, K.S., Ramasubramanian, V., 2019. Development and analysis of multilingual phone recognition systems using indian languages. Int. J. Speech Technol. 22 (1), 157–168. doi:10.1007/s10772-018-09589-z. Manjunath, K., Rao, K.S., 2015. Source and system features for phone recognition. Int. J. Speech Technol. 18 (2), 257–270. doi:10.1007/s10772-014-9266-0. Mohan, A., Rose, R., Ghalehjegh, S.H., Umesh, S., 2014. Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain. Speech Commun. 56, 167–180. doi:10.1016/j.specom.2013.07.005. Momo, N., Uddin, J., et al., 2017. Speech recognition using feed forward neural network and principle component analysis. In: International Symposium on Signal Processing and Intelligent Recognition Systems. Springer, pp. 228–239. Murty, K.S.R., Yegnanarayana, B., 2008. Epoch extraction from speech signals. IEEE Trans. Audio Speech LangProcess 16 (8), 1602–1613. doi:10.1109/tasl.2008.2004526. Nandi, D., Pati, D., Rao, K.S., 2017. Parametric representation of excitation source information for language identification. Comput. Speech Lang. 41, 88–115. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al., 2011. The Kaldi speech recognition toolkit. Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society.

9. Conclusion In this work, a two-stage system has been proposed for automatically recognizing the phonetic units present in speech utterances from multiple languages spoken in multiple modes. In the first stage, the mode of input speech is identified using a multilingual SMC system developed using vocal-tract and suprasegmental level excitation source features. In this study, we have considered two modes of speech, namely, conversation and read modes. The SMC systems are evaluated using data from four Indian languages: Telugu, Odia, Kannada, and Bengali. It is observed that the vocal-tract features provide better average SMC accuracy (83%) compared to excitation source information (69%). Complementary nature of excitation source and vocal tract features is also investigated in this study. There is an improvement in average classification accuracy (91%) with the combined vocal tract and excitation source features compared to the individual features. In the second stage, phone recognition is carried out by processing the speech through the MPRS which is specific to the mode of the speech signal. Initially, we have improved the baseline mode-specific MPRSs by means of proposed excitation source features. The improved MPRSs are then used in the second stage of COMB-MPRS. The evaluation results show that the proposed two-stage approach (COMB-MPRS) can better handle the speech utterances from both (read and conversation) the modes than the modedependent and mode-independent MPRSs. The overall recognition performance of COMB-MPRS depends on (a) performance of MSMC system and (b) the performance of mode specific MPRS. From the results, it can be seen that there is an overall improvement in PER by about 8% with ideal case (100% MSMC accuracy) compared to non-ideal case (91% MSMC accuracy). In this work, we have utilized the labels present in the database to realize the ideal case. But, in practice, the labels for the test data will not be available. Therefore, it is essential to have a MSMC system with very high accuracy (close to 100%) at the front-end for proper routing of speech signals. Hence, our future focus will be on improving the accuracy of MSMC by exploring other source features. Furthermore, we plan to investigate the articulatory features for decreasing the phone error rates of MPRSs. In this work, we have focused only on Indian languages; however, the effectiveness 22

K. Tripathi, M.K. Reddy and K.S. Rao

Speech Communication 119 (2020) 12–23

Pradeep, R., Rao, K.S., 2016. Deep neural networks for Kannada phoneme recognition. In: Proceedings of Ninth International Conference on Contemporary Computing (IC3), JIIT, Noida, pp. 1–6. doi:10.1109/IC3.2016.7880202. Reddy, V.R., Maity, S., Rao, K.S., 2013. Identification of Indian languages using multilevel spectral and prosodic features. Int. J. Speech Technol. 16 (4), 489–511. doi:10.1007/s10772-013-9198-0. Scanzio, S., Laface, P., Fissore, L., Gemello, R., Mana, F., 2008. On the use of a multilingual neural network front-end. In: Proceedings of Ninth Annual Conference of the International Speech Communication Association, Brisbane, Australia, pp. 2711–2714. Shridhara, M., Banahatti, B.K., Narthan, L., Karjigi, V., Kumaraswamy, R., 2013. Development of Kannada speech corpus for prosodically guided phonetic search engine. In: Proceedings of International Conference on Oriental COCOSDA Held Jointly with Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), Gurgaon, India, pp. 1–6. doi:10.1109/icsda.2013.6709875.

Trenn, S., 2008. Multilayer perceptrons: approximation order and necessary number of hidden units. IEEE Trans. Neural Netw. 19 (5), 836–844. Yegnanarayana, B., Murty, K.S.R., 2009. Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Trans. Audio Speech Lang Process 17 (4), 614–624. doi:10.1109/tasl.2008.2012194. Yu, D., Deng, L., 2015. Deep neural network-hidden Markov model hybrid systems. In: Automatic Speech Recognition. Springer, pp. 99–116. doi:10.1007/978-1-4471-5779-3_6. Zhang, X., Trmal, J., Povey, D., Khudanpur, S., 2014. Improving deep neural network acoustic models using generalized maxout networks. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 215–219. doi:10.1109/icassp.2014.6853589. Florence, Italy

23