Speech Communication 86 (2017) 107–120
Contents lists available at ScienceDirect
Speech Communication journal homepage: www.elsevier.com/locate/specom
An automated technique to generate phone-to-articulatory label mapping Basil Abraham∗, S. Umesh Department of Electrical Engineering, Indian Institute of Technology Madras, Chennai, Tamil Nadu 600036, India
a r t i c l e
i n f o
Article history: Received 1 February 2016 Revised 16 November 2016 Accepted 23 November 2016 Available online 28 November 2016 Keywords: Articulatory features Mapping Phone-CAT Under-resourced languages Cross-lingual techniques Multilayer perceptrons
a b s t r a c t Recent studies have shown that in the case of under-resourced languages, use of articulatory features (AF) emerging from an articulatory model results in improved automatic speech recognition (ASR) compared to conventional mel frequency cepstral coefficient (MFCC) features. Articulatory features are more robust to noise and pronunciation variability compared to conventional acoustic features. To extract articulatory features, one method is to take conventional acoustic features like MFCC and build an articulatory classifier that would output articulatory features (known as pseudo-AF). However, these classifiers require a mapping from phone to different articulatory labels (AL) (e.g., place of articulation and manner of articulation), which is not readily available for many of the under-resourced languages. In this article, we have proposed an automated technique to generate phone-to-articulatory label (phone-to-AL) mapping for a new target language based on the knowledge of phone-to-AL mapping of a well-resourced language. The proposed mapping technique is based on the center-phone capturing property of interpolation vectors emerging from the recently proposed phone cluster adaptive training (Phone-CAT) method. Phone-CAT is an acoustic modeling technique that belongs to the broad category of canonical state models (CSM) that includes subspace Gaussian mixture model (SGMM). In Phone-CAT, the interpolation vector belonging to a particular context-dependent state has maximum weight for the center-phone in case of monophone clusters or by the AL of the center-phone in case of AL clusters. These relationships from the various context-dependent states are used to generate a phone-to-AL mapping. The Phone-CAT technique makes use of all the speech data belonging to a particular context-dependent state. Therefore, multiple segments of speech are used to generate the mapping, which makes it more robust to noise and other variations. In this study, we have obtained a phone-to-AL mapping for three under-resourced Indian languages namely Assamese, Hindi and Tamil based on the phone-to-AL mapping available for English. With the generated mappings, articulatory features are extracted for these languages using varying amounts of data in order to build an articulatory classifier. Experiments were also performed in a cross-lingual scenario assuming a small training data set (≈ 2 h) from each of the Indian languages with articulatory classifiers built using a lot of training data (≈ 22 h) from other languages including English (Switchboard task). Interestingly, cross-lingual performance is comparable to that of an articulatory classifier built with large amounts of native training data. Using articulatory features, more than 30% relative improvement was observed over the conventional MFCC features for all the three languages in a DNN framework. © 2016 Published by Elsevier B.V.
1. Introduction The performance of automatic speech recognition (ASR) systems has significantly improved with the application of better acoustic modeling techniques based on deep neural network (DNN). However, many factors such as speaker variability and noisy environment affect the performance of ASR. In the literature, the use of
∗
Corresponding author. E-mail addresses:
[email protected] (B. Abraham),
[email protected] (S. Umesh). http://dx.doi.org/10.1016/j.specom.2016.11.010 0167-6393/© 2016 Published by Elsevier B.V.
articulatory features has shown to improve the performance of an ASR system in noisy environment in Kirchhoff et al. (2002) and conversational speech in Frankel et al. (2007). Articulatory features are features that represent a speech signal in terms of the underlying articulatory attributes of speech production. These features were used by Schmidbauer (1989a), Schmidbauer (1989b), Kirchhoff et al. (2002), King et al. (2007), Cetin et al. (2007) and Frankel et al. (2007) in their work on automatic speech recognition. Articulatory features can be obtained by the following three ways (Kirchhoff et al., 2002):
108
B. Abraham, S. Umesh / Speech Communication 86 (2017) 107–120
1. Direct measurement of articulatory parameters, for example, those obtained by cine-radiography (Papcun et al., 1992) 2. Articulatory parameters recovered from acoustic signal by inverse filtering (Schroeter and Sondhi, 1994) 3. Posterior probabilities extracted from conventional acoustic features by means of a statistical classifier (Kirchhoff et al., 2002). This approach generates the so-called pseudo-articulatory features. Only pseudo-articulatory features are considered in this article. To extract pseudo-articulatory features in any language, articulatory classifiers are constructed for different AL groups, such as a group of labels referring to the manner of articulation, and those referring to the place of articulation, etc. To build a robust articulatory classifier, large amounts of data transcribed in terms of AL (e.g., alveolar and dental) in that language are required. It is very difficult to manually transcribe data at frame level in terms of AL. Hence, the usual practice is to obtain a phone-level alignment and convert it into AL using phone-to-AL mapping. The two broad approaches for building articulatory classifiers are as follows: 1. Phone-to-AL mapping, which is available in that language: This is a direct way of building articulatory classifiers. Using the phone-to-AL mapping, articulatory classifiers are built in that language and are used to extract articulatory features in that language. 2. Phone-to-AL mapping, which is not available in that language: Here, the articulatory classifiers need to be constructed from another language where the phone-to-AL mapping is available. Then the articulatory features are extracted using these articulatory classifiers. This approach was used in Sivadas and Hermansk (2004), Tóth et al. (2008), Thomas et al. (2010), Lal and King (2013) and Çetin et al. (2007). The articulatory features extracted by the first approach showed consistent improvements compared to conventional acoustic features for ASR (Kirchhoff et al., 2002). However, a failure was noted in some cases when the second approach was used. For example, in Çetin et al. (2007), articulatory classifiers trained using English continuous telephone speech data were used to extract articulatory features for Mandarin broadcast news task. However, they failed to improve the recognition performance of Mandarin task. Similarly, in Tóth et al. (2008), articulatory classifiers trained on English speech were used to generate articulatory features for Hungarian telephone speech. It was reported that improved recognition performance was obtained, but they failed to perform at par with features generated by articulatory classifiers trained only with Hungarian data. The performance degradation observed while using different language data was due to differences in domain and channel variations between databases. A similar effect was also noted in Thomas et al. (2010). In Lal and King (2013), German, Portuguese and Spanish data were combined and used to build the articulatory classifiers, which were then used to extract articulatory features for each of these languages. The extracted articulatory features showed improved recognition performance for all three languages, but not as much as the articulatory features extracted from articulatory classifiers built with the corresponding language data alone. Hence, conventional wisdom is to train articulatory classifiers with the same language data or other language data collected in similar conditions. To build articulatory classifiers in any language, a phone-toAL mapping is required along with the data. For many languages, phone-to-AL mapping is not readily available. In most of the previous works that used AL-based features, a manually generated phone-to-AL mapping was used. For many languages, manually obtaining a phone-to-AL mapping is often difficult, since it requires the assistance of a phonetic expert and transcription may often
require phone labels that are not listed in the International Phonetic Alphabet (IPA) symbols. In case of rare languages, getting a phonetic expert is often difficult and hence the situation becomes worse. Even if phonetic experts are available, they may not have a consensus agreement on AL. In order to overcome these problems, we propose an automated way to generate phone-to-AL mapping in a particular language based on the knowledge of phone-toAL mapping in some well-resourced language. To our knowledge, there are no previous reports on techniques to automatically generate phone-to-AL mapping in any language. The proposed technique uses the interpolation vectors of recently proposed phone cluster adaptive training (Phone-CAT) acoustic modeling technique (Manohar et al., 2013) to generate phone-to-AL mapping. This paper is organized as follows: In Section 2, a brief review of ASR using articulatory features is given. In Section 3, the proposed phone-to-AL mapping technique is described. In Section 4, the proposed technique is compared with alternate techniques that we have used to generate phone-to-AL mapping. In Section 5, a detailed description of the experimental setup is given. In Section 6, the results and analysis of various experiments performed with articulatory features extracted using the proposed technique are given. In Section 7, application of the proposed technique in language with limited training data is described. Finally, Section 8 deals with conclusion. 2. Review of articulatory features for ASR Initial attempts to use articulatory features in ASR are reported in Schmidbauer (1989a, 1989b), Elenius and Takács (1991), Eide et al. (1993), Deng and Sun (1994) and Erler and Freeman (1996). Recent studies reported that articulatory features extracted from neural networks fed into tandem HMM (Ellis et al., 2001) showed improvements in recognition performance (Cetin et al., 2007; Frankel et al., 2007; Kirchhoff et al., 2002). Kirchhoff et al. (2002) also showed that these features were robust to noise. Articulatory features in tandem HMM framework were further explored in Johns Hopkins 2006 summer workshop (Cetin et al., 20 07; Frankel et al., 20 07). In this paper, we have followed the articulatory feature set and feature extraction methodology outlined in the workshop. 2.1. Articulatory label set In Frankel et al. (2007), a discrete multilevel label set with eight AL groups each having specific AL was introduced. This label set is given in Table 1. Each AL group has a ‘none’; for example, in the case of Degree & Manner this class covers the non-speech sounds (silences), whereas in the case of Place (associated with consonants), it covers all vowels and non-speech. For diphthongs (e.g., /aw/), the begin state is denoted by /aw1/ and end state by /aw2/. 2.2. Review of articulatory feature extraction technique Articulatory features are extracted from articulatory classifiers built for each of the eight AL groups listed in Table 1. Articulatory feature extraction is performed as per Algorithm 1 (Cetin et al., 2007). Consider a specific example of building an articulatory classifier for the AL group “Degree & Manner” as shown in Fig. 1. Given an acoustic feature as input, a multilayer perceptrons (MLP) is trained with six AL in “Degree & Manner” group as output targets. This requires the input acoustic features to be aligned at frame level with the six AL. Frankel et al. (2007) showed that manual transcription of data at frame level in terms of AL is laborious. Hence, the usual practice is to obtain a phone-level alignment (from an efficient acoustic model built in terms of phones) and convert it into AL using a phone-to-AL mapping. Articulatory
B. Abraham, S. Umesh / Speech Communication 86 (2017) 107–120
109
Table 1 Articulatory label groups and their articulatory labels. AL group
cardinality
Articulatory labels (AL)
Place Degree & Manner Nasality Rounding Glottal state Vowel shape Height Frontness
10 6 3 3 4 23 8 7
Alveolar (ALV), dental (DEN), labial (LAB), labio-dental (L-D), lateral (LAT), none, post-alveolar (PA), rhotic (RHO), velar (VEL) Approximant (APP), closure (CLO), FLAP, fricative (FRIC), vowel (VOW) −, + −, + Aspirated (ASP), voiceless (VL), voiced (VOI) aa, ae, ah, ao, aw1, aw2, ax, ay1, ay2, eh, er, ey, ey1, ey2, ih, iy, ow1, ow2, oy1, oy2, uh, uw, nil HIGH, LOW, MID, mid-high (MID-H), mid-low (MID-L), very-high (VI), nil back (BK), front (FRT), MID, mid-back (MID-B), mid-front (MID-F), nil
Fig. 1. Steps involved in training an AL-MLP for AL group “Degree & Manner of articulation (Degree & Manner)”.
Algorithm 1: Steps involved in extracting articulatory features. 1. Training: Build an articulatory label MLP (AL-MLP) for each ALgroup using the phone-to-AL mapping in the language as shown in Fig. 1. 2. Usage: Concatenate the MLP outputs (posterior probabilities) and convert them to log-probabilities as shown in Fig. 2 to obtainthe final articulatory features.
features are extracted for each AL group separately and then combined to get the final features as shown in Fig. 2. The efficacy of these articulatory features is highly dependent on the amount of speech data available to train articulatory classifiers (Frankel et al., 2007). 3. Proposed phone-to-AL mapping technique Phone-to-AL mapping is a many-to-one mapping from the phones to AL in an AL group. For example, in AL group “Degree & Manner”, the phones /AO/, /AW/, /EY/, /OW/, etc., gets mapped to AL “VOW“. In this article, we have proposed an automated technique to generate a phone-to-AL mapping in a target language based on the phone-to-AL mapping of a well-resourced source language. The proposed technique is based on the recently proposed
phone cluster adaptive training (Phone-CAT) (Manohar et al., 2013) method. A brief overview of Phone-CAT, its properties and how to modify it to include articulatory information are described in subsequent sections. 3.1. Overview of Phone-CAT Phone-CAT (Manohar et al., 2013) was inspired by a speaker adaptation technique called cluster adaptive training (CAT) (Gales, 1999). In CAT, a speaker-dependent HMM is formed by interpolating several speaker cluster HMMs. This interpolation vector in CAT thus characterizes a speaker and its speaker-adapted HMM. In Phone-CAT, a context-dependent state GMM is formed by interpolating various phone cluster GMMs. The interpolation vector in Phone-CAT can hence represent context-dependent state in terms of phones of that language. Consider a Phone-CAT model built for a language with P (mono)phones and J context-dependent states as shown in Fig. 3. The P phone clusters are formed by a full covariance maximum likelihood linear regression (MLLR) adaptation of Gaussian mixture model-universal background model (GMM-UBM). Initially this I mixture GMM-UBM can be built from the entire speech data as a single-state GMM or from a well-trained CDHMM model by clustering the mean parameters of the context-dependent states to the required number of Gaussian mixtures in the GMM-UBM. The MLLR transform matrices W p are initialized to identity matrices.
110
B. Abraham, S. Umesh / Speech Communication 86 (2017) 107–120
Fig. 2. Block schematic of extracting articulatory features from acoustic features.
Fig. 3. Block schematic of Phone-CAT acoustic model with “P” monophones, “J” context-dependent states and “I” mixtures in UBM.
The mean of ith mixture component in pth phone cluster can be written as follows:
T μip = W p μTi 1 = W p ξ i
All the mean parameters of P phone clusters are weighted by the interpolation vector vj to obtain the mean parameters of the jth context-dependent state GMM.
(1)
where μi is the canonical mean of ith mixture of GMM-UBM and ξ i is an extended form of it to take advantage of the bias in adaptation. The model complexity can be increased by multiple MLLR transform matrices (transform classes) for different mixtures in UBM as described in Manohar et al. (2013).
μ ji = μ1i v(j1) + μ2i v(j2) + · · · + μPi v(jP ) =
P
μip v(j p) = Mi v j
(2)
p=1
The jth interpolation vector is initialized as a unit vector with 1 in the dimension corresponding to the center phone of the jth context-dependent state. The weight subspace concept was
B. Abraham, S. Umesh / Speech Communication 86 (2017) 107–120
111
Fig. 4. Block diagram of Phone-CAT parameter estimation procedure.
borrowed from the subspace Gaussian mixture model (SGMM) (Povey et al., 2011a). The weight subspace W is of dimension I × P and is initialized with all zeros. The row vector wi from the weight subspace W is used to obtain the Gaussian weights ωji of the jth context-dependent state along with vj as follows:
ω ji = I
exp wTi v j
i =1
exp wTi v j
(3)
The covariances of all the context-dependent states are tied to the GMM-UBM covariance ( ji = UBM ). The estimation process i starts with estimating vj followed by wi , W p , and the UBM parameters in order as described in Fig. 4. The alignments are taken from the CDHMM model for the first five iterations and from the previous iteration thereafter. The update equation for vj is given by v j = G j −1 k j , where the accumulates kj and Gj are given by
kj = yj +
I
wi
I wi max γ ji , γ j ω ji wTi v j γ ji − γ j ω ji +
i=1
Gj =
i=1
I
γ ji Hi + max γ ji , γ j ω ji wi wTi
(4)
i=1
where Hi = Mi T i −1 Mi and yj , γ ji , γ j are sufficient statistics defined as follows:
yj =
γ ji (t )MTi −1 i ot
(5)
t,i
γ ji =
3.3. Articulatory label based CAT
γ ji (ot )
(6)
γ ji
(7)
t
γj =
weights. This characteristic of an interpolation vector of a contextdependent state to pick out the center-phone is referred to as the center-phone capturing property. Table 2 shows the statistical analysis of the center-phone capturing property in Switchboard, Aurora-4, and three Indian languages from MANDI database. In each Indian language, the analysis was performed on two data sets Train-high and Train-low with approximately 22 h and 2 h of training data respectively. These data sets were later used in the low-resource experiments. The analysis was performed using interpolation vectors from all the contextdependent states in the model. As seen from Table 2, the three dominant peaks of the interpolation vectors almost always pick up the center and left- and right-context phones. The significant differences in percentages between the Train-high and the Train-low cases of Indian languages is based on the fact that the Train-low sets were selected from the Train-high sets in such a way that each phone of the language is covered at least a minimum number of times. This interesting characteristic of the interpolation vector indicating the phone content is exploited to obtain the phone-to-AL mapping in the next section. The weights of the interpolation vector can be thought of as being analogous to the posterior probabilities of phone-classes (e.g., obtained using a phone decoder). The main difference is that posterior probabilities are usually obtained at frame level, while interpolation vectors are obtained at state level and, are, therefore, more robustly estimated.
i
A detailed description of the Phone-CAT acoustic modeling technique and the steps involved in estimating its parameters are given in Manohar et al. (2013). 3.2. Center-phone capturing property of Phone-CAT This section discusses the properties of interpolation vectors in Phone-CAT. Consider a context-dependent state formed by tying center state of triphones /l/ − /ii/ + /b/, /r/ − /ii/ + /b/, /l/ − /ii/ + /tx/, /l/ − /ii/ + /ch/, etc. Fig. 5a shows the interpolation vector of this context-dependent state as a function of constituent phones. The center phone /ii/ has the maximum weight. The phones occurring in the left or right context (/b/, /l/) also have significant
Articulatory label based CAT (AL-CAT) extends the conventional Phone-CAT to include articulatory information. AL-CAT for a particular AL group has their corresponding AL as clusters, while PhoneCAT has phones as clusters. The articulatory characteristics of the center-phone in each context-dependent state are captured by the interpolation vector in AL-CAT. Consider AL-CAT for AL group “Degree & Manner” as shown in Fig. 6. There are six clusters representing six AL in this AL group. A phone-to-AL mapping is needed for the training of this model. This mapping defines the identity of context-dependent states in terms of AL. Thus the data belonging to a particular articulatory label are used to estimate the MLLR transform for that AL cluster. The AL clusters are then interpolated to form each context-dependent state. The interpolation vector of AL-CAT provides the link between AL and phones as shown in Fig. 5. The weights of the interpolation vector indicate the dominant articulatory label, while the interpolation vector itself corresponds to a particular phone (corresponding to the center-phone of the context-dependent state). The
112
B. Abraham, S. Umesh / Speech Communication 86 (2017) 107–120
Fig. 5. Plot of interpolation vector of center state of triphone “l-ii+b” from Phone-CAT, AL-CAT for AL groups “Place of articulation” and “Degree \& Manner of articulation”. Table 2 Analysis of interpolation vectors of Phone-CAT. Peak no:
Context capturing
MANDI Assamese (%)
Hindi (%)
Aurora-4 (%)
Tamil (%)
Train-low
Train-high
Train-low
Train-high
Train-low
Train-high
91.3 8.6
64.6 29.2
93.4 5.8
63.6 26.3
69.9 2.5
72.5 3.6
0.0 0.8
0.4 6.6
11.0 1.0
9.0 1.0
0.1 1.7
4.7 0.9
4.3 0.2
1
Center left/right
96.4 2.8
70.3 21.6
2
Center left/right
0.8 0
3.2 4.2
0 0
2.1 3.6
Center left/right
0 0
0.1 0.3
0 0
0.1 0.3
3
SWBD (%)
0 0
ulation AL group. The interpolation vectors from AL-CAT built for these AL groups correctly pick up the labels as shown in Fig. 5b and c. Hence, the interpolation vectors of AL-CAT can be used to map a phone to a particular group of AL. This property of AL-CAT is used to obtain a phone-to-AL mapping for a new language. Therefore, to obtain phone-to-AL mapping for a new language, we need to obtain the interpolation vector for a context-dependent state using a well-trained AL-CAT. However, to train AL-CAT in a new language, we need phone-to-AL mapping in that language. This chicken-and-egg problem is overcome by training AL-CAT using a well-resourced language, such as English, whose phone-toAL mapping is readily available. This AL-CAT is then used to obtain phone-to-AL mapping for the new language and is described in the next section.
3.4. Generating phone-to-AL mapping using class-based Phone-CAT Fig. 6. Block Schematic of AL-CAT for the AL group “Degree & Manner”.
interpolation vector of Phone-CAT picks up /ii/ as the center-phone as shown in Fig. 5. /ii/ belongs to the vowel (VOW) category of Degree & Manner AL group and NONE category of place of artic-
In this section, we describe how the existing knowledge of phone-to-AL mapping of a well-resourced language (e.g., English) can be used to obtain phone-to-AL mapping for a new language. The various steps involved in this method are given in Algorithm 2.
B. Abraham, S. Umesh / Speech Communication 86 (2017) 107–120
113
Algorithm 2: Steps involved in phone-to-articulatory feature mapping. Consider the gth AL group (1 ≤ g ≤ 8) with Rg labels (A) Part 1 1. Train an AL-CAT model for the gth AL group with source language, using the existing phone-to-AL mapping. 2. Initialize the gth AL group AL-CAT model in the target language by borrowing UBM, W rg , C lust errg ; 1 ≤ r ≤ Rg of AL-CAT model in built in step 1. The context-dependent states 1 ≤ j ≤ J T are taken from the target language itself asshown in Fig. 7. 3. Initialize v jg of AL-CAT model in step 2 with equal weights for all the components ( R1g ) giving equal weight to each label in that AL group. 4. Estimate v jg of AL-CAT model in step 3 using target language data. (B) Part 2 5. Determine the center-phone k; 1 ≤ k ≤ Q for each context-dependent state (1 ≤ j ≤ J T ) of the target language. 6. For each v jg ; 1 ≤ j ≤ J T , find the dominant component m;1 ≤ m ≤ Rg corresponding to the dominant AL in the gth AL group as shown in Fig. 8. 7. From steps 5 and 6, obtain J T pairs (k,m); 1 ≤ k ≤ Q, 1 ≤ m ≤ Rg . 8. Generate a confusion matrix from the J T pairs and obtain a mapping from each phone in the target language to an AL in the gth ALgroup, based on the combination that occurred the most. 9. Repeat steps 1 - 8 for each AL group given in Table 1.
Fig. 7. AF-based Phone-CAT model for performing phone-to-AL mapping in AL group “Degree & Manner”.
Fig. 8. Phone-to-AL mapping procedure using interpolation vectors (vj ) of AL-CAT.
114
B. Abraham, S. Umesh / Speech Communication 86 (2017) 107–120
Algorithm 3: Steps involved in generating phone-to-AL mapping using articulatory features extracted from high-resource acoustic model. 1. The AL-MLPs are built with source language data using the phone-to-AL mapping in that language. 2. For the target language, obtain frame-level posterior probability (AF) with respect to each AL with AL-MLP built for that AL group. 3. For each AL group, find the most likely AL for each frame from the AF obtained in step 2. 4. For each AL group, compare the most likely AL for each frame with the original forced frame-level phone alignment. 5. For each phone in target language, choose the most occurring AL as the phone-to-AL mapping in the corresponding AL group.
The following notations are used in Algorithm 2.
Target language : The language for which the phone-to-AL mapping is to be determined Source language : The language for which the phone-to-AL mapping is known (e.g., English) Q : Number of phones in the target language J T : Total number of context-dependent states in the target language Rg : Number of AL in the gth AL group W rg : MLLR transform of the r th AL of the gth AL group AL-CAT from the source language Clusterrg : Mean cluster for the r th AL of the gth AL group AL-CAT from the source language v jg : Interpolation vector for the jth context-dependent state of the gth AL group AL-CAT in the target language The steps involved in the mapping procedure can be divided into two parts. In the first part, we built an AL-CAT for each AL group from a well-resourced source language and these AL clusters are transferred to the AL-CAT in target language. Then the interpolation vectors of the target language’s AL-CAT is trained from these clusters. In the second part, by virtue of the center-phone capturing property, the articulatory characteristics of phones in the target language are obtained. This gives us the phone-to-AL mapping in the target language. In the case of diphthongs, we considered the specific state of the triphones belonging to a context-dependent state rather than the center-phone alone in step 5 of Algorithm 2. Thus the transition from one articulatory position to another can also be captured by this technique. Many Indian languages comprise vowels of different shapes in contrast to English language. Hence, many vowel phones in the target language get mapped to an English vowel. Therefore, we find the target language phones which were mapped to vowels and use their original vowel label from the target language.
source language data to generate the final mapping. The steps involved in the mapping procedure are given in Algorithm 3. The technique involving the direct use of articulatory features obtained at frame level from articulatory classifiers built from another language is affected by environmental differences. 4.2. Phone mapping techniques We have recently proposed a technique based on interpolation vectors of Phone-CAT to automatically map the phones in a target language to the phones in a source language (Basil Abraham et al., 2014). We can use this automatic phone-to-phone mapping of two languages and combine it with phone-to-AL mapping of the source language to get the corresponding phone-to-AL mapping of the target language. However, the performance attained by such an approach was less compared to that by AL-CAT approach, especially when there was a mismatch in environmental conditions between source and target language data. In summary, the advantages of AL-CAT approach over the phone-to-phone-to-AL approach are as follows: • Phones are specific to a language, and to obtain a good mapping, good phonetic overlap between languages for phone mapping techniques is required. • The number of phones in a language is about 40, whereas the number of AL in an AL group is about 6–10. Therefore, the articulatory classifiers will be robustly estimated since more data are available during training for each AL group. 4.3. Discussion The proposed phone-to-AL mapping technique is less affected by noise as compared to the AF extracted by the direct use of articulatory classifiers trained in different language. The proposed technique uses the speech data from various segments of speech covered across different training examples (belong to a contextdependent state) to perform mapping, whereas direct articulatory feature extraction uses few frames in the neighborhood to estimate AF for a frame, which makes it more robust to noise. 5. Experimental setup This section describes the acoustic models that were built and the databases that were used for that matter.
4. Comparison with alternate approaches
5.1. Speech databases
In this section, we briefly describe two other phone-to-AL mapping techniques we have tried, which were outperformed by the method presented above. The proposed phone-to-AL mapping was the best of the three approaches.
In this study, three Indian languages “Assamese, Hindi, and Tamil” from the MANDI database are considered as target languages. The English databases Switchboard and Aurora-4 represent the data for our well-resourced language.
4.1. Using AF extracted from the source language acoustic model
5.1.1. MANDI database The MANDI database is a multilingual database consisting of six Indian languages (Basil Abraham et al., 2014). The database was created for “Speech-based access to agricultural commodity
In this technique, the articulatory features for the target language data are directly extracted from AL-MLPs built using the
B. Abraham, S. Umesh / Speech Communication 86 (2017) 107–120 Table 3 Details of MANDI database. Language
Data set
Train (h)
Test (h)
# Phones
# Words
Assamese
Train-high Train-low Train-high Train-low Train-high Train-low
22.25 1.41 21.47 1.69 22.6 2.04
3.68
37
300
5.23
41
256
4.28
39
8187
Hindi Tamil
# Phones: Number of phones in a language, # Words: Number of distinct words in the vocabulary.
prices,” a Government of India project to build ASR systems in Indian languages to provide information about prices of agricultural commodities in different markets to farmers. The database contains Assamese, Bengali, Hindi, Marathi, Tamil, and Telugu. In each language corpus, speech was collected from end users in their native language. Each corpus contains mainly names of markets and commodities in the state where that language is commonly spoken. The data were mostly collected outdoor and vary from quiet to very noisy environments. In our work, Assamese, Hindi, and Tamil corpora were used. Assamese and Hindi are of Indo-Aryan Eastern and Indo-Aryan Central origin, while Tamil is of Dravidian origin. The details about these corpora are given in Table 3. To perform low-resource experiments, each corpus was divided into two data sets. The Trainhigh set has ≈ 22 h of training data and Train-low set has ≈ 2 h of training data. A common test set was used for both the data sets. Each Train-low data sets is a random subset of the corresponding Train-high data set. A dictionary was built based on the corresponding pronunciation in that language. The phoneme set used was in terms of ARPAbet symbols. Among the three languages considered, Assamese and Hindi databases consist of short phrases and Tamil database consist of short sentences. The Assamese and Hindi databases encompass words from the recognizer’s vocabulary, whereas the Tamil database contained words that do not belong to the Tamil recognizer’s vocabulary (not belong to the names of either commodities or districts). The number of distinct words in each database is given in Table 3. Listening tests showed that Tamil database was the most noisy followed by Hindi and Assamese. 5.1.2. English databases Switchboard-1 (Godfrey and Holliman, 1993) and Aurora-4 (Hirsch, 2002) databases were used as the English language databases. The phone-to-AL mapping for phones in these two databases is given in Frankel et al. (2007). Switchboard-1 Release 2 (LDC97S62) consists of 2400 two-sided telephone conversations among 543 speakers from all over the United States. Aurora4 database was built by artificially adding noise to Wall Street Journal database (Garofolo et al., 1993). A multi-condition train-
115
ing set was formed by adding random noise samples from one of six noise conditions to 3569 utterances that were recorded with a Sennheiser microphone and another 3569 utterances that were recorded with different microphones. No noise was added to a portion of multi-condition set. The SVitchboard-1 task (King et al., 2005) is a small vocabulary task that consists of 10–500 words and that is partitioned in five subsets, denoted by A–E is defined using the subsets of Switchboard-1 corpus (Godfrey and Holliman, 1993). This data set is commonly used for performing research in the area of articulatory features. Three subsets together constitute a training set and the remaining two a development set and a test set respectively. The SVitchboard-1 task can be downloaded from http://www.cstr. ed.ac.uk/research/projects/svitchboard. 5.2. Baseline acoustic models In this section we have provided details of the various acoustic models that were built for the Train-high and Train-low data sets of Assamese, Hindi and Tamil corpus: Continuous density hidden Markov model (CDHMM), SGMM, Phone-CAT and deep neural network (DNN) models. The configurations of the various acoustic models are listed in Table 4. All the experiments were performed using open-source KALDI toolkit (Povey et al., 2011b). The speech waveform was parameterized to 13-dimensional MFCC features with delta and acceleration coefficients. The baseline CDHMM monophone and triphone models were built using these features. The 13-dimensional MFCC features was then stacked over 7 frames and reduced to 40 dimensions by an LDA transform. We refer to these features as LDA features. A CDHMM triphone model was then built using these features. Using the alignments from this model, other acoustic models like SGMM, Phone-CAT and DNN were built on top of the LDA features to compose a set. The recognition results of all the baseline acoustic models built using MFCC features are given in Table 5. 5.2.1. CDHMM • Monophone Model. A Hidden Markov Model (HMM) with 3 states was built for each monophone representing a speech sound, and one more HMM with 8 states was added for representing the non-speech frames. Each state was modeled with 14 Gaussian mixtures for Train-high data set and 4 Gaussian mixtures for Train-low data set. • Triphone Model. The triphone models were built to incorporate co-articulation effects in speech. The triphone model also uses state tying to reduce the number of context-dependent states or tied states. The number of context-dependent states used is given in Table 4. • LDA-MLLT Model. Using Heteroscedastic Linear Discriminant Analysis (HLDA) (Kumar and Andreou, 1998), the stack of frames t − 3 to t + 3 are transformed to a 40-dimensional LDA
Table 4 Configurations of baseline acoustic models. Language
Assamese Hindi Tamil
Data set
Train-22 h Train-2 h Train-22 h Train-2 h Train-22 h Train-2 h
CDHMM
SGMM
Phone-CAT
# TS
# Mix/TS
# Mix
# TS
# SS
# Mix
# TS
# SS
# TC
847 221 1072 262 1004 199
14 4 14 4 14 4
512 64 512 64 512 64
1.6 0.3 2.4 0.5 0.3 0.3
3k 0.5 k 3k 0.6 k 0.3 k 1.2 k
512 184 512 184 512 80
1.6 k 0.3 k 2k 0.5 k 1.8 k 1.8 k
3k 0.5 k 3.1 k 0.7 k 2.4 k 3.8 k
2 1 2 1 2 2
k k k k k k
# TS: Number of context-dependent states, # Mix/TS: Number of Gaussian mixtures per context-dependent state, # Mix: Number of Gaussian mixtures in the UBM of SGMM/Phone-CAT, # SS: Number of substates in SGMM/Phone-CAT and # TC: Number of transform classes in Phone-CAT.
116
B. Abraham, S. Umesh / Speech Communication 86 (2017) 107–120 Table 5 Recognition performances of acoustic models built with articulatory features extracted from articulatory classifiers built with the same dataset. Assamese Feature type
Acoustic model
Train-high MFCC
AF alone
AF append
MFCC
AF alone
AF append
Plain
Mono Triphone Triphone SGMM Phone-CAT DNN
30.88 15.63 14.30 13.75 13.91 13.09
38.21 26.25 14.58 13.52 13.95 12.03
23.94 15.40 14.46 13.28 13.52 12.15
45.45 39.85 37.26 33.27 33.74 36.56
63.71 55.05 38.87 33.80 32.68 32.45
43.38 36.95 32.17 30.25 29.94 31.78
Feature type
Acoustic model
Train-high MFCC
AF alone
MFCC
AF alone
AF append
Plain
Mono Triphone Triphone SGMM Phone-CAT DNN
18.34 5.89 5.55 4.83 4.84 4.12
14.55 9.61 5.07 3.84 4.19 3.65
24.55 16.28 14.59 13.76 13.50 13.86
27.41 24.61 12.28 10.65 9.67 10.04
17.67 13.58 11.03 9.85 9.92 10.11
Acoustic model
Train-high
LDA
Train-low
Hindi
LDA
Train-low AF append 9.98 5.42 4.68 3.76 3.97 3.05
Tamil Feature type
Plain LDA
Mono Triphone Triphone SGMM Phone-CAT DNN
Train-low
MFCC
AF alone
AF append
MFCC
AF alone
AF append
37.36 21.96 19.77 19.08 19.87 20.26
33.79 27.74 19.52 17.89 18.95 18.68
25.81 20.29 17.55 17.52 17.89 17.15
48.00 36.60 33.81 33.42 33.59 34.18
47.09 45.41 31.86 30.48 30.65 31.86
37.91 31.12 30.82 28.06 28.36 31.81
feature vector characterizing frame t. These more discriminative features are used to model the HMM for the triphones. 5.2.2. Parsimonious acoustic models • SGMM. The SGMM training was performed in a similar way as given in recipe files in KALDI toolkit (Povey et al., 2011b). The SGMM was built over LDA-MLLT model using LDA features. The number of Gaussian mixtures in UBM as well as the number of context-dependent states and substates are given in Table 4. • Phone-CAT. The Phone-CAT model described in Section 3.1 is also a subspace model similar to SGMM. Phone-CAT differs from SGMM in the way the subspace is defined. The PhoneCAT model uses a subspace that is spanned by the mean vectors of the monophone GMMs in that language, whereas in SGMM it is a lower dimensional subspace which does not have any specific meaning. Phone-CAT training was done as in Manohar et al. (2013). The Phone-CAT model was also built over the LDAMLLT model. The number of Gaussian mixtures in the UBM, the transform classes, context-dependent states and substates are given in Table 4. 5.2.3. DNN The neural network was trained using DNN training recipe files in KALDI toolkit (Povey et al., 2011b). It was trained using LDA features and with labels obtained from the LDA-MLLT model. The features were stacked over a context window of +/− 5 frames to form an 11 × 40 dimensional input for the DNN. • Pretraining. The RBM pretraining was performed as in Hinton (2010). The RBM had Gaussian-Bernoulli units in the first layer followed by Bernoulli-Bernoulli units in the higher layers. All the DNN models have 6 hidden layers and 2048 nodes in each layer. A learning rate of 0.01 and 0.4 was used for the Gaussian
units and Bernoulli units, respectively. The momentum parameter was set to 0.9 and the number of iterations was limited to 20 for Train-low and 4 for Train-high. For the first layer these numbers were doubled. • Training. The DNN training was performed as in Hinton et al. (2012). The neural network was layer by layer initialized with pretrained RBM. The pretrained network weights were updated during 20 iterations of back-propagation. The initial learning rate was 0.008 for the first three iterations and halved every other iteration. The minibatch size was 256 and no momentum was used. The training data were divided into training (90%) and cross validation (10%). At each iteration the frame classification efficiency computed on the validation set was used to determine whether to accept or reject the model. 5.2.4. Parameter count comparison Consider a CDHMM model trained with D = 40 dimensional input features (LDA features), J = 10 0 0 context-dependent states and M j = 14 mixtures per context-dependent state. There are J × D × M j = 560, 0 0 0 mean parameters and as much diagonal covariance parameters respectively. It also has J × M j = 14, 0 0 0 Gaussian weight parameters. Hence, a total of 1.2 million parameters. The corresponding SGMM and Phone-CAT models have around 75% and 60% of the CDHMM parameters respectively. The DNN model has around the same configurations with 6 layers and 2048 nodes per layer has around 20 million parameters. We have used L2 regularization and dropout training to avoid overfitting of the model. 5.3. Articulatory feature extraction setup In this experiment the articulatory classifiers were built using all the training data available in a language (data set). The time aligned labels from the respective DNN model alignments mapped
B. Abraham, S. Umesh / Speech Communication 86 (2017) 107–120
in terms of the AL were used to train the articulatory classifiers. Each articulatory classifier consisted of 3 layers and was supplied with a stack of 11 filterbank output vectors each comprising 23 log-energies. The classifiers for the AL groups Nasality, Rounding and Glottal had a hidden layers of 1024 units, whereas the others had a hidden layer of 2048 units. The same configurations were used for both Train-low and Train-high because the reduction in complexity for Train-low led to poor performance. The articulatory features of all the classifiers were then stacked and processed as given in Algorithm 1. 6. Experiments and analysis In this section, we describe the experiments performed with articulatory features obtained through the proposed phone-to-AL mapping technique. Firstly the proposed phone-to-AL mapping is validated for English as in Section 6.1. Then phone-to-AL mapping was obtained from our proposed technique for Assamese, Hindi and Tamil using English as source language. Using this mapping articulatory classifiers were built for Train-high and Train-low data sets of each language. 6.1. Validation of proposed technique Phone-to-AL mappings for English language is found in Frankel et al. (2007). This mapping was obtained manually. Using our proposed automated phone-to-AL mapping technique, we were able to replicate this phone-to-AL mapping. AL-CAT was trained using 8KHz Aurora-4 (Hirsch, 2002) multi-condition database with the phone-to-AL mapping according to Frankel et al. (2007). A 33 h subset of Switchboard database was considered as the target. We followed Algorithm 2 to obtain the phone-to-AL mapping in Switchboard as shown in Fig. 7. 6.2. Phone-to-AL mapping for indian languages Algorithm 2 was followed to generate phone-to-AL mapping for Assamese, Hindi and Tamil, with either Aurora-4 or 110 h of Switchboard representing the source language. The mappings from both Aurora-4 and Switchboard were similar. The mappings from Train-high and Train-low were the same for each language. This is because, the mapping procedure in Algorithm 2 estimates only the vj parameters of AL-CAT. As the number of vj parameters to be estimated is very small, even small amounts of training data can get a good estimate. 6.3. Experiments with articulatory features A phone-to-AL mapping is obtained as described in Section 6.2 for Train-high and Train-low of Assamese, Hindi and Tamil. This is then employed for training articulatory classifiers in each case. The articulatory features extracted from these classifiers were later used for training various acoustic models. Two sets of experiments were conducted using these features. 6.3.1. Articulatory features alone for acoustic modeling (AF alone) The initial monophone and triphone models were trained with articulatory features with its delta and acceleration coefficients. The recognition performance of these models compared to MFCC features were inferior due to the high dimensionality (≈ 132 dimension). Then the articulatory features stacked over 7 frames is projected down to 40 dimension by an LDA-MLLT transform. These LDA features gave superior recognition accuracy compared to MFCC features as shown in Table 5. Articulatory features were also directly used for training DNN with alignments from acoustic models built with MFCC features. However, systems obtained in this way
117
could not compete with systems trained on an alignment generated by acoustic models working with articulatory features. 6.3.2. Articulatory features appended to MFCC for acoustic modeling (AF append) The articulatory features with a dimensionality reduction by principal component analysis (PCA) (as in Cetin et al., 2007; Frankel et al., 2007) were append to MFCC features with delta and acceleration coefficients. This is to exploit the complementary information in both these features. By appending 26 articulatory dimensions (explaining 95% of the variance) to 39 MFCC features, 65dimensional feature vectors were created. The experiments were performed with both 65 dimensional plain articulatory features and LDA features obtained by stacking articulatory features over 3 frames and projected to 40 dimension by a LDA-MLLT transform. Stacking of 3 frames gave better results than conventional 7 frames for experiments with appended features due to larger dimension. Table 5 gives the recognition performance of acoustic models built with articulatory features. It gives the results for experiments with both AF alone and AF append cases. In all cases, articulatory features were significantly superior to MFCC features. Acoustic models built with the appended (AF append) always gives slightly better results than AF alone. Hence forth we consider only articulatory features augmented to MFCC features in our subsequent experiments. 7. Extracting articulatory features for low-resource languages In the previous sections, we have described approaches to obtain phone-to-AL mapping for a new language based on phone-toAL mapping of a well-resourced language such as English. We can then proceed to build good articulatory classifiers for the new language. In this section, we consider a specific case where sufficient training data are unavailable for the new language. In this case efficient articulatory classifiers cannot be built even when phoneto-AL mapping is generated. In this study a language with a large amount of resources in terms of such as transcribed speech and good language model has been referred to as high-resource/source language and a language with limited amount of resources as lowresource/target language. We propose that the articulatory features of a low-resource language can be obtained from the articulatory classifiers of closely related high-resource language(s) databases collected in similar conditions. The steps involved in extracting articulatory features for a low-resource language is given in Algorithm 4. If a phone-toAL mapping is available in a high-resource language, then the articulatory classifiers are generated directly; otherwise, the proposed phone-to-AL mapping technique is used to generate that mapping. Algorithm 4 is similar to Algorithm 1 except for the first two steps where the articulatory classifiers are built from a highresource language and AF are extracted for the target low-resource language. This approach is similar to cross-lingual AF extraction technique discussed in Lal and King (2013). From Table 5, it was evident that, the relative improvements were much larger in the Train-low than in Train-high case. This shows that the articulatory features are very useful in low-resource scenarios. A cross-lingual approach to articulatory feature extraction was employed in this case. The Train-low data set from each of the Indian languages was used as the low-resource data set in our experiments. Since many of the articulatory features overlap across languages, the articulatory classifiers could be trained using data from other languages for which a large amount of training data was available. The cross-lingual experiments were conducted to improve the recognition performance of low-resource data set (Train-low) of three languages namely Assamese, Hindi and Tamil using the articulatory classifiers built from high-resource data set
118
B. Abraham, S. Umesh / Speech Communication 86 (2017) 107–120
Algorithm 4: Steps involved in articulatory feature extraction for low-resource languages. 1. Training: Construct the articulatory classifier MLP for each of the AL group using transcribed speech data and phone-to-AL mapping from high-resource language(s). 2. Usage: Concatenate the MLP outputs (posterior probabilities) obtained by forward passing the low-resource language data and convert them to log-probabilities to obtain the final articulatory features
Table 6 Recognition results (%WER) for SVitchboard-1 500-word E set with Articulatory classifiers from SVitchboard and Switchboard -110 h. Feature type
MFCC Articulatory classifiers from SVitchboard Articulatory classifiers from Switchboard -110 h
(Train-high) of Assamese, Hindi, Tamil and English (110 h of data from Switchboard task). The AL-MLP configurations are the same as mentioned in Section 5.3. The proposed approach was validated on SVitchboard-1. For this experiment the articulatory classifiers trained on either SVitchboard itself or 110 h of Switchboard task was used for articulatory feature extraction of SVitchboard 500-word set. The recognition results in Table 6 shows that the articulatory features help when they are extracted from a well-trained articulatory classifiers built from large amount of data Frankel et al. (2007). Table 7 gives the recognition performance of the experiments performed on low-resource Indian languages. The first column shows the languages and the corresponding amount of data used to build the articulatory classifiers. The other columns gives the recognition word error rate (WER) for CDHMM, Phone-CAT and DNN acoustic models built using the articulatory features extracted for the low-resource language with the corresponding articulatory classifiers. In low-resource scenario, Phone-CAT gave results (better than DNN) in most of the cases. In the following sections, the results are compared with respect to Phone-CAT acoustic model and the results in the low-resource language experiments are explained. 7.1. Effect of “Amount of Training Data” The second and third rows of Table 7 show, the recognition performances obtained for Train-low data set in each language using the articulatory classifiers trained from Train-low and Trainhigh data sets, respectively, from the same language. The results show that the efficacy of the articulatory classifiers depends upon the amount of training data that eventually affect the recognition performance of the extracted articulatory features. Similarly in Table 6 the articulatory features extracted for SVitchboard task with articulatory classifiers trained with 110 h of Switchboard gave superior performance compared to articulatory classifiers built from SVitchboard data. In this work, we have tried to achieve a hypothetical recognition performance with articulatory classifiers trained from Train-high data set of the same language by using data from other high-resource languages.
CDHMM
DNN
Monophone
Triphone
LDA-MLLT
70.92 63.19 56.68
56.46 48.76 43.83
53.87 48.31 41.07
44.66 45.73 36.75
same language gave the best results, even after the removal of lowresource data. Above all the articulatory features extracted from classifiers built from 110hr data set of Switchboard task gave poor performance compared to using high-resource data set from the same domain of application. The reverse was also true in the case of articulatory features extracted for SVitchboard data set using classifiers built with other languages from the MANDI database. 7.3. Articulatory classifiers built with data pooling In each cross-lingual experiment, articulatory features were extracted for a low-resource language with other languages as highresource languages. The articulatory classifiers were built with different combination of high-resource languages with and without the inclusion of low-resource language. For example, in the case of Hindi as low-resource language, the articulatory classifiers were built with Train-high data set from Assamese and Tamil and their combination with and without Train-low data set from Hindi. The major observations were as follows: • The articulatory features extracted from articulatory classifiers built with more amount of data gave better performance. • In some cases, when data were pooled from two languages to build the articulatory classifiers, the recognition accuracy was in midway between the accuracy as if they were used separately. This problem might be due to the mismatch in the environment noises present in the databases. The recognition rate was better when compared to the language that gave poor performance among them. For example, in the case of Tamil Trainlow data set, articulatory classifiers built from Hindi-22 h gave the best recognition results and Assamese-22 h gave the worst, however, their combination gave a result in between these two results. This gives us the indication that in matched environmental conditions, pooling data to build articulatory classifiers will be beneficial as shown in Lal and King (2013). • Performance comparable to high-resourced scenario can be achieved even for Train-low conditions using articulatory classifiers built from other languages as long as they have similar environment conditions and similar domain/task.
7.2. Effect of “Varying Environmental Conditions”
8. Conclusion
The articulatory features extracted from articulatory classifiers built with high-resource data set from different languages gave varied recognition performance. This effect can be attributed to the difference in the environmental conditions under which these data were collected. This is more clear since articulatory features extracted from classifiers built with high-resource data set from the
In this paper, we have proposed an automated technique to generate phone-to-AL mapping for a target language using existing phone-to-AL mapping from a well-resourced language. The proposed mapping technique uses the center-phone capturing property of interpolation vectors of Phone-CAT. The mapping is performed at state level rather than at frame level, which makes the
B. Abraham, S. Umesh / Speech Communication 86 (2017) 107–120
119
Table 7 Recognition performance for low-resource data sets with articulatory classifiers trained with various high-resource databases and their combinations. ∗ Represents the baseline results with MFCC features and bold symbol represents the corresponding low-resource language which was pooled along with high-resource languages to build articulatory classifiers. Assamese Dataset for building articulatory classifiers
Mono-phone
Tri-phone
LDA-MLLT
Phone-CAT
DNN
45.45 43.38 32.33 36.21 33.70 36.09 35.93 33.54 33.35 38.68
39.85 36.95 29.31 33.19 31.03 35.50 31.90 31.62 31.03 35.85
37.26 32.17 25.31 27.66 27.63 29.43 27.63 26.37 26.45 33.35
33.74 30.25 21.90 22.12 21.36 24.18 24.41 22.18 21.28 26.92
42.99 31.78 24.80 25.12 24.69 26.76 28.76 24.33 24.92 35.50
Dataset for building articulatory classifiers
Mono-phone
Tri-phone
LDA-MLLT
Phone-CAT
DNN
MFCC baseline∗ Hindi-2 h Hindi-22 h Assamese-22 h Assamese-22 h + Hindi-2 h Tamil-22 h Tamil-22 h + Hindi-2 h Assamese-22 h + Tamil-22 h Assamese-22 h + Tamil-22 h + Hindi-2 h Switchboard -110 h
24.55 17.67 12.42 17.26 14.93 17.87 16.83 17.00 14.99 18.69
16.28 13.58 9.78 13.32 12.84 13.92 12.69 12.70 11.78 15.6
14.59 11.03 7.93 10.41 10.00 10.89 10.04 10.55 9.11 12.52
13.50 9.85 7.11 9.59 8.84 9.86 8.71 9.30 8.21 10.66
14.17 10.11 7.25 9.42 8.70 9.05 8.52 9.12 8.17 11.10
Mono-phone
Tri-phone
LDA-MLLT
Phone-CAT
DNN
48.00 37.91 28.97 37.44 36.99 35.02 33.27 35.44 34.50 38.06
38.60 31.12 25.07 32.13 30.21 29.15 28.92 28.04 29.64 32.82
33.81 30.82 22.75 29.47 27.57 26.53 26.09 27.39 27.34 28.60
33.59 28.06 22.68 28.53 27.17 25.20 25.47 25.94 26.16 27.47
35.29 31.81 24.93 29.47 30.01 27.39 28.31 28.31 30.06 32.11
∗
MFCC Baseline Assamese-2 h Assamese-22 h Hindi-22 h Hindi-22 h + Assamese-2 h Tamil-22 h Tamil-22 h + Assamese-2 h Hindi-22 h + Tamil-22 h Hindi-22 h + Tamil-22 h + Assamese-2 h Switchboard -110 h Hindi
Tamil Dataset for building articulatory classifiers ∗
MFCC baseline Tamil-2 h Tamil-22 h Assamese-22 h Assamese-22 h + Tamil-2 h Hindi-22 h Hindi-22 h + Tamil-2 h Assamese-22 h + Hindi-22 h Assamese-22 h + Hindi-22 h + Tamil-2 h Switchboard -110 h
mapping more robust. The proposed mapping technique was applied to three Indian languages namely Assamese, Hindi, and Tamil from the MANDI database using phone-to-AL mapping in English. The phone-to-AL mapping obtained from the proposed technique was used to build articulatory classifiers for extracting articulatory features. The articulatory features showed relatively more than 30% improvements over the conventional MFCC features for all the languages. The articulatory features were also extracted in lowresource scenario for Train-low data set with AL-MLPs built from other language Train-high data set and from 110 h of Switchboard task. In all cases, the use of AL-MLPs built from MANDI database outperformed AL-MLPs built from Switchboard task. This might be due to the difference in domain and channel variations, which limits the portability of articulatory classifiers in mismatched conditions. In matched conditions, performance comparable to wellresourced scenario was achieved with other language data.
Acknowledgments This work was supported in part by the consortium project titled “Speech-based access to commodity price in six Indian languages,” funded by the TDIL program of DeITY of Government of India. The DIT-ASR consortium includes IIT Madras, IIT Bombay, IIT Guwahati, IIT Kanpur, IIIT Hyderabad, Tata Institute of Fundamental Research (TIFR) Mumbai, and Center for Development of Advanced
Computing (C-DAC) Kolkata. The authors would like to thank members who helped in collecting Assamese, Hindi and Tamil corpus.
References Basil Abraham, Neethu Mariam Joy, Navneeth K., Umesh, S., 2014. A data-driven phoneme mapping technique using interpolation vectors of phone-cluster adaptive training. In: Proc. SLT, pp. 36–41. Cetin, O., Kantor, A., King, S., Bartels, C., Magimai-Doss, M., Frankel, J., Livescu, K., 2007. An articulatory feature-based tandem approach and factored observation modeling. In: Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol. 4. IEEE, pp. IV–645. Çetin, Ö., Magimai-Doss, M., Livescu, K., Kantor, A., King, S., Bartels, C., Frankel, J., 2007. Monolingual and crosslingual comparison of tandem features derived from articulatory and phone mlps. In: Automatic Speech Recognition & Understanding, 2007. ASRU. IEEE Workshop on. IEEE, pp. 36–41. Deng, L., Sun, D., 1994. Phonetic classification and recognition using hmm representation of overlapping articulatory features for all classes of english sounds. In: Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Conference on, vol. 1. IEEE, pp. I–45. Eide, E., Rohlicek, J.R., Gish, H., Mitter, S., 1993. A linguistic feature representation of the speech waveform. In: Acoustics, Speech, and Signal Processing, 1993. ICASSP-93., 1993 IEEE International Conference on, vol. 2. IEEE, pp. 483–486. Elenius, K., Takács, G., 1991. Phoneme recognition with an artificial neural network. EUROSPEECH. Ellis, D.P., Singh, R., Sivadas, S., 2001. Tandem acoustic modeling in large-vocabulary recognition. In: Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, vol. 1. IEEE, pp. 517–520. Erler, K., Freeman, G.H., 1996. An hmm-based speech recognizer using overlapping articulatory features. J. Acoust. Soc. Am. 10 0 (4), 250 0–2513.
120
B. Abraham, S. Umesh / Speech Communication 86 (2017) 107–120
Frankel, J., Magimai-doss, M., King, S., Livescu, K., Çetin, Ö., 2007. Articulatory feature classifiers trained on 20 0 0 hours of telephone speech. In: In Proc. Interspeech. Gales, M., 1999. Cluster Adaptive Training of Hidden Markov Models. IEEE Trans. Speech Audio Process. 8, 417–428. Garofolo, J., Graff, D., Paul, D., Pallett, D., 1993. CSR-I (WSJ0) Complete LDC93S6A. Linguistic Data Consortium. Godfrey, J., Holliman, E., 1993. Switchboard-1 Release 2 LDC97S62. Linguistic Data Consortium. Hinton, G., 2010. A practical guide to training restricted boltzmann machines. Momentum 9 (1), 926. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29 (6), 82–97. Hirsch, G., 2002. Experimental Framework for the Performance Evaluation of Speech Recognition Front-ends on a Large Vocabulary Task. Technical Report. Ericsson. King, S., Bartels, C., Bilmesy, J., 2005. SVitchboard 1: small vocabulary tasks from Switchboard. In: Annual Conference of the International Speech Communication Association, pp. 3385–3388. King, S., Frankel, J., Livescu, K., McDermott, E., Richmond, K., Wester, M., 2007. Speech production knowledge in automatic speech recognition. J. Acoust. Soc. Am. 121 (2), 723–742. Kirchhoff, K., Fink, G.A., Sagerer, G., 2002. Combining acoustic and articulatory feature information for robust speech recognition. Speech Commun. 37 (3), 303–319. Kumar, N., Andreou, A.G., 1998. Heteroscedastic discriminant analysis and reduced rank hmms for improved speech recognition. Speech Commun. 26 (4), 283–297. Lal, P., King, S., 2013. Cross-lingual automatic speech recognition using tandem features. Audio Speech Lang. Process. IEEE Trans. 21 (12), 2506–2515. Manohar, V., Chinnari, B.S., Umesh, S., 2013. Acoustic modeling using transform-based phone-cluster adaptive training. In: Proc. ASRU, pp. 49–54.
Papcun, G., Hochberg, J., Thomas, T.R., Laroche, F., Zacks, J., Levy, S., 1992. Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data. J. Acoust. Soc. Am. 92 (2), 688–700. Povey, D., Burget, L., Agarwal, M., Akyazi, P., Feng, K., Ghoshal, A., Glembek, O., Goel, N.K., Karafiát, M., Rastrow, A., Rose, R.C., Schwarz, P., Thomas, S., 2011. The subspace Gaussian mixture model - a structured model for speech recognition. Comput. Speech Lang. 25 (2), 404–439. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K., 2011. The kaldi speech recognition toolkit. In: Proc. ASRU. Schmidbauer, O., 1989. Robust statistic modelling of systematic variabilities in continuous speech incorporating acoustic-articulatory relations. In: Acoustics, Speech, and Signal Processing, 1989. ICASSP-89., 1989 International Conference on, vol. 1, pp. 616–619. Schmidbauer, O., 1989. Robust statistic modelling of systematic variabilities in continuous speech incorporating acoustic-articulatory relations. In: Acoustics, Speech, and Signal Processing, 1989. ICASSP-89., 1989 International Conference on. IEEE, pp. 616–619. Schroeter, J., Sondhi, M.M., 1994. Techniques for estimating vocal-tract shapes from the speech signal. IEEE Trans. Speech Audio Process. 2 (1), 133–150. Sivadas, S., Hermansk, H., 2004. On use of task independent training data in tandem feature extraction. In: Acoustics, Speech, and Signal Processing, 2004. Proceedings (ICASSP’04). IEEE International Conference on, vol. 1. IEEE, pp. I–541. Thomas, S., Ganapathy, S., Hermansky, H., 2010. Cross-lingual and multi-stream posterior features for low resource lvcsr systems. In: INTERSPEECH, pp. 877–880. Tóth, L., Frankel, J., Gosztolya, G., King, S., 2008. Cross-lingual portability of MLP-based tandem features - a case study for English and Hungarian. Proceedings of the 9th Annual Conference of the International Speech Communication Association, Brisbane, Australia, September 22-26, 2008, pp. 2695–2698.