Engineering Applications of Artificial Intelligence 59 (2017) 15–22
Contents lists available at ScienceDirect
Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai
Whispered speech recognition using deep denoising autoencoder a,⁎
MARK
D. Đorđe T. Grozdić , Slobodan T. Jovičić , Miško Subotić a b
a
b
School of Electrical Engineering, University of Belgrade, Bulevar Kralja Aleksandra 73, 11000 Belgrade, Serbia Life Activities Advancement Center, Laboratory for Forensic Acoustics and Phonetics, Gospodar Jovanova 35, 11000 Belgrade, Serbia
A R T I C L E I N F O
A BS T RAC T
Keywords: Automatic speech recognition (ASR) Whispered speech Deep denoising autoencoder (DDAE) Deep learning
Recently Deep Denoising Autoencoders (DDAE) have shown state-of-the-art performance on various machine learning tasks. In this paper, the authors extended this approach to whispered speech recognition which is one of the most challenging problems in Automatic Speech Recognition (ASR). Namely, due to the profound differences between acoustic characteristics of neutral and whispered speech, the performance of traditional ASR systems trained on neutral speech degrades significantly when whisper is applied. This mismatch between training and testing is successfully alleviated with the new proposed system based on deep learning, where DDAE is applied for generating whisper-robust cepstral features. This system was tested and compared in terms of word recognition accuracy with conventional Hidden Markov Model (HMM) speech recognizer in an isolated word recognition task with a real database of whispered speech (WhiSpe). Three types of cepstral coefficients were used in the experiments: MFCC (Mel-Frequency Cepstral Coefficients), TECC (Teager-Energy Cepstral Coefficients) and TEMFCC (Teager-based Mel-Frequency Cepstral Coefficients). The experimental results showed that the proposed system significantly improves whisper recognition accuracy and outperforms traditional HMM-MFCC baseline, resulting in an absolute 31% improvement of whisper recognition accuracy. The highest word recognition rate of 92.81% in whispered speech was achieved with TECC feature.
1. Introduction Modern-day automatic speech recognition (ASR) systems have shown good performances and evident commercial utilization, while, at the same time they have displayed a considerable number of weaknesses, flaws and problems when practically utilized. In order to accomplish more efficient and better quality of human-computer interaction these problems have yet to be solved. Despite much research concerned with ASR systems’ improvement over the last years, in some situations these systems still remain both significantly sensitive and unreliable. The ASR performance may be affected by various factors including: input speech type and quality, individual speaker characteristics, such as speech rate and style, dialect, the vocal tract anatomy, the speaker's psycho-physical state etc. In addition to this, other influence types may also occur, particularly originating from the surrounding environment, and entailing: ambient noise, reverberation, loudness etc. Apart from the neutral mode, speech also occurs in different modalities, such as emotional speech (Jovičić et al., 2004; Shahin, 2013), the Lombard effect speech (Boril and Hansen, 2010), as well as whispered speech (Jovičić, 1998). All these above-mentioned atypical speech modes present challenging problems in ASR. Whisper is a specific form of verbal communication that is
⁎
frequently utilized in different situations. Firstly, it is employed to make a discreet and intimate atmosphere in conversation, and, secondly, it is used to protect some confidential and private information from uninvolved parties. Besides, speakers whisper when they do not want to disturb other people, for example in the library, or during a business meeting, but also in criminal activities, e.g. when criminals try to disguise their identity. Nevertheless, in spite of the conscious production of a whisper, it may occur due to health problems which appear after rhinitis and laryngitis, or it can be a chronic disease of the larynx structures (Jovičić and Šarić, 2008). Whisper has become a research topic of interest, essentially important for speech technologies, mainly because of its substantial difference compared to normally phonated (neutral) speech, and this is the case primarily due to the glottal vibrations’ absence, noisy structure and lower SNR (Signal to Noise Ratio) (Morris, 2003). For real-world applications, ASR system must operate in situations where it is not possible to control the speaker's speech mode. This may result in a serious mismatch between training and test conditions. Namely, the most current neutral speech oriented interfaces of ASR systems are not capable of handling such an acoustic mismatch. Therefore, the performance of neutral-trained ASR systems degrades significantly when whisper is applied. Nonetheless, the research on the automatic
Corresponding author. E-mail addresses:
[email protected] (Đ.T. Grozdić),
[email protected] (S.T. Jovičić),
[email protected] (M. Subotić).
http://dx.doi.org/10.1016/j.engappai.2016.12.012 Received 23 May 2016; Received in revised form 29 September 2016; Accepted 12 December 2016 0952-1976/ © 2016 Elsevier Ltd. All rights reserved.
Đ.T. Grozdić et al.
Engineering Applications of Artificial Intelligence 59 (2017) 15–22
environment, and may improve whisper and low-voice speech recognition to some extent. It was also shown that ASR systems that are trained on neutral speech can be adapted to whisper recognition by using a small amount of whispered speech data, which can improve whisper recognition. Utilizing such speaking-style-independent model yields the whisper recognition accuracy of about 66%. Later studies of whisper recognition attempt to reduce this acoustic mismatch in different ways, for example through model adaptation (Galić et al., 2014b; Lim, 2011; Mathur et al., 2012; Yang et al., 2012) and feature transformations (Yang et al., 2012). There are also studies focused on front-end feature extraction strategies (Ghaffarzadegan et al., 2014b; Zhang and Hansen, 2010) and front-end filter bank redistribution based on the subband relevance (Ghaffarzadegan et al., 2014b). Efficiency of vocal tract length normalization (VTLN) (Ghaffarzadegan et al., 2014a) and shift transform (Bořil and Hansen, 2010) for whisper recognition was also investigated in (Ghaffarzadegan et al., 2014a). Several studies deal with ASR model adaptation to whisper recognition when small amounts of whisper are available (Ghaffarzadegan et al., 2014a, 2015). In (Ghaffarzadegan et al., 2014a) Vector Taylor Series (VTS) based approach to pseudowhisper adaptation sample generation was investigated. All abovementioned papers use HMM as ASR system. There are only two studies that investigates neural network-based approach in whisper recognition (Grozdić et al., 2012; Lee et al., 2014). Also, there is an audiovisual approach for isolated digits recognition under whisper and neutral speech (Tao and Busso, 2014). In terms of feature extraction, only basic MFCC and PLP features were tested in whisper recognition. Although some improvement is achieved in each study, commercially applicable performance of whisper recognition is not demonstrated.
whisper recognition is still at the beginning, and there have been only a few studies so far. The literature shows several approaches that have been proposed to alleviate this acoustic mismatch through model adaptation (Ito et al., 2005; Lim, 2011; Mathur et al., 2012; Yang et al., 2012), feature transformations (Yang et al., 2012), or using alternative sensing technologies such as throat microphone (Jou et al., 2004). Also there is an audio-visual approach to isolated word recognition under whisper speech condition (Tao and Busso, 2014). A vast majority of studies on whisper recognition use GMM-HMM (Gaussian mixture density – hidden Markov model) system and traditional MFCC (Mel-Frequency Cepstral Coefficients) or PLP (Perceptual Linear Prediction) features. In this paper it is demonstrated that by using state-of-the-art artificial intelligence technique, such as Deep Denoising Autoencoder (DDAE) (Mimura et al., 2015; Deng and Yu, 2014), a significant increase in whisper recognition performance can be attained without the need for model adaptation. This new efficient approach to whisper recognition applies a DDAE that aims to transform cepstral feature vectors of whispered speech, into clean cepstral feature vectors of neutral speech. More specifically, feature extraction by a DDAE is achieved by training the deep neural network to predict original neutral speech features from pseudo-whisper features that are artificially generated by inverse filtering of neutral speech data. Acquired cepstral features are then concatenated with original cepstral features and then processed with a conventional GMM-HMM recognizer to conduct an isolated word recognition task. The main advantage of our approach is that whisper-robust features are easily acquired by a rather simple mechanism. Regarding the input feature vectors, three different types of cepstral coefficients were tested: traditional MFCC (Mel-Frequency Cepstral Coefficients) and two more recent cepstral coefficients that are still not tested on whispered speech recognition – the TECC (Teager Energy Cepstral Coefficients) and the TEMFCC (Teager Energy based Mel-Frequency Cepstral Coefficients) (Dimitriadis et al., 2005). The reminder of this study is organized in six sections as follows. In Section 2, we briefly review related work on whisper recognition. Section 3 introduces the speech corpus that was specially constructed for this study, and describes the nature of whispered speech. Here, the essential differences between neutral and whispered speech are presented and explained. In Section 4, we introduce the general framework of the proposed ASR system and describe feature extraction, deep denoising autoencoder and GMM-HMM recognizer. In the next section we conduct isolated word recognition experiments to evaluate application of DDAE and different cepstral features in mismatched train/test conditions. Finally, conclusions are presented in Section 6.
3. Whisper 3.1. Corpus of neutral/whispered speech Given the fact that there is a lack of extensive, appropriate and publicly available databases in this area, the special speech corpus for Serbian language, entitled “Whi-Spe” (abbreviated from Whispered Speech), was developed for the purpose of this study. The corpus consists of two parts – 5000 audio recordings of spoken isolated words in neutral speech and 5000 recordings of the same words in whisper. The corpus contains 50 different words that were taken as a part of GEES speech database (Jovičić et al., 2004) and achieve the balance of linguistic features. Both whispered speech and neutral speech were collected from 5 male and 5 female speakers, with typical pronunciation of speech and whisper, and correct hearing. Each speaker had read all 50 words ten times in both speech modes, so the Whi-Spe corpus contains 10,000 recorded words. The recording process was carried out under quiet laboratory conditions in sound booth with high-quality omni-directional microphone. Speech data were digitized using a sampling frequency of 22050 Hz with 16 bits per sample in Windows PCM WAV format. More pieces of information concerning the Whi-Spe corpus can be found in (Marković et al., 2013).
2. Related work Automatic recognition of whispered speech is an ongoing and active field of research which is hindered by the lack of suitable and systematically collected corpora. Currently there are only several existing databases of parallel neutral and whispered speech collected for English (Ghaffarzadegan et al., 2014b; Tran et al., 2013; Zhang and Hansen, 2011), Japanese (Ito et al., 2005), Serbian (Marković et al., 2013) and Mandarin language (Lee et al., 2014). Most of them have small or medium-sized vocabulary, while only a few of them are transcribed and phonetically balanced. One of the first experiments with automatic whisper recognition was started by researchers at the University of Nagoya (Ito et al., 2005). They intended to develop a speech recognizer specifically capable of handling whisper on cell-phones in noisy conditions. Using HMM technique and MFCC features, they analyzed different mismatched train/test scenarios with three speech modes: whisper, low-voice speech and neutral speech. The results of these experiments show severe degradation in ASR when using mismatch data. Next, the authors (Ito et al., 2005) confirms that covering the mouth and the cell-phone with a hand can cause the SNR increase in a noisy
3.2. Characteristics of whispered speech Whisper represents an alternative speech mode that is quite different from neutral speech with regard to its characteristics, nature and generation mechanism. The main peculiarity of whisper's generation mechanism is the absence of vocal cord vibration. The vocal tract shape differs from neutral speech and implies narrowed pharyngeal cavity, different placement of tongue, lips and mouth opening (Jovičić, 1998). As a result of different vocal tract shape and articulation organs’ specific behavior, whisper has distinctive acoustic characteristics. In order to more closely describe this acoustic mismatch that occurs in ASR systems, several time-domain and spectral-domain examples are 16
Đ.T. Grozdić et al.
/p/
/i/
/j/
/a/
/ts/
1
/a/
Amplitude
Amplitude
1
Engineering Applications of Artificial Intelligence 59 (2017) 15–22
0.5 0 -0.5 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
8 6 4 2
0.5
0.6
-0.5 0.1
0.2
0.3
0.1
0.2
0.3
0.4
0.7
8 6 4 2
0.1
0.2
0.3
0.4
0.5
0.6
0 0
0.7
Time (s)
-20
-20
-40
-60
2
4
6
0.4
0.5
0.6
0.7
Time (s) 0
Magnitude (dB)
Magnitude (dB)
/a/
0
0
-80 0
/ts/
10
Frequency (kHz)
Frequency (kHz)
/i/ /j/ /a/
0.5
0
10
0 0
/p/
8
-40
-60
-80 0
10
Frequency (kHz)
2
4
6
8
10
Frequency (kHz)
͡ Fig. 1. Waveforms, spectrograms and spectrums of word/pijatsa/(“market” in English) in normal (left) and whispered speech (right).
long-term average spectrums, in which whisper, in opposition to normal speech, has much flatter spectral slope (Ito et al., 2005; Jovičić, 1998; Zhang and Hansen, 2007). When compared with neutral speech, all these specific differences render whisper a particular challenge for recognition by means of traditional ASR systems that have been primarily designed for normally phonated speech. Some of these differences can be reduced, for example differences in energy level. This problem can be simply solved by closer microphone position while whispering, or by simple signal amplification under good SNR conditions. However, spectral differences (and thus cepstral differences) still remain, e.g. differences in spectral slopes, and require their proper modification in order to be adjusted for efficient automatic recognition.
presented. Fig. 1 shows comparisons of waveforms, spectrograms and long-term average spectrums of a word in neutral and whispered speech. We can see clear difference in amplitude intensity from the time waveforms. Since there is the lack of sonority, amplitudes of voiced phonemes (vowels chiefly) are considerably lower in whisper, while the amplitudes of unvoiced phonemes show similar intensity for both modes (Ito et al., 2005). Whisper has completely noisy structure and conspicuously lower energy. The whispered speech duration is slightly longer (it is more obvious in longer utterances), which is also one of its characteristics (Fan and Hansen, 2011; Jovičić, 1998; Zhang and Hansen, 2007). The spectrograms show the other differences between neutral and whispered speech. Although the vocal tract shape is different in whisper, spectrograms save the most important spectral speech characteristics. Despite the noisy structure, the spectral concentrates of some phonemes can still be observed. The most important spectral changes are perceived in vowels. In whisper the lower formants’ locations are shifted to higher frequencies (Ito et al., 2005; Jovičić, 1998; Kallail and Emanuel, 1984). In contrast to vowels, unvoiced consonants are not significantly changed in their spectral domain (Ito et al., 2005). The lack of sonority in whisper is also observed from the
4. Proposed ASR model for whisper recognition A schematic diagram of the proposed ASR system is shown in Fig. 2. The proposed framework consists of two feature extractors to process speech signals and one GMM-HMM recognizer. The first feature extractor serves for MFCC, TECC and TEMFCC cepstral features extraction. The second feature extractor is in the form of a deep denoising autoencoder (DDAE) which filters out the effects of 17
Đ.T. Grozdić et al.
Engineering Applications of Artificial Intelligence 59 (2017) 15–22
Fig. 2. Architecture of the proposed ASR system. The system is composed of: (a) Feature extractor, (b) Deep denoising autoencoder (DDAE) and (c) GMM-HMM recognizer.
good descriptors of whispered speech. At the end, in order to describe speech signal dynamics more thoroughly, the MFCC, TECC and TEMFCC feature vectors were concatenated with their first (delta) and second (delta-delta) time derivatives. In this way, every frame was represented with 36-dimensional vector (12 static features +12 delta +12 delta-delta).
whispered speech and reconstructs clean neutral speech features. Accordingly, DDAE outputs are extracted to serve as the robust cepstral features for mismatched whispered speech. Finally, a GMM-HMM recognizes isolated words by binding acquired cepstral feature sequences from both feature extractors. 4.1. Feature extraction
4.2. Deep denoising autoencoder (DDAE) Three types of cepstral features are tested in this study: MelFrequency Cepstral Coefficient (MFCC), Teager Energy Cepstral Coefficient (TECC) and Teager Energy based Mel-Frequency Cepstral Coefficient (TEMFCC). These features are extracted from 25 ms time frames with a step size of 10 ms. The MFCCs are traditional cepstral features that are used in ASR and their standard extraction procedure is performed in the following steps: (1) the discrete Fourier transform (DFT) of speech signal frame is computed, (2) the power spectrum is calculated, (3) the Mel-filterbank consisting of 30 triangular shaped filters is applied, (4) the signal output log energy of each filter is estimated, (5) the discrete cosine transform (DCT) is taken and the cepstral coefficients are obtained, and (6) only the first 12 MFCCs are extracted. Finally, feature vectors are channel normalized using cepstral mean subtraction (CMS) to remove convolutional distortions caused by characteristics of communication channels or recording devices. However, CMS is partially effective in reducing the effects of additive environmental noise (Hahm et al., 2013). The other two cepstral coefficient types are relatively new and more robust to noise interference. Moreover, they are still not tested on whispered speech recognition. Their basic extraction algorithm is similar to MFCC feature, but includes one significant difference in terms of energy estimation. For TECC and TEMFCC calculation, the nonlinear Teager-Kaiser operator (Teager-Energy Operator - TEO) is used (Kaiser, 1990, 1993; Teager, 1980), which serves for estimation of the Teager-energy in place of the standard energy calculation method (Standard-Energy Operator, SEO – x (t )2 ). In addition to the TEO utilization, the TECC feature is distinctive by one more characteristic in comparison to MFCC and TEMFCC features, and that is the use of Gammatone filterbank (Qi et al., 2013) instead of Mel-filterbank. So far, nonlinear TEO-based cepstral features (TECC and TEMFCC) introduced very promising results in ASR of quiet and unvoiced murmured speech as well in speech classification under stress and noisy conditions (Dimitriadis et al., 2005; Heracleous, 2009; Zhou et al., 1998). On the other hand, the characteristics of whispered speech might be considered to have similarities with non-audible murmur, noise corrupted speech and speech under stress. Due to these similarities, it was expected that the TECC and the TEMFCC could be
The deep autoencoder is a special type of the DNN (Deep Neural Network) whose output is a reconstruction of the input which is commonly utilized for dimensionality compression and feature extraction (Bengio, 2009). Typically, an autoencoder has an input layer which represents the original feature vectors, one or more hidden layers that represent the transformed feature, and an output layer which matches the input layer for reconstruction. When the number of hidden layers is greater than one, as in our case, the autoencoder is considered to be deep (Deng and Yu, 2014). The dimension of the hidden layers can be either smaller than the input dimension when the goal is feature compression, or larger when the goal is mapping the feature to a higher-dimensional space. An autoencoder tries to find deterministic mapping between input units x and hidden nodes by means of a nonlinear function fӨ(x):
y = fθ (x)=f1 (Wx +b),
(1)
where W is a d×d′ weight matrix, b is a bias vector, and f1(.) is a nonlinear function such as sigmoid or tanh. This mapping is called the encoder. The latent representation y is then mapped back to reconstruct the input signal with:
z = f θ′ ( y)=f2 (W ′y+b′),
(2)
where W′ is a d′×d weight matrix (W′=WT), b′ is a bias vector, and f2(.) is either nonlinear function, such as sigmoid or tanh, or a linear function. This mapping is called the decoder. The goal of training is to minimize squared error function:
L (x ,
z) = x − z 2 .
(3)
To prevent autoencoder from learning the trivial identity mapping function, some constraints are usually applied during the training; for example adding Gaussian noise to the input signal (which is applied in our study) or using the “dropout” trick by randomly forcing certain values to be zero at the input data. Such autoencoder is known as denoising autoencoder (DAE) (Vincent et al., 2008; Mimura et al., 2015). DAE shares the same structure as autoencoder, but input data is a deteriorated version of the output data. In other words, DAE use 18
Đ.T. Grozdić et al.
Engineering Applications of Artificial Intelligence 59 (2017) 15–22
many times. After the pre-training, the RBMs are unrolled to create a deep autoencoder, which is then fine-tuned using backpropagation of error derivatives. Backpropagation modifies the weights of the network to reduce the error of the teacher signal and the output value when a pair of signals (input signal and the ideal teacher signal) are given. 4.3. GMM-HMM We used a standard GMM-HMM system trained and tested with the hidden Markov model toolkit (HTK) (Young et al., 2002). The acoustic model contains 5 states in total (3 of which are emitting) with strictly left-right structure and 16 mixture components. The number of training cycles in embedded re-estimation was restricted to 5 and the variance floor for Gaussian probability density functions was set to 1%. The initial model parameters were estimated using the flat-start method, in which models were initialized with global mean and variance. In the testing phase, Viterbi algorithm was applied to determine the most likely model that best matched each test utterance. 5. Experimental results The following subsections present average results of isolated word recognition obtained through 10-fold cross-validation with estimated SEM (standard error of the mean) which does not exceed 0.18%.
Fig. 3. Deep denoising autoencoder (DDAE) during training phase.
feature mapping to convert corrupted input data (xˆ – input signal) into clean output (x – teacher signal). In our case, we use deep DAE (DDAE) to transform cepstral feature vectors of whispered speech, into clean cepstral feature vectors of neutral speech. The proposed architecture of DDAE is illustrated in Fig. 2 and consists of two encoder layers (with 1000 nodes and 500 nodes respectively) with sigmoid functions and one decoder layer with linear function. Input layer has 11×36 nodes while output has 36 nodes. This means that we use 11 contiguous frames with 36 cepstral coefficients as xˆ to encode, and use only the corresponding middle frame with 36 clean cepstral features as x to fine-tune. The output 36-dimensional feature vector is concatenated with the original 36-dimensional feature vector from feature extractor and together fed to the GMM-HMM, as shown in Fig. 2. In terms of training, with parallel deteriorated and clean speech data (Fig. 3), a DDAE can be pre-trained on pseudo-whisper cepstral features and fine-tuned by neutral speech cepstral features. To be precise, in our experiments pseudo-whisper samples are obtained by inverse filtering (Grozdić et al., 2014) of neutral speech samples and adding random Gaussian noise with 10 dB SNR. Such pseudo-whisper data is used in the pre-training process, while neutral speech data is applied for fine-tuning of DDAE. This is a standard way to train DDAE. There are several reasons for adding random noise. First, with adding Gaussian noise, inverse-filtered signal becomes more similar to whisper in terms of its noisy nature, and the model learned that way would be robust to the same kind of distortions in the test data. Second, since the noise is added randomly, the DDAE avoid learning the trivial identity solution. Third after all, each distorted input sample is different, which greatly increases the training set size and thus can additionally alleviate the overfitting problem. In this way, the rich nonlinear structure in the DDAE can be used to learn an efficient transfer function which suppresses whisper characteristics in speech while keeping enough phonetically discriminative information to generate good reconstructed neutral speech features. Pre-training consists of learning a stack of restricted Boltzmann machines (RBMs), each having only one layer of feature detectors. After learning one RBM, the status of the learned hidden units given the training data can be used as feature vectors for the second RBM layer. The contrastive divergence (CD) method (Hinton, 2002) can be used to learn the second RBM in the same fashion. Then, the status of the hidden units of the second RBM can be used as the feature vectors for the third RBM, etc. This layer-by-layer learning can be repeated for
5.1. ASR performance evaluation – matched train/test scenarios This section presents the results of speaker-dependent word recognition in matched train/test scenarios – neutral/neutral and whisper/whisper, performed on the Whi-Spe database and using GMM-HMM speech recognizer without denoising autoencoder features. The average speaker-dependent word recognition accuracies along with their standard deviations achieved by GMM-HMM system are presented in Table 1. As expected for matched data, the recognition efficiency in both speech modes is very high with a low inter-speaker standard deviation. Although the “ceiling effect” has been reached, the recognition accuracies show two slight tendencies. Firstly, MFCC based features show a bit less recognition rate in both speech modes compared to other two features. Secondly, in whisper recognition, the TECC based features give the best results (TECC: 99.01%; TECC+Δ: 99.85%; TECC+Δ+ΔΔ: 99.94%) which predicts the fact that Teager-Energy Operator and Gammatone filterbank better describe whisper characteristics. In order to prove these statements, statistical test was needed. For this purpose, the two-tailed Wilcoxon signed-rank test was used to analyze statistical significance of these small differences in achieved average word recognition accuracies between MFCC feature and other cepstral features. Z and p-values for Wilcoxon test are presented in Table 2. The obtained results confirm for all speakers that whisper recognition based on Teager features shows statistically significant (p < 0.05) improvement. TECC+Δ+ΔΔ feature shows the highest statistical sigTable 1 Word recognition rate (%) in matched train/test scenarios achieved by GMM-HMM system using different cepstral features. Feature MFCC TECC TEMFCC MFCC+Δ TECC+Δ TEMFCC+Δ MFCC+Δ+ΔΔ TECC+Δ+ΔΔ TEMFCC+Δ+ΔΔ
19
Neutral/Neutral
σ
Whisper/Whisper
σ
98.52 99.41 99.27 99.65 99.92 99.86 99.77 99.95 99.91
0.26 0.18 0.22 0.19 0.16 0.28 0.15 0.13 0.16
97.80 99.01 98.13 99.24 99.85 99.28 99.53 99.94 99.60
0.72 0.23 0.49 0.54 0.28 0.43 0.21 0.20 0.17
Đ.T. Grozdić et al.
Engineering Applications of Artificial Intelligence 59 (2017) 15–22
Table 2 The two tailed Wilcoxon signed-rank test for the comparison of word recognition rate in matched train/test scenarios from different speech feature sets.
MFCC TECC TEMFCC MFCC+Δ TECC+Δ TEMFCC+Δ MFCC+Δ+ΔΔ TECC+Δ+ΔΔ TEMFCC+Δ+ΔΔ
Whispered speech (whisper/ whisper) Z-value p-value
( –; –) (Z=−0.604; p=0.546) (Z=−1.298; p=0.194) (Z=−1.604; p=0.109) (Z=−0.216; p=0.829) (Z=−1.510; p=0.131) (Z=−1.134; p=0.257) (Z=−0.677; p=0.498) (Z=−1.841; p=0.066)
( –;–) (Z=−2.371; p=0.018) (Z=−2.207; p=0.027) (Z=−1.486; p=0.137) (Z=−2.151; p=0.031) (Z=−2.207; p=0.027) (Z=−0.352; p=0.725) (Z=−2.675; p=0.007) (Z=−2.536; p=0.011)
Average Word Recognition (%)
Feature
Neutral speech (neutral/ neutral) Z-value p-value
100
85 80 75 70 65 60 55
Neutral/Whisper
Whisper/Whisper
Fig. 4. Word recognition rate (%) for different train/test scenarios in case of expanded set of features (feature+Δ+ΔΔ).
nificance among the other features (Z=−2.675; p=0.007). However, in neutral/neutral scenario the choice of different features does not show any significance.
again that the TECC feature is the one that best characterizes whisper. It is also interesting to note that standard deviation of word recognition accuracy between speakers is higher than in matched scenarios. This phenomenon is also reported in (Galić et al., 2014a) and later explained to be the consequence of different speaker's SNR during whispering. However, the usage of Teager-based features, which are more robust in low SNR conditions, demonstrated lower standard deviation. Better insight in relation between all train/test scenarios is illustrated in Fig. 4, presenting the best results when both static and dynamic features are applied.
5.2. ASR performance evaluation – mismatched train/test scenarios The usual problem of the ASR systems occurs when speaker switches from neutral speech to whisper, or vice versa. Such a problem is also noticeable and related to mismatched train/test scenarios in the application of GMM-HMM systems. This experiment investigates exactly that situation. The analysis was carried out for the whispered speech recognition with the ASR system that was trained to recognize neutral speech. The averaged results achieved by GMM-HMM without using denoising autoencoder features for all speakers are presented in Table 3. Compared to the matched train/test scenarios, mismatched scenario shows significantly lower word recognition accuracies. However, there are two important observations. Firstly, adding dynamic features to their static features results in the significant improvement of word recognition. Secondly, the usage of the TECC based features show much higher word recognition rate than other two features. For example, in case of using static features alone, TECC achieves 54.42% in word recognition accuracy compared to 47.85% and 45.19% when TEMFCC and MFCC are applied respectively. When delta features are added, TECC+Δ feature again shows better performance with 65.29% of correct word recognition in contrast to 59.94% (TEMFCC+Δ) and 59.67% (MFCC+Δ). The best performance is achieved when both delta and delta-delta features are used, and in that case TECC+Δ+ΔΔ feature once again demonstrates the best result with word recognition accuracy of 71.33% compared to 63.71% (TEMFCC) and 61.81% (MFCC). Therefore, in all three cases the usage of MFCC-based features provides the worst performance. These observations are statistically confirmed with two-tailed Wilcoxon test (p-values are marked with asterisks in Table 3) which highlights once
5.3. ASR performance evaluation – application of deep denoising autoencoder This section presents the results of a new proposed ASR framework (illustrated in Fig. 2) which enables better whisper recognition in mismatched scenarios. For simplicity, only the best results with delta and delta-delta features are presented and discussed in Table 4. The results demonstrate three things. Firstly, the fusion of standard cepstral features and DDAE-based cepstral features considerably improves whisper recognition in mismatched condition while keeping high performances in matched neutral/neutral scenario. The obtained word recognition rates in neutral/whisper scenario are 85.97%, 92.81% and 87.73% for MFCC, TECC and TEMFCC features respectively. This improvement in word recognition accuracy in comparison with the results from Table 3 is statistically confirmed with Wilcoxon signedrank test. Secondly, TECC feature once again demonstrates the highest score, 92.81%, proving to be the best feature for whisper recognition. Thirdly, DDAE significantly reduces inter-speaker standard deviation of word recognition accuracy in mismatched scenario. 6. Conclusion Lately applications of deep learning with its strong modeling power helped to quickly spread the success of DDAE in various industrial and machine learning tasks such as: image recognition, optical character recognition, computer vision, natural language and text processing,
Table 3 Word recognition rate (%) in mismatched train/test scenarios achieved by GMM-HMM system using different cepstral features.
MFCC TECC TEMFCC MFCC+Δ TECC+Δ TEMFCC+Δ MFCC+Δ+ΔΔ TECC+Δ+ΔΔ TEMFCC+Δ+ΔΔ
MFCC TECC TEMFCC
90
Neutral/Neutral
(Confidence interval =95%)
Feature
95
Neutral/Whisper
σ
45.19 54.42*** 47.85* 59.67 65.29*** 59.94* 61.81 71.33** 63.71
5.35 4.46 3.95 4.14 3.56 3.33 5.32 2.13 2.20
Table 4 Word recognition rate (%) achieved by proposed ASR system and application of DDAEbased cepstral features. Feature
Neutral/Neutral
MFCC+Δ+ΔΔ+DDAE TECC+Δ+ΔΔ+DDAE TEMFCC+Δ+ΔΔ+DDAE
(p < 0.05*; p < 0.01**; p < 0.006***; Confidence interval =95%).
(p < 0.05 *; p < 0.01
20
**
; p < 0.006
99.83 99.94 99.87 ***
σ
Neutral/Whisper
0.16 0.14 0.13
; Confidence interval=95%).
**
85.97 92.81** 87.73***
σ 1.2 0.6 0.8
Đ.T. Grozdić et al.
Engineering Applications of Artificial Intelligence 59 (2017) 15–22 TelecommunicationsForum (TELFOR). Belgrade, Serbia, pp. 728–731. 〈https://dx. doi.org/10.1109/TELFOR.2012.6419311〉 Grozdić, Đ.T., Jovičić, S.T., Galić, J., Marković, B., 2014. Application of inverse filtering in enhancement of whisper recognition, In: Proceedings of the 12th IEEE Symposium on Neural Network Applications in Electrical Engineering (NEUREL). Belgrade, Serbia, pp. 157–162. 〈https://dx.doi.org/10.1109/NEUREL.2014. 7011492〉. Hahm, S., Bořil, H., Angkititrakul, P., Hansen, J.H.L., 2013. Advanced feature normalization and rapid model adaptation for robust in-vehicle speech recognition, In: Proceedings of the 6th Biennial Workshop on Digital Signal Processing for InVehicle Systems. Seoul, Korea, pp. 14–17. Heracleous, P., 2009. Using Teager energy cepstrum and HMM distances in automatic speech recognition and analysis of unvoiced speech. Int J. Inf. Commun. Eng. 5 (1), 31–37. Hinton, G.E., 2002. Training products of experts by minimizing contrastive divergence. Neural Comput. 14, 1771–1800. http://dx.doi.org/10.1162/089976602760128018. Ito, T., Takeda, K., Itakura, F., 2005. Analysis and recognition of whispered speech. Speech Commun. 45, 139–152. http://dx.doi.org/10.1016/j.specom.2003.10.005. Jou, S., Schultz, T., Waibel, A., 2004. Adaptation for soft whisper recognition using a throat microphone, In: Proceedings Annual Conference International Speech Communication Association INTERSPEECH. Jeju Island, Korea, pp. 5–8. Jovicic, S.T., 1998. Formant feature differences between whispered and voiced sustained vowels. Acustica 84, 739–743. Jovičić, S.T., Šarić, Z., 2008. Acoustic analysis of consonants in whispered speech. J. Voice 22, 263–274. http://dx.doi.org/10.1016/j.jvoice.2006.08.012. Jovičić, S.T., Kašić, Z., Đorđević, M., Rajković, M., 2004. Serbian emotional speech database: design, processing and evaluation, In: Proceedings of the 9th International Conference on Speech and Computer SPECOM 2004. St. Petersburg, Russia, pp. 77– 81. Kaiser, J.F., 1990. On a simple algorithm to calculate the ‘energy’ of a signal. In: Proceedings of the IEEE International Conference Acoustic Speech Signal Process. Albuquerque, USA, pp. 381–384. Kaiser, J.F., 1993. Some useful properties of Teager’s energy operators, In: Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing. Minneapolis, USA, pp. 149–152. 〈https://dx.doi.org/10.1109/ICASSP.1993. 319457〉. Kallail, K.J., Emanuel, F.W., 1984. Formant-frequency differences between isolated whispered and phonated vowel samples produced by adult female subjects. J. Speech Hear. Res. 27, 245–251. http://dx.doi.org/10.1044/jshr.2702.251. Lee, P.X., Wee, D., Si, H., Toh, Y., Lim, B.P., Chen, N., Ma, B., College, V.J., 2014. A whispered Mandarin corpus for speech technology applications, In: Proceedings of the Annual Conference International Speech Communication Association INTERSPEECH, 2014. Singapore, pp. 1598–1602. Lim, B.P., 2011. Computational Differences between Whispered and Non-whispered Speech (Ph.D. thsis). University of Illinois at Urbana-Champaign, USA. Marković, B., Jovic̆ić, S.T., Galić, J., Grozdić, Đ., 2013. Whispered speech database: Design, processing and application. In: Proceedings of the 16th International Conference, TSD 2013. Pilsen, Czech Republic, pp. 591–598. 〈https://dx.doi.org/10. 1007/978-3-642-40585-3_74〉. Mathur, A., Reddy, S.M., Hegde, R.M., 2012. Significance of parametric spectral ratio methods in detection and recognition of whispered speech. EURASIP J. Adv. Signal Process. 2012, 157. http://dx.doi.org/10.1186/1687-6180-2012-157. Mimura, M., Sakai, S., Kawahara, T., 2015. Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature. EURASIP J. Adv. Signal Process. 2015, 62. http://dx.doi.org/10.1186/s13634-0150246-6. Morris, R., 2003. Enhancement and Recognition of Whispered Speech (Ph.D. thesis). School of Electrical and Computer Engineering, Georgia Institute of Technology, USA. Qi, J., Wang, D., Jiang, Y., Liu, R., 2013. Auditory features based on Gammatone filters for robust speech recognition, In: Proceedings of the IEEE International Symposium Circuits Systems Beijing, China, pp. 305–308. 〈https://dx.doi.org/10.1109/ISCAS. 2013.6571843〉. Shahin, I., 2013. Speaker identification in emotional talking envionments based on CSPHMM2s. Eng. Appl. Artif. Intell. 26 (2013), 1652–1659. http://dx.doi.org/ 10.1016/j.engappai.2013.03.013. Tao, F., Busso, C., 2014. Lipreading approach for isolated digits recognition under whisper and neutral speech. In: Proceedings of the Annual Conference International Speech Communication Association INTERSPEECH, 2014. Singapore, pp. 1154– 1158. Teager, H.M., 1980. Some observations on oral air flow during phonation. IEEE Trans. Acoust. 28, 599–601. http://dx.doi.org/10.1109/TASSP.1980.1163453. Tran, T., Mariooryad, S., Busso, C., 2013. Audiovisual corpus to analyze whisper speech. In: Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada, pp. 8101–8105. 〈https://dx.doi.org/10. 1109/ICASSP.2013.6639243〉. Vincent, P., Larochelle, H., Bengio, Y., Manzagol P., 2008. Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th Interntaionsal Conference on Machine Learning, ICML 2008. Helsinki, Finland, pp. 1096–1103. Yang, C.Y., Brown, G., Lu, L., Yamagishi, J., King, S., 2012. Noise-robust whispered speech recognition using a non-audible-murmur microphone with VTS compensation. In: Proceedings of the 8th International Symposium on Chinese Spoken Language Processing, ISCSLP 2012. Hong Kong, China, pp. 220–223. 〈https://dx.doi.org/10.1109/ISCSLP.2012.6423522〉. Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D.,
information retrieval, automatic speech recognition, etc. This paper extends this trend and demonstrates how a DDAE can be applied to solve complex problem in ASR such as whispered speech recognition. We propose a new and more advanced approach based on DDAE for isolated word recognition, which can also be applied in continuous speech recognition. Using a database of neutral and whispered speech (Whi-Spe), this article evaluates whispered speech recognition performance in mismatched train/test conditions and demonstrates how DDAE can effectively filter out the effects of whispered speech and reconstruct neutral cepstral features leading to significant performance gain in whisper recognition. The efficiency of this approach was demonstrated through a comparative study with the conventional GMM-HMM speech recognizer and three types of cepstral features: MFCC, TECC and TEMFCC. Experimental results confirmed that the proposed model presents several advantageous characteristics such as (i) significantly lowered word error rate in mismatched train/test conditions together with high performances in matched train/test scenarios, (ii) easily acquirable whisper-robust features, and (iii) no need for real whisper data in training process or model adaptation. Furthermore, the experimental results demonstrate that usage of Teager-based cepstral features outperform traditional MFCC features in whisper recognition accuracy by nearly 10%. Thus, combining TECC features with DDAE approach gives the best results and shows that the proposed framework can considerably reduce recognition errors and improve whisper recognition performance by 31% over the traditional HTK-MFCC baseline and thereby achieve commercially applicable word recognition accuracy of 92.81%. Acknowledgment This research work was supported by the Serbian Ministry of Education, Science and Technological Development and it was realized within the research projects under Grant numbers TR 32032, OI 178027. References Bengio, Y., 2009. Learning deep architectures for AI. Found. Trends Mach. Learn. 2, 1–127. http://dx.doi.org/10.1561/2200000006. Bořil, H., Hansen, J.H.L., 2010. Unsupervised equalization of lombard effect for speech recognition in noisy adverse environments. IEEE Trans. Audio Speech Lang. Process. 18, 1379–1393. http://dx.doi.org/10.1109/TASL.2009.2034770. Deng, L., Yu, D., 2014. Deep learning: mehtods and applications. Found. Trends Signal Process. 7, 197–387. http://dx.doi.org/10.1136/bmj.319.7209.0a. Dimitriadis, D., Maragos, P., Potamianos, A., 2005. Auditory Teager energy cepstrum coefficients for robust speech recognition, In: Proceedings of European Speech Processing Conference. Lisbon, Portugal, pp. 3013–3016. Fan, X., Hansen, J.H.L., 2011. Speaker identification within whispered speech audio streams. IEEE Trans. Audio, Speech Lang. Process. 19, 1408–1421. http:// dx.doi.org/10.1109/TASL.2010.2091631. Galić, J., Jovičić, S.T., Grozdić, Đ., Marković, B., 2014a. HTK-based recognition of whispered speech, In: Proceedings of the 16th International Conference on Speech and Computer, SPECOM 2014. Novi Sad, Serbia, pp. 251–258. 〈https://dx.doi.org/ 10.1007/978-3-319-11581-8_31〉. Galić, J., Jovičić, S.T., Grozdić, Đ., Marković, B., 2014b. Constrained lexicon dpeaker dependent recognition of whispered speech, In: Proceedings of the 10th International Symposium on Industrial Electronics INDEL. Banja Luka, BIH, pp. 180–184. Ghaffarzadegan, S., Bořil, H., Hansen, J.H.L., 2014a. Model and feature based compensation for whispered speech recognition, In: Proceedings of the Annual Conference International Speech Communication Association INTERSPEECH, 2014. Singapore, pp. 2420–2424. Ghaffarzadegan, S., Boril, H., Hansen, J.H.L., 2014b. UT-Vocal Effort II: Analysis and constrained-lexicon recognition of whispered speech, In: Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy, pp. 2544–2548. 〈https://dx.doi.org/10.1109/ICASSP.2014. 6854059〉. Ghaffarzadegan, S., Boril, H., Hansen, J.H.L., 2015. Generative modeling of pseudotarget domain adaptation samples for whispered speech recognition, In: Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing. Brisbane, Australia, pp. 5024–5024. Grozdić, Đ.T. , Marković, B., Galić, J., Jovicić, S.T., 2012. Application of neural networks in whispered speech recognition, In: Proceedings of the 20th IEEE
21
Đ.T. Grozdić et al.
Engineering Applications of Artificial Intelligence 59 (2017) 15–22 Antwerp, Belgium, pp. 2396–2399. Zhang, C., Hansen, J.H.L., 2010. Advancements in whisper-island detection using the linear predictive residual. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Dallas, USA, pp. 5170–5173. 〈https://dx. doi.org/10.1109/ICASSP.2010.5495022〉. Zhou, G., Hansen, J., Kaiser, J., 1998. Classification of speech under stress based on features derived from the nonlinear Teager energy operator. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing. Seattle, USA, pp. 2–5. 〈https://dx.doi.org/10.1109/ICASSP.1998.674489〉.
Valtchev, V., Woodland, P., 2002. The HTK Book (for HTK Version3.2), Techn. Report. Cambridge University Engineering Department, England. Zhang, C., Hansen, J.H.L., 2011. Whisper-island detection based on unsupervised segmentation with entropy-based speech feature processing. IEEE Trans. Audio Speech Lang. Process. 19, 883–894. http://dx.doi.org/10.1109/ TASL.2010.2066967. Zhang, C., Hansen, J.H.L., 2007. Analysis and classification of speech mode: Whispered through shouted. In: Proceedings of the 8th Annual International Conference on Speech Communications Association Interspeech 2007. Int. Speech Commun. Assoc.
22