Available online at www.sciencedirect.com
Computer Speech and Language 27 (2013) 288–300
Analysis of the visual Lombard effect and automatic recognition experiments夽 Panikos Heracleous a,∗ , Carlos T. Ishi a , Miki Sato a , Hiroshi Ishiguro b , Norihiro Hagita a a
ATR, Intelligent Robotics and Communication Laboratories, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Kyoto-fu 619-0288, Japan b ATR, Hiroshi Ishiguro Laboratory, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Kyoto-fu 619-0288, Japan Received 4 April 2011; received in revised form 16 April 2012; accepted 11 June 2012 Available online 19 June 2012
Abstract This study focuses on automatic visual speech recognition in the presence of noise. The authors show that, when speech is produced in noisy environments, articulatory changes occur because of the Lombard effect; these changes are both audible and visible. The authors analyze the visual Lombard effect and its role in automatic visual- and audiovisual speech recognition. Experimental results using both English and Japanese data demonstrate the negative effect of the Lombard effect in the visual speech domain. Without considering this factor in designing a lip-reading system, the performance of the system decreases. This is very important in audiovisual speech automatic recognition in real noisy environments. In such a case, however, the recognition rates decrease because of the presence of acoustic noise and because of the Lombard effect. The authors also show that the performance of an audiovisual speech recognizer depends also on the visual Lombard effect and can be further improved when it is considered in designing such a system. © 2012 Elsevier Ltd. All rights reserved. Keywords: Lip-reading; Automatic speech recognition; Hidden Markov models (HMMs); Fusion; Noise robustness
1. Introduction Speech is bi-modal in nature and includes the audio and visual modalities. Speech can be perceived using not only audio information but also information provided by the mouth/face movements. Automatic visual speech recognition (i.e., automatic lip-reading) attempts to automatically recognize speech provided by the mouth/lips. However, since many sounds look similar on the mouth/lips (i.e., visemes), speech cannot be totally recognized using visual information alone. In fact, automatic visual speech recognition has applications in audiovisual speech recognition, whereas visual information is used as a complement to audio information to increase the robustness against the noise. In noisy environments, the talker increases the intelligibility of his/her speech (Lombard, 1911), and, during this process, several characteristics of speech change (the Lombard effect) (Bond and Moore, 1990; Castellanos and Casacuberta, 1996). As a result, the performance of an automatic speech recognizer operating in a noisy environment
夽 ∗
This paper has been recommended for acceptance by Philip Jackson, Ph.D. Corresponding author. E-mail address:
[email protected] (P. Heracleous).
0885-2308/$ – see front matter © 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.csl.2012.06.003
P. Heracleous et al. / Computer Speech and Language 27 (2013) 288–300
289
Fig. 1. Power spectrum of a normal, clean and a Lombard word.
decreases not only because of the noise contamination but also because of these modifications (Junqua, 1993; Wakao et al., 1996; Hansen, 1996). Previously, in Heracleous et al. (2007), the role of the Lombard effect in non-audible murmur (NAM) recognition using a NAM microphone was investigated. A NAM microphone is a special acoustic sensor that is attached behind the talker’s ear and can capture very quietly uttered speech (i.e., non-audible murmur). The results showed that, although a NAM microphone is very robust against noise, the recognition performance of a NAM recognizer decreases in noisy environments because of the Lombard effect. Although many studies have addressed the problem of the Lombard effect in audio-only automatic speech recognition, only a few studies have addressed this issue with reference to automatic visual speech recognition. In Huang and Chen (2001), audiovisual speech recognition experiments using noisy and Lombard data were presented. In this study, it was also briefly mentioned that the Lombard effect is present not only in the audio channel but also in the visual channel, and a few results were also presented. In Davis et al. (2006) and Garnier et al. (2006), the changes that occur in the visual correlates of speech articulation when speech is produced in noisy environments were considered. In these studies, results were presented showing visual differences in the lip/mouth sector when speech was produced in a noisy environment or when Lombard speech was used. However, in these studies, analysis and experimental results related to visual speech recognition were not reported. In this study, the authors comprehensively analyzed the visual Lombard effect phenomenon with respect to automatic visual- and audiovisual speech recognition using real noisy data and Lombard data and showed significant progress compared to the previously limited studies. Specifically, several isolated word and continuous phoneme recognition experiments were conducted in both the Japanese and the English languages, using data from several speakers. In addition, two fusion methods were used to integrate the audio and the visual streams in the audiovisual recognition experiments. The authors also showed that, when designing an audiovisual speech recognition system, further improvements in the recognition rates can be achieved by considering the visual Lombard effect in the statistical model training. 2. Acoustic Lombard effect When speech is produced in noisy environments, the speech production process is modified, by a set of apparently preconscious behaviors called Lombard effect. Specifically, due to the reduced auditory feedback, the talker attempts to increase the intelligibility of his/her speech. During this process, several characteristics of speech change. In particular, the intensity of speech increases, the fundamental frequency (F0) and formants shift, the durations of vowels increase, and the spectral tilt changes. Because of these modifications, the performance of a speech recognizer decreases. The analysis of the Lombard effect is not trivial because it depends on the speaker, on the noise type, and on the noise level (Wakao et al., 1996; Chi and Oh, 1996). One way to investigate the effect of the Lombard effect is to analyze clean speech uttered while the speaker is listening to noise through headphones or earphones (i.e., Lombard speech). Even though Lombard speech does not contain any noise components, modifications in speech characteristics can be realized.
290
P. Heracleous et al. / Computer Speech and Language 27 (2013) 288–300
Fig. 1 shows an example of the power spectrum of a normal, clean word and a Lombard word recorded while listening to office noise through headphones at 75 dB(A). The example clearly illustrates the modifications leading to the Lombard effect: power is increased, formants are shifted, and spectral tilt is changed. These differences in the spectra cause feature distortions (e.g., distortions in the Mel-Frequency Cepstral Coefficients [MFCC], Furui, 1986; Muralishankar and O’Shaughnessy, 2008), and, therefore, acoustic models trained without considering the Lombard effect might fail to correctly match the speech affected by the Lombard effect. In the current study, however, it is showed that not only acoustic changes, but also changes on lips/mouth occur when the Lombard effect is experienced. This phenomenon is called “visual Lombard effect” and the visual data recorded in the presence of masking noise or real noise are called “visual Lombard data”. These data are different from the visual data recorded in a clean environment. Therefore, even if acoustic noise is used in the experiments, both audio and visual modalities change. However, it would be interesting to investigate whether visual noise results in phenomena similar to the Lombard effect. 3. Methods 3.1. Corpus and statistical modeling In the current study, we used data from different languages (i.e., English and Japanese), two fusion methods, and also isolated word recognition and continuous phoneme recognition experiments aiming to demonstrate that visual Lombard effect does not depend on these conditions. The audio and visual speech materials were recorded in a relatively quiet room [35 dB(A) noise level]. The audio signal was recorded at a sampling frequency of 16 kHz, in synchrony with the visual signal. For audio recording, a headset was used, and for the video recording, a Web camera was used. For experimental conditions that are more realistic, the auditory feedback was also ensured (i.e., the talker could listen to his/her voice). This was because, in real noisy situations, the talker can still listen and control his/her voice. The recordings for each experiment were done in one session on the same day. The speakers were instructed to read the utterances displayed on a PC-screen. Before uttering the sentences, the speakers did not practice, and during the recording procedure, no specific constraints on the speakers’ movements were applied. Although the speakers read the utterances from a PC-screen, they were instructed to pretend (i.e., imagine, or assume) that they were talking to a real person located in front of them. This procedure was selected because previous studies reported that the Lombard effect is better experienced when communication is present. Several samples of speech recorded using masking noise were examined. It was observed that speech characteristics were different than in normal speech. In particular, power was increased, spectral tilt was changed, and formants were shifted. These changes indicated that the speakers experienced the Lombard effect. In the current study, hidden Markov models (HMMs)-based speech recognition systems were used. HMMs are statistical models that have become the most popular speech models in automatic speech recognition. The main reason for this success is their ability to characterize speech signals in a mathematically tractable way. An HMM consists of a finite set of states in which each state is associated with a statistical distribution. The states are connected, and these connections are characterized by their transition probabilities (Rabiner, 1989). 3.1.1. English database For the English database, two male and two female speakers (i.e., M01, M02, F01, and F02) were instructed to read 50 English isolated words aloud. The task was chosen to include words used in a robot-based guidance system in a shopping center. Such words includes entrance, exit, left, right, or water, milk, tea. Each word was repeated six times under clean conditions, two times under babble noise at 70 dB(A) playing back through a loudspeaker, two times while the speaker was listening to babble noise at 75 dB(A) through headphones, and two times while the speaker was listening to babble noise at 80 dB(A) through headphones. Each speaker uttered 600 words, in total. For each speaker, the data were split as follows: (a) 150 clean words, 50 real noisy words of 70 dB(A) noise level, and 50 Lombard words of 80 dB(A) noise level for development, (b) 100 clean words and 100 Lombard words of 75 dB(A) for training, and (c) 50 clean words, 50 Lombard words of 80 dB(A) noise level, and 50 real noisy words of 70 dB(A) level for testing. The data from the four speakers were used to train speaker-dependent HMM sets and also a single HMM set (i.e., multi-speaker trained with the data from all the speakers). The acoustic parameter vectors had a length of 36 (12 MFCC, 12 MFCC, and 12 MFCC). The acoustic models were seven-state whole-word HMMs. Each HMM state
P. Heracleous et al. / Computer Speech and Language 27 (2013) 288–300
291
Fig. 2. Face detection and the points estimated by the OKAO Vision system.
was modeled with a mixture of four Gaussian components. The number of Gaussian components was experimentally selected to obtain the highest word accuracy. 3.1.2. Japanese database The Japanese corpus was different from the English one to generalize the experiments using additional noise levels of real and Lombard speech. Moreover, in the Japanese case continuous phoneme recognition experiments were conducted. Three speakers (one male and two females) were instructed to read aloud continuous sentences from the JNAS database (Ito et al., 1999). To obtain Lombard speech, the speakers listened to babble noises through headphones at 70, 75, and 80 dB(A) while uttering the sentences (i.e., Lombard data). In addition to the Lombard data, noisy data were also recorded to test the performance of an audiovisual speech recognition system. In this case, babble noises at 70 and 80 dB(A) were played back through loudspeakers while the talkers were uttering the sentences (i.e., real noisy data). Also, 100 continuous sentences were uttered in clean environment. For each condition, each speaker read 100 sentences. Forty-three context independent monophone hidden Markov models (HMMs) were trained using data from each speaker. Each HMM state was modeled with a mixture of 16 Gaussian components. The number of Gaussians was selected experimentally to obtain the highest accuracy. For the development set, 5000 phone segments from each speaker were used, for training clean HMMs (i.e., trained with speech recorded in the clean environment), 10,706 phonemes were used, whereas 2806 phone segments of each noise level from each speaker were used for testing. The acoustic parameter vectors had a length of 36 (12 MFCC, 12 MFCC, and 12 MFCC). In both English and Japanese experiments, the HTK 3.4 Toolkit was used for testing and training (Young et al., 2001; Young, 1994). For training, flat start scheme was applied with the initial acoustic models having the same parameters. The number of Gaussian components was adjusted using incremental mixture splitting. Each mixture splitting was followed by a 5-iteration parameter re-estimation. For continuous phoneme recognition a bi-gram language model constructed from the transcriptions was used. For the isolated word recognition experiments, it was assumed that all the words were equally likely. 3.2. Visual parameter extraction For lip-parameter extraction, the OKAO Vision commercial tool from the OMRON Corporation was used. Details concerning the methods applied for using the specific tool can be found in the study conducted by (Su et al., 2008). The OKAO Vision system carries out real-time detection and tracking of the face, mouth, and eyes, and each time frame provides the x–y coordinates of 38 points as shown in Fig. 2. Using the 38 points provided, six lip parameters were computed as follows: width (W), outer perimeter (C1), inner perimeter (C2), area (A), outer height (h1), and inner height (h2). Fig. 3 shows the lip parameters extracted in this study. To correct for the speaker–camera distance and the pose of the head, the lip features were normalized by dividing them with the Euclidean distance computed from the midpoint between the eyes and the upper lip, which does not move much during speech production. The visual signal was recorded at the rate of 30 Hz, in synchrony with the audio signal. A 25-ms window that shifted every 10 ms was used for extraction of the acoustic parameters. To obtain the same number of visual and audio samples, the visual samples were also interpolated using linear interpolation before multi-stream HMM decision fusion was carried out.
292
P. Heracleous et al. / Computer Speech and Language 27 (2013) 288–300
Fig. 3. Lip parameters used as features in the statistical modeling.
3.3. Fusion methods In this section, the fusion methods used for the integration of the visual and audio streams are introduced. In audiovisual speech recognition, several fusion methods have been introduced (Nefian et al., 2002; Nakamura et al., 2002; Hennecke et al., 1996; Adjoudani and Benoît, 1996; Chen, 2001). Experiments are reported using multi-stream HMM decision fusion (i.e., state-synchronous) and late fusion methods (i.e., state-asynchronous). Concerning the synchronization problem in visual and audio streams, previous studies showed that there is asynchrony between the two modalities up to 120 ms (close to phone duration). Also, it was observed that speech intelligibility does not suffer when visual signal artificially proceeds audio by up to 200 ms. In fact, in audiovisual speech recognition, both statesynchronous and state-asynchronous fusion methods are used, and the results reported differ from case to case. In some studies, state-synchronous methods are superior to state-asynchronous methods, and in some other studies the situation is the opposite. 3.3.1. Multi-stream HMM decision fusion For the integration of the audio and visual features in the case of the Japanese and also the English experiments, multi-stream HMM decision fusion was used. Multi-stream HMM fusion is a state synchronous decision fusion, which captures the reliability of each stream, by combining the likelihoods of single-stream HMM classifiers (Potamianos et al., 2003). The emission likelihood of multi-stream HMM is the product of the emission likelihoods of single-stream components weighted appropriately by stream weights. Given the combined observation vector O, that is, the audio and visual elements, the emission score of multi-stream HMM is given by M λs S s bj (Ot ) = cjsm N(Ost ; μjsm , jsm ) (1) s=1 m=1
where N(O ; μ, ) is the value in O of a multivariate Gaussian with mean μ and covariance matrix , and S is the number of the streams. For each stream s, Ms Gaussians in a mixture are used, with each weighted with cjsm . The contribution of each stream is weighted by λs . In this study, we assume that the stream weights do not depend on state j and time t. However, a constraint was applied. Namely, λa = 1 − λ v
∀λv ∈ (0, 1)
(2)
where λa is the audio stream weight, and λv is the video stream weight. In these experiments, the weights were adjusted experimentally by maximizing the accuracy on several experiments using the development data set. For the case of the clean speech, the audio stream weight was 0.8 and the video stream weight was 0.2. In the case of the noisy speech, the weights were optimized to 0.3 and 0.7 for the audio and visual streams, respectively. 3.3.2. Late fusion Late fusion is a state asynchronous integration method, and it was used in the case of the English audiovisual experiments to integrate the audio and visual features. In the late fusion method, two single HMM-based classifiers were used for the audio speech and visual speech, respectively. For each test utterance (i.e., isolated word), the two classifiers provided an output list, which included all the word hypotheses along with their likelihoods. Following that,
P. Heracleous et al. / Computer Speech and Language 27 (2013) 288–300
293
Table 1 Word accuracy of automatic visual speech recognition [%]. Parameter
Outer height Width Outer height Width Inner height Outer height Width Inner height Outer perimeter Outer height Width Inner height Outer perimeter Inner perimeter Outer height Width Inner height Outer perimeter Inner perimeter Area
Speakers M01
M02
F01
F02
Average
36
37
53
22
37.0
28
40
60
34
40.5
35
42
53
31
40.3
34
41
57
32
41.0
31
44
61
35
42.8
all the separate unimodal hypotheses were combined into bi-modal hypotheses using the weighted likelihoods, as it is given by, log PAV (h, O - ) = λa log PA (h, O - A ) + λv log PV (h, O -V)
(3)
where log PAV (h) is the score of the combined bi-modal hypothesis h, log PA (h|O - A ) the score of the h provided by the audio classifier, and log PV (h|O ) the score of the h provided by the visual classifier. λa and λv are the stream weights V with the constraint λa = 1 − λv
∀λv ∈ (0, 1)
(4)
4. Results This section describes the results of distribution analysis, as well as automatic visual- and audiovisual recognition experiments. 4.1. Selection of visual parameters used in the acoustic modeling To select the most appropriate visual parameter set used in the statistical modeling, several speaker-dependent experiments were conducted using different types of visual parameters. Table 1 shows the word accuracy of the automatic visual system for each speaker as a function of different visual parameter sets in the case of the English data. The results show that, in the case of M02, F01, and F02 speakers, the set of all six visual parameters (i.e., outer height, inner height, width, outer parameter, inner parameter, and area) appear to be the most effective. In the case of M01, different results were observed. Considering that, in three cases, a similar tendency was obtained, it was decided to use all six visual parameters, along with their first- and second-order derivatives. It should be noted, however, that a different combination of the visual parameters might provide higher performance. Due to the large number of combinations, it was not practical to try all possible combinations.
294
P. Heracleous et al. / Computer Speech and Language 27 (2013) 288–300
Fig. 4. Normalized outer height in the case of a clean and a Lombard word. Table 2 Mean values of the visual parameters in the case of clean and Lombard speech. Parameter
Visual speech data
Outer height Inner height Width Area Outer perimeter Inner perimeter
Clean
70 dB(A)
80 dB(A)
0.474 0.076 0.753 0.207 1.699 1.528
0.488 0.093 0.773 0.224 1.759 1.582
0.492 0.100 0.777 0.230 1.779 1.596
Fig. 5. Density functions in the clean and Lombard cases.
4.2. Analysis of the visual parameters Fig. 4 shows the normalized outer height in the case of a Japanese Lombard utterance and the same clean Japanese utterance. As is shown, in the case of the Lombard sentence, larger values are observed. Since these values are used as features in the statistical modeling, it is expected that differences in recognition accuracy will occur. Table 2 shows the mean values of the normalized lip features over all test data in the case of Japanese clean speech and Lombard speech. In all cases, the parameters increase while using Lombard speech. The results also show that, as the noise level increases, larger differences can be observed. Fig. 5 shows the density function of the outer height computed by a Kernel density estimation using the fast Fourier transform (Silverman, 1982; Vlassis and Motomura, 2001) in the case of the Japanese male speaker. The figure clearly
P. Heracleous et al. / Computer Speech and Language 27 (2013) 288–300
295
Table 3 Word accuracies for visual speaker-dependent recognition experiments using English data. Speaker
M01 M02 F01 F02 Average
Clean HMMs
Clean + LOMB 75 dB(A) HMMs
Clean
Lombard 80 dB(A)
Clean
Lombard 80 dB(A)
31 44 60 35 42.5
20 23 43 15 25.3
36 47 64 33 45
37 31 45 29 35.5
shows the differences when using clean speech, Lombard speech at the 70 dB(A) noise level, and Lombard speech at the 80 dB(A) noise level. By increasing the noise level, the probability of observing higher values in the outer height further increases. 4.3. Visual speech automatic recognition Table 3 shows the results for visual speech recognition when using English data. In the first two columns, results using clean HMMs are shown. When Lombard visual speech at 80 dB(A) is tested with clean HMMs, significant decreases in word accuracy are obtained compared with the clean test data. Specifically, for M01, the word accuracy decreased from 31% to 20%, for M02, it decreased from 44% to 23%, for F01, it decreased from 60% to 43%, and for F02, it decreased from 35% to 15%. The results clearly show the negative impact of the Lombard effect on the automatic recognition of visual speech. The statistical significance of the results was tested using a paired t-test (Box, 1987). In the case of using clean HMMs, the two-tailed p-value when comparing clean test data and Lombard data was 0.0048, indicating that the difference is statistically significant. To show the advantage of considering the visual Lombard effect in the statistical modeling, in a multi-style training scheme, similar to Lippmann et al. (1987) and Paul (1987), but with different objective, fifty Lombard visual words from each speaker recorded at 75 dB(A) noise level (i.e., different from the noise level of the Lombard test set) were added to the training data, and another experiment was conducted. The two last columns show the results when clean + Lombard HMMs were used. By adding Lombard data to the training data, the word accuracies – especially when using Lombard test data – increased drastically. This solution might be an effective and simple way to deal with the visual Lombard effect. The recognition rates can be improved by adding Lombard training data to the clean training data. The method can be also improved by including Lombard visual data recorded at different noise levels or using different kinds of noise. The two-tailed p-value in the case of the Lombard test data when using clean HMMs and also when using clean-Lombard HMMs was 0.0327. This result shows that when adding Lombard data to the training set, the accuracy when using Lombard test data significantly increased. When clean test data were used the two-tailed p-value between the clean and the clean-Lombard HMMs was 0.1697. This difference is considered to be statistically not significant. Fig. 6 shows the results obtained when using all the data to train and test a common HMM set. In clean HMMs, the word accuracy using clean test data was 40.1%, and when using Lombard test data was 20.3%. Using HMMs trained with clean and Lombard data at 75 dB(A), the accuracy increased to 42.7% when using clean test data, and to 33.6% when using Lombard test data at 80 dB(A). The word accuracies achieved in the multi-speaker experiment were lower compared to the mean word accuracies obtained in the speaker-dependent experiments. Although the number of speakers is not so large, the results indicate that speaker-independent visual- and audiovisual speech recognition based on the proposed methods might be possible. Comparing the mean values of the accuracies obtained in the speaker-dependent cases, no significant differences were found compared to the multi-speaker case. An unpaired t-test was conducted to test the differences in accuracy between the multi-speaker experiments and the speaker-dependent experiments (i.e., mean values). The results of the test showed that the differences are statistically not significant (i.e., two-tailed p-value equals 0.68). The result also indicates that speaker-independent experiments using a larger number of subjects might be possible.
296
P. Heracleous et al. / Computer Speech and Language 27 (2013) 288–300
Fig. 6. Word accuracy for visual speech recognition using English data in a multi-speaker scheme.
Fig. 7. Phoneme accuracies for visual speech recognition using Japanese data in a multi-speaker scheme.
Fig. 7 shows the results obtained when Japanese multi-speaker automatic visual experiments were conducted. The figure shows the effect of the Lombard effect in visual speech recognition of Japanese continuous monophones. Using clean test data, the phoneme accuracy was 43.7%. When the test data comprised Lombard speech of 70 dB(A), the phoneme accuracy decreased to 38.6%. The phoneme accuracy further decreased upon increasing the noise level. Using the test Lombard speech of 80 dB(A), the phoneme accuracy was only 32.3%. The results show that, as the noise level increases, the phoneme accuracy further decreases. Using a multi-style training scheme, the same experiment was conducted with the inclusion of 300 visual Lombard utterances (i.e., 100 for each speaker) recorded at 75 dB(A). Similar to the English case, the noise level of the additional Lombard training data was different than the noise level of the test data. As shown in Fig. 7, by including Lombard visual training data the phoneme recognition accuracy increased. 4.4. Audiovisual speech automatic recognition Fig. 8 shows the results of multi-speaker automatic recognition experiments of audio (A), audiovisual (AV), and audiovisual Lombard data (AV + LOMB) for English speakers when multi-stream HMM decision fusion was used to integrate the audio and visual streams. In the case of clean audio test data, the word accuracies in all cases were very high because of the small vocabulary (i.e., 50 isolated words). However, using noisy test data, the word accuracies drastically decreased. After the visual samples were linearly interpolated, the audio and video streams were fused using multi-stream HMM decision fusion, clean audiovisual HMMs were trained. In this case, the word accuracies when using noisy test data increased rapidly. Using clean test data, no differences were observed. Similar to the previous experiments, 50 Lombard words from each speaker recorded at the different noise level of 75 dB(A) were added to the clean training data. By adding Lombard training data to the clean data in the multi-style training scheme, the word accuracies of the audiovisual system further increased.
P. Heracleous et al. / Computer Speech and Language 27 (2013) 288–300
297
Fig. 8. Word accuracies for audiovisual multi-speaker speech recognition using English data and multi-stream fusion.
Fig. 9. Word accuracies for audiovisual multi-speaker speech recognition using English data and late fusion.
Fig. 10. Phoneme accuracies for audiovisual multi-speaker speech recognition using Japanese data and multi-stream fusion.
Fig. 9 shows the results of English multi-speaker automatic recognition experiments using late fusion. As is shown, the word accuracies using clean test data were the same as in multi-stream fusion. Using noisy test data, however, the word accuracies were higher compared with those obtained when using multi-stream fusion. The results of an unpaired t-test showed that the two-tailed p-value was 0.85. This difference is considered to be statistically not significant. Fig. 10 shows the results of automatic recognition of audio (A), audiovisual (AV), and audiovisual Lombard data (AV + LOMB) for the Japanese multi-speaker experiments. Using clean test data, a phoneme accuracy of 85.3% was obtained. When the clean models were tested using noisy speech, the phoneme accuracy drastically decreased. Specifically, phoneme accuracies of 23.1% and 17.8% were obtained when noisy data of 70 dB(A) and 80 dB(A), respectively, were used.
298
P. Heracleous et al. / Computer Speech and Language 27 (2013) 288–300
After the visual samples were linearly interpolated, the audio and video streams were fused using multi-stream HMM decision fusion, and clean audiovisual HMMs were trained. In the case of clean speech, the phoneme accuracy increased moderately. In the case of noisy speech, the phoneme accuracy increased from 23.1% to 40.5% and from 17.8% to 38.6% for 70 dB(A) and 80 dB(A), respectively. In addition, 50 Lombard sentences from each speaker at 75 dB(A) were added to the clean training data, and Lombard audiovisual HMMs were trained. In the case of 70 dB(A) noisy data, the phoneme accuracy increased to 46.2%, and, in the case of the test data at 80 dB(A), the phoneme accuracy increased to 43%. Late fusion can be easily applied in isolated word recognition tasks. Most studies which used late fusion reported isolated word recognition experiments (Adjoudani and Benoît, 1996; Su and Silsbee, 1996; Cox et al., 1997). In contrast, continuous phoneme recognition using late fusion faces a lot of difficulties and results strongly depend on some approximations (e.g., number of hypotheses that should be considered). This is because the number of possible hypotheses becomes prohibitively large (Potamianos et al., 2003). In this study, for Japanese continuous phoneme recognition multi-stream HMM decision fusion was used only, and the late fusion was used as an additional fusion method in the case of English isolated word recognition experiments. The results obtained in the current study are comparable to those reported in Adjoudani and Benoît (1996) and Chu and Huang (2000). It should be noted, however, that the task and the language used are different. As a result, the comparison might not be totally fair. 5. Discussion The current study focuses on the phenomenon of the Lombard effect with respect to automatic speech recognition. When speech is produced in a real noisy environment, the speaker experiences the Lombard effect, which results in modifications of speech characteristics. In the case of automatic speech recognition in an adverse environment, the recognition rates decrease not only because of the presence of noise but also because of the modifications of the speech due to the Lombard effect. In audio-only automatic speech recognition, many studies have addressed the problem of the Lombard effect, and several solutions have been suggested (Chen, 1987; Hansen and Cairns, 1995; Suzuki and Nakajima, 1994). However, the problem of the Lombard effect in the visual modality has not been experimentally investigated in detail so far. Although, speech cannot be perceived completely using visual information from mouth/lips alone, automatic visual speech recognition has applications in audiovisual speech recognition and in lip synthesis. It is important, therefore, to analyze the behavior of automatic lip-reading also in adverse environments. In many audiovisual systems, audio speech is recorded in laboratory environments under relatively clean conditions. To better match the noisy testing conditions, artificially noisy training data are created by superimposing noise onto the clean training data and re-training the system. This is reasonable when artificial data are used for testing. In real applications, however, Lombard effect also appears and, therefore, in some other studies (Paul, 1987), audio Lombard data are also added to the training data. This study experimentally shows that the visual Lombard effect is also experienced by the talkers, resulting in performance decreases. In the current study, the authors show that because of the acoustic Lombard effect, the accuracy of a visual speech recognition system also decreases. While designing a lip-reading system this factor should be also considered. Further, it should not be assumed that the accuracy of such a system does not change in the presence of acoustic noise (as is assumed in almost all studies). In addition, an audiovisual system can be more robust against the environmental noise, when not only the acoustic, but also the visual Lombard effect is considered. A possible solution is to add visual Lombard data to clean visual training data, or to record the noisy training data in real noisy conditions (artificially adding noise to clean training data is not sufficient). Multi-style training for audio speech recognition has already been reported in some studies. In this work, however, multi-style training for visualand audiovisual speech recognition using visual Lombard data is suggested. Although the results in two languages presented in the current study are promising, many problems remain. Issues that should be further investigated include speaker independence using a larger number of speakers, the role of the noise, its type, and its level. Also, additional methods might exist that offer a more efficient solution to the problem of the visual Lombard effect. 6. Conclusions In this study the role of the Lombard effect in visual- and audiovisual automatic speech recognition is presented. Speaker-dependent and multi-speaker isolated word and continuous phoneme recognition experiments conducted in
P. Heracleous et al. / Computer Speech and Language 27 (2013) 288–300
299
both the English and the Japanese languages showed that the Lombard effect also has a negative effect in the visual domain. By considering the visual Lombard effect in statistical modeling training by including Lombard training data in the training set, further improvements in the recognition rates of a visual- and an audiovisual speech recognition system could be achieved. Acknowledgements The authors thank Dr. Jani Even for the helpful suggestions on density estimation. This work was supported by KAKENHI (21118003). References Adjoudani, A., Benoît, C., 1996. On the integration of auditory and visual parameters in an hmm-based asr. In: Stork, D.G., Hennecke, M.E. (Eds.), Speechreading by Humans and Machines. Springer, Berlin, Germany, 461–471. Bond, Z.S., Moore, T.J., 1990. A note on loud and lombard speech. In: Proc. of International Conference on Speech and Language Processing, pp. 969–972. Box, J.F., 1987. Guinness, gosset, fisher, and small samples. Statistical Science 2 (1), 45–52. Castellanos, A., Casacuberta, J.M.B., 1996. An analysis of gerenal acoustic-phonetic features for spanish speech produced with lombard effect. Speech Communication 20, 23–36. Chen, T., 2001. Audiovisual speech processing. lip reading and lip synchronization. IEEE Signal Processing Magazine 18 (1), 9–21. Chen, Y., 1987. Cepstral domain stress compensation for robust speech recognition. In: Proc. of ICASSP, pp. 717–720. Chi, S.-M., Oh, Y.-H., 1996. Lombard effect compensation and noise suppression for noisy lombard speech recognition. In: Proc. of ICSLP 4, pp. 2013–2016. Chu, S., Huang, T., 2000. Bimodal speech recognition using coupled hidden markov models. International Conference on Spoken Language Processing II, 747–750. Cox, S., Matthews, I., Bangham, A., 1997. Combining noise compensation with visual information in speech recognition. European Tutorial Workshop on Audio-Visual Speech Processing, 53–56. Davis, C., Kim, J., Grauwinkel, K., Mixdorff, H., 2006. Lombard speech: auditory (a), visual (v) and av effects. In: Proc. of Speech Prosody. Furui, S., 1986. Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Transactions on Acoustics, Speech, and Signal Processing 34 (1), 52–59. Garnier, M., Bailly, L., Dohen, M., Welby, P., Levenbruck, H., 2006. An acoustic and articulatory study of lombard speech: global effects on the utterance. In: Proc. of Interspeech 2006-ICSLP, pp. 2246–2249. Hansen, J., 1996. Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition. Speech Communication, Special Issue on Speech Under Stress 20 (2), 151–170. Hansen, J., Cairns, D., 1995. Icarus: source generator based real-time recognition of speech in noisy stressful and lombard effect environments. Speech Communication 16, 391–422. Hennecke, M.E., Stork, D.G., Prasad, K.V., 1996. Visionary speech: looking ahead to practical speechreading systems. In: Stork, D.G., Hennecke, M.E. (Eds.), Speechreading by Humans and Machines. Springer, Berlin, Germany, 331–349. Heracleous, P., Kaino, T., Saruwatari, H., Shikano, K., 2007. Unvoiced speech recognition using tissue-conductive acoustic sensor. EURASIP Journal on Advances in Signal Processing, 2007. Huang, F.J., Chen, T., 2001. Consideration of lombard effect for speechreading. IEEE Fourth Workshop on Multimedia Signal Processing, 613–618. Ito, K., Yamamoto, M., Takeda, K., Takezawa, T., Matsuoka, T., Kobayashi, T., Shikano, K., Itahashi, S., 1999. Jnas: Japanese speech corpus for large vocabulary continuous speech recognition research. The Journal of Acoustical Society of Japan 20, 196–206. Junqua, J.-C., 1993. The lombard reflex and its role on human listeners and automatic speech recognizers. Journal of the Acoustical Society of America 1, 510–524. Lippmann, R., Martin, E., Paul, D., 1987. Multi-style training for robust isolated-word speech recognition. In: Proc. of ICASSP87, pp. 705–708. Lombard, A.E., 1911. Le signe de l’elevation de la voix. Annales des Maladies de l’oreille, Larynx, Nez, Pharynx 37, 101–119. Muralishankar, R., O’Shaughnessy, D., 2008. A comparative analysis of noise robust speech features extracted from all-pass based warping with MFCC in a noisy phoneme recognition. The Third International Conference on Digital Telecommunications, 180–185. Nakamura, S., Kumatani, K., Tamura, S., 2002. Multi-modal temporal asynchronicity modeling by product hmms for robust audio-visual speech recognition. In: Proc. of Fourth IEEE International Conference on Multimodal Interfaces (ICMI’02), pp. 305–309. Nefian, A.V., Liang, L., Pi, X., Xiaoxiang, L., Mao, C., Murphy, K., 2002. A coupled hmm for audio-visual speech recognition. In: Proc. of ICASSP, pp. 2013–2016. Paul, D.B., 1987. A speaker-stress resistant hmm isolated word recognition. In: Proc. of ICASSP87, pp. 713–716. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A., 2003. Recent advances in the automatic recognition of audiovisual speech. Proc. of the IEEE 91 (9), 1306–1326. Rabiner, L.R., 1989. A tutorial on hidden markov models and selected applications in speech recognition. Proc. of the IEEE 77 (2), 257–286. Silverman, B., 1982. Kernel density estimation using the fast fourier transform. Journal of the Royal Statistical Society, Series C: Applied Statistics 31 (1), 93–99.
300
P. Heracleous et al. / Computer Speech and Language 27 (2013) 288–300
Su, Q., Silsbee, P., 1996. Robust audiovisual integration using semicontinuous hidden markov models. International Conference on Spoken Language Processing, 42–45. Su, Y., Ai, H., Lao, S., 2008. Real-time face alignment with tracking in video. In: Proc. of ICIP, pp. 1632–1635. Suzuki, T., Nakajima, K., 1994. Isolated word recognition using models for acoustic phonetic variability by lombard effect. In: Proc. of ICSLP, pp. 999–1002. Vlassis, N., Motomura, Y., 2001. Efficient source adaptivity in independent component analysis. IEEE Transactions on Neural Networks 12 (3), 559–566. Wakao, A., Takeda, K., Itakura, F., 1996. Variability of lombard effects under different noise conditions. Proc. of ICSLP96 4, 2009–2012. Young, S., 1994. The HTK Hidden Markov Model toolkit: Design and Philosophy, vol. 2. Entropic Cambridge Research Laboratory, Ltd, pp. 2–44. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P., 2001. The HTK Book. Cambridge University Engineering Department.