Speech Communication 43 (2004) 1–16 www.elsevier.com/locate/specom
Lip geometric features for human–computer interaction using bimodal speech recognition: comparison and analysis Mustafa N. Kaynak b
a,*
, Qi Zhi b, Adrian David Cheok Zhang Jian b, Ko Chi Chung b
b,*
, Kuntal Sengupta b,
a Department of Electrical Engineering, Arizona State University, Tempe, AZ 85287-5706, USA Department of Electrical and Computer Engineering, National University of Singapore, 117576, Singapore
Received 9 January 2004
Abstract Bimodal speech recognition is a novel extension of acoustic speech recognition for which both acoustic and visual speech information are used to improve the recognition accuracy in noisy environments. Although various bimodal speech systems have been developed, a rigorous and detailed comparison of the possible geometric visual features from speakersÕ faces has not been given yet in the previous papers. Thus, in this paper, the geometric visual features are compared and analyzed rigorously for their importance in bimodal speech recognition. The relevant information of each possible single visual feature is used to determine the best combination of geometric visual features for both visualonly and bimodal speech recognition. From the geometric visual features analyzed, lip vertical aperture is the most relevant; and the set formed by the vertical and horizontal lip apertures and the first order derivative of the lip corner angle gives the best results among the possibilities of reduced set of geometric features that were analyzed. Also, in this paper, the effect of the modelling parameters of hidden Markov models (HMM) on each single geometric lip featureÕs recognition accuracy is analyzed. Finally, the accuracy of acoustic-only, visual-only, and bimodal speech recognition methods are experimentally determined and compared using the optimized HMMs and geometric visual features. Compared to acoustic and visual-only speech recognition, the bimodal speech recognition scheme has a much improved recognition accuracy using the geometric visual features, especially in the presence of noise. The results obtained showed that a set of as few as three labial geometric features are sufficient to improve the recognition rate by as much as 20% (from 62%, with acoustic-only information, to 82%, with audio-visual information at a signal to noise ratio (SNR) of 0 dB). 2004 Elsevier B.V. All rights reserved.
1. Introduction
* Corresponding authors. Tel.: +1-480-9651319; fax: +1-4809658325 (M.N. Kaynak), tel.: +65-874-6850; fax: +65-779-1103 (A.D. Cheok). E-mail addresses:
[email protected] (M.N. Kaynak),
[email protected] (A.D. Cheok).
Speech recognition is one of the important areas of human–computer interaction. The field of speech recognition is part of the ongoing research effort in developing computers that can hear and understand the spoken information. However, current automatic speech recognizersÕ recognition rates decrease significantly in common environments
0167-6393/$ - see front matter 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2004.01.003
2
M.N. Kaynak et al. / Speech Communication 43 (2004) 1–16
where the ambient noise level is high. This presents a problem because speech recognition is desirable in realistic settings as in store fronts, offices, airports, train stations, vehicles, and mobile outdoor settings. Humans on the other hand, often compensate for noise degradation and uncertainty in speech information by integrating multiple sources of speech information, such as the visible gestures from the speakerÕs face and body (Yuhas et al., 1990). Hence, humans use bimodal (visual and acoustic) information to recognize speech. This is demonstrated by the ‘‘McGurk effect’’ (McGurk and MacDonald, 1976): when a subject is presented with contradicting acoustic and visual signals, the perceived signal can be completely different from both the acoustic and the visual signals. For example in the previous psychological tests, subjects were shown a video where a speaker mouthed ‘‘gah’’, but the video was dubbed with a voice saying ‘‘bah’’. In the tests, most people usually heard the sound ‘‘dah’’. However when viewers turn their backs to the video, they heard the sound ‘‘bah’’ correctly. McGurk also discovered that viewers couldnÕt force themselves to hear the correct vocal sounds when told that they were being fed with the wrong visual information. Modelling the bimodal characteristics of the speech perception and the production systems of human beings is one of the possible approaches to solve the noise problem of automatic speech recognition systems. The most important source of additional speech information is the visual information from a speakerÕs face (Silcbee, 1993). Speech perception can be improved significantly by watching the face of the speaker (Yuhas et al., 1990). For example, hearing impaired people use only the visual speech information from the visible speech articulators (spread all over the face (Vatikiotis-Bateson et al., 1998; Yehia et al., 1998)) to recognize the speech signal. The primary advantage of the visual information is that, it is not affected by the acoustic noise and cross talk among speakers. Thus, at noisy environments the bimodal speech recognition system may take advantage of the visual speech information and improve the recognition accuracy.
The motivation of the audio-visual speech recognition can be summarized as; the more speech information we have, the higher will be the potential recognition. Furthermore, studies in human perception systems have shown that visual speech information allows people to tolerate an extra 4 dB of noise in the acoustic signal (Movellan, 1995). However, one of the fundamental problems in bimodal speech recognition, which is also known as audio-visual speech recognition, is to determine which visual features are the most advantageous to use. In the previous research, different types of visual features were used. Petajan developed one of the first audio-visual speech recognition systems (Petajan, 1984; Petajan et al., 1988). In his system, the mouth area, perimeter, height, and width derived from the binary mouth image were used as the visual features. In (Yuhas et al., 1990), the pixel values of a reduced area-of-interest in the image centered around the mouth were used as the visual features. The temporal variations of the mouth parameters were proposed as the visual features rather than the static images in (Stork et al., 1992). In (Luettin et al., 1996), both the shape information from the lip contours and the intensity information from the lip contours were used as the visual features, and active shape models were used to extract these visual features. In (Potamianos et al., 1997), acoustic features were combined with either geometric visual parameters such as the mouth height and width, or non-geometric visual parameters such as the wavelet transform of the mouth images. In (Basu et al., 1999), gray-scale parameters associated with the mouth region of the image were considered, but first principal component analysis (PCA) was used to reduce the dimension of the feature space. In (Yu et al., 1999), the intensity of each pixel in an image sequence was considered as a function of time and one-dimensional wavelet and Fourier transforms were applied to this intensity-versus-time function to model the lip movements. In (Dupont and Luettin, 2000), point distribution models were used to track the lips, and the shape parameters obtained from the tracking results were used as the visual features.
M.N. Kaynak et al. / Speech Communication 43 (2004) 1–16
In (Gurbuz et al., 2001), lip contour is transformed into the Fourier domain and normalized to eliminate dependencies on affine transformation (translation, rotation, scaling, and shear) and then this information is used for audio-visual speech recognition. In (Potamianos and Graf, 1998), discriminative training was used by the means of the generalized probabilistic descent algorithm to estimate the HMM stream exponents for audio-visual speech recognition. In (Adjoudani and Benoit, 1996), two architectures for combining automatic speech reading and acoustic speech recognition were described and a model which can improve the performance of an audio-visual speech recognizer in an isolated word and speaker dependent situation is proposed. In (Goldschen, 1993), the design and the implementation of a novel continuous speech recognizer that uses optical information from the oral-cavity shadow of a speaker was presented. The system was using HMMs trained to discriminate the optical information. It can thus be summarized that previous bimodal recognition studies have used two major types of information: • Geometric features (such as measures of height, width, area) • Image and shape features (such as pixel colors, lip outline) Less complex and easy to extract geometric features are alternatives of the image/pixel based features and they were proven to improve the recognition rate when used with the acoustic signal (Petajan, 1984; Petajan et al., 1988). Nevertheless, in previous works, a rigorous comparison of the suitable geometric features both singly, and in combination, has not been given yet. Hence the research contribution and the motivation of the work described in this paper are as follows: • to rigorously determine each important geometric lip feature singly, and determine its importance in bimodal speech recognition; • to determine the effect of the modeling parameters of HMMs on each individual geometric lip featureÕs recognition accuracy (based on an HMM recognizer);
3
• to determine the best combination of geometric visual features for both visual-only, and bimodal speech recognition; • to experimentally determine and compare the accuracy of acoustic-only, visual-only, and bimodal speech recognition using the optimized HMMs and geometric visual features. The experimental results are obtained using digit recognition, in order to provide a simple example for comparison purposes. This paper is organized as follows. The descriptions of the generated audio-visual database and the segmentation software are given in Section 2. In Section 3, first hidden Markov modelling of bimodal speech recognition is briefly introduced, and then the details of the acoustic modelling is explained. In Section 4, geometric visual features are rigorously analyzed and the visual hidden Markov modelling method is explained for both single and combined geometric visual features. In Section 5, the audio-visual speech recognition experiments, and the data fusion are explained, and then the results are reported. In Section 6, the paper is concluded.
2. Audio-visual database and automatic database processing toolkit An important component for conducting research on bimodal speech recognition is an audiovisual database. Development of such a database requires a large amount of time for data collection, segmentation and labelling. Furthermore, there is no commercial or widely available software to process the collected data for segmentation, labelling and visual feature extraction. Thus in this section first the audio-visual database generated for this research is discussed, and then an automatic audio-visual database processing toolkit is introduced. 2.1. Audio-visual database An audio-visual database is generated for our research on audio-visual speech recognition. In the
4
M.N. Kaynak et al. / Speech Communication 43 (2004) 1–16
Fig. 1. Speaker with six dots (left) and their centers on his face (right).
database, the fiducials are placed on the speakersÕ lips (using make-up), in order to achieve highly accurate geometric features free from noise. Hence, the results of this research can ignore the effect of inaccurate lip tracking, in order to provide rigorous conclusions about the geometric visual featuresÕ usefulness in bimodal speech recognition. The blue fiducials are shown in Fig. 1. Four of the markers are on the four corners of the outer lip, one of them is on the nose and the other one is on the chin. Blue is used for highlighting because humans seldom have natural blue color elements on their skin, therefore it is easy to detect blue color on the face. The markers on the nose and chin were used for the optimization purposes when extracting the geometric visual features from the lip so that the visual features extracted from the lip will be
invariant to the distance between the camera and the speaker. The present database requires the viewer to face to the camera. If speaker rotates his face, algorithms must be used to correct for the rotation. One possible approach is to use poseindependent visual features (Lincoln and Clark, 2000). The generated English audio-visual database contains continuous and isolated utterances from 22 non-native English speakers having European and Asian accents. In this research 77 samples from nine speakers were used to train the HMM of each digit. 24 samples from 10 speakers were used for testing the digit models for the speaker independent case; and 19 samples from nine speakers were used for testing the digit models for the speaker dependent case. The recordings were done with a digital camera in a silent room.
Fig. 2. Blue color intensity plots of the six dots on speakerÕs face.
M.N. Kaynak et al. / Speech Communication 43 (2004) 1–16 *.avi
*.wav
Sampling rate=8K, mono
*.bmp
Uncompressed windom bitmap (True color: RBG)
*.mat
Segmentation results in the audio field. *.phn is the result based on phoneme. *.wrd is the result based on digit
*.phn & *.wrd
5
Visual features
Fig. 3. Input–output file structure of the segmentation software (*.avi ¼ video file, *.wav ¼ audio file, *.bmp ¼ bitmap picture file, *.phn ¼ phoneme label text file, *.wrd ¼ word transcription text file, *.mat ¼ matrix file).
2.2. Automatic audio video segmentation software toolkit A software package has been developed to process the audio-visual database automatically (Kaynak et al., 2000). The software is available as a Matlab script with a user friendly graphical user interface (GUI). Using the software, the video data is processed, and fully segmented acoustic and visual signals, tracking results, geometric visual features, blue color intensity plots showing the tracking results of the six feature points of the input video avi file are obtained (Fig. 2). From these avi files, the efficiency of the tracking process can also be seen visually. In this research the avi files contain sequences of bitmap images. The input–output file structure of the software is shown in Fig. 3. To segment the audio-visual data, first the bitmap (bmp) and sound wave files are extracted from the avi file. Then tcl scripts are called from the CSLU Toolkit (Cosi et al., 1998; Schalkwyk et al., 1995) to do phoneme and word partitioning in the audio domain. Using the segmentation results in audio domain, the software does the segmentation in the video domain and segmented audio and video are saved into the folders corresponding to every phoneme of the digit. Then, the markers are detected in each frame of the corresponding bitmap files by convolving frames with a blue mask. The local maximal points in the result of the convolution correspond to the blue markers. Then the centers of these markers were found automatically and saved into a matrix
Fig. 4. Block diagram of automatic audio video segmentation software.
showing the movements of the centers of these points in relation to the corresponding phonemes and digits. Finally, the tracking results of the centers of each point are used to calculate the basic geometric visual features. The block diagram of the software is shown in Fig. 4. After running the software, a wave file containing the uttered words, bitmap files sampled at 25 frames/second, a word label text file, a phoneme label text file (which contains the phonemes of the uttered word), an audio domain segmentation file, a matrix file containing the location of the tracking results of the six markers for every video frame, and another matrix file containing the basic geometric visual features are obtained. All these results are saved into separate folders, according to the folder structure based on the results of the segmentation at the phoneme level. The user can hear the fully segmented audio signal and the bmp files corresponding to each phoneme using the GUI shown in Fig. 5.
6
M.N. Kaynak et al. / Speech Communication 43 (2004) 1–16
Fig. 6. A typical left–right HMM (aij is the state transition probability from state i to state j; Ot is the observation vector at time t and bi ðOt Þ is the probability that Ot is generated by state i).
Fig. 5. Displaying the frames for selected phonemes.
Once the geometric features are extracted using the segmentation software, we can begin our experiments on bimodal recognition. In the next part of the paper, the background and the applications of HMMs to acoustic, visual, and audiovisual signals are discussed for bimodal speech recognition.
3. Hidden Markov modelling for bimodal speech recognition Since the first bimodal speech recognizers were introduced, neural networks (NN), fuzzy logic and HMMs have been used as the recognition algorithm. However, HMMs have been the most widely used for speech recognition studies for the last few years, because they often work extremely well in practice, and enable users to realize important practical systems. HMMs, as explained in (Rabiner and Juang, 1993), are stochastic signal models and are widely used for characterizing the spectral properties of the frames of patterns. The underlying assumptions of using HMMs in speech recognition are that speech signals can be well characterized as a parametric random process, and that the parameters of the stochastic process can be determined (estima-
tion) in a precise, well-defined manner (training) (Rabiner and Juang, 1993). Fig. 6 shows the structure of a typical left–right HMM. For speech recognition, usually a left–right HMM is used because the underlying state sequence associated with the model has the property that as time increases, the state index increases or stays the same. For this research left–right HMMs are used as well. We developed an HMM code in Matlab, but in order to compare the performance of the developed code, the Microsoft HTK 1 version 3.0 HMM software was used as a benchmark. For the developed HMM code, the Baum– Welch algorithm was used for training by taking into account the practical implementation issues, such as scaling, multiple observation sequences and initial parameter estimates, which are explained in (Rabiner, 1989). During the training process, training terminated automatically when the highest accuracy is obtained for the validation data recognition. This termination also prevents the over training of HMMs. 3.1. Acoustic hidden Markov modelling In this section the hidden Markov modelling of the acoustic speech signal is discussed and a comparative study is carried out among the developed and HTK HMM codes. In order to present rigorous comparisons of the recognition results of acoustic-only, visual-only, and bimodal speech recognition, the optimum HMM is selected
1
HTK web site: http://htk.eng.cam.ac.uk/
M.N. Kaynak et al. / Speech Communication 43 (2004) 1–16
for each type of signal (using the same data). For the acoustic speech recognition experiments, 13 dimensional Mel frequency cepstral coefficients (MFCC) were used as the standard audio features (Rabiner, 1989; Becchetti and Ricotti, 1999). The original dimension of the acoustic feature vector was 14, however since zero-order coefficient is approximately equivalent to the log energy of the frame, and energy is directly computed on the time signal, the zero-order cepstral coefficient is discarded and a 13 dimensional acoustic feature vector is obtained. In order to find the best HMM for acousticonly speech recognition, different models were experimentally examined. For each model, the recognition results were plotted as a function of SNR. In the experiments, white noise was used to obtain the noisy speech signals and the SNR was calculated as the logarithm of the ratio between the average powers of the speech and white noise signal. For each digit model, the state number was fixed to four and five; and 8, 16 and 32 Gaussian mixtures were used repeatedly in order to find the optimum model. For training each digit model, 77 samples from nine speakers were used and for evaluating the performances of the digit models for speaker independent and dependent cases, 24 samples from 10 speakers and 19 samples from nine speakers were used respectively. Table 1 shows the average recognition results for the different HMMs for speaker independent case and Table 2 shows the results for speaker dependent case. As expected, speaker dependent results are better than the speaker independent results especially at low noise levels. From the tables, it can be seen that as the number of Gaussian mixture components increases, the performances of the HMMs become better at the expense of increasing the computational complexity. Amongst the six different digit models tested, HMMs with four states and 32 Gaussian mixtures are the best ones for both speaker dependent and independent cases especially at low SNR. Therefore, HMMs with four states and 32 Gaussian mixture components are used for digit models. However, since the difference between the recognition rates of 16 and 32 Gaussian mixture models is not statistically
7
Table 1 Acoustic hidden Markov modelling Acoustic speech recognition––speaker independent N M
4 8
4 16
4 32
SNR
Recognition rate (%)
Clean 30 dB 25 dB 20 dB 15 dB 10 dB 5 dB 0 dB
96.7 96.7 95.8 94.6 91.7 86.3 79.6 55.4
96.3 96.3 95.8 96.7 95.0 91.3 81.3 60.8
96.3 96.3 96.7 95.8 95.0 92.9 83.3 62.1
5 8
90.4 92.5 92.9 92.5 92.9 89.2 77.5 52.5
5 16
5 32
95.4 95.4 96.3 95.8 93.3 93.3 80.8 57.1
96.7 97.5 97.1 96.7 93.8 90.0 79.2 57.9
N is the number of states and M the number of Gaussian mixture components.
Table 2 Acoustic hidden Markov modelling Acoustic speech recognition––speaker dependent N M
4 8
4 16
4 32
SNR
Recognition rate (%)
Clean 30 dB 25 dB 20 dB 15 dB 10 dB 5 dB 0 dB
99.5 99.5 98.4 96.3 95.3 89.5 78.4 53.7
99.5 99.5 99.0 98.4 96.8 93.7 80.0 56.8
99.5 99.5 100 99.0 98.4 94.7 84.7 60.0
5 8
92.6 94.7 95.3 94.7 94.7 90.5 76.3 52.1
5 16
99.5 99.0 99.0 97.9 96.3 95.3 81.1 55.8
5 32
100 100 99.5 99.0 97.4 94.2 82.6 57.9
N is the number of states and M the number of Gaussian mixture components.
significant, HMM with four states and 16 Gaussian mixtures can be used as well. An important conclusion from Tables 1 and 2 is that, even though the vocabulary used for this research is limited to ten digits, as the noise level increases, the recognition rate decreases significantly. Hence at noisy environments visual speech signal can possibly be used together with the acoustic speech signal to increase the recognition performance of the system. For acoustic modelling, both the HTK and the developed Matlab HMM codes were used, Tables 1 and 2 show the results obtained from the
8
M.N. Kaynak et al. / Speech Communication 43 (2004) 1–16
Table 3 The recognition accuracy comparison of HTK and Matlab HMMs HMM source
SNR Clean
30 dB
25 dB
20 dB
15 dB
10 dB
5 dB
0 dB
97.9 93.8
94.2 91.7
84.2 85
75 77.9
60.4 60.1
47.1 47.9
Recognition rate (%) HTK HMM Matlab HMM
100 94.2
100 93.8
developed Matlab based HMM. 2 Table 3 shows the comparison of the recognition accuracy of the developed and the HTK HMM codes for speaker independent case for 4 state and 4 Gaussian component digit models under the same experimental condition. From Table 3, it can be seen that the HTK HMM used as reference outperforms the developed HMM code approximately 6% at high SNR, however at low SNR both codes have almost the same performance. The reason for the difference in the performance is the type of the trainer used for the HTK and developed HMM codes. For the developed HMM code simple discrete word trainer and for HTK HMM embedded trainer were used. Since embedded trainer needs more data to do the training, and more training data is needed if the number of Gaussian components increases. For the rest of experiments, the developed HMM code is used in order that the recognition results of acoustic-only, visual-only, and audio-visual speech recognition obtained by the same code can be rigorously compared and the developed code can be further improved.
metric visual features is experimentally carried out, and the effects of modelling parameters of HMM on the recognition results of each single visual feature are determined. After single visual feature analysis, the combinations of geometric visual features are experimented on to find the best performing visual feature combination for bimodal speech recognition. For visual hidden Markov modelling, left–right HMMs with continuous observation densities and diagonal covariance matrices, implying statistically independent features, were used. Another reason for using diagonal covariance matrices is that, more training data is needed to train the HMMs for full covariance matrices. The best performing HMM was found for both the single and the combined visual features for each digit through a series of experiments on different combinations of the number of states and Gaussian mixture components. In Section 4.1, single geometric visual features are analyzed for their importance in bimodal speech recognition. This analysis gives a valuable insight into the best features that can be possibly used for bimodal speech recognition.
4. Analysis of geometric visual features and visual hidden Markov modelling
4.1. Single geometric visual feature analysis
As mentioned previously, the use of geometric visual features for speech recognition improves the speech recognition rate. However which geometric features are the best has not been clearly studied yet in the previous works. Hence, in this research a rigorous systematic analysis of important geo2
Available for download at http://speech.ece.nus.edu.sg
In order to have an accurate and noise-free training, in the audio-visual database six blue markers were placed on the speakerÕs face. After automatically detecting the centers of the markers using a convolution based approach, four geometrical lip parameters were extracted. These were the outer-lip horizontal width (X), the outer-lip vertical width (Y), the outer-lip area (area) defined as the inside of the ellipse shown in Fig. 7, and the angle of the outer-lip corner (angle) shown in Fig.
M.N. Kaynak et al. / Speech Communication 43 (2004) 1–16
Angle Y
X Fig. 7. The basic geometrical lip features.
7. The vertical distance between the points on the chin and nose was used to normalize X and Y, in order that the features became invariant to the distance between the speaker and the camera. It should be noted that not only the shape of the outer-lip contour but also the movements of the lip contours are important for distinguishing the digits. If the four geometrical lip parameters are considered as a function of time, then their first order derivatives with respect to time characterize the changes in these parameters over time. A systematic study was carried out to determine which were the best features to use for speech recognition. For these experiments ten different visual features were analyzed. These were the four basic geometrical lip parameters, X, Y, area, angle, plus the ratio parameter Y =X , and their first-order derivatives. The analysis of the geometric visual features was carried out in two stages. At the first stage of visual hidden Markov modelling, the best
9
single visual features (i.e. using only the single visual feature alone without any other visual features or any acoustic features for recognition) were determined and at the second stage, the best visual feature combination was determined. To determine the best HMM for each digit and for each single visual feature, six HMMs having four states with four, eight and 16 Gaussian mixtures and five states with four, eight and 16 Gaussian mixtures were trained. The recognition performances of the trained models were evaluated against the validation data using the forward backward algorithm. The results in this section show the recognition rate for each single visual feature. According to the experimental results shown in Table 4 for speaker independent case, the average recognition rates are determined for every single visual feature over all the digits. Note that in the table, the recognition results are shown for each single visual feature alone (with no other visual feature or audio features being used). From these results, it can be concluded that the geometric visual features along the vertical directions, such as Y and dðY Þ=dðtÞ, are more important than the ones along the horizontal direction, such as the feature X. This is justified by the higher recognition rate for the vertical direction features. Also, the single visual feature dðangleÞ=dðtÞ, represents not only the coordinative lip movements along the vertical and horizontal directions but also expresses the velocities of these movements, thus it is an important feature as well. From Table 4,
Table 4 The recognition rate for the single visual feature experiment Feature
Digit One
X Y Y =X Area Angle dðX Þ=dt dðY Þ=dt dðY =X Þ=dt dðAreaÞ=dt dðAngleÞ=dt
Two
Three
Single visual feature analysis 35.4 30.6 37.5 37.5 31.3 19.4 31.3 34.7 27.8 23.6 14.6 16.0 29.2 35.4 29.2 22.9 27.8 34.0 29.9 21.2 16.7 36.1 25.0 19.4 35.4 17.4 25.7 23.6 27.1 22.2
Four recognition 38.9 51.4 20.8 34.0 21.5 38.2 50.7 34.0 44.4 35.4
Five
Six
rates (%) 52.8 48.6 72.2 22.9 33.3 28.5 36.8 16.7 39.6 29.9 59.0 13.9 68.8 6.9 66.7 8.3 61.8 17.4 66.0 6.3
Seven
Eight
Nine
Zero
29.9 58.3 49.3 22.2 45.8 20.1 68.1 56.9 50.7 60.4
34.7 34.0 16.7 10.4 28.5 27.1 45.8 26.4 22.2 22.2
38.2 36.1 24.3 7.6 20.8 9.7 31.3 27.8 22.9 25.7
6.9 15.3 12.5 33.3 20.8 16.7 25.7 31.3 26.4 34.7
Average recognition rate 35.4 37.9 27.9 21.5 30.1 26.9 37.3 33.2 32.4 32.4
10
M.N. Kaynak et al. / Speech Communication 43 (2004) 1–16
it can also be concluded that by taking into account the best average recognition rates, X, Y, angle, dðY Þ=dðtÞ, dðangleÞ=dðtÞ, and dðY =X Þ=dðtÞ performed well for almost all the digits and they can be said to be the best geometric visual features for bimodal speech recognition to optimize the recognition rate. 4.2. Combined visual feature analysis Now that the best single visual features have been determined, the question is to decide which combination of these six geometric visual features should be used for bimodal speech recognition for an acceptable recognition rate. In order to determine the best visual feature combination, further experiments were conducted on seven different combinations of visual features. These combinations were listed below: • • • • • • •
X, Y; Y , dðY Þ=dðtÞ; X , Y , angle; X , Y , dðY Þ=dðtÞ; X , Y , dðangleÞ=dðtÞ; X , Y , angle, dðangleÞ=dðtÞ; X , Y , dðY Þ=dðtÞ, dðY =X Þ=dðtÞ.
These visual feature combinations are selected in an empirical manner among the single geometric visual features with higher average recognition rates in Table 4. For selecting the visual feature combinations, factor analysis or principal com-
ponent analysis (PCA) can be used to obtain optimal combinations from a statistical point of view. The visual feature combinations were obtained by increasing the dimension of the feature vector. For example, for the visual feature combination ½X ; Y , one dimensional feature vectors X and Y were concatenated to form a two dimensional visual feature vector. For these visual feature combinations, for each digit seven different HMMs, with four states and four, eight, 16 and 32 Gaussian mixture components, and five states with eight, 16 and 32 mixture components, were trained by using 77 samples from nine speakers. Then performances of these HMMs were evaluated against the validation data for both speaker dependent and independent cases. For speaker independent case 24 samples from 10 speakers, and for speaker dependent case 19 samples from nine speakers were used as the validation data. The results for speaker independent and dependent cases are shown in Tables 5 and 6. According to both of these tables, the best visual feature combination is ½X ; Y ; dðangleÞ=dðtÞ due to its high recognition rate and the best model for this combination is HMM with five states and 32 mixtures. The difference between the recognition accuracy of the speaker dependent and independent cases for the optimum combination and HMM is 3.8% (78:4 74:6 ¼ 3:8), so even for a speaker independent case the performance of this visual feature combination is not very bad. Another important point is that, since all the visual features are nor-
Table 5 Evaluation of HMMs for combined visual features Combined visual feature analysis––speaker independent N M
4 4
4 8
Visual feature
Average recognition rate (%)
X –Y Y –dðY Þ=dt X –Y –dðY Þ=dt X –Y –angle X –Y –dðangleÞ=dt X –Y –dðY Þ=dt–dðY =X Þ=dt X –Y –angle–dðangleÞ=dt
55.4 47.5 67.1 50.1 63.3 55.4 60.4
57.9 46.7 65.0 51.7 68.3 60.8 60.0
4 16
4 32
62.5 47.1 70.8 50.4 72.1 60.0 60.8
63.3 42.9 70.8 51.7 72.5 59.2 62.1
N is the number of states and M the number of Gaussian mixture components.
5 8
63.3 46.7 70.4 52.5 67.9 60.0 59.2
5 16
5 32
62.9 51.3 72.5 50.8 73.3 62.5 60.0
62.9 52.1 70.8 50.8 74.6 60.8 60.8
M.N. Kaynak et al. / Speech Communication 43 (2004) 1–16
11
Table 6 Evaluation of HMMs for combined visual features Combined visual feature analysis––speaker dependent N M
4 4
4 8
Visual feature
Average recognition rate (%)
X –Y Y –dðY Þ=dt X –Y –dðY Þ=dt X –Y –angle X –Y –dðangleÞ=dt X –Y –dðY Þ=dt–dðY =X Þ=dt X –Y –angle–dðangleÞ=dt
57.9 47.9 68.4 50.0 63.7 59.5 61.6
60.5 48.4 67.4 62.1 72.6 64.7 73.2
4 16
4 32
64.7 47.9 74.7 61.1 75.3 64.7 75.3
67.4 45.3 74.2 60.5 76.3 63.2 73.2
5 8
68.4 49.5 74.7 64.2 71.6 64.7 70.0
5 16
5 32
65.3 52.1 74.7 62.6 76.8 67.4 72.1
63.2 53.7 75.8 61.1 78.4 65.3 73.2
N is the number of states and M the number of Gaussian mixture components.
malized with respect to the lip size, even the absolute values of geometric lip parameters produced good results. The second and a more important conclusion is that, although single visual features do not perform very well when used singly (as expected due to the relatively small information content from each single visual feature), when they are combined to form a combined visual feature, the recognition performance improves dramatically; the recognition rate almost doubles and the system becomes more robust to noise. After a thorough and rigorous analysis of the single geometric visual features and HMMs, it can be concluded that geometric visual features carry relevant speech information and, thus, can be used for bimodal speech recognition. In Section 5 the audio-visual speech recognition experiments are explained. Also the hidden Markov modelling of the audio-visual speech signal and the fusion of the data streams from the acoustic and visual channels are discussed.
signal carries relevant information for speech recognition. Thus, one can take the advantage of both acoustic and visual sources of speech for speech recognition. In this section the audiovisual speech recognition experiments and the fusion of the acoustic and visual data streams are detailed. The combination of ½X ; Y ; dðangleÞ=dðtÞ and 13 dimensional MFCC are used as the visual and acoustic features respectively for the experiments in this section. An important problem for bimodal speech recognition is how to fuse the acoustic and visual speech information. Some of the previous studies on audio-visual speech recognition have focused on the optimal integration of the speech information from the acoustic and visual channels. There are two widespread beliefs about how humans integrate acoustic and visual speech information: • early integration (feature fusion or direct identification) (Rogozan and Deleglise, 1998); • late integration (decision fusion or separate identification) (Teissier et al., 1999).
5. Audio-visual hidden Markov modelling Since the acoustic speech signal is susceptible to acoustic noise, in noisy environments the acoustic speech signal based recognition may not be accurate enough to identify speech. Our experiments on acoustic speech recognition confirmed this argument too. The visual speech recognition experiments showed that the visual
The former approach uses one recognition engine for both the visual and acoustic features, while the latter approach uses two recognition engines; one for the visual signal, and one for the acoustic signal, and then integrates the recognition results of each recognition engine. In this research an early integration system through the direct identification (DI) strategy was
12
M.N. Kaynak et al. / Speech Communication 43 (2004) 1–16
Acoustic Feature Vector
Visual Feature Vector
Integrated Audio-Visual Feature Vector
HMM Based Isolated Digit Recognizer
Recognized Digit Fig. 8. Early integration, feature fusion system.
used (Fig. 8). By this approach, features from the acoustic and the visual channels should be concatenated to form a joint feature vector. However the acoustic and visual feature vectors are not
synchronous, because the acoustic features are extracted from 25 ms speech signal windows with 10 ms overlapping; on the other hand visual features are extracted from 40 ms frames without any overlapping. Hence, for the same speech signal, there are different numbers of feature sequences for visual and acoustic features. Thus, merging the acoustic and visual signals raises a problem of the non-synchronism between the acoustic and visual channels. In order to solve this problem, firstly each single visual feature signal was re-sampled at a higher rate using low-pass interpolation. Then from these up-sampled visual signals, new visual features were obtained from 25 ms windows with 10 ms overlapping, by averaging the samples inside each window. Thus, the visual and acoustic feature sequences produced have the same number of samples. The fusion procedure of acoustic and visual features is graphically shown in Fig. 9. In this figure, only a one dimensional visual feature is shown. Normally this up-sampling and averaging procedure was repeated for each single visual feature in order to obtain the same number of sam-
Fig. 9. Feature fusion: the top graph displays the original visual feature which is extracted from 40 ms frame with no overlapping. The second graph displays the up-sampled feature vector. The bottom graph displays the audio frames of 25 ms with 10 ms overlapping.
M.N. Kaynak et al. / Speech Communication 43 (2004) 1–16
ples for both acoustic and visual features. After this procedure, three dimensional visual features and 13 dimensional acoustic features were concatenated to form a 16 dimensional joint audiovisual feature vector. After obtaining the audio-visual feature vector, the next step is training the digit models. For this purpose six different HMMs having four, five states; and eight, 16, 32 Gaussian mixtures were trained. 77 samples from nine speakers were used to train the HMM of each digit. Then their performances were evaluated against the validation data which included the clean and noise corrupted
13
test data. For testing, 24 samples from 10 speakers were used for speaker independent case; and 19 samples from nine speakers were used for speaker dependent case. The evaluation results are shown in Tables 7 and 8 for speaker independent and dependent cases respectively. The results for speaker dependent case are slightly better than speaker independent case, but in terms of the overall performance, the results for speaker independent case are also quite satisfactory. From these results, for each digit the best HMM was selected. Among these different digit models, the HMM with four states and 32 Gaussian mixture
Table 7 Audio-visual speech recognition using different digit models under various acoustic SNRs Audio-visual speech recognition––speaker independent N M
4 8
4 16
SNR
Average recognition rate (%)
Clean 30 dB 25 dB 20 dB 15 dB 10 dB 5 dB 0 dB
96.3 96.3 96.7 96.3 96.3 95.8 88.3 80.4
97.9 97.9 97.9 97.5 96.3 95.4 94.2 83.3
4 32
98.8 98.8 98.8 98.8 98.3 97.9 94.2 82.1
5 8
96.7 97.1 97.1 96.7 94.6 93.8 90.0 82.5
5 16
5 32
97.5 97.5 97.5 96.7 96.7 95.8 92.1 81.7
98.3 97.9 97.9 97.9 97.9 96.3 92.9 81.7
5 16
5 32
99.0 99.0 99.0 99.0 98.4 96.8 92.6 81.6
100 100 100 100 99.5 97.4 93.7 82.1
N is the number of states and M the number of Gaussian mixture components.
Table 8 Audio-visual speech recognition using different digit models under various acoustic SNRs Audio-visual speech recognition––speaker dependent N M
4 8
4 16
SNR
Average recognition rate (%)
Clean 30 dB 25 dB 20 dB 15 dB 10 dB 5 dB 0 dB
97.9 97.9 97.9 97.4 94.4 95.8 87.9 79.5
99.5 99.5 99.5 99.5 98.4 97.4 95.3 84.2
4 32
99.5 99.5 99.5 99.5 99.0 99.0 94.2 80.0
5 8
98.4 99.0 99.0 98.4 96.3 95.3 90.5 82.6
N is the number of states and M the number of Gaussian mixture components.
14
M.N. Kaynak et al. / Speech Communication 43 (2004) 1–16
components yielded the best results for speaker independent case and the HMM with five states and 32 Gaussian mixture components yielded the best results for speaker dependent case. The best performing HMMs were used for testing the performance of the bimodal speech recognition system. For comparison, the average recognition accuracy of acoustic-only, visual-only and audio-visual speech recognition are shown in Fig. 10 for speaker independent case and in Fig. 11 for speaker dependent case. When noise increases, audio-visual speech recognition produced much better results than the acoustic-only speech recognition for both speaker dependent and independent cases. Also, the audio-visual speech recognition system with the early fused signal yielded a significant performance gain for the noisy speech signal. The higher the noise is, the more the recognition rate is improved compared to the acoustic-only speech recognition. From recognition results of the audio-visual speech recognition experiments, the most important conclusion that can be drawn is that when visual speech information is used together with acoustic speech information, speech recognition becomes more robust to noise and, furthermore, the bimodal speech recognition system out-performs both acoustic-only and visual-only speech
Fig. 10. Recognition rate using different speech information sources under various acoustic SNRs for speaker independent case.
Fig. 11. Recognition rate using different speech information sources under various acoustic SNRs for speaker dependent case.
recognition systems by as much as 20% and 8% at SNR ¼ 0 dB respectively.
6. Conclusion Bimodal speech recognition is felt to be an important up-coming area of human computer interaction. For this aim, geometric lip features are advantageous for developing practical, real-time bimodal recognition systems. However in previous works, a rigorous comparison of the suitable geometric features both singly, and in combination, has not been given. Hence the contribution of this paper is a rigorous analysis and comparison of the geometric features that can be used in an audio-visual speech recognition system. For this purpose, first, single geometric visual features were experimented, and their importance for bimodal speech recognition was determined through experiments conducted on a visual speech recognition system. During these experiments the effects of modelling parameters of hidden Markov models on the recognition results of the single visual features were also examined. Thus, optimal HMMs were obtained for each single visual feature. Then, the best single visual features were selected to find the best visual feature combination. During this selection, optimized HMMs were used to reduce the
M.N. Kaynak et al. / Speech Communication 43 (2004) 1–16
effect of the modelling parameters of HMM. For combined visual features, first, optimized HMMs were found and then different combinations were compared to find the optimum visual feature combination. These features were then used for bimodal speech recognition experiments. For the fusion of the acoustic and visual speech information, an early integration algorithm for which the acoustic and visual features are concatenated to form a joint feature vector, was presented. Finally a comparative study was carried out between acoustic-only, visual-only and audiovisual speech recognition. For limited vocabulary digit recognition, the audio-visual speech recognition system significantly improved the recognition rate especially at high noise levels and out-performed both acoustic-only and visual-only speech recognition systems. The results obtained showed that a set of as few as three labial geometric features are sufficient to improve the recognition rate by as much as 20% (from 62%, with acoustic-only information, to 82%, with audio-visual information at SNR ¼ 0 dB). This is a promising result because, for most of the real time applications, the recognizer should deal with the noise present in the environment. For acoustic-only, visual-only and audio-visual speech recognition, we studied both speaker dependent and independent cases. The results showed that, although the recognition accuracy is higher for speaker dependent case, the results for speaker independent case are also satisfactory; and by using geometric lip features, we can improve the speech recognition accuracy for both speaker dependent and independent cases. Although in our research we used markers to extract the geometric visual features, geometric visual features can be extracted in real time accurately without using any markers (Zhang et al., 2001; Chan et al., 1998). Since geometric visual features are easy to extract and compute in real time and since bimodal speech recognition performs better than acoustic-only and visual-only speech recognition systems at high noise levels, the use of basic geometric visual features for audiovisual speech recognition enables the development of the more practical, real time and robust bimodal speech recognition systems.
15
References Adjoudani, A., Benoit, C., 1996. On the integration of auditory and visual parameters in an HMM-based ASR. ch. II: Speechreading by Machines. NATO Asi Series. Speechreading by Humans and Machines. NATO/Springer-Verlag, New York, NY, pp. 461–471. Basu, S., Neti, C., Rajput, N., Senior, A., Subrarnaniarn, L., Verma, A., 1999. Audio-visual large; vocabulary continuous speech recognition in the broadcast domain. In: Proc. IEEE Third Workshop on Multimedia Signal Process., pp. 475– 481. Becchetti, C., Ricotti, L.P., 1999. Speech Recognition, Theory and C++ Implementation. John Wiley and Sons Ltd. Chan, M., Zliang, Y., Huang, T., 1998. Real-time lip tracking and bimodal continuous speech recognition. In: Proc. IEEE Second Workshop on Multimedia Signal Process., pp. 65–70. Cosi, P., Hosom, J.-P., Shalkwyk, J., Sutton, S., Cole, R., 1998. Connected digit recognition experiments with the OGI toolkitÕs neural network and HMM-based recognizers. In: Proc. 4th IEEE Workshop on Interactive Voice, Technology for Telecommunications Applications, pp. 135–140. Dupont, S., Luettin, J., 2000. Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2 (3), 141–151. Goldschen, A., 1993. Continuous automatic speech recognition by lipreading. Ph.D. thesis. George Washington University. Washington, DC, September 1993. Gurbuz, S., Tufekci, Z., Patterson, E., Gowdy, J., 2001. Application of affine-invariant fourier descriptors to lipreading for audio-visual speech recognition. In: Proc. Internat. Conf. on Acoust., Speech, Signal Process., vol. 1, pp. 177–180. Kaynak, M.N., Zhi, Q., Cheok, A.D., Ko, C.C., Cavalier, S., Ng, K.K., 2000. Database generation for bimodal speech recognition research. Second JSPS-NUS Seminar on Integrated Engineering, pp. 220–227. Lincoln, M., Clark, A., 2000. Towards pose-independent face recognition. IEE Colloquium on Visual Biometrics 5, 1–5. Luettin, J., Thacker, N., Beet, S., 1996. Speechreading using shape and intensity information. In: Proc. Fourth Internat. Conf. on Spoken Language Process., vol. 1, pp. 58–61. McGurk, H., MacDonald, J., 1976. Hearing lips and seeing voices. Nature 264 (5588), 746–748. Movellan, J., 1995. Visual speech recognition with stochastic networks. In: Proc. Internat. Conf. on Neural Information Process. Systems: Natural and Synthetic, pp. 851–858. Petajan, E., 1984. Automatic lipreading to enhance speech recognition. In: Proc. IEEE Global Telecommunications Conf., vol. 1, pp. 205–272. Petajan, E., Brooke, N., Bischoff, B., Bocloff, D., 1988. An improved automatic lipreading system to enhance speech recognition. In: Proc. CHF88, pp. 19–25. Potamianos, G., Cosatto, H., Graf, H., Roe, D., 1997. Speaker independent audio-visual database for bimodal ASR. In: Proc. Euro Tutorial Workshop on Audio-Visual Speech Process., Rhodes, Greece.
16
M.N. Kaynak et al. / Speech Communication 43 (2004) 1–16
Potamianos, G., Graf, H.P., 1998. Discriminative training of HMM stream exponents for audio-visual speech recognition. In: Proc. Internat. Conf. on Acoust., Speech, Signal Process., vol. 6, 12–15 May 1998, pp. 3733–3736. Rabiner, L., 1989. A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77 (2), 257–286. Rabiner, L., Juang, B.H., 1993. Fundamentals of Speech Recognition. Prentice Hall Signal Processing Series. Rogozan, A., Deleglise, P., 1998. Adaptive fusion of acoustic and visual sources for automatic speech recognition. Speech Comm. 26 (1–2), 149–161. Schalkwyk, J., Colton, D., Fanty, M., 1995. ‘‘The CSLU toolkit for automatic speech recognition,’’ tech. rep. Center for Spoken Language Understanding. Oregon Graduate Institute of Science and Technology. Silcbee, P., 1993. Automatic lipreading. Biomedical Science Instrumentation 20, 415–422. Stork, D., Wolff, G., Levine, E., 1992. Neural network lipreading system for improved speech recognition. In: Proc. Internat. Joint Conf. on Neural Networks, vol. 2, pp. 289–295.
Teissier, P., Robert-Ribes, J., Schwartz, J.-L., Guerin-Dugue, A., 1999. Comparing models for audio-visual fusion in a noisy-vowel recognition task. IEEE Trans. Speech and Audio Process. 7 (6), 629–642. Vatikiotis-Bateson, E., Eigsti, I.-M., Yano, S., Munhall, K., 1998. Eye movement of perceivers during audio-visual speech perception. Perception and Psychophysics 60 (6), 926–940. Yehia, H., Rubin, P., Vatikiotis-Bateson, E., 1998. Quantitative association of vocal-tract and facial behavior. In: Proc. IEEE Second Workshop on Multimedia Signal Process., vol. 26(1–2), pp. 23–46. Yu, K., Jiang, X., Bunke, H., 1999. Lipreading using signal analysis over time. Signal Process. 77 (2), 195–208. Yuhas, B.P., Goldstein, M.H., Sejnowski, T.J., Jenkins, R.E., 1990. Neural network models of sensory integration for improved vowel recognition. Proc. IEEE 78 (10), 1658– 1668. Zhang, J., Kaynak, M., Cheok, A., Ko, C., 2001. Real-time lip tracking for virtual lip implementation in virtual environments and computer games. In: Proc. IEEE Internat. Conf. on Fuzzy Systems, vol. 3, pp. 1359–1362.