Digital Signal Processing 11, 169–186 (2001) doi:10.1006/dspr.2001.0397, available online at http://www.idealibrary.com on
Adaptive Fusion of Speech and Lip Information for Robust Speaker Identification T. Wark and S. Sridharan Speech Research Laboratory, Research Concentration in Speech, Audio and Video Technology, School of Electrical and Electronic Systems Engineering, Queensland University of Technology, GPO Box 2434, Brisbane, 4001, Queensland, Australia E-mail:
[email protected],
[email protected] Wark, T., and Sridharan, S., Adaptive Fusion of Speech and Lip Information for Robust Speaker Identification, Digital Signal Processing 11 (2001) 169–186. This paper compares techniques for asynchronous fusion of speech and lip information for robust speaker identification. In any fusion system, the ultimate challenge is to determine the optimal way to combine all information sources under varying conditions. We propose a new method for estimating confidence levels to allow intelligent fusion of the audio and visual data. We describe a secondary classification system, where secondary classifiers are used to give approximations for the estimation errors of outputs likelihoods from primary classifiers. The error estimates are combined with a dispersion measure technique allowing an adaptive fusion strategy based on the level of data degradation at the time of testing. We compare the performance of this fusion system with two other approaches to linear fusion and show that the use of secondary classifiers is an effective technique for improving classification performance. Identification experiments are performed on the M2VTS multimodal database [26], with encouraging results. 2001 Academic Press
1. INTRODUCTION There is an increasing focus on the combination of audio and visual lip information in automatic speaker recognition (ASR) tasks, the major benefit being an increased system robustness in the presence of audio or visual noise [25]. In ideal or clean conditions, ASR systems perform very well using speech characteristics alone; however, considerable decreases in performance are observed as a result of adverse variables such as background noise, channel distortion, or reverberation. Static, as well as dynamic, facial information can suffer as a result of changing lighting conditions or poor picture quality.
1051-2004/01 $35.00 Copyright 2001 by Academic Press All rights of reproduction in any form reserved.
170
Digital Signal Processing Vol. 11, No. 3, July 2001
Considerable research has gone into both of these disparate areas in an attempt to increase robustness [3, 21] of person authentication systems. Techniques such as RASTA [13, 14] processing, cepstral mean subtraction [21, 29], and pole filtering [24] reduce the effects of additive noise and channel distortion in the audio domain. The reduction of distortion effects such as reverberation have not been fully solved where the room response is unknown [5]. Likewise image enhancement techniques such as intensity equalization or smoothing techniques can be used to enhance performance of facial recognition systems in adverse conditions [3]. In this paper we describe a speaker identification system where lip information is fused with corresponding speech information from each speaker. Intuitively we would expect lip information to be somewhat complementary to speech information due to the range of lip movements associated with the production of the corresponding phonemes in speech. Feature extraction from speech is now a well established area, and we apply standard techniques for this task. In contrast, the tracking and extraction of features from lips is by no means a simple task, with much recent research focused on lip tracking itself. We describe a new technique for tracking lips [38] which overcomes many of the problems of past tracking systems. An integral part of any fusion system is the ability to determine the optimum way to combine parameters given changing conditions for either input source. The system is known as catastrophic if the identification performance of the fused system is worse than the performance of either one of the input systems [23]. Ideally a fusion system will associate a low confidence with incorrectly classified data and a high confidence with correctly classified data; however, this is by no means a straightforward task. Consequently the underlying challenge is to be able to detect those attributes which are consistently associated with high quality or low quality data and apply this to the weighting process of the fusion system.
2. RECENT BIMODAL DEVELOPMENTS The majority of past audio-visual fusion work has been performed in the area of bimodal speech recognition, although some work has considered biometric person identification [8, 17] and biometric noisy speech enhancement [19]. We present a brief review of recent work in bimodal fusion.
2.1. Hidden Markov Model Systems The hidden Markov model (HMM) has been used extensively in speech recognition systems, due to its ability to model the underlying temporal characteristics of features within words or continuous speech. The majority of work in bimodal speech recognition has also made use of the HMM for combining audio and visual information [15, 33]. The use of HMM’s in audiovisual fusion evolves from the fact that the maximum likelihood criterion [16], used to estimate the means and the variances of both the audio and the visual features, also gives a rough indication of the reliability of each data source.
Wark and Sridharan: Adaptive Fusion of Speech and Lip
171
Word-dependent acoustic-labial weights in HMM-based speech recognition have been determined using discriminative training [16]. In discriminative training we are considering the probability of an observation being produced by a given model where we also know the characteristics of the other models. In the discriminative training stage we maximize the ratio of the given model likelihood, and the product of the other model likelihoods. Recent work [27] has also used discriminative training to form multistream audio-visual HMMs. The generalized probabilistic descent (GPD) algorithm is used to minimize a loss function for a sentence, where the GPD algorithm is used to update the parameters of the multistream HMM. These HMMs are shown to have superior performance over individual single-stream audio and visual HMMs. The use of semicontinuous HMMs has also been investigated for audiovisual integration for classification of isolated words [34]. A combination of both early and late fusion techniques have been trialed in conjunction with the HMM [1, 35] via forming composite features prior to HMM training and testing for early fusion or combining HMM likelihood outputs via a linear function for late fusion.
2.2. Artificial Neural Networks Artificial neural networks (ANN) have also been used as classifiers for audiovisual fusion. One of the main classes of ANNs investigated for audio-visual fusion are multistate, time-delay, neural networks [7, 22]. Audio and visual weights αaud and αvis are determined by evaluating the entropy weights SA and SV over all acoustic and visual activations. A high entropy, and hence a low weight, is found when activations are evenly spread over a layer, implying high ambiguity for that particular modality. The other main class of ANNs investigated in the past is layered, feed-forward (LFF), neural networks [40]. When the network is presented with audio and visual data, error gradients are computed using the back-propagation algorithm, to form the classification system for each mode. When fusing audio and visual data together, however, the weighting is determined as a linear function of the audio SNR, rather than from the LFF neural network itself. Synergetic computers have also been investigated as a classifier for use in multisensoral person identification [37].
2.3. Weight Adaption via Noise Levels Another commonly used class of fusion systems determines the weighting adaptively based on the level of audio noise present in the test data [31]. As such this requires the determination of a mapping function during training time, to relate the appropriate weighting value α ∈ [0, 1] to the noise level. The disadvantages of this type of system is that it requires noisy data, of various levels, at the time of training. It also implies that the mapping function will be highly dependent on the type of noise at the time of training.
172
Digital Signal Processing Vol. 11, No. 3, July 2001
2.4. Statistical Methods The final major area of work in linear audio-visual fusion is via statistical evaluation of output likelihoods from audio and visual classifiers [11]. The most common technique used is a dispersion measure, where a low dispersion implies a high ambiguity in output scores, resulting in a low weight [31]. Similar techniques have been used for fusion of decisions for speech only identification and verification problems [6, 10, 28].
2.5. Proposed Method We propose an extension to the current statistical methods where secondary classifiers are used to model the distribution of output log-likelihoods from primary classifiers. By training the secondary classifiers on log-likelihoods, when the primary classifiers are presented with high quality data, an a priori “knowledge” is built into the system to allow estimates of the quality of incoming data.
3. SYSTEM FEATURE EXTRACTION 3.1. Audio Subsystem The audio subsystem feature extraction is quite standard, with mel-cepstral features [30] being extracted from the speech. Silence is first removed from the speech via a low-energy thresholding. This is followed by the calculation of the magnitude spectrum of 32 ms speech segments. The spectrum is then preemphasised and processed by a mel-scale filterbank. Finally the filterbank coefficients are cosine transformed to produce the cepstral coefficients. Mel-cepstral features have been shown in the past to be well suited for speaker identification purposes [21], hence their use in this application.
3.2. Visual Subsystem There are many areas of the face which provide person dependent information. Features such as eyes, nose, mouth, and overall facial shape have unique variations from person to person. Of these features we would intuitively expect the mouth or lip region to provide information which is most closely related to speech information. While other more uncorrelated facial features such as retinal information may seem more useful for audio-visual fusion purposes, we chose to use lips as a visual feature for speaker identification to allow the system to form part of a larger scale project in multimodal speech processing. Future work will exploit the dynamic features contained within lips, for temporal speaker and speech recognition. Based on this, we chose to extract visual features from the mouth region via lip tracking and use this information to enhance the speaker recognition process. Lip tracking in itself is a challenging task, particularly where systems must cope with facial movement or poor lighting conditions. Gradient-based techniques for edge detection are often not successful due to the poor contrast of lips to the surrounding skin region. Successful lip tracking via intensity
Wark and Sridharan: Adaptive Fusion of Speech and Lip
173
information has been presented using active shape models [20]; however, this technique requires a priori knowledge of the lip shape via manual labeling. Recent work [32] has used B-splines to track the outer lip contour using chromatic information around the lips. Other similar techniques [4] also use color information to build a parametric deformable model for the lip contour. These techniques require optimization or iterative techniques to refine estimates of the contour model to the lips. We have presented in detail [39] a new method for lip tracking using a combined chromatic-parametric approach, where the parametric lip contour polynomial model is derived directly from chromatic information. This technique provides computational advantages as no minimization procedure is required to fit the contour model. An example of the tracking performance of the system is shown over a number of frames in Fig. 1. It can be seen that the polynomial model can cater to a wide range of lip poses, providing a good representation of the outer lip contour. Features are extracted via color profiles taken around the lip contour. As the contour model follows the moving lips, the chromatic features will be consistent with respect to the lip position. This is illustrated in Fig. 2. Features are reduced via the use of principal component analysis (PCA), followed by linear discriminant analysis (LDA). In this way, lip features are chosen which provide the greatest discrimination between speakers. We cannot apply LDA directly to the original chromatic features due to the fact that the within class scatter matrix Sw ∈ Rk×k is singular when the number of image samples K, from each training class is less than the dimension of the feature vectors k. However, after PCA, the feature vector dimensions k are reduced to
FIG. 1.
Lip tracking performance.
Digital Signal Processing Vol. 11, No. 3, July 2001
174
FIG. 2.
Color profile vectors.
well below K. A complete description of these feature reduction steps may be found in [39].
4. PRIMARY CLASSIFIERS Classification of both audio and visual data is achieved via the use of the Gaussian mixture model (GMM). These models have been used extensively in the past for the modeling of the output probability distribution of speech features for a particular speaker [30]. The multimodal nature of the model allows it to cater for a wide range of voice characteristics for each speaker. The Gaussian mixture density for a given model λi is given by p( x |λi ) =
M
pim ( x , µim , im ),
(1)
m=1
where pim is the mixture weight for mixture m of speaker i, M is the total number of mixtures, and ( x , µ, ) is a multivariate Gaussian function with mean µ and covariance matrix . The decision rule for identifying a speaker, as defined by Bayes’ rule [30], is defined as sˆ = arg max
1
T t =1
log p(xt |λi ),
(2)
Wark and Sridharan: Adaptive Fusion of Speech and Lip
FIG. 3.
175
Speaker models for lip-feature distributions.
where T is the total number of frames available for testing and xt is a frame, t ∈ [1, 2, . . . , T ]. Experiments also showed that the distribution patterns of features from a speaker’s moving lips, over a period of time, held speaker dependent qualities. Based on this, we chose also to use the multimodal nature of the GMM to allow it to model the the wide variation in features from a speaker’s moving mouth. For each speaker, the best two features after PCA and LDA, for two different recording sessions are shown in Fig. 3. In each case the distribution patterns of lip features for each speaker can be seen to be very similar, despite the fact that recordings were taken some time apart. Contour representations of the GMMs for each speaker can be seen to be significantly different, hence proving a good basis for identification based on lip movements only.
5. AUDIO–VISUAL FUSION SYSTEM 5.1. System Structure The aim of any fusion system is to combine information from various sources so that, in the case of identification, the resulting performance is greater than or equal to the performance of the best individual source. Anything less than this is termed as catastrophic fusion [23] and is of course undesirable for a speaker identification problem. Two main approaches can be taken for fusion, being that of direct fusion and output fusion [9, 36]. In direct fusion features from each source are combined prior to classification, whereas in output fusion, features from each source are separately classified, with the classifier outputs then being combined. Past research [1] has shown that output fusion is in general superior for audio and visual fusion.
Digital Signal Processing Vol. 11, No. 3, July 2001
176
FIG. 4.
Asynchronous linear output fusion.
The basic structure of our fusion system is that of asynchronous linear output fusion, outlined in Fig. 4. In asynchronous fusion, the output likelihoods for each model are calculated given whole segments of speech and lip information, before combining outputs in a linear fashion. In our system, each segment was around 6 s in length. The aim of output fusion is to form an accurate estimate of the a posterior probability P (λs |xaud , xvis ), where P (λs |xaud , xvis ) =
p(xaud , xvis |λs )P (λs ) . p(xaud , xvis )
(3)
Given that p(xaud , xvis ) is constant over all speaker classes, and assuming that the a priori likelihoods for each speaker are equal, we can rewrite Eq. (3) as P (λs |xaud , xvis ) = P (xaud , xvis |λs ).
(4)
It has been shown [18] that we can estimate the a posterior probabilities P (λs |xaud , xvis ) as a sum of the individual a posterior probabilities for each mode. Given that we can assign a confidence α to each mode, and taking logs for numerical convenience, we determine the final output likelihood as vis log P (λs |xaud , xvis ) = α log P (λaud s |xaud ) + (1 − α) log P (λs |xvis ),
(5)
where log P (λs |xaud , xvis ) is the log-probability of speaker s having generated the audio and visual feature vectors xaud and xvis , given the audio and visual and λvis primary classifiers λaud s s and a weighting factor of α ∈ [0, 1]. Given this fusion structure, the main challenge is to determine an appropriate weighting to assign to the audio and visual classifier outputs. As the level of speech degradation increases due to increasing noise, we would wish to place more and more emphasis on visual information. Hence we need some way to assign a measure of confidence to the incoming audio and visual data, drawn from the data itself. The following sections present three methods for allocating weights to audio and visual data.
Wark and Sridharan: Adaptive Fusion of Speech and Lip
177
5.2. Equal Prior Weights For an artificial test set, where the identity of the target speaker is known, the optimum value of α can be empirically determined by varying the level of α, according to Eq. (5), between 0 and 1. This, however, is inconsequential for a real-life, speaker identification application where the identity of the speaker is of course not known. Without making any prior assumptions about the quality of each data source, a reasonable compromise is to set α = 0.5. In other words we weight the contribution of both audio and visual data equally for the identification problem. The results for using this technique are presented in Section 6.
5.3. Dispersion Confidence Measure The technique presented in Section 5.2 is not capable of adapting to the surrounding environment in that the system weighting is fixed regardless of the quality of either data source. The technique presented in this section is capable of adapting the weighting factor based upon the quality of data at the time of testing. The system achieves this by considering the dispersion of scores, or average output log-likelihoods, from the audio and visual primary classifiers. In general, we would expect that for the case of high-quality information, the GMM score assigned to the correct speaker model would be significantly higher than the scores assigned to the other speaker models. As the input information degrades, the score assigned to the correct model would merge with other scores, or indeed other speaker models could surpass the correct model’s score. Based on this, the confidence measure we used was taken as the difference in the top two speaker model scores, normalized by the mean of all speaker model scores u, for the audio and visual data, respectively. This can be expressed for S speakers as ubest = arg max log P (λi |x) 1≤i≤S
unextbest = arg max log P (λi |x), i = i → ubest 1≤i≤S
(6) (7)
1 log P (λi |x) S
(8)
− unextbest | |u , κ = best |umean |
(9)
S
umean =
i=1
where κ is the confidence measure which is evaluated over both audio and visual classifier outputs, notated as κaud and κvis . We then evaluate the weighting factor α ∈ [0, 1] as α=
κaud . κaud + κvis
The results from this technique are also presented in Section 6.
(10)
Digital Signal Processing Vol. 11, No. 3, July 2001
178
5.4. Secondary Classifiers The dispersion measure is in itself a reasonable confidence measure for audiovisual fusion; however, the technique breaks down when an incorrectly classified score stands out well above the other classifier scores. In this situation the confidence measure will assign a high level of confidence to the particular data while the classifier outputs are quite incorrect. We propose an alternative strategy for obtaining a posterior classifier confidences, where secondary classifiers are included to obtain additional information in estimating the error that is present in each modality. The technique is described in more detail in the following sections.
5.4.1. Motivation for use of secondary classifiers. In a practical scenario, due to mismatches between training and testing features, classifiers will never output the true a posterior log-probabilities [18] log P (λi |xmode ), but rather their estimates log Pˆ (λi |xmode ). We define the true a posterior probabilities as the output probabilities arising from an ideal classification scenario where there are no associated errors on top of the standard Bayes error for the classifier. We can relate the true and estimated log-probabilities via an additional estimation error ε(xmode ) as log Pˆ (λi |xmode ) = log P (λi |xmode ) + ε(xmode ).
(11)
As the case of a classifier having no additional error can never exist in reality, i.e., where ε(xmode ) = 0, we can best approximate this situation when a classifier is tested with clean data, with a minimal mismatch between training and evaluation conditions. In this case the additional error ε(xmode ) is virtually negligible, thus the actual a posterior probabilities will be very close to the “true” a posterior probabilities for that classifier. This process is discussed in more detail in Section 5.4.2. Given Eq. (11), we can determine the variance of the error σε2i (x ) as mode
σε2i (xmode ) = var log Pˆ (λi |xmode ) − log P (λi |xmode ) .
(12)
As log Pˆ (λi |xmode ) and log P (λi |xmode ) are two random variables, which we reasonably assume to be independent as the output probabilities are obtained from entirely separate data, we can calculate the variance as the sum of the individual variances σε2i (xmode ) = var log Pˆ (λi |xmode ) + var log P (λi |xmode ) ,
(13)
where var[log Pˆ (λi |xmode )] and var[log P (λi |xmode )] are calculated as the variances of the frame output log-probabilities over the input segments for the noisy test and clean evaluation data, respectively. As the additional contributing error should equally affect all S speaker classifiers within each modality, we can obtain a better estimate of the error
Wark and Sridharan: Adaptive Fusion of Speech and Lip
179
variance by averaging over all classifier probability outputs as 2 σε(x = mode )
1 1 var log Pˆ (λi |xmode ) + var log P (λi |xmode ) . (14) S S S
S
i=1
i=1
Given that the additional error may also be biased, i.e., E{ε(xmode )} = 0, another important measure of the error is the magnitude of difference between the true and estimated a posterior log-probabilities. We determined this as an average over all S model outputs within each mode as µε(xmode) =
1 1 mean log Pˆ (λi |xmode ) − mean log P (λi |xmode ) , (15) S S S
S
i=1
i=1
where mean[log Pˆ (λi |xmode )] and mean[log P (λi |xmode )] are calculated as the mean of the frame output log-probabilities over the input segments for the noisy test and clean evaluation data, respectively. 5.4.2. Training of secondary classifiers. In order to obtain estimates of the classifier errors we must know the true output a posterior log-probabilities. While inherently this is not possible, we can obtain very close estimates by testing the primary classifiers with features obtained in identical circumstances to the training data, yet distinct from the original training features. In these cases we expect the additional estimation error to be negligible; hence the estimated a posterior log-probabilities will be very close to the true a posterior log-probabilities, as discussed in Section 5.4.1. The purpose of the secondary classifiers is to form a model for the true a posterior log-probabilities during this supervised testing stage. We propose that the probability distributions of the primary classifier, a posterior, log-probabilities, or scores u, can be characterized by unimodal Gaussian distributions, in this way fitting within the framework of the equations given in the previous section. Hence we model the output scores of each mode as )2 1 1 (uaud − µaud aud aud i exp − (16) ϕi (u ) = aud aud 2 σi2 2πσi2 2 1 1 (uvis − µvis i ) vis vis exp − , (17) ϕi (u ) = vis vis 2 σi2 2πσi2 where — u is the output a posterior log-probabilities from audio or visual primary and λvis GMM λaud i i , where u = ut for t ∈ [1, 2, . . . , Nframes ] — i ∈ [1, S] for S speakers mode — µmode and σi2 , mode ∈ [aud, vis] are the mean and variance for the i output a posterior log-probabilities from the audio and visual primary models and λvis λaud i i . The basic concept of the secondary classifier stage is outlined in Fig. 5. 5.4.3. Determining a priori confidences. In order to allow the final calculated confidences to reflect the a priori qualities of each mode, we make use of the supervised testing stage to determine the identification accuracy of each mode given closely matched train/test conditions. Given the identification
Digital Signal Processing Vol. 11, No. 3, July 2001
180
FIG. 5.
Secondary output classifiers.
accuracies θaud and θvis during this stage, we calculate the relative a priori weightings ρmode as θ ρmode = N mode . (18) mode i=1 θmode
5.4.4. Calculating test confidences. During testing of noisy data, the primary classifiers will output their respective estimated a posterior logprobabilities. We calculate the mean and variance parameters of the output test, a posterior, log-probabilities over all models to enable us to obtain 1 var log Pˆ (λi |xmode ) S S
1 mean log Pˆ (λi |xmode ) . S S
and
i=1
i=1
Given the parameters stored by the secondary classifiers as well as these 2 test parameters, we can calculate σε(x and µε(xmode) as shown in Eqs. (15) mode ) and (16). Based on the parameters of the estimated error we derive a confidence measure as νmode =
2 ρmode σmode 2 σε(x
mode )
+ µε(xmode)
,
(19)
2 is the average variance of the true a posterior scores for each mode, where σmode as obtained from the secondary classifier. This measure is incorporated as a normalization factor to bias most strongly against a system whose output scores exhibit an estimated error with high mean or variance, relative to the variance of the true a posterior scores for each mode. The dispersion measure, of Section 5.3, also provides valuable confidence information not contained within the secondary classifiers. In order to exploit
Wark and Sridharan: Adaptive Fusion of Speech and Lip
181
this, we determined the final weights for each modality by combining both the secondary classifier and the dispersion measure scores as + κmode ν αmode = N mode , mode mode=1 (νmode + κmode )
(20)
where κmode is determined as shown in Eq. (9).
6. EXPERIMENTS We trained and tested the audio and visual identification systems using the M2VTS multimodal database, being the largest multimodal database publicly available at the time of writing. (A larger database, X2M2VTS, has since been released.) The M2VTS database consists of over 27,000 color images of 37 subjects counting from zero to neuf over a number of different sessions, with a week between each session. We used the first two recording sessions as training data, the third session for supervised testing, and the fourth session as test data.
6.1. Additive White Noise A common source of speech degradation is that of additive background noise. We define additive noise as that class of noise which is additive in the time domain. We can simulate the effects of background additive noise, by adding a random white signal of varying amplitude to the original speech signal. 6.1.1. Audio and visual subsystems. Clean speech data was corrupted with additive white noise of varying levels for the purposes of our experiments. No artificial degradation was imposed upon visual data, due to the fact that there were natural challenges within the database such as changes in facial pose and facial hair from session to session. Thus during the testing process the visual data was held constant while the level of audio degradation was varied. 6.1.2. Fused system. The speaker identification results for increasing white noise are shown in Fig. 6. At SNR levels where the speech-only recognition accuracy is above 50%, the secondary classifier fusion system can be seen to improve the performance of the dispersion measure fusion-system. Once speech-only recognition performance falls very low, however, it can be seen that secondary system begins to degrade performance to a small degree. This indicates that when noise starts to dominate the input audio features, the secondary classifiers obtain a much poorer estimate of the a posterior probability error parameters. This problem will be discussed further in Section 7.
6.2. Reverberated Speech Room reverberation of speech occurs to some extent in almost any enclosed area. As sound is reflected off walls back to the source, the resulting speech spectrum is smeared, reducing both speech intelligibility and speaker dependent
Digital Signal Processing Vol. 11, No. 3, July 2001
182
FIG. 6.
Speaker identification results for additive white noise.
qualities. We can mathematically express a reverberated signal r(n) as the convolution of the original signal s(n) with the room impulse response h(n): r(n) = s(n) ∗ h(n).
(21)
The effects of speech reverberation on ASR has not been studied extensively in the past; however, work which has been done to demonstrate a considerable drop in recognition performance. The case of automatic speaker verification (ASV) [2] has been considered under varying reverberant conditions, where ASV performance degrades sharply as reverberation time is increased and/or the enclosure size is decreased. Other researchers [12] have considered the use of acoustic array processing and spectral normalization to develop a more robust ASR system in reverberant conditions. Some performance improvement can come about as a result of these steps. As visual lip information is unaffected by reverberant conditions, we are interested in particular to evaluate the extent to which lip information can improve identification performance as reverberation of speech increases. 6.2.1. Audio and visual subsystems. The speech data was artificially reverberated using an image method [2], where the level of reverberation was increased by increasing the reflection coefficients of the simulated room. We found that the addition of reverberation to the training data, dramatically improved speech-only identification results for testing with reverberated speech data, the results of which are shown in Fig. 7. To implement this, we reverberated the training speech in a room which was different to any of the conditions the test speech was subjected to, this being a realizable step for a real-life problem.
Wark and Sridharan: Adaptive Fusion of Speech and Lip
183
FIG. 7. Speaker identification results for reverberated speech.
Once again, the visual data was held constant, while the level of audio reverberation was varied. 6.2.2. Fused system. The speaker identification results for increasing speech reverberation are shown in Fig. 7. We observe that the secondary classification fusion system either equals or outperforms the fusion performance of both the fixed weight and the dispersion measure systems.
7. CONCLUSIONS We have considered the performance improvement of speaker identification in various adverse conditions using lip information. We compare techniques for estimating confidence measures for incoming audio and visual information to allow adaptive fusion of audio and visual primary classifier outputs. The secondary classifiers are found to be effective for estimating the quality of incoming audio data under both additive and convolutional noise when the audio stream is still predominantly characterized by the speech signal instead of noise. Once noise starts to dominate the signal, however, the secondary system forms poorer estimates of the estimated error for each mode. This can be explained in that the presence of predominantly noise in a signal means that features representing noise start to dominate the output p.d.f. Given that a very high level of noise in a signal is likely to produce very similar scores 2 across all models, we would expect the estimate of the error variance σε(x mode ) 2 would become very small. In the case where σε(x µε(xmode) this would have ) mode the effect of reducing the denominator term of Eq. (19) giving the effect of a higher level of confidence than optimal for that level of noise.
184
Digital Signal Processing Vol. 11, No. 3, July 2001
Despite this shortfall in extreme cases, the current system would be useful in a wide range of audio-visual applications, considering that in most speech applications, we would expect speech to be at least the most predominant part of the signal. The results of experiments are encouraging and show the importance of lip information for speaker identification when speech is highly degraded. While we were using the largest multimodal database publicly available at the time of writing, the amount of data per speaker is limited; however, results are promising for future larger scale work in the multimodal area. The area of adaptive multimodal fusion is still somewhat of a new area that is finding increasingly more application. One of the main aims of this paper was to highlight the potential that does exist for performance improvement and increased robustness when visual information can be adaptively combined with speech information.
ACKNOWLEDGMENTS This work was carried out in support of the European Commission ACTS Project M2VTS and was supported by a CSIRO research contract.
REFERENCES 1. Alissali, M., Deleglise, P., and Rogozan, A., Asynchronous integration of visual information in an automatic speech recognition system. In Int. Conf. on Spoken Language Processing. 1996. 2. Castellano, P. J., Sridharan, S., and Cole, D., Speaker recognition in reverberent enclosures. In Int. Conf. on Acoustics Speech and Signal Processing. 1996. 3. Chellappa, R., Wilson, C., and Sirohey, S., Human and machine recognition of faces: A survey. In Proceedings of the IEEE. May 1995, Vol. 83, pp. 705–740. 4. Coianiz, T., Torresani, L., and Caprile, B., 2D deformable models for visual speech analysis. In Speechreading by Humans and Machines. Springer-Verlag, Berlin/New York, 1995. 5. Cole, D. R., Intelligibility Enhancement of Severely Reverberant Speech. Ph.D. thesis, Queensland University of Technology, 1997. 6. Demirekler, M. and Saranl, A., A study on improving decisions in closed set speaker identification. In Int. Conf. on Acoustics Speech and Signal Processing. 1997, pp. 1127–1130. 7. Duchnowski, P., Meier, U., and Waibel, A., See me, hear me: Integrating automatic speech recognition and lip-reading. In Int. Conf. on Spoken Language Processing. 1994, pp. 547–550. 8. Falavigna, D. and Brunelli, R., Person recognition using acoustic and visual cues. In ESCA Workshop on Automatic Speaker Recognition, Identification and Verification. 1994, pp. 71–74. 9. Farrell, K. R. and Mammone, R. J., Data fusion techniques for speaker recognition. In Modern Methods of Speech Processing. Kluwer Academic, Dordrecht/Norweell, MA, 1995, pp. 279–297. 10. Genoud, D., Bimbot, F., Gravier, G., and Chollet, G., Combining methods to improve speaker verification decision. In Int. Conf. on Spoken Language Processing. 1996. 11. Golfarelli, M., Maio, D., and Maltoni, D., On the error-reject trade-off in biometric verification systems. IEEE Trans. Pattern Anal. Mach. Intell. 19, No. 7 (1997), 786–796. 12. Gonzalez-Rodriguez, J. and Ortega-Garcia, J., Robust speaker recognition through acoustic array processing and spectral normalization. In Int. Conf. on Acoustics Speech and Signal Processing. 1997. 13. Hanson, B., Applebaum, T., and Junqua, J.-C., Spectral dynamics for speech recognition under adverse conditions. In Automatic Speech and Speaker Recognition—Advanced Topics. Kluwer Academic, Dordrecht/Norwell, MA, 1996, Chap. 14, pp. 342–343.
Wark and Sridharan: Adaptive Fusion of Speech and Lip
185
14. Hermansky, H. and Morgon, N., Rasta processing of speech. IEEE Trans. Speech Audio Process. 2 (1994), 578–589. 15. Igawa, S., Ogihara, A., Shintani, A., and Takamatsu, S., Speech recognition based on fusion of visual and auditory information using full-frame color image. IEICE Trans. Fundamentals E79-A, No. 11 (1996), 1836–1840. 16. Jourlin, P., Word-dependent acoustic-labial weights in hmm-based speech recognition. In Proc. of AVSP Workshop, September 1997. 17. Jourlin, P., Luettin, J., Genoud, D., and Wassner, H., Acoustic-labial speaker verification. In Audio and Video-Based Biometric Person Authentication. Springer-Verlag, Berlin, 1997, pp. 319– 326. 18. Kittler, J., Combining classifiers: A theoretical framework. Pattern Anal. Appl. 1 (1998), 18–27. 19. Girin, L., Feng, G., and Schwartz, J. L., Fusion of auditory and visual information for noisy speech enhancement: A preliminary study of vowel transitions. In Int. Conf. on Acoustics Speech and Signal Processing. May 1998. 20. Luettin, J., Thacker, N. A., and Beet, S. W., Locating and tracking facial speech features. In Proc. Int. Conf. on Pattern Recognition. 1996, Vol. I, pp. 652–656. 21. Mammone, R. J., Zhang, X., and Ramachandran, R. P., Robust speaker recognition—A feature based approach. IEEE Signal Process. Magazine September (1996) 58–71. 22. Meier, U., Hurst, W., and Duchnowski, P., Adaptive bimodal sensor fusion for automatic speechreading. In Int. Conf. on Acoustics Speech and Signal Processing. 1996, pp. 833–836. 23. Movellan, J. R. and Mineiro, P., Modularity and Catastrophic Fusion: A Bayesian Approach with Applications to Audiovisual Speech Recognition. Technical report, University of California, January 1996. 24. Naik, D., Assaleh, K., and Mammone, R., Robust speaker identification using pole filtering. In ESCA Workshop on Automatic Speaker Recognition. April 1994, pp. 225–230. 25. Technical Committee on Multimedia Signal Processing. The past, present, and future of multimedia signal processing. IEEE Signal Process. Magazine July (1997) 28–51. 26. Pigeon, S., The m2vts database. Technical report, Laboratoire de Telecommunications et Teledetection, Place du Levant, 2-B-1348 Louvain-La-Neuve, Belgium, 1996. [http://www.tele.ucl.ac.be/M2VTS] 27. Potamianos, G. and Graf, H. P., Discriminative training of HMM stream exponents. In Int. Conf. on Acoustics Speech and Signal Processing. 1998, Vol. 6, pp. 3733–3736. 28. Radova, V. and Psutka, J., An approach to speaker identification using multiple classifiers. In Proc. Int. Conf. on Acoustics Speech and Signal Processing. 1997, pp. 1135–1138. 29. Rahim, M. G. and Juang, B. H., Signal bias removal for robust telephone speech in adverse conditions. In Proc. Int. Conf. on Acoustics Speech and Signal Processing, Adelaide. April 1994. 30. Reynolds, D. A., Speaker identification and verification using Gaussian mixture speaker models. Speech Comm. (1995), 91–108. 31. Rogozan, A., Deleglise, P., and Alissali, M., Adaptive determination of audio and visual weights for automatic speech recognition. In AVSP Workshop, ESCA. September 1997. 32. Ramos Sanchez, M. U., Matas, J., and Kittler, J., Statistical chromacity models for lip-tracking with B-splines. In Int. Conf. on Audio and Video-Based Biometric Person Authentication. 1997. 33. Shintani, A., Ogihara, A., Yamaguchi, Y., Hayashi, Y., and Fukunaga, K., Speech recognition using HMM based on fusion of visual and auditory information. IEICE Trans. Fundamentals E77, No. 11 (1994), 1875–1878. 34. Su, Q. and Silsbee, P. L., Robust audiovisual integration using semicontinuous hidden Markov models. In Int. Conf. on Spoken Language Processing. 1996. 35. Tomlinson, M. J., Russel, M. J., and Brooke, N. M., Integrating audio and visual information to provide highly robust speech recognition. In Int. Conf. on Acoustics Speech and Signal Processing. 1996. 36. Varshney, P. K., Multisensor data fusion. Electron. Comm. Eng. J. December (1997) 245–253. 37. Wagner, T. and Dieckmann, U., Multi-sensoral inputs for the identification of persons with synergetic computers. In Proc. Int. Conf. on Image Processing. 1994, Vol. 2, pp. 287–291. 38. Wark, T. J. and Sridharan, S., A syntactic approach to automatic lip feature extraction for speaker identification. In Int. Conf. on Acoustics Speech and Signal Processing. May 1998, Vol. 6, pp. 3693–3696.
186
Digital Signal Processing Vol. 11, No. 3, July 2001
39. Wark, T. J., Sridharan, S., and Chandran, V., An approach to statistical lip modelling for speaker identification via chromatic feature extraction. In Int. Conf. on Pattern Recogntion. August 1998, Vol. 1, pp. 123–125. 40. Yuhas, B. P., Goldstein, M. H., and Sejnowski, T. J., Integration of acoustic and visual speech signals using neural networks. IEEE Comm. Magazine November (1989) 65–71.
TIMOTHY J. WARK received the B.Eng. (Hons) from the University of Southern Queensland, Toowoomba, in 1997. In January 1997 he joined the Speech Research Laboratory at the Queensland University of Technology, Brisbane, where he is currently completing his Ph.D. His main research interests are in the field of multimodal speaker recognition, speech recognition, and audio-visual fusion techniques. Mr. Wark is a graduate member of the Institution of Engineers Australia and a student member of the Institute of Electrical and Electronic Engineers. S. SRIDHARAN obtained his BSc (Electrical Engineering) and MSc (Communication Engineering) from the University of Manchester Institute of Science and Technology, United Kingdom and Ph.D. (Signal Processing) from the University of New South Wales, Australia. Dr. Sridharan is Senior Member of the IEEE, USA, and a corporate member of IEE, United Kingdom and IEAust of Australia. He is currently an associate professor in the School of Electrical and Electronic Systems Engineering of the Queensland University of Technology (QUT) and is also Head of the Speech Research Laboratory at QUT.