Information Fusion 5 (2004) 103–117 www.elsevier.com/locate/inffus
The fusion of visual lip movements and mixed speech signals for robust speech separation Parham Aarabi *, Bob Mungamuru Department of Electrical and Computer Engineering, University of Toronto, 10 Kings College Road, Toronto, Ont., Canada M5M 3G4 Received 2 September 2002; received in revised form 20 March 2003; accepted 21 March 2003
Abstract A technique for the early fusion of visual lip movements and a vector of mixed speech signals is proposed. This technique involves the initial recreation of speech signals entirely from the visual lip motions of each speaker. By using geometric parameters of the lips obtained from the Tulips1 database and the Audio–Visual Speech Processing dataset, a virtual speech signal is recreated by using audiovisual training segments as a basis for the recreation. It is shown that the visually created speech signal has an envelope that is directly related to the envelope of the original acoustic signal. This visual signal envelope reconstruction is then used to aid in the robust separation of the mixed speech signals by using the envelope information to identify the vocally active and silent periods of each speaker. It is shown that, unlike previous signal separation techniques, which required an ideal mixture of independent signals, the mixture coefficients can be very accurately estimated using the proposed technique in even non-ideal situations. For example, in the presence of speech noise, the mixing coefficients can be correctly estimated with signal-to-noise ratios (SNRs) as low as 0 dB, while in the presence of Gaussian noise, the estimation can be accurately done with SNRs as low as 10 dB. Ó 2004 Elsevier B.V. All rights reserved. Keywords: Audiovisual information fusion; Blind speech separation; Independent component analysis; Audiovisual signal separation; Robust speech recognition
1. Introduction Lip reading is a task that is often associated with robust speech recognition. The goal is to use the motion of the lips in order to improve the acoustic recognition of the words. Many different studies have shown improved speech recognition (both faster and more accurate) when visual cues are available [6–11,16–19,21,22]. Audiovisual speech recognition consists of three stages: visual feature extraction, audiovisual information fusion, and speech recognition. In the first stage, information from the video frames is processed in order to prepare it for integration with the acoustic signal [7]. One simplistic example of this is image-based data extraction, during which the image of the mouth is selected without any processing [7,19,20,22]. While all the information contained within that frame is automatically selected, it does not include any dimensionality *
Corresponding author. E-mail address:
[email protected] (P. Aarabi).
1566-2535/$ - see front matter Ó 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.inffus.2003.10.006
reduction and hence makes audiovisual information fusion extremely difficult. Other first stage techniques include visual motion based information extraction approaches [7,23], model based approaches [5,7,13–15], and geometric feature extraction techniques [7,18]. The latter techniques involve the automatic detection of pertinent features (such as the width and height of the outer corners of the mouth) that are then used for audiovisual information fusion. The advantage of this technique is that very few parameters are retained from each video frame, although obtaining these parameters under adverse conditions (camera rotations, image noise, etc.) is difficult. No matter what approach is employed in the first stage, the second stage is often far more important. Audiovisual information fusion can either be accomplished early in the speech recognition process [20,21] or late [18,19]. Early integration is often preferable since it will result in a more robust system [7]. Typical systems in the Ôearly integration’ category often combine audiovisual cues before the phonetic classification stage.
104
P. Aarabi, B. Mungamuru / Information Fusion 5 (2004) 103–117
There are applications, however, that require even earlier integration than that offered by typical audiovisual systems. One example is sound localization, wherein the unprocessed speech signals must be available for the accurate localization of the sound source [1–3,24]. In such applications, late integration limits the benefits of the integrated audiovisual system; hence Ôvery early’ integration techniques are preferable. The goal here would be to use the visual information to remove or reduce the noise that may be present in the raw acoustic signals. Other applications that benefit from very early audiovisual fusion include speech separation. With speech separation, the goal is to linearly combine multiple mixed signals in order to separate the mixtures into their constituent components. Improved speech separation performance would subsequently lead to improved speech recognition accuracy, due to the increased signalto-noise ratio at the input [31]. Many different speech separation techniques exist including, most notably, Independent Component Analysis (ICA) [25–28]. ICA, however, suffers from several constraints which might be removed by the introduction of visual information. While the very early fusion of audio and video at the signal level has not been explored as much as late-stage fusion, a few such techniques have been proposed [30]. One such technique, proposed by [30], requires a single microphone and a single camera, and attempts to utilize camera pixels with high mutual information with the acoustic signals to modify the time-frequency representation of the acoustic signal. The hope here is that because of the low likelihood of time-frequency overlap between a few speakers, the visual information can be used to effectively separate the different speech signals. This, of course, does break down as the number of speakers increase. The resulting separation using this approach ranges between 9.2 dB SNR gain for a male speaker and 5.6 dB SNR gain for a female speaker. While the work of [30] is a significant contribution to solving the audiovisual speech separation problem at hand, there are two problems that would limit the practicality of such an approach. First, in order for the noise removal to truly make a difference in speech recognition applications, a high SNR gain is required ([30] claimed an overall reduction of the speech recognition error rate from 49% preprocessed to 30% postprocessed, which is still too high for many applications). Second, many of the emerging applications for speech recognition (handheld computers, cars, etc.) have limited computational resources, and as a result, require a computationally simple audiovisual processing algorithm. The system of [30] requires computational resources that may not be available in today’s mobile/ handheld computers. The work in [4] provides some evidence that even the combination of incomplete acoustic signals and visual
information (video of the lips) can dramatically improve the recognition of speech by humans [4]. For example, it was experimentally shown that by providing the amplitude modulated fundamental frequency of the acoustic signal along with a video stream to the subjects, an average of 77.2% of the normal word recognition rate (i.e. the recognition rate with an ideal speech signal) would be obtained. Adequate recognition rates were also obtained when the visual information was combined with low-passed speech (77.7% of normal speech), single-tone amplitude modulated speech (68%), a constant amplitude fundamental frequency tone (67.9%), and even a constant tone and constant amplitude signal that was only on during the voiced speech sections (62.8%), thereby providing voicing duration cues. This tells us that in adverse acoustic conditions, or in cases where the acoustic signal is damaged, visual information can fill in the gaps where acoustic information is missing. In this paper, we present a solution to the speech separation problem, aided by a very early audiovisual integration technique. The basic idea was initially presented in [29] without the extensions and analyses that are presented here. The steps of the algorithm are, roughly, as follows. Using only the visual information and a training database of audiovisual signals, estimates of the unmixed speech signals are synthesized. Then, from these estimates, the unvoiced speech segments are isolated. Finally, using the time intervals where the speech signals are unvoiced, a histogram of estimates of the mixing matrix coefficients is built. The two simulations in this paper are performed on the smaller Tulips1 database and the relatively larger Audio–Visual Speech Processing (AVSP) dataset from Carnegie-Mellon University (CMU). The former is a publicly available database of 96 recordings of 12 speakers, and the latter is a database of 100 recordings of 10 speakers, also publicly available. The advantage of these datasets is the availability of geometric features that were extracted from the video frames, thereby simplifying the task at hand and removing any errors that would result as a result of inaccurate geometric lip features.
2. Visual signal re-substitution The first step of the proposed audiovisual speech separation technique is visual feature extraction. In particular, we are interested in features of the video data stream that may be used to distinguish between the different sounds made by the speaker. The features used are usually geometric, such as the width and height of the mouth and the size of the opening of the mouth. More advanced visual features might describe the optical flow within a sequence of frames, or other such
P. Aarabi, B. Mungamuru / Information Fusion 5 (2004) 103–117
motion analyses. In this paper, the extraction of the visual features is not the focus, and it will be assumed that an accurate set of geometric parameters describing the motion of the lips are available. Once the correct features are extracted from a recorded video stream, the second step is to integrate them with the corresponding acoustic signal. Unlike previous approaches, our goal is to fuse the audio and visual information very early, at the acoustic signal level. To do this, we require a smooth, high-resolution function describing the variation in time of each visual feature. The visual modality, however, is typically sampled at 30 frames-per-second or less, which is too low in comparison to typical acoustic sampling rates of 10 kHz or higher. Hence, we interpolate each visual parameter, as shown in Fig. 1. Suppose we have M available visual parameters (i.e. the height of the lips, the width of the opening of the mouth, etc.). The kth interpolated video parameter signal, for a given sample index i, can be defined as N X i j fv =fo yk ðiÞ ¼ vk ðjÞsinc ð1Þ fv =fo j¼1 where fv is the new up-sampled rate of the feature set, fo is the original sampling rate of the video with a total of N video frames available, and vk ðjÞ is the value of the kth visual parameter at the jth video frame. With a total of M visual parameters available, then, we can define Y ðiÞ to be the M-element vector whose elements are the up-sampled visual features at the sample index i. 2 3 y1 ðiÞ 6 7 Y ðiÞ ¼ 4 ... 5 ð2Þ yM ðiÞ
method of accomplishing this is by visual speech resubstitution. This approach requires a database of noise-free speech signals. Given a noisy speech segment where the visual parameters are known, we find the closest visually matching speech segment from the database and re-substitute it in the place of the noisy speech signal segment. In other words, if we define the re-substituted speech segment for the pth segment to be ^xp ðtÞ, the qth noise-free speech segments in the database to be sq ðtÞ, and the corresponding visual parameter vector of the database segment to be 2 3 /1;q ðiÞ 6 7 Uq ðiÞ ¼ 4 ... 5 ð3Þ /M;q ðiÞ then we have ^xp ðtÞ ¼ sarg minq dY ;Uq ðtÞ
35 Interpolated visual feature using a rate of 2000 Hz
ð4Þ
where dA;B is a distance metric between the visual signal segment vectors A and B, and /k;q is the kth visual parameter of the qth visual segment of the noise-free audiovisual database. Clearly, a suitable choice for a distance metric of the visual parameters is required. This is used to identify the (visually) closest segment from the noise-free database to the current noisy audiovisual signal segment (note that the noise is assumed to only appear in the acoustic signals, not the visual signals). One possible choice for the distance metric is the sum of the Euclidean norms of the difference between the parameters of the signal and the database: dY ;Uq ¼
M X X k¼1
Now, our goal is to recreate an estimate of the speech signal given the visual parameter vector of Eq. (2). One
105
ðyk ðiÞ /k;q ðiÞÞ
2
ð5Þ
i
Another possibility is the inverse of the sum of correlation coefficients between each parameter, as shown below: !1 M X dYq ;Uq ¼ qyk ;/k;q ð6Þ k¼1
Value in Pixels
30
Original feature values obtained from 30 fps video
25
20
15 8600
8700
8800
8900
9000
9100
9200
9300
Sample Index (sampling rate = 2000 Hz)
Fig. 1. The interpolation of a visual feature to a 2 kHz sampling rate.
where qa;b is the correlation coefficient between the signals a and b. While the correlation coefficient distance was used in earlier work [29], it was found that the Euclidean norm does perform better. As such, in this paper, the distance measure in Eq. (5) is used. An important parameter here is the length of the resubstitution segments that are used. If these segments are too short, then the temporal variations of the visual parameters are not considered adequately. Similarly, if the segments are too long, the probability of a good match in the noise-free database becomes small. In practice, it was found that segment sizes between 180 and 200 ms produced good results.
106
P. Aarabi, B. Mungamuru / Information Fusion 5 (2004) 103–117
While there may be a great deal of information captured in the re-substituted sound signal, one similarity between the actual and the re-substituted signals is the sound envelopes. For example, many (but not all) of the peaks between the two signals seem to occur at the same times. In order to assess this point, an example re-substitution was done. Fig. 2 illustrates the results of a visual signal re-substitution using the Tulips1 database, which is a set of 8 recordings of each of 12 speakers. In each recording, the words ‘‘one’’ through ‘‘four’’ are uttered twice, for a total of 192 monosyllabic words in the entire dataset. Here, all of the signals corresponding to the Tulips1 database were concatenated in the following order: {Anthony, Ben, Cynthea, Don, George, Isaac, Jay, Jesse, Candace, Oliver, Regina, Simon}, with each subject’s segment starting from the first utterance of the word Ôone’, followed by the second utterance of the same word, the first utterance of the word Ôtwo’, and so on until the second utterance of the word Ôfour’. The first 40,000 samples of the resulting audiovisual signal were used as the noise-free database. With a sampling rate of 11.1 kHz, this corresponded to approximately 3.6 s of noise-free speech. This initial 3.6 s corresponded to the first person, Anthony, repeating the words one to four twice and the second person, Ben repeating the words one to two twice. While seemingly small, a 3.6 s segment was adequate since each word that was spoken later in the Tulips speech segments by other speakers was spoken at least twice by Anthony in the training database. The next 10 s of the audiovisual signals where then used for testing purposes. While it is difficult to see the similarity between the actual and re-substituted waveforms in Fig. 2, the signal envelopes have a closer resemblance, as shown in Fig. 3. In order to analyze the relationship between the enve-
Fig. 3. The signal envelopes corresponding to the signals of Fig. 2.
lopes of the two signals, 10 s of the Tulips1 database were visually re-substituted and then several conditional probability distributions were empirically calculated. The results are shown in Fig. 4. Fig. 4 illustrates the relation between the visually resubstituted signal envelope and the actual speech signal envelope. As shown, the original speech signal has a Laplacian-like distribution for all of the data samples. Now, if we only consider the data samples at which the visually re-substituted signal has an envelope greater than 50% of the envelope maximum, then the distribution of the speech signals at those points is much flatter than the unconditional distribution. On the other hand, if we only consider points at which the visually resubstituted signal envelope is less than 5% of the envelope maximum, the distribution of the speech signals at
0.4
0.35 Conditional Speech Distribution Given Visual Envelope > 50% of maximum Conditional Speech Distribution Given Visual Envelope < 5% of maximum
Probability Distribution
0.3
0.25
0.2
0.15
0.1
0.05
0 –0.2
Fig. 2. A visually re-substituted segment (top) of an audio segment of the Tulips1 database (bottom).
–0.15
–0.1
0.05 0 0.05 Speech Signal Value
0.1
0.15
0.2
Fig. 4. The conditional distribution of the speech signal given information about the visually re-substituted signal envelope.
P. Aarabi, B. Mungamuru / Information Fusion 5 (2004) 103–117
those points have a sharper distribution compared to the unconditional distribution. This dependence illustrates the simple fact that information about the envelope of the visually resubstituted speech signal affects the distribution of the actual speech signals. This dependence is essential for the algorithm that will be presented in the following section.
3. A visually guided signal separation algorithm One important application for the visually re-substituted signal envelope is robust speech separation. Traditional blind speech separation involves estimating an original noise-free set of signals from a set of mixed signals. In the 2 clean signals and 2 mixed signals case, the mixed signals can be defined as
X1 ðiÞ a1 a2 x1 ðiÞ ¼ ð7Þ X2 ðiÞ x2 ðiÞ b1 b2 where x1 ðiÞ and x2 ðiÞ are two noise-free speech signals, a1 , a2 , b1 and b2 are mixing coefficients, and X1 ðiÞ and X2 ðiÞ are mixed signals. The goal of speech separation is to obtain the original noise-free signals given only the mixed signals. In other words, we must be able to estimate the mixing matrix. Since our goal is to estimate the original signals up to a constant scale factor regarding which we do not care, we can absorb the constants a1 and a2 into the original signals and only estimate the bottom row of the mixing matrix. In other words, we have
X1 ðiÞ 1 ¼ X2 ðiÞ c1
1 1 a x ðiÞ ¼ 1 1 a2 x2 ðiÞ c2 c1
x ðiÞ 1 1 x2 ðiÞ c2 ð8Þ
As a result, Eq. (8) can be approximated by
x1 ðiÞ 1 1 1 X1 ðiÞ ¼ ffi x1 ðiÞ x2 ðiÞ X2 ðiÞ c1 c2 c1
Lð^x1 ðiÞÞ Lð^x2 ðiÞÞ
ð9Þ
where Lðf ðxÞÞ corresponds to the envelope of the function f ðxÞ. From here, we make the assumption that at these points, the envelope relation also holds for the actual noise-free speech signals: Lðx1 ðiÞÞ Lðx2 ðiÞÞ
ð10Þ
ð11Þ
While we do not know the amplitude or the scaling that make up x1 ðiÞ, we can still estimate c1 as follows: c1 ffi
X2 ðiÞ X1 ðiÞ
ð12Þ
However, as mentioned previously, we have multiple samples for which the inequality of Eq. (9) may hold. As a result, we need a procedure of fusing the different coefficient estimates. One successful technique is taking the coefficient value that occurs most often, or the statistical mode of the coefficient estimates. Similarly, c2 can be obtained by taking the subset for which the following visually derived inequality holds: Lð^x1 ðiÞÞ Lð^x2 ðiÞÞ
ð13Þ
and then taking the mode of the coefficient estimates obtained by c2 ffi
X2 ðiÞ X1 ðiÞ
ð14Þ
In order to implement such a signal estimation technique, the much greater than inequality of Eqs. (9) and (13) requires a more practical definition. In this paper, the following definition was used for Eq. (9): Lð^x1 ðiÞÞ > 5 Lð^x2 ðiÞÞ
ð15Þ
with the following constraint: Lð^x1 ðiÞÞ > max Lð^x1 ðjÞÞ=2 j
ð16Þ
Similarly, the following definition was used for Eq. (13): 5 Lð^x1 ðiÞÞ < Lð^x2 ðiÞÞ
ð17Þ
with a constraint similar to that of Eq. (16): Lð^x2 ðiÞÞ > max Lð^x2 ðjÞÞ=2
Now we shall use the visually re-substituted signal envelope in order to estimate the mixing matrix constants of Eq. (8). The signal envelope is an indication of the amplitude of the signal at any given time. Let us assume that we select the sample indices where the estimated (using the visual information) envelope of the first signal is far greater than the envelope of the second signal. In other words, from the visually re-substituted signal we obtain samples such that
107
j
ð18Þ
After estimating the mixing matrix, the original noisefree speech signals can be obtained by multiplying the mixed signals by the inverse of the mixing matrix. Fig. 5 illustrates two noise-free speech signals (corresponding to two concatenated 3.6 s segments of the Tulips1 database). Fig. 6 illustrates two mixed signals mixed with the following mixing matrix:
X1 ðiÞ 1 1 x ðiÞ ¼ 1 ð19Þ X2 ðiÞ x2 ðiÞ 1:25 1:6 After performing the coefficient estimation technique proposed, a set of estimates were obtained for the first coefficient and another set for the second. The actual number of estimates obtained in each set is, in general, not the same because the number of samples that satisfy the constraints of Eqs. (15) and (16) are unrelated to the number of samples that satisfy the constraints of Eqs.
108
P. Aarabi, B. Mungamuru / Information Fusion 5 (2004) 103–117 350
300
Number of Occurrences
250
200
150
100
50
0 –0.5
0
0.5 1 Mixed Signal Ratio
1.5
2
0
0.5 1 Mixed Signal Ratio
1.5
2
600
Fig. 5. Two 3.6 s sections of the Tulips1 database.
Number of Occurrences
500
400
300
200
100
0 –0.5
Fig. 7. The histogram of the estimates of the first mixing coefficient (top) and the histogram of the estimates of the second mixing coefficient (bottom). Fig. 6. Two mixed signals obtained from the original signals and using the mixing matrix of Eq. (19).
4. Simulations on the Tulips1 database
algorithms such as Independent Component Analysis. What ICA cannot do, however, is to tolerate heavy noise [31]. For example, in a room with two dominant speakers and a third weaker, but still present, speech source, ICA would encounter difficulties in finding the mixing coefficients. In order to analyze how the proposed algorithm would function under noisy conditions, a set of simulations were performed on the Tulips1 database. As with the example in Section 2, the audiovisual database was formed by concatenating the recordings of the 12 speakers. The first 3.6 s of the database were used as training data. In these simulations, the entire remaining 30 s was used as testing data. The Tulips1 database has 6 manually extracted features for each video stream. These 6 features are:
While the proposed technique can successfully find the coefficients of the mixing matrix, so can other
1. width of the outer edge of the mouth 2. height of the outer edge of the mouth
(17) and (18). The histograms of the resulting estimates are shown in Fig. 7. Here, the mode of the estimates for the first coefficient is exactly 1.25 and the mode of the estimates for the second coefficient is 1.6. Hence, the perfect separation of the mixed signals becomes possible. In real environments, the coefficients of the mixing matrix would be re-estimated at fixed intervals in time, since they are time-variant. The mixing matrix varies in time due to the motion of the speakers, and changes in speech loudness and noise in the environment.
P. Aarabi, B. Mungamuru / Information Fusion 5 (2004) 103–117
3. 4. 5. 6.
109
width of the inner edge of the mouth height of the inner edge of the mouth height of the upper lip height of the lower lip
Fig. 8 illustrates the position of these features on the mouth. Although these feature were extracted manually in order to ensure their accuracy, automatic algorithms exist that can robustly estimate these features from a given image [6,12–15,20]. A variety of histograms were obtained at different SNRs for the case of the 1.25 mixing coefficient. The result, in the case that the noise source is another speaker, is shown in Fig. 9. In Figs. 9 and 10, each vertical cross-section of the image gives a histogram of coefficient estimates corresponding to a given SNR, with dark regions representing peaks. As shown in Fig. 9, the proposed technique can correctly detect the mixing coefficient at SNRs as low as 0 dB. This is indicated by a constant and sharp peak on the histograms. A similar simulation with a Gaussian noise signal was unable to duplicate the performance
Fig. 10. Normalized coefficient histogram with different Gaussian noise intensities. Each vertical section of the plotted surface gives a histogram of estimates.
with the Speech noise. As shown in Fig. 10, the Gaussian-noise corrupted signals have coefficient histograms that are sharp up to approximately 20 dB, below which the histograms no longer convey any useful information. This problem arises because Gaussian signals corrupt all of the samples, hence the ratios that are utilized are often inaccurate. In contrast, speech noise signals corrupt only certain segments, leaving other temporal segments intact in order for a stable histogram to be produced. Figs. 9 and 10 utilized segment lengths of 181.8 ms for the speech signal reconstruction.
5. Simulations on the AVSP dataset Fig. 8. The six geometric features extracted from each facial video image.
Fig. 9. Normalized coefficient histograms with different speech-based noise intensities. Each vertical section of the plotted surface gives a histogram of estimates.
In view of the encouraging results from simulations on the Tulips1 database, further experiments were performed on a larger audio–visual database, in order to verify the results of the initial simulations. For the second set of simulations, the Audio–Visual Speech Processing (AVSP) Dataset was used. The dataset consisted of 10 audio–visual recordings of each of 10 speakers uttering 78 different words, for a total of 7800 spoken words over roughly 3 h of speech. In this paper, exactly one recording from each of the 10 speakers was used to form the noise-free audio–visual database. That is, 10% of the data was set aside as training data and the remaining 90% was used to test the system. The vocabulary included various words used to tell the date, such as numbers from ‘‘one’’ through ‘‘billion’’, days of the week, months of the year, and additional words such as ‘‘this’’, ‘‘that’’ and ‘‘ago’’. Using a large dataset allowed evaluation of the proposed technique using a wider variety of commonly used syllables and words in speech.
110
P. Aarabi, B. Mungamuru / Information Fusion 5 (2004) 103–117
Fig. 11. The geometric features extracted from each facial video image.
The AVSP dataset has four automatically extracted features for each video stream. These four features are: ðx1 ; y1 Þ ðx2 ; y2 Þ h1 h2
location of the left extremity of the mouth location of the right extremity of the mouth height of the upper lip height of the lower lip
From ðx1 ; y1 Þ and ðx2 ; y2 Þ, the total width, w, of the mouth can be computed directly. The values of w, h1 and h2 together form the 3-feature set that the data fusion process will be based upon. Fig. 11 illustrates the position of these features on the mouth. The features were extracted using an automatic lip-tracking program, also developed by CMU. After setting aside 10 recordings for the training database, 90 recordings remained, with exactly 9 for each speaker. This allowed the formation for every pair-wise combination of the 10 speakers, with no recording being used twice. 45 sets of two speaker recordings (10 choose 2) were formed, with no repetitions. Using the recordings in the noise-free training set, the re-substitution process was done on each of the recordings in the test set. Then, pair-wise mixing of each of the 45 pairs was done, using mixing coefficients of 2 and 4. To evaluate the system performance under noisy conditions, speech noise and Gaussian noise were added to the mixed recordings at various SNR levels. From these noisy signals, the histogram of mixing coefficient estimates was computed, using the proposed technique.
Fig. 12. Histograms of estimates of the mixing coefficients for 20 dB speech noise.
Fig. 13. Histograms of estimates of the mixing coefficients for 10 dB speech noise.
P. Aarabi, B. Mungamuru / Information Fusion 5 (2004) 103–117
111
Fig. 14. Histograms of estimates of the mixing coefficients for 0 dB speech noise.
The results of the simulations, for the case of additive speech noise, are shown in Figs. 12–14. It is seen that, even at SNRs as low as 0 dB, the proposed technique can extract the correct value for the mixing coefficients. In all of the figures, the tallest peak in the histogram always corresponds to the correct estimate. At most SNR values, however, secondary peaks begin to appear in the histogram. Consider the histogram for c ¼ 2 in Fig. 14. The secondary peak at 4 represents instances where it was assumed that one of the speakers was silent and the other was speaking, whereas in reality it was the opposite. That is, the former was actually speaking and the latter was silent. In such cases, the wrong mixing coefficient is (correctly) extracted. The secondary peak at 0 represents instances where complete silence was actually recorded at the microphone, but the re-substitution process incorrectly estimated that only one of the speakers was silent. These secondary peaks do not pose a serious problem, since at all SNR levels the primary dominant peak corresponds to the correct estimate of the mixing coefficient. For the case of additive Gaussian noise, as expected and observed in earlier simulations, there is greater
degradation in performance when compared to speech noise. The results of the simulations are shown in Figs. 15–17. In the presence of Gaussian noise, the mixing coefficient was correctly estimated at SNR values as low as 10 dB. At lower SNR values, the histogram of coefficient estimates no longer provided useful information. As before, the histogram in the Gaussian case exhibits degradation at relatively higher SNR values because the noise corruption affects all the samples, whereas the speech noise corruption only affects only a fraction of the segments. The simulations on this larger database were performed using a set of testing parameters, which were individually optimized to yield the best system performance. These parameters included video interpolation frequency, audio segment length, and distance measure. To investigate the effect of varying the parameters, the conditional distribution of the test speech signal, given information about the resulting re-substituted signal, was examined. Conditional distributions similar to that shown in Fig. 4 were computed for various parameter values, and those that yielded the greatest probability gains were used to produce the results in this paper. The
Fig. 15. Histograms of estimates of the mixing coefficients for 20 dB Gaussian noise.
112
P. Aarabi, B. Mungamuru / Information Fusion 5 (2004) 103–117
Fig. 16. Histograms of estimates of the mixing coefficients for 10 dB Gaussian noise.
Fig. 17. Histograms of estimates of the mixing coefficients for 0 dB Gaussian noise.
optimization of testing parameters was based on a reduced database. Video feature interpolation frequencies of 500 Hz, 1 kHz and 2 kHz were tested. The resulting conditional histograms were similar at all three frequencies. As such, the video features were interpolated at 500 Hz, in order to reduce the total processing time. Segment lengths of 50, 100, 200 and 400 ms were tested. The peak values (the conditional probability of the signal value being zero) of the resulting conditional histograms are given in Table 1.
Table 1 The probabilities that the speech signal is small conditioned on the estimated signal envelope values Segment length (ms)
50 100 200 400
Peak pdf value given estimated envelope is more than 50% of maximum
Peak pdf value given estimated envelope is less than 5% of maximum
0.304 0.305 0.317 0.322
0.233 0.236 0.184 0.191
A segment length of 200 ms was chosen because it had the largest difference in conditional probabilities, and, as such, yields the largest gain in information through the re-substitution process.
6. Extension to mixing matrix estimation with delayed sources The mixing matrix estimation technique of Section 3 can be easily extended to estimate the mixing matrix when the speakers have different mixing delays. Our model in this case is: X1 ðiÞ ¼ a1 x1 ði s1 Þ þ a2 x2 ði s2 Þ
ð20Þ
X2 ðiÞ ¼ b1 x1 ðiÞ þ b2 x2 ðiÞ
ð21Þ
where s1 and s2 are known time delays of arrival (TDOAs) in samples corresponding to the first and second speaker, respectively. This TDOA information can usually be obtained using a variety of TDOA estimation techniques [1–3].
P. Aarabi, B. Mungamuru / Information Fusion 5 (2004) 103–117
Rewriting Eq. (20) as X1 ði þ s1 Þ ¼ a1 x1 ðiÞ þ a2 x2 ði þ s1 s2 Þ
Microphone 1
ð22Þ
and combining it with Eq. (21), for samples for which Lð^x1 ðiÞÞ Lð^x2 ðiÞÞ and Lð^x1 ðiÞÞ Lð^x2 ði þ s1 s2 ÞÞ (which leads us to assume that Lðx1 ðiÞÞ Lðx2 ðiÞÞ and Lðx1 ðiÞÞ Lðx2 ði þ s1 s2 ÞÞ), we obtain b1 X2 ðiÞ ffi a1 X1 ði þ s1 Þ
ð23Þ
and similarly, for samples for which Lð^x1 ðiÞÞ Lð^x2 ðiÞÞ and Lð^x1 ðiÞÞ Lð^x2 ði þ s1 s2 ÞÞ (which leads us to assume that Lðx1 ðiÞÞ Lðx2 ðiÞÞ and Lðx1 ðiÞÞ Lðx2 ði þ s1 s2 ÞÞ), we have b2 X2 ðiÞ ffi a2 X1 ði þ s2 Þ
ð24Þ
Note that, in practice, the recorded signals are discrete while the time delays s1 and s2 may not be an integer multiple of the sampling rate. In such cases, in order to delay the recorded signals by a fractional sample delay, the signal can be reconstructed, delayed, and re-sampled, all of which can be expressed in the following single step: X X1 ði þ s2 Þ ¼ X1 ðjÞ sin cðj i s2 Þ ð25Þ j
Hence, in the time-delayed case, the ratio of the mixing coefficients can be obtained as was done in the case with no delays. As before, if we assume that x1 ðiÞ ¼ a1 x1 ðiÞ and x2 ðiÞ ¼ a2 x2 ðiÞ, and, since we do not care about the scale of the separated speech signals, we can invert the mixed and delayed system of Eqs. (20) and (21) to result in the separated speech signals. This inversion would be performed in the frequency domain, as follows: Eqs. (20) and (21) can be stated in the frequency domain as Y1 ðxÞ ¼ a1 ejxs1 y1 ðxÞ þ a2 ejxs2 y2 ðxÞ
ð26Þ
Y2 ðxÞ ¼ b1 y1 ðxÞ þ b2 y2 ðxÞ
ð27Þ
113 Microphone 2
0.1m
0.1m 0.2m
Speaker 1
0.15m
Speaker 2
Speaker/Mic. Height=1.6m Mic. Signal to Noise Ratio~=20dB Reverb. Time <= 0.1s Relative Speaker Intensity~=0dB Sampling Rate=44.1kHz Total Recording Time=60s
Fig. 18. Experimental setup with two speakers and two microphones.
The details of the experimental setup are shown in Fig. 18. In order to test the proposed speech separation technique in a realistic environment, but not deal with the visual feature extraction phase, a set of two Sony TRS-90 wireless speakers were used to simulate two real speakers in the environment. The sound for these speakers, however, was produced from the Jon 2 (speaker 1) and the Sue 2 (speaker 2) sound tracks of the AVSP dataset. This way, a realistic experiment in a real word condition could be performed without the need for complex visual feature extraction, which is not really the point of this paper. Furthermore, because enough data would have to be recorded for training as well as testing, the fact that the AVSP dataset contained enough segments for both real-world testing and offline training made the proposed experiment the ideal choice. An image of the TRS-90 speakers along with the microphones is shown in Fig. 19. In order to better simulate the model of a human speaker, the reflectors of the TRS-90 speakers were removed for the 60 s experiment. The standard TRS-90 speaker with the reflector is shown in Fig. 20.
where Yk ðxÞ and yk ðxÞ are the Fourier Transforms of Xk ðiÞ and xk ðiÞ, respectively, for i ¼ 1, 2. Given our assumed knowledge of the TDOAs, and by estimating the mixing coefficients using the scheme outlined in this paper, the mixed system of equations (26) and (27) can be inverted as follows:
y1 ðxÞ Y1 ðxÞ b2 a2 ejxs2 1 ¼ a b ejxs1 b a ejxs2 : 1 2 1 2 y2 ðxÞ Y2 ðxÞ b1 a1 ejxs1 ð28Þ 7. Experimental example In order to test the validity of the proposed technique, an experiment was setup with two speech sources.
Fig. 19. Front view of the experimental setup of Fig. 18 with two SONY TRS-90 speakers and two microphones. Note that the two black inner microphones were used for this experiment.
114
P. Aarabi, B. Mungamuru / Information Fusion 5 (2004) 103–117
Fig. 20. Front view of the SONY TRS-90 speaker with the reflector attached.
One might wonder about the proximity of the sources to the microphones as well as the choices made regarding the setup of the speakers and the microphones. The main reason behind the proximal setup was to ensure disparate coefficient values that would clearly test the ability of the proposed technique to estimate those coefficients. Disparate coefficient values would also be easier to verify that ones that were very close to each other. Figs. 21 and 22 show the mixed signals recorded by the first and second microphones, respectively. Due to the nature of the speech signals being played, the source signals overlapped during certain segments while not overlapping during other segments. Using the entire 60 s data, (as well as prior training on the Jon 1 and Sue 1 data segments) the coefficient histograms of Figs. 23 and 24 were generated. These corresponded to the coefficients 0.4 and 1.6, which as far as could be verified, were correct. To better visualize the effects of different speech separation techniques, we zoom in on a region of the
Fig. 22. Mixed signal recorded by the second (right) microphone.
Fig. 23. Histogram for the b1 =a1 coefficient using the algorithm proposed in this paper.
Fig. 24. Histogram for the b2 =a2 coefficient using the algorithm proposed in this paper.
Fig. 21. Mixed signal recorded by the first (left) microphone.
mixed signals that the two speakers are not overlapping. This makes a qualitative analysis of the attenuation and
P. Aarabi, B. Mungamuru / Information Fusion 5 (2004) 103–117
115
Fig. 27. First ICA output for the zoomed in subsection of Figs. 25 and 26. Fig. 25. Zoomed in section of the mixed signal recorded by the first microphone.
Fig. 28. Second ICA output for the zoomed in subsection of Figs. 25 and 26.
Fig. 26. Zoomed in section of the mixed signal recorded by the second microphone.
the amplification of the different speakers possible. Figs. 25 and 26 illustrate the zoomed-in mixed signals for a non-overlapping section. In order to compare the simple audiovisual speech separation technique of this paper with standard speech separation algorithms, we compared the proposed technique with the time-delayed Independent Component Analysis technique of [26]. Figs. 27 and 28 show the resulting separated signals for the zoomed-in section of Figs. 25 and 26. As shown, due to the sensor noise (about 20 dB) and reverberations, the ICA algorithm is unable to obtain the correct coefficients. It should be mentioned that for this ICA technique, the correct TDOAs were used (and
not changed) while the feedback and other weights of the algorithm were allowed to adapt. The results of Figs. 27 and 28 must be put into perspective, since they clearly look (and sound) far worse than the original mixed speech signals of Figs. 25 and 26. The technique applied was a standard ICA algorithm with time-delay compensation, as reported by [26]. ICA in general suffers from several restrictions, including the inability to deal with more than a single Gaussian source or more sources than microphones (the reasons behind this can be attributed to the DarmoisSkitovich Theorem [32]). This clearly presents a problem in real environments with multiple sources (computer fans, air conditioners, reverberations, background speakers, etc.), some of which are Gaussian. Figs. 29 and 30 illustrate the speech separation results of the proposed technique. As shown, both speech signals are attenuated (the second speaker is attenuated in
116
P. Aarabi, B. Mungamuru / Information Fusion 5 (2004) 103–117
Fig. 29. First separated signal obtained using the coefficient estimation algorithm of this paper followed by inversion for the zoomed in subsection of Figs. 25 and 26.
Fig. 30. Second separated signal obtained using the coefficient estimation algorithm of this paper followed by inversion for the zoomed in subsection of Figs. 25 and 26.
the first separated output, and the first speaker is attenuated in the second separated output). This, of course, is a far better result than the previous ICA technique, although in fairness, it should be noted that the ICA technique does not utilize visual information or prior training, while our algorithm does.
The real advantage of the proposed technique is that it is not an alternative to standard audiovisual integration strategies, but can instead be combined with such strategies since both the visual data and the separated speech signals are still available for further processing. The main reason for employing such a technique was to apply knowledge of the visual information to separate speech signals while preserving some important information, such as the timing information, which is essential in applications such as sound localization [1–3,24]. The features used to represent the current status of the mouth were manually obtained and included in the Tulips1 database. However, it is conceivable that using more features may improve performance. Similarly, with the much larger Audio–Visual Speech Processing Dataset, only three features were derived, using automatic image feature extraction algorithms. More features, however, may have yielded more accurate results. Furthermore, the correlation-sum metric was used in the case of the Tulips1 database, whereas the Euclidean norm was used with the larger dataset. There may be other distance measures that provide improved performance in the re-substitution phase compared both of these simple metrics. The technique proposed in this paper was shown to be applicable to not just small vocabulary systems, but also to conventional speech systems, with a rich diversity in phonemes and visemes. Encouraging results were obtained on both the smaller and larger databases. Through simulations, it was discovered that the proposed algorithm is very robust to speech-based noise, where SNRs as low as 0 dB can be tolerated to robustly estimate the mixing coefficients. However, the technique proposed here does not perform as well in the presence of Gaussian noise signals, where the noise corruption extends across the entire speech signal, whereas speech noise only corrupted selected fragments. The cohesion of the histograms was lost at SNRs less than 10 dB, in the Gaussian noise case. These results, nevertheless, are substantial improvements over other speech separation algorithms such as ICA.
References 8. Conclusions A technique for the fusion of visual lip motions and mixed speech signals was proposed, with the goal of separating the speech signal mixture. This involved the initial estimation of the clean speech signal envelope by means of a virtual speech signal reconstruction algorithm. The envelope information was then used to identify signal samples where one speaker is silent, where the ratio of the samples would identify the correct mixing coefficient.
[1] P. Aarabi, S. Zaky, Multi-modal sound localization using audiovisual information Fusion, Information Fusion 3 (2) (2001) 209–223. [2] P. Aarabi, Multi-sense artificial awareness, M.A.Sc. Thesis, Department of Electrical and Computer Engineering, University of Toronto, June 1999. [3] P. Aarabi, S. Zaky, Integrated vision and sound localization, in: Proceedings of the 3rd International Conference on Information Fusion, Paris, France, July 2000. [4] K.W. Grant, L.H. Ardell, P.K. Kuhl, D.W. Sparks, The contribution of fundamental frequency, amplitude envelope, and voicing duration cues to speechreading in normal-hearing sub-
P. Aarabi, B. Mungamuru / Information Fusion 5 (2004) 103–117
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
jects, Journal of the Acoustical Society of America 77 (2) (1985) 671–677. J. Luettin, N.A. Thacker, Speechreading using probabilistic models, Computer Vision and Image Understanding 65 (2) (1997) 163–178. J. Luettin, N.A. Thacker, S.W. Beet, Visual speech recognition using active shape models and hidden Markov models, in: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Atlanta, GA, May 1996. J. Luettin, S. Dupont, Continuous audio–visual speech recognition, in: Proceedings of the 5th European Conference on Computer Vision, 1998. C. Neti, P. deCuetos, A. Senior, Audio–visual intent-to-speak detection for human–computer interaction, in: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, 5–9 June 2000. A. Rogozan, Discriminative learning of visual data for audiovisual speech recognition, International Journal on Artificial Intelligence Tools 8 (1) (1999) 43–52. A. Rogozan, P. Deleglise, Adaptive fusion of acoustic and visual sources for automatic speech recognition, Speech Communication 26 (1–2) (1998) 149–161. J.R. Movellan, Visual speech recognition with stochastic networks, in: G. Tesauro, D. Toruetzky, T. Leen (Eds.), Advances in Neural Information Processing Systems, vol. 7, MIT Press, Cambridge, 1995. J. Baldwin, T. Martin, M. Saeed, Automatic computer lip-reading using fuzzy set theory, in: Proceedings of AVSP 99, Santa Cruz, CA, 1999. S. Basu, N. Oliver, A. Pentland, 3D modeling and tracking of human lip motion, in: IEEE International Conference on Computer Vision, 1998. T. Coianiz, L. Torresani, B. Capril, 2D deformable models for visual speech analysis, in: D.G. Stork, M.E. Hennecke (Eds.), Speechreading by Humans and Machines, NATO ASI Series, Series F: Computer and Systems Sciences, vol. 150, Springer Verlag, Berlin, 1996, pp. 391–398. M.U. Ramos Sanchez, J. Matas, J. Kittler, Statistical chromaticity models for lip tracking with B-splines, in: Proceedings of the First International Conference on Audio- and Video-based Biometric Person Authentication, Lecture Notes in Computer Science, Springer Verlag, 1997, pp. 69–76. N.P. Erber, C.L. De Flippo, Voice-mouth synthesis of tactual/ visual perception of /pa, ba, ma/, Journal of the Acoustical Society of America 64 (1978) 1015–1019. K.P. Green, J.L. Miller, On the role of visual rate information in phonetic perception, Perception and Psychophysics 38 (3) (1985) 269–276.
117
[18] E.D. Petajan, Automatic lipreading to enhance speech recognition, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1985, pp. 40–47. [19] P.L. Silsbee, A.C. Bovik, Computer lipreading for improved accuracy in automatic speech recognition, IEEE Transactions on Speech and Audio Processing 4 (5) (1996) 337–351. [20] C. Breglar, S.M. Omohundro, Nonlinear manifold learning for visual speech recognition, in: IEEE International Conference on Computer Vision, 1995, pp. 494–499. [21] M.J. Tomlinson, M.J. Russell, N.M. Brooke, Integrating audio and visual information to provide highly robust speech recognition, in: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, 1996, pp. 821– 824. [22] B.P. Yuhas, M.H. Goldstein, T.J. Sejnowski, R.E. Jenkins, Neural network models of sensory integration for improved vowel recognition, Proceedings of the IEEE 78 (10) (1990) 1658–1668. [23] K. Mase, A. Pentland, Automatic lipreading by optical flow analysis, Systems and Computers in Japan 22 (6) (1991). [24] M.S. Brandstein, H. Silverman, A robust method for speech signal time-delay estimation in reverberant rooms, in: Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, Atlanta, Georgia, May 1996. [25] A. Bell, T. Sejnowski, An information-maximization approach to blind separation and blind deconvolution, Neural Computation 7 (7) (1995) 1129–1159. [26] K. Torkkola, Blind separation of delayed sources based on information maximization, ICASSP, May 1996. [27] S. Amari, A. Cichocki, H. Yang, A new learning algorithm for blind signal separation, Advances in Neural Information Processing Systems, vol. 8, 1996. [28] T.W. Lee, A.J. Bell, R. Orglmeister, Blind source separation of real world signals, Proc. ICNN (1997). [29] P. Aarabi, N.H. Khameneh, Robust speech separation using visually constructed speech signals, in: Proceedings of Sensor Fusion: Architectures, Algorithms, and Applications VI (AeroSense’01), Orlando, FL, April 2002. [30] T. Darrell, J. Fisher, P. Viola, B. Freeman, Audio–visual segmentation and the cocktail party effect, in: Proceedings of the International Conference on Multimodal Interfaces, Beijing, October 2000. [31] P. Aarabi, Genetic sensor selection enhanced independent component analysis and its applications to robust speech recognition, in: Proceedings of the 5th IEEE Workshop on Nonlinear Signal and Image Processing (NSIP ’01), Baltimore, MD, June 2001. [32] R.-W. Liu, Hui Luo, Direct blind separation of independent nonGaussian signals with dynamic channels, in: Proceedings of 5th IEEE International Workshop on Cellular Neural Networks and their Applications, London, England, 1998, pp. 34–38.