ARTICLE IN PRESS
Signal Processing: Image Communication 21 (2006) 1–12 www.elsevier.com/locate/image
Partial linear regression for speech-driven talking head application Chao-Kuei Hsieh, Yung-Chang Chen Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan 30013, ROC Received 15 September 2004; accepted 25 April 2005
Abstract Avatars in many applications are constructed manually or by a single speech-driven model which needs a lot of training data and long training time. It is essential to build up a user-dependent model more efficiently. In this paper, a new adaptation method, called the partial linear regression (PLR), is proposed and adopted in an audio-driven talking head application. This method allows users to adapt the partial parameters from the available adaptive data while keeping the others unchanged. In our experiments, the PLR algorithm can retrench the hours of time spent on retraining a new user-dependent model, and adjust the user-independent model to a more personalized one. The animated results with adapted models are 36% closer to the user-dependent model than using the pre-trained userindependent model. r 2005 Elsevier B.V. All rights reserved. Keywords: Partial linear regression (PLR); Speaker adaptation; Speech-driven face animation; Audio-to-visual conversion
1. Introduction With the rapid development of multimedia technology, the virtual avatar has been widely used in many areas, like cartoon or computer game characters and news announcers. However, huge amount of manpower is needed in adjusting Corresponding author. Tel.: +886 3 5731153; fax: +886 3 5715971. E-mail addresses:
[email protected] (C.-K. Hsieh),
[email protected] (Y.-C. Chen).
the avatar frame by frame to achieve a vivid and precise synthetic facial animation, since the asynchronism between mouth motion and voice pronunciation would be a fatal defect of realism. Therefore, a real-time speech-driven synthetic talking head, or so-called audio-to-visual synthesis system, is expected, which can provide an effective interface for many applications, e.g. image communication [1,24], video conferencing [12,7], video processing [8], talking head representation of agents [26], and telephone conversion for people with impaired hearing [22].
0923-5965/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.image.2005.04.002
ARTICLE IN PRESS 2
C.-K. Hsieh, Y.-C. Chen / Signal Processing: Image Communication 21 (2006) 1–12
In an audio-to-visual synthesis system, it needs a model established for describing the correspondence between the acoustic parameters and the mouth-shape parameters. In other words, the corresponding visual information is to be estimated for some given acoustic parameters, such as the phonemes, the cepstral coefficients or the line spectrum pairs. The visual information could be images or mouth movement parameters. Mouth images were used in the work of Bregler et al. [6] to provide a factual representation. However, the stitching perplexity and the limited view angle abated the practicability. A number of algorithms have been proposed for the task of mapping between acoustic parameters and visual parameters. The conversion problem is treated as one of finding the best approximation from given sets of training data. These approaches were briefly discussed in Chen and Rao [10], including vector quantization [25], Hidden Markov Models (HMM) [2,3,9,13,31], and neural networks [19,20,30]. However, the speech-driven systems were generally made to be user-independent for satisfactory average performance, which means a decrease in accuracy rate for a specific user. To maintain high performance, a timeconsuming retraining procedure for a new userdependent model is unavoidable since there is no reported adaptation method for this application in the literature. On the other hand, speaker adaptation methods have been extensively studied in the speech recognition field. There are two main categories in the adaptation methods. The first is the eigenvector-based speaker adaptation method [4,5], which uses the normalization on both the training-end and the recognition-end to deal with a variety of the acoustic characteristics due to different vocal channels. The other is based on the acoustic model, and is simpler than the former since the normalization for the training data is not necessary. A user-independent model is statistically established with the training data of several speakers in the beginning, and the parameters are then modified with certain adaptation data of a new user. The adaptation schemes include maximum a posteriori (MAP) estimation [11,17,27,28], maximum likelihood lin-
ear regression (MLLR) [18,21,32], VFS [29], and nonlinear neural network [16]. In these methods, they tried to adjust the model parameters to maximize the occurrence probability of the new observation data. Among them, the MLLR method is more widely adopted for its simplicity and effectiveness when the set of adaptation data is small. In this study, we try to integrate the MLLR adaptation approach with the audio-to-visual conversion of Gaussian mixture model, because the MLLR is first used for speaker adaptation of continuous density Hidden Markov Models and GMM is the kernel distribution used in an HMM. If the adaptation of audio-to-visual conversion model can be carried out with both audio and visual adaptation data, it will be exactly the same task as that in [21]. However, to obtain the precise visual adaptation information of a new user is not feasible in a usual environment, since some markers, infrared cameras, and postprocessing (same as in the training phase) are needed. This makes the MLLR not fully adequate to adapt only the audio parameters while keeping the visual part the same. In other words, we require another appropriate adaptation, by means of which the new model will map the new audio parameters of a new user to the original visual movement. A new adaptation method, called partial linear regression, is proposed in this paper. It is derived from the MLLR and put into practice in an audiodriven talking head system (Fig. 1). Rather than a time consuming retraining procedure, a simple adaptation with a small amount of additional data will be sufficient to adjust the model so as to be more applicable to the new user. The rest of the paper is organized as follows. In Section 2, we describe the audio-driven talking head system which uses the Gaussian mixture model to represent the relationship between audio and video feature vectors. The audio-to-visual conversion is also mentioned. Section 3 provides a review of MLLR and a detailed description of the proposed PLR model adaptation algorithm. Some experimental results are described in Section 4, and Section 5 concludes the paper.
ARTICLE IN PRESS C.-K. Hsieh, Y.-C. Chen / Signal Processing: Image Communication 21 (2006) 1–12
3
Fig. 1. (a) Flowchart of training phase and (b) flowchart of testing and updating phases.
2. Audio-driven talking head system 2.1. System architecture The flowcharts of the training and the testing phases of our audio-driven talking head system are described in Fig. 1. In the audio signal processing, we extract 10th-order line spectrum pair (LSP) coefficients [14] from every audio frame of 240 samples. In the training phase, the frame rates of the audio and video signal generally differ from each other. After labeling the beginning and the ending points of every training word manually, we use linear interpolation to align the audio and visual feature vectors and cascade them into a single vector. The Gaussian mixture model, derived by the EM algorithm [15], is then adopted to represent the distribution of the audio-visual vector. In the testing phase (Fig. 1(b)), to obtain the optimal estimator of the visual vector from an audio vector is actually to calculate the conditional expectation using the trained GMM. However, voice features of the current user may be distinct
from others’. An adaptation would be necessary to modify the pre-trained GMM to be more suitable to the new user. 2.2. Gaussian mixture model The density function of P Gaussian mixture model is defined as pðyÞ ¼ m i¼1 pi f i ðyÞ, where 1=2 T T T y ¼ ½v a , and f i ðyÞ ¼ ð1=ð2pÞd=2 Di Þ expf12 1
ðy li ÞT S i ðy li Þg, v is the visual feature vector, a is the audio feature vector, d is the dimension of vector y, pi is the weighting of the ith Gaussian kernel, li ¼ ½lTv;i lTa;i T is the mean vector, " # S vv S va Si ¼ S av S aa is the corresponding covariance matrix, and Di is the determinant of S i .1 1 In this paper, we use the bold and italic font to represent a vector. A matrix is denoted in bold, italic font with a bar on it. The italic font is used for the representation of a single value or a function.
ARTICLE IN PRESS C.-K. Hsieh, Y.-C. Chen / Signal Processing: Image Communication 21 (2006) 1–12
4
Given the training data set Y with n training vector yj, j ¼ 1; 2; . . . ; n, the likelihood function is pðYÞ ¼
n X m Y
i¼1
pi f i ðyj Þ.
j¼1 i¼1
¼
We use the expectation-maximization (EM) algorithm [15] to find out a mixture model, i.e., to determine the parameters of pi , li , and S i , with a maximum likelihood such that n X m Y
pu;i f u;i ðyj Þ
j¼1 i¼1
X
n X m Y
pv;i f v;i ðyj Þ;
where uav.
j¼1 i¼1
2.3. Audio-to-visual conversion For two vectors v, a modeled as jointly Gaussian, the joint probability function can be represented by f av ða; vÞ ¼ Nða; v; l; SÞ where " l¼
lv la
obtained from Z X m pi Nða; v; li ; S i Þ dv f a ðaÞ ¼ m X
pi Nða; la;i ; S a;i Þ.
i¼1
And the conditional expectation of v given a can be derived as Z f ða; vÞ E½v j a ¼ v av dv f a ðaÞ Z m X pi vNða; v; li ; S i Þ dv ¼ f ðaÞ i¼1 a ! m X pi Nða; la;i ; S aa;i Þ ¼ f a ðaÞ i¼1 Z Nða; v; li ; S i Þ dv v Nða; la;i ; S aa;i Þ ! m X pi Nða; la;i ; S aa;i Þ ¼ f a ðaÞ i¼1 1
ðlv;i þ S va;i S aa;i ða la;i ÞÞ which is the optimal estimator for GMM in least mean-squared error sense.
#
is the mean vector and " # S vv S va S¼ S av S aa is the covariance matrix. As stated in [23], the optimal estimator of v given the vector of a in mean-squared error sense is actually the conditional expectation of v given a, that is, Z Z f ða; vÞ Nða; v; l; SÞ E½v j a ¼ v av dv ¼ v dv f a ðaÞ Nða; la ; S a Þ 1
¼ lv þ S va S aa ða la Þ where f a ðaÞ is the marginal probability of signal a. For a Gaussian mixture model, similarly, the marginal probability function of a can be
3. Model adaptation 3.1. Maximum likelihood linear regression In MLLR of mean adaptation [21], the purpose is to maximize the likelihood of the new observation data such that PðO j pi ; l0i ; S i ÞXPðO j pi ; li ; S i Þ, by linear-regressively adjusting the mean vectors of every Gaussian kernel, i.e. " # 1 l0i ¼ W . li With the definition of the auxiliary function Q as Qðl; l0 Þ ¼ PðO j lÞ log pðO j l0 Þ,
ARTICLE IN PRESS C.-K. Hsieh, Y.-C. Chen / Signal Processing: Image Communication 21 (2006) 1–12
5
we can find the optimal value of the matrix W by differentiating the auxiliary function Q with respect to the matrix W and setting the derivative to zero. 3.2. Partial linear regression The MLLR method performs well even when the amount of the adaptation data is insufficient. It modifies every single value in all the mean vectors of the Gaussian kernels. In other words, if we can gather both audio and visual adaptation data at the same time, the MLLR will be qualified for the task of model adaptation in the audio-tovisual conversion. Unfortunately, to obtain the precise visual adaptation information, the threedimensional movements of specific control points, of a new user is not feasible in an ordinary environment, since some markers, infrared cameras, and post-processing (same as in the training phase, Fig. 6(a)) are needed. Only audio adaptation data is available. This makes the MLLR not conformable to our demand. If the model adaptation is directly executed without considering visual information but using audio vectors only, there will be some problems. Figs. 2 and 3 are some illustrations. There are the pre-trained GMM model with 2 Gaussian kernels and the adaptation data X and Y of a new user in both Fig. 2(a) and Fig. 3(a). On the one hand, if the samples can be classified into correct kernel without the visual information as in Fig. 2(a) (X and Y belongs to kernel 2 and 1, respectively), the audioonly adaptation will do the right job (Fig. 2(b)). On the other hand, however, if audio features involve large difference, the new samples may be misclassified and the adaptation will be inappropriate. For example, sample X and Y are meant to be in kernel 2 and kernel 1 as in Fig. 3(a), but they are both falsely assorted, which means, kernel 1 should be moved toward the Y position and kernel 2 to the X during model adaptation (Fig. 3(c)), but not kernel 1 to X and kernel 2 to Y (Fig. 3(b)). With such an issue, of course, the relationship between the audio and visual vectors should also be considered at the same time. But due to the same issue, the true correspondence between the new user’s audio vector and the original mouth
Fig. 2. Illustration of model adaptation with audio data only (I): (a) original trained GMM and new user’s adaptation data X, Y and (b) adapted result.
motion can never be known, even the occurrence probability of each Gaussian kernel could be misestimated. Therefore, we assume that the motion of each people will be the same according to the single syllable in Chinese words, and the alignment between different person’s audio vectors can be linearly achieved while the starting and ending points are labeled manually. In our application, we may merely want to adapt audio mean vectors, la;i , and keep the correspondence between audio and visual vectors unchanged. Another appropriate adaptation is indispensable, by means of which the new model will map the audio parameters of a new user to the original visual movement. For example consider Fig. 4. Suppose that the user 1-dependent GMM, GMMori, has been trained. When user 1 produces
ARTICLE IN PRESS 6
C.-K. Hsieh, Y.-C. Chen / Signal Processing: Image Communication 21 (2006) 1–12
the same pronunciation is made by another user, the produced audio vector a0 will not be the same as a, and consequently, a wrong visual vector v0 will be derived. Therefore, we have to adapt the GMMori to GMMnew which can more precisely reflect the relationship between the audio vector of user 2 and its corresponding visual vector of user 1. MLLR is then modified and integrated with the concept of conditional expectation used in audioto-visual conversion part, mentioned in Section 2.3. In Section 2.3, with the conditional expectation, the corresponding visual information, E½v j a, can be estimated for some given acoustic parameters in the audio-to-visual conversion. Oppositely, we can evaluate the audio information from its corresponding visual parameters, E½a j v, by the same token (Fig. 5(a)). Z f ða; vÞ E ½a j v ¼ a av da f v ðvÞ Z m X pi ¼ aNða; v; li ; S i Þ da f ðvÞ i¼1 v ! m X pi Nðv; la;i ; S aa;i Þ ¼ f v ðvÞ i¼1 Z Nða; v; li ; S i Þ da a Nðv; lv;i ; S vv;i Þ ! m X pi Nðv; lv;i ; S vv;i Þ ¼ f v ðvÞ i¼1 1
ðla;i þ S av;i S vv;i ðv lv;i ÞÞ. The concept of our proposed PLR (Fig. 5(b)) is, by adjusting the audio mean vectors la;i linear regressively, like the MLLR, to minimize the distance between the adaptation data a0 and the optimal estimator of a0 given the value of v, corresponding to a. To do so, the new user has to pronounce the words we have pre-defined. Suppose we have J adaptation data a0j , j ¼ 1; 2; . . . ; J. Our final goal is to Fig. 3. Illustration of model adaptation with audio data only (II): (a) original trained GMM and new user’s adaptation data X, Y and (b) bad adapted result and (c) good adapted result.
the audio vector a by pronouncing /e/, we can obtain the correct corresponding visual vector v. If
minimize
J X
ka0j a0j k
j¼1
¼ minimize
J X j¼1
ka0j E½a j vj k,
ARTICLE IN PRESS C.-K. Hsieh, Y.-C. Chen / Signal Processing: Image Communication 21 (2006) 1–12
7
Fig. 4. An illustration of adaptation demand in our speech-driven talking head application.
where vj is the visual vector of the original user, corresponding to the identical pronunciation, i.e., to find the adjustments A and b to minimize 2 3 pi Nðvj ;lv;i ;S vv;i Þ J m X X f v ðvj Þ 0 6 7 ; aj 4 5 1 j¼1 i¼1 ðAla;i þ b þ S av;i S vv;i ðvj lv;i ÞÞ (1) where
f v ðvj Þ ¼
Z X M
pi Nða; vj ; lv;i ; S vv;i Þ da
i¼1
¼
M X
pi Nðvj ; lv;i ; S vv;i Þ.
i¼1
After defining Eq. (1) becomes
Fig. 5. (a) The relationship between audio-to-visual and visualto-audio conversion. (b) Illustration of the proposed partial linear regression (PLR) algorithm: to minimize the distance between a0 and a0 .
pi Nðvj ; lv;i ; S vv;i Þ=f v ðvj Þ ¼ wi;j ,
" # J m X 1 0 X wi;j ðS av;i S vv;i ðvj lv;i ÞÞ arg min aj A;b j¼1 i¼1 m X ð2Þ wi;j ðAla;i þ bÞ. i¼1
ARTICLE IN PRESS C.-K. Hsieh, Y.-C. Chen / Signal Processing: Image Communication 21 (2006) 1–12
8
P 1 Let a0j m i¼1 wi;j ðS av;i S vv;i ðvj lv;i ÞÞ9d j , now have J m X X wi;j la;i arg min ðd j bÞ A A;b j¼1 i¼1 9 arg min A;b
J X
we
kðd j bÞ Acj k.
j¼1
To solve this, we can simply consider this as a least-squares problem. That means " # Aðk; : ÞT T ¼ d T ðkÞ, ½c 1 bðkÞ where 2 3 1 6 .. 7 c ¼ ½c1 . . . cJ ; d ¼ ½d 1 . . . d J ; 1 ¼ 4 . 5 1
.
JDim
The solution is given by " # 1 Aðk; : ÞT c c T ½c 1 d T ðkÞ. ¼ bðkÞ 1T 1T
(3)
4. Experimental results 4.1. Experimental data The audio ground truth data was captured with a microphone with a 8 kHz and 16-bit mono channel, and the facial expression was captured by 6 infrared cameras, 120 fps, with 27 particular markers (1 is the root) stuck on certain feature points of the user’s face (Fig. 6). Even though six infrared cameras were used for motion capture, the recorded data was still noisy. A time-consuming post-process was executed for avoiding such a situation. For each of the three male subjects in our experiment, we recorded 413 Chinese words. The start and the end point of each word were labeled manually, and 10 LSP coefficients were calculated from each voice segment of 240 samples, and then cascaded with their corresponding visual feature vectors. We choose the horizontal and the vertical moving distance of the 8 points around the mouth as
Fig. 6. (a) A snapshot of video grabbed from the digital video camera during motion capturing and (b) corresponding control points of talking head model.
the visual feature, because only these points are more related to the audio variation. The dimension of the visual vectors is 16. In order to preserve the mouth
ARTICLE IN PRESS C.-K. Hsieh, Y.-C. Chen / Signal Processing: Image Communication 21 (2006) 1–12
movement information, only the visual data of one person was adopted and used for the other subjects as a duplicate. In this way, there could be a standard for comparing the estimated result from the audioto-visual conversion no matter a user-dependent or user-independent model was used. However, the frame number of audio vectors is absolutely not the same as that of visual vectors because of the different sample rate and the varying length between different people. Therefore, we normalized the audio vectors to the visual vectors in each word by linear interpolation and resulted in 19315 audio-visual vectors totally for each user. GMMs with 10 kernels were used to approximate the relationship between the audio feature and the visual feature factor. 4.2. Experiment I The odd audio-visual vectors and the even vectors are used as the training data and testing data, respectively. Four kinds of model were established:
For each of the three users, a user-dependent speech to mouth movement model (Modeli, i ¼ 1; 2; 3) was established using their own training data. Training the user-independent speech to mouth movement model (Modelall) using the mixed data (the 3k+i vectors from the training data of the ith user, where k 2 Z and i ¼ 1; 2; 3) from all the users. Applying the proposed PLR method, we adjusted the model Modelall with the other audio-visual vectors, not used in the training of Modelall, to obtain Modelalli, i ¼ 1; 2; 3, which was supposed to be more consistent to user i. Applying the proposed PLR method, we adjusted the model Modeli, i ¼ 1; 2; 3 with the training data of user j as the adaptation data, to obtain Modelij, i; j ¼ 1; 2; 3, iaj, which should be more consistent to user j.
After the GMMs are established and adapted, we can directly derive the corresponding mouth movement vector with a given audio vector. In the testing phase, the visual vector conducted from a certain model was compared with the ground
9
truth data (training data). The mean value and standard deviation of the difference with the mouth width normalized to 100 are recorded in Table 1. As the result shows, the relationship between the performances of the GMMs applied to the ith user is Modeli 4Modelalli 4Modelall ;
i ¼ 1; 2; 3,
and the animated results of adapted models with sufficient adaptation data were 36% closer to the user-dependent model than using the pre-trained user-independent model. Comparing the results that use the user-independent GMMs and the GMMs adapted from a certain user-dependent to a new user, we can verify that the adapted GMMs still outperform the user-independent GMM, that is, Modelij 4Modelall ;
i; j ¼ 1; 2; 3 and iaj
Another consideration is to adjust the audio mean vectors la;i with a maximum likelihood criterion to the adaptation audio data without involving the video models, and the results is listed in Table 2. The adaptation is done by using the sufficient odd audio-vectors as the adaptation data. As the result shows, the conversion of using the adapted models is not as good as the proposed PLR in some occasion, even though the occurrence probability
Table 1 Mean and standard deviation of the difference between the original data and the value obtained from GMM GMM
Modelall Modelall1 Modelall2 Modelall3 Model1 Model12 Model13 Model2 Model21 Model23 Model3 Model31 Model32
Mean (Std.) User 1
User 2
User 3
4.45 3.72 4.73 4.41 2.95 4.19 3.92 4.39 3.94 4.45 4.67 4.02 4.96
4.57 4.51 4.22 4.44 4.28 4.18 4.39 3.42 4.71 4.85 4.35 4.87 4.50
4.60 4.55 4.56 4.21 4.34 4.20 3.94 4.56 4.71 4.33 3.30 4.83 4.57
(3.59) (3.57) (4.16) (4.04) (2.86) (3.72) (3.62) (3.59) (3.68) (4.17) (4.22) (3.77) (4.46)
(3.61) (3.88) (3.97) (3.99) (3.55) (3.80) (3.77) (3.21) (4.36) (4.55) (3.71) (4.02) (4.09)
(3.49) (4.14) (3.98) (3.85) (3.75) (3.49) (3.47) (3.91) (4.22) (3.91) (3.02) (4.34) (3.93)
ARTICLE IN PRESS C.-K. Hsieh, Y.-C. Chen / Signal Processing: Image Communication 21 (2006) 1–12
User 1
User 2
User 3
Model1
Model21
Model31
2.95
3.94
4.02
Modelall
Model21,ML
Model31,ML
4.45
4.02
4.13
Model2
Model12
Model32
3.42
4.18
4.50
Modelall
Model12,ML
Model32,ML
4.57
4.40
4.41
Model3
Model13
Model23
3.30
3.94
4.33
Modelall
Model13,ML
Model23,ML
4.60
4.12
4.19
User 2 => 1
Trial 1 Trial 2 Trial 3 Average
5.70 5.40
Var (Pixel)
Table 2 Difference between the original data and the value obtained from user-dependent GMM, user-independent GMM, adapted GMM from PLR and adapted GMM from maximum likelihood criterion
5.10 4.80 4.50 4.20 3.90
0
5
(a)
10
15
20
25
30
User 2 => 3
Trial 1 Trial 2 Trial 3 Average
4.80
4.50
4.20
0
5
10
15
20
25
30
odd
Adpating words (Num)
(b)
User1 => 3
Trial 3 Trial 2 Trial 1 Average
5
Var (Pixel)
odd
Adapting words (Num)
5.10
Var (Pixel)
10
4.7
of these new audio vectors has been maximized. One explanation could be that the relationship between the audio and visual information is not well maintained under this kind of adjustment as we discussed in Section 3.2 and thus it produces a passable result only.
(c)
4.3. Experiment II
Fig. 7. Some results of model adaptation with different number of adaptation words.
Instead of using the all odd audio-visual vectors as the training data, we randomly choose 5, 10, 15, 20, 25, and 30 words from the recorded 413 Chinese words, and use the audio-visual vectors of these selected words as the adaptation data. The user-dependent models trained in experiment I, Modeli, were consequently used as the reference. Each random selection was given three trials. The whole 413 words were used as the testing data and the difference between the estimated visual vector and the ground truth was shown in Fig. 7. The value ‘0’ in the adapting words axis means that no adaptation is implemented, and the corresponding value is the result of the original user-dependent model, Modeli, applied on the new user. The word
‘odd’ stands for the results of applying the adapted model in experiment I, Modelij, which was obtained with sufficient adaptation data (about 206 words, quantitatively). As the figure shows, there is a trend that the difference between the estimated value and the ground truth decreases while the number of adaptation words increases. The trial 1 in Fig. 7(a) is an ideal illustration. However, not every plot of experimental trials is so perfect, since the adaptation words are chosen randomly. When the set of adaptation data is small, the selected words will be critical to the result of model adjustment. With inappropriate extra data, the
4.4 4.1 3.8
0
5
10
15
20
25
30
odd
Adapting words (Num)
ARTICLE IN PRESS C.-K. Hsieh, Y.-C. Chen / Signal Processing: Image Communication 21 (2006) 1–12
11
performance of adapted model could be worse, even than using the original user-dependent model. When the number of adaptation words expands to about 20, the effect of applying PLR for model adaptation will be affirmatively positive.
5. Conclusion We have proposed a new adaptation algorithm using partial-linear-regression. The PLR method can be used in updating a part of the mean vector in Gaussian mixture model, keeping the corresponding relationship unchanged. This is because the precise visual data of a new user cannot be obtained easily, and we may only collect the audio information in the adaptation procedure. As the experimental result in Table 1 shows, we can derive a more adequate model for the new user via the PLR adaptation algorithm, rather than a timeconsuming re-training task. The set of adaptation data plays an important role when it is small and randomly selected. The adjusted model could outperform the original one only if the words were chosen appropriately. How to choose more efficient adaptation data is an important issue and this is still under investigation, although it is obvious that if the more adaptation data is used, the better performance there will be. In our audio-driven talking head system, it’s much easier to obtain a new user’s voice feature than the facial expression. Therefore, the audio data of a new user is used as the adaptation data such that the audio-to-visual conversion could estimate the corresponding original mouth movement for the new audio parameters. In other applications if the user wants to modify the driven facial expression to a different style, the proposed method can also be adopted to modify the visual mean vectors (Fig. 8). In this way, when the same audio parameters are given, a new set of face motion parameters can be derived.
Acknowledgement This research is conducted in the project of ‘‘Video-driven 3D Synthetic Facial Animation’’,
Fig. 8. Illustration of using PLR to modify the speech-driven facial expression to a new style.
supported by OES Laboratories, Industrial Technology Research Institute (ITRI), Taiwan. References [1] K. Aizawa, T.S. Huang, Model-based image coding, Proc. IEEE 83 (August 1995) 259–271. [2] P.S. Aleksic, A.K. Katsaggelos, Speech-to-video synthesis using facial animation parameters, Proceedings of 2003 International Conference on Image Processing, volume: 3, pp. III, 1–4 vol. 2, 14–17 September 2003. [3] P.S. Aleksic, A.K. Katsaggelos, Speech-to-video synthesis using MPEG-4 compliant visual features, IEEE. Trans. Circuits Systems Video Technol. 14 (5) (May 2004) 682–692. [4] T. Anastaskos, J. McDonough, R. Schwartz, J. Makhoul, A compact model for speaker adaptive training, ICSLP, 1996. [5] T. Anastaskos, J. McDonough, R. Schwartz, J. Makhoul, A compact model for speaker adaptive training a maximum likelihood approach to speaker normalization, ICASSP, 1997. [6] C. Bregler, M. Covell, M. Slaney, Video rewrite: driving visual speech with audio, in: Proceedings of the International Conference on Computer Graphics Interaction Techniques, Los Angeles, CA, 3–8 August 1997, pp. 353–360. [7] Y.J. Chang, C.K. Hsieh, P.W. Hsu, Y.C. Chen, Speechassisted facial expression analysis and synthesis for virtual conferencing systems, in: Proceedings of 2003 IEEE International Conference on Multimedia and Expo (ICME2003), vol. 3, Baltimore, MD, USA, 6–9 July, 2003, pp. 529–532.
ARTICLE IN PRESS 12
C.-K. Hsieh, Y.-C. Chen / Signal Processing: Image Communication 21 (2006) 1–12
[8] T. Chen, H.P. Graf, K. Wang, Speech-assisted video processing: interpolation and low-bitrate coding, 1994 Conference Record of the Twenty-eighth Asilomar Conference on Signals, Systems and Computers, vol. 2, 31 October–2 November 1994, pp. 975–979. [9] T. Chen, R.R. Rao, Audio-visual interaction in multimedia communication, in: Proceedings of the ICASSP, Munich, Germany, vol. 1, April 1997, pp. 179–182. [10] T. Chen, R.R. Rao, Audio-visual integration in multimodal communication, Proc. IEEE 86 (5) (May 1998) 837–852. [11] J.T. Chien, C.H. Lee, H.H. Wang, Improved Bayesian learning of hidden Markov models for speaker adaptation, in: Proceedings of the 1997 International Conference on Acoustics, Speech, Signal Processing, vol. 2, 1997, pp. 1027–1030. [12] C.S. Choi, F. Kiyoharu, H. Harashima, T. Takebe, Analysis and synthesis of facial image sequences in model-based image coding, IEEE Trans. Circuits Systems Video Technol. 4 (June 1994) 257–275. [13] K.H. Choi, J.H. Lee, Constrained optimization for a speech-driven talking head, Proceedings of the ISCAS, vol. 2, May 2003, pp. 560–563. [14] J.R. Deller, J.H.L. Hansen, J.G. Proakis, Discrete-time Processing of Speech Signals, Wiley-IEEE Press, September 1999. [15] A.P. Dempster, N.M. Laird, D.B. Rubbin, Maximumlikelihood from incomplete data via the EM algorithm, J. Roy. Statist. Soc. B 39 (1977) 1–38. [16] M. Gales, P. Woodland, Variance compensation within the MLLR frame work, Cambridge University Engineering Department, Technical Report CUED/F-INFENG/ TR.242, February 1996. [17] J.L. Gauvain, C.H. Lee, Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains, IEEE Trans. Speech Audio Process. 2 (2) (April 1994) 291–298. [18] J.E. Hamaker, MLLR: A speaker adaptation technique for LVCSR, Institute for Signal and Information Processing, Department of Electrical and Computer Engineering, November 1999. [19] P. Hong, Z. Wen, T.S. Huang, H.Y. Shum, Real-time speech-driven 3D face animation, in: Proceedings of First International Symposium on 3D Data Processing Visualization and Transmission, 19–21 June 2002, pp. 713–716.
[20] P. Hong, Z. Wen, T.S. Huang, Real-time speech-driven face animation with expressions using neural networks, IEEE Trans. Neural Networks 13 (4) (July 2002) 916–927. [21] C.J. Leggetter, P.C. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, Computer Speech Language 9 (1995) 171–185. [22] R. Lippman, Speech recognition by machines and humans, Speech Commun. 22 (1) (July 1997) 1–15. [23] J.M. Mendel, Lessons in estimation theory for signal processing, communications, and control, Prentice-Hall, Englewood Cliffs, NJ, 1995. [24] S. Morishima, Real-time talking head driven by voice and it’s application to communication and entertainment, in: Proceedings of the International Conference on Auditory-Visual Speech Processing, Terrigal, Australia, 1998. [25] S. Morishima, K. Aizawa, H. Harashima, An intelligent facial image coding driven by speech and phoneme, Proceedings of the IEEE ICASSP, Glasgow, UK, 1989, p. 1795. [26] I. Pandzic, J. Ostermann, D. Millen, User evaluation: synthetic talking faces for interactive services, Visual Comput. 15 (7/8) (November 1999) 330–340. [27] B.M. Shahshahani, A Markov random field approach to Bayesian speaker adaptation, IBM Technical Report TR54.885, January 1995. [28] K. Shioda, C.H. Lee, Structural MAP speaker adaptation using hierarchical priors, in: Proceedings of the IEEE Workshop on Speech Recognition and Understanding, 1997. [29] M. Tonomura, T. Kosaka, S. Matunaga, Speaker adaptation based on transfer vector field smoothing using maximum a posteriori probability estimation, ICASSP95, vol. 1, 1995, pp. 688–691. [30] A. Verma, N. Rajput, L.V. Subramaniam, Using viseme based acoustic models for speech driven lip synthesis, in: Proceedings of the ICME2002, vol. 3, July 2003, pp. 533–536. [31] J.J. Williams, A.K. Katsaggelos, An HMM-based speechto-video synthesizer, IEEE Trans. Neural Networks 13 (4) (July 2002) 900–915. [32] X. Wu, Y. Yan, Speaker adaptation using constrained transform, IEEE Trans. Speech Audio Process. 12 (2) (March 2004) 168–175.