Speaker-independent 3D face synthesis driven by speech and text

ARTICLE IN PRESS Signal Processing 86 (2006) 2932–2951 www.elsevier.com/locate/sigpro Speaker-independent 3D face synthesis driven by speech and tex...

Download PDF

1MB Sizes 0 Downloads 46 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

Signal Processing 86 (2006) 2932–2951 www.elsevier.com/locate/sigpro

Speaker-independent 3D face synthesis driven by speech and text Arman Savrana,, Levent M. Arslana, Lale Akarunb a

Electrical & Electronic Engineering Department, Bogazici University, 34342 Bebek, Istanbul, Turkey b Computer Engineering Department, Bogazici University, 34342 Bebek, Istanbul, Turkey Received 21 November 2005; accepted 6 December 2005 Available online 18 January 2006

Abstract In this study, a complete system that generates visual speech by synthesizing 3D face points has been implemented. The estimated face points drive MPEG-4 facial animation. This system is speaker independent and can be driven by audio or both audio and text. The synthesis of visual speech was realized by a codebook-based technique, which is trained with audio-visual data from a speaker. An audio-visual speech data set in Turkish language was created using a 3D facial motion capture system that was developed for this study. The performance of this method was evaluated in three categories. First, audio-driven results were reported, and compared with the time-delayed neural network (TDNN) and recurrent neural network (RNN) algorithms, which are popular in speech-processing ﬁeld. It was found out that TDNN performs best and RNN performs worst for this data set. Second, results for the codebook-based method after incorporating text information were given. It was seen that text information together with the audio improves the synthesis performance signiﬁcantly. For many applications, the donor speaker for the audio-visual data will not be available to provide audio data for synthesis. Therefore, we designed a speaker-independent version of this codebook technique. The results of speaker-independent synthesis are important, because there are no comparative results reported for speech input from other speakers to animate the face model. It was observed that although there is small degradation in the trajectory correlation (0.71–0.67) with respect to speaker-dependent synthesis, the performance results are quite satisfactory. Thus, the resulting system is capable of animating faces realistically from input speech of any Turkish speaker. r 2006 Elsevier B.V. All rights reserved. Keywords: Visual speech synthesis; 3D facial motion capture; Audio-visual codebook; Speaker independent; MPEG-4 facial animation

1. Introduction The synthesis of visual speech animation in computers has an important role for human–machine interaction. For instance, hearing-impaired can beneﬁt from synthetically generated talking Corresponding author. Tel.: +90 212 2862544; fax: +90 212 2862547. E-mail addresses: [email protected], [email protected] (A. Savran), [email protected] (L.M. Arslan), [email protected] (L. Akarun).

faces by means of lip reading. Lip-reading tutors for hearing-impaired, or language tutors for children that have difﬁculties in speaking can be prepared. Also, talking faces that are driven by a speech synthesizer can be employed as user interface agents, e.g., in e-learning, in Web navigation or as virtual secretary. Moreover, automated lip synching for movie characters or computer games are other important application areas. Unfortunately, realistically animating a face is a challenging problem, since we are all experts in perceiving facial cues and cannot tolerate unnatural

0165-1684/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2005.12.007

ARTICLE IN PRESS A. Savran et al. / Signal Processing 86 (2006) 2932–2951

looking details. Inaccurate synthesis of visual speech not only disturbs listeners, but also degrades the perception of speech. McGurk and MacDonald [1] have shown that speech is perceived with both audio and visual signals together. They have noticed that when the auditory /ga/ was combined with visual /ba/ it was perceived as /da/. This is known as McGurk-effect, and it emphasizes the need for the accurate synchronization between visual and acoustic speech. Various methods have been developed to animate a face using speech. Beskow [2] employed a rulebased approach to synthesize articulatory parameters that control a 3D face model. Cohen and Massaro [3] used dominance functions to model coarticulation, which is the inﬂuence of neighboring segments of a phoneme, to synthesize visual speech. In Video Rewrite [4], dynamic context dependent units of visible speech are concatenated to create speech video of a person. These techniques are called phoneme-driven articulation, since the visible articulatory movements are produced using a sequence of time-labeled phonemes. On the other hand, in audio-driven articulation, audio signal directly drives the articulation. Lewis and Parke [5] classiﬁed acoustic signals into visemes, the visible speech units, using LPC spectrum. Yamamoto et al. [6] employed HMM-Viterbi and HMM-EM methods to synthesize visual parameters of lip movements from audio signals. In many previous works [7–9], time-delayed neural networks (TDNNs) are used to map acoustic features to visual features by using neighboring audio frames as well as the current frame in order to consider context information. In this study, an audio-driven method, which is based on audio-visual codebooks (AVCs) as proposed by Arslan and Talkin [10], was realized to synthesize facial points. This method can also make use of text information to estimate co-articulation effects more accurately than the mentioned audiodriven methods. Also, it demonstrates good synthesis performance when used with different speakers. This study was carried out for Turkish language, and when we consider the synthesis based on captured data, it is the ﬁrst work in Turkish. Recent work on speech-driven visual speech synthesis has concentrated on the synthesis of visual data from the same speaker’s previous recordings. This type of system may be applicable only for TTS applications where the TTS speaker and the speaker selected for the audio-visual data collection is

2933

actually the same person. However, in many applications (e.g., avatars where people drive their own models with their own voices), the visual synthesis system must be speaker independent. In the literature, this issue is mostly ignored or it is mentioned that the system also supports different speakers. However, there is no experiment evaluating the speaker-independent performance. In this paper, we addressed this issue. We implemented a visual speech synthesis system that can be driven by any speaker’s voice, and reported informative experimental results which compare speaker-dependent and speaker-independent modes of operation. For the subjective evaluation of the synthesis results, an MPEG-4 compliant facial animation tool was employed. The generated facial points are utilized to control a 3D face model. For the training of the system, these facial points are obtained by a 3D facial motion capture system that was developed for this study. This system employs ordinary color markers and a stereo-camera to track 3D facial coordinates of the speakers. In this work, ﬁrst the codebook-based technique was compared with TDNN and recurrent neural network (RNN) algorithms, which are popular in the speech-processing ﬁeld. These comparative experiments were performed to assess the codebook-based method in audio-driven mode. However, this codebook-based method can use text input in addition to speech to consider the context. This improves the performance signiﬁcantly, since coarticulation effects, the inﬂuences of neighboring phonemes on the current speech frames, are produced more realistically. It was also adapted for speaker-independent synthesis. This is realized by extending the single-speaker codebook with audio data from different speakers. The synthesis performance after incorporating text information was evaluated for both speaker-dependent and -independent synthesis by conducting experiments. This paper is organized as follows. In Section 2, the facial motion capture system will be described. The methods that synthesize the face points to control a face model will be explained and the experimental evaluations will be given in Section 3. In Section 4, the MPEG-4 facial animation will be explained. Finally, conclusion will be given in Section 5. 2. 3D facial motion capture Professional 3D motion capture systems have been widely used for facial animation production

ARTICLE IN PRESS A. Savran et al. / Signal Processing 86 (2006) 2932–2951

2934

and research purposes. However, these systems are fairly expensive and many researchers may not have access to such equipment. In this section, we present a low-cost capture system which was developed for this study. It comprises of a stereo-camera, ordinary color stickers, and a standard PC with our software. Our facial motion capture system is an optical system which tracks the 3D coordinates of markers (not made of any reﬂective material) placed on the face. In this system, a stereo-camera, which contains two CCD sensors, is used. To stream the stereo video at 30 frames per second and the audio synchronously to the disk, recorder software was developed. The video is composed of 24-bit color images with 640 480 resolution. Small circular color stickers are used as markers. The block diagram of our 3D facial motion capture system is depicted in Fig. 1. In this system, ﬁrst, the facial markers are located from the initial frames of the video. Next, markers are tracked in 2D and 3D coordinates are reconstructed from stereo. In the last step, global head motion is estimated with the help of markers attached on the non-moving parts of the face. The ﬁnal coordinates are obtained after removing this estimated head motion from the tracked data. 2.1. Detection of initial markers In order to start tracking of markers, the positions of markers on the initial frames should be located. We used color and shape information as signiﬁcant cues for this task. The detection is performed only in the facial regions of the input

Stereo Video

Left & Right Images

image to prevent confusions when background pixel colors are close to the marker’s color. Therefore, as a preprocessing step the extraction of a face mask that represents the face region is employed. The method for the face mask extraction is based on skin-tone region detection. Hue component of the hue-saturation-intensity (HSI) color model [11] is used instead of red–green–blue (RGB). In HSI model, hue component is related with intrinsic character of a surface, which leads to the normalization of lighting. There are three steps for extraction of a face mask: skin-tone region detection, extraction of initial face mask and obtaining the ﬁnal face mask. To ﬁnd skin-tone regions, a new image, named as seed-hue image, is calculated. It is obtained by calculating the radian distances between the hue value of the pixel and a seed-hue value, and then subtracting this distance from one. The resulting image is in the range [0.5, 1]. To obtain a binary image of the skin regions, thresholding is applied. In Fig. 2, detected skin-tone regions with the seed-hue value 0.04 and threshold 0.95 are shown. After obtaining the binary image, the face mask is found as follows. First, a connected component labeling algorithm is employed to obtain individual objects. An initial face mask is extracted by choosing the object with maximum area. This face mask includes holes over the marker regions, and also some unwanted joined components such as shoulder contours. Therefore, a further processing is performed with morphological operations. To remove combined thin pieces like edges from the face mask, opening with a small square-shaped

2D Tracking of Markers

3D Reconstruction

Reference Markers Head Motion Estimation

Initial Frames

Left & Right Images

Facial Marker Detection & Labeling

Head Motion Parameters (R, T)

3D Marker Coordinates

Removing Global Head Motion

Fig. 1. Block diagram of the 3D facial motion capture system.

ARTICLE IN PRESS A. Savran et al. / Signal Processing 86 (2006) 2932–2951

Fig. 2. An example of face mask extraction.

(3 3) structuring element is applied. A new mask is obtained by choosing the maximum area object after connected component labeling. Finally, closing operation with a disk-shaped structuring element, whose radius is greater than the marker’s radius, is applied to close the holes. These morphological operations are performed without iteration. In this marker detection procedure, the total numbers of markers for each color is given. We used blue and green stickers as the markers since these colors are sufﬁciently distant from the skin color in the hue dimension. The block diagram of the facial marker detection algorithm is depicted in Fig. 3. In the ﬁrst step, candidate markers are obtained by

2935

processing the image pixels only in the face mask extracted in the previous stage. The same method with skin-tone region detection is performed to ﬁnd candidate markers. Binary images are created by thresholding the seed-hue images of the markers. Then, connected component labeling is applied, and the labeled objects are eliminated according to minimum and maximum area thresholds. In the next step, actual markers are selected from the candidates by computing a circularity measure as deﬁned by Shapiro and Stockman [12]. Markers are like solid disks. However, in the images, they appear as ellipses, when they are not parallel to the image plane. Although they lose their actual circular shapes in these cases, the elimination of false candidates based on circularity still works since their shapes are close to circles. After computing the circularity values, the top N candidates according to the circularity measure are chosen as the actual markers and their centroids are assigned as marker positions. At the last step, the detected markers coordinates are corrected if necessary. In some cases, the pixel values at the object centers become signiﬁcantly different from the actual marker colors for several reasons. Because tracking algorithm reads these initial pixel values at the beginning, it does not work when the pixel values are wrong. To overcome this problem, marker positions are re-computed by seeking the maximum valued pixel in the vicinity of the centroid, if the pixel value at the center is below the threshold. An example result of the marker detection algorithm is shown in Fig. 4. In order to reconstruct 3D coordinates of markers from 2D coordinates, a correspondence between left and right image markers must be established. This problem is simply solved by labeling the initial detected markers of the stereo images with the same labels. The method is based on ordering of markers in horizontal (x) and vertical (y) directions in the xyz coordinate space, assuming the z-axis rotation of the head is not too much. The algorithm uses the conﬁguration information of the markers on the face, which is entered by a user into a text ﬁle. The conﬁguration consists of the number of clusters and their marker quantities in two layers. These marker clusters represent groups of markers which are close to each other along only one dimension (x or y). At ﬁrst, all markers are sorted in the y-dimension and their indices according to this ordering are assigned as initial labels. By the clustering information in the ﬁrst layer, marker

ARTICLE IN PRESS 2936

A. Savran et al. / Signal Processing 86 (2006) 2932–2951

Fig. 3. Block diagram of the marker detection process.

Fig. 4. Locations of 29 detected and labeled facial markers.

clusters, which are grouped according to y-dimension, are obtained. Then, markers inside the clusters are ordered along the x-dimension to update the labels (indices). By the second clustering information, sub-clusters, which are grouped according to x-dimension, are formed, and markers inside these sub-clusters are ordered along the y-dimension to obtain the ﬁnal labels. Consequently, for all recordings, same labeling is provided for a

speciﬁc marker conﬁguration. Fig. 4 shows an example result of this marker detection and labeling procedure. 2.2. Tracking of markers Tracking of markers from video images is performed with an algorithm which inherently considers color and shape information. In this

ARTICLE IN PRESS A. Savran et al. / Signal Processing 86 (2006) 2932–2951

algorithm, each marker’s new position at the current frame is searched inside the square search window centered at the previous coordinates. The algorithm is based on template matching. From the previous frame a template image and from the current frame a greater search image is extracted for each marker. At the beginning of the algorithm, hue values for the detected markers are read from initial positions to use in seed-hue image calculations as explained in Section 2.1. The block diagram of the method is shown in Fig. 5. As a ﬁrst step, template and search images are calculated from the previous and current frames, respectively. A square template window of width 2p þ 1 and a search window of width 2q þ 1ðq4pÞ are centered on the initial coordinates, and then contrast-enhanced seed-hue images of these windows are calculated by a simple powering operation: I ¼ ½ð1 dÞ6 ,

(1)

where d is the radian distance from seed-hue value of the marker to the current pixel’s hue. In this

2937

equation, the power six is determined empirically to enhance contrast. This contrast enhancement is employed to improve tracking performance of the markers especially at the cheeks. Since in some cases marker hue values change abruptly, enhancing the contrast before the edge enhancement improves the template matching performance in these situations. A further processing is performed by enhancing edges before the template matching operation. Thus, the emphasis on marker shapes is increased during the matching. For this purpose gradient magnitudes, which are scaled between 0 and 255, are used. For template matching, the well-known technique, cross-correlation is used. Here, cross-correlation values are normalized by dividing with windowed image energies. The best match is found by taking the pixel with the maximum response only for reliable pixels. In the algorithm, the best match search is not made for all pixels. Instead, locations, where the contrast-enhanced seed-hue values are above

Fig. 5. Block diagram of the marker tracking algorithm.

ARTICLE IN PRESS 2938

A. Savran et al. / Signal Processing 86 (2006) 2932–2951

threshold, are considered for the template matching operation. By this way, possible wrong matches, which can occur in some cases like motion blur, are prevented. Also, the computation time is reduced. If all pixels are below this threshold, which is a rare case, the pixel with the maximum intensity is chosen as the marker’s tracked position. This is equivalent to ﬁnding the most similar pixel to the marker’s hue value. The maximum intensity search is made in the template window at the current frame. On the other hand, if pixels are above the threshold and thus the best match is found, a further processing is realized. It was observed that, tracking only with template matching had been working well at the initial frames. However, minor errors during tracking had been accumulating and causing to lose the center of the marker, and in the worst case to miss the marker. In order to overcome this, a correction after template matching is performed by ﬁnding the centroid of the marker at the tracked location. The centroid is computed for the object, which is found by thresholding (the same threshold for the template matching) on the window with the template’s size, and then choosing the largest object after applying connected component labeling. The described algorithm requires three global parameters which are the same for all markers: the parameters that determine the width of the square template and search windows (p and q, respectively), and the threshold value to get reliable pixels before the template matching operation. In fact, using proper parameters for each marker is a better solution, because their size is changing in the images according to their location on the face. However, the optimum same parameters were used for all markers in this study. Choosing p greater than the marker radius, and q equal to at least two times p gave good results in our recordings. Also, search window size should be adjusted according to movement speed. A sample tracking is demonstrated in Fig. 6. In this ﬁgure, the size of the image is 126 111, the radius of the marker is 7, the size of the blue search windows is 41 41 (q ¼ 20), and the size of the yellow template window is 21 21 (p ¼ 10). All these values are in pixel units. Black window shows the tracked coordinate. In the next step, 3D coordinates of the 2D tracked markers are reconstructed by using stereovision. Since our stereo-camera’s x-axes are collinear, and y-axis and z-axis are parallel, 3D coordinates are simply obtained from disparity.

Fig. 6. Demonstration of marker tracking in a window.

However, since humans move their heads unconsciously during speaking, the resulting facial motion is not only due to speech production but also due to the head movements. In order to take only the speech related motions of the facial points into account, the global head motion is estimated and removed. For estimation of the head pose, some markers are placed on the constant parts of the face so that we can treat the head as a rigid object. The movement of a rigid object can be determined by a rotation matrix and a translation vector. With respect to a reference, the relative movement of the head is predicted with the method proposed by Arun et al. [13]. The method is based on minimizing the least-squares error to estimate head rotation and head translation by using Singular Value Decomposition. After the motion of the head is determined, 3D facial points are normalized by the inverse transformation. 3. Face point trajectory synthesis The articulation of visual speech is controlled by synthesized face point trajectories. In this section, the methods to generate these point trajectories are described and the experimental evaluations are given. Our algorithm for the synthesis, which is a codebook-based method, is designed to work in a speaker-independent manner using speech and text. However, instead of giving this technique directly, we prefer to treat it in three stages as ‘‘speech-driven synthesis’’, ‘‘speech- and text-driven synthesis’’, and ‘‘speaker-independent speech- and text-driven

ARTICLE IN PRESS A. Savran et al. / Signal Processing 86 (2006) 2932–2951

synthesis’’ in order to make some informative comparisons. In Section 3.1, we evaluate our codebook-based method when the input is only the acoustic speech, and compare it with neural network algorithms, TDNNs and RNNs, which are widely used. In Section 3.2, we describe how the phonetic information is incorporated to the synthesis process and give the experimental results to show the gain in the performance. Finally, in Section 3.3, the adaptation of our codebook-based technique to the speaker-independent case is explained and synthesis results are presented. Using our facial motion capture system, an audio-visual data set was prepared to conduct these synthesis experiments. It consists of approximately 10 min of audio-visual data from a female Turkish speaker. The corpus is a phonetically balanced Turkish speech corpus [14] containing 200 utterances. The audio signals were digitized by 16 bits with a sampling frequency of 16 kHz, while the visual data were recorded at a 30 Hz sampling rate. The visual data consist of 29 3D face points. Ten of these points, which are from the constant parts of the face like forehead, were used to normalize the facial data. 3.1. Speech-driven synthesis The estimation of face point trajectories from acoustic speech signal is a multi-dimensional temporal regression problem. In this part, the algorithms, the TDNN, RNN, and AVC algorithms, which are used to handle this regression problem, are compared. We compare our AVC method with the neural network algorithms, because they are widely used in visual speech synthesis research and therefore, the results are useful for assessment. TDNNs were employed many times [7–9]. RNNs have not been mentioned in the facial animation literature so far, but they are popular in speech recognition and have been used together with HMMs [15]. For all the algorithms, the outputs are the principal points of the facial point displacements, and the inputs are the line spectral frequencies (LSFs). 3.1.1. Visual features To represent facial dynamics, principal components (PCs) of the face point data are used. Principal component analysis (PCA) is used to map data to a new space where the dimensions

2939

(PCs) are uncorrelated and represent direction of maximum variance in the original data. Therefore, applying PCA to facial point data is very appropriate, because, the motions of these points are highly correlated. In this work, ﬁrst, a reference frame that has the neutral facial expression was selected from the data set, and displacements of 3D points relative to this reference were calculated for the whole data set. Then, PCA was performed on this displacement data. These data were obtained from 3D positions of 29 facial markers for 200 utterances. This yielded 17802 vectors with 87 dimensions as inputs to the PCA. It was observed that ﬁrst ﬁve PCs explained 90% of total variance in the data. Since there are few basic facial movements like mouth opening or lip protrusion for speech articulation, and the facial points are highly correlated, this PCA result is as expected. In Fig. 7, the contribution of the ﬁrst PC to the facial motion is depicted by front and proﬁle views of the face. It was obtained by reconstructing the 3D displacements by only using the ﬁrst PC. As it is seen from the ﬁgure, the ﬁrst PC corresponds to the lower jaw movement, which is the most important facial movement during speech production and has the highest correlated facial points. This explains why its contribution to the variance is very high, above 60%. Also, it was observed that the second PC has the effect of lip protrusion (Fig. 8), the third and fourth PCs are related to mouth opening and closing without jaw movement, and the ﬁfth PC has the inﬂuence of labiodental occlusion. These ﬁrst ﬁve PCs explain the most crucial movements for speech articulation. It becomes difﬁcult to interpret facial motions captured by high order PCs. 3.1.2. Acoustic features Using speech signal directly makes the regression more difﬁcult, since it contains redundant information. Hence, LSFs are employed to obtain useful vocal data. LSFs have close relationship with the speech formants, which are mostly determined by vocal-tract shape, and therefore they are convenient features for the estimation of face shapes. In this study, the audio signals were digitized by 16 bits with a sampling frequency of 16 kHz. Before the feature extraction, a pre-emphasis ﬁlter (HðzÞ ¼ 1 az1 ; a ¼ 0:97) is applied to slightly boost the higher frequencies. Then, linear prediction (LP) analysis of order 18 was performed for each 25 ms audio frame after windowing with hamming

ARTICLE IN PRESS 2940

A. Savran et al. / Signal Processing 86 (2006) 2932–2951

Fig. 7. The inﬂuence of modiﬁcation of ﬁrst principal component ((a) frontal and (b) proﬁle views).

Fig. 8. The inﬂuence of modiﬁcation of second principal component ((a) frontal and (b) proﬁle views).

window [16]. Finally, LSFs were computed from the LP coefﬁcients with the methods described by Rothweiler [17,18].

3.1.3. Speech-driven methods 3.1.3.1. Time-delayed neural networks. TDNNs provide nonlinear mapping from input sequence to output sequence. TDNNs are formed by simply converting the inputs of multi-layer perceptrons from a temporal sequence to a spatial sequence [19]. The simple way of this conversion is to delay previous inputs in time. Thus, temporal dependencies can be learned from the data. However, in this study, succeeding frames are used in addition to the preceding frames. This provides better context information from the acoustic signal, since the current face shapes are dependent also to future speech segments. The inputs of the TDNN are white LSF features. Applying the whitening transformation to the LSFs improves the learning of the TDNN algorithm. This is expected because LSFs are highly correlated. By performing PCA it was observed that, with just the ﬁrst PC about 60%, and with the ﬁrst 10 PCs about 95% of the total variation in the data can be explained. Hence, z-normalization is applied to the ﬁrst 10 PCs and white LSF signals are obtained. Also, the ﬁrst ﬁve visual PCs are considered for the estimation, since they explain 90% of the total variance in the data. This dimension reduction decreases the experimentation period signiﬁcantly. These audio and visual features are also used for RNN. Training of a TDNN is performed with online learning. However, since speech frames cannot be considered as independent from their neighbors, each speech segment is considered as a sample, and the random sampling at each epoch is applied accordingly. Also, during training, the labeled silence regions are not considered, since there is no relationship between acoustic and visual signals during silence and they introduce noise. Window length and the number of hidden units are the hyper-parameters for TDNN. To determine the optimum values for these hyper-parameters, 10-fold cross-validation technique was applied over 150 utterances, and it was found that the optimum parameters are window length of nine with 14 hidden units. 3.1.3.2. Recurrent neural networks. In RNNs, in addition to feedforward connections, there are selfconnections or connections to units in the previous layers. This recurrent architecture provides a shortterm memory, thus temporal relations can be learned.

ARTICLE IN PRESS A. Savran et al. / Signal Processing 86 (2006) 2932–2951

There are three basic algorithms to train RNNs. These are backpropagation through time, real-time learning (RTL) and extended Kalman ﬁltering (EKF). RTL and EKF allow update of weights at the current frame and thus can use sequences of arbitrary length. However, backpropagation through time method cannot work with sequences of arbitrary length. On the other hand, EKF algorithm is shown to be more robust than the RTL algorithm but not dramatically different [20]. However, since EKF is computationally more expensive, in this study RTL [21] algorithm is employed. As in TDNN, optimal hyper-parameters for RNN are found by 10-fold cross-validation technique. These hyper-parameters are different connections between neurons and number of hidden units. Three different network architectures were tested with cross-validation. These are recurrent connections between hidden units, connections from output units to the hidden units and both. Actually, other type of connections, like recurrent connections between output units and different connection combinations are possible. Due to the initial observations and long computation times, these three set of structures were chosen for crossvalidation. It was observed that including recurrent connections between output units gives signiﬁcantly worse results than other connection types, though exhaustive tests were not performed. According to the cross-validation results, using hidden recurrent connections with 12 hidden units performs the best for this data set. 3.1.3.3. AVC method. In this method, an AVC is created from the training samples, and regression is realized by weighted sum procedure from the

2941

codebook data. Codebook stores acoustic and visual features together. These are 18 LSFs and ﬁve visual PCs, respectively. The ﬁrst step in the codebook generation stage is to segment the speech signal into phonemes. For this task, an HMM speech recognizer running in forced alignment mode is employed. It was developed using HTK Toolkit [22]. For the recognition task, MelFrequency Cepstrum Coefﬁcients [23] are extracted from the audio signal. There are 29 Turkish phonemes trained for HMM recognition. These phonemes were modeled with three states and mixture of six Gaussians. In addition, one short pause model with one state, and one silence model with three states were trained. Training was performed on a large data set composed of Turkish speakers. In the next step, audio and visual features are extracted for each phoneme from the uniformly spaced time locations within the phoneme duration. This process is depicted in Fig. 9. Then, for each context-phone, a codebook entry is created by adding audio and visual features. These features are extracted from the uniformly spaced time locations within the phoneme duration. In this study, the number of uniform time positions is ﬁve. Visual features at these locations are obtained by linear interpolation. However, for the silence regions, corresponding visual features are not included in the codebook entry. Instead, these features, which are the weights of the PCs, are set to zero to obtain neutral faces and prevent noise emerging at the silence regions. For the regression, visual PCs in the codebook are linearly combined. Here, the issue is how to weight these PCs. Weights are calculated based on the distances between acoustic features of the input

Fig. 9. Demonstration of the creation of a codebook entry (black dots: positions of video frames; Ak: audio feature vector; Vk: visual feature vector).

ARTICLE IN PRESS A. Savran et al. / Signal Processing 86 (2006) 2932–2951

2942

speech frame and the codebook features. The distances are calculated by taking LSF characteristics and the human hearing into account. LSF pairs indicate formants, and distance of these two frequencies depends on the formant bandwidths. It is a known fact that human hearing is more sensitive to acoustic signals with formants that have narrow bandwidths. This fact is used during distance calculation by assigning higher weights to closely spaced LSFs: di ¼

P X

hk jwk Lik j;

i ¼ 1; . . . ; ðK NÞ,

(2)

k¼1

hk ¼

1 arg minðjwk wk1 j; jwk wkþ1 jÞ k ¼ 1; . . . ; P.

,

In the above expressions, wk’s and Lik’s are the LSFs of the input and the codebook, respectively, P is the LSF vector dimension, K the total number of audio feature vectors found in the codebook entry, and N the number of codebook entries. From these distances, a set of weights, v, are derived: egd i vi ¼ PKN ; gd l l¼1 e

i ¼ 1; . . . ; ðK NÞ.

(3)

These normalized codebook weights are used to approximate the original LSF vector w by linear combination of codebook LSF vectors: w^ k ¼

KN X

vi wik .

(4)

i¼1

To determine the value of g incremental search is employed. This search is done to minimize perceptual weighted distance between the original and the approximated LSF vector. During the search, this distance is evaluated with different values for g, and then the best one is chosen. The search range is [0.2, 2] and the step size is 0.1 in this study. Finally, the new PCs for the current speech frame are created with the weights v which represent the acoustic similarity: ^ ¼ pðtÞ

KN X

vi Pi .

(5)

i¼1

3.1.4. Experimental results To measure the synthesis performance of the algorithms, the correlation coefﬁcient between the

original and the synthesized facial trajectories were used. A performance measure (PM) was deﬁned to measure the overall system performance: PK

k¼1 PM ¼ P K

ek Pk

k¼1 ek

,

(6)

where Pk is the performance metric for the kth PC, i.e., the correlation between synthesized and predicted PC coefﬁcient; ek the corresponding eigenvalue; and K the total number of PCs used. This formulation provides a reasonable PM by weighting performance of synthesized PCs according to their variance, since correct estimation of high variance PC components is more valuable than others. The three algorithms were compared by 10-fold cross-validated paired t-test over 200 utterances using the PM outcomes. With paired t-test, the hypothesis that two algorithms have the same performance at 5% signiﬁcance level is tested. According to results, all of the hypotheses were rejected, meaning that we can accept their performances as really different. The results are given in Table 1. In this experiment, we also test the TDNN with one frame (has 10 hidden units which is optimal according to validation) in order to see the effect of context on the estimation. From the average correlation coefﬁcients outcomes, we ﬁnd out that TDNN (nine frames) is the best algorithm, while RNN performs worst for this problem and with this data set. However, it is important to note the reason behind the superiority of TDNN to the AVC for these experiments. TDNNs use previous and succeeding speech frames as well for the estimation, and thus have the ability to acquire information about the context from acoustic data. On the other hand, AVC only stores the acoustic features extracted from the phoneme durations, which is not sufﬁcient to learn about context. Therefore, a fairer evaluation is to compare AVC with TDNN using only the current frame as input. It is seen from Table 1 that when this is the case, the regression Table 1 Performances of TDNN, RNN, and AVC methods Algorithms

Average CC

TDNN (nine frames) TDNN (one frame) RNN AVC

0.720 0.612 0.650 0.686

ARTICLE IN PRESS A. Savran et al. / Signal Processing 86 (2006) 2932–2951

performance is quite low; consequently, worse than AVC. In the light of these results, it is obvious that, for better performance, we can develop the AVC method further to store past and future acoustic frames, though this brings a considerable memory and computational burden. However, we can manage this by incorporating text information as explained in Section 3.2, which does not add signiﬁcant memory load regardless of the training data amount. At the same time computational load is reduced during synthesis. On the other hand, if we forget the performance comparisons, the AVC formulation has an important advantage when compared to others. It is a very ﬂexible algorithm that text information can easily be exploited, and also adaptation for speaker independence is simpler. This is why it is preferred in the other synthesis categories. 3.2. Text- and speech-driven synthesis Synthesizing visemes from current speech frames can give satisfactory results. However, co-articulation effects, which can be considered as transitions between visemes, depend on the context of the speech. Therefore, incorporating this context information directly in the synthesis phase is critical for more realistic animations. The experiment performed in Section 3.1 also indicates the importance of the context. To use this context information, a label, which is the associated context-phone, is assigned to each AVC entry. The assignment of a representative context-phone (label) /l2l1c+r1+r2/ to the codebook entry is depicted in Fig. 9, where c is the center phoneme, l i and ri are the ith left and right phonemes, respectively. For example, for an English word, ‘‘everything’’, which is expanded to its phonemes as /eh_v_r_iy_th_ih_ng/, when /iy/ is the center phoneme c, then the context-phone label is vriy+th+ih. The same procedure with the AVC method is carried out during placement of the audio and visual features into the corresponding entries. In these experiments, 20 PCs, which explain about 99% of variance, are used as visual features. These high ordered PCs can encode the subtle details which are necessary for more natural animation. In the synthesis phase, only the codebook cells that correspond to the visually most similar contextphones to the input context are selected. This similarity is determined using a matrix called visual

2943

phoneme similarity matrix, which is created in the training phase. The synthesis is performed with these selected codebook visual features as explained in the AVC method. Also, smoothing is employed after the initial estimation in order to prevent inevitable synthesis noise by least-squares spline approximation technique. This improves the visual quality of the animation, but also increases the synthesis performance. The block diagram of the text- and speech-driven synthesis system is shown in Fig. 10. In the following subsections, these two techniques are described. Finally, the comparative results that show the signiﬁcant improvements are given. 3.2.1. Visual phoneme similarity matrix The visual phoneme similarity matrix is used to cope with the incoming context-phones that are not available in the codebook, since having all contextphones in the training corpus is not possible. To create the matrix, visual similarity values of phonemes are calculated with the Euclidean distance of the PCs of face data. Hence, ﬁrst, an average principal coefﬁcient component vector is calculated for each phoneme. Then, the similarity measure is found by the following formula: S ik ¼ evkmi mk k ;

i ¼ 1; . . . ; K; k ¼ 1; . . . ; K.

(7)

According to this formula, similarity value for ith and kth phoneme Sik will be between 0 and 1. Here, K is the number of the phonemes, and the constant v is to control the dynamic range of similarity values, and it was chosen as 10 in this study. Fig. 11 visualizes the similarity matrix for Turkish phonemes as a gray-scale image, where the darker squares represent higher similarity values. In this ﬁgure, Z1 is the short pause, and X1 is the silence. It is seen that estimated similarity values are consistent with the expectations. For instance, from Fig. 11 it is seen that, the closest phonemes to /o/ is /O1/ due to lip rounding, and to /b/ is /p/ and /m/ due to lip closure. The information in the visual phoneme similarity matrix is used to determine the codebook entries that are employed for the synthesis. The contextphone of the incoming speech frame is compared to available context-phones in the codebook by the formulation given as nj ¼ S cj þ

C X i¼1

wi S l i j þ

C X

wi Sri j ;

j ¼ 1; . . . ; L.

i¼1

(8)

ARTICLE IN PRESS A. Savran et al. / Signal Processing 86 (2006) 2932–2951

2944

In this formulation, S cj , S l i j , and S ri j are the visual similarity values found in the visual phoneme similarity matrix; C is the level of context information; L the total number of context-phones in the codebook; w the weighting constant; and nj the score for the jth codebook entry. The subscripts of the similarity values denote the phonemes of the context-phones. In this work, w is set to 10, in order to make inﬂuence of center phonemes in the decision procedure always higher than the outer ones. After calculating the similarity scores, the top N most similar context-phones are selected from the codebook. 3.2.2. Smoothing of synthesized trajectories The synthesized PC trajectories introduce noise that make the resulting animation unnatural, although the trajectories pass through proper regions. Therefore, smoothing must be applied. In this study, only the speech regions are smoothed, because silence regions are always on the baseline due to the zero magnitude PCs stored in the codebook. However, this causes jumps between silence and speech regions which are disturbing. To Recorded Speech

provide smooth transitions for these locations, hamming function is used after smoothing the speech regions. The process is a simple extrapolation realized by appending the proper half of the hamming windows to the speech boundaries. By observation, the width of this half window was decided as 200 ms. For smoothing the trajectories over the speech regions, ‘‘Least Squares Spline Approximation’’ technique [24] is employed in this study. This technique achieves data compression, since the original signal is approximated with fewer number of spline coefﬁcients. However, here, it is used to reduce noise present in the synthesized PC trajectories. The basic idea is the decimation of the spline coefﬁcients, considering the approximation with minimum error. The expression for the spline approximation with an up-sampling factor m is 1 X

gnm ðxÞ ¼

yðiÞbn ðx=m iÞ,

where y(i)’s are the spline coefﬁcients weighting expanded and shifted basis functions bn in order to obtain approximated signal gnm ðxÞ. For our Text

Phonetic Segmentation (HMMs) Incoming Speech Frame

Incoming ContextPhone

Audio-Visual Codebook

Acoustic Feature Extraction

Visual Similarity Calculation

Visual Phoneme Similarity Matrix

Picking Visually Similar Top N Entries Codebook Weight Computation

Visual Feature Synthesis

Smoothing of Visual Features

(9)

i¼1

Transformation to Original Space

Displacements / Face Points

Fig. 10. Block diagram of the codebook-based face point trajectory synthesis algorithm.

ARTICLE IN PRESS A. Savran et al. / Signal Processing 86 (2006) 2932–2951

2945

Fig. 11. The derived visual phoneme similarity matrix for Turkish.

smoothing purpose, cubic splines (n ¼ 3) with sampling factor m ¼ 3 is used. After the smoothing, the ﬁnal operation is to get the displacement data by transforming PCs to the original space. 3.2.3. Experimental results The experiments for text- and speech-driven synthesis were performed by training the system with 150 utterances, and testing with the remaining 50 utterances. This yields 4.2 min of training data and 1.5 min of test data after removing the silence regions. Tests were performed by varying the number of top most similar context-phones with different context-levels. Also, training utterances were synthesized and compared to see the theoretical upper limit of the performance. The results are shown in Fig. 12. It is seen that the effect of the

context-level is the same after three, but there is no signiﬁcant difference after the context-level of one. In light of this fact, we can conclude that triphones are sufﬁcient to handle co-articulation effects. Since the neighboring phonemes have the greatest inﬂuence on the co-articulation, this result is quite reasonable. According to the results, with the increasing number of top most similar contextphones, the performance is rising up to some point for the test set, while it is falling for the training set as can be expected. Of course, for a much larger training corpus, the prediction performance can be expected to increase with higher context lengths by capturing a large amount of co-articulation effect from the data. With the help of these tests, it was found that the optimal context-level is three, and the number of top most similar context-phones is six. Correlation values with these optimal parameters are listed in Table 2. In order to see the performance

ARTICLE IN PRESS 2946

A. Savran et al. / Signal Processing 86 (2006) 2932–2951 Table 2 Synthesis performance for context-level of 3 using top six visually most similar context-phones Test condition

CC

Training data (raw) Training data (smoothed) Test data (raw) Test data (smoothed)

0.958 0.960 0.706 0.754

Table 3 Synthesis performance for context-level of 3 using top six visually most similar context-phones with training over 75 (a), 100 (b) and 125 (c) utterances

Fig. 12. The effect of top most similar context-phones on the performance. Each column represents a different context-level.

gain after making use of text information, tests with the same data and procedure as in Section 3.2.2 (with cross-validation over 200 utterances and using ﬁve PCs) were performed. The average correlations are 0.742 for the raw case and 0.794 after smoothing (AVC method without context information resulted in 0.686 average correlation). Also, tests were performed in order to see the effect of the size of the training corpus on the performance. Therefore, we compared the results for 75, 100 and 125 training utterances, which are illustrated in Table 3. As can be seen, increasing the size of the training corpus improves the estimation performance on the test set. In order to assess the synthesis performance of some important facial movements for speech articulation, average correlation values for corresponding face points are listed in Table 4. These facial actions are lower and upper lip movements in the y-dimension that determine mouth opening; lip displacement in the z-dimension for the lip protrusion; and lip corner movements along the xdimension crucial for lip rounding. Also, a test was performed to see the effect of number of PCs used in the synthesis. In Table 5, results with ﬁve PCs are compared with the 20 PC cases. We see that, on the average, slightly better performance is obtained for the 20 PC cases. However, when the computational load is an issue, using ﬁve PCs might be preferable. In Fig. 13, an example synthesis result is displayed, where the original, synthesized, and smoothed trajectories are plotted. It belongs to test utterance ‘‘rol yapmam gerekmiyordu’’, and the

Test condition

CC (a)

CC (b)

CC (c)

Test data (raw) Test data (smoothed)

0.681 0.730

0.694 0.744

0.698 0.745

Table 4 Average correlation between synthetic and original face point trajectories for test and training data (20 PCs) Average CC

Lower lip (y) Upper lip (y) Whole lip (z) Lip corners (x)

Raw

Smoothed

Test

Training

Test

Training

0.680 0.760 0.748 0.707

0.929 0.967 0.961 0.953

0.737 0.810 0.798 0.762

0.953 0.972 0.959 0.965

Table 5 Average correlation between synthetic and original face point trajectories for test and training data (5 PCs) Average CC

Lower lip (y) Upper lip (y) Whole lip (z) Lip corners (x)

Raw

Smoothed

Test

Training

Test

Training

0.696 0.709 0.756 0.697

0.923 0.859 0.935 0.911

0.748 0.748 0.801 0.749

0.942 0.864 0.934 0.926

trajectories shows the center upper and lower lip vertical displacements with respect to the neutral positions. It is seen that the smoothed synthesized results are sufﬁciently good; trajectories pass through the proper locations, although the initial synthesized trajectories are noisy. For example, the phoneme /p/ is correctly synthesized by the closure of lower and upper lips; while the upper lip point

ARTICLE IN PRESS A. Savran et al. / Signal Processing 86 (2006) 2932–2951

2947

Fig. 13. The comparison of original and synthesized face point displacement trajectories for test data. Top plot is speech signal of the utterance ‘‘rol yapmam gerekmiyordu’’. Middle and bottom plots are the center upper lip and center lower lip y-axis trajectories, respectively. Circle dotted curve: original; cross dotted curve: synthesized without smoothing; dashed curve: synthesized with smoothing.

moves downward, the lower one moves upward. On the other hand, over the silence region at the beginning of the utterance, a larger error is seen. This is due to the inhaling of the speaker, and can be ignored since it has no importance for the speech articulation. 3.3. Speaker-independent text- and speech-driven synthesis In previous speech-driven face synthesis methods, the audio-visual features used in the training typically belong to a single speaker. Hence, the synthesis for other speakers, especially whose vocal characteristics are quite different from the training data, is not expected to work well. In this study, a new multi-speaker codebook was created to increase the synthesis performance when used by different speakers. The idea is to enrich the acoustic features for each codebook entry. Thus, a speaker-independent system is obtained by including different speaker characteristics in the codebook.

The process to extend the single-speaker codebook to a multi-speaker one is as follows. First, speech signals coming from various speakers are segmented into phonemes to obtain a labeled multispeaker data set. Then, for each phoneme of each record, a context-phone is formed as explained in Section 3.2. For each input context-phone, acoustic features are extracted as in the AVC method, and then appended to the codebook entry of the same context-phone (Fig. 14). Thus, for the visual features of each codebook cell, we have audio features from the donor of the video data as well as the audio-only features from different speakers. During synthesis from a speaker different than the training speaker, a better decision for the selection of the appropriate visual features for the current speech segment is possible compared to single-speaker codebook case; because visual weights are assigned based on the acoustic match from a pool of speakers rather than a single speaker. An extreme case might occur when the source

ARTICLE IN PRESS 2948

A. Savran et al. / Signal Processing 86 (2006) 2932–2951

Fig. 14. Demonstration of the creation of the acoustic features of the multi-speaker codebook entry for the context-phone /de+m/ from N speakers. Red and black solid lines designate the acoustic and visual feature extraction locations, respectively. Visual features at these positions are obtained from original features (black dots) by interpolation.

speaker is female and the speech driving the animation belongs to a male speaker. In that case, the proposed multi-speaker method will have a signiﬁcant advantage over the speaker-dependent system. Since the acoustic distances between the

input and the codebook data of the speakers whose voice characteristics are very different from the input speaker will be relatively large, corresponding weights will be fairly small, and thus the effect of the closest speaker is emphasized.

ARTICLE IN PRESS A. Savran et al. / Signal Processing 86 (2006) 2932–2951 Table 6 Speaker-independent synthesis performance for context-level of 3 using top six visually most similar context-phones with the multispeaker codebook Test condition

CC

Training data (raw) Training data (smoothed) Test data (raw) Test data (smoothed)

0.780 0.827 0.674 0.724

In order to evaluate the speaker-independent performance objectively, a multi-speaker codebook involving audio data of 14 speakers (seven male and seven female speakers) in addition to original speaker’s data were used for the estimation of facial trajectories. The same 200 utterances recorded from the original speaker are used for the construction of the speaker-independent speech database. The acoustic features of the original speaker in the codebook were excluded during the synthesis, and the performance was measured by comparing the original and synthesized trajectories. The results are listed in Table 6. It is seen that there is performance degradation with the speaker-independent synthesis. While there is considerable amount of performance decrease for the training set, the degradation for the test set is much less. The results for the training set show the theoretical upper limit of the speakerindependent performance of the system. However, for more reliable speaker-independent performance results, tests using more speakers should be performed. Unfortunately, it is not possible with this data set since the original trajectories are available for only one speaker.

2949

The face animation in MPEG-4 is governed by FAPs. They indicate facial movements with respect to neutral face. There are 68 FAPs divided into 10 groups. The ﬁrst two groups are the high-level FAPs which represent visemes and the most common facial expressions. On the other hand, the remaining FAPs allow more detailed animations by deforming local regions on the face. There are 66 low-level parameters categorized into groups corresponding to subfacial regions like eyes, lip, etc. The value of a lowlevel FAP determines the displacement amount for one Cartesian coordinate of the feature point due to translation or rotation. Therefore, FAP decoder is responsible for calculating the movements of other coordinates, and also movements of vertices in the proximity of that feature point. FAP values are expressed in terms of facial animation parameter units (FAPUs). These units are the fractions of key facial distances, except for FAPU used to measure rotations. The normalization of facial displacements by means of FAPUs allows animating any MPEG-4 face model in a consistent way. In this study, to animate MPEG-4 face models, a set of markers were positioned according to MPEG4 speciﬁcations as shown in Fig. 4. Also, additional

4. MPEG-4 facial animation Facial animation is generated by deforming a 3D polygonal mesh of the face model. The animation is driven by 3D face point trajectories through the MPEG-4 Facial Animation Standard. To see the synthesis results, an MPEG-4 compliant facial animation engine (FAE) was used in this study. According to MPEG-4 Facial Animation Standard, human head is parameterized with two parameter groups as face deﬁnition parameters (FDPs), and facial animation parameters (FAPs). FDPs deﬁne the appearance of the face, and include 84 feature points like the tip of the nose, mouth corners, etc.

Fig. 15. MPEG-4 feature points. Thick circles designate the used MPEG-4 points in this work.

ARTICLE IN PRESS 2950

A. Savran et al. / Signal Processing 86 (2006) 2932–2951

Fig. 16. The smooth-shaded and wire-frame versions of the 3D MPEG-4 face model ‘‘Mike’’ used for animation.

example animation frames are shown during the phonemes /o/, /b/, and /a/ in Fig. 17. 5. Conclusion

Fig. 17. Animation frames during phonemes /o/, /b/, and /a/.

markers were employed for future use (for a different animation system) and for global head motion estimation. In total, reconstructed coordinates of 15 markers were used for the extraction of FAPs. These markers were attached on the MPEG4 feature point positions at the outer lips, at the cheeks, at the jaw, and at the nose (depicted in Fig. 15). FAPUs were manually measured from the reference frame that corresponds to the face in neutral state. According to MPEG-4 FAP speciﬁcations, all of the related 20 FAPs are calculated for the animation. To assess performance of the synthesis results, the FAE developed at the University of Genova [25] was used. The FAE is a high-level interface for animating MPEG-4 compliant faces, and can provide high frame rate animation synchronous with audio. This software can animate different MPEG-4 face models and reshape the existing ones with the FDPs. However, in this study, a demo version of the FAE, which only allows animating a simple face model, was used. The smooth-shaded version of this 3D model, which is called ‘‘Mike’’, is shown in Fig. 16. It is a very simple model composed of 408 vertices and 750 polygons. Some

In this paper, a system that generates visual speech by synthesizing 3D face points from speech and text input was described, and experimental results were reported. Also, a 3D facial motion capture system, which employs ordinary color stickers and stereo-camera for 3D reconstruction, was explained. In this study, ﬁrst, the audio-driven performance of the codebook-based system was compared with TDNN and RNN algorithms. It was found that it is better than RNN but worse than TDNN. The reason of this is the TDNN’s ability to learn about context from acoustic data, as conﬁrmed by performing TDNN without delay. Second, text information was utilized for the codebookbased method, and it was observed that the improvements in the objective results and in the animations were signiﬁcant. Finally, a speakerindependent experiment was performed after expanding the audio data of the codebook with a small number of speakers. Results showed that there is small performance degradation, but the animations were quite satisfactory according to the informal tests. Our current work is on developing an animation engine that uses control points to drive 3D face models. The face synthesis method presented in this paper, together with this animation engine, will be integrated with text-to-speech technologies to realize more natural human machine interfaces. Also, we plan to conduct research for developing algorithms that incorporate emotional expressions and facial gestures during visual speech to create more realistic talking characters.

ARTICLE IN PRESS A. Savran et al. / Signal Processing 86 (2006) 2932–2951

Acknowledgments This work is in part supported by EU Sixth Framework project SIMILAR Network of Excellence. We would like to thank the speakers in the database for their help and patience.

References [1] H. McGurk, J. MacDonald, Hearing lips and seeing voices, Nature 264 (1976) 746–748. [2] J. Beskow, Rule-based visual speech synthesis, in: Proceedings of the Fourth European Conference on Speech Communication and Technology (Eurospeech ’95), Madrid, Spain, 1995, pp. 299–302. [3] M.M. Cohen, D.W. Massaro, Modeling coarticulation in synthetic visual speech, in: N.M. Thalmann, D. Thalmann (Eds.), Models and Techniques in Computer Animation, Springer, Tokyo, 1993, pp. 139–156. [4] C. Bregler, M. Covell, M. Slaney, Video rewrite: visual speech synthesis from video, in: Proceedings of the Workshop on Audio-Visual Speech Processing, Rhodes, Greece, 1997, pp. 153–156. [5] J.P. Lewis, F.I. Parke, Automatic lip-synch and speech synthesis for character animation, in: Proceedings of the Graphics Interface ’86, Canadian Information Processing Society, Calgary, 1986, pp. 136–140. [6] E. Yamamoto, S. Nakamura, K. Shikano, Lip movement synthesis from speech based on Hidden Markov Models, J. Speech Commun. 28 (1998) 105–115. [7] F. Lavagetto, Converting speech into lip movements: a multimedia telephone for hard of hearing people, IEEE Trans. Rehabil. Eng. 3 (1) (1995) 90–102. [8] D.W. Massaro, J. Beskow, M.M. Cohen, Picture my voice: audio to visual speech synthesis using artiﬁcial neural networks, in: Proceedings of the AVSP ’99, 1999. [9] P. Hong, Z. Wen, T.S. Huang, Real-time speech-driven face animation with expressions using neural networks, IEEE Trans. Neural Networks 13 (1) (2002) 100–111. [10] L.M. Arslan, D. Talkin, Codebook based face point trajectory synthesis algorithm using speech input, Elsevier Sci. 953 (1998) 01–13.

2951

[11] R.C. Gonzalez, R.E. Woods, Digital Image Processing, Prentice-Hall, Englewood Cliffs, NJ, 2002 (pp. 295–302, Chapter 6). [12] L.G. Shapiro, G.C. Stockman, Computer Vision, PrenticeHall, Englewood Cliffs, NJ, 2001 (pp. 74–75, Chapter 3). [13] K.S. Arun, S.T. Huang, S.D. Blostein, Least-squares ﬁtting of two 3-D points sets, IEEE Trans. Pattern Anal. Mach. Intell. 9 (5) (1987) 698–700. [14] H. Dutagaci, Statistical language models for large vocabulary Turkish speech recognition, M.S. Thesis, Bogazici University, 2002. [15] T. Robinson, M. Hochberg, S. Renals, The use of recurrent neural networks in continuous speech recognition, 1995, /svr-www.eng.cam.ac.uk/ajr/rnn4csr94/rnn4csr94.htmlS. [16] L.R. Rabiner, R.W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, Englewood Cliffs, NJ, 1978. [17] J. Rothweiler, A root-ﬁnding algorithm for line spectral frequencies, in: Proceedings of the IEEE ICASSP 1999, Phoenix, AZ, USA, 1999, pp. II-661–II-664. [18] J. Rothweiler, On polynomial reduction in the computation of LSP frequencies, IEEE Trans. Speech Audio Process. 7 (5) (1999) 592–594. [19] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K. Lang., Phoneme recognition using time-delay neural networks, IEEE Trans. Acoust. Speech Signal Process. 37 (1989) 328–339. [20] M. Cernansky, L. Benuskova, Simple recurrent network trained by RTRL and extended Kalman ﬁlter algorithms, Neural Network World 13 (3) (2003) 223–234. [21] R.J. Williams, D. Zipser, A learning algorithm for continually running fully recurrent neural networks, Neural Comput. 1 (1989) 270–280. [22] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book, Entropic Cambridge Engineering, 2002. [23] X. Huang, A. Acero, H.W. Hon, Spoken Language Processing: a Guide to Theory, Algorithm, and System Development, Prentice-Hall PTR, 2001 (pp. 316–318, Chapter 6). [24] M. Unser, A. Aldroubi, M. Eden, B-spline signal processing: Part I—theory, IEEE Trans. Signal Process. 41 (2) (1993) 821–833. [25] F. Lavagetto, R. Pockaj, The facial animation engine: towards a high-level interface for the design of MPEG-4 compliant animated faces, IEEE Trans. Circuits Syst. Video Technol. 9 (2) (1999) 277–289.

Speaker-independent 3D face synthesis driven by speech and text

Speaker-independent 3D face synthesis driven by speech and text

Recommend Documents