Speech Communication 47 (2005) 182–193 www.elsevier.com/locate/specom
Data-driven multimodal synthesis Rolf Carlson *, Bjo¨rn Granstro¨m CTT, Department of Speech, Music and Hearing, KTH, Lindstedsva¨gen 24, 5th Floor, SE-100 44 Stockholm, Sweden Received 24 November 2004; received in revised form 2 February 2005; accepted 7 February 2005
Abstract This paper is a report on current efforts at the Department of Speech, Music and Hearing, KTH, on data-driven multimodal synthesis including both visual speech synthesis and acoustic modeling. In the research we try to combine both corpus based methods with knowledge based models and to explore the best of the two approaches. In the paper an attempt to build formant-synthesis systems based on both rule-generated and database driven methods is presented. A pilot experiment is also reported showing that this approach can be a very interesting path to explore further. Two studies on visual speech synthesis are reported, one on data acquisition using a combination of motion capture techniques and one concerned with coarticulation, comparing different models. 2005 Elsevier B.V. All rights reserved. Keywords: Speech synthesis; Multimodal synthesis; Data-driven synthesis
1. Introduction Current speech synthesis efforts, both in research and in applications, are dominated by methods based on concatenation of spoken units. In some cases the original waveform is simply used as is, but often it is processed to some degree before use. Research on speech synthesis is to a large extent focused on how to model efficient unit selection and unit concatenation and how optimal dat*
Corresponding author. Tel.: +46 8 790 75 68; fax: +46 8 790 78 54. E-mail address:
[email protected] (R. Carlson).
abases should be created. The traditional research efforts on formant synthesis and articulatory synthesis have been significantly reduced due to the success of waveform based methods. A new very active field of speech research is multimodal synthesis which again points to the need to understand speech articulation, but from a broader perspective. This paper is a report on current efforts at the Department of Speech, Music and Hearing, KTH, on data-driven multimodal synthesis including both visual speech synthesis and acoustic modeling. In our research we try to combine both corpus based methods with knowledge based
0167-6393/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2005.02.015
R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193
models and to explore the best of the two approaches. 1.1. A historical perspective on concatenative acoustic speech synthesis In the review by Klatt (1987) some of the early efforts on synthesis based on concatenative synthesis are included. As early as 1958 Peterson et al. suggested that unit concatenation might be a possible solution for speech synthesis. Dixon and Maxey (1968) made a special effort to create a unit library for diphone synthesis. Early synthesis research at AT&T based on ‘‘Diadic Units’’ (Olive, 1977) demonstrated an alternative to rule-based formant synthesis (Carlson and Granstro¨m, 1976; Carlson et al., 1982; Klatt, 1982). Charpentier and Stella (1986) opened a new path toward speech synthesis based on waveform concatenation, by introducing the PSOLA model for manipulating pre-recorded waveforms. The current synthesis methods using unit selection from large or optimized corpora rather than using a fixed unit inventory try to reduce the number of units in each utterance, thus solving context dependencies over a longer time frame. In some cases (e.g. Acero, 1999), formant tracking is used to optimize the selection of units. The recent workshops focused on speech synthesis manifest the dominance of techniques using different kinds of waveform coding. 1.2. Visual speech synthesis Visual speech synthesis can be accomplished either through manipulation of video images (Bregler et al., 1997; Brooke and Scott, 1998; Ezzat et al., 2002) or based on two- or three dimensional models of the human face and/or speech organs that are under control of a set of deformation parameters (Beskow, 1995, 1997; Cohen and Massaro, 1993; Pelachaud et al., 1996; Pelachaud, 2002; Reveret et al., 2000). The geometrical talking head model we use in our work is a parametrically controlled deformable polygon surface. Articulatory deformations are implemented as weighted geometrical transforms (translations, rotations and scalings) that
183
are applied to the vertices of the polygon mesh, according to principles first introduced by Parke (1982). In the study reported in this paper, datadriven articulation is controlled by the ten parameters: jaw rotation, lip rounding, upper lip retraction, upper lip raise, lower lip retraction, lower lip depression, left mouth corner stretch, right mouth corner stretch, jaw thrust and jaw shift (sideways). 1.3. Is formant synthesis still of interest? Affective computing is currently a very active research topic in the Speech Communication field. The need to synthesize different voices and voice characteristics and to model emotive speech is a strong motivation to keep research on formant synthesis active (Carlson et al., 1991, 1992). The driving force is that rule-based formant synthesis has the necessary flexibility to model both linguistic and extra linguistic processes. It is for example evident that large databases for corpus-based approaches can not be recorded consistently in a selection of desired speaking styles. However, the flexibility in formant synthesis can also be a problem, since for example articulatory constraints are not directly included in the formant-based model. The underlying articulatory gestures are not easily transformed to the acoustic domain described by the formant model. Successful efforts to go ‘‘halfway’’ using higher-level articulatory based parameters have been reported by Stevens and Bickley (1991) and Ogden et al. (2000). Alternative approaches that reduce the need for detailed formant synthesis rules, but still keep the flexibility of the formant model, extract formant synthesis parameters directly from a labeled corpus. Mannell (1998) reported a promising effort to create a diphone library for formant synthesis. The procedure included a speaker-specific extraction of formant frequencies from a labeled database. In a sequence of papers from Utsunomiya University, Japan automatic formant tracking has been used to generate speech synthesis of high quality using formant synthesis and an elaborate voice source (e.g. Mori et al., 2002). Recently we have seen a renewed commercial interest in speech synthesis using the formant
184
R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193
model (e.g. Aurix TTS from 20/20 Speech). One motivation is the need to generate speech using a very small ‘‘foot print’’. Thus, one can predict that formant synthesis will again be an important research subject because of its flexibility and also because of how the formant synthesis approach can be compressed into a limited application environment.
Input text
Data-base
Text-to-parameter
Analysis Extracted parameters
Features
Unit library
Unit selection Concatenation Unit-controled parameters
2. A combined approach for acoustic speech synthesis
Rule-generated parameters
+ Synthesizer Speech output
Research efforts to combine data-driven and rule-based methods in the KTH text-to-speech system has been pursued in several projects. In a study by Ho¨gberg (1997), formant parameters were extracted from a database and structured with the help of classification and regression trees. The synthesis rules were adjusted according to predictions from the trees. In an evaluation experiment the synthesis was tested and judged to be more natural than the original rule-based synthesis. The approach takes advantage of the fact that a unit library can better model detailed gestures then the general rules. Sjo¨lander (2001) expanded the method into replacing complete formant trajectories with manually extracted values, and also included consonants. According to a feasibility study, this synthesis was perceived as more natural sounding than the ruleonly synthesis (Carlson et al., 2002). Sigvardson (2002) developed a generic and complete system for unit selection using regression trees, and applied it to the data-driven formant synthesis. In the current work the rule system and the unit library are more clearly separated, compared to our earlier attempts. However, by keeping the rule-based model we also keep the flexibility to make modifications and the possibility to include both linguistic and extra linguistic knowledge sources. Fig. 1 illustrates the current approach from a ¨ hlin and Carlson, technical point of view (O 2004). A database is used to create a unit library. Each unit is described by a selection of extracted synthesis parameters together with linguistic information about the unitÕs original context and lin-
Fig. 1. Rule-based synthesis system using a data-driven unit library.
guistic features such as stress level. The parameters can be extracted automatically and/or edited manually. In our traditional text-to-speech system the synthesizers is controlled by rule-generated parameters from the text-to-parameter module (Carlson et al., 1982). The parameters are represented by time and values pairs including labels and prosodic features such as duration and intonation. In the current approach some of the rule-generated parameter values are replaced by values from the unit library. The process is controlled by the unit selection module that takes into account not only parameter information but also linguistic features supplied by the text-to-parameter module. Since only diphone sized units are included in the database the unit selection process is rather simple. The parameters are normalized and concatenated before being sent to the synthesizer (Carlson et al., 1991). The concatenation process for each parameter is a simple linear interpolation between the two joining units starting at the 25% point and ending at the 75% point of the phoneme duration. 2.1. Formant extraction for the unit library When creating a unit library of formant frequencies, automatic methods of formant extraction are of course preferred, due to the amount of data that has to be processed. However, available methods do not always perform adequately.
R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193
Unstressed vowels
Rule-driven formants Data-driven formants
Euclidian distance (F1-F4)
With this in mind, an improved formant extraction algorithm, using segmentation information to low¨ hlin, 2004). It is er the error rate, was developed (O akin to the algorithms described in (Lee et al., 1999; Talkin, 1989; Acero, 1999). Segmentation and alignment of the waveform were first performed automatically with nAlign (Sjo¨lander, 2003). Manual correction was required, especially on vowel–vowel transitions. The waveform is divided into (overlapping) time frames of 10 ms. At each frame, an LPC model of order 30 is created; the poles are then searched through with the Viterbi algorithm in order to find the path (i.e. the formant trajectory) with the lowest cost. The cost is defined as the weighted sum of a number of partial costs: the bandwidth cost, the frequency deviation cost, and the frequency change cost. The bandwidth cost is equal to the bandwidth in Hertz. The frequency deviation cost is defined as the square of the distance to a given reference frequency, which is formant, speaker, and phoneme dependent. This requires the labeling of the input before the formant tracking is carried out. Finally, the frequency change cost penalizes rapid changes in formant frequencies to make sure that the extracted trajectories are smooth. Although only the first four formants are used in the unit library, five formants are extracted. The fifth formant is then discarded. The justification for this is to ensure reasonable values for the fourth formant. Furthermore, one sub-F1 pole is extracted to avoid confusion with F1. The algorithm also introduces eight times over sampling before averaging, giving a reduction of the variance of the estimated formant frequencies. After the extraction, the data is down sampled to 100 Hz. Special tools were constructed in order to automatically detect errors in the extracted data, for instance, by comparing the data with formant data that had been generated by rules, or detecting rapid frequency changes. This information was used for manual corrections. In order to evaluate the unit library based on a reference speaker, a test data base of 300 utterances spoken by the same speaker was selected. The utterances were re-synthesized using either the default rule-based synthesis or the new system. Formant trajectories in the test material were
185
Stressed vowels
Fig. 2. Adaptation to the reference speakerÕs vowel space using the unit library from the reference speaker and evaluated on a new data base (300 utterances from the same speaker) with manually corrected formant frequencies.
automatically calculated and manually corrected. The Euclidian distance between the measured formant frequencies and the synthesized material was calculated. Fig. 2 shows how the new system has been adapted to the reference speakerÕs vowel space. 2.2. Expansion of the model to include waveform concatenation of voiceless segments Although reasonably high in intelligibility and clarity, the voiceless fricatives and plosives generated by the formant synthesizer lack naturalness. Therefore, the combined synthesis system has been expanded to include waveform-based synthesis. The unvoiced segments are taken from the database of recorded diphones used for formant extraction. Each unvoiced portion is synthesized using the corresponding waveform after time-scaling using TD-PSOLA (Charpentier and Moulines, 1990) and concatenated with the output formant synthesized waveform. Since the affected segments are voiceless, no pitch needs to be considered when scaling and concatenating, making the task significantly easier. The phoneme /h/, however, is often voiced, and is not included in the concatenative synthesis. For the unvoiced plosives, only the release phase is included. Since it is just a short-term signal, it is not scaled to fit the synthesized speech. A listening test was carried out, which showed that
186
R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193
this method, by itself, led to an increase in the perception of naturalness, but also that the raise is even greater when combined with data-driven formant synthesis (Vinet, 2004). The work by Hertz (2002) also supports that hybrid methods of this type are of great interest for further research. The author describes a promising synthesis approach in which formant synthesis is used to create waveform units to be combined with natural speech units in a concatenative synthesis system. 2.3. Evaluation In addition to the referenced evaluation a preliminary listening test was carried out to evaluate the combined synthesis in its current stage. 12 test subjects were asked to listen to 10 different sentences, all of which had been synthesized in two different ways: by rule and by the new combined system. For each such pair, the test subjects were asked to pick the one that was most natural sounding. The data-driven synthesis was considered more natural 73% of the time. On average for each sentence, none of the sentences was perceived as more natural when synthesized with rules only, than with the new synthesis. The conclusion of this and the earlier listening tests is that the data-driven formant synthesis is in fact perceived as more natural sounding than the rule-driven synthesis. However, it is also apparent from the synthesis quality that a lot of work still needs to be put into the automatic building of a formant unit library.
3. Visual speech synthesis—how to obtain data? In acoustic modeling the primary data is the speech. In our work on three-dimensional models for articulatory and visual speech synthesis at KTH, we have exploited several kinds of data sources. For the externally visible articulators and the facial surface, we have used optical motion tracking. For modeling of the tongue and internal vocal tract, three-dimensional data from magnetic resonance imaging (MRI) and kinematic data from electropalatography (EPG) and electromagnetic articulography (EMA) have been used (Engwall, 2002a, 2004). While each of these methods in
Fig. 3. The talking head is modeled as a 3D mesh that can be parametrically deformed. Ten parameters control the articulation.
isolation can provide useful information, none yields complete 3D data with good temporal resolution, and they hence need to be combined. We have investigated simultaneous measurements of vocal tract and facial motion using EMA and optical motion tracking. The data is used to improve and extend the articulation of an animated talking head (Fig. 3). 3.1. Previous studies The study differs from previous related studies as it uses simultaneous recordings of a large set of sentences. Yehia et al. (1998) used non-simultaneous recordings with Optotrack and EMA of two English and six Japanese sentences to derive quantitative association between the two data sets. Jiang et al. (2000) collected the data simultaneously, using Qualisys, an optical motion tracking system (http://www.qualisys.se) and EMA, but for CV syllables only and using 17 Qualisys markers. Bailly and Badin (2002) studied the correlation between facial and tongue movements using an articulatory model based on video and cineoradiographic recordings. All three studies concluded that information from the face supplies information on the articulation of the speech organs, but Bailly and Badin (2002) warned that the information is insufficient to recover the lingual constriction. The most important difference
R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193
187
is that none of these studies aim directly at applying the results to articulatory speech synthesis of the face, jaw and the entire tongue. 3.2. Models and data 3.2.1. Face and tongue models The face and tongue models (Beskow, 1997, 2003) are based on concepts first introduced by Parke (1982) defining a set of parameters that deform a static 3D-wireframe mesh by applying weighted transformations to its vertices. The parameters for the face are jaw opening, jaw shift, jaw thrust, lip rounding, upper lip raise, lower lip depression, upper lip retraction and lower lip retraction. The 3D tongue model is based on a three-dimensional MRI database of one reference subject of Swedish (Engwall, 2002b) The corpus consisted of 13 Swedish vowels and 10 consonants in three symmetric VCV contexts. As the acquisition time of 43 s required the subject to artificially sustain the articulations this database can not be used to model speech dynamics. For this purpose a new database, with EPG and EMA data, has been collected and used to adjust the articulations to normal and to obtain information on articulatory dynamics. The tongue parameters include dorsum raise, body raise, tip raise and tip advance. 3.3. Measurement setup The EMA data is collected with the Movetrack system (Branderud, 1985) using two transmitters on a light-weight head mount and six receiver coils (1.5 · 4 mm) positioned in the midsagittal plane as depicted in Fig. 4: three coils on the tongue (around 8 mm, 20 mm and 52 mm from the tip of the tongue) and two coils above and below the upper and lower incisors, respectively. One coil was placed on the upper lip for co-registration with the optical system. The optical motion tracking is done using a Qualisys system (http://www.qualisys.se) with four cameras. The system tracks 28 small reflectors (4 mm diameter) glued to the subjectÕs jaw, cheeks, lips, nose and eyebrows and on a plate attached to the Movetrack headmount (to serve as reference
Fig. 4. Marker placement for Movetrack (right) and Qualisys (left) measurements.
for head movements) and calculates their 3D-coordinates at a rate of 60 frames per second. The EMA coils on the upper lip and the jaw were equipped with a reflector (the latter during a special alignment recording) to allow for spatial alignment between the two data sets. The data was collected in sets of 1 min each, with a break between sets. In the analysis below, silent pauses between the speech sequences were removed. 3.4. Subject and corpora The subject was a female native speaker of Swedish, who has received high intelligibility ratings in audio-visual tests. 3.4.1. Sentence corpus The 270 Swedish everyday sentences, listed ¨ hman, 1998) have been developed specially in (O for audio-visual speech perception tests by ¨ hngren, based on MacLeod and Summerfield O (1990). The sentences are independent of each other and generally seven to nine syllables long (4–5 words). 3.4.2. Nonsense VCV word corpus The 138 VCV and VCC{C}V words consisted of the consonants [p, t, k, ˇ, b, d, g, Î, m, n, ¯, N, l, Æ, f, s, , c¸, j, r, v, h] and the consonant clusters [jk, rk, pl, bl, kl, gl, pr, br, kr, gr, kt, nt, tr, dr, st, sp, str, spr, sk, fl, fr, sl, skl, skr] in symmetric vowel context w V = [a, I, f].
188
R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193
3.4.3. Nonsense CVC word corpus The corpus consisted of 41 asymmetric C1VC2 words, with firstly the long vowels V = [u:, o:, 2:, i:, e:, e:, ø:] in C1 = [k], C2 = [p] and C1 = [p], C2 = [k] context, secondly the short vowels V = [f, , a, I, e, Y, ø] in C1 = [k], C2 = [p:] and C1 = [p], C2 = [k:] context. The [r] allophones V = [æ:, œ:, æ, œ] were collected with C1 = [k] and C2 = [r].
jaw), prediction of EMA data from face is better than face from EMA, but with 5 coils (tongue + jaw + lip) face is better recovered from the EMA than the opposite. For a detailed account of this study (see Beskow et al., 2003).
c
3.5. Data processing and use of the data 3.5.1. Pre-processing The Qualisys data, consisting of 3D-coordinates for all 28 points, was first normalized with respect to global movement using the points on the Movetrack frame as reference. The facial model was scaled and the Qualisys data was roto-translated in such a way that an optimal fit between the facial surface and the measured points was achieved. The EMA data was down-sampled to the frame rate of the Qualisys data, 60 Hz, and inserted into the midsagittal plane of the model, where it was roto-translated to align the lip and jaw coils with the corresponding Qualisys markers, forming a coherent data set of extra- and intraoral movement data. 3.5.2. Correlation between the datasets In an initial study using the data, we investigated the interrelation between the face (Qualisys) and tongue (Movetrack) datasets. Using linear regression, one data set is predicted from the other, and the correlation between the original and the predicted can be calculated. Face data is arranged in a N-by-75 matrix X, where each row represents a time frame (N = number of frames in the given corpus), and the columns hold the x-, y- and z-coordinates of the 25 points, excluding the reference points on the Movetrack headmount. A similar N-by-2K matrix Y is constructed for the EMA data, containing x- and y-coordinates of K Movetrack coils. The analysis was carried out with K = 3 coils (tongue only), 4 coils (tongue and jaw) and 5 coils (tongue, jaw and upper lip). It was found that for 3 and 4 coils (tongue, tongue +
4. Coarticulation in data-driven vs rule-based visual synthesis In most implementations we can make a distinction between the visual signal model and the control model. The visual signal model is responsible for producing the image, given a vector of control parameter values. The control model is responsible for driving the animation by providing these vectors to the signal model at each point in time, given some symbolic specification of the desired animation. In a general sense, the input to a control model could contain information about speech articulation, emotional state, discourse information, turntaking, etc. In this section we will restrict ourselves to studying articulatory control models. Early acoustic text-to-speech systems (Allen et al., 1987; Carlson et al., 1982) employed parametrically controlled models for speech generation. In these systems, control parameter trajectories were generated within a rule-based framework, where coarticulatory effects were modelled using explicit rules. Later, as concatenative speech synthesis techniques grew popular, the need for such rules diminished, since coarticulation was inherently present in the speech units used, for example diphones, demi-syllables or the arbitrary sized units used in contemporary unit-selection based speech synthesizers. Rule based control schemes have been successfully employed for visual speech synthesis. Pelachaud et al. (1996) describe an implementation of the look-ahead model. Phonemes are clustered into visemes that are classified with different deformability rank, which serves to indicate to what degree that viseme should be influenced by its context. Another rule-based look-ahead model is proposed by Beskow (1995). In this model, each phoneme is assigned a target vector of articulatory control parameters. To allow the targets to be influenced by coarticulation, the target vector
R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193
may be under-specified, i.e. some parameter values can be left undefined. If a target is left undefined, the value is inferred from context using interpolation, followed by smoothing of the resulting trajectory. For visual speech synthesis, approaches based on concatenation of context dependent units have been less dominant, although they have been used for video based synthesis (Bregler et al., 1997) as well as in model based systems (Ha¨llgren and Lyberg, 1998). The model described by Cohen and Massaro (1993) is based on Lo¨fqvistÕs (1990) gestural theory of speech production. In this model, each segment is assigned a target vector. Overlapping temporal dominance functions are used to blend the target values over time. The dominance functions take the shape of a pair of negative exponential functions, one rising and one falling. The height of the peak and the rate with which the dominance rises and falls are free parameters that can be adjusted for each phoneme and articulatory control parameter. Because the rise and fall times are context-independent, the Cohen–Massaro model can essentially be regarded as a time-locked model. It should be noted however that due to the nature of negative exponential functions, all dominance functions extend to infinity, so in practice onset of a gesture occurs gradually as the dominance of the gesture rises above the dominance of other segments. The free parameters were empirically determined through hand-tuning and repeated comparisons between synthesis and video recordings of a human speaker. Other investigators have proposed enhancements to the Cohen–Massaro model as well as data-driven automatic estimation of its parameters. Le Goff (1997) modified the dominance functions to be n-continuous and used trajectories extracted from video recordings of a real speaker uttering V1CV2CV1 words to tune the model for French. The resulting model was perceptually evaluated using an audiovisual phoneme identification task with varying levels of background acoustic noise. Recently, Massaro et al. (2005) used optical motion tracking of 19 points on a real speakerÕs face to obtain a database of 100 sentences that
189
were used to tune the free parameters of the original Cohen–Massaro model. Training was carried out both for a set of 39 monophones as well as for 509 context-dependent phones. ¨ hmanÕs model of Reveret et al. (2000) adopt O coarticulation to drive a French speaking talking head. Coarticulation coefficients and the temporal functions guiding the blending of consonants and the underlying vowel track were estimated from a corpus derived using video analysis of 24 VCV words. 4.1. Comparing different coarticulation schemes A number of factors make it difficult to assess the relative merits of the different approaches mentioned above. The corpora on which the models are trained differ in size, type of content (VCV words, sentences, etc.) as well as language. Different control parameter sets are used, obtained using a variety of techniques, and different facial models are used to produce the resulting animations. The purpose of the present study is to seek an answer to the question whether there is reason to prefer one of these models to another when building a data-driven visual speech synthesis system. The two main classes of models, look¨ hmanÕs model) and ahead (represented by O time-locked (represented by Cohen and MassaroÕs model) are trained on a corpus of phonetically rich sentences recorded with the Qualisys motion capture technique and then evaluated objectively as well as perceptually, on a test set not part of the original training data. In addition, they are compared against two variants of a novel control model based on recurrent time delayed artificial neural networks, one of which is designed specifically for real time applications. In the case of look-ahead models, the amount of time needed is dependent on the timing and identity of the particular segments involved. In practice, many implementations will require access to the full utterance before any trajectories can be computed. This is for example the case with an unconstrained implementation of the Cohen– Massaro model, where the negative exponential dominance functions extend to infinity (Beskow, 2004).
190
R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193
4.1.1. Real time considerations In certain real world applications, the models mentioned above are impossible to use without modifications. In the Synface project (Siciliano et al., 2003) the goal is to develop a communication aid for hard of hearing people, consisting of a talking head faithfully recreating the lip movements of a speaker at the other end of a telephone connection in (close to) real-time, based only on the acoustic signal. The system will utilize a phoneme recognizer, outputting a stream of phonemes that should be instantly converted into facial motion. In this scenario there is no way to know what phonemes will arrive in the future, so any model attempting anticipatory coarticulation will fail. To allow for this kind of application, we need control models that can do a reasonable job given a very limited look-ahead window (typically less than 50 ms). In the present study, one such lowlatency model is implemented and compared against the other models (referred to as the ANN model, see Beskow et al., 2004, for details on this experiment). 4.1.2. Evaluating the models Here we restrict our discussion to how the models objectively and subjectively account for the coarticulation found in our database. In Table 1 the RMS error values can be seen for the four conditions. Pairwise comparisons using LSD (least significant difference) show that the Cohen–Massaro model performs significantly better than the other models, having the lowest average RMSE as well as the highest correlation coefficient. Furthermore, the symmetrical ANN model showed a significantly better correlation value than the
Table 1 RMS error and correlation between target and estimated trajectories averaged over the 87 test sentence
¨ hman and low-latency ANN models. No other O significant differences were found between models. 4.1.3. Perceptual evaluation While the objective measures are informative about how well the different control models predict the parameter trajectories, it is not obvious how they relate to the perceived quality of the resulting animations. In order to obtain a rating of this, a perceptual evaluation was carried out. A sentence intelligibility test was carried out with 25 normal hearing native Swedish subjects. Each of the four control models was used to synthesize animations with the animated talking head for a corpus of 90 phonetically labeled sentences not part of the training or test corpora, spoken by a male talker. In addition to the four data-driven control models, a rule-based control model (Beskow, 1995), and an audio-alone condition was included in the evaluation, yielding six presentation conditions. The frame rate of the animation was 30 Hz. The acoustic signal was processed using a noiseexcited vocoder (Shannon et al., 1995) with three frequency bands in the range of 100–5000 Hz. This form of audio degradation has been used in previous intelligibility studies (Siciliano et al., 2003) and has the advantage over additive noise of being robust to intensity perturbations in the speech signal. Results were scored by counting the percentage of correctly identified keywords, where three words in each sentence had been defined as keywords. The average proportion of correct keywords for each condition is given in Table 2. Pairwise comparisons using LSD (least significant difference) indicate that all face conditions give significantly higher intelligibility than the audio-alone condition, with p < 0.05. Furthermore, the rulebased control model provides higher intelligibility than the data-driven ones, but no significant difference could be found between the four data-driven models on that significance level.
Control model
RMSE (%) Correlation
Cohen–Massaro
¨ hman O
ANN 1
ANN 2 low latency
8.63 0.689
9.06 0.639
9.08 0.666
9.14 0.632
4.1.4. Discussion The aim of this study was to find out whether there was reason to claim that any one of the data-driven control models is in some way superior to the others. The results of the objective
R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193
191
Table 2 Average intelligibility scores for the 25 subjects for each condition Audio-visual condition
Keywords correct (%)
Audio only
Cohen–Massaro
¨ hman O
ANN 1
ANN 2 (low latency)
Rule-based
62.7
74.8
75.3
72.1
72.8
81.1
evaluation shows the Cohen–Massaro model to produce trajectories that best matches the targets, but this advantage did not manifest itself in the intelligibility evaluation. Indeed, since all datadriven models performed equally well in the perceptual evaluation, that means we are free to choose a model based on other criteria. As stated before, real-time considerations can be one such criterion, which makes the low-latency ANN model a strong candidate. One question that demands an answer is why the data-driven models fall short of the rule-based model in the perceptual evaluation. This is not as surprising as one might first think. The rule-based model was developed with clear articulation and high intelligibility as the primary goal, and as such it almost tends to hyper-articulate. The datadriven models on the other hand, are trained to mimic the speaking style of the target speaker, who could be characterized as having a rather relaxed pronunciation. It should also be noted that the speaker was not selected on the basis of maximal visual intelligibility, hence the data-driven models can not be expected to provide optimal intelligibility. It is likely that re-training the models on a corpus with a highly intelligible speaker would improve this aspect, but that is a matter of further investigation.
5. Conclusion We have in this paper presented a new approach building formant-synthesis systems based on both rule-generated and database driven methods. A pilot experiment was reported showing that this approach can be a very interesting path to explore further. Despite a very simple implementation the preliminary results showed an advantage in naturalness compared to the traditional reference system. Work is currently underway to create
a generic platform to continue this research on formant synthesis methods, based on both rules and unit-concatenation. While improving the intelligibility of a rulebased articulation scheme for generating synthetic visual speech can be a very laborious process, a data-driven model can be automatically trained to more intelligible articulation, or to any particular style of speaking, which is one of the main attractions of employing data-driven techniques. We are presently expanding our experiments with data-driven techniques for training the visual synthesis system to expressive speech, using the same kind of motion tracking data.
Acknowledgments As evident from the list of reference, many persons at KTH have contributed to the work presented. For the visual speech synthesis, Jonas Beskow and Olov Engwall especially deserves mentioning. Ka˚re Sjo¨lander and the following undergraduate students contributed in building the combined synthesis system: Tor Sigvardson, ¨ hlin. This Arvid Sjo¨lander, Romain Vinet, David O research was carried out at the Centre for Speech Technology, a competence centre at KTH, supported by VINNOVA (The Swedish Agency for Innovation Systems), KTH and participating Swedish companies and organizations.
References Acero, A., 1999. Formant analysis and synthesis using hidden Markov models. In: Proc. EurospeechÕ99, pp. 1047–1050. Allen, J., Hunnicut, M.S., Klatt, D., 1987. From Text to Speech: The MITalk System. Cambridge University Press, Cambridge MA. Bailly, G., Badin, P., 2002. Seeing tongue movements from outside. In: Proc. ICSLP2002, pp. 1913–1916.
192
R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193
Beskow, J., 1995. Rule-based visual speech synthesis. In: Proc. 4th European Conf. on Speech Communication and Technology (EurospeechÕ95), Madrid, Spain, pp. 299–302. Beskow, J., 1997. Animation of talking agents. In: Proc. Internat. Conf. on Auditory-Visual Speech Processing (AVSPÕ97), Rhodos, Greece, pp. 149–152. Beskow, J., 2003. Talking heads—models and applications for multimodal speech synthesis. Doctoral thesis, Department of Speech, Music and Hearing, KTH, Stockholm, Sweden. Beskow, J., 2004. Trainable articulatory control models for visual speech synthesis. J. Speech Technol. 7 (4), 335–349. Beskow, J., Engwall, O., Granstro¨m, B., 2003. Resynthesis of facial and intraoral articulation from simultaneous measurements. In: Proc. ICPhS 2003, Barcelona, Spain. Beskow, J., Karlsson, I., Kewley, J., Salvi, G., 2004. SYNFACE—a talking head telephone for the hearing-impaired. In: Miesenberger, K., Klaus, J., Zagler, W., Burger, D., (Eds.), Computers Helping People with Special Needs, pp. 1178–1186. Branderud, P., 1985. Movetrack—a movement tracking system. In: Proc. French–Swedish Symposium on Speech, Grenoble, France, pp. 113–122. Bregler, C., Covell, M., Laney, M., 1997. Video rewrite: Driving visual speech with audio. In: Proc. ACM SIGGRAPHÕ97, pp. 353–360. Brooke, N.M., Scott, D.S., 1998. Two- and three-dimensional audio-visual speech synthesis. In: Proc. Internat. Conf. on Auditory-Visual Speech Processing (AVSPÕ98), Terrigal, Australia, pp. 213–218. Carlson, R., Granstro¨m, B., 1976. A text-to-speech system based entirely on rules. In: Proc. ICASSP-76. Carlson, R., Granstro¨m, B., Hunnicutt, S., 1982. A multilanguage text-to-speech module. In: Proc. 7th Internat. Conf. on Acoustics, Speech, and Signal Processing (ICASSPÕ82), Paris, France, Vol. 3, pp. 1604–1607. Carlson, R., Granstro¨m, B., Karlsson, I., 1991. Experiments with voice modelling in speech synthesis. Speech Comm. 10, 481–489. Carlson, R., Granstro¨m, B., Nord, L., 1992. Experiments with emotive speech—acted utterances and synthesized replicas. In: Internat. Conf. on Spoken Language Processing, Banff, Canada, pp 671–674. Carlson, R., Sigvardson, T., Sjo¨lander, A., 2002. Data-driven formant synthesis. In: Fonetik 2002. Charpentier, F., Moulines, E., 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Comm. 9 (5/6), 435–467. Charpentier, F., Stella, M., 1986. Diphone synthesis using an overlap-add technique for speech waveforms concatenation. In: Proc. ICASSP 86, Vol. 3, pp. 2015–2018. Cohen, M.M., Massaro, D.W., 1993. Modelling coarticulation in synthetic visual speech. In: Magnenat Thalmann, N., Thalmann, D. (Eds.), Models and Techniques in Computer Animation. Springer Verlag, Tokyo, pp. 139–156. Dixon, N.R., Maxey, H.D., 1968. Terminal analog synthesis of continuous speech using the diphone method of segment assembly. IEEE Trans. AudioElectroacoust. AU-16, 40–50.
Engwall, O., 2002a. Tongue talking—studies in intraoral speech synthesis. PhD thesis, KTH, Sweden. Engwall, O., 2002b. Evaluation of a system for concatenative articulatory visual speech synthesis. In: Proc. ICSLP 2002. Engwall, O., 2004. From real-time MRI to 3D tongue movements. In: Proc. ICSLP 2004. Ezzat, T., Geiger, G., Poggio, T., 2002. Trainable videorealistic speech animation. In: Proc. ACM SIGGRAPH 2002, San Antonio, TX, pp. 388–398. ˚ ., Lyberg, B., 1998. Visual speech synthesis with Ha¨llgren, A concatenative speech. In: Proc. Internat. Conf. on AuditoryVisual Speech Processing (AVSPÕ98), Terrigal, Australia, pp. 181–183. Hertz, S., 2002. Integration of rule-based formant synthesis and waveform concatenation: A hybrid approach to text-tospeech synthesis. In: Proc. IEEE 2002 Workshop on Speech Synthesis, Santa Monica, USA, 11–13 September 2002. Ho¨gberg, J., 1997. Data driven formant synthesis. In: Proc. Eurospeech 97. Jiang, J., Alwan, A., Bernstein, L., Keating, P., Auer, E., 2000. On the correlation between facial movements, tongue movements and speech acoustics. In: Proc. ICSLP2000, Vol. 1, 42–45. Klatt, D., 1982. The Klattalk text-to-speech conversion system. In: Proc. ICASSP 82, pp. 1589–1592. Klatt, D., 1987. Review of text-to-speech conversion for English. J. Acoust. Soc. Amer. 82 (3), 737–793. Le Goff, B., 1997. Automatic modeling of coarticulation in textto-visual speech synthesis. In: Proc. 5th European Conf. on Speech Communication and Technology (EUROSPEECHÕ97), Rhodos, Greece, pp. 1667–1670. Lee, M., van Santen, J., Mo¨bius, B., Olive, J., 1999. Formant tracking using segmental phonemic information. In: Proc. EurospeechÔ99, Vol. 6, pp. 2789–2792. Lo¨fqvist, A., 1990. Speech as audible gestures. In: Hardcastle, W.J., Marchal, A. (Eds.), Speech Production and Speech Modelling. Kluwer Academic Publishers, Dordrecht, pp. 289–322. MacLeod, A., Summerfield, Q., 1990. A procedure for measuring auditory and audio-visual speech-reception thresholds for sentences in noise: Rationale, evaluation, and recommendations for use. Br. J. Audiol. 24, 29–43. Mannell, R.H., 1998. Formant diphone parameter extraction utilising a labeled single speaker database. In: Proc. ICSLP 98. Massaro, D.W., Cohen, M.M., Tabain, M., Beskow, J., Clark, R., 2005. Animated speech: Research progress and applications. In: Vatikiotis-Bateson, E., Bailly, G., Perrier, P. (Eds.), Audiovisual Speech Processing. MIT Press. Mori, H., Ohtsuka, T., Kasuya, H., 2002. A data-driven approach to source-formant type text-to-speech system. In: ICSLP-2002, pp. 2365–2368. Ogden, R., Hawkins, S., House, J., Huckvale, M., Local, J., Carter, P., Dankovicova, J., Heid, S., 2000. Prosynth: An integrated prosodic approach to device-independent, natural-sounding speech synthesis. Comput. Speech Language 14, 177–210.
R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193 ¨ hlin, D., 2004. Formant extraction for data-driven formant O synthesis. Master thesis, TMH, KTH, Stockholm (in Swedish). ¨ hlin, D., Carlson, R., 2004. Data-driven formant synthesis. O In: Proc. Fonetik, pp. 160–163. ¨ hman, T., 1998. An audio-visual speech database and O automatic measurements of visual speech. In: KTH TMH QPSR, Vols. 1–2, pp. 61–76. Olive, J.P., 1977. Rule synthesis of speech from diadic units. In: Proc. ICASSP-77, pp. 568–570. Parke, F.I., 1982. Parameterized models for facial animation. IEEE Comput. Graphics 2 (9), 61–68. Pelachaud, C., 2002. Visual text-to-speech. In: Pandzic, I., Forchheimer, R. (Eds.), MPEG-4 Facial Animation—the Standard, Implementation and Applications. John Wiley & Sons, pp. 125–140. Pelachaud, C., Badler, N.I., Steedman, M., 1996. Generating facial expressions for speech. Cognitive Sci. 20 (1), 1– 46. Peterson, G., Wang, W., Sivertsen, E., 1958. Segmentation techniques in speech synthesis. J. Acoust. Soc. Amer. 32, 639–703. Reveret, L., Bailly, G., Badin, P., 2000. Mother: A new generation of talking heads providing a flexible articulatory control for video-realistic speech animation. In: Proc. 6th Internat. Conf. on Spoken Language Processing (ICSLPÕ2000). Bejing, China, pp. 755–758.
193
Shannon, R.V., Zeng, F-G., Kamath, V., Wygonski, J., Ekelid, M., 1995. Speech recognition with primarily temporal cues. Science 270, 303–304. Siciliano, C., Williams, G., Beskow, J., Faulkner A., 2003. Evaluation of a multilingual synthetic talking face as a communication aid for the hearing impaired. To appear in Proc. 15th Internat. Congress of Phonetic Sciences, Barcelona, Spain. Sigvardson, T., 2002. Data-driven methods for parameter synthesis—description of a system and experiments with CART-analysis. Master thesis, TMH, KTH, Stockholm, Sweden (in Swedish). Sjo¨lander, A., 2001. Data-driven formant synthesis. Master thesis, TMH, KTH, Stockholm (in Swedish). Sjo¨lander, K., 2003. An HMM-based system for automatic segmentation and alignment of speech. In: Proc. Fonetik 2003, Umea˚ Universitet, Umea˚, Sweden, pp. 93–96. Stevens, K.N., Bickley, C.A., 1991. Constraints among parameters simplify control of Klatt formant synthesizer. J. Phonetics 19, 161–174. Talkin, D., 1989. Looking at speech. Speech Technol. 4, 74–77. Vinet, R., 2004. Enhancing rule-based synthesizer using concatenative synthesis. Master thesis, TMH, KTH, Stockholm, Sweden. Yehia, H., Rubin, P., Vatikiotis-Bateson, E., 1998. Quantitative association of vocal-tract and facial behaviour. Speech Comm. 26, 23–43.