Data-driven multimodal synthesis

Speech Communication 47 (2005) 182–193 www.elsevier.com/locate/specom Data-driven multimodal synthesis Rolf Carlson *, Bjo¨rn Granstro¨m CTT, Departm...

Download PDF

252KB Sizes 0 Downloads 57 Views

Report

PDF Reader
Full Text

Speech Communication 47 (2005) 182–193 www.elsevier.com/locate/specom

Data-driven multimodal synthesis Rolf Carlson *, Bjo¨rn Granstro¨m CTT, Department of Speech, Music and Hearing, KTH, Lindstedsva¨gen 24, 5th Floor, SE-100 44 Stockholm, Sweden Received 24 November 2004; received in revised form 2 February 2005; accepted 7 February 2005

Abstract This paper is a report on current eﬀorts at the Department of Speech, Music and Hearing, KTH, on data-driven multimodal synthesis including both visual speech synthesis and acoustic modeling. In the research we try to combine both corpus based methods with knowledge based models and to explore the best of the two approaches. In the paper an attempt to build formant-synthesis systems based on both rule-generated and database driven methods is presented. A pilot experiment is also reported showing that this approach can be a very interesting path to explore further. Two studies on visual speech synthesis are reported, one on data acquisition using a combination of motion capture techniques and one concerned with coarticulation, comparing diﬀerent models. 2005 Elsevier B.V. All rights reserved. Keywords: Speech synthesis; Multimodal synthesis; Data-driven synthesis

1. Introduction Current speech synthesis eﬀorts, both in research and in applications, are dominated by methods based on concatenation of spoken units. In some cases the original waveform is simply used as is, but often it is processed to some degree before use. Research on speech synthesis is to a large extent focused on how to model eﬃcient unit selection and unit concatenation and how optimal dat*

Corresponding author. Tel.: +46 8 790 75 68; fax: +46 8 790 78 54. E-mail address: [email protected] (R. Carlson).

abases should be created. The traditional research eﬀorts on formant synthesis and articulatory synthesis have been signiﬁcantly reduced due to the success of waveform based methods. A new very active ﬁeld of speech research is multimodal synthesis which again points to the need to understand speech articulation, but from a broader perspective. This paper is a report on current eﬀorts at the Department of Speech, Music and Hearing, KTH, on data-driven multimodal synthesis including both visual speech synthesis and acoustic modeling. In our research we try to combine both corpus based methods with knowledge based

0167-6393/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2005.02.015

R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193

models and to explore the best of the two approaches. 1.1. A historical perspective on concatenative acoustic speech synthesis In the review by Klatt (1987) some of the early eﬀorts on synthesis based on concatenative synthesis are included. As early as 1958 Peterson et al. suggested that unit concatenation might be a possible solution for speech synthesis. Dixon and Maxey (1968) made a special eﬀort to create a unit library for diphone synthesis. Early synthesis research at AT&T based on ‘‘Diadic Units’’ (Olive, 1977) demonstrated an alternative to rule-based formant synthesis (Carlson and Granstro¨m, 1976; Carlson et al., 1982; Klatt, 1982). Charpentier and Stella (1986) opened a new path toward speech synthesis based on waveform concatenation, by introducing the PSOLA model for manipulating pre-recorded waveforms. The current synthesis methods using unit selection from large or optimized corpora rather than using a ﬁxed unit inventory try to reduce the number of units in each utterance, thus solving context dependencies over a longer time frame. In some cases (e.g. Acero, 1999), formant tracking is used to optimize the selection of units. The recent workshops focused on speech synthesis manifest the dominance of techniques using diﬀerent kinds of waveform coding. 1.2. Visual speech synthesis Visual speech synthesis can be accomplished either through manipulation of video images (Bregler et al., 1997; Brooke and Scott, 1998; Ezzat et al., 2002) or based on two- or three dimensional models of the human face and/or speech organs that are under control of a set of deformation parameters (Beskow, 1995, 1997; Cohen and Massaro, 1993; Pelachaud et al., 1996; Pelachaud, 2002; Reveret et al., 2000). The geometrical talking head model we use in our work is a parametrically controlled deformable polygon surface. Articulatory deformations are implemented as weighted geometrical transforms (translations, rotations and scalings) that

183

are applied to the vertices of the polygon mesh, according to principles ﬁrst introduced by Parke (1982). In the study reported in this paper, datadriven articulation is controlled by the ten parameters: jaw rotation, lip rounding, upper lip retraction, upper lip raise, lower lip retraction, lower lip depression, left mouth corner stretch, right mouth corner stretch, jaw thrust and jaw shift (sideways). 1.3. Is formant synthesis still of interest? Aﬀective computing is currently a very active research topic in the Speech Communication ﬁeld. The need to synthesize diﬀerent voices and voice characteristics and to model emotive speech is a strong motivation to keep research on formant synthesis active (Carlson et al., 1991, 1992). The driving force is that rule-based formant synthesis has the necessary ﬂexibility to model both linguistic and extra linguistic processes. It is for example evident that large databases for corpus-based approaches can not be recorded consistently in a selection of desired speaking styles. However, the ﬂexibility in formant synthesis can also be a problem, since for example articulatory constraints are not directly included in the formant-based model. The underlying articulatory gestures are not easily transformed to the acoustic domain described by the formant model. Successful eﬀorts to go ‘‘halfway’’ using higher-level articulatory based parameters have been reported by Stevens and Bickley (1991) and Ogden et al. (2000). Alternative approaches that reduce the need for detailed formant synthesis rules, but still keep the ﬂexibility of the formant model, extract formant synthesis parameters directly from a labeled corpus. Mannell (1998) reported a promising eﬀort to create a diphone library for formant synthesis. The procedure included a speaker-speciﬁc extraction of formant frequencies from a labeled database. In a sequence of papers from Utsunomiya University, Japan automatic formant tracking has been used to generate speech synthesis of high quality using formant synthesis and an elaborate voice source (e.g. Mori et al., 2002). Recently we have seen a renewed commercial interest in speech synthesis using the formant

184

R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193

model (e.g. Aurix TTS from 20/20 Speech). One motivation is the need to generate speech using a very small ‘‘foot print’’. Thus, one can predict that formant synthesis will again be an important research subject because of its ﬂexibility and also because of how the formant synthesis approach can be compressed into a limited application environment.

Input text

Data-base

Text-to-parameter

Analysis Extracted parameters

Features

Unit library

Unit selection Concatenation Unit-controled parameters

2. A combined approach for acoustic speech synthesis

Rule-generated parameters

+ Synthesizer Speech output

Research eﬀorts to combine data-driven and rule-based methods in the KTH text-to-speech system has been pursued in several projects. In a study by Ho¨gberg (1997), formant parameters were extracted from a database and structured with the help of classiﬁcation and regression trees. The synthesis rules were adjusted according to predictions from the trees. In an evaluation experiment the synthesis was tested and judged to be more natural than the original rule-based synthesis. The approach takes advantage of the fact that a unit library can better model detailed gestures then the general rules. Sjo¨lander (2001) expanded the method into replacing complete formant trajectories with manually extracted values, and also included consonants. According to a feasibility study, this synthesis was perceived as more natural sounding than the ruleonly synthesis (Carlson et al., 2002). Sigvardson (2002) developed a generic and complete system for unit selection using regression trees, and applied it to the data-driven formant synthesis. In the current work the rule system and the unit library are more clearly separated, compared to our earlier attempts. However, by keeping the rule-based model we also keep the ﬂexibility to make modiﬁcations and the possibility to include both linguistic and extra linguistic knowledge sources. Fig. 1 illustrates the current approach from a ¨ hlin and Carlson, technical point of view (O 2004). A database is used to create a unit library. Each unit is described by a selection of extracted synthesis parameters together with linguistic information about the unitÕs original context and lin-

Fig. 1. Rule-based synthesis system using a data-driven unit library.

guistic features such as stress level. The parameters can be extracted automatically and/or edited manually. In our traditional text-to-speech system the synthesizers is controlled by rule-generated parameters from the text-to-parameter module (Carlson et al., 1982). The parameters are represented by time and values pairs including labels and prosodic features such as duration and intonation. In the current approach some of the rule-generated parameter values are replaced by values from the unit library. The process is controlled by the unit selection module that takes into account not only parameter information but also linguistic features supplied by the text-to-parameter module. Since only diphone sized units are included in the database the unit selection process is rather simple. The parameters are normalized and concatenated before being sent to the synthesizer (Carlson et al., 1991). The concatenation process for each parameter is a simple linear interpolation between the two joining units starting at the 25% point and ending at the 75% point of the phoneme duration. 2.1. Formant extraction for the unit library When creating a unit library of formant frequencies, automatic methods of formant extraction are of course preferred, due to the amount of data that has to be processed. However, available methods do not always perform adequately.

R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193

Unstressed vowels

Rule-driven formants Data-driven formants

Euclidian distance (F1-F4)

With this in mind, an improved formant extraction algorithm, using segmentation information to low¨ hlin, 2004). It is er the error rate, was developed (O akin to the algorithms described in (Lee et al., 1999; Talkin, 1989; Acero, 1999). Segmentation and alignment of the waveform were ﬁrst performed automatically with nAlign (Sjo¨lander, 2003). Manual correction was required, especially on vowel–vowel transitions. The waveform is divided into (overlapping) time frames of 10 ms. At each frame, an LPC model of order 30 is created; the poles are then searched through with the Viterbi algorithm in order to ﬁnd the path (i.e. the formant trajectory) with the lowest cost. The cost is deﬁned as the weighted sum of a number of partial costs: the bandwidth cost, the frequency deviation cost, and the frequency change cost. The bandwidth cost is equal to the bandwidth in Hertz. The frequency deviation cost is deﬁned as the square of the distance to a given reference frequency, which is formant, speaker, and phoneme dependent. This requires the labeling of the input before the formant tracking is carried out. Finally, the frequency change cost penalizes rapid changes in formant frequencies to make sure that the extracted trajectories are smooth. Although only the ﬁrst four formants are used in the unit library, ﬁve formants are extracted. The ﬁfth formant is then discarded. The justiﬁcation for this is to ensure reasonable values for the fourth formant. Furthermore, one sub-F1 pole is extracted to avoid confusion with F1. The algorithm also introduces eight times over sampling before averaging, giving a reduction of the variance of the estimated formant frequencies. After the extraction, the data is down sampled to 100 Hz. Special tools were constructed in order to automatically detect errors in the extracted data, for instance, by comparing the data with formant data that had been generated by rules, or detecting rapid frequency changes. This information was used for manual corrections. In order to evaluate the unit library based on a reference speaker, a test data base of 300 utterances spoken by the same speaker was selected. The utterances were re-synthesized using either the default rule-based synthesis or the new system. Formant trajectories in the test material were

185

Stressed vowels

Fig. 2. Adaptation to the reference speakerÕs vowel space using the unit library from the reference speaker and evaluated on a new data base (300 utterances from the same speaker) with manually corrected formant frequencies.

automatically calculated and manually corrected. The Euclidian distance between the measured formant frequencies and the synthesized material was calculated. Fig. 2 shows how the new system has been adapted to the reference speakerÕs vowel space. 2.2. Expansion of the model to include waveform concatenation of voiceless segments Although reasonably high in intelligibility and clarity, the voiceless fricatives and plosives generated by the formant synthesizer lack naturalness. Therefore, the combined synthesis system has been expanded to include waveform-based synthesis. The unvoiced segments are taken from the database of recorded diphones used for formant extraction. Each unvoiced portion is synthesized using the corresponding waveform after time-scaling using TD-PSOLA (Charpentier and Moulines, 1990) and concatenated with the output formant synthesized waveform. Since the aﬀected segments are voiceless, no pitch needs to be considered when scaling and concatenating, making the task significantly easier. The phoneme /h/, however, is often voiced, and is not included in the concatenative synthesis. For the unvoiced plosives, only the release phase is included. Since it is just a short-term signal, it is not scaled to ﬁt the synthesized speech. A listening test was carried out, which showed that

186

R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193

this method, by itself, led to an increase in the perception of naturalness, but also that the raise is even greater when combined with data-driven formant synthesis (Vinet, 2004). The work by Hertz (2002) also supports that hybrid methods of this type are of great interest for further research. The author describes a promising synthesis approach in which formant synthesis is used to create waveform units to be combined with natural speech units in a concatenative synthesis system. 2.3. Evaluation In addition to the referenced evaluation a preliminary listening test was carried out to evaluate the combined synthesis in its current stage. 12 test subjects were asked to listen to 10 diﬀerent sentences, all of which had been synthesized in two diﬀerent ways: by rule and by the new combined system. For each such pair, the test subjects were asked to pick the one that was most natural sounding. The data-driven synthesis was considered more natural 73% of the time. On average for each sentence, none of the sentences was perceived as more natural when synthesized with rules only, than with the new synthesis. The conclusion of this and the earlier listening tests is that the data-driven formant synthesis is in fact perceived as more natural sounding than the rule-driven synthesis. However, it is also apparent from the synthesis quality that a lot of work still needs to be put into the automatic building of a formant unit library.

3. Visual speech synthesis—how to obtain data? In acoustic modeling the primary data is the speech. In our work on three-dimensional models for articulatory and visual speech synthesis at KTH, we have exploited several kinds of data sources. For the externally visible articulators and the facial surface, we have used optical motion tracking. For modeling of the tongue and internal vocal tract, three-dimensional data from magnetic resonance imaging (MRI) and kinematic data from electropalatography (EPG) and electromagnetic articulography (EMA) have been used (Engwall, 2002a, 2004). While each of these methods in

Fig. 3. The talking head is modeled as a 3D mesh that can be parametrically deformed. Ten parameters control the articulation.

isolation can provide useful information, none yields complete 3D data with good temporal resolution, and they hence need to be combined. We have investigated simultaneous measurements of vocal tract and facial motion using EMA and optical motion tracking. The data is used to improve and extend the articulation of an animated talking head (Fig. 3). 3.1. Previous studies The study diﬀers from previous related studies as it uses simultaneous recordings of a large set of sentences. Yehia et al. (1998) used non-simultaneous recordings with Optotrack and EMA of two English and six Japanese sentences to derive quantitative association between the two data sets. Jiang et al. (2000) collected the data simultaneously, using Qualisys, an optical motion tracking system (http://www.qualisys.se) and EMA, but for CV syllables only and using 17 Qualisys markers. Bailly and Badin (2002) studied the correlation between facial and tongue movements using an articulatory model based on video and cineoradiographic recordings. All three studies concluded that information from the face supplies information on the articulation of the speech organs, but Bailly and Badin (2002) warned that the information is insuﬃcient to recover the lingual constriction. The most important diﬀerence

R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193

187

is that none of these studies aim directly at applying the results to articulatory speech synthesis of the face, jaw and the entire tongue. 3.2. Models and data 3.2.1. Face and tongue models The face and tongue models (Beskow, 1997, 2003) are based on concepts ﬁrst introduced by Parke (1982) deﬁning a set of parameters that deform a static 3D-wireframe mesh by applying weighted transformations to its vertices. The parameters for the face are jaw opening, jaw shift, jaw thrust, lip rounding, upper lip raise, lower lip depression, upper lip retraction and lower lip retraction. The 3D tongue model is based on a three-dimensional MRI database of one reference subject of Swedish (Engwall, 2002b) The corpus consisted of 13 Swedish vowels and 10 consonants in three symmetric VCV contexts. As the acquisition time of 43 s required the subject to artiﬁcially sustain the articulations this database can not be used to model speech dynamics. For this purpose a new database, with EPG and EMA data, has been collected and used to adjust the articulations to normal and to obtain information on articulatory dynamics. The tongue parameters include dorsum raise, body raise, tip raise and tip advance. 3.3. Measurement setup The EMA data is collected with the Movetrack system (Branderud, 1985) using two transmitters on a light-weight head mount and six receiver coils (1.5 · 4 mm) positioned in the midsagittal plane as depicted in Fig. 4: three coils on the tongue (around 8 mm, 20 mm and 52 mm from the tip of the tongue) and two coils above and below the upper and lower incisors, respectively. One coil was placed on the upper lip for co-registration with the optical system. The optical motion tracking is done using a Qualisys system (http://www.qualisys.se) with four cameras. The system tracks 28 small reﬂectors (4 mm diameter) glued to the subjectÕs jaw, cheeks, lips, nose and eyebrows and on a plate attached to the Movetrack headmount (to serve as reference

Fig. 4. Marker placement for Movetrack (right) and Qualisys (left) measurements.

for head movements) and calculates their 3D-coordinates at a rate of 60 frames per second. The EMA coils on the upper lip and the jaw were equipped with a reﬂector (the latter during a special alignment recording) to allow for spatial alignment between the two data sets. The data was collected in sets of 1 min each, with a break between sets. In the analysis below, silent pauses between the speech sequences were removed. 3.4. Subject and corpora The subject was a female native speaker of Swedish, who has received high intelligibility ratings in audio-visual tests. 3.4.1. Sentence corpus The 270 Swedish everyday sentences, listed ¨ hman, 1998) have been developed specially in (O for audio-visual speech perception tests by ¨ hngren, based on MacLeod and Summerﬁeld O (1990). The sentences are independent of each other and generally seven to nine syllables long (4–5 words). 3.4.2. Nonsense VCV word corpus The 138 VCV and VCC{C}V words consisted of the consonants [p, t, k, ˇ, b, d, g, Î, m, n, ¯, N, l, Æ, f, s, , c¸, j, r, v, h] and the consonant clusters [jk, rk, pl, bl, kl, gl, pr, br, kr, gr, kt, nt, tr, dr, st, sp, str, spr, sk, ﬂ, fr, sl, skl, skr] in symmetric vowel context w V = [a, I, f].

188

R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193

3.4.3. Nonsense CVC word corpus The corpus consisted of 41 asymmetric C1VC2 words, with ﬁrstly the long vowels V = [u:, o:, 2:, i:, e:, e:, ø:] in C1 = [k], C2 = [p] and C1 = [p], C2 = [k] context, secondly the short vowels V = [f, , a, I, e, Y, ø] in C1 = [k], C2 = [p:] and C1 = [p], C2 = [k:] context. The [r] allophones V = [æ:, œ:, æ, œ] were collected with C1 = [k] and C2 = [r].

jaw), prediction of EMA data from face is better than face from EMA, but with 5 coils (tongue + jaw + lip) face is better recovered from the EMA than the opposite. For a detailed account of this study (see Beskow et al., 2003).

c

3.5. Data processing and use of the data 3.5.1. Pre-processing The Qualisys data, consisting of 3D-coordinates for all 28 points, was ﬁrst normalized with respect to global movement using the points on the Movetrack frame as reference. The facial model was scaled and the Qualisys data was roto-translated in such a way that an optimal ﬁt between the facial surface and the measured points was achieved. The EMA data was down-sampled to the frame rate of the Qualisys data, 60 Hz, and inserted into the midsagittal plane of the model, where it was roto-translated to align the lip and jaw coils with the corresponding Qualisys markers, forming a coherent data set of extra- and intraoral movement data. 3.5.2. Correlation between the datasets In an initial study using the data, we investigated the interrelation between the face (Qualisys) and tongue (Movetrack) datasets. Using linear regression, one data set is predicted from the other, and the correlation between the original and the predicted can be calculated. Face data is arranged in a N-by-75 matrix X, where each row represents a time frame (N = number of frames in the given corpus), and the columns hold the x-, y- and z-coordinates of the 25 points, excluding the reference points on the Movetrack headmount. A similar N-by-2K matrix Y is constructed for the EMA data, containing x- and y-coordinates of K Movetrack coils. The analysis was carried out with K = 3 coils (tongue only), 4 coils (tongue and jaw) and 5 coils (tongue, jaw and upper lip). It was found that for 3 and 4 coils (tongue, tongue +

4. Coarticulation in data-driven vs rule-based visual synthesis In most implementations we can make a distinction between the visual signal model and the control model. The visual signal model is responsible for producing the image, given a vector of control parameter values. The control model is responsible for driving the animation by providing these vectors to the signal model at each point in time, given some symbolic speciﬁcation of the desired animation. In a general sense, the input to a control model could contain information about speech articulation, emotional state, discourse information, turntaking, etc. In this section we will restrict ourselves to studying articulatory control models. Early acoustic text-to-speech systems (Allen et al., 1987; Carlson et al., 1982) employed parametrically controlled models for speech generation. In these systems, control parameter trajectories were generated within a rule-based framework, where coarticulatory eﬀects were modelled using explicit rules. Later, as concatenative speech synthesis techniques grew popular, the need for such rules diminished, since coarticulation was inherently present in the speech units used, for example diphones, demi-syllables or the arbitrary sized units used in contemporary unit-selection based speech synthesizers. Rule based control schemes have been successfully employed for visual speech synthesis. Pelachaud et al. (1996) describe an implementation of the look-ahead model. Phonemes are clustered into visemes that are classiﬁed with diﬀerent deformability rank, which serves to indicate to what degree that viseme should be inﬂuenced by its context. Another rule-based look-ahead model is proposed by Beskow (1995). In this model, each phoneme is assigned a target vector of articulatory control parameters. To allow the targets to be inﬂuenced by coarticulation, the target vector

R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193

may be under-speciﬁed, i.e. some parameter values can be left undeﬁned. If a target is left undeﬁned, the value is inferred from context using interpolation, followed by smoothing of the resulting trajectory. For visual speech synthesis, approaches based on concatenation of context dependent units have been less dominant, although they have been used for video based synthesis (Bregler et al., 1997) as well as in model based systems (Ha¨llgren and Lyberg, 1998). The model described by Cohen and Massaro (1993) is based on Lo¨fqvistÕs (1990) gestural theory of speech production. In this model, each segment is assigned a target vector. Overlapping temporal dominance functions are used to blend the target values over time. The dominance functions take the shape of a pair of negative exponential functions, one rising and one falling. The height of the peak and the rate with which the dominance rises and falls are free parameters that can be adjusted for each phoneme and articulatory control parameter. Because the rise and fall times are context-independent, the Cohen–Massaro model can essentially be regarded as a time-locked model. It should be noted however that due to the nature of negative exponential functions, all dominance functions extend to inﬁnity, so in practice onset of a gesture occurs gradually as the dominance of the gesture rises above the dominance of other segments. The free parameters were empirically determined through hand-tuning and repeated comparisons between synthesis and video recordings of a human speaker. Other investigators have proposed enhancements to the Cohen–Massaro model as well as data-driven automatic estimation of its parameters. Le Goﬀ (1997) modiﬁed the dominance functions to be n-continuous and used trajectories extracted from video recordings of a real speaker uttering V1CV2CV1 words to tune the model for French. The resulting model was perceptually evaluated using an audiovisual phoneme identiﬁcation task with varying levels of background acoustic noise. Recently, Massaro et al. (2005) used optical motion tracking of 19 points on a real speakerÕs face to obtain a database of 100 sentences that

189

were used to tune the free parameters of the original Cohen–Massaro model. Training was carried out both for a set of 39 monophones as well as for 509 context-dependent phones. ¨ hmanÕs model of Reveret et al. (2000) adopt O coarticulation to drive a French speaking talking head. Coarticulation coeﬃcients and the temporal functions guiding the blending of consonants and the underlying vowel track were estimated from a corpus derived using video analysis of 24 VCV words. 4.1. Comparing diﬀerent coarticulation schemes A number of factors make it diﬃcult to assess the relative merits of the diﬀerent approaches mentioned above. The corpora on which the models are trained diﬀer in size, type of content (VCV words, sentences, etc.) as well as language. Diﬀerent control parameter sets are used, obtained using a variety of techniques, and diﬀerent facial models are used to produce the resulting animations. The purpose of the present study is to seek an answer to the question whether there is reason to prefer one of these models to another when building a data-driven visual speech synthesis system. The two main classes of models, look¨ hmanÕs model) and ahead (represented by O time-locked (represented by Cohen and MassaroÕs model) are trained on a corpus of phonetically rich sentences recorded with the Qualisys motion capture technique and then evaluated objectively as well as perceptually, on a test set not part of the original training data. In addition, they are compared against two variants of a novel control model based on recurrent time delayed artiﬁcial neural networks, one of which is designed speciﬁcally for real time applications. In the case of look-ahead models, the amount of time needed is dependent on the timing and identity of the particular segments involved. In practice, many implementations will require access to the full utterance before any trajectories can be computed. This is for example the case with an unconstrained implementation of the Cohen– Massaro model, where the negative exponential dominance functions extend to inﬁnity (Beskow, 2004).

190

R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193

4.1.1. Real time considerations In certain real world applications, the models mentioned above are impossible to use without modiﬁcations. In the Synface project (Siciliano et al., 2003) the goal is to develop a communication aid for hard of hearing people, consisting of a talking head faithfully recreating the lip movements of a speaker at the other end of a telephone connection in (close to) real-time, based only on the acoustic signal. The system will utilize a phoneme recognizer, outputting a stream of phonemes that should be instantly converted into facial motion. In this scenario there is no way to know what phonemes will arrive in the future, so any model attempting anticipatory coarticulation will fail. To allow for this kind of application, we need control models that can do a reasonable job given a very limited look-ahead window (typically less than 50 ms). In the present study, one such lowlatency model is implemented and compared against the other models (referred to as the ANN model, see Beskow et al., 2004, for details on this experiment). 4.1.2. Evaluating the models Here we restrict our discussion to how the models objectively and subjectively account for the coarticulation found in our database. In Table 1 the RMS error values can be seen for the four conditions. Pairwise comparisons using LSD (least signiﬁcant diﬀerence) show that the Cohen–Massaro model performs signiﬁcantly better than the other models, having the lowest average RMSE as well as the highest correlation coeﬃcient. Furthermore, the symmetrical ANN model showed a signiﬁcantly better correlation value than the

Table 1 RMS error and correlation between target and estimated trajectories averaged over the 87 test sentence

¨ hman and low-latency ANN models. No other O signiﬁcant diﬀerences were found between models. 4.1.3. Perceptual evaluation While the objective measures are informative about how well the diﬀerent control models predict the parameter trajectories, it is not obvious how they relate to the perceived quality of the resulting animations. In order to obtain a rating of this, a perceptual evaluation was carried out. A sentence intelligibility test was carried out with 25 normal hearing native Swedish subjects. Each of the four control models was used to synthesize animations with the animated talking head for a corpus of 90 phonetically labeled sentences not part of the training or test corpora, spoken by a male talker. In addition to the four data-driven control models, a rule-based control model (Beskow, 1995), and an audio-alone condition was included in the evaluation, yielding six presentation conditions. The frame rate of the animation was 30 Hz. The acoustic signal was processed using a noiseexcited vocoder (Shannon et al., 1995) with three frequency bands in the range of 100–5000 Hz. This form of audio degradation has been used in previous intelligibility studies (Siciliano et al., 2003) and has the advantage over additive noise of being robust to intensity perturbations in the speech signal. Results were scored by counting the percentage of correctly identiﬁed keywords, where three words in each sentence had been deﬁned as keywords. The average proportion of correct keywords for each condition is given in Table 2. Pairwise comparisons using LSD (least signiﬁcant diﬀerence) indicate that all face conditions give signiﬁcantly higher intelligibility than the audio-alone condition, with p < 0.05. Furthermore, the rulebased control model provides higher intelligibility than the data-driven ones, but no signiﬁcant diﬀerence could be found between the four data-driven models on that signiﬁcance level.

Control model

RMSE (%) Correlation

Cohen–Massaro

¨ hman O

ANN 1

ANN 2 low latency

8.63 0.689

9.06 0.639

9.08 0.666

9.14 0.632

4.1.4. Discussion The aim of this study was to ﬁnd out whether there was reason to claim that any one of the data-driven control models is in some way superior to the others. The results of the objective

R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193

191

Table 2 Average intelligibility scores for the 25 subjects for each condition Audio-visual condition

Keywords correct (%)

Audio only

Cohen–Massaro

¨ hman O

ANN 1

ANN 2 (low latency)

Rule-based

62.7

74.8

75.3

72.1

72.8

81.1

evaluation shows the Cohen–Massaro model to produce trajectories that best matches the targets, but this advantage did not manifest itself in the intelligibility evaluation. Indeed, since all datadriven models performed equally well in the perceptual evaluation, that means we are free to choose a model based on other criteria. As stated before, real-time considerations can be one such criterion, which makes the low-latency ANN model a strong candidate. One question that demands an answer is why the data-driven models fall short of the rule-based model in the perceptual evaluation. This is not as surprising as one might ﬁrst think. The rule-based model was developed with clear articulation and high intelligibility as the primary goal, and as such it almost tends to hyper-articulate. The datadriven models on the other hand, are trained to mimic the speaking style of the target speaker, who could be characterized as having a rather relaxed pronunciation. It should also be noted that the speaker was not selected on the basis of maximal visual intelligibility, hence the data-driven models can not be expected to provide optimal intelligibility. It is likely that re-training the models on a corpus with a highly intelligible speaker would improve this aspect, but that is a matter of further investigation.

5. Conclusion We have in this paper presented a new approach building formant-synthesis systems based on both rule-generated and database driven methods. A pilot experiment was reported showing that this approach can be a very interesting path to explore further. Despite a very simple implementation the preliminary results showed an advantage in naturalness compared to the traditional reference system. Work is currently underway to create

a generic platform to continue this research on formant synthesis methods, based on both rules and unit-concatenation. While improving the intelligibility of a rulebased articulation scheme for generating synthetic visual speech can be a very laborious process, a data-driven model can be automatically trained to more intelligible articulation, or to any particular style of speaking, which is one of the main attractions of employing data-driven techniques. We are presently expanding our experiments with data-driven techniques for training the visual synthesis system to expressive speech, using the same kind of motion tracking data.

Acknowledgments As evident from the list of reference, many persons at KTH have contributed to the work presented. For the visual speech synthesis, Jonas Beskow and Olov Engwall especially deserves mentioning. Ka˚re Sjo¨lander and the following undergraduate students contributed in building the combined synthesis system: Tor Sigvardson, ¨ hlin. This Arvid Sjo¨lander, Romain Vinet, David O research was carried out at the Centre for Speech Technology, a competence centre at KTH, supported by VINNOVA (The Swedish Agency for Innovation Systems), KTH and participating Swedish companies and organizations.

References Acero, A., 1999. Formant analysis and synthesis using hidden Markov models. In: Proc. EurospeechÕ99, pp. 1047–1050. Allen, J., Hunnicut, M.S., Klatt, D., 1987. From Text to Speech: The MITalk System. Cambridge University Press, Cambridge MA. Bailly, G., Badin, P., 2002. Seeing tongue movements from outside. In: Proc. ICSLP2002, pp. 1913–1916.

192

R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193

Beskow, J., 1995. Rule-based visual speech synthesis. In: Proc. 4th European Conf. on Speech Communication and Technology (EurospeechÕ95), Madrid, Spain, pp. 299–302. Beskow, J., 1997. Animation of talking agents. In: Proc. Internat. Conf. on Auditory-Visual Speech Processing (AVSPÕ97), Rhodos, Greece, pp. 149–152. Beskow, J., 2003. Talking heads—models and applications for multimodal speech synthesis. Doctoral thesis, Department of Speech, Music and Hearing, KTH, Stockholm, Sweden. Beskow, J., 2004. Trainable articulatory control models for visual speech synthesis. J. Speech Technol. 7 (4), 335–349. Beskow, J., Engwall, O., Granstro¨m, B., 2003. Resynthesis of facial and intraoral articulation from simultaneous measurements. In: Proc. ICPhS 2003, Barcelona, Spain. Beskow, J., Karlsson, I., Kewley, J., Salvi, G., 2004. SYNFACE—a talking head telephone for the hearing-impaired. In: Miesenberger, K., Klaus, J., Zagler, W., Burger, D., (Eds.), Computers Helping People with Special Needs, pp. 1178–1186. Branderud, P., 1985. Movetrack—a movement tracking system. In: Proc. French–Swedish Symposium on Speech, Grenoble, France, pp. 113–122. Bregler, C., Covell, M., Laney, M., 1997. Video rewrite: Driving visual speech with audio. In: Proc. ACM SIGGRAPHÕ97, pp. 353–360. Brooke, N.M., Scott, D.S., 1998. Two- and three-dimensional audio-visual speech synthesis. In: Proc. Internat. Conf. on Auditory-Visual Speech Processing (AVSPÕ98), Terrigal, Australia, pp. 213–218. Carlson, R., Granstro¨m, B., 1976. A text-to-speech system based entirely on rules. In: Proc. ICASSP-76. Carlson, R., Granstro¨m, B., Hunnicutt, S., 1982. A multilanguage text-to-speech module. In: Proc. 7th Internat. Conf. on Acoustics, Speech, and Signal Processing (ICASSPÕ82), Paris, France, Vol. 3, pp. 1604–1607. Carlson, R., Granstro¨m, B., Karlsson, I., 1991. Experiments with voice modelling in speech synthesis. Speech Comm. 10, 481–489. Carlson, R., Granstro¨m, B., Nord, L., 1992. Experiments with emotive speech—acted utterances and synthesized replicas. In: Internat. Conf. on Spoken Language Processing, Banﬀ, Canada, pp 671–674. Carlson, R., Sigvardson, T., Sjo¨lander, A., 2002. Data-driven formant synthesis. In: Fonetik 2002. Charpentier, F., Moulines, E., 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Comm. 9 (5/6), 435–467. Charpentier, F., Stella, M., 1986. Diphone synthesis using an overlap-add technique for speech waveforms concatenation. In: Proc. ICASSP 86, Vol. 3, pp. 2015–2018. Cohen, M.M., Massaro, D.W., 1993. Modelling coarticulation in synthetic visual speech. In: Magnenat Thalmann, N., Thalmann, D. (Eds.), Models and Techniques in Computer Animation. Springer Verlag, Tokyo, pp. 139–156. Dixon, N.R., Maxey, H.D., 1968. Terminal analog synthesis of continuous speech using the diphone method of segment assembly. IEEE Trans. AudioElectroacoust. AU-16, 40–50.

Engwall, O., 2002a. Tongue talking—studies in intraoral speech synthesis. PhD thesis, KTH, Sweden. Engwall, O., 2002b. Evaluation of a system for concatenative articulatory visual speech synthesis. In: Proc. ICSLP 2002. Engwall, O., 2004. From real-time MRI to 3D tongue movements. In: Proc. ICSLP 2004. Ezzat, T., Geiger, G., Poggio, T., 2002. Trainable videorealistic speech animation. In: Proc. ACM SIGGRAPH 2002, San Antonio, TX, pp. 388–398. ˚ ., Lyberg, B., 1998. Visual speech synthesis with Ha¨llgren, A concatenative speech. In: Proc. Internat. Conf. on AuditoryVisual Speech Processing (AVSPÕ98), Terrigal, Australia, pp. 181–183. Hertz, S., 2002. Integration of rule-based formant synthesis and waveform concatenation: A hybrid approach to text-tospeech synthesis. In: Proc. IEEE 2002 Workshop on Speech Synthesis, Santa Monica, USA, 11–13 September 2002. Ho¨gberg, J., 1997. Data driven formant synthesis. In: Proc. Eurospeech 97. Jiang, J., Alwan, A., Bernstein, L., Keating, P., Auer, E., 2000. On the correlation between facial movements, tongue movements and speech acoustics. In: Proc. ICSLP2000, Vol. 1, 42–45. Klatt, D., 1982. The Klattalk text-to-speech conversion system. In: Proc. ICASSP 82, pp. 1589–1592. Klatt, D., 1987. Review of text-to-speech conversion for English. J. Acoust. Soc. Amer. 82 (3), 737–793. Le Goﬀ, B., 1997. Automatic modeling of coarticulation in textto-visual speech synthesis. In: Proc. 5th European Conf. on Speech Communication and Technology (EUROSPEECHÕ97), Rhodos, Greece, pp. 1667–1670. Lee, M., van Santen, J., Mo¨bius, B., Olive, J., 1999. Formant tracking using segmental phonemic information. In: Proc. EurospeechÔ99, Vol. 6, pp. 2789–2792. Lo¨fqvist, A., 1990. Speech as audible gestures. In: Hardcastle, W.J., Marchal, A. (Eds.), Speech Production and Speech Modelling. Kluwer Academic Publishers, Dordrecht, pp. 289–322. MacLeod, A., Summerﬁeld, Q., 1990. A procedure for measuring auditory and audio-visual speech-reception thresholds for sentences in noise: Rationale, evaluation, and recommendations for use. Br. J. Audiol. 24, 29–43. Mannell, R.H., 1998. Formant diphone parameter extraction utilising a labeled single speaker database. In: Proc. ICSLP 98. Massaro, D.W., Cohen, M.M., Tabain, M., Beskow, J., Clark, R., 2005. Animated speech: Research progress and applications. In: Vatikiotis-Bateson, E., Bailly, G., Perrier, P. (Eds.), Audiovisual Speech Processing. MIT Press. Mori, H., Ohtsuka, T., Kasuya, H., 2002. A data-driven approach to source-formant type text-to-speech system. In: ICSLP-2002, pp. 2365–2368. Ogden, R., Hawkins, S., House, J., Huckvale, M., Local, J., Carter, P., Dankovicova, J., Heid, S., 2000. Prosynth: An integrated prosodic approach to device-independent, natural-sounding speech synthesis. Comput. Speech Language 14, 177–210.

R. Carlson, B. Granstro¨m / Speech Communication 47 (2005) 182–193 ¨ hlin, D., 2004. Formant extraction for data-driven formant O synthesis. Master thesis, TMH, KTH, Stockholm (in Swedish). ¨ hlin, D., Carlson, R., 2004. Data-driven formant synthesis. O In: Proc. Fonetik, pp. 160–163. ¨ hman, T., 1998. An audio-visual speech database and O automatic measurements of visual speech. In: KTH TMH QPSR, Vols. 1–2, pp. 61–76. Olive, J.P., 1977. Rule synthesis of speech from diadic units. In: Proc. ICASSP-77, pp. 568–570. Parke, F.I., 1982. Parameterized models for facial animation. IEEE Comput. Graphics 2 (9), 61–68. Pelachaud, C., 2002. Visual text-to-speech. In: Pandzic, I., Forchheimer, R. (Eds.), MPEG-4 Facial Animation—the Standard, Implementation and Applications. John Wiley & Sons, pp. 125–140. Pelachaud, C., Badler, N.I., Steedman, M., 1996. Generating facial expressions for speech. Cognitive Sci. 20 (1), 1– 46. Peterson, G., Wang, W., Sivertsen, E., 1958. Segmentation techniques in speech synthesis. J. Acoust. Soc. Amer. 32, 639–703. Reveret, L., Bailly, G., Badin, P., 2000. Mother: A new generation of talking heads providing a ﬂexible articulatory control for video-realistic speech animation. In: Proc. 6th Internat. Conf. on Spoken Language Processing (ICSLPÕ2000). Bejing, China, pp. 755–758.

193

Shannon, R.V., Zeng, F-G., Kamath, V., Wygonski, J., Ekelid, M., 1995. Speech recognition with primarily temporal cues. Science 270, 303–304. Siciliano, C., Williams, G., Beskow, J., Faulkner A., 2003. Evaluation of a multilingual synthetic talking face as a communication aid for the hearing impaired. To appear in Proc. 15th Internat. Congress of Phonetic Sciences, Barcelona, Spain. Sigvardson, T., 2002. Data-driven methods for parameter synthesis—description of a system and experiments with CART-analysis. Master thesis, TMH, KTH, Stockholm, Sweden (in Swedish). Sjo¨lander, A., 2001. Data-driven formant synthesis. Master thesis, TMH, KTH, Stockholm (in Swedish). Sjo¨lander, K., 2003. An HMM-based system for automatic segmentation and alignment of speech. In: Proc. Fonetik 2003, Umea˚ Universitet, Umea˚, Sweden, pp. 93–96. Stevens, K.N., Bickley, C.A., 1991. Constraints among parameters simplify control of Klatt formant synthesizer. J. Phonetics 19, 161–174. Talkin, D., 1989. Looking at speech. Speech Technol. 4, 74–77. Vinet, R., 2004. Enhancing rule-based synthesizer using concatenative synthesis. Master thesis, TMH, KTH, Stockholm, Sweden. Yehia, H., Rubin, P., Vatikiotis-Bateson, E., 1998. Quantitative association of vocal-tract and facial behaviour. Speech Comm. 26, 23–43.

Data-driven multimodal synthesis

Data-driven multimodal synthesis

Recommend Documents