A stochastic model of intonation for text-to-speech synthesis1

Speech Communication 26 (1998) 233±244 A stochastic model of intonation for text-to-speech synthesis 1 Jean Veronis *, Philippe Di Cristo, Fabienn...

Download PDF

439KB Sizes 0 Downloads 54 Views

Report

PDF Reader
Full Text

Speech Communication 26 (1998) 233±244

A stochastic model of intonation for text-to-speech synthesis

1

Jean Veronis *, Philippe Di Cristo, Fabienne Courtois, Cedric Chaumette Laboratoire Parole et Langage, Universit e de Provence & CNRS, 29 Av. Robert Schuman, 13621 Aix-en-Provence Cedex 1, France

To Fabienne, in memoriam Received 6 May 1998; received in revised form 13 October 1998; accepted 19 October 1998

Abstract This paper presents a stochastic model of intonation contours for use in text-to-speech synthesis. The model has two modules, a linguistic module that generates abstract prosodic labels from text, and a phonetic module that generates an F0 curve from the abstract prosodic labels. This model diers from previous work in the abstract prosodic labels used, which can be automatically derived from the training corpus. This feature makes it possible to use large corpora or several corpora of dierent speech styles, in addition to making it easy to adapt to new languages. The present paper focuses on the linguistic module, which does not require full syntactic analysis of the text but simply relies on part-ofspeech tagging. The results were validated on French by means of a perception test. Listeners did not perceive a signi®cant dierence in quality between the sentences synthesised using the phonetic module only, with prosodic labels derived from original recordings as input, and those synthesised directly from the text using the linguistic module followed by the phonetic module. The proposed model thus appears to capture most of the grammatical information needed to generate F0 . Ó 1998 Published by Elsevier Science B.V. All rights reserved. Keywords: Text-to-speech synthesis; Prosody; Intonation; Stochastic model; Part-of-speech tagging; French

1. Introduction Generating acceptable prosody is currently one of the most challenging tasks in the development of text-to-speech synthesis systems (Klatt, 1987; Collier, 1991). High-quality prosody is indeed essential, both for comprehension (Silverman, 1993) and for acceptable synthesis, especially when long read texts are involved. Many studies have shown that although semantic and prag-

* Corresponding author. Tel.: +33 4 42 95 31 37; fax: +33 4 42 59 50 96; e-mail: [email protected]. 1 This paper is based on a communication presented at Eurospeech'97 (VeÂronis et al., 1997).

matic factors enter into play, syntax is a major determining factor of the prosody of utterances (e.g. Faure, 1974; Rossi, 1977; Selkirk, 1984; Di Cristo, 1985). However, while the existence of a relationship between prosody and syntax is undeniable, no one has ever been able to reduce that relationship to simple rules. Despite this fact, a number of authors have attempted to generate prosody based on syntactic parsing of sentences (Allen et al., 1979, 1987), but this approach is dicult to apply to text-to-speech synthesis because in today's state of the art, no system is capable of automatically producing full syntactic parses of arbitrary texts. This paper presents a probabilistic model that generates French intonation contours based on

0167-6393/98/$ ± see front matter Ó 1998 Published by Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 9 8 ) 0 0 0 6 3 - 6

234

J. Veronis et al. / Speech Communication 26 (1998) 233±244

part-of-speech tagging, a step that can currently be achieved with an error rate of less than 5% (Bahl and Mercer, 1976; Debili, 1977; Leech et al., 1983: Kupiec, 1992; Merialdo, 1994). The use of grammatical classes alone has already been proposed, for example, by Sorin et al. (1987) for French, QueneÂ and Kager (1992) for Dutch, and Ostendorf and Veilleux (1994) for English. Probabilistic models have already been applied to F0 synthesis (Ross, 1995), but unlike past approaches, the method presented here uses a system of prosodic labels that can be automatically derived from the training corpus. This saves time, so large corpora or multiple corpora of dierent speech styles can be used. In addition, as will be seen, the prosodic coding systems well as other components of the model are language-independent, which makes adaptation to new languages straightforward. The model is composed of two modules (Fig. 1): · a linguistic module that predicts a set of abstract prosodic labels from the text; · a phonetic module that predicts an F0 curve from the prosodic labels generated by the linguistic module. This article focuses on the linguistic module. We shall show that the method proposed for this module produces excellent results, as validated by a perception test. These positive results suggest

that the linguistic module captures a large part of the grammatical information needed to generate intonation contours. 2. A probabilistic approach to prosody Probabilistic models have been successfully applied to a variety of language processing tasks. Two advantages of these models is that they are ``neutral'' with respect to current theories and can be ``learned'' directly from the data. Generally, they do not attempt at furnishing any linguistic explanations of the language phenomena involved, yet they can be used to develop tools capable of carrying out a number of linguistic engineering tasks like speech recognition (Baker, 1975; Jelinek, 1976; Rabiner, 1989), grammatical labelling (Bahl and Mercer, 1976; Debili, 1977; Leech et al., 1983), lexical disambiguation (Choueka and Lusignan, 1985; Brown et al., 1991; Gale et al., 1993), lexical or terminological extraction (Choueka et al., 1983; Church and Hanks, 1990; Daille et al., 1994), and even ± albeit arguably less successfully ± automatic translation (Brown et al., 1990). Most probabilistic models can be stated in information-theory terms (Shannon, 1948). More speci®cally, it is assumed that input message I is supplied to a noisy channel that converts it into deformed output message O.

Fig. 1. Model overview.

J. Veronis et al. / Speech Communication 26 (1998) 233±244

Automatically reconstructing input message I when only the output is known amounts to examining all possible input messages and then selecting message I^ that maximises the probability of getting output message O. Expressed in mathematical terms, I^ argmax PrI j O; I

1

where `argmax' selects the argument with the best score, Pr(I) is the probability that message I will appear at the channel input, and Pr(I | O) the probability of having had I as input when O is the observed output. Bayes' theorem gives us PrI j O

PrIPrO j I : PrO

2

Given that the denominator is independent of I, ®nding I^ amounts to maximising the numerator, that is, I^ arg max PrI j O I

argmax PrIPrO j I: I

3

Applied to prosodic tagging, this model assumes that the noisy channel outputs a sequence of words W, corresponding to an unknown input sequence of prosodic labels, P. Retrieving P thus means ®nding the most probable sequence P^ capable of producing output sequence W, i.e., P^ arg max PrP j W

235

labic structure of words ± undoubtedly has an impact on prosodic contours. One could imagine trying to complete the above model to take the lexical eect into account, but the learning corpus would have to be much larger. Secondly, we will assume that it is possible to synchronously associate a prosodic label to each word without losing the ability to reconstruct high-quality F0 contours. Sequences P and C thus have the same number of elements, n. PrP PrP1 P2 . . . Pn ;

6

PrC j P PrC1 C2 . . . Cn j P1 P2 . . . Pn :

7

We will ®nally assume that probability of the current prosodic label given the past depends solely on the two labels that precede, and that the grammatical class depends solely on the current prosodic label and the preceding grammatical class: 2 n Y PrP PrPi j Piÿ1 Piÿ2 ; 8 i1

PrC j P

n Y PrCi j Ciÿ1 Pi :

9

i1

These conditions can obviously be changed in other versions of the model, by taking a larger context into account, provided a suciently large learning corpus is available.

P

arg max PrP PrW j P : P

4

As usual in probabilistic methods, it is impossible to estimate the parameters of the model directly from the above statements, so several approximations will be used. We will assume ®rst of all that we obtain the same result if, rather than observing words in the output, we observe the grammatical-class sequence C that corresponds to those words: P^ arg max PrP j C

3. The INTSINT prosodic labelling system Prosodic labelling systems can be categorised in two types: linguistic systems, such as ToBI (Beckman and Pierrehumbert, 1986; Silverman et al., 1992), which encode events of a linguistic nature, and phonetic systems, such as HLCB (Taylor, 1994) or INTSINT (Hirst et al., 1994, Forthcoming), which aim only at providing a purely con®gurational description of the macroprosodic curve without interpretation. It is

P

arg max PrP PrC j P : P

5

This simpli®cation is of course a strong one, since it is clear that the lexicon ± in particular, the syl-

2 This type of n-gram simpli®cation of the history is common in many probabilistic models, in particular in speech recognition (e.g. Jelinek, 1976).

236

J. Veronis et al. / Speech Communication 26 (1998) 233±244

obvious that the ®rst category poses greater problems for the automatisation, although work is underway in that area (Wightman and Ostendorf, 1992; Black and Hunt, 1996; Ostendorf and Ross, 1997). The second category, easier to implement, is of course less interesting in linguistic terms, but can still contribute to the development of labelled corpora useful for many applications. In addition, it can be seen as a ®rst step towards automatic labelling of systems like ToBI, and can be a help in the development of such systems for languages other than American English. In this study, the prosodic labels were taken from the INTSINT system (Hirst et al., 1994, Forthcoming). They constitute a simple formal encoding of intonation contours that can be automatically derived from the F0 curve (see Appendix B). This feature makes it easier to build a training corpus than in ToBI-based probabilistic approaches to F0 generation, where hand marking of the training corpus by one or more experts is necessary (see for example (Ross, 1995)). In addition, the INTSINT system is language-independent, so our model can be easily adapted to other languages. No full scale implementation has been done yet on languages other than French, but the automatic coding and re-generation have been tested satisfactorily on ®ve languages (VeÂronis and Campione, In press). 3 The labels fall into two categories (Fig. 2): · absolute labels, which are de®ned relative to the speaker's voice range: ) T (top), ) B (bottom), ) M (mid: initial label corresponding to a mean value); · relative labels, which are de®ned relative to the context: ) U (upstepped), ) D (downstepped), ) H (high: local maximum). ) L (low: local minimum) ) S (same as preceding)

3 INTSINT has been recently been used for the manual transcription of intonation patterns in a variety of languages (see dierent chapters in (Hirst and Di Cristo, In press)).

Fig. 2. INSINT labelling system.

4. Preparation of training corpus 4.1. Corpus description The model parameters were estimated on 500 sentences taken from the EUROM 1 corpus (Chan et al., 1995), totalling 35 min of speech or 7022 words. The sentences in this corpus are grouped together into ®ve-sentence passages pronounced by 10 dierent speakers. 4 There are 40 dierent passages altogether (i.e. 200 dierent sentences), and a subset is read by each speaker. Example of passage: La semaine derni ere, mon amie est all ee chez le m edecin se faire faire des piq^ ures. Elle doit partir en vacances en Extr^ eme-Orient, et elle est oblig ee de se faire vacciner contre le chol era, la typhoõde, l'h epatite A, la polio et le t etanos. Je pense qu'apr es ßca elle aura vraiment besoin d'un m edecin! D'autant plus qu'elle veut se faire faire tout ßca en une seule fois. Moi, je la plains pas. Tant pis pour elle! The corpus was prepared using the tools developed in the MULTEXT project (VeÂronis et al., 1994), as described in the sub-sections below (see overview in Fig. 3).

4 We thus model an ``average'' speaker. One of the reviewers of this paper suggested that it could be better, for the sake of naturalness, to use speaker-speci®c databases, if they were available. Whereas this is likely to result in improvement for the phonetic module if it were trained on recorded data, it remains however to be seen to what extent the relationships between grammatical category and prosodic labels are speaker-dependent. Further exploration is clearly needed.

J. Veronis et al. / Speech Communication 26 (1998) 233±244

237

Fig. 3. Preparation of training corpus: overview.

4.2. Grammatical tagging After a word segmentation phase, grammatical tagging of sentences was achieved using a Hidden Markov Model (HMM) tagger 5 which performs at a correct tagging rate of approximately 95%. The categories used are the basic parts of speech (noun, verb, adjective, etc. ± see Fig. 4) along with a few punctuation categories. The tagging was veri®ed and corrected manually. Note that HMM part of speech tagging techniques are language independent and have been successfully tested on a wide range of languages. 4.3. Word-signal alignment The recordings of the 500 sentences in the training corpus were manually segmented into words using a fast, computer-assisted method. Segmentation accuracy is not critical, so this phase will be performed quasi-automatically in the future, using tools now being developed (Di Cristo and Hirst, 1997). 4.4. Prosodic coding Prosodic coding was achieved in two steps (Fig. 5 ± see Campione and VeÂronis, In press, for a more precise description): · The recorded F0 was stylised with quadratic spline functions using the MOMEL algorithm (Hirst and Espesser, 1993; Hirst et al., Forthcoming ± see Appendix A). Since the accuracy 5 A description of this technique, which does not fall within the scope of this paper, can be found in (Kupiec, 1992) or (Merialdo, 1994).

Fig. 4. List of part-of-speech tags.

of the method is around 95% (Campione, 1997; Campione and VeÂronis, Forthcoming), 6 the stylised F0 was checked and corrected by an expert using the PSOLA re-synthesis system (Hamon et al., 1989). Various trials showed that the errors corrected were in fact minor and that the manual validation phase could have been skipped without any drastic eects on parameter estimation. · The INTSINT prosodic labels associated to each word and corresponding to the movements in the stylised curve were automatically derived from the stylised F0 , using the algorithm described in Appendix B. 5. Parameter estimation The linguistic module probabilities were estimated from the relative frequencies of the various

6 The technique has been recently improved as a result of the studies cited (ca. 97% accuracy).

238

J. Veronis et al. / Speech Communication 26 (1998) 233±244

Fig. 5. Prosodic coding.

prosodic-label and grammatical-class sequences in the corpus, corrected by a distribution smoothing in order to take unobserved cases into account: PrPi j Piÿ1 Piÿ2 f Pi j Piÿ1 Piÿ2

N Pi Piÿ1 Piÿ2 ; N Piÿ1 Piÿ2

N Ci Ciÿ1 Pi ; N Ciÿ1 Pi

13

with N Pi Piÿ1 : N Piÿ1

14

f Pi j Piÿ1

11

The kk parameters can increase with the amount of observed data, i.e., with the corpus size. In the experiment described here, the kk parameters were set at 0.9. Systematic trial and error showed that the exact value has little impact on the overall results.

where the f's stand for the relative event frequencies and the N's, being their absolute counts. These estimates assign a null probability to unobserved events in the corpus. However, events may be missing for two dierent reasons: · they are impossible (e.g., the prosodic-label sequence HH is impossible, by de®nition); · they are infrequent and did not occur in this particular corpus, but could be observed in some other corpus. Because the second possibility exists, a distribution smoothing has to be applied, as it is often the case in statistical models, by interpolating the observed distributions with lower-order distributions (for instance, using bigrams to estimate trigrams). As such, PrPi j Piÿ1 Piÿ2 k1 f Pi j Piÿ1 Piÿ2 1 ÿ k1 f Pi j Piÿ1 ;

k2 f Ci j Ciÿ1 Pi 1 ÿ k2 f Ci j Pi ;

10

PrCi j Ciÿ1 Pi f Ci j Ciÿ1 Pi

PrCi j Ciÿ1 Pi

12

6. Phonetic module As stated in the introduction, this paper deals mainly with the linguistic module. Although it produces acceptable output, the phonetic module used here is very simple. It is now under improvement, but even in its current state, it is suf®cient for testing the linguistic module. The F0 curve corresponding to the sequence of labels Pi is a quadratic spline curve that goes through the set of target points Fi , in one-to-onecorrespondence with the labels (Pi ). Each target point is placed at 2/3 of the duration of the corresponding word. This point corresponds usually to the ®nal stressed syllable in French. Systematic experiments showed that this point is optimal, al-

J. Veronis et al. / Speech Communication 26 (1998) 233±244

though there is quite a lot of ¯exibility in its exact position. Of course, a better model could take into account the exact syllabic and phonemic structure of the word. The intent, here, is to show that very crude heuristic can produce reasonably acceptable results. Frequencies FT , FM and FB , which correspond to the absolute prosodic labels T, M and B, respectively, are assumed to be ®xed. Let Fi be the frequency of the current target point. The frequency of the next target point is calculated by the following linear laws: Fi1 Fi aFT ÿ Fi if Pi1 U; H:

15

Fi1 Fi bFB ÿ Fi

16

if Pi1 D; L:

An ascending sequence of target points thus converges towards FT , and a descending sequence converges towards FB , which is consistent with the tendency observed by Hirst et al. (1991). The spline curve that yields the ®nal F0 is made up of parabola arcs whose extrema are the target points (Fi ). The arcs are connected by a common tangent to the median point between two consecutive target points. Parameters a, b, FT , FM and FB of the phonetic module were estimated by taking the mean of the observed values for a given speaker (TL test set; see below).

7. Evaluation The model was evaluated using a quality test involving expert subjects, as described in the following sections. 7.1. Test corpus An original set, TL , consisting of 20 sentences of comparable length and style to the ones in the training corpus, was recorded by a male speaker. Examples: Qu'est-ce que tu penses de la r eception donn ee a l'occasion des Victoires de la musique? ± Le probl eme majeur de la jeunesse, c'est de ne pas trouver de travail. ± La f^ ete nationale qui

239

a lieu en France le quatorze juillet est cl^ otur ee par un feu d'arti®ce. ± Pourquoi les hommes ne comprennent-ils pas que les oiseaux ne doivent pas e^tre mis en cage? TL was segmented into words, and then grammatical tagging, F0 stylisation, and INTSINT labelling were added. The test material included four test sets, TO , TP , TM and TA , each composed of the same 20 sentences as in TL , but synthesised by the MBROLA synthesiser (formerly MBR-PSOLA: Dutoit and Leich, 1993). 7 To avoid potential duration-related biases, the sentences were synthesised using for each phone an average duration observed in a corpus. Another option would have been to use the segmental durations of the original recordings. However, since it can reasonably be argued that segmental duration and intonation are correlated, it would have created a bias in favour of the test set using the original (stylised) F0 . Using a common, neutral set of durations, although it is likely to degrade the overall quality, is more satisfactory for the purpose of the test, since it enables a comparison of F0 curves only, all other things being equal. The test sets were assigned the following F0 curves: · TO : stylised F0 of original set TL , mapped onto the new durations. · TP : F0 generated by the phonetic module from INTSINT labelling of TL . · TM : F0 generated by the full model from the grammatical classes of the words. · TA : F0 generated from random target points in Gaussian distribution between FB and FT . The phonetic module parameters were estimated from the original TL set. 7.2. Protocol The sentences in the four test sets were presented in random order to 14 judges (graduate students in phonetics and laboratory researchers).

7 The material used in this test is slightly dierent from the one presented at Eurospeech'97 (VeÂronis et al., 1997).

240

J. Veronis et al. / Speech Communication 26 (1998) 233±244

The subjects were asked to grade the prosodic quality of passages on a scale ranging from 0 to 9. 8 The task proper was preceded by a ®ve-sentence practice session. The test was run using the ASTEC test station developed in the laboratory (formerly EURAUD-ASTEC; see (Pavlovic et al., 1995)): the scale was displayed as numbered boxes on the computer screen and subjects had to click on the desired value. 7.3. Results The ranking of the four test sets was TO > TP > TM > TA . The mean scores were as follows: Set Score

TO 6.032

TP 5.596

TM 5.386

TA 3.475

An analysis of variance revealed a signi®cant dierence between the four sets (F 111.386, p < 0.0001). Pair-wise post hoc comparison of models using Fisher's PLSD test showed that the dierences between TO and TP on one hand (0.436), and between TM and TA on the other hand (1.911), are signi®cant (p 0.0041 and p < 0.0001, respectively). The dierence between TP and TM (0.211) is not signi®cant (p 0.1646). The Bonferroni/Dunn test con®rmed these conclusions. 7.4. Discussion As expected, the TO , TM and TP scores were far better than that of the random set TA . Set TM , synthesised directly from the text using the full model, obtained a score very close to TP , which carried prosodic information directly derived from the original without using our linguistic module (non-signi®cant dierence). The proposed model thus seems to capture most of the gram-

8 The choice of this precise scale is somewhat arbitrary, since there is no agreement of a scale being better than others. A tenpoint scale as used in this study is relatively common. Experience in the laboratory indicates that it is easy to use for the subjects and reasonably stable in experiment repetition.

matical information needed to generate F0 . The relatively low score obtained by TP synthesised from the INTSINT labelling of TO can be explained by the overly simpli®ed state of the phonetic module used in this experiment. It is likely that the overall results could be improved by training the association between symbolic representation and acoustic features of intonation on a large corpus (preferably speaker-speci®c ± see note 4 supra). Thus, the results are very encouraging and con®rm that there is no need for complete syntactic parsing to generate at least some acceptable prosody (see previous work by (Monaghan, 1990; O'Shaughnessy, 1990; QueneÂ and Kager, 1992; Ostendorf and Veilleux, 1994; Zellner, 1997)). This is consistent with studies of the eye±voice span, i.e. the distance the eye is ahead of the voice when reading aloud, which have shown that readers look ahead no further than two words (see Levin, 1979), and therefore cannot have at their disposal a complete parse of sentences when they speak, yet they are capable of producing correct prosodic structures. Of course we do not make a claim that more elaborate syntactic information would not lead to quality improvement. It is now commonly accepted that prosody is organised in prosodic groups (also called prosodic domains) constituted by short, phrase-level segments of the text. Various psycholinguistic studies indicate the importance and stability of these small word groups in the sentence production and perception (e.g. Grosjean, 1980; Gee and Grosjean, 1983; Grosjean and Dommergues, 1983; see Caelen-Haumont, Forthcoming). It is therefore likely that partial or shallow parsing techniques (Liberman and Church, 1992; Ejerhed, 1988; Abney, 1991; Hindle, 1994; Karlsson et al., 1995; etc. ± see a survey in (Abney, 1997)) oriented towards the recognition of small groups of words that constitute the non-recursive kernels of major phrases (called non-recursive phrases, core phrases or chunk depending on the authors) could improve the quality of intonation. Obviously, part-of-speech bigrams (as used in this study) captures only a part of the small-distance relations between words.

J. Veronis et al. / Speech Communication 26 (1998) 233±244

8. Conclusion The probabilistic model proposed in this study generates realistic F0 contours. Clearly, the linguistic module captures a large part of the grammatical information that would ideally be needed to generate F0 , without requiring a thorough or complex syntactic analysis of the text. This module can nevertheless be enhanced in various ways since it now only supports a small set of grammatical classes and a limited grammatical context (bigrams). In particular, it could bene®t from the detection of syntactically coherent word groups through shallow parsing techniques. The phonetic module, which was not the focus of this study, is already being improved (Campione et al., 1997, 1998; Veronis and Campione, In press). Training the association between symbolic representation and acoustic features on a large database is likely to result in improvement. However, despite its simplicity, the model described show that probabilistic approaches towards intonation (somewhat underestimated so far) can yield reasonable results at low cost, and for languages other than American English, for which a reference system such as ToBI does not necessarily exist.

Acknowledgements The authors would like to thank Daniel Hirst and the anonymous reviewers for their helpful comments (remaining errors are of course ours). We would also like to thank Thierry Dutoit for the MBROLA synthesizer, Corine Astesano and Estelle Campione for their help with the corpus, Emmanuel Flachaire for the statistical processing, Beno^õt Lagrue and Martin Brousseau for their help on the test, and Robert Espesser for his technical assistance. Robert Espesser wrote the signal editing software and the re-synthesis program used in this experiment. This article is dedicated to Fabienne Courtois who died in an automobile crash as she was coming to the university to run the ®nal tests.

241

Appendix A. MOMEL melodic stylisation algorithm We summarize here the more detailed description that can be found in (Hirst and Espesser, 1993) and (Hirst et al., Forthcoming). The automatic stylisation algorithm (MOMEL) can be used to represent the fundamental frequency as a sequence of target points made up of two values áF0 , tñ. Target points occur wherever there are relevant local variations in the F0 curve. When interpolated using a quadratic spline function, they supply the suprasegmental shape that gives an overall picture of the intonation. The algorithm works as follows: for a given observation window (300 ms) whose centre is located at instant x in the series of F0 values, a quadratic curve is ®t by ``modal'' regression (a curve located less than distance d from the largest possible number of items in the series). The peak of this parabola de®nes the target point associated with instant x (Fig. 6). This operation is repeated for every instant x in the series. One target point át, hñ is thus obtained for each observation window. A data reduction procedure is then used to select only the relevant target points, i.e. target points that correspond in major changes in the intonation contour (Fig. 7). The sequence of target candidates is partitioned by means of another moving window (200 ms) which is divided into two halves, left and right. A partition boundary is

Fig. 6. Computation of candidate target point for the analysis window.

242

J. Veronis et al. / Speech Communication 26 (1998) 233±244

· for all i P 1: if jFi ÿ Fiÿ1 j=Fiÿ1 6 s else if Fi P 1 ÿ sFT else if Fi 6 1 sFB else if Fi P Fi ÿ 1 Fig. 7. Partitioning of target points.

then Pi S then Pi T then Pi B then Pi U or H depending on next else Pi D or L depending on next. The threshold s was set empirically at 5%.

inserted when the dierence between the average weighted values of t and h in the left and right halves of the window corresponds to a local maximum which is greater than a threshold set to the overall mean distance between left and right halves for all windows. Within each segment of the partition, outlying candidates (more than one standard deviation from the mean values for the segment) are eliminated. The mean value of the remaining targets in each segment is then calculated as the ®nal estimate of t and h for that segment. The target points are ®nally interpolated by a quadratic spline curve in order to produce a smooth F0 contour (Fig. 8). An analysis/re-synthesis tool based on the PSOLA technique (Hamon et al., 1989) re-synthesises the original curve from the modelled one. This device is used to perceptually validate the automatically stylised F0 curve. Appendix B. INTSINT label generation algorithm INTSINT labels for each word are generated automatically from the stylised F0 curve in the following manner: · initial label P0 is set at M; · the value Fi of the stylised curve is taken at the 2/3's point of each word i;

Fig. 8. Interpolation by a quadratic spline curve.

References Abney, S., 1991. Parsing by chunks. In: Berwick, R., Abney, S., Tenny, C. (Eds.), Principle-based Parsing. Kluwer Academic Publishers, Dordrecht, pp. 257±278. Abney, S., 1997. Part-of-speech tagging and partial parsing. In: Young, S., Bloothooft, G., Corpus-Based Methods in Language and Speech Processing, Kluwer Academic Publishers, Dordrecht, pp. 118±136. Allen, J., Hunnincutt, S., Carlson, R., Granstr om, B., 1979. MITalk-79: The 1979 MIT text-to-speech system. In: Wolf, Klatt (Eds.), Speech Communications Papers Presented at the 97th Meeting of the ASA, pp. 507±510. Allen, J., Hunnincutt, S., Klatt, D., 1987. From Text to Speech: The MITalk system. Cambridge University Press, Cambrdige. Bahl, L.R., Mercer, R.L., 1976. Part of speech assignment by a statistical decision algorithm. In: IEEE International Symposium on Information Theory, Ronneby, pp. 88±89. Baker, J., 1975. The DRAGON system ± An overview. IEEE Trans. on Acoustics, Speech and Signal Processing 23, 24± 29. Beckman, M., Pierrehumbert, J., 1986. Intonational structure in Japanese and English. Phonology Yearbook 3, 255±309. Black, A., Hunt, A., 1996. Generating F0 contours from ToBI labels using a linear regression. In: Proceedings of ICSLP'96, Philadelphia. Brown, P., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Laerty, J., Mercer, R., 1990. A statistical approach to machine translation. Computational Linguistics 16 (2), 79± 85. Brown, P., Della Pietra, S., Della Pietra, V., Mercer, R., 1991. Word sense disambiguation using statistical methods. In: Proceedings of the 29th Annual Meeting of Association for Computational Linguistics, Berkeley, California, pp. 264±270. Caelen-haumont, G., Forthcoming. Prosodie et Sens, une Approche Experimentale. Editions du CNRS, Collection Sciences du Langage, Paris. Campione, E., 1997. Modelisation automatique de la prosodie: etude statistique. Memoire de DEA, Universite de Provence, Aix-en-Provence, France, September 1997. Campione, E., Veronis, J., Forthcoming. Une evaluation de l'algorithme de stylisation melodique MOMEL. Travaux

J. Veronis et al. / Speech Communication 26 (1998) 233±244 de l'Institut de Phonetique d'Aix, Aix-en-Provence, France. Campione, E., Veronis, J., In press. A multilingual prosodic database. In: Proceedings of ICSLP'98, Sidney, Australia. Campione, E., Flachaire, E., Hirst, D., Veronis, J., 1997. Stylisation and symbolic coding of F0. ESCA Tutorial and Research Worskhop on Intonation: Theory, Models and Applications. Athens (Greece), September 1997, pp. 71±74. Campione, E., Flachaire, E., Hirst, D., Veronis, J., 1998. Evaluation de modele d'etiquetage automatique de l'intonation. Actes des XXemes Journees d'Etude sur la Parole, Martigny, Switzerland, pp. 99±102. Chan, D., Fourcin, A., Gibbon, D., Granstr om, B., Hucvale, M., Kokkinakis, G., Kvale, K., Lamel, L., Lindberg, B., Moreno, A., Mouropoulos, J., Senia, F., Trancoso, I., Veld, C., Zeiliger, J., 1995. EUROM ± A spoken language resource for the EU. In: Proceedings of the Fourth European Conference on Speech Communication and Speech Technology, Eurospeech'95, Madrid. Vol. 1, pp. 867±870. Choueka, Y., Lusignan, S., 1985. Disambiguation by short contexts. Computers and the Humanities 19, 147±158. Choueka, Y., Klein, S.T., Neuwitz, E., 1983. Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. ALLC Journal 4, 34±38. Church, K.W., Hanks, P., 1990. Word association norms, mutual information and lexicography. Computational Linguistics 16 (1), 22±29. Collier, R., 1991. Multi-language intonation synthesis. Journal of Phonetics 19, 61±73. Daille, B., Gaussier, E., Lange, J.M., 1994. Towards automatic extraction of monolingual and bilingual terminology. In: Proceedings 15th International Conference on Computational Linguistics, COLING'94, Kyoto, Japan. Debili, F., 1977. Traitements syntaxiques utilisant des matrices de precedence frequentielles construites automatiquement par apprentissage. These de Docteur-Ingenieur, Universite de Paris VII, U.E.R. de Physique, 297 pp. Di Cristo, A., 1985. De la Microprosodie a l'Intonosyntaxe. Publications Universite de Provence, Aix-en-Provence, France, 854 pp. Di Cristo, P., Hirst, D., 1997. Un procede d'alignement automatique de transcriptions phonetiques sans apprentissage prealable. 4eme Congres Francßais d'Acoustique, Marseille, pp. 425±428. Dutoit, T., Leich, H., 1993. MBR-PSOLA: Text-To-Speech synthesis based on an MBE re-synthesis of the segments database. Speech Communication 13 (3±4), 435±440. Ejerhed, E., 1988. Finding clauses in unrestricted text by ®nitary and stochastic methods. In: Proceedings of the Second Conference on Applied Natural language Processing, Austin, Texas, pp. 219±227. Faure, G., 1974. Contribution a l'analyse fonctionnelle des structures intonologiques du francßais moderne. In: De Caluwe, J., D'Heur, J.M., Dumas, R. (Eds.), Melanges oerts a Charles Rostaing, Liege, Belgium, pp. 283±300.

243

Gale, W., Church, K.W., Yarowsky, D., 1993. A method for disambiguating word senses in a large corpus. Computers and the Humanities 26, 415±439. Gee, J.P., Grosjean, F., 1983. Performance structures: A psycholinguistic and linguistic appraisal. Cognitive Psychology 15, 411±458. Grosjean, F., 1980. Comparative studies of temporal variables in spoken and sign languages: A short review. In: Dechert, W., Raupach, M. (Eds.), Temporal Variables in Speech. Mouton, The Hague, pp. 307±312. Grosjean, F., Dommergues, J.Y., 1983. Les structures de performance en psycholinguistique. L'Annee Psychologique 83, 513±536. Hamon, C., Moulines, E., Charpentier, F., 1989. A diphone system based on time-domain prosodic modi®cations of speech. In: Proceedings of ICASSP'89, pp. 238±241. Hindle, D., 1994. A parser for text corpora. In: Atkins, B., Zampolli, A. (Eds.), Computational Approaches to the Lexicon. Oxford University Press, Oxford, pp. 103±151. Hirst, D.J., Di Cristo, A. (Eds.), In press. Intonation Systems: A Survey of Twenty Languages. Cambridge University Press, Cambridge. Hirst, D.J., Espesser, D., 1993. Automatic modelling of fundamental frequency curves using a quadratic spline function. Travaux de l'Institut de Phonetique d'Aix 15, 71±85. Hirst, D.J., Nicolas, P., Espesser, R., 1991. Coding the F0 of a continuous text in French: an experimental approach. In: Proceedings of 12th International Congress of Phonetic Sciences, Aix-en-Provence France, Vol. 5, pp. 234±237. Hirst, D.J., Ide, N., Veronis, J., 1994. Coding fundamental frequency patterns for multi-lingual synthesis with INTSINT in the MULTEXT project. In: Proceedings of the 2nd ESCA/IEEE Workshop on Speech Synthesis, New Paltz, New York, September 1994, pp. 77±81. Hirst, D.J., Di Cristo, A., Espesser, R., Forthcoming. Levels of representation and levels of analysis for the description of intonation systems. In: Horne, M. (Ed.), Prosody: Theory and Experiment. Kluwer Academic Publishers, Dordrecht. Jelinek, F., 1976. Continuous speech recognition by statistical methods. Proceedings IEEE 64 (4), 532±556. Karlsson, F., Voutilainen, A., Heikkil a, J., Anttila, A. (Eds.), 1995. Constraint Grammars. Mouton de Gruyter, Berlin. Klatt, D.H., 1987. Review of text-to-speech conversion for English. Journal of the Acoustical Society of America 82 (3), 737±793. Kupiec, J., 1992. Robust part-of-speech tagging using a Hidden Markov Model. Computer, Speech and Language 6, 225± 242. Leech, G., Garside, R., Atwell, E., 1983. The automatic grammatical tagging of the LOB corpus. Newsletter of the International Computer Archive of Modern English 7, 13±33. Levin, H., 1979. The Eye±Voice Span. MIT Press, Cambridge, MA. Liberman, M., Church, K., 1992. Text analysis and word pronunciation in text-to-speech synthesis. In: Furui, S.,

244

J. Veronis et al. / Speech Communication 26 (1998) 233±244

Sondhi, M.M. (Eds.), Advances in Speech Signal Processing. Dekker, New York, pp. 791±831. Merialdo, B., 1994. Tagging English text with a probabilistic model. Computational Linguistics 20 (2), 155±171. Monaghan, A.I.C., 1990. A multi-phrase parsing strategy for unrestricted text. In: Proceedings of the ESCA Tutorial Day on Speech Synthesis, Autrans, France, pp. 109±112. O'Shaughnessy, D., 1990. Relationships between syntax and prosody for speech synthesis. In: Proceedings of the ESCA Tutorial Day on Speech Synthesis, Autrans, France, pp. 39±42. Ostendorf, M., Ross, K.N., 1997. A multi-level model for recognition of intonation labels. In: Sagisaka, Y., Campbell, N., Higuchi, H. (Eds.), Computing Prosody. Springer, Berlin, pp. 291±308. Ostendorf, M., Veilleux, N., 1994. A hierarchical stochastic model for automatic prediction of prosodic boundary location. Computational Linguistics 20 (1), 27±54. Pavlovic, C., Brousseau, M., Howells, D., Miller., D., Hazan, V., Faulkner., A., Fourcin A., 1995. Analytic assessment and training in speech and hearing using a poly-lingual workstation, EURAUD. In: Placencia Porrero, I., Puig de la Bellacasa, R. (Eds.), The European Context for Assistive Technology. IOS Press, Amsterdam, pp. 332±335. Quene, H., Kager, R., 1992. The derivation of prosody for textto-speech from prosodic sentence structure. Computer Speech and Language 6, 77±98. Rabiner, L.R., 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257±286. Ross, K.N., 1995. Modeling intonation for speech synthesis. Ph.D. thesis, Boston University. Rossi, M., 1977. L'intonation et la troisieme articulation. Bull. Soc. Ling. Paris LXII (1), 55±68. Selkirk, E.O., 1984. Phonology and Syntax: The Relation Between Sound and Structure. MIT Press, Cambridge, MA.

Shannon, C., 1948. The mathematical theory of information. Bell System Technical Journal 27, 379±423, 623±656. Silverman, K., 1993. On customizing prosody in speech synthesis: names and addresses as a case in point. In: Proceedings of the ARPA Worshop on Human Language Technology, pp. 317±322. Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., Hirschberg, J., 1992. ToBI: a standard for labelling English prosody. In: Proceedings of ICSLP'92, Vol. 2, Ban, Canada, pp. 867±870. Sorin, C., Larreur, D., Llorca, R., 1987. A rhythm-based prosodic parser for text-to-speech systems in French. In: Proceedings of International Congress of Phonetic Sciences 1, 125±128. Taylor, P., 1994. The rise/fall/connection model of intonation. Speech Communication 15 (1/2), 169±186. Veronis, J., Campione, E., In press. Towards a reversible symbolic coding of intonation. In: Proceedings of ICSLP'98, Sydney, Australia. Veronis, J., Hirst, D., Espesser, R., Ide, N., 1994. NL and speech in the MULTEXT project. In: AAAI'94 Workshop on Integration of Natural Language and Speech, Seattle, pp. 72±78. Veronis, J., Di Cristo, Ph., Courtois, F., Lagrue, B., 1997. A stochastic model of intonation for text-to-speech synthesis. In: Fifth European Conference on Speech Communication and Technology, EUROSPEECH'97, Rhodes, Greece, September 1997, Vol. 5, pp. 2643±2646. Wightman, C.W., Ostendorf, M., 1992. Automatic recognition of intonational features. In: Proceedings of ICASSP'92, Vol. I, pp. 221±224. Zellner, B., 1997. La ¯uidite en synthese de la parole. In: Keller, E., Zellner, B. (Eds.), Les de®s actuels en synthese de la parole. Etudes de Lettres, 3, Universite de Lausanne, Lausanne, Switzerland, pp. 47±78.

A stochastic model of intonation for text-to-speech synthesis1

A stochastic model of intonation for text-to-speech synthesis1

Recommend Documents