Analysis of prominence in spoken Japanese sentences and application to text-to-speech synthesis

SPEECH COMMUNIC.ATION ELSEVIER Speech Communication 14 (1994) 171-196 Analysis of prominence in spoken Japanese sentences and application to text-t...

Download PDF

2MB Sizes 1 Downloads 61 Views

Report

PDF Reader
Full Text

SPEECH

COMMUNIC.ATION ELSEVIER

Speech Communication 14 (1994) 171-196

Analysis of prominence in spoken Japanese sentences and application to text-to-speech synthesis Shoichi Takeda "1, Akira Ichikawa 2 Central Research Laboratory, Hitachi, Ltd., Kokubunji-shi, Tokyo 185, Japan (Received 4 April 1993; revised 24 August 1993)

Abstract This paper focuses on the partial emphasis or "prominence" of parts of Japanese sentences. Four sets of 43 read sentences uttered by two speakers including various types of prominence (172 sentences in total) are analyzed. This analysis shows that in 88% of the sentences prominence is produced by enhancing F 0 and increasing power. No examples of lengthening of phoneme duration are observed in the emphasized parts of the sentences except for some special cases. One exception is lengthening accompanied by pause insertion as a mark of prominence, and another slowing total speech rate. The prosodic features of read natural speech are then used to develop rules for changing a reference sentence to produce prominence for rule-based speech synthesis. Listening test results using 10 subjects do not show any significant difference in expressibility between prominence synthesized by rule (rate of correct expression: 76.9%) and prominence in natural speech (79.9%) at the 5% level. To further improve prominence expressibility, listening tests for 10 subjects are used to clarify the conditions under which prominence expressibility becomes optimal. These tests show that the prosodic control parameters increase the expressibility of prominence by about 20%. Finally, prosodic features of spontaneous conversational speech are analyzed and compared with those of read sentence speech. Speech-rate reduction in parts where prominence is placed is more conspicuous in spontaneous conversational speech.

Zusammenfassung Dieser Bericht untersucht die partielle Betonung beziehungsweise das "Hervorheben" von Teilen japanischer Siitze. Insgesamt wurden 172 S~itze, die verschiedene Arten von Hervorhebung enthalten, analysiert: Vier Gruppen mit je 43 von zwei Sprechern vorgelesenen S~itzen. Diese Analyse zeigt, dab bei 88% aller S~itze das Hervorheben durch Erh6hen der Grundfrequenz F 0 und der Energie erzeugt wird. In den betonten Teilen der S~itze konnte bis auf einige Sonderf~ille keine Verl~ingerung der Phonemdauern beobachtet werden. Ausnahmen sind die Verl~ingerungen, die bei Einfiigen einer Pause als Zeichen der Hervorhebung bzw. bei Verlangsamung der Sprechgeschwindigkeit auftreten. Die prosodischen Merkmale von vorgelesener natiirlicher Sprache wurden dann

* Corresponding author. Present address: Department of Information Systems, Teikyo University of Technology, 2289-23 Uruido, Ichihara-shi, Chiba 290-01, Japan. 1 Shoichi Takeda is now with Teikyo University of Technology, 2289-23 Uruido, Ichihara-shi, Chiba 290-01, Japan. 2 Akira lchikawa is now with Chiba University, 1-33 Yayoi-cho, Inage-ku, Chiba-shi, Chiba 263, Japan. 0167-6393/94/$07.00 © 1994 Elsevier Science B.V. All rights reserved SSDI 0167-6393(93)E0084-9

172

S. Takeda, A. Ichikawa /Speech Communication 14 (1994) 171-196

dazu benutzt, Regeln zur Amderung eines Referenzsatzes zu entwickeln, um ein Hervorheben for die regelbasierte Sprachsynthese zu erzeugen. Ergebnisse von HErversuchen mit 10 ZuhErer zeigen keinen signifikanten Unterschied in der AusdrucksmEglichkeit zwischen dem durch Regeln synthetisierten Hervorheben (Erkennungsrate: 76,9%) und dem Hervorheben in natiirlicher Sprache (79,9%) bei einem 5% Konfidenz-intervall. Um die Ausdrucksm6glichkeit durch Hervorheben weiter zu verbessern, wurden HErversuche mit 10 ZuhErer durchgef'fihrt, um festzustellen, unter welchen Bedingungen optimales Hervorheben erreicht werden kann. Diese Versuche zeigen, dab mit I-rilfe der prosodischen Kontroliparameter die Hervorhebung um 20% gesteigert werden kann. SchlieElich wurden die prosodischen Merkmale spontaner Umgangssprache analysiert und mit den S~itzen vorgelesener Sprache verglichen. In den hervorgehobenen Satzteilen ist die Reduzierung der Sprechgeschwindigkeit auff'alliger als in spontaner Umgangssprache.

REsumE Cet article est consacrE ~ l'emphase particle ou "prominence" de segments de la phrase japonaise. Quatre groupes de 48 phrases lues par deux locuteurs et qui renferment diffErents types de prominences sont analyses. Cette analyse, qui porte ainsi sur un total de 172 phrases, montre que dans 88% des phrases la "prominence" est gEnErEe par l'E16vation de F 0 et l'augmentation de l'Energie. Saul dans certains cas exceptionnels, aucun exemple d'allongement des phonemes n'a EtE observe sur les parties prominentes des phrases. Une exception concerne l'allongement accompagnE d'une pause insErEe comme marque de prominence, une autre concerne le ralentissement du debit global. Les caractEristiques prosodiques du discours lu naturellement sont ensuite utilisEes pour Elaborer des r~gles ~ partir desquelles les phrases de rEfErence seront modifiEes afin de gEnErer un effet de prominence pour une synth~se de discours base sur des r~gles. D'apr~s les rEsultats des tests d'Ecoute sur 10 sujets, la difference d'expressibilitE entre une prominence synthEtisEe par des r~gles (pourcentage d'expressions correctes: 76,9%) et une prominence de discours naturel (79,9%) est non significative. Pour augmenter l'expressibilitE de prominence, on utilise des tests d'Ecoute sur 10 sujets afin de mettre en Evidence dans quelles conditions l'expression de prominence est ~t son niveau optimal. Ces tests ont montrE que les param~tres de contrEle prosodique accroissent l'expressibilitE d'environ 20%. Finalement, les caractEristiques prosodiques du discours conversationnel spontan6 sont analysEes et comparEes ~ celle du discours de phrases lues. I1 apparait que sur les parties qui renferment une prominence, le ralentissement du debit est d'avantage Evidente dans le cas d'un discours conversationnel spontanE. Key words: Japanese; Text-to-speech; Fujisaki's model; Prominence; Prosody; Prominence-production rule

I. Introduction The height, intensity and speed of voice in natural speech are obviously not constant from one word to another. These qualities are instead often emphasized or weakened: one speaks in a higher pitch, more loudly, and sometimes more slowly when one wants to emphasize part of a conversation, and one speaks in a lower pitch, more softly, faster, and sometimes indistinctly when one feels the words are unimportant. These tendencies have so far been known to be language-independent features (Vaissi~re, 1983). This partial emphasis produced by altering intonation, loudness and speech rate is called "prominence" and may be one of the important

factors in achieving high comprehensibility even when little attention is paid to the voice. Fowler and Housum (1987) and Terken and Nooteboom (1987) demonstrated this point in their psycholinguistic studies, and further showed that not only must prominence be placed on words which required it in the context, but it must not be placed on words which did not require it, at the risk of impairing comprehension. Comprehension of synthetic speech is therefore expected to be improved if monotonous and mechanical features are eliminated by using rules that introduce prominence. The burden on a user listening to synthetic speech for a long time is thus expected to be reduced. So far, research on prominence or, more

S. Takeda, A. Ichikawa / S p e e c h Communication 14 (1994) 171-196

widely, prosody has been conducted by many organizations for various languages. Examples of such research for languages other than Japanese are as follows: Bell Laboratories for American English (Hirschberg, 1992), University of Edinburgh for British English (Monaghan, 1992, 1989), I P O for Dutch (Terken, 1991) and for Dutch and British English ('tHart, 1982, 1991), Research Institute for Language and Speech for Dutch (Quend and Kager, 1992), Lund University for Swedish, G r e e k and French (G~rding, 1982) and University of Copenhagen for Danish (Thorsen, 1980, 1982, 1985). Hirschberg developed a statistical approach to assigning pitch accent using discourse information on focus and topic. Monaghan and Quend and Kager took heuristic approaches using the similar discourse information in respective research for the different languages. These approaches belong to linguistic processing by which accentual or intonational information is derived from texts. In their approaches, information on prominence was derived mainly from discourse information. Syntactic analysis was not adopted because it was impracticably complicated. Statistical or heuristic approaches were instead used. Accentuation rules have also been developed for Japanese. As a major topic in accentuation of Japanese speech, Sagisaka and Sato (1984) developed rules for words concatenation using information on the accent-type of each word. Kubozono (1993) also proposed the similar accentuation rules. Our aim in this paper, on the other hand, is not at proposing a linguistic processing method for deriving accentual or intonational information, but at proposing a method for synthesizing speech waveforms from these accentual or intonational information. Specifically, this p a p e r focuses on prosodic p a r a m e t e r s that characterize prominence. As for research on prominence in Japanese, H a k o d a and Sato (1980) proposed rules for using syntactic-structure information and the physiological conditions of utterance (i.e., the duration of the airflow from the lungs) to determine the magnitudes of accent components (or stress levels), the locations for inserting pauses, and the duration of pauses. These are essentially promi-

173

nence-production rules based on unintended causes. To derive rules for producing intended prominence, on the other hand, Fujisaki and coworkers performed pioneering research on the relationships between syntactic and prosodic structures of spoken Japanese sentences, placing special focus on prosodic features of discourse structures (Hirose et al., 1986; Fujisaki and Kawai, 1988; Hirose et al., 1988). In discourse structures, a "focus", which is closely related to prominence, plays an important role. Fujisaki and coworkers analyzed noun or verb phrases consisting of two or three prosodic words 3 with or without a focus (Hirose et al., 1986; Fujisaki and Kawai, 1988) and derived rules for producing fundamental frequency ( F 0) contours, including prosodic words with a focus (Hirose et al., 1988). Shirai et al. (1986) also proposed prosodic rules for speech synthesis representing word emphasis. This p a p e r reviews prominence and proposes its production rules from the following new viewpoints: (1) Without confining the object of prominence to "a word" or "a phrase", various types of prominence in spoken Japanese sentences for example, prominence placed on " p a r t " of a word or a phrase - are analyzed and their rules are developed. Such prominence includes that produced by " u n i n t e n d e d " causes (e.g., habit, rhythm, etc.) as well as " i n t e n d e d " causes (e.g., discourse focus). (2) Without confining the features of prominence to " f u n d a m e n t a l frequency" and "a pause",

3 In conventional grammar of Japanese. the minimal unit of syntax called "bunsetsu" (see Footnote 4) does not necessarily correspond to the minimal unit of prosody, since a bunsetsu may contain more than one accent component. As the minimal prosodic unit of spoken Japanese, Fujisaki ;and coworkers introduced the "prosodic word", which was defined as a part or the whole of an utterance that forms an accent type. Under certain conditions, a string of prosodic words can form a larger prosodic word due to "accent sandhi" (see Section 3). A phrase component of the F o contour of an utterance, on the other hand, defines a larger prosodic unil, i.e., a "prosodic phrase", which may contain one or more prosodic words. (After (Fujisaki and Kawai, 1988)).

174

S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

overall prosodic features are taken into consideration in analysis and rule production. These features include "speech signal power" and " p h o n e m e duration". It is widely recognized that to synthesize natural-sounding conversational speech it is not sufficient to reflect only the features of read speech on the synthesis system, but those of spontaneous speech should also be introduced (Lindblom et al., 1991; Rischel, 1991). As for prosody, analyses have so far been conducted in spontaneous speech with exemplification from Swedish, French (Bruce and Touati, 1991) and British English (Hieronymus and Williams, 1991). This paper deals with a few examples of spontaneous Japanese conversational speech. This paper first classifies prominence in spoken Japanese sentences, analyzes read Japanese sentences that include prominence, and clarifies the structure and the main factors of prominence in terms of prosody. These results are then used to calculate prosodic control parameters for establishing prominence production rules, and listening tests are performed to evaluate the expressibility of prominence. We next discuss improvement of prominence expressibility. Finally, we discuss preliminary analysis of the prosodic features of spontaneous Japanese conversational speech and compare these features with those of read sentence speech. It must be noted that the prominence-production rules proposed in this paper are not derived directly from the analysis results. Analyses are performed based on reference sentences, so quantities obtained from these analyses are only modifications of a reference sentence. These quantities therefore are not sufficient for complete synthesis by rule - that is, neutral sentences must be produced in a completely synthetic way. This problem can be solved by using a text-tospeech system we previously developed. In this system, F 0 contours are produced based on the rules developed from the results of accent analysis for isolated words, assuming that the accents of a neutral sentence consist only of those of isolated words. We have used very simple wordaccent production rules, in which the quantities are regarded as a set of a few constants deter-

mined depending on the accent-type. The prominence-production rules can therefore be obtained by modifying these word-accent production rules. In other words, the prosodic control parameters for prominence obtained from analyses are converted to the modification quantities based on the isolated words. In this way, the prominence-production rules can be applied to text-to-speech synthesis, not merely functioned as modification of several limited sentences. F 0 contours can therefore be obtained in a completely synthetic way in combination with the word-accent production rules.

2. Classification of prominence

2.1. Prominence in Japanese Japanese sounds monotone to native speakers of English, and the lack of emphasis is one of the reasons. The reason for a lack of prosodic emphasis may be the fact that Japanese has more grammatical ways and uses word order more than the European languages such as English, German, French, etc. Prosody is not the only means to mark prominence, and all together, Japanese has more other means than English and French, for example. We now describe several points which are particular to Japanese.

(1) Not saying the obvious In Japanese, unlike English, elements are left unsaid as long as the context allows to understand what is said. Nouns, verbs, pronouns and even particles (in informal spoken language) are frequently deleted. Mentioning unnecessary information is in fact a sign of clumsiness in Japanese conversation. In an answer to a question, what is already stated in the question is often not repeated. (2) Identifying topics The following famous sentence can be taken as an example of the use of the topic marker wa: Zoo wa hana ga nagai. (meaning "As for elephants, trunks are long.")

S. Takeda,A. Ichikawa/ Speech Communication 14 (1994) 171-196 A n y e l e m e n t of the s e n t e n c e can b e c o m e the topic. Prosodic m a r k i n g of the topic b e c o m e s r e d u n d a n t w h e n the particle wa is used. E n g l i s h a n d F r e n c h have no such g r a m m a t i c a l m e a n s . Possessive p r o n o u n s in J a p a n e s e are d e l e t e d u n less t h e r e is n e e d for emphasis. W h e n they are p r e s e n t in the s e n t e n c e , they are implicitly emphasized.

(3) Questions and interrogative words I n p h r a s i n g a formal question, the i n t e r r o g a tive particle ka is a d d e d to the e n d of the sent e n c e a n d it is p r o n o u n c e d with a slightly rising i n t o n a t i o n (Kyoo ikimasu ka.). T h e J a p a n e s e particle ka is e q u i v a l e n t to the E n g l i s h do (Do you come today?). A n d in J a p a n e s e , u n l i k e English, there is n o n e e d to start the q u e s t i o n s with interrogative words. I n c o n v e r s a t i o n w h e n asking a question, a rising i n t o n a t i o n at the e n d of the u t t e r a n c e - w i t h o u t ka - may be used (Kyoo ikimasu. You come today?). T h e use of a rising i n t o n a t i o n (on the last m o r a only) is n o t d u e to emphasis m a r k i n g , b u t it is a s u b s t i t u t e for the missing particle ka.

175

2.2. Prosodic aspects o f prominence I n the previous subsection, we have described m a i n l y n o n p r o s o d i c aspects of p r o m i n e n c e in J a p a n e s e . In this subsection, however, we will describe classification of p r o m i n e n c e limiting only to prosodic aspects. C o n v e n t i o n a l knowledge of linguistics a n d speech analysis can be used to classify the promin e n c e in J a p a n e s e s e n t e n c e s in several different ways. I n general, p r o m i n e n c e may be divided into two types, d e f a u l t a n d i n t e n d e d . A typical example of default p r o m i n e n c e is that i n h e r e n t in a specific s e n t e n c e type, such as e m p h a s i z i n g the first prosodic word of a declarative sentence. I n t e n d e d p r o m i n e n c e can be f u r t h e r classified according to w h e r e focus is placed. F o r example, focus might be placed on an e n t i r e bunsetsu 4

4A bunsetsu is a grammatical and phonological unit in Japanese. It consists of an independent word - such as a noun, verb or adverb - followed by a sequence of zero or more dependent words - such as auxiliary verbs, postpositional particles or sentence final particles (Kurematsu el al., 1991).

Table 1 Classifications of prominence in spoken Japanese Classification Large group Default prominence

Intended prominence

Small group

Examples of sentences (! ! denotes emphasis, and r 1denotes weakening)

Prominence inherent in sentence-type

Weakening at end of declarative sentence

[kareno imo:towa okirruI] (meaning "His sister gets up.")

Strengthening on the first prosodic word

[lkalreno imo:towa okiru] (meaning "His sister gets up.")

Common

Focus on whole bunsetsu

[kareno ilmo:tolwa okiru] (meaning "His SISTER gets up.", focus on "imo:towa" (sister))

Special

Focus on whole bunsetsu (pause is inserted)

[waltaJ'ino! ~ anidesu] (meaning "He is MY brother.", focus on "wataSino" (my), and denotes pause)

Focus on part of compound word

[nakaxjawa Ise0seil aratame naka0awa lko:t~o:!] (meaning "'Mr. Nakagawa, formerly TEACHER, has become PRINCIPAL.", focus on "se0sei" (teacher) and "ko:t~'o:"(principal))

Focus on a syllable

[Jida lreljana0ika J'ida Iriljana0ika] (meaning "ShidaREyanagi or shidaRIyanagi.", focus on "re" and "ri", where "shidare (ri) yanagi" means "weeping willow")

Focus on whole bunsetsu, but acoustical emphasis on a syllable

[korewa semme0 lkiI] (meaning "This is a WASHBASIN.", focus on "semme0ki" (washbasin))

176

s. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

(which generally corresponds to a phrase in English) (Hirose et aI., 1986; Fujisaki and Kawai, 1988; Hirose et al., 1988), or on only part of a bunsetsu. And we can further classify this latter type according to whether focus is placed on part of a compound word, on a syllable, or on an entire bunsetsu but with acoustical emphasis on one syllable. The intended prominence most commonly used in Japanese is that which focuses on a whole bunsetsu. The second type, on the other hand, is a rather special case, examples of which can sometimes be found in conversation over the telephone. Table 1 lists these classifications of prominence in Japanese, with examples (Takeda and Ichikawa, 1990). The main factors of prominence in Japanese are those of prosody - that is, the fundamental frequency F 0, the power or amplitude of speech waveforms, and the temporal structure, i.e., phoneme duration and pause. Prominence in the above classifications is therefore analyzed here in terms of prosody.

3. Model for generating F o contours One of the most important prosodic parameters in Japanese may be F 0 (Kitahara et al., 1988). In this section, we describe a functional model used for analyzing as well as synthesizing F 0 contours. This model is called "Fujisaki's model" (Fujisaki and Hirose, 1984). The tendency of F 0 contours to decline during the course of utterances is one of the most controversial issue in the research of intonation in Japanese as well as in other languages. The idea of the "downstep model", in which F 0 trend is for the most part linguistically controlled, has been supported by some researchers (Kubozono, 1993). Other researchers, however, characterize the F 0 declination primarily as a phonetic phenomenon caused possibly by physiological factors. Fujisaki's model belongs to the latter case. As another approach, Sagisaka (1990) has pursued a method of predicting global shapes using statistical nonparametric optimization. In this paper, we will not go deep into this modeling issue. We adopt Fujisaki's model because, as a major rea-

son, this model is excellent in accuracy: according to our previous experiments, the F 0 prediction error was about 4% and, when the micro-fluctuation component was introduced, it was further reduced to less than 1%. This advantage is expected in future to bring good results on the analysis and synthesis of speech that includes emotional expressions and many other prosodic styles. According to Fujisaki's model, the F 0 contour of an utterance can be regarded as the vocal cord vibration in response to a set of commands that carry information about lexical accent, syntactic structures and discourse structures of the utterance. The following two different kinds of command have been known to be necessary: (1) An impulse-like phrase command for the onset of a prosodic phrase. (2) A stepwise accent command for the accented mora or morae 5 of a prosodic word. The consequences of these two types of commands have been shown to appear as phrase components and as accent components, each being approximated by a second-order linear system's response to the respective commands. The output fundamental frequency F 0 as a function of time t is given by In F0(t ) I = In Fmin -{- E h p i a p i ( t - Zoi) i=1 J + ~_~Aaj{Gaj(t- Zlj)-Gaj(tj=1

T2j)}, (1)

5A mora is the unit of metrical time equal to the short syllable. In Japanese, most syllables are consonant-vowel(CV) concatenations or single vowels (V) and they are usually equal to morae. Exceptions are a syllabic nasal (denoted a s / N / ) as in [sendai] (/seNdai/) and an assimilated sound (denoted as /Q/) as in [itta] (/iQta/). The duration of a syllabic nasal is approximately equal to that of a CV syllable, so a syllabic nasal is regarded as a one-mora syllable. The duration of a syllable that includes an assimilated sound, on the other hand, is approximately equal to twice that of a CV syllable, so a syllable that includes an assimilated sound is regarded as a two-mora syllable (i.e., / Q / i s counted as one mora).

S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

where

Gpi( t ) = { c~2 t e x p ( - a i t ) , O,

IMin[1Gar( t ) = ] [0,

for t >~ O, for t
(2)

(1 +/3rt ) e x p ( - / 3 / t ) , 0r],

for t > 0 , for t < 0 .

(3) The symbols in Eqs. (1), (2) and (3) indicate Fmin: bias level upon which all the phrase and accent components are superposed to form the F 0 contour, number of phrase commands, I: number of accent commands, J: Api" magnitude of the ith phrase command, Aaj: magnitude of the jth accent command, Tog: occurrence of the ith phrase command, Tu: onset of the jth accent command, T2/: offset of the jth accent command, natural angular frequency of the phrase control mechanism for the ith phrase command, /3r: natural angular frequency of the accent control mechanism for the jth accent command, and 0r: ceiling level of the accent component for the jth accent command. The parameters of this model are estimated by minimizing the mean squared error between the extracted F o contour and the contour predicted by the model. This minimization process uses the method of analysis-by-synthesis. We developed a graphic dialog system where the initial values of the model parameters were easily determined with sufficient accuracy. In this system, the initial parameters are adjusted mainly using a cursor moved by the mouse. After initial parameter adjustment is completed, the analysis-by-synthesis can be executed by the computer immediately. Since we can continue any number of trials until we get a perfect adjustment, the analysis-bysynthesis is successful in most cases. Even in case that the analysis-by-synthesis is unsuccessful, however, it is very easy to retry the procedure from the beginning. Adjustments and analysisby-synthesis are performed for each sentence in-

177

dependently. Although this system has a model and supports adjustments for micro-fluctuations, we have not used this function for the analysis of prominence to save the processing time. To reduce the influence of micro-fluctuations, smoothing has instead been applied to the extracted F 0 contours (of course pitch-extraction errors have been eliminated in advance by hand using this dialog system). It must be noted that in Japanese two adjacent accent components can often be merged into a single accent component or they can at least be closely connected to form a stepwise-varying accent ceiling without falling into the baseline. Such a connection of accent components is called an "accent sandhi". Linguistically, a prosodic accent sandhi is related to word concatenation such as concatenation of an independent word and a dependent word (e.g., a particle), concatenation that forms a compound word, etc. (Sagisaka and Sato, 1984). The F 0 contour of the former case of an accent sandhi is usually observed when the concatenated word is not emphasized. The F o contour of the latter case, on the other hand, is often observed when prominence is placed on either one of the components of the concatenated word (see Fig. 2: the second, left-hand rectangle pair (with and without dots) from the top in the box of " A C O U S T I C A L E F F E C T S " denotes an accent command of the former case and the right-hand counterpart denotes that of the latter case). The F 0 contour of the latter case of an accent sandhi is obtained by applying stepwisevarying accent commands to Fujisaki's model. We adopt such sandhi rules when necessary (i.e., when a single accent command does not fit the data).

4. Analysis of prosodic features of prominence

4.1. Measures for prosodic features Several measures have been defined to quantify the prosodic features of prominence (Takeda and Ichikawa, 1990). The prosodic parameters used for this purpose are those for F 0 (e.g., the magnitude of an accent command), those for power and those for temporal structure (e.g.,

S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

178

p h o n e m e duration). T h e measures are defined as the differences between the prosodic p a r a m e t e r values for the emphasized part of speech and the values for the c o r r e s p o n d i n g part of n o n e m p h a sized speech. These measures are functions of time, so they include information not only on the kare n o im o : tow a o [~TA

.,,

0

# 1

L

II

. . . . . . . . . . . . . . .

I I

[ ~ I~11

--'="

I

I

~ I

--I I

:,,

,,,,,.,

,

~'

i t..lJ,~,J.~.

I,

iT"F"' 111

I I

~r" " r i I I i

i I

degree of emphasis but also on the timing of w h e n p r o m i n e n c e occurs• N o w let us call speech that does not include p r o m i n e n c e " r e f e r e n c e s p e e c h " (or " n e u t r a l speech") and speech that includes p r o m i n e n c e "object speech" (or " m a r k e d speech") (for similar study see (Thorsen, 1980))•

k i ru

karen

,, ..., ,,]

. laa/.t,J I

7~] I

.I

i

'i

il

I

[.'AT/*. 4

o [P]i mo : towa

F--L~- I ~

0

,[..

,,,,,"~",

250

,' ~

~,

o k i ru

. . ,. . ',',i -.'-~"',,,, ', . , . i. .,_~,, ' " ', ".* ,...a-, , ' r---',7 ,

_

.

I l ,"~, ', ,,,''" ',

+"-, ,'

', ', ,," ',

'~so I .......

i .......

--lO0 ','x

o. 0---:°

o.~ . . . . .

i.~

~

t (sl

Ill

i

rN

~o O.

t

1.0

(1.5

TUME

(a)

i• 5 =-~ZZk-L2 •

o

(b)

-2bl ...........

........... ...........

~_ . . . . . . . . . . . . . ~ .........

1. . . . . . .:.-. _. - ~ i T.l/ii I I

i. . . . .

/!

° ,. r ~ l t l i J . L. r' ]1 I ,lllm--, I"I

-_ _

i ii i

g ........ o ...........

~ ........ ~ ........

.,."~i >.(i.'i

i i

............ o

m--2-.;,~ /ii

"iii ill iii iii iii iii

i i

i

ii

i

i

ii

i

i

ii

= ...........

~-

........ ........

~/#

..... /rl .... .-7,#.. i!

,.0 O.

/i

;;2'~ ii i i ..... n ii 'kare

-j,(

ii

~,i f i] _J !i

i! ii li

iii

i,i

i! ~i

ii,. i ii

.,i ii

i i . . . . . .o. . . : o irn

iii

i i i i

i i

i~i i iii

iii

i i

ii

i

i

.-- 10 -u 5 0

o

i

ii

i

, I i

i

l.S

I

I

I

I

I

;l n o im o

.....

f

~-~ ......... ,~ ...... ~

II

.... :

',r,,-,l~. I I

Ii;

I

I

;-

: tow

a o

..............

+

-

i

II

i

I

II ,-F-

k i

ru .....

.... :-------.n

............

,0E

I]~

oL~--I~ 0

I

2F--m--r

I

i

II

I

t

l

I

i

II

I

i

I

I

"l i

.I . . I. . . I I "it I I II

,,

|

i

i

i ~

I I

i I

II 6t II II

- - | [L

II t I

IIII

r { L - - ~

I

II li

. 0

0.0

I I

I

I

, -'r--- , I I

I

I

I

I

i

I I

t I

t I I I

AI ~I ,

i

•

r ~I

I

. . . . r

I

i

t t II

t ~ I

I

I I

, ,

, ,--~-r-r-

I

I

i

I I

II

~ i

i

i i

i J

i.-@ I I

i I ~

~

I I I I

I t i

I i I t i i I I

i i It I i i I I I

l i

I I I I I f LI

t

i i

I_

0.5

i

t't II

I i

A

I

t I

II u i ~__i

l.O t

i

-~- ~T~--~

I

T | I ~

(e)

I

I

I

Q.

tow a o I . .k. . i r ~ _i j

I

kare

--~

~-I0

i

1.0

O.S

i i

I

] ' L.!~,..!

~

I i i~

~;SFF_Rail

0

i ~ I

t III

fll

• ,~-

1.5

t Is)

I i

i I I i

I

J

1.5

(sl

(d)

Fig. 1. A n a l y s i s r e s u l t s f o r a J a p a n e s e s e n t e n c e [ k a r e n o i m o : t o w a o k i r u ] (His sister g e t s up). [P]: p a u s e . (a) P o w e r a n d Fo p a t t e r n s o f n o n e m p h a s i z e d s p e e c h ( r e f e r e n c e s p e e c h ) . (b) P o w e r a n d F 0 p a t t e r n s o f e m p h a s i z e d s p e e c h ( o b j e c t s p e e c h ) . (c) T i m e w a r p p a t t e r n ( b o t h axes: T i m e t (s)). (d) V a r i o u s p r o s o d i c f e a t u r e q u a n t i t i e s ( l a r g e r v a l u e s d e n o t e " e m p h a s i s " ) .

S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

The measures are defined as follows. (1) Fo ratio (FOR): Ratio of F 0 for the object speech to F 0 for the reference speech. The values approximated by Fujisaki's model are used for F 0. F O R is given by FOR = 20 log(Fox~For ) (dB),

(4)

where Fox denotes F 0 for the object speech and For denotes F 0 for the reference speech. (2) Accent command increment (DAa): Difference between the magnitude of the accent command of F 0 for the object speech and the magnitude of the accent command for the reference speech. DA~ is given by DA~ =Aax - A a r ,

(5)

where Aax denotes the magnitude of the accent command for the object speech and A ~ that for the reference speech. (3) Power ratio (POWR): Ratio of the power of the object speech to the power of the reference speech. P O W R is given by P O W R = 10 log(Px/Pr) (dB)

(6)

where Px denotes the power of the object speech waveform and P~ that of the reference speech waveform. (4) Time warp (TW): Rate of time warp of the object speech as a percentage of the corresponding phoneme duration in the reference speech. The TW for the ith phoneme is given by

T W ( i ) = (T~(i) - T~(i))/T~(i) X 100(%),

(7)

where T~(i) denotes the duration of the ith phoneme in the object speech and T~(i) that in the reference speech. Figs. l ( a ) - l ( d ) show an example of analysis results for a Japanese sentence [k~reno imo:t6wa okfiru] (which means "His sister gets up" and the mark ' denotes accent) uttered by a male announcer. Fig. l(a) shows the waveform, power pattern, and /"o patterns of speech that does not include prominence. Measured F 0 values (symbols " x " ) and their best approximations obtained by Fujisaki's model (solid curve) are both shown. The dotted curve shows the phrase component, the arrows show phrase commands, and

179

the stepwise waveforms show accent commands. Phoneme boundaries are denoted by the vertical dashed lines crossing the speech waveform by and the triangles along the time axes of the power and F 0 plots. These boundaries were determined by observing either the spectrogram, enlarged speech waveform, or spectrum variation rate curve.

Fig. l(b) shows the same quantities for speech that includes prominence in the second prosodic word ([imo:towa] which means "sister"). The power and F 0 values for [imo:towa] in Fig. l(b) are much larger than those in Fig. l(a). The largest rise (the highest F 0 values) on the first prosodic word of the neutral speech and the maximum of F 0 on the word with prominence are not peculiar to Japanese, but these features are common to many other languages (Vaissibre, 1983). Fig. l(c) shows a two-dimensional time warp pattern for the object speech versus the reference speech. The graph is plotted as a trajectory through the phoneme boundaries of the reference speech (x-coordinate) and the corresponding phoneme boundaries of the object speech ( y - c o o r d i n a t e ) ( d e n o t e d here as " I N P U T SPEECH"). One can see that a pause occurred due to prominence: the vertical line segment at 0.5 s on the horizontal axis indicates a pause in the object speech. Such a pause before the word to be emphasized is known to be often used in other languages as well (a so-called rhetoric pause). Fig. l(d) shows the time-varying patterns of the prosodic features defined by Eqs. (4)-(7). The time axis here is that of the reference speech, so the time scale of the object speech is converted to that of the reference speech. This figure shows that the values of FOR, DA a and P O W R are all larger where there is prominence (i.e., for [imo:towa]). These quantities are thus effective parameters that indicate the location and degree of prominence. In the time warp, on the other hand, there are no specific features where there is prominence; rather, there is the lengthening of a vowel that precedes the pause. This has been called "pre-pausal lengthening" (Lehiste, 1980; Delattre, 1966).

180

s. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

4.2. Qualitative analysis

While observing the prosodic feature quantities defined in the previous subsection, qualitative analysis was performed according to the classification of prominence listed in Table 1 (Takeda and Ichikawa, 1990). The aims of this qualitative analysis were to determine which feature or combination of features was dominant for prominence, how the F 0 features changed with prominence, and what types of prosodic words occurred as a result. We analyzed 43 sentences including all of the prominence types listed in Table 1. A sentence without intended prominence was also included as a reference in each classification. A total of 172 speech samples were analyzed: two sets of the 43 utterances spoken by a male announcer ("speaker A") and by a female announcer ("speaker B"). From the frequency with which the prosodic features occurred in these speech samples (Table 2), we determined the following: (1) With few exceptions, enhancement of F 0 and increasing of power were common features of prominence: about 88% of the samples with prominence increased both F 0 and power. (2) Lengthening of phoneme duration depended on the type of prominence as well as on the speaker. This lengthening was more conspicuous when the object of prominence was smaller. We observed only two types of

lengthening: that of a specific phoneme, and that of all of the phonemes in the sentence (i.e., slowing of the speech rate). We could not observe any examples of lengthening of part of the sentence, which we could observe in spontaneous speech (see Section 7). The lengthening of a specific phoneme was usually accompanied by a pause immediately afterward, and this has been called "pre-pausal lengthening" (Lehiste, 1980; Delattre, 1966). (3) Whether or not a pause was inserted immediately before a n d / o r after the object unit of prominence varied between speakers. Prominence was often achieved without pause insertion. The percentage of pause insertion for speaker A was 50% (20%: pause before the object unit; 24%: pause after the object unit; 6%: both), and that for speaker B was 28% (4%: pause before the object unit; 24%: pause after the object unit). (4) A typical feature of default prominence was emphasis on the first prosodic word of a declarative sentence. The main factors of this prominence were enhancement of F 0 and increasing of power. This enhancement of F 0 is known to be common to many languages (Vaissi~re, 1983). No specific features, however, could be seen by examining temporal structure. (5) In addition to a known mark for interrogation in Japanese - i.e., an intonational rise in F 0

Table 2 Occurrence rate (r(%)) of prosodic features Prosodic Default prominence features Declarative sentence Head of End of sentence sentence

Intended prominence End of Commoncase interWhole Part of rogative b u n s e t s u compound sentence word

F0

Enhancement

100

Inapplicable

100

50 ~
Power

Increase

100

Inapplicable

100

Temporal structure

Phoneme duration

0

0

Pause

0

0

Special case Syllable

Syllable Whole bunsetsu

50 ~ r < 100

50 ~
100

50 ~
50 ~
50 ~ r < 100

50 ~
100

0 < r < 50

Speakerdependent

100

50 ~
0

0 < r < 50

Speakerdependent

Speakerdependent

50 ~
S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

at the end of the sentence, another conspicuous feature of default prominence 6 was observed at the end of interrogative sentences. Prominence was formed by a peculiar combination of prosodic features: enhancement of F 0 in the end prosodic word (which syllable was acoustically emphasized depended on the accent type), increased power at the emphasized syllable, and lengthening of the end vowel. This combination was seen in each of the interrogative sentences.

Fig. 2 summarizes the results of the qualitative analysis. This analysis confirmed that a syntactic unit (bunsetsu) or a part of it, as an object of prominence, could be m a p p e d into one of several categories of prosodic words and variation types for F 0 contours. And the power of the object of prominence was greater than the power of neighboring units. Furthermore, there was a greater tendency to lengthen p h o n e m e duration when the object of prominence was a smaller unit, such as part of a word. The choice between lengthening a specific p h o n e m e or the entire sentence, however, was arbitrary. When only part of the p h o n e m e was lengthened, however, a pause usually followed. Another example of prominence induced by varying temporal structure is the lengthening of the vowel at the end of an interrogative sentence. Finally, pause insertion as a means of expressing prominence was also seen to be arbitrary.

6 The words at the end of the interrogative sentences used for this analysis were [oklru], [jowfii] and [d3 imlda], all of which had an accent on the penultimate syllables. In these words, the accent component and the mark for interrogation separated clearly - i.e., after once the F 0 value fell corresponding to the lexical accent, the rise in F 0 value occurred again as a mark for interrogation. Compared with the declarative sentences without prominence, the magnitudes of the accent commands of the above words in the interrogative sentences were significantly larger. We have therefore temporarily called this enlargement of the accent "default prominence". It must be noted, however, that this seeming emphasis might not be an actual emphasis, and we need further observations and discussions to interpret this phenomenon more accurately.

4.3. Quantitative analysis To make synthesis rules for producing prominence, it is necessary to quantify the prosodic ACOUSTICAL EFFECTS

CAUSES AND SYNTACTIC EFFECTS

[FUNDAMENTAL FREQUENCY] 1. Accent Command Enlargement

I. intended Prominence Theme, new/old information ,0 Discourse focus LOCATION OF FOCUS @ Whole bunsetsu @ Part of bunsetsu

[POWER] Relative increase in power

[TEMPORAL STRUCTURE] 1. Pause i n s e r t i o n

QUALITATIVE ANALYSIS

2. Default Prominence Habit, tone, rhythm, etc.

181

Accent Command Occurrence

[--q

" t * l--q ~

2. Lengthening 0f s p e c i f i c phoneme 3. Slowing tolal speech rate

Fig. 2. Summary of qualitative analysis. In the box of A C O U S T I C A L EFFECTS, a rectangle pair or a single rectangle on the left of an arrow illustrates one or two accent command(s) in one or two prosodic word(s) when they were uttered without prominence: timings (both ends of rectangles) and magnitudes (heights of rectangles). Rectangles painted with small dots are accent commands or part of them whose values are to be varied when prominence is placed. In the case of 2 (Accent Command Occurrence), the part where prominence is to be placed is deaccented part in the neutral utterance. The rectangle-pair counterpart on the right of the arrow illustrates two accent commands in two adjacent prosodic words (lst, 3rd and 4th pairs from the top) or one accent-sandhi type prosodic word (2nd and 5th pairs). The relationships between the left and the right rectangles across the arrow therefore illustrate how the accent commands were varied when prominence was placed based on the neutral utterances.

182

S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196 1.0-

features described in the previous subsection. In this subsection we therefore describe the results of the quantitatively analyzing these features.

4.3.1. Fundamental frequency For intended prominence, the relative increments in the magnitude of accent commands depended on such factors as the type of accent (e.g., sandhi type or not) and the location of the prosodic word. For example, the m e a n relative increment for nonsandhi accent in the middle of a sentence was 0.67, whereas the m e a n increment for sandhi accent was 0.35. For default prominence, on the other hand, this increment depended on the type of sentence. The m e a n value of this increment for the accent on the first word of a declarative sentence, for example, was 0.14, and that for the end accent of an interrogative sentence was 0.42. However, the increment for the accent on the first word of an interrogative sentence could be regarded as 0. Figs. 3 and 4 show the actual values of F 0 parameters. For these figures, A D A a is defined

0.6-

u. o

]

0.4......

0.2-

!.......

......

.... i .........

i .....

i;

--~7....__ . .A

0.0

-0.2'

.... N-N, - N_ p~,,~ -NL~(- N±~S . . . . . . . . t _ _ J

i

Non-accent sandhi

as ADA a = DAap - DAan,

0.8<~

I

Accent sandhi

Non-accent sandhi

(8)

where DAmp denotes the accent c o m m a n d increment defined by Eq. (5) for the object prosodic word, and DAan denotes the increment for either of the neighboring prosodic words, which has the smaller value of D A a. As a measure of default prominence, the difference between the magnitude of the accent c o m m a n d for the first prosodic word Aa~ and the magnitude of the accent comm a n d for the second prosodic word Aa2 was used (Fig. 4(b)). A statistical test result showed that this difference was significant. For the timing of accent commands, no conspicuous features due to prominence were seen. Even for the timing in the middle of an accent sandhi, it may be sufficient to use the timing rule for a prosodic word without prominence.

4.3.Z Power Fig. 5 shows the relationship between the magnitude of an accent c o m m a n d and the power of the corresponding speech spoken by speaker A. F r o m this figure, power was seen to be strongly

(a)

(b)

Fig. 3. Relative e n h a n c e m e n t of accent c o m m a n d magnitude for prominence of prosodic word in the middle of a sentence (points show m e a n values and error bars show standard deviations). * N - N : " N o n p a u s e to N o n p a u s e " (no pause occurs in spite of prominence). * * N - P : " N o n p a u s e to Pause" change (pause occurs due to prominence). Speech materials: [kareno imo:towa okiru], etc. N u m b e r of sentences concerned: 23. N u m b e r of speakers concerned: 2. (a) Declarative sentences. (b) Interrogative sentences.

correlated with the magnitude of an accent command. This correlation, however, was speaker-dependent, and such a strong correlation could not be seen for speech spoken by speaker B. Speaker B was probably able to independently control F 0 and power. This correlation for speech spoken by speaker A can be explained as the compensation effect - i.e., F 0 is the primary feature and power increase enhances F 0 increases (Vaissibre, 1983). If we decide to adopt the characteristic seen in Fig. 5 as a synthesis rule, one of the simplest ways would be to use a natural power increase charac-

S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196 1.0-

1.0 . . . .

Correlation factor Root-mean-square e r r o r 12 . . . . . .

<, 0 . 8 -

-

0.8-

•

183

:. . . . . S i ope i :8.2 (/iB/6c {.3

--

i0

p=0.93 o:O.97(dB)

:---•

....

{ v

£

0.6-

o

6- ..... i...... i . . . . . : o iO :

0.6-

:

m

OI

i : :

4c

c

0.4-

0.4-

.....

: ......

Z

I

0

8 T~

0. 0.2-

0.2-

0.0

-0.2-

-0.2-

0.4

0.6

0.8

0

Fig. 5. Power versus the magnitude of an accent command. Black circles: data emphasized by prominence. White circles: nonemphasized data. Speech materials: [imo:towa] in [kareno imo:towa •. • ] spoken by Speaker A (male). Power was measured for [o] in [.. towa]. N u m b e r of sentences concerned: 12. N u m b e r of speakers concerned: 1.

¢)

0.0

I 0.2

Magnitude of accent command A,

o~

%

2-

Oecl*

[nt ~*

5. Synthesis of speech that includes prominence (a)

(b)

Fig. 4. Prominence of the first prosodic word (points show mean values and error bars show standard deviations). *Declarative sentence. **Interrogative sentence. Speech materials: [kareno imo:towa okiru], etc. Number of sentences concerned: 16, Number of speakers concerned: 2. (a) Intended prominence in declarative sentences. (b) Default prominence in declarative and interrogative sentences.

Based on the results of the analyses described in the previous section, several rules for producing default p r o m i n e n c e and intended p r o m i n e n c e were proposed. In this section, we first describe the rules, and then describe evaluation test results.

teristic of a synthesizer as F 0 increases. In o t h e r words, as F 0 increased, the density of excitation pulses would increase, and this would increase power.

5.1. Prominence-production rules

4.3.3. Temporal structure Pre-pausal lengthening of p h o n e m e duration and lengthening of the e n d - p h o n e m e duration in interrogative sentences was n o t e d for b o t h speakers. T h e m e a n value of the time warp for prepausal lengthening was 66% and the m e a n time warp for e n d - p h o n e m e lengthening was 78%. F o r neither types of lengthening was significant the difference b e t w e e n speakers A and B at the 5% level.

Fig. 6 shows the p r o p o s e d p r o m i n e n c e - p r o d u c tion rules classified according to the prosodic word types. T h e r e are 6 types of rules: 2 for producing default p r o m i n e n c e and 4 for producing intended prominence. In this figure, T W denotes the time warp defined by Eq. (7).

5.2. Listening-test evaluation of the expressibility of prominence T h e expressibility of p r o m i n e n c e was comp a r e d b e t w e e n listening tests using natural speech and tests using rule-synthetic speech.

S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

184

No.

Fundamental frequency DA. DT, (s)DTz (s)

Type of prominence

Power P

Time T , ( s ) TW(%)

D e c l a r a t i v e sentence A,~ [ T~ Tz

[

0.14 ±0.13

r---1

0

0

Using natural power increase as Fo increases

Interrogative sentence

~=

F-]

2

]

]

~A.

0.37 ±0.16

0

78 +_22

0

0

T~ T~

~ A [--]

O.67

"r-i

4-0.16

T, T.

Same as above

Typic a l l y 72 O. i 4-25 ~0.3

O. 35 ±0.19

[-~A.;__3

~4

0

T, T=

TI =TI ~.

o)

T~ T2

0.42 ±0.18

5

TI T2

T2 =T, Typ i c a l l y 72

['-7

~

~

A.

0.1

TI T2

A,=0.63 +0.14

6

F]

•- ~

~ A .

TI T2

+--25

~0.3

0

0

TI =L 2

Fig. 6. Prosodic control parameters for producing prominence. Rectangles denote accent c o m m a n d s and rectangles painted with small dots are accent c o m m a n d s whose values were varied or which occurred due to prominence. In Rule No. 2, the magnitude of the accent c o m m a n d for the first two words can be set to be equal. Each symbol denotes as follows: DAa: difference between the magnitude of the accent c o m m a n d for an accent with prominence ( A , ) and the magnitude of the accent c o m m a n d for an isolated word (Aan) (typically, Aan is chosen to be 0.2-0.3). T~: onset of the accent command. T2: offset of the accent command. DTI: difference between the onset of the accent c o m m a n d for an accent with prominence and the onset for an accent without prominence. DT2: difference between the offset of the accent c o m m a n d for an accent with prominence and the offset for an accent without prominence. T12: onset of the accent sandhi c o m m a n d for prominence (typically T12 = Tph where Tph is the corresponding p h o n e m e boundary). To: duration of the following pause. TW: time warp for a vowel at the end of an interrogative sentence or time warp for a vowel followed by a pause (the values were m e a s u r e d at speech rate of 8 - 9 m o r a / s for the former, 6 - 9 m o r a / s for the latter; within these ranges of speech rate, T W has almost no correlation with speech rate). -: inapplicable. The values denoted as the form o f / x + ~ m e a n t h a t / ~ is a m e a n value and tr is a standard deviation.

S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

The same 172 utterances used for the analysis were used as natural speech materials. These utterances consisted of four sets of 43 sentences including all of the prominence types listed in Table 1 and one neutral for each type. The four sets further consisted of two sets of utterances spoken by a male announcer (speaker " A " ) and by a female announcer (speaker "B"). Synthetic speech materials based on 43 sentences were newly synthesized (the same four speech samples were synthesized from one sentence to form 172 samples in total). These sentences were all different from those used for the analysis, but they belonged to the same classifications. Tests were divided into two sessions: one using the natural speech, and one using the synthetic speech. Ten subjects with Tokyo accents were used for each test. The speech materials in each test were presented in random order. Subjects were asked to decide whether or not the presented speech had prominence; and if they decided that it did, they were asked to decide which part of the sentence was emphasized. In addition, they were asked to circle the item "interrogative sentence" on the answer sheet if the speech sounded like an interrogative sentence. They were asked to write down their answers during the 3-second intervals between each speech presentation. The results were evaluated in terms of "correct-answer rate", which is a measure of expressibility of prominence. This rate is defined as the ratio of the n u m b e r of correct answers to the total number of presentations of the text for all the subjects. The correct-answer rate r is thus calculated as R r = - - × I00 ( % ) , N

(9)

where N is the total n u m b e r of presentations for

all subjects and R is the n u m b e r of correct answers. A "correct answer" here m e a n t that the answer coincided with the intended prominence. The mean correct-answer rate for all the natural speech materials was 79.9% and that for synthetic speech was 76.9%. These values are not different ( p > 0.05). Correct-answer rates were also c o m p a r e d between natural and synthetic speech (Fig. 7). From

............

185

•

' ° °

o::o.

•

.

80 . . . . . . . . . . . . . . . . . .

~o -~

............

60 .

.

. •

...

•

. / /

. ....

--°°

/ O'~" .);~ .........

:•. . . .

/io

/ /

40-

/O

i C~,-6 0 :

t~ O ~

:

o

Q)

. . . ..)'7 . . . . . . . . . /**

:

/// c~

20-

~5

i/J --~ . . . . . . . . . . . .

2t0 C0rrect-answer

Fig. 7. C o m p a r i s o n

O .

.

.

.

40 60 ~;0 r a t e f o r na*cural s p e e c h

of listening test results belween

O-

r,

100 (%) natural

and synthetic speech. Black circles denote data for commonc a s e p r o m i n e n c e , a n d w h i t e c i r c l e s d e n o t e d a t a for s p e c i a l - c a s e prominence.

this figure, we found that prominence placed on the whole bunsetsu (i.e., the common case) in most rule-synthetic speech had correct-answer rates nearly equivalent to those of the corresponding natural speech. On the other hand, prominence placed on part of the bunsetsu (i.e., the special case) in rule-synthetic speech was seen to require further improvement of the rules for expressing prominence. The 20% where there was no coincidence was caused by a few reasons. One major reason for no coincidence was seen to come from mismatches between the realized form of the speaker's intention to place prominence and the listener's sensitivity to prominence. In other words, the speakers sometimes emphasized the word or the syllable not enough strongly for the listener to feel that it was emphasized. Another reason, which might be applied only to synthetic speech, was seen tO be due to incompleteness of the rules for expressing prominence as described above. This latter problem leads to the optimization of the rules for improving prominence expressibility in the following section. When we observed the reference sentences,

186

s. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

w e f o u n d t h a t t h e first p r o s o d i c w o r d w a s m o r e likely to b e p e r c e i v e d as e m p h a s i z e d t h a n t h e other prosodic words: listeners sometimes perc e i v e d t h e first p r o s o d i c w o r d as e m p h a s i z e d a n d s o m e t i m e s as n o n e m p h a s i z e d d e p e n d i n g o n t h e magnitude of accent command. This suggests that a reference sentence may have default promin e n c e o n t h e first p r o s o d i c w o r d n o t o n l y in t h e a n a l y t i c a l s e n s e , b u t also in t h e p e r c e p t u a l s e n s e .

I n t e n d e d p r o m i n e n c e , h o w e v e r , w a s m o r e likely t o b e p e r c e i v e d as e m p h a s i z e d t h a n d e f a u l t prominence.

6. Improvement of prominence expressibility S p e e c h s y n t h e s i z e d by t h e r u l e s d e s c r i b e d so f a r h a s n o t always p r o d u c e d t h e c l e a r e s t e x p r e s -

Pause A P a u s e

l

L PAUSE INSERTION (PRECEDING AND/OR FOLLOWING)

SLOWING SPEECH RATE

Fo AND POWER ENHANCEMENT

¢

I

Pause

Pause

[Withoutprominence Pause

/-k

c

C2:x A

~22x i~k I..I

Pause

Fig. 8. Forms of various types of prominence. Curves in each box illustrate F 0 contours with phrase components and the entire figure illustrates how the F 0 contours are varied when prominence is placed depending on the types of prominence. The curves in the lower-middle box with the phrase "Without prominence" illustrates an example of F 0 contours of a neutral sentence. The right three boxes led by the black arrows illustrate the cases of prominence production by pause insertion (from the top: focus between the pauses, focus preceding the pause and focus following the pause). The left box led by the net-pattern arrow illustrates the case of prominence production by slowing speech rate. The immediate upper box led by the dotted arrow illustrates the case of prominence production by F 0 and power enhancement (the latter part of the first accent component is emphasized), and its right and left boxes illustrate the cases of prominence production by the combination of Fo-power enhancement and pause insertion, and F0-power enhancement and slowing speech rate, respectively.

S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

sion of prominence. Even though the commoncase prominence (that placed on the whole bunsetsu) in synthetic speech shows an expressibility equivalent to that in natural speech, there is no guarantee that this expressibility is the best. This section therefore describes the use of listening tests to clarify the conditions under which the expressibility of prominence becomes optimal.

6.1. Evaluation methods According to the analysis of prosodic features of prominence and the results of the listening tests that confirmed the effectiveness of the prominence-production rules, the forms of various types of prominence produced can be summarized as shown in Fig. 8. In this subsection, we describe the results of listening tests that evaluated which of the forms was superior for prominence expressibility. Sixtysix-sentence speech synthesized by rule was used as speech materials. These materials included various forms of prominence belonging to each classification of intended prominence. Four different magnitudes of accent commands were used to synthesize speech with different degrees of emphasis. Ten subjects with Tokyo accents were used for the test. Each speech material was presented twice in random order. Subjects were asked to decide whether or not the presented speech had prominence; and if they decided that it did, they were asked to decide which part of the sentence was emphasized. In addition, they were asked to circle the item " u n n a t u r a l " on the answer sheet if the speech sounded unnatural. They were asked to write down their answers during the 3-second intervals between each speech presentation. The results were evaluated in terms of the correct-answer rate as well as the "unnaturalness rate". The correct-answer rate was that defined in Eq. (9). The unnaturalness rate was defined as the ratio of the number of unnatural-sounding presentations to the total n u m b e r of presentations for all the subjects. The unnaturalness rate u is thus calculated as U u = - - x 100 ( % ) , (10) N

187

where U is the n u m b e r of unnatural-sounding presentations in N total presentations.

6.2. Test results 6.2.1. Effect of F o and power enhancement Only when prominence was placed on the whole bunsetsu could the correct-answer rate reach nearly 100% without losing much naturalness by increasing the magnitude of accent comm a n d (Figs. 9(a) and 9(b)). In other cases, high correct-answer rates were achieved at the cost of naturalness (Figs. 9(c), 9(d) and 9(f)).

6.2.2. Effect of pause insertion The effect of pause insertion on prominence expressibility became greater as the object of prominence became smaller - that is, in the order of whole bunsetsu, part of a compound word, and a syllable (Fig. 10). Furthermore, when pause insertion was combined with the enhancement of F 0 and power, the expressibility of prominence became even greater. Except for the case shown in Fig. 10(d), however, naturalness was not lost as a cost of adding to this enhancement.

6.2.3. Effect of reducing speech rate For no classifications was prominence expressibility improved by reducing speech rate.

6.3. Summary of evaluation results Table 3 summarizes the evaluation results by showing which of the forms of prominence shown in Fig. 8 was superior in terms of prominence expressibility. It also shows the ADA~ values (defined by Eq. (8)) that give the maximum correct-answer rates r (and their corresponding unnaturalness rates u). These results agree with the qualitative analyses described in Section 4. This means that prosodic features important for speech production are also important features in an auditory sense. Furthermore, the maximum values of r were 98% in the common case and 96% in the special case. The expressibility of prominence was thus

S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

188

enhanced by about 20% over the case when the nonimproved prosodic control parameters were used (in that case, the value was 77%).

7. Prosodic features of spontaneous Japanese conversational speech In this paper, we have so far treated only natural speech uttered by professional announcers reading manuscripts in a sound-proof booth.

These types of speech are called " R E A D speech" or " L A B O R A T O R Y speech". The prominenceproduction rules based on read speech have been shown to be effective for synthesizing read sentences, but their effectiveness when applied to conversational speech has not yet been confirmed. This section preliminarily analyzes the prosodic features of a small set of N O N R E A D Japanese conversational speech and compares them with the features of read sentence speech. We will 100 -

100 -

r

f,,

o~ 80-

:

,,

6o-

8060-

,~, 40-

f~

40~.

~d

1

1

O, O.

I

"1 ~

0.2

"f'

0.4 0.6 /~ DA.

I

I

0.8

1.0

I

0.0

0.2

0.4 0.6 /N D A .

(a)

100 -

1.2

20-

20-

0.8

1.0

(b)

: ESunin] --I--.[kat~ o:]

1-

,

z'~

loo- ~

80-

60-

/7

20~

y/

A//II

40-

i

:

1

~

L.~: ,~,,/

O. r 0.0

0.2

I,':

Euool

40-

--

[kat~o:]

~ 2°24" ///

I

I

i

I

0.4

0.6

0.8

1.0

0

0.0

I

I

I

I

I

0.2

0.4

0.6

0.8

1.0

~, DA.

/k D A .

(c)

(d)

Fig. 9. Effect of F 0 and power enhancement. ADAa: relative increment of the accent command, r: correct-answer rate. u: unnaturalness. (a) Sentence: [anino ama0uwa furui] (A brother's raincoat is old). Prominence on the whole bunsetsu [ama0uwa] (raincoat). Nonsandhi type. (b) Sentence: [aneno amarjuwa furui] (A sister's raincoat is old). Prominence on the whole bunsetsu [ama0uwa] (raincoat). Sandhi type (the accent of [aneno] and the accent of [ama0uwa] form a sandhi). (c) Sentence: [endo:funin aratame endokat/o:] (Mr. Endo, formerly assistant manager, has become a manager). Prominence on parts of the compound words - i.e., on [ / u n i n ] (assistant manager) in [endo: /unin] and on [katfo:] (manager) in [endo:katfo:]. Nonsandhi type. (d) Sentence: [ t a n a k a / u n i n aratame tanakakatfo:] (Mr. Tanaka, formerly assistant manager, has become a manager). Prominence on parts of the c o m p o u n d words - i.e., on [funin] (assistant manager) in [tanakafunin] and on [katfo:] (manager) in [tanakaka t/o:]. Sandhi type (the accent of [tanaka] and the accent of [funin]/[kat/o:] form a sandhi).

S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196 IO0 -

[ba]

~'

80-

80-

Z 6o-

60-

~,

4020

A

On/'--

0.0

P-, 4 0 -

,LI

,~'"

~. [aa] -.-i---Eta]

lO0-

--.li--.- [ha]

189

.

r

I

I

I

I

I

0.2

0.4

0.6

0.8

1.0

/ N D A a

(e)

0

I

0.0

0.2

I

I

0.4 0.6 ZX D A ~

I

I

0.8

1.0

(f)

Fig. 9. (e) Sentence: [kuwabarasan tomo kuwaharasan tomo i:mas(u)] (We can call him either "Mr. Kuwabara" or "Mr. Kuwahara"). Prominence on the syllables [ba] in [kuwabarasan] (Mr. Kuwabara) and [ha] in [kuwaharasan] (Mr. Kuwahara). The originally low accent of the syllables [ba]/[ha] changed to high accent to express prominence. (f) Sentence: [nakadasan ka nakatasan ka] ("Mr. Nakada" or "Mr. Nakata"). Prominence on the syllables [da] in [nakadasan] (Mr. Nakada) and [ta] in [nakatasan] (Mr. Nakata). The originally high accent of the syllables [da]/[ta] became further higher to express prominence.

henceforth call nonread speech "spontaneous speech".

7.1. Classification of speech according to utterance style The acoustically produced form of prominence can depend greatly on the utterance style of the speaker, and this style might differ greatly depending on whether speech is obtained by reading a manuscript or by spontaneous speaking. Furthermore, even in read speech, utterance style might also differ depending on the contents of the manuscript - for example, whether the manuscript were news, a novel or a drama. To clearly define the types of speech, we therefore first need to classify speech from the viewpoint of utterance style. From the viewpoint of utterance style, speech might be classified as shown in Table 4. It must be noted here that there can be speech intermediate between read and spontaneous. For example, a manuscript might initially be uttered awkwardly, but fluently after some training. In this sense, features of speech in a lecture or a drama might differ greatly depending on the speaker's degree of skill. In this paper, we compare only typical sponta-

neous speech with typical read speech. One difference between read and spontaneous speech is that the former is uttered by reading a manuscript, whereas the latter is uttered while simultaneously composing sentences in the mind. Well-controlled speech can therefore be easily obtained with read speech. With spontaneous speech, on the other hand, speech can vary due to casualness because it is subject to various environmental influences.

7.2. Speech materials Before determining speech materials for analysis, we investigated conspicuous features of speech inherent in spontaneous conversational speech by listening to a discussion on coeducation. This discussion was made among a male announcer as the chairman and a male and a female high-school student. As a result of listening, the following varying features were recognized: (1) A speaker often spoke thoughtfully, while he or she was composing the sentence on the spot. The speech rate was reduced during this part of the speech. (In addition to slower speech, such features as pause insertion, exclamation insertion and stammering were also observed.)

S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

190

(2) Compared to read speech, more various forms of prominence were observed - e.g., partial reduction of speech rate in combination with enhancement of F 0 and power.

(3) P r e - p a u s a l l e n g t h e n i n g a n d t h e t e n d e n c y t o i n c r e a s e s p e e c h r a t e at t h e e n d o f a s e n t e n c e both became more conspicuous. Such great lengthening had been known to be observed

i00 -

i00 -

A DA = 0 . 2

80-

80-

60-

60-

40-

40-

20-

20-

ADA.=0.2 r

ADA,=0.2 ~ , 0

I

Pause

Fo

Fo + pause

Fo

v

i00-

r

80-

Fo+ pause

(b)

ADA.=0. i

[~unin]

...... m,,

Pause

(a)

I00 -

u

...... I,I

0

I

[~unin] ADA =0. i

8060-

A

40200

A DA,~0.i

ADA,=0.1 ,,.,'

200

I

Pause

Fo

Fo + pause

l

l

Fo

Pause

v

A DA

/ 60-

~, 40

=0.3

i00-

[ba]

80

u/ /

/

/

Fo+ pause

(d)

(e)

100

r

40-

A DA.=0.3

[da]

80-

'A

A DA,=0.3 2

/

60-

........I X"

r 40-

Z

u..,

%w

20 o

ADA,=0.3 [-7 I F~

I

Pause (e)

I

Fo+ pause

20h 0

I

F

Pause

Fo + pause

(f)

Fig. 10. Effect of pause insertion (preceding + following pauses). ADAa: relative increment of the accent command, r: correctanswer rate. u: unnaturalness. Sentences in (a)-(f) correspond to those in Fig. 9.

S. Takeda, A. lchikawa / Speech Communication 14 (1994) 171-196

191

Table 3 Listening test results (intended prominence) Classification

Common

Special (Focus on part of bunsetsu)

Accent type

Optimal form of prominence ( A D A a denotes relative increment of accent command for optimal form)

Optimal value of r Value of u when r is optimal

Nonsandhi

F 0 and power enhancement (ADA~ = 0.8)

r = 100% u = 10%

Sandhi

F 0 and power enhancement (2~DAa = 0.6 or 0.8)

r = 95% u = 25%

Focus on part of compound word

Nonsandhi

F 0 and power enhancement + preceding-andfollowing-pause insertion (2~DA a = 0.1)

r = 95% u = 40%

Sandhi

F 0 and power enhancement (ADA~ = 0.6)

r = 95% u = 85%

Focus on a syallable

Low ~ high

F 0 and power enhancement + preceding-andfollowing-pause insertion (A DA~ = 0.3)

r = 100% u = 75%

High ~ high

F 0 and power enhancement + preceding-andfollowing-pause insertion (ADA~ = 0.3)

r = 95% u = 40%

Focus on whole bunsetsu

Table 4 Classifications of speech from the viewpoint of utterance style Large group

Small group

Examples

Laboratory speech or read speech

Reading or recitation

News Weather forcast Novel

Read conversation

Drama Conversation based on manuscripts

Conversation

Dialog Discussion Comic dialog (manzai)

Telling

Presentation or lecture without manuscripts Preaching Comic story (rakugo)

Spontaneous speech

Table 5 Spontaneous speech sentences used for analysis (from discussion on coeducation) No.

Speaker

Sentence

TM (male high-school student)

[nanka tokitamako: danJ'idakentokoe ittemitaina toiu kimo okirundes(u)kedone]

TM (male high-school student)

[jappari kjo:xjakunotteiukane sonoho:rja d3ibunto~'(i)tewa iitoomoimas(u)ne]

TM (male high-school student)

[soreoi id3jo:tekini kand3irundes(u)jone]

Male announcer

[e:to itsumade kjo:0akudattandes(u)ka anatawa]

HS (female high-school student)

[u:n demo ko:ko:wa d3oseidakenoho:rja iind3anai kaJ'ira]

S Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

192

in simulated conversational sentences with four different speaking styles: normal, hurried, angry and polite (Sagisaka and Kaiki, 1991). Our corpus consisted of pure spontaneous speech, so this investigation suggested that greater lengthening at phrase or sentence final positions might be a common feature to many styles of conversational speech. (4) Speech tended to become coarse - i.e., syntactic omission, euphonic changes, syllable insertion by mistake, etc. were often observed. The features (1)-(3) were mainly of temporal structure, but the dominant features of feature (4) were not of temporal structure. Furthermore, feature (4) included conspicuous speaker-dependent features. These features could thus hardly be observed in speech uttered by the trained professional announcer, but they were conspicuous in speech uttered by the nonprofessional students. To produce rules for synthesizing speech that is easily understood, feature (4) is not preferable. So in this paper, we analyzed features (1)(3). Five sentences were selected from the above discussion on coeducation and used as speech materials in which temporal-structural features were conspicuous (Table 5).

7.3. Analysis results In this subsection, we describe several features of actual spontaneous conversational speech, focusing on temporal structure. Specifically, we dis-

_a

7.3.1. Time-varying patterns of phoneme duration in spontaneous conversational speech Fig. 11 shows an example of time-varying patterns of p h o n e m e duration for spontaneous sentence speech (uttered by a male high-school student (TM)). This figure also shows mean mora duration L m ( s / m o r a ) per prosodic phrase. To eliminate the effect of pre-pausal lengthening, however, m o r a duration values at the end of a sentence and for a syllable (or a mora) preceding a pause were not included in the calculation of L m. The following features were seen in these data: (1) The m e a n mora duration L m of prosodic phrases where prominence was placed was longer than that of phrases without prominence - - i.e., speech rate was reduced to produce prominence. (2) The mean mora duration L m of prosodic phrases at the end of sentences was shorter than that of other phrases - - i.e., speech rate was usually, but not always, increased at the end of sentences. An example of such an exception can be seen in a sentence where the subject was reversed and came at the end (Sentence No. 4). (3) The following features were found in parts

Reduction of speech I rate wherepro|inence| is placed II

03 0"4 I L~=0.082 0.2 (s] . . . . )

a. 0.0

cuss the relationships between temporal structure and /70, and we find the difference from read speech. The analysis methods used here were those described in Sections 3 and 4.

L.=O.

li

169(s/mora) 1 ] ' ' ;Ii'] A

increase in speech? a I ~

I.'

[nanka_tokitamako:~dan

L,,=0.091 (s/ . . . . )

~idakentokoe_ L J (Thinking)

(interjection> "so~etimes"(interjection)

at the end of a sentence

L.=O 081(s/.... )

L.=OO67(s/mora)

ill ii!i ~

.

.

.

.

.

.

.

.

! .

.

.

.

itternitaina~toiu~Rimo~okirundeskedone] ' (Thinking) J

"where there' re only boys . . . .

(SENTENCE: "Well, (I) sometimes come to feel like going (t,o a high school)

(I) come to feel like going (to a high school)" where there're only boys.")

Fig. 11. Time-varying patterns of p h o n e m e duration in spontaneous conversational speech. L~: mean mora duration per prosodic phrase. ~ :pause. Sentence No. 1 (speaker: male high-school student TM).

S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

where the speaker seemed to speak while thinking: a. Conspicuous pre-pausal lengthening was observed (e.g., [ko:], [toiu] and [kimo] in Fig. 11). b. Pauses were inserted (e.g., the same as above). c. Syllable insertion by mistake ol stammering was observed (e.g., [kjo:gakunotteiukane] in Sentence No. 2 and [i ida jo:tekini] in Sentence No. 3). d. Exclamation was inserted (e.g, [e:to] in Sentence No. 4).

0.200.18-

0.1g.~ E

O. i 4 0.12-

-

0.10-

"g 7.

0.08-

O O

O ~)

A

O 0.06-

193

a

A

O. O4-

0.020,00-

I

I

I

O. 0

1.0 M a g n i t u d e of a c c e n t

command

A.

(a)

0. Z ( I O. i 8 O

0. D ; O 0. 1, t -

0

0 O0

0

m =

g

O. 1.0-

•

A

Re

O

•

o. 0,~;0.06O. 0 , 1 0.02,>0,>

,

~ 0

,

0 2.

,

,

,

0 4.

,

,

06.

M a g n i l u d e ~/f a c c e n t

command

,

0.8

,

,

1.0

A,

(b)

Fig. 12. Mean mora duration versus the magnitude of an accent command. Black circles: data emphasized by prominence. White circles: nonemphasized data (data for prosodic phrases at the end of sentences were excluded). White triangles: nonemphasized data for prosodic phrases at the end of declarative sentences. (a) Data for spontaneous conversational speech (discussion on coeducation). (b) Data for read speech (scientific sentences).

7.3.2. Relationship between speech raise and the magnitude of accent commands For the same five sentences analyzed in the preceding subsection, Fig. 12(a) shows the relationship between the mean mora duration L m and the magnitude of accent command A a of F 0. Data obtained from speech uttered while thinking and from sentences in which the subject was reversed and came at the end of the sentence are not included in this figure. The values L m and A a tended to be larger for prosodic phrases where prominence was placed than for phrases without prominence. The L m and Aa values for prosodic phrases at the end of sentences are indistinguishable from those of other prosodic phrases without prominence, and, as a whole, conspicuous features could not be observed either in F 0 or in temporal structures (as mentioned in the previous subsection, the speech rate tended to increase in each sentence). For read sentence speech (scientific sentences), Fig. 12(b) shows the relationship between the mean mora duration L m and the magnitude of accent command A a of F o. The speaker was a female in her 20s, born and raised in Tokyo. In contrast to the spontaneous conversational speech shown in Fig. 12(a), conspicuous temporal-structural features were not observed either at the end of the sentences or even in parts where prominence was recognized to be placed. The results presented in this subsection show that temporal-structural features in parts where prominence was recognized to be placed were

194

s. Takeda,A. Ichikawa/ Speech Communication 14 (1994) 171-196

more conspicuous in spontaneous conversational speech than in read sentence speech. 7.3.3. Comparison with read conversational speech Here, we discuss several results relevant to whether or not the prosodic features of prominence observed in read conversational speech (and shown in Fig. 8) can also be observed in spontaneous conversational speech. (1) Fundamental frequency: Enhancement of F 0 seems to be one of the prosodic features of prominence common to read and spontaneous conversational speech. This is because the magnitudes of accent command A a for the object of prominence are usually larger than those for speech where prominence is not placed. (2) Pause: Pauses are also likely to be prosodic features of prominence shared by read and spontaneous conversational speech. This is because pauses are often inserted immediately before or after the object of prominence. Because many of these pauses are inserted to produce thinking time when speaking, however, it is not clear whether or not these pauses are inserted solely as a means of producing prominence. The pre-pausal lengthening observed in read speech was also observed in spontaneous conversational speech. Furthermore, the degree of lengthening was much greater than that in read speech. This lengthening also seems to occur during thought (see Section 7.3.1). (3) Speech rate: Unlike read speech, in which the overall speech rate was reduced, local reduction of speech rate (in parts where prominence was placed) seems to be a feature only of spontaneous conversational speech.

8. Conclusions

Prosodic features of prominence have been analyzed to develop synthesis rules for unrestricted Japanese sentence speech that is easily understood. Before this analysis, several measures have been introduced to quantify prosodic features of prominence.

Through qualitative analysis of read speech, the relationships between intention and the acoustical production of prominence have been clarified. Concerning F0, it has been clarified that both the default and intended types of prominence are achieved by using several common prosodic word expressions. Furthermore, it has also been shown that a specific combination of prosodic features forms each type of prominence. As for the temporal structure, no examples of lengthening of phoneme duration have been observed in the emphasized parts of the sentences except for some special cases. One exception has been lengthening accompanied by pause insertion as a mark of prominence, and another slowing total speech rate. Whether these features are speaker-independent or not cannot be determined only from the analysis results within the range of our corpus described in this paper. We must further extend the range of corpus to draw more general conclusions in this respect. Qualitative and quantitative analysis results have shown that the realized forms of any types of prominence are common to many other languages - that is, enhancement of F 0, increasing of power, and lengthening of segmental duration found in some special cases (Vaissi~re, 1983; Terken, 1991). While F 0 is likely to carry information primarily on emphasis or prominence in most stress-accent languages such as English (Hirschberg, 1992) and Dutch (Quen6 and Kager, 1992), F 0 primarily carries lexical information in many dialects of Japanese. For example, "HAshi" means "chopsticks" and " h a S H I " means a "bridge" in the Tokyo dialect. F 0 rise and fall carry no information on emphasis in this case. In most cases, prominence in pitch-accent languages (e.g., the Tokyo dialect of Japanese) may be achieved by emphasizing lexical pitch-accent. To produce prominence through quantitative analysis, the framework gained by this qualitative analysis has been used to calculate control parameter values for each prosodic quantity. Furthermore, a set of the control parameters - that is, the prominence-production rules - have been introduced into a rule-based speech synthesis system, and listening tests have been performed to

S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

c o n f i r m the effectiveness of these rules. T h e test results have n o t shown any significant difference in expressibility b e t w e e n p r o m i n e n c e synthesized by rule a n d p r o m i n e n c e in n a t u r a l speech. T h e s e results, however, have not g u a r a n t e e d that rulesynthetic speech p r o d u c e s the best expressibility of p r o m i n e n c e . L i s t e n i n g tests have t h e r e f o r e b e e n used to clarify the c o n d i t i o n s u n d e r which p r o m i n e n c e expressibility b e c o m e s optimal. T h e s e test results have shown that the expressibility of p r o m i n e n c e is e n h a n c e d by a b o u t 20% in c o m p a r i s o n with w h e n the n o n i m p r o v e d prosodic control p a r a m e ters are used. Finally, prosodic f e a t u r e s associated with p r o m i n e n c e in n o n r e a d ( s p o n t a n e o u s ) J a p a n e s e c o n v e r s a t i o n a l speech have b e e n p r e l i m i n a r i l y analyzed a n d c o m p a r e d with the p r o m i n e n c e features of read s e n t e n c e speech. S p e e c h - r a t e reduction in parts w h e r e p r o m i n e n c e is placed is m o r e c o n s p i c u o u s in s p o n t a n e o u s c o n v e r s a t i o n a l speech t h a n in read speech. T o develop p r o m i n e n c e - p r o d u c t i o n rules for s p o n t a n e o u s speech, m o r e data a n d f u r t h e r analysis will be n e e d e d . T o achieve m o r e h u m a n - l i k e speech, we n e e d also to study those aspects of prosody that arc n o t associated with p r o m i n e n c e , a n d we n e e d to l e a r n m o r e a b o u t how p a r a l i n guistic a n d n o n l i n g u i s t i c i n f o r m a t i o n is expressed.

9. Acknowledgments W e are i n d e b t e d to Professor Hiroya Fujisaki at the Science U n i v e r s i t y of Tokyo a n d Professor Keikichi Hirose at the University of Tokyo for their i n v a l u a b l e c o m m e n t s a n d discussions, a n d for their help in o b t a i n i n g references. W e are also grateful to Mr. Z e n j i T s u t s u m i , Dr. Yoshito T s u n o d a a n d Dr. Shigeo N a g a s h i m a , f o r m e r m a n a g e r s of 6th D e p a r t m e n t of H i t a c h i ' s C e n t r a l R e s e a r c h L a b o r a t o r y for their help a n d e n c o u r a g e m e n t , to Mr. F r a n k W a l l e r s t e i n at H i t a c h i ' s C e n t r a l R e s e a r c h L a b o r a t o r y for his help in t r a n s l a t i n g the abstract into G e r m a n , a n d to the two reviewers for their helpful c o m m e n t s .

195

10. References G. Bruce and P. Touati (1991), "On the analysis of prosody in spontaneous speech with exemplification from Swedish and French", Proc. ESCA Workshop, Barcelona, 13. P. Delattre (1966), "A comparison of syllable length conditioning among languages", Internat. Retd. Appl. Linguist., Vol. 4, pp. 183-198, C.A. Fowler and J. Housum (1987), "Talkers' signaling of "new" and "old" words in speech and listeners' perception and use of the distinction", J. Memory and Language, Vol. 26, pp. 489-504. H. Fujisaki and K. Hirose (1984), "Analysis of voice fundamental frequency contours for declarative sentence of Japanese", J. Acoust. Soc. Jpn. (E), Vol. 5, No. 4, pp. 233-242. H. Fujisaki and H. Kawai (1988), "Realization of linguistic information in the voice fundamental frequency contour of the spoken Japanese", Proc. IEEE lnternat Conf Acoust. Speech Signal Process.-88, New York City, S I4. 3, pp. 663666. E. G~rding (1982), "A comparative study of intonation", Preprints of Papers for the Working Group on Intonation at The Xlllth International Congress of Linguists, Tokyo, 3l August 1982, organized by H. Fujisaki and E. G~rding, 9,

pp. 85-94. K. Hakoda and H. Sato (1980), "Prosodic rules in connected speech synthesis", Systems, Computers, Controls, Scripta Electronica Japonica 3, Vol. 11, pp. 28-37. J.L. Hieronymus and B.J. Williams (1991), "'A comparison of the prosody in read speech and directed monologue in British English", Proc. ESCA Workshop, Barcelona, 32. K. Hirose, H. Fujisaki and H. Kawai (1986), "'Generation of prosodic symbols for rule-synthesis of connected speech of Japanese", Proc. 1EEE-IECEJ-ASJ ICASSP-86, Tokyo, 45.4, pp. 2415 2418. K. Hirose, H. Kawai and H. Fujisaki (1988), "Synthesis of prosodic features of Japanese sentences", Proc. The Second Symposium on Advanced Man-Machine lhterfa~e Through Spoken Language, 3.

J. Hirschberg (1992), "Using discourse context to guide pitch accent decision in synthetic speech", in Talking Machines: Theories, Models, and Designs, ed. by G. Bailly, C. Benoit and T.R. Sawallis (Elsevier, Amsterdam), pp. 367-376. Y. Kitahara, S. Takeda, A. Ichikawa and Y. Tohkura (1988), "Role of prosody in cognitive process of spoken language", .~vstems and Computers in Japan, Vol. 19, No. 11, pp. 53-61. H. Kubozono (1993), The Organization of Japanese Prosody, Studies in Japanese Linguistics, Vol. 2, ed. by Masayoshi Shibatani, Series Editor (Kurosio Publishers, Tokyo). A. Kurematsu, H. Iida, T. Morimoto and K. Shikano (1991), "'Language processing in connection with speech translation at ATR Interpreting Telephony Research Laboratories", Speech Communication, Vol. 10, No. 1, pp. 1-9. I. Lehiste (1980), "Phonetic manifestation of syntactic structure in English", Open Lecture at The University of Tokyo.

196

S. Takeda, A. Ichikawa / Speech Communication 14 (1994) 171-196

B. Lindblom, S. Brownlee, B. Davis and S.J. Moon (1991), "Speech transforms: On the extent, systematic nature and functional significance of phonetic variation", Proc. ESCA Workshop, Barcelona, 2. A.I.C Monaghan (1989), "Phonological domains for intonation in speech synthesis", Proc. Eurospeech 89, Paris, September 1989, Vol. 1, pp. 502-505. A.I.C Monaghan (1992), "Heuristic strategies for the higherlevel analysis of unrestricted text", in Talking Machines: Theories, Models, and Designs, ed. by G. Bailly, C. Benoit and T.R. Sawallis (Elsevier, Amsterdam), pp. 143-161. H. Quen6 and R. Kager (1992), "The derivation of prosody for text-to-speech from prosodic sentence structure", Computer Speech and Language (Academic Press, London), 6, pp. 77-98. J. Rischel (1991), "Formal linguistics and real speech", Proc. ESCA Workshop, Barcelona, 6. Y. Sagisaka (1990), "On the prediction of global F 0 shape for Japanese text-to-speech", Proc. Internat. Conf. Acoust. Speech Signal Process. 90, Albuquerque, NM, S6a. 9, pp. 325-328. Y. Sagisaka and N. Kaiki (1991), "Prosody control for spontaneous speech synthesis", Internat. Cong. of Phonetics Sciences, Aix en Provence, Vol. 3, pp. 506-509. Y. Sagisaka and H. Sato (1984), "Accentuation rules for Japanese text-to-speech conversion", Review of the Electrical Communication Laboratories, Vol. 32, No. 2, pp. 188199. K. Shirai, K. Iwata and T. Ohno (1986), "Pitch contour control in Japanese conversational speech", Proc. IEEEIECEJ-ASJ ICASSP-86, Tokyo, 38.12, pp. 2043-2046.

S. Takeda and A. Ichikawa (1990), "Analysis of prosodic features of prominence in spoken Japanese sentences", Proc. ICSLP 90, Kobe, 12.3, pp. 493-496. J. Terken (1991), "Fundamental frequency and perceived prominence of accented syllables", J. Acoust. Soc. Amer., Vol. 89, No. 4, Pt. 1, pp. 1768-1776. J. Terken and S.G. Nooteboom (1987), "Opposite effects of accentuation and deaccentuation on verification latencies for given and new information", Language and Cognitive Processes, Vol. 2, Nos. 3/4, pp. 145-163. J. 'tHart (1982), "The stylization method applied to British English intonation", Preprints of Papers for the Working Group on Intonation at The XIIIth International Congress of Linguists, Tokyo, 31 August 1982. organized by H. Fujisaki and E. G~rding, 3, pp. 23-33. J. 'tHart (1991), " F 0 stylization in speech: Straight lines versus parabolas", J. Acoust. Soc. Amer., Vol. 90, No. 6, pp. 3368-3370. N.G. Thorsen (1980), "A study of the perception of sentence intonation - Evidence from Danish", J. Acoust. Soc. Amer., Vol. 67, No. 3, pp. 1014-1030. N.G. Thorsen (1982), "Sentence intonation in Danish", Preprints of Papers for the Working Group on Intonation at The Xlllth International Congress of Linguists, Tokyo, 31 August 1982, organized by H. Fujisaki and E. G~rding, 6, pp. 47-56. N.G. Thorsen (1985), "Intonation and text in Standard Danish", J. Acoust. Soc. Amer., Vol. 77, No. 3, pp. 1205-1216. J. Vaissi~re (1983), "Language-independent prosodic features", in Prosody: Models and Measurements, ed. by A. Cutler and D.R. Ladd (Springer, Berlin), pp. 53-66.

Analysis of prominence in spoken Japanese sentences and application to text-to-speech synthesis

Analysis of prominence in spoken Japanese sentences and application to text-to-speech synthesis

Recommend Documents