Computer Speech and Language (1996) 10, 155–185
Prediction of abstract prosodic labels for speech synthesis K. Ross and M. Ostendorf Electrical and Computer Engineering Department, Boston University, 44 Cummington St., Boston, MA 02215 U.S.A.
Abstract Higher quality speech synthesis is required to make text-to-speech technologies useful in more applications, and prosody is one component of synthesis technology with the greatest need for improvement. This paper describes computational models for the prediction of abstract prosodic labels for synthesis—accent location, symbolic tones and relative prominence level—from text that is tagged with part-of-speech labels and marked for prosodic constituent structure. Specifically, the model uses multiple levels of a prosodic hierarchy and at each level combines decision tree probability functions with Markov sequence assumptions. An advantage of decision trees is the ability to incorporate linguistic knowledge in an automatic training framework, which is needed for building systems that reflect particular speaking styles. Studies of accent and tone variability across speakers are reported and used to motivate new evaluation metrics. Prediction experiments show an improvement in accuracy of prominence location prediction over simple decision trees, with accuracy similar to the level of variability observed across speakers. 1996 Academic Press Limited
1. Introduction Text-to-speech technology is currently useful in only a few applications because of the lack of high quality speech synthesis, and the goal of this research is to improve synthesis quality by improving the prosody control mechanism. Prosody aids the listener in interpreting an utterance, by grouping words into larger information units and drawing attention to specific words via relative differences in prominence. Prosody is one of the least developed parts of existing systems for converting text into speech, according to Klatt (1987), Collier (1991) and others. Moreover, Silverman (1993) has shown that a domain-specific model of prosody can significantly improve the comprehension of synthesized speech. Linguistic theory supports the practice of modeling prosody by predicting symbolic markers from text (for prosodic phrase boundaries, accents, and the ‘‘tones’’ or pitch events that mark them), and then using these markers to control duration, fundamental 0885–2308/96/030155+31 $18.00/0
1996 Academic Press Limited
156
K. Ross and M. Ostendorf Text analysis Symbolic tones Lexical structure of words Prosodic phrase structure
Abstract intonation prediction
Acoustic signal generation
Speech
Abstract prominence Figure 1. Illustration of inputs and outputs of the abstract intonation prediction system.
frequency, energy and segmental characteristics that represent the acoustic realization of prosodic structure. Working within this framework, illustrated in Fig. 1, this paper addresses the problem of abstract accent and tone prediction, which requires various types of text analysis as input. We assume the existence of a text processing module that can provide lexical information (phonemic representations and lexical stress), partof-speech tags, prosodic phrase structure (location of intermediate and full intonational phrase boundaries) and optionally additional linguistic annotation. Other research efforts have been concerned with these problems and have been relatively successful, e.g. (Wang & Hirschberg, 1992; Ostendorf & Veilleux, 1994) for prediction of prosodic phrases. We also assume that the predicted labels can be easily used in an existing speech synthesizer, which is the case for the Bell Laboratories synthesizer that we used in perceptual experiments. Early approaches to automatic accent placement tended to accent too many words in longer sections of text, making the accent placement decisions solely on the basis of content/function word distinctions. Function words, such as articles, prepositions, and pronouns, were designated as closed class and de-accented. Content words, such as nouns and verbs, were designated as open class and accented on their main stress syllable. For the London–Lund corpus of spoken English, Altenberg (1987) found that accent could be predicted with 57% accuracy based on content/function word classification alone. While higher accuracy is observed for our radio news corpus, it is still clear that a more sophisticated approach is needed. More recent systems by Quene´ and Kager (1989), Monaghan (1990), and Dirksen and Terken (1991) use rules to place default accents based upon word class, syntactic constituency, and surface position, allowing for discourse information to overrule the defaults if it is available. Hirschberg (1993) proposes a relatively simple discourse model that is integrated into a classification tree accent prediction algorithm, achieving 77–85% accent prediction accuracy at the word level for a variety of speech styles. More sophisticated use of discourse information is possible in speech generation from concept (also called message-to-speech synthesis) where knowledge of focus structure is available (e.g. House & Youd, 1991; Monaghan, 1994). Other issues related to accent placement include prediction of the level of prominence and specific pitch accent type, but little work has been reported on these problems. Early work on synthesis in the MITalk system assigned different fundamental frequency (F0) peak heights according to a hierarchy of part-of-speech classes (Allen et al., 1987). Assuming only simple text analysis, Pierrehumbert (1987) described a set of heuristics that used alternating values of 0·7 and 0·4 to scale F0 targets and a 1·0 scaling factor
Abstract prosodic labels for speech synthesis
157
for the nuclear accent. Silverman (1987) proposed a somewhat larger rule set that is a function of the number of accents in the intonational phrase and uses an exponential scaling factor. Silverman also provides some simple rules for tone label assignment, e.g. using only high tones for pitch accents, that we will use as a baseline for comparing the algorithms reported here. Again, if discourse information is available, more complex tone label assignment is possible (Prevost & Steedman, 1994; Black & Campbell, 1995). Our approach is to put accent placement, tone label and prominence level prediction together in a multi-level hierarchy. Like Hirschberg (1993) and Black and Campbell (1995), we use decision trees because they can be automatically trained to learn different speaking styles and they can accommodate a range of inputs, from simple text analyses for the problem of synthesis from unrestricted text to more detailed discourse information that may be available as a by-product of text generation. However, we extend Hirschberg’s (1993) work by modeling accents at the syllable level to capture early and double accent phenomena, and we differ from previous decision tree work more generally by using the decision trees as probability mappings in combination with a Markov sequence assumption. In other words, rather than predict a prosodic label at each time step, we generate probabilities and delay the prediction decision until the end of a phrase to maximize the likelihood of the sequence of labels as a whole. In experiments reported here, the models are trained on the Boston University radio news corpus (Ostendorf et al., 1995), using features such as part-of-speech labels, dictionary markings for lexical stress, prosodic phrase boundaries, and syllable position within words and phrases. The abstract prosodic label prediction algorithms are evaluated quantitatively by comparing the predicted labels to the observed labels for multiple versions of four test news stories and measuring differences between the predicted and observed versions. In addition, the algorithms are evaluated in perceptual experiments. The body of this paper is organized as follows. Section 2 reviews some factors that affected our model design, including results from research in linguistics and our own corpus analysis. Section 3 discusses the algorithms used for predicting the location and values of the abstract prosodic labels. Section 4 presents the resulting trees and corresponding prediction performance, and discusses the results of perceptual experiments. Finally, Section 5 summarizes the results and discusses their implications for further research. 2. Factors affecting model design and evaluation To support the particular choice of abstract prosodic labels and the selection of prediction variables, this section briefly reviews the linguistic theory of intonation and supporting empirical studies, particularly for American English. In addition, the radio news corpus used for this research is briefly described, followed by a summary of a study on variability that impacts the evaluation methodology. 2.1. Linguistic theory: symbolic intonation patterns In this paper, we face the problem that the terms ‘‘stress’’ and ‘‘accent’’ are not used consistently in the linguistics literature, so we will begin by establishing some terminology. Here, lexical stress refers to the primary, secondary and unstressed designations given to syllables in isolated words in a dictionary. An accented syllable is any syllable associated with a pitch accent, i.e. a prominence lending intonational pattern that is
158
K. Ross and M. Ostendorf
usually lined up with a particular syllable. Prominence (or a prominent syllable) occurs when a syllable is perceived as stronger or more salient than other syllables in the utterance. The acoustic correlates of this perceived salience are currently a topic of investigation; here the focus will be largely on prominence as it is associated with pitch accents, both pre-nuclear and nuclear accents. A nuclear accent is usually the last and most perceptually prominent accent in a phrase. 2.1.1. Phonological representation of intonation According to Selkirk (1984), Beckman (1986) and others, the levels of prominence in English can be organized into a hierarchical prosodic structure, that arises from varying degrees of prominence imparted to the syllables by the lexical stress and accents. Within words and phrases, some syllables are more prominent than others. A hierarchy of relative prominence can be defined, where unstressed syllables are at the lowest level, nuclear accent is at the highest level, and each level builds on the level below it. For example, only syllables with lexical stress can receive a pitch accent. Some theories associate the different levels of prominence with different levels of constituent structure and patterns of relative prominence with the strongest prominence at a given level being associated with the right-most constituent. Although it has been postulated that different levels of prominence are cued by different acoustic features (Vanderslice & Ladefoged, 1972; Gussenhoven, 1991), certainly at the phrase level several features combine to cue prominence, including energy, fundamental frequency, duration, and vowel quality. Most theories of intonation provide for symbolic tonal markers of both accented syllables and phrase boundaries. In the theory of Pierrehumbert and colleagues (Pierrehumbert, 1980; Beckman & Pierrehumbert, 1986) for American English, there are six types of tones that mark accented syllables: two simple accents (high and low tones) and four bi-tonal accents (made up of highs and lows and defined according to how they align to the syllable). In addition, there are two distinct types of phrase tones, each of which can take on high or low values: the phrase accent and the boundary tone, which mark the edge of intermediate and intonational phrases, respectively. Intonational phrases are formed from one or more intermediate phrases. Phrase tones differ from accents in their means of aligning to prosodic constituents; they belong to a whole phrase instead of a particular syllable and usually align to the right boundaries of the phrase. In theory, all intermediate phrases should have at least one pitch accent, and the intermediate phrase is the domain of downstep, the sequential reduction of high accent peaks. Occasionally, accentuation seems to rearrange the natural stress pattern of a word. Lexically unstressed syllables cannot be accented, but many secondary stress syllables can be accented and thus can be made stronger than syllables later in the word that would ordinarily be the strongest. This phenomenon, known as ‘‘stress shift’’ or early accent placement, places a pitch accent on a syllable earlier than the main-stress syllable (as with the word ‘‘Massachusetts’’ in the phrase ‘‘MASSachusetts MIRacle’’ vs. ‘‘in MassaCHUsetts’’). Candidate words for this phenomenon are all multi-syllable words with main stress late in the word and an unreduced syllable earlier in the word. Such patterns were observed frequently in our radio news database and are discussed further in Shattuck-Hufnagel et al. (1995). Most listeners have the intuition that there are variations in the relative level of
Abstract prosodic labels for speech synthesis
159
prominence of an accented syllable that are separate from the distinctions between accents associated with tones. Moreover, speech synthesized with multiple levels of accents produces a more natural sounding utterance. Terken (1996) argues that perceptual experiments show that listeners judge relative prominence primarily in terms of F0 peak and in the context of the local phrase pitch range. In addition, he provides evidence from spontaneous speech data that there is variation in prominence level within a phrase, i.e. separate from the variation across phrases that might be a consequence of phrase-level pitch range differences. This conclusion is supported by the usefulness of relative prominence scaling when range is accounted for (Silverman, 1987) and by our own accent clustering experiments (Ross, 1995). Whether this variation is continuous or associated with discrete categories, as suggested by Ladd (1993), remains an open question. This theory has influenced our model structure primarily in the choice of units predicted (accents and tone labels), the syllable-level representation (to enable the prediction of phenomena such as early accent placement within the word and double accents), and the use of intermediate and full intonational phrase structure to define domains of dependent sequences of accents. A hierarchical structure is represented explicitly with three levels—syllable, accent and phrase, as described in Section 3.1—but also implicitly through the choice of features used in the decision tree. 2.1.2. ToBI labeling system Prosodic labels for this work are based upon the ToBI (Tones and Break Indices) labeling system that was developed in a series of multi-site, collaborative workshops (Silverman et al., 1992; Beckman & Ayers, 1994; Pitrelli et al., 1994). The ToBI system consists of multiple tiers, with each tier containing time-aligned symbols representing the prosodic events in an utterance. The break index tier specifies prosodic constituent structuring using a subset of the markers in Price et al. (1991), and the tone tier contains a subset of Pierrehumbert’s tone labels (Pierrehumbert, 1980; Beckman & Pierrehumbert, 1986). Every intermediate phrase (break index 3 or 4) is assigned either an ‘‘H-’’, ‘‘!H-’’ or ‘‘L-’’ marker corresponding to a final high, downstepped or low tone. An intonational phrase (break index 4) has a final boundary tone marked by either an ‘‘L%’’ or an ‘‘H%’’. Since intonational phrases are composed of one or more intermediate phrases plus a boundary tone, full intonational phrase boundaries will have two final tones, e.g. ‘‘L-H%’’ for a continuation rise. Pitch accent tones are marked at every accented syllable, or about one third of the syllables in this corpus. Accent types include: a peak accent ‘‘H∗’’, a low accent ‘‘L∗’’, a scooped accent ‘‘L∗+H’’, a rising peak accent ‘‘L+H∗’’, and a down-stepped peak ‘‘H+!H∗’’. Down-stepped high tones (a high tone after a higher tone in the same intermediate phrase) are marked by a ‘‘!’’. Uncertainty about a phrase tone or pitch accent can be marked with a ‘‘?’’, but was used by the transcribers on less than 2% of the accents marked. The ToBI system does not include any markers to indicate relative differences in prominence of accented syllables. 2.1.3. Factors affecting accent placement and tone assignment Numerous factors appear to be correlated with accent placement within an utterance, and therefore useful for accent prediction. Probably the most important factor is word part-of-speech category (cf. Ross et al., 1992; Hirschberg, 1993). Altenberg (1987) divides function words and content words into subclasses and examines each of these
160
K. Ross and M. Ostendorf
subcategories for potential to be made prominent by pitch accents and syllable loudness. Function words are subdivided into eight classes: predeterminers (e.g. all, such), discourse markers (e.g. well, now), relative pronouns, conjunctions, non-lexical (i.e. primary and modal) verbs, pronouns, prepositions, and determiners. Content words are subdivided into four classes: nouns, lexical verbs, adjectives, and adverbs. Altenberg further subdivided the word classes and found that several function word subclasses were very likely to be made prominent: ‘‘wh’’ adverbs, ordinals, and quantifying pronouns. Discourse structure is useful in accent prediction because focused words and words introducing new information are often accented (cf. Jackendoff, 1972; Prince, 1981). Grosz and Sidner (1986) provide an architecture for modeling discourse, using a stack representation of attentional state as a function of discourse segmentation. Hirschberg uses a simple version of this discourse model together with word equivalence classes based on identical roots to determine whether words should be de-accented, with improved accent prediction results (Hirschberg, 1993). Another factor that is believed to influence pitch accent placement is rhythm. Bolinger (1965) was the first to note rhythmic influences on pitch accent placement when he observed a tendency to place a pitch accent early and late in the phrase. Rhythmoriented theories try to account for the tendency to spread out accents, and in particular for apparent ‘‘stress shift’’ by using a metrical grid and rules for avoiding stress clash (cf. Selkirk, 1984). Other work, that our approach will follow, posits a role for two factors influencing early accent placement: phrase onset marking and accent/stress clash avoidance (Bolinger, 1981; Gussenhoven, 1991; Shattuck-Hufnagel et al., 1994). The selection of tone types is thought to be primarily influenced by discourse factors, although the details are not yet well understood. Pierrehumbert and Hirschberg (1990) have laid out some of the groundwork for the theory of how intonation contributes toward the overall meaning of a discourse. For accents, they argue that the choice of accent label depends on the speaker’s intentions with respect to instantiating a word in or relating it to the mutual belief space. (To the extent that this is true, we must rely on intentions being correlated with choice of wording to predict accent labels from unrestricted text.) They propose that the choice of phrase accents is influenced by whether or not (H- or L-) the intermediate phrase forms part of a larger interpretative unit. Similarly, boundary tones provide information about whether or not (H% or L%) an intonational phrase is to be interpreted with respect to a succeeding phrase. Taken together, the pitch accents, phrase accents, and boundary tones supply information about how the listener should interpret an utterance structurally with respect to other utterances and with respect to what is mutually believed in the discourse. A fairly well accepted example is the continuation rise boundary tone, L-H%, which typically indicates that the current speaker intends to continue speaking. Another example is the H-H% boundary tone that is used by speakers at the end of a yes/no question. 2.2. Corpus analysis This section contains a description of the radio news corpus used for both corpus analysis and prediction experiments, followed by a summary of an accent variability study on this corpus. Part-of-speech, phrase structure and accent clash analyses on this corpus are reported in other work (Ross et al., 1992; Shattuck-Hufnagel et al., 1994; Ross, 1995).
Abstract prosodic labels for speech synthesis
161
2.2.1. Radio news corpus The corpus used for this research was drawn from a collection of recorded FM public radio news broadcasts spoken by seven radio announcers (Ostendorf et al., 1995). FM radio news speech appears to be a good style for prosody synthesis research, since the announcers strive to sound natural while reading with communicative intent. For corpus analysis and model training, this research used speech from one female radio announcer which included 34 stories (approximately 48 min). These stories are studio recordings of actual radio broadcasts, which have been transcribed by a listener who did not have access to the original scripts. The 34 radio news stories contain 8841 words, 14 599 syllables, and 4799 pitch accents. The relative frequency of syllables with prominence was 0·33 (0·51 for words with prominence). Four news stories that were collected in studio recordings were later recorded in the laboratory, by multiple announcers each reading the same four news stories. The stories represent independent data, covering different topics and a different time period than covered in the training data. Both training and test included mainly stories about news events local to Massachusetts. The speaker represented in the training data was designated as the target speaker, and the laboratory recorded version of each story from that speaker was designated as the target version. There were four versions of most sections, including two from the target speaker, but in some cases only three versions were available. The different versions of each story were used together in a study of variability across speakers for developing an evaluation metric, described later, and as an independent test set for evaluating the prediction algorithms. The target versions of the test set stories together contained 1904 words, 3087 syllables and 955 pitch accents. The text was automatically annotated with part-of-speech labels, using a tagger developed by Meteer et al. (1991) that used the University of Pennsylvania Treebank tag set (Marcus et al., 1993). The tags for the training data were not hand corrected, but an informal inspection suggested that the results are quite good. The tags for the test data were corrected, but the error rate was only 2% so the changes were minimal. The tagger was reported to have an error rate of 3–4% on known words in a news domain, and 8–9% accuracy in tagging a different domain (Meteer et al., 1991). Lexical stress assignments are available from a lexicon associated with the corpus. Phonetic labels and segment boundaries were obtained automatically using forced word alignment with the Boston University speech recognition system. The radio news data was hand-labeled with the ToBI system of prosodic transcription by labelers at two sites. Three of the stories containing a total of 1002 words were labeled by both sites so that consistency of the labels could be assessed (Ross, 1995). For pitch accent types, there was agreement on presence vs. absence for 91% of the words. On those 487 words that were marked by both labeling groups with an accent, there was 60% agreement on accent type with most of the disagreements occurring for the difficult ‘‘L+H∗’’ vs. ‘‘H∗’’ distinction. Grouping ‘‘H∗’’ tones with ‘‘L+H∗’’ tones as in Pitrelli et al. (1994), there was 81% agreement on pitch accent type. Boundary tone agreement was 93% for the 207 words marked by both labelers with an intonational phrase boundary, and similarly there was 91% agreement for 280 phrase accents. Similar levels of agreement were found for labeling the five ToBI break index levels. The consistency analysis was based on the same protocol used in Pitrelli et al. (1994), but the agreement obtained in our study was slightly better than that obtained in their
162
K. Ross and M. Ostendorf
study (agreements of 81% for accent placement, 91% for boundary tone labels, and 85% for phrase accent labels, for the cases that are directly comparable). The lower rates observed in the Pitrelli study are probably due to the fact that they looked at a much broader range of types of speech, some of which were more difficult to label than the carefully articulated radio speech. Of course, it is possible that both labelers are biased by similar semantic and syntactic knowledge/understanding of the story, so these consistency measures on any corpus may be optimistically biased. However, experiments in prosodic labeling of filtered speech suggest that bias is not a serious problem (Sanderman & Collier, in press). 2.2.2. Variability study In order to assess advances, an important part of developing any synthesis algorithm is the evaluation procedure, but this is a non-trivial problem because of the amount of variability allowable in speech. To better understand this problem, we performed an empirical study of the differences among the pitch accents and phrase tones, comparing the different read versions of each test story. Two of the versions were from the target speaker (the speaker represented in the training data), and the other two were from different FM radio announcers (one male, one female). We compared the two versions from the same speaker, as well as the four read versions from the three speakers. Since accent placement is affected by prosodic phrase structure, e.g. the first and last content words in the phrase are often accented, we use only those versions where the intermediate phrase boundaries matched those from the target version (the laboratory recorded version from the target speaker). We also eliminated a few versions of phrases where words were different (misread) from those in the target version. For the within-speaker comparison, used for the detailed results reported below, 333/597 (55%) of the phrases in the target version also appeared in the second version, which corresponded to 1192/ 3086 syllables that could be compared across versions. For the four-way multi-speaker comparison, there were on average 1·4/3 versions that had phrase boundaries that matched the target version, and 2469/3086 syllables fell into a phrase with at least one match from the other versions. We conducted four studies of variability—accent placement, accent label, phrase accent label and boundary tone—comparing across the same-phrase versions in all cases. The results for each study, described below, show that different readings of the same text are likely to use different, yet acceptable, intonation patterns. For the within-speaker experiments, 10% of the syllables in the phrases where the second version matched the target version were accented in one version but not the other. For the multi-speaker experiments, for phrases where there was at least one additional matching version, 16% of the syllables were accented by some but not all speakers. One can postulate that some pitch accented syllables are required for semantic or structural reasons, whereas others are used optionally by speakers to give their speech extra expressiveness or add special emphasis, possibly because of their different interpretation of the discourse structure and/or differences in speaking style. Tables I and II illustrate variability for boundary tone types1 and phrase accent types, respectively. Both tables show the phrase tones for the target version vs. the corresponding phrase tone in the other version from the same speaker, for the subset 1
Category L-H% also includes H-H%, since this ending is rare in our corpus which has almost no questions.
Abstract prosodic labels for speech synthesis
163
T I. Comparison of boundary tones used in the target version to boundary tones used in the other version from the same speaker, for the subset of matching full intonational phrases in the two versions Other versions Target version L-L% H-L% L-H%
L-L%
H-L%
L-H%
95 2 8
4 1 2
39 7 71
T II. Comparison of phrase accent types used in the target version to phrase accents used in the other version from the same speaker, for the subset of matching intermediate phrases in the two versions Other versions Target version
L-
H-
!H-
LH!H-
15 2 1
13 13 1
4 5 3
of syllables in identical phrases in the two versions. The similarity rate for the type of phrase-accent/boundary-tone combination was 73% (with no consistency on the use of the H-L% tone), and only 54% for the type of phrase accent at an intermediate phrase boundary. (The multi-speaker similarity rates are only slightly lower, 71% and 51%, respectively.) Since the similarity rate for phrase accents is at the level of chance prediction rate for one version (55%), we can see no meaningful way to evaluate prediction of phrase accents and have not included such experiments in Section 4. However, it may still be useful to predict phrase accents in order to give some variety to the speech. Finally, Table III examines pitch accent variability between accent tones in the target version and the other same-phrase version read by the same speaker. For this analysis, the ToBI tone labels are grouped into four categories: no accent, high, downstepped, and low. The L+H∗ and L+!H∗ tones are grouped with H∗ and !H∗, respectively, because human labelers had some difficulties in consistently agreeing on these labels. In addition, because of limitations in automatic training, rare tones were grouped with tones that had similar targets: L∗+H and L∗+!H were grouped with L∗; and H+!H∗ was grouped with H∗, and X∗? was grouped with !H∗. The tone class similarity rate was 76% for the cases where both versions accented a word, a three-class distinction that compares to 81% inter-transcriber agreement on a five-class distinction in our consistency study. For the equivalent multi-speaker variability test, the similarity rate was 72% for the cases where two or more versions accented a word.
164
K. Ross and M. Ostendorf T III. Comparison of pitch accents used in the target version to pitch accents used in the other version, for the subset syllables in matching phrases in the two versions. The different ToBI labels are grouped into three classes: high, downstepped and low Other versions Target version
Unaccented
High
Downstepped
Low
Unaccented High Downstepped Low
2970 258 43 18
115 791 69 13
57 178 109 7
29 73 18 10
For both phrase and accent tones, the variability across versions was much larger than can be accounted for in terms of labeler inconsistency or disagreement. For the L% vs. H% boundary tone distinction, for example, the comparison is 76% vs. 93% (173/229 vs. 193/207, t>5·30, p<10−7). It is difficult to make the comparison for pitch accents, since the similarity is over a different number of classes, but the lower level of agreement for a smaller number of classes (73% vs. 81%) suggests a similar problem. If the choice of tone is dependent on the discourse, as proposed in Pierrehumbert and Hirschberg (1990), then this variability must be explained in terms of differences in the speakers’ commitment to the discourse since the text is the same in all cases. However, there may also be other factors leading to this variability. Clearly, further research is needed to understand the factors affecting the choice of tune and how to best use tonal information in automatic speech processing. 3. Automatic prediction algorithms In this work, the general approach for developing models for intonation prediction was statistical rather than the more typical rule-based approach. However, linguistic theory has been used to help define the structure for the models and to determine the factors that influence prominence placement. An advantage of statistical models is that they can be automatically trained and therefore easily modified to different speaking styles and possibly different languages. Statistical models are also attractive because they represent the natural variability in prosodic patterns and are able to predict prosodic behavior that is not yet understood linguistically, which can lead to a better understanding of the factors that influence intonation. The particular type of statistical model used here is a probabilistic decision tree optionally combined with a Markov sequence assumption. This section begins with a description of the general approach based on this model and then proceeds to discussions about the specific prediction variables for the cases of prosodic labels investigated here: pitch accent location, pitch accent type, boundary tone type, and prominence level. 3.1. Symbolic label prediction Our approach to predicting abstract prosodic labels is hierarchical, predicting first the presence vs. absence of pitch accents on syllables, then the type of tone labels on the
Abstract prosodic labels for speech synthesis
165
accented syllables, and finally the boundary tones. At each level, we incorporate classification and regression trees, two types of decision trees that both make predictions based on a series of questions (typically binary) about input feature vectors. Decision trees have the advantages that the appropriate sequence of questions (from a prespecified set) are determined automatically from training data to minimize prediction error, and that the relative importance of the prediction features can be examined to gain insights into the phenomena being modeled. Additionally, decision trees have relatively low complexity and can easily be scaled in complexity by including a greater or lesser number of branches, given sufficient training data to design the trees. Of course, decision trees also have the disadvantage that the successively finer partitioning of the training data involved in tree design makes it difficult for trees to learn patterns of generalization across classes and questions based on features that occur rarely in the data. However, classification and regression trees have been used successfully in many speech synthesis applications including pitch accent prediction (Ross et al., 1992; Hirschberg, 1993), duration modeling (Pitrelli & Zue, 1989; Riley, 1992; Iwahashi & Sagisaka, 1993), and phrase break prediction (Wang & Hirschberg, 1992; Ostendorf & Veilleux, 1994). Most of these examples, however, use trees to predict prosodic labels independently for each unit (e.g. word). Since some important features (such as accent clash for prominence placement) violate the independence assumption, it can be useful to model prosodic labels as a sequence. Thus, this work, like that of Ostendorf and Veilleux (1994), differs from most other decision tree work on prosody in that the trees are used to provide probability distributions for prosodic labels that are then used in conjunction with a sequence model. 3.1.1. Modeling assumptions Taking a probabilistic approach to prosody label prediction, we need a stochastic model P(an1 yn1) that represents the conditional dependence of a sequence of prosodic labels an1={a1,...,an } on a sequence of feature vectors yn1={y1,...,yn } based on text analysis. For example, in prominence prediction the time index i is over syllables, so ai is the prosodic marker on the i-th syllable and yi is a vector of features that are computed from the text sequence and are relevant to the accent on the i-th syllable. Using the chain rule n
P(an1 yn1)=p(a1yn1) ; p(ai ai−1,...,a1, yn1 ).
(1)
i=2
In the first-order model, we assume that ai is conditionally independent of earlier aj given ai−1 and yn1: n
P(an1 yn1)=p(ayn1) ; p(ai ai−1, yn1 ).
(2)
i=2
(For a second-order Markov assumption, we would also condition on ai−2.) Further, we assume that the conditioning events can be represented by a set of equivalence classes T(ai−1, yn1 ), which gives
166
K. Ross and M. Ostendorf n
P(an1 yn1 )=p(a1 T(£, y1)) ; p(ai T(ai−1, yn1 )).
(3)
i=2
The equivalance classes are defined by the decision tree T, where each class corresponds to a terminal node t that is associated with a discrete distribution that represents the conditional probabilities for each of the labels given node t, p(ai T(ai−1, yi )=t). To simplify tree design, we make the approximation p(ai T(ai−1, yn1 ))≈p(ai T(ai−1, yi )). The order of the Markov assumption is determined by how far back in the sequence the tree finds it necessary to look. A second-order Markov model was sufficient for the trees for predicting pitch accent location, in the sense that the tree never chose to ask questions about more distant syllables. Similarly, a first-order Markov model was sufficient for the prediction of pitch accent type. If the tree did not choose any questions about previous labels, as was the case for boundary tones, then each accent could be assigned independently and the model would reduce to the standard decision tree approach used by Hirschberg (1993). Including previous labels in the set of questions results in a slightly more complex prediction algorithm, as discussed next. 3.1.2. Prediction algorithm Once we have a probabilistic model, the optimal (minimum error rate) predicted label sequence is found by maximizing the probability P(an1 yn1). Using the Markov assumption on the label sequence, the problem of finding the best sequence of prosodic labels is similar to the speech recognition problem of finding the best hidden Markov model state sequence; that is, it can be solved efficiently using a Viterbi search (dynamic programming) algorithm. Define A to be the set of all possible values that ai could take on. In the accent placement model, for example, A has two members: accented and unaccented. Define the likelihood of the best label sequence ending with ai=j as n di ( j)=max log p(ai−1 1 , ai=jy1 ). i−1 a1
(4)
The prediction algorithm is then: (1) Initially, find for all j v A d1(j)=log p(a1=jT(£, y1)).
(5)
(2) For each time i=2,..., n and all j v A compute: di (j)=max[log p(ai=jT(k, yi ))+di−1(k)]. kvA
(6)
Save a pointer wi (j) to the best previous label when ai=j for all j v A. (3) Find the best value for aˆ ∗n aˆ ∗n=max dn (k) kvA
and traceback for each time i=n,..., 2 to get the predicted sequence of labels
(7)
Abstract prosodic labels for speech synthesis
167
aˆ ∗i−1=wi (aˆ ∗i ).
(8)
(The asterisk is used to indicate that this is the optimal prediction.) An alternative, suboptimal prediction algorithm that does not require waiting until the end of the sentence to predict the accent sequence uses the decision tree to make a decision at each time step based on the previous decisions, i.e. aˆ i=argmaxjvA log p(jT(aˆ i−1, yi ))
(9)
for each time i=1,..., n. The suboptimal algorithm is needed if causality is required, in which case the text processing used to derive yn1 would also have to be causal. Experimental results for both the optimal and suboptimal solution to the sequence problem are given in Section 4. 3.1.3. Model training The standard approach to model training is maximum likelihood (ML) parameter estimation. For the model proposed here, ML estimation is simply a decision tree design with a minimum entropy criterion, assuming the availability of labeled data, i.e. (ai,yi ) pairs in sequence. If the sequence dependence is unspecified, then the tree can choose questions about any previous labels max T
] log p(a T(a i
,...,a1, yi )),
i−1
(10)
i
but if the dependence is specified, e.g. first-order Markov, max T
] log p(a T(a i
, yi ))
i−1
(11)
i
then the questions are restricted. The design of unrestricted trees is useful for determining the appropriate conditional independence assumptions, i.e. order of the Markov model. Standard decision tree design techniques apply to this problem, and we used a greedy growing algorithm that successively adds new branches (representing questions) to optimize an objective function. Specifically, the trees used in this work were generated by the commercially available CART program by Breiman et al. (1984). Although the tree design criterion should be minimum entropy (or maximum mutual information) to be consistent with the maximum likelihood objective, experience with decision tree design has suggested that other criteria can be more robust. Thus, the decision trees used here were grown using the Gini criterion for splitting where the objective is to minimize impurity i(t)=1−Rj p2(jt) for p(jt) the probability of class j at node t. An important issue in model design is determining the size of the tree to avoid over training. A variety of tree pruning algorithms have been proposed; we simply use a held-out portion of the training set to determine the right size tree. Then, because the tree is designed to provide probability estimates that must be non-zero, the leaf node distributions are smoothed using a simple back-off to the prior class probabilities. Probably the most important aspect of tree design is the choice of the features
168
K. Ross and M. Ostendorf
represented in yi and the questions that can be asked about these features. We used linguistically motivated features, as described further in Section 3.2. 3.1.4. Label hierarchies Conceptually the simplest approach to predicting abstract prosodic markers would use a single tree for all the labels, but this is not a good choice for two main reasons. First, a single tree is likely to be more sensitive to the problem of data fragmentation. For example, if the tree first predicts ± accent, then the accented syllables will be partitioned across multiple leaves of the tree, so that more data is required to learn questions that might be common to all accented syllables. Second, we found that the accent tone label prediction tree chooses questions about the previous tone when that question is available, and incorporating these dependencies would require a high order sequence model if the absence of accent is predicted at the same time as the tone labels. One might argue that the cost of an accent assignment error depends on the particular tone label and therefore we should design a single tree using a matrix of error costs, but in the absence of known error costs the single tree has more problems than benefits. Thus, we design one tree to predict presence vs. absence of accent on a sequence of syllables, and then a second tree to predict tone labels given a sequence of accented syllables. A second issue is whether to make the predicted boundary tone labels and predicted pitch accents independent or to use one as an input feature for the other. Again, we looked at the unrestricted tree features to make this decision. Predictions of boundary tone type tended to use the feature of the last pitch accent type in the phrase when it was available, but trees for predicting pitch accents generally did not use the feature of the boundary tone. Therefore, we chose to predict pitch accent type first, and then use this information in the phrase tone predictions. In summary, our approach is to first predict pitch accent location (two categories: ± accent) with a second-order Markov model; next predict pitch accent type (three categories: high, downstepped, low) with a different, first-order Markov model; and finally predict the boundary tones (three categories: H-L%, L-L%, L-H%) independently for each intonational phrase boundary with yet another decision tree. For both accent placement and tone prediction, we assume independence of phrases. Each model operates at a different time scale—syllables, accents and phrases, respectively—and together they represent a prosodic hierarchy. The algorithm described here takes a bottom-up approach to prediction, making decisions on the higher levels after the lower level decisions have been made. Taking the stochastic modeling approach further, it would be possible to make these choices jointly using a three-level recursive dynamic programming algorithm, as in Ostendorf and Veilleux (1994). This recursive algorithm was not implemented because of the increase in complexity and our desire to assess the model at separate stages. 3.2. Intonation label prediction variables When designing decision trees, one needs only to provide a vector of reasonable prediction features and the tree generation program will pick the best features for the prediction. The features used here were based on suggestions from the literature and our own corpus analyses, and were restricted to those that could be extracted with simple text processing. With the exception of the prosodic phrase boundaries and minor
Abstract prosodic labels for speech synthesis
169
corrections to the part-of-speech labels in the text data, all features are predicted automatically from the original text. The following paragraphs briefly summarize and motivate these features; more complete lists appear in Appendix 1. The full set of features was made available to all trees, whether or not they were motivated by that particular problem, with the exception that pitch accent label on the current syllable can obviously only be used by the boundary tone prediction tree. In Section 4, we describe which features were actually chosen in tree design. Dictionary information. Dictionary stress is important for predicting accent placement, because the main-stress syllable is almost always accented when the word receives nuclear accent. The stress categories must also account for the facts that secondarystress syllables were likely to be accented only if they precede the main-stress syllable, and that single-syllable words are not marked for lexical stress in the dictionary. Lexical features related to syllable structure of the word are also included, since longer words were more likely to receive two accents. Finally, vowels are labeled as either tense or lax,2 under the hypothesis that tense vowels were more likely to receive a pitch accent, and the tree for predicting accent location does use this feature for secondary stress syllables. Part-of-speech. The part-of-speech tag of the target word, as well as the preceding and succeeding words, are candidate variables for accent prediction. The data is labeled with the full set of 36 Penn Treebank part-of-speech labels, but CART can handle only a maximum of eight categories. Therefore, the part-of-speech categories are grouped, based upon their having similar function and frequency of being accented (Ross, 1995), into: cardinal numbers, adjectives, nouns, adverbs and particles, more-accentable verbs, less-accentable verbs, and two groups of miscellaneous function words. To effectively have more categories, a separate question distinguishes the four classes: more-accentable function words (quantifiers and negatives), less-accentable function words, content words, and proper nouns. Lastly, to capture some information about complex nominals with minimal text processing, we include a question about the part-of-speech class of neighboring words and word position in a complex nominal, where a complex nominal was defined as a sequence of adjectives and nouns ending in a noun and unbroken by prosodic phrase breaks of three or higher. Prosodic phrase structure. There is good evidence for interaction between phrase structure and accent placement, and ideally one would predict both jointly. To simplify the problem for this study, this model assumes that prosodic phrase structure was predicted before accent placement, since phrase boundaries seem to be more reliably predicted without knowledge of accent location than accent locations are predicted without knowledge of phrase boundary location (Wang & Hirschberg, 1992; Hirschberg, 1993; Ostendorf & Veilleux, 1994). Several questions related to phrase structure can therefore be included, specifically prosodic break size and distance from phrase boundaries, motivated by observations that speakers often accent the first and last accentable syllables of an intonational phrase. Prosodic phrase length and syllable position within a prosodic phrase are also features, primarily motivated by accent label prediction (e.g. the first accent in a phrase is never downstepped). Unless otherwise noted, the experiments reported here are based on hand-labeled phrase boundaries, so as not to confound the effects of phrase prediction errors with accent prediction. 2 Tense vowels include ‘‘iy’’, ‘‘ey’’, ‘‘ae’’, ‘‘aa’’, ‘‘ao’’, ‘‘ow’’, ‘‘uw’’, ‘‘ay’’, ‘‘aw’’, and ‘‘oy’’; lax vowels include ‘‘ih’’, ‘‘eh’’, ‘‘ah’’, and ‘‘uh’’.
170
K. Ross and M. Ostendorf
New/given status. Borrowing from Hirschberg’s (1993) work, we use a model of new vs. given that asks whether a content word or its root is new to the paragraph or the story. The root of the word is found by stripping off all prefixes and suffixes from the word and checking in a dictionary to see if the proposed root word is a legitimate word. Additionally, a list is used for some special case word to root word conversions needed in our corpus (e.g. bought/buy, feet/foot, etc.). Paragraph structure. There are a series of questions related to sentence and paragraph structure aimed at boundary tone prediction, including: location of prosodic phrases within sentences and paragraphs, location of sentences within paragraphs, sentence length and punctuation. Label of other units. Questions about the label of the preceding unit (syllable, accent or boundary tone, depending on the model) are included to capture sequence dependencies. In addition, for the accent location model, we include a question about how many syllables have occurred since the last pitch accent, motivated by the theory of accent clash avoidance. The threshold chosen (two, in this case) determines the order of the Markov model. This set of features is similar to those used in Hirschberg’s experiments, with a few minor exceptions. Hirschberg did not use lexical stress information, because she predicted accent at the word level, choosing the main-stress syllable by default. The part-ofspeech classifications differ somewhat; ours are motivated by our corpus analyses (Ross et al., 1992; Ross, 1995). We do not have access to the Sproat (1990) algorithm for predicting complex nominal main stress, and use a simple approximation instead. Finally, we include more questions about paragraph structure, but they are motivated by boundary tone prediction and not accent placement. The one substantial difference with respect to Hirschberg’s features is the use of previous labels, which necessitates the development of the sequence model. 3.3. Prominence level prediction After tone labels are predicted, prominence ‘‘level’’ is predicted for each accented syllable. Since there was a question of whether levels should be categorical or continuous, we first ran some analyses to determine whether the data could be labeled categorically. In pilot experiments, where human labelers were asked to mark two levels of accentual prominence, the level of consistency across labelers was too low to use the data. Clustering experiments were also conducted to determine if the acoustic features examined (duration and peak normalized F0 and energy within the syllable) formed identifiable groupings, but none were found (Ross, 1995). Having no basis on which to predict categorical differences, we chose to use regression trees for predicting prominence level as a continuous-valued normalized F0 peak for each accented syllable, where the normalization is with respect to the peak F0 in the phrase to account for the dependence of prominence level on local pitch range. (The F0 contours are smoothed before normalization and peak measurements, using a 5-point median filter to reduce segmental effects and spurious pitch tracking errors.) Although perceptual evidence suggests that F0 peak is the primary factor associated with prominence level (Terken, 1996), we also chose to predict peak normalized log RMS energy for each accented syllable since energy is sometimes said to be a correlate of prominence. Syllable F0 and energy peaks were predicted by regression trees designed with CART using a minimum mean squared error criterion. The models for prominence level prediction did not
Abstract prosodic labels for speech synthesis
171
include a sequence model, assuming that sequence information, if important, was captured by the abstract marker predictions already described (e.g. presence vs. absence of downstep). Instead, we allowed the use of the abstract prosodic markers as regression variables in the prediction, as well as many of the same features used during the prediction of accent location. Peak energy and F0 were predicted independently, because CART does not handle regression of vectors, though it is a straightforward extension. 4. Experiments Decision trees were designed automatically with approximately 50 min of speech representing thirty-four radio news stories from a single female speaker (the target speaker). The trees were grown using 67% of the training data, and pruned using the remaining 33% of training data. The following sections describe the results of these experiments, in terms of a quantitative comparison to the independent multi-version test set and in terms of perceptual experiments. 4.1. Quantitative prediction results 4.1.1. Evaluation measures A good evaluation procedure is needed for algorithm development to assess improvements in assigning tone and accent assignment patterns, ideally without the high cost of perceptual experiments. Previous work has used an error rate relative to a single target version. However, we believe that quantitative evaluation metrics should take into consideration ‘‘allowable’’ variability, given our lack of knowledge of speaker intentions. Therefore, we propose new metrics for evaluating the predicted labels, which will be reported together with the standard error measure. In particular, the assignment to syllables of required, optional, and unmarked accent status suggests a new method of evaluating pitch accent placement routines where error rate is only measured over those syllables where the different versions of a text were unanimous on either accenting or deaccenting. Another measure finds predicted prosodic labels for each of the versions, differing in the phrase boundaries, input to the algorithm, and gives a sentence the score resulting from the closest match between the predicted and spoken accents over the four possible versions. [See Ostendorf and Veilleux (1994) for a discussion of a similar approach for evaluating phrase boundary predictions.] We also give results for accuracy as measured against the target version to facilitate comparison of our results to other work. Tonal markers can be evaluated using similar techniques, i.e. using accuracy relative to some target version, relative to the best matching version, or on the subset of syllables where versions agreed on the tone assignment. However, because there was so much variability in the tone labels across versions, the subset of tones where the versions agreed was too small and/or too trivial to give meaningful results. Thus, all accent and boundary tone prediction results are in comparison to the target version. 4.1.2. Pitch accent location The decision tree for predicting the location of prominent syllables had twenty terminal nodes, as shown in Fig. 2. The most important features were dictionary stress, the number of syllables since the last pitch accent (threshold at 2), the syllable’s position
172
K. Ross and M. Ostendorf Dictionary stress
1
2
3
Syllables from
Word class
Dictionary stress
6 last accent
5
Syllable number
7
20
Number of content words in minor phrase
1797/2083
1
Syllables from minor 4 break
173/5572
2
4
5
142/195
0/64 Syllables
3
32/50
8
9 from
minor break
20/96
Phrase break 6 size 10 after 82/103 word
22/79
11 Part-ofspeech
Phrase break
12 before word
Syllables
Part-of-
13 from
18 speech
last accent
Part-ofspeech
9
14
10
7/26 536/704
8
7
Syllables from last accent
90/158
11
Syllables from
19 major
17
break
35/132 13/45 Syllables from
12
18
15 minor break
20/65
137/181 2/11 Content
16 words to
13
minor break
18/20 Syllables from end of word
16
17
0/11
14
15
30/75
94/149
Node
Question
1
Dictionary stress
2
Word class
FW-L, CW, PN
FW-M
3
Dictionary stress
other
2-early-tense
4
Syllables from minor break
< 1.5
> 1.5
5
Syllable number in word
0
>0
6
Syllables from last accent
>1
0 or 1
7
Content words in minor phrase
< 2.5
> 2.5
···
···
···
···
Left
19
Right
0, U, 2-late, 2-early-tense 1, 2-early-lax
Figure 2. Classification tree for predicting pitch accent location, with relative frequency of syllables with pitch accents given at the terminal nodes. See Appendix 2 for full description.
Abstract prosodic labels for speech synthesis
173
T IV. Comparison of algorithms for predicting pitch accent location, measuring performance at the syllable level Percent correct
Content word Monaghan Tree-suboptimal Tree-optimal
Target version
Unmarked/required
Best match
85·2 82·9 87·5 87·7
91·3 88·5 92·8 92·9
87·1 84·1 89·9 89·8
within the word, and the function/content word distinction. Other features that the tree used included: the number of content words in the phrase, the number of syllables in the word, the number of syllables since the minor phrase break, the number of content words to and from the phrase breaks, and the size of the prosodic phrase breaks before and after the words. The tree did not use any new/given features, probably because the radio news data is highly accented and ‘‘given’’ words are frequently accented. [Hirschberg (1993) finds only a small improvement through the use of new/given features, so our result was not so surprising.] The fact that the number of syllables from the last accent was the second question in the tree demonstrates the importance of the sequence model for pitch accent location prediction. The optimal prediction algorithm with the sequence model involved a dynamic programming search (see Section 3.1), but we also experimented with the suboptimal solution (Equation (9)) to assess the performance trade-offs. Table IV summarizes the results for both accent prediction algorithms, including three different performance measures at the syllable level: (1) accuracy with respect to the target version, (2) accuracy measured only on the unmarked/required subset of syllables, and (3) accuracy based on the best match between the predicted prosodic markers and the test data from the four versions. The optimal search achieves slightly (but not significantly) better performance than the suboptimal search, with the main difference being that the suboptimal approach tends to predict fewer accents. For comparison to other results where accents are predicted on words, our word-level accuracy with respect to the target version was 82·5% for the optimal search, compared to 87·7% accuracy at the syllable level. (Comparisons across studies must be interpreted with caution, however. For example, the 76·5% accuracy reported by Hirschberg (1993) on FM radio news data was based on a smaller training set and different test set; and the 85% accuracy reported on ATIS with the same approach reflects a different speaking style.) Note that the accuracy achieved here is higher than the similarity rate across the different versions (84%), which is also reflected in the high accuracy rates (92·9%) measured using the required/optional analysis. Two simple rule-based algorithms for predicting pitch accent location were implemented to compare to the tree-based approach, and the results for both are also included in Table IV. The first algorithm, referred to as the ‘‘Content Word’’ algorithm, puts an accent on the (last) main-stress syllable of every content word, and the second is the Monaghan (1990) algorithm which takes a rhythm-based approach to predicting accent location. Neither algorithm does as well as the tree-based approach, but surprisingly the content word algorithm outperforms the more sophisticated Monaghan
174
K. Ross and M. Ostendorf
Figure 3. Text with required (R), optional (O), and unmarked (U) labels for each syllable and the predicted prosodic markers (pitch accents and phrase tones).
algorithm. A plausible explanation is simply that the radio news data is over-accented, and the Monaghan algorithm does not represent this style well. A short section of text with predicted accents is shown in Fig. 3 with the associated accent designations from the multi-version analysis: ‘‘U’’ for unmarked, ‘‘O’’ for optional, and ‘‘R’’ for required. As shown in Table V, optional accents are allowed on 13% of the test syllables with a prediction accuracy of 92% on the remaining syllables. The difficulty in quantitative evaluation is also illustrated in Fig. 3, in that one would expect the syllable ‘‘-zine’’ of ‘‘magazine’’ to have an optional accent and it does not because there was no accent on this syllable in the target version and no exact phrase match in the other versions. Although multiple versions are better than a single version of assessing performance, more than four versions seem to be needed to characterize variability even for a single speaking style (radio news).
Abstract prosodic labels for speech synthesis
175
T V. Comparison of predicted accent status to a multi-version labeling of accents as unmarked, optional or required depending on whether no, some or all versions (respectively) accented the syllable Predictions
Unmarked syllable
Optional pitch accent
Required pitch accent
Unmarked Accented
1755 133
137 270
88 708
This algorithm occasionally does predict early accent placement on secondary stress syllables that precede the main stress syllables on words such as ‘‘Massachusetts’’ in the stress clash position of the phrase ‘‘that MASSachusetts HIGH court nomiNEES’’ and ‘‘nineteen’’ in the phrase onset position of ‘‘in NINEteen seventy SIX’’. The algorithm can also predict double accents, in principle, and in fact does for a number of long words (e.g. ‘‘Massachusetts’’ in the phrase ‘‘MASSaCHUsetts got TOUGH’’). 4.1.3. Pitch accent type After pitch accent locations are predicted, a second tree predicts pitch accent type using three classes of tones: high, downstepped and low. The resulting tree had seven terminal nodes, as shown in Fig. 4. Because they are rare, low accents are never predicted by this model. The low frequency of occurrence makes it difficult for the trees to find meaningful features for predicting them, and they are less likely than the other classes at every leaf in the tree. The first and most important question in the tree splits off the cases where the accent is the first in the phrase, since it is almost always a high accent. The position of the syllable in the phrase is also an important feature. Table VI gives the confusion matrix for pitch accent labels from the target version vs. labels from the optimal prediction algorithm. For the subset of syllables correctly predicted to have an accent, the optimal tone prediction algorithm correctly labels 72·4% of the syllables, compared to 72·1% for the sub-optimal algorithm. These numbers can be compared to 71·8% accuracy for the case where all accents are labeled with a high tone, and 63·7% accuracy for the simple rule that the first accent receives a high tone and all others are downstepped high tones. 4.1.4. Boundary tone markers Another tree is used for predicting the boundary tones that occur at the end of each intonational phrase (ToBI break level 4). Three boundary tones (L-L%, L-H%, or HL%) are predicted; the fourth (H-H%) tends to occur at yes–no question boundaries, which were not represented in the radio data. The tree was relatively simple, as shown in Fig. 5. The most important features were punctuation, phrase position, and phrase length. Questions about previous boundary tones were not chosen, so the Viterbi search algorithm was not needed. An initial set of trees were designed that used the last pitch accent type in the phrase, but this hurt the overall performance of boundary tone prediction due to errors from the pitch accent model. Most notably, when low (actually L∗) accents occurred in nuclear position, ‘‘L-H%’’ was likely to be the predicted boundary tone. But since low accents were never predicted, ‘‘L-H%’’ was under-predicted
176
K. Ross and M. Ostendorf 1
Last pitch accent type Position of minor phrase
2 in sentence
1 (0.95/0.02/0.04)
Break index
3 after word
2 (0.25/0.70/0.05)
4
3
Number of content words from major break
(0.56/0.41/0.03) Number of syllables in sentence
5
7 (1.00/0.0/0.0)
Part-of-speech
6
6 (0.35/0.59/0.06)
4
5
(0.57/0.38/0.05)
(0.31/0.69/0.0)
Node
Question
Left
Right
1
Previous accent in phrase
None
Other
2
Position of minor phrase in sentence
Beginning or end
Middle
3
Break index after word
0, 2, 3
1, 4, 5, 6
4
Number of content words from major break
<5.5
> 5.5
5
Number of syllables in sentence
< 32.5
> 32.5
6
Part-of-speech
Adjectives, nouns, adverbs, Numbers, verbs miscellaneous 1 and 2
Figure 4. Classification tree for predicting pitch accent type, with the fraction of elements with ‘‘high’’, ‘‘downstepped’’, and ‘‘low’’ labels, respectively, indicated at terminal leaves.
T VI. Comparison of observed pitch accents in the target version to predicted pitch accents using the sequence model Observed Predicted
Unaccented
High
Unaccented High Downstepped Low
88% (1866) 8% (164) 5% (102) 0% (0)
11% (75) 72% (491) 17% (113) 0% (0)
Downstepped 16% 25% 59% 0%
(32) (49) (118) (0)
Low 9% 44% 47% 0%
(7) (34) (36) (0)
Abstract prosodic labels for speech synthesis 1
177
Position of major phrase in sentence
2
1
Syllables since last pitch accent
LL% 3
2 LL%
5
3
Part-ofspeech
4
Punctuation
Number of syllables in major phrase
6
LH%
LL%
4
5
LH%
LL%
Node
Question
Left
Right
1
Position of major phrase in sentence
End
Beginning or middle
2
Syllables since the last accent
>1
<1
3
Punctuation
Comma, semi-colon, colon
None
4
Part-of-speech
Numbers, adverbs, less-accentable verbs, miscellaneous 2
Adjectives, nouns, more-accentable verbs, miscellaneous 1
5
Number of syllables in major phrase
< 5.5
> 5.5
Figure 5. Classification tree for predicting boundary tone type, with the most likely tone labels given for each leaf.
T VII. Comparison of observed boundary tones in the target version to predicted boundary tones Observed Predicted L-L% H-L% L-H%
L-L%
H-L%
L-H%
92% (220) 0% (0) 8% (19)
75% (12) 0% (0) 25% (4)
71% (97) 0% (0) 29% (39)
when the last accent in the phrase was allowed to be a prediction feature. Removing this feature improved the test set boundary tone prediction results, although the training set performance (based on actual and not predicted tone labels) saw an increase in errors from 25·2% to 29·4%. Boundary tone type prediction accuracy, assuming known intonational phrase boundary location, is shown in Table VII where predicted tones are compared to the target version. The overall rate for correct predictions for the three types of boundary tones was 66·9%, which can be compared to 61·1% for the
178
K. Ross and M. Ostendorf
case where every intonational phrase is assigned a boundary tone of L-L%. Assigning every sentence ending intonational phrase an L-L% and every sentence internal intonational phrase an L-H% gave an accuracy of 58·3%; assigning L-H% to boundaries that are marked with a comma and L-L% to all other boundaries gives an accuracy of 53·9%. All of the errors shown in this table result from boundary tones within sentences as opposed to boundary tones at the end of sentences, which are consistently marked with L-L%. The tree does not predict H-L% boundary tones, which occur infrequently in our corpus, but the spoken versions did not agree on the use of this tone either. In fact, if we evaluate prediction only on the subset of 296/548 tones that the different versions agree on, the accuracy is 100%. 4.1.5. Prominence level Once the pitch accent location and type has been established, a prominence level is predicted by regression trees. For this work, prominence level was captured in the peak normalized energy and F0 levels of pitch accented syllables, as described in Section 3.3. Independent trees were grown for predicting energy and F0 peaks, assuming that tone label is specified. The tree for F0 level is shown in Fig. 6, and its features include the predicted pitch accent type, number of pitch accented syllables in the minor phrase, the previous predicted pitch accent type, and the position of the minor phrase in the sentence. Tree prediction of normalized F0 reduced the squared error relative to using the training mean by 55%. For comparison, using the rules for predicting F0 peak specified in Silverman (1987) results in a 9% reduction of variance. Tree prediction of normalized log RMS energy did not give a significant improvement over using a constant mean from the training data. 4.2. Perceptual experiments The models for predicting abstract prosodic markers were also assessed with a perceptual test using the Bell Laboratories TTS synthesizer with the male voice. For this test, 12 native speakers of American English were asked to rate versions of 15 sentences from radio news stories. Three different synthesized versions of each sentence were presented to each listener: (1) our predicted pitch accents and phrase tones, (2) hand-labeled pitch accents and phrase tones from the target version, and (3) default pitch accents and phrase tones from the synthesizer. The default pronunciations from the synthesizer were used except where they sounded unnatural, and in all cases the original spoken prosodic phrase boundaries were given to the synthesizer. A version with predicted accent locations and default tone types was not included, since the Markov tree accent location prediction algorithm was similar to the TTS default. Thus, the focus of the perceptual tests was on combined accent and tone prediction. The participants listened to the three versions of each sentence as many times as they wanted and were asked to rate the naturalness on a scale of 1 to 5 with 1 representing the most natural. They were instructed that this test was evaluating several methods of producing intonational contours and that they should pay close attention to the naturalness of the intonation. The order was varied so the participants were not sure which version they were listening to. Our model for predicting prosodic labels had a mean rating of 2·70, the version using the observed labels from the target version had a mean rating of 2·74, and the version using the default prosody from the TTS
Abstract prosodic labels for speech synthesis 1 Number of pitch accents in minor phrase
179
Pitch accent type
6
2
Mean = 0.97 SD = 0.07
3
1
Position of minor phrase in sentence
Mean = 1.00 SD = 0.0 Pitch accent type
4
2 Mean = 0.71 SD = 0.14
5
Previous pitch accent
5 Mean = 0.82 SD = 0.12
3
4
Mean = 0.61 SD = 0.15
Mean = 0.87 SD = 0.15
Node
Question
Left
Right
1
Pitch accent type
!H*, L*
H*
2
Number of accents in minor phrase
< 1.5
> 1.5
3
Position of minor phrase in sentence Beginning Middle, end
4
Pitch accent type
L*
!H*
5
Previous pitch accent
H*, !H*
L*
Figure 6. Regression tree for predicting the peak fundamental frequency for accented syllables, with the predicted normalized F0 (mean) and training error (SD) associated with each leaf.
synthesizer had a mean rating of 2·73. The average difference between our model and the speech synthesized using the hand-labeled prosody was 0·04 with the prosody from the Markov tree model being slightly better (ns, t11=0·5). The closeness of the ratings for the three approaches suggest that a better mechanism for perceptual assessment is needed. Informal discussions with some of the subjects indicate that the overall quality was difficult to rate because it was hard to judge the effect of a defect in one version against a different defect in another. It may be that advances in other areas of synthesis technology are needed for meaningful assessment of tone prediction in perceptual experiments. 5. Discussion Our main objective was to develop automatically trainable algorithms for predicting abstract prosodic markers (pitch accents and boundary tones) that produce better results than simple rule-based models. Starting with decision trees for predicting pitch
180
K. Ross and M. Ostendorf
accent location, as in Hirschberg (1993), we showed that sequences of accents are correlated, and then extended the decision tree framework to operate in conjunction with a Markov sequence model to represent this dependence. We also extended previous work by applying the Markov tree model to multiple levels of a prosodic hierarchy, i.e. for predicting (1) accents on syllables (rather than words) in order to capture early and double accenting, (2) tone labels on accented syllables, and (3) boundary tones for intonational phrases. In addition, we introduced the idea of using regression trees to predict multiple levels of prominence. The structure of the model is based on relatively broadly accepted aspects of prosodic theory, i.e. that there is a finite inventory of tone types that are used to mark pitch accent and boundary phenomena. Although the experiments build on the ToBI labeling system, the mathematical framework can in theory support other prosodic labeling systems that represent these two types of tones. The prediction algorithms were evaluated experimentally and compared to other reported algorithms and/or simple rules that we implemented for testing on the same corpus. The Markov tree model achieved higher accuracy relative to other approaches using several different measures of performance, some of which were introduced in this work as a means of quantitative performance assessment that allows for natural variability of intonation markers. The prediction results for pitch accents and boundary tones achieved a level of performance comparable to the variability among spoken versions of the same text, though only slightly better than performance obtained by simple rules. However, we chose to predict the different levels of the hierarchy (prominence, accent labels, relative prominence and boundary tones) in a bottom-up algorithm, and it may be possible to further improve performance by modeling these processes jointly. For example, a joint model might help overcome such problems as missing the L∗ L-H% accent and tone combination when low accents are not predicted. An advantage of a probabilistic approach is that it provides a clear framework for integrating phrase and accent prediction. The experiments reported here show dramatic differences in performance for algorithms on different corpora. For example, prediction of accent location from partof-speech labels alone gives 85% accuracy on the radio news corpus vs. 57% accuracy reported by Altenberg (1987) for London-Lund corpus. In addition, a well-motivated rhythm-based algorithm did poorly relative to a naive content/function word algorithm for predicting accents in the radio news style of speech. These findings highlight the importance of automatic training algorithms for prosody modeling: prosodic patterns reflect speaking style and no single set of rules is appropriate for all styles. The results here are primarily based on features that were extracted automatically from text, and in principle all of the features could have been. There were minor corrections to the part-of-speech tags in the test set, but these were mainly particle/ preposition distinctions which were too infrequent to be learned by the tree and thus it is unlikely that they had any effect on the results. However, it is worth noting that the tagger used in this work was very accurate, and that some degradation would be expected with a lower accuracy tagger. Probably a bigger factor is the use of handlabeled prosodic phrase boundaries. In a subsequent perceptual experiment with synthesized speech using automatically predicted boundaries, we find a small drop in subject ratings relative to the version with hand-labeled boundaries. However, the results do not necessarily indicate that accent prediction is sensitive to accurate phrase boundary location. Because of the minor differences in subject ratings for the correct vs. predicted accent placement with hand-labeled prosodic boundaries, it is more likely
Abstract prosodic labels for speech synthesis
181
that the result is explained by the hypothesis that, for this style of speech, subjects are more sensitive to phrase boundary placement errors than accent prediction errors. Some of the experimental results reported here are also of theoretical interest. In particular, the relative importance of neighboring accents in the decision tree for accent prediction provides further evidence for the accent clash theory. Although a hierarchical representation of prosodic structure is fairly well accepted in linguistics, it is interesting that we found it to be a good computational model as well, i.e. a more efficient representation of statistical dependence than a model operating entirely at the syllable level. Finally, the study of accent and tonal marker variability raises important questions about what types of tones can be predicted and how we should evaluate accuracy when there are differences in production within and across speakers. Such data may also be useful in the future for developing a better understanding of the factors influencing the choice of intonation markers: to what extent are they influenced by syntactic structure and discourse purpose vs. reflecting style of a particular speaker? The main disadvantage of this general approach is that it requires a large amount of labeled data, and hand labeling prosodic markers is very expensive. However, much progress is being made on the problem of automatic labeling (e.g. Wightman & Ostendorf, 1994). In addition, this work is itself useful in improving the accuracy of automatic labeling, in that the Markov tree model can serve as a ‘‘language model’’ that is more powerful than a simple tone-label bigram (Ostendorf & Ross, 1996). The models described here are also applicable more generally to prosody recognition for automatic speech understanding and have been used in prosody/parse scoring (Veilleux & Ostendorf, 1993a,b). Further, although we have focused here on text-to-speech, these same algorithms would be applicable to message-to-speech systems simply by using different prediction variables. With message-to-speech applications one would expect to achieve better results because of the availability of discourse information and, as mentioned above, automatic training can be useful both to accommodate the richer input and application-dependent style differences. Although this work provides an advance in accent and tone prediction for speech synthesis, the problem still requires further work to obtain truly natural speech synthesis. For unrestricted text-to-speech synthesis, the infinitely large number of possible inputs and the lopsided distribution of different tonal elements together make the occurrence of some rare event a certainty. As evidenced by the problems in predicting low pitch accents, current model methods are not well suited to handling infrequently observed data types. To make significant advances, further work on mathematical modeling techniques as well as a better understanding of the linguistic dependence of these events are needed. For tone and accent location in particular, it is believed that factors such as semantics and discourse structure must be involved in prediction, but these areas are not themselves well understood, let alone how these factors combine with the simpler cues such as lexical stress, punctuation and part-of-speech tags. Thus, much work remains before the problem of tone and accent prediction for synthesis is solved. This research was funded in part by NSF under NSF grant number IRI-8805680 and in part by a research grant from NYNEX with additional support coming from the Department of Education under the Graduate Assistance Applied to National Needs Program, award number P200A90080. The radio news corpus was provided by WBUR, a Boston public radio station. We gratefully acknowledge Laura Dilley’s help labeling data.
182
K. Ross and M. Ostendorf References
Allen, J., Hunnicutt, M. S. & Klatt, D. (1987). From Text to Speech: The MITalk System. Cambridge: Cambridge University Press. Altenberg, B. (1987). Prosodic Patterns in Spoken English. Sweden: Lund University Press. Beckman, M. (1986). Stress and Non-stress Accent. Dordrecht: Foris. Beckman, M. & Ayers, G. (1994). Guidelines for ToBI labeling, version 2.0. Manuscript and accompanying speech materials. [Obtain by writing to
[email protected]]. Beckman, M. & Pierrehumbert, J. (1986). Intonational structure in Japanese and English. Phonology Yearbook, 3, 255–309. Black, A. & Campbell, N. (1995). Predicting the intonation of discourse segments from examples in dialogue speech. Proceedings of the ESCA Workshop on Spoken Dialogue, 1995. Bolinger, D. (1965). Pitch accent and sentence rhythm. In Forms of English: Accent, Morpheme, Order (I. Abe & T. Kanekiyo, ed.). Tokyo: Hokuou. Bolinger, D. (1981). Two kinds of vowels, two kinds of rhythm. Technical report, Indiana University Linguistics Club, Bloomington, Indiana. Breiman, L., Friedman, J., Olshen, R. & Stone, C. (1984). Classification and Regression Trees. Monterey, CA: Wadsworth. Collier, R. (1991). Multi-language intonation synthesis. Journal of Phonetics, 19, 61–73. Dirksen, A. & Terken, J. (1991). Specifications of the procedure for prosodic marker assignment. Technical Report 789, Institute for Perceptual Research, Eindhoven, The Netherlands. Grosz, B. & Sidner, C. (1986). Attention, intentions, and the structure of discourse. Computational Linguistics, 12, 175–204. Gussenhoven, C. (1991). The English rhythm rule as an accent deletion rule. Phonology, 8, 1–35. Hirschberg, J. (1993). Pitch accent in context: Predicting prominence from text. Artificial Intelligence, 63, 305–340. House, J. & Youd, N. (1991). Contextually appropriate intonation in speech synthesis. Proceedings of Eurospeech, pp. 185–188, Autrans, France. Iwahashi, N. & Sagisaka, Y. (1993). Duration modelling with multiple split regression. Proceedings of Eurospeech, pp. 329–332, Berlin, Germany. Jackendoff, R. S. (1972). Semantic Interpretation in Generative Grammar. Cambridge, MA: MIT Press. Klatt, D. H. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82, 737–793. Ladd, D. R. (1993). Notes on the phonology of prominence. ESCA Workshop on Prosody: Working Papers 41, pp. 90–95, Lund, Sweden. Marcus, M. & Santorini, B. (1993). Building a very large natural annotated corpus of English: the Penn Treebank. Computational Linguistics, 19, 313–330. Meteer, M., Schwartz, R. & Weischedel, R. (1991). POST: using probabilities in language processing. Proceedings of the International Joint Conference on Artificial Intelligence. Monaghan, A. (1990). Rhythm and stress-shift in speech synthesis. Computer Speech and Language, 4, 71–78. Monaghan, A. (1994). Intonation accent placement in a concept-to-dialogue system. ESCA/IEEE Workshop on Speech Synthesis, pp. 171–174, New Palz, NY, U.S.A. Ostendorf, M. & Veilleux, N. (1994). A hierarchical stochastic model for automatic prediction of prosodic boundary location. Computational Linguistics, 20, 27–54. Ostendorf, M. & Ross, K. (1996). A multi-level model for recognition of intonation labels. In Computing Prosody. (Y. Sagisaka, W. N. Campbell & N. Higuchi, eds.) Berlin: Springer-Verlag. Ostendorf, M., Price, P. & Shattuck-Hufnagel, S. (1995). The Boston University radio news corpus. Boston University Technical Report ECS-95-001, Boston University, MA, U.S.A. Pierrehumbert, J. (1980). The phonology and phonetics of English intonation. PhD thesis, Massachusetts Institute of Technology, MA, U.S.A. Pierrehumbert, J. (1981). Synthesizing intonation. Journal of the Acoustical Society of America, 70, 985–995. Pierrehumbert, J. & Hirschberg, J. (1990). The meaning of intonational contours in the interpretation of discourse. In Plans and Intentions in the Interpretation of Discourse (P. R. Cohen, J. Morgan & M. Pollack, eds). Cambridge: MIT Press. Pitrelli, J. & Zue, V. (1989). A hierarchical model for phoneme duration in American English. Proceedings European Conference on Speech Communication and Technology, pp. 324–327. Edinburgh, Scotland. Pitrelli, J., Beckman, M. & Hirschberg, J. (1994). Evaluation of prosodic transcription labeling reliability in the ToBI framework. Proceedings of the International Conference on Spoken Language Processing, Yokohama, Japan, 1, 123–126. Prevost, S. & Steedman, M. (1994). Information based intonation synthesis. Proceedings of ARPA Workshop on Human Language Technology, pp. 193–198, Plainsboro, NJ, U.S.A.
Abstract prosodic labels for speech synthesis
183
Price, P., Ostendorf, M., Shattuck-Hufnagel, S. & Fong, C. (1991). The use of prosody in syntactic disambiguation. Journal of the Acoustical Society of America, 90, 2956–2970. Prince, E. (1981). Toward a taxonomy of given-new information. In Radical Pragmatics. New York: Academic Press. Quene´, H. & Kager, R. (1989). Automatic accentuation and prosodic phrasing for Dutch text-to-speech conversion. Proceedings European Conference on Speech Communication and Technology, I, pp. 214–217, Edinburgh, Scotland. Riley, M. (1992). Tree-based modeling of segmental durations. In Talking Machines: Theories, Models, and Designs (G. Bailey, C. Benoıˆt & T. R. Sawallis, eds), pp. 265–273. Elsevier Science Publishers B. V. Ross, K. (1995). Modeling of intonation for speech synthesis. PhD thesis, Boston University, Boston, MA, U.S.A. Ross, K., Ostendorf, M. & Shattuck-Hufnagel, S. (1992). Factors affecting pitch accent placement. Proceedings of the International Conference on Spoken Language Processing 1, pp. 365–368, Banff, Canada. Selkirk, E. (1984). Phonology and Syntax: The Relation Between Sound and Structure. Cambridge, MA: MIT Press. Shattuck-Hufnagel, S., Ostendorf, M. & Ross, K. (1994). Stress shift and early pitch accent placement in lexical items in American English. Journal of Phonetics, 22, 357–388. Silverman, K. (1987). The structure and processing of fundamental frequency contours. Ph.D. Thesis, University of Cambridge, Cambridge, U.K. Silverman, K. (1993). On customizing prosody in speech synthesis: names and addresses as a case in point. Proceedings of the ARPA Workshop on Human Language Technology, pp. 317–322, Plainsboro, NJ, U.S.A. Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J. & Hirschberg, J. (1992). ToBI: a standard for labeling English prosody. Proceedings of the International Conference on Spoken Language Processing, 2, 867–870, Banff, Canada. Sproat, R. (1990). Stress assignment in complex nominals for English text-to-speech. Proceedings of European Speech Communication Association Workshop on Speech Synthesis, pp. 129–132, Autrans, France. Terken, J. (1996). Variation of accent prominence within the phrase: models and spontaneous speech data. In Computing Prosody. (Y. Sagisaka, W. N. Campbell & N. Higuchi, eds.) Berlin: Springer-Verlag. Vanderslice, R. & Ladefoged, P. (1972). Binary suprasegmental features and transformational wordaccentuation rules. Language, 48, 819–836. Veilleux, N. & Ostendorf, M. (1993). Probabilistic parse scoring with prosodic information. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, II, 51–55. Veilleux, N. & Ostendorf, M. (1993). Prosody/parse scoring and its application in ATIS. Proceedings of the ARPA Workshop on Human Language Technology, pp. 335–340, Plainsboro, NJ, U.S.A. Wang, M. & Hirschberg, J. (1992). Automatic classification of intonational phrase boundaries. Computer Speech and Language, 6, 175–196. Wightman, C. W. & Ostendorf, M. (1994). Automatic labeling of prosodic patterns. IEEE Transactions on Speech and Audio Processing, 2, 469–481. (Received 6 December 1995 and accepted for publication 22 April 1996)
Appendix 1: Features available for prediction of abstract labels 1. Dictionary information (a) Dictionary stress (eight categories): (i) Reduced or unstressed vowel (0). (ii) Primary accent with a lax vowel (1-lax). (iii) Primary accent with a tense vowel (1-tense). (iv) Secondary accent with a lax vowel that precedes the primary stress (2-early-lax). (v) Secondary accent with a lax vowel that follows the primary accent (2-late-lax). (vi) Secondary accent with a tense vowel that precedes the primary stress (2-earlytense). (vii) Secondary accent with a tense vowel that follows the primary accent (2-latetense). (viii) Unmarked syllable in a one syllable word (U).
184
K. Ross and M. Ostendorf
(b) Number of syllables in the word (continuous). (c) Syllable number from the word beginning (continuous). (d) Syllable number from the word end (continuous). 2. Part-of-speech (a) Part of speech (eight categories): University of Pennsylvania Treebank labels are given for the categories. The part-ofspeech labels are grouped roughly according to function and to frequency of getting a pitch accent. (i) Cardinal number (CD). (ii) Adjectives (JJ, JJR, JJS). (iii) Nouns (NN, NNS, NP, NPS). (iv) Adverbs (RB, RBR, RBS, RP). (v) More accentable verbs (VBG, VBN). (vi) Less accentable verbs (VB, VBD, VBP, VBZ). (vii) Miscellaneous 1 (PDT, PP, WP, WRB). (viii) Miscellaneous 2 (CC, DT, EX, FW, IN, LS, MD, PP$, SYM, TO, UH, WDT, WP$). (b) Previous word’s part-of-speech (eight categories). (c) Next word’s part-of-speech (eight categories). (d) Word class (four categories: less-accentable function word (FW-L), more -accentable function word (FW-M), content word (CW), or proper noun (PN)). (e) Whether a word is contained in a complex nominal, its position, and whether it is a noun or an adjective. We defined a complex nominal as a sequence of adjectives and nouns ending in a noun and unbroken by prosodic phrase boundaries (breaks of 3 or higher) (seven categories). 3. Prosodic phrase structure (a) (b) (c) (d)
Phrase break size before the word (seven categories). Phrase break size following the word (seven categories). Number of (syllables, words, content words) in (minor, major) phrase (continuous). Number of (syllables, words, content words) (to next, from last) (minor phrase break, major phrase break) (continuous).
4. New/given status (a) Word is new/given to the paragraph (i.e. whether or not the exact same word occurred previously in the paragraph), or a function word (three categories). (b) Word is new/given to the story, or a function word (three categories). (c) Root of the word is new/given to the paragraph, or a function word (three categories). (d) Root of the word is new/given to the story, or a function word (three categories). 5. Paragraph structure (a) The position of the minor phrase within the sentence (three categories—first, middle, or last). For one phrase sentences, there is only a first phrase, and for two phrase sentences, there is a first and a last phrase.
Abstract prosodic labels for speech synthesis
185
(b) The position of the major phrase within the sentence (three categories—first, middle, or last). (c) The position of the minor phrase within the paragraph (three categories—first, middle, or last). (d) The position of the major phrase within the paragraph (three categories—first, middle, or last). (e) The position of the sentence within the paragraph (three categories—first, middle, or last). (f) The punctuation following the word (seven categories—none, period, question mark, exclamation point, comma, semi-colon, or colon). (g) Sentence number in the paragraph (continuous). (h) Phrase number in the paragraph (continuous). (i) Number of words in the current sentence (continuous). (j) Number of syllables in the current sentence (continuous). 6. Label of other units (a) (b) (c) (d)
Pitch accent (four categories—none, high, downstepped, low). Boundary tone on word (four categories—none, L-L%, H-L%, L-H%). Previous boundary tone (four categories). Number of syllables from the last prominence (eight categories—0, 1, 2, 3, 4, 5, 6, and 7 or greater). (e) Last preceding pitch accent type in the phrase (four categories—none, high, downstepped, low). (f) Number of pitch accents since last minor break (continuous). (g) Number of pitch accents since last major break (continuous). Appendix 2: Questions used in accent prediction tree Node
Question
Left
Right
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Dictionary stress Word class Dictionary stress Syllables from minor break Syllable number in word Syllables from last accent Content words in minor phrase Syllables from last accent Syllables from minor phrase break Phrase break size after word Part-of-speech Phrase break size before word Syllables from last accent Part-of-speech Syllables from minor phrase break Content words to minor phrase break Syllables from the end of word Part-of-speech
1, 2-early-lax FW-M 2-early-tense >1·5 >0 0 or 1 >2·5 1 >0 2, 3, 4, 6 Other [2 1 Adjectives, nouns >1·5 >3·5 >0 Other
19
Syllables from major phrase break
0, U, 2-late, 2-early-tense FW-L, CW, PN Other <1·5 0 >1 <2·5 0 0 0, 1, 5 Miscellaneous 1 and 2 0, 1 0 Other <1·5 <3·5 0 Less-accentable verbs, miscellaneous 1 and 2 <8·5
>8·5