Tone nucleus modeling for Chinese lexical tone recognition

Tone nucleus modeling for Chinese lexical tone recognition

Speech Communication 42 (2004) 447–466 www.elsevier.com/locate/specom Tone nucleus modeling for Chinese lexical tone recognition Jinsong Zhang *,1, ...

508KB Sizes 1 Downloads 102 Views

Speech Communication 42 (2004) 447–466 www.elsevier.com/locate/specom

Tone nucleus modeling for Chinese lexical tone recognition Jinsong Zhang

*,1,

Keikichi Hirose

Department of Information and Communication Engineering, School of Engineering, University of Tokyo, Bunkyo-ku, Tokyo 113-8656, Japan Received 20 October 2003; received in revised form 20 October 2003; accepted 8 January 2004

Abstract This paper presents a new scheme to deal with variations in fundamental frequency (F0) contours for lexical tone recognition in continuous Chinese speech. We divide F0 contour of a syllable into tone nucleus and adjacent articulatory transitions. We only use acoustic features of the tone nucleus for tone recognition. Tone nucleus of a syllable is assumed to be the target F0 of the associated lexical tone, and usually conforms more likely to the standard tone pattern than the articulatory transitions. A tone nucleus can be detected from a syllable F0 contour by a two-step algorithm. First, the syllable F0 contour is segmented into several linear F0 loci that serve as candidates for the tone-nucleus using segmental K-means segmentation algorithm. Then, tone nucleus is chosen from a set of candidates by a predictor based on linear discriminant analysis. Speaker dependent tone recognition experiments using tonal HMMs showed our new approach achieved an improvement of up to 6% for tone recognition rate compared with a conventional one. This indicates not only that tone-nucleus keeps important discriminant information for the lexical tones, but also that our tone-nucleus based tone recognition algorithm works properly.  2004 Elsevier B.V. All rights reserved. Keywords: Tone recognition; Underlying-target; Articulatory transition; Tone-nucleus

1. Introduction Chinese (Mandarin or Standard Chinese) is a well-known tonal language in which pitch tones play important phonemic roles. Each syllable in Chinese corresponds to a morpheme (ideographic character) and is associated with a pitch tone

* Corresponding author. Present address: ATR Spoken Language Translation Laboratories, 2-2-2 Kansai Science City, Kyoto 619-0288, Japan. Tel.: +81-774-95-1314; fax: +81-77495-1308. E-mail address: [email protected] (J. Zhang). 1 This work was done when the author was affiliated with the University of Tokyo.

(usually referred to as lexical tone). Phonemically, a syllable is divided into two parts: an Initial and a Final. An Initial can be a consonant or none. A Final may be a vowel, a diphthong, or a triphthong and with an optional nasal ending. There are four basic lexical tones (referred to as Tones 1, 2, 3, 4, respectively) and a neutral tone. The four basic lexical tones are characterized by their perceptually distinctive pitch patterns which are conventionally called by linguists as: high-level, high-rising, low-dipping and high-falling tones (Chao, 1968, pp. 25–26). The neutral tone, according to (Chao, 1968), does not have any specific pitch pattern, and is highly dependent on the preceding tone and usually perceived to be

0167-6393/$ - see front matter  2004 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2004.01.001

448

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

temporally short and zero pitch range [p. 35]. There are only a very small number of morphemes, such as suffixes and particles, which are always uttered in the neutral tone [p. 36]. Fundamental frequency (hence F0) contours are the main acoustic manifestations of pitch tones, and there seem to be distinctive F0 patterns associated with the four basic lexical tones as illustrated in Fig. 1. There are a number of reasons for recognizing lexical tones using F0 contours. First, while the total number of phonetically differentiated syllables in Chinese is about 1282 (Cheng and Wang, 1990, p. 45), the number of base syllables, where only differences in segmental features are contained ignoring those of lexical tones, is only 412. This means there exist a number of homophone morphemes in Chinese. Tone recognition offers a method to discriminate homophone words. Second, automatic detection of intonation structure such as prosodic phrase boundaries, focus, stress locations, etc. is attracting more and more attentions recently (Wightman and Ostendorf, 1994; Niemann et al., 1998; Hirose and Iwano, 1999; Stolcke et al., 1999). This is because it probably offers an answer to problems residing in spontaneous speech like ungrammatical utterances, disfluencies, syntax ambiguities, etc. Similar to the lexical tones, the intonation structure is also acoustically manifested mainly through F0 contours (Fujisaki, 1997). The interplay effects result in so complex sentential F0 variations that those F0 cues for intonation structure cannot be clarified unless those due to the lexical tones are known

F0 Tone 1

Tone 2

Tone 3

Tone 4

Time Fig. 1. Standard distinctive F0 patterns of the four basic lexical tones.

(Xu, 1997a). Therefore study on tone recognition from F0 contours is the first and unavoidable step in incorporating intonation information processing into Chinese spoken dialogue systems. Third, tone recognition is helpful for developing automatic prosodic labeling system for Chinese speech. A prosodically labeled database is useful for prosody analyses and development of data-driven text-to-speech (TTS) systems (Wightman and Ostendorf, 1994). Since syllable tones often change from their original types in the dictionary due to various factors such as tone neutralization, morphophonemic tone sandhi (Chao, 1968), speakerÕs dialect, grammatical structure (Shih, 1990) and etc., it is necessary to recognize and label the syllable tones according to their real tonalities. An automatic tone recognizer is preferred because human labeling is time consuming and expensive. Fourth, tone recognition is crucial to set up computer aided language learning (CALL) systems for foreigners learning Chinese. Pronouncing lexical tones is rather difficult for people of non-tonal languages, where pitch tones usually express different stress or intonation patterns instead of differentiation of lexical meanings. Tone recognizer may offer a diagnosis of a learnerÕs pronunciation, and show the learner the way to correct his (her) pitch features. Finally, besides the above mentioned reasons, there are still many other reasons of tone recognition such as to improve digit recognition performance (Wang and Seneff, 1998). In this paper, we propose a new tone recognition method, which not only offers a systematic way of dealing with F0 variations for tone recognition, but may also help to detect sentential intonation structure. Previous studies have shown that it is easy to recognize lexical tones in isolated speech (Yang et al., 1988; Le et al., 1993), but rather difficult to recognize them from F0 contours of continuous speech (Wang et al., 1997; Liu et al., 1999). This different performances can be ascribed to the reason that the lexical tones show consistent tonal F0 patterns when uttered in isolation, but show complex variations in continuous speech. The complex F0 variations originate from the mechanic-physiological realization of the compound intonation functions including the lexical tones

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

and sentential intonation structure (Xu and Wang, 2001). On the one hand, confounded nature of information in F0 contours may obscure the F0 variation cueing lexical tones. For example, focus is usually related to F0-range-expansion of focused words that are not in the final position of an utterance and F0-range-suppression of post-focus words. When a Tone 1 is related to a focus, whose F0 contour usually shows high and flat pattern, may change into a rising or falling shape (Xu, 1999). On the other hand, F0 contours, which reflect the periods of the successive human vocal cordsÕ vibrations, are to vary due to articulatory constraints. For example, voiced or unvoiced syllable initial segments may lead to quite different syllable F0 contours even when the lexical tones are the same (Howie, 1974). The inertial characteristic of the bio-mechanical vibrations makes the neighboring tones interfere with each other extensively so that early portion of a syllable F0 contour tends to vary according to the carryover effect of the preceding tone (Xu, 1999). In order to cope with F0 variations in tone recognition, some researchers attempted to develop prosody models for intonation functions and articulatory constraints. However, these attempts usually modeled only a few specific F0 variations, thus were lack in generality. Wang et al. (1990) proposed to use the accent command coefficients of a well-known F0 contour generation model (Fujisaki and Hirose, 1984) to recognize lexical tones, assuming prosodic phrasing effects could be modeled by the phrase command coefficients of the model. In (Wang and Chen, 1994), Wang and Chen utilized prosodic phrasing information extracted from text transcripts to aid tone recognition. A disadvantage of these approaches is that the assumed prosody information is usually unavailable in a speech recognition system. It is rather difficult to decompose a Chinese sentential F0 contour into the phrase and accent components of the generation model without human interference. And the text transcripts are usually the purpose of tone recognition except in a database labeling system. In (Chen and Wang, 1995), Chen and Wang investigated to model the sentential intonation patterns like declarative or interrogative utterances by five different states of a neural net.

449

Wang et al. (1997) adopted context-dependent HMM models to model coarticulation effects between neighboring lexical tones. Problems associated with these attempts are that intonation patterns and coarticulation effects in continuous speech are usually much more complex than that the proposals can model. Thus the proposals did not show to be very efficient, and performance improvement was limited: only about 1–2% increase of tone recognition rates. Furthermore, due to a lack of systematic linguistic considerations, the above approaches showed to be difficult to be further improved or extended to detect sentential intonation structure. Our study reported in this paper is quite different from the conventional ones mentioned above. The basic difference lies in that we classify a syllable F0 contour into segments of underlying target and articulatory transitions, and only use acoustic features of the segment of underlying target for tone recognition, whereas nearly all the conventional approaches used acoustic features of a whole syllable for tone recognition. The proposal is also different from the main vowel based approach by us in (Zhang and Hirose, 1998) and by Chen in (Chen et al., 2001), in that it does not assume a consistent alignment between the segmental phones and tonalities. Thus offers a systematic approach to deal with F0 variations resulting from various factors, such as voiced/unvoiced syllable initials, coarticulation effects between neighboring tones, sentential intonation structure. Although the extracted F0 segments have potential importance for detection of sentential intonation structure, this paper only presents their application to tone recognition, leaving other issues for further studies. The rest of this paper is organized as follows. Section 2 introduces the representation of syllable F0 variations by a proposed F0 segmental structure model (referred to as tone-nucleus model), in which each syllable F0 contour is divided into tone nucleus and articulatory transitions. The model is originated in linguistic studies and shows to be consistent with diverse linguistic findings. Section 3 gives a general description about the tone recognition system and the speech database used in the study. Section 4 addresses the statistical means

450

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

to extract the tone nucleus from a syllable F0 contour, based on segmental K-means segmentation algorithm (Rabiner and Juang, 1993, pp. 382) and linear discriminant analysis (LDA). Section 5 evaluates the application of proposals through tone recognition experiments on the continuous speech database. Finally we conclude in Section 6.

2. Segmental representation of syllable F0 variations A consistent account of F0 variations in a syllable F0 contour in various phonetic contexts is desired for the development of a systematic method to deal with F0 variations for tone recognition. Based on a survey on linguistic findings, we proposed a F0 segmental structure model which divides a syllable F0 contour into underlying target and articulatory transitions, and assumes no direct relations between the F0 segments with the internal structure of a syllable. This proposal results from the continuing convergence of observations on one basic issue: • Lexical tone is not evenly distributed in a syllable. 2.1. Underlying target and articulatory transition By observation and comparison of isolated tonal F0 contours of syllables with different segmental structures, especially between those with voiced and unvoiced syllable initials, Howie (1974) concluded that the domain of a syllable tone in Chinese does not spread to the entire voiced part or the entire vocalic part of the syllable, but rather is confined to the rhyming part of the syllable including the vowel and its succeeding voiced segment [p. 147]. In (Rose, 1988), Rose suggested that the perceptually valid F0 contour should not include the period corresponding to the articulatory transition between two syllables. Shih (1988) proposed that different tones have different alignments with the syllable: some starting from the rhyme onset (Tones 1 and 3), but others starting later, even from the middle of the rhyme (Tone 2). In (Whalen and Xu, 1992), the authors tried to

examine the critical information for tone perception by extracting small segments from a syllable, and the experimental results led the authors to conclude that not every portion of the F0 contour indicates the original tone [p. 25]. Also through perceptual experiments using stimuli sliced from a number of isolated syllables, Lin (1995) assumed that a Chinese tone in isolation is mainly related to the syllabic vowel and its adjacent transitions, whereas neither initial consonants and glides nor final nasals play any tone-carrying roles. Based on statistical results, Xu and Wang (2001) revealed that there should be no direct relations between the segmental structure of a syllable and the tonal contour, but suggested that the early portion of an F0 contour always varies with the ending F0 of the preceding syllable, whereas the later portion converges to the contour that seems to conform to the purported underlying pitch values. Although there still exist some differences in the above-mentioned views, those studies, together with a number of ones unmentioned here, seem to converge to one point: a portion of a syllable F0 contour looks to conform to the underlying pitch better than other portions. If we ignore the issue of alignment of F0 contour with segmental units, which received rather diverse views in the literature, we can reach a general agreement on the fact that tone information is not evenly distributed in a syllable F0 contour: some segment of a syllable F0 contour may carry more tone information than other ones. Therefore, we can classify a syllable F0 contour into underlying target and articulatory transitions: • Underlying target represents the F0 target and serves as the main acoustic cue for pitch perception. • Articulatory transitions are the F0 variations occurring as the transitions to or from the pitch targets. 2.2. Tone nucleus model After arriving at a view that a syllable F0 contour could be divided into underlying target and articulatory transitions, we proposed a F0 segmental structure model of Chinese syllable F0

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

contours (referred to as tone nucleus model) which gave a systematic view of variations in a syllable F0 contour. As illustrated in Fig. 2, a syllable F0 contour may be divided into three segments: onset course, tone nucleus and offset course. Tone nucleus represents the underlying pitch targets and is obligatory, the onset and the offset courses are the articulatory transitions and optional. Each F0 segment usually exhibits an asymptotically linear curve (Xu and Wang, 2001), and one syllable F0 should contain no more than three F0 segments with quite different slopes. Based on previous studies (Xu, 1998), we assume that there should be no consistent relations between the F0 segmental structure and the syllable internal structure, but tone nucleus should reside in the Final portion of a Chinese syllable. • Tone nucleus: a portion of F0 contour that represents pitch targets of the lexical tone. It is the segment containing the most critical information for tonality perception, thus called as the tone-critical segment. • Onset course: the asymptotic F0 transition locus to the tone-onset target from a preceding vocal cordsÕ vibration state. • Offset course: the F0 transition locus from the tone-offset target to a succeeding vocal cordsÕ vibration state. Tone-onset target and tone-offset target indicate the pitch values, which takes either low (L) or high (H) value, at the tone onset and offset, respectively. These pitch values serve as distinctive

451

Table 1 Pitch target features of the four lexical tones Targets

Tone 1

Tone 2

Tone 3

Tone 4

Onset Offset

H H

L H

L L

H L

‘‘H’’, ‘‘L’’ depict high and low targets, respectively.

features characterizing the four basic lexical tones (Table 1) (Xu, 1997b). Fig. 3 illustrates some frequently observed tonal F0 variations in continuous speech. Although they show great deviations from the standard F0 patterns in Fig. 1, we can see that, from the view of tone nucleus model, the tone nuclei delimited by the vertical sticks still show the consistent patterns with the underlying pitch targets, and F0 variations in the beginning and ending portions of a syllable belong to articulatory transitions. Among the four basic lexical tones, Tone 3 is different from the other three tones, in that the others are associated with rather stable F0 patterns, i.e. Tone 1 with high and flat F0, Tone 2 with rising F0 and Tone 4 with falling F0 (hence contouricity) (Rose, 1988), whereas Tone 3 is found with rather wide variety of F0 patterns. Tone 3 was conventionally associated with a dipping F0 contour (Chao, 1968, p. 26), but other studies have shown that the final rise seen is usually absent in

F0

Tone 1 Tone 4 Tone 2

Sub-syllabic F0 segments (1)

2 Tone onset

(Onset course)

(3) Tone offset

Tone nucleus

(Offset course)

Syllable F0 contour

Fig. 2. Illustrations of the proposed F0 segmental structure model of Chinese syllable F0 contours. F0 segments in parentheses are optional; only the tone nucleus is obligatory.

c1

c2

c3 Time

Fig. 3. Illustration of tonal F0 contours with possible articulatory transitions for Tones 1, 2 and 4. The left and right vertical sticks in each contour correspond to the possible tone onset and offset locations, and the medium F0 segment delimited by the tone onset and offset in each contour represents tone-nucleus of the tone. c1 , c2 and c3 depict the onset courses, the tone-nuclei and the offset courses.

452

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

F0

F0

A

C

A

B

B time (a)

C

time (b)

Fig. 4. Illustrations of F0 contours of continua of ‘‘Tone 2 Tone 2’’, with a contrast of voiced initial segment (a) and unvoiced initial segment (b) in the second syllables. The thin vertical lines indicate the syllable boundaries.

non-prepausal positions (Xu, 1997b). Therefore, view about the tone nucleus of Tone 3 may be flexible: either the whole syllable F0 or the falling segment may be considered as tone nucleus. 2.3. Systematic framework to deal with F0 variations Through F0 observations it was found that an articulatory F0 transition may occupy a large portion of a syllable F0 contour: 30% in (Howie, 1974), 50–100 ms after the release of the initial consonant in (Rose, 1988, p. 76) (accounting for about 40% proportion in the given example [p. 65]), and more than 50% when the rising tone starting from the middle of the rhyme (Shih, 1988). Although the substantial F0 variations in articulatory transitions exert very few influences on human pitch perception, they may confuse an automatic tone recognizer. Tone-nucleus model offers a possible systematic framework to deal with F0 variations that result from both articulatory constraints and confounded intonation functions, for tone recognition and intonation function detection. Here, we illustrate briefly the possible ways to deal with the substantial F0 variations due to the following factors 2 for tone recognition: (1) Voiced/unvoiced syllable initial segments. (2) Prosodic phrase boundary. (3) Focus.

2 There are more other factors such as fast speaking rate that may lead to significant F0 variations.

2.3.1. Voiced/unvoiced syllable initial segments Fig. 4 illustrates F0 contours of two continua of ‘‘Tone 2 Tone 2’’. Due to the voicing (or not) on the initial segment of the second syllables, F0 contours of the second syllables in the two continua differ greatly. Since conventional tone recognizers based on hidden Markov modes (HMMs) or neural net usually take whole syllable F0 contours as important acoustic cues for the lexical tones (Wang et al., 1997; Chen and Wang, 1995), such a difference due to voiced and unvoiced syllable initial segments may not only affect the training of tonal acoustic models, but also easily lead to recognition errors. The F0 contour of the second syllable in (a) of Fig. 4 has a dipping shape like that of Tone 3, and tends to be mis-recognized as Tone 3. However, according to the tone-nucleus model, F0 loci ‘‘BC’’ in (a) and (b) of Fig. 4 are the tone nuclei, whereas the F0 locus ‘‘AB’’ in (a) is an articulatory transition. If the articulatory transition is ignored in tone recognition, the above problems may be avoided. 2.3.2. Prosodic phrase boundary Prosodic phrasing refers to the perceived groupings of words in speech, and it was suggested (Zhang and Kawanami, 1999) to have influences on the laryngeal coarticulatory overlap during tone production. Although this issue is still lack of a systematic study, it was found that when there is a prosodic boundary between two neighboring tones, F0 contour of the second syllable is usually free from coarticulation effect from the first syllable. Given a continuum of Tones 4 and 1, ‘‘H’’ targets in Tone 1 are usually rather lower than that of the ‘‘H’’ target in the preceding Tone 4 mainly

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

due to the carryover lowering effect from the ‘‘L’’ offset target of Tone 4 (Xu and Wang, 2001) when the two tones are in one word. But when there exists a prosodic boundary (such as a phrase boundary) between the two tones, the carryover lowering effect may be stopped and the ‘‘H’’ targets of Tone 1 are raised to a higher position, as illustrated in Fig. 5. This often results in a substantial rising transition F0 locus ‘‘CD’’ to the ‘‘H’’ targets of Tone 1. However, based on the tone-nucleus model, the rising ‘‘CD’’ F0 locus is an articulatory transition, and the tone information is mainly carried by the tone nucleus ‘‘DE’’ whose shape still conforms to the underlying pitch targets. Furthermore, if the tone nuclei ‘‘AB’’ and ‘‘DE’’ of the two tones can be detected, then the F0 range reset h of the Tone 1 will be an important cue for detecting the prosodic phrasing boundary, which is very useful in developing spontaneous speech recognition and understanding systems (Niemann et al., 1998). 2.3.3. Focus Sentential focus may also bring about substantial F0 variations to syllable F0 contours. According to Xu (1999), focus is related with three distinct pitch ranges: expanded range in non-sentence-final focused words, suppressed in post-focus words, and neutral in all other words. Fig. 6 illustrates this idea through a continuum of three Tone 1s with a focus on the second Tone 1. We can see that the focus lead to a substantial rising and a falling F0 loci in the early portions of F0 contours of the second and the third syllable respectively. These kinds of F0 loci are regarded as articulatory transitions according to the tone-nucleus model, whereas the segments ‘‘AB’’, ‘‘CD’’ and ‘‘EF’’ are F0

A D h B

E

C

time Fig. 5. Illustration of an F0 contour of a continuum of ‘‘Tone 4 Tone 1’’, with an prosodic phrase boundary between them.

453

Focus C

F0 A

B

h1

D h2

E

F

time Fig. 6. Illustration of a typical F0 contour for a continuum ‘‘Tone 1 Tone 1 Tone 1’’, with a focus placed on the second Tone 1 syllable.

the tone nuclei of the three Tone 1s. Compared with the whole syllable F0 contours, the tone nuclei show more consistent patterns with the underlying pitch targets. Also, if the three tone nuclei can be detected, the range differences represented by h1 and h2 are estimated as the gaps between the preceding tone offsets and succeeding tone onsets of neighboring tone nuclei. These range differences may serve as acoustic cues for focus detection, which is important for interpreting the speakerÕs intention for developing spontaneous dialogue systems. 2.3.4. Tone recognition based on tone nuclei After the above discussions, we can see that tone recognition may not be affected by the articulatory transitions by taking only the tone nuclei into account. Obviously, segmenting a syllable F0 contour into possible tone nucleus and articulatory transitions is the prerequisite for this tone recognition scheme. Since the tone nucleus is suggested residing in a syllable Final, with the segmentation of Initial and Final assumed available from a phoneme recognition process, we focused on the Final portion to detect the tone nucleus.

3. System overview and speech database Having explored the linguistic foundations of our tone recognition algorithm, we now give an overview of the basic architecture of the tone recognition system, as illustrated in Fig. 7. With the aid of phonetic segmentation of Initial consonants and Finals which are assumed available from a

454

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466 Table 2 Token numbers of each tone in the training and testing data set

Final' F0 Extraction F0, Energy Features

+

Phonetic Segmentation

Tone Nucleus Detection

Discrimination Coefficients

Data set

Tone 1

Tone 2

Tone 3

Tone 4

Neutral tone

Total

Training Testing

1138 473

1137 430

1004 391

2954 1115

186 158

6419 2567

Tone nuclei

Acoustic Matching

Tonal HMMs

Recognized Lexical Tones Fig. 7. Block diagram for the tone recognition system.

phoneme recognition process, only F0 contours corresponding to the Finals are kept for the tone recognition from the original F0s. The panels (b) and (c) of Fig. 11 illustrated the original and the extracted F0 contours of Finals. Then, a tonenucleus is detected for each FinalÕs F0 contour. Finally, tone recognition is carried out by the matching between the acoustic features of a tone nucleus and tonal HMMs. Data of a female speaker (0f) in the speech corpus HKU96, published by Hong Kong University, was used in the following study. There are two reasons for selecting the data: the first one is that the problem dealt here is more on intraspeaker variation than inter-speaker variation. Thus testing a speaker-dependent database is necessary as a first step to show the validity of our approach. The second one is that the speaker 0f is a phonetician who was involved in developing the data corpus HKU96: her speech was much more natural than that of other speakers whose speech often sounded more like ‘‘citation’’ style. We took 500 utterances (6419 syllables) labeled from cs0f0001 to cs0f0500 for tonal acoustic model training, and 200 utterances labeled from cs0f0501 to cs0f700 as testing set (2567 syllables) for tone recognition. Table 2 gives the details of the database. Average utterance length is 12.8 syllables in both the training and testing set. The utterances have average speaking rates ranging from 3.8 to 4.9 syllables per second, and sometimes more than 5 syllables per second.

The corpus offers phoneme, syllable (in Pinyin) and lexical tone labels together with orthographic transcriptions. All the labels were manually checked to correct possible errors: spectrograms, and sometimes together with F0 contours, were referred to for segmental alignment. A few tone labels were modified according to their real tonalities: for example, according to the Tone 3– Tone 3 tone sandhi rule, the first one was changed to Tone 2. And according to some other morphophonemic tone sandhi rules (Chao, 1968), syllables like ‘‘yi’’ and ‘‘bu’’ have tonal alternations depending on its following tones. F0 was extracted by using the integrated F0 tracking algorithm (IFTA) (Secrest and Doddington, 1983), and manual correctness were applied to any significant F0 tracking errors such as half and double frequency. The window size for log-energy analysis was 20 ms and frame shift for F0 and log-energy extraction was set at every 10 ms.

4. Tone-nucleus detection As the relation between tone nucleus and the syllable internal segmental structure was not found to be tight (Xu, 1998), we developed a standard pattern recognition scheme to detect tone nuclei from syllable FinalsÕ F0 contours, instead of the heuristic approach to use vowel positions to segment syllable F0 contours as in (Zhang and Hirose, 1998; Chen et al., 2001). The scheme, as illustrated in Fig. 8, mainly includes two steps: extracting prosodic features of candidate F0 loci for tone nucleus, and then discriminating the tonenucleus from other candidate F0 loci. However, the candidate F0 loci are not available beforehand, they need to be created through automatic segmentation of a Final F0 contour into a concatenation of asymptotically linear F0 loci.

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

F0 Contour Segmentation F0, Energy Features

+

Phonetic Segmentation

Number of Segments =2?

No

Yes Prosodic Feature Extration

Discrimination Coefficients

Discriminate

Tone Nucleus Smoothing

Tone Nucleus Fig. 8. The scheme for tone-nucleus detection.

Some difficulties exist for the segmentation and discrimination: one is how to robustly segment the syllable F0 contours into successive linear F0 loci. Segmentation method based on detecting turning points like peak and valley points in F0 contours, as used in (Hsie et al., 1988; Garding and Zhang, 1997), is not a good choice since there are possible extra-local peak and valley points in F0 contours due to their temporal fluctuations. Also it is difficult to find an appropriate threshold to segment an F0 contour like that of Tone 1 in Fig. 3 since the slope variation is gradual. Furthermore, number of linear F0 loci in a syllable F0 contour is uncertain, it maybe one, two or three. The second difficulty is that we do not have adequate knowledge of acoustic features to discriminate tone nucleus from other F0 loci. Hence, we proposed to segment F0 contours by segmental K-means segmentation algorithm (Rabiner and Juang, 1993, p. 382), and amalgamate neighboring segments according whether they can stand Hypothesis test on equal means of F0 slope ratios. For the F0 contours that have 2 segments, we collected a number of acoustic features, such as time, energy and F0 related variables, to do vari-

455

ance analyses (ANNOVA) with respect to the position of tone nucleus in order to get an image about feature distributions of tone nuclei. Then we exploited the method of linear discriminant analysis (LDA) to design a discriminant function to predict the position of tone nucleus. 4.1. F0 contour segmentation via segmental Kmeans segmentation algorithm Segmental K-means segmentation algorithm is a variant of the well-known K-means iterative procedure for data clustering. The variation lies in that the re-assignment of samples to the clusters is achieved by finding the optimum segmental state sequence via the Viterbi algorithm, and then by backtracking along the optimal path. The procedure is illustrated in Fig. 9. Let the observation sequence, O ¼ ðo1 o2    oN Þ to represent a FinalÕs F0 contour and is divided into I, 1 6 I 6 3, successive segments. The observation vector oj is a two component vector ðlog F0j ; Dlog F0j Þ. The ith, 1 6 i 6 I, segment centroid is assumed to have the p.d.f. of the multivariate Gaussian pðojUi Þ, where the parameter vector Ui include the mean-vector li and covariance matrix Ri , which are obtained from the ni observation points of the ith segment by the maximum likelihood estimates: ^i ¼ l

ni 1 X ok ni k¼1

ni X bi ¼ 1 ^i Þðok  l ^ i Þt R ðok  l ni k¼1

In the re-segmentation step, the likelihood  1 1 ^ i Þt pðoj jUi Þ ¼ exp  ðoj  l 1=2 b 2 2pj R i j  1 b ^i Þ  R i ðoj  l

ð1Þ

ð2Þ

ð3Þ

can be used in the Viterbi search to decide which segment the point oj belongs to. When a segmentation of a FinalÕs F0 contour becomes available, a check will be made whether two successive segments can be amalgamated or not based on the following two principles:

456

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

Segmentation Initialization: n1 = n2 = n3

n1, n2 , n3 : segmental length in frames.

Maximum Likelihood Estimate of Centroid Parameters

A Syllable Final's F0 Contour

Re-segmentation through Viterbi Search Algorithm, Backtracking Along The Optimal Path. No Converge? Yes I=3?

I: the number of segments. Yes n2 < 5 Yes I=2 frames?

No

No

T-Test On Neighboring Segment: Amalgamate?

Yes

I=I-1

No further amalgamation. Segmentation Results Fig. 9. The segmental K-means segmentation procedure used to segment a syllable F0 contour into a possible number of linear F0 loci.

• Phonetic rule: a tone-nucleus should be longer than 50 ms. Because pitch perception studies have revealed a duration threshold of 40–60 ms for a F0 contour perception (Rose, 1988, p. 70). • Statistical rule: if there is no significant statistical evidence for distributional differences between two neighboring F0 loci, they are merged. When a FinalÕs F0 contour is divided into 3 segments, the medium segment should correspond to tone-nucleus according to the tone nucleus model, and its length is required to be longer than 50 ms, i.e. n2 P 5 according to the phonetic rule. Otherwise the number of segments, I, will be reduced to 2 and the segmentation will be repeated once more. Next, two successive segments of F0 contour will be checked whether to amalgamate

into one or not based on T -test on the slope ratio K of the linear regression function of log F0 on time t. For the observation point ðtj ; log F0j Þ in the segment ci : log F0j ¼ Ki tj þ Ci

where Ci is a consonant

Ki may be estimated by taking the average of Dlog F0j for the ni points in ci : ni X bi ¼ 1 K Dlog F0j ni j¼1 We test the two hypotheses on if any two neighboring segments have the same slope ratio or not, H0 : Ki ¼ Kiþ1 H1 : Ki 6¼ Kiþ1

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

by using a test statistic Ti;iþ1 , bi  K b iþ1 K Ti;iþ1 ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi   1 S n1i þ niþ1 ( P I P ni i¼1

S ¼ max

j¼1

457

one-segment case, the whole FinalÕs F0 contour is the tone-nucleus. For three-segment case, the medium segment is the tone-nucleus. For twosegment case, the tone-nucleus is chosen by a predictor of linear discriminant function explained in the following section.

ð4Þ

ðDlog F0ij b K i Þ2

N I

4.2. Tone-nucleus discrimination for two-segment F0 contour

S0 where, S0 is the prescribed minimum allowance for micro-fluctuations, calculated by the average variance of D log F 0 of whole utterance; N , number of voicing points in the FinalÕs F0 contour; I, number of F0 segments. Ti;iþ1 has a Student T distribution with N  I degree of freedom. The critical point for the twotailed test at a level of significance is sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 1 1 cp ¼ tN I;1a=2 S þ ð5Þ ni niþ1

For the two segments of a divided Final F0 contour, we assume only one of them as the tone nucleus: either the first half or the second half is the tone nucleus of the syllable. We use Groups )1 and 1 to represent them. We collected a number of prosodic features, including duration and energy related ones, to study if they have different distributions for the two groups. Then, we use linear discriminant analysis (LDA) to select a subset of the prosodic features, and form a linear discriminant function which could do group classification.

where point tN I;1a=2 satisfies 4.2.1. Prosodic features As illustrated in Fig. 10, they are as follows:

P ½tN I;1a=2 6 TN I 6 tN I;1a=2 ¼ 1  a

log energy

bi  K b iþ1 j exceeds the We reject H0 whenever j K computed critical point. If H0 cannot be rejected at a level, this means the two neighboring F0 loci have the same slope ratio, and the two segments will be merged. A FinalÕs F0 contour may be divided into one, two or three segments, from these segments one is decided as the tone nucleus. For

t0

4.2.1.1. Duration related features • Normalized temporal location of points A and B with respect to the syllable onset: tA0 ¼

tA  t0 t2  t0

t2

t1

F0

A

T

C B c1

syllable w1

syllable w2

c2 syllable w3

Fig. 10. Illustration of prosodic feature extraction for syllable w2 in continuous speech. The upper and lower panels represent energy and F0 contours of syllable w2 and its neighbors. Vertical lines indicate phonetic segmentation and the F0 segmentation result, where thin lines indicating syllable boundaries between syllable w1 , w2 and w3 , broken lines indicating C–V boundaries, and dotted line indicating F0 segmentation of syllable w2 . ‘‘A’’, ‘‘B’’ and ‘‘C’’ indicate the segmentation points which divide the F0 contour of syllable w2 into two segments c1 and c2 . t0 , t1 and t2 are the frames of syllable onset, C–V boundary and syllable offset of syllable w2 .

458

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

tB0 ¼

tB  t0 t2  t0

• Normalized temporal location of points A, B and C with respect to the Consonant–Vowel (C–V) boundary: tA1 ¼

tA  t1 t2  t1

tB1 ¼

tB  t1 t2  t1

tC1 • • • •

PtB  PtA n1  1

DPc2 ¼

PtC  PtB n2  1

Ni 1 X X xi;j ; N i¼1;1 j¼1

where N ¼ N1 þ N1

ð6Þ

ð7Þ

i¼1;1

Se ¼ numbers: n1 . numbers: n2 . c1 : dur1 ¼ tB1  tA1 . c2 : dur2 ¼ tC1  tB1 .

4.2.1.2. Energy related features • Energy slope ratio of F0 segments c1 and c2 : DPc1 ¼

^¼ l

where i ¼ 1; 1

The treatment sum of squares St and the residual sum of squares Se are calculated as: X ^Þ2 St ¼ Ni ð^ li  l ð8Þ

tC  t1 ¼ t2  t1

Duration of c1 in frame Duration of c2 in frame Normalized duration of Normalized duration of

Ni 1 X xi;j ; Ni j¼1

^i ¼ l

where Pj , j ¼ tA ; tB ; tC depict the frame log energies respectively. • Segmental energy-sum ratio: PtC energy sum of c2 j¼t Pj n¼ ¼ PtB B energy sum of c1 k¼tA Pk

4.2.2. Statistical distributional analysis Let x represent for any individual prosodic features mentioned-above, and l1 , l1 for the mean values of x for Groups )1 and 1. We use the method ANNOVA (Milton and Arnold, 1992, p. 467–524) to test the hypothesis: H0 : l1 ¼ l1 When we got samples ðx1;1 ; . . . ; x1;N1 Þ and ðx1;1 ; . . . ; x1;N1 Þ for the two groups from the training data respectively, the following sample means could be calculated:

Ni X X

^i Þ ðxi;j  l

2

ð9Þ

i¼1;1 j¼1

The test statistic f equals to: f ¼

St Se =ðN  2Þ

and is known to have an F distribution with 1 and N  2 degrees of freedom, and denoted as Fð1;N 2Þ . A larger f value indicates that there are more significant statistical evidence that the hypothesis H0 is not correct. In experiment, we took 250 training utterances for statistical analysis of tone nuclei. After F0 contour segmentation, we got 664 two-segmentdivided FinalsÕ F0 contours. Then we manually label them to either Group )1 or 1. Among them, 289 samples belong to Group )1, and the other 375 ones to Group 1. Table 3 gives the statistics of ANNOVA of each prosodic feature for the tone nuclei with respect to the two groups. From Table 3, we can see all the features except tA0 and tA1 have different distributions for the two groups ‘‘)1’’ and ‘‘1’’ above 0.999 significance level. We also can see the temporal location of point B is of the highest importance for discriminating the two groups, and tB1 , the normalized temporal location to C–V boundary, is a better discriminating feature than tB0 which is normalized to the syllable onset. It is also interesting to note that if the former F0 segment (c1 ) is tone-nucleus, it tends to have a descending energy slope (DPc1 ¼ 0:046), whereas when the latter F0 segment (c2 ) is tone-nucleus, the former segment (c1 ) tends to have a rising energy slope (DPc1 ¼ 0:069).

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

459

Table 3 ANNOVA statistics for groups ‘‘)1’’ and ‘‘1’’ based on 664 samples in analysis data Features

tA0 tA1 tB0 tB1 tC1 n1 n2 dur1 dur2 DPc1 DPc2 n

Group means

Group standard deviation

)1

1

)1

1

0.350 0.011 0.756 0.617 0.834 10.2 3.0 0.705 0.291 )0.046 )0.304 0.272

0.315 0.038 0.487 0.230 0.765 5.0 8.5 0.336 0.602 0.069 )0.085 2.365

0.121 0.133 0.092 0.141 0.120 3.354 1.670 0.161 0.118 0.077 0.185 0.241

0.132 0.187 0.150 0.185 0.207 1.813 3.168 0.166 0.200 0.134 0.101 1.547

4.2.3. Linear discriminant function based tonenucleus predicator We use a linear discriminant function as the automatic classifier to predict the tone nucleus for the two-segment divided Final F0 contours. Let x be the vector consisting of a number of the prosodic features mentioned-above, the function is like, y ¼ wT x þ w0  Group  1 x2 Group 1

ð10Þ y<0 y>0

ð11Þ

With the training samples ðx1;1 ; . . . ; x1;N1 Þ of Group )1, and ðx1;1 ; . . . ; x1;N1 Þ of Group 1, we use Fisher ratio JF (Webb, 1999, pp. 104–105) to find the w of Eq. (10): 2

JF ¼

jwT ðl1  l1 Þj wT S w w

ð12Þ

where, li ¼

Ni 1 X xi;j where i ¼ 1; 1 Ni j¼1

Sw ¼

1 b 1 þ N1 R b 1 Þ and N ¼ N1 þ N1 ðN1 R N 2 ð13Þ

b 1 represent the maximum likelib 1 and R where R hood estimates of the convariance matrices of Groups )1 and 1, respectively. Through maxi-

F ð1; 662Þ

Significant level

12.0 4.3 723.3 873.5 25.2 647.0 718.9 825.3 549.9 171.0 373.5 519.4

0.001 0.039 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

mizing the ratio JF , the optimal weight w is calculated as, w ¼ S 1 w ðl1  l1 Þ Furthermore, through feature selection algorithms such as the sub-optimal sequential forward selection (SFS) (Webb, 1999, pp. 222–223), a subset of prosodic features can be chosen to form a discriminant function which achieve a good discrimination performance. The SFS works like this: (1) Initialization: let X , the objective feature set, be null. (2) JF estimation: for each xi in the candidate feature set, form a new feature set ðX ; xi Þ, and calculate the Fisher ratio JF ðX ; xi Þ in Eq. (12). (3) Best candidate: the feature x ¼arg maxJF  ðX ; xi Þ. (4) Significance check: ðJF ðX ; x Þ  JF ðX ÞÞ is checked to see if the improvement is statistically significant. If yes, include x into X , and delete x from the candidate feature set, then go back to 2. Otherwise, output X . (5) End. In our experiment, we used the manually labeled 664 samples (375 of Group )1 and 289 of Group 1) to do the feature selection analysis and design the linear discriminant function. The final feature set we got only includes 5 of the original 12 listed-up described prosodic features: dur1 , dur2 , DPc1 , DPc2 , n. The estimated discriminant function

460

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

Table 4 Detection performance of tone nuclei for an open set of 50 utterances

Number of tones Number of correct tone nuclei Correct rate of tone nucleus (%)

1 segment F0

2 segment F0

3 segment F0

Total

247 247 100

161 152 94.4

263 255 97.0

671 654 97.5

(a)Speech wave file with Initial/Final and syllable segmentations. 4000 2000 0 2000 ER

4000 sil

6000 8000

er2

0

ZH

I

VAN

zhi2

yuan2 sil

0.5

V

ING

yu2

ying1

X

1

VN xun4

SH

I

B

shi2

IAO

SH

I

biao3

1.5

shi4 2

sil 2.5

(b)Original F0 contours, with the F0 contours of Finals labled. 6

logF0 (Hz)

5.5

5

4.5

t2 4

0

t2 0.5

t2

t2

t1

1

t4

t2

1.5

t3

t4 2

2.5

2

2.5

2

2.5

(c)Illustration of segmentation of the tone nuclei from the Final F0 contours 6.5

Seg1 Seg2 Seg3

logF0 (Hz)

6

5.5

5

4.5

4

0

0.5

1

1.5

(d)Extracted F0 contours of the tone nuclei.

logF0 (Hz)

6

5.5

5

4.5

0

0.5

1

1.5

Time (second)

Fig. 11. Illustration of the extraction of tone nuclei. The panel (a) depicts the speech wave file, with phonetic segmentations of Initials/ Finals and syllables. The panel (b) depicts the original F0 contours, with the tonality and the FinalsÕ boundaries labeled. The panel (c) illustrated the segmentation process for the tone nuclei from the FinalsÕ F0 contours, where the ‘‘Seg1’’ depicts the results of segmental K-means segmentation, the ‘‘Seg2’’ for the results of segment merge, and the ‘‘Seg3’’ for the tone nuclei. The panel (d) depicts the extracted F0 contours of tone nuclei.

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

had an accuracy of 95.2% for discriminating the two groups in the training data. 4.3. Tone nucleus detection performance To evaluate the performance of the tone nucleus detection method, we manually checked the extracted tone nuclei of an open set consisting of 50 utterances (cs0f0251–cs0f0300) with 671 syllables. We regard those tone nuclei as errors only when they obviously missed the targets. So, if the whole F0 of a Final is chosen as the tone nucleus, it is regarded as correct no matter whether it includes transitory courses or not. Table 4 shows the number of samples in different segmentation groups and the correct rate of extracted tone nuclei. 4.4. F0 smoothing of tone nucleus After the tone nucleus is detected from a syllable F0 contour, the other F0 segments will be set to zero for tone recognition. Exactly in our approach, small portions of F0 of the onset/offset neighboring to the tone nucleus are kept to compute delta F0s. Fig. 11 illustrates the main steps of one nuclei extraction. The panel (d) of Fig. 11 shows the extracted tone nuclei.

5. Tone recognition Hidden Markov models (HMMs) were found to be an efficient method for pitch tone recognition in a number of previous studies (Ljolje and Fallside, 1987; Yang et al., 1988; Hirose and Hu, 1995; Wang et al., 1997). We also used HMMs as the tonal acoustic models in the tone recognition system. The difference of our approach from others is that we only use acoustic features corresponding to the tone nucleus of a syllable in training and recognition stages, whereas the conventional ones all used acoustic features of a whole syllable. Training of the tonal HMMs were done by the well-known Baum–Welch re-estimation procedure, and acoustic decoding of recognition period was done through Viterbi search algorithm.

461

5.1. Tonal HMMs structure The HMM structure for the four basic lexical tones (Tones 1, 2, 3 and 4) has a left–right configuration with 5 states. The beginning and ending states have 2 mixtures of Gaussian probability functions each, and the 3 middle states, assumed to represent tone nuclei, have 6 mixtures each. Since the neutral tone was regarded to be of short duration, its HMM has a less number of states 3 and less number of mixtures in the middle state than the basic tones. The frame acoustic vector consists of 6 elements: • • • •

log F0j , Dlog F0j ¼ log F0j  log F0j1 , DDlog F0j ¼ log F0jþ1 þ log F0j1  2log F0j , Pj  Pave , where Pave is the average log power value of Pj at an utterance level. • DPj ¼ Pj  Pj1 , • DDPj ¼ Pjþ1 þ Pj1  2Pj . No speaker normalization was conducted since the system was designed for a speaker dependent task. To extend it to a speaker independent task, some considerations for speaker normalization must be taken into account. 5.2. Tone recognition experiment I The first tone recognition experiment was done using context independent (CI) tonal HMMs. The number of HMMs is 5, i.e. one HMM for each of Tones 1, 2, 3, 4 and the neutral tone. The conventional method observing full syllable features acted as the baseline system. Recognition results of the two approaches for the test set are shown in Fig. 12 as tone recognition rates in percentage for the basic four tones and as the total average including the neutral tone. From Fig. 12 we can see the new approach increased the absolute average rate by about 6% compared with the conventional one. This indicates that observing features only of tone nuclei yields a better performance than observing full syllable features. When the performances of individual tones were viewed, we found one interesting phenomenon: Recognition rate is improved for

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

Recognition Rates (%)

462

95 90 85

Full

80

Nucleus

75 70 65 Tone 1

Tone 2

Tone 3

Tone 4

Total average

Fig. 12. Tone recognition rates for the four basic tones and absolute average in the two approaches using CI tonal HMMs. ‘‘Full’’ indicates the conventional approach observing full syllable features and ‘‘Nucleus’’ indicates the proposed tone-nucleus based method.

Tones 1, 2 and 4, but, for Tone 3, it even decreased a little (about 2%). We thus turn to the confusion matrices as in Table 5 (for conventional approach observing full syllable features) and Table 6 (for the tone-nucleus approach) to find a reasonable explanation. By comparing Tables 5 and 6, we can see how the use of tone nucleus improved tone recognition rates. Taking Tone 1 recognition as the example, error rate of mis-recognizing Tone 1 as Tone 4 reduced from 12.8% to 3.9%, and that as Tone 2 reduced from 16.2% to 10.5%. However, that as Tone 3 or the neutral tone was nearly unchanged. Since both onset and offset pitch targets of Tone 1 are H (high), in continuous speech a rising onset course and/or a falling offset course may appear in the F0 contour of a Tone 1 syllable as shown in Fig. 3. A Tone 1Õs F0 contour with a rising onset course may easily lead to an error of Tone 2, while a Tone 1Õs F0 contour with a falling offset course may easily leads to an error of Tone 4. But in the tone-nucleus approach, these two kinds of mis-

takes can be avoided. Similar reasons can also be given to explain the improvements in the recognition rates of Tones 2 and 4. The performance degradation of Tone 3 can also be clarified from the above tables. The misrecognition of Tone 3 as the neutral tone increased significantly from 2.4% to 11.6%. A close check of the data showed that when the original F0 contours of Tone 3s have lowering-rising dipping shapes, the extracted tone nuclei usually only had either the lowering or the rising F0 segments. The loose of the dipping shape or contouricity may result in more recognition errors of Tone 3 for the tone nucleus approach than the original method. Furthermore, as both Tone 3 and the neutral tone are associated with low pitch values, the confusion between Tone 3 and the neutral tone increased in the tone nucleus approach. On the other hand, the mis-recognitions of Tone 3 as Tone 2 and Tone 4 have still been reduced (from 10.4% to 4.7% as Tone 2 and from 17.2% to 15.7% as Tone 4). Future studies should be conducted to improve the

Table 5 Confusion matrix for the conventional approach based on CI tonal HMMs Input tone

Recognition rates (%) Tone 1

Tone 2

Tone 3

Tone 4

The neutral tone

Tone 1 Tone 2 Tone 3 Tone 4 The neutral tone

69.2 8.5 0 4.5 2.4

16.2 76.4 10.4 3.5 11.0

1.4 8.7 70.0 5.7 22.6

12.8 4.2 17.2 85.3 33.5

0.5 2.1 2.4 1.0 30.5

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

463

Table 6 Confusion matrix for the tone-nucleus approach based on CI tonal HMMs Input tone

Tone 1 Tone 2 Tone 3 Tone 4 The neutral tone

Recognition rates (%) Tone 1

Tone 2

Tone 3

Tone 4

The neutral tone

83.6 7.7 0 4.1 4.9

10.5 84.5 4.7 0.5 9.1

1.6 3.8 68.0 4.2 12.8

3.9 1.8 15.7 90.7 41.5

0.4 2.2 11.6 0.4 31.7

discrimination between Tone 3 and the neutral tone. 3 The whole recognition performance for the neutral tone is notably low: 31.7% in the tone-nucleus approach and 30.5% in the conventional approach. This may be ascribed to the fact that F0 and energy are not enough to characterize the neutral tone. Some other studies showed the neutral tone can be better discriminated based on duration feature (Hirose and Hu, 1995; Chen and Wang, 1995). This problem should be dealt with in the future. 5.3. Tone recognition experiment II The second tone recognition experiment used context-dependent tonal HMMs, which modeled the F0 variations due to the coarticulation with neighboring tones to improve tone recognition performance (Chen and Wang, 1995; Wang et al., 1997). For instance, a tone sequence ‘‘T1 T2 T3 T4 ’’ is acoustically modeled by context-independent (CI) or context-dependent (CD) models like:

CI : CD :

T1 T1  ðT2 Þ

T2 T3 ðT1 Þ  T2  ðT3 Þ ðT2 Þ  T3  ðT4 Þ

Therefore, coarticulation effects on each tone due to its neighboring tones can be modeled by

3 In our later study in (Zhang and Hirose, 2000), Tone 3 was found to be characterized by relative low pitch values, and its recognition rate could be improved significantly based on an anchoring based normalized method.

allotone models (CD models): bi-tone models used to model tones at the beginning and end of an utterance, and tri-tone models used to model tones in the middle of an utterance. In the recognition period, due to the fact that neighboring tones are not known in advance, tone recognition for all syllables in an input utterance must be realized through dynamic search to find a legal pass with the maximum probabilities across the whole candidate tone array, where a legal pass means that if the pass has an allotone candidate, it must include the context tones of the allotone as its preceding and succeeding candidates. The number of CD tonal HMMs is 175: 4 · 5 for tones at the beginning of an utterance (neutral tone does not appears at the beginning of an utterance), 5 · 5 · 5 for tones in the middle of an utterance, 5 · 5 for tones at the end of an utterance and another 5 for isolated tones. Due to insufficient training data for the context dependent case, CD HMMs have tied transition matrices. They are first copied from CI HMMs,

T4 ðT3 Þ  T4

and then re-estimated according to the training data with tri-tone labels. Table 7 lists the results of the CD experiments based on the acoustic features of either the full syllables or only tone nuclei, together with the results of the previous CI experiments. Although there exists an insufficient training data problem, the slight increase in the recognition

464

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

Table 7 Absolute average tone recognition rates for the training and test data in the four experiments Data set

Recognition rates (%) CI full

CI nucleus

CD full

CD nucleus

Training Test

76.2 75.3

80.8 81.5

86.5 76.2

91.8 83.1

rate for the test data from CI model to CD model coincides with reported results in other studies (Wang et al., 1997) (there, CD tonal HMMs brought 1.8% improvement compared with CI HMMs). We should note that the tone recognition rate of the tone nucleus CI approach (81.5%) still outperformed that of the conventional CD approach (76.2%).

6. Conclusions In summary, we have described a new approach to deal with F0 variations for recognizing lexical tones of Chinese continuous speech. The approach uses a tone-nucleus model to deal with F0 variations systematically. A robust algorithm for detecting tone-nuclei from speech signals is proposed and tested. The algorithm contains three components: (1) segmental K-means segmentation algorithm which offers a robust method to collect candidates for tone-nucleus detection, (2) ANNOVA which provides statistical knowledge about the features of tone nucleus and (3) a tone-nucleus detector based on linear discriminate function. HMM-based tone recognizer was tested on one female natural speech data in the HKU96 database. Performance shows that tone recognition can be substantially improved compared with other conventional methods. The performance improvement also confirms that the tone nucleus detection worked appropriately in the task. Since tone nucleus was also suggested to play a very important role in Chinese intonation structure, the developed tone nucleus detection algorithm is applicable to other applications. However, the best tone recognition rate is only 83.1%, even though the task variability has been

simplified to speaker dependent read speech. At one hand, this indicates that there exists high variability in the F0 contours of continuous speech. On the other hand, there is much room to improve our tone recognition methods. We consider the following points to further our study. First, the approach assumes that a tone nucleus should conform to the underlying pitch values. Whereas phonetic studies has revealed that this assumption may not be fully correct, the pitch values of a tone nucleus can be different from the underlying target pitch values when duration is very short, and a tone with much varied tone nucleus can probably be perceived as the purported tonality as long as the tone context is present (Xu, 1994). Hence, the proposal in this paper has difficulties in recognizing some highly variable tone nuclei. Second, the algorithms for tone nucleus detection and recognition can be further improved by incorporating more appropriate methods. Decision tree based prediction algorithm may be a promising choice. Third, tonal acoustic models may be further improved by optimizing the acoustic features used and the HMM topology for each tones. Fourth, this study was only experimented on a speaker dependent task. There are some problems when apply the proposed approach to a speaker independent task. One problem is that the approach is lack of processing the F0 range variabilities of different speakers. Another problem is that the tone nuclei detection rate may decrease significantly in speaker independent task compared with the speaker dependent task, hereinafter significantly affects the tone recognition. As future study directions, the methods of speaker F0 range normalization and soft tone nuclei detection, where a F0 segment is assigned a probability as tone nucleus other than the yes/no in this study, may be studied to deal with these problems.

Acknowledgements The authors acknowledge the helpful discussions with Dr. Jinfu Ni. The first author would also thank Dr. Yi Xu and Dr. Frank Soong for their advice on the earliest draft of this paper. The

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

valuable comments of three anonymous reviewers, which considerably improved the quality of this paper, are appreciated as well.

References Chao, Y.-R., 1968. A Grammar of Spoken Chinese. University of California Press, Berkeley. Chen, S.-H., Wang, Y.-R., 1995. Tone recognition of continuous Mandarin speech based on neural networks. IEEE Trans. SAP 3 (2), 146–150. Chen, C.-J. et al., 2001. Recognize tone languages using pitch information on the main vowel of each syllable. In: Proc. of ICASSP. Cheng, Y.-B., Wang, R.-H., 1990. Speech Signal Processing. University of Science and Technology of China Press (in Chinese). Fujisaki, H., 1997. Prosody, models, and spontaneous speech. In: Sagisaka, Y., Campbell, N., Higuchi, N. (Eds.), Computing Prosody: Computational Models for Processing Spontaneous Speech. Springer-Verlag, New York, pp. 27– 42. Fujisaki, H., Hirose, K., 1984. Analysis of voice fundamental frequency contours for declarative sentences of Japanese. J. Acoust. Soc. Jpn. (E) 5 (4), 233–242. Garding, E., Zhang, J.-L., 1997. Tempo effects in Chinese prosodic patterns. In: ESCA workshop on Intonation: Theory, Models and Applications, Athens Greece, September 1997, pp. 145–148. Hirose, K., Hu, X.-H., 1995. HMM-based tone recognition of Chinese trisyllables using double codebooks on fundamental frequency and waveform power. In: Proc. 45th European Conf. on Speech Communication and Technology, Vol. 1, Mopm2A.5, Madrid, pp. 31–34. Hirose, K., Iwano, K., 1999. Detecting prosodic word boundaries using statistical models of moraic transition and its use for continuous speech recognition. In: ICASSP99. Howie, J.M., 1974. On the domain of tone in Mandarin. Phonetica 30, 129–148. Hsie, Ch.-T., Furuichi, Ch., Imai, S., 1988. Recognition of four tones in continuous Chinese speech. J. IEICE, D J71-D (4), 661–668 (in Japanese). Lee, L.-Sh., Tseng, Ch.-Y., Gu, H.-Y., Liu, F.-H., Chang, Ch.-H., Lin, Y.-H., Lee, Y.-M., Tu, Sh.-L., Hsieh, Sh.-H., Chen, Ch.-H., 1993. Golden Mandarin(I)––A real-time Mandarin speech dictation machine for Chinese language with very large vocabulary. IEEE Trans. SAP 1 (2), 158– 179. Lin, M.-C., 1995. A perceptual study on the domain of tones in Beijing Mandarin. China Acta Acustica 20 (6), 437– 445. Liu, J., He, X.-D., Mo, F.-Y., Yu, T.-Ch., 1999. Study on tone classification of Chinese continuous speech in speech recognition system. In: Proc. Eurospeech 99, Vol. 2, Budapest, Hungary, pp. 891–894.

465

Ljolje, A., Fallside, F., 1987. Recognition of isolated prosodic patterns using hidden Markov models. Comput. Speech Lang. 2, 27–33. Milton, J.S., Arnold, J.C., 1992. Introduction to Probability and Statistics. McGraw-Hill. Niemann, H. et al., 1998. Using prosodic cues in spoken dialog systems. In: Kosarev, Y. (Ed.), Proc. of SPECOMÕ98 Workshop, St-Petersburg, pp. 17–28. Rabiner, L., Juang, B.-H., 1993. Fundamentals of speech recognition. Prentice-Hall PTR. Rose, P.J., 1988. On the non-equivalence of Fundamental frequency and pitch in tonal description. In: Bradley, D., Henderson, E.J.A., Mazaudon, M., (Eds.), Prosodic Analysis and Asian Linguistics: To honour R.K. Sprigg, Pacific Linguistics, C-104, pp. 55–82. Secrest, B.G., Doddington, G.R., 1983. An integrated pitch tracking algorithm for speech systems. In: Proc. of ICASSP, pp. 1352–1355. Shih, Ch.-L., 1988. Tone and Intonation in Mandarin, Vol. 3. Working Papers, Cornell Phonetics Laboratory, pp. 83–109. Shih, Ch.-L., 1990. Mandarin third tone sandhi and prosodic structure. In: Hoekstra, T., van der Hulst, H. (Eds.), Studies in Chinese Phonology. Mouton de Gruyter Press, pp. 81– 123. Stolcke, A. et al., 1999. Modeling the prosody of hidden events for improved word recognition. In: Eurospeech 99, Budapest, Hungary, September 1999. Wang, Y.-R., Chen, S.-H., 1994. Tone recognition of continuous Mandarin speech assisted with prosodic information. J. Acoust. Soc. Am. Pt. 1 96 (5), 2637–2645. Wang, Ch., Seneff, S., 1998. A study of tones and tempo in continuous Mandarin digit strings and their application in telephone quality speech recognition. In: Proc. of ICSLP98, Sydney, Australia, December 1998. Wang, Ch.-F., Fujisaki, H., Hirose, K., 1990. Chinese four tone recognition based on the model for process of generating F0 contours of sentences. In: Proc. of ICSLP90, pp. 221–224. Wang, H.M., Ho, T.-H., Yang, R.-Ch., Shen, J.-L., Bai, B.-R., Hong, J.-Ch., Chen, W.-P., Yu, T.-L., Lee, L.-sh., 1997. Complete recognition of continuous Mandarin speech for Chinese language with very large vocabulary using limited training data. IEEE Trans. Speech Audio Process. 5 (2), 195–200. Webb, A., 1999. Statistical Pattern Recognition. Arnold Press, London. Whalen, D.H., Xu, Y., 1992. Information for Mandarin tones in the amplitude contour and in brief segments. Phonetica 49, 25–47. Wightman, C.W., Ostendorf, M., 1994. Automatic labeling of prosodic patterns. IEEE Trans. SAP 2 (4), 469–481. Xu, Y., 1994. Production and perception of coarticulated tones. JASA (4), 2240–2253. Xu, Y., 1997a. What can tone studies tell us about intonation? In: Proc. from the ESCA Workshop on Intonation: Theory, Models and Applications, Athens Greece, pp. 337–340. Xu, Y., 1997b. Contextual tonal variations in Mandarin. J. Phonetics 25, 61–83.

466

J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466

Xu, Y., 1998. Consistency of tone-syllable alignment across different syllable structures and speaking rates. Phonetica 55, 179–203. Xu, Y., 1999. Effects of tone and focus on the formation and alignment of F0 contours. J. Phonetics 27 (1), 55–105. Xu, Y., Wang, Q.E., 2001. Pitch targets and their realization: evidence from Mandarin Chinese. Speech Commun. 33, 319–337. Yang, W.-J., Lee, J.-Ch., Chang, Y.-Ch., Wang, H.-Ch., 1988. Hidden Markov model for Mandarin lexical tone recognition. IEEE Trans. ASSP 36 (7), 988–992.

Zhang, J.-S., Hirose, K., 1998. A robust tone recognition method of Chinese based on subsyllabic F0 contours. proceedings from ICSLP98, Sydney, pp. 703–706. Zhang, J.-S., Hirose, K., 2000. Anchoring hypothesis and its application to tone recognition of Chinese continuous speech. In: ICASSP2000, Istanbul, Turkey, June 2000. Zhang, J.-S., Kawanami, H., 1999. Modeling carryover and anticipation effects for Chinese tone recognition. In: Proc. of EurospeechÕ99, Budapest, Hungary, September 1999, pp. 747–750.