Perception of stress and speaking style for selected elements of the SUSAS database

Perception of stress and speaking style for selected elements of the SUSAS database

Speech Communication 40 (2003) 493–501 www.elsevier.com/locate/specom Perception of stress and speaking style for selected elements of the SUSAS data...

246KB Sizes 0 Downloads 22 Views

Speech Communication 40 (2003) 493–501 www.elsevier.com/locate/specom

Perception of stress and speaking style for selected elements of the SUSAS database Robert S. Bolia *, Raymond E. Slyh Air Force Research Laboratory (AFRL/HECP), 2255 H Street, Wright-Patterson Air Force Base, OH 45433-7022, USA Received 30 October 2000; received in revised form 30 January 2002; accepted 30 June 2002

Abstract The Speech Under Simulated and Actual Stress (SUSAS) database is a collection of utterances recorded under conditions of simulated or actual stress, the purpose of which is to allow researchers to study the effects of stress and speaking style on the speech waveform. The aim of the present investigation was to assess the perceptual validity of the simulated portion of the database by determining the extent to which listeners classify its utterances according to their assigned labels. Seven listeners performed an eight-alternative, forced-choice response, judging whether monosyllabic or disyllabic words spoken by talkers from three different regional accent classes (Boston, General American, New York) were best classified as, clear, fast, loud, neutral, question, slow, or soft. Mean percentages of ‘‘correct’’ judgments were analysed using a 3 (regional accent class)  2 (number of syllables)  8 (speaking style) repeated measures analysis of variance. Results indicate that, overall, listeners correctly classify the utterances only 58% of the time, and that the percentage of correct classifications varies as a function of all three independent variables. Ó 2002 Elsevier Science B.V. All rights reserved. Zusammenfassung € usserungen, welche aufgenommen wurden unter Bedingungen mit Die SUSAS Datei beinhaltet eine Sammlung von A simulierten oder tats€achlichem Stress, zur Untersuchung von Stress- und Sprechstilauswirkungen auf Sprachsignal. Die Absicht dieser Recherche lag darin die Wahrnehmungsh€ ultigkeit des simulierten Teils der Datei zu messen durch Er€ usserungen deren vorbestimmten Kategorien zuordnen. Sieben H€ mittlung in wie weit Zuh€ orer A orer erhielten die Aufgabe zu entscheiden ob ein- oder zweisilbige Worte von Sprechern von drei Gruppen mit unterschiedlichem regionalen Akzent (Boston, General American, New York) am besten als ver€ argert, klar, schnell, laut, neutral, fragend, langsam, oder sanft einzustufen sind. Prozentdurchschnitte von ‘‘richtigen’’ Bestimmungen wurden analysiert mit Hilfe von 3 (regionale Akzent Gruppen)  2 (Silbenzahl)  8 (Sprechstil) wiederholten Varianzanalysen. Resultate scheinen anzudeuten daß im Grossen und Ganzen Hoerer nur in 58% der Faelle Aeusserungen richtig klassifizieren und dass der Prozentsatz korrekter Einordnungen von allen drei Variablen abh€ angig ist. Ó 2002 Elsevier Science B.V. All rights reserved.

*

Corresponding author. Tel.: +1-937-255-8802; fax: +1-937-255-8752. E-mail address: [email protected] (R.S. Bolia).

0167-6393/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. doi:10.1016/S0167-6393(02)00129-2

494

R.S. Bolia, R.E. Slyh / Speech Communication 40 (2003) 493–501

Resume La base de donnees SUSAS est une collection de phrases enregistrees dans des conditions de stress simule ou reel dans le but dÕetudier lÕinfluence du stress et du style sur le signal acoustique. LÕobjet de la presente etude etait la validation perceptuelle de la partie simulee de la base de donnees. Sept auditeurs jugeaient que les mots monosyllabiques ou dissyllabiques enonces par des locuteurs etaient percßus comme etant prononces de facßon Colerique, Claire, Rapide, Forte, Neutre, Interrogative, Lente ou Douce. Les locuteurs ont un accent soit de Boston, de New-York ou plus generalement de type americain. La moyenne des pourcentages de jugements ‘‘corrects’’ etaient soumise  a une analyse de la variance, qui montre que le pourcentage des classifications correctes etait seulement de 58%, et que ce pourcentage etait une fonction de lÕaccent, du style de parole, et du nombre de syllabes. Ó 2002 Elsevier Science B.V. All rights reserved. Keywords: SUSAS; Stressed speech; Speech perception; Speech recognition

1. Introduction The Speech Under Simulated and Actual Stress (SUSAS) database is a collection of utterances recorded under conditions of simulated or actual stress, designed to afford researchers the ability to study the effects of stress and speaking style on the speech waveform, and to apply the results of such investigations to the design of more robust systems for automatic speech recognition, as the performance of the latter tend to degrade when the operator is under stress 1 (Hansen et al., 2000). The importance of vocal cues to convey emotion and stress has been recognized, as Banse and Scherer (1996) point out, since antiquity, and a considerable amount of research has been conducted in the last three decades to determine not only the acoustic correlates of emotion, but also how emotive utterances are perceived by human listeners (Tolkmitt and Scherer, 1986; Banse and Scherer, 1996). During the course of some of this work, a number of vocal parameters affected by emotion have been identified, and, to some extent, the ability of human listeners to identify emotional states from context-free utterances has been quantified. However, this has not been done with

1 The SUSAS database is available from the Linguistic Data Consortium. Details can be found at the following internet addresses: http://www.ldc.upenn.edu/Catalog/LDC99S78.html (speech samples), http://www.ldc.upenn.edu/Catalog/LDC99T33. html (transcripts).

the single-word utterances comprising the simulated portion of the SUSAS database. The ‘‘simulated’’ portion of the SUSAS database consists of single-word utterances from nine male speakers spoken in each of 11 styles: angry, clear, cond50, cond70, fast, lombard, loud, neutral, question, slow, and soft. The cond50 and cond70 phrases comprise utterances recorded from subjects engaged in tracking tasks under different levels of workload. The phrases in the lombard class were recorded from talkers listening to pink noise presented binaurally through headphones at a sound pressure level of 85 dB SPL. For the angry, clear, fast, loud, neutral, question, slow, and soft styles, speakers were asked to say the words with the appropriate style. The simulated portion of the database was originally collected to aid in the development of robust speech recognition systems (Paul et al., 1986; Lippmann et al., 1987). However, in recent years, numerous studies have made use of the database in research on the analysis, classification, and synthesis of stressed speech and speaking styles (Hansen, 1988; Womack and Hansen, 1996; Hansen and Bou-Ghazale, 1997; Bou-Ghazale and Hansen, 1998; Slyh et al., 1999). These studies have included speaking-style discrimination tests involving segments of the database. Bou-Ghazale and Hansen (1995) conducted listening tests with samples from the angry, lombard, loud, and neutral styles, in which the ability of participants to judge the presence of stress in a speech sample was investigated. In this study, listeners were presented with pairs of words from the

R.S. Bolia, R.E. Slyh / Speech Communication 40 (2003) 493–501

database and asked whether one, both, or neither of the two words was spoken under stress. The listenerÕs ability to discriminate angry, lombard and loud versus neutral speech was 97%, 82%, and 85%, respectively. These researchers later replicated these results using a slightly different paradigm (Bou-Ghazale and Hansen, 1996). One characteristic of these investigations is that they have restricted both the stimulus set and the set of style labels from which participants may select a response. For example, presenting stimuli from either the angry or the neutral class and asking participants whether they sound angry or neutral does not afford many choices. It may be that they sound loud rather than angry or neutral, but loud is not an option from which the listener is able to select. The objective of the present study was to determine how well human listeners are able to correctly classify utterances in the SUSAS database according to speaking style for larger stimulus and label sets, and to what extent this ability is mediated by such factors as accent, the number of syllables in the utterance, and the style of the utterance as labeled.

2. Method Participants. Three males and four females, between the ages of 20 and 54 years, served as listeners in this experiment. All had pure tone thresholds 6 15 dB HL for frequencies ranging from 125 to 8 kHz (American National Standards Institute, 1989), and were native speakers of Midwestern American English. None of them had been previously exposed to the SUSAS database. Stimuli. The stimuli consisted of five monosyllabic (‘‘three,’’ ‘‘eight,’’ ‘‘help,’’ ‘‘freeze,’’ ‘‘steer’’) and five disyllabic words (‘‘thirty,’’ ‘‘eighty,’’ ‘‘hello,’’ ‘‘zero,’’ ‘‘degree’’) from the simulated stress portion of the SUSAS database, each spoken by nine male talkers (three from each of three accent classes). These particular words were selected because they were available for all talkers, and because the monosyllabic and disyllabic words were similar enough to each other that it might be reasonably conjectured that any differences in classification found between the two classes would

495

be due to effects related to duration. The original SUSAS files (8 kHz sample rate, 16-bit PCM) were converted to the Entropic ESPS FEASD file format, then played through the audio output device on a Sun Ultra 2 using the ESPS ‘‘s16play’’ program. The resulting audio signal was amplified to a comfortable listening level using a Crown SL-2 and presented to listeners via Sennheiser HD-540 headphones. Experimental design. Three regional accent classes (‘‘Boston,’’ ‘‘General American,’’ and ‘‘New York’’) were combined factorially with two syllable conditions (monosyllabic and disyllabic) and eight speaking style conditions (‘‘angry,’’ ‘‘clear,’’ ‘‘fast,’’ ‘‘loud,’’ ‘‘neutral,’’ ‘‘question,’’ ‘‘slow,’’ and ‘‘soft’’). The use of only eight of the speaking styles was dictated by the assumption that na€ıve listeners are unfamiliar with the terms ‘‘lombard,’’ ‘‘cond50,’’ and ‘‘cond70,’’ as they relate to the SUSAS database. Procedures. Subjects listened to sequential presentations of stimuli and judged via an eight-alternative, forced choice response, initiated via a keypress, whether each stimulus was ‘‘angry,’’ ‘‘clear,’’ ‘‘fast,’’ ‘‘loud,’’ ‘‘neutral,’’ ‘‘question,’’ ‘‘slow,’’ or ‘‘soft.’’ The categories were enumerated in alphabetical order, and participants were instructed that the order in which they were displayed implied no ranking in terms of importance or frequency. In addition, each of the styles was ‘‘defined’’ in a set of written instructions presented to the subjects prior to their participation in the experiment. In general, these definitions consisted of a comparison with the neutral form of the word (‘‘louder than normal,’’ ‘‘faster than normal,’’ etc.), and as such subjects were instructed to classify as neutral those words which they did not perceive as being spoken in one of the other styles. A response was called ‘‘correct’’ if the subjectÕs classification of the stimulus matched its label in the SUSAS database. No feedback as to correctness of response was given. In a given session, subjects were presented with 8 (speaking styles)  9 (talkers)  10 (words) ¼ 720 different stimuli. The stimuli were presented such that talker (and hence accent class), speaking style, and word varied randomly from trial to trial. Each of the subjects completed three experimental sessions.

496

R.S. Bolia, R.E. Slyh / Speech Communication 40 (2003) 493–501

3. Results In general, listeners performed better than chance at the classification task, but not as well as anticipated, given the fact that humans routinely identify speaking styles in the real world, and that acoustic inter-style differences were present in the speech samples (Cummings and Clements, 1990; Slyh et al., 1999). The only speaking style that was correctly classified more than 85% of the time was question. The remaining styles were classified correctly only 53% of the time. On average, listeners performed better when the talker belonged to the General American regional accent class than either of the other two classes, and better when the talker was from the Boston class than from the New York class, although this trend was reversed for words spoken in the question style. Specific differences and interactions are revealed by the statistical analyses. Mean percentages of correct classifications were subjected to a 3 (regional accent class)  2 (number of syllables)  8 (speaking style) repeated measures analysis of variance, revealing significant main effects of accent class, F ð2; 12Þ ¼ 11:09, p < 0:05, number of syllables, F ð1; 6Þ ¼ 20:90, p < 0:05, and speaking style, F ð7; 42Þ ¼ 25:96, p < 0:05. All of the two-way interactions were also significant (regional accent class  number of syllables, F ð2; 12Þ ¼ 4:09, p < 0:05; regional accent class  speaking style, F ð14; 84Þ ¼ 29:80, p < 0:05; number of syllables  speaking style, F ð7; 42Þ ¼ 3:29, p < 0:05), as was the three-way interaction, F ð14; 84Þ ¼ 3:89, p < 0:05. The three-way interaction is illustrated in Figs. 1–3, in which mean percentages of correct classifications are plotted as a function of speaking style (and, in Fig. 3, number of syllables) for each regional accent class. The three-way interaction was explored by tests of the two-way interactions at each level of the regional accent class factor, i.e., 2 (number of syllables)  8 (speaking style) analyses of variance were performed for each of the Boston, General American, and New York classes. Within the Boston and General American classes, only the main effects of speaking style were significant, F ð7; 42Þ ¼ 24:67, p < 0:01, and F ð7; 42Þ ¼ 37:73, p < 0:01, respectively. Within the New York class,

Fig. 1. Mean percentage of correct style classifications as a function of speaking style under the Boston accent condition. The dashed line represents chance performance. Error bars indicate one standard error of the mean in each direction.

Fig. 2. Mean percentage of correct style classifications as a function of speaking style under the General American accent condition. The dashed line represents chance performance. Error bars indicate one standard error of the mean in each direction.

however, both the main effect of speaking style, F ð7; 42Þ ¼ 20:19, p < 0:01, and the interaction between speaking style and number of syllables, F ð7; 42Þ ¼ 5:95, p < 0:01, were significant. These main effects of speaking style within the Boston and General American classes were further investigated by means of pairwise t-tests adjusted for family-wise a-error (an a-level of 0.01 was employed for these and all subsequent post hoc tests). Pairs of styles for which classification performance differed significantly are those marked

R.S. Bolia, R.E. Slyh / Speech Communication 40 (2003) 493–501

Fig. 3. Mean percentage of correct style classifications as a function of speaking style and number of syllables under the New York accent condition. The dashed line represents chance performance. Error bars indicate one standard error of the mean in each direction.

with an asterisk in Table 1. Within the Boston class, the speaking styles labeled question and slow were correctly classified more often than any of the other styles. Within the General American class, angry and clear were classified correctly less often (indeed, classification of angry stimuli did not

497

differ significantly from chance performance), and fast more often, than the remaining speaking styles. The interaction between speaking style and number of syllables within the New York class was further examined by tests of the simple main effects of number of syllables at each level of speaking style, and vice versa. Pairwise t-tests were also performed, where appropriate. The only significant difference in classification performance as a function of the number of syllables occurred for the clear speaking style, in which case listeners did better with disyllables than with monosyllables. However, the pattern of differences as a function of speaking style was different between one- and two-syllable words, as illustrated in Table 2. It is also instructive to look at these results in terms of how listeners misclassify elements of the database. This can be accomplished by presenting the data in the form of a confusion matrix, in which percentages of classifications are displayed in a two-dimensional array as a function of listener response and SUSAS label. The confusion matrices for each of the three regional accent classes are presented in Tables 3–5.

Table 1 Post hoc significance table for utterances spoken in the Boston and General American accentsa Angry Boston accent Angry Clear Fast Loud Neutral Question * Slow * Soft General American accent Angry Clear Fast * Loud * Neutral * Question * Slow * Soft * a

Clear

Fast

Loud

Neutral

* *

* *

* *

* *

* * * * * *

Question

Slow

*

*

* * * *

Pairs of styles for which classification performance differed significantly are marked by an asterisk.

Soft

498

R.S. Bolia, R.E. Slyh / Speech Communication 40 (2003) 493–501

Table 2 Post hoc significance table for utterances spoken in the New York accent, as a function of number of syllablesa

Monosyllables Angry Clear Fast Loud Neutral Question Slow Soft Disyllables Angry Clear Fast Loud Neutral Question Slow Soft a

Angry

Clear

* *

*

* * * *

* *

Fast

* * *

*

Loud

Neutral

* * * *

Question

Slow

Soft

*

*

* *

*

*

* *

*

*

*

Pairs of styles for which classification performance differed significantly are marked with an asterisk.

Table 3 Confusion matrix for the Boston regional accent class; due to rounding, not all rows sum to 100% Response category

SUSAS label Angry Clear Fast Loud Neutral Question Slow Soft Total

Angry

Clear

Fast

Loud

Neutral

Question

Slow

Soft

49% 1% 1% 12% 0% 0% 0% 0% 79%

1% 42% 3% 5% 15% 1% 16% 7% 11%

3% 2% 45% 5% 2% 0% 0% 1% 7%

45% 3% 1% 55% 1% 1% 0% 0% 13%

3% 44% 47% 23% 63% 7% 3% 27% 27%

0% 0% 0% 0% 13% 89% 0% 15% 15%

1% 9% 0% 0% 2% 1% 78% 4% 12%

0% 0% 3% 0% 4% 0% 2% 44% 7%

While it is clear from the confusion matrices that words labeled question are almost always classified correctly, it is equally clear that this is not true for any of the other speaking styles, and that the pattern of misclassifications varies as a function of accent. For example, fast words spoken in the General American accent were perceived as fast 88% of the time, though in both of the other accents these were called neutral almost as often as fast. One of the major sources of error seems to be the set of words labeled angry, which,

across regional accent classes, were perceived as loud nearly twice as often as they as they were perceived as angry, though not vice versa; words labeled loud were called angry only 9% of the time. In fact, listeners judged elements of the database as angry less often than any other style. This is reflected in Fig. 4, in which the percentage of total classifications is plotted as a function of speaking style. From this figure it can be seen that listeners classified approximately 25% of the stimuli as neutral, roughly twice as many as bore that label;

R.S. Bolia, R.E. Slyh / Speech Communication 40 (2003) 493–501

499

Table 4 Confusion matrix for the General American regional accent class; due to rounding, not all rows sum to 100% Response category

SUSAS label Angry Clear Fast Loud Neutral Question Slow Soft Total

Angry

Clear

Fast

Loud

Neutral

Question

Slow

Soft

17% 0% 0% 8% 0% 0% 0% 0% 3%

2% 29% 0% 1% 8% 1% 12% 3% 7%

15% 2% 88% 10% 19% 0% 0% 2% 17%

60% 1% 0% 65% 1% 0% 0% 0% 16%

4% 48% 7% 13% 57% 10% 13% 19% 22%

0% 0% 0% 1% 1% 86% 0% 0% 11%

0% 10% 0% 0% 1% 0% 65% 3% 10%

1% 10% 4% 2% 13% 1% 10% 73% 14%

Table 5 Confusion matrix for the New York regional accent class; due to rounding, not all rows sum to 100% Response category

SUSAS label Angry Clear Fast Loud Neutral Question Slow Soft Total

Angry

Clear

Fast

Loud

Neutral

Question

Slow

Soft

20% 1% 1% 7% 1% 0% 0% 0% 4%

9% 47% 3% 12% 10% 2% 27% 3% 14%

15% 0% 54% 2% 3% 0% 0% 0% 9%

38% 0% 0% 38% 0% 0% 0% 0% 10%

17% 23% 39% 39% 72% 4% 12% 21% 28%

0% 0% 2% 0% 8% 91% 0% 8% 14%

1% 27% 0% 2% 3% 1% 55% 7% 12%

0% 2% 1% 0% 3% 1% 4% 60% 9%

and less than 5% as loud, fewer than one-half the number called loud according to the database. 4. Discussion

Fig. 4. Histogram showing mean percentage of total classifications as a function of speaking style.

The results reported herein might lead one to conclude that there is a problem in the simulation of some of the styles for at least some of the regional accent classes. For example, the fact that the listeners often classified angry samples as loud might lead one to conclude that there are deficiencies in the simulation of the angry style. However, there are several reasons why such a conclusion would be premature at this point. First, it may be the case that humans are able to identify

500

R.S. Bolia, R.E. Slyh / Speech Communication 40 (2003) 493–501

some emotions or styles from purely vocal cues only if the utterance is of sufficient duration (indeed, much of the work done in this area has employed sentence-length stimuli see, for example: Scherer et al., 1984; Ladd et al., 1985; Banse and Scherer, 1996; Mejvaldov a, 1999; Abelin and Allwood, 2000). Second, the listeners used in this study were native to the Midwestern United States, and were thus accustomed to hearing English spoken with the ‘‘General American accent’’––indeed, the trend suggests that these particular listeners performed better on the classification task with talkers belonging to this regional accent class. It may be that listeners of other regional accent classes or non-native English speakers would yield different results. Another possibility worthy of consideration is that the vocal cues used to identify stress and emotion are weighted differently by different listeners, and that these cues are not uniformly represented in SUSAS. However, the statistical analysis of the data presented here does not support this hypothesis. Finally, it is possible that different results could be obtained if one selected a different set of words from the database––it is unclear how representative the selected words are of the entire simulated portion of the database. Further research is necessary to determine to what degree the results of the study are due to perceptual limitations, whether due to utterance duration, regional accent class, or native language of the listeners, or to other factors such as problems with the simulation of various emotions or styles. One means of addressing some of these issues is to collect a large corpus of continuous speech samples of varying duration. If a sufficient number of samples are collected under each of the simulated ‘‘stress’’ conditions, it should be possible to determine to what extent the classification of speaking style by human listeners is affected by duration, and perhaps identify the average minimal duration required for accurate classification. If classification studies are done on the corpus by listeners of varying regional accent classes, they should provide information about whether misclassifications are due to limitations in perception due to the duration of the sample or the speaking style of the listener, or to problems inherent in the

simulation/replication of emotion in a context-free environment. Until such time as the various questions are answered, it seems prudent to suggest that researchers should be cautious about interpreting results obtained in experiments using the simulated portion of the SUSAS database.

Acknowledgements The authors would like to acknowledge the technical contributions of Michael L. Ward, of Sytronics, Inc., and Timothy R. Anderson, of the Air Force Research Laboratory. They also express their gratitude to Jean Rouat, at lÕUniversite du Quebec a Chicoutimi, for his assistance in the preparation of the French abstract; to Sabina Noll, of the Air Force Research Laboratory, for her collaboration on the German abstract; and to the editor, Dr. Abeer Alwan, and three anonymous reviewers for their helpful comments on an earlier draft of this manuscript.

References ., Allwood, J., 2000. Cross linguistic interpretation of Abelin, A emotional prosody. In: Proc. ISCA Workshop on Speech and Emotion: A Conceptual Framework for Research, 5–7 September 2000, Belfast. American National Standards Institute, 1989. Specifications for Audiometers. ANSI, New York. Banse, R., Scherer, K.R., 1996. Acoustic profiles in vocal emotion expression. J. Pers. Soc. Psych. 70, 614–636. Bou-Ghazale, S.E., Hansen, J.H.L., 1995. Source generator based stressed speech perturbation. In: Proc. Euro SpeechÕ95, 18–21 September 1995, Madrid, pp. 455–458. Bou-Ghazale, S.E., Hansen, J.H.L., 1996. Generating stressed speech from neutral speech using a modified CELP vocoder. Speech Comm. 20, 93–110. Bou-Ghazale, S.E., Hansen, J.H.L., 1998. HMM-based stressed speech modeling with application to improved synthesis and recognition of isolated speech under stress. IEEE Trans. Speech Audio Process. 6, 201–216. Cummings, K.E., Clements, M.A., 1990. Analysis of glottal waveforms across stress styles. In: Proc. IEEE Intl. Conf. Acoust. Speech Signal Process., 3–6 April 1990, Albuquerque, Vol. 1, pp. 369–372. Hansen, J.H.L., 1988. Analysis and Compensation of Stressed and Noisy Speech with Application to Automatic Recogni-

R.S. Bolia, R.E. Slyh / Speech Communication 40 (2003) 493–501 tion. Unpublished doctoral dissertation, Georgia Institute of Technology, Atlanta GA. Hansen, J.H.L., Bou-Ghazale, S.E., 1997. Getting started with SUSAS: A speech under simulated and actual stress database. In: Proc. EuroSpeech Õ97, 22–25 September 1997, Rhodes, Vol. 4, pp. 1743–1746. Hansen, J.H.L., Swail, C., South, A.J., Moore, R.K., Steeneken, H., Cupples, E.J., Anderson, T., Vloeberghs, C.R.A., Trancoso, I., Verlinde, P., 2000. The impact of speech under ÔstressÕ on military speech technology. NATO Research and Technology Organization Technical Report RTO-TR-10 AC/323(IST)TP/5 IST/TG-01. Neuilly-sur-Seine, France: NATO RTO. Ladd, D.R., Silverman, K.E.A., Tolkmitt, F., Bergmann, G., Scherer, K.R., 1985. Evidence for the independent function of intonation contour type, voice quality, and F0 range in signaling speaker affect. J. Acoust. Soc. Amer. 78, 435– 444. Lippmann, R.P., Martin, E.A., Paul, D.B., 1987. Multi-style training for robust isolated word speech recognition. In: Proc. Intl. Conf. Acoust. Speech Signal Process., 31 March– 3 April 1987, Dallas, pp. 705–708.

501

Mejvaldova, J., 1999. La communication prosodique des attitudes. In: RJC Parole Õ99 (Rencontres Jeunes Chercheurs en Parole), 18–19 November 1999, Avignon. Paul, D.B., Lippmann, R.P., Chen, Y., Weinstein, C.J., 1986. Robust HMM-based techniques for recognition of speech produced under stress and in noise. In: Proc. DARPA Workshop Speech Recognition, 19–20 February 1986, Palo Alto, pp. 81–92. Scherer, K.R., Ladd, D.R., Silverman, K.E.A., 1984. Vocal cues to speaker affect: Testing two models. J. Acoust. Soc. Amer. 76, 1346–1356. Slyh, R.E., Nelson, W.T., Hansen, E.G., 1999. Analysis of mrate, shimmer, jitter, and F0 contour features across stress and speaking style in the SUSAS database. In: Proc. IEEE Intl. Conf. Acoust. Speech Signal Process., 15–19 March 1999, Phoenix, Vol. 4, paper 3036. Tolkmitt, F.J., Scherer, K.R., 1986. Effect of experimentally induced stress on vocal parameters. J. Exp. Psychol. Hum. Percept. Perform. 12, 302–313. Womack, B.D., Hansen, J.H.L., 1996. Classification of speech under stress using target driven features. Speech Comm. 20, 131–150.