SPEECH
COMMUNICATION EISEVIER
Speech Communication 20 ( 1996) 85-9 1
Emotional stress in synthetic speech: Progress and future directions Iain R. Murray
*,
John L. Arnott, Elizabeth A. Rohwer
’
Applied Computer Studies Dicision, lJniversit.v of Dundee, Dundee DDI 4HN, United Kingdom
Received 15 April 1996; revised 15 June 1996
Abstract Current text-to-speech systems have very good intelligibility, but most are still easily identified as artificial voices and no commercial system incorporates prosodic variation resulting from emotion and related factors. This is largely due to the complexity of identifying and categorising the emotion factors in natural human speech, and implementing these factors within synthetic speech. However, prosodic content in synthetic speech is seen as increasingly important, and there is presently renewed interest in the investigation of human vocal emotion and the expansion of synthesis models to allow greater prosodic variation. Such models could also be used as practical tools in the investigation and validation of models of emotion and other speech-altering stressors. This paper reviews progress to date in the investigation of human vocal emotions and their simulation in synthetic speech, and requirements for future research which is required to develop this area are also presented. Zusammenfassung Derzeitige Sprachsynthesesysteme sind sehr gut verstandlich; die Ausgabe 1’iiBtsich aber immer noch ohne weiteres als kunstlich identifizieren. Kein kommerzielles System erlaubt derzeit prosodische Variation fur den Ausdruck von Emotionen und vergleichbaren Faktoren. Dies liegt hauptslchlich an den Schwierigkeiten, die sich bei der Identifizierung und Kategorisierung emotionaler Faktoren in natiirlicher Sprache ergeben, zusatzlich aber such an deren Implementation fur Text-to-Speech-Systeme. Allerdings werden prosodische Eigenschaften in der Praxis der Sprachsynthese zunehmend in Betracht gezogen, und derzeit zeigt sich such ein neuerliches Interesse an der Erforschung des sprachlichen Ausdrucks von Emotionen und der Erweiterung von Modellen der Sprachsynthese urn grdgere prosodische Variation zu erzielen. Diese Modelle kijnnten zus’ritzlich als Werkzeuge zur Erforschung und Bewertung von Modellen von Emotion und anderen sprachverandemden Stressore! verwendet werden. Der folgende Artikel diskutiert die Fortschritte in der bisherigen Forschung zur stimmlichen AuRerung von Emotion und ihrer Simultion in der Sprachsynthese. Zudem sollen die Anforderungen an die zukunftige Forschung in diesem Bereich aufgezeigt werden.
* Corresponding author. ’ Present address: Linkon Corporation, 0167-6393/96/$15.00 Copyright PII SO1 67.6393(96)00046-5
2 Greville Drive, Edgbaston,
Birmingham
0 1996 Elsevier Science B.V. All rights reserved.
B 15 2UU, UK.
I.R. Murray et &/Speech
86
Communication
20 (1996) 85-91
R&me Les systemes actueles de synthbse de parole produisent une parole t&s intelligible mais qui reste, pour la plupart d’entre eux, clairement identifiable comme une parole artificielle. Aucun systeme commercial n’integre de variation prosodique like 2 l’emotion ou a ses corklats. Ceci est principalement dfi a la difficult6 d’identifier et de categoriser les facteurs lies B l’emotion dans la parole, puis a les utiliser dans les systemes de synthese. Cependant, le contenu prosodique est consider6 comme &ant de plus en plus important en parole de synthese. On note un regain d’inttr&t pour l’etude des manifestations de l’emotion dans la parole et pour l’extension des modules de synthbse de manibre a permettre une plus grande variabilite prosodique. De tels modkles pourraient aussi &tre utilises comme des outils efficaces pour la validation de mod&es de l’emotion et des autres facteurs de stress alterant la parole. Cet article presente un bilan des progres realis& en recherche sur
l’expression vocale de l’emotion et sa simulation en parole de synthese. I1 indique aussi les principaux besoins de recherche future dans ce domaine. Keywords:
Synthetic speech; Speech; Emotion; Stress
1. Introduction Current text-to-speech systems are highly intelligible - indeed approaching the intelligibility of human speech. However, all systems sound unnatural (that is, the speech is readily identified as machine generated), and the wealth of pragmatic information about the speaker and their state (such as emotion, mood and personality) which is present in all human speech utterance is missing in present synthesis systems. This is partly due to the simplicity of such systems which precludes incorporation of pragmatic effects, but is mostly due to our very scant knowledge of how physiological changes due to emotion then contribute to specific acoustical changes in speech.
2. The diffkulties
of studying emotion
We are all constantly experiencing emotion and a vast vocabulary of emotion-related terms is available, and this is used throughout artistic literature to describe the state of the characters. However, as experience of emotion and its various forms is different for every individual and, like taste or colour, so intangible, these terms are very much open to subjective interpretation. Thus despite often common usage, few of these terms have commonly agreed meaning and even fewer have any formal definition, making it very difficult to produce any rigorous descriptions of emotion and its associated features. A study of emotion terminology is described by Storm
and Storm (1987). Consequently, a wide range of emotions has been described in the scientific literature, and the analyses to produce these descriptions have been conducted by a wide range of means. Our knowledge of the ways in which stimuli lead to emotion changes within our bodies is very limited (Davitz, 1964), and the ways in which the somatic emotion is carried through to outwardly perceptible changes is even less well known; in this respect, speech is less well researched than expression via the face (Scherer, 1986) and posture. Another problem of marrying emotion effects to speech technology (both for input and output) is the distinct nature of the two research disciplines. Speech science has almost entirely avoided the emotion area as an unwanted complication in the study of speech (though there are exceptions, e.g. (Scherer et al., 1984)); for speech recognition, emotion information is virtually redundant as it adds little if anything to the recognisability of words (indeed, it will often reduce it), and for speech synthesis, pragmatic effects have always been of lower importance than intelligibility. Studies of vocal emotion have thus tended to be entirely the domain of psychologists and characteristically have been isolated in nature, whereas mainstream speech research has predominantly been conducted by larger research groups over long periods of time. The key to success in this area, then, appears to be formal definition of emotion terminology and effects corresponding to those emotions, as part of a wider definition of stress features in general (Murray et al., 1996). This process will be greatly facilitated
LR. Murray et al/Speech
by closer collaboration speech scientists.
between
psychologists
and
3. Emotion in speech From the human vocal emotion analyses reported in the literature, it is known that emotion causes changes in three groups of speech parameters: - Voice quality. These effects define the “character” of the voice, and typically include hoarseness, whispering, creakiness and similar effects. Intensity can also be considered as a voice quality effect. + Pitch contour. The intonation of an utterance, especially the range of pitch and the nature of pitch inflections, contributes greatly to the emotion content. 0 Timing. The overall speed of an utterance, together with changes in duration of certain parts of the speech, also conveys some emotion affect. It should be noted that these parameters are also used to convey other features during normal (that is, unemotional) speech; pitch and timing changes, for example, are used to indicate stressed words in an utterance. Thus, any emotion-related changes to these parameters must be considered in addition to the underlying changes present; any intonation changes must still be perceptible when emotion is present in the speech. It is also worth considering a number of other pragmatic features alongside emotions (some occasionally are considered as emotions), as they affect the same sorts of voice parameters in often similar ways to emotions. Such features would include “pseudo-emotions” such as tiredness, non-emotive bodily changes such as having a cold, and the more conscious-level effects of “speaking style” which a speaker uses to impart particular intention into their speech. Thus any useful model of stress and the sub-set of emotions (see (Murray et al., 1996)) would be able to separately define all of these features, and combine them correctly into one speech signal. Partly as a result of the terminological difficulties, the various emotion analyses reported in the literature have rarely investigated the same set of emotions; in addition, virtually every paper in this field
Communication
20 (1996) 85-91
87
reports differing analysis techniques. New studies using state-of-the-art equipment and a clearly defined study procedure are required; these would be greatly facilitated by the increasing availability of speech libraries now becoming available. Although use for emotion analysis has not been a deliberate goal of these recordings, some could be used for this purpose; for the long term, however, it would be better to have purpose-made recordings labelled with emotion features in addition to more conventional labelling (Amfield et al., 1995). It has been noted that vocal changes due to emotional stress may be cross-cultural in nature, i.e. that all peoples respond to stress stimuli in the same way (Van Bezooijen et al., 1983), although this is likely to be true only for some emotions (Ortony and Turner, 1990). However, somewhat inconveniently, this is only true up to the point where labels are applied to these emotions. Whilst dictionaries may translate emotion labels between languages, it is probable that the labels do not map directly - for instance, “sad” may translate to “triste” in French, but what a French speaker labels as ’ ‘triste” may not map directly to what an English speaker would label as “sad”. Clearly this is also the same for related emotion terms in one language - there are certainly different categories of anger (Frick, 19861, and attempts to compare results of previous studies are exacerbated by this classification problem - are “glad” and “happy” the same? Are “neutral” and “indifferent” the same? The taxonomy of emotion terms presented by Storm and Storm (1987) shows how complex the situation is, even for only one language. Thus, new emotion analyses using modem equipment and rigorous analysis procedures need to be conducted, and the results made widely available via databases of speech samples. From these, we need to derive detailed descriptions of how different emotions affect voice quality, pitch contour and timing of utterances; these should include cross-cultural analyses to determine the applicability of the results.
4. Modelling emotion Emotion modelling research in two main directions:
has been conducted
88
I.R. Murray et al/Speech
* modelling the internal patterns of the emotional process (from stimulus through physiological changes to outwardly perceivable changes in the subject); and - modelling inter-relationships between emotions, to define which emotions are similar to others, and determine sub-groups or families of emotions. 4.1. Model&g
emotion processes
To describe fully the emotion process, we must be aware of user stimuli, the way in which these stimuli affect the user’s physiology, and the way physiological changes then affect the vocal apparatus and speech. However, the exact way in which physiology is affected by external stimuli is not well understood, nor is the exact way in which this in turn affects the vocal tract. Stimulus Evaluation Checks (SECs) are one model (proposed by Scherer (1986)), using a hierarchy of effects leading from a stimulus through physiological changes to the subject’s response, depending on various internal and external factors, such as the magnitude of the stimulus and the perceived threat from it. Other models related to stressed speech are presented by Murray et al. (1996). Emotion research using voice analysis techniques (e.g. (Abadjieva et al., 1993; Vroomen et al., 1993); other studies reviewed by Murray and Amott (1993)) has attempted to correlate different emotional states directly with specific changes within speech. The “generative theory of affect” proposed by Cahn (1990) attempts to describe emotional changes (and other features of speech, such as intonation) directly from the mental and physical emotional state of the speaker, and such a model would be a valuable tool for automating the addition of emotion effects to synthetic speech. The stressor-strain matrix proposed by Murray et al. (1996) offers a practical route to achieving this goal. Any such model would be a vital part of a unified theory of speech (Moore, 19941, which would have to include pragmatic effects within it. Conversion of vocal tract dynamics at this level to speech sounds could be achieved using articulatory synthesis, and while this technique is not new, it is comparatively little used in speech research at present due to its computational complexity compared
Communication
20 (1996) 85-91
to other methods of speech synthesis. While this model seems an attractive one upon which to base a speech system, there are great practical difficulties in its implementation. In particular, factors affecting the acoustic variables related to details of phonology, speaking style, emotion and other related factors and the way they influence perception of speech are still very poorly understood. However, this is potentially a very important area, as a good understanding of the emotion processes, and particularly how they affect the speech apparatus and thus the audible speech, could be used as a convenient way to further develop an articulatory speech model to produce emotional speech. The system would be given a series of emotional stimuli and the emotional content of the speech would change accordingly. These models could also contribute to computer models for generation of faces (Hara and Kobayashi, 1992; MagnenatThalmann and Kalra, 1992), and would facilitate production of artificial face graphics with simultaneous synthetic speech. 4.2. Modelling
inter-relations
between emotions
Models of the inter-relationships between emotions have been divided between two main theories “basic” (or “palette”) emotion theories define a closed set of basic discrete emotions and define all others as variations and combinations of these, while “dimensional” theories define emotions as points within some form of dimensional space. These theories are not clearly defined by psychologists, and both have a number of variations; the basic emotion set (reviewed by Ortony and Turner (1990)) varies in emosize between two (“positive” and “negative” tions) and eighteen, and the dimensional theory exists in two-, three- and even four-dimensional form. However, it is generally the three-dimensional form (originally proposed by Schlosberg (1954)) which has received most support. The palette and dimensional theories are not mutually exclusive, as discrete emotions can be considered as points within a dimensional model. Both types of model appear to have some validity, though the dimensional type are perhaps of greater application, as they offer the potential to easily create a wide range of emotions from a simple set of input parameters. It is known that some physiological changes cor-
I.R. Murray et al. /Speech Communication 20 (1996) 85-91
relate with parameters of speech (Davitz, 1964; Scherer, 19861, and these also correlate with some of the proposed emotional dimensions (e.g. the “tension” dimension in the Schlosberg model correlates strongly with both speech rate and heart rate).
5. Using synthetic tures
speech to model emotion
fea-
Despite our limited knowledge, several prototype synthetic speech-with-emotion systems have been developed and demonstrated. The HAMLET system of Murray (19891, further described by Murray and Arnott (1995), uses rules to alter voice, pitch and timing in speech via a commercial synthesiser, The Affect Editor system of Cahn (1990) operates in a similar way, although some manual processing of the input text is required. The SPRUCE text-to-speech system of Tatham and Lewis (1992) also includes a capability for including pragmatic effects. Re-synthesis of coded speech allows certain parameters to be manipulated, and emotion-type effects can be added in this way. Whilst commercial systems have not exploited this potential, manipulation of emotion parameters in coded speech has been used in many experiments investigating emotion (e.g. (Scherer et al., 1984; Vroomen et al., 1993)). Interest in this area of research is increasing as the number of potential applications grows. Use of model-generated emotional synthetic speech could also be a valuable tool in the development of the stress and emotion models mentioned above; parameters could be tried out and altered heuristically after listening to the speech produced.
6. Evaluation
of the content
It is difficult to measure the pragmatic content of human speech, and the problem is further compounded when attempting to measure synthetic speech. There are three features of synthetic speech systems which can be used (formally or informally) to measure their performance: - Intelligibility. The most rigorously applied of synthetic speech measurements usually implemented by some form of rhyme test, intelligibility measures how well a synthesiser can produce differ-
89
ent speech sounds in a recognisable way. Many synthetic speech systems can score very highly in intelligibility tests, often scoring almost as well as natural speech. - Variability. This is the capability of a synthesiser to change the characteristics of the voice with which it speaks, from simple alterations of the speech rate to more involved voice quality alterations which allow the “personality” of the voice to be changed. * Naturalness. This parameter is not rigorously defined, but is intended to indicate “how human” a synthesiser sounds; thus a highly natural voice may be often mistaken for a human voice, while an unnatural voice is clearly machine-generated. The parameter is also often used to describe how pleasing or how “easy to listen to” the synthetic speech is. Although today’s synthesisers are often highly intelligible (Scherz and Beer, 1995), they are very often not very natural, and one goal of speech synthesis research has been to produce a synthetic voice which is highly natural. There is also some evidence (Cowley and Jones, 1993) that factors affecting intelligibility correlate negatively with those affecting naturalness. Synthetic speech-with-emotion systems require a recognised testing strategy which includes measurement of all of these features. There has been wide acceptance of the need for a standard testing strategy for speech recognition systems, and also for measuring the intelligibility of speech synthesis systems using rhyme tests and similar techniques, but there is not as yet a standard strategy for such features as emotion and the broader naturalness. As in speech recognition, future developments would be more easily evaluated and comparison of results from synthetic speech systems would be facilitated if there were a common test procedure covering all aspects of the speech signal (intelligibility, naturalness, etc.). It is thus important that a formal testing procedure for speech synthesis systems, which includes coverage of emotion and pragmatic effects, is developed and widely adopted. 7. Improving
realism of synthetic speech
There are three ways in which naturalness of synthetic speech:
to improve
the
1.R. Murray et al. /Speech
Voice quality. Improvements in the speech reproduction or synthesis process can lead to a more natural frequency distribution within the speech signal, leading to more natural-sounding speech (speech produced algorithmically tends to contain more clearly discernible frequency bands than are present in natural speech). One major contributor to this process in constructive synthesis (synthetic speech not based on human speech samples) has been found to be simulation of the voice source itself, and improving the vocal fold vibration model has been found to greatly enhance the output speech. Speaking style. Research (summarised by EskCnazi (1993)) has shown that humans speak in different ways depending on a number of factors related to their speaking environment, such as the type of material being read, whether the speaker is consciously trying to speak more intelligibly, the nature of the audience, and the speaker’s social standing relative to the audience. These changes are in timing, pitch contour and stress placement (at both word and utterance level). Emotion and mood. Numerous internal factors within the speaker commonly referred to as emotion or mood, and other pragmatic features, can also lead to changes in speech produced (summarised by Murray and Amott (199311. Thus, to ultimately achieve a truly natural-sounding synthetic voice, it must have a good underlying voice quality, speak with features appropriate to the style of dialogue and speaking context, and have the capability to express emotion and other pragmatic effects within the speech. The principal problem for researchers endeavouring to improve the naturalness of synthetic speech is the limited knowledge of the effect of speaking styles, emotion and other pragmatic affects upon the speech signal.
8. Conclusion With increasing demand for speech technology systems, there is increasing need for text-to-speech systems which sound natural, and thus include emotion and other pragmatic effects. Some research systems have been produced, but development is hindered by our poor knowledge of emotion processes and their correlates in speech.
Communication
20 (1996) 85-91
Progress in this research area will be improved by greater liaison between psychologists and speech scientists; formal definitions of emotion terminology; models relating acoustic vocal effects to the emotion definitions, and also to physiological emotion effects; modem analyses of human vocal emotion, and availability of speech sample databases including emotions; enhanced models of speech production, to allow incorporation of the full range of emotion effects; and a formal testing strategy for synthetic speech systems, which includes measurement of pragmatic effects.
Acknowledgements The authors would like to thank SERC who funded their initial synthetic speech-with-emotion research (Postgraduate Research Studentship No. 863 1679X and Research Grant GR/F 638621, and all those who contributed comments relating to this work at the ESCA-NATO Workshop on Speech under Stress.
References E. Abadjieva, I.R. Murray and J.L. Arnott (19931, “Applying analysis of human emotional speech to enhance synthetic speech”, Proc. Eurospeech ‘93, Berlin, pp. 909-912. S. Amfield, P. Roach, J. Setter, P. Greasley and D. Horton (1995). “Emotional stress and speech tempo variation”, Proc. ESCA-NATO Workshop on Speech under Stress, Lisbon, Portugal, pp. 13-15. J.E. Cahn (1990), Generating expression in synthesized speech, MIT Media Laboratory Technical Report. C.K. Cowley and D.M. Jones (1993), “Assessing the quality of synthetic speech”, in: C. Baber and J.M. Noyes, Eds., interactive Speech Technology (Taylor and Francis, London). J.R. Davitz (19641, The Communication of Emotional Meaning (McGraw-Hill, New York). M. Eskenazi (19931, “Trends in speaking styles research”, Proc. Eurospeech ‘93, Berlin, pp. 501-509. R.W. Frick (1986), “The prosodic expression of danger - Differentiating threat and frustration”, Aggressive Behauiour, Vol. 12, No. 2, pp. 121-128. F. Hara and H. Kobayashi (19921, “Computer graphics for expressing robot-artificial emotions”, Proc. IEE Znternar. Workshop on Robot and Human Communication, Tokyo, Japan, pp. 155-160.
I.R. Murray et al/Speech N. Magnenat-Thalmann and P. Kalra (19921, “A model for creating and visualizing speech and emotion”, Proc. 6th Internat. Workshop on Aspects of Automated Natural Language Generation, Trento, Italy, pp. I-12. From “blue R.K. Moore (19941, “Speech pattern processing: sky“ ideas to a unified theory?“, Proc. Inst. Acoust., Vol. 16, No. 5, pp. 1-13. I.R. Murray (19891, Simulating emotion in synthetic speech, PhD Thesis, University of Dundee. I.R. Murray and J.L. Amott (19931, “Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion”, J. Acoust. Sot. Amer., Vol. 93, No. 2, pp. 1097- 1108. I.R. Murray and J.L. Arnott (19951, “Implementation and testing of a system for producing emotion-by-rule in synthetic speech”, Speech Communication, Vol. 16, NO. 4, pp. 369-390. I.R. Murray, C. Baber and A. South (1996), “Towards a definition and working model of stress and its effects on speech”, Speech Communication, Vol. 20, Nos. 1-2, pp. 3-l 1. A. Ortony and T.J. Turner (19901, “What’s basic about basic Psychological Reriew, Vol. 97, No. 3, pp. 315emotions?“, 331. K.R. Scherer (1986). “Vocal affect expression - A review and a
Communication
20 (19961 85-91
91
model for future research”, Psychological Bull.. Vol. 99, No. 2, pp. 143-165. K.R. Scherer, D.R. Ladd and K.E.A. Silverman (19841, “Vocal cues to speaker affect: Testing two models”, J. Acoust. Sot. Amer., Vol. 76, No. 5, pp. 1346- 1356. J.W. Scherz and M.M. Beer (19951, “Factors affecting the intelligibility of synthesized speech”, AAC Augmentatice and Alternatice Communication, Vol. 11, No. 2, pp. 74-78. H. Schlosberg (1954), “Three dimensions of emotion”, Psychological Review, Vol. 61, No. 2, pp. 8 l-88. C. Storm and T. Storm (19871, “A taxonomic study of the vocabulary of emotions”, J. Personality and Social Psychology, Vol. 53, No. 4, pp. 805-816. M.A.A. Tatham and E. Lewis (19921, “Prosodic assignment in SPRUCE text-to-speech synthesis”, Pmt. Inst. Acoust., Vol. 14, No. 6, pp. 447-454. R. van Bezooijen, S.A. Otto and T.A. Heenan (19831, “Recognition of vocal expressions of emotion: A three-nation study to identify universal characteristics”, J. Cross-Cultural Psychology, Vol. 14, No. 4, pp. 3X7-406. J. Vroomen, R. Collier and S. Mozziconacci (19931, “Duration and intonation in emotional speech”, Proc. Eurospeech ‘93, Berlin, pp. 577-580.