Anders, Ende, Jungho¨fer, Kissler & Wildgruber (Eds.) Progress in Brain Research, Vol. 156 ISSN 0079-6123 Copyright r 2006 Elsevier B.V. All rights reserved
CHAPTER 12
Intonation as an interface between language and affect Didier Grandjean, Tanja Ba¨nziger and Klaus R. Scherer Swiss Center for Affective Sciences, University of Geneva, 7 rue des Battoirs, 1205 Geneva, Switzerland
Abstract: The vocal expression of human emotions is embedded within language and the study of intonation has to take into account two interacting levels of information — emotional and semantic meaning. In addition to the discussion of this dual coding system, an extension of Brunswik’s lens model is proposed. This model includes the influences of conventions, norms, and display rules (pull effects) and psychobiological mechanisms (push effects) on emotional vocalizations produced by the speaker (encoding) and the reciprocal influences of these two aspects on attributions made by the listener (decoding), allowing the dissociation and systematic study of the production and perception of intonation. Three empirical studies are described as examples of possibilities of dissociating these different phenomena at the behavioral and neurological levels in the study of intonation. Keywords: prosody; intonation; emotion; attention; linguistic; affect; brain imagery Emotions are defined as episodes of massive, synchronous recruitment of mental and somatic resources to adapt to or cope with stimulus events that are subjectively appraised as being highly pertinent for an individual and involve strong mobilization of the autonomous and somatic nervous systems (Scherer, 2001; Sander et al., 2005). The patterns of activations created in those systems will be reflected, generally, in expressive behavior and, specifically, will have a powerful impact on the production of vocal expressions. Darwin (1998/1872) observed that in many species, voice is moreover exploited as an iconic affective signalling device. Vocalizations, as well as other emotional expressions, are functionally used by conspecifics and sometimes even members of other species to make inferences about the emotional/ motivational state of the sender.
The coding of information in prosody One of the most interesting issues in vocal affect signalling is the evolutionary continuity between animal vocalizations of motivational/emotional states and human prosody. Morton (1982) proposed universal ‘‘motivational-structural rules’’ in an attempt to understand the relationship between fear and aggression in animal vocalizations. Morton tried to systematize the role of fundamental frequency, energy, and quality (texture) of vocalization for the signalling of aggressive anger and fear. The major differences between these characteristics of anger and fear are the trajectory of the contours of the respective sounds and the roughness or the ‘‘thickness’’ of the sounds. For instance, the ‘‘fear endpoint’’ (no aggression) is characterized by continuous high pitch compared with low pitch and roughness at the ‘‘aggressive endpoint’’ (see Fig. 1). Between these two extremes, Morton described variations corresponding to mixed emotions characterized by different combinations of the basic features.
Corresponding author. Tel.: +41223799213; Fax: +41223799844; E-mail:
[email protected] DOI: 10.1016/S0079-6123(06)56012-1
235
236
Fig. 1. Motivational-structural rules (Morton, 1982). Modelling the modifications of F0 and energy of vocalizations signalling aggression (horizontal axis) or fear/appeasement (vertical axis).
This kind of iconic affective signalling device, common to most mammals, has changed in the course of human evolution in the sense that it became the carrier signal for language. This communication system is discretely and arbitrarily coded and uses the human voice and the acoustic variability it affords to build the units of the language code. This code has been superimposed on the affect signalling system, which continues to be used in human communication, for example, in affect bursts or interjections (Scherer, 1985). Many aspects of the primitive affect signalling system have been integrated into the language code via prosody. It is not only that the two codes basically coexist, but also that speech prosody has integrated some of the same sound features as described in Morton’s examples. These vocal productions are not pure, but they are shaped by both physiological reactions and cultural conventions. Wundt (1900)
has suggested the notion of ‘‘domestication of affect sounds’’ constituted by this mixture of natural and cultural aspects. It is particularly important to distinguish ‘‘push effects’’ and ‘‘pull effects’’ (Scherer, 1986). The push effects represent the effect of underlying psychobiological mechanisms, for instance, the increase of arousal that increases muscle tension and thereby raises fundamental frequency (F0). The equivalence of ‘‘push’’ on the attribution side is some kind of schematic recognition, which is probably also largely innate. These psychobiological effects are complemented by ‘‘pull effects,’’ that is, conventions, norms, and display rules that pull the voice in certain directions. The result is a ‘‘dual code,’’ which means that, at the same time, several dimensions are combined to produce speech. This complex patterning makes it difficult to understand which features are specific to particular emotions in the acoustic signal.
237
Scherer et al. (1984) have proposed distinguishing coding via covariation (continuous affect signalling), determined mostly by biopsychological factors and configuration (discrete message types), shaped by linguistic and sociocultural factors. This distinction is very important but unfortunately often neglected in the literature. Given the complex interaction of different determinants in producing intonation patterns, it is of particular importance to clearly distinguish between production or encoding and perception or decoding of intonational messages. Brunswik’s ‘‘lens model’’ allows one to combine the study of encoding and decoding of affect signals (Brunswik, 1956). This model (see Fig. 2) presumes that emotion is expressed by distal indicator cues (i.e., acoustic parameters), which can be extracted from the acoustic waveforms. These objective distal cues are perceived by human listeners who, on the basis of their proximal percept, make a subjective attribution of what underlying emotion is expressed by the speaker, often influenced by the context. This model allows one to clearly distinguish between the expression (or encoding) of emotion on the sender side, the transmission of the sound, and the impression (or decoding) on the receiver side,
resulting in emotion inference. The model encourages voice researchers to measure the complete communicative process, including (a) the emotional state expressed, (b) the acoustically measured voice cues, (c) the perceptual judgments of voice cues, and (d) the process that integrates all cues into a judgment of the encoded emotion. Figure 2 further illustrates that both psychobiological mechanisms (push effects) and social norms (pull effects) will influence the expression (encoding) of vocal expressions and, conversely, will be transposed into rules used for impression forming (decoding). The patterns of intonation produced during a social interaction are strongly influenced by conventions shaped by different social and cultural rules (Ishii et al., 2003). Moreover, for tone languages, different tones may convey specific semantic meaning (Salzmann, 1993). In these cases, tones with semantic functions interact in a complex manner with emotional influences. It is very important to keep apart these different aspects because they are governed by very different principles (Scherer et al., 2003). Given that the underlying biological processes are likely to be dependent on both the idiosyncratic
Fig. 2. Adaptation of Brunswik’s lens model, including the influences of conventions, norms, and display rules (pull effects) and psychobiological mechanisms (push effects) on emotional vocalizations produced by the speaker (encoding) and the reciprocal influences of these two aspects on attributions made by the listener (decoding).
238
nature of the individual and the specific nature of the situation, relatively strong interindividual differences in the expressive patterns will result from push effects. Conversely, for pull effects, a very high degree of symbolization and conventionalization, and thus comparatively few and small individual differences, are expected. With respect to cross-cultural comparison, one would expect the opposite: Very few differences between cultures for push effects and large differences for pull effects.
What is the role of intonation in vocal affect communication? First, we propose defining different terms related to intonation to clarify the concepts. We suggest that the term ‘‘prosody’’ refers to all suprasegmental changes in the course of a spoken utterance — intonation, amplitude, envelope, tempo, rhythm, and voice quality. We talk about intonation as the contour of F0 in the utterance. The amplitude envelope is determined by the contour of acoustic energy variations over the utterance and will most of the time be correlated with F0. The tempo consists of the number of phonemic segments per time unit. The rhythm corresponds to the structure of F0 accents, amplitude peaks, and pausing distribution in the utterance. Finally, voice quality is defined by the distribution of energy in the spectrum as produced by different phonation modes. The modifications of the vocal tract characteristics, for example, the overall
muscle tension, influence not only the source (namely, the vocal cords), but also the resonances produced by the modifications of the vocal tract shapes. Spectral energy distribution is influenced by the vocal tract of the speaker and contributes to the possibilities of identifying an individual by the sound of the voice. Recently, a study conducted by Ghazanfar (2005) has shown that monkeys are able to attribute the size of the body from the vocal production of a conspecific. Spectral energy distribution is also modified by emotional processes, with the relative amount of energy in highand low-frequency bands changing for different emotions. For example, the voice quality can be modified in terms of roughness or sharpness during an emotional process, information that can be used by the listener to infer the emotional state of the interlocutor. It is very important to take these aspects into account when we want to understand the organism’s ability to attribute emotional states to conspecifics and even to other species. Different features of prosody might be coded differentially in terms of the distinction mentioned above between continuous or discrete types of coding. Thus, Scherer et al. (1984) have shown that F0 is coded continuously, whereas many intonation contour shapes are coded configurationally. On the left panel of Fig. 3, the pitch range of an utterance is continuously varied using copy synthesis (following the covariation principle); with wide pitch range, there is a continuous increase in the amount of emotional content perceived in the utterance. On the right panel of
Fig. 3. Effects of F0 range manipulation (covariation principle, left panel) and different contour shapes (configuration principle) for ‘‘Wh’’ and ‘‘Y/N’’ questions (right panel) on the perception of emotion.
239
Fig. 3, configuration effects are demonstrated; whether the utterance is perceived as challenging or not depends on whether final fall or final rise is used. This depends also on the context: It makes a major difference if a ‘‘Wh’’ question is pronounced (‘‘Where are you going’’) with a final fall that is not really challenging versus a final rise. The effect depends on configuration features rather than continuous covariation. Several distinctions are crucial and should be addressed or at least taken into account in future research related to vocal emotional communication. The distinction between coding based on an ancient affect signalling system and coding based on language should be systematically taken into account, particularly with respect to the distinction between covariation and configuration principles (see also Scherer, 2003). Because of the possibility that different coding principles underlie different features of prosody, we also need to distinguish very carefully between intonation as a phenomenon as carried by F0, tempo, amplitude, and voice quality; all of these aspects interact but it is possible to pull them apart. We also need to distinguish between push and pull effects; in other words, the question as to what extent there are differences in physiological push, producing certain vocal effects, versus the effects produced essentially by the speaker, modifying the command structure for the vocalizations on the basis of templates given by linguistic or cultural schema (pull). Finally, we need to distinguish much more carefully between encoding and decoding mechanisms, how the sender encodes both push and pull effects for a certain affect, and what processes are used to decode the production influenced by emotional processes. The ideal way to understand these two different processes is to study them jointly as suggested by the Brunswikian lens model (Brunswik, 1956). In the following sections, we present three different approaches to study and understand the encoding and decoding processes at different levels, using the conceptual distinctions explained in the above paragraphs. The first approach is focused on the question of the specificity of contour intonation for different emotions. The second exemplifies a study addressing the decoding
processes at the central nervous system (CNS) by electroencephalography (EEG). Finally, the relationship between emotion and attention in prosodic decoding is addressed in a study using functional magnetic resonance imagery (fMRI).
Are the emotions specific intonation contours? The question of how specifically intonation codes emotion has been addressed in a study investigating the contribution of intonation to the vocal communication of emotions (Ba¨nziger and Scherer, 2005). Intonation is defined in this context as pitch (or F0) fluctuations over time. Other prosodic aspects such as rhythm, tempo, or loudness fluctuations were not included in this study. Along the lines of the distinction outlined by Scherer et al. (1984), authors from different research backgrounds independently postulated that (a) specific configurations of pitch patterns (pitch contours) reflect and communicate specific emotional states (e.g., Fonagy and Magdics, 1963) and (b) continuous variation of pitch features (such as pitch level or pitch range) reflect and communicate features of emotional reactions, such as emotional arousal (e.g., Pakosz, 1983). Evidence supporting the first claim (existence of emotion-specific pitch contours) consists mostly of selected examples rather than of empirical examination of emotional speech recordings. On the other hand, efforts to describe/analyze the intonation of actual emotional expressions have been limited by the use of simplified descriptors, such as measures of overall pitch level, pitch range, or overall rise/fall of pitch contours. This line of research established that a number of acoustic features — such as F0 mean or range, intensity mean or range, and speech rate — vary continuously with emotional arousal (a review of this work is described in Scherer, 2003). It is far less clear as to what extent specific F0 contours can be associated with different emotions, especially independently of linguistic content. To examine this issue, quantifiable and comparable descriptions of F0 contours are needed. The study we describe used a simple procedure to stylize F0 contours for emotional expressions.
240
Fig. 4. Stylization example for an instance of (low aroused) ‘‘happiness’’ with sequence 1 (‘‘ha¨t san dig prong nju ven tsi’’).
The stylization (see Fig. 4) was applied to 144 emotional expressions (sampled from a larger set of emotional expressions described in detail by Banse and Scherer, 1996). Expressions produced by nine actors who pronounced two sequences of seven syllables (1. ‘‘ha¨t san dig prong nju ven tsi’’; 2. ‘‘fi go¨tt laich jean kill gos terr’’) and expressed eight emotions were used in this study. Two instances of ‘‘fear,’’ ‘‘happiness,’’ ‘‘anger,’’ and ‘‘sadness,’’ with ‘‘low arousal’’ (labeled: ‘‘anxiety,’’ ‘‘happiness,’’ ‘‘cold anger,’’ and ‘‘sadness’’) and ‘‘high arousal’’ (labeled: ‘‘panic fear,’’ ‘‘elation,’’ ‘‘hot anger,’’ ‘‘despair’’) were included in this study. Ten key points were identified for each F0 contour. The first point (‘‘start’’) corresponds to the first F0 point detected for the first voiced section in each expression. This point is measured on the syllable ‘‘ha¨t’’ in sequence 1 and on the syllable ‘‘fi’’ in sequence 2. The second (‘‘1 min1’’), third (‘‘1max’’), and fourth points (‘‘1 min2’’)
correspond respectively to the minimum, maximum, and minimum of the F0 excursion for the first operationally defined ‘‘accent’’ of each sequence. Those local minima and maxima are measured for the syllables ‘‘san dig’’ in sequence 1 and for the syllables ‘‘go¨tt laich’’ in sequence 2. Points five (‘‘2 min1’’), six (‘‘2max’’), and seven (‘‘2 min2’’) correspond respectively to the minimum, maximum, and minimum of the F0 excursion for the second operationally defined ‘‘accent’’ of each sequence. They are measured for the syllables ‘‘prong nju ven’’ and ‘‘jean kill gos.’’ Points eight (‘‘3 min’’), nine (‘‘3max’’), and ten (‘‘final’’) correspond to the final ‘‘accent’’ of each sequence: The local minimum, maximum, and minimum for the syllables ‘‘tsi’’ and ‘‘ter.’’ Fig. 4 shows an illustration of this stylization for a happy expression (first utterance). The pattern represented in Figure 4 — two ‘‘accents’’ (sequences of local F0 min1–max–min2) followed by a final fall — was the most frequent
241
pattern for the 144 expressions submitted to this analysis. The count of F0 ‘‘rises’’ (local ‘‘min1’’ followed by ‘‘max’’), ‘‘falls’’ (local ‘‘max’’ followed by ‘‘min2’’), and ‘‘accents’’ (‘‘min1’’ followed by ‘‘max’’ followed by ‘‘min2’’) for the first accented part, the second accented part, and the final syllable was not affected by the expressed emotions but varied for different speakers and for the two sequences of syllables that they pronounced. In order to control for differences in F0 level between speakers, a ‘‘baseline’’ value had to be defined for each speaker. An average F0 value was computed from 112 emotional expressions (including the 16 expressions used in this study) produced by each speaker. Fig. 5 shows the differences in hertz (averaged across speakers and sequences of syllables) between the observed F0 points in each expression and the speaker baseline value for each expressed emotion. Figure 5 shows that F0 level is affected by emotional arousal. The F0 points for emotions with low arousal (such as sadness, happiness, and anxiety) are generally lower than the F0 points for emotions with high arousal (despair, elation, panic fear, and hot anger). The description of the different points in the contour does not appear to add much information to an overall measure of F0,
such as F0 mean. Looking at the residual variance after regressing F0 mean (computed for each expression) on the points represented in Fig. 5, there remains only a slight effect of expressed emotion on points ‘‘2max’’ and ‘‘final.’’ The second maximum tends to be higher for recordings expressing elation, hot anger, and cold anger than for recordings expressing other emotions. The final F0 value tends to be relatively lower for hot anger and cold anger than for other emotions. Figure 5 further shows that the range of F0 fluctuations is affected by emotional arousal and also that F0 range (expressed in a linear scale) is on average larger for portrayals with high arousal (and high F0 level) than for portrayals with low arousal (and low F0 level). It is likely that both the level and range of F0 are enhanced in portrayals with high arousal as a consequence of increased vocal effort in those portrayals. The results reported above show that in this study only the overall level of the F0 contours was affected by expressed emotions and determined the emotion inferences of the judges in a powerful and statistically significant fashion. As could be expected from frequently replicated results in the literature, the height of F0 is likely to be reliably interpreted as indicative of differential activation
Fig. 5. Average F0 values by portrayed emotion. Note: The number of observations varies from 18 (for ‘‘start’’ with hot anger, cold anger, and elation; for ‘‘1max’’ with cold anger and panic fear) to 7 (for ‘‘final’’ with sadness). It should be noted also that there is a sizable amount of variance around the average values shown for all measurement points.
242
or arousal. These results do not encourage the notion that there are emotion-specific intonation contours. However, some of the detailed results suggest that aspects of contour shape (such as height of selected accents and final F0 movement) may well differentially affect emotion inferences. However, it seems unlikely that such features will have a discrete, iconic meaning with respect to emotional content. It seems reasonable to assume that, although the communicative value of F0 level may follow a covariation model, the interpretation of various features of F0 contour shape seems to be best described by a configuration model. Concretely, contour shape, or certain central features thereof, may acquire emotional meaning only in specific linguistic and pragmalinguistic contexts (including phonetic, syntactic, and semantic features, as well as normative expectations). Furthermore, the role of F0 contour may vary depending on the complexity of the respective emotion and its dependence on a sociocultural context. Thus, one would expect covariation effects for simple, universally shared emotions that are closely tied to biological needs and configuration effects for complex emotions and affective attitudes that are determined by socioculturally variable values and symbolic meaning. To summarize, the results indicate that there are no specific contours or different shapes related to the emotions studied (Ba¨nziger and Scherer, 2005). There are strong differences in the F0 level for the different kinds of emotions related to the underlying arousal intensity, which is well known from past research (see Fig. 5). However, in this study, for the first time, we have evidence that the accent structure, even with meaningless speech, with an important corpus, and spoken by nine different actors, is very similar for different emotions. The only single effect is related to elation, with a disproportional rise on the second accent (see Fig. 5). These results indicate that the accent structure alone does not show much specificity: There may be slight effects for the height of the secondary accent. There could be emotion-specific effects in contour aspects that are time critical: Lengthening–shortening in particular segments, upswing–downswing acceleration. To study these aspects systematically, we need further research using the synthetic variation of the
relevant parameters. Preliminary work using this technique has shown the importance of voice quality in decoding intonation (Ba¨nziger et al., 2004).
The neural dynamics of intonation perception The second study, which highlights the question related to the time course of decoding emotional prosody, was conducted by Grandjean et al. (in preparation). The main goal of this experiment was to address the timing related to the perception of emotional prosody, linguistic pragmatic accents, and phonemic identification. Using EEG and spatio-temporal analyses of the electrical brain signals, Grandjean et al. (2002) showed that different patterns of activation are related to specific decoding processes in emotional prosody identification compared with pragmatic and phonemic identifications. Three simple French words were used (‘‘ballon,’’ ‘‘talon,’’ and ‘‘vallon’’), with the F0 contour being systematically manipulated using Mbrola synthesis (Dutoit et al., 1996) to produce happiness, sadness, and neutral emotion expressions, as well as affirmative and interrogative utterance types (see Fig. 6). During EEG recording, the participants had to identify emotional prosody, linguistic prosody, and phonemic differences within three different counterbalanced blocks. Time slots for occurrences of specific topographical brain maps obtained by cluster analyses (Lehmann, 1987; see also Michel et al., 2004) on grand average of event-related potentials (ERPs) are different for the three recognition tasks. The results highlight specific processes related to emotional and semantic prosody identification compared with phonemic identification (see Fig. 7). Specifically, the first three ERP electrical brain maps (C1, C2, and C3 maps on the Fig. 7) are common to the different experimental conditions. Between 250 and 300 ms to 400 ms, specific processes occurred for emotional prosodic identification and semantic linguistic identification, demonstrating the involvement of different underlying neural networks subserving these different mental processes. In fact, the statistical analyses show specificity of the maps for both the emotional prosody and the linguistic pragmatic conditions,
243
Fig. 6. Examples of pitch analyses and sonograms for the French utterance ‘‘ballon’’ for the different experimental conditions used in the EEG study.
Fig. 7. Occurrence of the different topographical brain maps over time (1000 ms after the onset of the stimulus) for the three experimental conditions (semantic and emotional prosody, and phonemic identifications). The different colors correspond to the different brain electrical maps obtained from the grand average ERPs. The maps are represented on the global field power (GFP). Note the specific map related to emotional identification characterized by a right anterior positivity (E map) and the specific map for the semantic prosody condition with a large central negativity (S map).
244
when compared with the two other conditions, respectively (Grandjean et al., 2002). These results indicate that specific neural circuits are involved in the recognition of emotional prosody compared with linguistic and phonemic identification tasks. A right anterior positivity was measured on the scalp; this result is compatible with a previous fMRI study demonstrating anterior activations in right dorsolateral and orbitofrontal regions during emotional identification compared with phonetic identifications of the same stimuli (Wildgruber et al., 2005). The involvement of the left part of the frontal region was highlighted in another fMRI study when the participants had to identify linguistic information compared with emotional prosodic information (Wildgruber et al., 2004). However, the two temporal parts of the hemispheres are differentially involved in different subprocesses that contribute to the recognition of the emotional content of a word or a sentence (see Schirmer and Kotz, 2006). For instance, different brain networks process temporal information compared with spectral information, respectively, in the left temporal versus the right temporal parts of the brain (Zatorre and Belin, 2001). The specific electrical map related to the recognition of emotional prosody in this EEG experiment cannot be explained solely by the fact that the intonation contour was modified, because we also used different F0 contours for the linguistic pragmatic condition (interrogation and affirmative contours). Moreover, the same stimuli were used in the phonemic identification condition, demonstrating that this specific emotional prosody map is not related to the differences of basic acoustical features but rather related to the type of the participant’s recognition task. This study underlines the possibility of using speech synthesis to modify systematically acoustic features of emotional prosody, inducing different types of categorization processes related to the participant’s tasks. In the future, this type of paradigm could allow researchers interested in the understanding of perception of emotional prosody to study the integration of different subprocesses contributing to the subjective perception of intonation in emotional processes. Further studies are needed to manipulate systematically the different acoustical dimensions
involved in different functions at the prosodic level with vocal synthesis. In contrast to fMRI techniques, EEG methods allow the study of not only the interactions of different brain areas in prosodic perception, but also the timing of these processes to identify the brain structures involved in prosody perception.
Impact of attention on decoding of emotional prosody Another fundamental issue concerns the ability of humans to rely on emotion to adapt or cope with particularly relevant events or stimuli (Sander et al., 2003, 2005). Emotional prosody often serves the function of social communication and indeed has an impact on behavioral level related to individual differences; that is, attention toward angry voices increases neuronal activity in orbitofrontal regions related to interindividual sensitivity of punishment (Sander et al., 2005). To detect emotional signals in the environment, which is potentially relevant for survival, the CNS of organisms seems to be able to reorient attention via reflexive mechanisms, even if the voluntary attention processes are occupied with another specific task (Vuilleumier, 2005). This reflexive process has been extensively studied in the visual domain (Vuilleumier et al., 2001; Pourtois et al., 2004) but only rarely in the auditory domain (Mitchell et al., 2003; Wambacq et al., 2004). The last research example in this chapter addresses this question through a brain imaging study of emotionally angry prosody compared with neutral prosody in a dichotic listening study (see Fig. 8). In two experiments using fMRI techniques, Grandjean et al. (2005) demonstrated an increase of neuronal activities in the bilateral superior temporal sulcus (STS), known to be sensitive for human voices (Belin et al., 2000), when exposed to angry prosody compared with neutral prosody (controlling for signal amplitude level and envelope, as well as F0 level; see Fig. 8). This increase of STS activity occurred even when the angry emotional prosody was not the focus of attention in a dichotic listening paradigm, indicating possible reflexive mechanisms related to
245
Fig. 8. Attention and emotional prosody were manipulated in two fMRI experiments. (a) Dichotic listening paradigm allowing the presentation of different vocalizations at the left and right ears and the manipulation of spatial attention toward the right or the left side. (b) Cerebral activations of the right hemisphere. An increase of neuronal activity for angry relative to neutral speech prosody was found in the right STS (red, Po0.001). An anterior region of right STS was modulated by spatial attention directed to the left relative to the right ear (green, Po0.005). These modulations of activations by emotion and attention occurred within voice-selective areas (blue line). (c) Right STS activation in Experiment 1. Blood Oxygen Level-Dependent (BOLD) responses were increased for angry compared with neutral speech. (d) The same cluster in the right STS as in Experiment 2. Activation occurred only in response to vocal stimuli and not synthetic sounds.
anger prosody. Thus, emotional prosody seems to be able to induce an increase of neuronal activities even when the respective emotional-relevant event is not the focus of voluntary attention, as previously demonstrated in the visual domain (for a review, see Vuilleumier, 2005). Thus, the human auditory system interacting with other brain areas such as the amygdala would be able to allocate attentional resources to compute the relevant information, modifying attention allocation (Sander et al., 2005). Conclusion The research examples described above highlight the importance of investigating emotional prosody
by taking into account the specificity of acoustic signals not only in perception, but also in production. Moreover, the different subprocesses involved in these two mechanisms, such as temporal unfolding of decoding or the differential effects of push and pull effects, during encoding processes should be manipulated systematically in future research to allow better differentiation of the determinants of these phenomena. Future research in this field should systematically manipulate different acoustical parameters (i.e., the acoustic features underlying prosody properly speaking and vocal quality in a more general sense). In addition, the current research shows the utility of combining behavioral and neuroscience
246
research in trying to disentangle the complex system of intonational structure in a dually coded system, subject to psychobiological push and sociocultural pull. Abbreviations CNS EEG ERPs F0 fMRI max min ms STS
central nervous system electroencephalography event-related potentials fundamental frequency functional magnetic resonance imagery maximum minimum milliseconds superior temporal sulcus
References Banse, R. and Scherer, KR. (1996) Acoustic profiles in vocal emotion expression. J. Pers. Soc. Psychol., 70: 614–636. Ba¨nziger, T. and Scherer, K.R. (2005) The role of intonation in emotional expressions. Speech Commun., 46: 252–267. Belin, P., Zatorre, R.J., Lafaille, P., Ahad, P. and Pike, B. (2000) Voice-selective areas in human auditory cortex. Nature, 403: 309–312. Brunswik, E. (1956) Perception and The Representative Design of Psychological Experiments (2nd ed). University of California Press, Berkeley, CA. Darwin, C. (1998) The Expression of the Emotions in Man and Animals. John Murray, London. (Reprinted with introduction, afterword, and commentary by P. Ekman, Ed.). Oxford University Press, New York. (Original work published 1872) Dutoit, T., Pagel, V., Pierret, N., Bataille, F. and Van Der Vrecken, O. (1996) The MBROLA project: towards a set of high-quality speech synthesizers free of use for non-commercial purposes. Proc. ICSLP’96, 3: 1393–1396. Fonagy, I. and Magdics, K. (1963) Emotional patterns in intonation and music. Z Phonet., 16: 293–326. Ghazanfar, A.A. (2005) The evolution of speech reading. International Conference on Cognitive Neuroscience 9. Cuba. Grandjean, D., Ducommun, C. and Scherer, K.R. (in preparation) Neural signatures of processing emotional and linguistic-pragmatic prosody compared to phonetic-semantic word identification: an ERP study. Grandjean, D., Ducommun, C., Bernard, P.-J. and Scherer, K.R. (2002) Comparison of cerebral activation patterns in identifying affective prosody, semantic prosody, and phoneme differences. Poster presented at the International Or-
ganization of Psychophysiology (IOP), August 2002, Montreal, Canada. Grandjean, D., Sander, D., Pourtois, G., Schwartz, S., Seghier, M., Scherer, K.R. and Vuilleumier, P. (2005) The voices of wrath: brain responses to angry prosody in meaningless speech. Nat. Neurosci., 8: 145–146. Ishii, K., Reyes, J.A. and Kitayama, S. (2003) Spontaneous attention to word content versus emotional tone: differences among three cultures. Psychol. Sci., 14: 39–46. Lehmann, D. (1987) Principles of spatial analyses. In: Gevins, A.S. and Re´mond, A. (Eds.) Handbook of Electroencephalography and Clinical Neurophysiology, Vol. 1. Methods of Analyses of Brain Electrical and Magnetic Signals, Elsevier, Amsterdam, pp. 309–354. Michel, C.M., Murray, M.M., Lantz, G., Gonzalez, S., Spinelli, L. and Grave de Peralta, R. (2004) EEG source imaging. Clin. Neurophysiol., 115: 2195–2222. Mitchell, R.L., Elliott, R., Barry, M., Cruttenden, A. and Woodruff, P.W. (2003) The neural response to emotional prosody, as revealed by functional magnetic resonance imaging. Neuropsychologia, 41: 1410–1421. Morton, E.S. (1982) Grading, discreteness, redundancy, and motivation-structural rules. In: Kroodsma, D.E., Miller, E.H. and Ouellet, H. (Eds.), Acoustic Communication in Birds. Academic Press, New York, pp. 182–212. Pakosz, M. (1983) Attitudinal judgments in intonation: some evidence for a theory. J. Psycholinguist. Res., 12: 311–326. Pourtois, G., Grandjean, D., Sander, D. and Vuilleumier, P. (2004) Electrophysiological correlates of rapid spatial orienting towards fearful faces. Cereb. Cortex, 14: 619–633. Salzmann, Z. (1993) Language, Culture and Society: An Introduction to Linguistic Anthropology (3rd ed). Westview Press, Boulder, CO. Sander, D., Grafman, J. and Zalla, T. (2003) The human amygdala: an evolved system for relevance detection. Rev. Neurosci., 14: 303–316. Sander, D., Grandjean, D., Pourtois, G., Schwartz, S., Seghier, M., Scherer, K.R. and Vuilleumier, P. (2005) Emotion and attention interactions in social cognition: brain regions involved in processing anger prosody. Neuroimage, 28: 848–858. Sander, D., Grandjean, D. and Scherer, K.R. (2005) A systems approach to appraisal mechanisms in emotion. Neural Networks, 18: 317–352. Scherer, K.R. (1985) Vocal affect signalling: a comparative approach. In: Rosenblatt, J., Beer, C., Busnel, M.-C. and Slater, P.J.B. (Eds.) Advances in the Study of Behavior, Vol. 15. Academic Press, New York, pp. 189–244. Scherer, K.R. (1986) Vocal affect expression: a review and a model for future research. Psychol. Bull., 99: 143–165. Scherer, K.R. (2001) Appraisal considered as a process of multi-level sequential checking. In: Scherer, K.R., Schorr, A. and Johnstone, T. (Eds.), Appraisal Processes in Emotion: Theory, Methods, Research. Oxford University Press, New York and Oxford, pp. 92–120. Scherer, K.R. (2003) Vocal communication of emotion: a review of research paradigms. Speech Commun., 40: 227–256.
247 Scherer, K.R., Johnstone, T. and Klasmeyer, G. (2003) Vocal expression of emotion. In: Davidson, R.J., Scherer, K.R. and Goldsmith, H. (Eds.), Handbook of the Affective Sciences. Oxford University Press, New York and Oxford, pp. 433–456. Scherer, K.R., Ladd, D.R. and Silverman, K.E.A. (1984) Vocal cues to speaker affect: testing two models. J. Acoust. Soc. Am., 76: 1346–1356. Schirmer, A. and Kotz, S.A. (2006) Beyond the right hemisphere: brain mechanisms mediating vocal emotional processing. Trends in Cognitive Sciences, 10: 24–30. Vuilleumier, P. (2005) How brains beware: neural mechanisms of emotional attention. Trends Cogn. Sci., 9: 585–594. Vuilleumier, P., Armony, J.L., Driver, J. and Dolan, R.J. (2001) Effects of attention and emotion on face processing in the human brain: an event-related fMRI study. Neuron, 30: 829–841. Wambacq, I.J., Shea-Miller, K.J. and Abubakr, A. (2004) Nonvoluntary and voluntary processing of emotional prosody:
an event-related potentials study. Neuroreport, 15: 555–559. Wildgruber, D., Hertrich, I., Riecker, A., Erb, M., Anders, S., Grodd, W. and Ackermann, H. (2004) Distinct frontal regions subserve evaluation of linguistic and emotio nal aspects of speech intonation. Cereb. Cortex, 14: 1384–1389. Wildgruber, D., Riecker, A., Hertrich, I., Erb, M., Grodd, W., Ethofer, T. and Ackermann, H. (2005) Identification of emotional intonation evaluated by fMRI. Neuroimage, 15: 1233–1241. Wundt, W. (1900) Vo¨lkerpsychologie. Eine Untersuchung der Entwicklungsgesetze von Sprache, Mythos und Sitte. Band I. Die Sprache. [Cultural Psychology: A Study of the Developmental Laws of Language, Myth, and Customs. Vol. 1. Language]. Kro¨ner, Leipzig. Zatorre, R.J. and Belin, P. (2001) Spectral and temporal processing in human auditory cortex. Cereb. Cortex, 11: 946–953.