www.elsevier.com/locate/ynimg NeuroImage 22 (2004) 948 – 955
Distinct functional substrates along the right superior temporal sulcus for the processing of voices Katharina V. Kriegstein and Anne-Lise Giraud * Cognitive Neurology Unit, Department of Neurology, J.W. Goethe University, Frankfurt am Main, Germany Received 19 November 2003; revised 27 January 2004; accepted 11 February 2004 Available online 27 April 2004 The right superior temporal sulcus (STS) is involved in processing the human voice. In this paper, we report fMRI findings showing that segregated cortical regions along the STS are involved in distinct aspects of voice processing and that they functionally cooperate during speaker recognition. Subjects listened to identical sets of auditory sentences while recognizing either a target sentence irrespective of the speaking voice or a target voice irrespective of the sentence meaning. As the same stimulus material was used in both conditions, task-related activations were not confounded by differences in speech acoustic features. Half of the stimuli were voices of familiar persons and half of persons that were never encountered before. Recognizing voices activated the right anterior and posterior STS more than recognizing verbal content. While the right anterior STS responded equally to both voice categories, the right posterior STS displayed stronger responses to non-familiar than to familiar speakers’ voices. It also responded to our baseline condition of amplitude modulated noises that required a detailed analysis of complex temporal patterns. Analyses of connectivity (psychophysiological interactions) revealed that during speaker recognition both anterior and posterior right STS interacted with a region in the mid/anterior part of the right STS, a region that has been implicated in processing the acoustic properties of voices. Moreover, the anterior and posterior STS displayed distinct connectivity patterns depending on familiarity. Our results thus distinguish three STS regions that process different properties of voices and interact in a specific manner depending on familiarity with the speaker. D 2004 Elsevier Inc. All rights reserved. Keywords: Sulcus; Substrates; Voices
Abbreviations: fMRI, functional magnetic resonance imaging; STS, superior temporal sulcus; vfp, recognition of familiar persons’ voices; vnp, recognition of non-familiar persons’ voices; cfp, recognition of verbal content in sentences spoken by familiar people; cnp, recognition of verbal content in sentences spoken by non-familiar people; SEN, speech envelope noises; F, familiar; N, non-familiar. * Corresponding author. Cognitive Neurology Unit, Department of Neurology, J.W. Goethe University, Theodor Stern Kai 7, 60590 Frankfurt am Main, Germany. Fax: +49-69-6301-6842. E-mail address:
[email protected] (A.-L. Giraud). Available online on ScienceDirect (www.sciencedirect.com.) 1053-8119/$ - see front matter D 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.neuroimage.2004.02.020
Introduction Human speech is the first vehicle for language, but it also conveys meaningful non-linguistic information, which plays an important role in communication, for example, about the speaker’s gender, his identity, his emotional state. Dedicated neural territories that selectively respond to voices more than to other natural sounds are located along both superior temporal sulci (STS) (Belin et al., 2000). Although these regions seem selective to voices as acoustic stimuli, they generally respond more strongly to speaking voices than to non-verbal vocalizations (Belin et al., 2000). This enhanced sensitivity to verbal stimuli suggests that there is no strict functional selectivity for non-linguistic features of speech. However, heterogeneous activations along the STS were observed in response to non-verbal vocalizations compared to acoustically matched control stimuli (Belin et al., 2002). Furthermore it has been shown that the anterior STS is specifically sensitive to speaker recognition (Belin and Zatorre, 2003; Von Kriegstein et al., 2003). Hence, the STS could house several distinct functional areas that serve different aspects of voice processing. In two previous fMRI studies (Giraud et al., in press; Von Kriegstein et al., 2003), we addressed the issue of voice representation and processing in the human brain using bottom-up (stimulus-driven) and top-down (task-related) approaches. We sought to identify distinct neural substrates for different components in voice processing by precisely controlling for linguistic and acoustic processing and to determine whether depending on the task, areas respond more strongly to non-linguistic than to linguistic features of voices. In the first study (Giraud et al., in press), we contrasted tasks where subjects listened either to natural speech or to speech envelope noises (SEN). SEN are white noises shaped with the temporal envelope of speech (Shannon et al., 1995). If their temporal structure is sufficiently detailed, that is, 10 Hz and above, they can be understood while being void of spectral vocal patterns. In this case, SEN convey equivalent linguistic meaning as natural speech and thus have the critical acoustical properties needed for comprehension. In contrast to natural speech, these SEN however lack information about the speakers’ voice. Comparing tasks involving natural speech with SEN with the same linguistic content thus specifically targets the acoustic features of voices avoiding the confound by semantic and phonological speech properties. Vocal acoustic features activated the middle/anterior STS bilaterally with
K.V. Kriegstein, A.-L. Giraud / NeuroImage 22 (2004) 948–955
strong right predominance. This activation partly overlapped with the voice responsive regions observed by Belin et al. (2000). In a second fMRI experiment (Von Kriegstein et al., 2003), we sought to control for acoustic aspects of voice processing by investigating task-related rather than stimulus-driven responses. We compared conditions with identical speech material when subjects performed a recognition task focusing either on the linguistic semantic content of speech, that is, the meaning of the sentence or on the speaker’s voice. We found that the right anterior superior temporal sulcus responded during voice processing but not when processing semantic content. The present fMRI study was conducted to further study the neural processes related to speaker recognition. Our experiment involved the same linguistic material and tasks as in the previously described experiment but the speakers were either familiar or nonfamiliar to the subjects. Familiar speakers were personal acquaintances of the subjects, while non-familiar speakers had never been encountered before the experiment. As a control for complex temporal sounds, we additionally used SEN void of linguistic
949
information (with a temporal envelope cut-off at 2 Hz) derived from the sentences. We expected that recognizing voices of familiar speakers should rely on the co-activation of acoustic voice areas and of the cortical network involved in person identity retrieval and autobiographical memory (Cabeza and Nyberg, 2000; Gorno-Tempini and Price, 2001; Gorno-Tempini et al., 1998; Leveroni et al., 2000; Nakamura et al., 2000; Shah et al., 2001), while recognizing non-familiar speakers’ voices should rely essentially on acoustic steps of voice processing that emphasize detailed analysis of vocal spectro-temporal patterns.
Methods Subjects Nine right-handed subjects (4 females, 5 males; 27 – 36 years) without audiological or neurological pathology participated in the study. Written informed consent was obtained from all participants.
Table 1 Local response maxima in statistical parametric maps of the direct contrast of task (voice recognition > verbal content recognition, P < 0.05, corrected, masked by voice > sen) and second-level analyses, probing non-familiar voices ((vnp cnp) (vfp cfp)) and familiar voices ((vfp cfp) (vnp cnp)) at P < 0.001, uncorrected Region
Temporal lateral Anterior STS/STG r. Anterior STS/STG l. Middle STS l. Posterior STS r. Posterior STS l.
Voice > verbal content P < 0.05, corrected
Non-familiar voice P < 0.001, uncorrected
Familiar voice P < 0.001, uncorrected
x
x
x
y 51 48 60 63 66 63
z 18 18 9 42 21 45
15 12 3 6 0 3
Z
cl.
6.36 5.78 5.01 7.47 6.18 5.47
23 18 29 424
Frontal Dorsolateral prefrontal r. Dorsolateral prefrontal l. Orbito-frontal r. Orbito-frontal l. Anterior cingulate Superior frontal r. Superior frontal l. Precentral r. Occipital Middle l.
48 66
z
21 27
Z
12 0
3.39 3.23
cl.
y
z
Z
cl.
7 3
59
Temporal inferior/medial Fusiform l. Fusiform/parahippoc. r. Amygdala r. Parietal Angular r. Superior parietal r. Precuneus/cingulate bl.
y
21
3
12
3.33
7
36
66
54
3.82
42
36
60
57
7.04
6
3
69
51
4.99
89
45
33
18
9.03
30
48
42
18
5.28
872
51
15
27
6.51
176
51
30
30
3.81
49
27
45
15
3.91
59
33
30
18
6.57
42 9 12 9
36 42 36
27 15 54
3.55 3.48 3.7
13 14 63
27 36
51 39
21 12
3.69 3.59
15 9
18 18 15
51 39 15
66 45 42
5.11 4.19 3.7
98 210 22
24 15 33
9 6 21
66 69 45
3.96 3.92 3.56
63 34 30
48
72
15
3.75
47
vfp, Recognition of voices (familiar person); vnp, recognition of voices (non-familiar person); cfp, recognition of verbal content (familiar person); cnp, recognition of verbal content (non-familiar person); x, y, z are the Talairach coordinates of the local maxima (in millimeters); x, medial – lateral axis; y, anterior – posterior axis; z, dorsal – ventral axis; Z, level of significance.
950
K.V. Kriegstein, A.-L. Giraud / NeuroImage 22 (2004) 948–955
Protocol and data acquisition Functional imaging was performed on a 1.5-T magnetic resonance scanner (Siemens Vision, Erlangen, Germany) with a standard head coil and gradient booster. We used echo-planar imaging to obtain image volumes with 24 contiguous oblique transverse slices every 2.7 s (voxel size 3.44 3.44 4 mm3, 1 mm gap, TE 60 ms) covering the whole brain. We acquired 494 volumes/session and run two sessions for each subject (988 volumes/subject). Data processing SPM99 (http://www.fil.ion.ucl.ac.uk/spm/) was used to perform standard spatial pre-processing (realignment, normalization, and smoothing with a 10 mm Gaussian kernel for group analysis) and statistical block-design analyses comparing the conditions modeled. Stimulus preparation and delivery Vocal stimuli were recorded (32 kHz sampling rate, 16 bit resolution), adjusted to the same overall sound pressure level and processed using CoolEdit 2000 (Syntrillium Software Cooperation) and Soundprobe (Hisoft Cooperation). Simple speech envelope noises were derived from the vocal stimuli with a cut-off frequency of 2 Hz using a matlab-based program (Apoux et al., 2001; Lorenzi et al., 1999). All acoustic stimuli were presented binaurally to the subjects through a commercially available system (mr-confon, Magdeburg, Germany) and calibrated to produce an average signal-to-scanner noise ratio of 20 dB. Experimental design We used 47 German sentences spoken by 14 non-familiar (6 females/8 males) and 14 familiar speakers (5 females/9 males). Familiar speakers (and participants) were members of the clinical staff from the local neurology department. Non-familiar speakers were not known to the participants before the experiment neither by voice, by face, nor by name. Stimuli in the control conditions were simple speech envelope noises (cut-off 2 Hz) derived from the sentences. Subjects were familiarized with all voices and sentences (each voice was presented three times saying different sentences) as well as with the speech envelope noises. Subjects were additionally trained on the target voices and target speech envelope noises before the sessions. Each of the two sessions was composed of four experimental conditions consisting in recognizing: the target voice of a familiar person (condition 1, vfp), the target voice of an nonfamiliar person (condition 2, vnp), the verbal content of a target sentence spoken by familiar person (condition 3, cfp), and the verbal content of a target sentence spoken by an non-familiar person (condition 4, cnp). Matched control conditions involved recognizing speech envelope noises by virtue of their temporal structure (condition 5, sen). The experimental conditions were split into three blocks presented in random order within and across conditions. Blocks with the control condition alternated with the experimental sentence/voice recognition conditions. Each block lasted 32 s and contained 8 items (sentences or noises) of which three were targets. Each session lasted 22 min. Targets differed across sessions. Every target was presented before the ensuing block. A target voice occurred with different
sentences, and vice versa a target sentence occurred with different voices, during the block and during the target presentation. Therefore, the semantic content or the voice, respectively, could not be used as a cue to identify the targets. Subjects were requested to judge each item by pressing one button if it was a target and another button if it was not. Statistical thresholds and masking procedures We applied thresholds corrected for multiple comparisons ( P < 0.05, corrected, 3 voxel minimum) to direct contrasts (main effect of voice recognition) and uncorrected thresholds at P < 0.001, 3 voxel minimum to second-level analyses (effect of familiarity). For visualization, masking was used to confine the activity to areas responsive to voice more than to the control speech envelope noises derived from the sentences (vfp + vnp > fsen + nsen). The direct contrast was masked at P < 0.001, corrected, as reported in the figures and the tables. The second-level analysis was masked at P < 0.001, uncorrected. The unmasked results for the second-level analyses are reported in Table 1. Functional connectivity analyses To investigate the functional connectivity, we performed psychophysiological interaction analyses (Friston et al., 1997; Gitelman et al., 2003). Functional MRI signal changes over time were extracted from a volume of interest (VOI) with a radius of 5 mm centered on the response maximum of the sampled areas as representative time courses in terms of the first eigenvariate of the data in all suprathreshold voxels saved within a given VOI. We multiplied these mean-corrected data ( y) with a mean-corrected condition specific regressor (r). As the regressor was extracted from the SPM design matrix, it was already convolved with the canonical HRF (Gitelman et al., 2003). To probe a general task effect, the regressor was multiplied by 1 for the voice recognition conditions and 1 for verbal content recognition conditions ((r = vfp + vnp + (cfp.* 1) + (cnp.* 1)). The same procedure was employed to probe functional connectivity specific to familiar speakers’ voice recognition (r = vfp + (cfp.* 1) + (vnp.* 1) + cnp) and non-familiar speakers’ voice recognition (r = vnp + (cnp.* 1) + (vfp.* 1) + cfp). The mean-corrected regressors ry for general task, familiar and non-familiar speakers’ recognition
Fig. 1. Brain regions responding when performing a recognition task on the speaker’s voice relative to the verbal content of sentences on the same acoustic material, masked by voice > sen (red). In this contrast, conditions involving familiar and non familiar voices were pooled (vfp + vnp > cfp + cnp). P < 0.05, corrected. Activation for the specific acoustic properties of voices (natural speech > SEN equally understood) taken from (Giraud et al., in press) (green).
K.V. Kriegstein, A.-L. Giraud / NeuroImage 22 (2004) 948–955
951
regressors r and y as covariates of no interest (confounds). Responses were considered significant at P < 0.05 (corrected, 3 voxels minimum) in the general task and at P < 0.001 (uncorrected, 3 voxels minimum) for the analysis probing the familiarity effect during voice recognition.
Results
Fig. 2. Second-level contrast showing brain regions, a—responding more when performing the recognition task on non-familiar than familiar voices ((vnp > cnp) > (vfp > cfp)), b—responding more when performing the recognition task on familiar than non-familiar voices ((vfp > cfp) > (vnp > cnp)). Both contrasts are masked by voice > noise. Visualization threshold here P < 0.003, uncorrected.
Subjects had to recognize a target speaker or a verbal content in blocks of sentences spoken by different speakers. The contrast of these two experimental conditions permitted to control for the acoustic properties of natural speech. In a further control condition, subjects had to recognize a target temporal sequence of a speech envelope noise in a block of SEN derived from the sentences. Furthermore, the acoustic material was divided into voices that were recorded from persons who were either familiar (colleagues working in the same department) or non-familiar to the participants (see Introduction and Methods). The behavioral records indicated 98% correct responses for recognition of the verbal content, 99% for recognition of familiar voices and 88% for recognition of non-familiar voices. The noise
task were used in three separate analyses per region to test for psychophysiological interactions, that is, voxels where the contribution of the sampled region changed significantly as a function of general task, familiar or non-familiar speakers’ recognition, respectively. In addition to ry, the design matrices also contained the
Fig. 3. Functional connectivity and stimulus locked response functions. The anterior STS voice area functionally interacts with the mid/anterior STS (green). The posterior STS voice area functionally interacts with the mid/ anterior STS (blue), the activation overlaps with the target region of the anterior STS (merge). Stimulus-locked response functions sampled from the anterior, mid/anterior, and posterior STS. In the anterior region, responses are clustered as a function of task whereas in the posterior region, familiarity also contributed to the response. The mid/anterior STS shows a task effect for non-familiar speakers’ voices but not for familiar.
Fig. 4. Summary of functional connectivity when sampling the three regions of the right STS. Functional connections were assessed during recognition of voices (direct contrast, P < 0.05, corrected) and during either non-familiar or familiar speaker conditions (interaction analysis, P < 0.001, uncorrected). A—anterior STS; M/A—middle/anterior STS; P—posterior STS; MTL—medial temporal lobe; IP—inferior parietal.
952 Table 2 Response maxima of regions (target regions) showing functional connectivity with the anterior and posterior STS (probe regions) during voice vs. verbal content recognition, during non-familiar and familiar speakers’ voice recognition, respectively Probe region Anterior STS
Posterior STS
Voice > verbal content P < 0.05, corrected
Temporal lateral Middle STS l. Middle STS r. Posterior STS r. Posterior STS l.
x
y
60
0
z
15
Non-familiar voice P < 0.001, corrected
Z
cl.
5.31
57
x
y
z
Z
Familiar voice P < 0.001, corrected cl.
x
y
z
Z
12 27
6 9
18 21
3.94 3.33
Parietal Inferior r.
For abbreviations see the legend for Table 1.
cl.
x
y 63 60
Temporal Para-/hippocampal/ Amygdala
Frontal Superior medial Dorsolateral prefrontal r. Ventral prefrontal r.
Voice > verbal content P < 0.05, corrected
9 42
54
15
3.34
4
18
27
z 15 3
6 15
Non-familiar voice P < 0.001, corrected
Z
cl.
4.56 5.35
3 17
x
y
z
Familiar voice P < 0.001, corrected Z
cl.
57 57
0 3
9 9
3.66 3.42
14 10
51
27
0
3.72
24
51
39
42
3.21
3
54 36
33 54
24 6
3.31 3.44
13 7
10 9
3.42
x
y
z
Z
cl.
K.V. Kriegstein, A.-L. Giraud / NeuroImage 22 (2004) 948–955
Target region
K.V. Kriegstein, A.-L. Giraud / NeuroImage 22 (2004) 948–955
task yielded 96% correct responses. There was no significant familiarity or task effect on accuracy. Differences in accuracy between recognition of non-familiar voices versus verbal content (vnp/cnp: T(18) = 3.5; P < 0.004) and non-familiar voices versus familiar speakers’ voices (vnp/vfp: T(18) = 3.825; P < 0.002) were significant. Reaction times were shorter for all conditions involving vocal stimuli than for noise recognition (sen/cfp: T(18) = 6.66; P < 0.000); (sen/cnp: T(18) = 4.84; P < 0.001); (sen/vfp: T(18) = 3.22; P < 0.008); (sen/vnp: T(18) = 2.53; P < 0.028). They were also shorter for verbal content recognition than for voice recognition independent of familiarity (vnp/cnp: T(18) = 3.27; P < 0.007); (vfp/cfp: T(18) = 3.39; P < 0.006). There was no significant difference in reaction times depending on the familiarity of voices neither in the verbal content nor in the voice recognition task (cnp/cfp: T(18) = 0.65; P < 0.52); (vnp/vfp: T(18) = 1.03; P < 0.32). We first contrasted voice recognition (irrespective of familiarity with the speaker) with verbal content recognition (irrespective of familiarity with the speaker) on the same acoustic material (same voices, same sentences). This comparison yielded activation along the superior temporal sulcus predominantly on the right side (Fig. 1, Table 1). Additional activations were observed in bilateral prefrontal and parietal cortices. As the verbal task yielded shorter reaction times, increased effort to recognize voices may have contributed to these latter effects at locations that are functionally compatible with an influence of attention and working memory load (Cohen et al., 1997; Zatorre et al., 1994, 1998). In the STS, activation peaked in an anterior (51, 18, 15, 6.37) and a posterior part (63, 42, 6, 7.47) (Fig. 1, Table 1). Probing the specific responses to voices of non-familiar speakers (Fig. 2, Table 1) by a second-level analysis ((vnp > cnp) > (vfp > cfp)) showed that the posterior STS (66, 27, 0; Z, 3.23) responded predominantly to voices of non-familiar speakers (Fig. 2). This analysis additionally indicated that recognizing non-familiar speakers’ voices (compared to recognizing both the verbal content and familiar speakers’ voices, second-level contrast, i.e., interaction) recruited bilateral frontal regions, the right angular cortex and the right amygdala (Table 1). Probing familiar speakers’ recognition by the reverse contrast ((vfp > cfp) > (vnp > cnp)) showed no activation in the right STS (Fig. 2) but pointed to a network outside the voice responsive STS-areas including bilateral frontal, parietal, and fusiform regions as well as a left middle occipital area (Table 1). As recognition rates were higher during familiar speakers voice recognition than non-familiar speakers voice recognition, performance could be a confound in this contrast. Such effects are however expected when hypothesizing relatively automatic mechanisms for recognition of familiar speakers’ voices and mechanisms relying more on analysis of spectro-temporal patterns for non-familiar speakers’ voices. Recognizing non-familiar voices requires the analysis of the temporal modulations of voice in time (prosodic information, intonation, etc.) more than recognizing familiar voices. Analysis of the time courses of the anterior and posterior STS regions showed that the posterior region, although predominantly activated by non-familiar voices, responded in all tasks including recognition of noises (see plot of the time courses in this region in Fig. 3). In contrast, the time courses of the anterior peak indicate selective responses during the voice recognition tasks and no activation above the baseline activation (recognizing noises) during recognition of verbal content (see plot in Fig. 3). We did not only expect to see different activity changes in regions along the STS but also differences in the functional
953
interactions between these potentially segregated areas across the various conditions. We assessed functional connectivity of voice responsive regions with other brain regions in a condition specific way (see Methods). We sampled the activity time courses from the response peaks in the two distinct voice responsive regions, namely the anterior and posterior STS (red blobs, Fig. 3) and calculated the correlation with activity time courses in all other voxels of the brain as a function of condition (see Methods). By doing so, we found regions that show variations in their functional coupling depending on the experimental conditions (Fig. 4, Table 2). During voice recognition (but not verbal recognition), both the anterior and posterior STS interacted predominantly with the right mid/anterior STS (60, 0, 15, P < 0.05 corrected). Other interacting regions were in the left STS (at 54, 18, 9, P < 0.054, corrected, for the anterior STS sample, and 63, 15, 6, P<0.05, corrected, for the posterior STS sample) (Figs. 3, 4, Table 2). The activation in the right mid/anterior STS overlapped with the region that was found in our previous study (see introduction, Giraud et al., in press, Fig. 1). This overlap suggests that the mid/ anterior STS region targeted by our connectivity analyses responds to specific acoustic properties of voices. Further evidence for this hypothesis comes from evaluating the time courses. They show that all conditions with voices (Fig. 3, plot at 60, 0, 15) elicited a response, but not the SEN that did not contain spectral information. We additionally assessed the weight of functional interactions depending on the familiarity of voices (Fig. 4, Table 2). During nonfamiliar speaker recognition (condition-specific interaction ((vnp cnp) (vfp cfp)), see Methods) the right posterior STS interacted with the right mid/anterior STS that responded to acoustic properties of voices in Giraud et al. (in press). It was also functionally connected with the left posterior and mid/anterior STS, the right inferior parietal and dorsolateral prefrontal cortex. In the same contrast, the right anterior STS was found to also interact with the right dorsolateral prefrontal cortex. During recognition of familiar speakers voices (condition specific interaction ((vfp cfp) (vnp cnp)), see Methods), the anterior STS functionally interacted with the right amygdala/para-/hippocampal region and the superior medial frontal cortex.
Discussion Preferential responses to voices have been observed in bottomup approaches in regions along both STS with a right hemispheric predominance (Belin and Zatorre, 2003; Belin et al., 2000, 2002; Giraud et al., in press). The present findings confirm the implication of the STS in voice processing and further characterize distinct functional territories and their connectivity along the right STS. Top-down approaches, manipulating tasks instead of acoustic material, permit to identify those regions that process voices as a category of auditory object providing non-linguistic information related to the speaker, as opposed to an acoustic family or a vehicle for language. In a previous study (Von Kriegstein et al., 2003), we dissociated voice recognition mechanisms from the processing of the acoustic properties of voices by presenting the same auditory sentence material in two tasks involving either speaker recognition or semantic content recognition. This study showed that recognizing voices activates the anterior STS. The present study further confirms our previous finding that the anterior STS responds during voice recognition but not when the task focussed on the verbal content of the stimuli. Our results are thus in agreement with the recent
954
K.V. Kriegstein, A.-L. Giraud / NeuroImage 22 (2004) 948–955
demonstration by Belin and Zatorre (2003) that this region is concerned with voice identity, that is, show adaptation to speaker identity. Voice recognition (familiar and non-familiar) also activated the posterior STS. This region responded to all sounds that had a complex temporal structure (including SEN that had no spectral shape and no linguistic content), but its activity was significantly enhanced when subjects had to recognize voices from unknown speakers that were presented to them for the first time during the experiment compared to when they had to recognize voices of familiar persons. This region did not appear in the study by Belin and Zatorre (2003) where short words were employed. As our stimuli were 3 s sentences and temporally matched speech envelope noises, we propose that the right posterior STS plays a role in the analysis of temporal modulation of voices and sounds. The right posterior STS region was recently implicated by Thierry et al. (2003) in the semantic analysis of sequences of environmental sounds compared to an equivalent analysis performed on semantically related sentences. The present findings thus confirm a role of the right posterior STS in the processing of non-verbal complex temporal acoustic forms. Unlike non-familiar speakers’ voices, the voices of familiar persons did not enhance responses in any region of the right STS. Instead, it recruited regions classically involved in person familiarity recognition (Gorno-Tempini and Price, 2001; Gorno-Tempini et al., 1998; Leveroni et al., 2000; Nakamura et al., 2000; Shah et al., 2001) and episodic memory retrieval (Cabeza and Nyberg, 2000). In summary, recognition of voices as a non-linguistic meaningful sound implicated functionally distinct regions of the right STS. While the response of the anterior STS was modulated by the task and hence mostly under top-down influence, the posterior STS response additionally reflected the spectro-temporal complexity of the acoustic pattern targeted by the task. None of these regions showed enhanced responses for familiar speakers’ voices, which concurs with neuropsychological data suggesting that the critical areas for familiar person recognition through voices are extratemporal (possibly in the right parietal cortex, Van Lancker et al., 1988, 1989). Distinct response patterns across right STS territories are further supported by functional connectivity findings. The anterior and posterior STS voice recognition regions both interacted with a region that appeared to be sensitive to the spectral properties of voices in a previous study. In that study, we compared natural speech with linguistically as meaningful (complex) speech envelope noises (Giraud et al., in press). The difference between these two conditions is narrowed down to the distribution of the frequency content over time that is typical of voices, for example, frequency power clustered around harmonics in vowels in natural speech but not in SEN. Thus, the region in the right mid/anterior STS responds to spectral features in voices independently from the linguistic content they carry. Functional connectivity between the right posterior and mid/anterior STS was mainly modulated by the recognition of non-familiar persons’ voices. During this condition, we also observed enhanced connectivity with left-sided posterior and mid/anterior STS. Bilateral activation during recognition of non-familiar persons’ voices is in agreement with lesion studies that show a loss of function in nonfamiliar voice discrimination with damage to either the left or the right temporal lobe (Van Lancker et al., 1988). The right dorsolateral prefrontal cortex functionally interacted with the posterior and the anterior STS during recognition of non-familiar persons voices. As this region is activated during tasks where memory judgments are
uncertain or involve strong monitoring demands (Henson et al., 1999), we attribute its activation to the difficulty of recognizing voices of non-familiar persons. During explicit recognition of familiar persons’ voices, the anterior STS showed enhanced functional connectivity with right medial temporal regions, which agrees with the role of medial temporal lobe regions in episodic memory retrieval (Eldridge et al., 2000; Rugg et al., 2002; Zeineh et al., 2003). The presence of three ‘voice areas’ with distinct properties and functional connectivity patterns along the rostro-caudal axis of the right temporal lobe fits well with the two stream hypothesis in the auditory system (Kaas and Hackett, 1999; Rauschecker and Tian, 2000; Romanski et al., 1999; Zatorre et al., 2002). In analogy to the visual system, an antero-ventral pathway would subserve object recognition (‘‘what’’) while a postero-dorsal pathway would integrate acoustic signals into complex auditory scenes, that is, sound spatial location (‘‘where’’, Rauschecker and Tian, 2000) and auditory spectral motion (‘‘how’’, Belin and Zatorre, 2000). The properties of the three right STS areas are compatible with such a scheme: recognition of voices as a feature belonging to a specific person involves more strongly an antero-ventral STS region (‘‘what’’), while recognition of voices as target acoustic stimuli, that have not been associated with other person-related features, involves more strongly a postero-dorsal STS region. Furthermore, the anterior STS region interacted mostly with right-sided frontotemporal structures but not with dorsal regions and responded strongly in favor of speaker identification over acoustic processing. Conversely, the posterior STS region interacted mostly with dorsal regions and was influenced by the long range temporal structure of the sounds, that is, the critical features of auditory motion (Belin and Zatorre, 2000). Response enhancement in the posterior STS to non-familiar voices was accompanied by similar effects in right parietal and prefrontal cortices that we attribute to an increased load on auditory working memory (Cohen et al., 1997; Henson et al., 1999; Zatorre et al., 1998, 1994). Our data do not support the proposal of Belin and Zatorre (2000) that the dorsal stream underlies verbal analysis while the ventral stream is involved in speaker identification. Rather, the speaker recognition effect in the anterior as well as the posterior region indicates that, although the posterior STS is less specifically activated by the task, both streams underlie different aspects of voice processing in the right hemisphere. In conclusion, we have delineated three distinct areas along the right STS involved in different aspects of voice processing. These areas respond and interact differentially depending on (1) acoustic information of the speech stimulus, (2) the specific task, and (3) the familiarity with the speaker. The mid/anterior STS carries out a spectral analysis of voices. More posterior and anterior areas emphasize voice processing over linguistic analysis of speech sounds and are both functionally connected to the mid/anterior area during voice recognition. They furthermore show different response properties: the anterior area responds specifically to voice recognition while the posterior area shows a less specific role in voice processing, that is, a sensitivity to the temporal complexity of sounds including non-vocal and non-linguistic sounds. While recognition of familiar voices predominantly modulates connectivity between anterior STS and the medial temporal lobe memory system, recognizing non-familiar voices predominantly involves functional interactions between bilateral mid/ anterior and posterior STS regions as well as with a fronto-parietal network.
K.V. Kriegstein, A.-L. Giraud / NeuroImage 22 (2004) 948–955
Acknowledgments We thank Christian Lorenzi for his help in stimulus preparation, Andreas Kleinschmidt for his helpful comments on the manuscript. ALG is funded by BMBF (Germany) and KvK by the Volkswagen Stiftung. The sound system was funded by a BMBF grant.
References Apoux, F., Crouzet, O., Lorenzi, C., 2001. Temporal envelope expansion of speech in noise for normal-hearing and hearing-impaired listeners: effects on identification performance and response times. Hear. Res. 153, 123 – 131. Belin, P., Zatorre, R.J., 2000. ‘What’, ‘where’ and ‘how’ in auditory cortex. Nat. Neurosci. 3, 965 – 966. Belin, P., Zatorre, R.J., 2003. Adaptation to speaker’s voice in right anterior temporal lobe. NeuroReport 14, 2105 – 2109. Belin, P., Zatorre, R.J., Lafaille, P., Ahad, P., Pike, B., 2000. Voice-selective areas in human auditory cortex. Nature 403, 309 – 312. Belin, P., Zatorre, R.J., Ahad, P., 2002. Human temporal-lobe response to vocal sounds. Brain Res. Cogn. Brain Res. 13, 17 – 26. Cabeza, R., Nyberg, L., 2000. Imaging cognition II: an empirical review of 275 PET and fMRI studies. J. Cogn. Neurosci. 12, 1 – 47. Cohen, J.D., Perlstein, W.M., Braver, T.S., Nystrom, L.E., Noll, D.C., Jonides, J., Smith, E.E., 1997. Temporal dynamics of brain activation during a working memory task. Nature 386, 604 – 608. Eldridge, L.L., Knowlton, B.J., Furmanski, C.S., Bookheimer, S.Y., Engel, S.A., 2000. Remembering episodes: a selective role for the hippocampus during retrieval. Nat. Neurosci. 3, 1149 – 1152. Friston, K.J., Buechel, C., Fink, G.R., Morris, J., Rolls, E., Dolan, R.J., 1997. Psychophysiological and modulatory interactions in neuroimaging. NeuroImage 6, 218 – 229. Giraud, A., Kell, C., Thierfelder, C., Sterzer, P., Russ, M.O., Preibisch, C., Kleinschmidt, A., 2003. Contributions of sensory input, auditory search and verbal comprehension to cortical activity during speech processing. Cereb. Cortex 14, 247 – 255. Gitelman, D.R., Penny, W.D., Ashburner, J., Friston, K.J., 2003. Modeling regional and psychophysiologic interactions in fMRI: the importance of hemodynamic deconvolution. NeuroImage 19, 200 – 207. Gorno-Tempini, M.L., Price, C.J., Josephs, O., Vandenberghe, R., Cappa, S.F., Kapur, N., Frackowiak, R.S., Tempini, M.L., 1998. The neural systems sustaining face and proper-name processing. Brain 121, 2103 – 2118. Gorno-Tempini, M.L., Price, C.J., 2001. Identification of famous faces and buildings: a functional neuroimaging study of semantically unique items. Brain 124, 2087 – 2097. Henson, R.N., Rugg, M.D., Shallice, T., Josephs, O., Dolan, R.J., 1999. Recollection and familiarity in recognition memory: an event-related
955
functional magnetic resonance imaging study. J. Neurosci. 19, 3962 – 3972. Kaas, J.H., Hackett, T.A., 1999. ‘What’ and ‘where’ processing in auditory cortex. Nat. Neurosci. 2, 1045 – 1047. Leveroni, C.L., Seidenberg, M., Mayer, A.R., Mead, L.A., Binder, J.R., Rao, S.M., 2000. Neural systems underlying the recognition of familiar and newly learned faces. J. Neurosci. 20, 878 – 886. Lorenzi, C., Berthommier, F., Apoux, F., Bacri, N., 1999. Effects of envelope expansion on speech recognition. Hear. Res. 136, 131 – 138. Nakamura, K., Kawashima, R., Sato, N., Nakamura, A., Sugiura, M., Kato, T., Hatano, K., Ito, K., Fukuda, H., Schormann, T., et al., 2000. Functional delineation of the human occipito-temporal areas related to face and scene processing. A PET study. Brain 123, 1903 – 1912. Rauschecker, J.P., Tian, B., 2000. Mechanisms and streams for processing of ‘‘what’’ and ‘‘where’’ in auditory cortex. Proc. Natl. Acad. Sci. U. S. A. 97, 11800 – 11806. Romanski, L.M., Tian, B., Fritz, J., Mishkin, M., Goldman-Rakic, P.S., Rauschecker, J.P., 1999. Dual streams of auditory afferents target multiple domains in the primate prefrontal cortex. Nat. Neurosci. 2, 1131 – 1136. Rugg, M.D., Otten, L.J., Henson, R.N., 2002. The neural basis of episodic memory: evidence from functional neuroimaging. Philos. Trans. R. Soc. Lond., B Biol. Sci. 357, 1097 – 1110. Shah, N.J., Marshall, J.C., Zafiris, O., Schwab, A., Zilles, K., Markowitsch, H.J., Fink, G.R., 2001. The neural correlates of person familiarity. A functional magnetic resonance imaging study with clinical implications. Brain 124, 804 – 815. Shannon, R.V., Zeng, F.G., Kamath, V., Wygonski, J., Ekelid, M., 1995. Speech recognition with primarily temporal cues. Science 270, 303 – 304. Thierry, G., Giraud, A.L., Price, C., 2003. Hemispheric dissociation in access to the human semantic system. Neuron 38, 499 – 506. Van Lancker, D.R., Cummings, J.L., Kreiman, J., Dobkin, B.H., 1988. Phonagnosia: a dissociation between familiar and unfamiliar voices. Cortex 24, 195 – 209. Van Lancker, D.R., Kreiman, J., Cummings, J., 1989. Voice perception deficits: neuroanatomical correlates of phonagnosia. J. Clin. Exp. Neuropsychol. 11, 665 – 674. von Kriegstein, K., Eger, E., Kleinschmidt, A., Giraud, A.L., 2003. Modulation of neural responses to speech by directing attention to voices or verbal content. Brain Res. Cogn. Brain Res. 17, 48 – 55. Zatorre, R.J., Evans, A.C., Meyer, E., 1994. Neural mechanisms underlying melodic perception and memory for pitch. J. Neurosci. 14, 1908 – 1919. Zatorre, R.J., Perry, D.W., Beckett, C.A., Westbury, C.F., Evans, A.C., 1998. Functional anatomy of musical processing in listeners with absolute pitch and relative pitch. Proc. Natl. Acad. Sci. U. S. A. 95, 3172 – 3177. Zatorre, R.J., Bouffard, M., Ahad, P., Belin, P., 2002. Where is ‘where’ in the human auditory cortex? Nat. Neurosci. 5, 905 – 909. Zeineh, M.M., Engel, S.A., Thompson, P.M., Bookheimer, S.Y., 2003. Dynamics of the hippocampus during encoding and retrieval of face – name pairs. Science 299, 577 – 580.