Mechanisms of enhancing visual–speech recognition by prior auditory information

NeuroImage 65 (2013) 109–118 Contents lists available at SciVerse ScienceDirect NeuroImage journal homepage: www.elsevier.com/locate/ynimg Mechanis...

Download PDF

1MB Sizes 0 Downloads 40 Views

Report

PDF Reader
Full Text

NeuroImage 65 (2013) 109–118

Contents lists available at SciVerse ScienceDirect

NeuroImage journal homepage: www.elsevier.com/locate/ynimg

Mechanisms of enhancing visual–speech recognition by prior auditory information Helen Blank ⁎, Katharina von Kriegstein Max Planck Institute for Human Cognitive and Brain Sciences, Stephanstr. 1A, 04103 Leipzig, Germany

a r t i c l e

i n f o

Article history: Accepted 20 September 2012 Available online 27 September 2012 Keywords: fMRI Lip-reading Multisensory Predictive coding Speech reading

a b s t r a c t Speech recognition from visual-only faces is difﬁcult, but can be improved by prior information about what is said. Here, we investigated how the human brain uses prior information from auditory speech to improve visual–speech recognition. In a functional magnetic resonance imaging study, participants performed a visual–speech recognition task, indicating whether the word spoken in visual-only videos matched the preceding auditory-only speech, and a control task (face-identity recognition) containing exactly the same stimuli. We localized a visual–speech processing network by contrasting activity during visual–speech recognition with the control task. Within this network, the left posterior superior temporal sulcus (STS) showed increased activity and interacted with auditory–speech areas if prior information from auditory speech did not match the visual speech. This mismatch-related activity and the functional connectivity to auditory–speech areas were speciﬁc for speech, i.e., they were not present in the control task. The mismatch-related activity correlated positively with performance, indicating that posterior STS was behaviorally relevant for visual–speech recognition. In line with predictive coding frameworks, these ﬁndings suggest that prediction error signals are produced if visually presented speech does not match the prediction from preceding auditory speech, and that this mechanism plays a role in optimizing visual–speech recognition by prior information. © 2012 Elsevier Inc. All rights reserved.

Introduction Visual–speech recognition is an essential skill in human communication that helps us to better understand what a speaker is saying. For example, if we are in a noisy environment, such as a crowded pub, the information from the moving face helps us “hear” what is said (Ross et al., 2007; Sumby and Pollack, 1954). In addition, for hearing-impaired or deaf people, visual–speech recognition is a central means of communication (Giraud et al., 2001; Rouger et al., 2007; Strelnikov et al., 2009). Previous functional magnetic resonance imaging (fMRI) studies have shown that frontal and temporal lobe areas are involved in visual–speech recognition based on contrasts containing different stimulus material, such as “viewing visual speech vs. viewing facial gestures” (Besle et al., 2008; Hall et al., 2005; Nishitani and Hari, 2002; Okada and Hickok, 2009). Our ﬁrst aim was to reﬁne this network and investigate whether a subset of areas is activated, when contrasting visual–speech recognition with a control task (i.e. face-identity recognition) containing exactly the same stimulus material, thereby avoiding any confound due to difference in stimulus material. Visual–speech recognition from visual–only faces is very difﬁcult (Altieri et al., 2011; Summerﬁeld, 1992). However, it can be improved if one has prior information about what is said (Alegria et al., 1999; Abbreviations: STS, superior temporal sulcus; fMRI, functional magnetic resonance imaging. ⁎ Corresponding author. Fax: +49 341 9940 2448. E-mail address: [email protected] (H. Blank). 1053-8119/$ – see front matter © 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.neuroimage.2012.09.047

Gregory, 1987), and it can also be inﬂuenced by auditory speech (Baart and Vroomen, 2010; Nath and Beauchamp, 2011). Our second aim in this study was to investigate how the human brain uses prior information from auditory speech, and combines it with the visual information, in order to improve visual–speech recognition. How auditory and visual information are combined in the human brain is currently a matter of debate (Beauchamp et al., 2004b; Driver and Noesselt, 2008; Ghazanfar et al., 2005; Noesselt et al., 2010). There is evidence that interactions between simultaneously presented auditory and visual stimuli occur in the multisensory posterior STS (Amedi et al., 2005; Beauchamp et al., 2004b; Stevenson and James, 2009). When only auditory–speech or only visual–speech stimuli are presented, it has been shown that input in one modality also leads to activation of cortices related to another modality [e.g., auditory speech activates visual–speech processing areas in the visual left posterior STS (von Kriegstein et al., 2008) and vice versa, visual speech activates auditory regions (Calvert et al., 1997; van Wassenhove et al., 2005)]. Here, we investigated whether predictions from auditory-only speech inﬂuence processing of subsequently presented visual speech in the left posterior STS during visual–speech recognition (i.e., lipreading) of visual-only videos, using activity and connectivity analyses. To do this, we used an event-related fMRI design in which auditoryonly speech and visual-only speech stimuli were presented in a match-mismatch paradigm (Fig. 1). Participants performed a visual– speech recognition task and indicated whether the word spoken in visual-only videos matched preceding auditory-only speech. The design additionally included task-based and stimulus-based control conditions

110

H. Blank, K. von Kriegstein / NeuroImage 65 (2013) 109–118

(see Fig. 1 and Materials and methods). We expected three ﬁndings. Firstly, a mismatch of auditory information and facial movement should cause high prediction error signals in left face-movement sensitive and/ or multisensory STS. This prediction error signal can be measured as increased activity during visual–speech recognition if there is a mismatch of auditory and visual speech. Secondly, the strength of the prediction error signal in STS should correlate with the ability to recognize visual speech (visual–speech recognition performance). Thirdly, we expected increased functional connectivity between those areas that process auditory–speech and those that generate the prediction error signal if the auditory and visual speech does not match. We interpret our ﬁndings in line with predictive coding frameworks where predictions are the cause for a higher prediction error signal in response to a non-matching stimulus (Friston and Kiebel, 2009; Rao and Ballard, 1999). Materials and methods Subjects Twenty-one healthy volunteers (10 female; mean age 26.9 years, age range 23–34 years; all right handers [Edinburgh questionnaire; (Oldﬁeld, 1971)]) participated in the study. Written informed consent was collected from all participants according to procedures approved by the Research Ethics Committee of the University of Leipzig. Two subjects were excluded from the analysis: one because of difﬁculties with acquiring the ﬁeld-map during fMRI and the other because he did not follow the task instructions. Furthermore, one subject's behavioral results had to be excluded due to intermittent technical problems with the response box. Therefore, the analysis of the fMRI data was based on 19 subjects and the analysis of the behavioral data was based on 18 subjects. Stimuli Stimuli consisted of videos, with and without audio-stream, and of auditory-only ﬁles. Stimuli were created by recording three male speakers

(22, 23, and 25 years old). For an additional stimulus-based control condition (see below) we recorded three mobile phones. All recordings were done in a sound proof room under constant luminance conditions. Videos were taken of the speakers' faces and of a hand operating the mobile phones. All the speaker videos started and ended with the closed mouth of the speaker. Thereby the videos provide all movements made during word production. Speech samples of each speaker included 20 single words (Example: “Dichter”, English: “poet”) and 11 semantically neutral and syntactically homogeneous ﬁve-word sentences (Example: “Der Junge trägt einen Koffer.” English: “The boy carries a suitcase.”). 12 words were used in the experiment and 8 different words and the ﬁve-word sentences were used for the training. Key tone samples of each mobile phone included 20 different sequences of two to nine key presses per sequence. 12 sequences were used in the main experiment and 8 sequences were used for the training. Videos were recorded with a digital video camera (Canon, Legria HF S10 HD-Camcorder). High quality auditory stimuli were simultaneously recorded with a condenser microphone (Neumann TLM 50, pre-ampliﬁer LAKE PEOPLE Mic-AmpF-35, soundcard PowerMac G5, 44.1 kHz sampling rate, and 16 bit resolution) and the software Sound Studio 3 (felt tip inc, USA). The auditory stimuli were post-processed using Matlab (version 7.7, The MathWorks, Inc., MA, USA) to adjust overall sound level. The audio ﬁles of all speakers and mobile phones were adjusted to a root mean square (RMS) of 0.083. All videos were processed and cut in Final Cut Pro (version 6, HD, Apple Inc., USA). They were converted to mpeg format and presented at a size of 727 × 545 pixels. Experimental procedure fMRI experiment The experiment contained four types of blocks including two stimulus types (person stimuli/mobile stimuli) and two tasks (content task/ identity task, Figs. 1A/B). In the person-stimulus condition (Fig. 1C), each trial consisted of an auditory-only word followed by a visual-only video of a speaker saying a word. The stimuli were taken from the audio-visually recorded single

Fig. 1. Experimental Design. A/B. Overview of the experimental conditions. A. The factors, stimulus type and task type, were organized in four types of blocks (i.e. two stimulus types: person and mobile; and two tasks: content and identity). B. The panel displays two factors of the experiment, i.e. stimulus type and task type. C/D. Blocks started with an instruction screen and included 12 trials. Each trial consisted of an auditory-only stimulus followed by a silent video. Because the auditory and visual stimulus could either match or mismatch within each trial, the experiment included a third factor, i.e. congruency (see panels C and D). C. Person stimulus block: In the visual–speech recognition task, participants indicated whether or not the auditory-only presented word matched the word spoken by the person in the visual-only (muted) video. In the face-identity recognition task, participants indicated whether or not the visual-only face belonged to the same person that spoke the preceding auditory-only word. D. Mobile stimulus block: In the mobile-content task, participants indicated whether or not the auditory-only presented number of key tones matched the visual-only presented number of key presses. In the mobile-identity recognition task, participants indicated whether or not the visual-only mobile phone corresponded to the preceding auditory-only key tone.

H. Blank, K. von Kriegstein / NeuroImage 65 (2013) 109–118

words. In the mobile-stimulus condition (Fig. 1D), each trial consisted of an auditory-only presentation of key tones followed by a visualonly video of a hand pressing the keys of a mobile phone. In addition to the factors, stimulus type and task, there was a third factor, congruency, because auditory and visual stimulus could either match or mismatch within each subsequent auditory and visual stimulus pair. Subjects were asked to perform two types of tasks on the person and mobile stimuli, i.e. content and an identity task. In the content task during person stimulus conditions (Fig. 1C), participants were requested to indicate, via button press, whether the visual-only silently spoken word in the video matched the preceding auditory-only presented word or not. This task required visual–speech recognition during the visualstimulus presentation; therefore this task is henceforth referred to as the visual–speech recognition task. Similarly, in the content task during mobile-stimulus conditions (Fig. 1D), participants indicated, via button press, whether or not the number of pressed mobile keys matched the number of the previously heard auditory-only key tones. In the identity task during person stimulus conditions (Fig. 1C), participants indicated, via button press, whether or not the visual-only face and the preceding auditory-only voice belonged to the same person. This task required recognizing the identity of the face in the video; therefore this task is henceforth referred to as the “face-identity recognition” task. Similarly, for the identity task during mobile stimulus conditions (Fig. 1D), participants indicated whether the visual-only mobile phone and the auditory-only key-tones belonged to the same mobile or not. The identity associations were learned prior to the fMRI experiment (for a detailed description of this training see below). The experiment contained 12 different speech stimuli which were each randomly repeated three times per condition. While these stimuli were presented repeatedly across two tasks, none were systematically presented as the ﬁrst item in either task. Because of this randomized presentation, contrasting brain activation between tasks controlled for potential stimulus repetition effects. Within each trial, a ﬁxation cross occurred between the auditory-only stimulus and the visual-only stimulus. It was presented for a jittered duration lasting an average of 1.6 s and a range of 1.2 to 2.2 s. The auditory and visual stimuli were presented for 1.6 s on average. A blue frame appeared around the visual stimulus to indicate the response phase (see Figs. 1 C and D). The response phase lasted 2.3 s and started with beginning of the visual stimulus within each trial. Also, a ﬁxation cross was presented between trials for a jittered duration (average 1.6 s; range 1.2 to 2.2 s). One third of the stimuli were null events of 1.6 s duration, randomly presented within the experiment (Friston et al., 1999). The trials for each of the four types of blocks were grouped into 24 blocks of 12 trials (84 s) of the same task (i.e., content or identity recognition task) to minimize the time spent instructing the subjects on which task to perform. This also had the advantage that subjects did not frequently switch between different task-related mechanisms. Blocks were presented in semi-random order, not allowing neighboring blocks of the same type of block. All blocks were preceded by a short task instruction. The written words “Wort” (Eng. “word”) and “Anzahl der Tastentöne” (Eng. “key press”) indicated the content task, whereas, “Person” (Eng. “person”) and “Handy” (Eng. “mobile phone”) indicated the identity task. Each instruction was presented for 2.7 s. The whole experiment consisted of two 16.8 min scanning sessions. Between the sessions, subjects were allowed to rest for approximately two minutes. Before the experiment, subjects received a short familiarization with all tasks. Aside from the factors task (content/identity), stimulus (person/ mobile) and congruency (match and mismatch), the whole experiment included a fourth factor (auditory-only ﬁrst/visual-only ﬁrst). The “auditory-only ﬁrst” condition has been described above. The setup for the “visual-only ﬁrst” condition was exactly like that of the “auditory-only ﬁrst” condition, with the difference that the ﬁrst stimulus was the visual-only video and the second stimulus was auditory-only. This condition was part of a different research question

111

and the results will be described in detail elsewhere. However, the condition contributed to the deﬁning of the functional localizers (see below) and was used to show that visual–speech recognition was signiﬁcantly better when the auditory-only speech preceded the visual-only speech than when the visual-only speech preceded the auditory-only speech (see Results). Training To enable the participants to do the identity tasks during fMRI scanning, they were trained on the audio-visual identity of three speakers and three mobile phones, directly before they went into the MRI-scanner. The stimulus material used in training was different from the material used in the fMRI experiment. Participants learned the speakers by watching audio-visual videos taken of the speakers' faces while they recited 36 ﬁve word sentences. Participants learned the mobile phones by watching audio-visual videos showing a hand pressing keys on a mobile phone. 36 sequences with different numbers of key presses were used. After the training, recognition performance was tested. In the test, participants ﬁrst saw silent videos of a person (or mobile phone) and subsequently listened to a voice (or a key tone). They were asked to indicate whether the auditory voice (or key tone) belonged to the face (or mobile phone) in the video. Subjects received feedback about correct, incorrect, and “too slow” responses. The training, including the learning and the test, took 25 minutes. Training was repeated twice for all participants (50 minutes in total). If a participant performed less than 80% correct after the second training session, the training was repeated a third time. Image acquisition Functional images and structural T1-weighted images were acquired on a 3 T Siemens Tim Trio MR scanner (Siemens Healthcare, Erlangen, Germany). For the functional MRI, a gradient-echo EPI (echo planar imaging) sequence was used (TE 30 ms, ﬂip angle 90 degrees, TR 2.79 s, 42 slices, whole brain coverage, acquisition bandwidth 116 kHz, 2 mm slice thickness, 1 mm interslice gap, in-plane resolution 3 mm ×3 mm). Geometric distortions were characterized by a B0 ﬁeld-map scan. The ﬁeld-map scan consisted of gradient-echo readout (24 echoes, inter-echo time 0.95 ms) with standard 2D phase encoding. The B0 ﬁeld was obtained by a linear ﬁt to the unwrapped phases of all odd echoes. The structural images were acquired with a T1-weighted magnetization-prepared rapid gradient echo sequence (3D MP-RAGE) with selective water excitation and linear phase encoding. Magnetization preparation consisted of a non-selective inversion pulse. The imaging parameters were TI=650 ms, TR=1300 ms, TE=3.93 ms, alpha=10°, spatial resolution of 1 mm3, two averages. To avoid aliasing, oversampling was performed in the read direction (head–foot). Data analysis Behavioral Behavioral data (visual–speech recognition performance) were analyzed with Matlab (version 7.7, The MathWorks, Inc., MA, USA). To test whether our experimental contrasts were inﬂuenced by an effect of difﬁculty, we computed a repeated measures two-way analysis of variance (ANOVA) with the factors task and congruency. We used a paired t-test to assess whether visual–speech recognition was improved by prior auditory information, i.e., whether visual– speech recognition performance was signiﬁcantly better when the auditory-only speech preceded the visual-only speech, as opposed to if the visual-only speech preceded the auditory-only speech. Functional MRI Functional MRI data were analyzed with statistical parametric mapping (SPM8, Wellcome Trust Centre for Neuroimaging, UCL, UK, http://

112

H. Blank, K. von Kriegstein / NeuroImage 65 (2013) 109–118

www.ﬁl.ion.ucl.ac.uk/spm), using standard spatial pre-processing procedures (realignment and unwarp, normalization to Montreal Neurological Institute (MNI) standard stereotactic space, and smoothing with an isotropic Gaussian ﬁlter, 8 mm at FWHM). Geometric distortions due to susceptibility gradients were corrected by an interpolation procedure based on the B0 map (the ﬁeld-map). Statistical parametric maps were generated by modeling the evoked hemodynamic response for the different conditions as boxcars convolved with a synthetic hemodynamic response function in the context of the general linear model (Friston et al., 2007). The population-level inferences using BOLD signal changes between conditions of interest were based on a random-effects model that estimated the second-level t statistic at each voxel (Friston et al., 2007). The Anatomy Toolbox in SPM was used for assigning labels to the regions in Tables 1, 3, and 4 (Eickhoff et al., 2005). Deﬁnition of experimental contrasts of interest To assess regions that were involved in the task of visual–speech recognition, we used the contrast “visual–speech recognition>face-identity recognition”. We expected that a mismatch of auditory information and facial movement would cause high prediction error signals in facemovement sensitive and/or multisensory STS (Arnal et al., 2009; Nath and Beauchamp, 2011). We used two functional contrasts to investigate prediction error signals during presentation of the visual stimulus, when auditory and visual speech did not match: ﬁrst, a mismatch contrast (“mismatch>match, during visual–speech recognition”), and second a more controlled mismatch-interaction contrast (“[mismatch>match] during visual–speech recognition>[mismatch>match] during faceidentity recognition”). We will refer to these contrasts as mismatch contrast and mismatch-interaction contrast, respectively. For the experimental contrasts, we modeled the second event in each trial. Only events, when the auditory stimulus was presented ﬁrst were included. This was done to equate the contrasted conditions with regard to behavioral difﬁculty at the group level (see Table 2b). When the visual stimulus was presented before the auditory stimulus visual–speech recognition was more difﬁcult than any of the other tasks. This condition was therefore not included in any experimental contrast. Deﬁnition of functional localizers for activity and connectivity analyses We deﬁned functional localizer contrasts to test whether activity and connectivity results were located in speciﬁc regions (visual–face and/or multisensory STS, auditory–speech areas). Localizers were based on contrasts within the main experiment (Friston et al., 2006). For localizer contrasts we used only the ﬁrst event in each trial (visual or auditory) to avoid modeling exactly the same events for the localizer and the experimental contrast (which was computed on the second event in each trial). This means that the visual events used in the localizer contrasts (for visual–face and multisensory areas) were from trials that were separate and independent of the trials used for the experimental contrasts. Although the auditory events used in the localizer contrasts (for the multisensory and auditory–speech areas) and the visual events for the experimental contrasts were two successive events Table 1 Coordinates for the contrast “visual–speech recognition > face-identity recognition”a. Region

Peak location x, y, z (mm)

Peak Z

Approximate Brodmann's area

Cluster size number of voxels

Right SMA Left SMA Right SMA Right PCG Left PCG Left IFG Left pSTG Left pSTG

12 11 67 −9 14 61 6 5 67 57 −1 43 −54 −4 46 −60 14 13 −54 −43 10 −48 −40 19

4.05 3.99 3.54 4.00 3.92 3.82 3.50 3.42

area area area area area area area area

146

6 6 6 6 6 44 42 42

20 15 8 33

a Indented rows indicate maxima within same cluster. SMA = supplementary motor area, PCG = precentral gyrus, IFG = inferior frontal gyrus, pSTG = posterior superior temporal gyrus.

from the same trial, they were designed to be independent: the experimental contrasts involved taking the difference between conditions which were both preceded by the auditory stimuli used in the localizer contrasts. Therefore any potential effect of a correlated mean is accounted for. To localize visual–face areas in left posterior STS, we used the contrast “faces > mobiles” (including all tasks). For localizing multisensory regions, we combined activation for “faces and voices” (including all tasks) in a conjunction analyses. This corresponds to the widely used approach to localize multisensory areas by combining activation for visual and auditory stimuli (Beauchamp et al., 2010; Noesselt et al., 2010; Stevenson and James, 2009). As a localizer for auditory– speech areas we used the contrast “voices > tones, both during the content task”. The auditory–speech localizer was consistent with the location of auditory–speech areas in the left anterior temporal lobe reported in previous literature [p b 0.001, FWE-corrected for regions of interest taken from (Obleser et al., 2007) and from a meta-analysis on word length auditory–speech processing (DeWitt and Rauschecker, 2012)]. Regions of interest for the small volume correction were deﬁned by the visual, auditory–speech, and multisensory localizers (visual–face cluster in STS: x = −54, y = −34, z = 10, k = 30; multisensory cluster in STS: x = −60, y = −37, z = 7, k =57; auditory–speech cluster in STS: −x = 63, y = −31, z = −11, k = 12). The visual–face area localizer was relatively unspeciﬁc (i.e. contrasting faces against objects) while our hypothesis was speciﬁc for the left STS visual face-movement area; we therefore complemented the visual–face area localizer by a coordinate that has been shown to be speciﬁcally activated by face-movement (“moving faces vs. static faces”) (von Kriegstein et al., 2008). Signiﬁcance thresholds for fMRI data Activity in regions, for which we had an a-priori anatomical hypothesis, [i.e. responses of areas usually reported for lip-reading (Calvert et al., 1997; Calvert and Campbell, 2003; Campbell et al., 2001; Hall et al., 2005; Okada and Hickok, 2009)] was considered signiﬁcant at p b 0.001, uncorrected. To test whether mismatch-related activity was located within speciﬁc regions of interest in the posterior STS (e.g. multisensory, visual–face) we used the localizers described above. Activity was considered to be within a region of interest if it was present at p b 0.05 FWE-corrected for the region of interest. All other effects were considered signiﬁcant at p b 0.05 FWE-corrected for the whole brain. In the Results section, we only refer to the regions that adhere to these signiﬁcance criteria. For completeness and as an overview for interested readers, activations in regions for which we did not have an a-priori anatomical hypothesis are listed in Tables 1, 3, and 4 at p b 0.001, uncorrected; with a cluster extend of 5 voxels. None of them reached the signiﬁcance threshold of p b 0.05 FWE-corrected for the whole brain. For display purposes only, in Figs. 2 A and B, BOLD response from the contrast “visual–speech recognition > face-identity recognition” was overlaid on the group mean structural image, at p b 0.005, uncorrected with a voxel threshold of 20 voxels. Correlation analysis To investigate whether the amount of mismatch-related activity correlated with behavioral performance, we correlated activity during the mismatch contrast (“mismatch > match, during visual–speech recognition”) with behavioral performance during visual–speech recognition over subjects. Behavioral performance was included in the SPM analysis as a covariate at the second level. Signiﬁcance was assessed with SPM. Only to obtain a Pearson's r we subsequently analyzed the data in Matlab. Psychophysiological interactions analysis (PPI) To investigate the functional connectivity between auditory–speech regions and posterior STS regions during visual–speech recognition, we

H. Blank, K. von Kriegstein / NeuroImage 65 (2013) 109–118

113

Fig. 2. A and B BOLD responses were overlaid on a mean structural image of all participants. (A. sagittal and B. coronal view). The contrast “visual–speech recognition>face-identity recognition” is displayed in orange. The contrasts for investigating representations of prediction errors for visual–speech are displayed in dark blue for the mismatch contrast (“mismatch>match during the visual–speech recognition task”), and in green for the mismatch-interaction contrast (“[mismatch>match] during visual–speech recognition>[mismatch>match] during face-identity recognition”). C and D. Overlay of the mismatch-interaction contrast (green) and two functional localizer contrasts (C. sagittal and D. axial view), i.e., the visual-face area localizer (“faces>mobiles”, light blue) and the multisensory localizer (“faces and voices”, magenta). E. Percent signal change in the left posterior STS for conditions that were included in the mismatch-interaction contrast (“[mismatch>match] during visual–speech recognition>[mismatch>match] during face-identity recognition”). Match conditions are shown in light green and mismatch conditions in dark green. F. Correlation between correct performance during visual–speech recognition and the eigenvariate extracted from the mismatch contrast “mismatch>match during visual–speech recognition” revealed that activity from the left posterior STS correlated signiﬁcantly with behavioral visual–speech recognition performance (eigenvariate from the left posterior STS during the mismatch contrast with peak coordinate x=−54, y=−40, and z=1, p=0.006 FWE-corrected, r=0.6493).

114

H. Blank, K. von Kriegstein / NeuroImage 65 (2013) 109–118

conducted PPI analyses (Friston et al., 1997). We performed the PPI following standard procedures (Friston et al., 1997; O'Reilly et al., 2012). These included modeling the psychological variable and the ﬁrst eigenvariate as regressors in addition to the psychophysiological interaction term at the single subject level. Including the psychological variable and the ﬁrst eigenvariate in the PPI analysis assured that only connectivity effects that are neither explained by task nor eigenvariate are present in the results. We extracted time courses from the regions of interest (i.e., posterior STS, as deﬁned by the mismatch-interaction contrast, and auditory–speech regions deﬁned by the contrast “voices> tones, both during the content task”). The psychological variable was the mismatch-interaction contrast. Population-level inferences about BOLD signal changes were based on a random-effects model that estimated the second-level statistic at each voxel using a one-sample t-test. Results were considered signiﬁcant if they were present at pb 0.05, FWE-corrected for the region of interest. In Fig. 3, for display purposes only, the PPI results were presented at p b 0.01, uncorrected with a voxel threshold of 20 voxels. Results Identifying areas involved in visual–speech recognition: “visual–speech recognition > face-identity recognition” (Figs. 2A and B, orange) The contrast “visual–speech recognition> face-identity recognition” revealed activation of left superior temporal gyrus (STG, x = −54, y = −43, z = 10), left supplementary motor area (SMA, x = −9, y= 14, z = 61), left precentral gyrus (PCG, x =−54, y = −4, z = 46), and the inferior frontal gyrus (IFG, x =−60, y = 14, z = 13; Figs. 2A and B, orange, and Table 1). The location of these areas is in line with brain regions involved in visual–speech recognition, as reported in previous studies (Brown et al., 2008; Chainay et al., 2004; Hall et al., 2005; Nishitani and Hari, 2002; Okada and Hickok, 2009). The activity did not seem to be primarily caused by difﬁculty level differences between the conditions, since the visual–speech recognition and face-identity recognition task did not differ in difﬁculty on the group level (paired t-test: 93.09 vs. 92.57, t = 0.22, df= 17, p = 0.8285; Tables 2a, 2b). There was also no higher variability in one of the task conditions (Levene's Test for Equality of Variances: F(1, 34) = 0.0760, p = 0.7845). Since both tasks were performed at ceiling we applied the rationalized arcsine transformation (Sherbecoe and Studebaker, 2004; Studebaker, 1985) and repeated the analysis with these rau-transformed performance scores. The results of this analysis were qualitatively similar to the one with raw performance scores (t = 0.3464, df = 17, p = 0.7333). Mismatch of auditory speech and visual speech increases activity in visual–speech recognition area (Figs. 2A–E, green and dark blue) The mismatch contrast, and the more controlled mismatchinteraction contrast revealed increased activity in left posterior STS (mismatch contrast: “mismatch>match, during visual–speech recognition”, Figs. 2A and B, dark blue, and Table 3; mismatch-interaction contrast “[mismatch > match] during visual–speech recognition > [mismatch > match] during face-identity recognition”, Figs. 2A–E, green, and Table 4, p b 0.001 uncorrected). The mismatch-interaction contrast resulted in a more extended cluster than the mismatch contrast, but they both had the same peak coordinate (x = −57, y = −40, z =−2). The plot (Fig. 2E) shows that at this coordinate there was not only higher activation for “speech mismatch vs. speech match,” but also for “face-identity match vs. face-identity mismatch”. The difference between “face-identity match vs. face-identity mismatch” was however not signiﬁcant at a corrected signiﬁcance threshold that would be necessary for such an unexpected result. It also did not reach the more lenient threshold of 0.001 uncorrected (x = −57, y = −40, z = −2, p = 0.002, uncorrected). The results indicate that

activation in the left posterior STS was predominantly increased when the visual–speech information did not match the auditory–speech input when the task emphasizes speech processing. Furthermore, the mismatch-interaction contrast controlled for regions being activated by any general mismatch between vocal and facial information. Activity elicited by both mismatch contrasts did not seem to be primarily caused by an effect of difﬁculty, since there was neither a main effect of congruency nor of task in the behavioral data (repeated measures two-way ANOVA, for the factor congruency: F(1,17) = 2.425, p = 0.1378, and for the factor task: F(1,17) = 0.044, p = 0.8365, Tables 2a, 2b). There was also no interaction between mismatch and task (repeated measures two-way analysis of variance test, F(1,17)= 0.022, p = 0.8843, Tables 2a, 2b). As hypothesized the activity in left posterior STS for the mismatchinteraction contrast was located in a visual–face area (x = − 60, y = − 34, z = − 4, p = 0.009, FWE-corrected for visual–face area localizer) and also in a more speciﬁc face-movement area (x = −54, y = −55, z = 7, p = 0.009, FWE-corrected with localizer from von Kriegstein et al., 2008). Furthermore the activity in left posterior STS for the mismatch-interaction contrast overlapped with the activity of the contrast “visual–speech recognition >face-identity recognition” (at x = −51, y = −43, z = 13, p = 0.014, FWE-corrected). It is however not possible to conclude that the left posterior STS area that reacted to mismatch conditions corresponds more to a visual face-movement area than a multisensory area, because there was considerable overlap also with the multisensory localizer (p = 0.004, FWE-corrected for multisensory localizer, Figs. 2C and D). The mismatch-related activity in posterior STS correlates with visual–speech recognition ability (Fig. 2F) Performance during visual–speech recognition and the mismatchrelated activity in left posterior STS correlated signiﬁcantly (x = − 54, y = − 40, and x = 1, p = 0.006, FWE-corrected for mismatch contrast; Pearson's r = 0.6493, p = 0.0035, n = 18). Post hoc analysis conﬁrmed that this correlation was not affected by the subject with the highest eigenvariate-value in the posterior STS (x = − 54, y = − 40, and x = 1, p = 0.006, FWE-corrected for mismatch contrast; Pearson's r = 0.6672, p = 0.0034, n = 17). The correlation indicates that the activation of this area was behaviorally relevant for visual–speech recognition. Auditory–speech and visual–speech recognition areas are functionally connected during visual-only speech recognition (Fig. 3, Table 4) We used PPI analyses to investigate whether there was increased functional connectivity between the areas that processed auditory– speech and those that generated the mismatch-related activity if the auditory and visual speech did not match. We found functional connectivity of the areas that generated the mismatch-related activity in the left posterior STS, localized with the mismatch-interaction contrast, and auditory–speech areas (p = 0.04, FWE-corrected with the auditory–speech area localizer). Vice versa we found functional connectivity between auditory–speech areas in anterior/middle STS/ STG, and left posterior STS areas (p = 0.046, FWE-corrected with mismatch-interaction contrast). Both functional connectivity analyses included as psychological variable the mismatch-interaction contrast (i.e., “[mismatch > match] during visual–speech recognition > [mismatch > match] during face-identity recognition”). Auditory predictions improve visual–speech recognition performance Visual–speech recognition was signiﬁcantly better, when the auditory-only speech preceded the visual-only speech, in contrast to if the visual-only speech preceded the auditory-only speech (paired t-test: t = 5.3136, df = 17, p b 0.00001, Tables 2a, 2b).

H. Blank, K. von Kriegstein / NeuroImage 65 (2013) 109–118

115

Fig. 3. Functional connectivity between auditory–speech areas and mismatch areas in left posterior STS during visual-only speech recognition. A and B. There was a signiﬁcant functional connectivity (yellow) between the auditory seed in anterior/middle STS/STG (x=−57, y=−16, z=−5, red) and the left posterior STS target region (p=0.046, FWE-corrected for target region, x=−57, y=−40, z=−2, green). C and D. There was a signiﬁcant functional connectivity (yellow) between the seed region in posterior STS (x=−57, y=−40, z=−2, green) and the auditory–speech target region (p=0.040, FWE-corrected for target region, x=−57, y=−16, z=−5, red). The psychological variable for all PPI analyses was the mismatch-interaction contrast, i.e., (“[mismatch>match] during visual–speech recognition>[mismatch>match] during face-identity recognition”).

Discussion In the present study, we identiﬁed a visual–speech recognition network by contrasting visual–speech recognition and face-identity recognition tasks which use exactly the same stimulus material. Within this network activity in the left posterior STS increased when visual speech did not match the prior information from auditory speech. The correlation between the mismatch-related activity and visual–speech recognition performance suggests that responses in

Table 2a Behavioral results: visual–speech and face-identity recognition for match and mismatch (n=18, mean (std)).

left posterior STS were behaviorally relevant. In addition, if auditory and visual speech information did not match, left posterior STS interacted with auditory–speech areas during visual-only speech recognition. Brain regions involved in visual–speech recognition The ﬁnding that posterior STS, supplementary motor area, precentral gyrus, and inferior frontal gyrus were involved in visual–speech

Table 2b Behavioral results: visual–speech and face-identity recognition for visual and auditory stimulus presented ﬁrst (n = 18, mean (std)).

Task conditions

Per cent correct for match

Per cent correct for mismatch

Task conditions

Per cent correct for visual stimulus ﬁrst

Per cent correct for auditory stimulus ﬁrst

Visual–speech recognition Face-identity recognition

91.74 (9.68) 91.03 (12.51)

94.39 (8.55) 94.12 (6.17)

Visual–speech recognition Face-identity recognition

81.61 (10.81) 90.47 (7.87)

93.09 (7.57) 92.58 (8.52)

116

H. Blank, K. von Kriegstein / NeuroImage 65 (2013) 109–118

Table 3 Coordinates for the mismatch contrast “mismatch >match during the visual–speech recognition task”a. Region

Peak location x, y, z (mm)

Peak Z

Approximate Brodmann's area

Cluster size number of voxels

Left IFG Left IFG Left STG Right MTG Right CG

−54 29 4 −51 23 −5 −45 −22 −8 54 −13 −11 21 −100 −2

5.06 3.88 3.92 3.71 3.56

area 44/45 area 44/45

63

area 17/18

6 11 7

a

Indented rows indicate maxima within same cluster. IFG = inferior frontal gyrus STG = superior temporal gyrus, MTG = middle temporal gyrus, CG = cingulum

recognition is in line with previous research. However, studies have used different stimulus material, e.g., contrasts between conditions with talking faces and conditions with meaningless face movements (e.g., “gurning”), or with still faces (Calvert et al., 1997; Calvert and Campbell, 2003; Campbell et al., 2001; Hall et al., 2005; Okada and Hickok, 2009), or with more basic stimuli, such as check patterns (Puce et al., 1998). This might explain why most previous studies also found additional regions activated, such as occipital and fusiform gyrus, calcarine sulcus, middle and superior temporal gyrus, and parietal cortex (Calvert and Campbell, 2003; Campbell et al., 2001; Hall et al., 2005), Heschl's gyrus and also subcortical regions like the amygdala and the insula (Calvert et al., 1997) or the thalamus (Calvert and Campbell, 2003; Hall et al., 2005). Because in the present study there were no stimulus differences and also performance was matched over the group for the contrast (“visual–speech recognition > faceidentity recognition”), we assume that the areas found were involved in the task of visual–speech recognition. The network of four cortical regions includes areas that are involved in perception of faces, speech, and biological motion (inferior frontal gyrus and superior temporal gyrus; (Hall et al., 2005; Okada and Hickok, 2009; Skipper et al., 2005; Skipper et al., 2007), for a review see (Hein and Knight, 2008)), as well as areas that are involved in the active performance of mouth movements (supplementary motor area and precentral gyrus; (Brown et al., 2008; Chainay et al., 2004)). This supports the suggestion that the same system is involved in perception, as well as action of lip movements (Skipper et al., 2007; Wilson et al., 2004). Prediction error signals are represented in left posterior STS The present ﬁndings showed that the left posterior STS was not only involved in visual-only speech recognition, but also responded to a greater extent if the visual sensory information did not match the preceding auditory information. We interpret the changes in BOLD activity as reﬂecting a prediction error signal, as deﬁned by predictive coding theories (Friston and Kiebel, 2009; Rao and Ballard, 1999). In this view, predictions about the visual speech input, that in our study were based on the preceding auditory stimulus, were present in the left posterior STS. These predictions were the cause

Table 4 Coordinates for the mismatch-interaction contrast “(mismatch>match) during visual– speech recognition>(mismatch>match) during face-identity recognition” a. Region

Peak location x, y, z (mm)

Peak T

Left MTG Right PCG Left IFG Left IFG Left STG Right MTG

−57 −40 −2 54 2 28 −54 29 7 −48 23 13 −45 −22 −8 48 −70 1

4.16 3.92 3.75 3.26 3.42 3.35

Approximate Brodmann's area area 6/44 area 44/45 area 44/45 area 37

Cluster size number of voxels 83 12 25 5 16

a Indented rows indicate maxima within same cluster. MTG = middle temporal gyrus, PCG = precentral gyrus, IFG = inferior frontal gyrus STG = superior temporal gyrus.

for a higher prediction error signal to the non-matching visual stimulus. Investigating brain function with mismatching stimuli entails a particular difﬁculty: that the mismatch effect could be present for any kind of mismatching stimuli, not necessarily only for speech. Here, we addressed this by computing a mismatch-interaction contrast in which we contrasted the mismatch in auditory and visual speech with a mismatch in auditory and visual identity (“[mismatch>match] during visual–speech recognition >[mismatch > match] during faceidentity recognition”). This implies that the mismatch responses we found in left posterior STS were relatively speciﬁc to speech and not a common prediction error signal in response to mismatching auditory and visual stimuli in general. Our results are consistent with a recent MEG/fMRI-study, showing that activity in left posterior STS correlated with the predictability of the visual stimuli during the presentation of incongruent audio-visual videos of a speaker pronouncing syllables (Arnal et al., 2011). The present results extend these ﬁndings and suggest that it is not only visual predictability that elicits high activity in left posterior STS for a mismatching stimulus, but that the predictions can also be derived from preceding auditory speech. Interaction of auditory–speech areas with visual–speech recognition areas in the left posterior STS Functional connectivity between the left posterior STS and auditory– speech areas in left anterior and middle STS/STG was increased when the visual speech did not match the predictions from the preceding auditory–speech sample. The location of the auditory– speech area is in line with previous studies on speech intelligibly [corrected for regions of interest taken from (Obleser et al., 2007) and from a meta-analysis on word length auditory–speech processing (DeWitt and Rauschecker, 2012)]. A recent study showed increased connectivity between left posterior STS and auditory cortex when the auditory information was less noisy than the visual information [and vice versa for the visual modality; (Nath and Beauchamp, 2011)]. While this previous study investigated connectivity from posterior STS regions to primary auditory cortex, we investigated the connectivity from posterior STS to auditory higherlevel speech regions. Both the connectivity ﬁndings regarding primary auditory cortex and our results integrate well with the predictive coding framework. Here, this framework would assume that auditory– speech regions send predictions about the expected speech input to visual–speech regions and vice versa that visual–speech regions send back prediction errors, i.e. the residual errors between the predictions and the actual visual–speech input. We therefore speculate that the increased functional connectivity between mismatch and auditory– speech regions is a result of sending the prediction error signal from left posterior STS back to auditory–speech areas in more anterior and middle parts of the left temporal lobe. The mismatch-related activity in left posterior STS is behaviorally relevant The activity in left posterior STS in response to mismatching speech stimuli was positively correlated with the behavioral performance of subjects, indicating that this signal was relevant for visual–speech recognition performance. This is in line with a previous study on visual– speech recognition that showed that the extent of activation in left STG increased as a function of lip-reading ability (Hall et al., 2005). As expected, predictions from the auditory modality signiﬁcantly improved the behavioral performance in matching the visual and the auditory–speech signals in the present study. In communication that is entirely based on visual–speech recognition, e.g., in hearing impaired prior information is used to improve visual–speech recognition success (Giraud et al., 2001; Rouger et al., 2007; Strelnikov et al., 2009). This can be done in several ways e.g. by auditory cues

H. Blank, K. von Kriegstein / NeuroImage 65 (2013) 109–118

from cochlear implants or also by cueing phonemes using hand gestures (“cued speech”, (Alegria et al., 1999; Gregory, 1987)). It is an open question as to whether such predictions also lead to improved behavioral performance via the posterior STS, similar to the predictability of visual speech (Arnal et al., 2011) and the prediction provided by prior auditory speech in the present study. Visual and/or multisensory processing in the left posterior STS? Results from single-unit recording and fMRI suggest that the posterior STS consists of a patchy structure of neuronal populations that respond to auditory, visual, or multi-sensory auditory–visual stimuli (Beauchamp et al., 2004a; Dahl et al., 2009). Many neuroimaging studies consider the posterior STS mainly as a site of multisensory integration (Beauchamp et al., 2004a, 2004b; Okada and Hickok, 2009). Important for visual speech however, the left posterior STS is also a central region for processing biological motion and is implied as a specialized cortical area for the perception of dynamic aspects of the face, such as speech and emotion (Haxby et al., 2000; Hocking and Price, 2008; Puce et al., 1998). The present results showed that the prediction error signal in response to non-matching auditory and visual speech in left posterior STS was co-localized with visual–speech area localizers in posterior STS, but it was also co-localized with a multisensory-area-localizer. In addition, the multisensory region located with the conjunction contrast “face and voices” was perfectly located within the visual– face region located with the contrast “faces > mobiles” (see Fig. 2C). Therefore we cannot make strong conclusions about whether the prediction error signal represented in left posterior STS is represented in an auditory–visual integration site or a visual biological movement area. An fMRI study showed that even during auditory-only speech recognition the visual face-movement STS was activated if one has had a brief auditory–visual experience with this speaker (von Kriegstein et al., 2008). This activation in the left posterior STS has led to the hypothesis that visual speech is simulated during auditory-only speech recognition and thereby improves the understanding of auditory speech. Interpreting our results in line with these prior ﬁndings, we speculate that predictions from auditory speech are translated into visual–face movements and that these are simulated in the visual face-movement sensitive regions within the left posterior STS, leading to a prediction error signal if the sensory input does not match the internal simulation. However, previous imaging studies have implicated the posterior STS in phonological processing of auditory speech (Buchsbaum et al., 2001; Hickok and Poeppel, 2007; Vaden et al., 2010) and in multisensory processing of auditory–visual signals (Beauchamp et al., 2004a, 2004b; Okada and Hickok, 2009). Therefore, alternative interpretations for the present ﬁndings could be that the auditory–speech information was not translated into visual but into multisensory or phonological speech information that was represented in this region. The role of inferior frontal cortex Prompted by the helpful suggestion of an anonymous reviewer, we here report and discuss also the results of inferior frontal cortex activation that was not in our primary focus of interest. Within the identiﬁed lip-reading network, not only regions in the posterior STS but also in inferior frontal cortex were activated during the mismatch-interaction contrast (x = −54, y =11, z = 13, p = 0.04, FWE-corrected for “visual– speech recognition >face-identity recognition”) and functionally connected with the seed regions in left posterior STS (x = −51, y = 32, z = 10, p = 0.013, FWE-corrected for mismatch-interaction contrast) and auditory anterior/middle STS/STG (x = −51, y = 32, z =10, p = 0.04, FWE-corrected for mismatch-interaction contrast). The activity in inferior frontal regions did however not correlate with behavioral

117

performance during mismatch of auditory and visual speech (x = −51, y = 11, z = 16, p > 0.05, FWE-corrected for both “visual–speech recognition > face-identity recognition” and mismatch contrast). Since previous studies suggested that the inferior frontal gyrus is involved in articulation of speech (Hall et al., 2005; Hickok and Poeppel, 2007; Okada and Hickok, 2009; Skipper et al., 2005, 2007), these functional activity and connectivity ﬁndings could indicate that posterior STS regions exchange information about the speech-related motor plan with inferior frontal regions. On the other hand, as inferior frontal regions have also been implicated in processing incongruence of non-speech sounds (Noppeney et al., 2010) and in abstract coding of audio-visual speech (Hasson et al., 2007), we cannot rule out that activation of inferior frontal regions represents a more abstract level of speech processing. Conclusion In summary, the present study shows that the left posterior STS is a central hub for visual–speech recognition performance, which plays a key role in integrating auditory predictive information for optimizing visual-only speech recognition. The ﬁndings might represent a general mechanism for how predictive information can inﬂuence visual-only speech recognition, a mechanism that is especially important for communication in difﬁcult hearing conditions and hearing-impaired people. Conﬂict of interest statement The authors declare no conﬂict of interest. Acknowledgments This work was supported by a Max Planck Research Group grant to K.v.K. We thank Begoña Díaz and Arnold Ziesche for helpful comments on an earlier version of the manuscript and Stefan Kiebel for methodological advice. References Alegria, J., Charlier, B.L., Mattys, S., 1999. The role of lip-reading and cued speech in the processing of phonological information in French-educated deaf children. Eur. J. Cogn. Psychol. 11, 451–472. Altieri, N.A., Pisoni, D.B., Townsend, J.T., 2011. Some normative data on lip-reading skills. J. Acoust. Soc. Am. 130, 1–4. Amedi, A., von Kriegstein, K., van Atteveldt, N.M., Beauchamp, M.S., Naumer, M.J., 2005. Functional imaging of human crossmodal identiﬁcation and object recognition. Exp. Brain Res. 166, 559–571. Arnal, L.H., Morillon, B., Kell, C.A., Giraud, A.L., 2009. Dual neural routing of visual facilitation in speech processing. J. Neurosci. 29, 13445–13453. Arnal, L.H., Wyart, V., Giraud, A.-L., 2011. Transitions in neural oscillations reﬂect prediction errors generated in audiovisual speech. Nat. Neurosci. 14, 797–801. Baart, M., Vroomen, J., 2010. Do you see what you are hearing? Cross-modal effects of speech sounds on lipreading. Neurosci. Lett. 471, 100–103. Beauchamp, M.S., Argall, B.D., Bodurka, J., Duyn, J.H., Martin, A., 2004a. Unraveling multisensory integration: patchy organization within human STS multisensory cortex. Nat. Neurosci. 7, 1190–1192. Beauchamp, M.S., Lee, K.E., Argall, B.D., Martin, A., 2004b. Integration of auditory and visual information about objects in superior temporal sulcus. Neuron 41, 809–823. Beauchamp, M.S., Nath, A.R., Pasalar, S., 2010. fMRI-guided transcranial magnetic stimulation reveals that the superior temporal sulcus is a cortical locus of the McGurk effect. J. Neurosci. 30, 2414–2417. Besle, J., Fischer, C., Bidet-Caulet, A., Lecaignard, F., Bertrand, O., Giard, M.H., 2008. Visual activation and audiovisual interactions in the auditory cortex during speech perception: intracranial recordings in humans. J. Neurosci. 28, 14301–14310. Brown, S., Ngan, E., Liotti, M., 2008. A larynx area in the human motor cortex. Cereb. Cortex 18, 837–845. Buchsbaum, B.R., Hickok, G., Humphries, C., 2001. Role of left posterior superior temporal gyrus in phonological processing for speech perception and production. Cognit. Sci. 25, 663–678. Calvert, G.A., Campbell, R., 2003. Reading speech from still and moving faces: the neural substrates of visible speech. J. Cogn. Neurosci. 15, 57–70. Calvert, G.A., Bullmore, E.T., Brammer, M.J., Campbell, R., Williams, S.C., McGuire, P.K., Woodruff, P.W., Iversen, S.D., David, A.S., 1997. Activation of auditory cortex during silent lipreading. Science 276, 593–596. Campbell, R., MacSweeney, M., Surguladze, S., Calvert, G., McGuire, P., Suckling, J., Brammer, M.J., David, A.S., 2001. Cortical substrates for the perception of face

118

H. Blank, K. von Kriegstein / NeuroImage 65 (2013) 109–118

actions: an fMRI study of the speciﬁcity of activation for seen speech and for meaningless lower-face acts (gurning). Cogn. Brain Res. 12, 233–243. Chainay, H., Krainik, A., Tanguy, M.L., Gerardin, E., Le Bihan, D., Lehericy, S., 2004. Foot, face and hand representation in the human supplementary motor area. Neuroreport 15, 765–769. Dahl, C.D., Logothetis, N.K., Kayser, C., 2009. Spatial organization of multisensory responses in temporal association cortex. J. Neurosci. 29, 11924–11932. DeWitt, I., Rauschecker, J.P., 2012. Phoneme and word recognition in the auditory ventral stream. Proc. Natl. Acad. Sci. 109, E505–E514. Driver, J., Noesselt, T., 2008. Multisensory interplay reveals crossmodal inﬂuences on ‘sensory-speciﬁc’ brain regions, neural responses, and judgments. Neuron 57, 11–23. Eickhoff, S.B., Stephan, K.E., Mohlberg, H., Grefkes, C., Fink, G.R., Amunts, K., Zilles, K., 2005. A new SPM toolbox for combining probabilistic cytoarchitectonic maps and functional imaging data. Neuroimage 25, 1325–1335. Friston, K., Kiebel, S., 2009. Predictive coding under the free-energy principle. Philos. Trans. R. Soc. B-Biol. Sci. 364, 1211–1221. Friston, K.J., Buechel, C., Fink, G.R., Morris, J., Rolls, E., Dolan, R.J., 1997. Psychophysiological and modulatory interactions in neuroimaging. Neuroimage 6, 218–229. Friston, K.J., Zarahn, E., Josephs, O., Henson, R.N.A., Dale, A.M., 1999. Stochastic designs in event-related fMRI. Neuroimage 10, 607–619. Friston, K.J., Rotshtein, P., Geng, J.J., Sterzer, P., Henson, R.N., 2006. A critique of functional localisers. Neuroimage 30, 1077–1087. Friston, K., Ashburner, A., Kiebel, S., Nichols, T., Penny, W. (Eds.), 2007. Statistical Parametric Mapping: The Analysis of Functional Brain Images. Academic Press. Ghazanfar, A.A., Maier, J.X., Hoffman, K.L., Logothetis, N.K., 2005. Multisensory integration of dynamic faces and voices in rhesus monkey auditory cortex. J. Neurosci. 25, 5004–5012. Giraud, A.L., Price, C.J., Graham, J.M., Truy, E., Frackowiak, R.S., 2001. Cross-modal plasticity underpins language recovery after cochlear implantation. Neuron 30, 657–663. Gregory, J.F., 1987. An investigation of speechreading with and without cued speech. Am. Ann. Deaf 132, 393–398. Hall, D.A., Fussell, C., Summerﬁeld, A.Q., 2005. Reading ﬂuent speech from talking faces: typical brain networks and individual differences. J. Cogn. Neurosci. 17, 939–953. Hasson, U., Skipper, J.I., Nusbaum, H.C., Small, S.L., 2007. Abstract coding of audiovisual speech: beyond sensory representation. Neuron 56, 1116–1126. Haxby, J.V., Hoffman, E.A., Gobbini, M.I., 2000. The distributed human neural system for face perception. Trends Cogn. Sci. 4, 223–233. Hein, G., Knight, R.T., 2008. Superior temporal sulcus—it's my area: or is it? J. Cogn. Neurosci. 20, 2125–2136. Hickok, G., Poeppel, D., 2007. The cortical organization of speech processing. Nat. Rev. Neurosci. 8, 393–402. Hocking, J., Price, C.J., 2008. The role of the posterior superior temporal sulcus in audiovisual processing. Cereb. Cortex 18, 2439–2449. Nath, A.R., Beauchamp, M.S., 2011. Dynamic changes in superior temporal sulcus connectivity during perception of noisy audiovisual speech. J. Neurosci. 31, 1704–1714. Nishitani, N., Hari, R., 2002. Viewing lip forms: cortical dynamics. Neuron 36, 1211–1220. Noesselt, T., Tyll, S., Boehler, C.N., Budinger, E., Heinze, H.J., Driver, J., 2010. Sound-induced enhancement of low-intensity vision: multisensory inﬂuences on human sensoryspeciﬁc cortices and thalamic bodies relate to perceptual enhancement of visual detection sensitivity. J. Neurosci. 30, 13609–13623.

Noppeney, U., Ostwald, D., Werner, S., 2010. Perceptual decisions formed by accumulation of audiovisual evidence in prefrontal cortex. J. Neurosci. 30, 7434–7446. Obleser, J., Wise, R.J.S., Dresner, M.A., Scott, S.K., 2007. Functional integration across brain regions improves speech perception under adverse listening conditions. J. Neurosci. 27, 2283–2289. Okada, K., Hickok, G., 2009. Two cortical mechanisms support the integration of visual and auditory speech: a hypothesis and preliminary data. Neurosci. Lett. 452, 219–223. Oldﬁeld, R.C., 1971. The assessment and analysis of handedness: the Edinburgh inventory. Neuropsychologia 9, 97–113. O'Reilly, J.X., Woolrich, M.W., Behrens, T.E.J., Smith, S.M., Johansen-Berg, H., 2012. Tools of the trade: psychophysiological interactions and functional connectivity. Soc. Cogn. Affect. Neurosci. 7, 604–609. Puce, A., Allison, T., Bentin, S., Gore, J.C., McCarthy, G., 1998. Temporal cortex activation in humans viewing eye and mouth movements. J. Neurosci. 18, 2188–2199. Rao, R.P., Ballard, D.H., 1999. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-ﬁeld effects. Nat. Neurosci. 2, 79–87. Ross, L.A., Saint-Amour, D., Leavitt, V.M., Javitt, D.C., Foxe, J.J., 2007. Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environment. Cereb. Cortex 17, 1147–1153. Rouger, J., Lagleyre, S., Fraysse, B., Deneve, S., Deguine, O., Barone, P., 2007. Evidence that cochlear-implanted deaf patients are better multisensory integrators. Proc. Natl. Acad. Sci. U. S. A. 104, 7295–7300. Sherbecoe, R.L., Studebaker, G.A., 2004. Supplementary formulas and tables for calculating and interconverting speech recognition scores in transformed arcsine units. Int. J. Audiol. 43, 554. Skipper, J.I., Nusbaum, H.C., Small, S.L., 2005. Listening to talking faces: motor cortical activation during speech perception. Neuroimage 25, 76–89. Skipper, J.I., van Wassenhove, V., Nusbaum, H.C., Small, S.L., 2007. Hearing lips and seeing voices: how cortical areas supporting speech production mediate audiovisual speech perception. Cereb. Cortex 17, 2387–2399. Stevenson, R.A., James, T.W., 2009. Audiovisual integration in human superior temporal sulcus: inverse effectiveness and the neural processing of speech and object recognition. Neuroimage 44, 1210–1223. Strelnikov, K., Rouger, J., Lagleyre, S., Fraysse, B., Deguine, O., Barone, P., 2009. Improvement in speech-reading ability by auditory training: evidence from gender differences in normally hearing, deaf and cochlear implanted subjects. Neuropsychologia 47, 972–979. Studebaker, G.A., 1985. A rationalized Arcsine transform. J. Speech Hear. Res. 28, 455–462. Sumby, W.H., Pollack, I., 1954. Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26, 212–215. Summerﬁeld, Q., 1992. Lipreading and audiovisual speech-perception. Philos. Trans. R. Soc. Lond. B Biol. Sci. 335, 71–78. Vaden, K.I., Muftuler, L.T., Hickok, G., 2010. Phonological repetition-suppression in bilateral superior temporal sulci. Neuroimage 49, 1018–1023. van Wassenhove, V., Grant, K.W., Poeppel, D., 2005. Visual speech speeds up the neural processing of auditory speech. Proc. Natl. Acad. Sci. U. S. A. 102, 1181–1186. von Kriegstein, K., Dogan, O., Gruter, M., Giraud, A.L., Kell, C.A., Gruter, T., Kleinschmidt, A., Kiebel, S.J., 2008. Simulation of talking faces in the human brain improves auditory speech recognition. Proc. Natl. Acad. Sci. U. S. A. 105, 6747–6752. Wilson, S.M., Saygin, A.P., Sereno, M.I., Iacoboni, M., 2004. Listening to speech activates motor areas involved in speech production. Nat. Neurosci. 7, 701–702.

Mechanisms of enhancing visual–speech recognition by prior auditory information

Mechanisms of enhancing visual–speech recognition by prior auditory information

Recommend Documents