Involvement of superior temporal areas in audiovisual and audiomotor speech integration

Involvement of superior temporal areas in audiovisual and audiomotor speech integration

NSC 17011 No. of Pages 8 29 March 2016 Please cite this article in press as: Komeilipoor N et al. Involvement of superior temporal areas in audiovis...

998KB Sizes 4 Downloads 92 Views

NSC 17011

No. of Pages 8

29 March 2016 Please cite this article in press as: Komeilipoor N et al. Involvement of superior temporal areas in audiovisual and audiomotor speech integration. Neuroscience (2016), http://dx.doi.org/10.1016/j.neuroscience.2016.03.047 1

Neuroscience xxx (2016) xxx–xxx

3

INVOLVEMENT OF SUPERIOR TEMPORAL AREAS IN AUDIOVISUAL AND AUDIOMOTOR SPEECH INTEGRATION

4 5

N. KOMEILIPOOR, a,b P. CESARI b AND A. DAFFERTSHOFER a*

2

6 7 8

a

9 10

b

11

Abstract—Perception of speech sounds is affected by observing facial motion. Incongruence between speech sounds and watching somebody articulating may influence the perception of auditory syllable, referred to as the McGurk effect. We tested the degree to which silent articulation of a syllable also affects speech perception and searched for its neural correlates. Listeners were instructed to identify the auditory syllables /pa/ and /ta/ while silently articulating congruent/incongruent syllables or observing videos of a speaker’s face articulating them. As a baseline, we included an auditory-only condition without competing visual or sensorimotor input. As expected, perception of sounds degraded when incongruent syllables were observed, and also when they were silently articulated, albeit to a lesser extent. This degrading was accompanied by significant amplitude modulations in the beta frequency band in right superior temporal areas. In these areas, the event-related beta activity during congruent conditions was phaselocked to responses evoked during the auditory-only condition. We conclude that proper temporal alignment of different input streams in right superior temporal areas is mandatory for both audiovisual and audiomotor speech integration. Ó 2016 Published by Elsevier Ltd. on behalf of IBRO.

perception of speech sound is clearly affected by observation of facial motion: incongruent visual input caused sound perception to degrade, as the visual input may affect the perception of auditory syllable. This is referred to as the McGurk effect (McGurk and MacDonald, 1976). The McGurk effect has inspired many researchers investigating multisensory integration (Tiippana, 2014). The perception of a sound syllable can also be affected by tactile stimulation (Gick and Derrick, 2009; Ito et al., 2009). The identification of auditory syllables can be either degraded or improved when the listeners silently articulate incongruent or congruent syllables, respectively, as well as when they observe others producing those syllables (Sams et al., 2005; Mochida et al., 2013; Sato et al., 2013). Sams et al. (2005) suggested that both effects may rely on the same neural mechanism and may be due to modulation of the activity in auditory cortical areas. Functional magnetic resonance imaging (fMRI) studies indicated that lip reading modulates activity of the auditory cortex (Calvert et al., 1997). Visual speech may hence affect the auditory perception by altering activation of auditory cortical areas. Likewise, magnetoencephalography (MEG) studies suggest a modulation of activity in the auditory cortex during both silent and loud reading (Numminen et al., 1999; Kaurama¨ki et al., 2010; Tian and Poeppel, 2010) as well as silent articulation (Numminen and Curio, 1999) and lip reading (Kaurama¨ki et al., 2010). Interestingly, the responses were weaker for covert speech as compared to silent reading (Numminen et al., 1999), in lip reading and covert speech compared with a visual control and baseline tasks (Kaurama¨ki et al., 2010) and during silent articulation as compared to speech listening (Numminen and Curio, 1999). It has been suggested that the auditory suppression during speech might be due to the existence of an efference-copy pathway from articulatory networks in Broca’s area to the auditory cortex via the inferior parietal lobe (Rauschecker and Scott, 2009). Thus, the effect of observing and articulating incongruent syllables on the perception of auditory syllables (Sams et al., 2005; Mochida et al., 2013; Sato et al., 2013) may be ascribed to their impact on alteration of activities in auditory areas, which interferes speech perception. A further way to conceive the neuronal underpinning of multisensory perception is to consider it as a result of multimodal neurons activity processing inputs from different sensory modalities. In mammals, such multisensory cell assemblies are presumably located at

MOVE Research Institute Amsterdam, Faculty of Behavioural and Movement Sciences, Vrije Universiteit, Van der Boechorststraat 9, 1081BT Amsterdam, The Netherlands Department of Neurological, Biomedical and Movement Sciences, University of Verona, 37131 Verona, Italy

Key words: EEG, McGurk effect, multisensory integration, sensorimotor interaction, superior temporal gyrus. 12

13

INTRODUCTION

14

The brain receives a continuous stream of information from different sensory modalities. Proper integration of input is essential for accurate perception. The

15 16

*Corresponding author. Address: MOVE Research Institute Amsterdam, Faculty of Behavioural and Movement Sciences, Vrije Universiteit, Van der Boechorststraat 9, 1081BT Amsterdam, The Netherlands. E-mail address: a.daff[email protected] (A. Daffertshofer). Abbreviations: BEM, boundary element method; DICS, dynamic imaging of coherent sources; EEG, electroencephalography; EMG, electromyography; MEG, magnetoencephalography; MNI, Montreal Neurological Institute; PLV, phase-locking value; STS/STG, superior temporal sulcus/gyrus. http://dx.doi.org/10.1016/j.neuroscience.2016.03.047 0306-4522/Ó 2016 Published by Elsevier Ltd. on behalf of IBRO. 1

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

NSC 17011

No. of Pages 8

29 March 2016

2 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110

N. Komeilipoor et al. / Neuroscience xxx (2016) xxx–xxx

multiple neural levels in mammals, from midbrain to cortex (Stein and Stanford, 2008). Regarding the McGurk effect, neuroimaging revealed an involvement of the superior temporal sulcus/gyrus (STS/STG) (Calvert et al., 2000; Jones and Callan, 2003; Sekiyama et al., 2003; Bernstein et al., 2008; Irwin et al., 2011; Nath and Beauchamp, 2012, Szycik et al., 2012; Erickson et al., 2014). A number of recent papers considered the dynamic interplay of neural populations as a key to cross-modal integration (Senkowski et al., 2008; Arnal et al., 2009; Arnal and Giraud, 2012). The superior temporal area is considered a multisensory convergence site as it receives inputs from unimodal auditory and visual cortices and contains multisensory neurons (Karnath, 2001). However, what precisely happens in this area to accomplish multisensory integration and whether it is responsible for the reported effect of silent articulation on auditory perception (e.g. Sams et al., 2005) is still largely unclear. For the present study, we capitalized on the competition between auditory and visual inputs as well as between auditory and sensorimotor inputs to probe how cortical oscillations contribute to multisensory integration. We adopted a protocol recently introduced by Mochida et al. (2013), in which listeners are instructed to identify auditory syllables while silently articulating congruent/incongruent syllables, or observing videos of a speaker’s face articulating congruent/incongruent syllables. Cortical activity was monitored using electroencephalography (EEG). Consistent with the McGurk effect (McGurk and MacDonald, 1976), we expected, when dubbing the acoustic syllable /pa/ onto the visual presentation of articulatory gestures of /ta/, subjects to typically misperceive the sound. We also expected a similar result when subjects themselves silently articulated an incongruent syllable (Sams et al., 2005; Mochida et al., 2013; Sato et al., 2013). Furthermore, we expected source localization of EEG to reveal STS/STG as the area discriminating between proper and improper perception, in support with the aforementioned imaging studies. Finally, we hypothesized the phase dynamics in STS/STG to be essential for multisensory integration, as we believe that temporal alignment of distinct sensory streams is key to their integration.

EXPERIMENTAL PROCEDURE

111 112

Subjects

113

Twelve volunteers (mean age 26.1 years, five females) participated after giving their written informed consent. All were right handed and had normal hearing and normal or corrected-to-normal vision.

114 115 116

117

Protocol

118

The experimental protocol has been adopted from a recent study by Mochida et al. (2013). The ethics committee of the Faculty of Human Movement Sciences, VU University Amsterdam had approved it prior to conduction.

119 120 121 122

Task. Participants were asked to identify the syllables (/pa/ and /ta/) that they heard among the four possible alternatives (/pa/, /ta/, /ka/, or ‘etc’) displayed on the screen under the following subtask conditions: silently articulating congruent/incongruent syllables (motor condition), observing videos of a speaker’s face articulating congruent/incongruent syllables (visual condition), and a condition without a subtask (baseline condition or auditory only); see Fig. 1 for overview. In the motor condition, participants were instructed to articulate the syllables with as little vocalization as possible while moving the lips and tongue as much as possible and to identify the syllables that they heard. Under the visual condition, subjects were required to indicate what they heard while they were presented with audiovisual stimuli. In the baseline (auditory-only) condition, participants were asked to listen to the syllables while watching a still frame of the video and choose the heard syllable after they were presented.

123

Stimuli. Stimuli had been produced by a Dutch male speaker. We recorded conventional videos at 50 Hz frame rate and they were edited in iMovie 10.0. Audio signals were digitized at a rate of 44.1 kHz. They were delivered at a level of 60 dB through paired speakers placed in front of the participants (distance 55 cm to the participant’s torso) and were separated by approximately 30 cm. We superimposed white noise to the syllables (signal-to-noise ratio of 5 dB) to create ambiguity and reduce word recognition accuracy (Sato et al., 2013). Beginning and end of the noise were faded in and out, respectively (0.5 s duration). Syllables were preceded by four clicks (0.67 s inter-click interval) to provide a cue for silently articulating a syllable in the motor condition. For the visual conditions, auditory syllables were paired with the videos of a speaker’s face producing either congruent or congruent syllables yielding four different combinations: (i) congruent /pa/ (visual /pa/ auditory /pa/), (ii) congruent /ta/ (visual /ta/ auditory /ta/), (iii) incongruent stimuli (visual /pa/ auditory /ta/) and (iv) the converse incongruent stimuli (visual /ta/ auditory /pa/). Similar to visual conditions, in the motor conditions, silent articulation of congruent/incongruent syllables paired with the auditory syllables produced four conditions: (i) congruent /pa/ (articulation of /pa/ auditory /pa/), (ii) congruent /ta/ (articulation of /ta/ auditory /ta/), (iii) incongruent combination (articulation of /pa/ auditory /ta/) and (iv) the converse incongruent combination (articulation of /ta/ auditory /pa/). In the motor condition, English characters representing /pa/ or /ta/ were presented on a front display (LCD monitor, frame rate 60 Hz, about 55 cm in front of the participant’s nasion) until the participants pressed the space bar of a computer keyboard to start the trial. They were asked to silently articulate the indicated syllable in time with the clicks and the onset of the syllable while watching a still frame of the video. For the visual condition, a video of the speaker’s face articulating either /pa/ or /ta/ was presented on the front display. Prior to video presentation, the initial frame of

142

Please cite this article in press as: Komeilipoor N et al. Involvement of superior temporal areas in audiovisual and audiomotor speech integration. Neuroscience (2016), http://dx.doi.org/10.1016/j.neuroscience.2016.03.047

124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141

143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182

NSC 17011

No. of Pages 8

29 March 2016

N. Komeilipoor et al. / Neuroscience xxx (2016) xxx–xxx

3

Fig. 1. Scheme of the experimental protocol. The stimulus was preceded by four clicks at 0.67s intervals which provided the subjects with a cue to silently articulate a syllable under the motor condition in which the syllables to be articulated by the participants were presented visually using English characters, which disappeared at the second click. Participants were asked to silently articulate the indicated syllable in time with the clicks and the onset of the syllable while watching a still video frame. In the visual condition, videos of a speaker’s face producing the syllables were presented. The videos were synchronized with the auditory stimulus. The initial frame of each video was presented from the noise onset to the stimulus onset.

183 184 185 186 187

188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212

the video was presented from the onset of white noise until the onset of the syllable. In the baseline (auditory-only) condition, participants were asked to listen to the auditory stimuli while watching a still frame of the video. Procedures. The experimental session consisted of a familiarization phase followed by a test phase. First, participants performed one baseline followed by one motor condition block (see below for block definition). In the latter, the subjects performed three sets of motor and visual condition blocks, and the order of the two blocks was randomized within each set and counterbalanced for each participant. After these six blocks, they performed a baseline block, and a resting state recording both before and after every block; the latter will be reported elsewhere. After each stimulus presentation, the participants were asked to select which of the syllables they perceived among the four possible alternatives, /pa/, /ta/, /ka/, or ‘etc’ displayed on the screen. They provided their responses using a computer keyboard. The next trial was initiated 10 s after the participants entered their response. The total number of trials including all blocks was 45 for each motor and visual condition and 15 for each auditory-alone condition. Motor and visual condition blocks consisted of 60 trials in which each of the four different combinations (2 auditory syllables  2 subtask syllables) were performed 15 times. A baseline condition block consisted of 30 trials in which each of the two auditory syllables were presented 15 times. The order of

trials per block was randomized, and the blocks order was randomized across participants. We recorded EEG using a 64-channel amplifier (Refa, TMSi, Enschede, The Netherlands; Ag/AgCl electrodes mounted in an elastic cap and two mastoid electrodes) and sampled the signals at a rate of 1024 Hz. We also co-registered electromyographic (EMG) activity from the left and right masseter and digastric muscles using Ag/ AgCl surface electrodes (sampling 1 kHz, 16 channel Porti amplifier, TMSi, Enschede The Netherlands). All subsequent off-line analyses were realized using Matlab 2014a (The Mathworks, Natwick, MA) including the open-source toolbox fieldtrip (fieldtrip.fcdonders.nl).

213

DATA ANALYSIS

226

214 215 216 217 218 219 220 221 222 223 224 225

Behavioral data

227

For each auditory stimulus and condition, 45 responses were collected from which the error-response rates were defined as the percentage of incorrectly identified syllables and determined as a measure of syllable intelligibility. Subsequently, rates were categorized and averaged over trials for every participant according to the combination of visual and motor tasks with the auditory stimuli as (sets of) congruent or incongruent stimuli. The mean error-response rate of the baseline condition (auditory-only) was subtracted from those of the congruent and incongruent stimuli sets. The resulting unbiased rates were compared using a twoway repeated-measures analysis of variance (ANOVA) with condition (motor/visual), and stimulus (congruent/

228

Please cite this article in press as: Komeilipoor N et al. Involvement of superior temporal areas in audiovisual and audiomotor speech integration. Neuroscience (2016), http://dx.doi.org/10.1016/j.neuroscience.2016.03.047

229 230 231 232 233 234 235 236 237 238 239 240 241

NSC 17011

No. of Pages 8

29 March 2016

4

N. Komeilipoor et al. / Neuroscience xxx (2016) xxx–xxx

246

incongruent) as within-subject factors. Post-hoc comparisons were performed via t-tests applying a Bonferroni correction for multiple comparisons when required. A partial-eta-squared statistic served to estimate effect size.

247

Electrophysiological data

248

We considered the interval between the third and the forth (last) click onset ( pre) and the interval between the last click onset and the end of syllable presentation ( post) intervals of interest. In detail, we selected epochs of 1.34 s (±0.67 s) around last click onset (=speech onset). Equivalent to the behavioral responses, trials were categorized into two stimuli categories: congruent and incongruent.

242 243 244 245

249 250 251 252 253 254 255

256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279

280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299

Preprocessing. All signals were filtered with a 50-Hz notch filter (2nd-order bi-directional Butterworth) to remove power-line artifacts. We reduced movement artifacts by high-pass filtering (2nd-order, bi-directional Butterworth, cut-off frequency 3 Hz for EEG and 10 Hz for EMG signals). Since EMG signals were collected in a bipolar montage, we full-wave rectified them using the modulus of the corresponding analytic signal also referred to as Hilbert-amplitude. Subsequently, we removed movement and muscle artifacts by combining EEG and EMG data. We identified EMG-contamination via an independent component analysis ICA (Bell and Sejnowski, 1995) of the combined z-scored signals (64EEG + 6EMG = 70 channels). Components were considered artifacts if (1) they were highly correlated with (one of) the EMG signals (the six components with highest correlation were selected for removal); (2) they had a median frequency larger than the minimum (nine subjects) or the average (three subjects) of the EMG’s median frequency. On average 26 independent components were considered artifacts. EEG reconstruction was realized by multiplying the components with the thus reduced de-mixing matrix followed by inverting the aforementioned z-scoring. Source localization. We concentrated on neural activity at the beta frequency band (15–30 Hz) using beamformers based on dynamic imaging of coherent sources (DICS; Gross et al., 2001); the analysis of other frequency bands can be found in Appendix A. We note that beta band oscillations are prospectively modulated during the McGurk illusion and – of course – also during motor tasks (Van Wijk et al., 2009; Houweling et al., 2010). The beta band was hence considered the primary target to assess the role of phase entrainment in multisensory processing. Cross-spectral density matrices of all conditions were determined using a multi-tapering method over the beta frequency band (22.5 ± 7.5 Hz) for a time period of 0.67 s before and after syllable onset. A volume conduction model was derived from the Montreal Neurological Institute (MNI) template brain resulting in an anatomically realistic 3-shell model. The lead-field matrix was estimated using the boundary element method (BEM) for each grid point in the brain. A grid

with 5-mm resolution was normalized onto a standard MNI brain in order to calculate group statistics and for illustrative purposes. To establish significance across participants, we used a non-parametric permutation tstatistic (Monte Carlo method; 1000 iterations; a < 0.05; (Oostenveld et al., 2010). By this, we identified voxels with statistically significant activity contrasts pre and post stimulus presentation. Epochs of pre and post intervals were band-pass filtered in the beta frequency band with a second-order bi-directional Butterworth filter and projected onto the location determined via DICS. The resulting time series were aligned at stimulus onsets per subtask conditions (45 signals for visual and motor, 15 signals for auditoryalone) and averaged. We used the post-stimulus intervals to assess the phase relation between the beta band event-related activities after congruent/incongruent stimuli relative to baseline.

300

Phase dynamics. We defined the phase at each source via the analytical signal, i.e. we computed the instantaneous Hilbert-phase of the beta-band filtered DICS beamformed EEG. The degree of phase synchrony in the motor and visual conditions was estimated as the phase-locking value (PLV, Mormann et al., 2000) of the difference between the respective phases and the phase in the baseline condition. If the phase in the two conditions is not altered relative to baseline, then the corresponding PLV will be equal to 1, otherwise it will be smaller bounded by 0. The baseline was the auditory-only condition, in which auditory processing was ‘optimal’. Hence a large PLV implied visual or sensorimotor input streams to be entrained to the auditory one. PLVs underwent the same statistical assessment as the error-response rates, i.e. a two-way repeatedmeasures analysis of variance (ANOVA) with condition (motor/visual), and stimulus (congruent/incongruent) as within-subject factors and a post hoc t-tests with Bonferroni correction when required (partial-eta-squared as effect size).

318

RESULTS

339

301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317

319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338

Behavioral data

340

For the error-response rates in motor and visual conditions, we found a significant main effect for condition (F1,11 = 48.318, p < .001, g2 = .815). Rates were larger for the visual than the motor subtask. We also found a significant main effect for stimulus (F1,11 = 81.022, p < .001, g2 = .88), where rates were larger for incongruent than congruent stimuli. The interaction between condition and stimulus was also significant (F1,11 = 17.343, p < 0.005, g2 = .612); for both the visual and the motor condition the response rate was larger for incongruent compared to congruent stimuli (p < .001 and p < .005, respectively). For the incongruent stimuli, the rate was significantly larger in the visual than in the motor condition (p < .001); cf. Fig. 2.

341

Please cite this article in press as: Komeilipoor N et al. Involvement of superior temporal areas in audiovisual and audiomotor speech integration. Neuroscience (2016), http://dx.doi.org/10.1016/j.neuroscience.2016.03.047

342 343 344 345 346 347 348 349 350 351 352 353 354 355

NSC 17011

No. of Pages 8

29 March 2016

N. Komeilipoor et al. / Neuroscience xxx (2016) xxx–xxx

5

356

Electrophysiological data

stimulus presentation. The spatial topographies of source power were estimated separately for all conditions (congruent and incongruent visual and motor subtasks and auditory-only condition). The power estimates between all points in the grid (i.e., all voxels) were subsequently contrasted using a non-parametric permutation t-statistic. The analysis revealed contributing areas in temporal, occipital, and parietal cortices, with statistically significant activity during post as opposed to pre stimulus presentation. Most consistent sources were located around right STG, which will be highlighted in what follows. An overview of the other sources (and other frequency bands) is provided in Appendix A. Right STG, right supramarginal gyrus, and right Rolandic operculum turned out to be the three major sources contrasting conditions from baseline in the beta band; an example is given in Fig. 3. This underscores the role of STG in (mis-)perception of sounds. We selected the averaged maximum values of seven voxel locations (MNI coordinates: 47, 31, 16).

357

Source localization. DICS beamformer analysis was performed separately for different frequency bands (4– 8 Hz theta; alpha: 8–12 Hz; beta: 15–30 Hz; gamma: 30–80 Hz) during the interval of 0.67 s before and after

Phase dynamics. The PLV of the event-related beta activities in right STG during visual and motor conditions relative to the auditory-only condition revealed a significant main effect for stimulus (F1,11 = 6.3, p < .05, g2 = .364) with PLV being significantly larger for

Fig. 2. Effect of congruent and incongruent stimuli on syllable intelligibility. The error-response rate relative to baseline across all 12 participants for both congruent and incongruent stimuli during the two conditions (visual and motor). Error bars refer to standard errors. Significant differences between conditions are highlighted with an asterisk (*p < 0.05).

358 359 360

Fig. 3. Example of DICS. Source projection. Axial, sagittal, and coronal views of the contrast image transformed to MNI template space and overlaid on the template structural image. Possible generators of the beta-band changes were localized in the right hemisphere, from Rolandic operculum extending to supramarginal gyrus and superior temporal gyrus. The red areas represent cortical tissue displaying a significant difference between the time periods of 0.67 s before versus after syllable onset. The peak coherence was observed at MNI coordinates (47, 32, 21). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Please cite this article in press as: Komeilipoor N et al. Involvement of superior temporal areas in audiovisual and audiomotor speech integration. Neuroscience (2016), http://dx.doi.org/10.1016/j.neuroscience.2016.03.047

361

382

362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381

383 384 385 386

NSC 17011

No. of Pages 8

29 March 2016

6

N. Komeilipoor et al. / Neuroscience xxx (2016) xxx–xxx

Fig. 4. PLV in right STG between event-related activities during visual and motor conditions and baseline averaged over all 12 participants for both congruent and incongruent stimuli. Error bars refer to standard errors. Significant differences between conditions are highlighted with an asterisk (*p < 0.05).

387 388 389

congruent than for incongruent stimuli. The interaction between condition and stimulus was not significant (F1,11 = 1.643, p > .05, g2 = .13); see Fig. 4.

390

DISCUSSION

391

We used EEG to identify neural correlates of the influence of observing and/or silently articulating of congruent/ incongruent syllables on auditory perception. We found perception of auditory syllables to be degraded when the subjects observed and, to a lesser degree, when they silently articulated incongruent syllables. This was accompanied by significant amplitude modulations in the beta frequency band in right superior temporal areas. There, the event-related beta modulations during congruent conditions (i.e. properly perceived syllables) were phase locked to the responses evoked during the auditory-only condition. Different studies have evidenced the involvement of the motor system in speech and gestural perception (Fadiga et al., 2002, Komeilipoor et al., 2014). But does that imply the motor system itself processes perception? The behavioral results (a) do confirm that speech motor control contributes to listening ability, which supports the general idea of a close link between speech production and perception (Sams et al., 2005; Mochida et al., 2013; Sato et al., 2013). Our EEG source results (b), however, do hint at a major involvement of auditory rather than the motor system in audio-articulatory interaction. The importance of STG in the McGurk illusion, in particular, and audiovisual speech integration, in general, has already been underscored in several imaging studies (Calvert et al., 2000; Jones and Callan, 2003; Sekiyama et al., 2003; Bernstein et al., 2008; Irwin et al., 2011; Nath and Beauchamp, 2012, Szycik et al., 2012, Erickson et al., 2014). We would like to note that STS/ STG is not exclusively implicated during the integration of audiovisual information regarding the McGurk effect, and that it has been associated to broad variety of stimuli such as audiovisual presentation of speech (Calvert et al., 2000), letters (Van Atteveldt et al., 2004), emotions

392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425

(Kreifelts et al., 2007), animals (Beauchamp et al., 2004b), and tools (Beauchamp et al., 2004a,b). A recent MEG study revealed that the perception of the McGurk illusion is preceded by high beta activity in STG (Keil et al., 2012), meaning that not only the overall activity and excitability in STG but also its dynamics are important. Our results corroborate this suggestion, implying that especially right STG plays a central role both for audiovisual and audiomotor speech convergence. Apparently, both types of multisensory integrations rely on the same neural mechanism. fMRI-guided TMS delivered over left STG yields a significant reduction in the perception of the McGurk illusion (Beauchamp et al., 2010). TMS may be used to test whether inhibition of right STG similar to the left one (Beauchamp et al., 2010) would lead to a reduction in the perception of the McGurk illusion as well as the effect of silent articulation on perception of speech syllables. In line with Keil et al. (2012), the beta frequency band was most informative when pinpointing STG. More importantly, however, our finding, i.e. the pronounced phase locking between event-related activities during congruent conditions and the auditory-only baseline suggests that proper timing between the inputs through different modalities is mandatory for their proper integration. If stimuli were incongruent, it appears that additional processing of (one of) the individual inputs would yield a beta-de-synchronization in STG and thus a misperception of sound. Taken together, the influence of observing and silently articulating congruent/incongruent syllables on speech perception reflects multisensory-motor interaction mechanisms. We showed that in both cases auditory perception was changed and that this is accompanied by – and might be due to – a modulation of the activity in superior temporal auditory areas. The activity in the primary motor cortex might be a good candidate to consider as causing this modulation of activity. In fact, it has been shown that both listening to speech syllables and observing the mouth producing them would activate the primary motor areas used in controlling articulation of the perceived stimuli (Watkins et al., 2003). This suggests that the perception of speech syllables, which is linked to the activation of articulatory motor areas in a somatotopic manner (Fadiga et al., 2002), can be degraded if these areas are activated by articulation or observation of a discordant syllable. It can also be the case that these articulatory motor areas then send, either directly or indirectly, articulatory codes to auditory cortical areas in STG, which ultimately interfere with the activity caused by the heard syllables and result in misperception of the syllables. Sams et al. (2005) suggested Broca’s area as a good candidate for modulation of activity in auditory cortex during the McGurk and articulatory effect. This is in accord with results which report increased activities and sub-additive responses in Broca’s area during the perception of incongruent audiovisual speech stimuli than congruent or unimodal conditions (Calvert et al., 2000; Jones and Callan, 2003; Sekiyama et al., 2003). Skipper et al. (2007) suggested a model for the audiovisual speech integration indicating that multisensory

Please cite this article in press as: Komeilipoor N et al. Involvement of superior temporal areas in audiovisual and audiomotor speech integration. Neuroscience (2016), http://dx.doi.org/10.1016/j.neuroscience.2016.03.047

426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486

NSC 17011

No. of Pages 8

29 March 2016

N. Komeilipoor et al. / Neuroscience xxx (2016) xxx–xxx

506

speech representations in STG areas are projected onto speech motor control commands localized in Broca’s area, which are then mapped to the motor commands in ventral premotor and primary motor cortices which are used in speech production. The activated motor commands can be used to predict the acoustic and somatosensory consequences of those movements to limit the final phonetic interpretation of the incoming sensory information (Skipper et al., 2007). Given our current results, we hypothesize that a very similar mechanism may also explain the effect of silent articulation of incongruent syllables. Speculating about more details is beyond the scope of the current study not in the least because using sole 64-channel EEG comes with certain limitations. Future studies should address the involvement of M1, STS/STG and Broca’s area with more spatial resolution, but we suggest employing the (modified) experimental protocol by Mochida et al. (2013) as it allows for tackling not only the integration of auditory and visual input but also that of motor information.

507

REFERENCES

508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551

Arnal LH, Giraud A-L (2012) Cortical oscillations and sensory predictions. Trends in cognitive sciences 16:390–398. Arnal LH, Morillon B, Kell CA, Giraud A-L (2009) Dual neural routing of visual facilitation in speech processing. The Journal of neuroscience 29:13445–13453. Beauchamp MS, Argall BD, Bodurka J, Duyn JH, Martin A (2004) Unraveling multisensory integration: patchy organization within human STS multisensory cortex. Nat Neurosci 7:1190–1192. Beauchamp MS, Lee KE, Argall BD, Martin A (2004) Integration of auditory and visual information about objects in superior temporal sulcus. Neuron 41:809–823. Beauchamp MS, Nath AR, Pasalar S (2010) FMRI-guided transcranial magnetic stimulation reveals that the superior temporal sulcus is a cortical locus of the McGurk effect. J Neurosci 30:2414–2417. Bell AJ, Sejnowski TJ (1995) An information-maximization approach to blind separation and blind deconvolution. Neural Comput 7:1129–1159. Bernstein LE, Lu Z-L, Jiang J (2008) Quantified acoustic–optical speech signal incongruity identifies cortical sites of audiovisual speech processing. Brain Res 1242:172–184. Calvert GA, Bullmore ET, Brammer MJ, Campbell R, Williams SC, McGuire PK, Woodruff PW, Iversen SD, David AS (1997) Activation of auditory cortex during silent lipreading. Science 276:593–596. Calvert GA, Campbell R, Brammer MJ (2000) Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex. Current Biol 10:649–657. Erickson LC, Zielinski BA, Zielinski JEV, Liu G, Turkeltaub PE, Leaver AM, Rauschecker JP (2014) Distinct cortical locations for integration of audiovisual speech and the McGurk effect. Frontiers Psychol 5. Fadiga L, Craighero L, Buccino G, Rizzolatti G (2002) Speech listening specifically modulates the excitability of tongue muscles: a TMS study. European J Neurosci 15:399–402. Gick B, Derrick D (2009) Aero-tactile integration in speech perception. Nature 462:502–504. Gross J, Kujala J, Ha¨ma¨la¨inen M, Timmermann L, Schnitzler A, Salmelin R (2001) Dynamic imaging of coherent sources: studying neural interactions in the human brain. Proc Nat Acad Sci USA 98:694–699. Houweling S, Beek PJ, Daffertshofer A (2010) Spectral changes of interhemispheric crosstalk during movement instabilities. Cereb Cortex 20:2605–2613.

487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505

7

Irwin JR, Frost SJ, Mencl WE, Chen H, Fowler CA (2011) Functional activation for imitation of seen and heard speech. J Neurol 24:611–618. Ito T, Tiede M, Ostry DJ (2009) Somatosensory function in speech perception. Proc Nat Acad Sci 106:1245–1248. Jones JA, Callan DE (2003) Brain activity during audiovisual speech perception: an fMRI study of the McGurk effect. NeuroReport 14:1129–1133. Karnath H-O (2001) New insights into the functions of the superior temporal cortex. Nat Rev Neurosci 2:568–576. Kaurama¨ki J, Ja¨a¨skela¨inen I, Hari R, Mo¨tto¨nen R, Rauschecker J, Sams M (2010) Transient adaptation of auditory cortex organization by lipreading and own speech production. J Neurosci 30:1314–1321. Keil J, Mu¨ller N, Ihssen N, Weisz N (2012) On the variability of the McGurk effect: audiovisual integration depends on prestimulus brain states. Cereb Cortex 22:221–231. Komeilipoor N, Vicario CM, Daffertshofer A, Cesari P (2014) Talking hands: tongue motor excitability during observation of hand gestures associated with words. Frontiers Hum Neurosci 8. Kreifelts B, Ethofer T, Grodd W, Erb M, Wildgruber D (2007) Audiovisual integration of emotional signals in voice and face: an event-related fMRI study. NeuroImage 37:1445–1456. McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748. Mochida T, Kimura T, Hiroya S, Kitagawa N, Gomi H, Kondo T (2013) Speech Misperception: Speaking and Seeing Interfere Differently with Hearing. PLoS One 8:e68619. Mormann F, Lehnertz K, David P, Elger C E (2000) Mean phase coherence as a measure for phase synchronization and its application to the EEG of epilepsy patients. Phys D 144:358–369. Nath AR, Beauchamp MS (2012) A neural basis for interindividual differences in the McGurk effect, a multisensory speech illusion. NeuroImage 59:781–787. Numminen J, Curio G (1999) Differential effects of overt, covert and replayed speech on vowel-evoked responses of the human auditory cortex. Neurosci Lett 272:29–32. Numminen J, Salmelin R, Hari R (1999) Subject’s own speech reduces reactivity of the human auditory cortex. Neurosci Lett 265:119–122. Oostenveld R, Fries P, Maris E, Schoffelen J-M (2010) FieldTrip: open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comp Int Neurosci 2011. Rauschecker JP, Scott SK (2009) Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing. Nat Neurosci 12:718–724. Sams M, Mo¨tto¨nen R, Sihvonen T (2005) Seeing and hearing others and oneself talk. Cog Brain Res 23:429–435. Sato M, Troille E, Me´nard L, Cathiard M-A, Gracco V (2013) Silent articulation modulates auditory and audiovisual speech perception. Exp Brain Res 227:275–288. Sekiyama K, Kanno I, Miura S, Sugita Y (2003) Auditory-visual speech perception examined by fMRI and PET. Neurosci Res 47:277–287. Senkowski D, Schneider TR, Foxe JJ, Engel AK (2008) Crossmodal binding through neural coherence: implications for multisensory processing. Trends Neurosci 31:401–409. Skipper JI, van Wassenhove V, Nusbaum HC, Small SL (2007) Hearing lips and seeing voices: how cortical areas supporting speech production mediate audiovisual speech perception. Cereb Cortex 17:2387–2399. Stein BE, Stanford TR (2008) Multisensory integration: current issues from the perspective of the single neuron. Nat Rev Neurosci 9:255–266. Szycik GR, Stadler J, Tempelmann C, Mu¨nte TF (2012) Examining the McGurk illusion using high-field 7 Tesla functional MRI. Frontiers Hum Neurosci 6. Tian X, Poeppel D (2010) Mental imagery of speech and movement implicates the dynamics of internal forward models. Frontiers in Psychology 1:166. Tiippana K (2014) What is the McGurk effect? Frontiers Psychol 5.

Please cite this article in press as: Komeilipoor N et al. Involvement of superior temporal areas in audiovisual and audiomotor speech integration. Neuroscience (2016), http://dx.doi.org/10.1016/j.neuroscience.2016.03.047

552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622

NSC 17011

No. of Pages 8

29 March 2016

8 623 624 625 626 627 628

N. Komeilipoor et al. / Neuroscience xxx (2016) xxx–xxx

Van Atteveldt N, Formisano E, Goebel R, Blomert L (2004) Integration of letters and speech sounds in the human brain. Neuron 43:271–282. Van Wijk B, Daffertshofer A, Roach N, Praamstra P (2009) A role of beta oscillatory synchrony in biasing response competition? Cereb Cortex 19:1294–1302.

Watkins KE, Strafella AP, Paus T (2003) Seeing and hearing speech excites the motor system involved in speech production. Neuropsychologia 41:989–994.

629 630 631

APPENDIX A

632

See Tables A1–A3.

633

634

Table A1. Sources localized using DICS contrasting pre versus post syllable presentation periods during congruent and incongruent visual subtasks Subtask

Frequency range

Visual

Theta (4–8 Hz) Alpha (8–12 Hz)

Beta (15–30 Hz) Gamma (30–80 Hz)

Congruent

Incongruent

Calcarine R Rolandic oper R Heschl R Temporal sup R Temporal sup R SupraMarginal R Occipital sup R

Postcentral R Precentral R

Parietal Sup L Occipital sup R

Table A2. Sources localized using DICS contrasting pre versus post syllable presentation periods during congruent and incongruent motor subtasks Subtask

Frequency Range

Motor

Theta (4–8 Hz) Alpha (8–12 Hz) Beta (15–30 Hz)

Gamma (30–80 Hz)

Congruent

Incongruent

Calcarine R Postcentral R SupraMarginal R Rolandic oper R SupraMarginal R Temporal sup R Temporal sup R

Calcarine L Temporal inf R Rolandic oper R SupraMarginal R Temporal sup R Occipital sup R

Table A3. Source areas localized using DICS contrasting pre versus post syllable presentation periods during auditory-alone condition Subtask

Auditory #

Frequency Range

Theta (4–8 Hz) Alpha (8–12 Hz) Beta (15–30 Hz)

Gamma (30–80 Hz)

635 636

Postcentral R Postcentral R Rolandic oper R SupraMarginal R Temporal sup R SupraMarginal R Temporal sup R

(Accepted 16 March 2016) (Available online xxxx)

Please cite this article in press as: Komeilipoor N et al. Involvement of superior temporal areas in audiovisual and audiomotor speech integration. Neuroscience (2016), http://dx.doi.org/10.1016/j.neuroscience.2016.03.047