c o r t e x 1 1 7 ( 2 0 1 9 ) 1 e1 5
Available online at www.sciencedirect.com
ScienceDirect Journal homepage: www.elsevier.com/locate/cortex
Registered Report
Auditoryevisual integration during nonconscious perception April Shi Min Ching*, Jeesun Kim and Chris Davis The MARCS Institute, Western Sydney University, Australia
article info
abstract
Article history:
Our study proposes a test of a key assumption of the most prominent model of con-
Protocol received: June 5, 2017
sciousness e the global workspace (GWS) model (e.g., Baars, 2002, 2005, 2007; Dehaene &
Protocol accepted: December 15,
Naccache, 2001; Mudrik, Faivre, & Koch, 2014). This assumption is that multimodal inte-
2017
gration requires consciousness; however, few studies have explicitly tested if integration
Received 4 December 2018
can occur between nonconscious information from different modalities. The proposed
Reviewed 29 January 2019
study examined whether a classic indicator of multimodal integration e the McGurk effect
Revised 11 February 2019
e can be elicited with subliminal auditoryevisual speech stimuli. We used a masked
Accepted 12 February 2019
speech priming paradigm developed by Kouider and Dupoux (2005) in conjunction with
Action editor Zoltan Dienes
continuous flash suppression (CFS; Tsuchiya & Koch, 2005), a binocular rivalry technique
Published online 1 March 2019
for presenting video stimuli subliminally. Applying these techniques together, we carried out two experiments in which participants categorised auditory syllable targets which
Keywords:
were preceded by subliminal auditoryevisual (AV) speech primes. Subliminal AV primes
Multimodal integration
were either illusion-inducing (McGurk) or illusion-neutral (Incongruent) combinations of
Subliminal
speech stimuli. In Experiment 1, the categorisation of the syllable target (“pa”) was facili-
Consciousness
tated by the same syllable prime when it was part of a McGurk combination (auditory “pa”
Priming
and visual “ka”) but not when part of an Incongruent combination (auditory “pa” and visual “wa”). This dependency on specific AV combinations indicated a nonconscious AV interaction. Experiment 2 presented a different syllable target (“ta”) which matched the predicted illusory outcome of the McGurk combination e here, both the McGurk combination (auditory “pa” and visual “ka”) and the Incongruent combination (auditory “ta” and visual “ka”) failed to facilitate target categorisation. The combined results of both Experiments demonstrate a type of nonconscious multimodal interaction that is distinct from integration e it allows unimodal information that is compatible for integration (i.e., McGurk combinations) to persist and influence later processes, but does not actually combine and alter that information. As the GWS model does not account for non-integrative multimodal interactions, this places some pressure on such models of consciousness. © 2019 Elsevier Ltd. All rights reserved.
* Corresponding author. The MARCS Institute, Western Sydney University, Locked Bag 1797, Penrith, NSW 2751, Australia. E-mail addresses:
[email protected],
[email protected] (A.S.M. Ching). https://doi.org/10.1016/j.cortex.2019.02.014 0010-9452/© 2019 Elsevier Ltd. All rights reserved.
2
c o r t e x 1 1 7 ( 2 0 1 9 ) 1 e1 5
1.
Introduction
Signals from the world are registered via separate sensory organs, whose receptors process different physical properties in markedly different ways. Yet, our conscious experience of the world is not partitioned by sensory modality, but is integrated and multisensory. When does distributed sensory information become integrated? Is this integration related to the unity of conscious experience? A class of prominent theories of consciousness, which we will call global workspace theories (e.g., Baars, 2002, 2005, 2007; Dehaene & Naccache, 2001; Mudrik, Faivre & Koch, 2014), provides a clear answer to both of these questions. These theories propose stimulus processing to take place within a sensoryeperceptualecognitive architecture that has the following structure. Firstly, sensory information is processed in specialised networks, which themselves do not produce a conscious percept. If this sensory signal is strong enough, it is transduced by the sensory system and processed by the appropriate perceptual system. The subsequent products of perceptual processing are then broadcast throughout a global workspace. This global workspace, whose neural substrate is thought to be long-range neurons in the frontoparietal regions of the brain (Dehaene & Changeux, 2011), mediates connections between remote parts of the brain. Entry into the workspace enables information about a stimulus to become conscious. That is, the broadcast of that information through the workspace makes it globally available to otherwise inaccessible sources of information, such as that from other perceptual systems and executive processes. This global availability of information from multiple sources enables their integration into a coherent analysis. In short, the global workspace theories posit a process of broadcast and binding of information which correlates with consciousness, resulting in conscious experience that is integrated and multimodal. Under the above cognitive architecture, a stimulus that is nonconsciously perceived would be processed by the relevant sensory and perceptual systems into post-categorical information, but does not enter the global workspace. Consequently, information about that stimulus is prevented from integrating with information processed by other perceptual systems. That is, the integration of post-categorical information based on the analysis derived by different modalities e which we presently term multimodal integration e requires consciousness. We distinguish this from multisensory integration, which involves the cross-talk of information from different modalities earlier in the processing stream. In our view, although the link between multimodal integration and consciousness is intuitively appealing, a key test of this idea has yet to be conducted.1 This key test is to determine whether multimodal integration can take place in the absence of conscious awareness. A successful 1
This was true at the time of the Stage 1 Registered Reports acceptance in December 2017, but is no longer the case. A study published in Cognition in 2018 (Scott et al., 2018) is the first to have carried out the proposed key test (i.e., examining integration between stimuli of different modalities presented outside of conscious awareness).
demonstration of this would provide evidence for a type of broadcast and binding of information that does not involve consciousness. In other words, it would demonstrate that there can be a workspace where information derived from different senses can be combined, but is not global. A test of unconscious multimodal integration clearly requires that the information from the different modalities be presented subliminally. Very few multisensory studies have done so (for a review, see Mudrik et al., 2014), because it is a considerable technical challenge to subliminally present information from two different modalities and determine if multimodal interaction occurs. Since the point of our study is to examine integration while global conscious processes are inactive, we are not interested in whether conscious and unconscious information of different modalities can interact. Such demonstrations have already been carried out very extensively (see Deroy, Chen, & Spence, 2014; Mudrik et al., 2014). Moreover, a test of unconscious multimodal integration must also provide an unambiguous measure of what have called multimodal integration. It is important to understand what we mean by this term, and why we are concentrating on it rather than other forms of sensoryeperceptual interaction. To make it clear what we mean by multimodal integration, consider the following two experiments that do not in our view measure this property. In the first study by Arzi et al. (2012), sleeping participants were presented with pleasant and unpleasant odours preceded by predictive tones. Their participants were able to learn the associations, as indicated by differences in their frequency of sniffing to tones alone during both sleep and subsequent waking. In the second study, Faivre, Mudrik, Schwartz, and Koch (2014) conducted a congruence priming task in which copresented spoken and written digits (/wʌn/; 1) served as subliminal primes, and co-presented spoken and written letters (/keɪ/; k) served as supraliminal targets. Participants judged whether the target stimuli were of same or different letters. The congruence judgement was shown to be faster when the letters and digits had the same congruence relationship, than when one was congruent and the other incongruent. However, this facilitation was observed only if participants were first trained beforehand with a similar procedure using supraliminal primes instead of subliminal ones. While both studies have successfully demonstrated some form of cross-sensory interaction in the absence of conscious awareness, neither can be construed as a demonstration of nonconscious multimodal integration (though we note that this was not the aim of the Arzi study). This is because their reported effects can be explained in terms of cross-sensory processes without appealing to integration e the combination of unisensory information into an integrated category (Deroy et al., 2014; Massaro, 2009). The learning of the conditioned response in the Arzi study is based on only one of the sensory inputs (the tone). Learning the conditioned response is not likely to involve the combination of two different sensory inputs; instead the contents of one input may simply be used to predict changes in the other input, while both sources of information remain separate. As for the Faivre et al. study, that facilitation was observed only after training (with
c o r t e x 1 1 7 ( 2 0 1 9 ) 1 e1 5
supraliminal stimuli) suggests that cross-sensory influence takes place not through integration, but via semantic representations that were activated by prior elaboration of the supraliminal primes (see also Noel, Wallace & Blake, 2015, p. R157, for a similar view). In sum, these two studies illustrate that not all cross-sensory phenomena necessarily index multimodal integration. An appropriate index of multimodal integration must reflect the presence of integrated information, rather than mere correlation between information in separate sensory channels. In addressing the above concerns, our chosen index of integration is an auditoryevisual phenomenon known as the McGurk effect (MacDonald & McGurk, 1978). Here, auditory speech (e.g., “pa”) coupled with incongruent visual speech (e.g., lip movement for a spoken “ka”) can sometimes result in an illusory auditory outcome (“ta”). The McGurk effect provides an unambiguous index of multimodal integration because the integration of information from the different modalities occurs after unimodal stimulus evaluation (at a post-categorical level). This is apparent in neurophysiological studies of where such integration occurs in the brain, i.e., the McGurk effect acts at the level of the left medial superior temporal sulcus, a cortical area implicated in high-level/ abstract aspects of auditory and articulatory processing (e.g., Beauchamp, Nath, & Pasalar, 2010; Nath & Beauchamp, 2012; Venezia et al., 2017). This late stage integration is consistent with the global workspace conception of multimodal integration, in which information from different modalities is integrated in the global workspace to obtain a perceptual hypothesis that minimises discrepancies between the unimodal analyses.
3
We have devised a paradigm that enables the subliminal presentation of McGurk auditoryevisual stimuli, and is able to determine if the subliminal components of a McGurk stimulus still result in a McGurk effect. Since it would not be sensible to probe the McGurk effect by asking a participant to identify subliminal auditory speech, we used an implicit measure of McGurk integration, that being changes in the processing of a following target. That is, we modified a masked speech priming paradigm developed by Kouider and Dupoux (2005), and used it in conjunction with continuous flash suppression (CFS; Tsuchiya & Koch, 2005), a binocular rivalry technique for presenting video stimuli subliminally. Using these techniques together, we carried out a primed syllable identification task using subliminal auditoryevisual primes and supraliminal auditory-only targets (Fig. 1). Subliminal auditory primes and supraliminal auditory targets would have either the same or different phonological representations. We expected the identification of auditory targets to be facilitated by repeated subliminal auditory primes, in a phenomenon known as the masked repetition priming effect (Davis, Kim, & Barbaro, 2010; Dupoux, De Gardelle, & Kouider, 2008; Kouider & Dupoux, 2005). To determine if nonconscious multimodal integration occurs, we examined if subliminal visual primes can alter the post-categorical analysis of the subliminal auditory primes. That is, we employed visual speech primes to either nullify or induce a repetition effect. Specifically, we tested whether a McGurk-inducing visual prime can modify a repeat auditory prime to act as a non-repeat prime, or vice versa; this will be detected via the robustness of the repetition priming effect.
Fig. 1 e Schematic description of the experimental stimuli. Auditory stream is represented on top, and dominant and suppressed visual streams below. Each auditory prime is preceded by 4e9 masks (approximately 600e1500 msec), and followed by 3e5 masks (approximately 500 msec). The auditory target is superimposed over this stream, immediately after the auditory prime. The auditory prime is co-presented with visual speech (shown as a face), presented to the nondominant eye. The auditory mask/prime stream is co-presented with a visual suppressor (shown as a coloured Mondrian), presented to the dominant eye. After the auditory and visual streams, silence and a blank stimulus is presented for 1000 msec, followed by the auditory and visual awareness rating scales. Each visual stream is bounded by a black and white frame and contains a centred fixation cross, which are onscreen at all times (not shown to scale).
4
c o r t e x 1 1 7 ( 2 0 1 9 ) 1 e1 5
In addition to any repetition priming due to the subliminal auditory prime affecting target processing, the subliminal visual prime could also have some influence upon target processing. This would occur due to their proximity in the presentation stream and could contribute to behavioural effects (see van Wassenhove, Grant, & Poeppel, 2007). To determine the degree of visual prime-target interaction, we compared conditions containing the same non-repeat auditory prime (e.g., afa; ‘a’ subscript indicates auditory stimuli), paired with visual speech which is McGurk-inducing (vka; ‘v’ subscript indicates visual stimuli) or merely Incongruent (vwa; i.e., mismatched and McGurk non-inducing, see Supplementary Materials, Pilot Experiments 3, and Jiang & Bernstein, 2011, who found no McGurk effect with alternate bilabial syllable aba and vwa) relative to the target (apa). Given that subliminal visual speech is unable to elicit the McGurk effect in supraliminal auditory speech (Palmer & Ramsey, 2012), we predict that the influence of McGurk and Incongruent visual speech upon the target and subsequent task behaviour should be similar. If this is not the case, it would nonetheless constitute a novel demonstration of crosssensory interaction between conscious and subliminal stimuli (and go against to the Palmer and Ramsey (2012) finding). If nonconscious multimodal integration is possible, then post-categorical information related to subliminal stimuli of different modalities would be able to influence each other. In other words, while under subliminal conditions the phonological representation of auditory speech can be altered by McGurk visual speech, while remaining unchanged with Incongruent visual speech. This alteration should occur in both of our experiments. In Experiment 1, we would observe a loss of repetition priming when an auditory repeat prime (prime apa; target apa) is paired with a McGurk visual prime (vka). In Experiment 2, we would observe repetition priming being induced when an auditory non-repeat prime (prime apa; target ata) is paired with a McGurk visual prime (vka). Observing these experimental effects, in conjunction with negligible visual prime-target interaction, would be the most straightforward demonstration of a nonconscious McGurk effect. Conversely, if nonconscious multimodal integration does not occur at all, the influence of McGurk visual primes on reaction times should be equivalent to that of Incongruent visual primes in both Experiments. Since this procedure to produce a nonconscious McGurk effect is entirely novel, it is unclear whether any observed AV integration will have the properties of a conscious McGurk effect. We have thus proposed two separate experiments to probe how the visual speech information changes the properties of the auditory prime. Experiment 1 tests for general change in the prime: this experiment tests for a loss in repetition priming, but this can be caused by any change in the phonological representation of the prime (i.e., this change might not necessarily match the auditory percept typical of the McGurk illusion). Experiment 2 tests for a specific change in the prime: here, priming of a target will only occur if a visual speech prime produces a McGurk-like change in the auditory one. In addition to a set of quality checks (see Results, Participants), we implemented an outcome neutral test (as recommended by the Action Editor; see Methods, Conscious Control
Test). The aim of these tests was to demonstrate that the effects predicted for both Main Experiments can be elicited by conscious, unmasked AV prime stimuli. In piloting these tests (see Supplementary Materials, Pilot Experiments 3), it became clear that the first of these tests (based on the procedure in Experiment 1) would likely produce results that were difficult to interpret, thus the outcome neutral test employed in the study is based on Experiment 2.
2.
Method
The current experiments used a modified version of a masked speech priming technique developed by Kouider and Dupoux (2005). The presentation schematic is illustrated in Fig. 1. Here, the auditory prime was time compressed and presented in the midst of spectrally similar masking noise. These auditory masks were created from time compressed and reversed speech samples. The auditory prime was then immediately followed by an uncompressed and louder auditory target. Our modification to Kouider and Dupoux's (2005) original technique was the additional co-presentation of suppressed visual stimuli. Specifically, the auditory prime was co-presented with a visual speech prime to the non-dominant eye and a suppressor video to the dominant eye (i.e., CFS; Tsuchiya & Koch, 2005). Stimuli presentation was be done through the Psychophysics Toolbox Matlab extension (Brainard, 1997; Kleiner et al., 2007; Pelli, 1997).
2.1.
Materials
In both experiments, primes consisted of co-presented visual and auditory (AV) stimuli while targets are auditory only. The primes and targets were designed around the apa vka McGurk stimulus, which is usually reported to induce an illusory auditory percept of ata in susceptible individuals. There were two experiments, each with different stimuli content (Table 1).
2.1.1.
Experiment 1
Experiment 1 tested whether a loss of repetition priming occurs when an auditory repeat prime is paired with a McGurk inducing visual prime (apa vka prime; apa target). Repetition priming should occur normally when an Incongruent visual prime is used instead (i.e., apa vwa prime; apa target). Incongruent visual primes are incongruent with respect to the auditory prime but do not elicit the McGurk illusion (see Supplementary Materials, Pilot Experiments 3). Targets were either apa or afa in equal proportion. Only trials containing apa targets were analysed; i.e., trials containing afa targets were filler trials and are excluded from analysis. Primes were presented in four conditions, each being a specific combination of auditory and visual speech stimuli. The key condition consisted of the McGurk Repeat stimuli in which McGurk AV stimuli apa vka precede apa targets. The Incongruent Repeat condition served as the key comparison condition, presenting AV incongruent but non-McGurk AV stimuli apa vwa before apa targets. Lastly, there were the corresponding McGurk Non-Repeat and Incongruent Non-Repeat control conditions, presenting as primes afa vka and afa vwa
c o r t e x 1 1 7 ( 2 0 1 9 ) 1 e1 5
Table 1 e Syllabic content of primes and targets in Experiments 1 and 2. Subscript ‘a’ indicates auditory content and subscript ‘v’ indicates visual content. Experiment
Primes
Targets Analysed Filler
Experiment 1 Incongruent Repeat: apa vwa Incongruent Non-Repeat: afa vwa McGurk Repeat: apa vka McGurk Non-Repeat: afa vka Experiment 2 Repeat: ata vka McGurk: apa vka Non-Repeat: ada vka Non-Repeat: afa vka
apa
afa
ata
afa
stimuli respectively. Here, the auditory components of the Repeat AV primes are replaced with afa, a non-repeat syllable, resulting in an incongruent non-McGurk AV stimulus. Each prime-target combination consisted of 30 trials, resulting in a total of 240 trials and 120 analysable trials.
2.1.2.
Experiment 2
Experiment 2 tested if repetition priming can be induced when an auditory on-repeated prime is paired with a McGurk visual prime (apa vka prime; ata target). The effect of the McGurk AV prime was compared with that of a real repetition prime (ata vka prime; ata target). Targets were either ata or afa in equal proportion. Only trials containing ata targets were analysed; trials containing afa targets were filler trials and excluded from analysis. Primes were presented in four conditions, each being a specific combination of auditory and visual speech stimuli. Here the key condition was the McGurk one, in which McGurk AV stimuli apa vka were presented before ata targets. The Repeat condition served as the key comparison condition, presenting the AV incongruent but non-McGurk stimulus ata vka as a prime to ata targets. Lastly, there were the corresponding NonRepeat control prime conditions, presenting afa vka and ada vka respectively. Here, the auditory components of the Repeat AV primes are replaced with afa and ada, both non-repeat syllables, resulting in incongruent non-McGurk AV stimuli. Each prime-target combination consisted of 30 trials, resulting in a total of 240 trials and 120 analysable trials.
2.2.
Stimulus creation
All auditory and visual speech stimuli were created using recordings of a single female speaker. Praat software (Boersma & Weenink, 2017) was used to time-reverse and time-compress audio where needed; time compression was achieved through the pitch synchronous overlap and add (PSOLA) algorithm (Moulines & Charpentier, 1990). FFmpeg (http://www. ffmpeg.org) was used to manipulate and time-compress video. Synthesis of suppressor videos (randomly generated Mondrian patterns that alternate at 2 Hz), and all other sound modifications were carried out in Matlab (The Mathworks, Natick, MA). Auditory masks were created using words (nouns and verbs only) extracted from recordings of the speaker reading the IEEE/Harvard sentence corpus (1969). Each word had
5
leading and trailing silences removed, and were then timecompressed to 70% of their original duration, time-reversed, and normalised to an intensity of 50 db. The 70% compression rate was selected to match that of the auditory primes (as per Kouider and Dupoux (2005)), which was determined in a pilot experiment (see Supplementary Materials, Pilot Experiments 1). This created a set of 1025 reversed speech masks, from which a random selection was presented sequentially to create masking noise as needed. Each auditory prime was preceded by masking noise of approximately 600e1500 msec in duration (random selection of 4e10 masks), and followed by noise of approximately 300e500 msec in duration (random selection of 3e5 masks). The composition and length of masking noise per trial was pre-generated, resulting in two stimuli sets. The assignment of the two sets was counterbalanced between participants; this counterbalance would serve as a check of whether the pre-generated features of the masking noise have an influence on task performance. The auditoryevisual speech primes were created from video recordings of six different syllables (“pa”, “ta”, “da”, “wa”, “ka”, “fa”) by the same speaker. To order to render the videos more amenable to conscious suppression, they were converted to greyscale and gamma compressed to 40% of their original value (which has the effect of lightening shadows and thus reducing contrast). They were then cropped to show only the speaker's mouth and chin. Audio was normalised to an intensity of 50 dB. Frames preceding 300 msec before sound onset and following 300 msec after sound offset were dropped; resultant videos have an approximate average duration of 500 msec and resultant auditory syllables an approximate average duration of 300 msec. To create incongruent auditoryevisual speech stimuli, the sound streams of different videos were swapped while retaining the same sound onset. Visual and audio streams were then time-compressed to 70% of their original duration. We chose this compression value based on earlier pilot data (see Supplementary Materials, Pilot Experiments 1), as it rendered stimuli more amenable to masking (i.e., they could not be identified reliably after masking) while their syllabic content remained identifiable. A second pilot (see Supplementary Materials, Pilot Experiments 2 and 3) also indicated that 70% time-compressed stimuli were able to elicit the McGurk illusion in participants who were susceptible under normal conditions (i.e., in uncompressed videos). Auditory targets were uncompressed audio from the video recordings, normalised to an intensity of 70 db. The targets were dubbed over reversed speech masks, as shown in Fig. 1.
2.3.
Experimental procedure
The experiment session was split into two parts. The first part consisted of trials in which participants made a speeded categorisation response to the target syllable (Target Categorisation), then non-speeded reports of the audibility and visibility of the auditoryevisual prime (Prime Audibility; Prime Visibility). The second part tested whether the participant normally experiences the McGurk effect (McGurk Susceptibility Test). The prime visibility and audibility reports, together with the McGurk Susceptibility Test, will indicate if trial data and participants should be excluded (detailed description in Results, Participants).
6
2.3.1.
c o r t e x 1 1 7 ( 2 0 1 9 ) 1 e1 5
Target categorisation
Eye dominance was first determined before the experiment by using a variant of Porta's test (Porta, 1593). In this test, the participant was instructed to look at a distant object in the laboratory, stretch their arms out and form a triangle with their index fingers and thumbs, and centre the distant object within the triangle. They then alternately shut each eye; the dominant eye when shut causes a larger perceived change in alignment between object and triangle. The mirror stereoscope is then calibrated for the participant. The participant viewed a test stimulus e a pair of black and white square frames each with a fixation cross in the centre (see Fig. 1) through the stereoscope. The stereoscope is adjusted until the participant's visual fields converge and the two frames are fused stably into a single frame. A test video was used to verify that suppression is successful for the participant. The participant was tested in a semi-dark and quiet room. Participants in both Experiments carried out 10 practice trials, followed by 240 experimental trials split into 8 blocks. The experiment session was self-paced and short breaks could be taken between blocks if desired; participants were advised to at least stretch and look away from the stereoscope for a moment before continuing onto the next block. Each of the four prime conditions occurred with equal frequency within each block; the order of trials was fully randomised within blocks. In each trial, an auditory stream was co-presented with two concurrent visual streams, displayed left and right of fixation on a computer monitor. The content of the auditory stream was as shown in Fig. 1, beginning with an audio-only mask of 600e1500 msec, then the subliminal auditoryevisual syllable prime, followed by an audible audio-only syllable target dubbed over reversed speech masks. The visual component of the prime was presented to the participant's non-dominant eye while a suppressor was presented to the dominant eye. The two visual streams were fused by viewing the monitor through a stereoscope. To stabilise the fusing process, each visual stream was bounded by a black and white frame and contained a fixation cross in the centre. The size of the fused frame was 6.5 by 6.5 visual degrees; the size of the fused fixation cross was .52 by .52 visual degrees. Onsets of the auditory masks and video suppressors coincided at the beginning of the trial and were jittered randomly, so as to prevent participants from using their relative onsets to predict the start of the downstream auditory prime. Offsets of the auditory masks and video suppressors coincided at the end of a trial and occur 300e500 msec after target offset. This is followed by a post-stimulus interval of 800 msec, during which no audio was presented and only the fixation cross and bounding frame were displayed on-screen. Participants were advised to withhold excessive blinking until this interval. In both experiments, participants were told that they would be presented with (1) louder spoken syllables (“pa” and “fa” for Experiment 1; “ta” and “fa” for Experiment 2) in the midst of softer background noise over headphones, concurrent with (2) a fused visual stream through the stereoscope, in which only the frame, fixation, and occasional suppressor should be visible. They were instructed to listen for the louder
syllables and ignore the less intense background sounds, while viewing the fused visual stream through the stereoscope. When a target was presented, participants would have to decide as quickly and accurately as possible which of the two possible syllables were presented and indicate their answer by pressing the corresponding key (left or right shift) with their left or right hand. Key-syllable assignment was counterbalanced across participants.
2.3.2.
Prime audibility
To assess prime audibility, we used a combined subjectivee objective measure of awareness scale (see Gelbard-Sagiv, Faivre, Mudrik, & Koch, 2016). This scale was presented immediately after the post-stimulus interval of every trial. In Experiment 1, this scale consisted of eight options, represented by four “PA”s and four “FA”s in four different font sizes (see Fig. 1). Participants made a response to the identity of the prime syllable (select a PA or FA; essentially a twoalternative forced-choice task), combined with a confidence level rating (select the appropriate font size; from smallest to largest these correspond to (1) “pure guess”, (2) “weak experience”, (3) “almost clear experience”, and (4) “clear experience”). The correct and alternative syllables were assigned randomly to the left or right sides of the scale with equal frequency. The participant used the left and right shift keys to move a cursor to the desired option, and then confirmed their selection with the spacebar. In Experiment 2, we presented the same scale and task but with different syllable options. This change was necessary because Experiment 2 presented four possible auditory primes (while Experiment 1 presented only two). For all trials containing a given auditory prime (e.g., “TA”), half of the options was the matching syllable (“TA”) while the other half would be one of the three non-matching syllables (“PA”, “FA”, or “DA”); each of the three non-matching syllables appeared as the alternative choice with equal frequency. The correct and alternative syllable choices were assigned randomly to either sides of the scale with equal frequency. In doing so, we simplified a potential four-choice task to two choices; additionally, all possible combinations of syllables and lefteright assignments were presented with equal frequency across the experiment session, ensuring that the presented options were not predictive of the correct response.
2.3.3.
Prime visibility
Once a prime audibility rating was made, the participant was then prompted to rate the visibility of the visual prime using the Perceptual Awareness Scale (Ramsøy & Overgaard, 2004), which provides a measure of subjective awareness. Participants indicated whether they saw (1) nothing but the suppressor i.e., “saw nothing”, (2) a “brief glimpse” of something, (3) an “almost clear experience”, or (4) a “clear experience”, by using the left and right shift keys to select one of four options and confirming with the spacebar. Completing this task initiated the next trial.
2.3.4.
McGurk Susceptibility Test
After completing the first part of the experiment session, participants underwent a test to determine if they
c o r t e x 1 1 7 ( 2 0 1 9 ) 1 e1 5
experienced the McGurk illusion with uncompressed and 70% time-compressed stimuli. 24 AV speech stimuli were presented on-screen and over headphones. Of these 12 are based on stimuli presented in the main experiments, presented in four conditions of three items each: (1) Normal McGurk, an uncompressed apa vka, (2) Compressed McGurk, a 70% timecompressed apa vka, (3) Normal Incongruent, an uncompressed apa vwa, and (4) Compressed Incongruent, a 70% timecompressed apa vwa. The remaining 12 are filler items consisting of ada vda, aga vga, afa vfa, and aka vka in equal proportions. At the beginning of each trial a fixation cross was presented onscreen. The fixation remained onscreen until a key press, on which a randomly selected auditoryevisual speech stimulus was presented. Participants were instructed to look at the fixation cross before responding, ensuring that they were watching the lip movement when the stimulus was presented. After the presentation of each auditoryevisual stimulus, the participant was presented with a free-response prompt, where they typed in the syllable that was heard. The confirmation of a typed response would then bring up a fixation cross, prompting for initiation of the next trial. Responses to apa vka McGurk stimuli were to be categorised as either an (1) auditory response (“pa”), (2) a visual response (“ka”), (3) a combination response (“pka”), or (4) a McGurk response (“ta”). This test provided a check for whether our AV stimuli do induce the McGurk effect, as well as a basis for participant exclusion.
2.4.
Number of participants and power analysis
We ran a sequential analysis for each experiment, as proposed in Lakens (2014). Initial analysis used sample sizes of 15 participants with an adjusted alpha of .038. In the event that observed effects were marginally significant, we had planned to incorporate data from an additional 5 participants and reanalyse at an alpha of .029. However, since this did not occur in either experiment, we stopped data collection at 15 participants in both cases. The adjusted alphas for sequential analysis were determined using the R-based package GroupSeq (Lakens, 2014). We selected this number of participants based on power calculations for linear mixed models, which were carried out via Monte Carlo simulation. This technique determines power by simulating a dataset, refitting the model, and testing the fit statistically over a large number of iterations [we used the Rbased (R Core Team, 2015) package SIMR, Green & MacLeod, 2016]. The key parameters for the simulations were taken from the findings of a previous pilot experiment (N ¼ 14). This pilot experiment was carried out preliminarily to determine if syllables subjected to Kouider and Dupoux's (2005) auditory masking could still elicit repetition priming, as previous studies had only reported the use of this technique with words. Participants were presented with 20 trials containing aba and afa targets in equal proportions. Targets were preceded by masked aba or afa primes in equal proportions, yielding 10 repetition trials and 10 non-repetition trials. There were no co-presented visual speech stimuli. Participants were instructed to identify the target syllable and indicate their
7
answer by pressing the corresponding key with their left or right hand. Trial-wise reaction times were analysed in a mixed ANOVA with participant as random factor and Repetition (Repeat/Non-Repeat) as fixed factor. This test indicated a significant main effect of Repetition [F(1,13) ¼ 8.33, p < .01]. To determine what model we should simulate, we considered what pattern of data and statistical outcome would indicate the presence of nonconscious integration in Experiment 1. If nonconscious integration was detected in Experiment 1, we assumed that the McGurk AV stimulus would display a loss of repetition priming while the Incongruent AV stimulus would exhibit normal repetition priming (i.e., the same magnitude as the Repetition main effect in the aforementioned pilot study). We would observe no difference in reaction times between the McGurk Repeat and McGurk NonRepeat conditions, while reaction times in the Incongruent Repeat condition would be significantly faster than in Incongruent Non-Repeat. In a mixed ANOVA analysis with Repetition (Repeat/Non-Repeat) and Relation (McGurk/Incongruent) as fixed factors and subject as random factor, this pattern of data would exhibit a significant interaction between Repetition and Relation. Thus, we tested for the ability to detect an interaction of small effect size between Repetition and Relation conditions. We first needed an estimate of the fixed intercept and slopes e the aforementioned pilot experiment suggested a magnitude of about 30 msec, or a beta coefficient of .05 for the Repetition fixed effect. We thus entered fixed effect betas of 1, .05, 0, .03 for the intercept, Repetition, Relation, and their interaction, together with a residual variance of .09, describing a model in which the repetition priming effect was present for the Incongruent conditions but weaker for the McGurk conditions. Secondly, we needed estimates of the variances and covariances of the random effects. Based on the pilot, we assigned variances of .04 and .001 for the random intercept and random slope of Repetition respectively. Lastly, we assumed a variance of .001 for the random slope of Relation, and 0 for all covariances. The power estimate based on 1000 rounds of simulation, a sample size of 15, and 25 repetitions per condition at an alpha of .038 was .878. The power estimate based on 1000 rounds of simulation, a sample size of 20, and 25 repetitions per condition at an alpha of .029 was .964. We also carried out simulations of Type 1 error rates at sample sizes of 15 and 20 with a modified version of the Rbased function phack authored by Sherman (2014) (http:// rynesherman.com/phack.r). At 5000 repetitions, the simulation revealed a Type 1 error of .023 at sample size of 15 and .037 at sample size of 20, which is well below the typical alpha of .05. We did not carry out an explicit power analysis for Experiment 2 because, unlike Experiment 1, nonconscious integration would likely be indicated by an absence of an interaction between Repetition and Relation (i.e., similar repetition priming effects for both McGurk and Incongruent conditions). The subsequent power analysis would involve simulating a model with only significant Repetition main effect. However, since Experiment 2 involves the same effect size values and statistical tests as Experiment 1, and the present analysis already indicates sufficient power for Experiment 1, the same
8
c o r t e x 1 1 7 ( 2 0 1 9 ) 1 e1 5
sequential testing process ought to be adequate for Experiment 2. As such we used the same power analysis outcome for Experiments 1 and 2 (i.e., 15 participants at a ¼ .038; additional 5 participants at a ¼ .029).
2.5.
presented to both eyes onscreen, without suppression. These changes served to render the auditory and visual primes consciously perceivable. All other aspects of stimulus creation were as described in the Materials and Stimulus Creation subsections.
Predicted outcomes 2.6.2.
In both Experiments, we expected an absence of priming in both Non-Repeat conditions (i.e., similar reaction times in McGurk Non-Repeat and Incongruent Non-Repeat). Meanwhile, a genuine repetition effect was expected in Incongruent Repeat trials (i.e., faster reaction times in Incongruent Repeat than Incongruent Non-Repeat). Reaction times in the remaining condition e McGurk Repeat e were key to our interpretation and affected differently in each experiment. In Experiment 1, if auditoryevisual integration can take place nonconsciously, repetition priming would be disrupted in McGurk Repeat trials (i.e., significantly slower reaction times in McGurk Repeat than Incongruent Repeat). This would be reflected in ANOVA as an interaction of Repetition (i.e., auditory prime type; apa or afa) and Relation (i.e., visual prime type; vka or vwa). Conversely, if integration has not occurred, the ANOVA should indicate only a significant main effect of Repetition with no interaction. In Experiment 2, if auditoryevisual integration can take place nonconsciously, repetition priming would be induced in McGurk trials (i.e., significantly faster reaction times in McGurk than Non-Repeat). This effect would be present in a model with only a Repetition main effect, and pairwise comparisons would indicate significantly different reaction times in the McGurk and Non-Repeat conditions. Conversely, if integration has not occurred, the ANOVA will indicate an interaction in which the effect of Repetition is significant, but pairwise comparisons would show non-significant differences in McGurk and Non-Repeat reaction times.
2.6.
Conscious Control Test
A series of pilot experiments was carried out with a different and separate set of participants (N ¼ 20) to demonstrate that the effects predicted for the Main Experiments can be elicited by conscious, unmasked AV prime stimuli. This experiment extends the initial dataset of four participants reported in the Stage 1 report (see Supplementary Materials, Pilot Experiments 3). Participants first carried out a McGurk Susceptibility Test to ascertain that they experienced the McGurk effect under normal (i.e., conscious) presentation conditions. Participants next carried out a Conscious Control Test, which was a modified version of Experiment 2. Here, the auditoryevisual primes were rendered consciously perceivable by compromising the auditory and visual masking, while as much as possible of the original procedure was retained.
2.6.1.
Method
The presentation scheme within the Conscious Control test was based on that of the Main Experiments (Fig. 1) while differing in three aspects: (1) the auditory prime was normalised to 60 dB instead of 55 dB; (2) 50 msec of silence was introduced between the onset of the auditory prime and the offset of the preceding mask; and (3) the visual prime was
Participants
Twenty undergraduate and graduate students from Western Sydney University participated in the study for monetary or course credit reimbursement. All participants reported normal or corrected to normal vision, normal hearing, no history of neurological disorders, and high proficiency in English (native English speaker, or at least 12 years of English language education). Two participants were excluded from the analysis due to poor task performance (70% and 24% correct target responses). Three participants were excluded for making only auditory responses during the McGurk Susceptibility Test (i.e., McGurk stimuli did not elicit an illusory percept). We thus retained fifteen participants for analysis. All retained participants gave at least three non-auditory responses to the six McGurk stimulus trials (co-presented visual “ka” and spoken “pa”; three at 70% duration compression and three uncompressed) during the McGurk Susceptibility test. We identified the most common response type to the McGurk stimulus for each participant e this was the McGurk percept (“ta”) for eleven participants, the visual percept (“ka”) for one participant, and equal numbers of McGurk and visual for three participants. These participants also gave at least 80% correct responses in the Target Categorisation task.
2.6.3.
Results
Trial-wise reaction times were submitted to a linear mixed effects analysis to determine the effects of auditoryevisual speech primes. The model included fixed effects of Prime Type (McGurk, Repeat, Non-Repeat), as well as by-Subject random intercepts and random slopes of Prime Type. Trials with incorrect target responses and reaction times less than 200 msec and more than 2000 msec were excluded from analysis; within the retained dataset, reaction times more or less than 2 standard deviations from the mean for each combination of participant and condition were replaced by the relevant cut-off (Wilcox, 1995). Trial-wise accuracies (see Table 2 for a summary) were also submitted to binomial mixed effects models with the same factors; however, as these did not yield any significant effects we restrict our discussion to effects on reaction time. In the Conscious Control Test (Fig. 2), There was significant main effect of Prime Type [c2(1) ¼ 28.8, p < .001]. Pairwise Table 2 e Mean accuracies (percentage correct) for each of the four auditoryevisual primes in the Conscious Control Test. Experiment and prime Conscious control (ata target) ata vka apa vka ada vka afa vka Standard error of mean in parentheses.
Accuracy (%) 97.7 (.672) 96.1 (.762) 94.2 (1.48) 94.4 (1.01)
c o r t e x 1 1 7 ( 2 0 1 9 ) 1 e1 5
9
test. We identified the dominant response type to the McGurk stimulus for each participant e this was the McGurk percept (“ta”) for nine participants, the visual percept (“ka”) for four participants, and equal numbers of McGurk and visual for four participants. These participants also rated Prime Visibility and Prime Audibility as “almost clear experience” or “clear experience” on less than 20% of total trials (48 of 240 trials), and gave at least 80% correct responses in the Target Categorisation task.
3.2.
Fig. 2 e Mean reaction times for each of the four auditoryevisual primes in the Conscious Control Test. Error bars correspond to 95% CI. comparisons for each combination of Prime Type were then carried out using Tukey contrasts with BonferronieHolm corrections. Reaction times associated with all three types of auditoryevisual primes were significantly different from each other (p < .001 for all three comparisons) e reaction times were fastest with Repeat primes (M ¼ .684 sec; SD ¼ .191), followed by McGurk primes (M ¼ .761 sec; SD ¼ .187), then Non-Repeat primes (M ¼ .804 sec; SD ¼ .178).
3.
Results
3.1.
Participants
In total, seventeen undergraduate and graduate students from Western Sydney University gave written informed consent and participated in the study for monetary reimbursement: sixteen participated in both Experiments 1 and 2, one in only Experiment 2. One participant from Experiment 1 and two participants from Experiment 2 were excluded due to greater than chance performance in the auditory prime identification task (see Awareness of Auditory Primes below for details). Fifteen participants per Experiment were thus retained for analysis. All participants reported normal or corrected to normal vision, normal hearing, no history of neurological disorders, and high proficiency in English (native English speaker, or at least 12 years of English language education). Normal vision and hearing were also confirmed on-site with Snellen chart test and air conduction audiometry at 500 Hz, 1 kHz, and 2 kHz respectively. All retained participants gave at least four non-auditory responses to the six McGurk stimulus trials (co-presented visual “ka” and spoken “pa”; three at 70% duration compression and three uncompressed) during the McGurk Susceptibility
Awareness of auditory primes
To test for subliminality of the auditory primes, participants' responses to the objective component of the Prime Audibility test will be entered into a mass at chance model (MAC; Rouder, Morey, Speckman, & Pratte, 2007), a hierarchical model within the Bayesian framework that can estimate whether a given participant is performing at chance. The key assumption of the MAC model is that each participant has a true latent ability e negative values correspond to at-chance performance (i.e., subliminality), while positive values are above chance (i.e., supraliminality). In addition, the distribution of latent ability in the population is assumed to be normal. Taken together, one can determine if the posterior probability of a participant's latent ability is less than zero; if this posterior probability is below a criterion (.95, as recommended by Rouder et al., 2007), the participant's performance is judged to be above chance. Since the two Experiments had the same procedure and differed only in stimulus content, data from both Experiments were entered into the same model to maximise statistical power. Only data that was under consideration for subliminal priming analysis e non-filler trials with correct target responses and no/weak visual and auditory awareness e were submitted to the MAC model. We also excluded data from one participant in Experiment 2, who displayed both low subliminality (no/weak visual and auditory awareness indicated for only 26.6% of trials) and high prime identification accuracy (85.4% correct responses). After all exclusions, 32 datasets with an average of 110 trials per participant and Experiment (Experiment 1: N ¼ 16; Experiment 2: N ¼ 16) were entered into the model. Participants generally indicated very low awareness of the auditory prime e the average proportion of trials rated “no experience” or “weak experience” was 96.0% (SE ¼ .536%) in Experiment 1 and 95.3% (SE ¼ 1.12%) in Experiment 2. Within these trials, participants' accuracy in the auditory prime identification task were consistently low (Experiment 1: M ¼ 49.7%, SE ¼ .421%; Experiment 2: M ¼ 50.4%, SE ¼ 1.53%). MAC model estimates of posterior probabilities for all 32 datasets are displayed in Fig. 3. Based on these results, 15 of 16 participants in Experiment 1 and 14 of 16 participants in Experiment 2 had posterior probabilities greater than .95, and thus were judged to be performing at chance and their data retained for the subliminal priming analysis. One participant in Experiment 1 displayed a marginally acceptable posterior probability of .942 e because previous similar applications of the MAC model have suggested the .95 criterion to be too strict (e.g., Finkbeiner, 2011), we provisionally retained this participant's data for analysis.
10
c o r t e x 1 1 7 ( 2 0 1 9 ) 1 e1 5
Fig. 3 e Estimated posterior probabilities from the MAC model for each individual dataset in Experiments 1 (circles) and 2 (triangles) as a function of accuracy in the prime audibility task. The dashed line indicates the .95 criterion.
3.3.
Effects of auditoryevisual primes: Experiment 1
Trial-wise reaction times in Experiment 1 were submitted to a linear mixed effects analysis to determine the effects of auditoryevisual speech primes. The model included fixed effects of Stimulus Set, Repetition (i.e., auditory prime type; apa or afa), Relation (i.e., visual prime type; vka or vwa), and interaction between Repetition and Relation, as well as bySubject and by-Set random intercepts and random slopes of Repetition and Relation. Trials with incorrect target responses, “almost clear” or “clear” ratings for prime audibility or visibility, and reaction times less than 300 msec and more than 2000 msec were excluded from analysis. All these procedures resulted in the exclusion of an average of 8.70% of 120 analysable trials per participant (M ¼ 10.5 trials; range ¼ 1e19). Within the retained dataset, reaction times greater or less than 2 standard deviations from the mean for each combination of participant and condition were replaced by the relevant cut-off (Wilcox, 1995). As a consequence of planned sequential analysis, these tests use an adjusted alpha of .038.
Table 3 e Mean accuracies (percentage correct) for each of the four auditoryevisual primes in Experiments 1 and 2. Experiment and prime Experiment 1 (apa target) apa vwa apa vka afa vwa afa vka Experiment 2 (ata target) ata vka apa vka ada vka afa vka Standard error of mean in parentheses.
Accuracy (%) 96.0 97.2 93.0 94.7
(.907) (.913) (2.23) (1.15)
95.0 96.1 96.2 97.2
(1.22) (1.20) (1.13) (.892)
Trial-wise accuracy data (see Table 3 for summary) were also submitted to binomial mixed effects models with the same factors; however, as these did not yield any significant effects, we restrict our discussion to effects on reaction time. In Experiment 1 (see Fig. 4, left panel), there was a significant interaction of Repetition and Relation (c2(1) ¼ 9.93, p < .001). Further analyses were then carried out to evaluate the effect of Relation in repetition and non-repetition trials separately. Relation was marginally significant for repetition trials [c2(1) ¼ 3.84, p ¼ .050], in which reaction times were faster with McGurk visual primes (M ¼ 1.027 sec; SD ¼ .0345 sec) relative to Incongruent visual primes (M ¼ 1.051 sec; SD ¼ .0352 sec). Conversely, Relation was not significant for non-repetition trials [c2(1) ¼ 2.70, p ¼ .101], indicating no difference between McGurk (M ¼ 1.046 sec; SD ¼ .0314 sec) and Incongruent visual primes (M ¼ 1.028 sec; SD ¼ .0344 sec). The results of Experiment 1 indicate processing differences for subliminal McGurk audio-visual stimuli (i.e., apavka) relative to incongruent pairs (apavwa, afavka, afavwa).
3.4.
Effects of auditoryevisual primes: Experiment 2
Trial-wise reaction times for Experiment 2 was submitted to a linear mixed effects analysis to determine the effects of auditoryevisual speech primes. The model included a fixed effect of Prime Type (McGurk, Repeat, Non-Repeat), as well as by-Subject random intercepts and random slopes of Prime Type. Trials with incorrect target responses, “almost clear” or “clear” ratings for prime audibility or visibility, and reaction times less than 300 msec and more than 2000 msec were excluded from analysis. All these procedures resulted in the exclusion of an average of 8.13% of 120 analysable trials per participant (M ¼ 9.8 trials; range ¼ 1e24). Within the retained dataset, reaction times more or less than 2 standard deviations from the mean for each combination of participant and condition were replaced by the relevant cut-off (Wilcox, 1995). As a consequence of planned sequential analysis, these tests use an adjusted alpha of .038. Trial-wise accuracies (see Table 3 for summary) were also submitted to binomial mixed effects models with the same factors; however, as these did not yield any significant effects for either Experiment, we restrict our discussion to effects on reaction time. In Experiment 2 (see Fig. 4, right panel), there were no significant effects involving reaction time. To ascertain if the interference effect associated with repetition primes in Experiment 1 was also present in Experiment 2, a pairwise comparison between the repeat prime (“ta”) and non-repeats (“fa” and “da”) was carried out using Tukey contrasts with Bonferroni-Holm corrections. Reaction times for repeat primes (M ¼ 1.035 sec) were not significantly different from non-repeat primes (M ¼ 1.026 sec, p > .1). In short, there was no evidence in Experiment 2 that subliminal McGurk audiovisual stimuli (apavka) produced a representation that matched its predicted illusory effect (ata).
4.
Discussion
In this study, we set out to investigate if multimodal integration could occur for visual and auditory stimuli that were
c o r t e x 1 1 7 ( 2 0 1 9 ) 1 e1 5
11
Fig. 4 e Mean reaction times for each of the four auditoryevisual primes in Experiments 1 (left panel; apa target) and 2 (right panel; ata target). Error bars correspond to 95% CI.
nonconsciously perceived. To do so, we examined how the processing of an auditory target syllable could be influenced by pre-target presentations of subliminal auditoryevisual (AV) speech stimuli in various combinations. In Experiment 1, we found a significant interaction effect in which the speed of classifying an auditory target (aPA) was facilitated by repeat auditory primes co-presented with McGurk visual primes (apavka) relative to Incongruent visual primes (apavwa), while no difference in classification speed occurred for Incongruent pairings of non-repeat auditory primes with either visual prime token (afavka, afavwa). That this effect on target classification depended on a specific combination of AV stimuli (McGurk combination) is evidence that a nonconscious auditoryevisual interaction had occurred. That such a specific interaction was unpredicted indicates that the processes involved in priming from masked AV are likely more complicated that we assumed. Was this interaction an example of multimodal integration? In the present study we argued that unambiguous multimodal integration would only be indicated by the combination of unimodal information that resulted in a McGurklike change in auditory representations. Several aspects of our results strongly imply that this change did not occur for subliminal McGurk stimuli. The first is the outcome of Experiment 2, in which the speed of target classification (aTA) was the same for all four prime combinations (apavka, atavka, afavka, adavka). Since the McGurk prime (apavka) had failed to elicit a priming effect and instead resulted in target response times similar to those preceded by the non-repeat primes (i.e., afavka, adavka), we conclude that the subliminal McGurk prime did not match the illusory percept typically associated with that particular stimulus combination (i.e., “ta”). The second is the outcome of the Conscious Control Test, which was an exact replication of Experiment 2 but with consciously perceivable primes. Contrary to their subliminal counterparts in Experiment 2, conscious McGurk primes were able to elicit priming, i.e., we observed strong facilitation by the repeat prime (atavka) and a weaker facilitation by the McGurk prime (apavka) relative to non-repeat primes (afavka, adavka). The absence of significant priming effects for the McGurk stimuli
in Experiment 2 indicates that McGurk-associated integration processes are unlikely to have occurred under nonconscious conditions. Finally, the result that the McGurk primes in Experiment 1 led to facilitation, indicated that the interaction that occurred with subliminal McGurk primes likely left the auditory component unaltered (i.e., “pa”). Taken together, we did not find evidence that multimodal integration e the combination of unimodal information e can proceed under nonconscious conditions. This finding supports existing evidence that the elicitation of the McGurk effect requires conscious access to the information conveyed by the visual (Munhall, van Hove, Brammer, & Pare, 2009; Palmer & Ramsey, 2012) and auditory modalities (Eskelund, Tuomainen, & Andersen, 2011), and also provides novel evidence suggesting that the illusion does not exist in a subliminal auditory form. This is consistent with the global workspace theory (GWT)'s claim that consciousness is required for multimodal integration. Additionally, the idea that the global workspace is mediated through conscious access involving long-range intercortical connections agrees with current theories of the neural basis of the McGurk effect e some have theorised the McGurk effect to be underpinned by auditoryevisual integration at the multisensory superior temporal sulcus (e.g., Eskelund et al., 2011; Miller & D'Esposito, 2005). If the observed AV interaction is not integration, then what is it? In our following explanation, we presume the effect of the AV primes as the combination of a masked auditory priming effect and an additional visual influence. We first consider the effects of auditory-only primes in the present study, adopting a Bayesian perspective of masked priming (see Norris & Kinoshita, 2008), in which perception is viewed as the product of inference based on accumulated evidence. To make this conception clearer, consider how it explains the priming effects that were observed with repeat auditory-only primes and targets (i.e., apa) in our pilot experiments (see Supplementary Materials, Pilot Experiments 1). Here, the auditory masked prime can be understood as a source of perceptual evidence that is taken into account when processing the target (that immediately follows the prime). When the prime
12
c o r t e x 1 1 7 ( 2 0 1 9 ) 1 e1 5
is a repeat of the target, then the earlier presentation of the same information gives target processing a “head-start” and speeds up reaction time. When the prime is a non-repeat, then it is generating evidence of a different syllable from the target, thus causing interference and slower reaction times. In the above view, priming that is based on accumulated evidence can explain the results of the auditory priming experiment, but why does the type of concurrent visual speech affect whether priming occurs in the auditoryevisual version? We propose that, unlike the case with auditory-only primes, evidence is sourced from AV primes only if it can be interpreted as a coherent whole. In other words, the visual and auditory components must plausibly come from the same source e this would be true for illusion-inducing McGurk combinations (apavka), where both the visual and auditory components contain features that can be found in a common phonetic identity (i.e., auditory “pa” sounds like auditory “ta”, and visual “ka” looks like visual “ta”; see Massaro & Stork, 1998), but not for Incongruent combinations whose components do not have compatible features and cannot be integrated into a single multisensory percept. We suggest that this compatibility is evaluated through an early, bottom-up mechanism that compares only the features of stimuli and not full representations e a possible mediator could be the outputs of feature detectors (Hubel & Wiesel, 1962), which are populations of neurons in early visual and sensory cortices that are sensitive to simple stimulus attributes (e.g., auditory frequency, or visual intensity). Coincident auditory and visual stimuli are “flagged” if they have potential to be integrated into a coherent event e we will refer to this process as a multimodal alignment. This nonconscious alignment can be thought as an earlystage evaluator of multimodal inputs: through the “flagging” mechanism, coherent inputs and thus potentially meaningful information are promoted for further processing, while incoherent inputs are downplayed and excluded. McGurk AV primes were coherent and aligned e as a result their information would persist and be considered in the processing of the target, but also remain unaltered at this stage of processing. Thus, the McGurk prime (apavka) was representative of apa and vka tokens e in Experiment 1 then, the McGurk prime would be a repeat of the target (aPA) and so result in a facilitative priming effect; in Experiment 2, the McGurk prime would be a non-repeat of the target (aTA) and so result in a null priming effect. On the other hand, Incongruent AV primes were left unaligned. We posit that, without alignment, the sensory traces associated with the Incongruent prime are not maintained in the system, and become too weak to contribute to target processing. The apavwa prime in Experiment 1 would be equivalent or similar to a hypothetical no-prime condition, resulting in a null priming effect relative to the facilitative apavka prime. Along the same lines, we suggest the same mechanism to be at work with the atavka prime in Experiment 2, which failed to elicit repetition priming of the aTA target. Again, the Incongruent atavka would be equivalent to a noprime condition, resulting in reaction times similar to that with the interfering McGurk prime (apavka), and null-priming Incongruent primes (afavka, adavka). In summary, we did not find evidence that multimodal integration e presently defined as the combination of
unimodal information into an integrated category ecan occur in nonconscious conditions; we have also proposed the existence of a type of multimodal process which we have termed alignment e where auditory and visual components are flagged as potentially belonging to the same event. This latter type of multimodal interaction does not alter the auditory or visual sensory representations and can occur in the absence of consciousness. Taken together, our findings suggest a number of constraints on the GWT and similar theories. Clearly, the findings go against the strictest possible form of GWT, in which nonconsciously perceived sensory inputs are confined within their respective processing modules, and where interaction between the content of modules only occurs by being conscious of the output of each. Instead, the results support a less restricted form of the theory that allows for some nonconscious cross-sensory interaction, i.e., the results indicate the existence of a mode of broadcast between modules that is distinct from the conscious global workspace. Is such a nonconscious broadcast compatible with GWT? It has been stated previously (e.g., Scott, Samaha, Chrisley, & Dienes, 2018) that the existence of nonconscious crossmodal interactions shows that nonconscious communication between modules is possible. A key feature of GWT is that global broadcast is the only way to overcome modular encapsulation e thus any form of crossmodal interaction, conscious or nonconscious, must be mediated via the global workspace. If nonconscious information can potentially contact introspection systems through global connections, then why do nonconscious stimuli remain unreportable? Our interpretation of the current results offers a possible accommodation with the GWT. We have proposed that nonconscious information (e.g., apa) cannot be altered by other nonconscious sources (vka), but only flagged as compatible for integration with those sources. On the other hand, elicitation of the McGurk effect requires conscious processes that integrate auditory and visual representations. Taken together, one could suppose two distinct and separate intermodular mechanisms e the nonconscious mechanism manages basic sensory information and has a limited operation (i.e., it cannot transform AV information), while only the global workspace mechanism can carry out complex operations on the end-products of modular processing to arrive at an outcome that is the most consistent with both sources of sensory information. This picture of conscious and nonconscious processing, while speculative, is consistent with evidence that crossmodal speech processes exist at both lower and higher processing stages (see Eskelund et al., 2011; Miller & D'Esposito, 2005; Klucharev, Mottonen, & Sams, 2003). These studies generally describe a dissociation between AV interactions during speech detection tasks and speech identification tasks e detection is associated with an earlier, preconscious locus of effect, while identification has a later locus and a requirement for consciousness. These properties of detection and identification mirror the present nonconscious alignment and conscious integration processes respectively. Multilevel crossmodal connections are also apparent neurally e for instance, while the McGurk effect is mainly associated with auditory cortical structures (Beauchamp et al., 2010; Nath &
c o r t e x 1 1 7 ( 2 0 1 9 ) 1 e1 5
Beauchamp, 2012; Venezia et al., 2017), anatomical and brain imaging studies using unisensoryemultisensory contrasts have consistently implicated thalamic and low-level cortical structures (Calvert, Hansen, Iversen, & Brammer, 2001; Driver & Noesselt, 2008; Schroeder & Foxe, 2005), suggesting that cross-modal interactions can occur at a relatively earlier processing locus. Scott et al. (2018) recently demonstrated that associative learning between novel pairs of visual and auditory stimuli was possible even under subliminal presentation conditions. While the authors interpreted their result as a successful demonstration of nonconscious multimodal integration, we are unsure how their findings, which appears to describe a very different type of crossmodal process, relates to our own. While both phenomena have been proposed as indices of crossmodal integration, associative priming and the alignment effect likely probe processes with very different characteristics. Associative priming is likely mediated by the establishment of a form of episodic memory linking the associations. Scott and colleagues' study is, if interpreted from this perspective, a demonstration of the nonconscious triggering of episodic memory (e.g., Henke, 2010) that encoded the co-presentation of specific stimuli in different modalities. It is not clear however, that the alignment effect we describe in our study could be adequately explained in terms of an outcome of associative priming. This is because the effect we observed only occurred for specific auditoryevisual pairs that have physical correlations based on those naturally found in auditoryevisual speech, while associative priming could occur with arbitrary stimuli. Our findings might reflect a speech-specific process that precedes and can influence nonconscious memory encoding e but this is only speculative. Although Scott and colleagues' demonstration of a nonconscious crossmodal interaction still presents a legitimate challenge to GWT and requires explanation, we note that their use of associative priming does not fulfil what we claimed is needed for an unambiguous index of integration e a combination of information that results in a qualitative change. Since our study did not find evidence for this specific definition of integration, but did find a different type of nonconscious crossmodal effect, it opens up the possibility of multimodal processes that are nonconscious, but not integrative. Associative priming could be another example of this type of process. If integrative and non-integrative crossmodal processes differentially depend on consciousness, then the mere presence of any crossmodal phenomena might be too coarse a test of the GWT. When future studies interpret their results, they should carefully consider the level of processing and type of information involved in their crossmodal phenomena, and possibly include multiple measures or predictions aimed at probing different processing loci. To that end, it would seem that whether a given behaviour is an index of crossmodal integration or of crossmodal association can be discerned by its reliance on conscious or nonconscious networks respectively. An important caveat here concerns the generalisability of our results, because several features of our methodology are, to our knowledge, unique to this study. One is the use of syllables with auditory masking, which may have unknown characteristics compared to the use of words in previous
13
applications. However, we have accounted for this to the best of our ability through pilot tests, which have shown effects typical of a repetition priming paradigm. Our study is also the first to use CFS and auditory masking in conjunction, and it is unknown if there are any interactions aside from those mediated cross-modally. Given these untested aspects of our study, there is a possibility that our observations are a consequence of our presentation conditions, and do not actually reflect real limitations in conscious processing. However, there is little indication this is the case. If the AV interactions in the present study were mostly the result of the presentation techniques used (e.g., the suppressor video, or the masking sound), we would have observed the same reaction times in all prime conditions regardless of which syllables were presented. Since this was clearly not the case and AV interactions were observed to be dependent on the speech content of the visual and auditory primes, the possibility of a technique-related confound seems fairly unlikely. To conclude, our study did not find evidence of multimodal integration e the combination of information occurring in the absence of conscious awareness of both modalities. Our findings did however provide novel evidence of a type of multimodal process that is able to proceed under completely nonconscious conditions: a process that functions as an early evaluator of sensory coherence. The possibility of such nonconscious AV interactions places some pressure on the GWT, and further empirical studies seem to be required to reconcile GWT with the complexities of multisensory processing. A fuller characterisation of multisensory processing at multiple levels e which could be probed by manipulations of attention, awareness, stimulus complexity and type e would make a significant contribution to a more complete account of the role of consciousness in multimodal integration.
Open practices The study in this article earned Open Materials, Open Data and Preregistered badges for transparent practices. Materials and data for the study are available at https://osf.io/sbntf/? view_only¼12aeec9cda114f6eb138064f6c45aefe.
CRediT authorship contribution statement April Shi Min Ching: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Writing - original draft, Visualization, Project administration. Jeesun Kim: Writing - review & editing, Supervision. Chris Davis: Writing - review & editing, Supervision.
Appendix A The raw data (Experiments 1 and 2, McGurk Susceptibility, and Conscious Control) and Stage 1 protocol are freely and publicly available at Open Science Framework Storage (DOI: 10.17605/OSF.IO/X5WUF) at: https://osf.io/sbntf/?view_ only¼12aeec9cda114f6eb138064f6c45aefe. A version of the study materials with smaller stimulus set (due to prohibitively large file size) is freely and publicly
14
c o r t e x 1 1 7 ( 2 0 1 9 ) 1 e1 5
available at Open Science Framework at: https://osf.io/74cq5/? view_only¼f4d4b42377a0496eb0baacd774be6df0.
Supplementary data Supplementary data to this article can be found online at https://doi.org/10.1016/j.cortex.2019.02.014.
references
Arzi, A., Shedlesky, L., Ben-Shaul, M., Nasser, K., Oksenberg, A., Hairston, I. S., et al. (2012). Humans can learn new information during sleep. Nature Neuroscience, 15, 1460e1465. Baars, B. J. (2002). The conscious access hypothesis: origins and recent evidence. Trends in Cognitive Sciences, 6(1), 47e52. Baars, B. J. (2005). Global workspace theory of consciousness: toward a cognitive neuroscience of human experience. Progress in Brain Research, 150, 45e53. Baars, B. J. (2007). The global workspace theory of consciousness (pp. 236e246). The Blackwell Companion to Consciousness. Beauchamp, M. S., Nath, A. R., & Pasalar, S. (2010). fMRI-guided transcranial magnetic stimulation reveals that the superior temporal sulcus is a cortical locus of the McGurk effect. Journal of Neuroscience, 30, 2414e2417. Boersma, P., & Weenink, D. (2017). Praat: Doing phonetics by computer [computer program] (version 6.0.27). Retrieved from http://www.praat.org/. Brainard, D. H. (1997). The psychophysics toolbox. Spatial Vision, 10, 433e436. Calvert, G. A., Hansen, P. C., Iversen, S. D., & Brammer, M. J. (2001). Detection of audio-visual integration sites in humans by application of electrophysiological criteria to the BOLD effect. NeuroImage, 14(2), 427e438. Davis, C., Kim, J., & Barbaro, A. (2010). Masked speech priming: Neighborhood size matters. The Journal of the Acoustical Society of America, 127, 2110e2113. Dehaene, S., & Changeux, J.-P. (2011). Experimental and theoretical approaches to conscious processing. Neuron, 70, 200e227. Dehaene, S., & Naccache, L. (2001). Towards a cognitive neuroscience of consciousness: Basic evidence and a workspace framework. Cognition, 79, 1e37. Deroy, O., Chen, Y.-C., & Spence, C. (2014). Multisensory constraints on awareness. Philosophical Transactions of the Royal Society B: Biological Sciences, 369, 20130207. Driver, J., & Noesselt, T. (2008). Multisensory interplay reveals crossmodal influences on ‘sensory-specific’ brain regions, neural responses, and judgments. Neuron, 57(1), 11e23. Dupoux, E., De Gardelle, V., & Kouider, S. (2008). Subliminal speech perception and auditory streaming. Cognition, 109, 267e273. Eskelund, K., Tuomainen, J., & Andersen, T. S. (2011). Multistage audiovisual integration of speech: Dissociating identification and detection. Experimental Brain Research, 208(3), 447e457. Faivre, N., Mudrik, L., Schwartz, N., & Koch, C. (2014). Multisensory integration in complete unawareness evidence from audiovisual congruency priming. Psychological Science, 25, 2006e2016. Finkbeiner, M. (2011). Subliminal priming with nearly perfect performance in the prime-classification task. Attention, Perception, & Psychophysics, 73(4), 1255e1265. Gelbard-Sagiv, H., Faivre, N., Mudrik, L., & Koch, C. (2016). Lowlevel awareness accompanies “unconscious” high-level
processing during continuous flash suppression. Journal of vision, 16(1), 3-3. Green, P., & MacLeod, C. J. (2016). SIMR: an R package for power analysis of generalized linear mixed models by simulation. Methods in Ecology and Evolution, 7(4), 493e498. Henke, K. (2010). A model for memory systems based on processing modes rather than consciousness. Nature Reviews Neuroscience, 11(7), 523. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. The Journal of Physiology, 160(1), 106e154. Jiang, J., & Bernstein, L. E. (2011). Psychophysics of the McGurk and other audiovisual speech integration effects. Journal of Experimental Psychology: Human Perception and Performance, 37, 1193e1209. Kleiner, M., Brainard, D., Pelli, D., Ingling, A., Murray, R., Broussard, C., et al. (2007). What's new in Psychtoolbox-3. Perception, 36, 1. € tto € nen, R., & Sams, M. (2003). Klucharev, V., Mo Electrophysiological indicators of phonetic and non-phonetic multisensory interactions during audiovisual speech perception. Cognitive Brain Research, 18(1), 65e75. Kouider, S., & Dupoux, E. (2005). Subliminal speech priming. Psychological Science, 16, 617e625. Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology, 44(7), 701e710. MacDonald, J., & McGurk, H. (1978). Visual influences on speech perception processes. Attention, Perception, & Psychophysics, 24, 253e257. Massaro, D. W. (2009). Caveat Emptor: The meaning of perception and integration in speech perception. Available from Nature Precedings http://hdl.handle.net/10101/npre.2009.4016.1. Massaro, D. W., & Stork, D. G. (1998). Speech recognition and sensory integration: A 240-year-old theorem helps explain how people and machines can integrate auditory and visual information to understand speech. American Scientist, 86(3), 236e244. Miller, L. M., & D'Esposito, M. (2005). Perceptual fusion and stimulus coincidence in the cross-modal integration of speech. Journal of Neuroscience, 25(25), 5884e5893. Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9, 453e467. Mudrik, L., Faivre, N., & Koch, C. (2014). Information integration without awareness. Trends in Cognitive Sciences, 18, 488e496. , M. (2009). Munhall, K. G., Ten Hove, M. W., Brammer, M., & Pare Audiovisual integration of speech in a bistable illusion. Current Biology, 19(9), 735e739. Nath, A. R., & Beauchamp, M. S. (2012). A neural basis for interindividual differences in the McGurk effect, a multisensory speech illusion. NeuroImage, 59(1), 781e787. Noel, J.-P., Wallace, M., & Blake, R. (2015). Cognitive neuroscience: Integration of sight and sound outside of awareness? Current Biology, 25, R157eR159. Norris, D., & Kinoshita, S. (2008). Perception as evidence accumulation and Bayesian inference: Insights from masked priming. Journal of Experimental Psychology: General, 137(3), 434. Palmer, T. D., & Ramsey, A. K. (2012). The function of consciousness in multisensory integration. Cognition, 125, 353e364. Pelli, D. G. (1997). The VideoToolbox software for visual psychophysics: Transforming numbers into movies. Spatial Vision, 10, 437e442. Porta, J. B. (1593). De refractione. Optices parte. Libri novem. Naples: Salviani. R Core Team. (2015). R: A language and environment for statistical computing. Vienna, Austria; 2015. Retrieved from http://www. R-project.org/.
c o r t e x 1 1 7 ( 2 0 1 9 ) 1 e1 5
Ramsøy, T. Z., & Overgaard, M. (2004). Introspection and subliminal perception. Phenomenology and the Cognitive Sciences, 3(1), 1e23. Rouder, J. N., Morey, R. D., Speckman, P. L., & Pratte, M. S. (2007). Detecting chance: A solution to the null sensitivity problem in subliminal priming. Psychonomic Bulletin & Review, 14, 597e605. Schroeder, C. E., & Foxe, J. (2005). Multisensory contributions to low-level, ‘unisensory’ processing. Current Opinion in Neurobiology, 15(4), 454e458. Scott, R. B., Samaha, J., Chrisley, R., & Dienes, Z. (2018). Prevailing theories of consciousness are challenged by novel crossmodal associations acquired between subliminal stimuli. Cognition, 175, 169e185. Sherman, R. (2014). phack. Retrieved from http://rynesherman. com/phack.r.
15
The MathWorks. (2015). Matlab (version 2015a). Natick, MA. Tsuchiya, N., & Koch, C. (2005). Continuous flash suppression reduces negative afterimages. Nature Neuroscience, 8, 1096e1101. Venezia, J. H., Vaden, K. I., Jr., Rong, F., Maddox, D., Saberi, K., & Hickok, G. (2017). Auditory, visual and audiovisual speech processing streams in superior temporal sulcus. Frontiers in Human Neuroscience, 11. van Wassenhove, V., Grant, K. W., & Poeppel, D. (2007). Temporal window of integration in auditory-visual speech perception. Neuropsychologia, 45, 598e607. Wilcox, R. R. (1995). ANOVA: The practical importance of heteroscedastic methods, using trimmed means versus means, and designing simulation studies. British Journal of Mathematical and Statistical Psychology, 48, 99e114.