Journal of Memory and Language 85 (2015) 42–59
Contents lists available at ScienceDirect
Journal of Memory and Language journal homepage: www.elsevier.com/locate/jml
Turning a blind eye to the lexicon: ERPs show no cross-talk between lip-read and lexical context during speech sound processing Martijn Baart a,⇑, Arthur G. Samuel a,b,c a
BCBL. Basque Center on Cognition, Brain and Language, Donostia, Spain IKERBASQUE, Basque Foundation for Science, Spain c Stony Brook University, Dept. of Psychology, Stony Brook, NY, United States b
a r t i c l e
i n f o
Article history: Received 17 June 2014 revision received 23 March 2015
Keywords: ERPs N200-effect P2 Lexical processing Audiovisual speech integration
a b s t r a c t Electrophysiological research has shown that pseudowords elicit more negative EventRelated Potentials (i.e., ERPs) than words within 250 ms after the lexical status of a speech token is defined (e.g., after hearing the onset of ‘‘ga” in the Spanish word ‘‘lechuga”, versus ‘‘da” in the pseudoword ‘‘lechuda”). Since lip-read context also affects speech sound processing within this time frame, we investigated whether these two context effects on speech perception operate together. We measured ERPs while listeners were presented with auditory-only, audiovisual, or lip-read-only stimuli, in which the critical syllable that determined lexical status was naturally-timed (Experiment 1) or delayed by 800 ms (Experiment 2). We replicated the electrophysiological effect of stimulus lexicality, and also observed substantial effects of audiovisual speech integration for words and pseudowords. Critically, we found several early time-windows (<400 ms) in which both contexts influenced auditory processes, but we never observed any cross-talk between the two types of speech context. The absence of any interaction between the two types of speech context supports the view that lip-read and lexical context mainly function separately, and may have different neural bases and purposes. Ó 2015 Elsevier Inc. All rights reserved.
Introduction Over the course of a half century of research on speech perception and spoken word recognition, the central observation has been that despite the enormous variability in the speech signal (both between and within speakers), listeners are generally able to correctly perceive the spoken message. Researchers have found that in addition to extra-segmental cues like prosody, context is used whenever possible to support the interpretation of the ⇑ Corresponding author at: Basque Center on Cognition, Brain and Language, Paseo Mikeletegi 69, 2nd Floor, 20009 Donostia (San Sebastián), Spain. E-mail address:
[email protected] (M. Baart). http://dx.doi.org/10.1016/j.jml.2015.06.008 0749-596X/Ó 2015 Elsevier Inc. All rights reserved.
auditory stream. Two substantial literatures exist in this domain: One body of research has shown that perceivers rely on the visible articulatory gestures of the speaker, here referred to as lip-reading (e.g., Sumby & Pollack, 1954). The second widely studied context effect is based on the mental lexicon – the knowledge that informs listeners about existing words within the language (e.g., Ganong, 1980). In the current study, we use electrophysiological techniques to examine the relationship between these two types of contextual support during spoken word recognition. The studies of lip-read context and of lexical context have generally developed independently of each other (for exceptions, see e.g., Barutchu, Crewther, Kiely,
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59
Murphy, & Crewther, 2008; Brancazio, 2004), but there are several ways in which the two context types seem to be quite similar. One commonality is that both context types appear to be most effective when the auditory speech signal is degraded or ambiguous. For example, in Sumby and Pollack’s (1954) classic study of audiovisual (henceforth AV) speech, participants tried to recognize words in sentences under various levels of noise masking, either with auditory-only or with AV presentation. When the listening conditions were good, the lip-read information had little effect, but when the signal to noise ratio was substantially reduced, there was a very large advantage for the AV over the audio-only condition. Lexical context also seems to be particularly potent when the auditory information is unclear. For instance, an ambiguous sound in between ‘‘g” and ‘‘k” will be heard as ‘‘g” when followed by ‘‘ift” and as ‘‘k” when followed by ‘‘iss” because ‘‘gift” and ‘‘kiss” are lexically valid items and ‘‘kift” and ‘‘giss” are not (Ganong, 1980); under very clear listening conditions, listeners are certainly able to identify nonlexical tokens like ‘‘kift” or ‘‘giss”. Ambiguous speech sounds (e.g., a sound in between /b/ and /d/) can be disambiguated through lip-reading, with the sound heard as /b/ when listeners see a speaker pronouncing /b/, and heard as /d/ when combined with a visual /d/ (e.g., Bertelson, Vroomen, & De Gelder, 2003). There are many studies showing such disambiguation of the auditory input by both lexical context, and by lip-read context. In addition, in the last decade there have been parallel developments of studies showing that both lexical context and lip-read context can be used by listeners to recalibrate their phonetic category boundaries. The seminal report for lexical context was done by Norris, McQueen, and Cutler (2003), and the original work showing such effects for lip-read speech was by Bertelson et al. (2003). The recalibration effect occurs as a result of exposure to ambiguous speech that is disambiguated by lip-read or lexical context. It manifests as a subsequent perceptual shift in the auditory phoneme boundary such that the initially ambiguous sound is perceived in accordance with the phonetic identity provided by the lip-read context (e.g., Baart, de BoerSchellekens, & Vroomen, 2012; Baart & Vroomen, 2010; Bertelson et al., 2003; van Linden & Vroomen, 2007; Vroomen & Baart, 2009a, 2009b; Vroomen, van Linden, de Gelder, & Bertelson, 2007; Vroomen, van Linden, Keetels, de Gelder, & Bertelson, 2004) or the lexical context (Eisner & McQueen, 2006; Kraljic, Brennan, & Samuel, 2008; Kraljic & Samuel, 2005, 2006, 2007; Norris et al., 2003; van Linden & Vroomen, 2007). In both cases, the assumption is that the context acts as a teaching signal that drives a change in the perceived phoneme boundary that can be observed when later ambiguous auditory speech tokens are presented in isolation. In addition to these parallel behavioral patterns for lexical and lip-read context, electrophysiological studies have revealed that both types of context modulate auditory processing within 250 ms after onset of the critical speech segment. For lip-reading, electrophysiological evidence for such early context induced modulations of auditory processing is found in studies that relied on the mismatch negativity (i.e., MMN, e.g., Näätänen, Gaillard, & Mäntysalo,
43
1978), which is a negative component in the event-related potentials (ERPs) usually occurring 150–200 ms post stimulus in response to a deviant sound in a sequence of standard sounds (the standards are all the same). As demonstrated by McGurk and MacDonald (1976), lip-read context can change perceived sound identity, and when it does, it triggers an auditory MMN response when the illusory AV stimulus is embedded in a string of congruent AV stimuli (e.g., Colin, Radeau, Soquet, & Deltenre, 2004; Colin et al., 2002; Saint-Amour, De Sanctis, Molholm, Ritter, & Foxe, 2007). When sound onset is sudden and does not follow repeated presentations of standard sounds, it triggers an N1/P2 complex (a negative peak at 100 ms followed by a positive peak at 200 ms) and it is well-documented that amplitude and latency of both peaks are modulated by lip-read speech (e.g., Alsius, Möttönen, Sams, Soto-Faraco, & Tiippana, 2014; Baart, Stekelenburg, & Vroomen, 2014; Besle, Fort, Delpuech, & Giard, 2004; Frtusova, Winneke, & Phillips, 2013; Klucharev, Möttönen, & Sams, 2003; Stekelenburg, Maes, van Gool, Sitskoorn, & Vroomen, 2013; Stekelenburg & Vroomen, 2007, 2012; van Wassenhove, Grant, & Poeppel, 2005; Winneke & Phillips, 2011). Thus, studies measuring both the MMN and the N1/P2 peaks indicate that lip-reading affects sound processing within 200 to 250 ms after sound onset. The electrophysiological literature on lexical context paints a similar picture, as both the MMN and the P2 are modulated by lexical properties of the speech signal. For instance, when a spoken syllable completes a word it elicits a larger MMN response than when the same syllable completes a non-word (Pulvermüller et al., 2001). Similarly, single-syllable deviant words (e.g., ‘day’) in a string of non-word standards (‘de’) elicit larger MMNs than non-word deviants in a sequence of word standards (Pettigrew et al., 2004). Recently, we demonstrated that the auditory P2 is also sensitive to lexical processes (Baart & Samuel, 2015). We presented listeners with spoken, naturally timed, three-syllable tokens in which the lexical status of each token was determined at third syllable onset (e.g., the Spanish word ‘‘lechuga” [lettuce] versus the pseudoword ‘‘lechuda”). The lexical context effect occurred by about 200 ms after onset of the third syllable, with pseudowords eliciting a larger negativity than words. In previous studies, a comparable ERP pattern was observed for sentential context rather than withinitem lexical context, and was referred to as an N200 effect (Connolly, Phillips, Stewart, & Brake, 1992; van den Brink, Brown, & Hagoort, 2001; van den Brink & Hagoort, 2004). In our study (Baart & Samuel, 2015), we sought to determine whether the N200 effect is robust against violations of temporal coherence within the stimulus, and we therefore added 440 or 800 ms of silence before onset of the third syllable. The lexicality effect survived the delay and pseudowords again elicited more negative ERPs than words at around 200 ms. As expected, the ERPs following the onset of the third syllable after a period of silence had a different morphology than the ERPs obtained for the naturally timed items. That is, the delayed syllables elicited an auditory N1/P2 complex, in which the lexicality
44
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59
effect was now observed at the P2. This result demonstrated that the N200-effect of lexicality was robust against violations of temporal coherence. As with lip-read context, effects of lexical context can be found even before the N200 effect. Recently, it was demonstrated that brain activity in response to words may start to differentiate from the response to pseudowords as early as 50 ms after the information needed to identify a word becomes available (MacGregor, Pulvermüller, van Casteren, & Shtyrov, 2012). There are also linguistic context effects later than the N200, including a negative going deflection in the waveform at around 400 ms (i.e., the N400 effect) in response to meaningful stimuli such as spoken words, written words and sign language (see Kutas & Federmeier, 2011 for a review). In the auditory domain, the N400 is larger (more negative) when words are preceded by unrelated words than when words are preceded by related words (i.e., a semantic priming effect, see Holcomb & Neville, 1990). The N400 has been argued to reflect cognitive/linguistic processing in response to auditory input (Connolly et al., 1992), and is proposed to be functionally distinct from the earlier N200 (Connolly, Stewart, & Phillips, 1990; Connolly et al., 1992; van den Brink & Hagoort, 2004; van den Brink et al., 2001). Looking at the literature on lexical context effects, and the literature on the effect of lip-read context, we see multiple clear parallels: In both cases, the context effect strongly influences how listeners interpret ambiguous or unclear speech. In both cases, the context guides recalibration of phonemic category boundaries, bringing them into alignment with the contextually-determined interpretation. In both cases, the context effects can be detected in differing patterns of electrophysiological activity within approximately 250 ms after the context is available. Based on these strong commonalities, it seems plausible that in the processing stream that starts with activity on the basilar membrane, and results in a recognized word, contextual influences operate together very rapidly. Despite the appealing parsimony of this summary, there are some findings in the literature that suggest that lexical and lip-read context might not operate together, and may actually have rather different properties. We will briefly review two types of evidence that challenge the idea that lexical and lip-read context operate in the same way. The domains of possible divergence include differing outcomes in selective adaptation studies, and possibly different semantic priming effects. Both lip-read and lexical context have been employed in the selective adaptation paradigm (Eimas & Corbit, 1973; Samuel, 1986), with divergent results. In an adaptation study, subjects identify tokens along some speech continuum (e.g., a continuum from ‘‘ba” to ‘‘da”), before and after an adaptation phase. In the adaptation phase, an ‘‘adaptor” is played repeatedly. For example, in one condition an adaptor might be ‘‘ba”, and in a second condition it might be ‘‘da”. The adaptation effect is a reduction in report of sounds that are similar to the adaptor (e.g., in this case, listeners would report fewer items as ‘‘ba” after adaptation with ‘‘ba”, and fewer items as ‘‘da” after adaptation with ‘‘da”). Multiple studies have
tested whether a percept generated via lip-read context will produce an adaptation effect that matches the effect for an adaptor that is specified in a purely auditory way, and all such studies have shown no such effect. For example, Roberts and Summerfield (1981) had subjects identify items along a ‘‘b”–‘‘d” continuum before and after adaptation, with adaptors that were either purely auditory or that were audiovisual. Critically, in the AV case, a pairing of an auditory ‘‘b” with a visual ‘‘g” consistently led the subjects to identify the AV combination as ‘‘d” (see McGurk & MacDonald, 1976). Despite this lipread generated percept, the pattern of activation effects for the AV adaptor perfectly matched the pattern for the simple auditory ‘‘b”. This failure by lip-read context to produce an adaptation shift has been replicated repeatedly (Saldaña & Rosenblum, 1994; Samuel & Lieblich, 2014; Shigeno, 2002). This inefficacy contrasts with adaptors in which the critical sound is generated by lexical context. Samuel (1997) constructed adaptors in which a ‘‘b” (e.g., in ‘‘alphabet”) or a ‘‘d” (e.g., in ‘‘armadillo”) was replaced by white noise. The lexical context in such words caused listeners to perceive the missing ‘‘b” or ‘‘d” (via phonemic restoration: Samuel, 1981; Warren, 1970), and these lexically-driven sounds successfully generated adaptation shifts. Similarly, when a segment was phonetically ambiguous (e.g., midway between ‘‘s” and ‘‘sh”), lexical context (e.g., ‘‘arthriti_” versus ‘‘demoli_”) caused listeners to hear the appropriate sound (see Ganong, 1980; Pitt & Samuel, 1993), again successfully producing adaptation shifts. Thus, across multiple tests, the percepts generated as a function of lip-read information cannot support adaptation, despite being phenomenologically compelling; lexically-driven percepts are both compelling and able to generate adaptation. A recent study examining semantic priming effects also suggests that there may be an important dissociation between the phenomenological experience produced by lip-read context and actual activation of the apparently perceived word. Ostrand, Blumstein, and Morgan (2011) presented AV primes followed by auditory test items. For some primes an auditory nonword (e.g. ‘‘bamp”) was combined with a video of a speaker producing a real word (e.g., ‘‘damp”), leading subjects to report that they heard the real word (‘‘damp”) due to the lip-read context. Other primes were made with an auditory word (e.g., ‘‘beef”) combined with a visual nonword (e.g., ‘‘deef”), leading to a nonword percept (e.g., ‘‘deef”) due to the lip-read context. Ostrand et al. (2011) found that if the auditory component of the AV prime was a word then semantic priming was found even if the perception of the AV prime was a nonword (e. g., even if people said they heard the prime as ‘‘deef”, ‘‘beef” would prime ‘‘pork”). No priming occurred when a prime percept (e.g., ‘‘deef”) was based on an AV combination that was purely a nonword (e.g., both audio and video ‘‘deef”). These semantic priming results again indicate that lip-read context may not be activating representations of words in the same way that lexical context does. Samuel and Lieblich (2014) have suggested that lip-read context may have strong and direct effects on the immediate percept, but may not directly affect linguistic encoding.
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59
The current study builds on electrophysiological evidence for effects of lexical and lip-read context on auditory processing within approximately 250 ms after the relevant stimulus. Two experiments explore the neural correlates of the two types of context, with the specific goal of assessing whether they show interactive patterns of neural activation, which would indicate a mutual influence on auditory processing. This approach has been used for decades in the behavioral literature, and more recently, in the ERP literature. It is grounded in the additive factors logic developed by Sternberg (Sternberg, 1969; see Sternberg, 2013, for a recent and thoughtful discussion of the method). The fundamental idea is that if two factors are working together during at least some part of processing, they have the opportunity to produce over-additive or under-additive effects, whereas if they operate independently then each will produce its own effect, and the two effects will simply sum. In the behavioral literature this approach has typically been applied to reaction time measurements. With ERPs, the approach gets instantiated by looking for additive versus non-additive effects on the evoked responses. ERPs have proven to be sensitive enough to reveal interactions between various properties of a stimulus. For instance, lip-read induced suppression of the N1 is modulated by AV spatial congruence (Stekelenburg & Vroomen, 2012). Interactions between phonological and semantic stimulus properties (Perrin & García-Larrea, 2003) and between sentence context and concreteness of a target word (Holcomb, Kounios, Anderson, & West, 1999) have been found for the N400 (see also Kutas & Federmeier, 2011). Here, we measure ERPs while listeners are presented with the same auditory words and pseudowords used in our recent study (Baart & Samuel, 2015), but now the third syllable (that determines whether the item is a word or pseudoword) is presented in auditory form (as in the previous study), in the visual modality (requiring lip-reading), or audiovisually. We expect, based on prior work, to see an early ERP effect of lexicality, and an early ERP effect of lip-read context. Our central question is whether these two effects are independent and simply sum, or if there is evidence for some coordination of the two context effects that yields non-additive effects. We provide two quite different stimulus conditions to look for any interaction of the two context types. In Experiment 1, we use naturally timed items in which we should observe a lexicality effect in the form of an auditory N200 effect, as we found previously with these words and pseudowords. The goal is to determine whether lip-read context modulates the effects of lexicality. A complication with using naturally timed speech is that it means that we will be looking for an effect of lip-read context that occurs during ongoing speech (i.e., at the beginning of the third syllable in our stimuli). There is nothing fundamentally wrong with this, but almost all of the existing studies that show early ERP consequences of lip-reading have done so for utterances that follow silence. Such conditions provide a clean N1/P2 complex to look at, a complex that is not found during ongoing speech. To provide a test with these properties, in Experiment 2, we use items in which the third syllable is delayed,
45
providing conditions that should elicit an N1/P2 response. Effects of lexical context and lip-read speech were both expected to occur at the P2 peak. The existing literature predicts that these effects will push the P2 in opposite directions (i.e., lip-read information suppresses the P2, lexical information yields more positive ERPs), but the critical issue is whether the two effects are additive or interactive. Assuming that lip-read induced modulations at the N1 reflect integration of relatively low-level features, the P2 is the most promising component to reveal interactions between the two contexts. By testing both normally-timed speech, and speech delayed to offer a clean P2, we maximize the opportunity to observe any interaction of the two types of speech context. Experiment 1: Naturally timed items In Experiment 1, participants were presented with three-syllable words or pseudowords that were identical through the first two syllables, but diverged at the onset of the third syllable. In all cases the first two syllables were presented in auditory form only. The third syllable was then presented either in auditory-only form (A), in visual-only form (V), or audiovisually (AV). We measured the resulting ERP patterns for the six resulting conditions (word/pseudoword A/V/AV), focusing on whether there was any interaction of the lexical and lip-read context effects following third syllable onset. Methods Participants 20 adults with normal hearing and normal or correctedto-normal vision participated in the experiment in return for a payment of €10 per hour. Participants were only eligible for participation if Spanish was their dominant language. Language proficiency in other languages (e.g., English, German, French, Basque) was variable. All participants gave their written informed consent prior to testing, and the experiment was conducted in accordance with the Declaration of Helsinki. Two participants were excluded from the analyses, one because of poor data quality (see ‘EEG recording and analyses’ below) and one because one of the experimental blocks was accidentally repeated. The mean age in the final sample of 18 participants (7 females) was 22 years (SD = 2.1). Stimuli The auditory stimuli were the same as those used in our recent study (Baart & Samuel, 2015). We selected 6 threesyllable nouns from the EsPal subtitle database (Duchon, Perea, Sebastián-Gallés, Martí, & Carreiras, 2013) that were matched on four criteria (frequency, stress, absence of embedded lexical items, and absence of higher frequency phonological neighbors). The resulting nouns (i.e., ‘‘brigada” [brigade], ‘‘lechuga” [lettuce], ‘‘granuja” [rascal], ‘‘laguna” [lagoon], ‘‘pellejo” [hide/skin] and ‘‘boleto” [(lottery) ticket]) were produced by a male native speaker of Spanish. The speaker was recorded with a digital video camera (at a rate of 25 frames/s) and its internal microphone (Canon Legria
46
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59
HF G10), framing the video as a headshot. The speaker was asked to produce several tokens of each item, and also to produce versions in which the onset consonant of the third syllable was replaced by ‘‘ch” or ‘‘sh”. These tokens were used for splicing purposes as ‘‘ch” and ‘‘sh” have relatively weak co-articulatory effects on the preceding vowel, and they also would not provide accurate predictive information after the splicing process. Word stimuli were created by splicing the final syllable of the original recording onto the first two syllables of the ‘‘ch” or ‘‘sh” item (e.g., ‘‘na” from ‘‘laguna” was spliced onto ‘‘lagu” from ‘‘lagucha”). Pseudowords were created by rotating the final syllables – which led to ‘‘brigaja”, ‘‘lechuda”, ‘‘granuna”, ‘‘laguga”, ‘‘pelleto” and ‘‘bolejo” – and splicing the final syllables of the original word recordings onto the first two syllables of the ‘‘ch” item (e.g., ‘‘na” from ‘‘laguna” was spliced onto ‘‘granu” from ‘‘granucha”). All of these naturally-timed stimuli sounded natural, with no audible clicks or irregularities. The original video recordings that corresponded to the six final syllables were converted to bitmap sequences and matched on total duration (11 bitmaps; 440 ms, including 3 bitmaps of anticipatory lip-read motion before sound onset). Procedures Participants were seated in a sound-attenuated, dimly lit, and electrically shielded booth at 80 cm from a 19in. CRT monitor (100 Hz refresh). Auditory stimuli were presented at 65 dB(A) (measured at ear-level) and played through a regular computer speaker placed directly above the monitor. In total, 648 experimental trials were delivered: 216 A trials, 216 V trials and, 216 AV trials. For each modality, half of the trials were words and half were pseudowords. An additional 108 catch-trials (14% of the total number of trials) were included to keep participants fixated on the monitor and minimize head-movement. During a catch-trial, a small white dot briefly appeared at auditory onset of the third syllable (120 ms, Ø = 4 mm) between the nose and upper-lip of the speaker on AV and V trials, and in the same location on the screen on A trials. Participants were instructed to press the space-bar of a regular keyboard whenever they detected the dot. We used Presentation software for stimulus delivery. As can be seen in Fig. 1, each trial started with a 400 ms fixation cross followed by 600, 800 or 1000 ms of silence before the auditory stimulus was delivered. Auditory onset of the first syllable was slightly variable, as we added silence before the stimulus (in the .wav files) to ensure that third syllable onset was always at 640 ms, which coincided with bitmap number 17 from a bitmap sequence that was initiated at sound onset (see Fig. 1). Bitmaps were 164 mm (W) 179 mm (H) in size, and on AV and V trials, auditory onset of the critical third-syllable was preceded by 3 fadein bitmaps (still frames) and 3 bitmaps (120 ms) of anticipatory lip-read information (which were bitmaps 11–16 in the bitmap sequence, see Fig. 1). In all conditions, the ITI between sound offset and fixation onset was 1800 ms. The trials were pseudo-randomly distributed across six experimental blocks. Before the experiment started, participants completed a 12-trial practice session to familiarize them with the procedures.
EEG recording and analyses The EEG was recorded at a 500 Hz sampling rate through a 32-channel BrainAmp system (Brain Products GmbH). 28 Ag/AgCl electrodes were placed in an EasyCap recording cap and EEG was recorded from sites Fp1, Fp2, F7, F3, Fz, F4, F8, FC5, FC1, FC2, FC6, T7, C3, Cz, C4, T8, CP5, CP1, CP2, CP6, P7, P3, Pz, P4, P8, O1 and O2 (FCz served as ground). Four electrodes (2 on the orbital ridge above and below the right eye and 2 on the lateral junctions of both eyes) recorded the vertical- and horizontal Electrooculogram (EOG). Two additional electrodes were placed on the mastoids, of which the left was used to reference the signal on-line. Impedance was kept below 5 kX for mastoid and scalp electrodes, and below 10 kX for EOG electrodes. The EEG signal was analyzed using Brain Vision Analyzer 2.0. The signal was referenced off-line to an average of the two mastoid electrodes and band-pass filtered (Butterworth Zero Phase Filter, 0.1–30 Hz, 24 dB/ octave). Additional interference was removed by a 50 Hz notch filter. ERPs were time-locked to auditory onset of the third syllable and the raw data were segmented into 1100 ms epochs (from 200 ms before to 900 ms after third syllable onset). After EOG correction (i.e., ERPs were subtracted from the raw signal, the proportion of ocular artifacts were calculated in each channel and also subtracted from the EEG, and ERPs were then added back to the signal, see Gratton, Coles, & Donchin, 1983, for details), segments that corresponded to catch-trials and segments with an amplitude change >120 lV at any channel were rejected. One participant was excluded from analyses because 40% of the trials contained artifacts, versus 4% or less for the other participants. ERPs were averaged per modality (A, V and AV) for words and pseudowords separately, and base-line corrected (200 ms before onset of the third syllable). Next, we subtracted the visual ERPs from the audiovisual ERPs. This was done in order to compare the AV–V difference with the A condition, which captures AV integration effects (see e.g., Alsius et al., 2014; Baart et al., 2014; Besle, Fort, Delpuech, et al., 2004; Fort, Delpuech, Pernier, & Giard, 2002; Giard & Peronnet, 1999; Klucharev et al., 2003; Stekelenburg & Vroomen, 2007; van Wassenhove et al., 2005; Vroomen & Stekelenburg, 2010). These ERPs are provided in Fig. 2, and ERPs for all experimental conditions (A, V and AV) are provided in Appendix A (Fig. 4). The hypothesis that lip-read and lexical context effects are independent predicts that their impact on ERPs will be additive, i.e., that there will be no interaction of these two factors. The absence of an interaction is of course a null effect, and it is not statistically possible to prove a null effect. What can be done is to show that each factor itself produces a robust effect, and to then look at every possible way that an interaction might manifest itself if it exists. Toward this end, in addition to conducting the test both with normal timing (Experiment 1) and with a delay that offers a clear P2 (Experiment 2), in each experiment we first analyze a window large enough to contain any interaction that might occur (the first 400 ms), and then provide a focused analysis within the window that prior work indicates is the most likely one for any interaction (150– 250 ms). All ANOVAs are Greenhouse–Geisser corrected.
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59
47
Fig. 1. Overview and timing for auditory-only (left) and audiovisual and visual-only (right) trials in Experiments 1 and 2.
Results On average, participants detected 99% of the catch-trials (SD = 2%, ranging from 94% [N = 1] to 100% [N = 9]), indicating that they had attended the screen as instructed. The 0–400 ms window We averaged the data in eight 50 ms bins spanning a time-window from 0 to 400 ms. This window was chosen to ensure that we did not miss any of the context effects of interest, since these clearly occur before 400 ms (e.g., Besle, Fort, Delpuech, et al., 2004; Klucharev et al., 2003; MacGregor et al., 2012; Pettigrew et al., 2004; Stekelenburg & Vroomen, 2007; van Linden, Stekelenburg, Tuomainen, & Vroomen, 2007; van Wassenhove et al., 2005). In a first analysis, we submitted the data to an 8 (Time-window; eight 50 ms bins from 0 to 400 ms) 27 (Electrode) 2 (Lexicality; Words vs. Pseudowords) 2 (Modality; A vs. AV–V) ANOVA. The results of this omnibus ANOVA are summarized in Table 1. As indicated in Table 1, both contexts produced significant effects. Lexicality yielded a main effect in the entire 0–400 ms epoch, and also showed an interaction with Time-window. Modality also interacted with Timewindow, as well as with Electrode. The Time-window Electrode Modality interaction was also significant. Thus, the results meet the first critical criterion of demonstrating effects of both Lexicality and Modality. The second critical criterion is the finding of purely additive effects for these two factors, and that is exactly what Table 1 shows: None of the interactions involving
Lexicality and Modality even approached significance. As noted, the first, broad analysis was intended to cast a wide net in looking for any evidence of an interaction of the two types of context. Since both interacted with Time-window, we analyzed the data separately for each 50 ms bin, using eight 27 (Electrode) 2 (Lexicality; Words vs. Pseudowords) 2 (Modality; A vs. AV–V) ANOVAs. The results of those ANOVAs are summarized in Fig. 3a, where the p-values for the main and interaction effects involving Lexicality and Modality are plotted. The top panel shows the main effects of Lexicality and Modality, and each one’s interaction with Electrode; the bottom panel shows the interaction of the two context types with each other, and their three-way interaction with Electrode. As the bottom panel illustrates, at no point within the 400 ms window were there any interactions between Modality and Lexicality, Fs < 1. There were never any interactions between Electrode, Modality and Lexicality, Fs(26, 442) < 1.35, ps > .26, g2ps < .08, presumably because Modality had no influence on the effect that Lexicality had on auditory processing, and vice versa. In contrast, in the top panel, there are robust effects of the two factors, in overlapping time windows. The timewindows where both contexts yielded significant results (as a main effect and/or interaction effect with Electrode) are shaded in gray. In all time-windows, ANOVAs yielded significant main effects of Electrode, Fs(26, 442) > 4.00, ps < .01, g2ps > .18. The main effects of Lexicality in the time-windows 100– 150 ms, 250–300 ms, 300–350 ms and 350–400 ms, Fs (1, 17) > 4.59, ps < .05, g2ps > .20, indicated that in all four
48
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59
Fig. 2. ERPs time-locked to third-syllable onset for auditory words and pseudowords and the AV–V difference-waves for Experiment 1 (upper panel) and Experiment 2 (lower panel).
time-windows, words yielded more positive ERPs than pseudowords (differences ranged between .39 lV and .91 lV). The interactions between Lexicality and Electrode (in the 300–350 ms window and the 350– 400 ms window) were followed-up by paired-samples ttests that tested activity for words against pseudowords at each electrode. Family-wise error was controlled by
applying a step-wise Holm–Bonferroni correction (Holm, 1979) to the t-tests. The Appendix A provides these tests and topography maps across the 400 ms window (Fig. 6). The ANOVAs showed a main effect of Modality in the 150–200 ms time-window, F(1, 17) = 5.80, p = .03, g2p = .25, because amplitude across the scalp was 1.01 lV more negative for A than for AV–V. Interactions between Electrode
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59 Table 1 Greenhouse–Geisser corrected results of the 8 (average activity in 50 ms time-windows in a 0–400 ms epoch) 27 (Electrode) 2 (Modality; A vs. AV–V) 2 (Lexicality; Words vs. Pseudowords) omnibus ANOVAs in Experiment 1 and Experiment 2, with significant effects indicated by asterisks. Interactions that involve Lexicality and Modality never approached significance, which is indicated by highlighted cells.
49
between Electrode and Lexicality, F < 1. There was no main effect of Modality, F(1, 17) = 3.66, p = .07, g2p = .18, but there was an Electrode Modality interaction, F(26, 442) = 14.01, p < .01, g2p = .45, that was already explored in the 0–400 ms window analyses. Critically, Lexicality did not interact with Modality, F < 1, and the interaction between Electrode, Lexicality and Modality was also not significant, F < 1. To quantify this finding, we constructed the AV–V–A difference waves for words, and compared them with the difference-waves for pseudowords. The rationale was that residual activity after the AV–V–A subtraction fully captures the effect of AV integration. Critically, if AV integration was modulated by Lexicality, this effect would be different for words and pseudowords. However, as can be seen in Fig. 3b, the effect of AV integration was similar for words and pseudowords, and none of the pair-wise comparisons (testing AV–V–A wave for words against pseudowords) reached significance, ts(17) < 1.78, ps > .09, let alone survived a correction for multiple comparisons. Experiment 1: Discussion
and Modality were observed in all time-windows, Fs (26, 442) > 6.64, ps < .01, g2ps > .27, because auditory ctivity was more negative than AV–V at bilateral (centro) frontal locations, with largest effects in a 100–250 ms time-frame. Interactions were again followed-up as before (see the topography maps in the Appendix A for details). The 150–250 ms window Looking across the entire 400 ms window, we saw multiple significant effects of both lexical context and lip-read context, but no hint anywhere of their interacting. We now focus on the time range that a priori has the greatest chance of including an interaction, the period around 200 ms that has been implicated in previous studies as showing both lexical and lip-reading effects. A 27 (Electrode) 2 (Lexicality; Words vs. Pseudowords) 2 (Modality; A vs. AV–V) ANOVA on the averaged data in a 150–250 ms window showed a main effect of Electrode, F(26, 442) = 14.98, p < .01, g2p = .47, as activity was most positive at central electrodes (i.e., the maximal activity was 2.00 lV at Cz). There was no main effect of Lexicality in the 150–250 ms window (as Fig. 3a shows, the significant Lexicality effect in Experiment 1 surrounded this window), and there was also no interaction
Experiment 1 yielded four main findings: (1) lexical context modulated auditory processing in the 100– 150 ms and 250–400 windows, (2) lip-read context modulated auditory processing in the entire 0–400 ms epoch, (3) both contexts had a similarly directed effect; words yielded more positive ERPs relative to pseudowords, and effects of AV integration (AV–V) were manifest through more positive ERPs than auditory only presentations, and (4) there were no interactions between the two context types; AV integration was statistically alike for words and pseudowords, and likewise, lexical processing was statistically alike for A and AV–V. For lexical context, the pattern of results was as expected, although the N200-effect was somewhat smaller, earlier and shorter-lived (.39 lV in a 100–150 ms window) compared to prior findings for sentential auditory context (i.e., the effect ranged in between .71 and .94 lV in a 150–250 ms window, see van den Brink & Hagoort, 2004; van den Brink et al., 2001). The N200 effect we observed was statistically alike across the scalp, which is consistent with previous findings (van den Brink et al., 2001). It has been argued that the N200 effect is distinct from the N400 effect that follows it (e.g., Connolly et al., 1992; Hagoort, 2008; van den Brink & Hagoort, 2004; van den Brink et al., 2001), and given that our ERPs also show two distinct effects of Lexicality, it appears that the N200 at 100–150 ms was indeed followed by an (early) N400 effect starting at 250 ms. As argued before (Baart & Samuel, 2015), the N200 effect obtained with final syllables and sentence final words may be quite similar, given that both ultimately reflect phonological violations of lexical predictions (see e.g., Connolly & Phillips, 1994, who argued that such early negativity may be attributed to a phonemic deviation from the lexical form). Effects of AV integration were largest in a 100–250 ms time-window at bilateral frontal electrodes. The timecourse of these effects is thus quite consistent with the
50
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59
Fig. 3. Panel (a) depicts p-values of the main effects of Modality and Lexicality and their interactions with Electrode, obtained with ANOVAs on averaged data in eight 50 ms time-window in Experiment 1. The lower panel in (a) displays significance of the interaction between both contexts and their 3-way interaction with Electrode. The gray shaded areas below the alpha threshold indicate time-windows in which both contexts yielded significant results. Panel (b) displays the averaged effect of AV integration (AV–V–A) for words and pseudowords in a 150–250 ms window, for each electrode. Panels (c) and (d) are like (a) and (b) respectively, but for Experiment 2 instead of Experiment 1.
literature that shows AV integration at the N1 and P2 peaks (or in between both peaks, e.g., Alsius et al., 2014). The effect became manifest as more negative Aonly ERPs than for the AV–V difference waves, which is consistent with the ERPs observed by Brunellière,
Sánchez-García, Ikumi, and Soto-Faraco (2013) for highly salient lip-read conditions, although these authors did not find this effect to be significant. Since Brunellière et al. (2013) used ongoing AV speech sentences whereas we only presented lip-read information in the final
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59
syllable, it seems likely that the differences are related to the differences in experimental procedures. Possibly, insertion of a lip-read signal during ongoing auditory stimulation renders it highly unexpected for participants. This could have triggered a positive P3a component that has an anterior distribution and peaks at around 300 ms (Courchesne, Hillyard, & Galambos, 1975), which is essentially what we observed in the ERPs (see Fig. 2). However, since we know of no prior research in which a lip-read signal was inserted during ongoing auditory stimulation, this suggestion is clearly speculative. The central question of Experiment 1 was whether there is electrophysiological evidence for an interaction between lip-read and lexical speech context. Both contexts produced significant effects in four time-windows, yet we never found interactions between Lexicality and Modality. Our more focused test was less conclusive because the 150–250 ms time-window did not show clear effects of both contexts. Experiment 2 provides another opportunity to look for evidence of the two context types working together. By delaying the onset of the critical third syllable, the effect of each context type should potentially be clearer because the silent period should allow a clear auditory P2 to emerge, and to possibly be affected by each type of context.
Experiment 2: Delayed third syllable In Experiment 2, we use the same basic materials and the same approach, but with stimuli in which we delayed the onset of the third syllable. By inserting 800 ms of silence before the onset of the third syllable, the syllable’s onset should produce the usual ERP pattern that has been studied in most previous research on AV speech integration. More specifically, there should be an N1/P2 complex for such cases, and prior studies of the effect of visual context have found significant effects in this time window (e.g., Stekelenburg & Vroomen, 2007; van Wassenhove et al., 2005). In our previous work (Baart & Samuel, 2015) we demonstrated a robust lexical effect at about 200 ms post onset (i.e. at the P2) with this kind of delay in the onset of the third syllable. Thus, Experiment 2 offers an opportunity to test for any interaction of the two context types under conditions that have been more widely examined in the literature on audiovisual context effects. Methods Participants 20 new adults with Spanish as their dominant language and with normal hearing and normal- or corrected to normal vision participated. Two participants were excluded because of poor data quality (see below). The final sample included 8 females, and the mean age across participants was 21 years (SD = 1.9). Stimuli Stimulus material was the same as in Experiment 1.
51
Procedures Experimental procedures were the same as in Experiment 1, except that in all conditions, third syllable onset was delayed relative to offset of the second syllable. This delay was 760, 800 or 840 ms (216 trials per delay, 72 per modality) and realized by adding silence in the soundfiles after offset of the second syllable, and adding 19, 20 or 21 black bitmaps (40 ms each) to the sequence triggered at stimulus onset (see Fig. 1). EEG recording and analyses EEG recording and analyses procedures were the same as before. Two participants were excluded from the analyses because more than a third (i.e. 43% and 50%) of the trials did not survive artifact rejection, whereas the proportion of trials with artifacts for the remaining participants was 17% (N = 1), 11% (N = 1) or <7% (N = 16). To facilitate a comparison across experiments, the analysis protocol was similar to that in Experiment 1. As before, we first look across the full 400 ms window for any evidence of interaction, and then focus on the most likely time window for these effects – around 200 ms. Results On average, participants detected 96% of the catch-trials (SD = 5%, range from 85% [N = 1] to 100% [N = 5]), indicating that they attended the screen as instructed. The 0–400 ms window The results of the 8 (Time-window; eight 50 ms bins from 0 to 400 ms) 27 (Electrode) 2 (Lexicality; Words vs. Pseudowords) 2 (Modality; A vs. AV–V) ANOVA are summarized in Table 1. As in Experiment 1, Lexicality yielded a main effect and also showed an interaction with Time-window. Modality interacted with Time-window, as well as with Electrode. The Time-window Electrode Modality interaction was also significant. Critically, none of the interactions involving Lexicality and Modality reached significance. The 27 (Electrode) 2 (Lexicality; Words vs. Pseudowords) 2 (Modality; A vs. AV–V) ANOVAs conducted on each time-window are summarized in Fig. 3c; topography maps are shown in the Appendix A. As in Experiment 1, the critical overarching finding is that in none of the time windows does the interaction of Modality and Lexicality (or their interaction together with Electrode) approach significance (see the bottom panel of Fig. 3c), Fs(1, 17) < 2.35, ps > .13, g2ps < .13. Modality did not influence the effect that Lexicality had on auditory processing, and vice versa. In contrast, as summarized in the top panel of Fig. 3c, there were strong effects of both Lexicality and Modality in several critical time periods. Across the scalp, words elicited more positive ERPs than pseudowords in five consecutive time-windows from 150 to 400 ms (words– pseudowords differences ranged between .55 lV and .82 lV). There was also a main effect of Modality in the 350–400 ms window as averaged activity for A was 1.17 lV more positive than for AV–V. Lexicality interacted with Electrode in the 150–200 ms window, the 250– 300 ms window, and the 350–400 ms window, Fs
52
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59
(26, 442) > 3.09, ps < .05, g2ps > .14. However, none of the significant paired comparisons between words and pseudowords in the 150–200 ms window survived the Holm– Bonferroni correction, whereas a cluster of six occipitoparietal electrodes in the 250–300 ms window, and four parietal electrodes in the 350–400 ms window did show reliable Lexicality effects in the anticipated direction (i.e., activity for pseudowords was more negative than for words). Modality interacted with Electrode in the 50–100 ms window, the 200–250 ms window, the 300–350 ms window, and the 350–400 ms window, Fs(26, 442) > 3.20, ps < .04, g2ps > .15. In those four time-windows, auditory stimuli elicited more positive ERPs than AV–V, although the interaction was only reliable at electrode T8 in the 50–100 ms, 200–250 ms and 300–350 ms windows, whereas AV integration in the 350–400 ms window was observed in a large cluster of left-, mid- and right-central electrodes, with most prominent interactions at right scalp locations. The 150–250 ms window As in Experiment 1, the broad-window analyses showed strong Lexical and Modality effects, in overlapping time periods, and no hint of any interactions between the two types of context. Looking at the time period around the expected critical delay of 200 ms, we conducted a targeted analysis as in Experiment 1. A 27 (Electrode) 2 (Lexicality; Words vs. Pseudowords) 2 (Modality; A vs. AV–V) ANOVA on the averaged data in a 150–250 ms window showed a main effect of Electrode, F(26, 442) = 13.10, p < .01, g2p = .44, as activity was most positive at central electrodes (i.e., the maximal activity was 2.33 lV at Cz, consistent with the topography of the P2). There was a main effect of Lexicality, F(1, 17) = 9.61, p < .01, g2p = .36, as Lexical context increased positive activity in the ERPs (i.e., average amplitudes were 1.43 lV for words and .87 lV for pseudowords). There was also an interaction between Electrode and Lexicality, F(26, 442) = 3.04, p = .02, g2p = .15, with stronger lexical effects at some sites than others. Lip-read context decreased positivity (i.e., average amplitudes were 1.22 lV for A and 1.09 lV for AV–V). This effect varied with location, as the main effect of Modality was not significant, F < 1, but the interaction of Electrode and Modality was, F(26, 442) = 3.37, p = .02, g2p = .17. Critically, Lexicality did not interact with Modality, F < 1. The interaction between Electrode, Lexicality and Modality was also not significant, F(26, 442) = 1.91, p = .13, g2p = .10, as underscored by the comparisons between the averaged amplitudes of the AV integration effect (AV–V–A) for words with pseudowords at each electrode, that all yielded ps > .09 (see Fig. 3d). Experiment 2: Discussion Experiment 2 yielded four main findings: (1) lexical context modulated auditory processing in the 150–400 window, with relatively constant effects in terms of ERP amplitude across that time-frame, (2) lip-read context modulated auditory processing modestly in the 50–
100 ms, 200–250 ms, and 300–350 ms epoch, whereas effects in the 350–400 ms window were clearly larger, (3) the effects of context were in opposite directions; words yielded more positive ERPs relative to pseudowords, and effects of AV integration (AV–V) were manifest through more negative ERPs than auditory only presentation, and (4) there were no interactions between the two types of context; AV integration was statistically alike for words and pseudowords, and likewise, lexical processing was statistically alike for A and AV–V. As can be seen in Fig. 2, the effect of lexicality starting at around 150 ms was characterized by a more negative P2 peak in the ERPs for pseudowords than words. In our previous study that included a comparable delay condition for auditory stimuli, we observed a similar effect (Baart & Samuel, 2015). Given that the lexicality effect is thus similar to the effect observed with naturally timed stimuli, it appears that the N200 effect can survive the 800 ms delay, producing an N200 effect superimposed on the obligatory P2. Despite the clear central topography (as often found for the P2, see e.g., Stekelenburg & Vroomen, 2007), effects of lexicality were statistically alike across the scalp, providing additional support for the hypothesis that the effect is an N200-effect with a central distribution (Connolly et al., 1990) that is nonetheless statistically alike across the scalp (van den Brink et al., 2001). Although effects of lip-read context were largest in the 300–350 ms epoch, the same pattern of AV integration (i.e., more negative ERP amplitudes for AV–V than for A) started to appear at the P2 peak for a number of mid-central electrodes (e.g., C3, Cz, C4). The trend is clearly visible in Fig. 2, though as noted not significant until later. The patterns in the ERPs are consistent with previous studies that have shown that lip-read information suppresses the auditory P2 (e.g., van Wassenhove et al., 2005). The weaker/later effect here may be due to the overall modest size of the N1/P2 complex, which presumably stems from our materials being more complex than the speech typically used in previous studies. Such studies usually employ single, stressed syllables beginning with a stop consonant (i.e., / pa/, /py/, /po/, /pi/, Besle, Fort, Delpuech, et al., 2004), and/or that are extremely well articulated (Stekelenburg & Vroomen, 2007). Such stimuli produce sharper sound onsets than our critical syllables because ours were taken from natural 3-syllable utterances with second-syllable stress, with initial consonants belonging to different consonant classes (i.e., the voiced dental fricative /d/ [ð], the velar approximant /g/ [ɣ̞], the voiceless velar fricative /j/ [x], and the nasal /n/ [n]). Of note however, is that others have observed N1/P2 peaks that are comparable in size (i.e., a 4 lV peak-to-peak amplitude, Ganesh, Berthommier, Vilain, Sato, & Schwartz, 2014), or even smaller than what we observed here (Winneke & Phillips, 2011). Moreover, even when N1/P2 amplitudes are of the usual size, lip-read information does not always suppress the N1 (Baart et al., 2014) and/or the P2 (Alsius et al., 2014). AV P2 peaks may even be larger than for A-only stimulation (see e.g. Fig. 2 in Treille, Vilain, & Sato, 2014), indicating that the effects partially depend on procedural details.
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59
General discussion The current study was designed to examine the neural consequences of lip-read and lexical context when listening to speech input, and in particular, to determine whether these two types of context assert a mutual influence. Our approach is based on the additive factors method that has been used very productively in both behavioral work (e.g., Sternberg, 1969, 2013) and in prior ERP studies (e.g., Alsius et al., 2014; Baart et al., 2014; Besle, Fort, Delpuech, et al., 2004; Besle, Fort, & Giard, 2004; Stekelenburg & Vroomen, 2007). This approach depends on finding clear effects of each of the two factors of interest, and then determining whether the two effects simply sum, or if instead they operate in a non-additive fashion. Independent factors produce additive effects, while interactive ones can yield over-additive or under-additive patterns. In two experiments that were built on quite different listening situations (naturally timed versus delayed onset of a word’s third syllable) we found pervasive effects of both lexicality and lip-read context. In comparing items in which the third syllable determined whether an item was a word or a pseudoword, we replicated the recent finding of an auditory N200-effect of stimulus lexicality (Baart & Samuel, 2015). In naturally timed items, ERPs for pseudowords were more negative than for words just before and just after 200 ms, and for the delayed stimuli this effect reached significance by 200 ms and remained robust after that. Effects of lip-read context were widespread for both the naturally timed and the delayed stimuli, and critically, overlapped the lexicality effect at various time-windows in both experiments. Thus, our two experiments produced the necessary conditions for detecting an interaction of these two factors, if such an interaction exists. The fundamental finding of the current study is that the interaction of the two types of context never approached significance, in any time window. We believe we provided every opportunity for an interaction to appear by running the test with two such different timing situations, and by analyzing the ERPs both broadly and with a specific focus in the 200 ms region. Thus, the most appropriate conclusion is that lexical context and lip-read context operate independently, rather than being combined to form a broader ‘‘context effect”. As we noted previously, the claim of independent effects that comes from finding an additive pattern is inherently a kind of null effect – the absence of an interaction. The fact that the main effects were robust, and that there were many other interactions (e.g., with Electrode) suggests that the lack of the critical interaction is not due to insufficient power. More substantively, as Sternberg (2013) points out in his comments on the additive factors approach, any finding of this sort needs to be placed in a broader empirical and theoretical context – the test provides one type of potential converging evidence. We believe that the broader empirical and theoretical context does in fact converge with our conclusion. There are three
53
relevant bodies of empirical work, and some recent theoretical developments that provide a very useful context for them. One set of empirical findings comes from behavioral studies that examined the potential interaction between lip-read context and the lexicon. These studies tested whether the impact of lip-read context varies as a function of whether the auditory signal is a real word or not. The evidence from these studies is rather mixed. (Dekle, Fowler, & Funnell, 1992) found McGurk-like lip-read biases on auditory words, and a study in Finnish (Sams, Manninen, Surakka, Helin, & Kättö, 1998) also showed strong and similar McGurk effects for syllables, words and words embedded in sentences. These initial findings are consistent with independent effects of lip-read and lexical context. However, Brancazio (2004) suggested that the stimuli used by Sams et al. (1998) may have been too variable to determine any effect of the lexicon on AV integration. In particular, the position of the audiovisual phonetic incongruency across words differed from nonwords, as did the vowel that followed the incongruence. Brancazio (2004) therefore fixed the position of AV congruency at stimulus onset and controlled subsequent vowel identity across stimuli, and observed that McGurk effects were larger when the critical phoneme produced a word rather than a non-word. In a subsequent study (Barutchu et al., 2008), the observed likelihood of McGurkinfluenced responses was the same for words and pseudowords when the incongruency came at stimulus onset (which seems reasonable given that lexical information at sound onset is minimal at best), but was lower for words than for pseudowords at stimulus offset. Overall, studies using this approach may provide some suggestion of an interaction between lip-read and lexical context, but the evidence is quite variable and subject to a complex set of stimulus properties. As discussed in the Introduction, two other lines of research have shown a clear divergence between the effect of lip-read context and lexical context – studies using selective adaptation, and semantic priming. Recall that Ostrand et al. (2011) recently reported that when an audiovisual prime is constructed with conflicting auditory (e.g., ‘‘beef”) and visual (e.g., ‘‘deef”) components, listeners often perceive the prime on the basis of visual capture (e.g., they report perceiving ‘‘deef”), but semantic priming is dominated by the unperceived auditory component (‘‘beef” primes ‘‘pork”, even though the subject claims not to have heard ‘‘beef”). This pattern demonstrates a dissociation between a percept that is dominated by lip-read context, and internal processing (lexical/ semantic priming) that is dominated by the auditory stimulus. Several studies using the selective adaptation paradigm have shown a comparable dissociation (Roberts & Summerfield, 1981; Saldaña & Rosenblum, 1994; Samuel & Lieblich, 2014; Shigeno, 2002). In these studies, the critical adaptors were audiovisual stimuli in which the reported percept depended on the lip-read visual component, but in all cases the observed adaptation
54
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59
shifts were completely determined by the unperceived auditory component of the audiovisual adaptors. Samuel and Lieblich (2014) tried to strengthen the audiovisual percept by making it lexical (e.g., pairing a visual ‘‘armagillo” with an auditory ‘‘armibillo”, to generate the perceived adaptor ‘‘armadillo”), but the results were entirely like those from the studies using simple syllables as adaptors – the shifts were always determined by the auditory component, not by the perceived stimulus. The results from all of these studies contrast with those from multiple experiments in which the perceived identity of the adaptor was determined by lexical context, either through phonemic restoration (Samuel, 1997), or via the Ganong effect (Samuel, 2001). In the lexical cases, the perceived identity of the adaptors matched the observed adaptation shifts. Looking at all of these findings, it is clear that both lipread context and lexical context produce reliable and robust effects on what listeners say that they are perceiving. At the same time, there is now sufficient behavioral and electrophysiological evidence to sustain the claim that these two types of context operate differently, and at least mostly independently. Samuel and Lieblich (2014) suggested that the adaptation data are consistent with somewhat different roles for lip-read information and for lexical information. They argued that people’s reported percepts indicate that a primary role for lip-read context is to aid in the overt perception of the stimulus – such context directly affects what people think they are hearing. Lexical context can also do this, but in addition, it seems to have a direct impact on the linguistic encoding of the speech. Samuel and Lieblich argued that the successful adaptation by lexicallydetermined speech sounds was evidence for the linguistic encoding of those stimuli, while the unsuccessful adaptation found repeatedly for lip-read based percepts shows that this type of context does not directly enter into the linguistic processing chain. This distinction is grounded in the idea that speech is simultaneously both a linguistic object, and a perceptual one; lip-read context seems to be primarily affecting the latter. This distinction in fact closely matches one that Poeppel and his colleagues have drawn on the basis of a wide set of other types of evidence. As Poeppel (2003) put it, ‘‘Sound-based representations interface in task-dependent ways with other systems. An acoustic–p honetic–articulatory ‘coordinate transformation’ occurs in a dorsal pathway . . . that links auditory representations to motor representations in superior temporal/parietal areas. A second, ‘ventral’ pathway interfaces speech derived representations with lexical semantic representations” (p. 247). The first pathway in this approach aligns with Samuel and Lieblich’s (2014) more perceptual analysis, while the second one clearly matches the more linguistic encoding. In fact, Poeppel’s first pathway explicitly connects to motor representations, exactly the kinds of representations that naturally would be associated with lip-read information. And, his second pathway explicitly involves lexical semantic representations, obviously a close match to the lexical context effects
examined here. The theoretical distinctions made by both Poeppel and by Samuel and Lieblich provide a natural basis for the hypothesized independence of lip-read context and lexical context: Each type of context is primarily involved with somewhat different properties of the speech signal, reflecting its dual nature as both a perceptual and a linguistic stimulus. Moreover, Poeppel’s analysis suggests that the two functions are subserved by different neural circuits, providing an explanation for the observed additive effects: There is little or no interaction because the different types of context are being routed separately. We should stress that the independence of the two pathways is a property of the online processing of speech. The initial independence does not preclude one type of context from eventually affecting the ‘‘other” kind of information, sometime later. For example, lip-read only words can prime the semantic categories an auditory target belongs to (Dodd, Oerlemans, & Robinson, 1989) and lipreading the first syllable of a low frequency word may prime auditory recognition of that word (Fort et al., 2012). However, the fact that lip-read information may affect later auditory perception is not evidence for on-line interaction between the two contexts while processing a speech sound, or even for the visual information directly affecting linguistic encoding; an initial perceptual effect may eventually have downstream linguistic consequences. The consistently independent effects of the two context types on the observed ERPs in the current study, together with the accumulating findings from multiple behavioral paradigms, provide strong support for the view that lipreading primarily guides perceptual analysis while lexical context primarily supports linguistic encoding of the speech signal.
Acknowledgments This work was supported by Rubicon grant 446-11-014 by the Netherlands Organization for Scientific Research (NWO) to MB, and MINECO grant PSI2010-17781 from the Spanish Ministry of Economics and Competitiveness to AGS.
Appendix A Figs. 4 and 5 display the ERPs for each stimulus type (A, V and AV), from 600 ms before onset of the critical third syllable, to 500 ms after. As can be seen, all conditions yield similar ERPs before video-onset, which is to be expected given that two auditory syllables were played during this time in all conditions. Onset of the third syllable’s lip-read information (indicated by ‘video’) clearly led to a response in V and AV conditions (there was no lip-read information in the A condition) and V and AV start to diverge after auditory onset of the third syllable (indicated by ‘audio’). ERPs for words and pseudowords look quite similar in all conditions, but to capture effects of AV
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59
55
Fig. 4. ERPs for unimodal (A and V) and AV stimuli in Experiment 1 when the final syllable made up a word (upper panel) or a pseudoword (lower panel). Onset of the lip-read information is indicated by ‘video’, and ‘audio’ refers to auditory onset of the third syllable (e.g., ‘ga’), that was presented immediately after the first two auditory syllables (e.g., ‘lechu’).
integration, analyses were conducted on A and the AV–V difference waves (see Fig. 2). Fig. 6 displays topography maps of the averaged activity in eight consecutive 50 ms time-windows. The maps are symmetrically scaled with different maximal
amplitudes for each time-window. Negativity is indicated by black contour lines and positivity by white ones. Main effects of Lexicality are indicated by rectangles around the topography maps corresponding to the time-windows where main effects were observed.
56
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59
Fig. 5. ERPs for unimodal (A and V) and AV stimuli in Experiment 2 when the final syllable made up a word (upper panel) or a pseudoword (lower panel). Onset of the lip-read information is indicated by ‘video’, and ‘audio’ refers to auditory onset of the third syllable (e.g., ‘ga’), that was preceded by 800 ms of silence that was inserted after the first two auditory syllables (e.g., ‘lechu’).
Interactions between Lexicality or Modality and Electrode in any given time-window were assessed through pair-wise t-tests at each electrode (testing activity for words vs. pseudowords, or for A vs. AV–V, depending on the factor that interacted with Electrode),
and electrodes for which the obtained p-value survived a Holm–Bonferroni correction are indicated by asterisks. Test parameters of interest are summarized below the maps. Delta values (D) indicate the minimal amplitude difference observed in the significant tests. The minimal
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59
57
Fig. 6. Scalp topographies of average activity in 50 ms windows for words, pseudowords, auditory and AV–V in Experiments 1 (upper panel) and 2 (lower panel). Rectangles indicate time-windows in which main effects of lexicality (Words vs. pseudowords) or modality (AV–V vs. A) were observed. Interactions between Electrode and Lexicality, or Electrode and Modality were followed-up by pair-wise comparisons that tested the context effect at each electrode. Those tests for which the p-value survived a Holm–Bonferroni correction are indicated by asterisks. The corresponding test-parameters are displayed below the maps and include the minimal difference across electrodes (D), the minimal t-value, the maximal p-value, the maximal 95% Confidence Interval around the difference, and the minimal effect-size (Cohen’s d).
t and maximal p-values of the comparisons are provided, as well as the maximal 95% Confidence Interval around delta, and minimal effect-size (Cohen’s d). Fig. 6 displays
these maps for both experiments, with the upper panel corresponding to Experiment 1, and the lower panel to Experiment 2.
58
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59
References Alsius, A., Möttönen, R., Sams, M. E., Soto-Faraco, S., & Tiippana, K. (2014). Effect of attentional load on audiovisual speech perception: Evidence from ERPs. Frontiers in Psychology, 5, 727. Baart, M., de Boer-Schellekens, L., & Vroomen, J. (2012). Lipread-induced phonetic recalibration in dyslexia. Acta Psychologica, 140, 91–95. Baart, M., & Samuel, A. G. (2015). Early processing of auditory lexical predictions revealed by ERPs. Neuroscience Letters, 585, 98–102. Baart, M., Stekelenburg, J. J., & Vroomen, J. (2014). Electrophysiological evidence for speech-specific audiovisual integration. Neuropsychologia, 53, 115–121. Baart, M., & Vroomen, J. (2010). Phonetic recalibration does not depend on working memory. Experimental Brain Research, 203, 575–582. Barutchu, A., Crewther, S. G., Kiely, P., Murphy, M. J., & Crewther, D. P. (2008). When /b/ill with /g/ill becomes /d/ill: Evidence for a lexical effect in audiovisual speech perception. European Journal of Cognitive Psychology, 20, 1–11. Bertelson, P., Vroomen, J., & De Gelder, B. (2003). Visual recalibration of auditory speech identification: A McGurk aftereffect. Psychological Science, 14, 592–597. Besle, J., Fort, A., Delpuech, C., & Giard, M. H. (2004). Bimodal speech: Early suppressive visual effects in human auditory cortex. European Journal of Neuroscience, 20, 2225–2234. Besle, J., Fort, A., & Giard, M. H. (2004). Interest and validity of the additive model in electrophysiological studies of multisensory interactions. Cognitive Processing, 5, 189–192. Brancazio, L. (2004). Lexical influences in audiovisual speech perception. Journal of Experimental Psychology: Human Perception and Performance, 30, 445–463. Brunellière, A., Sánchez-García, C., Ikumi, N., & Soto-Faraco, S. (2013). Visual information constrains early and late stages of spoken-word recognition in sentence context. International Journal of Psychophysiology, 89, 136–147. Colin, C., Radeau, M., Soquet, A., & Deltenre, P. (2004). Generalization of the generation of an MMN by illusory McGurk percepts: Voiceless consonants. Clinical Neurophysiology, 115, 1989–2000. Colin, C., Radeau, M., Soquet, A., Demolin, D., Colin, F., & Deltenre, P. (2002). Mismatch negativity evoked by the McGurk-MacDonald effect: A phonetic representation within short-term memory. Clinical Neurophysiology, 113, 495–506. Connolly, J. F., & Phillips, N. A. (1994). Event-related potential components reflect phonological and semantic processing of the terminal word of spoken sentences. Journal of Cognitive Neuroscience, 6, 256–266. Connolly, J. F., Phillips, N. A., Stewart, S. H., & Brake, W. G. (1992). Eventrelated potential sensitivity to acoustic and semantic properties of terminal words in sentences. Brain and Language, 43, 1–18. Connolly, J. F., Stewart, S. H., & Phillips, N. A. (1990). The effects of processing requirements on neurophysiological responses to spoken sentences. Brain and Language, 39, 302–318. Courchesne, E., Hillyard, S. A., & Galambos, R. (1975). Stimulus novelty, task relevance and the visual evoked potential in man. Electroencephalography and Clinical Neurophysiology, 39, 131–143. Dekle, D. J., Fowler, C. A., & Funnell, M. G. (1992). Audiovisual integration in perception of real words. Perception & Psychophysics, 51, 355–362. Dodd, B., Oerlemans, M., & Robinson, R. (1989). Cross-modal effects in repetition priming: A comparison of lip-read graphic and heard stimuli. Visible Language, 22, 59–77. Duchon, A., Perea, M., Sebastián-Gallés, N., Martí, A., & Carreiras, M. (2013). EsPal: One-stop shopping for Spanish word properties. Behavior Research Methods, 1–13. Eimas, P. D., & Corbit, J. D. (1973). Selective adaptation of linguistic feature detectors. Cognitive Psychology, 4, 99–109. Eisner, F., & McQueen, J. M. (2006). Perceptual learning in speech: Stability over time. Journal of the Acoustical Society of America, 119, 1950–1953. Fort, A., Delpuech, C., Pernier, J., & Giard, M. H. (2002). Early auditory– visual interactions in human cortex during nonredundant target identification. Cognitive Brain Research, 14, 20–30. Fort, M., Kandel, S., Chipot, J., Savariaux, C., Granjon, L., & Spinelli, E. (2012). Seeing the initial articulatory gestures of a word triggers lexical access. Language and Cognitive Processes, 28, 1207–1223. Frtusova, J. B., Winneke, A. H., & Phillips, N. A. (2013). ERP evidence that auditory–visual speech facilitates working memory in younger and older adults. Psychology and Aging, 28, 481–494. Ganesh, A. C., Berthommier, F., Vilain, C., Sato, M., & Schwartz, J. L. (2014). A possible neurophysiological correlate of audiovisual binding and unbinding in speech perception. Frontiers in Psychology, 5, 1340.
Ganong, W. F. (1980). Phonetic categorization in auditory word perception. Journal of Experimental Psychology: Human Perception and Performance, 6, 110–125. Giard, M. H., & Peronnet, F. (1999). Auditory-visual integration during multimodal object recognition in humans: A behavioral and electrophysiological study. Journal of Cognitive Neuroscience, 11, 473–490. Gratton, G., Coles, M. G., & Donchin, E. (1983). A new method for off-line removal of ocular artifact. Electroencephalography and Clinical Neurophysiology, 55, 468–484. Hagoort, P. (2008). The fractionation of spoken language understanding by measuring electrical and magnetic brain signals. Philosophical Transactions of the Royal Society B: Biological Sciences, 363, 1055–1069. Holcomb, P. J., Kounios, J., Anderson, J. E., & West, W. C. (1999). Dualcoding, context-availability, and concreteness effects in sentence comprehension: An electrophysiological investigation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25, 721–742. Holcomb, P. J., & Neville, H. J. (1990). Auditory and visual semantic priming in lexical decision: A comparison using event-related brain potentials. Language and Cognitive Processes, 5, 281–312. Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70. Klucharev, V., Möttönen, R., & Sams, M. (2003). Electrophysiological indicators of phonetic and non-phonetic multisensory interactions during audiovisual speech perception. Cognitive Brain Research, 18, 65–75. Kraljic, T., Brennan, S. E., & Samuel, A. G. (2008). Accommodating variation: Dialects, idiolects, and speech processing. Cognition, 107, 54–81. Kraljic, T., & Samuel, A. G. (2005). Perceptual learning for speech: Is there a return to normal? Cognitive Psychology, 51, 141–178. Kraljic, T., & Samuel, A. G. (2006). Generalization in perceptual learning for speech. Psychonomic Bulletin & Review, 13, 262–268. Kraljic, T., & Samuel, A. G. (2007). Perceptual adjustments to multiple speakers. Journal of Memory and Language, 56, 1–15. Kutas, M., & Federmeier, K. D. (2011). Thirty years and counting: Finding meaning in the N400 component of the event-related brain potential (ERP). Annual Review of Psychology, 62, 621–647. MacGregor, L. J., Pulvermüller, F., van Casteren, M., & Shtyrov, Y. (2012). Ultra-rapid access to words in the brain. Nature Communications, 3, 711. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748. Näätänen, R., Gaillard, A. W. K., & Mäntysalo, S. (1978). Early selectiveattention effect in evoked potential reinterpreted. Acta Psychologica, 42, 313–329. Norris, D., McQueen, J. M., & Cutler, A. (2003). Perceptual learning in speech. Cognitive Psychology, 47, 204–238. Ostrand, R., Blumstein, S. E., & Morgan, J. L. (2011). When hearing lips and seeing voices becomes perceiving speech: Auditory-visual integration in lexical access. Paper presented at the 33rd annual conference of the Cognitive Science Society, Austin, Texas. Perrin, F., & García-Larrea, L. (2003). Modulation of the N400 potential during auditory phonological/semantic interaction. Cognitive Brain Research, 17, 36–47. Pettigrew, C. M., Murdoch, B. E., Ponton, C. W., Finnigan, S., Alku, P., Kei, J., et al. (2004). Automatic auditory processing of English words as indexed by the mismatch negativity, using a multiple deviant paradigm. Ear and Hearing, 25, 284–301. Pitt, M. A., & Samuel, A. G. (1993). An empirical and meta-analytic evaluation of the phoneme identification task. Journal of Experimental Psychology: Human Perception and Performance, 19, 699–725. Poeppel, D. (2003). The analysis of speech in different temporal integration windows: Cerebral lateralization as ‘asymmetric sampling in time’. Speech Communication, 41, 245–255. Pulvermüller, F., Kujala, T., Shtyrov, Y., Simola, J., Tiitinen, H., Alku, P., et al. (2001). Memory traces for words as revealed by the mismatch negativity. Neuroimage, 14, 607–616. Roberts, M., & Summerfield, Q. (1981). Audiovisual presentation demonstrates that selective adaptation in speech perception is purely auditory. Perception & Psychophysics, 30, 309–314. Saint-Amour, D., De Sanctis, P., Molholm, S., Ritter, W., & Foxe, J. J. (2007). Seeing voices: High-density electrical mapping and source-analysis of the multisensory mismatch negativity evoked during the McGurk illusion. Neuropsychologia, 45, 587–597. Saldaña, H. M., & Rosenblum, L. D. (1994). Selective adaptation in speech perception using a compelling audiovisual adaptor. Journal of the Acoustical Society of America, 95, 3658–3661.
M. Baart, A.G. Samuel / Journal of Memory and Language 85 (2015) 42–59 Sams, M., Manninen, P., Surakka, V., Helin, P., & Kättö, R. (1998). McGurk effect in Finnish syllables, isolated words, and words in sentences: Effects of word meaning and sentence context. Speech Communication, 26, 75–87. Samuel, A. G. (1981). Phonemic restoration: Insights from a new methodology. Journal of Experimental Psychology: General, 110, 474–494. Samuel, A. G. (1986). Red herring detectors and speech perception: In defense of selective adaptation. Cognitive Psychology, 18, 452–499. Samuel, A. G. (1997). Lexical activation produces potent phonemic percepts. Cognitive Psychology, 32, 97–127. Samuel, A. G. (2001). Knowing a word affects the fundamental perception of the sounds within it. Psychological Science, 12, 348–351. Samuel, A. G., & Lieblich, J. (2014). Visual speech acts differently than lexical context in supporting speech perception. Journal of Experimental Psychology: Human Perception and Performance, 40, 1479–1490. Shigeno, S. (2002). Anchoring effects in audiovisual speech perception. Journal of the Acoustical Society of America, 111, 2853–2861. Stekelenburg, J. J., & Vroomen, J. (2007). Neural correlates of multisensory integration of ecologically valid audiovisual events. Journal of Cognitive Neuroscience, 19, 1964–1973. Stekelenburg, J. J., & Vroomen, J. (2012). Electrophysiological correlates of predictive coding of auditory location in the perception of natural audiovisual events. Frontiers in Integrative Neuroscience, 6, 26. Sternberg, S. (1969). The discovery of processing stages: Extensions of Donders’ method. In W. G. Koster (Ed.). Attention and performance II. Acta Psychologica (Vol. 30, pp. 276–315). Amsterdam: North-Holland. Sternberg, S. (2013). The meaning of additive reaction-time effects: Some misconceptions. Frontiers in Psychology, 4, 744. Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America, 26, 212–215. Treille, A., Vilain, C., & Sato, M. (2014). The sound of your lips: Electrophysiological cross-modal interactions during hand-to-face and face-to-face speech perception. Frontiers in Psychology, 5, 420. van den Brink, D., Brown, C. M., & Hagoort, P. (2001). Electrophysiological evidence for early contextual influences during spoken-word
59
recognition: N200 versus N400 effects. Journal of Cognitive Neuroscience, 13, 967–985. van den Brink, D., & Hagoort, P. (2004). The influence of semantic and syntactic context constraints on lexical selection and integration in spoken-word comprehension as revealed by ERPs. Journal of Cognitive Neuroscience, 16, 1068–1084. van Linden, S., Stekelenburg, J. J., Tuomainen, J., & Vroomen, J. (2007). Lexical effects on auditory speech perception: An electrophysiological study. Neuroscience Letters, 420, 49–52. van Linden, S., & Vroomen, J. (2007). Recalibration of phonetic categories by lipread speech versus lexical information. Journal of Experimental Psychology: Human Perception and Performance, 33, 1483–1494. van Wassenhove, V., Grant, K. W., & Poeppel, D. (2005). Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Sciences of the United States of America, 102, 1181–1186. Vroomen, J., & Baart, M. (2009a). Phonetic recalibration only occurs in speech mode. Cognition, 110, 254–259. Vroomen, J., & Baart, M. (2009b). Recalibration of phonetic categories by lipread speech: Measuring aftereffects after a twenty-four hours delay. Language and Speech, 52, 341–350. Vroomen, J., & Stekelenburg, J. J. (2010). Visual anticipatory information modulates multisensory interactions of artificial audiovisual stimuli. Journal of Cognitive Neuroscience, 22, 1583–1596. Vroomen, J., van Linden, S., de Gelder, B., & Bertelson, P. (2007). Visual recalibration and selective adaptation in auditory-visual speech perception: Contrasting build-up courses. Neuropsychologia, 45, 572–577. Vroomen, J., van Linden, S., Keetels, M., de Gelder, B., & Bertelson, P. (2004). Selective adaptation and recalibration of auditory speech by lipread information: Dissipation. Speech Communication, 44, 55–61. Warren, R. M. (1970). Perceptual restoration of missing speech sounds. Science, 167, 392–393. Winneke, A. H., & Phillips, N. A. (2011). Does audiovisual speech offer a fountain of youth for old ears? An event-related brain potential study of age differences in audiovisual speech perception. Psychology and Aging, 26, 427–438.