Hearing Research 307 (2014) 98e110
Contents lists available at ScienceDirect
Hearing Research journal homepage: www.elsevier.com/locate/heares
Review
Functional imaging of auditory scene analysis Alexander Gutschalk*, Andrew R. Dykstra Department of Neurology, Ruprecht-Karls-University Heidelberg, Heidelberg, Germany
a r t i c l e i n f o
a b s t r a c t
Article history: Received 21 May 2013 Received in revised form 26 July 2013 Accepted 8 August 2013 Available online 19 August 2013
Our auditory system is constantly faced with the task of decomposing the complex mixture of sound arriving at the ears into perceptually independent streams constituting accurate representations of individual sound sources. This decomposition, termed auditory scene analysis, is critical for both survival and communication, and is thought to underlie both speech and music perception. The neural underpinnings of auditory scene analysis have been studied utilizing invasive experiments with animal models as well as non-invasive (MEG, EEG, and fMRI) and invasive (intracranial EEG) studies conducted with human listeners. The present article reviews human neurophysiological research investigating the neural basis of auditory scene analysis, with emphasis on two classical paradigms termed streaming and informational masking. Other paradigms e such as the continuity illusion, mistuned harmonics, and multi-speaker environments e are briefly addressed thereafter. We conclude by discussing the emerging evidence for the role of auditory cortex in remapping incoming acoustic signals into a perceptual representation of auditory streams, which are then available for selective attention and further conscious processing. This article is part of a Special Issue entitled
. Ó 2013 Elsevier B.V. All rights reserved.
1. Introduction At any given moment, our environment is comprised of multiple sound sources such that the sound arriving at our ear canals is a complex mixture, with acoustic energy from each source overlapping in both time and frequency with other sources. One of the primary functions of the human auditory system is to break down this mixture into individual sound elements that ideally, when grouped together, constitute all the elements produced by an individual source while excluding elements from all other sources. Subsequent sounds that bind together perceptually are referred to as an auditory stream, and the process of perceptual organization by integrating sound into auditory streams and segregating two or more streams from each other has been termed auditory scene analysis (Bregman, 1990).
Abbreviations: ARN, awareness related negativity; BOLD, blood oxygenation level dependent; DF, frequency difference; EEG, electroencephalography; fMRI, functional magnetic resonance imaging; ISI, inter-stimulus interval; ITD, inter-aural time difference; MEG, magnetoencephalography; ROI, region of interest; SSR, steady-state response; STG, superior temporal gyrus * Corresponding author. Department of Neurology, University of Heidelberg, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany. Tel.: þ49 6221 56 36811; fax: þ49 6221 56 5258. E-mail address: [email protected] (A. Gutschalk). 0378-5955/$ e see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.heares.2013.08.003
Auditory scene analysis relies on a number of physiological processes, some of which are well studied in other contexts, and others which may be specifically related to perceptual organization. The quest for the neural mechanisms specifically related to auditory scene analysis has received increasing research interest in recent years, utilizing invasive animal models as well as noninvasive functional imaging and electrophysiological studies in human listeners. Previous reviews on auditory scene analysis have discussed in detail the research performed with animal models (Micheyl et al., 2007b) as well as the behavioral and mismatchnegativity (MMN) literature (Snyder and Alain, 2007). The present review will focus on functional imaging studies conducted in human listeners, streaming cues other than pure-tone frequency, and more complex sequence configurations. The first section focuses on the auditory stream segregation (or streaming) paradigm, a classical paradigm often used to study basic, sequential source segregation. Although the stimuli themselves are quite simple, the streaming paradigm has nonetheless proven particularly fruitful in light of the fact that it produces bistable perception; that is, perception that changes despite identical stimuli. The use of bistability in examining the neural basis of perception, per se, is addressed in the following section. We then turn to more complex stimulation paradigms, where multiple tones are presented in variable configurations, focusing in particular on so-called multi-tone informational masking paradigms. We then
A. Gutschalk, A.R. Dykstra / Hearing Research 307 (2014) 98e110
99
perceived, one with double the rate (A) of the other (B). Apart from the modification of rate or rhythmic percept that can be effected by streaming, there are other, objective effects as well. For example, the separation of streams makes it more difficult to estimate the temporal relationship between two sound elements, even if they are adjacent in time, if they do not belong to the same stream (Bregman and Campbell, 1971; Vliegen et al., 1999b).
2.1. Computational models for stream segregation
Fig. 1. Example of the classical ABA_streaming paradigm introduced by Van Noorden (1975), where A and B are pure tones with a frequency difference DF and “_” is a silent pause. (a) The sequence is perceived as one stream with a characteristic, galloping rhythm when the DF is small. (b) At larger DF, the pattern is usually heard as two segregated, isochronous streams. In the latter case, the predominant perceptual interstimulus interval (ISI) within the B-tone stream is prominently longer than the time interval between the B tones and the leading A tones when both are integrated into one stream.
briefly address a number of other paradigms from the sceneanalysis literature, before attempting a synthesis of the various findings in the final section. Based on the studies reviewed, we argue that auditory cortex may represent the major interface between faithful representations of acoustic stimuli and perceptual representations of auditory streams. 2. Auditory stream segregation Stream segregation is a now-classical paradigm with which to study the segregation of temporally interleaved tone sequences. In the simplest case, two tones, A and B, continuously alternate in a regular ABAB. pattern. When the frequency difference (DF) between the tones is small and the presentation rate slow, the sequence is perceived as a single stream of alternating tones (a ‘trill’) (Miller and Heise, 1950). Conversely, when the rate is sufficiently high and the DF sufficiently large, two separate streams e one of A tones and another of B tones e are perceived. The latter phenomenon has been referred to as the streaming effect (Bregman, 1990).1 Several variants of such patterns have been used in the auditory scene analysis literature, with the most ubiquitous being the ABA_ABA_. pattern (Fig. 1) introduced by Van Noorden (1975). This pattern produces a characteristic change of pattern perception that is well suited to instruct experimental listeners, because the rhythmic perception is less abstract than the theoretical explanation of what are one or two streams. When the ABA_ pattern is perceived as one stream, it produces a distinct, galloping rhythm. When the pattern splits, two isochronous streams are
1 Note that the term streaming has alternatively been used to characterize any kind of sequential grouping in auditory perception. Following Bregman (1990, page 47), we will only use streaming in the context of the classical, alternating stream segregation paradigm in this paper.
The earliest neuronal models purporting to explain stream segregation suggested that segregation of pure tone sequences can be explained based on neuronal representation distance along the frequency axis of the cochlea e the so-called peripheral channeling hypothesis (Hartmann and Johnson, 1991) e along with an additional temporal integrator in the central nervous system (Beauvois and Meddis, 1996; McCabe and Denham, 1997). Multi-unit recordings in macaque monkeys suggested that frequency separation in the auditory cortex is modulated by forward suppression (Brosch and Schreiner, 1997), such that the separation of the individual neuronal representations of the different stimuli is enhanced at shorter inter-stimulus intervals (ISI) (Bee and Klump, 2004; Fishman et al., 2004, 2001), potentially explaining why stream segregation can also be observed with smaller DF at faster rates and shorter ISIs (Bregman et al., 2000; Van Noorden, 1975). Furthermore, streaming often is not an instantaneous percept, but may build-up over the course of seconds (Anstis and Saida, 1985; Bregman, 1978). Adaptation processes with longer time constants than forward suppression have been proposed to explain this gradual buildup of streaming-related activity in auditory cortex that is often observed at intermediate DF (Micheyl et al., 2005). Similar multi-second adaptation has later been observed as early in the auditory pathway as the cochlear nucleus (Pressnitzer et al., 2008). The models used to explain streaming based on a separation of streams into distinct neuronal representations are generally summarized as the population-separation model of auditory stream segregation (Fishman et al., 2012; Micheyl et al., 2007b). The population-separation model of stream segregation goes beyond the previous peripheral channeling model in also considering neuronal representations of other feature representations than ear and tone frequency, which supposedly emerge in the central auditory system. The adaptation phenomena described above are an additional component of the model to explain temporal phenomena such as buildup and rate dependency. While the population-separation model of stream segregation can explain the classical streaming effect introduced above, it may not be universal enough to explain why stream segregation does or does not occur with other stimulus configurations. For example, it has been pointed out that the separation of two streams of tones along the tonotopic axis in auditory cortex was similar for alternating and synchronous pure-tone sequences, but that the synchronous sequences are generally perceived as a single coherent stream of chords (Elhilali et al., 2009b). In the framework of Bregman (1990), one would argue that the common onsets of the synchronous sequence are a stronger cue for integration than frequency separation is for segregation. An alternative model that accounts for synchronicity cues and additionally for other temporal characteristics of auditory objects was introduced as the temporal coherence model of auditory stream segregation (Elhilali et al., 2009b, Shamma et al., 2011). This model adds a module subsequent to the separation of sounds into different neuronal populations (i.e. feature extraction), which computes the coherence between stimulus-locked activity in all neural channels in a time interval of up to 500 ms. Sound elements with high coherence are
100
A. Gutschalk, A.R. Dykstra / Hearing Research 307 (2014) 98e110
Fig. 2. (a) Auditory-evoked MEG activity in auditory cortex in response to ABA_stimuli (Fig. 1). (a) Source waveforms averaged for the left and right auditory cortex, based on dipoles fitted to the P1m. The evoked response increases with DF, most prominently for the B-tone evoked response whose perceived ISI changes more prominently when the two tones are segregated into distinct streams (c.f. Fig. 1). (b) Behavioral rating of how easily the triplet sequences are perceived as two streams, on a scale between 0 (impossible) to 1 (automatically). (c) Correlation of the normalized MEG response strength with the behavioral data shown in panel b. Modified from Gutschalk et al. (2005).
grouped into coherent streams, whereas low coherence forms the basis for stream segregation. Another line of models to explain auditory stream segregation is based on the predictive-coding framework (Winkler et al., 2012). Predictive-coding theory assumes that the brain is constantly generating a model of the world based on previous experience and incoming sensory signals. The models stress the role of topedown processing pathways, which are required to compare the model prediction with the bottomeup stream of sensory information. A recent implementation of a predictive-coding model of auditory stream segregation was particularly successful in predicting the bistable nature of the streaming effect (Mill et al., 2013). Like the population-separation model, however, the implemented model (Mill et al., 2013) is not yet capable of handling situations other than the classical streaming effect. It is nevertheless conceivable that such models could be constructed and altered to more universally account for the range of perceptually phenomena associated with auditory scene analysis. 2.2. Spectral cues for streaming The first human neurophysiological study of the streaming effect utilized an auditory deviance response known as the mismatch negativity, or MMN. The MMN is evoked by any violation of an otherwise regular stimulus pattern, and is thought to be a predominantly automatic component (Näätänen et al., 2011). Using alternating patterns of low- and high-frequency tones, it was shown that the MMN is elicited much more readily for violations of within-stream vs. across-stream patterns (Sussman et al., 1999, 2007), i.e. for violations that are more-readily perceived when the frequency separation between the tone patterns is large. Based on these findings, it has been suggested that streaming is an automatic process that does not require directed attention (Sussman et al., 1999). Because the MMN does not monitor the ongoing stream but only the processing of deviants within a
stream, its physiological relationship with streaming is an indirect one, whose advantages and limitations have been reviewed in detail elsewhere (Snyder and Alain, 2007). The MMN is of particular interest in the context of predictive-coding models of stream segregation, because it may be interpreted to reflect the prediction error when the regularity of a stream is interrupted (Winkler et al., 2012). The first study to directly demonstrate that human auditorycortex activity depends on tone-frequency separation in the streaming paradigm came later (Gutschalk et al., 2005). Using magnetoencephalography (MEG), it was shown that the P1m and N1m, which are evoked by individual tones of a continuous ABA_ sequence, were smaller when the DF was small and the tones were grouped into a single stream (Fig. 2). For larger DF, the evoked response increased in amplitude and plateaued at a DF of about 8 semitones, where the sequence was generally segregated into two streams (Gutschalk et al., 2005). This phenomenon is thought to be caused by selective adaptation2: suppression imposed by one tone on the following tones is smaller for tones separated by large DF. The range of this frequency-selective adaptation is in good agreement with the transition from integrated to segregated perception (Fig. 2), suggesting some relationship between the underlying physiological mechanisms of selective adaptation and the perception of streaming. Similar results for the P1 and N1 (Snyder et al., 2006), and for the P1m (Chakalov et al., 2012), have also been obtained by other investigators. One difference between adaptation of the P1m and N1m is their ISI dependence: At 200 ms ISIs, P1m amplitude is not suppressed much more than at ISIs several times longer (600 ms) (Gutschalk et al., 2004a), whereas the N1m can show suppression effects lasting 3 s or longer (Hari et al., 1982; Imada et al., 1997). At ISIs of 200 ms and below, the N1m is often not even observed. With respect to the psychoacoustics of auditory stream segregation, the temporal integration reflected by the P1m would be a candidate marker for physiological processes that are related to the ISI dependence of the temporal coherence boundary (Van Noorden, 1975), defined as the minimum DF where one stream can no longer be perceived, even if the listener is instructed to maintain a one-stream percept (Van Noorden, 1975). The coherence boundary decreases with ISI, most prominently below 200 ms (Bregman et al., 2000; Van Noorden, 1975), similar to the range in which adaptation of the P1m is most pronounced (Gutschalk et al., 2004a). Supposedly, P1m adaptation is also related to streaming phenomena observed in monkey A1 (Fishman et al., 2004, 2001), although the P1m is likely not generated exclusively in A1 but also in adjacent core and belt areas (Liegeois-Chauvel et al., 1994; Yvert et al., 2001). We return to the potential dissociation of the P1m and N1m in Section 6. fMRI has also provided evidence for potentially streamingrelated selective adaptation effects in human auditory cortex. The blood-oxygen level-dependent (BOLD) activity in auditory cortex evoked by sequences of sounds depends on the repetition rate (Harms and Melcher, 2002; Harms et al., 2005): when the ISI
2 The term selective adaptation is used here to describe the phenomenon of response reduction for sequentially presented sounds that is selective for spectral composition or other sound features. In the present review, selective adaptation neither implicates a specific neural mechanism nor a specific range of temporal dynamics (time constants). There are good reasons to believe that selective adaptation of the P1m described above is related to frequency-specific forward suppression observed in animal models (Brosch and Schreiner, 1997; Fishman et al., 2001). Because the equivalence of these phenomena has not yet been firmly established, however, and the generalization to other sound features remains to be shown for forward suppression, we will use the more general term selective adaptation in this review.
A. Gutschalk, A.R. Dykstra / Hearing Research 307 (2014) 98e110
between sounds is below about 200 ms, sustained BOLD activity decreases, whereas the initial onset transient remains strong or may even increase (Harms and Melcher, 2002). It was therefore examined whether this rate-dependent effect can also be observed in the context of streaming, where the perception of rate is halved for an ABAB sequence perceived as two streams compared to when it is perceived as a single, coherent stream (Wilson et al., 2007). Indeed, the sustained BOLD response evoked by an ABAB sequence increased when the DF was larger than 0 semitones, in the same range where perceptual streaming is observed. This effect is likely related to the forward suppression phenomena discussed above (Brosch and Schreiner, 1997; Fishman et al., 2001). It should be mentioned, however, that another study failed to show DF dependent BOLD enhancement in the auditory cortex (Cusack, 2005). While there are some technical differences between these studies (Cusack, 2005; Wilson et al., 2007), it appears that the difference is likely related to the sequence type (ABA_ and ABAB for Cusack and Wilson et al., respectively), specifically that the presence of the pause in the ABA_ paradigm causes a strong sustained response that does not readily increased with increasing DF. At the neural level there is likely overall stronger adaptation for the continuous ABAB sequence, whereas the pause in the ABA_ pattern limits the overall adaptation along pattern repetitions. Accordingly, the additional reduction of adaptation by DF, which supposedly produces the BOLD enhancement, is small in case of the ABA_ pattern. At the behavioral level, reduced adaptation is possibly related to the reduced likelihood of stream segregation for tone patterns that are interrupted by a pause (Bregman, 1978; Carl and Gutschalk, 2013). A discrepancy between the MEG and fMRI data remains, nevertheless, since selective adaptation was observed for the ABA_ pattern in MEG and we would have expected at least some frequency-selective adaptation in fMRI, as well if the two reflected the same mechanism. Finally, a recent streaming study examined the covariation between DF and neural activity in areas outside classically-defined auditory cortex (Dykstra et al., 2011). Utilizing direct recordings from the cortical surface of epilepsy patients, the authors observed that the brain areas engaged when listeners perform an active streaming task extend far beyond the auditory cortex, with many areas, including sites in temporal, frontal, and parietal cortex, showing activity that significantly correlated with DF (Fig. 3). 2.3. Non-spectral cues While differences in sound frequency, or more generally spectral differences, are probably the strongest cue underlying stream segregation (Hartmann and Johnson, 1991; Moore and Gockel, 2002), there is also evidence for streaming based on other, nonspectral cues as for example the pitch of unresolved harmonic
101
tone complexes (Vliegen and Oxenham, 1999a). Evidence from both MEG and fMRI (Gutschalk et al., 2007) show that the selectiveadaptation effects in auditory cortex, described in detail for spectral cues in the previous section, are similarly found for stimuli with completely overlapping spectra in which streaming is based on a difference in fundamental frequency, or DF0. The sustained BOLD response in auditory core and belt areas was stronger when the DF0 in a continuous ABAA sequence was 3 or 10 semitones, where streaming was generally perceived, than when the DF0 was 0 or 1 semitone, where a coherent, integrated percept was generally reported. In MEG, the P1m evoked by B tones was suppressed for DF0s of 0 and 1 semitone, but was clearly present for DF0s of 3 and 10 semitones. Further studies have shown that selective adaptation is also observed based on inter-aural time differences (ITD), reflected by the P1m in MEG (Carl and Gutschalk, 2013; Schadwinkel and Gutschalk, 2010b) as well as by the sustained BOLD response in auditory cortex (Schadwinkel and Gutschalk, 2010a, 2010b), both of which were enhanced in situations that typically evoke percepts of segregated streams. When the enhanced BOLD activity was compared between DF0 and DITD, the enhancement was found in Heschl’s gyrus and the adjacent, anterior parts of planum temporale, corresponding well to the supposed auditory core and belt areas, and there was no topographical difference between these two cues on a macroscopic scale (Schadwinkel and Gutschalk, 2010b). In summary these findings demonstrate that selective adaptation (potentially forward suppression) effects observed in the context of fast, alternating sequences generalize across different cues, including tone frequency, pitch of complex tones, and ITD. These results demonstrate further that these cues are selectively processed at the level of auditory cortex, a condition that is important for the population-separation model, but also for the coherence models of auditory streaming (c.f. Section 2.1). Potentially, there is a more general role of selective adaptation as a neural basis for streaming, beyond it being a measure of stimulus-selective processing (Carl and Gutschalk, 2013; Fishman et al., 2001; Gutschalk et al., 2005), but this will require further examination. 3. Bistable streaming perception The streaming effect is related to certain physical stimulus parameters, but is not uniquely determined by these parameters. It has long been known that such sequences can be perceptually bistable (Van Noorden, 1975). For long sequences (seconds to minutes), the percept evoked by the sequence may spontaneously alternate between one of integration and one of segregation (Gutschalk et al., 2005), with temporal characteristics similar to bistable visual phenomena (Pressnitzer and Hupe, 2006).
Fig. 3. Electrode sites that showed significant correlation between intracranial EEG (iEEG) evoked responses and the DF of an ABA_streaming sequence. Reproduced from Dykstra et al. (2011).
102
A. Gutschalk, A.R. Dykstra / Hearing Research 307 (2014) 98e110
A continuous bistable streaming stimulus can therefore be used to study the neural correlates of perception directly, by avoiding confounding physical stimulus differences, an approach widely applied in visual neuroscience (Blake and Logothetis, 2002; Sterzer et al., 2009). In a bistable streaming experiment (Gutschalk et al., 2005), listeners were instructed to listen to the streaming stimulus and indicate when their perception switched from one to two streams and vice versa. The data were evaluated such that time intervals in which one stream was perceived were averaged separately from intervals in which listeners indicated they heard two streams. The results showed that the response to B tones of ABA_ triplets were enhanced for segregated vs. integrated percepts (Gutschalk et al., 2005), concordant with the changes observed when adaptation was reduced as a consequence of larger DF (Fig. 2). In the original publication of these data, a highpass filter at 3 Hz was used to avoid using a baseline in an interval that was not completely flat. In this analysis, a small but consistent enhancement for both the P1m and the N1m was observed. For better comparability with subsequent studies, a re-analysis of the data where the highpass filter was omitted and a short baseline in the time-interval 25 ms before the ABA_ triplet was used is shown in Fig. 4. As can be seen, the enhanced positive response for twocompared to one-stream percepts is even more prominent in this analysis, although the average latency of the DP1m was in the range of 69e74 ms, 10e20 ms later than the native P1m (measured from B-tone onset). A number of investigators have performed similar experiments in EEG: one study used the build-up effect of streaming, i.e. the observation that two-stream percepts are more likely with increasing time since sequence onset (Anstis and Saida, 1985). They showed a positive difference wave in the auditory cortex for the late interval e again where streaming is more likely e compared to the early interval, with a latency of approximately 80e90 ms after the B-tone onset (Snyder et al., 2006). Two other studies used a continuous bistable setup as described above. One study showed a vertex-positive difference wave with a latency of approximately 100 ms after B-tone onset for the contrast of segregation vs. integration (Hill et al., 2012). The other study found a positive
difference wave with a latency of 60e140 ms after B-tone onset (Szalardy et al., 2013). Thus, absolute latency variability notwithstanding, there seems to be converging evidence for a positive difference wave, peaking about 60e100 ms after B-tone onset and located in auditory cortex, for segregated vs. grouped percepts. It remains unclear whether this activity is physiologically related to the native P1m (Gutschalk et al., 2005) or if the difference wave reflects a new, independent component (Hill et al., 2012; Szalardy et al., 2013). The appeal of a more direct relationship between the difference wave and the P1m is that the P1m is also modulated by adaptation that is selective for a variety of streaming cues, as outlined in Section 2. Since these adaptation effects are best observed for the B tones of the ABA_ paradigms used, it was predicted that the modulation during bistable experiments would also be detected in the 50e150 ms latency range post B-tone onset (Gutschalk et al., 2005). Conversely, the positive difference wave is generally later and broader than the P1m, which suggests it may be related to a distinct or additional process, potentially downstream to the P1m. It should be mentioned that the study by Dykstra et al. (2011), despite utilizing spatiotemporally resolved cortical recordings, failed to identify neuronal correlates of perceptual bistability with the streaming paradigm. The authors suggested that this could have been due to the combination of lack of extended coverage and the focal nature of such recordings, and posited that the neuronal representation of bistability during streaming might be uniquely situated in brain areas (e.g. on the superior temporal plane or in the intraparietal sulcus) not covered by their surface electrodes. The first fMRI study of bistable streaming perception used the buildup paradigm, but did not find a correlation between perception and activity in the auditory cortex (Cusack, 2005). Instead this study revealed enhanced activity in the intraparietal sulcus for segregation compared to integration. Enhanced BOLD activity in the intraparietal sulcus was confirmed in a subsequent study (Hill et al., 2011). The latter study also found enhanced activity in the auditory cortex for streaming perception with a region of interest (ROI) analysis, matching with the results of the EEG and MEG studies.
Fig. 4. MEG activity (grand average across 14 listeners) in auditory cortex binned according to whether a bistable ABA_ sequence (DF ¼ 4 or 6 semitones) was perceived as one stream (gray) or two streams (orange). Once the listeners perceived two streams, they selectively attended either the A tones (left) or B tones (right). The difference waves (black) show enhanced activity in the latency range subsequent to the B tones overlapping with the P1m and also with the P2m (thin lines represent bootstrap based t interval p < 0.05). Note that the enhancement is also in the B-tone latency range when listeners attended to the A tones, making it unlikely that the bistability effect can be explained by selective attention. The data used are the same as those used in Figure 4 of Gutschalk et al. (2005), but without the 3-Hz highpass filter used in that study.
A. Gutschalk, A.R. Dykstra / Hearing Research 307 (2014) 98e110
103
Fig. 5. BOLD activity evoked by perceptual reversals at the onset (On) and offset (Off) of two-stream percepts. (a) Activity in the auditory cortex (red) and neighboring areas (orange) based on all streaming reversals (On and Off combined). (b) Time course of the BOLD response in the auditory cortex ROI (red in panel a). (c) ROI in the inferior colliculus, determined with a localizer stimulus. (d) Time course for On and Off reversals in the inferior colliculus, based on the ROI shown in panel c. Modified from Schadwinkel and Gutschalk (2011).
While the studies reviewed so far compared the entire intervals where segregation is perceived with the intervals where grouping is perceived, another approach is to evaluate the reversals between these intervals. In fMRI, transient BOLD activity in auditory cortex has been observed in response to perceptual switches from one to two streams and vice versa (Kondo and Kashino, 2009; Schadwinkel and Gutschalk, 2011). Reversal-related activity has also been found in the thalamus, supra-marginal gyrus, and insular cortex (Kondo and Kashino, 2009). Similar activation patterns were observed in a target detection task, where the frequency of targets was matched to the frequency with which reversals were observed (Kondo and Kashino, 2009), suggesting that the response to perceptual reversals signifies the detection of transient changes in the sources present in the acoustic environment. Finally, an ROI analysis revealed transient activity related to perceptual reversals in the inferior colliculus (Fig. 5), an obligatory synaptic nucleus in the auditory midbrain (Schadwinkel and Gutschalk, 2011). Of particular interest to these results is that structures that are considered predominantly sensory are activated by transient perceptual events, even though there is no transient present in the acoustic stimulus. The physiological underpinnings of these imaging results are currently unclear. A simple explanation could be that the detection of stream-segregation onset in cortex produces a transient enhancement of activity in the inferior colliculus via topedown attentional gain control (Rinne et al., 2008). Alternatively, there might be a yet unspecified bottomeup mechanism of receptive field rearrangement that actually causes stream segregation. 4. Informational masking The ability to parse the auditory scene into an accurate perceptual representation is of particular importance in situations with multiple sound sources, where listening selectively to one source can be quite demanding. It has been demonstrated that the ability to detect the presence of a tone of pre-defined frequency may be hindered by the presence of simultaneous, randomly drawn tones at frequencies distant from that of the probe tone (Neff and
Green, 1987). This phenomenon cannot be explained by “energetic” masking, determined by cochlear-level activation patterns (Delgutte, 1990; Moore, 1995). Rather, the competition for neural resources in such experiments is thought to occur at later stages of processing, in the central auditory system. To dissociate these two forms of masking, masking observed with multi-tone maskers, that is, masking that cannot be explained by activation patterns on the basilar membrane, is termed “informational” masking (Durlach et al., 2003; Kidd et al., 2008). Informational masking3 depends on multiple parameters, some of which are also important for the stream-segregation paradigm reviewed above (Kidd et al., 1994). Impressive all-or-none perceptual effects are found when the target tone is repeated, while the masker is newly randomized for each tone repetition (Kidd et al., 2003). With this paradigm (Fig. 6a), the repetitive target stream sometimes “pops out” from the random masker tones. This pop-out effect sometimes builds up across time in a manner similar to the way in which listeners are more likely to hear a segregated percept for the streaming paradigms discussed earlier (Micheyl et al., 2007a).
4.1. Multi-tone masking and MEG The neural underpinnings of this informational-masking popout effect have been explored in recent years by a number of studies. To separate neural activity that is time locked to the target stream, a variant of the paradigm was introduced in which the target and masker tones are asynchronous (Fig. 6b). Constructing the masker in this way results in the cancellation of all activity evoked by individual masker tones when the response is evaluated time-locked to the target tones. Listeners are presented with sequences of targets embedded in a random multi-tone masker and
3 The term informational masking is not limited to multi-tone masking, but is also used for other situations, such as for example the masking of speech by other speech stimuli. Note that this does not necessarily mean that the same mechanisms are relevant for all kinds of informational masking.
104
A. Gutschalk, A.R. Dykstra / Hearing Research 307 (2014) 98e110
Fig. 6. Variants of the random multi-tone masker paradigm used by the studies summarized in Section 4. Target tones are plotted in black, masker tones in gray. When the presence of the regular target stream is not detected in these stimuli, it is considered informationally masked. A “protected region” between target and masker is used to reduce energetic masking, indicated by the gray shading. (a) Multi-tone masker with repetitive target tones (black) and random pure tones which are newly drawn for each interval (Kidd et al., 2003; Micheyl et al., 2007a). (b) Variant used by a number of MEG studies. In contrast to (a), the onset of the masker tones is additionally randomized with respect to the target-tone onsets. This modification allows the targetevoked activity to be separated from the masker-evoked activity (Dykstra, 2011; Elhilali et al., 2009a; Gutschalk et al., 2008; Königs and Gutschalk, 2012). (c) In this variant, the target occurrence is uncoupled from the masker onset. The number of masker tones is reduced once the target starts, such that the average number of tones per time interval is constant (Wiegand and Gutschalk, 2012). (d) The salience of the regular target is enhanced when multiple tone components are repeated at unrelated frequencies (Teki et al., 2011). In the example shown here, no protected region around the target tones is used.
are instructed to press a response button as soon as they detect the target stream. Neural activity can be selectively evaluated between intervals in which the target tones are detected vs. intervals in which they are missed (Gutschalk et al., 2008). While detected tones evoke a negative deflection in the MEG, generated in auditory cortex between 75 and 250 ms post-onset, targets that are missed evoke virtually no such response, and instead evoke a response that is nearly identical to intervals in which no target tones were present (Fig. 7). This response has been labeled the awareness related negativity (ARN), and is likely related to the N1m and subsequent negative response components in auditory cortex, which are evoked automatically when tones are presented in the absence of perceptual competition. However, these negative waves are significantly enhanced when a listener selectively attends one stream among competing streams of tones (Hillyard et al., 1973; Rif et al., 1991), and they are strongly suppressed or entirely absent when a tone sequence remains undetected amongst multi-tone maskers. Like the N1m, hemispheric lateralization of the ARN depends on whether a tone is presented at the left or at the right, but only in the early time interval that overlaps with the N1m (Königs and Gutschalk, 2012). In the later time interval, the ARN is balanced or slightly left lateralized and there is no modulation by stimulus laterality, whereas the negativity subsequent to the N1m peak recorded without masking is right lateralized. Similar results were obtained in a study that used an informational masking stimulus with two task instructions (Elhilali et al., 2009a): listeners were required to either detect deviants within a regular target
stream, or within the random-tone masker. The periodic 4-Hz activity evoked by the target was stronger and left lateralized when the target was attended, but right lateralized when the masker was attended. A lateralization of auditory-cortex activity towards the left has also been observed in fMRI stream segregation tasks, when listeners were instructed to actively segregate two sequences from each other (Deike et al., 2010, 2004). It may therefore be that the left auditory cortex plays a specific role in active listening, but this requires further investigation. In contrast to the surface-negative activity in auditory cortex starting about 75 ms after tone onset, earlier activity in auditory cortex is similar for both detected and undetected targets. This was shown first for the 40-Hz steady-state response (SSR) evoked by amplitude-modulated target tones (Gutschalk et al., 2008) and subsequently for the P1m (Königs and Gutschalk, 2012). One might therefore suggest that early activity in auditory cortex, reflected by the 40-Hz SSR and the P1m, reflects purely sensory stimulus processing whereas later activity, e.g. the N1 and ARN are more directly coupled to perception. It is unclear, however, if this temporal sequence also manifests as a hierarchical anatomical organization of the auditory cortex: The 40-Hz SSR is generated predominantly in medial Heschl’s gyrus (Brugge et al., 2009; Steinmann and Gutschalk, 2011; but see Nourski et al., 2013), the location of the medial core field, whereas the N1m and later components are generated in a more distributed network in the auditory cortex including both core and belt areas (Gutschalk et al., 2004b; Liegeois-Chauvel et al., 1994). The P1m also comprises generators in the auditory core and belt areas, partly overlapping with those of the N1m (Bidet-Caulet et al., 2007; Liegeois-Chauvel et al., 1994). While the “center of mass” of MEG or EEG activity can be determined reasonably well, the extent of a source cannot be determined precisely (Hämäläinen et al., 1993). Other, complementary techniques, such as intracranial EEG and fMRI are required in answering this question.
4.2. Multi-tone masking and fMRI The spatial distribution of activity for detected compared to undetected targets under informational masking was evaluated with fMRI (Wiegand and Gutschalk, 2012). The results showed that detected targets evoke activity in Heschl’s gyrus and planum temporale compared to the masker baseline interval (Fig. 8), similar to the activity observed for unmasked targets. In contrast, the undetected targets produced only little activity in lateral auditory cortex when contrasted to the masker baseline without a regular target. The more stringent comparison of detected minus undetected targets revealed significant activity in medial Heschl’s gyrus at the same site where the 40-Hz SSR is generated that was not modulated by target detection in MEG (see Section 4.1). It therefore appears that activity in the human core fields reflect both physical stimulus properties (at early latencies) and perception (presumably only at longer latencies). While the two cannot be dissociated with respect to their timing in fMRI, the finding would generally match with the model of recurrent activation of primary sensory areas supporting conscious perception (Bullier et al., 2001; Lamme, 2004; Meyer, 2011). Such recurrent projections may emanate from secondary auditory areas or from frontal or parietal areas involved in selective attention, target detection, and working memory processes. Intracranial recordings (Dykstra, 2011) show that target tones evoke activity in supra-modal areas of frontal and temporal cortex, in addition to that in unimodal auditory areas on the posterolateral STG only when (i) they are embedded in random multi-tone maskers and (ii) they are detected. However, whether the activity in supramodal brain areas is required for perceptual awareness (Dehaene and Changeux, 2011),
A. Gutschalk, A.R. Dykstra / Hearing Research 307 (2014) 98e110
105
Fig. 7. MEG experiment using a random multi-tone masker. The target comprised 12 repetitions of the same tone at a regular rate. (a) The detection probability for the target increased over time and was on average 50%. (b) Based on dipole source analysis, activity evoked by detected target tones was located in auditory cortex. (c) MEG source activity in auditory cortex shows a broad (w70e250 ms) negative wave for detected targets. This includes all tones presented after the listener has indicated awareness of the target tones. Target tones that were presented in the preceding time interval (undetected targets) did not evoke this negative source component. The two lower panels show the activity when no target is present (left) and when the targets are presented in isolation (right). (d) Dipole source strength measured separately for each of the 12 tones comprised in each target stream. While the activity decreases over time when the target is presented without a masker (orange), the activity for detected targets, but not undetected targets (blue), builds up in the presence of the masker (black). Modified from Gutschalk et al. (2008).
or instead indicates task-related processes subsequent to perception remains underexplored in the context of audition. A stimulation paradigm similar to classical multi-tone masking, termed “stochastic figure-ground stimulus,” was used in an fMRI experiment (Teki et al., 2011) while the listeners performed an unrelated auditory task. The targets were comprised of repetitions of several tone frequencies in otherwise random chords (Fig. 6d). The audibility of the targets increased with the number of synchronous frequency components comprising the target. Perceptual segregation of this target type cannot be explained by selective attention to a single frequency, and instead requires a mechanism that integrates simultaneous cross-frequency events as suggested by the coherence model (Teki et al., 2013) (see Section 2.1). The fMRI results show that activity in the temporal lobe and in the intraparietal sulcus increased with increasing number of target components. Given that the subjects in that study did not give trialby-trial perceptual reports of their perception, it remains to be seen whether the BOLD activity observed reflects merely the spectro-
temporal coherence of the stimuli, and thus a passive sceneanalysis mechanism, or is linked to bottomeup perceptual popout of the more salient targets. 5. Other scene analysis paradigms Streaming and informational masking are certainly not the only paradigms that have been used to study auditory scene analysis with functional imaging techniques. We now briefly review three other paradigms: the continuity illusion, concurrent segregation based on mistuned harmonics, and the segregation of multiple talkers. 5.1. Continuity illusion When a speaker or other ongoing sound is occasionally masked by brief noise bursts or other competing transient sounds, listeners often report hearing the ongoing sound as continuing through the
Fig. 8. fMRI study of target detection in a continuous multi-tone masker paradigm (Fig. 6c). The target comprised four identical tones with a regular repetition rate. (a) Areas in Heschl’s gyrus and planum temporale activated by the target when presented without the masker. (b) Time courses of BOLD activity in the region shown in panel A. The blue and red curves show activity elicited by detected and undetected targets, respectively. The gray curve shows the activation time course for targets presented passively without a masker. (c) The peak activity evoked by detected targets (TD) in the auditory cortex is significantly stronger than activity evoked by undetected targets (TN). Reproduced from Wiegand and Gutschalk (2012).
106
A. Gutschalk, A.R. Dykstra / Hearing Research 307 (2014) 98e110
noise. Strikingly, this perceived continuity is even observed when the continuously-perceived tone is completely switched off during the interrupting noise, a phenomenon known as the auditory continuity illusion (Warren et al., 1972). A detailed review of the continuity illusion and its neural underpinnings can be found elsewhere (Petkov and Sutter, 2011); below we focus on recent human neuroimaging and electrophysiological studies for comparison with other phenomena discussed above. Recordings in monkey auditory cortex show that activity in some neurons may sustain through the gap in cases where the tones are interrupted (Petkov et al., 2007). Conversely, onset and offset responses at the beginning and ending of the tone gap were more prominent for tones that were likely to be perceived as interrupted. Results from combined behavioral and fMRI experiments are better matched with the latter observation, i.e. with the transient coding of the perceived gap, since activity in auditory cortex has been found to be higher for discontinuous compared to continuous perception of the same sounds (Riecke et al., 2007; Shahin et al., 2009). A time-frequency analysis of EEG data showed reduced theta-band (w4 Hz) activity (Riecke et al., 2009, 2012) for trials in which the sound was perceived as continuing through the interrupting noise (that is, trials that elicited the continuity illusion) as compared to trials in which the gap was perceived. One study that applied additional visual cues found similar effects for the N1 and P2 responses evoked by transient offsets and onsets of the interrupted sound (Shahin et al., 2012). Specifically, the N1 and P2 responses were smaller when the sound was perceived as continuous than when it was perceived as interrupted. Given that, in the studies by Riecke, the evoked response was not subtracted out of individual trials before computing the event-related spectral perturbation, it remains unclear if the modulation of theta-activity is independent of the N1eP2 modulation found by Shahin et al., or if both are related to transient onset and offset responses for physically interrupted sounds. Note that traditional evoked response and event-related spectra reflect the same processes when the inter-trial coherence is high and the evoked response is not subtracted from individual trials prior to event-related time-frequency analysis. Thus, the difference between the studies might to some extent reflect differences in the sensitivity of these methods and associated filter functions. One important difference between the techniques is that baseline activity has a stronger impact on the relative measures (e.g. eventrelated spectral perturbation) typically used in time-frequency analysis of neural signals. A recent EEG study (Vinnik et al., 2012) evaluated the possibility that neural activity before the gap, which was used as the baseline interval in previous studies (Riecke et al., 2009, 2012), could predict whether the listener perceives the subsequent gap as present or not. Their results showed that such a prediction was possible significantly more often than chance, but based on activity in the gamma- and beta-frequency bands only. The anatomical source of this activity is unclear at this point since the analysis was performed in electrode space, and the relationship to the studies reviewed above remains to be clarified. In any case, the results suggest that reduced theta-band activity after masker onset (Riecke et al., 2009) is not caused by baseline fluctuations, since no significant theta-band effects were observed in the baseline interval. 5.2. Concurrent sound segregation In the paradigms discussed so far, the two segregated streams typically have asynchronous onsets. However, perceptual segregation into two distinct streams is sometimes also observed for synchronous sounds (so-called simultaneous integration vs. sequential integration, following the terminology of Bregman). A common
example is the case of a harmonic complex tone with a single, mistuned partial. While the correctly-tuned harmonics are perceptually grouped into one stream with a common pitch, the mistuned harmonic stands out from this percept as a separate tone that does not belong to the complex (Moore et al., 1985). It has been shown that activity in the auditory cortex is enhanced when the mistuned harmonic is perceived as emanating from a distinct source compared to when the mistuned partial is grouped with the remainder of the harmonic complex (Alain et al., 2001). The enhanced activity includes a positive peak (60e100 ms) and a negative transient (140e180 ms), the latter being termed the object-related negativity (Alain and McDonald, 2007). A later positivity (230e400 ms) is additionally observed when listeners attend the tones (Alain et al., 2002). When the sounds are longer, a negative sustained potential is also enhanced when the mistuned harmonic is perceptually isolated (Alain et al., 2002). 5.3. Simultaneous speech One of the most traditional paradigms to study auditory scene analysis is multiple, simultaneous speakers (Broadbent, 1952; Cherry, 1953), a laboratory version of the “cocktail party” problem. It has recently been shown that activity in the auditory cortex that is time locked to the envelope of a speech stream can be extracted by cross-correlation between the envelope and the ongoing MEG signal. The results showed that a time-locked response can be deconvolved from the ongoing MEG, the characteristics of which are similar to the long-latency responses (i.e. P1m, N1m, and P2m) evoked by isolated stimuli in auditory cortex (Ding and Simon, 2012b). When the listeners were instructed to listen to one of two voices cued either by ear in dichotic paradigms (Ding and Simon, 2012b), or by the speaker’s gender in diotic settings (Ding and Simon, 2012a), the part of the response that resembles the N1m was selectively enhanced for the attended speaker. This response was robust to modifications of the stimulus, and was not disrupted when the competing speaker’s voice was made louder than the attended speaker (Ding and Simon, 2012a). These data demonstrate how a long known effect (Hillyard et al., 1973; Rif et al., 1991) can be generalized to natural, complex stimuli, highlighting its relevance in real-world acoustic settings. More importantly, these data show that stimuli whose frequency spectra possess considerable overlap are nevertheless segregated at the level of auditory cortex, such that switching the attentional focus selectively enhances one or the other ecologically relevant stream. Similarly-large effects of selective attention on the neuronal representation of individual speakers in cocktail-party settings have also been observed with intracranial EEG recordings from secondary auditory areas of human postero-lateral superior temporal gyrus (Mesgarani and Chang, 2012; Zion Golumbic et al., 2013) and areas outside auditory cortex (Zion Golumbic et al., 2013). In both studies, high-gamma power e thought to reflect high-frequency synaptic activity and/or multi-unit firing in the immediate vicinity of the electrode (Ray and Maunsell, 2011; Steinschneider et al., 2008) e selectively tracked the speech envelope for the attended speaker in a multi-talker setting. 6. Summary and perspective The data reviewed above suggest that the human auditory cortex represents the auditory scene via a multiple-level hierarchy. Early activity, until about 80 ms, likely reflects the processing of spectral and other features that are important for auditory grouping, most-convincingly demonstrated with the P1m component in the classical streaming paradigm. This activity is likely not directly linked to perception, but appears to instead represent
A. Gutschalk, A.R. Dykstra / Hearing Research 307 (2014) 98e110
sensory stimulus processing that is important for mapping the acoustic input onto a perceptual representation of auditory streams. Accordingly, the P1m is not modulated when a target stream is detected under informational masking, and is similarly evoked when the same tones are perceptually masked (Königs and Gutschalk, 2012). The sensory nature of the P1m has also been confirmed for speech in noise, where its amplitude is directly related to the signal to noise ratio (Ding and Simon, 2013). Other activation patterns in the auditory cortex are clearly not fully determined by the physical nature of acoustic stimulation, and are instead more closely related to perceptual events reported by listeners. One example is the negativity (ARN) evoked in the latency range 75e200 ms only by target streams that are perceived to pop out from random multi-tone backgrounds (Dykstra, 2011; Gutschalk et al., 2008; Wiegand and Gutschalk, 2012). Another is the enhanced N1m for a selectively attended speaker (Ding and Simon, 2012b), which is invariant to modifications of the signal to noise ratio (Ding and Simon, 2012a, 2013). The object-related negativity evoked during the perception of mistuned harmonics in concurrent segregation likely also belongs to this group (Alain et al., 2001). We suggest that the processes reflected by these surface-negative waves operate on a representation of streams rather than acoustics. It appears that these processing resources are limited, and that as a consequence there is competition for perceptual resources in the presence of multiple streams (Desimone and Duncan, 1995; Lavie, 2006; Zylberberg et al., 2010). Therefore, a stream that is selected by topedown attention or because of its bottomeup salience is processed in more depth, and it may be that this level of processing is required for conscious perception of the stream in question. Conversely, when the same stream is presented in silence, i.e. without perceptual competition, the same processes are hypothesized to occur invariably and automatically, as is reflected by the negative waves in auditory cortex that remain present even in passive recording conditions (Ding and Simon, 2012b; Dykstra, 2011; Gutschalk et al., 2008) as well as the lack of wide-spread activity in supramodal areas (specifically frontal cortex). It remains to be determined how the modulation of activity in the auditory cortex is controlled by frontal and parietal cortex under sensory competition (Fritz et al., 2010; Lee et al., 2012), and whether these areas themselves are required for conscious auditory perception (Dehaene and Changeux, 2011). This view is different from others who have suggested that segregating a stream necessarily requires voluntary attention (Carlyon et al., 2001) and that these attention-driven mechanisms themselves are the crucial mechanism for segregating auditory streams (Lakatos et al., 2013). Note, however, that our hypothesis that attention is primarily deployed on streams does not mean that attention cannot generally be focused on a basic feature before auditory stream formation, and that this can then impact the stream formation process. For example, an access to early, perhaps even subcortical auditory representations has been suggested by reverse hierarchy theory (Nahum et al., 2008). Moreover, attentional modulation of earlier activity (overlapping with the P1m) in auditory cortex has been reported in MEG (Poghosyan and Ioannides, 2008; Woldorff and Hillyard, 1991) and at the level of the inferior colliculus in fMRI (Rinne et al., 2008) and EEG (Sorqvist et al., 2012). However, these effects are small in comparison to the effects of attention observed for later, cortical activity, and it is possible that their impact on perception may be subtle in most situations. In bistable perception, where the interpretation of the sensory evidence is ambiguous, such small variations could nevertheless bias subsequent perceptual decisions more frequently and thus determine perception. The same could apply to the enhanced positivity observed for segregation compared to
107
integration (Gutschalk et al., 2005; Hill et al., 2012; Szalardy et al., 2013). Note that the finding of differential neuronal activity in one perceptual condition vs. another does not indicate that such activity is, per se, the locus of the associated percept. A somewhat different attention mechanism has been suggested as part of the coherence model of stream segregation (Shamma et al., 2011, 2013). Here it is assumed that when attention is deployed to one feature of an auditory stream, all coherent elements are automatically segregated from the rest, such that the object representation basically emerges from attention to a sound feature. While this model is elegant in integrating feature and object based perception, it does not explain so well that different effects of attention have been observed at different stages of the auditory system (subcortical and cortical), and that attention is often deployed directly to streams (or sound objects) rather than to a singleton sound feature (Shinn-Cunningham, 2008). Moreover, in vision, it has been shown that attention can operate on subconscious items (Watanabe et al., 2011), and similar effects likely exist in human audition. To understand the neural underpinnings of auditory scene analysis, it will be important to understand in more detail how auditory sensory information is transformed from sensory representations of physical stimuli into perceptual representations of streams. Based on considerations above, we suspect that major parts of these computational processes take place in the auditory cortex, before the level reflected by the late negative components of the auditory evoked response. The separation of sound features characterizing a stream is stressed by the population-separation model (Fishman et al., 2012; Micheyl et al., 2007b), and it has been shown that part of this transformation has been achieved at the level reflected by the P1m. As auditory streams unfold over time, another important aspect of the transformation into streams is the coding of onsets and offsets, or more generally transitions within each stream. This has been demonstrated for streaming, the continuity illusion, and speech segregation. Processing in the ascending auditory system accentuates transitions that are present in sound. Selective adaptation is one mechanism along these lines that further enhances the separation along frequency and other feature dimensions, and makes the coding of transitions more similar to the coding of the same stream presented in isolation. For a universal framework of auditory scene analysis, other mechanisms are required before auditory streams can be confidently segregated. One suggestion is the temporal coherence model, which has been suggested to operate on stimulus-driven, phaselocked activity in auditory cortex (Shamma et al., 2011, 2013). Population separation is a necessary preprocessing step for this model, but selective adaptation could also potentially enhance stream segregation prior to coherence (or alternative spectrotemporal integration mechanisms). Further microscopic recordings in animal preparations will be required to determine these models and their neural implementation, but functional imaging in combination with behavioral tests will remain critical in translating these findings to the human brain. While we think that many of these basic scene-analysis mechanisms are bottomeup processes, it is likely that topedown processes beyond selective attention are also involved in auditory scene analysis. This follows the suggestion of Bregman (1990), who dissociated primitive scene analysis mechanisms, to which he assigns streaming, and schema-based mechanisms, which supposedly require topedown processing. Based on functional imaging studies, a potential site for topedown control of auditory cortex in the context of auditory stream formation is the intraparietal sulcus (Cusack, 2005; Hill et al., 2011; Teki et al., 2011). However, more studies will be required to disentangle activity in this region that is directly related to perceptual organization from other processes
108
A. Gutschalk, A.R. Dykstra / Hearing Research 307 (2014) 98e110
like target detection and attentional selection, which are likely to involve overlapping neural substrates (Weisz et al., 2013). Another source of topedown processes may be expected to emanate from the lateral temporal lobe, in particular for processing speech elements, but also other complex sound objects, which are thought to be specifically processed in the superior temporal sulcus (Belin et al., 2000; Binder et al., 2000). Given the fast connection of these sites with the auditory cortex on the superior temporal plane (Howard et al., 2000), their influence may potentially be involved in the time window before the stream organization is firmly established. A potential mechanism could be gain control of lower, feature-based sound representations in the auditory cortex by complex, memory-based object representations in the superior temporal sulcus. Alternatively, top down control of auditory cortex from the superior temporal sulcus could be modeled in the predictive-coding framework. At this point, however, such interactions remain mostly unaddressed. In summary, we are only just beginning to understand the neural underpinning of auditory scene analysis. Urgent questions include the role of primary and secondary areas of the auditory cortex, as well as its interaction with other areas from the frontal, parietal, and temporal lobes. Other questions include the dissociation of processes within the auditory cortex, and a unifying framework for auditory feature extraction. Clearly, invasive recordings will be required to dissociate between some of the potential mechanisms of auditory scene analysis. Functional imaging techniques can help increase the relevance of these studies to human listeners. Furthermore, they can also serve to guide invasive research by establishing models on a macroscopic scale (note that a number of the imaging findings summarized in this review have not yet been explored on a microscopic level). The closer these often-segregated branches of research interact, the larger the potential benefit will be, both to each field individually, but also to the field at large and, most importantly, to those who, even in the absence of traditional hearing loss, often suffer breakdowns in parsing everyday, complex auditory scenes. Acknowledgments Research supported by Bundesministerium für Bildung und Forschung (BMBF, grant 01EV0712) and Deutsche Forschungsgemeinschaft (DFG, grant GU593/3-2). References Alain, C., McDonald, K.L., 2007. Age-related differences in neuromagnetic brain activity underlying concurrent sound perception. J. Neurosci. 27, 1308e1314. Alain, C., Arnott, S.R., Picton, T.W., 2001. Bottom-up and top-down influences on auditory scene analysis: evidence from event-related brain potentials. J. Exp. Psychol. Hum. Percept Perform 27, 1072e1089. Alain, C., Schuler, B.M., McDonald, K.L., 2002. Neural activity associated with distinguishing concurrent auditory objects. J. Acoust. Soc. Am. 111, 990e995. Anstis, S., Saida, S., 1985. Adaptation to auditory streaming of frequency-modulated tones. J. Exp. Psychol. Hum. Percept Perform 11, 257e271. Beauvois, M.W., Meddis, R., 1996. Computer simulation of auditory stream segregation in alternating-tone sequences. J. Acoust. Soc. Am. 99, 2270e2280. Bee, M.A., Klump, G.M., 2004. Primitive auditory stream segregation: a neurophysiological study in the songbird forebrain. J. Neurophysiol. 92, 1088e1104. Belin, P., Zatorre, R.J., Lafaille, P., Ahad, P., Pike, B., 2000. Voice-selective areas in human auditory cortex. Nature 403, 309e312. Bidet-Caulet, A., Fischer, C., Besle, J., Aguera, P.E., Giard, M.H., Bertrand, O., 2007. Effects of selective attention on the electrophysiological representation of concurrent sounds in the human auditory cortex. J. Neurosci. 27, 9252e9261. Binder, J.R., Frost, J.A., Hammeke, T.A., Bellgowan, P.S., Springer, J.A., Kaufman, J.N., Possing, E.T., 2000. Human temporal lobe activation by speech and nonspeech sounds. Cereb. Cortex 10, 512e528. Blake, R., Logothetis, N.K., 2002. Visual competition. Nat. Rev. Neurosci. 3, 13e21. Bregman, A.S., 1978. Auditory streaming is cumulative. J. Exp. Psychol. Hum. Percept Perform 4, 380e387. Bregman, A.S., 1990. Auditory Scene Analysis. MIT Press, Cambridge, MA.
Bregman, A.S., Campbell, J., 1971. Primary auditory stream segregation and perception of order in rapid sequences of tones. J. Exp. Psychol. 89, 244e249. Bregman, A.S., Ahad, P.A., Crum, P.A., O’Reilly, J., 2000. Effects of time intervals and tone durations on auditory stream segregation. Percept Psychophys. 62, 626e636. Broadbent, D.E., 1952. Listening to one of two synchronous messages. J. Exp. Psychol. 44, 51e55. Brosch, M., Schreiner, C.E., 1997. Time course of forward masking tuning curves in cat primary auditory cortex. J. Neurophysiol. 77, 923e943. Brugge, J.F., Nourski, K.V., Oya, H., Reale, R.A., Kawasaki, H., Steinschneider, M., Howard 3rd, M.A., 2009. Coding of repetitive transients by auditory cortex on Heschl’s gyrus. J. Neurophysiol. 102, 2358e2374. Bullier, J., Hupe, J.M., James, A.C., Girard, P., 2001. The role of feedback connections in shaping the responses of visual cortical neurons. Prog. Brain Res. 134, 193e204. Carl, D., Gutschalk, A., 2013. Role of pattern, regularity, and silent intervals in auditory stream segregation based on inter-aural time differences. Exp. Brain Res. 224, 557e570. Carlyon, R.P., Cusack, R., Foxton, J.M., Robertson, I.H., 2001. Effects of attention and unilateral neglect on auditory stream segregation. J. Exp. Psychol. Hum. Percept Perform 27, 115e127. Chakalov, I., Draganova, R., Wollbrink, A., Preissl, H., Pantev, C., 2012. Modulations of neural activity in auditory streaming caused by spectral and temporal alternation in subsequent stimuli: a magnetoencephalographic study. BMC Neurosci. 13, 72. Cherry, C., 1953. Some experiments on the recognition of speech, with one and two ears. J. Acoust. Soc. Am. 25, 975e981. Cusack, R., 2005. The intraparietal sulcus and perceptual organization. J. Cogn. Neurosci. 17, 641e651. Dehaene, S., Changeux, J.P., 2011. Experimental and theoretical approaches to conscious processing. Neuron 70, 200e227. Deike, S., Scheich, H., Brechmann, A., 2010. Active stream segregation specifically involves the left human auditory cortex. Hear Res. 265, 30e37. Deike, S., Gaschler-Markefski, B., Brechmann, A., Scheich, H., 2004. Auditory stream segregation relying on timbre involves left auditory cortex. Neuroreport 15, 1511e1514. Delgutte, B., 1990. Physiological mechanisms of psychophysical masking: observations from auditory-nerve fibers. J. Acoust. Soc. Am. 87, 791e809. Desimone, R., Duncan, J., 1995. Neural mechanisms of selective visual attention. Annu. Rev. Neurosci. 18, 193e222. Ding, N., Simon, J.Z., 2012a. Emergence of neural encoding of auditory objects while listening to competing speakers. Proc. Natl. Acad. Sci. U S A 109, 11854e11859. Ding, N., Simon, J.Z., 2012b. Neural coding of continuous speech in auditory cortex during monaural and dichotic listening. J. Neurophysiol. 107, 78e89. Ding, N., Simon, J.Z., 2013. Adaptive temporal encoding leads to a backgroundinsensitive cortical representation of speech. J. Neurosci. 33, 5728e5735. Durlach, N.I., Mason, C.R., Kidd Jr., G., Arbogast, T.L., Colburn, H.S., ShinnCunningham, B.G., 2003. Note on informational masking. J. Acoust. Soc. Am. 113, 2984e2987. Dykstra, A.R., 2011. Neural Correlates of Auditory Perceptual Organization Measured with Direct Cortical Recordings in Humans. Thesis. Massachusetts Institute of Technology. Available: http://dspace.mit.edu/handle/1721.1/68451. Dykstra, A.R., Halgren, E., Thesen, T., Carlson, C.E., Doyle, W., Madsen, J.R., Eskandar, E.N., Cash, S.S., 2011. Widespread brain areas engaged during a classical auditory streaming task revealed by intracranial EEG. Front Hum. Neurosci. 5, 74. Elhilali, M., Xiang, J., Shamma, S.A., Simon, J.Z., 2009a. Interaction between attention and bottom-up saliency mediates the representation of foreground and background in an auditory scene. PLoS Biol. 7, e1000129. Elhilali, M., Ma, l., Micheyl, C., Oxenham, A.J., Shamma, S.A., 2009b. Temporal coherence in the perceptual organization and cortical representation of auditory scenes. Neuron 61, 317e329. Fishman, Y.I., Arezzo, J.C., Steinschneider, M., 2004. Auditory stream segregation in monkey auditory cortex: effects of frequency separation, presentation rate, and tone duration. J. Acoust. Soc. Am. 116, 1656e1670. Fishman, Y.I., Micheyl, C., Steinschneider, M., 2012. Neural mechanisms of rhythmic masking release in monkey primary auditory cortex: implications for models of auditory scene analysis. J. Neurophysiol. 107, 2366e2382. Fishman, Y.I., Reser, D.H., Arezzo, J.C., Steinschneider, M., 2001. Neural correlates of auditory stream segregation in primary auditory cortex of the awake monkey. Hear Res. 151, 167e187. Fritz, J.B., David, S.V., Radtke-Schuller, S., Yin, P., Shamma, S.A., 2010. Adaptive, behaviorally gated, persistent encoding of task-relevant auditory information in ferret frontal cortex. Nat. Neurosci. 13, 1011e1019. Gutschalk, A., Micheyl, C., Oxenham, A.J., 2008. Neural correlates of auditory perceptual awareness under informational masking. PLoS Biol. 6, e138. Gutschalk, A., Patterson, R.D., Uppenkamp, S., Scherg, M., Rupp, A., 2004a. Recovery and refractoriness of auditory evoked fields after gaps in click trains. Eur. J. Neurosci. 20, 3141e3147. Gutschalk, A., Patterson, R.D., Scherg, M., Uppenkamp, S., Rupp, A., 2004b. Temporal dynamics of pitch in human auditory cortex. Neuroimage 22, 755e766. Gutschalk, A., Oxenham, A.J., Micheyl, C., Wilson, E.C., Melcher, J.R., 2007. Human cortical activity during streaming without spectral cues suggests a general neural substrate for auditory stream segregation. J. Neurosci. 27, 13074e13081. Gutschalk, A., Micheyl, C., Melcher, J.R., Rupp, A., Scherg, M., Oxenham, A.J., 2005. Neuromagnetic correlates of streaming in human auditory cortex. J. Neurosci. 25, 5382e5388.
A. Gutschalk, A.R. Dykstra / Hearing Research 307 (2014) 98e110 Hämäläinen, M.S., Hari, R., Ilmoniemi, R.J., Knuutila, J., Lounasmaa, O.V., 1993. Magnetoencephalography e theory, instrumentation, and applications to noninvasive studies of the working human brain. Rev. Mod. Phys. 65, 413e497. Hari, R., Kaila, K., Katila, T., Tuomisto, T., Varpula, T., 1982. Interstimulus interval dependence of the auditory vertex response and its magnetic counterpart: implications for their neural generation. Electroencephalogr. Clin. Neurophysiol. 54, 561e569. Harms, M.P., Melcher, J.R., 2002. Sound repetition rate in the human auditory pathway: representations in the waveshape and amplitude of fMRI activation. J. Neurophysiol. 88, 1433e1450. Harms, M.P., Guinan Jr., J.J., Sigalovsky, I.S., Melcher, J.R., 2005. Short-term sound temporal envelope characteristics determine multisecond time patterns of activity in human auditory cortex as shown by fMRI. J. Neurophysiol. 93, 210e222. Hartmann, W.M., Johnson, D., 1991. Stream segregation and peripheral channeling. Music Percept. 9, 155e184. Hill, K.T., Bishop, C.W., Miller, L.M., 2012. Auditory grouping mechanisms reflect a sound’s relative position in a sequence. Front Hum. Neurosci. 6, 158. Hill, K.T., Bishop, C.W., Yadav, D., Miller, L.M., 2011. Pattern of BOLD signal in auditory cortex relates acoustic response to perceptual streaming. BMC Neurosci. 12, 85. Hillyard, S.A., Hink, R.F., Schwent, V.L., Picton, T.W., 1973. Electrical signs of selective attention in the human brain. Science 182, 177e180. Howard, M.A., Volkov, I.O., Mirsky, R., Garell, P.C., Noh, M.D., Granner, M., Damasio, H., Steinschneider, M., Reale, R.A., Hind, J.E., Brugge, J.F., 2000. Auditory cortex on the human posterior superior temporal gyrus. J. Comp. Neurol. 416, 79e92. Imada, T., Watanabe, M., Mashiko, T., Kawakatsu, M., Kotani, M., 1997. The silent period between sounds has a stronger effect than the interstimulus interval on auditory evoked magnetic fields. Electroencephalogr. Clin. Neurophysiol. 102, 37e45. Kidd, G., Mason, C.R., Richards, V.M., Gallun, F.J., Durlach, N.I., 2008. Informational masking. In: Yost, W.A., Popper, A.N., Fay, R.R. (Eds.), Auditory Perception of Sound Sources. Springer, New York. Kidd Jr., G., Mason, C.R., Richards, V.M., 2003. Multiple bursts, multiple looks, and stream coherence in the release from informational masking. J. Acoust. Soc. Am. 114, 2835e2845. Kidd Jr., G., Mason, C.R., Deliwala, P.S., Woods, W.S., Colburn, H.S., 1994. Reducing informational masking by sound segregation. J. Acoust. Soc. Am. 95, 3475e 3480. Kondo, H.M., Kashino, M., 2009. Involvement of the thalamocortical loop in the spontaneous switching of percepts in auditory streaming. J. Neurosci. 29, 12695e12701. Königs, L., Gutschalk, A., 2012. Functional lateralization in auditory cortex under informational masking and in silence. Eur. J. Neurosci. 36, 3283e3290. Lakatos, P., Musacchia, G., O’Connel, M.N., Falchier, A.Y., Javitt, D.C., Schroeder, C.E., 2013. The spectrotemporal filter mechanism of auditory selective attention. Neuron 77, 750e761. Lamme, V.A., 2004. Separate neural definitions of visual consciousness and visual attention; a case for phenomenal awareness. Neural Netw. 17, 861e872. Lavie, N., 2006. The role of perceptual load in visual awareness. Brain Res. 1080, 91e100. Lee, A.K., Rajaram, S., Xia, J., Bharadwaj, H., Larson, E., Hamalainen, M.S., ShinnCunningham, B.G., 2012. Auditory selective attention reveals preparatory activity in different cortical regions for selection based on source location and source pitch. Front Neurosci. 6, 190. Liegeois-Chauvel, C., Musolino, A., Badier, J.M., Marquis, P., Chauvel, P., 1994. Evoked potentials recorded from the auditory cortex in man: evaluation and topography of the middle latency components. Electroencephalogr. Clin. Neurophysiol. 92, 204e214. McCabe, S.L., Denham, M.J., 1997. A model of auditory streaming. J. Acoust. Soc. Am. 101, 1611e1621. Mesgarani, N., Chang, E.F., 2012. Selective cortical representation of attended speaker in multi-talker speech perception. Nature 485, 233e236. Meyer, K., 2011. Primary sensory cortices, top-down projections and conscious experience. Prog. Neurobiol. 94, 408e417. Micheyl, C., Shamma, S., Oxenham, A.J., 2007a. Hearing out repeating elements in randomly varying multitone sequences: a case of streaming. In: Kollmeier, B., Klump, G.M., Hohmann, V., Langemann, U., Mauermann, M., Uppenkamp, S., Verhey, J. (Eds.), Hearing - from Basic Research to Application. Springer, Berlin, pp. 267e274. Micheyl, C., Tian, B., Carlyon, R.P., Rauschecker, J.P., 2005. Perceptual organization of tone sequences in the auditory cortex of awake macaques. Neuron 48, 139e148. Micheyl, C., Carlyon, R.P., Gutschalk, A., Melcher, J.R., Oxenham, A.J., Rauschecker, J.P., Tian, B., Wilson, E.C., 2007b. The role of auditory cortex in the formation of auditory streams. Hear Res. 229, 116e131. Mill, R.W., Bohm, T.M., Bendixen, A., Winkler, I., Denham, S.L., 2013. Modelling the emergence and dynamics of perceptual organisation in auditory streaming. PLoS Comput. Biol. 9, e1002925. Miller, G.A., Heise, G.A., 1950. The trill threshold. J. Acoust. Soc. Am. 22, 637e638. Moore, B.C., Peters, R.W., Glasberg, B.R., 1985. Thresholds for the detection of inharmonicity in complex tones. J. Acoust. Soc. Am. 77, 1861e1867. Moore, B.C.J., 1995. Frequency analysis and masking. In: Moore, B.C.J. (Ed.), Handbook of Perception and Cognition, Hearing, vol. 6. Academic Press, Orlando, Florida, pp. 161e205. Moore, B.C.J., Gockel, H., 2002. Factors influencing sequential stream segregation. Acta Acust. United Acust. 88, 320e333.
109
Näätänen, R., Kujala, T., Winkler, I., 2011. Auditory processing that leads to conscious perception: a unique window to central auditory processing opened by the mismatch negativity and related responses. Psychophysiology 48, 4e22. Nahum, M., Nelken, I., Ahissar, M., 2008. Low-level information and high-level perception: the case of speech in noise. PLoS Biol. 6, e126. Neff, D.L., Green, D.M., 1987. Masking produced by spectral uncertainty with multicomponent maskers. Percept Psychophys. 41, 409e415. Nourski, K.V., Brugge, J.F., Reale, R.A., Kovach, C.K., Oya, H., Kawasaki, H., Jenison, R.L., Howard 3rd, M.A., 2013. Coding of repetitive transients by auditory cortex on posterolateral superior temporal gyrus in humans: an intracranial electrophysiology study. J. Neurophysiol. 109, 1283e1295. Petkov, C.I., O’Connor, K.N., Sutter, M.L., 2007. Encoding of illusory continuity in primary auditory cortex. Neuron 54, 153e165. Petkov, C.I., Sutter, M.L., 2011. Evolutionary conservation and neuronal mechanisms of auditory perceptual restoration. Hear Res. 271, 54e65. Poghosyan, V., Ioannides, A.A., 2008. Attention modulates earliest responses in the primary auditory and visual cortices. Neuron 58, 802e813. Pressnitzer, D., Hupe, J.M., 2006. Temporal dynamics of auditory and visual bistability reveal common principles of perceptual organization. Curr. Biol. 16, 1351e1357. Pressnitzer, D., Sayles, M., Micheyl, C., Winter, I.M., 2008. Perceptual organization of sound begins in the auditory periphery. Curr. Biol. 18, 1124e1128. Ray, S., Maunsell, J.H., 2011. Different origins of gamma rhythm and high-gamma activity in macaque visual cortex. PLoS Biol. 9, e1000610. Riecke, L., van Opstal, A.J., Goebel, R., Formisano, E., 2007. Hearing illusory sounds in noise: sensory-perceptual transformations in primary auditory cortex. J. Neurosci. 27, 12684e12689. Riecke, L., Esposito, F., Bonte, M., Formisano, E., 2009. Hearing illusory sounds in noise: the timing of sensory-perceptual transformations in auditory cortex. Neuron 64, 550e561. Riecke, L., Vanbussel, M., Hausfeld, L., Baskent, D., Formisano, E., Esposito, F., 2012. Hearing an illusory vowel in noise: suppression of auditory cortical activity. J. Neurosci. 32, 8024e8034. Rif, J., Hari, R., Hämäläinen, M.S., Sams, M., 1991. Auditory attention affects two different areas in the human supratemporal cortex. Electroencephalogr. Clin. Neurophysiol. 79, 464e472. Rinne, T., Balk, M.H., Koistinen, S., Autti, T., Alho, K., Sams, M., 2008. Auditory selective attention modulates activation of human inferior colliculus. J. Neurophysiol. 100, 3323e3327. Schadwinkel, S., Gutschalk, A., 2010a. Functional dissociation of transient and sustained fMRI BOLD components in human auditory cortex revealed with a streaming paradigm based on interaural time differences. Eur. J. Neurosci. 32, 1970e1978. Schadwinkel, S., Gutschalk, A., 2010b. Activity associated with stream segregation in human auditory cortex is similar for spatial and pitch cues. Cereb. Cortex 20, 2863e2873. Schadwinkel, S., Gutschalk, A., 2011. Transient bold activity locked to perceptual reversals of auditory streaming in human auditory cortex and inferior colliculus. J. Neurophysiol. 105, 1977e1983. Shahin, A.J., Bishop, C.W., Miller, L.M., 2009. Neural mechanisms for illusory fillingin of degraded speech. Neuroimage 44, 1133e1143. Shahin, A.J., Kerlin, J.R., Bhat, J., Miller, L.M., 2012. Neural restoration of degraded audiovisual speech. Neuroimage 60, 530e538. Shamma, S.A., Elhilali, M., Micheyl, C., 2011. Temporal coherence and attention in auditory scene analysis. Trends Neurosci. 34, 114e123. Shamma, S.A., Elhilali, M., Ma, L., Micheyl, C., Oxenham, A.J., Pressnitzer, D., Pingbo, Y., Yanbo, X., 2013. Temporal coherence and the streaming of complex sounds. In: Moore, B.C.J., Carlyon, R.P., Patterson, R.D., Gockel, H. (Eds.), Basic Aspects of Hearing: Physiology and Perception. Springer, New York, pp. 535e544. Shinn-Cunningham, B.G., 2008. Object-based auditory and visual attention. Trends Cogn. Sci. 12, 182e186. Snyder, J.S., Alain, C., 2007. Toward a neurophysiological theory of auditory stream segregation. Psychol. Bull. 133, 780e799. Snyder, J.S., Alain, C., Picton, T.W., 2006. Effects of attention on neuroelectric correlates of auditory stream segregation. J. Cogn. Neurosci. 18, 1e13. Sorqvist, P., Stenfelt, S., Ronnberg, J., 2012. Working memory capacity and visualverbal cognitive load modulate auditory-sensory gating in the brainstem: toward a unified view of attention. J. Cogn. Neurosci. 24, 2147e2154. Steinmann, I., Gutschalk, A., 2011. Potential fMRI correlates of 40-Hz phase locking in primary auditory cortex, thalamus and midbrain. Neuroimage 54, 495e504. Steinschneider, M., Fishman, Y.I., Arezzo, J.C., 2008. Spectrotemporal analysis of evoked and induced electroencephalographic responses in primary auditory cortex (A1) of the awake monkey. Cereb. Cortex 18 (3), 610e625. Sterzer, P., Kleinschmidt, A., Rees, G., 2009. The neural bases of multistable perception. Trends Cogn. Sci. 13, 310e318. Sussman, E., Ritter, W., Vaughan Jr., H.G., 1999. An investigation of the auditory streaming effect using event-related brain potentials. Psychophysiology 36, 22e34. Sussman, E., Wong, R., Horvath, J., Winkler, I., Wang, W., 2007. The development of the perceptual organization of sound by frequency separation in 5e11-year-old children. Hear Res. 225, 117e127. Szalardy, O., Bohm, T.M., Bendixen, A., Winkler, I., 2013. Event-related potential correlates of sound organization: early sensory and late cognitive effects. Biol. Psychol. 93, 97e104.
110
A. Gutschalk, A.R. Dykstra / Hearing Research 307 (2014) 98e110
Teki, S., Chait, M., Kumar, S., von Kriegstein, K., Griffiths, T.D., 2011. Brain bases for auditory stimulus-driven figure-ground segregation. J. Neurosci. 31, 164e171. Teki, S., Chait, M., Kumar, S., Shamma, S., Griffiths, T.D., 2013. Segregation of complex acoustic scenes based on temporal coherence. eLife 2, e00699. Van Noorden, L.P.A.S., 1975. Temporal Coherence in the Perception of Tone Sequences. University of Technology, Eindhoven. Vinnik, E., Itskov, P.M., Balaban, E., 2012. beta- and gamma-band EEG power predicts illusory auditory continuity perception. J. Neurophysiol. 108, 2717e 2724. Vliegen, J., Oxenham, A.J., 1999a. Sequential stream segregation in the absence of spectral cues. J. Acoust. Soc. Am. 105, 339e346. Vliegen, J., Moore, B.C., Oxenham, A.J., 1999b. The role of spectral and periodicity cues in auditory stream segregation, measured using a temporal discrimination task. J. Acoust. Soc. Am. 106, 938e945. Warren, R.M., Obusek, C.J., Ackroff, J.M., 1972. Auditory induction: perceptual synthesis of absent sounds. Science 176, 1149e1151. Watanabe, M., Cheng, K., Murayama, Y., Ueno, K., Asamizuya, T., Tanaka, K., Logothetis, N., 2011. Attention but not awareness modulates the BOLD signal in the human V1 during binocular suppression. Science 334, 829e831. Weisz, N., Muller, N., Jatzev, S., Bertrand, O., 2013. Oscillatory alpha modulations in right auditory regions reflect the validity of acoustic cues in an auditory spatial attention task. Cereb. Cortex.
Wiegand, K., Gutschalk, A., 2012. Correlates of perceptual awareness in human primary auditory cortex revealed by an informational masking experiment. Neuroimage 61, 62e69. Wilson, E.C., Melcher, J.R., Micheyl, C., Gutschalk, A., Oxenham, A.J., 2007. Cortical FMRI activation to sequences of tones alternating in frequency: relationship to perceived rate and streaming. J. Neurophysiol. 97, 2230e2238. Winkler, I., Denham, S., Mill, R., Bohm, T.M., Bendixen, A., 2012. Multistability in auditory stream segregation: a predictive coding view. Philos. Trans. R Soc. Lond B Biol. Sci. 367, 1001e1012. Woldorff, M.G., Hillyard, S.A., 1991. Modulation of early auditory processing during selective listening to rapidly presented tones. Electroencephalogr. Clin. Neurophysiol. 79, 170e191. Yvert, B., Crouzeix, A., Bertrand, O., Seither-Preisler, A., Pantev, C., 2001. Multiple supratemporal sources of magnetic and electric auditory evoked middle latency components in humans. Cereb. Cortex 11, 411e423. Zion Golumbic, E.M., Ding, N., Bickel, S., Lakatos, P., Schevon, C.A., McKhann, G.M., Goodman, R.R., Emerson, R., Mehta, A.D., Simon, J.Z., Poeppel, D., Schroeder, C.E., 2013. Mechanisms underlying selective neuronal tracking of attended speech at a “cocktail party”. Neuron 77, 980e991. Zylberberg, A., Fernandez Slezak, D., Roelfsema, P.R., Dehaene, S., Sigman, M., 2010. The brain’s router: a cortical network model of serial processing in the primate brain. PLoS Comput. Biol. 6, e1000765.