Hearing Research
Hearing Research 229 (2007) 225–236
www.elsevier.com/locate/heares
Research paper
Breaking the wave: Effects of attention and learning on concurrent sound perception Claude Alain
*
Rotman Research Institute, Baycrest Centre for Geriatric Care, Department of Psychology, University of Toronto, 3560 Bathurst Street, Toronto, Ont., Canada M6A 2E1 Received 13 September 2006; received in revised form 6 December 2006; accepted 3 January 2007 Available online 16 January 2007
Abstract The auditory surrounding is often complex with many sound sources active simultaneously. Yet listeners are proficient in breaking apart the composite acoustic wave reaching the ears. This achievement is thought to be the result of bottom-up as well as top-down processes that reflect listeners’ experience and knowledge of the auditory environment. Here, specific findings concerning the role of bottomup and top-down (schema-driven) processes on concurrent sound perception are reviewed, with particular emphasis on studies that have used scalp recording of event-related brain potentials. Findings from several studies indicate that frequency periodicity, upon which concurrent sound perception partly depends, is quickly and automatically registered in primary auditory cortex. Moreover, success in identifying concurrent vowels is accompanied by enhanced neural activity, as revealed by functional magnetic resonance imaging, in thalamus, primary auditory cortex and planum temporale. Lastly, listeners’ ability to segregate concurrent vowels improves with training and these neuroplastic changes occur rapidly, demonstrating the flexibility of human speech segregation mechanisms. Together, these studies suggest that the primary auditory cortex and the planum temporale play an important role in concurrent sound perception, and reveal a link between thalamo-cortical activation and the successful separation and identification of speech sounds presented simultaneously. Ó 2007 Elsevier B.V. All rights reserved. Keywords: Streaming; Speech; ERP; Attention; Auditory; Perception
1. Introduction A sound rarely occurs in isolation: listen carefully and you will realize that many physical sound sources may be active simultaneously (e.g., music from a TV or radio, vehicle traffic, people talking). While we may not be aware of all of them at the same time, we nevertheless can effortlessly switch our attention to those various sound sources in the environment. This ability to focus our attention on a particular sound object within an auditory scene presumes that the incoming composite acoustic wave has been ‘‘broken’’ into its various constituents, which either activate old representations or lead to the formation of new auditory
*
Tel.: +1 416 785 2500x3523; fax: +1 416 785 2862. E-mail address:
[email protected].
0378-5955/$ - see front matter Ó 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.heares.2007.01.011
objects (i.e., mental representations) in memory. This is a complicated problem for the brain to solve because each ear has access to only a single pressure wave that comprises the summation of acoustic energy coming from all individual sound sources. However, the human auditory system seems to have evolved in a way to successfully partition this amalgam of acoustic information such that we can build representations of the various sources of sounds present in the environment. The putative mechanisms engaged in the decomposition of complex acoustic waves, such that perceptual objects can be identified, fall under the collective term of auditory scene analysis. Bregman (1990) has proposed a comprehensive theory of auditory scene analysis based primarily on the Gestalt laws of organization such as similarity, proximity, and good continuity (Koffka et al., 1935). In this framework, auditory scene analysis is divided into two classes of
226
C. Alain / Hearing Research 229 (2007) 225–236
processes, dealing with the perceptual organization of simultaneously (i.e., concurrent) and sequentially occurring acoustic elements. These processes are responsible for grouping and parsing components of the acoustic mixture to construct perceptual representations of sound sources. Many of these processes are considered automatic or ‘primitive’ since they are found in infants (McAdams and Bertoncini, 1997), birds (Hulse et al., 1997; MacDougallShackleton et al., 1998) and non-human primates (Izumi, 2002, 2003). Perception and recognition of complex sound sources involves not only pre-attentive processes, which use basic stimulus properties to segregate the incoming acoustic wave, but also controlled schema-driven processes that reflect our knowledge from past experience with auditory stimuli. The use of prior knowledge in auditory perception is particularly evident in an analogous laboratory condition of a cocktail party situation, in which a sentence’s final word embedded in multi-talker babble is more easily detected when it is contextually predictable (Pichora-Fuller et al., 1995). Auditory scene analysis theory concerns itself with how the auditory system assigns incoming acoustic elements from different physical sound sources into auditory objects. Here, a sound source is defined as the physical entity that generates acoustic energy (see also, Bregman, 1990). Sound sources may differ in terms of place of origin, intensity, spectral, and/or temporal complexity (i.e., pattern of amplitude and frequency modulations over time). An auditory object, on the other hand, is the perceptual formation of a group of sounds into a coherent whole that seems to emanate from a single physical acoustic source. Thus, the term auditory object denotes a mental description of a physical sound source in the environment and its behavior over time rather than the source itself or the sounds it emits. The current definition shares similarity, but is not limited, to the notion of auditory streams proposed by Bregman (1990) because it encompasses the segregation and perception of concurrent sound sources. The concept of auditory streams usually refers to groups of sounds occurring over several seconds rather than the grouping of sounds occurring simultaneously. In the present review, I also make a distinction between auditory objects and auditory events. While the former refers to a perception of a sound source and its behavior over time, the latter is used when referring to the perceptual dimensions of a sound source at one particular time point. Auditory events exist within the larger entity of the auditory object. For example, in dichotic listening experiments co-occurring conversations would be defined as two separate auditory objects (i.e., each is perceived to originate from a separate auditory source) whereas the elements (e.g., listener’s own name) within either conversation would be defined as auditory events. In this example, the sound sources, of course, would be the left and right ear locations. As the auditory scene unfolds over time, the listener is faced with the problem of perceptually organizing acoustic elements along the dimensions of time (sequential sound
segregation) and frequency (concurrent sound segregation). Organization along the time axis entails the sequential grouping of acoustic data over several seconds, whereas organization along the frequency axis involves parsing acoustic elements from simultaneous sound sources according to different frequencies and their harmonic relations. For example, to follow a conversation in a typically complex acoustic environment, the individual must be able to separate out the acoustic elements that correspond to the conversation of interest from those that belong to other competing sound sources, and perform simultaneous temporal integration of those acoustic elements emanating from each source. Sounds that have similar onsets, intensities, and harmonically-related frequencies are more likely to be perceived as coming from the same source than are sounds that begin at different times, and/or differ in intensity and frequency. The perception of a mistuned harmonic in an otherwise periodic complex sound provides a good example of concurrent sound segregation. When listeners are presented simultaneously with tonal elements that are integer multiples of the fundamental frequency (f0), they usually report hearing a single buzz-like sound with a pitch corresponding to the f0 of the harmonic series. However, if one of the lower harmonics is mistuned such that it is no longer an integer multiple of the fundamental, then the listeners report hearing a buzz-like sound and another sound at the pitch of the mistuned harmonic. This likelihood of hearing the mistuned harmonic as a separate object increases with mistuning (Alain et al., 2001a; Moore et al., 1986). When the amount of mistuning is large (e.g., 10–16% lower or higher than its original value), the mistuned harmonic appears to perceptually ‘‘pop-out’’ of the complex sound, in much the same way that a visual target defined by a unique color pops out of a display filled with homogeneous distractors defined by a secondary color (Treisman and Gelade, 1980). Listeners are also more likely to segregate lower than higher harmonics (Moore et al., 1986) and this effect could partly be accounted for by the width of peripheral auditory filters, which increases with increasing frequency. As a result, higher harmonics are unresolved and the perceptual ‘pop-out’ that occurs for lower mistuned partials (i.e., ‘buzz’ plus pure tone percept) is replaced by a sensation of roughness for higher frequency mistuned components (Moore et al., 1986). Decreasing signal duration has also been found to reduce the likelihood of reporting hearing the lower mistuned harmonic as a separate object (Moore et al., 1986). In hindsight, such effect of duration is no surprise given that the processing of the incoming acoustic mixture is likely to require some time before its various constituents can be identified and assigned to distinct perceptual objects. The processes involved in the detection of mistuning can be modeled using the concept of a harmonic sieve or template (Brunstrom and Roberts, 1998; Duifhuis et al., 1982; Lin and Hartmann, 1998; Scheffers, 1983). That is, tonal elements that are integer multiples of the f0 would ‘‘pass’’
C. Alain / Hearing Research 229 (2007) 225–236
through the sieve and be grouped into one coherent sound object, while the mistuned harmonic would lie outside of the ‘‘hole’’ and be attributed to another object. The basis for the construction or activation of such a template is not well understood, but could involve neurons sensitive to equal spacing (i.e., period) between tonal elements (Roberts and Brunstrom, 1998). That is, neurons sensitive to frequency periodicity could act as a series of filters that allow harmonically related partials to group together with the f0, and mistuned partials to form separate representations. For shorter-duration stimuli (e.g., 50 ms) the auditory system may be more tolerant of deviations from harmonicity because of a weak estimation of periodicity pitch. Another paradigm that has been instrumental in advancing knowledge of the psychological and neural mechanisms supporting concurrent sound segregation is the double-vowel task. In the double-vowel task, participants are usually presented with a mixture of two phonetically different synthetic vowels (taken from a set of 4–5 vowels) and are required to indicate which two vowels are presented by sequentially pressing the two corresponding buttons. In most cases, both vowels begin and end at the same time and the difference in f0 between the two vowels randomly varies from trial to trial. This paradigm overcomes, to some extent, the artificiality of the mistuned harmonic tasks and has the additional benefits of providing a more direct assessment of speech separation as well as allowing for an examination of the processes involved in acoustic identification as opposed to detection. Psychophysical studies have shown that the identification rate improves with increasing separation between the f0 of the two vowels (Assmann and Summerfield, 1990, 1994; Chalikia and Bregman, 1989, 1993). When two vowels are played together, listeners generally hear one of the vowels as foreground (dominant) and the other as background (nondominant). The difficulty of the task resides in identifying the non-dominant vowel, which depends on successfully parsing the incoming vowel mixture into its constituent parts for a comparison with schemata of the vowels. The auditory system is thought to achieve this separation by first extracting the f0 of the dominant vowel and then ‘‘subtracting’’ its components in order to facilitate the identification of the non-dominant vowel (de Cheveigne, 1999). In this paper, I will discuss findings from recent studies that have used the paradigms described above in conjunction with recording of event-related brain potentials (ERPs) in an attempt to identify specific neural markers underlying the segregation and perception of concurrent auditory objects. Scalp recording of human ERPs (obtained by averaging together EEG segments timelocked on stimulus onset) is a powerful technique for investigating the neural correlates of temporally unfolding auditory scenes because it allows for the examination of brain activity with exquisite temporal resolution. Moreover, ERPs can be recorded for sounds presented either within or outside the focus of attention, thereby providing an opportunity to assess the impact of top-down controlled
227
processes such as attention (Alain and Izenberg, 2003; Hillyard et al., 1973) and learning (Shahin et al., 2003; Tremblay et al., 1997) on auditory scene analysis. Cortical auditory evoked responses such as those recorded in the middle- (10–50 ms post-stimulus) and long-latency (50–250 ms post-stimulus) ranges result from stimulus-locked post-synaptic potentials within apical dendrites of pyramidal neurons in the cerebral cortex. Middleand long-latency auditory ERPs are comprised of several positive and negative deflections (i.e., waves) that reflect activation from neural ensembles of tonotopically organized generators (Woods, 1995) as well as from associative auditory areas showing fewer frequency-specific characteristics than primary cortex (Picton et al., 1999). Each deflection is identified according to its polarity, order of occurrence and/or latency. For example, the N1 wave, also referred to as the N100, is a negative peak that occurs approximately 100 ms following sound onset (Fig. 1). It is followed by a positive wave referred to as P2 or P200, which peaks at about 180 ms after sound onset and is largest over the frontocentral scalp areas. The number of recruited neurons, extent of neuronal activation, and synchrony of the neural response all contribute to the resulting pattern. The amplitude of ERPs can be used as an index of the strength of the response in microvolts (lV) whereas the latency refers to the amount of time, in milliseconds (ms), that it takes to generate the bioelectrical response following stimulus onset. Latency is therefore related to neural conduction time and site of excitation; the time it takes for the sound to travel through the peripheral auditory system to the place of excitation in the central auditory system. One goal of ERP studies of auditory scene analysis is to relate the ERP components to psychological processes, using task manipulations that differentially affect the latencies and amplitudes of different components. A second goal is to identify the brain generators that contribute to a particular ERP effect. Thus, on the one hand, ERPs can contribute to our understanding of auditory scene analysis by revealing the latency and sequence of processing operations, while on the other, they can inform computer modeling and cellular neurophysiology by localizing the brain areas involved in particular aspects of auditory scene analysis. Therefore, ERPs may serve as a bridge between various approaches used to understand auditory scene analysis. In the following sections, prior research that has used ERPs to study concurrent sound perception in general and concurrent vowel perception in particular are reviewed. The role of attention in parsing concurrent sounds is first examined. This is followed by a review of the studies that have examined the role of learning in concurrent sound segregation. 2. The role of attention on concurrent sound segregation Alain and colleagues used scalp-recorded ERPs to investigate the neural underpinning of concurrent sound segregation in humans (Alain and Izenberg, 2003; Alain et al.,
228
C. Alain / Hearing Research 229 (2007) 225–236
Fig. 1. (a) Group mean ERPs elicited by tuned and mistuned stimuli (2nd harmonic mistuned by 16% of its original value) and the corresponding difference wave in a group of healthy young adults. Note that the ORN is recorded during both passive (i.e., participants read a book of their choice and no response is required) and active listening (response required). P400 wave was present only when participants were required to report whether they heard one complex sound or whether they heard a buzz-like sound plus another sound with a pure tone quality. The ORN and P400 amplitude correlated with listeners’ likelihood of reporting having heard concurrent auditory objects. (b) Isocontour maps illustrating the amplitude distribution of the ORN and P400 during active and passive listening. Adapted with permission from Alain et al. (2001a).
2001a, 2002; Dyson and Alain, 2004). As mentioned earlier, ERP recording is particularly well-suited for studying the impact of attention on auditory scene analysis because ERPs can easily be recorded for the same sounds when they are either task-relevant and require a behavioral response from the participants or when they are task-irrelevant (no response required). In the first study by Alain et al. (2001a), participants were presented with complex sounds composed of either all tuned harmonics or multiple tuned and one mistuned harmonic. The amount of mistuning varied randomly from trial to trial and listeners were required to indicate on each trial whether they heard one buzz-like sound or two sounds (i.e., a buzz plus another sound with a pure tone quality). In addition, the same participants were presented with the same stimuli while reading a book of their choice, to examine the extent to which the processing of mistuning could be performed independently of sustained attention. Consistent with behavioral studies
(Hartmann et al., 1990; Lin and Hartmann, 1998; Moore et al., 1986), the perception of the mistuned harmonic as a separate sound increased as the mistuning increased. When a single low harmonic was mistuned by more than 4%, listeners heard it as a separate object. This perception of simultaneous auditory objects was accompanied by negative and positive waves that peaked at 180 and 400 ms post-stimulus, respectively (Fig. 1). The negative wave, revealed by the difference wave between ERPs elicited by tuned and mistuned stimuli, and referred to as the object-related negativity (ORN), overlapped in time with the N1 and P2 deflections. Its amplitude correlated with perceptual judgment, being greater when participants were more likely to report hearing two distinct perceptual objects. Interestingly, the ORN was recorded even when participants were not attending to the stimuli, suggesting that concurrent sound segregation may occur independently of listeners’ attention.
C. Alain / Hearing Research 229 (2007) 225–236
The hypothesis that the ORN generation is independent of listeners’ attention was further investigated by directly manipulating selective listening demands. Alain and Izenberg (2003) presented equiprobable tuned and mistuned stimuli to the left and right ears and asked their participants to focus their attention to one ear and to ignore the stimuli presented in the other ear. For both tuned and mistuned stimuli, long (standard) and occasional shorter (deviant) duration sounds were presented in both ears. In the easy version of the task, the participants were instructed to press a button whenever they heard a short duration sound (targets), irrespective of tuning. In the more difficult version, they were requested to also indicate whether the short duration sounds were tuned or mistuned. Participants were faster in detecting targets defined by duration alone than when the targets were defined by both duration and tuning. More importantly, standard mistuned stimuli at the unattended location generated an ORN whose amplitude was not affected by task difficulty. In other words, the amount of attention allocated to the attended location had little impact on listeners’ ability to process inharmonicity. More recently, Dyson et al. (2005) have shown that manipulation of visual attention load during an n-back task has no significant impact on the ORN amplitude. Together, these findings provide further support for the hypothesis that concurrent sound segregation, as indexed by the ORN, operates independently of conscious control. However, it is important to point out that under certain circumstances the process indexed by the ORN can be facilitated by top-down controlled processes. Indeed, listening situations that promoted selective attention to the frequency region of the mistuned harmonic generated a larger ORN than during passive listening (see Experiments 1 and 3 from Alain et al., 2001a). The perception of concurrent auditory objects has also been associated with a late positive wave that peaks at about 400 ms following stimulus onset (P400). Like the ORN, the amplitude of the P400 correlated with perceptual judgment, being larger when participants perceived the mistuned harmonic as a separate tone. However, in contrast with the ORN, this component was present only when participants were required to respond whether they heard one or two auditory stimuli. Moreover, the amplitude of the P400, like subjective reports from listeners, was sensitive to contextual manipulation such as the probability of the mistuned harmonic within a sequence of stimuli, whereas the ORN amplitude was little affected by contextual manipulations. These findings suggest that the P400 reflects the conscious evaluation and decision-making process regarding the number of auditory objects present (Alain et al., 2001a, 2002). It was originally proposed that the ORN indexes the early and automatic registration of a new and additional object from the acoustic mixture reaching the ear, while the P400 reflects the more conscious evaluation of the mixture in which context and memory are used to guide the response. However, in the studies reviewed above only
229
mistuning was manipulated, making it difficult to know whether the changes in neural activity reflect mistuning or whether they index concurrent sound segregation and perception per se. Presumably, if the ORN and P400 responses are related to concurrent sound perception then they should also be observed when concurrent sounds are segregated based on other types of acoustic cues. McDonald and Alain (2005) examined the contribution of location and harmonicity cues in a free-field environment. They found that observers were more likely to report hearing two concurrent auditory objects if a tonal element, either in tune or slightly mistuned (2%), was presented at a different location than the remaining harmonics. Interestingly, this was paralleled by an ORN that was present even when the tonal component was segregated based on location alone. These results indicate that listeners can segregate sounds based on harmonicity or location alone and that conjunction of harmonicity and location cues contribute to sound segregation primarily when harmonicity is ambiguous. The notion that ORN index concurrent sound segregation rather than mistuning is further supported by recent work from Johnson and colleagues, who measured ERPs while listeners were presented with stimuli that promoted the perception of dichotic pitches (Hautus and Johnson, 2005; Johnson et al., 2003). Dichotic pitch is a perception of pitch that is extracted by the brain from two noise segments, neither of which alone contains any cues to pitch. This perceived pitch occurs when a dichotic delay is introduced to a narrow frequency region of a noise segment and may be localized to a different spatial location than the background noise, which is heard with the pitch. Johnson et al. found that the perception of this pitch along the background noise dichotic was paralleled by ORN and P400 responses. Taken together, these findings suggest that both ORN and P400 reflect fairly general mechanisms that can broadly utilize a range of cues that can help to separate simultaneous acoustic events. More importantly, findings from these studies indicate that the ORN and P400 generations are not limited to the detection of mistuning and are consistent with the proposal that they index concurrent sound segregation and perception. 3. Concurrent vowel segregation The studies reviewed above suggest that the ORN reflects an early and automatic registration of concurrent sound objects. However, it is unclear whether models derived from studies using harmonic series can account for segregation and identification of over-learned stimuli such as speech sounds. Speech stimuli are likely to activate representations, which may help solve the scene analysis problem (i.e., schema-based segregation). To examine whether ERP findings obtained with the mistuned paradigm are generalizable to more ecologically valid stimuli, Alain et al. (2005a) recorded ERPs while participants were presented with a mixture of two phonetically different
230
C. Alain / Hearing Research 229 (2007) 225–236
vowels and were asked to identify both of them by sequentially pressing the corresponding button on a response pad, as the vowel pair and the difference in f0 (Df0) varied from trial to trial. As previously reported in the behavioral literature (Assmann and Summerfield, 1990, 1994; Chalikia and Bregman, 1989), listeners’ ability to identify both vowels improved by increasing the difference in f0 between the two vowels. This improvement in performance as a function of Df0 was paralleled by two ERP modulations thought to underlie the detection and identification of concurrent vowels, respectively. The first ERP modulation was a negative wave that was superimposed on the N1 and P2 waves, and peaked around 145 ms after sound onset (Fig. 2a). This component was maximal over midline central electrodes and showed similarities in latency and amplitude distributions with the ORN. As with the ORN, the amplitude of this component appears to be related to the detection of the discrepancy between f0’s, signaling to higher auditory centers that two sound sources are present. As the difference in f0 increase, different populations of neurons responding to the temporal characteristics of the steady state vowel and/or the ‘‘place’’ on the cortex (Ohl and Scheich, 1997) may be activated whereas vowels sharing the same f0 would activate the same population of neurons.
The second ERP modulation associated with concurrent vowel perception was a negative wave that peaked at about 250 ms after sound onset and was larger over the right and central regions of the scalp (Fig. 2b). As mistuned harmonics do not generate this late modulation (Alain and Izenberg, 2003; Alain et al., 2001a, 2002), it is likely related to the identification and categorization that followed the automatic detection of the two constituents of the double-vowel stimuli. This sequence of neural events supports a multistage model of auditory scene analysis in which the spectral pattern of each vowel constituent is automatically extracted and then matched against representations of those vowels in working memory. The proposal that Df0 is automatically registered in auditory cortices was tested further in a second experiment where a different group of participants were presented with the same stimuli, while they watched a muted movie of their choice presented on a computer monitor with English subtitles. This manipulation was designed to test whether Df0 changes in ERPs depend on focused attention, thereby allowing us to clarify the stage at which attention affects concurrent vowel perception. The use of muted subtitled movies is important because the text dialogue effectively captures attention without interfering with auditory processing (Pettigrew et al., 2004). Fig. 1c shows the difference
Fig. 2. (a) Group mean ERPs elicited by double-vowel stimuli with Df0 equal to 0 or 4 semitones. The difference wave reveals an early negative peak at about 145 ms after sound onset, referred to as the object-related negativity (ORN). The amplitude distribution is consistent with dipole sources in the supratemporal plane. The contour spacing was set at 0.10 uV. The negative polarity is illustrated in blue (shade) color. (b) Neural activity following the initial registration of Df0. The latency and amplitude distribution show similarity with the N2b component thought to index stimulus classification and categorization. (c) Effects of attention on the ORN and N2b waves. The gray rectangle is a schematic representation of the double-vowel stimuli. Adapted with permission from Alain et al. (2005a).
C. Alain / Hearing Research 229 (2007) 225–236
wave between ERPs elicited when the two vowels shared the same f0 and when they were separated by four semitones in both groups. As expected, the ORN was present whether or not the participants were involved in identifying the two constituents of the double-vowel stimulus. This result is consistent with neurophysiological studies showing that Df0 is registered early in the ascending auditory pathways (Cariani and Delgutte, 1996a,b; Palmer, 1990) and provides further support for the proposal that concurrent vowel segregation occurs automatically. However, the subsequent negativity was present only when listeners were required to make a perceptual decision. This modulation shows similarity in latency with another component referred to as the N2b (Ritter et al., 1979, 1982) and may index a matching process between the incoming signal and the stored representation of the vowels in working memory. Given that vowels are over-learned, the second modulation may also reflect the influence of schema-driven processes on vowel identification. 4. Neural networks involved in concurrent sound perception The neural generators underlying the scalp-recorded ERPs can be estimated using dipole source modeling algorithms. Determining the intra-cerebral sources for ERPs measured outside of the head is referred to as the bioelectromagnetic inverse problem, since there is not a unique solution to the problem. Nonetheless, this technique has proven useful in investigating the neural underpinning of perceptual and cognitive processes. Brain electrical source analysis (BESA) uses models with only a small number of dipoles, for example single sources in the left and right auditory cortex. For dipole estimation, the difference between the measured electric field and the calculated field is minimized by varying the location and the orientation of the dipoles (Scherg, 1989). Dipole source modeling suggests that the ORN sources are inferior and medial to the N1 sources (Alain et al., 2001a), indicating that neurons activated by co-occurring auditory stimuli are different from those activated by stimulus onset. More recently, Dyson and Alain (2004) measured middle latency auditory evoked responses for tuned and mistuned stimuli and found that the Pa wave at 30 ms was significantly larger when the third harmonic was mistuned by 16% of its original value. The enhanced Pa amplitude was also associated with an increased likelihood that participants would report the presence of multiple, concurrent auditory objects. These results are consistent with an early stage of auditory scene analysis in which acoustic properties such as mistuning act as preattentive segregation cues that can subsequently lead to the perception of multiple auditory objects. Therefore, it appears that the primary auditory cortex (the main source of the Pa wave) plays an important role in sound segregation. It is very likely that concurrent sound perception depends on a widely distributed network of brain regions.
231
In an effort to delineate more precisely the brain areas involved in concurrent sound perception, Alain et al. (2005b) measured brain activation by means of the blood oxygenation level-dependent (BOLD) effect (Ogawa et al., 1990) using functional magnetic resonance imaging (fMRI), while participants performed the double-vowel task. They used a ‘‘sparse sampling’’ protocol (Belin et al., 1999; Hall et al., 1999), which allowed them to examine segregation of the two vowels without the interference of the scanner noise. Stimuli were presented binaurally via circumaural, fMRI-compatible headphones 4–6 s prior to the ‘‘scan repeat’’ time (also known as TR time). The analysis of the behavioral data revealed increased accuracy with increased Df0 between the two vowels. The analysis of the fMRI data did not yield reliable changes in BOLD signal as a function of Df0. However, a comparison of fMRI signals between trials in which participants successfully identified both vowels as opposed to when only one of the two vowels was recognized revealed significant increase in BOLD signal obtained from the left thalamus, Heschl’s gyrus, the superior temporal gyrus, and the planum temporale (for more details, see Alain et al., 2005b). Because participants successfully identified at least one of the two vowels on each trial, the difference in fMRI signal indexes the extra computational work needed to segregate and successfully identify the other concurrently presented vowel. The results support the view that auditory cortex in or near Heschl’s gyrus as well as in the planum temporale are involved in sound segregation and reveal a link between left thalamo-cortical activation and the successful separation and identification of simultaneous speech sounds. 5. Role of learning on concurrent vowel perception Learning can take many forms, some of which relate to improvement in motor controls or general cognitive functioning including attention and memory, and some of which reflect improvement in observers’ ability to discriminate differences in the attributes of sensory stimuli (perceptual learning). Perceptual learning occurs when two stimuli that at first appear identical become differentiated with practice. It often involves a rapid improvement in performance within the first hour of training (fast perceptual learning) followed by more gradual improvements that take place over several daily practice sessions (slow perceptual learning). This is a research area where there is a great cross-pollination between basic neurophysiological studies revealing fundamental properties of neurons as well as cortical plasticity during training, and neuroimaging research in humans showing neuroplastic changes in a larger intertwined system(s) or network(s) rather than a specific subpopulation of neurons. During the last decade, converging evidence from animal, neuropsychological and neuroimaging research has revealed a remarkable degree of brain plasticity in the sensory systems during adulthood. Learning-induced changes in mature sensory cortices may occur in a task-dependent
232
C. Alain / Hearing Research 229 (2007) 225–236
manner, and can include rapid and highly specific changes in the response properties of cells (Bakin et al., 1996; Edeline et al., 1993; Fritz et al., 2003). However, as the training regimen or rehabilitation continues, significant changes can take place in the topographical organization representing the trained sensory attributes (Recanzone et al., 1993, 1992). These changes in sensory and/or motor representations following extended training may involve the expression of new synaptic connections, thereby resulting in an enlarged cortical representation of a specific stimulus after training, as well as changes in the tuning properties of sensory neurons (Recanzone et al., 1993; Rutkowski and Weinberger, 2005). In humans, scalp-recorded ERPs have been instrumental in identifying physiological correlates underlying auditory perceptual learning in humans (Atienza et al., 2002; Bosnyak et al., 2004; Gottselig et al., 2004; Reinke et al., 2003; Tremblay et al., 1997). Training-related changes in
auditory evoked responses have been reported for a wide range of tasks requiring discrimination between tones of different frequencies (Bosnyak et al., 2004; Brattico et al., 2003; Menning et al., 2000), or between consonant vowel stimuli varying in voice onset time (Tremblay et al., 1997, 2001). Prior studies focusing on auditory perceptual learning have shown a decrease in N1 latency (Bosnyak et al., 2004) as well as an enhanced N1m amplitude (the magnetic counterpart of the N1 wave) (Menning et al., 2000) following extended training. The training-related enhancement in N1m may either indicate that more neurons are activated or that neurons representing the stimulus are firing more synchronously. In addition, the N1c component showed an increase in amplitude with extended practice (Bosnyak et al., 2004). Interestingly, the N1c amplitude continued to grow over 15 training sessions (Bosnyak et al., 2004). In addition, extended training has been found to enhance the amplitude of the P2 wave (Atienza et al., 2002; Bosnyak
Fig. 3. (a) Group mean accuracy during the four daily training sessions as well as during the first and second ERP recording sessions in trained and untrained individuals. Error bars reflect standard error of the mean. (b) Group mean ERPs from the trained group recorded before and after four days training period at midline frontal site (i.e., FCz). (c) Dipole source model over the 160–220 ms interval of the difference wave between ERPs recorded prior and after 4 days of practice. The dipoles are superimposed on a standardized MR image. Reprinted from Reinke et al. (2003) with permission from Elsevier.
C. Alain / Hearing Research 229 (2007) 225–236
et al., 2004; Reinke et al., 2003; Tremblay et al., 2001), which can appear after two (Atienza et al., 2002) or three (Bosnyak et al., 2004) daily test sessions. Most studies to date have focused on tasks involving the discrimination of two successively presented stimuli. Few studies have examined whether individuals can also learn to segregate concurrent auditory objects. This is an important question to address given that concurrent sound segregation is usually described as a low-level process that occurs independently of attention. Since speech is typically perceived against a background of other sounds, a training program that improves listeners’ ability to segregate concurrent sounds could have important clinical applications. Reinke et al. (2003) examined the effects of a four-day training protocol on listeners, ability to segregate two vowels presented concurrently. ERPs to double-vowel stimuli were recorded during two sessions separated by one week. Half of the participants practiced the discrimination task during the intervening week, while the other half served as controls and did not receive any training. Trained listeners showed greater improvement in accuracy than untrained participants (Fig. 3a). In both groups, vowels generated N1 and P2 waves at the fronto-central and temporal scalp regions. The behavioral effects of training were paralleled by
233
decreased N1 and P2 latencies as well as a marked increased in P2 amplitude (Fig. 3b), which was, like performance, greater in trained than untrained listeners. While the practice-related decrease in N1 latency may reflect more efficient encoding of sensory information, the P2 amplitude enhancement appears more consistent with recruitment of additional neurons involved in parsing and representing the vowel constituent and/or higher degree of synchronization within a particular neural ensemble. How much training is needed before observing physiological changes? Alain et al. (2006) examined whether rapid perceptual learning was paralleled by changes in ERP amplitude and/or latency. Participants performed the same double-vowel task as described earlier. Listeners’ ability to segregate and identify both vowels improved gradually during the first hour of testing (about 6% improvement from the first to the last block of trials). This improvement in vowel segregation and identification was paralleled by enhancements in an early evoked response (approximately 130 ms) localized in the right auditory cortex and a late evoked response (340 ms) localized in the right anterior superior temporal gyrus and/or inferior prefrontal cortex (Fig. 4). These rapid changes in ERP amplitude were preserved only if practice continued, and declined within one
Fig. 4. (a) Group mean ERPs recorded during the first ERP recording at midline frontal site in participants required to identify both vowels (Active) or asked to watch a subtitled movie (Passive). Note, the increased in ERP amplitude from the begging to the end of the recording session is present only when participants are asked to focus attention on the stimuli. (b) Isocontour maps showing the amplitude distribution for the difference in brain activity between the first and the fourth block of trials. Adapted with permission from Alain et al. (2006).
234
C. Alain / Hearing Research 229 (2007) 225–236
week without training. The changes in ERP amplitude are consistent with top-down modulation and may reflect sharpening in the receptor fields of neurons involved in periodicity coding. There was no difference in P2 amplitude between the first and last block of trials (Alain et al., 2006), suggesting that the training-related P2 enhancement indexed a relatively slow learning process that may depend on consolidation over several days. While attention to stimuli facilitates learning, there is also evidence that mere exposure to sounds improves performance in subsequent recognition and identification tasks (Clarke and Garrett, 2004; Szpunar et al., 2004; Yonan and Sommers, 2000). Are the rapid changes in ERP amplitude reflecting learning or simply stimulus exposure? To address this issue, Alain et al. (2006) measured ERPs elicited by the same stimuli in a group of participants that were instructed to ignore the stimuli and watch a muted movie of their choice with subtitles. The ERP amplitude recorded over the right temporal lobe did not differ significantly with increased exposure, suggesting that active listening is required to generate reliable and rapid neuroplastic changes (Fritz et al., 2003; Recanzone et al., 1993). These findings highlight the role of top-down controlled processes in brain plasticity and learning. Although it remains unclear whether increased ERP amplitude would have occured with more exposure to the stimuli, these results also suggest that attentional capacities should be taken into account, when designing interventions aimed at improving perceptual and cognitive functions in individuals suffering from hearing impairments.
speech material, i.e., the processing of vowels may automatically engage both low-level and schema-driven processes. Moreover, successful segregation and identification of currently presented vowels depend on a thalamocortical network including the left thalamus, primary auditory cortex and planum temporale. Lastly, training improves listeners’ ability to segregate concurrent vowels, and this improvement is paralleled by neuroplastic changes in auditory sensory cortex as well as in non-sensory cortices. These changes can occur quickly within the first hour of training. These findings show that the functional organization of the adult’s sensory systems is dynamic, modifiable by training, and can be highlighted with scalp recording of human ERPs. This may have important applications for rehabilitation, especially in individuals (e.g., older adults) experiencing difficulties in processing simultaneous sources of sounds (Alain et al., 2001b; Snyder and Alain, 2005). Further research is needed to explore the impact of training on older adults and hearing-impaired listeners and to assess whether learning to segregate simple acoustic stimuli can be generalized to more natural and complex listening situations. Acknowledgements The research presented in this review was supported by grants from the Canadian Institutes of Health Research and the Natural Sciences and Engineering Research Council of Canada. Special thanks to the volunteers who participated in the experiments reviewed here from my laboratory. References
6. Concluding remarks Over the past decade there has been an explosion in research activity related to the neuropsychological basis of auditory scene analysis. This sudden increase in scientific work has been motivated in part by changes in current models of auditory perception and by a realization that hearing is more than just signal detection. It also involves organizing the sounds that surround us in meaningful ways. The other major motivation for studying the neural correlates of auditory perception is that they may provide new ways to assess and rehabilitate auditory perception. A fundamental problem faced by the human auditory system is the segregation of concurrent speech signals. To discriminate between individual voices, listeners must extract information from the composite acoustic wave, which reflects the summed activation from all the simultaneously active voices. Numerous studies have shown that the observer’s ability to identify two different vowels presented simultaneously improves by increasing the f0 separation between the vowels. ERP recordings have revealed a different pattern of neural activity when individuals are required to identify concurrent vowels than when they are asked whether one or two auditory objects are present in a mixture using the mistuned harmonic paradigm. These differences may be related to observer expertise with the
Alain, C., Izenberg, A., 2003. Effects of attentional load on auditory scene analysis. J. Cogn. Neurosci. 15, 1063–1073. Alain, C., Arnott, S.R., Picton, T.W., 2001a. Bottom-up and top-down influences on auditory scene analysis: evidence from event-related brain potentials. J. Exp. Psychol. Hum. Percept. Perform. 27, 1072– 1089. Alain, C., McDonald, K.L., Ostroff, J.M., Schneider, B., 2001b. Agerelated changes in detecting a mistuned harmonic. J. Acoust. Soc. Am. 109, 2211–2216. Alain, C., Schuler, B.M., McDonald, K.L., 2002. Neural activity associated with distinguishing concurrent auditory objects. J. Acoust. Soc. Am. 111, 990–995. Alain, C., Reinke, K., He, Y., Wang, C., Lobaugh, N., 2005a. Hearing two things at once: neurophysiological indices of speech segregation and identification. J. Cogn. Neurosci. 17, 811–818. Alain, C., Reinke, K., McDonald, K.L., Chau, W., Tam, F., Pacurar, A., Graham, S., 2005b. Left thalamo-cortical network implicated in successful speech separation and identification. Neuroimage 26, 592–599. Alain, C., Snyder, J.S., He, Y., Reinke, K.S. 2006. Changes in auditory cortex parallel rapid perceptual learning. Cereb. Cortex. [Epub ahead of print]. Assmann, P., Summerfield, Q., 1990. Modeling the perception of concurrent vowels: vowels with different fundamental frequencies. J. Acoust. Soc. Am. 88, 680–697. Assmann, P., Summerfield, Q., 1994. The contribution of waveform interactions to the perception of concurrent vowels. J. Acoust. Soc. Am. 95, 471–484. Atienza, M., Cantero, J.L., Dominguez-Marin, E., 2002. The time course of neural changes underlying auditory perceptual learning. Learn. Mem. 9, 138–150.
C. Alain / Hearing Research 229 (2007) 225–236 Bakin, J.S., South, D.A., Weinberger, N.M., 1996. Induction of receptive field plasticity in the auditory cortex of the guinea pig during instrumental avoidance conditioning. Behav. Neurosci. 110, 905–913. Belin, P., Zatorre, R.J., Hoge, R., Evans, A.C., Pike, B., 1999. Eventrelated fMRI of the auditory cortex. Neuroimage 10, 417–429. Bosnyak, D.J., Eaton, R.A., Roberts, L.E., 2004. Distributed auditory cortical representations are modified when non-musicians are trained at pitch discrimination with 40 Hz amplitude modulated tones. Cereb. Cortex 14, 1088–1099. Brattico, E., Tervaniemi, M., Picton, T.W., 2003. Effects of brief discrimination-training on the auditory N1 wave. Neuroreport 14, 2489–2492. Bregman, A.S., 1990. Auditory Scene Analysis: The Perceptual Organization of Sounds. The MIT Press, London, England. Brunstrom, J.M., Roberts, B., 1998. Profiling the perceptual suppression of partials in periodic complex tones: further evidence for a harmonic template. J. Acoust. Soc. Am. 104, 3511–3519. Cariani, P.A., Delgutte, B., 1996a. Neural correlates of the pitch of complex tones. II. Pitch shift, pitch ambiguity, phase invariance, pitch circularity, rate pitch, and the dominance region for pitch. J. Neurophysiol. 76, 1717–1734. Cariani, P.A., Delgutte, B., 1996b. Neural correlates of the pitch of complex tones. I. Pitch and pitch salience. J. Neurophysiol. 76, 1698– 1716. Chalikia, M.H., Bregman, A.S., 1989. The perceptual segregation of simultaneous auditory signals: pulse train segregation and vowel segregation. Percept. Psychophys. 46, 487–496. Chalikia, M.H., Bregman, A.S., 1993. The perceptual segregation of simultaneous vowels with harmonic, shifted, or random components. Percept. Psychophys. 53, 125–133. Clarke, C.M., Garrett, M.F., 2004. Rapid adaptation to foreign-accented English. J. Acoust. Soc. Am. 116, 3647–3658. de Cheveigne, A., 1999. Vowel-specific effects in concurrent vowel identification. J. Acoust. Soc. Am. 106, 327–340. Duifhuis, H., Willems, L.F., Sluyter, R.J., 1982. Measurement of pitch in speech: an implementation of Goldstein’s theory of pitch perception. J. Acoust. Soc. Am. 71, 1568–1580. Dyson, B., Alain, C., 2004. Representation of concurrent acoustic objects in primary auditory cortex. J. Acous. Soc. Am. 115, 280–288. Dyson, B., Alain, C., He, Y., 2005. Effects of visual attentional load on low-level auditory scene analysis. Cognit. Affect. Behav. Neurosci. 5, 319–338. Edeline, J.M., Pham, P., Weinberger, N.M., 1993. Rapid development of learning-induced receptive field plasticity in the auditory cortex. Behav. Neurosci. 107, 539–551. Fritz, J., Shamma, S., Elhilali, M., Klein, D., 2003. Rapid task-related plasticity of spectrotemporal receptive fields in primary auditory cortex. Nat. Neurosci. 6, 1216–1223. Gottselig, J.M., Brandeis, D., Hofer-Tinguely, G., Borbely, A.A., Achermann, P., 2004. Human central auditory plasticity associated with tone sequence learning. Learn. Mem. 11, 162–171. Hall, D.A., Haggard, M.P., Akeroyd, M.A., Palmer, A.R., Summerfield, A.Q., Elliott, M.R., Gurney, E.M., Bowtell, R.W., 1999. ‘‘Sparse’’ temporal sampling in auditory fMRI. Hum. Brain Mapp. 7, 213–223. Hartmann, W.M., McAdams, S., Smith, B.K., 1990. Hearing a mistuned harmonic in an otherwise periodic complex tone. J. Acoust. Soc. Am. 88, 1712–1724. Hautus, M.J., Johnson, B.W., 2005. Object-related brain potentials associated with the perceptual segregation of a dichotically embedded pitch. J. Acoust. Soc. Am. 117, 275–280. Hillyard, S.A., Hink, R.F., Schwent, V.L., Picton, T.W., 1973. Electrical signs of selective attention in the human brain. Science 182, 177– 180. Hulse, S.H., MacDougall-Shackleton, S.A., Wisniewski, A.B., 1997. Auditory scene analysis by songbirds: stream segregation of birdsong by European starlings (Sturnus vulgaris). J. Comp. Psychol. 111, 3–13.
235
Izumi, A., 2002. Auditory stream segregation in Japanese monkeys. Cognition 82, B113–B122. Izumi, A., 2003. Effect of temporal separation on tone-sequence discrimination in monkeys. Hear. Res. 175, 75–81. Johnson, B.W., Hautus, M., Clapp, W.C., 2003. Neural activity associated with binaural processes for the perceptual segregation of pitch. Clin. Neurophysiol. 114, 2245–2250. Koffka, K., 1935. Principles of Gestalt Psychology. Harcout, Brace, & World, New York. Lin, J.Y., Hartmann, W.M., 1998. The pitch of a mistuned harmonic: evidence for a template model. J. Acoust. Soc. Am. 103, 2608–2617. MacDougall-Shackleton, S.A., Hulse, S.H., Gentner, T.Q., White, W., 1998. Auditory scene analysis by European starlings (Sturnus vulgaris): perceptual segregation of tone sequences. J. Acoust. Soc. Am. 103, 3581–3587. McAdams, S., Bertoncini, J., 1997. Organization and discrimination of repeating sound sequences by newborn infants. J. Acoust. Soc. Am. 102, 2945–2953. McDonald, K.L., Alain, C., 2005. Contribution of harmonicity and location to auditory object formation in free field: evidence from eventrelated brain potentials. J. Acoust. Soc. Am. 118, 1593–1604. Menning, H., Roberts, L.E., Pantev, C., 2000. Plastic changes in the auditory cortex induced by intensive frequency discrimination training. Neuroreport 11, 817–822. Moore, B.C., Glasberg, B.R., Peters, R.W., 1986. Thresholds for hearing mistuned partials as separate tones in harmonic complexes. J. Acoust. Soc. Am. 80, 479–483. Ogawa, S., Lee, T.M., Kay, A.R., Tank, D.W., 1990. Brain magnetic resonance imaging with contrast dependent on blood oxygenation. Proc. Natl. Acad. Sci. USA 87, 9868–9872. Ohl, F.W., Scheich, H., 1997. Orderly cortical representation of vowels based on formant interaction. Proc. Natl. Acad. Sci. USA 94, 9440–9444. Palmer, A.R., 1990. The representation of the spectra and fundamental frequencies of steady-state single- and double-vowel sounds in the temporal discharge patterns of guinea pig cochlear-nerve fibers. J. Acoust. Soc. Am. 88, 1412–1426. Pettigrew, C.M., Murdoch, B.E., Ponton, C.W., Kei, J., Chenery, H.J., Alku, P., 2004. Subtitled videos and mismatch negativity (MMN) investigations of spoken word processing. J. Am. Acad. Audiol. 15, 469–485. Pichora-Fuller, M.K., Schneider, B.A., Daneman, M., 1995. How young and old adults listen to and remember speech in noise. J. Acoust. Soc. Am. 97, 593–608. Picton, T.W., Alain, C., Woods, D.L., John, M.S., Scherg, M., ValdesSosa, P., Bosch-Bayard, J., Trujillo, N.J., 1999. Intracerebral sources of human auditory-evoked potentials. Audiol. Neurootol. 4, 64–79. Recanzone, G.H., Jenkins, W.M., Hradek, G.T., Merzenich, M.M., 1992. Progressive improvement in discriminative abilities in adult owl monkeys performing a tactile frequency discrimination task. J. Neurophysiol. 67, 1015–1030. Recanzone, G.H., Schreiner, C.E., Merzenich, M.M., 1993. Plasticity in the frequency representation of primary auditory cortex following discrimination training in adult owl monkeys. J. Neurosci. 13, 87–103. Reinke, K.S., He, Y., Wang, C., Alain, C., 2003. Perceptual learning modulates sensory evoked response during vowel segregation. Brain Res. Cogn. Brain Res. 17, 781–791. Ritter, W., Simson, R., Vaughan Jr., H.G., Friedman, D., 1979. A brain event related to the making of a sensory discrimination. Science 203, 1358–1361. Ritter, W., Simson, R., Vaughan Jr., H.G., Macht, M., 1982. Manipulation of event-related potential manifestations of information processing stages. Science 218, 909–911. Roberts, B., Brunstrom, J.M., 1998. Perceptual segregation and pitch shifts of mistuned components in harmonic complexes and in regular inharmonic complexes. J. Acoust. Soc. Am. 104, 2326–2338. Rutkowski, R.G., Weinberger, N.M., 2005. Encoding of learned importance of sound by magnitude of representational area in primary auditory cortex. Proc. Natl. Acad. Sci. USA 102, 13664–13669.
236
C. Alain / Hearing Research 229 (2007) 225–236
Scheffers, M.T., 1983. Simulation of auditory analysis of pitch: an elaboration on the DWS pitch meter. J. Acoust. Soc. Am. 74, 1716– 1725. Scherg, M. 1989. Fundamental of Dipole Source Analysis Karger. In: Grandori, F., Hoke, M., Romani, G.L. (Eds.), Auditory Evoked Magnetic Fields and Evoked Potentials, pp. 40–69. Shahin, A., Bosnyak, D.J., Trainor, L.J., Roberts, L.E., 2003. Enhancement of neuroplastic P2 and N1c auditory evoked potentials in musicians. J. Neurosci. 23, 5545–5552. Snyder, J.S., Alain, C., 2005. Age-related changes in neural activity associated with concurrent vowel segregation. Brain Res. Cogn. Brain Res. 24, 492–499. Szpunar, K.K., Schellenberg, E.G., Pliner, P., 2004. Liking and memory for musical stimuli as a function of exposure. J. Exp. Psychol. Learn. Mem. Cogn. 30, 370–381.
Treisman, A.M., Gelade, G., 1980. A feature-integration theory of attention. Cognit. Psychol. 12, 97–136. Tremblay, K., Kraus, N., Carrell, T.D., McGee, T., 1997. Central auditory system plasticity: generalization to novel stimuli following listening training. J. Acoust. Soc. Am. 102, 3762– 3773. Tremblay, K., Kraus, N., McGee, T., Ponton, C., Otis, B., 2001. Central auditory plasticity: changes in the N1–P2 complex after speech-sound training. Ear Hear. 22, 79–90. Woods, D.L., 1995. The component structure of the N1 wave of the human auditory evoked potential. Electroencephalogr. Clin. Neurophysiol. 44 (Suppl.), 102–109. Yonan, C.A., Sommers, M.S., 2000. The effects of talker familiarity on spoken word identification in younger and older listeners. Psychol. Aging 15, 88–99.