ARTICLE IN PRESS
Journal of Phonetics 31 (2003) 495–501 www.elsevier.com/locate/phonetics
Physiological foundations of temporal integration in the perception of speech Shihab Shamma* Electrical and Computer Engineering Department, Center for Auditory and Acoustic Research, Institute for Systems Research, University of Maryland, College Park, MD 20742, USA Received 13 November 2002; received in revised form 1 September 2003; accepted 17 September 2003
Speech understanding involves the integration and identification of acoustic cues that are distributed over multiple time scales. These range from the sub-millisecond intervals associated with spectral estimates, to the few-millisecond periods of the fundamental frequency (f0 ), to the tens of milliseconds spanning phonemic and syllabic segments, and the longer time scales involved in perceiving words and sentences. Much of what is known about the auditory representation of these cues comes from experimental studies in various animal species. Especially, well studied are the early stages of the cochlea and cochlear nucleus, and the later cortical stages (Sachs & Young, 1979; Young & Sachs, 1979; Young, 1997, Chap. 4; Clarey, Barone, & Imig, 1992; Calhoun & Schreiner, 1995; Shamma Versnel, & Kowalski, 1995; Kowalski, Depireux, & Shamma, 1996; deCharms, Blake, & Merzenich 1998). By contrast, the physiological underpinnings of the linguistic processes remain highly elusive despite extensive investigations employing a host of new human fast-imaging technologies and computational models over the last decade (Poeppel, 2001; Horwitz, Friston, & Taylor, 2000). These techniques do not yet have the resolution to give a clear insight into single units and the neural circuits and their responses and representations. Consequently, the review below concerns conceptions of auditory processes operating at the faster time scales found in the earlier auditory pathway where animal experimentation is possible. Furthermore, they are based on extrapolations from experiments that employ simpler stimuli than speech (such as tones and noise with various amplitude and frequency modulations), and hence the models discussed are not specific to speech perception. Temporal integration in the auditory system actually refers to integration of spectro-temporal features over several stages, giving rise to varied forms of spectro-temporal selectivity that have been deemed valuable for speech processing. One example is the selectivity to speed and direction of frequency-modulated (FM) tones that resemble formant transitions in speech (Nelken & *3208 Oliver Street, Washington, DC 20015, USA. Tel.: +1-301-405-6842; fax: +1-301-314-9920. E-mail address:
[email protected] (S. Shamma). 0095-4470/$ - see front matter r 2003 Published by Elsevier Ltd. doi:10.1016/j.wocn.2003.09.001
ARTICLE IN PRESS 496
S. Shamma / Journal of Phonetics 31 (2003) 495–501
Versnel, 2000). Other examples are the sensitivity to different rates of amplitude-modulated (AM) tones (Langner, 1992), to sound onsets (Heil, 2001), to complex spectro-temporal modulation features, and to elaborate combinations found in species-specific vocalizations (Lyon & Shamma, 1996). The existence and functional relevance of these temporal feature detectors have often been associated with extensive physiological response maps in numerous auditory structures (Clarey et al., 1992; Lyon & Shamma, 1996). The speech signal in its journey from the eardrum to the cortex undergoes a profound transformation from a simple one-dimensional temporal pressure waveform to an elaborate multidimensional representation that is closely associated with the intelligibility of the speech signal and the perception of musical timbre. Figs. 1 and 2 depict the two conceptual stages of this transformation, the temporal features that survive at each stage, and the percepts associated with each. Anatomically and physiologically, these transformations occur probably over four synapses from the cochlea to the cortex, passing through the cochlear nucleus (terminal end of the auditory nerve), the midbrain (nuclei of the lemniscus and the inferior colliculus), the thalamus (several
Fig. 1. Multiple time scales of a speech utterance. (A) The auditory spectrogram of the utterance come home right away spoken by a male with a pitch of approximately 100–130 Hz. The ordinate is labeled by the center frequencies (CF) of the analysis filterbank. Up to 6 harmonics of the fundamental frequency are very well resolved. Higher harmonics are only partially resolved, becoming closely spaced small peaks that can be discerned up to the 12th harmonic (for example near the diphthong at 900 ms). Also visible are the formants and their transitions. The first formant typically ‘‘rides’’ on strong low-order harmonics (near 500 Hz in this example); the second formant sweeps between 700 and 2000 Hz. Higher formants are weaker and occur above 1000 Hz. The dashed line marks the auditory channel at 550 Hz whose temporal modulations are depicted in (B) to the right. (B) Temporal modulations in the auditory spectrogram at different time scales. (Top) At the coarsest scale, the slow modulations (few Hz) roughly correlate with the different syllabic segments of the utterance. (Middle) At an intermediate time scale, modulations due to inter-harmonic interactions occur at a rate that reflect the f0 of the signal (100–130 Hz). These modulations are highlighted by the dashed curve. (Bottom) At the finest scale, the fast temporal modulations are due to the frequency component driving this channel best (around 550 Hz).
ARTICLE IN PRESS
S. Shamma / Journal of Phonetics 31 (2003) 495–501 497
Fig. 2. Cortical multiscale analysis and representation of speech. (A) The auditory spectrogram of Fig. 1 is analyzed in the auditory cortex by (B) a bank of spectro-temporal modulation selective filters. The STRF of one such filter is shown in the small panel below. It is tuned to spectra with peaks 2 octaves apart (or it has a scale of O ¼ 0:5 cycles/octave) and sweeping downwards at 4 Hz speed (temporal rate w ¼ 4 Hz), i.e., it is most responsive when input peaks are sweeping downwards at 8 octaves/s. The output from each STRF is computed by convolving it with the entire input spectrogram, to produce a new spectrogram as shown to the right. For display purposes, we often collapse (integrate over) the frequency axis reducing the output from each STRF to a one-dimensional time-function representing the total output energy from this filter as a function of time. (C) Four panels displaying cross-sections of outputs from all STRFs at the time instants marked by the vertical dashed lines in the spectrogram above. In each panel, filter outputs are organized according to the selectivity of the filters, i.e., are indexed by two parameters: spectral scale and temporal rate (both for up and down selectivity).
ARTICLE IN PRESS 498
S. Shamma / Journal of Phonetics 31 (2003) 495–501
divisions of the medial geniculate body), and finally the auditory cortex (primary and numerous surrounding secondary fields). In the awake behaving animal, massive top-down feedback influences the forward transformations (Winslow & Sachs, 1987), and hence it is a drastic simplification to imagine the emergent representations at each stage as a result of signal processing in isolated successive stages. Therefore, it is prudent to keep in mind that what we do understand thus far is largely based on studies on anesthetized and awake nonbehaving animals. In the first stage, the acoustic signal is transformed into an auditory spectrogram—a representation resembling the well-known spectrogram of speech, and postulated to arise very early in the auditory pathway, perhaps as early as the cochlear nucleus (Lyon & Shamma, 1996; Blackburn & Sachs, 1990). It is the end-result of three steps (Fig. 1): frequency analysis in the cochlea, detection and temporal smoothing in the hair cells, and a final edge-enhancement and temporal integration in the cochlear nucleus (Lyon & Shamma, 1996). A simplified view of the auditory spectrogram is to think of it as the output of a bank of bandpass filters with center frequencies (CF) that are uniformly spaced along a logarithmic frequency axis (these are depicted along the frequency axis of Fig. 1(A)). Unlike in a usual spectrogram, cochlear filters are not of constant bandwidth, but rather more of constant resolution (or Q-factor) at about 10% of the CF. Consequently, filters with higher CFs have broader bandwidths. The auditory spectrogram differs from the usual spectrogram in other details, the most important being that its filter outputs do not simply encode the instantaneous power of each frequency band, but rather preserve or explicitly encode the rapidly modulated waveforms of the frequency components falling within the band. This is strictly true in the lower CF bands (o2000 Hz). At higher CF (>4000 Hz), the hair cell stage following each filter smoothes out the output waveform, replacing it with the instantaneous power output as in the usual spectrogram. Aside from these fast modulations, cochlear outputs at all CF may become modulated due to formant transitions or other dynamic features of speech, or they may beat at f0 because of interactions among signal components that pass within their bandwidths. These latter modulations are identical to the f0 modulations (vertical striations) typically seen in broadband spectrograms. These three types of temporal modulations are depicted in more detail in Fig. 1(B) where the output waveform from an auditory channel at CF=550 Hz is shown in response to the speech signal Come home right away. The response encodes the three temporal scales simultaneously. At the top are the slowest gross modulations (approximately 4–5 bursts per second) that reflect the rise and fall of energy in this frequency band during the speech utterance. These slow modulations reflect the succession of syllables, and hence are affected by the dynamics of the vocal tract that are directly responsible for the intelligibility of the speech signal: movements of the formants and acoustic consequences of onsets and offsets of consonantal articulations. The middle plot illustrates the intermediate-rate modulations (dashed waveform of about 112 Hz) that are due to interactions among partially resolved harmonics that fall within the bandwidth of the cochlear filter in this band. The depth and shape of these modulations are sensitive to the relative phase of the interacting components and, therefore, reflect the timbre of the sound being voiced, harsh, or whispered. Furthermore, these modulations decrease as the interacting harmonics within a filter bandwidth become more resolved, and hence they largely disappear in responses dominated by low-order (well resolved) harmonics as in the CFo300 Hz region in Fig. 1(A). Finally, the bottom plot depicts the responses at the finest (fastest) temporal scale which reflect the acoustic frequency
ARTICLE IN PRESS S. Shamma / Journal of Phonetics 31 (2003) 495–501
499
components (around 550 Hz) that carry the energy of the stimulus in this band. As mentioned earlier, these fast modulations disappear in high CF channels (>4 kHz) leaving only the f0 modulations as ‘‘carriers’’ of the all-important slow modulations of speech in these bands. In general, this overall picture of the modulations remains the same up the auditory pathway, with the one major exception being that there is a progressive decrease in the fastest possible rates to under a few hundred Hertz at the collicular and thalamic stages (Clarey et al., 1992; Langner, 1992). The second conceptual transformation mimics aspects of the responses of cortical auditory stages, especially the primary auditory cortex (Fig. 2). Functionally, this stage is mostly concerned with the analysis and representation of the slow modulations critical for the intelligibility of speech. The analysis is performed by a bank of ‘‘modulation-selective filters’’ (Fig. 2(B)) that detect various spectro-temporal features created by the up and down sweeps of the formant transitions, their convergence or divergence, and the onset and offset of narrow or broad spectral peaks. The shape of a spectro-temporal receptive field (STRF) of one such cortical cell is shown in Fig. 2(B). This STRF is best activated by input spectrogram features that have formant peaks two octaves apart, and that sweep downwards at a rate of about 8 octaves/s past the STRF. This rate is in fact comparable to the formant transitions seen at the end of the utterance in Fig. 2(A). The auditory cortex contains a wide variety of STRFs with different spectral bandwidths (also called scales), asymmetries, dynamics (rates), and directional preferences (peaks sweeping up or down in frequency) (Depireux, Simon, Klein, & Shamma, 2001). Consequently, each spectro-temporal feature in the spectrogram will activate a unique pattern of filters. A map of responses across the entire filterbank, therefore, provides a unique characterization of the spectrogram, one that is sensitive to the short-time spectral shape and temporal dynamics of the stimulus. Such response patterns are illustrated in Fig. 2(C) at different instants in the spectrogram. These scale-rate plots illustrate the distribution of activity from STRFs tuned to a wide range of scales and rates, from broad bandwidths (0.5 cyc/oct) to narrowband (8 cyc/oct), and from slow dynamics (2 Hz or time constants of integration over 500 ms), to fast (32 Hz or time constants of about 30 ms). These plots can uniquely summarize the salient features of the underlying spectrogram. For instance, the downward-sweeping harmonic peaks between 300 and 500 ms generate a strongly asymmetric pattern in the second panel of Fig. 2(C). Along the scale axis, the activity is concentrated at scale O ¼ 2 cyc/oct, signifying harmonics that are separated by an average of around 0.5 octave. In the third panel, the distribution displays the opposite asymmetry since the second formant sweeps upwards between 500 and 700 ms. In the rightmost panel, the divergent formants just prior to 1000 ms evoke a roughly balanced response pattern with two notable differences between the two halves of the scale-rate plot: the downward-sweeping first formant is faster (maximum response occurs at higher rates), and is composed of clearly resolved low harmonics producing a concentrated pattern of activity at approximately 2 cycles/oct as in the second panel. Finally, the first panel illustrates that rapid onsets and transient events, as in the plosive /k/ in come at the beginning of the sentence, generate outputs at the highest temporal rates. The wide spread along the scale axis signifies a broad but well resolved harmonic spectrum, while the strong asymmetry in favor of the ‘‘down’’-half of the panel is largely due to the accumulating phase-lags of the cochlear filters (Lyon & Shamma, 1996).
ARTICLE IN PRESS 500
S. Shamma / Journal of Phonetics 31 (2003) 495–501
It has long been known that the perception of speech is critically dependent on the faithful representation of spectral and slow temporal modulations in the auditory spectrogram (Drullman, Festen, & Plomp, 1994; Shannon, Zeng, Wygonski, Kamath, & Ekelid, 1995; Greenberg, Arai, & Silipo 1998). Similarly, the nuances of musical timbre (e.g., a violin versus a piano playing the same note) depend critically on their dynamics (attack and decay time constants), the subtle fluctuations in their pitch (vibrato) or amplitudes (shimmer), and on the spectral signature of their resonances. Since all these spectral and temporal factors are directly reflected in the multiscale cortical output, one can directly employ this representation in a variety of tasks such as: (1) to assess the intelligibility of a speech utterance by measuring the integrity of its representation relative to that of a clean sample (Elhilali, Chi, & Shamma, 2003; Chi, Gao, Guyton, Ru, & Shamma, 1999); or (2) to use the representation as a unique descriptor of the timbre of a musical instrument (Ru & Shamma, 1997); or (3) to account for the perceptual effects of phase changes in complex sounds (Carlyon & Shamma, 2003). In summary, we have seen that the initial stages of the auditory system (cochlea to primary auditory cortex) integrate features of the incoming sound over multiple time scales ranging from milliseconds (kHz) to tenths of a second (a few Hz). However, the slowest of these time scales is only commensurate with phonemic and syllabic rates, and hence it is still too fast to account for a later major stage of ‘‘linguistic’’ integration which operates over seconds and gives rise to comprehension of words and sentences. It recruits syntactic, lexical, and semantic rules and mechanisms that go beyond any auditory phenomena and must engage short and long-term memory traces, expectations, and strong top-down influences. Grossberg’s model (Grossberg, 2003) described in this volume, provides a rare quantitative instantiation of such processes, illustrating both their enormous complexity, and yet simple consequences when tested with the well-controlled settings and stimuli of ‘‘auditory scene analysis’’ (Bregman, 1991). However, uncovering the neurobiological substrate and mechanisms responsible for these phenomena remains a formidable and very exciting challenge.
Acknowledgements This work has been supported in part by a grant from the Office of Naval Research under the ODDR\&E MURI97 Program to the Center for Auditory and Acoustic Research.
References Blackburn, C., & Sachs, M. (1990). The representation of the steady-state vowel /e/ in the discharge patterns of cat anteroventral cochlear nucleus neurons. Journal of Neurophysiology, 63, 1191–1212. Bregman, A. S. (1991). Auditory scene analysis: The perceptual organization of sound. Cambridge, MA: MIT Press. Calhoun, B., & Schreiner, C. (1995). Spectral envelope coding in cat primary auditory cortex. Journal of Auditory Neuroscience, 1, 39–61. Carlyon, R., & Shamma, S. (2003). An account of monaural phase sensitivity. The Journal of the Acoustical Society of America, 114, 333–348. Chi, T., Gao, Y., Guyton, M. C., Ru, P., & Shamma, S. A. (1999). Spectro-temporal modulation transfer functions and speech intelligibility. The Journal of the Acoustical Society of America, 106(5), 2719–2732.
ARTICLE IN PRESS S. Shamma / Journal of Phonetics 31 (2003) 495–501
501
Clarey, J., Barone, P., & Imig, T. (1992). Physiology of thalamus and cortex. In D. Webster (Ed.), The mammalian auditory pathway: Neurophysiolog (pp. 232–334). Berlin: Springer. deCharms, R. C., Blake, D. T., & Merzenich, M. M. (1998). Optimizing sound features for cortical neurons. Science, 280(5368), 1439–1443. Depireux, D., Simon, J., Klein, D., & Shamma, S. (2001). Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. Journal of Neurophysiology, 85(3), 1220–1234. Drullman, R., Festen, J., & Plomp, R. (1994). Effect of envelope smearing on speech perception. The Journal of the Acoustical Society of America, 95(2), 1053–1064. Elhilali, M., Chi, T., & Shamma, S. (2003). A spectro-temporal modulation index (STMI) for assessment of speech intelligibility. Speech Communication, 41, 331–348. Greenberg, S., Arai, T., & Silipo, R. (1998). Speech intelligibility derived from exceedingly sparse spectral information. Proceedings of the international conference on spoken language processing, Sydney. Grossberg, S. (2003). Resonant neural dynamics of speech perception. Journal of Phonetics, 31, doi:10.1016/ S0095-4470(03)00051-2. Heil, P. (2001). Representation of sound onsets in the auditory system. Audiology Neurootolaryngology, 6, 167–172. Horwitz, B., Friston, K., & Taylor, J. (2000). Neural modelling and functional brain imaging: An overview. Neural Networks, 13, 829–846. Kowalski, N., Depireux, D., & Shamma, S. (1996). Analysis of dynamic spectra in ferret primary auditory cortex: I. Characteristics of single unit responses to moving ripple spectra. Journal of Neurophysiology, 76(5), 3503–3523. Langner, G. (1992). Periodicity coding in the auditory system. Hearing Research, 6, 115–142. Lyon, R., & Shamma, S. (1996). In H. Hawkins, M. McMullen, A. Popper, & R. Fay (Eds.), Auditory representation of timbre and pitch Auditory Computations (pp. 221–270). Berlin: Springer. Nelken, I., & Versnel, H. (2000). Responses to linear and logarithmic frequency-modulated sweeps in ferret primary auditory cortex. European Journal of Neuroscience, 12, 549–562. Poeppel, D. (2001). New approaches to the neural basis of speech sound processing: Introduction to special section on brain and speech. Cognitive Science, 25, 659–661. Ru, P., & Shamma, S. (1997). Representation of musical timbre in the auditory cortex. Journal of New Music Research, 26(2), 154–169. Sachs, M. B., & Young, E. D. (1979). Encoding of steady state vowels in the auditory-nerve: Representation in terms of discharge rate. The Journal of the Acoustical Society of America, 66, 470–479. Shamma, S., Versnel, H., & Kowalski, N. (1995). Ripple analysis in ferret primary auditory cortex: I. Response characteristics of single units to sinusoidally rippled spectra. Auditory Neuroscience, 1, 233–254. Shannon, R., F.-G. Zeng, J. Wygonski, V. Kamath, M. Ekelid (1995). Speech recognition with primarily temporal cues. Science, (270), 303–304. Young, E. (1997). The cochlear nucleus. In G. M. Shepherd (Ed.), Synaptic organization of the brain (pp. 131–157). London: Oxford Press. Young, E., & Sachs, M. (1979). Representation of steady-state vowels in the temporal aspects of the discharge patterns of populations of auditory-nerve fibers. The Journal of the Acoustical Society of America, 66, 1381–1403. Winslow, R., & Sachs, M. (1987). Effect of electrical stimulation of olive-cochlear bundle on auditory nerve responses to tones in noise. Journal of Neurophysiology, 57(4), 1002–1021.