Journal of Phonetics ( 1993) 21 , 491-496
Reviews Auditory Scene Analysis By Albert Bregman Cambridge, MA: The MIT Press, 1990, ISBN 0-26202297-4 xiii + 773 pp. Keith Johnson Department of Linguistics, Ohio State University, Columbus , OH, U.S.A.
The central problem which Bregman addresses in Auditory Scene Analysis is that real world auditory experience is rarely comparable to the benign environment of the sound booth . Sounds (and acoustic reflections of sounds) from a variety of sources are blended in the typical acoustic environment. Bregman illustrates the problem with this analogy (pp. S-6) . Imagine two narrow channels on the bank of a lake busy with motorboats. Each channel is a few feet long and a few inches wide , spaced apart by a few feet. Auditory perception is like being able to count the boats, locate them , and identify them by simply watching the motion of the water in these narrow channels. The hypothesis which Bregman presents is that auditory sensations are grouped into auditory objects (explicitly analogous to visual objects) , which he calls auditory streams. Table I shows Bregman's taxonomy of the perceptual processes which produce auditory streams and some of the acoustic parameters upon which they rely. The book is a summary of 20 years of research on auditory perceptual organization, and as such is much too long (over 700 pages) for a detailed review here . It is organized in eight chapters with a glossary of basic terms (rather than a chapter on basic acoustics), index and bibliography. The first and last chapters provide an overview of the theory , while the middle six chapters detail the experimental data upon which the theory is based. In the Preface, Bregman explains his decision to write two books in one, the first and last chapters are intended for a general audience, the middle chapters for specialists. This choice is perhaps unfortunate. The chapters for non-specialists are a joy to read, full of common sense and clear language. However , they summarize so much data and skim over so many details (especially the last chapter) that one is left with more questions than answers. The middle chapters , while providing many missing details , are repetitious and lack organization. They include many detours that deal with interesting, but baffling side-topics. Sometimes , too , the Introduction and Conclusion are not representative of the more detailed chapters. For instance , one might assume from the Introduction that the chapter on speech perception will focus mainly on the problem of segmentation in speech , but it actually focuses on the segregation of simultaneous voices . Some examples from each major category in Bregman's taxonomy of stream formation processes illustrate the theory and the type of data summarized in the book. 0095-4470/ 93/040491
+ 06 $08.00/0
©
1993 Academic Press Limited
492 TABLE
Reviews I. Bregman's taxonomy of processes of auditory organization
1. Primitive analysis : preattentive, innate processes. A . Sequential integration Acoustic parameters: • rate-separation of onsets • frequency separation • continuous sounds-involve a " unit forming process" which is sensitive to discontinuities , especially sudden rises in intensity • spatial location • F., • timbre • loudness-helping factor but not very effective alone • spectral continuity (like formant transitions) holds the stream together • rhythm-very little research on this B. Spectral integration Acoustic parameters: • continuity of components across time • harmonic relationship • common fate -->correlated changes of harmonics -->amplitude modulation across frequencies • spatial location 2. Schema-based analysis-learned patterns of organization • knowledge-driven • schema-governed attention-partial activation of a schema makes it possible to predict upcoming signal properties • phenomena which illustrates the operation of schema-based processes: -->phonemic restoration -->schema-driven expectations for rhythm
Sequential integration occurs when sequential elements group together. Imagine a sequence of two pure tones, A and B, where A has a lower frequency than B [Fig. l(a)]. Your task as a listener is to say whether A precedes or follows B. When these two are presented in isolation this task is easy, but when flanker tones(F) are added to the presentation the task becomes much harder. However, when we add a sequence of captor tones , (C), which have the same frequency and occur at the same rate as the flankers, it is-again easy to decide whether or not A preceded B. According to Bregman, the complexity of the stimulus array increases when the flankers are added , increasing the difficulty of the task. When the captors are added , however, they form an auditory stream which includes (or captures) the flankers , thus leaving A and B as the only elements in a separate stream. With A and B as elements in aseparate object, temporal order is again easy to judge. Once the basic phenomenon has been established, one can go on to manipulate acoustic parameters such as the frequency of the captors, the rate at which they occur, their spatial location , and so on, uncovering acoustic parameters which have an impact on the integration of the flankers into the sequence of captor sounds. Bregman's interpretation of this phenomenon is based on the intuitions of his listeners. They typically report that there are two things making noise when the captors are present, one which makes a repetitive noise at a fixed frequency, and one which makes a brief two-note noise. When the flankers are present but not the captors, listeners report just one source , and so hear a complicated four-note noise .
Reviews
(a)
493
B
>,
u
A
c <> :::J c:r
J:
c c
c
F
F
c c
(b)
A
B ~··· ·
---'--- ,. c 'f
Time Figure 1. (a) A tone seq ue nce illustrating primitive sequential grou ping. (After Bregman & Rudn icky , 1975.) (b) A tone array illust rat ing primitive simultaneous grouping. (After Bregman & Pinker, 197R).
Spectral integration occurs when simultaneous elements group together. Imagine three tones A, !i and~ as shown in Fig. 1(b). Listeners perceive a complex tone when !i and ~ are presented by themselves , but this spectral integration can be altered in several ways. If A is close to the freque ncy of !i listeners perceive a repeating tone of one frequency (A fo llowed by !D with a lower tone (~) occurring once for each occurrence of the repeating pair. If~ does not start and stop (nearly) simultaneously with !i, the percept of a complex tone disintegrates, and A; !i and~ each stand out as separate auditory objects. These methods for manipulating the spectral integration of !i and ~ illustrate some of the factors listed in Table I. Schema-based analysis involves grouping upon the basis of a learned pattern . When melodies are interleaved with seq uences of distractor tones, li steners are more likely to correctly recognize familiar than unfamiliar melodies (Dowling, 1973) . This effect is not instantaneous, but requires several repeated presentations. It is sensitive to factors such as the degree of fami liarity of the melody and attention. From these findings , Bregman concludes that famili arity (the prior existence of a schema) can produce segregation in the absence of primitive segregation. He suggests that this mechanism is qualitatively different from primitive stream segregation in that it requires prior learning and attention, and produces a different output. In the examples of primitive segregation discussed earlier, listeners cou ld describe all of the auditory objects produced by the primitive analysis. However, in Dowling's (1973) example of schema -based processing, listeners could identify the melody but showed no evidence of having acq ui red knowledge about the seq uence of distractor tones. Bregman suggests that primitive processes partition the acoustic
494
Reviews
signal into separate objects, which may then become the focus of the listener's attention. Schema-based processes, on the other hand, select objects from the signal leaving an incoherent residual. Auditory scene analysis is relevant to the study of speech perception in several interesting ways that are not explored in detail by Bregman. Primitive scene analysis processes could help listeners segment the speech signal, in response to sudden changes in intensity at the boundaries of phonetic segments. That is, primitive processes may underly listeners' abilities to detect the number and order of speech sounds. However, in Chapter 6 ("Auditory organization in speech perception"), Bregman assumes that sounds produced by a single talker end up in the same auditory stream. If so, the cues used in segmentation may not be absolute. For example, formant transitions provide continuity which may counteract cues for segmentation. In this connection , note that demonstrations of a streaming effect with speech sounds make use of rapidly repeating presentations or increased speech rate. Thus, segmentation may be based more on schema-based processing than on primitive processes, but the segmentation problem has generally not been addressed from the standpoint of auditory scene analysis. Auditory scene analysis also suggests some ways that speech prosody may affect the perception of utterances. F 0 and rhythm may serve to hold utterances together, with F 0 reset at the edges of intonational phrases and rhythm suspension (also at edges) tending to produce prosodic grouping in the auditory scene. Again , Bregman's theory suggests some specific hypotheses which have not been adequately explored. The distinction between primitive and schema-based processes is central to Bregman's taxonomy (Table I); the main claim being that primitive processes are innate, while schema-based processes are learned . Bregman presents two arguments, the first based on Shepard's (1981) theory of the evolution of psychological mechanisms. Bregman argues that "certain constant properties of the environment that would have to be dealt with by every human everywhere" (p. 38) have driven the evolutionary development of innate processes of auditory organization. Shepard's principle of psychophysical complementarity predicts that humans, having "evolved in a world of mixture" (p. 33), have innately specified rules of auditory grouping that are "complementary with the redundancies that link acoustic components that have arisen from the same source" (p. 39). Bregman's paraphrase is "the mental processes of animals have evolved to be complementary with the structure of the surrounding world" (p. 39). Note that this principle does not contradict the notion of a genetic endowment for language. If there is a linguistic genetic endowment, the principle of psychophysical complementarity predicts that this endowment for language is derived from and complements some prior structure. Bregman's argument is appealing because it starts from a very general principle of how psychological mechanisms might become genetically specified. It is not compelling, however, because Bregman presents no evidence that there are indeed "certain constant properties" of auditory objects in the environment. The second argument for the innateness of primary processes is based on the boot-strapping problem. Without innate auditory object formation processes (or innate schemas for particular auditory objects) how can we learn to recognize unfamiliar patterns in a multi-source auditory environment? "If we are to learn about patterns in the first place ... we need some primitive processes that are capable of extracting them from their acoustic contexts" (p. 403). The infant
Reviews
495
imitating the mother does not " insert into the imitation the squeaks of a cradle that have been occurring at the same time " (p. 5). This argument is appealing, but one wonders if coincidental sounds are really a problem for a schema-forming process given the nature of coincidences. The cradle squeaks one time , the dog barks the next. What's to keep a schema-forming process from converging on a practical solution to the bootstrapping problem by finding constancies across situations? Bregman provides very little direct evidence to show that primitive stream segregation processes are innate . He cites (pp. 41ft, 405) a study by Demany (1982) which found evidence of auditory stream segregation in 11-month-old infants . Using the method of habituation , Demany designed stimuli (sequences of high and low tones) which, when changed from one sequence to the other , would be expected to produce dishabituation if the tones remained unsegregated , but not if the infants segregated the high and low tones into separate streams. Demany's results suggested that the infants segregated the streams in an adult-like manner , but hers is apparently the only published study of stream segregation by infants . Although the arguments for the innateness of primitive processes are not compelling , Bregman's distinction between primitive and schema-based processes is well founded , at least for non-speech sounds . However, from Bregman's point of view, strange things happen in speech . In duplex perception objects which are partitioned by primitive processes (the F2 transition and the base) are put back together by the speech perception process. This integration is surprisingly insensitive to acoustic characteristics which ordinarily affect stream segregation (Nygaard & Eimas, 1990) . Duplex perception is a problem for Bregman 's theory because schema-based analysis is only supposed to select among the objects provided by primitive analysis. Chapter 7 (" The Principle of Exclusive Allocation in Scene Analysis" ) is devoted to this problem . Despite the fact th at duplex perception can be produced with musical stimuli (Pastore , Schmuckler, Rosenblum & Szczesiul, 1983; Collins , 1985) or environmental noises (Fowler & Rosenblum , 1990) , the examples from speech perception lead Bregman to conclude that "speech schemas ... seem to have an impressive ability to put information together, even under circumstances in which primitive segregation is pulling it apart" (p. 618). Although Chapters 1-6 describe primitive processes which produce auditory objects, this view is revised in Chapter 7. The duplex perception phenomenon violates the principle of exclusive allocation, a principle related to the Gestalt psychologists' principle of belongingness, which says that " if a piece of sensory evidence is allocated to one perceptual organization, it cannot, at the same moment, be allocated to another" (p . 595) . In response to duplex perception and similar violations of the principle of exclusive allocation, Bregman proposes a "twocomponent theory of links and schemas". In this revision , primitive processes do not result in all-or-none groupings but in variable probabilities for grouping. Bregman suggests that such a system has evolved because "no cue is utterly reliable in identifying parts of evidence that must be combined" (p. 630). A second process, " decides , in an all or nothing manner, whether to include a particular perceptual property . . . in the current description" (p . 630) . Thus , primitive processes set up grouping tendencies , but it is the "description-building" process that makes all of the decisions and produces the percepts (either by fitting evidence into existing schemas or tallying the " votes" of the primitive processes) . This revision offers an
496
Reviews
account of duplex perception, but it also weakens Bregman's innateness claims. The bootstrapping problem is now back because primitive processes, under this revision , do not produce auditory streams. Does duplex perception warrant this awkward revision? Bregman might have focused on the properties of schemas and the role that schema might play in speech perception (Johnson & Ralston , 1990). Instead, Chapter 7 concentrates on exclusive allocation and ideas about the transparency of sound . Given Bregman's interest in primitive processes, his strategy for dealing with duplex perception is understandable but unfortunate . Regardless of the details, Bregman's approach has important theoretical implications. Liberman & Mattingly (1985) argued that perception of an F 2 transition which depends on whether listeners attend to the speech percept or the non-speech percept, shows that speech perception is tied to production while auditory perception is not. This argument hinges on the fact that the same stimulus produces different response patterns depending on whether the listener focuses on the speech percept or the non-speech percept. Contextual information in the base component (such as the presence or absence of a silent interval between an /s/ and the base) affects the speech percept , but not the non-speech percept. However , given a scene analysis view in which the two percepts (chirp and syllable) are separate auditory objects , there is no reason to expect the chirp to be affected by properties of the base. Toward the end of Auditory Scene Analysis, Bregman describes his proposal as "really just a cluster of ideas" (p. 637). The book belies this assessment. With great creativity and persistance , Bregman has succeeded in identifying a fundamental problem in a largely unexplored area of auditory perception, and has constructed and tested a theory which , although incomplete and occasionally contradictory, is impressive both for the range of phenomena it covers and for the scientific outlook it illustrates. References Bregman , A. S. & Pinker , S. (1978) Auditory streaming and the building of timbre , Canadian Journal of Psychology, 32, 19-31. Bregman , A. S. & Rudnicky, A. (1975) Auditory segregation: Stream or streams? Journal of Experimental Psychology: Human Perception and Performance, 1, 263-267. Collins, S. (1985) Duplex perception with musical stimuli: A further investigation, Perception and Psychophysics, 38 , 172-177. Demany , L. (1982) Auditory stream segregation in infancy, Infant Behavior and Development, 5, 261-276. Dowling , W. J. ( 1973) Rhythmic groups and subjective chunks in memory for melodies, Perception and Psychophysics , 14, 37-40. Fowler, C. A. & Rosenblum , L. D. (1990) Duplex perception: A comparison of monosyllables and slamming of doors , Journal of Experimental Psychology: Human Perception and Performance, 16, 742-754. Johnson , K. & Ralston, J. R. (1990) Automaticity in speech perception : Some speech/nonspeech comparisons. Research on Speech Perception Progress Report No. 16, Bloomington , IN: Speech Research Laboratory, Indiana University. Liberman , A. M. & Mattingly , I. G. (1985) The motor theory of speech perception revised, Cognition, 21 , 1-36. Nygaard , L. C. & Eimas, P. D. (1990) A new version of duplex perception: Evidence for phonetic and nonphonetic fusion , Journal of the Acoustical Society of America, 88 , 75-86. Pastore , R. E ., Schmuckler , M.A. , Rosenblum , L. & Szczesiul, R. (1983) Duplex perception with musical stimuli , Perception and Psychophysics, 33, 469-474. Shepard, R.N. (1981) Psychophysical complementarity. In Perceptual organization, (M. Kubovy & J. R. Pomerantz , editors). Hillsdale , NJ: Lawrence Erlbaum.