Computational auditory scene analysis: Exploiting principles of perceived continuity

Speech Communication 13 (1993) 391-399 North-Holland 391 Computational auditory scene analysis: Exploiting principles of perceived continuity M.P. C...

Download PDF

805KB Sizes 0 Downloads 66 Views

Report

PDF Reader
Full Text

Speech Communication 13 (1993) 391-399 North-Holland

391

Computational auditory scene analysis: Exploiting principles of perceived continuity M.P. C o o k e and G.J. B r o w n

Department of Computer Science, University of Sheffield, 21l Portobello Street, Sheffield, England Received 2 February 1993 Revised 24 July 1993

Abstract. Acoustic sources are often occluded by other sounds, yet the strategies for recovering individual sources employed by the auditory system in tasks such as speech recognition are remarkably robust against these intrusions. There are often sufficient cues which allow the auditory system to determine whether sound components continue through such occlusions. This paper reviews the situations where an assumption of continuity is warranted and demonstrates how the principles governing the so-called "continuity illusion" can be used within a computational system for segregating acoustic sources.

Zusammenfassung. Akustische Quellen sind oft durch andere T6ne verdeckt, jedoch sind die vom H6rsystem in Aufgaben wie z.B. der Spracherkennung verwendeten Strategien zur Wiederherstellung einzelner Quellen relativ unempfindlich gegen diese Einwirkungen. Es gibt genug Indizien fiir das H6rsystem, um zu bestimmen, ob Bestandteile des Tones w~ihrend derartigen Verdeckungen fortlaufen. Dieser Artikel gibt einen l[lberblick fiber Situationen, in denen eine Annahme der Kontinuit~it gew~ihrleistet ist, und zeigt, wie die Prinzipien der sogenannten "Illusion der Kontinuit~it" in einem Vergleichssystem zur Unterscheidung von akustischen Quellen verwendet werden k6nnen.

R6sum£ Les sources acoustiques sont souvent obstru6es par d'autres sons. Cependant, les strat6gies de reconstruction des sources individuelles employ6es par le syst~me auditif sont remarquablement robustes vis h vis de ces obstructions. Le syst~me auditif dispose souvent d'une quantit~ suffisante d'indices qui lui permettent de d&erminer si les composantes du son se prolongent pendant ces obstructions. Cet article recense les situations au cours desquelles des assomptions de continuit6 sont garanties et d~montre comment les principes sous-tendant ce qu'on appelle "l'illusion de continuit6'" peuvent ~tre utilis6s au sein d'un module informatique de s6paration des sources acoustiques. Keywords. Perceived continuity; auditory scene analysis; auditory model; auditory grouping; speech segregation; resynthesis.

I. Introduction A c o u s t i c s o u rces such as s p e e c h are g e n e r a l l y p e r c e i v e d against a b a c k g r o u n d o f o t h e r sounds, yet listeners are o f t e n able to r e c o v e r sufficient i n f o r m a t i o n to allow i n t e r p r e t a t i o n o f t h e individual sources. A u d i t o r y s c e n e analysis ( B r e g m a n , 1990) is a r e c e n t t h e o r e t i c a l account, b a c k e d up by two d e c a d e s o f p s y c h o a c o u s t i c investigation, w hi c h p r o p o s e s that t h e a u d i t o r y system uses a c o m b i n a t i o n o f p r i m i t i v e ( b o t t o m - u p ) o r g a n is a tional principles an d s c h e m a - d r i v e n ( t o p - d o w n or

l e a r n e d ) p r o c e s s e s to d e t e r m i n e which parts of the m i x t u r e are likely to have arisen f r o m the s a m e physical event. R e c e n t years have w i t n e s s e d a n u m b e r o f a t t e m p t s ( W e i n t r a u b , 1985; C o o k e , 1993; M e l l i n g e r , 1991; Brown, 1992) to build c o m p u t a t i o n a l systems for source s e g r e g a t i o n inspired by t h e o r e t i c a l a c c o u n t s of a u d i t o r y scen e analysis. O u r own w o r k has d e m o n s t r a t e d the v a l u e o f t h ese p r i n c i p l e s in s e g r e g a t i n g s p e e c h f r o m a v ar i et y o f o t h e r sources. A r e c e n t evaluation ( B r o w n an d C o o k e , 1992) c o n c l u d e d that a large i m p r o v e m e n t in signal-to-noise ratio is pos-

0167-6393/93/$06.00 © 1993 - Elsevier Science Publishers B.V. All rights reserved

SSDI 0167-6393(93)E0072-4

392

M.P. Cooke, G.J. Blown / Computational auditory scene analysis

sible using a system based primarily on grouping acoustic components which have sufficiently similar pitch contours. Figure 1 depicts the computational auditory scene analysis system (Brown, 1992) employed in the study described here. The example illustrated in this figure is a mixture consisting of speech and an artificial siren. In the first stage of processing, the acoustic input is passed through a simulation of the auditory periphery which consists of a bank of bandpass (gammatone) filters (Patterson et al., 1988; Cooke, 1993), each of which is followed by a model of inner hair cell function (Meddis, 1988). Filters are spaced according to the ERB-rate scale of Glasberg and Moore (1990). The next stage employs a series of representational abstractions called computational maps, motivated by the discovery of physiological auditory maps which appear to place-code acoustic parameters in orderly arrays of neurons (Moore, 1987). The maps are two-dimensional, such that characteristic frequency and the value of the mapped parameter are represented on orthogonal axes. Periodicity, intensity, interaural time and intensity differences, frequency transition and spectral shape appear to be encoded in this fashion. In the model, computational maps represent averaged auditory nerve firing rate, onsets and offsets of energy in auditory filter channels, frequency transitions, and periodicities present in the fine structure of each filter output. The frequency transition and periodicity maps can be combined to produce a collection of auditory scene components (referred to throughout simply as "components"), shown in Figure l(c), each of which represents the evolution in time of collections of auditory filter channels which have similar temporal responses. Grouping of components which are likely to have arisen from the same acoustic source is based on similarity of pitch contours derived indiuidually from each component. Pairs of components with sufficiently similar contours (as judged by a Gaussian-weighted distance metric) are grouped; this process iterates until no further components can be recruited. The pitch contour of a component is obtained by finding the best path through a time-series of pitch estimates, as described in (Brown and Cooke, 1992). Pitch estimates are derived by an

(a) acoustic mixture III I J auditory periphery model |

(b) auditory maps

I

L

(c) auditory scene components ~r and corresponding local pitch contours : ::~:~I~..

frequency (ERB-rate)

~. .~:

:;/~

: 'i~

i~,

time

(d) components grouped by pitch contour similarity

J

(e) resynthesl s from groupeo components .I . . . .

~,~u,~.

~r ~.L,,~ . I .

................. ..~_

Fig. 1. Auditory scene analysis: outline of representations and strategy in Brown's model. The example mixture here consists of an artificial siren added to the utterance "our lawyer will allow your rule". U p p e r panels of (c) and (d) depict auditory scene components derived from the mixture, whilst lower panels show pitch contours for each component (in this example, these form three clusters, the largest of which corresponds to the pitch contour of the utterance). Those pitch contours emboldened in part (d) show components which have been grouped by pitch contour similarity.

M.P. Cooke, G.J. Brown / Computational auditory scene analysis

autocorrelation analysis at the output of each auditory filter channel. The subset of components grouped in this way for the siren + speech mixture is shown in Figure l(d). O t h e r grouping rules based on onset and offset synchrony are present in the model but are not employed in this study. It is possible to resynthesise a signal from any subset of components. Apart from allowing informal listening assessments, this facility permits quantitative evaluation of the system's performance in segregating sources. Two quantitative metrics have been developed, one which measures the signal-to-noise ratio (SNR) prior to and after segregation, the other reflecting the proportion of the intended source which is recovered from the mixture (the "characterisation" metric). These metrics are defined in Section 4. For the example illustrated in Figure 1, the SNR of the mixture prior to segregation is 0.29, measured on a scale where 0 represents all noise and 1 denotes all signal. After segregation by the model, the SNR is 0.98, indicating that virtually all of the source belongs to the speech signal as opposed to the siren. However, whilst S N R improvement is valuable, it is not sufficient on its own, since it is apparent that significant chunks of the utterance have been omitted by the grouping system. Little energy in the higher frequencies has been assigned to the utterance, and large sections of the utterance have been "carved out" by the siren. The proportion of energy in the signal which is recovered during grouping is estimated by the characterisation metric at only 35%. In spite of this, the resynthesised speech remains intelligible. Why is characterisation so poor? Part of an explanation lies in the fact that the current system follows a "synthetic" approach to grouping (Ellis, 1992), in which organisational principles such as onset synchrony are used to piece together components of an acoustic source. The default is to segregate components unless there is some reason to fuse them. Thus, a low characterisation performance points to the paucity of organisational principles in the model; certainly, the presence of speech-specific schema could help to fuse rather more of the auditory scene than the sole use of primitive principles as is the case in the current model.

393

In this paper, an alternative approach is presented which is suggested by inspection of Figure l(c), in which components of the utterance appear, visually, to continue through the siren. There is an auditory equivalent to this visual illusion which has received extensive experimental attention - the "continuity illusion".

2. The continuity illusion If parts of a signal are deleted and replaced with some other (usually louder) signal with the right characteristics, the softer sound is often heard as continuing through the louder intrusion even though it is not physically present. This illusory continuity has been demonstrated for a range of stimuli from simple tonal signals to more complex signals such as speech (Warren, 1970). Bregman (1990) suggests an explanation in terms of auditory scene analysis, the role of which is to find the simplest explanation for some auditory configuration which is not contradicted by any of the sensory evidence. Consider the possible explanations fi~r the configuration of Figure 2. It is possible that A1 and A2 are perceived as having separate explanations in the environment. Alternatively, A1, B and A2 may have arisen from the same source. A third explanation is that A1 and A2 arose from the same source, and that B is a manifestation of something else. Which explanation should the auditory system adopt? The answer depends on the precise nature of A1, A2 and B. However, if there is no evidence that A1 and A2 have arisen from separate events, the scene analysis explana-

B Fig. 2. Components of the continuity illusion: B is a loud sound, whilst A1 and A2 are softer sounds with "similar" properties (Redrawn from (Bregman, 1990, Figure 3.22).)

394

M.P. Cooke, G.J. Brown / Computational auditory scene analysis

tion will adopt the simpler interpretation on the basis that it is more likely to reflect the sort of sound configurations that exist in the environment. Bregman goes on to separate out the questions of whether a sound has continued through an interruption and what the occluded sound is. Since it appears that any type of sound can undergo perceptual restoration, the " w h a t " question involves factors such as sentential context, e.g. (Warren, 1970). How can we tell " w h e t h e r " a sound has continued through an interruption? Bregman suggests the following four rules - a detailed discussion of the psychoacoustic evidence for each is presented in his book (Bregman, 1990, pp. 345-382): i. No discontinuity in A. T h e r e should be no evidence that A1 stopped at the onset of B. Similarly, there should be no reason to suppose that A2 started just as B ended. ii. Sufficiency of evidence for occlusion. There must be enough neural activity during B - at least as much as would have been generated by a continuation of A1. iii. A 1 - A 2 grouping. A1 and A2 must have similar properties. In fact, there should be reason to suppose that they have arisen from the same event. iv. A is not B. It should be possible to interpret B as the result of a separate event rather than as a continuation of A1. One consequence of rules iii and iv is that perceptual restoration might be considered as occurring after sound components have been grouped. Section 3 describes how these rules can be used to improve characterisation performance in a model of auditory scene analysis.

3. A two-stage model which exploits principles of perceived auditory continuity The computational strategy described here consists of a general method for answering the " w h e t h e r " question, followed by a specific solution to the " w h a t " issue which involves exploiting harmonic relationships between components.

3.1. Answering the "whether" question In determining whether some acoustic component might have continued through an occluding source, the model draws on auditory maps for averaged auditory nerve firing rate, onsets and offsets. Bregman's four rules are transformed into cost functions in a dynamic programming (DP) search which attempts to link those components grouped by Brown's model. Specific rules are handled as follows: i. No discontinuity in A. A cost is computed based on the evidence for an offset at the end of A1, and on the evidence for an onset at the start of A2 (using offset and onset maps, respectively). ii. Sufficiency of evidence for occlusion. An extension of A1 should be supported by the presence of sufficient activity in the map of firing rate. A cost is incurred if the rate drops below that registered at the end of A1 (rather arbitrarily, this is calculated as the average firing rate over the final 30 ms of A1). iii. A 1 - A 2 grouping. Since the search for perceptual restoration takes place after grouping, it ought to be the case that pairs of components linked are parts of the same source. However, it is still possible for components to link with inappropriate components (e.g. a 2nd harmonic track linking with a 3rd harmonic; both are part of the same organisation but it would be an error to link them as part of a continuous track). Dealing with this effectively appears to rely on source-specific knowledge. A general principle which might be used is frequency proximity: a cost could be associated with a transition in frequency. Here, we attempt only to restore similarly labelled harmonics, as described in Section 3.2. iv. A is not B. If A1 is transformed gradually into B, little or no activity in the onset map at the transition would be expected (similarly for the offset map at the B - A 2 interface). Conversely, any evidence for an onset at these points can be rewarded by introducing a negative cost based on the degree of onset activity at the A 1 - B interface (and for offsets at the B - A 2 interface).

M.P. Cooke, G.J. Brown / Computational auditory scene analysis

The costs associated with these rules (with the exception of iii) can be combined to produce a DP search through the t i m e - f r e q u e n c y region between any pair of components we are seeking to link. In the general case, the system does not know which A2 is a candidate continuation of A1, so it is necessary to allow the search to proceed up to some specified limit in time (300 ms in the current implementation). Indeed, more than one such continuation may be discovered, in which case the lowest cost candidate is chosen. In the following, R denotes the amount of activity in the firing rate map at the end of component A1, and r(t, f), on(t, f ) and off(t, f ) represent the activity in the rate, onset and offset maps at time t in channel f. The local cost function, c(t, f ) is defined as

{0,

c(t, f ) = tXrate(R-r( t, f ) ) ,

R ~r(t, f ) ,

(a) where JLrate is a constant used to weight the contribution of this component in relation to that made by the onset and offset maps. Equation (1) implements rule ii above. Initial costs are based on rules i and iv. For each channel f , we define C ( t o,

f) =/Xon/off(off(t0, f) --on(t 0,

f)),

(2)

where t 0 is the time at which A1 ends. This provides a positive penalty for any evidence of an offset at the end of A1, balanced by evidence for an onset. Whilst the negative penalty introduced by the onset term in (2) goes against the spirit of a DP cost solution, it provides an effective means of introducing rule iv. The constant P~on/off is a weighting factor. The DP iteration uses a 3-way decision function; consequently, the cumulative cost, C(t, f ) , of extending A1 to a cell (t, f ) is given by

C(t, f ) = c ( t , f ) + min[C(t-

1, f -

1),

C ( t - 1, f + 1)].

C ( t - 1, f ) , (3)

For each cell containing a candidate A2 start point, a penalty for any onset evidence is added,

395

again balanced by evidence for an offset. For all such cells, the additional cost is given by

c(t, f ) = P~on/off(on(t, f ) - o f f ( t , f ) ) .

(4)

Finally, the lowest cost continuation is determined (of course, there may be no such continuation present within the search region). Its cost reflects the likelihood that A1 has a continuation. The lower the cost, the more likely that A1 can be continued. Ideally, any evidence that the rules are broken should result in a negative answer to the " w h e t h e r " question. In practice, costs which are less than a small positive tolerance indicate that restoration is admissible - the precise value of this tolerance should depend on how conservative a restoration strategy one wishes to adopt. A choice of values for /xrate and /Xo./of~ is dependent on many factors, not" least of which is the sensitivity of the onset and offset maps. Currently, contributions from equations (1), (2) and (4) are weighted equally. Further coupled experimental-modelling work is required to devise weights which reflect listeners' susceptibility to the continuity illusion.

3.2. Signal restoration Having determined that a component is likely to have been occluded, the question of what form the restoration should take is raised. Whilst an answer to this might be considered to require source-specific knowledge, there may be cues provided by primitive grouping processes such as harmonicity. Here, since components are grouped according to a principle of pitch contour similarity, it is possible to use harmonic relations for the determination of appropriate t i m e - f r e q u e n c y tracks. Specifically, each component is assigned a harmonic label as a result of the grouping process, and the cost of linking successive, fragments of each harmonic is calculated using the approach outlined in the previous section. In fact, it is unnecessary to employ a full DP search in this case because it is possible to predict the t i m e frequency regions occupied by the missing harmonic fragment. A further modification to the general algorithm is possible since the system knows which A2 matches each A1 - they possess

M.P. Cooke, G.J. Brown / Computational auditory scene analysis

396

the same harmonic number. In these circumstances, firing rate information is available both at the end of A1 and at the start of A2, allowing a more accurate prediction of firing rate during the occlusion. In practice, this means replacing the constant R in equation (1) by R(t, f), calculated by linear interpolation of the rates between the end of A1 and the start of A2. More sophisticated prediction methods based on speechspecific knowledge might be employed. It is inappropriate to assign the full amount of energy in any restored time-frequency region to the signal, since some of it will belong to the occluding source. We can achieve such an assignment of energy by creating a real-valued "mask" of weights, w(t, f), which specifies the allocation of energy in each time-frequency region deemed to belong the source undergoing restoration. This weight at (t, f ) is simply the ratio of the predicted rate to the actual rate:

W(t, f)

R(t, f) r(t, f)

(a)

iiiiii~ii~ J~#:::'~'":::'::~:~:~:ii.~......................... "'" ':%::i,~::::::i. . . . .

:

~ ~ .~:. . . . . ~ .~~:~!.:~:i... . .?.~. . .~i , ~:. :!~:.!~i~.:i:;:. :i?:?~!!~!!~iii~ il ~:..... ~

(b) ~,, !

i~ii!i~

~

(5)

The weights, in fact, represent not so much an "energy" assignment, but instead an allocation of auditory nerve firing rates. Each time-frequency region of the rate map represents a time-averaged output from a model of the inner hair cell (Meddis, 1988). The well-known compressive characteristics of the inner hair cell transduction process (Kiang et al., 1965) will tend to produce a compressed estimate of signal energy, leading to an overestimation of mask weights. However, we use such a representation of "auditory energy" since it is this information which is available to the auditory system. Figure 3(a) displays r(t, f) for the siren + speech mixture of Figure 1, whilst Figures 3(b) and 3(c) depict the weights, w(t, f), and restored speech produced by multiplying the rate map of the mixture with the mask. It is evident that the restoration has been partially successful in reconstructing occluded parts of the utterance. Precisely how successful? Using the metrics described in Section 4, we find that characterisation has improved from 35% to 49%, whilst the SNR now stands at 0.92. This is slightly worse than the figure 0.98 obtained by grouping, indicating that

(e)

:i:i ¸ :

~:~,~:::

~:.. "

:~i~

:!:!::::::":i:~.

Fig. 3. Stages in signal restoration for the siren+speech mixture illustrated in Figure 1. (a) Firing rate map, r(t, f); (b) mask of weights, w(t, f ) ; (c) reconstructed source formed by multiplying the rate map by the mask in each time-frequency region.

the mask has allowed through not only more of the speech signal, but also some of the siren. However, this figure still represents a significant improvement in SNR over the original mixture.

4. Quantitative

results

One way to assess the performance of a separation system is to measure some kind of distance between original and reconstructed signals. Whilst this is certainly possible in our case, it does not

M.P Cooke, G.J. Brown / Computational auditor), scene analysis

necessarily lead to an intuitive metric which can be compared with listeners' ratings of intelligibility improvement. For example, a signal which contains very little intrusion (high SNR) after separation but poorly characterises the original signal may still be quite intelligible despite having a large objective distance score. Instead, we prefer to separate out the two measures referred to earlier as characterisation and SNR improvement.

397

For the evaluation reported below, the task was to retrieve a spoken utterance from various mixtures. In the following, we refer loosely to the intrusive or occluding source as the 'noise' signal. Quantities s(t, f ) and n(t, f ) denote: the RMS energy at time t and frequency f in the speech signal and intrusive source, respectively.

4.1. Signal-to-noise ratio The SNR for each 10 ms frame of a restored signal is measured using equation (6) which varies from 0, representing a t i m e - f r e q u e n c y region containing all "noise" to 1, representing all "signal":

characterisation

SNR(t) Y'~min[s(t,

2

= --atan -rr

f

Emax[O,

f ) , p(t, f)]

p(t, f ) -s(t...f)]

'

f

(6) nO

nl

•

n2

n3

segregated m

n4

115

n6

n7

n8

n9

segregated and restored

SNRs

0 = all noise 1 = all sianal

where p(t, f ) = w(t, f)[s(t, f ) + n(t, f ) ] is the value of the signal predicted by the mask. Results thus obtained are averaged across the utterance. The arctangent compression is used since, from some frames, the denominator in equation (6) is zero. Note that the ideal separation, achieved when p(t, f ) = s ( t , f ) everywhere, results in an SNR of 1.

4.2. Characterisation Characterisation, which measures the proportion of some source recovered from the mixture, is measured using Y'~max[O, nO

•

nJ

n2

r13

mR 1 ~ segregated ~

114

n5

n6

n7

n8

nO

segregated and restored

CHAR(t) = 1 -

¢

s( t, f ) - O( t, f ) ] Es(t, f) f

(7) Fig. 4. C h a r a c t e r i s a t i o n a n d S N R i m p r o v e m e n t o v e r the 100 mixtures. T h e noise s o u r c e s a r e n o = 1 k H z tone; n 1 = w h i t e noise; n 2 = i m p u l s e series; n 3 = l a b o r a t o r y noise; //4 = rock music; n5 = siren; n 6 - t e l e p h o n e ; n 7 = f e m a l e s p e e c h ; n s = m a l e s p e e c h ; n 9 = f e m a l e s p e e c h . S N R s a r e m e a s u r e d in the mixture, a n d b e f o r e a n d a f t e r principles o f p e r c e i v e d a u d i t o r y c o n t i n u i t y a r e exploited.

A value of 1 corresponds to a time frame in which the reconstructed signal completely characterises the utterance. Note that the " m a x " function in equation (7) allows the signal to "overcharacterise" the utterance - that is, to allocate

398

M.P. Cooke, G.J. Brown / Computational auditory scene analysis

more energy than required in some t i m e frequency regions. Of course, such an over-allocation will result in a degradation of the SNR as measured by equation (6). Figure 4 shows the value taken by SNR and C H A R over a database of 100 mixtures (Cooke, 1993) consisting of 10 voiced utterances mixed with each of 10 various acoustic sources. Results in each case are averaged over the set of voiced utterances. The figure also shows the original SNR in the mixture and the SNR after grouping but prior to exploiting the principles underlying the continuity illusion. Characterisation performance increases in each case, sometimes to double its original level. Hence, use of the principles of continuity in determining whether source restoration is justified is demonstrably successful in allowing a more complete recovery of the source. SNR improvement generally falls, though its level is still well above that of the mixture itself. Informally, speech resynthesised from masks restored in this manner generally sounds more natural than tokens resynthesised prior to source restoration. 5. Discussion

The principles governing perceived auditory continuity appear to provide a basis for the restoration of acoustic sources occluded by other signals. Since this situation occurs in most listening environments, computational techniques which exploit such principles may be necessary components of any sophisticated system for acoustic source separation. The model presented here is rather crude, and could certainly be improved upon in several ways. The system performance is largely dependent on the accuracy with which the firing rate during occluded regions is estimated, and a model which provides more detailed spectral predictions in such regions would result in improved accuracy. Such predictions could be supplied in a top-down fashion by a model of the acoustic source (such as a hidden Markov model), although such models would need to be trained specifically for each type of source. As noted in Section 2, the continuity illusion appears to apply to the restoration of compo-

nents which have already been grouped. Hence, one might expect further improvements in characterisation through the development of more extensive auditory scene analysis strategies and the employment of further grouping principles. This leads to consideration of the following question: Is grouping a necessary precondition for restoration? The A 1 - A 2 grouping rule suggests that it is. If so, a mechanism for further exploitation of the principle would be to use the rule which resulted in the A 1 - A 2 group to indicate the form of the restored region. In this study, since auditory scene components were grouped by common periodicity, we were able to exploit harmonic relations to fill in the missing portions of these components. One can imagine other grouping principles (such as common location in space) providing similar suggested restorations. This raises the issue of the extent to which perceptual "restoration" can be modelled as a top-down or bottom-up process. The top-down explanation requires that restoration occurs after "recognition" of a partial input in much the same way that missing characters in a word can often be supplied by the context. An alternative and probably complementary viewpoint is that some of the restoration occurs as a result of the kinds of process described in this study. Possibly the most important consequence of exploiting the principles behind the continuity illusion is the way it allows regions of the signal to be assigned an occlusion estimate. This enables those parts of the signal which represent unobstructed glimpses of some target source to be identified. Such assignments will be crucial in situations where computational auditory scene analysis operates as a pre-processor for systems which rely on single acoustic sources (e.g. automatic speech recognisers). We are currently investigating speech recognition algorithms which recover source-specific information based on the combination of partially-specified input and likelihood of occlusion information.

6. Conclusions

This paper presents what we believe to be the first model to exploit principles of perceived au-

M.P. Cooke, G.J. Brown / Computational auditory scene analysis

ditory continuity in the segregation of acoustic mixtures. The results measured over a fairly large corpus indicate that real benefits are achievable in the reconstruction of a target signal from the mixture, without much loss in SNR improvement. The general method, based on a dynamic programming search, can be used in any situation where evidence for occlusion is sought. Precisely how this evidence is used depends on such things as source-specific knowledge. More work is required to derive suitable weightings for evidence from onset, offset and firing rate maps. In particular, the model should predict continuity only in those situations in which listeners fall for the continuity illusion. Future work will attempt to widen the range of cues used within the model to include, for example, those based on source location (Denbigh and Zhao, 1992). A further challenge is the integration of source knowledge with primitive organisational principles. The work described in this paper suggests a link between partial auditory scene groupings and stored source knowledge in that it demonstrates how to find those areas of the signal which provide reliable estimates of single source characteristics.

Acknowledgments MPC thanks the Royal Society for a study visit grant. This work is supported by the SERC Image Interpretation Initiative.

399

References A.S. Bregman (1990), Auditory Scene Analysis (MIT Press, London). G.J. Brown (1992), Computational auditory scene analysis: A representational approach, Ph.D. Thesis, University of Sheffield. G.J. Brown and M.P. Cooke (1992), "Computational auditory scene analysis: Grouping sound sources using common pitch contours", Proc. Inst. Acoustics, Windermere, Nocernber 1992, pp. 439-446. M.P. Cooke (1993), Modelling Auditor), Processing; and Organisation (Cambridge Univ. Press, Cambridge). P.N. Denbigh and J. Zhao (1992), "Pitch extraction and separation of overlapping speech", Speech Communication, Vol. 11, Nos. 2-3, pp. 119-125. D. Ellis (1992), A perceptual representation ot audio, M.S. Thesis, MIT. B.R. Glasberg and B.C.J. Moore (1990), "Derivation of auditory filter shapes from notched noise data", ttearing Research, Vol. 47, pp. 103-138. N.Y.S. Kiang, T. Watanabe, E.C. Thomas and L.F. Clark (1965), Discharge Patterns ~[" Single Fibres in the Cat's Auditor), Nert~e (MIT Press, Cambridge, MA). R. Meddis (1988), "'Simulation of auditory-neural transduction: Further studies", J. Acoust. Soc. Amer., Vol. 83, pp. 1056-1063. D.J. Mellinger (1991), Event formation and separation in musical sound, Ph.D. Thesis, Stanford University. D.R. Moore (1987), "Physiology of the higher auditory system", British Medical Bull., Vol. 43, No. 4, pp. 856-870. R.D. Patterson, J. Holdsworth, 1. Nimmo-Smith and P. Rice (1988), SVOS Final Report: The auditory filterbank, APU Report 2341. R.M. Warren (1970), "Perceptual restoration of missing speech sounds", Science, Vol. 167, pp. 392-393. M. Weintraub (1985), A theory and computational model of monaural auditory sound separation, Ph.D. Thesis, Stanford University.

Computational auditory scene analysis: Exploiting principles of perceived continuity

Computational auditory scene analysis: Exploiting principles of perceived continuity

Recommend Documents