Spatially selective sound capture for speech and audio processing

Spatially selective sound capture for speech and audio processing

207 Speech Communication 13 (1993) 207-222 North-Holland Spatially selective sound capture for speech and audio processing J .L . Flanagan, A.C ...

583KB Sizes 0 Downloads 67 Views



207

Speech Communication 13 (1993) 207-222 North-Holland

Spatially selective sound capture for speech and audio processing J .L . Flanagan, A.C . Surendran and E .E . Jan CAIP Center, Rutgers University, Frelinghuysen Road, Piscataway, NJ 08855-1390, USA

Received 24 December 1992 Revised 26 May 1993

Abstract. Advances in transducer technology, signal processing and computing make possible high-quality sound capture from designated spatial volumes under adverse acoustic conditions . The techniques of multiple beamforming and matched filtering are applied to two- and three-dimensional arrays of sensors . Array performance is assessed in a preliminary way from computer simulations of rooms and from image characterization of the multipath environment . The results suggest that high-quality signals can be retrieved from spatially-selected volumes in severely reverberant enclosures . Reciprocally, the same techniques can be applied to spatially-selective sound projection .

Zusammenfassung . Dank des technologischen Fortschritts der Ubertrager in der Signalverarbeitung and in der EDV ist es heute moglich, Tonaufzeichnungen in giver Qualitat in gegebenen, raumlichen Bereichen trotz schwieriger akustischer Bedingungen durchzufuhren . Die Techniken zur Bildung der Sprache and zu einer geeigneten Filtering werden auf zweioder dreidimensionale Antennen von Aufnehmern angewandt . Die Leistung der Antennen wird durch die EDV-Simulation der Raume and durch eine Analyse der Bilder ihrer Auswirkung auf die Umgebung vorlaufig bewertet . Die Ergebnisse berechtigen zu der Annahme, dam Signale in guter Qualitat in raumlich lokalisierten Bereichen von stark halienden Umgebungen aufgezeichnet werden konnen. In diesem Sinne kann man die gleichen Techniken fur die raumlich selektive Schallprojektion anwenden .

Risumi. Grace aux progres en technologie des transducteurs, en traitement du signal et en informatique, it est desormais possible de realiser des saisies sonores de bonne qualite dans des zones spatiales donnees malgre des conditions acoustiques difficiles. Les techniques de formation de voie et de filtrage adapt€ sont appliquees a des antennes bi- on tri-dimensionnelles de capteurs . La performance des antennes est evaluee de maniere preliminaire par la simulation informatique des salles et par une analyse des images de lent effet d'environnement . Los resultats incitent a penser que des signaux de bonne qualite peuvent titre recuperes dans des zones spatialement localisees de milieux fermes fortement reverberants . Reciproquement, on pent appliquer les memes techniques a la projection sonore spatialement selective . Keywords . Speech processing ; audio processing ; microphone systems .

* This report is offered as a tribute to Professor Hiroya Fujisaki on the multiple occasions of (a) his sixtieth anniversary, (h) his retirement from long and distinguished service at the University of Tokyo to become Professor Emeritus, and (c) his appointment to key faculty responsibilities in the Science University of Tokyo . Professor Fujisaki's personal research, and that of his many students over the years, has influenced scientific thinking and progress in speech research in many significant ways . His work has inspired complementary research not only in the prestigious speech laboratories of Japan, but also among diverse groups throughout Europe, Asia and America . Pro-

fessor Fujisaki's career-long devotion to Spoken Language Processing and to the organizations which promote this field, is internationally recognized and lauded . Professor Fujisaki's career and scientific purview span at least two generations . One, the older, is exemplified by the first author of this report who, as a contemporary of Professor Fujisaki, worked in Professor Kenneth Stevens' speech group in the M .I .T . Acoustics Laboratory. And, the newer generation of emerging scientists in speech processing is exemplified by the PhD candidates who co-author this report .

0167-6393/93/$06.00 © 1993 - Elsevier Science Publishers B .V. All rights reserved

208

J.L . Flanagan et al. / Spatially .selective .sound capture

1. Issues in sound capture Telephonic communication in general, and speech processing in particular, require that the human speech signal be "captured", transduced and digitized with the greatest fidelity possible . Ideally, one wishes a facsimile of the acoustic signal produced by a human talker, having intelligibility and quality comparable to that received in face-to-face communication . One also often desires spatial realism in the acoustic representation, as well as the avoidance of hand-held, body-worn or tethered equipment that might encumber the talker and restrict movement . In most environments, the acoustic properties of the enclosure and unwanted sources of sound pose obstacles to this objective. Over the recent years, great progress has been made in the quality, availability and cost of highperformance microphones . Electret technology (Sessler and West, 1969) now provides inexpensive solid-dielectric condenser elements having flat frequency response and linear phase over the audio frequency range . At the same time, advances in microelectronics, and the understanding of digital signal processing permit complex and advantageous algorithms to operate in real time on transduced signals . At this point in time, large numbers of microphone elements, along with dedicated signal processors, can be devoted to high-quality signal capture . A variety of technical approaches have been explored for audio signal capture . Some include : beamforming to minimize interfering reflections and extraneous noise; adaptive noise filtering (or null steering) to mitigate interference ; active noise cancellation by computing and radiating a phaseinverse of interfering signals ; neural network "learning" and compensation of acoustic environment ; algorithmic signal enhancement employing non-linear filtering to reduce effects of distortion and interference; and synergistic combinations of these (Che et al ., 1992; Flanagan et al ., 1991 ; Kaneda and Ohga, 1984; Silverman, 1987 ; Stern and Acero, 1989). Among the foregoing methods, the technique favored for applications such as teleconferencing, hands-free cellular radio, and voice-control of information systems has been beamforming by

arrays of microphones (Berkley, and Flanagan, 1990; Flanagan et al ., 1985) . Moreover, with appropriate computation and logic, one can arrange beamformers to be "autodirective" - that is, to determine automatically the existence and direction of an acoustic signal, and then to determine whether the signal is a desired one (such as speech) or an unwanted one (such as air conditioner noise) (Elko et al ., 1988; Flanagan and Silverman, 1992) . Successful signal processing methods for microphone arrays must therefore address issues that include the effects of - acoustic multipath distortion (reverberation), - interference from noise sources, - accuracy of location of desired sound sources, - reliability of speech/non-speech discrimination, - identification and separation of multiple talkers . The technical discussion here will focus on the first of these issues; that is, on new possibilities for mitigating unfavorable characteristics of the room enclosure while at the same time realizing volumetric selectively in sound capture .

2. Microphone performance in enclosures ; simulation of rooms A microphone element is sensitive to the pressure fluctuations of an acoustic wave . For a point source of sound of sinusoidal frequency w (i .e ., a small pulsating sphere) the sound pressure at r-distance from the source is imply

p(r, t) = A-,t"(`-NC) r where A is the source strength (depending upon the volume velocity of air displaced by the vibrating sphere) and c is the speed of sound. At large distances from the source the curvature of the wavefront is relatively small, the wave becomes plane and is characterized simply by the phase factor e -'"Ic . The latter quantity renders very simple the computation of the responses of complex microphone arrays to plane waves (from distant sources). Within rooms, however, one usually is in the near field and the spherical spreading must be taken into account .



209

J.L . Flanagan et al . / Spatially selective sound capture

The walls of conventional rooms are large compared to acoustic wavelengths of interest, and hence constitute effective reflectors (mirrors) . Some absorption of acoustic energy may occur upon each encounter with a surface . Objects whose dimensions are small compared to a wavelength may act as scatterers and diffuse the sound energy . They are difficult to account for in detail . As a first approximation to the multipath effects in the room, the image technique of computing a source-receiver impulse response is valuable and insightful (Allen and Berkley, 1979). It permits estimation of the reverberation time of the enclosure and of the filtering characteristics that the multipath environment imposes . Image sources are determined in accordance with Snell's Law and behave as a constellation of point sources in free space . The sound pressure contributed by each source is given by the foregoing expression for p(r, t) and typically includes spherical spreading . The strength of each image source depends upon the absorption its ray path has encountered in the room, and in the simplest instance this absorption is considered to be frequency independent . Figure 1 shows an image diagram in two dimensions for a hard walled room . Also given is the number of images for the three-dimensional space through fifth order . One notices that the transit times for reflection place all images outside the wall boundaries of the room . Also, while the sound source gives rise to a first-order image in each bounding surface, i .e., six, the first-order (and higher-order) images do not necessarily spawn the same number of nextorder images . Nevertheless, the number of images grows rapidly and increases the complexity of the computation . Using the relationship for p(r, t), Figure 2 illustrates the computed impulse response for a rectangular room of 10 x 8 x 3 meters . A single receiver is positioned at coordinates (5, 0, 2) meters and the sound source is positioned at (8.5, 4 .5, 1 .35) meters. All surfaces exhibit an absorption coefficient of a = 0 .1 . Images through fifth order are used . Clearly, typical rooms contain irregularities and contents whose dimensions are small compared to a wavelength of sound in air. Diffraction of incident sound energy therefore occurs and acts

Order of Images

Number of Images

0 1 2 3 4 5

Cumulative Total Number of Images

1 6 18 38 66 102

1 7 25 63 129 231

2

t ... .. . . . . .. .... .. .

x

2 12

i SOURCE

ROOM

2

1

t 2

Fig- 1 . Number of images and cumulative total number of images associated with order in room simulation . The figure shows the location of images of order 1 and 2 corresponding to a sound source located at X .

to produce a diffuse sound field . It is difficult to account for all such detail in a simulation, though current research is addressing statistical approaches to modeling the integrated effects of images of high order and of diffraction from irregularities in geometry (Naylor, 1992) . The interest here is in the gross behavior of arrays in rooms with the objective of finding designs that effectively combat multipath distortion .

3 . Limitations of single beamformers Beamforming is a technique that has proved very effective for hands-free sound pick up in auditoria and teleconference rooms (Berkley and



210

JL . Flanagan et al / Spatially selective sound capture

Flanagan, 1990, Flanagan et al ., 1991) . Arrays for single beamforming may be one-, two- or threedimensional in architecture depending upon the characteristics of spatial selectivity desired (Flanagan, 1987) . But, single beamformers are adversely affected by severe multipath distortion, and function best if the enclosure has at least a modest amount of sound treatment . The single beamformer is traditionally of the "delay and sum" form, which directs a beam of receiver sensitivity along the direction of the sound source (the direct path), and attenuates sound arriving from other directions . Suppose the multipath enclosure gives rise to K significant signal arrivals at every microphone element (i .e ., a direct path and K - 1 reflections), and that N microphones constitute the receiving array . Pure

delay is provided to each individual microphone so as to cohere (i .e ., add in phase, or voltagewise) all arrivals coming directly from the source . The summated output of the array therefore provides N cohered arrivals and (K - 1)N that distribute in time (and typically add powerwise) . This situation is illustrated in Figure 3 for an impulse of sound at the source and where all arrivals are assumed to be of comparable amplitude . On the basis of these relations, one can make a simple definition of the ratio of undistorted signal power to the reverberant (interfering) noise power . For a source on the axis of the single beam this signal-to-noise ratio (SNR) is

N2 SNR, _ N(K 1)

N K-1

1 .80

1 .60

1 .40

a

I8 aC d

1 .20

1 .00

0.80

.60 0

0.40

0.20

iaii,

I I

.00 0 0 .00

2 .00

4.00

6 .00

III 8.00

I 10.00

Time (x 15 .625 maec) Bin Size : (1164k) sec Fig . 2. Impulse response for single receiver at coordinates (5, 0, 2) m in a room of dimensions 10x8x3 m . The source is at (8 .5, 4.5, 1 .35) m and the reflectivity of the walls is 0.9 . The impulse response is simulated up to fifth order images with 64 kHz sampling rate .



J.L. Flanagan et al. / Spatially selective sound capture

21 1

SENSOR SIGNALS DIRECT PATH

REFLECTION REFLECTION #1 #(K-1) Single Beam

hit)

h (t) 2

lilt) j 0

N

f t . . . t

Output of single beam for impulse source on axis of

I . . .

t

TIME

Fig . 4 . Simulation of a single-beam of gular array steered to a source in number of reflection paths are traced, beam can "see" a number

beam

a 2-dimensional rectana hard-walled room. A showing that the single of images .

Fig. 3 . Single beamforming using delay-and-sum .

Notice that as the multipath becomes severe (i .e ., K >> 1), N SNR I

- K

and monotonically diminishes with

K.

2-Dimensional arrays on adjacent wails and calling .

The point is that although the single beam mitigates the effect of some images, it nevertheless collects all sound along its axis . Figure 4 is a simulation of a two-dimensional array, singlebeamed to a source . A few rays are plotted to show that, other than the direct path, some images are along the "bore" of the beam, leading to the degradation mentioned above .

3-Dimensional array on the calling

Fig. 5 . Different configurations of two- and three-dimensional arrays .



21 2

.. L . Flanagan et al. / Spatially selectwe sound capture 1

4. Multiple beamforming

multiple beamforming on the direct path and on major images, the positions of which are known from the room geometry once the source location is given . In particular, three-dimensional array architectures are attractive for two reasons : they can provide some selectivity in range, and their

Considering the deleterious effects on single beamforming, a clear incentive is to examine array designs that might collect the reverberant multipath energy in useful ways . One possibility is

SENSOR DIRECT PATH

REFLECTION REFLECTION #(K-1) #1 Beam 1

hft) I

i

t

t...

h(t) 0

(B beams)

Beam B

Beam 2

t ,f t . . . t ,t t . . . t

t t. . . t

f t . . . t f t . . . } t t . . . f

h(t)

N

MULTIPLE BEAM OUTPUTS

SIGNALS

t t. . .

t t -- . t , } t TIME

(K-1)N

(K-I)N

tt . . . t f 1'° tt . . . .I f t



(K-1)N

t tft . . .

RELATIVE TIME

1~ BN

Output of

B beams for impulse source at focus .



BN(K-1) 1 .0

t t

. . .1

f

t t t

RELATIVE TIME Fig. 6 . Principle of multiple beamforming.



213

J.L. Flanagan et al./ Spatially selective sound capture

beam patterns are sensibly independent of pointing direction (Flanagan, 1987) . Configurations of particular interest are shown in Figure 5, and include a three-dimensional "chandelier" arrangement on the room ceiling and orthogonal two-dimensional arrays on the adjacent walls . Suppose, again, that one has N microphones in an array, and the acoustic environment gives rise to K significant multipaths (a direct path and K- 1 reflections) . If one now forms B beams on respectively the source and main images, the impulse response of the array is illustrated in Figure 6. Now, for a source at the focal point of the array, BN of the BNK arrivals cohere and BN(K - 1) distribute in time . Hence, the ratio of undistorted signal output to reverberant noise power is bounded by SNR B

(BN) 2

BN

=

BN(K-1) K-1

Now, for K >> 1 and B -' K, . N, SNR B independent of the number of multipaths . 4.1 . Simulation results - multiple beamforming

To test this notion, a three-dimensional rectangular array of 7 x 7 x 7 microphones spaced by 4 cm is placed at the ceiling of a 7 X 5 X 3 meter room . The array design gives adequate spatial selectivity for a two octave signal 1-4 KHz . For no steering of the array (i .e ., zero delay at every microphone), the response to an impulse of sound delivered at (1 .80, 2 .75, 0 .75) meters is shown in Figure 7 . In contrast, the array with 63 beams steered to all images through third order, produces the impulse response of Figure 8 . When natural speech is delivered at the focal point, the ratio of undistorted speech to reverberant noise

.W

0 .20

0 .40

0 .60

0.90

1 .00

Time ( x 62.5 msee ) Fig. 7, Impulse response of an unsteered 7 x 7 X 7 array at the center of the ceiling of a 7 x 5 X 3 m room . Images through 3rd order are assumed . Sensor spacing is 4 cm . Source location is (1 .8, 2.75, 0 .75) m .



2 14

J.L . Flanagan el al. / Spatially selective sound capture

power produced by the array is shown as a function of the number of beams in Figure 9a . The signal-to-noise ratio (SNR) for the simulation is calculated as follows : SNR = (energy in the speech processed by the desired impulse response)/ (energy in the speech processed by the undesired part of the impulse response) E{s(n) * h d (n)} 2 n 2 (s(n)

* (h(n) - hd(n))}

n

where h(n) is the impulse response of the steered array . h d (n) is the desired impulse response and in the idealization of Figure 6, is a single pulse of strength BN, where B is the number of beams and N is the number of sensors . In the simula-

tion with spherical spreading and absorption the desired signal is convolved with the maximum component of the impulse response shown in Figure 8 . This definition of SNR leads to pessimistic values and ignores the fact that ear makes useful integration of signal convolved by impulse response components close to the maximum component . As shown in Figure 9a, the SNR of speech received through a single omnidirectional microphone is -11 dB and is of very low intelligibility. The SNR or speech received through the unsteered array is below -8 dB and is comparable to that for a single microphone . When 63 beams are steered to images through third order, the SNR improves to 13 dB, a 24 dB improvement over a single omnidirectional microphone . Figure 9b shows spectrograms for the original speech signal, the signal from the unstecred array and the signal captured by the 63 beamformer .

3 .00

2 .50

8 m a 2

nE

2 .00

1 .50

a

1 .00

0.50

-.rr . ._

0 .00 0 .00

0.50

1 .00

1 .50

2.00

Time (x 62.5 onec ) Fig . 8 . Impulse response of a 7 X 7 x 7 array at the center of the ceiling of a 7 x 5 x 3 m room steered to images through third order . The 3-dimensional array is steered to a source at (1 .8, 2.75, 0.75) m .



21 5

J.L. Flanagan el al. / Spatially selective sound capture

Also shown in Figure 9a is the SNR B for two positions away from the focal point, 0 .5 m (SNR _ - 5 dB) and 1 .0 m (SNR = -10 dB), respectively . In this instance, where beam outputs are summed directly, the spatial selectivity in three dimensions is relatively acute . An alternative that leads to more practical focal volumes, is to sum the spectral magnitudes of the individual beams and inverse Fourier transform this result for the array output (Elko et al ., 1988) .

sponse of the filter applied to each sensor of the array is the impulse response from the focal point to that sensor. For the nth sensor of the array this impulse r .S .ionse is h nf(t) as shown in Figure 10, and the matched filter of the nth sensor is h„ f (-t) . Typically the latter is non-causal, and fixed delay and truncation are used to realize a causal filter that approximates the desired response. These relations are such that if a source located at the focal point emits a signal s(t) that travels to the nth sensor via a multipath with impulse response h nf(t), then the temporal output of the matched-filter array is

5. Matched-filter processing of microphone arrays The well-known technique of matched filtering, when applied to microphone arrays, leads to advantages similar to those of multiple beamforming. The traditional matched filter is the time inverse of the impulse response of the system to be matched . Therefore, the impulse re-

N 0(t)= E s(t) n-1

*

hnf(t)

*

hnf( - t)

(1)

N

=s(t)

*

6(hnf, h„f),

(2)

n =1

0

Source at Focus

10 .00---

,~ aa1

5 .00

o

.00 1 at order Images

Z N

Source at (21,3 .05,1 05)m 0.5 m fro , focus

,,



-

-10 .00 _

Sourcee at (2 .4,3 .35,1 .35) 1mfrom focus

SNR of a single m"hone on cellin3 (3rd order Image)

Unsteered

1

7 Number of beams -~

25

I3

Fig . 9a . SNR versus number of beams for a 7 x 7 x 7 array at the center of the ceiling of a 7 X S X 3 m room . Focus at (1 .8, 2.75, 0.75) m . Sensor spacing = 4 cm . The speech signal is band-pass filtered to 1-41(1-iz .



216

J.L . Flanagan et al. / Spatially selective sound capture

(i)

Fig. 9b . Sound spectrograms for (i) the original speech signal, (ii) the signal produced by the unsteered 7x 7x 7 array, and (iii) the signal provided by the array steered to 63 images . The original utterance is "The best way to learn is to solve extra problems" and the signal is handpass filtered to 1-4 KHz .

where * designates convolution and O(h nf , h„ f ) is the autocorrelation of the multipath response .

Enclosure h

For a source located off the focal position, the path impulse response from the source to the nth sensor is h n ,(t) and the temporal output of the array is N ~, 0(t) =S(t)

* E

It

(-t)

h m (-t)

Focus

`t'(hnf, h r.),

n=1

where ¢(h nf , h n ,) is the cross-correlation of the impulse responses from the focus to the nth sensor and from the source to the nth sensor . Immediately one sees that the size of the focal volume for retrieval of low distortion signals is

0 Source Fig. 10 . Configuration of a matched-filter microphone array with N microphones . The impulse responses include the multipath structure .



J.L. Flanagan et al. / Spatially selective sound capture

beamformer. For K paths and N sensors, K 2N arrivals appear at the output of the matched-filter array . For a source on focus KN of these cohere (or, add voltagewise) and K(K-1)N distribute in time (to sum powerwise) . Arguments about SNR advanced previously lead to relations similar to those for the multiple beamformer, i .e ., the

conditioned by the spatial correlation of the impulse responses h at(t) and h as(t) . If the signal s(t) is an impulse of the characteristics previously mentioned, and if spherical spreading and absorption are ignored, the temporal relationships are illustrated in Figure 11 . One notes the superficial similarity to the multiple

SENSOR SIGNALS

DIRECT PATH

MATCHED FILTER OUTPUTS

REFLECTION REFLECTION #(K-1) #1

K

f .0 }

t hit)1

...

K(K-1)

t

1 .0

t t t t K K(K-1)

1 .0

t . . . f

h (t) 2

t t t . . . . t

K K(K-1) 1 .0

. . t x i t . TIME

h (t)



217

. .. . t t t

tt

0

RELATIVE TIME

Output of N matched filters for K multipaths

KN K(K-1)N 1 .0

t f t

t . . . . . . t t

RELATIVE TIME Fig. 11 . Principle of matched-filter processing .



218

J.L . Flanagan et al. / Spatially selective sound capture

matched-filter array can produce an SNR that is sensibly independent of the number of multipaths.

over the unsteered condition . In comparison, the SNR of the output of a single, matched-filtered, microphone located at the center of the array is approximately 2 dB . By contrast, if the 11 X 11 array is used in this hostile environment as a single beamformer of the delay and sum variety, its impulse response is shown in Figure 15 and the speech SNR is measured as 2 dB, a condition of poor intelligibility and quality . Indicative of the acute volume selectivity of the matched-filter array used in its full form, the impulse response for a source 1 meter from the focus is given in Figure 16 . This condition produces a speech SNR of -5 dB . Research in progress is examining means for controlling this spatial volume selectivity through means such as

5.1 . Simulation results - matched filtering

As an initial assessment of these relations, matched filters are applied to a two-dimensional array for the room conditions given in Figure 12 . The 11 x 11 array with element spacing of 4 cm is adequate for a two-octave signal 1-4 KHz . Speech received by the I1 x 11 array in the unsteered condition is processed by the impulse response of Figure 13, resulting in a speech SNR of -10 dB . For the matched-filter processing, the array impulse response is shown in Figure 14, and produces a speech SNR of 12 dB, a 22 dB increment

SIMULATION OF ARRAY PERFORMANCE

IN AN ENCLOSURE

ROOM

Bm

ROOM:

10x8x3m

ARRAY CENTER :

5, 0, 2 m

FOCUS/SOURCE :

8 .5, 4 .5, 1 .35 m ; SPHERICAL SPREADING

ALPHA :

0 .1

IMAGES :

THRU FIFTH ORDER

ARRAY :

2D MATCHED FILTERS ; D=4 cm

Fig . 12 . Configuration for simulation of a two-dimensional matched-filter array .



JL. Flanagan et al. / Spatially selective sound capture

bandlimiting, decimating and truncating matched filters .

the

6 . Conclusion Concomitant advances in transducer technology, signal processing theory, computing and microelectronics support sophisticated new techniques in sound capture under adverse acoustic conditions . In particular, multiple beamforming and matched filtering, applied to two- and threedimensional arrays of sensors, can obtain highquality signals for teleconferencing and multimedia use, freeing the user from encumbrances such as body-worn or hand-held equipment . Moreover, these techniques lead to sound capture from des-

2 19

ignated spatial volumes . Reciprocally, the same processing techniques permit projection of sound to the same spatially-selective points . Current research continues to characterize the control of spatial volumes, automatic location of desired sources and discrimination of speech/audio/ music signals from undesired interference .

Acknowledgments The research reported here is supported by the Circuits and Signal Processing Division of the National Science Foundation under Contract No . M1P-9121541, and by sustaining grants to the CAIP Center from the New Jersey Commission on Science and Technology .

I

Time ( x 125 cosec) Fig . 13 . Impulse response for the 11 x 11 unsteered array . For a 1-4 KHz signal, the calculated speech SNR = -10 dB.

220

J- L . Flanagan el al. / Spatially selective sound capture

4 .00

3 .50

3 .00 J O a

2 .50

O

2 .00 E

a v a z W Y

1 .50

1 .00

0 .50

0 .00 0.00

0s0

1 .00

.50

I

2.00

f

2. 0

Time (x 125 msec) Fig . 14 . Impulse response for the 11 x 11 matched-filter array . For a speech signal of two octaves, 1-4 KHz, the calculated speech SNR - 12 dB .



221

LL . Flanagan et al. / Spatially selective,sound capture 240 .00 220 .00

200 .00180 .00 .

160 .00

~_

140 .00

aft

120 .00

a 100 .00

d N M

80 .00 60 .00

-

40 .00

20.00

1. .

0 .00 0 .00

0 .20

0 .40

0.60

11 .80

1 .20

Time ( x 125msec)

Fig. 15 . Impulse response for 11 x 11 single-beam delay-and-sum array. The calculated speech SNR = 2 dB .



222

J.L. Flanagan et al. / Spatially selective sound capture

700 .00

600.00

500 .00-

400.00

300.00

200.00

100 .00

0.00 0.00

0.50

1 .00

1 .50

2.00

Time (x 125 msee) Fig. 16 . Impulse response for the li x 11 matched-filter array for a source 1 meter off focus . The focal point is at (7 .5, 4.5, 1 .35) m instead of (8 .5, 4 .5, 1 .35) m. The calculated speech SNR = -5 dB .

References J .B . Allen and D .A. Berkley (1979), "Image method for efficiently simulating small-room acoustics", J. Acoust. Soc. Amen, Vol. 65, No. 4, pp . 943-950 . D.A . Berkley and J .L. Flanagan (1990), "HuMaNet : An experimental human/machine communication network based on ISDN", AT & T Tech . J., pp. 87-98 . C .W . Che, M . Rahim and J .L. Flanagan (1992), "Robust speech recognition in a multimedia teleconferencing environment", J. Acoust. Soc . Amer., Vol . 92, No. 4, Pt . 2, p . 2476(A) . G.W. Elko, J .L. Flanagan and J .D . Johnston (1988), Sound location arrangement, U.S . Patent No. 4,741,038, 26 April 1988 . J .L . Flanagan (1989), Three dimensional microphone arrays", J. Acoust. Soc. Amer., Vol . 82, No. 1, p . 539 . J .L. Flanagan, J .D . Johnston, R . Zahn and G.W. Elko (1985), "Computer steered microphone arrays for sound transduction in large rooms", J. Acoust. Soc. Amer., Vol . 78, pp. 1508-1518 .

J .L. Flanagan, DA. Berkley, G .W . Elko, J .E. West and M.M . Sondhi (1991), "Autodirective microphone systems", Acustica, Vol . 73, pp. 58-71 . J .L. Flanagan and H .F. Silverman, eds ., (1992), Prow Intemat. Workshop on Microphone Array Systems, Technical Report No- LEMS-113, Brown University . Y. Kaneda and J . Ohga (1984), "Adaptive microphone-array system for noise reduction", J Acoust . Soc. Amer., Vol . 76, p . 584. G . Naylor (1992), "Treatment of early and late reflections in a hybrid computer model for room acoustics", Acoust. Sac. Amer., Vol . 92, No . 4, Pt . 2, p . 2345(A). G .M. Sessler and J .E. West (1969), "First-order gradient microphones based on the foil-electret principle : Discrimination against air-borne and solid-borne noise", J. Acoust. Soc . Amer., Vol. 46, pp . 28-36. H .F . Silverman (1987), "Some analysis of microphone arrays for speech data acquisition", IEEE Trans. Acoust. Speech Signal Process ., Vol. 35, No . 12, pp. 1699-1712 . R .M . Stern and A. Acero (1989), "Acoustical preprocessing for automatic speech recognition", DARPA Speech and Natural Language Workshop, Harwich Port, MA.