Beamforming microphone arrays for speech acquisition in noisy environments

Beamforming microphone arrays for speech acquisition in noisy environments

__ __ Ii!!! CQB ELSEVIER Beamforming SPEECH coSpeech Communication 20 (1996) 215-227 microphone arrays for speech acquisition in noisy environmen...

1MB Sizes 1 Downloads 131 Views

.__ __ Ii!!!

CQB ELSEVIER

Beamforming

SPEECH coSpeech Communication 20 (1996) 215-227

microphone arrays for speech acquisition in noisy environments ’ Sven Fischer a3*, Klaus Uwe Simmer b

a Uniuersiy of Bremen, Department of Physics and Electrical Engineering, P.O. Box 330 440, D-28334 Bremen, Germany b Houpert Digital Audio, Wiener Str. 5, D-28359 Bremen, Germany

Received 1 February 1996; revised 2 September 1996

Abstract In this paper we present an adaptive microphone array with adaptive constraint values to suppress coherent as well as incoherent noise in disturbed speech signals. We use a generalized sidelobe cancelling (GSC) structure implemented in the frequency domain since it allows a separate handling of determining the adaptive look-direction response to suppress incoherent noise and adjusting the adaptive filters for cancellation of coherent noise. The transfer function in the look-direction is an adaptive Wiener-Filter which is estimated by using the short-time Fourier transform and the Nuttall/Carter method for spectrum estimation. The experimental results demonstrate that the proposed method works well for a large range of reverberation times and is therefore able to operate independently of the correlation properties of the noise field. Zusammenfassung In diesem Beitrag wird ein adaptives Mikrofonatray mit adaptiven linearen Restriktionen vorgestellt, urn koharente und inkoharente Storgerausche in e&&en Sprachsignalen zu unterdrlicken. Es wird eine Generalized Sidelobe Cancelling (GSC) Struktur im Frequenzbereich verwendet, da diese eine separate Bestimmung der adaptiven Impulsantwort in Blickrichtung zur Unterdrlickung inkoharenter Stdrung und Einstellung der adaptiven Filter zur Kompensation koharenter Stiirung erlaubt. Die fibertragungsfunktion in Blickrichtung ist ein adaptives Wiener-Filter, welches mit Hilfe der Kurzzeit-Fourier Transformation und der Nuttall/Carter Methode zur Spektralschatzung bestimmt wird. Die experimentellen Ergebnisse zeigen, dal3 die vorgeschlagene Methode gute Resultate fur verschiedene Nachhallzeiten liefert und somit unabhangig von den Korrelationseigenschaften des Gerauschfeldes arbeitet. R&urn6 Dans cet article, nous presentons un reseau adaptatif de microphones avec des contraintes lineaires sur l’adaptation, destine a la suppression des bruits coherents et incoherents presents sur le signal de parole. Nous utilisons une structure de

* Corresponding author. E-mail: [email protected]. ’ Audiofiles available. See http://www.elsevier.nl/locate/specom 0167.6393/96/$15.00 Copyright PII SO167-6393(96)00054-4

.

0 1996 Elsevier Science B.V. All rights reserved.

216

S. Fischer, K.U. Simmer/Speech

Communication

20 (19961215227

(GSC) operant dans le domaine frequentiel; cette structure permet d’obtenir type “generalized sidelobe canceller” sCparCment la suppression des bruits incoherents par ajustement adaptatif de la repotwe dans la direction de vi&e et la suppression des bruits coherents par ajustement des filtres adaptatifs du GSC. La fonction de transfert dans la direction de vi&e est celle d’un filtre de Wiener adaptatif, estim6 au moyen de la transformee de Fourier B court terme, suivant la methode de Nutal-Carter pour l’estimation spectrale. Les resultats experimentaux montrent que la methode propoke fonctionne bien dans une large gamme de temps de reverberation, et qu’elle est par consequent capable d’operer independamment des propriCt& de correlation du champ sonore. Keyvord.~: Microphone-arrays;

Audio-beamforming;

Noise reduction:

1. Introduction Hands-free voice communication systems are desirable when the user needs his hands free for other tasks. Important applications for hands-free audio communication are mobile telephony, aids for the handicapped, automatic information systems, and the expanding field of multimedia applications. However, the environmental noise is a severe limitation on the performance of hands-free speech acquisition devices. Therefore, techniques for selectively receiving a specific signal while suppressing interfering signals and noise are strongly desired. Several noise reduction techniques using one or two sensors have been proposed in the past 20 years (Lim, 1983). Most of them fail in practical applications due to the complexity of real noise fields. Microphone arrays seem to be the most promising technique today for solving the problem of reducing the noise and reverberation in sound pick-up. The principal advantage in using a microphone array is that it can provide spatial selectivity which can be steered by electronic means. Beamforming microphone arrays for speech applications have been investigated by several authors. The different approaches can be classified into three main categories: * conventional beamforming, - adaptive beamforming, - microphone arrays with adaptive postfiltering. The simplest method known as delay-and-sum beamforming (or conventional beamforming) involves shifting the data so that the desired speech signals on all channels are aligned and then averaging over all channels to obtain a single output signal (Pirz, 1979; Flanagan, 1985; Flanagan et al., 1985; Kellermann, 1991; Sydow, 1994). Such systems are very robust and need less complex signal processing but in gen-

Speech processing

eral require many sensors to yield a high directivity and noise reduction (Flanagan et al., 1991). To overcome the problem of low angular resolution at low frequencies and too narrow beams at high frequencies splitting the array into subarrays has been successfully implemented (Flanagan et al., 1985; Kellermann, 1991; Mahieux et al., 1993; Khalil et al., 1994; Mahieux et al., 1995). Limitations associated with small array apertures can be somewhat circumvented by using adaptive filters behind each sensor. The filters can be optimized to null out signals arriving from directions other than the specified look-direction. This technique is known as adaptive beamforming (Widrow et al., 1967; Frost, 1972; Griffiths and Jim, 1982; Cox et al., 1987; Van Veen and Buckley, 1988) and their application to speech enhancement has been investigated by several authors (Kaneda and Ohga, 1986; Sondhi and Elko, 1986; Van Compemolle et al., 1990; Dowling et al., 1992; Grenier, 1993; Nordholm et al., 1993; Peterson et al., 1987; Greenberg and Zurek, 1992; Hoffman et al., 1994). Adaptive beamforming works well if the number of point noise sources is smaller than the number of sensors, which is often the case in free-field wave propagation. However, in closed environments noise is influenced by multipath propagation and reverberation which yields a multi-source noise field. In such diffuse noise situations the performance of adaptive beamformers is limited. For incoherent (i.e. diffuse) noise fields postfiltering of a conventional beamformer output can be used for noise suppression as proposed by Zelinski (1988, 1990). This delay-sum-and-filter technique is also adaptive but differs from the adaptive beamforming approach in the optimization criterium used. This method yields a high noise reduction performance in diffuse environments with a relatively small number

S. Fischer, K.U. Simmer/Speech

of sensors; however, in the case of coherent direct path noise signal distortions might appear in the output signal. Improvements of this method are proposed by Simmer and Wasiljeff (Simmer and Wasiljeff, 1992; Yang et al., 1993). Conventional beamforming combined with spectral subtraction can also be classified in this category (Gierl, 1990; Kroschel, 1991). A comparison of these methods in a car environment can be found in (Affes and Grenier, 1994). From the above statements we can conclude that the adaptive beamforming technique and the microphone arrays with adaptive postfiltering require contrary assumptions about the correlation properties of the noise field. Therefore, the performance of these array techniques for noise reduction depends on the acoustical environment in which they have to operate. In practice, noise fields are neither perfectly diffuse nor do they consist of direct-path noise only. The reflection coefficients of the walls as well as the distance between the noise sources and the array determine the ratio of coherent and incoherent noise components received by a microphone array. A practical system for noise reduction has to operate independently of the correlation properties of the noise field. The method presented here is able to suppress coherent (i.e. direct path) noise and incoherent (i.e. diffuse) noise and can be conceived as a unification of the three array techniques for noise reduction mentioned above. In Section 2 the linearly constrained beamforming problem is reviewed. The problems associated with this technique for noise reduction in closed environments are discussed in Section 2.1. In Section 3 we generalize the linearly constrained beamformer to include a data dependent look-direction response to suppress incoherent noise. Two possible implementations (a direct form and a GSC structure) are described. Section 4 describes the simulation method used to validate our noise reduction approach, and some experimental results are given.

Y(n) W

x(n)

t(a) Direct form

/

broadband

beamforming

The objective of linearly constrained adaptive beamforming is to minimize the total output power

I

(b) GSC structure

Fig. 1. Block diagram of two possible implementations linearly constrained broadband beamformer.

of the

of the array subject to the constraint of preserving an a priori specified impulse response in the look-direction. The system we are looking at is a direct form broadband beamformer as depicted in Fig. l(a). The array consists of M microphones of which each output signal feeds a K-dimensional FIR filter. The M-dimensional vector of data observed at the output of the array at the nth time instant is the snapshot vector Z(n) = [x,(n),x,(n), . . ,x,(n>lT. The MKdimensional stacked snapshot vector, containing K delayed snapshot vectors, is x(n) = [X”T(n),X”T(n l), . . ,_fT(n - K + l)lT. The beamformer output is then given by the inner product of the stacked snapshot vector and a (real valued) MK-dimensional weight vector w: y(n)

= W’X@).

(1)

For the linearly constrained minimum variance beamformer, w is the solution to the problem minimize w

wTRXXw

subject to the constraint 2. Linearly constrained

217

Communication 20 (1996) 215-227

CTW =f,

(2a) equation (2b)

where R,, is the covariance matrix of the input data vector. For the steer-direction gain-only constraints the constraint matrix C is sparsely constructed and

218

S. Fischer, K.U. Simmer/Speech

given by C=Z,@ll,. I, is the K-dimensional identity matrix, 1, is a column vector containing M ones and @ denotes the Kronecker product. The K-dimensional vector f describes the impulse response of the beamformer to a signal impinging on the array from the desired look direction. For example, if f is a vector which contains a single one and K - 1 zeros, the system will pass any signal which is incident on the array from the look-direction without distortion. The weight vector which satisfies Eqs. (2a) and (2b) is obtained via Lagrange multipliers as (Frost, 1972)

f.

w, =R;;C(CrR;;C)-l

(3)

An iterative algorithm to solve the above equation has been introduced by Frost in his well-known adaptive beamforming algorithm (Frost, 1972). The linearly constrained beamformer can be implemented as a Generalized Sidelobe Canceller (GSC). A block diagram of the GSC is shown in Fig. l(b). The GSC consists of an upper non-aduptiue path which is designed to pass any signal meeting

Communication

20 (1996) 215-227

the constraint conditions and a lower adaptive path containing a blocking matrix B which prevents any such signal from reaching the adaptive filters v. The components w, and B are fixed processors and v is adjusted to minimize the total output power of the system. Conditions under which the GSC implementation is equivalent to the direct form were derived by Jim (1977) and generalized by Buckley (1986). The columns of the blocking matrix must be linearly independent while satisfying BC = 0. The GSC can be viewed as a mapping of the constrained least-square problem into an unconstrained minimization problem. Therefore, wellknown unconstrained adaption algorithms like the recursive-least-square or frequency domain least mean square algorithms can be used to improve convergence for speech applications (Chen and Fang, 1992; An and Champagne, 1994). The GSC will serve as a useful vehicle for generalizing the linearly constrained beamfotmer to include an arbitrary (data dependent) look-direction response, as will be shown in Section 3.

Anechoic Chamber

Office Room

:

\

,

:

“---____

-PO Input SNR [dB] --f

(4

-5

(b)

0

5 10 15 Input SNR [dB] +

20

25

Office Room

Anechoic Chamber

u

OO

(c)

1000

2ooo o/22 [Hz] +

3ooo

4000

(4

Fig. 2. Noise reduction as function of input SNR and spatial coherence

Onz [Hz] +

of the respective

noise fields.

S. Fischer, K.U. Simmer/Speech

2.1. Application

to speech enhancement

First experiments with broadband beamformers were carried out in an anechoic chamber. We used a seven element linear array with 5 cm inter-element spacing and FIR filters with 32 taps only (sampling frequency 8 kHz; the use of more taps leads to marginal improvements only. Note that these filters are very short compared with filterlengths often used in acoustic echo cancellation). The desired speech signal came from broadside and as noise source we used a hair dryer, which was positioned approximately 2 meters away from the center of the array. With respect to the small aperture used, we can assume the noise source in the far field of the array. However, since the distance between the speaker and the array center (50 cm> is significant with respect to the array dimension, curvature of the acoustic wave front must be considered. The main beam of the array is focused to the desired speaker assuming spherical wave propagation. Fig. 2(a) shows the SNR improvement as function of the input SNR for the Frost beamformer (Frost, 1972), the Griffiths-Jim implementation of the GSC (Griffiths and Jim, 1982) and a conventional (delay & sum> beamformer. The spatial coherence Clj( w) of the noise field in the anechoic chamber is shown in Fig. 2(c). The (magnitude squared) coherence function C,,(w) is defined according to the following formula:

were #I .(o> is the spatial cross power density spectrum’ between sensor signals i and j; aX .(w) and @A.(o) are the auto power spectral densitiek of microph&e signal i and j, respectively. C,(w) is a measure of the linear dependence between the two sensor signals i and j. As we can see from Fig. 2(a), the adaptive beamforming approach yields a very high noise reduction performance compared with the conventional beamformer. The output signal is nearly noise-free. A dramatic loss of performance can be observed, if the same experiment is carried out in a highly reverberant room (reverberation time TeO=: 300 ms). The results are shown in Fig. 2(b) (note the different scaling of the y-axis). The spatial coherence of the

Communication

20 (1996) 215-227

219

noise field is shown in Fig. 2(d). The dashed line in Fig. 2(d) shows the theoretical coherence of a diffuse noise field (Simmer and Wasiljeff, 1992). Although only one active noise source was present we obtained a diffuse noise situation and the different beamformers lost their effectiveness especially for higher signal-to-noise ratios (i.e. higher input speech-quality). Increasing the number of microphones will lead to better performance in the office room but has no effect on the adaptive beamformers in the anechoic chamber when only one interfering source is present. The above experiment shows, that the predominant limitation to the performance of adaptive beamformers for speech applications is caused by the structure of the acoustical environment. In particular, reverberation yields a diffuse noise situation and the degrading effect on system performance is evident. Therefore the performance in diffuse noise fields must be improved before adaptive beamforming can provide an acceptable solution to the noise reduction problem.

3. Constrained adaptive beamforming tive look-direction response

with adap-

The data vector x(n) can be represented as a sum of the desired speech signal s(n), and all other undesired terms m(n) representing noise and interference signals. If the steering delays are adjusted so that the desired signal is identical at the filter inputs, we can write for the data vector: x(n)

=Cs(n)

+-m(n),

(5)

where s(n) = [ s(n),s( n - l), . . . ,s(n - K + l)lT contains K delayed samples of the desired speech signal, m(n) is the MK-dimensional vector containing the noise components at the taps of the beamformer and C is the steer-direction gain-only constraint matrix (see Section 2). As pointed out by Frost (19721, the response of the array to a desired signal incident from the look-direction is equivalent to a K-dimensional FIR filter. For the output signal y,(n) of this equivalent FIR filter we can write y,(n)

= wTCs( n) =fTs(

The constraint

values

f

n). are identical

(6) with the im-

220

S. Fischer,

K. U. Simmer/Speech

pulse response of the equivalent FIR filter in the look-direction and it is obvious that arbitrary values for these coefficients can be forced by using K linear constraints. 3.1. Adaptive look-direction

response

Ideally, a linearly constrained beamformer operates in an environment where a single desired signal from a known direction is incident upon the array while some interference signals arrive from other unknown directions. This ideal situation is hardly fulfilled in speech applications. The acoustical environment in beamforming for speech application is highly reverberant, making multipath propagation and reverberation the rule rather than the exception. The degrading effect on system performance has been shown experimentally in Section 2.1. There is always an amount of incoherent diffuse noise present, which reaches the microphones from all directions. Such noise signal levels are reduced by the adaptive beamformer as in a conventional beamformer by adding the desired signal coherently at the array output while adding the noise incoherently. Noise arriving from the look-direction can be suppressed by a suitable choice of the frequency response in that direction. Denoting x(n) the K-dimensional vector of the signal from the look-direction containing the desired signal s(n) and incoherent noise n(n), we choose the look-direction response f so that the output signal of the equivalent FIR filter in look-direction ~,~(n) is the least-square estimate of the desired speech signal s(n). Minimization of the mean squared error leads to the well-known matrix formulation of the Wiener-Hopf equation under the assumption of uncorrelated speech and noise: (7) where R, is the K X K autocorrelation matrix of the input signal of the equivalent FIR filter and rs is the K X 1 autocorrelation vector of the desired speech signal. The optimal impulse response in the look-direction is then given by

f,= R,'r,.

(8)

If there are no noises from the look-direction (n(n) = 0) the optimal impulse response in the look-direc-

Communication

20 (1996) 215-227

tion is of course the Kronecker delta. For application of Eq. (8) the matrix R, and the vector rs have to be estimated from the received microphone signals x(n). The elements r,(m) = E{ x(n)~(n + m)), m = O,l, . . . , K - 1 of the autocorrelation matrix R, can be estimated directly from the microphone data. For the input signal x(n) of the equivalent FIR filter in look-direction we write x(n)

= [ x(n),x(n-

l),...,x(n-K+

I>]’ (9)

=~[sum{f(n)}.sum{%(n-

I)],...,

K+

sum{x”(n-

l)}]‘,

(10)

where sum{x”(rz)} denotes the sum of the elements of the snapshot vector x”(n) (see Section 2). Assuming N data samples (N > K) are available, we use the biased estimate for the autocorrelation sequence of the input signal of the equivalent FIR filter in lookdirection: 1

P,(m)

= i

N-m-l

C n=O

m=O,...,K-

x(n)x(n+m).

1.

(11)

The elements r,(m) of the autocorrelation vector rs can be estimated from the spatial cross-correlation of the disturbed input signals if the noise between adjacent sensors is uncorrelated (Zelinski, 1988). In this case E{xttn)xj(n

m)}

+

=E{(s(n)+ni(n))(s(n+m)n,i(n+m))} ( 124 (12b)

= r,(m), and therefore we obtain an estimate for r,(m): fx(

m) =

2 NM(M-1) x;(n)xj(n+m),

M-l

cc i-1

N-m-l

M

j=i+l

c n=O

m=O ,...,

K-

1. (13)

The convolutional computations for estimating the elements of R, and r, can also be carried out in the frequency domain as described in (Zelinski, 1988). Once the correlation sequences have been estimated

S. Fischer, K.U. Simmer/Speech

Communication 20 (19961215-227

221

from a finite data segment (e.g. K = 32 coefficients from N = 256 data samples with 8 kHz sampling rate), the normal Eq. (8) may be solved using the Levinson recursion. With the optimal impulse response f,,, the Frost algorithm (Frost, 1972) can be used to minimize the total output power of the beamformer. The disadvantage of this combined block- and iterative-processing is that the convergence of the two adaptive processes have to be matched to each other to guarantee a stable overall performance. A more robust implementation method is given by using a block-adaption scheme as described in the next section.

et al., 1994). In the case of coherent direct path noise the time delay estimation unit may steer the main beam into the direction of the noise source, especially when the input SNR is low. This worst case can be limited, if the desired speaker is closer to the array than the noise source and if the maximum search sector of the GCC is limited. Then the GCC works well for combined noise fields, too. The sidelobe cancelling part (lower signal track in Fig. 3) supplies a minimum mean square estimate of the signal Y,(k). Hence, the optimal values for the transfer functions Hi are given by the standard expression for the unconstrained linear estimator:

3.2. Open-loop frequency the GSC

Hi(ejn)

domain implementation

of

The Generalized Sidelobe Canceller (as described briefly in Section 2) serves as a useful tool for implementing the adaptive look-direction response beamformer. To separate the two adaptive processes, an open-loop block-adaption scheme is used. The block-diagram is shown in Fig. 3. The system operates completely in the frequency domain with the short-time Fourier transform and the overlap-add method. For time-delay estimation we use the Generalized Cross Correlation method (GCC) (Carter, 1993) and the time delay compensation is performed in the frequency domain. A parabolic interpolator is used to yield a temporal resolution of one fourth of the sampling period (Boucher and Hassab, 198 1). The Smooth Coherence Transform (SCOT) and the Approximate Maximum Likelihood (AML) approach used as weighting function in the GCC were found to work well in pure diffuse noise fields (Kuczynski

Fig. 3. Block diagram response beamformer.

of the open-loop

adaptive

look-direction

=

@8,&jf2 > %,s,(e’” >’

i=l

,...,

L,

(14)

where the number L depends on the blocking matrix structure (L can be M - 1 or less). The cross power density spectrum Q6 \‘ and the auto power density spectrum @*$a,of Eq:~cl4) are estimated by using the recursive update formulas:

where k is the frequency index, 1 is the time segment index, ai ,(k) is the short-time spectrum at the output of the signal blocking unit and Y,.,,(k) is the postfiltered output spectrum of the conventional beamformer (see also Fig. 3). In Eqs. (15) and (16), (Y is a number close to one and defines the average time. The GSC approach for noise reduction is closely related to adaptive noise cancelling proposed by Widrow et al. (1975). The noise reduction which can be achieved by this type of processor is completely specified by the spatial coherence of the noise field. A reasonable noise reduction can be achieved if the noise signals between adjacent microphones are highly correlated (Arrnbrlister et al., 1986). Therefore, this part of our noise reduction system is able to suppress the coherent direct path noise but is inefficient for incoherent noise. Incoherent noise components are partially suppressed by the conventional beamformer (upper path in Fig. 3) and more effective by the postfilter W which contains the constraint values. The optimal

S. Fischer, K.U. Simmer/Speech

222

constraint values are given by the normal equations (8). Using the Z-transform and evaluating the result on the unit circle we obtain the following expression for the transfer function estimate in look-direction under the assumption of spatially incoherent noise (Simmer et al., 1994): M-l

2

c izl

M(“-l) $(ej”)

‘44 c j=i+l

@x,xpn)

=

@?,-(ej’)

(17) @:r(ej”> is the auto power density spectrum of the output signal of the conventional beamformer X and Qx,,l(ejn) is the cross power density spectrum between adjacent microphone signals. To ensure the ppsitiveness of the estimate of the power spectrum Qs,,(ej”> we take the modulus of the estimated spatial cross power density spectrum (numerator in Eq. (17)). It can be shown that this transfer function l@ is identical with the transfer function of a non-causal Wiener Filter in the case of zero spatial correlation of the noise signals (Simmer et al., 1994). In the case of a completely coherent noise field the transfer function W equals one and the noise reduction is only due to the sidelobe cancelling path of the system shown in Fig. 3. The power density spectra in the numerator and denominator of Eq. (17) can be estimated in a manner similar to Eqs. (15) and (16) from the short-time spectra:

&J;‘(k) = c+“(k)

+ y_ M

M-l



C C i=l

c&(k)

l)

xil;(k)X,,l(k),

(18)

j=i+l

= cd&l)(k)

+IX,(k)l*.

(19)

Xi,,(k) and Xj,,(k) are the DFT coefficients for block 1 and frequency bin k of microphone signal i and j respectively, * denotes complex conjugation and X,(k) are the DFI’ coefficients for block 1 and frequency bin k of the output signal of the conventional beamformer. The transfer functions l@ and Hi are determined as the data arrives at the input micro-

Communication 20 (1996) 215-227

phones. Thus, the adaption of all the transfer functions takes place simultaneously. It should be noted that the three array techniques for noise reduction described in Section 1 are included in the adaptive look-direction response beamformer as special cases: (1) for spatially coherent noise we yield a standard linearly constrained beamformer to null out signals arriving from outside the look-direction; (2) for completely incoherent noise fields the system is equivalent to the post-filtering approach, and (3) if desired, a conventional beamformer output is also available. Consequently, no a priori assumptions about the correlation properties of the noise field are required.

3.2.1. Improvement of the transfer function estimate + The transfer function estimate l? given by Eq. (17) is based on the identity in Eqs. (12a) and (12b). But this identity holds only in a statistical sense. In practice, only estimates of the spatial cross correlations are available, and validation of the identity in Eqs. (12a) and (12b) requires infinitely long timeaveraging. Due to the nonstationary nature of the speech signals, only short-time intervals are available for spectrum estimation. Therefore, the transfer function l@ is only a rough estimate for the true Wiener Filter. To improve the estimate k? we use the combined time and lag weighting technique for periodogram smoothing as introduced by Nuttall and Carter (1982), which was adapted to our application. The starting point are Eqs. (18) and (19>, which describe a shorttime Weighted Overlapped Segment Averaging (WOSA) method (excluding constant factors). These estimates are subjected to an inverse Fourier Transform to yield the correlation function estimates i,, and i,,, respectively. In a next step, the correlation estimates are multiplied by a symmetric real lag weighting function wIag , which takes into account the windowing of the input data prior to the computation of the FITS, and is calculated according to the following expression (Nuttall and Carter, 1982):

W&4 =

Wd( n>rww(O) r,,(n)

.

(20)

S. Fischer, K.U. Simmer/Speech

In Eq. (20), w, is the desired lag window (in our case a Hanning window of one fourth of the FIT length to perform the desired smoothing), rww is the auto correlation function of the data window, and wlag is the reshaped lag window. In a final step the weighted correlation estimates are transformed back into the frequency domain to obtain the desired power density spectra used to determine @ according to Eq. (17). This combined smoothing in time and frequency yields estimates with low variance by using only relatively short data segments. The musical tones in the output signal are reduced noticeably. Theoretically, a Wiener Filter frequency response is a real positive function in the range 0 I lV(ej’) 5 1. But due to estimation errors of the power spectral densities this property is not always guaranteed. To further improve the transfer function estimate we finally constrain W to values between zero and one. 3.2.2. Spatiotemporal blocking matrix The rows of the blocking matrix can be interpreted as fixed beamformers, each of them forming a spatial null in the look-direction (Griffiths and Jim, 1982). In its simplest form, the signal blocking is realized by taking the difference between the signal samples of adjacent sensors to yield (in the ideal case> noise only reference signals. The (far-field) beampattern of this two-element “blocking beamformer’ ’ is shown in Fig. 4(a) with 5 cm sensor distance (note that the sensor distance not only has an effect on the delay-and-sum section but also on the blocking beamformer). As we can see from this figure, there is a zero in look-direction (zero degrees in this example), but signals from outside the lookdirection (i.e. the noise) will also be suppressed, especially at low frequencies. In addition to this spatial filtering the signal blocking causes a temporal high-pass filtering on the array signals as can be seen in Fig. 4(a). Therefore, the transformation filters H, in Fig. 3 must have large gains to form a proper cancelling signal Y,(k). The transformation filters are theoretically given by Eq. (14) for time stationary signals. In practice, however, there are only estimates available and the filter order is limited. There is always a potential for mismatch in the transfer functions H,, and since these filters have to operate over a large range of gain values, a mismatch can

Communication 20 (1996) 215-227

T B 1 2 a %

223

I3 -20 -40 -60 -80

B -100 0

Azimuth Angle [deg.] --t

Frequency [Hz] +

(4

Azimuth Angle [deg.] +

Frequency [Hz] +

(b) Fig. 4. Beam-power-pattern of blocking-beamformer. (a) Standard blocking matrix, (b) temporal filtered blocking matrix.

result in a very distorted output signal. In order to emphasize the low frequency components in the signals 6, we include a fixed temporal lowpass filter in the blocking matrix. The beampattern of this filter-and-sum blocking beamformer is shown in Fig. 4(b). Signals arriving from outside the look-direction will not be suppressed. Hence, the filters Hj operate within a restricted range. To broaden the look-direction and consequently prevent the adaptive filters H, from cancelling signals coming from an area around the focal point, a spatial filter in the blocking matrix can be included additionally as proposed in (Claesson and Nordholm, 1992). However, this requires many sensors.

S. Fischer,

224

K.U. Simmer/Speech

4. Experiments 4.1. Simulation

description

To test the noise reduction performance of the described system, a computer program has been developed which allows easy changing of the acoustical properties of the enclosure. The input signals were generated by convolving one channel anechoic recordings of speech and noise with the source-tomicrophone impulse responses. These room impulse responses were simulated by using the image method described by Allen and Berkley (1979). The room dimensions were 3.50 X 7.10 X 2.96 m3 and the wall reflection coefficients were varied to simulate different reverberation times, i.e. different ratios of direct path noise and diffuse noise. The reflection coefficients were chosen to be equal for all six walls so that Eyring’s formula a = 1 - exp{ln( 10-6)4V/cT,,

A}

(21)

revealed the desired reverberation time. In Eq. (21) V is the volume of the room, A is the surface area of all six walls and c is the speed of sound. From the absorption coefficient a, the reflection coefficients p can be calculated according to

Communication

20 (19961215-227

We used a seven element linear equally spaced array with 5 cm inter-element spacing and total aperture length of 35 cm to avoid spatial aliasing in the frequency band below 3400 Hz. Experiments with various sensor configurations led us to revert to the linear array, which yielded the best performance under the constraint of a maximum number of seven sensors. 4.3. Performance

measure

In speech communications, the ultimate recipient of information is the human being. The artefacts generated by many speech enhancement techniques decrease the user acceptance for voice communication systems. Hence, for performance evaluation subjective listening tests are absolutely necessary. Since this is time and cost intensive however, for performance evaluation we used the Log Area Ratio (LAR) Distance (I, norm without energy weighting) as objective measure for speech quality which is found to correlate well with the subjective sensation (Bamwell and Voiers, 1979). The LAR distance is defined according to the following formula:

(23) p=J1--cy.

(22)

4.2. Choosing the array aperture An array of discrete sensors can be conceived as a sampled continuous aperture. If the sampling period is not chosen appropriately, this sampling introduces spatial aliasing in the form of grating lobes (Johnson and Dudgeon, 1993). On the other hand, the estimation of the transfer function in look-direction I@ assumes a spatially white noise field. In practice, the noise field can be at best diffuse with a spatial coherence function given by a sine-function. To yield spatially uncorrelated noise signals, undersampling the continuous aperture is usually performed. This works well for pure diffuse noise fields and if the desired speaker is close to the array. Our proposed system for noise reduction is in principle an adaptive beamformer. An undersampled aperture yields a poor system performance in the case of direct path noise.

where g,(Z;m) and g,(l;m) represents the Ith arearatio function of the reference data (desired signal) and the test data (output signal of the noise reduction system) respectively computed over the frame ending at time m. The area ratio function is defined according to 1 +k(l;m)

g( 1;m) =

1 - k(l;m)



(24)

where the k( l;m> are the 1 PARCORcoefficients computed over the frame ending at time m which are calculated from an LPC analysis of order L = 12. The mean LAR distance is computed between the begining and ending point of the test word. 4.4. Results Fig. 5(a) shows the LAR improvement as a function of Eyring’s reverberation time 7’,‘,a(low LAR 2 high speech quality). The sampling frequency was 8

S. Fischer, K.U. Simmer/Speech Input SNR = 3 dB

100 200

300

400

500

600

700

800

900 1000

Tm bsl +

(b)

8

90

100

200

300

400

500

600

700

T, [msl +

800

Communication

20 (1996) 215-227

225

very low LAR in anechoic environments. However, for reverberation times above 300 ms the conventional beamfotmer performs better than the Frost beamformer. The proposed adaptive look-direction response beamformer (direct form with Frost algorithm (described in Section 3.1) and the GSC form (Section 3.2)) yields the best LAR improvement which is nearly independent of the reverberation time. Fig. 6 shows the performance of various microphone arrays with adaptive postfiltering. For this experiment we used an array with four sensors. The microphones were placed at the comers of a square of 0.6 X 0.6 meters as suggested by (Zelinski, 1988) (the total aperture is twice as large as the aperture used for the experiment in Fig. 5!). These great sensor distances are necessary for this kind of algorithms, because this undersampled aperture yields uncorrelated noise signals in diffuse environments. As we can see from Fig. 6, these algorithms (Simmer and Wasiljeff, 1992) and (Zelinski, 1988) work well only for reverberation times above 300 ms. For very low reverberation times, the output speech quality is poorer than the input speech quality! For spatially correlated noise (i.e. for low reverberation times) not only the postfilter but also the conventional beamformer is ineffective because of the undersampled aperture used. The dashed line in Fig. 6 shows again the results of the adaptive look-direction response

1

i6

! 0 1000

Fig. 5. Log Area Ratio distance as function of reverberation time of the enclosure for various beamforming algorithms (a> and mean spatial coherence of the noise field (b).

Input SNR = 3 dB

kHz and as noise source we used a hair drier. The solid line shows the LAR of the disturbed input speech signal (input SNR = 3 dB). The other curves show the output LAR of various beamforming algorithms (seven sensor linear array with 5 cm inter-element spacing) ‘. Fig. 5(b) shows the mean spatial coherence of the noise field measured by the outer sensors of the array. As we can see from Fig. 5(a) the LAR improvement of the conventional beamformer is relatively constant with regard to the reverberation time. The Frost beamformer yields a 00

500

Tea * The available audiofiles l-5 w ith I”,, condition http://www.elsevier.nl/locate/specom).

correspond 300 =

to this acoustic ms (see

600

700

800

900 1000

tmsl-1

Fig. 6. Log Area Ratio distance as function of reverberation time of the enclosure for various microphone arrays with adaptive postfiltering.

226

S. Fischer, K.U. Simmer/Speech

beamformer but with the seven sensor linear with 5 cm inter-element spacing.

array

5. Conclusion In this paper we proposed an adaptive beamformer with adaptive constraint values for suppression of coherent and incoherent noise in disturbed speech signals. Two possible implementation methods (a direct form and an open-loop GSC structure) were described. Improvements of the estimation of the adaptive transfer function in look-direction and the design of the signal blocking matrix were given. The experimental results demonstrated that the proposed method works well for a large range of reverberation times and is therefore able to operate independently of the acoustical properties of the enclosure.

Acknowledgements We would like to thank Prof. K.D. Kammeyer of the University of Bremen, Germany for his support in the course of this work and for critical reading of the manuscript. We also thank the anonymous reviewers for a number of comments and suggestions which improved this paper.

References S. Affes and Y. Grenier (19941, “Test of adaptive bearnformers for speech acquisition in cars”, Proc. Internat. Conj Signal Processing Applications and Technology, ICSPAT ‘94, Dallas, TX, Vol. 1, pp. 154-159. J.B. Allen and D.A. Berkley (1979). “Image method for efficiently simulating small-room acoustics”, J. Acoust. Sot. Amer., Vol. 65, No. 4, pp. 943-950. J. An and B. Champagne (19941, “GSC realisation using the two-dimensional transform-domain LMS algorithm”, IEE Proc. - Radar, Sonar, Nauig.. Vol. 141, No. 5, pp. 270-278. w Armbriister, R. Czarnach and P. Vary (19861, “Adaptive noise cancellation with reference input - possible applications and theoretical limits”, Proc. European Signal Processing Conf EUSIPCO-86, The Hague, pp. 391-394. T Barnwell and W. Voiers (1979), An analysis of objective measures for user acceptance of voice communication systems, Georgia Institute of Technology, Atlanta, Report DCA100-78-003.

Communication

20 (19961215-227

R.E. Boucher and J.C. Hassab (1981), “Analysis of discrete implementation of generalized cross correlator”, IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-29, No. 3, pp. 609-611. K.M. Buckley (19861, “Broad-band beamforming and the GenerIEEE Trans. Acoust. Speech Sigalized Sidelobe Canceller”, nal Process., Vol. ASSP-34, No. 5, pp. 1322-1323. G.C. Carter (Ed.) (19931, Coherence and Time Delay Estimation (IEEE Press, New York). Y.-H. Chen and H.-D. Fang (19921, “Frequency-domain implementation of Griffiths-Jim adaptive beamformer”, J. Acoust. Sot. Amer., Vol. 91, pp. 3354-3366. I. Claesson and S. Nordholm (1992), “A spatial filtering approach to robust adaptive beamforming”, IEEE Trans. Antennas Propagut., Vol. 40, No. 9, pp. 1093-1096. H. Cox, R.M. Zeskind and M.M. Owen (19871, “Robust adaptive IEEE Trans. Acoust. Speech Signal Process., beamforming”, Vol. ASSP-35, No. 10, pp. 1365-1376. E.M. Dowling, D.A. Linebarger, Y. Tong and M. Munoz (1992). “An adaptive microphone array processing system”, Microprocessors and Microsystems, Vol. 16, No. 10, pp. 507-516. and usable bandwidth of J.L. Flanagan (19851, “Beamwidth delay-steered microphone arrays”, AT&T Tech. J., Vol. 64, No. 4, pp. 983-994. J.L. Flanagan, J.D. Johnston, R. Zahn and G.W. Elko (1985), “Computer-steered microphone arrays for sound transduction in large rooms”, J. Acoust. Sot. Amer., Vol. 78, No. 5, pp. 1508-1518. J.L. Flanagan, D.A. Berkley, G.W. Elko, J.E. West and M.M. Sondhi (19911, “Autodirective microphone systems”, Acustica, Vol. 73, pp. 58-71. O.L. Frost (1972). “An algorithm for linearly constrained adapProc. IEEE, Vol. 60, No. 8, pp. tive array processing”, 926-935. S. Gierl (19901, “Noise reduction for speech input systems using an adaptive microphone-array”, Proc. 22nd Internat. Symp. on Automotitie Technology & Automation - ISATA 90, Florence, Italy, pp. 517-524. J.E. Greenberg and P.M. Zurek (19921, “Evaluation of an adaptive beamforming method for hearing aids”, J. Acoust. Sot. Amer., Vol. 91, No. 3, pp. 1662-1676. Y. Grenier (1993), “A microphone array for car environments”, Speech Communication, Vol. 12, No. 1, pp. 25-39. L.J. Griffiths and C.W. Jim (1982). “An alternative approach to linearly constrained adaptive beamforming”, IEEE Trans. Antennas Propagat., Vol. AP-30, No. 1, pp. 27-34. M.W. Hoffman, T.D. Trine, K.M. Buckley and D.J. van Tasell (1994). “Robust adaptive microphone array processing for hearing aids: realistic speech enhancement”, J. Acoust. Sot. Amer., Vol. 96, No. 2, pp. 759-770. C.W. Jim (19771, “A comparison of two LMS constrained optiProc. IEEE, Vol. 65, No. 12, pp. mal array structures”, 1730-1731. D.H. Johnson and D.E. Dudgeon (1993), Array Signal Processing _ Concepts and Techniques (Prentice-Hall, Englewood Cliffs, NJ). Y. Kaneda and J. Ohga (1986). “Adaptive microphone-array

S. Fischer, K.U. Simmer/Speech system for noise reduction”, IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-34, No. 6, pp. 1391-1400. W. Kellermann (1991). “A self-steering digital microphone array”, Proc. Internat. ConJ Acoust. Speech Signal Process., ICASSP-91, pp. 3581-3584. F. Khalil, J.P. Jullien and A. Gilloire (1994). “Microphone array for sound pickup in teleconference systems”, J. Audio Eng. Sot., Vol. 42, No. 9, pp. 691-700. K. Kroschel (1991), “Enhancement of speech signals using microphone arrays”, in: V. Cappellini and A.G. Constantinides, Eds., Digital Signal Processing - 91, pp. 223-228. P. Kuczynski, S. Fischer, F.Ch. Kriigel, A. Wasiljeff and K.U. Simmer (1994), “Adaptive Mehrkanalgerauschunterdruckung und Sprecherortung bei gestijrten Sprachsignalen innerhalb geschlossener Raume”, Kleinheubacher Berichte Band 38, pp. 369-378. J.S. Lim, Ed. (19831, Speech Enhancement (Prentice-Hall, Englewood Cliffs, NJ). Y. Mahieux, A. Gilloire and G. Le Tourneur (19931, “A microphone array for multimedia applications”, 1993 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (Mohonk Mountain, New York). Y. Mahieux, T.G. Le Toumeur and A. Saliou (19951, “A microphone array for multimedia workstations”, Preprints AES 98th Conaention, Paris. S. Nordholm, I. Claesson and B. Bengtsson (1993), “Adaptive array noise suppression of handsfree speaker input in cars”, IEEE Trans. Vehicular Technology, Vol. 42, No. 4, pp. 514518. A.H. Nuttall and G.C. Carter (19821, “Spectral estimation using combined time and lag weighting”, Proc. IEEE, Vol. 70, No. 9, pp. 1115-1125. P.M. Peterson, NJ. Durlach, W.M. Rabinowitz and P.M. Zurek (19871, “Multimicrophone adaptive beamforming for interference reduction in hearing aids”, .I. Rehabilitation Res. Deuelopment, Vol. 24, No. 4, pp. 103-110. F. Pirz (19791, “Design of a wideband, constant beamwidth, array microphone for use in the near field”, Bell Syst. Techn. J., Vol. 58, No. 8, pp. 1839-1851.

Communication

20 (1996) 215-227

227

K.U. Simmer and A. Wasiljeff (1992), “Adaptive microphone arrays for noise suppression in the frequency domain”, Second Cost 229 Workshop on Adaptiue Algorithms in Communications, Bordeaux, France, pp. 185-194. K.U. Simmer, S. Fischer and A. Wasiljeff (1994). “Suppression of coherent and incoherent noise using a microphone array”, Annales des Te’le’communications, Vol. 49, Nos. 7-8, pp. 439-446. M.M. Sondhi and G.W. Elko (19861, “Adaptive optimization of microphone arrays under a nonlinear constraint”. Proc. Internat. Conj Acoust. Speech Signal Process., ICASSP-86, pp. 98 I-984. beamforming for a microphone C. Sydow (1994). “Broadband array”, .I. Acoust. Sot. Amer., Vol. 96, No. 2, pp. 845-849. D. van Compemolle, W. Ma, F. Xie and M. van Diest (1990). “Speech recognition in noisy environments with the aid of microphone arrays”, Speech Communication, Vol. 9, Nos. 5/6, pp. 433-442. B.D. van Veen and K.M. Buckley (1988), “Beamforming: A versatile approach to spatial filtering”, IEEE ASSP-Magazine, pp. 4-24. B. Widrow, P.E. Mantey, L.J. Griffiths and B.B. Goode (1967), “Adaptive antenna systems”, Proc. IEEE, Vol. 55, No. 12, pp. 2143-2159. B. Widrow et al. (19751, “Adaptive noise cancelling: Principles Proc. IEEE, Vol. 63, No. 12, pp. 1692and applications”, 1975. Z. Yang, K.U. Simmer and A. Wasiljeff (1993), “Improved performance of multimicrophone speech enhancement systems”, Quatorzieme Colloque GRETSI, Juan-Les-Pins, France, pp. 479-482. R. Zelinski (1988), “A microphone array with adaptive post-filtering for noise reduction in reverberant rooms”, Proc. Internat. Co@ Acoust. Speech Signal Process., ICASSP-88, New York, pp. 2578-2581. R. Zelinski (1990). “Noise reduction based on microphone array with LMS adaptive post-filtering”, Electronics Lett., Vol. 26, No. 24, pp. 2036-2037.