Multichannel Speech Enhancement in the Frequency Domain

CHAPTER 6 Multichannel Speech Enhancement in the Frequency Domain In this chapter, we study the multichannel speech enhancement problem in the frequ...

Download PDF

195KB Sizes 0 Downloads 64 Views

Report

PDF Reader
Full Text

CHAPTER

6

Multichannel Speech Enhancement in the Frequency Domain In this chapter, we study the multichannel speech enhancement problem in the frequency domain. By exploiting the structure of the speech subspace, we can easily estimate all convolved speech signals at the microphones with a simple complex filter. As a result, binaural noise reduction with this approach is straightforward since we can choose any two signals from the estimated convolved speech signals, which contain all the necessary spatial information for the localization of the desired source signal.

6.1 SIGNAL MODEL AND PROBLEM FORMULATION We consider a sensor array consisting of M microphones. In a general way, the received signals at the frequency index f are expressed as [1, 2] Ym ( f ) = G m ( f )S( f ) + Vm ( f ) = X m ( f ) + Vm ( f ),

m = 1, 2, . . . , M,

(6.1)

where Ym ( f ) is the mth microphone signal, S( f ) is the unknown speech source, G m ( f ) is the acoustic impulse response from the position of S( f ) to the mth microphone, X m ( f ) = G m ( f )S( f ) is the convolved speech signal, and Vm ( f ) is the additive noise. It is assumed that X m ( f ) and Vm ( f ) are incoherent and zero mean. It is more convenient to write the M frequency-domain microphone signals in a vector notation: y( f ) = g( f )S( f ) + v( f ) = x( f ) + v( f ), where

T y( f ) = Y1 ( f ) Y2 ( f ) · · · Y M ( f ) , T x( f ) = X 1 ( f ) X 2 ( f ) · · · X M ( f ) = g( f )S( f ),

Speech Enhancement. http://dx.doi.org/10.1016/B978-0-12-800139-4.00006-2 Copyright © 2014 Elsevier Inc. All rights reserved.

(6.2)

80

Speech Enhancement

T g( f ) = G 1 ( f ) G 2 ( f ) · · · G M ( f ) , T v( f ) = V1 ( f ) V2 ( f ) · · · VM ( f ) . In the rest, we assume that the whole vector x( f ) is the desired signal that we wish to estimate from the observation vector, y( f ). Since S( f ) and Vm ( f ) are incoherent by assumption, the correlation matrix of y( f ) is y ( f ) = E y( f )y H ( f ) = x ( f ) + v ( f ) = φ S ( f )g( f )g H ( f ) + v ( f ),

(6.3)

matrix (whose rank where x ( f ) = φ S ( f )g( f )g H ( f ) is the correlation 2 |S( is equal to 1) of x( f ), φ is the variance of S( f ), and ( f ) = E f )| S H v ( f ) = E v( f )v ( f ) is the correlation matrix (whose rank is assumed to be equal to M) of v( f ). The speech spatial correlation matrix can be decomposed as follows:

where

x ( f ) = Qx ( f )x ( f )QxH ( f ),

(6.4)

Qx ( f ) = qx,1 ( f ) qx,2 ( f ) · · · qx,M ( f )

(6.5)

is a unitary matrix and x ( f ) = diag λx,1 ( f ), λx,2 ( f ), . . . , λx,M ( f )

(6.6)

is a diagonal matrix. The orthonormal vectors qx,1 ( f ), qx,2 ( f ), . . . , qx,M ( f ) are the eigenvectors corresponding, respectively, to the eigenvalues λx,1 ( f ), λx,2 ( f ), . . . , λx,M ( f ) of the matrix x ( f ), where λx,1 ( f ) = tr x ( f ) = φ S ( f )g H ( f )g( f )

(6.7)

and λx,2 ( f ) = λx,3 ( f ) = · · · = λx,M ( f ) = 0. Let Qx ( f ) = qx,1 ( f ) ϒ x ( f ) ,

(6.8)

where

g( f ) qx,1 ( f ) = H g ( f )g( f )

(6.9)

Multichannel Speech Enhancement in the Frequency Domain

and

ϒ x ( f ) = qx,2 ( f ) qx,3 ( f ) · · · qx,M ( f ) .

81

(6.10)

It can be verified that H I M = qx,1 ( f )qx,1 ( f ) + ϒ x ( f )ϒ xH ( f ).

(6.11)

H ( f ) and ϒ ( f )ϒ H ( f ) are two orthogonal projecNotice that qx,1 ( f )qx,1 x x H ( f ) is tion matrices of rank 1 and M − 1, respectively. Hence, qx,1 ( f )qx,1 the orthogonal projector onto the speech subspace (where all the energy of the speech signal is concentrated) or range of x ( f ) and ϒ x ( f )ϒ xH ( f ) is the orthogonal projector onto the null subspace of x ( f ). Using (6.11), we can write the speech vector as

x( f ) = Qx ( f )QxH ( f )x( f ) H ( f )x( f ) = qx,1 ( f )qx,1 = qx,1 ( f ) X ( f ),

where

H X ( f ) = qx,1 ( f )x( f )

(6.12)

(6.13)

is the transformed desired signal element. Therefore, the signal model for multichannel noise reduction becomes y( f ) = qx,1 ( f ) X ( f ) + v( f ).

(6.14)

We see that the estimation of the vector x( f ) of length M is equivalent to the estimation of the component X ( f ). From (6.14), we give another form of the correlation matrix of y( f ): H y ( f ) = φ (6.15) X ( f )qx,1 ( f )qx,1 ( f ) + v ( f ), 2 = λx,1 ( f ) and, obviously, x ( f ) = λx,1 ( f ) ( f ) = E X ( f ) where φ X H ( f ). qx,1 ( f )qx,1

6.2 LINEAR ARRAY MODEL In our context, multichannel noise reduction consists of estimating X( f ) from the observations. This task is performed in the same way as the classical beamforming approach, i.e., by applying a complex weight to the

82

Speech Enhancement

output of each sensor, at frequency f , and summing across the aperture [3–5]: Z( f ) =

M

m∗ ( f )Ym ( f ) H

m=1 H

= h ( f )y( f ), where Z ( f ) is supposed to be the estimate of X ( f ) and 1 ( f ) H 2 ( f ) · · · H M ( f ) T h( f ) = H

(6.16)

(6.17)

is a complex-valued filter of length M. We can rewrite (6.16) as rn ( f ), Z( f ) = X fd ( f ) + V

(6.18)

h H ( f )x( f ) X fd ( f ) = X( f ) = h H ( f )qx,1 ( f )

(6.19)

where

is the filtered transformed desired signal and rn ( f ) = h H ( f )v( f ) V

(6.20)

is the residual noise. Equivalently, the estimate of x( f ) is supposed to be Z( f ) z( f ) = qx,1 ( f ) = qx,1 ( f )h H ( f )y( f ) = H( f )y( f ), where

hH ( f ) H( f ) = qx,1 ( f )

(6.21) (6.22)

is a filtering matrix of size M × M that leads to the estimation of x( f ). Now, it is easy to compute the variance of Z ( f ), which is 2 φ Z( f ) = E Z( f ) = φ rn ( f ), X fd ( f ) + φV

(6.23)

where H φ X fd ( f ) = h ( f )x ( f )h( f ) 2 H h = φ ( f ) ( f )q ( f ) , x,1 X

(6.24)

h H ( f )v ( f ) h( f ). φVrn ( f ) =

(6.25)

Multichannel Speech Enhancement in the Frequency Domain

83

H We also observe that z ( f ) = φ Z ( f )qx,1 ( f )qx,1 ( f ) and tr z ( f ) = φ Z ( f ). The variance of Z ( f ) is helpful in defining meaningful performance measures.

6.3 PERFORMANCE MEASURES In this section, we define some important performance measures for multichannel noise reduction in the frequency domain. We discuss both narrowband and broadband measures.

6.3.1 Noise Reduction

Since X ( f ) is the transformed desired signal, the narrowband input SNR is tr x ( f ) iSNR( f ) = tr v ( f ) φ ( f ) . = X (6.26) tr v ( f ) From (6.26), we deduce the broadband input SNR:

X ( f )d f f φ . iSNR =

f tr v ( f ) d f It can be shown that φX ( f ) φX ( f ) ≤ iSNR( f ) ≤ max m , min m m φVm ( f ) m φVm ( f )

(6.27)

(6.28)

where φ X m ( f ) and φVm ( f ) are the variances of X m ( f ) and Vm ( f ), respectively. From (6.23), we find the narrowband output SNR: φ X fd ( f ) h( f ) = oSNR φVrn ( f ) H 2 φ X ( f ) h ( f )qx,1 ( f ) = h H ( f )v ( f ) h( f ) H 2 h ( f )g( f ) φ X m ( f ) 1 = × h H ( f )v ( f ) h( f ) |G m |2

(6.29)

84

Speech Enhancement

and the broadband output SNR: H 2

φ X ( f ) h ( f )qx,1 ( f ) d f f

oSNR h = . H ( f ) ( f ) h h( f )d f v f

(6.30)

Since it is assumed that the matrix v ( f ) is full rank, we have 2 H H h H ( f )v ( f ) h( f ) qx,1 ( f )−1 ( f )q ( f ) , h ( f )qx,1 ( f ) ≤ x,1 v (6.31) −1 with equality if and only if h( f ) ∝ v ( f )qx,1 ( f ). Using the inequality (6.31) in (6.29), we find an upper bound for the narrowband output SNR: H −1 oSNR h( f ) ≤ φ X ( f )qx,1 ( f )v ( f )qx,1 ( f ) φXm ( f ) H = g ( f )−1 (6.32) v ( f )g( f ), ∀h( f ). 2 |G m | As a consequence, we define the maximum narrowband output SNR as H −1 oSNRmax ( f ) = φ X ( f )qx,1 ( f )v ( f )qx,1 ( f ) = tr −1 ( f ) ( f ) . x v

(6.33)

For the particular filters, im , m = 1, 2, . . . , M, where im corresponds to the mth column of the identity matrix I M , we have φXm ( f ) oSNR im ( f ) = . φVm ( f )

(6.34)

As a result, oSNRmax ( f ) ≥ max m

φXm ( f ) ≥ iSNR( f ). φVm ( f )

(6.35)

It follows from the definitions of the input and output SNRs that the narrowband and broadband array gains are, respectively, oSNR h( f ) , (6.36) h( f ) = A iSNR( f ) oSNR h . (6.37) A h = iSNR

Multichannel Speech Enhancement in the Frequency Domain

85

The noise reduction factor is defined as the power of the noise at the microphones over the power of the noise remaining after filtering. We have the narrowband noise reduction factor: tr v ( f ) ξnr (6.38) h( f ) = h H ( f )v ( f ) h( f ) and the broadband noise reduction factor:

tr v ( f ) d f f . h =

ξnr H f h ( f )v ( f )h( f )d f

(6.39)

The noise reduction factor should be greater than 1 for optimal filters.

6.3.2 Speech Distortion To measure the distortion of the transformed desired signal, we can use the narrowband speech reduction factor: φ ( f ) ξsr h( f ) = X φ X fd ( f ) 1 = 2 h H ( f )qx,1 ( f ) and the broadband speech reduction factor:

X ( f )d f f φ ξsr h =

. H ( f )q ( f )2 d f φ ( f ) h x,1 f X

(6.40)

(6.41)

To avoid any distortion of the transformed desired signal, we must have h H ( f )qx,1 ( f ) = 1, ∀ f .

(6.42)

So when the speech reduction factor is greater than 1, the transformed desired signal is distorted. It is clear that we have the following relationships: ξnr h , h = A h ξsr ξnr h( f ) . A h( f ) = h( f ) ξsr

(6.43) (6.44)

86

Speech Enhancement

We can also measure distortion via the narrowband speech distortion index: 2 E X ( f ) − X ( f ) fd υsd h( f ) = φ X( f ) 2 H = h ( f )qx,1 ( f ) − 1 . (6.45) We deduce that the broadband speech distortion index is H 2

φ X ( f ) h ( f )qx,1 ( f ) − 1 d f f

h = υsd X ( f )d f f φ

X ( f )υsd h( f ) d f f φ

= . X ( f )d f f φ

(6.46)

The speech distortion index should be smaller than 1 for optimal filters and equal to 0 in the distortionless case.

6.3.3 MSE Criterion We define the error signal between the estimated and transformed desired signals, Z ( f ) and X ( f ), at frequency f , as Z( f ) − X( f ) E f = H X( f ) = h ( f )y( f ) − rn ( f ) − X ( f ). = X fd ( f ) + V This error can also be expressed as E f = Eds f + Ers f , where

H h ( f )qx,1 ( f ) − 1 Eds f = X( f )

(6.47)

(6.48) (6.49)

is the speech distortion due to the complex filter and h H ( f )v( f ) Ers f =

(6.50) represents the residual noise. Notice that the error signals Eds f and Ers f are incoherent. The narrowband MSE is then 2 J h( f ) = E E f H H = φ X ( f ) + h ( f )y ( f )h( f ) − φ X ( f )h ( f )qx,1 ( f ) h( f ), (6.51) −φ ( f )q H ( f ) X

x,1

Multichannel Speech Enhancement in the Frequency Domain

which can be expressed as 2 2 J h( f ) = E Eds f + E Ers f = Jds h( f ) + Jrs h( f ) , where

and

2 H h( f ) = φ ( f ) h ( f )q − 1 Jds x,1 X = φ X ( f )υsd h( f ) h( f ) = h H ( f )v ( f ) h( f ) Jrs tr v ( f ) . = h( f ) ξnr

We deduce that Jds h( f ) = iSNR( f ) × ξnr h( f ) × υsd h( f ) h( f ) Jrs h( f ) × υsd h( f ) . = oSNR h( f ) × ξsr

87

(6.52)

(6.53)

(6.54)

(6.55)

We observe how the narrowband MSEs are related to the narrowband performance measures.

6.4 OPTIMAL FILTERS In this section, we derive the most important filters that can help mitigate the level of the noise picked up by the microphones. Since we can estimate the whole vector x( f ), the framework proposed here is valid for the monaural and binaural noise reduction problems. In the monaural case, we can choose any component of x( f ) as the desired signal while in the binaural case, any two elements of x( f ) can be considered as the desired signals, which include all the spatial information needed for our binaural hearing system to be able to localize the source signal.

6.4.1 Maximum SNR Let us rewrite the narrowband output SNR: H H φ X ( f )h ( f )qx,1 ( f )qx,1 ( f )h( f ) oSNR h( f ) = . h H ( f )v ( f ) h( f )

(6.56)

88

Speech Enhancement

The maximum SNR filter, hmax ( f ), is obtained by maximizing the output SNR as given above. In (6.56), we recognize the generalized Rayleigh quotient [6]. It is well known that this quotient is maximized with the −1 H maximum eigenvector of the matrix φ X ( f )v ( f )qx,1 ( f )qx,1 ( f ). Let us denote by λmax ( f ) the maximum eigenvalue corresponding to this maximum eigenvector. Since the rank of the mentioned matrix is equal to 1, we have −1 H λmax ( f ) = tr φ ( f ) ( f )q ( f )q ( f ) x,1 v X x,1 H −1 = φ X ( f )qx,1 ( f )v ( f )qx,1 ( f )

= oSNRmax ( f ). As a result,

oSNR hmax ( f ) = λmax ( f ),

which corresponds to the maximum possible SNR and hmax ( f ) = Amax ( f ) A H ( f )−1 = tr v ( f ) qx,1 v ( f )qx,1 ( f ).

(6.57) (6.58)

(6.59)

Obviously, we also have hmax ( f ) = ς ( f )−1 v ( f )qx,1 ( f ) ς( f ) = −1 v ( f )g( f ), H g( f )g ( f )

(6.60)

where ς ( f ) is an arbitrary frequency-dependent complex number different from zero. While this factor has no effect on the narrowband output SNR, it has on the broadband output SNR and on the speech distortion. In fact, all the filters (except for the LCMV) derived in the rest of this section are equivalent up to this complex factor. These filters also try to find the respective complex factors at each frequency depending on what we optimize. It is important to understand that while the maximum SNR filter maximizes the narrowband output SNR, it certainly does not maximize the broadband output SNR whose value depends quite a lot on the ς ( f )’s. Let us denote by A(m) max ( f ) the maximum narrowband array gain of a microphone array with m sensors. By virtue of the inclusion principle [6] −1 H for the matrix φ X ( f )v ( f )qx,1 ( f )qx,1 ( f ), we have (M−1) (2) (1) A(M) max ( f ) ≥ Amax ( f ) ≥ · · · ≥ Amax ( f ) ≥ Amax ( f ) = 1.

(6.61)

Multichannel Speech Enhancement in the Frequency Domain

89

This shows that by increasing the number of microphones, we necessarily increase the narrowband array gain. If there is one microphone only, the narrowband array gain cannot be improved as expected [7]. Using (6.22), we deduce the optimal, in the maximum SNR sense, equivalent filtering matrix for the estimation of the whole vector x( f ): H Hmax ( f ) = ς ∗ ( f )qx,1 ( f )qx,1 ( f )−1 v (f) ∗ ς (f) x ( f )−1 = v (f) φ ( f ) X ς ∗( f ) = y ( f )−1 v ( f ) − IM . φ X( f )

(6.62)

We deduce that any element, X m ( f ), of the vector x( f ) can be estimated by the filter: hmax,m ( f ) =

ς ( f ) −1 v ( f )y ( f ) − I M im . φ X( f )

(6.63)

6.4.2 Wiener

By minimizing the narrowband MSE, J h( f ) , with respect to h( f ), we easily find the Wiener filter: −1 hW ( f ) = φ X ( f )y ( f )qx,1 ( f ).

Let y ( f ) = M

y ( f ) tr y ( f )

(6.64)

(6.65)

be the pseudo-coherence matrix of the spatial observations, we can rewrite (6.64) as hW ( f ) = M

iSNR( f ) −1 ( f )qx,1 ( f ) 1 + iSNR( f ) y

= M HW ( f ) −1 y ( f )qx,1 ( f ), where HW ( f ) =

iSNR( f ) 1 + iSNR( f )

(6.66)

(6.67)

is the (single-channel) Wiener gain and −1 y ( f )qx,1 ( f ) is the spatial information vector. If we do not want to rely on the statistics of the noise to

90

Speech Enhancement

estimate the Wiener filter, we can approximate qx,1 ( f ) with the steering vector and estimate iSNR( f ) with the decision-directed approach [8]. We can write the general form of the Wiener filter in another way that will make it easier to compare to other optimal filters. Indeed, determining the inverse of y ( f ) from (6.15) with the Woodbury’s identity, we get −1 y (f)

=

−1 v (f)−

−1 H −1 v ( f )qx,1 ( f )qx,1 ( f )v ( f ) H ( f )−1 ( f )q ( f ) φ −1 ( f ) + qx,1 x,1 v X

.

(6.68)

Substituting (6.68) into (6.64) gives hW ( f ) = =

−1 φ X ( f )v ( f )qx,1 ( f )

−1 H 1 + φ X ( f )qx,1 ( f )v ( f )qx,1 ( f ) −1 φ X ( f )v ( f )qx,1 ( f ) . 1 + λmax ( f )

(6.69)

hmax ( f ) differ It is interesting to see that the two filters hW ( f ) and only by a real-valued factor. Indeed, taking ς( f ) =

φ X( f ) 1 + λmax ( f )

(6.70)

in (6.60) (maximum SNR filter), we find (6.69) (Wiener filter). From (6.69), we deduce that the narrowband output SNR is oSNR hW ( f ) = λmax ( f ) = tr −1 ( f ) ( f ) −M y v and, obviously,

oSNR hW ( f ) ≥ iSNR( f ),

(6.71)

(6.72)

since the Wiener filter maximizes the narrowband output SNR. The speech distortion indices are 1 υsd hW ( f ) = 2 , 1 + λmax ( f )

(6.73)

Multichannel Speech Enhancement in the Frequency Domain

hW = υsd

f

−2 φ df X ( f ) 1 + λmax ( f )

. X ( f )d f f φ

91

(6.74)

The higher the value of λmax ( f ) (and/or the number of microphones), the less the desired signal is distorted. It is also easy to find the noise reduction factors: 2 1 + λmax ( f ) , hW ( f ) = ξnr iSNR( f )λmax ( f )

−1 X ( f )iSNR ( f )d f f φ ξnr hW =

−2 , φ ( f )λ ( f ) 1 + λ ( f ) df max max f X

(6.75)

(6.76)

and the speech reduction factors: 2 1 + λmax ( f ) ξsr hW ( f ) = , λ2max ( f )

X ( f )d f f φ hW =

ξsr −2 . 2 (f) 1+λ φ ( f )λ ( f ) df max max X f The broadband output SNR of the Wiener filter is

λ2max ( f ) φ 2 d f X( f ) f 1 + λmax ( f ) oSNR hW = . λmax ( f ) φ 2 d f X( f ) f 1 + λmax ( f )

(6.77)

(6.78)

(6.79)

Property 6.1. With the multichannel frequency-domain Wiener filter given in (6.64), the broadband output SNR is always greater than or equal to the broadband input SNR, i.e., oSNR hW ≥ iSNR. From (6.22) and (6.64), we deduce the Wiener filtering matrix for the estimation of the vector x( f ): H −1 HW ( f ) = φ X ( f )qx,1 ( f )qx,1 ( f )y ( f )

= x ( f )−1 y (f) = I M − v ( f )−1 y (f)

(6.80)

92

Speech Enhancement

or, from (6.69), HW ( f ) = =

(f) x ( f )−1 −1 v 1 + tr v ( f )x ( f ) y ( f )−1 ( f ) − IM v −1 . 1 − M + tr v ( f )y ( f )

(6.81)

As a result, X m ( f ) can be estimated with the Wiener filter: hW,m ( f ) = I M − −1 ( f ) ( f ) im v y =

−1 v ( f )y ( f ) − I M im . 1 − M + tr −1 v ( f )y ( f )

(6.82)

We can express hW,m ( f ) as a function of the narrowband input SNR and the pseudo-coherence matrices, i.e., 1 + iSNR( f ) −1 v ( f ) y ( f ) − I M hW,m ( f ) = (6.83) im , 1 − M + 1 + iSNR( f ) tr −1 v ( f ) y ( f ) where v ( f ) = M

v ( f ) . tr v ( f )

(6.84)

If we know that we are in the presence of the spherically isotropic noise, the Wiener filter simplifies to 1 + iSNR( f ) −1 y( f ) − IM si ( f ) im , hW,m ( f ) = (6.85) −1 1 − M + 1 + iSNR( f ) tr si ( f ) y ( f ) which makes it very practical since the coherence matrix, si ( f ), of the spherically isotropic noise is known, while iSNR( f ) and y ( f ) are easy to estimate.

6.4.3 MVDR The well-known MVDR filter proposed by Capon [9, 10] is derived by minimizing the narrowband MSE of the residual noise with the distortionless constraint, i.e., min h H ( f )v ( f ) h( f )

h( f )

subject to h H ( f )qx,1 ( f ) = 1,

(6.86)

Multichannel Speech Enhancement in the Frequency Domain

93

for which the solution is hMVDR ( f ) = =

−1 v ( f )qx,1 ( f )

H ( f )( f )−1 ( f )q ( f ) qx,1 x,1 v −1 φ X ( f )v ( f )qx,1 ( f ) . λmax ( f )

(6.87)

Taking ς( f ) =

φ X( f ) λmax ( f )

(6.88)

in (6.60) (maximum SNR filter), we find (6.87) (MVDR filter), showing how the maximum SNR and MVDR filters are equivalent up to a realvalued factor. Alternatively, we can also write the MVDR as hMVDR ( f ) =

−1 y ( f )qx,1 ( f ) H ( f )( f )−1 ( f )q ( f ) qx,1 x,1 y

.

(6.89)

The Wiener and MVDR filters are simply related as follows: hMVDR ( f ), hW ( f ) = CW ( f )

(6.90)

where H CW ( f ) = hW ( f )qx,1 ( f ) λmax ( f ) = 1 + λmax ( f )

(6.91)

can be seen as a single-channel frequency-domain Wiener gain. In fact, any filter of the form: h( f ) = C( f ) hMVDR ( f ),

(6.92)

where C( f ) is a real number with 0 < C( f ) < 1, removes more noise than the MVDR filter at the price of some desired signal distortion, which is

94

Speech Enhancement

h( f ) = ξsr or

1 C 2( f )

2 υsd h( f ) = C( f ) − 1 . It can be verified that we always have hW ( f ) , oSNR hMVDR ( f ) = oSNR hMVDR ( f ) = 0, υsd ξsr hMVDR ( f ) = 1,

and

ξnr hMVDR ( f ) ≤ ξnr hW ( f ) , ξnr hMVDR ≤ ξnr hW .

(6.93) (6.94)

(6.95) (6.96) (6.97) (6.98) (6.99)

The MVDR filter rejects the maximum level of noise allowable without distorting the desired signal at each frequency. While the narrowband output SNRs of the Wiener and MVDR filters are strictly equal, their broadband output SNRs are not. The broadband output SNR of the MVDR filter is

X ( f )d f f φ oSNR hMVDR =

(6.100) −1 X ( f )λmax ( f )d f f φ and

oSNR hMVDR ≤ oSNR hW .

(6.101)

Property 6.2. With the multichannel frequency-domain MVDR filter given in (6.87), the broadband output SNR is always greater than or equal to the broadband input SNR, i.e., oSNR hMVDR ≥ iSNR. It is easy to observe that the MVDR filtering matrix for the estimation of the vector x( f ) is HMVDR ( f ) = =

H ( f )−1 ( f ) qx,1 ( f )qx,1 v H ( f )( f )−1 ( f )q ( f ) qx,1 x,1 v

x ( f )−1 (f) −1 v tr v ( f )x ( f )

Multichannel Speech Enhancement in the Frequency Domain

=

( f ) − IM y ( f )−1 −1 v tr v ( f )y ( f ) − M

95

(6.102)

or, from (6.89), HMVDR ( f ) = =

I M − v ( f )−1 y (f) M − tr −1 ( f ) ( f ) v y HW ( f ) . M − tr −1 y ( f )v ( f )

(6.103)

As a result, X m ( f ) can be estimated with the MVDR filter: hMVDR,m ( f ) =

I M − v ( f )−1 y (f) im M − tr −1 ( f ) ( f ) v y

=

−1 ( f )y ( f ) − I M im . v −1 tr v ( f )y ( f ) − M

(6.104)

We can express hMVDR,m ( f ) as a function of the narrowband input SNR and the pseudo-coherence matrices, i.e., 1 + iSNR( f ) −1 ( f ) y ( f ) − I M hW,m ( f ) = (6.105) im . v −1 1 + iSNR( f ) tr v ( f ) y ( f ) − M If we know that we are in the presence of the spherically isotropic noise, the MVDR filter simplifies to 1 + iSNR( f ) −1 IM si ( f ) y ( f ) − im . (6.106) hMVDR,m ( f ) = −1 1 + iSNR( f ) tr si ( f ) y ( f ) − M

6.4.4 Tradeoﬀ The tradeoff filter is derived by minimizing the narrowband MSE of the speech distortion with the constraint that the narrowband noise reduction factor is equal to a positive value that is greater than 1, i.e., min Jds h( f ) subject to Jrs h( f ) = βtr v ( f ) , (6.107) h( f )

where 0 < β < 1 to insure that we get some noise reduction. By using a Lagrange multiplier, μ > 0, to adjoin the constraint to the cost function,

96

Speech Enhancement

we easily deduce the tradeoff filter: −1 hT,μ ( f ) = φ qx,1 ( f ) X ( f ) x ( f ) + μv ( f ) =

−1 φ X ( f )v ( f )qx,1 ( f )

−1 H μ + φ X ( f )qx,1 ( f )v ( f )qx,1 ( f )

,

(6.108)

where we have assumed that the matrix x ( f ) + μv ( f ) is invertible and the Lagrange multiplier, μ, satisfies hT,μ ( f ) = βtr v ( f ) . Jrs (6.109) However, in practice, it is not easy to determine the optimal μ. Therefore, when this parameter is chosen in a heuristic way, we can see that for hW ( f ), which is the Wiener filter; • μ = 1, hT,1 ( f ) = • μ = 0, hT,0 ( f ) = hMVDR ( f ), which is the MVDR filter; • μ > 1, results in a filter with low residual noise at the expense of high speech distortion (as compared to Wiener); • μ < 1, results in a filter with high residual noise and low speech distortion (as compared to Wiener). Note that the MVDR cannot be derived from the first line of (6.108) since by taking μ = 0, we have to invert a matrix that is not full rank. It can be observed that the tradeoff, Wiener, and maximum SNR filters are equivalent up to a real-valued number. As a result, the narrowband output SNR of the tradeoff filter is independent of μ and is identical to the narrowband output SNR of the maximum SNR filter, i.e., oSNR hT,μ ( f ) = oSNR hmax ( f ) , ∀μ ≥ 0. (6.110) We have

2 μ , μ + λmax ( f ) 2 μ ξsr hT,μ ( f ) = 1 + , λmax ( f ) 2 μ + λmax ( f ) . ξnr hT,μ ( f ) = iSNR( f )λmax ( f )

hT,μ ( f ) = υsd

(6.111) (6.112) (6.113)

Multichannel Speech Enhancement in the Frequency Domain

97

The tradeoff filter is interesting from several perspectives since it encompasses both the Wiener and MVDR filters. It is then useful to study the broadband output SNR and the broadband speech distortion index of the tradeoff filter. Next, we give some important results. It can be verified that the broadband output SNR of the tradeoff filter is

f oSNR hT,μ = f

φ X( f )

λ2max ( f )

2 d f μ + λmax ( f ) . λmax ( f ) φ 2 d f X( f ) μ + λmax ( f )

(6.114)

Property 6.3. The broadband output SNR of the tradeoff filter is an increasing function of the parameter μ [11]. An important consequence of the previous property is that in the class of the tradeoff filters, the MVDR filter gives the smallest broadband output SNR [11]. While the broadband output SNR is upper bounded, it is easy to see that the broadband noise reduction factor and broadband speech reduction factor are not. So when μ goes to infinity, so are ξ h and nr T,μ ξsr hT,μ . The broadband speech distortion index is

μ2 2 d f X ( f ) μ+λ f φ max ( f )

υsd hT,μ = . X ( f )d f f φ

(6.115)

Property 6.4. The broadband speech distortion index of the tradeoff filter is an increasing function of the parameter μ. It is clear that

0 ≤ υsd hT,μ ≤ 1, ∀μ ≥ 0.

(6.116)

Therefore, as μ increases, the broadband output SNR increases at the price of more distortion to the desired signal. Property 6.5. With the multichannel frequency-domain tradeoff filter given in (6.108), the broadband output SNR greater than or equal is always to the broadband input SNR, i.e., oSNR hT,μ ≥ iSNR, ∀μ ≥ 0 [11].

98

Speech Enhancement

From the previous results, we deduce that, for μ ≥ 1, hMVDR ≤ A hW ≤ A hT,μ , 1≤A hMVDR ≤ υsd hW ≤ υsd hT,μ , 0 = υsd

(6.117) (6.118)

and, for 0 ≤ μ ≤ 1,

hMVDR ≤ A hT,μ ≤ A hW , 1≤A 0 = υsd hMVDR ≤ υsd hT,μ ≤ υsd hW .

(6.119) (6.120)

It can be verified that the equivalent tradeoff filtering matrix for the estimation of x( f ) is HT,μ ( f ) =

y ( f )−1 ( f ) − IM v −1 , μ − M + tr v ( f )y ( f )

(6.121)

so that the tradeoff filter to estimate X m ( f ) is hT,μ,m ( f ) =

−1 v ( f )y ( f ) − I M im . μ − M + tr −1 v ( f )y ( f )

(6.122)

6.4.5 LCMV It is possible to derive an LCMV filter [12, 13], which can handle more than one linear constraint, by exploiting the decomposition of the noise spatial correlation matrix. Let v ( f ) = Qv ( f )v ( f )QvH ( f ),

(6.123)

where the unitary and diagonal matrices Qv ( f ) and v ( f ) are defined similarly to Qx ( f ) and x ( f ), respectively. We assume that the (positive) eigenvalues of v ( f ) have the following structure: λv,1 ( f ) ≥ λv,2 ( f ) ≥ · · · ≥ λv,Q ( f ) > φswn ( f ) and λv,Q+1 ( f ) = λv,Q+2 ( f ) = · · · = λv,M ( f ) = φswn ( f ), where Q +1 ≤ M. In this case, we can express the unitary matrix as Qv ( f ) = Tv ( f ) ϒ v ( f ) , (6.124) where the M × Q matrix Tv ( f ) contains the eigenvectors corresponding to the first Q eigenvalues of v ( f ) and the M × (M − Q) matrix ϒ v ( f ) contains the eigenvectors corresponding to the last M − Q eigenvalues of v ( f ). As a result, the noise signal vector can be decomposed as v( f ) = vc ( f ) + vi ( f ),

(6.125)

Multichannel Speech Enhancement in the Frequency Domain

99

where vc ( f ) = Tv ( f )TvH ( f )v( f )

(6.126)

corresponds to the coherent noise, (6.127) vi ( f ) = ϒ v ( f )ϒ vH ( f )v( f ) H corresponds to the incoherent noise, and E vc ( f )vi ( f ) = 0 M×M . The LCMV filter that we propose in this subsection consists of estimating X ( f ) without any distortion, partially removing the coherent noise, and attenuating the incoherent noise as much as possible. It follows that the constraints are T , h H ( f )C (6.128) Xv( f ) = 1 α where

C X v ( f ) = qx,1 ( f ) Tv ( f )

is the constraint matrix of size M × (Q + 1) and T α = α1 α2 · · · α Q

(6.129) (6.130)

is the attenuation vector of length Q with 0 ≤ αq < 1, q = 1, 2, . . . , Q. The optimization problem is now T , min h H ( f )y ( f ) h( f ) subject to h H ( f )C Xv( f ) = 1 α h( f )

(6.131) from which we deduce the LCMV filter: −1 1 −1 H −1 hLCMV ( f ) = y ( f )C . ( f )y ( f )C X v( f ) C Xv( f ) Xv α (6.132) We see from (6.132) that we must have Q + 1 ≤ M, otherwise the matrix H ( f )−1 ( f )C ( f ) is not invertible. For Q + 1 > M, the LCMV filter C y Xv Xv does not exist and for Q + 1 = M, the LCMV filter simplifies to 1 −H hLCMV ( f ) = C (f) . (6.133) Xv α Finally, we see that the LCMV filtering matrix for the estimation of x( f ) is −1 H −1 HLCMV ( f ) = qx,1 ( f ) 1 α T C ( f ) ( f )C ( f ) y X v Xv H × C ( f )−1 y ( f ). Xv

(6.134)

100

Speech Enhancement

REFERENCES [1] J. Benesty, J. Chen, Y. Huang, Microphone Array Signal Processing, Springer-Verlag, Berlin, Germany, 2008. [2] J. Benesty, J. Chen, E. Habets, Speech Enhancement in the STFT Domain, Springer Briefs in Electrical and Computer Engineering, Springer-Verlag, 2011. [3] J.P. Dmochowski, J. Benesty, Microphone arrays: fundamental concepts, in: I. Cohen, J. Benesty, S. Gannot (Eds.), Speech Processing in Modern Communication—Challenges and Perspectives, Springer-Verlag, Berlin, Germany, 2010, pp. 199–223 (Chapter 8). [4] D.H. Johnson, D.E. Dudgeon, Array Signal Processing—Concepts and Techniques, Prentice-Hall, Englewood Cliffs, NJ, 1993. [5] G.W. Elko, J. Meyer, Microphone arrays, in: J. Benesty, M.M. Sondhi, Y. Huang (Eds.), Springer Handbook of Speech Processing, Springer-Verlag, Berlin, Germany, 2008, pp. 1021–1041 (Chapter 48). [6] J.N. Franklin, Matrix Theory, Prentice-Hall, Englewood Cliffs, NJ, 1968. [7] J. Benesty, J. Chen, Y. Huang, I. Cohen, Noise Reduction in Speech Processing, Springer-Verlag, Berlin, Germany, 2009. [8] Y. Ephraim, D. Mallah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator, IEEE Trans. Acoust. Speech Signal Process. ASSP-32 (1984) 1109–1121. [9] J. Capon, High resolution frequency-wavenumber spectrum analysis, Proc. IEEE 57 (1969) 1408–1418. [10] R.T. Lacoss, Data adaptive spectral analysis methods, Geophysics 36 (1971) 661–675. [11] M. Souden, J. Benesty, S. Affes, On the global output SNR of the parameterized frequency-domain multichannel noise reduction Wiener filter, IEEE Signal Process. Lett. 17 (2010) 425–428. [12] O. Frost, An algorithm for linearly constrained adaptive array processing, Proc. IEEE 60 (1972) 926–935. [13] M. Er, A. Cantoni, Derivative constraints for broad-band element space antenna array processors, IEEE Trans. Acoust. Speech Signal Process. 31 (1983) 1378–1393.

Multichannel Speech Enhancement in the Frequency Domain

Multichannel Speech Enhancement in the Frequency Domain

Recommend Documents