Efficient online target speech extraction using DOA-constrained independent component analysis of stereo data for robust speech recognition

Signal Processing 117 (2015) 126–137 Contents lists available at ScienceDirect Signal Processing journal homepage: www.elsevier.com/locate/sigpro E...

Download PDF

526KB Sizes 2 Downloads 42 Views

Report

PDF Reader
Full Text

Signal Processing 117 (2015) 126–137

Contents lists available at ScienceDirect

Signal Processing journal homepage: www.elsevier.com/locate/sigpro

Efficient online target speech extraction using DOA-constrained independent component analysis of stereo data for robust speech recognition Minook Kim, Hyung-Min Park n Department of Electronic Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 121-742, Republic of Korea

a r t i c l e in f o

abstract

Article history: Received 9 March 2015 Accepted 27 April 2015 Available online 16 May 2015

This paper describes an efficient online target-speech-extraction method used as a preprocessing step for robust automatic speech recognition (ASR). Because a target speaker is located relatively close to microphones in many ASR applications, acoustic paths to microphones are moderately reverberant, and the target speaker direction can easily be estimated. In this situation, noise estimation is effectively performed by forming a directional null to the target speaker. Required weights for extracting target speech, independent of the estimated noise, are then determined using an adaptation rule derived from a modified version of the cost function for independent component analysis (ICA), while retaining the minimal distortion principle. In particular, an online natural-gradient learning rule with a nonholonomic constraint and normalization by a smoothed power estimate of the input signal is derived for stable convergence, even for dynamically changing speech levels, with much less computational complexity than conventional ICA. Furthermore, stereo mixtures are considered as input data for further reduction of computational loads and fast convergence. Although the method may suffer from the underdetermined problem, the weights are adapted to obtain signal-to-noise-ratiomaximization beamformers for successful target speech estimation. The experimental results obtained for various conditions demonstrate the effectiveness of the proposed method. & 2015 Elsevier B.V. All rights reserved.

Keywords: Target speech extraction Robust speech recognition Independent component analysis Direction of arrival Online adaptation

1. Introduction Noise robustness is a very important issue in the commercialization of automatic speech recognition (ASR) systems because the performance of such systems is seriously degraded in noisy real-world environments. This degradation occurs mainly because of the differences between training and testing environments, and many algorithms have been proposed to compensate for the mismatch (e.g. [1–5]). Although

n

Corresponding author. Tel.: þ82 27058916. E-mail address: [email protected] (H.-M. Park).

http://dx.doi.org/10.1016/j.sigpro.2015.04.022 0165-1684/& 2015 Elsevier B.V. All rights reserved.

these approaches can improve recognition accuracy under some conditions, most of them frequently fail to result in high performance ASR systems in dynamically changing environments with various nonstationary interferences (e.g. [6]). In recent decades, the performance of speech enhancement under highly adverse circumstances has been dramatically improved by the use of techniques for blind source separation (BSS), i.e., recovering source signals from their mixtures without information about the mixing process and its environment (e.g. [7–9]). Some of these techniques can be used in preprocessing for robust speech recognition (e.g. [10–12]). Among the various techniques available, independent component analysis (ICA), a signal processing

M. Kim, H.-M. Park / Signal Processing 117 (2015) 126–137

method that expresses multivariate data as linear combinations of statistically independent random variables, has attracted considerable interest because of its successful performance in many BSS applications (e.g. [8,9,13]). Because acoustic mixing in real-world situations involves complex reverberations, ICA has been extended to the deconvolution of mixtures in both the time and frequency domains. In general, the frequency-domain (FD) approach is preferred for acoustic source separation because of the intensive computations required for and the slow convergence of the time-domain (TD) approach [13]. To avoid the degenerate problem of obtaining an inverse of the mixing system, most of the conventional ICA-based approaches should not be used for underdetermined cases, where the number of sensors is less than the number of sources [13]. This is a very undesirable requirement for real-world applications because the number of active sources is not known in advance in most practical situations. Moreover, the acoustic inverse system needed to reconstruct source signals from a sufficient number of mixtures (to avoid an underdetermined problem) usually requires too many parameters to be accurately estimated, so its estimation results in slow convergence and intensive computations. Therefore, the number of mixtures needs to be reduced to develop an efficient ICA-based preprocessing method with fast convergence. Fortunately, the preprocessing of ASR systems does not require restoration of all sources but rather enhancement of target speech only. In addition, the target speaker is frequently located relatively close to the microphones, so the target speaker can be regarded as a point source, and the accompanying reverberation components are moderate. In such cases, array processing with small numbers of microphones and filter taps is proficient in noise estimation by forming a directional null to the target speaker [10]. A preprocessing method called blind spatial subtraction array (BSSA) was proposed [10]. This method involves subtraction of the noise power spectrum estimated by ICA, followed by application of the projection-back (PB) method after elimination of the target speech output. This method is more advantageous than conventional ICA methods because it circumvents the problem caused by underdetermined cases, with a limited number of microphones, in which acoustic environments including widespread background noises may be formulated. However, the target speech output of ICA in the BSSA still contains residual noise1, even though ICA optimizes parameters to obtain signal-to-noise-ratio (SNR)-maximization beamformers, and the noise power removed when eliminating the target speech output can be expanded by the PB method. Therefore, the noise power spectrum estimation may not be accurate, which may result in performance degradation in speech recognition. Other problems with ICA-based preprocessing techniques are scaling(arbitrary filtering for convolutive mixtures) and permutation indeterminacy [13–15]. In the FDICA methods, arbitrary scaling and permutation across 1 Note that this method is proficient in noise estimation only when the target speaker is located close to the microphones.

127

frequency bins should be resolved to successfully restore source signals without distortion. Scaling indeterminacy can be relieved by the minimal distortion principle (MDP) [16], and there are several approaches to overcoming the permutation problem (e.g. [17–23]). However, most of these approaches utilize properties of natural sounds or mixing environments heuristically, which may not always be effective for some kinds of acoustic mixtures. Moreover, as the number of restored sources increases, the problem becomes considerably more difficult. As an alternative to FD-ICA methods, independent vector analysis (IVA) resolves the permutation problem by introducing a new source prior [24]. The prior models the frequency dependencies of source signals, while IVA utilizes the same framework as the FD ICA that estimates a separating matrix in each frequency bin. Furthermore, although the scaling and permutation problems are successfully overcome, the preprocessing method should still identify the target speech output. In many applications, the target source direction can be estimated in advance because it can be obtained by sound localization methods or image-based source detection methods [15]. Because BSS to recover all sources from mixtures can be regarded as a set of beamformers whose response is constrained to a set of directions of arrival (DOAs), geometric information about sound sources can be incorporated as constraints or regularizations in ICA to resolve the indeterminacy inherent in the independence criterion (e.g. [14,15]). Recently, Nesta and Matassoni [11,12] presented a constrained ICA method based on a semi-blind source separation framework [25]. In the semiblind source extraction (SBSE), a separating matrix constraint is imposed to force the first output to give a target source signal, while the others provide the remaining noise source signals [11,12]. Unfortunately, for estimated (inaccurate) target mixing parameters, no adaptation of the first column of an adaptive separating matrix can provide deteriorated target speech, whereas moderate adaptation does not necessarily guarantee a target source signal in the first system output [12]. In addition, moderate adaptation requires more intensive computations than conventional FD ICA because of the additional transformation of input mixture vectors. In this paper, we propose a preprocessing method with low computational complexity and fast convergence that uses stereo data from dual microphones. To minimize the number of parameters to be estimated, only two weights are adapted so that the weighted signals are summed up to extract a target speech signal in each frequency bin. To guide the target speech signal in the output, a dummy output to obtain noise estimation is considered by forming a directional null to the target speaker. The target speaker's direction is assumed to be known as a priori knowledge, and a simple “delay-and-subtract nullformer” is used to provide the dummy output. Because ICA is more robust against inaccurate estimation of the target direction, possibly due to moderate reverberation, than conventional adaptive beamformers [15], the weights are estimated by an ICA-based method with real and dummy outputs. The adaptation rule is derived from a cost function modified from the conventional ICA one with the MDP retained to

128

M. Kim, H.-M. Park / Signal Processing 117 (2015) 126–137

overcome the scaling problem. In addition, an online natural-gradient learning rule [13,26] with a nonholonomic constraint [27] and normalization by a smoothed power estimate of the input signal [28] is derived to improve parameter convergence, even for dynamically changing speech signal levels, with much less computational complexity than a conventional ICA. Unfortunately, the method also suffers from the problem caused by the underdetermined case in a practical situation. However, the dummy output is proficient in noise estimation, assuming that the target speaker is located close to the microphones in a typical ASR situation, such as recognition of the speech of a car driver or a kiosk user, so the ICAbased adaptation optimizes weights to obtain an SNRmaximization beamformer for the real (or target speech) output in each frequency bin. The remainder of this paper is organized as follows: Section 2 describes the basic formulation of BSS and the conventional ICA algorithm in the frequency domain. BSS approaches incorporating geometric information about sources are described in Section 3. In Section 4, a realtime version of IVA [28] is reviewed, and a method for selecting a target speech output using geometric information is provided for ASR applications. A method for efficiently estimating target speech independent of noise estimation using the DOA of a target speaker is proposed in Section 5. Section 6 evaluates the performance of the proposed method through experiments in speech recognition. Conclusions are presented in Section 7. 2. BSS formulation and conventional ICA in the frequency domain 2.1. Formulation Let us consider mutually independent unknown sources, fsn ðtÞ; n ¼ 1; …; Ng. The sources are transmitted through acoustic channels and mixed to yield observations, fxm ðtÞ; m ¼ 1; …; Mg. Therefore, the mixtures are linear combinations of filtered versions of the sources and can be given by xm ðtÞ ¼

N LX 1 X

amn ðpÞsn ðt pÞ;

m ¼ 1; …; M;

ð1Þ

n¼1p¼0

where amn ðpÞ denotes a mixing filter coefficient from the n-th source to the m-th observation, and L is the filter length. The TD mixture xm ðtÞ is converted into time– frequency segments X m ðk; τÞ by the short-time Fourier transform (STFT) expressed as X m ðk; τÞ ¼

KX 1

xm ðτH þ tÞwðtÞe

jωk t

;

m ¼ 1; …; M;

ð2Þ

t¼0

where τ and H denote the frame index and the frame-shift size in sample, respectively. w(t) is a window function with a length up to the number of frequency bins K, and ωk represents the frequency at the k-th bin, which is equal to 2πðk 1Þ=K. Assuming that the length of the window function is sufficiently longer than the effective length of the mixing filter, the convolution in the time domain is approximately transformed into multiplication in the

frequency domain as follows: xðk; τÞ ¼ AðkÞsðk; τÞ;

ð3Þ T

where xðk; τÞ ¼ ½X 1 ðk; τÞ; …; X M ðk; τÞ and sðk; τÞ ¼ ½S1 ðk; τÞ; …; SN ðk; τÞT denote vectors composed of the time–frequency segments of mixture and source signals, respectively, at frequency bin k and frame τ. AðkÞ represents a mixing matrix at the k-th frequency bin. The aim of BSS for convolutive acoustic mixtures is to restore source signals by estimating separating matrices such that uðk; τÞ ¼ WðkÞxðk; τÞ;

ð4Þ T

where uðk; τÞ ¼ ½U 1 ðk; τÞ; …; U N ðk; τÞ denotes a vector composed of the time–frequency segments of estimated source signals at frequency bin k and frame τ, and WðkÞ is a separating matrix at the k-th frequency bin. After fixing the scaling and permutation problems described in Section 2.2, the TD waveform of an estimated source signal can be reconstructed by the inverse Fourier transform and overlap-add method as follows [24]: un ðtÞ ¼

K XX τ

U n ðτ; kÞejωk ðt τHÞ ;

n ¼ 1; …; N:

ð5Þ

k¼1

The window effect of the reconstructed waveform is avoided by using a Hanning window with length K and setting H to K/4. 2.2. FD ICA and its indeterminacy In (4), FD ICA can be used to accomplish BSS by trying to find a linear transform of mixtures to obtain statistically independent signals that corresponds to source signals sðk; τÞ. To measure dependency between estimated time– frequency segments of source signals in the k-th frequency bin, the Kullback–Leibler (KL) divergence between an exact joint probability density function (pdf) pðU 1 ðk; τÞ; …; U N ðk; τÞÞ and the product of hypothesized pdf models of the estimated sources ∏N n ¼ 1 qðU n ðk; τÞÞ can be used as [13,26] N D pðU 1 ðk; τÞ; …; U N ðk; τÞÞ J ∏ qðU n ðk; τÞÞ n¼1

Z

pðU 1 ðk; τÞ; …; U N ðk; τÞÞ duðk; τÞ ∏N n ¼ 1 qðU n ðk; τÞÞ N Z X ¼ f ðxðk; τÞÞ logjdet WðkÞj pðU n ðk; τÞÞlog qðU n ðk; τÞÞ dU n ðk; τÞ ¼

pðU 1 ðk; τÞ; …; U N ðk; τÞÞ log

n¼1

¼ f ðxðk; τÞÞ logjdet WðkÞj

R

N X

E½log qðU n ðk; τÞÞ;

ð6Þ

n¼1

where f ðxðk; τÞÞ ¼ pðX 1 ðk; τÞ; …; X M ðk; τÞÞlog pðX 1 ðk; τÞ; …; X M ðk; τÞÞdxðk; τÞ. The natural gradient to minimize the KL divergence provides the following learning rule for a separating matrix at the k-th frequency bin [13,26]: ΔWðkÞ p fI E½ϕðuðk; τÞÞuH ðk; τÞgWðkÞ;

ð7Þ

where I denotes the identity matrix, and ϕðuðk; τÞÞ is the score function given by T dlog qðU 1 ðk; τÞÞ dlog qðU N ðk; τÞÞ ; …; : ð8Þ ϕðuðk; τÞÞ ¼ dU 1 ðk; τÞ dU N ðk; τÞ

M. Kim, H.-M. Park / Signal Processing 117 (2015) 126–137

Assuming that the time–frequency segments U n ðk; τÞ at a frequency bin are described by a Laplace distribution, dlog qðU n ðk; τÞÞ=dU n ðk; τÞ ¼ expðj argðU n ðk; τÞÞÞ, n ¼ 1; …; N. Adopting a simple stochastic gradient by omitting the expectation in (7), the online FD-ICA algorithm can be expressed as ΔWðkÞ p fI ϕðuðk; τÞÞuH ðk; τÞgWðkÞ:

ð9Þ

In fact, the independence criterion of the FD ICA implies mutual independence between all pairs of estimated signals, so the time–frequency segments of the estimated signals uðk; τÞ are an arbitrarily scaled and permuted version of the original source signals sðk; τÞ such that uðk; τÞ ¼ WðkÞAðkÞsðk; τÞ ¼ PðkÞCðkÞsðk; τÞ;

ð10Þ

where PðkÞ and CðkÞ denote a permutation matrix and a diagonal scaling matrix at the k-th frequency bin, respectively. Inconsistent scaling across frequency bins leads to distortion of restored signals whereas inconsistent permutation results in failure of signal separation even with separated outputs in each frequency bin. The inconsistency is caused by the FD formulation of the ICA, in which the separation task is decoupled into an individual task for each frequency bin. The permutation problem becomes more severe as the number of sensors increases because the number of possible permutations increases dramatically [14]. Because natural sounds are nonstationary and their variances at frequency bins are unknown, the scaling indeterminacy can be relieved by adjusting the estimated separating matrices using the MDP [16]. There are several approaches to fixing the permutation problem by exploiting the spectral properties of natural sounds [21–23]. However, in practice, it is difficult to extract features from estimated time–frequency segments that are appropriate for use in distinguishing the spectral components of various kinds of sounds at all frequency bins. In addition, these methods must consider various combinations of spectral components, which result in high computational complexity [14]. On the other hand, smoothness constraints on the separating filter coefficients in the frequency domain that limit the filter length in the time domain may not always be reasonable, as rather long filters are required in strong reverberant environments [14].

129

source signal of interest while reducing the other interferences, but they may suffer from performance degradation when source signals may reach the sensor array from many directions in reverberant environments. On the other hand, ICA circumvents performance degradation by optimizing separating filters to yield mutually independent outputs. Geometric source separation [14] is performed to attempt to combine the advantages of both approaches by incorporating geometric constraints on source locations into ICA to resolve the indeterminacy inherent in the independence criterion. This method requires geometric information about all source locations, but some locations may be unknown in advance, or their direct-path components (corresponding to their geometric information) may not be dominant if the sources are far from microphones in highly reverberant conditions. 3.2. SBSE Semi-blind source separation is a special type of BSS for use when some prior knowledge about source signals or mixing environments is available [25]. Assuming that the mixing parameters related to a target source are known or can be estimated in advance with the always active target source and stationary mixing parameters, Nesta and Matassoni proposed a constrained ICA called SBSE that imposes a constraint on the separating matrix to force the first system output to estimate a target signal while the others provide estimates of the remaining interfering sources [11,12]. In (3), let us assume that the first element S1 ðk; τÞ in sðk; τÞ corresponds to a target source signal and that the first column a1 ðkÞ in AðkÞ indicates the mixing parameters of the target signal. Using a1 ðkÞ, the mixing matrix AðkÞ can be factorized as [11,12] AðkÞ ¼ ½a1 ðkÞjI2⋯M ½σjFðkÞ;

ð11Þ

where I2⋯M denotes the last M 1 columns of the M M identity matrix, σ is an M 1 column vector ½σ; 0; …; 0ÞT , σ is an arbitrary constant, and FðkÞ is a matrix resulting from the factorization. To obtain the target signal in the first system output, the separating matrix should be represented by WðkÞ ¼ Wconstr ðkÞ ½a1 ðkÞjI2⋯M 1 ;

ð12Þ 1

3. BSS using geometric information on sources In many ASR applications, the target speaker is located relatively close to the microphones, which means the target source direction is easily estimated, and the accompanying acoustic path has moderate reverberation [10,15]. 3.1. Geometric source separation The array processing structure needed for BSS to recover all sources from mixtures can be regarded as a set of beamformers whose response is constrained to a set of DOAs. Conventional beamforming methods exploit geometric information about source locations to extract a

where Wconstr ðkÞ corresponds to ½σjFðkÞ . Therefore, ~ Wconstr ðkÞ can be represented by ½σ~ jFðkÞ where σ~ ¼ ½1= ~ σ; 0; …; 0T , and FðkÞ is the remaining part of ½σjFðkÞ 1 . The adaptation rule of Wconstr ðkÞ with a transformed ~ τÞ ¼ ½a1 ðkÞjI2⋯M 1 xðk; τÞ is given by mixture vector xðk; [11,12] ΔWconstr ðkÞ ¼ ½μΔw1 ðkÞjΔW2⋯M ðkÞ;

ð13Þ

where μ is a scalar between 0 and 1 that reflects the strength of the constraint imposed by ½a1 ðkÞjI2⋯M 1 . Δw1 ðkÞ and ΔW2⋯M ðkÞ are the first column vector and the matrix composed of the last M 1 columns of fI ϕðuðk; τÞÞuH ðk; τÞgWconstr ðkÞ, respectively, similar to Eq. (9). In [12], adaptation speeds of the first column vector and the remaining matrix of Wconstr ðkÞ are

130

M. Kim, H.-M. Park / Signal Processing 117 (2015) 126–137

controlled by different local step sizes. However, the step sizes are determined by normalized coherence between the normalized cross-power spectrum and the target mixing parameters, and the normalization of the coherence is based on its maximum and minimum values, which is not suitable for real-time processing. If μ¼1, no constraint is imposed, and the adaptation rule is the same as in conventional FD ICA [13] with the transformed ~ τÞ. On the other hand, if μ ¼ 0, the mixture vector xðk; adaptation constrains the target mixing parameters to a1 ðkÞ, while it adapts the parameters related to the interfering signals. If 0 o μo 1, the adaptation is less constrained by a1 ðkÞ, under the assumption that there is a certain degree of uncertainty in the initial guess for a1 ðkÞ [11,12]. Unfortunately, accurate knowledge of mixing parameters related to a target source is not available in advance in most practical situations, so the parameters must be estimated which may not result in accurate values. In this case, setting μ to zero may provide a deteriorated target signal, whereas even small positive μ to settle the target mixing parameters cannot guarantee that the first system output always corresponds to the target source [12]. In addition, this requires more computations than conventional FD ICA because of the additional transformation of the input mixture vectors required to ~ τÞ. obtain xðk;

In (15), φðkÞ ðv1⋯N ðτÞÞuH ðk; τÞ is on average equal to the identity matrix I at a stationary point. The constraints for off-diagonal elements require independence among estimated source signals, whereas the other constraints regularize the value of φðkÞ ðvn ðτÞÞU n ðk; τÞ, which has nothing to do with independence. Unfortunately, when a source signal such as speech suddenly becomes very small, the corresponding parameters of the separating matrices should increase in amplitude in the learning process to compensate for the change, which may lead to divergence of the parameters. To avoid the compensation that is unnecessary for source separation and to improve convergence of the online algorithm, a nonholonomic constraint can be adopted in which I is replaced with diag½φðkÞ ðv1⋯N ðτÞÞuH ðk; τÞ denoting a diagonal matrix whose diagonal elements are the same as φðkÞ cðv1⋯N ðτÞÞuH ðk; τÞ [27,28]. To achieve convergence that is further robust to the input signal level in the online learning process of the separating matrices, the update amount is normalized by a smoothed power estimate as follows [28]: h i 1 ΔWðkÞ p pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃfdiag φðkÞ ðv1⋯N ðτÞÞuH ðk; τÞ ξðk; τÞ φðkÞ ðv1⋯N ðτÞÞuH ðk; τÞgWðkÞ;

where the smoothed power estimate ξðk; τÞ, with a factor β, is given by ξðk; τÞ ¼ βξðk; τ 1Þ þ ð1 βÞ J xðk; τÞ J 2 =M:

4. Real-time IVA for preprocessing of ASR systems 4.1. Review of the real-time IVA While conventional FD ICA estimates a separating matrix at a frequency bin independent of those at the others, estimation of a separating matrix at a frequency bin is affected by those at the others in the IVA because it exploits an improved source signal prior [24,28]. By considering higher-order dependencies of source signals across frequencies, IVA can avoid the permutation problem [24]. If IVA is applied to the framework in Section 2, the KL divergence between pðv1 ðτÞ; …; vN ðτÞÞ and ∏N n ¼ 1 qðvn ðτÞÞ is used to measure the dependency between the estimated source signals of all frequency bins, where vn ðτÞ ¼ ½U n ð1; τÞ; …; U n ðK; τÞ; n ¼ 1; …; N [24,28]. After eliminating the terms that are independent of the separating matrix, the cost function can be given by J¼

K X k¼1

logjdet WðkÞj

N X

E½log qðvn ðτÞÞ:

ð14Þ

n¼1

As with the FD-ICA algorithm, the online natural-gradient algorithm for minimizing the cost function provides the IVA learning rule expressed as [28] ΔWðkÞ p fI φðkÞ ðv1⋯N ðτÞÞuH ðk; τÞgWðkÞ;

ð15Þ

where the multivariate score function is given by φðkÞ ðv1⋯N ðτÞÞ ¼ ½φðkÞ ðv1 ðτÞÞ; …; φðkÞ ðvN ðτÞÞT and φðkÞ ðvn ðτÞÞ ¼ ∂ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ PK 2 log qðvn ðτÞÞ=∂U n ðk; τÞ ¼ U n ðk; τÞ= κ ¼ 1 jU n ðκ; τÞj assuming qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ PK 2 that qðvn ðτÞÞ p expð k ¼ 1 jU n ðk; τÞj Þ [24].

ð16Þ

ð17Þ

Although IVA does not suffer from the permutation problem across frequency bins, the scales of the estimated time–frequency segments may be different from the original ones, which may cause distortion. MDP is a method for resolving the scaling indeterminacy by generating the estimated output signals with minimum scaling distortion from observed mixtures [16]. In the IVA framework, MDP can be accomplished by premultiplying the estimated time–frequency segments uðk; τÞ by diag½W 1 ðkÞ [28]. 4.2. Selection of a target speech output With the scaling and permutation problems resolved, the target speech output should be still identified to use IVA as a preprocessing tool for an ASR system. As mentioned in Section 1, the target source direction can be estimated in advance in many ASR applications [15]. The estimated direction can be compared with the DOAs computed from the separation matrix WðkÞ to identify the target speech output. Let us consider a uniform linear microphone array illustrated in Fig. 1. At the k-th frequency bin, the DOA of the n-th sound source, θn ðkÞ, can be estimated from the mth and m0 -th microphone pair by [10,29] !9 8 > ½W 1 ðkÞmn > > > > >arg > > < ½W 1 ðkÞm0 n = ; ð18Þ θn ðkÞ ¼ arcsin > > ωk dðm m0 Þ=c > > > > > > ; : where ½mn denotes an element at the m-th row and n-th column of a matrix, d is the distance between adjacent microphones, and c is the speed of sound. Because the

M. Kim, H.-M. Park / Signal Processing 117 (2015) 126–137

Target source

131

out the target signal from the first and m-th microphones can be expressed as dðm 1Þ sin θtarget U m ðk; τÞ ¼ X m ðk; τÞ exp jωk c X 1 ðk; τÞ; m ¼ 2; …; M: ð21Þ

θtarget

To derive the learning rule conveniently, the nullformer outputs are regarded as dummy outputs, in addition to the real target speech output represented by

0

d Mic. 1

Mic. 2

…

Mic. m

…

Yðk; τÞ ¼ wðkÞxðk; τÞ; Mic. M

Fig. 1. Configuration of a microphone array and a target source.

permutation problem is avoided in IVA, the DOA of the nth sound source is obtained by averaging θn ðkÞ over the frequency bins without spatial aliasing. Therefore, the target speech output can be determined by identifying the output with the minimum DOA difference from the estimated target source direction. 5. Proposed method Although IVA significantly enhances speech in a target speech output, it requires estimation of all parameters consisting of separating matrices. In addition, the target speech output should be chosen from among all the outputs for speech recognition. Because preprocessing of ASR systems requires enhancement of target speech only, rather than restoration of all sources, a method for estimating parameters for generating a target speech signal (corresponding to a row of the separating matrix in each frequency bin) is proposed. To estimate the parameters needed to generate target speech in the output, the other outputs are set to generate noise estimation by canceling a target speech signal. As with the selection of a target speech output based on DOA, described in Section 4.2, the target speaker direction is assumed to be known. Referring to Fig. 1, the signal at the m-th microphone can be represented as X m ðk; τÞ ¼ ½AðkÞm1 S1 ðk; τÞ þ

N X

½AðkÞmn Sn ðk; τÞ;

ð19Þ

n¼2

where S1 ðk; τÞ denotes a time–frequency segment of the target source signal. As mentioned in Section 3, the target speaker is usually located relatively close to the microphones in ASR applications, and the acoustic paths from the speaker to the microphones have moderate reverberation, which means that their direct-path components are dominant. If the acoustic paths are approximated by their direct paths and relative attenuation among microphones is negligible assuming proximity of microphones without any obstacles between them, the ratio of the target source components in a pair of microphone signals is given by dðm m0 Þ sin θtarget ½AðkÞm1 S1 ðk; τÞ exp jωk ; ð20Þ ½AðkÞm0 1 S1 ðk; τÞ c where θtarget denotes the DOA of the target source. Therefore, the simple “delay-and-subtract nullformer” to cancel

ð22Þ

where wðkÞ denotes the adaptive vector to generate the real output. Therefore, the real and dummy outputs can be expressed in matrix formulation as wðkÞ yðk; τÞ ¼ xðk; τÞ; ð23Þ γk jI where yðk; τÞ ¼ ½Yðk; τÞ; U 2 ðk; τÞ; …; U M ðk; τÞT , and γk ¼ ½Γ 1k ;

1 T …; Γ M with Γ k ¼ exp jωk d sin θtarget =c . k Because the nullformer parameters used to generate the dummy outputs are fixed for noise estimation purposes, the proposed method is safe from the permutation problem across frequency bins, and estimation of wðkÞ at one frequency bin independent of the others, in contrast with IVA, may achieve fast convergence and avoid performance degradation in target speech extraction as preprocessing for ASR systems, as shown in the experimental results discussed in Section 6. Therefore, we obtain a desired target signal at the real output by maximizing independence between the real and dummy outputs at a frequency bin. From the KL divergence between pðYðk; τÞ; U 2 ðk; τÞ…; U M ðk; τÞÞ and qðYðk; τÞÞpðU 2 ðk; τÞ; …; U M ðk; τÞÞ, the terms independent of wðkÞ are removed to yield the cost function as X M 1 J 0 ¼ log ð24Þ Γm ½wðkÞ m E½log qðYðk; τÞÞ; k m ¼ 1 where ½m denotes the m-th element of a vector. The natural-gradient algorithm to minimize the cost function is

wðkÞ ; ð25Þ ΔwðkÞ p f½1; 0; …; 0 E ϕðY ðk; τÞÞyH ðk; τÞ g γ k jI where ϕðYðk; τÞÞ ¼ dlog qðYðk; τÞÞ=dYðk; τÞ ¼ expðj argðYðk; τÞÞÞ. Therefore, the online natural-gradient algorithm with a nonholonomic constraint and normalization by a smoothed power estimate can be expressed as

1 ΔwðkÞ p pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃf ϕðY ðk; τÞÞY n ðk; τÞ; 0; …; 0 ϕðY ðk; τÞÞyH ξðk; τÞ

wðkÞ ϕðYðk; τÞÞ ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ U n2 ðk; τÞ; …; U nM ðk; τÞ γ k jI ðk; τÞg γk jI ξðk; τÞ " # M ϕðYðk; τÞÞ X n n m1 n p ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Γ k U m ðk; τÞ; U 2 ðk; τÞ; …; U M ðk; τÞ : ¼ ξðk; τÞ m ¼ 2 ð26Þ The MDP can be used to resolve the scaling indeterminacy PM of the output signal by dividing Yðk; τÞ by m¼1 1 Γm ½wðkÞm because the element in the first row and k

1 PM 1 the first column of wðkÞ= γ k jI is 1= m ¼ 1 Γ m k

132

M. Kim, H.-M. Park / Signal Processing 117 (2015) 126–137

Table 1 The number of complex multiplications and divisions required to compute a value in the first column for a frame in the proposed method and real-time FD ICA.

in each frequency bin are adapted by the proposed method to form an SNR-maximization beamformer as an efficient preprocessing method for ASR with fast convergence and reduced computational loads. The proposed algorithm can be definitely applied to signals acquired by more than two microphones if available, as described in this section.

Value to be computed

Proposed method

Real-time FD ICA

y or u ϕðYðk; τÞÞ or ϕðuðk; τÞÞ ξ Δw or ΔW

Kð2M 1Þ 2K

KM2 2KM

KðM þ 2Þ 2KM

KðM þ 2Þ

KM

KðM 3 þ M 2 MÞ K (M þ O(M3)a)

6.1. Experimental setup

MDP Total

Kð6M þ 3Þ

KðM 3 þ 2M 2 þ 3M þ 2 þ OðM 3 ÞÞ

To evaluate the proposed target-speech-extraction method as a preprocessing tool for ASR, we conducted recognition experiments using the DARPA Resource Management (RM) database [30] and the HMM toolkit (HTK) [31]. The recognition system was based on fully continuous hidden Markov models trained on 3990 sentences recorded in a quiet environment. The test set consisted of 300 sentences. Speech recognition was based on the observed values of 13th-order mel-frequency cepstral coefficients with corresponding delta and acceleration coefficients. The cepstral coefficients were gained from 24 mel-frequency bands with a frame size of 25 ms and a frame shift of 10 ms. Each test utterance was corrupted by competing speech that was randomly chosen from the same RM database, excluding the target speech. To assess robustness against interference types, we also considered subway, car, and exhibition hall noises from the AURORA2 database [32]. Two microphone signals were simulated using the image method [33], with a target source and one to three interfering sources, arranged according to the configuration shown in Fig. 2. When acoustic filters were generated using the image method, the reflection coefficient was selected to obtain a designated reverberation time, RT 6 0. The source at “Interference 1” was used for one interfering source, and the sources at “Interference 1 and 2” were used for two interfering sources. Because users are located in front of the microphone array in many ASR applications, θtarget was set to 51, and the target speaker was located close to the microphones to simulate a typical ASR situation such as recognition of the speech of a car driver or a kiosk user. Because the original sampling rate of 16 kHz was too low to simulate signal delay at the two microphones standing nearby, the source signals were upsampled to 1024 kHz, convolved with room impulse responses generated at a sampling rate of 1024 kHz, and downsampled back to 16 kHz. The noise components at both microphones were scaled by the same factor to obtain a designated input SNR at “Mic. 1”. We compared the proposed method with SBSE [11], BSSA [10], and real-time IVA [28] with DOA-based target output selection. For the purpose of conducting a fair comparison, all of the tested methods employed a Hanning window with a 1024-sample length and 256-sample shift as the analysis window for the STFT. In addition, online natural-gradient learning with a nonholonomic constraint and normalization by a smoothed power estimate was used with each method, and the MDP was employed with each, except the BSSA conducted using the PB method. The optimal step size was fixed for each method, μ was set to 0.2 for the SBSE, and the smoothing factor β was set to 0.3.

a

6. Experimental evaluation

The number of operations for matrix inversion.

½wðkÞm . The TD waveform of estimated target speech can then be reconstructed by yðtÞ ¼

K XX τ

Yðτ; kÞejωk ðt τHÞ :

ð27Þ

k¼1

For the proposed method and a real-time version of FD ICA (corresponding to real-time IVA with the score function of ϕðuðk; τÞÞ) with N ¼M, Table 1 summarizes the number of complex multiplications and divisions required to compute a value in the first column from the time– frequency segments of M microphone signals for a frame. The numbers of complex operations required for the proposed method and the real-time FD ICA are O(M) and OðM 3 Þ, respectively, which indicates that the proposed method reduces the computational complexity significantly. Furthermore, the real-time FD ICA requires additional computations to resolve the permutation problem and identify a target speech output. In many practical applications, the reverberation is not negligible, even though the target speaker is located close to the microphones. In this case, the approximation in (20) is not valid even if the DOA of the target source θtarget is accurately estimated. Fortunately, as the experimental results discussed in Section 6 show, the proposed method is sufficiently robust that target speech in the system output is successfully enhanced. This is consistent with the discussion in [15] concerning ICA being more robust against inaccurate estimation of the target source direction than conventional adaptive beamformers. Although the proposed method provides an efficient online algorithm with less computational complexity, the number of microphones needs to be reduced to further decrease the number of parameters to be estimated. Because the proposed method requires noise estimation as a dummy output by forming a directional null to the target speaker, stereo microphone signals can be considered. In addition, more than two microphones may be unavailable in some systems because of hardware costs. When stereo microphones are used, the method suffers from the underdetermined BSS problem if two or more noise sources in addition to a target source are active. However, the dummy output is proficient in noise estimation since the target speaker is located close to the microphones in many ASR applications, so two weights

M. Kim, H.-M. Park / Signal Processing 117 (2015) 126–137

These values were determined on the basis of extensive experiments.

5m

Room height: 3 m

133

6.2. Recognition results Interference 1 - Height: 1.1 m - Azimuth: -50

I

1.4 m Mic.1

2m

0.5 m

Target speech - Height: 1.1 m - Azimuth: -5

T

Mic.2 Microphones - Height: 1.1 m - Gap: 16 cm

I

1.5 m

I

4m

Interference 3 - Height: 1.1 m - Azimuth: 25

Interference 2 - Height: 1.1 m - Azimuth: 75

80 60 40

20 15 10

5

60 40 20 0

0

100 80 60 40 20 0

DC ICA DC IVA

20 15 10

5

Word accuracy (%)

Word accuracy (%)

60 40 DC ICA DC IVA

5

Input SNR (dB)

60 40 20 0

60 40 20 20 15 10

5

0

0

80 60 40 20 20 15 10

5

Input SNR (dB)

5

0

80 60 40 20 0 20 15 10

5

0

Input SNR (dB)

100

0

20 15 10

100

Input SNR (dB)

80

20 15 10

80

Input SNR (dB)

80

0

0

100

0

0

100

Input SNR (dB)

20

5

100

Input SNR (dB)

Word accuracy (%)

Word accuracy (%)

Input SNR (dB)

20 15 10

Word accuracy (%)

0

DC ICA DC IVA

80

Word accuracy (%)

20

100

Word accuracy (%)

100

Word accuracy (%)

Word accuracy (%)

Fig. 2. Source and microphone positions to simulate corrupted speech as test data.

Fig. 3 shows the word accuracies for the proposed method, denoted by DC ICA (DOA-constrained ICA), and the corresponding IVA-based method similarly derived by maximizing independence between the real and dummy outputs of all frequency bins with higher order dependencies of source signals across frequencies, denoted by DC IVA. For all of the tested cases, the recognition accuracies of the proposed method were comparable to or better than those of the corresponding IVA-based method. As discussed in Section 5, the proposed method is safe from the permutation problem although it estimates a separating matrix at a frequency bin independent of the others. In addition, the independent estimation across frequency bins may yield fast convergence, which may result in better recognition performance.

0

100 80 60 40 20 0

20 15 10

5

0

Input SNR (dB)

Fig. 3. Word accuracies (%) for the proposed method (DC ICA) and the DC IVA , in the configuration shown in Fig. 2 with competing speech as interferences. (a) One interference and RT60 of 0.2 s. (b) One interference and RT60 of 0.4 s. (c) One interference and RT60 of 0.6 s. (d) Two interferences and RT60 of 0.2 s. (e) Two interferences and RT60 of 0.4 s. (f) Two interferences and RT60 of 0.6 s. (g) Three interferences and RT60 of 0.2 s. (h) Three interferences and RT60 of 0.4 s. (i) Three interferences and RT60 of 0.6 s.

M. Kim, H.-M. Park / Signal Processing 117 (2015) 126–137

80 60 40 20 0

DSNF FSNF

20 15 10

5

0

100

Word accuracy (%)

100

Word accuracy (%)

Word accuracy (%)

134

80 60 40 20 0

Input SNR (dB)

20 15 10

5

0

100 80 60 40 20 0

Input SNR (dB)

20 15 10

5

0

Input SNR (dB)

80 60

0

20 15 10

5

0

60 40 20 0

Word accuracy (%)

100 80 60 40 20 0

DC ICA SBSE BSSA RT IVA

20 15 10 5 Input SNR (dB)

0

100 80 60 40 20 0

DC ICA SBSE BSSA RT IVA

20 15 10

5

Input SNR (dB)

5

0

100 80 60 40 20 0

Input SNR (dB)

Word accuracy (%)

Word accuracy (%)

Word accuracy (%)

Input SNR (dB)

20 15 10

0

100 80 60 40 20 0

20 15 10 5 Input SNR (dB)

0

100 80 60 40 20 0

20 15 10

5

Input SNR (dB)

20 15 10

5

0

Input SNR (dB)

Word accuracy (%)

20

DC ICA SBSE BSSA RT IVA

80

Word accuracy (%)

40

100

Word accuracy (%)

100

Word accuracy (%)

Word accuracy (%)

Fig. 4. Word accuracies (%) for the proposed target-speech-extraction with a dummy output for noise estimation computed by the DSNF or by the FSNF of stereo data, in the configurations shown in Fig. 2 with competing speech as one interference. (a) RT60 of 0.2 s, (b) RT60 of 0.4 s, and (c) RT60 of 0.6 s.

0

100 80 60 40 20 0

20 15 10 5 Input SNR (dB)

0

20 15 10

0

100 80 60 40 20 0

5

Input SNR (dB)

Fig. 5. Word accuracies (%) for the proposed method (DC ICA), SBSE [11], BSSA [10], and real-time IVA [28] with DOA-based target output selection (RT IVA), in the configuration shown in Fig. 2 with competing speech as interferences. (a) One interference and RT60 of 0.2 s. (b) One interference and RT60 of 0.4 s. (c) One interference and RT60 of 0.6 s. (d) Two interferences and RT60 of 0.2 s. (e) Two interferences and RT60 of 0.4 s. (f) Two interferences and RT60 of 0.6 s. (g) Three interferences and RT60 of 0.2 s. (h) Three interferences and RT60 of 0.4 s. (i) Three interferences and RT60 of 0.6 s.

Fig. 4 compares the word accuracies for the proposed method when a dummy output for noise estimation is obtained by a simple “delay-and-subtract nullformer” of stereo data (corresponding to the “real” proposed method), denoted by DSNF, or by a “filter-and-subtract nullformer”, denoted by FSNF. The latter is represented by Eq. (21), using the ratio of target source components in stereo microphone signals, f½AðkÞ21 S1 ðk; τÞg=f½AðkÞ11 S1 ðk; τÞg, instead of

exp jωk d sin θtarget =c (corresponding to the “ideal” noise estimation).2 It is interesting to note that the method that involves adopting the “delay-and-subtract nullformer” yielded 2 This method yields better recognition accuracies than the method in which the parameters were set to the discrete Fourier transform of a downsampled version from 1024 kHz to 16 kHz of exact mixing filters for simulation.

80 60

0

20 15 10

5

60 40 20 0

0

100 80 60 40 20 0

DC ICA SBSE BSSA RT IVA

20 15 10

5

0

Word accuracy (%)

Word accuracy (%)

80 60

0

DC ICA SBSE BSSA RT IVA

20 15 10

5

Input SNR (dB)

40 20 0

20 15 10

80 60 40 20 0

20 15 10

5

0

80 60 40 20 0

20 15 10

60 40 20 5

Input SNR (dB)

5

0

Input SNR (dB)

80

20 15 10

0

100

0

100

0

5

Input SNR (dB)

Input SNR (dB)

100

20

60

0

100

Input SNR (dB)

40

5

80

Input SNR (dB)

Word accuracy (%)

Word accuracy (%)

Input SNR (dB)

20 15 10

Word accuracy (%)

20

80

135

100

Word accuracy (%)

40

DC ICA SBSE BSSA RT IVA

100

Word accuracy (%)

100

Word accuracy (%)

Word accuracy (%)

M. Kim, H.-M. Park / Signal Processing 117 (2015) 126–137

0

100 80 60 40 20 0

20 15 10

5

0

Input SNR (dB)

Fig. 6. Word accuracies (%) for the proposed method (DC ICA), SBSE [11], BSSA [10], and real-time IVA [28] with DOA-based target output selection (RT IVA), in the configuration shown in Fig. 2 with the subway, car, or exhibition hall noises from AURORA2 database as three interferences. (a) Subway noise and RT60 of 0.2 s. (b) Subway noise and RT60 of 0.4 s. (c) Subway noise and RT60 of 0.6 s. (d) Car noise and RT60 of 0.2 s. (e) Car noise and RT60 of 0.4 s. (f) Car noise and RT60 of 0.6 s. (g) Exhibition hall noise and RT60 of 0.2 s. (h) Exhibition hall noise and RT60 of 0.4 s. (i) Exhibition hall noise and RT60 of 0.6 s.

recognition accuracies comparable to those obtained using the method that involves adopting the “filter-and-subtract nullformer” for moderately reverberant conditions. In particular, the recognition performance of the former was better than that of the latter in a heavily reverberant environment at an RT 6 0 of 0.6 s. Even though the parameters for generating noise estimation at a dummy output were inaccurate because of a reverberation, the proposed method tried to optimize the weights based on ICA to obtain an SNR-maximization beamformer for target speech extraction at the real output in each frequency bin, which is consistent with the statement that ICA is robust against inaccurate estimation of the target source direction [15]. In the heavily reverberant environment, the mixing filter in the frequency domain has more variation in amplitude, and the denominator in the ratio of the target source components frequently has a very small amplitude, which may result in a numerically erroneous “filter-andsubtract nullformer.” Therefore, the estimated noise may contain more target speech components than that estimated for moderately reverberant conditions, and its recognition performance may result in greater degradation.

Figs. 5 and 6 display the word accuracies for the proposed method and the methods to which it was compared, using one to three competing speech interferences or other types of noises as interferences. As both figures show, the recognition accuracies decreased as the reverberation time increased, regardless of the method, because of the performance limitation of the FD-ICA-based approaches, i.e., the fact that a limited frame size cannot adequately cover a long reverberation [34–36]. It is remarkable that the performance degradation of the proposed method as the number of interferences increased was not more severe than that of the other methods, as shown in Fig. 5. The reason for this is that the proposed method seeks to achieve robustness by adapting two weights in each frequency bin to provide a target speech signal independent of the estimated noise at the dummy output, although it also suffers from the underdetermined BSS problem with more than one interference. Above all, the word accuracies of the proposed method were comparable to or better than those of the other methods for all tested environments and interferences, with less computational complexity.

136

M. Kim, H.-M. Park / Signal Processing 117 (2015) 126–137

7. Conclusion In this paper, we present an efficient online target-speechextraction method using stereo data that can be used as a preprocessing step for robust ASR. Because a target speaker is located close to the microphones in many ASR applications, we focused on the case in which accompanying target speech paths have moderate reverberation and the target speaker direction is known in advance. For this case, a dummy output for noise estimation was obtained using a simple “delay-andsubtract nullformer” of stereo data, and weights for extracting target speech were then estimated using a learning rule derived from a modified ICA cost function to maximize independence between the nullformed noise and the estimated target speech while retaining the MDP to overcome scaling and permutation problems. In particular, the online learning rule was based on a stochastic natural gradient with a nonholonomic constraint and normalization by a smoothed power estimate of the input signal to improve parameter convergence, even for dynamically changing speech levels, with less computational complexity than conventional ICA. Although the method using stereo data is advantageous for small computational loads and fast convergence, it suffers from the underdetermined BSS problem if two or more noise sources in addition to a target source are active. However, the weights were adapted to form SNR-maximization beamformers for robust target speech estimation at the real output. The experimental results obtained using data from the RM database simulated in various configurations and reverberant environments demonstrate that the proposed method delivers better speech recognition performance on average with fewer computations than the methods to which it was compared.

Acknowledgment This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (2014R1A2A2A01006581). References [1] T. Virtanen, R. Singh, B. Raj (Eds.), Techniques for Noise Robustness in Automatic Speech Recognition, John Wiley & Sons Ltd., Chichester, UK, 2012. [2] J. Hung, W. Tu, C. Lai, Improved modulation spectrum enhancement methods for robust speech recognition, Signal Process. 92 (11) (2012) 2791–2814. [3] I. Mporas, T. Ganchev, O. Kocsis, N. Fakotakis, Context-adaptive preprocessing scheme for robust speech recognition in fast-varying noise environment, Signal Process. 91 (8) (2011) 2101–2111. [4] M. Wölfel, J. McDonough, Distant Speech Recognition, John Wiley & Sons, Ltd., Chichester, UK, 2009. [5] J. Droppo, A. Acero, Environmental robustness, in: J. Benesty, M. Sondhi, Y. Huang (Eds.), Springer Handbook of Speech Processing, Springer, Berlin, 2008, pp. 653–680. [6] B. Raj, V. Parikh, R.M. Stern, The effects of background music on speech recognition accuracy, in: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'97), Munich, Germany, 1997, pp. 851–854. [7] M. El Rhabi, H. Fenniri, A. Keziou, E. Moreau, A robust algorithm for convolutive blind source separation in presence of noise, Signal Process. 93 (4) (2013) 818–827.

[8] P. Comon, C. Jutten (Eds.), Handbook of Blind Source Separation: Independent Component Analysis and Applications, Academic Press, Kidlington, UK, 2010. [9] S. Haykin (Ed.), Unsupervised Adaptive Filtering: Blind Source Separation, vol. 1, John Wiley & Sons, Ltd., New York, NY, USA, 2000. [10] Y. Takahashi, T. Takatani, K. Osako, H. Saruwatari, K. Shikano, Blind spatial subtraction array for speech enhancement in noisy environment, IEEE Trans. Audio Speech Lang. Process. 17 (2009) 650–664. [11] F. Nesta, M. Matassoni, Robust automatic speech recognition through on-line semi blind source extraction, in: Proceedings of CHiME Workshop on Machine Listening in Multisource Environments, Florence, Italy, 2011, pp. 18–23. [12] F. Nesta, M. Matassoni, Blind source extraction for robust speech recognition in multisource noisy environments, Comput. Speech Lang. 27 (2013) 703–725. [13] A. Hyvärinen, J. Karhunen, E. Oja (Eds.), Independent Component Analysis, John Wiley & Sons, Ltd., New York, NY, USA, 2001. [14] L. Parra, C. Alvino, Geometric source separation: merging convolutive source separation with geometric beamforming, IEEE Trans. Speech Audio Process. 10 (2002) 352–362. [15] M. Knaak, S. Araki, S. Makino, Geometrically constrained independent component analysis, IEEE Trans. Audio Speech Lang. Process. 15 (2007) 715–726. [16] K. Matsuoka, S. Nakashima, Minimal distortion principle for blind source separation, in: Proceedings of International Conference on Independent Component Analysis and Blind Signal Separation, San Diego, CA, USA, 2001, pp. 722–727. [17] L. Parra, C. Spence, Convolutive blind separation of non-stationary sources, IEEE Trans. Speech Audio Process. 8 (2000) 320–327. [18] P. Smaragdis, Blind separation of convolved mixtures in the frequency domain, Neurocomputing 22 (1998) 21–34. [19] F. Asano, S. Ikeda, M. Ogawa, H. Asoh, N. Kitawaki, Combined approach of array processing and independent component analysis for blind separation of acoustic signals, IEEE Trans. Speech Audio Process. 11 (2003) 204–215. [20] M.Z. Ikram, D.R. Morgan, A beamforming approach to permutation alignment for multichannel frequency-domain blind source separation, in: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, USA, 2002, pp. 881–884. [21] W. Wang, J.A. Chambers, S. Sanei, A novel hybrid approach to the permutation problem of frequency domain blind source separation, in: Proceedings of International Conference on Independent Component Analysis and Blind Signal Separation, Granada, Spain, 2004, pp. 532–539. [22] N. Murata, S. Ikeda, A. Ziehe, An approach to blind source separation based on temporal structure of speech signals, Neurocomputing 41 (2001) 1–24. [23] H. Sawada, S. Araki, S. Makino, Measuring dependence of bin-wise separated signals for permutation alignment in frequency-domain BSS, in: Proceedings of IEEE International Symposium on Circuits and Systems, New Orleans, LA, USA, 2007, pp. 3247–3250. [24] T. Kim, H.T. Attias, S.-Y. Lee, T.-W. Lee, Blind source separation exploiting higher-order frequency dependencies, IEEE Trans. Audio Speech Lang. Process. 15 (2007) 70–79. [25] F. Nesta, T. Wada, B.-H. Juang, Batch-online semi-blind source separation applied to multi-channel acoustic echo cancellation, IEEE Trans. Audio Speech Lang. Process. 19 (2011) 583–599. [26] T.-W. Lee (Ed.), Independent Component Analysis: Theory and Applications, Kluwer, Boston, MA, USA, 1998. [27] S.-I. Amari, T.-P. Chen, A. Cichocki, Nonholonomic orthogonal learning algorithms for blind source separation, Neural Comput. 12 (2000) 1463–1484. [28] T. Kim, Real-time independent vector analysis for convolutive blind source separation, IEEE Trans. Circuits Syst. I: Reg. Pap. 57 (2010) 1431–1438. [29] H. Sawada, R. Mukai, S. Amari, S. Makino, A robust and precise method for solving the permutation problem of frequency-domain blind source separation, IEEE Trans. Speech Audio Process. 12 (2004) 530–538. [30] P. Price, W.M. Fisher, J. Bernstein, D. Pallet, The DARPA 1000-word resource management database for continuous speech recognition, in: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, New York, NY, USA, 1988, pp. 651– 654. [31] S.J. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P.C. Woodland, The HTK Book (for HTK Version 3.4), University of Cambridge, Cambridge, UK, 2006.

M. Kim, H.-M. Park / Signal Processing 117 (2015) 126–137

[32] H.-G. Hirsch, D. Pearce, The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions, in: Proceedings of ASR2000—Automatic Speech Recognition: Challenges for the new Millennium, Paris, France, 2000, pp. 851–854. [33] J.B. Allen, D.A. Berkley, Image method for efficiently simulating small-room acoustics, J. Acoust. Soc. Am. 65 (1979) 943–950. [34] S. Araki, S. Makino, A. Blin, R. Mukai, H. Sawada, Blind separation of more speech than sensors with less distortion by combining

137

sparseness and ICA, in: Proceedings of International Workshop on Acoustic Echo and Noise Control, Kyoto, Japan, 2003, pp. 271–274. [35] H.-M. Park, C.S. Dhir, S.-H. Oh, S.-Y. Lee, A filter bank approach to independent component analysis for convolved mixtures, Neurocomputing 69 (2006) 2065–2077. [36] H.-M. Park, S.-H. Oh, S.-Y. Lee, A bark-scale filter bank approach to independent component analysis for acoustic mixtures, Neurocomputing 73 (2009) 304–314.

Efficient online target speech extraction using DOA-constrained independent component analysis of stereo data for robust speech recognition

Efficient online target speech extraction using DOA-constrained independent component analysis of stereo data for robust speech recognition

Recommend Documents