Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds

Available online at www.sciencedirect.com Computer Speech and Language 27 (2013) 851–873 Speech recognition in living rooms: Integrated speech enhan...

Download PDF

572KB Sizes 0 Downloads 65 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com

Computer Speech and Language 27 (2013) 851–873

Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds夽 Marc Delcroix ∗ , Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Atsunori Ogawa, Takaaki Hori, Shinji Watanabe, Masakiyo Fujimoto, Takuya Yoshioka, Takanobu Oba, Yotaro Kubo, Mehrez Souden, Seong-Jun Hahm, Atsushi Nakamura NTT Communication Science Laboratories, NTT Corporation, 2-4, Hikaridai Seika-cho, Souraku-gun, Kyoto 619-0237, Japan Received 20 December 2011; received in revised form 24 April 2012; accepted 9 July 2012 Available online 20 July 2012

Abstract Research on noise robust speech recognition has mainly focused on dealing with relatively stationary noise that may differ from the noise conditions in most living environments. In this paper, we introduce a recognition system that can recognize speech in the presence of multiple rapidly time-varying noise sources as found in a typical family living room. To deal with such severe noise conditions, our recognition system exploits all available information about speech and noise; that is spatial (directional), spectral and temporal information. This is realized with a model-based speech enhancement pre-processor, which consists of two complementary elements, a multi-channel speech–noise separation method that exploits spatial and spectral information, followed by a single channel enhancement algorithm that uses the long-term temporal characteristics of speech obtained from clean speech examples. Moreover, to compensate for any mismatch that may remain between the enhanced speech and the acoustic model, our system employs an adaptation technique that combines conventional maximum likelihood linear regression with the dynamic adaptive compensation of the variance of the Gaussians of the acoustic model. Our proposed system approaches human performance levels by greatly improving the audible quality of speech and substantially improving the keyword recognition accuracy. © 2012 Elsevier Ltd. All rights reserved. Keywords: Robust ASR; Model-based speech enhancement; Example-based speech enhancement; Model adaptation; Dynamic variance adaptation

1. Introduction Recently automatic speech recognition (ASR) has started to move out of the research laboratory and is used increasingly in real life applications. This has been exemplified by the incorporation of speech interfaces in many recent mobile devices. However, for optimal performance, such applications still require the speaker to be relatively close to the microphone and the background noise level to be low. If we want to bring speech recognition products

夽

This paper has been recommended for acceptance by Jon Barker. Corresponding author at: NTT Communication Science Laboratories, Media Information Laboratory, Processing Research Group, 2-4, Hikaridai, Seika-cho, Keihanna Science City, Kyoto 619-0237, Japan. Tel.: +81 774 93 5288; fax: +81 774 93 5158. E-mail address: [email protected] (M. Delcroix). ∗

0885-2308/$ – see front matter © 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.csl.2012.07.006

852

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

into everyday life these constraints must be relaxed. Therefore, there is an increasing need to tackle the challenging problem of noise-robust distant speech recognition in real world environments. Research on noise robust speech recognition has focused on acoustic model compensation and speech/feature enhancement. Acoustic model compensation includes approaches such as multi-style retraining (Lippmann et al., 1987; Sehr et al., 2010), parallel model combination (Gales and Young, 1996) and vector Taylor series (VTS) (Moreno et al., 1996; Kim et al., 1998; Li et al., 2007). For example, VTS has recently received great attention. This method approximates the non-linear relationship between speech and additive noise or short-term convolution in the logspectrum domain using a Taylor series. A noisy speech model can then be obtained from the clean speech acoustic model and noise characteristics. By compensating the acoustic model directly, the model compensation approaches achieve a close relationship with ASR decoding and can therefore greatly improve recognition performance in the presence of stationary noise. However, most acoustic model compensation approaches assume relatively stationary noise, which is often not the case in practice. Moreover they usually use a single microphone and do not therefore take advantage of the spatial information provided by microphone arrays (Souden et al., 2011). Another approach to noise robust speech recognition consists of performing speech/feature enhancement prior to recognition. Many speech enhancement algorithms have been designed to handle non-stationary distortions emanating from non-stationary noise sources (Ephraim and Malah, 1985; Martin, 2001; Ming et al., 2011; Fujimoto et al., 2011), reverberation (Naylor and Gaubitch, 2010; Nakatani et al., 2010) or interfering speakers (Lee et al., 1999; Parra and Spence, 2000; Yilmaz and Rickard, 2004; Makino et al., 2007; Sawada et al., 2011; Roweis, 2003) using spectral, spatial or temporal information. For example, blind speech separation (BSS) algorithms usually use either spatial information about the speaker localization derived from the multi-channel microphone signals (Sawada et al., 2011) or spectral information obtained from pre-trained speech spectrum models (Roweis, 2003). To perform separation, these approaches rely on various assumptions about speech and noise sources, specific to the type of information they use. For example, most BSS algorithms that employ spatial information require the number of sources to be known a-priori and the sources to be point sources. In contrast, the approach based on trained spectral models assumes the diversities/complexities of speech and noise spectra to be sufficiently small for their models to be accurately trained in advance or estimated from observations. These constraints mean that the performance of such approaches may deteriorate in complex noise conditions such as those found in a family home. Moreover, many speech enhancement methods introduce distortions that are detrimental to ASR because they cause a mismatch between the enhanced speech and the acoustic model used by ASR. In this paper, we present a speech recognition system that we proposed for the PASCAL ‘CHiME’ Speech Separation and Recognition Challenge (Baker et al., 2013). This challenge consists of recognizing commands recorded with binaural microphones in a family living room. There are various types of background noise including noise sources such as radio or television, children voices, vacuum cleaners, and the noise has rapidly time varying characteristics (Baker et al., 2013). Such noise conditions are extremely challenging for conventional noise robust speech recognition. To tackle such a complex recognition task, we introduce the directional spectral and temporal information based noise robust command recognition system (DISTINCT). DISTINCT has the following four characteristics. 1. 2. 3. 4.

It makes full use of information available about speech and noise. It utilizes spectral models for speech enhancement and recognition that are very similar to each other. It includes a mechanism to compensate for any mismatch between enhanced speech and the acoustic model. It uses state of the art speech recognition technologies.

We elaborate on these characteristics below. First, if we are to deal with such severe background noise conditions, it appears that we should use any information available about the speech and noise. For example, the previous CHiME challenge revealed that the use of log-spectrum models for speech combined with temporal information were the keys to recognizing mixtures of speech signals (Cooke et al., 2010; Kristjansson et al., 2006). DISTINCT uses two complementary model-based enhancement algorithms to fully exploit spatial (directional/locational), spectral and temporal information about the speech and noise. We start by performing multi-channel speech–noise separation using locational and spectral models of speech and noise. For this purpose, we adopt a method called dominance based locational and power-spectral characteristic integration (DOLPHIN) (Nakatani et al., 2011a, 2011b). Then long-term temporal information about the speech is used to further

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

853

reduce non-stationary noise. This is achieved using an example-based enhancement algorithm (Ming et al., 2011; Kinoshita et al., 2011). These two enhancement algorithms will be discussed in detail in this paper. Second, both of the speech enhancement algorithms used by DISTINCT operate using spectral models similar to those used by ASR. From one point of view, using the same model for different processing units is desirable because reducing the spectral distortion in one unit also reduces that in the other units. Consequently, to realize optimal connection with the state-of-the-art ASR, speech enhancement units should rely on spectral models that are the same as or similar to the model used by ASR, namely hidden Markov models on mel frequency cepstral coefficients (HMMMFCC). Therefore, we employed Gaussian mixture models (GMM) on log-spectra and GMMs on MFCCs for the speech enhancement units. Because the spectral distributions represented by these two models are closely related to the distributions represented by HMM-MFCCs, the two enhancement methods are well suited for ASR. This contrasts with enhancement approaches that employ linear spectral domain models. Third, although our speech enhancement and recognition units use similar spectral models as discussed above, a certain mismatch inevitably remains between the enhanced speech and the acoustic models used for recognition. Due to the noise non-stationarity and the frame-by-frame processing of the speech enhancement algorithms, the mismatch is time-varying. We call such a time-varying mismatch a dynamic mismatch. The third characteristic of DISTINCT is that it mitigates the influence of the dynamic mismatch using static and dynamic adaptations of the acoustic model parameters. Static adaptation is realized with conventional approaches such as maximum likelihood linear regression (MLLR) (Leggetter and Woodland, 1995). For dynamic model adaptation, DISTINCT uses dynamic variance adaptation (DVA) proposed in Delcroix et al. (2009) and revisited in Section 5. Finally, DISTINCT employs state-of-the-art techniques for ASR such as discriminative training of the acoustic models (McDermott et al., 2010) and system combination (Evermann and Woodland, 2000) in addition to the technologies described above. Consequently, DISTINCT achieves high recognition performance even under adverse noise conditions, i.e., more than 91% average keyword recognition accuracy. Note that to obtain such high recognition performance, we exploit the characteristics of the CHiME challenge, that are, availability of noise and reverberant speech training data, a known fixed speaker position, the possibility to use speaker-dependent modeling and offline processing. We will discuss how these constraints could be relaxed to use DISTINCT in more general situations. This paper is an extension of the work we presented in Delcroix et al. (2011b). It includes more detailed descriptions of our proposed recognition system and provides additional discussions of the results. We also discuss future research directions that could further improve recognition performance. The organization of the paper is as follows. Section 2 provides a brief overview of DISTINCT. Section 3 describes the principle of DOLPHIN. Then Section 4 derives example-based speech enhancement. Section 5 discusses static and dynamic model adaptation. Section 6 reports experimental results obtained in the CHiME challenge task. Section 7 discusses possible extensions to our system to improve performance or relax its assumptions. Section 8 concludes the paper and discusses potential future research directions.

2. Overview of DISTINCT DISTINCT aims at recognizing commands spoken in the presence of noise. In the following derivations, we assume that the speaker is located right in front of a binaural microphone array. In contrast, the noise sources may be located (1) anywhere in the available space. sj (t) denotes the waveform of the speech image at the microphone of index j (j= 1, (2)

2), where t is the time index. Accordingly the noise microphone image is designed by using sj (t). The observed noisy speech, oj (t) is thus given as

oj (t) =

2

(l)

sj (t),

l=1

where l is the source index.

(1)

854

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

Fig. 1. Processing flow of DISTINCT.

The system consists of an enhancement process that aims at recovering the sum of the 2-channel (2-ch) speech microphone images, s(1) (t) as s(1) (t) =

2

(1)

sj (t).

(2)

j=1

This corresponds to the delay-and-sum beamformer used for enhancing the signal located right in front of the array, i.e., the target speech. The enhanced speech is then passed to the recognizer. Fig. 1 is a schematic diagram of DISTINCT. It consists of the following modules. • Speech–noise separation using DOLPHIN (see Section 3): DOLPHIN processes the binaural observed signals, oj (t), and outputs a single channel time domain enhanced signal that we denote as x(t). With DOLPHIN, rapidly changing speech and noise can be appropriately distinguished based mainly on their locational features, while the spectral shapes of the speech can be estimated reliably based mostly on the spectral features. With the CHiME challenge, since the target speaker location is fixed and speech and noise training data are available, spectral and locational models can be trained in advance to achieve optimal performance. • Example-based speech enhancement (see Section 4): Example-based enhancement uses a parallel corpus containing speech sentences processed with DOLPHIN and the corresponding clean speech. The long-term temporal information of the speech corpus is represented by an example model that is detailed in Section 4.1. Enhancement is performed by searching for the longest speech segments in the corpus that best match the DOLPHIN processed speech, x(t), using the example model. Then we employ the corresponding clean speech segments to reconstruct the target speech y(t) using Wiener filtering. Using the long-term temporal information enables us to distinguish non-stationary noise from speech, thereby achieving high-quality enhancement. y(t) can be obtained either by enhancing the observed noisy speech, oj (t), (example-based I) or the speech processed with DOLPHIN, x(t), (example-based II) thus generating two enhanced speech signals. • Static and dynamic acoustic model adaptation (see Section 5): The mean parameters of the Gaussians of the acoustic model are adapted using conventional MLLR. We use DVA to dynamically adapt the variance of the Gaussians of the acoustic model. DVA is similar to uncertainty decoding (Droppo et al., 2002), in the sense that a dynamic (i.e., time-varying) feature variance is added to the acoustic model variance during decoding to mitigate the effect of unreliable features. DVA is based on a simple dynamic feature variance model that provides a general formulation, enabling its use with many speech enhancement methods. Moreover, the dynamic feature variance model parameters are optimized for recognition using adaptation data. • Speech recognition: Speech recognition is performed using clean and multi-condition acoustic models (AM) trained using the differenced maximum mutual information (dMMI) discriminative criterion (McDermott et al., 2010). dMMI is a generalization of the minimum phone error (MPE) criterion that achieves superior or equivalent performance while

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

855

being simpler to implement. Recognition is performed in parallel using speech processed with DOLPHIN, and the two enhanced outputs of the example-based speech enhancement. • System combination: The different recognition outputs are merged together using the system combination technique (Evermann and Woodland, 2000). Each speech enhancement output is separately processed by the recognizer to output lattice results. These lattices are then combined by taking account of the word posterior probabilities. In the following sections, we describe the enhancement and model adaptation modules in more detail. 3. Speech–noise separation using DOLPHIN In this section, we describe the principles of DOLPHIN that was recently proposed in Nakatani et al. (2011a, 2011b). (l) The goal of DOLPHIN is to separate the observed signals, oj (t), into speech and noise. In the following, we use Sj,n,f to denote the short time Fourier transforms (STFT) of the speech (l = 1) and noise (l = 2) microphone images at a microphone j (v= 1, 2) for time frame n and frequency f (f= 1, . . ., Nf ), where Nf is the STFT length. In the rest of this section, we omit the time frame index, n, of all the symbols because the speech–noise separation is applied to each time frame independently. The STFT of the observed signals is given by Oj,f . The 2-ch observed signal, Of , and the (l) 2-ch speech and noise microphone images, Sf are then given as ⎤ ⎡ (l) S O1,f 1,f (l) ⎦. (3) and Sf = ⎣ Of = (l) O2,f S2,f DOLPHIN aims at recovering the log power spectra of the desired signals, s(l) , defined as (l)

(l)

s(l) = [s1 , . . . , sNf ]T

(l)

(l)

(l)

where sf = ln(|S1,f + S2,f |2 ).

(4) (l)

where T denotes the non-conjugate transpose of a vector. In particular, the estimate of the desired speech signal sf for l = 1, denoted by xf , is handled as the output of DOLPHIN. The output signal is converted to its waveform, x(t), before being transmitted to the subsequent processing steps. Two important signal characteristics, namely locational and spectral characteristics, are employed by DOLPHIN to enable it to cope well with highly non-stationary noise such as that occurring in the CHiME challenge. The advantages of employing these characteristics can be summarized as follows. • No matter how rapidly the spectral characteristics of the individual source signals (namely speech and noise signals) change over time, and no matter how complex the shapes of the source signal spectra are, we can track their dominant spectral components robustly based on their locational characteristics as long as they are distinguishable. • The spectral characteristics are essential for the accurate estimation of the entire spectral shape of each source, including both dominant spectral components and non-dominant spectral components that may be masked by the other source. In the following, we describe how DOLPHIN utilizes both characteristics in a unified manner to make use of the above advantages. Corresponding to the two important characteristics above, DOLPHIN uses two sets of observed features: one is a level normalized 2-ch observed signal, denoted by df , and the other is the log power spectrum of a monaural signal, denoted by of , obtained as the sum of the 2-ch observed signals. Letting || · || be the Euclidean norm of a vector, the two sets of features are defined as Of d = {d1 , . . . , dNf } where df = , ||Of || o = [o1 , . . . , oNf ]T where of = ln(|O1,f + O2,f |2 ). Because df includes information about the difference between channels, namely interchannel phase and level differences, it is referred to as a locational feature. In contrast, of is referred to as a spectral feature.

856

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

r f(1)

q(1) Lf

df

of q(2)

r f(2) Locational model

Speech (l=1)

Noise (l=2)

Spectral model

Fig. 2. Graphical model of DOLPHIN.

To analyze the observed features, DOLPHIN introduces their generative models as illustrated in Fig. 2. The model is composed of the two sub-models shown on the left and right in the figure, which correspond to generative models for df and of , respectively. The sub-model for df , which is discussed in more detail in Section 3.2, corresponds to an extension of the generative model for sparseness-based source separation proposed in Yilmaz and Rickard (2004) and Sawada et al. (2011). Sparseness-based source separation was shown to work effectively for speech–noise separation by distinguishing the dominant spectral components based on their location characteristics even without the other sub-model for of in Fig. 2. However, this sub-model does not take any spectral characteristics into account, and thus the separation is not necessarily optimal in terms of separated speech quality. In contrast, in the sub-model for of , which is discussed in more detail in Section 3.1, a Gaussian mixture model (GMM) is employed to represent the spectral characteristics of each source. Thus, this sub-model as a whole corresponds to a generative model for a factorial hidden Markov model (fHMM) (Roweis, 2003) except that GMMs are employed here instead of HMMs for the sake of simplicity.1 An fHMM was shown to work effectively for the separation of multiple-speakers’ utterances (Rennie et al., 2010). However, when used alone, it may have trouble separating speech and noise when the diversity of the noise is very high as in the CHiME challenge database. This is because it is difficult to model precisely such a highly diverse noise solely using a pre-trained spectral model. In addition, this approach incurs a huge computational cost to search for the combination of speech and noise GMM/HMM index pairs that best fit the observation. This is also a serious problem for practical applications. Taking the above observation into account, we propose DOLPHIN to combine these two sub-models and makes them work cooperatively. For this combination, DOLPHIN lets the two sub-models share the dominant source index (DSI), Lf , which indicates whether speech or noise is more dominant at each frequency f. More precisely, DOLPHIN first estimates the DSIs jointly with the parameters of both sub-models, and then recovers the clean speech spectral shapes based on the estimated DSIs and the speech log-spectral GMM. With this combination, DOLPHIN presents the following advantages. • DOLPHIN can estimate the DSIs and parameters of the sub-models more reliably than conventional approaches by considering both sub-models simultaneously, and thus estimate the spectral shapes of the clean speech more accurately.2 • A new computationally efficient search for the optimal speech/noise GMM index pairs, which is used by DOLPHIN (see Section 3.3), can be made reliable by using locational models (Nakatani et al., 2012). It is also important that DOLPHIN is capable of preparing precise sub-models based on prior training when 2-ch training data sets are provided separately for speech and noise. This is an additional advantage of DOLPHIN particularly in the context of the CHiME challenge. In the following, we describe the two sub-models and the parameter estimation method using the integrated generative model.

1 The use of HMMs with DOLPHIN is straightforward as discussed in Nakatani et al. (2011a). This extension improved the keyword recognition accuracy for the challenge, but the improvement was small. Therefore, we did not use it in this paper. 2 This can be confirmed by the performance improvement obtained by using jointly the submodels compared with using only one of the submodels (Nakatani et al., 2011a, 2011b).

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

857

3.1. Sub-model for spectral features DOLPHIN models the log power spectra of source signals, s(l) , by using log-spectral GMMs. With a log-spectral GMM, the probability density function (pdf) of s(l) for each source l is modeled as p(s(l) ) = wq(l) p(s(l) |q(l) ), (5) q(l)

(l) p(sf |q(l) ),

p(s(l) |q(l) ) =

(6)

f (l)

(l)

(l)

where p(s(l) |q(l) ) = N(s(l) ; μq(l) , Σ q(l) ) is a pdf of a Gaussian component indexed by q(l) with a mean vector μq(l) and a (l)

(l)

(l)

diagonal covariance matrix Σ q(l) , and wq(l) is a mixture weight. In this paper, the parameters of this model, Σ q(l) , μq(l) , and wq(l) , are all assumed to be fixed in advance by prior training using the training data set. (l)

To model the relationship between the source signal sf and the observed signal of , we adopt the log-max model (Roweis, 2003) because it allows us to achieve efficient optimization based on the EM algorithm as discussed in Nakatani et al. (2011a). The relationship is defined as

(1) (2) of = max sf , sf . (7) To deal with this model in a probabilistic form, we further introduce the following equations, (1)

(2)

p(of |Lf , sf , sf ) = δ(of − sf(Lf ) ), (1) (2) p(Lf |sf , sf )

=

(8) (l)

1

when Lf = argmaxsf ,

0

otherwise.

(9)

l

The first equation above ensures that of is equal to the source, sf(Lf ) , indexed by the DSI, Lf , and the second equation ensures that the DSI is equal to the index of the dominant source. Then, given a spectral Gaussian index pair, q = [q(1) , (1) (2) q(2) ], the joint pdf of of and Lf is derived based on Eqs. (6), (8) and (9) with marginalization over sf and sf as

(1) (2) (1) (2) p(of , Lf , sf , sf |q)dsf dsf , (10) p(of , Lf |q) = =

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

p(of , Lf |sf , sf , q)p(sf , sf |q)dsf dsf , (1)

(2)

(1)

(2)

(1)

(2)

p(of |Lf , sf , sf )p(Lf |sf , sf )p(sf |q(1) )p(sf |q(2) )dsf dsf ,

(Lf ) (L ) (L ) (L ) (L ) (L ) f = p(sf = of |q ) p(Lf |sf f = of , sf f )p(sf f |q(L f ) )dsf f ,

of (Lf ) (L ) (L ) (L ) f ) p(sf f |q(L f ) )dsf f , = p(sf = of |q =

(11)

−∞

where L f indicates the non-dominant source index. To derive the expression of Eq. (11), we used the graphical model (1) (2) (1) (2) (1) (2) of Fig. 2 to express p(sf , sf |q), the fact that p(of , Lf |sf , sf , q) = p(of , Lf |sf , sf ), and that the domain of (L )

integration for sf f can be obtained from Eq. (9). Note that it is possible to generalize Eq. (11) to more than 2 sources. Interestingly, the first element in Eq. (11) corresponds to the likelihood of the dominant source taking the value of of , and the second element corresponds to the probability of the non-dominant source taking a value of less than of . Note that we can derive p(o, q) = p(o|q)p(q) based on Eq. (11) and wq(l) in Eq. (5) with further marginalization over Lf , and thus achieve a maximum a posteriori (MAP) estimation of q by maximizing p(o, q) even without locational models. This corresponds to the solutions obtained with the conventional fHMM approaches (Roweis, 2003; Rennie et al., 2010).

858

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

3.2. Sub-model for locational features To define the sub-model for the locational features, DOLPHIN adopts the sparseness assumption proposed in Yilmaz and Rickard (2004). With this assumption, each source signal is assumed to be sparse so that each frequency is dominated only by one of the sources, and thus the observed location feature df is assumed to be equal to the location feature of the dominant source at that frequency. Here, the location feature of a source l at a frequency f is defined as (l) (l) (l) Df = Sf /||Sf ||. Then, the posterior pdf of df given Lf can be written as (l)

p(df = d|Lf = l) = p(Df = d),

(12)

(l)

(l)

where d is the realization of the location feature and p(Df ) is the pdf of Df that is defined below. A model of the location feature, referred to as a location vector model (LM) in this paper, is proposed in Sawada et al. (2011) based on the above assumption, and used for separating a sound mixture composed of a given number of point sources. However, in the assumed scenario for the CHiME challenge, the number of noise sources is not fixed and noise may include certain sources, such as diffusive noise, whose location features are distributed much more widely than a point source. In addition, the reverberation of each source inherently widens the distribution of its locational features. (l) As a consequence, the pdf of Df should be more complex than that used by existing source separation methods. To model the pdf of such complex location features for each source, we introduce a location vector mixture model (LMM), which is defined as a mixture pdf of LMs. We expect the LMM to be able to model the location features of speech, which are distributed mainly in front, while it can model the features of noise distributed widely in all directions. This enables DOLPHIN to distinguish the locational characteristics of speech and noise appropriately. (l) The LMM for Df of each source l is defined at each frequency f as (l)

p(Df ) =

(l)

(l)

wr,f p(Df |r),

(13)

r

(l)

(l)

where wr,f is a mixture weight, and p(Df |r) is a mixture component of the LMM indexed by r at a frequency f,3 , which corresponds to a pdf of a point source, namely an LM of the conventional approach. For each mixture component, (l) p(Df |r), we adopt the LM proposed in Sawada et al. (2011), which is defined as ⎛ (l) p(Df |r)

(l)

=

C (l)

(ηr,f )2

exp ⎝

(l)

(l) H

(l)

(l)

−||Df − (ξ r,f Df )ξ r,f ||2 (l)

(ηr,f )2

⎞ ⎠,

(14)

(l)

where ξ r,f and (ηr,f )2 are the model parameters, that correspond to a normalized steering vector and variance, respectively, and C is the normalization constant (see Sawada et al., 2011 for further details of this model). We also assumed (l) (l) / f are statistically independent. for computational efficiency that Df and Df for f = (l)

(l)

(l)

In this paper, all the model parameters of the LMMs, ηr,f , ξ r,f , and wr,f are assumed to be trained in advance on the training data set. The prior training can be accomplished according to the maximum likelihood criterion, using the EM algorithm as shown in Sawada et al. (2011).

3

(l)

In Fig. 2 this index is indicated as rf with source and frequency indices, l, and f, respectively.

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

859

3.3. Parameter estimation by DOLPHIN Using the two sub-models defined above, DOLPHIN estimates a spectral Gaussian index pair q at each time frame, assuming the DSIs, Lf , to be hidden variables. A MAP function is used as the estimation criterion, and defined as · · · p(d, o, L1 , . . . , LNf , q) L(q) = L1

=

L1

=

LNf

···

p(d|L1 , . . . , LNf )p(o, L1 , . . . , LNf |q)p(q)

LNf

⎛

⎝

⎞

p(df |Lf )p(of , Lf |q)⎠

Lf

f

wq(l) , l

where we assumed conditional independence across frequencies. p(df |Lf ) and p(of , Lf |q) are defined in Eqs. (11) and (12), and p(q(l) ) = wq(l) is a model parameter of the log-spectral GMM in Eq. (5). Then, as in Nakatani et al. (2011a), the maximization of the MAP function is accomplished using the EM algorithm. The auxiliary function for the EM algorithm is defined as ˆ = E{log p(d, o, L1 , . . . , LNf , q)|q} ˆ Q(q|q) ˆ log p(of , Lf = l|q) + p(Lf = l|df , of , q) log wq(l) = f

=

l

(l) ˆ log p(s(l) = of |q(l) ) + M ˆ (l) log M f f f f

(16)

l of

−∞

l

(15)

(l )

(l )

p(sf |q(l ) )dsf

+

log wq(l)

(17)

l

ˆ (l) is defined as where the term related to p(df |Lf ) is omitted from Eq. (16) because it does not include q, and M f ˆ and can be obtained as p(Lf = l|df , of , q) ˆ ˆ (l) = p(df |Lf = l)p(of , Lf = l|q) . M f )p(o , L = l |q) ˆ p(d |L = l f f f f l

(18) (1)

(2)

4 ˆ ˆ If we expand the summation over l and note that M f = (1 − Mf ), we can express Eq. (17) as,

ˆ = Q(q|q)

(1) ˆ log p(s(1) = of |q(1) ) + (1 − M ˆ (2) ) log M f f f f (2)

(2)

(1)

ˆ ) log ˆ log p(s = of |q(2) ) + (1 − M +M f f f =

of

−∞

of

−∞

(2)

(2)

p(sf |q(2) )dsf (1)

(1)

p(sf |q(1) )dsf

+

log wq(l)

(19)

l

ˆ Q(l) (q(l) |q),

l

ˆ is obtained by rearranging the terms of the summation as where Q(l) (q(l) |q)

of (l) (l) (l) (l) (l) (l) (l) (l) (l) ˆ ˆ ˆ Q (q |q) = Mf log p(sf = of |q ) + (1 − Mf ) log p(sf |q )dsf + log wq(l) . f

−∞

(l)

(20)

ˆ in Eq. (18) is the posterior of the DSI, Lf , which is assumed to be updated in E-step. It is this posterior Here, M f ˆ in Eq. (20) is the calculation that combines the information of the two sub-models within DOLPHIN. Q(l) (q(l) |q) (l) (l) ˆ spectral matching function for updating q depending on Mf in M-step. Because Eq. (20) only contains q(l) for a certain l, we can update the GMM index of each source independently in M-step. Thanks to this unique characteristic, 4

Here, for simplicity, we consider the case of two sources, but the same relation can be obtained for more sources.

860

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

DOLPHIN can update the log-spectral GMM index pairs in a computationally efficient manner. In addition, the terms inside the brackets in Eq. (20) ensure that the dominant source takes the same value as the observation, while the non-dominant source takes any smaller value. 3.4. Estimation of clean speech by DOLPHIN After parameter estimation with the EM algorithm, speech–noise separation is achieved based on a minimum mean square error (MMSE) estimation. The estimated log power spectrum of the speech signal, denoted by xf for all f, is given as of (1) (1) (1) (1) sf p(sf |ˆq )dsf (1) (1) ˆ of + (1 − M ˆ ) −∞ xf = M . (21) of f f (1) (1) (1) q )dsf −∞ p(sf |ˆ The enhanced speech waveform, x(t), is then calculated by using an inverse Fourier transform of exp(xf ) using the phase of the observed signal, followed by an overlap-add synthesis. 3.5. DOLPHIN algorithm The DOLPHIN algorithm is summarized in Algorithm 1. In the experiments, we used a relatively long window for the feature extraction, that is a 100 ms Hann window with a 25 ms shift, to capture features for reverberant signals appropriately. For both speech and noise, we fixed the number of mixture components at 256 and 4 in the spectral and locational models, respectively. Speaker-dependent log-spectral GMMs were prepared for speech, while a speakerindependent LMM was prepared for the locational model. As regards noise, a pair of spectral and locational models were trained on all the noise data in the training data set. To ensure convergence, we fixed the number of iterations to a sufficiently large value, which we set to 10. Algorithm 1. Pseudocode of DOLPHIN. for all Time frame n do (1. Estimation of parameters) (l) Set the initial estimate of Mf based only on the locational features as p(d |L = l) ˆ (l) = f f M . f p(df |Lf = l ) l repeat for all Source l do (M-step) Update the log-spectral GMM index qˆ (l) as ˆ qˆ (l) = argmaxQ(l) (q(l) |q). q(l)

end for for all Source l do for all Frequency bin f do ˆ (l) based on Eq. (18). (E-step) Update M f end for end for until Convergence (2. Estimation of clean speech) Estimate the clean speech xf using Eq. (21). Convert the estimate clean speech to its waveform, x(t). end for

3.6. Further extension of DOLPHIN As future work, an important extension of DOLPHIN is the use of different spectral models. For example, GMMs for mel-frequency cepstral coefficients (MFCC) can be incorporated into DOLPHIN as spectral models instead of the log power spectra (Nakatani et al., 2012). This can be formulated as a MAP estimation of the MFCCs rather than of

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

861

the GMM indices, for which the MFCCs of each source are modeled by a GMM and the spectral shapes of each source are modeled by a linear regression from the MFCCs. It was shown by our preliminary experiments that this extension greatly improves the accuracy of the spectral estimation. A detailed discussion of this extension is beyond the scope of this paper, which therefore only provides brief results for the CHiME challenge database as a reference in Section 7. 4. Example-based speech enhancement Even if DOLPHIN is powerful in terms of speech–noise separation it may not completely suppress non-stationary noise. Indeed, DOLPHIN focuses on frame-by-frame processing and does not incorporate dynamic information about speech. Therefore, it is not optimal for dealing with highly non-stationary noise. Recently, an example-based approach to speech enhancement has been developed to handle highly non-stationary noise (Ming et al., 2011; Kinoshita et al., 2011). Here we apply this example-based speech enhancement method as a post-processing technique for DOLPHIN. The method uses a parallel speech corpus created using stereo data composed of speech processed by DOLPHIN and the corresponding clean speech. The characteristics of the example-based enhancement method can be summarized as follows. • It uses examples obtained from clean speech training data and therefore can achieve precise modeling of speech. • It models precisely the long-term temporal characteristics of speech. With this model, the method can efficiently discover the long-term underlying clean speech sequence embedded in the observed signals, and therefore can completely eliminate the relatively short noise components. The method first trains an example model that efficiently captures the exact spectro-temporal patterns of the training data processed with DOLPHIN. The example model is described in detail in Section 4.1. During enhancement, the algorithm employs this example model to search for the longest examples in the training data that are most likely to match the input signal. Finally, clean speech examples that correspond to the examples found previously are used to reconstruct clean speech. Note that the use of such long segments provides long-term temporal dynamic information that enables us to distinguish speech from non-stationary noise. 4.1. Construction of example model Here, we describe the example model M, which will be used to search for speech examples, and show how it models the exact and fine spectro-temporal information of the training data. The model is constructed in two steps. First, we train a GMM to represent the noisy training data processed with DOLPHIN. The GMM is defined as G=

M m=1

wm N(x; μm , Σ m ),

(22)

g(x|m)

where x is an MFCC feature vector of the noisy speech processed with DOLPHIN, g(x|m) is the m-th Gaussian component with mean μm and covariance Σ m , and wm is the corresponding weight. M is the number of mixture components. Then, based on G, we build an example model, M, by associating the most likely state sequence with the training data as M = {G, mi : i = 1, 2, . . . , I},

(23)

where mi is the index of a Gaussian component g(x|mi ) in G that produces the maximum likelihood for the i-th frame feature of the training data set, and I is the total number of training data frames. Hereafter, we call M an example model. In parallel with this example model, the clean speech used in the corpus is stored as amplitude spectra as follows: A = {Ai : i = 1, 2, . . . , I},

(24)

where Ai is the amplitude spectrum of the clean speech associated with the i-th frame feature of the training data set.

862

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

4.2. Enhancement using example model Enhancement is performed by first searching for the longest segment in the training data that matches a segment of the input processed speech. With u, τ and n as frame indices, this search is performed by maximizing the posterior probability, p(Mu:u+τ |xn:n+τ ), over u and τ as (u∗n , τn∗ ) = argmaxp(Mu:u+τ |xn:n+τ ), u,τ

≈ argmax u,τ

p(xn:n+τ |Mu:u+τ ) p(xn:n+τ )

(25) (26)

where Mu:u+τ = {G, mi : i = u, u + 1, . . . , u + τ} represents the sequence of Gaussian components modeling consecutive frames from u to u + τ in the training data, and xn:n+τ represents a segment taken from time frame n to n + τ of the noisy observed speech processed with DOLPHIN x, i.e., xn:n+τ = {xn , . . ., xn+τ }. From Eq. (25) to (26), we assumed the prior probability of the corpus segment p(Mu:u+τ ) to be constant for all segments. Note that the posterior probability, p(Mu:u+τ |xn:n+τ ), is designed to favor longer matching between xn:n+τ and Mu:u+τ , i.e., a larger τ. This can be realized by defining the denominator of Eq. (26) as Ming et al. (2011), p(xn:n+τ |Mu:u+τ ) + ζ(xn:n+τ ), (27) p(xn:n+τ ) = u

where ζ(xn:n+τ ) is a penalty term that tends to take large values for small τ, and decays with τ. Therefore, the penalty term has the effect of penalizing the short segments in Eq. (26), and thus the posterior probability increases as the matched segment length increases. More detail about the interpretation of the penalty term and its effect can be found in Ming et al. (2011). The numerator of Eq. (26), which is the likelihood of the processed speech given the segment Mu:u+τ , is given by p(xn:n+τ |Mu:u+τ ) =

τ

g(xn+v |mu+v ).

(28)

v=0

After finding the longest matched segments for all frames, an estimate of the underlying clean speech at time frame n, Yˆ n , is formed by weighted-averaging the clean amplitude spectra associated with the matched segments that include the time frame n as follows: Au∗ +n−v p(Mu∗ :u∗ +τ ∗ |xv:v+τ ∗ ) ˆ Yn = v (29) v p(Mu∗ :u∗ +τ ∗ |xv:v+τ ∗ ) where Au∗ +n−v is the amplitude spectrum associated with Mu∗ :u∗ +τ ∗ that corresponds to time frame n. To simplify the notations we dropped the time index v in u∗v and τv∗ . Finally, the enhanced speech, Yn , is obtained through Wiener filtering, using the estimated target speech given in Eq. (29) as Yn =

ˆ 2n Y ˆ 2n + N2n Y

· Bn ,

ˆ 2n , 0 , N2n = αN2n−1 + (1 − α)max B2n − Y

(30) (31)

where Bn corresponds to the complex spectrum of the observed speech (“example-based I”) or that of the observed speech preprocessed by DOLPHIN (“example-based II”), N2n is an estimate of the noise power spectrum and α is a smoothing factor. 4.3. Implementation details for CHiME task The above discussion assumes a single set of training data. One of the problems with the example-based enhancement method is that the searching process becomes computationally expensive when the example model becomes large, i.e., large I. Indeed, we need to search for the best matching sequence among all the utterances used to create the example

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

863

model. A single example model covering all speaker and noise conditions would be very large and so would greatly increase the complexity of the search process. The CHiME recognition challenge allows for a speaker-dependent recognition system. Therefore, we created a separate example model for each speaker. Moreover, to reduce the search cost further, we utilize a separate example model for each signal to noise ratio (SNR) level. First, we blindly select the optimal SNR level by choosing the GMM of Eq. (22) that gives the highest likelihood of the input utterance. Then we use the corresponding example model to enhance the utterance of interest. The pseudocode of the example-based enhancement for an utterance is given in Algorithm 2. The parameters used in the experiments were as follows. The feature vector for the GMM G consisted of 60th order MFCCs with a log energy term. We used 60 triangular shape filters for the mel-filterbank. The number of mixture components M was 4096. The frame length was 20 ms, and the frame shift was 10 ms. The total number of frames in the training data, I, ranged from 512,127 to 755,650 depending on the target speaker. The smoothing parameter α in Eq. (31) was set at 0.2.

Algorithm 2. Pseudocode of the example-based enhancement. (1. Searching for longest matched segments) for all n do Initialize u∗n , τn∗ and pmax to 0. n repeat Calculate p(xn:n+τ ) as in Ming et al. (2011, Eqs. (3) and (5)). for all u do Calculate likelihood p(xn:n+τ |Mu:u+τ ) using Eq. (28). Calculate posterior probability p(Mu:u+τ |xn:n+τ ) (see Eq. (26)). then if p(Mu:u+τ |xn:n+τ ) ≥ pmax n ← p(Mu:u+τ |xn:n+τ ) pmax n τn∗ ← τ u∗n ← u end if end for τ ←τ +1 until (No increase in pmax n ) end for (2. Enhancement) for all n do ˆn using Eq. (29) and the previously obtained u∗n , τn∗ . Obtain Y Perform Wiener filtering using Eqs. (30) and (31). end for

5. Interconnection of speech enhancement and recognizer using static and dynamic acoustic model adaptation The above speech enhancement pre-processor can greatly reduce noise. However, the processed speech usually contains artifacts that are detrimental to ASR since they introduce a mismatch between the enhanced speech features and the acoustic models. Because noise is non-stationary and because of the frame-by-frame processing of speech enhancement algorithms, the mismatch changes from one analysis time frame to another and thus cannot be fully compensated for with conventional model compensation approaches such as MLLR (Leggetter and Woodland, 1995). We call such a time-varying mismatch a dynamic mismatch. Recently, there has been increasing interest in mitigating such a dynamic mismatch by considering features as distributions rather than as point estimates (Droppo et al., 2002; Kolossa et al., 2008; Deng et al., 2005).

864

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

Let us represent the clean, observed and enhanced speech feature vectors as sn , on and sˆ n ,5 respectively, for time frame n. We can model the conditional probability of the enhanced speech feature vector given clean features with a Gaussian as Deng et al. (2005) p(ˆsn |sn ) ≈ N(ˆsn ; sn , Σ n ),

(32)

where Σ n is a dynamic feature covariance matrix that we assume to be diagonal, with diagonal elements denoted as 2 , where i is the feature dimension index. With this model of the enhanced feature the dynamic feature variance σn,i distribution, assuming that the acoustic model is represented by HMMs with a state density modeled by GMMs, the probability of the enhanced MFCC feature vector, sˆ n , given an acoustic model HMM state k, can then be expressed as Deng et al. (2005)

p(ˆsn |k) =

⎞

⎛ +∞

−∞

p(ˆsn , sn |k)dsn ≈

Ma

⎟ ⎜ ⎟ ωk,m N ⎜ ⎝sˆ n ; μk,m , Σ k,m+ Σ n⎠ .

m=1

(33)

Σ k,m,n

where m is the Gaussian mixture component index, Ma is the number of Gaussian mixtures, ωk,m is the mixture weight, and μk,m and Σ k,m are the mean vector and covariance matrix, respectively. The dynamic feature variance, 2 , represents the feature reliability. For reliable features, the dynamic feature variance values are small, and the σn,i decoding with Eq. (33) becomes equivalent to conventional decoding using the clean acoustic model. When features are unreliable, the dynamic feature variance values are large, and therefore Eq. (33) takes similarly small values for all HMM states. Consequently the influence of unreliable features on the final recognition results is reduced. It is shown in Deng et al. (2005) that a level of performance close to that of clean speech recognition could be achieved when using an Oracle feature variance, i.e., given by the square difference between clean and enhanced features. However, in practice, it is not easy to find dynamic feature variance estimates that approach the Oracle values and that are optimal for recognition. Moreover, the approaches that have been proposed for calculating the dynamic feature variance (Droppo et al., 2002; Kolossa et al., 2008; Deng et al., 2005) usually rely on the use of a specific pre-processor and therefore lack generality. To cope with these issues, we have recently introduced DVA. 5.1. Dynamic feature variance model To provide a general model of dynamic feature variance we introduce two hypotheses. First we present the intuitive hypothesis that the more the features are affected by acoustic distortions, the more uncertain they will become. Moreover, the level of acoustic distortion is assumed to be proportional to the damping caused by the speech enhancement preprocessor. The latter can be quantified as the difference between the enhanced and observed speech features. These hypotheses enable us to express the dynamic feature variance with the following simple expression (Delcroix et al., 2009) 2 σˆ n,i = α2i (on,i − sˆ n,i )2 ,

(34)

2 bˆ n,i

2 is the estimated dynamic feature variance and α is the pre-processor uncertainty weight. b ˆ 2 provides where σˆ n,i i n,i the root of the time-varying feature variance. The pre-processor uncertainty weight αi measures the reliability of the speech enhancement pre-processor. If the speech enhancement introduces many artifacts and the enhanced features are therefore unreliable, the pre-processor uncertainty weights will be large.

5 Note that depending on the case, sˆ may be the feature obtained from the output of DOLPHIN, x or from the output of DOLPHIN + example-based n n enhancement, yn .

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

865

Fig. 3. Recognition system based on the proposed static and dynamic adaptation using MLLR and DVA.

5.2. Estimation of feature variance model parameters In Delcroix et al. (2009) we proposed estimating αi using adaptation to obtain optimal values as follows: α = argmax(p({ˆsn }|W, α)p(W)), α

(35)

where {ˆsn } is a sequence of enhanced speech feature vectors, W is the word sequence corresponding to the feature sequence {ˆsn }, α is the set of model parameters to be optimized, i.e., α = (α1 , . . ., αF ) and F is the dimension of the feature vector. Eq. (35) can be solved using the EM algorithm or with a gradient ascent optimization method. A detailed derivation of the algorithm can be found in Delcroix et al. (2009). Note that the above discussion assumes that the mismatch as regards the mean of the Gaussians of the acoustic model could be well compensated by using the speech enhancement pre-processor, which is not always the case in practice. Thus we combine DVA with conventional mean adaptation techniques such as MLLR. Several approaches can be used to combine MLLR and DVA. Here we performed 3 iterations of MLLR followed by 3 iterations of DVA and repeated the process 20 times to ensure convergence. In the experiments we used unsupervised adaptation to estimate α, i.e., the word sequence W was obtained from a first recognition pass performed without adaptation. The advantages of the proposed static and dynamic model adaptation scheme can thus be summarized as follows. • It can dynamically compensate for the mismatch between the enhanced speech and the acoustic model. • It relies on a general formulation that does not depend on the internal processing of the speech enhancement and can therefore easily be used with DOLPHIN and example-based methods. • The model parameters are optimized for recognition using the maximum likelihood criterion and adaptation data. Fig. 3 is a schematic diagram of the static and dynamic adaptation and recognition process when using the proposed adaptation scheme based on MLLR DVA. 5.3. Extension to cluster-based DVA Using the model introduced in Eq. (34) we can already achieve a significant recognition improvement. This simple 2 , model assumes a linear relation between the Oracle feature variance and the estimated dynamic feature variance, σˆ n,i which is not always true. In particular, we have observed that the relation between the estimated feature variance and the Oracle feature variance may change according to the speech sound level. This suggests that a better representation of the dynamic feature variance could be obtained by introducing HMM state dependency into the model of Eq. (34). This can be achieved by making αi dependent on the HMM states. To limit the number of parameters, we use a cluster-based approach similar to the one used by MLLR, i.e., the Gaussians of the acoustic model are grouped into clusters and a different αi,r is associated with each cluster r (Delcroix et al., 2012). The experimental results described in Sections 6.2 and 6.3 and of Delcroix et al. (2011b) were obtained by defining clusters in a rule-based manner, i.e., one cluster for the silent states and one for the other non-silent states. We have recently investigated the effect of increasing the number of clusters. This was achieved by using conventional regression

866

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

tree Gaussian clusters obtained by Gaussian clustering based on the Euclidean distance (Leggetter and Woodland, 1995). Results with an increased number of clusters are briefly discussed in Section 7. 6. Experimental results We tested DISTINCT on the PASCAL ‘CHiME’ Speech Separation and Recognition Challenge (Barker et al., 2013). The CHiME challenge task consists of the recognition of commands recorded using two microphones in a reverberant family living room, corrupted by various types of noise sources at six SNR levels between −6 and 9 dB. The CHiME data were uttered by 34 speakers consisting of 16 females and 18 males. The task includes 500 ‘clean’ (reverberant only) training utterances per speaker and 6 h of background noise recordings. The test data consist of development and evaluation test sets, both containing 600 utterances at each of the six different SNRs. The evaluation focuses on the recognition accuracy of keywords that consist of a letter followed by a digit, spoken in the middle of the command. 6.1. Experimental settings We used the speech recognizer platform SOLON (Hori et al., 2007), which was developed at NTT Communication Science Laboratories. The feature vectors consisted of 39 coefficients, 12 MFCCs, the log energy, and the delta and acceleration coefficients. For the feature extraction, we used a frame length and frame shift of 25 ms and 10 ms, respectively. The features were processed with cepstral mean normalization (CMN) performed per utterance. These settings of the feature extraction correspond to those provided by the CHiME challenge baseline system (Barker et al., 2013). In accordance with the CHiME challenge regulations, the acoustic models were speaker-dependent. They consisted of left-to-right word HMMs, trained with reverberant speech. We generated two types of speaker-dependent acoustic models, one using ‘clean’ speech (reverberant only) and one using multi-condition training data. The clean speech model consisted of conventional left-to-right HMMs with a total of 254 states each modeled by a Gaussian mixture consisting of seven Gaussians. We added a silent and short pause model to the original model provided by the CHiME challenge organizers. To train the acoustic models, we first used the acoustic models provided by the challenge organizers as initial models. These models were then retrained with SOLON using Viterbi training. We also created models trained using the dMMI discriminative criterion (McDermott et al., 2010). The dMMI models were obtained by using the previously obtained maximum likelihood (ML) models as initial models. We created multi-condition data by adding background noise samples to the reverberant training data. There were 42 times more training data than clean training data (seven noise environments obtained from the background noise data provided by the CHiME challenge, Barker et al., 2013, by six SNR levels). The multi-condition noisy data were then processed by DOLPHIN and example-based enhancement algorithms. The obtained multi-condition training data were used to train acoustic models. For the multi-condition model, we did not use silent and short pause models because it did not provide any significant recognition improvement. We used 20 Gaussians per HMM state to cover the variability of the multi-condition training data. The multi-condition acoustic models were also trained using the dMMI discriminative criterion (McDermott et al., 2010). In addition we used speaker-dependent, SNR-independent, unsupervised adaptation to further reduce any mismatch between the input features and the acoustic model. We employed all the test data (at all SNR levels) from a given speaker to generate labels that we used for adaptation in a batch manner. The adaptation combined DVA and MLLR with a diagonal transformation matrix to adapt the mean parameters of the Gaussians. Hereafter we refer to this adaptation process as Adap. We evaluated the results in terms of keyword recognition accuracy using the evaluation script provided by the CHiME challenge organizers (Barker et al., 2013). 6.2. Results for development test set Table 1 shows the keyword recognition accuracy for the development set when using clean (Systems I–VI) and multi-condition acoustic models (Systems VII–XII). Systems I, II, VII and VIII provide baseline results obtained with noisy speech (without enhancement) using clean and multi-condition training with ML and dMMI criteria. The clean ML baseline (System I) performs better than the baseline provided by the organizers of the challenge because of the

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

867

Table 1 Keyword recognition accuracy in percent for the development test set. The ‘clean’ baseline achieved 96.75% keyword accuracy. System

Model

Speech

Adap.

−6 dB

−3 dB

0 dB

3 dB

6 dB

9 dB

Mean

I II

ML-clean dMMI-clean

Noisy Noisy

– –

49.75 50.42

52.58 53.58

64.25 63.33

75.08 75.25

84.25 84.58

90.58 90.50

69.42 69.61

III IV

dMMI-clean dMMI-clean

DOLPHIN DOLPHIN + EX I

– –

71.33 77.42

76.92 80.92

82.08 84.17

87.42 89.42

90.92 92.33

91.75 94.50

83.40 86.46

V VI

dMMI-clean dMMI-clean

DOLPHIN DOLPHIN + EX I

X X

77.08 78.58

81.42 81.83

86.83 85.50

89.33 90.58

92.42 92.83

93.42 94.33

86.75 87.28

VII VIII

ML-multi dMMI-multi

Noisy Noisy

– –

69.75 73.25

75.08 78.08

83.25 84.92

86.33 87.75

92.00 92.08

92.75 93.67

83.19 84.96

IX X

dMMI-multi dMMI-multi

DOLPHIN DOLPHIN + EX II

– –

82.75 84.08

85.42 86.00

89.17 88.08

91.25 90.92

92.00 92.92

92.67 93.58

88.88 89.26

XI XII

dMMI-multi dMMI-multi

DOLPHIN DOLPHIN + EX II

X X

83.83 83.33

87.33 86.75

90.25 88.58

91.50 91.33

93.83 94.17

93.75 94.50

90.08 89.78

85.08

88.75

90.25

92.25

95.08

95.00

91.07

System combination (VI + XI + XII)

use of the silence model and because SOLON’s handling of the sparse training data provided better speaker-dependent models6 (Barker et al., 2013). The systems trained using dMMI (Systems II and VIII) provided an improvement compared with the ML systems (Systems I and VII), especially for the multi-condition model. Indeed, with multicondition training, dMMI can take advantage of the large quantity of data. Therefore, in the following, we report results obtained using only acoustic models trained with dMMI. Note that we obtained an upper bound keyword recognition accuracy of 96.75% by recognizing ‘clean’ speech (reverberant speech with no noise) using the clean model trained with dMMI. In Table 1, Systems III and VI show the results for the recognition systems when using DOLPHIN (System III) and DOLPHIN + example-based enhancement with Wiener filtering applied to the noisy speech (DOLPHIN + EX I, i.e., System IV) with clean acoustic models. DOLPHIN (System III) already provided an average keyword accuracy improvement of more than 13 points. Combining DOLPHIN with example-based enhancement (System IV) provided an additional improvement of more than 3 points.7 In the context of the CHiME challenge, the test utterances consist of commands with fixed grammar that can be well represented by a speech corpus. Therefore, for the CHiME recognition task, example-based enhancement seems particularly suited to suppressing the remaining non-stationary noise at the DOLPHIN output. Note that example-based enhancement can also be performed without DOLPHIN, if we generate the example model directly from noisy speech data. In this case, a preliminary experiment revealed that the average keyword recognition accuracy of example-based enhancement alone was 80.26%, which is much worse than that of DOLPHIN + example-based enhancement. Using adaptation (MLLR combined with DVA) as shown with Systems V and VI, we further improved the keyword accuracy by 3 points for DOLPHIN and 0.8 points for DOLPHIN combined with example-based enhancement. Note that for DOLPHIN, we confirmed that MLLR and DVA separately achieved comparable performance improvements of around 2 points, and combining them achieved an additional 1 point improvement. The second part of Table 1 (Systems IX–XII) shows the enhancement results when using an acoustic model trained with multi-condition training data. The multi-condition training data were obtained by processing the multi-condition

6 This was corroborated by observing that we obtained a baseline comparable to the challenge baseline using a speaker-independent model but a more than 8 point accuracy improvement for the speaker-dependent acoustic models trained using SOLON. The reason may simply be due to the fact that SOLON updates the variances of the acoustic model only if the state occupancy count is superior to some threshold, otherwise it keeps the variances of the seed model unchanged. Since the seed model corresponds to the speaker-independent model that is trained with a larger amount of data, we may obtain more reliable variances when training data are sparse. Note that a similar improvement would be expected with HTK if only the mean parameters were retrained. 7 Note that the example-based enhancement algorithm uses multi-condition data and therefore strictly speaking the whole system does not rely solely on clean training data.

868

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

Table 2 Keyword recognition accuracy in percent for the evaluation test set. System

Model

Human

Speech

Adap.

−6 dB

−3 dB

0 dB

3 dB

6 dB

9 dB

Mean

Noisy

X

90.30

93.00

93.80

95.30

96.80

98.80

94.67

II

dMMI-clean

Noisy

–

45.67

52.67

65.25

75.42

83.33

91.67

69.00

III IV

dMMI-clean dMMI-clean

DOLPHIN DOLPHIN + EX I

– –

71.58 79.83

77.92 82.25

85.08 89.75

90.25 91.92

91.58 92.42

93.92 94.92

85.06 88.52

V VI

dMMI-clean dMMI-clean

DOLPHIN DOLPHIN + EX I

X X

78.33 80.42

82.50 82.58

87.42 90.00

91.67 92.58

93.17 92.75

94.83 95.00

87.99 88.89

VIII

dMMI-multi

noisy

–

70.58

77.75

84.92

89.42

91.50

94.00

84.70

IX X

dMMI-multi dMMI-multi

DOLPHIN DOLPHIN + EX II

– –

84.25 84.08

86.17 86.33

90.92 90.42

92.58 92.25

93.67 93.08

93.75 94.08

90.22 90.04

XI XII

dMMI-multi dMMI-multi

DOLPHIN DOLPHIN + EX II

X X

85.83 84.42

87.92 87.50

91.17 91.67

93.58 93.08

94.08 94.08

94.17 94.42

91.13 90.86

85.75

88.67

92.92

94.33

94.75

96.25

92.11

System combination (VI + XI + XII)

noisy training data with the DOLPHIN or DOLPHIN + example-based enhancement. Using DOLPHIN with the multicondition acoustic model (System IX), we obtained an average accuracy improvement of 4 points compared with the multi-condition noisy speech baseline and 5 points when using adaptation (System XI). We also investigated the performance of the multi-condition model with DOLPHIN + example-based enhancement (Systems X and XII). To increase the diversity when using system combination, we use example-based enhancement with Wiener filtering applied to the speech processed by DOLPHIN (DOLPHIN + Ex II). The performance is equivalent to that obtained with DOLPHIN but it provides an additional system that can be used for system combination. In Table 1 we highlight the best performance among Systems I to XII using bold italics. We observe that System XI achieved the best average performance. Even though the other systems perform worse than System XI on average, they may cause different types of errors and thus can be used as a different source of information to improve the performance with the system combination method (Evermann and Woodland, 2000). The last part of Table 1 shows results obtained with the system combination technique using the three systems that provided the best performance, i.e., Systems VI, XI and XII. For all SNR levels, the best performance (shown in bold in Table 1) was obtained with the system combination and an average absolute keyword improvement of up to 1 point could be achieved.

6.3. Results for evaluation test set Table 2 shows the recognition results we obtained for the evaluation test set. For conciseness, we only provide the most relevant results. For comparison, we also included the results obtained by a human listener (Barker et al., 2013). Note that even though all the parameters were set using the development set, we obtained a slightly better performance with the evaluation test set. These results confirm the robustness of DISTINCT. Note that these results are slightly better than those we provided for the CHiME challenge and that achieved the best performance among the challenge participants (Barker et al., 2013). The results in Table 2 were obtained using a multi-condition acoustic model trained on the speech processed by DOLPHIN + example-based enhancement for Systems X and XII, whereas the results reported in Delcroix et al. (2011b) and Barker et al. (2013) were obtained using an un-matched acoustic model trained with speech processed with DOLPHIN only, because we had insufficient time to train the matched acoustic model when writing (Delcroix et al., 2011b). These results demonstrate that DISTINCT can greatly improve recognition accuracy. The relative keyword error rate reduction is more than 70% and the performance approaches that of a human listener.

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

869

Table 3 Segmental SNR in dB for the development test set.

Noisy

−6 dB

−3 dB

0 dB

3 dB

6 dB

9 dB

Mean

−5.7

−4.9

−2.7

−0.7

1.3

3.0

−1.6

3.8 4.3

4.3 4.5

5.1 5.2

6.1 6.1

6.9 6.9

7.7 7.6

5.6 5.8

DOLPHIN DOLPHIN + EX I

6.4. Evaluation of speech enhancement DISTINCT provides a significant improvement in recognition accuracy, and also improves speech audible quality. We evaluated the improvement brought about by speech enhancement in terms of segmental SNR averaged over the six SNR conditions evaluated as in Nakatani et al. (2012). Table 3 shows the segmental SNR as a function of the input SNR for the development test set for the input noisy speech, DOLPHIN and DOLPHIN + example-based enhancement.8 The average segmental SNR of the noisy speech was −1.6 dB. DOLPHIN improved the segmental SNR up to 5.6 dB and DOLPHIN + example-based enhancement improved the segmental SNR up to 5.8 dB. Enhancement was particularly effective at a low SNR. With −6 dB noise, the segmental SNR of the noisy speech was −5.7 dB, and it was improved to 3.8 and 4.3 dB by DOLPHIN and DOLPHIN + example-based enhancement, respectively. These results demonstrate that DOLPHIN could greatly improve segmental SNR for all input SNR conditions. Examplebased enhancement further improved segmental SNR when the input SNR was low, but the improvement was small. Nevertheless, clear audible quality improvement and the suppression of transient noise are confirmed by the sound samples provided on our demo webpage (Delcroix, 2012). 6.5. Computational time DISTINCT is designed to achieve high performance and assumes offline processing. Therefore, we did not focus on reducing the computational complexity of the system. We can evaluate the computational time of the main processing steps of DISTINCT using the real time factor (RTF). DOLPHIN has a RTF of about 4–5. The RTF of example-based enhancement reaches about 200. It is the most computationally demanding step of our system. The RTF of the ASR decoding when using dynamic model adaptation is about 0.3–1. This is about ten times the decoding time without adaptation. These values indicate that our system is complex in terms of computation. However, there are possibilities to reduce the complexity as we will discuss in the next section. 7. Discussions and future extension of DISTINCT Finally, we discuss some possible improvements to the proposed recognition system. First, it is possible to modify DOLPHIN to source models based on MFCC instead of log spectra as explained in Section 3.6. This extension is referred to as DOLPHIN-MFCC. Since GMMs based on the MFCCs may better model speech than GMMs based on log-spectra, DOLPHIN-MFCC is expected to further improve the enhancement performance. Moreover, using DOLPHIN-MFCC, all the processing units of DISTINCT operate using spectral models based on MFCCs. Consequently the spectral models used for the enhancement units are even more similar to the model used for recognition, and therefore a superior recognition performance can be expected. Preliminary experiments revealed that DOLPHIN-MFCC could improve the average keyword accuracy by up to 1.7 points. Another improvement may be achieved by increasing the number of Gaussian clusters used for adaptation as discussed in Section 5.3. The previous experiments used a rule based approach to generate Gaussian clusters, i.e., one cluster for silent states and one cluster for all non-silent states. This approach is very simple but hard to extend to more clusters. Instead we may use the conventional Gaussian clustering approach to create a regression tree similar

8 The results here are provided only for the development test set, because the ‘clean’ reference signals are only available for this test set. Even so, we would expect to observe the same tendency for the evaluation test set.

870

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873 100

Segmental SNR (dB)

95 90 85 80 human noisy (multi−AM) DISTINCT DISTINCT 2

75 70 −6

−3

0

3

6

9

Input SNR (dB)

Fig. 4. Summary of the keyword accuracy for the evaluation set for human, noisy speech with multi-condition acoustic model, DISTINCT and DISTINCT 2.

to conventional MLLR. When we used this simple extension of the adaptation process we were able to achieve an absolute error reduction of up to 1.5 points, using the same number of clusters for both MLLR and DVA. As a reference, we show the results we obtained using DOLPHIN-MFCC and the above adaptation procedure (DISTINCT 2) in Fig. 4. The figure plots the keyword accuracy as a function of the input SNR for a human listener, the noisy speech using a multi-condition acoustic model and DISTINCT without and with the above improvements, DISTINCT and DISTINCT 2, respectively. Looking at the figure we can clearly see the potential of these extensions. The detailed results are given in Appendix A. To ensure optimal performance, we took advantage of the specifications of the CHiME challenge task. In addition to the above modifications, the dependency on the CHiME challenge specifications could be relaxed by addressing the following issues. • Dealing with unknown speakers: If we are to use speaker-independent models, speaker adaptation techniques should be used for the different speaker-dependent models used by DISTINCT. DOLPHIN can easily include speaker adaptation techniques to adapt the spectral model (Nakatani et al., 2011a). Speaker adaptation of the example model remains an issue that we should investigate. Nevertheless, we have demonstrated experimentally that example-based enhancement could be used with a speaker-independent example model (Kinoshita et al., 2011). For the acoustic models used by the recognizer many conventional speaker adaptation could be used (Leggetter and Woodland, 1995). Note that channel mismatch could also be treated in the same way. • Dealing with unknown speaker position: In our system, only DOLPHIN uses the information about the known speaker position through the use of pre-trained locational feature models. However, the locational feature models could be learned from the observed signals when the number of sources is given (Nakatani et al., 2011a), mitigating therefore the need for a fixed known speaker position. • Dealing with the absence of noise training data: Noise training data were used for training the noise spectral and locational models of DOLPHIN and create multi-condition training data that were used by example-based enhancement and the recognizer. The noise locational model used by DOLPHIN could be estimated online (Nakatani et al., 2011a). As for the example-based enhancement, we could use a clean example model (Kinoshita et al., 2011) directly or by tightly integrating the example-based enhancement within the framework of DOLPHIN. • Dealing with non-reverberant speech training data: If only clean speech training data would be available (i.e., without reverberation), we could process the speech with a dereverberation method before DOLPHIN and examplebased enhancement (Naylor and Gaubitch, 2010; Nakatani et al., 2010). This would be similar to the approach we proposed for meeting recognition (Hori et al., 2012). • Reducing computational complexity: Currently, as discussed in Section 6.5, the most computationally expensive part of DISTINCT is the example-based enhancement. The complexity of example-based enhancement could be greatly reduced by designing more efficient search algorithms and using more compact example models. This will be part of our future research directions. It is also possible to greatly reduce the computational complexity of DOLPHIN and achieve an RTF below 1 using a filterbank implementation as shown in Nakatani et al. (2011b).

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

871

• Extending to less restricted recognition tasks: It is important to evaluate the proposed system with more complex recognition tasks such as spontaneous speech recognition. We have already confirmed the usability of DOLPHIN (Hori et al., 2012) and DVA (Delcroix et al., 2011a) on large vocabulary tasks. Therefore, the main issue would be to confirm the performance of the example-based processing when the corpus utterances may not fully represent the test utterances. However, this issue could be alleviated if a sufficiently large amount of training data would be available. From these considerations we expect that our proposed framework could be used under less restricted conditions than that of the CHiME challenge. The implications on the recognition performance of relaxing these constraints will be part of our future research directions.

8. Conclusion In this paper we presented a new speech recognition system for use in environments with highly non-stationary noise that we called DISTINCT. We demonstrated that DISTINCT efficiently uses spatial, spectral and temporal information and therefore could greatly improve the audible quality of speech and provide a substantial improvement in recognition performance. The proposed system was developed for the CHiME Challenge command recognition task, but it could be extended for use under broader conditions. In this case, one issue will be to extend the framework to unknown speakers and acoustic environments, including unknown speaker locations and unknown channel responses. Another issue is to confirm the performance with more complex tasks such as spontaneous speech.

Appendix A. Results with DOLPHIN-MFCC and cluster DVA Tables A.4 and A.5 show the keyword accuracy obtained with DOLPHIN-MFCC and cluster based adaptation (DISTINCT 2). Note that in this case, similarly good performance can be achieved with many different systems. Therefore, we used 5 systems for the system combination (Systems V, VI, XI, XIIa and XIIb).

Table A.4 Keyword recognition accuracy in percent for the development test set for DISTINCT 2. System

Model

Speech

Adap.

−6 dB

−3 dB

0 dB

3 dB

6 dB

9 dB

Mean

III IV

dMMI-clean dMMI-clean

DOLPHIN-MFCC DOLPHIN-MFCC + EX I

– –

74.75 81.25

79.42 84.67

83.83 87.08

88.82 90.08

91.67 92.17

91.92 92.75

85.07 88.00

V VI

dMMI-clean dMMI-clean

DOLPHIN-MFCC DOLPHIN-MFCC + EX I

X X

81.42 84.33

85.83 87.17

88.83 90.17

91.08 91.42

93.75 94.42

94.58 94.08

89.25 90.27

IX Xa Xb

dMMI-multi dMMI-multi dMMI-multi

DOLPHIN-MFCC DOLPHIN-MFCC + EX I DOLPHIN-MFCC + EX II

– – –

84.83 86.00 86.00

87.83 88.83 87.33

89.67 90.08 88.92

91.00 91.58 92.00

93.83 93.50 93.83

93.08 93.83 93.75

90.04 90.64 90.31

XI XIIa XIIb

dMMI-multi dMMI-multi dMMI-multi

DOLPHIN-MFCC DOLPHIN-MFCC + EX I DOLPHIN-MFCC + EX II

X X X

85.58 86.67 86.92

90.00 89.83 89.75

91.67 91.50 91.25

92.17 93.08 93.58

95.00 94.33 95.25

94.58 94.50 94.92

91.50 91.65 91.95

87.67

90.83

93.33

94.08

95.92

96.17

93.00

System combination (V + VI + XI + XIIa + XIIb)

872

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

Table A.5 Keyword recognition accuracy in percent for the evaluation test set for DISTINCT 2. System

Model

Speech

Adap.

−6 dB

−3 dB

0 dB

3 dB

6 dB

9 dB

Mean

Human III IV

dMMI-clean dMMI-clean

Noisy DOLPHIN-MFCC DOLPHIN-MFCC + EX I

X – –

90.30 77.08 83.25

93.00 81.25 87.58

93.80 87.83 89.08

95.30 89.83 92.67

96.80 92.42 93.50

98.80 95.08 95.33

94.67 87.25 90.24

V VI

dMMI-clean dMMI-clean

DOLPHIN-MFCC DOLPHIN-MFCC + EX I

X X

83.17 85.33

86.83 89.00

90.58 91.75

94.08 94.17

94.42 94.83

96.33 96.00

90.90 91.85

IX Xa Xb

dMMI-multi dMMI-multi dMMI-multi

DOLPHIN-MFCC DOLPHIN-MFCC + EX I DOLPHIN-MFCC + EX II

– – –

86.00 86.67 85.75

88.08 88.25 89.00

91.50 91.00 91.67

93.92 93.42 92.50

93.17 93.92 93.83

94.75 95.50 95.42

91.24 91.46 91.36

XI XIIa XIIb

dMMI-multi dMMI-multi dMMI-multi

DOLPHIN-MFCC DOLPHIN-MFCC + EX I DOLPHIN-MFCC + EX II

X X X

87.50 88.00 87.75

90.58 90.58 90.00

92.42 92.33 93.17

93.92 94.50 94.67

95.33 94.67 94.58

95.83 96.42 96.17

92.60 92.75 92.72

89.67

92.08

94.17

95.83

96.08

97.17

94.17

System combination (V + VI + XI + XIIa + XIIb)

References Barker, J., Vincent, E., Ma, N., Christensen, H., Green, P., 2013. The PASCAL CHiME Speech Separation and Recognition Challenge. Computer Speech and Language 27 (3), 621–633. Cooke, M., Hershey, J.R., Rennie, S.J., 2010. Monaural speech separation and recognition challenge. Computer Speech & Language 4 (1), 1–15. Delcroix, M., Nakatani, T., Watanabe, S., 2009. Static and dynamic variance compensation for recognition of reverberant speech with dereverberation preprocessing. IEEE Transaction on ASLP 17 (2), 324–334. Delcroix, M., Nakatani, T., Watanabe, S., 2011a. Variance compensation for recognition of reverberant speech with dereverberation preprocessing. In: Haeb-Umbach, R., Kolossa, D. (Eds.), Robust Speech Recognition of Uncertain or Missing Data. Springer, pp. 225–255. Delcroix, M., Kinoshita, K., Nakatani, T., Araki, S., Ogawa, A., Hori, T., Watanabe, S., Fujimoto, M., Yoshioka, T., Oba, T., Kubo, Y., Souden, M., Hahm, S.-J., Nakamura, A., 2011b. Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/noise modeling combined with dynamic variance adaptation. In: Proc. CHiME Workshop, pp. 12–17. Delcroix, M., Watanabe, S., Nakatani, T., Nakamura, A., 2012. Cluster-based dynamic variance adaptation for interconnecting speech enhancement pre-processor and speech recognizer. Computer Speech & Language, http://dx.doi.org/10.1016/j.csl.2012.07.001 Delcroix, M., 2012b. Demonstration of the results obtained for the PASCAL ‘CHiME’ Challenge, http://www.kecl.ntt.co.jp/icl/signal/CHiME demo/index.html (cited 24 April 2012). Deng, L., Droppo, J., Acero, A., 2005. Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Transaction on SAP 13 (3), 412–421. Droppo, J., Acero, A., Deng, L., 2002. Uncertainty decoding with SPLICE for noise robust speech recognition. In: Proc. ICASSP’02, vol. 1, pp. 57–60. Ephraim, Y., Malah, D., 1985. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Transaction on ASSP 33 (2), 443–445. Evermann, G., Woodland, P.C., 2000. Posterior probability decoding, confidence estimation and system combination. In: Proc. NIST Speech Transcription Workshop. Fujimoto, M., Watanabe, S., Nakatani, T., 2011. Non-stationary noise estimation method based on bias residual component decomposition for robust speech recognition. In: Proc. ICASSP’11, pp. 4816–4819. Gales, M.J.F., Young, S.J., 1996. Robust continuous speech recognition using parallel model combination. IEEE Transaction on SAP 4 (5), 352–359. Hori, T., Hori, C., Minami, Y., Nakamura, A., 2007. Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition. IEEE Transaction on ASLP 15 (4), 1352–1365. Hori, T., Araki, S., Ogawa, A., Souden, M., Delcroix, M., Yoshioka, T., Oba, T., Fujimoto, M., Kinoshita, K., Kubo, Y., Hahm, S.-J., Watanabe, S., Nakatani, T., Nakamura, A., 2012. Evaluation of multi-speaker distant-talk speech recognition in a meeting analysis task. In: Proc. ASJ’12, March, pp. 223–224 (in Japanese). Kim, D.Y., Un, C.K., Kim, N.S., 1998. Speech recognition in noisy environments using first-order vector Taylor series. Speech Communication 24, 39–49. Kinoshita, K., Souden, M., Delcroix, M., Nakatani, T., 2011. Single channel dereverberation using example-based speech enhancement with uncertainty decoding technique. In: Proc. Interspeech’11, pp. 197–200. Kolossa, D., Araki, S., Delcroix, M., Nakatani, T., Orglmeister, R., Makino, S., 2008. Missing feature speech recognition in a meeting situation with maximum SNR beamforming. In: Proc. ISCAS’08, pp. 3218–3221. Kristjansson, T., Hershey, J., Olsen, P., Gopinath, R., 2006. Super-human multi-talker speech recognition: the IBM 2006 speech separation challenge system. In: Proc. ICSLP’06, pp. 97–100.

M. Delcroix et al. / Computer Speech and Language 27 (2013) 851–873

873

Lee, T.-W., Lewicki, M.S., Girolami, M., Sejnowski, T.J., 1999. Blind source separation of more sources than mixtures using overcomplete representations. IEEE SP Letters 6 (4), 87–90. Leggetter, C.J., Woodland, P.C., 1995. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech & Language 9 (2), 171–185. Li, J., Deng, L., Gong, Y., Acero, A., 2007. High-performance HMM adaptation with joint compensation of additive and convolutive distortions via vector Taylor series. In: Proc. ASRU’07, pp. 65–70. Lippmann, R., Martin, E., Paul, D., 1987. Multi-style training for robust isolated-word speech recognition. In: Proc. ICASSP’87, vol. 12, pp. 705–708. Makino, S., Lee, T.-W., Sawada, H. (Eds.), 2007. Blind Speech Separation. Springer, Dordrecht. Martin, R., 2001. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Transaction on SAP 9 (5), 504–512. McDermott, E., Watanabe, S., Nakamura, A., 2010. Discriminative training based on an integrated view of MPE and MMI in margin and error space. In: Proc. ICASSP’10, pp. 4894–4897. Ming, J., Srinivasan, R., Crookes, D., 2011. A corpus-based approach to speech enhancement from nonstationary noise. IEEE Transaction on ASLP 19 (4), 822–836. Moreno, P.J., Raj, B., Stern, R.M., 1996. A vector Taylor series approach for environment-independent speech recognition. In: Proc. ICASSP’96, vol. 2, pp. 733–736. Nakatani, T., Kellermann, W., Naylor, P., Miyoshi, M., Juang, B.H., 2010. Introduction to the special issue on processing reverberant speech: methodologies and applications. IEEE Transaction on ASLP 18 (7), 1673–1675. Nakatani, T., Araki, S., Yoshioka, T., Fujimoto, M., 2011a. Joint unsupervised learning of hidden Markov source models and source location models for multichannel source separation. In: Proc. ICASSP’11, pp. 237–240. Nakatani, T., Araki, S., Delcroix, M., Yoshioka, T., Fujimoto, M., 2011b. Reduction of highly nonstationary ambient noise by integrating spectral and locational characteristics of speech and noise for robust ASR. In: Proc. Interspeech’11, pp. 1785–1788. Nakatani, T., Yoshioka, T., Araki, S., Delcroix, M., Fujimoto, M., 2012. LogMax observation model with mfcc-based spectral prior for reduction of highly nonstationary ambient noise. In: Proc. ICASSP’12, pp. 4029–4032. Naylor, P.A., Gaubitch, N. (Eds.), 2010. Speech Dereverberation. Springer, London. Parra, L., Spence, C., 2000. Convolutive blind separation of non-stationary sources. IEEE Transaction on SAP 8 (3), 320–327. Rennie, S.J., Hershey, J.R., Olsen, P.A., 2010. Single-channel multitalker speech recognition. In: IEEE SP Magazine, pp. 66–80 (November). Roweis, S.T., 2003. Factorial models and refiltering for speech separation and denoising. In: Proc. EUROSPEECH-2003, pp. 1009–1012. Sawada, H., Araki, S., Makino, S., 2011. Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Transaction on ASLP 19 (3), 516–527. Sehr, A., Hofmann, C., Maas, R., Kellermann, W., 2010. A novel approach for matched reverberant training of HMMs using data pairs. In: Proc. Interspeech’10, pp. 566–569. Souden, M., Kinoshita, K., Delcroix, M., Nakatani, T., 2011. A multichannel feature-based processing for robust speech recognition. In: Proc. Interspeech’11, pp. 689–692. Yilmaz, O., Rickard, S., 2004. Blind separation of speech mixture via time-frequency masking. IEEE Transaction on SP 52 (7), 1830–1847.

Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds

Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds

Recommend Documents