Self-learning speaker identification for enhanced speech recognition

Available online at www.sciencedirect.com Computer Speech and Language 26 (2012) 210–227 Self-learning speaker identification for enhanced speech re...

Download PDF

383KB Sizes 2 Downloads 105 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com

Computer Speech and Language 26 (2012) 210–227

Self-learning speaker identification for enhanced speech recognition夽 Tobias Herbig a,b,∗,1 , Franz Gerl c,1 , Wolfgang Minker a a

University of Ulm, Institute of Information Technology, Ulm, Germany b Nuance Communications Aachen GmbH, Ulm, Germany c SVOX Deutschland GmbH, Ulm, Germany

Received 23 May 2010; received in revised form 18 November 2011; accepted 21 November 2011 Available online 1 December 2011

Abstract A novel approach for joint speaker identification and speech recognition is presented in this article. Unsupervised speaker tracking and automatic adaptation of the human–computer interface is achieved by the interaction of speaker identification, speech recognition and speaker adaptation for a limited number of recurring users. Together with a technique for efficient information retrieval a compact modeling of speech and speaker characteristics is presented. Applying speaker specific profiles allows speech recognition to take individual speech characteristics into consideration to achieve higher recognition rates. Speaker profiles are initialized and continuously adapted by a balanced strategy of short-term and long-term speaker adaptation combined with robust speaker identification. Different users can be tracked by the resulting self-learning speech controlled system. Only a very short enrollment of each speaker is required. Subsequent utterances are used for unsupervised adaptation resulting in continuously improved speech recognition rates. Additionally, the detection of unknown speakers is examined under the objective to avoid the requirement to train new speaker profiles explicitly. The speech controlled system presented here is suitable for in-car applications, e.g. speech controlled navigation, hands-free telephony or infotainment systems, on embedded devices. Results are presented for a subset of the SPEECON database. The results validate the benefit of the speaker adaptation scheme and the unified modeling in terms of speaker identification and speech recognition rates. © 2011 Elsevier Ltd. All rights reserved. Keywords: Speaker identification; Speech recognition; Speaker adaptation

1. Introduction Automatic speech recognition has attracted extensive research activities since the 1950s. A natural and easy-to-use human–computer interface for technical applications is targeted to guarantee a high degree of user-friendliness for a wide range of users. Even those who are not familiar with the handling of technical devices such as infotainment systems should be supported without any training of the user. 夽

This paper has been recommended for acceptance by ‘Philip Jackson, Ph.D.’. Corresponding author at: Nuance Communications Aachen GmbH, Ulm, Germany. E-mail address: [email protected] (T. Herbig). 1 This work has been done by the authors when affiliated with Harman/Becker. Tobias Herbig is now with Nuance Communications. Franz Gerl is now with SVOX. ∗

0885-2308/$ – see front matter © 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.csl.2011.11.002

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

211

In the history of speech recognition and speaker identification the performance of speech controlled systems has been steadily improved (Furui, 2009). The high level of recognition accuracy of current speech recognizers allows applications of increasing complexity to be controlled by speech. Speech recognition has been applied to a manifold of applications (Rabiner and Juang, 1993): Speech recognition can increase labor efficiency in call-centers. Dictation systems have been developed which are highly sophisticated for special occupational groups with extensive documentation duty, e.g. medical documentation. In-car applications target to improve the usability and security for a wide variety of users. The driver should be supported to safely participate in road traffic while he operates technical devices such as navigation systems or hands-free sets. The distraction of the driver shall be minimized. A typical scenario for the system presented in this article may be an infotainment system with speech recognition for navigation, telephony or music control, for example. Such systems are typically not personalized to a single user. The speech signal is generally degraded by varying background noises. For in-car applications speech controlled devices have to stay robust against engine, wind and tire noises, or transient events such as passing cars or babble noise. Especially for embedded systems computational efficiency and memory consumption are important design parameters due to limited resources. Nevertheless, a very large vocabulary, e.g. city or street names for navigation, has to be recognized accurately. Despite significant advances through ongoing research activities, some deficiencies still delay the wide-spread application of speech recognition in technical devices (Furui, 2009). One reason is the fact that recognition accuracy is negatively affected by changing environments, speaker variability and natural language input (Junqua, 2000). Speaker independent speech recognizers are trained on a large set of speakers. Thus, it has to be expected that the speaker’s voice is not optimally represented by the averaged speech model of a speaker independent recognizer (Zavaliagkos et al., 1995). One possibility to achieve high recognition rates in large vocabulary systems is to train the statistical models to a particular user. Especially when the speaker is known and the environment does not change, the speech model can be extensively trained. For some applications, e.g. dictation, these assumptions may be realistic. For in-car applications these conditions are typically not fulfilled. In a system without speaker tracking, all information acquired about a particular speaker is lost with each speaker turn. However, often one may expect that a device is only used by a small number of users, e.g. five recurring speakers. This allows speaker adaptation to be realized for several speakers separately. If speaker identification is integrated into such a system, speech recognition accuracy may be enhanced by progressive speaker adaptation. When a robust estimate of the speaker’s identity is given, additional features can be developed. For example, the usability may be improved by tracking personal preferences and habits to design speaker adaptive speech dialogue strategies. A speaker adaptive system can be implemented in a simple way if the user has to identify himself after each reset of the system or a change of the current user. When the user is identified in an unsupervised way a more natural human–computer communication will be achieved. No additional intervention of the user should be required. In this article a self-learning Speaker Identification and Long-term Adaptation System (SILAS) is presented. An integrated implementation of speech recognition and speaker identification has been developed which adapts individually to a small number of users. The requirements for such a system influence the choice of the algorithms for speaker adaptation, speech recognition and speaker identification as discussed in the following. Many techniques have been developed for speaker adaptation to capture individual speaker characteristics for enhanced speech recognition. The main differences between the various adaptation schemes which have been published, e.g. by Stern and Lasry (1987), Gauvain and Lee (1994) and Kuhn et al. (2000), may be viewed in the number of parameters. However, the performance of adaptation techniques strongly depends on stable estimates of the tuning parameters. Especially during the initialization of a new speaker profile, fast speaker adaptation that converges within a few utterances is essential to provide robust statistical models for efficient speech recognition and reliable speaker identification. To support a self-learning speech controlled system to build up individual speaker profiles, fast initial adaptation should be combined with detailed long-term adaptation. A balanced strategy is required to benefit from the respective advantages of both approaches. Unsupervised long-term adaptation without explicit authentication of the user requires speaker identification. New profiles have to be initialized on the first occurrences of new users. Recurring users have to be tracked across speaker turns despite limited adaptation data to guarantee long-term stability. The realization of SILAS presented in this article relies on a short enrollment of each speaker. In an absolutely unsupervised scenario users who are unknown to the system have to be detected after a few utterances.

212

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

If the statistical models of a speech recognizer can be adapted to particular users, speech recognition can be significantly enhanced by applying long-term adaptation profiles in speech decoding. Optimal recognition accuracy may be achieved for each user by an efficient profile selection during speech recognition. Recognition accuracy and computational complexity have to be addressed. Since applications for embedded systems are considered, special requirements w.r.t. computational complexity, memory consumption and latency have to be met: The system shall be designed to efficiently retrieve and represent speech and speaker characteristics. Multiple recognitions of speaker identity and spoken text should be avoided, especially for real-time computation. A unified statistical modeling of speech and speaker related characteristics will be presented as an extension of common speech recognizer technologies. This article is structured as follows. First, the fundamental components of speaker identification, speech recognition and three techniques for speaker adaptation are discussed. Then a balanced speaker adaptation strategy is introduced. It is subsequently used to initialize and continuously adapt speaker profiles to be used for speaker identification and speech recognition. A new realization of speaker identification and speech recognition has been developed. A compact statistical representation suitable for both tasks is presented. The experiments carried out validate the benefit for enhanced speech recognition and speaker identification. A summary and a conclusion are given. Finally, some extensions of SILAS are described. 2. Speaker identiﬁcation Speaker identification is an essential part of SILAS. Speakers shall be tracked across several speaker changes so that speech recognition can be supported by individually adapted speaker profiles. Only in combination with a robust detection of unknown users can a completely unsupervised system be achieved which is able to integrate new speaker profiles without any additional training. Speaker variability is subsequently modeled by Gaussian Mixture Models (GMM) since various probability density functions can be modeled accurately if a sufficient number of parameters can be estimated (Reynolds et al., 2000). GMMs model probability density functions by a set of N multivariate Gaussian densities which are subsequently addressed by the component index k. The resulting probability density function p(xt |i ) =

N

wik · N xt |μik , ik

(1)

k=1

is a convex combination of all component densities. The parameter set i contains the weighting factors wik , mean i vectors μik and covariance matrices ik for a particular speaker i. The weighting factors are characterized by N k=1 wk = 1. i will be omitted. t denotes the discrete time index. Mel Frequency Cepstral Coefficients (MFCCs) or mean normalized MFCCs have been used successfully as feature vectors x1:T for speaker identification (Reynolds et al., 2000; Reynolds, 1995). Furthermore, so-called delta features are frequently used to integrate knowledge about the temporal order of the feature vectors (Reynolds et al., 2000). Independent and identically distributed (iid) feature vectors are usually assumed to simplify likelihood computation. The Maximum A Posteriori (MAP) criterion iMAP = argmax {p(x1:T |i) · p(i)}

(2)

i

may be used to identify the speaker as found by Reynolds and Rose (1995), for example. Detecting unknown speakers is crucial for open-set speaker identification because no speaker specific data can be employed for an explicit modeling. Instead, unknown or out-of-set speakers may be detected using a simple threshold θ th for absolute likelihood values (Fortuna et al., 2005). Advanced techniques apply normalization techniques based on cohort models (Markov and Nakagawa, 1996) or a Universal Background Model (UBM) (Fortuna et al., 2005; Angkititrakul and Hansen, 2007), for instance. The UBM is a speaker independent model which is trained on data from a large group of speakers. For example, the log-likelihood ratio of the speaker models and UBM can be employed for out-of-set detection (Fortuna et al., 2005): log (p(x1:T |i )) − log (p(x1:T |UBM )) ≤ θth ,

∀i.

(3)

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

Front-end

xt

Codebook

qt

213

Speech Decoder

Fig. 1. Block diagram of a speech recognizer based on Semi-Continuous HMMs (SCHMMs). Figure is taken from (Herbig et al., 2010a).

The latter approach has the advantage of lowering the influence of events which affect all statistical models in a similar way, e.g. text-dependent fluctuations due to unseen data or training conditions (Fortuna et al., 2005; Markov and Nakagawa, 1996). 3. Speech recognition Automated interaction with the user is essential for a speech controlled system. Hands-free telephony or voice-driven navigation devices are typical in-car applications where the user is enabled to enter telephone numbers by speaking sequences of digits, to select entries from an address book or to enter city names for navigation, for example. Subsequently, speech recognition is organized in three steps as shown in Fig. 1: In the front-end background noise is reduced by Wiener filtering before 11 MFCCs are extracted from the audio signal. The zeroth coefficient is replaced by an energy estimate. Cepstral mean subtraction is applied to compensate for channel characteristics. Furthermore, a Linear Discriminant Analysis (LDA) is employed to obtain compact feature vectors. Speech can be modeled by Hidden Markov Models (HMMs). HMMs fuse a Markov model and one or more GMMs to represent static and dynamic characteristics of speech. Speech dynamics are reflected by the states st and transitions of a Markov model whereas intra-speaker and inter-speaker variability is modeled by GMMs. In this article so-called Semi-Continuous HMMs (SCHMMs) are employed for speech recognition (Rabiner and Juang, 1993): pSCHMM (xt |, st ) =

N

wskt · N xt |μk , k .

(4)

k=1

All states of the Markov model share the mean vectors and covariances of one GMM and only differ in their weighting factors. The GMM is also called codebook. denotes the corresponding parameter set. SCHMMs achieve high recognition accuracy while retaining the modeling power of GMMs. Another strong point is the possibility to define compact codebooks that can be easily modified which makes speaker adaptation straightforward (Schukat-Talamazzini, 1995). There are additional advantages of SCHMMs as used in this setup that will be discussed in the description of the system. In the second stage of the speech recognizer shown in Fig. 1 a soft quantization is applied to the feature vectors using a Speaker Independent (SI) codebook. The SI codebook is defined by the parameter set 0 . The soft quantization T 0 (t) q0t = q10 (t), . . . , qN qk0 (t)

N xt |μ0k , 0k = N 0 0 l=1 N xt |μl , l

(5)

(6)

contains the likelihood scores of all component densities. The weighting factors of each component are evaluated in the subsequent speech decoding. In the speech decoder a transcription of the spoken phrase is generated based on the acoustic models, lexicon and language model. The acoustic model is realized by Markov chains. The corpus of all possible words to be recognized is given by the lexicon. The prior probabilities of word sequences are reflected by the language model. The Viterbi algorithm (Rabiner and Juang, 1993) is applied to determine the most likely word string. More details on generic language modeling can be found by Manning and Schütze (1999).

214

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

4. Speaker adaptation Statistical models generally suffer from unseen training conditions when the training has to be done on incomplete data which does not match the actual test situation. Speaker adaptation is one possibility to improve existing statistical models or to obtain new speaker models based on speaker specific data. An extensive training such as the Baum–Welch algorithm (Rabiner and Juang, 1993) is often not possible due to limited data, resources or latency. Since speaker adaptation generally follows speech decoding, a two-stage approach is feasible. The Viterbi algorithm is applied during speech recognition to determine the optimal state sequence sˆ 1:T which can be employed for codebook adaptation. Speaker adaptation is typically realized by a shift of mean vectors whereas covariances are left unchanged. The state variable will be omitted to obtain a compact notation. 4.1. Maximum a posteriori MAP adaptation also known as Bayesian learning (Gauvain and Lee, 1994) is a standard approach. It integrates prior knowledge and ML estimates based on the observed data and is characterized by individual adaptation of each Gaussian density (Gauvain and Lee, 1994; Reynolds et al., 2000). MAP adaptation starts from a speaker independent model, e.g. the SI codebook 0 , as the initial parameter set or alternatively from a trained speaker specific codebook. An enhanced speaker specific model is calculated by optimizing the auxiliary function of the EM algorithm QMAP (, 0 ) = QML (, 0 ) + log (p()) QML (, 0 ) =

T N

p(k|xt , 0 ) · log (p(xt , k|)) .

(7) (8)

t=1 k=1

For an iterative procedure 0 has to be substituted by . The latter represents the parameter set of the preceding iteration. First, a new parameter set of ML estimates is calculated by using one E-step and one M-step of the EM algorithm (Gauvain and Lee, 1994; Reynolds et al., 2000): w0 · N xt |μ0k , 0k p(k|xt , 0 ) = N k 0 (9) 0 0 l=1 wl · N xt |μl , l nk =

T

p(k|xt , 0 )

(10)

t=1 T

μML k =

1 p(k|xt , 0 ) · xt . nk

(11)

t=1

The ML estimates μML k only represent the observed data. ML estimates are therefore not reliable when facing few data. Therefore, MAP adaptation balances ML estimates with prior knowledge about the mean vectors, covariance matrices and the weights given by μ0k , 0k and w0k , respectively. For each component density the interpolation = (1 − αk ) · μ0k + αk · μML μMAP k k αk =

nk nk + η

(12) (13)

is controlled by the number of softly assigned feature vectors nk . For large nk the new estimates ML dominate whereas small values cause the MAP estimate to resemble the prior (Reynolds et al., 2000).

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

215

4.2. Maximum likelihood linear regression Approaches for speaker adaptation, where each Gaussian density is adapted individually, require a sufficient amount of adaptation data to be efficient. Otherwise, the number of tuning parameters has to be decreased to stay robust against limited data. One possibility is to use a set of model transformations for classes of Gaussian densities to capture speaker characteristics in an efficient manner. Maximum Linear Regression (MLLR) realizes model adaptation by a linear transformation. For example, mean vector adaptation opt

μkr = Wopt r · ζ kr

(14)

Wr = [br Ar ] T ζ kr = 1 μTkr

(15) (16)

is implemented for each regression class r by multiplying the mean vector μkr with a matrix Ar and by a shift with an offset vector br as found by Gales and Woodland (1996) and Leggetter and Woodland (1995). The transformation matrix Wopt r is estimated by maximizing the auxiliary function QML (, ) of the EM algorithm. In this context is given by the initial parameter set 0 or the parameter set of the previous iteration. The optimal transformation matrix Wopt r is given by the solution of the following set of equations: T R t=1 r=1

Lkr · −1 kr · xt

· ζ Tkr

=

T R t=1 r=1

opt T Lkr · −1 kr · Wr · ζ kr · ζ kr

(17)

as found by Gales and Woodland (1996). Lkr denotes the posterior probability where each feature vector xt is assigned to a particular Gaussian distribution kr within a given regression class r. 4.3. Eigenvoices The Eigenvoice (EV) approach is advantageous when only a very limited amount of training data can be used as shown by Kuhn et al. (2000). This technique benefits from prior knowledge about speaker variability of each Gaussian density. This adaptation scheme therefore allows the number of tuning parameters to be significantly reduced, e.g. some 10 parameters (Kuhn et al., 2000). Even short command and control utterances enable the robust extraction of 10 scalar parameters allowing an efficient codebook adaptation. One possibility to extract the main directions of speaker variations in the feature space is using Principal Component Analysis (PCA) (Jolliffe, 2002) in an off-line training. For example, MAP adaptation may be performed on a large training set for several speakers, the corresponding mean vectors are stacked into long supervectors before a PCA is applied. The main adaptation directions are given by the eigenvectors eEV which are also called eigenvoices. k Subsequently, only the first L eigenvoices associated with the highest eigenvalues are employed in speaker adaptation. EV EV T For convenience, the eigenvoices eEV k = (el,1 , . . . , el,N ) are split into the contributions of each component density. Codebook adaptation is achieved by the linear combination of all eigenvoices (Kuhn et al., 2000). The adapted mean 0 vector μEV k results from a linear combination of an offset vector, e.g. the SI mean vector μk , and a weighted sum of the eigenvoices: μEV k

=

μ0k

+

L

αl · eEV k,l .

(18)

l=1 EV A more compact notation will be achieved by introducing the matrix Mk = (eEV k,1 , . . . , ek,L ), k = 1, . . . , N. In comT bination with the weight vector α = (α1 , . . ., αL ) the adaptation can then be formulated by a matrix product and an offset: 0 μEV k = μk + Mk · α.

(19)

216

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

Speaker Front-End Speech

ML

Adaptation

Transcription I

II

Fig. 2. System architecture of SILAS. One front-end is employed for feature extraction. Speech decoding (I) and speaker identification (II) are realized in a single step by using speaker specific codebooks. Future speaker identification and speech recognition is enhanced by progressive speaker adaptation. Cepstral mean and energy normalization are adapted for each speaker separately. Figure is taken from (Herbig et al., 2010d).

For adaptation the optimal scalar weighting factors αl have to be determined by optimizing the auxiliary function of the EM algorithm Q(, 0 ). For an iterative speaker adaptation scheme 0 has to be replaced by the parameter set of the preceding iteration. The optimal parameter set αML results from the following set of equations:

N T 0 −1 nk · (Mk ) · (k ) · Mk · αML 0 = k=1

−

N

(20)

0 nk · (Mk )T · (0k )−1 · μML k − μk .

k=1

The adapted mean vectors μEV k are given by the weighted sum of all eigenvoices in (18) given the optimal weights αML . In contrast to MAP adaptation, this approach is characterized by fast convergence even on limited data due to prior knowledge about speaker variability and a very limited number of adaptation parameters. 5. Speaker identiﬁcation and long-term adaptation system Starting from a general view on the architecture of SILAS each module will be discussed in closer detail. SILAS is able to identify the speaker and to decode the utterance simultaneously. Speaker specific codebooks are initialized and continuously adapted. 5.1. System architecture First, the architecture of SILAS is described. In Fig. 2 the block diagram of the system is depicted. It is assumed that users have to press a push-to-talk button before starting to enter voice commands. Thus, speech segmentation and an additional speaker change detection are not considered in this article since at least a rough end-pointing can be used. SILAS can be briefly summarized as follows: • Front-end. A Wiener filter is employed to reduce background noises. Since speaker identification is only intended to support speech recognition as an optional component of SILAS, the feature vectors of the speech recognizer are used. The computational complexity of two feature extractions is avoided. The zeroth coefficient of the MFCCs is substituted by a normalized log-energy estimate based on mel-band energy values. For on-line normalization a peak tracker is used. To increase the robustness against channel characteristics, cepstral mean subtraction is applied. Starting from initial values trained off-line, the mean of MFCC feature vectors is tracked between two speaker turns by applying an exponential window on a frame level (Class et al., 1993). When a speaker change is detected, the adaptation of peak tracking and cepstral mean normalization is continued with the corresponding speaker specific parameter set. To obtain compact feature vectors, an LDA is applied. Windows of nine frames of zero-mean MFCCs are stacked into long vectors. The LDA has been optimized in a bootstrap training to extract the dynamics of delta

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

217

and delta–delta coefficients (Class et al., 1993). The vector representation xt comprises 32 elements representing the relevant spectral speech characteristics. The feature vector is computed each 10 ms. • Speech recognition. An efficient codebook selection scheme is realized by fast speaker tracking on a frame level prior to speech decoding. Each codebook consists of about 1000 multivariate Gaussian densities. SCHMMS with full covariance matrices are used. Because of the bootstrap procedure described by Class et al. (1993) Gaussian densities correspond to the phonetic inventory of the initial training. The mean vectors thus represent an average phonetic realization, whereas much of the intra and inter-speaker variance is modeled in the covariance matrices. A moderate number of parameters (approximately 30,000) is required for the mean vectors whereas the covariance matrices are more complex. The transcription of the spoken phrases is obtained on an utterance level. Speaker adaptation is supported by a state alignment of the acoustic models. • Speaker identiﬁcation. The definite decision on the speaker identity is done on an utterance level using the codebooks and speech segmentation of the speech recognizer. • Speaker adaptation. For speaker adaptation it is convenient to shift only the mean vectors, whereas the covariance matrices are left unchanged. The separation of codebook parameters enables SILAS to achieve high speech recognition accuracies and makes adaptation efficient. Speaker specific codebooks are initialized and continuously adapted based on the recognized word string and the guess of the speaker’s identity. For each utterance speech recognition and speaker identification are calculated in parallel. The corresponding speaker profile is adapted after the utterance is finished. The accuracy of speaker identification and speech recognition is continuously improved with each utterance. 5.2. Balanced speaker adaptation The first component of SILAS is a flexible yet robust speaker adaptation to provide speaker specific statistical models. A two-stage procedure is applied. Speech decoding provides an optimal state sequence and determines the state dependent weights wsk of a particular codebook before the codebook of the current user is adapted. The main aspect is to limit the effective number of adaptation parameters in the case of limited training data and to allow individual adjustment later on. Thus, a balanced adaptation strategy of EV and MAP adaptation can be applied using the Bayesian framework (Kuhn et al., 2000; Botterweck, 2001). One possibility is using EV adaptation to capture speaker characteristics in feature space along eigenvoices. In addition, MAP estimates allow the modeling of individual variations not represented by eigenvoices (Botterweck, 2001): L

μ ˇ =μ ˇ 0 + Pˇ · μ ˇ MAP − μ ˇ0 + αl ·eˇlEV .

(21)

l=1

The matrix Pˇ guarantees that MAP adaptation only takes effect in directions orthogonal to all eigenvoices. Adaptation is performed in the supervector space where all mean vectors are stacked into a long vector indicated by the superscript of μ. ˇ To realize a smooth transition from eigenvoices to ML estimates without the costly matrix multiplication in (21), Bayesian parameter estimation is subsequently employed (Duda et al., 2001):

opt opt ML −1 EV ˜ nk · −1 = ( . (22) · μ − μ ) · μ − μ k k k k k k ˜ k reflects the variation of MAP estimates μMAP obtained on extensive data of particular speakers w.r.t. EV estimates k opt μEV k based on limited data. μk denotes the mean vector to be optimized. For small but non-zero values of nk the EV approach dominates but a matrix multiplication is additionally applied. ˜ k models 0k represents all data used in the training of a particular Gaussian density of the speech recognizer whereas the variation of that density caused by inter-speaker variation. For sufficiently large data sets one may assume that the ˜ k )−1 = λ · I, λ = const. Then, the final solution is quite easy to interpret: covariance matrices are proportional k · ( opt

ML μk = (1 − αk ) · μEV k + αk · μk

(23)

218

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

Front-end

xt

Speaker specific codebooks

qit

Speech decoder

Fig. 3. Block diagram of an implementation for speaker specific speech recognition. NSp speaker specific codebooks and speech decoders are employed in parallel to generate a transcription for each speaker profile separately. Figure is taken from (Herbig et al., 2011a).

αk =

nk . nk + λ

(24) opt

EV Only the sufficient statistics of the SI codebook {nk , μML k } are required to calculate μk and μk which simplifies accumulating adaptation data during speech recognition. However, unreliable ML estimates still negatively affect EV adaptation due to limited data. Subsequently, the ML estimates in (20) are empirically replaced by MAP estimates μMAP , e.g. with η = 4, to stabilize the EV estimate. For sufficiently large nk there is no difference because μMAP and k k 0 . Those ML μk converge. If nk is small for particular Gaussian densities, the MAP estimate guarantees that μMAP ≈ ␮ k k Gaussian densities have only a limited influence on the result of the set of linear equations in (20). Furthermore, the EV adaptation was only calculated in the experiments carried out when sufficient speech data ≥0.5 s was available.

5.3. Joint speaker identiﬁcation and speech recognition When the speaker’s identity is known, speaker adaptation can individually modify the SI codebook to enhance the recognition rate for each speaker. Therefore, several speaker profiles have to be handled by the speech recognizer in an efficient way. In Fig. 3 one implementation of an advanced speaker specific speech recognizer is shown. To obtain an accurate speech modeling for each speaker, NSp speaker specific codebooks are operated in parallel to the SI codebook. However, the computational load is significantly increased since the speech decoder has to process NSp + 1 data streams in parallel. Especially for navigation applications, most of the computational complexity is due to the Viterbi search algorithm. For embedded devices it is obviously advantageous to separate speaker tracking and speaker specific speech decoding using a codebook selection on a frame level. Therefore SILAS continuously forwards only the result of the codebook belonging to the current speaker. Subsequently, the realization of speaker identification on two time scales is described in closer detail. First, a fast yet probably less confident identification is used to select the optimal codebook on a frame level for speaker specific speech recognition. Furthermore, the definite decision on the speaker’s identity is done in a second step on an utterance level to obtain an improved guess which can be employed in speaker adaptation (Herbig et al., 2010b). 5.3.1. Speaker speciﬁc codebook selection Speaker turns may be captured by allowing the speech recognizer to switch rapidly between speaker profiles. The goal is to achieve optimal speech recognition and low latencies. Switching rapidly, however, holds the risk of not always applying the correct speaker profile in speech decoding. On the other hand, if an improper speaker specific speech model is selected by the speech recognizer, the speech recognition rates do not degrade significantly since two speakers yield similar pronunciation patterns. The long-term stability of the system is more severely affected if the wrong codebook is adapted. A codebook selection strategy of low complexity is preferred. The technique presented here is oriented at the match between codebooks and the observed feature vectors. Since speaker specific codebooks are optimized to represent the speaker’s pronunciation characteristics, this decision is expected to be correlated with the definite speaker identification result which will be obtained on an utterance level. Class et al. (2003) describe a technique for automatic speaker change detection in a speech recognition system. Their approach achieves low computational complexity by evaluating several speaker specific codebooks in parallel. This approach is extended here by speaker specific codebook selection and simultaneous speaker tracking. Fig. 4 (part I) shows the core component of SILAS. The setup comprises one speaker independent and several speaker specific codebooks operated in parallel. Additional latencies in speech decoding are avoided by

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

219

I

p(xt |Θ 0 )

SI Codebook

qi=0 t

f0t p(xt |Θ 1 )

Speaker 1

qi=1 t

FrontEnd

p(xt |Θ 2 )

Speaker 2

xt

ifast = arg maxi (p(i|xt ))

f0t

qi=2 t

f0t p(xt |Θ i )

.. .

qitfast

qit f0t

Speaker NSp

p(xt |Θ NSp ) i= NSp

qt

Σ

II

Σ

Σ

Σ

Σ

f0t speech pause detection

islow = arg maxi (p(x1:Tu |Θ i))

islow

Fig. 4. Joint speaker identification and speaker specific speech recognition. Only the soft quantization of the codebook (part I) which is expected to represent the target speaker is employed in speech decoding. Speaker identification is calculated in parallel (part II). A definite decision on the speaker’s identity is performed after the utterance is finished. Figure is taken from (Herbig et al., 2011a).

evaluating only codebooks and not the underlying Markov models. Therefore, the statistical models are reduced in this context to GMMs. Furthermore, equal weighting factors wskt are assumed for each codebook. No state alignment of the acoustic models is therefore required at this stage. The Markov models are evaluated later during speech decoding. Using this approach the time consuming Viterbi decoding has to be calculated only once. The computational load can be reduced even further: Because of the limited modification during adaptation compared to the SI codebook, one can use the result of the SI codebook evaluation to reduce the number of Gaussian densities that are evaluated for every speaker. Only the Nb Gaussian densities which are relevant for the likelihood computation of the SI codebook are evaluated: N

p(xt |i ) =

1 N xt |μik , ik N

(25)

k=1

≈

1 N xt |μik , ik . N 0 k∈φt

(26)

220

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

In the experiments that will be discussed later Nb = 10 was employed. In this context φ0t denotes the indices of the corresponding Nb Gaussian densities resulting from the evaluation of SI codebook and feature vector. For each codebook a separate vector

T 0 0 qit ∝ p(xt |k = φt,1 , i ), . . . , p(xt |k = φt,N , i ) (27) b is generated by investigating this subset of Gaussian densities. Since the quadratic part of the exponent of each Gaussian density does not have to be calculated anew, only Nb scalar products have to be calculated. Furthermore, the scalar likelihood p(xt |i ) is computed for each codebook to represent the match between codebook and observation (Herbig et al., 2010b). Thus with only 300 multiplications and additions per frame and speaker, one can receive a first estimate about the identity of the speaker and the corresponding vector quantization. A decision logic may follow several strategies to select an appropriate speaker specific codebook. Only the corresponding soft quantization vector qtifast should be further processed in the speech decoder. One possibility is a decision on the actual likelihood value p(xt |i ) using the ML criterion. In each time instance the codebook with the highest likelihood value is selected. Speaker changes, e.g. from male to female, have to be detected rapidly. However, an unrealistic number of speaker turns on a single utterance has to be avoided. Therefore, a decision logic which relies on Bayes’ theorem is used in analogy to the forward algorithm (Rabiner and Juang, 1993). The posterior probability p(it |x1:t ) is calculated for each speaker using the entire history of feature vectors x1:t of the current utterance (Herbig et al., 2010b): p(it |x1:t ) =

p(xt |it , x1:t−1 ) · p(it |x1:t−1 ) , p(xt |x1:t−1 )

∝ p(xt |it ) · p(it |x1:t−1 ), p(it |x1:t−1 ) =

NSp

t>1

it = 0, 1, . . . , NSp

p(it |it−1 ) · p(it−1 |x1:t−1 ),

(28) (29)

t > 1.

(30)

it−1 =1

p(i1 |x1 ) ∝ p(x1 |i1 ) · p(i1 ).

(31)

A small prior probability p(it |it−1 ) is used to reflect speaker changes. For convenience, the parameter set i is omitted in this context. The initial distribution p(i1 ) is used for a boost of a particular speaker identity using the identification result of the preceding utterance, for example. The codebook with the highest posterior probability is selected for the subsequent processing: ifast = argmax {p(it |x1:t )} . t it

(32)

Speech decoding is performed based on the soft quantization qitfast . This approach has the advantage that multiple recognitions are avoided since speaker identification and speech recognition are not performed sequentially on an utterance level. A smoothing of the posterior probability p(it |x1:t ) may be employed to prevent an instantaneous drop or rise, e.g. caused by background noises during short speech pauses. The SI codebook is always evaluated in parallel to all speaker specific codebooks to ensure that the speech recognizer’s performance does not decrease. When the SI codebook shows a higher likelihood than the adapted codebooks, this is evidence that no suitable codebook is present, e.g. during the initialization of a new codebook. In such a case using the quantization given by the SI codebook empirically prevents performance below baseline. 5.3.2. Speaker identiﬁcation For speaker identification each speaker specific codebook may be evaluated on an utterance level in the same way as standard GMMs. Therefore the codebooks are used not only to decode speech utterances. They are also employed to identify speakers in parallel to speech decoding as shown in Fig. 4. Equal prior probabilities wik = p(k|i ) are assumed to simplify likelihood calculation as discussed before. The weighting factors will be omitted in the subsequent notation.

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

221

Li denotes the normalized log-likelihood of speaker i. The log-likelihood computation can be further simplified by the iid assumption: N T 1 i Li = . (33) log N xt |μk , k T t=1

k=1

The most likely in-set speaker = argmax {Li } islow t i

(34)

is identified after the utterance is finished (Herbig et al., 2010b). A strictly unsupervised solution can only be achieved if out-of-set speakers can be detected automatically. For example, a simple threshold decision Li − L0 < θth ,

∀i,

(35)

may be applied (Herbig et al., 2010c). This approach is motivated by Fortuna et al. (2005). However, if speaker identification is implemented in this way, it may suffer from speech pauses, e.g. in spelling or digit loops, or garbage words which do not contain speaker specific information. A more precise speech segmentation is given by the speech recognition result. A relatively more robust guess of the speaker identity will be achieved when speech pauses are excluded. Since speaker identification is performed at the end of each utterance, the scalar likelihood values p(xt |i ) can be buffered. When a precise segmentation is available, the log-likelihood Li is accumulated for each speaker. The advantages of this setup using SCHMMs are: • A moderate number of parameters that are optimized for every speaker, which efficiently improve recognition. • An estimate of the speaker likelihood and the corresponding vector quantization at very low cost. • An adaptation procedure that naturally switches from fast response on little training material to detailed modeling later on. 6. Experiments Several experiments were conducted to investigate the benefit of the unified approach for speech recognition and speaker identification rates. First, the database and the evaluation set are described. Then, speaker adaptation with predefined speaker identity is examined to evaluate the benefit of a balanced speaker adaptation scheme. Finally, the SILAS comprising speaker adaptation, speaker identification and speech recognition is evaluated. 6.1. Database SPEECON is an extensive speech database which was collected to support the development of speech-driven consumer applications. The database consists of about 20 languages each represented by 600 speakers. Several recordings were done under real conditions and environments. For example, recordings for home, office, public places and in-car environments are available. Recordings for read and spontaneous speech can be used (Iskra et al., 2002). A subset of the US-SPEECON database was used for the evaluation which is analyzed in this article. Recordings from 73 speakers (50 male and 23 female speakers) in an automotive environment are considered. The Lombard effect (Junqua, 1996) is reflected. The sound files were down-sampled from 16 kHz to 11.025 kHz and only AKG microphone recordings were employed in the experiments. Colloquial utterances with more than four words and mispronunciations were not used in the experiments carried out. Digit and spelling loops were kept. At least 250 utterances per speaker are available. The time order of the utterances in the recording session was not changed.

222

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

6.2. Evaluation The evaluation was conducted on 60 sets of five enrolled speakers. To ensure independence of the speaker set composition, the speakers are chosen randomly. The number of female and male speakers in a group is not balanced. In the first experiment the speaker identity is known prior to each utterance to obtain a reference for word error rate reduction which may be achieved by SILAS. Then experiments which are more realistic for the use case of SILAS are analyzed for closed speaker sets. At the beginning of each speaker set a short enrollment of 10 utterances takes place for each speaker. Since a closed-set scenario is investigated, the first two utterances of each new speaker are indicated to SILAS during enrollment. After the first two utterances, the initialized speaker model has to be identified in a completely unsupervised way. In an open-set scenario new users have to be detected automatically. After this enrollment, speaker changes are randomly inserted between two utterances with a probability of 10%. The utterances of each speaker set are organized in blocks of at least five utterances assuming that several utterances are spoken before the speaker changes. The number of utterances within each block and the time order of the speakers in each set are randomized to obtain a realistic profile of speaker turns. The speech recognizer without any speaker profiles or continuous speaker adaptation is used as baseline for the experiments discussed here. Grammars for digit and spelling loops, numbers and a grammar containing the remaining ≈2000 utterances were generated. For testing the corresponding recognition grammar was selected prior to each utterance. More details on the training of the speech recognizer are provided by Class et al. (1993). In an off-line training the main directions of speaker variation in the feature space were estimated. For this purpose speaker specific codebooks were extensively trained by MAP adaptation for a large set of speakers of the USKCP2 development database. For each codebook the mean vectors were stacked in supervectors before the eigenvoices eEV were extracted by a PCA. A further reference for the SILAS is given by a simple implementation of continuous speaker adaptation neglecting any speaker identification. Instead of representing each speaker separately, only one speaker specific codebook is employed for speech recognition and speaker adaptation. To capture speaker turns without using speaker identification, a weighting with an exponential decay is applied to the sufficient statistics of the SI codebook. The sufficient statistics are incrementally updated for speaker adaptation. In the experiments this weighting guarantees that only some five utterances have an effect on speaker adaptation.

6.3. Results First, an upper bound for unsupervised speaker adaptation is examined before a more realistic closed-set scenario will be considered. The first experiment determines the Word Error Rate (WER) which may be achieved if the correct speaker identity is given. No prior knowledge about the spoken word string is used in speech decoding. Several adaptation strategies have been evaluated (Herbig et al., 2010a). In addition, feature extraction is controlled externally. Energy normalization and mean subtraction are continuously adapted or reset before the next utterance is processed. The tuning parameter λ in (24) is considered for specific values. The performance of MAP adaptation is tested for dedicated values of the tuning parameter η in (13). The results in Table 1 show a significant improvement for the balanced speaker adaptation compared to the baseline and short-term adaptation. If no knowledge about the speaker identity is given about 6% relative error rate reduction w.r.t. the baseline can be achieved by the implementation for short-term adaptation. If EV estimates are interpolated with ML estimates, 11.06% WER is achieved using λ = 12. This corresponds to 25% relative error rate reduction w.r.t. the SI baseline. Obviously, the performance of the speech recognizer does not degrade if λ is selected in a relatively wide range. When either EV or MAP adaptation (according to (12)) are used, the results for speech recognition are below the optimal combination of both adaptation schemes as expected. MAP adaptation yields better results than ML or EV approach. The EV approach, however, performs better than ML estimates in the initialization phase. ML estimates

2 The USKCP is a speech database internally collected by TEMIC Speech Dialog Systems, Ulm, Germany. The USKCP comprises command and control utterances recorded in an automotive environment. The language is US-English.

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

223

Table 1 Comparison of different adaptation techniques with predefined speaker identity. Speaker adaptation

WER [%]

Baseline Short-term adaptation Balanced adaptation ML (λ ≈ 0) λ=4 λ=8 λ = 12 λ = 16 λ = 20 EV (λ→ ∞)

14.77 13.87 11.56 11.13 11.14 11.06 11.14 11.10 11.81

MAP adaptation η=4 η=8 η = 12

11.38 11.38 11.43

Bold values are used to highlight the best results of the evaluation.

Table 2 Comparison of dedicated adaptation techniques for self-learning speaker identification in a closed-set scenario. Speaker adaptation

WER [%]

ID [%]

Baseline Short-term adaptation Balanced adaptation ML (λ ≈ 0) λ=4 λ=8 λ = 12 λ = 16 λ = 20 EV (λ→ ∞)

14.77 13.87 13.11 11.90 11.83 11.84 11.82 11.80 12.49

81.54 94.64 93.49 92.42 92.26 91.68 84.71

MAP adaptation η=4 η=8 η = 12

12.52 12.68 12.97

87.35 81.67 75.60

– –

Bold values are used to highlight the best results of the evaluation.

suffer from poor performance when the number of training utterances is smaller than ≈40 as shown by Herbig et al. (2010a). The significance of the improvements achieved were evaluated by paired difference tests. The latter are obviously more appropriate than tests for independent data since all experiments were conducted on the identical data set. To evaluate whether the tuning of speaker adaptation yields a real improvement, the results of two different experiments are compared sentence-wise. Recognition errors which have occurred in both experiments are separated from errors which have been caused by modifying the tuning parameter. The latter are employed to estimate a probability of improvement by calculating the cumulative probability of the underlying Binomial distribution. This approach is similar to (Bisani and Ney, 2004). The paired difference tests for the error rates in Table 1 result in probabilities larger than 95% that the balanced adaptation yields a real improvement compared to standard MAP adaptation. In the following a closed-set scenario is considered to examine the benefit of self-learning speaker identification combined with speech recognition and unsupervised speaker adaptation. Only a very short training phase of two utterances is used before the user is identified in a completely unsupervised manner. The results for speech recognition and speaker identification (ID) are summarized in Table 2 (Herbig et al., 2010b,e).

224

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

Table 3 Detailed comparison of word error rates for baseline and the best results of SILAS. The results for λ = 12 and predefined speaker identity are shown in column I. The results for λ = 20 and a short enrollment of each speaker are shown in column II. Categories

Baseline (WER [%])

4.64 21.41 8.74 16.75

1

1

0.9

0.9

Detectionrate

Detectionrate

Digits Spelling Numbers Words

SILAS (WER [%])

0.8 0.7 0.6 0.5 0.4

I

II

3.24 18.14 5.79 12.46

3.67 19.16 5.86 13.29

0.8 0.7 0.6 0.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.4

0

0.1

0.2

0.3

0.4

0.5

False alarm rate

False alarm rate

(a) Five speakers are enrolled.

(b) 10 speakers are enrolled.

0.6

Fig. 5. Detection of out-of-set speakers for open-set speaker identification—NA ≤ 20 (◦), 20 < NA ≤ 50 (), 50 < NA ≤ 100 () and 100 < NA ≤ 200 (+). Figure is taken from (Herbig et al., 2010c).

Significant improvements of the WER can be observed w.r.t. baseline and short-term adaptation. The two special cases ML (λ ≈ 0) and EV (λ→ ∞) are clearly outperformed by the combination of both adaptation techniques. In the case of standard MAP adaptation the identification rates decrease significantly. Inefficient speaker adaptation during the unsupervised learning phase is assumed to be a major reason. Only marginal speaker discrimination can be expected. Since no eminent difference in the WER can be observed for λ = 4, . . ., 20, speaker identification may be optimized independently from speech recognition. An optimum of 94.64% is reached for speaker identification when λ = 4 is used in speaker adaptation. Speaker characteristics can be efficiently captured even on limited training data. Speaker identification obviously profits from faster adaptation to the ML values. For speech recognition a higher value of λ also limits the damage done by misidentifications. This may prevent some maladjustments of some Gaussian densities that could lead to occasional recognition errors. In Table 3 the results of the experiments discussed so far are shown in closer detail for 3694 digit loops (20,126 digits), 1271 spelling loops (8140 letters), 864 number loops (5959 numbers) and a further category containing all remaining 69,171 utterances (93,737 words). SILAS yields consistent results for all categories. In the paired difference tests described above the probability of improvement is always larger than 95% for each category. Finally, the experiment has been repeated for an open-set scenario. For testing the following two-stage technique is employed: First, the current user is expected to be an in-set speaker. The speaker model characterized by the highest log-likelihood score is assumed to represent the target speaker according to (34). In a second step this decision has to be verified. A threshold criterion is applied to the log-likelihood ratios of speaker specific and SI codebooks according to (35). Depending on the threshold speech recognition performance could be improved by 16− 18 % w.r.t. the SI baseline. However, the system could not track recurring speakers accurately. Too many codebooks have been initialized. To examine the performance of the in-set/out-of-set classification in more detail, an additional supervised experiment was conducted where detection errors and maladaptations are neglected (Herbig et al., 2010c). The performance of the in-set/out-of-set classification is evaluated by the so-called Receiver Operating Characteristic (ROC) curve. The detection rate is depicted versus false alarm rate. The ROC curves in Fig. 5(a) and (b) show the problem of open-set speaker identification in a self-learning system. When speaker models are progressively trained starting from only a few

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

225

utterances, unknown speakers cannot be reliably detected, especially for fixed thresholds. In this context NA denotes the number of utterances used for adaptation. Even in the case that one would afford a training on NA ≈ 20 utterances the error rates would still be too high for an unsupervised system. This problem would be even more complicated if error propagation was considered in this experiment. Mixed codebooks representing several speakers would be used or several codebooks would belong to one speaker. Therefore a more sophisticated strategy which relies on several utterances and flexible thresholds is required to recognize recurring speakers. 7. Summary and conclusion A self-learning speech controlled system has been presented and discussed. Based on a flexible adaptation scheme speaker specific codebooks can be initialized and continuously adapted for several speakers. This technique is characterized by a data-driven smooth transition. Starting from globally estimated mean vectors SILAS transits to locally optimized mean vectors. The overhead is moderate compared to EV adaptation since ML estimates can be re-used. When SILAS has perfect knowledge of the identity of the current user, a relative error rate reduction of 25% may be achieved using the adaptation techniques described in this article. This condition may be realistic for some systems where the user is asked to identify himself before he starts a speaker turn or after the system is reset. A unified approach for simultaneous speaker identification and speech recognition has been developed. Only one front-end is employed. There is no requirement to realize feature extraction for speech recognition and speaker identification separately. Speaker identification was introduced on different time scales. A trade-off between fast profile selection for speech recognition and a delayed but more confident guess of the speaker identity for speaker adaptation has been discussed. Progressive adaptation of speaker specific codebooks becomes feasible. Recognition accuracy can be continuously improved. A relative error rate reduction of 20% w.r.t. the SI baseline has been achieved. Detecting unknown speakers by a simple threshold for the log-likelihood ratios of speaker specific and SI codebooks requires sufficiently trained codebooks. The detection rates achieved in the experiments show the limitations of such a system to detect new users based on only one utterance. Especially for unsupervised systems, an improved strategy is required as will be discussed in the extensions. If the training of each speaker model is taken into consideration, better results should be obtained (Herbig et al., 2010c). In summary, the unified modeling of SILAS is preferable for an embedded system. Speaker identification may use the front-end of a speech recognizer. The speaker identity can be estimated in parallel to speech decoding. To achieve optimal speech recognition rates the corresponding speaker specific codebook may be selected on a frame level. Multiple recognitions are not required limiting computational complexity, memory consumption and latency. With some effort similar features could be developed for continuous HMMs (CHMMs) and/or using MLLR adaptation variants that employ increasing numbers of parameters. It may be expected that the relative improvements would be very similar. Encouraging results in a moderately complex system hopefully prompt more research in different arenas. 8. Extensions One important aspect of SILAS is to capture speaker characteristics fast and accurately for robust speaker identification. In this work EV adaptation was examined. Optimal performance of EV adaptation can only be achieved in the case of balanced training and test conditions. Therefore, the robustness of the system may be improved by re-estimating the eigenvoices to obtain a better representation of the eigenspace using PCA (Nguyen et al., 1999). Since MLLR adaptation is widely-used in speaker adaptation, e.g. Leggetter and Woodland (1995), Gales and Woodland (1996) and Ferràs et al. (2007, 2008), it may be integrated into SILAS as possible future work. As discussed so far, a robust solution to the closed-set speaker identification problem has been developed. Detecting unknown speakers, however, is essential for strictly unsupervised systems. Due to the first results, it has to be viewed as a challenging task which requires further research. This task is closely related to the problem of speaker specific codebooks trained at highly different adaptation levels. In practical applications weakly adapted speaker models compete with well trained speaker models. Internal parameters of SILAS, e.g. likelihood scores, are expected to be biased (Yin et al., 2008). Furthermore, ML decisions do not provide reliable confidence measures concerning the speaker identity. Speaker adaptation cannot reject single utterances with uncertain origin. Long-term stability may be affected. These aspects shall be addressed to increase

226

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

the robustness of speaker identification and speech recognition under the objective of unsupervised speaker tracking (Herbig et al., 2011b). References Angkititrakul, P., Hansen, J.H.L., 2007. Discriminative in-set/out-of-set speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing 15 (2), 498–508. Bisani, M., Ney, H., 2004. Bootstrap estimates for confidence intervals in ASR performance evaluation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-2004, Vol. 1, pp. 409–412. Botterweck, H., 2001. Anisotropic MAP defined by eigenvoices for large vocabulary continuous speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-2001, Vol. 1, pp. 353–356. Class, F., Kaltenmeier, A., Regel-Brietzmann, P., 1993. Optimization of an HMM- based continuous speech recognizer. In: EUROSPEECH-1993, pp. 803–806. Class, F., Haiber, U., Kaltenmeier, A., 2003. Automatic detection of change in speaker in speaker adaptive speech recognition systems, US Patent Application 2003/ 0187645 A1. Duda, R.O., Hart, P.E., Stork, D.G., 2001. Pattern Classification, 2nd ed. Wiley-Interscience, New York. Ferràs, M., Leung, C.C., Barras, C., Gauvain, J.-L., 2007. Constrained MLLR for speaker recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-2007, Vol. 4, pp. 53–56. Ferràs, M., Leung, C.C., Barras, C., Gauvain, J.-L., 2008. MLLR techniques for speaker recognition. In: The Speaker and Language Recognition Workshop, Odyssey-2008, pp. 21–24. Fortuna, J., Sivakumaran, P., Ariyaeeinia, A., Malegaonkar, A., 2005. Open-set speaker identification using adapted Gaussian mixture models. In: INTERSPEECH-2005, pp. 1997–2000. Furui, S., 2009. Selected topics from 40 years of research in speech and speaker recognition. In: INTERSPEECH-2009, pp. 1–8. Gales, M.J.F., Woodland, P.C., 1996. Mean and variance adaptation within the MLLR framework. Computer Speech and Language 10 (4), 249–264. Gauvain, J.-L., Lee, C.-H., 1994. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing 2 (2), 291–298. Herbig, T., Gerl, F., Minker, W., 2010a. Fast adaptation of speech and speaker characteristics for enhanced speech recognition in adverse intelligent environments. In: The 6th International Conference on Intelligent Environments, IE-2010, pp. 100–105. Herbig, T., Gerl, F., Minker, W., 2010b. Simultaneous speech recognition and speaker identification. In: IEEE Workshop on Spoken Language Technology, SLT-2010, pp. 206–210. Herbig, T., Gerl, F., Minker, W., 2010c. Detection of unknown speakers in an unsupervised speech controlled system. In: Lee, G.G., Mariani, J., Minker, W., Nakamura, S. (Eds.), Spoken Dialogue Systems for Ambient Environments: Second International Workshop on Spoken Dialogue Systems Technology, IWSDS-2010, Vol. 6392 of Lecture Notes in Computer Science. Springer, pp. 25–35. Herbig, T., Gerl, F., Minker, W., 2010d. Speaker tracking in an unsupervised speech controlled system. In: INTERSPEECH-2010, pp. 2666–2669. Herbig, T., Gerl, F., Minker, W., 2010. Evaluation of two approaches for speaker specific speech recognition. In: Lee, G.G., Mariani, J., Minker, W., Nakamura, S. (Eds.), Spoken Dialogue Systems for Ambient Environments: Second International Workshop on Spoken Dialogue Systems Technology, IWSDS-2010, Vol. 6392 of Lecture Notes in Computer Science. Springer, pp. 36–47. Herbig, T., Gerl, F., Minker, W., 2011a. Self-Learning Speaker Identification: A System for Enhanced Speech Recognition. Springer, Heidelberg. Herbig, T., Gerl, F., Minker, W., Haeb-Umbach, R., 2011b. Adaptive systems for unsupervised speaker tracking and speech recognition. Evolving Systems 2 (3), 199–214. Iskra, D., Grosskopf, B., Marasek, K., van den Heuvel, H., Diehl, F., Kiessling, A., 2002. SPEECON - speech databases for consumer devices: Database specification and validation. In: Proceedings of the Third International Conference on Language Resources and Evaluation, LREC-2002, pp. 329–333. Jolliffe, I.T., 2002. Principal Component Analysis, 2nd ed. Springer Series in Statistics, Springer, Heidelberg. Junqua, J.-C., 1996. The influence of acoustics on speech production: A noise-induced stress phenomenon known as the Lombard reflex. Speech Communication 20 (1–2), 13–22. Junqua, J.-C., 2000. Robust Speech Recognition in Embedded Systems and PC Applications. Kluwer Academic Publishers, Dordrecht. Kuhn, R., Junqua, J.-C., Nguyen, P., Niedzielski, N., 2000. Rapid speaker adaptation in eigenvoice space. IEEE Transactions on Speech and Audio Processing 8 (6), 695–707. Leggetter, C.J., Woodland, P.C., 1995. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language 9, 171–185. Manning, C.D., Schütze, H., 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, USA. Markov, K., Nakagawa, S., 1996. Frame level likelihood normalization for text-independent speaker identification using Gaussian mixture models. In: International Conference on Spoken Language Processing, ICSLP-1996, Vol. 3, pp. 1764–1767. Nguyen, P., Wellekens, C., Junqua, J.-C., 1999. Maximum likelihood eigenspace and MLLR for speech recognition in noisy environments. In: EUROSPEECH-1999, pp. 2519–2522. Rabiner, L., Juang, B.-H., 1993. Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliffs. Reynolds, D.A., Rose, R.C., 1995. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing 3 (1), 72–83. Reynolds, D.A., Quatieri, T.F., Dunn, R.B., 2000. Speaker verification using adapted Gaussian mixture models. Digital Signal Processing 10 (1–3), 19–41.

T. Herbig et al. / Computer Speech and Language 26 (2012) 210–227

227

Reynolds, D.A., 1995. Large population speaker identification using clean and telephone speech. IEEE Signal Processing Letters 2 (3), 46–48. Schukat-Talamazzini, E.G., 1995. Automatische Spracherkennung. Vieweg, Braunschweig (in German). Stern, R.M., Lasry, M.J., 1987. Dynamic speaker adaptation for feature-based isolated word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing 35 (6), 751–763. Yin, S.-C., Rose, R., Kenny, P., 2008. Adaptive score normalization for progressive model adaptation in text independent speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP-2008, pp. 4857–4860. Zavaliagkos, G., Schwartz, R., McDonough, J., Makhoul, J., 1995. Adaptation algorithms for large scale HMM recognizers. In: EUROSPEECH-1995, pp. 1131–1135.

Self-learning speaker identification for enhanced speech recognition

Self-learning speaker identification for enhanced speech recognition

Recommend Documents