Robust speaker verification with state duration modeling

Robust speaker verification with state duration modeling

Speech Communication 38 (2002) 77–88 www.elsevier.com/locate/specom Robust speaker verification with state duration modeling Nestor Becerra Yoma a a,...

170KB Sizes 0 Downloads 101 Views

Speech Communication 38 (2002) 77–88 www.elsevier.com/locate/specom

Robust speaker verification with state duration modeling Nestor Becerra Yoma a

a,*

, Tarciano Facco Pegoraro

b

Electrical Engineering Department, University of Chile, Av. Tupper 2007, P.O. Box 412-3, Santiago, Chile b Ericsson do Brasil, Rodovia Erm^ enio de Oliveira Penteado km 55,5, Idaiatuba, SP, Brazil Received 5 July 2000; received in revised form 30 March 2001; accepted 7 May 2001

Abstract This paper addresses the problem of state duration modeling in the Viterbi algorithm in a text-dependent speaker verification task. The results presented in this paper suggest that temporal constraints can lead to reductions of 10% and 20% in the error rates with signals corrupted by noise at SNR equal to 6 and 0 dB, respectively, and that the accurate statistical modeling of state duration (e.g. with gamma probability distribution) does not seem to be very relevant if maximal and minimal state duration restrictions are imposed. In contrast, temporal restrictions do not seem to give any improvement in a speaker verification task with clean speech or high SNR. It is also shown that state duration constraints can easily be applied with the likelihood normalization metrics based on speaker-dependent temporal parameters. Finally, the results here presented show that word position-dependent state duration parameters give no significant improvement when compared with the word position-independent approach if the coarticulation effect between contiguous words is low.  2002 Elsevier Science B.V. All rights reserved. Keywords: Speaker verification; Noise robustness; Temporal restrictions; HMM

1. Introduction The constant transition probability in the ordinary HMM topologies used in speech recognition and in speaker verification leads to a geometric probability density for state duration which is not accurate. Consequently, bounding or modeling state durations seems an interesting approach to reduce the error rate specially when the speech signal is corrupted by noise. In a previous paper (Yoma et al., 2001), where state duration modeling was tested in speech recognition in noise, it is

* Corresponding author. Tel.: +56-2-678-4205; fax: +56-2695-3881. E-mail address: [email protected] (N.B. Yoma).

concluded that in isolated or connected word speaker-dependent tasks temporal constraints can lead to high reductions in the error rate with signals corrupted by additive or convolutional noise, and that the accurate statistical modeling of state duration (e.g. with gamma probability distribution) does not seem to be very relevant if maximal (max) and minimal (min) state duration restrictions are imposed. The introduction of max and min duration to states also gives better results than the metric proposed in (Burshtein, 1996). This metric, as shown in (Yoma et al., 2001), neither corresponds to a transition probability for an arbitrary duration probability distribution, nor includes max and min duration to states. Moreover, connected word recognition experiments showed that word position-dependent (WPD) temporal

0167-6393/02/$ - see front matter  2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 0 1 ) 0 0 0 4 4 - 9

78

N.B. Yoma, T.F. Pegoraro / Speech Communication 38 (2002) 77–88

constraints are much more effective than the ordinary word position-independent (WPI) temporal restrictions with the same computational load. However, state duration modeling led to less improvement in a speaker-independent connected word task, which suggests that the introduction of temporal constraints in the Viterbi algorithm could be more useful when the state duration parameters are trained and employed on a speakerdependent basis, although the HMMs could still be speaker independent, or in speaker verification systems. However, in (Forsyth, 1995) state duration information was combined with spectralbased parameters in a speaker verification system with clean signals but no significant improvement was observed. In this paper, temporal constraints introduced in (Yoma et al., 2001) were applied to a textdependent speaker verification task with clean and noisy signals. Experiments with clean signals showed that temporal restrictions did not lead to any improvement, which is consistent with (Forsyth, 1995). However, with speech signals corrupted by additive noise, state duration modeling led to reductions of 10% and 20% in the error rate at SNR equal to 6 and 0 dB, respectively. The results here presented also suggest that the lower the SNR, the higher the improvement. The temporal restriction parameters 1 were computed with the training database for every state in each model by means of estimating the optimal state sequence for every training utterance using the Viterbi algorithm after the HMMs had been trained. In the context of text-dependent speaker verification, the contributions of this paper concern: • study of the applicability of state duration modeling; • introduction of temporal restrictions based on a penalization procedure; • comparison of the truncated gamma and geometric probability distributions; • comparison of WPD with WPI state duration modeling; and

1 Mean duration, the variance, and the min and max durations.

• discussion on applicability of state duration restrictions in Lombard speech. Surprisingly, the use of temporal constraints has not been widely addressed in the literature and this paper represents a contribution toward robust speaker verification.

2. The speaker verification system 2.1. The database The Viterbi algorithm with temporal constraints was tested on a text-dependent speaker verification system using the Yoho database. The Yoho Speaker Verification Corpus (Linguistic Data Consortium) supports development, training and testing of speaker verification systems that use limited vocabulary, free-text input. The vocabulary is composed of two-digit numbers spoken continuously in sets of three (e.g. ‘‘62–31–53’’ or ‘‘sixty-two thirty-one fifty-three’’). The database is divided into ‘‘enrollment’’ and ‘‘verification’’ segments; each segment contains data from all 138 speakers (108 males and 30 females). There are four enrollment sessions per speaker and each session contains 24 utterances. Each verification segment contains 10 sessions and each session contains four utterances per speaker. 2.2. The HMM representation Each two-digit number can be decomposed into two words: a decade (‘‘20’’, ‘‘30’’, ‘‘40’’, ‘‘50’’, ‘‘60’’, ‘‘70’’, ‘‘80’’ or ‘‘90’’) and a digit (‘‘1’’, ‘‘2’’, ‘‘3’’, ‘‘4’’, ‘‘5’’, ‘‘6’’, ‘‘7’’ or ‘‘9’’). As a consequence, the vocabulary is composed of 16 words, and each word is represented by a left-to-right HMM containing eight emitting states (Fig. 1), with a single multivariate Gaussian density per state and a diagonal covariance matrix. The false-acceptance and false-rejection curves (needed to compute the equal error rate – EER) were estimated with 97 speakers (77 males and 20 females) and the global HMMs (Carey and Parris, 1992), used in the likelihood normalization (Furui, 1997), were trained with 41 speakers (31 males and 10 females).

N.B. Yoma, T.F. Pegoraro / Speech Communication 38 (2002) 77–88

79

Fig. 1. Eight-state left-to-right HMM without skip-state transition.

All the HMMs were trained with the Baum–Welch algorithm. The speaker-dependent HMMs (16 per speaker) were estimated using 96 utterances (4 enrollment sessions and 24 utterances per session), and the 16 global HMMs (one per vocabulary word) were trained with 3936 utterances (41 speakers and 96 training utterances per speaker). The false-rejection errors were estimated using the 40 verification utterances (10 sessions and four utterances per session) per speaker. The falseacceptance curve for a given speaker was computed with only one utterance per impostor. Each utterance (O) was processed with the Viterbi algorithm in order to estimate the normalized log likelihood (log LðOÞ): log LðOÞ ¼ log P ðO=ki Þ  log P ðO=kg Þ;

ð1Þ

where P ðO=ki Þ is the likelihood related to the speaker i; and P ðO=kg Þ is the likelihood related to the global HMMs. Both models, ki and kg , correspond to the sequence of word HMMs that compose the testing sequence O (text-dependent speaker verification). In order to estimate the false-rejection and false-acceptance error curves, the normalized log likelihood log LðOÞ was divided by the number of frames (N) in the verification utterance, log LðOÞ 0 log LðOÞ ¼ : N

ð2Þ

3. Temporal restrictions in the Viterbi alignment Given the topology shown in Fig. 1, in (Yoma et al., 2001) the temporal restrictions were included

in the Viterbi algorithm by means of replacing the ordinary transition probabilities with 8 if s < tmini ; <1 if s P tmaxi ; ð3Þ asi;i ¼ 0 : Di ðsÞdi ðsÞ if t 6 s < t ; min max i i Di ðsÞ asi;iþ1

8 <0 ¼ 1 : di ðsÞ

Di ðsÞ

if s < tmini ; if s P tmaxi ; if tmini 6 s < tmaxi ;

ð4Þ

where s is the number of frames in state i up to time t; tmini ¼ tol min  mini and tmaxi ¼ tol max  maxi ; mini and maxi are the observed min and max durations, respectively; the constants tol min and tol max introduce a tolerance to the min and max state duration for every state; di ðsÞ is the probability P1 of state duration equal to s; and Di ðsÞ ¼ t¼s di ðtÞ. Eqs. (3) and (4) correspond to the truncated conditional probability asi;j ¼ Probðstþ1 ¼ j j st ¼ st1 ¼    ¼ stsþ1 ¼ iÞ; where j ¼ i or j ¼ i þ 1 according to the topology shown in Fig. 1. 3.1. Geometric and gamma probability functions If the geometric distribution is used, fDi ðsÞ  di ðsÞg=Di ðsÞ in (3) coincides with ai;i , and di ðsÞ= Di ðsÞ in (4) with ai;iþ1 , where ai;i and ai;iþ1 are the ordinary transition probabilities estimated during the HMM training algorithm. However, the gamma function better fits the empirical state duration distributions (Burshtein, 1996) and di ðsÞ can also be modeled as a gamma probability function

80

N.B. Yoma, T.F. Pegoraro / Speech Communication 38 (2002) 77–88

whose parameters are estimated with the mean (Ei ðsÞ) and the variance (Vari ðsÞ) of state durations. The state duration parameters (Ei ðsÞ, Vari ðsÞ, maxi and mini ) are computed for every state in each model by means of estimating the optimal state sequence for every training utterance using the Viterbi algorithm after the HMMs have been trained. In this paper, the geometric and gamma probability functions are compared from the speaker verification point of view. The discrete gamma distribution is given by di ðsÞ ¼ K  eas  sp1 ;

ð5Þ

where s ¼ 0; 1; 2; . . . is the duration of a given state i in number of frames, a > 0, p > 0 and K is a normalizing term. The parameters a and p were estimated by ai ¼

Ei ðsÞ ; Vari ðsÞ

pi ¼

Ei2 ðsÞ : Vari ðsÞ

The discrete geometric distribution, provided by the ordinary HMM topology, is given by di ðsÞ ¼ ðai;i Þs  ai;iþ1 : 3.2. Word position-independent and word positiondependent temporal constraints Two sets of state duration parameters (Ei ðsÞ, Vari ðsÞ, maxi ðsÞ and mini ðsÞ) are tested in the speaker verification system: WPI and WPD (Yoma et al., 2001). In the WPI model the state duration parameters were estimated independently of the word’s position in the string. In contrast, the WPD model was composed of two subsets of state durations parameters for each word HMM: stringinitial or preceded by other words for decades (‘‘twenty’’, ‘‘thirty’’, etc.); string-final or followed by other words for digits (‘‘one’’, ‘‘two’’, etc.). For example, in the sequence ‘‘sixty-two forty-one sixty-three’’ the word ‘‘sixty’’ can appear at the beginning of the string or be preceded by other word, and these different contexts are modeled with two sets of state duration parameters. It is worth mentioning that WPD temporal constraints better model the coarticulation effect on state duration, which can be specially interesting in word

modeling where each HMM attempts to capture the coarticulation effect between contiguous words independently of the context. 3.3. False-rejection error rate and truncated transition probabilities Generically, the improvement due to the introduction of temporal restrictions in the Viterbi alignment is due to the truncation of the transition probability and to the statistical modeling of state duration. Some preliminary experiments showed that loose temporal restrictions (i.e. a high tol max and a low tol min) give no improvement in the error rate with clean and noisy signals. In contrast, tight temporal constraints (i.e. tol max and tol min equal to 1) could give some improvement with noisy signals, but could also increase the error rate with clean speech. This could be a result of the low number of enrollment utterances, which makes the parameters maxi and mini be poorly estimated. As a consequence, tight temporal constraints help to improve the noise robustness of the Viterbi algorithm, but may introduce a distortion in clean signals when a client is rejected because the optimal alignment presents at least one state whose duration does not satisfy the restrictions imposed by maxi and mini . A solution to this problem would be to set a floor to the transition probabilities in order to penalize those frames where the optimal Viterbi alignment gives a state duration longer than tmaxi or shorter than tmini . According to (3) and (4), a state whose duration is one-frame shorter or longer than the min or max bounds, respectively, is enough to reject a client due to the fact that logðasi;j Þ ¼ 1 in the log likelihood domain. This may result from the highly variable silence intervals between two words in the task here considered. The extreme (first and last) states usually present higher duration variance than intermediate states. In other words, the highest coarticulation effect in duration takes place in the first and last states, and the speaker-dependent duration variance is generally low for internal states (see Fig. 2). If the variance is low, state durations are likely to be concentrated around the mean, and the max and min should reliably be estimated even if only few samples are available. However, in the YOHO

N.B. Yoma, T.F. Pegoraro / Speech Communication 38 (2002) 77–88

Fig. 2. Speaker-dependent state duration variance in model ‘‘seven’’. These results were observed with the TIDIGITS database (Linguistic Data Consortium).

database the silence intervals between two words are highly variable, which added to the fact that the highest coarticulation effect in duration is mainly in the extreme states, makes the restrictions imposed by max and min be easily violated. In order to counteract these limitations the conditional transition probabilities (3) and (4) were modified to 8 if s < tmini ; <1 s minða Þ if s P tmaxi ; i;j ai;i ¼ ð6Þ : Di ðsÞdi ðsÞ if tmini 6 s < tmaxi ; Di ðsÞ 8 < minðai;j Þ if s < tmini ; if s P tmaxi ; asi;iþ1 ¼ 1 ð7Þ : di ðsÞ if t 6 s < t ; min max i i Di ðsÞ where minðai;j Þ is a threshold that was empirically estimated and depends on the percentage of frames that are allowed not to comply with the min and max state durations. It is worth mentioning that the threshold minðai;j Þ alleviates the dependence of the method on tol max and tol min which in turn is interesting from the practical point of view. 3.4. Accuracy and stability of the parameters tol max and tol min As it was mentioned in Section 3.3, intermediate states generally present a speaker-dependent duration distribution concentrated around the mean, and the parameters tol max and tol min are employed as a tolerance to the min and max state duration for every state. In order to estimate the accuracy of tol max and tol min, the probabilities of duration s being longer than tmaxi (MaxDur-

81

ErrorProb) and shorter than tmini (MinDurErrorProb), where tmaxi and tmini are defined above, were numerically estimated with the gamma distribution for all the intermediate states for each speaker. Then, the averages MaxDurErrorProb and MinDurErrorProb were computed across all the speakers. With tol max ¼ 1:5 and tol min ¼ 0:8, the configuration used in the experiments here reported, MaxDurErrorProb and MinDurErrorProb gave less than 1%, which suggests that these parameters provide an accurate model for state duration distribution. On the other hand, the parameters tol max; tol min and minðai;j Þ were empirically estimated and a wide range of values was observed in which the error rates did not present a high variation. As mentioned in Section 3.3 the threshold minðai;j Þ alleviates the dependence of the method on tol max and tol min that makes these parameters more stable and the modeling more robust. Section 5.4 presents further results on the stability of the parameters tol max and tol min.

4. State duration modeling and the Lombard effect It has been shown that the speaker style usually changes in severe noisy environments. This effect, denominated Lombard effect (Junqua, 1993), has generally been studied using databases where the speech was produced by having speakers listen to noise (generally at 85 dB SPL) while uttering tokens (Junqua, 1993, 1999; Hansen, 1996; Mokbel and Chollet, 1995). The Lombard effect seems to be ‘‘governed by the desire to achieve successful intelligible communication’’ (Junqua, 1999). In other words, the speaker changes the characteristics of his/her voice to communicate better with others and not to enter speech commands to computers. However, under severe noisy conditions the speaker may also modify his voice when interacting with a speech recognizer and/or speaker verification systems. The differences between neutral and Lombard speech have been documented in (Hansen, 1996; Junqua, 1993). Five separate perturbation models can be employed to describe the Lombard effect (Bou-Ghazale and Hansen, 1998):

82

N.B. Yoma, T.F. Pegoraro / Speech Communication 38 (2002) 77–88

(a) voiced duration variation; (b) pitch contour perturbation; (c) derivative of pitch contour perturbation; (d) explicit state occupancy for pitch–perturbation HMM; and (e) average spectral mismatch. As far as phoneme duration is concerned, vowels are longer and consonants are slightly shorter in Lombard speech when compared with neutral voice (Junqua, 1993). For instance, vowels present an increase of 10 or 20% in the duration according to Hansen (1996) and Junqua (1999). A practical solution to deal with Lombard speech could be to employ a max that is 20% or 30%, which is probably the highest phoneme duration variation in Lombard speech, higher than the one estimated with neutral voice. However, a text-dependent speaker verification database with speech under simulated and/or actual stress does not exist, and the elaboration of such corpus is out of the scope of this paper. It is interesting to mention that the same hypotheses related to the inaccuracy of the geometric probability density for state duration in the ordinary HMM topology are applicable to both neutral and stressed speech, and the method here proposed should also be useful to help to compensate the Lombard effect in speaker verification. Actually, temporal restrictions in the Viterbi alignment require the adaptation of just a few parameters and could be used in combination with techniques in the spectral domain (Mokbel and Chollet, 1995; Bou-Ghazale and Hansen, 1998) to deal with Lombard speech. When compared with speech produced in quiet, in Lombard voice the greatest increase in energy takes place in the higher frequencies, which in turn results in an increase of the spectral center of gravity (Junqua, 1993, 1999). This effect could be modeled as a spectral tilt that depends on the phoneme, speaker and noise (Bou-Ghazale and Hansen, 1998). However, in order to approximately evaluate the effectiveness of the method here proposed, the spectral distortion in Lombard speech was modeled as an average tilt (Van Summers, 1988; Chen, 1988). The problem of stressed speech in heavy noise environments is not the main focus of this paper, but the authors believe that the re-

sults shown in Section 5.5 represent a clear evidence that the transition probabilities according to (6) and (7) can also be applicable if the Lombard effect is significant. In fact, as discussed later, the reduction in the error rate due to state duration modeling was slightly higher with the spectral tilt than without it. It is worth mentioning that the spectral mismatch in Lombard effect also includes modifications in the formant frequencies (Van Summers, 1988; Junqua, 1993). Nevertheless, the fact that the temporal restrictions were able to be effective without any knowledge about the spectral tilt suggests that state duration modeling should also lead to reductions in the error rate when more complex and more intense spectral mismatches are observed.

5. Experiments The proposed methods were tested with the text-dependent speaker verification system explained in Section 2. The signals were divided into 25 ms frames with 10 ms overlap, each frame was processed with a Hamming window before the DFT spectral estimation, and spectral subtraction (SS) according to (Vaseghi and Milner, 1997) was applied. The band from 300 to 3400 Hz was covered with 20 Mel DFT filters, the log of the energy was estimated, and 12 static cepstral coefficients and their time derivatives were computed. Besides the cepstral and D-cepstral parameters, the frame log energy (log E) and its time derivative (D  log E) were also estimated. Each word was modeled with an 8-state left-to-right topology (see Fig. 1) without skip-state transition, with a single multivariate Gaussian density per state and a diagonal covariance matrix. The HMMs were estimated by means of the clean signal utterances using the Baum–Welch algorithm. The state duration parameters were estimated using the enrollment utterances after the HMMs had been trained by means of Viterbi alignment. In some cases it was observed that the variation in state duration was equal to 0 and a threshold was introduced to set a floor for Vari ðsÞ. For each client (97 speakers), speaker-dependent temporal parameters were computed. In contrast, speaker-independent

N.B. Yoma, T.F. Pegoraro / Speech Communication 38 (2002) 77–88

state duration parameters were estimated with all the 41 impostors employed to train the global HMMs (speaker-independent models). The verification clean utterances were used to create the noisy database by adding car noise from the Noisex database (Varga et al., 1992) at four global-SNR levels: +18, +12, +6 and 0 dB. The global SNR was defined as in (Ghitza, 1987). In order to test the validity of the state duration modeling from the text-dependent speaker verification point of view, the following configurations were tested: the ordinary Viterbi algorithm, Vit; the Viterbi algorithm with state duration distribution with gamma function without max and min state duration (according to (3) and (4) but with a high tol max and tol min equal to 0), Vit-WPD-Gamma and Vit-WPI-Gamma; and finally, the penalization procedure according to (6) and (7), with tol max ¼ 1:5, tol min ¼ 0:8 and logfminðai;j Þg ¼ 10, in combination with the ordinary geometric distribution (Vit-WPD-P-Geom) and with the gamma function (Vit-WPD-P-Gamma). The methods here covered are compared using a posteriori equal error rates (EERs): EERSS , using speaker specific thresholds; and EERSI , with a speakerindependent threshold. 5.1. Likelihood normalization and state duration modeling According to (1) the normalized log likelihood log LðOÞ is composed of two components: log P ðO=ki Þ and log P ðO=kg Þ. As mentioned above, ki and kg correspond to the speaker i and global HMMs, respectively. The approximation of log P ðO=ki Þ was made using speaker-dependent

83

state duration parameters estimated with speakerdependent HMMs but, in order to evaluate the state duration modeling in the Viterbi algorithm to compute log P ðO=kg Þ, four contexts were considered: (a) The state duration parameters were estimated with the speaker-independent (global) models with all the 41 speakers used to train the global HMMs. (b) The state duration parameters were computed on a speaker-dependent basis with each client’s enrollment utterances using speaker-independent HMMs (global models). (c) The temporal parameters were estimated on a speaker-dependent basis with speaker-dependent HMMs (i.e. log P ðO=ki Þ and log P ðO=kg Þ were computed with the same temporal restrictions). (d) No temporal constraints were used to estimate log P ðO=kg Þ. The previous configurations were tested with clean speech using the Viterbi algorithm with state duration distribution according to Vit-WPD-Gamma. Results are shown in Table 1. According to Table 1, the highest reductions in the error rate were achieved when log P ðO=ki Þ and log P ðO=kg Þ were estimated with the same speaker-dependent temporal parameters computed with speaker-dependent HMMs. This is an interesting result due to the fact that the state duration parameters need to be estimated only once and then can also be employed in combination with speaker-independent or global HMMs.

Table 1 Equal error rate [speaker-independent (EERSI )] according to the state duration parameters used to approximate log P ðO=kg Þ as indicated in Section 5.1 EERSI (%) Speaker-independent temporal constraints

Speaker-dependent temporal constraints (estimated with global HMMs)

Speaker-dependent temporal constraints (estimated with speaker-dependent HMMs)

No temporal constraints

1.50

1.31

1.24

1.60

The experiments were done with clean speech and the estimation of log P ðO=ki Þ was made using speaker-dependent state duration parameters computed with speaker-dependent HMMs. The duration modeling was applied according to Vit-WPD-Gamma.

84

N.B. Yoma, T.F. Pegoraro / Speech Communication 38 (2002) 77–88

Table 2 EERSI (speaker-independent EER) for WPD and WPI state duration parameters with the gamma function modeling: Vit-WPDGamma and Vit-WPI-Gamma EERSI (%) SNR (dB)

Baseline system

WPD (Vit-WPD-Gamma) (SS)

WPI (Vit-WPI-Gamma) (SS)

0 6 12 18 Clean speech

32.47 13.87 4.57 2.0 0.96

22.70 8.71 3.65 1.92 1.24

22.99 8.72 3.44 2.09 1.23

The experiments were done with clean speech and signals corrupted with additive noise (car noise). In the experiments with noisy speech, spectral subtraction was applied when indicated.

5.2. Word position-dependent (WPD) and word position-independent (WPI) state duration parameters The WPD and WPI state duration parameters were compared using the gamma function without max and min state duration: Vit-WPDGamma and Vit-WPI-Gamma. The estimation of log P ðO=ki Þ and log P ðO=kg Þ was made with the same speaker-dependent temporal parameters computed with speaker-dependent HMMs according to option (c) in Section 5.1. Results are presented in Table 2 where the WPD and WPI approaches were tested with clean and noisy speech. In the experiments with noisy speech, spectral subtraction (SS) was applied. As can be seen in Table 2, the WPD temporal modeling did not give any significant improvement when compared with the WPI approach. This must be due to the short silence intervals that generally appear between contiguous two-digit words which in turn result in a low coarticulation effect. Nevertheless, the WPD state duration modeling should still be useful in those tasks where coarticulation effect is higher (e.g. digit strings and other long utterances without silence intervals).

5.3. Comparison of state duration modeling methods In order to compare the temporal constraints, four configurations were tested: Vit-WPD-Gamma; Vit-WPD-P-Gamma; Vit-WPD-P-Geom; and finally, Vit (the ordinary Viterbi algorithm). The experiments were performed with clean speech,

Table 3 Equal error rate [speaker-specific (EERSS ) and speaker-independent (EERSI )] according to the temporal constraints employed in the Viterbi alignment in experiments with clean speech Temporal restrictions

EERSS (%)

EERSI (%)

Vit-WPD-Gamma Vit-WPD-P-Gamma Vit-WPD-P-Geom Vit

0.63 0.33 0.35 0.36

1.24 1.03 1.03 0.96

The estimation of log P ðO=ki Þ and log P ðO=kg Þ was made with the same speaker-dependent temporal parameters (computed with speaker-dependent HMMs) according to option (c) in Section 5.1.

and speech corrupted by car and speech noises, and the estimation of log P ðO=ki Þ and log P ðO=kg Þ was made with the same speaker-dependent temporal parameters (computed with speaker-dependent HMMs) according to option (c) in Section 5.1. Results are shown in Tables 3 and 4. According to Table 3, experiments with clean speech suggested that the state duration distribution with gamma function without max and min state duration (Vit-WPD-Gamma) could introduce a slight increase in error rate when compared with the Viterbi algorithm without temporal restrictions (Vit). This result indicates that in a speaker-verification task the pure statistical modeling of state duration may not be appropriate, which in turn is consistent with the low variance (equal to 0 sometimes) in state duration observed in some cases. When the penalization procedure was used, the gamma (Vit-WPD-P-Gamma) and (Vit-WPDP-Geom) distributions gave almost the same results, which seems to be consistent with (Yoma

N.B. Yoma, T.F. Pegoraro / Speech Communication 38 (2002) 77–88

85

Table 4 Equal error rate [speaker-specific (EERSS )] according to the temporal constraints employed in the Viterbi alignment in experiments with speech corrupted by additive noise EERSS (%) Noise

SNR (dB)

Baseline system

SS

SS (Vit-WPD-Gamma)

SS (Vit-WPD-P-Gamma)

SS (Vit-WPD-P-Geom)

Car

0 6 12 18

22.90 6.34 1.80 0.68

13.12 3.70 1.24 0.59

11.34 3.76 1.57 0.74

10.20 3.15 1.12 0.49

10.24 3.11 1.11 0.49

Speech

0 6 12 18

23.58 6.77 1.88 0.68

13.87 3.93 1.23 0.55

12.30 3.76 1.41 0.84

11.36 3.76 1.29 0.59

11.17 3.72 1.28 0.59

The estimation of log P ðO=ki Þ and log P ðO=kg Þ was made with the same speaker-dependent temporal parameters (computed with speaker-dependent HMMs) according to option (c) in Section 5.1. SS indicates that spectral subtraction was applied.

et al., 2001). Moreover, there is not a significant difference between the error rate given by VitWPD-P-Gamma/Vit-WPD-P-Geom and the ordinary Viterbi algorithm (Vit) showing that the temporal constraints according to the penalization procedure (6) and (7) do not introduce any distortion with clean speech signals. As can be seen in Table 4, experiments with noisy signals confirmed the results presented in Table 3: Vit-WPD-Gamma was able to improve the results only at SNR equal to 0 dB and introduced some distortion at SNR equal to 18, 12 and 6 dB; the penalization procedure with the gamma function (Vit-WPD-P-Gamma) and the geometric distribution (Vit-WPD-P-Geom) did not introduce any significant error at higher SNR and showed reductions of 20% and 10% in the error rate at SNR equal to 0 and 6 dB, respectively; finally, no significant differences were observed in the error rates provided by Vit-WPD-P-Gamma and VitWPD-P-Geom. 5.4. Stability of the parameters tol max and tol min Figs. 3 and 4 present results related to the curves EERSS versus tol max and tol min, respectively. The configuration Vit-WPD-P-Geom, which corresponds to the transition probability as defined in (6) and (7) with the geometric distribution, was employed in experiments with speech signal

corrupted by car noise at SNR ¼ 6 dB. In Fig. 3 tol min was made equal to 0.8, and in Fig. 4 tol max was made equal to 1.5. As can be seen in Figs. 3 and 4, there are reasonable wide ranges of sub-optimal values defined by the intervals 1:5 6 tol max 6 2:5 and 0:5 6 tol min 6 1:0. 5.5. Experiments with spectral tilt and additive noise The clean speech signals were processed with a 1.4 dB/oct high-pass filter before adding the additive noise (car) in order to model the average spectral tilt that takes place when the Lombard effect is observed (Van Summers, 1988; Chen, 1988). According to Table 5 temporal restrictions according to (6) and (7) with geometric distribution, Vit-WPD-P-Geom, led to reductions of 26% and 11% in EERSS at SNR ¼ 6 and 0 dB. In fact, the spectral distortion in Lombard speech, which is also characterized by modifications in the formant frequencies (Van Summers, 1988; Junqua, 1993), is very difficult to model because it depends on the speaker, phoneme and noise. Nevertheless, the improvements shown in Table 5 suggest that temporal restrictions can lead to reductions in the error rate without any knowledge about the corrupting environments and about the effects that this environment may cause on speech production. It is interesting to notice that the reductions in the error rate presented in Table 5 are slightly higher

86

N.B. Yoma, T.F. Pegoraro / Speech Communication 38 (2002) 77–88

Fig. 3. Equal error rate [speaker-specific (EERSS )] according to tol max. The parameter tol min was made equal to 0.8. The configuration Vit-WPD-P-Geom, which corresponds to the transition probability as defined in (6) and (7) with the geometric distribution, was employed in experiments with speech signal corrupted by car noise at SNR ¼ 6 dB. The estimation of log P ðO=ki Þ and log P ðO=kg Þ was made with the same speaker-dependent temporal parameters (computed with speaker-dependent HMMs) according to option (c) in Section 5.1. Spectral subtraction was applied.

Fig. 4. Equal error rate [speaker-specific (EERSS )] according to tol min. The parameter tol max was made equal to 1.5. The configuration Vit-WPD-P-Geom, which corresponds to the transition probability as defined in (6) and (7) with the geometric distribution, was employed in experiments with speech signal corrupted by car noise at SNR ¼ 6 dB. The estimation of log P ðO=ki Þ and log P ðO=kg Þ was made with the same speaker-dependent temporal parameters (computed with speaker-dependent HMMs) according to option (c) in Section 5.1. Spectral subtraction was applied.

N.B. Yoma, T.F. Pegoraro / Speech Communication 38 (2002) 77–88 Table 5 Equal error rate [speaker-specific (EERSS )] with signals processed with a 1.4 dB/oct high-pass filter before adding the additive noise (car) SNR EERSS (%) 0 6 12 18

Baseline system

SS

SS (Vit-WPD-P-Geom)

31.12 9.65 2.67 0.89

20.66 5.11 1.60 0.67

15.40 4.53 1.60 0.70

The estimation of log P ðO=ki Þ and log P ðO=kg Þ was made with the same speaker-dependent temporal parameters (computed with speaker-dependent HMMs) according to option (c) in Section 5.1. SS indicates that spectral subtraction was applied.

than the ones in Table 4, which in turn indicates that the higher spectral mismatch, the higher is the improvement due to state duration modeling in the Viterbi algorithm. 6. Conclusions The results presented in this paper suggest that in a text-dependent speaker-verification task temporal constraints based on a penalization method here proposed can lead to significant reductions in the error rate with signals corrupted by noise (SNR equal to 0 and 6 dB). Moreover, the accurate statistical modeling of state duration with gamma probability distribution does not seem to be very relevant if max and min state duration restrictions are imposed, which in turn is consistent with (Yoma et al., 2001), where state duration restrictions were tested in speech recognition systems. It is also shown that state duration constraints can easily be applied with the likelihood normalization metrics using speaker-dependent temporal parameters. In contrast, as observed in (Forsyth, 1995), temporal restrictions do not seem to give any improvement in a speaker verification task with clean speech or high SNR, and the introduction of no distortion with clean signals (when compared with the ordinary Viterbi algorithm) could be one of the criteria to tune the methods here covered. This is consistent with (Ljolje and Levinson, 1991; Ljolje, 1994) in the sense that duration model effects could be much less signifi-

87

cant when the acoustic model is trained using data that matches the test data, which in turn is the case of text-dependent speaker verification. As far as WPD state duration modeling is concerned, no significant improvement was observed when compared with the WPI approach. This must result from the low coarticulation effect between contiguous words in the task here addressed. Nevertheless, the WPD approach should still be relevant in those applications employing digit strings and other long utterances without silence intervals so that the coarticulation is higher. The fact that the truncated gamma and geometric probability distributions gave the same results suggest that the approach here proposed could also be employed in Lombard speech by means of using the max parameter 20% or 30% higher than the one estimated with neutral voice. Moreover, the results here presented suggest that if the spectral mismatch between training and testing speech increases, which may be caused by Lombard effect, the improvement due to temporal restrictions also increases. Nevertheless, the effects of noise on speech production is not the main topic of this paper, and a text-dependent speaker verification database with speech under simulated and/or actual stress is not currently available and the elaboration of such corpus is out of the scope of the research here reported. It is interesting to highlight that state duration modeling does not need any information about the testing environment, and hence it is an interesting technique from the practical application point of view. It is also worth mentioning that the Viterbi algorithm with the temporal restrictions may not give a globally optimal alignment path and further research is currently in progress to overcome this limitation. Finally, a more complete study of the applicability of temporal restrictions in Lombard speech is also proposed as a future work.

Acknowledgements The authors wish to thank Dr. Fergus McInnes, from the University of Edinburgh, UK, for having proofread this manuscript, and Prof. John Hansen, from the University of Colorado, USA, for

88

N.B. Yoma, T.F. Pegoraro / Speech Communication 38 (2002) 77–88

the discussions on Lombard effect. Finally, the research described in this paper was sponsored by Conicyt/Fondecyt-Chile and T.F. Pegoraro was supported by a scholarship from CNPq-Brazil.

References Bou-Ghazale, S.E., Hansen, J., 1998. HMM-based stressed speech modeling with application to improved synthesis and recognition of isolated speech under stress. IEEE Trans. Speech Audio Process. 6 (3), 201–216. Burshtein, D., 1996. Robust parametric modeling of durations in hidden Markov models. IEEE Trans. Speech Audio Process. 4 (3). Carey, M., Parris, E., 1992. Speaker verification using connected words. Proc. Inst. Acoust. 14 (6), 95–100. Chen, Y., 1988. Cepstral domain talker stress compensation for robust speech recognition. IEEE Trans. ASSP 36 (4). Forsyth, M.E., 1995. Semi-continuous hidden Markov models for automatic speaker verification. Ph.D. Thesis, University of Edinburgh, UK. Furui, S., 1997. Recent advances in speaker recognition. Pattern Recognition Lett. 18, 859–872. Ghitza, O., 1987. Robustness against noise: the role of timingsynchrony measurement. In: Proc. ICASSP’87, pp. 2372– 2375.

Hansen, J., 1996. Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition. Speech Communication 20, 151–173. Junqua, J.C., 1993. The Lombard reflex and its role on human listeners and automatic speech recognizers. J. Acoust. Soc. Am. 93 (1), 510–524. Junqua, J.C., 1999. The Lombard effect: A reflex to better communicate with others in noise. In: Proc. ICSLP99. Ljolje, A., 1994. High accuracy phone recognition using context clustering and quasi-triphonic models. Comput. Speech Language 8, 129–151. Ljolje, A., Levinson, S.E., 1991. Development of an acoustic– phonetic hidden Markov model for continuous speech recognition. IEEE Trans. ASSP 39 (1), 29–39. Mokbel, C., Chollet, G., 1995. Automatic word recognition in cars. IEEE Trans. Speech Audio Process. 3 (5), 346–356. Van Summers, W. et al., 1988. Effects of noise on speech production: Acoustic and perceptual analises. J. Acoust. Soc. Am. 84 (3). Varga, A. et al., 1992. The Noisex-92 study on the effect of additive noise in automatic speech recognition. Technical Report, DRA, UK. Vaseghi, S.V., Milner, B.P., 1997. Noise compensation methods for hidden Markov model speech recognition in adverse environments. IEEE Trans. Speech Audio Process. 5 (1), 11–21. Yoma, N.B. et al., 2001. On including temporal constraints in Viterbi alignment for speech recognition in noise. IEEE Trans. Speech Audio Process. 9 (3), 179–182.