Digital Signal Processing 10, 42–54 (2000) doi:10.1006/dspr.1999.0360, available online at http://www.idealibrary.com on
Score Normalization for Text-Independent Speaker Verification Systems Roland Auckenthaler∗,† , Michael Carey∗ , and Harvey Lloyd-Thomas∗ ∗ Ensigma
Ltd., Turing House, Station Road, Chepstow, NP16 5PB, United Kingdom; and † Department of Electrical & Electronic Engineering, University of Wales, Swansea, SA2 8PP, United Kingdom E-mail:
[email protected],
[email protected],
[email protected] Auckenthaler, Roland, Carey, Michael, and Lloyd-Thomas, Harvey, Score Normalization for Text-Independent Speaker Verification Systems, Digital Signal Processing 10 (2000), 42–54. This paper discusses several aspects of score normalization for textindependent speaker verification. The theory of score normalization is explained using Bayes’ theorem and detection error trade-off plots. Based on the theory, the world, cohort, and zero normalization techniques are explained. A novel normalization technique, test normalization, is introduced. Experiments showed significant improvements for this new technique compared to the standard techniques. Finally, there is a discussion of the use of additional knowledge to further improve the normalization methods. Here, the test normalization method is extended to use knowledge of the handset type. 2000 Academic Press Key Words: speaker verification; score normalization.
1. INTRODUCTION The important problems in speaker verification are those of finding the right feature set and finding an optimal strategy for classification. Over recent years, features based on cepstral coefficients and statistical approaches such as Gaussian mixture models (GMMs) [1] have come to dominate the field of speaker verification. An important issue in the statistical approaches to speaker verification is that of score normalization, which covers aspects such as the scaling of likelihood scores [2, 3], and handset normalization [4, 5]. The scaling of the likelihood score distributions of different speakers is used to find a global speaker independent threshold for the decision making process. Handset or channel normalization is used to reduce environmental effects on the verification decision.
1051-2004/00 $35.00 Copyright 2000 by Academic Press All rights of reproduction in any form reserved.
Auckenthaler, Carey, and Lloyd-Thomas: Score Normalization
43
Normalization techniques also depend on the approach used for verification. As noted, the most common statistical approach for text-independent speaker verification is based on GMMs [6]. In previous NIST evaluations and [7], it has been shown that global GMMs outperform phoneme-based hidden Markov models (HMMs) in text-independent tasks, due to sharing of information between mixture components during training. In this paper we will first discuss the theory of normalization. Then we describe the common normalization techniques based on these theories and present some experimental results. A new strategy called test normalization (T-norm) is also presented. We will also discuss the use of additional knowledge to improve the performance of speaker verification.
2. THEORY OF SCORE NORMALIZATION For speaker verification we need to evaluate the probability of a hypothesized speaker model m for a given test utterance O, P (m|O). However, the output of a GMM system gives P (O|m). By making the assumption that all speaker models are equally likely and by approximating the probability of the utterance using a speaker independent world model mW , Bayes’ theorem [8] gives the following when transformed into the log domain log P (m|O) = log P (O|m) − log P (O|mW ) .
(1)
This equation gives a relative log-likelihood score between a speaker and a world model for the observation O. The effect of Eq. (1) on the speaker verification task is that quality mismatches which occur between the test observation O and the speaker model m will have a corresponding effect on the world model mW . Therefore, effects which lead to a bias in P (O|m) are eliminated in P (m|O) due to the relative log-likelihood scoring. In a GMM system, a speaker’s voice is modeled by a large number of Gaussian probability densities. The speaker model is often created by a maximum likelihood adaptation from a world model [8], the Gaussian density parameters being linearly adapted from the world model toward the speaker’s voice characteristics. The speaker parameters are used to calculate a similarity score S between the observation sequence O = {E o1 , oE2 , . . . , oEN } and the speaker model m. This is performed by S(E oi , m) = max q j
wj,m (2π)D |6j,m
1 −1 T oi − µ exp − (E E j,m ) · 6j,m · (E oi − µ E j,m ) , (2) 2 |
E j,m is the mean vector, and 6j,m is the covariance where wj,m is the weight, µ matrix of the most likely Gaussian density (index j ) in the speaker model m for a given observation vector with the dimension D. The covariance matrix 6j,m is assumed to be diagonal. In general, more than one nearest neighbor is used for the likelihood calculation. However, considering only the nearest neighbor
Digital Signal Processing Vol. 10, Nos. 1–3, January/April/July 2000
44
simplifies the explanation of the effects of normalization. In real applications, an accumulation of a number of best mixture densities is used [9]. The probability of the observation O is related to the vector likelihood scores by P (O|m) =
N Y
!1/N S(E oi , m) · d oE
,
(3)
i=1
where the exponent 1/N is a normalization by the length of the utterance, which is equal to the number of test vectors N . The differential interval for the probability is given by d oE. This leads to a final equation for the verification task E j,m , 6j,m , and world for an observation O, speaker model parameters wj,m , µ E j,mw , 6j,mw with the best mixture index j for each model parameters wj,mw , µ observation vector oEi : s ! N 1 X |6j,mw | wi,m log log P (m|O) = N |6j,m | wi,mw i=1
N 1 1 X −1 T oi − µ E j,m ) · 6j,m · (E oi − µ E j,m ) − (E + N 2 i=1
N 1 1 X −1 T oi − µ E j,mw ) · 6j,mw · (E oi − µ E j,mw ) . − (E − N 2
(4)
i=1
By applying Bayes’ theorem and the transformation to the log domain, the factor d oE is eliminated in Eq. (4). This equation shows that the log-likelihood score depends upon the similarity between the observation and the speaker model parameters and the similarity between the observation and the world model parameters. The first term in Eq. (4) is a constant based on the weights and variance parameters of the models. This model-dependent constant is included in the final score. From the NIST Speaker Recognition Evaluations [10] it is known that the best verification results are obtained when the speaker’s weights and variances are identical to those of the world model. The verification performance due to adaptation of mixture weights is better than that due to adaptation of the variances, but neither improves upon adapting the means only. The variance parameters are not well trained with the limited amount of training data for a speaker, while some mixture components may be underrepresented in the weights adaptation and the influence of additional vectors becomes very large. Another explanation is that the first term in Eq. (4) disappears when the weights and variances of the speaker are identical to those of the world model. Otherwise, the term can lead to a bias in the scoring depending on the number of vectors seen for each mixture component.
2.1. Distribution Scaling The input observation for the verification process can be uttered by either a target speaker or an impostor. A threshold has to be found to distinguish between observations from targets and impostors. This threshold depends on
Auckenthaler, Carey, and Lloyd-Thomas: Score Normalization
45
FIG. 1. Theoretical DET plots for variations of the target distribution.
the bimodal distribution of the log-likelihood output score log(P (m|O)), where the bimodal distribution parameters depend on the target speaker model as well as on the observation itself. A distribution scaling is therefore needed to find a single scale for all target speakers. This then allows the use of a single threshold across all hypothesized speakers. In the NIST evaluations [11] a threshold based on a detection cost function (DCF) is used to assess the performance of the systems at a specific operating point. In more general assessments, a detection error trade-off (DET) [12] plot is used for evaluation. The axes of this plot are proportional to the variances of the distributions which lead to a straight line if the scores of both classes of uttered observations, from impostors or targets, are Gaussian distributed. Figure 1 shows the relationship between the two distributions and the corresponding DET plot. The impostor distribution is assumed to have zero mean and unit variance. In Fig. 1a, three target score distributions are shown
Digital Signal Processing Vol. 10, Nos. 1–3, January/April/July 2000
46
with differing means. The separation of targets and impostors improves with an increasing mean of the target distribution. The DET plots for these examples show increasing performance for higher target means. Moreover, the plots for the differing means are parallel. The performance characteristic changes when the mean of the target score distribution stays constant and the variance changes (Fig. 1b). The DET plots are seen to rotate about a fixed point. A flattening of the performance occurs with larger variances. The anchor point of rotation is given by the distance between target and impostor distributions. Similar results can be obtained by varying the impostor distribution parameters, which will show the same behavior for shifting the mean and a rotation based on an anchor point on the vertical axis. The DET plots in Fig. 1 show that the two types of verification error, false alarms and missed recognitions (x and y), are related to each other. The relationship is given by σT y + µT = −σI x + µI
(5)
with 1 PMiss = √ 2π
Z
y −∞
e−t
2 /2
dt
and
1 PFa = √ 2π
Z
x −∞
e−t
2 /2
dt,
(6)
where µI , σI are the parameters of the impostor distribution and µT , σT are the parameters for the target distribution, both distributions being assumed Gaussian. The relation between x and y contains the four distribution parameters. Since these parameters can be different for each target speaker, this makes it almost impossible to find an optimal global threshold for the decision making process. A distribution scaling should therefore try to normalize (µ, σ ) pairs to a single distribution, the most common case being zero mean and unit variance. There are two different approaches to this problem. The first tries to unify the impostor distributions, the second, the target distributions. Therefore we distinguish between impostor-centric and target-centric normalization methods. An accurate estimation of the distribution parameters implies the need for a large amount of data. All the currently available databases [13] only contain enough data to allow the estimation of impostor distributions. No databases are available to develop speaker-centric normalization techniques. Therefore we will limit further discussion to impostor-centric techniques. The distribution scaling also suggests an easy way of setting a global threshold with a standardized scale for one distribution. In the case of an impostor-centric distribution, points on the horizontal axis of the DET plot are related to a zero mean unit variance distribution. This allows us to set the threshold to an a priori known probability for false alarms. A threshold level of zero is equal to a false alarm probability of 50%.
Auckenthaler, Carey, and Lloyd-Thomas: Score Normalization
47
2.2. Impostor-centric Scaling An impostor-centric distribution scaling simplifies the relation between x and y points in Eq. (5) to µT 1 x+ (7) y =− σT σT which reveals two special values which are responsible for the performance of the system. The first term in the equation leads to the tilt of the DET plot and the second term results in a parallel shift of the plot. A special point of interest is the equal error rate (EER); here, the probability of a false alarm is equal to the probability of a missed target speaker. This can be expressed by x =y =−
µT 1 + σT
1 with PEER = √ 2π
Z
x −∞
e−t
2 /2
dt.
(8)
3. NORMALIZATION TECHNIQUES Different normalization techniques can be assigned to different categories. First, techniques can be distinguished by purpose, either score normalization (by applying Bayes’ theorem) or performing a distribution scaling. A second characterization is made for distribution scaling, where the technique can be either speaker-centric or impostor-centric. It is not always clear to which category a particular technique may belong. Score normalization procedures include two main approaches. The first approach is based on the use of a world model [14], which is derived from Bayes’ theorem. A second approach is cohort normalization [15], which uses a set of cohort speakers who are close to the target speaker. The cohort can be seen as a replacement for the world model by calculating a probability of the cohort under the conditions of the observation. The selection of the cohort can be done during training when the speaker model is compared to cohort models using a similarity measure [16]. Another approach finds a cohort set of speakers during testing. This is referred to as an unconstrained cohort [17]. However, the cohort approach can also be seen as a distribution scaling due to the fact that the cohort is different for each target speaker model. If the cohort is chosen during test time, the selected speakers may also be different for each test utterance. This leads to a mean estimate for the distribution scaling problem. The variance factor of the distribution scaling is ignored. Another aspect of cohort normalization is the size of the cohort. If a large set of speakers is chosen for a cohort, it behaves as an impostorcentric normalization. With a reduction in the number of speakers, cohort normalization behaves in a more speaker-centric fashion. In [17, 18] it is shown that a cohort size smaller than five performs best for an unconstrained cohort for text-dependent verification.
48
Digital Signal Processing Vol. 10, Nos. 1–3, January/April/July 2000
A different approach presented in [19] creates a virtual model based on a set of cohort speakers. Components of each individual cohort speaker model are used to create an artificial speaker model which is close to the target speaker.
3.1. Zero Normalization A normalization technique which uses a mean and variance estimation for distribution scaling is zero normalization (Z-norm) [9]. The advantage of Z-norm is that the estimation of the normalization parameters can be performed off-line during training. A speaker model is tested against example impostor utterances and the log-likelihood scores are used to estimate a speaker specific mean and variance for the impostor distribution. The normalization has the form log(P (m|O)) − µI , (9) S= σI where µI and σI are the estimated impostor parameters for speaker model m and S is the distribution normalized score.
3.2. Test Normalization A new normalization method which is also based on a mean and variance estimation for distribution scaling is test normalization (T-norm). During testing, a set of example impostor models is used to calculate impostor loglikelihood scores for a test utterance, similar to a cohort approach. However, unlike the cohort approach, a mean and variance parameter are estimated from these scores. These parameters are then used to perform the distribution normalization given by Eq. (9). The advantage of T-norm over a cohort normalization is the use of the variance parameter which approximates the distribution of the cohort population more accurately. The estimation of these distribution parameters is carried out on the same utterance as the target speaker test. Therefore, an acoustic mismatch between the test utterance and normalization utterances, possible in Z-norm, is avoided.
4. PERFORMANCE EVALUATION OF NORMALIZATION METHODS In this section we describe the results of experiments using different normalization methods against a baseline using a standard world model normalization. In addition to the world model normalization, an unconstrained cohort, Z-norm, and T-norm are applied to perform distribution scaling.
4.1. Experimental Configuration Tests were performed on the NIST 1998 Speaker Recognition Evaluation database [20]. Speaker models were trained using two sessions from the same handset, each of about 1-min duration. Testing was performed with 2500 30-s segments for each gender. Only test segments from the same phone number were taken from the evaluation (same number condition).
Auckenthaler, Carey, and Lloyd-Thomas: Score Normalization
49
A GMM [6] -based verification system was used, with features extracted using a 19th order mel filterbank with a frame rate of 10 ms. The filterbank outputs were decorrelated using a discrete cosine transform giving 12 cepstral features per frame. First and second order derivatives were then calculated over a span of five and seven frames, respectively. The cepstral mean was subtracted from the static features on a per segment basis. Frame energy was appended to the feature vector as well as its first and second order derivatives. Separate world models were built for each gender each of 1024 mixture densities. Speaker models were adapted from the world models using linear adaptation of the mean parameters only (weight and variance parameters were copied directly from the world model).
4.2. Results Figure 2 shows a comparison of the different normalization methods described. The T-norm and unconstrained cohort are based on a set of 150 impostor speakers taken from the NIST 1997 evaluation training set. For the unconstrained cohort the best two speaker scores were used for normalization. The impostor data for Z-norm were taken from 150 test segments from the NIST 1997 test set, with the number of segments being equally distributed between carbon and electret handset types. The curves show no significant improvement for Z-norm compared to the world model normalization baseline, due to the mean-only adaptation. If a complete adaptation of means, variances, and weights is applied [6], Z-norm should improve the performance compared to the baseline. The improved
FIG. 2. Comparison of verification performance for different normalization methods.
50
Digital Signal Processing Vol. 10, Nos. 1–3, January/April/July 2000
FIG. 3. Comparison of verification performance for T-norm with different cohort sizes.
performance of Z-norm when a complete adaptation is applied can be explained with reference to the first term in Eq. (4). The estimation of Z-norm parameters includes the first term which is based on the mismatch of the weight and variance parameters. It is also shown that the basic performance around the EER point is the same for world model normalization, T-norm, and the unconstrained cohort method. However, the cohort approaches seem to rotate the DET plot in favor of a lower miss probability at low false alarm rates. The resulting plot also seems to be straightened, closer to a Gaussian distribution. The unconstrained cohort performs more poorly than T-norm for low miss rates and high false alarms. This is because T-norm is also modeling a variance against unconstrained cohort where only the mean is used. A problem with T-norm is the availability of large numbers of cohort speakers to estimate this variance parameter for distribution scaling. Figure 3 shows a comparison of T-norm with different cohort sizes. With an increase in size, performance improves, with a huge gain being seen for an increase from 10 to 20 speakers. However, a cohort size above 50 speakers leads to no significant improvement in performance.
5. ADDITIONAL KNOWLEDGE SOURCES Additional knowledge about the speaker or channel can improve verification performance. It is known that a gender-specific world model improves performance when it is used in the adaptation during training and in testing.
Auckenthaler, Carey, and Lloyd-Thomas: Score Normalization
FIG. 4.
51
Scatter plot of handset-probability vs verification log-likelihood score for T-norm.
In addition to the knowledge of the gender, NIST also provides knowledge of the handset type for evaluations. This is important if a handset mismatch occurs between training and testing, where the handset type can be either carbon or electret. Note that handsets of the same type are closer to each other than handsets of different types. The information about the handset type can therefore be used for normalization purposes. Distribution parameters are estimated for carbon and electret handset types separately. This leads to a closer approximation of the impostor distribution. The scatter plot in Fig. 4 shows four groups of test scores. In the scatter plot for each test utterance a single point is plotted for the actual target model, while the average of all hypothesized impostors for that utterance is plotted. The upper two groups belong to electret handsets. This suggests a larger value for the electret probability. The lower two groups represent tests performed on carbon handsets. The two groups on the right side are based on target speaker scores. The impostor scores are represented on the left. If we look at the T-norm log-likelihood scores for the different handset types (Fig. 4), a bias (shift to the right) is seen in the scoring of carbon handset utterances. This is due to a mixture of both handset types in the cohort set and an emphasis on the number of electret handset-type models in the evaluation. Therefore the normalization is not optimal. A handset-dependent normalization can be applied if the cohort set is divided into two handsetdependent groups. In testing, the appropriate cohort set is used to normalize for a claimed target speaker. Figure 5 shows a scatter plot when handset-dependent
52
Digital Signal Processing Vol. 10, Nos. 1–3, January/April/July 2000
FIG. 5. Scatter plot of handset-probability vs verification log-likelihood score for handsetdependent T-norm.
T-norm (HT-norm) was applied. The bias for the carbon handset test utterances disappears. The NIST 1998 Speaker Recognition Evaluation only included a small number of carbon test segments compared to the number of electret segments. Therefore the effect of handset normalization shows no significant shift in performance. Figure 6 shows a comparison of verification performance between T-norm and HT-norm for same number tests and different number tests. In the case of same number tests, no improvement is seen. The performance decreases with HT-norm for small miss probabilities and large false alarm rates. While an improvement can be seen for the different number tests. This is mainly due to tests which contain a mismatch between training and testing handset types. No improvement was revealed when the handset type was the same but the number was different.
6. COMMENTS AND CONCLUSION This paper has presented an overview of different normalization techniques. The methods were categorized based on their theoretical background and on the scaling of the distribution. The distribution scaling can be either speaker-centric or impostor-centric, depending on the available data. Due to the lack of suitable databases, only impostor-centric normalization methods were discussed. Impostor-centric normalization techniques include either a global
Auckenthaler, Carey, and Lloyd-Thomas: Score Normalization
53
FIG. 6. Verification performance for T-norm and handset-dependent T-norm.
description (world model) or individual models (cohort) for the estimation of impostor distribution parameters. A novel method, T-norm, was introduced. Based on a cohort set, mean and variance parameters are estimated at test time, for normalization of a claimed speaker score. A comparison showed a significant improvement over the baseline of a world model normalization. At a miss probability of 10%, T-norm reduced the false alarm rate to 0.3% compared to 0.8% for the baseline. Additional knowledge if provided is useful for normalization. An example was given for handset type normalization, where a handset-dependent normalization improved verification performance when a mismatch of the handset type occurred between testing and training. Finally, a drawback of T-norm is the dependence on the language of speakers. For a world model normalization, the world model may be trained using material from several languages to yield good performance. In T-norm, the claimed target speaker and the cohort set must speak the same language to obtain good performance. If the language is also given as additional knowledge, due perhaps to the use of a language recognition system [21], the cohort set could be chosen according to the recognized language.
REFERENCES 1. Reynolds, D., Quatieri, T., and Dunn, R., Speaker verification using adapted Gaussian mixture models, Digital Signal Process. 10 (2000), 19–41. 2. Doddington, G., Speaker recognition evaluation methodology—an overview and perspective. In Proc. RLA2C 1998, Avignon, 1998, pp. 60–66.
54
Digital Signal Processing Vol. 10, Nos. 1–3, January/April/July 2000
3. Furui, S., An Overview of Speaker Recognition Technology. Kluwer Academic, Dordrecht/ Norwell, 1996. 4. Reynolds, D., The effects of handset variability on speaker recognition performance: experiments on the switchboard corpus. In Proc. ICASSP 1996, Atlanta, 1996, Vol. 1, pp. 113–116. 5. Heck, L., and Weintraub, M., Handset dependent background models for robust textindependent speaker recognition. In Proc. ICASSP 1997, Munich, 1997, Vol. 2, pp. 1071–1074. 6. Reynolds, D., Robust text-independent speaker identification using Gaussian mixture speaker models, Speech Commun. 17 (1995), 91–108. 7. Auckenthaler, R., Parris, E., and Carey, M., Improving a GMM speaker verification system by phonetic weighting. In Proc. ICASSP 1999, Phoenix, 1999, Vol. 1, pp. 313–316. 8. Carey, M., Parris, E., and Bennett, S., Speaker verification. In Proc. Institute of Acoustics (Speech & Hearing), Windermere, UK, 1996, pp. 99–106. 9. Reynolds, D., Comparison of background normalization methods for text-independent speaker verification. In Proc. Eurospeech 1997, Rhodes, 1997, pp. 963–966. 10. NIST, Speaker Recognition Workshop, June 1999. 11. Przybocki, M., and Martin, A., NIST Speaker Recognition Evaluation—1997. In Proc. RLA2C 1998, Avignon, 1998, pp. 120–123. 12. Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M., The DET curve in assessment of detection task performance. In Proc. Eurospeech 1997, Rhodes, 1997, pp. 1895– 1898. 13. Campbell, J., and Reynolds, D., Corpora for the evaluation of speaker recognition systems. In Proc. ICASSP 1999, Phoenix, 1999. 14. Carey, M., Parris, E., and Bridle, J., A speaker verification system using alpha-nets. In Proc. ICASSP 1991, Toronto, 1991, pp. 397–400. 15. Rosenberg, A., Delong, J., Lee, C., Juang, B., and Soong, F., The use of cohort normalized scores for speaker recognition. In Proc. ICSLP 1992, Banff, 1992, pp. 599–602. 16. Fukunaga, K., Introduction to Statistical Pattern Recognition, Academic Press, New York, 1972. 17. Ariyaeeinia, A., and Sivakumaran, P., Analysis and comparison of score normalization methods for text-dependent speaker verification. In Proc. Eurospeech 1997, Rhodes, 1997, pp. 1379–1382. 18. Auckenthaler, R., and Mason, J., Score normalization in a multi-band speaker verification system. In Proc. RLA2C 1998, Avignon, 1998, pp. 102–105. 19. Isobe, T., and Takahashi, J., A new cohort normalization using local acoustic information for speaker verification. In Proc. ICASSP 1999, Phoenix, 1999. 20. Martin, A., and Przybocki, M., The NIST 1999 Speaker Recognition Evaluation—An overview, Digital Signal Process. 10 (2000), 1–18. 21. Lloyd-Thomas, H., Parris, E., and Wright, J., Recurrent substrings and data fusion for language recognition. In Proc. ICSLP 1998, Sydney, 1998, pp. 169–172.
ROLAND AUCKENTHALER received his degree in telematics from the University of Graz, Austria. He is currently working on a collaborate project between the University of Wales, Swansea, and Ensigma, developing techniques for speaker validation. MICHAEL CAREY (FIEE, SMIEEE) received his B.Sc. (Eng.) and Ph.D. degrees in electrical engineering from Imperial College London in 1971 and 1978. He worked on digital filters and orthogonal multiplexing techniques at British Telecom Research Laboratories. While lecturer in electronics at Keele University, England, from 1976–1981 he researched the applications of signal processing in audio and communications. As Head of Research at Mitel Telecom he worked on speech technology and spread-spectrum techniques. He is now Managing Director of Ensigma, a company specializing in speech and signal processing which he founded in 1986. There he has worked on speech and speaker recognition and language and topic spotting. He has published over fifty papers and holds five patents. HARVEY LLOYD-THOMAS (AMIEE) received his B.Eng. (Hons.) degree in computer systems engineering from the University of Bristol, England, in 1991. He then went on to study for his Ph.D., also at Bristol. In 1995 he completed his thesis on an integrated language model for automatic speech recognition and joined Ensigma, where he works on speech technology. He is co-author of a number of papers on language modelling and speaker and language recognition.