ARTICLE IN PRESS
Engineering Applications of Artificial Intelligence 18 (2005) 13–19 www.elsevier.com/locate/engappai
A genetic classification method for speaker recognition Q.Y. Hong, S. Kwong Department of Computer Science, City University of Hong Kong, 83 Tatchee Avenue, Kowloon, Hong Kong Received 24 July 2000; accepted 18 August 2004 Available online 18 October 2004
Abstract Gaussian mixture model (GMM) has been widely used for modeling speakers. In speaker identification, one major problem is how to generate a set of GMMs for identification purposes based upon the training data. Due to the hill-climbing characteristic of the maximum likelihood (ML) method, any arbitrary estimate of the initial model parameters will usually lead to a sub-optimal model in practice. To resolve this problem, this paper proposes a hybrid training method based on genetic algorithm (GA). It utilizes the global searching capability of GA and combines the effectiveness of the ML method. Experimental results based on TI46 and TIMIT showed that this hybrid GA could obtain more optimized GMMs and better results than the simple GA and the traditional ML method. r 2004 Elsevier Ltd. All rights reserved. Keywords: Gaussian mixture model; Speaker identification; Genetic algorithm
1. Introduction Spoken utterances convey both linguistic and speakerspecific information. Different from speech recognition (Rabiner and Juang, 1993), the objective of speaker recognition (Furui, 1996) is to find out the speaker’s identity from the utterances. This technique has many real-world applications, e.g. voice dialect, call center, information retrieve, secure accessing control, etc. Speaker recognition can be text-dependent or textindependent. It is more preferred to develop the textindependent system, since the speaker is not required to utter a specific phrase or sentence. In practice, we usually use the longer utterances for this type of system, since they include more speaker information. The research of speaker recognition has been undertaken for over four decades. There are many techniques proposed to model the speakers, e.g. vector quantization (VQ) (Soong et al., 1985), hidden Markov model (HMM) (Tishby, 1991; Matsui and Furui, 1992) and Corresponding author. Tel.: +852 2788 7704; fax: +852 2788 8610.
E-mail address:
[email protected] (S. Kwong). 0952-1976/$ - see front matter r 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.engappai.2004.08.035
neural networks (NN) (Hattori, 1992; Farrell et al., 1994). HMM has been widely used in the area of speech recognition. As a left-to-right model, HMM has the temporal sequence modeling capability and performs well for text-dependent speaker recognition. But for the text-independent system, the improvement due to the temporal sequence capability can be negligible (Tishby, 1991). NN method produces good performance compared with the VQ method but needs to be trained once more when a new speaker is added to the system (Reynolds and Rose, 1995). In recent years, it is more popular to represent speakers with the Gaussian mixture model (GMM) (Rose and Reynolds, 1990; Reynolds and Rose, 1995; Reynolds, 1995). The linear combination of the Gaussian functions is capable of representing a large class of sample distributions (Reynolds and Rose, 1995). GMM is computationally efficient and easily implemented, especially on a real-time platform (Reynolds et al., 1992). In principle, it is a compromise between performance and complexity. There are two fundamental tasks of speaker recognition: identification and verification (Furui, 1996). While speaker verification is to accept or reject the claimed
ARTICLE IN PRESS Q.Y. Hong, S. Kwong / Engineering Applications of Artificial Intelligence 18 (2005) 13–19
14
speaker based on a threshold, speaker identification is to determine the identity among the registered speakers. This study will focus on the GMM-based speaker identification system. As shown in Fig. 1, there is single GMM for each of the M speakers, respectively. The speaker model that has the highest likelihood score for the speech utterance will be selected as the identification result. In the GMM framework, one of the major problems is how to generate a set of GMM models for identification purposes based upon a set of limited training data such that it can accurately match with the speaker source. The commonly used training method is the maximum likelihood (ML) estimation because of its simplicity and mathematical tractability. Based on the ML criterion, the model parameters are updated repeatedly and the probability of observation sequences will be improved until some limit point. However, due to the hill-climbing characteristic, any arbitrary estimate of the initial model parameters will usually lead to a suboptimal model in practice. Genetic algorithm (GA) (Tang et al., 1996; Man et al., 1997) is a searching process based on the laws of natural selection and genetics. It emulates the individuals in the natural environment, that is, the natural selection mechanism makes the stronger individuals likely winners in the competing environment. Previously, Chau et al. have applied the GA training for HMM-based speech recognition and gave better quality of solutions than the traditional Baum–Welch algorithm (Rabiner and Juang, 1993). Kwong et al. combined the GA with the Baum–Welch algorithm to form a hybrid-GA such that the quality of the results and the runtime behavior of the GA were further improved. In this work, we propose a hybrid GA to the training of the GMM-based speaker identification. It uses the ML re-estimation as the heuristic operator to improve the converging speed of GA. Experimental results showed that the proposed GA for the GMM training could obtain more optimized GMMs than the simple GA and the traditional ML estimation method. This paper is organized as follows: Section 2 briefly describes the GMM. Section 3 introduces the ML parameter estimation. After that, optimizing the parameters of GMM using GA is described in Section 4.
GMM for Speaker 1
Speech Utterance
Speaker ID
GMM for Speaker 2
Feature Extraction
. . .
Section 5 reports and discusses the experimental results of the TI46 and TIMIT databases. Finally, conclusion is given in Section 6.
2. Gaussian mixture model The Gaussian mixture has remarkable capability to model the irregular data. This characteristic makes it very suitable for GMM to have a smooth estimate of speaker’s acoustic variability, e.g. emotion, health, etc. In the GMM-based system, the distribution of feature vectors extracted from a speaker’s speech X ¼ fxt ; 1ptpT g is modeled by a weighted sum of K mixture components, which can be defined as pðxt jlÞ ¼
K X
ck Nðxt ; lk ; Rk Þ;
(1)
k¼1
where l is the brief notation of the GMM parameters: (2) l ¼ ck ; lk ; Rk 1pkpK: The mixture component Nðxt ; lk ; Rk Þ denotes a Gaussian density function with the mean vector lk and covariance matrix Rk Nðxt ; lk ; Rk Þ ¼
1 D=2
jR j1=2 k 1 exp ðxt lk Þ0 R1 x l t k ; k 2
ð2pÞ
ð3Þ
where prime denotes vector transpose and D is the dimension of the vector xt : It is more popular to use diagonal covariance matrices for GMM, since linear combination of diagonal covariance Gaussians has the same model capability with full matrices (Reynolds and Rose, 1995). Another reason is that the speech utterances are usually parameterized with cepstral features. Cepstral features are more compactable, discriminable and most important as they are nearly decorrelated, which allows diagonal covariance to be used by the GMMs. The selection of mixture number K is dependent on the amount of available training data. It needs to be large enough to adequately model the acoustic variability of the speech utterances. On the other hand, it should also be small enough to allow for a reliable estimate of the model parameters. In any case, the mixture weights must satisfy the following constraint:
Identification
GMM for Speaker M
Fig. 1. Block diagram of speaker identification system.
K X
ck ¼ 1;
ck X0:
(4)
k¼1
Given a sequence of observations X ¼ fxt ; 1ptpT g; the likelihood of observing X by speaker y is denoted as
ARTICLE IN PRESS Q.Y. Hong, S. Kwong / Engineering Applications of Artificial Intelligence 18 (2005) 13–19
the following expression: PðX jly Þ ¼
T Y
pðxt jly Þ:
(5)
t¼1
For speaker identification, we have y^ ¼ arg max PðX jly Þ
(6)
y
with y^ being the identified speaker that attains the highest likelihood score among all the registered speakers.
3. Maximum likelihood parameter estimation Before using GMM for speaker identification, the model parameters must be trained to describe the observation sequence of the speaker accurately. There are many criteria that can be used to estimate the model parameters. The most common one is the ML criterion. This criterion is to use a training sequence of observations X to derive the set of model parameters l; yielding lML ¼ arg max PðX jlÞ;
where PðX jlÞ is the probability of observation sequence. The ML parameters can be obtained through the Expectation Maximization (EM) algorithm (Dempster et al., 1977). It includes a set of re-estimation formulas and guarantees that the re-estimated GMM l¯ will equal to or better than the initial model l: The recursive ML re-estimation formula for multiple observation sequences are shown as follows , Tc Tc C X K X C X X X c c¯ k ¼ gt ðkÞ gct ðkÞ; (8)
l¯ k ¼
Tc C X X
k¼1 c¼1 t¼1
, gct ðkÞxct
Tc C X X
c¼1 t¼1
¯k ¼ R
Tc C X X
gct ðkÞ;
(9)
c¼1 t¼1
, gct ðkÞðxct
lk Þðxct
c¼1 t¼1
lk Þ
0
Tc C X X
gct ðkÞ;
c¼1 t¼1
(10) ¯ ¯ k are the model parameters of l; C is the where c¯ k ; l¯ k ; R number of observation sequences, T c is the length of the cth observation sequence for the speaker model, gct ðkÞ is the probability at time t with the kth mixture component accounting for the cth observation sequence, which is defined as follows gct ðkÞ ¼
and Juang, 1993). This prototype model is further reestimated with the ML formula. Using Eqs. (8)–(10), model parameters are improved by the repetition of the ¯ This re-estimation process will be substitution of l ¼ l: terminated until some convergence criterion based on a threshold is met. This threshold is usually set as the average of the logarithms of the probabilities of the observation sequences. It will be described more in detail later. To avoid confusion, we refer the whole training process through EM algorithm as the ML method and each single re-estimation as ML re-estimation. The ML method is widely used because of its simplicity and mathematical tractability. However, as a hill-climbing algorithm, it will usually lead to a sub-optimal model for any arbitrary chosen initial model l: In the following, we will introduce a GA-based training for GMM. GA provides the global searching capability to the optimization problem. Therefore, the GMM training can escape from the initial guess and find the optimal solution if we applied the GA to the training process.
(7)
l
c¼1 t¼1
15
Nðxct ; lk ; Rk Þ : K P Nðxct ; lk ; Rk Þ
(11)
k¼1
Before the parameter estimation, a prototype model is usually initialized via the K-means algorithm (Rabiner
4. Optimizing the parameters using genetic algorithm In this section, the data structure of the phenotype and the representation of the chromosome in the GA training, population initialization technique, fitness evaluation and genetic operations will be described in detail. Our GA is a hybrid GA, where ML reestimations are used as special heuristic operator to improve its performance. 4.1. Encoding mechanism In Section 2, we discussed the advantages of diagonal covariance compared with the full covariance for GMM. Assuming that this kind of covariance is used, the phenotype of a single speaker model for the GA training can be represented using the structure given in Fig. 2, where K is the number of mixtures, D is the dimension of the observation vector. The GMM parameters are represented with the mixture weight array, mean matrix and covariance matrix, respectively. Although three kinds of parameters in the same mixture are distributed in different places, they are actually correlated and will be used as a unit in the crossover operation. As shown in Fig. 2, the basic data type of the elements of the GMM is real number. So we use real number string instead of bit-string as the representation of the chromosomes in our GA (see Fig. 3) where C k ; mV k;l and Sigmak;l are the weight, lth mean element and lth diagonal covariance element of the kth mixture component, respectively.
ARTICLE IN PRESS Q.Y. Hong, S. Kwong / Engineering Applications of Artificial Intelligence 18 (2005) 13–19
16
struct GMM { float C[K]; /*mixture weight*/ float mV[K][D]; /*mean matrix*/ float Sigma[K][D]; /*diagonal covariance matrix*/ }; Fig. 2. The data structure of the phenotype in the GA training of GMM.
C0
mV1, D
mV0,0 …
…
CK
mV0, D
Sigma 0,0
mV0,0
…
…
mVK , D
Sigma 0, D Sigma K ,0
C1 …
mV1,0
…
Sigma K , D
Fig. 3. The representation of the chromosome in the GA training.
4.2. Population initialization Two strategies can be used to initialize the population. The first one is to generate the chromosomes directly, without any heuristic knowledge. The second is to initialize them based on a prototype model. As in the ML method, this model is initialized via the K-means algorithm. Experimental result showed that the second strategy is a better choice. The initialization procedure can be described as follows: Step 1: Initialize a prototype model l0 based on the training data; Step 2: For n ! 1; . . . ; P; For k ! 1; . . . ; K; ln C½k ¼ l0 C½k n Gð1:0; 0:2Þ; where Gð1:0; 0:2Þ is a Gaussian random generator with mean=1.0 and variance=0.2. For l ¼ 1; . . . ; D; 1. ln mV ½k ¼ l0 mV ½k n G(1.0,0.2); 2. ln Sigma½k ¼ l0 Sigma½k n G(1.0,0.3); End End End where P is the size of population. The variances of the Gaussian random generators are determined experimentally. 4.3. Fitness evaluation In the GA training, the fitness values are the results of the objective function. The objective function is defined as the average of the logarithms of the probabilities of the observation sequences, X 1 ; X 2 ; . . . ; X C generated by the given nth GMM ln ; !, C X lpn ¼ logðPðX i jln ÞÞ C; (12) i¼1
where lpn is the fitness value of the nth chromosome in the population, C is the number of observation sequences in the training data. This objective function only considers the likelihood for a single speaker. It is also used to determine the termination condition in the ML method. The probability PðX i jln Þ is calculated through Eqs. (1), (3) and (5). During the training process, some mixtures of the GMM may be corrupted. In this case, we will discard them and only calculate the average log probability based on the remaining mixtures. 4.4. Genetic operations In the training process, three genetic operators are designed: mixture crossover, mutation and the heuristic operator. Without any heuristic operator, the convergence speed of the GA cycle is slow, e.g. it needed 20,000 generations for the simple GA to get a better solution of HMM than the traditional method (Chau et al., 1997). To overcome this shortcoming, we use the ML reestimation as a heuristic operator. Mixture crossover. In the GMM framework, mixture crossover will be used. It is a derivative of the standard crossover operator. In this operator, two parents will be selected from the population pool based on the Roulette Wheel mechanism (Man et al., 1997; Kwong et al., 2001). Five mixtures are randomly selected from one parent and replaced with the corresponding mixtures in another parent. Experimental results show five is a good choice. The first parent is reserved as the offspring. Mutation. Mutation introduces local variations to the individuals for searching different solution spaces and keeps the diversity of the population. It recovers the lost information in the initialization phase, makes the GA escape from the initial model parameters and helps to find the optimal model parameter set. This operation is conducted on the level of each model parameter. If mutation takes place, the model parameter is multiplied by a Gaussian random number generator with mean = 1.0 and variance = 0.001. ML re-estimation. In each generation, each chromosome in the offspring is re-estimated five times with the ML formula Eqs. (8)–(10). Each chromosome in the population will be further re-estimated eight times every ten GA generations. Since the GA operations may destroy the constrained conditions of the model parameters, thresh-checking should be conducted before the ML re-estimation. And the sum of mixture weight is normalized to 1.0. In the replacement stage, we generate the new population by replacing the worst chromosomes in the current population with the sub-population (pool of offspring). Finally, the control parameters for the GA training are summarized in Table 1.
ARTICLE IN PRESS Q.Y. Hong, S. Kwong / Engineering Applications of Artificial Intelligence 18 (2005) 13–19 Table 1 The control parameters for the GA training
F1
F2
F3
F4
17
F5
F6
F7
F8
-3000 -3200
Setting
The The The The The
30 5 0.08 0.01 30
size of population size of offspring chromosomes mixture crossover rate mutation rate maximum number of generations
-3400
fitness (lp)
Parameters
-3600 -3800 -4000 -4200 -4400
simple-GA
-4600
hybrid-GA
-4800
5. Experimental results Experiments are based on the TI46 (Liberman et al., 1980) and TIMIT (Garofolo et al., 1986) corpus, respectively. Speech utterances from these two databases were parameterized with mel frequency cepstral coefficients (MFCC). We used the HTK toolkit 3.0 (Young et al., 2000) to conduct the feature extraction procedure. Speech signal was pre-emphasized using a coefficient of 0.97. Each frame of speech was windowed with a Hamming window and represented by a 24 dimensional feature vector, which consists of 12 MFCCs and the first differentials. The results of the traditional ML method are used as the baseline for comparison. In the ML training, reestimations will be terminated when the relative increase of the average log probability of the training data between two successive iterations is less than 0.01%. The associated limit for the mixture weight and the element of the diagonal covariance are 1:0=ð30:0 KÞ and 0.0001, respectively, where K refers to the number of mixtures. The same thresh-checking is conducted for the GA training.
5.1. Results of TI46 The TI46 corpus contains 16 speakers: 8 males labeled M1–M8 and 8 females labeled F1–F8. The complete corpus is divided into two directories: TI20 and TI_ALPHA. TI20 contains all utterances of the words ‘ZERO’–‘NINE’ and the words ‘ENTER’–‘YES’. From this sub-corpus, we select the 10-command set {enter, erase, go, help, no, rubout, repeat, stop, start, yes} of the 8 female speakers to conduct the text-dependent speaker identification. There are 100 training utterances used for each speaker. The whole test sections are selected as the test data. Each word has 16 utterances spoken by each speaker. Thus there are 1280 test utterances, in total, to be used for the identification evaluation. To demonstrate the effectiveness of the heuristic operator, we compare the final fitness of the hybrid GA and simple GA without the ML re-estimation operator after 30 generations. The results are given in Fig. 4, from which
Fig. 4. The fitness comparison of simple GA (without the ML reestimation operator) and hybrid GA for 8 female speakers. Each speaker model consists of 32 mixtures.
we can see that the value of lp for each speaker is increased significantly through the heuristic operator. To compare the performance of the ML and GA training, the GMMs are trained with 16, 32 and 64 mixtures, respectively. Table 2 gives the results of fitness and identification accuracy. It is seen that in all cases, the GMMs trained by GA have higher values of lp than the GMMs trained by the ML method. This implies that the GMMs trained by our GA are more optimized than the GMMs trained by the ML method. After the training, we compare the identification accuracy. The results show that the GA has better identification performance. 5.2. Results of TIMIT TIMIT contains a total of 630 speakers from 8 major dialect regions (labeled from DR1 to DR8) of the United States. There are 10 sentences for each speaker, including 2 SA sentences, which are the same across all speakers, 5 SX sentences and 3 SI sentences. SX and SI sentences are different from each other and across speakers. The speech signal is recorded through a highquality microphone in a quiet environment. From the second dialect region (DR2) in the train section of TIMIT, we select 23 female speakers to conduct the text-independent speaker identification. The number of training sentences of each speaker ranges from 4 to 8 and the rest sentences are used as the test data. Each of the GMM consists of 8 mixtures. The results of ML and GA are compared in Fig. 5. In the first two cases when 4 or 5 training sentences are used, the identification accuracy of GA is lower than the ML method. This might be due to the over-training of limited training data. When there are 6, 7 or 8 training sentences, the performance of GA is equal to or better than ML. Thus our GA training method is more preferred for enough training data. Actually overtraining is also a problem of the traditional ML method. It is difficult to make a balance between the
ARTICLE IN PRESS Q.Y. Hong, S. Kwong / Engineering Applications of Artificial Intelligence 18 (2005) 13–19
18
Table 2 Experimental results: lpn and identification accuracy Trained speaker
16 mixtures
F1 F2 F3 F4 F5 F6 F7 F8 Accuracy
32 mixtures
ML
GA
ML
GA
ML
GA
3787.80 3791.18 4113.22 4162.01 4132.95 4515.99 4416.49 3874.24 89.61%
3786.82 3779.34 4104.45 4154.72 4122.38 4512.31 4413.68 3869.81 90.39%
3705.70 3700.24 4036.45 4083.44 4045.24 4434.00 4319.99 3790.61 94.22%
3701.32 3691.50 4028.66 4078.08 4032.45 4423.31 4307.82 3783.49 95.78%
3617.96 3619.87 3944.00 3988.24 3948.83 4351.70 4199.82 3689.15 97.73%
3609.59 3604.43 3935.72 3982.03 3932.45 4336.34 4192.83 3678.62 98.20%
100%
Accuracy
98%
96%
94%
ML GA
92%
4
5
64 mixtures
6
7
8
Training Sentences Fig. 5. Identification results of 23 female speakers.
re-estimation number and the open test performance. Generally, when the amount of the training data is large enough, more re-estimations for the training data lead to higher accuracy of the test data. On the other hand, we only need 20 or even less ML re-estimations when only limited amounts of training data are available. To further compare their performance in the case of enough training data, we conduct the identification experiment on the first two dialect regions DR1 and DR2, respectively. There are 8 training sentences per speaker and the training process is conducted for 8 mixtures and 16 mixtures, respectively. The results are shown in Table 3. For the 76 speakers of DR2, we can see that the identification results of GA are better than the ML method. While for the 38 speakers of DR1, it is found that there is no improvement through the GA training, although the value of lp of the GMM trained by GA is higher than the GMM trained by ML. We test the other dialect regions and some of them have the same results. This further indicates that increasing the value of lp is not always an effective strategy to improve the identification performance. It can be explained by the fact that the ML criterion only considers the likelihood
for a single speaker, that is, each model is estimated separately using its labeled training utterances. When there are confusable models or the training data is limited, usually it can only reach a local optimized classifier. To resolve this problem, one of the solutions is to use the discriminative evaluation criterion. For example, in the MMD framework (He et al., 2000), each model represents the stochastic characteristics of a class of acoustic signals and the difference of those stochastic characteristics can be mapped into the dissimilarities of their models. By maximizing the dissimilarities among models, the performance of classifier can be improved. This discriminative criterion could be used as the evaluation function of the GA training. However, this evaluation function was still not positive proportionally with the identification performance. It had some fluctuations during the training process, although the MMD score could be increased in each generation. The open identification accuracy was usually increased at first but might deteriorate later on. This indicates that the discriminative criterion has also the problem of overtraining. Alternatively, we could get the training results in the first several generations.
6. Conclusion In this study, we have developed a hybrid GA for the GMM-based speaker identification. It makes the training process escape from the initial model parameters and provides the global searching capability to find a more optimized GMM. Using ML re-estimation as the heuristic operator, the runtime behavior of the GA cycle is further improved. Experimental results of TI46 showed that our GA has better identification performance than the traditional ML method. Based on the dialect results of TIMIT, we also discussed the limit of the ML criterion. In the future, we will further study the possibility of applying the discriminative criterion such
ARTICLE IN PRESS Q.Y. Hong, S. Kwong / Engineering Applications of Artificial Intelligence 18 (2005) 13–19
19
Table 3 Dialect identification performance Dialect region
DR1 DR2
Speaker number
38 76
Test sentences
38 2 76 2
as MMD to the GA training to improve the identification performance.
Acknowledgment This work is supported by City University of Hong Kong Strategic Grant Number 7001488. References Chau, C.W., Kwong, S., Diu, C.K., Fahrner, W.R., 1997. Optimisation of HMM by a genetic algorithm. In: Proceedings of the ICASSP, vol. 3, April 1997, pp. 1727–1730. Dempster, A., Laird, N., Rubin, D., 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39, 1–38. Farrell, K.R., Mammone, R.J., Assaleh, K.T., 1994. Speaker recognition using neural networks and conventional classifiers. IEEE Transactions Speech Audio Processing 2, 194–205. Furui, S., 1996. An Overview of Speaker Recognition Technology. Automatic Speech and Speaker Recognition. In: Lee, C., Soong, F., Paliwal, K. (Ed.), Kluwer Academic Press, Dordrecht. Garofolo, J.S., Lamel, L.F., et al., 1986. TIMIT Acoustic-Phonetic Continuous Speech Corpus, http://www.ldc.upenn.edu/. Hattori, H., 1992. Text-independent speaker recognition using neural networks. In: Proceedings of the ICASSP, vol. 2, March 1992, pp. 153–156. He, Q.H., Kwong, S., Man, K.F., Tang, K.S., 2000. Improved maximum model distance for HMM training. Pattern Recognition 33, 1749–1758. Kwong, S., Chau, C.W., Man, K.F., Tang, K.S., 2001. Optimisation of HMM topology and its model parameters by genetic algorithms. Pattern Recognition 34, 509–522.
8 mixtures
16 mixtures
ML (%)
GA (%)
ML (%)
GA (%)
98.68 96.05
98.68 98.68
100 98.68
100 99.34
Liberman, M., Amsler, R., et al., 1980. TI46-Word, http://www.ldc.upenn.edu/. Man, K.F., Tang, K.S., Kwong, S., Halang, W.A., 1997. Genetic Algorithms for Control and Signal Processing. Springer, Berlin. Matsui, T., Furui, S., 1992. Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs. In: Proceedings of the ICASSP, March 1992, pp. II.157–164. Rabiner, L.R., Juang, B.-H., 1993. Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliffs, NJ. Reynolds, D.A., 1995. Speaker identification and verification using Gaussian mixture speaker models. Speech Communication 17, 91–108. Reynolds, D.A., Rose, R.C., 1995. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions Speech Audio Processing 3 (1), 72–83. Reynolds, D.A., Rose, R.C., Smith, M.J.T., 1992. PC-based TMS320C30 implementation of the Gaussian mixture model textindependent speaker recognition system. In: Proceedings of International Conference on Signal Processing Application Technology, November 1992, pp. 967–973. Rose, R.C., Reynolds, D.A., 1990. Text-independent speaker identification using automatic acoustic segmentation. In: Proceedings of the ICASSP, vol. 1, April 1990, pp. 293–296. Soong, F.K., Rosenberg, A.E., Rabiner, L.R., Juang, B.H., 1985. A vector quantization approach to speaker recognition. In: Proceedings of the ICASSP, pp. 387–390. Tang, K.S., Man, K.F., Kwong, S., He, Q., 1996. Genetic algorithm and their applications. IEEE Signal Processing Magazine 13 (6), 22–37. Tishby, N.Z., 1991. On the application of mixture AR hidden Markov models to text independent speaker recognition. IEEE Transactions on Signal Processing 39, 563–570. Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P., 2000. The HTK book (for htk version 3.0), (htk.eng.cam.ac.uk/prot-docs/HTKBook/htkbook.html), July 2000.