A genetic classification method for speaker recognition

ARTICLE IN PRESS Engineering Applications of Artiﬁcial Intelligence 18 (2005) 13–19 www.elsevier.com/locate/engappai A genetic classiﬁcation method ...

Download PDF

224KB Sizes 1 Downloads 95 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

Engineering Applications of Artiﬁcial Intelligence 18 (2005) 13–19 www.elsevier.com/locate/engappai

A genetic classiﬁcation method for speaker recognition Q.Y. Hong, S. Kwong Department of Computer Science, City University of Hong Kong, 83 Tatchee Avenue, Kowloon, Hong Kong Received 24 July 2000; accepted 18 August 2004 Available online 18 October 2004

Abstract Gaussian mixture model (GMM) has been widely used for modeling speakers. In speaker identiﬁcation, one major problem is how to generate a set of GMMs for identiﬁcation purposes based upon the training data. Due to the hill-climbing characteristic of the maximum likelihood (ML) method, any arbitrary estimate of the initial model parameters will usually lead to a sub-optimal model in practice. To resolve this problem, this paper proposes a hybrid training method based on genetic algorithm (GA). It utilizes the global searching capability of GA and combines the effectiveness of the ML method. Experimental results based on TI46 and TIMIT showed that this hybrid GA could obtain more optimized GMMs and better results than the simple GA and the traditional ML method. r 2004 Elsevier Ltd. All rights reserved. Keywords: Gaussian mixture model; Speaker identiﬁcation; Genetic algorithm

1. Introduction Spoken utterances convey both linguistic and speakerspeciﬁc information. Different from speech recognition (Rabiner and Juang, 1993), the objective of speaker recognition (Furui, 1996) is to ﬁnd out the speaker’s identity from the utterances. This technique has many real-world applications, e.g. voice dialect, call center, information retrieve, secure accessing control, etc. Speaker recognition can be text-dependent or textindependent. It is more preferred to develop the textindependent system, since the speaker is not required to utter a speciﬁc phrase or sentence. In practice, we usually use the longer utterances for this type of system, since they include more speaker information. The research of speaker recognition has been undertaken for over four decades. There are many techniques proposed to model the speakers, e.g. vector quantization (VQ) (Soong et al., 1985), hidden Markov model (HMM) (Tishby, 1991; Matsui and Furui, 1992) and Corresponding author. Tel.: +852 2788 7704; fax: +852 2788 8610.

E-mail address: [email protected] (S. Kwong). 0952-1976/$ - see front matter r 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.engappai.2004.08.035

neural networks (NN) (Hattori, 1992; Farrell et al., 1994). HMM has been widely used in the area of speech recognition. As a left-to-right model, HMM has the temporal sequence modeling capability and performs well for text-dependent speaker recognition. But for the text-independent system, the improvement due to the temporal sequence capability can be negligible (Tishby, 1991). NN method produces good performance compared with the VQ method but needs to be trained once more when a new speaker is added to the system (Reynolds and Rose, 1995). In recent years, it is more popular to represent speakers with the Gaussian mixture model (GMM) (Rose and Reynolds, 1990; Reynolds and Rose, 1995; Reynolds, 1995). The linear combination of the Gaussian functions is capable of representing a large class of sample distributions (Reynolds and Rose, 1995). GMM is computationally efﬁcient and easily implemented, especially on a real-time platform (Reynolds et al., 1992). In principle, it is a compromise between performance and complexity. There are two fundamental tasks of speaker recognition: identiﬁcation and veriﬁcation (Furui, 1996). While speaker veriﬁcation is to accept or reject the claimed

ARTICLE IN PRESS Q.Y. Hong, S. Kwong / Engineering Applications of Artificial Intelligence 18 (2005) 13–19

14

speaker based on a threshold, speaker identiﬁcation is to determine the identity among the registered speakers. This study will focus on the GMM-based speaker identiﬁcation system. As shown in Fig. 1, there is single GMM for each of the M speakers, respectively. The speaker model that has the highest likelihood score for the speech utterance will be selected as the identiﬁcation result. In the GMM framework, one of the major problems is how to generate a set of GMM models for identiﬁcation purposes based upon a set of limited training data such that it can accurately match with the speaker source. The commonly used training method is the maximum likelihood (ML) estimation because of its simplicity and mathematical tractability. Based on the ML criterion, the model parameters are updated repeatedly and the probability of observation sequences will be improved until some limit point. However, due to the hill-climbing characteristic, any arbitrary estimate of the initial model parameters will usually lead to a suboptimal model in practice. Genetic algorithm (GA) (Tang et al., 1996; Man et al., 1997) is a searching process based on the laws of natural selection and genetics. It emulates the individuals in the natural environment, that is, the natural selection mechanism makes the stronger individuals likely winners in the competing environment. Previously, Chau et al. have applied the GA training for HMM-based speech recognition and gave better quality of solutions than the traditional Baum–Welch algorithm (Rabiner and Juang, 1993). Kwong et al. combined the GA with the Baum–Welch algorithm to form a hybrid-GA such that the quality of the results and the runtime behavior of the GA were further improved. In this work, we propose a hybrid GA to the training of the GMM-based speaker identiﬁcation. It uses the ML re-estimation as the heuristic operator to improve the converging speed of GA. Experimental results showed that the proposed GA for the GMM training could obtain more optimized GMMs than the simple GA and the traditional ML estimation method. This paper is organized as follows: Section 2 brieﬂy describes the GMM. Section 3 introduces the ML parameter estimation. After that, optimizing the parameters of GMM using GA is described in Section 4.

GMM for Speaker 1

Speech Utterance

Speaker ID

GMM for Speaker 2

Feature Extraction

. . .

Section 5 reports and discusses the experimental results of the TI46 and TIMIT databases. Finally, conclusion is given in Section 6.

2. Gaussian mixture model The Gaussian mixture has remarkable capability to model the irregular data. This characteristic makes it very suitable for GMM to have a smooth estimate of speaker’s acoustic variability, e.g. emotion, health, etc. In the GMM-based system, the distribution of feature vectors extracted from a speaker’s speech X ¼ fxt ; 1ptpT g is modeled by a weighted sum of K mixture components, which can be deﬁned as pðxt jlÞ ¼

K X

ck Nðxt ; lk ; Rk Þ;

(1)

k¼1

where l is the brief notation of the GMM parameters: (2) l ¼ ck ; lk ; Rk 1pkpK: The mixture component Nðxt ; lk ; Rk Þ denotes a Gaussian density function with the mean vector lk and covariance matrix Rk Nðxt ; lk ; Rk Þ ¼

1 D=2

jR j1=2 k 1 exp ðxt lk Þ0 R1 x l t k ; k 2

ð2pÞ

ð3Þ

where prime denotes vector transpose and D is the dimension of the vector xt : It is more popular to use diagonal covariance matrices for GMM, since linear combination of diagonal covariance Gaussians has the same model capability with full matrices (Reynolds and Rose, 1995). Another reason is that the speech utterances are usually parameterized with cepstral features. Cepstral features are more compactable, discriminable and most important as they are nearly decorrelated, which allows diagonal covariance to be used by the GMMs. The selection of mixture number K is dependent on the amount of available training data. It needs to be large enough to adequately model the acoustic variability of the speech utterances. On the other hand, it should also be small enough to allow for a reliable estimate of the model parameters. In any case, the mixture weights must satisfy the following constraint:

Identification

GMM for Speaker M

Fig. 1. Block diagram of speaker identiﬁcation system.

K X

ck ¼ 1;

ck X0:

(4)

k¼1

Given a sequence of observations X ¼ fxt ; 1ptpT g; the likelihood of observing X by speaker y is denoted as

ARTICLE IN PRESS Q.Y. Hong, S. Kwong / Engineering Applications of Artificial Intelligence 18 (2005) 13–19

the following expression: PðX jly Þ ¼

T Y

pðxt jly Þ:

(5)

t¼1

For speaker identiﬁcation, we have y^ ¼ arg max PðX jly Þ

(6)

y

with y^ being the identiﬁed speaker that attains the highest likelihood score among all the registered speakers.

3. Maximum likelihood parameter estimation Before using GMM for speaker identiﬁcation, the model parameters must be trained to describe the observation sequence of the speaker accurately. There are many criteria that can be used to estimate the model parameters. The most common one is the ML criterion. This criterion is to use a training sequence of observations X to derive the set of model parameters l; yielding lML ¼ arg max PðX jlÞ;

where PðX jlÞ is the probability of observation sequence. The ML parameters can be obtained through the Expectation Maximization (EM) algorithm (Dempster et al., 1977). It includes a set of re-estimation formulas and guarantees that the re-estimated GMM l¯ will equal to or better than the initial model l: The recursive ML re-estimation formula for multiple observation sequences are shown as follows , Tc Tc C X K X C X X X c c¯ k ¼ gt ðkÞ gct ðkÞ; (8)

l¯ k ¼

Tc C X X

k¼1 c¼1 t¼1

, gct ðkÞxct

Tc C X X

c¼1 t¼1

¯k ¼ R

Tc C X X

gct ðkÞ;

(9)

c¼1 t¼1

, gct ðkÞðxct

lk Þðxct

c¼1 t¼1

lk Þ

0

Tc C X X

gct ðkÞ;

c¼1 t¼1

(10) ¯ ¯ k are the model parameters of l; C is the where c¯ k ; l¯ k ; R number of observation sequences, T c is the length of the cth observation sequence for the speaker model, gct ðkÞ is the probability at time t with the kth mixture component accounting for the cth observation sequence, which is deﬁned as follows gct ðkÞ ¼

and Juang, 1993). This prototype model is further reestimated with the ML formula. Using Eqs. (8)–(10), model parameters are improved by the repetition of the ¯ This re-estimation process will be substitution of l ¼ l: terminated until some convergence criterion based on a threshold is met. This threshold is usually set as the average of the logarithms of the probabilities of the observation sequences. It will be described more in detail later. To avoid confusion, we refer the whole training process through EM algorithm as the ML method and each single re-estimation as ML re-estimation. The ML method is widely used because of its simplicity and mathematical tractability. However, as a hill-climbing algorithm, it will usually lead to a sub-optimal model for any arbitrary chosen initial model l: In the following, we will introduce a GA-based training for GMM. GA provides the global searching capability to the optimization problem. Therefore, the GMM training can escape from the initial guess and ﬁnd the optimal solution if we applied the GA to the training process.

(7)

l

c¼1 t¼1

15

Nðxct ; lk ; Rk Þ : K P Nðxct ; lk ; Rk Þ

(11)

k¼1

Before the parameter estimation, a prototype model is usually initialized via the K-means algorithm (Rabiner

4. Optimizing the parameters using genetic algorithm In this section, the data structure of the phenotype and the representation of the chromosome in the GA training, population initialization technique, ﬁtness evaluation and genetic operations will be described in detail. Our GA is a hybrid GA, where ML reestimations are used as special heuristic operator to improve its performance. 4.1. Encoding mechanism In Section 2, we discussed the advantages of diagonal covariance compared with the full covariance for GMM. Assuming that this kind of covariance is used, the phenotype of a single speaker model for the GA training can be represented using the structure given in Fig. 2, where K is the number of mixtures, D is the dimension of the observation vector. The GMM parameters are represented with the mixture weight array, mean matrix and covariance matrix, respectively. Although three kinds of parameters in the same mixture are distributed in different places, they are actually correlated and will be used as a unit in the crossover operation. As shown in Fig. 2, the basic data type of the elements of the GMM is real number. So we use real number string instead of bit-string as the representation of the chromosomes in our GA (see Fig. 3) where C k ; mV k;l and Sigmak;l are the weight, lth mean element and lth diagonal covariance element of the kth mixture component, respectively.

ARTICLE IN PRESS Q.Y. Hong, S. Kwong / Engineering Applications of Artificial Intelligence 18 (2005) 13–19

16

struct GMM { float C[K]; /*mixture weight*/ float mV[K][D]; /*mean matrix*/ float Sigma[K][D]; /*diagonal covariance matrix*/ }; Fig. 2. The data structure of the phenotype in the GA training of GMM.

C0

mV1, D

mV0,0 …

…

CK

mV0, D

Sigma 0,0

mV0,0

…

…

mVK , D

Sigma 0, D Sigma K ,0

C1 …

mV1,0

…

Sigma K , D

Fig. 3. The representation of the chromosome in the GA training.

4.2. Population initialization Two strategies can be used to initialize the population. The ﬁrst one is to generate the chromosomes directly, without any heuristic knowledge. The second is to initialize them based on a prototype model. As in the ML method, this model is initialized via the K-means algorithm. Experimental result showed that the second strategy is a better choice. The initialization procedure can be described as follows: Step 1: Initialize a prototype model l0 based on the training data; Step 2: For n ! 1; . . . ; P; For k ! 1; . . . ; K; ln C½k ¼ l0 C½k n Gð1:0; 0:2Þ; where Gð1:0; 0:2Þ is a Gaussian random generator with mean=1.0 and variance=0.2. For l ¼ 1; . . . ; D; 1. ln mV ½k ¼ l0 mV ½k n G(1.0,0.2); 2. ln Sigma½k ¼ l0 Sigma½k n G(1.0,0.3); End End End where P is the size of population. The variances of the Gaussian random generators are determined experimentally. 4.3. Fitness evaluation In the GA training, the ﬁtness values are the results of the objective function. The objective function is deﬁned as the average of the logarithms of the probabilities of the observation sequences, X 1 ; X 2 ; . . . ; X C generated by the given nth GMM ln ; !, C X lpn ¼ logðPðX i jln ÞÞ C; (12) i¼1

where lpn is the ﬁtness value of the nth chromosome in the population, C is the number of observation sequences in the training data. This objective function only considers the likelihood for a single speaker. It is also used to determine the termination condition in the ML method. The probability PðX i jln Þ is calculated through Eqs. (1), (3) and (5). During the training process, some mixtures of the GMM may be corrupted. In this case, we will discard them and only calculate the average log probability based on the remaining mixtures. 4.4. Genetic operations In the training process, three genetic operators are designed: mixture crossover, mutation and the heuristic operator. Without any heuristic operator, the convergence speed of the GA cycle is slow, e.g. it needed 20,000 generations for the simple GA to get a better solution of HMM than the traditional method (Chau et al., 1997). To overcome this shortcoming, we use the ML reestimation as a heuristic operator. Mixture crossover. In the GMM framework, mixture crossover will be used. It is a derivative of the standard crossover operator. In this operator, two parents will be selected from the population pool based on the Roulette Wheel mechanism (Man et al., 1997; Kwong et al., 2001). Five mixtures are randomly selected from one parent and replaced with the corresponding mixtures in another parent. Experimental results show ﬁve is a good choice. The ﬁrst parent is reserved as the offspring. Mutation. Mutation introduces local variations to the individuals for searching different solution spaces and keeps the diversity of the population. It recovers the lost information in the initialization phase, makes the GA escape from the initial model parameters and helps to ﬁnd the optimal model parameter set. This operation is conducted on the level of each model parameter. If mutation takes place, the model parameter is multiplied by a Gaussian random number generator with mean = 1.0 and variance = 0.001. ML re-estimation. In each generation, each chromosome in the offspring is re-estimated ﬁve times with the ML formula Eqs. (8)–(10). Each chromosome in the population will be further re-estimated eight times every ten GA generations. Since the GA operations may destroy the constrained conditions of the model parameters, thresh-checking should be conducted before the ML re-estimation. And the sum of mixture weight is normalized to 1.0. In the replacement stage, we generate the new population by replacing the worst chromosomes in the current population with the sub-population (pool of offspring). Finally, the control parameters for the GA training are summarized in Table 1.

ARTICLE IN PRESS Q.Y. Hong, S. Kwong / Engineering Applications of Artificial Intelligence 18 (2005) 13–19 Table 1 The control parameters for the GA training

F1

F2

F3

F4

17

F5

F6

F7

F8

-3000 -3200

Setting

The The The The The

30 5 0.08 0.01 30

size of population size of offspring chromosomes mixture crossover rate mutation rate maximum number of generations

-3400

fitness (lp)

Parameters

-3600 -3800 -4000 -4200 -4400

simple-GA

-4600

hybrid-GA

-4800

5. Experimental results Experiments are based on the TI46 (Liberman et al., 1980) and TIMIT (Garofolo et al., 1986) corpus, respectively. Speech utterances from these two databases were parameterized with mel frequency cepstral coefﬁcients (MFCC). We used the HTK toolkit 3.0 (Young et al., 2000) to conduct the feature extraction procedure. Speech signal was pre-emphasized using a coefﬁcient of 0.97. Each frame of speech was windowed with a Hamming window and represented by a 24 dimensional feature vector, which consists of 12 MFCCs and the ﬁrst differentials. The results of the traditional ML method are used as the baseline for comparison. In the ML training, reestimations will be terminated when the relative increase of the average log probability of the training data between two successive iterations is less than 0.01%. The associated limit for the mixture weight and the element of the diagonal covariance are 1:0=ð30:0 KÞ and 0.0001, respectively, where K refers to the number of mixtures. The same thresh-checking is conducted for the GA training.

5.1. Results of TI46 The TI46 corpus contains 16 speakers: 8 males labeled M1–M8 and 8 females labeled F1–F8. The complete corpus is divided into two directories: TI20 and TI_ALPHA. TI20 contains all utterances of the words ‘ZERO’–‘NINE’ and the words ‘ENTER’–‘YES’. From this sub-corpus, we select the 10-command set {enter, erase, go, help, no, rubout, repeat, stop, start, yes} of the 8 female speakers to conduct the text-dependent speaker identiﬁcation. There are 100 training utterances used for each speaker. The whole test sections are selected as the test data. Each word has 16 utterances spoken by each speaker. Thus there are 1280 test utterances, in total, to be used for the identiﬁcation evaluation. To demonstrate the effectiveness of the heuristic operator, we compare the ﬁnal ﬁtness of the hybrid GA and simple GA without the ML re-estimation operator after 30 generations. The results are given in Fig. 4, from which

Fig. 4. The ﬁtness comparison of simple GA (without the ML reestimation operator) and hybrid GA for 8 female speakers. Each speaker model consists of 32 mixtures.

we can see that the value of lp for each speaker is increased signiﬁcantly through the heuristic operator. To compare the performance of the ML and GA training, the GMMs are trained with 16, 32 and 64 mixtures, respectively. Table 2 gives the results of ﬁtness and identiﬁcation accuracy. It is seen that in all cases, the GMMs trained by GA have higher values of lp than the GMMs trained by the ML method. This implies that the GMMs trained by our GA are more optimized than the GMMs trained by the ML method. After the training, we compare the identiﬁcation accuracy. The results show that the GA has better identiﬁcation performance. 5.2. Results of TIMIT TIMIT contains a total of 630 speakers from 8 major dialect regions (labeled from DR1 to DR8) of the United States. There are 10 sentences for each speaker, including 2 SA sentences, which are the same across all speakers, 5 SX sentences and 3 SI sentences. SX and SI sentences are different from each other and across speakers. The speech signal is recorded through a highquality microphone in a quiet environment. From the second dialect region (DR2) in the train section of TIMIT, we select 23 female speakers to conduct the text-independent speaker identiﬁcation. The number of training sentences of each speaker ranges from 4 to 8 and the rest sentences are used as the test data. Each of the GMM consists of 8 mixtures. The results of ML and GA are compared in Fig. 5. In the ﬁrst two cases when 4 or 5 training sentences are used, the identiﬁcation accuracy of GA is lower than the ML method. This might be due to the over-training of limited training data. When there are 6, 7 or 8 training sentences, the performance of GA is equal to or better than ML. Thus our GA training method is more preferred for enough training data. Actually overtraining is also a problem of the traditional ML method. It is difﬁcult to make a balance between the

ARTICLE IN PRESS Q.Y. Hong, S. Kwong / Engineering Applications of Artificial Intelligence 18 (2005) 13–19

18

Table 2 Experimental results: lpn and identiﬁcation accuracy Trained speaker

16 mixtures

F1 F2 F3 F4 F5 F6 F7 F8 Accuracy

32 mixtures

ML

GA

ML

GA

ML

GA

3787.80 3791.18 4113.22 4162.01 4132.95 4515.99 4416.49 3874.24 89.61%

3786.82 3779.34 4104.45 4154.72 4122.38 4512.31 4413.68 3869.81 90.39%

3705.70 3700.24 4036.45 4083.44 4045.24 4434.00 4319.99 3790.61 94.22%

3701.32 3691.50 4028.66 4078.08 4032.45 4423.31 4307.82 3783.49 95.78%

3617.96 3619.87 3944.00 3988.24 3948.83 4351.70 4199.82 3689.15 97.73%

3609.59 3604.43 3935.72 3982.03 3932.45 4336.34 4192.83 3678.62 98.20%

100%

Accuracy

98%

96%

94%

ML GA

92%

4

5

64 mixtures

6

7

8

Training Sentences Fig. 5. Identiﬁcation results of 23 female speakers.

re-estimation number and the open test performance. Generally, when the amount of the training data is large enough, more re-estimations for the training data lead to higher accuracy of the test data. On the other hand, we only need 20 or even less ML re-estimations when only limited amounts of training data are available. To further compare their performance in the case of enough training data, we conduct the identiﬁcation experiment on the ﬁrst two dialect regions DR1 and DR2, respectively. There are 8 training sentences per speaker and the training process is conducted for 8 mixtures and 16 mixtures, respectively. The results are shown in Table 3. For the 76 speakers of DR2, we can see that the identiﬁcation results of GA are better than the ML method. While for the 38 speakers of DR1, it is found that there is no improvement through the GA training, although the value of lp of the GMM trained by GA is higher than the GMM trained by ML. We test the other dialect regions and some of them have the same results. This further indicates that increasing the value of lp is not always an effective strategy to improve the identiﬁcation performance. It can be explained by the fact that the ML criterion only considers the likelihood

for a single speaker, that is, each model is estimated separately using its labeled training utterances. When there are confusable models or the training data is limited, usually it can only reach a local optimized classiﬁer. To resolve this problem, one of the solutions is to use the discriminative evaluation criterion. For example, in the MMD framework (He et al., 2000), each model represents the stochastic characteristics of a class of acoustic signals and the difference of those stochastic characteristics can be mapped into the dissimilarities of their models. By maximizing the dissimilarities among models, the performance of classiﬁer can be improved. This discriminative criterion could be used as the evaluation function of the GA training. However, this evaluation function was still not positive proportionally with the identiﬁcation performance. It had some ﬂuctuations during the training process, although the MMD score could be increased in each generation. The open identiﬁcation accuracy was usually increased at ﬁrst but might deteriorate later on. This indicates that the discriminative criterion has also the problem of overtraining. Alternatively, we could get the training results in the ﬁrst several generations.

6. Conclusion In this study, we have developed a hybrid GA for the GMM-based speaker identiﬁcation. It makes the training process escape from the initial model parameters and provides the global searching capability to ﬁnd a more optimized GMM. Using ML re-estimation as the heuristic operator, the runtime behavior of the GA cycle is further improved. Experimental results of TI46 showed that our GA has better identiﬁcation performance than the traditional ML method. Based on the dialect results of TIMIT, we also discussed the limit of the ML criterion. In the future, we will further study the possibility of applying the discriminative criterion such

ARTICLE IN PRESS Q.Y. Hong, S. Kwong / Engineering Applications of Artificial Intelligence 18 (2005) 13–19

19

Table 3 Dialect identiﬁcation performance Dialect region

DR1 DR2

Speaker number

38 76

Test sentences

38 2 76 2

as MMD to the GA training to improve the identiﬁcation performance.

Acknowledgment This work is supported by City University of Hong Kong Strategic Grant Number 7001488. References Chau, C.W., Kwong, S., Diu, C.K., Fahrner, W.R., 1997. Optimisation of HMM by a genetic algorithm. In: Proceedings of the ICASSP, vol. 3, April 1997, pp. 1727–1730. Dempster, A., Laird, N., Rubin, D., 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39, 1–38. Farrell, K.R., Mammone, R.J., Assaleh, K.T., 1994. Speaker recognition using neural networks and conventional classiﬁers. IEEE Transactions Speech Audio Processing 2, 194–205. Furui, S., 1996. An Overview of Speaker Recognition Technology. Automatic Speech and Speaker Recognition. In: Lee, C., Soong, F., Paliwal, K. (Ed.), Kluwer Academic Press, Dordrecht. Garofolo, J.S., Lamel, L.F., et al., 1986. TIMIT Acoustic-Phonetic Continuous Speech Corpus, http://www.ldc.upenn.edu/. Hattori, H., 1992. Text-independent speaker recognition using neural networks. In: Proceedings of the ICASSP, vol. 2, March 1992, pp. 153–156. He, Q.H., Kwong, S., Man, K.F., Tang, K.S., 2000. Improved maximum model distance for HMM training. Pattern Recognition 33, 1749–1758. Kwong, S., Chau, C.W., Man, K.F., Tang, K.S., 2001. Optimisation of HMM topology and its model parameters by genetic algorithms. Pattern Recognition 34, 509–522.

8 mixtures

16 mixtures

ML (%)

GA (%)

ML (%)

GA (%)

98.68 96.05

98.68 98.68

100 98.68

100 99.34

Liberman, M., Amsler, R., et al., 1980. TI46-Word, http://www.ldc.upenn.edu/. Man, K.F., Tang, K.S., Kwong, S., Halang, W.A., 1997. Genetic Algorithms for Control and Signal Processing. Springer, Berlin. Matsui, T., Furui, S., 1992. Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs. In: Proceedings of the ICASSP, March 1992, pp. II.157–164. Rabiner, L.R., Juang, B.-H., 1993. Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliffs, NJ. Reynolds, D.A., 1995. Speaker identiﬁcation and veriﬁcation using Gaussian mixture speaker models. Speech Communication 17, 91–108. Reynolds, D.A., Rose, R.C., 1995. Robust text-independent speaker identiﬁcation using Gaussian mixture speaker models. IEEE Transactions Speech Audio Processing 3 (1), 72–83. Reynolds, D.A., Rose, R.C., Smith, M.J.T., 1992. PC-based TMS320C30 implementation of the Gaussian mixture model textindependent speaker recognition system. In: Proceedings of International Conference on Signal Processing Application Technology, November 1992, pp. 967–973. Rose, R.C., Reynolds, D.A., 1990. Text-independent speaker identiﬁcation using automatic acoustic segmentation. In: Proceedings of the ICASSP, vol. 1, April 1990, pp. 293–296. Soong, F.K., Rosenberg, A.E., Rabiner, L.R., Juang, B.H., 1985. A vector quantization approach to speaker recognition. In: Proceedings of the ICASSP, pp. 387–390. Tang, K.S., Man, K.F., Kwong, S., He, Q., 1996. Genetic algorithm and their applications. IEEE Signal Processing Magazine 13 (6), 22–37. Tishby, N.Z., 1991. On the application of mixture AR hidden Markov models to text independent speaker recognition. IEEE Transactions on Signal Processing 39, 563–570. Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P., 2000. The HTK book (for htk version 3.0), (htk.eng.cam.ac.uk/prot-docs/HTKBook/htkbook.html), July 2000.

A genetic classification method for speaker recognition

A genetic classification method for speaker recognition

Recommend Documents