Available online at www.sciencedirect.com
COMPUTER SPEECH AND LANGUAGE
Computer Speech and Language 24 (2010) 545–561
www.elsevier.com/locate/csl
A real-time trained system for robust speaker verification using relative space of anchor models Ali Sadeghi Naini a,b,*, M. Mehdi Homayounpour a, Abbas Samani b,c a
Laboratory for Intelligent Sound and Speech Processing, Computer Engineering Department, Amirkabir University of Technology (Tehran Polytechnic), 424 Hafez Avenue, Tehran, Iran b Department of Electrical and Computer Engineering, University of Western Ontario, London, ON, Canada N6A 5B9 c Department of Medical Biophysics, University of Western Ontario, London, ON, Canada N6A 5C1 Received 16 June 2008; received in revised form 1 May 2009; accepted 12 July 2009 Available online 15 July 2009
Abstract A real-time trained system for robust speaker verification is proposed. This system was developed using a relative space of reference speakers, also referred to as anchor models. The real-time training aspect of the system is based on this relative space’s intriguing features and properties. The relative space concept uses relative speaker representation rather than an absolute representation, by comparing the speaker to a set of well-trained reference speakers. The advantage of this approach is that instead of estimating numerous parameters of an absolute model for a speaker, only a few parameters of a model relative to a number of anchor models are estimated. In order to optimize the performance of the proposed system, several techniques were assessed for possible implementation in various blocks of the system. As a result, the best performance was achieved where normalized vector’s mutual angle with the Minimum normalization method was applied to speaker verification in conjunction with an orthogonal relative space of virtual reference speakers. In this case, an Equal Error Rate (EER) of 0.12% on 400 test samples of 100 speakers was obtained. In addition to assessment under normal conditions, the developed speaker verification system was also evaluated under abnormal conditions where noisy or telephonic speech sequence contamination was present. Experiments conducted in this case demonstrated that, in most cases, this system outperforms absolute space based systems even with shortened training speech sequences. Another major contribution of this research is the development of a more complex speaker verification system capable of tackling abnormal conditions more effectively. In this case, other interesting features of the relative space approach were employed. For this purpose, a novel enrichment method was developed to construct a relative space of anchor models trained to tackle noise. The results of the experiments conducted in this part of the research demonstrated an excellent ability of this approach to tackle abnormal conditions. Compared to absolute space based system, applying this method in relative space led to lower error rates of speaker verification in all cases even with low SNR values. Ó 2009 Elsevier Ltd. All rights reserved. Keywords: Speaker verification; Robust; Noisy condition; Real-time training; Relative space; Absolute space; Anchor models; Reference speakers; Eigenspace; Normalization; Orthogonal
*
Corresponding author. Current address: Department of Electrical and Computer Engineering, University of Western Ontario, London, ON, Canada N6A 5B9. Tel.: +1 519 661 2111x86228; fax: +1 519 850 2436. E-mail addresses:
[email protected],
[email protected] (A. Sadeghi Naini),
[email protected] (M.M. Homayounpour),
[email protected] (A. Samani). 0885-2308/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.csl.2009.07.002
546
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
1. Introduction Speaker verification systems have various applications that involve identity authentication through World Wide Web, telephone lines, and direct microphone speech data acquisition systems. For example, secure webor telephone-banking can be facilitated efficiently using such systems. Furthermore, over the past few years, reliable security systems, where the security of public at a large scale is concerned, has become a matter of paramount importance. Another important application of speaker verification systems is crime investigation conducted by law enforcement organizations. In such applications, usually there is access to only short speech segments, processing of which using current systems often fails to yield reliable results. Another hurdle associated with this application is the fact that speech segments are usually contaminated with noise, including channel noise or environmental noise. This leads to further degradation of the ability of current systems to produce correct output. These shortcomings of the current verification systems have led many research groups to investigate the possibility of developing more effective systems. The main purpose of this research is to provide solutions to address these issues. These solutions employ novel techniques based on features extracted in the relative space. Feature extraction is a primary phase in speaker verification systems. Speech extracted features used in a speaker verification system fall into two categories based on their related space. One category includes features defined in an absolute and irrelative space, while the other includes features defined in a relative space. For the first category, representation of a speaker in the feature space is not related to any reference speaker. While there is a significant body of literature on features in the absolute space, very little research has been conducted for investigating the properties of features extracted in the relative space. MFCC (Young, 1996), LPCC (Rabiner and Juang, 1993), wavelet coefficients (Avci and Akpolat, 2006), etc. are among the most prevalent speech features in absolute space. Recently, Campbell et al. used Maximum A Posteriori (MAP) adapted GMM mean supervectors as an absolute feature with Support Vector Machine (SVM) as a discriminative model for speaker verification (Campbell et al., 2006). As an attempt to investigate the relative space approach and its application in verification systems, a number of groups have recently utilized such approach in developing novel speaker verification systems. For features defined in a relative space, each speaker in the feature space is represented relative to some reference speakers. Given that the feature extraction phase in any verification system can be performed independently from its verification phase, the latter will not be affected by the choice of feature space. As such, extracted features in the relative space can be applied in conjunction with any other set of techniques from the verification phase menu that are deemed more suitable. Relative feature space was first developed for speaker adaptation in speech recognition. Merlin et al. proposed a new approach to speaker recognition and indexation systems, based on nondirectly-acoustic processing in the relative space (Merlin et al., 1999). In 2000 Kuhn et al. introduced the eigenvoices concept and represented each new speaker relative to eigenvoices (Kuhn et al., 2000). Eigenvoices were, then, used by Thyes et al. for speaker identification and verification (Thyes et al., 2000). Later, other researchers used a different approach where they introduced the idea of space of anchor models to represent enrolled speakers in verification systems, and to verify a test speaker in a relative feature space (Mami and Charlet, 2002, 2003, 2006). The main concept of anchor models space, which will be described in the methods, has been utilized in this research. In this study, the effectiveness of applying various techniques involved in the development of a speaker verification system in a relative feature space of reference speakers has been assessed. As a result, the most effective techniques have been identified and optimal values of the parameters involved in the system have been determined and used to develop a novel speaker verification system. The proposed system was developed using a novel relative space of reference speakers, also referred to as anchor models. The real-time training aspect of this speaker verification system is based on the relative space’s intriguing features and properties as described in the methods. Real-time training in speaker verification systems is often required in applications where acquisition of long segments of speech data for clients’ enrollment in the system is either not desirable for clients or not feasible. Examples of such applications include temporary security systems such as exhibition authentication systems, large-scale security systems involving biometrics passports or national ID, and criminal identity determination associated with investigative procedures where long segments of speech data is not available in the crime scene. Real-time training in speaker verification systems is also desirable in applications
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
547
where clients may not feel comfortable spending a relatively long time for enrollment in the security system, e.g., in online security systems. Another important contribution of this work is improving the accuracy of existing verification systems in the relative space for which distance normalization was applied in the relative space. Among the applied normalization techniques, the Minimum normalization method provided the highest accuracy. Furthermore, a simple clustering method was developed and implemented in the proposed systems. This method clusters the anchor models, thus leading to reduction of the relative space complexity, hence reducing the system’s response time and increasing its accuracy. To further enhance the performance of the system, an orthogonalization method was employed on the coordinate vectors in the relative space. This has led to significant improvement in the relative space discrimination ability. Under various circumstances, speech signals can be distorted or contaminated with noise. Under such abnormal conditions, existing speaker verification systems such as the ones proposed recently (Mami and Charlet, 2002, 2003, 2006) are not adequate. In this research, the developed speaker verification system was employed with data obtained under abnormal conditions where noisy or telephonic speech sequence contamination was present. Results obtained in this case confirmed that, in most cases, this system outperforms absolute space based systems even with shortened training speech sequences. Another major contribution of this research is the development of a more complex speaker verification system capable of tackling abnormal conditions more effectively. In this case, other interesting features of the relative space approach were employed. For this purpose, a novel enrichment method was employed to construct a relative space of anchor models trained to tackle noise. Applying this enrichment method has proven to be more effective for tackling noisy conditions. Another major difficulty a speaker verification system needs to overcome is session variability, which frequently happens in real world applications. It refers to all of the phenomena, which cause two recordings of a given speaker to sound different from each other. Recently, new techniques of joint factor analysis and eigenchannels were introduced by Kenny et al. to deal with the session variability in speaker recognition systems (Kenny et al., 2007). These techniques had a profound impact on the speaker verification field of research (Fauve et al., 2007; Matrouf et al., 2007). As discussed in Section 6, it is speculated that the relative space approach may also be able to tackle session variability efficiently. This paper is organized such that the main concept of relative space of anchor models will be first introduced in the following section. In Section 3, primary and optional steps for a speaker verification system in such space will be described. Section 4 focuses on the details of the systems implemented in this study and the conditions under which the experiments were conducted. In Section 5, the conducted experiments and their observed results will be presented and discussed, and finally in Section 6, the proposed techniques and results will be discussed and conclusions made.
2. Relative space of anchor models In order to extract relative features from a speech segment, the space containing these features should be initially defined. The main idea is to build an orthogonal space where each axis represents some general characteristics of the speakers’ voices. In fact, a speech representation space is defined, which contains useful and common information about voices of a diverse group of speakers. Generally, this supplementary information can complement the system’s available information for speaker verification, and thus improve its performance. Another interesting feature of the relative space approach is that it even allows space enrichment with information known a priori about conditions under which the verification system is to be applied. Such enrichment can help tasks involved in the system to be performed more efficiently. To the authors’ knowledge, this feature has not been utilized in existing systems. As will be described in Section 5.2, this feature was exploited in the proposed speaker verification system to tackle noisy abnormal conditions. A relative space is constructed using speech data from a sufficient number of speakers called reference speakers. For each reference speaker, a model (GMM model in this research) is trained using absolute feature vectors extracted from his/her speech data. As mentioned in Section 3, normalized output of these GMMs to each absolute feature vector (extracted from incoming speaker’s speech segments) composes its coordinate vector in the relative space (Eqs. (3) and (4)). Section 4.3 presents the specifications of the GMMs training process applied in this research.
548
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
A set of anchor models may include real or virtual speakers. Applying virtual anchor models may lead to time and memory complexity reduction, system response time reduction, and (as it will be discussed in Section 5) it may even improve the system performance. A virtual anchor model may be constructed using speech data from a group of real speakers by a clustering method. Various methods can be used for anchor models clustering. Our simple algorithm for constructing virtual anchor models is as follows. To be able to compare virtual speakers easily, in this algorithm each virtual speaker is constructed using an equal amount of training data.
Input: a set of real speakers 1.Train a model for each speaker 2.Based on a similarity criterion, find the two speakers whose GMM models are the most similar: Eliminate these two speakers from the list of real speakers Construct a new virtual speaker using feature vectors from the above similar speakers Add the new speaker to the virtual speaker set that includes all constructed virtual speakers 3.If real speakers set is empty go to the next stage, else return to 2 4.Train a GMM model for each virtual speaker Output: a set of trained models belong to virtual speakers To assess speaker similarity in this algorithm, distance measure criteria such as Euclidian distance (Eq. (1)), Bhattacharya distance (Eq. (2)) or other distance measures can be used. vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u d qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uX ðM ; M Þ ¼ t ðl l Þ2 ¼ ðl l ÞT ðl l Þ ð1Þ D Euclidean
i
j
jk
ik
j
i
j
i
k¼1
Ri þRj 1 1 1 T Rj þ Ri 2 ðlj li Þ þ log qffiffiffiffiffiffiffiffiffiffiffiffi DBhattacharyya ðM i ; M j Þ ¼ ðlj li1 Þ ffi 2 8 2 R R i
ð2Þ
j
In Eqs. (1) and (2), Mi indicates the ith speaker’s mixture model. In addition, li and Ri (in Eq. (2)) are the mean vector and covariance matrix of the ith mixture model respectively. Given that it applies both mean vectors and covariance matrices of the mixture models to assess their similarity, the Bhattacharya criterion seems more appropriate. 3. Speaker verification in relative space of anchor models 3.1. Enrolling system speakers in relative space After constructing a relative space, relative feature vectors of speakers can be extracted. In this paper, this is referred to as enrollment or locating of speakers in the relative feature space. To enroll or locate a speaker in the relative space, the speaker’s speech data is preprocessed and its absolute features are extracted. Scores of the speaker’s feature vectors under all reference speakers’ models as well as under GMM/UBM model are evaluated. GMM/UBM model is a GMM model trained using speech data from a large number of speakers. As such, the location of each speaker in the relative space is a coordinate vector that can be obtained as follows: 2 3 Pe ðx j k1 Þ 6 7 6 Pe ðx j k2 Þ 7 6 7 ð3Þ wi ¼ 6 7 .. 6 7 . 4 5 Pe ðx j kE Þ
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
549
This coordinate vector wi serves as the feature vector of speaker i in our speaker verification system. E, the dimension of wi, is equal to the total number of anchor models, or in other words, the number of relative space’s dimensions. Pe ðx j ke Þ is the normalized log-likelihood score of the speech feature sequence x (of Nf absolute feature vectors), given the GMM model of the anchor model ke : e Þ 1 P ðx j k log ð4Þ Pe ðx j ke Þ ¼ Nf P ðx j kUBM Þ In this formula, the likelihood score of x under anchor model ke is normalized by the likelihood score of x under GMM/UBM model kUBM . In Section 5, this is referred to as UBM score normalization in the process of coordinate vectors calculation. 3.2. Locating and verifying test samples in the relative space As explained in the previous section, in the training phase, each speaker is enrolled in the relative space. In the test phase, a speaker claims an identity to be verified by the system. Test samples from this speaker are located in the relative space and their coordinate vectors are extracted. This process is similar to the speaker enrollment process explained earlier. Subsequently, a two-point distance between the coordinate vector obtained from test data and the coordinate vector corresponding to the identity claimed by the speaker is measured in the relative state. The distance is compared to a predefined threshold and the decision on acceptance or rejection of the speaker’s claim is established. The two points distance measurement in the relative space can be performed by applying different metrics such as the Euclidian distance using Eq. (1) or using the points’ coordinate vectors mutual angle as follows: Dangle ðV 1 ; V 2 Þ ¼ cosðaÞ ¼
V1V2 jV 1 j jV 2 j
ð5Þ
In Eq. (5), V1 and V2 are the position vectors of the test data and the claimed identity points in the relative space, respectively, and () indicates the inner product of these vectors. It is obvious that bigger values of cos(a) correspond to smaller distances between two vectors. 3.3. Orthogonalization of coordinate vectors and distance normalization in the relative space By applying orthogonalization using techniques such as Linear Discriminative Analysis (LDA) (Manly, 1990), coordinate vectors in the relative space can be transferred to an orthogonal relative space. During orthogonalization, it is possible to either preserve or reduce the coordinate vectors dimensions. Another advantage of space orthogonalization is that it increases the discrimination ability between different speakers, thus leading to performance improvement in speaker verification systems in relative space. Score normalization methods such as Maximum (Rosenberg et al., 1992), Tnorm (Auckenthaler et al., 2000) or Znorm (Li and Porter, 1988) usually improve the performance of speaker verification systems in the absolute space. As such, it may be advantageous to normalize the obtained distances by comparing the test samples and the claimed speaker model in the relative space before they are used in the verification process. In the absolute space, the score normalization is usually referred to as normalization of output scores of the models, e.g., GMM models. In the relative space context where the distance is used, the score is related to the distance reciprocal. Hence, the Minimum normalization method is used instead of the Maximum normalization method. In the Minimum normalization method (Eq. (6)), the Minimum distance between the test speaker and the other system speakers (except the claimed speaker) is added to the distance between the test speaker and the claimed speaker. n
x; Spkri Þ ¼ Disð~ x; Spkri Þ þ min Disð~ x; Spkrj Þ Dis0 ð~ j¼1 j–i
ð6Þ
x are the ith system speaker and the coordinate vector of the test sample in the relative space, where Spkri and ~ respectively. Tnorm and Znorm methods use a general relation as follows:
550
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
Anchor model’ training data
Extra anchor model’ training data, containing useful information for the relative space
System speakers’ training data
Clustering and constructing virtual anchor models
Anchor models training
Enrolling and locating in relative space
Testing data
Coordinate vectors orthogonalization
Measuring distance to the claimed system speaker in the relative space
Locating in the relative space
Non-orthogonal relative space of anchor models
Coordinate vectors of system speakers in the (orthogonal) relative space of anchor models
Distance normalization Threshold
Decision-making
Acceptance or rejection Fig. 1. General scheme of the proposed speaker verification system in the relative space of anchor models (dotted boxes represent optional step blocks).
Dis0 ð~ x; Spkri Þ ¼
Disð~ x; Spkri Þ lSpkri rSpkri
ð7Þ
In this formula, lSpkri and rSpkri are the normalization parameters. In the Znorm normalization method, the distances between each system speaker and a number of unauthorized speakers’ speech data are measured and the distribution of the unauthorized speakers’ distances is obtained. The corresponding mean and variance of the obtained distribution for each system speaker (normalization parameters) is then estimated. In the Tnorm normalization method, the distances between each test sample and some of the most similar speakers to the claimed speaker (except the claimed speaker) are measured. The distances distribution is then obtained and used for evaluating the corresponding mean and variance (normalization parameters). Fig. 1 illustrates the general scheme of the proposed speaker verification system in the relative space of anchor models. 4. Implementation 4.1. Speech databases To test the proposed system in this study, a number of experiments were conducted. In these experiments, three speech databases were involved, which are (A) FARSDAT microphony (clean) (Bijankhan et al., 1994), (B) telephonized FARSDAT (Momtazi et al., 2007), and (C) NoiseX (Varga and Steeneken, 1993). FARSDAT microphony is a Persian speech database, which contains speech data with a sampling frequency of 22,050 Hz from 300 speakers. Each speaker uttered a number of sentences in two sessions. This database includes male and female speakers who cover a wide range of ages, with various academic education levels and accents (10 accent from various regions in Iran were represented: Tehrani, Torki, Esfahani, Jonubi,
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
551
Shomali, Khorassani, Baluchi, Kordi, Lori, and Yazdi). The telephonized FARSDAT database was generated by passing the clean FARSDAT data through a telephone channel. This database contains files with a sampling frequency of 11,025 Hz. More specifications on this database, the applied framework for telephonizing, and the telephone lines including their statistical parameters can be found in (Momtazi et al., 2007). NoiseX is a database, which was applied to study abnormal noisy conditions. To generate the desired noisy speech database in this case, various noise sequences from the NoiseX database were added to the clean FARSDAT speech segments of the testing samples. The speech samples used to train the anchor models of the relative space as well as the training samples of the enrolled speakers were noise-free during the experiments that involved using noisy test data. The noise sequences included white noise, car noise, factory noise and babble noise, which were added such that SNR values were 0, 5, 10, and 20 dB. It should be noted that adding artificial noise to clean database samples and using them as testing samples may lead to over-estimation of the speaker verification systems’ performances under real noisy condition. In all experiments of this research, speech samples of 200 out of the 300 speakers of the FARSDAT database were used to construct the relative space while those of the remaining 100 speakers were applied for training and testing the speaker verification system in this space. Moreover, all the experiments conducted in this research were done independently from age, gender and accent, i.e., no a priori information was assumed. 4.2. Preprocessing and extracting the absolute features Data preprocessing in this study includes four steps of silence removal, framing, preemphasizing and windowing (Bimbot et al., 2004). Since silence does not contain any useful information for speaker verification, it increases verification error rate and training and testing time. As such, silence segments were removed from speech samples. Then, each speaker’s speech sample was segmented into 32 ms frames with an overlapping of 24 ms. Subsequently, the obtained frames were preemphasized (Picone, 1993) with a coefficient of 0.975. Finally, the preemphasized frames were windowed (Rabiner and Juang, 1993) using a hamming window. After preprocessing, a vector of 12 MFCCs was extracted from each frame as its absolute feature vector. Cepstral Mean Subtraction (CMS) was then performed on the feature vectors in order to normalize the effects of the microphone or handset as well as the transmission channel on extracted features (Bimbot et al., 2004). This vector was used for constructing the relative space and training and testing the speaker verification system as described next. 4.3. Implementation of relative space In this study, 200 GMM models along with a GMM/UBM model (Bimbot et al., 2004) were used to construct the relative space of anchor models. For this purpose, the GMM/UBM model including 512 Gaussian components was first trained using extracted absolute feature vectors of all reference speakers via an EM/ML algorithm. Then, a GMM model was derived for each reference speaker by adapting the parameters of GMM/ UBM model via a MAP algorithm. The MAP algorithm was only performed on the mean vectors of the GMM/UBM using extracted absolute feature vectors from a 30 sec speech segment of the corresponding reference speaker. Moreover, to study speaker clustering and for constructing relative space of virtual reference speakers, the algorithm described in Section 2 was employed iteratively to obtain 100, 50 and 25 virtual reference speakers. For implementing this algorithm, Bhattacharya distance was calculated based on the mean and variance vectors of GMMs, and used for evaluating the models’ similarity. 4.4. Evaluating system’s performance Speech samples of the remaining 100 speakers were used to train and test the speaker verification system in the absolute and relative spaces, and to evaluate their performance. These 100 speakers were selected randomly from FARSDAT database at the first stage of this research and cover both male and female speakers. Experiments were conducted using three data sets of: (1) 10 sec for training and 5 sec for testing, (2) 5 sec for training and 5 sec for testing, and (3) 2 sec for training and 2 sec for testing. In each of these three sets, one training sample and four testing samples belonged to each speaker, i.e., each set contained 400 test samples.
552
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
In the relative space case, speaker dependent thresholds were set a posteriori in order to obtain EER in the test phase. The confidence interval for each experiment was calculated based on a confidence level of 95% and is given in the tables and figures for each error rate. In this study, the speaker verification system in the absolute space was also implemented, for which conventional GMM models were used (Bimbot et al., 2004). Similar to reference models training in the relative space, the training process in this case consisted of adapting a GMM/UBM model including 512 Gaussian components for each system speaker via a MAP mean only algorithm using his/her absolute feature vectors of training samples. Then, the training was followed by calculating the score of test samples under each model. The process of thresholding and error calculation in the absolute space was performed using a similar process as in the relative space case. Besides, in order to compare the performance of the speaker verification system in an orthogonal relative space of anchor models with the one in an eigenspace, the latter was implemented following the procedure described by Thyes et al. (2000). The LDA was applied on mean vectors of reference speakers’ GMM models to obtain an eigenspace made up of eigenvoice basis vectors. To estimate each system speaker’s coordinates in the eigenspace, the MLED technique (Maximum Likelihood Eigen Decomposition) was used (Thyes et al., 2000; Kuhn et al., 2000). The MLED technique was also applied in the test phase where samples were projected into the eigenspace. The process of finding the distances between the test samples’ and the system speakers’ points in the eigenspace, i.e., ‘‘eigendistance decoding” in Thyes et al. (2000), was performed in a similar manner to that in the relative space of anchor models. A similar process was also used to find thresholds and calculate errors. 4.5. Enriching the relative space to tackle noisy conditions By exploiting the main idea of the relative space described in Section 2, the relative space may be enriched to tackle noisy conditions. As such, we propose improving the knowledge embedded in the relative space about noisy conditions in order to improve the speaker verification systems’ performance under noisy conditions. The relative space can use this information to be more robust and to reduce the error rate under such conditions. The proposed enrichment approach consists of using noisy speech samples to train some of the anchor models. The noisy anchor models along with the clean anchor models are then utilized to construct the relative space. Obtained results using this technique are demonstrated in Section 5.2. 5. Experiments and results 5.1. Normal conditions Speaker verification system in the relative space was first examined under normal conditions where noise and transmission channel effects were not present. The data set of 10 sec for training and 5 sec for testing from microphony (clean) FARSDAT was applied in these experiments. In the first set of experiments, two different distance criteria, namely Euclidian distance and vectors’ mutual angles were applied as distance metrics in the relative space. The results shown in Table 1 indicate that the vectors’ mutual angle criterion clearly outperforms the Euclidian distance criterion. It can be inferred that in the relative space the angular distance has a much better ability than the spatial distance to discriminate between diverse speakers. As such, this metric was used in all of the subsequent experiments. Another experiment was conducted to assess the performance of the system in the absolute space and relative space. Table 2 summarizes the results of speaker verification error rates corresponding to these spaces for real or virtual reference speakers with 25, 50, 100, or 200 reference speakers. Table 1 Performance of distance metrics in the relative space. The minimum EER obtained (written in bold), corresponds to the vector’s mutual angles. Distance metric in the relative space
EER in the relative space (%)
Euclidian distance Vectors’ mutual angles
35.50 ± 0.47 2.05 ± 0.14
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
553
Table 2 Performance of speaker verification system in different spaces. The minimum EER obtained (written in bold), corresponds to the relative space with 100 virtual anchor models. Space
Anchor models
Number of reference speakers
EER (%)
Absolute space Relative space Relative space Relative space Relative space
– Real Virtual Virtual Virtual
– 200 100 50 25
2.09 ± 0.14 2.05 ± 0.14 1.83 ± 0.13 4.72 ± 0.21 9.13 ± 0.28
Table 2 indicates that under similar conditions, the relative space provides a better platform for speaker verification than the absolute space. Furthermore, it was observed that performing clustering and constructing virtual reference speakers in the first iteration, not only caused a reduction in the number of necessary reference speakers, dimensions of relative space and time and memory complexity, but also led to a smaller verification error rate. Moreover, more iterations in the clustering algorithm and further reduction in the number of reference speakers caused poorer system performance. From Table 2, it can be seen that a total number of 100 reference speakers led to the smallest verification error rate, therefore, 100 reference speakers were used in our subsequent experiments. It was observed that clustering of virtual reference speakers led to a small number of virtual reference speakers. However, to achieve the best results, clustering should be stopped once the speaker verification performance begins to decline. As such, this stopping criterion was used as a compromise between speaker verification performance, system complexity and memory requirements. The next experiment conducted was to study the effects of orthogonalization processes applied in the system to the coordinate vectors of speakers in the relative space. In this experiment, a LDA process was performed to the speakers feature vectors obtained from the previous experiment where the 100 dimensional relative space of virtual reference speakers was constructed. The orthogonalized vectors of train and test samples were then used in the verification system. Moreover, as described in Section 4.4, in order to compare the performance of the speaker verification system in an orthogonal relative space of anchor models with the one in an eigenspace, the latter was implemented as described by Thyes et al. (2000). The LDA was applied on mean vectors of the 100 virtual reference speakers’ GMM models to obtain an eigenspace made up of 100, 90, 70, and 63 eigenvoice basis vectors, respectively. To enroll each system speaker, as well as to project each test sample in the eigenspace, the MLED technique was applied. All other conditions were kept the same in order to make a valid comparison possible. The results obtained from this experiment are summarized in Table 3. These results imply a favorable effect of coordinate vectors orthogonalization on the performance of the speaker verification system. In other words, the orthogonal relative space can discriminate between diverse speakers more accurately. The best result in the anchor models’ relative space was obtained when LDA
Table 3 Effects of orthogonalization. The minimum EERs obtained (written in bold), corresponds to the eigenspace with 90 eigenvoices and 100 dimensional orthogonal relative space of virtual anchor models. Space
Relative space Relative space Relative space Relative space Relative space Eigenspace Eigenspace Eigenspace Eigenspace
of of of of of
anchor anchor anchor anchor anchor
models models models models models
Performing LDA process (on coordinate vectors/ GMM mean vectors)
Final dimensions of coordinate vectors/ number of eigenvoices
EER (%)
No Yes Yes Yes Yes Yes Yes Yes Yes
100 100 90 70 63 100 90 70 63
1.83 ± 0.13 0.73 ± 0.08 0.80 ± 0.09 0.84 ± 0.09 0.95 ± 0.10 0.94 ± 0.10 0.71 ± 0.08 1.19 ± 0.11 2.34 ± 0.15
554
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
was performed while the space’s dimension was not changed. A reduction in the number of coordinate dimensions caused the system error to increase with a small rate. As such, in the next experiments, LDA was performed while keeping the space’s dimension the same. In this experiment, the best result in the eigenspace was found to be slightly different but very close to the one in the anchor model’s space. This result was achieved when the number of eigenvoices used to build the space was reduced to 90 from 100. Further reduction in the number of eigenvoices, however, led to an increase in the system error with a larger rate. The effect of distance normalization on the verification system performance was investigated in the next experiment. In this experiment, normalization methods were applied to the distances obtained in the previous experiments in conjunction with both the non-orthogonal and orthogonal relative spaces. Moreover, through an independent experiment, the process of enrollment and locating of train and test speakers in the relative space were performed without using a universal background model (kUBM in Eq. (4)). This was followed by normalizing the distances based on the resulting coordinate vectors. The latter experiment was conducted to study the effect of distance normalization independently from UBM score normalization in the process of calculating the coordinate vectors. For comparison, the best normalization method (Maximum) was applied to GMM models scores obtained from the previous experiment in the absolute space. Results obtained from this experiment are given in Table 4. These results demonstrate that the speaker verification error rate increases when the UBM normalization is not applied. Furthermore, Minimum (Maximum) normalization was successful to improve the verification system performance both in relative (orthogonal and non-orthogonal) and absolute spaces. The Tnorm and Znorm methods; however, did not yield satisfactory performance in the relative space even when the UBM normalization was not applied. The failure of the latter two methods can be attributed to the fact that they use a relationship, which rescales and compacts the distance distribution space. The compacting process occurs when all original distances are subtracted by mean of unauthorized speakers’ distances and then divided by the variance of them in the normalization formula. The distance criterion is less discriminative in a compact space because in such a space the values are smaller, thus may be ignored in the calculation, hence accuracy may decrease. In other words, the resolution of features in the relative space in a compact space is lower, which may have a negative impact on the measurements and performance. As seen in Table 4, the best result corresponds to the case where the UBM normalization is applied in conjunction with the Minimum normalization method. In another experiment the purpose of which was to assess the effect of training samples’ time length, three datasets with different length of training and testing speech samples were applied in the systems. The relative
Table 4 Effects of distance (score) normalization on speaker verification system’s performance. The minimum EER obtained (written in bold), corresponds to the orthogonal relative space of virtual anchor models where the UBM normalization is applied in conjunction with the Minimum normalization method. Applied normalization EER (%) Space of speaker verification Using kUBM in coordinate vectors Relative space of virtual reference speakers Relative space of virtual reference speakers Relative space of virtual reference speakers Relative space of virtual reference speakers Relative space of virtual reference speakers Relative space of virtual reference speakers Relative space of virtual reference speakers Relative space of virtual reference speakers Orthogonal (LDA) relative space of virtual anchor models Orthogonal (LDA) relative space of virtual anchor models Absolute space Absolute space
calculation
method
Yes No Yes No Yes No Yes No Yes
– – Minimum Minimum Tnorm Tnorm Znorm Znorm –
1.83 ± 0.13 4.99 ± 0.21 1.24 ± 0.11 2.82 ± 0.16 17.61 ± 0.37 19.90 ± 0.39 20.38 ± 0.39 23.36 ± 0.41 0.73 ± 0.08
Yes
Minimum
0.12 ± 0.02
– –
– Maximum
2.09 ± 0.14 0.60 ± 0.07
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
555
Table 5 EER of speaker verification in different spaces vs. different length of data samples. The minimum EER obtained for each set of data samples’ length is written in bold. Space of speaker verification
Length of training samples (s)
Length of testing samples (s)
EER (%)
Absolute space Relative space + clustering Relative space + clustering + LDA Absolute space Relative space + clustering Relative space + clustering + LDA Absolute space Relative space + clustering Relative space + clustering + LDA
10 10 10 5 5 5 2 2 2
5 5 5 5 5 5 2 2 2
2.09 ± 0.14 1.83 ± 0.13 0.73 ± 0.08 2.96 ± 0.17 2.01 ± 0.14 1.18 ± 0.11 8.39 ± 0.27 4.70 ± 0.21 3.41 ± 0.18
Fig. 2. EER of speaker verification in different spaces.
space of virtual anchor models (using clustering) was utilized in this experiment. Table 5 and Fig. 2 summarize the results obtained in this experiment. As these results imply, compared to the absolute space approach, the relative space approach is much more robust with the time length reduction of training and testing speech samples. With reducing the length of training and testing samples, the error rates of these two systems in the absolute and relative spaces diverge rapidly. Given its high efficiency, in conjunction with its robustness with length reduction of training samples, the relative space approach makes real-time training of speaker verification systems possible. In other words, realtime training of the speaker verification system in this space is feasible because such a system does not require long training samples for enrollment of system speakers. In addition, the training process merely consists of calculating a set of scores and forming a coordinate vector for each system speaker. Contrary to speaker verification systems in relative space, systems in absolute space are very sensitive to the length of training samples;
556
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
Table 6 EER of speaker verification systems in telephony condition. The minimum EER obtained (written in bold), corresponds to the orthogonal relative space of virtual anchor models. Space of speaker verification
EER (%)
Absolute space Non-orthogonal relative space of virtual anchor models Orthogonal relative space of virtual anchor models
9.22 ± 0.28 7.43 ± 0.26 5.62 ± 0.23
Table 7 EER of speaker verification systems in the abnormal condition. The minimum EER obtained for each noise sequence and SNR is written in bold. Space
Noise sequences
EER (%) SNR = 20 dB
SNR = 10 dB
SNR = 5 dB
SNR = 0 dB
Absolute space Orthogonal relative Absolute space Orthogonal relative Absolute space Orthogonal relative Absolute space Orthogonal relative
Car Car Babble Babble Factory Factory White White
7.99 ± 0.27 2.92 ± 0.16 8.49 ± 0.27 3.24 ± 0.17 10.21 ± 0.30 4.32 ± 0.20 16.70 ± 0.37 15.98 ± 0.36
12.16 ± 0.32 4.44 ± 0.20 10.17 ± 0.30 5.25 ± 0.22 17.66 ± 0.37 11.93 ± 0.32 24.20 ± 0.42 25.31 ± 0.43
13.73 ± 0.34 5.98 ± 0.23 11.89 ± 0.32 6.94 ± 0.25 23.40 ± 0.41 20.35 ± 0.39 27.17 ± 0.44 31.16 ± 0.45
15.13 ± 0.35 10.31 ± 0.30 15.02 ± 0.35 12.16 ± 0.32 28.47 ± 0.44 26.61 ± 0.43 30.12 ± 0.45 33.52 ± 0.46
space of virtual anchor models space of virtual anchor models space of virtual anchor models space of virtual anchor models
Fig. 3. EER of speaker verification systems under the condition of car noise.
as such they need to involve longer speech segments for enrollment of system speakers. Moreover, the training process in such systems comprises of training an initial model, which is more time-consuming than calculating
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
557
Fig. 4. EER of speaker verification systems under the condition of babble noise.
a set of scores. These limitations arise from the nature of the absolute space approach, which precludes the possibility of real-time training. 5.2. Abnormal condition Subsequent to assessment in the normal condition, we conducted experiments to assess the speaker verification system in the relative space under abnormal noisy and telephony conditions. The data set of 2 sec for training and 2 sec for testing was applied in these experiments. In the first experiment, the performances of the speaker verification systems in the relative and absolute spaces were compared under the telephony condition (channel noise). The results of these experiments are summarized in Table 6. These results indicate that the relative space approach (non-orthogonal and orthogonal) provides a more appropriate platform than the absolute space approach for speaker verification in the telephony condition. Moreover, the orthogonal relative space outperforms the non-orthogonal counterpart in a similar manner to what was observed in the normal condition case. In the next experiment, the performances of the systems in the absolute and relative spaces under noisy conditions were compared. Several noise sequences namely car, babble, factory and white sequences were added to testing samples from the microphony (clean) FARSDAT with different SNR values of 20, 10, 5, and 0 dB. The speech samples used to train the anchor models of the relative space as well as the training samples of the enrolled speakers were noise-free. Table 7 and Figs. 3–6 summarize the results obtained from these experiments. These results imply favorable performance of the system in the relative space under all cases of noisy conditions except under the white noise condition. In the latter condition, as seen in Fig. 6, the system in the absolute space outperforms that of the relative space for SNR values of 0, 5, and 10 dB. In this case, the systems’ error rates come together around a SNR value of 15 dB after which it shows a higher level of performance in the relative space.
558
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
Fig. 5. EER of speaker verification systems under the condition of factory noise.
Table 8 Speaker verification EER in white noise-trained orthogonal relative space of virtual anchor models. The EER obtained for the white noise condition(written in bold) is lower than those from previous systems in both absolute and relative spaces in the same noise condition. Noise sequences
EER (%) SNR = 20 dB
SNR = 10 dB
SNR = 5 dB
SNR = 0 dB
Car Babble Factory White
4.13 ± 0.20 4.56 ± 0.20 6.78 ± 0.25 12.89 ± 0.33
5.63 ± 0.23 6.19 ± 0.24 14.52 ± 0.35 22.31 ± 0.41
7.65 ± 0.26 8.39 ± 0.27 21.29 ± 0.40 25.93 ± 0.43
11.15 ± 0.31 13.79 ± 0.34 26.94 ± 0.43 29.71 ± 0.45
As described earlier in Section 4.5, the relative space may be enriched to tackle noisy condition. The proposed enrichment approach consists of using noisy speech samples to train some of the anchor models. The noisy anchor models along with the clean anchor models are then utilized to construct the relative space. To implement this enrichment technique, white noise was added to the clean speech samples of 200 anchor models with SNR values of 0, 5, 10, and 20 dB. This led to obtain 1000 speech samples (200 clean samples and 800 noisy samples), which were then applied to train 1000 GMM models as anchor models. These 1000 anchor models were then reduced to 100 virtual anchor models using the proposed clustering algorithm in Section 2. In this case, the two most similar real anchor models and their eight related noisy anchor models were placed in a cluster to create a virtual anchor model in each iteration of the algorithm. This process leads to the reconstruction of a relative space of 100 virtual anchor models trained to tackle white noise. Following this relative space construction, the previous experiment in the noisy condition was repeated using this constructed relative space. The obtained results from this experiment are demonstrated in Table 8 and Figs. 3– 6. These results indicate that the proposed verification system tackles white noise condition more robustly
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
559
Fig. 6. EER of speaker verification systems under the condition of white noise.
and outperforms the two previous speaker verification systems. Under the other three noisy conditions (other than the white noise condition), the system in the white noise-trained relative space could not achieve the performance achieved in the previous relative space that was not trained for white noise. Nonetheless, it has still outperformed the previous system in the absolute space in all cases. This better performance is because the white noise space training process may provide the space with incorrect information about the noisy condition the space may encounter. However, the obtained results indicate that this process has just led to a small deviation from the best achieved results and that the error rates are still smaller than those in the absolute space. 6. Summary and conclusion Robust and real-time trained speaker verification systems in a relative space of anchor models were proposed in this paper. This system was developed using a relative space of reference speakers also referred to as anchor models. The primary steps involved in this system include constructing the relative space, enrolling and locating system speakers and test samples in the space, and finally distance measuring and decision making for each test sample based on a predefined threshold. In this study, several methods and techniques were applied for constructing a relative space and verifying a speaker. The performances of these methods along with the menu of techniques involved in their various steps were assessed by conducting several experiments. Results obtained from these experiments indicated more favorable performance of the vector’s mutual angle than the Euclidian distance used as the distance metric in the relative space. Using this distance criterion in the subsequent experiments, it was shown that the relative space provides a better platform than the absolute space for speaker verification. The absolute space baseline used in this study was implemented by adapting a GMM/UBM model including 512 Gaussian components for each system speaker via a MAP mean only algorithm using his/her absolute feature vectors of training samples.
560
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
A simple clustering method was also developed and implemented in the proposed systems. This method clusters the anchor models, leading to reduction of the relative space complexity, hence reducing the system’s response time and increasing its accuracy. An interesting feature of the proposed algorithm is that it guarantees uniform distribution of information pertaining to virtual speakers over a relative space. Experiments conducted here showed that creating virtual anchor models using anchor models clustering is a helpful process in constructing the relative space up to a certain point beyond which over-reduction in the number of virtual anchor models occurs. In the proposed methods, to enhance the performance of the systems, the optimum number of virtual anchor models has to be determined in each system. For this purpose it was shown that this optimum number can be obtained by conducting a set of iterative experiments. Coordinate vectors orthogonalization has a positive influence on the performance of speaker verification systems in the relative space. In other words, the orthogonal relative space can discriminate more accurately between diverse speakers than its non-orthogonal counterpart. This was validated and confirmed through a number of experiments. To increase the accuracy of the system, distance normalization was applied in the relative space. It was shown that the Minimum normalization method is successful in improving the verification system performance both in relative (orthogonal and non-orthogonal) and absolute spaces. However, the Tnorm and Znorm methods (which compact the distances distribution space) lead to raise the verification error rate in the relative space even when the UBM normalization is not applied in the coordinate vectors calculation process. This can be attributed to the fact that the resolution of features in the relative space in a compact space is lower, which may have a negative effect on the measurements and performance. The results of the experiments conducted based on the implemented techniques show that the performance of this system was improved compared to the existing systems. The best results obtained in this research corresponded to a case where normalized vector’s mutual angle with Minimum normalization method was applied to speaker verification in conjunction with an orthogonalized relative space of virtual anchor models. In this case, an EER of 0.12% on 400 test samples of 100 speakers was obtained. Another important observation was that, in comparison with the absolute space, the relative space is much more robust in reducing the training and testing speech samples length. The observed high robustness of the relative space with the shortened training samples makes possible real-time training of the speaker verification systems in this space. In other words, real-time training of the speaker verification system in this space is feasible because such a system does not need to involve the long training samples for enrollment of system speakers. In addition, the training process merely consists of a set of score calculation and forming a coordinate vector for each system speaker. Training process in absolute space based systems, however, comprises of training an initial model, which is more time-consuming than a set of score calculation. Contrary to speaker verification systems in relative space, speaker verification systems in absolute space are very sensitive to the length of training samples; as such they need to involve longer speech segments for enrollment of system speakers. These limitations arise from the nature of the absolute space approach, which precludes the possibility of real-time training. Finally, the developed speaker verification system was employed with data obtained under abnormal conditions where noisy or telephonic speech sequence contamination was present. In this case it was observed that the system in the relative space is generally more noise-robust than a similar system in the absolute space. Another major contribution of this research is the development of a more complex speaker verification system capable of tackling abnormal conditions more effectively. For this purpose, a novel enrichment method was proposed and developed to construct a relative space of anchor models trained to specifically tackle noise. The last set of experiments confirmed that the robustness of such system may be increased by enriching the relative space’s knowledge known a priori about the abnormal conditions. The results of the experiments conducted in this part of the research demonstrated an excellent ability of this approach to tackle abnormal conditions even with low SNR values. Given that the relative space approach is capable of overcoming noisy conditions effectively, it may be able to tackle session variability. As such, an interesting study can be conducted to compare the performance of joint factor analysis and eigenchannels approaches with the relative space approach for dealing with session variability. Integrating the relative space approach with joint factor analysis or eigenchannels may also improve the performance of the current systems dealing with session variability. To determine whether improvement is achieved, this integration can also be conducted in a future investigation.
A. Sadeghi Naini et al. / Computer Speech and Language 24 (2010) 545–561
561
References Auckenthaler, R., Carey, M., Lloyd-Thomas, H., 2000. Score normalization for text-independent speaker verification system. Digital Signal Processing 10 (1), 42–54. Avci, E., Akpolat, Z.H., 2006. Speech recognition using a wavelet packet adaptive network based fuzzy inference system. Expert Systems with Applications 31 (3), 495–503. Bijankhan, M., Seikhzadeghan, J., Roohani, M.R., Samareh, Y., Lucas, K., Tebyani, M., 1994. FARSDAT – the speech database of Farsi spoken language, In: Proceedings of SST-94. pp. 826–831. Bimbot, F., Bonastre, J.F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcia, J., PetrovskaDelacretaz, D., Reynolds, D.A., 2004. A tutorial on text-independent speaker verification. EURASIP Journal on Applied Signal Processing 4, 430–451. Campbell, W.M., Sturim, D.E., Reynolds, D.A., 2006. Support vector machines using GMM supervectors for speaker verification. Signal Processing Letters 13 (5), 308–311. Fauve, B., Matrouf, D., Scheffer, N., Bonastre, J.F., Mason, J.S.D., 2007. State-of-the-art performance in text-independent speaker verification through open-source software. IEEE Transactions of Audio, Speech, Lang. Processing 15 (7), 1960–1968. Kenny, P., Boulianne, G., Ouellet, P., Dumouchel, P., 2007. Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans. Audio, Speech, Lang. Process 15 (4), 1435–1447. Kuhn, R., Junqua, J.-C., Nguyen, P., Niedzielski, N., 2000. Rapid speaker adaptation in eigenvoice space. IEEE Transactions on Speech Audio Process 8 (6), 695–707. Li, K.P., Porter, J.E., 1988. Normalizations and selection of speech segments for speaker recognition scoring. In: Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), vol. 1. pp. 595–598. Mami, Y., Charlet, D., 2002. Speaker identification by location in an optimal space of anchor models. In: International Conference on Spoken Language Processing (ICSLP), vol. 2. pp. 1333–1336. Mami, Y., Charlet, D., 2003. Speaker identification by anchor models with PCA/LDA post-processing. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1. pp. 180–183. Mami, Y., Charlet, D., 2006. Speaker recognition by location in the space of reference speakers. Speech Communication 48 (2), 127–141. Manly, B.F.J., 1990. Multivariate Statistical Methods: A Primer. Chapman & Hall, London. Matrouf, D., Scheffer, N., Fauve, B., Bonastre, J.F., 2007. A straightforward and efficient implementation of the factor analysis model for speaker verification. In: Proc. Interspeech, pp. 1242–1245. Merlin, T., Bonastre, J.F., Fredouille, C., 1999. Non directly acoustic process for costless speaker recognition and indexation. In: Workshop on Intelligent Communication Technologies and Applications. Momtazi, S., Sameti, H., Vaisipour, S., Tefagh, M., 2007. Introducing a framework to create telephony speech databases from direct ones. In: Systems, Signals and Image Processing, 2007 and Sixth EURASIP Conference Focused on Speech and Image Processing, 14th International Workshop on Multimedia Communications and Services. pp. 327–330. Picone, J., 1993. Signal modeling techniques in speech recognition. Proceedings of the IEEE 81 (9), 1215–1247. Rabiner, L., Juang, B.H., 1993. Fundamentals of Speech Recognition. Prentice Hall, New Jersey. Rosenberg, A.E., DeLong, J., Lee, C.-H., Juang, B.-H., Soong, F.K., 1992. The use of cohort normalized scores for speaker verification. In: International Conference on Spoken Language Processing (ICSLP). pp. 599–602. Thyes, O., Kuhn, R., Nguyen, P., Junqua, J.-C., 2000. Speaker identification and verification using eigenvoices. In: International Conference on Spoken Language Processing (ICSLP), vol. 2. pp. 242–245. Varga, A., Steeneken, H.J.M., 1993. Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication 12 (3), 241–246. Young, S., 1996. A review of large-vocabulary continuous-speech recognition. IEEE Signal Proc. Magazine 13 (5), 45–57.