Low-complexity speaker verification with decimated supervector representations

Low-complexity speaker verification with decimated supervector representations

Available online at www.sciencedirect.com ScienceDirect Speech Communication 68 (2015) 11–22 www.elsevier.com/locate/specom Low-complexity speaker v...

955KB Sizes 0 Downloads 12 Views

Available online at www.sciencedirect.com

ScienceDirect Speech Communication 68 (2015) 11–22 www.elsevier.com/locate/specom

Low-complexity speaker verification with decimated supervector representations B.C. Haris ⇑, Rohit Sinha Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, Guwahati 781039, India Received 16 June 2014; received in revised form 12 December 2014; accepted 17 December 2014 Available online 26 December 2014

Abstract This work explores the use of a few low-complexity data-independent projections for reducing the dimensionality of GMM supervectors in context of speaker verification (SV). The projections derived using sparse random matrix and decimation are explored and are used as speaker representations. The reported study is done on the NIST 2012 SRE task using a state-of-the-art PLDA based SV system. Interestingly, the systems incorporating the proposed projections result in performances competitive to that of the commonly used i-vector representation based one. Both the sparse random matrix and the decimation based approaches are attributed to have very low computational requirements in finding the speaker representations. A novel SV system that exploits the diversity among the representations obtained by using different offsets in the decimation of supervector, is also proposed. The resulted system is found to achieve a relative improvement of 7% in terms of both detection cost and equal error rate over the default i-vector based system while still having lesser overall complexity. Ó 2015 Elsevier B.V. All rights reserved.

Keywords: Speaker verification; GMM mean supervector; Data-independent projection

1. Introduction The current text independent speaker verification (SV) systems predominantly use short-term cepstral feature extraction approaches to parameterize the speech signals and Gaussian mixture models (GMM) to model the distribution of the feature vectors (Kinnunen and Li, 2010). An efficient approach to train a GMM for a speaker is to adapt the parameters of a universal background model (UBM) using the speaker dependent data with the maximum-a posteriori (MAP) approach (Reynolds et al., 2000). The UBM is a GMM trained using large amount of speech data gathered from a number of speakers and thus it captures the speaker independent distribution of feature vectors. ⇑ Corresponding author.

E-mail addresses: [email protected] (B.C. Haris), [email protected] (R. Sinha). http://dx.doi.org/10.1016/j.specom.2014.12.005 0167-6393/Ó 2015 Elsevier B.V. All rights reserved.

Adapting the means of the UBM only has been found to work well in practice. For verification purpose, the likelihood of the test data feature vectors over the claimed speaker’s model is computed and is compared to a predefined threshold to obtain the decision. Apart from the likelihood based approaches, the support vector machine (SVM) classifier is also been proposed for the SV task (Wan and Renals, 2005; Campbell et al., 2006a,b). Among the SVM based SV systems, the one using GMM mean super vector representations (Campbell et al., 2006b) happens to be the most popular. GMM mean supervectors are created by concatenating the mean vectors corresponding to a speaker adapted GMM-UBM. The interpretation of GMM mean supervectors as a fixed dimension representation of a speaker utterance has also led to the development of efficient session/channel compensation techniques for the SV task (Hatch et al., 2006; Yin et al., 2007; Kenny et al., 2007). The GMM mean supervec-

12

B.C. Haris, R. Sinha / Speech Communication 68 (2015) 11–22

tors are found to be effective in representing speaker utterances, but are noted to be highly redundant in terms of speaker dependent information. To remove the redundancy of the supervectors, commonly, a factor analysis based approach is used in the front-end of SV systems. In that, a learned low-rank projection matrix called the total variability matrix (T-matrix) is used to derive the low-dimensional speaker representation which is commonly referred to as i-vector (Dehak et al., 2011). The systems using the i-vectors as speaker representations and a probabilistic linear discriminant analysis (PLDA) based classifier is considered to be the state-of-the-art for SV (Kenny, 2010; GarciaRomero and Espy-Wilson, 2011). The major disadvantage of the i-vector based SV systems is the computational complexity and the memory requirements involved in deriving the i-vector representations. In addition to that a large amount of speech data is required to learn the T-matrix. Recently, some works simplifying the i-vector computation and reducing the memory requirements have been reported (Glembek et al., 2011; Cumani and Laface, 2013; Li and Narayanan, 2014). In literature, data-independent projection approaches using random matrices are reported to provide a viable alternative to the data-dependent ones like principal component analysis (PCA) for reducing the dimensionality of very high dimensional vectors (Kaski, 1998). In a recent work (Haris and Sinha, 2014), we have explored some low-complexity data-independent projections using random matrices to reduce the dimensionality of GMM mean supervectors for a sparse representation classification based speaker identification system. Among the various projections explored, the sparse random matrix based one demands very little computational resources only and resulted in a performance competitive to that of the i-vector based approach. Such approaches are yet to be explored for an SV task and that forms the basic motivation of this work. The salient contributions reported in this paper are as follows. (1) The use of low-complexity sparse random projections for reducing the dimensionality of supervectors in context of the PLDA based SV system developed on a large multi-variability (NIST 2012 SRE) dataset, (2) A non-random data-independent projection with simple decimation of supervectors is proposed for dimensionality reduction and is shown to be as effective as the sparse random projection while having a much lower complexity. (3) A multi-offset decimation diversity based SV system is proposed which outperforms not only the individual offset decimation based systems but also the default i-vector based system while still having lesser computational requirements. The rest of the paper is organized as follows. Section 2 presents the low-complexity data-independent projections used in this work. The descriptions on the SV system used and the experimental setup are given in Section 5. Section 6 contains the results and discussions followed by the presentation of the proposed multi-offset decimation diversity based SV system. The paper is concluded in Section 8.

2. Data-independent projections of GMM mean supervectors In many pattern recognition applications, the use of data-independent random projections as an alternative to the data-dependent projections like PCA and factor analysis have been explored (Kaski, 1998; Bingham and Mannila, 2001). In order to reduce the dimensionality with random projections, the original d-dimensional supervector x is projected to a k-dimensional subspace using a random k  d matrix R as, ^ ¼ Rx x

ð1Þ

^ denotes the low dimensional representation of the where x data. The basis for the use of the random projection for classification tasks lies in the well known Johnson and Lindenstrauss (JL) lemma (Dasgupta and Gupta, 2003). It states that, a set of n points in high dimensional space can be mapped to a k-dimensional subspace such that the Euclidean distance between any two points changes by only a factor of ð1  Þ, if k  12 ln2n. The random projection matrix R is typically created using random numbers having a standard normal distribution. Such a projection matrix is much simpler to generate compared to the one derived using the datadependent approaches like PCA or factor analysis. At the same time, the random matrix based approach does not provide any computational advantage in finding the projections compared to PCA. Also, this approach does not produce representations as compact as that in the data-dependent cases. As a result, the size of the projection matrix in case of the random projection approach would be much higher than that in case of the datadependent ones. The generation and storage of such a large matrix is non-trivial. To address these issues, in Achlioptas (2001), the use of a non-Gaussian sparse random matrix is proposed. In the following we first describe the data-independent projection using sparse random matrix which is shown to have attractive computational advantages in our earlier work (Haris and Sinha, 2014) in context of sparse representation based speaker identification task. Additionally in this work we have explored the decimation of supervectors as a method of data-independent dimensionality reduction. Interestingly the decimation process can be interpreted as a projection using sparse matrix and the same is also described in the following. 2.1. Sparse random projection matrix As proposed in Li et al. (2006), the elements of the sparse random projection matrix R are distributed as, 8 > þ1 with probability 1=2s pffiffi< with probability 1  1=s ½Rij ¼ þ s 0 ð2Þ > : 1 with probability 1=2s

B.C. Haris, R. Sinha / Speech Communication 68 (2015) 11–22

where s is the sparsity control parameter. Let the reduced 4þ2b dimension k P k 0 ¼ 2 =2 3 =3 lnðnÞ, where  is the representation error, b > 0 is a parameter which controls the probability of success of the projections and n is the total number of data points to be projected. In Achlioptas (2001), it is shown that the transformation using the Achlioptas’ matrix satisfies the JL bound at least with a probability of 1  nb . With the elements of the projection matrix being {±1, 0}, the multiplications in computing projections reduce to addition of the data elements corresponding to the non-zero elements, except for the final normalization. For the value of the sparsity control parameter s being 200 in this work, the projection matrix defined in Eq. (2) turns out to have 0:5% non-zero elements only. Hence the sparse projection matrix allows for substantial reduction in computational complexity in finding projections along with a very low storage requirement compared to the non-sparse random projection matrices.

3. Proposed low-complexity data-independent projection based SV system

2.2. Decimation as a projection using a sparse matrix

y ¼ m þ Tw

Decimation is the process of re-sampling the data at a lower rate. In this, every ith sample in the given data vector is retained and the shift i is referred to as the decimation factor. Decimation process can also be interpreted as a low rank projection in which each row of the projection matrix has only one non-zero (unity) coefficient whose position shifts by the chosen decimation factor across the rows. This is depicted below with the help of an example. 2 3 1 0 0 0 0 6 7 ½x1 x3 x5 T ¼ 4 0 0 1 0 0 5 ½x1 x2 x3 x4 x5 T ð3Þ 0

0

0

0

1

For a chosen decimation factor, the shifting of the nonzero entry in the first row of the projection matrix leads to decimation with an offset. By changing the offset for decimation one can select different set of entries from the given vector. The above depicted decimation corresponds to the zero- or no-offset case. Later in this work, we have also exploited the decimations with different offset values. Based on the matrix interpretation of the decimation operation, we argue that it can be considered as a projection with one of the realizations of a highly sparse random projection matrix. The sparsity of the projection matrix in this case depends upon the chosen decimation factor. For a decimation factor of 4 used in this work, only 0.0025% of the elements in the projection matrix turn out to be nonzero. Since in actual practice the decimation process is implemented using a simple selection operator, it avoids the need of both multiplication and addition in computing projections and so it entails even lesser complexity compared to the sparse random projection approach described in Section 2.1.

13

In this section, we describe the use of the proposed lowcomplexity data-independent projections in the front end of a PLDA based SV system. Prior to that the existing ivector representation based PLDA system is reviewed. The low dimensional representations of GMM mean supervectors derived using factor analysis is generally known as i-vectors. In the factor analysis frame work for computing i-vectors, a low rank factor loading matrix that represents the dominant speaker and channel variabilities is used. It is often referred to as the total variability matrix and is learned using a suitable development data using the EM algorithm (Dehak et al., 2011). Let y be the GMM mean supervector created by concatenating the component specific centered 1st order statistics F c normalized with the 0th order statistics N c of the data computed over the UBM. The factor analysis model for y with the total variability matrix T is given as, ð4Þ

where m is the UBM mean supervector and w is the latent variable which is assumed to have standard Normal prior. ^ is the i-vector The MAP point estimate of w denoted by w representation and is computed as follows. 1

~ ^ ¼ ðI þ T 0 R1 NTÞ T 0 R1 F w

ð5Þ

where N and R are matrices whose diagonal blocks are N c I e is the vector obtained by concatenatand Rc respectively. F ing the component specific centered 1st order statistics, I is the identity matrix and Rc is the covariance matrix of the cth component of the UBM. Often the i-vectors are further processed with suitable session/channel compensation methods such as linear discriminant analysis (LDA) and within class covariance normalization (WCCN) prior to classification. State-of-the-art SV systems employ PLDA based Bayesian classifier for computing verification scores from the speaker utterance representations. In the PLDA (Prince and Elder, 2007) based SV approach a speaker utterance representation w is modeled as, w ¼ l þ Hh þ Gg þ 

ð6Þ

where l is the global mean of the representation vector population. The matrices H and G represents the speaker and channel subspaces respectively. The vectors h and g are the latent variables corresponding to the speaker and channel subspace respectively and  is the residual factor. In Kenny (2010), heavy-tailed priors are assumed for the latent variables to handle the effect of outlying data and the residual term is modeled with zero mean and diagonal covariance assumption. The resultant model is known as heavy-tailed PLDA (HPLDA). In the simplified Gaussian PLDA (GPLDA) model proposed in Garcia-Romero and Espy-Wilson (2011), the term Gg is ignored and a radial Gaussianization followed by length normalization is

14

B.C. Haris, R. Sinha / Speech Communication 68 (2015) 11–22

applied on the data vectors. Also, it assumes a full covariance matrix for the residual term and Gaussian priors for the latent variables. This approach is reported to be much faster compared to the HPLDA without any degradation in performance. In this work we have followed the GPLDA approach for building the SV system. Further, the scores are calibrated for optimizing the detection performance. The overall processing pipeline is depicted in Fig. 1a. In the proposed approach, for reducing the complexity of the system, the first front-end projection performed through the computation of i-vectors is replaced with low-complexity data-independent projection of the supervectors. For doing the same, it is required to compute the supervectors explicitly from the statistics unlike that in the i-vector approach. The computed supervectors are then projected to a lower dimensional space using the data-independent approach. In this work both sparse random projection and decimation approaches are explored for dataindependent dimensionality reduction. Fig. 1b shows the pipeline of the proposed SV system with decimation based first projection in the front-end. Unlike the sparse random projection, the decimation based approach selects the coefficients from the supervector in a systematic fashion. As the GMM mean supervectors are formed by concatenating the mean vectors of the adapted GMM, it possesses a unique structure in which the information about speakers are encoded in the quantum of acoustic feature vectors. In the decimation process, some choices of decimation factor would lead to complete absence of certain dimensions of the acoustic feature vector across all the Gaussian components in the decimated supervector. Such cases would arise when the acoustic feature size or its multiple is divisible by the chosen decimation factor. For example, with 39-dimensional MFCC as the acoustic features, the choice of the decimation factors of 3; 6; 9; . . . leads to complete absence of certain dimensions in the decimated supervectors. Such degenerative selections are

expected to result in degraded performance. On the other hand, for the non-degenerative choices of decimation factors, all dimensions of the acoustic features will certainly be selected within a few Gaussian mixtures. As none of the acoustic feature dimensions is removed so it allows for the effective modeling of the speakers even in the reduced space. With increasing decimation factor, the number of Gaussian components over which the coverage of all the acoustic feature dimensions is achieved would also be increased. This in turn would lead to greater loss in the non-redundant speaker information in the supervector, resulting in degradation in SV performance. 4. Multi-offset decimation diversity based SV system In the data-independent dimensionality reduction approaches the coefficients of the supervector are selected in a non-informative fashion unlike in the data-dependent ones, so the selected subspace is not expected be optimal for all speakers. To some extent this issue can be addressed by introducing diversity in the process so as to derive complementary subspaces from the supervector space. The diverse representations extracted by the use of different projection matrices can be used to build a multi-classifier system with a score level fusion to exploit the complementary information. It is known that for multi-classifier systems exploiting the diversity features, the component systems should be competitive enough and at the same time should be based on complementary features (Oza and Tumer, 2001). For the sparse random projection case, the features derived using different realizations of the projection matrix may exhibit competitive performances but are not guaranteed to be complementary. In the decimation process different short-dimensional representations can be derived from the same supervector by varying the offset value while keeping the decimation factor unchanged. As this process chooses mutually exclusive sets of coefficients

(a) i-vector based SV approach

(b) Proposed decimation based SV approach Fig. 1. Block diagrams showing the various processes involved in (a) i-vector and (b) proposed decimation based SV systems.

B.C. Haris, R. Sinha / Speech Communication 68 (2015) 11–22

form the supervector, relatively higher diversity is guaranteed between the representations derived. This motivated us to build decimation based SV systems using the representations derived by choosing different offset-factors. A logistic regression based score level fusion is used to combine the systems corresponding to different offset factors as illustrated in Fig. 2. The resulting system is referred to as multi-offset decimation diversity based SV (MODDSV) system. One can also explore the diversity approach with the ivector based SV systems. For doing the same one need to create different T-matrices and as the T-matrix learning requires a large amount of data, it is non-trivial to do so in practice. In addition, this approach would lead to manifold increase in computational complexity. Alternatively, one can also explore the decimation diversity while collecting statistics for i-vector computation. By applying decimation on the first order statistics supervector derived form the data, multiple i-vectors corresponding to different offset factors can be computed form the same utterance. Obviously the scores of the various systems based on such i-vectors can be fused as done for the MODD-SV system. Needless to mention that such a system would also be having very high complexity due to computation of multiple ivectors but at the same time it is expected to provide a better performance compared to the proposed MODD-SV system. We have developed this system also for comparison purpose and is referred to as MODD-SV with i-vector system in this work. 5. Experimental setup The studied low-complexity data-independent projections are contrasted with the data-dependent factor

15

analysis based i-vector representations. The i-vector representations computed using both the default approach (Dehak et al., 2011) and a simplified (for better speed) approach (Li and Narayanan, 2014) are considered. The various front-end processing steps involved in the i-vector and the proposed low-complexity projection based SV systems are shown in Fig. 3. The illustrations highlight the kind and the size of different projections involved in both the approaches. In the i-vector case, the first projection is data-dependent whereas in the proposed case it is dataindependent. The next two projections in both the cases are intended for the removal of session/channel variability from the representations. Those projections are required to be supervised and hence are data-dependent. For the i-vector based SV system, LDA followed by WCCN is used for session/channel compensation. In case of the data-independent projections, as they cannot produce representations as compact as the i-vectors, the dimension of the vectors happen to be much higher than the number of training examples available in our experimental setup. In such conditions the computation of the LDA matrix become infeasible as the within-class covariance matrix of the training data being singular. This is referred to as the ‘small sample size’ problem in literature and to overcome this, the direct linear discriminant analysis (DLDA) (Yu and Yang, 2001) is used in case of the data-independent projections. 5.1. Database and system parameters Evaluation of the proposed approaches is done on the NIST 2012 SRE database (NIST 2012 Speaker Recognition Evaluation,) which contains 770 male and 1161 female speakers. The training data for the target speakers are derived from multiple recording sessions and

Fig. 2. Multi-offset decimation diversity based SV system. For ease in depiction the case with decimation factor 3 is illustrated.

16

B.C. Haris, R. Sinha / Speech Communication 68 (2015) 11–22

Fig. 3. Front-end projections involved in the PLDA based SV system using (a) the i-vector and (b) the proposed low-complexity projections as the speaker representations. In the i-vector case, the supervector is implicit and shown for ease of comparison.

Table 1 Number of speakers in the NIST 2012 SRE training data across gender and sensor types.

Total number of speakers No. of speakers with phone data No. of speakers with mic data

Fig. 4. Tuning of the projection dimensions for the sparse random and decimation cases on the dev-test set. For the sparse random projection case, the averaged performance along with its standard deviation computed over 10 realizations are shown.

the number of segments available across the speakers vary from 1 to 240. The training set contains telephone recorded phone conversation, microphone recorded phone conversation and microphone recorded interview recordings. The gender-wise and sensor-wise distributions of speakers in the training dataset are shown in Table 1. The microphone recorded data is available for about half of the speakers only. The test dataset contains 68,954 speech segments of 30, 100 and 300 s durations. These segments are used to

Male

Female

770 770 335

1161 1161 423

perform 1.38 million verification trials in the primary task which is chosen for this study. In SRE 2012, the knowledge of all targets is allowed in computing the detection score of each trial. To examine the effect of this on the performance of system, known as well as unknown false trials are also included in the task. The trials in the primary task are split into five evaluation conditions based on the sensor, noise and recording environment conditions of the test data segments as shown in Table 2. For the system development, speaker utterance segments of 3–5 min duration derived from NIST 2006, 2008 and 2010 SRE datasets are used. Among those utterances, approximately 19k (7k male; 12k female) are telephone recorded and 7k (3k male; 4k female) are microphone recorded. For the purpose of tuning the system parameters and for the calibration of scores, we have developed an initial version for each of the systems under study and these systems are referred to as development (dev) systems. For this purpose, the NIST 2012 SRE training dataset is split into development train (dev-train) and development test (devtest) sets. The dev-test set contains about 4000 speech segments of 30–100 s durations. A set of development trials (dev-trials) is created comprising approximately 80,000 claims keeping a ratio of 1:20 between true and false trials. The dev systems are trained on the dev-train dataset and

B.C. Haris, R. Sinha / Speech Communication 68 (2015) 11–22 Table 2 Channel and noise conditions for test data corresponding to the different evaluation conditions in the NIST 2012 SRE dataset. Evaluation condition

Test data channel and noise conditions

TC-1 TC-2 TC-3 TC-4 TC-5

Microphone: clean Telephone: clean Microphone: noisy Telephone: noisy Telephone: noisy environment

evaluated on the dev-test dataset using these trials. The parameters of the different systems are tuned based on the performance for the dev-trials. The final version of all the systems are trained on the 2012 SRE training data with the tuned parameter values and evaluated on the 2012 SRE test data. A detection cost measure (CDET ) which is defined as the mean of the normalized detection costs computed at two pre-defined operating points (NIST 2012 Speaker Recognition Evaluation,) and the equal error rate (EER) are used as the performance measures. All speech data used is sampled at 8 kHz with 16 bits/ sample resolution and analyzed using a Hamming window of length 20 ms, with a frame shift of 10 ms. To remove the silence portions from data, an energy based voice activity detector with a threshold equal to 0.06 times the average energy of the utterance is employed. The speech data is parameterized into 39 dimensional acoustic features consisting of 13 base MFCC along with their first and second derivatives. The cepstral mean subtraction and variance normalization are also performed on the acoustic feature vectors. All SV systems are developed in gender-dependent mode. Two gender-dependent UBMs of 1024 Gaussian mixtures are created and are kept common to all the developed systems. The UBM for modeling the male speakers is trained using approximately 40 h of telephone recorded speech data from 725 speakers and that for the female speakers is trained using 50 h of telephone recorded speech data from 1099 speakers. For deriving the speaker representations, the 0th and 1st order Baum-Welch statistics of the target speech data are computed with respect to the corresponding UBM. The size of the i-vector is chosen to be 800 for both the default and simplified implementations. This choice is supported by an i-vector dimension tuning experiment reported in Kenny (2012). The default i-vectors are derived using the computed statistics and a T-matrix conditioned to 600 telephone and 200 microphone columns as described in Senoussaoui et al. (2010). The codes provided at Li (2014) are used to compute the simplified i-vectors (Li and Narayanan, 2014). For the proposed approaches, the GMM mean supervectors are computed by normalizing the UBM mean centered 1st order statistics with the 0th order statistics and then projected to low dimensional space. In case of the sparse random projection matrix, the optimal size can be determined based on the JL lemma discussed earlier. In the chosen evaluation task (NIST 2012 SRE), there are

17

1931 speakers (n ¼ 1931), assuming an error of 10% in representation ( ¼ 0:1), the bound on the projection dimension ðk  12 ln2nÞ turns out to be 9078. We have tuned the projection size around this theoretical value on dev-test. Fig. 4a shows the mean performances along with standard deviation computed over 10 realizations for the sparse random projection based systems. Based on tuning, a projection dimension of 10k is chosen. The tuning of projection dimension for the decimation with zero-offset case is also shown in Fig. 4b. Note that the decimation based system has resulted in significantly degraded performance for projection dimensions 13:31k; 6:66k and 4:44k, which corresponds the degenerative decimation factor cases of 3, 6 and 9, respectively. This results support our earlier arguments given in Section 3. On the other hand, considering the performances for the non-degenerative cases, the projection dimension 9984  10k (which corresponds to a decimation factor of 4) turns out to be the suitable size. The channel-conditioned LDA/DLDA matrices with 300 projection dimensions and full-rank WCCN matrix are used in appropriate systems. The approaches followed for handling the multiple training examples of speakers and the cross channel trials are same as those in our submission to NIST 2012 SRE which is reported in Haris et al. (2013). The telephone and microphone speaker models for each speaker are created by taking mean of the representations of the telephone and microphone training utterances available for that speaker, respectively. All test segments are scored against matching channel models of the claimed speaker. As for some speakers the microphone models are not available, telephone models are used in trials involving such speakers irrespective of the channel of the test data. The score calibration for optimal CDET for all the systems and the fusion in MODD-SV systems are performed with logistic regression using the BOSARIS toolkit (Bru¨mmer and Villiers). The dev-trials scores are used to train the parameters for the same.

5.2. Performance measures A detection cost as defined in NIST 2012 Speaker Recognition Evaluation (2012) is used for evaluating the performance of various SV systems developed in this work. It is defined as the mean of the normalized detection costs (Cnorm ) computed at two pre-defined operating points. Given the miss probability PMissjTar , the known false-alarm probability PFAjKnownNontar and the unknown false-alarm probability PFAjUnknownNontar computed by applying the corresponding threshold values on the scores, Cnorm is defined as, Cnorm ¼ PMissjTar þ bfPKnown  PFAjKnownNontar þ ð1  PKnown Þ  PFAjUnknownNontar g   CFA 1  PTar where b ¼ CMiss PTar

ð7Þ

18

B.C. Haris, R. Sinha / Speech Communication 68 (2015) 11–22

CFA and CMiss denote the costs associated with false alarms and misses, respectively and both are set to 1. PTar is the prior probability of the target trial and PKnown is the prior probability that the non-target trial is from one of the evaluation target speakers. The value of logðbÞ is used as the threshold for computing the miss and the false-alarm probabilities. Let Cnorm1 and Cnorm2 denote the normalized detection costs computed with PTar set to 0:01 and 0:001, respectively while keeping the value of PKnown equal to 0:5 in both cases. The overall detection cost, Cprim is then computed as, CDET ¼ ðCnorm1 þ Cnorm2 Þ=2

ð8Þ

The detection cost CDET is used as the primary performance measure of the performance of the systems. For assessing the non-application specific performances, equal error rate (EER) is also used in this work. 6. Results and discussions The performances of the proposed projection based systems as well as those of the default and simplified i-vector (Li and Narayanan, 2014) based systems with appropriate session/channel compensation are given in the rightmost

columns of Table 3 for the five different evaluation conditions. In comparison to the default i-vector, the simplified i-vector based system has resulted in slightly degraded performance. This degradation is attributed to the simplification made in the factor analysis by replacing the 0th order statistics of each GMM component by the total feature count of the utterance which causes an imbalance between mixture components. Though to address this effect the pre-weighting of the 1st order statistics by the square root of the normalized feature count for all mixture components is incorporated, it is not able to recover the original counts fully. In addition, the quantization error due to the use of look-up table for matrix inversion also contribute to the degradation. The proposed sparse random projection and the decimation based systems are found to give competitive performances. Note that in the decimation case the argued projection matrix happens to be 200 times more sparse than that for the projection matrix in the sparse random projection case. The reason of achieving competitive performance despite such a high sparsity is attributed to the systematic selection of the coefficients of the supervector such that all the dimensions of the acoustic features are covered in the decimation process. The data-independent

Table 3 Performances (with and without session/channel compensation) of the proposed SV systems with different types of data-independent projections and that of the i-vector based systems on the NIST SRE 2012 primary task, in terms of % EER and normalized detection cost ðCDET Þ. 1st projection matrix

Data-dependent

T-matrix: i-vector (Default)

T-matrix: Simplified i-vector (Li and Narayanan, 2014)

Data-independent

Sparse random (with s ¼ 200)

Decimation (with no-offset)

No projection

Eval. cond.

No comp.

With comp.

EER

CDET

EER

CDET

TC-1 TC-2 TC-3 TC-4 TC-5 Avg. TC-1 TC-2 TC-3 TC-4 TC-5 Avg.

20.69 8.39 18.76 13.32 9.42 14.12 21.85 9.43 19.85 14.56 9.93 15.12

0.905 0.644 0.990 0.677 0.689 0.785 0.954 0.744 0.993 0.771 0.678 0.828

6.55 4.90 5.45 4.54 4.84 5.26 7.38 5.98 6.37 5.60 5.67 6.20

0.492 0.557 0.483 0.512 0.528 0.514 0.568 0.641 0.519 0.538 0.581 0.569

TC-1 TC-2 TC-3 TC-4 TC-5 Avg. TC-1 TC-2 TC-3 TC-4 TC-5 Avg.

23.54 10.57 21.42 15.09 11.30 16.38 23.47 11.86 21.293 15.30 12.56 16.89

0.975 0.862 0.958 0.934 0.878 0.921 0.982 0.878 0.964 0.929 0.882 0.927

7.68 5.89 6.63 5.66 5.75 6.32 7.68 5.82 6.65 5.58 5.73 6.29

0.582 0.675 0.517 0.557 0.613 0.588 0.584 0.664 0.523 0.568 0.596 0.587

TC-1 TC-2 TC-3 TC-4 TC-5 Avg.

22.40 10.56 20.39 14.81 11.15 15.86

0.977 0.854 0.964 0.927 0.876 0.920

7.19 5.82 6.25 5.34 5.53 6.03

0.534 0.614 0.502 0.527 0.526 0.541

B.C. Haris, R. Sinha / Speech Communication 68 (2015) 11–22

approaches have found to be degraded when compared to the default i-vector based system and quite competitive compared to the simplified i-vector approach. The i-vector, being a data-dependent projection, provides not only dimensionality reduction but also an improvement in classification performance through the de-emphasis of the noisy measurements. So a fair comparison of the data-independent projections would be with the full representation. Table 3 also shows the performance of the SV system with the DLDA matrix being directly applied to the supervectors. The performance for no projection case is noted to be inferior to that of the default i-vector which supports the above argument about the noisy dimension suppression with the data-dependent projections. On comparing the best performing projection approach out of the proposed ones (i. e., decimation based) with the default i-vector representation, a degradation of 0.073 in CDET and 1.03 in % EER is noted when averaged over all evaluation conditions. On the other hand, in comparison to the simplified i-vector, no significant degradation in performance is noted with the proposed projections. When compared to the no-projection case, the decimation based system is noted to have a reduction of 0.046 in CDET and 0.26 in % EER. The fact that the performance of the simplified i-vector approach involving a data-dependent transform can be closely attained by the data-independent projection approaches appears somewhat counter-intuitive. We attribute this observed behavior to the succeeding discriminative transforms present in the front-end of both the systems. To analyze it further we have evaluated the performances of all the systems using a simple cosine kernel classifier without session/channel compensation and those are also given in Table 3. We note that without session/channel compensation both the default and simplified i-vector based systems have outperformed all the data-independent projection based systems. The DET curves corresponding to the TC-1 evaluation condition of the various SV systems with session/channel compensation are shown in Fig. 5.

19

is the complexity reduction, the same is discussed in following section. 7. Computational complexity The order of the multiplication complexity and the memory requirement involved in the first two projections in the proposed low-complexity projection based systems and those in the default as well as simplified i-vector based systems are given in Table 5. For the comparison of the actual complexity involved, the observed runtime of the different systems are also given. The computations and memory usage involved in the third projection (WCCN) and the PLDA-classifier are relatively small and are same for all the systems, so those are not considered in the comparison. As mentioned earlier, the projection with sparse random matrices can be realized with additions only, except for the final scaling. This makes the corresponding system much faster than both the default and simplified ivector based systems. On the other hand, the decimation approach does not involve multiplication at all and its runtime turns out to be 40 times smaller to that of the simplified i-vector approach (Li and Narayanan, 2014). In addition to the reduced projection complexity, both the sparse random projection and the decimation based systems are also attractive in terms of the storage space required for the projection matrix in comparison to the ivector based ones. In case of the sparse projection matrix, one needs to save only the indices of the nonzero elements along with the sign information which in turn do not require floating point representations. The decimation approach has the least storage requirement as it avoids the need of projection matrix completely. In addition, the proposed approaches does not have any algorithm specific 60

Default i−vector Simplified i−vector Sparse random proj. Decimation No projection

40

The performance for the MODD-SV and MODD-SV with i-vector systems along with those of their component systems corresponding to different offset values are given in Table 4. All the component systems of the MODD-SV system have resulted in competitive performances despite each being based on different coefficients of supervectors. It is to note that by exploiting the diversity among the different decimated representations, the MODD-SV system achieves consistent improvements in all five evaluation conditions. Relative improvement of 7% over the default i-vector approach in terms of both CDET and EER are noted, when averaged across the evaluation conditions. In comparison to that, the MODD-SV with i-vector system has resulted in small but consistent improvement over the direct MODD-SV system. As the major motivation of the work

Miss probability (in %)

6.1. MODD-SV system performance 20

10

5

2 1 0.5 0.2 0.1 0.1 0.2

0.5

1

2

5

10

20

40

60

False Alarm probability (in %)

Fig. 5. DET curves corresponding to TC-1 evaluation condition of the various SV systems developed.

20

B.C. Haris, R. Sinha / Speech Communication 68 (2015) 11–22

Table 4 Performances of the MODD-SV system for the direct and with i-vector cases along with that of the corresponding component systems, on the NIST SRE 2012 primary task, in terms of % EER and normalized detection cost ðCDET Þ. 1st projection matrix

Eval. cond.

Direct

With i-vector

EER

CDET

EER

CDET

Decimation (Offset-0)

TC-1 TC-2 TC-3 TC-4 TC-5 Avg.

7.68 5.82 6.65 5.58 5.73 6.29

0.584 0.664 0.523 0.568 0.596 0.587

7.31 5.92 6.41 5.53 5.69 6.17

0.555 0.620 0.516 0.547 0.588 0.565

Decimation (Offset-1)

TC-1 TC-2 TC-3 TC-4 TC-5 Avg.

7.71 5.79 6.72 5.62 5.74 6.32

0.579 0.672 0.519 0.571 0.603 0.589

7.28 6.01 6.32 5.68 5.76 6.21

0.552 0.632 0.513 0.557 0.591 0.569

Decimation (Offset-2)

TC-1 TC-2 TC-3 TC-4 TC-5 Avg.

7.69 5.82 6.84 5.72 5.83 6.38

0.581 0.674 0.523 0.582 0.611 0.594

7.32 6.07 6.33 5.70 5.80 6.24

0.556 0.633 0.513 0.560 0.595 0.571

Decimation (Offset-3)

TC-1 TC-2 TC-3 TC-4 TC-5 Avg.

7.56 5.71 6.74 5.63 5.51 6.23

0.561 0.630 0.519 0.572 0.580 0.572

7.24 5.92 6.24 5.61 5.72 6.15

0.543 0.625 0.504 0.556 0.583 0.562

MODD-SV (Fusion: Offset 0–3)

TC-1 TC-2 TC-3 TC-4 TC-5 Avg.

5.93 4.31 5.54 4.38 4.48 4.93

0.458 0.521 0.454 0.467 0.491 0.478

5.84 4.26 5.46 4.27 4.37 4.84

0.449 0.505 0.442 0.453 0.478 0.465

Table 5 Comparison of the multiplication complexity and memory usage in the first two front-end projections of various SV systems. (GMM size m ¼ 1024, MFCC feature dim. p ¼ 39, i-vector size q ¼ 800, random proj. size r ¼ 10k, and LDA/DLDA proj. size s ¼ 300). Projection-I approach

Multiplication complexity Proj.-I

i-vector (default) i-vector (simplified) Sparse random Decimation: single Decimation diversity: MODD + i-vector Decimation diversity: MODD a b

Oðmpq þ mq2 þ q3 Þ OðmpqÞ OðrÞ – Oðrq þ mq2 þ q3 Þ –

Runtime (ms)a

Proj.-II

OðqsÞ OðqsÞ OðrsÞ OðrsÞ OðqsÞ OðrsÞ

Memory usage (in MB) Fixed

4620 273 13 6 16,920b 24b

Proj.-I

Proj.-II

255 255 5 – 255 –

2 2 24 24 8 96

Algo. specific

Total

5250 51 – – 5250 –

5507 308 29 24 5513 96

Computed using 64-bit MATLAB (R2010b) running on Intel Xeon 6-core CPU, 2.4 GHz with 32 GB RAM in single thread default precision mode. Corresponds to serial implementation of four component systems.

memory requirement during the computation of representation, unlike the i-vector computation. Hence, the overall memory requirement for both the sparse random projection and the decimation based systems are noted to be about 10 times lesser than that for the simplified i-vector based system. Recently, another fast i-vector computation using the factored sub-space approach is proposed that achieves a

five fold reduction in the memory required for storing the T-matrix, without degradation in performance (Cumani and Laface, 2013). The complexity of this approach is slightly higher than the eigen-decomposition based approach reported in Glembek et al. (2011) which in turns has a complexity same as that of the simplified i-vector considered in this work. On comparing with the factored subspace based i-vector, the proposed decimation approach is

B.C. Haris, R. Sinha / Speech Communication 68 (2015) 11–22

noted to have much lower computational complexity and memory requirements but with a small degradation in performance. With this, the proposed decimation approach is very attractive for limited resource platforms due to its low computational burden. Further, in context of distributed speaker verification systems, it can also lead to a very simple synchronization between the client and the server. For both kinds of MODD-SV systems, the runtime as well as the memory usage is increased by a factor equal to the employed decimation factor in comparison to those of the corresponding component systems. In this work, the total memory requirement for the MODD-SV system with four decimation components turns out to be 96 MB which still remains competitive to that of the efficient i-vector approaches reported in the literature. Thus, the proposed MODD-SV system provides an improved performance with a much lower computational complexity and a competitive memory requirement even compared to the efficient implementation of the i-vector based SV system. On the other hand, the complexity of the MODD-SV with i-vector system is noted to be extremely larger than that of the direct MODD-SV system in terms of both runtime and memory requirement. In Eq. (5), the term that contributes the most to the complexity of i-vector computation is T 0 R1 NT. The corresponding order of complexity is given by mq2 , where m is the number of mixtures in GMM and q is the i-vector dimension (Glembek et al., 2011). As this term is independent of the statistics supervector dimension, the overall complexity of the component i-vector systems remains very high even after decimation of the statistics. With four such component systems being used the total complexity of the MODD-SV with i-vector becomes even higher. It is to note that the reported runtime corresponds to the sequential implementation of the component systems. With the use of parallelism in implementation, the run-times of the both kinds of MODD-SV systems could be made close to those of their components systems. 8. Conclusion In this work, the use of a few low-complexity data-independent projections for deriving compact speaker representations from GMM mean supervectors for speaker verification is presented. The low-complexity data-independent projections derived using sparse random matrix and decimation process are explored and contrasted with the existing i-vector based data-dependent ones. The performance for the proposed projections is found to be slightly inferior when compared to the default i-vector implementation and competitive to the fast i-vector implementation, on the NIST 2012 SRE primary task. In terms of computational complexity and memory usage the proposed approaches, especially the decimation based one, are found to have significant advantage even over the fast i-vector implementation. Further, the diversity among the representations obtained by decimation with different offsets is also exploited through score-level fusion. The resulting system

21

is noted to provide 7% relative improvement over the default i-vector based system in terms of both the detection cost and the equal error rate. The computational requirements for the final system remains much lower than that of the default i-vector based and comparable to the fast i-vector based systems. By introducing the i-vector computation as second transformation in the decimation diversity based system, the performance can be further improved but it costs a manifold increase in computational complexity. Acknowledgements This work has been partially supported by the ongoing project “Development of speech based multilevel authentication system” sponsored by the Department of Information Technology, Government of India. References Achlioptas, D., 2001. Database-friendly random projections, In: Proc. ACM Symposium on Principles of Database Systems, pp. 274–281. Bingham, E., Mannila, H., 2001. Random projection in dimensionality reduction: applications to image and text data. In: Proc. ACM Intl. Conf. Knowledge Discovery and Data Mining, pp. 245–250. Bru¨mmer, N., de Villiers, E.,. BOSARIS toolkit. (accessed 10.01.14). Campbell, W.M., Campbell, J., Reynolds, D.A., Singer, E., TorresCarrasquillo, P., 2006a. Support vector machines for speaker and language recognition. Comput. Speech Lang. 20, 210–229. Campbell, W.M., Sturim, D.E., Reynolds, D.A., 2006b. Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13, 308–311. Cumani, S., Laface, P., 2013. Fast and memory effective i-vector extraction using a factorized sub-space. In: Proc. Interspeech, pp. 1599–1603. Dasgupta, S., Gupta, A., 2003. An elementary proof of the Johnson– Lindenstrauss lemma. Random Struct. Algorithms 22, 60–65. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P., 2011. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19, 788–798. Garcia-Romero, D., Espy-Wilson, C.Y., 2011. Analysis of i-vector length normalization in speaker recognition systems. In: Proc. Interspeech, pp. 249–252. Glembek, O., Burget, L., Matjka, P., Karafiat, M., Kenny, P., 2011. Simplification and optimization of i-vector extraction. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4516–4519. Haris, B.C., Pradhan, G., Sinha, R., Prasanna, S., 2013. The IITG speaker verification systems for NIST SRE 2012. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 7668–7672. Haris, C.B., Sinha, R., 2014. Exploring data-independent dimensionality reduction in sparse representation-based speaker identification. Circ. Syst. Signal Process., 1–18 Hatch, A.O., Kajarekar, S., Stolcke, A., 2006. Within-class covariance normalization for SVM-based speaker recognition. In: Proc. International Conference on Spoken Language Processing (ICSLP), pp. 1471– 1474. Kaski, S., 1998. Dimensionality reduction by random mapping: fast similarity computation for clustering. In: Proc. IEEE International Joint Conference on Neural Networks, pp. 413–418. Kenny, P., 2010. Bayesian speaker verification with heavy-tailed priors. In: Proc. Odyssey: The Speaker and Language Recognition Workshop.

22

B.C. Haris, R. Sinha / Speech Communication 68 (2015) 11–22

Kenny, P., 2012. A small footprint i-vector extractor. In: Proc. Odyssey: The Speaker and Language Recognition Workshop. Kenny, P., Boulianne, G., Ouellet, P., Dumouchel, P., 2007. Speaker and session variability in GMM-based speaker verification. IEEE Trans. Audio Speech Lang. Process. 15, 1448–1460. Kinnunen, T., Li, H., 2010. An overview of text-independent speaker recognition: From features to supervectors. Speech Commun. 52, 12– 40. Li, M., 2014. Simplified i-vector MATLAB code. (accessed 14.04.14). Li, M., Narayanan, S., 2014. Simplified supervised i-vector modeling with application to robust and efficient language identification and speaker verification. Comput. Speech Lang. 28, 940–958. Li, P., Hastie, T.J., Church, K.W., 2006. Very sparse random projections. In: Proc. ACM International Conference on Knowledge Discovery and Data Mining, pp. 287–296. NIST 2012 Speaker Recognition Evaluation, 2012. . Oza, N.C., Tumer, K., 2001. Input decimation ensembles: decorrelation through dimensionality reduction. Proc. Second International Workshop on Multiple Classifier Systems. Springer, pp. 238–247.

Prince, S., Elder, J.H., 2007. Probabilistic linear discriminant analysis for inferences about identity. In: Proc. IEEE International Conference on Computer Vision (ICCV), pp. 1–8. Reynolds, D.A., Quatieri, T.F., Dunn, R.B., 2000. Speaker verification using adapted gaussian mixture models. Digital Signal Process. 10, 19– 41. Senoussaoui, M., Kenny, P., Dehak, N., Dumouchel, P., 2010. An i-vector extractor suitable for speaker recognition with both microphone and telephone speech. In: Proc. Odyssey: The Speaker and Language Recognition Workshop. Wan, V., Renals, S., 2005. Speaker verification using sequence discriminant support vector machines. IEEE Trans. Speech Audio Process. 13, 203–210. Yin, S.C., Rose, R., Kenny, P., 2007. A joint factor analysis approach to progressive model adaptation in text-independent speaker verification. IEEE Trans. Audio Speech Lang. Process. 15, 1999–2010. Yu, H., Yang, J., 2001. A direct LDA algorithm for high-dimensional data with application to face recognition. Pattern Recogn. 34, 2067–2070.