Phonotactic language recognition using dynamic pronunciation and language branch discriminative information

Phonotactic language recognition using dynamic pronunciation and language branch discriminative information

Available online at www.sciencedirect.com ScienceDirect Speech Communication 75 (2015) 50–61 www.elsevier.com/locate/specom Phonotactic language rec...

1MB Sizes 0 Downloads 43 Views

Available online at www.sciencedirect.com

ScienceDirect Speech Communication 75 (2015) 50–61 www.elsevier.com/locate/specom

Phonotactic language recognition using dynamic pronunciation and language branch discriminative information Xianliang Wang, Yulong Wan, Lin Yang, Ruohua Zhou ⇑, Yonghong Yan Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China Received 22 June 2014; received in revised form 11 August 2015; accepted 5 October 2015 Available online 14 October 2015

Abstract This paper presents our study of phonotactic language recognition system using dynamic pronunciation and language branch discriminative information. The theory of language branch in linguistics is introduced to language recognition, and phonotactic language branch variability (PLBV) method based on factor analysis is proposed. In our work, phoneme variability factor containing dynamic pronunciation information is investigated firstly. By concatenating low-dimensional phoneme variability factors in the language branch spaces, phonotactic language branch variability factor is obtained. Language models are trained within and between language branches with support vector machine (SVM). The proposed method uses dynamic and discriminative pronunciation phonotactic characteristics while it doesn’t involve fallible phoneme sequences. Results on 2011 NIST Language Recognition Evaluation (LRE) 30 s data set show that the proposed method outperforms parallel phoneme recognizer followed by vector space models (PPRVSM) and ivector systems, and obtains relative improvement of 28.2–72.0% in EER, minDCF and language-pair performance metrics significantly. Ó 2015 Elsevier B.V. All rights reserved.

Keywords: Phonotactic language branch variability; Factor analysis; Dynamic pronunciation information; Discriminative information; Language recognition

1. Introduction Language recognition aims to determine the language identity given a segment of speech. Two representative approaches have been widely used, which are based on phonotactic features and spectral features. In acoustic feature systems, Gaussian mixture models (GMM) and support vector machines (SVM) have been the usual choices (Burget et al., 2006; Campbell et al., 2006), gradually outperforming the phonotactic systems (Dehak et al., 2011; Glembek et al., 2012). Recently, ivector based on factor analysis has provided significant ⇑ Corresponding author.

E-mail addresses: [email protected] (X. Wang), wanyulong@ hccl.ioa.ac.cn (Y. Wan), [email protected] (L. Yang), zhouruohua@ hccl.ioa.ac.cn (R. Zhou), [email protected] (Y. Yan). http://dx.doi.org/10.1016/j.specom.2015.10.001 0167-6393/Ó 2015 Elsevier B.V. All rights reserved.

improvements to language recognition systems using spectral features (Martınez et al., 2011; Dehak et al., 2011; Li and Narayanan, 2013). It defines a low-dimensional ivector space modeling both speaker and channel variabilities. Low-dimensional ivector is obtained by mapping a sequence of speech frames onto the ivector space. Previous results in 2011 National Institute of Standards and Technology (NIST) Language Recognition Evaluation (LRE) (Greenberg et al., 2012) presented the superiority of ivector system over phonotactic systems (Singer et al., 2012). In phonotactic approaches, speech utterances are first tokenized into phoneme sequences or lattices, and then n-gram lexicon model is applied to model the phoneme sequences or lattices. Phoneme recognizer followed by language model (PRLM) (Zissman and Singer, 1994) is a classic approach using phoneme recognizer containing Artificial Neural Network (ANN) (Bourlard and

X. Wang et al. / Speech Communication 75 (2015) 50–61

Morgan, 1994) and Viterbi decoder, and n-gram language model to model discriminative gram statistics. There are several significant improvements on PRLM. In W. M. Campbell’s work (Campbell et al., 2007), representative n-gram phone statistics were selected and SVM was used to model the statistics. In H. Li’ work (Li et al., 2007), parallel phoneme recognizer followed by vector space models (PPRVSM) was proposed to language recognition based on vector space modeling and obtained excellent performance. Mikolov T proposed PCA-based feature extraction method for phonotactic language recognition (Glembek et al., 2010). The system reduced the dimension of trigram soft-counts using PCA. Mehdi Soufifar proposed ivector approach to phonotactic language recognition (Soufifar et al., 2011), and obtained comparable performance to PCA-based feature extraction approach. And in DHaro’s work (DHaro et al., 2012), ivector based on trigram counts was proposed using multinomial subspace model. The proposed phonotactic approaches were proven to be efficient for language recognition task and achieved comparable performance to the ivector in spectral level. Ming Li (Li and Liu, 2014) also presented a generalized i-vector framework with phonotactic tokenizations and tandem features for speaker verfication as well as language identification. Howbeit these methods model the n-gram derived from phoneme recognizer, and may suffer from some problems. For example, the model size grows exponentially with the model order n, and selection of representative n-gram may also bring loss of discriminative information inevitably. In the meanwhile, it is vulnerable to the errors induced by phoneme recognizer. In H.Wang’s work (Wang et al., 2013), shift-delta multilayer perceptron (SDMLP) features based on phoneme posterior probabilities were introduced to GMM-based language recognition system. The feature was not dependent on n-gram phone statistics and achieved good performance for the rich pronunciation information of phonotactic features (Zissman, 2001; Lei et al., 2014). Even though it obtained significant improvement, dimension of SDMLP features was usually high and the features were impractical for well performed factor analysis. With the development of language recognition, it is more concerned about the discrimination between the pairs of languages as is emphasized in the NIST 2011 LRE. In NIST 2011 LRE, more confusable target languages were evaluated, and new performance metrics which considered only the N worst performance language pairs were defined. This means that all the target languages should be modeled suitably and the discriminative ability between confusable pairs is more important. In this paper, phonotactic language branch variability (PLBV) system is proposed. In the method, phonemedependent features and delta phoneme-dependent features are put forward considering dynamic pronunciation information of speech signals, and are investigated in phonotactic factor analysis. The PLBV method aims to strengthen the discrimination between confusable

51

languages. Languages can be divided into different language branches in perspective of linguistics. Languages in the same language branch share a remarkably similar pattern, and may be related through descent from a common ancestor, or be different dialects in a region. The proposed method considers the discriminative information of languages from both intra language branch and inter language branches in factor level and model level. In factor level, language branch variability factors are obtained by combing factors mapped on the language branch spaces, and the concatenated factors contain rich representative information and discriminative information between language branches. In model level, two groups of SVM models are trained. One group of models covers richer discriminative information of languages related to language branches and the other group emphasizes the discrimination between languages. Experimental results show the discernment between confusable language pairs is stronger compared to the traditional language recognition methods. The reminder of this paper is as follows: In Section 2, ivector for language recognition is introduced briefly. Section 3 describes the process of extracting feature and phoneme variability factor. In Section 4, phonotactic factor analysis using language branch discriminative information is described in detail. Corpus and experimental setup are presented in Section 5. Section 6 represents experimental results. Conclusions are given in Section 7. 2. Ivector for language recognition The ivector approach based on factor analysis has become state-of-the-art in both speaker verification and language recognition. It defines a new low-dimensional space mapping the high-dimensional GMM supervector to a low-dimensional and length-fixed vector. For a given utterance, the language and variability dependent supervector is denoted as Eq. (1). M ¼ m þ Tw

ð1Þ

where m is the supervector from Universal Background Model (UBM), T is the language total variability space, and w is a standard-normally distributed latent variable called ivector factor. 3. Feature and phoneme variability factor 3.1. Phoneme-dependent feature and delta phonemedependent feature The process of extracting phoneme-dependent features is illustrated in Fig. 1. Firstly, speech utterances are tokenized to frames of features represented by phoneme posterior probabilities. Phoneme posterior probabilities are obtained based on the phoneme recognizer (Schwarz, 2009) using long-time TempoRAl Pattern (TRAP) (Hermansky and Sharma, 1999) and ANN developed by the Brno University of

52

X. Wang et al. / Speech Communication 75 (2015) 50–61

Fig. 1. Block diagram of extracting phoneme-dependent features.

Technology (BUT). They are estimated from critical bank energies using the context of 310 ms around the current frame (Petr Schwarz and Ernocky, 2006). In the next step, probabilistic features are transformed into logarithmic domain. It ensures constant sharpness of the posterior probabilities. Principal Component Analysis (PCA) is then used for the purpose of decorrelation and dimension reduction. It allows to find dimensions with the highest variability in the feature space. After PCA transformation, each feature element is normalized to zero mean, unit variance by Mean and Variance Normalization (MVN) computed from the whole utterance. MVN reduces the impact of pronunciation variability of different speakers. It processes similarly as Cepstral Mean Subtract (CMS) and Cepstral Variance Normalization (CVN), commonly applied in features of acoustic systems. The resulting features are more matching to GMM. We call the normalized features phoneme-dependent features. Delta phoneme-dependent features are derived from phoneme-dependent features, and are obtained by time derivatives to capture the short-term speech dynamics. The features contain dynamic pronunciation information of speech signals. One-order derivative is implemented as follows: Dyit ¼ yiðtþ1Þ  yiðt1Þ

ð2Þ

where Dyit is the delta feature at frame t of utterance i; yiðtþ1Þ and yiðt1Þ represent the transformed feature of later frame and previous frame, respectively. As the phonemedependent features, MVN is applied to the delta features. 3.2. Phoneme variability factor Phoneme variability factor is extracted as Fig. 2. In the general idea, delta phoneme-dependent feature Dyit may be

appended to static feature vector yit . Considering the high dimensionality of features, the phoneme-dependent features and delta phoneme-dependent features are loaded to factor analysis respectively, and phoneme factor w and delta phoneme factor Dw are obtained. We concatenate phoneme factor and delta phoneme factor as follows: e ¼ ½w; Dw Dw

ð3Þ

A speech utterance is represented by the compact factor containing pronunciation variability information in timelapse sequence. We call the factor phoneme variability factor. 4. Phonotactic factor analysis using language branch discriminative information 4.1. Language branch A language family (http://en.wikipedia.org/wiki/ Language_family) is a group of languages related to each other through descended from a common ancestor in historical linguistics. Language branch is established by languages sharing common features, while these features are not found in other languages of the language family, for example, West Slavic Branch and East Slavic Branch, which are from the Common Slavic Family. Fig. 3 is an example of Proto-Indo-European language family tree (Fortson, 2004). Membership of languages in the Indo-European language family is determined by genetic relationships, meaning that all members are presumed descendants of a common ancestor, ProtoIndo-European. Membership in the various branches, groups and subgroups is also genetic. For example, what makes the Germanic languages a branch of IndoEuropean is that much of their structure and phonology

Fig. 2. Block diagram of extracting phoneme variability factor.

X. Wang et al. / Speech Communication 75 (2015) 50–61

53

Fig. 3. Proto-Indo-European language family tree.

can so be stated in rules that apply to all of them. Many of their common features are presumed innovations that take place in Proto-Germanic, the source of all the Germanic languages. Membership of languages in the same language branch is established by comparative linguistics, genetically relative or contact languages. Two languages are considered to be genetically relative if one is descended from the other as Hindi and Urdu or the two languages are descended from a common ancestor as Czech and Slovak. A language may also be contacted with other languages by influencing each other, for example, language transferring and specifically borrowing as Japanese and Chinese. Mixed languages, pidgin languages and creole languages also construct language branch for their inseparable relationship.

4.2. Phonotactic language branch variability method The NIST 2011 LRE differed from the previous ones in emphasizing the language pair condition, and it contained more confusable language pairs. The most confusable pairs were generally within clusters of linguistically similar languages. The PLBV method pays more attention to the discriminative information of languages between and within language branches in both factor level and model level. In factor level, phonotactic language branch variability factors are extracted by mapping the GMM supervector onto all of the language branch variability spaces. For a given utterance, the language and variability dependent supervector is denoted in Eq. (4).

54

X. Wang et al. / Speech Communication 75 (2015) 50–61

M ¼ mbi þ Tbi wbi

ð4Þ

where bi represents language branch i; mbi is the GMM supervector of language branch i; Tbi is phonotactic language branch variability space and wbi is variability factor in the space having a standard normal distribution. For simply, the following GMM of this paper represents a specific language branch GMM. The process of training language branch variability space of a language branch Tbi is exactly the same as learning total variability space, except for the GMM supervector instead of UBM supervector. It is a process of eigenvoice modeling (Kenny et al., 2005). The phonotactic language branch variability factor is obtained by concatenating phoneme variability factors in all of the language branch spaces as follows: w ¼ ½wb1 ; Dwb1 ; wb2 ; Dwb2 ; . . . ; wbi ; Dwbi ; . . . ; wbL ; DwbL 

ð5Þ

where L is the number of language branches. The variability factor can be seen as a distribution in variability space. The same variability factor figures different distributions in different variability space. The concatenated factor covers different characteristics of distributions in different language branch variability spaces, and contains rich representative and discriminative information. In this way, different variability factors within the same language branch represent different characteristics of different utterances, and variability factors of all the language branches constitute the phonotactic language branch variability factor containing richer representative information of utterance and covering richer discriminative information between language branches. The variability factor of language branch i is defined as follows: 1

t 1 b wbi ¼ ðI þ Ttbi R1 bi Nbi ðuÞTbi Þ Tbi Rbi F bi ðuÞ

ð6Þ

We define Nbi ðuÞ as a diagonal matrix whose diagonal ½c b b ðuÞ is obtained by concatenating all blocks are N bi I. F i ½c the first-order Baum-welch statistics F bi for utterance u. ½c Rbi is a common diagonal covariance of GMM. N bi and ½c F bi can be obtained as follows: ½c

N bi ¼ ½c

F bi ¼

T X ½c cbi ðtÞ

ð7Þ

t¼1

T X t¼1

½c

½c

cbi ðtÞðy t  mbi Þ

ð8Þ

where T is the number of frames, c is the Gaussian index. ½c cbi ðtÞ corresponds to posterior probability of mixture com½c ponent c generating the vector y t . mbi is the mean of GMM mixture component c. Two groups of SVM models are trained. Scores of the two groups of models are fused as the final results. A block diagram of the process is shown in Fig. 4, taking Arabic Iraqi as example.

In the first group, models are trained within and between language branches, and we emphasize the discrimination of languages within language branch and the distinction of language branches. Here we take Arabic Iraqi as example. We assigned Arabic Iraqi, Arabic Levantine, Arabic Maghrebi, Arabic MSA to Arabic Branch. In the process of training Arabic Iraqi model, utterances of Arabic Iraqi are seen as positive samples and utterances of the other languages in Arabic Branch are the negative samples. In this process, discrimination of confusable language pairs are more prominent. Models of language branches are also trained. For example, when training Arabic Branch model, utterances in Arabic Branch are positive samples, and other utterances in the training dataset are negative samples. The group of models are complement, resulting that the discrimination of languages is more apparent. As for the second group of models, target language models are trained among all the languages. The positive samples of target language are composed of utterances of the language, and the utterances of all the other languages merged to the negative samples. Language models in this group cover the discriminative information of languages of different language branches. 4.3. Backend Normalized and indexation are processed on scores after SVM classifier as follows: s  l i s0i ¼ exp ð9Þ 2r ðL  1Þs0 s00i ¼ log PL 0 i 0 j¼1 sj  si

ð10Þ

where si is the initial SVM score, l and r is the mean and variance of score for the utterance, L is the number of scores, and s00i is the normalized result. The resulting scores fall into the lower dynamic range, and attain greater differentiation. Linear Discriminant Analysis (LDA) (Balakrishnama and Ganapathiraju, 1998) and generative Gaussian backend (BenZeghiba et al., 2009) are then applied to the scores for calibration. 5. Corpora and experiments setup 5.1. Corpora and evaluation Our experiments were carried out on 2011 NIST LRE closed-set 30 s, 10 s and 03 s tasks. There were 24 target languages in corpora of the 2011 evaluation database: Arabic Iraqi, Arabic Levantine, Arabic Maghrebi, Arabic MSA, Bengali, Czech, Dari, English American, English India, Farsi/Persian, Hindi, Lao, Mandarin, Panjabi, Pashto, Polish, Russian, Slovak, Spanish, Tamil, Thai,

X. Wang et al. / Speech Communication 75 (2015) 50–61

55

Fig. 4. A block diagram for training two groups of SVM models, taking Arabic Iraqi as example.

Turkish, Ukrainian and Urdu. More confusable languagepairs are involved. Most of the confusable languages are genetically relative or contact languages, for example, English American/English India, Hindi/Urdu, Czech/Slovak and Arabic dialects. Equal error rate (EER) and minimum decision cost value (minDCF) were used as old metrics (Martin and Przybocki, 2003) for evaluation. We also used the new evaluation criterions adopted in NIST 2011 LRE (Greenberg et al., 2012), in terms of: the minimum average cost value (minCavg) and actual average cost value (actCavg) computed on the 24 worst language-pairs and the minCavg and actCavg computed for all 276 languagepairs. 5.2. Experiment setup The 24 target languages in NIST 2011 LRE were divided into ten language branches according to the linguistic knowledge as Table 1. Performance of five language recognition systems is evaluated: ivector(Martınez et al., 2011), PPRVSM (Li et al., 2007), language branch variability (LBV) system

Table 1 Classification of the 24 target languages in NIST 2011 LRE according to language branches. Language name

Language branch

Czech, Slovak, Polish Russian,Ukrainian Bengali, Hindi, Panjabi, Urdu Farsi/Persian, Pashto, Dari Arabic Iraqi, Arabic Levantine, Arabic Maghrebi, Arabic MSA English American, English India Mandarin, Lao, Thai Spanish Turkish Tamil

West Slavic branch East Slavic branch Indic branch Iranian branch Arabic branch English branch Sino-Tibetan branch Spanish branch Turkish branch Tamil branch

(Wang et al., 2014), phoneme variability (PV) system, and phonotactic language branch variability (PLBV) system. In LBV method, language branch variability factor is obtained by concatenating low-dimensional factors in the language branch variability spaces, and feature used in the LBV system is Mel Shifted Delta Coefficients (MSDC) (Torres-Carrasquillo et al., 2002), with 7 Mel Frequency Cepstral Coefficients (MFCC) concatenated with Shifted Delta Coefficients (SDC) 7-1-3-7 feature. In PV system, phoneme variability factors are obtained based on phoneme-dependent feature and delta phonemedependent feature as Section 3. In PV_static, features used are only static phoneme-dependent ones, and features in the PV_dynamic contain only dynamic phonemedependent features. SVM is used to model the factors for all the target languages. Dimension of phonemedependent feature and delta phoneme-dependent feature in our experiment was 56 as MSDC. Features of the ivector system were MSDC. CMS and CVN were used. UBM contained 1024 Gaussians, and dimension of latent factors was 400. SVMTorch (Collobert and Bengio, 2001) was implemented to train the one-vs-rest SVM classifier. LDA and diagonal covariance Gaussians backend (BenZeghiba et al., 2009) were applied to calculate the log-likelihoods for target languages. 6. Experimental results 6.1. Performance of static phoneme-dependent features and delta phoneme-dependent features Firstly, we present the performance of static and delta phoneme-dependent features in Tables 2–4 on 2011 NIST LRE 30 s, 10 s and 03 s tasks respectively. Even though the static phoneme-dependent features achieve better performance than the delta ones, the delta features still perform comparable results to the static features in the metrics.

56

X. Wang et al. / Speech Communication 75 (2015) 50–61

Table 2 Performance of static and delta features on Russian, Czech and Hungarian phoneme recognizers on 2011 NIST LRE 30 s task. Systems

Old metrics (%)

24 worst pairs (%)

All 276 pairs (%)

EER

minDCF

minCavg

actCavg

minCavg

actCavg

PV_static

RU CZ HU

5.61 5.90 6.27

5.76 6.10 6.38

9.15 10.22 10.08

12.26 13.48 13.58

1.54 1.74 1.67

2.71 3.00 3.27

PV_dynamic

RU CZ HU

6.73 6.93 6.28

6.84 7.14 6.45

10.15 11.34 10.71

13.79 14.60 13.40

1.87 2.13 1.95

3.44 3.77 3.10

Table 3 Performance of static and delta features on Russian, Czech and Hungarian phoneme recognizers on 2011 NIST LRE 10 s task. Systems

Old metrics (%)

24 worst pairs (%)

All 276 pairs (%)

EER

minDCF

minCavg

actCavg

minCavg

actCavg

PV_static

RU CZ HU

11.50 11.52 11.78

11.69 11.65 11.91

17.44 18.13 17.90

20.53 22.41 21.12

5.15 5.50 5.07

6.89 7.00 6.95

PV_dynamic

RU CZ HU

13.42 13.87 12.68

13.47 13.98 12.75

19.00 20.14 19.68

22.10 23.07 21.88

6.30 6.78 6.22

8.25 8.77 7.57

Table 4 Performance of static and delta features on Russian, Czech and Hungarian phoneme recognizers on 2011 NIST LRE 03 s task. Systems

Old metrics (%)

24 worst pairs (%)

All 276 pairs (%)

EER

minDCF

minCavg

actCavg

minCavg

actCavg

PV_static

RU CZ HU

26.40 25.43 25.65

26.37 25.30 25.58

29.82 29.27 29.83

32.76 32.49 32.25

16.52 15.62 15.85

19.15 17.90 18.33

PV_dynamic

RU CZ HU

28.55 28.60 27.45

28.34 28.35 27.01

31.99 31.40 31.13

34.70 34.03 33.88

18.06 18.66 17.66

20.98 21.30 19.93

Table 5 Performance of phonotactic systems on Russian, Czech and Hungarian phoneme recognizers on 2011 NIST LRE 30 s task. Systems

Old metrics (%)

24 worst pairs (%)

All 276 pairs (%)

EER

minDCF

minCavg

actCavg

minCavg

actCavg

PPRVSM

RU CZ HU

7.55 9.03 7.75

7.73 9.37 8.00

11.94 14.23 13.04

15.51 17.25 16.48

2.79 3.67 3.01

4.22 5.27 4.39

PV

RU CZ HU

5.30 5.71 5.64

5.45 5.96 5.79

8.62 9.95 9.54

11.13 12.82 13.01

1.45 1.66 1.64

2.40 2.81 2.82

PLBV1

RU CZ HU

4.80 4.97 5.06

4.94 5.10 5.23

8.67 8.78 9.00

10.97 11.56 12.33

1.27 1.33 1.35

2.26 2.31 2.54

PLBV2

RU CZ HU

4.68 5.00 5.08

4.84 5.15 5.21

7.79 8.06 8.29

10.19 11.01 11.37

1.16 1.25 1.25

2.17 2.32 2.47

X. Wang et al. / Speech Communication 75 (2015) 50–61

57

Table 6 Performance of phonotactic systems on Russian, Czech and Hungarian phoneme recognizers on 2011 NIST LRE 10 s task. Systems

Old metrics (%)

24 worst pairs (%)

All 276 pairs (%)

EER

minDCF

minCavg

actCavg

minCavg

actCavg

PPRVSM

RU CZ HU

14.98 16.95 15.62

15.30 17.13 15.80

20.84 23.83 22.05

22.98 26.06 24.44

7.96 9.61 8.53

9.83 11.38 10.09

PV

RU CZ HU

11.67 11.70 11.78

11.84 11.91 11.87

17.65 18.28 17.89

19.91 21.16 21.18

5.27 5.70 5.21

6.76 7.23 6.82

PLBV1

RU CZ HU

10.45 10.59 10.51

10.70 10.78 10.72

17.53 17.53 17.28

19.67 19.65 19.52

4.82 5.07 4.85

6.22 6.57 6.36

PLBV2

RU CZ HU

10.50 10.82 10.64

10.72 11.03 10.90

16.79 16.90 16.82

18.53 19.52 19.33

4.66 4.96 4.67

6.26 6.68 6.41

Table 7 Performance of phonotactic systems on Russian, Czech and Hungarian phoneme recognizers on 2011 NIST LRE 03 s task. Systems

Old metrics (%)

24 worst pairs (%)

All 276 pairs (%)

EER

minDCF

minCavg

actCavg

minCavg

actCavg

PPRVSM

RU CZ HU

30.33 32.10 29.88

30.36 32.07 29.95

32.97 34.91 33.36

35.28 37.22 35.54

20.14 22.65 20.65

23.77 25.48 23.33

PV

RU CZ HU

27.08 26.59 26.47

27.14 26.73 26.47

31.08 30.58 30.02

33.30 33.49 32.69

17.68 17.56 17.01

20.14 19.84 19.52

PLBV1

RU CZ HU

26.87 26.64 26.34

27.04 26.78 26.60

31.26 30.94 30.42

32.82 32.82 32.32

17.24 17.17 16.88

20.08 20.07 20.00

PLBV2

RU CZ HU

27.11 26.64 26.29

27.34 26.85 26.56

30.84 30.13 29.96

33.12 32.74 32.20

16.91 16.73 16.38

20.57 20.20 20.06

Table 8 Performance of phoneme variability system on Russian and Hungarian phoneme recognizers on 2011 NIST LRE with 2048 Gaussians, 600 dimensions. Systems

Old metrics (%) EER

24 worst pairs (%) minDCF

minCavg

All 276 pairs (%) actCavg

minCavg

actCavg

30 s

RU HU

5.31 5.46

5.54 5.60

8.12 8.56

11.24 12.48

1.40 1.42

2.59 2.76

10 s

RU HU

11.51 11.41

11.75 11.61

17.04 17.52

19.61 20.92

5.27 5.19

6.96 6.87

03 s

RU HU

27.23 26.64

27.21 26.60

30.59 31.06

33.62 33.64

18.02 17.79

20.72 19.96

Table 9 Performance of systems on 2011 NIST LRE 30 s task. Systems

(1) (2) (3) (4) (5)

ivector PPRVSM LBV PV PLBV

Old metrics (%)

24 worst pairs (%)

All 276 pairs (%)

EER

minDCF

minCavg

actCavg

minCavg

actCavg

7.19 6.51 6.14 4.45 3.93

7.46 7.01 6.50 4.57 4.09

12.59 9.22 11.59 6.85 5.97

15.81 12.53 14.67 10.15 9.01

2.71 1.79 2.23 0.95 0.76

3.92 4.19 3.35 1.96 1.71

58

X. Wang et al. / Speech Communication 75 (2015) 50–61

Table 10 Performance of systems on 2011 NIST LRE 10 s task. Systems

(1) (2) (3) (4) (5)

ivector PPRVSM LBV PV PLBV

Old metrics (%)

24 worst pairs (%)

All 276 pairs (%)

EER

minDCF

minCavg

actCavg

minCavg

actCavg

14.34 11.53 13.01 8.86 8.21

14.55 11.88 13.31 9.06 8.42

22.77 16.75 21.72 13.56 12.74

24.82 19.13 23.70 16.98 15.32

8.11 5.24 7.23 3.26 2.83

9.27 7.68 8.36 4.77 4.49

Table 11 Performance of systems on 2011 NIST LRE 03 s task.

(1) (2) (3) (4) (5)

ivector PPRVSM LBV PV PLBV

Old metrics (%)

24 worst pairs (%)

All 276 pairs (%)

EER

minDCF

minCavg

actCavg

minCavg

actCavg

29.26 26.81 28.80 22.56 23.29

29.33 27.23 28.79 22.70 23.58

34.39 29.66 34.08 25.94 25.37

35.77 31.69 35.27 29.79 28.93

21.78 16.93 21.19 12.44 11.74

22.93 19.63 22.28 15.62 16.65

6.2. Performance of systems on Russian, Czech and Hungarian phoneme recognizers We present the performance of phonotactic systems on Russian, Czech and Hungarian phoneme recognizers on 2011 NIST LRE 30 s task in Table 5. In PLBV1, only SVM models of the first group are trained as these models contain certain discriminative information within and between languages, and for PLBV2, results of the two groups of SVM models are fused by adding the corresponding scores of target languages as Section 4. Results demonstrate that the performance of PV is consistently better than PPRVSM on all the three phoneme recognizers. Take results of Hungarian phoneme recognizer as example. Compare the results of Table 5 with Table 2, it can be seen that the PV achieves relatively improvement of 10.1%, 9.3%, 5.4%, 4.2%, 1.8% and 13.8% compared to the PV_static. The dynamic pronunciation information brings relative improvement of 27.2%, 27.6%, 26.8%, 21.1%, 45.5% and 35.8% in the corresponding metrics comparing results of PV to those of PPRVSM. The language branch discriminative information also provides significantly superior performance, which are relative improvement of 10.3%, 9.7%, 5.7%, 5.2%, 17.7% and 9.9% in the performance metrics compared the results of PLBV1 with PV. From the results of PLBV1 and PLBV2, it can be seen that the second group of SVM models contributes to the new metrics in NIST 2011 LRE because the proposed method lays emphasis on the performance of languagepairs. Results of phonotactic systems on 2011 NIST LRE 10 s and 03 s are presented in Tables 6 and 7 respectively. Results also shown the obvious superior of the PV over the PPRVSM. In the following of this paper, PLBV is specified PLBV2.

In Table 8, performance of PV with phoneme-dependent features on Russian and Hungarian phoneme recognizers on 2011 NIST LRE 30 s, 10 s and 03 s is presented. UBM contained 2048 Gaussians, and dimension of latent factors was 600. 6.3. Performance of the mainstream and the proposed systems In this section, performance of the mainstream and the proposed systems are evaluated. Tables 9–11 show the performance of ivector, PPRVSM, LBV, PV and the proposed PLBV systems on 2011 NIST LRE 30 s, 10 s and 03 s tasks respectively. DET curves is given in Fig. 5 on 30 s task. The 30

(1)ivector (2)PPRVSM (3)LBV (4)PV (5)PLBV

20

Miss probability (in %)

Systems

10

5

2

1

1

2

5

10

20

30

False Alarm probability (in %) Fig. 5. DET curves of systems on 2011 NIST LRE 30 s task.

X. Wang et al. / Speech Communication 75 (2015) 50–61 0.35

(1)ivector (2)PPRVSM (3)LBV (4)PV (4)PLBV

0.3

minDCF

0.25 0.2 0.15 0.1 0.05 0

0

5

10

15

20

25

Fig. 6. minDCF for the 24 worst language-pairs of systems on 2011 NIST LRE 30 s task.

0.4

(1)ivector (2)PPRVSM (3)LBV (4)PV (4)PLBV

0.35 0.3

actDCF

0.25 0.2 0.15 0.1 0.05 0

0

5

10

15

20

25

Fig. 7. actDCF for the 24 worst language-pairs of systems on 2011 NIST LRE 30 s task.

59

2011 NIST LRE differed from the prior ones in emphasizing the language pair condition. In Figs. 6 and 7, minDCF and actDCF for the 24 worst language-pairs of systems on 30 s task are depicted respectively. In Table 9, results in EER, minDCF, minCavg and actCavg computed on the 24 worst language-pairs and minCavg and actCavg computed for all of the 276 language-pairs are given. Compared the results of systems based on spectral features, LBV and ivector, it can be seen that the theory of language branch brings relative reductions of 14.6%, 12.9%, 7.9%, 7.2%, 17.7% and 14.5% in corresponding performance metrics. From the results of PV and ivector, the advantage of dynamic pronunciation information is reflected. PV achieves relative improvement of 38.1%, 38.7%, 45.6%, 37.3%, 64.9% and 50.0% in the performance metrics. And PLBV method using dynamic pronunciation and language branch discriminative information leads relative improvements of 39.6%, 41.7%, 35.2%, 28.1%, 57.5% and 59.2% compared to ivector, and 45.3%, 45.2%, 52.6%, 43.0%, 72.0% and 56.4% compared to PPRVSM. The PLBV achieves reduction of 11.7%, 10.5%, 12.8%, 11.2%, 20.0% and 12.8% in the performance metrics relative to PV. From the Tables 10 and 11, it can be seen that the PV and PLBV still achieve excellent performance compared to the acoustic approaches and PPRVSM on short duration tasks. It’s a pity that the PLBV approach didn’t show a great advantage over the PV on 03 s task. DET curves in Fig. 5 gives straightforward performance comparison of ivector, PPRVSM, LBV, PV and PLBV systems. In Fig. 6, minDCF for the 24 worst language-pairs of systems are depicted. From the figure, it can be seen that the performance of the proposed method outperforms other systems significantly on the 24 most challenge language-pairs. Due to the dynamic pronunciation

Table 12 Fusion results of systems on 2011 NIST LRE 30 s task. Systems

Old metrics (%)

24 worst pairs (%)

All 276 pairs (%)

EER

minDCF

minCavg

actCavg

minCavg

actCavg

(1) + (2) (2) + (3) (1) + (4) (1) + (5) (3) + (5)

3.58 3.44 3.15 2.92 2.88

3.97 3.85 3.23 2.97 2.98

4.92 4.63 4.26 3.93 3.73

8.65 8.12 7.35 6.75 6.81

0.58 0.52 0.44 0.39 0.37

1.68 1.63 1.04 0.93 0.92

The bold values represent the best results. Table 13 Fusion results of systems on 2011 NIST LRE 10 s task. Systems

(1) + (2) (2) + (3) (1) + (4) (1) + (5) (3) + (5)

Old metrics (%)

24 worst pairs (%)

All 276 pairs (%)

EER

minDCF

minCavg

actCavg

minCavg

actCavg

9.62 9.40 8.97 8.26 8.16

9.92 9.66 9.16 8.51 8.40

14.50 14.52 14.11 13.92 13.64

17.47 17.06 17.97 16.46 16.37

3.97 3.79 3.42 3.15 3.06

5.75 5.55 4.81 4.47 4.44

The bold values represent the best results.

60

X. Wang et al. / Speech Communication 75 (2015) 50–61

Table 14 Fusion results of systems on 2011 NIST LRE 03 s task. Systems

Old metrics (%)

24 worst pairs (%)

All 276 pairs (%)

EER

minDCF

minCavg

actCavg

minCavg

actCavg

(1) + (2) (2) + (3) (1) + (4) (1) + (5) (3) + (5)

23.99 23.89 23.30 23.03 22.92

24.19 24.10 23.24 23.10 23.08

28.21 28.24 27.99 27.67 27.47

29.93 30.08 29.96 29.37 29.39

15.27 15.06 14.27 13.62 13.61

17.06 16.76 16.21 15.92 15.88

The bold values represent the best results.

information and language branch discriminative information, most minDCF values for the 24 most challenge languagepairs are reduced. Performance of actDCF for the 24 worst language-pairs as Fig. 7 also presents similar results. 6.4. Fusion results of systems Fusion results of systems based on spectral features and phonotactic features are presented in Tables 12–14. In our

0.25

minDCF

0.2

0.15

0.1

0.05

0

Fig. 8. minDCF of fusion results of LBV with PLBV for the 24 worst language-pairs of systems on 2011 NIST LRE 30 s task.

0.35 0.3

actDCF

0.25 0.2 0.15 0.1 0.05 0

Fig. 9. actDCF of fusion results of LBV with PLBV for the 24 worst language-pairs of systems on 2011 NIST LRE 30 s task.

experiments, fusion results are obtained by adding the corresponding scores of target languages, and we only fuse systems based on features in different levels. From the results, the fusion of LBV and PLBV obtains the best performance. The inclusion of every single dynamic pronunciation information or language branch discriminative information improves the performance significantly. It also reflects the superiority of dynamic pronunciation and language branch discriminative information. In Figs. 8 and 9, minDCF and actDCF of the fusion systems of LBV with PLBV for the 24 worst language-pairs on 30 s task are displayed respectively. Most of the languagepairs achieve lower decision cost values. Compared to Figs. 6 and 7, the fusion results improve the performance significantly, and the complementarity between systems based on spectral features and phonotactic features is still notable. 7. Conclusion In this paper, a novel phonotactic language recognition system called phonotactic language branch variability method is proposed. Knowledge of linguistics is introduced to language recognition. Phonotactic language branch variability factors are extracted based on factor analysis and the theory of language branch. The obtained factors contain dynamic pronunciation and language branch discriminative information, and have not involved inexact phoneme sequences. Experiments on 2011 NIST LRE 30 s, 10 s and 03 s tasks show that both the dynamic pronunciation information and language branch discriminative information contribute to system performance, and the proposed PLBV method outperforms the PPRVSM and ivector systems significantly. Acknowledgments This work is partially supported by the National Natural Science Foundation of China (Nos. 11161140319, 91120001, 61271426), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant Nos. XDA06030100, XDA06030500), the National 863 Program (No. 2012AA012503) and the CAS Priority Deployment Project (No. KGZD-EW-103-2).

X. Wang et al. / Speech Communication 75 (2015) 50–61

References Balakrishnama, S., Ganapathiraju, A., 1998. Linear discriminant analysisa brief tutorial. Institute for Signal and Information Processing. BenZeghiba, M.F., Gauvain, J.-L., Lamel, L., 2009. Language score calibration using adapted gaussian back-end. In: INTERSPEECH, pp. 2191–2194. Bourlard, H.A., Morgan, N., 1994. In: Connectionist Speech Recognition: A Hybrid Approach, vol. 247. Springer. Burget, L., Matejka, P., Cernocky, J., 2006. Discriminative training techniques for acoustic language identification. In: IEEE International Conference on Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings, vol. 1. IEEE, I–I. Campbell, W.M., Campbell, J.P., Reynolds, D.A., Singer, E., TorresCarrasquillo, P.A., 2006. Support vector machines for speaker and language recognition. Comput. Speech Language 20 (2), 210–229. Campbell, W.M., Richardson, F., Reynolds, D., 2007. Language recognition with word lattices and support vector machines. In: IEEE International Conference on Acoustics, Speech and Signal Processing, 2007 (ICASSP 2007), vol. 4. IEEE, IV–989. Collobert, R., Bengio, S., 2001. Svmtorch: support vector machines for large-scale regression problems. J. Machine Learning Res. 1, 143–160. Dehak, N., Torres-Carrasquillo, P.A., Reynolds, D.A., Dehak, R., 2011. Language recognition via i-vectors and dimensionality reduction. In: INTERSPEECH, pp. 857–860. Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P., 2011. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Language Process. 19 (4), 788–798. DHaro, L.F., Glembek, O., Plchot, O., Matejka, P., Soufifar, M., ˇ ernocky‘, J., 2012. Phonotactic language recognition Cordoba, R., C using i-vectors and phoneme posteriogram counts. In: Proceedings of Interspeech. Fortson, B.W., 2004. IV Indo-European Language and Culture. Blackwell Publishing, pp. 16–44. Glembek, O., Mikolov, T., Plchot, O., 2010. Pca-based feature extraction for phonotactic language recognition. In: Proceedings of the Odyssey Speaker and Language Recognition Workshop, vol. 42. Glembek, O., Brmmer, N., Cumani, S., 2012. Description and analysis of the brno276 system for lre2011, Odyssey. The Speaker and Language Recognition Workshop. Greenberg, C.S., Martin, A.F., Przybocki, M.A., 2012. The 2011 nist language recognition evaluation. In: INTERSPEECH. Hermansky, H., Sharma, S., 1999. Temporal patterns (traps) in asr of noisy speech. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999, vol. 1. IEEE, pp. 289–292. Kenny, P., Boulianne, G., Dumouchel, P., 2005. Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process. 13 (3), 345–354.

61

Lei, Y., Scheffer, N., Ferrer, L., McLaren, M., 2014. A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing, 2014 (ICASSP 2014). Li, M., Liu, W., 2014. Speaker verification and spoken language identification using a generalized i-vector framework with phonetic tokenization and tandem features. In: Proceedings of Interspeech. Li, M., Narayanan, S., 2013. Speaker verification using simplified and supervised i-vector modeling. In: IEEE International Conference on Acoustics, Speech and Signal Processing, 2013 (ICASSP 2013). Li, H., Ma, B., Lee, C.-H., 2007. A vector space modeling approach to spoken language identification. IEEE Trans. Audio Speech Language Process. 15 (1), 271–284. Martin, A.F., Przybocki, M.A., 2003. Nist 2003 language recognition evaluation. In: Proceedings of Eurospeech, vol. 1344, Geneva, Switzerland, September 2003. Martınez, D., Plchot, O., Burget, L., Glembek, O., Matejka, P., 2011. Language recognition in ivectors space. Proceedings of Interspeech, Firenze, Italy, 861–864. Petr Schwarz, P.M., Ernocky, J.H.C., 2006. Hierarchical structures of neural networks for phoneme recognition. In: Proceedings of the 31st International Conference on Acoustics, Speech, and Signal Processing, pp. 325–328. Schwarz, P., 2009. Phoneme recognition based on long temporal context. Singer, E., Torres-Carrasquillo, P., Reynolds, D., McCree, A., Richardson, F., Dehak, N., Sturim, D., 2012. The mitll nist lre 2011 language recognition system. Speaker Odyssey 2012, 209–215. Soufifar, M., Kockmann, M., Burget, L., Plchot, O., Glembek, O., Svendsen, T., 2011. ivector approach to phonotactic language recognition. In: INTERSPEECH, pp. 2913–2916. Torres-Carrasquillo, P.A., Singer, E., Kohler, M.A., Greene, R.J., Reynolds, D.A., Deller Jr, J.R., 2002. Approaches to language identification using gaussian mixture models and shifted delta cepstral features. In: INTERSPEECH. Wang, H., Leung, C.-C., Lee, T., Ma, B., Li, H., 2013. Shifted-delta mlp features for spoken language recognition. IEEE Signal Process. Lett. 20, 15–18. Wang, X., Wan, Y., Yang, L., Zhou, R., Yan, Y., 2014. Language recognition system using language branch discriminative information. In: IEEE International Conference on Acoustics, Speech and Signal Processing, 2014 (ICASSP 2014). http://en.wikipedia.org/wiki/Language_family. Zissman, K.B.M.A., 2001. Automatic language identification. In: Speech Commun, vol. 35. Zissman, M.A., Singer, E., 1994. Automatic language identification of telephone speech messages using phoneme recognition and n-gram modeling. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 1994 (ICASSP-94), vol. 1. IEEE, I–305.