TSINGHUA SCIENCE AND TECHNOLOGY ISSN 1 0 0 7 - 0 2 1 4 1 6 / 2 2 pp5 2 8 -5 3 2 Volume 13, Number 4, August 2008
Maximum Likelihood A Priori Knowledge Interpolation-Based Handset Mismatch Compensation for Robust Speaker Identification* LIAO Yuanfu (廖元甫)**, ZHUANG Zhixian (庄智显), YANG Jyhher (杨智合)† Department of Electronic Engineering, Taipei University of Technology, Taipei 106, China; † Department of Communication Engineering, Chiao Tung University, Hsinchu 300, China Abstract: Unseen handset mismatch is the major source of performance degradation in speaker identification in telecommunication environments. To alleviate the problem, a maximum likelihood a priori knowledge interpolation (ML-AKI)-based handset mismatch compensation approach is proposed. It first collects a set of handset characteristics of seen handsets to use as the a priori knowledge for representing the space of handsets. During evaluation the characteristics of an unknown test handset are optimally estimated by interpolation from the set of the a priori knowledge. Experimental results on the HTIMIT database show that the ML-AKI method can improve the average speaker identification rate from 60.0% to 74.6% as compared with conventional maximum a posteriori-adapted Gaussian mixture models. The proposed ML-AKI method is a promising method for robust speaker identification. Key words: robust speaker identification; maximum likelihood estimation; handset mismatch compensation; Gaussian mixture model; maximum a posteriori
Introduction Speaker identification systems in the public telephone switched network (PTSN) need to be robust to the distortions of different handsets. Moreover, the characteristics of the handset and speaker are usually tightly mixed or coupled together. Separation of these characteristics is essentially a difficult one-to-many mapping problem unless a priori knowledge about the handset is available. However, the existence of some mismatch handsets that may not be known about in advance, i.e., unseen handsets is unavoidable and may lead to serious performance degradation. Several successful techniques using a priori knowledge for handset mismatch compensation have been Received: 2007-09-10; revised: 2008-02-28
* Supported by the Science Council of Taiwan, China (No. NSC 95-2221-E-027-102)
** To whom correspondence should be addressed. E-mail:
[email protected]; Tel: 886-919-968592
studied in the past. They include feature transformation (FT)[1] and speaker model synthesis (SMS)[2] approaches. In general, these methods rely on a handset classifier and a set of pre-trained handset-specific characteristics. They first determine the type of the testing handset (e.g., carbon button or electret), and then apply the corresponding handset-specific characteristics to compensate the distortion. For example, the FT method transforms the input distorted feature vectors to fit the underlying speaker models. In contrast, the SMS method adapts the speaker models to match the distorted input feature vectors. However, it is generally difficult to collect beforehand all data for handsets that users may use. Therefore, handset classifier-based approaches may encounter problems in dealing with test utterances from an unseen handset. The models may select the most likely handset from a set of seen handsets, simply reject it as an out-of-handset (OOH), or resort to using the cepstral mean subtraction (CMS)-based approach[1].
LIAO Yuanfu (廖元甫) et al:Maximum Likelihood A Priori Knowledge …
To alleviate the unavoidable problem of the presence of unseen handsets, the heuristic a priori knowledge interpolation (AKI) approach, previously proposed in Yang and Liao[3], is further modified in this paper using the maximum likelihood criterion and an expectation-maximization (EM) algorithm[4] (called ML-AKI from now on). The ML-AKI method not only optimally estimates the interpolation weights, but also eliminates the requirements for handset classifiers that are critical for FT or SMS approaches and previous heuristic AKI approaches. Specifically, the ML-AKI first collects a set of handset-specific characteristics of seen handsets as the a priori knowledge to represent the space of handsets. During evaluation, the handset characteristic hˆ of an unknown test handset is optimally estimated via interpolation from the set of the a priori handset characteristics as defined by N
hˆ = ∑ α n hn
(1)
n =1
where H = {hn , n = 1,2,… ,N } is the set of a priori handset characteristics collected from N seen handsets, and α n are the optimal interpolation weights estimated by the EM algorithm. Here, hn can be a featureor model-space transformation function between the n-th seen handset and an enrollment handset computed by the stochastic matching (SM) method[1] or by the maximum likelihood linear regression (MLLR) method[5]. In this way, the proposed ML-AKI approach can apply the estimate hˆ to optimally adapt the underlying speaker Gaussian mixture models (GMMs) and thereby compensate for the distortion of an unknown test handset. This paper introduces the proposed ML-AKI approach and the EM algorithm for optimizing the interpolation weights. An experiment using the well-known HTIMIT database[6] is carried out to verify the ML-AKI approach.
1 Maximum Likelihood A Priori Knowledge Interpolation The ML-AKI scheme for robust speaker identification is shown in Fig. 1. The scheme includes an AKI-adapted handset universal background model (AKI-UBM), a set of a priori handset characteristics, and a maximum likelihood (ML) interpolation weight estimator.
529
The handset UBM is chosen in preference to the set of all speaker GMMs to estimate the interpolation weights because the UBM covers all speaker GMMs and its use allows a reduction in computation complexity. After the characteristics hˆ of an unknown test handset are well estimated, all underlying speaker GMMs can then be adapted. The detailed procedures of the ML-AKI scheme are described in the following sub-sections.
Fig. 1 Schematic of the proposed maximum likelihood a priori knowledge interpolation (ML-AKI) scheme for handset characteristics estimation
1.1
MLLR a priori knowledge
The ML-AKI scheme can be applied in both the feature and model spaces. In this paper, a model-space ML-AKI is adopted using the maximum likelihood linear regression (MLLR) mixture transformation. Assume that the observations of the speakers from the enrollment handset are modeled by an M-component GMM Λ = {um , Σ m , m = 1, 2, ..., M } , and that the corresponding observations from the n-th seen handset are modeled by another GMM Λn = {un ,m , Σ n,m , m = 1, 2, ..., M } . Then the relationship between
the n-th seen handset and the enrollment handset can be described using the first-order linear regression equations as shown below. un,m = An ⋅ um + bn (2) Σ n ,m = C nT,mTnC n ,m
(3)
where Tn is the variance transformation; An , bn , and C n ,m are the mixture mean transformation matrix, the bias, and the inverse function of the Choleski factor of the variance matrix Σ s−,1m for the m-th mixture component of the n-th seen handset. The collection of the set of N mixture mean biases and transformations, and variance transformation matrices, i.e., H =
Tsinghua Science and Technology, August 2008, 13(4): 528-532
530
{ An , bn , Tn , n = 1, 2,..., N } , can then be used as the model-space of a priori handset characteristics. 1.2
AKI-adapted handset UBM
Given a set of MLLR a priori handset characteristics, the handset UBM trained from the enrollment handset can then be adapted to match the input feature vectors from an unknown test handset by N
μˆ m = ∑α nWn μm
(4)
where α n are the optimal interpolation weights with N
∑α n =1
n
= 1, α n . 0, n = 1, 2, ..., N ; μˆ m and μm
are the extended mean vectors of the enrollment and test handsets; and Wn = [ An bn ]T are the mean transformation matrices. Given the observation sequence O = {o1 ,..., oT } of a test speaker from an unknown handset, the probability P(O | Φ, Λ) that O is generated by the AKI-UBM can then be computed by M
N
m =1
n =1
P(ot | Φ , Λ) = ∑ cm N (ot | ∑ α nWn μm , Σ m )
(5)
where M is the mixture number of the GMM and cm is the weight of the m-th mixture component. The problem can, therefore, be stated in the following way: how can the set of interpolation weights Φ = {α n , n = 1, 2, ..., N } be estimated directly in an optimal manner from the short evaluation data in order to adapt all speaker GMMs.
2 Interpolation Weights Optimization and System Fusion 2.1
Optimization of the interpolation weights
To find the optimal interpolation weights Φ = {α n , n = 1, 2, ..., N } , an EM algorithm based on the ML criterion is applied. First, a sequence of hidden data, i.e., a mixture sequence Θ = {θt , t = 1, 2, ..., T } that generates the observation sequence O = {o1 , ..., oT } , is added to form a complete data sequence {O, Θ} . Then an auxiliary function Q(Φ, Φˆ ) is defined as T
M
N
Q(Φ, Φˆ ) = ∑∑ γ m (t ) log N (ot | ∑ αˆ nWn μm , Σ m ) (6) t =1 m =1
the m-th mixture component, which can be computed by c P (o | Φ, Λ) γ m (t ) = M m m t (7) ∑ cm Pm (ot | Φ, Λ) m =1
By ignoring some terms not related to αˆ n , Eq. (6) can be simplified and expressed by
n =1
T
N ⎛ ⎞ Q′(αˆ1 ,…,αˆ N ) = −∑∑ γ m (t ) ⎜ ot − ∑ αˆ nWn μm ⎟ ⋅ t =1 m=1 n =1 ⎝ ⎠ N ⎛ ⎞ (8) Σ m−1 ⎜ ot − ∑ αˆ nWn μm ⎟ n =1 ⎝ ⎠ Moreover, a modified Lagrange equation Q′′(αˆ1 , … , αˆ N ) can be defined to impose the equality constraint, T
n =1
constraints
where Φ and Φˆ are the old and new sets of interpolation weights, and γ m (t ) is the occupancy probability of
N
∑αˆ n =1
n
M
= 1 , i.e.,
⎛ N ⎞ (9) Q′′(αˆ1 ,… , αˆ N ) = Q′(αˆ1 ,… , αˆ N ) − λ ⎜ ∑ αˆ n − 1⎟ ⎝ n =1 ⎠ The Kuhn-Tucker conditions[7] for this 2nd order nonlinear optimization problem (where we assume diagonal covariance matrix for the sake of simplicity) are ⎧ ∂Q′′(αˆ1 ,… , αˆ N ) T M = ∑∑ γ m (t ) ⋅ ⎪ ∂αˆ j t =1 m =1 ⎪ ⎪ ⎧ N ⎞⎫ T −1 ⎛ ⎪⎪ ⎨(W j μm ) Σ m ⎜ o(t ) − ∑ αˆ nWn μm ⎟ ⎬ − λ = 0, (10) ⎨ ⎩ n =1 ⎝ ⎠⎭ ⎪ ⎪ j = 1,… , N , λ . 0, ⎪N ⎡N ⎤ ⎪∑ αˆ n = 1 and λ ⎢ ∑ αˆ n − 1⎥ = 0 ⎣ n =1 ⎦ ⎩⎪ n =1 Equations (10) can then be solved by a nonlinear programming (NLP) search algorithm[8]. The expectation and maximization steps of the EM algorithm are iteratively applied until the set of interpolation weights converges. Finally, the estimated characteristics of the unknown test handset are employed for the adaption of all underlying speaker GMMs.
2.2
System fusion
To handle outliers that are totally uncovered by the set of a priori handset characteristics, the ML-AKI method and the conventional MAP-GMM/CMS
LIAO Yuanfu (廖元甫) et al:Maximum Likelihood A Priori Knowledge …
approach[9] can be combined to complement each other by fusing the scores from each method to obtain the final identification score sf according to (s − s ) (s − s ) sf = λ 1 1 + (1 − λ ) 2 2 (11)
σs
1
σs
531
identification scores of the two systems, and s1 , s2 ,
σ s , and σ s are the means and standard deviations of 1
2
s1 and s2 , respectively. The overall process is illus-
trated in Fig. 2.
2
where λ is a weighting constant, s1 and s2 are the
Fig. 2
Proposed combined ML-AKI and MAP-GMM/CMS scheme for robust speaker identification
3
Experiments
3.1
HTIMIT and experiment conditions
To evaluate the effectiveness of the proposed ML-AKI approach, the well-known HTIMIT database[6], which was recorded by Massachusetts Institute of Technology (MIT) for studying handset mismatch problems, was chosen. In total 384 speakers were used, each giving ten utterances using a Sennheizer head-mounted microphone (referred to in the following as “sen”). The set of 384×10 utterances was then played and recorded through nine other different handsets, including four carbon button handsets (referred to as cb1, cb2, cb3, and cb4), four electret handsets (referred to as el1, el2, el3, and el4), and one portable cordless phone (called pt1). In this paper, all experiments were performed on a sub-database containing 356 speakers (178 females and 178 males), each completing ten utterances. For training the speaker models, the first 16 s of speech for each speaker from the first seven utterances of “sen” handset dataset was used as the enrollment speech. Ten four-second sessions of speech for each speaker from the last three utterances of all ten handsets were used
as the evaluation speech. A total of 38 recognition features, including 12 mel-frequency cepstral coefficiences (MFCCs), 12 Δ-MFCCs, 12 Δ2- MFCCs, Δ-log energy, and Δ2- log energy, were computed with a window size of 30 ms and a frame shift of 10 ms. 3.2
Unseen handset robustness
First, a GMM speaker recognizer using the conventional CMS method for removing handset bias was evaluated as a baseline (referred to as GMM/CMS). The identification results using 32 mixtures are shown in Table 1. An average identification rate of 55.8% was achieved. Compared with the results reported in Reynolds[6], the baseline results are promising. Second, the MAP-GMM system using the CMS (referred to as MAP-GMM/CMS) approach was evaluated. To construct the speaker models, a 256-mixture UBM was first built from the enrollment speech of all 356 speakers. Then, for each speaker, an MAP-GMM adapted from the UBM using his/her own enrollment speech was built. The identification results are shown in Table 1. The average identification rate for this method is 60.0%. These results show that the MAP-GMM/CMS approach works well but that it is still affected by distortions from mismatch handsets.
Tsinghua Science and Technology, August 2008, 13(4): 528-532
532
The average speaker identification rates for each of the unseen handsets are shown in Table 2. The results show that even under unseen handset conditions the ML-AKI fusion approach can still give a performance improvement from 58.3% (MAP-GMM/CMS) to 67.5%. Based on the results shown in Tables 1 and 2, we can, therefore, conclude that the proposed ML-AKI method can efficiently compensate the mismatch of both seen and unseen handsets.
Finally, the unseen handset robustness was investigated to test the proposed ML-AKI fusion approach. In brief, one of the nine handsets (cb1 to cb4, el1 to el4, and pt1) was chosen in turn as the unseen handset and removed from the set of the a priori handset characteristics. The remaining eight handsets plus the “sen” data were used as the seen handsets. The results show that the average speaker identification rates were shown in Table 1. Table 1 showed that the ML-AKI fusion system can improve the performance to 74.6%.
Table 1 Average speaker identification rates (%) of the unseen handset robustness experiments on the HTIMIT database achieved by the GMM/CMS, MAP-GMM/CMS, and ML-AKI/MAP-GMM fusion approaches
GMM/CMS MAP-GMM/CMS ML-AKI/MAP-GMM
sen
cb1
cb2
cb3
cb4
el1
el2
el3
el4
pt1
Average
67.1 75.6 85.3
62.9 69.7 80.0
63.2 75.0 84.5
25.3 28.7 51.8
35.4 34.6 62.8
68.5 73.9 85.0
61.0 63.8 76.1
55.6 59.8 72.8
63.5 64.9 78.7
55.9 53.9 68.6
55.8 60.0 74.6
Table 2 Speaker identification rates (%) of the nine unseen handsets in the unseen handset robustness experiment of the HTIMIT database achieved by the GMM/CMS, MAP-GMM/CMS, and ML-AKI/MAP-GMM fusion approaches
4
cb1
cb2
cb3
cb4
el1
el2
el3
el4
pt1
Average
GMM/CMS
62.9
63.2
25.3
35.4
68.5
61.0
55.6
63.5
55.9
54.6
MAP-GMM/CMS ML-AKI/MAP-GMM
69.7 78.4
75.0 82.6
28.7 32.9
34.6 53.4
73.9 84.8
63.8 63.8
59.8 71.6
64.9 75.3
53.9 64.3
58.3 67.5
Conclusions and Future Works
In this paper, we have proposed the use of an ML-AKI fusion method to alleviate the problem of unseen handset mismatch. Unlike conventional hard-decision handset classifier-based approaches which may choose incorrect handset types, have to reject data, or to use CMS-based systems, the proposed approach relaxes the need for a handset classifier and allows an optimal estimate of the characteristics of an unseen handset using the EM algorithm. The performance can, therefore, be improved significantly and is a promising method for robust speaker identification. In the future, the ML-AKI method will be improved using eigen-vector analysis to produce a compact and orthogonal handset space when the number of seen handsets is large enough. References [1] Mak M W, Tsang C L, Kung S Y. Stochastic feature transformation with divergence-based out-of-handset rejection for robust speaker verification. EURASIP J. on Applied Signal Processing, 2004, 4: 452-465.
[2] Teunen R, Shahshahani B, Heck L P. A model based transformational approach to robust speaker recognition. In: Proc. ICSLP. Beijing, China, 2000. [3] Yang Jyhher, Liao Yuanfu. Unseen handset mismatch compensation based on feature/model-space a priori knowledge interpolation for robust speaker recognition. In: Proc. of ISCSLP’2004. Hong Kong, China, 2004. [4] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 1977, 39: 1-38. [5] Leggetter C J, Woodland P C. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech Lang., 1995, 9: 171-185. [6] Reynolds D A. HTIMIT and LLHDB: Speech corpora for the study of handset transducer effects. In: Proc. ICASSP’97. Munich, Germany, 1997, II: 1535-1538. [7] Kuhn H W, Tucker A W. Nonlinear programming. In: Proceedings of 2nd Berkeley Symposium. Berkeley: University of California Press, 1951: 481-492. [8] Spellucci P. DONLP2. http://www.mathematik.tu-darmstadt.de:8080/ags/ag8/Mitglieder/spellucci_de.html, 2008. [9] Reyolds D, Quatieri T, Dunn R. Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 2000, 10: 19-41.