GPR-based Thai speech synthesis using multi-level duration prediction

GPR-based Thai speech synthesis using multi-level duration prediction

Speech Communication 99 (2018) 114–123 Contents lists available at ScienceDirect Speech Communication journal homepage: www.elsevier.com/locate/spec...

1MB Sizes 0 Downloads 26 Views

Speech Communication 99 (2018) 114–123

Contents lists available at ScienceDirect

Speech Communication journal homepage: www.elsevier.com/locate/specom

GPR-based Thai speech synthesis using multi-level duration prediction ⁎,a

b

Decha Moungsri , Tomoki Koriyama , Takao Kobayashi a b

T

b

Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama 226-8502, Japan School of Engineering, Tokyo Institute of Technology, Yokohama 226-8502, Japan

A R T I C LE I N FO

A B S T R A C T

Keywords: Thai language Speech synthesis Gaussian process regression Multi-level model Prosody Duration prediction

This paper proposes a multi-level Gaussian process regression (GPR)-based method for duration prediction by incorporating phone- and syllable-level duration models. In this method, we first train the syllable model and predict syllable durations for a given input of context labels. Then, we use the predicted syllable duration as an additional context for the phone-level model to predict phone durations. To apply multi-level duration prediction to the GPR-based speech synthesis framework, we designed phone- and syllable- level context sets for Thai that include linguistic information and the relative positions of speech units. We also examined the multilevel deep neural network (DNN)-based duration-prediction method by using the same approach as for the proposed multi-level GPR-based one. We conducted objective and subjective evaluations using two-hour training data to compare the proposed method with single-level ones. The results indicate that the proposed multi-level duration-prediction method outperformed single-level ones in DNN-, and GPR-based frameworks. They also indicate that the proposed multi-level GPR-based method can provide better performance than the multi-level HMM-based duration-prediction method.

1. Introduction In text-to-speech synthesis, accurate prosody modeling is essential to synthesize natural-sounding and intelligible speech since prosody contributes to linguistic functions such as intonation, stress, accent phrase, prominence, and emphasizing. In terms of acoustic features, fundamental frequency (F0), duration, and intensity are the attributes of prosody. In this paper, we focus on improving duration modeling for statistical parametric speech synthesis (SPSS), which contributes to making synthetic speech more natural-sounding and intelligible. Various methods of duration modeling have been proposed. For hidden Markov model (HMM)-based SPSS, state durations of a phoneme HMM were modeled by a multi-dimensional Gaussian distribution, and a decision tree was used to cluster the variations of contextual factors (Yoshimura et al., 1998). Iwahashi and Sagisaka (2000) proposed a constrained tree regression method by incorporating linear and tree regressions for speech segmental duration prediction. Another treebased method was proposed by Yamagishi et al. (2008), which uses a meta-algorithm of regression trees, called gradient tree boosting, for iteratively constructing a regression tree in phone-duration modeling. Qian et al. (2011) used tree-based context clustering for phone- and syllable-level context in a multi-level HMM-based duration-prediction method. Regarding the neural network-based approach, Teixeira and Freitas (2003) used a neural network to predict segmental duration.



Zen and Sak (2015) proposed a streaming synthesis architecture using a unidirectional long short-term memory recurrent neural network for duration prediction and acoustic feature prediction. Nagy and Németh (2016) demonstrated a deep neural network (DNN) for duration prediction in short sentences. Henter et al. (2016) proposed DNNbased duration modeling by using robust statistics to disregard dubious and unhelpful data points in parameter estimation. Moreover, a variety of machine learning approaches have been applied to duration prediction, for example, a multiplicative model (Chen et al., 2003), support vector regression (SVR) (Lazaridis et al., 2014), sums-of-product (SoP) models (Van Santen, 1992), and Bayesian networks (Goubanova and King, 2008). In addition to using a single method for duration prediction, various studies combining multiple methods have also been reported. Campbell (1992, 1993) used a neural network to predict syllable duration, then estimated phone durations from the predicted syllable duration using the elasticity principle (Campbell and Isard, 1991). Rao and Yegnanarayana (2007) combined a support vector machine (SVM) and feedforward neural networks to predict syllable duration. Sainz et al. (2011) proposed a hybrid approach by combining unit selection and statistical methods for duration prediction. Lazaridis et al. (2011) combined outputs of dissimilar phone-duration predictors for improving duration modeling. The concept of using multiple predictors was extended by Lazaridis et al. (2012) to a two-

Corresponding author. E-mail addresses: [email protected] (D. Moungsri), [email protected] (T. Koriyama), [email protected] (T. Kobayashi).

https://doi.org/10.1016/j.specom.2018.03.005 Received 15 March 2017; Received in revised form 30 January 2018; Accepted 13 March 2018 Available online 14 March 2018 0167-6393/ © 2018 Elsevier B.V. All rights reserved.

Speech Communication 99 (2018) 114–123

D. Moungsri et al.

describe the contextual factors and kernel functions of GPR-based Thai speech synthesis. In Section 3, we describe the proposed GPR-based speech synthesis method with multi-level duration prediction including context and a kernel function for the multi-level model, and briefly explain the multi-level HMM- and DNN-based methods. In Section 4, we present the results of objective and subjective evaluations of the proposed method and compare its performance with the conventional HMM- and DNN-based speech synthesis methods. We conclude the paper in Section 5.

stage method for phone-duration modeling, which uses multiple feature constructors in the first stage and uses their outputs as extended features in the second stage of phone-duration modeling. Wang et al. (2015) used an extreme learning machine, which is a variation of a feedforward neural network for phone-duration prediction and a decision tree-based method to predict state duration. In this paper, we focus on the problem of improving the naturalness of Thai SPSS, in which the duration-prediction accuracy significantly affects synthetic speech quality. We consider information of multiple speech-unit layers for duration prediction because the single phonelevel is not sufficient to capture linguistic functions that belong to longer units. Specifically, we incorporate multi-level duration models, phone- and syllable-level models, to improve the accuracy of duration prediction. Thai has a clear boundary of syllables in which we can avoid the error caused during boundary labeling. Moreover, syllable duration plays an important role in Thai linguistic functions, such as stress, accent, and prominence, which are the main factors that affect intonation (Peyasantiwong, 1986; Potisuk et al., 1996; Luksaneeyanawin, 1983; Potisuk et al., 1998). Therefore, incorporating the syllable-level model can show the impact on the natural-sounding and intelligibility of predicted duration. The contextual factors of syllable level are more complicated than those of the phone level since it includes the contextual factors of multiple phones1, and the conventional HMM-based method that involves decision tree-based context clustering is inefficient in expressing complex context dependencies as described by Zen et al. (2013). To handle the complexity of syllable-level context, a large decision tree is required which leads to overfitting problem and degrades the quality of synthetic speech. Another limitation is that the decision tree-based context clustering ties states assigned to each leaf node to a single state, which leads to a over-smoothing effect and reduces contextual diversity. To overcome these problems, we investigated an alternative duration-prediction method based on Gaussian process regression. Gaussian process regression (GPR)-based SPSS was proposed to overcome the limitations of HMM-based SPSS (Koriyama et al., 2013a; Koriyama et al., 2013b; Koriyama et al., 2014a; Koriyama et al., 2014b; Koriyama and Kobayashi, 2015b). The GPR-based SPSS uses Gaussian process (GP) to model the relationship between contextual factors and acoustic features. GP is a nonparametric Bayesian model whose model complexity grows with the amount of training data. Furthermore, GP leads to a robust parameter estimation that alleviates the over-fitting problem. The GPR-based SPSS uses a kernel function which is more flexible than tree-based context clustering for determining the similarity between input context. It has been also shown that the performance improvement of the GPR-based SPSS relative to the HMM-based one is comparable to or slightly higher than that of the DNN-based one using a limited amount of training data (Koriyama and Kobayashi, 2015a). In this paper, we extend our previous study (Moungsri et al., 2015) and investigate a multi-level GPR-based modeling for duration prediction in detail under the condition in which around 30-minute to twohour speech data are available for model training. In the proposed method, the duration model at the syllable level is first trained then the predicted syllable duration is used as an additional context for training the phone-level duration model. Moreover, since the key idea of the proposed method is not limited to the GPR-based SPSS framework, we also apply it to the DNN-based one. To examine the effectiveness of the proposed method, we compared the duration-prediction performance of the multi-level GPR-based model with multi-level HMM- and DNNbased models. In the next section, we explain GPR-based speech synthesis and

2. GPR-based speech synthesis In this paper, we introduce a GPR-based SPSS framework to Thai speech synthesis in the same manner as that of Japanese (Koriyama et al., 2013b; Koriyama et al., 2014b; Koriyama et al., 2014a; Koriyama et al., 2013a; Koriyama and Kobayashi, 2015b). The difference is the specific contexts for Thai which are described in Section 2.3. 2.1. Gaussian process for regression In GPR, the relation between output variable yn and input variables xn is defined by

yn = f (xn) + ϵ

(1)

where f(·) is a noise-free latent function and ϵ is Gaussian noise whose variance is σ2. Let XN = [x1, x2, …, xN ]⊺ , y = [y1 , y2 , …, yN ]⊺ , and f = [f (x1), f (x2), …, f (xN )]⊺ be the matrix forms of the input variable values, output variable values, and latent function values of the training data, respectively. Similarly, let XT = [x t1, x t2, …, x tT ]⊺ , yT , and fT be matrix forms for test data. Since it is assumed that f(·) is sampled from a GP, the joint probability of the latent function values of the training and test data is given by

⎞ ⎛ f p (f, fT XN , XT ) = N ⎜ ⎡ ⎤; 0, KN + T ⎟ ⎢ fT ⎥ ⎠ ⎝⎣ ⎦

(2)

K KNT ⎤ KN + T = ⎡ NN ⎢ KTN KTT ⎥ ⎦ ⎣

(3)

where KNN and KTT are the covariance matrices of the training and test ⊺ data, respectively, and KNT = KTN is the covariance matrix between training and test data. The (m, n) element of the covariance matrix is given by kmn = κ (x m, xn), where κ(xm, xn) is a kernel function for calculating the similarity between input variables. Using Eqs. (1) and (2), the joint distribution of y and is expressed as

⎛ y ⎞ p (y , yT XN , XT ) = N ⎜ ⎡ y ⎤; 0, KN + T + σ 2I⎟. ⎢ T⎥ ⎝⎣ ⎦ ⎠

(4)

The predictive distribution of is given by

p (yT y , XN , XT ) = N (yT ; μT , ΣT ).

(5)

Then, the parameters of the predictive distribution can be calculated as

μT = KTN [KNN + σ 2I]−1y ΣT = KTT +

σ 2I

− KTN [KNN +

(6)

σ 2I]−1KNT .

(7)

Koriyama et al. (2013b) applied GPR to speech synthesis by defining a frame context as input variables xn and acoustic features as output variables yn. For calculating the similarity of the frame context, the kernel function κ(xm, xn) is defined by including linguistic information and frame-related position in speech units. 2.2. Frame-level context and kernel function

In Thai, syllables are combinations of the initial consonant, vowel, final consonant, and tone. Since Thai has 38 initial consonants, 24 vowels, 13 final consonants, and 5 tones, and there are many possible syllables (Wutiwiwatchai and Furui, 2007). 1

The frame context is composed of temporal event and relative position contexts: 115

Speech Communication 99 (2018) 114–123

D. Moungsri et al.

xn = (x n,1, ⋯, x n, K ),

x n, k = (pn, k , cn, k )

(+1) pn, k = (pn(−, k1) , pn(0), k , pn(+, k1) ), cn, k = (cn(−, k1), cn(0) , k , cn, k )

Table 1 Examples of temporal events of Thai phonetic features for GPR-based speech synthesis.

(8)

where xn is an array of partial frame context that has K temporal events. The xn,k is the context of the k-th event that is composed of the relative position (pn,k) and temporal event context (cn,k). The superscripts (−1), (0), and (+1) denote preceding, current, and succeeding temporal events, respectively. The kernel function κ(xm, xn) is the sum of the similarities of events, K

κ (x m, xn) =

2 ∑ θr2,k κk (xm,k , xn,k ) + δmn θfloor k=1 +1

κk (x m, k , x n, k ) =

(9)

+1

∑ ∑

(u) (v ) (u) (v ) (u) (v ) [w (pm , k ) w (pn, k )· κp (pm, k , pn, k ) κ c (cm, k , cn, k )]

u =−1 v =−1

(10) and are kernel parameters, and κk(xm, k, xn,k) is the where kernel for the k-th event context, which is composed of a weighting function w( · ), the event kernel κc(·), and position kernel κp(·). The event kernel is a linear one defined by

θr2, k

2 θfloor

κ c (cm(u, )k , cn(v, k) ) = cm(u, )k ·cn(v, k) .

p

pl

a

u: a

k⌃

Labial Alveolar Velar Glottal Voiceless unaspired Lateral Cluster Vowel Short Vowel High Vowel Low Vowel Front Vowel Central Vowel Back Dipthong Initial consonant Vowel Final consonant ⋮

+ − − − + − − − − − − − − − + − − ⋮

+ − − − + + + − − − − − − − + − − ⋮

− − − − − − − + − + − + − − − + − ⋮

− − − − − − − − − + − + − + − + − ⋮

− − + − + − − − − − − − − − − − + ⋮

Table 2 Temporal context for Thai GPR-based speech synthesis.

(11)

Unit: Type: Scale: Unit: Type:

We use the event kernel to calculate the similarity of temporal event contexts represented by binary values as described by Koriyama et al. (2013a). The position kernel is a squared exponential (SE) given by

Scale: Unit: Type: Scale: Unit: Type: Scale:

2

(u) (v ) ⎛ (pm, k − pn, k ) ⎞ (u) (v ) κp (pm , k , pn, k ) = exp ⎜ − 2 ⎟ lk ⎝ ⎠

Phonetic features

(12)

where lk denotes a length-scale hyper-parameter. The position kernel is used to determine the distance of relative position pn,k.

Phone {beginning, end} of each phonetic feature Phone-normalized scale, time*a Syllable {beginning, end} of tone-type {beginning, end} of syllable duration†b {syllable, word}-normalized scale, time* Word {beginning, end} of part of speech {syllable, word}-normalized scale, time* Utterance {beginning, end} of utterance {syllable, word, utterance}-normalized scale, time*

a

The scales marked with * are not used for the duration model. The temporal context marked with † is used only in phone-level duration model of multi-level approach.

2.3. GPR-based Thai speech synthesis

b

We used the Thai speech corpus called T-Sync-1 (Hansakunbuntheung et al., 2005), which includes the linguistic information of four layers, phone, syllable, word, and utterance. Therefore, we define the frame context in a similar manner to the conventional HMM-based method (Chomphan and Kobayashi, 2007a) to match it with the linguistic information from T-Sync-1. The phone layer contains Thai phonetic features, as described by Wutiwiwatchai and Furui (2007). Each phonetic feature event is represented by a binary value (+ 1 or − 1). Table 1 lists examples of the temporal events of phonetic features. The syllable layer contains tone-type information. Thai has five tones in which different tones change the meaning of a word. We represent the tone context by using a one-hot vector, for example, if the tone-type is tone 2, then the context is given as [0, 0, 1, 0, 0]. The word layer contains part-of-speech (POS) information obtained from the POS tag set of T-Sync-1. We use one-hot vectors to represent the POS context in the same manner as the tone-type context. Table 2 is a summary of temporal events for GPR-based Thai speech synthesis. In addition, we define the relative position pn,k corresponding to the positional information used in conventional HMM-based Thai speech synthesis (Chomphan and Kobayashi, 2007a). Fig. 1 shows an example of frame-level context defined for the beginning of tone-type event xn,k. In this example, pn(−, k1) , pn(0), k , and pn(+, k1) denote the relative position from the beginning of preceding, current, and succeeding tone-type temporal events to the n-th frame position, respectively, and consists of relative positions in different scales as shown in Table 2. Specifically, in the figure, the relative position context pn,k is expressed as

pn(u, k) = ⎡pn(u, k) , syllable , pn(u, k) , word , pn(u, k) , time⎤, ⎢ ⎥ ⎣ ⎦

u = −1, 0, +1

pn, k = (pn(−, k1) , pn(0), k , pn(+, k1) ) = ([1.7, 0.825, 0.33], [0.7, 0.375, 0.15], [−0.3, −0.175, −0.07]) (+1) where cn(−, k1), cn(0) , k , and cn, k denote tone-types of preceding, current, and succeeding syllables, respectively. The temporal event context cn,k is expressed as (+1) cn, k = (cn(−, k1), cn(0) , k , cn, k )

= ([1, 0, 0, 0, 0], [0, 0, 0, 1, 0], [0, 0, 1, 0, 0]) where [1, 0, 0, 0, 0], [0, 0, 0, 1, 0], and [0, 0, 1, 0, 0] represent tones 0, 3, and 2, respectively. 3. Duration prediction using multi-level model 3.1. Multi-level GPR-based duration prediction In this paper, as well as at phone level, we consider duration modeling at the syllable level, where Thai intonation is importantly characterized (Luksaneeyanawin, 1983). For this purpose, we developed our multi-level model combining phone and syllable models for duration prediction. A key idea of the method is that the syllable-level model is used to predict the syllable duration, then the predicted syllable duration is used as an additional context for the phone-duration 116

Speech Communication 99 (2018) 114–123

D. Moungsri et al.

Syn.2 Predict syllable durations corresponding to input text by using syllable-level duration model. Syn.3 Make phone-level context of input text including syllableduration context predicted with syllable-duration model. Syn.4 Generate phone durations by using phone-level duration model. Syn.5 Make frame context, generate speech parameters, and synthesize speech. Step Ph.3 calculates the similarity between two syllable-duration contexts in the phone-level model by using the SE kernel as follows: 2

(u) (v ) ⎛ (cm, k − cn, k ) ⎞ κ c (cm(u, )k , cn(v, k) ) = exp ⎜− 2 ⎟, lck ⎝ ⎠

cm(u, )k

(13)

cn(v, k)

and are syllable-duration contexts, and lck denotes a where length-scale hyper-parameter. In step Syl.1, i.e., the syllable-duration model training, the syllablelevel context is composed of the temporal event and relative position contexts. The temporal event has three layers, syllable, word, and utterance. Table 3 shows the syllable-level context used in this paper. Thai syllables are a combination of initial consonant, vowel, final consonant, and tone. Then, the syllable layer has three phonetic components (initial consonant, vowel, and final consonant), and one tonetype. Each phonetic component is represented by an array of phonetic features in the same manner as a single phone in the phone-level model. The tone-type in the syllable layer and POS in the word layer are represented with one-hot vectors in the same manner described in Section 2.3. In addition, we use the same kernel function as the phonelevel model for calculating the similarity of syllable-level contexts. 3.2. Multi-level DNN-based duration prediction The proposed multi-level method can be applied to not only the GPR-based speech synthesis framework but also other frameworks. In this paper, we incorporate it into the DNN-based duration prediction. In multi-level DNN-based duration prediction, the syllable-level DNNbased duration model is used to predict the syllable duration. Then, the predicted syllable duration is used as an additional context in the phone-level DNN-based duration model. The input features of the DNNbased duration model are defined using the same approach described by Zen et al. (2013) and are binary and numerical features including linguistic and positional information. More specifically, the input features are derived from the linguistic and position contexts shown in Tables 2 and 3. An overview of the DNN-based multi-level model for duration prediction is shown in Fig. 3.

Fig. 1. Illustrative example of contextual factor of beginning of tone−type temporal event x n, k = (pn, k , cn, k ) .

model. Finally, the phone-duration model is used to predict phone durations for generating other speech parameters. Fig. 2 is an overview of a speech synthesis system with our multi-level model for duration prediction.

3.3. Multi-level HMM-based duration prediction using joint maximizing probability

• Training part



This section briefly explains a conventional approach of the multilevel HMM-based method to duration prediction. We separately train state, phone, and syllable-duration models. The question sets used in the tree-based clustering are based on conventional HMM-based Thai speech synthesis (Chomphan and Kobayashi, 2007b), which include those related to phoneme, tone-type, part-of-speech, and position of unit. In the syllable-level model, the question set includes three phonetic components of a syllable, which is the same as our multi-level GPR-based method. The question sets of phone- and syllable-levels are summarized in Table 4. The duration of each leaf node is modeled by a Gaussian distribution. For duration prediction, we combine state, phone, and syllableduration models in such a way that the joint probability is maximized (Qian et al., 2011; Gao et al., 2008). The likelihood of state durations is jointly maximized with the weighted likelihoods of phone and syllable durations. For a given duration sequence D = [d1, d2, ⋯, dJ ] of J

– Phone-level model Ph.1 Extract acoustic features including mel-cepstral coefficients, aperiodicity, and F0. Ph.2 Convert labels into phone and frame-level context including the additional context, syllable-duration context (see Table 2). Ph.3 Train phone-level duration model. Ph.4 Train mel-cepstral coefficients, aperiodicity, and F0 models based on GPR-based framework described in Section 2. – Syllable-level model Syl.1 Convert labels into syllable-level context (see Table 3) for model training. Syl.2 Train syllable-level duration model. Synthesis part Syn.1 Make syllable-level context of input text. 117

Speech Communication 99 (2018) 114–123

D. Moungsri et al.

Training Phone-level model

Ph.4

Ph.1 Extract acoustic features

Ph.2

Train acoustic features

Generate acoustic features

Acoustic models

Syn.4

Ph.3

Convert labels into phone & frame context including syllable-duration context

Train phoneduration model

Phone-duration model

Convert labels into syllable-level context

Synthetic speech

Syn.5

Generate phone duration

Make frame context

Syn.3

Syl.2

Syl.1

Database

Synthesis

Make phone context for speech parameter generation

Train syllableduration model

Syn.2

Syn.1

Predict syllableduration context for synthesis

Syllable-duration model

Syllable-level model

Make syllable context

Input text

Fig. 2. Block diagram of GPR-based speech synthesis system with multi-level model for duration prediction.

∑ d j , n, k = d j , n

Table 3 Syllable-level temporal context for Thai GPR-based speech synthesis. Unit: Type:

k

Syllable Beginning of each initial-consonant’s phonetic feature Beginning of each vowel’s phonetic feature Beginning of each final-consonant’s phonetic feature Beginning of tone-type {syllable, word}-normalized scale Word {beginning, end} of part of speech {syllable, word}-normalized scale Utterance {beginning, end} of utterance {syllable, word, utterance}-normalized scale

Scale: Unit: Type: Scale: Unit: Type: Scale:

∑ dj, n = dj . n

(15) (16)

By maximizing L(D), we obtain the solution of state duration dj,n,k, which is given by

dj, n − μj, n dj − μj ⎤ 2 d j , n, k = μ j , n, k + ⎡ −β σ j , n, k 2 ⎢−α σ j, n σ j2 ⎥ ⎣ ⎦

(17)

where dj,n and dj are obtained by applying the constraints of Eqs. (15) and (16). 4. Experiments To evaluate the effectiveness of the proposed multi-level method, we compared its performance with other methods in HMM-, DNN-, and GPR-based SPSS frameworks. To confirm that the use of a multi-level model has more impact than simply adding additional syllable information onto a single-level model, we examined duration prediction using a single-level model, which takes into account both phone- and syllable-level contexts, called the extended context method. The linguistic information used as contextual factors of all methods is summarized in Table 5. 4.1. Experimental conditions A set of phonetically balanced sentences from the Thai speech database T-Sync-1 developed by NECTEC (Hansakunbuntheung et al., 2003; Hansakunbuntheung et al., 2005) was used for training and evaluation. The sentences were uttered by one professional female speaker with clear articulation in the reading style of the standard Thai accent. The phone duration was obtained from forced-alignment using an HMM and rechecked by linguists (Hansakunbuntheung et al., 2003). The syllable duration can be calculated from the sum of phone durations in a syllable. The number of training sentences was varied from 250 to 950 utterances, approximately 30 to 120 min. The test set contained 50 utterances that were not included in the training data. Speech signals were sampled at a rate of 16 kHz. Spectral features, aperiodicity, and F0 were extracted by STRAIGHT (Kawahara et al., 1999) with a 5 ms frame shift. The acoustic feature vector consisted of the 0th to 39th mel-cepstral coefficients, five-band aperiodicity, log F0, and their delta and delta-delta coefficients.

Fig. 3. Overview of DNN-based multi-level model for duration prediction.

syllables, the log likelihood L(D) of duration is defined as

L (D) =





∑ ⎢ ∑ ⎡⎢ ∑ log pj,n,k (dj,n,k ) + α log pj,n (dj,n) ⎤⎥ + β log pj (dj) ⎥, j



n



k



⎦ (14)

where dj,n,k is the duration of state k in phone n and syllable j, and pj,n,k(·) is the corresponding probability density function. The pdfs of phone and syllable are likewise defined by pj,n(·) and pj(·), respectively. The syllable and phone durations are constrained as follows: 118

Speech Communication 99 (2018) 114–123

D. Moungsri et al.

Table 4 Question set of phone- and syllable-level models for multi-level HMM-based method for duration prediction. Level

Phoneme Syllable

Word Utterance

Question set Phone-duration model

Syllable-duration model

Phonetic features of {preceding, current, succeeding} phone Position of {preceding, current, succeeding} phone in syllable —

— Phonetic features of consonant in {preceding, current, succeeding} syllable Phonetic features of vowel in {preceding, current, succeeding} syllable Phonetic features of final-consonant in {preceding, current, succeeding} syllable

Tone-type of {preceding, current, succeeding} syllable Position of {preceding, current, succeeding} syllable in word Number of phones in {preceding, current, succeeding} syllable POS of {preceding, current, succeeding} word Number of syllables in {preceding, current, succeeding} word Number of syllables in utterance Number of words in utterance

Table 5 Comparison of linguistic information as contextual factors used in the experiments. Level

Phoneme Syllable

Word

Context

Phonetic features Tone Phonetic features of initial consonant, vowel, final consonant within a syllable POS

Single-level

Single-level model

Multi-level model

(Baseline)

with extended context

Phone-duration model

✓ ✓

✓ ✓ ✓

✓ ✓







Fig. 4. Optimal α and β values for each number of training utterances.

Syllable-duration model

✓ ✓ ✓

Fig. 6. Mel-cepstrum distortions.

Fig. 7. Log F0 distortions.

In the conventional HMM-based SPSS framework (Chomphan and Kobayashi, 2008), we used context-dependent triphone hidden semiMarkov models (HSMMs) having five-state, left-to-right, no-skip model topology (Zen et al., 2007). Decision-tree-based context clustering was carried out with the minimum description length (MDL) criterion (Shinoda and Watanabe, 2000). In duration prediction with the multi-level HMM-based method, we used one hundred utterances for a development set to find the optimal α

Fig. 5. Full grid search result of optimal α and β values. Lowest distortion is marked with star. 119

Speech Communication 99 (2018) 114–123

D. Moungsri et al.

Fig. 9. Comparison of multi-level model methods in phone-duration distortions.

(a) HMM-based model

(a) Single-level HMM-based model (H-S)

(b) DNN-based model with 3 hidden layers

(b) Multi-level HMM-based model (H-M)

(c) DNN-based model with 6 hidden layers

(c) Multi-level DNN-based model (D3-M)

(d) Multi-level GPR-based model (G-M) Fig. 10. Comparison of duration prediction errors in terms of phone unit. Sentence is “... then it occurs between modules...” in English. Vertical dashed lines are syllable boundaries. Unit of error is shown in milliseconds.

(d) GPR-based model Fig. 8. Comparison of phone-duration distortions among single-level model, extended context, and multi-level model for HMM-, DNN-, and GPR-based SPSS frameworks.

for each training utterances set. Fig. 5 shows the results of the grid search for 950 training utterances where the optimal values were 1.4 and 1.0 for α and β, respectively. For the DNN-based SPSS, we used the framework proposed by Zen et al. (2013). We constructed two different multi-level DNN-based duration-prediction models having 3 and 6 hidden layers, respectively, since Zen et al. (2013) and Qian et al. (2014) showed that objective errors almost converged using hidden layers ranging from 3 to 6 layers.

and β in Eq. (17). The development set was included in neither training nor test set. We used a grid search for finding the optimal values of α and β that generated the lowest error in phone duration. We conducted the grid search separately for each number of training utterances used in the objective evaluation. Fig. 4 shows the optimal values of α and β 120

Speech Communication 99 (2018) 114–123

D. Moungsri et al.

Table 6 Acoustic feature generation frameworks used in MOS tests. Method

Spectral features, Aperiodicity, F0

Duration Single-level

HMM HMM:Single DNN:Single DNN:Multi GPR:Single GPR:Multi

DNN 6-layer

GPR



HMM

Multi-level DNN 3-layer

GPR

DNN 3-layer

GPR

✓ ✓ ✓

✓ ✓ ✓ ✓

✓ ✓

4.3. Subjective evaluation

Each hidden layer has 1024 nodes. The activation function was the tanh function. The input and output features used in network training were normalized to zero-mean and unit variance. The multi-level model duration prediction was conducted in the same manner as with the GPR-based method. We first trained a DNN-based syllable-level duration model then used the predicted syllable duration as an additional context in a DNN-based phone-level duration model. In the GPR-based model, the numbers of temporal events, kernel function parameters, and input features were 39, 133, and 507 for the duration model, and 72, 376, and 1146 for mel-cepstral coefficients, five-band aperiodicity, and log F0, respectively. We conducted GPRbased modeling using the partially independent conditional approximation (Koriyama et al., 2013b) and optimized the kernel function parameters using the expectation-maximization (EM)-based method (Koriyama et al., 2014a).

The subjective evaluation involved mean opinion score (MOS) and forced-choice preference tests to assess the perceptual quality of naturalness of synthetic speech. In the MOS test, the participants evaluated each sample on a five-point scale from 1 to 5 according to their satisfaction regarding naturalness. The definition of the rating was 1: bad, 2: poor, 3: fair, 4: good, and 5: excellent. In the forced-choice preference test, the participants were asked to choose the most natural one for each pair of speech samples. The participants could repeat playback as many times as they required. We used speech parameters generated from the models trained by 950 utterances of training data. The subjective evaluation consisted of four listening tests. All the tests were evaluated by a group of ten Thai native speakers. Each person evaluated ten samples that were randomly selected from the test set. The models

4.2. Objective evaluation Before evaluating the duration-prediction performance, we compared spectral and F0 distortions for the single-level models of HMM-, DNN- and GPR-based SPSS frameworks. Figs. 6 and 7 plot mel-cepstral distances and root mean square (RMS) errors of log F0, respectively. These results show that the GPR-based method achieved lower spectral and F0 distortions compared with the HMM- and DNN-based ones, which is consistent with the results shown in the study by Koriyama and Kobayashi (2015a). Regarding the duration distortion, we first compared the effectiveness of the extended context and multi-level model in phone-duration distortion. Fig. 8 shows the phone-duration RMS errors of respective duration prediction models in different SPSS frameworks. Although the single-level model with the extended context exhibited lower phoneduration distortion than the single-level model in every framework, the multi-level model achieved the lowest RMS error. From this result, we confirm that the use of a multi-level model has more impact than a single-level model with the extended context. Fig. 9 shows the comparison of phone-duration distortions among multi-level models in different SPSS frameworks. The multi-level GPRand DNN-based model with 3 hidden-layers exhibited lower RMS error than that with 6 hidden-layers in most cases, except at 950 utterances which the GPR-based method had a comparable result with the DNNbased model with 6 hidden-layers. The DNN-based model with 3 hidden-layers had lower RMS error than the GPR-based model in all case and the difference became smaller in over 650 utterances. Fig. 10 illustrates an example of the duration errors in a sentence where each bar represents the difference between the generated phone duration and the original one. The duration sequences were predicted using 950 utterances of training data. For the DNN-based method, the result of the 3-hidden-layer model is shown since it had less distortion than the 6-hidden-layer model. The duration distortions decreased when we applied the multi-level methods. Moreover, the multi-level DNN- and GPR-based models had less distortions than the HMM-based models, especially at the end of the sentence.

Fig. 11. Comparison of mean opinion scores (MOSs) of naturalness between conventional HMM-based and baseline GPR-based SPSS.

Fig. 12. Comparison of MOSs of naturalness between DNN- and GPR-based SPSS with single and multi-level models.

Fig. 13. Preference scores of duration prediction with the single- and multilevel models of HMM- and GPR-based methods. 121

Speech Communication 99 (2018) 114–123

D. Moungsri et al.

Fig. 14. Comparison of forced-choice preference scores of naturalness between DNN- and GPR-based methods with single and multi-level duration predictions.

4.3.4. Naturalness of predicted duration between DNN-based and GPRbased methods Finally, we conducted a forced-choice preference test to compare the effectiveness regarding duration prediction of the single- and multilevel models of GPR- and DNN-based frameworks. The spectral features, aperiodicity, and F0 of each method were generated in the GPR-based framework. Thus, the difference was only in the phone-duration prediction. Fig. 14 shows the results, where Single-level DNN, Multi-level DNN, and Multi-level GPR denote the duration-prediction methods. The multi-level DNN-based method outperformed the single-level DNNbased one (p-value = 0.0007) and was comparable with the multi-level GPR-based one (p-value = 0.4). From these results, we can see that the proposed multi-level duration-modeling approach is useful for not only GPR-based but also DNN-based SPSS frameworks.

for acoustic feature generation used in the MOS tests are summarized in Table 6.

4.3.1. Naturalness of generated acoustic features between HMM- and GPRbased methods In the first subjective evaluation, we conducted a MOS test to compare the baseline performance of the conventional HMM-based and GPR-based SPSS frameworks. Fig. 11 shows the resultant scores with 95% confidence intervals. In this experiment, all speech parameters of the HMM:Single and GPR:Single methods were generated using the conventional single-level HMM- and GPR-based models, respectively. It is clear that the GPR-based model significantly outperformed the HMMbased one. This result is consistent with the objective evaluation in which the GPR-based model achieved lower distortion than the HMMbased one in speech parameter generation.

5. Conclusion

4.3.2. Naturalness of generated acoustic features between DNN- and GPRbased methods We then conducted a MOS test to evaluate the effectiveness of GPRand DNN-based SPSS frameworks with/without multi-level duration prediction. The results of the MOS test are shown in Fig. 12. Single and Multi denote the use of single- and multi-level models for duration prediction, respectively. In this experiment, the spectral features, aperiodicity, F0, and duration for DNN:Single and DNN:Multi were generated based on the DNN-based SPSS framework. Similarly, speech parameters for GPR:Single and GPR:Multi were generated in the GPRbased SPSS framework. The GPR:Single method significantly outperformed DNN:Single (p-value = 0.002). This result is consistent with the objective evaluation result in which the GPR-based method exhibited lower distortions than the DNN-based one in mel-ceptral and log F0 distortions. The DNN:Multi method also had a statistically higher score than DNN:Single (p-value = 0.00002). Thus, we confirm that the proposed multi-level method can improve duration prediction. The comparison between GPR:Multi and DNN:Multi shows that GPR:Multi had a higher score than DNN:Multi, but it is statistically insignificant (pvalue = 0.22).

We proposed a multi-level GPR-based method for duration prediction using phone- and syllable-level models. In this method, a syllablelevel model is used to predict syllable duration, then the predicted syllable duration is used as an additional context for the phone-level duration model. To train the syllable-duration model, we also designed a syllable-level context set for Thai speech synthesis, which consists of phonetic features of phonemes in a syllable, linguistic information of a syllable and longer unit, and relative positioning information. The experimental results under the condition using two-hour training data showed that the proposed multi-level duration-prediction method outperformed the conventional single-level one in HMM-, DNN-, and GPRbased SPSS frameworks. Furthermore, we showed that the proposed method is useful for not only the GPR-based SPSS framework but also the DNN-based one and can perform better than the multi-level HMMbased method. The current study used a relatively small amount of training data, which would be a realistic situation when the available training data is limited for constructing a usable speech synthesis system, especially in under-resourced languages. For future work, we will conduct experiments with a larger amount of training data, various speakers, and different languages. Moreover, we will examine a multilevel model for F0 modeling in the GPR-based SPSS framework.

4.3.3. Naturalness of predicted duration between HMM-based and proposed multi-level GPR-based methods In the third evaluation, we compared the performance of the proposed GPR-based method with the conventional HMM-based ones by using a forced-choice preference test. Spectral features, aperiodicity, and F0 of the all methods were trained and generated in the GPR-based framework because it exhibited the lowest distortions in the objective evaluation. As a result, only the difference existed in duration prediction. Fig. 13 shows the forced-choice preference scores, where Singlelevel HMM, Multi-level HMM, Single-level GPR, and Multi-level GPR denote the duration-prediction methods. The results show that the proposed multi-level GPR-based method had significantly higher preference scores than the single-level GPR-, single-level HMM-, and multilevel HMM-based methods with p-values of 0.0003, 0.002, and 0.04, respectively.

Acknowledgments We would like to thank Dr. Vataya Chunwijitra of NECTEC, Thailand, for providing the T-Sync-1 speech database. A part of this work was supported by JSPS KAKENHI Grant Number JP15H02724. References Campbell, W.N., 1992. Syllable-based Segmental Duration. In Talking Machines: Theories, Models, and Designs. Elsevier, North-Holland, Amsterdam, pp. 211–224. Campbell, W.N., 1993. Predicting segmental durations for accommodation within a syllable-level timing framework. Proc. EUROSPEECH. pp. 1081–1084. Campbell, W.N., Isard, S.D., 1991. Segment durations in a syllable frame. J. Phon. 19 (1), 37–47. Chen, S.-H., Lai, W.-H., Wang, Y.-R., 2003. A new duration modeling approach for mandarin speech. IEEE Trans. on Speech Audio Process. 11 (4), 308–320.

122

Speech Communication 99 (2018) 114–123

D. Moungsri et al.

phone duration modelling using support vector regression fusion. Speech Commun. 53 (1) (1), 85–97. Luksaneeyanawin, S., 1983. Intonation in Thai. University of Edinburgh. Moungsri, D., Koriyama, T., Kobayashi, T., 2015. Duration prediction using multi-level model for GPR-based speech synthesis. Proc. INTERSPEECH. pp. 1591–1595. Nagy, P., Németh, G., 2016. DNN-based duration modeling for synthesizing short sentences. Proc. of International Conference on Speech and Computer (SPECOM). Springer, pp. 254–261. Peyasantiwong, P., 1986. Stress in Thai. Papers from a Conference on Thai Studies in Honor of William J. Gedney. Michigan Papers on South and Southeast Asia, Center for South and Southeast Asian Studies, University of Michigan, Ann Arbor. pp. 19–39. Potisuk, S., Gandour, J., Harper, M., 1996. Acoustic correlates of stress in Thai. Phonetica 53 (4), 200–220. Potisuk, S., Gandour, J., Harper, M.P., 1998. Vowel length and stress in Thai. Acta linguistica hafniensia 30 (1), 39–62. Qian, Y., Fan, Y., Hu, W., Soong, F., 2014. On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis. Proc. ICASSP. pp. 3829–3833. Qian, Y., Wu, Z., Gao, B., Soong, F.K., 2011. Improved prosody generation by maximizing joint probability of state and longer units. IEEE Trans. Audio Speech Lang. Process. 19 (6), 1702–1710. Rao, K.S., Yegnanarayana, B., 2007. Modeling durations of syllables using neural networks. Comput. Speech Lang. 21 (2), 282–295. Sainz, I., Erro, D., Navas, E., Hernáez, I., 2011. A hybrid TTS approach for prosody and acoustic modules. Proc. INTERSPEECH. pp. 333–336. Shinoda, K., Watanabe, T., 2000. MDL-based context-dependent subword modeling for speech recognition. J. Acoust. Soc. Japan 21 (2), 79–86. Teixeira, J.P., Freitas, D., 2003. Segmental durations predicted with a neural network. Proc. EUROSPEECH/INTERSPEECH. ISCA, pp. 169–172. Van Santen, J.P., 1992. Contextual effects on vowel duration. Speech Commun. 11 (6), 513–546. Wang, Y., Yang, M., Wen, Z., Tao, J., 2015. Combining extreme learning machine and decision tree for duration prediction in HMM based speech synthesis. Proc. INTERSPEECH. pp. 2197–2201. Wutiwiwatchai, C., Furui, S., 2007. Thai speech processing technology: a review. Speech Commun. 49 (1), 8–27. Yamagishi, J., Kawai, H., Kobayashi, T., 2008. Phone duration modeling using gradient tree boosting. Speech Commun. 50 (5), 405–415. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T., 1998. Duration modeling for HMM-based speech synthesis. Proc. ICSLP. pp. 29–32. Zen, H., Sak, H., 2015. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. Proc. ICASSP. pp. 4470–4474. Zen, H., Senior, A., Schuster, M., 2013. Statistical parametric speech synthesis using deep neural networks. Proc. ICASSP. pp. 7962–7966. Zen, H., Tokuda, K., Masuko, T., Kobayasih, T., Kitamura, T., 2007. A hidden semiMarkov model-based speech synthesis system. IEICE Trans. Inf. Syst. E90-D (5), 825–834.

Chomphan, S., Kobayashi, T., 2007. Design of tree-based context clustering for an HMMbased Thai speech synthesis system. Proc. of Sixth ISCA Workshop on Speech Synthesis (SSW6). pp. 160–165. Chomphan, S., Kobayashi, T., 2007. Implementation and evaluation of an HMM-based Thai speech synthesis system. Proc. INTERSPEECH. pp. 2849–2852. Chomphan, S., Kobayashi, T., 2008. Tone correctness improvement in speaker dependent HMM-based Thai speech synthesis. Speech Commun. 50 (5), 392–404. Gao, B., Qian, Y., Wu, Z., Soong, F.K., 2008. Duration refinement by jointly optimizing state and longer unit likelihood. Proc, INTERSPEECH. pp. 2266–2269. Goubanova, O., King, S., 2008. Bayesian networks for phone duration prediction. Speech Commun. 50 (4), 301–311. Hansakunbuntheung, C., Rugchatjaroen, A., Wutiwiwatchai, C., 2005. Space reduction of speech corpus based on quality perception for unit selection speech synthesis. Proc. of the 6th Symposium on Natural Language Processing (SNLP). pp. 127–132. Hansakunbuntheung, C., Tesprasit, V., Sornlertlamvanich, V., 2003. Thai tagged speech corpus for speech synthesis. The Oriental COCOSDA 2003 97–104. Henter, G.E., Ronanki, S., Watts, O., Wester, M., Wu, Z., King, S., 2016. Robust TTS duration modelling using DNNs. Proc. ICASSP. pp. 5130–5134. Iwahashi, N., Sagisaka, Y., 2000. Statistical modelling of speech segment duration by constrained tree regression. IEICE Trans. Inf. Syst. 83 (7), 1550–1559. Kawahara, H., Masuda-Katsuse, I., de Cheveigne, A., 1999. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneousfrequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun. 27 (3–4), 187–207. http://dx.doi.org/10.1016/S0167-6393(98) 00085-5. Koriyama, T., Kobayashi, T., 2015. A comparison of speech synthesis systems based on GPR, HMM, and DNN with a small amount of training data. Proc. INTERSPEECH. pp. 3496–3500. Koriyama, T., Kobayashi, T., 2015. Prosody generation using frame-based Gaussian process regression and classification for statistical parametric speech synthesis. Proc. ICASSP. pp. 4929–4933. Koriyama, T., Nose, T., Kobayashi, T., 2013. Frame-level acoustic modeling based on Gaussian process regression for statistical nonparametric speech synthesis. Proc. ICASSP. pp. 8007–8011. Koriyama, T., Nose, T., Kobayashi, T., 2013. Statistical nonparametric speech synthesis using sparse Gaussian processes. Proc. INTERSPEECH. pp. 1072–1076. Koriyama, T., Nose, T., Kobayashi, T., 2014. Parametric speech synthesis based on Gaussian process regression using global variance and hyperparameter optimization. Proc. ICASSP. pp. 3834–3838. Koriyama, T., Nose, T., Kobayashi, T., 2014. Statistical parametric speech synthesis based on Gaussian process regression. IEEE J. Selected Topics Signal Process. 8 (2), 173–183. Lazaridis, A., Ganchev, T., Mporas, I., Dermatas, E., Fakotakis, N., 2012. Two-stage phone duration modelling with feature construction and feature vector extension for the needs of speech synthesis. Comput. Speech Lang. 26 (4), 274–292. Lazaridis, A., Honnet, P.-E., Garner, P.N., 2014. SVR vs MLP for Phone Duration Modelling in HMM-based Speech Synthesis. Technical Report. Idiap. Lazaridis, A., Mporas, I., Ganchev, T., Kokkinakis, G., Fakotakis, N., 2011. Improving

123