Excitation modelling using epoch features for statistical parametric speech synthesis

Excitation modelling using epoch features for statistical parametric speech synthesis Journal Pre-proof Excitation modelling using epoch features fo...

Download PDF

1MB Sizes 0 Downloads 65 Views

Report

PDF Reader
Full Text

Excitation modelling using epoch features for statistical parametric speech synthesis

Journal Pre-proof

Excitation modelling using epoch features for statistical parametric speech synthesis M Kiran Reddy, K Sreenivasa Rao PII: DOI: Reference:

S0885-2308(19)30273-6 https://doi.org/10.1016/j.csl.2019.101029 YCSLA 101029

To appear in:

Computer Speech & Language

Received date: Revised date: Accepted date:

13 March 2019 16 August 2019 2 October 2019

Please cite this article as: M Kiran Reddy, K Sreenivasa Rao, Excitation modelling using epoch features for statistical parametric speech synthesis, Computer Speech & Language (2019), doi: https://doi.org/10.1016/j.csl.2019.101029

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.

HIGHLIGHTS • A novel excitation model based on excitation source parameters and phone-specific natural residual segments is proposed for accurate parameterization and generation of excitation signals in SPSS framework. • Energy, Source spectrum, epoch strength and sharpness are considered as the excitation features. • The effectiveness of proposed excitation model is analyzed in HMMbased and DNN-based Statistical parameteric speech synthesis frameworks.

1

Excitation modelling using epoch features for statistical parametric speech synthesis M Kiran Reddya,∗ , K Sreenivasa Raoa a

Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India

Abstract In this paper, a novel excitation modelling method is proposed for improving the naturalness of statistical parametric speech synthesis (SPSS). In the proposed approach, the excitation or residual signal is parameterized by using features extracted from the epochs. The epoch parameters used in this work are epoch strength and sharpness. These features are modeled in the statistical framework along with other parameters. During synthesis, the excitation signal is constructed by imposing the generated epoch parameters on the natural instances of excitation signal. The effectiveness of the proposed method is evaluated in the framework of hidden Markov model (HMM)-based and deep neural network (DNN)-based SPSS. Evaluation results have shown that the SPSS systems developed using the proposed excitation model are capable of synthesizing more natural sounding speech compared to the ones based on two state-of-the-art excitation modelling approaches. Key words: Speech synthesis, hidden Markov model, deep neural networks, epoch parameters, source features, excitation modelling 1. Introduction Speech synthesis is widely used in applications such as screen readers and telephony inquiry systems. Nowadays, statistical parametric speech synthesis (SPSS) based on hidden Markov models (HMMs) [1] or neural networks ∗

Corresponding author Email addresses: [email protected] (M Kiran Reddy), [email protected] (K Sreenivasa Rao)

Preprint submitted to Computer Speech & Language

October 10, 2019

(NNs) [2, 3, 4] is one of the most popular text-to-speech synthesis (TTS) techniques. The HMM-based approach offers enough flexibility to manipulate voice characteristics, whereas, NN-based approach has more limitations in terms of adaptability. On the other hand, NN-based SPSS has the potential to provide better voice quality than HMM-based SPSS, due to improved acoustic modelling with NNs. Although, the quality of vocoder-based SPSS has improved dramatically in the recent years, the achieved quality is still far from that of natural speech. One of the major issues contributing to quality degradation is the imprecise excitation modelling. The excitation source component contributes significantly to naturalness in synthesized speech [13]. Hence, accurate modelling of the excitation signal is essential to improve the naturalness of synthesized speech. The simplest excitation scheme [1, 5] uses only pitch or fundamental frequency (F0 ) to model the excitation signal. During synthesis, a sequence of pulses positioned as per the generated pitch is used as voiced excitation and white noise is considered as unvoiced excitation. As a result, typical buzziness can be perceived in the synthesized speech. Therefore, some excitation or source modelling techniques have been proposed in the literature to generate improved excitation signal. At first, a mixed excitation (ME) model was proposed by Yoshimura et al. in [6]. In this approach, the excitation parameters, namely, band-pass voicing strengths, pitch, and Fourier magnitudes are used to generate voiced excitation. Zen et al., adopted the most popular STRAIGHT vocoder [7] for HMM-based speech synthesis in [8]. This method generates voiced excitation using impulse trains and white noise components weighted by aperiodicity parameters. In [10], Maia et al., proposed a ME approach based on closed-loop training procedure for HMM-based TTS system. Kim et al., proposed a two-band excitation model for generating mixed excitation source [11]. Instead of modelling the excitation signal using ME approach, some techniques based on glottal flow pulses have been proposed. Liljencrants-Fant (LF) model is used to generate glottal excitation in [12]. Here, the LF parameters are modeled by HMMs, and during synthesis, the generated LF parameters are used to control the glottal pulse shape. Raitio et al., utilized a single natural glottal pulse modified according to the parameters generated by HMMs to synthesize excitation signal [13]. This model is further improved by utilizing a library of glottal flow pulses [14]. Recently, Glottal neural vocoders have been introduced for spectral modelling and glottal excitation generation in SPSS framework [15, 16, 17, 18]. As an alternative to glottal source signal, the residual signal obtained by 3

inverse filtering, has gained interest in excitation modelling. In [20, 19], a codebook of pitch-synchronous residual frames is constructed, which is used during synthesis to generate the source signal as per target residual specification. A method based on pitch-scaled spectrum of excitation signal is proposed in [21]. Here, the residual signal is modeled as a combination of periodic spectrum and aperiodic measure fitted by sigmoid function. Uniform concatenative excitation model is proposed in [22] to generate the excitation signal in both voiced and unvoiced speech. Drugman et al. proposed the deterministic plus stochastic model (DSM) of residual signal in [23]. In DSM approach, the voiced excitation is modeled as a superposition of two components: low-frequency (deterministic component) and high-frequency (stochastic component), separated in the spectral domain by a fixed maximum voiced frequency. The deterministic component is the first eigenvector obtained by Principal Component Analysis (PCA) of a dataset of pitchsynchronous residual frames. The stochastic component is obtained by modifying the spectral and amplitude envelopes of white Gaussian noise. Recently, an excitation model based on time domain deterministic plus noise (DPN) model is proposed in [24]. This approach models the voiced excitation as a combination of deterministic and noise components. The deterministic component is represented using 20 PCA coefficients and the noise component is parameterized in terms of amplitude and spectral envelopes. The DPN model has shown a better speech quality compared to traditional pulse excitation, STRAIGHT, and DSM approaches. This paper precisely focuses on efficient parameterization and modelling of the residual excitation signal. The existing residual based models [19, 20, 21, 22, 23, 24] have shown a significant improvement in speech quality. However, most of these approaches cannot capture well the time-varying characteristics of excitation signal around the epoch locations. Epochs (also called glottal closure instants (GCIs)) are the impulse-like excitations due to abrupt closure of vocal folds. In the entire residual signal, the regions around GCIs carry significant information related to the perceptual characteristics of the voiced speech [22, 25]. The features derived from the epochs contain emotion and voice-quality cues [29, 30, 31, 27, 28]. Our intuition is that the synthesized speech will be more natural, if the epoch features are utilized in generation of the excitation signal. In this paper, we propose an excitation model based on epoch parameters for improving the naturalness of SPSS. In the proposed method, the excitation signal is compactly represented using epoch parameters, which are modeled in the statistical framework. At the 4

time of synthesis, the generated excitation parameters and phone-specific residual segements are used to construct more realistic excitation signals. Speech is synthesised by filtering the resulting excitation signal using Melgeneralised log spectrum approximation (MGLSA) filter. The rest of the paper is organized as follows: Section 2 details the proposed approach for modelling and generation of excitation signal. In Section 3, HMM-based SPSS framework incorporating the proposed source modelling approach is described. The quality of the proposed method is evaluated and compared with the state-of-the-art approaches in Section 4. Further, this section also describes the effectiveness of proposed excitation model for DNN-based speech synthesis. Section 5 summarizes the present work and provides directions for the future work. 2. Proposed excitation modelling method The source-filter theory of speech production interprets speech signal s(t) as the convolution of two components, namely, excitation signal e(t) and response of vocal tract filter v(t). Generally, the residual signal obtained by inverse filtering is considered as a good approximation of the excitation source signal. The residual excitation signal mostly contains phone, emotion, and speaker-specific information [13, 26, 29, 30, 31]. Hence, the proposed excitation modelling approach tries to represent the excitation signal by using features that can capture the information embedded in it. The various steps in the proposed method are depicted in Fig. 1. The speech signal is independently given as input to MGLSA inverse filter and Zero-frequency filter to generate the excitation signal (explained in Section 2.1) and zero-frequency filtered signal (explained in Section 2.2), respectively. In this work, we considered speech signals sampled at 16 kHz. First, energy and source spectral envelope are extracted from every frame (frame size = 25 ms and frame shift = 5 ms) of the excitation signal. In [26], residual energy has been used successfully for speaker recognition. Hence, speaker related information can be captured adequately by modelling energy. The frame-wise spectral envelope of the source signal is computed using linear prediction (LP) analysis. For sampling frequency of 16 kHz, typically 10-14 poles are used for LP analysis [24]. Hence, in this work, the order of LP is chosen to be 10. This process is similar to the estimation of the speech spectral envelope, except that the input is excitation signal instead of speech signal [24, 32]. The LP coefficients (LPCs) capture the spectral properties of excitation signal in each 5

Speech Signal

Inverse Filtering

Zero-frequency Filtering

GCIs

Excitation Signal Extraction of Energy

Extraction of Spectrum

Extraction of Epoch sharpness

Extraction of Epoch strength

Figure 1: Flow diagram indicating different steps in excitation parameterization.

frame. As LPCs are sensitive to quantization noise, they are transformed to line spectral frequencies (LSFs) which are more stable and offer better quantization performance. The zero-frequency filtered signal (ZFFS) is used to compute the epoch locations. Using the epochs as anchor points, the epoch strength and epoch sharpness are computed from ZFFS and excitation signal (explained in Section 2.2), respectively. Energy, F0 , source spectral envelope, epoch strength and sharpness are considered as the excitation parameters. At the time of synthesis, the excitation signal is reconstructed by imposing the statistically generated parameters on the natural instances of excitation signal (explained in Section 2.3). 2.1. MGLSA inverse filtering The excitation signal is obtained by inverse filtering the speech signal with MGLSA filter derived from Mel-generalized cepstral coefficients (MGCCs). The MGCCs are used to model the speech spectral envelope or vocal tract component. The transfer function related to MGLSA synthesis filter D(z ) [33] is given by ! γ1 M X (1) D(z) = 1 + γ b(m)φm (z) m=1

6

where φm (z) =

(1 − α2 )z −1 1 − αz −1

−(m−1)

zb

, m≥1

and zb−1 is the transfer function of an all-pass filter defined as −1 z −α −1 zb = 1 − αz −1

(2)

(3)

Here, α = [−1, 0) ∪ (0, 1] controls the frequency warping and γ = -1/K (K ∈ Z>0 ) is the power parameter. The MGLSA filter coefficients b(m) are the 1 0

(a)

-1 1 0

(b)

-1 1 (c)

0 -1

500

1000 1500 2000 2500 3000 Time (#samples)

500

1000 1500

Figure 2: (a) Speech signal, (b) ZFFS and (c) residual signal. The detected epoch locations are shown with red stars.

gain-normalized and linearly transformed version of M th-order MGCCs [33]. In this work, 34th-order MGCCs are extracted with the parameter values α = 0.42, and γ = −1/3 [9]. The filter 1 /D(z ) is the MGLSA inverse filter, which is excited by the speech signal to generate the residual signal. The residual signal contains sharp and periodic discontinuities in case of voiced speech, and it is like random noise for unvoiced speech without any periodicity. Fig. 2(a) shows a segment of voiced speech signal and the corresponding residual signal derived from inverse filter is shown in Fig. 2(c). 7

2.2. Extraction of epoch strength and sharpness The most significant excitation to the vocal-tract system takes place during the production of voiced speech around the instant of glottal closure, called the epoch. Accurate detection of epochs is useful in characterizing voice-quality features. In this work, the zero-frequency filtering (ZFF) technique [34] is used to extract the epoch locations from the speech signal. In ZFF method, the speech is initially passed through a cascade of two zero frequency resonators. The trend in zero frequency resonator output is then removed by local mean subtraction to obtain zero frequency filtered signal (ZFFS). Fig. 2(b) shows the ZFFS for the speech signal shown in Fig. 2(a). The time instants of positive zero crossings in the ZFFS are considered as GCIs. In Fig. 2, the stars in red color indicate the epoch locations estimated from ZFFS. Epoch strength, also called strength of excitation (SoE) provides information about the amplitude of excitation signal at the epoch locations. SOE (s[k]) is commonly computed as the slope of ZFFS, yb[n], at the epoch locations, k , as follows [25], s[k] = | yb[k + 1] − yb[k − 1] |

(4)

In the literature, this feature is reported as one of the major emotion-specific source parameters [29, 30, 31]. Also, in [25] the importance of SoE in preserving the naturalness has been demonstrated. This shows that modelling SoE is essential to capture the information of naturalness. Fig. 3, shows the segment of speech signal, the ZFFS, and strength of epochs estimated from ZFFS. Epoch sharpness (ES) basically represents the impulse-like nature of excitation signal during glottal closure, and is indicative of the perceived loudness of speech [31]. ES has been found effective in discriminating voice qualities [27, 28]. It is derived from the Hilbert envelope of the residual signal r (n). The Hilbert envelope he (n) of r (n) is given by q 2 (n) (5) he (n) = r2 (n) + rHT where rHT (n) denotes the Hilbert transform of r (n) given by rHT (n) = IF T (RHT (ω)) where IFT denotes inverse Fourier transform, and RTH (ω) is given by ( jR(ω), ω ≤ 0 RHT (ω) = −jR(ω), ω > 0 8

(6)

(7)

1 0

(a)

-1 1 (b)

0 -1 1

(c)

0.5 0

1000

2000

3000 4000 Time (#samples)

5000

Figure 3: (a) Speech signal, (b) corresponding ZFFS and (c) strength of epochs.

1 (a)

0 -1 1

(b)

0 -1 1

(c)

0.5 0 1 0.5 0

(d) 1000

2000

3000 Time (#samples)

4000

5000

Figure 4: (a) Speech signal, (b) corresponding residual signal, (c) Hilbert envelope of residual signal and (d) sharpness of epochs.

9

F0

LPC

Voiced Resample Database

Overlapadd

Modify Spectrum

Selecting Reference Segment

SoE

ES

Modify Epoch Parameters

Energy Modify Energy

Excitation Signal

Modify Spectrum Unvoiced LPC

Figure 5: Flow diagram showing different stages in excitation generation. The parameters generated by HMM/DNN are shown in bold-italic.

Here R(ω) denotes the Fourier transform of r (n). ω and j represents angular frequency and imaginary unit, respectively. The sharpness of epoch is computed as η = σ/µ, where σ and µ are the standard deviation and mean of samples of the Hilbert envelope in a 2 ms window around the epoch location. Fig. 4, shows the segment of speech signal, the corresponding residual signal, and sharpness of epochs estimated from the Hilbert envelope of the residual signal. From Figs. 3 and 4, it can be seen that, the characteristics of excitation signal vary from one epoch to another. To reproduce these natural variations in the excitation signal used in the synthesis, it is necessary to include the epoch parameters in modelling and generation of the excitation signal. As it is conveninent to model all features computed at a constant frame size and frame rate in a unified framework, the epoch strength and sharpness, computed from the epochs present in every 25 ms frame are averaged and assigned as the parameters of that frame. Except energy and source spectrum, all other excitation parameters are set to zero in case of unvoiced speech. Voicing decision is obtained using the continuous wavelet transform based method proposed in [36]. The excitation parameters are trained in HMM and DNN framework (explained in Section 3). 2.3. Excitation signal generation during synthesis During synthesis, the source parameters generated from HMM/DNN framework and natural residual segments are used to construct the excitation signal, as shown in Fig. 5. Depending on the input phone the corresponding natural reference segment is chosen from the pre-created database. The procedure for developing the database of natural reference segments is given in Section 2.4. The excitation signal is constructed separately for voiced and unvoiced frames. For voiced frames, first, the chosen reference segment is

10

resampled to twice the target pitch period. Second, the resampled reference segment is pitch-synchronously overlap-added using hanning window of length twice the target period centered at the peak of the segment, to obtain the excitation signal. The spectral envelope of the generated excitation signal (estimated using LPCs (bk s)) is different from that of target LP spectrum. The target LP spectrum is represented by LSFs obtained from HMM/DNN. The LSFs are converted back to LPCs (ak s)). The excitation signal is given as input to an IIR filter constructed from both LPCs to compensate the difference between two spectra. The transfer function of IIR filter is given by P 1 − pk=1 bk z −k P (8) H(z) = 1 − pk=1 ak z −k

The filtering operation in Equation (8) applies the target LP spectrum on the generated excitation signal. Next, the Hilbert envelope peaks of the excitation signal are modified in order to incorporate the target epoch parameters. The modification of epoch strength and sharpness is carried out using the methods proposed in [31]. Here, the samples in the Hilbert envelope of the excitation signal around epoch locations are manipulated according to fixed strength and sharpness modification factors. In this work, the modification is performed at framelevel based on the modification factors derived from the parameteres generated for every frame. Since the peaks in the Hilbert envelope of the generated excitation correspond to approximate epoch locations, modifying the peaks will achieve the desired effects. To modify epoch strength, the samples within 2 ms around the Hilbert envelope peaks are scaled by strength modification factor s, calculated as, st (9) s= sr where st represents the target epoch strength and sr denotes the epoch strength estimated from the GCI present in the chosen reference segment. Epoch sharpness is modified by changing the samples to the left and right in the 2 ms interval around each peak, while keeping the peak constant [31]. If n denotes a peak location in the Hilbert envelope, then for any sample before the peak he (n), the previous sample he (n − 1 ) is modified as follows, 0 (10) he (n − 1) = (1 − k)he (n) + k(he (n − 1)) 11

0.4 0.2 (a)

0 -0.2 -0.4

0

50

100

150

200

0.1 0.05 0 -0.05 -0.1

(b) 50

100

150 200 250 Time (#samples)

300

350

400

Figure 6: Reference segment extracted from (a) voiced phone /aa/ and (b) unvoiced phone /s/ spoken by female SLT speaker.

Similarly, for any sample after the peak, the next sample he (n + 1 ) following he (n) is modified as follows, 0 he (n + 1) = (1 − k)he (n) + k(he (n + 1)) (11) Here, k is the sharpness modification factor computed according to the following relation, Et (12) k= Er Where, Et represents the target epoch sharpness and Er denote the epoch sharpness estimated from the GCI present in the chosen reference segment. After the modifications are made on Hilbert envelope, the corresponding modified excitation signal is reconstructed by multiplying the modified Hilbert envelope by the cosine of the phase of the analytic signal corresponding to the Hilbert envelope. Finally, the energy of the obtained excitation signal is equalized to the target energy measure. In case of unvoiced frames, reference segment whose LP spectrum and energy are modified according to the target parameters is used as the excitation signal. 2.4. Storing natural reference segments A reference segment acts as a representative sample of the excitation signal of corresponding phone. During synthesis, the natural reference seg12

ments of all phones are available for the generation of the excitation signal. The reference segment corresponding to the input phone is selected from the database. To accomplish this, reference segments of all voiced and unvoiced phonetic classes are stored in the database. In case of voiced phone, a single pitch-synchronous (two-pitch period long and GCI centred) residual frame is considered as the reference segment. The procedure for selecting the reference segment for a voiced phone is as follows: 1. Extract pitch-synchronous residual frames from the steady state portions of the phone. 2. Parameterize each frame into a set of 14 dimension features: (i) pitch value (1 dim), (ii) epoch sharpness (1 dim), (iii) maximum to minimum peak ratio (1 dim), (iv) Harmonic-to-noise ratio (HNR) (1 dim) [35], and (vi) LPCs (10 dim). These features are used to describe the residual frames as closely as possible. 3. Compute the mean of all distances between a given residual frame and all other residual frames. The distance between two residual frames is computed as the Mahalanobis distance between corresponding features. 4. Residual frame which has the lowest mean value is considered as the reference segment of the phone. The above procedure is followed for every voiced phone present in the speech corpus to extract corresponding reference segments. The voiced reference segments are normalized both in energy and pitch period. The pitch periods of the reference segments are normalized to the maximum pitch period of the speaker. The energy of the reference segment is normalized by fixing the total energy to 1. In case of an unvoiced phone, each residual frame is parameterized in terms of amplitude envelope (15-dim). Then, the average value of all the amplitude envelopes is computed. Finally, the residual segment whose amplitude envelope is closest to the mean amplitude envelope (according to Mahalanobis distance) is taken as the reference segment of that phone. Examples of voiced and unvoiced reference segments of SLT speaker of CMU Arctic database [41] are shown in Fig. 6. 2.5. Analyzing the effect of epoch parameters and unvoiced modelling To analyze the effectiveness of epoch parameters and unvoiced excitation modelling on synthesis quality, the proposed method is evaluated against its variants in terms of quality in analysis/resynthesis. The methods compared are: 13

Table 1: Preference test results (%) between the synthesized speech samples

Test Index Test 1 Test 2 Test 3 Test 4 Test 5

Baseline 11.4 7.4

PM4 61.4 21.7 -

PM3 44.6 23.6 -

PM2 33.8 10.4 -

PM1 19.1 85.8

Equivalent 27.2 33.7 42.6 70.5 6.8

(Bold font for p < 0.001, Italic font for p < 0.05) • Baseline – Traditional pulse excitation approach, which considers only F0 as excitation parameter. • PM1 – Proposed excitation modelling method. • PM2 – The same as PM1 except that white noise is used as unvoiced excitation. • PM3 – The same as PM2 except that there is no epoch sharpness modification. • PM4 – The same as PM2 except that there is no epoch strength and sharpness modification. Note that for these experiments, except in PM1, white noise is used as unvoiced excitation in all other systems. 50 randomly selected utterances from SLT speaker [41] are analyzed and resynthesized for each method. We carried out preference listening tests to ask subjects to choose the synthesized utterance that is closer to the corresponding natural speech. 15 native Indian subjects participated in the test and the results are shown in Table 1. The trends in the table can be analyzed as follows: (1) The quality of PM1 and PM4 is significantly better than the baseline system (Test 1 and 5). Intutively, a significant improvement in the quality can be seen by comparing PM2 and PM3 with the baseline system. This confirms that the excitation signals generated using natural residual segments are much better than the sequence of pulses. (2) PM3 provided superior quality than PM4 (Test 2), demonstrating that incorporation of epoch strength results in a more elaborate reconstruction of excitation signals. (3) PM2 showed a slightly better quality than PM3, with a significance level of 0.05 (Test 3). (4) The results 14

indicate that natural residual segments and epoch strength contribute significantly to the naturalness of speech, and incorporation of epoch sharpness can further enhance the quality of synthesized speech. In order to assess the impact of unvoiced excitation modelling, PM1 (which includes unvoiced excitation modelling) and PM2 (which uses white noise as unvoiced excitation) systems are compared. 15 short sentences that mostly contain consonants (such as “strychnine Knightsbridge cyttyns”), analyzed/resynthesized from both systems, are used for evaluation. From Table 1, it can be seen that there is no significant difference between PM1 and PM2 (Test 4). However, the listeners observed that PM1 is slightly more intelligible than PM2 for some of the synthesized files. These experiments confirm that epoch parameters and modelling of unvoiced excitation does, indeed, improve the perceived quality of synthesized speech. 3. Integration of proposed method in SPSS In this section, integration of proposed excitation model in SPSS using HMM and DNN frame works is discussed. 3.1. HMM-based system In this work, the publicly available HTS toolkit (HTS version 2.3) [38] is used to implement our HMM-based speech synthesizer. During training, fundamental frequency (F0 ), vocal-tract and excitation parameters are derived from the speech signals. 34-dimensional MGCCs computed from STRAIGHT [9] represent the vocal tract part. F0 is estimated using the method proposed in [36], which performs pitch estimation using the mean signal obtained from continuous wavelet transform coefficients. A set of excitation parameters, namely, energy, spectral envelope and epoch parameters are computed using the proposed excitation model. The speech parameters (F0 , MGCCs and excitation parameters) are extracted by considering a frame size of 25 ms and a frame shift of 5 ms, and are modeled with multi-stream HMMs [39]. All the parameters (except F0 ), together with the first and second derivatives, are modeled using continuous probability density HMMs (CD-HMM). The F0 patterns are statistically modeled as a mixture of continuous values for voiced regions and discrete symbols for unvoiced regions by the multi-space probability distribution HMM (MSD-HMM). The epoch parameters would be natural candidates for modelling using MSD-HMM, but in this work CD-HMM was used and the epoch parameters generated 15

for unvoiced regions are simply discarded. We have also examined modelling the epoch parameters by using MSD-HMM. But, there was no noticeable difference in the synthesized speech by modelling the epoch parameters using CD-HMM and MSD-HMM. In this paper, 5-state left-to-right HMMs are used for all the experiments. A single Gaussian distribution with diagonal covariance is used to model the output probabilities of each state and state durations of each phoneme HMM. First, monophone HMMs are trained using EM algorithm and segmental K-means, based on the phonetic labels containing time-alignment information. The monophone HMMs are converted into context-dependent HMMs, and the model parameters are re-estimated again. Then, the data appearing in similar contexts are modeled using the context clustering technique based on decision trees [40]. The model parameters at each leaf node of the decision trees are tied and re-estimated again. Except MGCC and F0 streams, the weights of other streams are set to zero during the alignment stage of re-estimation. During synthesis, from the text input, context-dependent phoneme label sequences are generated. An utterance HMM is constructed according to the generated label sequence. From the utterance HMM, the most probable speech parameter sequence is generated by using the global variance (GV)based parameter generation algorithm [37]. The advantage of GV-based approach is that it alleviates the problem of oversmoothing in SPSS. From the generated source parameters, the excitation signal is constructed using proposed approach. Finally, the generated STRAIGHT-MGCCs and excitation signal are given as input to the MGLSA synthesis filter, to produce speech. 3.2. DNN-based system Besides HMM-based SPSS systems we have also developed DNN-based SPSS systems using HTS toolkit (HTS version 2.3) [38]. We have used a feed forward neural network with four hidden layers for mapping the linguistic features to the acoustic features. Each hidden layer has 2048 hyperbolic tangent units. The input features for the DNN-based systems consists of 482 binary features for categorical linguistic contexts and 9 numerical features for numerical linguistic contexts, derived from label files using HTS-style questions. The output features include spectral and excitation source parameters and their time derivatives (dynamic features), which are basically the same as those used in the corresponding HMM-based systems. Input features are normalised to the range of [0.01, 0.99], whereas the output features 16

are normalised to zero mean and unit variance. The weights of the DNN are initialized randomly, and are optimized using mini-batch stochastic gradient descent algorithm (mini-batch size = 256). 4. Evaluation In this section, the quality of the proposed method is compared with two state-of-the-art methods, namely, STRAIGHT [8], and DPN [24]. STRAIGHT is a vocoding technique which makes use of pitch-adaptive spectral smoothing performed in the time-frequency domain for speech representation, manipulation and reconstruction. In STRAIGHT, the excitation is modeled by 5 a-periodic measurements derived from five spectral sub-bands: (0-1), (1-2), (2-4), (4-6) and (6-8) kHz besides fundamental frequency. During synthesis, mixed excitation made up of a weighted sum of impulses and white noise is used to excite voiced speech, and white noise is used to excite unvoiced speech. In DPN approach, the pitch-synchronous residual frames are modeled as a combination of deterministic and noise components. The deterministic component is the region of residual frame around GCI, and the remaining portion of the residual frame corresponds to the noise component. The deterministic components are parameterized in terms of amplitude and spectral envelopes. During synthesis, the reconstructed deterministic and noise components are superimposed and then overlap added to generate the voiced excitation signal. For unvoiced speech, white noise is considered as the excitation signal. 4.1. Experimental setup For evaluation, we used two speakers (female (SLT) and male (AWB)) from CMU Arctic database [41] and a US female speaker (Nancy) from Blizzard challenge 2011 [42]. The prompts in the Blizzard 2011 corpus are annotated for the speaker to read with target intonation patterns, making the data somewhat expressive. The training set consists of about 1100 utterances (approx. 62 min), 1100 utterances (approx. 80 min) and 6000 utterances (approx. 7 hrs) for speakers SLT, AWB and Nancy, respectively. Evaluation is carried out using two subjective listening tests, namely, preference test and mean opinion score (MOS) test for speaker similarity and one objective test. In preference tests, subjects are asked to listen to a pair of synthesized speech utterances and chose the one that is more natural among the two or prefer both as equal. In MOS tests, subjects are presented 17

with a natural speech from the original speaker, and are asked to rate the test samples in terms of similarity in voice on a 5-point scale ranging from 5 (sounds like the same person) to 1 (sounds completely different). Before listening tests, the energy of speech signals is normalized to the same level. Altogether 25 subjects between the ages of 24 and 32 participated in the listening tests. Out of these, fifteen subjects are post graduates or research scholars with sufficient background knowledge in speech processing and remaining are research scholars not working in speech processing area. The listening tests are conducted in the laboratory environment by playing the speech signals through headphones. The objective evaluation is carried out for the vocoded speech wave files (i.e. analysis/resynthesis without any statistical modelling) using the ITU-T Rec. P.862 Perceptual Evaluation of Speech Quality (PESQ) [43] measure. In PESQ, the audible difference between reference waveform (natural speech signal) and the test signal (speech waveform resynthesised with one of the excitation models using the analysis/synthesis framework) is evaluated. For a pair of speech utterances, the PESQ is a single value in the range -1 to 4.5. If the synthesised speech waveform and its corresponding natural speech are perceptually similar, then the PESQ value is close to 4.5. A PESQ value close to -1 indicates that there is a huge perceptual degradation in the synthesised speech compared to the corresponding natural speech signal. 4.2. Results 4.2.1. Subjective evaluation results A. HMM-based speech synthesis. In this experiments, the proposed method is evaluated in the context of SPSS using systems based on Hidden Markov models (HMMs). We trained HMM systems for the STRAIGHT, DPN and proposed methods for the 3 voices mentioned above. The test set consists of 50 sentences per speaker that are not seen in the training data. Hence, a total of 450 utterances (3 speakers × 50 sentences × 3 excitation models) are synthesized. In order to reduce the workload on subjects, 20 sentences from all speakers are randomly selected for each subject and presented to them in each test. Thus each subject rated a total of 120 stimuli pairs (in preference tests) and 180 wave files (in MOS test). The preference and MOS test results are shown in Table 2 and Fig. 7, respectively. The trends in the results can be analyzed as follows: (1) For all speakers, the proposed method exhibits a higher perceptual quality and speaker similarity compared to STRAIGHT-HTS, with a significance 18

Table 2: Preference test results (%) between the synthesized speech samples

Method Proposed versus STRAIGHT Proposed versus DPN

Other method Equivalent preferred SLT speaker

Proposed method preferred

21.4

34.1

44.5

22.6

41.3

36.1

AWB speaker Proposed versus STRAIGHT Proposed versus DPN

19.3

32.1

48.6

22.2

40.7

37.1

Nancy speaker Proposed versus STRAIGHT Proposed versus DPN

19.7

28.5

51.8

20.2

42.2

37.6

(Bold font for p < 0.001, Italic font for p < 0.05) level of p < 0.001. The subjects reported that there is slight buzziness in the speech synthesized from STRAIGHT-based system, particularly in case of male (AWB) speaker. In STRAIGHT-HTS, mixed excitation parameters are used to model and generate the voiced excitation. The proposed method utilizes natural residual segments for generating the excitation signal. This preserves some of the detailed structure of the real excitation signal, which is very difficult to model. Also, the perceptually important information present in successive frames of the excitation signal is retained by incorporating the epoch parameters. Hence, the proposed method generates realistic excitation signals and subsequently results in the synthesis of natural sounding speech. (2) The preference scores show that the proposed method provides a better quality compared to DPN-HTS for all speakers, with a significance level of p < 0.05. But, in terms of speaker similarity, there is no significant difference between the two systems. In DPN source model, the deterministic component (segment of residual frame around GCI) which is perceptually important is accurately recon19

5 SLT

AWB

Nancy

4

3

2

1

0

STRAIGHT-HTS

DPN-HTS

Proposed

Figure 7: Results of MOS on speaker similarity with 95% confidence intervals.

structed for every frame from PCA coefficients. In the proposed method, the region around GCI of natural residual frame is modified to incorporate the generated parameters. Hence, in both methods, the characteristics of excitation signal in the vicinity of instants of glottal closure are retained. But, in DPN source model, the noise component (remaining part of the residual frame) is constructed from white Gaussian noise which differs from natural noise signal. This leads to improper fusion between deterministic and noise components and hence results in slight degradation in synthesis quality. On the other hand, in the proposed method, utilization of natural residual segments particular to every phone can preserve some of the phone dependent characteristics of the residual frame, which cannot be modeled with discrete parameters. Unlike DPN source model, the proposed method generates unvoiced excitation signal in a more meaningful way by utilizing natural unvoiced residual frames. A subset of the synthesized wav files used during listening tests are made available online at http://cse.iitkgp.ac. in/~kiran.reddy.m/EP_SourceModel/epsm-spss.html. B. DNN-based speech synthesis. We trained DNN systems for STRAIGHT, DPN and proposed method for the two speakers (SLT and AWB) from CMU Arctic database. The test set consists of 50 sentences per speaker that are not seen in the training data. Hence, a total of 300 utterances (2 speakers × 50 sentences × 3 excitation 20

Table 3: Preference scores (%) between speech samples from the DNN-based systems.

Method

Other method Equivalent preferred SLT speaker

Proposed versus STRAIGHT Proposed versus DPN

Proposed method preferred

19.6

41.2

39.2

17.4

50.9

30.7

AWB speaker Proposed versus STRAIGHT Proposed versus DPN

17.7

38.9

43.4

21.5

44.3

34.2

(Bold font for p < 0.05) models) are synthesized. A preference listening test is then carried out to ask the subjects to rate the synthesized speech by quality preference. The preference test results are shown in Table 3. The proposed method performed better than STRAIGHT and DPN, for both male and female speakers, at a significance level of 0.05. From informal listening tests, we also observed that all the systems provided better perceptual quality than their HMMbased counterparts. This confirms that combining improved excitation and acoustic modelling, results in a significant improvement in the naturalness of synthetic speech. Synthesized speech samples of the DNN-based systems used for the comparison are made available online at http://cse.iitkgp. ac.in/~kiran.reddy.m/EP_SourceModel/epsm-spss.html. 4.2.2. Objective evaluation results To analyze the effectiveness of vocoders without any influence from statistical models, natural speech signals are modeled using STRAIGHT, DPN and proposed method. Using natural spectrum, F0 and excitation signals obtained from the three approaches, the speech signals are re-synthesized. The natural spectrum corresponds to smoothed spectral envelope, MGCCs and STRAIGHT-MGCCs in case of STRAIGHT, DPN and proposed method, respectively. A total of 25 utterances per speaker are randomly selected for analysis. Fig. 8 shows an example of natural speech signal and its corre21

1 0 -1

1 0 -1 500

1000

(a)

1500

1 0 -1

500

1000

1500

1 0 -1 500

1000

(b)

1500

1 0 -1

500

1000

1500

1 0 -1 500

1000

(c) 500

1500

1000

1500

1 0 -1

1 0 -1 500

1000 Time (#Samples)

(d) 500

1500

1000 Time (#Samples)

1500

Figure 8: Illustration of speech (left plots) and excitation signals (right plots) synthesized from various vocoding techniques. (a) Natural speech and corresponding excitation signal. Speech and excitation signal synthesized by using (b) STRAIGHT, (c) DPN and (d) proposed method.

5 SLT

AWB

Nancy

4

3

2

1

0

STRAIGHT

DPN

Proposed

Figure 9: PESQ scores obtained for various vocoding techniques.

22

Table 4: Source features and number of parameters used in the source modelling methods

Method STRAIGHT

DPN

Proposed

Features Pitch Aperiodicity Total Pitch Energy Harmonic-to-noise ratio PCA coefficients Noise spectral envelope Noise amplitude envelope Total Pitch Energy Epoch strength Epoch sharpness Source spectral envelope Total

Number of parameters 1 5 6 1 1 1 20 10 15 48 1 1 1 1 10 13

sponding excitation signal (Fig. 8(a)), synthesized speech and corresponding excitation signals reconstructed from various methods (Fig. 8(b)-(d)). From the figure, it can be observed that the excitation and speech waveforms constructed from the proposed method shown in Fig. 8(d) are much closer to the corresponding natural ones shown in Fig. 8(a), compared to other excitation models. Fig. 9 shows the PESQ values obtained by comparing the speech resynthesized from various source models with the natural speech signals. From the figure, it can be noticed that for all the speakers the proposed approach has the highest PESQ value compared to other methods. This measure (PESQ) objectively confirms that the proposed method generates better quality speech compared to STRAIGHT and DPN excitation models. The source features considered in the three source modelling approaches are listed in Table 4. From the table, it can be seen that the DPN-based source model uses more number of parameters compared to other methods. Hence, it significantly increases the computational footprint at running time. On the other hand, the proposed method uses fewer parameters than DPN-

23

based method, and requires a very little additional memory for storing the voiced and unvoiced reference segments. For SLT speaker, each voiced and unvoiced reference waveforms are represented using 200 samples (normalized pitch period) and 400 samples (frame size), respectively. If 4 bytes are required to represent each sample in floating point format, then the memory needed to store one voiced reference segment is 200 × 4 = 0.8KB. The memory needed to store one unvoiced reference segment is 400 × 4 = 1.6KB. The total number of unique voiced and unvoiced phones present in the considered database are 28 and 12, respectively. Therefore, the total memory required to store the reference segments is 28 × 0.8KB + 12 × 1.6KB = 35.2KB. The memory requirements of the proposed method is only influenced by the number of unique voiced and unvoiced phones present in the considered speech corpus, frame-size and the number of samples for length normalization (depends on sampling frequency). Hence, in any case, the storage required by proposed method cannot increase substantially. The overall results indicate that the proposed method can generate better quality speech compared to existing source models while keeping the memory footprint small, and the use of better acoustic model such as DNN can further enhance the quality of proposed method. 5. Summary and conclusion Generation of improper excitation signals is one of the major issues affecting the naturalness of synthesized speech. The excitation signals reconstructed with the recent residual-based methods are significantly better than the traditional pulse excitation. However, these approaches have certain limitations: (1) they fail to model and incorporate the time-varying epoch characteristics or require more parameters as in DPN model; and (2) most of these approaches have not attempted to model the unvoiced excitation. As a result, the generated excitation signal is unable to mimic the natural excitation. This paper proposes a new source modelling approach to address the aforementioned issues. In the proposed method, the excitation signal is represented using features which carry important information related to the perceptual characteristics of the speech. During synthesis, the statistically generated excitation parameters are imposed on natural voiced and unvoiced residual segments to synthesize the excitation signals. This approach can very well reproduce the time-varying characteristics of the real excitation signal, and hence preserves the speech quality. The experimental results in24

dicated that the speech synthesized with the proposed method is appreciably better than the existing excitation modelling approaches. As epoch parameters carry significant emotion-specific information, the proposed method can potentially be used for emotional speech synthesis. However, this requires accurate estimation of epochs from emotion speech. Errors in epoch extraction leads to improper modeling of epoch parameters, which significantly degrades the quality of synthesis. Although existing epoch extraction methods work well for neutral speech, the performance degrades significantly in case of emotional speech. Therefore, the future focus may be on developing a robust epoch extraction technique for emotional speech, which can be used with the proposed excitation model for synthesis of high-quality emotional voices. Also, the proposed excitation model can be extended to reconstruct better excitation signals for voice qualities such as creaky voice. References [1] Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., Oura, K., “Speech synthesis based on hidden Markov models,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1234-1252, 2013. [2] Zen, H., Senior, A., Schuster, M., “Statistical parametric speech synthesis using deep neural networks,” in Proceedings of ICASSP, Vancouver, Canada, pp. 7962-7966, 2013. [3] Ling, Z.-H., Kang, S.-Y., Zen, H., Senior, A., Schuster, M., Qian, X.-J., Meng, H., Deng, L., “Deep learning for acoustic modelling in parametric speech generation: A systematic review of existing techniques and future trends”, IEEE Signal Process. Mag., vol. 32, no. 3, pp. 35-52, May 2015. [4] Shen, J., et al., “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions,” Proc. ICASSP, Alberta, Canada, pp.4779-4783, 2018. [5] Tamamori, A., et al., “Speaker-Dependent WaveNet Vocoder,” Proc. Interspeech, Stockholm, Sweden, pp. 1118-1122, 2017. [6] Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T., “Mixed excitation for HMM-based speech synthesis,” Proceedings of Eurospeech, Aalborg, Denmark, pp. 2259-2262, 2001. 25

[7] Kawahara,H., Masuda-Katsuse, I., de Cheveigne, A., “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency based f0 extraction: Possible role of a repetitive structure in sounds,” Speech communication, vol. 27, no. 3, pp. 187-207, 1999. [8] Zen, H., Toda, T., Nakamura, M., Tokuda, K., “Details of the nitech HMM-based speech synthesis system for the Blizzard challenge 2005,” IEICE Transactions on Information and Systems, vol. 90, no. 1, pp. 325-333, 2007. [9] Zen, H., Toda, T., Tokuda, K., “The Nitech-NAIST HMM-based speech synthesis system for the Blizzard Challenge 2006,” IEICE Trans. Inform. Syst. E91-D(6), 1764-1773, 2008. [10] Maia, R., Toda, T., Zen, H., Nankaku, Y., Tokuda, K., “An excitation model for HMM-based speech synthesis based on residual modelling,” 6th ISCA Speech Synthesis Workshop (SSW6), Bonn, Germany, 2007. [11] Kim, S.J., Hahn, M., “Two-band excitation for HMM-based speech synthesis,” IEICE Trans. Inf. Syst., vol. E90-D, 2007. [12] Cabral, J.P., Renals, S., Yamagishi, J., Richmond, K., “HMM-based speech synthesiser using the LF-model of the glottal source,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Prague, Czech Republic, pp. 4704-4707, 2011. [13] Raitio, T., Suni, A., Yamagishi, J., Pulakka, H., Nurminen, J., Vainio, M., Alku, P., “HMM-based speech synthesis utilizing glottal inverse filtering,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 1, pp. 153-165, 2011. [14] Raitio, T., Suni, A., Pulakka, H., Vainio, M., Alku, P., “Utilizing glottal source pulse library for generating improved excitation signal for HMMbased speech synthesis,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Proc., pp. 4564-4567, 2011. [15] Raitio, T., et al., “Voice source modelling using deep neural networks for statistical parametric speech synthesis”, Proc. Eusipco, Lisbon, pp. 2290-2294, 2014. 26

[16] Hwang, M., et al., “A unified framework for the generation of glottal signals in deep learning-based parametric speech synthesis systems”, Proc. Interspeech, Hyderabad, India, pp. 912-916, 2018. [17] Cui, Y., et al., “A new glottal neural vocoder for speech synthesis”, Proc. Interspeech, Hyderabad, India, pp. 2017-2021, 2018. [18] Airaksinen, M., et al., “A comparison between STRAIGHT, glottal, and sinusoidal vocoding in statistical parametric speech synthesis”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 9, pp. 1658-1670, 2018. [19] Drugman, T., Moinet, A., Dutoit, T., Wilfart, G., “Using a pitchsynchronous residual codebook for hybrid HMM/frame selection speech synthesis,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan, pp. 3793-3796, 2009. [20] Csap´o, T.G., N´emeth, G., “Statistical parametric speech synthesis with a novel codebook-based excitation model,” Intelligent Decision Technologies vol. 8, no.4, pp. 289-299, 2014. [21] Wen, Z., Tao, J., Pan, S., Wang, Y., “Pitch-scaled spectrum based excitation model for HMM-based speech synthesis”, J. Signal Process. Syst., vol. 74, no. 3, pp. 423-435, 2013. [22] Cabral, J.P., “Uniform concatenative excitation model for synthesising speech without voiced/unvoiced classification” in Proceedings of INTERSPEECH, Lyon, France, pp. 1082-1086, 2013. [23] Drugman, T., Dutoit, T., “The deterministic plus stochastic model of the residual signal and its applications,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 3, pp. 968-981, 2012. [24] Narendra, N.P., Rao, K.S., “Parameterization of excitation signal for improving the quality of HMM-based speech synthesis system,” Circuits Syst. Signal Process., Vol. 36, no. 9, pp. 3650-3673, 2017. [25] Adiga, N., Prasanna, S.M., “Significance of instants of significant excitation for source modelling” in Proc. Interspeech, Lyon, France, pp. 1677-1681, 2013. 27

[26] Wakita, H., “Residual energy of linear prediction applied to vowel and speaker recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 3, pp. 270-271, 1976. [27] Thati, S.A., Bollepalli, B., Bhaskararao, P., Yegnanarayana, B., “Analysis of breathy voice based on excitation characteristics of speech production,” In Proc. International Conference on Signal Processing and Communications (SPCOM), IISc Bangalore, Karnataka, pp. 1-5, 2012. [28] Seshadri, G., Yegnanarayana, B., “Perceived loudness of speech based on the characteristics of glottal excitation source,” The Journal of the Acoustical Society of America, vol. 126, no. 4, pp.2061-2071, 2009. [29] Koolagudi, S.G., Devliyal, S., Chawla, B., Barthwal, A., Rao, K.S., “Recognition of emotions from speech using excitation source features,” Procedia Engineering, 38, 3409-3417, 2012. [30] Kadiri, S.R., Gangamohan, P., Gangashetty, S.V., Yegnanarayana, B., “Analysis of excitation source features of speech for emotion recognition,” In Proc. Interspeech, Dresden, Germany, pp. 1324-1328, 2015. [31] Haque, A., Rao, K.S., “Modification of energy spectra, epoch parameters and prosody for emotion conversion,” International Journal of Speech Technology, Vol. 20, No. 1, pp. 15-25, 2017. [32] Reddy, M.K., Rao, K.S., “Inverse filter based excitation model for HMM-based speech synthesis system,” IET Signal Process., Vol. 12, No. 4, PP. 544-548, 2018. [33] Koishida, K., Hirabayashi, G., Tokuda, K., Kobayashi, T., “A 16kbit/s wideband CELP-based speech coder using mel-generalized cepstral analysis,” IEICE Trans. Inf. and Syst., vol. E83-D, no. 4, pp. 876-883, 2000. [34] Murty, K.S.R., Yegnanarayana, B., “Epoch Extraction From Speech Signals,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 8, pp. 1602-1613, 2008. [35] Krom, G.d., “A cepstrum-based technique for determining a harmonicsto-noise ratio in speech signals”, J. Speech Hear. Res., vol. 36, no. 2, pp. 254-266, Apr. 1993. 28

[36] Reddy, M.K., Rao, K.S., “Robust pitch extraction method for the HMMbased speech synthesis system,” IEEE Signal Processing Letters, Vol. 24, No. 8, pp. 1133-1137, 2017. [37] Toda, T., Tokuda, K., “A speech parameter generation algorithm considering global variance for HMM-based speech synthesis,” IEICE Trans. Inform. Systems vol. E90-D, no. 5, pp. 816-824, 2007. [38] “HMM-based Speech Synthesis System (HTS)”. (online). Available: http://hts.sp.nitech.ac.jp/. [39] Young, S.J., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P., “The Hidden Markov Model Toolkit (HTK) version 3.4,” 2006. (Online). Available: http://htk.eng.cam.ac.uk/. [40] Shinoda, K., Watanabe, T., “MDL-based context-dependent subword modelling for speech recognition,” Acoustical Science and Technology, vol. 21, no. 2, pp. 79-86, 2001. [41] “CMU Arctic Speech Synthesis Databases”. (online). Available: http://festvox.org/cmu arctic/. [42] King, S., Karaiskos, V., “The blizzard challenge 2011,” in Proc. Blizzard Challenge 2011 Workshop, 2011. [43] Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrow band telephone networks and speech codecs, ITU-T Draft Recommendation P.862, 2000.

29

Excitation modelling using epoch features for statistical parametric speech synthesis

Excitation modelling using epoch features for statistical parametric speech synthesis

Recommend Documents