Computer Speech & Language 60 (2020) 101025
Contents lists available at ScienceDirect
Computer Speech & Language journal homepage: www.elsevier.com/locate/csl
A continuous vocoder for statistical parametric speech synthesis and its evaluation using an audio-visual phonetically annotated Arabic corpus s Ga bor Csapo a,d, Sherif Abdouc, Mohammed Salah Al-Radhia,*, Omnia Abdob, Tama a b za Ne meth , Mervat Fashal Ge a
Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary Phonetics and linguistics department, Alexandria University, Egypt c Faculty of Computers and Information, Cairo University, Egypt d € let Lingual Articulation Research Group, Budapest, Hungary MTA-ELTE Lendu b
A R T I C L E
I N F O
Article History: Received 12 May 2018 Revised 31 May 2019 Accepted 26 September 2019 Available online 29 September 2019 Keywords: Speech synthesis Continuous vocoder Envelope Arabic
A B S T R A C T
In this paper, we present an extension of a novel continuous residual-based vocoder for statistical parametric speech synthesis by addressing two objectives. First, because the noise component is often not accurately modelled in modern vocoders (e.g. STRAIGHT), a new technique for modelling unvoiced sounds is proposed by adding time domain envelope to the unvoiced segments to avoid any residual buzziness. Four time-domain envelopes (Amplitude, Hilbert, Triangular and True) are investigated, enhanced, and then applied to the noise component of the excitation in our continuous vocoder, i.e. of which all parameters are continuous. With the future aim of producing high-quality Arabic speech synthesis, we secondly apply this vocoder on a modern standard Arabic audio-visual corpus which is annotated both phonetically and visually, and dedicated to emotional speech processing studies. In an objective experiment, we investigated the Phase Distortion Deviation, whereas a MUSHRA type subjective listening test was conducted comparing natural and vocoded speech samples. As a result, both experiments based on the proposed noise modelling have shown satisfactory results in terms of naturalness and intelligibility, while outperforming STRAIGHT and other earlier residual-based approaches. © 2019 Elsevier Ltd. All rights reserved.
1. Introduction State-of-the-art speech synthesis can be described as artificial generation of human speech (Suendermann et al., 2010). In particular, statistical parametric speech synthesis (SPSS) has been an important research field during the last years due to the development of Hidden Markov Model (HMM) (Zen et al., 2009) and deep neural network (Zen et al., 2013) based approaches. Such a statistical framework is guided by the vocoder (which is also called speech analysis/synthesis system) to reproduce human speech. A vocoder is the most important component of various speech synthesis applications such as text-to-speech synthesis (Dutoit, 1997), voice conversion (Toda et al., 2007), or singing synthesizers (Kenmochi, 2012). Although there are several different types of vocoders that use analysis/synthesis, they follow the same main strategy. The analysis stage is used to convert This paper is a revised and expanded version of the work presented at Interspeech 2017 conference (Al-Radhi et al., 2017a). *Corresponding author. ),
[email protected] E-mail addresses:
[email protected] (M.S. Al-Radhi),
[email protected] (O. Abdo),
[email protected] (T.G. Csapo meth),
[email protected] (M. Fashal). (S. Abdou),
[email protected] (G. Ne https://doi.org/10.1016/j.csl.2019.101025 0885-2308/© 2019 Elsevier Ltd. All rights reserved.
2
M.S. Al-Radhi et al. / Computer Speech & Language 60 (2020) 101025
speech waveform into a set of parameters which represent separately the vocal-folds excitation signal (sound is voiced or unvoiced) and vocal-tract filter transfer function to filter the excitation signal (vocal-folds movements), whereas in the synthesis stage, the entire parameters are used to reconstruct the original speech signal. Since the design of a vocoder-based SPSS depends on speech characteristics, the preservation of voice quality in the analysis/ synthesis phase is the main problem of the vocoder. Hu et al. (2013) present an experimental comparison of a wide range of important vocoder types which have been previously invented. Despite the fact that most of these vocoders have been successful in synthesizing speech, they are not always successful in synthesizing high-quality speech. Therefore, we are trying in this study to develop a vocoder based high-quality speech synthesis system, while its approach still remains computationally efficient. Within the vocoder, the accurate modeling of the fundamental frequency (also referred to as pitch or F0) parameter play a crucial role during the analysis/synthesis process because F0 values are continuous in voiced regions and discontinuous in unvoiced regions, which makes it complicated to model accurately. For modeling discontinuous F0, Multi-Space Probability Distribution based HMM (MSD-HMM) was proposed and it is generally accepted (Tokuda et al., 2002). However, because of the discontinuities at the boundary between voiced and unvoiced regions, the MSPD-HMM is not optimal (Yu and Young, 2011). To solve this, among others, Yu and Young (2011), Yu et al. (2010) proposed a continuous F0 model, showing that continuous F0 observations can similarly appear in unvoiced regions. It has also been shown recently that continuous modeling can be more effective in achieving natural synthesized speech (Garner et al., 2013). Another vocoder parameter is the Maximum Voiced Frequency (MVF) which was recently proposed and shown to result in major improvement in the quality of synthesized speech (Drugman and Stylianou, 2014). During the synthesis of various sounds, the MVF parameter can be used as a boundary frequency to separate the voiced and unvoiced components. A widely used vocoder STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum) (Kawahara et al., 1999) was proposed as an effective framework to achieve high-quality speech synthesis. STRAIGHT decomposes the voice signal into three parameters: F0 extracted using the instantaneous frequency of the fundamental component of the speech signal, band-aperiodicity which represent the ratios between periodic and aperiodic components, and spectrogram extraction using F0 adaptive spectral smoothing. In synthesis, STRAIGHT uses mixed excitation in which impulse and noise excitations are mixed according to the band aperiodicity parameters in voiced speech. For real-time processing, STRAIGHT is computationally expensive. As the noise component is not accurately modeled in the STRAIGHT vocoder, Degottex and his colleagues aim to improve this and present a novel noise model, which is of slightly worse quality than STRAIGHT, but it is much simpler (Degottex et al., 2018). Espic et al. (2017) observed that in natural voiced speech, the time-domain amplitude envelope of aperiodic components (above the MVF) is pitch synchronous, and energy is concentrated around epoch locations. et al., 2015), using a continuous F0 In our earlier work, we proposed a computationally feasible residual-based vocoder (Csapo (Garner et al., 2013), and MVF (Drugman and Stylianou, 2014). In this method, the voiced excitation consisting of pitch synchronous PCA (Principle Component Analysis) residual frames is low-pass filtered while the unvoiced part is high-pass filtered according to the MVF contour as a cutoff frequency. The approach was especially successful for modeling speech sounds with et al. (2016), we removed the post-processing step in the estimation of the MVF parameter and thus mixed excitation. In Csapo improved the modeling of unvoiced sounds within the continuous vocoder. To reconstruct the time-domain characteristics of voiced segments, a time-domain envelope is often applied which was shown to be related to speech intelligibility (Drullman, 1995). There are various methods to obtain a more reliable representation of such envelopes. In an early attempt (Schloss, 1985), the amplitude envelope is shaped by obtaining peaks of the signal in a window that runs in the data. In Stylianou (2001) a pitch-synchronous triangular envelope is proposed. In Pantazis and Stylianou (2008), Hilbert and energy envelops are introduced. In Robel et al. (2007), an iterative technique is used to estimate the true envelope. Frequency Domain Linear Prediction (FDLP) envelope is presented in Ellis and Athineos (2003). In vocoding, such envelopes are often used to enhance the source model (e.g. Maia et al. (2011), Drugman and Dutoit (2012), Cabral and Berndsen (2013)). Therefore, this article considers the above issues by suggesting a simple method for advanced modeling of the noise excitation which can yield an accurate noise component of the excitation. We expect that adding such an envelope-modulated noise to the voiced and unvoiced components that involve the presence of noise in voiced frames, the quality of synthesized speech in the noisy time regions will be more accurate than the baseline. The rest of this paper is structured as follows: Section 2 gives details of the continuous vocoder as a baseline. Section 3 describes the novel methods we used for speech synthesis. Then, experimental conditions are showed in Section 4. Evaluation and discussion are presented in Section 5. Finally, Section 6 concludes the contributions of this paper. 2. Continuous vocoder: baseline et al., 2016). During the analysis phase, F0 is calculated on the The baseline system is our earlier continuous vocoder (Csapo input waveforms by the open-source1 implementation of a simple continuous F0 tracker (Garner et al., 2013). In regions of creaky voice and in case of unvoiced sounds or silences, this F0 tracker interpolates F0 based on a linear dynamic system and Kalman smoothing. After this step, MVF is calculated from the speech signal using the MVF_Toolkit,2 resulting in the MVF parameter (Drugman and Stylianou, 2014). In the next step 24-order Mel-Generalized Cepstral analysis (MGC) (Tokuda et al., 1994) is 1 2
https://github.com/idiap/ssp http://tcts.fpms.ac.be/»drugman/files/ MVF.zip
M.S. Al-Radhi et al. / Computer Speech & Language 60 (2020) 101025
3
performed on the speech signal with alpha ¼ 0:42 and gamma¼1=3. In all steps, 5 ms frameshift is used. The results are the F0, MVF, and the MGC parameter streams. Besides, the Glottal Closure Instant (GCI) algorithm (Drugman et al., 2012) is used to find the glottal period boundaries of individual cycles in the voiced parts of the inverse filtered residual signal. From these F0 cycles, a PCA residual is built which will be used in the synthesis phase (see Fig. 1). During the synthesis phase, voiced excitation is composed of PCA residuals overlap-added pitch synchronously, depending on the continuous F0. After that, this voiced excitation is lowpass filtered frame by frame at the frequency given by the MVF parameter. In the frequencies higher than the actual value of MVF, white noise is used. The voiced and the unvoiced excitation are added together. Finally, a Mel generalized-log spectrum approximation (MGLSA) filter is used to synthesize speech from the excitation and the MGC parameter stream (Imai et al., 1983).
Fig. 1. Workflow of the proposed method.
4
M.S. Al-Radhi et al. / Computer Speech & Language 60 (2020) 101025
3. Parameterizing the noise components It was shown that in natural speech, the high-frequency noise component is time-aligned with the F0 periods (Stylianou, 2001). In the baseline system, there is a lack of voiced component in higher frequencies. The novelty of the proposed model differs from the baseline in the synthesis phase: we test various time envelopes to shape high-frequency component (above MVF) of the excitation by estimating the envelope of the PCA residual and modifying the noise component by this envelope to make it more similar to the residual of natural speech. The steps involved in this framework are composed of three parts: analysis, statistical modeling, and synthesis. In this paper, we only deal with the analysis and synthesis phases; the statistical modeling is investigated in (Al-Radhi et al., 2017b). Our proposed framework is presented in Fig. 1, and the aim of this section is to show that the time envelope estimation techniques lead to better results. 3.1. Amplitude envelope The amplitude envelope refers here to the shape of sound energy over time. It is usually calculated as filtering the absolute value of the voiced frame v(n) by moving the average filter to the order of 2Nþ1 (Pantazis and Stylianou, 2008), where N is chosen to be 10. The amplitude envelope is given by AðnÞ ¼
N X 1 jvðnkÞj 2Nþ1 k ¼N
ð1Þ
Previous work showed that by down-sampling the amplitude envelope to a different number of samples will reduce the relative time square error (Narendra and Sreenivasa, 2015) during parameterizing the noise components. Fig. 4b shows the effects of applying the amplitude envelope on the PCA residual signal. 3.2. Hilbert envelope Another method of calculating an envelope is based on the Hilbert transform technique (Yan and Gao, 2009; Moore, 2014), which has been used first to obtain an analytical signal in the complex-valued time-domain. Here, the analytic signal va(n) can be defined as a complex function of time derived from a real voiced frame v(n), and can be written as def
va ðnÞ ¼ vðnÞþjHfvðnÞg
ð2Þ pffiffiffiffiffi where j is the imaginary unit 1, and jHf ¢ g denotes the operation of the Hilbert transform which is equivalent to the integration form (Potamianos and Maragos, 1994) HfvðnÞg ¼
1
p
þ1 Z
vðt Þ
1
1 1 vðnÞ dt ¼ n t pn
ð3Þ
where * stands for convolution symbol. Thus, the Hilbert envelope H(n) can be estimated by taking the magnitude of the analytical signal to capture the slowly varying features of the sound signal (see Fig. 4c) as qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð4Þ H ðnÞ ¼ jva ðnÞj ¼ vðnÞ2 þHfvðnÞg2
3.3. Triangular envelope A further time domain parametric envelope that can be easily applied to each frame signal is the triangular envelope. It was proposed in Stylianou et al. (1995) by using four parameters as it assumes the triangle to be symmetric. In Cabral and Berndsen (2013), it used a polynomial curve to detect these parameters. In this work, our approach for estimating the triangular envelope T(n), which is a slightly different from Stylianou et al. (1995), is only using three parameters (a, b, and c) obtained by detecting them directly on the envelope. Here, the design parameters have been given as: a ¼ 0:35Lf , b ¼ 0:65Lf , c ¼ aþb 2 , and we set A ¼ 1; where Lf is the frame length. These parameters are illustrated in Fig. 2, and the performance of the Triangular envelope can be observed in Fig. 4d. 3.4. True envelope Another new approach, which can be used for estimating the time domain envelope, is called the true envelope (TE). It is based on cepstral smoothing of the amplitude spectrum (Robel and Rodet, 2005; Villavicencio et al., 2006). In an iterative procedure, the TE algorithm starts with estimating the cepstrum and updating it with the maximum of the original spectrum signal and the current cepstral representation. To have an efficient real-time implementation, Galas and Rodet (1990) proposed a concept of a discrete cepstrum which consists of a least mean square approximation, and (Cappe and Moulines, 1996) added a
M.S. Al-Radhi et al. / Computer Speech & Language 60 (2020) 101025
5
A
a
c
b
Fig. 2. Triangular time-domain envelope estimation.
regularization technique that aims to improve the smoothness of the envelope. In this study, the procedures for estimating the TE is shown in Fig. 3 in which the cepstrum c(n) can be calculated as the inverse Fourier transform of the log magnitude spectrum S(k) of a signal frame v(n) cðnÞ ¼
N1 X
SðkÞ:ejð N Þkn 2p
ð5Þ
k¼0
SðkÞ ¼ logjV ðkÞj
ð6Þ
where V(k) is N-point discrete Fourier transform of a v(n), and can be found mathematically as V ðkÞ ¼
N1 X
vðnÞ:ejð N Þnk 2p
ð7Þ
n¼0
Next, the algorithm iteratively update M(k) with the maximum of the S(k) and the Fourier transform of the smoothing cepstrum Ci(k), that is the cepstral representation of the spectral envelope at iteration i. C ðkÞ ¼
N1 X
cðnÞ:ejð N Þnk 2p
ð8Þ
n¼0
Mi ðkÞ ¼ maxðSi1 ðkÞ; Ci1 ðkÞÞ
ð9Þ
It can be noted that the TE with weighting factor wf will bring a unique time envelope which makes the convergence more close to natural speech. In practice, the wf which was found to be the most successful is 10. Thus, TE envelope T(n) is obtained here as T ðnÞ ¼
N1 X
wf :M ðkÞ:ejð N Þkn 2p
ð10Þ
k¼0
Despite the good performance, TE makes oscillations whenever the change in S(k) is fast. This can be seen in Fig. 4e. 4. Experimental conditions Since the evaluation result depends on the speech database, two different speech databases designed for the purpose of speech synthesis research were selected in this way to measure the performance of the obtained model. One of them is an English corpus to determine the effectiveness of the proposed method, and the other from an Arabic corpus to ensure a good evaluation with a different language and provide us with a basis of future research for high-quality Arabic speech synthesis.
Fig. 3. Procedures for estimating the true envelope.
6
M.S. Al-Radhi et al. / Computer Speech & Language 60 (2020) 101025
Fig. 4. Illustration of the performance of the time envelopes. “unvoiced_frame'' is the excitation signal consisting of white noise, whereas ''resid_pca'' is the result of applying PCA on the voiced excitation frames.
M.S. Al-Radhi et al. / Computer Speech & Language 60 (2020) 101025
7
4.1. English database Two English speakers were chosen from the CMU-ARCTIC database (Kominek and Black, 2003), denoted AWB (Scottish English, male) and SLT (American English, female), which consist of 1138 and 1132 sentences, respectively. The waveform sampling rate of the database is 16 kHz. In the vocoding experiments, 100 sentences from each speaker were analyzed and synthesized with the baseline, proposed, and STRIGHT vocoders. 4.2. Arabic database Our database (Abdo et al., 2017) was motivated by the fact that it builds the first modern standard Arabic audio-visual expressive corpus which is annotated both visually and phonetically. It contains 500 sentences with 6 emotions (Happiness Sadness Fear Anger Inquiry Neutral), and recorded by a native Arabic male speaker, denoted ARB. The waveform sampling rate of the database is 48 kHz. In the vocoding experiments, 40 sentences (resampled to 16 kHz) per each emotion were analyzed and synthesized with the baseline, proposed, and STRIGHT vocoders. 5. Evaluation and discussion In order to achieve our goals and to verify the effectiveness of the proposed methods, objective and subjective evaluations were carried out per each database. 5.1. Objective evaluations 5.1.1. Phase distortion deviation Recent progress in the speech synthesis field showed that the phase distortion of the signal carries all of the crucial information relevant to the shape of glottal pulses (Degottex and Erro, 2014). As the noise component in our continuous vocoder is parameterized in terms of time envelopes and computed for every pitch-synchronous residual frame, we compared the natural and vocoded sentences by measuring the phase distortion deviation (PDD). Originally, PDD can be calculated based on early Fisher's standard-deviation (Fisher, 1995). However, Degottex and Erro (2014) shows two issues related to variance and source shape in voiced segments. By avoiding these limitations, PDD can be estimated in this experiment at 5 ms frame shift by3 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X 1 PDD ¼ s i ðf Þ ¼ 2log ejðPDn ðf Þmn ðf ÞÞ ð11Þ N n2C ! 1 X jPDn ðf Þ e ð12Þ N n2C N1 where C ¼ i N1 2 ; . . . ; iþ 2 , N is the number of frames, PD is the phase difference between two consecutive frequency components, and we denote the phase by ff. As we wanted to quantify the noisiness in the higher frequency bands only, we zeroed out the PDD values below the MVF contour. Samples for the PDD of one natural and seven vocoded utterances are shown in Fig. 5 as an example. For the four methods, significant differences between the vocoded samples of the different envelope types can be noted. As can be seen, the baseline vocoding sample has too much noise component compared to the natural sample (e.g. see the colors between 1 1.7 s in English and 1.4 2 s in Arabic sentences). On the other hand, the proposed systems with envelopes and STRAIGHT have PDD values (i.e., colors in the figure) closer to the natural speech. Fig. 6 shows the means of the PDD values of the three speakers grouped by the 7 variants. As can be seen, the PDD values of the baseline system are significantly higher compared to natural speech. The various envelopes result in different PDD values, but in general they are closer to natural speech than the baseline. In particular, the Hilbert envelope in English and Arabic male speakers is outperformed (not significant) the STRAIGHT; whereas both the True and STRAIGHT models were found to be the best for the female speaker. We also quantified the distribution of the PDD measure across all of the natural and vocoded variants of several sentences. We conducted Mann-Whitney-Wilcoxon rank sum tests to check the statistical significance. The systems with ‘Hilbert’ and ‘True’ envelopes are not significantly different from natural speech. In general, the ‘Amplitude’ envelope system results in too low PDD values, meaning that the noisiness is too low compared to natural speech. Otherwise, in general, the ‘Hilbert’ and the ‘True’ envelopes are closer to natural speech.
mn ð f Þ ¼ ff
3
http://covarep.github.io/covarep/
8
M.S. Al-Radhi et al. / Computer Speech & Language 60 (2020) 101025
Fig. 5. Phase Distortion Deviation of a natural and vocoded speech samples above the Maximum Voiced Frequency region. The top row shows the spectrogram of the natural utterances. English sentence: “He made sure that the magazine was loaded, and resumed his paddling.”, from speaker AWB; and an Arabic sentence “ ” translated as “I did not know what it was as I did not care about it” and the Latin transcription is “knt la ahlm mahowo kma knt la ahtm bh”. The warmer the color, the bigger the PDD value and the noisier the corresponding time-frequency region.
5.1.2. Performance comparison A range of acoustic objective measures are considered to evaluate the quality of synthesized speech based on the proposed method. We adopt the frequency-weighted segmental SNR (fwSNRseg) for the error criterion since it is said to be much more correlated to subjective speech quality than classical SNR (Ma et al., 2009). Jensen and Taal (2016) introduced an Extended ShortTime Objective Intelligibility (ESTOI) measure that calculates the correlation between the temporal envelopes of clean and processed speech. Another objective measure is the Weighted-Slope Spectral distance (WSS) (Klatt, 1982), which computes the weighted difference between the spectral slopes in each frequency band. The spectral slope is found as the difference between adjacent spectral magnitudes in decibels. The final objective measure used here that is Normalized Covariance Metric (NCM) (Ma et al., 2009), which is based on the covariance between the clean and processed speech based time envelope approaches. For all objective measures, a calculation is done frame-by-frame and a higher value indicates better performance except for the WSS measure (lower value is better). The results were averaged over the selected utterances (20 sentences) for each speaker. Table 1 displays the results of the evaluation of four methods in comparison to the TANDEM-STRAIGHT vocoder (Kawahara and Morise, 2011) as a high-quality vocoder and widely regarded the state-of-the-art model in SPSS. As Table 1 shows, the proposed methods tend to significantly outperform the baseline approach among all metrics., suggesting the superiority of these techniques. In particular, we see that NCM and ESTOI measures display that the proposed vocoder based on Hilbert and True envelopes are closer to the STRAIGHT in all speakers. Hence, we can conclude that the time envelopes based approaches were beneficial to model the noise component. But it should be pointed out that there is no guarantee that better objective measures yield a better model as synthetic speech quality is an inherently perceptual study. 5.2. Subjective evaluations As a subjective evaluation, the idea was to select the closeness between the vocoded and original speech signal that fits our goal. In order to evaluate which proposed vocoder variant is closer to the natural speech, we conducted a web-based MUSHRA (MUlti-Stimulus test with Hidden Reference and Anchor) listening test (ITU-R Recommendation, 2001). The advantage of MUSHRA is that it enables evaluation of multiple samples in a single trial without breaking the task into many pairwise
M.S. Al-Radhi et al. / Computer Speech & Language 60 (2020) 101025
9
Fig. 6. Mean PDD values by sentence type. Table 1 Average scores performance of synthesized speech signal per each speaker. The bold font shows the best performance among the proposed vocoder variants. Method
Baseline Amplitude Hilbert Triangular True STRAIGHT
fwSNRseg
NCM
ESTOI
WSS
AWB
SLT
ARB
AWB
SLT
ARB
AWB
SLT
ARB
AWB
SLT
ARB
6.971 9.638 9.665 9.566 9.693 12.209
8.020 10.820 10.949 10.775 10.919 15.427
9.288 12.732 12.748 12.664 12.770 14.248
0.642 0.865 0.872 0.863 0.871 0.990
0.665 0.867 0.883 0.863 0.880 0.981
0.682 0.864 0.875 0.857 0.876 0.978
0.532 0.785 0.785 0.782 0.786 0.796
0.665 0.868 0.872 0.865 0.871 0.943
0.599 0.847 0.853 0.845 0.851 0.880
54.162 37.158 37.362 37.524 37.404 35.586
58.449 40.168 40.401 40.150 40.488 21.719
43.359 26.977 26.932 27.051 27.066 24.589
comparisons. The listeners had to rate the naturalness of each stimulus relative to the reference (which was the natural sentence), from 0 (highly unnatural) to 100 (highly natural). The utterances were presented in a randomized order (different for each participant). Our aim was to measure the perceived correlation of the ratio of the voiced and unvoiced components, therefore we compared natural sentences with the synthesized sentences from the baseline, proposed, STRAIGHT, and a hidden anchor system (the latter being a vocoder with simple pulse-noise excitation). In this section, we show the methods and results of two perceptual listening tests. 5.2.1. Listening test #1: english corpus A total number of 19 participants between the age of 2139 (mean age: 30 years) mostly with engineering background were asked to conduct the online listening test; of which eight of them were males and eleven were females. All of them were
10
M.S. Al-Radhi et al. / Computer Speech & Language 60 (2020) 101025
Fig. 7. Results of the subjective evaluation #1 (English samples) for the naturalness question. Higher value means larger naturalness. Error bars show the bootstrapped 95% confidence intervals.
non-natives English speakers and none of them reported any hearing loss. On average, the test took 14 min to fill. The listening test samples can be found online.4 The MUSHRA scores of the listening test are presented in Fig. 7. We can observe that all of the proposed systems significantly outperformed the baseline (Mann-Whitney-Wilcoxon rank sum test). For the male speaker in this experiment (see Fig. 7a), out of the proposed versions, the Amplitude, Hilbert, and True reached the highest naturalness scores in the listening test. Moreover, a significant improvement was noted in sound quality with STRAIGHT vocoder versus the proposed systems. Nevertheless, it is worth mentioning that the proposed systems have significantly higher ratings than those of STRAIGHT for the female voice (see Fig. 7b). Overall, our model contributes notably to the synthetic quality of the proposed vocoder than other systems (see Fig. 7c). We therefore draw the conclusion that the average scores achieved by the proposed vocoder based Hilbert and True envelope significantly outperformed STRAIGHT in case of the female speaker, while they reached almost the highest naturalness for the male speaker. This means that the approach presented in this work is an interesting alternative to the earlier version of the pro et al., 2016), and at least for the female voice in STRAIGHT vocoder. posed vocoder (Csapo 5.2.2. Listening test #2: Arabic corpus For a second MUSHRA test, twelve sentences were selected from the Arabic corpus (two from each emotion). Altogether, 84 utterances were included in the test (1 speaker x 7 types x 6 emotions x 2 sentences). Another set of 21 participants (8 males and 13 females) between the age of 2134 (mean age: 24 years) mostly with linguistics background were asked to conduct the online
4
http://smartlab.tmit.bme.hu/vocoder_Arabic_2018
M.S. Al-Radhi et al. / Computer Speech & Language 60 (2020) 101025
11
Fig. 8. Results of the subjective evaluation #2 (Arabic samples) for the naturalness question. Higher value means larger naturalness. Error bars show the bootstrapped 95% confidence intervals.
12
M.S. Al-Radhi et al. / Computer Speech & Language 60 (2020) 101025
listening test. All of them were native Arabic speakers and none of them reported any hearing loss. On average, the test took 20 min to fill. The listening test samples can be found online.5 The MUSHRA scores of the listening test are presented in Fig. 8. Here, a number of observations can also be made. The results show that all of the proposed systems are significantly better than the baseline (Mann-Whitney-Wilcoxon rank sum test). It is also important to note that the difference between the proposed systems using the envelopes and the STRAIGHT vocoder is significant, meaning that overall, our system could reach the quality of state-of-the-art vocoders. Focusing on the Neutral and Sad types, it is obvious that STRAIGHT works better (with mean naturalness of 86%) than other methods. For the other emotions, the proposed vocoder based on the envelopes was superior with mean naturalness of 86% in Anger, 85% in Fear, 87% in Happy, and 85% in Question. Overall, our proposed vocoder is preferred in synthesized Arabic speech and reached the highest rate (85%) in the listening test, being evaluated as higher than the STRAIGHT (78%) and baseline (77%) vocoders. 5.3. Discussion To discuss the trend of why STRAIGHT synthesizer scores fell below 70 in English female speaker and below 80 in some Arabic emotions, PDD samples of natural and vocoded utterances by STRAIGHT are shown in Fig. 9. The main cause seems to be the error that the voiced segment was wrongly affected by higher frequency harmonics (e.g. above 5 kHz between 0.20.5 s on Fig. 9 left), which degrades the quality of the synthesized speech; thus explaining the lower value for the English female speaker and Anger, Fear, Question for the Arabic male speaker. Conversely, the synthetic speech of the proposed technique exceeds this limitation by controlling the harmonic frequencies and improves speech quality as previously described and shown in Section 5.1. Some confusion of Arabic emotion types of speech synthesized by our methods and STRAIGHT was observed during the results of the listening test #2. Therefore, the empirical cumulative distribution function (Waterman and Whiteman, 1978) of phase distortion mean values are calculated and displayed in Fig. 10 to see whether these systems can be normally distributed and how far they are from the natural signal. The empirical cumulative distribution function Fn(PDM) is defined as n fX : X xg 1 X ¼ IX x ðXi Þ Fn ðxÞ ¼ i i n n i¼1 i
ð13Þ
where Xi is the PDM variables with density function f(x) and distribution function F(x), A symbolizes the number of elements in the set A (Xi x), n is the number of experimental observations, I is the indicator of event A given as 1; x 2 A ð14Þ IA ðxÞ ¼ 0; x= 2A It can be noticed that the higher mode of the distribution (positive x-axis in Fig. 10bf) corresponding to STRAIGHT’s PDMs are clearly higher than that of the original signal. This also demonstrates why the synthesized speech for the STRAIGHT ranked lower in the perception test. On the contrary, the higher mode of the distribution corresponding to the proposed configurations are better reconstructed especially in the emotions of Anger, Fear, and Question. These results can be explained by the fact that modulating high frequencies based time envelope is still beneficial and can substantially reduce any residual buzziness. Focusing on the lower mode of the distribution (negative x-axis in Fig. 10a, e), STRAIGHT’s PDMs gives better synthesized performance
Fig. 9. Phase Distortion Deviation of a natural and vocoded speech samples based STRAIGHT vocoder. English sentence: ''Will we ever forget it.'' from speaker SLT; and an Arabic sentence: “ ” translated as “it would be better to start studying now as I don't want to lose time” and the Latin transcription is “yuhsen en abda mothakeraty alaan heta la yakhtalet aly alamer”. The warmer the color, the bigger the PDD value and the noisier the corresponding time-frequency region.
5
http://smartlab.tmit.bme.hu/vocoder_Arabic_2018
M.S. Al-Radhi et al. / Computer Speech & Language 60 (2020) 101025
13
Fig. 10. Empirical cumulative distribution function of PDMs using 6 vocoders based emotions compared with the PDM measure on the natural speech signal.
14
M.S. Al-Radhi et al. / Computer Speech & Language 60 (2020) 101025
than other systems for the Neutral and Sad emotions, whereas the proposed vocoder almost reaches the natural distribution for fear and question emotions (Fig. 10c, f). Consequently, the experimental results verify the effectiveness of the proposed vocoder in terms of speech naturalness; and it is comparable, or even better in some cases, than STRAIGHT. In particular, our emotional Arabic utterances are also more suitable to model with the continuous vocoder applying envelopes and provide a better performance in Arabic speech re-synthesis. 6. Conclusions This paper proposed a new approach with the aim of improving the accuracy of our continuous vocoder and evaluating it using English and Arabic speech samples. The main idea was to further control the time structure of the high-frequency noise component by estimating a suitable time envelope. Four different envelopes (Amplitude, Hilbert, Triangular, and True) were tested from the literature. Using a variety of measurements, the performance strengths and weaknesses of each of the proposed methods for different speakers were highlighted. From the objective experiments, it was shown that the proposed vocoders have a better capability for modeling the time structure of the noise component of the excitation than the baseline (see e.g. the error metrics in Table 1). It can be concluded that the Hilbert and True envelopes are the best when combined with the continuous vocoder (i.e. they are close to the natural sentences in terms of PDD). Furthermore, the results of the MUSHRA test demonstrated the effectiveness of the proposed approaches for improving the quality of synthetic speech. It was shown that the proposed vocoder outperformed the baseline and the state-of-the-art (STRAIGHT) models in Arabic and female English speakers. Plans of future research involve to add a Harmonics-to-Noise Ratio as a new parameter to the analysis, statistical learning and synthesis steps in order to further increase the vocoder performance (i.e., voice quality of the male speaker). We believe the results obtained in this paper will allow us to enhance the performance of other types of vocoders. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgements The research was partly supported by the VUK (AAL-2014-1-183), by the EUREKA (DANSPLAT E!9944) projects and by the National Research, Development and Innovation Office of Hungary (FK 124584). The Titan X GPU used was donated by NVIDIA Corporation. We would like to thank the subjects for participating in the listening test. References Abdo, O., Abdou, S., Fasha, M., 2017. Building audio-visual phonetically annotated Arabic corpus for expressive text to speech. In: Proceedings of the Interspeech 2017, Stockholmpp. 3767–3771. , T.G., Ne meth, G., 2017. Time-domain envelope modulating the noise component of excitation in a continuous residual-based vocoder for Al-Radhi, M.S., Csapo statistical parametric speech synthesis. In: Proceedings of the Interspeech 2017, Stockholmpp. 434–438. , T.G., Ne meth, G., 2017. Deep recurrent neural networks in speech synthesis using a continuous vocoder. In: Proceedings of the Nineteenth Al-Radhi, M.S., Csapo Speech and Computer. Cham. Hatfield, England. 10458, Springer, pp. 282–291. Lecture Notes in Computer Science. Cabral, J.P., Berndsen, J.C., 2013. Towards a better representation of the envelope modulation of aspiration noise. In: Proceedings of the Advances in Nonlinear Speech Processing, Berlin Heidelberg. Cappe, O., Moulines, E., 1996. Regularization techniques for discrete cepstrum estimation. IEEE Signal Process. 3 (4), 100–103. , T.G., Ne meth, G., Cernak, M., 2015. Residual-Based excitation with continuous F0 modeling in HMM-Based speech synthesis. In: Proceedings of the Third Csapo International Conference on Statistical Language and Speech Processing, 9449, pp. 27–38. , T.G., Ne meth, G., Cernak, M., Garner, P.N., 2016. Modeling unvoiced sounds in statistical parametric speech synthesis with a continuous vocoder. In: ProCsapo ceedings of the EUSIPCO, pp. 1338–1342. Degottex, G., Erro, D., 2014. A uniform phase representation for the harmonic model in speech synthesis applications. EURASIP J. Audio Speech Music Process. 38 (1), 1–16. Degottex, G., Lanchantin, P., Gales, M., 2018. A log domain pulse model for parametric speech synthesis. IEEE/ACM Trans. Audio Speech Lang. Process. 26 (1), 57–70. Drugman, T., Dutoit, T., 2012. The deterministic plus stochastic model of the residual signal and its applications. Trans. Audio Speech Lang. Process. 20 (3), 968–981. Drugman, T., Thomas, M., Gudnason, J., Naylor, P., Dutoit, T., 2012. Detection of glottal closure instants from speech signals: a quantitative review. IEEE Trans. Audio Speech Lang. Process. 20 (3), 994–1006. Drugman, T., Stylianou, Y., 2014. Maximum voiced frequency estimation: exploiting amplitude and phase spectra. IEEE Signal Process. Lett. 21 (10), 1230–1234. Drullman, R., 1995. Temporal envelope and fine structure cues for speech intelligibility. J. Acoust. Soc. Am. 97 (1), 585–591. Dutoit, T., 1997. High-quality text-to-speech synthesis: an overview. J. Electr. Electron. Eng. Aust. 17, 25–36. Ellis, D., Athineos, M., 2003. Frequency domain linear prediction for temporal features. In: Proceedings of the IEEE ASRU Workshop. Espic, F., Botinhao, C.V., King, S., 2017. Direct modelling of magnitude and phase spectra for statistical parametric speech synthesis. In: Proceedings of the Interspeech, Stockholm, pp. 1383–1387. Fisher, N.I., 1995. Statistical Analysis of Circular Data. Cambridge University, UK. Galas, T., Rodet, X., 1990. An improved cepstral method for deconvolution of source-filter systems with discrete spectra. In: Proceedings of the International Computer Music Conference, Glasgow. Garner, P.N., Cernak, M., Motlicek, P., 2013. A simple continuous pitch estimation algorithm. IEEE Signal Process. Lett. 20 (1), 102–105. Hu, Q., Richmond, K., Yamagishi, J., Latorre, J., 2013. An experimental comparison of multiple vocoder types. In: Proceedings of the ISCA SSW8, pp. 155–160. ITU-R Recommendation BS.1534. (2001). Method for the subjective assessment of intermediate audio quality.
M.S. Al-Radhi et al. / Computer Speech & Language 60 (2020) 101025
15
Jensen, J., Taal, C.H., 2016. An algorithm for predicting the intelligibility of speech masked by modulated noise Maskers. in IEEE/ACM Trans. Audio Speech Lang. Process. 24 (11), 2009–2022. Kawahara, H., Morise, M., 2011. Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework. Sadhana 36 (5), 713–727. , A., 1999. Restructuring speech representations using a pitch-adaptive timefrequency smoothing and an instanKawahara, H., Masuda-Katsuse, I., de Cheveigne taneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun. 27 (3), 187–207. Kenmochi, H., 2012. Singing synthesis as a new musical instrument. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyotopp. 5385–5388. Klatt, D., 1982. Prediction of perceived phonetic distance from critical band spectra: A first step. 17, New York, pp. 1278–1281. Kominek, J., Black, A.W., 2003. CMU ARCTIC Databases For Speech Synthesis. Carnegie Mellon University. Ma, J., Hu, Y., Loizou, P., 2009. Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. Acoust. Soc. Am. 125 (5), 3387–3405. Maia, R., Zen, H., Knill, K., Gales, M., Buchholz, S., 2011. Multipulse sequences for residual signal modeling. In: Proceedings of the Interspeech 2011, pp. 1833–1836. Moore, B., 2014. Auditory Processing of Temporal Fine Structure: Effects of Age and Hearing Loss. World Scientific Co, UK. Narendra, N.P., Sreenivasa, K.R., 2015. Time-domain deterministic plus noise model based hybrid source modeling for statistical parametric speech synthesis. Speech Commun. 77, 65–83. Pantazis, Y., Stylianou, Y., 2008. Improving the modeling of the noise part in the harmonic plus noise model of speech. In: Proceedings of the ICASSP, pp. 4609–4612. Potamianos, A., Maragos, P., 1994. A comparison of the energy operator and the Hilbert transform approach to signal and speech demodulation. Signal Process. 37, 95–120. Robel, A., Rodet, X., 2005. Efficient spectral envelope estimation and its application to pitch shifting and envelope preservation. In: Proceedings of the International Conference on Digital Audio Effects, Madrid. Robel, A., Villavicencio, F., Rodet, X., 2007. On cepstral and all-pole based spectral envelope modeling with unknown model order. Pattern Recognit. Lett. 28 (11), 1343–1350. Imai, S., Sumita, K., Furuichi, C., 1983. Mel log spectrum approximation (MLSA) filter for speech synthesis. Electron. Commun. Jpn. Part I Commun. 66 (2), 10–18. Schloss, A., 1985. On the Automatic Transcription of Percussive Music - From Acoustic Signal to High-Level Analysis. Stanford University PhD. thesis. Stylianou, Y., 2001. Applying the harmonic plus noise model in concatenative. Trans. Audio Speech Lang. Process. 9 (1), 21–29. Stylianou, Y., Laroche, J., Moulines, E., 1995. High-quality speech modification based on a harmonic + noise model. In: Proceedings of the EUROSPEECH, pp. 451–454. € ge, H., Black, A., 2010. Challenges in speech synthesis. Speech Technology: Theory and Applications. Springer, Boston, MA, pp. 19–32. Suendermann, D., Ho Toda, T., Black, A.W., Tokuda, K., 2007. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15, 2222–2235. Tokuda, K., Kobayashi, T., Masuko, T., Imai, S., 1994. Mel-generalized cepstral analysis - a unified approach to speech spectral estimation. In: Proceedings of the ICSLP, pp. 1043–1046. Tokuda, K., Mausko, T., Miyazaki, N., Kobayashi, T., 2002. Multi-space probability distribution HMM. IEICE Trans. Inf. Syst. E85-D (3), 455–464. Villavicencio, F., Robel, A., Rodet, X., 2006. Improving LPC spectral envelope extraction of voiced speech by true-envelope estimation. In: Proceedings of the ICASSP, 6, pp. 869–872. Waterman, M.S., Whiteman, D.E., 1978. Estimation of probability densities by empirical density functions. Int. J. Math. Educ. Sci. Technol. 9 (2), 127–137. Yu, K., Thomson, B., Young, S., Street, T., 2010. From discontinuous to continuous F0 modelling in HMM-based speech synthesis. In: Proceedings of the ISCA SSW7, pp. 94–99. Yu, K., Young, S., 2011. Continuous F0 modeling for hmm based statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 19 (5), 1071–1079. Zen, H., Senior, A., Schuster, M., 2013. Statistical parametric speech synthesis using deep neural network. In: Proceedings of the ICASSP, pp. 7962–7966. Zen, H., Tokuda, K., Black, A.W., 2009. Statistical parametric speech synthesis. Speech Commun. 1039–1064.