Available online at www.sciencedirect.com
Speech Communication 55 (2013) 606–618 www.elsevier.com/locate/specom
Complex cepstrum for statistical parametric speech synthesis Ranniery Maia a,⇑, Masami Akamine b,1, Mark J.F. Gales a,2 a
Cambridge Research Laboratory, Toshiba Research Europe Limited, 208 Cambridge Science Park, Milton Road, Cambridge CB4 0GZ, UK b Corporate Research and Development Center, Toshiba Corporation, 1, Komukai Toshiba-cho, Saiwai-ku, Kawasaki 212-8582, Japan Received 9 July 2012; received in revised form 14 December 2012; accepted 15 December 2012 Available online 1 February 2013
Abstract Statistical parametric synthesizers have typically relied on a simplified model of speech production. In this model, speech is generated using a minimum-phase filter, implemented from coefficients derived from spectral parameters, driven by a zero or random phase excitation signal. This excitation signal is usually constructed from fundamental frequencies and parameters used to control the balance between the periodicity and aperiodicity of the signal. The application of this approach to statistical parametric synthesis has partly been motivated by speech coding theory. However, in contrast to most real-time speech coders, parametric speech synthesizers do not require causality. This allows the standard simplified model to be extended to represent the natural mixed-phase characteristics of speech signals. This paper proposes the use of the complex cepstrum to model the mixed phase characteristics of speech through the incorporation of phase information in statistical parametric synthesis. The phase information is contained in the anti-causal portion of the complex cepstrum. These parameters have a direct connection with the shape of the glottal pulse of the excitation signal. Phase parameters are extracted on a frame-basis and are modeled in the same fashion as the minimum-phase synthesis filter parameters. At synthesis time, phase parameter trajectories are generated and used to modify the excitation signal. Experimental results show that the use of such complex cepstrum-based phase features results in better synthesized speech quality. Listening test results yield an average preference of 60% for the system with the proposed phase feature on both female and male voices. Ó 2013 Elsevier B.V. All rights reserved. Keywords: Speech synthesis; Statistical parametric speech synthesis; Spectral analysis; Cepstral analysis; Complex cepstrum; Glottal source models
1. Introduction Statistical parametric synthesizers (Zen et al., 2009) have mostly relied on a simplified parametric model of speech production where a minimum-phase filter is excited by a signal which consists of a mixture of pulses and noise. Although this approach has been successful in producing synthetic speech with high quality (Zen et al., 2005), achieving naturalness and speaker similarity at the same level as speech synthesized by concatenative systems is still an open ⇑ Corresponding author. Tel.: +44 1223 436974; fax: +44 1223 436909.
E-mail addresses:
[email protected] (R. Maia), masa.
[email protected] (M. Akamine),
[email protected] (M.J.F. Gales). 1 Tel.: +81 44 549 2020; fax: +81 44 520 1308. 2 Tel.: +44 1223 436900; fax: +44 1223 436909. 0167-6393/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.specom.2012.12.008
problem. The use of a minimum-phase filter, as an approximation for the effects of the vocal tract and lip radiation of the human speech production model, is a legacy of the speech coding area where causality is essential (Deller et al., 2000). However, in speech coders the limitations of this simplifications can be compensated for by the excitation signals that are usually derived through a frame-based analysis-by-synthesis procedure in the encoder (Chu, 2003). In parametric speech synthesis, this is not possible. Improvements to the speech production mechanism in parametric synthesis have mainly focused on enhancing the excitation signal, for example (Yoshimura et al., 2001; Zen et al., 2005; Maia et al., 2007; Cabral et al., 2007; Raitio et al., 2008; Drugman et al., 2009). The utilization of parameters for mixing the periodic and aperiodic components of the excitation signal, such as band-
R. Maia et al. / Speech Communication 55 (2013) 606–618
aperiodicity parameters described in (Zen et al., 2005) removed the buzz sensation present in the binary excitation typical from linear prediction (LP) vocoders (Chu, 2003). Additionally, efforts have been made to improve naturalness and speaker similarity by mimicking the natural glottal pulses of the human excitation flow (Cabral et al., 2007; Raitio et al., 2008) or through models of the speech residual, the optimal excitation signal that is produced by inverse filtering the speech signal with the minimum-phase synthesis filter (Maia et al., 2007; Drugman et al., 2009). We have proposed the use of the complex cepstrum to incorporate phase as glottal pulse information into statistical parametric speech synthesis systems (Maia et al., 2012). This paper is following up and expanding upon that subject. From the perspective of the speech production mechanism in source-filter modeling, the use of the complex cepstrum has an advantage over the commonly used cepstrum of minimum-phase sequences3 since it better represents the mixed-phase characteristics of speech signals. The complex cepstrum representation of the speech signal allows non-causal modeling of short-time speech segments, which is actually observed in natural speech (Deller et al., 2000). However, though theoretically advantageous, complex cepstrum analysis has certain drawbacks. The speech signal must be windowed at the glottal closure instants (GCI). The accuracy of the detection of the GCIs, as well as the type of window used for analysis, have a direct impact on the estimation of the complex cepstrum (Quatieri, 1979; Verhelst and Steenhaut, 1986). In addition, a phase unwrapping procedure is usually performed to obtain the phase spectrum of the speech segment as a continuous function of the frequency. A high-order Fast Fourier Transform (FFT) often improves the performance of this phase unwrapping process as well as avoiding aliasing, at a cost of an increase of computational complexity (Oppenheim, 2010). Aside from the analysis issues, for statistical parametric synthesizers, another drawback is that statistical methods, such as hidden semi-Markov models (HSMMs), usually require the observation vectors to be extracted at a fixed frame rate rather than at GCIs. This paper addresses this issue by proposing a method for calculating frame-based complex cepstra. Chirp analysis (Drugman and Dutoit, 2010) is one possible approach for calculating frame-based complex cepstrum. The idea, as described in (Drugman and Dutoit, 2010), can be interpreted as searching for an optimal GCI to place the analysis window. However, this approach yields poor performance in extracting temporal changes of the complex cepstrum when several frames exist between consecutive GCIs, because the frame-based complex cepstrum becomes constant between consecutive pitch period onset times. In this work the complex cepstrum is derived by interpolating pitch-synchronous amplitude spectra and phase spectra over frames. This method has lower computational
3
Henceforth referred to as the minimum-phase cepstrum.
607
complexity than chirp analysis. Frame-based complex cepstra are then decomposed into all-pass and minimum-phase components. The all-pass cepstra, which are related to the non-causal part of the complex cepstrum, are regarded as phase parameters and modeled by HSMMs as an additional observation vector stream. At synthesis time, the generated phase parameters are used to implement a glottal filter. In previous work (Maia et al., 2012), the basic idea of using the complex cepstrum to statistical parametric synthesis was first proposed. Here we perform more in-depth investigation of the issues involved in its use for speech analysis and synthesis. For instance, at analysis time a comparison between different methods of frame-based complex cepstrum is presented. In addition, the use of frequency warping significantly increases the speech quality when compared with the results obtained in (Maia et al., 2012). This paper is organized as follows. Section 2 gives a brief review of speech modeling using the complex cepstrum. Section 3 describes how the complex cepstrum can be incorporated into the statistical parametric synthesis framework. Experimental results are shown in Section 4, and the conclusions we draw from this work are given in Section 5. 2. Speech modeling using the complex cepstrum 2.1. Cepstral analysis Cepstral analysis was initially developed in the field of homomorphic deconvolution (Oppenheim, 2010). The complex cepstrum, ^sðnÞ, of a signal, sðnÞ, is given by the inverse Fourier transform of its log spectrum, Z p 1 ^sðnÞ ¼ ln S eJ x eJ xn dx; ð1Þ 2p p Z p Z p 1 J ¼ ln S eJ x eJ xn dx þ hðxÞeJ xn dx; ð2Þ 2p p 2p p where 1 X S eJ x ¼ sðnÞeJ xn ¼ S eJ x eJ hðxÞ
ð3Þ
n¼1
is the Discrete-Time Fourier Transform (DTFT) of sðnÞ, with jS ðeJ x Þj and hðxÞ being the amplitude and phase spectra, respectively. The complex cepstrum of a real sequence is an infinite and non-causal sequence, i.e., ^sðnÞ – 0; n ¼ 1; . . . ; 1, and a lossless representation of the signal from which it is derived. The original signal sðnÞ can be fully recovered from ^sðnÞ through the following inverse operations 1 X ^sðnÞeJ xn ; S eJ x ¼ exp
sðnÞ ¼
1 2p
Z
n¼1 p J x J xn
S e
e
dx:
ð4Þ ð5Þ
p
In contrast, the real cepstrum of a signal takes into consideration only its amplitude spectrum. Therefore, the real
608
R. Maia et al. / Speech Communication 55 (2013) 606–618
cepstrum, ^cðnÞ, of the signal, sðnÞ, is given by the first term of the right-hand side of (2) Z p 1 ^cðnÞ ¼ ln S eJ x eJ xn dx: ð6Þ 2p p Note that the real cepstrum is a special case of the complex cepstrum when hðxÞ ¼ 0; 8x, i.e., the zero phase situation. 2.2. Speech analysis and synthesis using the complex cepstrum When a signal sðnÞ can be represented as the convolution of two signals, i.e. sðnÞ ¼ hðnÞ eðnÞ, in the cepstral domain this convolution is realized as an addition: ^sðnÞ ¼ ^ hðnÞ þ ^eðnÞ. Assuming that one of the signals is more stationary than the other, the separation of ^hðnÞ and ^eðnÞ can be achieved by a lifter, which is a filter in the cepstral domain. Hence, using a source-filter model of speech production, where speech is generated by a slowly varying filter driven by a rapidly varying excitation (Deller et al., 2000), the impulse response of the filter, hðnÞ, related to the spectral envelope of sðnÞ can be obtained from a windowed (or liftered) version of ^sðnÞ. When pitch-synchronous analysis is performed, as depicted in Fig. 1, the excitation signal in each windowed portion of the speech signal can be considered as the unity impulse sequence, eðnÞ ¼ dðnÞ, which means that ^sðnÞ will correspond to the short-term spectral envelope term hðnÞ. In this case the samples of ^sðnÞ will decay rapidly to zero as jnj ! 1 (Deller et al., 2000). Therefore, the impulse response hðnÞ can be approximated by C X ^ H eJ x ¼ exp hðnÞeJ xn ;
1 hðnÞ ¼ 2p
Z
ð7Þ
n¼C p
H eJ x eJ xn dx;
ð8Þ
p
^ where C is the cepstral order, and hðnÞ is a truncated version of ^sðnÞ so that ^ hðnÞ ¼ 0 for jnj > C. The speech signal can then be re-synthesized by convolving hðnÞ with an excitation signal where eðnÞ ¼ dðnÞ or white noise, for voiced and unvoiced regions, respectively (Vondra and Vı´ch, 2011).
Fig. 1. Segmentation of a speech signal using pitch-synchronous windows.
2.3. Practical computation of the complex cepstrum Assuming that accurate GCIs are given,4 the two main computational issues of complex cepstrum analysis are aliasing and phase unwrapping. Both these issues arise because the continuous function in the frequency domain S ðeJ x Þ is discrete for practical speech analysis, and therefore the DTFT must be replaced by a Discrete Fourier Transform (DFT). Thus, by replacing the integration in (1) with a summation, an approximation to the complex cepstrum can be calculated as follows L 1 X lnS eJxl eJ xl n ; 2L þ 1 l¼L ( ) L J 0 X J x 1 l þ2 cos ðhðxl Þ þ xl nÞ ; ln S e ¼ ln S e 2L þ 1 l¼1
^sd ðnÞ ¼
ð9Þ ð10Þ
where fx0 ; . . . ; xL g are L þ 1 sampled frequencies so that x0 ¼ 0 and xL ¼ p. jS ðeJ xl Þj and hðxl Þ are the amplitude and phase responses at xl , respectively. To obtain (10), it is assumed that the phase function is odd and periodic, i.e., hðxl Þ ¼ hðxl Þ, and hð0Þ ¼ hðpÞ ¼ 0, and the amplitude function is even and periodic, i.e., jS ðeJ xl Þj ¼ jS ðeJ xl Þj. Aliasing comes from the fact that S ðeJ xl Þ; l ¼ 0; . . . ; L, is a sampled version of S ðeJ x Þ. Consequently ^sd ðnÞ is an aliased version of ^sðnÞ, 1 X ^sðn þ ðL þ 1ÞrÞ: ^sd ðnÞ ¼ ð11Þ r¼1
Since for (1) to be valid the phase function hðxÞ must be continuous, odd, and periodic in x, a phase unwrapping process must be applied to the samples of the principal value of the phase function, HðxÞ. This is calculated as Hðxl Þ ¼ tan1
Im½S ðeJ xl Þ ; Re½S ðeJ xl Þ
l ¼ 0; . . . ; L;
ð12Þ
where Re½ and Im½ mean respectively the real and imaginary parts of ½. Consecutive samples of HðxÞ may have jumps of 2p due to the direct application of (12) to compute the phase response. These jumps need to be detected to enable phase unwrapping. In general, both issues can be moderated by increasing the number of sampled frequencies L þ 1. For the aliasing case, ^sd ðnÞ ! ^sðnÞ as L ! 1. For phase unwrapping, the performance of the unwrapping algorithms is greatly improved when the difference between two consecutive angular frequencies Dx ¼ xl xl1 is small. In Section 3.1.2 some approaches to unwrapping the phase function are discussed. Henceforth in this paper the notation ^sd ðnÞ is dropped and instead ^sðnÞ will be used to indicate the approximation to the theoretical complex cepstrum.
4
Pitch period onset time detection is not covered in this paper.
R. Maia et al. / Speech Communication 55 (2013) 606–618
3. Complex cepstrum for statistical parametric synthesis To use the complex cepstrum in statistical parametric synthesis, the following issues must be addressed: (1) frame-based complex cepstrum analysis; (2) statistical modeling of additional features to represent the mixedphase characteristics of speech signals; and (3) synthesis with these features. 3.1. Frame-based complex cepstrum Statistical parametric synthesizers usually utilize observation vectors composed of speech parameters that are extracted from speech signals at a fixed period. To obtain frame-based complex cepstra, first pitch-synchronous analysis is performed, and then linear interpolation is used to obtain frame-based features. 3.1.1. Pitch-synchronous spectral analysis In short-term complex cepstrum analysis, speech segmentation is a very important step to avoid representing a circularly shifted version of the original speech segment from which the complex cepstrum is extracted (Oppenheim, 2010). The analysis windows should be placed at pitch period onset times and should cover two pitch periods, as illustrated in Fig. 1. Samples of pitch-synchronous spectra S ðeJ x Þ are obtained by taking the DFT of the speech segments at pitch period onset times . . . ; pi1 ; . . . ; piþ4 ; . . . . Furthermore, the choice of the window function is crucial to achieving a good estimate of the complex cepstrum (Quatieri, 1979; Verhelst and Steenhaut, 1986; Drugman et al., 2011). In this work, windowing is performed with an asymmetric Blackman window. This window was found to outperform Hann and Hamming windows in initial informal listening evaluations of analyzed-synthesized speech. 3.1.2. Phase unwrapping Phase unwrapping is an important procedure in the calculation of the complex cepstrum. As mentioned in Section 2.3, increasing the DFT size decreases the Dx between two consecutive samples and therefore makes the process of detecting discontinuities in the phase function modulo 2p; HðxÞ, easier. In this work the performance of two algorithms for unwrapping the phase function were compared5. The first algorithm is a simple implementation in which the difference of two consecutive samples of HðxÞ is analyzed and processed (Maia et al., 2012). The second algorithm is based on integration of the phase derivative (Tribolet, 1977). They are described below. Simple method. Given fHðx0 Þ; . . . ; HðxL Þg; L þ 1 samples of the phase function modulo 2p; HðxÞ, between x0 ¼ 0 and xL ¼ p, phase unwrapping is achieved by
5
See Section 4.2.
uðxl Þ ¼ Hðxl Þ
l1 X
wðkÞ;
609
l ¼ 1; . . . ; L;
ð13Þ
k¼0
where fuðx0 Þ; . . . ; uðxL Þg are samples of the unwrapped phase function, with uðx0 Þ assumed to be uðx0 Þ ¼ 0, and Hðxkþ1 Þ Hðxk Þ þ p wðkÞ ¼ 2p; 2p k ¼ 0; . . . ; L 1:
ð14Þ
In essence, the algorithm above examines the difference between two consecutive points of HðxÞ and checks whether it is greater than p. Integration of the phase derivative. In this case the phase function is computed through trapezoidal integration of the phase derivative. Here ~ ðxl Þ ¼ uðxl1 Þ þ u
Dx 0 ½u ðxl Þ þ u0 ðxl1 Þ; 2
l ¼ 1; . . . ; L;
ð15Þ
where samples of the phase derivative u0 ðxÞ ¼ obtained by u0 ðxl Þ ¼
duðxÞ dx
can be
Re½S ðeJ xl ÞIm½S 0 ðeJ xl Þ Im½S ðeJ xl ÞRe½S 0 ðeJ xl Þ 2 jS ðeJ xl Þj
;
ð16Þ with S 0 ðeJ x Þ being the derivative of S ðeJ x Þ with respect to x, given by ! 1 1 X X d 0 Jx J xn ¼ S e sðnÞe nsðnÞeJ xn : ð17Þ ¼ J dx n¼1 n¼1 ~ ðxl Þ uðx0 Þ is again assumed to be zero and each sample u ~ ðxl Þ is tested to in (15) is an estimate of uðxl Þ. Finally, u verify whether an integer rðlÞ exists such that ~ ðxl Þ Hðxl Þ 2prðlÞj < p: ju
ð18Þ
Where no integer rðlÞ is found to satisfy this condition, L is increased and the process repeated (Tribolet, 1977). An extension to this method is proposed in (Bhanu and McClellan, 1980), which improves the unwrapping procedure specially for speech segments that have zeros close to the unit circle. However, in the same paper it is also mentioned that when a large DFT size is utilized, the performance of the proposed method becomes similar to the one shown in (Tribolet, 1977). Regardless of the unwrapping process, the final step is to remove the linear phase component from the unwrapped phase spectrum hðxl Þ ¼ uðxl Þ
uðpÞ xl ; p
l ¼ 1; . . . ; L:
ð19Þ
This operation ensures that hðxÞ is odd and periodic on x. Fig. 2 shows for a given speech segment the phase function modulo 2p; HðxÞ, its unwrapped version obtained by using the simple algorithm, uðxÞ, and the final phase function, hðxÞ, after removing the linear phase component according to (19).
R. Maia et al. / Speech Communication 55 (2013) 606–618
Phase (rad)
610
2 0 −2 0
1000
2000
3000
4000
5000
6000
7000
8000
6000
7000
8000
6000
7000
8000
Frequency (Hz) Phase (rad)
0 −50 −100 −150 −200 0
1000
2000
3000
4000
5000
Frequency (Hz) Phase (rad)
8 6 4 2 0 −2 0
1000
2000
3000
4000
5000
Frequency (Hz)
Fig. 2. Phase function before and after the phase unwrapping process. Top: phase function modulo 2p (principal value of the phase), HðxÞ; middle: unwrapped (continuous) phase function, uðxÞ; bottom: unwrapped phase function with the linear phase component removed, hðxÞ.
Complex cepstrum without phase unwrapping. The complex cepstrum can also be calculated without performing phase unwrapping (Oppenheim, 2010; Bednar and Watt, 1985). One approach is to factorize the z-transform of the windowed speech segment onto first order terms using a polynomial rooting algorithm and then compute the complex cepstrum directly from the zeros of the polynomial (Oppenheim, 2010). However, this method is computationally costly and numerical imprecision in the calculation of the roots of the polynomial may result in more distortion in the complex cepstrum than from phase unwrapping algorithms. Another approach to compute the complex cepstrum without phase unwrapping is based on the derivative of the log spectrum (Oppenheim, 2010). However, initial experiments showed that this method yielded an aliased complex cepstrum which had far lower quality than the complex cepstrum obtained according to (10). 3.1.3. Interpolation of pitch-synchronous features Once pitch-synchronous spectra have been calculated, frame-based complex cepstra can be obtained through linear interpolation as follows ft ¼
1. spectral domain: jS ðeJ x Þj; hðxÞ; 2. cepstral domain: ^sðnÞ. In the absence of a significant difference between the interpolation methods discussed above, interpolation of amplitude and phase spectra (spectral domain) has an advantage in practical terms because of the derivation of frame-based amplitude spectra. This enables other spectral parameterization for the synthesis filter to be used, as discussed in Section 3.3. 3.2. Complex cepstrum-based phase features for HMM-TTS 3.2.1. Minimum-phase/all-pass decomposition It is assumed that a speech segment sðnÞ is selected by an appropriate pitch-synchronous window function that covers two pitch periods as depicted in Fig. 1. This segment can be represented as the convolution of its minimumphase, sm ðnÞ, and all-pass, sa ðnÞ, components (Deller et al., 2000) sðnÞ ¼ sm ðnÞ sa ðnÞ:
ðpi tÞf pi1 þ ðt pi1 Þf pi ; pi pi1
t ¼ 0; N 1; 2N 1; 3N 1; . . .
positions before and after t. Interpolation in two different domains are possible:
ð20Þ
where f is the feature vector being interpolated, t is the index of the sample to be obtained through interpolation, N is the frame size, and pi1 and pi are respectively the pitch
ð21Þ
In the cepstral domain, the relationship above becomes a sum: ^sðnÞ ¼ ^sm ðnÞ þ ^sa ðnÞ. Assuming that sðnÞ is produced by passing an excitation signal eðnÞ through a synthesis filter with impulse response hðnÞ, and that eðnÞ ¼ dðnÞ, then ^sðnÞ ^hðnÞ, where ^hðnÞ is a truncated version of ^sðnÞ (see Section 2.2). ^hðnÞ is hereby defined as the complex cepstrum
R. Maia et al. / Speech Communication 55 (2013) 606–618
of sðnÞ. In this case ^hðnÞ can be represented as the sum of a minimum phase cepstrum, ^ hm ðnÞ, and an all-pass cepstrum, ^ ha ðnÞ : ^ hðnÞ ¼ ^ hm ðnÞ þ ^ ha ðnÞ. The minimum-phase cepstrum, ^ hm ðnÞ, is a causal sequence and can be obtained from the complex cepstrum, ^ hðnÞ, as follows (Deller et al., 2000) 8 0; n ¼ C; . . . ; 1; > > > > > < ^ ð22Þ hm ðnÞ ¼ ^ hðnÞ; n ¼ 0; > > > > > :^ hðnÞ þ ^ hðnÞ; n ¼ 1; . . . ; C; where C is the cepstral order. The all-pass cepstrum ^ha ðnÞ can be simply retrieved from the complex and minimumphase cepstrum as ^ hðnÞ ^ hm ðnÞ; ha ðnÞ ¼ ^
n ¼ C; . . . ; C:
3.2.3. Phase features for statistical parametric synthesis To integrate the all-pass cepstrum ^ha ðnÞ into the statistical parametric speech synthesis framework they are converted into phase parameters, /ðnÞ. These parameters can be defined as the causal part of ^ha ðnÞ, /ðnÞ ¼ ^ha ðn þ 1Þ;
n ¼ 0; . . . ; C a 1;
ð27Þ
where C a < C is the dimension of the phase parameters. The definition of these phase features is an important step because it does not restrict the use of the cepstrum as spectral parameters for the synthesis filter, as discussed in Section 3.3. When warping is applied, the phase features f/ð0Þ; . . . ; /ðC a 1Þg are still defined according to (27) ~ by replacing ^ha ðnÞ with ^ha ðnÞ.
ð23Þ
By substituting (22) into (23) it is clear that the all-pass cepstrum ^ ha ðnÞ is non-causal and anti-symmetric, and only depends on the non-causal part of ^ hðnÞ 8 ^ > hðnÞ; n ¼ C; . . . ; 1; > > > < ^ ð24Þ ha ðnÞ ¼ 0; n ¼ 0; > > > > : ^ hðnÞ; n ¼ 1; . . . ; C; n o Therefore, ^ hðCÞ; . . . ; ^ hð1Þ , and consequently the allpass cepstrum ^ ha ðnÞ, can be interpreted as carrying the extra phase information which is not usually taken into account in conventional source-filter models based on minimum-phase filter impulse responses. The all-pass cepstrum is related to the shape of the glottal pulse since it contains the mixed-phase information of the speech signal which is left after the minimum-phase component of the signal is removed. 3.2.2. Warping Frequency warping can be applied to the all-pass cepstrum in the same way as in the minimum-phase cepstrum case. The frequency response of the all-pass filter component ha ðnÞ becomes C X ~ ^ H a eJ xl ¼ exp ha ðnÞeJ x~ l n ;
611
ð25Þ
3.3. Waveform generation using the complex cepstrum in statistical parametric synthesis 3.3.1. Phase information incorporation using a glottal filter To incorporate the complex cepstrum into the waveform synthesis stage, the time domain process shown in Fig. 3 is used. The generated ln F 0 and phase parameters /ðnÞ are used to derive the pulse train tðnÞ and all-pass filter impulse response ha ðnÞ, respectively. This pulse train is passed through H a ðnÞ to yield the pulse sequence tp ðnÞ. Finally, the speech signal sðnÞ is synthesized by convolving eðnÞ with hm ðnÞ, the impulse response of the minimum-phase synthesis filter, H m ðzÞ. hm ðnÞ in this case is causal and obtained from spectral parameters like in conventional statistical parametric synthesizers. The impulse response of the non-causal glottal filter, ha ðnÞ, is obtained from the phase features /ðnÞ. By using the relationship in (27), and by considering the inverse complex cepstrum operation according to Eqs. (7) and (8), the phase and impulse responses of the glottal filter H a ðzÞ become, respectively ha ðxl Þ ¼ 2
C a 1 X
/ðnÞ sin ðxl ðn þ 1ÞÞ;
ð28Þ
n¼0
( ) L X 1 1þ2 ha ðnÞ ¼ cosðxl n þ ha ðxl ÞÞ 2L þ 1 l¼1
ð29Þ
n¼C
~ where ^ ha ðnÞ is the warped all-pass cepstrum, and ~ L g are the angular frequencies in the warped ~ 0; . . . ; x fx axis, which can be defined as the phase response of an all-pass system (Tokuda et al., 1994) ~ ¼ tan1 x
ð1 a2 Þ sin x ; ð1 þ a2 Þ cos x 2a
jaj < 1;
ð26Þ
The constant a controls the intensity of warping. Because of its symmetric properties, ^ ha ðnÞ can be warped using the recursive formula presented nin (Tokuda et al., o 1994) ^a ð0Þ; . . . ; ^ ha ðCÞ . by using only its causal portion h
Fig. 3. Synthesis: the phase parameters are used to include glottal pulse information in the pulse train through the all-pass filter H a ðzÞ.
612
R. Maia et al. / Speech Communication 55 (2013) 606–618
for n ¼ P a ; . . . ; P a , where P a is the order of the glottal fil~ 0; . . . ; x ~ L g are the L þ 1 freter impulse response, and fx quencies in a warped scale obtained from fx0 ; . . . ; xL g according to (26). Proof of Eqs. (28) and (29) can be found in Appendices A.1 and A.2, respectively. Fig. 4 shows ~ l Þ and ha ðnÞ derived from generated phase examples of ha ðx parameters for a male and female speakers, for the case where no warping is used (a ¼ 0:0). In both cases the resemblance of the impulse response of ha ðnÞ to a glottal pulse can be noticed. Fig. 5 shows the impact of the dimension C a on the re-synthesized speech for a segment with large F 0 when a ¼ 0:42. Signal-to-noise ratios for the segments shown in Fig. 5 are respectively 5.887 dB, 8.851 dB, and 5.797 dB for C a ¼ 39; C a ¼ 19, and C a ¼ 0, respectively.
their impulse responses can be obtained from faðx0 Þ; . . . ; aðxL Þg as ( ) L X 1 1 aðx0 Þ þ 2 ð1 aðxl ÞÞcos ðxl nÞ ; ð31Þ hv ðnÞ ¼ 2L þ 1 l¼1 ( ) L X 1 aðx0 Þ þ 2 aðxl Þcos ðxl nÞ : hu ðnÞ ¼ 2L þ 1 l¼1
ð32Þ
The mixed excitation signal eðnÞ is formed by passing tp ðnÞ and wðnÞ through H v ðzÞ and H u ðzÞ, respectively, as shown in Fig. 6. The phase features /ðnÞ could also be implemented in the frequency domain, as in the vocoding method employed in (Zen et al., 2005). In this case the phase response ha ðxÞ replaces the zero-phase response of the pulse T ðeJ x Þ. However, this approach is not investigated in this work.
3.3.2. Synthesis under a mixed excitation framework The system of Fig. 3 only implements glottal pulses in the excitation signal eðnÞ. To create mixed excitation, additional parameters are necessary for controlling the combination of the pulse with noise excitation. These parameters are typically based on aperiodicity measures as the ones described in (Kawahara et al., 2001). Aperiodicity parameters indicate for each sampled frequency xl the amount of pulse and noise that should be used to compose the excitation signal, i.e., E eJ xl ¼ ð1 aðxl ÞÞT eJ xl þ aðxl ÞW eJ xl ; l ¼ 0; . . . ; L; ð30Þ
4. Experiments 4.1. Corpora Two speech databases, one comprising 4994 sentences spoken by a female speaker, and another of 1421 sentences spoken by a male speaker were utilized to train statistical parametric synthesizers. Both databases were recorded by American English speakers at 48 kHz and down-sampled to 16 kHz. The advantage of using waveforms with a sample rate of 16 kHz is that it is possible to make a better comparison with other statistical parametric systems that have already been reported, e.g. (Zen et al., 2005). Pitch period onset times were detected from the natural speech waveforms using a proprietary tool. The algorithm utilizes the phase spectrum of the input signal to detect the glottal closure instants. Although the calculation of the complex cepstrum is very much dependent on having accurate pitch marks, we do not include here any evaluation of the accuracy of the pitch marks we used, nor any demonstration of how pitch marking accuracy influences results generally.
where aðxl Þ is the aperiodicity measure at frequency xl , and EðeJ xl Þ; T ðeJ xl Þ and W ðeJ xl Þ are respectively the excitation, a delta pulse, and white noise in the frequency domain. In this work synthesis with mixed excitation is implemented in the time domain. The aperiodicity parameters faðx0 Þ; . . . ; aðxL Þg are used to derive the voiced and unvoiced filter impulse responses, hv ðnÞ and hu ðnÞ, respectively. Using the fact that these filters have zero-phase response, i.e., H v ðeJ x Þ ¼ H v ðeJ x Þ ¼ 1 aðxl Þ, and H u ðeJ x Þ ¼ H u ðeJ x Þ ¼ aðxl Þ, and taking the inverse DFT,
0.4 0.2
3
a
h (n)
rad
4
2
−0.4
1 0
0 −0.2 −0.6
0
1000
2000
3000
4000
5000
6000
7000
−60
8000
−40
−20
20
40
60
20
40
60
0.2
2
0 −0.2
a
h (n)
rad
0
n
Frequency (Hz)
1
−0.4 −0.6
0
−0.8
0
1000
2000
3000
4000
5000
Frequency (Hz)
6000
7000
8000
−60
−40
−20
0
n
Fig. 4. Examples of phase and impulse responses of the filter H a ðzÞ for a given speech frame, obtained from generated phase parameters, /ðnÞ, according to (28) and (29), for C a ¼ 19; a ¼ 0:0; L ¼ 4096, and P a ¼ 64. From top to bottom: phase response for a female speaker, corresponding impulse response, phase response for a male speaker and corresponding impulse response.
R. Maia et al. / Speech Communication 55 (2013) 606–618 0.5
0.5
0
0
−0.5
50
100
150
200
250
300
350
400
450
500
0.5
−0.5
0
613
50
100
150
200
250
300
350
400
450
500
50
100
150
200
250
300
350
400
450
500
50
100
150
200
250
300
350
400
450
500
50
100
150
200
250
300
350
400
450
500
0.2 0.1
0
0 −0.1
−0.5
50
100
150
200
250
300
350
400
450
500
0.5
−0.2
0.2 0.1
0
0 −0.1
−0.5
50
100
150
200
250
300
350
400
450
500
−0.2
1
0.5
0.5
0
0
50
100
150
200
250
300
350
400
450
500
−0.5
Fig. 5. Impact on the excitation signal and re-synthesized speech of the dimension of the phase parameter /ðnÞ. From top to bottom: residual (left), natural speech (right), excitation signal for C a ¼ 39 (left), corresponding re-synthesized speech (right); excitation signal for C a ¼ 19 (left), corresponding re-synthesized speech (right), excitation signal for C a ¼ 0 (left) and corresponding re-synthesized speech (right). In all the cases the glottal filter order is P a ¼ 64 and the warping factor is a ¼ 0:42. Signal-to-noise ratio for the synthetic segments are: 5.887 dB, 8.851 dB, and 5.797 dB for C a ¼ 39; C a ¼ 19, and C a ¼ 0, respectively.
Table 1 Frame-based complex cepstrum extraction methods compared. simple and integration stand for respectively the simple algorithm and the integration of the phase derivative methods described in Section 3.1.2.
Method Method Method Method Fig. 6. Synthesis with mixed excitation: the voiced and unvoiced filter coefficients, H v ðzÞ and H u ðzÞ, respectively, are derived from the aperiodicity parameters faðx0 Þ; . . . ; aðxL Þg.
Phonetic labels and segmentation for both voices were manually corrected after automatic annotation and alignment according to the procedure described in (Buchholz et al., 2007). 4.2. Selection of the best frame-based complex cepstrum computation method Different methods of extracting frame-based complex cepstra from speech were evaluated. The methods differ according to the phase unwrapping algorithm utilized and whether the frame-based interpolation of pitch-synchronous features is conducted in the spectral or cepstral domain. The complex cepstrum extraction methods evaluated are summarized in Table 1.
1 2 3 4
(M1) (M2) (M3) (M4)
Phase unwrapping
Interpolation domain
simple simple integration integration
Amplitude and phase spectra Complex cepstrum Amplitude and phase spectra Complex cepstrum
Fifty sentences for each speaker were randomly selected from the corpora described in Section 4.1. The speech signals were analyzed, and re-synthesized using frame-based complex cepstra. For synthesis, the system engine shown in Fig. 3 was used. The complex cepstrum order C was set to 39. For spectral analysis, a 8192-size (Fast Fourier Transform) FFT, i.e. L ¼ 4096, was taken from each speech segment. The filter was driven by a simple excitation signal (unit pulses for the voiced regions and white noise for the unvoiced regions) constructed from the extracted F 0 . Two objective measures were used to evaluate the distortion between natural and re-synthesized speech. The first one was the segmental signal-to-noise ratio (SNRseg) " # PtðNþ1Þ1 2 T1 10 X s ðnÞ n¼tN SNRsegðdBÞ ¼ ln PtðNþ1Þ1 ; 2 T ln 10 t¼0 ½sðnÞ ~sðnÞ n¼tN ð33Þ
614
R. Maia et al. / Speech Communication 55 (2013) 606–618
where T is the number of frames, t the frame index, and ~sðnÞ the synthesized speech signal. The second measure was the mean log spectral distance (LSD) given by vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 82 32 2 32 9 u u > > T1 u K < = J x J x X X l 0 2 6 jS t ðe Þj7 6 jS t ðe Þj7 u 1 LSD ¼ þ 2 ln ln : 4 5 4 5 t T t¼0 2L þ 1 > : ; S t ðeJ x0 Þ S t ðe J x l Þ > k¼1 e e ð34Þ
n o In (34), fS t ðeJ x0 Þ; . . . ; S t ðeJ xK Þg and e S t ðeJ x0 Þ; . . . ; e S t ðeJ xK Þ are samples between x0 ¼ 0 and xK ¼ p of the DFT of the t-th frame of natural and re-synthesized speech, respectively. Analysis conditions for the measures were: frame shift N ¼ 80 samples. For the LSD the input speech was windowed at every 5 ms using 25-ms Blackman windows. Finally, a 1024-point FFT was taken, i.e., K ¼ 512. Table 2 shows the SNRseg and LSD results. As reference, measures for analysis-synthesis using the minimumphase cepstrum derived from interpolated amplitude spectra are also shown. Minimum-phase cepstrum synthesis was obtained by ignoring the phase parameter /ðnÞ at synthesis time. From the results it can be seen that the use of the complex cepstrum produces waveforms that are closer to their natural versions. This is most clearly highlighted by the difference in terms of SNRseg. In terms of LSD the results were similar since this measure ignores the phase spectra of the speech signal. Among the complex cepstrum computation methods, although Method 2 presented the highest SNRseg for both speakers, the difference is not large. Therefore, Method 1 was selected as the phase unwrapping process is simpler. Additionally, the interpolation of pitch-synchronous spectra results in frame-based amplitude spectra, which can be used to derive other types of spectral parameters for the synthesis filter, such as melgeneralized cepstral coefficients (Tokuda et al., 1994). 4.3. Feature extraction and HSMM training Frame-based complex cepstra were extracted through Method 1 as described in Section 4.2. For spectral analysis, a 8192-size FFT, i.e. L ¼ 4096, was taken from each speech segment. Interpolated spectra were then converted into complex cepstra with order C ¼ L ¼ 4096 using (10). Each Table 2 SNRseg (dB) and LSD results for the frame-based complex cepstrum extraction methods summarized in Table 1. MP is when speech is synthesized using the minimum-phase cepstrum (no phase parameter). MP (a) Female voice SNRseg -2.565 LSD 3.389 (b) Male voice MP SNRseg 1.527 LSD 3.581
M1
M2
M3
M4
1.697 3.422
1.705 3.461
1.688 3.423
1.697 3.462
M1 0.680 3.497
M2 0.712 3.428
M3 0.678 3.411
M4 0.707 3.432
complex cepstrum coefficient set was then decomposed into its minimum and all-pass components as described in Section 3.2. The minimum-phase and all-pass cepstra were warped to result into 40 mel-cepstral coefficients and 19 phase features (C a ¼ 19), respectively, by using the recursive formula shown in (Tokuda et al., 1994) with a ¼ 0:42. The choice for 19 phase parameters was made according to informal listening tests using re-synthesized versions of the speech signal for C a ¼ 19 and C a ¼ 39. No audible difference could be noticed. The aperiodicity coefficients were computed by using the amplitude spectrum of the voiced and unvoiced components of the speech signal, calculated using the pitch-scaled harmonic filter (Jackson and Shadle, 2001). After that, band-aperiodicity parameters were computed as P J xl n Þj xl 2Xn aðxl ÞjS ðe bðnÞ ¼ P ; n ¼ 0; . . . ; B 1; ð35Þ J x n l Þj xl 2Xn jS ðe where Xn is the n-th frequency band for which the n-th band aperiodicity coefficient, bðnÞ, is calculated, and B ¼ 22 is the number of bands. The frequency sub-band configuration was chosen according to the Bark critical frequencies (Yamagishi and Watts, 2010). The observation vectors for statistical modeling were composed of six streams arranged as follows: stream 1: 40 mel-cepstral coefficients, plus delta and delta-delta; streams 2, 3, 4: ln F 0 ; D ln F 0 and DD ln F 0 , respectively; stream 5: 22 band-aperiodicity coefficients, plus delta and delta-delta; stream 6: 19 phase parameters, plus delta and deltadelta. This is similar to the configuration of the statistical parametric synthesizer presented in (Zen et al., 2005) with an additional stream of phase parameters. The observation vectors were used to train HSMMs with 5 states and a leftto-right no-skip topology. All the streams were independently clustered using the decision tree method. During training, weights were set to zero for the streams of band-aperiodicity coefficients and phase parameters so that they could not influence the model segmentation. At synthesis time, parameter generation with global variance (Toda and Tokuda, 2007) was used for all the streams except the phase parameter one, since experiments have shown that the smoothing effect of statistical parametric synthesis is beneficial for the phase parameters. Three voices were built using the corpora described in Section 4.1: a female voice trained on 4994 utterances (Female-4994); a male voice trained on 1421 utterances (Male-1421); and a female voice trained on 1421 utterances (Female-1421) randomly selected from the full corpus. The purpose of building Female-1421 was to compare it in terms of subjective test results with Male1421 (see Section 4.4).
R. Maia et al. / Speech Communication 55 (2013) 606–618
Female−4994
Female−1421
615
Male−1421
MOS
3.8 3.6 3.4 3.2 3 Simple mode
Mixed mode
Mixed plus phase mode
Fig. 7. Results of the subjective tests in terms of MOS for the female and male voices according to each synthesis mode. The circle and squares indicate the means while the error bars represent the 95% confidence intervals. Female-4494, Female-1421 and Male-1421 mean respectively female voice trained on 4994 utterances, female voice trained on 1421 utterances, and male voice trained on 1421 utterances.
4.4. Subjective listening tests To investigate the impact of the phase features on the subjective quality of synthesized speech, subjective tests were conducted with 51 open sentences, for both male and female systems, using the Amazon Mechanical Turk (2012). The first test collected the mean opinion score (MOS) on the quality of the test samples synthesized using three modes according to the parameters utilized to generate the excitation signal: simple: excitation signal constructed with ln F 0 using the system of Fig. 3 with H a ðzÞ ¼ 1; mixed: excitation signal constructed with ln F 0 and band-aperiodicity coefficients using the system of Fig. 6 with H a ðzÞ ¼ 1; mixed with phase: excitation signal constructed with ln F 0 , band-aperiodicity coefficients and phase features using the system of Fig. 6. In total, 92, 86 and 100 subjects took part in the listening tests for Female-4994, Female-1421, and Male-1421, respectively. The subjects, who were not speech experts, were instructed to use headphones and rate the quality of the synthetic speech according to the following scale: 1: very bad; 2: bad; 3: average; 4: good; 5: very good. Subjects who did not complete at least 10 judgments were excluded. Fig. 7 shows the MOS and corresponding 95% confidence intervals for the three voices. According to these results the mode simple is the one with lowest quality while there is no difference between modes mixed and mixed with phase. This will be discussed later. Another interesting point to note is that for the same amount of data the female voice achieves higher MOS. This could be a consequence of the subjects’ preference towards female voices since the speech synthesized by both systems present no major distortions according to informal listening tests. It is usually difficult to track the subjects’ commitment to the tests conducted through the Amazon Mechanical Turk. On the other hand ABX preference tests usually offer
better ways to cope with this problem. For instance, one can detect cheating by the subject’s consistent choice for the second sample or average disagreement coefficient (Buchholz and Latorre, 2011). Based on this, and in order to investigate the impact of the phase feature on the synthesized speech, preference tests were conducted between modes mixed and mixed with phase. The subjects were asked to use headphones and given the instruction: Indicate which of the sound files is better. In total, the opinions of 50, 63 and 62 listeners were considered in the preference tests for Female-4994, Female-1421 and Male-1421, respectively. Each sentence was judged by at least 24 listeners. The total number of stimuli considered for each voice was respectively 786, 564 and 919, out of 1225. The stimuli were filtered according to empirical methods to detect cheating (Buchholz and Latorre, 2011). The results shown in Table 3 indicate a strong preference for the system with phase features for all the voices. The p-value smaller than 0.005 indicates a significant preference for the proposed system. By comparing the results obtained for Female1421 and Male-1421, the impact of the phase parameter was bigger for the male voice. However, the biggest impact of /ðnÞ on the synthesized speech occurs for the female voice trained on the full corpus. This could be a consequence of training the system on a fairly large database, which results in better estimation of the statistical models. Another interesting fact to note from the preference test results is the impact of the frequency warping of the phase parameter /ðnÞ. The results shown here are stronger than those reported in (Maia et al., 2012), where no warping was used.6 The experiments in (Maia et al., 2012) were conducted on the female voice trained on 4994 utterances. The results showed an average preference of 35.5% and 43.3%, respectively, for the modes mixed and mixed with phase, whereas on average in 21.3% of the judgments the quality of the speech synthesized by these modes were considered to be the same. Informal listening tests indeed confirm
6 Most likely the subjects who took part in the two tests are completely different.
616
R. Maia et al. / Speech Communication 55 (2013) 606–618
Table 3 Results of the preference test between the synthesis modes mixed and mixed with phase for each voice. Boldface numbers highlight the highest average preferences. Female-4494, Female-1421 and Male-1421 mean respectively female voice trained on 4994 utterances, female voice trained on 1421 utterances, and male voice trained on 1421 utterances. Voice
Mode Mixed (Fig. 6, H a ðzÞ ¼ 1)
Mixed with phase (Fig. 6)
None
p-value
Female-4994 Female-1421 Male-1421
27.8 39.4 33.0
62.1 55.0 59.0
10.1 5.7 8.0
0.0 0.0 0.0
8 6 4 2 0 −2 −4
200
400
600
800
1000
200
400
600
800
1000
1200
1400
1600
1800
2000
0.2 0.1 0 −0.1 −0.2 1200
1400
1600
1800
2000
10
5
0 200
400
600
800
1000
1200
200
400
600
800
1000
1200
200
400
600
800
1000
1200
200
400
1400
1600
1800
2000
2200
2400
0.4 0.2 0 −0.2 −0.4
1400
1600
1800
2000
2200
2400
8 6 4 2 0 −2 −4
1400
1600
1800
2000
2200
2400
0.2 0.1 0 −0.1 −0.2 600
800
1000
1200
1400
1600
1800
2000
2200
2400
Fig. 8. Speech segment corresponding to the phone /E/ (SAMPA, 2012), as pronounced in the second e of the word Lafreniere. From top to bottom: residual signal, natural speech, excitation signal without phase information, corresponding synthetic speech, excitation signal with phase information, and corresponding synthetic speech.
R. Maia et al. / Speech Communication 55 (2013) 606–618
the impact of the phase parameter on the synthesized speech when warping is utilized. Fig. 8 shows some examples of excitation signals and the resulting synthetic waveforms for a segment of the phone / E/, using the Speech Assessment Methods Alphabet (SAMPA, 2012). The synthetic waveforms were generated from voice Female-4994. The excitation signal with phase information results in a speech waveform that is closer to the natural one when compared with the waveform generated without phase information.
J ha ðxÞ ¼
Ca X
617
^ha ðnÞeJ xn
ðA:3Þ
n¼C a
¼
1 X
^ha ðnÞeJ xn þ ^ha ð0Þ þ
n¼C a
Ca X
^ha ðnÞeJ xn
ðA:4Þ
n¼1
¼ ^ha ðC a ÞeJ xCa þ þ ^ha ð0Þ þ þ ^ha ðC a ÞeJ xCa :
ðA:5Þ
But ^ha ð0Þ ¼ 0 and ^ha ðnÞ ¼ ^ha ðnÞ. Then J ha ðxÞ ¼
Ca X
^ha ðnÞ eJ xn eJ xn
n¼1
5. Conclusions ¼ 2J An approach to modeling glottal pulse shape in statistical parametric speech synthesizers through the use of the complex cepstrum has been presented. At the parameter extraction stage, the interpolation of pitch-synchronous magnitude and phase spectra has been shown to be effective for obtaining frame-based complex cepstra. For acoustic modeling, the minimum-phase/all-pass decomposition of complex cepstra was used. The all-pass component, which is related to the non-causal part of the complex cepstrum, can be viewed as glottal parameters and modeled as a separate stream of information for hidden semi-Markov modeling. At synthesis time these glottal parameters were used to implement a glottal filter. Experimental results under the framework of band-aperiodicity-based mixed excitation show that the use of the proposed complex-cepstrum-based glottal parameter significantly increases synthetic speech quality at an increase of computational cost at the feature extraction stage due to the increased FFT order and the phase unwrapping procedure. The proposed approach moves beyond the simplified source-filter model that has been applied to statistical parametric synthesis thus far.
) ha ðxÞ ¼ 2
Ca X
^ha ðnÞ sin ðxnÞ
To simplify the notation, frequency warping is dropped. The complex spectrum of the all-pass filter, H a ðeJ x Þ, can be obtained from the all-pass component of the complex cepstrum, ^ ha ðnÞ, through ðA:1Þ
n¼1
ðA:2Þ
because jH a ðeJ x Þj ¼ 1; 8x. Considering that the all-pass cepstrum is truncated between n ¼ C a and n ¼ C a , and after substituting (A.2) into (A.1), then
ðA:6Þ
n¼1
and finally ha ðxÞ ¼ 2
C a 1 X
/ðnÞ sin ðxðn þ 1ÞÞ;
ðA:7Þ
n¼0
since /ðnÞ ¼ ^ha ðn þ 1Þ; n ¼ 0; . . . ; C a 1. A.2. Proof of Eq. (29) The all-pass filter impulse response, ha ðnÞ, can be obtained from the all-pass complex spectrum through inverse Fourier transform Z p 1 H a eJ x eJ xn dx; ðA:8Þ ha ðnÞ ¼ 2p p Since H a ðeJ x Þ ¼ eJ ha ðxÞ , and approximating the integration by a summation, then L 1 X eJ ½ha ðxl Þþxl n ; 2L þ 1 l¼L ( ) 1 L X X 1 eJ ½ha ðxl Þþxl n þ eJ ha ðx0 Þ þ eJ ½ha ðxl Þþxl n ; ¼ 2L þ 1 l¼L l¼1
A.1. Proof of Eq. (28)
Since H a ðeJ x Þ is all-pass, then H a eJ x ¼ H a eJ x eJ ha ðxÞ ¼ eJ ha ðxÞ ;
^ha ðnÞ sin ðxnÞ
n¼1
ha ðnÞ ¼
Appendix A. Proof of Eqs. (28) and (29)
1 X ^ha ðnÞeJ xn : H a eJ x ¼ exp
Ca X
¼
ðA:9Þ ðA:10Þ
1 J ½ha ðxL ÞþxL n e þ þ eJ ½ha ðx0 Þþx0 n þ þ eJ ½ha ðxL ÞþxL n ; ðA:11Þ 2L þ 1
where L þ 1 is the number of samples of H a ðeJ x Þ taken between x0 ¼ 0 and xL ¼ p. Noticing that xl ¼ xl , and ha ðxl Þ ¼ ha ðxl Þ ¼ ha ðxl Þ, then 1 J ½ha ðxL ÞþxL n e þ þ eJ ½ha ðx0 Þþx0 n þ þ eJ ½ha ðxL ÞþxL n ; ðA:12Þ 2L þ 1 ( ) L X J ½ha ðx Þþx n J ½ha ðx Þþx n 1 l l l l eJ ½ha ðx0 Þþx0 n þ e þe ; ðA:13Þ ¼ 2L þ 1 l¼1 ( ) L X 1 eJ ½ha ðx0 Þþx0 n þ 2 ¼ cos ðha ðxl Þ þ xl nÞ : ðA:14Þ 2L þ 1 l¼1
ha ðnÞ ¼
Since x0 ¼ 0, from (A.7) ha ðx0 Þ ¼ ha ð0Þ ¼ 0. Then ( ) L X 1 1þ2 cos ðha ðxl Þ þ xl nÞ : ðA:15Þ ha ðnÞ ¼ 2L þ 1 l¼1
618
R. Maia et al. / Speech Communication 55 (2013) 606–618
References Amazon Mechanical Turk, last visited in December 2012.
. Bednar, J.B., Watt, T.L., 1985. Calculating the complex cepstrum without phase unwrapping or integration. IEEE Trans. Acoust. Speech Signal Process. ASSP-33 (4), 1014–1017. Bhanu, B., McClellan, J.H., 1980. On the computation of the complex cepstrum. IEEE Trans. Acoust. Speech Signal Process. (5), 583–585. Buchholz, S., Latorre, J., 2011. Crowdsourcing preference tests, and how to detect cheating. In: Proc. Annual Conf. of the ISCA (INTERSPEECH), pp. 3053–3056. Buchholz, S., Braunschweiler, N., Morita, M., Webster, G., 2007. The toshiba entry for the 2007 Blizzard Challenge, on-line, last acessed in December 2012. . Cabral, J., Renals, S., Richmond, K., Yamagishi, J., 2007. Towards an improved modeling of the glottal source in statistical parametric speech synthesis. In: Proc. Sixth ISCA Speech Synthesis, Workshop (SSW6), pp. 113–118. Chu, W., 2003. Speech Coding Algorithms. Wiley-Interscience, USA. Deller Jr., J.R., Hansen, J.H.L., Proaks, J.G., 2000. Discrete-Time Processing of Speech Signals. IEEE Press Classic Reissue. Drugman, T., Dutoit, T., 2010. Chirp complex cepstrum-based decomposition for asynchonous glottal analysis. In: Proc. Annual Conf. of the ISCA (INTERSPEECH), pp. 657–660. Drugman, T., Wilfart, G., Dutoit, T., 2009. A deterministic plus stochastic model of residual signal for improved parametric speech synthesis. In: Proc. Annual Conf. of the ISCA (INTERSPEECH), pp. 1779–1782. Drugman, T., Bozkurt, B., Dutoit, T., 2011. Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation. Speech Comm. 53, 855–866. Jackson, P.J., Shadle, C.H., 2001. Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech. IEEE Trans. Speech Audio Process. 9 (7), 713–726. Kawahara, H., Estill, J., Fujimura, O., 2001. Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT. In: Proc. Internat. Workshop on Models and Analysis of Vocal Emissions for Biological Applications (MAVEBA), pp. 13–18. Maia, R., Toda, T., Zen, H., Nankaku, Y., Tokuda, K., 2007. An excitation model for HMM-based speech synthesis based on residual
modeling. In: Proc. Sixth ISCA Speesh Synthesis, Workshop (SSW6), pp. 131–136. Maia, R., Akamine, M., Gales, M., 2012. Complex cepstrum as phase information in statistical parametric speech synthesis. In: Proc. IEEE Internat. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4581–4584. Vondra, M., Vı´ch, R., 2011. Speech modeling using the complex cepstrum. In: Proc. Third COST 2102 Internat. Training School Conf. on Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces: Theoretical and Practical, Issues, pp. 324–330. Oppenheim, A.W., 2010. Discrete-time signal processing. Pearson. Quatieri Jr., T.F., 1979. Minimum and mixed phase speech analysissynthesis by adaptive homomorphic deconvolution. IEEE Trans. Acoust. Speech Signal Process. ASSP-27 (4), 328–335. Raitio, T., Suni, A., Pulakka, H., Vainio, M., Alku, P., 2008. HMM-based Finnish text-to-speech system using glottal inverse filtering. In: Proc. Annual Conf. of the ISCA (INTERSPEECH), pp. 1881–1884. SAMPA – computer readable phonetic alphabet, on-line, last accessed in August 2012. . Toda, T., Tokuda, K., 2007. A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans. Inf. Systems E90-D (5), 816–824. Tokuda, K., Kobayashi, T., Masuko, T., Imai, S., 1994. Mel-generalized cepstral analysis – a unified approach to speech spectral estimation. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP), pp. 1043–1046. Tribolet, J.M., 1977. A new phase unwrapping algorithm. IEEE Trans. Acoust. Speech Signal Process. ASSP-25 (2), 170–177. Verhelst, W., Steenhaut, O., 1986. A new model for the short-time complex cepstrum of voiced speech. Trans. Acoust. Speech Signal Process. ASSP-34 (1), 43–51. Yamagishi, J., Watts, O., 2010. The CSTR/EMIME HTS system for the Blizzard Challenge 2010, on-line, last acessed in December 2012. . Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T., 2001. Mixed-excitation for HMM-based speech synthesis. In: Proc. European Conf. on Speech Communication and Technology (EUROSPEECH), pp. 2263–2266. Zen, H., Toda, T., Nakamura, M., Tokuda, K., 2005. Details of the nitech HMM-based speech synthesis for Blizzard Challenge 2005. IEICE Trans. Inf. Systems E90-D (1), 325–333. Zen, H., Tokuda, K., Black, A., 2009. Statistical parametric speech synthesis. Speech Comm. 51 (11), 1039–1064.