Digital Signal Processing 71 (2017) 131–143
Contents lists available at ScienceDirect
Digital Signal Processing www.elsevier.com/locate/dsp
Improved voicing decision using glottal activity features for statistical parametric speech synthesis Nagaraj Adiga ∗ , Banriskhem K. Khonglah, S.R. Mahadeva Prasanna Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, Guwahati, 781039, India
a r t i c l e
i n f o
Article history: Available online 28 September 2017 Keywords: Glottal activity features Statistical parametric speech synthesis Voicing decision Support vector machine
a b s t r a c t A method to improve voicing decision using glottal activity features proposed for statistical parametric speech synthesis. In existing methods, voicing decision relies mostly on fundamental frequency F0, which may result in errors when the prediction is inaccurate. Even though F0 is a glottal activity feature, other features that characterize this activity may help in improving the voicing decision. The glottal activity features used in this work are the strength of excitation (SoE), normalized autocorrelation peak strength (NAPS), and higher-order statistics (HOS). These features obtained from approximated source signals like zero-frequency filtered signal and integrated linear prediction residual. To improve voicing decision and to avoid heuristic threshold for classification, glottal activity features are trained using different statistical learning methods such as the k-nearest neighbor, support vector machine (SVM), and deep belief network. The voicing decision works best with SVM classifier, and its effectiveness is tested using the statistical parametric speech synthesis. The glottal activity features SoE, NAPS, and HOS modeled along with F0 and Mel-cepstral coefficients in Hidden Markov model and deep neural network to get the voicing decision. The objective and subjective evaluations demonstrate that the proposed method improves the naturalness of synthetic speech. © 2017 Elsevier Inc. All rights reserved.
1. Introduction Statistical parametric speech synthesis (SPSS) became a stateof-the-art synthesis approach over the past few years [1]. The merit of this method is its capability to produce a reasonably intelligible speech with a small footprint [2]. The statistical models in SPSS give the flexibility to change the speaking style and prosody of speech [3]. In recent days, SPSS using Hidden Markov models (HMM) and deep neural network (DNN) is extensively used. Statistical models using HMM are commonly implemented using software known as HMM based speech synthesis system (HTS) [1–4]. To build the DNN based speech synthesis system, recently Merlin toolkit is introduced [5]. In Merlin system, linguistic features are taken as input and used to predict acoustic features, which are then passed to a vocoder to produce the speech waveform. However, the synthesis quality of speech is poor concerning naturalness or speaker individuality when compared to unit selection speech synthesis [3,6]. Many factors result in the unnaturalness of synthetic speech. These include poor representation of acoustic features, over-smoothing of acoustic features in statistical modeling,
*
Corresponding author. E-mail addresses:
[email protected] (N. Adiga),
[email protected] (B.K. Khonglah),
[email protected] (S.R. Mahadeva Prasanna). https://doi.org/10.1016/j.dsp.2017.09.007 1051-2004/© 2017 Elsevier Inc. All rights reserved.
and a simplified vocoder model to generate synthetic speech [7]. This work focuses on the better extraction of acoustic features. In particular, the components present in glottal activity region like aperiodicity, phase information, variation in the strength of excitation, glottal pulse shape etc. have a positive impact on the naturalness of synthetic speech [8,9]. Speech segment can be categorized mainly into voiced and unvoiced regions, based on whether the glottal activity or glottal vibration is present or not. The glottal activity region characterized by variations in the locations of glottal closure instant or epoch due to quasi-periodic vibration of vocal folds, epoch strength due to change in the strength of excitation signal, and an aperiodic component due to the turbulence noise generated in voiced speech. In existing literature [10,11], epochs correspond to glottal closure instants (GCIs), glottal opening instants (GOIs), and onset of bursts [10]. However, GCIs are the major excitation present in speech production. Therefore, in this work, GCIs are referred as epochs. The glottal pulse is characterized by the duration of closing and opening phase of glottal cycle, skewness or glottal pulse shape etc [8,9]. In this work, the significance of glottal activity features illustrated for speech synthesis. Some of these glottal activity parameters are modeled and used to improve the naturalness of SPSS.
132
N. Adiga et al. / Digital Signal Processing 71 (2017) 131–143
In conventional SPSS using hidden Markov model (HMM), Melcepstral coefficients (MCEP) are used to model the vocal-tract transfer function, and the fundamental frequency (F0) or pitch parameter is used to model excitation feature [12]. In the base version of SPSS, voicing decision is computed from F0, which represents only quasi-periodic characterization of glottal activity. For voicing decision, other source parameters of glottal activity are ignored. F0 along with voicing decision plays a critical role in the quality of synthetic speech. The errors in F0 come from the pitch computation algorithm as well as from the average statistical model. In HMM modeling, due to pitch estimation errors, there may be a chance of occurrence of unvoiced frame for a voiced sound. If it occurs within the middle of voiced region, degradation in synthesis quality will be higher. Hence, the synthesis quality of SPSS is not natural like unit selection method [1]. Besides, when the speech sounds are weakly periodic or strength of excitation is low [13,14], there is a chance of missing the classification of some portions of voiced region. For instance, misclassification may happen around voiced–unvoiced (V–UV) transitions and UV–V transitions due to the low amplitude of speech signal [13,14]. This paper focuses on showing the significance of glottal activity features for voicing decision and then training these features in HMM and DNN framework for improving the naturalness of synthetic speech. The focus is on voiced speech as it is perceptually essential and relatively easier to model. The glottal activity features used as a voicing decision in speech synthesis task to generate the excitation signal. However, for voicing decision, some heuristic threshold has to be applied to the final evidence, which can be avoided by using the classifiers. The advantage of using classifiers is transforming features into a higher dimensional space, which may provide better separability for classification. The classifiers such as k-nearest neighbor (k-NN), support vector machine (SVM), and deep belief network (DBN) are studied [15–17]. The classifiers from simple to complex ones have been explored to examine their capability for the task of voicing classification and the classifier that performs best for using these features.
1.1. Contributions of this paper
The main contributions of the work are:
• To show the significance of different glottal activity features for voicing decision.
• Improving the voicing decision from glottal activity features like strength of excitation (SoE), normalized autocorrelation peak strength (NAPS), and higher-order statistics (HOS) using classifiers such as k-NN, DBN, and SVM. • Modeling the glottal activity features SoE, NAPS, and HOS as a separate stream in HMM with the continuous distribution. • Using the voicing decision for HMM and DNN synthesis framework to improve the naturalness of synthetic speech. This paper is organized as follows: Section 2 gives an overview of voicing decision in the context of SPSS. The issues present in conventional voiced/unvoiced decision used in SPSS and behavior of glottal activity features for a better voicing decision are described in Section 3. The refinement of voicing decision using different classifiers is described in Section 4. The integration of proposed voicing decision for SPSS is explained in Section 5. The experimental evaluation and effectiveness of proposed voicing classification for synthesis is described in Section 6. The work is finally concluded in Section 7.
2. Related literature 2.1. Voicing decision In the literature, different features have been used for voicing decision. They can be classified into time-domain and frequencydomain features. Typically, time-domain features measure the acoustic nature of voiced sound such as energy, periodicity, zero crossing rate, and short-term correlation [18,19]. Frequencydomain features include the spectral peaks and harmonic measure by decomposing the speech signal using the Fourier transform or wavelet transform [20,21]. Besides, various refinements have been introduced for voicing decision from degraded speech recorded in realistic conditions. These cover extraction of voicing decision in adverse situations by utilizing autocorrelation of the Hilbert envelope of linear prediction (LP) residual [22,23]. Further, some of the recent algorithms like YIN, PRAAT, Get_F0, SWIPE etc. gave very good voicing estimation [24–27]. Also, some of the algorithms focused on the instantaneous F0 estimation, which in turn gives voicing decision-based on the event processing [19,28]. In all these methods, voicing decisions are taken by some threshold which was chosen empirically. The performance of these methods depends critically on the threshold. To improve the accuracy and to avoid threshold, statistical methods such as HMM, Gaussian mixture model (GMM), neural network model, and deep neural network also used for voicing decision [17,18,29]. These methods do not depend on the threshold. However, they require discriminative features for training. 2.2. Voicing decision in HMM In the existing literature of HMM, voicing decision along with fundamental frequency (F0) is modeled as Multi-space distribution (MSD) [30]. F0 is modeled as a continuous Gaussian distribution in the voiced region and discrete symbols in the unvoiced region. It is modeled for a state (S) as follows
N ( F ; μ S , σ S ), L = V P ( F + = F |L, S ) = 0, L=U 0, L = V P ( F + = NULL| L , S ) = 1, L = U
(1)
(2)
where N is a Gaussian density with mean μ S and variance σ S , F ∈ (−∞, ∞) represents the real F0 value, and L ∈ {U, V} is the voicing label. According to the hypothesis of continuous F0, the voicing label V and the NULL symbol value cannot be observed at the same time. Similarly, the unvoicing label U and the real F0 value cannot occur together. F0 model using MSD can provide good quality speech if the voicing decision is accurate. However, there are failures like voiced region being classified as unvoiced region (false unvoiced) and gives hoarseness to voice quality. Sometimes, there is also a case wherein unvoiced region is classified as a voiced region (false voiced) providing buzziness, mainly in a higher frequency of synthesized speech [31]. Another major approach to model F0 and voicing labels for HMM is using continuous F0 model, instead of doing discontinuous MSD F0 model [32,33]. In continuous model, F0 observations are assumed to be always available, and the voicing decision modeled separately. Yu et al. [32] modeled F0 and voicing labels independently in separate streams to make the voicing decision independent of F0. In globally tied distribution method [33], pitch values for unvoiced frames are extended from neighboring voiced frames by interpolation and smoothing. Hence, F0 and its derivatives are modeled in a single stream. Besides, two mixture GMM for each state is modeled to represent one mixture for voiced frames, and
N. Adiga et al. / Digital Signal Processing 71 (2017) 131–143
133
Fig. 1. (a) HMM synthesized speech for an English word “sleep” ([s], [l], [iy], [p], [sil]) using RAPT algorithm; (b) fundamental frequency with voicing decision; (c) spectrogram for the same word; ((d)–(f)) shows the HMM synthesized speech for the same word using the proposed glottal activity features, fundamental frequency with voicing decision and spectrogram, respectively.
the other mixture for unvoiced frames. Finally, based on voicing mixture weight, voicing decision is made. The continuous F0 model has shown significant improvement in synthesis quality compared to MSD F0 model for HMM. Statistical methods such as GMM and multilayer perceptron based voicing decision is also attempted in HMM [31,34]. In [29], long-term suprasegmental property of F0 features have been exploited using deep neural models to get an improved voicing decision. These classifiers helped in improving the accuracy of voicing decision using the existing features like F0, zero crossing rate, and energy. Hence, using glottal activity features along with classifier may enhance the accuracy of the voicing decision. Narendra et al. [7] recently proposed the robust voicing decision computed from the zero-frequency filter. The voicing decision is obtained from the strength of excitation together with F0 is modeled in HMM with MSD. This method showed improvement in the synthesis quality. The motivation of current work is to have a better accuracy of voicing decision using other attributes of glottal activity features for vocoder in HMM, which need an accurate voicing decision. Moreover, in the proposed method, instead of MSD glottal activity features are trained using HMM with the continuous model and also in the DNN framework. 3. Glottal activity features for voicing decision In the base version of SPSS using HMM, the voicing decision is made independently for each state using the MSD model of F0. The pitch parameter is computed using a robust algorithm for pitch tracking (RAPT) [26]. The RAPT algorithm performs frame by frame autocorrelation analysis to capture periodicity information. The errors in pitch detection algorithm may happen due to a poorly pronounced vowel or low energy boundary regions of voiced sound. These errors may even procure to statistical training. Due to this, during parameter generation, the voiced frame may classify as the unvoiced frame. Besides, if the misclassified frame happens to occur in the middle of a voiced sound with voiced neighboring frames, quality will be poor. Fig. 1 is a synthesized sample of the English word “sleep” from SPSS, where F0 and voicing decision are obtained from RAPT algorithm [26]. In this word, due to the voicing decision error, the unvoiced frames appeared
within the voiced sound [iy]. Hence, vowel looks like very dry and hoarse, which degrades the naturalness of synthetic speech. Thus, the features currently used for voicing decision are insufficient to capture the source dynamics of voiced speech. Hence, there is a need for better features to represent the voicing information, along with standard periodicity feature. 3.1. Analysis of F0 for different voiced sounds In the basic version of SPSS, voicing decision is made using glottal activity feature like F0. To understand the significance of this feature, the relative frequency of F0 feature for all the sentences of SLT speaker taken from the ARCTIC database [35] is plotted in Fig. 2. It shows the distribution of F0 values for different voiced sound categories like vowels, semivowels, nasals, voiced fricatives, and voiced stops. F0 calculated from the RAPT algorithm with a frame shift of 5 ms. Even though most of the F0 values for sound units such as vowels, semivowels, and nasals are around 110 Hz to 230 Hz, around 10%, values have F0 = 0, i.e., these frames classified as unvoiced sounds. For voiced stops and voiced fricatives, around 40% of frames classified as unvoiced sounds. There is a need to have better glottal activity features, which can be able to classify these frames correctly. 3.2. Different glottal activity features for voicing decision In our recent work [36], three glottal activity features and their combination are proposed to identify the regions where activity of glottis is present. Three glottal activity features considered are SoE, NAPS, and HOS of the source signal. This section describes the voicing decision of phoneme units using this glottal activity features present in the speech production mechanism. The glottal activity features are computed from the source signal by minimizing the effect of time-varying vocal-tract response present in the speech signal. Hence, zero-frequency filter and inverse LP filter are applied independently on the speech signal to get two different source representations, zero-frequency filtered signal (ZFFS) and integrated linear prediction residual (ILPR) signal [11,37], respectively.
134
N. Adiga et al. / Digital Signal Processing 71 (2017) 131–143
Fig. 2. The distribution of different glottal activity features for voiced and unvoiced/silence regions: (a)–(d) Relative frequency of feature values for F0, SoE, NAPS, and HOS, respectively.
γ2
3.2.1. Strength of excitation (SoE) During the production of voiced sounds, rapid movement of vocal folds occurs resulting in high energy during the closing phase of the glottis and gives higher strength to voiced speech around the epoch location. This strength around the epoch locations may be estimated by passing speech signal through the zero-frequency filter and computing slope near the epoch locations of filtered signal [38]. The strength of excitation (se [k]) of the filtered signal (z[n]) is defined as
SKR =
se [k] = | z[k + 1] − z[k]| ,
N 1 (ri [n] − r¯i )4 β = N n =1 − 3, 2 2 N 1 ¯ ( r [ n ] − r ) i n =1 i N
(3)
where k is the epoch location. se [k] gives the strength of the impulse-like excitation at the epoch location. 3.2.2. Normalized autocorrelation peak strength (NAPS) In the production of voiced sounds, quasi-periodic glottal pulses of air excite the time-varying vocal-tract system. The speech signal is therefore quasi-periodic for voiced sounds and this behavior can be characterized by taking the NAPS of ZFFS (z[n]). The NAPS (n p (τ )) for a frame of filtered signal (z2 [n]) is given by
n p [τ ] =
n= N n =1
z[n] ∗ z[n − τ ]
n= N n =1
z2 [n]
(4)
where τ is the time lag. In this case, it represents the location of the highest peak in 2.5–15 ms range and is the pitch period or periodicity for voiced sound. 3.2.3. Higher-order statistics (HOS) measure In voiced speech, due to the airflow pressure, the closing and opening of vocal folds occur, resulting in significant excitation around both closing and opening instants. There is a mismatch with higher strength during the closing instant compared to that of opening instant due to differences in the airflow pressure around those instants [8] and results in the shape of the glottal pulse to be asymmetric. The ratio of skewness to kurtosis computed from ILPR signal is used for capturing the asymmetric characteristic of glottal pulse [36]. The nature of ILPR signal is somewhat identical to the glottal flow derivative signal and obtained when a non-preemphasized speech used during the inverse filtering operation of speech. Normalized skewness and kurtosis may be affected by high noise energy, and hence appropriate skewness and kurtosis ratio (SKR) is considered by Nemer et al. [39], which is given by
(5)
β 1 .5
where γ and β represent skewness and kurtosis of ILPR signal (r i [n]) of N samples of a frame, respectively, and are given by 1
N
(ri [n] − r¯i )3 3 2 2 ¯ ( r [ n ] − r ) i i n =1
γ = N n =1 N 1 N
(6)
(7)
where r¯i is the mean of r i [n]. The SKR evidence is the HOS measure considered in the present work that represents the glottal activity feature. 3.3. Analysis of SoE, NAPS, and HOS for voicing decision The distribution of SoE, NAPS, and HOS values are plotted in Fig. 2((b)–(d)) for different voiced sounds. The SoE values for voiced fricatives and voiced stops are relatively low, which is around 5% of frames present in the entire database and it is better than F0 distribution shown in Fig. 2(a). Moreover, relative frequency values of the NAPS and the HOS features worked better for voiced fricative and voiced stop sounds also. The combination of these features will be helpful for the correct classification of voiced sounds, particularly, NAPS and HOS features are necessary for low energy voiced sounds. The combination is suitable in case of voiced sound units with weakly periodic regions, V–UV or UV–V boundary regions. It also helps in case of low amplitude voiced sounds such as voiced consonants/semivowels. In these sounds, there is a tendency to have lower periodicity values either because pitch estimates are not reliable or instantaneous values of pitch vary even within an analysis frame near the boundary regions and leads to misclassification of voiced frames. Fig. 3(a) shows a natural speech for a phrase “a big canvas” consisting of some weakly voiced sound units such as [b], [g], and [v]. The voicing decision obtained from the RAPT algorithm is shown in Fig. 3(b) along with the normalized fundamental frequency plot showed in a continuous line. The RAPT algorithm works fairly well for a weakly voiced sound unit like [g]. For voiced consonant [b]
N. Adiga et al. / Digital Signal Processing 71 (2017) 131–143
135
Fig. 3. (a) Natural speech for a phrase “a big canvas”([a], [b], [ih], [g], [k], [ae], [n], [v], [ah], [s]) with reference voicing marking in dotted line; (b) Voicing decision obtained from RAPT with amplitude 0.8, and normalized F0 evidence; (c)–(e) glottal activity evidence obtained from features, SoE, NAPS, and HOS, respectively; (f) Proposed voicing decision (dotted line) using combined evidence of glottal activity features (continuous line).
(around 0.4 s), pitch estimation fails due to the sudden dip and results in the sound unit classified as the unvoiced region. A similar observation can be seen around 0.9 s region for voiced sound unit [v]. Fig. 3((c)–(e)) shows the SoE, NAPS, and HOS features, applied to the same segment, respectively. Even though the SoE evidence is low for [b] and [v] sounds, the NAPS (around 0.4 s and 0.9 s) and the HOS (around 0.4 s) values are high around these regions helping in the correct classification of this region, which is shown in Fig. 3(f) with combined evidence and the final classification. The combined evidence is computed by normalizing all the three individual evidence to a maximum value and then averaging them. The reason for the combination of features is because the NAPS evidence calculated on ZFFS, which is an approximated band-pass filter and gives high gain for the components around fundamental frequency, results in enhanced sinusoidal nature of low amplitude voiced sound units. The HOS feature computed on ILPR captures the boundary/transition region well, as residual errors around this region are relatively high. Fig. 1(d) is a synthesized sample of the same English word “sleep” generated from SPSS using the proposed combination of glottal activity features. The voicing decision shown in Fig. 1(e) is obtained from the proposed three glottal activity features trained using HMM. The detailed procedure of computing voicing decision shown in the Algorithm 1. In the synthesized word, there is no voicing decision error, which can be seen in Fig. 1(e). This is mainly due to the better voicing decision obtained from the combination of glottal activity features SoE, NAPS, and HOS. Further, the narrowband spectrogram shown in Fig. 1(c), around 0.3 s for sound unit [iy] shows no regular harmonic structure for the RAPT method. However, we can see the regular harmonic structure for sound unit [iy] in the proposed method as shown in Fig. 1(f). Hence, the additional features along with F0 are helpful in improving the voicing decision. 4. Glottal activity features in classifiers An extensive study of the classification of voiced sounds using glottal activity features with different classifiers is explored in this section. There are two advantages of using classifier. Firstly, it will be helpful for enhancing the classification accuracy. Sec-
ondly, the heuristic threshold for classification can be avoided. The first one is the k-NN, a simple classifier, which stores all the features and their labels during the training process. During testing, the distance (Euclidean) of an unknown sample to all the training samples is computed. The labels of k nearest samples to the unknown sample are considered, and the dominant label assigned to the unknown sample [15]. Another classifier explored is the SVM, which is a binary classifier. This classifier finds optimal hyperplane by maximizing the distance of hyperplane from the support vectors. The details of SVM can be found in [16]. LIBSVM [40] software used for the experiments in this work, where the radial basis function kernel is considered, and the kernel equation is
K (x, y ) = exp (−Y ||x − y ||2 )
(8)
where x and y are training samples and labels, respectively, Y is the width parameter. There is also a cost parameter c for SVM and details can be found in [41]. The parameters Y and c are varied to achieve the best performance. Deep belief network (DBN) used for voice activity detection task by Zhang et al. [17]. In DBN, serially concatenated multiple features are given as input and transferred through the multiple nonlinear hidden layers of DBN. The final layer of DBN consists of a soft-max layer, which acts as a classifier, and used for classification of speech frames into glottal activity region or not. A stack of restricted Boltzmann machines (RBM) constitute a DBN, and the training process takes place in two steps [42]. The first step consists of a pre-training stage, where each RBM is pre-trained in an unsupervised manner, and the output of first RBM is given as input to the next RBM. This step is repeated until the final layer of DBN. The above process is performed to find the initial parameters of DBN, which are close to a good solution. The next step is the supervised back-propagation step, which is carried out to fine tune the parameters. For a 2-class classification problem like the voicing decision, a given observation sequence, o assigned to a class, whose analogous output unit is given a value of 1. The output unit yk , k = 1, 2, is calculated similar to [17], by the following relations:
136
N. Adiga et al. / Digital Signal Processing 71 (2017) 131–143
Fig. 4. Voiced/unvoiced classification using classifiers: (a)–(b) Speech and DEGG segment with reference marking in dotted line; (c) signal processing combined evidence in continuous line (black color) and final nonlinear mapping in dotted line (red color); (d)–(f) the evidence of glottal activity region computed from k-NN, DBN, and SVM, respectively, with likelihood probability in continuous line (black color) and mapping in dotted line.
yk =
1,
if pk > p i , ∀i = 1, 2, i = k
0,
otherwise
(9)
where pk is the probabilistic soft output of the voicing class “ yk = 1”, pk is defined as exp (dk )/ i exp (di ) with dk defined as
dk =
( L +1 , L ) ( L ) w k ,i fi
i
( L , L −1 ) ( L −1 )
w j ,n
fn
...
j
...
b
(2,1) (1) w a,b f b
(1,0) w b,c xc
4.1. Voiced/unvoiced classification
(10)
c
where f (l) (.) representing the nonlinear activation function of the (l,l−1) lth hidden layer, l = 1, ..., L, ( w i , j )i , j represents the weights between the neighboring layers with i as the ith unit of lth layer and j as the jth unit of (l − 1)th layer and (xc )c represents the input feature vector [17]. The activation function is generally the logistic function. The input to all classifiers consists of a combination of glottal activity features. The features are concatenated to form a feature vector for each frame. Further, contextual information also incorporated wherein a five frame context was used. This means five frames to the left, and five frames to right of the current frame are used. The glottal activity features without classifiers are compared by combining the features and nonlinearly mapped to get the binary classification. For combining the nonlinear mapping technique, three features are first added and then the nonlinear mapping is performed. This mapping function [43] is given by
Ep =
1 1 + e −( S p −)/τ
threshold is to get the better voicing decision, which also shown in the paper [44] for speech/music classification. The reason for getting better results may be coming from the fact that due to the sigmoid function used in the nonlinear mapping, the range of input feature obtained from the combined evidence of glottal activity features is increased. That is, higher feature values further increase and mapped to binary decision 1 and lower feature value further decrease and assigned to binary decision 0.
(11)
where E p is nonlinearly mapped value and S p is the combined glottal activity feature value, which is obtained after adding SoE, NAPS, and HOS features and normalizing to the mean value. The variables and τ are the slope parameters, τ is chosen to be very small with a value of 0.001, and is the primary threshold determined to be 0.7 in this work, which is found empirically. The advantage of using nonlinear mapping compared to the direct
The classifiers mentioned above are used for voicing decision with the glottal activity features obtained from speech, which characterizes voiced sounds. The remaining sounds where the activity of glottis is minimal are classified as unvoiced sounds. The reference labels for training classifiers were obtained using differential electroglottography (DEGG). The features, SoE, NAPS, and HOS are applied on DEGG. The obtained three features are added and normalized to a maximum value. The combined evidence for each frame is classified as a voiced frame when the evidence value is above 1% mean value [19], otherwise, the frame labeled as an unvoiced frame. The authors have further verified manually the reference labels for the voicing decision estimated from DEGG signal for setting the threshold to get reliable voicing labels as mentioned in [19]. The evidence obtained by the classifiers are shown in Fig. 4. The reference voicing label marking along with speech and DEGG are shown in Fig. 4(a) and (b), respectively. The evidence obtained from the nonlinear mapping of combined glottal activity features show accurate detection of voicing decision, except for the region around 0.8 s as shown in Fig. 4(c). The evidence obtained from the k-NN classifier is given in Fig. 4(d). Note that for k-NN only the mapped evidence is displayed since the k-NN output provides only a binary value based on the dominant label. The k-NN classifier gives an excellent detection accuracy for this speech sample. The evidence obtained from DBN slightly fails in the region around 0.7 s and 1.25 s as seen in Fig. 4(e). The evidence from the SVM classifiers performs the best as seen in Fig. 4(f). It can be seen that the nonlinear mapped value of likelihood probability of SVM is very close to the reference value.
N. Adiga et al. / Digital Signal Processing 71 (2017) 131–143
137
Table 1 Comparison of the different classifiers: represented in terms of the percentage (%) of voiced, unvoiced, and combined voicing error. Database
SLT
Nonlinear k-NN (k = 23) SVM (c = 8, Y = 16) DBN (40-25-15-5-2) RAPT TEMPO SRH REAPER
BDL
KEELE
CSTR
VE
UE
VUE
VE
UE
VUE
VE
UE
VUE
VE
UE
VUE
3.34 2.84 1.72 3.24 4.04 3.6 3.86 2.02
1.85 1.48 1.02 1.71 2.45 2.33 2.31 1.51
5.19 4.32 2.74 4.95 6.49 5.93 6.17 3.53
4.0 3.41 1.81 3.51 4.85 4.02 4.79 3.35
1.92 1.6 1.15 1.94 3.2 2.83 2.57 2.06
5.92 5.01 2.96 5.45 8.05 6.85 7.36 5.41
4.51 4.06 2.01 2.86 5.36 4.65 4.81 3.97
1.92 1.78 1.18 1.5 3.92 3.55 4.11 2.38
6.43 5.84 3.19 4.36 9.28 8.20 8.92 6.35
4.03 3.67 1.92 2.92 4.82 5.12 4.55 3.65
1.98 1.84 1.1 1.79 2.76 3.49 2.78 2.16
6.01 5.51 3.02 4.71 7.58 8.61 7.33 5.81
Table 2 Comparison of the frame level context information used in different classifiers: represented in terms of the percentage (%) of voicing error. No. of context
SVM SLT
BDL
Average
SLT
DBN BDL
Average
0 1 2 3 4 5 6
4.68 4.55 3.53 3.01 2.78 2.74 3.15
6.19 6.02 4.27 3.95 3.36 2.96 4.07
5.43 5.28 3.90 3.48 3.07 2.85 3.61
7.92 7.68 7.41 6.61 5.86 4.95 7.36
9.34 8.64 8.25 7.35 6.57 5.45 8.54
8.63 8.17 7.83 6.98 6.21 5.20 7.95
4.2. Evaluation parameters The classifier results are evaluated by finding the percentage of voiced–unvoiced errors (V U E ) between the reference and detected voicing frames. The voicing frame error is defined as follows
V UE = VE + UE
(12)
where V E and U E represent the percentage of voiced and unvoiced errors, respectively, given by
V E = V N / T re f
(13)
U E = U N / T re f
(14)
where V N and U N are total error frames for voicing and unvoicing decisions, respectively. T re f is the total number frames present in the testing database. A voicing failure happens when a reference voicing frame recognized as unvoiced frame. An unvoiced error occurs when a reference unvoiced frame is identified as voiced. For evaluation of these parameters, two speakers SLT (female) and BDL (male) are taken from the CMU ARCTIC database available publicly [35], which consist of 1132 sentences. A subset of the database around 800 sentences is used for training. The parameters of the classifiers are optimized using exhaustive grid search using fivefold cross-validation to get the best performances. The remaining sentences are used for validating voiced–unvoiced errors. 4.3. Results The voicing decision obtained from the different classifiers tabulated in Table 1. The results are further compared with state-ofthe-art algorithms such as RAPT, time domain excitation extraction based on a minimum perturbation operator (TEMPO) used in STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum) vocoder, SRH (Summation of Residual Harmonics), and Robust Epoch And Pitch EstimatoR (REAPER) [22,23,28,45]. For all the methods, the glottal activity features obtained for the frame shift of 5 ms. The V U E error rate of the signal processing method is evaluated using nonlinear mapping. Note that the contextual information is incorporated in the input of classifiers, where a five frame context has been used.
The use of this information showed drastic improvement in the results. The reason may be due to the processing of each frame without context results in a finer level of processing and may result in a plenty of miss-classifications. The use of left-context and right-context of the current frame may provide additional information to the classification task, which may help to reduce the small changes, and hence reducing the classification error. The results of varying the number of frame level context to the left and right of the current frame are shown in Table 2 for SVM and DBN classifiers. From the table, it can be seen that in the frame level context 5, the ARCTIC database (for both SLT and BDL speaker) gives the best results and further increasing the context is not helping in improving the voicing decision. The SVM method performed best among all the classifiers and even superior to state-of-the-art voicing algorithms such as RAPT, TEMPO, SRH, and REAPER. Among the classifier, SVM performed even better than DBN, which may be because classification task is binary classification. Moreover, in this work the database used is only around one hr (800 sentences), which may not be sufficient for DBN to give good accuracy for this classification task. In addition to the ARCTIC database, the proposed voicing decision is evaluated using KEELE and CSTR databases [46,47], and compared with other state-of-the-art methods. The trend for these databases also remains similar, with a proposed method having voicing decision error as 3.19 and 3.02, respectively, for KEELE and CSTR databases. The decrease in the voicing decision error in both the databases are due to the slightly degraded nature of wave files present in both the databases when compared to the ARCTIC database. The same trend is observed in the other methods also. 5. Glottal activity features for synthesis In this section, integration of proposed voicing decision algorithm in SPSS using HMM and DNN framework is explained. The proposed synthesis framework using HMM is shown in Fig. 5. In this work, we used the HTS software to train the glottal activity features. HTS provides a unified framework to train vocal-tract, excitation, and duration parameters simultaneously in HMM [1]. This framework classified into training and synthesis. In training, excitation, vocal-tract, and duration parameters are derived for each phoneme from the whole database. All the phonemes are modeled
138
N. Adiga et al. / Digital Signal Processing 71 (2017) 131–143
Fig. 5. Block diagram of HMM based speech synthesis.
with five states. In each state, three streams are used to model the different parameters [32]. The first stream consists of vocal-tract parameters, i.e., MCEPs including the zeroth coefficient and their delta and delta-delta coefficients. It is trained by continuous density HMMs. The source parameters F0 and its delta and delta-delta coefficients are modeled in a single stream using continuous distribution instead of three independent streams employed in the MSD modeling of F0 [32]. In the third stream, glottal activity features SoE, NAPS, and HOS are modeled with continuous distribution along with delta and delta-delta coefficients. The details of each stream and its distribution are as follows:
• The first stream for MCEPs and its derivatives with the continuous probability distribution.
• The second stream for fundamental frequency, its delta, and delta-delta with the continuous distribution.
• The third stream for SoE, NAPS, and HOS features and its derivatives with the continuous probability distribution. For each phoneme, parameters mentioned above are extracted along with their corresponding labels. In training part, the maximum likelihood estimation of each parameter is computed using Baum–Welch re-estimation algorithm [1]. During synthesis, as per input text, using the maximum likelihood parameter generation algorithm, frame wise MCEPs, glottal activity features, and F0 parameters are computed by maximizing the output probability [1]. The derived glottal activity features are used for voicing decision to generate the excitation source signal. 5.1. Integration of voicing decision in SPSS The block diagram of integrating the glottal activity features in SPSS framework is shown in Fig. 6. The excitation signal generated according to the voicing decision modeled from the proposed glottal activity features. The voicing decision is obtained from the generated SoE, NAPS, and HOS parameters by passing them through the SVM classifier. The F0 is computed from zerofrequency filter for training in HMM/DNN framework. It is based on the sub-segmental analysis by calculating the interval between the successive epochs [11]. The zero-frequency filter method gives an accurate estimation of F0 [48]. The obtained F0 over an epoch
Fig. 6. Integration of proposed voicing decision using glottal activity features to SPSS vocoder.
Algorithm 1 Algorithm for computing voicing decision from the proposed glottal activity features in HMM training and synthesis. Training: Step 1: Compute F0 from zero-frequency filter algorithm, which gives instantaneous F0 values. To get frame wise F0 value, average the F0 values in each frame. Step 2: Compute glottal activity features SoE and NAPS from ZFFS, and HOS feature from ILPR. Step 3: Apply glottal activity features with five frame context information to SVM classifier. Optimize cost parameter (c) and width parameter (Y). Save the SVM model. Step 4: Train F0 and glottal activity features (SoE, NAPS, and HOS) with two streams in HMM as continuous Gaussian distribution. Synthesis: Step 5: Generate the glottal activity parameters and F0 from the maximum likelihood generation algorithm of HMM for a given text. Step 6: Obtain the voicing decision from the SVM trained model by feeding glottal activity features (SoE, NAPS, and HOS obtained from HMM) with five frame context. Step 7: Generate the excitation signal from the voicing decision as shown in Fig. 6.
interval is averaged over a frame segment to get frame wise F0. During synthesis, the voicing decision is obtained from the trained SVM model, where the input to SVM is the glottal activity parameters generated from HMM/DNN framework. The detailed procedure for integration of glottal activity features to the SPSS is explained in Algorithm 1. For the voiced frame, impulse train is generated. In
N. Adiga et al. / Digital Signal Processing 71 (2017) 131–143
139
Table 3 Subjective evaluation results of MOS and PT with 95% confidence intervals from expert subject. Experimental evaluation
HMM system using different types of voicing decision
MOS
3.36 ± 0.13
2.62 ± 0.29
2.96 ± 0.23
3.06 ± 0.16
3.12 ± 0.24
–
PT
72% 53% 47% 45%
20% – – –
– 31% – –
– – 35% –
– – – 34%
8% 16% 18% 21%
GA + continuous
RAPT
TEMPO
the case of unvoiced frame, the white Gaussian noise is generated. The generated excitation is passed through the synthesis filter. In the present work, the Mel-log spectral approximation (MLSA) filter is used as synthesis filter. The entire process is done frame wise and the convolved source, and system response is overlapped and added to obtain the synthetic speech. 6. Experimental evaluation The proposed glottal activity features in the vocoder framework are evaluated using HTS system built for two speakers: SLT (US female) and BDL (US male) [35]. The SLT and BDL speaker consists of 1132 sentences. For training, randomly selected 1000 sentences were used, and remaining sentences were used for testing. The parameters are analyzed for a frame size of 25 ms with a frame rate of 5 ms. The glottal activity features and F0 parameter are extracted frame wise along with the MCEP. These parameters are trained in the HMM framework. For the comparison purpose, along with the proposed, three more systems based on voicing decision obtained from RAPT, TEMPO, and REAPER methods are developed in the HMM framework [22,28,45]. In all the three methods, F0 is modeled with MSD, whereas, in the proposed method, F0 is modeled in continuous distribution [32]. Also, MCEP with the order of 35 is used in all the experiments and systems. The proposed method is also compared with the original continuous model proposed by Yu et al. [32], where the voicing decision is modeled in a separate stream using voicing strength. In RAPT and REAPER methods, the voicing decision is based on the autocorrelation analysis of speech and using dynamic programming for decision making. The TEMPO method is based on wavelet transform, where based on voicing strength in different bands voicing decision is made. The SRH based voicing decision is not used here since its accuracy is less compared to other algorithms. For a fair comparison of the voicing decision, in all the methods F0 is computed from the zero-frequency filter. Since zerofrequency filter method gives an excellent estimate of F0 [48]. The synthesis is done using MLSA filter. The synthesized files for all the methods can be accessed from the following link.1 6.1. Subjective evaluation In this evaluation, two tests are conducted, namely, mean opinion score (MOS) and preference test (PT) for all the HMM system. In MOS test, 25 sentences which are not used in training are given to subjects along with the original waveform and asked to give the mean opinion score with the scale of 1 to 5 (1: poor and 5: excellent) by comparing with the original wave file. Two groups of subjects were used in the subjective evaluation. They are ten speech experts and 15 naive listeners. The evaluators are not from native English. However, they are fluent in speaking English. Naive listeners are not speech experts, and they are not
1
http://www.iitg.ernet.in/cseweb/tts/tts/Assamese/gaspss.php.
REAPER
p value Continuous
Same
4.38 × 10−9 2.12 × 10−5 1.56 × 10−2 1.92 × 10−1
worked in speech processing area. For evaluations, listeners were asked to examine naturalness present in each file and give their overall scores accordingly. The average scores obtained from expert listeners are given in Table 3 along with standard deviation. From the table, it can be seen that the proposed voicing decision method outperform RAPT, TEMPO, and REAPER algorithms. Moreover, the proposed method is marginally superior to continuous model with MOS score of 3.46, which signifies the importance of accurate estimation of voicing decision to improve the naturalness of the synthetic speech. Besides, to know the distribution of scores for male and female speakers, the bar chart is plotted in the Fig. 7. It shows MOS score with a standard deviation of all five systems. From the bar plot, it is clear that the proposed method for both male and female speaker is slightly better than the continuous system. Results for naive listeners are presented in Table 4. The conclusions that can be drawn from these results are similar as with speech experts, except that they are in this case less pronounced. This can be explained by two facts, one compared to speech experts, naive listeners pay less attention to the minute variations in the naturalness. The contrary to expert listeners, naive listeners having a higher variance compared to expert listeners. Nonetheless, it is appreciable to note that even the naive listeners can observe differences between proposed glottal activity (GA) features based SPSS system with the other methods with a considerable difference in MOS score. This is consistent with both male and female speakers as shown in Fig. 7. In the preference test, for each sentence, subjects were requested to listen to the two versions of the system shuffled randomly from five systems at a time and asked to choose any one system or prefer the same as their preference. The percentage of preference scores with p value from expert listeners can be viewed in Table 3. A distinct improvement of the proposed method over RAPT, TEMPO, REAPER, and continuous voicing methods can be observed according to the preference test and the p values given by hypothesis tests. In the case of naive listeners, similar to MOS test, preference to the glottal activity based system is higher when compared with RAPT system with a preference score of 72%. The choice of TEMPO/REAPER/continuous algorithms with the proposed glottal activity based method is slightly reduced. Naive listeners may not be able to distinguish small errors occurring due to the voicing decision. The proposed features from glottal activity region for voicing decision using continuous modeling outperforms base version of the continuous voicing model where only voicing strength is used. This signifies the contributions of other features present in glottal activity region like NAPS and HOS for the improvement in speech quality. 6.2. Objective evaluation In this work, two objective measures are used, namely, log spectral distance (LSD), and voicing frame error (V U E ). The duration of synthesized speech and original speech may not be of the same length. Hence, alignment of original and synthesized speech
140
N. Adiga et al. / Digital Signal Processing 71 (2017) 131–143
Fig. 7. Average MOS of five HMM systems with RAPT, TEMPO, REAPER, Continuous, and glottal activity (GA) feature based voicing decision, respectively, for SLT (female speaker) and BDL (male speaker). Table 4 Subjective evaluation results of MOS and PT with 95% confidence intervals from naive subjects. Experimental evaluation
HMM system using different types of voicing decision GA + continuous
RAPT
TEMPO
MOS
3.21 ± 0.76
2.65 ± 1.12
PT
58% 44% 39% 33%
12% – – –
p value REAPER
Continuous
Same
2.82 ± 0.95
2.96 ± 1.02
3.04 ± 1.03
–
– 24% – –
– – 28% –
– – – 31%
28% 32% 33% 36%
3.19 × 10−7 2.07 × 10−4 1.78 × 10−3 4.14 × 10−1
Table 5 Objective evaluation results of LSD, and V U E with standard deviation. Experimental evaluation
HMM using different types of voicing decision GA + continuous
RAPT
TEMPO
REAPER
Continuous
LSD V U E (%)
2.05 ± 0.26 4.61
2.64 ± 0.27 11.02
2.49 ± 0.32 9.15
2.43 ± 0.26 7.12
2.27 ± 0.29 5.93
is done using dynamic time warping algorithm. First objective evaluation is LSD measure, which gives distortion error in the spectral domain. The reason for choosing this measure is due to voicing error, voiced frame may classify as unvoiced frame, and vice versa. This mismatch results in a change in excitation, which reflects in the synthesized speech spectrum also. Note that the distortion is normalized and lower values indicate smaller distortion and better the synthesis quality. This measure is evaluated between original reference speech and synthesized speech for the same text. The average LSD for all the five voicing methods given in Table 5 along with standard deviation. The LSD of proposed voicing method is lesser with distortion of 2.05, indicating the better spectral modeling of proposed method compared to the continuous F0 model and as well as the MSD methods such as RAPT, TEMPO, and REAPER algorithms. Similarly, as mentioned in Section 4 (c), the V U E error rate is also evaluated for the synthesized waveforms. From Table 5, it can be seen that V U E values are relatively higher in the case of synthesized files when compared to the evaluation done for the original waveforms on the whole database. This may be because HMM training errors are also included in the synthesized files. However, the V U E error rate in the synthesized waveforms is least for the proposed method.
6.3. GA system using advanced excitation To check the proposed voicing decision with advanced excitation method, we used STRAIGHT vocoder as a reference method. These systems are referred as HMM-Advanced, whereas, earlier systems are referred as HMM-Impulse. In the STRAIGHT method, within a voiced region, both periodic and aperiodic excitation is generated using aperiodicity component. Further, phase component is also added using group delay method [28]. In the proposed method also we used similar procedure except that voicing decision is generated from glottal activity features and F0 from zero-frequency filter method. During HMM training, the aperiodicity component is trained as 25 band aperiodicity (BAP) parameters along with other parameters mentioned in the previous section. During the synthesis, the glottal activity features generated from HMM are used to generate voicing decision. Along with F0, voicing decision is given as input to the STRAIGHT vocoder. The synthesis quality of GA system is better compared to impulse excitation with reduced buzziness and improvement in naturalness. Further, proposed method (V U E = 4.45) is better compared to STRAIGHT method (V U E = 8.92) regarding voicing decision error. The MOS of proposed method is shown in Fig. 8. MOS of proposed method is 3.44, which is better than STRAIGHT method. This indicates that
N. Adiga et al. / Digital Signal Processing 71 (2017) 131–143
141
Fig. 8. Average MOS of proposed GA feature based voicing decision vs STRAIGHT for HMM-Impulse, HMM-Advanced, and DNN systems.
along with advanced excitation (HMM-Advanced) also voicing decision plays a significant role in improving the naturalness of synthesis system. 6.4. DNN based speech synthesis To build the DNN based system, Merlin toolkit is used [5]. In Merlin system, linguistic features are taken as input and tried to predict acoustic features, which are then passed to vocoder to produce the speech waveform. Many neural network architectures are implemented in Merlin software. In this work, we have chosen long short-term memory based recurrent neural network. For evaluation, SLT and BDL speaker from ARCTIC database is used. The speech signal is sampled at a rate of 48 kHz. 1000 utterances are used for training, 70 as a development set, and 72 as the evaluation set. The input features for the system consisted of 491 features. 482 of these are derived from linguistic context, including quinphone identity, part-of-speech, and positional information within a syllable, word, and phrase, etc. The remaining 9 are within-phone positional information: frame position within HMM state and phone, state position within phone both forward and backward, and state and phone durations. The frame alignment and state information are obtained from forced alignment using a monophone HMM based system with five emitting states per phone [5]. For comparison, voicing decision from the STRAIGHT method is used in the experiment. In STRAIGHT method, fundamental frequency on the log scale, 25 BAP parameters to represent aperiodicity, and 60 MCEP parameters computed from the STRAIGHT spectrum, at 5 msec frame rate are used. Similarly, for the proposed method also, 60 dimensions MCEP and 25 BAP parameters calculated from STRAIGHT is used, whereas for voicing decision and fundamental frequency computed from proposed glottal activity features are used. At synthesis time, maximum likelihood parameter generation is applied to generate smooth parameter trajectories from the DNN outputs. The objective results of the proposed method are slightly better than the STRAIGHT method regarding voicing decision error. In particular, in the suggested framework, V U E is 3.01, whereas for the STRAIGHT method it is 4.12. In both the methods, V U E is reduced compared to HMM, which is due to the DNN framework used for training the models. Some of the
synthesized files from the DNN systems can be accessed from the following link.2 For DNN systems also, MOS and PT tests are conducted. A total of 10 subjects were used in the listening test. 25 sentences from both proposed and STRAIGHT method are given to subjects. The MOS test gives a score of 3.64 compared to the STRAIGHT system with a rating of 3.53. Relatively DNN based system is better than HMM system. Further, the improvement in the proposed system vs. STRAIGHT system is not very high, as we observed in HMM system. In the preference test, 42% waveforms felt as same indicating the similarity in both proposed and STRAIGHT method. The preference to proposed method is around 35% indicating that subjects were able to distinguish between two systems for some of the synthesized files, which means that proposed method works even in the DNN system. 7. Summary and conclusions This work demonstrated the significance of glottal activity features for the voicing decision. The voicing decision is performed based on the SoE, NAPS, and HOS features, which represent strength, periodicity, and asymmetrical nature of glottal activity regions, respectively. The voicing decision using the proposed features is further refined using SVM classifier. The performance of the proposed approach is compared with the current pitch estimation based voicing decision algorithms. The results demonstrated that the classification error of proposed approach is notably less compared with other methods. The glottal activity features are modeled along with F0 and MCEPs using continuous distribution in SPSS. The quality of synthesized speech using the proposed approach is compared with three SPSS systems developed using RAPT, TEMPO, and REAPER voicing decision methods. The proposed method is also compared with the continuous F0 model, advanced excitation method, and DNN based system. The objective and subjective assessment results show that the proposed voicing decision using glottal activity features for SPSS both in HMM and DNN framework significantly reduces voicing decision errors. It also helps in improving the naturalness of synthesized waveforms when compared with other methods. The future focus may be on modeling the other aspects of glottal activity features like
2
http://www.iitg.ernet.in/cseweb/tts/tts/Assamese/gaspss.php.
142
N. Adiga et al. / Digital Signal Processing 71 (2017) 131–143
epoch location, epoch strength, aperiodicity, and phase component in SPSS to get a better source model. Acknowledgments This work is part of the ongoing project on the Development of Text-to-Speech Synthesis for Assamese and Manipuri languages funded by TDIL, DEiTy, MCIT, GOI. References [1] K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, K. Oura, Speech synthesis based on hidden Markov models, Proc. IEEE 101 (5) (2013) 1234–1252. [2] S. King, An introduction to statistical parametric speech synthesis, Sadhana 36 (5) (2011) 837–852. [3] H. Zen, K. Tokuda, A.W. Black, Statistical parametric speech synthesis, Speech Commun. 51 (11) (2009) 1039–1064. [4] HTS [link], http://hts.sp.nitech.ac.jp/. [5] Z. Wu, O. Watts, S. King Merlin, An open source neural network speech synthesis system, in: Proc. SSW, Sunnyvale, USA, 2016. [6] A.J. Hunt, A.W. Black, Unit selection in a concatenative speech synthesis system using a large speech database, in: Proc. IEEE Int. Conf. Acoust. Speech Signal Process, vol. 1, 1996, pp. 373–376. [7] N. Narendra, K.S. Rao, Robust voicing detection and F0 estimation for HMMbased speech synthesis, Circuits Syst. Signal Process. (2015) 1–23. [8] A. Krishnamurthy, D. Childers, Two-channel speech analysis, IEEE Trans. Acoust. Speech Signal Process. 34 (4) (1986) 730–743, http://dx.doi.org/10.1109/TASSP. 1986.1164909. [9] T.V. Ananthapadmanabha, Acoustic Analysis of Voice Source Dynamics, STLQPSR 2–3, Tech. rep., Speech, Music and Hearing Royal Institute of Technology, Stockholm, 1984. [10] R. Smits, B. Yegnanarayana, Determination of instants of significant excitation in speech using group delay function, IEEE Trans. Speech Audio Process. 3 (5) (1995) 325–333. [11] K.S.R. Murthy, B. Yegnanarayana, Epoch extraction from speech signals, IEEE Trans. Audio Speech Lang. Process. 16 (2008) 1602–1613. [12] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura, Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis, in: Proc. Eurospeech, 1999. [13] K.U. Ogbureke, J.P. Cabral, J. Carson-Berndsen, Using noisy speech to study the robustness of a continuous F0 modelling method in HMM-based speech synthesis, in: Proc. Speech Prosody, 2012. [14] P. Cabañas-Molero, D. Martínez-Muñoz, P. Vera-Candeas, N. Ruiz-Reyes, F.J. Rodríguez-Serrano, Voicing detection based on adaptive aperiodicity thresholding for speech enhancement in non-stationary noise, IET Signal Process. 8 (2) (2014) 119–130. [15] K. Fukunaga, P.M. Narendra, A branch and bound algorithm for computing k-nearest neighbors, IEEE Trans. Comput. 100 (7) (1975) 750–753. [16] C. Cortes, M. Mohri, A. Rostamizadeh, Two-stage learning kernel algorithms, in: Proc. of the 27th Int. Conf. on Machine Learning, 2010, pp. 239–246. [17] X.-L. Zhang, J. Wu, Deep belief networks based voice activity detection, IEEE Trans. Audio Speech Lang. Process. 21 (4) (2013) 697–710, http://dx.doi.org/ 10.1109/TASL.2012.2229986. [18] B. Atal, L. Rabiner, A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition, IEEE Trans. Acoust. Speech Signal Process. 24 (3) (1976) 201–212, http://dx.doi.org/10.1109/ TASSP.1976.1162800. [19] N. Dhananjaya, B. Yegnanarayana, Voiced/nonvoiced detection based on robustness of voiced epochs, IEEE Signal Process. Lett. 17 (3) (2010) 273–276, http:// dx.doi.org/10.1109/LSP.2009.2038507. [20] D. Arifianto, Dual parameters for voiced-unvoiced speech signal determination, in: Proc. IEEE Int. Conf. Acoust. Speech Signal Process, vol. 4, 2007, pp. 749–752. [21] L. Janer, J.J. Bonet, E. Lleida-Solano, Pitch detection and voiced/unvoiced decision algorithm based on wavelet transforms, in: Proc. IEEE Int. Conf. Spoken Lang. Process., 1996, pp. 1209–1212. [22] D. Talkin, REAPER: robust epoch and pitch estimator [online]. Available: https:// github.com/google/reaper. [23] T. Drugman, A. Alwan, Joint robust voicing detection and pitch estimation based on residual harmonics, in: Proc. Interspeech, 2011, pp. 1973–1976, http://www.isca-speech.org/archive/interspeech_2011/i11_1973.html. [24] A. De Cheveigné, H. Kawahara, YIN, a fundamental frequency estimator for speech and music, J. Acoust. Soc. Am. 111 (4) (2002) 1917–1930. [25] P. Boersma, et al., Praat, a system for doing phonetics by computer, Glot Int. 5 (9/10) (2002) 341–345. [26] D. Talkin, A robust algorithm for pitch tracking (RAPT), in: Speech Coding and Synthesis, vol. 495, 1995, p. 518. [27] A. Camacho, Swipe: A Sawtooth Waveform Inspired Pitch Estimator for Speech and Music, Ph.D. thesis, University of Florida, 2007.
[28] H. Kawahara, I. Masuda-Katsuse, A. de Cheveigné, Restructuring speech representations using a pitch-adaptive time frequency smoothing and an instantaneous-frequency-based F0 extraction, Speech Commun. 27 (3–4) (1999) 187–207. [29] X. Yin, M. Lei, Y. Qian, F.K. Soong, L. He, Z.-H. Ling, L.-R. Dai, Modeling {F0} trajectories in hierarchically structured deep neural networks, Speech Commun. 76 (2016) 82–92, http://dx.doi.org/10.1016/j.specom.2015.10.007, www. sciencedirect.com/science/article/pii/S0167639315001223. [30] K. Tokuda, T. Masuko, N. Miyazaki, T. Kobayashi, Hidden Markov models based on multi-space probability distribution for pitch pattern modeling, in: Proc. IEEE Int. Conf. Acoust. Speech Signal Process, vol. 1, 1999, pp. 229–232. [31] S. Kang, Z. Shuang, Q. Duan, Y. Qin, L. Cai, Voiced/unvoiced decision algorithm for HMM-based speech synthesis, in: Proc. Interspeech, 2009, pp. 412–415. [32] K. Yu, S. Young, Continuous F0 modeling for HMM based statistical parametric speech synthesis, IEEE Trans. Audio Speech Lang. Process. 19 (5) (2011) 1071–1079, http://dx.doi.org/10.1109/TASL.2010.2076805. [33] K. Yu, T. Toda, M. Gasic, S. Keizer, F. Mairesse, B. Thomson, S. Young, Probablistic modelling of F0 in unvoiced regions in HMM based speech synthesis, in: Proc. IEEE Int. Conf. Acoust. Speech Signal Process, 2009, pp. 3773–3776. [34] U. Ogbureke, J. Cabral, J. Berndsen, Using multilayer perceptron for voicing strength estimation in HMM-based speech synthesis, in: Proc. IEEE Int. Conf. Information Science Signal Process. and Their Applications, 2012, pp. 683–688. [35] J. Kominek, A.W. Black, The CMU ARCTIC speech databases, in: Proc. 5th ISCA Speech Synthesis Workshop, 2004, pp. 223–224, http://festvox.org/cmu_ arctic/index.html. [36] N. Adiga, S.R.M. Prasanna, Detection of glottal activity using different attributes of source information, IEEE Signal Process. Lett. 22 (11) (2015) 2107–2111, http://dx.doi.org/10.1109/LSP.2015.2461008. [37] A. Prathosh, T. Ananthapadmanabha, A. Ramakrishnan, Epoch extraction based on integrated linear prediction residual using plosion index, IEEE Trans. Audio Speech Lang. Process. 21 (12) (2013) 2471–2480, http://dx.doi.org/10. 1109/TASL.2013.2273717. [38] K.S.R. Murthy, B. Yegnanarayana, M.A. Joseph, Characterization of glottal activity from speech signals, IEEE Signal Process. Lett. 16 (6) (2009) 469–472. [39] E. Nemer, R. Goubran, S. Mahmoud, Robust voice activity detection using higher-order statistics in the LPC residual domain, IEEE Trans. Speech Audio Process. 9 (3) (2001) 217–231, http://dx.doi.org/10.1109/89.905996. [40] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (3) (2011) 27. [41] C.J. Burges, A tutorial on support vector machines for pattern recognition, Data Min. Knowl. Discov. 2 (2) (1998) 121–167. [42] G. Hinton, A practical guide to training restricted Boltzmann machines, Momentum 9 (1) (2010) 926. [43] P. Krishnamoorthy, S.R.M. Prasanna, Enhancement of noisy speech by temporal and spectral processing, Speech Commun. 53 (2) (2011) 154–174. [44] B.K. Khonglah, S.R.M. Prasanna, Speech / music classification using speechspecific features, Digit. Signal Process. 48 (2016) 71–83, http://dx.doi.org/ 10.1016/j.dsp.2015.09.005, http://www.sciencedirect.com/science/article/pii/ S1051200415002730. [45] K. Sjölander, J. Beskow, Wavesurfer – an open source speech tool, in: Proc. Interspeech, 2000, pp. 464–467. [46] F. Plante, G. Meyer, W. Ainsworth, A pitch extraction reference database, Children 8 (12) (1995) 30–50. [47] P.C. Bagshaw, S.M. Hiller, M.A. Jack, Enhanced pitch tracking and the processing of F0, contours for computer aided intonation teaching. [48] B. Yegnanarayana, K. Murty, Event-based instantaneous fundamental frequency estimation from speech signals, IEEE Trans. Audio Speech Lang. Process. 17 (4) (2009) 614–624.
Nagaraj Adiga was born in India, 1987. He received B.E. degree in Electronics and Communication Engineering from University Visvesvaraya College of Engineering, Bengaluru, India, in 2008, and the Ph.D. degree in the Department of Electronics and Electrical Engineering, Indian Institute of Technology (IIT) Guwahati, in 2017. He is currently pursuing postdoctoral research at the Department of Computer science, University of Crete, Greece. He was a Software Engineer in Alcatel-Lucent India Private limited, Bengaluru, India, from 2008 to 2011, mainly focusing on next generation high leverage Optical Transport Networks. His research interests are in Speech processing, Speech synthesis, Signal Processing, Voice Conversion and Machine learning.
N. Adiga et al. / Digital Signal Processing 71 (2017) 131–143
Banriskhem K Khonglah was born in India in 1987. He received the B.E. degree in Electronics and Communication Engineering from Sri Jayachamarajendra College of Engineering, Visvesvaraya Technological University, Mysore, India, in 2010 and the M.Tech. degree in Microelectronics and VLSI from the National Institute of Technology, Silchar, India, in 2012. He is currently pursuing the Ph.D. degree in Electronics and Electrical Engineering at the Indian Institute of Technology Guwahati, India. His research interests include speech processing, analysis and recognition.
143
S.R. Mahadeva Prasanna received the B.E. degree in electronics engineering from Sri Siddartha Institute of Technology, Bangalore University, Bangalore, India, in 1994, the MTech degree in Industrial Electronics from the National Institute of Technology Karnataka (NITK), Surathkal, India, in 1997, and the PhD degree in computer science and engineering from the Indian Institute of Technology (IIT) Madras, Chennai, India, in 2004. He is currently a professor in the Department of Electronics and Electrical Engineering, Indian Institute of Technology (IIT) Guwahati. Currently he is also professor (on deputation) in the department of Electrical engineering of IIT Dharwad. His research interests are in speech, handwriting, and signal processing.