Improved Multi-***scale Product Method for Glottal Opening Instants From Speech Signal

Improved Multiscale Product Method for Glottal Opening Instants From Speech Signal W. Saidi*, A. Bouzid**, N. Ellouze*** *Ecole Nationale d’Ingénieurs...

Download PDF

205KB Sizes 0 Downloads 39 Views

Report

PDF Reader
Full Text

Improved Multiscale Product Method for Glottal Opening Instants From Speech Signal W. Saidi*, A. Bouzid**, N. Ellouze*** *Ecole Nationale d’Ingénieurs Tunis - Le Belvédère B.P 37, 1002 Tunis, Tunisie (Tel : +216 71 872 729 ; e-mail : [email protected]). **Ecole Nationale d’Ingénieurs Tunis - Le Belvédère B.P 37, 1002 Tunis, Tunisie (e-mail : [email protected]) ***Ecole Nationale d’Ingénieurs Tunis - Le Belvédère B.P 37, 1002 Tunis, Tunisie (e-mail : [email protected])} Abstract: In this paper, we improve the glottal opening instant (GOI) detection from speech signal by a new approach based on Multiscale Product Method (MPM). We compute wavelet transform coefficients products at three scales using the quadratic spline wavelet. Glottal closure and opening instants appear respectively as strong maximum and weak minimum on the product signal. The GOI localisation is enhanced by refining the time interval containing this instant and based on the open quotient value. The proposed method is applied and evaluated on the Keele University database. Evaluation results show that the MPM detects GOIs with a performance rate equals to 96% for the entire database. Keywords: Speech, Glottal opening instant detection, open quotient, Multi-scale product. 1. INTRODUCTION Voiced speech is produced by a quasi-periodic vibration of the vocal folds that excites the vocal tract. During each period, two important excitations are considered. The primary excitation is relied on the instant of vocal folds closure and called glottal closure instant (GCI). The second one occurs at the opening of the glottis and defined as the glottal opening instant (GOI). GCI marks the start of the closed phase, while GOI marks the start of the open phase. Accurate identification of GCI and GOI from speech signal is crucial for many signal applications, such as voice pathology classification. GOI identification is necessary for many important and modern problems of speech processing. Closed phase LPC analysis and subsequent inverse filtering for glottal volume flow estimation form speech signal, are such examples. Further uses are found in the analysis of pathological speech, including types of dysphonia, and vocal fold impact stress. The multi-scale product method (MPM) is proposed in this paper to estimate the GCI and GOI. It is operated by multiplying the coefficients of wavelet transforms of the speech signal at three scales. The MPM is a multi-scale analysis proposed to circumvent the one-scale analysis. In fact, fine scales give birth to edge points produced by the signal noise. However, at large scales wavelet modulus maxima are smoothed. In this case, the peak manifesting the singularity loses precision. Glottal opening instant is referred to be a second excitation of the vocal tract. In deed it is characterised by a weaker and dispersed energy compared to the primary excitation relied on the glottal closure. As a result, GOI has less consideration

and remains a challenging problem for signal-processing researchers. On the EGG derivative signal termed DEGG, GCIs and GOIs are respectively manifested by minima and maxima. Many works are based on this criterion to measure directly glottal opening and closing moments. Numerous approaches have been developed for GCI detection from speech signal based on Frobenius Norm, wavelet transform, LPC residual and Group Delay function. In a previous work, Bouzid and Ellouze proposed the MPM for GCI and GOI detection. The difficulty of GOI localisation is related to the maximum structure engendered by the MP due to the energy weakness. Few works addressed their consideration to determine GOIs from speech signal [Bouzid et al. (2007)], [Drugman et al. (2009)]. The attempt of the present paper is to enhance of glottal opening instants localisation, using MPM. To improve GOI position, we look for a new time interval smaller than the pitch period and necessarily contains the GOI. Knowing the open quotient (OQ) permits to define this interval. The OQ is defined as the ratio of the open phase and the pitch period. In our case, we use the Keele database which is characterised by a normal vocal quality where the glottis is often open during the half of the pitch period. Thus, the open quotient is around 0, 5 for female and male speakers of the database. The paper is structured as follows. Section 2 describes the multi-scale product method for edge detection. Section 3 outlines the improved MPM used for the glottal opening identification. Evaluation results of MPM tested on the Keele University database are presented in section 4. In section 5 we draw our conclusions.

2. MULTISCALE PRODUCT FOR EDGE DETECTION Multi-scale Product is a well known method for edge detection. It was first used in image processing, and then applied on the clean and noisy speech signal [Bouzid et al. (2007)], [Saidi et al. (2008)], [Saidi et al. (2010)]. The MPM consists of calculating the product of wavelet transforms at three scales. Negative peaks represent GCIs and positive peaks refer to GOIs. Generally, a GOI is modelled as a maximum occurring between two successive GCIs. This method gives a good representation of the GCI and a best detection of the GOI, so as the product is a non linear combination which reduces noise and spurious peaks [Bouzid et al. (2006)]. Previous works have proven that Multi-scale Product is an accurate and robust method for GCI detection [Saidi et al. (2008)], [Saidi et al. (2010)]. Tested on the Keele University database, the MPM has marked a performance rate equals to 99% with a good detection rate about 96%. The product of the wavelet transform of a function some dyadic adjacent scales is: p ( n) =

f (n) at (1)

WSj f (n) j

WSj f (n) represents the wavelet transform of the function f (n) at scale sj . The product p ( n) shows peaks at signal edges, and has

Were

relatively small values elsewhere. An odd number of terms in p (n) preserves the edge sign. In the present work, we operate the MPM using the following three scales 2, 2.5 and 3 that give better results for GCI detection than dyadic scales ½, 1, and 2. Figure 1 shows the speech signal pronounced by the female speaker f2 followed by its multi-scale product and the derivative of the EGG signal (DEGG) taken as a reference signal. GCIs are depicted by well aligned minima on the product signal. Glottal closure instants are easily detected using the “findpeaks” Matlab file provided by the voicebox available in [Brooks (2008)].

The method that we propose for glottal opening instant detection from speech signal is strongly related to the GCI positions and based on three essential steps. The first one consists of determining the glottal closure instants. The following step consists of refining the time interval where the GOIs likely happen. And finally, the GOI is the maximum of the speech MP localised within this interval. GOI is an instant that usually follows the happening of the glottal closing instant. It is identified as a maximum in the MP occurring between two GCIs. A pitch period is the primary interval containing the glottal opening instant. However, this interval is too large to permit an accurate detection of the GOI. Therefore we attempt to define a smaller interval that facilitates the correct estimation. In fact, the GOI is the beginning of the open glottal phase. It is consequently related to the ratio of the open phase to the pitch period. This ratio is called the open quotient. The open quotient is given by the following formula: GCI (k + 1) - GOI (k ) GCI (k + 1) - GCI (k )

OQ(k ) =

GOI (k ) is the glottal opening occurring between two closure instants GCI ( k ) and GCI ( k + 1) . GCI (k + 1) - GCI (k ) defines the duration of the open Where

phase.

T (k ) = GCI (k + 1) - GCI (k ) defines the pitch period.

In this study, we use the Keele University database constituted by normal voices where the OQ ratio is around 0.5. Estimation of GCIs and GOIs is operated for voiced speech signal. So, the interval is reduced from GCI (k ) : GCI (k + 1) to

[ ] [GCI (k ) + 0.4T (k ) : GCI (k + 1) -

speech signal 1

2 1

amplitude

MP of speech signal

6

-1 -2 0

50

100

150

200

250

300

350

400

450

500

-3 x 10

6

0

100

200

MP of the speech signal

300

400

500

600

700

400

500

600

700

400

500

600

700

interval of GOIs presence 1

0

0.8 0.6

-5 -10

x 10

0

0 -0.5

5

0.55T (k )] .

Figure 2 shows three signals. The first is the multi-scale product of a voiced speech frame pronounced by the speaker f5; minima correspond to GCIs and maxima represent GOIs. Glottal opening instants are detected using intervals depicted by the second signal. The last signal is the MP of the EGG signal. In this case, GOIs are correctly detected.

0.5

-1

(2)

0.4 0

50

100

150

200

250

300

350

400

450

500

0.2 0

DEGG signal

0

500

4

0

x 10

100

200

300

19

MP of EGG signal

2 -500 -1000

0 -2 0

50

100

150

200

250 samples

300

350

400

450

500

-4 -6

Fig. 1. Speech signal of a female voice (speaker f2), the multi-scale product and the DEGG signal. 3. THE PROPOSED APPROACH FOR GLOTTAL OPENING INSTANT DETECTION

0

100

200

300

Fig. 2. The MP of the speech signal (speaker f5), Time interval containing the GOI and the DEGG signal.

As mentioned above, the test corpus used in this work is the Keele University database, popularly used for pitch estimation evaluation [Jinachitra (2006)]. It includes two kinds of signals: acoustic speech signals and laryngograph signals. Five adult female speakers and five adult male speakers were recorded in low ambient noise conditions using a sound-proof room. These speakers are noted fi and mi where i varies from 1 to 5. Each utterance consists of the same phonetically-balanced English text. In each case, the acoustic and laryngograph signals are time-synchronised and share the same sampling rate value of 20 kHz [Plante et al. (1995)].

interval, the method succeeds to correctly localise the glottal opening instants. Table 1 shows performance results of the Keele database for GOI detection. 6

5

MP of the speech sgnal

x 10

0

-5

0

200

400

600

19

5

amplitude

4. RESULTS AND EVALUATION

800

1000

1200

800

1000

1200

800

1000

1200

MP of the EGG signal

x 10

0

-5

0

200

400

600 GOIs from EGG signal and speech signal

2 1.5

To evaluate our GOI detection approach, we calculate the delay between the estimated GOI and the reference one given by the EGG signal. The speech and EGG signals are timealigned to compensate the observed larynx-to-microphone time delay. In practice we compute the mean distance between the GOIs reference and the estimated GOIs for each voiced frame This delay is subtracted from the vector representing the set of the estimated GOIs of the same frame. We denote as Good Detection (GD) the rate of GOIs detected with a tolerance less than 0.25 ms. When the error is within the time interval [0.25ms 1 ms] it is considered as Fine Error (FE). If the error exceeds 1 ms it is defined as Gross Error (GE). It may happen that the algorithm doesn’t find any GOI during the pitch period and it’s the case of a Missing Measure (MM). False Alarm (FA) is a case when more than one GOI is identified.

1 0.5 0

0

200

400

600 samples

Fig. 4. MP of the speech signal, the DEGG signal and GOIs. Figure 4 displays respectively the MP of the speech signal pronounced by a female speaker and the MP of the corresponding EGG. On these signals, GCIs and GOIs are shown. It is clearly observed that estimated GOIs and reference GOIs, presented on the last part of the figure, coincide very well due the improved MPM. Table 1. Performance rates of the Keele University database for GOI detection Speakers

GD

FE

GE

FA

MM

PR

FR

Female

90.83

5.44

0.40

0.44

3.33

96.27

3.73

Male

73.92

18.02

1.78

0.53

6.28

91.94

8.06

• Failure Rate (FR) (%) = Gross Error (%) + Missing Measure (%).

Database

88.04

7.51

0.62

0.45

3.83

95.55

4.45

• Performance Rate (PR) (%) = Good Detection (%) + Fine Error (%).

As shown in table 1, the MPM has a good detection rate for GOI localisation equals to 91% for the Keele database female speakers. This rate is about 74% for male speakers. For the entire database, GOIs are correctly achieved with a rate of 88 %.

Even more we can conclude the following rates:

Performance rate informs about the reliability of the detector however the good detection one informs about its accuracy. 6

1

The performance rate is depicted as a bar diagram in the figure 5. It shows that the performance is around 96% for the Keele University database.

MP of speech signal

x 10

0.5 0 -0.5 -1 -1.5 -2

100

200

300

400

19

3

500

600

700

800

900

1000

700

800

900

1000

MP of EGG signal

x 10

2 1 0 -1 -2 -3 -4 -5

100

200

300

400

500

600

Fig. 3. MP of the speech signal, and the DEGG signal. Figure 3 illustrates the multi-scale product of the speech signal of speaker f5 followed by the DEGG. In this example maxima are duplicated, so GOI is ambiguous, with the new

The estimation of the GOIs happening interval outperforms our previous multi-scale product for identification of these sensitive spikes.

performance of the MPM for GOI detection

100 90 80 70 60 paerformane 50 40 30 20 10 0

PR

female

male

database

speakers

Fig. 5. Performance of the MPM for GOI detection using Keele University database. 5. CONCLUSIONS An improved multi-scale product method for GOI detection has been proposed. The MP approach consists of calculating the wavelet transforms of the speech signal at three scales and then multiplying them. The product reinforces the crossscale peaks and reduce spurious noisy peaks. On the speech product signal, the GCI appears as a minimum and the GOI is modelled as a maximum taking place between two successive GCIs. The MPM is used first to identify GCIs. Then, short intervals are estimated in order to guarantee the correct position of GOIs. Since the open quotient is about 0.5 for all speakers of the Keele database, the interval is limited by GCI (k ) + 0.4T (k ) and GCI (k + 1) - 0.55T (k ) , on the other hand. Maxima indexes localised within the obtained intervals form the set of the estimated GOIs. The MPM has achieved considerable performance especially for female speakers. It, correctly, detects 91% of GOIs with a performance rate superior to 96%. The entire database performance rate is 95.55%. This study motivates us to look for an approach to estimate the open quotient to be used for defining and the MPM can so be automatically applied to any speech corpus. REFERENCES Bouzid, A., and Ellouze, N. (2006). Singularity detection of electroglottogram signal by multiscale product method. the 14th European Signal Processing Conference, EUSIPCO’06, Florence, Italie. Bouzid, A., and Ellouze, N. (2007). Open quotient measurements based on multiscale product of speech signal wavelet transform. Research Letters in Signal Processing, Vol. 2007. Plante, F., Meyer, G.F., and Ainsworth, W.A. (1995). A pitch extraction reference database. in Proc. Eurospeech, Madrid, pp. 837-840. Brooks, M. (2008). A speech processing toolbox for MATLAB. Available from .

Jinachitra, P. (2006). Glottal closure and opening detection for flexible parametric voice coding. interspeech, Pittsburg. Drugman, T., and Dutoit, T. (2009). Glottal closure and opening instant detection from speech signals. International Conference on Spoken Language Processing, ICSLP’09. Saidi, W., Bouzid, A., and Ellouze, N. (2008). Evaluation of multi-scale product pethod and DYPSA algorithm for glottal closure instant detection. 3rd International Conference on Information and Communication Technologies: From Theory to Applications, 2008. ICTTA 2008, pp.1-5, 7-11. Saidi, W., Bouzid, A., and Ellouze, N. (2010). MPM method and DYPSA algorithm evaluation for GCI detection in noisy speech signal. International Journal of Computing and Information Technology, Serials Publications Publisher, Vol. 2, No. 1.

Improved Multi-***scale Product Method for Glottal Opening Instants From Speech Signal

Improved Multi-***scale Product Method for Glottal Opening Instants From Speech Signal

Recommend Documents