Robust voice activity detection using perceptual wavelet-packet transform and Teager energy operator

Pattern Recognition Letters 28 (2007) 1327–1332 www.elsevier.com/locate/patrec Robust voice activity detection using perceptual wavelet-packet transf...

Download PDF

225KB Sizes 0 Downloads 32 Views

Report

PDF Reader
Full Text

Pattern Recognition Letters 28 (2007) 1327–1332 www.elsevier.com/locate/patrec

Robust voice activity detection using perceptual wavelet-packet transform and Teager energy operator Shi-Huang Chen a

a,*

, Hsin-Te Wu a, Yukon Chang b, T.K. Truong

b

Department of Computer Science and Information Engineering, Shu-Te University, Kaohsiung County 824, Taiwan, ROC b Department of Information Engineering, I-Shou University, Kaohsiung County 840, Taiwan, ROC Available online 21 April 2007

Abstract In this letter, a robust voice activity detection (VAD) algorithm is presented. This proposed VAD algorithm makes use of the perceptual wavelet-packet transform and the Teager energy operator to compute a robust parameter called voice activity shape for VAD. The main advantage of this algorithm is that the preset threshold values or a priori knowledge of the SNR usually needed in conventional VAD methods can be completely avoided. Various experimental results show that the proposed VAD algorithm is capable of outperforming the VAD of Adaptive Multi Rate (AMR) speech codec in both additive noisy and real noisy environments. 2007 Elsevier B.V. All rights reserved. Keywords: Voice activity detection (VAD); Perceptual wavelet-packet transform (PWPT); Teager energy operator (TEO)

1. Introduction Voice activity detection (VAD) is used to distinguish speech from noise and is required in a variety of speech communication systems. For example, in an adaptive multi rate (AMR) cellular phone system, the VAD module can reduce co-channel interference and power consumption in portable equipment (ETSI EN 301 708 V7.1.1, 1999). Because the speech enhancement algorithm is to remove noise components from speech signals, it can be developed to solve the problem of VAD and vice-versa (Le BouquinJeannes and Faucon, 1995). Recently, Bahoura and Rouat (2001) proposed an eﬀective speech enhancement algorithm based on the wavelet-packet transform and the Teager energy operator (TEO) (Kaiser, 1990). The de-noising performance of the Bahoura and Rouant (BR) algorithm has shown to be similar to that of Ephraim and Malah’s noise suppression scheme (Ephraim and Malah, 1984). The main reason of this satisfying performance in (Bahoura and

*

Corresponding author. E-mail address: [email protected] (S.-H. Chen).

0167-8655/$ - see front matter 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2006.11.023

Rouat, 2001) is that the BR algorithm makes use of a wavelet-based time-adaptive (WBTA) thresholding to speech enhancement. The WBTA can adapt its threshold values for speech frames yet remain unchanged for non-speech ones regardless of background noises. This WBTA method is computed by approximating the Teager energy of the wavelet coeﬃcients. In this letter, the WBTA method is further developed to construct a robust VAD algorithm. This developed VAD algorithm employs the perceptual wavelet-packet transform (PWPT), instead of the conventional wavelet-packet transform, to decompose the input speech signal into critical sub-band signals. Such a PWPT is designed to match the psychoacoustic model and to improve the performance of various wavelet-based speech processing systems, such as speech de-noising (Pinter, 1996; Carneno and Drygajlo, 1999) and speech coding (Srinivasan and Jamieson, 1998; Carneno and Drygajlo, 1999). In each of the critical subband signals, the mask construction is obtained by smoothing the TEO of corresponding wavelet coeﬃcients. Unlike the BR algorithm where the mask constructions are used to compute time-adaptive thresholding values, these mask constructions are applied to generate a robust parameter

1328

S.-H. Chen et al. / Pattern Recognition Letters 28 (2007) 1327–1332

called voice activity shape (VAS) for VAD. The VAS has a property which is similar to that of the WBTA. In that the magnitude of the VAS is time-adapted as a function of speech components. In other words, a robust VAD algorithm can be achieved by tracing the magnitude of VAS. A primary advantage of the proposed VAD algorithm is that the preset threshold values or a priori knowledge of the SNR usually needed in conventional VAD methods can be completely avoided. Using speech signals corrupted with additive and real noises, experimental results show that the proposed VAD algorithm obtains a better performance than those of VAD of G.729B (ITU-T Rec. G.729 Annex B, 1996) and AMR codec (ETSI EN 301 708 V7.1.1, 1999).

2.2. Teager energy operator

2. PWPT and TEO for the VAD algorithm

3. Implementation of the VAD algorithm

2.1. Perceptual wavelet-packet transform

In the ﬂow-chart shown in Fig. 2, the proposed VAD algorithm ﬁrst computes the PWPT of the input speech signal x(n) for 1 6 n 6 N and results in 17 critical wavelet subband signals, namely wj,m(k), where 3 6 j 6 5 is the level of the PWPT, 1 6 m 6 17, and 1 6 k 6 N/2j. Then, from (1), a set of tj,m(k) = W[wj,m(k)] can be derived from the TEO of wj,m(k).

It has been shown (Jabloun et al., 1999; Bahoura and Rouat, 2001) that the TEO is a powerful nonlinear operator which has been used successfully in various speech applications. For a given band-limited discrete speech signal y(n), the discrete form of the TEO introduced by Kaiser (Kaiser, 1990) is given by W½yðnÞ ¼ y 2 ðnÞ yðn þ 1Þyðn 1Þ

ð1Þ

where W[y(n)] is called the TEO coeﬃcient of y(n). Note that the TEO is applied to enhance the discriminability of speech components against those of noise (Bahoura and Rouat, 2001).

As mentioned in (Pinter, 1996; Srinivasan and Jamieson, 1998; Carneno and Drygajlo, 1999), the decomposition tree structure of PWPT is designed to approximate the critical bands as close as possible in order to eﬃciently match the psychoacoustic model. Hence, the size of PWPT decomposition tree is directly related to the number of critical bands. In this letter, the underlying sampling rate is set to be 8 KHz, yielding a speech bandwidth of 4 KHz. Within this bandwidth, there are approximately 17 critical bands (Zwicker and Terhardt, 1980) and the corresponding PWPT decomposition tree can be constructed as shown in Fig. 1. By the use of this PWPT, the input speech signal can be transformed into 17 critical wavelet sub-band signals via the ﬁlter band approach proposed by Mallat (1989).

3.1. Band selection pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ The level-dependent threshold, i.e. kj ¼ rj 2 logðN Þ, proposed by Johnstone and Silverman (Johnstone and Silverman, 1997) is embedded in the band selection. That is, tj;m ðkÞ; if varftj;m ðkÞg P kj T j;m ðkÞ ¼ ð2Þ 0; otherwise

0

Decomposition Level

1

2

3

w3,15

4

w4,9

5

w4,10

w4,11

w4,12

w4,13

w3,16

w3,17

w4,14

w5,1w5,2 w5,3 w5,4 w5,5 w5,6 w5,7 w5,8 0

0.5

1.0

1.5

2.0

2.5

3.0

Frequency (kHz) Fig. 1. The tree structure of the proposed PWPT.

3.5

4.0

S.-H. Chen et al. / Pattern Recognition Letters 28 (2007) 1327–1332

x(n)

wj,m(k)

PWPT

...

m=1, ,1 7

tj,m(k)

TEO

Band Selection

...

m=1, ,1 7

1329

Tj,m(k) m=1,...,1 7

Mask Mj,m(k)

V(n)

AWT

m=1,...,1 7

Wm(n)

VAS

m=1,...,1 7

VAD Results Output

IPWPT

Fig. 2. The ﬂow-chart of the proposed VAD algorithm.

where var{tj,m(k)} denotes the variance of tj,m(k). Here the band selection is used to reject the sub-band that only contains noise.

(3) Repeat Step (2) and then one can obtain the second derivative round mean (SDRM) from E[V(2)(n)].

3.2. Mask construction

4

3

x 10

SDRM AWT V(n)

For each selected Tj,m(k), a mask is obtained by M j;m ðkÞ ¼ T j;m ðkÞ H j ðkÞ

2.5

ð3Þ

where * denotes the convolution operation and Hj(k) is a 256/2j-point level-dependent Hamming window. Magnitude

2

3.3. Calculation of voice activity shape (VAS) The P voice activity shape V(n) is calculated by 17 V ðnÞ ¼ m¼1 W m ðnÞ where Wm(n) is the inverse PWPT (IPWPT) of each Mj,m(k) given by (3).

1.5

1

0.5

3.4. VAD decision 0

(1) Initially set k = 1 and deﬁne V(1)(n) = V(n). (2) Let V(k+1)(n) be deﬁned as ( V ðkÞ ðnÞ; if V ðkÞ ðnÞ < E½V ðkÞ ðnÞ ðkþ1Þ ðnÞ ¼ V ðkÞ E½V ðnÞ; otherwise (k)

where E[V (n)] is the mean of V (n).

0.2

0.4

0.6

0.8 1 1.2 Sample point

1.4

1.6

1.8

2 4

x 10

4

2.5

x 10

SDRM AWT V(n) 2

1.5

1

0.5

0

ð4Þ (k)

0

Fig. 3. VAS, SDRM, and AWT of a speech signal ‘‘Zero-Four-SevenFive-Seven’’ with 0 dB white noise.

Magnitude

As mentioned previously, the VAS is time-adapted as a function of speech components such that it can be applied to the proposed VAD algorithm. It is observed that the magnitudes of V(n) in voice-active regions are always greater than those in voice-inactive regions. Under noiseless conditions, V(n) of voice-inactive regions is zero and voice-active regions can be easily detected by checking whether V(n) > 0. However, when the speech is contaminated with noises, V(n) is uniformly raised by a timeadaptive threshold value ‘‘AWT’’, making it none zero even in the voice-inactive regions. In this condition, the voice-active regions are characterized by V(n) > AWT. To determine this time-adaptive threshold value, an iterative algorithm proposed by this letter consists the following ﬁve steps:

0

2000

4000

6000 8000 Sample point

10000

12000

14000

Fig. 4. VAS, SDRM, and AWT of a speech signal ‘‘Oh-One-Five’’ with 0 dB white noise.

1330

S.-H. Chen et al. / Pattern Recognition Letters 28 (2007) 1327–1332

(4) Determine the voiced rate p = Lv/L of the given speech signal where Lv is the length of V(2)(n) = V(1)(n) and L is the length of input speech. (5) Finally the AWT of each frame can be computed by the use of Eq. (5). That is 9 8 MaxðFrameðiÞÞ 1:1; > > > > > = < if MaxðFrameðiÞÞ < Noise disðnÞ > AWTðiÞ ¼ ð2Þ > > > > > ðE½V ðnÞ þ MeanðFrameðiÞÞÞ=2; > ; : otherwise

white noises. The cost-performance (CP) rate used in Table 1 is deﬁned as CP ¼ ðAverage P d Average P f Þ=CPU time

ð6Þ

where the CPU time is the average PWPD process time of the speciﬁc wavelet. Considering the cost-performance rate 15000 V(n) AWT 10000

ð5Þ

5000

where AWT(i) is the time-adaptive threshold value of frame i, and Frame(i) is deﬁned as Frame(i) = [V((i 1)* 160 + 1), V(i*160)], Noise_dis(n) is deﬁned as Noise_dis(n) = p · {E[V(2)(n)] + Mean(Frame(i))}/ 2.

0

0

5000

10000

15000

Sample point [n] 10000

5000

Figs. 3 and 4 show the illustrations of VAS, SDRM, and AWT of two sentences under 0 dB additive noisy condition.

0

-5000

0

5000

4. Experimental results

10000

15000

Sample point [n] 4

In this letter, the probabilities of detection Pd and falsealarm Pf for a number of noisy speech signals are utilized to evaluate the performance of the proposed VAD algorithm. Ideally, a VAD should maximize the Pd and minimize Pf. To obtain Pd and Pf, the active and inactive regions of the clean speech signals are ﬁrst marked manually. Pd is calculated as the percentage of test cases when the hand-marked speech regions are correctly detected by the VAD algorithm while Pf is the percentage of test cases when hand-marked noise regions are erroneously identiﬁed as speech. There are 640 test speech signals used in this letter and all of them are selected from the ‘‘Aurora 2 database’’ (Aurora 2 Database, 2000). The software simulations were performed using Matlab 7.0 on a Pentium IV 2.0, Windows XP PC. The orthogonal wavelet ﬁlters including Daubechies, Coiﬂets, and Symlets (Daubechies, 1992; Burrus et al., 1998; Addison, 2002) are considered in the letter. In order to select an appropriate wavelet ﬁlter for the proposed VAD algorithm, a test experiment is given and its results are listed in Table 1. This test experiment was done using 640 test speech signals corrupted by diﬀerent Gaussian

2

x 10

V(n) AWT

1.5 1 0.5 0

0

5000

10000

15000

Sample point [n] 4

1

x 10

0.5 0 -0.5 -1

0

5000

10000

15000

Sample point [n]

Fig. 5. (a) The VAS (solid line) and the oﬀset value (dashed line) of the clean speech signal ‘‘three-zero-eight-two’’ shown in (b), (b) the waveform of the clean speech signal and its VAD result, (c) the VAS (solid line) and the oﬀset value (dashed line) of the noisy speech signal (white noise, SNR = 0 dB) shown in (d), and (d) the waveform of the noisy speech signal and its VAD result.

Table 1 The test experimental results on choosing of wavelet ﬁlter Wavelet ﬁlter type

Daubechies

Coiﬂets

Symlets

D4

D8

D10

D14

D20

C6

C12

C18

S4

S10

S16

Filter length Average Pd (%) Average Pf (%) CPU time (s) CP

4 91.4 4.34 0.31 280

8 93.1 2.32 0.32 283

10 96.1 1.64 0.33 286

14 96.3 1.6 0.35 270

20 96.5 1.58 0.39 243

6 92.1 3.49 0.32 277

12 96.2 1.66 0.34 278

18 96.4 1.61 0.38 249

4 91.9 4.56 0.31 282

10 95.5 1.61 0.33 285

16 96.3 1.57 0.37 256

S.-H. Chen et al. / Pattern Recognition Letters 28 (2007) 1327–1332

1331

Table 2 Probability of detection Pd and probability of false-alarm Pf of the proposed VAD, AMR VAD Option 1, AMR VAD Option 2, and G.729B VAD for various noise conditions Environments

Noise

Method

SNR (dB)

Proposed VAD

AMR VAD

AMR VAD

(Option 1)

G.729B VAD

(Option 2)

Pd (%)

Pf (%)

Pd (%)

Pf (%)

Pd (%)

Pf (%)

Pd (%)

Pf (%)

Train-station

20 15 10 5 0

98.90 98.83 98.38 96.73 95.22

7.17 8.91 8.49 9.18 9.67

99.40 98.71 97.94 96.23 93.78

16.70 25.02 31.44 45.04 57.17

88.37 84.37 90.93 96.79 96.39

19.69 20.02 22.56 30.45 39.57

96.22 97.83 99.13 98.43 99.71

37.44 38.50 38.43 39.42 39.46

Street

20 15 10 5 0

99.08 98.93 97.84 96.73 94.82

8.04 8.18 8.34 9.19 9.46

98.78 98.18 97.45 96.83 96.18

26.24 30.32 32.42 40.33 48.65

78.74 82.78 88.73 94.96 98.46

19.36 18.72 19.18 27.27 40.37

97.76 97.24 96.35 94.02 93.71

41.78 42.18 42.71 43.94 41.39

Car

20 15 10 5 0

98.73 98.39 97.84 94.32 88.87

7.28 7.82 8.72 9.11 9.73

98.96 98.05 97.80 97.43 97.08

19.48 20.27 39.45 53.51 53.94

83.96 86.42 93.58 95.69 98.09

22.18 23.29 25.18 30.35 41.42

97.85 97.82 98.73 98.68 98.64

41.87 45.88 45.24 46.72 47.87

White noise

20 15 10 5 0

99.8 98.9 96.9 95.4 89.6

3.2 1.6 1.3 1.1 1.0

97.09 96.71 95.82 95.17 95.32

8.34 11.16 14.34 16.84 18.52

96.53 96.47 96.25 95.60 98.93

7.94 13.81 18.23 21.31 25.44

98.79 98.45 97.98 97.34 96.96

34.79 39.67 42.59 43.45 47.50

given in Table 1, the Daubechies wavelet ﬁlter with length of 10, which has the best CP ratio, is recommended for the proposed VAD algorithm. An illustrative example of the proposed VAD algorithm with a clean speech signal and its corresponding noisy speech signal (SNR = 0 dB) is shown in Fig. 5. Under a variety of noise sources and SNR’s, the Pd and the Pf of the proposed algorithm are compared with those of the VAD speciﬁed in the ITU standard G.729B (ITU-T Rec. G.729 Annex B, 1996), VAD Option 1 and Option 2 of AMR codec (ETSI EN 301 708 V7.1.1, 1999). The experimental results are summarized in Table 2. From Table 2, one observes that although the probabilities of detection Pd of AMR VAD Option 1, Option 2, and G.729B are ﬁne, their probabilities of false-alarm Pf are somewhat unsatisﬁed. This means that most noisy sounds will be classiﬁed as speech frames using AMR VAD Option 1, Option 2, and G.729B. The proposed VAD scheme presents a good alternative to these standardized algorithms. In addition to the well average probabilities of detection, the proposed VAD scheme has lower probabilities of false-alarm Pf than all the mentioned standardized algorithms over a variety of noise environments and conditions. 5. Conclusions A robust VAD algorithm based on the BR speech enhancement technique has been presented in this letter. Two robust parameter called VAS and AWT that are

time-adapted to the speech waveform are ﬁrst developed by the use of the PWPT and the TEO. Then, such a VAS and AWT are utilized to achieve a robust VAD. The advantage of the proposed algorithm over the conventional VAD is that it requires neither the preset threshold values nor a priori knowledge of the SNR. Finally, various experimental results show that the proposed VAD algorithm obtains a better performance than those of VAD of G.729B and AMR codec. Acknowledgements The authors thank the anonymous reviewers for their valuable comments, which have greatly helped improve the presentation of the paper. References Addison, Paul S., 2002. The Illustrated Wavelet Transform Handbook. Institute of Physics Publishing. Aurora 2 Database, 2000. (accessed 12.07.06). Bahoura, M., Rouat, J., 2001. Wavelet speech enhancement based on the Teager energy operator. IEEE Signal Process. Lett. 8, 10–12. Burrus, C.S., Gopinath, R.A., Guo, Haitao, 1998. Introduction to Wavelets and Wavelet Transforms. Prentice-Hall, Upper Saddle River, NJ. Carneno, B., Drygajlo, A., 1999. Perceptual speech coding and enhancement using frame-synchronized fast wavelet-packet transform algorithms. IEEE Trans. Signal Process. 47 (6), 1622–1635.

1332

S.-H. Chen et al. / Pattern Recognition Letters 28 (2007) 1327–1332

Daubechies, I., 1992. Ten Lectures on Wavelets. CBMS, SIAM publ. Ephraim, Y., Malah, D., 1984. Speech enhancement using a minimum mean square error short time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32, 1109–1121. ETSI EN 301 708 V7.1.1, 1999. Voice Activity Detector (VAD) for Adaptive Multi-Rate. ITU-T Rec. G.729 Annex B, 1996. A silence compression scheme for G.729 optimized for terminals conforming to ITU-T V.70. Jabloun, F., Cetin, A.E., Erzin, E., 1999. Teager energy based feature parameters for speech recognition in car noise. IEEE Signal Process. Lett. 6, 259–261. Johnstone, I.M., Silverman, B.W., 1997. Wavelet threshold estimators for data with correlated noise. J. Roy. Stat. Soc. B 59, 319–351. Kaiser, J.F., 1990. On a simple algorithm to calculate the ‘energy’ of a signal. In: Proc. IEEE ICASSP‘90, pp. 381–384.

Le Bouquin-Jeannes, R., Faucon, G., 1995. Study of a voice activity detector and its inﬂuence on a noise reduction system. Speech Commun. 16, 245–254. Mallat, S., 1989. Multifrequency channel decomposition of images and wavelet model. IEEE Trans. Acoust. Speech Signal Process. 37, 2091– 2110. Pinter, I., 1996. Perceptual wavelet-representation of speech signals and its application to speech enhancement. Comput. Speech Lang. 10 (1), 1– 22. Srinivasan, P., Jamieson, L.H., 1998. High quality audio compression using an adaptive wavelet decomposition and psychoacoustic modeling. IEEE Trans. Signal Process. 46 (4), 1085–1093. Zwicker, E., Terhardt, E., 1980. Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. JASA 68, 1523– 1525.

Robust voice activity detection using perceptual wavelet-packet transform and Teager energy operator

Robust voice activity detection using perceptual wavelet-packet transform and Teager energy operator

Recommend Documents