Pattern Recognition Letters 28 (2007) 1327–1332 www.elsevier.com/locate/patrec
Robust voice activity detection using perceptual wavelet-packet transform and Teager energy operator Shi-Huang Chen a
a,*
, Hsin-Te Wu a, Yukon Chang b, T.K. Truong
b
Department of Computer Science and Information Engineering, Shu-Te University, Kaohsiung County 824, Taiwan, ROC b Department of Information Engineering, I-Shou University, Kaohsiung County 840, Taiwan, ROC Available online 21 April 2007
Abstract In this letter, a robust voice activity detection (VAD) algorithm is presented. This proposed VAD algorithm makes use of the perceptual wavelet-packet transform and the Teager energy operator to compute a robust parameter called voice activity shape for VAD. The main advantage of this algorithm is that the preset threshold values or a priori knowledge of the SNR usually needed in conventional VAD methods can be completely avoided. Various experimental results show that the proposed VAD algorithm is capable of outperforming the VAD of Adaptive Multi Rate (AMR) speech codec in both additive noisy and real noisy environments. 2007 Elsevier B.V. All rights reserved. Keywords: Voice activity detection (VAD); Perceptual wavelet-packet transform (PWPT); Teager energy operator (TEO)
1. Introduction Voice activity detection (VAD) is used to distinguish speech from noise and is required in a variety of speech communication systems. For example, in an adaptive multi rate (AMR) cellular phone system, the VAD module can reduce co-channel interference and power consumption in portable equipment (ETSI EN 301 708 V7.1.1, 1999). Because the speech enhancement algorithm is to remove noise components from speech signals, it can be developed to solve the problem of VAD and vice-versa (Le BouquinJeannes and Faucon, 1995). Recently, Bahoura and Rouat (2001) proposed an effective speech enhancement algorithm based on the wavelet-packet transform and the Teager energy operator (TEO) (Kaiser, 1990). The de-noising performance of the Bahoura and Rouant (BR) algorithm has shown to be similar to that of Ephraim and Malah’s noise suppression scheme (Ephraim and Malah, 1984). The main reason of this satisfying performance in (Bahoura and
*
Corresponding author. E-mail address:
[email protected] (S.-H. Chen).
0167-8655/$ - see front matter 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2006.11.023
Rouat, 2001) is that the BR algorithm makes use of a wavelet-based time-adaptive (WBTA) thresholding to speech enhancement. The WBTA can adapt its threshold values for speech frames yet remain unchanged for non-speech ones regardless of background noises. This WBTA method is computed by approximating the Teager energy of the wavelet coefficients. In this letter, the WBTA method is further developed to construct a robust VAD algorithm. This developed VAD algorithm employs the perceptual wavelet-packet transform (PWPT), instead of the conventional wavelet-packet transform, to decompose the input speech signal into critical sub-band signals. Such a PWPT is designed to match the psychoacoustic model and to improve the performance of various wavelet-based speech processing systems, such as speech de-noising (Pinter, 1996; Carneno and Drygajlo, 1999) and speech coding (Srinivasan and Jamieson, 1998; Carneno and Drygajlo, 1999). In each of the critical subband signals, the mask construction is obtained by smoothing the TEO of corresponding wavelet coefficients. Unlike the BR algorithm where the mask constructions are used to compute time-adaptive thresholding values, these mask constructions are applied to generate a robust parameter
1328
S.-H. Chen et al. / Pattern Recognition Letters 28 (2007) 1327–1332
called voice activity shape (VAS) for VAD. The VAS has a property which is similar to that of the WBTA. In that the magnitude of the VAS is time-adapted as a function of speech components. In other words, a robust VAD algorithm can be achieved by tracing the magnitude of VAS. A primary advantage of the proposed VAD algorithm is that the preset threshold values or a priori knowledge of the SNR usually needed in conventional VAD methods can be completely avoided. Using speech signals corrupted with additive and real noises, experimental results show that the proposed VAD algorithm obtains a better performance than those of VAD of G.729B (ITU-T Rec. G.729 Annex B, 1996) and AMR codec (ETSI EN 301 708 V7.1.1, 1999).
2.2. Teager energy operator
2. PWPT and TEO for the VAD algorithm
3. Implementation of the VAD algorithm
2.1. Perceptual wavelet-packet transform
In the flow-chart shown in Fig. 2, the proposed VAD algorithm first computes the PWPT of the input speech signal x(n) for 1 6 n 6 N and results in 17 critical wavelet subband signals, namely wj,m(k), where 3 6 j 6 5 is the level of the PWPT, 1 6 m 6 17, and 1 6 k 6 N/2j. Then, from (1), a set of tj,m(k) = W[wj,m(k)] can be derived from the TEO of wj,m(k).
It has been shown (Jabloun et al., 1999; Bahoura and Rouat, 2001) that the TEO is a powerful nonlinear operator which has been used successfully in various speech applications. For a given band-limited discrete speech signal y(n), the discrete form of the TEO introduced by Kaiser (Kaiser, 1990) is given by W½yðnÞ ¼ y 2 ðnÞ yðn þ 1Þyðn 1Þ
ð1Þ
where W[y(n)] is called the TEO coefficient of y(n). Note that the TEO is applied to enhance the discriminability of speech components against those of noise (Bahoura and Rouat, 2001).
As mentioned in (Pinter, 1996; Srinivasan and Jamieson, 1998; Carneno and Drygajlo, 1999), the decomposition tree structure of PWPT is designed to approximate the critical bands as close as possible in order to efficiently match the psychoacoustic model. Hence, the size of PWPT decomposition tree is directly related to the number of critical bands. In this letter, the underlying sampling rate is set to be 8 KHz, yielding a speech bandwidth of 4 KHz. Within this bandwidth, there are approximately 17 critical bands (Zwicker and Terhardt, 1980) and the corresponding PWPT decomposition tree can be constructed as shown in Fig. 1. By the use of this PWPT, the input speech signal can be transformed into 17 critical wavelet sub-band signals via the filter band approach proposed by Mallat (1989).
3.1. Band selection pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi The level-dependent threshold, i.e. kj ¼ rj 2 logðN Þ, proposed by Johnstone and Silverman (Johnstone and Silverman, 1997) is embedded in the band selection. That is, tj;m ðkÞ; if varftj;m ðkÞg P kj T j;m ðkÞ ¼ ð2Þ 0; otherwise
0
Decomposition Level
1
2
3
w3,15
4
w4,9
5
w4,10
w4,11
w4,12
w4,13
w3,16
w3,17
w4,14
w5,1w5,2 w5,3 w5,4 w5,5 w5,6 w5,7 w5,8 0
0.5
1.0
1.5
2.0
2.5
3.0
Frequency (kHz) Fig. 1. The tree structure of the proposed PWPT.
3.5
4.0
S.-H. Chen et al. / Pattern Recognition Letters 28 (2007) 1327–1332
x(n)
wj,m(k)
PWPT
...
m=1, ,1 7
tj,m(k)
TEO
Band Selection
...
m=1, ,1 7
1329
Tj,m(k) m=1,...,1 7
Mask Mj,m(k)
V(n)
AWT
m=1,...,1 7
Wm(n)
VAS
m=1,...,1 7
VAD Results Output
IPWPT
Fig. 2. The flow-chart of the proposed VAD algorithm.
where var{tj,m(k)} denotes the variance of tj,m(k). Here the band selection is used to reject the sub-band that only contains noise.
(3) Repeat Step (2) and then one can obtain the second derivative round mean (SDRM) from E[V(2)(n)].
3.2. Mask construction
4
3
x 10
SDRM AWT V(n)
For each selected Tj,m(k), a mask is obtained by M j;m ðkÞ ¼ T j;m ðkÞ H j ðkÞ
2.5
ð3Þ
where * denotes the convolution operation and Hj(k) is a 256/2j-point level-dependent Hamming window. Magnitude
2
3.3. Calculation of voice activity shape (VAS) The P voice activity shape V(n) is calculated by 17 V ðnÞ ¼ m¼1 W m ðnÞ where Wm(n) is the inverse PWPT (IPWPT) of each Mj,m(k) given by (3).
1.5
1
0.5
3.4. VAD decision 0
(1) Initially set k = 1 and define V(1)(n) = V(n). (2) Let V(k+1)(n) be defined as ( V ðkÞ ðnÞ; if V ðkÞ ðnÞ < E½V ðkÞ ðnÞ ðkþ1Þ ðnÞ ¼ V ðkÞ E½V ðnÞ; otherwise (k)
where E[V (n)] is the mean of V (n).
0.2
0.4
0.6
0.8 1 1.2 Sample point
1.4
1.6
1.8
2 4
x 10
4
2.5
x 10
SDRM AWT V(n) 2
1.5
1
0.5
0
ð4Þ (k)
0
Fig. 3. VAS, SDRM, and AWT of a speech signal ‘‘Zero-Four-SevenFive-Seven’’ with 0 dB white noise.
Magnitude
As mentioned previously, the VAS is time-adapted as a function of speech components such that it can be applied to the proposed VAD algorithm. It is observed that the magnitudes of V(n) in voice-active regions are always greater than those in voice-inactive regions. Under noiseless conditions, V(n) of voice-inactive regions is zero and voice-active regions can be easily detected by checking whether V(n) > 0. However, when the speech is contaminated with noises, V(n) is uniformly raised by a timeadaptive threshold value ‘‘AWT’’, making it none zero even in the voice-inactive regions. In this condition, the voice-active regions are characterized by V(n) > AWT. To determine this time-adaptive threshold value, an iterative algorithm proposed by this letter consists the following five steps:
0
2000
4000
6000 8000 Sample point
10000
12000
14000
Fig. 4. VAS, SDRM, and AWT of a speech signal ‘‘Oh-One-Five’’ with 0 dB white noise.
1330
S.-H. Chen et al. / Pattern Recognition Letters 28 (2007) 1327–1332
(4) Determine the voiced rate p = Lv/L of the given speech signal where Lv is the length of V(2)(n) = V(1)(n) and L is the length of input speech. (5) Finally the AWT of each frame can be computed by the use of Eq. (5). That is 9 8 MaxðFrameðiÞÞ 1:1; > > > > > = < if MaxðFrameðiÞÞ < Noise disðnÞ > AWTðiÞ ¼ ð2Þ > > > > > ðE½V ðnÞ þ MeanðFrameðiÞÞÞ=2; > ; : otherwise
white noises. The cost-performance (CP) rate used in Table 1 is defined as CP ¼ ðAverage P d Average P f Þ=CPU time
ð6Þ
where the CPU time is the average PWPD process time of the specific wavelet. Considering the cost-performance rate 15000 V(n) AWT 10000
ð5Þ
5000
where AWT(i) is the time-adaptive threshold value of frame i, and Frame(i) is defined as Frame(i) = [V((i 1)* 160 + 1), V(i*160)], Noise_dis(n) is defined as Noise_dis(n) = p · {E[V(2)(n)] + Mean(Frame(i))}/ 2.
0
0
5000
10000
15000
Sample point [n] 10000
5000
Figs. 3 and 4 show the illustrations of VAS, SDRM, and AWT of two sentences under 0 dB additive noisy condition.
0
-5000
0
5000
4. Experimental results
10000
15000
Sample point [n] 4
In this letter, the probabilities of detection Pd and falsealarm Pf for a number of noisy speech signals are utilized to evaluate the performance of the proposed VAD algorithm. Ideally, a VAD should maximize the Pd and minimize Pf. To obtain Pd and Pf, the active and inactive regions of the clean speech signals are first marked manually. Pd is calculated as the percentage of test cases when the hand-marked speech regions are correctly detected by the VAD algorithm while Pf is the percentage of test cases when hand-marked noise regions are erroneously identified as speech. There are 640 test speech signals used in this letter and all of them are selected from the ‘‘Aurora 2 database’’ (Aurora 2 Database, 2000). The software simulations were performed using Matlab 7.0 on a Pentium IV 2.0, Windows XP PC. The orthogonal wavelet filters including Daubechies, Coiflets, and Symlets (Daubechies, 1992; Burrus et al., 1998; Addison, 2002) are considered in the letter. In order to select an appropriate wavelet filter for the proposed VAD algorithm, a test experiment is given and its results are listed in Table 1. This test experiment was done using 640 test speech signals corrupted by different Gaussian
2
x 10
V(n) AWT
1.5 1 0.5 0
0
5000
10000
15000
Sample point [n] 4
1
x 10
0.5 0 -0.5 -1
0
5000
10000
15000
Sample point [n]
Fig. 5. (a) The VAS (solid line) and the offset value (dashed line) of the clean speech signal ‘‘three-zero-eight-two’’ shown in (b), (b) the waveform of the clean speech signal and its VAD result, (c) the VAS (solid line) and the offset value (dashed line) of the noisy speech signal (white noise, SNR = 0 dB) shown in (d), and (d) the waveform of the noisy speech signal and its VAD result.
Table 1 The test experimental results on choosing of wavelet filter Wavelet filter type
Daubechies
Coiflets
Symlets
D4
D8
D10
D14
D20
C6
C12
C18
S4
S10
S16
Filter length Average Pd (%) Average Pf (%) CPU time (s) CP
4 91.4 4.34 0.31 280
8 93.1 2.32 0.32 283
10 96.1 1.64 0.33 286
14 96.3 1.6 0.35 270
20 96.5 1.58 0.39 243
6 92.1 3.49 0.32 277
12 96.2 1.66 0.34 278
18 96.4 1.61 0.38 249
4 91.9 4.56 0.31 282
10 95.5 1.61 0.33 285
16 96.3 1.57 0.37 256
S.-H. Chen et al. / Pattern Recognition Letters 28 (2007) 1327–1332
1331
Table 2 Probability of detection Pd and probability of false-alarm Pf of the proposed VAD, AMR VAD Option 1, AMR VAD Option 2, and G.729B VAD for various noise conditions Environments
Noise
Method
SNR (dB)
Proposed VAD
AMR VAD
AMR VAD
(Option 1)
G.729B VAD
(Option 2)
Pd (%)
Pf (%)
Pd (%)
Pf (%)
Pd (%)
Pf (%)
Pd (%)
Pf (%)
Train-station
20 15 10 5 0
98.90 98.83 98.38 96.73 95.22
7.17 8.91 8.49 9.18 9.67
99.40 98.71 97.94 96.23 93.78
16.70 25.02 31.44 45.04 57.17
88.37 84.37 90.93 96.79 96.39
19.69 20.02 22.56 30.45 39.57
96.22 97.83 99.13 98.43 99.71
37.44 38.50 38.43 39.42 39.46
Street
20 15 10 5 0
99.08 98.93 97.84 96.73 94.82
8.04 8.18 8.34 9.19 9.46
98.78 98.18 97.45 96.83 96.18
26.24 30.32 32.42 40.33 48.65
78.74 82.78 88.73 94.96 98.46
19.36 18.72 19.18 27.27 40.37
97.76 97.24 96.35 94.02 93.71
41.78 42.18 42.71 43.94 41.39
Car
20 15 10 5 0
98.73 98.39 97.84 94.32 88.87
7.28 7.82 8.72 9.11 9.73
98.96 98.05 97.80 97.43 97.08
19.48 20.27 39.45 53.51 53.94
83.96 86.42 93.58 95.69 98.09
22.18 23.29 25.18 30.35 41.42
97.85 97.82 98.73 98.68 98.64
41.87 45.88 45.24 46.72 47.87
White noise
20 15 10 5 0
99.8 98.9 96.9 95.4 89.6
3.2 1.6 1.3 1.1 1.0
97.09 96.71 95.82 95.17 95.32
8.34 11.16 14.34 16.84 18.52
96.53 96.47 96.25 95.60 98.93
7.94 13.81 18.23 21.31 25.44
98.79 98.45 97.98 97.34 96.96
34.79 39.67 42.59 43.45 47.50
given in Table 1, the Daubechies wavelet filter with length of 10, which has the best CP ratio, is recommended for the proposed VAD algorithm. An illustrative example of the proposed VAD algorithm with a clean speech signal and its corresponding noisy speech signal (SNR = 0 dB) is shown in Fig. 5. Under a variety of noise sources and SNR’s, the Pd and the Pf of the proposed algorithm are compared with those of the VAD specified in the ITU standard G.729B (ITU-T Rec. G.729 Annex B, 1996), VAD Option 1 and Option 2 of AMR codec (ETSI EN 301 708 V7.1.1, 1999). The experimental results are summarized in Table 2. From Table 2, one observes that although the probabilities of detection Pd of AMR VAD Option 1, Option 2, and G.729B are fine, their probabilities of false-alarm Pf are somewhat unsatisfied. This means that most noisy sounds will be classified as speech frames using AMR VAD Option 1, Option 2, and G.729B. The proposed VAD scheme presents a good alternative to these standardized algorithms. In addition to the well average probabilities of detection, the proposed VAD scheme has lower probabilities of false-alarm Pf than all the mentioned standardized algorithms over a variety of noise environments and conditions. 5. Conclusions A robust VAD algorithm based on the BR speech enhancement technique has been presented in this letter. Two robust parameter called VAS and AWT that are
time-adapted to the speech waveform are first developed by the use of the PWPT and the TEO. Then, such a VAS and AWT are utilized to achieve a robust VAD. The advantage of the proposed algorithm over the conventional VAD is that it requires neither the preset threshold values nor a priori knowledge of the SNR. Finally, various experimental results show that the proposed VAD algorithm obtains a better performance than those of VAD of G.729B and AMR codec. Acknowledgements The authors thank the anonymous reviewers for their valuable comments, which have greatly helped improve the presentation of the paper. References Addison, Paul S., 2002. The Illustrated Wavelet Transform Handbook. Institute of Physics Publishing. Aurora 2 Database, 2000.
(accessed 12.07.06). Bahoura, M., Rouat, J., 2001. Wavelet speech enhancement based on the Teager energy operator. IEEE Signal Process. Lett. 8, 10–12. Burrus, C.S., Gopinath, R.A., Guo, Haitao, 1998. Introduction to Wavelets and Wavelet Transforms. Prentice-Hall, Upper Saddle River, NJ. Carneno, B., Drygajlo, A., 1999. Perceptual speech coding and enhancement using frame-synchronized fast wavelet-packet transform algorithms. IEEE Trans. Signal Process. 47 (6), 1622–1635.
1332
S.-H. Chen et al. / Pattern Recognition Letters 28 (2007) 1327–1332
Daubechies, I., 1992. Ten Lectures on Wavelets. CBMS, SIAM publ. Ephraim, Y., Malah, D., 1984. Speech enhancement using a minimum mean square error short time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32, 1109–1121. ETSI EN 301 708 V7.1.1, 1999. Voice Activity Detector (VAD) for Adaptive Multi-Rate. ITU-T Rec. G.729 Annex B, 1996. A silence compression scheme for G.729 optimized for terminals conforming to ITU-T V.70. Jabloun, F., Cetin, A.E., Erzin, E., 1999. Teager energy based feature parameters for speech recognition in car noise. IEEE Signal Process. Lett. 6, 259–261. Johnstone, I.M., Silverman, B.W., 1997. Wavelet threshold estimators for data with correlated noise. J. Roy. Stat. Soc. B 59, 319–351. Kaiser, J.F., 1990. On a simple algorithm to calculate the ‘energy’ of a signal. In: Proc. IEEE ICASSP‘90, pp. 381–384.
Le Bouquin-Jeannes, R., Faucon, G., 1995. Study of a voice activity detector and its influence on a noise reduction system. Speech Commun. 16, 245–254. Mallat, S., 1989. Multifrequency channel decomposition of images and wavelet model. IEEE Trans. Acoust. Speech Signal Process. 37, 2091– 2110. Pinter, I., 1996. Perceptual wavelet-representation of speech signals and its application to speech enhancement. Comput. Speech Lang. 10 (1), 1– 22. Srinivasan, P., Jamieson, L.H., 1998. High quality audio compression using an adaptive wavelet decomposition and psychoacoustic modeling. IEEE Trans. Signal Process. 46 (4), 1085–1093. Zwicker, E., Terhardt, E., 1980. Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. JASA 68, 1523– 1525.