Signal Processing 84 (2004) 431 – 433
www.elsevier.com/locate/sigpro
Fast communication
Dissonant frequency #ltering technique for improving perceptual quality of noisy speech and husky voice Sangki Kang Telecommunication R&D Center, Samsung Electronics Co. Ltd., Suwon P.O. Box 105, Maetan-3dong, Paldal-gu, Suwon-si, Gyeonggi-do, South Korea Received 13 March 2003; received in revised form 23 March 2003
Abstract There have been numerous studies on the enhancement of the noisy speech signal. In this paper, we propose a completely new speech quality improvement method, that is, a dissonant frequency #ltering technique (especially B and F# in each octave of the tempered scale when the reference frequency is C) based on improved fundamental frequency estimation which is developed in frequency domain. The subjective test results indicate that the proposed method provides a signi#cant gain in audible improvement especially for a husky voice signal. Therefore if the #lter is employed as a pre-#lter for speech quality improvement, the output speech quality and intelligibility should be enhanced perceptually. ? 2003 Elsevier B.V. All rights reserved. Keywords: Dissonant frequency; Speech quality enhancement
1. Introduction Degradation of the quality or intelligibility of speech caused by the acoustic background noise is common in most practical situations. In general, the addition of noise reduces intelligibility and degrades the performance of digital voice processors used for applications such as speech compression and recognition. Therefore, the problem of removing the uncorrelated noise component from the noisy speech signal, i.e., speech enhancement, has received considerable attention. There have been numerous studies on the enhancement of the noisy speech signal. Many di7erent types of speech enhancement algorithms have been proposed and tested [1–4,6,7,10]. E-mail address:
[email protected] (S. Kang).
They can be grouped into three categories. The #rst category contains speech enhancement algorithms based on the short-time spectral estimation such as the spectrum subtraction [1] and Wiener #ltering [4,6] techniques. The algorithms in the second category are comb #ltering and adaptive noise cancelling techniques which exploit the quasi-periodic nature of the speech signal [2,7,10]. The third category contains algorithms that are based on the statistical model of the speech signal and use hidden Markov model (HMM) or expectation and maximization (EM) for speech enhancement [3]. In this paper, we propose a completely new speech quality improvement method, that is, a #ltering of a dissonant frequency (especially B and F# in each octave of the tempered scale when reference frequency is C) based on an improved fundamental frequency estimation using parametric cubic convolution [8]. There
0165-1684/$ - see front matter ? 2003 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2003.10.023
432
S. Kang / Signal Processing 84 (2004) 431 – 433
are a lot of fundamental frequency estimation algorithms in speech signal processing. Autocorrelation method, AMDF, SIFT, etc. are such examples. But most of them su7er from the problem of resolution because of discrete sampling time and limited window size. So we adopt the parametric cubic convolution method [8], which o7ers fundamental frequency estimation results beyond the resolution of FFT.
160 and 400 Hz for female [9]. However, exact elimination of corresponding dissonant frequencies is not desirable or not good to implement. So elimination region with margin of half tone is de#ned where the dissonant frequency exist according to the following equations. F0 × 2(n+11=24) ¡ Fd1 ¡ F0 × 2(n+13=24) ; n = 2; 3; : : : ; 7;
2. A dissonant frequency ltering
F0 × 2(n+21=24) ¡ Fd2 ¡ F0 × 2(n+23=24) ;
The standard musicological de#nition is that a musical interval is consonant if it sounds pleasant or restful. Dissonance, on the other hand, is the degree to which an interval sounds unpleasant or rough. Dissonant intervals generally feel tense and unresolved. When reference frequency is C, B and F# are such examples. Among them, F# is known as “the Devil’s interval” in music [11]. So we eliminate those dissonant frequencies to make speech less annoying and pleasant. In a dissonant frequency #ltering of input speech signal to enhance perceptual speech quality, it is important to accurately estimate the fundamental frequency that is closely related to a dissonant frequency #ltering. A #ltering of a dissonant frequency based on the fundamental frequency estimation which is developed in frequency domain is as follows. First, frequency transform of the original input signal using fast Fourier transform (FFT) is carried out and then an improved fundamental frequency estimation is performed using parametric cubic convolution [8]. Next, #ltering of the dissonant frequency based on the obtained fundamental frequency (pitch) is performed. The dissonant frequency Fd1 and Fd2 corresponding to B and F# relative to fundamental frequency are de#ned as follows [5]: Fd1 = F0 × 2(n+6=12) ; Fd2 = F0 × 2(n+11=12) ;
n = 0; 1; : : : ; 7; n = 0; 1; : : : ; 7:
(3)
(1) (2)
Here F0 is fundamental frequency estimated using parametric cubic convolution [8]. And n which represent octave band varies from 0 to 7 on condition that Fd1 and Fd2 is less than half of the sampling frequency. In this paper, however, we consider that n is varying from 2 to 7 because fundamental frequency ranges between 80 and 160 Hz for male talkers and between
n = 2; 3; : : : ; 7;
(4)
where each range corresponds to a half tone or a semitone. Very simple #ltering algorithm is employed in this paper. If speech shows a peak in the region de#ned above, that peak is smeared out, that is, the magnitude in that region is lowered to the neighboring magnitude while the phase is kept. Finally, the inverse Fourier transform of #ltered signal is carried out. This is illustrated in Fig. 1. 3. Experimental results The database for performance evaluation consists of 14 speech #les collected from 7 speakers (5 males and 2 females), each one delivering 2 Korean sentences. Also we obtained 6 husky voice #les from 3 speakers (2 males and 1 female) who have worse quality than normal speakers. All utterances were sampled at 8 kHz with 8 bit resolution. Speech input is windowed by 256-point Hanning window and padded with 256 point zeroes. And hanning window is half overlapped. 256 point window corresponds to 32 ms in which speech is considered to be stationary. Zero padding is used to enhance the accuracy of fundamental frequency estimation. Noise types considered in our experiments include white Gaussian noise, babble noise and car noise recorded inside a car moving approximately at a speed of 90 km=h. We obtain the noisy speech by adding noise to clean speech with the noise power being adjusted to achieve 10 and 15 dB SNR. In order to evaluate the performance of the proposed speech quality improvement scheme, subjective tests (MOS test) were conducted. Twenty listeners participated in the test. During the testing period, each individual used a high-quality headset as a listening
S. Kang / Signal Processing 84 (2004) 431 – 433
433
Fig. 1. Speech quality improvement scheme.
Table 1 Results of subjective measures of enhanced speech Noise type
SNR
MOS (unprocessed)
MOS (processed)
White Gaussian noise
10 dB 15 dB
2.78 2.87
3.09 3.02
Babble noise
10 dB 15 dB
2.79 2.97
3.01 3.09
Car noise (motor noise)
10 dB 15 dB
2.89 3.11
3.14 3.22
Clean speech
Male Female
4.58 4.64
4.69 4.80
Husky voice (clean speech)
Male Female
2.98 3.05
3.49 3.46
device in a quiet room. The recordings in each presentation were played randomly. The listeners were asked to give a score ranging from 1 (very bad) to 5 (excellent) according to the perceived perceptual quality. The results are illustrated in Table 1. The subjective test results indicate that the proposed method provides a signi#cant gain in perceptual improvement especially for a husky voice signal. 4. Conclusion We proposed a new speech quality improvement scheme, that is, a #ltering of a dissonant frequency based on the improved fundamental frequency estimation. The subjective test results indicated that the proposed method delivered improvements in terms of both speech intelligibility and perceived quality
when compared with the unprocessed case. Therefore when the #lter is employed as a pre-#lter for speech enhancement, the output speech quality is more enhanced perceptually. References [1] S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust. Speech Signal Process. ASSP-29 (April 1979) 113–120. [2] S.F. Boll, D.C. Pulsipher, Suppression of acoustic noise in speech using two microphone adaptive noise cancelation, IEEE Trans. Acoust. Speech Signal Process. ASSP-28 (6) (December 1980) 752–753. [3] Y. Ephraim, Statistical-model-based speech enhancement systems, Proc. IEEE 80 (10) (October 1992) 1526–1555. [4] J. Hansen, M. Clements, Constrained iterative speech enhancement with application to speech recognition, IEEE Trans. Acoust. Speech Signal Process. ASSP-39 (4) (April 1991) 21–27. [5] D.M. Howard, J. Angus, Acoustics and Psychoacoustics, Music Technology Series, Focal Press, 1996. [6] J.S. Lim, A.V. Oppenheim, All-pole modeling of degraded speech, IEEE Trans. Acoust. Speech Signal Process. ASSP-26 (June 1978) 197–210. [7] J.S. Lim, A.V. Oppenheim, L.D. Braida, Evaluation of an adaptive comb #ltering method for enhancing speech degraded by white noise addition, IEEE Trans. Acoust. Speech Signal Process. ASSP-26 (4) (April 1991) 354–358. [8] Hee-Suk Pang, SeongJoon Baek, Koeng-Mo Sung, Improved fundamental frequency estimation using parametric cubic convolution, IEICE Trans. Fundamentals E83-A (12) (2000) 2747–2750. [9] T.W. Parsons, Voice and Speech Processing, McGraw-Hill, New York, 1986. [10] M.R. Samber, Adaptive noise canceling for speech signals, IEEE Trans. Acoust. Speech Signal Process. ASSP-26 (October 1978) 419–423. [11] J. Tennenbaum, The foundations of scienti#c musical tuning, FIDELIO Magazine, 1(1) (Winter 1991–92).