Applied Acoustics 73 (2012) 50–55
Contents lists available at ScienceDirect
Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust
Wide-band dereverberation method based on multichannel linear prediction using prewhitening filter Takuma Okamoto a,b,⇑, Yukio Iwaya a,c, Yôiti Suzuki a,c a
Research Institute of Electrical Communications, Tohoku University, Katahira 2-1-1, Aoba-ku, Sendai 980-8577, Japan Graduate School of Engineering, Tohoku University, Katahira 2-1-1, Aoba-ku, Sendai 980-8577, Japan c Graduate School of Information Sciences, Tohoku University, Katahira 2-1-1, Aoba-ku, Sendai 980-8577, Japan b
a r t i c l e
i n f o
Article history: Received 28 February 2011 Received in revised form 4 July 2011 Accepted 7 July 2011 Available online 31 July 2011 Keywords: Microphone array signal processing Dereverberation LIME algorithm Colored source signal Prewhitening filter
a b s t r a c t Several dereverberation algorithms have been studied. The sampling frequencies used in conventional studies are typically 8–16 kHz because their main purpose is preprocessing for improving the intelligibility of speech communication and articulation for automatic speech recognition. However, in next-generation communication systems, techniques to analyze and reproduce not only semantic information of sound but also more high-definition components such as spatial information and directivity will be increasingly necessary. To decompose these sound field characteristics with high definition, a dereverberation algorithm that is useful at high sampling frequencies is an important technique to process sound that includes high-frequency spectra such as musical sounds. The LInear-predictive Multichannel Equalization (LIME) algorithm is a promising dereverberation method. Using the LIME algorithm, however, a dereverberation signal cannot be solved at high sampling frequencies when the source signal is colored, such as in the case of speech and sound of musical signals. Because the rank of the correlation matrix calculated from such a colored signal is not full, the characteristic polynomial cannot be calculated precisely. To alleviate this problem, we propose preprocessing of all input signals with filters to whiten their spectra so that this algorithm can function for colored signals at high sampling frequencies. Ó 2011 Elsevier Ltd. All rights reserved.
1. Introduction Conventionally, the main purpose of dereverberation algorithms is to improve speech communication intelligibility and the performance of automatic speech recognition systems [1]. Therefore, in these studies, sampling frequencies are typically 8–16 kHz. However, in next-generation high-definition communications systems, techniques to analyze and synthesize not only semantic information of sound but also more qualitative characteristics comparable to textures of sound must be very important. Therefore, it is necessary to extract room acoustic characteristics and original sound precisely from each sound source, along with the directivity of each sound source from the sound sensed in a room at a position that is distant from the sound sources [2–5]. Dereverberation is a key technique. The sampling frequency must be sufficiently high, such as 44.1 kHz. Recently, dereverberation of musical sounds has been attracting the interest of researchers [6]. Therefore, we conducted a study to develop an algorithm that can decompose reverberant components from received signals at high sampling frequencies. ⇑ Corresponding author at: Research Institute of Electrical Communications, Tohoku University, Katahira 2-1-1, Aoba-ku, Sendai 980-8577, Japan. E-mail address:
[email protected] (T. Okamoto). 0003-682X/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.apacoust.2011.07.004
To date, various dereverberation techniques have been proposed [1]. Among them, methods using inverse filtering of a room transfer function based on the Multi input–output INverse Theorem (MINT) [7] show good performances to extract original source signals from input signals, including reflected sound. In methods based on MINT, the impulse responses corresponding to room acoustics (room impulse responses, RIR) are necessary to calculate the inverse filters. However, it is difficult to estimate impulse responses solely from the input signals because of the great length of room impulse responses [8]. These techniques are known as blind multichannel identification based on second-order or higher-order statistical [9] or subspace methods [10]. Another type is blind dereverberation, where the inverse filter is estimated solely from the input signals [11–13]. Particularly, LInear-predictive Multi-input Equalization (LIME) [12] is widely used because this method shows good performance against whitening problems in blind dereverberation. In the LIME algorithm, the signal generating process is estimated from input signals based on multi-channel linear prediction; the original signal is extracted almost perfectly if the sampling frequency is as high as 16 kHz. However, when the sampling frequency is higher and the source signal is colored as normally as typical music signals and speech signals, then the dereverberation performance is decreased
T. Okamoto et al. / Applied Acoustics 73 (2012) 50–55
even with the LIME algorithm. This performance degradation occurs because the matrix rank used in the calculation is decreased and becomes extremely ill-conditioned. To cope with this problem in this paper, we propose a new wide-band dereverberation algorithm by improving the LIME algorithm. In Section 2, the LIME algorithm is briefly introduced. The problem of the LIME algorithm at high sampling frequency is discussed in Section 3. A new algorithm using a prewhitening filter is then proposed in Section 4. In Section 5, the performance of the proposed method is evaluated. Finally, our conclusions are presented in Section 6. 2. LIME algorithm The LIME algorithm is based on the hypothesis that the original sound source signal s(n) is modeled using linear prediction. The estimated source signal ^sðnÞ is recovered by calculating the estimated prediction residual ^eðnÞ and the estimated Auto-Regressive ^ðzÞ from the matrix Q obtained from the input (AR) parameters a signals xi(n), as shown by the following equation:
Q ¼ ðEfxðn 1ÞxT ðn 1ÞgÞþ Efxðn 1ÞxT ðnÞg:
ð1Þ
+
In that equation, A is the Moore–Penrose generalized inverse of matrix [15] A, xi(n) = [xi(n), . . . , xi(n (L 1))]T and xðnÞ ¼ ½xT1 ðnÞ; . . . ; xTP ðnÞT . ^ðzÞ, which correspond to the linThe estimated AR parameters a ear predictive coefficient a(n) of the source signal s(n), are obtained from the characteristic polynomial [14] of Q. The LIME algorithm schema is depicted in Fig. 1. The order of a(n) is defined as
N ¼ M þ L 1;
ð2Þ
where M is the order of the room impulse response hi(n) and where L signifies the order of the inverse filter based on MINT. The order of a(n) is obtainable as Eq. (2) by calculating Q using the higher order than the real order M of hi(n) despite unclear M. This constitutes a great advantage of LIME. The LIME algorithm has been introduced and discussed in detail in an earlier report [12]. 3. Performance degradation of LIME at high sampling frequency 3.1. Occasion of performance degradation In the LIME algorithm, the prediction residual ^ eðnÞ and the esti^ðzÞ of the source signal mated Auto-Regressive (AR) parameters a
51
are calculated from the correlation matrix of received signals and its inverse given by Eq. (1). Therefore, the accuracy of LIME depends on the granularity of Q in Eq. (1). To calculate Eq. (1), the rank of the correlation matrix must be full because the precision of inverse in Eq. (1) depends on its rank. The rank of the correlation matrix is determined by its condition number, which is obtainable as a ratio between the maximum eigenvalue and minimum eigenvalue. If the length of L in Eq. (2) is sufficiently large in LIME, then the eigenvalues of the correlation matrix E {x(n 1)xT(n)} in Eq. (1) are equivalent to the average power spectrum of x(n) [16,17]. Therefore, if the average power spectrum of x(n) is colored, then the difference between the maximum eigenvalue and the minimum eigenvalue is large and the rank of its correlation matrix becomes smaller than the full rank. However, most actual sounds around us are frequently colored, but have energy bias. For example, in speech and musical instrumental signals, energy is concentrated in the low-frequency region. It decays concomitantly at higher frequencies. Such a tendency of colored characteristics increases when the sampling frequency is high. At high sampling frequency, the correlation matrix E {x(n 1)xT(n 1)}, defined as Eq. (1), is colored when the input signal is colored. Results show that the rank of the correlation matrix E {x(n 1)xT(n 1)} is reduced to less than N(= the order of AR ^ðzÞ) and that its inverse and the characteristic polynopolynomial a mial of Q cannot be calculated accurately, which underscores an important problem: LIME does not work at high sampling frequencies if the input signals are colored. 3.2. Computer simulation Using computer simulations, we demonstrate that the dereverberation performance is degraded when a source signal is a colored signal in LIME. In the simulations, the room impulse responses hi(n) that were used were those measured using a time-stretched pulse (TSP) [18] in a soundproof room using a microphone array of 20 microphones (Fig. 2). The room size was 5.18 m 3.38 m 2.52 m. The soundproof room reverberation time was 0.15 s and sampling frequency was 44.1 kHz for deliberation at high sampling frequency. Therefore, the length of the impulse response hi(n) was around 44,100 0.15 = 6150 samples. Fig. 3 shows the temporal waveform and the averaged spectrum of a measured impulse response. The averaged spectra of s(n) h(n) at two sampling frequencies are shown in Fig. 4a–d, respectively. The original source signal in
Fig. 1. Block diagram of the LIME Algorithm [12].
52
T. Okamoto et al. / Applied Acoustics 73 (2012) 50–55
rank of the correlation matrix is less than the order of a(z). Meanwhile, the results of the rank of patterns (a) and (d) are equivalent to a(z). From these results, in LIME, it is shown that the dereverberation performance is degraded when a source signal is a colored signal. 4. Proposal for improving the method of LIME: White-LIME To alleviate the problem discussed in the previous section, we propose a wide-band dereverberation algorithm that works at high sampling frequency for colored signals by improving the LIME algorithm with a prewhitening filter. The proposed method is designated as White-LIME. The procedure used for the proposed method is shown schematically in Fig. 5. If the colored characteristics of the input signals are mitigated such that the rank of the correlation matrix E {x(n 1)xT(n 1)} would be of the same order of N, then LIME would work well even when the sampling frequency is high. Therefore, we propose whitening of all input signals before application of the LIME algorithm. Consequently, the rank will be improved and the polynomial of Q can be obtained stably. In conventional LIME, a source signal is decomposed by white noise and the linear prediction coefficient (AR polynomial). This decomposition is based on the hypothesis that the signal x(n) is described by the linear prediction coefficient a(n) and by white noise e(n) shown as
Fig. 2. Measurement environment.
xðnÞ ¼ Fig. 4a and b is male speech from the Phoneme-Balanced 1000 Sentence Speech Database by NTT – Advanced Technology (MIYB985.WAV; about 3.19 s, sampling frequency is 44.1 kHz, 16-bit linear PCM). The source signal in Fig. 4a is a downsampled source (b) from 44.1 kHz to 8 kHz. The impulse response hi(n) is downsampled, too. The source signal s(n) in Fig. 4c is a violin sound from Sound Quality Assessment Material (SQAM) Recordings for subjective tests [19] (track59.wav; L channel, 5 s (0.6–5.5 s); sampling frequency, 44.1 kHz; 16 bit linear PCM). The source signal s(n) in Fig. 4d is white noise from a computer simulation. The received signals xi(n) (i = 1–20) were simulated by generation using s(n) and hi(n). The averaged spectrum of male speech in Fig. 4a shows that if s(n) is a colored signal at low sampling frequency, then the spectral energy bias is not so pronounced. However, when the sampling frequency is high, such as 44.1 kHz, then the average spectrum of the colored signal has spectral energy bias, as shown in Fig. 4b and c. The LIME computer simulation was calculated using these received signals using 10 channels of microphones. The microphone arrangement is shown in Fig. 2b-(1). The results are presented in Table 1. The results of patterns (b) and (c) show that if the average spectrum of the colored signal has a spectral energy bias, then the
ð3Þ
In the present proposal, the prewhitening filter was estimated as follows: Eq. (3) is rewritten as
eðnÞ ¼ xðnÞ
N X
aðkÞxðn kÞ:
ð4Þ
k¼1
From Eq. (4), the white noise e(n) is generated by convolving the linear prediction coefficient a(n) and signal x(n). Applying Eq. (4) to all input signals xi(n), the new input signals can be obtained so that the original signals are whitened with Eq. (4). Consequently, the colored characteristics in input signals are reduced. In the LIME algorithm, common components among input signals remain in prediction error ^eðnÞ. Specific components in each input signal, however, are removed from ^eðnÞ. Therefore, even when each input signal is whitened based on Eq. (4), the characteristic of the convolved filter remains in ^ eðnÞ. However, the prediction coefficients to be estimated become more numerous. Furthermore, when common poles exist among whitening filters, dereverberation performance deteriorates. To prevent this problem, we propose the application of one common whitening filter to all input channels.
Relative level [dB]
5
0.5 Amplitude
aðkÞxðn kÞ þ eðnÞ:
k¼1
1
0 −0.5 −1 0
N X
0.05
0.1
0.15
0 −5 −10 0
Time [s]
5
10 Frequency [kHz]
Fig. 3. Measured impulse response h(n).
15
20
53
T. Okamoto et al. / Applied Acoustics 73 (2012) 50–55
30 Relative level [dB]
Relative level [dB]
20 10 0 −10 0
1000
2000 Frequency [Hz]
3000
4000
20 10 0 −10 −20 0
10 15 Frequency [kHz]
20
5
10 15 Frequency [kHz]
20
25 Relative level [dB]
20 Relative level [dB]
5
10 0 −10 −20
20 15 10 5 0
−30 0
5
10 15 Frequency [kHz]
0
20
Fig. 4. Average spectrum of s(n) h(n) at two sampling frequencies.
Table 1 Simulation results of LIME algorithm (i) order of a(z); (ii) rank of Efxðn 1ÞxT(n 1)}.
(a) (b) (c) (d)
Source
xi(n)
Sampling frequency (kHz)
(i)
(ii)
Speech Speech Violin White noise
10 10 10 10
8 44.1 44.1 44.1
1342 7358 7358 7358
1342 5818 6094 7358
ch ch ch ch
Based on the discussion presented above, we propose the following dereverberation method for colored signals, which is effective at a high sampling frequency. First, the prewhitening filter that is common to all input signals is calculated from input signals xi(n). We use a prewhitening filter c(n) calculated as an averaged AR polynomial. Each linear prediction coefficient bi(n) is calculated from each input signal by solving the Yule–Walker equation using the Levinson–Durbin algorithm [20].
xi ðnÞ ¼
N X
bi ðkÞxi ðn kÞ þ ei ðnÞ;
ð5Þ
cðzÞ ¼ 1 Efbi ð1Þgz1 þ þ Efbi ðnÞgzN :
ð6Þ
k¼1
Therein, c(z) is an averaged coefficient of bi(n) that is obtainable from each channel i. Because the room impulse responses can be expected to have certain similarity, all the input signals should be
whitened efficiently, although not perfectly, using this averaged coefficient. All whitened signals c(n) xi(n) are then fed to the LIME algorithm. In this case, the output signal is also whitened because common components among input signals to LIME remain in the LIME algorithm output. However, the original signals to be estimated can be recovered by deconvolving the prewhitening filter. This characteristic is an advantageous point of the proposed method because the rank of the correlation matrix E {x(n 1)xT(n 1)} can be maintained almost entirely by the beneficial effect of introducing the prewhitening filter. To explain the effect of prewhitening filter c(n), we introduce four figures in Figs. 6 and 7. The temporal waveform and the averaged frequency spectrum of an example of prewhitening filter c(n) are shown in Fig. 6a and b, respectively. The filter c(n) is calculated from all received signals x(n) s. In the simulation, the original source signal s(n) was male speech. The averaged power spectrum of a received signal of a specific channel x(n) is shown in Fig. 7a. The received signal is seen to have an energy bias. The averaged power spectrum of whitened signal c(n) x(n) is portrayed in Fig. 7b. This result shows that the spectral energy bias of x(n) is moderately flattened by the proposed prewhitening filter c(n). It was confirmed that all received signals could be flattened with the common whitening filter c(n).
Fig. 5. Block diagram of White-LIME.
54
T. Okamoto et al. / Applied Acoustics 73 (2012) 50–55
1 Relative level [dB]
10
Amplitude
0.5 0 −0.5
0 −10 −20 −30
−1 0
1
2
3 Time [s]
4
5
0
−3
5
x 10
10 15 Frequency [kHz]
20
Fig. 6. Prewhitening filter c(n) (256 samples).
Relative level [dB]
Relative level [dB]
20 10 0 −10 −20
5 0 −5 −10 −15
−30 0
5
10 15 Frequency [kHz]
20
0
5
10
15
20
Frequency [kHz]
Fig. 7. Averaged spectrum of received signal and whitened signal (speech).
5. Performance evaluation Computer simulations were performed to evaluate the performance of the proposed method. The source signal included male speech and a violin sound used in Section 3, as well as a piano sound from SQAM Recordings for subjective tests [19] (track60.wav; L channel, 5 s (0.6–5.5 s), sampling frequency of 44.1 kHz, (16-bit linear PCM). The impulse responses were the same as those described in Section 3. The number of channels were 10 and 20. Each arrangement of microphones is shown in Fig. 2b-(1) and b(2), respectively. Five patterns of lengths of prewhitening filters of 128, 256, 512, 1024, and 2048 were tested. The ranks of the correlation matrices Efxðn 1ÞxT(n 1)} using 256 samples of prewhitening filters of each source are presented respectively in Tables 2–4. From these results, it is apparent that when the original LIME algorithm was applied with a 44.1 kHz sampling frequency, the rank of the correlation matrix was de^ðzÞ). Concreased to less than N(=the order of the AR polynomial a sequently, the estimated original signal ^sðnÞ could not be obtained. Furthermore, when applying White-LIME at a sampling frequency of 44.1 kHz, the rank of the correlation matrices remained equal to ^ðzÞ and the estimated source signal the order of AR polynomial a ^sðnÞ was almost perfectly obtainable. The signal-to-distortion ratio (SDR) and the spectral distortion (SD) for the results with WhiteLIME with 10 microphones and 20 microphones are portrayed in Tables 2–4. Table 2 Simulation results (male voice): sampling frequency of 44.1 kHz, whitening filter length of 256 samples (i) order of a(z); (ii) rank of Efxðn 1ÞxT(n 1)}. (i) Input signal x1(n) LIME (10 ch) LIME (20 ch) White-LIME (10 ch) White-LIME (20 ch)
7358 6972 7358 6972
(ii)
SDR (dB)
SD (dB)
5818 5562 7358 6972
5.2 – – 36.3 49.3
5.79 – – 0.18 0.05
Table 3 Simulation results (violin): sampling frequency of 44.1 kHz, whitening filter length of 256 samples (i) order of a(z); (ii) rank of Efxðn 1ÞsxT(n 1)}. (i) Input signal x1(n) LIME (10 ch) LIME (20 ch) White-LIME (10 ch) White-LIME (20 ch)
7358 6972 7358 6972
(ii)
SDR (dB)
SD (dB)
6094 5845 7358 6972
3.2 – – 15.9 16.1
8.54 – – 0.70 0.51
Table 4 Simulation results (piano): sampling frequency of 44.1 kHz, whitening filter length of 256 samples (i) order of a(z); (ii) rank of E{x(n 1)xT(n 1)}
Input signal x1(n) LIME (10 ch) LIME (20 ch) White-LIME (10 ch) White-LIME (20 ch)
(i)
(ii)
SDR (dB)
SD (dB)
7358 6972 7358 6972
4857 4610 7358 6972
3.3 – – 24.4 41.0
8.50 – – 0.10 0.22
The differences of the performance among the lengths of prewhitening filters are discussed next. The respective SD of sources for each prewhitening filter length are shown in Fig. 8a–c. These results show that when the filter length is too large, the dereverberation performance decreases. As inferred from these results, 256 points is the best length in the simulated condition. Finally, from these simulation results, the precise dereverberation at a high sampling frequency can be calculated by the proposed method.
! jsðnÞj2 ; SDR ¼ 10log10 P jsðnÞ ^sðnÞj2 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 XF1 b ÞjÞ2 : SD ¼ ð20 log jPðf Þj 20 log j Pðf f ¼0 F P
ð7Þ ð8Þ
55
T. Okamoto et al. / Applied Acoustics 73 (2012) 50–55
10 ch 20 ch
6 5 4 3 2 1 0
128
256 512 1024 2048 Whitening filter length
10 Spectral distortion [dB]
Spectral distortion [dB]
7
8
10 ch 20 ch
6 4 2 0
128
256 512 1024 2048 Whitening filter length
Spectral distortion [dB]
5 4
10 ch 20 ch
3 2 1 0
128
256 512 1024 2048 Whitening filter length
Fig. 8. Results of spectral distortion for several whitening filter lengths in White-LIME.
6. Conclusion A novel algorithm for wide-band blind dereverberation has been proposed in this paper. This method, based on prewhitening the inputs for LIME, is called White-LIME. Computer simulations demonstrated the validity of the proposed method at high sampling frequencies such as 44.1 kHz. As future work, the robustness of this method against additional noise must be considered because calculation of the polynomial of matrix Q is extremely sensitive to noise. Acknowledgements The authors thank Dr. Masato Miyoshi (currently a professor at Kanazawa University) and other members of NTT Communication Science Laboratories for their intensive discussion and invaluable advice related to this study. This study was partly supported by the Japan Society for the Promotion of Science (19-55071) and the GCOE program of the Graduate School of Technology, Tohoku University (14200033). References [1] Naylor PA, Gaubitch ND. Speech dereverberation. 1st ed. Springer-Verlag; 2010. [2] Okamoto T, Nishimura R, Iwaya Y. Estimation of sound source positions using a surrounding microphone array. Acoust Sci Technol 2007;28(3):181–9. [3] Okamoto T, Iwaya Y, Suzuki Y. Blind directivity estimation of a sound source in a room using a surrounding microphone array. In: Proc. ICA 2010; 2010. [4] Suzuki T, Nakajima H, Tsuru H, Arai T, Nakadai K. 3D sound field recording and reproducing system including sound source orientation. In: Proc IUCS 2010; 2010. p. 214–9.
[5] Rafaely Boaz. Spherical loudspeaker array for local active control of sound. J Acoust Soc Am 2009;125(5):3006–17. [6] Yasuraoka N, Yoshioka T, Nakatani T, Nakamura A, Okuno HG. Music dereverberation using harmonic structure source model and Wiener filter. In: Proc ICASSP 2010; 2010. p. 53–6. [7] Miyoshi M, Kaneda Y. Inverse filtering of room acoustics. IEEE Trans Speech Signal Process 1988;36(2):145–52. [8] Huang Y, Benesty J, Chen J. Identification of acoustic MIMO systems: challenges and opportunities. Signal Process 2006;86(6):1278–95. [9] Huang Y, Benesty J. A class of frequency-domain adaptive approaches to blind multichannel identification. IEEE Trans Signal Process 2003;51(1): 11–24. [10] Moulines E, Duhamel P, Cardoso J–F, Mayrargue S. Subspace methods for the blind identification of multichannel FIR filters. IEEE Trans Signal Process 1995;43(2):516–25. [11] Furuya K, Kataoka A. Robust speech dereverberation using multichannel blind deconvolution with spectral subtraction. IEEE Trans Audio Speech Lang Process 2007;15(7):1579–91. [12] Delcroix M, Hikichi T, Miyoshi M. Precise dereverberation using multichannel linear prediction. IEEE Trans Audio Speech Lang Process 2007;15(2): 430–40. [13] Ram I, Habets E, Avargel Y, Cohen I. Multi-microphone speech dereverberation using LIME and least squares filtering. In: Proc EUSIPCO 2008, August 2008. p. 25–9. [14] Rombouts S, Heyde K. An accurate and efficient algorithm for the computation of the characteristic polynomial of a general square matrix. J Comput Phys 1998;140(2):453–8. [15] Harville DA. Matrix algebra from a statistician’s perspective. New York: Springer-Verlag; 1997. [16] Hayes MH. Statistical digital signal processing and modeling. Wiley; 1996. [17] Moon TK, Stirling WC. Mathematical methods and algorithms for signal processing. Prentice Hall; 1999. [18] Suzuki Y, Asano F, Kim Y-H, Sone T. An optimum computer-generated pulse signal suitable for the measurement of very long impulse responses. J Acoust Soc Am 1995;97(2):1119–23. [19] Waters GT. Sound quality assessment material recordings for subjective tests. Users’ handbook for the EBU-SQAM compact disc, Technical centre of the European Broadcasting Union, Tech. Rep. 3253-E, April 1998. [20] Ljung L. System identification: theory for the user. Prentice Hall; 1987.