Signal Processing 83 (2003) 919 – 929 www.elsevier.com/locate/sigpro
Adaptive wavelet-packet analysis for audio coding purposes Nicol%as Ruiz Reyesa , Manuel Rosa Zurerab;∗ , Francisco L%opez Ferrerasb , Pilar Jarabo Amoresb b Dpto.
a Dpto. de Electr onica. Escuela Universitaria Polit ecnica. Universidad de Jaen. 23700 Linares, Jaen, Spain de Teor a de la Se˜nal y Comunicaciones. Escuela Polit ecnica. Universidad de Alcal a. Campus Universitario. 28871 Alcal a de Henares, Madrid, Spain
Received 30 July 2001; received in revised form 9 September 2002
Abstract This paper describes a wavelet-based perceptual audio coder, addressing the problem of the search for the wavelet-packet decomposition that minimizes a new perceptual cost function computed in the wavelet domain. We are interested in decompositions adapted to the nature of audio signals which take into account the characteristics of human hearing. The results of audio coding with three di3erent decomposition criteria are presented for comparison purposes. They all give rise to adaptive wavelet-trees obtained minimizing di3erent cost functions. These cost functions are the non-normalized Shannon entropy, the SUPER and our proposed perceptual cost function. Another important contribution is the algorithm for bit allocation, that takes into consideration the synthesis 6lter bank. The results con6rm that the best way to achieve maximum compression rate and transparent coding is the usage of perceptual-entropy-based decompositions. Experimental results indicate that our coding scheme ensures transparent coding of one channel CD-quality audio signals at bit rates below 64 kbps for most audio signals. ? 2003 Elsevier Science B.V. All rights reserved. Keywords: Adaptive; Wavelet; Audio; Coding; Perceptual cost function
1. Introduction CD-quality of an audio signal is de6ned as where each channel is sampled at 44:1 kHz, and linearly coded with 16 bits/sample. It results in very high binary rates for the multimedia applications that are in use nowadays, and justi6es the use of compression techniques. A very important issue in audio coding is the possibility of exploiting the masking phenomenon in the inner ear. Another important issue is the non-stationary nature of the audio signal, which ∗
Corresponding author.
forces us to use local analysis. Several techniques have been proposed to do that, among them, wavelet-packet decompositions. This paper deals with the selection of the wavelet-packet base leading to maximum compression rate. In other words, the objective is to 6nd the wavelet-packet base that allows audio signals to be represented with the minimum bit rate, maintaining the perceptual quality of the reconstructed signal as high as possible. The search for the best base has been tackled by several researchers recently. Many of them use cost functions based on the statistical entropy in order to obtain a compact representation of the source (the
0165-1684/03/$ - see front matter ? 2003 Elsevier Science B.V. All rights reserved. doi:10.1016/S0165-1684(02)00489-9
920
N.R. Reyes et al. / Signal Processing 83 (2003) 919 – 929
audio signal, in our case) but without taking into consideration any perceptual information. Although a bit assignment taking into account the masking threshold is accomplished, this approach is not suitable when the objective is transparent audio coding with maximum compression rate, as shown in Section 7. There are some works where the best base selection is made using psycho-acoustic information [3– 6,14]. The main drawback of these approaches is the fact that psycho-acoustic information is considered in the Fourier domain, but not in the wavelet domain. Furthermore, the search criterion also considers a measure of computational complexity [5,6,14], making analysis of the improvement due to the usage of psycho-acoustic information in the search criterion diHcult. Our approach tries to obtain an audio signal representation in the wavelet domain that allows us to eliminate as much irrelevant information as possible. A new cost function that takes into consideration the masking threshold provided by the psycho-acoustic model is used for the selection of the best base. In order to de6ne a complete adaptive wavelet analysis environment, an adaptive tiling of the time axis is also included. The sections of the paper are organized as follows: In Section 2, coder and decoder structures are presented, which follow the general scheme of subband coding. Section 3 is a review of the main theoretical concepts related to the best base selection and the di3erent approaches that have been considered till now. Several original contributions are included, such as a novel algorithm for the best base selection, which is presented in Section 4, and a bit allocation algorithm which takes into consideration the low selectivity of the equivalent subband 6lters, presented in Section 6. For the assessment of quality of the decoded signal, the ITU-R BS.1116.1 recommendation has been used. The compression capability is very high, resulting in a binary rate that is below 64 kpbs for one channel audio signals, maintaining a high perceptual quality. These results are presented in Section 7. Finally, conclusions are extracted in Section 8.
2. Audio coder and decoder structures In this section, coder and decoder structures are described. They work with one channel audio signals, sampled at 44:1 kHz, where each sample has been linearly coded with 16 bits (CD-quality). The objective is to obtain a digital representation of the original signal with the minimum size, and preserve as much as possible its psycho-acoustic properties. The following scheme has been implemented: (1) The audio signal is segmented into frames of L = 2048 samples each. Each frame is 6rst segmented into N = 4 non-overlapping time intervals of M = L=N = 512 samples each. After that, adjacent time intervals are analyzed jointly in order to decide if they should be linked or not. In order to do that, a two bands wavelet decomposition is applied to each time interval and the pairs Pi = 1=C(Li ; Hi ); i = 0; : : : ; N − 1 are formed, where Li and Hi are the l2 norm of the sequences li (k) and hi (k), respectively, C = max{L0 ; : : : ; LN −1 ; H0 ; : : : ; HN −1 } is a normalization factor, and li (k) and hi (k) represent the low and high frequency subbands, respectively, corresponding to the ith time interval. The proposed algorithm for deciding if two consecutive time intervals should be linked is completed with a spectral clustering process that takes the Euclidean distances di; j = (Pi − Pj ) between two consecutive time-frequency pairs into consideration. The distances di; j are compared with a decision threshold U . The algorithm is summarized in Fig. 1, and is somewhat similar to the one described in [6]. (2) Before decomposing the input signal, the masking threshold for each resulting audio segment is estimated in the frequency domain, using the second method proposed in the ISO/MPEG standard. (3) To avoid the number of wavelet coeHcients which characterize the segment being higher than the number of samples, the input segment is interpreted as a periodic signal, implementing a periodized wavelet-packet decomposition. The periodization may introduce discontinuities and therefore audible coding artefacts at low bit-rates. This drawback is avoided by making adjacent segments overlap 32 samples, and the overlapping
N.R. Reyes et al. / Signal Processing 83 (2003) 919 – 929
921
Fig. 1. Time axis tiling algorithm.
samples of each segment being windowed with the square root of a raised cosine. The same windowing procedure is applied at the synthesis stage, so that after the overlap-add process, the perfect reconstruction property is preserved in absence of quantization at the coder. (4) The proposed audio coder analyzes each segment with a 6lter bank that implements a wavelet-packet decomposition adapted to the segment in order to minimize the bit rate, maintaining the perceptual quality as high as possible. (5) A set of 15 uniform quantizers is used to quantize wavelet coeHcients, with the size of the quantization steps adjusted to the maximum absolute value of the coeHcients of each band in the segment (scale factors). The scale factors are also coded using logPCM with only 8 bits. They are sent as side information together with the number of bits used to code the wavelet coeHcients and an index that characterizes the implemented wavelet-packet decomposition. (6) At the decoder, the coded wavelet coeHcients and the side information are demultiplexed; then, the coded wavelet coeHcients are decoded and applied to the reconstruction process of the audio frame. Of course, reconstructed signal is not equal to the original one, due to the introduction of the quantization process at the coder. 3. Background to adaptive wavelet-packet analysis The main advantage of the wavelet-packet transform is the possibility of selecting the optimum base for analyzing a given signal according to a suitable criterion. The choice of such criterion depends on the application the decomposition is applied to. The complexity of the search algorithm is also important. From this point of view, the algorithm for selecting the best base must be as simple as possible.
The choice of the best base within a library of orthonormal wavelet-packet bases B requires the de6nition of a cost function F. For an input vector x, the best base according to the cost function F is that base B ∈ B for which the result of F(B · x) is minimum. A very important set of cost functions is composed of those that are ‘additive’. A cost function is additive if the following relations are true: Fadd (0) = 0 Fadd (x) =
N
Fadd (x(i) · ui ):
(1)
i=1
Being ui the vector of all zeros except for the ith component, whose value is one, and x(i) the ith component of the N -length vector x. These are the most frequently used additive cost functions [15]: (1) Number of coe5cients higher than a threshold. An arbitrary threshold is 6xed and the number of samples of x whose absolute value is higher than the threshold is obtained. (2) Functional −l2 ln l2 , related to the Shannon entropy. For a signal x, the Shannon entropy is de6ned as H(x) = − j pj ln(pj ), with pj = |x(j)|2 =x2 . H(x) is not an additive cost function, but considering the following de6nition: Hadd (x) = − x2 (i) · ln(x2 (i)) (2) i : x(i)=0
an additive functional related to the Shannon entropy is obtained. The equation H(x) = x−2 Hadd (x) + ln(x2 ) ensures that as Hadd is minimized, H is also minimized. (3) Functional l2 , related to the entropy of a Gauss–Markov process, de6ned as: G(x) = ln(x2 (i)): (3) i : x(i)=0
922
N.R. Reyes et al. / Signal Processing 83 (2003) 919 – 929
These functions are used in a general way in [2] for selecting the best base, but without establishing a concrete application. They have been applied in [11] for the selection of the best base within a perceptual audio coding framework. The results obtained reveal that these functions are not very suitable for eHcient audio coding. Furthermore, the performance of the system hardly di3ers from one function to another. These entropy-type cost functions are suited to the selection of the best base in classical source coding applications, because they describe information-related properties for an accurate and compact representation of a given signal. However, in perceptual audio coding applications we are interested in those cost functions that give rise to wavelet-packet decompositions that minimize the binary rate whilst maintaining the perceptual quality as high as possible. The cost function to be considered must take into account the psycho-acoustic information provided by a perceptual model. Several approaches to the usage of perceptual cost functions have been made recently: (1) In [14], an audio coding scheme that uses a dynamic wavelet-packet decomposition is proposed. The cost function proposed in that paper considers the masking threshold in the frequency domain, and looks for a subband decomposition that minimizes the SUbband perceptual rate (SUPER), which is a measure that tries to adapt the subband structure to approach the perceptual entropy (PE) as closely as possible. The SUPER is theiteratively determined minimum number of bits k bk , where bk is the number of bits per sample in the subband k so that the quantization error incurred due to quantizing using bk bits remains below the masking threshold in the frequency domain. The main drawback of this approach is that the inOuence of the synthesis 6lter bank is not considered neither for the best base selection nor for the quantization of subband signals. The distortion due to quantization depends on the quantization noise injected in the wavelet domain and on the synthesis 6lter bank. The equivalent 6lter frequency responses of the 6lter bank branches are far from ideal, and this approach is not
optimum, especially for low selective 6lter banks. The algorithm also incorporates a mechanism for controlling the computational complexity, and the zerotree algorithm proposed in [12] is included. (2) The algorithm proposed in [5,6] is quite similar to the previous one. The main di3erence refers to the mechanism for stopping the search algorithm, that is based on the use of prede6ned thresholds. (3) In [3], a signal-adaptive 6lter bank is used for decomposing the audio signal, which is controlled using optimization techniques for a rate-distortion or a perceptual metric. In order to choose the best decomposition and the best quantizing scheme for a given signal block, di3erent measures are required. For signals having large redundancy, the objective is to decorrelate the signal in order to code it more eHciently. If the signal has little redundancy but a large amount of irrelevancy, the masking properties of the human auditory system are exploited for coarsely quantizing some subbands. Once again, the masking threshold is considered in the frequency domain. It appears from this revision that there is a lack of methods for the best base selection using the measure of the signal-to-mask ratio in the wavelet domain. The usage of such a measure provides information about the minimum number of bits to be allocated in each subband to get transparent coding. Furthermore, the minimization of the signal-to-mask ratio jointly minimizes the subband signal power (decorrelation), and maximizes the power of quantization noise that can be injected maintaining the perceptual quality as high as possible. 4. Proposal of a new perceptual cost function The wavelet-packet decomposition allows us to choose the best base for analyzing a given signal. The best base selection is made taking into account a suitable criterion. We propose a perceptual cost function closely related to the perceptual entropy, de6ned as the
N.R. Reyes et al. / Signal Processing 83 (2003) 919 – 929
923
minimum number of bits per sample required in order to ensure transparent coding [8]. For a given Ni -length subband audio signal xi , the proposed perceptual cost function is de6ned in expression (4), where SMR(xi ) represents the signal-to-mask ratio of subband xi .
The perceptual cost function proposed is 6nally de6ned in expression (8): Ni −1 2 1=Ni · j=0 xi (j) ; F(xi ) = Ni log2 min!∈Wi ((T (ej! )=|Fi (ej! )|2 ))
F(xi ) = Ni log2 (SMR(xi )):
where:
(4)
For the zero mean subband signal xi , the signal-tomask ratio, SMR(xi ), can be expressed as the ratio between the power of the subband signal (x2i ) and the maximum power of quantization noise that can be injected into the subband (n2i ): SMR(xi ) =
x2i : n2i
(5)
For the distortion due to quantization noise to be inaudible, its power spectrum must remain below the masking threshold, T (ej! ), that is estimated in the Fourier domain, but should be considered in the wavelet one. Therefore, the equivalent 6lter frequency responses of the synthesis 6lter bank branches, Fi (ej! ), obtained using the noble identities, should be taken into consideration. We must notice that the wavelet-packet decomposition is implemented with a maximally decimated 6lter-bank. The power spectrum of the quantization noise injected into the ith subband, considered at the input of the equivalent 6lter of the synthesis 6lter bank ith branch, is Sni (ej!M ), M being the interpolation factor. If the ith subband signal is considered in isolation, the distortion at the output, due to the quantization noise injected into that subband, must be less or equal to the masking threshold: Sni (ej!M ) · |Fi (ej! )|2 6 T (ej! );
! ∈ Wi ;
(8) • T (ej! ) is the masking threshold estimated in the Fourier domain. • |Fi (ej! )| is the equivalent 6lter frequency response magnitude of the synthesis 6lter bank ith branch. • Ni is the number of samples of the ith subband. This cost function has several advantages: • It takes into consideration both the wavelet representation of the input signal and the masking threshold estimated in the Fourier domain. The masking threshold is normalized in order to consider the quantization noise power that can be injected in the wavelet domain so as to achieve transparent coding, if the subband is considered in isolation. • The value returned by the cost function is an indicative of the number of bits per sample required for transparent coding. • The joint search of the optimum decomposition tree structure and the best prototype 6lter is possible. This fact is not dealt with in this paper due to the necessarily high computational cost for this joint search. May it be said that the selection of the best prototype 6lter is possible because both the energy of the subband signal and the equivalent 6lter frequency response magnitude of the synthesis 6lter bank branch depend on the given prototype 6lter.
(6)
where Wi is the set of frequency lines within the ith subband. Taking into consideration that Fi (ej! ) is the frequency response of a band-pass 6lter, and supposing that its magnitude response is approximately constant inside the pass band Wi , the power of quantization noise, supposed white noise, can be calculated approximately according to (7). T (ej! ) : (7) n2i min !∈Wi |Fi (ej! )|2
5. Algorithm for selecting the best wavelet-packet base After de6ning the cost function to be used, the algorithm for selecting the best wavelet-packet base must be stated. Our option is to use a search procedure based on the pruning of a complete tree [16]. This approach provides optimum solutions at the expense of slightly increased algorithm complexity. The pruning algorithm has to examine the complete wavelet-packet tree in order to select the best base.
924
N.R. Reyes et al. / Signal Processing 83 (2003) 919 – 929 5000
0
−5000
(a)
0
100
200
300
400
500
600
700
800
900
1000
0
50
100
150
200
250
300
350
400
450
500
0
50
100
150
200
250
300
350
400
450
500
80 60 40 20
(b)
0
100
50
(c)
0
Fig. 2. Example of a given audio segment: (a) Waveform; (b) Masking threshold (dB) vs. frequency index; (c) Energy spectral density (dB) vs. frequency index.
In our implementation, a complete 6ve levels wavelet-packet decomposition (D = 5) is always accomplished, getting thus a trade o3 relation between coding gain and computational complexity. Greater decomposition depths give rise to higher computational costs for selecting the best base and also an increase in the side information required for representing the decomposition tree. Furthermore, when the decomposition depth is increased, coding gain is usually obtained for only low frequency subbands, where the width of subbands is low, and the number of wavelet coeHcients is small. The value D = 5 leads to a maximum number of 32 subbands. The number of bases in a binary tree of D = 5 levels is 458; 329, which can be represented with 19 bits. This is the side information that must be sent to the decoder to represent the best base. The amount of side information is independent of the segment length, once the maximum decomposition depth is 6xed. If the decomposition depth is D = 6, the number of bits to represent the best base is 38, and the side information to represent the selected base is duplicated.
It must be mentioned that neither the number of subbands nor the selected base will inOuence the reconstructed signal perceptual quality. The perceptual quality is forced to be transparent if the hypothesis mentioned in Section 6 are true. So, only the binary rate will depend on the selected base. In Fig. 3, an illustrative example of the resulting wavelet-packet base for a given audio segment (see Fig. 2) is presented. 6. Algorithm for bit allocation When the wavelet-packet decomposition is implemented using low selective 6lters, the equivalent 6lter frequency response magnitudes of the synthesis 6lter bank branches are far from ideal. Therefore, an algorithm for bit allocation that takes into consideration the non-ideal frequency responses of these 6lters should be used. Several algorithms have been revised. The most relevant one, proposed in [13] has the disadvantage of a high computational cost. To reduce this complexity, a simpli6cation was proposed using long wavelet 6lters.
N.R. Reyes et al. / Signal Processing 83 (2003) 919 – 929
925
(0,0)
(1,0)
(1,1)
(2,0)
(2,1)
(3,0)
(3,1)
(4,0)
(5,0)
(4,1)
(5,1)
(5,2)
(3,2)
(4,2)
(5,3)
(5,4)
(4,3)
(5,5)
(5,6)
(4,4)
(3,3)
(4,5)
(5,7)
Fig. 3. Wavelet-packet decomposition tree for the audio segment of Fig. 2.
In [10], an algorithm for translating psycho-acoustic information to the wavelet domain is presented. It can be used with 6lters that generate orthogonal wavelets with any compact support, but the analysis tree must be close to the critical band division. Work closely related to the problem here formulated is described in [1], where an optimum bit allocation algorithm applicable to either uniform or non-uniform frequency decompositions is presented, allowing to take into account the non-ideal selectivity of the reconstruction 6lters. The aim of that algorithm is to minimize the noise variance under the constraint of a given average bit per sample rate. Because our objective is the study of the behavior of time-varying wavelet-packet decompositions, obtained using a perceptual cost function, the bit allocation procedure should only focus on achieving transparent coding with the minimum bit rate. The objective is not to minimize the noise variance, but to minimize the bit rate ensuring transparent coding. Therefore, the algorithm proposed in [1] has been slightly modi6ed looking for transparent coding at the minimum bit rate. The new algorithm can be used with 6lters that generate wavelets with any compact support and applied to time-varying wavelet-packet decompositions. The following hypothesis are assumed: (1) Subband signals are orthogonal.
(2) Uniform quantization is used. (3) Quantization noise can be modelled as white noise, independent of the subband signals. Obviously, this hypothesis is only true when the number of bits to assign is high, but looking for simplicity, we assume this hypothesis to be always true. The results obtained allow us to maintain this consideration avoiding the use of a more complex model. When the hypothesis are true, the overall distortion power spectrum due to quantization noise, SN (ej! ), is the sum of all the contributions of the di3erent subbands. It is formulated in expression (9), where i2 is the quantization noise variance injected in the ith subband, taken to be additive white noise: i2 · |Fi (ej! )|2 : (9) SN (ej! ) = i
For a 6lter bank that implements a wavelet-packet decomposition, the synthesis 6lter output due to quantization noise in one subband is not stationary. However, it is cyclostationary [9]. Taking this fact into consideration, SN (ej! ) represents the Fourier transform of the mean value of the overall distortion autocorrelation function. The objective is to 6nd the variance values, so that the overall distortion remains below the masking threshold in the frequency domain, minimizing
926
N.R. Reyes et al. / Signal Processing 83 (2003) 919 – 929
the bit rate. Before presenting the algorithm, the nomenclature to be used must be stated: M is the number of subbands, Ni is the number of samples in the ith subband, Wi is the frequency interval of the ith subband, bi is the number of bits per sample assigned to that subband, and fSi is the scale factor. If uniform quantization is used, the quantization step is obtained using expression (10). Furthermore, if the above hypothesis are true, the variance of quantization noise is calculated with (11). i =
i2
2 · f Si ; 2b i
(10)
(11)
The number of bits allocated to the ith subband is calculated using expression (12): bi = log2
Numerous tests have been carried out with di3erent con6gurations so as to assess the performance of our audio coding scheme. Several objective and subjective measures have been obtained, that will be presented and analyzed in this section. Music samples considered hard to encode have been used. The chosen set, that consists of 6ve one channel CD-quality audio signals of about 15 s, is described in Table 1. Special attention has been paid to signals with sharp attacks, since these signals are extremely susceptible to the presence of pre-echos. 7.1. Objective results
2 = i: 12
7. Experimental results
(12 · i2 )1=2 2 · fSi
:
(12)
Once the nomenclature has been stated, the algorithm for bit allocation is formulated as follows: (1) The variance values are initialized taking into consideration the equivalent 6lter frequency response magnitudes of the synthesis 6lter bank branches, obtained using the noble identities (expression (7)). (2) The initial bit assignment is calculated using the variance values previously obtained and expression (12). (3) The new variance values are obtained using expression (11). M (4) If i=1 i2 · |Fi (ej! )|2 ¡ T (ej! ), then • Find the subband i-th with the maximum value of Ni · bi . • bi = bi − 1, and i2 is multiplied by four. M (5) If i=1 i2 · |Fi (ej! )|2 ¿ T (ej! ), then • Find the subband i-th with the minimum value of Ni · bi . • bi = bi + 1, and i2 is divided by four. (6) Steps 4 and 5 are repeated until a stable situation is reached.
In this section, the average number of bits per sample is used to evaluate the behavior of our approach. Three di3erent cost functions are used to achieve a time-varying wavelet-packet tree: (1) The perceptual cost function de6ned in Section 4. (2) The perceptual cost function de6ned in [14]. (3) The non-normalized Shannon entropy. In order to assess the performance of the proposed adaptive wavelet-packet analysis for audio coding purposes, the average number of bits per sample have been obtained with 6lters of di3erent lengths using the three considered criteria. The results shown in Tables 2 and 3 have been obtained using 24 and 8 samples-length minimum phase Daubechies wavelet 6lters, respectively. The best results are obtained using the perceptual cost function de6ned in Section 4 for the best base selection, independently of the 6lter length used. Table 1 Source material used in the quality test
CD quality source material Code
Instrument/style
Drums Piano Guitar Saxo Pop
Solo drums (Caribbean music) Solo piano (classic music) Solo Spanish guitar Solo saxophone Pop music by Spandau Ballet
N.R. Reyes et al. / Signal Processing 83 (2003) 919 – 929 Table 2 Average number of bits/sample obtained using the three considered criteria with 24 samples-length Daubechies 6lters
Drums Guitar Piano Saxo Pop
Our perceptual cost function
Super cost function
Non-normalized Shannon entropy
1.474 1.445 1.370 1.330 1.459
1.506 1.472 1.404 1.362 1.486
1.80 1.73 1.67 1.75 1.72
Table 3 Average number of bits/sample obtained using the three considered criteria with 8 samples-length Daubechies 6lters
Drums Guitar Piano Saxo Pop
Our perceptual cost function
Super cost function
Non-normalized Shannon entropy
1.722 1.695 1.636 1.602 1.716
1.826 1.822 1.760 1.736 1.828
2.051 2.002 1.932 2.025 1.994
Focusing on the usage of long 6lters, the results obtained using the cost function de6ned in [14] are only a little worse than those obtained using the cost function proposed in this paper. However, when short prototype 6lters are used, the di3erences between the results obtained using the two evaluated perceptual cost functions (the proposed one and the de6ned in [14]) become higher, showing that SUPER is far from ideal when non-selective 6lters are used to implement the wavelet-packet decomposition. We believe that the performance of the perceptual cost function proposed in this work is really good, showing that consideration of the synthesis 6lter frequency responses for selecting the best base is crucial when non-selective 6lters are used. Finally, the worst results correspond to the non-normalized Shannon entropy. Only one entropytype cost function has been compared to our perceptual cost function. The next arising question is: Can we 6nd any improvement using a di3erent entropy-type cost function for the best base selection? This problem has already been analyzed in [11]. The main conclusion of this previous study was that
927
the behavior of the coder hardly varies if, instead of the non-normalized Shannon entropy, we use other entropy-type cost functions. From the analysis of the results presented in Tables 2 and 3, we can conclude that the usage of the proposed perceptual cost function is the best option for selecting the wavelet-packet decomposition for a given audio segment, especially for the case of non-selective 6lters. The usage of non-selective 6lters is useful for low computational cost implementations. If many audio segments are considered, the average bit rate decreases when the 6lter length is increased, but we must also say that improvements can be obtained if an optimal search of the 6lter length is made before processing each audio segment [10]. If we concentrate our attention on a particular audio segment, the optimal 6lter length depends on the frequency components of that segment. So, because the optimal 6lter could be short, this is another practical situation when one would use non-selective 6lters. 7.2. Subjective results In order to assess the subjective quality, informal tests with listeners from our laboratory have been done using test procedures inspired by the ITU-R BS.1116-1 recommendation [7] for the subjective evaluation of small impairments in audio systems, including multi-channel sound systems (double-blind triple-stimulus with hidden reference and the 5-grade scale). In this section we present the subjective results obtained when the adaptive wavelet analysis proposed in this paper is applied. These results have been obtained with 24-length (Table 4) and 8-length (Table 5) minimum phase Daubechies 6lters. Analyzing the results presented in Tables 4 and 5, we determine if meaningful perceptual di3erences are produced upon using the adaptive decomposition structure proposed in this paper. We also demonstrate that the prototype 6lter length has no inOuence on the perceptual quality, because the bit allocation procedure takes into account the non-ideal reconstruction 6lters. We notice that, in general, low values of the parameter RMOS are obtained with the time-varying tree, which allows us to assert that the adaptation process of the tree structure supposes not only a reduction of
928
N.R. Reyes et al. / Signal Processing 83 (2003) 919 – 929
Table 4 Adaptive wavelet analysis using the proposed cost function and 24-length 6lters: subjective results
Test signals
Orig.
Decoded MOS
Orig. MOS
Decoded MOS
Orig.-Decoded RMOS
Drums Guitar Piano Saxo Pop
4.95 4.97 4.97 5.00 4.96
4.87 4.68 4.87 4.60 4.83
0.08 0.06 0.05 0.00 0.07
0.10 0.17 0.10 0.12 0.12
0.08 0.29 0.10 0.40 0.13
Mean
4.97
4.77
0.05
0.12
0.20
Std. Dev.
0.02
0.12
0.03
0.03
0.14
MOS
Table 5 Adaptive wavelet analysis using the proposed cost function and 8-length 6lters: subjective results
Test signals
Orig.
Decoded MOS
Orig. MOS
Decoded MOS
Orig.-Decoded RMOS
Drums Guitar Piano Saxo Pop
4.94 4.97 4.96 5.00 4.98
4.85 4.70 4.84 4.57 4.85
0.09 0.06 0.04 0.00 0.06
0.12 0.19 0.11 0.10 0.09
0.09 0.27 0.12 0.43 0.13
Mean
4.97
4.76
0.05
0.12
0.21
Std. Dev.
0.02
0.12
0.03
0.04
0.14
MOS
the bit rate, but also the maintenance of a very high perceptual quality. It is due to the usage of the proposed bit allocation algorithm, that takes into account the equivalent 6lter frequency responses of the synthesis 6lter bank branches. A careful analysis of the results shows that the signals for which the perceptual quality is higher are those with sharp attacks, Drums, Guitar, Piano, and Pop. The music samples named Guitar and Piano have a high sinusoidal content, but at the instant a note is played, a sharp attack occurs. The algorithm for adaptive tiling of the time axis described in Section 2 allows to isolate sharp attacks and avoids the appearance of pre-echoes. As a result of this, the perceptual quality is very high. On the other hand, slightly worse results are obtained for Saxo. This is a signal with high sinusoidal
content but without sharp-attacks. Its spectral density varies quickly, but the short-time energy varies slowly. The algorithm for adaptive tiling of the time axis only has into consideration a two bands wavelet decomposition. So, spectral changes that don’t alter the energy of the wavelet subbands make this algorithm work bad and slightly perceptible artefacts may appear, reducing the perceptual quality of the signal, because the distortion due to quantization noise may be over the masking threshold in the frequency domain. 8. Conclusions We have presented a new cost function for the selection of the best wavelet-packet decomposition for audio coding purposes. The results con6rm
N.R. Reyes et al. / Signal Processing 83 (2003) 919 – 929
that psycho-acoustic information must be taken into account in order to minimize the resulting binary rate maintaining the perceptual quality as high as possible. As has been shown, adaptive wavelet analysis is only interesting for audio coding purposes if psycho-acoustic information is taken into consideration when selecting the best base. The adaptation process of the tree structure, taking into account the psycho-acoustic information, supposes a notable reduction in the binary rate, maintaining a high perceptual quality, if the equivalent 6lter frequency responses of the synthesis 6lter bank branches are considered. Another important remark is the fact that not only psycho-acoustic information should be taken into account, but also the synthesis equivalent 6lter frequency responses. In order to prove this statement, results for audio compression have been obtained using the cost function proposed in [14], which considers the usage of ideal 6lters. The results con6rm that the synthesis 6lter frequency responses should be taken into account, because the behavior of the coder with time-varying wavelet-packet decompositions is not the best when low selective 6lters are used and the synthesis equivalent 6lter frequency responses are not considered. From the analysis of the objective results presented in Section 7.1, we can conclude that the usage of the proposed perceptual cost function is the best option for selecting the wavelet-packet decomposition for a given audio segment, especially for the case of non-selective prototype 6lters. Analyzing the subjective results presented in Section 7.2, it has been proved that no meaningful perceptual di3erences are produced upon using the adaptive decomposition structure proposed in this paper. It is also demonstrated that the prototype 6lter length has no inOuence on the perceptual quality, because the bit allocation procedure takes into account the non-ideal reconstruction 6lters. References [1] C. Caini, A. Vanelli-Coralli, Optimum bit allocation in subband coding with nonideal reconstruction 6lters, IEEE Signal Process. Lett. 8 (6) (June 2001) 157–159.
929
[2] R. Coifman, M. Wickerhauser, Entropy based algorithms for best basis selection, IEEE Trans. Inform. Theory Part 2 38 (2) (March 1992) 713–718. [3] M. Erne, G. Moschytz, Audio coding based on rate-distortion and perceptual optimization techniques, Proceedings of the AES 17th International Conference, Florence, Italy, September 2–5, 1999, pp. 220 –225. [4] M. Erne, G. Moschytz, C. Faller, Best wavelet-packet bases for audio coding using perceptual and rate-distortion criteria, Proceedings of the IEEE International Conference on Acoustic Speech and Signal Processing, Phoenix, Arizona, March 15 –19, 1999, pp. 909 –912. [5] N. Gonz%alez Prelcic, A. Pena, An adaptive tree search algorithm with application to multiresolution-based perceptive audio coding, Proceedings of the IEEE-SP International Symposium on Time-Frequency and Time-Scale Analysis, Paris, France, June, 1996, pp. 117–120. [6] N. Gonz%alez Prelcic, A. Pena, An adaptive tiling of the time-frequency plane with application to multiresolution based perceptive audio coding, Signal Processing 81 (2) (February 2001) 301–319. [7] ITU-R, Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems, 1197, Rec. ITU-R BS.1116-1. [8] J. Johnston, Transform coding of audio signals using perceptual noise criteria, IEEE J. Selected Areas Commun. 6 (2) (February 1988) 314–323. [9] A. Papoulis, S. Unnikrishna Pillay, Probability, random variables, and stochastic processes, fourth Edition, McGraw-Hill, New York, 2002. [10] M. Rosa, F. L%opez, P. Jarabo, S. Maldonado, N. Ruiz, New algorithm for translating psycho-acoustic information to the wavelet domain, Signal Processing 81 (3) (March 2001) 519–531. [11] N. Ruiz, M. Rosa, F. L%opez, D. Mart%Tnez, An adaptive wavelet-based approach for perceptual low bit rate audio coding attending to entropy-type criteria (Chapter), Software and Hardware Engineering for the 21th Century, WSES Press, 1999, pp. 305 –309. [12] J. Shapiro, Embedded image coding using zerotrees of wavelet coeHcients, IEEE Trans. Signal Process. 41 (12) (December 1993) 3445–3462. [13] D. Sinha, A. Tew6k, Low bit-rate transparent audio compression using adapted wavelets, IEEE Trans. Signal Process. 41 (12) (December 1993) 3463–3479. [14] P. Srinivasan, L. Jamieson, High quality audio compression using an adaptive wavelet packet decomposition and psychoacoustic modeling, IEEE Trans. Signal Process. 46 (4) (January 1998) 1087–1093. [15] M. Wickerhauser, Acoustic signal compression with wavelet packets (Chapter), Wavelets: a Tutorial in Theory and Applications, Academic Press, Boston, 1992, pp. 679 –700. [16] M. Wickerhauser, Adapted Wavelet Analysis from Theory to Software, A.K. Peters, Ltd., Natick, MA, 1994.