Digital Signal Processing 13 (2003) 23–41 www.elsevier.com/locate/dsp
A low-complexity, high-quality, 64-Kbps audio codec with efficient bit allocation ✩ Din-Yuen Chan a,∗ and Cheng-Yuan Ku a,b a Department of Information Engineering, I-Shou University, Kaohsiung County, Taiwan, ROC b Department of Information Management, National Chung Cheng University, Chia-Yi, Taiwan, ROC
Abstract This paper presents a very low-complexity audio codec that provides audio playback quality similar to the MPEG-I/audio level 3 codec at 64 Kbps for a monophonic-channel signal. This welldesigned low-sophistication scheme uses a simple noise-masking model, a specialized nonuniform quantizer, an effectively recursive refinement module, and an adaptive arithmetic coder with multiplication-free adaptation. For bit allocation, we propose an appropriate nonuniform quantizer incorporating noise-masking effects and designed for fast implementation as well as efficient acceleration of the proposed refinement process. This recursive refinement algorithm effectively improves recovered perceptual audio quality after quantization. The adaptive arithmetic coder uses two fast adaptation algorithms that do not require multiplication to quickly obtain efficient bit allocation. A Taylor series expansion is used to simplify the frequently executed functions in the masking threshold formulas. Thus, the proposed high-quality audio codec appears to be a very valuable consumer electronic approach or a software solution at low cost. 2002 Elsevier Science (USA). All rights reserved.
1. Introduction The MPEG audio coding standards [1,2] call for typical audio bit rates ranging from 64 to 192 Kbps for digital storage media associated transmission applications. High-fidelity audio coders [1–4] rely primarily on extremely complicated psychoacoustic models to control adaptive bit-allocation algorithms. Basically, the high-quality audio compression ✩ This research was supported by the National Science Council of the ROC under Contract NSC-88-2612-E-214-0C1. * Corresponding author. E-mail address:
[email protected] (D.-Y. Chan).
1051-2004/02/$ – see front matter 2002 Elsevier Science (USA). All rights reserved. PII: S 1 0 5 1 - 2 0 0 4 ( 0 2 ) 0 0 0 0 5 - 2
24
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
uses the noise masking [1–7] effectively to remove the acoustical irrelevance. Effective combination of psychoacoustic models and effective bit-allocation mechanisms [1–4,8, 9] is indispensable for achieving various high compression ratios. A 32-channel cosinemodulated filterbank is used to decompose audio signals [1,10], and a fast Fourier transformation (FFT) is used to analyze each critical bank. At higher compression rates, the MPEG-I/audio level 3 and AC-3 encoders [3] pass windowed audio frames through the time-domain-aliasing–cancellation (TDAC) filterbank to obtain the highest frequency–resolution data for subband coding. By modified discrete cosine transform (MDCT) [11,12], TDAC filtering can be rapidly implemented/achieved using the fast DCT algorithms [13]. Perceptual quantization of the filtered signals, which achieves the lowest noise-to-mask ratios (NMRs) [1,14,15] to obtain optimum bit allocation, is very important in yielding high coding gains. The complex, high-fidelity audio codecs mentioned above involve several very complicated processes; for example, they require complex psychoacoustic analysis to achieve optimum noise-masking and spectrum sectioning, and must iterate requantization and perceptual allocation of saved bits [1–3] at given target bit rates. These are heavy burdens for personal computers performing high-fidelity audio coding, especially when running realtime applications such as visual/audio encoding of real-time transmissions, collaborative music authoring with the real-time delivery of musical creations among cocomposers at different Web sites, and real-time audio compression for sound games. Thus, developing a low-cost, high-quality audio coder to meet real-time requirements is becoming more important as demand for interactive multimedia communication increases. In this paper, we propose a low-complexity audio coder that obtains well-reconstructed audio quality, which can be comparable to that given from the codec specified in the MPEG-I/audio level 3 standard. The coder can compress a single-channel audio signal of 768 Kbps (16 bits/sample × 48 K sample/s) into a 64 Kbps bit stream. It is especially suitable for the multimedia extension (MMX) technology and for digital signal processing (DSP) chips.
2. Overview of the proposed low-complexity audio codec A subband filterbank, a perceptual model, a quantizer, an entropy coder, and a refinement module being the fundamental processing units consist of high-quality audio codecs operating at low bit rates. The proposed audio codec includes only these fundamental processing units in significantly simplified forms. It incorporates the most basic noise-masking functions specified in Model I of the MPEG-I standard, an appropriate nonuniform quantizer, an effectively recursive refinement module, and an adaptive arithmetic coder with fast multiplication-free adaptation. Tolerable perceptual errors based on the signal-to-mask ratio (SMR) [1] can be directly derived from the MDCT outcome of high-frequency resolution without running a FFTbased spectral analysis. Therefore, for simplification, we also adopt the TDAC filterbank [13,16] to decompose the input audio signal such that the noise masking of MDCT coefficients can be directly analyzed by the psychoacoustic model. We did not labor to design an appropriate perceptual model for our codec, instead, we use the most basic and indispensable noise-masking functions specified in the MPEG/audio standard [1] to determine
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
25
the modest quantization stepsize for each critical band. Then, we considered the noisemasking effect, quantization simplification, and refinement acceleration simultaneously in designing an appropriate nonuniform quantizer that yields perceptually noise-free quantized MDCT coefficients and efficiently assists in saved bits dispatch. To make the corresponding decoder as simple as possible, a constant delay is necessary for the decoding process. For this aim, our proposed encoder is a constant bit rate encoder to compress all the input frames into packets of the same length without the need of a “bit reservoir” for bit borrowing between individual frames. In some realizations, quantization and entropy coding must be reiterated until the compacted codes fall within the allowed bit budget. Since the nature of audio signals leads to high skew probability distributions of the MDCT quantization levels in some critical bands, arithmetic coding can provide quite effective compaction. Therefore, we use an effective adaptive arithmetic coder for entropy coding after quantization. By using the accumulated self-information derived from the probabilities of quantization levels to predict encoded bit counts, the most suitable scaling factor can be found by means of a table lookup process that increases the quantization stepsizes to limit the encoded bits to the bit quota. Thus, the proposed encoder avoids superfluous quantization and arithmetic coding iterations. After quantization and dynamic arithmetic coding, saved bits being residual usable bits will exist that reflect the differences between the encoded bits and target bits per frame. Allocating these saved bits to refine quantized spectral coefficient accuracy leaves significant room for improvement of recovered audio signals. With our proposed refinement algorithm, the perceptual degradation resulting from quantizing and the incomplete fitting of noisemasking effects is effectively compensated by efficiently decreasing the NMR to repress the audible coding distortions as much as possible. Fast adaptation algorithms designed for the adaptive arithmetic coding were developed to further improve the efficiency of the proposed codec. These can quickly tune the probability tables without multiplication. And, in order to facilitate implementation of the popular single-instruction-multiple-data (SIMD) technique for low-precision processors, we further propose simplification algorithms for quickly yielding noise-masking thresholds via a Taylor series expansion. In summary, the proposed audio codec framework represents a noticeably simple and methodical approach to providing interactive, real-time audio services.
3. Simple psychoacoustic model with the basic noise-masking equations Each N -sample windowed frame, where a 50% overlap exists between successive transform windows, is passed through a TDAC filterbank to obtain N/2 independent MDCT coefficients [17,18]. In this paper, we emphasize low-sophistication design, so we use the most basic functions in MPEG psychoacoustic model 1 for N = 1024. Thus, the proposed audio codec uses only spectral masking, being a simultaneous masking process, and temporal masking effects are not further considered. The resulting masking thresholds are adequate for achieving 64 Kbps transmission. Noise-masking thresholds obtained by the psychoacoustic model are represented by a masking curve of powers in dB versus discrete frequencies.
26
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
The threshold in quiet indicates a perceptual criterion to decide whether the sound pressure level of a pure tone is audible or not, only on the basis of loudness. After aligning the maximum of MDCT coefficients in dB to the reference maximum provided in the MPEG psychoacoustic model 1 for normalizing all MDCT coefficients in dB, the proposed psychoacoustic model uses the Crit–Band–Rate table listed in model I to obtain the N/2 absolute masking thresholds. These absolute masking thresholds are called the thresholds in quiet of which the ith MDCT coefficient is denoted as QM(i). In advance, the encoder drops normalized spectral coefficients, of which powers are below their absolute masking thresholds, and leaves the remaining MDCT coefficients to carry out global noise masking. These MDCT coefficients are classified as tonal and nontonal; then, the proposed psychoacoustic model uses discrimination criteria based on special local peaks [1] to identify the tonal components of each frame. Summing up the powers of a local-peak MDCT coefficient and the coefficients neighboring this peak forms a tonal component and the total power of other MDCT coefficients enclosed in an individual critical band represents a nontonal component. The tonal and nontonal components, of which strengths in dB are named as sound pressure levels (SPL), act as maskers primarily to mask audible noise. Locating the origin of each tonal/nontonal elementary masking function relating its corresponding masker begins configuration of the global masking curve. The fundamental masking functions in the Bark domain are first exploited to obtain the global masking thresholds. A frequency region of a Bark maps a critical band [19]. The Bark scale of the j th MDCT coefficient, denoted as the real number b(j ), is obtained by nonlinearly mapping its discrete frequency index. The elementary tonal and nontonal masking functions, respectively, denoted as TM and NM, are expressed as TM(j, i) = SPLT(j ) + UT(b(j )) + V (b(j ), b(i)) and
(1)
NM(j, i) = SPLN(j ) + UN(b(j )) + V (b(j ), b(i)),
(2)
where SPLT(j ) and SPLN(j ) are, respectively, the SPLs of the tonal and nontonal maskers at the j th discrete frequency index. The so-called j th tonal and nontonal indices are two monotonously decreasing functions along the Bark scales and are expressed as UT(j ) = −1.525 − 0.275b(j ) − 4.5 and UN(j ) = −1.525 − 0.275b(j ) − 0.5,
(3)
respectively. In (1) and (2), the masking curve-shaped function V (b(j ), b(i)) identically used by the tonal and nontonal maskers possesse asymmetrical slopes as V (b(j ), b(i)) = 17(b + 1) − (0.4SPL(j ) + 6), V (b(j ), b(i)) = [0.4SPL(j ) + 6]b, V (b(j ), b(i)) = −17b, V (b(j ), b(i)) = − (b − 1)(17 − 0.15SPL(j )) − 17,
for −3 b < −1 Bark, for −1 b < 0 Bark, (4) for 0 b < 1 Bark, for 1 b < 8 Bark,
where SPL(j ) represents the SPL of the tonal or nontonal masker at the j th discrete frequency index, and b = b(i) − b(j ). The global masking function is obtained via convolution of the elementary masking threshold function and the basilar membrane spreading function. For practical implementation, the global masking threshold of each discrete frequency can be obtained by directly summing up the powers given from all
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
27
elementary masking functions at the position of this frequency index. Consequently, the global masking threshold MTg (i) of the ith MDCT coefficient is yielded by the contribution of three kinds of masking functions MTg (i) = 10(QM(i)−D)/10 +
m
10(TM(j,i)−D)/10 +
j =1
m
10(NM(j,i)−D)/10,
j =1
i = 0, 1, . . . , N/2 − 1,
(5)
where integer m equals N/2 if all the high frequency coefficients are included, and D is the normalizing difference resulting from subtracting the maximal MDCT coefficient in dB from the referred maximum. When coefficients are less than their corresponding masking thresholds as shown in (5), their values must be forced to zero. This indicates that their SPLs will be inaudible to the human auditory system. Based on this psychoacoustic model, we developed an effective bit-allocation algorithm combined with an appropriate nonuniform quantizer that compensates for the simplification of this perceptual model.
4. The design of an appropriate nonuniform quantizer According to (5), basically, the value 2 MTg (i) must be used as the quantization stepsize for the ith MDCT coefficient to make sure that its quantization noise cannot be sensed. However, encoding all the N/2 global masking thresholds, i.e., MTg (i), for i = 0, 1, 2, . . . , N/2, is certainly not meaningful to data compression. Therefore, the coefficients are quantized using one quantizer per critical band. Each critical band contains a different number of MDCT coefficients, as shown in Fig. 1. Here, the minimal global masking threshold within the kth Bark is defined as MTmin (k). Thus, the quantization √ stepsize of this Bark shall be set as 2 × MTmin (k) to ensure that the quantization noises in this Bark are below the masking thresholds. No doubt, using a minimal tolerable quantization stepsize to all coefficients of one Bark will influence compression performance.Therefore, we designed a proprietary nonuniform quantizer via analysis of the elementary masking function to elevate the compression gain and effectively facilitate the refinement process. If not considering the interactive masking related to (4) among spectral coefficients, substituting the maximal Bark index, i.e., 24, to b(j ) into (3), yields a minimum for tonal and nontonal indices of about 12.625. Letting UT(b(j )) in (1) or UN(b(j )) in (2) equal 12.625, we can deduce from (5) that the acceptable ratio of spectral coefficient to corresponding reconstruction error is roughly equal to 1012.625/10 ∼ = 4.378. According to the brief analysis above, we know that there should be no perceptual noise after quantization if all the ratios of MDCT coefficients to quantization stepsizes keep beyond 8.756. But, in fact, the exceedingly simplified psychoacoustic model cannot approximate the human auditory system very well for all audio patterns. Hence, by experiment and conservative design, we limit the maximum of ratios of MDCT coefficients to quantization stepsizes to 4 rather than 8.756 in the nonuniform quantizer for achieving a
28
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
Fig. 1. The number of MDCT coefficients in each Bark.
monophonic channel of 64 Kbps high-quality audio. According to the above constraint and simulations, an appropriate nonuniform quantizer is effectively designed and described by the quantization stepsize of ith MDCT coefficient, denoted as X(i), as δ for 0 |X(i)| 8δ, for 8δ < |X(i)| 16δ, 2δ for 16δ < |X(i)| 32δ, for i = 0, 1, . . . , N/2, (6) = 4δ 8δ for 32δ < |X(i)| 64δ, 16δ for 64δ < |X(i)| 128δ, √ where δ is 2 MTmin (k), while X(i) is in the kth Bark. According to (6), the nonuniform quantizer yields a reconstruction noise curve approximately parallel to the masking threshold curve, which means that a fairly consistent signal-to-noise ratio exists over the range of quantizer values. After the value of δ is determined, only right-shifting operations in (6) are needed to obtain four primary decision levels, the powers of 2, 8δ, 16δ, 32δ, 64δ, for initially determining to which of five coarse regions each quantizing datum belongs. Four or eight more refined decision levels in each of these regions are then used to yield the final quantization levels. The MMX technology extension of the Intel architecture can then quickly carry out the inner product using SIMD techniques, which can perform the operations of packed words that each packed word usually binds four or eight individual data up. Hence, the quantization process described above can easily be achieved by modern software with MMX instructions or simple hardware. It is significant to note that using this quantizer allows fast computation of individual MDCT coefficient NMRs that can further provide effective acceleration during performing refinement. To simplify controls of constant rate and delay, an effectively recursive refinement algorithm is addressed in this section.
5. Effectively recursive refinement algorithm Arithmetic coding achieves optimal entropy coding and provide a high compression ratio for high skew probability distributions. After quantization, just like the advanced
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
29
Fig. 2. The minimum masking thresholds in dB versus Bark scale yielded by an experimental audio frame.
audio codec [1], arithmetic encoding is used to compact the quantization results. As shown in Fig. 2, some low-frequency Bark stepsizes are over three orders of magnitude larger than high-frequency stepsizes for some audio frames. This implies that more distant critical bands may have the larger probability distribution differences, so their spectral coefficients should be classified into groups with close statistical characteristics, and individual group probability tables are used for arithmetic coding. Certainly, using more different probability tables to various Barks provides greater compression, but also it requires more memory and increases complexity. Thus, for simplicity, we divide the whole spectrum into only two sections with two individual proprietary probability tables when arithmetic coding is performed. The first includes the first 128 MDCT coefficients, which can nearly encompass the spectrum of human speech, and the second contains the residuals. Thus, if abrupt diminution of transmission bandwidth occurs, we need only transmit the first part by which human voice can be recovered. Basically, the encoded bits can be estimated in advance by using the probability table of arithmetic coding to a formula described below as K−1 I (k)−1 1 , (7) Nki log2 Eb = pki k=0
i=0
where Nki is the ith quantization level occurrence count for the kth critical band, pki is the occurrence probability of this quantization level, and I (k) represents the number of possible quantization levels in this section. Here, since the transmission bit rate is limited to Rb = 64 Kbps, for fixed-rate coding, the number of allocated bits in an N/2 sample frame is the permitted bit quota ((N/2)/fs ) × Rb = 683 (bits) for the sampling rate fs = 48 (KHz). If the number of estimated bits for the two spectral sections using (7), plus related side information, exceeds this permitted bit quota, the basic but effective method is to enlarge the quantization stepsize of every Bark generated by the psychoacoustic model. By giving precise scaling factors for weighting these stepsizes, arithmetic coding need be executed only one time for the fixed-rate coding. Therefore, to handle bits in excess of the bit quota, suitable scaling factors are assigned to constrain the encoded frame bits just below this bit quota. By tabulating factors and numbers of excess bits to obtain a checking table with an offline process, as soon as the estimated information exceeds the bit
30
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
quota during the encoding process, the codec exploits this checking table to find the most suitable scaling factor to weight all original quantization stepsizes up. For simplicity, only four probability tables are included in our codec. One pair is, respectively, used to lowand high-frequency spectra, as quantization stepsizes are not weighted, and another pair is exploited after the stepsize is weighted. As shown in Fig. 3, the weighting parameter W represents the most appropriate scaling factor obtained from the lookup table. The value of W is constrained to the range [1, 8.756/4] for the tolerable perceptual distortion level mentioned in the previous section. When the bit estimate in (7) exceeds 683 bits, all encoder switches change processing paths to make use of the pair of probability tables for arithmetic coding of the quantization levels resulting from quantization of weighted stepsizes. After arithmetic coding, the proposed quantized coefficient refinement module yields reconstructed 64 Kbps audio. As shown in Fig. 4, the reconstruction quality is very similar to that of the MPEG-I/audio level 3 encoder. We define the quantization stepsize as j , which is used to the j th MDCT coefficient X(j ), to formulate the average error energy of a frame [20] as E= 2j /12 + X(j )2 , j = 0, 1, . . . , N/2. (8) j /2X(j )
j /2>X(j )
Conventional precision-increasing processes work to minimize mean-square errors such as E in (8) [21–23]. By contrast, currently popular high-quality audio encoders attempt to effectively increase perceptual quality by minimizing noise-to-mask ratios [1,14,24]. Hence, the proposed refinement module is designed to minimize NMRs by efficiently and quickly allocating the saved bits to the quantized MDCT coefficients. According to [1, 15], an individual NMR of a subband is defined as the ratio of the reconstructed error energy to the minimal global masking threshold of this subband. For convenience and compression performance considerations, the second term in (8) is ignored in this work. Thus, an individual approximate NMR for the ith MDCT coefficient is set as (i /δ)2 , which is derived from the quantization using (6). Obviously, with the proposed quantizer described in (6), individual approximate NMRs of all the spectral coefficients can be easily obtained for both encoder and decoder. Through repeated but rapid comparison of individual approximate NMRs for all spectral components, the refinement module can achieve near minimal NMR for an encoded frame. Auditory perception considerations demand that the refinement of spectral coefficients should be performed repetitively from the coefficient of the largest individual approximate NMR to that of the smallest NMR, and from unmasked to masked coefficients. And, since the same individual approximate NMRs often exist in MDCT coefficients, the refinement priorities must be established for these coefficients, which will be discussed in the following cases. If spectral coefficients with identical quantization levels (i.e., same individual approximate NMR) occur in distinct Barks, the precision of the quantized coefficient with the larger quantization stepsize is increased prior to that with the smaller stepsize to effectively inhibit the quantization error power shown in (8). But, if distinct Barks mentioned above use the same quantization stepsize, the lower frequency critical bands are first refined to maintain the better waveform envelope. By substituting (3) in (1) and (2), we find that the higher frequency masker causes fewer elementary masking effects, which means that the higher frequency coefficients tolerate relatively smaller quantization
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
Fig. 3. The flow diagram of the proposed high-quality audio encoder. When the number of estimated encoded bits exceeds 683, all the switches will be switched to paths utilizing the weighting stepsizes.
31
32
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
Fig. 4. Original waveforms as well as reconstructed waveforms resulting from the proposed and MPEG-I/audio level 3 codes for four experimental audio frames.
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
33
noise than the lower frequency coefficients. Thus, in assigning residual bits to MDCT coefficients of same individual approximate NMRs within a Bark, the order should be from higher to lower indices. In practice, the quantization design in (6) can yield a simple and quick refinement process. According to the above concepts, we delineate the recursive refinement process in the next paragraph. The ith reconstructed MDCT coefficient is set as Xr (i) and the corresponding i yielded by the proposed quantizer is used as the initial value of the double quantization error, and let ε(i) = i . The individual approximate NMRs resulting from the proposed quantizer have only five values {162 , 82 , 42 , 22 , 1} before the recursive refinement process starts. Therefore, the maximum of all the square roots of individual approximate NMRs (SNMRs) is the parameter M with an initial value of 16. Therefore, minimization of the total approximate NMR of a frame is accomplished by the following iteration loop of allocating residual bits. Step 1: Check M; if M = 1, search the MDCT components whose SNMRs are equal to M, then go to Step 3 to refine them. But, as M is reduced to become 1, while the MDCT components with SNMR = M are found, their unmasked coefficients are separated from their masked coefficients (i.e., the coefficients with the zero quantization level). Refine all these unmasked coefficients before these masked coefficients by Step 3. Record the number of SMNR = M MDCT components as C. Step 2: If C equals zero, divide M by 2, then go back to Step 1. Otherwise, go to the next step. Step 3: Allocate a residual bit to a SMNR = M MDCT coefficient, then decrement the number of residual bits by one and C = C − 1. Assume that the reallocated coefficient is the ith MDCT for the moment. If X(i) − Xr (i) 0, the value of allocated residual bit γ is assigned as 1 (i.e., γ = 1); otherwise, let γ = 0. Then, update Xr (i) and ε(i) by using
ε(i) and ε(i) = ε(i)/2, (9) Xr (i) = Xr (i) + (−1)γ +1 4 as well as divide its SMNR by 2. The refinement order of allocating residual bits to SMNR = M MDCT components is from the Bark with the largest δ to the Bark with smallest δ for the various Barks, as well as from the highest frequency index to the lowest index for a Bark. Step 4: Check the residual bit count. If the residual bits have not been exhausted, go to Step 2. Otherwise, exit this loop. Since all the SNMRs and M are always an integer power of 2 only one bit in the binary forms of these data is 1. Thus, in Step 1, searching the SMNR = M MDCT components is very fast since only one bit must be checked. This refinement procedure is very efficient in minimizing total approximate NMR without requiring any extra side information related to actual reconstruction errors. Of course, when the perceptual characteristics are completely analyzed to precisely confirm the perceptual importance of each Bark to the human ear, the encoded bits of quantized coefficients in the various Barks can be separately compacted as
34
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
independent arithmetic codewords in significance order. And, the enhancement bits yielded by the refining processes are subsequently delivered in the same order as the arithmetic codewords. Thus, the proposed audio codec can achieve progressive audio coding with MNR scalability. Consequently, this proposed codec with bit-rate scalability can easily be handled in real-time, interactive multimedia applications.
6. Fast implementation of probability adaptation and basic functions The adaptation in arithmetic coding can be implemented more easily than that of Huffman coding, which relies on the contiguous reconstruction of code trees [25]. Existing adaptation and estimation methodologies using probability tables, such as that in [26], appear complex more or less for low-cost processors. In the next section, we address two adaptation techniques having no need for multiplication to update probability tables during adaptive arithmetic coding for efficient bit-allocation achievement in real-time processors. A. Fast multiplication-free adaptation techniques In general, the initial probability tables used for adaptive arithmetic codecs are obtained by appropriate offline statistical analyses and zero probability is forbidden. Assume a total of L quantization levels; the size of the alphabet composing all quantization levels as symbol members is L symbols. Taking into account the precision constraint, the total occurrence count of all quantization levels resulting from (6) are constrained to the integer power of 2 as 2µ , in the sense that the buffers storing probabilities are of (µ + 1) bits wordwidth. The initial occurrence probabilities p0 , p1 , . . . , and pL−1 of L quantization levels are numerically represented by C0 (0)/ζ , C1 (0)/ζ , . . . , and CL−1 (0)/ζ , respectively, where the integers C0 (0), C1 (0), . . . , and CL−1 (0) are referred to as the initial occurrence counts. By intuition, changes in probabilities are performed immediately after the arithmetic codec processes an event, i.e., a quantized MDCT coefficient. To facilitate fast implementation within the (µ + 1)-bit precision limitation, we do not use such a short updating period. And we propose two extremely low-complexity techniques for updating the occurrence probabilities with nonideal but fairly reasonable frequencies. Both proposed methods need L registers to store the currently used numerical probabilities, and L auxiliary counters to record the L quantization level instantaneous occurrence times during online encoding/decoding. In the first method, the updating of the probability table is performed once per ζ -input events, which are considered as a data group. When the count summation of the L auxiliary counters reaches ζ , meaning that an ζ -event group has just been coded, all corresponding L counts are simultaneously reset to zero to restart counting. The L numerical probabilities p0 , p1 , and pL−1 are then immediately modified at once as Cj (T ) + Cj (T − 1) , 2 pj (T ) = Cj (T )/ζ, for j = 0, 1, . . . , L − 1,
Cj (T ) =
(10)
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
35
where Cj (T ) expresses the occurrence count of the j th level within the T th ζ -event data group. According to the low-complexity requirement, the divisions by 2 and ζ in (10) require only one-bit and µ-bit right-shifting operations, respectively. For arithmetic coding convenience, zero probability must be prohibited even if some levels do not occur for a very long time, so the minimal count must be forced to 1 for each adaptation by (10). Through the probability adaptation in (10), past events grow less influential on the increasing offsets relative to the current input because (10) increases recent count potential weights and gradually decreases past count weights to achieve pj (T ) for j = 0, 1, . . . , (L − 1). Therefore, the proposed adaptation technique yields reasonable modification probabilities for the next input data to be processed. To obtain even higher compaction performance, the updating period for fitting the dynamic audio characteristics could become 2 to the power of a variable integer if the statistical audio characteristics can be well detected in time during the encoding process. The fitness of the probabilities for musical signals with large active characteristics is still in question because high irrelevance will exist in some neighboring audio samples of these signals. Hence, for generalization, another effective adaptation technique is further proposed to relate probability updates to only current ζ -input samples. In achieving multiplication-free adaptation, there is a simple limitation that the updating period must be an integer multiple of the alphabet size, so the smallest permitted update period is equal to the time spent on inputting L events. This technique requires two probability tables to separately record the probabilities in use and the adapted probabilities. To accurately track changes in audio characteristics, the adapted probability table is modified for every L-input event. Thus, the p0 , p1 , and pL−1 are adjusted immediately after inputting L events once Cj (T ) = Cj (T ) + Cj (T − 1) − 1, pj (T ) = Cj (T )/ζ,
for j = 0, 1, . . . , L − 1,
(11)
where Cj (T ) expresses the occurrence count of the j th quantization level within the T th L-event data group. Then, according to the speed of changes in audio characteristics, the arithmetic coder can adopt a renewed probability table by replacing the currently used probabilities with the adapted probabilities at any appropriate moment. For most real-world applications, statistical knowledge of the source is usually nowhere near sufficient to figure out the exact difference between predicted and true probabilities for the next L-input event. Therefore, for precision limitation of low-cost processors and calculation simplification, each count is fairly decremented by 1 in the first equation of (11) to keep summation of all the L counts equal to ζ . To make arithmetic coding more easily workable, any zero count resulting from (11) must be set to 1 immediately, and 1 must be simultaneously borrowed from other counts larger than 1 to keep the summation equal to ζ . This complex borrowing method will obstruct realization regularization to a greater or lesser extent; therefore, we propose a simple but reasonable method to resolve it. We use an auxiliary 1-counter to count 1 up as a count with value 0 set to the count with value 1 in the L-iteration loop of (11). If the counting result of this auxiliary counter is λ at the end of the L-iteration loop, it indicates that a total of λ extra occurrence times were generated in avoiding zero counts. These L levels are first classified into two groups that the levels with 1 count belong to the first group, the others to the second. Next, a λ-iteration
36
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
loop is immediately performed to execute count lending. In the λ-iteration loop, we check the codewords of counts of the second group bit-by-bit from the maximal significant bit (MSB), i.e., the (µ − 1)th bit, to the least significant bit (LSB). The count codewords have higher MSB as bit 1; these counts will be more antecedently selected to decrement 1. If any are reduced to become 1 by the decrementing process, then the corresponding level is assigned to the first group as a new member. Thus, the auxiliary counter continuously counts down one-by-one from λ along the count decrement of the second group until its count reaches zero. This process exploits the concept that sacrificing larger counts under the precision limitation will cause less entropy increase. In next paragraph, we give a brief derivation to confirm the above claim. For simplicity, assume that the alphabet has only two members whose actual probabilities are defined as (ζ /2 + η)/ζ and (ζ /2 − η)/ζ , respectively, where η is a nonzero positive integer less than ζ /2. If the smaller count borrows 1 from the larger count, the average information under the fixed probabilities becomes Λ1 =
ζ /2 − η ζ ζ ζ /2 + η log2 + log2 . ζ ζ /2 + η − 1 ζ ζ /2 − η + 1
(12)
By contrast, the average information is Λ2 =
ζ /2 − η ζ /2 + η ζ ζ log2 + log2 . ζ ζ /2 + η + 1 ζ ζ /2 − η − 1
(13)
Then, comparing Λ1 with Λ2 by the subtraction, we have ζ /2 + η − 1 ζ /2 − η ζ /2 − η + 1 ζ /2 + η log2 + log2 ζ ζ /2 + η + 1 ζ ζ /2 − η − 1 ζ /2 − η ζ /2 + η − 1 ζ /2 − η + 1 log2 × > ζ ζ /2 + η + 1 ζ /2 − η − 1 2 ζ /2 − η ζ /4 − η2 + 2η − 1 = log2 2 > 0. ζ ζ /4 − η2 − 2η − 1
Λ2 − Λ1 =
(14)
From (14), we know that Λ1 is less than Λ2 . Hence, without doubt, when a zero probability occurs, the count lending must be from larger to smaller counts. All the processes proposed above for probability adaptation involve only zero checking, one-bit checking, one-by-one up/down counting, and right-shifting. Therefore, they can be very easily realized especially in hardware. In our simulations, the proposed audio encoding programmed by our master’s degree candidate students can require only about one half the time of the MPEG-I/audio level 3 encoding. As shown in Fig. 4, the recovered audio waveforms have perceptual qualities similar to those resulting from the MPEG-I/audio level 3 encoder and are very close to the original waveforms. B. Fast realization of the basic unit functions via taylor series expansion Fast implementation of frequently used functions appears necessary in the low precision processors for low-cost applications. Achieving the masking thresholds and
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
37
the information estimation in (7) is fundamentally based on transfers and inverse transfers of dB scales such that logarithm and exponentiation (e.g., 10 to the powers of a real number) calculations are frequently needed. Hence, the two functions can be considered as fundamental key functions of which frequent executions will require much computational time. To facilitate fast implementation in MMX technology, we propose effective approximation of these two basic functions via Taylor series expansion. The normalized floating-point binary representation of X(i) is in general denoted as ±(1.bT bT −1 . . . b0 )2 × 2n , where bT , bT −1 , bT −2 , . . . , b0 ∈ {0, 1} and n is an integer. If the bit bT , which only performs the single-bit checking, equals zero then log10 |X(i)| is expressed as log10 |X(i)| = log10 (1 + α) + n × log10 2,
(15)
where α is (X(i) − 2n )/2n . Otherwise, it is represented as log10 |X(i)| = log10 (2 − β) + n × log10 2 = log10 (1 − α ) + (n + 1) × log10 2,
(16)
where β is (2n+1 − X(i))/2n and α is β/2. Since α and α are less than or equal to 0.5 and 0.25, respectively, we can expand (15) and (16) into a Taylor series up to only the first few terms. To very rapidly obtain the approximated dB values, (15) and (16) are merely expanded up to a two-term series about 1 as α2 + (log10 2)(n) log10 x(i) ∼ (17) = (log10 e) α − 2 and
α 2 log10 x(i) ∼ (log e) −α − + (log10 2)(n + 1), = 10 2
(18)
respectively. In some cases, the identification of tonal and nontonal components can rely only on the comparison between the second terms of MDCT coefficients shown in (17) or (18). By extracting the constants log10 e and log10 2 from (17) and (18) outside the equations concerned with the summation of logarithms such as (1), (2), (4), and (7), log10 e and log10 2 can be multiplied only once to perform these operations. Thus, multiplications and processing round-off errors can be reduced. Of course, the constants can be further merged into the perceptual formula coefficients such that the multiplications and the roundoff errors can be further reduced. In psychoacoustic models, the inverse of the logarithm is usually 10 to the power of a real number, as depicted in (5), such that a Taylor series expansion can provide simplification similar to that above. One can first express 10yj as 10Ij × 10Fj , where Ij is the closet integer to the real number yj and Fj is the fraction within [−0.5, 0.5]. Thus, when performing a Taylor series expansion about a proper value F0 , 10yj can be approximated by summing the first three terms in the Taylor series
(ln 10)2 (Fj − F0 )2 , (19) 10yj = 10Ij 10F0 1 + ln 10(Fj − F0 ) + 2
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
Fig. 5. Flow diagram of the proposed high-quality audio decoder.
38
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
39
where F0 is set as 0.25 to limit (Fj − F0 ) to within [−0.25, 0.25]. Substituting (19) into the summation equations in (5), we rearrange the summation of the powers of 10 in the following form: J
J yj F0 Ij Ij 2 Ij C1 + C2 (Fj × 10 ) + C3 (Fj × 10 ) , 10 = 10 10 (20) j =1
j =1
where C1 = (1−F0 ln 10+F02((ln 10)2 /2)), C2 = ln 10(1−F0 ln 10), and C3 = (ln 10)2/2; the constants C1 , C2 , and C3 can be calculated and stored in advance. Similarly, the constants C1 , C2 , and C3 can be merged into the related coefficients of all the perceptual formulas to reduce multiplications and round-off errors. Hence, we can efficiently reduce the total computations required to achieve N/2-point MTg (i), which requires extreme computational complexity. Using the above fast algorithms for the very low-precision processors, round-off errors can be further reduced with special rearrangement of multiplication terms in the summation for achieving the dot products, where multiplication terms with close values in the summation will be previously added together [27] in (5), (7), and (20). Significantly, the successive multiplication-and-addition operations in (17)–(20) can be effectively executed using the SIMD technique on extended Intel CPUs or DSP chips. By contrast, use of the SIMD technique does not specially benefit the table lookup in yielding actual logarithms and 10 to the powers of real numbers.
7. Conclusions The proposed audio codec is very low complicated and has high regular scheme. It has been verified by the experimental simulations that the proposed audio encoder can compress 48 × 16 Kbps audio sequences into 64 Kbps bitstreams with the reconstruction quality and SNR similar to those resulting from the MPEG-I/audio level 3 encoder. The audio coder effectively combines the proposed refinement algorithms with a specially designed nonuniform quantizer to obtain high subjective quality using the fast and efficient implementations. Generally speaking, most music signals reproduced by the proposed coding scheme are transparent to their original signals for the general listener. Furthermore, all the proposed fast implementations, especially the multiplication-free adaptation in adaptive arithmetic coding, can be realized easily in the low-precision processors. Consequently, the codec can be rapidly realized in low-cost hardware or in software with MMX instructions. Thus, it is very suitable for real-time or just-in-time interactive multimedia applications, especially with only the decoding requirement shown in Fig. 5. In addition, due to the simple data encoding structure, the proposed audio codec can be easily realized as a progressive audio codec in scaleable systems.
Acknowledgments The authors are grateful to Chaur-Heh Hsieh and Jar-Ferr Yang for their great help and many valuable suggestions.
40
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
References [1] ISO/IEC JTC1/SC29/WG11 MPEG, International Standard, ISO-IEC 13818-7 (Part 3, Audio). Generic coding of moving pictures and associated advanced audio coding, 1994. [2] Y.F. Dehery, M. Lever, P. Urcun, A MUSICAM source codec for digital audio broadcasting and storage, in: Proc. ICASSP’91, 1991, pp. 3605–3608. [3] Dolby AC-3, Multi-channel digital audio compression system algorithm description, Revision 1.12, Dolby Laboratories, London, 1994. [4] Y. Mahieux, J.P. Petit, High-quality audio transform coding at 64 Kbps, IEEE Trans. Commun. 42 (11) (1994). [5] Y.F. Dehery, MUSICAM source coding, in: AES 10th International Conference, pp. 71–79. [6] E. Zwicker, U.T. Zwicker, Audio engineering and psychoacoustics: Matching signals to the final receiver, the human auditory system, J. Audio Eng. Soc. 39 (3) (1991) 115–126. [7] Jayant, Johnston, Safranek, Signal compression based on models of human perception, in: Proc. IEEE, Vol. 81, 1993, pp. 1383–1422. [8] G. Davidson, L. Fielder, M. Antill, High-quality audio transform coding at 128 Kbits/s, in: Proc. ICASSP’90, 1990, pp. 1117–1120. [9] J.D. Johnston, Transform coding of audio signals using perceptual noise criteria, IEEE Selected Area Commun. 6 (1) (1988) 314–323. [10] G. Theile, G. Stoll, M. Link, Low bit-rate coding of high-quality audio signals. An introduction to the MUSICAM system, EBU Rev. Tech. 230 (1988) 158–181. [11] P.P. Vaidyanathan, in: Multirate Systems and Filter Banks, Prentice Hall International, Englewood Cliffs, NJ, 1993, pp. 373–391. [12] P. Duhamel, Y. Mahieux, J.P. Petit, A fast algorithm for the implementation of filter banks based on “time domain aliasing cancellation”, in: Proc. ICASSP’91, 1991, pp. 2209–2212. [13] D.-Y. Chan, J.-F. Yang, S.-Y. Chen, Regular implementation algorithms of time domain aliasing cancellation, IEEE Proc. Vision Image Signal Process. 143 (6) (1996). [14] S. Voran, Perceptual-based bit-allocation algorithms for audio coding, in: IEEE ASSP Workshop, 1997, pp. 57–60. [15] X.W. Martyn, J. Show, M.R. Varley, Optimum bit allocation and decomposition for high quality audio coding, in: Proc. ICASSP’97, Vol. 1, 1997, pp. 315–318. [16] J.P. Princen, A.B. Bradley, Analysis/synthesis filter bank design on time domain aliasing cancellation, IEEE Trans. Acoustics Speech Signal Process. ASSP-34 (5) (1986) 1153–1161. [17] J.P. Princen, A.W. Johnson, A.B. Bradley, Subband/transform coding using filter bank designs based on time domain aliasing cancellation, in: Proc. ICASSP’87, 1987, pp. 2161–2164. [18] D.-Y. Chan, J.-F. Yang, S.-Y. Chen, Regular implementation algorithms of time domain aliasing cancellation, IEEE Proc. Vision Image Signal Process. 143 (6) (1996). [19] E. Zwicker, Subdivision of the audio frequency range into critical bands, J. Acoust. Soc. Am. 33 (2) (1961) 248. [20] A.V. Oppenheim, R.W. Schafer, Discrete-Time Signal Processing, Prentice Hall, New York, 1989, pp. 359– 361. [21] D.G. Luenberger, Optimization by Vector Space Methods, Wiley, New York, 1969. [22] N.S. Jayant, P. Noll, Digital Coding of Waveforms: Principles and Applications to Speech and Video, Prentice Hall International, Englewood Cliffs, NJ, 1984. [23] T. Berger, Rate Distortion Theory and Application, Prentice Hall International, Englewood Cliffs, NJ, 1971. [24] C.H. Lim, Improving the performance of MPEG audio coder layer I using adaptive representation of bit allocation information, Electronic Lett. 33 (18) (1997) 1356–1357. [25] D.E. Knuth, Dynamic Huffman coding, J. Algorithms (1985–1986) 163–180. [26] D.L. Duttweiler, C. Chamzas, Probability estimation in arithmetic and adaptive-Huffman entropy coders, IEEE Trans. Image Process. 4 (3) (1995) 237–246. [27] J.-F. Yang, D.-Y. Chan, U.-B. Chen, Fast- and low-roundoff implementation of quadrature mirror filters for subband coding, IEEE Trans. Circuit Systems Video Tech. 5 (6) (1995) 524–532.
D.-Y. Chan, C.-Y. Ku / Digital Signal Processing 13 (2003) 23–41
41
Din-Yuen Chan was born in Taipei, Taiwan, on October 30, 1962. He received the M.S. in medical engineering and the Ph.D. in electrical engineering from Cheng Kung University in 1992 and 1996, respectively. He has been a senior engineer and a project leader in a company named Multi-Media for developing quasi videophone products. He is currently an associate professor in the Department of Information Engineering, I-Shou University, which he joined in 1997. His teaching and research are primarily in the areas of signal processing, data compression, image database retrieval, and visual communication. Cheng-Yuan Ku was born in Taipei, Taiwan, on October 14, 1965. He received the B.S. in control engineering from the National Chiao Tung University in 1987 and the M.S. and Ph.D. in electrical engineering from Northwestern University in 1993 and 1995, respectively. From 1995 to 1999, he was with the Department of Information Engineering, I-Shou University. In 1999, he joined the Department of Information Management, National Chung Cheng University. His current research interests include cryptography, digital signal processing, and network management. He is a member of INFORMS, IEEE, and IEICE.