Predictive vector quantization of wideband LSF using narrowband LSF for bandwidth scalable coders

Predictive vector quantization of wideband LSF using narrowband LSF for bandwidth scalable coders

Speech Communication 49 (2007) 490–500 www.elsevier.com/locate/specom Predictive vector quantization of wideband LSF using narrowband LSF for bandwid...

244KB Sizes 0 Downloads 32 Views

Speech Communication 49 (2007) 490–500 www.elsevier.com/locate/specom

Predictive vector quantization of wideband LSF using narrowband LSF for bandwidth scalable coders q Hiroyuki Ehara *, Toshiyuki Morii, Koji Yoshida Next-Generation Mobile Communication Development Center, Matsushita Electric Industrial Co. Ltd. (Panasonic), 239-0847 Yokosuka, Japan Received 20 February 2006; received in revised form 1 February 2007; accepted 11 April 2007

Abstract For implementing a bandwidth-scalable coder, a wideband line spectral frequency (LSF) quantizer was developed. It works in combination with a narrowband LSF quantizer. A new predictive vector quantization was introduced to the wideband LSF quantizer. The predictive vector quantizer is based on the use of several predictive contributions, which include first-order auto regressive (AR) prediction and vector quantization (VQ) codebook mapping. One feature of the new predictive vector quantizer is exploitation of the correlation between wideband and narrowband LSFs quantized in the previous frame for estimating wideband LSF in the current frame. A 16-bit switched predictive three-stage vector quantizer was used to encode estimation residues. Results showed that introduction of the predictor brought about a performance improvement of 0.3 dB in spectral distortion. This paper describes procedures of designing the predictor and the three-stage codebook, as well as simulation results. Ó 2007 Elsevier B.V. All rights reserved. Keywords: Predictive vector quantization; LSF; LSP; Bandwidth scalability; Codebook mapping

1. Introduction Bandwidth-scalable coding, which is the end application of the proposed algorithm in this paper, has a layered coding structure having one or more enhancement layers on top of a core layer. Such a coding structure is suitable for a speech/audio codec used in heterogeneous networks and voice over IP (VoIP) communications. It can provide the optimum speech/audio quality according to service, network, or terminal constraints. Simple truncation of the layered bitstream can adjust the bit-rate and quality of the communication. Such truncation can be performed at gateways or any point of the communication chain. One use case of this feature is a tele-conference service having a multi-point connection with several clients of different types. In this case, a server can distribute a proper bitq Portions of this work were presented at INTERSPEECH2005 (Lisbon, September 2005). * Corresponding author. Tel.: +81 50 3687 6544; fax: +81 46 840 5122. E-mail address: [email protected] (H. Ehara).

0167-6393/$ - see front matter Ó 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2007.04.002

stream to each client by truncating a layered bit-stream. The feature can also be exploited in a congestion control mechanism of gateways. On the other hand, the scalability in bandwidth is one expected feature in the future speech/ audio codec. Improving communication quality by offering wide-band quality is important along with support of the conventional narrow-band speech communication service. Recently, such a scalable coding algorithm was standardized by the International Telecommunication Union – Tele-communication standardization sector (ITU-T) as G.729.1 (ITU-T, 2006). Another scalable coding algorithm is also being studied under Q.9/16 of ITU-T; it is expected to be standardized in 2008. In this paper, we propose a new predictor that estimates wideband line spectral frequency (LSF) parameters using several predictive contributions including previously quantized wideband LSF parameters (WB-LSF), narrowband LSF parameters (NB-LSF), and WB-LSF mapped from NB-LSF. The paper presents an algorithm of a WB-LSF quantizer which, based on predictive vector quantization (PVQ), exploits the proposed predictor. The WB-LSF

H. Ehara et al. / Speech Communication 49 (2007) 490–500

quantizer works in combination with an NB-LSF quantizer. Consequently, it can be thought of as a bandwidth extension layer of a narrowband/wideband scalable LSF quantizer. Such an LSF quantizer is used in the MPEG-4 CELP speech coder (Nomura et al., 1998). An optimized design algorithm of the LSF quantizer was also discussed in the literature (Koishida et al., 2000). However, few studies have examined the improved performance of such LSF quantizers, which motivated us to explore some PVQ designs (Ehara et al., 2005a) and to develop an algorithm of an extension layer of a bandwidth-scalable LSF quantizer (Ehara et al., 2005b). The remainder of the paper is organized as follows. In Section 2, we overview the structure of our bandwidth scalable LSF quantizer. In Section 3, we consider introducing some memory-based prediction. In Section 4, codebook (CB) mapping based WB-LSF estimators are studied as memory-less predictors. In Section 5, we describe procedures of training CBs for the LSF quantizer. In Section 6, results of performance evaluation are presented. Section 7 shows a comparison of complexity among tested quantizers. Finally, in Section 8, we summarize our findings on designing PVQ for an enhancement layer of a bandwidth scalable LSF quantizer.

491

Quantized NB-LSF Upsampling

WB-LSF

β5

Three-stage CB D

CB1

/

for Mode0

β3

for Mode1

CB mapping

β2

CB2 for Mode0 for Mode1

β1

+ for Mode0

for Mode1

+ +

-

-

Coefficient table

CB3 for Mode0

D

β4

for Mode1

β0 + Weighted error minimization Encoded data

Fig. 2. Block diagram of the WB-LSF quantizer.

unquantized input WB-LSF vector is quantized using a switched predictive three-stage vector quantizer. The upsampling process and the switched predictive quantizer are described in the following subsections. In this study, the respective sampling rates of the NB and WB signals are 8 and 16 kHz; the order of LSF is 12 and 18. The quantizer is operated on 20 ms frames.

2. Bandwidth scalable LSF quantizer

2.1. Signal analysis

In this section, the overview of our bandwidth scalable LSF quantizer is presented. Because the main subject of our study is scalable quantization of WB-LSF using NBLSF, the overall structure of the quantizer is reviewed briefly. We specifically examine the detailed structure of the WB-LSF quantizer, which functions as the bandwidth extension layer of the bandwidth-scalable LSF quantizer. Fig. 1 shows a schematic diagram of the scalable LSF quantizer. The WB-LSF quantizer performs quantization of a WB-LSF vector in a layered manner using the output from an NB-LSF quantizer. A 29-bit vector quantizer is used as the NB-LSF quantizer in our experiments. The detailed structure of the WB-LSF quantizer is depicted in Fig. 2. The NB-LSF vector output from the NB-LSF quantizer is upsampled in the autocorrelation domain; then the error vector between the upsampled NB-LSF and an

The LSF parameters are extracted per 20-ms frame. A 35-ms asymmetric Hamming window is used as an analysis window. The last 5 ms is a lookahead of the 20-ms frame; the first 10 ms is overlapped with the preceding frame. The 20-ms frame is divided into four subframes, and the LSFs of the fourth subframe are quantized by our studied LSF quantizer. Therefore the center of the asymmetric Hamming window is placed at the center of the fourth subframe, meaning that the 35-ms asymmetric Hamming window consists of half a 55-ms Hamming window and half a 15-ms Hamming window. This window specification is used for both signals, an 8 kHz sampled NB signal and a 16 kHz sampled WB signal. A 12-order linear predictive (LP) analysis is performed in the case of an NB signal, while a WB signal is analyzed with an 18-order LP analysis. The LP analysis is performed using the autocorrelation method, and a 60 Hz bandwidth expansion is applied identically to ITU-T Recommendation G.729 (Salami et al., 1998). The WB signal is a 50–7000-Hz band-limited signal with a ITU-T Recommendation P.341 (ITU-T, 1998) sendside weighting filter which is provided in the ITU-T Software Tool Library of ITU-T Recommendation G.191 (ITU-T, 2005b); the NB signal is 300–3600-Hz band-limited using an IIR HPF and an IIR LPF.

Wideband Speech Input

Down-sampling NB signal Analysis NB-LSF(order12) NB-LSF Quantizer

Quantized NB-LSF

WB signal Analysis WB-LSF(order18) WB-LSF Quantizer (Extension Layer)

Quantized WB-LSF

Fig. 1. Schematic diagram of the bandwidth-scalable LSF quantizer.

2.2. NB-LSF quantizer The NB-LSF parameters are quantized using a 29-bit scalable quantizer that consists of the 21-bit quantizer with

492

H. Ehara et al. / Speech Communication 49 (2007) 490–500

NB-LSF Encoded data 21-bit MA-predicted two-stage split VQ

-

Predictor switch : 2 bits : 7 bits 1st stage CB 2nd stage (lower) : 6 bits 2nd stage (higher) : 6 bits

+ -

8-bit MA-predicted single-stage VQ

+

þ1 þ1 X X

rðk  n þ mÞrðmÞrðnÞ

ð1Þ

ðrðk  mÞ þ rðk þ 1 þ mÞÞrðmÞ

ð2Þ

m¼1 n¼1

Rð2k þ 1Þ ¼

þ1 X m¼1

Quantized NB-LSF Fig. 3. Schematic diagram of the NB-LSF quantizer.

Table 1 Bit allocation of the WB-LSF quantizer

where

  1 rðxÞ ¼ sinc x þ p 2

ð3Þ

2.4. Switched predictive three-stage VQ

Layer

Parameter

Bits/frame

NB

21-bit VQ: MA + 1st (dim.12) + 2nd-l (dim.6) + 2nd-h (dim.6) Complementary VQ (as NB enhancement layer, dim.12) Subtotal (1.45 kbit/s)

2+7+6+6

29

Mode 1st stage codebook (dimension 18) 2nd stage codebook (dimension 18) 3rd stage codebook (dimension 18) Subtotal (0.8 kbit/s)

1 5 5 5 16

Total (2.25 kbit/s)

Rð2kÞ ¼ rðkÞ þ

Predictor switch : 0 bits : 8 bits 1st stage CB

Encoded data

WB

coefficients, R(i), are reconverted to LSF parameters. The upsampled versions of the NB-LSF parameters were obtained

8

45

an 8-bit complementary VQ as shown in Fig. 3. The 21-bit quantizer exploits an MA-predicted two-stage vector quantization; it resembles that used in (Thyssen et al., 2001; Hiwasaki et al., 2004). Two bits specify one of four moving average (MA) predictors; the second stage is split into two sub-codebooks. Bit allocation is shown in Table 1. Seven bits are assigned to the first stage codebook; six bits are allocated for each sub-codebook in the second stage. The two sub-codebooks have the same vector dimension. A quantization error of the 21-bit quantizer is encoded using a single 8-bit VQ codebook. When eight sentence pairs in Japanese were used for performance evaluation, the objective performances of the quantizer were about 1.05 dB and 0.75 dB in spectral distortion (SD), respectively, using 21 bits and 29 bits. 2.3. Upsampling of NB-LSF The quantized NB-LSF parameters are converted to linear prediction (LP) coefficients, which are transformed into autocorrelation coefficients. The autocorrelation coefficients are upsampled in a way that is equivalent to an upsampling process of the input signal in the time domain using Eqs. (1) and (2). The value of r(i) is the ith autocorrelation coefficient converted from the NB-LSF; R(i) is the upsampled version of r(i). This upsampling calculation is equivalent to the upsampling calculation in the time domain (see Appendix A). The upsampled autocorrelation

The quantizer has two prediction modes. One is designed to handle various input LSF vectors, whereas the other is specialized to quantize the LSF vectors in stationary segments. As described later, the former is a memory-less VQ mode (Mode0); the latter is a memorybased VQ mode (Mode1). Mode selection is performed in a closed-loop manner: quantization is performed in both modes; then the mode that gives the least quantization error is selected. As shown in Fig. 2, the prediction is based on several predictive contributions. They are categorized as two types. One is memory-based prediction. The other is memory-less prediction. The former uses a kind of autoregressive (AR) prediction. It is activated only in the memory-based VQ mode. The latter exploits a VQ codebook mapping technique, which is enabled in the both VQ modes. A prediction residue vector is quantized using a three-stage VQ codebook. Bit allocation of the WB-LSF quantizer is shown in Table 1. Five bits are assigned to each stage codebook and one bit is used to identify the VQ mode. In the following two sections, the two predictive contribution types are explained along with preliminary experimental results. 3. Predictive VQ Because the LSF vectors are generally correlated between successive frames, memory-based VQ is commonly used to quantize the LSF vectors. Predictive vector quantization (PVQ) is the most popular memory-based VQ. A new feature of our PVQ exploits both interframe and intraframe correlations. The relationship between NB-LSF and WB-LSF, which was quantized in the previous frame, is used to predict the current WB-LSF based on the use of the quantized NB-LSF in the current frame. Specifically, the current WB-LSF is predicted by multiplying the currently quantized NB-LSF using the ratio of previously quantized WB-LSF to NB-LSF. A first-order AR prediction is further introduced in combination with the new prediction. These two contributions are shown in Fig. 2 as paths through the amplifiers b5 and b4.

H. Ehara et al. / Speech Communication 49 (2007) 490–500

Because our PVQ is based on AR prediction, a ‘‘safetynet PVQ’’ strategy (Eriksson et al., 1999) and ‘‘forgetting’’ capability are introduced for resetting or decreasing the erroneous memory of the predictor. 3.1. Preliminary experiment The following four configurations of the WB-LSF quantizer were compared: Baseline: This configuration uses only one predictive contribution, which corresponds to the path through the amplifier b1 in Fig. 2. The ith element of the quantized ðnÞ WB-LSF vector at the nth frame, b L W ðiÞ, is given as Eq. (4). Both b0(i) and b1(i) are predictive coefficients for the b ðnÞ ðiÞ is the ith element of the residue vector ith element, C quantized using the three-stage VQ at the nth frame, and ðnÞ b L N ðiÞ is the ithe element of the upsampled NB-LSF vector at the nth frame ðnÞ ðnÞ b ðnÞ ðiÞ þ b1 ðiÞ b b L W ðiÞ ¼ b0 ðiÞ C L N ðiÞ

ð4Þ

PVQa: In this configuration, the first-order AR predicðn1Þ tive contribution, b4 ðiÞ b L W ðiÞ, is introduced to ‘‘BaseðnÞ b line,’’ and L W ðiÞ is given by Eq. (5) ðnÞ ðnÞ ðn1Þ b ðnÞ ðiÞ þ b1 ðiÞ b b L W ðiÞ ¼ b0 ðiÞ C L N ðiÞ þ b4 ðiÞ b L W ðiÞ

ð5Þ

PVQb: The new predictive contribution, bL ðn1Þ ðiÞ ðnÞ W b b5 ðiÞ ðn1Þ L ðiÞ is introduced to ‘‘PVQa.’’ Therefore, bL N ðiÞ N ðnÞ b L W ðiÞ is given by Eq. (6) ðnÞ ðnÞ ðn1Þ b ðnÞ ðiÞ þ b1 ðiÞ b b L W ðiÞ ¼ b0 ðiÞ C L N ðiÞ þ b4 ðiÞ b L W ðiÞ

þ b5 ðiÞ

ðn1Þ b L W ðiÞ b ðnÞ L N ðiÞ ðn1Þ b L N ðiÞ

ð6Þ

PVQc: This configuration is the same as that of PVQb, but the VQ codebook is switched using class information. A classifier performs 3-bit vector quantization of the NBLSF vector and selects the corresponding subset out of eight sub-codebooks of the three-stage vector quantizer. ðnÞ b ðnÞ ðiÞ is the ith eleHere, b L W ðiÞ is given by Eq. (7), and C cl ment of a code vector that is generated using the clth sub-codebook at the nth frame ðnÞ ðnÞ ðn1Þ b ðnÞ ðiÞ þ b1 ðiÞ b b L W ðiÞ ¼ b0 ðiÞ C L N ðiÞ þ b4 ðiÞ b L W ðiÞ cl

þ b5 ðiÞ

ðn1Þ b L W ðiÞ b ðnÞ L ðiÞ b Nðn1Þ ðiÞ N L

ð7Þ

This configuration can be considered as a VQ technique based on classified VQ (Gersho and Gray, 1992a); the classified VQ that uses a switch VQ codebook has been studied by So and Paliwal (2007). In our study, the classification is performed using the NB-LSF, which is quantized by the NB-LSF quantizer and is available at the decoder side. Therefore, no additional bit is required at the decoder side for identifying the class information selected by the classifier.

493

3.2. Experimental results Some results of the performance of the four configurations are presented in Tables 2 and 3. The respective performances for clean speech in error-free conditions and frame erasure conditions were tested in terms of spectral distortion (SD). Eight Japanese sentence pairs (four males and four females, 64 s in total) were used as test data. Training data contained 26,600 LSP vectors, which were generated from 200 short sentences in Japanese. The test data were not included in the training data. As shown in the tables, the three PVQ algorithms outperformed the baseline algorithm both in error-free and frame-erasure conditions.The PVQ algorithms brought about improvement in the average SD as well as reduction of the number of outliers exceeding 2 dB SD. In this experiment, an LSF vector for an erased frame was concealed with the LSF vector decoded in the previous frame of the erased frame. Results showed that the performances of PVQ algorithms were not worse than that of the baseline algorithm up to a 20% frame erasure rate. Fig. 4 shows an example of the obtained coefficients for Mode1 of PVQa. The horizontal axis shows the order of LSF; the vertical axis presents the value of the predictor coefficient for each order of LSF. The predictor coefficient can be seen as a weight for each term of the predictor presented by Eq. (5). A large value of the predictor coefficient indicates a strong weight for the term of the predictor corresponding to the coefficient; in other words, such a term is inferred to be more important or useful for PVQ than other terms. As shown in this figure, a previously quantized wideband LSF is important for predicting the higher-order LSF, although it becomes less useful for lower-order LSF parameters. Because 0.4 is set as a minimum value for b1 to maintain a ‘‘forgetting’’ capability, b1 in the high band is saturated at 0.4. For that reason, b4 (or b4 + b5 in the case of Fig. 5) is clipped at 0.6 as its maximum value. Table 2 Comparison of SD performance (error-free) Configuration

Average SD (dB)

2 dB 6 SD < 4 dB (%)

4 dB 6 SD (%)

Baseline PVQa PVQb PVQc

1.62 1.47 1.44 1.42

20.44 14.66 13.84 13.09

0.50 0.12 0.22 0.16

Table 3 Comparison of SD performance (FER = 10%) Configuration

Average SD (dB)

2 dB 6 SD < 4 dB (%)

4 dB 6 SD (%)

Baseline PVQa PVQb PVQc

1.81 1.74 1.71 1.70

23.38 20.44 19.81 19.19

3.40 3.47 3.38 3.41

494

H. Ehara et al. / Speech Communication 49 (2007) 490–500

with a codevector in a WB-LSF codebook. The estimated e W , can be expressed as Eq. (8). WB-LSF vector, L [Mapping0]

1.0

β0 β1 β4

0.8

e W ¼ LðI CBn Þ L CBw

0.6 0.4 0.2 0.0

1

3

5

7 9 11 13 Order of LSF

15

17

Fig. 4. Example of predictive coefficients.

Value of Coefficient

1.2 1.0 β0 β1 β4 β5

0.8 0.6

is the WB-LSF codevector whose index is Therein, idx, and ICBn is an index of the selected codevector from the NB-LSF codebook in the vector quantization for the mapping. Using this mapping scheme, however, the number of possible WB-LSF vectors is limited to the size of the mapping codebook. To increase the number of possible WBLSF vectors, we considered utilization of the upsampled NB-LSF vector, LN, and the selected codevector from the ðI CBn Þ NB-LSF codebook, LCBn . We compared two mapping schemes as follows. Mapping1 simply interpolates the upsampled NB-LSF and the mapped WB-LSF, whereas information about the distance between the upsampled NB-LSF and the selected NB-LSF codevector is used to estimate WB-LSF in Mapping2. [Mapping1] e W ¼ b1 LN þ b2 LðI CBn Þ L CBw

0.4

ð9Þ

[Mapping2]

0.2 0.0

ð8Þ

ðidxÞ LCBw

1

3

5

7 9 11 13 Order of LSF

15

17

Fig. 5. Example of predictive coefficients.

Fig. 5 shows an example of the coefficients for Mode1 of PVQb. This figure suggests that the third prediction component, b5, is useful for the prediction of lower-order LSF parameters because the values of b5 for the lowerorder LSFs are higher than those for the higher-order LSFs. Because the values of b4 for the higher-order LSFs are higher than those for the lower-order LSFs, combining b5 with the b4, the memory-based components was inferred to be beneficial for the prediction of the full bandwidth. In this sense, the proposed AR-based prediction (PVQb) appears to be capable of exploiting interframe and intraframe correlations efficiently. The classified VQ technique of PVQc improves the performance of PVQb at the cost of memory requirements and computational loads. The PVQc scheme is one means for exploiting intraframe correlation and realizing efficient quantization. However, mapping-based prediction is more effective than the classified VQ, as shown in the following sections. 4. VQ codebook mapping This section gives a brief description of our codebook mapping algorithm. One-to-one mapping is adopted as the simplest mapping scheme. The upsampled NB-LSF is vector quantized using an NB-LSF codebook for mapping. Each codevector in the NB-LSF codebook is associated

e W ¼ b1 LN þ b2 LðI CBn Þ þ b3 LðI CBn Þ L CBw CBn

ð10Þ

Therein, b1, b2 and b3 are representative predictive coefficients. They are obtainable through off-line training to minimize the total estimation error, Dmap, for a training database (n is a frame number) X ðnÞ e ðnÞ k2 kLW  L ð11Þ Dmap ¼ W n

Our experimental result showed that the Mapping2 scheme gave slightly lower Dmap than Mapping1 did, as shown in the following subsection. 4.1. Preliminary experiment The total estimation error, Dmap, was calculated for each of the mapping-based predictors, Mapping0, Mapping1 and Mapping2 using a 794 s (39,700 frame) training data-

Estimation Error, Dmap

Value of Coefficient

1.2

20 18 16 14 12 10 8 6 4 2 0

Mapping0 Mapping1 Mapping2 1

3

5

7 9 11 13 Order of LSF

15

17

Fig. 6. Comparison between three mapping-based predictors (7-bit mapping).

H. Ehara et al. / Speech Communication 49 (2007) 490–500

ðnÞ Therein, the following representations are used: b L W ðiÞ is the ith quantized WB-LSF parameter at the nth frame; ðnÞ b L N ðiÞ is the ith upsampled NB-LSF parameter at the nth ðnÞ frame; LCBw ðiÞ is the ith element of the WB-LSF vector seðnÞ lected in the codebook mapping at the nth frame; LCBn ðiÞ is the ith element of the NB-LSF vector selected in codebook b ðnÞ ðiÞ is the ith element of mapping at the nth frame; and C the error vector quantized using the three-stage VQ at the nth frame. In addition, b0(i) to b5(i) are predictive coefficients for the ith element. In this formulation, Mapping1 is a special case in which b3 is set to the zero vector. The last two terms of Eq. (12) include previously quantized LSF. For that reason, the VQ given as Eq. (12) is a memory-based VQ unless b4(i) = b5(i) = 0. To improve robustness against channel errors, we adopt two-mode predictive VQ; a memory-less VQ is used as one of the two modes. That is, as shown in Fig. 2, the coefficient table consists of two sets of coefficients. One of the sets is for the memory-less VQ mode (Mode0) and having b4(i) = b5(i) = 0. The three-stage VQ codebooks also contain two subsets of the codebook: one for Mode0 (memory-less) and the other for Mode1 (memory-based, for stationary segments). It is noteworthy that Mode1 also has memory-less contributions, b1, b2 and b3. Therefore, Mode1 is a memory-based VQ having ‘‘forgetting’’ capability.

1.2

Value of Coefficient

1.0 β1 β2

0.8 0.6 0.4 0.2 0.0

-0.2

1

3

5

7 9 11 13 Order of LSF

15

17

Fig. 7. An example of prediction coefficients for Mapping1.

1.5

Value of Coefficient

1.0 0.5 0.0 -0.5 β1 β2 β3

-1.0 -1.5

1

3

5

7 9 11 13 Order of LSF

15

17

5. Training procedures

Fig. 8. An example of prediction coefficients for Mapping2.

base, which included seven languages, several background noise conditions, and music samples. A 7-bit codebook was used for mapping CB in this test. The results are compared in Fig. 6. Fig. 6 shows that Mapping2 gives the lowest estimation error among the three mapping-based predictors. The main improvement is found in lower orders, although no clear difference in higher orders is apparent. Examples of prediction coefficients for Mapping1 and Mapping2 predictors are shown respectively in Figs. 7 and 8. These figures suggest that the contribution of LCBw is dominant in the estie W for higher orders. Therefore, mated WB-LSF vector L the three mapping-based predictors yield similar performance for these orders.

4.2. New PVQ algorithm We introduced a Mapping2 predictor into the PVQb scheme in Section 3. Quantized LSF parameters b L W ðiÞ, i ¼ 1; . . . ; 18 are given as Eq. (12), the following: ðnÞ ðnÞ ðnÞ b ðnÞ ðiÞ þ b1 ðiÞ b b L W ðiÞ ¼ b0 ðiÞ C L N ðiÞ þ b2 ðiÞLCBw ðiÞ ðnÞ

ðn1Þ

þ b3 ðiÞLCBn ðiÞ þ b4 ðiÞ b LW þ b5 ðiÞ

ðn1Þ b L W ðiÞ b Nðn1Þ ðiÞ L

ðnÞ b L N ðiÞ

495

Training procedures are presented in this section. The algorithms of training of (1) a codebook for mapping, (2) a set of predictor coefficients, and (3) a three-stage VQ codebook are outlined in the following subsections. Regarding the database utilized for that training, a total of 794 s (39,700 frames) data were used, including seven languages, several background noise conditions, and music samples. 5.1. Codebook for mapping The design procedure of the mapping codebook is straightforward and can be summarized as follows: Step 1: to prepare a training data set of vector pairs of NB-LSF and WB-LSF. Step 2: to create a codebook of NB-LSF using the LBG algorithm with the NB-LSF training data. Step 3: to perform VQ on the NB-LSF training data using the created NB-LSF codebook and collect paired WB-LSF data for each NB-LSF code space (cluster). Step 4: to calculate the average of the collected WB-LSF for each cluster.

ðiÞ ð12Þ

This is also known as the nonlinear vector prediction in (Gersho and Gray, 1992b). The NB-LSF and WB-LSF codebooks each have 128 (7-bit) code vectors.

496

H. Ehara et al. / Speech Communication 49 (2007) 490–500 ðnÞ

5.2. Predictor coefficients 5.2.1. Initial coefficients ðnÞ As shown in Eq. (12), the predicted WB-LSF, e L W ðiÞ, is expressed as the following: ðnÞ e L W ðiÞ

¼

ðnÞ b1 ðiÞ b L N ðiÞ

þ

ðnÞ b2 ðiÞLCBw ðiÞ

ðn1Þ þ b4 ðiÞ b L W ðiÞ þ b5 ðiÞ

þ

ðnÞ b3 ðiÞLCBn ðiÞ

ðn1Þ b L W ðiÞ b Nðn1Þ ðiÞ L

ðnÞ b L N ðiÞ

ð13Þ

Then the total weighted prediction error, E, is given as  2 X XX  ðnÞ  ðnÞ E¼ En ¼ wðnÞ ðiÞLW ðiÞ  e L W ðiÞ ð14Þ n

n

i

(n)

where w (i) is the weighting coefficient to the ith LSF at the nth frame and is given as w(n)(i) = c1(i)/l2(i) + c2(i)/ ðnÞ ðnÞ l(i) + c3(i), where lðiÞ ¼ LW ði þ 1Þ  LW ði  1Þ and, c1, c2 and c3 are constants. By solving the simultaneous equation, oE ¼ oboE ¼ oboE ¼ oboE ¼ oboE ¼ 0, the initial set of preob1 ðiÞ 2 ðiÞ 3 ðiÞ 4 ðiÞ 5 ðiÞ dictor coefficients is obtained: {b1(i), b2(i), b3(i), b4(i), b5(i)}. The initial value of b0 is set to 1.0.

to which the LR2 ðiÞ belongs, and the initial third stage codebook is obtained using the LBG algorithm with the trainðnÞ ing data set of fLR3 ðiÞg. 5.3.2. Update of codebook The codevector generated from the three-stage codebook at the nth frame is written as Eq. (17). The column vector Cj represents the jth stage sub-codebook. It is created by concatenating all codevectors for the jth stage. ðnÞ The matrix S j , which contains only elements one or zero, selects the codevector for nth frame from the jth stage subcodebook 3 2 C1 7 6 —7   h i6 7 6   ðnÞ ðnÞ ðnÞ ðnÞ b ðnÞ ¼ S  C ¼ S S S 6 C 2 7 ð17Þ C 1 2 3 7 6 7 6 4—5 C3 Then the total quantizing distortion D is given as X X ðnÞ b ðnÞ k2 D¼ Dn ¼ wðnÞ kLW  L ð18Þ W n

n

(n)

5.2.2. Update of predictor coefficients The total quantizing distortion, D, is expressed as  2 X XX  ðnÞ  ðnÞ D¼ Dn ¼ wðnÞ ðiÞLW ðiÞ  b L W ðiÞ ð15Þ n

n

i

b W ðiÞ is given as Eq. (12). By solvIn those expressions, L ing the simultaneous equation, oboD ¼ oboD ¼ oboD ¼ oboD ¼ 0 ðiÞ 1 ðiÞ 2 ðiÞ 3 ðiÞ oD oD ¼ ob5 ðiÞ ¼ 0, the updated set of predictor coefficients is ob4 ðiÞ obtained: {b0(i), b1(i), b2(i), b3(i), b4(i), b5(i)}. ðnÞ

5.3. Three-stage VQ codebook 5.3.1. Initial codebook ðnÞ Prediction residual vectors, LR ðiÞ, are calculated using unquantized WB-LSF vectors as follows (note that unquantized WB-LSF, LW(i) is used instead of quantized WB-LSF, b L W ðiÞ) ðnÞ ðnÞ ðnÞ ðnÞ LR ðiÞ ¼ LW ðiÞ  b1 ðiÞ b L N ðiÞ  b2 ðiÞLCBw ðiÞ ðnÞ

ðn1Þ

 b3 ðiÞLCBn ðiÞ  b4 ðiÞLW  b5 ðiÞ

ðn1Þ LW ðiÞ b Nðn1Þ ðiÞ L

ðnÞ b L N ðiÞ

ðiÞ ð16Þ

The initial first stage codebook is obtained using the LBG algorithm (Linde et al., 1980) with the training data ðnÞ set of fLR ðiÞg. For each cluster of the first stage codebook, ðnÞ ðnÞ ðnÞ the residues LR2 ðiÞ ¼ LR ðiÞ  b L R ðiÞ are calculated, where ðnÞ ðnÞ b L R ðiÞ is the centroid of the cluster to which the LR ðiÞ belongs. The initial second stage codebook is obtained using the LBG algorithm with the training data set of ðnÞ ðnÞ ðnÞ ðnÞ fLR2 ðiÞg. Similarly, the residues, LR3 ðiÞ ¼ LR2 ðiÞ  b L R2 ðiÞ, ðnÞ are calculated, where b L R2 ðiÞ is the centroid of the cluster

where w is the diagonal matrix whose elements are ðnÞ w(n)(1), . . . , w(n)(17) and w(n)(18), and in which LW and ðnÞ b L W , respectively, represent the unquantized and the quantized WB-LSF vectors at the nth frame. The updated codebook is obtained by minimizing Eq. (18) using a projection method (LeBlanc et al., 1993). 6. Performance evaluation This section presents results of an objective evaluation test. The performance of the WB-LSF quantization is evaluated in terms of spectral distortion (SD), which is given as Eq. (19) vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u   !2 u 1 Xn 1 Aðe ^ j2pn=N Þ2 h t SD ¼ ð19Þ 10 log10 2 n¼nl nh  nl jAðej2pn=N Þj b j2pn=N Þj2 , In that equation, j1/A(e j2pn/N)j2 and j1= Aðe respectively, denote the unquantized and the quantized LPC power spectra. The 512-point FFT (N = 512) was uti^ j2pn=N Þ. Also, nl = 2 and lized to compute A(ej2pn/N) and Aðe nh = 224 are used to calculate the SD within the frequency range of 50–7000 Hz, ‘‘wideband’’. Table 4 shows configurations of the LSF quantizer tested. Both VQ1 and VQ3 correspond to Baseline and PVQb in Section 3. Eight Japanese sentence pairs (four females and four males, 64 s in total) were used as the test material. Performance in frame-erasure conditions, FER = 2%, 5%, 10% and 20%, were simulated as well as in the error-free condition. In this experiment, frame erasure was inserted periodically. To remove error propagation arising from the MA-predictive NB-LSF quantizer, actual errors were inserted in the WB-LSF extension layer only. However, NB-LSF parameters decoded at erased

H. Ehara et al. / Speech Communication 49 (2007) 490–500 Table 4 Tested configurations

497

Table 5 Test results for eight Japanese sentence pairs in error free conditions

Configuration

Mapping-based pred.

Memory-based pred.

Condition

Configuration

VQ1 VQ2 VQ3 VQ4 VQ5

Not used (b2 = b3 = 0) Mapping1 (b3 = 0) Not used (b2 = b3 = 0) Mapping1 (b3 = 0) Mapping2

Not used (b4 = b5 = 0) Not used (b4 = b5 = 0) Used Used Used

Average SD (dB)

2 dB 6 SD < 4 dB (%)

4 dB 6 SD (%)

Clean Speech

VQ1 VQ2 VQ3 VQ4 VQ5

1.60 1.46 1.44 1.35 1.29

19.19 13.97 13.97 9.63 7.84

0.25 0.03 0.09 0.03 0.03

Speech with car noise (SNR = 15 dB)

VQ1 VQ2 VQ3 VQ4 VQ5

1.45 1.33 1.24 1.15 1.13

11.06 7.31 6.19 5.03 4.44

0.06 0 0 0 0

frames were not used for concealing erased WB-LSF parameters. A previously decoded WB-LSF parameter was used simply for replacement of a lost WB-LSF parameter. Test results are presented in Fig. 9. Observation remarks about the results are summarized as follows: (1) 0.14 dB improvement in SD is brought about by introducing ‘‘Mapping1’’-based prediction both in the error-free and FER conditions. (Comparison of VQ1 and VQ2.) (2) Proposed memory-based prediction, PVQb described in Section 3, also improves SD performance by 0.16 dB in the error free condition. However, the improvement lessens with the increase of FER and null at FER 20%. (Comparison of VQ1 and VQ3.) (3) Combination of the memory-based and mappingbased predictions yields the lowest SD performance, along with the SD improvement from VQ1 is 0.25 dB in the error-free condition. (Comparison among VQ1, VQ2, VQ3 and VQ4.)

2.0

1.9

Spectral Distortion [dB]

1.8

1.7

1.6

1.5

1.4 VQ1 VQ2 VQ3 VQ4 VQ5

1.3 0

5 10 15 Frame Erasure Rate [%]

Fig. 9. Comparison of SD performance.

20

(4) Mapping2 provides better performance than Mapping1 does by around 0.05 dB in SD. (Comparison between VQ4 and VQ5.) Another set of eight Japanese sentence pairs corrupted with car noise at SNR = 15 dB was also used as test data. The test results are shown in Table 5 together with the clean speech condition. Observation remarks in the car noise condition are summarized as follows: (1) The memory-based prediction achieves lower SD and a smaller number of outliers exceeding 2 dB SD than the mapping-based prediction does. (Comparison between VQ2 and VQ3.) (2) Absolute SD values are smaller and the outliers are fewer than in the case of the clean speech condition. Regarding the test results of the car-noise condition, the SD values are lower than those of the clean speech condition. Our observations indicate that this is true because most segments have large energy in their low-frequency bands that have been previously quantized using the 29bit NB-LSF quantizer. On the other hand, in the clean speech condition, large SD values are typically found on unvoiced segments in which high-frequency bands have large energy. Those high band signals are quantized only by the 16-bit WB-LSF extension layer. However, such unvoiced characteristics can be obscured by car-noise. Consequently, those segments where large SD values are found in the clean speech condition almost disappear in the car-noise condition. Thus, the average SD value and the number of outliers decrease. Because of the nature of stationary characteristics of car-noise, the memory-based prediction is more effective to the car-noise condition. The effectiveness of the mapping-based prediction is lowered. The prediction coefficients for the Mode1 of VQ5 quantizer are shown in Fig. 10. In that figure, b0 to b3 are coefficients for the memory-less contributions; b4 and b5 denote weighting coefficients for memory-based contributions. In addition, b1 to b3 correspond to those in Fig. 8, and b4 and b5 correspond to those in Fig. 5. Seeing the fact that

498

H. Ehara et al. / Speech Communication 49 (2007) 490–500

1.0 Value of Coefficient

Value of Coefficient

1.0

0.5

0.0

-0.5 3

5

7 9 11 13 Order of LSF

15

17

3

5

7 9 11 13 Order of LSF

15

17

15

17

15

17

1.0 Value of Coefficient

Value of Coefficient

β1 1

1.0

0.5

0.0

-0.5 3

5

7 9 11 13 Order of LSF

0.5

0.0

-0.5

β2 1

15

17

β3 1

3

5

7 9 11 13 Order of LSF

1.0 Value of Coefficient

1.0 Value of Coefficient

0.0

-0.5

β0 1

0.5

0.5

0.0

-0.5 3

5

7 9 11 13 Order of LSF

0.0

-0.5

β4 1

0.5

15

17

β5 1

3

5

7 9 11 13 Order of LSF

Fig. 10. Coefficients b0–b5 for Mode1 of VQ5.

b4 is large for higher-order LSF, and that b5 is large for lower-order LSF, an LSF parameter decoded in the previous frame is most useful to predict the higher part of LSF, whereas the last memory-based term contributes mainly to estimation of the lower part of LSF. Our results demonstrate that a one-to-one mapping model is useful for bandwidth-scalable LSF quantization. The one-to-many mapping model suggested in (Agiomyrgiannakis and Stylianou, 2004) might bring further improvement at the cost of increasing complexity. 7. Complexity In this section, differences between predictive schemes in their complexities are discussed. Computational complexity was estimated using weighted million operations per second (wMOPS). The weight for each operation basically

follows the ITU-T guideline (ITU-T, 2005a). It is noteworthy, however, that the simulation program is implemented using floating-point expression. Because we focused on comparison between prediction schemes, we allow the three-stage VQ to take a rather high computational load of about 10.76 wMOPS to obtain an encoding performance that is close to that of full-search VQ. Furthermore, about 1.24 wMOPS are needed for upsampling of autocorrelation coefficients, as described in Appendix A. Including others like calculation of interpolated LPC, the baseline parts, which are common parts for five VQ configurations listed in Table 4, consume around 12.26 wMOPS. The three-stage VQ codebook consists of two three-stage codebooks for the two modes; each stage is a 5-bit codebook whose dimensions are 18. Therefore, 3456 words are required for the three-stage VQ codebook. Including prediction coefficients b0 and b1 for two

H. Ehara et al. / Speech Communication 49 (2007) 490–500 Table 6 Comparison of computational complexity and memory requirements Configuration

Computational complexity

Table-ROM

VQ1 VQ2 VQ3 VQ4 VQ5

(12.26 + 0.00) (12.26 + 0.26) (12.26 + 0.04) (12.26 + 0.30) (12.26 + 0.30)

(3528 + 0) words (3528 + 4640) words (3528 + 64) words (3528 + 4716) words (3528 + 4748) words

wMOPS wMOPS wMOPS wMOPS wMOPS

modes and 18 orders, 3528 words are required for the baseline parts. Comparisons of computational complexity and memory requirements of the five configurations are summarized in Table 6. The complexity for the base line parts and additional complexity by introducing each predictive scheme are presented separately. The table-ROM is also presented in the same manner. Most of the additional complexity comes from the codebook mapping between NB-LSF and WB-LSF; around 0.26 wMOPS and 4608 words are required for this mapping VQ. 8. Conclusion

Appendix A. Upsampling of autocorrelation coefficients In this appendix, derivation of Eqs. (1) and (2) are given. A 1:2 upsampling can be expressed as Eqs. (A.1) and (A.2) using an interpolation formula using the sinc function þ1 X

xði  nÞ  sincðnpÞ ¼ xðiÞ

ðA:1Þ

n¼1

uð2i þ 1Þ ¼

  1 xði  nÞ  sinc n þ p 2 n¼1 þ1 X

Here, u(2i) and u(2i + 1) are the upsampled versions of x(i); x(i) corresponds to an 8 kHz sampled NB signal. Autocorrelation coefficients, R(j), of the upsampled signal, u(l), are calculated using Eq. (A.3) RðjÞ ¼

þ1 X

uðlÞ  uðl þ jÞ

l¼1

¼

þ1 X

uð2iÞ  uð2i þ jÞ þ

i¼1

þ1 X

uð2i þ 1Þ  uð2i þ 1 þ jÞ

i¼1

ðA:3Þ Using Eqs. (A.1),P (A.2), and the autocorrelation funcþ1 tion of x(i), rðkÞ ¼ l¼1 xðlÞxðl þ kÞ, Eq. (A.3) can be rewritten as Eqs. (A.4) and (A.5). When j = 2k (k is an integer), then Rð2kÞ ¼ rðkÞ þ

þ1 þ1 X X

rðk  n þ mÞ

m¼1 n¼1

    1 1  sinc m þ p  sinc n þ p 2 2

ðA:4Þ

Otherwise (when j = 2k + 1) then

This paper has presented a newly developed predictor that estimates wideband LSF parameters using several predictive contributions. Its application to the wideband LSF quantizer, which works in combination with a narrowband LSF quantizer, was studied. The wideband LSF quantizer works as the bandwidth extension layer of a narrowband/ wideband scalable LSF quantizer. The predictor exploits both memory-less and memory-based predictors. One feature of the memory-based predictor is the exploitation of the correlation between wideband and narrowband LSFs quantized in the previous frame for estimating the wideband LSF in the current frame. It was found to be particularly effective for the lower band LSFs. The memory-less predictor is based on the use of a codebook mapping technique. Both types of predictors contributed to improving the objective performance of the WB-LSF quantizer. It achieved 1.29 dB in spectral distortion using 16 bits for the bandwidth extension layer when 29 bits were assigned to the NB-LSF quantizer whose SD performance was about 0.75 dB. The paper also described the design procedure of the wideband LSF quantizer, i.e. the algorithm for codebook training, and optimization of the predictor.

uð2iÞ ¼

499

ðA:2Þ

Rð2k þ 1Þ ¼

  1 ðrðk  mÞ þ rðk þ 1 þ mÞÞ  sinc m þ p 2 m¼1 þ1 X

ðA:5Þ

In practice, however, interpolation is performed with a finite order, and Eqs. (A.4) and (A.5) are not strictly correct and become approximations. References Agiomyrgiannakis, Y., Stylianou, Y., 2004. Combined estimation/coding of highband spectral envelopes for speech spectrum expansion. In: Proc. IEEE ICASSP-2004, pp. I-469–I-172. Ehara, H., Morii, T., Oshikiri, M., Yoshida, K., 2005a. Predictive VQ for bandwidth scalable LSP quantization. In: Proc. IEEE ICASSP-2005, pp. I-137–I-140. Ehara, H., Morii, T., Oshikiri, M., Yoshida, K., Honma, K., 2005b. Design of bandwidth scalable LSF quantization using interframe and intraframe prediction. In: Proc. ISCA INTERSPEECH-2005, pp. 1493–1496. Eriksson, T., Linde´n, J., Skoglund, J., 1999. Interframe LSF quantization for noisy channels. IEEE Trans. Speech Audio Process. 7 (5), 495–509. Gersho, A., Gray, R.M., 1992a. Vector Quantization and Signal Compression. Kluwer Academic Publishers, Chapter 12.5, pp. 423–424. Gersho, A., Gray, R.M., 1992b. Vector Quantization and Signal Compression. Kluwer Academic Publishers, Chapter 13.5, pp. 506–509. Hiwasaki, Y., Mano, K., Yasunaga, K., Morii, T., Ehara, H., Kaneko, T., 2004. Design of a robust LSP quantizer for a high-quality 4-kbit/s CELP speech coder. IEICE Trans. Inf. System E87-D (6), 1496–1506. ITU-T, 1998. Transmission characteristics for wideband (150–7000 Hz) digital hands-free telephony terminals. ITU-T Recommendation P.341, ITU-T. ITU-T, 2005a. ITU-T Software Tool Library 2005 User’s Manual. ITU-T, Chapter 12, pp. 161–185. ITU-T, 2005b. Software tools for speech and audio coding standardization. ITU-T Recommendation G.191, ITU-T. ITU-T, 2006. G.729 Based embedded variable bit-rate coder: An 8– 32 kbit/s scalable wideband coder bitstream interoperable with G.729, ITU-T Recommendation G.729.1, ITU-T.

500

H. Ehara et al. / Speech Communication 49 (2007) 490–500

Koishida, K., Linde´n, J., Cuperman, V., Gersho, A., 2000. Enhancing MPEG-4 CELP by jointly optimized inter/intra-frame LSP predictors. In: Proc. IEEE Workshop on Speech Coding, pp. 90–92. LeBlanc, W.P., Bhattacharya, B., Mahmoud, S.A., Cuperman, V., 1993. Efficient search and design procedures for robust multi-stage VQ of LPC parameters for 4 kb/s speech coding. IEEE Trans. Speech Audio Process. 1 (4), 373–385. Linde, Y., Buzo, A., Gray, R.M., 1980. An algorithm for vector quantizer design. IEEE Trans. Comm. COM-26 (1), 84–95. Nomura, T., Iwadare, M., Serizawa, M., Ozawa, K., 1998. A bitrate and bandwidth scalable CELP coder. In: Proc. IEEE ICASSP-98, pp. 341– 344.

Salami, R., Laflamme, C., Adoul, J.-P., Kataoka, A., Hayashi, S., Moriya, T., Lamblin, C., Massaloux, D., Proust, S., Kroon, P., Shoham, Y., 1998. Design and description of CS-ACELP: a toll quality 8 kb/s speech coder. IEEE Trans. Speech Audio Process. 6 (2), 116–130. So, S., Paliwal, K.K., 2007. Efficient product code vector quantisation using the switched split vector quantiser. Digital Signal Process. (17), 138–171. Thyssen, J., Gao, Y., Benyassine, A., Shlomot, E., Murgia, C., Su, H., Mano, K., Hiwasaki, Y., Ehara, H., Yasunaga, K., Lamblin, C., Kovesi, B., Stegmann, J., Kang, H.-G., 2001. A candidate for the ITUT 4 kbit/s speech coding standard. In: Proc. IEEE ICASSP-2001, pp. 681–684.