A flexible bit rate switching method for low bit rate vocoders

A flexible bit rate switching method for low bit rate vocoders

Signal Processing 81 (2001) 1737–1742 www.elsevier.com/locate/sigpro A exible bit rate switching method for low bit rate vocoders ˙Ilkc Osman Ero'g...

138KB Sizes 1 Downloads 36 Views

Signal Processing 81 (2001) 1737–1742

www.elsevier.com/locate/sigpro

A exible bit rate switching method for low bit rate vocoders ˙Ilkc Osman Ero'gula , Hakki G-okhan ˙Ilkb ; ∗ , Ozlem a G ulhane

b Faculty

Military Medical Academy, Biomedical and Clinical Engineering Centre, Ankara, Turkey of Sciences, Department of Electronics Engineering, Ankara University, Bes"evler 06100, Ankara, Turkey c Department of Statistics, Middle East Technical University, 06530, Ankara, Turkey Received 4 September 2000; received in revised form 11 April 2001

Abstract Robust low bit rate speech coders are essential in commercial and military communication systems. They operate at 5xed bit rates and bit rates cannot be altered without major modi5cations in the vocoder design. In this paper we introduce a novel approach to vocoders by coding the time-scale modi5ed input speech signal. The proposed method o8ers any bit rate from 2400 to 720 bits= s without modifying the vocoder structure. Simulation results, which mainly concentrate on intelligibility, talker recognisability and voice quality versus codec complexity and delay, are also presented. The proposed scaled speech coder delivers communication quality speech at half the bit rate of the new US Federal Standard, mixed excitation linear prediction vocoder.? 2001 Elsevier Science B.V. All rights reserved. Keywords: Low bit rate vocoders; Time-scale modi5cation of speech; Scaled speech coder; Mixed excitation linear prediction

1. Introduction In many applications it is desirable to transform a speech waveform into a signal which is more useful than the original. For example in time-scale modi5cation; speech can be sped up in order to compress the words spoken into an allocated time interval or to quickly scan a passage. As an application, full-duplex link can be achieved over a single-channel radio system if both ends of the link operate in what is known as time division duplex mode. In this mode, the channel is allocated half of the time to transmit information in each of the ∗ Corresponding author. Tel.: +90-312-2126720=1160; fax: +90-312-223-2395. E-mail address: [email protected] (H.G. ˙Ilk).

two directions. That is, the radio channel is divided into time slots of T=2 seconds, with each end transmitting in alternate time slots. In this way a single radio channel can support information ow in both directions resulting in a virtual analog full-duplex link [8]. Speech coding, at low bit rates on the other hand, has been the focal point of considerable research activity over the last two decades. In this paper a novel approach to vocoders, in order to reduce the bit rate required to transmit speech signal, is proposed. The proposed method is particularly useful for existing low bit-rate speech coding algorithms such as mixed excitation linear prediction [4,9] (MELP) vocoder because time-scale modi5cation is performed as a pre and post-process at the transmitter and receiver, respectively.

0165-1684/01/$ - see front matter ? 2001 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 5 - 1 6 8 4 ( 0 1 ) 0 0 0 7 2 - X

1738

O. Ero.gul et al. / Signal Processing 81 (2001) 1737–1742

One important time domain modi5cation algorithm is waveform similarity overlap-and-add (WSOLA) method, which ensures suLcient signal continuity that exists in the speech signal [10]. WSOLA algorithm preserves qualities such as naturalness, intelligibility and speaker dependent features such as pitch and formant structure after time-scale modi5cation.

2. Scaled speech coder Traditional LPC vocoders use either a periodic pulse train or white noise as the excitation for an all pole synthesis model. These vocoders either operate at a 5xed bit rate or the bit rate varies depending on the channel or source conditions. However, an increasing demand for digital speech coding applications and new advances in speech coding has made further development of low bit rate speech coding systems inevitable. The proposed scaled speech coder (SSC) takes advantage of both time and frequency domain approaches. While low bit rate coders employ linear predictive coding, pitch estimation and Fourier magnitude modelling for the excitation, WSOLA algorithm o8ers a powerful tool for removing redundancy from speech signals in the time domain. Speech signals exhibit both short-term (i.e. from sample to sample) and long-term (i.e. pitch related) correlation. WSOLA algorithm, when used with modest compression factors, removes some of the long-term correlation from speech by using waveform similarity measures. However, the time compressed speech signal exhibits strong short-term correlation and some pitch correlation and therefore LP coding can be successfully applied after time scale modi5cation. In addition, the proposed SSC o8ers a novel way to change the bit rate without any modi5cation in the vocoder design. The motivation therefore is to merge these two approaches in order to compress and code speech signals e8ectively. In this method, the transmitter speeds up original speech Signal A, by a factor of  using WSOLA, preserving the pitch and formant structure. This signal, Signal B in Fig. 1, is then coded using the new US Federal Standard MELP algorithm at 2:4 kbits= s.

At the receiver side, the coded signal is 5rst decoded to obtain time-scale modi5ed Signal C which is then slowed by a factor of 1= in order to reconstruct the synthesised speech Signal D. The advantage stems from the fact that time-scale modi5ed signals require less coding time than that of the original input speech at the expense of increased complexity and delay. Another question of interest is the degraded voice quality. Fig. 1 illustrates the overall block diagram of the SSC. It is possible to obtain variants of the SSC by employing other time scale modi5cation algorithms [6,7] and=or other low bit rate speech coders [3,5]. WSOLA and MELP algorithms have been selected for their well-known performances. 3. Simulation In order to assess the performance of the proposed coder, di8erent compression factors () of 0.3, 0.5, 0.7 and 0.9 have been computer simulated, under benign input conditions, for source coding. Transmission channel characterisation and simulation are however beyond the scope of this paper. Phonetically balanced quiet sentences spoken by one male and one female speaker obtained from TIMIT database were used. Fig. 2(a) illustrates the original speech utterance “Cash-mere”, while Figs. 2(b) – (d) depict Signals B, C and D, respectively, for  = 0:5. Please note that time scale di8ers for Figs. 2(b) and (c) from the original. Fig. 3 on the other hand illustrates the synthesised output speech obtained from MELP vocoder without any compression. Simulation results mainly concentrate on intelligibility, talker recognisability and voice quality versus codec complexity and delay. Twenty-5ve subjects, of whom 8 are trained, were accessed for this study. Voice quality was assessed by using mean opinion (MOS) and degradation mean opinion score (DMOS) tests. During the MOS tests, listeners were asked to rate the output speech quality according to absolute scale, ranging from “very bad” for grade 1 to “excellent” for grade 5. The main obstacle during MOS tests was that the ordinary subjects were not familiar with low bit rate vocoders, thus they were confused between harsh,

O. Ero.gul et al. / Signal Processing 81 (2001) 1737–1742

1739

Fig. 1. The overall block diagram of the SSC.

Fig. 2. (a) Original speech utterance “Cash-mere”, Signal A; (b) speech signal obtained after time compression of  = 0:5 is applied, Signal B; (c) compressed MELP decoded Signal C and; (d) synthesised speech Signal D, after time expansion of 1= = 2:0 is applied.

muQed, buzzy and nasal quality of speech and noise added after coding. To overcome this limitation, DMOS test was also conducted. In this test, listeners were asked to rate the quality of time scaled and coded sentences relative to the quality of standard MELP vocoder output. Please note that this study is not concerned with the performance measures of

the MELP vocoder. A detailed comparison of the MELP vocoder with other standard coders is addressed in Ref. [4]. In general, 25 subjects are not suLcient to assess the quality of the decoded speech. Thus a statistical tool called bootstrap [1], in determining the accuracy of the sample mean of an unknown population

1740

O. Ero.gul et al. / Signal Processing 81 (2001) 1737–1742

Fig. 3. MELP decoded speech utterance “Cash-mere”.

mean, is used. 95% BCa (bias-corrected and accelerated) con5dence intervals have been constructed for MOS and DMOS test results. BCa method is an improved version of the standard intervals, which attempts to improve a normal transformation for the non-normality of the estimator and to correct the bias resulting from the transformation. Intelligibility and talker recognisability were tested by the trained subjects only. 4. Results The question of interest is whether there is a signi5cant di8erence between the average scoring of compressed and coded speech 5le for each  and MELP standard. That is whether subjects prefer MELP output over compressed and then coded speech outputs. The other questions of interest are codec complexity and delay. In this study, 1000 bootstrap samples (of the 25 subjects) were drawn for each MOS and DMOS of male and female speakers. Since there are no extreme observations, which badly inuence mean, use of median or trimmed mean is not going to improve our estimator. In addition our results indicated that the use of bootstrap has lead the sample means approach normal distribution. Fig. 4 depicts the histogram of means of 1000 bootstrap samples where the actual sample consists of the mean opinion scores of male speaker when

Fig. 4. Histogram of 1000 bootstrap samples of MOS for male speaker when  = 0:3.

 = 0:3. Since the histogram looks closely normal in shape, it is safe to assume the normality of mean values. Mean opinion and degradation mean opinion scores for both speakers and all  values are checked and found to be normally distributed. Table 1 gives statistically assessed MOS and DMOS test results for each compression factor. From Table 1 it is seen that when  = 0:9, compressed and coded male speech scores somewhere between [3.42, 4.40] with 95% probability for MOS, while the MELP coded male speech scores somewhere between [3.74, 4.54]. DMOS test results also indicate that there is no signi5cant difference between these two decoded speech 5les. On the other hand abrupt changes, especially for the lower bound of mean opinion and degradation mean opinion scores, are observed when  = 0:5. These results indicate that compression factor  should lie somewhere in the range of 1.0 – 0.5 and preferably between 0.7 and 0.5 for the best compromise between coding eLciency and voice quality. This result is also supported by the coder delay. MELP vocoder operates on 22:5 ms frames and requires one additional frame for bu8ering, namely the overall delay is 45:5 ms. Please note that this additional frame is not required due to codec complexity but it is required by the nature of the MELP algorithm. WSOLA algorithm delay is within the frame and therefore the overall delay for the proposed SSC algorithm is determined by multiplying

O. Ero.gul et al. / Signal Processing 81 (2001) 1737–1742

1741

Table 1 Statistically assessed mean opinion and degradation mean opinion scores for each compression factor  Compression ratio

95% BCa (bias corrected and accelerated) con5dence intervals

()

MOS (male)

DMOS (male)

MOS (female)

DMOS (female)

0.3 0.5 0.7 0.9 1.0 (MELP)

[1.09, [2.12, [2.89, [3.42, [3.74,

[1.27, [2.51, [4.23, [4.62, 5

[1.00, [2.01, [2.79, [3.10, [3.45,

[1.29, [2.88, [4.10, [4.47, 5

1.46] 2.83] 3.76] 4.40] 4.54]

Table 2 The overall delay with minimal bu8ering for each compression factor  Compression ratio ()

Bit rate (bits=s)

Codec delay (ms)

0.3 0.5 0.7 0.9 1.0 (MELP)

720 1,200 1,680 2,160 2,400

151.5 91 64.5 50.5 45.5

Table 3 Intelligibility and talker recognisability comments for each compression factor  Compression ratio () 0.3 0.5 0.7 0.9 1.0 (MELP)

Intelligibility Limited (interrupted) Intelligible (harsh quality) Highly intelligible Highly intelligible Highly intelligible

Talker recognisability Limited Recognisable Clearly recognisable Clearly recognisable Clearly recognisable

45:5 ms with the expansion factor (1=). Table 2 gives the overall delay with minimal bu8ering for each compression factor . Intelligibility and talker recognisability comments observed by trained subjects are presented in Table 3. It is possible to real-time implement the proposed SSC. MELP algorithm has been real-time implemented on both 5xed and oating point DSP processors [2]. Ref. [8] on the other hand describes the details of real-time implementation of WSOLA algorithm on oating point TMS 320-C31, 60 MHz, 45 MIPS processor. 128K RAM and 32K ROM

2.04] 3.60] 4.66] 4.95]

1.40] 2.78] 3.74] 4.14] 4.41]

2.01] 3.89] 4.62] 4.91]

memory supported this processor. Two DSP processors were used, one for time compression and another for time expansion [8]. This implementation has been tested for  values between 0.45 to 1.0. 5. Conclusions A novel approach to speech coding was introduced in this paper for the coding of speech signals used in vocoders. While conventional methods code original input speech, the proposed algorithm codes time-scale modi5ed speech signals. WSOLA algorithm was used for time-scale modi5cation while the new Federal US Standard MELP vocoder was employed for coding purposes. The proposed SSC produces communication quality speech at half the bit rate of the standard MELP vocoder. Synthesised speech quality degrades gracefully for decreasing values of the compression factor  and it is similar to that of the standard MELP vocoder. Although increase in tonal artefacts is observed, the output speech sound natural and speaker identi5cation is preserved when  is between 1.0 and 0.5. In addition, SSC does not require modi5cations on the MELP vocoder architecture. Time-scale modi5cation algorithms appear as cascaded blocks in order to compress and expand speech signals in the encoder and decoder, respectively. Acknowledgements The authors are grateful to the members of the Ankara University Speech Processing Group

1742

O. Ero.gul et al. / Signal Processing 81 (2001) 1737–1742

(AUSPG), for evaluating decoded speech 5les and making constructive comments. The authors also would like to thank Rachel Martin of the Iowa State University for proof reading of this study. Synthesised speech 5les can be found at the URL address, http:==auspg.science.ankara.edu.tr. References [1] B. Efron, R. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall Publications, London, 1993, pp. 153–165. [2] E. Ertan, E. Aksu, H.G. Ilk, H. Karci, O. Karpat, T. Kolcak, L. Sendur, M. Demirekler, E. Cetin, Implementation of an enhanced 5xed point variable bit rate MELP vocoder on TMS320C549, IEEE Internat. Conf. Acoust. Speech Signal Process. (ICASSP) 4 (1999) 2295–2298. [3] W.B. Kleijn, J. Haagen, A speech coder based on decomposition of characteristic waveforms, IEEE Internat. Conf. Acoust. Speech Signal Process. (ICASSP) 1 (1995) 508–511. [4] M.A. Kohler, A comparison of the new 2400 bps MELP federal standard with other standard coders, IEEE Internat.

[5]

[6] [7]

[8] [9]

[10]

Conf. Acoust. Speech Signal Process. (ICASSP) 2 (1997) 1541–1544. A. Kondoz, A 2.4 –1:2 kb= s SB-LPC speech coder and noise pre-processor: the turkish NATO candidate, IEEE National Signal Processing and Applications Conference, SI˙U 2000, Antalya-Turkey, 12 June 2000, pp. 539 –544. S. Roucos, A.M. Wilgus, High quality time-scale modi5cation of speech, IEEE Internat. Conf. Acoust. Speech Signal Process. (ICASSP) 1 (1985) 493–496. S. Sene8, System to independently modify excitation and=or spectrum of speech waveform without explicit pitch extraction, IEEE Trans. Acoust. Speech Signal Process. ASSP-30 (4) (1982) 566–578. N. Serinken, B. Gagnon, O. Erogul, Full duplex speech for HF radio systems, IEE HF Radio Systems and Techniques Conference Publications, 1997, pp. 281–284. L.M. Supplee, R.P. Cohn, J.S. Collura, A.V. McCree, MELP: the new federal standard at 2400 BPS, IEEE Internat. Conf. Acoust. Speech Signal Process. (ICASSP) 2 (1997) 1591–1594. W. Verhelst, M. Roelands, An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modi5cation of speech, IEEE Internat. Conf. Acoust. Speech Signal Process. (ICASSP) 2 (1993) 554–557.