Adaptive time scale modification of speech for graceful degrading voice quality in congested networks for VoIP applications

Adaptive time scale modification of speech for graceful degrading voice quality in congested networks for VoIP applications

ARTICLE IN PRESS Signal Processing 86 (2006) 127–139 www.elsevier.com/locate/sigpro Adaptive time scale modification of speech for graceful degrading...

573KB Sizes 0 Downloads 11 Views

ARTICLE IN PRESS

Signal Processing 86 (2006) 127–139 www.elsevier.com/locate/sigpro

Adaptive time scale modification of speech for graceful degrading voice quality in congested networks for VoIP applications Hakkı Go¨khan I˙lk, Saadettin Gu¨ler Ankara University, Department of Electronics Engineering, Bes- evler, 06100 Ankara, Turkey Received 4 June 2004; received in revised form 3 May 2005; accepted 3 May 2005 Available online 23 June 2005

Abstract This paper proposes an alternative scheme to variable bit rate (VBR) speech coding for voice over Internet protocol (VoIP) during network congestion in Internet. The proposed scheme is called ‘‘adaptive bit rate switching’’ and ensures that the available bandwidth is most efficiently used. When congestion is signaled, a time scale modification algorithm called WSOLA (waveform similarity overlap and add) with time-dependent compression rate, determined according to the severity of the network congestion, is employed in order to reduce the bit rate required to transmit speech adaptively. This approach is different from VBR speech coding and novel in the sense that the coder operates at any desired bit rate for any desired duration. This is particularly useful in network environments because load may be different at each direction. WSOLA algorithm has been selected as the time scale modification algorithm because it is computationally efficient and produces high quality output. In addition, the proposed scheme integrates WSOLA, or any time scale modification algorithm into any commercial or military constant bit rate (CBR) or VBR codec without any modification in the vocoder structure. The results of the proposed method are statistically evaluated by using diagnostics rhyme tests (DRT) and mean opinion score (MOS) tests. The DRT results obtained from the simulation of the proposed system revealed, under 90% confidence interval, that the perceptual success of the adaptively compressed and G.729 coded speech is 98.9270.03 percent. The MOS test results, on the other hand, proved that the system provides better perceptual quality than the standard time scale modification, indicating that the proposed system indeed provides graceful degradation in voice quality even in additive increase multiplicative decrease modeled channels, provided that the dynamic network conditions grant bandwidth. r 2005 Elsevier B.V. All rights reserved. Keywords: Quality of service; Adaptive time scale modification; Adaptive voice communication; Wired and wireless access

Corresponding author. Tel.: +90 312 2126720x1160; fax: +90 312 212 54 80.

E-mail addresses: [email protected], [email protected] (H.G. I˙lk). 0165-1684/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2005.05.006

ARTICLE IN PRESS 128

H.G. I˙lk, S. Gu¨ler / Signal Processing 86 (2006) 127–139

Nomenclature bt

time scale compression factor (for the transmitting side) br ¼ 1=bt time scale expansion factor (for the receiving side) AIMD additive increase multiplicative decrease AMDF average magnitude difference function CBR constant bit rate CS-ACELP conjugate-structure algebraic code excited linear prediction

1. Introduction Integration of voice and data networks has been made possible by recent advances in digital signal processors (DSP) and computer technologies. Especially in the last decade increasing amount of attention has been drawn to technologies regarding the transmission of voice over data networks by International Telecommunication Union (ITU), Internet Engineering Task Force (IETF) and commercial companies. Today voice over Internet protocol (VoIP) is widely used in private and public networks even by telecommunication carrier companies. VoIP offers lower costs compared to traditional circuit based networks such as public switched telephony network (PSTN). Due to this lower cost model, it is foreseen that in the next years large amount of long distance voice traffic will continue to be transmitted through VoIP channels. However, Internet protocol (IP) networks do not guarantee a lossless and real-time transmission. Although resource reservation protocol (RSVP) provides end-to-end transmission reservation, in practice it is not used through Internet where there are many hops and packet loss is inevitable when congestion occurs. A solution to the congestion problem is adaptive rate voice coding, where four different methods are employed, namely on/off, adaptive multi-rate, multimode and scalable. The on/off method uses silence suppression [1,2] where transmission occurs during talkspurt periods and background noise is

DRT diagnostic rhyme test IVR interactive voice response MELP mixed excitation linear prediction MOS mean opinion score RSVP resource reservation protocol VBR variable bit rate WSOLA waveform similarity overlap and add tðmÞ time warping function that provides scaling Dk tolerance that maximises similarity between consecutive segments

modeled during silence. Adaptive multi-rate uses different predefined bit rates of the same CBR codec depending on the congestion [3–5]. The Multimode adapts itself according to the input speech signal features [4]. Finally the scalable method uses an embedded structure in which the data packet obtained by coding each single frame comprises a very low bit-rate core (e.g. 2 k bit/s) to which a series of enhancement stages are added that increase the bit-rate and therefore the quality of the reconstructed signal [4]. Another alternative solution to congestion problem, which is discussed in this paper, is the scale modification of speech signals in time according to the congestion severity. Time scale modification of speech has useful applications in many speech areas, such as interactive voice response (IVR), voice mail, dictation tape playback and post synchronization of voice and video by increasing or decreasing speaking rate. Increasing speaking rate is also shortening the duration of speech message and allows saving in transmission rate. In this paper waveform similarity overlap and add (WSOLA) is used for time scale modification since it is computationally efficient and produces high quality speech [6,7]. A comparison of different time scale modification algorithms (both operating in time and frequency domains) has been made in Ref. [8] and WSOLA proved to provide best quality at a given bit rate. The proposed method described in this paper modifies conventional WSOLA algorithm to provide congestion adaptivity.

ARTICLE IN PRESS H.G. I˙lk, S. Gu¨ler / Signal Processing 86 (2006) 127–139

When congestion occurs in an IP environment, providing high quality speech at an available bit rate has two important issues: (i) Signaling the network congestion and (ii) Reducing the bit rate according to the congestion severity. Network concerns are beyond the scope of this paper and therefore they are not discussed. As a result, speech rate increment or decrement decision is assumed to be provided by a function which is determined by an additive increase, multiplicative decrease (AIMD) model. Reducing the bit rate adaptively, according to the congestion determined by this model, is the focal point of our research activity. Furthermore, the time scale modification is adaptive and therefore changes in time. Therefore synchronization, i.e. synthesizing the same amount of speech as that of processed, constitutes new challenges. This work presents an interesting novelty that provides easy buffer management and synchronization of the speech at the receiver with respect to the speech at the transmitter. The paper is organized as follows. After briefly summarizing WSOLA in Section 2, the adaptive WSOLA solution is derived in Section 3. The results are then discussed in Section 4 and finally some conclusions are drawn in Section 5.

2. Background: WSOLA Synthesis of an original speech, after time scaling, with the WSOLA algorithm has the same perceptional characteristics except the speaking rate [6,7]. This is achieved by finding similar regions of the original speech x(n) and overlapping these regions with a time warping function provides the time scale modified speech signal y(n). It should be noted that this is not a conventional multi-rate DSP technique, where decimation simply discards samples. The WSOLA algorithm can be expressed as follows [6,7]: 8 m : yðn þ tðmÞÞwðnÞð¼Þxðn þ mÞwðnÞ,

(1)

129

where w(n) is the weighting window and ð¼Þ denotes maximum similarity. tðmÞ is the time warping function and provides time scaling. The synthesized output signal which provides maximum local similarity according to this definition is given as follows [6,7,9]: P 2 w ðn  Sk Þxðn þ t1 ðSk Þ  S k þ Dk Þ P 2 yðnÞ ¼ k , k w ðn  S k Þ (2) where S k is the time instance of kth synthesized (voice) packet, which is defined as the overlapped regions of the speech segments. Segments are the speech signals under the corresponding weighting windows. It is important to understand that WSOLA algorithm requires analysis and synthesis both at the transmitter and receiver, i.e. original speech signal must be analysed in order to find the maximum similar locations and then synthesized to obtain the compressed speech at the transmitter. Conversely the compressed speech should be first analysed to find maximum similar locations and then synthesized to obtain expanded signal at the receiver. In fact WSOLA does not differentiate between time compression or expansion and regards this operation as time scale modification by using Eq. (2). There can be three kinds of segments based on their locations: (i) Base candidate segment, which is located at the time instance of ðt1 ðS k ÞÞ, (ii) Candidate segments located at the Dmax interval neighborhoods of the base candidate, (iii) Elected, most similar, segment which is located at the time instance of ðt1 ðSk Þþ Dk Þ. Here, t1 ðnÞ is the time warping function for scaling and Dk is the time tolerance which provides maximum local similarity with the previous elected segment. The value of Dk may vary between Dmax and Dmax . Average magnitude difference function (AMDF) or cross-correlation coefficient (or normalized cross-correlation coefficient) of segment started at time instance of ðt1 ðSk1 Þ þ S þ Dk1 Þ and segment started at time instance of ðt1 ðSk Þ þ DÞ

ARTICLE IN PRESS 130

H.G. I˙lk, S. Gu¨ler / Signal Processing 86 (2006) 127–139

is calculated for each D. Elected segment is given by the optimum time tolerance, D ¼ Dk , which is determined by the index which gives minimum AMDF, maximum cross-correlation coefficient or maximum normalized cross-correlation magnitude [6,7]. For ease of computation, P 2 window function is selected to provide k w ðn  S k Þ ¼ 1. When synthesized speech packets are regularly spaced, the time instance of the kth packet becomes S k ¼ kS. Linear speaking rate increase (compression) or decrease (expansion) can be calculated as t1 ðS k Þ ¼ kS=b, where b is the scaling factor

and kS=b is the original speech time instance of kth base candidate segment. Real-time computation of regularly spaced and 50% output overlapping WSOLA [6] for S k onoS kþ1 is given in Eq. (3); yðnÞ ¼ w2 ðn  ðk  1ÞSÞxðn þ ðk  1ÞS=b  ðk  1ÞS þ Dk1 Þ þ w2 ðn  kSÞ xðn þ kS=b  kS þ Dk Þ,

ð3Þ

where bo1 is used for rate increment and b41 is used for rate decrement [6,7].

Fig. 1. Illustration of waveform similarity overlap and add.

ARTICLE IN PRESS H.G. I˙lk, S. Gu¨ler / Signal Processing 86 (2006) 127–139

Fig. 1 illustrates the synthesized speech y(n), derived from an original input speech x(n) plotted in Fig. 1(a). Three dotted and two solid windows drawn in this diagram show base candidate and elected segments, respectively. The three base candidate segments have time instances of Pk2 Pk1 Pk m¼1 S=b, m¼1 S=b and m¼1 S=b, respectively. For the (k2)th elected segment (initial), the location is assumed to be the same with the base candidate segment location. Fig. 1(b) illustrates the most similar regions of the (k1)th and the kth elected segments and their corresponding portions of weighting windows. Fig. 1(c) illustrates the weighted values of these elected segments and Fig. 1(d) shows the overlap-add of these weighted values to form the kth packet of the synthesized speech.

3. Adaptive WSOLA 3.1. Introduction Using WSOLA for real-time transmission was first introduced in Ref. [10] to provide full duplex communication over half-duplex channels, where b ¼ 0:5. The proposed system was further enhanced in Ref. [11] in order to provide flexible bit rate switching, where b is determined once at the beginning of the transmission. A block diagram of this system is given in Fig. 2 where input speech of the transmitting side WSOLA is called the ‘‘original speech’’, xa(n), and the output speech of the transmitting side WSOLA is called the ‘‘synthesized speech’’, x0 a(n). Also, the input speech of the receiving side WSOLA is called the

131

‘‘received speech’’, x00 a(n) and the output speech of the receiving side WSOLA is called the ‘‘resynthesized speech’’, ya(n). In this system the transmitting side speaking rate is increased with WSOLA scaling factor b and then mixed excitation linear prediction (MELP) coded [11]. At the receiving side, the speech signal is MELP decoded and speaking rate is decreased with the WSOLA scaling factor of 1=b [11]. The system in Ref. [11] can also be used for VoIP or a similar voice over packet switching network technology without synchronization problem because the time compression is fixed and determined once at the beginning of the transmission. Time scale modification enables the transmission of speech in less bandwidth but requires longer delays. The main motivation is therefore to modify the time scale only in times when the available bandwidth is not sufficient for communication. The proposed Adaptive WSOLA technique enables ‘‘non-uniform’’ time scale modification. The problem would have been easier if the time-scaled signal was our final goal. However one needs to go back to the original time scale for speech coding purposes. Otherwise speech would slow down or speed up according to the time scale modification factor. In fact this kind of application might be useful for audio–video synchronization purposes. In such cases adaptive WSOLA on the transmitter side is sufficient to provide ‘‘non-uniform’’ time scaling. If the time-scaled signal is not the final result, one has to go back to the original time scale. This requires the application of adaptive WSOLA on the receiver side. The following sections

Fig. 2. WSOLA block diagram.

ARTICLE IN PRESS 132

H.G. I˙lk, S. Gu¨ler / Signal Processing 86 (2006) 127–139

consider the adaptation of window sizes (actually half-window sizes) at the receiver side as well as the application details of the adaptive WSOLA at the transmitter side.

warping function becomes t1 ðS k Þ ¼

k X S t b m¼1 m

(6)

and the synthesized speech is expressed as in Eq. (7). X w2 ðn  S k Þxðn þ t1 ðS k Þ  Sk þ Dk Þ yðnÞ ¼

3.2. Adaptive WSOLA: rate increment (compression) for the transmitter side

k

During speech communication over an IP network, congestion may occur at any time. When congestion occurs, the bandwidth required by speech communication should adapt itself according to network congestion conditions. In adaptive WSOLA if the speaking rate is increased to b ¼ bt at the transmitting side, then the speaking rate at the receiver side should be decreased to br ¼ 1=bt , depending on the network load. During speech communication, scaling factor of the transmitter side can change from bt1 to bt2 at any time due to congestion. In such a scenario; if first ‘‘i’’ speech packets are scaled with b ¼ bt1 and the rest are scaled with b ¼ bt2 (for packet size or synthesis lag of S) analysis lag becomes S=bt1 for the first ‘‘i’’ speech segments and S=bt2 for the remaining speech segments. In this case the time warping function is modified as follows: ( kpi; kS=bt1 ; 1 t ðS k Þ ¼ (4) iS=bt1 þ ðk  iÞS=bt2 ; k4i: By using this time-warping function, Eq. (3) is modified, for k4i þ 1, as follows: yðnÞ ¼ w2 ðn  ðk  1ÞSÞxðn þ iS=bt1 þ ðk  1  iÞ S=bt2  ðk  1ÞS þ Dk1 Þ þ w2 ðn  kSÞ xðn þ iS=bt1 þ ðk  iÞS=bt2  kS þ Dk Þ. ð5Þ During conversation, the scaling factor may change again according to the congestion conditions. In practice, even each packet of the speech may require a different scaling factor. If we denote the scaling factor of mth packet as btm , then the

¼

X k

! k X S w ðn  kSÞx n þ  kS þ Dk : bt m¼1 m 2

ð7Þ In Eqs. (6) and (7), scaling factor btm o1 for the mth packet should be selected according to congestion. Fig. 3(a) illustrates the original input speech where dotted windows show time instances of base candidate segments. The time instances of elected segments are shown with solid windows. Each dotted window has the same length but they are separated by different amount of duration since each analysis lag is different. In Fig. 3(a), the first Pk2dottedt window located at the time instance of m¼1 S=bm is for the (k2)th base candidate segment and the second and the third ones are for the (k1)th and the (k)th segments are Pk1 which t located at the time instances of S=b and m m¼1 Pk t m¼1 S=bm , respectively. It is much easier to follow Fig. 3(a) by observing the distance between analysis lags, rather than their locations. Analysis lag between the (k2)th and (k1)th base candidates is S=btk1 and it is S=btk between the (k1)th and (k)th base candidates. Assuming that Dk2 is zero, the time instance of (k1)th elected segment is found at Dk1 neighborhood of the second dotted window. Similarly time instance of the kth elected segment is searched at Dmax neighborhood of the third dotted window to provide maximum similarity with the second-half of the (k1)th elected segment, which is located at the time instance t2 . During synthesis, the kth segment of the original speech will 50% overlap with the (k1)th segment. Fig. 3(b) illustrates these two consecutive maximum similar halves of the original input speech and their corresponding weighting windows. In Fig. 3(c), multiplication of each sample of these

ARTICLE IN PRESS H.G. I˙lk, S. Gu¨ler / Signal Processing 86 (2006) 127–139

133

Fig. 3. Time scale modification, rate increment for the adaptive WSOLA.

segments with their corresponding values is shown. The overlap-add of weighted segments is illustrated in Fig. kth packet of the synthesized speech is the time instance of t3 ¼ kS.

weighting these two 3(d). This located at

3.3. Adaptive WSOLA: rate decrement (expansion) for the receiving side A novel approach proposed in this paper is to use b-depended synthesis packet size at the

receiving side rather than fixed size synthesis packets, as used in the conventional WSOLA and adaptive WSOLA of the transmitting side. In the proposed approach the difference between base candidate segments of the received speech is constant for the receiving side. Therefore, each S samples of the received speech leads to synthesis of S=btm samples of the re-synthesized speech at the receiving side for each btm . Consequently, even if each analysis lag has different WSOLA factor, synchronization is achieved.

ARTICLE IN PRESS H.G. I˙lk, S. Gu¨ler / Signal Processing 86 (2006) 127–139

134

The time instance of the kth base candidate segment of the received speech is given by the warping function t1 ðSk Þ ¼ Lk . In order to provide fixed sized analysis lags for each b value, the time warping function becomes t1 ðS k Þ ¼ kL, where L is the half of the window length at the transmitting side. The time instance of the kth resynthesized speech packet is therefore Sk ¼ tðLk Þ and it depends on the past brm values. Since Lk ¼ kL, the kth instance of the re-synthesized speech is determined as follows: S k ¼ tðLk Þ ¼

k X

brm L .

(8)

m¼1

when L ¼ S and brm ¼ 1=btm , end-to-end speech duration synchronization is accomplished. During synthesis each overlapped segment should have a length of brm L. In order to synthesize the kth re-synthesized speech packet, the first-half of the kth elected segment of the received speech overlaps with second-half of the (k1)th elected segment of the received speech. This half of the weighting window should have a length of brk L. In addition, the second-half of the kth segment will overlap with the (k+1)th segment’s first-half and this half of the weighting window has a length of brkþ1 L. If the weighting window is represented with wk ð:Þ, then the length of wk ð:Þ is Lðbrk þ brkþ1 Þ. According to this time warping function and weighting windows, adaptive WSOLA speaking rate decrement in the receiving side is determined as follows: X yðnÞ ¼ w2k ðn  S k Þxðn þ t1 ðS k Þ  S k þ Dk Þ; k

(9) yðnÞ ¼

X k

w2k

n

k X

! brm L

m¼1

 n þ kL 

k X

x !

brm L

þ Dk .

ð10Þ

m¼1

Fig. 4 illustrates an example to adaptive WSOLA at the receiving side. Fig. 4(a) illustrates three elected segments of the received speech where, for ease of interpretation, only elected segments are presented. The weighting

windows belong to the (k2), (k1) and kth elected segments which are located at the time instances ðk  1ÞL þ Dk1 , kL þ Dk and ðk þ 1Þ L þ Dkþ1 of the received speech, respectively. Fig. 4(b) illustrates two consecutive maximum similar segments of received speech with their corresponding weightings. Fig. 4(c) on the other hand, illustrates weighted values of these segments, while Fig. 4(d) illustrates their overlap-added values which form the re-synthesized speech. The re-synthesized speech packets are located at Pk1 r b L for the overlap ofP(k2)th–(k1)th m¼1 m k r received speech segments and m¼1 bm L for the overlap of (k1)th–kth received speech segments, respectively. Also note that the network load may be different at each direction during conversation.

4. Tests and discussion The proposed method has been simulated by implementing algorithms using Eqs. (7) and (10). In the simulation code original input speech was read and synthesized output speech was written to disk. If the sampling rate is represented with f, the number of samples at each read is 0:005 f =btk for rate increment and 0:005 f for rate decrement. The number of samples at each write is 0:005 f for rate increment and 0:005 f brk for rate decrement. The following parameter defaults were used during simulations. The length of the weighting window was selected to be 10 ms for rate increment, S ¼ 40 samples and f ¼ 8000 Hz was used in Eq. (7). The length of the weighting window was set to 5ðbrk þ brkþ1 Þ ms (L ¼ 40) for rate decrement in Eq. (10). Adaptive rate increment was provided by initially selecting btm ¼ 0:5 and then incremented by 0.1 until 0.9. After 0.9, the rate increment was then suddenly set to 0.5 in order to achieve additive increase multiplicative decrease congestion model (AIMD). Synchronization for adaptive rate decrement is performed by selecting brm ¼ 1=btm (where brm ¼ 1=0:5, brmþ1 ¼ 1=0:6,y, brmþ4 ¼ 1=0:9 and again brm ¼ 1=0:5,y) in Eq. (10). G.729 CS-ACELP codec, which is commonly used in VoIP applications, was used as the voice coder. During simulations, adaptive WSOLA rate

ARTICLE IN PRESS H.G. I˙lk, S. Gu¨ler / Signal Processing 86 (2006) 127–139

135

Fig. 4. Time scale modification, rate decrement for the adaptive WSOLA.

increment was performed first and then the time scale modified speech was CS-ACELP coded. For the receiving side simulation, CS-ACELP encoded speech was first decoded and then adaptive WSOLA rate decrement was applied on the decoded output in order to obtain the re-synthesized speech. For all of these tests maximum similarity tolerance (Dk ) was found by using normalized cross-correlation. Overall algorithm complex-

ity, including normalized cross-correlation, is ð2f bÞð2Dmax þ 2 þ 2Dmax =SÞ multiply-adds for 1 s of input speech. As an example, for the sampling rate f ¼ 8000, overlap size S ¼ 40, maximum tolerance Dmax ¼ 40 and WSOLA scaling factor b ¼ 0:5; algorithm requires 672,000 multiply-adds at the transmitting side. At the receiving side the overlap size becomes S ¼ 80 and the adaptive WSOLA scaling factor becomes b ¼ 2. The complexity becomes nearly 2,700,000 multiply-add

ARTICLE IN PRESS 136

H.G. I˙lk, S. Gu¨ler / Signal Processing 86 (2006) 127–139

operations for 1 s of received speech, which corresponds to 2 s of re-synthesized speech. Note that these complexity figures do not include the vocoder complexity because different voice coders can be utilized. However it should be noted that the proposed method also reduces vocoder complexity because the same amount of coded speech information requires half of the coding time (for b ¼ 0:5) and therefore requires less complexity. The performance of the proposed adaptive WSOLA is evaluated using diagnostic rhyme tests (DRT). DRT were carried out on 50 rhyming Turkish word pairs. 65 subjects were asked to find out which word of the pair was spoken. Word pairs can generally be classified into four categories. Voiced/unvoiced difference (Class 1): These sounds have only three letters. Word pairs have different utterances only in the first letter. The material therefore tests if the subjects can distinguish the difference between b–p, g–k, v–f, j–sand d–t. Nasality difference(Class 2): These sounds again have only three letters and have different utterances only in the first letter. The material tests if the subjects can distinguish the difference between m–b and n–d. Sustained/interrupted difference (Class 3): These are the words which have similar sounds but different meanings. In this case, one word of the pair lasts longer to read. Although the DRT were performed in Turkish, a good example of this class in English could be the words ‘‘lunch’’ and ‘‘launch’’. Sibilated/unsibilated difference (Class 4): These words have three letters which try to identify the difference between z–t, c- –k and j–g. Each subject was required to fill three test templates. These test templates were prepared as follows: Adaptive WSOLA: Adaptive time scale modification rate increment (Eq. (7)) and then time scale modification rate decrement (Eq. (10)) was applied to the original input signal (8 Khz, 16bit, mono) using the simulation code. btm was changed at each consecutive segments according to AIMD congestion bandwidth model.

CS-ACELP: Original input signal was coded and then decoded with ITU-T G.729 according to Annex A. Adaptive WSOLA & CS-ACELP: Time scale of the original input signal was adaptively modified for rate increment and then CS-ACELP coded. Coded output was then decoded with CS-ACELP and time scale modified for rate decrement. Table 1 illustrates the total number of misclassifications for each class of the Turkish rhyming pairs for different compression/coding techniques. The DRT was carried out on 65 subjects with a total of 50 rhyming word pairs. It is interesting to note that for the sustained/interrupted difference class (Class 3), adaptive WSOLA used in conjunction with CS-ACELP outperforms both adaptive WSOLA and CS-ACELP DRT scores. Although this result might look unexpected at first, it is interesting to note that speaking rate and time scale modification is widely used in the literature in order to increase intelligibility [12]. While tests revealed complex inter-correlations between time scale modification and speech intelligibility for hearing impaired and normal listeners at different ages, it is also interesting to note that time scale modification indeed increases the performance of automatic speech recognizer (ASR) [13]. In our case, the situation is different, in the sense that the speaking rate is first increased and then decreased in order to go back to the original speaking rate. While this processing leads to 10 misclassifications for the adaptive WSOLA, the pre-processing of the CS-ACELP coder on the input speech, which is the compressed sustained/interrupted word pair, and then the time expansion of the encoded speech only leads to 4 misclassifications. It should be noted that in this case encoding is performed on the time compressed speech and perceptual weighting, of the CS-ACELP decoder, is performed on the decoded speech (which is time compressed speech). This decoded, perceptually weighted, compressed speech is expanded in order to synthesize output speech. It is thought that the perceptual enhancement of the compressed speech might be increasing the intelligibility, only for the sustained/interrupted word pairs. This is consistent with the fact that the only difference between

ARTICLE IN PRESS H.G. I˙lk, S. Gu¨ler / Signal Processing 86 (2006) 127–139 Table 1 Total number of misclassifications for the DRT results obtained from 65 subjects, 50 rhyming word pairs

Class Class Class Class

1 2 3 4

(13 (13 (13 (11

pairs) pairs) pairs) pairs)

Total (50 Pairs)

Adaptive WSOLA

CS-ACELP

Adaptive WSOLA & CS-ACELP

8 4 10 3

9 2 12 6

19 5 4 7

25

29

35

the sustained/interrupted word pairs is the duration. For the other classes (classes other than Class 3) misclassifications for the adaptive WSOLA+CS-ACELP are higher than adaptive WSOLA and CS-ACELP used alone, as should be expected heuristically. Table 2 illustrates the statistical significance of the DRT results obtained for subjects under 90% confidence interval. Finally MOS tests were carried out on 50 subjects in order to evaluate the quality of the re-synthesised speech. During these tests listeners were asked to rate the output speech quality of two phonetically balanced sentences (one uttered by a male and the other uttered by a female) obtained from TIMIT database from 1 to 5, where 1 is unacceptable and 5 is toll quality. Table 3 gives the MOS test results for standard time scale modification (where b ¼ 0:5) and adaptive WSOLA for the AIMD bandwidth model as described at the beginning of this section. The adaptive model requires 1.4 times more channel capacity than that of the conventional model.

5. Conclusion In this paper an alternative solution to adaptive rate voice coding is proposed. The proposed algorithm, ‘‘adaptive WSOLA’’, provides a mechanism to change the time-scale modification factor for rate increment/decrement in time,

137

Table 2 Standard deviation and 90% confidence interval of the DRT results Adaptive WSOLA (%)

Standard deviation 90% confidence interval

0.083

CSACELP (%)

0.118

Adaptive WSOLA & CS-ACELP (%) 0.141

99.2370.017 99.1170.025 98.9270.029

Table 3 Mean opinion scores for the conventional and adaptive WSOLA modified and CS-ACELP coded speech Conventional (b ¼ 0:5)

Adaptive (AIMD model)

MOS (Male) 3.49

MOS (Male) 3.93

MOS (Female) 3.76

MOS (Female) 4.02

which enables bandwidth efficiency in VoIP channels. This approach is particularly useful in network environments because the network load is dynamic and may be different at each direction. In adaptive WSOLA, time warping function is not constant and depends on network congestion. Although nonlinear time warping functions of WSOLA were proposed for prosodic transplantation and post synchronization of speech utterances [14] and time scale modification for voice transmission in IP networks were also proposed for loss concealment and adaptive play out [15,16] the proposed scheme provides adaptive rate coding for real-time voice communication and differs from other proposed WSOLA systems in algorithm and concept. The only expense introduced with the adaptive WSOLA is delay (complexity increase is partly compensated by the reduction of codec complexity due to time compression). Delay is inversely proportional with the compression rate and therefore adaptive WSOLA should be used only with low delay codecs. However, the system is generic

ARTICLE IN PRESS 138

H.G. I˙lk, S. Gu¨ler / Signal Processing 86 (2006) 127–139

and can be applied to any low delay (preferably low bit rate) voice coder. The simulations of the adaptive WSOLA proved that the adaptivity of rate change could be as small as 10 ms. The effect on the quality of this rapid change has been tested with DRT where recordings were obtained in office environment. As shown in Tables 1 and 2, when adaptive WSOLA and CS-ACELP are used together, the decrease in DRT perceptual success is less than 0.3%. The 90% confidence interval of perceptual success of the proposed system is calculated as 98.92%70.029%, according to the Student-t distribution. This result reveals the fact that there is no significant difference (in terms of DRT results) between the CS-ACELP coded speech and adaptive WSOLA+CS-ACELP coded speech. Finally, although adaptive time scale modification provides slightly better MOS results than the conventional model, it should be noted that the proposed system provides optimum usage of the channel bandwidth with minimum possible delay introduced. The interested reader is encouraged to visit http://auspg.science.ankara.edu.tr web site to listen to the synthesized speech files and download free simulation software (executable only) which provides adaptive WSOLA+CSACELP.

Acknowledgements The authors would like to thank Dr. Kemal Ozdemir, Schlumberger Norway and Rachel Martin, Iowa State University for proof reading of this manuscript. The authors also would like to thank Prof. Werner Verhelst for his fruitful discussions on adaptive (non-uniform) WSOLA.

References [1] Appendix II to Recommendation G.711, A comfort noise payload definition for ITU-T G.711 use in packet-based multimedia communication systems, February 2000.

[2] Annex B to Recommendation G.729, C source code and test vectors for implementation verification of the algorithm of the G.729 silence compression scheme, August 1997. [3] A. Barberis, C. Casetti, J.C. DeMartin, M. Meo, A simulation study of adaptive voice communications on IP networks, Comput. Commun. 24 (2001) 757–767. [4] G. Ruggeri, E. Beritelli, S. Casale, Hybrid multi-mode/ multi-rate CS-ACELP speech coding for adaptive voice over IP, IEEE International Conference on Acoustics, Speech and signal Processing, Proceedings of ICASSP 2001, vol. 2, May 2001, pp. 733–736. [5] C. Casetti, J.C. De Martin, M. Meo, A framework for the analysis of adaptive variable bit rate voice sources, International Conference on Communications, Proceedings of ICC 2000, vol. 2, June 2000, pp. 821–826. [6] W. Verhelst, Overlap-add methods for time-scaling of speech, Speech Commun. 30 (2000) 207–221. [7] W. Verhelst, M. Roelands, An overlap-add technique based on waveform similarity (WSOLA) for high quality time scale modification of speech, IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings of ICASSP 93, April 1993, pp. 554–557. [8] H.G. Ilk, S. Tugac, Channel and source considerations of a bit rate reduction technique for a possible wireless communications system’s performance enhancement, IEEE Trans. Wireless Commun. vol. 4(1), January 2005, pp. 93–99. [9] W. Verhelst, D.V. Compernolle, P. Wampacq, A unified view on synchronized overlap-add methods for prosodic modification of speech, International Conference on Spoken Language Processing, ICSLP 00, vol. 2, October 2000, pp. 63–66. [10] N. Serinken, B. Gagnon, O. Erog˘ul, Full duplex speech for HF radio systems, IEE HF Radio Systems and Techniques Conference, 1997, pp. 281–284. [11] O. Erog˘ul, H.G. I˙lk, O. Ilk, A flexible bit rate switching method for low bit rate vocoders, Signal Processing 81 (2001) 1737–1742. [12] N.E. Vaughan, I. Furukawa, N. Balasingam, M. Mortz, S.A. Fausti, Time-expanded speech and speech recognition in older adults, J. Rehabil. Res. Dev. 39 (5) (September– October 2002) 559–565. [13] E. Fosler-Lussier, N. Morgan, Effects of speaking rate and word frequency on pronunciations in convertional speech, Speech Commun. 29 (2–4) (November 1999) 137–158. [14] W. Verhelst, H. Brouckxon, Rejection phenomena in inter-signal voice transplantations, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Proceedings of WASPAA 03, October 2003, pp. 165–168. [15] Y.J. Liang, N. Farber, B. Girod, Adaptive playout scheduling using time-scale modification in packet voice

ARTICLE IN PRESS H.G. I˙lk, S. Gu¨ler / Signal Processing 86 (2006) 127–139 communications, IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings of ICASSP 01, vol. 3, May 2001, pp. 1445–1448. [16] F. Liu, J. Kim, C.-C.J. Kuo, Adaptive delay concealment for Internet voice applications with packet-

139

based time-scale modification, IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings of ICASSP 01, vol. 3, May 2001, pp. 1461–1464.