Signal Processing 41 (1995)43-48
A nonuniform sampling method of speech signal and its application to speech coding JaeYeol Rheemap*, BeomHun Kimb, SouGuil Anna “Department of Electronics Engineering, Seoul National University, Shinlim-dong San 56-1, Kwanak-gu, Seoul 151-742. South Korea bSamho Electronic Company, Bucheon 421-040, South Korea
Received 8 November 1993; revised 22 April 1994
Abstract A nonuniform sampling method for speech signal which rejects perceptually redundant samples by sampling at the maxima and minima of waveform is proposed. Data reduction is improved by silence processing implemented at sampling stage without transmitting any side information. As an application, average 13.4 kbit/s waveform coding scheme is proposed. Zusammenfassung Es wird eine Methode zur ungleichfiirmigen Abtastung von Sprachsignalen vorgeschlagen, die perzeptiv i.iberfKissige Abtastwerte dadurch unterdrlckt, daD nur die Maxima und Minima des Signals abgetastet werden. Die Datenreduktion wird durch eine Ruheverarbeitung verbessert, die in der Abtaststufe ohne Zusatzinformation durchgefiihrt wird. Als Anwendung wird eine Signalumsetzung mit einer mittleren Bitrate von 13.4 kb/s vorgeschlagen. RbsumC Cet article propose une methode d’6chantillonnage non uniforme pour le signal de parole, qui rejette les ichantillons redondants sur le plan de la perception en Cchantillonnant aux maxima et minima de I’onde. La rbduction de don&es est ameliork par un traitement des silences B l’etage d’Cchantillonnage, sans transmettre d’information auxiliaire. On propose une application pour un sch6ma de codage d’onde d’une moyenne de 13.4 kbit/sec. Keywords: Nonuniform
and uniform sampling; Redundancy;
1. Introduction To get data reduction, conventional speech waveform coding methods incorporate their own
*Corresponding author. Tel.: + 82-2-880-7279.Fax: + 82-2882-3906.E-mail:
[email protected]. 0165-1684/95/$9.500 1995 Elsevier Science B.V. All rights reserved SSDI 0165-1684(94)00089-l
Maxima and minima; Compression ratio
quantization method which utilizes inherent redundancy in speech signal. Such redundancy is known to come from the relatively high sampleby-sample correlation that results from uniform sampling [3]. To reduce the sample-by-sample correlation, nonuniform sampling or nonredundant sample coding methods can be considered [S, 23. However, conventional nonuniform sampling
44
J. Y. Rheem et al. / Signal Processing 41 (1995) 43-48
methods such as polynomial predictors and interpolators cannot be applied to speech signal which is well known to have much redundancy. Since speech signal has very complex waveform and shows quasiperiodicity, noise-like randomness and nonstationarity, there need many samples to approximate it using straight-line segments based upon conventional nonuniform sampling criterions and the resulting amount of storage required is usually comparable to or more than that required by uniform sampling method like PCM [2]. Thus, a new criterion which determines nonredundant samples of speech signal is required to get data reduction by nonuniform sampling method. Through the classical past experiments on perception of amplitude distorted speech signal, the local peak-to-peak interval, i.e., the interval between maxima and minima of speech waveform, is known to have the very information which preserves the intelligibility [4, 11. In Licklider and Pollack’s research [4], the intelligibility score of the differentiated and clipped speech was 97% while that of the original speech was 99%. The only information contained in a differentiated and clipped waveform is the times of occurrence of the maxima and minima of the original waveform. Furthermore, it was found that the polarity and shape of the signals are not important while the time intervals between maxima and minima are maintained [ 11. In our listening experiment of such a simplified signal, it is observed that personality information is almost lost. However, if each amplitude of the simplified signal is adjusted to its actual amplitude of maxima or minima of speech waveform, the personality is distinguishable. Since the intervals are closely related to the frequency components of speech signal, it is noticed that linguistic information is mainly in the intervals, and the amplitudes and their shapes are closely related to personality information. Consequently, those facts lead to the conclusion that the samples between maxima and minima are redundant in the sense of speech perception. We propose a simple method to represent speech signal by sampling nonuniformly at the times of maxima and minima of speech waveform.
(a)
(b)
Fig. 1. Considerations on the local maxima and minima of PCM signal in the positive side: (a) normal maxima; (b) unresolved maxima and minima by quantization error; (c) unresolved maxima by quantization or by overload clipping.
2. Proposed nonuniform sampling method 2.1. Discussion on the local maxima and minima of speech waveform
In the digitized speech waveform, some considerations are necessary to determine the extreme points, i.e., maxima and minima of it. There are three possible cases of maxima by the effect of sampling and quantization as shown in Fig. 1, when the positive side of the waveform is considered. The zero-slope regions in (b) and (c) can be interpreted as the result of unresolved maxima and minima. In the continuous signal before digitization, there might be two extreme points in (b) and one in (c). Especially in case of(c), the possibility of clipping by overload cannot be excluded. If the sampling frequency and the bit of quantizer increase, those extreme points in the zero-slope region will be resolved as normal extreme points in the digitized waveform. Therefore there are two kinds of extreme points such as normal and unresolved extreme points. In case of(b), the starting and ending points of zero-slope region must be decided as nonredundant samples, and in case of (c), it is reasonable that a possible center point of zero-slope region should be estimated as an extreme point. However, for more reduction of data, the extreme points in (b) can be ignored. Since there are usually a few occurrences in the long-term speech signal and their intervals are always short, it is observed that the quality of reconstructed signal makes no severe difference perceptually, whether they are ignored or not. By ignoring these extreme points, we can get further reduction of data.
45
J. Y. Rheem et al. 1 Signal Processing 41 (1995) 43-48
2.2. Algorithm description The determination of those extreme points can be implemented by a simple slope test of successive samples. Let x(n) be a PCM speech signal at time n. The slope at time n, n(n), may be estimated as jc(n) = x(n) - x(n - 1).
(1)
If the slope-product of successive samples, i(n)Ec(n + 1) is less than zero, x(n) is to be a normal extreme point. If the slope-product is zero when both of successive slopes, jc(n) and n(n + l), are not zero, x(n) is to be a starting or an ending point of zero-slope region. By checking the slopes of those starting and ending points, the state of the zero-slope region can be determined. If the zeroslope region is in the state of (b), the starting and ending points are estimated as the two nonredundant samples in algorithm I and ignored in algorithm II. If the zero-slope region is in the state of(c), a possible center point of zero-slope region is estimated as the nonredundant sample. If the sample x(n) at time n has been stored as a nonredundant sample, the next sample to be stored is the sample x(n + z) at time n + T, for which it first matches the above conditions of nonredundant sample. And the interval r which is the number of uniformly sampled samples between the previous and the present nonredundant samples is to be stored as timing information. Thus each nonredundant sample consists of its amplitude and interval. The first and last samples are stored as nonredundant samples to maintain the entire time of the signal.
2.3. Reconstruction For reconstruction, linear interpolation by a straight line between two nonredundant samples as a function of interval can be used [S]. However, though the reconstructed signal has almost the same degree of intelligibility of original speech, the resulting saw-tooth waveform is quite different from the original waveform, especially in voiced segment. To compensate for the waveform shape, a sinusoidal interpolation method is proposed. A half period of cosine function can be used to
approximate the original waveform. If a(k) and i(k) represent the amplitude and interval of the kth nonredundant sample respectively, the reconstructed signal between (k - 1)th and kth nonredundant samples by sinusoidal interpolation, yi(n + t), is expressed as yi(n + t) =
a(k - 1) - a(k)cos zt 2 ( i(k) ) 1 + a(k - 1) + a(k) 2
1 L
for 1 < t < i(k),
(2)
where n is the sampling instant of (k - 1)th nonredundant sample, and [-IL is a quantizer operator with L levels. Since the intervals of unvoiced and silence segments are relatively short, the difference among the PCM data and the reconstructed signals in these segments is not clear as shown in Fig. 2. However the difference between the two interpolation methods is clearly shown in voiced segments.
2.4. Improvement of data reduction by silence processing Since speech signal has considerable portion of silence segments, if the extreme points in them are rejected, much data reduction can be achieved. Rejecting those extreme points can be implemented by fixed-length windowing and a silence decision algorithm. The average energy and zero-crossing rate of the uniformly sampled and quantized data within the windowed segment are estimated and the decision of silence segment is done based on the usual silence decision algorithm [6]. If the segment is determined as silence, the final sample of it is sampled as the only nonredundant sample with its interval of the window length. In this case, it does not matter whether the final sample is an extreme point or not. If the segment is determined as nonsilence, it will be processed by the proposed nonuniform sampling algorithm. When the state of successive segments changes from nonsilence to silence, the final sample of non-silence segment should be sampled with its appropriate interval to
J. Y. Rheem et al. 1 Signal Processing 41 (199.5) 43-48
46
(4
K4
[2]. In that case, the probability distribution of the run lengths of interval in terms of the given conditional probability of sample occurrence is required. However, speech signal is a nonstationary process and its conditional probability is not well known [3]. Thus the probability distribution of interval as a function of length is estimated empirically for long-term speech signal and then the compression ratio is computed in our experiment. If p(r) is the estimated probability of the interval r, the average length of interval, N, is estimated as
where T is the maximum interval. Then compression ratio C is expressed as C=
(cl Fig. 2. Comparison of reconstructed signals: (a) original signal; (b) reconstructed signal by linear interpolation; (c) reconstructed signal by sinusoidal interpolation.
match the length of the interval of following silence segment. In this scheme, we can get only one redundant sample with constant length of interval which is equal to the window length for silence segment. It is noticeable that the suggested silence.processing can be implemented at the stage of sampling with one frame delay and without any side information to be transmitted or stored.
3. Experimental result Davission has found the compression ratio of conventional nonuniform sampling methods in the analytic form for a stationary stochastic process
N log, L
log, T + log, L’
where L is the number of the quantizer level [2]. For experiment, three males and one female read the same material for three minutes at a higher average speaking rate of 5.34 syllables/s in an ordinary computer room. The speech signals were lowpass filtered with cut-off frequency 3.8 kHz and uniformly sampled at 8 kHz and quantized into 8-bit words (L = 256). As results, T, N and C were 16, 1.98 and 1.32, respectively, for algorithm I. For algorithm II, they were 24, 2.43 and 1.54, respectively. Since the extreme points in (b) of Fig. 1 are ignored in algorithm II, the compression ratio of it is higher than that obtained by algorithm I. For algorithm I, the segmental SNR [3] of the reconstructed signal was 11.49 dB by the linear interpolation and 11.20 dB by the sinusoidal interpolation. For algorithm II, they are 10.42 dB and 9.91 dB, respectively. The subjective quality measured by five-point MOS (mean opinion score) scale [3] was 3.9 by the linear interpolation and 3.8 by the sinusoidal interpolation, for algorithm I. For algorithm II, they were 3.6 and 3.4, respectively. As shown above, the quality of the reconstructed signal by the linear interpolation is better than that by the sinusoidal interpolation. For silence processing, five window lengths of 16, 32, 64, 128 and 256 were considered only for
J. Y. Rheem et al. / Signal Processing 41 (1995) 43-48
algorithm II. The thresholds for silence processing were appropriately chosen, not to damage nonsilent segments. When the length of interval of a nonredundant sample exceeds the length of window for silence processing, which often occurs when the length of window is relatively short, a uniformly sampled sample corresponds to the maximum length between the previous and the present nonredundant samples is taken as a pseudo nonredundant sample. Since the possible maximum length of interval is restricted to the length of window, there is a trade-off between the compression ratio and the length of window. The resulting compression ratios for window length 16,32,64,128 and 256 were 2.62, 2.45,2.26,2.10 and 1.90, respectively. The quality of the reconstructed signal with silence processing was the same as that of the reconstructed signal without it. The compression ratio with window length 16 corresponds to about 24 kbit/s. As shown in above results, the proposed method itself gives a slight data reduction. However, combining with silence processing, the compression ratio can be improved to give moderate transmission rate without much degradation of quality. Furthermore, there remains room for further improvement. Since the sequence of amplitude reflects the original speech signal, conventional amplitude quantization techniques can be applied to encode it with lower bits without much degradation.
4. Application to speech coding: an example In this section, a practical coding scheme based on the proposed algorithm II combined with silence processing is developed as follows. Each of the sequences of interval and amplitude is encoded with 4 bits: To encode the sequence of interval, 4-bit encoding is enough. The encoding rule is: for interval r, send binary value for (z - 1). However there exist some intervals whose length exceed the encoding range, for the maximum length of interval is usually greater than 16. In that case, the interval is split into two intervals with a pseudo nonredundant sample whose length of interval equals 16 as discussed in Section 3. To encode the sequence of amplitude, a 15-level (Chit) p-law quantizer is used. The remaining one level is
41
reserved for silence processing. Thus each nonredundant sample is encoded for an 8-bit word: 4 bits for interval and 4 bits for amplitude. For silence processing, window length 256 is used. When a windowed segment is decided to be silent, its interval, 256, is encoded for 255 without encoding the amplitude. In decoding, when an 8-bit word is 255, it is decoded for interval, 256, and amplitude, zero level. When it is not 255, each 4 bits is decoded for its interval and amplitude. For reconstruction, the linear interpolation is used. For the experiment, 120 sentences were used, which were composed of ten Korean phoneme-balanced sentences spoken three times at different speaking rates by one female and three males. The performance varies sentence to sentence and person to person. The overall average transmission rate, speaking rate, and segmental SNR were 13.4 kbit/s, 4.0 syllables/s and 9.7 dB, respectively. The subjective quality of the reconstructed signal was 3.0 on the MOS scale. To improve the compression ratio or the quality, other quantization methods such as DPCM or ADPCM can be applied to encode the sequence of amplitude.
5. Conclusion In this paper, we proposed a nonuniform sampling method for speech signal, where speech signal is nonuniformly sampled at the times of the local maxima and minima of speech waveform. This gives a new representation of speech signal. Since the samples between maxima and mimima are redundant in the sense of perception of speech signal, the reconstructed speech signal shows almost the same intelligibility score and it gives a little degradation of speech quality. To improve the compression ratio, a silence processing method which can be implemented at the sampling stage without transmitting or storing any side information was considered. As a practical application, average 13.4 kbit/s speech coding scheme was developed. Although the proposed method gives moderate transmission rate and its performance depends on the message and speaker’s characteristics such as speaking rate and voice quality, because of its simplicity and extremely low computational requirement,
48
J. Y. Rheem et al. / Signal Processing 41 (1995) 43-48
many applications such as voice mailing, voice recording, and speech synthesis are expected.
References [l] W.A. Ainsworth, “Relative intelligibility of different transforms of clipped speech”, .J. Acoust. Sot. Amer., Vol. 41, 1967, pp. 1272-1276. [Z] L.D. Davission, “Data compression using straight line interpolation”, IEEE Trans. Inform. Theory, Vol. IT-14, No. 3, May 1968, pp. 390-394.
[3] N.S. Jayant and P. NOR, Digital Coding of Waveforms: Principles and Applications to Speech and Video, PrenticeHall, Englewood Cliffs, NJ, 1984. [4] J.C.R. Licklider and I. Pollack, “Effects of differentiation, integration, and infinite peak clipping upon the intelligibility of speech”, J. Acoust. Sot. Amer., Vol. 20, No. 1, January 1948, pp. 42-51. [S] T.J. Lynch, Data Compression: Techniques and Applications, Lifetime Learning Pub., 1985. [6] L.R. Rabiner and M.R. Sambur, “An algorithm for determining the endpoints of isolated utterances”, Bell System. Technical J., Vol. 54, No. 2, February 1975, pp. 297-3 15.