Speech Communication 10 (1991) 533-538 North-Holland
533
An algorithm for the measurement of jitter Jean Schoentgen ~ and Raoul de Guchteneere
Institute of Phonetics, Universit( Libre de Bruxelles, Brussels, Belgium Received 7 June 1991
Abstract. Jitter is the small fluctuation from one glottis cycle to the next in the duration of the fundamental period of the voice source. Analyzing jitter requires measuring glottal cycle durations accurately. Generally speaking, this is carried out by sampling at a medium rate and interpolating the discretized signal to obtain the required time resolution. In this article we describe an algorithm which solves the following two signal processing problems. Firstly, signal samples obtained by interpolation are only estimates of the original samples, which are unknown. The quality of the reconstruction of the signal therefore has to be evaluated. Secondly, small variations in cycle durations are easily corrupted by noise and measurement errors. The magnitude of measurement errors therefore has to be gauged. In our algorithm, the quality of reconstruction by signal interpolation is evaluated by a statistical test which takes into account the distribution of the corrections (which are brought about by interpolation) to the positions of the signal events which mark the beginnings of the glottal cycles. Three different interpolation methods have been implemented. Measurement errors are controlled by estimating independently the cycle durations of the speech and the electroglottographic signals. When the series obtained from both signals agree, we may then conclude that they reflect vocal fold activity and that they have not been unduly corrupted by errors or noise. The algorithm has been tested on 77 signals produced by healthy and dysphonic subjects. Its performance was satisfactory on all counts.
Zusammenfassung. Man bezeichnet als Periodendauerschwankungen die kleinen ,~nderungen von Glottiszyklus zu Glottiszyklus der Dauer der Pseudoperiode des Quellensignals. Die Analyse dieser Schwankungen erfordert eine sehr genaue Messung der Dauer der Glottiszyklen. Im Allgemeinen wird dies dadurch erreicht, dass man das Sprachsignal mit einer mittleren Frequenz abtastet und das numerische Signal interpoliert, um die gewtinschte Pr~izision zu erreichen. Wir beschreiben in diesem Beitrag einen Algorithmus, welcher die Dauerschwankungen genau misst, indem er Lrsungen zu den folgenden zwei Problemen bereitstellt. Erstens, Signalwerte welche durch Interpolation gewonnen werden sind nur Sch~itzwerte. Die Originalwerte sind unbekannt. Das Problem besteht darin, die Qualit~it der Rekonstruktion des Signals zu bestimmen. Zweitens, die Messungen der sehr kleinen Periodendauerschwankungen werden leicht durch Rauschen oder durch Messfehler verf~ilscht. Hier ist das Problem den Einfluss yon Messfehlern auf das Endresultat abzusch~itzen. In dem vorgestellten Algorithmus wird die Qualit~it der Signalrekonstruktion bestimmt anhand der statistischen Verteilung der Korrekturen (durch Interpolation) der Positionen der Merkmale, welche den Beginn der Perioden markieren. Drei verschiedene Interpolationsverfahren sind verglichen worden. Messfehler werden abgesch~itzt, indem die Dauer der Glottiszyklen sowohl anhand des Sprachsignals als auch des Laryngogramms unabh~ingig voneinander gemessen werden. Wenn die zwei Messserien tibereinstimmen, dann kOnnen wir davon ausgehen, dass sie die Aktivit~it der Stimmlippen wiederspiegeln, und dass sie nicht zu sehr dutch Rauschen und Messfehler verfalscht wurden. Der Algorithmus wurde mittels 77 Sprachsignalen, welche von gesunden und dysphonischen Sprechern produziert wurden, tiberprtift. Seine Leistungen waren in allen F~illen zufriedenstellend. R~sum~. On drsigne par microperturbations de la prriode fondamentale les 16g~res fluctuations des durre des cycles glottiques. Afin d'analyser les microperturbations, il est nrcessaire de mesurer la durre des cycles glottiques avec une tr~s grande prrcision. En grnrral, on y arrive en 6chantillonnant le signal de parole ~ une cadence moyenne et en interpolant le signal numrrisr. Dans cet article nous drcrivons un algorithme drdi6 ~ la mesure prrcise des microperturbations et qui propose une solution aux deux probl~mes de traitement du signal suivants. Premi~rement, les 6chantillons qui sont obtenus par interpolation ne sont que des estimations des 6chantillons d'origine qui sont inconnus. Le probl~me est d'rvaluer la qualit6 de la reconstruction du signal. Deuxi~mement, les petites variations dans les dur6es des cycles glottiques sont facilement biaisres par du bruit et des erreurs de mesure. Ici, le probl~me est d'estimer l'impact des erreurs de mesure sur
National Fund for Scientific Research, Belgium. 0 1 6 7 - 6 3 9 3 / 9 1 / $ 0 3 . 5 0 © 1991 E l s e v i e r S c i e n c e P u b l i s h e r s B . V . A l l r i g h t s r e s e r v e d
J. Schoentgen, R. de Guchteneere / Measurement of jitter
534
le r6sultat final. Dans notre algorithme, la qualit6 de l'interpolation est 6valu6e ~ l'aide d'un test statistique qui tient compte de la distribution des corrections (dues ~ l'interpolation) des positions des 6v6nements qui marquent le d6but des p6riodes. Trois m6thodes d'interpolation diff6rentes ont 6t6 implant6es. Les erreurs de mesure sont contr616es en estimant ind6pendamment les dur6es des cycles glottiques h partir du signal de parole et ~t partir du laryngogramme. Si les deux s6ries coincident, alors on peut conclure qu'elles refl6tent l'activit6 des cordes vocales et qu'elles n'ont pas 6t6 biais6es exag6r6ment par des erreurs de mesure ou du bruit. L'algorithme a 6t6 v6rifi6 sur 77 signaux produits par des locuteurs sains et dysphoniques. Ses performances sont tr6s satisfaisantes en g6n6ral.
Keywords. Jitter, voice quality, dysphonia.
1. Introduction
Jitter is the fluctuation from one glottis cycle to the next in the duration of the pseudo-period of the voice source. This article presents a method for measuring jitter accurately. Jitter has been studied for some thirty years (e.g. (Lieberman, 1963)). Lieberman was among the first to establish that voiced speech signals are not purely periodic. Since then, the phenomenon has been shown to exist in connected speech as well as in sustained vowels. In the case of healthy speakers the amount of jitter is feeble; between 0.1 and 1% of the average fundamental period in the case of sustained vowels (e.g. (Horii, 1982)). When laryngeal disorders are present, jitter may increase considerably (e.g. (Koike et al., 1977)). The method which is usually employed for estimating jitter has not changed much over thirty years. One notable exception is Orlikoff and Baken (1989). Basically, the extent of jitter is gauged by measuring individual period durations over an analysis interval of, typically, fifty periods (in the case of sustained vowels). Subsequently, average jitter is expressed with the help of a statistical dispersion measure of the individual period durations (Pinto and Titze, 1990). Several methodological problems arise from this approach. Because of this, articles published over the last few years on the subject of jitter have become more methodologically and less clinically oriented. The following problems are among those which still await solutions: (a) Several authors have drawn attention to the fact that the speech signal has to be sampled at a very high frequency to capture a phenomenon which consists of time differences of the order of 10 -4 to 10 -5 seconds (Horii, 1979; Heiberger and Speech Communication
Horii, 1982; Deem et al. 1989). Measuring the amount of fluctuations with a precision of 10% would therefore require a sampling frequency of 100 to 1000 kHz. In contrast, bandwidth requirements appear to be modest. Experiments have shown that microperturbations can still be reliably detected after low-pass filtering at 1000 Hz (Titze et al., 1987). In order to avoid wasting computer resources, a compromise has been reached that involves the sampling of the signal at a medium rate of 20 or 30 kHz and the interpolation of the sampled signal to obtain the required resolution in time. The problem here is to evaluate the quality of the reconstruction of the sampled signal. Indeed, the interpolated values of in-between samples are nothing but estimates; the exact values are unknown. (b) The smallness of the cycle-to-cycle perturbations means that their measurements are easily corrupted by noise or by random or systematic measurement errors. An experiment which shows how the methods used to extract the fundamental period may bias the estimates of jitter can be found in Schoentgen (1989). We propose hereafter a method for checking that measurement results are not unduly corrupted by noise or errors. (c) Generally speaking, the average amount of jitter in an utterance is estimated by the statistical dispersion of the durations of the successive glottal cycles. Statistically speaking, it is considered that these variations are the consequence of the fluctuations of a random variable (i.e. the period) about its average. We think that this point of view is too limited. Indeed, cycles are produced sequentially and so belong to a time series. Hence, neighbouring cycles are not necessarily independent. As a result, a statistical framework able to handle time series may well be required to describe jitter correctly. This implies switching from
J. Schoentgen, R. de Guchteneere / Measurement of jitter
the single random variable point of view to an approach based on a sequence of random variables (e.g. (Chatfield, 1984)). In addition to the algorithm that we have developed to measure glottis cycle durations accurately, in this article we describe the solutions that we have adopted to solve problems posed by interpolation (a) and by noise (b). The statistical treatment is left out. (a) We evaluate the reliability of interpolation techniques by using a method proposed by Hess and Indefrey (1987). It is based on the statistical distribution of the corrections (which are brought about by interpolation) of the positions of the events (e.g. peak positions and zero crossings) which mark the beginning of a new glottal cycle. The null hypothesis is that the corrections are uniformly distributed. In other words, within the permitted bandwidth any analogue signal value is as likely to be sampled as any other. Any deviation from the uniformity assumption indicates that the interpolation technique has favoured some signal features over others. In this article we compare three different interpolators. The first is a parabolic interpolator, which is generally recommended in the literature (e.g. (Titze et al., 1987; Nitrauer et al., 1990)); the second consists of oversampling and smoothing with a truncated Fast Fourier Transform (FFT); the third invokes oversampling followed by Finite Impulse Response (FIR) filtering. The quality of the results so obtained is compared with the help of Hess' criterion. (b) As we state above, period to period fluctuations are tiny. As a result, jitter measurements are easily biased. In order to bring sources of errors under control we propose extracting the pitch period sequence from the acoustic and the electroglottograph signal, two physically very different signals that are recorded simultaneously. A good agreement between the time series obtained independently indicates that they in fact reflect phenomena related to vocal fold movement, and that they have not strayed too far from their true appearance under the influence of systematic or random errors.
535
2. Subjects and methods Twenty-six adult speakers served as subjects for this preliminary study (eight healthy males, five healthy females, six dysphonic males and seven dysphonic females). They were told to sustain the vowels ([a], [i] and [u]), as long and as steadily as possible at a comfortable pitch and loudness level. The acoustic signals were recorded in a soundproof room. The microphone was placed approximately 5 cm from the subjects' lips. The laryngograph signal, which varied proportionally to the laryngeal conductance, was recorded simultaneously. The signals were digitized by a two channel SONY PCM audio processor and recorded on video tape. A central one-second portion of the signal of each vowel was redigitized at a 20kHz sampling frequency with a 12 bit resolution, and was stored in two files (laryngograph- and acoustic signal) on the hard disk of a Masscomp 5050 computer for further processing. As recommended by Hess and Indefrey (1987), the procedure that we designed to measure the duration of individual glottis cycles used oversampling followed by low-pass filtering to obtain a high resolution in time. In addition, we implemented linear and parabolic interpolation. The period duration measurements were made in two steps: Firstly, a coarse detection of the important events in the original signal was carried out, i.e., the peaks were determined in the first derivative of the laryngogram (EGG), as were the zero crossings in the acoustic signal, which were assumed to mark the instant of glottal closure. At this point, the visual display of the coarse period durations was scanned for errors; these could be corrected manually. Secondly, a portion of the signal centred on the main events was oversampled and filtered to improve accuracy. The algorithm worked as follows (Figure 1): (a) Input of the E G G and acoustic signals from the disk files. (b) Computation of the E G G signal autocorrelation function to estimate the average fundamental period. (c) Optional low-pass filtering of the E G G signal Vol. I0, Nos. 5-6, December 1991
536
J. Schoentgen, R. de Guchteneere / Measurement of fitter
I
estimation of period duration
optional low-pass f ter ng
band-pass filtering
~]
I
first order differentiation
E G G S
I detection ~ ~
gross
gross peak
zero-crossing detection
~ / I visualization I
I
G N A L
~
/
bad
good
[ oversamplingand I
fine detection of peaks
U S T I
I I I I
CI
bad ~
interpolation
__
| ~
good
~.
overeemplingend[ nterpo at on
I
fine detection of zero-crossings
sl I I
NGII AI LI
[ visualization and printing I I l statistical tests ]
Fig. 1. General design of the algorithm for measunng glottis cycle durations from laryngograms and speech signals.
(to remove undesirable high frequency components) and band-pass filtering of the acoustic signal. (d) First order differentiation of the EGG signal. (e) Coarse detection of the peaks in the first order derivative, coarse detection of the zero crossings in the filtered acoustic signal. (f) Visual check of events detected and optional return to (e) (some detection parameters could be modified such as the search interval, the minimum amplitude of the peaks or their slope, etc.). (g) Oversampling and interpolation around the main peaks or the zero crossings and fine detection. (h) Visualization. (i) Statistical tests. The algorithm processed only one of the two signals when the other was too weak or too noisy, for instance. It was interesting, though, to obtain period estimates from both signals since the extent of agreement achieved between both was a
SpeechCommunication
measure of the reliability of the results obtained. Oversampling and interpolation were carried out by inserting the current base sample value seven times between the current and following sample, thus increasing the sampling rate to 160 kHz. The "staircase function" so obtained was then filtered with a 199th order filter (interpolator and differentiator for the EGG signal, and interpolator only for the acoustic signal) (McClellan et al., 1979). We thus obtained a theoretical temporal resolution of 6.25 microseconds. When lowpass filtering was carried out by Fourier Transforming, the FFT was truncated beforehand to a sixteenth of its order (which was equal to 1024). The algorithm included a test of oversampling accuracy. As a first step, our method roughly detected the important events (peaks or zero crossings) represented by the local maxima or the samples nearest to zero. But the true event might have occurred anywhere in the vicinity of the sample detected. Hess has argued that the distance (in number of samples) between the event position found after oversampling and the last base sample of the signal is distributed uniformly. We therefore implemented a chi-square test to verify the uniformity of the distribution of the distances so obtained. Any important deviation from a uniform distribution meant that the shape of the oversampled signal had been unduly influenced by the oversampling procedure, and that the period durations so obtained were possibly biased.
3. Results and discussion
The algorithm was tested on 77 signals produced by 26 speakers sustaining the vowels [a], [i], [u]. On the whole, the performance of the algorithm was very good. For example, Figures 2 and 3 show ~he cycle durations of a stable one-second-portion of a sustained [a] vowel for a male and female speaker, respectively. The upper curve has been obtained from the speech signal, the lower curve from the laryngogram. The upper curve has been shifted upwards by 0.1 milliseconds to make the comparison easier. One notices that the time-series agree not only as far as melody, mean-term jitter and the order of mag-
J. Schoentgen, R. de Guchteneere / Measurement of jitter
Table 1 Result of a chi-square test checking the reliability of signal interpolation. Interpolation has been carried out by three methods (parabolic or linear interpolation, oversampling and FFF-based low-pass filtering, oversampling and FIR-filtering). Given are the number of signals which satisfy (P > 5%) and which do not satisfy (P < 5%) the null hypothesis (which is that the corrections brought about by interpolation are uniformly distributed)
g.2
9
03 ~:~ 8 . 8 C 0 O3 8 6
"~
B,4
82
Laryngogram 20
410
510
period
/0
100 ~
120
number
Fig. 2. Period durations of a one-second portion of an [a] vowel sustained by a healthy male speaker. The upper curve represents the series obtained from the speech signal: it has been shifted upwards by 0.1 milliseconds. The lower curve represents the series obtained from the electroglottograph signal. The vertical scale is in milliseconds, the horizontal scale shows the period numbers in the utterance.
5,5 CO 5.4 O 5.2 ---
5 4,8 4.6 4.4 4,2
4
537
210
410
610
I0
per iod
I 100
I 120
I 140
I 160
180
200
number
Fig. 3. Period durations of a one-second portion of an [a] vowel sustained by a healthy female speaker. The upper curve represents the series obtained from the speech signal: it has been shifted upwards by 0.1 milliseconds. The lower curve represents the series obtained from the electroglottograph signal. The vertical scale is in milliseconds, the horizontal scale shows the period numbers in the utterance.
Speech signal
P P P P
~< 5% > 5% ~< 5% > 5%
FIR filtering
FFT filtering
Parab./linear interpolation
25 52 3 74
38 39 32 45
40 37 20 57
nitude of short-term jitter are concerned, but also in most of the fine detail. We obtained a similar quality of agreement between speech and the E G G signal for most of our speakers. Since the two time series were obtained from signals which were physically very dissimilar and which were not processed in the same manner, it shows that the measurements are not overly corrupted by errors or noise. Table 1 summarizes the result of a chi-square test carried out on 77 signals whose time resolution had been increased by interpolation with the help of three different methods: (i) oversampling followed by FIR-filtering; (ii) oversampling followed by a truncated Inverse Fourier Transform; (iii) the fitting of a linear or parabolic curve to two or three samples. The table indicates the number of signals which satisfy Hess' criterion when the level of significance is fixed at 5% (i.e., accepting one error in twenty when rejecting the null hypothesis). The null hypothesis consisted of assuming that corrections to the peak positions or zero-crossings are distributed uniformly. The table shows a difference between the performances of oversampling followed by FIR-filtering on the one hand, and of fitting a parabola on the other. The former did not satisfy Hess' criterion in 25 and the latter in 40 cases out of the 77. The results are still more in favour of FIR-filtering in the case of the acoustic signal: 74 of the 77 cases satisfy the criterion. The score of the FFT is intermediate. Vol. [0. Nos. 5 ~ .
D e c e m b e r t991
538
J. Schoentgen, R. de Guchteneere / Measurement of jitter
4. Conclusion We have presented an algorithm for the measurement of jitter. Interpolation was used to improve time resolution. The algorithm contained a built-in check of the accuracy of the waveform reconstruction. Furthermore, time series of the glottis cycle durations were extracted from two physically different signals. The less the measurements had been corrupted by noise or errors, the better the agreement between both time series was assumed to be. In most of the cases the time series obtained from the speech signal and the laryngogram agreed to a very great extent. This indicated that measurement errors had been kept within bounds and that the time sequences correctly reflected vocal fold behaviour. Furthermore, oversampling followed by FIR-filtering was the best of three interpolation schemes. The next stage will consist of statistically treating the period sequence as a time series. This approach dispenses with the hypotheses that adjacent glottal cycles are independent and that any mean-term or long-term trends are not more than the outcome of melodic variations.
References C. Chatfield (1984), The Analysis of Time Series: An Introduction (Chapman & Hall, London). J.F. Deem, W.H. Manning, J.V. Knack and J.S. Matesich (1989), "The automatic extraction of pitch perturbation using microcomputer: Some methodological considerations", J. Speech Hearing Res., Vol. 32, pp. 689~597.
Speech Communication
V.L. Heiberger and Y. Horii (1982), "Jitter and shimmer in sustained phonation", Speech and Language, Advances in Basic Research and Practice, ed. by N.J. Lass (Academic Press, New York), Vol. 7, pp. 299-332. W. Hess and H. Indefrey (1987), "Accurate time-domain pitch determination of speech signal by means of a laryngograph", Speech Communication, Vol. 6, No. 1, pp. 55~58. Y. Horii (1979), "Fundamental frequency perturbation observed in sustained phonation", J. Speech Hearing Res., Vol. 22, pp. 5-19. Y. Horii (1982), "Jitter and shimmer differences among sustained vowel phonations", J. Speech Hearing Res., Vol. 25, pp. 12-14. Y. Koike, A. Takahashi and T.C. Calcaterra (1977), "Acoustic measures for detecting laryngeal pathology", Acta Otolaryngologica, Vol. 84, pp. 105-117. P. Lieberman (1963), "Some acoustic measures of the fundamental periodicity of normal and pathologic larynges", J. Acoust. Soc. Amer., Vol. 35, pp. 344-353. J.H. McClellan, T.W. Parks and L.R. Rabiner (1979), "FIR linear phase filter design program", in Programsfor Digital Signal Processing (IEEE Press, New York). S. Nittrauer, R.S. McGowan, P.H. Milenkovic and D. Beehler (1990), "Acoustic measurements of men's and women's voices: A study of context effects and covariations", J. Speech Hearing Res., Vol. 33, pp. 761-775. R.F. Orlikoff and R.J. Baken (1989), "Fundamental frequency modulation by the heartbeat: Preliminary results and possible mechanisms", J. Acoust. Soc. Amer., Vol. 85, pp. 888--893. N.B. Pinto and I.R. Titze (1990), "Unification of perturbation measures in speech signals", J. Acoust. Soc. Amer., Vol. 87, No. 3, pp. 1278-1289. J. Schoentgen (1989), "Jitter in sustained vowels and isolated sentences produced by dysphonic speakers", Speech Communication, Vol. 8, No. 1, pp. 61-79. I. Titze, Y. Horii and R. Scherer (1987), "Some technical considerations in voice perturbation measurements", J. Speech Hearing Res., Vol. 30, pp. 252-260.