Speech Communication 8 (1989) 61-79 North-Holland
61
JITTER IN SUSTAINED VOWELS AND ISOLATED SENTENCES PRODUCED BY DYSPHONIC SPEAKERS Jean SCHOENTGEN
hlstitute of Phonetics, University of Brus,~ets. 1050 Brus,~el.~. Belgium Received 11 March 1t188 Revised 17 October 1988
Abstract. Jitter measures are known to discriminate between normal and dysphonic speakers. We investigated the influence of (i) speech material type (i.e. sustained vowels vs. isolated sentences): (ii) phonetic vowel quality: (iii) preprocessing: and (iv) speaker sex on the di,:criminatory performance of two pitch perturbation measures. The aim was to learn about the influence of experimental conditions on the output of dysphonic voice analysis systems. Two comparative studies were carried out. The first showed that as far as inter-vowel quality differences were concerned, all significant differences could be related to the idiosyncratic behaviour of several preprocessing schemes with reference to vowel quality. Intrinsic differences were canceled out by normalizing absolute jitter by the average fundamental period. As a rule of thumb preprocessing routines were more successful, the further F~ and FI were apart. With all other experimental factors held constant, significant differences persisted between several preprocessing schemes, e.g. analysis by linear prediction failed on female voices and low-pass filtering eliminated so much fine signal details that discrimination between normal and dysphonic voices became impossible. In a second experiment, jitter values extracted from connected speech did not discriminate between normal and dysphonic speakers any more efficiently than values calculated from sustained vowels. As far as our corpora were concerned. no intrinsic superiority in the discrimination performance of connected speech as opposed to sustained vowels could be found. In the case of running speech absolute micropcrturbation values appeared to be higher during inter-segment transitions and during voice onset and offset. Zusammcnfassung. Die Messung der Werte kleiner Periodendauerschwankungen des Sprachsignals erlaubt es zwischen Kontrollsprechern und dysphonischen Sprechern zu unterscheiden. Wir hubert den Einfluss (i) des Typus des Sprachmaterials (gehaltene Vokale bzw. isolierte S/.itze), (ii) der phonetischen Qualit'iit der Vokale. (iii) der Sprachsignalaufbereitung und (iv) des Geschlechts des Sprechers auf die Unterscheidungsleistung zweier akustischer Parameter der Periodenschwankungen untersucht. Das Ziel bestand darin den Einfluss yon experimentellen Faktoren auf die Ausgangsinformation yon Analysesystemen dysphonischer Sprachc zu studieren. Zwei Vergleichsstudien wurden durchgefiihrt. Alle signifikamen Differenzen zwischen Vokalen verschiedener phonctischer Qualitat konnten auf das vokalspezifische Vcrhalten der Signalaufbereitung zurtickgeftihrt werden. Vokaleigene Unterschiede wurden durch die Normalisierung der absoluten Schwankungswerte mit Hilfe der Periodendauer ausgeglichen. Im allgemeinen war die Signalaufb,ereitung desto erfolgreicher, desto gr~Ssser die Frequenzdistanz zwischen F~, und F,. Selbst wenn alle anderen experimentellen Faktoren konstant gehalten wurden verblieben signifikante Unterschiede zwischen verschiedenen Signalaufbereitungsmethoden. Das LPC-Analyseverfahren, z. Bsp., versagte im Falle der Sprachsignale weiblicher Sprecher und ein Tiefpassfllter eliminierte alle entscheidenden Signaldetails, so dass keine Unterscheidung zwischen den zwei Sprecherklassen meht mOglich war. In einem zweiten Experiment diskriminierten Schwankungswerte welche unhand yon kontinuierlicher Sprache berechnet wurden nicht besscr zwischen den Sprecherklassen denn solche welche unhand von gehaltenen Vokalen bestimmt wurden. Was die Unterscheidungsleistung anbetraf, konnten wir im Falle unseres Korpus keine wesentliche 13berlegenheit der kontinuierlichen Sprache feststellen, lm Vergleich zu gehaltenen Vokalen waren die absoluten Schwankungswerte dagegen bOher, im besonderen w';.ihrcnd den Segmenttibet'gangen sow~e am Anfang und Ende der Stimmgebung. R6sum6. Les microperturbations de la pdriode fondamentale permettent de discriminer entre Iocuteurs normaux et dysphoniques. Nous avons examin6 l'influenee des facteurs suivants: (i) type de mat~riau sonore (voyelles soutenues compar~es h• des phrases isol6es), (ii) qualit6 phon~tique de la voyelle, (iii) pr6traitement du signal, (iv) sexe du !ocuteur, sur les performances de discrimination de deux indices de perturbation de la p6riode fondamentale. Deux &udes compar6es ont ~.t6 mendes b. bien. Les diff6rences inter-voyelles significatives ont pu &re reli6es au comportement idiosyncratiqae des diff6rentes mdthodes de pr6traitement du signal par rapport '~ la qualit6 phon6tique de la voyelle. Les diffdrences intervoyelles intrins~ques sont compens6es par une normalisation des microperturbations "a I'aide de la moyenne de la pdriode
0167-6393/89/$3.50 © 1989, E l s e v i e r S c i e n c e P u b l i s h e r s B . V . N o r t h - H o l l a n d
62
J. Sr'hoentgen / Jitter in sustai, ed t,owels and isolated sentences
fondamentalc. En ri:gle g6n6ralc, le pr6traitement a 6tt5ti'autant plus ais6 que la distance en fr6quence entre E, et Ft dtait 6levt~e+ Pour les autres facteurs exp6rimentaux fixes, des diff+2rencessignificativesont persistt~entre resultats obtenus par des pr6traitements difl~rents. L'analyse par prediction lin6aire a t~chou*2sur les voix f6minines et le filtrage passe-bas a dlimim5rant de dt~tailsdu signal de pl,role qu'une discriminationentre voix normales et dysphoniqucsdevem,it impossible. Lors d'une deuxi~:meexp,Sricnce,des mesures obtenues h partir de phrases isol~2esn'ont pus mieux st~par,2Iocuteursnormaux et dysphoniques que des mesures bas6es sur den voyelles soutenues. En ce qui concerne nos corpus, aucune sup6riorit6 intrinsi2que de la parole continue n'a pu ~tre mine en evidence au niveau des performances de discrimination. Par contre. lots den transitions entre ~egments et lots de I't~tablissementet de I'extinetion du voisement les valeurs des microperturbations sont apparues plus importantes en valeur absolue. Keywords. Jitter. dysphonic speakers. 1. Introduction
In the present article we study the microperturbations of the fundamental period in normal speakers and in speakers suffering from laryngeal pathologies. By microperturbations, we mean slight cycle-to-cycle variations in the duration of the fundamental period (jitter). The purpose of our study is to look into the impact on discrimination performance of possible analysis scheme configurations. It is part of an investigation the aim of which is to describe dysphonic voices by means of acoustic parameters extracted from speech signals. A n analysis scheme for dysphonic voices can be subdivided into four stages: (i) speech signal recording and sampling (it) acoustic signal preprocessMg (iii) feature extraction from the processed signal (iv) data reduction and discrimination performance evaluation. The attribute upon which we concentrate here belongs to the period perturbation feature family. Jitter has been studied for some twenty-five years and it is believed to be useful in the characterization of the dysphonic voice signal. However, the overall discrimination performance between normal and dysphonic speakers depends not only on acoustic parameters as such but also on other levels, including the type of pathologies examined. We focus here on the dependency of pitch perturbation values on speech material and preprocessing methods. These aspects have rarely been studied systematically. In othel words, (i) does jitter differ from one vowel quality to the next and, therefore, is one quality more likely to discriminate between norSpeechCommunicat:on
mal and dysphonic speakers than another? (it) What characterizes jitter in sustained vowels compared to connected speech, and what type of production is more likely to disclose laryngeal pathologies? (iii) how do jitter values vary with speech signal preprocessing schemes? (iv) how does speaker-sex interact with the other factors? More precisely, we present here the results of two experiments. In the first, one set of normal and one set of dysphonic speakers sustained the three French vowels [a], [e] and [il. Each vowel was than preprocessed in three ways: (i) by bandpass filtering followed by signal envelope extraction and finite difference computations; (it) by low-pass filtering: and (iii) by whitening with an inverse linear predictive filter. The result of whitening is a flattening of the spectral envelope of the inverse filter output signal. Procedures (it) and (iii) are standard in digital speech processing and the study of their performance on dysphonic speech signals is therefore interesting for reference purposes. Procedure (i) is new and was introduced to overcome shortcomings in the other two. The same jitter measure was computed nine times for each speaker (3 vowels × 3 processing routines). The results were then statistically compared in order to highlight inter-vowel and interprocessing differences. In a second experiment a set of normal and a set of dysphonic male and female speakers recorded a short sentence [set yn saga] and the same vowels as in the first experiment. The connected speech signal was processed once using the preferred method of experiment 1. A total of three jitter values were extra,:ted for each speaker. These were again compared in order to study the differences in discriminatory power between sustained vowels and connected speech. In addition, inter-
63
J. Schoentgen / Jitter in sustained vowels attd isolated .~entences
sex differences are discussed. In practice, the three options (i) spoken texts (40 seconds" duration or m o r e ) , (ii) isolated sentences; (iii) sustained vowels, have been used in dysphonic voice analysis systems, with a preference for sustained vowels ( H i r a n o , 1981). T h e respective duration of the analysis interval is then equal to thousands, hundreds or tens of fundamental periods. T h e c o m p u t e r m e m o r y need is proportional to (i) the duration of the analysis interval; (ii) the bandwidth of the signal; and (iii) the n u m b e r of discretisation levels. T h e r e f o r e the choice of speech material has a direct impact on the technical feasibility of a given voice analysis system. O n a m o r e f u n d a m e n t a l level, s o m e researchers consider the study of c o n n e c t e d speech to be m o r e fruitful than the study of sustained
vowels (Askenfelt and H a m m a r b e r g , 1980, 1981). T h e y fear that the latter are less discriminatory than the former. O n e of the first to draw attention to cycle variations in dysphonic voices was L i e b e r m a n (1963). Since then several studies have been carried out, mostly on sustained vowels. Table 1 summarizes results published by different authors. Most studies agree on the o r d e r of magnitude of jitter. O n the o t h e r hand, no a g r e e m e n t has yet been reached as far as inter-vowel differences and preprocessing m e t h o d s are concerned. A m o n g five recent studies looking into betweenvowel differences, three found that there were no systematic differences (Koike et al., 1977; Horii, 1982; Kasuya and Kobayashi, 1983), o n e concluded that jitter was lowest for [al ( R a m i g and
Table 1 Jitter in sustained vowels. Summary of published results Author
Year
Feature
Typical values
Processing
Hollien Koike
(1973) (1973)
jitter factor PPQ
0.3-1.3% 0.15-41.85%
Kitajima et al.
( 19751
magnitude of perturbation
Koike et al.
( 19771
frequency PQ
0.1141.21 (M) semitones 0.15-4).24 (F) semitones 0.4-1%
optical processing [a] throat micro. [a] + manual measur. pretracheal voice [a] recording + peak picking
Davis
( 19781
PPQ
0.2-0.6%
Horii et al. Horii
(1979) (1982)
mean jitter jitter factor
30-45 ,us 0.7%
Kasuya et al.
(1983)
PPQ
11.1-41.5%
Raseh
( 19831
mean jitter
3-50 cent'~
Ramig and Ringel Ludlow et al.
( 19831 ( 19851
jitter factor mean jitter
Klingholz and Martin
( 19851
jitter factor
11.4-41.9% 47 l~s (M) 33 l~s (F) 0.3-1.6%
lmaizumi
( 19851
ppQb
0.3%
Horii (19751. b Period Pertarbatiou Quotient.
contact micro + semi manual peak picking Residue inverse filtering + peak picking SEARPJ accelerometer SEARP contact micro. + peak picking LP-filtering + zero-crossings SEARP analogue preprocessing rectification + peak picking LP-filtering + peakpicking
Vowel
[a] [al [il eight vowels [e] eight vowels [alIil[ul [al [al [el
J. Schoetttgett / Jitter in sustained vowels and isolated setttence.s
Ringel, 1983) and one (Rasch, 1983), that it was highest for [a]. As far as recording is concerned, apart from the speech signal, mechanical vibrations picked up at the threat or neck have been used and, occasionally, electrcglottograph signals, which are related to tlw vocal fold contact area. The contact microphone signal has practical advantages over the conventional microphone signal in so far as it is less corrupted by environmental noise, vields a simple waveform which does not need any further preprocessing and is relatively free fom the influences of the supra-glottal cavities. In Table 1, only results obtained by aerial and contact microphones are displayed. Comparisons between results obtained by different pick-up methods have to be made with caution (Horii, 1982). Studies of jitter in isolated sentences are relatively rare (Lieberman, 1963: Lebrun and Hasquin, 1971: Hecker and Kreul, 1971). Lieberman proposed a so-called pitch perturbation factor which counts the percentage of periods exhibiting a perturbation greater than 0.5ms. He believes that perturbations less than 0.5ms are due to phase changes occurring during the evolution of the vocal tract transfer function. He found furthermore that the lower the speakers" pitch, the higher the period perturbations and, for comparable pitch, that pathological speakers may exhibit greater perturbations than normal speakers. These findings have since been confirmed for both connected speech and isolated vowels. Hecker and Kreul proposed a directional perturbation. They considered measurements on five patients suffering from laryngeal cancer and five normal speakers. Several authors worked with spoken texts. Some made use of the Long Time Average Spectrum. The LTAS fails to detect all the fine temporal details of the speech signal and, therefore, cannot characterize any cycle-to-cycle perturbation of either period or amplitude. Those who have made use of period-to-period measurements (Laver et al.. 1985: Gubrynowicz et al., 1980: Ask'cnfck a~d. I. .,'.~.. . . . . . .~,. ~. .;-,~,~,~,~ . . . ~. !98 ~) have established statistical measures of period and period perturbation di,~lributions. One argument in favour of spoken texts is that dysphonic speakers may comSpeech
Communication
pensate for their laryngeal deficits during the phonation of su,~tained vowel or short sentences, but may not be able to do so when producing utterances of longer duration. It is important to stress that most authors, whatever their choice of ,speech material, considered that the results achieved by their respective analysis systems confirmed the feasibility of separating normal and dysphonic subjects at a reasonable level of performance. One exception are Ludlow et al. (1985) who took a critical stance on screening.
2. Methods and materials (experiment 1)
2.1. Subjects 33 male speakers with no known laryngeal problems provided the control signals. Their ages were between 20 and 25, and all were Frenchspeaking university students. They were instructed to sustain the three vowels [al, [el and [i] at a comfortable pitch and loudness level, with each vowel separated by a pause. The dysphonic group was made up of 30 male speakers. One speaker had been recorded more than once, so the total number of signals came equal to 32. The interval between two subsequent recordings by this same speaker was a week at least. The total of 32 recordings were considered distinct items during statistical analysis. All the dysphonic subjects were patients of the O.R.L. department of the St-Pierre University Hospital in Brussels. The speakers" ages were as follows: 13 were below 40, 13 between 40 and 60 and 4 were over 60. Like the normal speakers, they sustained the three vowels at comfortable pitch and loudness levels. Table 2 lists the speakers" numbers and gives a short description of their pathologies. The signals marked by an asterisk were produced by the same speakers. The control speaker sample is akin to a standard provided by young voices not yet affected by smoking, drinking and environmental factors. As far as the dependency of jitter on chronological _.~,~,'."i3 concerned, published results are contradictory (see for example Ramig and Ringel, 1983). In fact, Ludlow et ai. (1985) compared normal
J. Schoentgen I Jitter in sustahted rowdv and isolated sentences
and dysphonic subjects with very dissimilar chronological age distributions. A factorial analysis did not reveal any significant differences between jitter rate due to chronological age, to drinking, smoking or maximum phonation time. Relevant factors were fundamental frequency, intensity and the number of voice breaks during a maximum phonation time task.
Table 2 Experiment I: Dysphonic male speakers (1) Hyperhaemiaof right vocal fold, imperfect adduction (2) Carcinomaof right vocal fold and anterior commissure (3) Paralysisleft vocal fold (4) Chronic laryngitis. Reinke's oedema* (5) Paralysislett vocal fold (6) Calcinoma both vocal folds (7) Hypotonia during adduction, glottal slit (8) Light hyperhaemia (allergic) (9) Hypertrophy of the ventricularbands, lightoedema on left vocal fold, drynessof the mucosa (10) Light oedema right vocal fold (11) Epithelioma left hemilarynx (12) Tumor right ventricular band, lymphoma, marginof right vocal fold slightlyirregular (13) Chronic laryngitis (14) Carcinoma pharynx (15) Sequellaeof left cordectomy (leukoplakia), post-surgical status ahost normal (16) Slightlyhyperhaemicresidual polypusof anterior commissure on right vocal fold (17) Laryngitis,hyperhaemiaof left vocal fold (18) Synechiaanterior part of vocal folds (19) Laryngitis,right vocal fold slightlyhyperhaemic (20) Chronic laryngitis,Reinke's oedema* (21) Post-traumatichaematoma of the ventricular bands (22~ Chronic laryngitis,Reinke's oedema* (23) Dysfunctionaldysphonia (24) Spread oedema postradio-therapy (25) Extended carcinomaT4N3 (26) Spread papillomatosis (27) Chronic laryngitis (28) Acute laryngitis (29) Acute laryngitis (30) Incompletevocal mutation (31) Incompletevocal mutation (32) Hyperhaemiaof the free marginsof the vocal folds, posterior slit * Signals (4), (20), and (22) were produced by the same speaker.
65
2.2. Data acquisition
The normal subjects sustained the vowels twice in a row, and the dysphonic subjects three times in a row. ]'he duration of each vowel was typically 2 seconds. The recordings were made using a Nagra 4.2. recorder, at a recording speed of 38cr~s. The microphone (Shure 5455) was placed 40cm from the mouth of the speakers, and between 20 and 40cm for the dysphonic speakers, their voices being in certain cases more feeble. The recordings were made in a sound-proot booth. Of all the vowels sustained, the longest in duration was retained for digitization. Those shorter than 0.7s were discarded. Before analogue-to-digital conversion, the signals were low-pass filtered through an elliptic filter (Blesser, 1978). The sampling frequency was equal to 10kHz and the number of discretization levels was equal to 4096 (12bits). The signal level was adjusted so as to use the whole dynamic range of the digitizer. All the data acquisition and analysis programmes were implemented on a D E C 11/60 computer. The analysis methods were fully automated. Jitter measures require a minimum time resolutio~l between 0.1 and 0.01ms. For a sampled signal, the Fourier spectrum repeats itself periodiczdly in the frequency domain. Improving time r,,~solution is equivalent to increasing the frequency distance between periodic spectra. This can be achieved either by oversampling (Horii, 1979) or by interpolation (Davis, 1981; Kasuya et al., 1986). We choose parabolic interpolation in the case of the residue signal and the envelope variation criterion and linear interpolation in the case of the low-pass filtered signal.
2.3. Preprocessing
We processed the speech signal in three distinct manners in order to compare their influence on overall discriminatory performance. Signal processing considerably facilitates pitch period extraction (Hess, 1982) and improves the discriminatory performance of a complete feature extraction procedure (Schoentgen, 1985). In the first experiment we applied three different routines: whitening by linear predictive inverse
66
.L Schoentgen / Jitter in sustained rowels attd isolated sentences
filtering, low-pass filtering and band-pass filtering followed by signal envelope extraction and finite difference computation. Since the first two are standard procedures explained in all the text books on speech processing (e.g. Rabiner and Shafer, 1978) we will only describe them briefly here. The third procedure is new and is explained in detail in an appendix.
2.3.1. Whitening The residue ,;ign~,l was obtained from the speech signal by irwerse filtering using a linear predictive filter (Makhoul, 1975). The number of predictor coeffici,ents was fixed at 14 (zeroth included). We chose an analysis interval in the stationary part of the vowel; i.e. the analysis routine discarded the first 270 ms and retained the following 450, ~hich were Hamming-windowed before the autocorrelation method was applied in order to compute the linear prediction filter coefficients. The raw speech signal was then fed into the inverse linear predictive filter to obtain the residue signal. This signal expresses the neatness with which the linear predictive filter modellizes the speech signal over time. It assumes a prominent peak in the vicinity of glottal closure. The pitch period durations were assimilated to the duration between two subsequent peaks extracted by a peak picking routine put forward by Davis (1976). The analysis procedure described here follows closely the one which was proposed by Davis and described in considerable detail in Davis (1976, 1978, 1979, 1981).
2.3.2. Lo~v-pass filtering Many pitch analyzers contain a low-pass filtering stage. Since pitch av~a~yzers have been u s e d for F0 and microperturbation measurement (e.g. Gubrynowicz et al,, 1980) and since the very popular throat microphone has much in common with LP-filtering, it seemed worthwhile to test discriminatory performz:nce on speech signals processed in such a manner. In practice, we implemented a low-pass elliptic filter bank x~ith successive cut-off frequencies at 120Hz, 150Hz, 200Hz, 250Hz, 300Hz and 350 Hz respectively. The transition between the cut-off and reject frequencies was 70 Hz wide. The dynamic rejection of the filters was about 55 dB. The speech signal Speech Communication
was highpass-filtered at 70 Hz in order to remove any 50 Hz buzz. The output of such a filter bank was six quasi sinuso'ids, each an estimation of the first harmonic of the speech signal. In each sinuso'fd, a succession of iso-directional zerocrossings was detected and stored in a vector. The exact zero-crossings were determined by linear interpolation. Since there was a set of crossings from the (+) to ( - ) and from the ( - ) to (+) signal values, it was possible to compute two pitch perturbation quotients for each filter output. Of the total of 12 values, the minimum one was chosen to represent pitch perturbation: the minimum was always realized by the output of the filter which achieved the best estimation of the speaker's fundamental. This F0 extraction scheme is very robust. It is very similar in principle to the formant tracking method used by Niederjohn and Lahat (1985). It never failed on any of the sustained vowel signals that we examined. This is a fact not completely irrelevant to the task in hand; we discovered that for many jitter measurement schemes, at least a small part of the high jitter values was due to wrong F0 values outputted by F0 extractors unable to cope with not quite regular sustained vowel signals. It should be noted that the computation of two pitch perturbation values for each sinuso'/d renders the output independent of the absolute phase of the signal, an option which in practice considerably simplifies the experimental set-up.
2.3.3. Envelope variation criterion (EVS) The extraction of the so-called envelope variation criterion was achieved with the aid of a signal processing procedure which includes (i) a bandpass filter: (ii) a signal envelope estimator: (iii) a smoothing window: (iv) a finite difference calculator (Jospa.. 1982; Jospa and Schoentgen, 1982: Jospa, t984). The width of the band-pass filter was chosen so as to allow for the separation of t~o neighbouring formants. The transition band of tile filter has a width of 300 Hz. The EVS computational procedure does not prescribe any value for the centre frequency (F,.) of the filter. In an earlier experiment we successively set F,. at values in the vicinity of the first three formants since the dynamic range of the signal is greatest in the formant regions. Our results show that
J. Schoentgen I Jitter in sustained vowels and isolated sentences
Table 3 Effective smoothingwindowwidth (N) and differenceinterval (L) as functions of vowel quality (experiment 1). The durations are expressed in numbers of samples. The sampling frequency was 10 kHz
Effectivesmoothingwindowwidth N: Difference interval L:
lal
lel
Ill
17 21
21 25
39 43
good overall performance is achieved when the band-pass filter is centred around the first formant. This was the case for all the experiments described in the present article. Smoothing window length (2 N + 1) and finite difference interval L vary with centre frequency F,. Table 3 shows the values of N and L as they were set for the three vowels. The discussion of the rationale underlying the EVS extraction and further computational details are given in Appendix 1. 2.4. Jitter m e a s u r e
Recent jitter measures include pitch period normalization and intonation pattern compensation. One of the more sophisticated measures is the so-called period perturbation quotient, which has been extensively studied by Davis (1976) and Koike (1977). We adopted this jitter measure in our study. Its definition is: PPQ 1
N-(k-1)
N-
- t~
~
,=l
l) - l'i . . . .
: 1
E,P, N
Ni=
where 2m = k - 1 and N = 45, number of periods. Long-term trends in fundamental period values due to intonation are removed by comparing individual period values to a local average of adjacent periods (k = 5). We tested the overall precision of our analysis routines by applying them to rigorously stable synthetic [al and [il vowels sampled at 10kHz. With fundamental frequencies between 100Hz and 200 Hz typical period perturbation quotient
67
values (in %) were 10 -s (Residue), 10-4 (EVS) and 10-'- (LPF). 2.5. Statistical analysis
When opting for a statistical analysis method, it must be borne in mind that the jitter value distributions to be compared are neither Gaussian nor of equal variance. Furthermore, an acoustic parameter is only helpful in characterizing dysphonic voices when normal and dysphonic population do not share exactly the same score interval. Accordingly, we settled for the (nonparametric) Mann-Whitney U test in our comparison of normal and dysphonic samples, l'he Mann-Whitney U test is applicable to independent samples and is sensitive to differences in the central tendencies of sample distributions (Siegel, 1956). In addition to laryngeal status, we had to consider differences arising from vowel quality and signal processing. Their influence on jitter values was studied by comparing samples provided by a same speaker set (either normal or dysphonic), We therefore applied the Friedman two-way Analysis of Variance by Ranks test (Siegel, 1956). 2.6. Merit f a c t o r
In the second part of our discussion we compare discrimination performance of different analysis schemes on identical corpora. This task becomes difficult when based on statistical probability values alone. In fact, the outcome of a statistical test is not an appropriate quantitative expression of discriminatory performance since probabilities depend explicitly on the number of samples examined, and since probability values drop dramatically in a non-linear fashion when the separation between two corpora becomes more and more distinct. Both mathematical properties are annoying as far as the evaluation of discriminatory power is concerned. A mathematical expression of the latter has to (i) be independent of the total number of samples; (ii) vary linearly with detection- and false-alarm rates; (iii) assume a value between 0 and 1. We defined a so-called quality factor which satisfies these requirements (Schoentgen, 1982; 1988). It is a kind of merit
68
J. Schoentgen I Jitter in sustained t'otcel.~ and isolated sentences
factor, i.e., factor which evaluates the o u t c o m e of a clustering analysis. The simplest way of putting a value to the quality factor is to allocate to one class all those speakers w h o have a feature value greater or equal to a given threshold and all the others to a second class. W h e n the threshold successively assumes all the feature values realized in the mixed corpus, the quality factor takes on a m a x i m u m value for one or several of them. This m a x i m u m is a m e a s u r e of the discriminatory p e r f o r m a n c e achieved.
Table 6 Merit factor values for three vowels and three processing methods (Re,; = residue signal: EVS = envelope variation criterion: LPS = low-pass filtered signal). The merit factor varies linearly with discrimination performance between control and dysphonic speakers. A value of 1 means perfect separation
3. Results
~vhen p > 0.05. Table 5 gives the same information for dysphonic speakers. Discrimination per~brmance has been e v a l u a t e d with the help of the merit factor and U-test probabilities. Merit factor values are listed in T a b l e 6 for all cases.
Results are displayed in Tables 4 to 6. Table 4 shows the median period perturb~titm quotient values realized bv the control speakers for three vowel qualities and three preprocessing methods. The range of values spanning t~6% of the total and centred on the median are given in parentheses. C o n v e n t i o n a l l y , differences b e t w e e n rows or coktmns are considered to be due to chance Table 4 Median period perturbation quotient values(PPO) in % for male control speakers for three vowelsand three processing methods IRes = residue signal: EVS = envelopevariation criterion: LPS = low-pass filtered signall. Valuesgivenin parenthesisshow the range spanned by of the speakers, i.e. ~ above and ~below the median [al
lel
[ii
Re.,, 0.35 {0.26~k69) 0.25 (0.18-0.69} I.I.26(0.19-1.08) EVS 0.35 10.24-1L551 0.42 10.26-tl.621 11.1;701.59-1.281 p < O.0lll kPS 0.47 (11.2841.S51 0.49 111.211-0.71t 11.37(0.13-0.661 p <0.ll01 p < 0.111 p < O.IlO|
Table 5 Median period perturbation quotient values (PPQ) in % for male dD,phonic speakers for three vowels and three processing methods (Res = residue signal: EVS = envelope variation criterion: LPS = Iov,-pass filtered signal). Values given in parenthesis show the range spanned by ~ of the speakers, i.e. above and h below the median [a] Res EVS LPS
tel
SpeechCommunication
p < 0.001
tel
Ill
11.66 11.56 O.16
11.511 11.56 11.22
11.50 11.44 I).119
3.1. Period perturbation quotient values F o r the control speakers, the P P Q medians were c o m p a r a b l e for all three vowel qualities and for all three processing m e t h o d s ( 0 . 2 5 % - 0 . 4 9 % ) ; apart from one exception: [i] preprocessed via envelope variation (0.87%). In accordance with results r e p o r t e d elsewhere (Table 1) they were equal, typically, to several tenths of a percent. N o simple rule could be given, h o w e v e r , for the speakers suffering f r o m a laryngeal pathology. In the case of o u r c o r p o r a , m e d i a n perturbation values were several times higher ( 0 . 3 6 % - 2 % ) . T h e y varied from o n e processing m e t h o d to another: "'normal" values for the low-pass filtered speech signal, and higher values for residue and e n v e l o p e variation signal. A peculiarity of the latter was that, in isolated instances involving very a b n o r m a l voices, the period perturbations quotient app e a r e d severely boosted when c o m p a r e d to residue and LP-filtered signal. Discriminatory performance, though, is not adversely affected by such run-away b e h a v i o u r in heavily perturbed voices.
[i]
2.03 10.71-3.571 1.32 (0.33-3.42) 1.39 {0.43-3.63) 1.10 (0.39--6.28) 1.04 (11.43-5.721 1.82 (0.68--4.35) 0.51 (11.25-1.1181 0.36 (0.20-0.76) 0.34 10.17--0.761 p < 0.001
Res EVS kIPS
[al
p < 0.1301
3.2. Fundamental frequency F o r the control speakers the median fundamental frequency was equal to 136 Hz, 138 Hz and 141Hz for [a], [e] and [i] respectively. D e p e n -
J. Schoentgen I Jitter in suxtahwd vowelsand isolated sentences
dency of F0 on phonetic vowel quality was to be expected from what is known about the intrinsic fundamental frequency of vowels (Rossi and Autesserre, 1981; Gu6rin and Bo& 1980). For the dysphonic speakers the median fundamental frequency was equal to 135 Hz, 133 Hz and 137 Hz for [a], [e] and [i] respectively. Differences between control and dysphonic corpora were not significant, as far as F. was concerned. In the case of [e] the lower than expected F. was due to an experimental artifact: [e] was produced finally and as a consequence several speakers lowered their fundamental frequency. 3.3. Discrimination between normal and dysphonic speakers
All the significant differences between the two speaker classes were removed when the speech signal was processed by means of the low-pass filter bank described above. However, on the residue and envelope variation signal, control and dysphonic speakers yielded significantly dissimilar perturbation values (Table 6). More precisely, overall performance was somewhat better for the residue signal than for the envelope variation signal and somewhat better for [a] than for [e] or [il. The degree to which these observations apply to female voices, remains to be discussed in the light of the results of experiment 2, reported below. Moreover, the contrast between control and dysphonic speakers was enhanced when more and more pathologies inducing asymmetric changes (e.g. paralyses and tumors of the vocal folds) were included among the dysphonic samples (Schoentgen, 1985). As a rule of thumb, discriminatory performance was proportional to the number of such pathologies present. Such behaviour was to be expected from results published elsewhere (Isshiki et al., 1972: 1976). The present study will not dwell further on these matters. 3.4. Inter-vowel differences
Discrepancies between vowel qualities were only clear-cut for the normal speakers in the case of the envelope variation and the low-pass filtered signal. For the latter, relative jitter was lowest and for the former highest for [i]. For the dys-
69
phonic speakers no significant differences were observed, the general trend was comparable. though. 3.5. Inter-processing differences
Apart from [a], between processing differences were systematic for all vowels for both control and dysphonic speakers (Tables 4 and 5). Most of these differences could be traced to a few salient traits of the signal processing operations. In particular, the low-pass filtered signal gave rise to significantly lower perturbation values in the case of the dysphonic speakers. As a result, jitter was equally low for both speaker classes (Table 5). For the normal speakers, the envelope variation signal calculated from [i] was the most conspicuous. The cause of this was the obligation to choose different analysis parameter values for different vowel qualities (Table 3).
4. Discussion
Very highly significant jitter differences between normal and dysphonic subjects were observed when jitter was computed from the residue signal or the en',eiope variation signal. For the low-pass filtered signal, differences were not significant, whatever the vowel color. A possible explanation is that in the case of the LP-filtered speech signal most of the information about periodicity perturbations was lost. In fact, when the fundamental period of a strictly periodic signal is slightly perturbed, then the width of its hacmonies increases. Low-pass filtering gets rid of all the harmonics apart from the fundamental, i.e., the one which only carries information concerning the coarsest perturbations. As a result, discrimination between normal and perturbed signals becomes impossible. This explains why pitch analyzers which incorporate heavy LP-filtering stages fail to differentiate between sustained vowels produced by normal and dysphonic speakers (Gubrynowicz, 1983). This does not necessarily imply that sustained vowels are intrinsically less discriminatory than connected speech. It was in order to obtain a grasp of these matters that we conducted a comparative study of vowels and iso-
71)
J. Schoentgen I ,litter in sustained rowe!., attd Lsolated sentences
lated sentences, the results of which are discussed below. Two mechanisms must be held responsible for the significant differences which persist between signal handling routines for a given corpus and for a given phonetic vowel quality. The first has been touched upon above. I.e., for the dysphonic speakers, the significant differences between the residue and the envelope variation on the one hand, and LP-filtered signal on the other, can be explained by a loss in the filtering process of part of the information dealing with signal irregularities. As a ceasequence, "'dysphonic'" signals take on "'normal" perturbation values. This emphasizes the need not to limit the available bandwidth too drastically wher. "'shaping up'" a speech signal for automatic peak processing. This requirement is in conflict with the prerequisite for an easy and reliable estimation of the fundamental period (Laver et al., 1982). The second mechanism is related to a computational pecularity of the envelope variation signal. The duration of the time-span over which the difference is taken in order to obtain the EVS from the smoothed signal envelope increases nonlinearly with decreasing centre frequencies. Since the first formant is lowest for [i], the difference interval is exceptionally long (Table 3) compared with [e] and [a]. Since two samples situated flmher apart in time are less strongly correlated (due to noise in the first formant region), the resulting microperturbations values are boosted for [i] (and somewhat for [e]). As a consequence, between-processing differences are particularly strong in the case of [i]. Not surprisingly, one must conclude that the inner workings of the preprocessing routines affect the outcome of jitter value measmements. This does not necessarily jeopardize their use in a clinical framework, when the primary concern is not the most precise measurement possible, but a reasonable separation capability. Leaving the envelope variation signal aside, the behaviour of which is discussed above, relative jitter values have a tendency to be less for [i] and [e] than for [a], but not always significantly so. No unanimity has yet been reached in the literature as far as the dependence of jitter on vowel quality is concerned. We think that the conflicting Speech ('ommur,,c~aion
clues are the result of lumping together absolute and relative jitter measurements. In fact, our resuits suggest that vowel quality has a minor influence only on relative jitter, if any at all. This observation can be brought in line with all those reporting higher absolutc jitter foi low vowels (e.g. [a]) compared to high vowels (e.g. [i]). The point is that intrinsic fundamental frequency increases for high vowels; consequently, raw jitter is diminished since jitter is known to decrease with increasing fundamental frequency. In relative measurements these between-vowel differences are compensated for when absolute jitter values and intrinsic fundamental period covary so as to keep their ratio constant.
5. Methods and materials (experiment 2)
5.1. Corpora All the control speakers (21 male and 22 female), were French-speaking university students aged between 18 and 25. They sustained the vowels [a], [e], [il, and produced the sentence [set yn saga] at a comfortable pitch and loudness level. They were instructed to articulate distinctly, affirmatively and without any paralinguistic bias. The sentence chosen contained the voiced stop [g], whose production in French imposes severe aerodynamic constraints on the laryngeal system. All the dysphonic speakers (16 male and 20 female) were patients either of the O.R.L. department of the St.-Pierre University Hospital or of the Centre d'Audiophonologie Paul Guns (St.Luc University Hospital), both in Brussels. Their diagnoses had been established by the doctors of the two departments The dysphonic speakers produced, under comparable conditions, the same vowels and sentences as the normal speaker=. Tables 7 and 8 list the signal numbers, together with a brief description of pathologies. The speakers" ages were as follows: for the males, 5 were below 40, 8 between 40 and 60, and 3 were over 60: for the females, 7 were below 40, 11 between 40 and 60, and 3 were over 60.
J. Schoemgen / Jitter in suxtabled rowels amt isolated sentence.s
Table 7 Experiment 2: Dysphonic male speakers (1) Polypus on anterior third of left voca~fold (2} Paralysis of right vocal fold in paramedian position, sequel to traumatic dysphonia (3) Laryngitis, hyperhaemia of both vocal folds (4) Chronic laryngitis (5) Hypotonia, air loss, atrophic fold (6) Hyperhaemia, inflammatoD' condition of both vocal folds, hypertonia of ventricular folds:~ (7) Benign laryngitis (8) Benign laryngitis (9) Hyperhaemia. inflammatory condi,ion of both vocal folds, hypertonia of ventricular folds* (10) Keratonic papillomatosis (11) Laryngeal carcinoma (12) Dryness of mucosa, sequel to radiotherapy** (13) Dryness of mucosa, sequel to radiotherapy** (14) Hyperhaemia of vocal folds, light oedema (15) Incomplete vocal mutation (16) Papillomatosis, numerous scars from former surgery (17) Mutational falsetto voice (181 Partial laryngectomy * Signals (6) and (9) were produced by the same speaker. ** Sigals (12) and (13) were produced by the same speaker.
Table 8 Experiment 2: Dysphonic female speakers (1) Synechia, anterior part of the vocal fold (2) Paralysis of right vocal fc,ld (3) Sulcus glottidis (4) Vocal nodules (5) Vocal nodules (6) Motor disorder due to lateral amyotrophic sclerosis (7) Vocal nodules (8) Laryngitis, pharyngitis, reddened vocal folds (9) Laryngitis, pharyngitis, reddened vocal folds (10) Paraphlegia of left vocal fold, sequel to polypodectomy ( 11) Hyperhaemic laryngitis, air loss (12) Inflammatory granuloma - left arytenoid, hyperkinesy (13) Light laryngitis (recurrent) (14) Light laryngitis, slight hyperhaemia of the vocal folds (15) Bilateral abductoD' paresies with air loss and quasiimmobility of both vocal folds (16) Pharyngitis (17) Acute laryngitis, oedema (18) Light laryngitis* (19) Dysfunctional dysphonia (20) Light laryngitis* (21) Laryngitis, hyperhaemia, oedema on two vocal folds * Signals (18) and (20) were produced by the same speaker.
71
5.2. Data acquisition
T h e recordings and signal digitizations were carried out in the same way as described above. T h e only difference was that [sEt yn saga] utterances were sampled at 6 6 0 0 H z instead of 10 0 0 0 H z and were prefiltered accordingly at 3300 Hz.
5.3. Preprocessing
T h e results of the first part of the second ex i~eriment showed that the E V S p e r f o r m e d ,~etter than the residue and LP-filtered signals. ', i,,: E V S pretreating scheme was thus applied to running speech. It is true that in most cases the E V S and the residue signal w o r k e d equally as well on signals provided by male speakers. H o w e v e r , the residue signal gave systematically p o o r results on female voices. T h e respective merit factor values were 0.33 and 0.52 in the case of the residue signal and the e n v e l o p e variation signal (Table 11); we o b t a i n e d similar results in the case of the vowels [i] and [e]. W e therefore preferred the e n v e l o p e variation criterion which allowed for a g o o d discriminatory perfor~tance on both the male and the female voices. I n t e r - g e n d e r differences with reference to L P C analysis are discussed in detail below. In the case of the sentence, the band-pass filter centre frequency was set to 650 Hz and the effective s m o o t h i n g window width was adjusted so as to be equal to 15 samples for the males, and 9 samples for the females. T h e finite difference interval was adjusted so as to be equal to 19 and 13 samples respectively. T h e analysis procedures applied to the sustained vowels were the same as in e x p e r i m e n t 1.
5.4. Jitter measures
W e applied the period perturbation quotient to both sustained vowels and isolated sentences. In the case of the latter the P P O was c o m p u t e d for each uninterrupted stretch of voiced speech. T h r e e or four P P Q values were typically computed for each sentence. T h e median value of these was taken to represent relative jitter in the sentence. T h e total n u m b e r of periods was typi-
J. &'hoentge,l I Jitter in sustained vowels and isolated sentences cally 100 per sentence for male speakers and 200 for female speakers. We also computed L i e b e r m a n ' s pitch perturbation factor (PPF), i.e., the percentage of perturbations occurring between adjacent periods and which were longer than 0.5ms. This feature seemed appropriate for an analysis using connected speech where important perturbations may occur during voice onset and offset and between adjacent segments ( L i e b e r m a n , 1963).
6. Results O u r second experiment was carried out in order (i) to check the results achieved during experiment 1; (ii) to compare jitter values on sustained vowels and connected speech: and (iii) to compare male and female speakers. Tables 9 and 10 show the median period perturbation quettent values produced by male and
female speakers respectively. PPQ values were compared when extracted respectively from the three auxiliary signals ( R E S , EVS, LPS) for [a], and from EVS for [set yn saga]. 6.1. Fundamental frequency and median pitch perturbation quotient vahtes (sustained vowel [a]) The median f u n d a m e n t a l frequencies for normal males and females were 126Hz and 252 Hz respectively. In the case of the dysphonic subjects these became 132Hz and 215 Hz respectively. For the male subjects (Table 9), the median PPQs found ( 0 . 2 2 % - 0 . 4 0 % ) were in agreement with experiment 1. For the normal female subjects, the PPQ's were higher 10.51%-0.52%), a consequence of arithmetically dividing smaller (than in the case of the males) absolute perturbations, by a f u n d a m e n t a l period that is twice as short. Absolute jitter appears to decrease more slowly than the f u n d a m e n t a l period. This suggests
Table 9 Median period perturbation quotient (PPQ) and -factor (PPF) values in % for [a] and [svtynsaga] for male and female control speakers. The speech signal is processed via an envelope variation criterion in each case. Values in parentheses sho'x the respective ranges spanned by a of the speakers, i.e. ~ above and !~below the median value Re::
EVS
LPS
PPQ: [a] & PPQ: [s~'tynsaga]& PPF: [st'tynsaga]&
O.22 ([I.1741.521
0.27 (0.24- 0.32) 2.78 (1.83- 3.95) 26.9 (15, -36.8)
0.40 (0.28-0.73)
PPQ: [a] c~ ppQ: Ist,tynsaga] ~ PPF: [s~tynsaga]~
I).51 ([).27-1.871
0.52 (0.35- 0gn~ 3.11 (1.85- 3.65) 8.8 (5.5 -13.11
0.50 (0.31--0.871
Table 1() Median period perturbation quotient (PPQ) and factor (PPF) values in % for [a] and [s~'tynsaga]for male and female dysphonic speakers. The speech signal is processed via an envelope variation criterion in each case. Values in parentheses show the respective ranges spanned by a of the speakers, i.e. ~ above and ~,below the median value Res
EVS
LPS
PPO: [a] CY PPQ: [s~tynsaga](Y PPF: [st-tynsaga](Y
1.59 ([I.36-.-3.811
0.65 ([).28- 2.22) 5.52 (3.91-15.3) 35.3 (19.7 -67.4)
0.53 (0.26--2.48)
PPQ: [a] ~ PPQ: [st'tynsaga]~ PPF: is~'tynsaga]~
1.96 (0.31-3.551
1.32 (0.4[)- 7.48) 4.64 (2.82- 7.33) 20.2 111.8 -41.9)
0.30 (0.21-0.37)
SpeechCommunication
J. Schoentgen / Jitter in su~taine¢!vowels and ixolated sentences
that the relationship between jitter and period is not exactly linear or that the noise floor of the cxperimenta! set-up keeps jitter values from reaching arbitrarily low values. The median PPQ (0.51%), realized bv the normal female subjects on the residue signals obscures the fact that, because for the female speakers the residue signal was often more chaotic in appearance than for the male ones, a high Droportion of the control speakers exhibited abnormally high perturbation values. The PPQ values extracted from the low-pass filtered signals produced by the female dysphonic speakers were especially low (median = 0.3%) compared to the "'control" values (median = 0.5%) -- a consequence of arithmetically dividing already low absolute perturbation ;:aiues (due to LP filtering) by significantly longer fundamental periods (median F0 = 215 Hz). 6.2. h~wr-processing differer :es
For the normal male speakers, the results confirmed those of experiment 1. When starting from the residue signal and envelope variation criterion, the "'normal" and "dysphonic" corpora were separated reasonably well (merit factors = 0.55 and 0.61). Low-pass filtering, on the other hand, forestalled all efficient separation. For the female speakers the picture was quite different. Only one pretreatment method (EVS) allowed for a good discriminatory performance (merit factor = 0.52). The LPC model appeared less adequate for female voices. As a consequence, the separation capability of the residue signal dropped considerably. The low-pass filtered signal gave rise to significant between-class differences, although in the wrong direction: the dysphonic signal perturbation values were lower than the '~normal'" average -- a consequence, as mentioned above, of a fundamental period which increased for part of the dysphonic speakers. 6.3. Running speech [set yn saga]
The envelope variation signal which had been shown to yield reliable results for male and female speakers was subsequently computed for the isolated sentences. A peak picking routine extracted
73
individual period lengths. A pitch perturbation quotient value was computed for each voiced stretch of speech. Each sentence production thus gave rise to several PPQ values whose median was considered to represent the jitter inside each production. Lieberman's pitch perturbation factor was also implemented. PPF is a feature which counts all occurrences of perturba :ions exceeding a threshold of 0.5 ms. Both Lieberman's perturbation factor (PPF) and the pitch perturbation quotient (PPO) demomtrated significant differences between normal and dysphonic subjects. Their behaviour on female corpora was in agreement with expectancies voiced elsewhere (Askenfelt and Hammarberg, 1981): the pitch perturbation factor, a typical "'running speech" feature, performed best. However, for the male speakers the discriminatory capability of the same feature appeared considerably diminished. One way to compare isolated sentences and sustained vowels is to order all the features and signal processing procedures according to their merit factor values. These are in a linear relationship to discriminatory performance. For the male voices discriminatory performance of [a] and [set yn saga] were strictly comparable. Although [a] performed somewhat less well for the females. this was not a standard occurrence, e.g. when an analysis interval of 15 instead of 45 periods was used, the PPQ merit factor changed from 0.52 to 0.66. For our corpora, it was generally true that discrimination performance increased somewhat when fewer periods were taken into account, This is in agreement with Davis (1976). but the great majority of more recent studies have made use of analysis intervals of around 50 periods. A possible explanation is that the stationarity requirement is better fulfilled on shorter analysis intervals.
7. Discussion
Table 9 reveals, that as far as [a] is concerned. the results agree with experiment 1. There are very highly significant differences between normal and pathological male speakers when they were analyzed by residue and envelope variation signals respectively (Table 11). The low-pass fib
74
J. Schoentgen / J#ter in sustained von't'L~" attd isolated sentences
Table I 1 Merit factor values for [a] and [st'tynsaga]for male an J female speakers. The merit factor values vary linearly with discrimination performance between control and dysphonic speakers. A value equal to 1 means perfect discrimination Res
EVS
LPS
PPQ: lal cf 0.55 PPQ: Ist'tynsaga]~+ PPF: [s~'tynsaga]O"
0.61 (I.61 (I.39
0.22
PPQ: M 9 (I.33 PPQ: [s~'t,,'nsaga](2 PPF: [srtynsaga] (2
0.52 0.57 0.67
-11.43
tered signal produces no discrimination whatever. Significant inter-signal differences are related to treatment idiosyncrasies as discussed above. As for the features computed from the sentence, both differ significantly between normal and pathological speakers. It sl'ould be noted that the best performance was furnished by a feature computed from a sustained vowel (i.e. [e]: the merit value of which is not displayed in Table 11) via the residue signal, and not from connected speech. The poor achievement of Lieberman's pitch perturbation factor needs further comment. The explanation is that the 0.5 ms threshold is not optimal for both male and female speakers. Since raw jitter values vary with fundamental frequency, the threshold should be higher for males. Accordingly, when we tentatively doubled the threshold, the merit factor jumped from 0.38 to 0.61. a 60% increase. For female speakers, the discrimination performance hierarchy is also listed in Table 11. This time the' pitch perturbation factor performs best, and the residue signal worst. The feeble performance of the latter is systematic and has been observed for other features and corpora. Analysis by linear prediction is a standard speech processing method and has frequently been found wanting with respect to female voices (Kahn and Garst, 1983: Wong, 1980). In basic terms, what distinguishes female from male voices with reference to LPC is that: (i) the differentiated glottal signal is less peaky; (ii) source-tract interactions ace stronger: (iii) the time-span to which the allSpeech(ommunication
pole model applies exactly is shorter: (iv) in gep,eral, the formant locations in the LPC spectrum are affected by the Fc~harmonics. It may be worthwhile mentioning that laryngeal pathologies have been predicted as having similar effects (Koike and Markel, 1975). Even if the excitation source were perfect, a qualitative difference in LPC behaviour on female and male speech would still be observed. The point is that correct LPC procedure requires that the impulse response of the vocal tract should have died away before the next pulse occurs in order to avoid putting several impulse responses one on top of the other. This requirement is better fulfilled by low-pitch male than by high-pitch female voices. While our results do not show a gradual deterioration from male to female voices, for a high proportion of female speakers jitter values are roughly ten times higher than expected. What happens is that for female speech the formant structure is less well removed by LPC inverse filtering. Spurious peaks therefore appear superimposed on the main residue signal spikes, and these peaks fool the peak picking routine which measures the exact duration between adjacent spikes. Our results show that connected speech did not provide a decisive advantage over sustained vowels in revealing anomalous jitter. In fact, in the case of the male speakers discriminatory performance was strictly comparable. What distinguishes both types of speech material are the higher relative and absolute perturbation of running speech. For sustained vowels, typical absolute perturbation values were equal to 30 and 20.us for the male and female subjects respectively. These should be compared to typical pitch perturbation factor values of 0.27 and 0.09 respectively in the case of connected speech. This means that 27% and 9% of the periods of a typical male and female speaker exhibited perturbations higher than 500/~s. Accordingly, we found that PPQ values measured in voiced portions of connected speech were roughly 5 to 10 times higher than when measured on sustained vowels produced by the same speaker. Lieberman chose a high threshold since he believed that smaller perturbations were the consequence of inter-segment trar,~ients induced by vocal tract dynamics. Lebrun (1970) confirmed
J. Schoentgen / Jitter in sustained vowels and isolated sentences
these findings by showing that in speech signals produced when exciting the vocal tract with the help of an artificial larynx, cycle to cycle perturbations higher than 0.1 ms persisted throughout the voiced portions. A small part of the pitch perturbation factor percentage is brought about by errors in the peak picking routine. Errors affecting PPF are estimated to be below 2% for male speakers. In fact, in the case of running speech several extrinsic and intrinsic factors may either individually or cooperatively increase jitter and shimmer measures. Among the extrinsic factors are more or less rapid changes in the F0 contour which are not completely compensated for. Another extrinsic factor is the segment-specific behaviour of preprocessing routines which do not cope equally as well with all categories of speech ~cgments (see discussion above). The same observation applies mutatis mutandis to human operators who pick periods by hand. Intrinsic factors are laryngeal and vocal tract dynamics respectively. Koike (1973) showed, for example, that period variability is much !ower during the stationary part of a sustained vowel than during onset and offset. Vocal tract dynamics, on the other hand, constantly changes the acoustic load experienced by the voice source. Since the internal impedance is not infinitely high, its output changes somewhat from one speech segment to the next. Finally, when inspecting sentence osciliograms it appeared that dysphonic speakers had not had any particular difficulties producing [g] in a voiced context, i.e. [aga], contrary to expectations expressed elsewhere. We think it more likely that they will meet increased difficulties when producing [g] syllable-initially (Serniclaes, 1984). To summarize, since jitter appears to be higher in running speech, it should be easier to measure in those circumstances. This may explain in part the claim made by several authors as to the superiority of running speech as far as laryngeal pathology detection is concerned. No definitive conclusion can be drawn as to whether connected speech is intrinsically more likely to reveal laryngeal anomalies.
8. Conclusion
Two comparative studies were carried out. and of these the first showed that idiosyncrasies in the preprocessing schemes gave rise to systematic differences in jitter values. In the case of the control speakers jitter values could, nonetheless, be made to agree to far better than one order of magnitude. As far as between-vowel quality differences were concerned, no significant differences were observed. In fact, in relative jitter such differences appear to be compensated for by normalization by the fundamental period. In a second experiment, jitter values extracted from connected speech did not discriminate between normal and dysphonic speakers any more efficiently than values calculated from sustained vowels. Absolute microperturbation values appeared to be higher in connected speech, however, and were consequently easier to measure. These findings substantiate the claims made by several authors. No intrinsic superiority in the discrimination performance of connected speech as opposed to sustained vowels could be found, though. Our results also draw attention to the necessity of analyzing jitter in sustained vowels with equipment allowing for sufficient bandwidth. When low-pass filtering becomes too extreme, too many fine signal details are lost, jitter included. As a final point, our results confirm what is known about jitter in female voices, i.e., the lower absolute perturbation in female voices. And again they highlight the increased difficulties encountered when acoustically analyzing female speech. As far as clir2cal applications are concerned, acoustic measurements are a non-invasive means to characterize a dy~phonic voice signal reproducibly, quantitatively and objectively. We think that the relevance of these measurements should be evaluated in the framework of a precisely defined task. Possible chnical tasks are (i) Follow-up, i.e. charting the evolution of a patient's voice in the course of treatment. (ii) Comparison. i.e. a voice signal is compared and classified with reference to a set of voices which are well known by the laryngologist and which therefor:, play the role of an ad hoc standard. (iii) Documea.~ation. ;.e. building a data-base
76
.I. Schoentgen / Jitter in sustained vowels and isolated sentences
for further reference. Accumulating data on an important number of voices would also permit in the long run to establi,,~h statistics on different speakers categories and different pathologies. (iv) Expertise in a legal framework involving accidental damage to the larynx of a person. (v) Screening for vocal pathologies, especially cancer of the vocal folds. It must be understood that when proceeding from task (i) to (v), demands for knowledge about voice quality in the population at large become more and more pressing. However, as yet, a model describing voice quality in the general population in quantitative terms is not available. We do not believe that task (v), for example, can be tackled successfully without completing task (iii) beforehand (e.g. Kasuya et al., 1986). Our experience suggests that applications (i) to (iii) can be attempted successfully at the present time. However, screening and expertise, based on acoustic measurements alone, must wait until these limitations are better understood with reference to the population at large.
Appendix In this appendix we give further explanations of the rationale underlying the extraction of the so-called enxelope variation criterion. Let us make a qualitative examination of the evolution in time for one glottal cycle of the inflow of energy into a stationary vocal tract. For normal phonation, the glottis is closed for about 40% of the duration of a glottal cycle (Monsen and Engcbretson. 1977). There is no influx of energy and losses are constant. If one, and only one, formant is excited the signal envelope takes on the shape of a decreasing exponential and the logarithmic derivative is a negative constant. When the glottis opens, ~he infraglottal cavities are connected to the vocal system and Ios~,es increase dramatically (Fant, 1981). The influx of energy also increases so as to attain a maximum near the instant of glottal closure: losses then prevail once more and the cycle starts all over again. Ideally, the logarithmic derivative would take on successiveb positive and negative values at the rhythm on the opening and closing of the glottis. This reasoning Speech Communicati(n!
is at the root of the "'envelope variation criterion" computation (Jospa, 1984). The sophisticated procedure which is explained below is not entirely necessary to obtain a signal which takes on positive and negative extrema once per glottal cycle: it would suffice to low-pass filter the signal or to record it with the help of a contact microphone. What is peculiar to the treatment proposed here is that the low-pass filtering stage is replaced by band-pass filtering. This allows for the analysis of arbitrary frequency regions, provided that there is sufficient energy inside the frequency interval considered. When analyzing vowels, for example, the centre frequency of the band-pass filter can be positioned near one of the first three formants. The envelope variation criterion (EVS) exhibits a local maximum in the vicinity of glottal closure, a characteristic it shares with the residue and raw speech signals. To summarize, a very rough picture of the waxing and waning of signal energy during vocal-fold operation can be obtained as follows: (i) Band-pass filtering isolates a formant: (i i) Hi lbert transformation and smoothing yield an approximation of the signal envelope; (iii) a finite difference calculation outputs a signal which varies rhythmically with the opening and closing of the glottis. EVS computational procedure is governed by a set of relationships which constrain the values of analysis parameters such as the centre frequency of the band-pass filter F,, the width of the smoothing window N, and the finite difference spacing L. These relationships are established ~:~low (Jospa, 1982). Let s (i) be a band-pass filtered and rectified speech sample, and let T be the sampling period. A smoothing window is defined as follows: h(i:N) = 0,
when i is greater than the window width equal to 2N + 1
h ( - i ; N ) = h (izN), Z h(i;N) = 1,
i.e., the window is symmetrical,
i.e., the window is normalised.
i
Let f(inf) be the lowest significant frequency component of the band-pass filtered signal. "ihe maximum period duration of the frequency component still smoothed out by the window is
J. Schoentgen / Jitter bs susmhwd vowels and isohm'd sentences
roughly equal to the window width. The width and f(inf) are therefore related: 2 N + 1 > 1/(Tf(inf)).
(1)
f(inf) = Fc - ½dF, Fc = center of the band-pass filter, and d F - the width of the filter, T = sampiing interval. The result of the smoothing operation is to replace each sample with a linear combination of itself and the neighbouring samples. The aim is to get rid of any irregularities persisting in the envelope after extracting it with the help of a Hilbert transform; i.e. it is necessary to level out undesired frequency components still present after imperfect band-pass filtering. Instead of the total window width, let us consider the effective width-duration for which its amplitude is very different from zero. For example, the effective length of a rectangular window equals its total length. For the optimal window (Geckinli and Yavuz, 1981), one assumes the effective x~idth to be equal to N. Condition (i) then becomes N > lIT(F,. - ½dr'-).
(1')
Note that (1') is a more severe constraint than (~). The finite difference operation which we want to carry out only makes sense when carried out between two successive samples which have never belonged to the same window (the sampling theorem would make it possible to discard the in-between samples). The difference interval L is therefore contained between two smoothing window half-widths at least, and four half-widths at most, By assuming an effective width equal to a half-width, we obtain another condition: if L is even, then N < L < 2N, ifLisodd, theN
(3)
77
Summary of the constraints: (1) N > l i T (F¢ - ½ dF), (2) N < L < 2 N + 1, (3) ( N + L ) T < Tt,.
It should be noted that these constraints derive from very general reasoning as to what goes on inside a glottal cycle. These constraints can easily be satisfied by voiced sounds produced by male speakers, but this is not necessarily so when the sounds are produced by female speakers. Computation procedure: (1) The first stage consists of the band-pass filtering of the speech signal in order to isolate a formant. The numerical filter employed is a finite impulse response filter which leaves phase relationships intact. Its band-width is equal to 400 Hz (McClellan, Parks, Rabiner, 1979). (2) As we are interested in events taking place in the envelope of the pass-band filtered signal, the envelope is estimated with tile help of a Hilbert transform. (3) The Hilbert transform applied to a signal with non-zero noise content and filtered by means of a non-ideal filter results in the estimation of the signal envelope in which irregularities persist. These need smoothing out; the c~aracteristics of the smoothing window are governed by (1). The implemented smoothing window is optimal in the sense of Geckinli and Yavuz (1981). (4) The difference operation, finally, is governed by constraints (2) and (3).
Acknowledgments I would like to thank Professor D. Hennebert, Service O.R.L., Clinique Universitaire St. Pierre and Professor Ph. Dejonckere, Centre d'Audiologie Paul Guns, Clinique Universitaire St. Luc, both in Brussels, for providing voice samples from patients with laryngeal disorders. I would also like to thank my colleague Dr. P. Jospa for his valuable advice during the implementation of the envelope variation criterion analysis method on computer. Part of the work reported here was carried out during my period as a "'charg6 de recherches'" with the Belgian "Fond National de la Recherche Scientifique".
78
J. &'hoentgen / Jitter in sustained roa'els" and &olate,:' senteaees
References A. Askenfelt and B. Hammarberg (1981)), "Speech waveform perturbation analysis". Speech transmission laboratory. Quarterly progress" attd status report, No. 4, pp. 40--49, A. ~skenfelt and B. l-~,mmarberg (~981), "'Speech waveform perturbation ;y., ,is revisited", Speech transmission laboratory, Quarterly progre.ss attd status report, No. 4, pp. 4(I-49. A. Asker.felt and A. Sjrlin (19811), "'Voice analysis in depressed patients: Rate of change of fundamental freqttency related :o mental state", Speech transmission laboratory, Quarterly progress and status report. Nos. 23, pp. 71-84. B. Blesser (1978). "'Digitization of ;radio: A comprehensive examination of theory', implementation and current practice", J. of the Audio Enghteering Soc., Vol. 26(I01, pp. 739-769. St.B. D~wis(1976), "'Computer evaluation of laryngeal pathology based on inverse filtering of speech", Monograph 13, Speech Communication Research Laboratories Inc.. Santa Barbara. CA. St.B. Davis (1978). "'Acoustic characteri,,;tics of normal and pathological voices", Haskins Lab. Stat. Rep. on Speech Res., SR 54, pp. 133-164. St.B. Davis (1979), "'~coustic characteristics of normal and pathological voices", in Speech and Language. Advances in Basic Research mid P~actice, Vol. 1, ed. by N.J. Lass (Academic Press, New York), pp. 273-335. St.B. Davis (1981). "'Acoustic characteristics of laryngeal pathology", in Speech Erahtation in Medichte, ed. by J.K. Darby (Grune and Stratton. New York), pp. 77104. G Fant (1981), "'The source filter concept in voice production", Speech transmission laboratory. Quarterly progress and status report. No. 1, pp. 21-37. N.C. Geckinli and D. Yavuz, (1981), "'A set of optimal discrete linear smoothers", Signal Processing, Vol. 3, pp. 49-62. B. Gold and L. Rabiner (1969), "'Parallel processing techniques for estimating pitch periods of speech in the time domain", J. Aeoust. Soc. Am., Vol. 46. pp. 442448. R. Gubrynowicz (1983). Personal communication. R. Gubrynowicz, W. Mikiel and P. Zarnecki (1977), "'Evaluation de 1"rtat pathologique des cordes vocales d'apr~s I'analyse des variations du fondamental", Proceedings 8th "'Journ~es d'(tudes sur la parole", Aix-en-Provence, pp. 21-27. R. Gubrynowicz, W. Mikiel and P. Zarnecki (1980), "An acoustic method for the evaluation of the state of the larynx source in cases involving pathological changes of the larynx", Archives of Acoustics, Vol. 5(1), pp. 3-30. R. Gubr),nowicz, B. Kacprowski, W. Mikiel and P. Zarnecki (1981), "'Detection and evaluation of laryngeal pathology based on pitch period measurements in continuous speech", Proceedbtgs 4th Symposium of the Federation of the Acoustical Societies of Europe, Venice, pp. 131134. Speech Communication
B. Gudrin and J.L. BoE {1980), "'Etude de l'influence du couplage acoustique somce-eonduit vocal sur F. des voyellcs orales", Phonetica. Vol. 37, pp. 169-192. M. Hecker and E.J. Kreul ( 1971 ). "Descriptions of the speech o1 patients with cancer of the vocal folds. Part l: Measures of fundamental frequency", J. Acoust. Soc. Am., Vol. 49(4), pp. 1275-1282. W. Hess (1982), Pitch Determination of Speech SignaLs'. Springer Series in lnlormation Sciences (Springer Verlag, Berlin). S. Hiller, J. Laver and J. Mackenzie (1983), "'Automatic analysis of waveform perturbations in connected speech". Work in Progress 16, Department of Linguistics. University of Edinburgh, pp. 40~9. S.M. Hiller, J. Laver and J. Mackenzie (1~,~84), "Durational aspects of long-term measurement,, of fundamental frequency perturbations in connected speech", Work in Progress 13, Department of Linguist:cs. University of Edinbureh, pp. 59-77. M. Hirano (1981), Clinical Examination of Voice (Springer Verlag, New York). H. Hollien, J. Michel and E. Doherty (1973), "A method for analyzing vocal jitter in sustained phonation", J. of Phonetics, Vol. 1(1), pp. 85-91. Y. Horii, (1975). "'Some statistical characteristics of voice fundamental frequency", J. Speech and Hearing Res., Vol. 18, pp. 192-201. Y. Horii, (1979), "Fundamental frequency perturbation observed in sustained phonation", J. Speech and Hearing Res., Vol. 22, pp. 5-19. Y. Horii (1982). "'Jitter and shimmer differences among sustained vowel phonations', J. Speech attd Hearing Res., Vol. 25, pp. 12-14. S. Imaizumi (1985), "'Acoustic measures of pathological voice qualities", Ann. Bull. RILP 19, pp. 179-190. N. Isshiki (1972), "'Imbalance of the vocal cords as a factor of dysphonia", Stadia Phonologiea, Vol. VI, pp. 39'~4. N. lsshiki and K. lshizaka (1976), "'Computer simulation of pathological vocal cord vibration", J. Acoust. Soc. Am., Vol. 60(5), pp. 1193-1198. P. Jospa (1982), "'Crit~res de variation d'amplitude". Rapport d'Activitrs de I'Institut de Phonrtique de I'Universit6 Libre de Bruxelles 17, pp. 89-108. P. Jospa (1984), "'Detection synchrone du pitch par le crit~re de variation d'amplitude", Proceedings 13th "Journdes d'Etudes sur la parole". Groupement des Acousticiens de Langue Franqaise, Bruxelles, pp. 161-162. P. Jospa and J. Schoentgen (1982), "Signal acoustique, signal residuel, crit~:re de variation d'aplitude: Trois supports au calcul d'indices de dysphonie" Proceedings 5th Symposium of the Federation of the Acoustical Societies of Europe, Grttingen, pp. 993-996. J. Kacprowski (1979), "'Objective acoustical methods in phoniatric diagnostics of speech organ disorders", Archives of Acoustics, Vol. 4(4), pp. 289-3(14. M. Kahn and P. Garst (1983), "'The effects of five voices characteri:;tics on LPC quality", Proc. IEEE Intern. Conf. Acoust., Speech. Signal Process'., Boston, pp. 531534.
J. Schoentgen / Jitter in sostabred vowels and isolated sentences K. Kasuya and T. Kobayashi (1983). "'Characteristics of pitch period and amplitude perturbations in pathologic voice". Proc. IEEE huern. Confi Acousr, Speech. Signal Process., Boston, pp. 341-353. K. Kasuya, S. Ogawa, Y. Kikuchi and S. Ebihara (1986), "An acoustic analysis of pathological voice and its application to the evaluation of laryngeal pathology", Speech Communication. Vol. 2(5), pp. 171-181. K. Kitajima, M. Tanabe and N. lsshiki (1975). "'Pitch perturbation in normal and pathologic voice". Stadia Phonologica, Vol. IX, pp. 25-32. F. Klingholz and F. Martin (1985), "'Quantitative spectral evaluation of shimmer and ji~ter", J. Speech and Hearing Res., Vol. 28, pp. 169-174. Y. Koike (1973). "'Application of s<,me acoustic measures for the evaluation of laryngeat dysfunction", Stadia Phonologica, Vol. VII, pp. 17-23. Y. Koike and J. Markel (1975), "'Apphcation of inverse filtering for detecting laryngeal p:r.~thoIogy", Amlals of Otolaryngology, Vol. 84, pp. 117-123. Y. Koike, A. Takahashi and T.C. Calcaterr,~ (1977), "'Acoustic measures for detecting laryngeal pa:'.qology". Acta Otolaryngologica, Vol. 84, pp. 105-117. J. Laver, S. Hiller and R. Hanson (1982), "Comparative performance of pitch detection algorithms on dysphonic voices", Proc. IEEE buern. Conf. Acoust., Speech, Signal Process., Paris, pp. 192-195. J. Laver, S. Hiller, J. Mackenzie and E. Rooney (1985), "'An acoustic screening system for the detection of laryngeal pathology", Work in Progress 18, Department of Linguistics, University of Edinburgh, pp. 1-11. Y. Lebrun and J. Hasquin (1971), "'Variations in vocal wave duration", J. Lar.vngology and Otology, Vo!. 1, LXXXV, pp. 43-56. Ph. Lieberman (1963), "'Some acoustic measures of the fundamental periodicity of normal and pathological larynges", J. Acoust. Sot.. Am., Vol. 35(3), pp. 344-352. C. Ludlow, C. Bassich, N. Connor, D. Coulter and Y. Lee (1985), "'The validity of using phonatory jitter and shimmer to detect laryngeal pathology", Foltrth International Vocal FoM Physi, dogy Conference, New Haven, CN. J.H. McClellan, T.W. Parks and L.R. Rabiner (1979), FIR linear phase filter design program, ed. by Digital Signal Processing Committee, Programs for Digital Signal Fk~cessing (IEEE Press, New York)
79
J. Mackenzie. J. Laver and S. Hiller (1984), "'Acoustic screening for vocal pathology: Preliminary results". Work in Progress 17, Department of Linguistics. University of Edinburgh, pp. 98-111. J. Makhoul (1975). "Linear prediction: A tutorial review". Proc. IEEE hltern. Confi Acoust.. Speech. Signal Process.. Vol. 63. pp. 561-580. R.B. Monsen and A.M. Engebretson (1977). "'Study of variations in the male and female glottal ,,,,'ave". J. Aeoust. Soc. Am.. Vol. 62, pp. 981-993. R.J. Niederjohn and M. Lahat (1985). "A zero-crossing consistency method for formant tracking of voiced speech in high noise levels". IEEE Trans. Acoust.. Speech. Signal Process.. Vol. 33(2), pp. -49-355. L.R. Rabiner and R.W. Schafer (1978), Digital Processing of Speech Signais (Prentice-Hall. Englewood Cliffs). L.A. Ramig and R.L. Ringel (19831. "Effects of physiological aging on selected acoustic characteristics of voice". J. Speech and Hearing Res., Vol. 26, pp. 22-3{). R.A. Rasch (1983), "Jitter in the singing voice". Proceeding~ of tire Tenth hrtern. Congress of Phonetic Sciences. Utrecht, pp. 288-292. M. Rossi and D. Autesserre (1981), "Movements of the thyroid and the larynx and the intrinsic frequency of vowels". J. of Phonetics. Vol. 9(2). pp. 233-249. J. Schoentgen (1982), "'Quantitative evaluation of the discrimination performance of acoustic features in detecting laryngeal pathology". Speech Communication, Vol. 1. pp. 269-282. J. Schoentgen (1985]. "'L'incidenee des pathologies laD'ng6es sur le signal de parole: Le pouvoir discriminatif des indices acoustiques'" (Unpublished thesis. Free University of Brussels). J. Scboentgen ( 1988], "'Performance of jitter in discriminating between normal and dysphonic speakers", Applied Stochastic Models and Data Anal.~is, Vol. 4. pp. 127135. W. Serniclaes (1984), Personal communication. S. Siegel (1956), Nonparametric Statistics for tire Behavioura t Sciences (Me Graw-Hill, New York). D.Y. Wong (1980), "'On understanding the quality problems of LPC speech", httern. IEEE Conf. Acoust.. Speech. Signal Process.. pp. 725-728.