Perception of jitter and shimmer in synthetic vowels

Journal of Phonetics (1979) 7, 343-355 Perception of jitter and shimmer in synthetic vowels Anton J. Rozsypal and Bruce F. Millar Department of Lingu...

Download PDF

6MB Sizes 0 Downloads 55 Views

Report

Full Text

Journal of Phonetics (1979) 7, 343-355

Perception of jitter and shimmer in synthetic vowels Anton J. Rozsypal and Bruce F. Millar Department of Linguistics, The University of Alberta, Edmonton, Alberta, Canada T6G 2H 1 Received 15th May 1978

Abstract:

The naturalness of synthetic vowels /a/, /i/, and /u/, with introduced jitter (perturbations of the period durations of the glottal wave) and shimmer (perturbations of the amplitude of the glottal pulses) was investigated. In a paired-comparisons experiment the subjects were instructed to select the more human-like speech sound in each pair of synthetic vowel stimuli with different amounts of jitter and shimmer. The preference data were analysed using a multidimensional scaling technique. The obtained solution was two-dimensional, not readily interpretable in terms of signal parameters. In a rating experiment the subjects estimated the naturalness of such stimuli. The rating data were subjected to analysis of variance. The results support the observation that some degree of roughness is necessary for synthetic vowels to be perceived as natural. No trade-off relationship between jitter and shimmer was obtained.

Introduction Upon close analysis, speech signals which are in general regarded as periodic show imperfect periodicity. This cycle-to-cycle variation is caused by perturbations of the glottal wave period, called jitter, and by perturbations of the amplitude of glottal pulses, referred to as shimmer. Both are the result of involuntary laryngeal variations during phonation, with extreme deviations from periodicity taking place at the onset and decay of voicing and at abrupt spectral shifts. They do not seem to be linked to any particular language (Lieberman, 1963). Jitter appears to be affected by the degree of subjective vocal constriction (Beckett, 1969) and by the emotional state of the speaker (Lieberman, 1961 ; Lieberman & Michaels, 1962), since various emotional speech modes are apparently produced with different degrees of conscious vocal control. With pathological changes in the laryngeal structure these perturbations become more pronounced (Lieberman, 1963; Koike, 1969 ; Hecker & Kreul, 1971; Kreul & Hecker, 1971; Iwata & von Leden, 1970). Data on the perceptual role of these perturbations are applicable in speech synthesis, since the presence of jitter and shimmer contributes to the natural quality of synthetic speech (Rozsypal & Millar, 1975). Schroeder & David (1960) report that preservation of individuality and naturalness in synthetic speech necessitates faithful reproduction of irregular fluctuations of consecutive periods of voiced sounds, in particular at the onset and offset of voicing. On the other hand, Rosenberg (1968) claims that listeners are unable 0095-4470/040343+ 13 $ 02·00/0

© 1979 Academic Press Inc. (London) Ltd.

344

A. J. Rozsypal and B. F. Millar

to detect elimination of period duration fluctuations from natural speech as long as the natural pitch contour of a sentence remains preserved. Research in this field can be of considerable clinical interest. lsshiki, Yanagihara & Morimoto (1966) and Yanagihara (1967) recognized two basic types of hoarseness: one due to jitter, the other due to constant air leakage through the glottis without noticeable perturbations of the glottal waveform . They suggested spectrographic measurements of relative levels of noise and harmonic components in vowel sounds as a method for objective diagnosis and classification of hoarseness. New clinical methods for early detection and diagnosis of laryngeal pathologies can be based on acoustic parameters derived from patients' utterances (Crystal, 1972; Hiki, I maizumi, Hirano, Matsushita & Kakita, 1975; Davis, 1975; Murry & Doherty, 1977; Hiki, Imaizumi, Hirano & Matsushita, 1978). Several perceptual aspects of perturbed trains of narrow pulses have been investigated. Rosenberg (1966) tested discrimination between pulse trains of equal pulse rates and different polarity patterns as a function of jitter and found that most relevant for discrimination of such pulses are spectral features located in the frequency range from approximately 300 to 1000Hz. In a pitch matching experiment Cardozo & Ritsma (1968) showed that with increasing jitter, pitch definition decreases and eventually, for jitter greater than about 10%, pitch becomes imperceptible. Threshold of jitter detectability, normally of the order of several tenths of a percent, rises sharply for stimuli shorter than about 100 ms. This threshold also rises for stimuli with non-stationary pitch contours. In numerous experiments, Pollack investigated the influence of pulse rate, stimulus duration, and amount of waveform perturbation on perception of imperfect periodicity over a wide range of experimental conditions, such as detection of single period disturbances (gaps) in either periodic (Pollack, 1967) or jittered (Pollack, 1968a) pulse trains, thresholds of detectability and discriminability of random and patterned jitter (Pollack, 1968b, d, e, 1969b), pitch discrimination of periodic and jittered pulse trains (Pollack, 1968c), and detection and discrimination of jitter and shimmer in rectangular waveforms (Pollack, 197lb). Pollack also tested the contribution of various frequency regions to jitter detection by measuring the effect of noise masking (Pollack, 1969a) and frequency filtering (Pollack, 197la) on jitter detection. These experiments indicate that at higher pulse rates jitter detection and discrimination sharply improves with stimulus duration, while at low pulse rates jitter thresholds are dependent on stimulus duration to a lesser degree. For pulse rates below about 100 Hz, jitter appears to be detected on the basis of temporal analysis, while above that frequency the detection relies on spectral cues, the most relevant of which are centered around 1000 Hz. Perceptual roughness of a simulated glottal waveform with no formant structure imposed on its spectrum was studied by several investigators. In a series of experiments Wendahl (1963, 1966a, b) generated his stimuli in an electrical laryngeal analog LADIC. Jitter and shimmer were introduced by controlling the period duration and amplitude of individual glottal sawtooth pulses, respectively. The cycle-to-cycle variations were introduced in a deterministic fashion, the perturbation pattern being repeated over eight, four, or two cycles. Wendahl's results indicate that both jitter and shimmer can be judged on a perceptual roughness scale in the same general manner. In both cases the roughness judgements are directly related to the magnitude of the period and amplitude variations. A systematic relationship exists between jitter stimuli and shimmer stimuli which produce equal auditory roughness, but only highly trained subjects are able to discriminate between these two stimulus types. For a jitter waveform with given period variations, an increase

Jitter and Shimmer

345

of its median frequency results in decrease of perceptual roughness. Coleman (1969, 1971), also using the stimulus generator LADIC, examined the influence of two signal characteristics on the estimation of perceptual roughness. One was jitter, the other was the range of quasi-random pulse-to-pulse changes in the glottal pulse duration, simulating the variance in the open-to-close ratio of the glottis. Coleman found that stimuli with the same proportional j itter, related to median period duration, elicit higher roughness judgements as their pitch is decreased. Increased changes in the pulse durations, whether in combination with j itter or not, give rise to increased perceptual roughness. LaBelle (I 973) studied the relationship between vibrato rate and its frequency extent on auditory roughness j udgements. The test signal was a periodic sawtooth waveform, which was frequency modulated by a sinusoidal signal whose frequency and amplitude determined the vibrato rate and its extent, respectively. Roughness of such sinusoidally jittered simulated glottal waveforms significantly increased both with increased vibrato frequency and with increased vibrato extent. Also, the interaction between these two stimulus variables was found to be significant. Our aim at the onset of this project was to determine perceptually the optimal amount of jitter and shimmer in sustained synthetic vowels. We expected to find some kind of trade-off relationship between j itter and shimmer in perception of synthetic vowels. Several reasons led us to this expectation. One argument for the jitter-shimmer trade-off may be based on the findings of Wendahl (1966b), that jitter and shimmer result in similar auditory experiences, that both can be scaled on an auditory continuum of roughness, and that the roughness of both types of signals is directly comparable. O nly some highly trained subjects in his experiment were able to distinguish between jitter and shimmer stimuli, characterizing the latter as sounding more "bassy" . Another argument in favor of the trading relationship might be the spectral similarities between jitter and shimmer in combination with the limited sensitivity of hearing to phase differences of components of complex signals. As a crude simplification, the jittered glottal waveform can be compared to a periodic sequence of glottal pulses frequency modulated by a random signal. Similarly, shimmer type of perturbation can be conceived of as a periodic glottal wave modulated in amplitude. F or low modulation index certain approximations are satisfied and the frequency modulation can be treated as the so-called narrowband frequency modulation. 1n such case the spectra of frequency modulated signals consist of the same components as spectra of amplitude modulated signals, provided the carrier and modulating signals are identical (Lathi, 1968). The only spectral distinction between these two types of modulation can be found in the different phase spectra of the components resulting from the modulation process. This has been illustrated by Zwicker & Feldtkeller (1967) for the case of modulating a sinusoidal carrier signal by another sinusoid. In spite of these spectral similarities, the amplitude and frequency modulated signals have distinctly different waveforms. More accurately, jitter and shimmer can be considered as pulse-position modulation and pulse-amplitude modulation, respectively. In case of a random modulating signal such pulse trains are necessarily random and thus non-deterministic. Consequently, their spectra cannot be calculated using the conventional method of Fourier transform. Power density spectra of such signals can be obtained in two steps. First, the autocorrelation function of the random waveform is calculated. At this stage the phase information about the spectral components is lost since the autocorrelation function is independent of the phase relationship of the spectral components of the correlated function. As a second step, the power density spectrum is

346

A. J. Rozsypal and B. F. Millar

obtained as the Fourier transform of the autocorrelation function. Using this two-step procedure Nelsen ( 1964) calculated power spectra for several cases of randomly perturbed waveforms. Examination of his formula (6) for the power spectrum of a jittered and shimmered train of narrow pulses reveals that both types of perturbations generate components of the same frequencies. In special cases the amplitudes of these components can also be similar. But, as noted above, no phase information about the components is available. To help us better understand the mechanism of jitter and shimmer perception, we decided also to investigate the dimensionality, i.e. the number and possibly also the perceptual nature of the major dimensions of the perceptual space into which vowel signals with different degrees of jitter and shimmer are mapped.

Experimental method Two experiments were carried out. The first was a paired-comparisons experiment in which pairs of vowel stimuli with varying degrees of roughness were presented. Roughness of the stimuli was caused by jitter and shimmer introduced into the glottal waveform. By having the subjects indicate the more natural of the two stimuli in each pair, it was possible to draw inferences about their underlying perceptual space and hence relate jitter and shimmer to vowel naturalness. In the second experiment stimuli of the same type were presented individually. The subjects were requested to rate their naturalness directly and thus determine the optimal amounts of jitter and shimmer in the synthetic vowels tested. Stimuli Three stationary vowel sounds fa /, /i/ , and j uf were tested. These were synthesized on a cascade analog speech synthesizer PAT with three variable formants (Anthony & Lawrence, 1962), in which the normal glottal waveform was replaced by a PDP-12 computer (Digital Equipment Corporation) generated triangular glottal wave with controlled amounts of jitter and shimmer. Two typical waveforms are shown in Fig. I. Formant frequencies for the three vowels tested are presented in Table I. The duration of the individual glottal pulses was constant. To simulate jitter, the mean glottal period was adjusted for each period by a random period deviation, which had the discrete probability distribution given in Table II. This discrete distribution approximated a gaussian normal distribution. Its mean was zero and the standard deviation depended on the jitter level. The extreme deviations were equal to three standard deviations. Shimmer was simulated in a manner parallel to that

Figure 1

Examples of glottal waveforms used in simulating jitter and shimmer in synthetic vowels. Upper trace : periodic waveform with no jitter and shimmer. Lower trace: waveform with intermediate levels of jitter (2 %) and shimmer (9·5 %).

Jitter and shimmer

347

Table I Formant frequencies of the vowel stimuli synthesized for the present experiments

Formant frequency (Hz) Vowel

F1

F2

F3

/a/ /i/

730 320 320

1500 2300 1020

2200 3100 1100

/u/

for jitter. The same probability distribution was used to compute the deviations from the mean amplitude of the individual glottal pulses. In no case did the extreme deviations result in glottal pulse periods or amplitudes of negative value. All stimuli were 1200 ms long at a constant mean fundamental frequency of 1 10 Hz. Four glottal pulses at the beginning and termination of the stimulus were scaled in amplitude to obtain a smooth onset and decay. Jitter was tested at five levels. The first level denotes no jitter. At the highest jitter level, the standard deviation of the period perturbations amounted to 4% of the mean period. At the intervening three jitter levels the respective standard deviations amounted to I %, 2 %, and 3% of the mean period. Table II Discrete gaussian probability distribution used in synthesizing the perturbed glottal waveform. Period and amplitude deviations are expressed in terms of the standard deviation s

Probability

Deviation

10/64 9/64 7/64 5/64 3/64 2/64 1/64

0·00 ±0·38 s ±0·76 s ±1·12 s ±1·48 s ±1·90 s ±3·00 s

Shimmer was also tested at five levels, the first of which had no shimmer. At the higher shimmer levels the standard deviations of the amplitude perturbations were equal to 4·75%, 9·5 %, 14·25 %, and 19% of the mean amplitude of the glottal pulses. The random processes governing jitter and shimmer were independent. The tested ranges of perturbations were selected in a pilot study so that two stimuli, one with maximum level of jitter and the other with maximum level of shimmer, were perceptually about equally rough, but still acceptable as speech-like. The chosen ratio of the maximum shimmer and jitter values in our experiment, 19% to 4%, is in very good agreement with the ratio of threshold values for shimmer and jitter, about 10% to 2%, obtained by Pollack (197lb) for a square wave of comparable fundamental frequency.

Subjects Twenty-seven subjects participated in the experiments. They were native CanadianEnglish speakers, university students and high school teachers, ranging in age from 18 to

348

A. J. Rozsypal and B. F. Millar

42, with no history of hearing defects. Although they were informed about the purpose of the test, namely to find the optimal amount of harshness in synthetic speech, they had no knowledge about the nature of the study.

Testing procedure The testing session lasted about one hour and consisted of two experiments. Instructions for both were issued at the beginning of the session in order not to disrupt continuity of the testing. Small groups of subjects were tested in a sound-treated classroom. The stimuli were reproduced from an audio tape by a loudspeaker system at a comfortable listening level, equalized for all vowel types. The testing tape was recorded under PDP-12 computer control. The session started with a paired-comparisons experiment in which pairs of stimuli, namely the same vowel with different amounts of jitter and shimmer, were presented. Only three levels, that is the first, third, and fifth levels of jitter and shimmer were tested in this experiment. Duration of both stimuli was 1200 msec, with an intervening pause of one second. A six-second pause separated pairs. Order of presentations was randomized. Each stimulus was compared with all other stimuli of the same vowel. Furthermore, each pair was presented once in both possible orders. This resulted in a total of216 testing pairs being presented, preceded by 17 practice pairs. Subjects indicated on scoring sheets which member of the pair they preferred as more human-like and natural. In the second experiment of the same testing session, a naturalness rating experiment, single stimuli in a random order were presented using the same jitter and shimmer range as in the paired-comparisons experiment, but at all five levels of jitter and shimmer. This experiment consisted of 75 presentations, as each stimulus was presented once. Stimulus duration was again 1200 ms and the interstimulu s interval was six seconds. Subjects rated the naturalness of these stimuli on a seven-point scale. Since it was stressed in the instructions that the same stimuli would be used in both experiments and since the second experiment immediately followed the first one, the subjects started the rating experiment with the set of anticipated stimuli well anchored in their minds . Experimental results

Preferred naturalness data The preference data from the paired-comparisons experiment were analysed using the Torgerson proximity sca ling algorithm (Torgerson , I 958) in order to obtain estimates of interstimulus di stances in the perceptual space. These estimates were input into a KruskalSheppard multidimensional scaling program (Kruska l, 1964a, b; MacGuire, 1968) assuming a Euclidean metric. The dimensionality of the perceptual space for each vowel was chosen on the basis of the Kru skal stress val ues. The computed solutions are presented in Fig. 2. In these plots each stimulus point is labelled by two digits, the first denoting the jitter level and the second the level of shimmer. For the vowel /a/ the obtained dimensionality is one, for / i/ and fu/ the solution is two-dimensional. For the sake of symmetry the two-dimensional solution for fa/ is also presented . Comparison of the three two-dimensional so lution s indica tes that the perceptual configuration is quite stable between the vowel s investigated. However, the obtained dimensions I and I r are not readily interpretable, nor can a ny rotated coordinate system be expressed in term s of stimulus variables. The closest to a ny plausible interpretation is a "naturalness" coordinate running in the direction of the dashed lines, along which the

349

Jitter and shimmer I

2 / /

31

II

/

13

-oo""33:------rs-o3_-=s___53----sos--sl-

/ --,

: I I I I

\ \

0

53

I

I

s1),' I

I

/

I /O

I I

/\_s~)

Ia !

Iii

2

I

I

I

---- /-

I

I 15/

,.--o~\t

1 35 /1 - -

Ia!

/

\

1

\

\

o

31

,

/

/u/

I

\

',, 'I ', /c 33 '

Figure2

/

\

'v \

/

_

---

'

--0

I '

:I

/

Multi-dimensional scaling solutions of the paired-comparisons data for the synthetic perturbed vowels. One-dimensional solution for the vowel /a/ and two-dimensional solutions for the vowels /a/, /i/, and /u/. Each stimulus point in the perceptual space is labelled by two digits, the first corresponding to the level of jitter and the second to the level of shimmer.

clusters defined by jitter are ordered. Some other, apparently dichotomous perceptual variable, is operating orthogonally to this "naturalness" axis. Naturalness rating data In processing the rating experiment data the between-subjects differences were reduced by standardizing each individual's response profiles. After subsequent averaging across subjects, smoothing, and interpolation, the data were plotted in contour maps of equal natural quality ratings. These maps are presented in Fig. 3. In these plots the 25 grid points represent the rated stimuli, with jitter increasing from top to bottom and shimmer from left to right. The contours are marked in terms of standardized naturalness rating scale, in which the higher numbers correspond to higher naturalness. In the case of the vowel /i/, stimuli with jitter level two (1 %) and no shimmer were rated as best. While shimmer for the vowel /i/ appears to be critical at low jitter levels, it does not seem to play any important role at high jitter levels. The contours do not suggest the existence of any trade-off between jitter and shimni.er, at least not in the area of the higher naturalness ratings. Such a trade-off relationship would be indicated in the above graphs by the contours running diagonally, in the left-bottom to right-top direction. With the possible exception of the central area

I

350

A . J. Rozsypal and B. F. Millar

for the vowel j aj , the other two vowels do not show the expected trade-off either. The vowel jaj has its optimum at jitter level three (2 %) and no shimmer, which is higher than for the vowel /i/. Here, in contrast to the case of /i/, shimmer at low jitter levels has only a small effect on the rating. For the vowel j uj the region of maximum naturalness rating was at jitter level three (2 %) and shimmer level two (4·75 %), the highest amount of glottal wave perturbations of all three vowels tested. The true periodic stimulus juj sounded more like a buzzer than a speech sound. Except for the region around the optimum, shimmer appears to have very small influence on the rating. The same naturalness rating data were also analysed using an analysis of variance technique. Results of this analysis are presented in Table III. In a mixed model the vowel

Shimmer

Shimmer

l ui

Shimmer

Figure 3

Contours of equal naturalness rating for synthetic perturbed vowels /a/, /i/, and /u/.

Jitter and shimmer

351

Table III Results of the analysis of variance of data from the naturalness rating experiment

df

ss

F

2 4 4 8 8 16 32 1950

105·18 64·70 18·81 25 ·97 7· 15 10·75 31 ·56 1733·36

59·16* 18·20* 5·29* 3·65* 1-01 0·76 1·II

Source Vowel (V) Jitter (J) Shimmer (S) VX J VxS J X S V X] X S Subjects *p < 0·01 .

type, jitter, and shimmer were treated as fixed factors and subjects as a random factor. All main effects and the vowel by jitter interaction were found to be significant at the 0·01 level. The order of vowel rating for naturalness was fil the best (30·5), then fu/ (-6·2), and finally /a/ (-24· 3). The vowel by shimmer interaction was found to be non-significant. The significant main effect of shimmer is illustrated in the top graph of Fig. 4, showing a decreasing trend in naturalness rating with increased shimmer. This indicates that the presence of shimmer in synthetic vowels is undesirable since, irrespective of vowel quality, shimmer reduces their naturalness. In the bottom graph of Fig. 4 the vowel by jitter interaction is illustrated , with the vowel main effect removed . Here it can be seen that jitter in vowels fa/ and /u/ is perceived alike as to naturalness and that the vowel /i/ differs from these two vowels only at the two extreme jitter levels. Optimum jitter for fa/ is 2% and for fu/ is I %. The optimum jitter range for /i/ appears to be wider, between no jitter and 2%.

10

-

~

0

2

Shimmer

4

3 ~

-10

~

s""

Ei "'"' "'c:

e "

z0

20

--------.c-...::-~'\',

/1/

10 0 -10 -20

/1 I I I

,I II I

Jitter

'\',

' I

''-4

\

2

3

'

\ Ia I

--------

Iii

----

lui

-

-30 -40 -50

Figure 4

.".:-, '

\:.'

- ----

\

5

'-\

\

\

\

\

\

\

Results of the analysis of variance of the naturalness rating data: shimmer main factor (upper plot) and vowel by jitter interaction with the vowel main effect removed (bottom plot).

352

A. J. Rozsypal and B. F. Millar

Non-significance of the jitter by shimmer interaction further supports the finding that the trade-off between jitter and shimmer does not exist.

Discussion Difficulty in interpreting the data of this experiment is posed by the lack of other experimental data with which to compare them. Other studies have investigated jitter in natural speech, but no data have been collected investigating both jitter and shimmer. It has been suggested by several authors that the degree of perceived harshness in vowels is related to different vowel categories. To investigate this question Sherman & Linke (1952) let their subjects, diagnosed as exhibiting harsh voice quality, read passages of connected speech which were constructed to contain six dichotomized vowel categories: front-back, high-low, and tense-lax. Tape recordings of these speech samples were presented to qualified observers for rating of harshness severity. High vowels were rated as less harsh than low vowels, similarly lax vowels as less harsh than tense vowels. No significant difference in harshness rating between front and back vowels was found . Rees (1958) studied perceived harsh voice quality of individual vowels in different contexts : in isolation, initiated with ; h/ , or in interconsonantal position of the eve type, with the same initial and final consonant. T hese speech samples were recorded by adult male speakers with harsh voice quality and presented for rating of harshness severity to a group of trained judges. Analysis of variance of these ratings yielded as significant both the vowel and the consonantal context main factors and the vowel by context interaction. Vowels in isolation were rated as more harsh than vowels initiated with / h/ and both these as more harsh than most of the vowels in the eve context. Of many other interesting findings we will mention here only those related to our set of vowels. For all contexts, harshness increased from high to mid to low vowels. High vowels ji; and j uj in all contexts were rated as the two least harsh of the nine vowels tested. The low vowel j aj, depending on the context, ranked either as sixth or eighth most harsh of the set. Harshness rating did not seem to depend on the front-back or tense-lax vowel categorization. Higher roughness ratings for low vowels than for high vowels were also confirmed by the series of experiments of Emanuel & Sansone ( 1969), Sansone and Emanuel (1970), Lively & Emanuel (1970), and Emanuel, Lively & McCoy (1973), who measured the ratio of harmonic to inharmonic energy in different frequency bands of vowels. They found that the spectra of rough vowels were distinguished by increased spectral noise levels and by decreased harmonic components with respect to normal vowels. The perceived roughness was proportional to the spectral noise levels. They report that in sustained vowels produced by normal speakers of both sexes, for both normal and simulated rough vowels, the low vowels in general contained higher spectral noise and were rated higher in roughness severity than the high vowels . For vowels sustained by normal speakers, Emanuel & Smith (1974) observed that by raising vocal pitch above the subject's habitual pitch the roughness ratings and level of the noise components are consistently reduced. Our results can be compared with available data on perturbations in natural vowels. To our knowledge, no work has yet been published on analysis of shimmer in natural speech. Hollien, Michel & Doherty (1973) provided data on the amount of jitter in the sustained vowel fa/ produced by normal adult male speakers at four fundamental frequencies within their normal phonatory range. Their results indicate that laryngeal jitter increases with fundamental frequency of the vowel. To quantitatively evaluate jitter, Hollien eta/. introduced a jitter factor which is comparable to the way jitter is expressed as percentage in our paper. Their mean jitter factor of between about 0· 5 at low and 1·0 at

Jitter and shimmer

353

high fundamental frequencies (corresponding to jitter 0·43% and 0·87 %, respectively) is in fairly good agreement with the value for optimal jitter around 2% for the same vowel fa/ obtained for synthetic speech in our experiments. Kasprzyk & Gilbert (I 975) measured jitter as a function of tongue height in five sustained vowels produced by normal adult male speakers. Their findings indicate that jitter is not dependent on tongue height, since their mean perturbation factor for different vowels did not show any significant differences. For the vowels tested in the present study, Kasprzyk and Gilbert obtained these mean perturbation factors : 4·20 for /u/ , 6·02 for /i/, and 6·36 for fa/. Although the differences are not significant, jitter for the low vowel fa/ is greater than for the high vowels /i/ and /u/. These values are not directly comparable to the jitter values obtained in our experiments since fundamentally different methods were used to evaluate jitter. Results of our naturalness rating experiment agree with the above experiments in confirming that optimal jitter for the high vowels /i/ and fu/ is lower than for the low vowel/a/. The authors are aware of many limitations of the present experiments. For instance, the stimuli were relatively long, with a flat pitch contour. Such vowel sounds are a rare phenomenon in natural speech. Only one fundamental frequency , 110Hz, was tested. In order to generalize about the roughness of natural speech the data should cover the complete range of fundamental frequencies, both male and female. Both jitter and shimmer modulating functions in our stimuli were random and mutually independent. This is not the case in natural speech. Lieberman (1961) observed that jitter is not entirely random since a high degree of correlation exists between the durations of each period and its next but one successor. For certain types of laryngeal pathologies Koike (1969) observed shimmer to display periodicity ranging from two to twelve glottal periods. Further, we did not test as an experimental variable the relationship between the fundamental and formant frequencies of our stimuli, although their ratio critically influences the threshold of jitter detection (Cardozo & Neelen, 1968; Neelen, 1969) and consequently may represent a significant factor in naturalness rating. Conclusions

From the present experiments it can be concluded that some degree of jitter is necessary for sustained vowels to be perceived as natural. The optimal amount of jitter depends on the vowel sound. The effect of shimmer is equal for all vowels, and Jess pronounced than that of jitter. The presence of shimmer in synthetic vowels lowers their naturalness rating. The expected trade-off between jitter and shimmer in the perception of naturalness of sustained vowels is called into question. The perceptual space for perturbed vowel sounds was found to be two-dimensional. These two dimensions are not readily interpretable either in terms of jitter and shimmer, or in any other physical parameters of the stimuli. This research was supported in part by the Summer Temporary Employment Program (STEP) of the Department of Advanced Education of the Alberta Provincial Government. References Anthony, J. & Lawrence, W. (1962). A resonance analogue speech synthesizer. The Fourth International Congress on Acoustics, Copenhagen. Beckett, R. L. (1969). Pitch perturbation as a function of subjective vocal constriction. Folia Phoniatrica 21, 416-25.

354

A. J. Rozsypal and B. F. Millar

Cardozo, B. L. & Neelen, J. J. M. (1968). Audibility of jitter in pulse trains as affected by filtering. Institute for Perception, Annual Progress Report3, 13-15. Cardozo, B. L. & Ritsma, R . J. (1968). On the perception of imperfect periodicity. IEEE Transactions on Audio and Electroacoustics AU-16, 159-64. · Coleman, R. F. (1969). Effect of median frequency levels upon the roughness of jittered stimuli. Journal of Speech and Hearing Research 12, 330-6. Coleman, R. F. (1971). Effect of waveform changes upon roughness perception. Folia Phoniatrica 23, 314-22. Crystal, T. H. (1972). Laryngeal-disorder detection through speech analysis (Abstract). Journal of the Acoustical Society of America 52, 158. Davis, S. B. (1975). Preliminary results using inverse filtering of speech for automatic evaluation of laryngeal pathology (Abstract). Journal of the Acoustical Society of America 58, SIll . Emanuel, F. W. & Sansone, F. E., Jr. (1969). Some spectral features of "normal" and simulated "rough" vowels. Folia Phoniatrica 21, 401-15. Emanuel, F. W., Lively, M.A., & McCoy, J . F. (1973). Spectral noise levels and roughness ratings for vowels produced by males and females. Folia Phoniatrica 25, 110-20. Emanuel, F. W. & Smith, W. F. (1974). Pitch effects on vowel roughness and spectral noise. Journal of Phonetics 2, 247-53 . Hecker, M. H. L. & Kreul, E. J. (1971). Descriptions of the speech of patients with cancer of the vocal folds. Part I : Measures of fundamental frequency. Journal of the Acoustical Society of America 49, 1275-82. Hiki, S., Imaizumi, S., Hirano, M., Matsushita, H. & Kakita, Y. (1975). Acoustical analysis for voice disorders: An attempt towards clinical applications (Abstract). Journal of the Acoustical Society of America 58, Sill. Hiki, S., Imaizumi, S., Hirano, M. & Matsushita, H. (1978). Analysis of pathological voices with a sound spectrograph (Abstract). Journal of the Acoustical Society of America 63, S86. Hollien, H., Michel, J. & Doherty, E. T. (1973). A method for analyzing vocal jitter in sustained phonation. Journal of Phonetics 1, 85-91. Isshiki, N., Yanagihara, N. & Morimoto, M. (1966). Approach to the objective diagnosis of hoarseness. Folia Phoniatrica 18, 393-400. Iwata, S. & von Leden, H. (1970). Pitch perturbations in normal and pathologic voices. Folia Phoniatrica 22, 413-24. Kasprzyk, P. L. & Gilbert, H. R. (1975). Vowel perturbation as a function of tongue height. Journal of the Acoustical Society of America 57, 1545-6. Koike, Y. (1969). Vowel amplitude modulations in patients with laryngeal diseases. Journal of the Acoustical Society of America 45, 839-44. Kreul, E. J. & Hecker, M. H. L. (1971). Descriptions of the speech of patients with cancer of the vocal folds. Part II: Judgments of age and voice quality. Journal of the Acoustical Society of America 49, 1283-7. Kruskal, J. (1964a). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1-27. Kruskal, J. (19646). Nonmetric multidimensional scaling: A numerical method. Psychometrika 29, 115-29. LaBelle, J. L. (1973). Judgements of vocal roughness related to rate and extent of vibrato. Folia Phoniatrica 25, 196-202. Lathi, B. P. (1968). Communication Systems. Pp. 214-16. New York: J. Wiley. Lieberman, P. (1961). Perturbations in vocal pitch. Journal of the Acoustical Society of America 33, 597-603. Lieberman, P. & Michaels, S. B. (1962). Some aspects of fundamental frequency and envelope amplitude as related to the emotional content of speech. Journal of the Acoustical Society of America 34, 922-7. Lieberman, P. (I 963). Some acoustic measures of the fundamental periodicity of normal and pathologic larynges. Journal of the Acoustical Society of America 35, 344-53. Lively, M.A. & Emanuel, F. W. (1970). Spectral noise levels and roughness severity ratings for normal and simulated rough vowels produced by adult females. Journal of Speech and Hearing Research 13,503-17. MacGuire, T. 0. (1968). SCAL: 02 Kruskai-Sheppard multidimensional scaling. The University of Alberta Educational Services Computer Documentation. Murry, T. & Doherty, E. T. (1977). Frequency perturbation and duration characteristics of pathological and normal speakers (Abstract). Journal of the Acoustical Society of America 62, Suppl. I, S5 . Neelen, J. J. M. (1969). Audibility of jitter pulse trains as affected by filtering- II. Institute for Perception, Annual Progress Report 4, 23-9.

Jitter and shimmer

355

Nelsen, D . E. (1964). Calculation of power density spectra for a class of randomly jittered waveforms. Massachusetts Institute of Technology, Research Laboratory of Electronics, Quarlerly Progress Report 74, 168-74. Pollack, I. (1967). Asynchrony: The perception of temporal gaps within periodic auditory pulse patterns. Journal of the Acoustical Society a/America 42, 1335-40. Pollack, I. (1968a). Asynchrony II: Perception of temporal gaps within periodic and jittered pulse patterns. Journal of the Acoustical Sociely of America 43, 74-6. Pollack, I. ( 1968b). Detection and relative discrimination of auditory "jitter". Journal of the Acoustical Society of America 43, 308-16. Pollack, I. (1968c). Discrimination of mean temporal interval within jittered auditory pulse train s. Journal of the Acoustical Society of America 43, 1107-12. Pollack, I. (1968d). Periodicity discrimination for auditory pulse trains. Journal of the Acoustical Society of America 43, 1113-19. Pollack, I. (1968e). Can the binaural system preserve temporal information for jitter? Journal of the Acoustical Society of America 44,968-72. Pollack, I. (1969a). Effect of masking noise and pulse level upon jitter detection. Journal of the Acoustical Society of America 45, 1022-4. Pollack, I. (1969b). Auditory random-walk discrimination. Journal of the Acoustical Society of America 46,422-5. Pollack, I. (l971a). Spectral basis of auditory "jitter" detection. Journal of the Acoustical Society of America 50, 555-8. Pollack, I. (l971b). Amplitude and time jitter thresholds for rectangular-wave trains. Journal ofthe Acoustical Society of America 50, 1133-42. Rees, M. (1958). Some variables affecting perceived harshness. Journal of Speech and Hearing Research 1, 155-68. Rosenberg, A. E. (1966). Pitch discrimination of jittered pulse trains. Journal of the Acoustical Society of America 39, 920-8. Rosenberg, A. E. (1968). Effect of pitch averaging on the quality of natural vowels. Journal of the Acoustical Society of America 44, 1592-5. Rozsypal, A. J. & Millar, B. F. (1975). Perceptual dimensionality of jitter and shimmer in synthetic vowels (Abstract). Journal of the Acoustical Society of America 58, S23. Sansone, F. E., Jr. & Emanuel, F. W. (1970). Spectral noise levels and roughness severity ratings for normal and simulated rough vowels produced by adult males. Journal of Speech and Hearing Research 13, 489-502. Schroeder, M. R. & David, E. E. , Jr. (1960). A vocODER for transmitting I 0 kc/s speech over a 3·5 kc/s channel. Acustica 10, 35-43. Sherman, D. & Linke, E. (1952). The influence of certain vowel types on degree of harsh voice quality. Journal of Speech and Voice Disorders 17, 401-8. Torgerson, W. S. (1958). Theory and Methods of Scaling. Pp. 254-259. New York: J. Wiley. Wendahl, R. W. (1963). Laryngeal analog synthesis of harsh voice quality. Folia Phoniatrica 15, 241-50. Wendahl, R. W. (1966a). Some parameters of auditory roughness. Folia Phoniatrica 18, 26-32. Wendahl, R. W. (1966b). Laryngeal analog synthesis of jitter and shimmer auditory parameters of harshness. Folia Phoniatrica 18,98-108. Yanagihara, N . (1967). Significance of harmonic changes and noise components in hoarseness. Journal of Speech and Hearing Research 10, 531-41. Zwicker, E. & Feldtkeller, R. (1967). Das Ohr a/s Nachrichtenempfiinger. Pp. 6-12, 171. Stuttgart: S. Hirzel.

Perception of jitter and shimmer in synthetic vowels

Perception of jitter and shimmer in synthetic vowels

Recommend Documents