Intelligibility comparisons for two synthetic and one natural speech source

Intelligibility comparisons for two synthetic and one natural speech source

Journal of Phonetics (1983) 11 , 37-49 Intelligibility comparisons for two synthetic and one natural speech source John E. Clark Speech and Language ...

6MB Sizes 0 Downloads 42 Views

Journal of Phonetics (1983) 11 , 37-49

Intelligibility comparisons for two synthetic and one natural speech source John E. Clark Speech and Language Research Centre, School of English and Linguistics, Macquarie University, North Ry de, New South Wales 2113, Australia R eceived 16th July 1982

Abstract :

A set of simple test procedures for making a quantitative assessment of synthetic speech intelligibility characteristics at the phonentic level is described. To show where the major weaknesses of the synthesized speech sources lie, comparisons were made between natural speech, analysis-synthesis speech, and speech synthesized by rule, using the same test materials and protocol. The results indicate that the intelligibility of synthesized speech sources is equal or superior to natural speech for vocalic segments, but that synthetic stop and fricative class consonants have lower intelligibility than their natural counterparts. Moreover, the synthesized consonants were more vulnerable to degradation by masking noise , showing the synthesized acoustic cures to be less robust perceptually. The results suggest that terminal analogue speech synthesizers perform least well when generating approximations to the complex spectra found in fricatives and the release phase of stops. Not unexpectedly, the rule-synthesized consonants showed poorer performance in noise than the analysissynthesis consonants, since the derivation of their parametric specifications was one step further removed from the natural speech parametric data.

Introduction It has long been recognised that informal assessments of synthesized speech are of little value. Fant (1968) has made the point that researchers regularly working with a particular speech synthesis system are often unable to distinguish between output which is highly intelligible, and output which is almost totally unintelligible to others unfamiliar with the signal source or material concerned. The use of speech synthesis systems for stimulus generation in speech perception research, and (more latterly) for machine-man communication, has become an established technique over the past two decades. The degree to which the intelligibility of such speech synthesis systems matches that of natural speech is therefore a question of some interest both in establishing their validity as research tools and in assessing their effectiveness as communications devices. The present paper describes the application of a simple intelligibility test to two forms of synthesized speech produced by a 12-parameter serial formant speech synthesizer developed by the author, Clark (1977). (i) Analysis-synthesis derived from natural speech. (ii) Synthesis-by-rule from a system developed by the author (Clark, 1981). Identical testing of the same human talker used in the analysis-synthesis was also 0'095-44 70/ 83/ 01003 7+ 14$03.00/0

© 1983 Academic Press Inc. (London) Ltd.

38

J. E. Oark

undertaken, to provide benchmark comparisons for the synthesized speech. Although the data presented are specific to the speech synthesis system under consideration, the results suggest that certain intelligibility limitations exist which may be inherent to the performance of the class of synthesizer being used. Test materials and methods Intelligibility test scores have long been known to be influenced by the linguistic properties of the test materials themselves. (See Miller et al. , 1951; Hirsh et al., 1954 ; Rubenstein et al., 19 59 ; Schultz, 1964; Traul & Black, 1965 ; Epstein et al. , 1968; Kalikow et al., 1976.) To establish worst-case intelligibility performance figures, the test materials in this study were chosen to contain as little unnecessary predictable linguistic structure as possible, forcing listeners to make maximum use of the acoustic cues associated with the speech sounds being tested. The 19 vocalic and 22 consonantal sounds of English (Gimson, 1972) were tested in two lists: list one comprising 19 vocalic segments in /h-d/ frames, and list two comprising 22 consonantal segments in CV (consonant-vowel) frames, as shown in Table I. Table I.

List 1 List 2

hi : d h 3d pa: 3a:

hrd herd ba : sa:

Test lists I and 2

hAd hnd hJ:d hud ha:d hed h:Ed hi;)d h£;)d hu;)d h;md hJid hard haud fa: va: ga : ta : ka: tf a: da:
hu :d ea: wu:

It is arguable that in the interests of completely removing all potential linguistic context, the vocalic test list should have simply consisted of isolated segments. However, Millar & Ainsworth (1972), Strange et al. (1974), and Gottfried & Strange (1980) have all reported that the overall acoustic structure of the syllabic nucleus, including consonant associated formant transitions, contributes to normal vocalic segment identification. Because of this , it seemed advisable to avoid any potential artifactual effects on intelligibility resulting from abnormally reduced linguistic context. The choice of the /h-d/ frame was based on its established use ; for example Peterson & Barney (1952) and Ainsworth & Millar (1971 ), and its stable phonetic properties. A CV syllable structure was chosen for testing the consonantal sounds because it repre sents the most basic syllable type containing an initial consonant ; only the occurrence of/!]/ is phonotactically excluded in English. The single vowel/a:/ was used in all cases for four reasons : (i) The vowel itself is highly intelligible in all speech sources (as evidenced in the /h-d/list test results). (ii) The use of a single vowel type maintains consistent coarticulatory effects on the consonant-associated formant transitions of the vocalic nucleus. (iii) There is evidence from studies of consonant perception by Wang & Fillmore (1961) and Gay (1970) that consonant intelligibility varies with the choice of vowel used in the test syllable on an individual basis, with fa:f giving the highest overall intelligibility figures. (iv) The use of a variety of consonant vowel combinations would have extended beyond the assessment of base-level worst-case performance (and in the light of (iii) (above) it would have made data interpretation rather more complex). An open response rather than multiple or forced choice format was chosen to help obtain

Intelligibility : synthetic vs natural speech

39

worst-case test scores. Because preliminary trial tests showed that all the speech sources had inherently high intelligibility levels (80%) under quite listening conditions, four levels of masking noise were used to sensitise test conditions to help identify potentially vulnerable acoustic cues in the speech signals being tested. Experimental method

Natural speech master recordings of the test items in lists one and two were made by a trained phonetician in a sound treated studio. An analysis-synthesis version of the two test lists was produced with a computer-controlled 12-parameter serial formant synthesizer (Clark, 1977) using parameters derived from spectrographic analysis of the natural speech in the master recordings of the two test lists. A syntheis-by-rule version of the test lists was produced with the rule system (Clark, 1981) operating from a parametric data-base itself derived from spectrographic analysis of the same speaker used for the natural speech test list master recordings. The set of phonetic input strings specifying the segmental structure of the test items were given values intended to approximate the dialect of the natural speech master recordings . Master recordings of the two synthetic speech versions of the test lists were made directly from the output of the synthesizer. A set of sub-master tapes were then dubbed from the master recordings of the two test lists of each of the three speech courses. During this process, each test item in each list was repeated (by re-recording) three times at 0.5 s intervals, and the recording levels of all test items were normalised to peak at 0 VU on the vocalic nucleus of the test item. The speech was also band-pass fll.tered from 90Hz to 6300Hz using a sixth order (- 36 dB/octave), maximally flat Butterworth filter. Triple repetition was introduced for use in the actual listening tests to avoid error responses caused by inattention or other spurious factors . Level normalizing was applied to ensure a consistency in the sound pressure level at which the test items were presented to listeners. Filtering was applied to provide some degree of spectral equality between the natural and synthetic speech sources, since the latter inherently lacks some of the high frequency formant structure (and hence energy) present in the former. This spectral disparity is due specifically to the very rapid attenuation of output from the vowel formant filter chain at frequencies above F4 (3500Hz) and is an inherent property of the class of serial formant terminal analogue synthesizer used here to generate the speech. The test conditions chosen for this study were : Quiet

0 dB S/N 6 dB S/N -

+ 12 dB S/N + 6 dB S/ N

condition condition condition condition condition

A B C D E

Test tapes of conditions A to E for each of the test lists were prepared by dub-editing separate randomizations of each test list in each test condition from the previously described submaster recordings. Actual signal to noise ratios were set for conditions B to E during dubediting using the level normalized 0 VU peak in the vocalic nucleus on the sub-master recordings as a reference . The required masker level was then set with respect to this in the absence of the speech signal. Standardized recording levels on the final test tapes were maintained by then resetting the record level of the mixed speech and masker to again peak at 0 vu during re-recording. The masker remained on during the three repeats of each test item, overlapping the firsts and third presentations by 0.5 s. The masker signal used was white noise (General Radio generator model 1382) bandpass filtered from 90Hz to 6300Hz exactly as previously

40

J. E. aark

described for the speech signal. This procedure was used so that the measured intensity of the masker energy would be in a pass-band specifically related to that of the speech signal being masked. It also contributes to a degree of subjective perceptual matching between the speech and the masker. The choice of an effective masker for general sensitizing of speech intelligibility tests is basically limited to wide-band (white) noise, or low frequency weighted signals such as filtered white noise and multi-speaker babble approximating a long-time average speech spectrum. Miller (194 7) showed in comparative tests on word lists that wide-band noise and babble were the most effective maskers, yielding similar intelligibility scores at 0 dB S/N, but that there was a steeper sloped overall articulation curve with babble. Investigations by Miller & Nicely (19 55), Pickett (1957), Pickett & Rubenstein (1960), have shown, however, that despite similarities in total score, the spectral properties of the masker may significantly affect the pattern of intelligibility scores obtained for individual speech sounds. The available evidence suggests that all masker types have selective effects on individual speech sound intelligibilities, but that these effects may be slightly less marked with wide-band noise, and it thus seemed an appropriate choice in this study. However, quantitative comparisons between listening test scores obtained with different masker types or different forms of sensitising distortion should be avoided, as Hirsh et al. (1954) and Williams & Hecker (1968) have pointed out. The intelligibility tests were conducted with groups of listeners sitting at numbered test booths in a sound-treated room using Beyer type DT96 headphones with circumaural cushions, using a presentation level of+ 65 dB (re 20 J1. Pa) at each headphone on the peak of the vocalic nucleus of each test item. The tests of the two synthetic speech types were each conducted in two test sessions one day apart using identical test protocols but with two separate listener groups of 24 each. Both listener groups heard the same set of natural speech test lists; analysis-synthesis for group 1 and syntheis-by-rule for group 2. listeners were volunteer staff and students from the School of English and linguistics, and all were capable of competent broad phonetic transcription. None of the listeners had any history of hearing disorders, and all were screened with a simple speech discrimination test to ensure that they could reliably identify monosyllabic eve words at a presentation level of 40 dB s.p.l. (25 dB below that used in the tests proper). To minimize order, learning, and fatigue effects in the tests, each listener group was split into two sub-groups of 12 each and conditions A to E for each of the test lists presented to the two sub groups so that the test lists were heard both in ascending (A to E) and descending (E to A) levels of masking noise by the counter-balanced groups. Listeners were instructed to write their responses in phonetic transcription on a numbered test sheet, and each test item was signalled one second prior to presentation by a cue lamp illuminated for a one second interval in each test both.

Results and discussion Mean intelligibility scores of the two listener groups for lists 1 and 2 with each speech type are shown in Table II. The most commonly used method for analysis for data of this type is analysis of variance (Winer, 1971 ). However, as Winer and other authors note, the analysis of variance is based on the assumption of compound symmetry (equal variances of treatment scores and equal correlations between every pair of treatment scores). For the present data this assumption has proved untenable, and profile analysis (Timrn, 1975) was therefore proposed by Home! (1981) as the multivariate method most suited in this instance. For purposes of analysis, the data was arranged into four categories:

Table II. Intelligibility scores for Test Lists I and 2 for the Two Listener Groups with Natural Speech, Analysis-synthesis Speech, and Speech Synthesis by Rule.

Listener group

Test list and speech type

-6dB Mean

S.D.

Masking conditions (S/N) + 6dB

OdB % Int.

Mean

S.D.

% Int.

Mean

S.D.

% Int.

+ 12dB Mean

S.D.

~ ....

Quiet % Int.

Mean

S.D.

% In t.

~

~-

0"'

No.I

No.2

List 1 Natural List I Analysis-synthesis List 2 Natural List 2 Analysis-synthesis List 1 Natural List 1 Synthesis by rule List2 Natural List2 Synthesis by rule

::::.:

11.87

2.96

62.5

17.83

1.3

93.86

18.79

0.5

98 .9

18.91

0.28

99.56

18.91

0.28

99.56

~-

13.33

2.94

70 .16

18.29

0.8

96 .26

18.91

0.28

99 .53

18.95

0.2

99 .78

18.79

0.41

98.89

~

7.54

1.53

34.27

13.83

2.18

62.86

18.85

1.77

84 .09

19.38

1.47

88 .09

21.04

0.69

95.64

;:s

~

6.0

1.5

27.27

8.2

1.47

37.27

10.83

1.88

46.95

13.5

1.74

61.36

18.0

1.53

81.82

til

~-

-.:: ;:s

"' !::>

E"

12.17

2.73

64.05

17.96

0.86

94.53

18.63

0.65

98.05

18.76

0.64

98.26

18.88

0.45

99.37

12.96

2.27

68.24

17.96

1.16

94.53

18.88

0.34

99 .37

18.79

0.51

98.89

18.96

0.2

99.79

7.42

2.19

33.73

14.5

2.99

65.91

18.33

1.86

83.32

19.67

1.43

89.41

21.04

0.91

95 .64

2.42

1.28

11.0

1.89

28.23

8.88

1.65

40.36

10.63

1.58

48 .32

17.96

2.05

81.66

6.21

~ -.

i3til ~

;:s-

-1'>

42

J. E. Clark

list list list list

1 (/h-d/ words), natural speech; 2(CV syllables), natural speech; 1 (/h-d/ words), synthesized speech; 2 (CV syllables), synthesized speech.

Natural speech /h-d/ words and CV syllables For both the /h-d/ and CV lists there were no significant differences (p < 0.05) in mean intelligibility scores due to the age or sex of listeners under any of the masking conditions, nor were there any interactions between sex and listener group. No significant differences existed (p < 0.05) between the mean scores of the two listener groups under any masking condition for either the /h-d/ or CV lists. This shows that the two listener groups had a high degree of consistency on the single natural speech source, and were not disturbed by any variation in experimental conditions. For both listener groups pooled, the CV list scores at each masking condition were significantly different (p > 0.05 , Bonferroni t-tests for individual comparisons). However , in the case of the /h-d/ lists scores were only significantly different between the + 6 dB, 0 dB and - 6 dB signal to noise ratios. No significant differences existed between + 6 dB , + 12 dB and quiet, so that the latter two conditions need not have been tested. It also illustrates the high resistance of vocalic sounds to masking when compared with consonants under the same conditions. Synthesized speech /h-d/ words and CV syllables As with the natural speech tests, there were no significant differences (p < 0.05) in mean intelligibility scores due to the age or sex of listeners under any of the masking conditions in either /h-d/ words or CV syllables , nor were there any interactions between sex and listener goup. For the /h-d/ words, mean scores for each of the listener groups showed no significant differences (p < 0.05) between scores under any of the masking conditions. This shows that the overall intelligibility of vocalic sounds produced by analysis-synthesis and synthesisby-rule are the same ; and that the rule structure and its data-base are capable of predicting parametric data as valid as that obtained from an analysis of natural speech for vocalic sounds. Furthermore, as with the natural speech /h-d/ words, only the masking conditions of + 6 dB, 0 dB and - 6 dB S/ N produced significantly different scores (Bonferroni t-test , p <0.05). For the CV syllables, mean scores for each of the listener groups showed signficant differences between scores (p > 0.05) at each of the masking conditions except quiet. In all cases where significant differences existed, the intelligibility of the synthesis-by-rule was poorer than that of the analysis-synthesis. This result illustrates the value of testing under sensitized listening conditions such as noise masking to show up intelligibility differences not apparent in quiet listening alone. The superior performance of analysis-synthesis speech under noise masking compared to synthesis-by-rule (with CV syllables), also reflects a difference in the properties of the parametric data used to produce each speech type. Analysis-synthesis gives a measure of potential synthesizer performance degraded only by the validity of the parametric data derived from the analysis of natural speech. The syntesis-by-rule speech also reflects synthesizer limitations, but uses rule-generated parametric data which can only ever predict an approximation to the parametric properties of natural speech for any input strong of phonological units from its rule structure and data-base. The disparity between the intelligibility

43

Intelligibility : synthetic vs natural speech

performance of synthesis-by-rule and analysis-synthesis therefore gives some measure of rule structure and data-base limitations. Comparisons between natural and synthesized speech /h-d/ Words Figure 1 shows a plot of lh-dl word intelligibility scores versus masking noise level from + 6 dB to - 6 dB for the two synthetic speech sources and the natural speech (the latter with listener groups pooled).

100 ~

f=l

90 >-

ii .~

80

"'

~ 'if!.

70



60k------L-----J------~----~

12

6

0

-6

Signal-to-n oise - ratio

Figure 1

/h-d/ word intelligibility scores versus masking noise. • = natural; o = analysissynthesis; 6 = synthesis by rule.

Differences between them are non-significant (p < 0.05) except at- 6 dB where the synthesized source scores are slightly higher than those of the natural speech. These results suggest that the synthesizer is capable of reproducing the acoustic properties of natural vocalic sounds with great fidelity , and that this performance is not degraded by the synthesis-by-rule system. Such a result should be predictable, given the known ability of the serial formant class of terminal analogue synthesizers to model the idealized acoustic properties of the vocal tract during vowel production (Flanagan, 19 57 ; Fant, 1960). The slightly superior synthesized vowel intelligibilities at - 6 dB SIN probably result from the almost idealized formant structure in the synthesized vocalic sounds being marginally less affected by noise masking than their natural counterparts. Differences between the vocalic sounds for the three speech sources are best seen by comparing intelligibility scores for the individual vocalic segments. Figure 2(a) and (b) shows the comparative vowel and diphthong intelligibilities respectively at -6 dB SIN, and it can be seen for the majority of segments that the scores are reasonably consistent across the three ~peech sources with a small overall superiority for synthetic speech. There are, however , some disparities in performance worth noting. Natural speech li:l, I ;:e I and I 3 :I are all markedly lower in intelligibility than their synthesized counterparts, illustrating the greater resistance to noise masking previously noted as a feature of synthetically generated vowel spectra. (The particularly low intelligibility of natural speech /3 :I may be partly due to poor articulation of that lh-dl word by the speaker). le/, j-:J:I, lnl and

44

J. E. Clark Rounded

Unrounded Centering diphthongs

Bock

Central

Front

IOOr-------------------------~~-----,

(o)

1:'

Closing diphthongs

(b)

80

]' 60 Q)

c:

~ 40

~

20

., Figure 2

a:

"

"'

D

u

e1

aI

au

ou

~

"

,.

to

~ uo

(a) Individual vowel intelligibilities and (b) individual diphthong intelligibilities at- 6 dB S/N . D = natural; ~ = synthesis by rule; ~ = analysis-synthesis.

I u I all have markedly lower intelligibility in synthesis-by-rule than in analysis-synthesis, suggesting that either their phonetic representation in the input string to the rule system or aspects of the rule system itself could be improved. The fact that three out of four of these are lip-rounded back vowels could indicate a deficiency in the rules and data for this segment sub-class. lr I, lu :1 and /3 :I show higher intelligibility in synthesis-by-rule than analysissynthesis, suggesting less than optimum parametric data analysis in the latter form of synthesis. The poor intelligibility of lu:l (< 45%) in all speech sources may be due to its centralized and rather unstable target in Australian English which makes it relatively easily confused with li:l due to the strong masking effects of flat (white) noise on its high F 2 value (typically ;;;. 1500Hz). The overall superior intelligibility of the low unrounded vowels may also be explained by the greater resistance to masking of their relatively compact and high amplitude F 1 F 2 structure. Amongst the diphthongs, lar I shows very low intelligibility in analysis-synthesis by comparison with the other two speech sources, indicating poor parametric data, as does l;w I for synthesis-by-rule which again shows up weakness in generating back rounded vowels. lerl shows low intelligibility in natural speech suggesting poor articulation in the test token used, and all three sources show low intelligibility for luJ I which is a reflection of its low functional load in the language coupled with its inherently low level of spectral energy. Comparison of these results with other studies is difficult both because of the lack of published data, and differences in the experimental methodology used to obtain such data as is available. Arnold et al. (1958) in a comparative test of natural and synthesized vowels in lh-dl words obtained intelligibility scores of 94% for natural vowels and 76% for synthesized vowels produced by analysis-synthesis, and in a later study of a synthesis-by-rule system obtained a score of 76.6% for isolated vowels. These results, all obtained in quiet listening (no masking), are relatively poor in comparison with those of the present study. They may be explained partly by the fact that in each case the vowels were produced with parallel formant terminal analogue synthesizers which are inherently less able to model vowel production acoustics. (Although, to be fair, Holmes (1972) has shown this problem can be minimized with very careful design.) By comparison, in an evaluation of a

Intelligibility: synthetic vs natural speech

45

synthesis-by-rule system using a serial formant synthesizer, lngemann (1975) obtained a vowel and dipthong intelligibility score of 98% which is very similar to those of the present study under the same quiet (non-masked) listening conditions. CV syllables Figure 3 shows CV syllable intelligibiity scores as a function of masking noise level from quiet to - 6 dB for the two synthetic speech sources and the natural speech source (the latter with listener groups pooled). The natural scores are significantly superior (p < 0.05) to both of the synthesized speech scores under all listening conditions. The synthesized speech sources are thus unable to match the overall intelligibility of natural speech in the acoustically and parametrically complex situation of consonant production. A determination of the specific sources of deficiencies in consonant intelligibility and their relationship to the properties of natural speech is therefore of some interest. A useful overall measure of these is obtained from comparisons between the patterns of intelligibility scores for the natural phonetic consonantal classes of voiceless stops, voiced stops, voiceless fricatives, voiced fricatives, nasals and liquids, for the three speech sources. These are shown in Fig. 4(a ), (b) and (c) for natural speech (both listener groups pooled), analysis-synthesis, and synthesis-by-rule, over the range of listening conditions.

0

0

Quiet

-6

Signa I-to-noise-ratio

Figure 3

CV syllable intelligibility scores versus masking noise. • = natural speech, o = analysis-synthesis, 10 = synthesis by rule.

The most obvious overall difference observable in this data is that the natural speech consonants remain quite resistant to masking until 0 dB S/N is reached. By contrast, the synthesized consonants show a rapid fall in the intelligibility of both stops and fricatives even at + 12 dB S/N masking. As might be expected from the overall intelligibility scores, this trend is less marked in analysis-synthesis than synthesis-by-rule. In all three speech sources the nasals and liquids remain most resistant to masking (which accords in general with Miller & Nicely, 1955), although at -6 dB all consonant classes have very poor intelligibility in the synthesis-by-rule. . The marked disparity between the intelligibility scores for stops and fricatives from the

46

J. E. Clark

roo~

~

N ~

so•"-. ~

:c

60

:!?

\

-.;

.s

40

* 20



Iol 0 Qu iet

Figure 4

12

6

0

Icl

-6

12 Qui et

6

0

Sig nal· to- noise-ratio

-6 Quiet

CV syllables intelligibility scores versus masking noise for (a) natural speech, (b) analysis-synthesis and (c) synthesis by rule. • =voiceless stops; o =voiced stops; • =voiceless fricatives ; o = voiced fricatives ; .o. =nasals; 6. =liquids.

synthesized speech sources compared to natural speech, illustrates an important limitation in the acoustic modelling capabilities of the synthesizer. Sounds in these consonant classes have complex spectra containing zeroes, energy peaks and energy distributions which are not always easily described accurately and unambiguously in terms of the limited number of parameters available on a conventional serial formant synthesizer. This class of synthesizer does not allow independent specification of formant amplitudes, and uses a separate fricative filter system with characteristics defined by a relatively simple parameter set. For the synthesizer used in the present study, this is limited to specifying the upper and lower frequency limits of frication noise, with a tracking zero below the lower cut-off of the filter. The consequence of these limitations is that , when modelling the complex spectra found in fricatives and the release phase of stops, only approximations can be generated. Furthermore, the effects of such approximations on intelligibility are exacerbated by having separate fricative and (vocalic) formant filter circuits which produce inevitable pole discontinuities between the spectra generated by the two filters . This tends to reduce perceptual integration in some fricative sounds, both when the filters are used sequentially and/or simultaneously in a speech sequence . (Careful manipulation of parameter values can minimize this effect.) Although this problem of spectral approximation also exists with nasals and the liquids /1/ and /r/, it does not cause serious intelligibility problems (as the analysis-synthesis results show) because the spectra of these sounds are predominantly specified by their (vowel-like) formant structure. The presence of additional spectral zeroes can be compensated for partly in an all-pole filter, as Fant (1960) has pointed out, by (slight) shifts in adjacent formant frequencies. For nasals , there is also a simple nasal formant in the synthesizer which contributes very effectively to the perceptual adequacy of the approximated spectra for this consonant class. Despite the disparities between consonant class scores for the three speech sources, the ranking of these scores is reasonably consistent. The most useful listening condition for a ranking comparison is 0 dB S/ N where all sources have wide class score spreads. The order from most to least intelligble is : liquids , nasals, voiced stops, voiceless stops, voiced fricatives and voiceless fric atives. This order partly reflects the properties of the flat noise masker (as noted earlier in the article) in favouring the intelligibilities of liquids and nasals compared

Intelligibility: synthetic vs natural speech

47

with stops, fricatives and voiceless sounds generally. The poorer performance of the stops and fricatives in the synthesized speech is essentially a difference of degree in intelligibility compared with the natural speech. Differences in intelligibility scores between the two synthesis sources amongst the consonant classes are likewise largely one of degree. The only real exception to this general trend occurs amongst the fricatives where the poor intelligibility of the voiceless fricatives in analysis-synthesis indicates a need for better parametric data. As noted for the vowels, comparisons with other studies can only be made with caution due to differences in experimental conditions. Flanagan & House (1956) in a quite early study obtained scores of 22% to 29% for three automatic analysis-synthesis (formant vocoder) systems: lowest intelligibilities were in stops and nasals, and the highest in fricatives and liquids, which for the nasals and fricatives is an opposite result to that obtained in the present study. Strong (1967) obtained 83.1% intelligibility in an analysis-synthesis system with the lowest scores amongst the fricatives. Keeler et a/. (1976) tested several analysis-synthesis vocoder configurations, obtaining scores of 85% and 86% for three formant and five formant synthesis respectively. In synthesis-by-rule system evaluations, Rabiner (1968) obtained a score of 77% for CV syllable consonants in which fricatives had amongst the lowest intelligibilities. Haggard & Mattingly (1968) obtained a consonant score of 85.7% which also showed the lowest level of intelligibility amongst fricatives. Similar results were obtained by Nye & Gaitenby (1974) with a socre of 78%, and lngemann (1975) with scores from 73% to 82%. None of these studies mentioned tested the speech sources with masking, but the intelligibility figures for the present system compare very favourably with the limited amount of information published for other systems.

Conclusion The set of test procedures and materials described here have been shown able to yield essential basic information about speech intelligibility characteristics under conditions of minimal linguistic context. The ~portance of sensitizing test conditions (in the present instance by masking) to determine differences in intelligibility characteristics not apparent in quiet listening, is also illustrated by the data presented for the three speech sources tested in this study. The three speech sources have effectively identical intelligibility characteristics under all listening conditions for vowels. This high level of performance primarily reflects the fidelity with which the serial formant synthesizer models the acoustics of vowel production in analysis-synthesis. The near identical performance for the synthesis-by-rule suggests that its rule structure and data-base does not significantly degrade this basic capability. In the case of consonant intelligibility, the synthesized speech sources show a marked lack of consistency in scores compared with natural speech. For natural speech overall consonant intelligibility is only 4% below vowel intelligibility in quiet listening, whereas the two synthesized speech sources have consonant scores 14% below vowel intelligibility in quiet. Masking further shows up this lack of consistency in the far greater vulnerability of the synthesized sources, even at + 12 dB S/N. The masking also shows differneces between analysis-synthesis and synthesis-by-rule which are not apparent from their near identical intelligibility scores in quiet listening. The poorer performance of the synthesis-by-rule in masking noise indicates that the rule structure and data-base require further refinement. The sources of lower intelligibility for consonants in the synthesized speech sources ~ecome apparent in the analysis of consonant class intelligibilities. The general trend is for those consonant classes (fricatives and stops) whose spectral characteristics are most poorly

48

J. E. Clark

modelled in the acoustic properties and performance of the synthesizer to be least intelligible in masking. This form of analysis therefore has a very useful diagnostic function. Despite the poorer performance of the synthesized sources in consonants compared with natural speech, the general pattern of ranking of the consonant class intelligibility scores is the same as for the natural speech. In that sense the synthesized sources behave like a degraded form of natural speech, but with a degree of spread in the scores not characteristic of natural speech. The degree to which limitations inherent to serial formant type synthesis are the cause of limitations in the intelligibility of the consonant data presented here, as opposed to limitations in the parametric data used, remains an interesting topic for further investigation. It does seem likely that a parallel formant synthesizer might produce a better consonant performance. By virtue of its independent formant amplitude control, this class of synthesizer can replicate a range of speech spectra produced by vocal tract configurations in which an all-pole transfer function acoustic model is not valid. Spurious spectral distortions occurring when formant amplitudes are adjusted are inherent to simple parallel synthesizers and may degrade vowel quality, but as Holmes (1972) points out, they can be avoided by appropriate filtering of the independent formant generator outputs prior to mixing. An indication of the potential superiority of the parallel formant synthesizer class is perhaps also reflected in a shift of the latest in the OVE series of synthesizers (OVE IV) to a parallel configuration where its predecessors were all seriel types, and also in the successful hybrid (parallel/serial) synthesizer described by Klatt (1980). The author would like to thank the Speech & Language Research Centre technical staff for their contributions to the technical aspects of this study. Ross Homel from the School of Behavioural Sciences provided essential advice and help on the statistical analysis of the data, and Mary Clark and Sallyanne Palethorpe assisted with several aspects of the detailed data analysis. References Ainsworth, W. A. (1974 ). Performance of a speech synthesis system. International Journal of ManMachine Studies, 6, 493-511. Ainsworth, W. A. & Millar, J. B. (1971). Methodology of experiments on the perception of synthesised vowels. Language & Speech , 14, 201-212. Arnold, G. F . eta!. (1958). The synthesis of english vowels. Language & Speech, 1, 114-125. Busch, A. C. & Eldridge, D. (1967) . The effect of differing noise spectra on the consistency of identification of consonants. Language & Speech, 10, 194-202. Clark, J. E. (1977). A real-time speech synthesis system. Monitor, 38, 56-67. Clark, J. E. (1981). A low-level speech synthesis by rule system. Journal of Phonetics, 9, 451-476 . Epstein, A. eta!. (1968). Familiarity & intelligibility of monosyllabic word lists. Journal of Speech & Hearing Research, 11, 428-434. Fant, G. (1960). Acoustic Theory of Speech Production, The Hague: Mouton Fant, G. (1968). Analysis and synthesis of speech processes. In: Manual of Phonetics. B. Malmberg (ed.) Amsterdam: North Holland . Flanagan, J. L. (1957) . Note on the design of 'terminal-analog' speech synthesisers. Journal of the Acoustical Society of America, 29, 306-310. Flanagan, J. L. & House, A. S. (1956). Development and testing of a formant-coding speech compression system. Journal of the Acoustical Society of America, 28, 1099-1106. Gay, T. (1970). Effects of filtering and vowel environment on consonant perception. Journal of the Acoustical Society of America, 45 , 993-998. Gimson, A. C. (1972). An Introduction to the Pronunciation of English, London: Edward Arnold. Gottfried, T. L. & Strange, W. (1980). Identification of coarticulated vowels. Journal of the Acoustical Society of America, 68, 000-000. Haggard, M.P . & Mattingly, I. (1968). A simple program for synthesising British English. IEEE Transactions on Audio & Electroacoustics, AU-16, 95-99. Hirsh, I. J. eta!. (1954 ). Intelligil;Jility of different speech materials. Journal of the Acoustical Society of

Intelligibility: synthetic vs natural speech

49

America, 26, 530-53 8. Holmes, J. N. (1972) . Speech Sy nthesis. London: Mills & Boone. Home!, R. (1981). person co mmunication. Ingemann, F. (1975). Testing synthesis-by-rule with the OVEBORD program. Status Report on Speech Research, Haskins Laboratories, SR-41, 127-136. Kalikow, D. N. eta!. (1976). Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability. Journal of the Acoustical Society of America, 61 , 1337-1351. Keeler, L. 0. et al. (1976). Two preliminary studies of the intelligibility of predictor-coefficient and formant coded speech, IEEE Transactions on Acoustics Speech and Signal Processing, ASSP-24, 429-432 . Klatt, D. H. (1980). Software for a cascade/parallel formant synthesiser. Journal of the Acoustical Society of America, 67 , 971-995. Millar, J. B. & Ainsworth, W. A. (1972). Identification of synthetic isolated vowels in /h-d/ context, Acustica, 27, 278-282. Miller, G. A. (194 7). The masking of speech, Psychological Bulletin , 44, 105-129. Miller, G. A. eta!. (1951 ). The intelligibility of speech as a function of the context of test materials. Journal of Experimental Psychology, 17, 748-7 62. Miller, G. A. & Nicely, P. (1955) . An analysis of perceptual confusions among some English consonants. Journal of the Acoustical Society of America, 27, 338-352. Nye, P. W. & Gaitenby, J . (1974). The intelligibility of synthetic monosyllables in short, syntactically normal sentences, Status Report on Speech Research, Haskins Laboratories, SR-37 /38, 169-190. Peterson, G. E. & Barney, H. L. (195 2). Control methods used in the study of vowels. Journal of the Acoustical Society of America, 24, 175-184. Pickett, J . M. (1957). Perception of vowels heard in noises of various spectra. Journal of the Acoustical Society of America, 29,613-620. Pickett, J. M. & Rubenstein, H. (1960). Perception of consonant voicing in noise. Language & Speech, 3, 155-163. Rabiner, L. R. (1968). Speech synthesis by rule: an acoustic domain approach. Bell System Technical Journal, 47,17-37. Rubenstein, H. et al. (1959) . Word length and intelligibility. Language & Speech, 2, 175-178. Schultz, M. C. (1964). Word familiarity influences in speech discrimination. Journal of Speech & Hearing Research, 7, 395-400 . Strange, W. et al. (1974). Consonant environment specifies vowel identity. Status Report on Speech Research, Haskins Laboratories, SR-37 /38, 209-216. Strong, W. J. (1967). Machine-adided formant determination for speech synthesis. Journal of the Acoustical Society of America, 41, 1434-1442. Timm, N. (1975). Multivariate Analysis with Applications in Psychology and Education. California: Brooks-Cole. Traul, G. N. & Black, J. W. (1965). The effect of context on aural perception of words. Journal of Speech & Hearing Research , 8 , 363-369. Wang, W. S-Y. & Fillmore, C. J. (1961 ). Intrinsic cues and consonant perception. Journal of Speech & Hearing Research, 4, 130-136. Williams, C. E. & Hecker, M. H. L. (1968). Relations between intelligibility scores of four test methods and three types of speech distortion. Journal of the Acoustical Society of America, 44, 1002-1006 . Winer , B. J. (1971). Statistical Principles in Experimental Design. (2nd edition). New York: McGraw-Hill.