Recognition of steady-state French sounds pronounced by several speakers: Comparison of human performance and an automatic recognition algorithm

Recognition of steady-state French sounds pronounced by several speakers: Comparison of human performance and an automatic recognition algorithm

Speech Communication 2 (1983) 173-177 North-ltolland 173 RECOGNITION OF STEADY-STATE FRENCH S O U N D S P R O N O U N C E D BY SEVERAL SPEAKERS: C O...

279KB Sizes 0 Downloads 19 Views

Speech Communication 2 (1983) 173-177 North-ltolland

173

RECOGNITION OF STEADY-STATE FRENCH S O U N D S P R O N O U N C E D BY SEVERAL SPEAKERS: C O M P A R I S O N OF H U M A N PERFORMANCE AND AN AUTOMATIC RECOGNITION ALGORITHM M. ESKENAZI and J.S. L I E N A R D Laboratoire d'lnformatique pour la M~canique et les Sciences de l'Ing&Heur (I.IMSI CNRSA B.P. 30. 914~t~ Or~'m ('edux, France

Received 30 March 1983

1. Introduction In general, when results in a u t o m a t i c speakeri n d e p e n d e n t recognition on the a c o u s t i c - p h o n e t i c level are o b t a i n e d , a b o v e 90% ' c o r r e c t ' recognition is c o n s i d e r e d acceptable. This figure, however, represents p r e s u m p t i o n s which are far from evident on three level,,;: p r o d u c t i o n , perception, a n d automatic processing. First, the speakers' p r o d u c t i o n of each s o u n d to be recognised (and his own j u d g e m e n t thereon, or the j u d g e m e n t of the person c o m p i l i n g the d a t a b a s e ) is p r e s u m e d to be correct in all cases. Second, each of these sounds is pres u m e d to be perceived by listeners (speaking differing dialects of the same language) all of the time as one and only one p h o n e m e . Third, regardless of what h u m a n results are, a u t o m a t i c recognition is c o n s i d e r e d to be good only on the basis of a general total "correct' recognition percentage; any o t h e r type of figures are left aside. In this article we will show that by c a r r y i n g out intelligibility tests on the d a t a b a s e s used in a u t o m a t i c processrag, one m a y come to a more realistic e v a l u a t i o n of the value of a d a t a b a s e , itself, as well as the kind of results that may be meaningful for automatic recogni~tion. In o r d e r to o b t a i n an aural a n d an a u t o m a t i c vowel classification, three similar d a t a b a s e s were compiled. Each was r e c o r d e d u n d e r the same material and e n v i r o n m e n t a l conditions, and cont a i n e d three repetitions of each of the isolated F r e n c h vowels ( / i / , / y / , / u / , / e / , ~ e l , / g / , /a/. /~/,/3/,/5/,/o/,/oe/,/~/) by a different set of five male and five female speakers. They

had a short training p e r i o d before the recording, a n d r e p e a t e d the r a n d o m list of vowels five times.

2. Intelligibili~' tests The d a t a b a s e s were tested in two ways: a group of 17 listeners heard 300 ms sounds; 5 other listeners heard only the 50 ms p o r t i o n of the s o u n d s used in the a u t o m a t i c classification experiment. 2.1. 300 ms segment tests

In o r d e r to avoid the p r o b l e m inv(Hved in multiple choice testing, the listeners chosen were all persons having had a m i n i m u m of ten hours of training in I P A transcription. C o n l r a r y to the listeners in tests like those of Peterson and Barney [1], none of them were also speakers. The listeners were tested a f o r e h a n d to d e t e r m i n e if 'Lhey had any p a r t i c u l a r bias in noting the vowels (for example, could they regularly m a k e the distinction between /~/ and / e / ) . The 300 ms segments, consisting of the first 250 ms plus 50 ms p r e c e d i n g the begirming of the vowel, were presented in r a n d o m o r d e r with a 2 s e c o n d p a u s e between each, the first 100 tokens c o u n t i n g only as a learning buffer. O n l y the first two of the speakers' three p r o n u n c i a t i o n s of each vowel were used, shortening the tests in o r d e r not to cause fatigue; a m p l e pauses were p r o v i d e d t h r o u g h o u t the tests. The two identical presentations of the test were 14 days apart.

OI67-63q3/83fi$3.00 " 1983, Elsevier Science Publishers B.V. (North-Holland)

M. EskenazL J.S. Lienard / Recognition of french sounds

174

value was transformed to a quasi-logarithmic one. After severe spectral smoothing, a derivative was obtained to reflect the degree of curvature of the vowel at various points along the frequency scale (Eskenazi and Lienard [4]). With simple Gaussian statistics, each vowel is then represented by the mean and the standard deviation values for the whole of the speakers at each point along the frequency scale. Classification may be carried out for vowels from unknown speakers by summing the distances (in terms of standard deviations) between the unknown vowel and the mean values at all points for each vowel reference. The reference corresponding to the lowest cumulative distance is proposed. A first simple classification of the vowels (one 'feature', with other classification criteria to be superposed afterwards is then obtained. Table 1 shows the confusion matrix for this classification, where the first of the three databases was used as the base, and the two others for the test.

2.2. 50 ms segment tests

Five professional speech researchers, three of which were also speakers, took part in this test. As above, French is their mother tongue, and IPA notation was used. Whereas the 300 ms segments could be comfortably listened to with a H a m m i n g window, the 50 ms tokens could not. These segments were inbedded in the middle of 650 ms of Gaussian white noise produced by an adaptation of the Klatt synthesizer [2], in accordance with the observations on testing sounds of extremely short duration by Pols [3]. The maximum noise amplitude was 1/6 of the m a x i m u m amplitude of the least intense of all of the tokens. All three pronunciations of each vowel were heard once this time. To avoid fatigue, two equal parts were presented on two successive days, with a training buffer of 195 tokens at the beginning of each day's test. The learning buffers are presently being examined to find out approximately how long it takes listeners to get accustomed to each vowel in this type of test.

4. Discussion of results Overall listener recognition for the 300 ms segment test is 88.3%; the listeners' global scores were between 81.3% and 92.1%, from one session to the other, the global and individual vowel percentages for any one listener didn't vary more than 2%. This tends to indicate that listeners were consistant in their answers.

3. Automatic classification For an automatic classification of the French oral and nasal vowels, a 50 ms segment of each isolated vowel, taken 200 ms after its beginning was put through an FFT. The results were adjusted to a Bark frequency scale, and then each

Table 1 C o n f u s i o n m a t r i x f o r intelligibility ( h u m a n tests)

i y e c 4, a

o u

i

y

e

c

g

4'

oe

a

5

3

99.9 0.1 0.2 0 0 0 0 0 0 0 0 0 0.6

0 97.6 0.1 0 0 0.1 0.1 0 0 0.1 0 0 0.5

0.1 1.3 96.9 10.4 0 0.1 0.1 0 0 0 0.1 0 0

0 0 2.6 89.2 0.6 0.1 2.4 0 0.4 0 0.1 0 0

0 0 0 0.1 99.0 0.1 0.2 1.1 0 0.1 1.3 0 0

0 0.3 0.1 0.1 0 85.6 34,1 0 0 0.1 0.1 0.1 0

0 0 0.1 0.1 0,2 14.1 63.0 0.1 5.3 0.3 0.1 0.1 0

0 0 0 0 0 0 0.1 96.9 98,0 2.9 0 0 0

0 0 0 0 0.1 0.1 0.2 0.4 2.6 13.4 1.4 0 0

0 0 0 0 0 0.1 0.1 1.4 2.3 0.9 1.0 3.8 0.2

Speech Communication

~ 0 0 0 0.1 0 0 0 0 0.1 65.1 94.5 0.4 0.1

o 0 0.1 0 0 0 0 0.1 0 0.1 0.6 1.1 92.1 1.9

u 0 0.7 0 0 0 0 0 0 16.50 0.2 3,4 96.4

q.

Z

m

0

~

0

0

0

0

0

0

0

~

0

0

0

0

0

.2

,.I

o -

-

m

'~k. . . .

°I

eo

_.]

I

1

1

[]

B

,..1

[]

,-.rl

!

176

M. Eskenazi, J.S. Lienard / Recognition of french sounds

The speakers production axis was then examined. First, if all of the listeners agreed that a given token was other than what was intended, for example, that an / a / was actually a / u / , the token was eliminated from both the final intelligibility and the automatic results; the speaker's production, of the labelling of the token, being deemed to be incorrect. Other evidence questioning the unilateral 'correctness' of phonemes is large listener disagreement (less than 5 'correct' answers out of 34, the number 5 being chosen as at least twice the probability of getting the 'right' answer if guessing). Fig. 1 shows that for some vowels, such as / i / , and / a / , which are the most acoustically 'separate', there never are that many disagreements, but that the other vowels, some of which are presently in linguistic evolution, do provoke numerous disagreements. The discrepancies also reflect the problems inherent in consistituting an out-of-context database. The person evaluating the acceptability of the pronunciation compares his statistical knowledge of the vowel reference in question with the token. He will then either accept what he has heard, mentally filing it with the rest of his representation, or reject it because it is just too 'distant' from what he expected to hear. The definition of distance changes for each individual, due to the differences in past linguistic experience. The confusion matrices in Tables 1 and 2 show, respectively, automatic and perception results. It would be wrong, for example, to say that the relatively poorer results for t h e / a / i n both automatic and human recognition show that automatic recognition parallels human tendencies. Vowel production is the real culprit on both fronts. The / a / is the only French vowel that is not a morpheme in isolated form; it is also rarely in CV context. Phonologically unnatural in form, the isolated / a / is therefore difficult to obtain; the speakers' variation is much greater than if continuous speech was used. This fact reflects on the automatic recognition statistics. If there is abnormally large variation for one given vowel as opposed to the rest, the standard deviation for this vowel will also be large, and overlap the other vowels' values. Many confusions with / a / and / u / , for example, may be explained in this manner. Speech Communication

This is confirmed by retesting w i t h o u t / a / i n the base. Fig. 2 shows the number of tokens for which there was not 100% agreement on the identity of the vowel. Here, the phonological evolution of certain paires of vowels, and, therefore, the listeners perception is the cause. Listeners were tested on unstable pairs such as / o e / and / 4 ' / ; anyone showing a marked preference for one or the other of the candidates in his production or perception was eliminated. Globally, no individual scored extremely low for one specific vowel, nor did anyone's confusions often involve the same vowel. This type of disagreement therefore seems to be spread evenly among the listeners, reflecting unstable elements of French phonology. Automatic results do seem to reflect this fact; confusions increase in approximately the same proportions for these vowels (compared to stable vowels) as in the intelligibility results. The matrices show t h a t / e / a n d / u / a r e quite well recognized by human subjects, but n o t by the machine. This cannot be attributed to characteristics of speech production; it is an indication of the limits of the automatic classification feature. Other distinctions, such as liability, are also limited. Two new features, based on the same principles as the first, prevently being designed, will soon be superposed on the first, and tested in view of eliminating these, and other confusions. They will also have to deal with nasality, which seems to show differences between the machine and humans, human errors tending to be between one nasal and another, whereas automatic confusions are most often between a nasal vowel and its oral neighbors. It may be noted that in all cases there are no greatly deviant confusions for the automatic process. Acoustically 'close neighbors' are confused, but very different vowels ( / i / and / o / , for example) are not. A number of large classes appears for humans and the machine which are much the same in the two cases; confusions are only found among members of the same class. Much more information has been gleaned from these tests than we have room to describe herein. All results of the 50 ms tests have not been obtained as of the writing of this article, but a first sampling shows 10 to 15% lower global recognition and the same types of confusions as those in the 300 ms matrix.

M. Eskenazi, J.S. Lienard / Recognition of french sounds

Conclusions and perspectives U s i n g close a n a l y s i s of the t y p e of e r r o r s m a d e by h u m a n s in m u l t i s p e a k e r v o w e l i d e n t i f i c a t i o n , a realistic e v a l u a t i o n of the d a t a b a s e s used, a n d of h u m a n and a u t o m a t i c r e c o g n i t i o n p e r f o r m a n c e s m a y be o b t a i n e d . T h e a u t o m a t i c c l a s s i f i c a t i o n u s e d as a base p r o v e s to fulfill e x p e c t a t i o n s in m a n y w a y s if we use h u m a n p e r c e p t i o n as a m o d e l . T h e s h o r t c o m i n g s a n d the goals f u t u r e f e a t u r e s s h o u l d fulfill b e c o m e clear, a n d m a y be realistically e v a l u a t e d o n this basis.

Acknowledgement

177

m e a n s n e c e s s a r y to m a k e the intelligibility tests possible.

References [1] G. Peterson and H. Barney, "'Control meth,ads used in a study of the vowels" J. Acoust. Soc. Am.. Vol 24, 1951, pp. 175 184. [2] D. Klatt, "Software for a cascade/parallel formant synthesizer", J. Acoust. Soc. Am., Vol. 67. 1980, pp. 971 995, [3] L.C.W. Pols, "Intelligibility of intervocalic consonants", IEEE-1CASSP, Washington, 1979, pp. 460 463, CH1379 7/79/0000-0460. [4] M. Eskenazi and J.S. Lienard, "On the acoustic characterisation of the oral and nasal vowels of French", article submined to the Tenth Int. Congress of Phoaetic Sciences, Utrecht, August 1983.

W e wish to t h a n k Fran~:ois L o n c h a m p ( P h o n e t i c I n s t i t u t e of N a n c y ) w h o k i n d l y m a d e a v a i l a b l e the

V o l 2, No~. 2 - 3 , July 1983