Journal ofPhonetics (1984) 12,17-21
Effectiveness of different vowel sounds in automatic speaker identification K. K. Paliwal t Division of Telecommunications, University of Trondheim, Trondheim-NTH, Norway R eceived 22nd July 1983
Abstract:
An experiment is performed to study the effectiveness of eleven English vowels (/3/, /A/, fu/, /u/, fo/, h/, fa/, fre/, /£/, fr/, /i/) in an automatic speaker identification task. The first four formant frequencies of the vowel sounds are used as parameters and a minimum distance classifier is used for speaker classification. The vowel I 3 I is found to give the best speaker identification performance.
IntroduCtion Speaker identification systems use speech signal to identify an unknown (or test) speaker from a group of known speakers (Atal, 1976). These systems are different from speaker verification systems which verify the identity claimed by an unknown speaker (Rosenberg, 1976). Speaker identification systems can be of two types, text-dependent and textindependent, and the type of speaker identification system to be used depends on the cooperation offered by the unknown speaker. If he is cooperative enough, a text-dependent system can be used and the speaker can be asked to utter a key sentence (or phrase) presspecified for the speaker identification system (Atal, 1972; Sam bur, 1976; Dante & Sarma, 1979). On the contrary, if the unknown speaker is not cooperative, which is normally the case, a text-independent system has to be used. The speaker identification system, in this case, has to cope up with the utterance of the unknown speaker representing an arbitrary text. For this, it uses either time-averaged speech characteristics (Sam bur, 1976; Markel, Oshika & Gray, 1977; Shridhar & Mohankrishnan, 1982) or certain specified speech events (Wolf, 1972; Su, Li & Fu, 1974; Sam bur, 1975; Kashyap, 1976). Irrespective of the type of speaker identification system to be used, it is very important to know which phonemes convey better the speaker-specific information. For example, this will be helpful in constructing the key sentence to be used in the text-dependent systems. For text-independent systems, this knowledge about the relative effectiveness of different phonemes will be helpful in selecting the speech events to be used for speaker identification. Since most of the phonemes (about 38.2%, Mines, Hanson & Shoup, 1978) in conversational English are vowels and every spoken word contains at least one vowel sound, it will be quite useful to know how different vowels perform in an automatic speaker identification task. So, the aim of the present paper is to study the relative effectiveness of
t
On leave from Speech and Digital Systems Group, Tata Institute of Fundamental Research, Bombay-400005, India.
0095-4470/84/010017+05$03.00/0
© 1984 Academic Press Inc. (London) Ltd.
18
K. K. Paliwal
different English vowels for speaker identification. Since the formant structure of the vowel spectrum is directly related to the unique shape of the vocal tract (Stevens, 1971) and supplies important information about the speaker 's identity (Matsumoto et al. 1973; Sam bur, 1975) , the first four form at frequencies are chosen as parameters to represent different vowel sounds in the present investigation. Data acquisition and preprocessing Speaker identification experiment was conducted on ten native speakers of British English which included one female and nine male speakers. Effectiveness of eleven different vowels (/3 /, /11 /, /u/, / u / , fo/, /:J /, /a/ , fx/ , /E/, /1/, /i/) commonly occurring in British English (Ainsworth, 1976) was studied in a speaker identification task . These vowels were uttered in /h/-vowel-/d/ context. The speakers were given a list of /h/ -vowel-/d/ syllables: heard, hud, who'd, hood, hoard, hod , hard, had , head, hid, heed ; and asked to read these syllables. The reason for using the particular /h/-vowel-/ d/ context was that ten of the eleven syllables are familiar words of conversational English and the coarticulation effects within these groups of phonemes are not very complex (Paliwal, Lindsay & Ainsworth, 1983). Five repetitions of these syllables were recorded in an ordinary office room on a Rev ox A77 tape recorder and microphone. The utterances of these syllables were digitized at a sampling rate of 10kHz on a Computer Automation Alpha computer, using a 12-bit analog-to-digital converter. The waveforms of these utterances were displayed on a CRT and a 25.6 ms segment of the vowel was selected carefully for spectral analysis , using a manually controlled cursor. The log-power spectrum was computed by taking a 256-point discrete Fourier transform, using a fast Fourier transform algorithm. The log-power spectrum was smoothed by using cepstral smoothing (Schafer & Rabiner, 1970). The first four formant frequencies were measured. from the smoothed log-power spectrum by using a peak-picking method. For this, the spectrum was displayed on a CRT and the peaks were picked with the help of the manually controlled cursor. A three-point parabolic interpolation around the peak was performed to obtain a more accurate location of the peak. In this manner, the formant frequencies for each of the five repetitions were measured for all the ten speakers and all the eleven vowels. Speaker identification procedure The aim here is to identify the unknown speaker from a known group of ten speakers by using his speech utterance for a given vowel , which can be represented as a vector in a fourdimensional vector space (with the four formant frequencies being along the four axes). This is a standard problem in statistical pattern recognition and has been treated exhaustively in the literature (Duda & Hart , 1973). In the present study, a minimum distance classifier is used for speaker classification. The test vector representing the unknown speaker is compared with the mean vectors of all the ten known speakers and the distances are computed. The unknown speaker is identified as the speaker showing the least distance. The minimum distance classifier is studied here for two different distance measures: Euclidean distance measure and correlation distance measure. These distance measures are defined as follows: (Euclidean distance measure) and (Correlation distance measure)
19
Vowel sounds in speaker identification
where d; is the distance between the test vector X (representing the unknown speaker) and the mean vector M; (representing the ith speaker from the group of known speakers). Results and discussion Speaker identification performance is studied here separately for each of the eleven English vowels. For estimating the speaker identification performance, we have here, for a given vowel, a limited sample of 50 preidentified vectors (obtained from five repetitions of that vowel by ten speakers). In order to obtain an unbiased estimate of the speaker identification performance, it is necessary to have independent training and test data sets. For this, we use the following procedure . . Each of the five repetitions for every speaker in turn is used for testing the speaker identification system, while the remaining four repetitions are used for training the system. (By training, we mean here the computation of mean vectors for each of the ten speakers.) This procedure of estimating the speaker identification performance not only maintains independence of training and test data sets, but also uses all the fifty available vectors for performance evluation and, thus, giving statistically more meaningful estimate of the performance. In Table I, we show the speaker identification performance with different English vowels. Speaker identification rates are listed here separately for Euclidean and correlation distance measures. It can be seen from this table that for Euclidean distance measure the vowel I 3 I is most effective for speaker identification, followed by the vowel I u I. For correlation distance measure, the speaker identification rates are less than those obtained with Euclidean distance measure; but as far as the relative effectiveness of different vowels is concerned, the vowel 13 I is still the most effective vowel for speaker identification. Table I
Vowel
Speaker identification performance using different vowel sounds
Speaker identification rate with Euclidean distance measure( %)
Correlation distance measure( %)
/3/
84
(t-.(
72
76 64
(u( /u/ /of /"J/ /a/ Ire/ If./ /II /i/
74
62
78
68 44 46
74 66
58 56
72 62 54
46
76
62
60
48
We have also studied the relative effectiveness of different vowels with more complex distance meaures such as weighted-Euclidean distance measure (Paliwal, 1982) and Mahalanobis distance measure (Paliwal & Rao, 1982). These distance measures require the computation of second-order statistics from the training set data. It was found that for these distance. m easures too, the vowel I 31 gave the best speaker identification performance. So far, we have studied the relative effectiveness of different vowels for speaker identification using all the four formant frequencies (F 1, F2, F3 and F4). Now, we shall study the
20
K. K. Paliwal
relative effectiveness of individual formant frequencies of different vowels. For this, we use F-ratio (which is defined as the ratio of inter-speaker to intra-speaker variance) as criterion. F-ratio is a measure of separability between speakers and has been successfully used for speaker identification (Pruzanskey & Mathews, 1964; Wolf, 1972) and speaker verification (Das & Mohn, 1971). Table II lists the F-ratios for individual formant frequencies Fl, F2, F3 and F4 for different vowels. It can be seen that all the F-ratios listed in this table are more than the critical value of F-ratio for the test at the 0.01 significance level (which is 2.88 for 9 and 40 degrees of freedom for the numerator and the denominator, respectively). Thus, all the individual formant frequencies F1, F2, F3 and F4 for all the vowels are statistically meaningful for speaker identification. The most effective vowels for individual formant frequencies are the vowel /A/ for F1, the vowel j1j for F2, the vowel /3/ for F3 and the vowel /o/ for F4. However, the vowel /3/ shows relatively high F-ratio values for all the four formant frequencies. This explains its effectiveness in speaker identification task. Table II F-ratios for individual formant frequencies (Fl, F2, F3 and F4) of different vowels
Vowel
/3/
(A(
/u/ /u/ /of hi /a/
I -.:e I /E/
/II /i/
F-ratio Fl
F2
F3
F4
35.3 106.4 13.0 23.1 7.9 29.0 9.4 46.9 3.8 15.4 6.9
87.0 49.0 45.5 56.5 18.3 20.5 6.0 10.7 34.4 110.2 54.4
76.7 12.0 24.8 13.2 20.2 19.8 31.0 21.3 41.1 55.5 13.0
43.6 20.1 9.9 13.1 58.8 26.7 11.5 14.7 10.1 30.2 9.8
The present study was conducted with automatic speaker identification in mind. But, the results obtained from this study showing relative effectiveness of different vowels for speaker identification should also be hopefully useful for speaker verification. Conclusion Relative effectiveness of eleven English vowels for automatic speaker identification has been studied. The first four formant frequencies were used as parameters and a minimum distance classifier was used for speaker classification. The vowel I:{/ was found to give the best speaker identification performance. The author wishes to thank Dr W. A. Ainsworth of Keele University (England) for allowing him to use the facilities at his speech laboratory for measuring the formant frequencies, and Mr D. Lindsay for his help during the measurement process. References Ainsworth, W. A. (1976). Mechanisms of Speech Recognition. Oxford: Pergamon Press. Atal, B. S. (1972). Automatic speaker recognition based on pitch contours. Journal of the Acoustical Society ofAmerica, 52,1687-1697.
Vowel sounds in speaker identification
21
Atal, B. S. (1976). Automatic recognition of speakers from their voices. Proceedings IEEE, 64, 460-475 . Dante, M. M. & Sarma, V. V. S. (1979). Automatic speaker identification from a large population. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP 27, 255-263. Das, S. K. & Mohn, W. S. (1971). A scheme for speech processing in automatic speaker verification. IEEE Transactions on Audio Electroacoustics, AU 19, 32-43. Duda, R. 0. & Hart, P. E. (1973). Pattern Classification and Scene Analysis. New York : Wiley. Kashyap, R. L. (1976). Speaker recognition from unknown utterance and speaker-speech interaction. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP 24,481-488. Markel, J.D., Oshika, B. T. & Gray, A. H. Jr. (1977). Long-term feature averaging for speaker recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP 25,330-337. Matsumoto, H., Hiki, S., Sane, T. & Nimura, T. (1973). Multidimensional representation of personal quality of vowels and its acoustic correlates. IEEE Transactions on Audio Electroacoustics, AU 21, 428-436. Mines, M. A., Hanson, F. & Shoup, J. E. (1978). Frequency of occurrence of phonemes in conversational English. Language and Speech , 21, 221-235. Paliwal, K. K. (1982). On the performance of the quefrency-weighted cepstral coefficients in vowel recognition. Speech Communication, 1, 151-154. Paliwal, K. K., Lindsay, D. & Ainsworth , W. A. (1983). Correlation between production and perception of English vowels. Journal of Phonetics, 11, 77-83. Paliwal, K. K. & Rao, P. V. S. (1982). Evaluation of various linear prediction parametric representations in vowel recognition. Signal Processing, 4, 323-327. Pruzansky, S. & Mathews, M. V. (1964 ). Talker-recognition procedure based on analysis of variance. Journal of the Acoustical Society of America, 36, 2041-204 7. Rosenberg, A. E. (1976). Au tom atic speaker verification: A review. Proceedings IEEE, 64, 4 75-487. Sam bur, M. R. (1975). Selection of acoustic features for speaker identification. IEEE Transactions on Acoustics, Speech , and Signal Processing, ASSP 23, 176-182. Sam bur, M. R. (1976). Speaker recognition using orthogonal linear prediction. IEEE Transactions on Acoustics, Speech , and Signal Processing, ASSP 24, 283-289. Schafer, R. W. & Rabiner, L. R. (1970). System for automatic formant analysis of voiced speech. Journal of the Acoustical Society of America, 4 7, 634-648. Shridhar, M. & Mohankrishnan, N. (1982). Text-independent speaker recognition: A review and some new results. Speech Communication, l, 257-267. Stevens, K. N. (1971). Sources of inter- and intra-speaker variability in acoustic properties of speech sounds. Proceedings of Seventh International Congress on Phonetic Science, Montreal, Canada, August 22-28. Su, L. S., Li, K. P. & Fu, K. S. (1974). Identification of speakers by use of nasal coarticulation. Journal of the Acoustical Society of America, 56, 1876-1882. Wolf, J. (1972). Efficient acoustic parameters for speaker recognition. Journal of the Acoustical Society of America, 51 ,2044-2056 .