Journal of Phonetics (1982) 10,139-148
Perceptual identification of voices under normal, stress and disguise speaking conditions Harry Hollien Program in Linguistics and Institute for Advanced Study of the Communication Processes, University of Florida, Gainesville, Florida 32611, U.S.A.
Wojciech Majewski Institute of Telecommunications and Acoustics, Technical University of Wroclaw, Wroclaw, Poland
E. Thomas Doherty Department of Communicative Disorders, University of South Carolina, Columbia, South Carolina, U.S.A . Received lOth June 1981
Abstract:
A number of authors have contended that, at present, the human listener constitutes the most accurate "system" for correctly identifying individuals from their speech-especially if tape recorded materials are used . Other investigators have indicated that listeners do not have to know the speakers in order to make highly accurate judgements of this type. The present experiments attempt to ( I) estimate listeners' capabilities in this area and (2) assess the importance of the auditors being acquainted with the talkers. Speakers were I 0 adult males who recorded speech samples under three types of speaking conditions: (a) normal, (b) stress and (c) disguise. Three classes of listeners were utilized: (a) a group of individuals who knew the talkers, (b) a group that did not know the talkers but were trained to identify them and (c) a group that neither knew the talkers nor understood the language spoken. The analyses indicate that the performances among the groups were significantly different. Listeners who knew the talkers performed best while the nonEnglish speaking listeners produced the lowest level of correct identification. The "middle" group, i.e. the English speaking listeners who did not know the talkers, was divided into two subgroups by the method of extremes. However, even in this case, the most competent of the subgroups still was significantly less able to identify the talkers than were the listeners who knew them; the least competent subgroup performed at about the same level as the auditors that did not speak English. Finally, analysis of the three types of speech revealed that the normal and stress conditions were not statistically different relative to the identification task whereas the disguised productions produced fewer correct identifications.
Introduction It is argued that, of the several approaches to speaker identification, aural/ perceptual techniques are among the most effective (Hecker, 1971). Indeed, the perception of a talker's identity solely from his or her speech is a familiar (if subjective) everyday experience: it is 0095-44 70/82/020139 + 10$03.00/0
© 1982 Academic Press Inc. (London) Ltd.
140
H. Hollien, W. Maje wski and E. T Doherty
one that occurs under many different circumstance s-i. e. during telephone conversations, at parties, from television broadcasts, and so on. In other instances, the identification of a person based on the perceived auditory signal alone can be crucial to a situational outcome. Pilots, for example, sometimes must be able to identify and attend to a single voice amid a welter of voices in order to obtain appropriate instructions for aircraft guidance ; individuals will not reveal confidences over the telephone unless they feel reasonably sure of the listener's identity. While these subjective events do not demonstrate that aural/perceptual strategies for speaker identification are the most accurate, they do provide credence to the notion that such approaches are used frequently and are effective under a large variety of conditions. While few comparisons contrasting the several approaches to speaker identification have been reported, substantial research has been carried out evaluating specific elements or relationships within each. In the aural/perceptual area, for example, the influence of various temporal/physical parameters upon the perceptual process have been investigated. Several researchers (Bricker & Pruzansky, 1966 ; Compton, 1963 ; Cart & Murry, 1971 ; LaRiviere , 1971) have studied the effects of utterance duration on the identification task. In this case, the evidence suggests that levels of correct speaker identification correlate to utterance duration only for very brief samples and longer utterances are important primarily because they permit listeners to sample a larger repertoire of a speaker' s phonemes. Second, there is evid ence to show that mean speaking fundamental frequency or SFF (Compton, 1963; Iles, 1972; La Riviere, 1971) and formant frequencies - especially F 2 -{Iles, 1972; Meltzer & Lehiste, 1977; Stevens et al., 1968) provide important cues in the perceptual identification of speakers, and that even the relative contributions of the source and the vocal tract can be postulated (LaRiviere, 1971 ). Third, phonemic effects on the identification task have been investigated. It has been reported that perceptual identifications appear to vary as a function of (1) the vowel produced, (2) consonant- vowel transitions and (3) inflections. Confusions in speaker pairs often vary with the vowel also (Bricker & Pruzansky, 1966; Iles, 1972; LaRiviere, 1971; Stevens et al. , 1968). Finally, other investigators have studied the identification response in other ways, e.g. the effects on the process by different speakers using different speech materials (Carbonell et al. , 1965; Stuntz, 1963; Williams, 1956), as well as contrasting the performance of observers who both listen to the speech and make judgements from spectrograms (Hennessy, 1970 ; Stevens et al., 1968 ; Tosi et al., 1972). On the other hand, only minimal information is available with respect to the processes and strategies utilized in aural/perceptual speaker identification tasks. To illustrate, even though some research has been carried out on the effects of emotion, speaker disguise and distortion on speaker identification (Carbonell et al., 1965 ; Endress, Bambach & Flosser, 1971 ; Hecker et al., 1968 ; Reich & Duke, 1979; Simonov & Frolov, 19'13 ; Williams & Stevens, 1972), in only a few of these cases have attempts been made to relate the events specifically to aural/pereeptual approaches. In short, more information needs to be generated about the perceptual approaches/processes utilized in speaker identification. The purpose of this set of experiments was to compare the characteristics of certain talker/listener relationships to the perceptual identification process. Basically, the possible effects of two sets of variables upon the identification response were studied. They were: (1) different speech modes-specifically, normal speech, speech under stress and voice disguise-and (2) different classes of auditors- specifically, listeners who knew the talkers, listeners who did not know them and listeners who knew neither the talkers nor the language.
Perceptual identification of voices
141
Method
Talker characteristics and the basic methodology for generating the speech samples have been described in detail elsewhere (Hollien & Majewski, 1977); hence, they will be reviewed only briefly here. Ten individuals, drawn from a population of 25 healthy adult males read a modernization of R. L. Stevenson's "An Apology for Idlers" three times . Tape recordings of these three 2.5 min speech passages were made for the conditions of: (1) normal speechin this case, the passage was read as "naturally" as possible, (2) speech under stress-stress was induced by administering electric shocks (randomly delivered from a standard ElectroDermal Response unit at levels which caused the subject discomfort) while the subject was speaking and (3) voice disguise-here the talkers were permitted to disguise their voices in any way they wished except by using a "foreign" dialect" or by whispering. The ten talkers used in these experiments were selected from the larger group available because the speech of all members of this subgroup was familiar to the first class of listeners. None exhibited a marked regional dialect. Listening sessions were conducted in a sound treated room. High quality tape recording equipment and audio speakers were employed in order to present the stimuli free-field. The auditors heard four different sets of utterances, ranging in length from 50 to 58 words, selected from the available speech material. Two of the utterances were used for training; the third in the pre-test. The remaining sample was duplicated and used twice in the experimental tape. In all cases, exactly the same utterances were utilized for all talkers within each condition: i.e. training, pre-test and experimental. In any event, the experimental sequence consisted of 60 quasi-randomized stimuli as follows: 10 talkers each produced three types of speech (normal, stress, disguise), with the entire series of 30 stimuli being presented twice. Listeners indicated which of the 10 talkers was speaking by placing a check in the appropriate box on an answer form. Listeners of three types were used. Group A consisted of I 0 individuals who were extremely familiar with the speech of the talkers; they were drawn from the IASCP staff and students (University of Florida). In order to be utilized as a Group A listener, subjects were required to demonstrate that they could recognize the talkers solely from their speech; i.e. each volunteer had to correctly identify all speakers (no errors permitted) in a set of test samples. The first I 0 individuals who volunteered achieved this level of correct identification. The protocols for the selection of Group B were as follows. Staff and students at the University of Florida who neither knew the I 0 talkers, nor had heard them speak, were requested to(!) listen to a series of normal utterances by these 10 individuals (feedback concerning the correct response was provided), (2) take a preliminary identification test based on similar normal utterances and (3) respond (ultimately) to the experimental task. Even though response data were obtained on all volunteers, the original protocols specified that only data from the first 10 (or more) volunteers who achieved an identification score of 70% or better on the pre-test were to be assigned to Group B. Nearly 50 volunteers were evaluated before this criterion was met. Thus, information was obtained for more Group B individuals than was expected. In any case, these subjects were organized into a "total" group (Bt :N = 4 7 listeners) and subgroups consisting of those individuals who scored highest (Bh :scores= 70- 100%; N= 13) and those who scored lowest on the pre-test (B 1:scores = 0~30%;N = 17). The final procedure was carried out on a group consisting of 14 Polish speaking staff employed at the Technical University of Wroclaw. None of these individuals knew, or had studied, English; no one in Group C knew any of the talkers prior to the experiment.
142
H. Hollien, W. Majewski and E. T. Doherty Table 1 Per cent correct identifications, by listenrs, of the 10 talkers who produced controlled utterances under the three speaking conditions. Listener groups were made up of individuals (A) who knew the talkers, (B) who did not know the talkers or (C) who knew neither the talkers nor the language N
Group A Mean Range Group B Bt Total group Mean Range Bh High Group Mean Range B1 Low Group Mean Range Group C Mean Range
Pre-test
Speaking condition Normal Stress Disguise
10 100
98.0 90- 100
97.5 90-100
79 .0 65-90
49.8 0-100
39 .8 5-80
31.4 0-65
20.7 0-50
84.6 70-100
50.0 15- 75
40.8 10-65
24.6 5-50
22.4 0-30
32.9 5-60
25.0 0- 50
16.8 0-40
15.7 0-40
27.1 15- 45
26.8 5- 45
17.9 5-30
47 13 17 14
Results
Group A: listeners who knew the talkers The obtained data for all groups-including Group A: listeners who knew the talkers-can be found in Table 1. From examination of the table, it may be seen that this class of listeners could identify the talkers virtually all of the time when they were speaking normally, or under stress, even though they had to respond to a relatively large field of stimuli (60 items). Indeed, only four errors- aut of a total of 200 trials- were made for the normal speaking condition and only five for stress. Thus, it would appear that listeners have very little difficulty identifying talkers under these conditions if they are familiar enough with their speaking modes. On the other hand, even these listeners were somewhat confused by talker disguise and exhibited an error rate of 21% for this condition. Even though this correct identification level was significantly better than that for our machine approach (Hollien et al., 1977; Hollien & Majewski, 1977; Doherty & Hollien, 1978), such levels simply are not of acceptable accuracy, especially for laboratory controlled research. The data do suggest, however, that it is difficult for individuals to disguise their voices sufficiently well so that they can continue to confuse listeners who are intimately familiar with their speech.
Group B: listeners who did not know the talkers The response levels for the English speaking listeners who did not know the talkers also may be found in Table 1. They are presented separately for: (1) the 13 listeners who scored highest on the identification pre-test, i.e. they exhibited correct identification scores of 70-100% after training, (2) the total group of 4 7 volunteers and (3) a subgroup of individuals who scored the lowest on the pre-test-i.e. levels of only 0- 30% correct identification after training. Examination of the table reveals that, while all group/subgroup scores are above chance,
Perceptual identification of voices
143
it is apparent that some individuals demonstrate a better (perceptual) ability to recognize talkers from their speech than do others-or, possibly, that they can be more easily trained in this task. Indeed, the identification pre-test mean of 84.6% for the better group (Bh) is nearly four times higher than that for the poorer group, (the B1 mean is 22.4%). The abilities of this superior group also are apparent within the experiment itself- but the differences are not as marked. In this case, the scores of Group Bh are only about 50% better than are those for Group B1• Moreover, the total group (Group Bt) exhibited substantial overall variability for both the pre-test and the experiment itself. For example, the pretest scores ranged over the entire possible spectrum (i.e. 0-100%); variation of nearly the same magnitude can be noted for the three experimental conditions. Such variability suggests wide differences in listeners' abilities to recognize speakers from their voices and/or to learn such skills; in addition, it suggests that a substantial number of different listening strategies may be employed by auditors. One of the major issues in these experiments concerned the relative performances between listeners who knew the talkers and others who did not. When the responses provided by Group A are compared to those for Group Bt, it can be seen that the first group is over twice as adept as the second at identifying the speakers under the normal speaking condition , three times as good for stress and better for disguise by a factor of nearly four. Even when the Bh subgroup was compared to Group A, the differences favored the group who knew the talkers by factors varying in magnitude from two to three.
Group C: listeners who knew neither the talkers nor the language Table 1 also presents the data for the third experiment, i.e. the identification scores for th e 14 Polish speaking listeners (Group C). As stated, this group neither knew nor had studied English-nor did they know any of the 10 talkers. As with Group B, the Group C identification scores are clearly above chance and are roughly at the same level as the scores for the poorer of the English speaking subgroups (B 1). That is, while the pre-test scores are somewhat higher for Group B1 than they are for Group C, the results for the three experimental conditions are rather similar and, in fact, the Polish listeners achieved slightly higher scores for two. On the other hand, the performance pattern (for Group C) is more like that for Group A than it is for any of the B group/subgroups, i.e. the identification scores for the Poles are more nearly the same for the normal and stress conditions. Apparently , the strategies employed by these listeners were as effective for the stress speaking conditions as they were for the identification of normal speech. Statistical analyses of the data The results of the statistical analyses of the data are presented in Table 2 ; two separate analyses were performed. The first is shown in the top portion of the table (Part a), it examined the performance of three groups of listeners responding to the three types of speech. This analysis of variance indicates that some differences exist both between the groups of listeners and between the types of speech. Post hoc comparisons indicate that all three groups are significantly different from one another. Further, the listeners responded similarly for the normal and stress speaking conditions but performed at a substantially reduced level for disguise. Due to the great variability in pre-test performance, Group B was split into Bh, those individuals who scored 70-100% on the pre-test, and B1, listeners with pre-test scores of 0-30%. Accordingly, a second ANOVA was conducted and is reported in the lower half of the table. Again, post hoc comparisons confirmed that Group A performed better than
144
H. Hollien, W. Maje wski and E. T. Doherty Table 2 Results of analyses of variance based on listener's per cent correct identifications of the 10 talkers who produced controlled utterances under the three speaking conditions. Listener groups were made up of individuals (A) who knew the talkers, (B) who did not know the talkers or (C) who knew neither the talkers nor the language
Source (a) Analysis based on three groups Listeners Speech Listeners X speech Error (b) Analysis based on four groups Listeners Speech Listeners X speech Error
df
MS
F-value
2 2 4 204
50911.23 5390.49 224.91 202.19
25 1.80* 26.66* 1.11
3 2 6 150
34041.21 2801.34 205.47 171.53
198.46* 24.50* 1.20
*Significant at the 0.05 level. any of the others. However, in this case the Bh listeners were shown to achieve significantly higher scores than the B1 and C groups (which, in turn , were not significantly different from each other). An examination of the speech conditions demonstrated that, as expected, listener responses followed the same pattern as in the "three group" analysis-namely, the scores for normal and stress are similar and are, in turn , statistically better than for the disguised condition. Discussion and conclusions Figure 1 summarizes the information provided by the three experiments. First, it may be seen that there is a distinct difference in the level of correct identification for each group of listeners. As noted previously, Groups A, Bt and C are all significantly different from one another. Further, the data are consistent with those reported by Williams (1956) with regard to the size of talker groups. While his research was based on rather different experimental procedures, he reported that he could obtain correct identification levels of about only 40% for speaker groups as large as eight individuals when the listeners were unfamiliar with the talkers. Even though the present experiments utilized (1) considerably shorter training schedules than did Williams, (2) a larger group of talkers than his and (3) a rather large field of stimuli from which to make the identifications, it is interesting to note that the overall mean for the 47 subjects in Group B is almost exactly 40%. Admittedly, neither the procedures used by Williams nor those in the present study parallel the forensic or investigational situation-indeed, they do not even replicate the typical speaker verification paradigm. Nevertheless, data from both studies suggest that it is difficult for an individual to select a particular, and unfamiliar, speaker from a reasonably large field of other speakers under any conditions-but especially when an attempt at voice disguise is present. The relationships among the three speaking conditions (over groups) also may be assessed from the information presented by Fig. 1. First, the identification response by all 71 subjects was best for the normal speaking conditions. Although the difference is not of statistical significance, it was somewhat poorer for stress ; only the disguised speech showed a statistically significant (ex= 0.05) decrease in identification. As has been pointed out also, the patterns among listener types were consistent across groups; it is especially clear that the auditors who knew the talkers (Group A) and the group that knew neither the talkers nor
Perceptual identification of voices
145
100 r - - - - - - - - - - - - - - ,
G]
90
80
c:
70
.2
8
I ~
60 50
h
0 0
~0
~
40
h
I
I
I
30 I I
20
I
10
Chance
0 N
s
D
Speaking condition
Figure 1
Per cent correct identification data for all listener groups plotted as a function of three speaking conditions-i.e. normal, stress and disguise. Legend: A-10 English-speaking listeners who knew the talkers Bt- 4 7 English-speaking listeners who did not know the talkers Bh-Subgroup of 13 English-speaking listeners who did not know the talkers, but who scored 70- 100% on identification pre-test B1- Subgroup of 17 English-speaking listeners who did not know the talkers, and who scored 0-30% on identification pre-test C-14 Polish-speaking listeners who did not know the talkers or the language spoken.
the language (Group C) performed equally as well for the stress condition as they did for normal speech. Apparently the strategies used by those groups-even though they may be quite different with respect to overall competency- were such that they were equally as effective for either of these two (normal/stress) speaking conditions. It is even possible to speculate that knowledge of the language actually contributed slightly to distracting the 4 7 listeners in Group B and made it difficult for them to attend to the temporal and physical parameters that would permit them to obtain higher scores for the stress condition. Another comparison can be made. The long-term speech spectra (LTS) identification data previously reported (Hollien & Majewski, 1977) were revaluated for the 10 talkers utilized in this investigation. These data were then compared- by speaking condition-to those obtained via the aural/perceptual approach used in this research. Specifically, the scores for the objective (speech spectra) approach were 100, 80 and 30% for the normal, stress and disguise speaking conditions, respectively. Comparisons of these data with the results obtained from Group A reveals that both approaches result in somewhat similar scores for the normal
146
H. Hollien, W. Maje wski and E. T Doherty
speaking condition and not extremely dissimilar sco res fo r stress. However, the Group A auditors did much better at the identification task for disguise than did the machine approach. On the other hand, when the identification response of the auditors who did not know the talkers was compared to the long-term spectra data, it was found that their performance was greatly reduced for the normal and stress speaking conditions and that it was slightly poorer for disguise. Admittedly, all of the data reported in the two studies resulted from laboratory type research and it is hazardous to predict how the two approaches would compare for either identification or verification tasks in the field - or if the research protocols utilized were of a different type. Presumably, however, id entification scores for the human auditors would be less affected by such distorting conditions as noise , limited bandpass and so on. On the other hand, the correct response levels by the listeners probably would be more dependent upon the size of the talker groups. In any case , the LTS procedure appears to be superior to human (perceptual) judgements for the procedure that most closely parallels the forensic model. Reich & Duke ( 1979) have examined the problem of auditory identification of disguised talkers. Although their procedure (paired comparisons) was markedly different from that used in the present experiments, the results are consistent with those found here. In both cases, listeners' performance was significantly greater than chance but was severely degraded by the talkers' use of vocal disguise. Finally, the data for the disguise condition can be contrasted with those that have been reported for voice mimicry. Of course, mimicry and disguise constitute two distinctly different processes; in the former, the speaker attempts to make his or her voice/speech similar or identical to that of another person ; in the second case , the speaker attempts to change his voice so it will not be recognized as his own . In any event, several investigators (Endress et al., 1971; Hall, 1975; Lummis & Rosenberg, 1972) have reported that mimics are either totally unsuccessful, or enjoy relatively low success levels, when attempting to match their voices to those of others- depending, of course, on the nature of the research and the recognition analysis procedures. In the case of disguise (i.e. non-recognition), however, much higher success levels apparently are obtainable (Reich & Duke , 1979; Tate, 1978). Hence, from the limited data currently available, it appears that it is easier for a person to disguise his or her voice and speech successfully than it is to mimic the speech of another individual accurately. An understanding of this relationship should be of particular value to individuals who are working in the forensic area. In summary, it is possible to draw a number of conclusions from this research ; however, it should be remembered that all of the generalizations to be listed should be interpreted in a manner consistent with the nature of the investigations described in this report (i.e. they are laboratory experiments). First, auditors who listen to the speech of individuals with whom they are very familiar can be expected to identify them at very high levels of accuracy for conditions where the speech is normal or even when it is produced during the application of the type of mild stress used in this research. Second, individuals who do not know the speakers can be expected to be able to quickly learn to identify talkers at levels well above chance but not sufficiently high to be useful in the practical identification situation. Third, the effects of attempted voice disguise should be expected to confuse members of any group of auditors ; the confusion is even more pronounced when the listeners are not familiar with the speaker's speech and language . Accordingly , it is argued that voice disguise probably will constitute one of the more difficult challenges to any speaker identification approachespecially to those employed in the field . Finally, while the evidence from this research does not seriously challenge the currently held opinion that the human auditory mechanism
Perceptual identification ofvoices
147
provides a reasonable system for the identification of speakers, it does suggest that errors can be expected if the listener is not familiar with the speaker and these errors will be sufficiently numerous to be unacceptable, at least, for the forensic model. This paper was presented in part of the convention of the Acoustical Society of America, St. Louis, Missouri, November, 1974. The authors wish to thank Mrs Patti Hollien and Mrs Clynthia Slater for their assistance with the project. The research was supported in part by NIH grant NS-06459 and by grants from the Graduate Schools at the University of Florida and the Technical University of Wroclaw.
References Bricker, P. & Pruzansky, S. (1966). Effects of stimulus content and duration on talker identification. Journal of the Acoustical Society of America, 40 1441- 1450. Carbonell, J. R., Grignetti, M. C., Stevens, K. N., Williams, C. E. & Woods, B. (1965). Speaker authentication techniques. Report 1296 No. DA-28-043-AMC-00116 (E), Bolt, Beranek and Newman, Inc., Cambridge, Massachusetts. Compton, A. J. (1963). Effects of ftltering and vocal duration upon the identification of speakers aurally. Journal of the Acoustical Society of America, 35 1748-1752. Cort, S. & Murry, T. (1972). Aural identification of children's voices. Journal of the Acoustical Society of America, 51 S131(A). Doherty, E. T. & Hollien, H. (1978). Multiple factor speaker identification of normal and distorted speech. Journal of Phonetics, 6 1-8. Endress, W., Bambach, W. & Flosser, G. (1971). Voice spectrograms as a function of age, voice disguise and voice imitation. Journal of the Acoustical Society of America, 49 1842-1848. Hall, M. E. (197 5). Spectrographic Analysis of Interspeaker and Intraspeaker Variabilities of Professional Mimicry, unpublished master's thesis, Michigan State University. Hecker, M. H. L. (1971). Speaker recognition: an interpretative survey of the literature. American Speech Hearing Association Monographs, 16 1-103. Hecker, M. H. L., Stevens, K. N., von Bismarck, G. & Williams, C. E. (1968). Manifestation of taskinduced stress on the acoustic speech signal. Journal of the Acoustical Society of America, 44 9931001. Hennessy, J. J. (1970). An Analysis of Voiceprint Identification, unpublished master's thesis, Michigan State University. Hollien, H. & Majewski, W. (1977). Speaker identification by long-term spectra under normal and distorted speech conditions, Journal of the Acoustical Society of America, 62 975-980. Hollien, H., Childers, D. G. & Doherty, E. T. (1977). Semi-automatic system for speaker identification (SAUSI). Proceedings, IEEE International Conference on Acoustics, Speech and Signal Processing, Hartford, Connecticut, pp. 768-771. lies, M. (1972). Speaker Identification as a Function of Fundamental Frequency and Resonant Frequences, unpublished doctoral dissertation, University of Florida. Johson, C. C. (1979). Speaker identification by means of temporal parameters: preliminary data. Current Issues in the Phonetic Sciences (H. and P. Hollien, Eds), Amsterdam: J. Benjarnins, 9(11), pp. 821828. LaRiviere, C. L. (1971). Some acoustic and perceptual correlates of speaker identification. Proceedings of the Seventh International Congress of Phonetic Sciences, Montreal, Canada, pp. 558-564. Lummis, R. C. & Rosenberg, A. E. (1972). Test of an automatic speaker verification method with intensively trained professional mimics. Journal of the Acoustical Society of America, 51 S131(A). Majewski, W. & Hollien, H. (1974 ). Euclidean distances between long-term speech spectra as a criterion for speaker identification. Proceedings, Speech Communication Seminar-74, Stockholm, Sweden, pp. 202-210. Meltzer D. & Lehiste, I. (1972). Vowel and speaker identification in natural and synthetic speech. Journal of the Acoustical Society of America, 51 Sl31(A). Pollack, I., Pickett, J. M. & Sumby, W. H. (1954). On the identification of speakers by voice. Journal of the Acoustical Society of America, 26 403-412. Reich, A. R. & Duke, J. E. (1979). Effects of selected vocal disguises upon speaker identification by listening. Journal of the Acoustical Society of America, 66 1023-1028. Rekieta, T. W. & Hair, G. D. (1972). Mimic resistance of speaker verification using phoneme spectra. Journal of the Acoustical Society of America, 51 S132(A). Rosenberg, A. E. (1972). Evaluation of an automatic speaker verification system over telephone lines. Journal of the Acoustical Society, 57 S23(A). Simonov, P. V. & Frolov, M. V. (1973). Utilization of human voice for estimation of man's emotional stress and state of attention. Aerospace Medicine, 44 256-258 .
148
H. Hollien, W. Majewski and E. T. Doherty
Stevens , K. N., Williams, C. E., Carbonell, J. R. & Woods, D. (1968) . Speaker authentication and identification : a comparison of spectrographic and auditory presentation of speech materials. Journal of the Acoustical Society of America, 44 1596- 1607. Stuntz, S. E. (1963). Speech intelligibility and talker recognition tests of air force communication systems. R eport, ESP-TDR-63-224, Electronic Systems Division, Air Force Systems Command, Hanscom Field, Massachusetts. Tate, D. A. (1978) . A Study of Speaker Disguise, unpublished master's thesis, University of Florida. Tosi, 0. , Oyer, H., Lashbrook, W., Pedrey, C., Nichol, J. & Nash , E. (1972). Experiment on voice identification. Journal of the Acoustical Society of America, 51 2030- 2043. Williams, C. E. (1956). The effects of selected factors on the aural identification of speakers. Tech. Doc. Opt. ESD-TDR-65-153, Electronic Systems Division, Air Force Systems Command, Hanscom Field, Massachusetts. Williams, C. E. & Stevens, K. N. (1972). Emotions and speech: so me acoustic correlates . Journal of th e A coustical Society of A m erica, 52 1238- 1250.