Journal of Phonetics (1984) 12,319-326
Speaker indentification utilizing selected temporal speech features Charles C. Johnson Amherst Associates, 210 Old Farm Road, Amherst, Massachusetts 01002, U.S.A.
Harry Hollien Institute for Advanced Study of the Communication Processes, University of Florida, Gainesville, Florida 32611, U.S.A.
and James W. Hicks SCI Systems Inc., 8600 South Memorial Parkway, M/S 32, Huntsville, Alabama 35802, U.S.A. Received 28th Apri/1984
Abstract:
This investigation was undertaken in an effort to determine if certain temporal features of speech could be used as speaker identification cues. In order to rigorously test the selected features , they were evaluated on the basis of normal, stress and disguise speaking conditions. Specifically, 20 adult male subjects were recorded under the three cited conditions and temporal analyses carried out on each speech sample. These measurements included multilevel durational analyses of the speech bursts/pauses plus several estimates of speech rate. The results of this experiment suggest that it may be possible to utilize temporal speech features as speaker identification cues. For example , a majority of the talkers were correctly identified for the normal speaking mode and nearly as many for the stressful and disguised conditions. While these levels are not high enough for immediate application, they do suggest that temporal speech features contribute to the identification process - and that a properly developed temporal vector should operate to support a machinebased. speaker identification system.
Introduction Recognizing a speaker from his/her voice alone routinely occurs under a variety of familiar circumstances- during telephone conversations, at cocktail parties, from television broadcasts, and so on. Moreover, in some situations, it is not simply desirable but, indeed, crucial to be able to do so. For example, valid speaker authentication techniques must be available to the military before voice activated weaponry can be deployed and utilized with safety ; banking by telephone will not become a reality until robust verification techniques are operational. Just as important (as verification) are speaker identification methods - they are Please address correspondence to : Dr Harry Hollien . 0095-44 70/84/040319 + 08 $03.00/0
©
1984 Academic Press Inc. (London) Limited
320
C C Johnson, H Hollien and J W Hicks
of special interest to law enforcement agencies as well as to relevant individuals working in the security, intelligence and business sectors. Within law enforcement, for example, the recording of the speech of criminals and/or suspects is common practice and these recorded conversations often are admissible in Courts of Law . In any case , it would appear that a reliable and objective speaker identification system would be of value in the identification and conviction of criminals. Some research has been carried out in the speaker identification area ; the reports resulting from the efforts may be sorted roughly into three categories: (1) aural/perceptual speaker recognition (see among others, McGehee , 1937; Pollack et al., 1954; Stevens et al. , 1968; Rothman, 1977; Hollien et al. , 1982), (2) the spectrographic or visual recognition approach often referred to as "voiceprints" or "voice grams" (see especially Kersta, 1962; Tosi et al., 1972; Hollien & McGlone , 1976 ; Reich et al. , 1976 ; Houlihan, 1977; Rothman, 1977) and (3) the machine (or automatic) approach (see among others, Pruzansky, 1963 ; Atal, 1972 ; Wolf, 1972 ; Sambur, 1975; Doherty, 1976; Hollien & Majewski, 1977; Johnson, 1979 ; Furui, 1981 ). While most investigations involve only normal speaking conditions, some also have dealt with speech produced in stressful or disguised situations - issues of interest relative to this research (for example , see Endres et al., 1971 ; Hollien & McGlone, 1976; Reich et al. , 1976 ; Hollien & Majewski, 1977 ; Doherty & Hollien, 1978; Houlihan, 1979; Hollien et al. , 1982). In any case , the research literature on speaker recognition can be summarized as follows. A reasonable amount of research has been carried out on speaker verification (the citations here are too numerous to include) and a lesser amount on speaker identification (see the above citations among others). Second , investigators (in the identification area) have fractioned their efforts into three subareas: (1) one group has focused upon the human ' s ability to perceive a person ' s identity from listening to his or her voice (and upon the various strategies employed by listeners in this, the identification task) , (2) a second group has studied "voice-prints/grams" (i.e. pattern matching of time-byfrequency-by-amplitude sound spectrograms) and (3) a final group is working with machinebased approaches to the problem. Yet, it is ironic to note that- no matter what approach is utilized or how intense the effort - temporal parameters rarely are included among the features/vectors studied as cues to the identification process - that is, at least to any great extent. Accordingly, it was the aim of this project to investigate the identification capabilities of a number of temporal speech/voice parameters - and the effects of different speaking conditions upon the robustness of the resulting vectors. Specifically, the experiment involved three speaking conditions: (1) normal, (2) stress and (3) disguise ; attempts were made to identify individuals by analysing the temporal characteristics of the speech produced under these conditions. It was hoped that the data from this investigation would provide information about the strength of a temporal (identification) vector and its resistance to speaker distortions (voluntary and involuntary).
Method Subjects and speech material The speakers utilized were 40 adult males ranging in age from 25 to 45 years; they exhibited no unusual dialects, speech or voice disorders. Subjects read a modernization of R. L. Stephenson's An Apology for Idlers while being recorded on high-quality laboratory equipment. Three speaking conditions (normal speech, stress and disguise) were included: they may be defined as follows: ( 1) "normal" speech resulted from subjects' best effort to read
321
Speaker identification
the passage; it constituted the control (or baseline) production, (2) speech under stress is self explanatory; it was induced by application of electrical shock, delivered randomly to one hand while the subject was speaking and (3) disguise was the least controlled of the three variables; here, subjects were requested to disguise their speech as completely as they could with the only restriction being that they should not use a "foreign dialect" , whisper or speak in any but the modal voice register.
Analysis procedure A number of temporal parameters were selected for analyses ; only the two of best potential will be included in this report. These vectors included a time-energy distribution (TED) and voiced/voiceless speech time contrast (VVL); they were applied as single vectors and in combination. The TED vector was based on a group of time-by-energy measurements which reflected the total accumulated time a talker's speech bursts remained at a specific energy level (relative to his peak amplitude). Moreover, this vector provided indications of speakers' speech patterns with respect to both the bursts and pause periods. Strictly defined , "speech bursts" constitute that portion of the (speech) energy produced which is above a specified level and pause periods are those areas (between the bursts) where the energy falls below that level (see Fig. 1). The measurement of the speech bursts and pause periods was accomplished by processing a digitized energy envelope on a Digital Equipment Corporation PDP-8i mini-computer. All parameters were computed for 10 equadistant energy levels; they included the number and extent of the speech bursts (including both the means and standard deviations) and the standard deviations of the pause periods. The mean pause periods and number of pauses were not computed as they are direct reciprocals of the speech bursts. In any case, a total of 40 parameters contributed to the development of this vector.
(a)
Figure 1
Time
(b)
Schematic representation of (a) a typical energy envelope and (b) th e quantized bands generated for the vector.
The voiced/voiceless speech time (VVL) vector was composed only of two parameters phonation time and speech time. The first term of this vector consisted of the total duration of phonation or vocal activity which occurred during a speech "sample" (i.e. phonation time) ; it was obtained by signal analysis carried out on the IASCP Fundamental Frequency Indicator (FFI-8). This system was developed so that speaking fundamental frequency could be extracted from a sample of connected speech (Hollien, 1981 ). In one of its secondary modes, FFI-8 is programmed to calculate the duration of vocal activity (phonation time) from a tape recorded sample. The second element within the VVL subvector was generated by extracting total articulation time from the sample. This procedure can best be understood by further reference to Fig. 1. Here it can be seen that the summed duration of the
322
C C Johnson, H. Hollien and J. W. Hicks
speech energy bursts (at level one) represents the total amount of articulated speech. Thus, the voiceless speech component of the sample was represented on the basis of total articulation time as related to phonation time. A discriminant analysis statistical procedure was selected as the basic approach to the identification task. This technique was chosen because it appears (at least in this case) to be a more sensitive approach than do methods such as Euclidean distance analysis or crosscorrelation (Doherty, 1976; Doherty & Hollien, 1978; Zalewski et al. , 1975). Additionally, three classification methods were utilized in association with the statistical analysis. That is, each of a talker's three speech samples were divided into roughly equal quarters and each sample reclassified , in turn , with respect to the reference sets. The first of the three methods was posterior classification; it is used to "pretest" the speaker identifying features. The second method chosen was the jackknife approach. In this case, each sample was eliminated (in turn) from the reference se t before computation of the discriminant functions. Classifications, then, were made on the "removed" samples. A third and final method was utilized to simulate the forensic model. Here , the talker's initial sample was arbitrarily designated the test sample and the reference set calculated from the remaining three samples. This third method is the identification procedure. Finally, the " identification" scores also were calculated on a paradigm where the fourth sample served as the test. Results
Normal speaking condition The results of the pretest and identification procedures may be found in Table I. As may be seen from examination of the first column in Table I, application of the time-energy distribution (TED) vector produced the highest identification scores. Indeed, this vector correctly classified all speakers when the posterior procedure was applied , 95% of the speakers in the jackknifed pretest and 60% of the speakers when the forensic type identification analysis was carried out. The VVL vector followed TED in effectiveness with correct classification rates of 65% and 40%, respectively , for the two pretest methods and only 7.5% relative to the identification method. Moreover, the combination vector demonstrated little-to-no improvement over the TED vector alone. This finding undoubtedly was due to the following relationships: ( 1) since all speakers were correctly identified by the posterior classification procedure, no improvement was possible , and (2) the no improvement related to the jackknifed and identification procedures apparently resulted because few (if any) new parameters were added by the VVL vector. Stress speaking condition The posterior classification procedure was eliminated from the analysis where subjects were stressed; it was judged an inappropriate approach because the "stressed" speaking samples were compared only with the normals. Therefore, only the jackknifed pretest and identification tasks were carried out. The results of this set of evaluations also may be found in Table I. Specifically, the TED vector again yielded the highest levels of speaker cl assification and identification (70% and 40% correct). However, in this case the VVL vector proved to be almost as robust as TED - at least for the jackknifed analysis (i.e. 65% correct identifications). Finally, the combined vector provided no real improvement either with respect to classification or identification of the stressed talkers. Indeed, the accuracy of the combination vector was less than TED alone. Thus, it would appear that substantial overlays occurred between TED and VVL (at le ast in this case).
323
Speaker identification
Table I Pretest and identification scores obtained utilizing the time energy distribution (TED), the voiced/voicelsss speech time (VVL) and the combined vectors. n = 40; all scores are percentages Speaking conditions Vectors Posterior classification TED VVL Combination vector 1ack-knifed classification TED VVL Combination vector Identification only TED VVL Combination vector
Normal
Stress
Disguise
100.0 65 .0 100.0
NA NA NA
NA NA NA
95.0 40.0 95.0
70.0 65.0 65.0
45.0 35.0 60.0
60.0 7.5 55 .0
40.0 20.0 30.0
30.0 5.0 40.0
NA = not applicable.
Disguised speaking condition The results of the experiment involving disguised speech may be seen in Table I (again only the jackknifed and identification values were calculated). As with the two preceding conditions, the highest set of single vector scores was associated with TED ; i.e. 45% of the disguised voices were correctly matched to relevant talkers in the jackknifed pretest and correct identifications were made in 30% of the cases. Use of the VVL vector results in 35% correct disguise-to-normal matches for the jackknife pretest but only 5% of the voices were correctly identified. Combining the vectors resulted in some improvement ; specifically, 60% of the disguised voice were classified and 40% correctly identified. It is possible that, in this case, the characteristics identified by the two subvectors were substantially different. Discussion
Normal speaking condition As will be remembered, the normal speech analyses consisted of comparisons between one set of speech samples (test) and another like set (reference). The high levels of identification found for this condition were expected and for several reasons. First, in previous experiments, investigators have suggested that identification cues based on temporal information should result in relatively robust predictions of talker's identity- see , for example, Majewski & Hollien (1977), Doherty & Hollien (1978) as well as Pruzansky ( 1963) who reported that a time-energy distribution was effective as a speaker identification cue. Second, the very nature of the speech act would suggest that there could be individual characteristics which are time invariant (or nearly so). Finally, these vectors- especially the TED - were structured from a relatively large number of parameters. Accordingly , the fairly robust findings for these time-based vectors suggest that temporal information is (or can be) processed in support of the speaker identification task. On the other hand, it should be remembered that the highest identification score obtained under any of the conditions was 60%. Thus, it would appear that , while the present procedure (as structured) is of encouraging potential , it still is not sufficiently discriminative for practical use - even when high quality speech is utilized .
324
C C Johnson, H Hollien and J. W. Hicks
Stress speaking condition Data from a second experiment were evaluated in an attempt to discover if speaking situations distorted by psychological stress would tend to reduce the speaker identification performance of the cited temporal parameters. The results demonstrate such an effect. Indeed, the TED classification scores were reduced (from 95% to 70%) as were the identification scores (from 60% to 40%). These relationships suggest that the features being evaluated are not as resistant to the external effects of stress as would be desired . However, it should be remembered that each subject's "stressed" speech samples were compared to referents calculated from his "normal" samples. Thus, they were both non-contemporary and drawn from different speaking categories. There is little doubt but that use of speech samples of different classes tends to reduce the accuracy of identification and the lack of speech con temporariness has been demonstrated to be deleterious to the speaker identification process (McGehee, 1937;Rothman, 1977;Tosietal., 1972). The patterns among the data for the other vectors (VVL and combined) are similar to those discussed for TED. Indeed, a reduction in both the classification and identification scores is observed when the combination vector was evaluated. However, it is possible that not enough new information was being added to this vector to permit improvement in its predictive function. Of course, such was not the case for the normal speech evaluations, i.e. in that case the addition of more variables often improved the overall vector performance. In this instance, however, there is little evidence to suggest that VVL adds very much to the more powerful TED vector. While previous research on the relationship between stressful speaking conditions and speaker identification has been rather limited, data from these few studies reported tend to suggest that stressful speaking conditions will degrade the speaker identification processes. For example, both Hollien & Majewski (1977) and Doherty & Hollien ( 1978) report that their speaker identification rates - based on long-term speech spectra analysis- were reduced when "stressed" speech was compared to normal samples. The present findings are in general agreement and, thus, suggest that the presence of stress may reduce both the validity and reliability of most approaches to speaker identification. These relationships argue, then, that any speaker identification system which is proposed for field use must be tested both in a large variety of situations and in those that are stressful to the talker. Evaluations of that type will help to accurately predict the robustness of the procedures when they are applied to real-world situatiol).s. Disguised speaking conditions The scores obtained from the third experiment (i.e. disguise) were the lowest of all. However, while voice disguise obviously interferes with the identification task, most of the scores were substantially above chance - a finding which appears to demonstrate that at least some of the identification features measured by the temporal vectors are both present and effective when a speaker attempts to disguise his or her voice. Furthermore, the present vectors yielded identification levels that were as high or higher than procedures utilized by other investigators (Hollien et a!., 1974; Hollien & McGlone, 1976 ; Houlihan, 1977 ; Hollien & Majewski, 1977; Doherty & Hollien , 1978; Reich eta!., 1976). Thus, it would appear that talkers may not be capable of altering their temporal speaking characteristics as easily as they do other speech features. Finally, utilization of the TED/VVL combined vector resulted in an increase of identification power. It now may be conceivable that the two subvectors are finally sampling different types of speaker-dependent information. This explanation appears possible as the TED vector may be measuring characteristics which can
325
Speaker identification
Table II The identification scores obtained when the fourth speech sample was utilized as the test (second column in pair); the TED, VVL and combination vectors were examined and the results compared to the prior evaluation (see first column in pair)- n = 20; all scores are in percentages
Speaking condition Vectors TED VVL Combined
Normal No.1 No.4 60 8 55
60 10 65
Stress No.4 No.1 40 20 30
15 35 25
Disguise No.4 No.1 30 5 40
25 10 25
be considered the temporal counterpart of spectral information whereas the WL parameter may be more directly associated with vocal activity . If these suppositions are correct, the vector effects may be both discriminatory and additive when the speech mode is markedly changed from normal (as in disguise).
Validation procedure It must be stressed that the pretest classification procedures (posterior and jack-knifed) utilized in these experiments served only to evaluate the vectors and identify those demonstrating the greatest potential for speaker identification. Since they are only visible in a closed-set test (i .e. when the speaker is known), the identification task must be considered to constitute the core of this research - and, as was expected, it yielded lower sets of scores than did the classification approaches. As stated earlier, the initial analysis procedure utilized only the first of the four speech samples (within each speaking condition) as the test sample . Thus, in order to validate the cited findings and provide information selective to reliability, a second identification procedure was carried out. In this case, the final of the four speech samples was utilized as the test sample and the results summarized in Table II. Indeed , two relationships may be observed from the data contained in this table . First , the identification scores are similar enough to suggest that the first set of data are verified. Second , some variation in the patterns can be observed. That is, in some cases vector performance improved and in others it did not. .Of course , it must be conceded that neither the initial or final samples in a series are necessarily the most idiosyncratic of a given talker' s speech; moreover, there is some indication that vector effectiveness can vary as a function of sample size and type. Accordingly , these findings tend to suggest that test sample selection is critical to the evaluation process if speaker-dependent measures are to be effective as identification cues.
Conclusions The findings reported above would appear to support the following conclusions. (1) Temporal features within the speech act appear to have potential as speaker identification cues- that is, if the techniques utilized are structured appropriately. (2) The temporal parameters/vectors employed in this research demonstrate reasonable utility as speaker identification cues but will need to be modified if their power is to be enhanced. (3) Speaker distortions (stress, disguise) tend to have a generally degrading effect on the identification robustness of temporal vectors - that is, at least for those features studied.
326
C C Johnson, H Hollien and J. W Hicks
( 4) Temporal speech feature analyses ultimately may provide a more powerful iden tification approach, in the presence of speaker disguise than any of the other identification feat ures currently available . This research was supported in part by a grant from the Max and Victoria Dreyfus Foundation and the U.S . Army Research Office (ARO).
References A tal, B.S. (1972). Automa tic speaker recognition based on pitch co nto urs. Journ al of the Acoustical Society of America, 52, 1687-1697. Doherty, E. T. (1976). An evaluat ion of selected acoustic parameters for use in speaker id en tificatio n. Journal of Phonetics, 4 , 321-326. Doherty, E. T. & Hollien, H. (1978). Multiple-factor speaker id en tifica tion of normal and distorted speech. Journal of Phonetics, 6, 1-8. Endres, W., Bambach, W. & f lossier, G. (1971). Voice spectrograms as a function of age voice disguise and voice imitation. Journal of the Acoustical Society of America, 49, 1842-1848. Furui, S. (1981). Cepst ral analysis technique for automatic spea ker verification. IEEE Transa ctions of the Acoustics, Speech and Signal Process ASSP, 29, 254-272. Hollien, H. (1981). Analog instrumentation fo r acoustic speech analysis. In Speech Evaluation in Psychiatry (J. Darby, ed.), pp. 79-1 03. New York: Gru ne and Stratton. Hollien, H. & Majewski, W. (1977). Speaker identification by long-t erm speech spectra under normal and distorted speech co nd itio ns. Journal of the Acoustical Society of America, 62, 975-980. Hollien, H. & McGlone, R. E. (1976). The effects of disguise on "voice-print" identification. National Journal of Criminal Defense, 2, 117-130. Hollien, H. , Majewski, W. & Doherty, E. T. (1982). Perceptual id entificatio n of voice und er norma l, stress and disguise speak ing co nditions. Journal of Phonetics, 10, 139-148. Houlih an, K. (1979). The effects of disguise on speaker identification from sound spectrograms. In: Current Issues in the Phonetic Sciences (H. Hollien & P. A. Hollien, eds), pp. 8 11 -820. Amsterdam: J ohn Benjamins. Johnson, C. C. (1979). Speaker identification by means of tempora l parameters: preliminary data. In : Current Issues in the Phon etic Sciences (H. Hollien & P. A. Hollien, eds), pp. 821-828 . Amsterdam: John Benjamins. Kersta, L G. (1962). Voiceprint id entification. Nature, 196, 1253-1257. McGehee, F. (1937). The reliability of the identification of the hum an voice. Journal of General Psychology, 17, 246-271. Pollack, I., Pickett, J. M. & Sum by, W. H. (1954). On the identification of speakers by vo ice. Journal of the Acoustical Society of America, 26, 403-412. Pruzansky, S. (1963). Pattern rna tching procedure for automatic talker recognition. Journal of th e Acoustical Society of America, 35, 35 4-3 58. Reich, A. R. , Moll, K. L. & Curtis, J. F. (197 6). Effects of selected vocal disguise upo n spectrog rap hic speaker id entificatio n. Jo urnal of th e Acoustical Society of America, 60, 919-925. Rothman, H. B. (1977). A pereptual (aural) and spectrographic identification of talkers with sim ilar so un d ing voices. Proceedings of the International Conference of O·ime Countermeasures Science and Engineering, Oxford, pp. 37-42. Sam bur, M. R. (197 5). Selection of acoust ic features fo r speaker id en tifica tion. IEEE Transactions of the Acoustics, Speech and Signal Process ASS!', 23 , 169-176. Stevens, K. N., Williams, C. E., Carbo nell, J. R. & Woods, D. (1968). Speaker identification and authentication: a com parison of spectrographic and auditory presentation of speech materials. Journal of the Acoustical Society of America, 44, 1596- 1607. Tosi, 0 ., Oyer, H., Lashbrook, W., Pedrey , C. & Nash, W. (1972). Experiment on voice id en tification. Journal of the Acoustical Society of America, 51, 2030- 2043. Wolf, J. J. (1972). Efficient acoustic parameters for speaker recognition. Journal of the Acoustical Society of America, 51 , 2044-2056. Zalewski, J ., Majewski, W. & Hollien, H. (1975). Cross-correlatio n between long-term speech spec tra as a criterion for speaker identification. Acoustica, 34, 20-24.