SCIENTIFIC & TECHNICAL
The relation between speech tempo, loudness, and fundamental frequency: an important issue in lorensic speaker recognition HJ KUNZEL* Speaker Identijication & Tape Authentication Department, Bundeskriminalamt (BKA), Wiesbaden, Germany
and
HR MASTHOFF, JP KOSTER Phonetics Institute, University of Trier, Trier, Germany Science & Justice 1995; 35: 291 -295 (Received 23 May 1994; accepted 17 March 1995) In voice identification, the mean fundamental frequency is a powerful and by far the best known speaker-specific acoustic parameter. In some cases, however, large differences in fundamental frequency values in different speech samples from the same subject may occur. This observation gave rise to the hypothesis that this parameter may interact with others, particularly speech tempo and loudness, which may both attain extreme values under forensic circumstances. In an experimental investigation with a total of 50 subjects in two groups, no statistically significant relationship was found between speech tempo, loudness and voice fundamental frequency. Therefore no correction or normalising procedure need be applied prior to the assessment or interpretation of fundamental frequency means. The issue needs further investigation with spontaneous speech samples and material from casework.
Die Hauptgrundfrequenz gilt in der Sprechererkennung als ein sehr informativer und als der am besten untersuchte akkustische Parameter. Es ist allerdings immer wieder festzustellen, da13 die Hohe der Hauptgrundfrequenz in verschiedenen Sprechproben von einundderselben Person erheblich von einander abweichen kann. Diese Beobachtung fiihrte zur Hypothese, darj die Hauptgrundfrequenz moglicherweise von anderen akkustischen Parametern beeinflurjt wird insbesondere der Sprechgeschwindigkeit und der Lautstkke, die beide unter strafrechtlich relevanten Bedingungen bekannten extreme Werte annehmen konnen. In einer Feldstudie mit 50 Personen, eingeteilt in zwei Gruppen konnte allerdings keine statistisch signifikante Beziehung zwischen Sprechgeschwindigkeit, Lautstarke und Hauptgrundfrequenz nachgewiesen werden. Es ist deshalb nicht erforderlich vor der Abschatzung und Beurteilung der Grundfrequenzmittelwerte Korrekturoder Normalisierungsverfahren anzuwenden. Das Problem bedarf aber weiterer Untersuchungen an spontanen, ungestellten Sprechproben und an Sprechproben aus der Fallarbeit.
La frCquence moyenne fondamentale est un paramktre acoustique puissant, et de loin le plus connu, spCcifique du locuteur dans I'identification par la voix. Dans certains cas, toutefois, de grandes diffCrences dans les valeurs de frCquence fondamentale dans diffkrents Cchantillons de paroles du m&mesujet peuvent arriver. Cette observation a donne lieu 5 I'hypothbe que ce paramittre peut interagir avec d'autres, particulikrement la vitesse et la force de la parole, qui toutes deux peuvent atteindre des valeurs extremes en circonstances forensiques. Dans une expkrimentation avec 50 sujets rCpartis en deux groupes, aucune relation statistiquement significative n'a pu Etre trouvCe entre la vitesse de la parole, sa force et la frCquence fondamentale de la voix. C'est pourquoi il n'est pas nCcessaire de corriger ou normaliser avant 1'Ctude ou I'interprCtation des moyennes de frCquence fondamentale. Cette question mCrite d'autres Ctudes avec des Cchantillons de paroles spontanCes et de cas rCels.
La frecuencia fundamental principal es un potente y el mejor conocido para'metro aclistico de identificaci6n de un hablador especifico. Sin embargo en algunos casos pueden producirse amplias diferencias en 10s valores de frecuencia fundamentales en diferentes muestras habladas del mismo sujeto. Esta observaci6n da pie a la hip6tesis de que este para'metro puede interaccionar con otros, especialmente tempo y volumen de lo hablado, 10s cuales pueden ambos alcanzar valores extremos en circunstancias forenses. En una investigaci6n experimental con un total de 50 sujetos divididos en dos grupos no se pudo establecer relaci6n significativa entre tempo, volumen y frecuencia fundamental de la voz. Por lo tanto no es necesario avlicar ninguna correcci6n o procedimiento de normalizacidn previo a la valoraci6n o interpretacidn de frecuencias fundamentales. El tema precisa -mas investigaci6n con conversaciones esponta'neas y casos de rutina.
Key Words: Voice Recognition; Speech; Fundamental Frequency; Tempo; Loudness. *Corresponding author: BKA, KT 54, D-65173 Wiesbaden. Science & Justice 1995; 35(4): 291-295
Introduction In the field of forensic speaker identification [1,2], voice fundamental frequency (Fo), which is the acoustic correlate of perceived pitch, is generally considered to be one of the single most powerful speaker-specific parameters. At the present time, it certainly is the best known parameter in terms of empirical data, both acoustic and physiological, as well as of factors such as various aspects of the speaking situation, duration of samples, and others [3,4]. Some of the results may be used directly in forensic casework, for instance in assessing the chance probability of two speakers exhibiting an average Fo of n Hertz or less (/more). Consider the following, previously-published example [5]. Figure 1 shows the cumulative distributions of the mean Fo of two groups of subjects. Group A consisted of 100 male adult Germans who read a fake kidnapper's telephone call (average duration 45 s) five times in a six month period. Group B consisted of 166 male adult German speakers who were asked to produce a single phrase taken from an actual hoax bomb threat once (average duration ca. 3 s). There was no overlap between the two sets of speakers. It is evident that both the shapes and the mean values of the distributions are very similar in spite of the differences of sample duration and size of speaker group. Figure 2 shows the Fo-distribution for the data of both' speaker groups combined. In all cases, Fo was extracted on a computer developed specifically for forensic speech processing, using two independent algorithms: one derived from the inverse logarithmic power spectrum (cepstrum), the other a simple inverse filter tracking (SIFT) algorithm.
90
100
110
120
130
140
150
160
Since nearly all forensic casework performed at the BKA and the University of Trier forensic phonetics laboratories involves telephone recordings and is thus afflicted by all sorts of degradations, Fo results are regarded as sufficiently reliable only if mean values gained with both algorithms differ by not more than 1 per cent.
*
Mean Fo and its standard deviation (STD) may be affected by a variety of physiological, psychological, and situational factors [4,6]. In order to correctly assess the value of Fo results in a given case, the expert must be acquainted with such factors and their potential mutual interactions. Since forensic speaker identification is a fairly recent discipline, we are still at the starting point of such tests and for the time being will have to combine just a small number of variables in order to detect any interrelations. The investigation reported below was stimulated by Fo-related findings in a considerable number of actual cases where the authors of this study, as well as colleagues from Germany and the UK, found that even though an array of phonetic features exhibited great similarity between known and unknown speaker, fundamental frequency data indicated varying degrees of dissimilarity. In most of those cases, the discrepancy could be attributed to the condition of the speakers in different speaking situations. This is an unsatisfactory explanation, since we presently know very little about the effect of the condition of a speaker on hislher vocal behaviour in general and fundamental frequency in particular. In forensically valid voice comparisons Fo, like all other speaker-dependent features of the speech signal, must be "normalised" to take account of vastly
170
< = [Hertz] FIGURE 1 Cumulative distribution of mean voice fundamental frequency in two groups of adult male speakers.
< = [Hertz] FIGURE 2 Cumulative distribution of mean voice fundamental frequency in 266 adult male speakers.
Science & Justice 1995; .35(4): 291-295
differing speaking situations, effects of the transmission channel, voice and/or language disguise, and other forms of non-cooperativity of the unknown and/or known speaker(s). Such factors account for the basic difference between forensic and commercial or cooperative speaker recognition. Automatic techniques such as artificial neural networks and Hidden Markov Models cannot effectively account for these factors. In forensic speaker recognition, the process of evaluation and normalisation is performed by the forensically trained expert in speech science or phonetics. In many of the cases mentioned above the speakers exhibited vast differences in speaking tempo. This parameter may be quantified in terms of GoldmanEisler's syllable rate and articulation rate [7,8], the former defined as the average number of syllables per minute, the latter as the number of syllables per minute after deduction of all speech pauses contained in the sample. These parameters yield what might be conceived of as the net working speed of the speech organs. For forensic purposes it is useful to express both rates in syllables per second, because in spite of many texts taking less than one minute, they may well lead to representative and stable values [9]. While the speaking rate and speech tempo may be quantified, the affective condition of a speaker may not. We therefore considered whether it would be possible to normalise fundamental frequency data using speech tempo as a correction factor. There is a positive correlation between loudness of speaking, or intensity in acoustic terms, and Fo due to physiological constraints: increases in Fo of 40 per cent and more have been observed in the past [4,10]. Therefore it was considered possible that a significant acceleration of the speaking tempo might involve a general increase of the muscle tone of the articulating structures, including the glottis, leading to increased overall loudness.
Experimental Method The general pattern of the experiment was to have subjects read a text at a normal and accelerated rate, determine the respective articulation rates according to Goldman-Eisler, measure average speaking fundamental frequency in both texts and attempt to relate articulation rate and possible changes in Fo. The experiment was a joint operation of the Federal Criminal Police Office (BKA) and the University of Trier. The experiments were conducted independently in Wiesbaden and in Trier. The BKA group of subjects consisted of 10 men and 10 women aged Science & Justice 1995; 35(4): 291-295
20-40 and the Trier group of 17 men and 13 women aged 20-30. The BKA group were first asked to read a fake kidnapper's message consisting of 169 syllables at their individual "normal" speed. Subjects then read the same text under the instruction "imagine you need to communicate this text over the telephone in a public telephone booth. It is a long-distance call and you realise that you are running out of change." Both versions of the text were recorded on digital audio tape. The Trier subjects were asked to read a neutral text from a newspaper. Sample size varied individually between 50 and 121 syllables. Then, the subjects had to read their text again according to the following instruction: "You have just read this text within N seconds" (for instance: 20s): "Now try to repeat within half that time". It goes without saying that a 50 per cent reduction of the reading time was totally unrealistic. The reason behind the instruction was, however, to create an adequate impulse for the subjects to substantially increase their speaking rate. Both versions were recorded on digital audio tape. For both groups, articulation rates (AR) were measured under both conditions using digital editing of the speech signal for determination and suppression of pauses. Average Fo, its standard deviation (Fo-STD) and average overall intensity (Ao) as a measure of loudness were measured under both conditions. Fo data were obtained using the two above-mentioned algorithms which were implemented in a Medav Spectro 3000 specialised signal processing computer. The articulation rate was determined after interactively marking the pauses and activating a compression function which eliminated the marked section and linked pre- and post-marked portions of the text. The criteria for the portion of the signal to be regarded as a speech pause were thus established by a phonetician (as opposed to an automatic pause-detector). The articulation rate rather than syllable rate was selected because the syllable rate would have included another undesirable variable, namely fluency of speaking or reading. Intensity measurements and calculations were performed on a Kay CSL 4300 speech processing system, with the "net" speech signal, i.e. after removal of all pauses (see above), serving as input. The inclusion of intensity as a parameter in this investigation required a constant distance between the mouth of the speaker
and the microphone as well as a constant input sensitivity of the recording device and the signal processing system during the recording of the samples. The use of digital recording equipment with a large dynamic range (72dB) was found to be particularly useful for this purpose. Results Somewhat surprisingly, measurements and calculations produced nearly identical results for the two groups of subjects, even though the experimental situations had been different in Trier and Wiesbaden as far as length and type of text as well as reading instructions were concerned. Nearly identical results were also obtained for the respective male and female populations. It was therefore considered adequate to "pool" both sets of data and thus obtain a larger statistical basis. As was to be expected, articulation rates were markedly increased as a result of the instructions given prior to the task: on average 1.5 syllables per second for men and 1.0 syllables per second for women. Since an increase can be shown for every subject, the difference may be regarded as significant at the 0.01% level of confidence according to the Dixon-Mood sign test. Table 1 summarises the numerical values obtained for the four parameters in both conditions, normal and fast. Values for male and female subjects are presented separately because of the sex-related general differences for Fo and its standard deviation (STD). As far as tempo is concerned, there is a sex-independent, uniform response pattern: Fo increases by 0.07% and 0.05%; Fo-STD decreases by
7% and 9%; Ao increases by 1.3% and 1.4%. Neither of the changes is statistically significant (p >> 0.05). Figures 3 and 4 illustrate the results for males and females, respectively. It should also be noted that the Fo data from this experiment perfectly agree with the statistics presented in Figure 1 (mean Fo for 266 subjects: 116-6Hz) and with data from other experiments with subjects of both sexes [3,9,11]. Discussion Under the conditions of this experiment, voice fundamental frequency and its standard deviation over the test samples were found to be independent of articulation rate, or, in a broader sense, of speech tempo. Loudness of speaking in terms of overall intensity seems to be independent of speech tempo too. Thus it will neither be possible nor necessary to "correct" or "normalise" fundamental frequency values for speech tempo (articulation rate). The forensic phonetician will be quite pleased with this finding, since it implies that two or more samples containing speech produced at different tempos may well be compared in terms of average fundamental frequency. In view of the forensic bearing of this conclusion, other researchers are invited to further investigate the subject, using spontaneous rather than read speech material, perhaps even taken from real cases. The result may also have a bearing on areas other than speaker identification. It is well known that one objective of modern speech synthesis is to create a high-quality, natural sounding signal in order to make
TABLE 1 Mean values for male and female subjects (n = 50) in normal and fast reading conditions Parameter
27 males
23 females
Fo (Hz) normal fast
Fo-STD (HZ) normal fast Intensity (dB) normal fast Art.-Rate (syllls) normal fast
Fo n
Fo f
STD n STD f Int n
lnt f
AR n AR f
FIGURE 3 Fundamental frequency (Fo), Fo-standard deviation (STD), intensity (Int) and articulation rate (AR) in normal (n) and fast (f) reading-male subjects.
Science & Justice 1995; 35(4): 291-295
1.0 n
Fo t
S'I'D n STD f l n t n
lnt l
AR n
AR f
FIGURE 4 Fundamental frequency (Fo), Fo-standard deviation (STD), intensity (Int) and articulation rate (AR) in normal (n) and fast (f) readinefemale subjects.
it acceptable to commercial users. Typical deficits such as poor intelligibility and robotic sound must be avoided. In view of the above-mentioned finding it may be inferred that there is no need of an algorithmic interrelation of speech tempo and Fo. Results on the standard deviation of Fo are, however, not as unequivocal as they may seem. Even though speed-induced changes were found to be statistically insignificant for the respective speaker sex groups, this result was based on a high degree of inter-speaker variability. Also, the observation that some individuals tend to considerably increase their Fo-STD while others decrease needs to be checked in future investigations. A similar change of a Fo-related parameter was observed when speech was produced before and after alcohol intoxication [12]. References 1. French JP. Developments in forensic speaker identification. Acoustics Bulletin 1993; 18(5): 13-16.
Science & Justice 1995; 35(4): 291-295
2. Kiinzel HJ. Current approaches to forensic speaker recognition. Proceedings of the ESCA workshop on Automatic Speaker Recognition, Identification, Verification. Martigny, Switzerland, April 4-7; 1994; 135-141. 3. Baken RJ. Clinical Measurement of Speech and Voice. London: Taylor & Francis, 1978. 4. Braun A. Zur Bedeutung des Merkmals "mittlere Sprechstimmlage" in der forensischen Sprechererkennung. In: Dingeldein H, Lauf R (eds) Festschrift fur J Goschel. Marburg: Universitatsbibliothek, 1992: 1-26. 5. Kiinzel HJ. Perspektiven einer forensischen Phonetik. Zeitschrift fur Dialektologie und Linguistik 1992; 3: 294-311. 6. Howard DM, Hirson A, French JP, Szymanski JE. A survey of fundamental frequency estimation techniques used in forensic phonetics. Proceedings of the Institute of Acoustics 1993; 15(7): 207-215. 7. Goldman-Eisler F. Continuity of speech utterance, its determinants and its significance. Language and Speech 1961; 4: 220-231. 8. Goldman-Eisler F. Psycholinguistics. Experiments in Spontaneous Speech. London/New York: Cambridge University Press, 1968. 9. Kiinzel HJ. Sprechererkennung. Grundziige forensischer Sprachverarbeitung. Heidelberg: Kriminalistik-Verlag, 1987. 10. Harris CM and Weiss MR. Pitch and formant shifts accompanying changes in speech power level. Journal of the Acoustical Society of America 1963; 35: 1876. 11. Hollien H and Shipp T. Speaking fundamental frequency and chronological age in adult males. Journal of Speech and Hearing Research 1972; 15: 155-159. 12. Kiinzel HJ, Braun A and Eysholdt U. Einfluss von Alkohol auf Sprache und Stimme. Heidelberg: Kriminalistik-Verlag, 1992. 13. Masthoff HR and Koster JP. Automatische und halbautomatische Erkennung von Sprechern an Hand ihrer Stimme. In: Nixdorff K, ed. Anwendungen der Akustik in der Wehrtechnik. Hamburg, 1992.