Journal of Phonetics (1982) 12,85-89
Review: The phonetic bases of speaker recognition by F .J. Nolan t Peter Ladefoged Phonetics Laboratory, Linguistics Dept, UCLA, Los Angeles, California, U.S.A. Received 7th October 1983
The principal aim of this book is to present a model of the sources of variation between speakers, and an account of experiments testing this model. It does not set out to be an account of the state of the art in speaker recognition . Indeed , it does not contain a complete review of the literature , the author correctly claiming that this can be found in Tosi (1979) , Bricker & Pruzansky (1976) , and elsewhere, (although, of course, there have been important studies since these literature surveys , many of which are noted in Nolan's own extensive list of references) . Nolan emphasizes that without a testable model of how speakers differ from one another, we are not going to be able to advan ce very far in the study of speaker recognition . After a short introductory chapter, the second chapter provides us with such a model. Drawing largely on Lyons (1 9 77) , Crystal (1969 , 197 5), and Laver ( 1980), Nolan presents a description of the communicative intent of a speaker. In his view there is a segmental strand and a suprasegmental strand , both of which are accompanied by long term (Laver's voice quality) strands. Communicative intent includes cognitive meaning (largely Lyons's factual or propositional information) , affective meaning (Crystal's friendly , angry, etc. tone of voice) , social intent (e.g . formal or informal style) , self-presentational information (the personality or image the speaker wishes to convey) , and interaction management (signals indicating the degree of willingness to be interrupted , controlling turn-taking in a conversation) all of which can be varied by the speaker. Speakers can differ from one another in all the aspects of communicative intent , as well as in other , non-intentional , ways. Differences between speakers in the segmental strand are described in terms suggested by Wells (1982) for describing accents (and , although not noted by Nolan, by Abercrombie (1967)). They include (a) differences in system (the number of phonemic oppositions) ; (b) phonotactics (Abercrombie 's- and Firth's-structure : the permitted sequences of phonemes) ; (c) incidental differences (of the kind noted by Porter (1936) ; "you say [io;}r] and I say [ai();}r]" ;) and (d) realisational differences (Abercrombie 's phonetic differences). Nolan stresses that the iatter may include differences in coarticulation. The suprasegmental strand is described in traditional British terms , with the suggestion that between speaker differences in this strand could also be described in terms of the same four factors as those used for describing differences on the segmental strand . But Nolan recognizes that there are m any difficulties in doing this , in that the units of intonational meaning are not as discrete as phonemes. When considering the long-term strands, Nolan makes an interesting three way t Published by Cambridge University Press, Cambridge, England 0095-44 70 /84/010085+05 S03 .00 /0
© 1984 Academic Press Inc . (London) Ltd .
86
P. Ladefoged
distinction between null values , which are the most natural for a given individual, default values that a speaker has chosen to adopt, which have no communicative value other than signaling personal identity , and determined values "resulting from the mapping of some aspect of communicative intent". He points out that the vocal tract "does not determine particular acoustic characteristics of a person's speech, but merely the range within which variation in a particular parameter is constrained to take place" . He suggests that "This clearly invalidates any belief in the existence of immutable cues to speaker identity in the speech signal". The third chapter describes a set of experiences that test part of this model of potential differences between speakers. Nolan chose 15 speakers (17 year old males from the same school) all of whom presumably had the same system of phonemes, and the same permitted phonotactic structures , but differed in their realization of these units. These speakers read a list of words that enabled Nolan to analyse I11 and lrl before different vowels. He found that speakers differed in both their mean formant values for these sounds , and in the degree to which these sounds (especially I 1/) was affected by coarticulation with the following vowel. The presentation of these results is followed by an interesting general discussion on personal variation and language specific aspects of coarticulation. It is, of course, not a new notion that languages differ in their degree of co articulation. Jones (1972) commented on differences revealed by palatograms of English and French ltls and Ladefoged (1972), for example, pointed out that English "peak, pock" differ from French "Pique, Paques" in the degree of coarticulation of the velar stop with the preceding vowel. But Nolan's ideas go far beyond those expressed earlier on the language specific nature of coarticulation. He proposes a model in which each extrinstic allophone is considered to consist of a bundle of properties, each of which is specified in terms of three numbers , one indicating its physical value, the second its duration, and the third being an index of the degree of co articulation permitted (an idea that can be traced back to Holmes et al., 1964) The fourth (and longest) chapter describes a set of experiments that is not concerned with determining differences between speakers, but with the acoustic correlates of the long term articulatory settings that characterize an individual's voice. Nolan emphasizes that we must have a thorough understanding of these matters before we can construct good systems for recognizing voices. I doubt this is true , just as I doubt that I have to understand the electronic engineering involved in building a computer before I can write programs. Speaker recognition is an auditory or acoustic procedure. Although I, like most phoneticians, find the study of articulatory-acoustic relations very interesting, there is no logical necessity for a speaker recognition system to be based on articulatory actions. Furthermore , a major weakness of this chapter is that it is based on a detailed analysis of a single speaker, supplemented by a partial analysis of one other speaker. Nolan himself comments on "the limitations of evidence based on a single performance by a single speaker". Moreover, the articulatory settings are described simply in impressionistic auditory terms , so there is no way in which anybody not familiar with these terms (i .e. the majority of phoneticians) can pursue this line of research. This section of the book does not lead us very far in our search for the phonetic basis of speaker recognition. The major analysis in Chapter 4 is of a single speaker, John Laver, reading a 30-55 second passage using 31 different long-term settings, the "voice qualities" which he has himself described (Laver, 1980) . Nolan shows that the long term spectra of these readings do not differ very much when the differences in setting are supra-laryngeal (e.g. with the superimposition of velarization , or lip rounding, or nasalization) ; but some laryngeal settings (creaky voice, breathy voice) have a noticeable effect on the long term spectrum.
The phonetic bases of speaker recognition
87
In a second set of analyses of the same data based, Nolan uses LPC analysis to determine the formant frequencies at 67 measurement points within the text. He suggests that these measurements can be used to characterize most supra-laryngeal long term settings. But there are some discrepancies between the data for the major speaker, John Laver, and the supplemental data for which Nolan himself was the speaker. Considering that only two speakers were used, Nolan's claim that "the characterisations (of long term settings) were shown to be speaker independent" seems to be premature . The fifth and final chapter is mainly a statement of Nolan's views on current speaker recognition practices as given in Tosi (1979) and Bolt et al. (1979). Many of his views are also apparent in other parts of the book. But before discussing these general issues, it is appropriate to consider some more minor matters. The book is well produced, remarkably free from printer's errors. But at £20 (over $30) for something that could be photocopied for a little more than a tenth of that price, it is too expensive . There are some good explanations of some general phonetic notions, such as the relation between phonemes and both extrinsic allophones (i.e. phonologized allophones), and intrinsic allophones (i.e . low-level coarticulated allophones). The account of the relation between formants and cavities is also good, although a little too condensed for those not already familiar with the topic. But it must be noted that the references are sometimes oddly esoteric. Thus it is curious to find that an unpublished paper by Crompton (1981) is given the credit for the standard generative phonology view of phonetic representation. There is no doubt that Chomsky and Halle ( 1968) consider phonetic representation to refer to abstract properties with the same theoretical status as syntactic categories such as noun and verbs. At this point I would like to consider the more general issues. Nolan is very much concerned with forensic issues. He considers a good speaker recognition system to be one that permits the identification of speakers in all possible circumstances. This is certainly a worthy ideal aim; but it is not one that is relevant in many forensic applications. For example, only very seldom does one have to have a system that cannot be foiled by mimicry or disguise . I have worked on several legal cases, having been called to give evidence on about 25 occasions, and having been consulted on well over three times as many matters. Only once has there been an attempt to sustain a claim that the voice in question was not that of the accused but of someone who was deliberately mimicking him. In by far the majority of cases this is not a plausible defence . Criminals are seldom capable of mimicking other people who might be suspected of their crimes. Nor is the possibility of a voice being disguised a common occurrence in legal proceedings. Fortunately for the law abiding, most criminals are rather stupid and do not attempt to disguise their voices. The clever ones who do may well go free, because their voices cannot be matched. As Nolan notes, we probably cannot reliably detect whether a voice is disguised or not. But this has little bearing on the reliability of identifications that are made. I have known cases in which the prosecution maintained that an unknown voice on a recording was that of the accused, but did not sound like him because he was trying to disguise his voice. But evidence of this kind is rarely brought into the proceedings, because it obviously has so little value. Usually when the prosecution bring a recording into evidence, it is because they believe that the voice matches that of the accused . And in these circumstances the possibility of disguise has no meaning from a defence point of view. Mimicry and disguise are seldom relevant issues in real life legal cases. Similarly, many sources of within speaker variation are not very important in legal situations. Nolan makes a major point of the effect of style differences, correctly pointing
88
P. Ladefoged
out that a person recorded while committing a crime may (my emphasis) talk in a different style when making an official recording in a police station. But if that happens it will probably be impossible to establish that the two recordings were made by the same person. A guilty person may go free, but no innocent person will suffer. It would be nice if we could develop a system of voice recognition that would be impervious to changes in style. But if we cannot , there are still scores of situations in which there are no style changes involved . It is just not true to say , "Before speech can ever be used positively to identify an individual ... evidence will need to be adduced to show that the parameters on which the identification is founded are immune to the effects of voice quality choices made by the speaker in mapping communicative intent". Such evidence is necessary before we can have a system that will correctly declare that two samples were not produced by the same speaker. But when two samples do match one another to a very great extent we can declare that they might well have been produced by the same speaker. Nolan raises the possibility that adopting a disguise or style shifting may make two dissimilar voices more alike. He quotes Labov (1972) "it may therefore be difficult to interpret any signal by itself-to distinguish , for example, a casual salesman from a careful pipefitter" and says that this may lead to a false identification. But as Nolan himself emphasizes, there are a very large number of factors involved in the production of speech. It could happen , but it is statistically very unlikely, that a criminal's altered voice will become in all ways the same as that of somebody else who is suspected of the crime. Style shifting is likely to allow a guilty person to remain unidentified; but it is highly unlikely to be the cause of an innocent person being wrongly matched with a voice recorded during the commission of a crime. However, it is obviously true that there are chance likeness between voices; and also that the voices of members of the same family or close community can be very similar to one another. I doubt that there will ever be any absolute certainty in voice recognition. But it , like visual recognition , handwriting analysis, and a host of other procedures, can still have evidential value. It seems to me that we must be very careful not to make sweeping generalizations about the applicability of acoustic phonetic techniques to voice recognition in legal cases. It is not simply a matter of deciding that they should or should not be admissible in courts of law. Each case must be considered on its merits. We will probably never be able to say anything about a poorly recorded, ten second bomb threat, made over a low quality telephone in a noisy bar. But I am sure that virtually all phoneticians would agree that acoustic phonetic analyses have some evidential value when comparing two ten minute recordings made under studio conditions. I have been consulted more than once on cases of both these kinds of recordings . I have always refused to say anything about short noisy recordings (except that , in my opinion, no one could identify the voice "beyond a reasonable doubt"). But in the case of lengthy , high quality, recordings I think that I can be helpful. Most legal questions lie somewhere between these two extremes. Where we draw the line as to what would be admissible is still in doubt. But there is no doubt that Nolan's book is a valuable discussion of these issues-even if I disagree on some points. References Abercrombie, D . (1967) . Elements of General Phonetics. Edinburgh: Edinburgh University Press. Bolt, R . H., Cooper , F . S., Green , D. M. , Hamlet , S. L., McKnight , J. G. , Pickett, J. M. , Tosi, 0 . 1. & Underwood, B. D. (1979). On the Theory and Practice of Voice Identification . Washington: National Academy of Sciences.
The phonetic bases of speaker recognition
89
Bricker, P. D. & Pruzansky , S. (1966). Effects of stimulus content and duration on talker identification. Journal of Acoustical Society of America 40, 1441-1449. Chomsky , N. & Halle, M. (1968). The Sound Pattern of English . New York: Harper and Row. Crompton, A. (1981). Phonetic Representation. Unpublished Paper. University of Nottingham . Crystal, D. (1969). Prosodic Systems and Intonation in English. Cambridge: Cambridge University Press . Crystal, D. (1975). The English Tone of Voice. London: Edward Arnold. Holmes, 1. N ., Mattingly, I. G. & Shearme, J. N. (1964). Speech synthesis by rule . Language and Speech 7, 127. Jones, D. (1972). An Outline of English Phonetics Cambridge: Heffer. Labov, W. (1972). Social Linguistic Patterns. Philadelphia: University Press. Ladefoged, P. (1972). Phonetic prerequisites for a distinctive feature theory. Papers in Linguistics and Phonetics in Memory of Pierre Delattre. pp. 273-286. (Waldman, A. ed.) The Hague: Mouton. Laver, J. (1980). The Phonetic Description of Voice Quality. Cambridge: Cambridge University Press. Lyons, J. (1977). Semantics. Cambridge: Cambridge University Press. Porter, C. (1936) . Anything Goes: A Musical Comedy. New York: Harms. Tosi, 0.1. (1979). Voice Identification: Theory and Legal Applications. Baltimore : University Park Press. Wells, J . (1982) . Accent of English: An Introduction. (3 Vols). Cambridge: Cambridge University Press .