Accepted Manuscript Title: The Case for Aural Perceptual Speaker Identification Author: Harry Hollien PhD Grace Didla PhD James D. Harnsberger PhD Keith A. Hollien BA PII: DOI: Reference:
S0379-0738(16)30337-1 http://dx.doi.org/doi:10.1016/j.forsciint.2016.08.007 FSI 8560
To appear in:
FSI
Received date: Revised date: Accepted date:
19-8-2015 3-8-2016 3-8-2016
Please cite this article as: H. Hollien, G. Didla, J.D. Harnsberger, K.A. Hollien, The Case for Aural Perceptual Speaker Identification, Forensic Science International (2016), http://dx.doi.org/10.1016/j.forsciint.2016.08.007 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
No. FSI-D-15-00812
ip t
The Case for Aural Perceptual Speaker Identification Harry Hollien PhD¹, Grace Didla PhD²,
1
us
2
cr
James D. Harnsberger PhD³, and Keith A. Hollien BA4.
Harry Hollien PhD Professor Emeritus of Linguistics
Grace Didla PhD, Assistant Professor English and Foreign Languages Hyderabad, 500 605 India
M
an
Founding Director, The Institute for Advanced Study of the Communication Processes University of Florida, Gainesville, FL 352-378-9771
[email protected] 3
Post-Doctoral Fellow Department of Linguistics University of Florida, Gainesville, FL
[email protected]
d
4
Ac ce p
te
James D Harnsberger PhD Associate Dean, College of Forensic Sciences New Haven University, New Haven Ct 203-848-7741
[email protected]
Keith Hollien, BA, Associate Forensic Communication Associates Box 12323, University Station Gainesville, Fl 32601 352-332-3754
[email protected]
Address correspondence to:
Dr. Harry Hollien 229 SW 43rd Terrace Gainesville, FL 32607 352-378-9771
[email protected]
The first author’s SI research was supported primarily by grants from the U.S. Justice Dept. (82×IJ×CX×0034), the U.S. Army Research Office (DAA6-81×D×0100), the Dreyfus Foundation (parts of DRS78214JC-1 and (62197PB-4) and the University of Florida.
1 Page 1 of 32
Abstract
M
an
us
cr
ip t
It was expected that valid computer-based speaker identification (SI) systems would be developed well before initiation of the 21st Century. This has not happened. One reason for the lack of progress here was that problem complexity was seriously underestimated. Another was that appropriate protocol was not followed during system development. Appropriate standards are now available and will be reviewed. Nevertheless, there is a critical, and growing, demand for effective SI systems. This situation leads to a need for some sort of a stop-gap procedure that would fill the temporary void. Fortunately, a large amount of SI research has been carried out during the cited period. Its results (which will be summarized) lead to the postulation that an aural perceptual (AP SI) system, although somewhat subjective, could be used to meet this requirement. A number of them have been developed (an example will be provided) and relevant research supports the position that, if rigorously controlled and cautiously interpreted, they can provide useful SI information. It also is recommended that, if an appropriate research model is followed, the speech/voice analysis approach should be seriously considered as a platform for the creation of digital speaker identification systems.
Ac ce p
te
d
Key words: Forensic Science, Speaker Identification, Speaker Recognition, Speech Analysis, Aural Perceptual Analysis, Forensic Phonetics, Expert Witnesses
The first author’s SI research was supported primarily by grants from the U.S Justice Dept. (82·IJ·CX·0034), the U.S. Army Research Office (DAA6-81·D·0100), the Dreyfus Foundation (parts of DRS78214JC-1 and (62197PB-4) and the University of Florida.
2 Page 2 of 32
Introduction
us
cr
ip t
The first concept to be considered in this presentation is the controlling one: Aural Perceptual Speaker Identification (AP SI) is a stop-gap approach to speaker identification (SI). Moreover, it always has been. Indeed, 50 years ago when the initial (organized) attempts were made to identify individuals by speech/voice analysis, this dictum was accepted on a nearly universal basis. However, during that period, two other SI approaches also were being proffered. First, the so-called “voiceprint” technique appeared and was being touted by its practitioners. By the time it was shown to lack validity, the SI field was somewhat in disarray.
M
an
The second approach to emerge was a far more creditable one. A number of scientists and/or engineers attempted to develop electronic SI processors -- and some of their procedures appeared to have merit. Unfortunately, however, few enjoyed reasonably robust research programs. Hence, nearly all of them suffered from inadequate development and/or were abandoned.
Ac ce p
te
d
On the other hand, it must be stressed that, during this same period, a strong -very strong – demand existed for a valid, working SI system. Even so, the agencies which funded such programs tended to become disinterested in efforts which did not (or could not) quickly provide “deliverables”. Also, the splitting of the general area of speaker recognition into two parts -- i.e., into speaker verification (SV) and speaker identification (SI) -- complicated the problems being faced. As is well known, the “closed set” nature of speaker verification is 1) where a talker wishes to be correctly identified (example: to gain access to a bank account) or 2) when his or her identity needs to be validated (example: which astronaut was speaking during a flight?). Only the most powerful procedures and equipment available are employed for SV and the examiner is permitted to establish an extended library of referent speech samples to compare with the questioned utterance(s). On the other hand, the nature of speaker identification (SI) is “open-set” and it involves an uncooperative, unknown speaker, residing within a population of unknown size and composition. The examiner must analyze speech under conditions which often range from poor to minimally adequate. As is obvious, a more sophisticated approach is needed for successful SI than is that for SV. Actually, if an effective SI procedure could be developed, the SV problem would be resolved also. So too did this situation serve to shift engineering talent away from SI. These professionals tended to be more interested in the highly controlled SV area — plus its 3 Page 3 of 32
generally more attractive commercial outcome. Moreover, it was broadly assumed that, after all, these problems would be satisfactorily solved within “but a few years”. Unfortunately, however, this has not happened.
M
an
us
cr
ip t
Even with all the cited diversions, the development of SI systems continued; however, it did so in a somewhat chaotic manner. For one thing, those individuals working in the area did not seem to be fully aware that the challenge being confronted was as complex as it proved to be. Even the behaviorists (i.e., mostly phoneticians) seemed not to be disturbed by the hundreds (perhaps thousands) of human, situational and environmental variables they faced. Undoubtedly they would have been startled had they been made aware of the exchange that took place between U.S. Senator Daniel Patrick Moynihan and an eminent Harvard chemist [1]. When elected to Congress, the Senator had vigorously continued his efforts (originated at the U.S. Labor Dept.) to reverse the negative effects that the expanding number of homes headed by a single parent was having on various communities. Moynihan indicated to the chemist that it was difficult not to become confused by all the variables he encountered. The professor agreed with him that solutions there became most obstinate when the number of variants present interacted with each other. When the senator asked the scientist “At what number of variables do such problems begin?” The chemist replied “Three.”
Ac ce p
te
d
Therein lies the problem faced when attempts are made to identify individuals by analyzing their speech and voice. Variables! Many, many variables. Yet, even today, this problem is not fully understood. On the other hand, it does explain why that -- even after 50 years of serious attempts at SI system development -- the aural perceptual speaker identification approach (AP SI) is still needed as an SI stop-gap method. While this relationship appears clear-cut, it still seems to require further justification. Hence, the review to follow.
History
As the first author of this paper has previously pointed out [2, 3] the SI area has had a long and uneven history. It all started with efforts of this type being made (by humans) well before recorded history. Of course, written references about the subject could not be found until ancient records surfaced [4]. There are more modern references also, but they tend to involve only the courts and legal issues. In such instances, it is asked if a witness knew a speaker well enough to make an identification in court [5] or if testimony based on some early form of earwitness “lineup” should be admitted at a trial [6]. 4 Page 4 of 32
cr
ip t
It was not until the 1930s and 1940s that a shift occurred. In one such instance, scientists with engineering backgrounds, working at Bell Telephone Laboratories, reported that they had developed a “visual speech” device. Basically, what they had invented was an operating “time-frequency-amplitude” spectrometer, i.e., the Sonagraph [7]. It proved to be an important breakthrough as it permitted graphic presentations to be made of complex sounds. Displays of that type are still being used to provide (at least, crude) visual representations of many types of acoustic events. Also, it was during that period that engineers were able to sharply upgrade electronic recording systems and measurement equipment.
an
us
In addition, behaviorists, such as phoneticians and psychologists, started to make AP SI type contributions. An example: A psychologist [8] became interested in the earwitness testimony made by aviator Charles Lindbergh at the trial of the man who had kidnapped and murdered his child. In an attempt to assess the accuracy of this testimony, she developed an early form of AP SI. It will be reviewed in a subsequent section.
Ac ce p
te
d
M
A little later, Max Steer (a phonetician) was tasked to determine if Adolf Hitler had survived the attempt on his life which occurred at Wolf’s Liar on June 20, 1944. Steer, who was Director of the Navy’s Acoustic Lab at NAS Pensacola FL (on leave from Purdue), assembled a team to examine the utterances of the person who was identified as Hitler after that occurrence. Was he the same person as the one who had made appearances and speeches before the assassination attempt? The team employed several AP SI approaches plus just about every other analytical procedure available at the time (personal communication, July 1976). As it turns out, they were successful [9]. Hitler had survived.
Attempts at SI Development
The period 1970-2000 was not much more orderly than the earlier ones. A number of investigators attempted to address the issue but adequate coordination was relatively sparse. Further, more research was being conducted in the SV area than in SI. It helped but not materially. Halfway through that period, the first author of this presentation attempted to address these problems by developing a semi-automatic (computer supported) SI system [2, 10]. It was based on the then available corpus of SI research. However, to do so, a model for system development -- and its interface with criminal justice -- had to be structured. Perhaps the greatest difficulty in doing so was that little-to-no appreciation existed here for the primary operational requirement for such 5 Page 5 of 32
ip t
programs -- one that is almost absolute. It is that a sustained research program, designed to validate any evolving SI system, is virtually mandatory. Indeed, the standards for such a model specify that appropriate studies must be carried out. Further, adequate SI system development simply is not possible unless a majority of the relevant variables are identified and either removed, mitigated or counterbalanced. In short, development of a valid SI system must be modeled by a well-structured experimental program.
us
cr
A Model
an
The first step in model development was to provide its framework. Such an effort (based, of course, on the scientific method) is partially illustrated by flow charts of the type found in Figure 1. It is contended that any and all approaches to SI development should be guided by this, or a similar, model.
Ac ce p
te
d
M
An important element specified by the model is that the research approach selected should first generate some sort of analytical vectors (i.e., clusters of related parameters). These vectors, in turn should be tested both individually and in concert with each other. The program should be continued yet further by replicating this process when the vectors have been distorted by various external variables. That is, by inter- and intraspeaker variations, system distortions, varying environmental conditions and so on. Such experiments should be replicated yet again, using simulated crimes, interrogations and “real world” cases. Important standards here are that the research should be continued until 1) the system’s basic validity is established, 2) its reliability can be demonstrated and 3) its vulnerabilities documented. As stated, a vector-based approach was employed to develop the cited semiautomatic system. In turn, its vectors were created from the speaker-specific elements found in voice. To illustrate: one of the first vectors created was speaking fundamental frequency (SFF) analysis (seen second from the left in Fig. 1). This SFF vector consisted of 1) the geometric mean of the Fo spoken in connected speech, 2) its variability, 3) its patterns and 4) the distribution of all individual frequencies organized into semitone “bins”. Development of other vectors, such as vowel formant tracking (VFT), etc., then followed. The validity of this model has been substantiated multiple times [10]. However, even though some centers (involved in forensic investigations) adopted the approach, the model itself was not received with much enthusiasm [11]. Indeed, in some quarters, it 6 Page 6 of 32
was argued that the extensive research specified by the model was unnecessarily wasteful. This controversy will be addressed in the next section.
ip t
As for the semi-automatic system itself; it has enjoyed modest success but has not yet been finalized [10-12]. On the plus side, it has assisted in the establishment of platforms useful in developmental support of a number of SI approaches.
cr
Further Justification of the Model
te
d
M
an
us
Even though the validity of the model appears virtually self-evident, it has not yet been fully accepted. Accordingly, further support would appear necessary. First, the U.S. National Academy of Sciences, Research Council’s committee on the needs of the Forensic Sciences (and how to improve them) recently published their report [13]. It stresses the necessity of providing a thorough vetting of any forensic procedure before it is put to use and, especially, before the results it provides are introduced in the courtroom. The report specifies that, to achieve “the goals of justice”, the validity of any process/system should be based on “sound operational principals and procedures, plus serious research, [designed] to establish [its] limits and measures of performance.” Indeed, the report stresses that standards are required that are based on a “body of research and measures of performance.”
Ac ce p
The cited committee’s report further justified its position by stressing the Daubert test [14] and the evidentiary standards it provides, ie., those that must be met before the results of any procedure are accepted. “Daubert” requires that 1) “a theory or technique [should] be (or has been) tested”, 2) “the theory or technique [should have] been subjected to peer review and publication”, 3) “the potential rate of error of a particular scientific technique [should be] known”, 4) “the existence and maintenance of standards controlling the technique’s operation” [should have been] established and 5) the scientific technique’s “degree of acceptance within a relevant scientific community)” should be available. Clearly, a serious and extended research program is needed to meet the National Academy of Sciences and Daubert standards – as well as those previously articulated by our model. It is simply not enough to build a device (based on a developed algorithm) and put it to use. Nor should a set of AP SI procedures be structured and immediately applied. It now should be clear that very few of the SI programs have even minimally followed the standards specified above. It also explains why a stop-gap procedure, such as that of AP SI, is presently needed. 7 Page 7 of 32
Aural Perceptual Speaker Identification
M
an
us
cr
ip t
Stop-gap or not, aural perceptual speaker identification systems are, at least, minimally viable and probably will be needed for some time. Accordingly, the bases underlying this approach will be provided first; then specific areas of research further supporting it will be reviewed. These discussions will be followed, in turn, by an example of a relatively successful AP SI system. Finally, it should be reiterated that portions of the information reviewed in the following sections are drawn from the first author’s prior publications [3, 10].
d
The Human Auditory System Potential
Ac ce p
te
First, it must be asked if a speaker actually can be identified by auditory assessment of his or her motor speech. Specifically, is the human auditory system, and its processing abilities, sensitive enough to make the determinations required for adequate speaker identification purposes? Can it resist the effects of unfavorable environments and conditions? Actually, a response in the affirmative can be argued. Psychoacousticians and neurophysiologists have extensively studied how the hearing system operates (i.e., from the external auditory meatus all the way to the cortex). While many of the early experiments are of substantial importance, of course, some of the more recent reports turn out to be particularly relevant to SI. Indeed, many studies demonstrate that humans exhibit a substantial number of absolutely remarkable – and relevant -- auditory processing abilities [15]. Investigators also describe how speech embedded in noise can be decoded [16], the speed (only 10-12 ms) by which complex sounds can be processed [17], the enhancement of speech processing provided by training and/or experience (a very important area) [17, 18], how emotions can be identified, how language [18], including tonal language and speech, can be decoded [19, 20] and so on. Other research in this area supports the validity of these findings. In short, 8 Page 8 of 32
there is support for the hypothesis that the auditory system is capable of the very rapid and accurate processing of those complex signals required for robust SI.
ip t
Research on Related Behaviors
d
M
an
us
cr
Before embarking on an overview of those characteristics which permit a speaker's identity to be established, a brief review of the general domain within which SI is located would appear useful [for additional reviews, see: 3, 10]. It is an area of research that, to date, has been little appreciated. Specifically, it involves the detection of the many human behaviors which can be identified by analysis of a speaker's utterances. Included here are c o n d i t i o n s / behaviors such as age, intoxication, emotions, deception, fatigue, language, dialect, speech intelligibility, voice disorders and so on. Research on such relationships permit the investigator to gather information about: 1) the nature of the cited behaviors, 2) the techniques which permit them to be detected and decoded, 3) the upgrading of the quality of relevant experimental protocol and 4) the identification of new and more sophisticated analysis systems/devices. In turn, data from these studies often directly support the processing techniques for, and/or understanding of, speaker identification.
Ac ce p
te
One example consists of investigations designed to discover which vocal features signal the age of a speaker. Obviously, if acoustic and/or temporal patterns associated with a persons' age can be detected from his or her utterances, they also should assist in the identification of that specific person. A number of studies about the speaking correlates of age have been published or reviewed [21, 22]. While data on infants and children are rarely relevant, information about young adults and adults [23, 24] as well as the aged [25-27] c a n b e u s e f u l . That is, parameters such as age-related 1) speaking rates, 2) dialects, 3) fundamental frequency usage, and/or 4) vowel idiosyncrasies sometimes will provide an assist in identifying a particular speaker. Another example involves detection of psychological stress. Again, a rather substantial research literature exists on this subject [2, 28]. Sub areas are stress 1) induced in the forensic milieu [29-31], 2) resulting from external stimuli [32, 33] or 3) created by serious threats of danger [34-36]. It has been found that a number of analytical techniques employed in these studies also can be adapted to identify SI cues. Moreover, when stress related changes are identified in the speaking behaviors of a suspect, SI comparisons will be upgraded when it is also present in the (unknown) voice found on an evidence recording. 9 Page 9 of 32
cr
ip t
A final example. In some cases, the research being conducted can be useful either for its primary purpose or as a supplement to SI evaluations. One such bridge is illustrated by studies carried out by Huntley Bahr and her associates [37]. They recognized that it is important to understand how and why people intentionally vary their speech production. While no one produces speech in the same manner at all times, in some instances they “purposefully” modify speech. These intentional shifts often can be used to identify a specific individual. In any event, this group studied patterns found in social dialects, code switching and language structure for SI purposes [38, 39].
an
us
In short, detection of the speaker-specific speech elements in a number of related areas provide data and techniques useful for AP SI. They also assist in structuring the foundation for other SI approaches.
Earwitness Research
Ac ce p
te
d
M
Earwitness analyses result when an individual -- who was either a victim of, or witness to, a crime -- heard the perpetrator but did not see him or her. The task, then, is for that witness to make judgments about the perpetrator’s identity by listening to voice samples obtained from a suspect (who may or may not actually be the criminal). To do so, procedures must be employed which can enhance the possibility of a witness accurately making an identification or elimination. A review of this area is available [40]; it also will be summarized below. Research on earwitness lineups (or voice parades), and the capabilities of earwitnesses, cannot be studied as rigorously as can laboratory (AP SI) experiments. Nevertheless, investigations of this kind can provide additional perspective on the ability of humans to make SI judgments. Of course, some controversy [41-43] has accompanied efforts in this area and, hence, the remarks to follow should be carefully considered. Caution is required because the procedure ordinarily involves the lay public; in addition, confrontation with the unknown speaker (or his voice) is usually stressful. Moreover, the overall process here resembles eyewitness lineups. Both require selection of an individual from a group and both involve use of sensory modalities and memory [44, 45]. However, it is clear that the two procedures actually are different. They vary as to how: 1) auditory and visual memories are processed, 2) eyes and ears are structured and 3) fear, anger or arousal differently affect the two processes. Also, poor eyesight differs from hearing disabilities and native abilities make certain people good at visual memory and others better listeners. In short, eyewitness and earwitness identifications simply do not mirror 10 Page 10 of 32
Ac ce p
te
d
M
an
us
cr
ip t
each other [40, 46]. Thus, to assess the AP SI hypothesis, it is important to focus only on the fundamental aspects of the auditory speaker recognition process. Just as with other SI issues, some of the elements in earwitness processing degrade accuracy while others enhance it. One such difficulty involves the latency between confrontation and the earwitness lineup as degradation will ordinarily occur within a period of only a few weeks. It does so even though some auditors are capable of performing well even after greater latencies. [47-49]. Training and memory also are factors. Indeed, there are data which suggest that many witnesses have inherent capabilities which permit them to effectively handle the earwitness task. For example, DeJong [44] studied the effects of memory, intelligence, auditory capability and musical skills on earwitness accuracy. Although she reported that intelligence was a superior predictor of a subject’s ability to identify speakers, basic auditory and memory skills also contributed. Moreover she found that listeners who exhibited a high degree of musical aptitude (or training) also performed in a superior fashion. Her findings expand those reported in a previous section [16-18]. Chronological age also proved to be a factor, as individuals tend to be more accurate identifying people within their own age range. Finally, researchers at the University of Florida have found that many untrained individuals are capable of making surprisingly accurate identifications even under severely challenging conditions [46] and that they tend to be more accurate when stressed or aroused [50]. Without question, certain of the factors reported in this section operate to augment the understanding of auditory SI accuracy.
Speech Analysis for SI
The information provided in the following three areas is focused directly on speakerspecific – or at least, speaker related – speech and/or vocal features.
Early Research
As previously indicated, much of the earliest SI research tended to involve laylistener responses to speech samples. Investigators employed either 1) individuals who knew the speakers or 2) subjects (listeners) who were familiarized with the speaker's utterances by some sort of training. For example, the previously cited McGeehee research [ 8] was of the latter type. She presented listeners, who had received some training, with live utterances of sentence-length material, spoken by the target subjects. 11 Page 11 of 32
ip t
The uttered speech was embedded in sets of similar samples provided by distracters and the listener had to identify the target speaker within each set. She found the identification rates to be quite high (over 80% in most cases) when the trials were conducted immediately after training. However, these levels decayed rapidly over time. Even though this research was somewhat primitive, the data have been essentially validated by others -- and both for correct level [3,10, 51] and speed of decay [44,49].
M
an
us
cr
Later, researchers [51] obtained even higher correct identification rates (i.e., of 98%) for sentence length stimuli when they employed voices familiar to the auditors. While their response levels for very short samples were lower (mean = 56%), both data sets were statistically significant. It should be noted also that, as early as 1954 [52], data became available which suggested that modest identification accuracy could be attained for even extremely short stimuli. In that case, the investigators demonstrated that scores would systematically rise for samples up to about 1200ms in length. Further increases resulted not from greater duration but from a "more extensive phonemic repertoire". Please note that these findings also are consistent with those reported in the prior section on basic auditory potential.
Ac ce p
te
d
Subsequent to that time, many similar studies have been carried out and reported. For example, different speakers, auditors, utterances, dialects, recording fidelity, environments, and so on have been studied [2, 10]. These investigators also confirmed that, overall, humans demonstrate rather robust SI abilities. Hence, it appears very likely that 1) identification cues embedded in speech do exist and 2) they can be detected. On the other hand, it also has been observed that yet other factors can operate to enhance or degrade accuracy. On the plus side are: 1) voices which are distinctive or idiosyncratic, 2) auditors familiar with the speaker, 3) reinforcements, 4) good acoustics, 5) high fidelity, 6) long samples, 7) training to task, 8) listeners experienced with the SI process, 9) professionally trained auditors [10, 53] and so on. On the other hand, accuracy (by the lay public anyway) can be degraded if some of the above conditions are not extant or if variables such as increased signal complexity, noise or distortions exist -- or if the target population contains either "sound-alikes" [54] and/or very large numbers of talkers [55]. In any event, much of the obtained data support the hypothesis that humans can make remarkably good identifications under many different conditions and that their skills can be augmented by training and/or experience. While none of the individual procedures reviewed resulted in extremely low error rates in "real-life" situations, they support our position that a "speech vector" approach to SI is a realistic one.
Research Focused on Speech Features 12 Page 12 of 32
M
an
us
cr
ip t
But what about the speech/voice signal itself? Since it is hypothesized that speaker-specific cues are embedded within it, this section will shift focus to the analysis of t h e signal itself. One area of study here is that of speech suprasegmentals; included here are: SFF, or speaking fundamental frequency [56-58], voice quality (often determined by long term spectra [59, 60], temporal aspects of speech [61, 62], whispering [63] and so on. Identification research also has been carried out on the speech segmentals; that is, on: vowels [64-66] and their formants as well as diphthongs [67], nasality [68, 69], fricatives [70, 71], and related segmentals [10]. Many of these results (plus subsequent data) confirm and/or expand Steven's [72] very early review wherein he lists many of the segmental and suprasegmental elements of speech that can be employed in SI. In any event, the importance of such research on speech features should not be underestimated. For example, many different (internally related) clusters of these parameters can be most useful in building speech vectors [10, 73]. An especially effective approach is when multiparameter vectors are combined with other vectors to generate speaker profiles [2, 3, 10].
te
d
Forensic Related Experiments
Ac ce p
The content will now shift to research directly related to Forensics, Intelligence and the like. First, it is important to understand that most of the experiments here have employed subjects who have had but brief encounters with limited stimuli. Consider how difficult it would be for a listener to identify a person -- who was “known" to him or her only from a relatively brief exposure to their speech -- when samples were presented in noise or embedded among utterances produced by other individuals. Many of the experimental tasks upon which research of this type has been based are as challenging to the auditor as are these; few are markedly easier. Yet a number of experiments of this type -- including the three to follow -- have provided support for the hypothesis that speech-based analysis provides a robust platform for SI systems of all types and does so even within the sharp limitations of the forensic model. In the first example, Shirt [74] obtained speech samples from a large number of talkers (suspected perpetrators) provided by the British Home Office. After some organized training, her listeners were requested to identify specific subjects from speech samples randomly embedded in sets of similar ones produced by "foil" talkers. Listeners were: 1) phoneticians (not forensic phoneticians, 13 Page 13 of 32
cr
ip t
however) and 2) university students. Although some students did as well as the phoneticians, the success of the professional group -- as based on the means for correct identification -- was somewhat superior. What was remarkable was that both groups did reasonably well even though their training was limited and the stimuli relatively brief. Second, Koester and his associates [53, 75] carried out similar studies but with known voices. Both the student listeners and the phoneticians did quite well in identifying the speakers. Remarkably, it is reported that not one phonetician made as much as a single error.
Ac ce p
te
d
M
an
us
One experiment [76] in particular can be said to provide pivotal information in this area. Three groups of listeners were studied: 1) individuals who knew the talkers, 2) those who did not know them but who received 1-2 hours of training in recognizing their voices and 3) a group who neither knew the speakers nor the language spoken (but who also were briefly trained). The talkers were 10 adult males who produced a number of phrase-length samples under the conditions of: 1) normal speech, 2) speech uttered during (electric) shock-induced stress and 3) disguised speech. The listeners heard a recording of 60 samples of the 10 subjects presented randomly (two different samples of each of the three speaking conditions) and had to identify the person who uttered it. The results will be best understood by consideration of Figure 2. As can be seen, the accuracy of the listeners who knew the speakers approached 100% correct identification for both normal and stressed speech; further, they could often tell (80% accuracy) who the talker was even when free disguise was permitted. The university students did not do as well, of course, but their native processing abilities and the limited training they had received, allowed them to succeed from double chance to four times better, depending on the speaking condition. Even a general population of native speakers of Polish (generally unfamiliar with English) exceeded chance – and for all conditions. These data can be considered of yet greater import when it is remembered that the presentation procedure was an extremely challenging one. That is, all 60 samples were presented in a single trial and listeners had to rapidly identify the heard talker, and check him off on a form, before the next sample in the sequence was automatically presented.
One More Possibility
14 Page 14 of 32
Ac ce p
te
d
M
an
us
cr
ip t
The remarkable (human) auditory processing abilities described in the several preceding sections would not have surprised Fisher and Thompson [77]. Indeed, they may even have provided an answer (at least a partial one) about how humans are able to make the judgments that they do. Specifically, these authors argue that humans can analyze stimuli both consciously and subconsciously. That is, they contend that heard events/discourse also are processed at a “below” conscious level and this capacity enhances a person’s ability to detect behaviors of many types. They refer to such processing as “intuition.” Of course, these concepts are most difficult to understand --- much less to organize and/or research. Yet, some investigators already have carried out this type of study [example, see 78] and relevant data also have emerged as a by-product of related research. That is, Horvath et al [79] report data obtained from a large multifaceted project that inadvertently demonstrated that such a subconscious capacity appears to actually exist. They had a group of experienced investigators listen to the recorded speech of suspects being interrogated (by other examiners) and then indicate when the suspect was lying and when he or she was telling the truth. Even though none of the details of these cases were known to the auditors, the accuracy of their decisions was substantially above chance. While just how they reached those performance levels is difficult to explain, evidence now exists that humans may possess an “accessory auditory system.” The hypothesis here is based on the presence of otoliths (ie., small stones of the type also used for hearing by many other species) which reside in our vestibular system [80]. In any event, it now appears clear that the area of subconscious processing (of heard signals) is one that warrants further research.
Developing AP SI
The mass of data provided by the several prior sections clearly demonstrates that: 1) humans can function effectively in identifying individuals from their speech and 2) there are elements within speech and voice which can be detected and extracted for identification purposes. Specifically, as the various relationships reviewed began to unfold, a number of phoneticians commenced to organize them into structured tests. While this activity could be found occurring at numerous sites around the world, four groups were particularly active here. Included were several each in Western Europe [41, 48, 75], Great Britain [81,82], Australia [67, 83] and the United States [10, 27, 37]. These phoneticians -- plus their colleagues and
15 Page 15 of 32
students -- developed a number of AP SI approaches that proved useful in actual practice.
an
us
cr
ip t
The individuals that did so theorized that their task would be to identify appropriate motor speech elements, organize them into a procedure and apply it. They also had observed that, in criminal investigations especially, the small field of suspects sharply reduced task magnitude and thereby often assisted them in making reasonably accurate judgments as to whether the unknown talker was the same person as the known speaker (i.e., U=K) or if they were two different people (i.e., U≠K). Indeed, these situations operated to both reduce the effect of external variables and the complexity of the task. So too did the fact that a forensic phonetician could devote as much time, and employ as many techniques (and/or systems) as judged necessary, before attempting a decision.
M
An Example of AP SI
Ac ce p
te
d
It must be conceded that the AP SI system to be described violates a number of the standards established in the earlier sections of this presentation [2,13,14]. Moreover, it is not enough to argue (as a defense) that its development was initiated much earlier than the cited standards were published and/or that its protocol continues to be upgraded. On the other hand, two defensible explanations exist; they provide material support for AP SI use. First, the size, nature and complexity of any experiment on AP SI is so great that it would be almost impossible to successfully design and carry out. The huge subject populations necessary would be virtually impossible to assemble. Nor could investigators be expected to recruit acceptable groups of forensic phoneticians (as examiners), or locate the necessary numbers of speakers’ samples drawn from actual cases. Additionally, it would not be expected that studies of small groups of subjects and/or examiners, placed in artificial situations, would yield very much useful data. Most difficult of all would be experiments designed to evaluate the effects on AP SI resulting from the hundreds of behavioral, situational and environmental variables extant. Of course, small studies testing the competency of forensic phoneticians have already been carried out; so (occasionally) has their performance as subjects in SI research [10]. Finally, it also should be noted that there are cases where specific forensic phoneticians have made successful judgments in court. Admittedly, such 16 Page 16 of 32
determinations can hardly be considered scientific; nonetheless, they are of some value.
us
cr
ip t
Perhaps the second of the two reasons provides the best support for the application of the referenced AP SI techniques. It certainly provides justification for the use of the AP SI system that will be described below. That is, such support is provided by the extensive body of research on SI that is available. As can be seen from the relatively few examples included in this presentation, a great deal is now known about the SI process, its many dimensions, and the effect variables have upon it. Thus, a robust number of the critical relationships actually have been researched and are presently understood. Of course, more information is needed. Nevertheless, we would argue that it is legitimate to present the following AP SI example for consideration.
Ac ce p
te
d
M
an
The protocol to follow has been described previously [10, 84]. It is the one that parallels (as closely as possible) the model employed when the semi-automatic SI system was being developed. As stated, it is based on a cohesive body of research which, in turn, provides understanding of: 1) the identification processes employed by humans and 2) how to deal with the many variables that impact these processes. It may be best understood by consideration of Figure 3. Here, a number of parameters (not quite vectors as yet) have been organized into groups quantified by low-to-high continua -- for which a 0-10 range provides a simple metric scale. Note that the first parameter that can be judged is fundamental frequency or heard “pitch”. In this case, a simple vector can be seen; it is made up of Fo level, variability and patterns. These three features can each be detected aurally and -- just as important -- they can be physically measured. While many of the parameters that follow are suprasegmental in nature (i.e., voice quality, vocal intensity, speech timing, etc.), the system being described also includes a number of segmentals (that is: vowels, consonants, consonant clusters, nasality, etc.). In any event, protocol here would be to repeatedly compare pairs of recorded speech samples uttered by an unknown talker (U) with those of a known person (K) -- and do so, one parameter at a time, until every one of the pairs (i.e., of those that can be judged) has been rated on the basis of their U-K similarity. After a number of trials, a judgment can be made as to how likely it is that the paired samples being assessed were produced either by a single individual (scoring would be in in the 7-10 range) or by two individuals (judgments falling into the 0-3 region). As might be expected, structured approaches such as this one (although labor intensive) have served forensic phoneticians quite well and for many years. As cited, some data directly on their validity (both experimental and field) have been published – 17 Page 17 of 32
ip t
and show reasonable accuracy and reliability [10]. So too were the assessments satisfactory in those cases where verification experiments were carried out or when external confirmation was available. Moreover, this particular procedure has been approved under the Daubert standards [14] as it has been accepted nearly 100 times in the United States (in both state and federal courts).
cr
Discussion
M
an
us
Of course, it must be conceded that a number of concerns or reservations have been expressed about virtually any SI method structured or attempted [43,85, 86]. The subjectivity of the AP SI approach is particularly vulnerable to such challenge [41, 42, 86]. Further, as has been indicated, some of the standards established by the governing model have been violated. However, this second challenge has been mitigated to some extent by the large body of research that supports its use.
te
d
On the other hand, the issue of subjectivity must be addressed. An additional reason for doing so is that the National Academy of Sciences Research Council [13] has advised that anyone carrying out forensic procedures should reduce the human factors associated with their use as much as is possible — ie., to prevent subjective judgments from unfairly biasing any forensic outcome.
Ac ce p
Of course, it can be argued that there has to be some subjectivity even if the system being used is said to be “fully automatic.” After all, the operator still has to make a number of decisions; i.e., can SI be carried out in the first place, which system is to be employed (how calibrated), what speech samples should be assessed, how the output will be organized, how the findings will be interpreted. In short, a human must control, and be responsible for, such operations. Hence, certain subjective decisions must always be made no matter what forensic procedure is being employed. Nevertheless, it is conceded, that the subjective elements inherent in speech-based SI cannot be ignored. First, for all AP SI systems, the practitioner will become a potential error source. The level and quality of his or her training and experience definitely are factors here, as are biases, emotional issues, arousal and health states. Additionally, it is conceded that problems can result from 1) how the practitioner selects (or has developed) the system, 2) the standards which must be met (or were not met), and 3) how these, and other, variables influence the process. In many instances, problems such as these can be controlled, mitigated or counterbalanced. Nonetheless, when a human being has a major 18 Page 18 of 32
role in the SI process, caution is recommended [43,85]. It also would appear desirable to set up -- and use -- procedures designed to test the accuracy of the practitioner in question [14].
Ac ce p
te
d
M
an
us
cr
ip t
Finally, it must be stressed that, even with all their limitations, the use of established AP SI procedures is recommended. First, systems such as the one illustrated above, have been shown to be reasonably valid and effective; that is, if they are carefully applied under specific (controlled) conditions and the resultant data are cautiously interpreted. Indeed, they mostly have been successful in meeting such challenges when applied. And, after all, they constitute but a stop-gap procedure until a universally valid and efficient digital approach can be developed.
19 Page 19 of 32
References
[2] Hollien H. Acoustics of crime. New York: Plenum Press, 1990.
ip t
[1] Will G. The prescience of Daniel Patrick Moynihan. Conservative Chronicle, 2015; 30.12:25.
[3] Hollien H. An approach to speaker identification. J Forensic Science, in press.
us
[5] Deering and Co. vs Shumpik. Minn Reporter, 67:348, 1897.
cr
[4] Quintilian. Institutions ortoriae. From: Quintilian’s Institutes of Oratory, Watson, J (editor). London: G. Bell, 1899.
[6] Mack vs Florida. S4 Fla. 55, 44 50 706, 1907.
an
[7] Potter R, Kopp G, Green H. Visible speech. New York: Van Nostrand, 1947.
M
[8] McGehee F. The reliability of the identification of the human voice. J Gen Psychol 1936; 17: 249-71. [9] Steer MD. Special report on recordings, US Navy Acoustics Lab, Pensacola (FL) 1945.
d
[10] Hollien H. Forensic voice identification. London: Academic Press Forensics, 2002.
te
[11] Hollien H. Barriers to progress in speaker identification, Ling Evidence Security Law Intel 2013; 1: 1-23.
Ac ce p
[12] Hollien H, Harnsberger J. Speaker identification: the case for speech vector analysis. J Acoust Soc Amer 2010; 128: 2394A (Abstract of presentation). [13] National Research Council. Strengthening forensic sciences in the United States: a path forward. Report to the U.S. Department of Justice (Contract 2006-DN-BX-0001) Washington: National Academies Press, 2009. [14] Daubert vs. Merrel Dow Pharms Inc., 509 U.S. 579, 113S. CT 2786, 1992. [15] Kraus N, Nicol T. The musician’s auditory world. Acoustics Today 2010; 3: 15-27. [16] Bronkhorst AW. The cocktail party phenomenon: review of research on speech intelligibility in multiple-talker conditions. Acustica 2000; 86: 117-28. [17] Kraus N, Skoe E, Parberry-Clarke A, Ashley R. Experience-induced malleability in neural encoding of pitch, timbre and timing: implications for language and music. Annals New York Acad Sci, Neurosci And Music III 2009; 1169: 543-57. [18] Wong P, Dees T, Kraus N, Russo N, Skoe E. Musical experience shapes human brainstem encoding of linguistic pitch patterns. Nature Neurosci 2007; 10: 420-22. 20 Page 20 of 32
[19] Kraus N, McGee T, Carrell T, Sharma A. Neurophysiologic bases of speech discrimination. Ear and Hear 1995; 16: 19-37. [20] Krishnan A, Cariani P, Gandour J, Xu YS. Encoding of pitch in the human brainstem is sensitive to language experience. Cognitive Brain Res 2005; 25: 161-68.
ip t
[21] Linville S. Vocal aging. San Diego (CA): Singular, 2001.
cr
[22] Hollien H. The normal aging voice. In: Huntley R, Helfer K. (editors). Communication in Later Life. Boston: Butterworth-Heinemann, 1995; 23-40.
us
[23] Neiman GS, Applegate JA. Accuracy of listener judgments of perceived age relative to chronological age in adults. Folia Phoniatr 1990; 42: 327-30. [24] Hollien H, Shipp T. Speaking fundamental frequency and chronologic age in males. J Speech Hear Res 1972; 15: 155-59.
an
[25] Ferrand C. Harmonics-to-noise ratio: an index of vocal aging, J Voice 2002; 16: 480-87.
M
[26] Smith B, Preston J, Wasowicz J, Temporal characteristics of the speech of normal elderly adults. J Speech Hear Res 1987; 30: 522-29.
d
[27] Harnsberger JD, Shrivastav R, Brown W, Rothman H, Hollien H. Speaking rate and fundamental frequency as speech cues to perceived age. J Voice 2008; 22:58-69
te
[28] Kirchhübel C, Howard D, Stedmon A, Acoustic correlates of speech when under stress: research methods and future directions. J Speech Lang Law 2011; 18: 75-98.
Ac ce p
[29] Hicks W, Hollien H. The reflection of stress in voice: understanding the basic correlates. Proceed Carn Conf Crime Counter, Lexington Ky. 1981, 189-194. [30] Hollien H, Saletto JA, Miller SK. Psychological stress in voice: a new approach. Studia Phonet Posnaniensia 1993; 4: 5-17. [31] Vrij A, Mann S. Telling and detecting lies in a high-stake situation: the case of a convicted murderer. Applic Cognal Psych 2001; 15: 187-203. [32] Fernandez R, Picard RW. Modeling drivers’ speech under stress. Speech Comm 2003; 40: 145-59. [33] Scherer KR, Johnstone T, Klasmeyer G, Bänziger T. Acoustic correlates of task load and stress. Proceed ICSLP. Denver, USA 2002: 2017-20. [34] Finch MI, Stedmon AW. The complexities of stress in the operational military environment. In: Hanson M, (Editor). Contemporary Ergonomics. London: Taylor and Francis Ltd 1998: 388-92. [35] Johannes B, Kirsch K, Gunga HC, Salnitski VP. Voice stress monitoring in space: possibilities and limits. Aviation Space Environ Med 2000; 71: 58-64. 21 Page 21 of 32
[36] Williams CE, Stevens KN. On determining the emotional state of pilots during flight: an exploratory study. Aerospace Med 1969; 40: 1369-72. [37] Huntley Bahr R, Pass K. The influence of style-shifting on voice identification. Forensic Ling 1996; 3: 24-38.
ip t
[38] Huntley Bahr R, Frisch S. The problem of codeswitching in voice identification. In: Braun A, Masthoff HR (editors). Phonetics and its Applications. Stuttgart, Germany: Steiner, 2002; 89-96.
us
[40] Hollien H. On earwitness lineups, Invest Sci J 2012; 4:1-17.
cr
[39] Betancort KS, Huntley Bahr RH. Dialect and channel influences in a voice identification task, Int J Speech Lang Law 2010; 17: 179-200.
an
[41] Broeders A. Earwitness identification: common ground, disputed territory and uncharted areas. Forensic Ling 1996; 3: 1-13. [42] Lea W. Voice analysis on trial. Springfield Il: Charles C. Thomas, 1981.
M
[43] Hollien H, Majewski W. Unintended consequences: due to lack of standards for speaker identification and other forensic procedures. Proceed 16th Internat Congr Sound/Vibrat, Krakow, Poland July 2009; 1-6.
te
d
[44] DeJong G. Earwitness characteristic and speaker identification accuracy. Unpublished PhD dissertation, University of Florida. Gainesville FL: 1998.
Ac ce p
[45] Bull R, Clifford BR. Earwitness testimony. In: Shepard E, Wolchover D. (editors). Analyzing Witness Testimony: A Guide for Legal Practitioners and Other Professionals. London: Blackstone Press, 1999: 194-206. [46] Hollien H, Bennett G, Gelfer MP. Criminal investigation comparison: aural/visual identification resulting from a simulated crime. J Forensic Sci 1983; 28: 208-221. [47] Kershold JH, Jansen J, VanAmelsvoort A, Broeders A. Earwitness effects of speech duration, retention interval ad acoustic environment. Applied Cogn Psych 2004; 18: 327-336. [48] Künzel H. On the problems of voice lineups. Forensic Ling 1994; 1:45-58. [49] Papcun G, Davis A, Kreiman J. Long-term memory for unfamiliar voices. J Acoust Soc Amer 1989; 85: 913-25. [50] Aarts N. The effects of listener stress on perceptual speaker identification. MA thesis University of Florida. Gainesville FL: 1984. [51] Bricker P, Pruzanzky S. Effects of stimulus content and duration on talker identification. J Acoust Soc Amer 1966; 40: 1441-50. 22 Page 22 of 32
[52] Pollack I, Pickett JM, Sumby WH. On the identification of speakers by voice. J Acoust Soc Amer 1954; 26: 403-12. [53] Schiller N, Koester O. The ability of expert witnesses to identify voices: a comparison between trained and untrained listeners. Forensic Ling 1998; 5: 1-9.
ip t
[54] Rothman H. A perceptual and spectrographic investigation of talkers with similar sounding voices. Proceed Internat. Conf Crime Counter, Oxford U.K. 1977; 37-41.
cr
[55] Dante H, Sarma V. Automatic speaker identification for a large population. IEEE Trans Acoust Speech Signal Process 1979; 27: 255-63.
us
[56] Atal BS. Automatic speaker recognition based on pitch contours. J Acoust Soc Amer 1972; 52: 1678-97.
an
[57] Jiang M. Fundamental frequency vector for a speaker identification system. Forensic Ling 1996; 3: 95-106.
M
[58] Kinoshita Y, Ishihara S, Rose P. Exploring the discriminatory potential of Fo distribution parameters in traditional forensic speaker recognition. J Speech Lang Law 2009; 16: 91-111.
d
[59] Hollien H, Majewski W. Speaker identification using long-term spectra under normal and distorted speech conditions. J Acoust Soc Amer 1977; 62: 975-80.
te
[60] Gelfer MP, Massey KP, Hollien H. The effects of sample duration and timing of speaker identification accuracy by means of long-term spectra. J Phonet 1989; 17: 327-38.
Ac ce p
[61] Jacewicz E, Fox RA, Wei L. Between-speaker and within-speaker variation in speech tempo of american english. J Acoust Soc Am 2010; 128: 839-50. [62] Johnson CC, Hicks JW, Hollien H. Speaker identification utilizing selected temporal speech features. J Phonet 1984; 12: 319-27. [63] Orchard T, Yarmey A. The effect of whispers, voice-sample duration and voice distinctives on criminal speaker identification. Appt Cogn Psychol 1995; 9: 249-60. [64] LaRiviere CL. Contributions of fundamental frequency and formant frequencies to speaker identification. Phonetica 1975; 31: 185-97. [65] Ganesan M, Agrawal S, Ansari A, Pavate K. Recognition of speakers based on acoustic parameters of vowel sounds. J Acoust Soc India 1985; 4: 202-12. [66] Duckworth M, McDougall K, DeJong G, Shockey L. Improving the consistency of formant measurement. J Speech Lang Law 2011; 18: 35-51.
23 Page 23 of 32
[67] Morrison GS. Likelihood-ratio forensic voice comparison using parametric representations of the formant trajectories of diphthongs. J Acoust Soc Amer 2002; 125: 2387-97. [68] Glenn JW, Kleiner N. Speaker identification based on nasal phonation. J Acoust Soc Amer 1976; 43: 368-72.
ip t
[69] Su LS, Li K, Fu K. Identification of speakers by use of nasal coarticulation. J Acoust Soc Amer 1975; 56: 22-32.
cr
[70] LaRivere C. Speaker identification from turbulent portions of fricatives. Phonetica 1974; 29: 245-52.
us
[71] Schuartz MF. Identification of speaker sex from isolated voice fricatives. J Acoust Soc Amer 1986; 43: 1178-79.
an
[72] Stevens KN. Sources of inter- and intra-speaker variability in the acoustic properties of speech sounds, Proceed. 7th Internat Cong Phonetic Sci. Montreal 1971; 206-23.
M
[73] Tsai WH, Wang HM. Speech utterance clustering based on the maximization of withinclustering homogeneity of speaker voice characteristics. J Acoust Soc Amer 2006; 120: 1631-45.
d
[74] Shirt M. An auditory speaker recognition experiment, Proceed Conf Police Applicat Speech Tape Record Evidence London: Instit Acoust 1984; 71-74.
te
[75] Koester JP. Performance of experts and naïve listeners in auditory speaker recognition, in German. In: Weiss R, (Editor) Festschrift fiir H. Wangler. Hamburg: Buske, 1987; 171218.
Ac ce p
[76] Hollien H, Majewski W, Doherty ET. Perceptual identification of voices under normal, stress and disguise speaker conditions. J Phonetics 1982; 10: 139-48. [77] Fischer C, Thompson CR. Psychiatry and behavioral science: the power of intuition in deception detection. AAFS: Academy News 2014; 44;25-28. [78] Ten Brinkle L, Stimson D, Carney DR. Some evidence for unconscious lie detection. Psychological Science 2014; 1098-1105 [79] Horvath F, McClaughan J, Weatherman D, Slowiska S, The accuracy of auditors and layered voice analysis (LVA) operators judgement of truth and deception during police questioning. J Forensic Sci 2013; 58:385-92 [80] Todd N. Do humans possess a second sense of hearing? Amer Scientist 2015; 103:348-55 [81] Nolan JF. The phonetic basis of speaker identification. Cambridge: Cambridge University Press. 1983.
24 Page 24 of 32
[82] French P. An overview of forensic phonetics with particular reference to speaker identification. Forensic Ling 1994; 1: 169-81. [83] Rose P. Forensic speaker identification, London: Taylor and Francis 2002.
ip t
[84] Hollien H, Hollien PA. Improving aural-perceptual speaker identification techniques. In: Braun A, Köster JP. (editors). Studies in Forensic Phonetics Trier: Wissenschaftlicher Verlag 1995; 64: 87-97.
cr
[85] Campbell J, Shen W, Campbell R, Schwartz R, Bonastre JF, Matrouf D. Forensic speaker recognition, IEEE Signal Processing Mag. March 2009; 95-103.
Ac ce p
te
d
M
an
us
[86] Hollien H, Jiang M. The challenge of effective speaker identification, RLA2C. Avignon, France 1998; 1: 2-9.
25 Page 25 of 32
ip t cr us
an
Legends
Ac ce p
te
d
M
Figure 1. This flow-chart provides a graphic view of the model described in the text. (It also parallels those articulated by NRC and Daubert.) This configuration is a working version of the original and includes the research steps which must be taken to develop a valid SI system. In this case, the vectors (i.e., clusters of the related parameters) are speech-based. Note that information is fed into the process from external as well as internal (i.e., research data) sources. The research protocol flows from the laboratory level, via several stages, to “real life” situations.
Figure 2. The means for the correct identifications (by listeners) of 10 talkers speaking under the conditions of N = normal, S = stress and D = disguise. The three listener groups consisted of: (A) individuals who knew the talkers very well, (B) listeners who did not know the talkers but were trained to recognize them and (C) those that were trained similarly but did not know either English nor the talkers. The task was to name each speaker for all six of his presentations (the total number being 60).
Figure 3. The AP SI form developed for use with the aural-perceptual speaker identification approach described in the text. Basically, a series of recorded pairs of U, K samples are presented. Judgements are made as to whether they were produced by the same speaker (high scoring) or by two individuals (low scoring). All parameters are judged independently and one at a time. 26 Page 26 of 32
27
Page 27 of 32
d
te
Ac ce p us
an
M
cr
ip t
28
Page 28 of 32
d
te
Ac ce p us
an
M
cr
ip t
29
Page 29 of 32
d
te
Ac ce p us
an
M
cr
ip t
30
Page 30 of 32
d
te
Ac ce p us
an
M
cr
ip t
31
Page 31 of 32
d
te
Ac ce p us
an
M
cr
ip t
Ac ce p
te
d
M
an
us
cr
ip t
Highlights to be delivered offline. Submission should be received by 06-17-2016.
32 Page 32 of 32