Distinguishing between forensic science and forensic pseudoscience: Testing of validity and reliability, and approaches to forensic voice comparison

Distinguishing between forensic science and forensic pseudoscience: Testing of validity and reliability, and approaches to forensic voice comparison

Science and Justice 54 (2014) 245–256 Contents lists available at ScienceDirect Science and Justice journal homepage: www.elsevier.com/locate/scijus...

350KB Sizes 1 Downloads 153 Views

Science and Justice 54 (2014) 245–256

Contents lists available at ScienceDirect

Science and Justice journal homepage: www.elsevier.com/locate/scijus

Review

Distinguishing between forensic science and forensic pseudoscience: Testing of validity and reliability, and approaches to forensic voice comparison☆ Geoffrey Stewart Morrison ⁎ Forensic Voice Comparison Laboratory, School of Electrical Engineering & Telecommunications, University of New South Wales, UNSW Sydney, NSW 2052, Australia

a r t i c l e

i n f o

Article history: Received 12 January 2013 Received in revised form 28 May 2013 Accepted 17 July 2013 Keywords: Validity Reliability Forensic voice comparison Aural Spectrographic Acoustic–phonetic

a b s t r a c t In this paper it is argued that one should not attempt to directly assess whether a forensic analysis technique is scientifically acceptable. Rather one should first specify what one considers to be appropriate principles governing acceptable practice, then consider any particular approach in light of those principles. This paper focuses on one principle: the validity and reliability of an approach should be empirically tested under conditions reflecting those of the case under investigation using test data drawn from the relevant population. Versions of this principle have been key elements in several reports on forensic science, including forensic voice comparison, published over the last four-and-a-half decades. The aural–spectrographic approach to forensic voice comparison (also known as “voiceprint” or “voicegram” examination) and the currently widely practiced auditory–acoustic–phonetic approach are considered in light of this principle (these two approaches do not appear to be mutually exclusive). Approaches based on data, quantitative measurements, and statistical models are also considered in light of this principle. © 2013 Forensic Science Society. Published by Elsevier Ireland Ltd. All rights reserved.

Contents 1.

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1. The 2009 National Research Council report's versus Cole's concept of forensic “science” . . . . . . . . . . . . . . . . . . . . . . . 1.2. Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3. The likelihood-ratio framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4. Approaches based on quantitative measurements, databases representative of the relevant population, and statistical models . . . . . . 2. Testing of validity and reliability under conditions reflecting those of the case under investigation using data drawn from the relevant population 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Procedures for measuring validity and reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3. Lack of testing of experience-based systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4. Lack of testing/lack of appropriate testing of systems based on data, quantitative measurements, and statistical models . . . . . . . . . 2.5. Pre-testing or case-by-case testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. The spectrographic/aural–spectrographic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Legal admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Reports including consideration of principles for determining acceptable practice . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4. Tests of validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5. Conversions and reasons for conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6. Is the fact that a spectrogram is used a key aspect of the criticism of the approach? . . . . . . . . . . . . . . . . . . . . . . . . . 3.7. The IAFPA resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. The auditory–acoustic–phonetic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

246 246 246 246 246 247 247 247 248 248 248 249 249 249 250 251 252 252 253 254 255 255 256

☆ This is a version of the opening presentation of the Special Session on Distinguishing Between Science and Pseudoscience in Forensic Acoustics bhttp://montreal2013.forensic-acoustics. net/N at ICA 2013: 21st International Congress on Acoustics/165th Meeting of the Acoustical Society of America/52nd Meeting of the Canadian Acoustical Association, Montréal, 2–7 June 2013 bhttp://www.ica2013montreal.org/N. An abridged written version appears in the conference proceedings under the title “Distinguishing between science and pseudoscience in forensic acoustics”. The present written version maintains some of the oral character of the original presentation. ⁎ Now Forensic Consultant, Vancouver, British Columbia, Canada. Tel.: +1 604 637 0896, +44 191 645 0896, +61 2 800 74930. E-mail address: [email protected]. 1355-0306/$ – see front matter © 2013 Forensic Science Society. Published by Elsevier Ireland Ltd. All rights reserved. http://dx.doi.org/10.1016/j.scijus.2013.07.004

246

G.S. Morrison / Science and Justice 54 (2014) 245–256

1. Introduction The title of this paper was deliberately chosen to be provocative, but is probably somewhat (if not highly) inaccurate: I don't plan to actually provide a definition which could be used to include everything one wants to count as science and to exclude everything one does not want to count as science, a problem known in philosophy of science as the demarcation problem. See Edmond & Mercer [1] on the problems with the “junk science” versus “good science” debate. I will, however, provide a discussion of what I consider to be relevant principles governing acceptable practice in forensic science in general and forensic voice comparison in particular. I believe that it is more productive to focus on and potentially debate principles and then consider different approaches in light of these principles, rather than immediately attempt to critique the approaches. I believe that a focus on principles will help us to understand what really matters. There are serious problems with current practice in forensic science, as documented in the 2009 National Research Council (NRC) report on Strengthening forensic science in the United States: A path forward [2], in the 2012 Frontline documentary The real CSI: How reliable is the science behind forensic science? [3], and elsewhere. Although both the aforementioned report and documentary are from the United States, I would be very surprised if similar problems did not exist in Canada and in other parts of the world. 1.1. The 2009 National Research Council report's versus Cole's concept of forensic “science” The message of the 2009 NRC report [2] could be summarized as “forensic science should be more scientific”, and it explicitly calls for the adoption of a “scientific culture” [2, p. 125]. From a philosophy and sociology of science perspective, Cole [4] is critical of the NRC report's portrayal of science and scientific culture, arguing among other things that it focused on “discovery science” whereas the majority of forensic science practice is what he calls “mundane science”. Discovery science can be exemplified by the recently completed process of hypothesizing the existence of the Higgs boson then designing and running an experiment to test this hypothesis, whereas mundane science can be exemplified by “laboratory technicians performing routine assays, industrial scientists seeking to refine a product or process, and even physicians trying to diagnose patients or engineers trying to design a safer bridge” [4, p. 447]. Cole points out, however, that the NRC report never claimed that forensic science was “not science”, “unscientific”, or “pseudoscience”, and that it instead made a number of specific claims and recommendations. One of these recommendations, Recommendation 3 [2, pp. 22–23], will be the focus of my presentation, and can be summarized as: The validity and reliability of forensic analysis approaches and procedures should be tested. 1.2. Paradigm For several years I have been advocating a paradigm for the evaluation of forensic evidence consisting of the following three elements: 1. obligatory use of the likelihood-ratio framework 2. highly preferred use of approaches based on quantitative measurements, databases representative of the relevant population, and statistical models 3. obligatory testing of validity and reliability under conditions reflecting those of the case under investigation using data drawn from the relevant population. Recent summaries of the paradigm appear in Morrison, Evett, et al. [5] and Morrison [6]. Details of my thoughts on selecting an appropriate database for forensic-voice-comparison cases appear in Morrison, Ochoa, & Thiruvaran [7], and details of my thoughts on appropriate metrics and

methodology for testing validity and reliability for forensic comparison in general appear in Morrison [8]. Below I briefly discuss the first two elements, then discuss the third element in greater detail. 1.3. The likelihood-ratio framework I (and many others) consider the likelihood-ratio framework to be the logically correct framework for the evaluation and interpretation of forensic evidence irrespective of the approach adopted (several approaches to forensic voice comparison are discussed below). There is increasing support for this position: In 2011, 31 experts in the field signed a position statement that included an affirmation that they consider the likelihood-ratio framework to be the most appropriate framework for the evaluation of evidence (Evett et al. [9]), and this position statement was endorsed by the Board of the European Network of Forensic Science Institutes (ENFSI), representing 58 laboratories in 33 countries. In the context of forensic voice comparison, the forensic scientist must assess the likelihood of getting the acoustic properties of the recording of a speaker of questioned identity had it been produced by a speaker of known identity (similarity) versus had it been produced by some other speaker from the relevant population (typicality).1 The likelihood-ratio framework requires the forensic scientist to consider both similarity and typicality, and to consider what constitutes the relevant population. Much has been written and said about the likelihoodratio framework, and I will not focus on this element of the paradigm in the current paper.2 1.4. Approaches based on quantitative measurements, databases representative of the relevant population, and statistical models Approaches based on quantitative measurements, databases representative of the relevant population, and statistical models are highly preferred over more human–expert–experience-based approaches because they are more transparent, more easily replicated, and as a practical matter more easily subjected to validity and reliability testing.3 They are more transparent and more easily replicated because it is possible to describe the data used, measurements made, and statistical models applied in sufficient detail that another suitably qualified and equipped forensic scientist can copy what was done — the first forensic scientist can even provide the second with the data and software which they used. If there are major discrepancies in results, these can potentially be traced back to mistakes in the application of the procedures (e.g., measuring the wrong sample or misrecording a measurement) or genuine disagreements with respect to issues such as what constituted the relevant population. Complete objectivity is unachievable, and it may be reasonable to expect that differences in subjective decisions will typically be the cause 1 The speaker of questioned identity is usually the offender and the speaker of known identity is usually a suspect. This is not always the case, for example the speaker of questioned identity could be a victim, and the recording of the speaker of known identity a recording of a missing person who it is believed could be that victim. For simplicity, I will use the terms “offender” and “suspect” hereafter, rather than the more widely applicable but periphrastic “speaker of questioned identity” and “speaker of known identity”. 2 Introductions to the likelihood-ratio framework include Robertson & Vignaux [10], Balding [11], and Morrison [12]. 3 Systems with the output based directly on human expert judgments can be fused with systems based on data, quantitative measurements, and statistical models (see Morrison [13]); however, because such a fused system would include a system with the output based directly on human expert judgments it would still be less transparent, harder to replicate, and harder to test than a system based on data, quantitative measurements, and statistical models. Note that the use of systems based on data, quantitative measurements, and statistical models is preferred rather than obligatory within the paradigm, the paradigm does not absolutely preclude the use of systems whose output is based directly on human expert judgments. As discussed in Section 2.1 below, whatever the approach used, the validity and reliability of the system should be tested under conditions reflecting the condition of the case under investigation and the best performing system should be used irrespective of whether it is based directly on human expert judgments, based on data, quantitative measurements, and statistical models, or a fusion of the two.

G.S. Morrison / Science and Justice 54 (2014) 245–256

of the latter type of disagreement. Such disagreements could be discussed by the forensic scientists and potentially resolved before trial,4 or could become matters to be debated before the trier of fact. I would consider these legitimate topics for debate. For example, last year (2012) I was asked to critique a forensic-voice-comparison report written by an expert who used (at least in part of their analysis) databases, quantitative measurements, and statistical models to calculate likelihood ratios; and one of my primary negative criticisms was that in my opinion the data used did not reflect the relevant population or the recording conditions of the case under investigation.5 In contrast, if the final conclusion presented by the forensic expert is directly dependent on an experience-based subjective decision, there is really no way to interrogate the process by which that decision was made, and no way to sensibly debate or resolve a major discrepancy between the conclusions of two forensic experts. 2. Testing of validity and reliability under conditions reflecting those of the case under investigation using data drawn from the relevant population 2.1. Introduction I described the second element of the paradigm (use of quantitative measurements, databases representative of the relevant population, and statistical models) as highly preferred rather than obligatory, because it should be subservient to the third element: testing of validity and reliability under conditions reflecting those of the case under investigation using data drawn from the relevant population. If, under the conditions of a particular case, a more subjective experience-based system is found to have better validity and reliability than a system based on data, quantitative measurements, and statistical models, then the former should be employed rather than the latter. Why is it essential to measure the validity and reliability of a forensic analysis under conditions reflecting those of the case under investigation using samples drawn from the relevant population? Quite simply such testing is the only way to demonstrate the degree to which a forensic system does what it is claimed to do, and to demonstrate the degree of consistency with which it does that. 2.2. Procedures for measuring validity and reliability The following summarizes parts of Morrison [8], see the latter for details. Validity, synonymous with accuracy, refers to the extent to which on average a set of measurements or estimates approximate the true value of the property being measured or estimated, e.g., how close is the mean 4 “Concurrent evidence” (aka “hot tubbing”) is practiced in some Australian jurisdictions, and has recently been introduced in Canadian Federal Courts [Federal Court Rules SOR/2010-176, s. 9, 282.1–282.2] and in Ontario [Rules of Civil Procedure 20.05(2)(k)]. 5 The suspect and offender recordings were spontaneous speech (not necessarily the same speaking style on each recording), neither were of studio quality, the suspect recording was made in a police interview room and the offender recording was made using the audio recording facility of a mobile telephone, and both were recorded in 2009, whereas the recordings constituting the sample of the population were studio-quality recordings of read speech recorded in the 1960s. The questions of whether the data adequately reflect the relevant population and the casework conditions cannot be addressed via empirical testing because they are questions related to the selection of data which would subsequently be used for empirical testing. A forensic scientist can use their expertise to select what they consider to be appropriate data, but they must also make it clear to the trier of fact that this decision was a subjective decision and the trier of fact should have the opportunity to decide whether it was an appropriate decision, possibly after also taking into account the subjective opinion of another forensic scientist. In casework reports issued by my lab we make it clear that if the trier of fact does not believe that we have obtained data that adequately reflect the relevant population and the conditions of the case under investigation then all subsequent testing is meaningless. Our data-selection procedure involves the use of a panel of listeners, see Morrison, Ochoa, & Thiruvaran [7]. Although subjective decisions are needed to select relevant databases, these decisions are remote from the ultimate output of the system compared to a system whose output is directly the subjective judgment of an expert. The former type of system is therefore much more robust than the latter with respect to resistance to human bias influencing the outcome.

247

of a set of estimates to the true value.6 Reliability, synonymous with precision, refers to the spread of a set of measurements or estimates around the average value of those measurements or estimates, e.g., what is the variance of a set of estimates of a value. The validity of a forensic–comparison system can be empirically assessed using a database of pairs of test samples. Some pairs must be same-origin pairs and other pairs must be different-origin pairs. The tester must known which pairs are same origin and which are different origin, but the system being tested must not have access to this information. The system is presented with the test pairs and it provides an answer for each pair. The tester compares the system's answers with the truth as to whether each pair was a same-origin or a differentorigin pair. The tester assigns a penalty value to each answer according to the correctness of the answer and takes an average of these penalty values as an indicator of the validity of the performance of the system. The function which assigns the penalty value and the averaging function constitute a metric of system validity. Correct-classification rate (or its inverse, classification-error rate) has been proposed as a metric of system validity; however, it is not consistent with the role of the forensic scientist within the likelihood-ratio framework. It is based on hard-thresholded decisions made on the basis of posterior probabilities. Within the likelihood-ratio framework such decisions are the responsibility of the trier of fact and the forensic scientist should not usurp the trier-of-fact's role. An appropriate metric for assessing the validity of a forensic–comparison system within the likelihood-ratio framework must be based on likelihood ratios, not posterior probabilities, and must assign continuous penalty values, not discrete hard-thresholded values. For example, a likelihood-ratio value of 1000 from a test pair known to the be a different-origin pair should attract a greater penalty value than a likelihood-ratio value of 1.1 for the same pair since the former provides greater support for the contraryto-fact same-origin hypothesis over the consistent-with-fact differentorigin hypothesis than does the former; likewise, a likelihood-ratio value of 0.001 for that same pair should attract a smaller penalty value than a likelihood-ratio value of 0.9 since the former provides more support for the consistent-with-fact different-origin hypothesis over the contrary-to-fact same-origin hypothesis than does the latter. A metric which has these properties is the log-likelihood-ratio cost (Cllr). Imprecision can come from various sources, for example: if the same sample is remeasured multiple times imprecision in the measurement system may result in different values for the measurements and ultimately different values for the calculated likelihood ratio. If multiple samples are taken from the same object (e.g., multiple recordings of the same speaker) this may also result in different likelihood-ratio values. Different samples of the same population may also result in different likelihoodratio values. A test set including multiple measurements and/or multiple samples can be used to obtain multiple likelihood-ratio estimates for each pair of test objects, e.g., multiple recordings of each speaker resulting in multiple likelihood-ratio estimates for each same-speaker and each different-speaker comparison (the speaker rather than the recording

6 One of the reviewers pointed out that ISO/IEC 17025:2005 tests of validity of humanbased approaches focus on assessing competence in terms of whether the individual possesses certain qualifications, experience, and knowledge, and whether they follow certain procedures and practices. This would, however, give no indication of how well the system of which this individual is a part would perform under conditions reflecting those of a particular case. That ISO/IEC 17025:2005 sense of “validity” is very different from the sense of “validity” employed in the current paper, and compliance with the former is insufficient in and of itself to produce a system acceptable for casework. Advocates of the use of the aural–spectrographic approach wrote protocols for the application of this approach (see Section 3.7) and then argued that acceptability was based on whether one followed these protocols (see footnote 10). Establishing poor standards can be a means of giving an undeserved imprimatur to poor practice (see Morrison, Evett, et al. [5]). An augment that a particular approach or procedure is valid because it conforms with an existing standard or protocol distracts from defining appropriate principles for acceptability and then determining whether an approach and the procedures which implement that approach conform with those principles.

248

G.S. Morrison / Science and Justice 54 (2014) 245–256

being the object of interest). Procedures have been proposed for calculating estimates of credible intervals (CI) as metrics of system reliability. The results of empirical assessments of system validity and reliability depend on the test data as well as the system. In order to be informative with respect to the case under investigation the data used to test a forensic–voice–comparison system should therefore reflect the relevant population, the speaking styles, and the recording conditions of the suspect and offender recordings in the case under investigation. This is discussed in Section 2.4 below. The procedures for measuring validity and reliability proposed can be applied to any forensic–comparison system irrespective of whether the output is based directly on a human expert's judgment or whether it is based on relevant data, quantitative measurements, and statistical models. The only requirements are that the system accepts pairs of samples as inputs and that it provides likelihood ratios as outputs. 2.3. Lack of testing of experience-based systems The idea that experience-based systems should be tested is not new: For an expert to say “I think this is true because I have been doing this job for x years” is, in my view, unscientific. On the other hand, for an expert to say “I think this is true and my judgement has been tested in controlled experiments” is fundamentally scientific. (Evett [14, p. 21]) It is my impression, however, that practitioners of experience-based approaches are often unable or unwilling to undergo validity and reliability testing. I have even heard one practitioner of such an approach claim in court that the validity and reliability of forensic voice comparison cannot be tested, and another claim that their approach to forensic voice comparison was scientifically valid because it was reproducible and testable but without presenting any evidence that their system had in fact been reproduced or that their ability to do what they claimed to be able to do had in fact been tested (either under conditions reflecting those of the case at trial or under any other conditions). Some of this is likely due to a practical issue: systems based on data, quantitative measurements, and statistical models are often wholly or substantially automated and once the system has been built, tailored, and optimized to the relevant population and the conditions of the case under investigation it may be relatively easy to run a large number of test trials; in contrast, an experience-based system may have to start from scratch on each trial. There may be a large investment in setting up the former, but then little additional cost for each test trial, whereas for the latter there may be moderate investment on the first test trial and the same moderate investment on every other test trial resulting in a rapidly increasing total investment as the number of test trials increases. 2.4. Lack of testing/lack of appropriate testing of systems based on data, quantitative measurements, and statistical models The use of approaches based on data, quantitative measurements, and statistical models is not itself a panacea. Leaving aside the issue of whether I am critical of the design of any particular system based on this general approach, I have seen such systems used inappropriately in both research and casework. The principal problems are inappropriate selection of the relevant population and a sample thereof, and no testing or inappropriate testing of validity and reliability. These are issues which we discussed at length in Morrison, Ochoa, & Thiruvaran [7], and so I discuss them only briefly here. A likelihood ratio as a forensic strength-of-evidence statement is the answer to a particular question, a question defined by the prosecution and defense hypothesis. If the forensic scientist does not properly consider what the appropriate defense hypothesis would be for the case under investigation, and what would be the relevant population specified by that hypothesis, and does not obtain a sample representative of

this population with which to model the denominator of the likelihood ratio, then the question for which they provide an answer will not be the question for which the trier of fact requires an answer. Whatever the value of the likelihood ratio calculated under such circumstances, it is meaningless because it answers the wrong question. Note that a very large likelihood ratio can be obtained if the properties of the suspect and offender samples are far out on a tail of the distribution of those properties in a model of the population, but this is very misleading if the model is of the wrong population. Likewise, if the actually relevant population has not been sampled to build the test database, then testing will be on a sample of the wrong population and the results will not be relevant to the case under investigation.7 The other error in testing procedures is not to test under conditions reflecting the conditions of the particular case under investigation. In forensic voice comparison such conditions can include recording duration, speaking style, and recording conditions, the latter including noise, reverberation, transmission of the speech signal over different communication systems, and lossy compression in the storage format. Mismatches in conditions are typical, e.g., a half-hour police interview with background noise and reverberation recorded directly from a microphone versus a two-minute exchange of information about bank account details with the speaker of interest using a mobile telephone and the recording made by a device attached to a landline telephone and saved in a compressed format. The results of testing under conditions which do not reflect those of the case under investigation may be of little or no relevance with respect to the performance of the system when applied to the actual suspect and offender recordings. I am aware of court cases involving forensic voice comparison using quantitative measurements, databases, and statistical models where the practitioner has either not tested the performance of the system using samples from the relevant population and under conditions reflecting those of the case under investigation, or has not performed any tests at all and has at best relied on tests conducted by commercial manufacturers or tests reported in academic research publications; the latter tests having been conducted using samples of populations and under recording conditions which were very different from those of the cases under investigation.8 2.5. Pre-testing or case-by-case testing? Cole [4] proposes that forensic science culture/society be reformed on the model of medical culture/society. This would include researchers who develop and validate new techniques; practitioners who are skilled 7 A sample of the wrong population in forensic voice comparison may include pairs of voice recordings which subjectively sound very different from each other, potentially so different that no one would think that they could be produced by the same speaker and hence so different that they would not be submitted for forensic comparison. Such pairs may also present very easy trials for a forensic–voice–comparison system, leading to validity and reliability test results which are highly optimistic compared to how the system would perform if a truly relevant population were sampled. 8 In my lab we use databases containing multiple non-contemporaneous recordings of a large number of speakers, with each speaker recorded using multiple speaking styles on each occasion (Morrison, Rose, & Zhang [15]). We select recordings in the speaking styles which we judge to be closest to the speaking styles on the suspect and offender recordings, e.g., interview and telephone conversation. Unless we were to collect additional data, we could not perform analyses involving speaking styles which we judge to be dissimilar from any of the speaking styles in our existing databases (we do not have recordings of whispered speech for example). Our databases are of particular languages and dialects and, pending the collection of databases of other languages and dialects, we cannot perform analyses involving other languages and dialects. For a particular case a subset of speakers from a database is selected for inclusion in the sample of the population via a procedure which makes use of a panel of listeners (see Morrison, Ochoa, & Thiruvaran [7]). Our databases consist of high-quality audio recordings which we process to reflect the conditions of the suspect and offender recordings, for example by adding reverberation and noise, and by passing the recordings through different telephone channels (landline, mobile telephone, etc.) or simulations thereof. A paper describing the details of both the data selection procedure and recording condition simulation is in preparation, this paper is based on the speaking styles and recording conditions which we encountered in an actual case.

G.S. Morrison / Science and Justice 54 (2014) 245–256

users of those techniques, who understand the theory behind them and the results of the validation studies, and who make informed practical decisions accordingly; and technicians who follow prescribed protocols without necessarily having substantial knowledge about theory or the results of validation studies. Under this system the practitioners would be highly trained individuals and would be required to keep up with research developments, but would not themselves typically conduct validation studies. Such a model I think assumes that there are a relatively small number of conditions under which forensic systems need to be validated, that the results of validation studies can then be published, and when a practitioner works on a case they have to determine what the conditions are and look up the results of existing validation studies under those conditions for the system or systems they are considering using. Perhaps the number of conditions which need to be pre-tested runs to the tens or hundreds, but testing them all is conceivably achievable in the short term, and then the practitioners are covered for the majority of cases on which they are likely to work. There would be a small proportion of cases when the practitioner recognizes that the conditions are unusual and not covered in the existing validation literature, and in these instances they would call in the researchers to address the problems or even take over the case. I could conceive that this may be the way that some forensic DNA analysis laboratories already work, and that there may be other branches of forensic science where this model could be applied. If Cole's [4] proposal could be widely implemented, I think this would lead to improvement in forensic-science practice. When it comes to forensic voice comparison, however, I do not think that the pre-testing aspect of the model can be applied in the foreseeable future, if ever. Given the very large (perhaps infinite) number of possible combinations of conditions and the variability in what constitutes the relevant population from case to case, and therefore the low probability of frequent repetitions of the same conditions and relevant population, I think that validity and reliability have to be tested on a scenario-byscenario basis, which is effectively a case-by-case basis.9 This means that every forensic–voice–comparison laboratory will have to have staff capable of running validity and reliability tests.

3. The spectrographic/aural–spectrographic approach 3.1. Introduction The aural–spectrographic approach (aka “voiceprinting” and “voicegram identification”) consists of listening to suspect and offender recordings (and depending on the protocol also recordings of foil speakers), and looking at spectrograms made from each of those same 9 Testing on a case-by-case basis (scenario-by-scenario basis) should not be confused with optimizing the system for the conditions of each case (scenario). The system must be optimized in the sense that it must be trained using data from a sample of the relevant population and that the data must reflect the conditions of the actual suspect and offender recordings (speaking styles and recording conditions). The system may also be optimized to, for example, attempt to compensated for differences between the suspect and offender recordings due to mismatches in speaking styles and recording conditions. All system optimization should be completed and the system should be frozen (no further changes allowed) before the system is tested, and testing should be performed using test data which are separate from any data used to train and optimize the system. The system must be tested on previously unseen data since training and testing on the same data would give an overly optimistic assessment as to how the system would be expected to perform on previously unseen data such as the actual suspect and offender samples. In our laboratory we maintain a strict chronology in casework whereby the system is first trained and optimized, second the performance of the system is tested, and third the likelihood ratio for the actual suspect and offender recordings is calculated. Each step is completed before starting the subsequent step and no changes to previous steps are allowed once subsequent steps have begun. Preliminary tests may be performed in order to select parameters which optimize the system for the conditions of the case under investigation, but no overlap is allowed between the data used for these preliminary tests (development data) and the data used for the final test of system performance.

249

recordings, then making a decision on the basis of this aural and visual examination. A spectrogram in this context is a graphical representation of the time, frequency, and amplitude properties of an acoustic signal. Traditionally time is represented on the abscissa of a two-dimensional graph, frequency on the ordinate, and amplitude as the intensity of a monochrome image within those axes (colors can also be used to represent amplitude and on-screen computer graphics can be used to view virtual three-dimensional objects). An example of a spectrogram is shown in Fig. 1. Using both auditory and visual examination, the aural–spectrographic practitioner forms an experience-based subjective opinion as to whether the suspect and offender recordings were produced by the same speaker. The spectrographic approach is visual-mode only, but was supplanted by the aural–spectrographic approach at the beginning of the 1970s (the term “spectrographic approach” can also be used as a cover for both the visual-only and the aural-plus-visual approaches). In both approaches the practitioner's opinion is based directly and wholly on their subjective experience-based judgment. There has been much debate as to whether these approaches are appropriate. Unsupported claims of near infallibility have been made, and at times the debate has been acrimonious see, for example, Hollien [16, pp. 24–25 and ch. 6], and Koenig [17]. The most comprehensive balanced review of the use of the approaches and the debate about their use appears in Gruber & Poza [18]. Other more recent, though less thorough, reviews appear in Meuwly [19, ch. 5]/[20,21], Rose [22, pp. 107–123], and Morrison [12, §99.680–99.690]. Almost all the literature on the topic focuses on the situation in the United States. 3.2. Legal admissibility In 2003 in United States v Robert N Angleton [2003, 269 F Supp 2nd 892 S D TX] the court conducted a relatively thorough review of the admissibility of the aural–spectrographic approach under Federal Rule of Evidence 702 and case law following Daubert v Merrell Dow Pharmaceuticals [1993, 509 US 579], and ruled that the approach was not admissible.10 The Innocence Project documents a case in which use of the spectrographic approach may have contributed to a wrongful conviction [State of Texas v David Shawn Pope, 1986, 204th District Court, Case No F85-98755-MQ].11 I am not aware of any recent use of the aural–spectrographic approach in Canadian courts. It was admitted in Ontario and Manitoba in the 1970s see Tosi [23] Appendix A, and in February 2013 the opinion of a practitioner of the aural–spectrographic approach was used by a journalist as part of an investigation into the “robocall” political scandal [24]. I appeared as an expert witness in two cases in Australian courts last year (2012) where I was asked to critique reports submitted by experts who used the aural–spectrographic approach. One of these cases was before a jury, and the lawyer on one side attempted at voir dire to have the testimony based on the aural–spectrographic approach ruled inadmissible (the other case was heard by judge alone). The admissibility rules in Australian jurisdictions are very different from the US federal rules, and the attempt was unsuccessful. The judge decided that they were bound by Regina v Gilmore [1977, 2 NSWLR 935], a case from 1977 in which the approach had been ruled admissible, in part because it had been ruled admissible by a number of US courts in the early to mid 10 I am also aware of one US-state-level appeal ruling in 2003 [State of Louisiana v Gary Morrison 2003 KW 1554 (no relation)] and another in 2009 [State of Vermont v Gregory S Forty 2009 VT 118], both Daubert-based, both spending less time on this issue than Angleton, and both coming to the same conclusion of inadmissibility. The appeal court in Forty ruled that the lower court should not have made its decision on the basis of earlier case law such as Angleton which was not actually brought forward by either the prosecution or defense. It did, however, uphold the exclusion of the aural–spectrographic approach in this case on the grounds that the ABRE protocol (see Section 3.7) required that at least ten words be examined and the expert had only been able to examine eight. 11 http://www.innocenceproject.org/Content/David_Shawn_Pope.php

Frequency (kHz)

250

G.S. Morrison / Science and Justice 54 (2014) 245–256

4 3 2 1 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Time (s) Fig. 1. Example spectrogram of the word “spectrogram”.

1970s (which was in turn in part on the basis of the results of a study by Tosi et al. [25], which will be discussed in Section 3.4). 3.3. Reports including consideration of principles for determining acceptable practice In 1968 Peter B. Denes, at that time the Chair of the Speech Communication Technical Committee (SCTC) of the Acoustical Society of America (ASA), appointed six SCTC members (including himself) to investigate the use of the spectrographic approach [26]. In a sense this study group was the forerunner of the current ASA Forensic Acoustics Subcommittee (FAS), although there was approximately a 40 year gap between the study group being active and the formation of the FAS. At the time, a visual-mode only approach was prevalent and this was the focus of the study group's investigation. The study group presented a draft report at the SCTC meeting on 9 April 1969, at which the SCTC endorsed the report. The final version (Bolt et al. [27]) was published in the Journal of the Acoustical Society of America (JASA) in 1970 with a footnote that the views expressed were those of the authors as individuals. The following quotes from the report are of particular interest in relation to my topic of principles governing acceptable practice in forensic science. 1970 ASA SCTC study group report: What kinds of evidence would convince scientists of the reliability of speaker identification based on voice patterns? The usual basis for the scientific acceptance of any new procedure is an explicit description of experimental methods and of results of relevant tests. The description must be sufficient to allow the replication of experiments and results by other scientists.... Lacking explicit knowledge and procedures, can individuals nevertheless acquire such expertise in identification from voice patterns that their opinions could be accepted as reliable?… Validation of this approach to voice identification becomes a matter of replicable experiments on the expert himself, considered as a voice identifying machine. Thus, voice identification might be accomplished either on the basis of explicit knowledge and procedure available to anyone, or on the basis of the unexplained expertise of individuals. In either case, validation requires experimental assessment of performance on relevant tasks.… It may be objected that this minimal set of tests is unreasonably arduous. We do not believe that it is. As scientists we could accept no less in checking the reliability of a “black box” supposed to perform speaker identification. [27, pp. 601–602] Court determinations may also depend on the apparent validity of exhibits brought in evidence. [27, p. 602] We conclude that the available results are inadequate to establish the reliability of voice identification by spectrograms… Procedures exist, as we have suggested, by which the reliability of voice identification methods can be evaluated. We believe that such validation is urgently required. [27, p. 603]

The principles expressed in Bolt et al. [27] parallel the second and third elements of the paradigm I promote: highly preferred use of approaches based on quantitative measurements, databases representative of the relevant population, and statistical models; and obligatory testing of validity and reliability under conditions reflecting those of the case under investigation using test data drawn from the relevant population. The first element, obligatory use of the likelihood-ratio framework, was not introduced to forensic voice comparison until the late 1990s (see Morrison [28] for a history). The principles and conclusions expressed in Bolt et al. [27] were also similar to principles and conclusions expressed in a 1979 NRC report on the aural–spectrographic approach [29] prepared at the request of the Federal Bureau of Investigations (FBI, the committee included both proponents and opponents of the approach), in the 2009 NRC report on forensic science in general [2], and in the US National Institute of Standards and Technology (NIST) and National Institute of Justice (NIJ) 2012 report on forensic fingerprint analysis [30]. 1979 NRC Report: The degree of accuracy, and the corresponding error rates, of auralvisual voice identification vary widely from case to case, depending upon several conditions including the properties of the voices involved, the conditions under which the voice samples were made, the characteristics of the equipment used, the skill of the examiner making the judgments, and the examiner's knowledge about the case. Estimates of error rates now available pertain to only a few of the many combinations of conditions encountered in real-life situations. These estimates do not constitute a generally adequate basis for a judicial or legislative body to use in making judgments concerning the reliability and acceptability of aural-visual voice identification in forensic applications. [29, p. 60] The Committee concludes that the full development of voice identification by both aural-visual and automated methods can be attained only through a longer-term program of research and development leading to a science-based technology of voice identification. [29, p. 60] An important initial step in developing research plans will be the development of a standard data base of voice samples that are representative of the relevant populations and of the characteristics encountered in voice identification. [29, p. 61] determining the acceptability of a particular error rate for a particular forensic application is a value question and not a question of scientific or technical fact. It can be answered properly not by this Committee and not by the technical examiner, but only by the judicial or legislative body charged with regulating the proceeding in question. [29, p. 62] 2009 NRC Report: some forensic disciplines are supported by little rigorous systematic research to validate the discipline's basic premises and techniques. There is no evident reason why such research cannot be conducted. [2, p. 22]. The judicial system is encumbered by, among other things, judges and lawyers who generally lack the scientific expertise necessary to comprehend and evaluate forensic evidence in an informed manner, trial judges (sitting alone) who must decide evidentiary issues without the benefit of judicial colleagues and often with little time for extensive research and reflection, and the highly deferential nature of the appellate review afforded trial courts' Daubert rulings. Given these realities, there is a tremendous need for the forensic science community to improve. Judicial review, by itself, will not cure the infirmities of the forensic science community. [footnote omitted] The development of scientific research, training, technology, and databases associated with DNA analysis have resulted from substantial and steady federal support for both academic research and programs employing techniques

G.S. Morrison / Science and Justice 54 (2014) 245–256

for DNA analysis. Similar support must be given to all credible forensic science disciplines if they are to achieve the degrees of reliability needed to serve the goals of justice. [2, pp. 12–13] 2012 NIST/NIJ report: A basic tenet of experimental science is that “errors and uncertainties exist that must be reduced by improved experimental techniques and repeated measurements, and those errors remaining must always be estimated to establish the validity of our results.” [31, p. 1] What applies to physics and chemistry applies to all of forensic science: “A key task … for the analyst applying a scientific method is to conduct a particular analysis to identify as many sources of error as possible, to control or eliminate as many as possible, and to estimate the magnitude of remaining errors so that the conclusions drawn from the study are valid.” [2, p. 111] In other words, errors should, to the extent possible, be identified and quantified. [30, p. 21] quantified “error rates”… not only can lead to improvements in the reliability and validity of current practices, but it also could assist in more appropriate use of the evidence by fact-finders… Many court opinions have discussed error rates of scientific tests such as polygraphy, speaker identification, and latent print identification as a consideration affecting the admissibility of these tests. [30, p. 22] Recommendation 6.3: A testifying expert should be familiar with the literature related to error rates. A testifying expert should be prepared to describe the steps taken in the examination process to reduce the risk of observational and judgmental error. The expert should not state that errors are inherently impossible or that a method inherently has a zero error rate. [30, p. 209] I think that a clear pattern emerges across all these sober reports issued over the last four-and-a-half decades: The key scientific concern regarding any approach to forensic analysis (including the spectrographic or aural–spectrographic approach for forensic voice comparison) is whether it has been demonstrated to be sufficiently valid and reliable under casework conditions. It is up to forensic scientists to demonstrate the degree of validity and reliability of the approach under casework conditions, and it is up to the legal authorities, who may not have a good understanding of the approach itself, to decide whether the demonstrated degree of validity and reliability is sufficient.12

3.4. Tests of validity Over the years there have been a number of tests of the validity of the spectrographic and aural–spectrographic approach. Some of these are summarized in the 1979 NRC report [29], Gruber & Poza [18], and elsewhere, but one research report merits attention here because it described the largest study conducted and it had the greatest impact on practice and (for a number of years) admissibility. In 1972 Tosi et al. [25] reported on a study conducted between 1968 and 1970 using recordings of 250 US-English speakers, and just under 35 thousand experiment trials performed by 29 examiners. These experiments were conducted in visual-only mode. Prior to performing the test, as part of the research protocol the examiners received one month of training in the spectrographic approach (the participants were not previously-trained professional practitioners of the approach). In general, the examiners were presented with spectrograms from a recording of one speaker, and a set of spectrograms from recordings of multiple other speakers one of whom might be the same speaker. Some trials were closed set where the examiner knew that the target speaker was included, but others were open set where the examiner had to either choose a recording as being produced by the 12 The brief pre-Daubert discussion of legal acceptability in Bolt et al. [27] focused on the court's determination of whether the approach had gained general acceptability within the scientific community, which in turn they argued should be based on demonstrated degree of validity.

251

same speaker as on the first recording or say that none of the other recordings were produced by the same speaker. The examiners also had to indicate how confident they were in their decision. The interpretation of the results reported in Tosi et al. [25] is hindered by a conflation of different types of error. In open-set trials there are four types of error (assuming that the suspect is in the lineup)13: A The suspect is the offender but the examiner picks a foil speaker instead. B The suspect is the offender but the examiner says that the offender is not in the lineup. C The suspect is not the offender and the examiner picks the suspect. D The suspect is not the offender and the examiner picks a foil speaker. In a real case, errors A and D would be immediately detected (assuming that none of the foil speakers could be the offender). In signaldetection theory error B is known as a miss, in this context it would not be immediately detectable and could contribute to a guilty person being declared not guilty. In signal-detection theory error C is known as a false alarm, in this context it would not be immediately detectable and could contribute to an innocent person being declared guilty. Guilty versus not guilty decisions would be made by the trier of fact who would weigh the voice-comparison evidence along with all the other evidence presented in the legal trial; other evidence may outweigh the voice-comparison evidence. In Tosi et al. [25] no speakers were specified as suspects and the situation was simplified to presence of target speaker (A and B) and absence of target speaker (C and D). Given this, Tosi et al. [25] did not have a distinction between error types C and D, so conflated C and D as a single error type, and, for no apparent reason, they also conflated this with error type A. Results were therefore reported as error type B “false elimination” and conflated error type A + C + D “wrong matches”. According to Tosi et al. [25], in the most forensically realistic condition tested (open set and non-contemporaneous recordings) the “false elimination” rate was 13% and the “wrong matches” rate was 6%. If only the 74% of decisions when the examiners were “fairly certain” or “almost certain” were included, these rates dropped to 5% and 2% respectively. These values were averaged across fixed and random word context, but results for the more forensically realistic random context were said to be poorer than those for the fixed context. Tosi et al. [25] ended their paper by speculating as to how their results might be relevant for casework conditions. Tosi et al. [25] was immediately criticized by the ASA SCTC study group (Bolt et al. [32], see also Gruber & Poza [18]).14 The primary criticisms were that the laboratory tests were methodologically flawed and did not reflect casework conditions, and that Tosi et al.'s attempt to extrapolate to casework conditions was based on dubious assumptions.15 One criticism of note is that the 250 speakers who Tosi et al. [25] claimed to be from a homogeneous population probably included many pairs of individuals who did not sound very much like each other, and did not therefore constitute members of the same relevant population. If they were not sufficiently similar 13 This discussion is framed in terms of categorical decisions, which are not consistent with the role of the forensic scientist in the likelihood-ratio framework. Error types A and B here are the same as indicated by those letters in Tosi et al. [25], but error types C and D differ. 14 In contrast to the opinion expressed by the ASA SCTC study group, Greene [33, pp. 189–190] and Tosi [23, p. 144] quote extracts from a letter written on 23 or 28 March 1973 by then President of the ASA, Karl D. Kryter, which appears to give conditional support to the use of the aural–spectrographic approach: “contrary to the resolution [Bolt et al. [32]?], it can be stated, in my opinion, that by scientific tests it has been proven within normal standards of scientific reliability and validity, that voiceprints for some speakers, under certain conditions and with certain analysis procedures, can provide positive identification…” (Karl D. Kryter quoted in Tosi [23, p. 144]) There is no record of this letter in the minutes of ASA Executive Council meetings (p.c. Elaine Moran, ASA Office Manager, 13 November 2012) so it appears that Kryter wrote this letter in his personal capacity rather than in his capacity as President of the ASA. 15 Rather than discuss the “Tosi extrapolation” here, I recommend the review of this issue in Gruber & Poza [18, part B]. Extrapolation from laboratory studies to casework was also discussed at length in the 1979 NRC report [29].

252

G.S. Morrison / Science and Justice 54 (2014) 245–256

sounding that an investigator would submit them for forensic comparison with each other, then they would not constitute forensically realistic different-speaker trials. Those trials would be too easy and the correctdecision rate would be inflated. Tosi et al. [25] also made reference to a “field study” conducted at the Michigan Department of State Police Crime Laboratory. This was a review of 673 aural–spectrographic cases conducted between 1967 and 1970. The results, as reported in Tosi et al. [25], where that no decision was made in 59% of cases because of poor audio quality or quantity (as compared with 26% “almost uncertain” and “fairly uncertain” responses in the laboratory study), and of the remainder, 38% were declared to be the same speaker and 62% different speakers. It was further reported that of the same-speaker conclusions 29% were confirmed because the suspects “admitted culpability or were convicted by evidence other than that produced by their voice” (p. 2042). The latter clearly has a danger of circularity: a suspect may plead guilty or be found guilty even if they are innocent, and if the voice evidence is presented one cannot determine the extent to which this contributed to a jury's decision (it may be that the voice evidence was not presented in these legal trials). The argument in Tosi et al. [25] appears to have been that the use of the aural–spectrographic approach (visual and auditory) by professionals taking as much time as they need, as opposed to the use of the spectrographic approach (visual only) by amateurs with only one month of training performing the task in a limited amount of time, combined with the use of a “no decision” option will lead to higher correctdecision rates. As an argument in favor of the use of the aural–spectrographic approach (not necessarily in contrast with the spectrographic approach), I find this unconvincing. I need to see the results of tests under conditions reflecting those of casework, and for which there can be no dispute as to the same-speaker or different-speaker status of every test pair. I do, however, believe it is self evident that if one avoids making a decision in cases one judges to be difficult and removes these from the statistics, then one will be left with cases which are generally easier and one will therefore achieve a better correct-decision rate. I also believe that in fact all practitioners, irrespective of their approach, decline to perform analyses when a priori they believe that the quantity or quality of the recorded material is such that their system is unlikely to produce a high strength of evidence in either direction. I believe that it would be inappropriate to proceed with a full analysis without at least making the client aware of the likely limitations. An attack on this practice per se I would not consider appropriate, but neither would I consider it appropriate to make unsubstantiated claims that this practice will eliminate or virtually eliminate errors. If this practice is part of casework conditions, then it should be included when assessing the degree of validity and reliability of a forensic system under casework conditions.

3.5. Conversions and reasons for conversion It is interesting to note that although some proponents of the spectrographic or aural–spectrographic approach seem to have believed in its efficacy more as a matter of faith than on the basis of evidence, there were a number of researchers and practitioners who converted to or away from supporting the use of the approach on the basis of their assessment of the results of empirical tests of its validity. Even if I were to disagree with someone else's assessment as to the extent to which the results of such tests constituted convincing evidence, I would still consider this a rational basis on which to make such a decision. Both Oscar Tosi and Peter Ladefoged were initially of the opinion that the validity of the spectrographic approach had not been proven, but later became supporters of the aural–spectrographic approach, apparently on the basis of the studies reported in Tosi et al. [25] and their own personal experience see Ladefoged [34], and Tosi [23, pp. 137, 138, 140]. Ladefoged, however, seemed to be particularly concerned that the validity of the aural–spectrographic approach not be overstated, and regarded the 6% “wrong matches” rate as a minimum

error rate (Ladefoged [34], Gruber & Poza [18, §7], Solan & Tiersma [35, pp. 418–420]). The FBI began using the aural–spectrographic approach in the 1950s or early 1960s, it commissioned the 1979 NRC report, and continued using the approach until 2011 (Koening [36], Archer [37]). Throughout that period the approach was used for investigative purposes but not for presentation of evidence in court (Koening [36], Archer [37]). As of 2012 the FBI no longer uses the aural–spectrographic approach (p.c. Hirotaka Nakasone, Senior Scientist, Digital Evidence Section, FBI, 30 November 2011). In the late 1990s the FBI got involved in automatic-speakerrecognition (ASR) research, and now uses an ASR-based approach to forensic voice comparison, again for investigative purposes only (Archer [37]). Why did the FBI move away from the aural–spectrographic approach and towards an ASR approach? According to Hirotaka Nakasone, the consensus in the laboratory was that16: • an ASR approach uses quantitative measurements, data, and statistical models, rather than the aural–spectrographic approach's subjective decisions, and is therefore a priori considered more reliable • an ASR approach allows for easier testing of within- and betweenanalyst reliability • ASR approaches in theory satisfy three out of five Daubert criteria (they have not yet been tested at a Daubert hearing), whereas the aural–spectrographic approach satisfies none • ASR approaches have ample support from the scientific community, which has abandoned the aural–spectrographic approach • scientists and engineers around the world are conducting serious research and development on ASR algorithms, whereas little or no research is being done on the aural–spectrographic approach • ASR approaches use recordings of spontaneous speech, whereas the aural–spectrographic approach requires verbatim voice samples which are difficult if not impossible to obtain • an ASR approach can perform a much larger number of comparisons in a given time • ASR approaches can be applied regardless of the language spoken • training an ASR analyst is easier and less time consuming, requiring training in fewer disciplines.

3.6. Is the fact that a spectrogram is used a key aspect of the criticism of the approach? With respect to the spectrographic and aural–spectrographic approaches, is the fact that a spectrogram is used a key aspect of the criticism of the approach? I would argue that it is not. As I understand it, sober criticisms have always centered on the issue of whether the degree of validity and reliability of the approach has been demonstrated under conditions reflecting those of casework. If a forensic scientist did not use spectrograms, but, for example, instead measured formant values from tokens of a number of vowel phonemes, plotted these on a first-formant by second-formant (F1 by F2) plot, and then on the basis of looking at these plots made an experience-based subjective opinion, this approach would be subject to the same criticisms. Whether such an approach were deemed acceptable should depend on whether it had been tested under casework conditions and found to be sufficiently valid and reliable. The same criterion should apply even if a graphic representation is not used at all. The same criterion should apply to a purely auditory approach, to looking at a table of numbers, and, as argued earlier, to an approach based on data representative of the relevant population, quantitative measurements, and statistical models. Under cross-examination I was asked by a lawyer whether there were studies which had shown that the aural–spectrographic approach 16 The following is a paraphrase of a written personal communication from Nakasone received 26 November 2012. Drafts of the paraphrase was sent to Nakasone and revised on the basis of feedback received from him on 9 December 2012.

G.S. Morrison / Science and Justice 54 (2014) 245–256

was more valid than the spectrographic approach (it seemed to be a question from 1972 rather than 2012). The expert that this lawyer had called had used the aural–spectrographic approach, and what the lawyer seemed to be implying was that criticisms of the spectrographic approach did not apply because that expert had listened as well as looked. There was also an argument made that the expert had actually formed their opinion on the basis of listening, and had only used the spectrograms to confirm that opinion. What the lawyer seemed to have failed to understand (or willfully ignored) was that the key point in my testimony with respect to the expert's approach had been that they had failed to present any evidence of the validity and reliability of their approach under the conditions of the particular case at trial and with respect to the relevant population for this case (or under any other conditions or with respect to any other population for that matter). The issue of the use of spectrograms per se, was actually a red herring. The opposing lawyer, who tried to have the aural–spectrographic approach ruled inadmissible, may also have fallen into this trap. What one needs to focus on is principles, one should not get fixated on approaches. 3.7. The IAFPA resolution I was quite surprised when I critiqued the two 2012 reports in which the aural–spectrographic approach had been used, because I knew that the authors of both reports were members of the International Association for Forensic Phonetics and Acoustics (IAFPA),17 and I thought that in 2007 IAFPA had issued a statement to the effect that its members should not use this approach. The resolution is as follows: IAFPA dissociates itself from the approach to forensic speech comparison known as the “voiceprint” or “voicegram” method in the sense described in Tosi (1979). This approach to forensic speaker identification involves the holistic, i.e., non-analytic, comparison of speech spectrograms in the absence of interpretation based on understanding of how spectrographic patterns relate to acoustic reflexes of articulatory events and vocal tract configurations. The Association considers this approach to be without scientific foundation, and it should not be used in forensic casework. [38] After a brief conversation with the president of IAFPA (p.c. J. Peter French 24 October 2012, who was in office in 2007 and has been continually ever since), I came to realize that my interpretation of the resolution as an outright ban on the use of the aural–spectrographic approach was not in fact what the drafters and endorsers had intended. Rather, the resolution was specifically restricted to: “the method in the sense described in Tosi (1979)… involv[ing] the holistic, i.e., non-analytic, comparison of speech spectrograms in the absence of interpretation based on understanding of how spectrographic patterns relate to acoustic reflexes of articulatory events and vocal tract configurations.” But what does this mean? Let us unpack the IAFPA resolution: First, Tosi disliked the term “voiceprint” but recognized that it referred to the aural–spectrographic approach, the same general approach which he used: [In this book] special emphasis was placed on the popularly, and wrongly, named “voiceprinting” method because it is the only method presently used for legal evidence. This author, an expert witness on voice identification, refers to this method as “aural and spectrographic examination of speech samples,” that is, the aural examination of tape recordings and the visual examination of their spectrograms. [23, p. ix] 17 For the record, I am also a member of this association, but joined after the resolution was passed. The minutes of the IAFPA Annual General Meeting on 24 July 2007 indicate that the resolution was passed by 22 in favor, 3 abstentions, and 0 against.

253

“Voiceprint” was a trademark owned in the 1960s and early 70s by Voiceprint Laboratories, Inc., a company established by Lawrence Kersta. If not the originator, Kersta was definitely the popularizer of the spectrographic approach, and was an advocate its use in visualonly mode. Tosi cited criticisms of Kersta for making unsubstantiated claims of infallibility (Tosi [23, pp. 68–69], see also Gruber & Poza [18, §10]). The term “voicegram” does not appear in Tosi [23]. The 1979 NRC report used the term “voicegram” rather than “voiceprint” to avoid the implicit suggestion of association with “fingerprint” examination which was perceived to have high validity [29, pp. 6–7]. Second, what does “holistic, i.e., non-analytic” mean? My best guess is that it refers to a gestalt approach to auditory and visual comparison. This is an approach recommended by Poza & Begault [39], but it is not, at least on the face of it, the approach recommended by Tosi [23]. Tosi [23] discussed a number of acoustic–phonetic features considered in earlier aural and spectrographic studies (p. 43), and it is also clear that he expected examiners to follow the protocol promulgated by the International Association of Voice Identification (IAVI). This required examiners to auditorily compare features such as “melody pattern, pitch, quality, respiratory grouping of words, and any peculiar common features” and to spectrographically compare “mean frequencies and apparent bandwidths (clarity) of formants, rates of change of formant frequencies, levels of components between formants, type of vertical striations and distances between them, spectral distributions of fricatives and plosives, gaps of plosives, and voice onset times of vowels following plosives” (quoted from the 1979 NRC report [29, p. 77]).18 Tosi [23] provided examples of the application of the spectrographic part of his approach (pp. 118–127) which made use of multiple acoustic–phonetic features on tokens of different words appearing in the recordings. He gave a numeric rating (from −10 to +10) to his subjective evaluation of the similarity of each word and a description of the observed similarities/differences which led him to assign each rating, then averaged the ratings to provide a final score on the basis of which he made his decision. One could query the reliability of his assignment of ratings or the appropriateness of the function he used to combine these into a single score, but this approach is clearly not gestalt, holistic, or non-analytic. Even if the approach were gestalt, as in Poza & Begault [39], would that in and of itself be unacceptable? As I argued earlier, what matters is the degree of validity and reliability of a system under conditions reflecting those of the case under investigation. If a gestalt approach were found to have better validity and reliability than any other approach, then that is the approach which should be preferred. Third, does the aural–spectrographic approach lack “interpretation based on understanding of how spectrographic patterns relate to acoustic reflexes of articulatory events and vocal tract configurations”? Let us say that this may have been true and may still be true of some practitioners of the approach (I would even accept something more definitive than “may”),19 but the IAFPA resolution claims, or at least implies, that this was the case for Tosi's [23] approach. Is that claim true? I think

18 Similar lists of features appeared in a protocol promulgated by the American Board of Recorded Evidence (ABRE) [40], and in a description of the FBI's protocol in Koenig [36]. 19 Let us consider, for the sake of argument, that the intention of the IAFPA resolution was to ban the use of the aural–spectrographic approach by individuals who lacked sufficient training and qualifications in phonetics but allow it by individuals with sufficient training and qualifications in phonetics. Who would decide what qualifications were necessary and who was sufficiently qualified? Would being qualified in and of itself be sufficient to make the approach and procedures used by the qualified individual acceptable for any particular case? The central thesis of the present paper is that the acceptability of a particular approach should not be based on these sorts of considerations. It should instead be based on first establishing the principles required for acceptability independent of any approach, then considering whether the particular approach conforms to those principles. In particular the principle proposed is that the validity and reliability of the approach should be tested under conditions reflecting those of the case under investigation and that the results of those tests should be presented to the judge at an admissibility hearing and/ or the trier of fact at trial who can decide if the performance of the system is sufficient to provide them with useful information.

254

G.S. Morrison / Science and Justice 54 (2014) 245–256

not. Oscar Tosi, among other qualifications and appointments, had a PhD in audiology, speech sciences, and electronics from Ohio State University, and was Director of the Speech and Hearing Research Laboratory at Michigan State University (1979 NRC report [29, p. 161]). Chapter 2 of Tosi [23] is on Acoustics, phonetics, and theory of voice production, and clearly demonstrates an understanding of “how spectrographic patterns relate to acoustic reflexes of articulatory events and vocal tract configurations”. Education and demonstrated knowledge may not be definitive indicators that a practitioner has integrated these into their practice, but it is probably reasonable to assume that it is correlated. Finally, the IAFPA resolution states that “The Association considers this approach to be without scientific foundation”, but fails to specify what it considers to constitute a “scientific foundation”. Perhaps I am being too literal, perhaps we all know what the drafters and endorsers of the resolution really meant, but I'm not convinced that this is the case. There is clearly a demarcation problem here, and I don't know what the conditions are under which an IAFPA member is or is not permitted to use the spectrographic or aural–spectrographic approach. My criticisms of the IAFPA resolution are not meant as an endorsement of Tosi's [23] aural–spectrographic approach. There are appropriate criticisms which can be made of the approach. Tosi [23] is far from unbiased, and at a minimum makes convenient omissions. Rather, my point is that the IAFPA resolution failed to address the issue on the level of principles, and came unstuck on a direct attack on the approach.

4. The auditory–acoustic–phonetic approach Given IAFPA's attack on “the approach to forensic speech comparison known as the ‘voiceprint’ or ‘voicegram’ method”, I think it is fair to examine what approach might be recommended by the majority or a large proportion of IAFPA members. There is no single approach officially endorsed by IAFPA, but I believe that the Position Statement concerning use of impressionistic likelihood terms in forensic speaker comparison cases (French & Harrison [41]) is representative of the practice of a substantial proportion of its members. The Position Statement was endorsed by 25 IAFPA members, claimed to be all except one of the “practising forensic speech scientists and interested academics within the UK” (French & Harrison [41, p. 138]), hence it has come to be known as the UK Position Statement. According to the minutes of the 2007 IAFPA Annual General Meeting, the association had 82 members including 6 students, so the 25 signatories of the Position Statement represented approximately a third of the membership (excluding students, who presumably were not counted as practicing speech scientists or interested academics), and it appears that IAFPA members living outside the United Kingdom (the other two thirds of the membership) were not invited to sign. We have previously been critical of the UK Position Statement (Rose & Morrison [42], Morrison [28, §2.5], Morrison [12, §99.400]; and see response in French et al. [43]), and I will not repeat our whole set of criticisms in the present paper. Rather I want to focus on the approach espoused in the UK Position Statement from the perspective of what I consider to be appropriate principles as to what constitutes acceptable forensic-science practice. The focus of the UK Position Statement is a framework for the evaluation of evidence, but it also assumes an experience-based “auditory–acoustic–phonetic” approach (French et al. [43, p. 150]). “It may also involve statistical analysis of the features found” (French & Harrison [41, p. 138]), but there is no elaboration as to the nature of this possible statistical analysis, and databases representative of the relevant population are not used: “we consider the lack of demographic data along with the problems of defining relevant reference populations as grounds for precluding the quantitative application of this type of approach [the likelihood-ratio framework] in the present context” (French & Harrison [41, p. 142]). Note that there are no explicit references to the spectrographic/aural–spectrographic approach in either French & Harrison [41] or French et al. [43].

In the survey on International practices in forensic speaker comparison reported in Gold & French [44], 71% of respondents (25 of 35) reported using an auditory–acoustic–phonetic approach (it is not clear whether this category subsumed the aural–spectrographic approach, the latter was not explicitly mentioned), and another 6% (2 of 35) reported using an auditory-only approach. 70% of respondents reported using “some form of population statistics in arriving at their conclusions” (Gold & French [44, p. 299]). The survey included practitioners from outside the UK and likely included practitioners who were not members of IAFPA, but I am still unable to reconcile the latter figure with the statement in the UK Position Statement about the lack of data and problem of defining the relevant population. Also, I do not know how respondents use “population statistics” and combine them with the experience-based auditory element of their approach. My key concern, irrespective of the approach, is whether the validity and reliability of the forensic–voice–comparison system has been assessed under conditions reflecting those of the case under investigation using test samples drawn from the relevant population. Gold & French [44] did not report on this issue. They did, however, note that approaches based on automatic-speaker-recognition techniques (used by only 20% of the respondents, 7 of 35) lend themselves more readily to being tested. Testing of validity and reliability was not even mentioned in the description of the UK Position Statement set out in French & Harrison [41]. The word “reliable” occurred once in French et al. [43], “we are of the view that it is unrealistic to see it as merely a matter of time and research before a rigorously and exclusively quantitative LR approach can be regarded as feasible, let alone reliable,…” (pp. 149–150), but there was no discussion of the validity or reliability of their own approach. Ultimately I do not know what proportion of practitioners of auditory–acoustic–phonetic approaches test the validity and reliability of their systems under conditions reflecting those of the case at trial using samples drawn from the relevant population, but the lack of indication that this is a widespread practice is disturbing if one considers it essential practice. I would further contend that the approach advocated in the UK Position Statement is open to many of the same criticisms as have previously been levied against the aural–spectrographic approach. As a demonstration, I provide the following quotes from French & Harrison [41] and French et al. [43] juxtaposed with quotes criticizing the aural-spectrographic approach. UK Position Statement: In considering consistency one would assess the degree to which observable features were similar or different. (French & Harrison [41, p. 141]) Criticism of the aural–spectrographic approach: The problem here is that… it is never stated what the criteria for similarity are. (Rose [22, p. 114]) UK Position Statement: This assessment… involves ‘separating out’ the samples into their constituent phonetic and acoustic ‘strands’ (e.g., voice quality, intonation, rhythm, tempo, articulation rate, consonant and vowel realisations) and analysing each one separately. (French & Harrison [41, p. 138]) Amongst the features commonly considered in speaker comparison cases are the following: 1. Vocal setting and voice quality. Full analysis… distinguishes phonation features, overall muscular tension features and vocal tract features, with up to 38 individual elements to be considered. 2. Intonation, potentially including analysis of tone unit nuclei, heads and tails.

G.S. Morrison / Science and Justice 54 (2014) 245–256

3. 4. 5. 6. 7.

8.

9.

10. 11.

Pitch, measured as average and variation in fundamental frequency. Articulation rate. Rhythmical features. Connected speech processes such as patterns of assimilation and elision. A large set of consonantal features, including energy loci of fricatives and plosive bursts, durations of nasals, liquids, and fricatives in specific phonological environments, voice onset time of plosives, presence/absence of (pre-)voicing in lenis plosives, and discrete sociolinguistic variables. A large set of vowel features, including acoustic patterns such as formant configurations, centre frequencies, densities, and bandwidths, and auditory qualities of sociolinguistic variables. Higher-level linguistic information including use and patterning of discourse markers, lexical choices, morphological and syntactic variants, pragmatic behaviour such as turn-taking and telephone call opening habits, aspects of multilingual behaviour such as codeswitching. Evidence of speech impediment, voice and language pathology. Non-linguistic features characteristic of the speaker, for example patterns of audible breathing, throat-clearing, tongue clicking, and both filled and silent hesitation phenomena. (French et al. [43, pp. 146–147])

255

questions, without the accompaniment of compelling empirical scientific data, has done little to quell the controversy surrounding this technique. (Gruber & Poza [18, §7]) “‘voiceprint’ enthusiasts” should follow the example of those working on automatic and semi-automatic speaker identification systems and resist premature application of their technique in court and in forensic investigation; specifically… they should apply a “moratorium” to their activities until they can unequivocally demonstrate that their system provides acceptable identification levels… (Nolan [45, p. 25]) Although too small to be contemplated as a representative sample, I critiqued reports written by three IAFPA members last year (2012). Two used aural–spectrographic approaches20 and one use an approach based on data, quantitative measurements, and statistical models. None of the three presented tests of the validity and reliability of their approach, and one even claimed that such testing was impossible. 5. Conclusion

the rigour and detail of the analysis, together with the education, training and experience pre-requisite to carrying it out, put it well beyond the resources of a layman simply listening to the material. Additionally, by drawing upon research literature and general experience, the analyst may provide an assessment of the degree to which the features common to the questioned voice and that of the suspect are unusual or distinctive. (French & Harrison [41, p. 138])

I have argued that if one wants to determine whether a particular approach to forensic analysis is acceptable, one should first specify what one considers to be the principles governing what would be acceptable. Once this has been done, the same principles can be applied to all approaches which one may want to consider. In my opinion, one of the key principles is that the validity and reliability of the approach be empirically tested under conditions reflecting those of the case under investigation using test samples drawn from the relevant population. This (or a very similar principle) was also proposed in each of the Acoustical Society of America Speech Communication Technical Committee study group's 1970 report on Speaker identification by speech spectrograms: a scientists' view of its reliability for legal purposes (Bolt et al. [27]), the 1979 National Research Council report On the theory and practice of voice identification [29], the 2009 National Research Council report on Strengthening forensic science in the United States [2], and the 2012 National Institute of Standards and Technology/National Institute of Justice report on Latent print examination and human factors [30]. I have considered the aural–spectrographic approach to forensic voice comparison from the perspective of this principle, and the auditory–acoustic–phonetic approach (it does not appear that these two approaches are mutually exclusive), and also approaches based on data, quantitative measurements, and statistical models. In the end I will refrain from making an explicit statement as to whether I think any of these approaches are acceptable. What I want to emphasize instead is that, for whatever approach in whatever branch of forensic science, this decision should be based on principles. I believe that a key principle is that the forensic scientist should test the validity and reliability of the approach under conditions reflecting those of the case under investigation using test samples drawn from the relevant population, that the forensic scientist should present the results of such testing to the judge at an admissibility hearing and/or the trier of fact during a legal trial, and that it is the judge/trier of fact who should ultimately determine acceptability on the basis of the test results.

Criticism of the aural–spectrographic approach:

Acknowledgments

the competency of forensic examiners, both in absolute terms and relative to laypersons who just listen to voices, is largely unknown;… to assert that the individual examiner's experience, combined with his competence and talent, should, in the end, override any concerns about the problems associated with subjective decision making is to make a very questionable assumption. (Gruber & Poza [18, §6]) Proponents often claim that an examiner's “experience” will enable him or her to distinguish inter-from intra-talker characteristics,… Putting forth “experience” as the answer to difficult scientific

This research was supported by the Australian Research Council, Australian Federal Police, New South Wales Police, Queensland Police, National Institute of Forensic Science, Australasian Speech Science and Technology Association, and the Guardia Civil through Linkage Project

Criticism of the aural–spectrographic approach: As Gruber and Poza (1995: section 59 fn11) point out, these features, although sounding impressively scientific, are not selective but exhaustive, and the protocol amounts to instructions to ‘examine all the characteristics that appear on a spectrogram’. Likewise, they point out that the instructions as to aural cues also amount to telling the examiner to take everything he hears into account. Finally, they point out that it is no use just to name the parameters of comparison: one has to be specific about what particular aspect of the feature it is that is to be compared. (Rose [22, p. 113–114]) The lists of features to analyze provided by the proponents of the UK Position Statement look remarkably similar, though not identical, to lists of features which appeared in protocols for the aural–spectrographic approach. Also, unlike Tosi [23] which the IAFPA resolution criticized for being “holistic, i.e., non-analytic”, the UK Position Statement fails to explain how the separately analyzed “strands” are ultimately to be combined to provide a single assessment of the strength of evidence. UK Position Statement:

20 In addition to listening and looking at spectrograms, they also reported some quantitative fundamental–frequency measurements. Thus perhaps their approach should be called aural–spectrographic–acoustic–phonetic, but it doesn't matter what it is called, what matters is whether it has been tested under conditions reflecting those of the case at trial using test samples drawn from the relevant population, and its validity and reliability found to be sufficient.

256

G.S. Morrison / Science and Justice 54 (2014) 245–256

LP100200142. Unless otherwise explicitly attributed, the opinions expressed are those of the author and do not necessarily represent the policies or opinions of any of the above mentioned organizations. References [1] G. Edmond, D. Mercer, Trashing “junk science”, Stanford Technology Law Review (3) (1998) 1–86, (http://stlr.stanford.edu/STLR/Articles/98_STLR_3). [2] National Research Council, Strengthening Forensic Science in the United States: A Path Forward, National Academies Press, Washington, DC, 2009. http://www. nap.edu/catalog.php?record_id=12589. [3] A. Cediel, L. Bergman, The Real CSI: How Reliable is the Science Behind Forensics? PBS Frontline, WGBH Educational Foundation, Boston, MA, April 17 2012. (http:// www.pbs.org/wgbh/pages/frontline/real-csi/). [4] S.A. Cole, Acculturating forensic science: what is ‘scientific culture’, and how can forensic scientists adopt it? Fordham Urban Law Journal 38 (2010) 435–472, (http://ssrn.com/abstract=1788414). [5] G.S. Morrison, I.W. Evett, S.M. Willis, C. Champod, C. Grigoras, J. Lindh, N. Fenton, A. Hepler, C.E.H. Berger, J.S. Buckleton, W.C. Thompson, J. González-Rodríguez, C. Neumann, J.M. Curran, C. Zhang, C.G.G. Aitken, D. Ramos, J.J. Lucena-Molina, G. Jackson, D. Meuwly, B. Robertson, G.A. Vignaux, Response to Draft Australian Standard: DR AS 5388.3 Forensic Analysis — Part 3 — Interpretation, 2012. (http://forensic-evaluation.net/australian-standards/#Morrison_et_al_2012). [6] G.S. Morrison, The likelihood-ratio framework and forensic evidence in court: a response to R v T, International Journal of Evidence and Proof 16 (2012) 1–29, http://dx.doi.org/10.1350/ijep.2012.16.1.390. [7] G.S. Morrison, F. Ochoa, T. Thiruvaran, Database selection for forensic voice comparison, Proceedings of Odyssey 2012: The Language and Speaker Recognition Workshop, International Speech Communication Association, Singapore, 2012, pp. 62–77. http:// geoff-morrison.net/documents/Morrison,%20Ochoa,%20Thiruvaran%20(2012)%20 Database%20selection%20for%20forensic%20voice%20comparison.pdf. [8] G.S. Morrison, Measuring the validity and reliability of forensic likelihood-ratio systems, Science & Justice 51 (2012) 91–98, http://dx.doi.org/10.1016/j.scijus.2011.03.002. [9] I.W. Evett, C.G.G. Aitken, C.E.H. Berger, J.S. Buckleton, C. Champod, J.M. Curran, A.P. Dawid, P. Gill, J. González-Rodríguez, G. Jackson, A. Kloosterman, T. Lovelock, D. Lucy, P. Margot, L. McKenna, D. Meuwly, C. Neumann, N. Nic Daeid, A. Nordgaard, R. Puch-Solis, B. Rasmusson, M. Radmayne, P. Roberts, B. Robertson, C. Roux, M.J. Sjerps, F. Taroni, T. Tjin-A-Tsoi, G.A. Vignaux, S.M. Willis, G. Zadora, Expressing evaluative opinions: a position statement, Science & Justice 51 (2011) 1–2, http://dx.doi.org/10.1016/j.scijus.2011.01.002. [10] B. Robertson, G.A. Vignaux, Interpreting Evidence, Wiley, Chichester UK, 1995. [11] D.J. Balding, Weight-of-Evidence for Forensic DNA Profiles, Wiley, Chichester, UK, 2005. [12] G.S. Morrison, Forensic voice comparison, in: I. Freckelton, H. Selby (Eds.), Expert Evidence, Thomson Reuters, Sydney, Australia, 2010, (ch. 99). http://www. thomsonreuters.com.au/forensic-voice-comparison-expert-evidence/product detail/91156. [13] G.S. Morrison, Tutorial on logistic-regression calibration and fusion: converting a score to a likelihood ratio, Australian Journal of Forensic Sciences 45 (2013) 173–197, http://dx.doi.org/10.1080/00450618.2012.733025. [14] I.W. Evett, Interpretation: a personal odyssey, in: C.G.G. Aitken, D.A. Stoney (Eds.), The Use of Statistics in Forensic Science, Ellis Horwood, Chichester UK, 1991, pp. 9–22. [15] G.S. Morrison, P. Rose, C. Zhang, Protocol for the collection of databases of recordings for forensic–voice–comparison research and practice, Australian Journal of Forensic Sciences 44 (2012) 155–167, http://dx.doi.org/10.1080/00450618.2011.630412. [16] H. Hollien, Forensic Voice Identification, Academic Press, San Diego CA, 2002. [17] B.E. Koenig, Review of Hollien (2002) Forensic voice identification, Journal of Forensic Identification 52 (2002) 762–766. [18] J.S. Gruber, F. Poza, Voicegram identification evidence, American Jurisprudence Trials 54 (1) (1995) §1–§133. [19] D. Meuwly, Reconnaissance de locuteurs en sciences forensiques: L'apport d'une approche automatique. Doctoral dissertation University of Lausanne, 2001. (www. unil.ch/webdav/site/esc/shared/These.Meuwly.pdf). [20] D. Meuwly, Le mythe de l'empreinte vocale I, Revue Internationale de Criminologie et Police Technique 56 (2003) 219–236. [21] D. Meuwly, Le mythe de l'empreinte vocale II, Revue Internationale de Criminologie et Police Technique 56 (2003) 361–374.

[22] P. Rose, Forensic Speaker Identification, Taylor and Francis, London, UK, 2002. [23] O. Tosi, Voice Identification: Theory and Legal Applications, University Park Press, Baltimore, MD, 1979. [24] G. McGregor, S. Maher, Tories now admit they sent Saskatchewan robocall: Forensic expert links company behind latest push poll to firm behind Pierre Poutine calls, Ottawa Citizen, February 5 2013.(http://www.ottawacitizen.com/news/Tories+ admit+they+sent+Saskatchewan+robocall/7922470/story.html). [25] O. Tosi, H. Oyer, L. Lashbrook, C. Pedrey, J. Nicol, E. Nash, Experiment on voice identification, Journal of the Acoustical Society of America 51 (1972) 2030–2043, http://dx.doi.org/10.1121/1.1913064. [26] J.M. Pickett, Annual report of the Speech Communication Technical Committee, Journal of the Acoustical Society of America 46 (1969) 687–688. [27] R.A. Bolt, F.S. Cooper, E.E. David Jr., P.B. Denes, J.M. Pickett, K.N. Stevens, Speaker identification by speech spectrograms: a scientists' view of its reliability for legal purposes, Journal of the Acoustical Society of America 47 (1970) 597–612, http://dx.doi.org/10.1121/1.1911935. [28] G.S. Morrison, Forensic voice comparison and the paradigm shift, Science & Justice 49 (2009) 298–308, http://dx.doi.org/10.1016/j.scijus.2009.09.002. [29] National Research Council, On the Theory and Practice of Voice Identification, National National Academies Press, Washington, DC, 1979. http://books.google.ca/ books?id=FjMrAAAAYAAJ. [30] Expert Working Group on Human Factors in Latent Print Analysis, Latent Print Examination and Human Factors: Improving the Practice through a Systems Approach, US Department of Commerce, National Institute of Standards and Technology, Gaithersburg, MD, 2012. (http://www.nist.gov/manuscript-publication-search.cfm? pub_id=910745). [31] P.R. Bevington, D.K. Robinson, Data Reduction and Error Analysis for the Physical Sciences, 3rd edition McGraw Hill, Boston, MA, 2003. [32] R.A. Bolt, F.S. Cooper, E.E. David Jr., P.B. Denes, J.M. Pickett, K.N. Stevens, Speaker identification by speech spectrograms: some further observations, Journal of the Acoustical Society of America 54 (1973) 531–534, http://dx.doi.org/10.1121/1.191 3613. [33] H.F. Greene, Voiceprint identification: the case in favor of admissibility, American Criminal Law Review 13 (1975) 171–200. [34] P. Ladefoged, An opinion on voiceprints, UCLA Working Papers in Phonetics, 19, 1971, pp. 84–87. http://escholarship.org/uc/item/4k81b31v. [35] L.M. Solan, P.M. Tiersma, Hearing voices: speaker identification in court, Hastings Law Journal 54 (2003) 373–435. [36] B.E. Koenig, Spectrographic voice identification: a forensic survey, Journal of the Acoustical Society of America 79 (1986) 2088–2090, http://dx.doi.org/10.1121/1. 393170. [37] C. Archer, HSNW conversation with Hirotaka Nakasone of the FBI: voice recognition capabilities at the FBI — from the 1960s to the present, Homeland Security News Wire, July 11 2012. (http://www.homelandsecuritynewswire.com/bull20120711voice-recognition-capabilities-at-the-fbi-from-the-1960s-to-the-present). [38] International Association for Forensic Phonetics and Acoustics, Resolution on voiceprints, http://www.iafpa.net/voiceprintsres.htm July 24 2007. [39] F.T. Poza, D.R. Begault, Voice identification and elimination using aural– spectrographic protocols, Proceedings of the Audio Engineering Society 26th International Conference: Audio Forensics in the Digital Age, 2005(paper 1-1). [40] American Board of Recorded Evidence, Voice comparison standards, http:// www.tapeexpert.com/pdf/abrevoiceid.pdf 1999. [41] J.P. French, P. Harrison, Position Statement concerning use of impressionistic likelihood terms in forensic speaker comparison cases, International Journal of Speech, Language and the Law 14 (2007) 137–144, http://dx.doi.org/10.1558/ijsll.v14i1.137. [42] P. Rose, G.S. Morrison, A response to the UK position statement on forensic speaker comparison, International Journal of Speech, Language and the Law 16 (2009) 139–163, http://dx.doi.org/10.1558/ijsll.v16i1.139. [43] J.P. French, F. Nolan, P. Foulkes, P. Harriaon, L. McDougall, The UK position statement on forensic speaker comparison: a rejoinder to Rose and Morrison, International Journal of Speech, Language and the Law 17 (2010) 143–152, http://dx.doi.org/10.1558/ijsll.v17i1.143. [44] E. Gold, J.P. French, International practices in forensic speaker comparison, International Journal of Speech, Language and the Law 18 (2011) 143–152, http://dx.doi.org/10.1558/ijsll.v18i2.293. [45] F. Nolan, The Phonetic Bases of Speaker Recognition, Cambridge University Press, Cambridge, UK, 1983.