The influence of bias
6
Chapter Outline Introduction 93 Defining bias and understanding its relevance 94 Objectivity and subjectivity—Definition and interplay 96 Understanding the literature 101 Potential for and dealing with bias 114 Final thoughts 117 References 118
Introduction In their 2009 report on the state of forensic science in the United States, the National Research Council devoted time to discussing the potential for bias and the potential negative impact bias could have in forensic science. Two primary issues stood out for them. The first was that the potential impact bias was having or could have in forensic science was relatively unknown. “Unfortunately, at least to date, there is no good evidence to indicate that the forensic science community has made a sufficient effort to address the bias issue; thus, it is impossible for the committee to fully assess the magnitude of the problem” (National Research Council, 2009). For them, there was potential for a problem, but what was unknown was how large that problem might be. The second issue was directed at those disciplines that relied upon subjective interpretations of data to render an opinion. The Council stated that, “The forensic science disciplines need to develop rigorous protocols for performing subjective interpretations, and they must pursue equally rigorous research and evaluation programs. The development of such research programs can benefit significantly from work in other areas, notably from the large body of research that is available on the evaluation of observer performance in diagnostic medicine and from the findings of cognitive psychology on the potential for bias and error in human observers” (National Research Council, 2009). This specifically points to those disciplines relying on pattern matching, which includes fingerprint comparisons, shoe print and tire print comparisons, blood spatter interpretation, and firearm and toolmark identification. They were especially critical of the AFTE Theory of Identification, suggesting that it does not provide a substantive criterion that can be relied upon to render a common source determination.
Firearm and Toolmark Identification. https://doi.org/10.1016/B978-0-12-813250-0.00006-1 © 2018 Elsevier Inc. All rights reserved.
94
Firearm and Toolmark Identification
This concern for bias has been behind the movement of forensic science laboratories to separate themselves from law enforcement administration all together. The Washington, DC, Metro police forensics laboratory was removed from police a uthority and placed under management that was independent of law enforcement. The forensic laboratory that was once part of the Houston Police Department became the Houston Forensic Science Organization still under city control, but no longer under law enforcement authority. In an effort to do something to respond to this concern regarding bias, some agencies have isolated their technical staff from any police reports regarding cases they are working unless information from those reports is absolutely necessary. In these cases, supervisors will determine the information that is revealed, keeping the analysts away from the supposed influence of the reports, which some maintain would have a potential biasing effect. This discussion will look at bias, its definition, and relevance. It will examine the interplay of objectivity and subjectivity in the firearm and toolmark discipline and how this interplay relates to bias. Published literature and studies dealing with bias in pattern identification disciplines will be reviewed and discussed, including the significance of the work and potential limitations. Finally, the potential for bias in a forensic laboratory and specifically in firearm and toolmark identification will be discussed as well as strategies to minimize the potential impact.
Defining bias and understanding its relevance Bias is an interesting word because the word itself can cause an emotional reaction in people just by hearing it. When teaching on another subject, I asked the class one time whether the word “consequences” brought up negative, positive, or neutral thoughts. The overwhelming answer was negative, which is understandable if that had been their general experience with the word. However, since consequences are simply the result of a choice put into action, consequences can be either good or bad and the word itself is actually a fairly neutral word. There is no reason in particular that would cause one to think negatively or positively unless that person was predisposed to that thought based on prior experience. In fact, the discussion with respect to the word consequence is not only an example of bias, but is also an example of the type of reaction bias can cause in an individual. Ask someone who deals with fabric on a daily basis and they will probably think of bias as a good thing. Cutting along that “bias” line, which is at a 45-degree angle to the fabric grain, helps for a better fit. Ask a statistician about bias and they may think about the expected deviance from an expected value. That is neither good nor bad, just a matter of doing business. Ask a group of scientists if the word “bias” brings to mind negative or positive thoughts and the answer will almost assuredly be overwhelmingly negative. The reason is that scientists are supposed to be neutral, not biased one way or another. Indeed, the implication that a scientist is biased is almost certain to be met with a strong confrontation because that strikes at the very core of a scientist, especially a forensic scientist where the evidence is supposed to “speak for itself.” However, that is primarily because bias is seen as
The influence of bias95
a conscious choice as opposed to the subconscious influence that it is more often than not. Looking at Merriam-Webster’s dictionary, one of the definitions for bias is, “An inclination of temperament or outlook; especially: a personal and sometimes unreasoned judgment: prejudice” (Web, 2017a). When one considers that a scientist should be viewing things as objectively as possible, then it is understandable why bias can have such a negative connotation. If not controlled, bias can have a significant impact on the work that is being done. If the bias is in the context of a forensic case and dealing with evidence, it could mean inappropriate or incomplete testing, inappropriate test results and, potentially, the implication of an innocent party. In fact, bias is one of the key factors cited by the Office of the Inspector General (OIG) in a misidentification of a fingerprint discovered during a terror bombing (OIG, 2006). In March 2004, there was a bombing of commuter trains in Madrid, Spain. The Spanish National Police developed prints on a bag of detonators and sent images to the FBI to be searched on a nationwide database (IAFIS). In May 2004, the FBI arrested Brandon Mayfield, an Oregon attorney, based on the identification of one of the recovered prints on the plastic bag to Mayfield. Two weeks later, the Spanish National Police arrested an Algerian national based on comparisons using the same print that the FBI identified as belonging to Mayfield, ironically the same day that an independent expert confirmed the finding of the FBI. After reviewing the known prints of the Algerian national, the FBI withdrew the identification of Mayfield and released him from custody. Several potential reasons were cited by the OIG for the misidentification. The first was an unusual similarity of the prints of Brandon Mayfield and Ouhnane Daoud (the Algerian national). The second was bias, whereby the original examiner appeared to be influenced in his interpretation of the recovered latent print after viewing the known prints of Mayfield. In their Method for Fingerprint Identification, Interpol identified this as a potential issue (Interpol, 2003). Under definitions, they state, “A difference in appearance between compared fingerprints (or details of them) that is contributed to normal variations with printing can be tolerated. Tolerances should be applied consistently and honestly. Experts should be aware of the paradox that one may be inclined to accept more differences in bad prints under the umbrella of distortion than one would accept in better quality prints. Distortion not only limits the perception of the similar but also of the dissimilar. The pitfall is that a premature assumption of donorship leads to transplantation of data from the ‘original’ into the blur of the latent. It is circular reasoning like: this print comes from this donor, prints are unique, thus all data must be the same and subsequently all differences are not real.” There were five features of the latent print in particular that the original examiner had keyed in on and were relevant to the identification of Mayfield to that print. However, the OIG pointed out that there was no evidence the examiner had even identified these as significant prior to examining the known prints of Mayfield. At best, the OIG reported that these features were ambiguous or blurred in the latent print. Further, of the seven points that were encoded by the examiner for the IAFIS search prior to having seen the Mayfield prints, the examiner changed the interpretation of either the type or location of five of those points after seeing the Mayfield print. The OIG was clear in pointing out the ramifications of this when they reported, “The bias that this
96
Firearm and Toolmark Identification
reinterpretation introduced can be appreciated by comparing Green’s original coding with the Daoud exemplars. For four of the five points that Green changed as to type or location after seeing the Mayfield prints, it turned out that his original interpretation was correct (i.e., was consistent with the Daoud exemplar)…For those points, Green’s original analysis, still unbiased by any comparison to Mayfield, was in fact more accurate than the adjusted interpretation he made after seeing the Mayfield exemplars” (OIG, 2006). Sir Francis Bacon wrote, “The human understanding when it has once adopted an opinion (either as being the received opinion or as being agreeable to itself) draws all things else to support and agree with it” (Bacon, 1620). In other words, there can be a tendency to seek, interpret and remember information that confirms our preconceptions. Similarly, information that does not fall within the the perceived framework is more easily dismissed, especially the more strongly we hold on to our preconceptions. This is a concern for those disciplines within forensic science that rely on pattern identification to determine a potential common source such as fingerprint identification and firearm and toolmark identification. In order to understand the potential in firearm and toolmark identification, it will be important to discuss the discipline in terms of objectivity and subjectivity.
Objectivity and subjectivity—Definition and interplay One of the earliest struggles in firearm and toolmark identification was how to define it. Some individuals would refer to it as a science, while others said it was an art. Others would suggest that there were elements of both. If one looks at the earliest foundations of the forensic discipline then one can claim that given that it was developed out of the university system, then there was some science at its foundation. At the same time, given the skill set necessary and the ultimate subjective interpretation of what one is observing, there is some support for suggesting that, at least in part, the discipline is an art as well. At the same time, if one reviews the literature, it appears that when the words science and art were being used, there was no agreed-upon consensus as to the meaning or intent of the authors when they used those words. Given that and the importance of this discussion and evaluating the literature that has been published, it will be important to set a foundation of common understanding. Therefore, several terms will be defined and then, once defined, the discipline of firearm and toolmark identification will be discussed in that context. In the context of this chapter, science is defined as, “knowledge or a system of knowledge covering general truths or the operation of general laws especially as obtained and tested through the scientific method; such knowledge or such a system of knowledge concerned with the physical world and its phenomena” (Web, 2017b). It is also defined as “a system or method reconciling practical ends with scientific laws” giving an example of cooking as both a science and an art (Web, 2017b). Art has many definitions of which one seems to be particularly relevant for this discussion, “skill acquired by experience, study, or observation” (Web, 2017c). As one can see from
The influence of bias97
these definitions, science and art are not mutually exclusive of one another and each has relevance for the discipline. Two more definitions are important. The first is the word objective. In the context of this chapter, objective is defined as, “of, relating to, or being an object, phenomenon, or condition in the realm of sensible experience independent of individual thought and perceptible by all observers; involving or deriving from sense perception or experience with actual objects, conditions, or phenomena” (Web, 2017d). This definition is discussed in terms of data, data that can be collected through the use of the scientific method. The second is the word subjective. Of the several definitions, one is particularly relevant for this discussion, “peculiar to a particular individual; personal subjective judgments: modified or affected by personal views, experience, or background” (Web, 2017e). One definition that clearly does not apply is, “lacking in reality or substance,” (Web, 2017e) which seems to be the implication at times of some of the critics of the discipline and pattern matching in general. By definition, science neither precludes subjectivity nor does it solitarily embrace objectivity. And yet, there are some who equate science with fact only, with no opinion, and “pure science” as objective only (Dror, 2013). The challenge has been and always will be that natural sciences require the collection of data that at some point has to be interpreted by an individual. What seems to be more appropriate is when he states that, “Ideally all domains (not only across the forensic domains, but also medicine and others) will be purely scientific and objective. However, this is not realistic within our current knowledge and understandings…Of course, it is not a dichotomy of either being a domain that is totally objective or one that is purely a matter of opinion. It is a continuum, where domains are more or less objective” (Dror, 2013). Considering that firearm and toolmark identification is a forensic domain, the question to be examined is its place on the continuum. To examine this, we look at the actual practice of the discipline and the various objective and subjective elements involved. When looking at the various determinations that are made in the course of examining firearm and tool-related evidence from crime scenes, there are a number of objective elements. One of these elements includes class characteristics, features indicating a restricted group of items. A firearm with a rifled barrel will have a certain number of lands and grooves in the bore along with a direction of twist. The number and the direction of twist are quite objective. The lands and grooves will also have dimensions that can be measured. Provided there is an understanding regarding the potential for error, these measurements could also be considered objective. In addition, the bore has a certain caliber, a caliber that can be measured and helps to define the size of the bullet that can be fired through that bore. Each of these, the number of lands and grooves, their dimensions, the direction of twist, and the caliber all can be argued to be, “…relating to…an object…perceptible by all observers; involving or deriving from sense perception or experience with actual objects…” In other words, these features are observable, they are verifiable, and they are derived from sense perception, being capable of observation and measurement. The application of interpretation of sets of data based on class characteristics can be either objective or subjective. For example, a pristine bullet that has eight lands and eight grooves engraved on the surface inclined to the right could not have been fired
98
Firearm and Toolmark Identification
from a firearm having a bore with six lands and grooves and a left twist. The class characteristics are incompatible and this is not much of a matter of opinion as it is a simple fact, at least for those knowledgeable and trained in the measurement, understanding, and application of class characteristics as they relate to firearm and toolmark identification. Therefore, this interpretation would be seen as quite objective, leaving little room for subjectivity. At the same time, if one were to take the same set of data collected on the bullet and propose a potential list of firearms from which it may have been fired based on that data, this would lend to a greater level of subjectivity. Yes, it is being compared to a database of measured data, but whether or not to include some elements of that data due to acceptable tolerances could be considered subjective. Another objective element is the correspondence of characteristics within a pattern. As was demonstrated in an earlier chapter, these characteristics are capable of being measured and graphed. It is possible to measure the distance of a striation from the shoulder of the land-engraved area in which it is present. It is possible to measure the slope on either side of the peak along with its height relative to other striations within the pattern. Individual measurements can be taken of all such data on two different items and then the data compared. But, this is not practical not only because of the time involved, but also because studies referenced earlier have demonstrated that the trained human examiner is better at lining up a potential pattern of agreement than a machine can. Using a comparison microscope with adequate and relatively equal lighting, it is possible to align two toolmarks in a matching or nonmatching position and have another trained examiner look and see what the first examiner is seeing. Such pattern corresponding is “…relating to…an object…perceptible by all observers; involving or deriving from sense perception or experience with actual objects…” In other words, pattern correspondence is observable, verifiable (that it exists), and is derived from sense perception, being capable of observation and measurement (if one desired). It is the interpretation of that pattern correspondence that is subjective. However, as will be seen in the next chapter, it is not so subjective that pattern identification in the firearm and toolmark discipline should not be considered reliable. At the same time, it is important to understand that subjectivity is inherent in the process of data collection. Whether that be in the measurement of land- and groove- engraved areas on a bullet and understanding instrument tolerances, or whether it is the adjustment of lighting for the best visualization of toolmarks in a field of view on a comparison microscope, there is some subjectivity involved in the entire data collection process. Indeed, as one is performing comparative pattern analysis, every time that an image is moved, one has just made a subjective assessment of the objective correspondence (or lack thereof) of the pattern. However, there is more. As data are collected and are tending to favor a particular decision, it is possible that data not supporting that decision are given less and less attention as more and more data supporting a particular decision are collected (Cole, 2006). This is a circular issue too because as less contrary data are being considered, they also feed into the one particular decision rather than another competing decision. In essence, the examiner is being consistently “tuned” through “constructive effects” to consider certain data and ignoring other data as more and more data are collected that support a particular decision (Risinger et al., 2002). As a result, decisions are consistently and constantly being formed and made whether
The influence of bias99
or not to consider data being observed throughout the entire process of comparative pattern analysis. Therefore, for firearm and toolmark identification, specifically comparative pattern analysis, there is no clear separation of objectivity and subjectivity. At the same time, it is not detrimental and there are ways in which subjectivity can be lessened or minimized. One is through increased transparency in documentation. A colleague once shared, “If I say that two bullets match, that statement alone tells you all you need to know about what I saw.” I respectfully replied that it tells me nothing about what he observed, only that whatever he observed was sufficient for him to reach his conclusion about the evidence he was looking at. The initial and most obvious step in documentation transparency is the recording of observations and not just conclusions. For example, based on previous chapters, it is possible for a trained examiner to examine a tool surface and determine whether or not subclass characteristics may be present. To make a simple statement that they aren’t, while helpful, does not provide an understanding of why the examiner thinks that they are not present. It could be argued (and has) that the statement, “no subclass characteristics were observed” is an observation as much as it is a conclusion. While true to some level, what is missing is the observations made about the tool surface itself such as “toolmarks are straight, irregular in directionality and length, with a combination of coarse and fine markings.” Observations such as these point to a filing or grinding method. Knowledge of those manufacturing methods along with the observed character of the markings themselves can lead the examiner to the conclusion that “no subclass characteristics were observed.” However, that statement alone does not provide any information about what was observed. Another way in which subjectivity can be lessened is through a technical review of sufficiently thorough documentation. A second, trained, and qualified examiner can review the documentation and report and determine whether there is sufficient documentation to support the conclusions. These conclusions would include not only those in the report but also those in the documentation itself, such as the instance with the potential for subclass characteristics. There have been attempts by those in the community to define what minimally sufficient documentation would entail as well as attempts by those outside the community to define what minimally sufficient documentation should entail. Since there are many different ways in which documentation can be accomplished, it is more suitable to identify common core principles rather than specific manners in which the documentation can be accomplished. The first core principle is understanding that documentation cannot perfectly record and capture everything with respect to an examination. However, it should capture critical observations and thought processes leading to various conclusions reached. The aforementioned issue with subclass is one example. Another example is photographs, especially for cases in which common source identifications have been made. Photographs of areas of correspondence, especially those areas which had correspondence leading to the common source determination, are helpful in illustrating what was observed. Some have argued against photographs, suggesting that they capture just a very small portion of the comparison; even going to the extreme that if it is not being videotaped then there’s not a point. While the first part of the argument is true, it is essential to
100
Firearm and Toolmark Identification
recognize that those areas of significant correspondence are critical parts of the comparison. Not only that, they are much more effective in helping to illustrate what an examiner means by “sufficient” or “significant” correspondence than words alone. While photographs aren’t perfect representations of the entire comparative examination, they can help to illustrate some data that can be considered objective. The second core principle is that conclusions are not observations and that all observations leading to conclusions should be documented in some form whether it be narrative or through sketches or photographs. The third and final core principle is that documentation should be sufficient that another trained examiner could understand what was done, what was observed, and whether there is sufficient support for the conclusions that were reached. To demand that it be sufficient that anyone could understand everything associated with the case is an unrealistic standard that is not met in any area of science. Related to this is the third way in which subjectivity can be minimized, independent verification of comparative examinations. One of the arguments posed when dealing with reliability of results is the subjectivity involved in the final interpretation of results. Having a second qualified examiner provide an independent assessment of the evidence and comparative results is helpful in diffusing this argument against reliability. There is much involved in a quality verification process, which will be discussed later in this chapter when discussing how to deal with bias in the laboratory. The fourth means by which subjectivity can be reduced is by using a quantitative standard for the interpretation of observed correspondence. Earlier, the concept of consecutive matching striations (CMS) was discussed and presented. Based on this concept, proponents have offered a minimum standard above which a conclusion of common source may be offered. For three-dimensional toolmarks, this standard is two independent groups of three CMS or a single group of six or more. For two- dimensional toolmarks, this standard is two independent groups of five CMS or a single group of eight or more. For both, the potential for subclass characteristics has to be ruled out (Biasotti and Murdock, 1997). While the application of CMS does not make the discipline of firearm and toolmark identification more objective, it allows for lessened subjectivity when interpreting the significance of observed patterns of correspondence. Without CMS, an examiner’s criterion for identification is based on correspondence in known matching and nonmatching situations that he or she has observed during the course of training and experience. This is not inherently bad because that individual’s criterion can and is validated through competency and proficiency testing. At the same time, it is subjective because it is, “peculiar to a particular individual; personal subjective judgments: modified or affected by personal views, experience, or background” (Web, 2017e). With CMS, one is able to not only draw on one’s own experiences and testing, but also the experiences of others who utilize a similar model, which, to date, involves thousands of data points (Nichols, 2003). While it does not make the discipline more objective, it does lessen the subjectivity associated with the interpretation of the observed correspondence because it is based not only on one’s own training and experience, but also data collected from a number of other examiners using the same model for interpretation.
The influence of bias101
By definition, subjectivity and bias are inextricably linked. While ways have been presented that can reduce subjectivity, it can never be completely eliminated because of the constant decision making that is occurring during the comparative analysis and examination process. Therefore, bias can never be completely eliminated but, as subjectivity is reduced, a reduction in bias should follow. Following is a discussion of the potential impact of bias based on the available literature and how to deal with bias in the discipline of firearm and toolmark identification.
Understanding the literature The literature with respect to bias involves both general discussions to include different types of bias and research in areas of science other than forensic science. This literature is important because it helps to identify potential concerns with bias about which practitioners need to be aware lest the bias inadvertently sway them from proper decision making. There is also literature dealing with the potential for bias in the various forensic science disciplines, including research examining decision-making in the overall analytical process and isolated steps within those processes. As a result of their work, various authors contend that bias does exist, it can take on different forms, and is potentially a more significant obstacle for those forensic disciplines relying heavily on subjective interpretation for their conclusions. To begin this discussion, a published article unrelated to firearm and toolmark identification is presented because it speaks to the issue of assumptions that can be observed in the literature, especially by those who are concerned with the credibility of forensic science, but not actually practicing forensic scientists. As an example, it has been implied that there is greater danger with respect to bias for disciplines in which subjective interpretations or judgments of individual examiners are, “their primary instrumentality, and not based on techniques derived from normal (emphasis added) science methodology” (Risinger et al., 2002). In a footnote, they give examples of gas chromatography and scanning electron microscopy as examples of these techniques. However, a previously published article focusing on errors made in drug proficiency testing over a period of 1985 and 1993 demonstrated that it was not the techniques, but rather the lack of critical thought that was at fault for the errors (Nichols, 1997). That article was written to address concerns of using microscopy-based microcrystalline testing for drug identification, which was considered subjective when state-of-the-art technology to include gas chromatography-mass spectrometry (GC-MS) and infrared spectroscopy (IR) were available, which were considered far superior and, according to Risinger et al., would be classified as “normal science methodology.” Discovered in that review of proficiency tests was that 56 of the 63 errors involved the use of GC-MS, IR, or both while no errors were reported for testing schemes that included at least two microcrystalline tests for identification (148 of the returned tests). When this study is re-examined in the context of bias, two statements made by respondents, one committing an error and one not, take on even greater significance. Both concluded that their results indicated that the suspect was attempting to manufacture m ethamphetamine.
102
Firearm and Toolmark Identification
There was nothing in the testing scenario that would even support such a conclusion in that particular test. This is not to suggest that nonpracticing forensic scientists are not qualified to speak on issues relating to forensic scientists. Indeed, in many ways, it is helpful because such discussions can help identify areas in which forensics scientists have not been sufficiently communicating the basis for their work or conclusions. They can also help to highlight areas in which forensic science may have weaknesses that may need to be addressed. At the same time, when it includes assumptions and presumptions such as “pure science” (Dror, 2013) “normal science methodology” (Risinger et al., 2002), and “Latent print individualizations are merely (emphasis added) subjective determinations formed in the government examiners’ minds” (Cole, 2006), these discussions need to be approached with the same skepticism with which those authors approach forensic science. Michael Risinger, among others, discusses the issue of expectation and suggestion, which they classify under a more broad-based category of “observer effects” (Risinger et al., 2002). These effects would include the typical affiliation of forensic laboratories with law enforcement bodies, referred to as motivational bias. These effects also include “examination-irrelevant information,” which can lead to an examination influenced by confirmation bias, “the tendency to test a hypothesis by looking for instances that confirm it rather than be searching for potentially falsifying instances.” These effects also include context bias, which occurs when a known sample and a questioned sample are compared side by side, which can lead to an examiner overvaluing similarities and undervaluing differences. Paul Gianelli discusses two types of bias, motivational bias and cognitive bias (Gianelli, 2007). The first is motivational bias, which is often subconscious based on one’s affiliation with law enforcement. This bias is one of the primary reasons for the Committee on Identifying the Needs of the Forensic Sciences Community making this recommendation: “To improve the scientific bases of forensic science examinations and to maximize independence from or autonomy within the law enforcement community, Congress should authorize and appropriate incentive funds to the National Institute of Forensic Science (NIFS) for allocation to state and local jurisdictions for the purpose of removing all public forensic laboratories and facilities from the administrative control of law enforcement agencies or prosecutors’ offices” (National Research Council, 2009). It is believed that by simply being affiliated with a law enforcement agency or a prosecutor’s office, forensic scientists can be inadvertently swayed simply by that affiliation. The other type of bias he identified was cognitive bias in which examiners will see what they expect to see which will tend to be most influential in areas of ambiguity. Kassin et al. define confirmation bias as confirming what is already known or suspected based on previous information (Kassin et al., 2013). This includes not only the typical seeking, perception and interpretation of evidence to confirm what is already known or suspected, but also the creation of “new evidence.” They also discuss contextual bias in terms of two concepts—bottom-up processing and topdown processing. Bottom-up processing is when decisions are based solely on data. Top-down processing is when decisions are based with context in mind, such as case
The influence of bias103
circumstances. The authors indicate that top-down processing can lead to confirmation bias. In whatever way the authors discuss the terms, they tend to focus on three primary forms of bias—motivational, confirmation, and contextual bias. It is important that a significant portion of the literature is not implying that bias leads to a purposeful and intentional misrepresentation of the data. At the same time, the literature has to be approached with caution because there are incidental statements that are made without further definition of support as discussed earlier. In addition, the studies themselves have to be reviewed in the light of potential bias inherent in the study or other potential limitations that could result in data that are interesting but not necessarily applicable for the ultimate task or desire, demonstrating the pervasiveness of bias in the forensic science disciplines, especially those concerned with subjective interpretations of data. This will not be an exhaustive presentation of all the research that has been performed in the area of bias research in forensic science, but it will help to highlight the potential concerns that bias can have in forensic science. Necessarily, this discussion will involve many studies involved in disciplines other than firearm and toolmark identification because those studies involving firearm and toolmark identification specifically are scarce. Though not the first study chronologically, one study seems to focus on motivational bias (Charlton et al., 2010). This was a qualitative, interview-type study in which 13 different participants answered a series of questions from which different observations were made. Each of these participants had at least 7 years of experience with the details of that experience incorporating a range of responsibilities, from bench level having more routine casework to bench level with casework involving violentperson-related crimes to supervisors. These individuals were drawn from a number of different law enforcement agencies. What they discovered was that among those 13 participants, there was a relatively high need for cognitive closure with respect to any given case. In addition, there was a relatively high level of job satisfaction in terms of case outcome and other emotional experiences associated with job performance. They emphasized that this study was not definitive in any way and was designed to promote further research. That this is true is evidenced by the use of the words “may” or “might” a total of 44 different times in this particular article when discussing potential consequences of certain concerns. At the same time, the conjecture is helpful in helping to identify potential blind spots than can exist. It may not be possible to address and study the potential influence of each one because it is uncertain that a particular number of studies would ever become exhaustive. Concerns and potential limitations of the study would include the sampling of the population. There was no control group, allowing for a comparison and evaluation of broad-based issues such as cognitive closure. Was the apparent need for cognitive closure due to the individuals being involved in law enforcement or, does law enforcement tend to draw individuals with an inherently high need for cognitive closure? The employment history of the participants was not discussed in great detail; therefore, it is not known whether they were police officers or civilians. In addition, could the issue be culturally based? Although it is known that the authors are based in England, there is no indication of where the participants perform their work. It is
104
Firearm and Toolmark Identification
common that there are significant differences in lifestyles among different cultures and countries. This would seem to have to include work-related attitudes and potential biases. None of this should detract from the value of the article, which was to offer potential concerns and promote further research. In doing so, this work does have value. It has to be understood for what it is though and that is nondefinitive as the authors share. If the article by Charlton et al. was considered qualitative in nature, there have been others that could be considered more quantitative in nature. The first to be discussed in a forensic context is a study focusing on the influence of contextual bias when rendering fingerprint comparison decisions (Dror et al., 2005). Twenty-seven participants, all university student volunteers, were given a total of 96 pairs of prints each. 48 of the pairs were considered clear, detailed, and unambiguous. Of these 48, 24 were from the same source and 24 were not. The other 48 pairs of prints were considered more ambiguous and believed to be more indicative of what would be encountered in day-to-day casework. The participants were tested under two different conditions representing low and high contextual bias by the introduction of stories and photographs representing the background of the cases. The case backgrounds for low bias included rather common crimes that did not involve physical harm to another person. These included photographs of items that were stolen. The case backgrounds for high-bias conditions included stories in which there were victims who were seriously injured or killed. One photograph representing an example of the photos shown for the highbias cases appears to be an autopsy photo of an individual’s head and the side of the face having a slash from the forehead down the cheek, revealing underlying tissue. These biasing conditions were chosen because they reflected conditions under which latent prints examiners would be faced with performing their work. Similar conditions were successful in other studies dealing with the potential for bias in decision making in nonfingerprint contexts, and they were relatively easy to control, administer, and quantify. The experiment was programmed and delivered on a computer with some subliminal messages, “guilty” or “same,” added to some of the pairs of prints as they were presented for evaluation. All participants were tested under four different biasing conditions: control, making decisions based on prints only; low-bias conditions; high-bias conditions; and high-bias conditions with the subliminal messaging. Before testing, participants were provided six practice sessions in which they would make a determination of same or different and then provided feedback after each of the six practice sessions. There is no indication that any further training was provided to the participants. During the trial of the 96 pairs of prints, there was no feedback provided and the prints were divided into four trials of 24 print pairs each, 12 ambiguous and 12 nonambiguous, presented randomly within each set. Each pair would remain visualized on the screen until a same or different decision was made without an opportunity for any other response, including something akin to “I don’t know.” Based on their study, it does appear that there was a positive correlation between the rate of identifications and the available top-down biasing information that was being provided. The rate of identifications was 47% for the control group, 49% for the low emotion, 58% for the high emotion, and 66% for the high emotion with
The influence of bias105
s ubliminal messages. When presenting these results, the researchers did indicate that this tendency to make an increased number of positive associations was limited to the ambiguous prints and there were no significant differences with the unambiguous prints. They concede that the study was preliminary in nature, trying to establish the existence of contextual bias when making fingerprint decisions for the purpose of identifying further areas for research. One of the limitations in the study is that the participants were not trained latent print examiners. The researchers acknowledge this and offer that trained examiners might not respond in a similar manner or, based on their training and experience, could even be more susceptible to the contextual bias. This is a statement that could have been made without the study and the study actually did nothing to address that at any level. At the same time, that only touches the tip of the issue. Two other issues associated with this lack of experience are not addressed by this dismissal. The first is that the data showed the trend for positive associations was limited to the ambiguous print sets. It is these sets in particular that require training and expertise to understand and appreciate. Along with this is the requirement that the decision be either “same” or “different” without allowing an opportunity for an “I cannot determine,” which is an option available for trained examiners. Looming potentially larger is the second issue and that has to do with the potential impact of the photographs used for the highbias conditions. If the example provided in the article was a representative example, such photographs will elicit an emotional response. The concern is the impact of such photographs given the two populations, the one being tested and the one about which inferences are being suggested. It might be expected that given the nature of their work and exposure to some rather heinous circumstances, trained examiners would have less of an emotional response to the same stimuli than a university student might be expected to have, especially if their exposure to such circumstances has been limited. That being said, provided that the article is understood in context, it does have value in highlighting the potential for bias and it warrants further study. The use of the university students was reasonable given the recognized difficulty of examining the issue of bias in actual examiners. In 2006, researchers worked to bridge this gap with a limited study examining whether trained examiners would reach the same decision with respect to a case given different contextual information with respect to that case (Dror et al., 2006). In this study, five different fingerprint examiners with an average of 17 years of experience among them were asked to compare fingerprints that they had previously examined 5 years prior and made a determination that the two prints were made by the same person. The examiners knew that they would be tested but did not know when or under what circumstances. Each of the examiners representing five different countries was provided with a previously identified pair of prints with contextual information that favored an elimination. That contextual information was that the prints with which they were being presented were the prints the FBI had erroneously identified in the aforementioned Madrid bombing case. Of those five examiners, four offered conclusions that were different than their original conclusions. Three went from an identification to an exclusion and one went from an identification to an insufficient, cannot be determined. One of the five maintained that
106
Firearm and Toolmark Identification
it was an identification even in light of the extraneous information strongly indicating that they weren’t. The study had self-recognized limitations, one of which was the limited number of participants among others. They offered, and rightly so, that there are significant challenges in performing studies of this nature. At the same time, it is important to discuss these limitations because this particular article has been taken out of that “preliminary” context in which it so clearly belongs. It seems plausible that given that they were already able to obtain previously worked cases from the archives of five examiners, additional cases could have been obtained to test other contextual information. The reason that may have been of value is because the contextual information introduced with the prints is much more highly suggestive than what may be encountered in a particular case. From personal experience working in local and Federal laboratories, I have encountered a number of instances in which the detectives assured me that the submitted gun was the one responsible, or that the two cases were definitely linked. Experience quickly demonstrated that they were wrong as often as they were correct, and, in one particular case, they were wrong on three different occasions. Therefore, since they were able to obtain one set of prints to test one particular biasing statement, it seems plausible others could have been obtained including at a minimum, previously excluded prints that were ambiguous with the contextual information that they were identified by another laboratory seeking assistance. Part of the reason is because, while an error is still an error, there is a general perception in forensic science that an error of exclusion is more acceptable than an error of inclusion because the former will not result in an innocent person serving time for a crime he or she did not commit. Another potential limitation is that the quality of the previously identified print sets was not discussed in this article though it was discussed in a later article to be discussed next, which also happens to address many of the concerns and limitations of this study. As a follow-up to the just-discussed study, another within-subject study was performed with six different examiners (Dror and Charlton, 2006). In this study, each examiner was provided eight pairs of prints, which he/she previously compared being told that they were participating in an assessment project for problematic prints. These prints were placed in four different categories as illustrated in Fig. 1. One pair from each of the four categories was used as a control while the other was provided with biasing information, which was different depending on whether the original conclusion was an exclusion or an individualization. If the original conclusion was an individualization, then the contextual information was that the suspect was in custody at the time of the crime, biasing toward an exclusion. Similarly, if the original conclusion was an exclusion, then the contextual information was that the suspect confessed, biasing toward an individualization. Two questions were asked with each pair. The first asked about sufficiency for a determination with a yes or no offered for the response. The second was for a conclusion of individualization or exclusion, provided there was sufficient information upon which to base a decision. Of the 48 total trials, there were six decisions that were inconsistent with the first decision made for the pair of prints. One of the six individuals was responsible for three of those inconsistent decisions, three made one inconsistent decision each, and two
The influence of bias107
Fig. 1 Print set distribution for participants in study by Dror and Charlton (2006).
made no inconsistent decisions. Two of the six inconsistent decisions were from the control trials and the remaining four from the biased trials. Five of the six inconsistent decisions were from ambiguous print sets, while one was from a print set considered unambiguous. Only one of the inconsistent decisions went from an individualization to an exclusion and that was a control trial. That particular print set was considered ambiguous and was worked by the individual who was responsible for three of the six inconsistent decisions. There were no instances of an exclusion being concluded as an individualization with biasing information suggesting it was an individualization. The study makes up for many of the weaknesses from the previous study and even though the rate of inconsistent decisions is much lower than 80%, at 12.5% it is still very significant and reason for concern. As in the other studies, when the biasing information was introduced, it tended to influence the more ambiguous print sets as opposed to the unambiguous print sets. Apart from the one conclusion in a control trial that went from an exclusion to an individualization, no other similar decisions were made. The authors suggested that the threshold for an exclusion was lower than the threshold for an individualization. That explanation is possible, just as the tendency to think that an error of exclusion is less egregious than an error of individualization is also possible. The authors offered other potential reasons, other than bias, which could have resulted in the inconsistent conclusions. At the same time, even if results by the examiner responsible for the three inconsistent decisions were removed from consideration, that leaves 3 inconsistent decisions, 2 of which had biasing information, for a rate of 4.7%. While lower, it cannot and should not be ignored as a potential issue. This was confirmed in a report that re-evaluated the data using meta-analytic procedures to better quantify the limited data (Dror and Rosenthal, 2008). Another study that evaluated bias on decision making and outcome examined the potential contextual bias due to Automated Fingerprint Identification Systems (AFIS) (Dror et al., 2012). AFIS is a means by which latent prints can be searched against a database and the technology returns a list of potential matching candidates for searched latent prints. The true match may not even be on the list or, if it is, it can be in positions other than the first position. Typically, examiners will review the list, making comparisons and often stop if and when an individualization is made. This
108
Firearm and Toolmark Identification
particular study focused on understanding the potential for bias in decision making and outcomes based on the ranking of a true match in a candidate list. A total of 160 different latent prints of low and medium quality were selected for this study along with the corresponding matching known prints. The reason for the quality choice was that it had been demonstrated that ambiguous prints are more susceptible to bias. Half of the latent prints would be inserted into lists that had 10 candidates and the other half would be inserted into lists that had 20 candidates. Corresponding matching prints were inserted into some but not all lists. A primary reason for this was that this study was supposed to be covertly incorporated into the dayto-day casework and many lists would not have a matching print set corresponding to the searched latent. When the corresponding matching print was inserted into the candidate list, it would be inserted into one of four slots: (1) top: #1 position; (2) near top: #2 position (10-candidate list), #3 position (20-candidate list); (3) near bottom: #8 position (10-candidate list), #15 position (20-candidate list); and (4) bottom: #10 or #20 position depending on list size. Examiners were required to make comparisons on each and every print in the candidate list even if one was already matched, making one of three decisions—identification, exclusion, or inconclusive. Each examiner made a total of 2400 comparisons for a grand total of 55,200 comparisons across all 23 examiners. The testing method allowed for measuring not only errors, which were reflective of the outcome of the decision-making process, but also the time that a particular comparison required to reach a decision, which was reflective of the decision-making process itself (response time). A significant amount of data was generated that the authors were quick to caution about their use because of the intent of the study. This included not only the total number of comparisons and match comparisons, but also total errors including false identifications, missed identifications, and what they termed false inconclusive results—results reported as inconclusive when an identification should have been made. The data are listed in Table 1. When the data were reviewed in the context of the intent of the study, several things became apparent. The first was, “No main effect of the position of the target matching print as far as the final (emphasis added) false exclusion and false inconclusive conclusions are concerned” (Dror et al., 2012). When they examined the response time,
Data From Study Evaluating Biasing Effects of AFIS Contextual Information on Human Experts (Dror et al., 2012) Table 1
Comparisons performed Match comparisons available Total errors (all) False identifications False inconclusives Missed identifications a
Number
%
55,200 1832 1516 49a 1001 502
0.09% 1.81% 27.40%
Of the 49, 37 could have been due to clerical errors leaving 12 certain false identifications.
The influence of bias109
they observed a correlation between response time and errors. As the response time lessened, there was a noticeable increase in the amount of errors. There was also a direct correlation between position and time spent on that particular comparison and that more time was spent on comparisons from the 10-candidate list than the 20-candidate list. From this, the research indicates that there is a potential biasing effect of rank position on the decision-making process even if it may not impact the final outcome. They also evaluated the 49 false identifications to determine if there was a correlation between rank and the tendency to make a false identification. Of the 49 false identifications, 17 had the true match on the candidate list. One issue with which they were concerned was the potential for some of those false identifications to be the result of clerical error and that was discussed fully. One way in which they considered that potential was when two identifications were made off a single print—the false identification and the true identification, which was on the candidate list. However, since a clerical error could be a potential reason for a false identification, it could not be certain. Therefore, they evaluated the total of 49 and then 12, once the potential clerical errors were removed. What they observed was that there was a greater likelihood for the false identifications at the top of the lists even with the removal of the 37 potential clerical errors. So far, the reviewed studies having quantitative data have been focused on the actual decision-making process, that process associated with making an outcome decision. The studies have also focused on fingerprint identification because that is the discipline in which the largest amount of quantitative work has been performed. Studies have also been performed on different phases of the fingerprint identification process, which has been termed ACE-V standing for analysis, comparison, and evaluation- verification. Two in particular focused on the analysis stage in which latent prints are analyzed for suitability for identification prior to being compared. In the first study to be discussed, researchers examined the potential biasing effect of having a known print against which to evaluate a latent print as opposed to an evaluation of that latent print without any context (Dror et al., 2011). Examiners were tasked with counting minutiae (potential points of identification in fingerprints) while examining the latent prints solo and in the context of having a known print against which the examiners could evaluate the latent prints. Three different experiments were performed. The first was an evaluation of the biasing effect. The second was an interobserver evaluation of minutiae counts and the third was an intraobserver evaluation of minutiae counts. In the first experiment, 20 examiners were provided 10 trials, five latent prints by themselves and five with a matching print against providing a context against which the latent print could be evaluated. The results indicated that the presence of a matching print did have an effect on the judgment and perception of the latent print with the number of minutiae recorded being less in the biasing context than the solo trials. The mean difference was 2.6 less minutiae in the biasing condition than the solo trials. It appears that this could be due to a number of potential reasons. One would be that with the known print, attention was more focused in particular areas of the latent print than would otherwise be the case. A second could be that when evaluating a print by itself, it is unknown which minutiae may be relevant and therefore more are recorded than in
110
Firearm and Toolmark Identification
the biasing context in which nonmatching minutiae may be more readily disregarded and remain uncounted. In the second experiment, the variation of minutiae counts in the solo trials among the 20 examiners was examined. Significant ranges were observed in the counts for each of the latent prints with the minimum range being 8 and the maximum range being 21. Those prints having a greater average of minutiae counted also had the highest ranges of counts among the examiners. For example, in one trial in which an average of 20.1 minutiae were recorded, one examiner counted 30 while another counted 9. In another trial in which the average count was 7.1, the range was 4 to 12 minutiae. The data were statistically evaluated for reliability because of the small sample size and the index of 0.85 showed that the data were relatively reliable. In the third experiment, a different set of 10 examiners was selected. They were provided 10 latent prints each and, a few months later, were provided the same 10 latent prints. They were asked to analyze the latent prints and provide minutiae counts for each print at each of the two different times. This was statistically analyzed and a retest reliability factor was determined, which was used to quantify consistency. A higher score (approaching 1.00) would represent significant consistency, while a lower score would represent greater intraexaminer variation. The range of retest reliability factors was 0.65–0.95, but it should be noted that nine of the 10 were 0.80 and above, with three being above 0.90. It was also observed that, as in the second experiment, there were significant interexaminer differences and that the intraexaminer consistency was also impacted by the quality of the latent prints being evaluated. When evaluating this study, it is important to keep context in mind. While it is clear that there were statistically significant intraexaminer and interexaminer variations, it has to be remembered that this is the analysis stage. When the intraexaminer differences are evaluated, 16% of the trials showed no difference, 40% showed a difference of 1 or 0, and 55% showed differences in minutiae counts of 2 or less. When examined in the context of decision thresholds, this bias may or may not be sufficient to make a difference in the actual outcome. At the same time, this study did demonstrate a potential for bias, which could be significant for more ambiguous prints. As a result, the authors recommend some thoughts that they believe would help guard against such bias, thoughts that will be discussed later. In a later study, the analysis stage was examined again in an attempt to evaluate its robustness (Fraser-Mackenzie et al., 2013). The scope of the study was to evaluate the potential bias that a matching or nonmatching print may have on a suitability determination of a latent print. Participants were asked to make a suitability decision when examining the latent print by itself (solo) and declare it suitable, unsuitable, or questionable. Then they were presented the print in the context of a comparison with a matching or nonmatching print. They were not directly asked if the latent prints were suitable in this context. Rather, the suitability was inferred based on the examination results. If the conclusion was an individualization or exclusion, then it was inferred that the latent print was suitable. Three experiments were conducted in the context of this study. In the first experiment, a total of 640 latent print images were used with half being judged by a panel as highly suitable. The remaining 320 were evenly distributed
The influence of bias111
among suitable, unsuitable, highly unsuitable, and inconclusive. Twenty-four examiners were given at least 125 latent prints in solo trials and at least 180 latent prints in comparison trials, half with a known match and the other half with a known nonmatch. There were no repeats of latent prints from the solo trials into the comparison trials. Each examiner was tested covertly, in the normal course of casework. Based on the experiment, there is contextual influence with the presence of matching/nonmatching prints and the influence is greatest when the latent prints were more difficult to judge. When the latent print was presented in combination with a nonmatching print, the tendency was that the latent would be considered more suitable than when evaluated in a solo trial, even those latent prints predetermined as “highly unsuitable.” When presented with matching exemplars, there was an increase in the number of questionable inferences when compared with the solo trials, even those latent prints predetermined as “highly suitable.” A potential reason for the difference is that when evaluating a latent print in the context of a nonmatching print, it is easier to find differences, thus making the latent print more “suitable” than it might otherwise be evaluated. The second experiment involved 30 latent prints in each of three categories— suitable, questionable, and unsuitable. Half were provided to the participants with a suitable prompt and half with an unsuitable prompt, representing the bias of another examiner’s analysis determination. There were no significant differences except for latent prints predetermined to be suitable when an unsuitable prompt was presented. In this instance, the percentage of suitable latent prints deemed unsuitable was 16.5% when a suitable prompt was offered and 23% when an unsuitable prompt was offered. The third experiment involved 24 different examiners divided into three equal groups. One group received latent prints that were predetermined to be “highly suitable.” The second group received latent prints that were predetermined to be “suitable” and the third group received latent prints that were predetermined to be “questionable.” The biasing information was that it was a major case and another examiner declared an identification. The results indicated that even under these strong biasing conditions, the underlying suitability of the latent prints still plays an important role. This tends to show that even though other experiments showed a potential for bias in the analysis of latent prints depending on the circumstances, the analysis stage is still rather robust. Much as the previous two studies dealt with the analysis stage of the ACE-V process, research has also been conducted with respect to the potential for bias impacting the verification stage only (Langenburg et al., 2009). Three groups of examiners participated, selected from International Association for Identification (IAI) conference attendees. The participants were given different biasing information depending on the group into which they were placed. One group of 15 served as the control group. A group of 12 served as the low-bias group, which received answers for each of the print sets to be compared. These answers represented the conclusions of the originating examiner. A group of 16 served as the high-bias group and were told by a well-recognized expert that the print sets were from actual cases he had previously examined and that they were all identifications, going so far as to have the results on official letterhead. Each group was given the same six sets of prints. Three had the same source with a predetermined difficulty of easy, medium, and difficult (one set each). Three had different sources with the same predetermined difficulty range.
112
Firearm and Toolmark Identification
The results of the study indicated that in general, experts under biasing conditions provided significantly fewer definitive and erroneous conclusions than those in the control group. One false identification occurred in the control group in which there was no bias. Three false exclusions also occurred in the control group. There were no false identifications or exclusions in either of the groups provided biasing information. However, there was a higher rate of inconclusive responses in the low- and high-bias groups. There were significant limitations to this study, which the authors recognize could have impacted the results. The two primary limitations are that all examiners were aware they were being tested and that the examiners in the two groups exposed to bias were admittedly suspicious and more “alert” than the control group. In addition, experience levels were self-assessed and background among the participants varied significantly. This study was repeated with university students from a variety of majors but having taken a forensic science course as an elective. These students were provided with approximately 1 hour of training and, like the experts, were divided into three separate groups—control (31 students), low bias (27 students), and high bias (28 students). When the students were provided with same source prints, there were 11 false exclusions for the control group and 10 false exclusions for the low-bias groups. The significant difference was obtained with the high-bias group in which case the number of false exclusions dropped to 1, suggesting that the high-bias conditions had a significant impact on the students in that group. Similarly, for different source prints, the number of false identifications was the same for the control and low-bias groups increasing by 3, to 10 false identifications for the high-bias group. These results were similar to those observed of other novices in other studies and demonstrate the importance of using correct populations when studying things such as bias. While the preponderance of the quantitative studies dealing with bias and forensic science has focused on fingerprint examinations, studies have been done in other disciplines as well. One of the three that will be discussed involves DNA, which is often the standard against which other disciplines are measured. In this particular study, an adjudicated case involving the interpretation of a complex DNA mixture was used (Dror and Hampikian, 2011). The case involved a gang rape of a victim with three suspects. In that case, a suspect agreed to provide testimony in return for a lesser sentence. For such testimony to be offered in the jurisdiction, it had to be corroborated by physical evidence (DNA) and, at a minimum, the results could not be inconclusive. There was one suspect in particular that was clearly not identified and the conclusion was that the suspect offering testimony “could not be excluded” as a potential contributing source. The same data were provided to 17 other trained and qualified analysts in North America. Of those 17, only one agreed with the original analyst. Four of the 17 reported inconclusive and 12 of the 17 excluded the individual. The last two studies to be reviewed in this context deal with firearm and toolmark identification. The first is a validation study involving bullets and cartridge cases provided to eight different examiners assigned to the FBI Laboratory Firearms/Toolmark Unit (Smith, 2005). Each examiner was provided with two test packets, one containing bullets and one containing cartridge cases. Each test packet contained 45 specimens
The influence of bias113
to include one true identification and the rest true eliminations. The participants were advised that the incorrect results could impact the theory of identification, a statement which was believed could function as a bias toward conservative results with the participants less inclined to make identifications. The results for the 360 cartridge case comparisons were 7 identifications (of 8 possible), 335 “no conclusion” results, and 18 exclusions. The results for the 360 bullet comparisons were 6 identifications and a majority of “no conclusion” results. In no instance was there a false identification or false exclusion. With respect to the bias issue, the author concluded that bias was not an issue because the identifications were correctly made in light of an overwhelming number of true eliminations in the test packets. This conclusion must be considered in light of limitations such as the lack of a control group, an amount of true identifications that was likely insufficient, and the uncertainty with respect to the strength of the data set. As other studies have demonstrated, top-down contextual bias will generally not override a strong data set. The last study to be reviewed in this context looked at the potential for bias in two different experiments involving firearm and toolmark identification (Kerstholt et al., 2010). In the first experiment, six examiners evaluated six sets of bullets that were presented to them twice with at least 6 months’ separation, once under neutral conditions and the second time under biasing conditions. Half the sets were known matching bullets and the other half were known nonmatches. The neutral conditions were that there were two suspects and two crime scenes. The biasing condition was that there was one shooter and one crime scene. The examiners were asked to determine whether or not the two bullets from each of the six sets were fired from the same gun. Two cases did have large individual differences. One went from “beyond any reasonable doubt” to “inconclusive” and the second from “likely” to “likely not.” These were the most extreme as the remaining results indicated that there was no significant difference between neutral and biasing conditions. In the second experiment, 153 previously worked cases that had two opinions were evaluated. It is an accepted practice that in complex or ambiguous cases, two independent examinations may be performed. In these instances, the first examiner typically had case information while the second examiner did not. It was expected that had there been a potential contextual bias, the first examiner would have a higher tendency to identify than the second examiner because they possessed extraneous, case information. However, in the evaluation of the cases, it was determined that the first examiner was generally more “conservative” in their responses than the second examiner. A number of studies have evaluated the potential for bias in several different forensic science disciplines, including firearm and toolmark identification. A common theme is that researchers admit that critical research is very difficult to perform because of several issues including test-taking bias and the difficulty in mimicking expected bias conditions. Dror and others were successful in performing some of their research covertly, but it is difficult to do with some of the other disciplines, including firearm and toolmark identification. As a result, the studies have limitations, which the authors do recognize. At the same time, these studies have also demonstrated that arguments in support of the potential of bias do have some merit and cannot be summarily dismissed.
114
Firearm and Toolmark Identification
Potential for and dealing with bias Studies have shown that bias is a potential issue and it can have its greatest impact in those forensic science disciplines in which human judgment is the primary means by which a conclusion is reached and the data set is ambiguous. It has also been recognized that bias can actually impact the decision-making process without necessarily having an impact on the actual outcome of the decision (Dror et al., 2012). The reason is because while bias may have been involved at some point, the bias may not have caused a shift over a particular decision threshold. Nonetheless, it is important to know and understand that bias can have a potential impact in casework related to firearm and toolmark identification and to take proper precautions to minimize the potential impact. One manner in which examiners can be exposed to bias potential is when they are provided investigative information. This can be as overt as an investigator reaching out and telling the examiner his or her thoughts with regard to the evidence and expectations with respect to that evidence. While not as overt, the mere presence of a firearm on an evidence listing can also suggest that an identification to submitted bullets and cartridge cases is expected. One means by which this can be alleviated is through sequential unmasking, a process endorsed by many scholars (Krane et al., 2008). Sequential unmasking keeps extraneous, unnecessary information from the examiner, providing only that information that is required for the examiner to adequately and reliably proceed with the next step of the analysis. When that step is completed, the examiner is provided more information to adequately and reliably proceed with the next step of the analysis, and this process would be continued until the analysis was completed. Such a process requires various case managers with sufficient training and experience to be able to assess incoming cases and understand what is needed for examiners to complete various analyses. This helps to minimize the potential influence of any case-related investigative information in the actual examination of the evidence. A forensic filler-control method has been proposed to minimize the supposed bias introduced merely by the evidence that is submitted, referred to as an “evidentiary show-up” (Risinger et al., 2002; Wells et al., 2013). It is suggested that if a single firearm or tool is submitted for comparison with recovered evidence, the addition of similar firearms or tools could help minimize the contextual bias that would be introduced by the presence of a single firearm or tool alone. In this way, the examiner will not know which of the firearms or tools are the firearms or tools suspected by the investigator to be responsible for the evidence. If adopted as a practice, some suggest that it allows for an estimate of not only the error rate of the discipline, but the examiners as well (Wells et al., 2013). For example, if three additional tools are introduced as exemplars with the suspected tool, then if the method was completely unreliable, there would only be a 25% chance of getting the correct answer, which would be the result of pure chance. If the method was indeed reliable, then repeated trials among examiners would allow the laboratory to calculate a reliability index for the discipline. In the same way, potential error rates could be determined for the individual examiners within the laboratory. However logical this sounds, there is one significant flaw in that
The influence of bias115
the firearm or tool submitted by the investigator may not actually be the responsible firearm or tool. Furthermore, when an examiner declares a common source, it is because it is believed that there would be no other firearm or tool that would be responsible for the toolmarks on the evidence. Therefore, if a common source is declared with the submitted firearm or tool, no other firearms or tools would be examined. If examiners were compelled to examine them regardless of the common source determination, then a different bias is introduced when examining those added firearms or tools. Other concerns would be obtaining appropriate tools to serve as the foils and the potential confusion for the courts though this can be mitigated by appropriate documentation. Another manner in which examiners can be exposed to bias is when laboratory processes are poorly defined. Authors have suggested that linear processing can reduce the potential contextual biasing that can occur if submitted exemplars were examined prior to examining evidence samples (Dror et al., 2011). Similar to the ACE-V methodology employed for fingerprint comparisons, the idea is to evaluate the evidence first and make documented decisions about that evidence. In the case of firearm and toolmark evidence, that would include the measurement of class characteristics and the evaluation of the evidence toolmarks for subclass characteristics and the potential sufficiency of individual marks both in quality and quantity for identification. Once the examinations with the evidence have been completed, the exemplars are then examined and documented decisions are made with respect to similar features as was the evidence. In this way, the information learned from the exemplars is not influencing the observations being made when examining the evidence. Once the evidence and exemplars have been examined, then any comparisons can take place. When this is done, it may be that previous evaluations of the evidence may change. For example, a lack of features on the evidence might become more important if the exemplars display a similar lack of features. Such re-evaluations are to be expected and, at the same time, should be documented as to the change and the reason for the change so that the thinking process is transparent. Other researchers call for an even stricter evaluation of the evidence suggesting that forensic science should have more of a research culture in daily operations (Haber and Haber, 2013). They call for greater measuring and objective assessment of evidence items and exemplars, which would include, as an example, measuring the spatial relationship between minutiae in latent prints. While this would seem to be relatively straightforward in some instances, it would be a significant challenge in firearm and toolmark identification in which the totality of a pattern is often assessed and compared. The reason for this is that within a particular area of the pattern, there could be a number of features that could potentially be measured, a process that could greatly increase the amount of time being spent on a case. If this is to be done for 10 fired cartridge cases, which could be quite common at some crime scenes, the evaluation of the evidence alone could take several days. However, when employed properly, comparative microscopy is as effective in assessing pattern similarity and is much more efficient at doing so. With proper photo documentation, it is possible for that similarity to be assessed during a review of the work performed.
116
Firearm and Toolmark Identification
While proper and adequate documentation does not necessarily lessen the impact of bias, it is related to laboratory processes and should be discussed. Proper and adequate documentation allows for better verifiability of the observations made and a better evaluation of the rendered conclusions as a result of those observations. Therefore, laboratory processes should include, at a minimum, what is expected with respect to documentation of observations and conclusions. For example, when an evaluation for subclass characteristics is made, the documentation should include a description of the marks that are observed and how those relate to the potential presence of subclass characteristics or lack thereof. By recording things in such a way, there is a better appreciation for the thought process behind the examination. Related to laboratory processes, examiners can also be exposed to bias when they know the results of an examination and are asked to verify or confirm those results. This can be overt such as reviewing the notes of the examiner, or it can be due to laboratory processes in which only identifications are verified. In this case, the original examiner does not have to say anything. The individual being asked to verify already knows that he or she would not be doing it were an identification not reached. If comprehensive 100% verification is not possible, at the very least it is imperative that all casework have at least an opportunity to be verified. In addition, the verifications should be blind. The individual being asked to verify should be shielded from knowledge with respect to the original examiner’s conclusions. Furthermore, it is also important that verifications being performed by subordinates of the original examiner be masked or avoided all together lest the verification be hindered by external personnel concerns. Another manner in which examiners are exposed to potential bias has to do with poorly filtered external influences over the laboratory and its staff. This is one of the reasons discussed earlier for the call to separate forensic science laboratories from law enforcement agencies or prosecutors’ offices. Many newscasts throughout the years have had various law enforcement officials declaring that staff will work round the clock if necessary to deal with a particularly heinous or high-profile crime. While it may seem admirable and self-sacrificing, such demands on forensic laboratory staff can lead to unintended mistakes. For example, there have been comparisons in which I have worked for hours on particularly difficult samples only to finally observe a patch of correspondence. Knowing the potential for observer effects, it was at that point the examination was halted, the items removed from the microscope, any indexing marks removed, and the items put away to be looked at fresh another day. If the patch of correspondence was genuinely present, it will be found again. If not, then the time away from the samples allowed for a dissipation of the observer bias. Unreasonable demands that do not understand this phenomenon can create potential issues for laboratory personnel and they should be shielded from those demands and external influences as much as possible. That being said, another source of potential bias for the examiner, and the final one to be discussed here, includes poorly filtered internal influences. While separation from law enforcement agencies or prosecutors’ offices can help alleviate the external influences, such separation will not help in these instances. Examples of poorly filtered internal influences include fear in the workplace, quota demands, and quality
The influence of bias117
control issues. Subordinates can be subjected to bias when supervisors can exercise punitive measures if the subordinate does not follow particular courses of action that may be unjustified but demanded by the supervisor. Such pressures can be very overt or subtle if there is a history of such punitive measures being implemented. Quota demands can also introduce bias because examiners are more rushed to accomplish a set number of tasks such that they are not paying as much attention to detail as they might otherwise pay. An example of this was the discussion of response time for the latent print study in which examiners spent less time on comparisons from the 20-candidate lists than the 10-candidate lists (Fraser-Mackenzie et al., 2013). There could be a reliance on other information to help “speed” along the examination, thus allowing for a greater top-down biasing effect. Finally, internal quality control issues can also introduce unintended bias. An example of this emphasizes the potential danger. When discussing bias at a workshop, it was mentioned that examiners could document the number of inconclusive and elimination results an examiner makes over a period of time to help provide some quantifiable evidence to a common anecdote that examiners make determinations of inconclusive or elimination as often as they make determinations of identifications. Such determinations are just not highlighted because they often don’t result in a suspect being arrested. One attendee replied that was the practice in their laboratory. The unfortunate issue involving the tracking of results was that the quality team in this laboratory would review those statistics and if too many inconclusive results were reached, the examiner could be given corrective counseling for being “lazy.” I asked the attendee if the staff knew what that number was and I was advised that the staff did know that number. The concern in such a situation is that if an individual reached their “allotment” of inconclusive results, that individual might be inclined toward an identification in the next case not because the evidence warranted it, but because he or she wanted to avoid a confrontation in the manager’s office.
Final thoughts Bias is present in nearly everything that one does and it is linked to subjectivity. All disciplines in forensic science exist on an objective-subjective continuum, understanding that it can result in varying levels of reliability. During data collection in firearm and toolmark identification, especially comparative microscopy, there is an interplay between objectivity and subjectivity. In addition, the conclusions are a subjective assessment of the totality of the data. By definition, subjectivity does imply a personal bias at some level. A number of studies have been done that have examined bias in various contexts in forensic science. These studies have had limitations in terms of sample sizes, populations tested, and participants knowing that they were being tested, which in of itself is a bias. At the same time, these studies have demonstrated that there is a potential for bias and this potential cannot and should not be ignored because it can impact the reliability of the results. Researchers have generally demonstrated that when the data set is strong, the influence of bias is lessened. It is when the data set is ambiguous that the influence of bias is of greatest concern. At the same time, while there may be bias
118
Firearm and Toolmark Identification
in the decision-making process, it still may not affect the outcome if the bias was not the reason a particular decision threshold was crossed. Processes and procedures can be examined to determine if there is anything inherent in those processes and procedures that could be unintentionally introducing an examiner bias. In addition, improvements can be made to minimize bias by minimizing subjectivity as much as reasonable understanding that subjectivity can never be completely eliminated. It is also important to understand that there are normal case situations involving firearm and toolmark evidence in which inconclusive or exclusion results are reported alongside common source determinations of other evidence. It is important that each individual data set be treated individually understanding that, for example, the results of a cartridge case comparison could potentially influence decision making when comparing bullets from the same case that are more ambiguous. Bias cannot be avoided. At the same time, the mere potential for bias should not be cause for declaring a particular forensic science discipline that relies on subjective interpretation wholly unreliable. The next chapter will review and evaluate validation studies that have examined the reliability of firearm and toolmark identification as currently practiced in contemporary forensic science laboratories.
References Bacon, F., 1620. Novum Organum Scientiarium. Biasotti, A., Murdock, J., 1997. Firearms and toolmark identification. In: Faigman, D., Kaye, D., Saks, M., Sanders, J. (Eds.), Modern Scientific Evidence: The Law and Science of Expert Testimony. vol. 2. West, St. Paul, MN. Charlton, D., Fraser-Mackenzie, P., Dror, I., 2010. Emotional experiences and motivating factors associated with fingerprint analysis. J. Forensic Sci. 55 (2), 385–393. Cole, S., 2006. The prevalence and potential causes of wrongful conviction by fingerprint evidence. Golden Gate Univ. Law Rev. 37, 39–105. Dror, I., Peron, A., Hind, S., Charlton, D., 2005. When emotions get the better of us: the effect of contextual top-down processing on matching fingerprints. Appl. Cogn. Psychol. 19, 799–809. Dror, I., Charlton, D., Peron, A., 2006. Contextual information renders experts vulnerable to making erroneous identifications. Forensic Sci. Int. 156, 74–78. Dror, I., Charlton, D., 2006. Why experts make errors. J. Forensic Identification 56 (4), 600–616. Dror, I., Rosenthal, R., 2008. Meta-analytically quantifying the reliability and biasability of forensic experts. J. Forensic Sci. 53 (4), 900–903. Dror, I., Hampikian, G., 2011. Subjectivity and bias in forensic DNA mixture interpretation. Sci. Justice 51, 204–208. Dror, I., Champod, C., Langenburg, G., Charlton, D., Hunt, H., Rosenthal, R., 2011. Cognitive issues in fingerprint analysis: inter- and intra-expert consistency and the effect of the “target” comparison. Forensic Sci. Int. 208, 10–17. Dror, I., Wertheim, K., Fraser-Mackenzie, P., Walajtys, J., 2012. The impact of human- technology cooperation and distributed cognition in forensic science: biasing effects of AFIS contextual information on human experts. J. Forensic Sci. 57 (2), 343–352. Dror, I., 2013. The ambition to be scientific: human expert performance and objectivity. Sci. Justice 53, 81–82.
The influence of bias119
Fraser-Mackenzie, P., Dror, I., Wertheim, K., 2013. Cognitive and Contextual Influences in Determination of Latent Fingerprint Suitability for Identification Judgments. National Institute of Justice Grant Number 2010-DN-BX-K270. Final Report. Gianelli, P., 2007. Confirmation bias. Crim. Justice 22, 61–62. Interpol European Expert Group on Fingerprint Identification (IEEGFI), 2003. Method for Fingerprint Identification. http://latent-prints.com/images/Forensic%20-%20Fingerprint. htm. Accessed August 21, 2017. Haber, R., Haber, L., 2013. The culture of science: bias and forensic evidence. J. Appl. Res. Memory Cogn. 2, 65–67. Kassin, S., Dror, I., Kukucka, J., 2013. The forensic confirmation bias: problems, perspectives, and proposed solutions. J. Appl. Res. Memory Cogn. 2, 42–52. Kerstholt, J., Eikelboom, A., Dijkman, T., Stoel, R., Hermsen, R., van Leuven, B., 2010. Does suggestive information cause a confirmation bias in bullet comparisons? Forensic Sci. Int. 198, 138–142. Krane, D., Ford, S., Gilder, J., Inman, K., Jamieson, A., Koppl, R., Kornfield, I., Risinger, D.M., Rudin, N., Taylor, M., Thompson, W., 2008. Letter to the editor: sequential unmasking: a means of minimizing observer effects in forensic DNA interpretation. J. Forensic Sci. 53 (4), 1006–1007. Langenburg, G., Champod, C., Wertheim, P., 2009. Testing for potential contextual bias effects during the verification stage of the ACE-V methodology when conducting fingerprint comparisons. J. Forensic Sci. 54 (3), 571–582. National Research Council, 2009. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press, Washington, DC. Nichols, R., 1997. Drug proficiency test false positives: a lack of critical thought. Sci. Justice 37 (3), 191–196. Nichols, R., 2003. Consecutive matching striations (CMS): its definition, study and application in the discipline of firearms and tool mark identification. AFTE J. 35 (3), 298–306. Office of the Inspector General, 2006. A Review of the FBI’s Handling of the Brandon Mayfield Case. US Department of Justice, Washington, DC. Risinger, D.M., Saks, M., Thompson, W., Rosenthal, R., 2002. The Daubert/Kumho implications of observer effects in forensic science: hidden problems of expectation and suggestion. Calif. Law Rev. 90 (1), 1–56. Smith, E., 2005. Cartridge case and bullet comparison validation study with firearms submitted in casework. AFTE J 37 (2), 130–135. Web, 2017a. Bias. Merriam-Webster.com. Merriam-Webster, n.d. 22 Aug 2017. Web, 2017b. Science. Merriam-Webster.com. Merriam-Webster, n.d. 22 Aug 2017. Web, 2017c. Art. Merriam-Webster.com. Merriam-Webster, n.d. 22 Aug 2017. Web, 2017d. Objective. Merriam-Webster.com. Merriam-Webster, n.d. 22 Aug 2017. Web, 2017e. Subjective. Merriam-Webster.com. Merriam-Webster, n.d. 22 Aug 2017. Wells, G., Wilford, M., Smalarz, L., 2013. Forensic science testing: the forensic filler-control method for controlling contextual bias, estimating error rates, and calibrating analysts’ reports. J. Appl. Res. Memory Cogn. 2, 53–55.