SCIENTIFIC & TECHNICAL
When can a DNA profile be regarded as unique? DJ BALDING Department of Applied Statistics, University of Reading, P.O. Box 240, Reading RG6 6FN, UK Science & Justice 1999; 39: 257-260 Received 8 April 1999; accepted 1 July 1999
The probability that a defendant's DNA profile is unique in a population of untyped individuals is shown to be bounded below by one minus twice the sum of the match probabilities over the population. This bound assumes that the possibility of laboratory or handling error can be neglected, and applies only when there is no non-DNA evidence in favour of the defendant. There cannot be a completely general lower bound: if there is overwhelming non-DNA evidence that the defendant is not the source of the crime stain, then that is also overwhelming evidence of non-uniqueness. Application to k-locus short tandem repeat (STR) profiles is discussed, and illustrated with calculations based on the 6-STR-locus system used in current UK casework. However, because of the problem of the nonDNA evidence, there seems to be no satisfactory way for an expert witness to address the question of uniqueness in court.
La probabilitt que le profil ADN d'un accust soit unique dans une population d'individus non-typts prtsente une borne inftrieure de 1 moins deux fois la somme des probabilitts de correspondance dans cette population. Cette borne admet que la possibilitt d'une erreur de laboratoire ou de manipulations peut-&tre ntgligte et s'applique seulement lorsqu'il n'y a pas d'indices en faveur de l'accud autres que I'ADN. I1 ne peut pas y avoir une borne inftrieure complbtement gtntrale : s'il y a des indices tcrasants autres que l'ADN, qui montrent que le dtfenseur n'est pas i la source de la tache du crime, alors il y a aussi indices tcrasants de non-unicitt. L'application i des profils de rtpttition tandem courte (STR) k-locus est discutte et pour le locus k illustrte par des calculs basts sur le systbme de locus 6-STR utilist actuellement au Royaume-Uni. Toutefois, i cause du problbme des indices autres que l'ADN, il ne semble pas y avoir de manibre satisfaisante pour un expert de rtpondre i la question de l'unicitt au tribunal.
Die Wahrscheinlichkeit, dass das DNA-Profil eines Beklagten innerhalb einer Population uncharakterisierter Individuen einzigartig ist, ist geringer als Eins minus der doppelten Summe der Wahrscheinlichkeit einer Ubereinstimmung in der Population. Dieser Grenzwert gilt nur dann, wenn die Mijglichkeiten von Labor- oder Handhabungsfehlern auzuschlieljen sind, und er trifft nur dann zu, wenn kein Nicht-DNA-Beweismaterial zugunsten des Beklagten vorliegt. Es kann generell keinen tieferen Grenzwert geben. Wenn der Beklagte aufgrund der Fulle an Nicht-DNA- Beweismaterial als Quelle fur die inkriminierte Spur auszuschlieljen ist, dann belegt die Fiille an Beweismaterial auch die Nichteinzigartigkeit. Die Anwendbarkeit auf k-Locus short tandem repeat (STR) Profile wird diskutiert und mit Berechnungen veranschaulicht, die auf dem 6-STR-Locus System basieren, das derzeit in Groljbritannien in der Fallarbeit verwendet wird. Wegen der Problematik des Nicht-DNABeweismaterial, erscheint es fur den Sachverstandigen angezeigt, vor Gericht keine Fragen von Einzigartigkeiten anzusprechen. Est5 demonstrado que la probabilidad de que el perfil de DNA de un acusado sea dnico, en una poblacidn de individuos sin tipificar, tiene un limite inferior de uno menos dos veces la suma de las probabilidades de coincidencia de la poblacidn. Este lirnite asume que la posibilidad del laboratorio o error de manejo puede ser despreciado y se aplica s610 cuando no hay evidencia noDNA a favor del acusado. No peude haber un limite inferior generalizado: si hay evidencia abrumadora no-DNA de que la mancha del crimen no sea del acusado, entonces hay tarnbiCn evidencia abrumadora de que no hay unicidad. Se estudia la aplicacidn de 10s perfiles del tandem corto del Klocus, y se ilustra con cllculos basados en el sistema 6-STR que se usa actualmente de rutina en el Reino Unido. Sin embargo debido a1 problema de evidencia no-DNA, no parece haber una soluci6n satisfactoria para el experto ante 10s tribunales en lo que se refiere a1 tema de la unicidad.
Key Words: Forensic science; Identification; Statistics; Probability; Weight of evidence; Interpretation; Match probability. Science & Justice 1999; 39(4): 257-260
257
Uniqueness of DNA profiles.
Introduction When the DNA profile of a defendant matches that obtained from a crime scene sample, the strength of this evidence is often conveyed in court by an expert witness's assessment of the "match probability". For a precise definition of the match probability, see [1,2]. Here, it can be taken to be the probability that a particular unprofiled individual would also have a matching DNA profile. The smaller the match probability, the less plausible is the possibility that such an individual could have been the source of the crime stain, and hence the more likely it is that the defendant was the source. Match probabilities for alternative suspects unrelated to the defendant are often extremely small: 1 in 100 million is not unusual in current UK practice, based on six STR loci [3] and a 10-STR-locus system currently under development will lead to calculated match probabilities substantially less than 1 in 1 billion (IW Evett, personal communication). Methods for calculating match probabilities have been the subject of controversy [4], although the controversy seems to have been quelled following the widespread introduction of an FsT (or 13) correction to allow for possible shared ancestry between the defendant and the unprofiled individual [5,6] We do not discuss here methods of calculating match probabilities, nor their role in interpreting DNA evidence [1,7]. Instead, we consider another question. When an expert witness quotes a match probability as small as 1 in 100 million, isn't this equivalent to saying that he or she is reasonably certain that the defendant's DNA profile is unique in the population of possible sources of the crime stain? And if so, wouldn't jurors be better assisted by the expert giving a "plain English" statement of this, rather than a match probability whose unfamiliar magnitude may overwhelm or confuse? For example, perhaps an expert witness could assert that, excluding identical twins and laboratorylhandling errors, in hidher opinion the defendant's DNA profile is almost certainly unique in some population which includes most or all of the alternative possible sources. The US FBI announced in November 1997 a policy of declaring "uniqueness" in criminal identification cases [8], but it seems not to have been routinely implemented. Then, as now (during 1999), there was very little analysis or discussion in the scientific literature which could underpin such a policy. The issue is briefly discussed in the report on forensic DNA evidence by the US National Research Council [6]. However, it did not consider population genetic issues, relatedness, or the role of the non-DNA evidence, so that its discussion is of little value. More recently, the monograph of Evett and Weir [7] on interpreting DNA evidence touches on the issue of uniqueness only very briefly, to warn against some elementary fallacies.
One of these is the problem of dealing with those cases in which uniqueness cannot reasonably be asserted, which may often arise for mixed profiles, or in cases of questioned paternity or other relatedness. Perhaps the most important barrier to declaring uniqueness is the problem of the nonDNA evidence in a case. The event that a particular DNA profile is unique is either true or false, no "objective" probability can be assigned to it. Nevertheless, since this truth or falsity cannot be established in practice, a probability of uniqueness based on the information available to an expert witness, such as that obtained from DNA profile databases, together with population genetics theory, may potentially be useful to a court. The problem then arises as to what data and theory the expert should take into account. Specifically, the non-DNA evidence in a case may be directly relevant, yet it may not be appropriate for the DNA expert to assess this evidence. We return later to the difficulties inherent in declaring uniqueness. First, we outline an analysis leading to a lower bound on the probability of uniqueness when the non-DNA evidence in the case does not favour the defendant.
Probability analysis Let U denote the event that (i) the DNA profile of s , the defendant, matches the crime scene profile, and (ii) there is no matching individual in some population P of unprofiled individuals. The population P is assumed to include all the possible sources of the crime stain other than s. It is also assumed that s is known to have no identical twin, and that laboratory or handling errors do not occur. Under these assumptions, U implies G, the event that s is the source of the crime stain, and so P(UIA) can never exceed P(GIA), whatever the conditioning event A. In fact, writing E for all the evidence presented in the case, in [2] the authors show that
where Rs(x) denotes the match probability for alternative possible culprit x, while ws(x) measures the weight of the non-DNA evidence against x, relative to its weight against s (for example, w,(x) =1 means that the non-DNA evidence is equally incriminating, or exculpatory, for s and x). It seems reasonable to suppose that the evidence in the case has no bearing on the question of uniqueness beyond its bearing on the question of guilt, so that P(UlG,E) = P(UlG). Further, non-uniqueness is the union of the events "the DNA profile of x matches the DNA profile of s", for all x E P. From the elementary rules of probability, the probability of this union is bounded above by the sum of all the match probabilities (which equals the expected number of individuals in P sharing the DNA profile of s). It follows that
Although attractive in some respects, a practice of declaring uniqueness in court would lead to substantial difficulties. 258
Science & Justice 1999; 39(4): 257-260
DJ BALDING
Substituting (2) in (1) gives
In some cases, the non-DNA evidence does not favour s, so that on the basis of this evidence no individual is regarded as a more plausible suspect than s. Then w,(x) s 1, for all x E P, and (3) leads to
To help motivate the bound (4), consider the following simplified example. A person is sampled anonymously and at random from a population P of size N+1, and found to have DNA profile D. The DNA profiles of the other N individuals are unknown; each is assumed to be D independently with probability p , so that the probability that D is unique in P is p(U)=(l-~)~. A second individual is then drawn at random from P. With probability l/(N+l), the second individual is the same as the first (call this event G). Now, the second individual is typed and found to have profile D (call this observation E). If G holds, E provides no additional information about U , so that P(UIG,E)=(I-P)~. Otherwise, the two individuals sampled are distinct yet both have profile D, implying that U is false. By Bayes Theorem we can calculate
and hence
which is the bound given by (4) in this setting.
Application to profiles of k STR loci Currently in the UK, a profile consisting of six Short Tandem Repeat (STR) loci is routinely employed in forensic identification casework [9,10]. Typical match probabilities for this system, for a range of possible relationships of x with s, are shown in Table 1. Since we assume here that identical twins do not exist, the largest match probabilities apply to brothers, and the first row of Table 1 indicates that R,(x) is usually about 0.002 when x is a brother of s. For unrelated individuals, R,(x) 2 x assuming an allowance of 2% for Fsp The quantiles (5% point, median, 95% point) of match probabilities for each relationship are calculated from 1 000 simulated 6-locus STR profiles. The simulations assumed independence within and between loci, and used the Caucasian STR allele frequencies reported in [3]. Calculated match probabilities were for a sample size of n=600; population allele proportions were calculated as (x+2)l(n+4), where x is the sample allele frequency.
-
Suppose, arbitrarily, that uniqueness can be regarded as established if P(UIE)>99.9%. Using (4), this amounts to requiring that the sum over P of the match probabilities is Science & Justice 1999; 39(4): 257-260
at most 112000. As an illustration, consider P described in Table 1. These numbers of possible culprits in each category of relatedness to s might be close to the largest expected to arise in casework. When k=6, uniqueness clearly cannot be asserted: the expected number of individuals sharing the DNA profile of s is typically about 114, with the largest contribution coming from the individuals unrelated to s. Depending on the other evidence in the case, a satisfactory conviction may nevertheless be achieved. Convictions are routinely secured in UK courts based primarily on 6-STRlocus matches. Table 2 indicates the proportion of cases in which we might expect uniqueness to be established, according to our chosen criterion, for k-locus STR profiles of a discriminatory power similar to those of the 6 loci in current use [3]. The k-locus match probabilities are obtained by raising to the power k16 the match probability for a 6-locus profile simulated as for Table 1. When k=8, uniqueness is almost never achieved, whereas it is usually achieved when k=10 and almost always when k=l 1. The contribution to the uniqueness calculation from unrelated individuals becomes relatively less important as k increases, and the contribution from brothers begins to dominate for k larger than about 8, which is close to the point at which uniqueness becomes routinely achievable. The contribution to the calculation of relatives other than brothers is negligible in all cases. Therefore as a rough guide to whether or not uniqueness can be asserted for a particular profile, one can check that the expected number of brothers of s sharing his profile is sufficiently small, say less than 112 500. The relative importance of brothers in the uniqueness calculation means that the results of Table 2 are somewhat insensitive to the number of individuals in P who are unrelated to s. Even if this number were increased from
TABLE 1 Typical match probabilities for the system presented in this paper, for a range of possible relationships of x with s. -
Relationship with s
-~
Match probability: (5%, 50%, 95%) quantizes
No. of Individuals related to s
brother
5
(1-4,2.2,3-3)x 10"
uncle, nephew, half-brother
20
(0.6,2.9, 11) x
cousin
100
(0.6,4-7,25) x
unrelated
1o7
(0-1,2.4,21) x 10.'
TABLE 2 Proportion of k-locus STR profiles achieving "uniqueness" for the suspect population of Table 1. No. of loci, k Proportion 'unique'
8
9
0.00
0.32
--
11
10 0.98
-
1.00 -
Uniqueness of DNA profiles.
10 million to 100 million, 11 loci would still suffice to establish uniqueness under our adopted assumptions and criteria. However, 12 loci would usually be needed to accommodate 1000 million unrelated individuals. Another consequence of the role of brothers is the importance of the number of distinct loci investigated by a DNA profiling system. Systems which are very highly discriminatory at only a few loci, such as the "digital fingerprints" of Jeffreys et a1 [ll],suffer the drawback that they are relatively weak in eliminating the relatives of s as alternative possible sources.
The role of the non-DNA evidence Consider a crime scene DNA profile which is thought to be so rare that an expert might be prepared to assert that it is unique. Suppose that, for reasons unrelated to the crime, it is subsequently noticed that the crime scene profile matches that of the Archbishop of Canterbury. On further investigation, it is found to be a matter of public record that the Archbishop was taking tea with the Queen of England at the time of the offence in another part of the country. (You may consider your preferred religious leader, beverage, and head of state in place of those named here.) A reasonable expert would, in the light of these facts, revise downwards any previous assessment of the probability that the crime scene profile was unique. However, this is just an extreme case of the more general phenomenon that any evidence in favour of a defendant's claim that he is not the source of the crime stain is evidence against the uniqueness of his DNA profile. A juror conversant with probability theory could assess the effect of the non-DNA evidence on the probability of uniqueness by assigning values to the w,(x) in (3). It would not usually be appropriate for a DNA expert to assign values to the w,(x), as this would imply an assessment of the non-scientific evidence. The expert may, however, perform illustrative calculations using (3) for various assignments of the w,(x). The expert may try to avoid the "other evidence" problem by using in court a formulation such as "My assessment that the DNA profile of the defendant is unique takes no account of the non-DNA evidence in this case, and this evidence may have a bearing on uniqueness". However, the goal of simplification may then be lost, as it seems likely that such a statement would be no less confusing to a juror than a match probability.
Discussion We have seen that, under certain assumptions, 10 or 11 STR loci may often suffice to establish "uniqueness". (Current UK practice employs six loci, although a 10-locus system is under development.) One of our assumptions was a 99.9% criterion for uniqueness, chosen arbitrarily. Another key assumption is the validity of the match probability calculations. The incorporation of an appropriate value of Fsr, usually between 1% and 3%, into match probability calculations makes it unlikely, in the opinion of most commentators, that the calculated values will be
unfavourable to defendants. Nevertheless, this can only be checked empirically for two or perhaps three loci. The larger number of loci required to establish uniqueness would increase concerns over the validity of the match probability calculations. However, it was noted above that in cases of interest the uniqueness bound is dominated by the contribution from siblings (usually brothers) of s. Match probability calculations may be more robust for brothers than for unrelated individuals, provided that the STR loci are unlinked, as the calculation is primarily based on wellestablished Mendelian laws for shared inheritance from parents. Perhaps the most problematic assumption underlying the calculation of a probability of uniqueness is the assumption that there is no evidence in favour of s. In some cases there is evidence favouring the defendant. More generally, it is usually not appropriate for the forensic scientist to pre-empt the jurors' assessment of the non-scientific evidence. Although the latter difficulty presents a substantial barrier to reporting uniqueness in court, this does not imply a serious problem for the fair and effective use of DNA profile evidence. Focussing on the directly relevant issue, whether or not the defendant is the source of the crime stain, rather than uniqueness, makes more efficient use of the evidence and, properly presented and explained to the court, can suffice as a basis for satisfactory prosecutions.
Acknowledgment I thank Charles Brenner, Dennis Lindley, Karen Ayres and Bruce Weir for comments on a draft manuscript. This work was initiated while the author was a visitor at the Isaac Newton Institute for Mathematical Sciences, Cambridge, UK.
References 1. Balding DJ and Donnelly P. Inferring identity from DNA profile evidence. Proceedings of the National Academy of Sciences of the USA 1995; 92: 11741-1 1745. 2. Balding DJ and Donnelly P. Inference in forensic identification. Journal of the Royal Statistical Society A, 1995; 158: 21-53 (with discussion). 3. Evett IW, Gill PD, Lambert JA et al. Statistical analysis of data for three British ethnic groups from a new STR multiplex. International Journal of Legal Medicine 1997; 110: 5-9. 4. Roeder K. DNA fingerprinting: a review of the controversy. Statistical Science 1994; 9: 222-278.7. 5. Balding DJ and Nichols RA. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 1995; 96: 3-12. 6. National Research Council. The evaluation of forensic DNA evidence. Washington DC: National Academy Press, 1996. 7. Evett IW and Weir BS. Interpreting DNA evidence: statistical genetics for forensic scientists. Sunderland MA: Sinauer 1998. 8. Holden C. DNA fingerprinting comes of age. Science 1997; 278: 1407. 9. Sparkes R, Kimpton C, Watson S et al. The validation of a 7-locus multiplex STR test for use in forensic casework (I) Mixtures, ageing, degradation and species studies. International Journal of Legal Medicine 1996; 109: 186194. 10. Sparkes R, Kimpton C, Gilbard S et al. The validation of a 7-locus multiplex STR test for use in forensic casework (11) Artefacts, casework studies and success rates. International Journal of Legal Medicine 1996; 109: 195-204. 11. Jeffreys AJ, MacLeod A, Tamaki K, Neil DL and Monckton, DG. Minisatellite repeat coding as a digital approach to DNA typing. Nature 1991; 354: 204-209.
Science & Justice 1999; 39(4): 257-260