Forensic Science International 193 (2009) 63–71
Contents lists available at ScienceDirect
Forensic Science International journal homepage: www.elsevier.com/locate/forsciint
Implementing statistical learning methods through Bayesian networks. Part 1: A guide to Bayesian parameter estimation using forensic science data A. Biedermann a,*, F. Taroni a, S. Bozza b a b
The University of Lausanne, Ecole des Sciences Criminelles, Institut de Police, Scientifique, le Batochime, 1015 Lausanne-Dorigny, Switzerland University Ca’ Foscari, Department of Statistics, Venice, Italy
A R T I C L E I N F O
A B S T R A C T
Article history: Received 11 July 2008 Received in revised form 10 July 2009 Accepted 14 September 2009 Available online 14 October 2009
As a thorough aggregation of probability and graph theory, Bayesian networks currently enjoy widespread interest as a means for studying factors that affect the coherent evaluation of scientific evidence in forensic science. Paper I of this series of papers intends to contribute to the discussion of Bayesian networks as a framework that is helpful for both illustrating and implementing statistical procedures that are commonly employed for the study of uncertainties (e.g. the estimation of unknown quantities). While the respective statistical procedures are widely described in literature, the primary aim of this paper is to offer an essentially non-technical introduction on how interested readers may use these analytical approaches – with the help of Bayesian networks – for processing their own forensic science data. Attention is mainly drawn to the structure and underlying rationale of a series of basic and context-independent network fragments that users may incorporate as building blocs while constructing larger inference models. As an example of how this may be done, the proposed concepts will be used in a second paper (Part II) for specifying graphical probability networks whose purpose is to assist forensic scientists in the evaluation of scientific evidence encountered in the context of forensic document examination (i.e. results of the analysis of black toners present on printed or copied documents). ß 2009 Elsevier Ireland Ltd. All rights reserved.
Keywords: Bayesian networks Statistical learning methods Bayesian parameter estimation
1. Introduction Most of the currently employed inference procedures for evaluating scientific evidence in forensic science involve a likelihood ratio [1,2]. Such a ratio provides an expression of the degree to which particular evidence allows one to discriminate between competing propositions of interest. Likelihood ratio formulae with different levels of technicality can be found in specialised scientific and legal literature as is illustrated by approaches developed in domains such as DNA [3–5] or fiber [6] evidence. Technicality with likelihood ratios may be encountered in the form of different kinds of questions. ‘‘What are the assumptions built in some given model?’’ is a typical example which is of importance for forensic scientists because they need to examine carefully whether the assumptions encoded in a model – for example, the definition of the variables along with the relationships thought to hold among them – are appropriate for a case
* Corresponding author. Tel.: +41 21 692 46 00. E-mail addresses:
[email protected] (A. Biedermann),
[email protected] (F. Taroni),
[email protected] (S. Bozza). 0379-0738/$ – see front matter ß 2009 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.forsciint.2009.09.007
under examination. Questions of this kind may meaningfully be approached and studied through the use of, for example, modern graphical modelling techniques with underlying probabilistic framework, in particular Bayesian networks.1 Actually, existing likelihood ratio formulae can be transcribed into Bayesian networks which allow intrinsic assumptions to be represented in a transparent and rigorous way [9]. Because of their graphical environment, Bayesian networks also facilitate the task of adjusting, if necessary, a model’s specification (as new information becomes available) without the need to re-work the formal derivation of the respective likelihood ratio. Besides the fact that Bayesian networks support the task of likelihood ratio computation in presence of model reformulation, the specific topic of model selection will not be addressed in this paper. More generally, Bayesian networks enjoy a widespread acceptance as a formalism for handling uncertainty (as well as for building practical devices to
1 Bayesian networks aggregate graph and probability theory with nodes representing variables and directed edges between nodes representing relevance relationships. The strength of probabilistic influence among variables within a network, that is an acyclic combination of nodes and edges, is expressed in terms of probability tables associated with each of a network’s node. Additional elements of formal definitions is available in specialised literature on this topic [7,8].
64
A. Biedermann et al. / Forensic Science International 193 (2009) 63–71
handle uncertainty) in various fields in which incomplete information plays a central role. Examples for applications can typically be found in such diverse contexts like industry [10], medicine [11], or agriculture and livestock management [12]. Once the scientist can agree with the structural properties of a model (such as the definition of nodes as well as their relationships), a second major category of questions may need to be approached. Typically, these may be of the form ‘‘How is one to choose particular numerical values?’’ This sort of question equally affects the use of both formal likelihood ratios as well as Bayesian networks. With Bayesian networks, for example, this question typically refers to completing node probability tables. Generally, the complication is one that emerges when the scientist aims at transposing a given inference procedure from a purely theoretical consideration to a substantive practical application. The point is that the appropriateness of a model’s numerical specification has an essential bearing on that model’s capacity of providing a meaningful approach for the case scenario at hand. The elicitation of numerical values, too, inevitably relates to a framework of assumptions. This can be considered as a sufficiently specific topic on its own – often discussed separately as such – that can be thought about quite independently of a given likelihood ratio formula, or in case of a graphical model, a network itself. It is this latter theme – the assessment of numerical values – that the present paper will primarily focus on. Forensic scientists may need numerical values when the specification of the quantitative part of models entails questions of the kind: ‘‘What is the probability of an unknown item of clothing of a particular population of potential sources being acrylic?’’, ‘‘What is the probability of an unknown firearm of a given population of firearms having a barrel with six right-twisted grooves?’’, or ‘‘What is the probability of the allele A in the population?’’ Statistics is a discipline that provides methods that can help answering such questions. This paper will primarily focus on the Bayesian approach, which, roughly speaking, aims at drawing inferences about populations, based on sample information. What is not covered by the present paper is what ought to be an appropriate population for a problem of interest and whether a sample meets requirements such as randomness and representativeness. Rather, the discussion concentrates on how a given method is operated once a sample is available. In particular, Bayesian networks will be used for the purpose of (a) clarifying procedural characteristics of a given method and of (b) explaining how – using examples and real data – particular numerical assessments come to be. Such understanding is an important aspect of efforts that seek to assure that evidence offered by scientists is trustworthy [13]. Part II of this series of papers will concentrate on how the notions introduced in this paper may be used to assist the specification of a graphical inference model. The discussion will rely on a setting that involves the evaluation of the results of toner analyses in forensic document examination. This forensic discipline is chosen here because it is one that currently is not regularly approached within a (Bayesian) probabilistic perspective [14]. Throughout the papers, the subject matter will be presented in a way that tends to avoid a formal approach because such a form of presentation can be found in more specialised treatises that are available on the topic [2,7,8]. Instead, the intention is to illustrate the rationale behind the key concepts – basic knowledge of which the reader is assumed to have – through both theoretical and practical examples and to show how they may be operated in casework using real forensic science data. The primary aim will be
to present a collection of generic network fragments that scientists may reuse and adapt for building their own applications [15–17]. 2. Preliminaries The kind of question of interest here and in later sections is how one is to assign a numerical value to the event that an unknown member of a given population is of a certain type. A fairly general setting to which this sort of question applies – and which will be studied below – is one where each member of the population has one of two possible attributes. The question of interest thus relates to the proportion, in the context referred to as a parameter, of members of the population which have a given attribute. The actual value of that parameter is almost always unknown essentially because the entire population cannot be investigated. Instead, the parameter may be estimated or ‘learned’ on the basis of a sample, that is, a subset of the population whose attributes are investigated. Sometimes it is convenient to think of the population in terms of a model (alternatively called a hypothesis) which may be, for example, an urn containing balls whose labels reflect the kind of attribute possessed by each member of the population. Maximum likelihood estimation consists of finding that model (or hypothesis) whose likelihood, that is the probability of the observations (i.e., the number of items in a sample found to be of a certain type) given that model, is maximised. Section 3 will briefly summarise the main considerations to which this concepts amounts. The discussion will focus on a sample of individual documents, each of which was printed with a machine that uses black toner. For the time being, attention is confined to a toner characteristic that may be one of two possible types, referred to as ‘single component toner’ or ‘bi-component toner’. More formally speaking, imagine that a sample D of n individual documents di (i ¼ 1; 2; . . . ; n) is available. Each member of that sample is the result of a xerographic printing process using black toner. Suppose also that each member of the sample is inspected and its type noted, that is either S (for ‘single component toner’) or B (for ‘bi-component toner’). The sample is thus given by the collection of observed S’s and B’s. Let s and b denote the total number of, respectively, observed S’s and B’s in the sample so that s þ b ¼ n. By saying that the data can be summarised using just the two pieces of information s and b one disregards the order in which the S’s and B’s appear. The assumption is made that the observations are mutually independent so that if the value for the proportion u of members of the population that possess the attribute S were known with certainty, the probability for each observation to be S would be u . Such samples are said to be ‘independent and identically distributed’ or ‘i.i.d.’. The probability for s (the total number of items among the n documents which contain toner of type S) given u, can be considered as governed by a binomial probability distribution, written shorthand as Bðn; uÞ: n us ð1 uÞns for s ¼ 0; 1; . . . ; n: (1) Prðsjn; u Þ ¼ s 3. Maximum likelihood parameter learning The method of maximum likelihood provides an estimation of the unknown parameter u by finding that value of u , often denoted uˆ , that maximises the likelihood function LðujxÞ, that is Lðuˆ jxÞ ¼ max u Lðu jxÞ where x denotes the observed sample.
A. Biedermann et al. / Forensic Science International 193 (2009) 63–71
For illustration, consider the binomial likelihood function (Eq. (1)) for a scenario in which the data consist of s ¼ 77 documents found to have a toner of type S in a sample of n ¼ 100 documents.2 That function, formally written as Lðuj77; 100Þ, has a maximum near 0.8, the maximum likelihood estimate (MLE). It may also be determined analytically and it can be shown that it is given by the sample proportion: s 77 ¼ 0:77: uˆ ¼ ¼ n 100 The maximum likelihood approach is a procedure that is sometimes used to relate the information contained in a database, typically a collection of individual cases, with a given Bayesian network. Such an operation may be needed because the available set of cases, which may be thought of as a registry of past experiences, is relevant for the scenario under study. In particular, it may be that one wishes the Bayesian network to rely on that data while processing new cases. The feature of Bayesian networks that interfaces data are the tables that are associated with each of a network’s nodes. The entries of these tables are commonly referred to as ‘parameters’. The problem of parameterising a model with a known structure thus consists of finding the values that make up the node tables, that is to ‘learn’ the required values from data. For the ease of argument, two assumptions will be made here. The first is to assume the structure of the Bayesian network to be known. The second is to assume the available database to be complete in the sense that there are no cases without observations (no missing values). Thus, procedures that aim to ‘learn’ Bayesian networks from data together with particular methods that focus on deficient data collections are not considered here. The complete data assumption is taken to be acceptable here because, usually, in many forensic population studies all sampled items (e.g., individuals) are analysed. In order to illustrate the method of maximum likelihood, imagine a single binary node A that does not receive entering arcs from other nodes. Such a node may be thought of as a part of a larger network, for example. For the current discussion, let this binary node reflect the possible outcomes for a physical descriptor of an item of black toner. Following notation introduced in Section 2, denote these descriptors by ‘S’ and ‘B’. Imagine that the node A pertains to a database D with n entries of black toner samples di (i ¼ 1; 2; . . . ; n). The data are complete in the sense that each item di is known to be either of type S or of type B. The parameters of the unconditional table of A are u and 1 u. Denote, as in Section 2, with s and b the total number of occurrences of S and B among the n items in the database. Assuming independence between the cases di given u, the distribution of the total number s of observed items of type S is Binomial with parameters ðn; uÞ. The maximum likelihood approach for parameterising the node A says to choose that value uˆ for u that maximises the likelihood function: n Lðuˆ jxÞ ¼ max u us ð1 uÞns s
65
4. Bayesian learning 4.1. The Bayesian perspective The maximum likelihood solution for learning the parameters of a Bayesian network from data is, provided that certain assumptions can be admitted, readily feasible and likely to be well known to most readers. However, there may be situations that lead to particular results. For example, with small data sets and the same value being observed for all cases, a maximum likelihood estimator of 1 is obtained. This situation may be particularly acute if the sample consists of only a single observation. Although this is an extreme setting for a small sample, it is not per se unrealistic. In such an outset, one may prefer to inspect further cases and any values contrary to the first observation would result in another estimate. In some sense, thus, the value of 1 may be taken as an upper bound estimate on the ‘true’ value. An analogous argument applies for settings in which zero occurrences are made. An alternative view approached in this section – known as the Bayesian approach – allows one to cope with this and other issues. In order to anticipate some of the forthcoming discussion, the key ideas underlying the Bayesian approach, which are essentially twofold, may be summarised as follows: The true value of the parameter of interest is taken to be uncertain. It thus is considered a random variable and is allowed to have a probability distribution. Inferences about the parameter of interest are made through the use of the rules of probability calculus. Based on an expression of the belief held by an individual in each of the possible parameter values, known in the context as a ‘prior distribution’, Bayes’ theorem is operated to adjust that distribution in the light of the observed data. This gives the so-called ‘posterior distribution’ from which the desired estimates are calculated. These Bayesian estimates allow one to talk in terms of the probability of the parameter given the data. The Bayesian approach thus views the learning task as a form of uncertain reasoning from observations, that is, a process of probabilistic inference. Below, the features of this approach will be outlined in further detail and illustrated through the use of graphical probability models (Bayesian networks). 4.2. Parameter learning for a Bernoulli trial In this section the discussion will focus on two variables. One variable, u, represents the proportion of the currently discussed population of toners that is of type S. A prior distribution on the parameter u will be introduced to model uncertainty about its value. The second variable, di , can assume two possible values, 1 if state S is observed, and 0 if state B is observed, with probability u and 1 u, respectively. It represents the probability that a particular item of that population will be of type S. This is a Bernoulli variable with parameter u, written di Bernoulliðu Þ, and probability distribution d
1di
Prðdi juÞ ¼ u i ð1 uÞ s ¼ : n The maximum likelihood estimates are given by the observed frequencies. 2
As discussion proceeds, this sample will repeatedly referred to again. Consideration will also be given to further descriptors of it. A more detailed account of the sample’s characteristics (e.g., mode of collection) will be provided in Part II of this series of papers.
Generally, the variable di can be thought of as a member, or datapoint, of a n-sized sample fd1 ; d2 ; . . . ; dn g. Notice that the sum of n Bernoulli trials with parameter u has a Binomial distribution with parameters ðn; uÞ. 4.2.1. Discrete parameter For the time being, attention will be restricted to a single observation. Another important aspect of the discussion at this point is that – for the ease of argument – the target population
A. Biedermann et al. / Forensic Science International 193 (2009) 63–71
66
Fig. 1. (a) Structure for a Bayesian network for learning the probabilities of the j possible discrete states of a population parameter u . The node di is binary with states S and B that represent the possible observations on an item drawn randomly from the population. (b) Initialised Bayesian network with both nodes expanded. Due to limitations of space the node u only displays the states u7 to u11 . State probabilities are indicated both numerically ( p%) and graphically (barplots). (c) State of the Bayesian network after finding that the item di is of type S. (d) Bayesian network structure for sequential Bayesian learning through conditionally independent trials d1 ; d2 ; . . . ; dn .
parameter, u, will be assumed to be discrete. Another reason for mentioning the discrete setting here is that it represents the standard specification of Bayesian networks. Let u be defined so as to cover a total of m states u j ( j ¼ 1; 2; . . . ; m) that correspond to distinct values from the interval between [0,1]. For example, the values 0; 0:1; 0:2; . . . ; 1 could be chosen for m ¼ 11. A discussion of a setting in which u assumes a continuum of values is given in Section 4.2.2. Notice that, in the context, u may also be called a ‘model’ for the population. The current discussion thus considers a total of m distinct models. An alternative way to think about this outset is to imagine that the population could be represented by one of m urns, each containing a certain number of balls labelled by either S or B – depending on the proportion. For example, the population whose proportion of members of type S is 0.6 could be represented by an urn that contains six balls labelled S and four balls labelled B. Let us recall from the preceding discussion that a question of interest is to derive a value for the probability that a particular item di will be found, upon inspection, to be of type S, written formally as Prðdi ¼ SÞ. In terms of a Bayesian network one could say that the aim is to fill out the table associated with a binary node di . The actual learning task relates, however, only indirectly to di because Prðdi ¼ SÞ is functionally related to u. This follows because the scenario here is understood as one in which one could tell the probability of an inspected item di to be of a certain type given knowledge of the proportion of the population that is of that type (or, stated otherwise, knowing that a given model u j were known to hold): Prðdi ¼ Sju j Þ ¼ u j :
(2)
What needs to be done, instead, is to assign probabilities to each of the m competing models for the population. These probabilities are required to find Prðdi ¼ SÞ which is given by the weighted average of that probability over the m different population models, that is the proportions u j : Prðdi ¼ SÞ ¼
m X Prðdi ju j ÞPrðu j Þ
j ¼ 1; 2; . . . ; m:
(3)
j¼1
The question thus is how to learn the various values for u j . The Bayesian approach to this problem starts with some information about each u j and revises this information – using the laws of probability calculus – when new data become available. The initial state of knowledge about the various models is known in the context as prior probabilities. Notice that these reflect personal degrees of belief – ours, yours, anybody’s.
The target question can now be reframed as follows: based on an initial assignment of probabilities to each u j (with j ¼ 1; 2; . . . ; m) – which may be thought of as the entries of a Bayesian network variable with m states – what ought to be one’s belief about each of the u j , given a set of data-points D (written formally as Prðu j jDÞ)? This process of ‘learning’ the probabilities of the u j can be conceptualised as an inference according to Bayes’ theorem. Considering, for the ease of argument, a single observation di and a particular model u j , Bayes’ theorem can be written as follows: Prðu j jdi Þ ¼
Prðdi ju j ÞPrðu j Þ : Prðdi Þ
(4)
Verbally stated, Eq. (4) says that for finding the probability of the model u j given the observation di one needs to multiply the prior probability for that model, Prðu j Þ, by the likelihood of that model, that is Prðdi ju j Þ, and divide all by the probability of the data, Prðdi Þ. The latter term is given by Eq. (3) whereas the likelihood of the model is given by Eq. (2). These calculations can be traced in a Bayesian network by defining two nodes u and di . Assuming that the outcome of inspecting an item di depends on the distribution of the characteristics in the population from which the item is drawn, the two variables are linked such that the following Bayesian network is obtained: u ! di (Fig. 1(a)). The variable u covers a total of m states. The second variable, di , is Bernoulli with possible states S and B (Section 4.2). The associated node table requires conditional probabilities, for each state of di , given each state of the parental variable u . This represents a total of 2m probabilities. This node table can be specified in different ways. One possibility would be to specify manually Prðdi ¼ Sju j Þ ¼ u j and Prðdi ¼ Bju j Þ ¼ 1 u j for each u j (with j ¼ 1; 2; . . . ; m). Although this may be promptly feasible, it may not be very convenient because manual adjustments will be needed whenever changes are made in the table of the node u. An alternative way of specifying the node table of di is to rely on an automation provided by special functionality included in certain Bayesian network software, such as Hugin.3 In Hugin language, for example, the conditional probabilities for di can be defined through the expression Distributionðtheta; 1 thetaÞ where theta refers to the name of the node that represents the variable u. 3 A free demonstration version of Hugin, called Hugin Lite, can be obtained from www:hugin:com. Hugin Lite allows some, Hugin Researcher allows all of the models presented in this series of papers to be built.
A. Biedermann et al. / Forensic Science International 193 (2009) 63–71
67
Fig. 1(b) shows such an example for a Bayesian network with m ¼ 11 states. Initially, no evidence is entered. The node u displays exemplar prior probabilities given by the vector
u ¼ f0:01; 0:02; 0:03; 0:04; 0:05; 0:06; 0:10; 0:18; 0:34; 0:16; 0:01g: The node di shows the value 0.682 for the state S (and 1 0:682 ¼ 0:318 for the state B). This value is given by the sum of the products of the individual likelihoods and the prior probability of the respective model. It corresponds to the denominator of Bayes’ theorem (Eq. (4)), formally defined in Eq. (3). Communicating to a Bayesian network that the state of a variable has become known with certainty is called instantiating that variable. Fig. 1(c) shows the Bayesian network with the node di instantiated to S. Knowing that S is observed for the item di affects the probability distribution for the node u. Actually, this node now displays posterior probabilities. These are found according to Eq. (4). Notice that the proposed network provides an explicit representation of the idea that the probability of an inspected item to be of a given type depends on the proportion of the population that has that type. By adopting a distinct node for u one can represent one’s uncertainty about the various possible proportions. It is for this reason that the corresponding Bayesian network is also known in the context as an ‘augmented Bayesian network’ [18]. The Bayesian network language makes it easy to add further branches of the kind ‘ ! di ’, which represent further conditionally independent observations (sampling with replacement). This allows one to implement a sequential Bayesian learning procedure where the ‘posterior’ probability distribution for the variable u, given an observation, will represent the ‘prior’ probability distribution before inspecting an additional item di . Fig. 1(c) shows a general structure for a Bayesian network that is appropriate for this kind of analysis.
Fig. 2. Representation of, respectively, a Betað1; 1Þ and a Betað8; 2Þ prior distribution (dashed lines). The more intensively peaked solid lines represent the corresponding posterior distributions obtained by updating through a binomial sample of size 100 in which 77 were found to share the target characteristic.
and b. The idea is to shape the function in such a way that it appropriately reflects one’s beliefs about the possible values of u. Appendix A provides some guiding considerations for this. Following the choice of a particular beta distribution, the aim is to obtain a Bayesian update of that distribution in the light of a binomial sample. The posterior beta distribution for u (with 0 < u < 1) is found by applying Bayes’ theorem and leads to the following: f ðuja; b; n; sÞ ¼
G ða þ b þ nÞ aþs1 bþns1 u ð1 uÞ : G ða þ sÞG ðb þ n sÞ
(6)
(5)
Stated otherwise, updating a Betaða; bÞ density in the light of a binomial sample of size n with a number s of items being found to be of type S, leads to another beta density of the form Betaða þ s; b þ n sÞ. Yet another way to describe this updating procedure is to say that the parameter a is incremented by the number of items found to be of type S whereas the parameter b is incremented by the number of items that are not of type S. It is interesting to note that this allows one to interpret any Betaða; bÞ distribution as a Betað1; 1Þ distribution after observing a 1 counts of items of type S and b 1 counts of type not-S. Fig. 2 provides a graphical illustration for the Bayesian updating of a Betað1; 1Þ and a Betað8; 2Þ distribution. The former considers all values in the interval [0,1] as equally probable whereas the latter assigns more belief to values greater than 0.5. Actually, a Betað8; 2Þ distribution could be interpreted4 as one’s expectation to see about 8 out of 10 items to be of type S. These distributions are updated by the binomial sample ðn ¼ 100; s ¼ 77Þ discussed earlier in Section 2. The solid lined curves represent the resulting posterior distributions Betað78; 24Þ and, respectively, Betað85; 25Þ. Let us recall that the purpose of the current analysis still is to find a point estimate for the unknown parameter u . The Bayesian approach to obtain such an estimate relies on the posterior distribution for u given the observed data (i.e., the binomial sample). That task may consist of summarising that distribution by a single statistic, such as the posterior mean, given by5a=ða þ bÞ. As an example, let us refer again to the distributions shown in Fig. 2. Here, the mean of the Betað85; 25Þ distribution is 0.773 whereas that of the distribution Betað78; 24Þ is 0.765. In summary,
In order to obtain a particular distribution, one needs to provide positive values (which may be fractions) for the two parameters a
4 Notice that the expectation of a beta distributed random variable with parameters a and b is a=ða þ bÞ. 5 Here, a and b refer to the parameters of the posterior distribution.
4.2.2. Continuous parameter In many situations a collection of discrete states may not provide a satisfactory coverage for the possible values of a parameter. Actually, it may be that all values over some range need to be accounted for. Such a setting is addressed in this section. The discussion here will continue to focus on u, the proportion of items in the target population that are of type S. It will be assumed that u may take an uncountable infinite number of real numbers between 0 and 1, a setting in which u is referred to as a continuous random variable. Again, the aim of the analysis is to work on one’s uncertainty about the various possible values of u. To get started, an initial distribution over u is needed (i.e., prior to the data). A family of distributions used very commonly in statistical literature for continuous random variables (within a fixed range) is the beta family, written Betaða; bÞ, a more formal definition of which is given in Appendix A. This kind of distribution has some very convenient properties, in particular when conditioning on data that takes the form of a binomial sample. This is explained in further detail below. Let us suppose that a beta distribution is used to represent belief about the possible values of u, prior to considering particular data. Saying that the density function for the variable u is a beta distribution thus means that the densities at each point in the interval [0,1] are determined by the following relation:
G ða þ bÞ a1 b1 u ð1 u Þ ; G ðaÞG ðbÞ 0 < u < 1; a > 0; b > 0:
f ðuja; bÞ ¼
68
A. Biedermann et al. / Forensic Science International 193 (2009) 63–71
Fig. 3. Different states of a Bayesian network for learning the probability distribution of a proportion. The node u represents the proportion of documents that are of type S. The nodes d1 and d2 represent the possible outcomes of individual Bernoulli observations (usable within a sequential updating procedure). That is, an item di from the population may, if inspected, be either found to be of type S or of type B. The nodes n and s act as summary nodes in the sense that they allow one to enter a binomial sample of size n in which s items are found to be of type S. The possible states for the node n have been arbitrarily restricted to 96; 97; . . . ; 100 whereas s covers any integer between 0 and 100 (including these endpoints). In the Bayesian network shown in figure (i), the nodes al pha and beta are instantiated to 8 and 2, respectively (states are marked in bold) so that the node u displays a Betað8; 2Þ ‘prior’ distribution. Figure (ii) shows the state of the Bayesian network after entering a (100,77) binomial sample (nodes n and s are instantiated). Accordingly, the node u displays a Betað85; 25Þ posterior distribution.
thus, the Bayesian procedure for inference about a proportion is based on the posterior distribution, which gives an account on one’s belief about the parameter after analysing data. How is a continuous approach to inference about the population parameter u to be implemented in terms of a Bayesian network? In order to illustrate an answer to this, it is useful to consider an analogy to the Bayesian network discussed towards the end of Section 4.2.1 (Fig. 1(a)). The parameter u was modelled in that example in terms of a discrete node with m states u j ð j ¼ 1; 2; . . . ; mÞ. In order to adapt that network for a continuous setting, the definition of the node u requires a slight formal modification. That is, instead of a collection of discrete states, u needs to be defined in such a way that the whole range between 0 and 1 is covered. How this is practically to be achieved requires some explanation because, currently, Bayesian networks allow genuine continuous nodes only for variables with a Gaussian (normal) distribution function. This constraint can often be avoided, however, in a quite reasonably acceptable way. The way ahead is to describe the continuous range in terms of disjoint intervals over which the continuous quantity of interest can be discretized. Certain Bayesian network software packages, such as Hugin, support this mode of network specification through a special subtype of node called ‘interval node’. Such an interval node can be used for implementing a continuous variable u as it is discussed here. For the purpose of illustration, allow the node u to cover a total of 20 states that represent disjoint intervals of size 0.05, that is, 0 0:05; 0:050:1; 0:10:15; . . . ; 0:951 (notice that another number of intervals may be chosen and the size of the disjoint intervals need not be constant). As is readily seen, this node definition
results in a remarkable increase in the number of node states. At this point, however, this should not be seen as a major complication because convenient software functionalities exist that allow one to complete node tables automatically through the definition of expressions. Referring again to Hugin language, probabilities can be assigned to the interval states of the node u according to a user-specified statistical distribution, defined in terms of a particular expression. For example, if one were to define one of the prior distributions displayed in Fig. 2 (for a node u without parental variables), expressions of the form Beta(1,1) or Beta(8,2) would be needed. Fig. 3(i) shows a Bayesian network with a Betað8; 2Þ prior distribution for u. Notice that a particular feature of this network is that u is not modelled as a root node. Instead, it is conditioned on two additional nodes named, respectively, al pha and beta. These have states assuming values 0:1; 0:5; 1; 2; . . . ; 10 (other sets with values > 0 may be chosen) and provide the parameters for the beta distribution defined over the states of the node u. The latter node is thus specified by using the expression Beta(alpha,beta) where alpha and beta refer to the names of the newly added parental nodes. The distributions for the nodes al pha and beta do not require further attention here because these nodes are supposed to be instantiated whenever using the network for inference.6 The definition of the nodes di is the same as that given in Section 4.2.1. These nodes inform about the main question formulated at the beginning of this paper (Section 2), that is the probability that a random member of the population will be found to be of, 6 Notice, however, that the instantiation of the nodes al pha and beta is an assumption that may be relaxed and that it is possible to consider a model with hyperparameters.
A. Biedermann et al. / Forensic Science International 193 (2009) 63–71
respectively, type S or type B. That (predictive) probability is given by the mean of the distribution associated with the node u, the population proportion. Because a beta distribution is involved, the mean is given by a=ða þ bÞ. An illustration of these results is given by the models shown in Fig. 3 where a probability of 0.8 for the state S of the nodes di (Fig. 3(i)) can be found to correspond to the mean of the Betað8; 2Þ ‘prior’ distribution whereas the value 0.77 agrees with the mean of the posterior Betað85; 25Þ distribution (Fig. 3(ii)). 4.3. Parameter learning for a multinomial variable: the Dirichlet distribution The kind of samples discussed so far are particular in the sense that they consist of sequences of trials in which the target characteristic may assume exactly one of two possible mutually exclusive outcomes. These samples have been referred to as binomial samples, and the binomial distribution (Eq. (1)) has been used to model probabilities over the set of possible outcomes. Let us recall the sample of 100 documents introduced in Section 2 and let us imagine now that attention is being drawn on a descriptor that may assume one of more than two possible outcomes. In the case of black toner that may be found on printed documents, forensic scientists commonly analyse resins by means of Fourier Transform Infrared Spectroscopy (FTIR), the results of which (so-called IR data) may be classified in one of several mutually exclusive categories. An example of such a set of data, in the context also referred to as a multinomial sample, is given in Table 1. When evaluating the weight-of-evidence for IR data, forensic scientists may need to address questions of the kind ‘‘What is the probability of an unknown item of black toner being of type i?’’ A Bayesian approach to this learning task uses prior knowledge about the k outcomes, expressed in terms of prior probabilities u1 ; . . . ; uk , which are then adjusted in the light of data. More formally, the move to a posterior probability distribution is operated by combining a prior probability distribution with a multinomial likelihood. A distribution for modelling probabilities uk for k outcomes, commonly referred to throughout statistical literature, is the Dirichlet distribution (Appendix B). For the seven classes represented in Table 1 that distribution is written Dirichletða1 ; a2 ; . . . ; a7 Þ where the mean value for the ith class is P ai =a0 and a0 ¼ 7i¼1 ai . In order to illustrate the revision of the parameters in terms of a numerical example, let us say that the prior distribution is uniform, that is ai ¼ 1, for all k categories. As an aside, notice that parameters ai may be thought of as ‘prior observation counts’ for observations that are determined by u i . The Dirichlet distribution has very convenient updating properties, comparable to those of the beta distribution where the observed ‘successes’ and ‘failures’ add to the parameters a and b (Section 4.2.2). In case of the Dirichlet distribution, which is a generalisation of the beta distribution, the posterior
Table 1 Total number of observations (counts) for each of seven resin groups. The sample covers a total of 100 items (documents with black toner), each of which falls in exactly one category. Resin group (number – name)
Counts
1 2 3 4 5 6 7
83 11 2 1 1 1 1
– – – – – – –
styrene-co-acrylate epoxy A epoxy B epoxy C epoxy D polystyrene unknown
69
Table 2 Example for updating a Dirichlet distribution with parameters ai . The data are taken from Table 1 (represented here in column four) and consist of the numbers xi of observations (counts) of items found to be of type i (the sample size is 100). Column three represents the prior means, column five the revised parameters and column six the posterior means. These means are estimators for Prðui Þ and Prðui jxi ; nÞ, respectively. Resin group i
ai
ai =a0
xi
ai þ xi
ðai þ xi Þ=ðn þ a0 Þ
1 2 3 4 5 6 7
1 1 1 1 1 1 1
1/7 1/7 1/7 1/7 1/7 1/7 1/7
83 11 2 1 1 1 1
84 12 3 2 2 2 2
0.78505 0.11215 0.02804 0.01869 0.01869 0.01869 0.01869
Sums:
a0 ¼ 7
1
n ¼ 100
n þ a0 ¼ 107
1
distribution for ðu 1 ; u2 ; . . . ; uk1 Þ is found in the case of binomial and multinomial likelihoods by adding to each parameter ai the counts xi of items that are found to be of type i P [19]: Dirichletða1 þ x1 ; a2 þ x2 ; . . . ; ak þ n k1 i¼1 xi Þ. Notice that the distribution is written only for the k 1 proportions because P of the constraint uk ¼ 1 k1 i¼1 u i . A summary of the updating procedure is given in Table 2. The posterior means for the u i are shown in the column on the far right-hand side. They are given P by ðai þ xi Þ=ðn þ a0 Þ, where a0 ¼ ki¼1 ai . In Part II of this series of papers, mean values found in this way will be used as entries for probability tables of multinomial variables of a Bayesian network. 5. Discussion and conclusions Bayesian networks represent an important addition to the collection of analytical and evaluative tools currently available to forensic scientists [9,20–23]. One of the major reasons for this is that they allow one to extend probabilistic analyses to levels of complexity – that is in terms of the number of variables as well as their dependency structure – that would otherwise become increasingly difficult to manage. Furthermore, the graphical component of the concept allows the user to focus his/her attention on finding appropriate structural representations of a problem while underlying calculations can largely be confined to widely available software systems. Besides, such systems also greatly facilitate the coordinated use of probabilistic data from different sources, such as expert opinions, databases of past experiences or results of specific experiments. Although, in principle, Bayesian networks accept virtually any kind of numerical values – provided that they agree with the laws of probability – users should take seriously the task of specifying network models numerically because of the crucial effect that this process has on the appropriateness of the result. With regard to this general context, the main purpose of the present paper was to point out that Bayesian networks could also provide a platform to clarify the way in which beliefs are informed by data. In particular, Bayesian networks support the task of furthering the understanding of underlying assumptions and how they function in practice, examining and keeping track of results, and transposing inferential procedures to practical applications (see also Part II of this series of papers). Appendix A. Beta distribution The beta distribution offers a way for representing variability in a continuous random variable whose values are confined to a fixed range (from zero to one). Let u denote such a random variable. u has a
A. Biedermann et al. / Forensic Science International 193 (2009) 63–71
70
beta distribution with two parameters a and b, denoted Betaða; bÞ, if its probability density function, f ðuja; bÞ, is f ðuja; bÞ ¼ Bða; bÞ u
a1
b1
ð1 uÞ
;
0 < u < 1;
where Bða; bÞ ¼
G ða þ bÞ G ðaÞG ðbÞ
is the beta function and Z 1 G ðzÞ ¼ t z1 et dt 0
is the gamma function. For arguments that are integer, the gamma function produces factorials:
G ðx þ 1Þ ¼ x!;
for integer x > 0;
where
Betað2; 2Þ and Betað4; 4Þ. Generally, by choosing smaller parameter values, distributions will be less committal (more openminded [24]) with respect to particular ranges of values for u. a ¼ b < 1: the effect of values inferior to 1 is to confine the density preferentially towards the upper and lower bound of the interval [0,1]. For the purpose of illustration, Fig. A.1 shows a Betað0:1; 0:1Þ distribution. a < b; a > b: generally, when a is smaller than b, then the density has more weight on the lower half of the parameter interval whereas the opposite holds when a is greater than b. An example for the latter case is given in Fig. 2 in terms of a Betað8; 2Þ distribution. Particular choices of a and b may be further refined by consistency checks [24]. Appendix B. Dirichlet distribution
x! ¼ xðx 1Þðx 2Þ . . . 1: pffiffiffiffi Notice that G ð1=2Þ ¼ p. a1
b1
It is also worth noting that u ð1 uÞ determines the shape of the curve whereas Bða; bÞ is the constant that is needed to make the curve a density function. Finding values for the parameters a and b may be guided by the following considerations: a ¼ b ¼ 1: for these values the beta distribution is uniform, that is it assigns a constant density to all values in [0,1]. This distribution could be used if one does not wish to prefer, a priori, any one value over another. This device is sometimes used for the purpose of presenting the effect of data in terms of a posterior distribution that is constructed from a starting point that does not favour any particular value. The uniform distribution is shown in Fig. A.1 as a horizontal line. Generally, for values a ¼ b the density is symmetric. a ¼ b > 1: choosing values different from one will tend to concentrate the density in favour of some values u versus others. Actually, the tendency will be that more weight will be placed on values surrounding the distribution’s mean, the value of which is given by a=ða þ bÞ. That concentration will be more intensive the larger the values chosen for a and b. This can be seen in Fig. A.1 by comparing the shapes of the curves of the distributions
Fig. A.1. Some beta distributions with different parameters a and b.
The Dirichlet distribution is commonly used to distribute probability over a set of parameters ui ði ¼ 1; . . . ; k; 0 < ui < 1; Pk i¼1 u i ¼ 1Þ. A typical situation where this applies is one where each of n trials may result in outcomes that fall into one of k classes, that is with multinomial samples. When such trials are independent, then the probability of a trial falling into the ith category is u i . A Dirichlet distribution for the set fui ; . . . ; uk g has the probability density function f ðu1 ; . . . ; uk1 ja1 ; . . . ; ak Þ ¼
0 < u i < 1;
i ¼ 1; . . . ; k;
G ða1 þ þ ak Þ a1 1 a 1 u uk k ; G ða1 Þ G ðak Þ 1
k X
ui ¼ 1;
a1 ; . . . ; ak > 0:
i¼1
P The mean value for ui is ai =a0 where a0 ¼ ki¼1 ai . It can be observed that the beta distribution is a special case of the Dirichlet distribution in the sense that the latter is equivalent to the former P when k ¼ 2. Notice also that uk ¼ 1 k1 i¼1 u i implies that the Dirichlet density function is uniquely determined by the values of the first k 1 variables. References [1] B. Robertson, G.A. Vignaux, Interpreting Evidence. Evaluating Forensic Science in the Courtroom, John Wiley & Sons, Chichester, 1995. [2] C.G.G. Aitken, F. Taroni, Statistics and the Evaluation of Evidence for Forensic Scientists, 2nd edition, John Wiley & Sons, New York, 2004. [3] I.W. Evett, B.S. Weir, Interpreting DNA Evidence, Sinauer Associates Inc., Sunderland, 1998. [4] J.S. Buckleton, C.M. Triggs, S.J. Walsh, Forensic DNA Evidence Interpretation, CRC Press, Boca Raton, FL, 2004. [5] D.J. Balding, Weight-of-Evidence for Forensic DNA Profiles, John Wiley & Sons, Chichester, 2005. [6] C. Champod, F. Taroni, The Bayesian approach, in: J. Robertson, M. Grieve (Eds.), Forensic Examination of Fibres, Taylor and Francis, London, 1999, pp. 379–398. [7] R.G. Cowell, A.P. Dawid, S.L. Lauritzen, D.J. Spiegelhalter, Probabilistic Networks and Expert Systems, Springer, New York, 1999. [8] F.V. Jensen, Bayesian Networks and Decision Graphs, Springer, New York, 2001. [9] F. Taroni, C.G.G. Aitken, G. Garbolino, A. Biedermann, Bayesian Networks and Probabilistic Inference in Forensic Science, John Wiley & Sons, Chichester, 2006. [10] D.M. Buede, Engineering design using decision analysis, in: IEEE International Conference on Uncertainty Modelling and Analysis, IEEE Systems, Man and Cybernetics, vol. 2, 1994, 1868–1873. [11] G. Parmigiani, Decision models in screening for breast cancer, Bayesian Statistics 6 (1999) 525–546. [12] F. Sahin, S.H. Bay, A biological decision theoretic intelligent agent solution to a herding problem in the context of distributed multi-agent systems, in: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 2, 2000, pp. 137–142. [13] G. Jackson, The scientist and the scales of justice, Science & Justice 40 (2000) 81– 85. [14] F. Taroni, C. Champod, P. Margot, Forerunners of Bayesianism in early forensic science, Jurimetrics Journal 38 (1998) 183–200.
A. Biedermann et al. / Forensic Science International 193 (2009) 63–71 [15] S.M. Mahoney, K. Blackmond Laskey, Network engineering for complex belief networks, in: Uncertainty in Artificial Intelligence, Proceedings of the Twelfth Conference, Morgan Kaufmann Publishers, San Francisco, CA, 1996 , pp. 389– 396. [16] K. Blackmond Laskey, S.M. Mahoney, Knowledge fragments: representing knowledge for constructing probabilistic models, in: D. Geiger, P.P. Shenoy (Eds.), Uncertainty in Artificial Intelligence, Proceedings of the Thirteenth Conference, Morgan Kaufmann Publishers, San Francisco, CA, 1997, pp. 334–341. [17] M. Neil, N. Fenton, L. Nielsen, Building large-scale Bayesian networks, The Knowledge Engineering Review 15 (2000) 257–284. [18] R.E. Neapolitan, Learning Bayesian networks, Prentice Hall Series in Artificial Intelligence, Pearson Prentice Hall, Upper Saddle River, NJ, 2004.
71
[19] M.H. De Groot, Optimal Statistical Decisions, Wiley Classics Library Edition Published, John Wiley & Sons, Inc., Hoboken, New Jersey, 1970. [20] C.G.G. Aitken, A. Gammerman, Probabilistic reasoning in evidential assessment, Journal of the Forensic Science Society 29 (1989) 303–316. [21] A.P. Dawid, I.W. Evett, Using a graphical method to assist the evaluation of complicated patterns of evidence, Journal of Forensic Sciences 42 (1997) 226–231. [22] P. Garbolino, F. Taroni, Evaluation of scientific evidence using Bayesian networks, Forensic Science International 125 (2002) 149–155. [23] A.P. Dawid, J. Mortera, V.L. Pascali, D. van Boxel, Probabilistic expert systems for forensic inference from genetic markers, Scandinavian Journal of Statistics 29 (2002) 577–595. [24] D.A. Berry, Statistics: A Bayesian Perspective, Duxbury, Belmont, 1996.