Bayesian Networks

Bayesian Networks

B Bayesian Networks M E Borsuk, Thayer School of Engineering, Dartmouth College, Hanover, NH, USA ª 2008 Elsevier B.V. All rights reserved. Introduct...

503KB Sizes 1 Downloads 145 Views

B Bayesian Networks M E Borsuk, Thayer School of Engineering, Dartmouth College, Hanover, NH, USA ª 2008 Elsevier B.V. All rights reserved.

Introduction Definition of BNs Building Models

Using Models Special Cases Further Reading

Introduction

Definition of BNs

Ecological informatics is concerned with the use of advanced computational technology to: (1) further our understanding of ecosystems at all levels of detail and (2) support rational and transparent decision making concerning ecological management. Distinct features of ecological informatics include: integration across scales and levels of complexity, translation of patterns in data to ecological processes, and adaptive methods of model revision and prediction under uncertainty. One approach for pursuing these goals is Bayesian network (BN) modeling. By succinctly and effectively translating causal assertions between variables into patterns of probabilistic dependence, Bayesian networks facilitate logical and holistic reasoning under uncertainty in complex systems. Such reasoning is necessary for accurate analysis, synthesis, prediction, inference, and decision making. The first section of this article defines BNs and introduces a simple ecological example that will be used throughout to illustrate basic concepts. Methods for constructing BNs will then be described, including specification of model structure and conditional probabilities. This will be followed by a description of BN use for prediction, inference, explanation, intervention, and decision. Finally, special cases of BNs will be presented including hierarchical, dynamic, and integrated modeling. In this relatively brief overview there is not enough space for detailed theoretical development, algorithms, or examples. Rather, the goal is to provide an introduction to the basic concepts and rely on the sources listed for further reading to fill in the details.

A BN is a directed acyclic graph (DAG) that leads to a compact representation, or factorization, of the joint probability distribution of a set of variables in a system of interest. Graphically, variables are represented by nodes and dependences between nodes are represented by directed edges. It is important to note that a BN illustrates patterns of probabilistic dependence (i.e., statistical correlation, causal influence, or diagnostic reasoning) and not the flow of mass or process control. Therefore, variables need not be in compatible units and states of variables need not be discrete. The absence of a directly connecting edge between any two nodes in a BN implies that the two variables are independent given the values of any intermediate nodes. This means that the probability distribution of any variable Xi in any state of the network can be determined by knowing only the values of its immediate predecessors (called its parents, PAi), without regard to the values of any other variables. This is referred to as the parental Markov property. In this way, the joint probability distribution for the entire network can be written as the product of a limited number of conditional distributions using the chain rule of probability calculus: Pðx1 ; . . . ; xn Þ ¼

n Y

Pðxi jpai Þ

½1

i¼1

Nodes without any parents are called roots and are specified by marginal (i.e., unconditional) distributions. (Following the notation of Pearl 2000, lowercase symbols

307

308 Ecological Informatics | Bayesian Networks

The implications for forward prediction are perhaps the most obvious benefit of the BN approach. More subtle is the benefit to performance of probabilistic inference, or diagnosis. This will be discussed later in this article, but it is worth noting here that the use of Bayes’s theorem for conducting such inference is one reason why BNs are called Bayesian. The other reason is that the subjective interpretation of probabilities as degrees of belief rather than as long run frequencies is consistent with the Bayesian philosophy. This will also be discussed in more detail below.

Nutrient concentration (N )

Algal density (A)

Algal toxins (T )

Hypoxia (H )

Building Models Fishkill (K )

Specifying the Structure

Figure 1 A simple BN representing dependences among five variables characterizing eutrophication of a water body. Nutrient concentration (N) is assumed to influence algal density (A), which in turn influences the probability of algal toxins and water column hypoxia (H). These two variables then influence the probability of a fishkill (K).

are used to indicate particular realizations of the corresponding uppercase variables.) Figure 1 illustrates a simple BN representing the relationships between nutrient concentration in a water body (N), algal density (A), the presence of algal toxins (T), water column hypoxia (H), and the occurrence of a fishkill (K). Such a graphical network can be drawn based on causal knowledge of the system. For example, the absence of a direct link between N and H captures our understanding that nutrient inputs to a water body only cause hypoxia via the stimulation of algae and not through any other direct or indirect means. In probabilistic terms, knowing A renders H and N independent. Together with the remainder of the relationships expressed in Figure 1, this implies that the joint distribution of all variables can be written in the mathematical form of eqn [1] as PðkÞ ¼ PðnÞ ? PðajnÞ ? Pðt jaÞ ? PðhjaÞ ? Pðkjt ; hÞ

½2

The recognition that causal assertions expressed in graphical form have practical implications for determining the probabilistic relationships among variables significantly facilitates the handling of uncertainty in complex systems. For example, generating a prediction for the probability of a fishkill given a particular nutrient concentration can proceed by decomposing the full causal chain connecting these two variables into the conditional relationships contained in eqn [2]. These local relationships can be quantified independently using the data, expert knowledge, or mechanistic models that are directly relevant. The parts can then be reassembled in a way that makes logical sense based on the causal assertions embedded in the graphical model.

The interpretation of DAGs in terms of causality is not necessary for extracting meaningful conditional dependence relations from BNs. However, it is usually the causal interpretation that allows the structure of a BN to be drawn before explicitly consulting the relevant data. That is to say that the modeler, or an appropriate expert, can draw a BN based on straightforward, qualitative notions of cause and effect (which are the basic building blocks of scientific knowledge) without necessarily being fluent in probabilistic reasoning. The interpretation of BNs in terms of causality also allows the model to easily represent and respond to external interventions or spontaneous changes in the system. Such adaptation is arguably a defining trait of ecological informatics tools. Any changes in the mechanisms of a system translate into minor modifications of the network structure. For example, to represent the installation of a mechanical aerator to artificially add oxygen to the water body modeled in Figure 1, we simply need to add a node representing the management of the aerator (O) with a link to hypoxia (H) and modify the conditional distribution of H to include the influence of O: P(hja,o). If the aerator were managed in a way that responded to measured nutrient concentrations, then we would add a link between N and O and revise P(ojn). Such changes would be much more difficult to identify if the BNs were not constructed according to causal relations. Specifying the structure of a BN can proceed most effectively by first identifying the key system variables to be modeled. In an ecological management context, this may involve detailed discussions with decision makers and other stakeholders to determine the variables that they would like to see predicted by the model. Ideally, these would consist of measurable quantities that indicate the degree to which a particular decision alternative fulfills their management objectives. With these endpoints identified, it is then natural to proceed by identifying the nodes immediately preceding them in the causal chain, then the nodes preceding them,

Ecological Informatics | Bayesian Networks

and so on, back to the primary causes representing model inputs. This might occur by consulting the relevant scientific literature or interviewing subject matter experts directly. However, caution should be exercised at this stage, as most experts usually have their own ‘pet processes’ that they would like to see included in a model representing their area of expertise. The inclusion of many variables and processes may, in principle, produce a more precise network. If the values of those variables and the parameters of the processes are well known, then other variables can be conditioned on them, thereby reducing uncertainty in model relationships. However, if the variables are stochastic or uncontrollable and must be described by marginal probability distributions themselves, then their explicit inclusion is not very useful and their effect can be subsumed by the conditional distributions. For example, a scientist studying algal growth might emphasize that the response of algae to a particular nutrient concentration will depend on the ambient light availability and therefore ‘light’ should be added as a node in Figure 1. However, if light is not a controllable factor and data are not available to estimate or predict light availability on a given day, then it may not be necessary to include it explicitly. Instead, the prediction for algal density (A) conditional on a given nutrient concentration (N) can simply be represented by a probability distribution, rather than a precise value, to account for the variability in A that is caused by variation in light (as well as any other disregarded factors). In other words, any factors not explicitly accounted for in a model become part of the unexplained variability, or model error, forming the conditional distributions. The unexplained variability associated with a variable X is sometimes included as an explicit disturbance term in the network indicated, for example, by a node labeled UX (Figure 2). Such nodes are also referred to as latent variables. Nutrient concentration (N )

Algal density (A)

Algal toxins (T )

Unmodeled factors (U )

Hypoxia (H )

Fishkill (K ) Figure 2 A BN indicating the unmodeled factors (U), such as light, that may influence the algal density resulting from a given nutrient concentration.

309

The decision about whether to include disturbance terms (or the influential, but omitted, variables whose influence they represent) explicitly or implicitly in a BN should be dictated by whether the resulting models satisfy the parental Markov property (see eqn [1]). Any causal diagram among system variables X that includes latent variables U and is acyclic leads to a semi-Markovian model, and the values of all variables X will be uniquely determined by the values of the variables U. Equivalently, the joint distribution of the variables, P(x1, . . ., xn), will be determined uniquely by the marginal distribution of the latent variables, P(u). The model is called Markovian if and only if, in addition to the model being acyclic, the disturbances represented by U are jointly independent. As proven by Pearl 2000, Markovian models induce distributions that satisfy the parental Markov property. In practical terms, achieving a Markovian model involves: (1) being sure to explicitly include in the model any variable that is a causal parent of two or more other variables, and (2) assuming that if any two variables are correlated, then one is the cause of the other or there is a third variable causing both. These two assumptions imply that the disturbances are mutually independent and therefore the causal model is Markovian. Having a Markovian model is important for a number of reasons. Most importantly, the relationships between variables in a Markovian model are guaranteed to be stable, meaning that they are invariant to changes in our knowledge about other variables in the model, as well as to parametric changes in the mechanisms governing the relationships themselves. This is because each parent– child relationship in a Markovian BN is assumed to represent an autonomous physical mechanism, independent of all other mechanisms (or disturbance terms resulting from omitted mechanisms) in the model. Maintaining the Markov property as a constraint also determines the level of abstraction that is allowable for model construction. For example, if we start at one extreme, where all variables and processes are represented in microscopic detail, then the Markov property would certainly hold. If we then increase the level of abstraction by aggregating variables in space and time and representing stochasticity or missing factors by probability distributions (or hidden disturbance terms), we need some indication of when the abstraction has gone too far and the essential properties of causation are lost. The Markov property tells us that the set of parents PAi of a variable xi is too small if there are disturbance terms that influence two or more variables simultaneously. In such a case, the Markov property is violated. However, as shown by Pearl 2000, if such disturbances are treated as latent variables and represented explicitly as nodes in a graph, then the Markov property is restored. The consideration of season (S) as an additional node in Figure 1 provides a relevant example of using the

310 Ecological Informatics | Bayesian Networks

avoiding cycles is to define variables to represent longterm equilibrium values, rather short-term responses. These issues will be discussed in more detail later in the section titled ‘Dynamic models’.

Nutrient concentration (N )

Algal density (A)

Algal toxins (T )

Season (S )

Hypoxia (H )

Fishkill (K ) Figure 3 A BN indicating the potential influence of season (S) on other network variables.

Markov property to test whether a variable should be explicitly included in a model. Let us assume that we have already determined that season is an appropriate scale at which to capture the effects of light, temperature, and flow variation. We can then expect that season will have an influence on nutrient concentration (N), algal density (A), and hypoxia (H) by impacting physical and biological processes (Figure 3). As an attempt to further simplify, we may consider omitting season as an explicit variable. Since we cannot control its effects, it might be convenient to consider these effects part of the stochasticity of the system and fold them into the probability distributions of N, A, and H. However, this would be a mistake, as N and H would then no longer be conditionally independent given A (they would be correlated through the effects of S), and therefore the Markov property would be violated. Therefore, S (or at least an equivalently connected latent variable) must be explicitly included in the model. The requirement that Markovian models be acyclic may appear to be a significant limitation of BNs, for many natural systems are known to contain feedback loops. However, the apparent need to include cycles in a BN usually arises from overaggregation of variables in time or space. For example, in the system represented by Figure 1, the algal decay process that causes hypoxia would also release nutrients to the water column, which could promote further algal growth, inducing additional hypoxia. However, a network cycle is only necessary if the variables are defined on a temporal scale that is greater than the nutrient turnaround time. At smaller scales, cycles can be avoided by indexing variables to represent multiple points in time, so that a variable referenced at one time point can be connected to one referenced at another, rather than looping back on itself. Another option for

Specifying the Conditional Probabilities Once a graphical model is drawn that has a causal structure and satisfies the Markov property, it defines the appropriate factorization of the joint distribution of all variables in the system. ‘Appropriate’ means that: (1) conditional distributions characterizing the relationships between a variable and its parents will be stable to changes in other variables and relationships; and (2) as a logical consequence, the full network can be modularized to allow the characterization of individual subnetworks to proceed independently without regard to the broader context. This implies that each subnetwork can be specified using an approach suitable for the type and scale of information available. Specification of the conditional probabilities can proceed in several ways depending on the properties of the variables involved and the nature of the knowledge being brought to bear. Most examples of BN modeling in ecology have used either inherently discrete variables or continuous variables that have been discretized into a finite number of categories for representation in the network. However, this may have more to do with convenience and ease of interpretation than fidelity to the true properties of the system. When all variables in a network are discrete, then the network relationships are specified by conditional probability tables for each node that provide the probability of it being in a particular state (or category), given any combination of states of its parents. This has the advantage of being fairly easy to interpret when the parents are set to particular states, but when many different states are possible, the number of probabilities required to fill out the table quickly becomes prohibitive. For example, if T, H, and K in Figure 1 are each assumed to have only three possible states, then specifying the full conditional probability table for K would require 33 ¼ 27 probabilities (albeit, 9 of which could be inferred from the law of total probability). Estimating this many conditional probabilities from either data or expert judgment is a demanding task. Representing all variables in a network model as being discrete does have the advantage that software is readily available to handle all possible calculations one would want to make with such a model. However, discretizing variables that are inherently continuous introduces a degree of imprecision into the model that would otherwise not exist. This is because of the vagueness that arises from assigning all values within a specified range of a continuous variable to the same discrete state. For example, we might define nutrient concentration (N) in Figure 1 to have three possible states, corresponding to

Ecological Informatics | Bayesian Networks

xi ¼ fi ðpai ; ui Þ

½3

where PAi are the parents of Xi, and Ui are the disturbances caused by omitted variables or random (e.g., measurement) errors. This conceptualization can be considered a nonlinear, nonparametric version of the more familiar linear structural equation models (SEMs). It has been shown that for every BN characterized by some distribution P (as in eqn [1]), there exists a functional model (as in eqn [3]) that generates a distribution identical to P. In other words, characterizing each relationship as a functional equation, instead of a conditional probability P(xijpai), leads to all the valuable properties of a Markov model, and, as shown by Pearl 2000, this holds regardless of the choice of function fi or error distribution P(ui). This implies that for all applications of BNs, including synthesis, prediction, and inference, one can regard functional models as a legitimate way of specifying the conditional distributions.

As an example of the value of using a functional expression rather than a conditional probability table to describe the relationships between variables, consider how one might use measured data on nutrient concentration and algal density to characterize the relation between N and A in Figure 1. A linear regression fit to logtransformed data of Dillon and Rigler (1974) would provide the following functional form: logðAÞ ¼ 0 þ 1 ? logðN Þ þ UA

where  0 and  1 are model coefficients and UA is a normally distributed disturbance term with a mean of zero and a standard deviation derived from the residuals of the model fit. As described above, UA might be represented as an explicit node in the model or as an implicit error term, depending on whether the omitted factors it represents also influence variables beyond A. The same holds true for the model coefficients 0 and 1, which also have uncertainty associated with them; they can either be treated as explicit nodes or implicit disturbance terms with distributions defined by the parameter means, standard errors, and correlations estimated by the regression procedure. An actual fit to data (Figure 4) suggests that the conditional probability distribution of A given a value for N of 11 mg l1, for example, can be appropriately represented by a lognormal distribution with mean 2.5 mg l1 and standard deviation of 2.4 mg l1. If N were to be 39 mg l1, A would have a lognormal conditional distribution with median 15.5 mg l1 and standard deviation of 15.0 mg l1.

Summer chlorophyll (µg l–1)

the ranges of 0–10, 10–40, and >40 mg l1, respectively, and then specify the probability of various levels of algal density (A) conditional on each of these states. The inability of the probability table to distinguish between the different values for A likely to result from a value for N of 11 mg l1 compared to a value of 39 mg l1 adds significant imprecision to model predictions and inferences. Another problem associated with discretization is that it encourages vagueness in variable definitions. For example, many BN studies have been published that only define states of variables to be low, medium, and high, without giving precise quantitative definitions. This is unacceptable, as it opens up the possibility for model developers or users to have very different ideas of what the variable and its different states represent. This can lead to errors in assessing the probabilities required of the model or in applying the results for decision making. The clarity test provides confirmation that variables and states have been defined in adequate detail. To implement the test, one needs to imagine that, at some point in the future, perfect information will be available regarding all aspects of the system. Will it be possible to determine unequivocally the state of every node in the network, without any interpretation or judgment? If not, then further specificity is required. A satisfying alternative to constructing BNs entirely of discrete variables related by conditional probability tables is to use continuous variables when appropriate, connected by functional equations. Probabilities are introduced through the assumption that certain variables or parameters in the equations are uncertain or unobserved. In many ways, this is more consistent with the semideterministic way that causal models are conceived in biology, physics, and engineering. In its most general form, a probabilistic functional equation for a network variable Xi consists of an equation of the form,

311

100

10

1

0.1 100 10 Spring phosphorus (µg l–1) Figure 4 Linear regression fit to data on spring phosphorus concentration and summer chlorophyll level (as a measure of algal density) from 46 lakes. The solid line represents the mean prediction, dashed lines represent the 95% confidence interval in the mean resulting from uncertainty in model coefficients, and dotted lines represent the 95% predictive interval representing the full conditional distribution. Vertical and horizontal dashed lines represent the thresholds for categorical definitions, as described in the text.

312 Ecological Informatics | Bayesian Networks Table 1 Conditional probabilities for categorical representations of N and A, as well as two examples of corresponding results from the functional representation

Functional Discrete variables results

Spring phosphorus (μg l –1)

Summer chlorophyll (μg l –1)

accurate and honest assessments. These distributions can then serve as Bayesian ‘priors’, to be formally updated according to Bayes’s theorem as data become available or knowledge improves.

Low (<2)

Medium (2–15 )

High (>15 )

Low (<10)

0.87 (13/15)

0.13 (2 /15)

0 (0/15)

Learning the Conditional Probabilities

Medium (10–40)

0.33 (6/18)

0.56 (10/18)

0.11 (2/18)

High (>40)

0 (0/13)

0.08 (1/13)

0.92 (12/13)

11

0.37

0.63

0

39

0

0.51

0.49

As described above, the conditional probabilities of a BN can be derived from data using statistical methods, and almost any approach is appropriate as long as results can be represented probabilistically. Linear and nonlinear regression, quantile regression, logistic regression, generalized additive models, and classification and regression trees are all suitable tools for conditional probability determination. The fact that a BN can be easily decomposed into independent substructures means that statistical methods can be chosen that are optimal for the nature of the variables in those subnetworks, without regard to the larger set of variables comprising the full network. There may, however, be situations in which it is desirable to learn the conditional probability distributions of many nodes in a network simultaneously from a set of case data. Cases are examples, events, or situations for which the values or discrete states of some or all of the variables in a network are known. Learning can occur either by starting in a state of ignorance for all nodes or by starting with Bayesian prior distributions that are based on preexisting knowledge. If every case provides a value or discrete state for every variable, then learning the conditional probabilities of the network occurs through straightforward algorithms for Bayesian updating. If there are variables for which none of the cases have any data (latent variables), for which data are available for some cases but not for others (missing data), or for which data are expressed as likelihoods rather than certain values, then other, more complex, learning algorithms are required. These are usually based on optimization methods that attempt to find the set of network probabilities with maximum likelihood given the observed data. Expectation-maximization (EM) and gradient descent are the two most common such algorithms employed in BN learning. Artificial neural networks (ANNs) employ many of the same learning algorithms as BNs, with latent variables in a BN corresponding to hidden neurons. This invites a comparison between the two: in general, BNs employ fewer hidden nodes and the learned relationships between the nodes are more complex. The result of BN learning usually has a direct physical interpretation (as a causal process), rather than simply leading to a set of empirical weights. This need for a causal interpretation may help avoid the problem of overfitting. Finally, as mentioned

Fractions given in the body of the table represent the conditional categorical frequencies of the data in Figure 4.

If, instead of representing N and A as continuous variables, we were to artificially discretize them into three categories, then conditional probabilities could be derived from the data using a two-way contingency table (Table 1). With thresholds of 10 and 40 mg l1 for spring phosphorus and 2 and 15 mg l1 for summer chlorophyll, such a table would predict a conditional probability distribution for A of low: 33%; medium: 56%; and high: 11%; regardless of whether N were 11 mg l1, 39 mg l1, or anywhere in between. This is a significantly less-precise prediction than the functional results, which, even if A were to be discretized, would capture the difference between 11 and 39 mg l1 for N (Table 1). Regardless of whether variables in a BN are represented as continuous or discrete, there are a variety of ways to determine the appropriate conditional probabilities. Data-based statistical techniques, such as the regression or contingency table approaches exemplified above, are one possibility. Another possibility is to use the results of complex, process-based simulation models run externally to the BN that are then converted into reduced-form, response-surface approximations for use in BN specification. Of course, this requires a comprehensive uncertainty analysis to be performed to characterize the conditional probability distributions. This may not always be possible, but response-surface surrogate models can help in this regard also. When data and process models are not available for specifying conditional probabilities, the carefully elicited judgment of subject matter experts may be required. This approach is consistent with the Bayesian perspective on statistical inference and decision, which states that probabilities are a useful way of expressing subjective degrees of belief. Established techniques exist for eliciting probability distributions from experts and help to assure

Ecological Informatics | Bayesian Networks

above, BNs can be treated as modular, so that parts of one network can be extracted and connected to other structures. This is not usually the case for ANNs.

Learning the Structure Learning the conditional probabilities of a BN, given the graphical structure and data, is referred to as parameter learning. Another type of learning in BNs is structure learning, which attempts to recover the causal structure that underlies a set of data based on the patterns of probabilistic dependence between variables. Many algorithms for learning BN structure employ the concept of d-separation (short for directed separation) to translate between probabilistic patterns and graphical structures. Graphically, two nodes (or sets of nodes) X and Y are said to be d-separated by another node (or set of nodes) Z if and only if (for details, the reader is referred to ‘Further reading’ section): 1. the path between X and Y contains a chain X ! m ! Y or a fork X m ! Y such that the middle node m is a member of Z, or 2. the path between X and Y contains an inverted fork (or collider) X ! m Y such that the middle node m is not in Z and such that no descendant of m is in Z. The term path indicates a sequence of consecutive edges (of any directionality) connecting two nodes in a graph. The notion of d-separation has important implications for the probability relations implied by a graph. Namely, if two nodes X and Y are d-separated by a node Z in a graphical model, then the corresponding variable X is independent of the variable Y, conditional on the variable Z. Conversely, if X and Y and not d-separated by Z, then X and Y are dependent conditional on Z. These implications are entirely general and do not depend on any assumptions about the distributional form of the variables or the functional form of the causal relationships. The notion of a d-separating chain was used to establish the parental Markov property, which forms the basis for modularization of BNs. For example, in Figure 1, d-separation implies that H is independent of N given A. The notion of a d-separating fork can be exemplified by the idea that T and H in Figure 1 are unconditionally dependent (they are both more likely to occur under conditions of high algal density), but become independent after conditioning on A (all correlation between the two is accounted for by A). Colliders indicate two causes having a common effect (such as the relation between H, T, and K in Figure 1) and act the opposite way to forks. For example, if H and T are assumed to be independent conditional of A, they will become dependent once K is known. This is referred to as the explaining away effect because, given that a consequence has occurred, information about one of

313

two possible individually sufficient causes tends to reduce our belief that the second cause occurred. The d-separation criterion can be used to test the causal relations represented in a graphical model by confirming that all d-separation relations in the graph are mirrored by equivalent statistical independences in the observational data. If the test is negative, then the graph does not represent the underlying causal mechanisms generating the data. If the test is positive, then the causal structure cannot be rejected. However, there may be other structures that are also consistent with the patterns in the data. Such structures are termed observationally equivalent, and cannot be distinguished from the existing data alone. The process of proposing and testing a large number of graphical models in order to learn the causal structure underlying a set of variables is not a very efficient procedure. Instead, structural learning algorithms have been developed that take data (or a statistical summary of data) as inputs and return the set of graphical structures that are consistent with those data as outputs. In other words, the algorithms construct the set of all directed acyclic graphs that show d-separation relations that can be supported by statistical dependences contained in the data. When all variables are observed and are either discrete or related by linear Gaussian models, reliable and efficient algorithms exist for recovering the causal structure. However, when some variables are unobserved, the situation is more complicated, for it is no longer clear that graphs consistent with the distributions of observed data must have a directed acyclic structure. Therefore, the available algorithms lead to graphs with four types of edges between any two nodes: (1) a definitively directed edge indicating genuine causation, (2) a possibly directed edge indicating potential causation (leaving open the possibility of a latent common cause), (3) a bidirected edge indicating spurious association (the existence of a latent common cause), or (4) an undirected edge indicating an undetermined relationship. Distinguishing which of the four types of edges should be used to denote the relationship between two variables X and Y requires that a third variable Z exhibit a particular pattern of dependency with X and Y. This is consistent with the notion that causal claims are defined by their ability to correctly specify the behavior of X and Y under the influence of a third variable that corresponds to an external control on either X or Y. However, in the absence of experimental manipulation, the variable Z serves as a virtual control and must come from the observed data. Algorithms for generating partially directed graphs can be regarded as a systematic way for identifying the variables Z that qualify as virtual controls. A partially directed graph that includes the four types of edges described above provides a concise way of revealing all the structures that are observationally equivalent with a particular set of data. When the directions of some edges remain ambiguous, additional tests can be performed to

314 Ecological Informatics | Bayesian Networks

identify which additional observations or experiments are required to reveal the underlying causal structure.

Using Models Once the structure and conditional probabilities of a BN have been specified (using prior knowledge, models, databased learning, or a combination), the network can be used to determine the probability distributions of specific target or query nodes, given findings (either deterministic or probabilistic observations) for other nodes. When the query nodes are descendants of the nodes with findings, this process is called prediction. When they are ancestors, it is called inference (or diagnosis). For example, using the network in Figure 3, one can predict hypoxia (H ) given values (or distributions) for nutrient concentration (N ) and season (S ), or one can infer the value of N from findings on H (and/or any other variables). The network may also be used to determine the most probable explanation for why particular values for some system variables were observed, to accurately describe the effects of interventions (or external controls) on the system, and to support decisions about management actions in the face of uncertainty. Each of these will be described in the following subsections. Prediction Prediction using a BN is straightforward and can generally proceed most effectively using Monte Carlo simulation. The findings for a node are represented by a marginal (discrete or continuous) probability distribution, which is used to generate a large random sample for that variable. This sample is then used as input to the function(s) defining that node’s descendant(s), along with samples from any other uncertain variables that are required by the function(s). This generates a sample of the first generation of descendents, which is propagated further along the causal direction in an analogous manner until the query node is reached. The sample for the query node can then be used to estimate the statistics or full distribution for the corresponding variable. Inference Inference against the causal direction of a BN is more difficult. Several types of algorithms have been developed for exact inference in networks that have discrete variables or multivariate Gaussian distributions. These involve either reversing the direction of the edges using Bayes’s theorem, eliminating variables through summation, performing symbolic manipulation to achieve optimal factoring, or transforming the network into a tree structure of cliques (called a junction tree) and employing a message passing scheme. However, exact inference in an arbitrary

BN is NP-hard, meaning that algorithms do not exist that can perform the inference in polynomial time. Instead, calculation time is exponential in the number of variables. Therefore, network size quickly becomes a limiting factor. For large BNs or those with continuous, non-Gaussian distributions, several different approximation algorithms are available. Stochastic simulation, including importance sampling and Markov Chain Monte Carlo-type algorithms, is the most popular. Other methods include: systematic simplification with the goal of finding a network with an exact solution, and search-based methods that concentrate on the areas of the joint probability space containing most of the probability mass. Unfortunately, all of the approximate inference techniques are also NP-hard, and there does not seem to be a single technique that works well for all situations. Therefore, recent research has focused on developing algorithms that can give imprecise solutions quickly and be refined iteratively over time. Another option is to link multiple algorithms that are more or less appropriate for different problems and select or combine them in an automated and intelligent way for specific applications. Explanation Explanation, also known as belief revision, involves finding the most probable value (or discrete state) of one or more query variables given findings for other variables in the network. When all query nodes are ancestors of nodes with findings, this is referred to as identifying the most probable explanation (MPE). For example, in Figure 3, after observing the occurrence of a fishkill (K ), one might want to know the most likely coincident level of algal toxins (T ) or hypoxia (H). This amounts to finding the MPE. The problem of finding the MPE is a special case of probabilistic inference and is also NP-hard. Intervention The idea of accommodating external interventions was introduced above when motivating the causal interpretation of BNs. For example, the installation of an artificial aerator to a water body may completely eliminate the chances of hypoxia. This type of intervention can be represented in the network by breaking the links between algal density (A), season (S), and hypoxia (H) and keeping H fixed in a state of false (Figure 5). Alternatively, the management of the aerator can be shown explicitly by adding a node (O) that is a parent of H and can be in a state of either off or on (Figure 6). In the on state, H is false, while in the off state, H has the same conditional distribution as before the intervention. A marginal distribution is then specified for O to describe the management plan for the aerator. Prediction and inference on the new network including O can then proceed in the usual way. As

Ecological Informatics | Bayesian Networks

Nutrient concentration (N )

Algal density (A)

Algal toxins (T )

Season (S)

Hypoxia (H )

Fishkill (K ) Figure 5 A BN showing the effect of an external intervention that keeps hypoxia in a state of false.

Nutrient concentration (N )

315

prefer the action that maximizes a particular mathematical combination of the derived probabilities and utilities. BNs provide a convenient and appropriate tool for generating the outcome probabilities required for decision analysis. If the children of a fully specified network represent outcome variables and the root parents represent decision variables, then the necessary probabilities can be generated by forward prediction. In most management contexts, actions to be considered often involve some type of system intervention, and these can be handled appropriately as described in the previous section. Many software packages support the inclusion of utility nodes in a BN in addition to decision nodes and chance variables. Such networks are referred to in the decision analysis literature as influence diagrams. Influence diagrams can often be solved directly to find the optimal action, either for the unconditioned network or after findings have been added for some of the network variables. This facilitates value of information analysis.

Special Cases Hierarchical Models

Algal density (A)

Algal toxins (T )

Oxygen aerator (O )

Season (S)

Hypoxia (H )

Fishkill (K ) Figure 6 A BN explicitly showing the effect of adding an artificial oxygen aerator (O).

described above, the ability of a causal BN to correctly represent interventions is an essential element of its usefulness. The use of BNs for decision analysis is one reason for this, as discussed next.

Hierarchical BNs can accommodate additional complexity in uncertainty representation by explicitly separating the parameters, hyperparameters, and data from the processes that are the usual focus of a causal network. For example, there may be multiple sources of data for any given model variable, and each of these may have its own measurement error. These sources could be added as explicit nodes in the network with corresponding parameters describing the error magnitude and/or bias (Figure 7). Further, the Hyperparameters:

Parameters:

Process:

Data:

Variation

Estimation error

Uncertainty and variability

Measurement error

Nutrient concentration (N )

θC θα

θA θR

Decision Decision analysis is a normative method for selecting among actions that have uncertain outcomes. This outcome uncertainty can be characterized by probability distributions for variables that represent the key consequences of the considered actions. The decision maker’s relative preference for the various possible outcomes can then be described by a utility function that also captures the decision maker’s attitude toward risk. A logical decision maker should then

Algal density (A)

In situ chlorophyll (C)

Remotely sensed chlorophyll (R )

Figure 7 An example of how a hierarchical BN can be used to represent complexity by allowing multiple levels of uncertainty. The processes relating the two variables nutrient concentration (N) and algal density (A) are characterized by a vector of parameters A. If A is expected to vary across space or time, this variability can be described by the hyperparameter vector . If, in addition, algal density is measured using more than one method (e.g., in situ chlorophyll and remotely sensed chlorophyll), the different measurement errors can be described by parameters C and R.

316 Ecological Informatics | Bayesian Networks N(1)

N(2)

N(3)

A(1)

A(2)

A(3)

T(1)

H(1)

T(2)

H(2)

T(3)

K(2)

K(1)

parameters characterizing the causal processes in the network may be variable across space, time, individuals, or groups. This variability can be captured by conditioning these parameters on higher-level parameters, called hyperparameters. In this way, stochasticity is allowed at multiple levels, each conditioned on one level higher. Complex statistical analysis can then proceed in reduced dimensions: rather than asking, ‘‘How does the entire process work?’’ after conditioning we can ask, ‘‘How does this component work, conditioned on those elements directly affecting it?’’ The complexity of nature then emerges when we marginalize across the components.

H(3)

K(3)

Figure 8 A temporal BN, showing how algal density (A) in one time period can influence nutrient concentration (N) in the next.

(a)

Water temperature

B1(flow)

River nitrogen concentration

River flow

BN

Bp0

Chlorophyll a

Bp1

Pfiesteria

Error

Levels of concern

BT

Chlorophyll a

B2(flow)

(b)

Bwey

BO Standard exceedances

Error

(c) Chlorophyll a River nitrogen concentration

(e)

Annual average SOD

River flow

Water temperature

kv

kd

Theta d

C1 Water temperature

DO Conc.

Carbon production

C2

Surf DO

Duration of stratification

Duration of stratification

DO Conc.

Oxygen concentration

(d)

Annual average production

a a Mean survival time

Shellfish survival

b

Daliy carbon production

Annual average production

Sediment oxygen demand (f)

Productivity Error

Water temperature

Error Duration of stratification

C0

Pfiesteria density

Algal density p

C3

Bottom water depth

Days of hypoxia

Avg. depth

Annual average SOD

b Error Daliy harzard

k

Scale (c)

Frequency of cross-channel winds

Fish health

Fishkills

Error

Proportion shelfish survival

(g)

(h)

Days of hypoxia

Days of hypoxia

α1 Frequency of cross-channel winds

>100 000 fishkill

Days with trapping conditions

Fish health

β >10 000 fishkill

>1 000 fishkill

Fish population health

α2

Figure 9 A BN of estuarine eutrophication that integrates a number of submodels, shown as rounded squares in the main network. Parameters of the submodels are shown as shaded nodes. Reproduced from Borsuk ME, Stow CA, and Reckhow KH (2004) A Bayesian network of eutrophication models for synthesis, prediction, and uncertainty analysis. Ecological Modelling 173: 224, with permission from Elsevier.

Behavioral Ecology | Behavioral and Ecological Genetics

Dynamic Models As discussed above in the context of introducing the Markov property, the feedback loops present in many natural systems suggest the need to include cycles in BNs. However, cycles can often be avoided by defining variables to represent long-term equilibrium values, rather short-term responses. When this is not reasonable, variables can be replicated or indexed to represent multiple points in time, so that the value of a variable at one time point can depend on the value at another, rather than having to refer back to itself (Figure 8). Such a model is referred to as a ‘dynamic’ (or ‘temporal’) BN and is a generalization of the familiar hidden Markov model and linear dynamical systems.

Integrated Models The Markov property of causal BNs provides a rational system for decomposing a large network into a set of smaller subnetworks (Figure 9). This is especially useful in the environmental and ecological sciences where the study of complex systems is usually broken down into smaller pieces, each addressed by a different group of researchers. The Markov property means that these groups can assemble separate submodels using approaches suitable for the type and scale of information they have available, and when the submodels are reassembled, the whole model will make logical, causal sense.

317

See also: Application of Ecological Informatics; Artificial Neural Networks: Temporal Networks; Artificial Neural Networks; Ecological Informatics: Overview; Sensitivity and Uncertainty; Statistical Prediction; Multilayer Perceptron.

Further Reading Borsuk ME, Stow CA, and Reckhow KH (2004) A Bayesian network of eutrophication models for synthesis, prediction, and uncertainty analysis. Ecological Modelling 173: 219–239. Cowell RG, Dawid AP, Lauritzen SL, and Spiegelhalter DJ (1999) Probabilistic Networks and Expert Systems. New York: Springer. Dillon PJ and Rigler FH (1974) The phosphorus–chlorophyll relationship in lakes. Limnology and Oceanography 19: 767–773. Jordan MI (ed.) (1999) Learning in Graphical Models. Cambridge, MA: MIT Press. Neapolitan RE (2004) Learning Bayesian Networks. Upper Saddle River, NJ: Pearson Prentice-Hall. Oliver RM and Smith JQ (eds.) (1990) Influence Diagrams, Belief Nets, and Decision Analysis. New York: Wiley. Pearl J (1988) Probabilistic Reasoning in Intelligent Systems. San Francisco, CA: Morgan Kaufmann. Pearl J (2000) Causality, pp. 16, 30, 44. Cambridge, UK: Cambridge University Press. Shipley B (2000) Cause and Correlation in Biology. Cambridge, UK: Cambridge University Press. Spiegelhalter DJ, Dawid AP, Lauritzen SL, and Cowell RG (1993) Bayesian analysis in expert systems. Statistical Science 8: 219–283. Spirtes P, Glymour C, and Scheines R (2000) Causation, Prediction, and Search. Cambridge, MA: MIT Press.

Behavioral and Ecological Genetics U Ganslosser, Fu¨rth, Germany ª 2008 Elsevier B.V. All rights reserved.

Introduction to Genetical Terms Genes for Behavior? Relationship and Kinship Heritabilities and Selection Experiments

Single Gene Effects on Behavior Population Genetics for Small Populations Further Reading

Introduction to Genetical Terms

importance of genetic relatedness in the evolution of animal societies, as a background for population genetical consideration. This will be followed by some studies on heritability and selection experiments for behavioral traits, and a few examples for monogenic heredity. A few comments on human behavioral genetics are included in this. Finally, the importance of population genetics, particulary genetics of small populations for wildlife management and conservation biology, shall be outlined.

The aim of this chapter is not to outline general concepts of classical, Mendelian genetics or techniques of molecular sciences. These should be found in any introductory textbook of undergraduate biology. Nevertheless, a few terms and definitions of general genetics shall be recapitulated first. After that we are going to discuss behavioral genetics in a general way, followed by an outline of the