Exploring Genetic Epidemiology Data with Bayesian Networks

Exploring Genetic Epidemiology Data with Bayesian Networks

Handbook of Statistics, Vol. 28 ISSN: 0169-7161 Copyright © 2012 Elsevier B.V. All rights reserved DOI: 10.1016/B978-0-44-4518750.00018-X 18 Explori...

3MB Sizes 0 Downloads 121 Views

Handbook of Statistics, Vol. 28 ISSN: 0169-7161 Copyright © 2012 Elsevier B.V. All rights reserved DOI: 10.1016/B978-0-44-4518750.00018-X

18

Exploring Genetic Epidemiology Data with Bayesian Networks Andrei S. Rodin1 , Grigoriy Gogoshin1 , Anatoliy Litvinenko1 and Eric Boerwinkle1,2 1 Human

Genetics Center, School of Public Health, University of Texas Health Science Center, Houston, TX 77030, USA 2 Institute of Molecular Medicine, University of Texas Health Science Center Houston, TX 77030, USA

Abstract Recent advances in DNA sequencing and genotyping technologies led to the initiation of large-scale genetic association studies aimed at unraveling complex genotype–(environment)–phenotype relationships underlying common human diseases. Unfortunately, traditional statistical tools are ill-suited for analyzing high dimensional datasets with many small effects. In addition, a primary emphasis of traditional statistical methods is formal hypothesis testing rather than hypothesis generation. When faced with hundreds of thousands (and soon—millions) of potentially predictive variables (e.g., SNPs), we need methods for automated knowledge discovery. Providentially, such methods have long been in development and are well-established in the data mining research community. One of such methods is Bayesian or Belief Network (BN) modeling, which has its roots in both computer science and statistics. A BN is a graphical model that represents a joint multivariate probability distribution and reflects the conditional independences among variables. Given data, the optimal network topology can be estimated with the assistance of local search (optimization) algorithms and model scoring criteria. Statistical significance of edge strengths can be evaluated using model scoring criteria tests, cross-validation and bootstrapping. BNs are an excellent tool for reverse-engineering biological (e.g., physiologic and genetic) networks from “flat” datasets (e.g., generated by the large-scale, or candidate gene, association studies, or microarray gene expression experiments). In this chapter we review various applications of BN methodology to modern human genomics. An example application is detailed. We also discuss 479

480

A. S. Rodin et al.

various technical aspects of BN reconstruction, with a special emphasis on scalability in the context of modern “omic” data. Keywords: Bayesian (Belief) networks, data mining, genetic epidemiology, genomics, large-scale (genome-wide) association studies

1. Introduction Recent advances in DNA sequencing and genotyping technologies coupled with the emergence of extensive information resources (e.g., http://www.ncbi.nlm.nih.gov) mean that most biomedical researchers have access to a wealth of single nucleotide polymorphism (SNP) data. Combined with the availability of appropriate sampling designs, stored DNA, and relevant computerized phenotype data, it is now possible to begin to unravel complex genotype–(environment)–phenotype relationships underlying common human diseases (e.g., coronary heart disease (CHD), hypertension, cancer, and stroke). Unfortunately, aspects of study design and data analysis have not kept pace with advances in laboratory technologies. Traditional statistical tools are ill-suited for analyzing high dimensional datasets with many small effects. In addition, a primary emphasis of traditional statistical methods is formal hypothesis testing rather than hypothesis generation. When faced with hundreds of thousands (and soon—millions) of potentially predictive variables (e.g., SNPs), we need methods for automated knowledge discovery. Fortunately, such methods have long been in development and are well-established in the data mining research community. Spurred by the increasingly demanding data analysis needs in business and finance, data mining practitioners have evolved a toolkit of diverse methods that have been adopted from the fields of computer science, artificial intelligence and machine learning. These methods have proven to be well-suited to the analysis of high dimensional datasets with many small effects, and to automated knowledge discovery. One of these methods is Bayesian or Belief Network (BN) modeling, which is a subspecies of dependency modeling. It has its roots in both computer science and statistics. A BN is a graphical model that represents a joint multivariate probability distribution and reflects the conditional independences among variables. Given data, the optimal network topology can be estimated with the assistance of local search (optimization) algorithms and model scoring criteria. Statistical significance of edge strengths can be evaluated using model scoring criteria tests, cross-validation and bootstrapping. BNs are an excellent tool for reverse-engineering biological (e.g., physiologic and genetic) networks from “flat” datasets (e.g., generated by the large-scale, or candidate gene, association studies, or microarray gene expression experiments). In this communication we review various applications of BN methodology to modern human genomics, with a special emphasis on genetic epidemiology and wide-scale association studies. An example application from the domain of CHD genetics is detailed. We also discuss various technical aspects of reconstructing BNs from real data. Finally, we introduce a number of novel techniques aimed at explicitly

Exploring Genetic Epidemiology Data with Bayesian Networks

481

incorporating prior expert knowledge into BNs (that would otherwise be purely data-driven), and BN reconstruction improvements in general.

2. Bayesian Networks 2.1. Overview The origin of BN methodology, and dependence modeling in general, goes back as far as to the original work of Sewell Wright on path analysis (Wright, 1934). However, its actual implementation was not possible until achieving both theoretical and practical advances in computer science, most notably conditional independence/probabilistic reasoning theory by Pearl (1988). It is gratifying that a comprehensive treatment of BN methodology can be found in some recent textbooks (Russell and Norvig, 2009). However, as BN methodology might still be unfamiliar to many computational biology practitioners, we feel obligated to give a brief introductory overview. A detailed introduction to BNs can be found in Heckerman (1995) and in a less formal fashion in Krause (1998). The latter is particularly suitable for the readers less familiar with probability theory. Additional recent introductions to BNs (Friedman et al., 2000; Pe’er, 2005; Rodin and Boerwinkle, 2005; Rodin et al., 2005; Markowetz and Spang, 2007; Needham et al., 2007) are aimed primarily at a biological practitioner and concentrate predominantly on modeling signaling and cellular networks from biological data. A rigorous theoretical treatment of BN reconstruction is given in de Campos and Ji (2011). A BN is a graphical representation of a joint probability distribution among a number of random variables (Pearl, 1988). The network’s topology reflects a relationship among the variables, or nodes (i.e., which variables are dependent on, or conditionally independent of, which other variables). The nodes, or variables, can be discrete or continuous. In the present domain, an example of a discrete node is a SNP, and an example of a continuous node is plasma cholesterol levels. Edges connecting the nodes indicate dependencies. Each edge is accompanied by a Conditional Probability Function (CPF) or a Conditional Probability Table (CPT, for discrete variables) that defines a conditional distribution for the offspring variable given its parents. The edge strength indicates the relative magnitude of the dependency between two variables. The edge directionality, when present, is not intended to imply causation. Rather, it is employed for mathematical and computational convenience in order to distinguish between “parent” and “offspring” (“daughter”) nodes. Formally, a BN for a set of random variables X X1, . . . , Xn  consists of its structure, or topology T that describes conditional independence assertions about the variables in X and parameter vector, , describing local probability distributions for each variable (conditional distributions given the variable’s respective parents in T). The topology is a Directed Acyclic Graph (DAG), and together these two components define a joint probability distribution for X. The DAG is subject to the fundamental Markov assumption, meaning that each variable is independent of its non-offspring given its parents. This assumption allows one to decompose complex joint probability distribution into more tractable product form. For example, for the network

482

A. S. Rodin et al.

A

B

C

D

E

F

the joint distribution can be decomposed into P(A, B, C, D, E, F )

P(A)P(D)P(B A)P(E D, B)P(C B)P(F E).

The conditional probabilities from the above expression form , a set of parameters defining the local probability distributions for each and every variable. This decomposition is also known as a chain rule of probability for BNs. For the above decomposition, it can be said that variable E is, for example, conditionally independent of A given B. This means that the probability distribution of E conditioned on A and B is the same as the probability distribution of E conditioned on B only, or P(E A, B) P(E B). This conditional independence statement can be formalized as (E  AB). Various principles can be used to assess conditional independence in BNs (and, therefore, to distinguish between direct and indirect dependencies), most notably Pearl’s d-separation (Pearl, 1988). For our purposes, a similar but simpler idea of a Markov blanket is useful, and will be introduced below. In most human genetics applications, researchers are primarily interested in predicting only one or two target variables (e.g., a disease endpoint), and it is usually the case that these target variables are directly influenced by a comparatively limited number of other variables. In general, biological networks, both physiologic and genetic, can be described as locally structured (or sparse) systems, meaning that each node directly interacts with only a limited number of other nodes (Miklos and Rubin, 1996; Arnone and Davidson, 1997). This structural sparseness bodes well for computational efficiency and scalability when learning a BN from data. Consider a network of N discrete variables, with S states each, that has no local structure. In this network every single node is connected to every other single node. Its complexity, or the amount of information needed to completely specify all CPTs, is proportional to S N . For example, if S = 3 and N = 20, 3,486,784,401 CPT entries would be required to specify an unstructured network. However, if we assume a local structure (i.e., each node is directly influenced by no more than K other nodes), we would need S K entries for each node and no more than NS K entries for the whole network (only 4860 if K = 5, for example). In addition, local structure allows us to split the complete network into sub-networks of interest, such as sub-networks containing a certain gene or phenotype. Therefore, an application of BN modeling to a large number of potential predictive variables allows one to isolate the sub-network around a target variable and examine the relationship between this variable and its direct predictors. One definition of a sub-network is the Markov blanket of the target variable (Pearl, 1988). The Markov blanket of a node A is the “parents” of A, the “offspring” of A, and all the nodes that share an “offspring” with A. Given its Markov blanket, the target variable is independent of (or shielded from) any other variable in the network. The Markov blanket of a target node performs a natural feature selection

Exploring Genetic Epidemiology Data with Bayesian Networks

483

(Guyon and Elisseeff, 2003), since all other nodes (not belonging to the Markov blanket) could be deleted from the BN under certain reasonable assumptions (such as completeness and observability of the data). Dependencies within the Markov blanket may be checked for statistical robustness using resampling, Bayesian, and information-theoretic criteria. Thus, BN construction is simultaneously a hypothesis-generating (constructing Markov blanket for a target variable) and a hypothesis-testing (estimating the robustness of the dependencies within the Markov blanket of a target variable) activity. In addition to their conceptual versatility, BNs have a number of other advantages over alternative representation, modeling, and data analysis techniques: First, by their very raison d’etre BNs allow one to model and study dependencies and, potentially, causal relationships, which is crucial if we are interested in gaining an underlying mechanistic understanding of the domain in question, such as biological networks at various levels of abstraction. Second, the BNs (or at least their subnetworks) can be easily visualized, understood, interpreted, shared and improved upon by expert input. Unlike many computer science, machine learning, and artificial intelligence methods, BNs are literally an antithesis of a “black box.” Third, BNs, not unlike classical Bayesian statistical analysis techniques, can be used to integrate expert knowledge and data. In other words, BNs can easily be transformed from a purely “data-driven” modeling into the one that takes into account both the signal contained in the data and prior human expertise. Fourth, BNs do not suffer from the ubiquitous problem of “overfitting” when proper model scoring criteria are employed. Depending on the scoring metric used, BN learning generally does not require separate testing and/or validation data sets. Fifth, the probabilistic nature of BNs allows them to accommodate substantial quantities of random noise present in most biological datasets. Moreover, BNs can handle the uncertainty and random noise present in unobserved variables, or hidden nodes. Finally, computational resources permitting, BNs can be augmented by traditional statistical resampling methods, such as cross-validation or bootstrapping, for validation and other purposes. BN modeling also has practical utility for predicting future outcomes. Given exact values for some predictive variables (e.g., SNPs), and the reconstructed BN, the posterior probability distributions for the target variables (e.g., phenotype) can be computed in closed form using well-established probabilistic inference algorithms (Pearl, 1988). By concentrating largely on the Markov blanket of a target variable, BN modeling can be reduced to Naïve Bayes classification, which has many attractive properties compared to other classifiers (Kononenko, 1990). These properties include computational efficiency, excellent scalability, and high robustness and prediction accuracy—comparable to that of, for example, support vector machines and ensemble (bagging, boosting) tree-based classifiers. Genotype-to-phenotype modeling for complex traits is particularly suitable for BN treatment. Edges in the BN models correspond to the biological processes and interactions that are readily interpreted and whose relative strengths can be estimated. The variable-to-observation ratio (dimensionality) is relatively low, which makes the physiologic and genetic networks much easier to model than, for example, gene expression (microarray) networks (Friedman et al., 2000). However, the issue

484

A. S. Rodin et al.

of scalability does arise when moving from the candidate gene to the genome-wide association studies. We will discuss increasing BN scalability in Section 3.3.

2.2. Learning a BN Learning a BN involves learning both its structure, or topology (formally defined as an equivalence class) and local probability distributions. Consider the following three graphs: A

B

C

A

B

C

A

B

C

Either of these structures represents only one independence assertion: that A and C are conditionally independent given B. In other words, all three graphs describe the identical set of independence assertions. When we are trying to represent a set of independence assertions by a graph, it is not always possible to resolve directionality. Therefore, a generalization of the DAG is necessary. For example, an A–B edge represents an equivalence class consisting of two network structures— one containing an A  B edge, and one containing an A  B edge. If some edges within a graph are directional and some are not, this structure is a partially directed graph (PDAG). Thus, a PDAG is a graphic representation of an equivalence class. Once the BN topology, or PDAG, is fixed, learning the local probability distributions is straightforward (Heckerman, 1995 and references therein). For example, if all nodes are discrete variables, learning entails defining the multinomial probabilities for each variable’s state given the variable parents’ states. For continuous variables, discretization means that there will be some loss of information. However, in many cases discretization and the multinomial model may be a better fit than the linear Gaussian model so often used in the biologic sciences. Specifically, the multinomial model can capture non-linear relationships and is robust to most distributional assumptions. Also, intelligent, adaptive discretization methods can be used to minimize the potential information loss. When learning the network topology, we aim to find an equivalence class that best fits the training set D X 1, . . ., X m  of independent instances of X. An effective approach to learning the network structure would be to apply some objective scoring criterion to various equivalence classes with respect to their fit to the data and pick, in course of heuristic search through structure space, the equivalence class that scores the highest. This strategy is known in statistics as model selection. One such criterion, Bayesian in nature, is the posterior probability of an equivalence class given the data (logarithms are used here for computational convenience): log P(T D) log P(DT )

log P(DT )  log P(T )  Const. log

P(DT, )P(T )d.

The first component, log P(DT ), is also known as the log marginal likelihood. log P(T) is a network structure (topology) prior. Under a number of reasonable assumptions, the log marginal likelihood can be efficiently computed. In the case of

Exploring Genetic Epidemiology Data with Bayesian Networks

485

discrete multinomial variables and parameter independence, P(T ) can be shown to be Dirichlet-distributed (Geiger and Heckerman, 1994). Subsequently, the log marginal likelihood is easily decomposed into a product of gamma distributions (Cooper and Herskovits, 1992). The posterior probability criterion is also known as the BDe (Bayesian/Dirichlet) metric (Heckerman et al., 1995). While other model scoring criteria exist, an important advantage of the BDe metric is that it avoids overfitting. We will discuss and compare other model scoring criteria from the prospective of overfitting in Section 3.2. Finding the BN equivalence class with the best score is NP-hard (Chickering, 1995). Therefore, a heuristic search must be performed for efficient model selection. In this heuristic search, the initial structure and the allowed operators must be defined. In our experience, BN reconstruction can be somewhat sensitive with respect to the choice of initial stage. The three allowed operators usually are: (1) Add an edge, (2) Remove an edge, and (3) Reverse an edge’s directionality. Heuristic search algorithms include simple greedy hill-climbing, hill-climbing with restarts, best-first search, beam search, simulated annealing, genetic algorithms, etc. (Russell and Norvig, 2009). Hill-climbing with random restarts has been shown to be particularly effective (Chickering, 1996). We will discuss the scalability and computational efficiency of BN reconstruction in Section 3.3. Learning causation, as opposed to “just” dependence (or correlation, or association) relationships, is a controversial topic, partly because the philosophical definition of causation is debatable (Krause, 1998; Heckerman et al., 1999; Pearl, 2000). Causal networks are similar to BNs mathematically, but the meaning of the directed edge is different. In causal networks, parents of the node are its immediate causes. It is often assumed that if (1) a PDAG can be learned from the data, and (2) the causal Markov assumption holds, than at least some of the directed edges can be interpreted as causations. The causal Markov assumption (CMA) states that, given the node’s immediate causes, it is independent of its earlier causes. The CMA has been shown to hold for many application domains, including those in biology (Friedman et al., 2000). However, other assumptions necessary to interpret directed edges as causations, such as absence of unobserved variables (hidden nodes), do not hold in the domain of genetic epidemiology. Therefore, we prefer to apply BNs in order to model dependencies in a dataset and to develop prediction equations, but not to claim causation.

2.3. Validating a BN Once the BN is learned, we would like to assess the robustness of its overall structure and that of particular features (e.g., Markov blankets). While BNs are essentially a statistical technique, the issue of estimating their robustness has received relatively limited attention. One way to test the relative “quality” of a particular network is to evaluate it using some model scoring criteria, such as the aforementioned posterior probability of the network given the data (Heckerman, 1995). We can compute the scores for the two networks that differ only in one specific feature, such as presence/absence of an edge, or its direction, and thus estimate the relative support for this feature. This procedure is conceptually similar to a LRT. A second way to statistically evaluate BNs is bootstrapping (Efron and Tibshirani, 1993)

486

A. S. Rodin et al.

Bootstrapping is a popular way of estimating statistical significance of edges in other graphical modeling domains, such as phylogenetic analysis in molecular evolution (Zharkikh and Li, 1995) and cluster analysis in gene expression (Kerr and Churchill, 2001). In nonparametric bootstrapping, the “re-shuffled” dataset is generated from the original dataset (by resampling with replacement), the graph is built from this re-shuffled set, and the procedure is repeated a sufficient number of times (usually several hundred or thousand). Confidence in a particular feature is measured as a percentage of times when that feature actually appears in the set of reconstructed graphs. We would like to underline that the primary goal of BN construction is not to quantify the correlations among variables and put a definite p-value on these correlations, but rather to identify variables (out of great many) that are likely to be correlated given the data. Once the feature sets for such variables are cleared of irrelevant, conditional dependencies, and the most interesting dependencies are ascertained, traditional statistical methods (such as linear or logistic regression) can be used to rigorously scrutinize the resulting smaller sub-networks, and to validate the relationships therein.

3. Example application in genetic epidemiology 3.1. Plasma apolipoprotein E levels The apolipoprotein E (APOE) gene encodes a 299 amino acid single chain protein. It is a crucial factor in blood lipid and lipoprotein metabolism and transport. In humans, three common protein isoforms are known, E2, E3, and E4. Populationbased and case-control association studies have consistently revealed an association of these isoforms with CHD and Alzheimer’s disease (Utermann, 1987; Mahley and Rall, 2000). In addition to the common APOE isoforms, there is substantial other APOE gene variation that may influence the interindividual differences in traits of interest, such as plasma apolipoprotein E (apoE) levels. Lai et al. (1998) generated high-density SNP maps around the human APOE locus on chromosome 19 and identified 10 SNPs showing large allele frequency differential between populations. Stengård et al. (2002) identified multiple APOE SNP variations influencing plasma apoE levels. We have applied BN modeling to datasets containing SNP variation in the human APOE gene, plasma lipid levels, and plasma apolipoprotein levels obtained from two samples: 702 African-Americans from Jackson, Mississippi, and 854 non-Hispanic whites from Rochester, Minnesota (Rodin and Boerwinkle, 2005; Rodin et al., 2005). Detailed description of these datasets can be found in Stengård et al. (2002) and Nickerson et al. (2000). Briefly, APOE gene variation was ascertained in a two-stage process. First, the gene was sequenced in a representative sample of 72 individuals from both populations. Second, 20 SNPs were genotyped in both (complete) samples. Four of these SNPs are located in the coding regions of the APOE gene. Two of them, SNP 3937 and 4075, code for the E2, E3, and E4 protein isoforms. In addition, the datasets contained the following variables: levels of plasma cholesterol, high-density lipoprotein (HDL) cholesterol, triglycerides and apolipoproteins E, AI, and B. Data on the individuals’ gender, age, height, and weight were also available.

Exploring Genetic Epidemiology Data with Bayesian Networks

487

For this baseline BN reconstruction experiment, we used the BDe model scoring metric, hill-climbing-with-restarts heuristic search, tree-like structure priors, 500-fold boostrapping, and multinomial model with Dirichlet parameter priors. Continuous variables (including plasma levels) were discretized into tertiles. Discretizing a continuous variable into fewer categories leads to more dependencies being recovered in its Markov blanket, thus counterbalancing the potential loss of information and sensitivity due to discretization. We continued the model selection process until the networks were essentially stabilized (converged), usually requiring 10–15 restarts of the basic hill-climbing search. About 20,000,000 network structures were evaluated in each experiment, requiring about 10 min mid-level workstation time. Our BN reconstruction software and source code (see Section 4.1) are freely available upon request from the authors. Before discussing the resulting networks, it is important to keep in mind that: 1. The networks shown in this section were generated from the data only, without prior/expert biological knowledge, or input, of any kind. In other words, these are purely data-driven networks. 2. We were primarily interested in the Markov blanket of plasma apoE levels. However, other variables were also of interest. 3. While there is no absolute bootstrap value cut-off in BN context, in our experience bootstrap values of greater than 92% are highly significant, and values between 75% and 92% are at least suggestive (see also Friedman et al., 1999a). The BNs generated from the Jackson and Rochester datasets are shown in Figs. 1 and 2, respectively. Tables 1A and 1B show the edge strengths, obtained using a model scoring criterion (posterior probability) test, and the bootstrap support for the edges in the Jackson apoE sub-network, respectively. Tables 2A and 2B show the same data for the Rochester sample. Detailed biological interpretation of these results can be found in Rodin et al. (2005). A few key points are worth noting here. First, the edges between SNPs are indicative of linkage disequilibrium. Second, in the Jackson network, only two SNPs belong to the apoE level node Markov blanket, namely SNP 4036 and 4075. The edge between SNP 4036 and apoE plasma level nodes is only moderately strong. Third, in the Rochester network, SNPs 3937 and 4075 belong to the apoE level node. Of course, these two SNPs determine the common protein isoforms. Moreover, SNP 4036 is also one of the four coding APOE SNPs, and it has been previously associated with type III hyperlipoproteinemia in a single family (Rall et al., 1989). To summarize, it is gratifying that a purely datadriven BN analysis uncovered known relationships between three APOE coding mutations and apoE levels. It should be noted that there is a difference between finding a single “perfect” model, or network, and extracting robust, reliable features from data. The former is difficult to achieve, mostly because the samples are typically too small and the chance sampling errors are too high. Also, no single model is likely to fit all strata within a sample, a phenomenon known as hidden heterogeneity. We will address this phenomenon, in context of shielding, in Section 3.3. The latter, however, is readily achievable. For example, the dependency between SNP 4075 and apoE level nodes is robust in the Jackson sample, no matter what model scoring criterion or heuristic search algorithm are chosen to reconstruct the BN. Similarly, the dependencies

488

A. S. Rodin et al.

Fig. 1. Adapted from Rodin and Boerwinkle (2005). Learned BN relating APOE SNPs to plasma apoE levels in Jackson, MS. Node legends: numbers refer to corresponding SNPs (see Fig. 1 in Nickerson et al. (2000) for an APOE SNP map). APO_E, APO_A, APO_B, TRIG, CHOL, and HDL stand for levels of apolipoproteins E, AI, and B, triglycerides, cholesterol, and high-density lipoprotein cholesterol, respectively. Line thickness corresponds to the relative edge strength (see Table 1A).

between SNP 3937 and 4075 and apoE level nodes are strong in the Rochester BNs. One can use bootstrapping to extract robust features. Alternatively, Bayesian model averaging can be used to generate a class of high-scoring BNs that have a number of robust features in common.

Exploring Genetic Epidemiology Data with Bayesian Networks

489

Fig. 2. Adapted from Rodin et al. (2005). BN learned from the Rochester, MN dataset. All designations are as in Fig. 1. Line thickness corresponds to the relative edge strength (see Table 2A).

3.2. Overfitting, model scoring criteria and simulation experiments It is understood that the posterior probability model scoring criterion (used for reconstructing the above two BNs) does not overfit (Heckerman, 1995). In other words, it does not produce spurious, tentative edges that presumably reflect only random noise contained in the dataset. This is a useful quality, as the analysis methods that tend to overfit do not replicate well on the new datasets. However, BNs are predominantly used for exploratory data analysis and to generate novel hypotheses from the data. In this context, a certain amount of overfitting might actually be desirable, as it will increase the method’s sensitivity. Of course, “weak” dependencies can be rigorously scrutinized in subsequent analyses. However, at the early hypothesis generating stage, we would rather not ignore potentially useful relationships.

490

A. S. Rodin et al.

Table 1A Adapted from Rodin and Boerwinkle (2005). Relative edge strengths in the Jackson BN. For each edge, its strength is the ratio of the posterior probability of the model containing the edge to the posterior probability of the identical model without the edgea Node 1

Node 2

APO_E SNP 3937

TRIG SNP 4075

Edge strength 970,916 190,785

SNP 4075

APO_B

105,196

SNP 471 APO_A

SNP 1998 CHOL

73,280 51,954

SNP 560 SNP 2440

SNP 3673 SNP 5361

13,854 9439

SNP 308

SNP 2440

2505

SNP 4075 SNP 4075

APO_E SNP 624

2221 880

SNP 73

SNP 560

374

HDL SNP 1998

WEIGHT SNP 545

261 50

SNP 73 GENDER

SNP 4951 APO_A

45 18

SNP 1163

SNP 4075

5.95

APO_E SNP 73

SNP 4036 SNP 4075

4.68 4.04

SNP 1998

SNP 624

3.44

SNP 2440 SNP 73

SNP 545 SNP 5361

2.99 2.81

SNP 4036

SNP 308

1.17

a Due to the limited software resolution there is no significant difference between the values above

1,000,000. Twenty edges with such values are considered highly significant and are not shown here. However, they are drawn in Fig. 1.

Table 1B Adapted from Rodin and Boerwinkle (2005). Bootstrap values and strengths for the edges belonging to the Markov blanket of the apoE node in the Jackson BN Node 1

Node 2

Edge strengtha

Bootstrap valueb (%)

SNP 4036 SNP 4075

APO_E APO_E

4.68 2221

76 85

TRIG

APO_E

970,916

98

a From Table 1A. b The percentage of times the edge was scored as present out of 500 bootstrap samples.

We will now introduce the concept of overfitting formally. We first define a distribution learning error. If Mtrue is the “true” model (i.e., the model generating the “true” or entire distribution of data), then Error (M) of a hypothesized model M should reflect the difference between the fit of Mtrue to the data and the fit of M to the data. One way to quantify this difference is via KL-divergence (Kullback and Leibler, 1951):

Exploring Genetic Epidemiology Data with Bayesian Networks

491

Table 2A Adapted from Rodin and Boerwinkle (2005). Relative edge strengths in the Rochester BN. For each edge, its strength is the ratio of the posterior probability of the model containing the edge to the posterior probability of the identical model without the edge Node 1

Node 2

Edge strength

APO_E WEIGHT

SNP 4075 HDL

191,513 74,236

SNP 1163

SNP 1575

19,032

SNP 832 SNP 1998

SNP 2440 SNP 2440

11,575 2331

CHOL SNP 1998

SNP 4075 SNP 3106

248 78

TRIG

GENDER

59

SNP 2907 SNP 4075

SNP 2440 SNP 1575

12 7.92

APO_B

GENDER

4.95

SNP 1163

SNP 2907

2.68

 Due to the limited software resolution there is no significant difference between the values above

1,000,000. 25 edges with such values are considered highly significant and are not shown here. However, they are drawn in Fig. 2. Table 2B Adapted from Rodin et al. (2005). Bootstrap values and strengths for the edges belonging to the Markov blanket of the apoE node in the Rochester BN Node 2

Edge strengtha

Bootstrap valueb (%)

SNP 3937 SNP 4075

APO_E APO_E

> 106 191,513

97 98

TRIG CHOL

APO_E APO_E

> 106 > 106

99 100

Node 1

a From Table 2A. b The percentage of times the edge was scored as present out of 500 bootstrap samples.

Error P(X Mtrue , ); P(X M, ¼ )



P(X Mtrue , ) log (P(X Mtrue , )/P(X M, ¼ )),

where ¼ is an estimate (configuration) of the parameter vector, and X1 , . . . , Xn  is a set of random variables represented by a BN. KL-divergence is a standard measure of true error for distribution learning (Van Allen and Greiner, 2000). Consider now the predictive error of a hypothesized model M over a dataset of interest, Errortraining (M). Assuming that the dataset is sampled from the “true” or entire distribution of data, consider also Errortrue (M) over the entire dataset. Errortrue (M) is also known as the unbiased or generalization error. Then, model M overfits training dataset if there is an alternative (typically simpler) model m such that Errortraining (M) < Errortraining (m)

and

Errortrue (M) > Errortrue (m).

492

A. S. Rodin et al.

In other words, the model overfits when it learns, or “soaks in,” spurious features (e.g., dependencies) of the training dataset that are mostly random noise. Because these features are unlikely to be present in the other datasets generated from the same underlying “true” distribution, such models replicate poorly. Of course, it is also possible for a model to “underfit,” when the model is too simple and does not contain some of the genuine dataset features that are not random noise. Ideally, one seeks to find a model of appropriate complexity that neither overfits nor underfits the data. Because we do not know the “true” distribution, we have to use dataset partitioning (into training and testing/validation subsets) or cross-validation or bootstrapping to estimate the true model fit. Alternatively, we can use model scoring criteria that penalize the model complexity explicitly. Recall that the first component in posterior probability model scoring criteria is logP(DT ), or the log marginal likelihood. Under certain assumptions it can be computed efficiently and in closed form. Unfortunately, one of these assumptions is the completeness of the dataset. For incomplete data, an efficient approximation must be developed. One such approximation is Bayesian Information Criterion, or BIC (Schwarz, 1978). log P(DT ) log P(DT, ¼ ) 1/2d log N, where ¼ is a maximum likelihood estimate (configuration) of the parameter vector, N is the sample size, and d is the dimension of , or number of free parameters. Under the multinomial model, BIC can be computed very efficiently, even for very large samples. Note that BIC is inversely proportional to the Minimum Description Length (MDL) criterion (Rissanen, 1987). Generally, although different model selection criteria often come from different research disciplines and are motivated by different rationales, there is a substantial amount of agreement between them. BIC’s interpretation is straightforward. The first term measures how well the model fits the data. The second term penalizes (decreases the model score) for the model’s complexity. Akaike Information Criterion, AIC (Akaike, 1970), although derived from the information-theoretic rather than Bayesian prospective, is very similar: log P(DT ) log P(DT, ¼ ) d log e, where e is the base of the natural logarithm and log e performs conversion from bits/entropy scale. Note that the complexity penalty (second term) is smaller than that of BIC. In order to investigate how overfitting would influence the BN reconstruction, we have applied BIC and AIC to the Jackson and Rochester datasets. In our experience, BIC tends to “underfit” the data, while AIC tends to overfit the data. This is because the penalty for the model complexity for AIC is considerably smaller than that for BIC. However, both with AIC and BIC, the extent of overfitting can be “artificially” controlled during the BN structure search by adjusting certain other algorithmic parameters. Figure 3 shows a series of BNs with progressively increasing overfitting reconstructed from the Jackson sample. At the highest level of overfitting, another SNP appears in the apoE level node Markov blanket, SNP 4951. While this dependency might be spurious, it is beneficial to have an option of higher sensitivity

Exploring Genetic Epidemiology Data with Bayesian Networks 493

Fig. 3. BNs learned from the Jackson dataset using different model scoring criteria. All designations are as in Fig. 1, except all SNP identifiers are preceded by “X”. Line thickness corresponds to the relative edge strength. The following model scoring criteria were used: (a) BIC, (b) AIC with slight overfitting, (c) AIC with moderate overfitting, and (d) AIC with severe overfitting.

494 A. S. Rodin et al.

Fig. 4. BNs learned from the Rochester dataset using different model scoring criteria. All designations are as in Fig. 3.

Exploring Genetic Epidemiology Data with Bayesian Networks

495

built into the analysis methods. Figure 4 shows a similar series for the Rochester data. Again, as overfitting increases, a number of new SNPs appear in the apoE level Markov blanket. The choice of the extent of overfitting ultimately belongs to a biological practitioner. However, two practical guidelines suggest themselves: 1. Perform simulation studies by introducing a dependency of known strength (a synthetic positive control) into the BN, and ascertaining at what level of overfitting the dependency is recognized by a BN reconstruction algorithm. If the positive control dependency represents, for example, a known SNP  phenotype effect with known allelic effect size, then the level of overfitting at which this dependency becomes “visible” to a BN reconstruction algorithm is appropriate for further predictive hypothesis generation. 2. If BN data analysis is being carried out simultaneously with other analysis methods, adjust the level of overfitting so that the number of predictive variables (SNPs) generated is roughly the same across the whole palette of analysis methods. For example, if logistic regression suggests five statistically significant predictors, the algorithmic parameters should be adjusted so that approximately five predictive variables appear in the target variable’s Markov blanket. The predictive variable rankings generated by the different analysis methods can be compared directly. In addition to the BDe, BIC/MDL and AIC, there are other model scoring criteria, notably cross-validation and bootstrapping. These criteria do not use the second or penalizing component, but simply estimate the first or model fit component via cross-validation or bootstrapping of the dataset. In principle, for each specific dataset we would like to use a criterion that would not lead to either over- or under-fitting. Unfortunately, it is difficult to state a priori which criterion would be the most appropriate. Clearly, further research is needed in this area. In summary, based on existing simulation studies (Myllymaki et al., 2002; Van Allen and Greiner, 2000) we would recommend using BDe scoring metric, if possible. If not (e.g., due to the incomplete data, or very large sample size), cross-validation is similarly an objective criterion, although it can be computationally expensive. Also, for a large-scale incomplete-data approximation BIC/MDL is a good conservative choice. Dependencies generated under BIC/MDL are likely to be robust. If more sensitivity is desired, either bootstrapping or AIC would be appropriate (AIC being, of course, more computationally efficient than bootstrapping). The reader is referred to de Campos and Ji (2011) for a more theoretical overview.

3.3. Scalability, computational efficiency, shielding, prior expert knowledge, and visualization Intelligent heuristic search algorithms and the general sparseness of biological networks bode well for BN reconstruction scalability. However, there are inherent limits to the number of potential predictor variables that can be accommodated in BN modeling. These limits are (1) direct access computer memory (RAM) required for data handling during the BN reconstruction, (2) CPU time required to traverse through the model space during model selection, and (3) ability of

496

A. S. Rodin et al.

the researchers to visualize and interpret the reconstructed BN. In general, BN learning is NP-hard, and while local search and more sophisticated optimization algorithms make BN reconstruction practical even for large datasets, arriving at a globally optimal BN structure is not guaranteed. Recent attempts to adapt the BN reconstruction methodology to genome-wide scale (Jiang et al., 2010a,b) tend to limit the expressiveness of the BN model, typically to that of a Naïve Bayesian classifier (Friedman et al., 1997). We are presently working on a number of algorithmic improvements intended to address scalability and search completeness issues. These include (1) using streamlined “single-type” data storage formats, (2) using a novel recursive local search algorithm with highly sparse BN structure priors and post-reconstruction optimization, (3) parallelization, and (4) “zoom-in” visualization of the reconstructed BN that allows the human expert to concentrate on Markov blanket(s) of particular variable(s). Our current software implementation of the novel local search algorithm can handle up to 1,000,000 variables (SNPs) on eight-core workstation with 16 GB RAM, taking less than two days for the full analysis of the 500,000 SNPs/ 2000 individuals dataset (unpublished). Majority of the existing BN reconstruction algorithms start with a predefined BN structure prior (for example, a tree-like structure in Myllymaki et al. (2002)) which, coupled with the non-exhaustiveness of the model selection heuristic search process, can introduce significant biases in the resulting BNs. In our algorithm, the prior is 100% “sparse” (i.e., all nodes are present, no edges). The first step is to evaluate all possible “parent-offspring” pairs using CPF/CPTs and BDe scoring metric. The “best” pair is selected as a kernel for the growing BN structure, with the directionality of the edges maintained, and the procedure is repeated recursively until full convergence is achieved. This heuristic is based on a prototypical “forward selection” algorithm (and K2 algorithm for BN reconstruction, Cooper and Herskovits, 1992), but with a number of important distinctions. First, because the directionality of the “parent-offspring” relationships is maintained, only two types of operators are allowed in each iteration—addition or removal (but not reversal) of the edge. This drastically restricts the model search space (and, therefore, time to convergence). Second, because the initial states (and the earlier BN reconstruction states, in general) are essentially very “sparse” structures, the algorithm is straightforward to parallelize. Third, the resulting BN topology can be used as a structure prior for another, additional, optimization algorithm. In our current implementation, a series of random local rearrangements (edge additions/removals) is carried out, and the highest-scoring BN is selected as the “final” output. Obviously, this step is also straightforward to parallelize. Such local rearrangement algorithms have been proven to be useful in a conceptually similar domain of phylogenetic analysis (see, for example, Rzhetsky and Nei, 1993) and our preliminary simulation results support their efficacy. In addition to this “post-reconstruction” step, we also take advantage of the “pre-reconstruction” step, utilizing the homogeneity of the SNP variables (for the genotyping data). This approach is similar to the “sparse candidate” algorithm (Friedman et al., 1999b); however, only the SNP variables are preprocessed. Specifically, all possible pairwise combinations of SNPs are “pre-tested” for statistical significance, and all edges between the pairs of SNPs showing no significant correlation are explicitly “forbidden” (during the subsequent BN model selection process). This drastically shrinks the model search space.

Exploring Genetic Epidemiology Data with Bayesian Networks

497

These improvements lead to the rapid convergence of the BN reconstruction algorithm for even the largest currently available datasets. However, neither convergence as such nor the unbiased structure priors can guarantee achieving the globally optimal BN. In order to partly alleviate the problem of getting stuck in the local extrema during the BN reconstruction process, we also intend to implement a variation on a “beam” search (see, for example, Rodin and Li, 2000) algorithm, in which more than one intermediate BN structure is kept in memory in each reconstruction step. Automated knowledge discovery in BNs is not limited to dependencies and Markov blankets. BNs are also capable of discovering hidden heterogeneity in the dataset. Figure 5 shows a series (BIC and AIC) of BNs reconstructed from the pooled (both Jackson and Rochester) sample. While both SNP 3937 and 4075 nodes continue to figure prominently in the apoE level node Markov blanket, the most striking feature of these BNs, compared to the networks depicted in Figs. 3 and 4, is a large number of nodes directly connected to the Race node. We call this pattern a “wheel hub and spoke” effect. It is also known as shielding, as in one node (Race) “shielding” many other nodes from each other. Shielding is generally an undesirable phenomenon, because it decreases sparseness and, therefore, negatively affects the efficiency of the BN reconstruction algorithms. However, when shielding is observed in a reconstructed BN, it can be interpreted as an indicator of hidden heterogeneity. If so, separate analyses of the sub-samples should be carried out. By doing the analysis separately for African-Americans and non-Hispanic whites (Figs. 3 and 4, respectively) we were able to show that while SNP 3937 influenced apoE levels in whites, it did not in African-Americans. This could not have been inferred from the pooled sample. In general, obvious shielding by a strata-related node (such as Race or Gender) suggests that the sample should be stratified accordingly. (Note that Gender does not show any shielding in this example.) In addition to inferring a purely data-driven BN, it is also possible to incorporate prior biological (expert) knowledge. Incorporating “forbidden” edges or other features (e.g., forcing an edge direction) in a BN reconstruction process can be useful not only for the scalability and computational efficiency purposes. It is also a natural way of comparing knowledge-derived (ontology) BNs with the data-driven ones. As an illustration, we have imposed the following set of prior restrictions on the BN reconstruction from genotype–phenotype data (such as APOE SNPs and plasma apoE levels dataset). These restrictions are intuitive and fairly self-explanatory: Forbidden edges SNP—gender (Note: for autosomal data only) SNP—age Forced directions (if edge is present)

Age  Lipids, apolipoproteins, and blood pressure Gender  Lipids, apolipoproteins, and blood pressure SNP  Lipids, apolipoproteins, and blood pressure Weight  Lipids, apolipoproteins, and blood pressure

As a pilot experiment, we have applied these restrictions to the APOE datasets described in Section 3.1. Figure 6 shows the results for the Rochester dataset

498 A. S. Rodin et al.

Fig. 5. BNs learned from the pooled (Jackson and Rochester) dataset using different model scoring criteria. All designations are as in Fig. 3.

Exploring Genetic Epidemiology Data with Bayesian Networks Fig. 6. BNs learned from the Rochester dataset using different model scoring criteria, with prior expert knowledge (restrictions) incorporated (see text). All designations are as in Fig. 3. 499

500

A. S. Rodin et al.

Table 3 BN scores for the purely data-driven BNs, and BNs reconstructed from the same datasets but with prior expert knowledge incorporated (in parenthesis) Criteriona Sample BIC AIC slight overfitting AIC moderate AIC high

Jackson

Rochester

Both

10,712 ( 10,750) 10,082 ( 10,166)

12,341 ( 12,383) 11,397 ( 11,440)

23,752 ( 24,030) 22,086 ( 22,359)

9991 ( 10,045) 9851 ( 9889)

11,337 ( 11,450) 11,322 ( 11,406)

21,973 ( 22,312) 21,898 ( 22,255)

a Scores obtained via different model scoring criteria should not be directly compared to each other.

(similar results for Jackson and pooled datasets are not shown). These BNs should be compared with the BNs in Fig. 4. In order to perform a formal comparison, however, one needs to score the purely data-driven BN (such as the ones in Fig. 4) and the BN reconstructed from the same dataset but with the restrictions applied. There are three potential outcomes: 1. The BN’s score (fit) increases significantly. This means that the heuristic search algorithm used for model selection is not very robust. 2. The BN’s score decrease significantly. This means that the human expert and the dataset strongly disagree, and that a thorough further investigation is indicated. 3. The BN scores do not differ significantly. This means that the BN with prior expert knowledge belongs to the same class of highly scoring BNs as the purely data-driven one, and either is acceptable. The BN with prior expert knowledge might, however, be preferable purely for aesthetical reasons. In our experiment, the latter proved to be the case (Table 3). It should be noted, however, that it is somewhat unclear what exactly constitutes a “significant” difference for various model scoring metrics. Clearly, additional research in this area is indicated. Finally, we will address the BN visualization issue. The concept of Markov blanket can also be used to improve the visualization components of the BN reconstruction software, particularly in case of large-scale datasets. Currently we are working on implementation of the BN output module in the DOT markup language script that is subsequently processed by the Graphviz (graph visualization) rendering engine (http://www.graphviz.org/) to generate high-quality BN figures. A “Markov blanket” visualization routine (supporting a range of zoom/scale options) can be added to the DOT script generator so that series of high-resolution visualizations of sub-networks will be automatically derived from the reconstructed BN. For example, one obvious option will be selection of the sub-networks covering the “boundary” between the SNPs and other variables. An example of the optimized BN reconstruction output (using novel local search algorithm, sparse priors, and Graphviz engine) is shown in Figs. 7 and 8 (same data as in Figs. 1 and 2, respectively). Figures 7a and 8a depict purely data-driven networks, whereas Figs. 7b and 8b show the BNs with the prior expert knowledge restrictions (see the restriction list above) imposed. Note that there is one important

Exploring Genetic Epidemiology Data with Bayesian Networks 545

1998

471

501

1575

2907

3106

RACE

4951

(a) 4036

1522

73

3673

308

4075

2440

5361

WEIGHT

1163

APO_E

624

832

TRIG

HDL

GENDER

3937

560

APO_B

APO_A

CHOL

HEIGHT

AGE

545

1575

2907

3106

AGE

RACE

(b) 471

1998

4951

3937

4036

APO_E

4075

3673

73

624

1522

1163

GENDER

308

TRIG

832

2440

APO_B

5361

CHOL

560

HEIGHT

WEIGHT

HDL

APO_A

Fig. 7. BN learned (optimized local search algorithm, sparse priors, Graphviz engine) from the Jackson, MS dataset. All designations are as in Fig. 1. Line thickness corresponds to the relative edge strength. (a) A purely data-driven BN. (b) A BN with the prior expert knowledge (restrictions) incorporated (see text).

difference between Figs. 1 and 7: the edge between SNP 3937 and apoE level nodes is absent in Fig. 1 (simple hill-climbing with restarts, tree-like prior) but present in Fig. 7 (optimized local search algorithm, sparse priors). Because SNP 3937 is, in fact, a real signal (see Section 3.1), the latter algorithm clearly outperforms the former in terms of sensitivity. In general, we would like to re-emphasize that careful

502

A. S. Rodin et al. 73

308

471

545

3106

4951

1522

3673

4036

RACE

1998

3937

(a)

624

560

4075

APO_B

1163

TRIG

1575

APO_A

832

2440

HDL

CHOL

5361

GENDER

APO_E

HEIGHT

AGE

2907

WEIGHT

(b) 73

308

471

545

GENDER

3106

HEIGHT

AGE

APO_A

4951

WEIGHT

3673

4036

RACE

1998

3937

APO_B

CHOL

HDL

2907

1522

624

TRIG

560

4075

APO_E

1163

1575

832

2440

5361

Fig. 8. BN learned (optimized local search algorithm, sparse priors, Graphviz engine) from the Rochester, MN dataset. All designations are as in Fig. 1. Line thickness corresponds to the relative edge strength. (a) A purely data-driven BN. (b) A BN with the prior expert knowledge (restrictions) incorporated (see text).

choice of priors and model selection algorithms is important for maximizing the BN reconstruction accuracy, and that clearly more work needs to be done in order to find the combination of algorithmic options and parameters that achieves the

Exploring Genetic Epidemiology Data with Bayesian Networks

503

optimal balance of scalability, computational efficiency, and search exhaustiveness in context of genetic epidemiology data analysis applications.

4. Software and applications 4.1. Existing implementations A number of tools, both commercial and free, exist for BN modeling. Two useful lists can be found at http://www.kdnuggets.com/software/bayesian.html and http://www.cs.ubc.ca/ murphyk/Bayes/bnsoft.html. The University of Helsinki B-course, at http://b-course.cs.helsinki.fi, is a web server with a robust GUI that is suitable for “quick” BN reconstructions from smaller datasets. There are also opensource projects and libraries developed specifically for probabilistic and dependency modeling. In particular, PEBL (Python Environment for Bayesian Learning) (Shah and Woolf, 2009) and bnlearn (Scutari, 2010) are two rich resources covering basic BNs and many useful options and extensions. Software solutions developed explicitly for BN modeling in the context of biological data tend to be limited, but are a topic of ongoing development (e.g., Nikolajewa et al., 2007). In this study we have predominantly used R and Python code and libraries. In addition, three environments were used for software development, data processing and simulation experiment management: GeNIe (development environment for graphical decision-theoretic models, at http://genie.sis.pitt.edu/), R (The R Project for Statistical Computing, at http://www.r-project.org/) and Weka (Data Mining Software in Java, at http://www.cs.waikato.ac.nz/ml/weka/). All of our code and code extensions are freely available upon request. A less scalable autonomous Python version of our software, named BNOmics (Bayesian Network reconstruction for the heterogeneous “-Omics” data), is also freely available under open source-style license at https://bitbucket.org/uthsph/bnomics/overview.

4.2. Existing applications To the best of our knowledge, the first BN modeling application to modern genetic data was by Friedman, Pe’er and co-workers (Friedman et al., 2000), who applied a sparse candidate BN reconstruction algorithm, augmented with bootstrap validation, to a large gene expression dataset. Subsequent work involved modeling a finer dependency structure in biological networks using perturbed expression profiles (Pe’er, 2005), and reconstructing protein-signaling networks from multiple proteomic measurements in human immune system cells (Sachs et al., 2005). The latter study is especially significant because it included experimental intervention (by perturbing the cells experimentally), thus allowing one to infer causality, not just statistical dependencies. Recall that it is not recommended to interpret a dependency as a causation unless certain assumptions are met. However, one can introduce intervention experiments into the domain of interest to infer causal relationships explicitly. Consider, for example, a dependency between two variables, A and B. If we manipulate the value of A in experimental intervention, and the value of B does not change, then we can conclude that A does not cause B. If, subsequently, we manipulate the value of B,

504

A. S. Rodin et al.

and the value of A does change, then we conclude that B causes A, and we can impose a causative directionality on the edge between the two, B A. This is precisely what has been done in the aforementioned study (Sachs et al., 2005). In human genetics, however, the ability to do intervention experiments is limited and, when possible, the intervention studies are expensive. However, using interventions to infer causality is certainly promising on the cellular and proteomic levels (see Fröhlich et al. (2009) for a recent prospective). Other, less explicit, approaches to inferring causality exist (e.g., Millstein et al., 2009). One way to deal with the causality flow is via Dynamic BN (DBN) modeling. In first-order Markov DBNs, at a given time point a node is influenced only by itself and its parents in the immediately preceding time point. Thus, a dynamic sequence of BNs is generated. The issues of equivalence classes and directionality ambiguity are resolved by using temporal information from this sequence. While both intervention experiments and DBNs lie outside of the scope of this review (largely because longitudinal and/or interventional genetic epidemiology datasets are still somewhat of a rarity), we would like to acknowledge their significance in the general scheme of modern genetic data analysis. Hartemink et al. (2001) developed a formal framework for reverse-engineering genetic regulatory networks using BN modeling, and applied it to the gene expression data combined with the available genomic location data for the genes being expressed (Hartemink et al., 2002). The latter study highlights an attractive, and important, feature of BNs—heterogeneous variables can be accommodated by a single network. This feature makes it possible to combine signals from several datasets (and types of data) into one model. This work was followed by formalizing a DBN inference algorithm for genetic data and testing it on simulated data (Yu et al., 2004). Recent work in DBNs includes Paluszewski and Hamelryck (2010), Chen et al. (2010), Zhu et al. (2010), Grzegorczyk and Husmeier (2011), and Li et al. (2011) and encompasses transcriptional, regulatory, sequence, and proteomic data. It is possible that simple probabilistic models, such as multinomial or linear Gaussian, do not adequately represent certain biological relationships. Imoto et al. proposed a more complex and flexible model that employed nonparametric regression with Gaussian noise to capture both linear and non-linear relationships between variables (Imoto et al, 2002). It has been successfully applied to the Saccharomyces cerevisiae cell cycle gene expression data, and to a combination of gene expression data and DNA sequence (promoter) information (Tamada et al., 2003). Other notable applications of BN methodology to genetic data include using DBNs to model cell signaling pathways (Sachs et al., 2002), combining heterogeneous predictive information in Bayesian framework to identify genes in Drosophila DNA sequence (Pavlovíc et al., 2002), using BN-based classifier to predict protein superfamilies (Raval et al., 2002), opterons in Escherichia coli (Bockhorst et al., 2003) and protein–protein interactions (Jansen et al., 2003), incorporating stochasticity into DBNs (Perrin et al., 2003), reverse-engineering BNs from metabolic data (Li and Chan, 2004), using DBNs to reconstruct genetic regulatory networks with hidden nodes (Beal et al., 2005), optimizing BN reconstruction by using different types of network topology priors and by incorporating prior (expert) knowledge (Steele et al., 2009; Keilwagen et al., 2010), applying BN modeling to

Exploring Genetic Epidemiology Data with Bayesian Networks

505

large-scale sequencing, genotyping and microarray datasets (Chu et al., 2009; Wang et al., 2009; Bauer et al., 2010; Jiang et al. (2010a,b, 2011b)) , explicitly capturing genetic epistasis (Jiang et al. (2010a, 2011a); Han and Chen, 2011), integrating heterogeneous data types within the same network (Lee and Lee, 2005; Chan and McGeachie, 2011) and improving on the basic BN reconstruction framework (see, for example, Zou et al. (2010) and Watanabe et al. (2012) for alternative takes on network modeling). A recent overview of BN applications in the domain of systems, or pathway, biology (with special emphasis on pharmacogenetics and pharmacodenomics) can be found in Rodin et al. (2011). Alternative modeling techniques (such as ordinary differential equations, structural equations and Granger causality) and merging purely data-driven networks with the prior biological knowledge (networks generated from the ontology databases) are also briefly discussed. The reader is referred to the literature list therein in addition to the above references. It is obvious that BN modeling is an exciting, dynamic and lively research area, particularly in its application to reverse-engineering genetic networks from sequencing and gene expression data. We believe that the time has come for much wider deployment of BN methodology in the genetic epidemiology research (see Rao (2008) for a balanced overview, and Kang et al. (2011), Schlosberg et al. (2011), and Namkung et al. (2011) for recent applications of BN modeling to the Genetic Analysis Workshop 17 (GAW17) data), especially for data analysis and predictive and descriptive model building in large-scale association studies.

4.3. Data mining in genetic epidemiology and association studies We are committed to a strategy of multi-level, hierarchical analysis of large-scale association datasets. Three major levels, or steps, in our multi-level algorithmic strategy are (1) variable selection, or feature set reduction, (2) descriptive and predictive model building, and (3) model validation (Rodin et al., 2009). Such a strategy is common for a typical data mining process, especially with very large datasets. A model, in this context, contains relevant input variables (SNPs, environmental factors, physiological measurements), dependent, or target, variables (phenotypes, endpoints) and often some formalism of relationships (e.g., dependencies or causations) between the variables. In the first, or variable selection (feature set reduction), step we rank potentially predictive variables (SNPs) and remove (i.e., “prune away”) irrelevant SNPs from the dataset to avoid overfitting and multiple testing problems, and to ease the computational burden. Once a relatively compact set of potentially predictive SNPs is identified, we move onto the second, or descriptive/predictive model building, step. In this step, we reverse-engineer the biological relationships between the SNPs and the other variables (e.g., other predictive variables and the primary dependent variable of interest). Finally, in the model validation step, we examine the robustness of the model and its ability to predict trait values or endpoints in a second sample from the same population or a different population. In general, BN modeling is an excellent fit to the second step of this multi-level analysis strategy (of course, BNs can also be used for variable selection and model validation). Importantly, BNs are proposed not to replace but rather to complement

506

A. S. Rodin et al.

other descriptive and predictive modeling methods, such as structural equations, Boolean networks, ensemble (boosting and bagging) decision tree-based classifiers, neural networks, linear and logistic regression, etc. (see Vignes et al. (2011) and Pirooznia et al. (2012) for examples of meta-analyses involving BNs and other methods).

5. Summary and future directions BN modeling is a valuable tool for studying genotype-to-phenotype relationships in modern human genetics, especially in the context of candidate gene and large-scale association studies, and multi-level analysis strategies. While their prime appeal lies with purely data-driven hypothesis-generation analyses, BNs are versatile enough to incorporate prior expert knowledge and validate the discovered features. BNs have been gaining in popularity in recent years, and we believe that BNs and similar graphical dependency modeling methods will soon become a crucial component of the genetic epidemiology practitioner’s toolbox. We are currently developing numerous extensions to our existing BN framework. One is a hybrid local probability distribution model, in which both continuous (linear Gaussian) and discrete (multinomial) random variables co-exist within the same BN. In such model, discrete variables can be the parents of continuous variables, but not the other way around. Another extension has to do with the applicability of BN modeling to the different types of genetic data. The genetic epidemiology community is now at the threshold of another major step forward, in which not only the quantity (number of SNPs available for genotyping) but also the “quality” of available data is rapidly increasing. Importantly, knowledge of the detailed DNA sequence variation, combined with the larger sequencing scale (genomic, rather than just the candidate genes), necessitates development of new (or improved) data analysis methods and software. Application of BNs to the full exome and genome (in essence, large-scale sequencing) data presents new technical and conceptual challenges. One is an even higher, compared to candidate gene datasets, within-gene linkage disequilibrium (and thus more dense dependency patterns) inherent to the sequence data. It is unclear how well baseline BN modeling will handle these high-density data (see Section 2.1 for discussion of sparse vs. dense networks). We plan to ascertain whether it will be necessary to apply “pre-reconstruction” routines that will parse pairwise SNP combinations, testing for high correlation (Sections 3.3), so that during the model selection stage we can force (or forbid) corresponding edges between the SNPs, thus significantly shrinking the model search space. The second, and perhaps the most important, issue is that of rare variants, inherent to the sequence variation data. We are not certain how the BN modeling might handle rare variants. In the past, we have seen BN modeling “overreacting” to the rare events in the dataset by creating spurious edges of artificially high significance. We plan to investigate this issue thoroughly, using real and simulated data. Another recent development is the increasing availability of the metabolomic data. This is a very different data type compared to SNPs—continuous metabolomics measures with unknown distributions and ambiguous missing values. We

Exploring Genetic Epidemiology Data with Bayesian Networks

507

intend to investigate whether existing BN discretization routines are adequate for handling this data type. If necessary, we will implement the appropriate variations of the hybrid and/or Gaussian local probability models, which might be a better fit for the metabolomic data than the multinomial model (with discretization) currently used by our BN reconstruction software. Finally, we plan to implement an advanced missing data imputation algorithm based on the assessment of the immediate Markov neighborhood of a variable with missing values, and to compare it to the standard imputation algorithms (assignment of an extra variable state, majority, and proximity).

Acknowledgments This work is dedicated to the memory of Sergei N. Rodin. We thank D.C. Rao, Stanislav Zakharkin, Devika Subramanian, Anke-Hilse Maitland-van der Zee, Bas J.M. Peters, Andy Clark, and Charlie Sing for many useful discussions, and Tomi Silander for access to B-course source code. This work was supported in part by grants from the National Institutes of Health R03LM009738, P50GM065509, R01HL072810/R01HL072905, RC2HL102419, U01GM074492, and U01HG004402, and by grants from PhRMA foundation and Gillson-Longenbaugh foundation. This work began as part of the MDCODe project and the authors would like to acknowledge the laboratory of Dr. Debora Nickerson at the University of Washington for generating the SNP data used in this analysis.

References Akaike, H., 1970. Statistical predictor identification. Ann. Inst. Statist. Math. 22, 207–217. Arnone, M.I., Davidson, E.H., 1997. The hardwiring of development: organization and function of genomic regulatory systems. Development 124, 1851–1864. Bauer, S., Gagneur, J., Robinson, P. N., 2010. Going Bayesian: model-based gene set analysis of genomescale data. Nucleic Acids Res. 38 (11), 3523–3532. Beal, M.J., Falciani, F., et al., 2005. A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics 21 (3), 349–356. Bockhorst, J., Craven, M., et al., 2003. A Bayesian network approach to operon prediction. Bioinformatics 19 (10), 1227–1235. Chan, H.H., McGeachie, M., 2011. Phenotype prediction by integrative network analysis of SNP and gene expression microarrays. Conf. Proc. IEEE Eng. Med. Biol. Soc., 6849–6852. Chen, X., Hoffman, M.M., et al., 2010. A dynamic Bayesian network for identifying protein-binding footprints from single molecule-based sequencing data. Bioinformatics 26 (12), i334–i342. Chickering, D.M., 1995. Learning Bayesian networks is NP-Complete. In: Fisher, H. (Ed.), Learning from Data: Artificial Intelligence and Statistics. Springer-Verlag, Heidelberg, pp. 121–130. Chickering, D.M., 1996. Learning equivalence classes of Bayesian network structures. In: Horvitz, F.V. (Ed.), Twelfth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, San Francisco, pp. 50–157. Chu, J.H., Weiss, S.T., et al., 2009. A graphical model approach for inferring large-scale networks integrating gene expression and genetic polymorphism. BMC Syst. Biol. 27 (3), 55. Cooper, G., Herskovits, E., 1992. A Bayesian method for the induction of the probabilistic networks from data. Mach. Learn. 9, 309–347. de Campos, C.P., Ji, Q., 2011. Efficient structure learning of Bayesian networks using constraints. J. Mach. Learn. Res. 12, 663–689.

508

A. S. Rodin et al.

Efron, B., Tibshirani, R.J., 1993. An Introduction to the Bootstrap. Chapman and Hall, London. Friedman, N., Geiger, D., Goldszmidt, M., 1997. Bayesian network classifiers. Machine Learning 29, 2–3. Friedman, N., Goldszmidt, M., et al., 1999a. Data analysis with Bayesian networks: a bootstrap approach. In: Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI). Friedman, N., Nachman, I., et al., 1999b. Learning Bayesian network structure from massive datasets: the “sparse candidate” algorithm. In: Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI). Friedman, N., Linial, M., et al., 2000. Using Bayesian networks to analyze expression data. Comput. Biol. 7 (3–4), 601–620. Fröhlich, H., Sahin, O., et al., 2009. Deterministic effects propagation networks for reconstructing protein signaling networks from multiple interventions. BMC Bioinform. 10, 322. Geiger, D., Heckerman, D.A., 1994. A characterization of the Dirichlet distribution through global and local independence. MSR-TR-94-16. Microsoft Research. Grzegorczyk, M., Husmeier, D., 2011. Improvements in the reconstruction of time-varying gene regulatory networks: dynamic programming and regularization by information sharing among genes. Bioinformatics 27 (5), 693–699. Guyon, E., Elisseeff, A., 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182. Han, B., Chen, X.W., 2011. bNEAT: a Bayesian network method for detecting epistatic interactions in genome-wide association studies. BMC Genomics 12 (Suppl. 2), S9. Hartemink, A.J., Gifford, D.K., et al., 2001. Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. In: Pac Symposium on Biocomputing. Hartemink, A.J., Gifford, D.K., et al., 2002. Combining location and expression data for principled discovery of genetic regulatory network models. In: Pac Symposium on Biocomputing. Heckerman, D.A., 1995. Tutorial on learning with Bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research. Heckerman, D., Geiger, D., Chickering, D., 1995. Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20, 197–243. Heckerman, D., Meek, C., et al., 1999. A Bayesian approach to causal discovery. In: Computation, Causation, and Discovery, pp. 141–165. Imoto, S., Goto, T., et al., 2002. Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. In: Pac Symposium on Biocomputing. Jansen, R., Yu, H., et al., 2003. A Bayesian networks approach for predicting protein–protein interactions from genomic data. Science 302, 449–453. Jiang X., Barmada, M.M., Visweswaran, S., 2010a. Identifying genetic interactions in genome-wide data using Bayesian networks. Genet. Epidemiol. 34, 575–581. Jiang X., Neapolitan, R.E., et al., 2010b. A fast algorithm for learning epistatic genomic relationships. In: AMIA Annual Symposium Proceedings, pp. 341–345. Jiang X., Neapolitan, R.E., et al., 2011a. Learning genetic epistasis using Bayesian network scoring criteria. BMC Bioinform. 12, 89. Jiang X., Barmada, M.M., et al., 2011b. A bayesian method for evaluating and discovering disease loci associations. PLoS One 6 (8), e22075. Kang, J., Zheng, W., et al., 2011. Use of Bayesian networks to dissect the complexity of genetic disease: application to the Genetic Analysis Workshop 17 simulated data. BMC Proc. 5 (Suppl. 9) S37. Keilwagen, J., Grau, J., et al. (2010). Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis. BMC Bioinform. 11, 149. Kerr, M.K., Churchill, G.A., 2001. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc. Natl. Acad. Sci. USA. 98 (16), 8961–8965. Kononenko, I., 1990. Comparison of inductive and naive Bayesian learning approaches to automatic knowledge acquisition. In: Current Trends in Knowledge Adquisition, pp. 190–197. Krause, P.J., 1998. Learning probabilistic networks. Knowledge Engineering Review 13 (14), 321–351. Kullback, S., Leibler, R.A., 1951. On information and sufficiency. Ann. Math. Statist. 22, 79–86. Lai, E., Riley, J., Purvis, I., Roses, A., 1998. A 4-Mb high-density single nucleotide polymorphism-based map around human APOE. Genomics 54 (1), 31–38.

Exploring Genetic Epidemiology Data with Bayesian Networks

509

Lee, P.H., Lee, D., 2005. Modularized learning of genetic interaction networks from biological annotations and mRNA expression data. Bioinformatics 21 (11), 2739–2747. Li, Z., Chan, C., 2004. Inferring pathways and networks with a Bayesian framework. FASEB J. 18, 746–748. Li, H., Wang, N., et al., 2011. Learning the structure of gene regulatory networks from time series gene expression data. BMC Genomics 12 (Suppl. 5), S13. Mahley, R.W., Rall Jr., S.C., 2000. Apolipoprotein E: far more than a lipid transport protein. Annu. Rev. Genomics Hum. Genet. 1, 507–537. Markowetz, F., Spang, R., 2007. Inferring cellular networks – a review. BMC Bioinform. 27 (8 Suppl. 6), S5. Miklos, G.L., Rubin, G.M., 1996. The role of the genome project in determining gene function insights from model organisms. Cell 86, 521–529. Millstein, J., Zhang, B., et al., 2009. Disentangling molecular relationships with a causal inference test. BMC Genet. 27 (10), 23. Myllymaki, P., Silander, T., et al., 2002. B-Course: a web-based tool for Bayesian and causal data analysis. Int. J. Artif. Intell. Tools 3, 369–387. Namkung, J., Raska, P., et al., 2011. Analysis of exome sequences with and without incorporating prior biological knowledge. Genet. Epidemiol. 35 (Suppl. 1), S48–S55. Needham, C.J., Bradford, J.R., et al., 2007. A primer on learning in Bayesian networks for computational biology. PLoS Comput. Biol. 3 (8), e129. Nickerson, D.A., Taylor, S.L., et al., 2000. Sequence diversity and large-scale typing of SNPs in the human apolipoprotein E gene. Genome Res. 10, 1532–1545. Nikolajewa, S., Pudimat, R., et al., 2007. BioBayesNet: a web server for feature extraction and Bayesian network modeling of biological sequence data. Nucleic Acids Res. 35 (Web Server issue), W688–W693. Paluszewski, M., Hamelryck, T., 2010. Mocapy++ – a toolkit for inference and learning in dynamic Bayesian networks. BMC Bioinform. 11, 126. Pavlovíc, V., Garg, A., et al., 2002. A Bayesian framework for combining gene predictions. Bioinformatics, 18 (1), 19–27. Pearl, J., 1988. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo, CA. Pearl, J., 2000. Causality. Models, Reasoning, and Inference. Cambridge University Press. Pe’er, D., 2005. Bayesian network analysis of signaling networks: a primer. Sci. STKE, 281, p. l4. Perrin, B.-E., Ralaivola, L., et al., 2003. Gene networks inference using dynamic Bayesian networks. Bioinformatics 19 (Suppl. 2), 138–148. Pirooznia, M., Seifuddin, F., et al., 2012. Data mining approaches for genome-wide association of mood disorders. Psychiatr. Genet. 22 (2), 55–61. Rall Jr., S.C., Newhouse, Y.M., et al., 1989. Type III hyperlipoproteinemia associated with apolipoprotein E phenotype E3/3. Structure and genetics of an apolipoprotein E3 variant. Clin. Invest. 83, 1095–1101. Rao, D.C., 2008. An overview of the genetic dissection of complex traits. Adv. Genet. 60, 3–34. Raval, A., Ghahramani, Z., et al., 2002. A Bayesian network model for protein fold and remote homologue recognition. Bioinformatics 18 (6), 788–801. Rissanen, I., 1987. Minimum description length principle. Encyclopedia Statist. Sci 5, 523–527. Rodin, A., Boerwinkle, E., 2005. Mining genetic epidemiology data with Bayesian networks I: Bayesian networks and example application (plasma apoE levels). Bioinformatics 21(15), 3273–3278. Rodin A.S., Li., W.-H., 2000. A rapid heuristic for finding minimum evolution trees. Mol. Phylogenet. Evol. 16, 173–179. Rodin, A., Brown, A., et al., 2005. Mining genetic epidemiology data with Bayesian networks II: Application to ApoE gene variants and plasma lipid levels. J. Comput. Biol. 12, 1–11. Rodin, A.S., Litvinenko, A., et al., 2009. Use of a Random Forests classifier for variable selection in large-scale genomic association studies. J. Comput. Biol. 16 (12), 1705–1718. Rodin, A.S, Gogoshin, G., Boerwinkle, E., 2011. Systems biology data analysis methodology in pharmacogenomics. Pharmaciogenomics 12 (9), 1349–1360. Russell, S., Norvig, P., 2009. Artificial Intelligence: A Modern Approach, third ed. Prentice Hall. Rzhetsky, A., Nei, M., 1993. Theoretical foundation of the minimum-evolution method of phylogenetic inference. Mol. Biol. Evol. 10, 1073–1095. Sachs, K., Gifford, D., et al., 2002. Bayesian network approach to cell signaling pathway modeling. Sci. STKE 148, 38.

510

A. S. Rodin et al.

Sachs, K., Perez, O., et al., 2005. Causal protein-signaling networks derived from multiparameter singlecell data. Science 308 (5721), 523–529. Schlosberg, C.E., Schwantes-An, T.H., et al., 2011. Application of Bayesian network structure learning to identify causal variant SNPs from resequencing data. BMC Proc. 5 (Suppl. 9), S109. Schwarz, G.E., 1978. Estimating the dimension of a model. Annals of Statistics 6 (2), 461–464. Scutari, M., 2010. Learning Bayesian networks with the bnlearn R Package. J. Stat. Softw. 35 (3). Shah, A., Woolf, P., 2009. Python environment for Bayesian learning: inferring the structure of Bayesian networks from knowledge and data. J. Mach. Learn. Res. 10, 159–162. Sprites, P., Meek, C., 1995. Learning Bayesian networks with discrete variables from data. In: Proceedings of the First International Conference on Knowedge Discovery and Data Mining. Morgan Kaufmann, Montreal, QU, Canada. Steele, E., Tucker, A., et al., 2009. Literature-based priors for gene regulatory networks. Bioinformatics. 25 (14), 1768–1774. Stengård, J.H., Clark, A.G., et al., 2002. Contributions of 18 additional DNA sequence variations in the gene encoding apolipoprotein E to explaining variation in quantitative measures of lipid metabolism. Am. J. Hum. Genet. 71, 501–517. Tamada, Y., Kim, S., et al., 2003. Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection. Bioinformatics 19 (Suppl. 2), 227–236. Utermann, G., 1987. Apolipoprotein E polymorphism in health and disease. Am. Heart J. 113 (2), 433–440. Van Allen, T., Greiner, R., 2000. A model selection criteria for learning belief nets: an empirical comparison. In: ICML 2000. Vignes, M., Vandel, J., et al., 2011. Gene regulatory network reconstruction using Bayesian networks, the Dantzig Selector, the Lasso and their meta-analysis. PLoS One 6 (12), e29165. Wang, Y., Zhang, X.S., Xia, Y., 2009. Predicting eukaryotic transcriptional cooperativity by Bayesian network integration of genome-wide data. Nucleic Acids Res. 37 (18), 5943–5958. Watanabe, Y., Seno, S., et al., 2012. An estimation method for inference of gene regulatory network using Bayesian network with uniting of partial problems. BMC Genomics 13 (Suppl. 1), S12. Wright, S., 1934. The method of path coefficients. Ann. Math. Stat. 5, 161–215. Yu, J., Smith, V.A., et al., 2004. Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics 20 (18), 3594–3603. Zharkikh, A., Li, W.H., 1995. Estimation of confidence in phylogeny: the complete-and-partial bootstrap technique. Mol. Phylogenet. Evol. 4, 44–63. Zhu, J., Chen, Y., et al., 2010. Characterizing dynamic changes in the human blood transcriptional network. PLoS Comput. Biol. 6 (2), e1000671. Zou, C., Ladroue, C., et al., 2010. Identifying interactions in the time and frequency domains in local and global networks – a granger causality approach. BMC Bioinform. 21 (11), 337.