A hybrid Bayesian network learning method for constructing gene networks

A hybrid Bayesian network learning method for constructing gene networks

Available online at www.sciencedirect.com Computational Biology and Chemistry 31 (2007) 361–372 A hybrid Bayesian network learning method for constr...

353KB Sizes 3 Downloads 175 Views

Available online at www.sciencedirect.com

Computational Biology and Chemistry 31 (2007) 361–372

A hybrid Bayesian network learning method for constructing gene networks Mingyi Wang a,b , Zuozhou Chen c , Sylvie Cloutier a,∗ a

Agriculture and Agri-Food Canada, Cereal Research Centre, Winnipeg, MB R3T 2M9, Canada b Indiana University School of Informatics, Indianapolis, IN 46202, USA c Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100080, China Received 12 June 2007; accepted 13 August 2007

Abstract A Bayesian network (BN) is a knowledge representation formalism that has proven to be a promising tool for analyzing gene expression data. Several problems still restrict its successful applications. Typical gene expression databases contain measurements for thousands of genes and no more than several hundred samples, but most existing BNs learning algorithms do not scale more than a few hundred variables. Current methods result in poor quality BNs when applied in such high-dimensional datasets. We propose a hybrid constraint-based scored-searching method that is effective for learning gene networks from DNA microarray data. In the first phase of this method, a novel algorithm is used to generate a skeleton BN based on dependency analysis. Then the resulting BN structure is searched by a scoring metric combined with the knowledge learned from the first phase. Computational tests have shown that the proposed method achieves more accurate results than state-of-the-art methods. This method can also be scaled beyond datasets with several hundreds of variables. Crown Copyright © 2007 Published by Elsevier Ltd. All rights reserved. Keywords: Gene network; Bayesian network; DNA microarray; Hybrid learning

1. Introduction In the post-genomic era, one of the fundamental challenges in the biological field remains the understanding of the complicated gene regulation networks that control cellular functions. The accumulated genomic and gene expression data have recently provided potential to start unraveling this problem through computational analysis. In the past few years, Bayesian networks (BNs) (Cooper and Herskovits, 1992; Heckerman et al., 1995; Heckerman, 1997; Neapolitan, 2003; Pearl, 1988) have received increasing attention from the computational biology community as models of gene networks (Chen et al., 2006; Friedman et al., 2000; Hartemink et al., 2002; Husmeier, 2003; Imoto et al., 2002; Murphy and Mian, 1999; Ott et al., 2004; Pe’er et al., 2006; Pe˜na et al., 2006; Segal et al., 2005; Wang et al., 2006). Intrinsic features of this model make it suitable for gene expression data



Corresponding author. Tel.: +1 204 983 2340; fax: +1 204 983 4604. E-mail addresses: [email protected] (M. Wang), [email protected] (Z. Chen), [email protected] (S. Cloutier).

analysis: (1) the sound statistical foundation allows BNs to deal with the noise and missing data common to gene expression data and permit incomplete knowledge; (2) the causal relationships between genes can be modeled in the form of conditional independence; (3) the probabilistic model makes BNs more flexible, which can reflect non-linear relationships between genes; (4) a BN is capable of integrating prior biological knowledge into the system. However, unlike the conventional datasets used in machine learning fields, the typical gene expression microarray datasets include large number of variables (several thousands of genes) and very few (at most a few hundred) samples. As a result, the number of genes will exceed by far the number of sample points. This poses a serious challenge to existing BN learning algorithms, and also renders the estimation of genetic networks an extremely difficult problem. As a consequence, learning BNs from gene expression data is still problematic. Current BN structure learning methods can be categorized as either constraint-based methods (Cheng et al., 2002; Cooper, 1997; Kalisch and B¨uhlmann, 2007; Pearl and Verma, 1991; Spirtes et al., 1993) or scored-searching methods (Cooper and Herskovits, 1992; Heckerman et al., 1995). The constraint-based

1476-9271/$ – see front matter. Crown Copyright © 2007 Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2007.08.005

362

M. Wang et al. / Computational Biology and Chemistry 31 (2007) 361–372

methods determine all the dependence and independence relationships among variables via conditional independence (CI) tests and construct networks that characterize these relationships. The scored-searching methods attempt to identify the network that maximizes a scoring function indicating how well the network fits the data. These two kinds of methods each have their own advantages and disadvantages. Constraint-based algorithms are computationally effective, hence making them suitable for highdimensional data sets. However, they rely on a significance level to decide independencies. They can also be unstable, i.e. an early error in the search can have a cascading effect causing different graphs to result. The second drawback is that high-order CI tests need large sample sizes. In the context of microarray datasets, the amount of samples is rarely enough to perform reliable high-order CI tests, which will consequently lead to too many false positives resulting in relatively low accuracy of predicting gene networks. Finally, the results of constraint-based algorithms may contain some undirected edges. The resultant graph of this method is called an essential graph (or equivalent classes is the term used in BNs) and actually only reflects the CI relations among variables. The scored-searching methods may produce more accurate results in structure learning than the constraint-based methods, and they can be applied even with limited samples where CI tests are likely to break down. Also, these methods are capable of dealing with incomplete records in the database and identifying some structures which the constraint-based method cannot (Heckerman et al., 1997; Neapolitan, 2003). However, the most serious drawback to these methods is that they are relatively slow, especially when the scales of networks become large (above several hundreds), the numbers of possible structures increase super-exponentially with the number of nodes. The excessive computational cost of analyses limits the applicability of current BNs to small networks because they cannot be reliably scaled up to thousands of variables in a reasonable time. Some other discussions about the comparisons and relations of these two methods can be found in (Cheng et al., 2002; Cowell, 2001; Heckerman et al., 1997; Neapolitan, 2003). In order to address the above problems, some hybrid methods (Brown et al., 2004; Spirtes and Meek, 1995; Tsamardinos et al., 2006) have been proposed. The premise of the hybrid methods is the application of a constraint-based algorithm to learn a network structure pattern in a relatively short period of time and then use some scored-searching algorithm to refine the final BN structures based on the previous pattern. These hybrid methods are expected to strengthen the BN learning efficiency and increase the prediction accuracy. In this paper, we propose a novel variant of a hybrid constraint-based and scored-searching algorithm to construct gene networks from microarray data. The method includes two main phases. In the first phase, we use a new constraint-based algorithm to identify the skeleton graph (undirected graph) of BNs. The key improvement of our algorithm is the use of heuristic search strategies to restrict the possible CI tests. In the second phase, our method performs scored searching, but we restrict the BN searching space according to the skeleton graph returned

from the first phase. Our method was evaluated using four benchmark network datasets with known structures. The results demonstrate that the proposed method can effectively increase the prediction accuracy of network structures. We also applied the method to gene expression data measured during the yeast cell cycle and showed that it can be scaled up to the datasets with several hundreds of variables. 2. Bayesian Networks First, we review some concepts of probability and graph theory as they relate to BNs. The following definitions and theorem can be found in most books on BNs (Neapolitan, 2003). In the following text, we abbreviate “if and only if” by iff and “with respect to” by w.r.t. Definition 1. Two variables X and Y are conditionally independent given a set of variables Z w.r.t. a probability distribution P, denoted as if P(X,Y|Z) = P(X|Z)P(Y|Z). We denote We and define conditional dependence as abbreviate conditionally independent as CI. The order of CI is the size of the conditional variable set (i.e., |Z|). Recall that a directed graph G is a pair , where V is a set of distinct nodes in G. Elements of E are called edges (or arcs), and (X,Y) ∈ E represents an edge from X to Y. Two nodes X and Y are adjacent if there is an edge from X to Y or from Y to X. Suppose we have a set of nodes [X1 , X2 , . . ., Xk ], where k ≥ 2, such (Xi−1 , Xi ) ∈ E for 2 ≤ i ≤ k. We call the set of edges connecting the k nodes a path from X1 to Xk . If each node in the path only occurs once, this path is called a simple path. In this paper, path only represents simple path. Definition 2. Let V denote a non-empty finite set of random variables. A Bayesian network BN for V is a pair , where G is a Directed Acyclic Graph (DAG) whose nodes correspond to the random variables in V (in BN, variable and node can then be used interchangeably), and θ are parameters specifying a conditional probability distribution for each node X given its parents in G, i.e. p(X|PaG (X)). Assumption 1 (Pearl, 1988). A BN satisfies the Markov Condition: every node is independent of any subset of its nondescendant nodes conditioned on its parents. It is easy to prove that for a BN , the probability distribution for V, p(V), through the factorization p(V) = X∈ V p(X|PaG (X)). To represent a BN, the structure (i.e., the BN graph G) and the parameters θ have to be encoded. The DAG G in conjunction with the Markov condition directly encode some of the independencies of probability distribution and entail others. A graphical criterion for entailment is that of d-separation. It is defined on the basis of blocked paths. Definition 3 (Pearl, 1988). A path p from node X to node Y in a DAG G is said to be blocked by a set of nodes Z if and only if 1. p contains a chain X → W → Y or a fork X ← W → Y such that the middle node W ∈ Z, or

M. Wang et al. / Computational Biology and Chemistry 31 (2007) 361–372

363

2. p contains a collider X → W ← Y such that middle node W∈ / Z and such that all the descendents of W are not in Z.

simulated annealing (Chickering, 1996; Cooper and Herskovits, 1992; Heckerman et al., 1995; Heckerman, 1997).

Definition 4. A set Z is said to d-separate node Y from X in G iff Z blocks every path from X to Y, denoted as d-sepG (X,Y|Z). Two nodes X and Y are d-connected by a set Z if they are not d-separated by Z, denoted as d-conG (X,Y|Z).

3. Methods

Pearl and Verma (1991) proved that a pair of nodes dseparated by a variable set in BN is also conditionally independent in P given the set. The faithfulness condition below asserts that the conditional independencies observed in the distribution of a network are not accidental properties of the distribution but are instead due to the structure of the network. Assumption 2 (Spirtes et al., 1993). A Bayesian network satisfies the faithfulness condition when iff d-sepG (X,Y|Z). Hence, the oracle for conditional independencies in probabilities can be thought of as an oracle for testing d-separation relations in G, and vice versa. Theorem 1 (Pearl, 1988). In a faithful BN on variables V, there is an edge between the pair of nodes X and Y in V iff for all Z ⊆ V − {X,Y}. There is no edge between the pair of nodes X and Y in V iff there exists a Z ⊆ V−{X,Y} satisfying In BN structure G, X and Y are adjacent means that X and Y are d-connected, in other words, they cannot be d-separated by any subset Z ⊆ V − {X,Y}. It also means that no Z ⊆ V − {X,Y} will make One of the major tasks in BNs is structure learning, that is, given a dataset, to find the BN that best reflects the dataset. When learning the graph structure under certain conditions one can attribute a causal interpretation to the edges of the graph G: an edge X → Y corresponds to a causal effect meaning manipulating X will affect the observed distribution of Y. Therefore, BNs can be used for generating causal hypotheses from data sets. Constraint-based methods, such as IC (Pearl and Verma, 1991), FCI, PC (Spirtes et al., 1993), TPDA (Cheng et al., 2002) and algorithmGPC (Pe˜na et al., 2006), iteratively find all of the CI relationships underlying a dataset. After all d-separation relationships are identified, the algorithms use a rule to allocate the direction for the connected edges. This rule can be described as follows: For non-adjacent X and Y, if X–W–Y in G, and W ∈ / Sep(X,Y), then we can orient them as X → W ← Y, where Sep(X,Y) is the first variable set found satisfying For scored-searching methods, given some assumptions, a scoring metric can be defined to represent the fitness to which the network reflects the training datasets. For details, see Cooper and Herskovits (1992), Heckerman et al. (1995) and Heckerman (1997). Unfortunately, in practical use, it is impossible to enumerate and score all the possible structures for these variables because the number of structures is super exponential relative to the number of variables in the domain making an exhaustive search infeasible. A number of heuristic algorithms are resorted to find the optimal DAG, such as, greedy hill-climbing and

3.1. Heuristic Constraint-Based Searching In the first phase, we use a constraint-based algorithm to identify the skeleton graph (undirected graph) of BNs. Previous analysis (Spirtes et al., 1993) has shown that, although conventional constraint-based algorithms (such as PC, FCI) do not perform all possible CI tests among the measured variables, in the worst case, the number of CI tests performed by the algorithms grows exponentially with the number of variables. Hence, in the worst case these algorithms are not feasible on datasets with large number of variables. Details regarding time complexity analysis for the PC algorithm can be found in Neapolitan (2003). To address this issue, we construct undirected networks using a heuristic constraint-based algorithm. Through heuristics, we can avoid performing an exponential number of CI tests that have to be performed and thus computationally make the process more efficient. Our algorithm is based on the Monotone DAG faithfulness assumption (Cheng et al., 2002), a stronger assumption than the DAG faithful assumption (Assumption 2). To describe this assumption, we let d-conSG (X,Y|Z) denote the set of all the dconnection paths between X and Y given a conditioning set Z in G. From Definition 4, all paths in d-conSG (X,Y|Z) satisfies one of two following conditions: (1) if the node is not a collider then this node is not in Z; (2) if the node is a collider then the node itself or the descendant of the node are placed in Z. Assumption 3. A faithful BN is Monotone DAG faithful iff in DAG G = , for X,Y ∈ V, if d-conSG (X,Y|Z1 ) ⊆ d-conSG (X,Y|Z2 ) and Z1 , Z2 ⊆ V, then depD (X,Y|Z1 ) ≤ depD (X,Y|Z2 ). In the above assumption, depD (X,Y|Z) is a measure of the strength of the conditional dependence w.r.t. the learning dataset D. Details of its definition will be discussed later. Take the example in Fig. 1. We can find that d-conSG (W,S|X) = {W → Y → R → S}; d-conSG (W,S|Y) = {W → Y ← X → Z → S, W → Y ← X → Z → R → S};

Fig. 1. Illustration of a simple Monotone DAG faithful assumption.

364

M. Wang et al. / Computational Biology and Chemistry 31 (2007) 361–372

d-conSG (W,S|Z) = {W → Y → R → S}; d-conSG (W,S|R) = {W → Y → R←Z → S}, d-conSG (W,S|T) = {W → Y → R → S, W → Y → R ← Z → S}, d-conSG (W,S|X,Y) = ∅, d-conSG (W,S|X,Z) = {W → Y → R → S}, d-conSG (W,S|X,R) = {W → Y → R ← Z → S}, d-conSG (W,S|X,T) = {W → Y → R → S, W → Y → R ← Z→S}, d-conSG (W,S|Y,Z) = ∅, d-conSG (W,S|Y,R) = {W → Y ← X → Z → S}, d-conSG (W,S|Y,T) = {W → Y ← X → Z → S, W → Y ← X → Z → R → S}, d-conSG (W,S|R,Z) = ∅, d-conSG (W,S|R,T) = {W → Y → R ← Z → S}. So from Assumption 3, we can easily infer that depD (W,S|X,Y) [or depD (W,S|Y,Z) or depD (W,S|R,Z)] ≤ depD (W,S|X,Z) [or depD (W,S|X,R) or depD (W,S|X,T) or depD (W,S|Y,R) or depD (W,S|Y,T) or depD (W,S|R,T)], depD (W,S|X,Y) ≤ depD (W,S|X) ≤ depD (W,S|T), depD (W,S|R,Z) ≤ depD (W,S|R) ≤ depD (W,S|T), depD (W,S|Y,Z) ≤ depD (W,S|Z) ≤ depD (W,S|T) [or depD (W,S|X,T)]. Before proceeding, we present three necessary lemmas for our algorithm. Lemma 1 (Neapolitan, 2003). Let G = be a DAG and X,Y ∈ V. Then if X and Y are d-separated by some set, they are d-separated by the set consisting of nodes adjacent to X or the set consisting of nodes adjacent to Y. Lemma 2. Let G = be a DAG and X,Y ∈ V. Under Assumption 3, if Z is a minimal subset of V such that dsepG (X,Y|Z), and there exists two sets Z1 ⊂ Z, Z2 ⊂ Z and |Z1 | = |Z2 | = |Z1 ∩ Z2 | + 1, then depD (X,Y|Z1 ) < depD (X,Y|Z2 ). Lemma 3. Let G = be a DAG and X,Y ∈ V and Z ⊆ V. Under Assumption 3, if Z is a minimal subset of V such that d-sepG (X,Y|Z), and there exists two sets Z1 , Z2 such that Z2 ⊂ Z1 ⊂ Z, then depD (X,Y|Z1 ) < depD (X,Y|Z2 ). From Lemma 1, we try to d-separate one adjacent node pair X and Y by only checking all nodes in either ADJX or ADJY . By ADJX we mean the subset of V consisting of all nodes adjacent to X. For detecting the proper conditioning set Z that d-separate X and Y, we try to separate nodes by checking first for lowerorder, then progressively for higher-order CI tests between X and Y, that is, checking if depD (X,Y|Z) is lower than a threshold value. From Lemmas 2 and 3, we can also use depD (X,Y|Z) as the heuristics function to select the most probable node into Z, that is, always select the node in ADJX or ADJY that can decrease the depD (X,Y|Z) in the current iteration. Table 1 outlines our heuristics PC algorithm algorithmHPC. The proposed algorithm takes a dataset D over a set of variables V as input and a significant level 0 < ε <1. The algorithm starts with a complete connected graph GP over V. For all adjacent nodes X and Y in this graph, we first try to d-separate node pairs by zero-order CI tests, i.e. the conditioning set is null (lines 33–36). Then it checks d-separation relations by progressively higher-order CI tests between X and Y. The conditioning sets and conditional dependence values at the end of the iteration are saved into the arrays Sep and dep, respectively. Sep[X,Y] and

dep[X,Y] are used to save the current conditioning set consisting of nodes adjacent to X and the corresponding conditional dependence values. Similarly, Sep[Y,X] and dep[Y,X] are used for adjacent nodes of Y. The order of CI tests is controlled by the variable order. In each iteration (lines 3–31), given the variable order > 0, for each node pair X and Y, we use heuristics to search one conditioning set that is most probable to d-separate X and Y. Thus, each variable in ADJX (or ADJY ) is sequentially added into the current conditioning variable sets oldSepx (and oldSepy) to produce new ones newSepx (and newSepy). Then all the conditional dependence values depD (X,Y|newSepx) [and depD (X,Y|newSepy)] with higher order will be calculated. Once the node pair X and Y are identified as d-separated (in line 9 or 22), the edge between X and Y will be removed and CI tests for X and Y will be terminated (in line 10 or 23). If the edge of X and Y cannot be removed, the smallest depD (X,Y|newSep) and the corresponding conditioning variable set in this iteration are recorded in the arrays dep[X,Y] and sep[X,Y] (or dep[Y,X] and sep[Y,X]), respectively (implemented in lines 12–14 or lines 25–27). Note that the higher order conditional dependence value depD (X,Y|newSep) must be lower than any of the lower-order conditional dependence value depD (X,Y|oldSep). If the newSep that meets this requirement fails to be found in this iteration (line 31), the relationship between X and Y will not be tested any further in the following iterations. The relationship between X and Y in this case can be considered uncertain. In our method, uncertain relationships are still marked as possible d-connection relationships because these relationships can be checked further in the next phase. After all node pairs have been tested in the current order according to the method described above, the order is increased by 1 to begin the next iteration. This procedure will be repeated until all d-separation relations are found or the order cannot be increased anymore (line 40). Unlike the basic PC algorithm which considers all possible conditioning variable subsets with the same size order in ADJ, algorithmHPC only chooses the most probable variable set to increase the order of CI tests. This dramatically decreases the number of CI tests to be performed. The heuristics selection criterion in algorithmHPC is based on the measure depD (X,Y|Z). Some parametric or non-parametric tests that can be used to assess conditional independence and strength of association depD (X,Y|Z), including mutual information, Fisher’s Z test, chi-square test and G2 test (Neapolitan, 2003). In this paper, we use the G2 statistic because it provides statistically significant values and is relatively easy to compute. The G2 test returns a p-value that corresponds to the probability of falsely rejecting the null hypothesis given that it is true. If the p-value was less than a significance level ε (set to 0.05 in our experiments) the null hypothesis was rejected. If the independence hypothesis could not be rejected, it was then accepted. As a measure of association, we used the negative pvalue returned by G2 tests: the smaller the p-values, the higher the association. For the G2 tests, the absolute minimum number of samples should be above five times the number of degrees of freedom. If this requirement could not be met, the CI results were considered not reliable. Due to small sample sizes, in our

M. Wang et al. / Computational Biology and Chemistry 31 (2007) 361–372

365

Table 1 Heuristics PC algorithm

implementation, we adopted the method in Steck and Jaakkola (2002) to calculate the degrees of freedom. We proved that our algorithm is always correct under the given assumptions. (The proofs can be found in Appendix A). Theorem 2. Under the Assumptions 1–3, all the dconnection and d-separation relationships returned by the algorithmHPC(D, ε) are true. In practical datasets, the Monotone DAG faithful condition does not necessarily hold. Some recent theoretical analysis (Chickering and Meek, 2006) also pointed out some problems with this assumption. This situation makes the heuristics search in algorithmHPC miss some d-separation relations. However,

empirical evaluations (Cheng et al., 2002; Huang et al., 2007) have shown that Monotone DAG faithfulness generates a reasonably good approximate network model. Despite the fact that our algorithm cannot guarantee all of the d-connection relations in GP are true, it has the feature of rapidly decreasing the numbers of the possible d-connection relations in GP. 3.2. Learning BNs The skeleton graph GP produced from the first phase was used as the input to this phase. Because the amount of data is small relative to the number of variables in the gene expression datasets, it is more appropriate to do inference by averaging over

366

M. Wang et al. / Computational Biology and Chemistry 31 (2007) 361–372

Table 2 BN MCMC learning algorithm

graph models (Heckerman et al., 1997; Neapolitan, 2003). Due to this reason, we adopted approximate model averaging using Markov Chain Monte Carlo (MCMC) to learn the BN in this phase. Table 2 outlines this BN learning algorithm. In this BN learning, we start with the empty graph G, i.e. no edge in the G. This initial graph as current Gold , given Gold , all the possible new graphs are produced by one of three elementary operators: adding one edge, deleting one edge or reversing one edge. These new graphs are called neighborhoods of Gold because there is only one single adjacency difference between Gold . In our learning method, the neighborhood is a legal neighborhood iff it satisfies two requirements: (1) the new graph is acyclic; (2) any one edge X → Y (or Y ← X) in the new graph can be added only if there is undirected edge X–Y in GP. From all these possible legal neighbors DAGs, we randomly select one as the new DAG Gnew . Then, Gnew is checked as to whether it is accepted as the current graph Gold . The check will be depended on Metropolis–Hastings acceptance criterion, is given by   P(Gnew |D) Q(Gold |Gnew ) PMH = min 1, × P(Gold |D) Q(Gnew |Gold ) where P(G|D) is the probability of DAG G given the data D. From Bayesian rule we have 1 P(G|D) = P(D|G)P(G) Z  where Z = G P(D|G)P(G) is a normalization factor. P(G) is the prior probability on DAG. When certain regularity conditions, discussed by Heckerman et al. (1995), are satisfied and the data are complete, the P(D|G) can be estimated by closed formulas. As mentioned in Heckerman et al. (1995), there are several scoring functions which can be used to efficiently estimate the goodness of fit of a model against a given dataset. Among the score metrics, none has been proven to be superior and so for our experiments we selected the widely used BDeu scoring. The details of definition of BDeu scores can be found in Cooper and Herskovits (1992) or Heckerman et al. (1995), which are the posteriori probability of a network given the data and prior knowledge.

In practice, the Hastings ratio, in this case, is given by N(Gold ) Q(Gold |Gnew ) = N(Gnew ) Q(Gnew |Gold ) where N(G) is the number of the neighborhood of G, that is, the number of all the legal acyclic graphs that can be obtained from G by application of one the elementary operators. Thus, PMH can be derived as below:   P(D|Gnew ) N(Gold ) PMH = min 1, × P(D|Gold ) N(Gnew ) Repeatedly running the above Monte Carlo sampling (lines 3–10 in Table 2) will produce one series of DAGs. This series of DAGs is a Markov chain. Some theoretical proofs have shown that after a long enough iteration, the scores of graphs will converge. If the iteration is larger than the burn-in periods, the average of the DAGs in this Markov chain is computed (line 11). From this averaged G, we determine whether every edge in G is present or not according to I(Gij > 0.5), where I(·) is a indication function, equaling 1 if the condition in parentheses is true and 0 otherwise (in lines 12–14). In our implementation, we specify the n × 5 as the burnin value and n × 100 as the len value. Methods for choosing a burnin time and the number of iterations to use after burn-in are discussed in Gilks et al. (1996). The decomposition property of the score calculation make the global measure of P(D|G) can be written as a product of local measures, where each local measure is a function of one node and its parent set only. Let Pi (D|G) be the ith term of the outmost product. Let Gnew be the result of proposing either an addition or a removal of an arc j → i of Gold . The acceptance ratio involves only those sets of parent nodes that differ, that is 

PMH

Pi (D|Gnew ) N(Gold ) = min 1, × Pi (D|Gold ) N(Gnew )



If Gnew is the result of proposing the reversal of the edge j → i into i → j, then the acceptance ratio involves four different sets

M. Wang et al. / Computational Biology and Chemistry 31 (2007) 361–372

of parent nodes, as follows   Pi (D|Gnew )Pj (D|Gnew ) N(Gold ) PMH = min 1, × Pi (D|Gold )Pj (D|Gold ) N(Gnew ) The decomposition property of the BN score calculation means that the global measure of P(D|G) can be written as a product of local measures, which is a function of one node and its parent set only. This can save computational time for score calculation. Two additional computational tricks can be applied in MCMC learning to improve learning efficiency: First, we only need to store adjustment information of one edge for each DAG in the learning procedure because there is only one adjacency difference between any two contiguous DAGs in the simulated Markov chain. Thus, much redundant information can be ignored for describing a full DAG structure. For example, in a given chain, if the edge X → Y in the last DAG is reoriented as Y ← X in the next DAG, only the nodes X and Y and the corresponding operator information “reversion” need to be recorded. All of the other edge information in the DAG does not need to be recorded. Addition and deletion operations can also be recorded in similar ways. From the initial empty graph and stored adjustment information for each DAG, we can readily restore the full structure of all the DAGs in the chain. This dramatically reduces the memory usage during MCMC learning. Second, we can adopt the MCMC model searching method proposed in Giudici and Castelo (2003) to check for acyclicity of the DAG. In BN learning, acyclicity checking is very time-consuming even using the most efficient methods. In our implementation, we maintain a matrix M called ancestor matrix. In this matrix, the element M[X,Y] is set to 1 iff X is the ancestor of Y. As a consequence, after one of three elementary operators, only the matrix M needs to be adjusted accordingly. Then, after each reverse or addition operator, we just need to check whether there appears a 1 in the diagonal of M. If 1 appears in the diagonal of M, it means that there appears at least one cycle in the DAG after the current operator. Using these techniques, we can easily increase the algorithm efficiency and scale up MCMC learning to the datasets with several thousands of variables. 4. Results In this section, we evaluated our method first on simulated data and then on one real biological dataset. We then compared the performances of our method with a competitive method using simulated data. 4.1. Simulated Data Since the underlying models of real world data sets are usually unknown, it is difficult to evaluate and analyze the learning accuracy of novel methods. In this field, almost all researchers use synthesized data sets to evaluate their algorithms. We considered databases sampled from four discrete Bayesian networks that have been previously used as benchmarks for algorithms for learning BNs from data, namely the Alarm BN (37 nodes and 46 edges) (Beinlich et al., 1989), Alarm10 BN (370 nodes and

367

570 edges), Pigs BN (441 nodes and 592 edges) (Jensen, 1997) and Insurance10 BN (270 nodes and 278 edges). Among these four BNs, the Alarm10 and Insurance10 are generated by the tiling algorithm used in (Statnikov et al., 2003) and contain 10 copies of their respective original Alarm BN and Insurance BN (Binder et al., 1997). We selected these four BNs because the possible values of attributes (domains) in these datasets are relatively small (the maximal domain size is 5), which is similar to discretized microarray datasets that only include three expression levels. For each of the networks, we generated 10 synthetic datasets with 100, 200 and 500 training cases, respectively. The number of training cases in the dataset is hereafter referred to as the sample size (SS). We tested our method’s ability to reconstruct the network using these datasets. The performance metrics of a method were the averages over those 10 datasets of each SS and network. We adopted Structural Hamming Distance (SHD) metric proposed by Tsamardinos et al. (2006) as the measure to evaluate the overall performances of the algorithm. The SHD directly compares the structure of the corresponding equivalent classes (also called partially directed DAG, PDAG) of learned and original networks. For details about Markov Equivalence classes can be found in (Chickering, 1996; Neapolitan, 2003; Spirtes et al., 1993). Thus, the SHD between two PDAGs is defined as the number of the following operators required to make the PDAGs match: add, delete an undirected edge, and add, delete, or reverse the orientation of an edge. Before calculating SHDs, we used the algorithm proposed in Chickering (1996) to convert DAGs into the corresponding PDAGs. The details of SHD calculation can be found in Tsamardinos et al. (2006). We also present the number of missing arcs (MA), erroneous arcs (EA) and wrongly (or missing) oriented arcs (WO) in the learned PDAG structures as compared to the true PDAG structures as an indication of the error distribution in the results. We selected MMHC algorithm (Tsamardinos et al., 2006) and Sparse Candidate (SC) algorithm (Friedman et al., 1999) for inclusion in this study. The MMHC algorithm was selected because this algorithm achieved the best performances when compared to several other commonly used algorithms in a previous study (Tsamardinos et al., 2006). We do not include pure constraint-based algorithms in our simulation tests due to their poor performances comparing to MMHC. The SC algorithm was selected for our comparison tests because this BN learning algorithm was first applied into large scale gene network construction (Friedman et al., 2000). We ran our method together with these two algorithms on all of the simulated datasets to compare their performances. For each structure, given different sample sizes (100, 200 and 500, respectively), we calculated the average counts of MA, EA, WO and SHD of 10 trials. Table 3 provides a summary of performance comparisons between these three methods. In all cases, except for that of Alarm networks with 200 and 500 samples and Insurance10 networks with 100 samples, our method outperforms MMHC and SC algorithms in the SHD metric. This suggests that our method achieved better overall performance than competitive algorithms by successfully reducing the number of false positives of edges in the predicted BNs.

368

M. Wang et al. / Computational Biology and Chemistry 31 (2007) 361–372

Table 3 Comparisons of simulation test results SS

EA

MA

WO

SHD

Alarm Ours MMHC SC Ours MMHC SC Ours MMHC SC

100 100 100 200 200 200 500 500 500

3.3 12.6 18.7 3.3 6.8 17.6 2.3 4.1 14.9

± ± ± ± ± ± ± ± ±

1.1 2.2 5.2 1.1 1.6 6.1 1.5 1.4 2.2

30.0 19.9 26.5 26.7 14.8 24.5 20.0 9.8 24.3

± ± ± ± ± ± ± ± ±

1.9 2.2 2.4 2.0 2.2 3.8 3.1 1.5 3.1

42.3 12.5 8.7 11.3 8.6 9.4 10.9 7.1 9.0

± ± ± ± ± ± ± ± ±

3.1 4.0 3.5 4.1 4.3 3.0 1.7 2.9 1.7

42.3 45.0 53.9 41.3 30.2 51.5 33.2 21.0 48.2

± ± ± ± ± ± ± ± ±

3.1 3.6 9.3 2.5 6.0 8.0 4.2 3.1 5.5

Alarm 10 Ours MMHC SC Ours MMHC SC Ours MMHC SC

100 100 100 200 200 200 500 500 500

33.1 282.3 213.1 25.3 164.3 193.5 13.6 104.5 130.2

± ± ± ± ± ± ± ± ±

6.4 9.7 14.2 4.3 13.2 34.3 4.1 6.2 10.0

426.6 312.3 562.9 379.7 273.7 566.6 309.7 225.9 545.9

± ± ± ± ± ± ± ± ±

7.7 28.8 8.3 8.8 7.9 6.7 6.7 3.3 9.0

98.5 97.1 21.3 119.8 117.6 26.0 105.9 127.9 38.2

± ± ± ± ± ± ± ± ±

6.2 9.7 5.9 6.0 11.2 4.0 9.5 10.7 2.8

558.2 691.7 797.3 524.8 555.6 786.1 429.2 458.3 714.3

± ± ± ± ± ± ± ± ±

12.6 34.6 12.7 12.1 21.8 33.9 11.1 11.3 10.9

Pigs Ours MMHC SC Ours MMHC SC Ours MMHC SC

100 100 100 200 200 200 500 500 500

28.1 169.9 131.5 8.5 59.1 66.8 0.5 9.5 43.8

± ± ± ± ± ± ± ± ±

5.2 16.2 13.5 2.7 5.3 8.3 0.7 3.9 7.1

69.7 91.7 584.9 17.8 5.9 588.4 0.4 0.0 591.0

± ± ± ± ± ± ± ± ±

10.7 8.2 3.7 4.0 2.6 1.6 0.8 0.0 1.1

87.1 106.4 6.6 7.6 40.7 2.3 0.4 1.5 0.5

± ± ± ± ± ± ± ± ±

14.2 11.5 3.5 3.1 17.6 1.1 1.3 2.7 0.5

184.9 368.0 723.0 33.9 105.7 657.5 1.3 11.0 635.3

± ± ± ± ± ± ± ± ±

21.3 29.8 13.5 7.0 21.4 6.9 2.1 4.9 6.8

Insurance 10 Ours MMHC SC Ours MMHC SC Ours MMHC SC

100 100 100 200 200 200 500 500 500

93.4 124.6 70.9 96.9 69.6 68.1 19.8 47.3 68.0

± ± ± ± ± ± ± ± ±

7.6 11.6 5.6 6.8 9.3 8.2 3.9 7.4 3.8

87.1 314.8 529.9 41.9 259.6 529.3 260.7 199.5 517.2

± ± ± ± ± ± ± ± ±

62.2 7.7 5.1 47.6 6.2 6.2 7.1 7.2 5.5

385.0 110.2 15.1 324.9 136.7 15.3 96.8 150.7 23.1

± ± ± ± ± ± ± ± ±

6.8 11.1 3.3 7.9 8.9 4.0 8.2 6.3 5.1

565.5 549.6 615.9 463.7 465.9 612.7 377.3 397.5 608.3

± ± ± ± ± ± ± ± ±

58.9 18.7 6.0 42.8 12.6 4.9 13.4 7.2 4.3

MA means the edges missed in learned PDAG, EA means the extra edges in the learned PDAG, WO means the reversed edges between two PDAGs and edges that are undirected in one PDAG and directed in the other one. The numbers of MA, EA, WO and SHD in the table are average counts of 10 trials in each case. SS is sample size of datasets. The values following ± are standard deviation values.

4.2. Biological Data To further evaluate the performance of the proposed method, we applied our method to a publicly available real microarray dataset from a previous study on yeast cell cycle gene regulation (Spellman et al., 1998). That study was originally designed to identify yeast genes whose transcription levels varied periodically within the cell cycle. The microarray dataset contains 77 gene expression measurements of the mRNA levels of 6177 Saccharomyces cerevisiae ORFs. A total of 800 genes were identified as cell-cycle regulated in the original report. In this paper, we used 77 data samples for these 800 genes. A 3-level quantization scheme was chosen to discretize the continuous data in this dataset. In this quantization scheme, we

used the same threshold as in Friedman et al. (2000) to perform discretization. With a threshold of 0.5 in logarithmic base 2 scale, expression levels with a ratio to the reference that were lower than 2−0.5 were considered to be down regulated; levels higher than 20.5 were considered up regulated; expression level between 2−0.5 and 20.5 were considered as unchanged. Fig. 2 presents the top 20 relations with the highest chisquare values from the learned network. Some interactions in Fig. 2 have been verified in pervious yeast research, for example, HHF2, HHT1, HTB1, HTA1 and HTB2 are histone genes, the mRNAs of which oscillate in abundance during the cell division cycle (Hereford et al., 1981). The interactions HHF2-HHT1, HTB1-HTA1, HTA1-HTB2 and HHT1-HTB1 have been validated by biological experiments

M. Wang et al. / Computational Biology and Chemistry 31 (2007) 361–372

Fig. 2. The top 20 relations selected from the learning yeast cell cycle gene network.

(Ho et al., 2002; Grant et al., 1999; Kurumizaka and Wolffe, 1997). 5. Discussion 5.1. Comparing to the Related Work In recent years, many different algorithms based on BN models for inferring gene networks have been proposed. To our knowledge, only a few BN learning methods can be applied to large scale networks with thousands of variables, such as the SC algorithm (Friedman et al., 1999), the MMHC algorithm (Brown et al., 2004; Tsamardinos et al., 2006), the algorithmGPC (Pe˜na et al., 2006), the MinReg (Pe’er et al., 2006) and Module networks (Segal et al., 2005). The MinReg and Module network models use some additional constraints (module) to the possible BN structure, so the BN structure in these methods does not exactly fit the standard BN definition. We do not discuss these two methods further because performance comparison is difficult under different structure definitions. To restrict the searching space, each of the three algorithms (SC algorithm, MMHC algorithm and algorithmGPC) use local search methods to identify important highly correlated variables for each variable. The SC algorithm does not use d-separation concepts to search parent nodes for each node. Initially, it simply selects the candidate parents of each node according to association strength between variables. Then, the candidate parent selection is combined with a BN structure learning procedure. In this algorithm, BN learning not only needs to identify the BN structure, but re-estimate the possible candidate parent sets of each node as well. As a consequence, it takes longer running time to converge to an acceptable approximation and makes the selection of parent nodes unsound. The other problem with the SC algorithm resides in the difficulty to specify the necessary parameter k, i.e. the maximum number of parents allowed for any node in the network. MMHC and algorithmGPC use Markov blanket concepts and CI tests to identify the parents and child nodes (denoted by PCD) of each node in DAG. They also use some heuristic methods to determine the PCD around each node. The PCD in their algorithm exactly means all the d-connected nodes to each node. These two algorithms use a similar approach to identify all of the d-connection relation-

369

ships, but algorithmGPC has claimed to correct some errors in PCD selection (Pe˜na et al., 2005). Our tests showed that these theoretical errors in the theorem proofs do not have much influence on the practical application, the reason being that these errors rarely occur in the actual structures. Compared to the above algorithms, our algorithm distinguished itself by the use of global searching for the identification d-connection (or d-separation) relations. This can avoid some problems in local search. The first limitation of a local search is that it has to add all variables into initial candidate PCD (called CPC in their algorithms) irrespective of the d-separation relationships that have been found in the lower order CI tests for other variables. Therefore, it will add computational burden because of some unnecessary CI tests. The second limitation is that the d-connection relations inferred from different nodes are often violated for a local search. Thus, poor overall quality results are generated due to myopic decisions. Pe˜na et al. (2005) are aware of this drawback and describe a variant of a local search that alleviates it, though they do not solve it: it specifies an initial variable is specified as the seed node and a distance variable which dictates how the CI test can be extended to other nodes from the seed. How to specify these two inputs is hard to deal with. Moreover, different initial seed variables may lead to different results in practice. The other difference is the computational time complexity. Both the MMHC algorithm and algorithmGPC use a two-phase method to sift the possible d-connection relations (PCD) for each variable. For MMHC, in the growing phase, it exhaustively performs all CI tests under all possible subsets of the conditioning set before one variable enters into CPC. In the shrinking phase, the algorithm tests for the independence of any given variable in the CPC with the target conditioned on all subsets of the rest of the variables in the CPC. It is very time consuming, especially when heuristics function is not perfect. To limit the search space, it requires one to specify a parameter l, the maximal size of conditioning set that can be allowed to perform CI tests. Thus, the number of CI tests is bound by O(|V|·|CPC|l+1 ). It is however not easy to determine the parameter l. More seriously, even if l is set to a small number, in the worst case, CPC may still grow to include all the variables. In the context of large datasets, it would make the search prohibitive. The algorithmGPC uses a similar idea for searching PCD, but needs even more CI tests than MMHC. Another important constraint-based algorithm TPDA (Cheng et al., 2002) which is also based on Monotone faithfulness assumption can be compared with our algorithm. The major difference to our algorithmHPC is the induction step. TPDA iteratively decreases the order of CI tests during the dependence analysis. Large numbers of variables are present in conditioning sets at the beginning of the procedure. From theoretical analysis, the strength of TPDA is that errors made in previous early iterations can be corrected in the later runs. However, in practice, the high-order CI tests in several early rounds may produce many unreliable results and the requirement of sample size is also hard to attain in practical datasets, especially in the microarray domain. Compared to TPDA, algorithmHPC uses a different search strategy that depends mainly on the low order CI tests

370

M. Wang et al. / Computational Biology and Chemistry 31 (2007) 361–372

in the first several rounds. This has the effect of making the dseparation relations in the results more reliable. There are two other improvements in our algorithm which can further reduce the false positive d-connection relations: (1) for CI tests, algorithmHPC considers two node sets (i.e., both ADJX and ADJY in Table 1) adjacent to two ends of each edge; (2) unlike the thinning phase in TPDA (d-connection relations are validated in this phase), we can delete false positive edges by scored-searching method in the second phase. For datasets with small sample sizes, this is more robust than pure CI tests. 6. Conclusion The hybrid method described herein compares advantageously with existing algorithms used in this field. It combines the advantages of constraint-based and scored-searching algorithms. It can be scaled up to high-dimensional data while increasing accuracy, especially when the sample numbers are limited and the datasets are noisy and incomplete. The proposed method is therefore suitable for microarray data mining. Secondly, some false positive errors generated in the first phase can be rectified in the second phase. Thirdly, our method does not need to specify extra input parameters which are hard to determine without much prior biological knowledge. The time complexity is relatively low in the first stage. In the algorithmHPC, the time complexity of CI tests is bound by O(|V|3 ). In practice, the time complexity can be even lower when the heuristics take effect. All of the d-separation and d-connection relations are determined from global search and high-order CI results that rely on lower order CI tests in the first several iterations, which makes the results more reliable and stable. Some techniques used in MCMC learning can improve the execution efficiency. Two limitations remain with algorithmHPC. First, the first phase will return many false positive (FP) errors when the Monotone faithfulness assumption may not hold. Second, false negative (FN) errors produced in the first phase cannot be verified further and will remain in the final results. From our simulation tests, the FP error problem is not serious because we can find the FP numbers have been significantly reduced by algorithmHPC. It showed that the Monotone faithfulness assumption is still adequate in practical datasets. The FN problem (MA numbers in Table 3) remains a more serious issue. The control of FN numbers becomes a challenge to improve our method. The main reasons leading to this situation are: (1) statistical errors frequently occurred in CI tests, especially when the sample numbers of the datasets are small; (2) the DAG faithful assumption for constraint-based algorithms may not be valid. Some techniques can be considered to mitigate the FN problem, such as, the use of bigger threshold values ε for CI tests, restricting the order of CI needed to be tested (that is, only low order CI tests would be performed). From our experience, different significance levels have only slight influence on the final results. Restricting the maximal order of CI tests is an effective method. In some cases, performing only zero-order and one-order CI tests in the first phase can achieve ideal results. Balancing between the FP and FN errors is still an unresolved issue. Research is still

need to address the issues of the finding valuable assumptions and improving statistical confidence in CI tests. Acknowledgements The authors are grateful to Travis Banks for fruitful discussion and comments and Andrzej Walichnowski for manuscript revision; to Kevin P. Murphy, Olivier Francois and Sonia Leach for their contribution to the Bayesnet toolbox. Mingyi Wang received a visiting fellowship from NSERC. This research was supported by the Canadian Crop Genomics Initiative. This is Agriculture and Agri-Food Canada publication #1953. Appendix A. Proofs of the Theorems Lemma 2. Let G = be a DAG and X,Y ∈ V. Under Assumption 3, if Z is a minimal subset of V such that dsepG (X,Y|Z), and there exists two sets Z1 ⊂ Z, Z2 ⊂ Z and |Z1 | = |Z2 | = |Z1 ∩ Z2 | + 1, then depD (X,Y|Z1 ) < depD (X,Y|Z2 ). Proof. If Z is the minimal conditioning subset of V satisfying d-sepG (X,Y|Z), from Definitions 3 and 4, each node in Z can only be a chain or fork, not a collider or its descendent node in the path between X and Y. Since Z1 ⊂ Z, all the nodes in Z1 can only be a chain or fork in the path between X and Y. Thus, d-conSG (X,Y|Z1 ) includes all the paths between X and Y (denoted by Spath(X,Y) ) except the paths through the nodes in Z1 (denoted by d-sepSG (X,Y|Z1 )). That is, d-conSG (X,Y|Z1 ) = Spath(X,Y) -d-sepSG (X,Y|Z1 ). Since |Z1 | = |Z2 | = |Z1 ∩ Z2 | + 1, there is one and only one node in Z2 but not in Z1 , we denote this node as A. We can infer A∈ / Z, because if A ∈ Z it would violate Z2 ⊂ Z. The node A cannot be a chain or fork in any path between X and Y because A ∈ /Z and d-sepG (X,Y|Z). So, d-conSG (X,Y|Z2 ) include all the paths between X and Y except the paths through the nodes in Z2 ∩ Z1 . That is, d-conSG (X,Y|Z2 ) = Spath(X,Y) -d-sepSG (X,Y|Z2 ∩ Z1 ). We have d-sepSG (X,Y|Z2 ∩ Z1 ) ⊂ d-sepSG (X,Y|Z1 ) because (Z1 ∩ Z2 ) ⊂ Z1 . Thus, d-conSG (X,Y|Z1 ) ⊂ d-conSG (X,Y|Z2 ). From Assumption 3, depD (X,Y|Z1 ) < depD (X,Y|Z2 ) holds.  Lemma 3. Let G = be a DAG and X,Y ∈ V and Z ⊆ V. Under Assumption 3, if Z is a minimal subset of V such that d-sepG (X,Y|Z), and there exists two sets Z1 , Z2 such that Z2 ⊂ Z1 ⊂ Z, then depD (X,Y|Z1 ) < depD (X,Y|Z2 ). Proof. If Z is the minimal conditioning set satisfying dsepG (X,Y|Z), from Definitions 3 and 4, the nodes in Z can only be a chain or fork, not a collider or its descendent node in the path between X and Y. So d-conSG (X,Y|Z1 ) includes all the paths between X and Y (denoted by Spath(X,Y) ) except the paths through the nodes in Z1 (denoted by d-sepSG (X,Y|Z1 )), that is, d-conSG (X,Y|Z1 ) = Spath(X,Y) -d-sepSG (X,Y|Z1 ). Similarly, d-conSG (X,Y|Z2 ) = Spath(X,Y) -d-sepSG (X,Y|Z2 ). We can get d-sepSG (X,Y|Z2 ) ⊂ d-sepSG (X,Y|Z1 ) because Z2 ⊂ Z1 , thus, d-conSG (X,Y|Z1 ) ⊂ d-conSG (X,Y|Z2 ). From Assumption 3, depD (X,Y|Z1 ) < depD (X,Y|Z2 ) holds. 

M. Wang et al. / Computational Biology and Chemistry 31 (2007) 361–372

Theorem 2. Under Assumptions 1–3, all the d-connection and d-separation relationships returned by the algorithmHPC(D, ε) are true. Proof. For any node pair X and Y in D, there are two cases: (1) If there is one conditioning set Z that d-separate X and Y, from Lemma 1, we always select the node adjacent to X or Y (in line 4). From Lemma 2, given order = 0, in lines 33–36, we can select one node in Z into Sep[X,Y] (or Sep[Y,X]) according to conditional dependence values depD (·). If |Z| > 1, with the increment of order (in line 39), according to Lemma 3, the algorithmHPC ensures that only the node in Z can enter Sep[X,Y] (or Sep[Y,X]) by checking depD (·) (in lines 12–14 and lines 25–27). According to Lemma 2, in lines 6–17 and lines 19–30, only add the nodes in Z can be added into Sep[X,Y] (or Sep[Y,X]) because only the nodes in Z can decrease depD (X,Y|Sep[X,Y]) or depD (Y,X|Sep[Y,X]) when the sizes of Sep[X,Y] (or Sep[Y,X]) are same. The above procedure is repeated until Z is found. From Assumptions 1 and 2, the X and Y are dseparated by Z and can be detected by CI test (in lines 9, 22 and 34). (2) If there is no conditioning set that d-separate X and Y, from Theorem 1, all CI tests performed in the algorithmHPC cannot detect d-separation relation between X and Y (in line 9 or 22 or 34). Hence, after the algorithmHPC, the link between X and Y will still be connected.  References Beinlich, I.A., Suermondt, H.J., Chavez, R.M., Cooper, G.F., 1989. The ALARM monitoring system: a case study with two probabilistic inference techniques for belief networks. In: Hunter, J., Cookson, J., Wyatt, J. (Eds.), Proceedings of 2nd European Conference on Artificial Intelligence and Medicine. Springer Berlin, London, UK, pp. 247–256. Binder, J., Koller, D., Russell, S., Kanazawa, K., 1997. Adaptive probabilistic networks with hidden variables. Mach. Learn. 29 (2–3), 213–244. Brown, L.E., Tsamardinos, I., Aliferis, C.F., 2004. A novel algorithm for scalable and accurate Bayesian network learning. Medinfo 11, 711–715. Chen, X.W., Anantha, G., Wang, X., 2006. An effective structure learning method for constructing gene networks. Bioinformatics 22 (11), 1367–1374. Cheng, J., Greiner, R., Kelly, J., Bell, D., Liu, W., 2002. Learning Bayesian networks from data: an information-theory based approach. Artif. Intell. 137 (1–2), 43–90. Chickering, D.M., 1996. Learning equivalence classes of Bayesian network structures. In: Horvitz, E., Jensen, F.V. (Eds.), Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence (UAI). Morgan Kaufmann, Porland, OR, USA, pp. 150–157. Chickering, D.M., Meek, C., 2006. On the incompatibility of faithfulness and monotone DAG faithfulness. Artif. Intell. 170 (8–9), 653–666. Cooper, G.F., Herskovits, E., 1992. A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9 (4), 309–347. Cooper, G.F., 1997. A simple constraint-based algorithm for efficiently mining observational databases for causal relationships. Data Min. Knowl. Discov. 1 (2), 203–224. Cowell, G.R., 2001. Conditions under which conditional independence and scoring methods lead to identical selection of Bayesian network models. In: Breese, J.S., Koller, D. (Eds.), Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI). Morgan Kaufmann, Seattle, WA, USA, pp. 91–97.

371

Friedman, N., Linial, M., Nachman, L., Pe’er, D., 2000. Using Bayesian networks to analyze expression data. J. Comput. Biol. 7 (3–4), 601–620. Friedman, N., Nachman, I., Pe’er, D., 1999. Learning Bayesian network structure from massive datasets: the “sparse candidate” algorithm. In: Laskey, K.B., Prade, H. (Eds.), Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI). Morgan Kaufmann, Stockholm, Sweden, pp. 206–215. Gilks, W.R., Richardson, S., Spiegelhalter, D.J., 1996. Markov Chain Monte Carlo in Practice. Chapman and Hall/CRC, Boca Raton, Florida. Giudici, P., Castelo, R., 2003. Improving Markov Chain Monte Carlo model search for data mining. Mach. Learn. 50 (1–2), 127–158. Grant, P.A., Eberharter, A., John, S., Cook, R.G., Turner, B.M., Workman, J.L., 1999. Expanded lysine acetylation specificity of Gcn5 in native complexes. J. Biol. Chem. 274 (9), 5895–5900. Hartemink, A.J., Gifford, D.K., Jaakkola, T.S., Young, R.A., 2002. Combining location and expression data for principled discovery of genetic regulatory network models. Pac. Symp. Biocomput. 7, 437–449. Heckerman, D., Geiger, D., Chichering, D., 1995. Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20 (3), 197–243. Heckerman, D., 1997. Bayesian networks for data mining. Data Min. Knowl. Discov. 1 (1), 79–119. Heckerman, D., Meek, C., Cooper, G., 1997. A Bayesian approach to causal discovery. Technical Report, Microsoft. Hereford, L.M., Osley, M.A., Ludwig, T.R., Mclaughlin, C.S., 1981. Cell-cycle regulation of yeast histone mRNA. Cell 24 (2), 367–375. Ho, Y., Gruhler, A., Heilbut, A., Bader, G.D., Moore, L., Adams, S.L., Millar, A., Taylor, P., Bennett, K., Boutilier, K., Yang, L., Wolting, C., Donaldson, I., Schandorff, S., Shewnarane, J., Vo, M., Taggart, J., Goudreault, M., Muskat, B., Alfarano, C., Dewar, D., Lin, Z., Michalickova, K., Willems, A.R., Sassi, H., Nielsen, P.A., Rasmussen, K.J., Andersen, J.R., Johansen, L.E., Hansen, L.H., Jespersen, H., Podtelejnikov, A., Nielsen, E., Crawford, J., Poulsen, V., Sorensen, B.D., Matthiesen, J., Hendrickson, R.C., Gleeson, F., Pawson, T., Moran, M.F., Durocher, D., Mann, M., Hogue, C.W., Figeys, D., Tyers, M., 2002. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415 (6868), 180–183. Huang, Z., Li, J., Su, H., Watts, G.S., Chen, H., 2007. Large-scale regulatory network analysis from microarray data: modified Bayesian network learning and association rule mining. Decis. Support. Syst. 43 (4), 1207–1225. Husmeier, D., 2003. Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics 19 (17), 2271–2282. Imoto, S., Goto, T., Miyano, S., 2002. Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. Pac. Symp. Biocomput. 7, 175–186. Jensen, C.S., 1997. Blocking Gibbs sampling for inference in large and complex Bayeisan networks with applications in genetics. Ph.D. Thesis, Aalborg University, Denmark. Kalisch, M., B¨uhlmann, P., 2007. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J. Mach. Learn. Res. 8, 613–636. Kurumizaka, H., Wolffe, A.P., 1997. Sin mutations of histone H3: influence on nucleosome core structure and function. Mol. Cell Biol. 17 (12), 6953–6969. Murphy, K., Mian, S., 1999. Modeling gene expression data using dynamic Bayesian networks. Technical Report, Computer Science Division, University of California, Berkeley, CA. Neapolitan, R.E., 2003. Learning Bayesian Networks. Prentice Hall, Upper Saddle River, NJ. Ott, S., Imoto, S., Miyano, S., 2004. Finding optimal models for small gene networks. Pac. Symp. Biocomput. 9, 557–567. Pe’er, D., Tanay, A., Regev, A., 2006. MinReg: a scalable algorithm for learning parsimonious regulatory networks in yeast and mammals. J. Mach. Learn. Res. 7, 167–189. Pearl, J., 1988. Probabilistic Reasoning in Intelligent Systems: in Intelligent Systems. Morgan Kaufmann, San Francisco, CA. Pearl, J., Verma, T.S., 1991. A theory of inferred causation. In: Allen, J.F., Fikes, R., Sandewall, E. (Eds.), Principles of Knowledge Representation and Reasoning: Proceedings of the 2nd International Conference. Morgan Kaufmann, San Mateo, CA, USA, pp. 441–452.

372

M. Wang et al. / Computational Biology and Chemistry 31 (2007) 361–372

Pe˜na, J.M., Bj¨orkegren, J., Tegn´er, J., 2005. Scalable, efficient and correct learning of Markov boundaries under the faithfulness assumption. In: Godo, L. (Ed.), Proceedings of the 8th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty. Springer, Barcelona, Spain, pp. 136–147. Pe˜na, J.M., Bj¨orkegren, J., Tegn´er, J., 2006. Growing Bayesian network models of gene networks from seed genes. Bioinformatics 21 (2), ii224–ii229. Segal, E., Pe’er, D., Regev, A., Koller, D., Friedman, N., 2005. Learning module networks. J. Mach. Learn. Res. 6, 557–588. Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., Futcher, B., 1998. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9 (12), 3273– 3297. Spirtes, P., Glymour, C., Schienes, R., 1993. Causation, Prediction, and Search. Springer, New York.

Spirtes, P., Meek, C., 1995. Learning Bayesian networks with discrete variables from data. In: Proceedings of First International Conference on Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA, USA, pp. 294–299. Statnikov, A., Tsamardinos, I., Aliferis, C.F., 2003. An algorithm for generation of large Bayesian networks. Technical report DSL TR-03-01, May 28, 2003. Vanderbilt University, Nashville, TN, USA. Steck, H., Jaakkola, T., 2002. On the Dirichlet prior and Bayesian regularization. In: Becker, S., Thrun, S., Obermayer, K. (Eds.), Advances in Neural Information Processing Systems (NIPS). MIT Press, Vancouver, BC, Canada, pp. 697–704. Tsamardinos, I., Brown, L.E., Aliferis, C.F., 2006. The max-min hill-climbing Bayesian network structure learning algorithm. Mach. Learn. 65 (1), 31–78. Wang, M., Lu, H., Chen, Z., Wu, P., 2006. Mining gene expression databases for local causal relationships using a simple constraint-based algorithm. Int. J. Pattern Recogn. 12 (2), 311–327.