BioSystems 103 (2011) 425–434
Contents lists available at ScienceDirect
BioSystems journal homepage: www.elsevier.com/locate/biosystems
A multiorganism based method for Bayesian gene network estimation Zaher Dawy ∗ , Elias Yaacoub, Marcel Nassar, Rami Abdallah, Hady Ali Zeineddine Department of Electrical and Computer Engineering, American University of Beirut, P.O. Box 11-0236, Riad El-Solh, Beirut 1107 2020, Lebanon
a r t i c l e
i n f o
Article history: Received 29 March 2010 Received in revised form 3 August 2010 Accepted 6 December 2010 Keywords: Gene networks Homologous genes Bayesian network estimation Multitask learning
a b s t r a c t The primary goal of this article is to infer genetic interactions based on gene expression data. A new method for multiorganism Bayesian gene network estimation is presented based on multitask learning. When the input datasets are sparse, as is the case in microarray gene expression data, it becomes difficult to separate random correlations from true correlations that would lead to actual edges when modeling the gene interactions as a Bayesian network. Multitask learning takes advantage of the similarity between related tasks, in order to construct a more accurate model of the underlying relationships represented by the Bayesian networks. The proposed method is tested on synthetic data to illustrate its validity. Then it is iteratively applied on real gene expression data to learn the genetic regulatory networks of two organisms with homologous genes. © 2010 Elsevier Ireland Ltd. All rights reserved.
1. Introduction The modeling of gene interactions as networks is becoming increasingly widespread for gaining insight into various cellular properties, such as cellular state dynamics and transcriptional regulation (Brun et al., 2007). The resulting networks provide a high level description of the gene expression system by predicting how genes interact with each other through regulatory actions and translational feedback. Bayesian genetic regulatory networks are attractive in analyzing gene expression data for their ability to describe complex stochastic processes (Neapolitan, 2004). A Bayesian network can be regarded as a task; a more general term used in the machine learning literature. Multitask learning is an approach that learns a problem together with other related problems, using a shared representation. This often leads to a better model for the main task, because it allows the learner to use the common information among the tasks (Caruana, 1997). Multitask learning is well-suited in domains where the cost of collecting more data is prohibitive or there is a limited training data associated with each task as in the case of DNA microarrays (Xue et al., 2007). DNA microarrays measure gene expression levels, and provide data samples that are sparse compared to the number of interacting genes that need to be modeled (Friedman et al., 2000). Although the advent of microarray technology has allowed a revolutionary transition from the exploration of the expression of a handful of genes to that of entire genomes,
∗ Corresponding author. Tel.: +961 1 350000x3538; fax: +961 1 744462. E-mail addresses:
[email protected] (Z. Dawy),
[email protected] (E. Yaacoub),
[email protected] (M. Nassar),
[email protected] (R. Abdallah),
[email protected] (H.A. Zeineddine). 0303-2647/$ – see front matter © 2010 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.biosystems.2010.12.004
microarray data has proved difficult to analyze, partly due to the significant amount of noise (Hecker et al., 2009), but also due to the large number of factors that influence gene expression and the complexity of their interactions (Badea and Tilivea, 2006). Hence, a key problem in structure learning from microarray data is that the number of gene expression levels is large compared to the number of available samples (Fung and Ng, 2007). Recent studies have revealed highly conserved proteins that exhibit similar expression patterns in different organisms, and play a role in gene regulation. Therefore, this evolutionary information can be used to refine regulatory relationships among genes, which are estimated from gene expression data, and bias the learning process. In genomics, sequence homology implies that the given sequences share a common ancestor. Orthology is a type of sequence homology that results from a speciation event, i.e., a parent species diverging into two new species. Orthologous genes usually preserve protein expression and regulation functions and can be used to quantify the evolutionary information between two organisms (Bergmann et al., 2004). A statistical method to estimate gene networks from DNA microarray data and protein–protein interactions is presented in the work of Nariai et al. (2004). The method adds knowledge about protein–protein interactions to the estimation method of gene networks under a Bayesian statistical framework. In the estimated gene network of Nariai et al. (2004), a protein complex is modeled as a virtual node based on principal component analysis. In the work of Imoto et al. (2002), the bootstrap method is used to measure the reliability of the gene network estimated from microarray data. Imoto et al. (2002) applied the bootstrap method to the Saccharomyces cerevisiae gene expression data. The authors focused on 521 genes. They repeated the bootstrap algorithm 100 times in order to measure the intensity of the edges and the degree of
426
Z. Dawy et al. / BioSystems 103 (2011) 425–434
confidence of the direction of the Bayes causality. However, the approach of Imoto et al. (2002) is based on data from a single organism. Tamada (2005) utilized evolutionary information between two organisms to estimate the individual gene networks. Assuming two organisms A and B with respective gene expression data DA and DB , the networks of the two organisms GA and GB are built simultaneously by a Hill-Climbing algorithm that maximizes the posterior probability function P(GA , GB |DA , DB , HAB ) where HAB models the evolutionary information between A and B. As part of their algorithm, the authors calculate P(HAB |GA , GB ) based on gene expression data of the orthologous gene pairs between A and B in addition to two free parameters N and P that need to be set. However, no systematic approach is given for choosing N and P . Instead, the values are chosen empirically by comparing the obtained networks to established biological pathways. The motivation in this paper is to use the gene network of an organism to determine the underlying structure of the gene network of another, more complex, organism, using the notion of orthologous genes. The proposed method is based on a multitask learning approach. In multitask learning, the tasks are usually not identical, and hence cannot be simply treated as a single task by merging the available datasets. Instead the tasks are assumed to be dependent; consequently, treating them as independent tasks, as in the single task learning, might cause loss of useful information. The motivation for assuming dependencies between tasks in the gene network context arises from the possibility of conservation of translational machinery across species at the expression level via orthologous genes, as noted by Jimenez et al. (2003) and Bergmann et al. (2004). As noted by Caruana (1997), learning a task requires inductive biases towards some elements of the search space. In multitask learning, the extra tasks serve as additional inductive bias for the learning of the main task on top of the bias provided by the model distribution and available data. This paper proposes a new multitask learning method targeted for gene network estimation. Emphasis is given to the structure learning problem of the Bayesian network. This improves parameter learning indirectly since parameter learning depends on structure learning (Neapolitan, 2004). A new score function is proposed in order to capture the evolutionary information between two organisms via a parameter ˇ without requiring any additional free parameters. The proposed score function can be localized easily for reduced computational complexity, with the local score being a function of the node and its parents only. In addition, a bootstrap based approach is used to estimate the confidence of the edges induced by the microarray data. Results for synthetic data and real gene expression profiles are presented and analyzed to validate the proposed method. This paper is organized as follows. The system model is presented in Section 2. The proposed method is explained in Section 3, where similarity and confidence measures are derived, and the proposed algorithm is detailed. Results are presented and discussed in Section 4. First, the proposed method is tested on synthetic data in order to demonstrate its validity. Then it is iteratively applied on real gene expression data from humans and yeast in order to learn the genetic regulatory networks of the two organisms with homologous genes. Ideas for future extension of this work are presented in Section 5. Finally, conclusions are drawn in Section 6.
2. System Model In this paper, gene networks are modeled as Bayesian networks. A Bayesian network consists of two components G and , where G is a directed acyclic graph (DAG) whose n nodes (genes) correspond to the random variables X1 , X2 , . . ., Xn , while describes the con-
ditional probability distribution of each node Xi , given its parents1 Po (Xi ), i.e., N. A basic assumption in Bayesian networks is the applicability of the Markov assumption: each variable Xi is independent of its non-descendants, given its parents in G. This property allows the joint probability distribution P(X1 , X2 , ..., Xn ) to be decomposed into the following product form (Neapolitan, 2004): P(X1 , X2 , ..., Xn ) =
n
P(Xi |Pa (Xi ))
(1)
i=1
Modeling a gene network as a Bayesian network is motivated by the fact that the causal relations between genes can be described qualitatively through the presence of edges, and quantitatively through the conditional probability distributions. Biological factors, measurement inaccuracies, and data sparsity prohibit the prediction of deterministic relations between genes, and act as a noise source that results in the probabilistic nature of relations. In order to exploit multitask learning, a quantitative link between the Bayesian networks of the different tasks (organisms) must be determined. To model this link, a similarity parameter ˇ is proposed. This parameter represents the similarity of the underlying true Bayesian networks, i.e., the Bayesian networks that would be learned if complete information were available. Let G1 = (V, E1 ) and G2 = (V, E2 ) be two DAGs of B1 and B2 , the underlying true Bayesian networks of organisms 1 and 2, respectively, where V is the set of vertices (genes) and Ei is the set of edges (regulatory relations). The similarity parameter ˇ is defined as follows: ˇ , P(e ∈ E2 |e ∈ E1 ) = P(e ∈ / E2 |e ∈ / E1 )
(2)
where it is assumed that the probability of the presence of an edge e in both networks and its absence is the same. Since ∈ is a binary relation, it can be deduced that P(e ∈ / E2 |e ∈ E1 ) = P(e ∈ E2 |e ∈ / E1 ) = 1 − ˇ Based on the given definition, it is difficult to obtain a numerical value for the parameter ˇ based on general evolutionary biological information. To alleviate this problem, the known common orthologous genes between the two organisms are used in order to estimate the value of ˇ. The following estimation procedure is proposed: Step 1: Isolate the orthologous genes between the two organisms. Step 2: Construct the Bayesian network for the orthologous genes of each organism separately using single network learning algorithms such as the K2 algorithm discussed in Section 3.1. Step 3: Obtain an estimate for ˇ using the following expression: ˇ=
2Nc |E1 | + |E2 |
(3)
where Nc is the number of common edges between the two DAGs G1 = (V, E1 ) and G2 = (V, E2 ), and |Ei | is the total number of edges in Gi . An example is shown in Fig. 1. Networks (a) and (b) have eight edges in common, with network (a) having 12 edges and network (b) having eight edges. This leads to ˇ = 2 × 8/(12 + 8) = 0.8, which corresponds to a relatively high similarity. Conversely, networks (a) and (c) have four edges in common, with network (a) having 12 edges and network (c) having eight edges. This leads to ˇ = 2 × 4/(12 + 8) = 0.4, which corresponds to a relatively low similarity. To find the best matching graph using the single network approach (i.e., using the data of just one organism), the scoring
1
The set of parents of Xi contains all nodes that have an edge directed towards Xi .
Z. Dawy et al. / BioSystems 103 (2011) 425–434
427
Fig. 1. Example of estimating the parameter ˇ between three different gene networks, where the orthologous genes are denoted by the same letter (but have different numbers) in the different networks: solid lines correspond to edges common between networks (a) and (b), dashed lines correspond to edges common between networks (a) and (c), and dotted lines correspond to edges existing only in network (c).
functions must be proportional to P(G|D), which can be expressed as follows: P(G|D) =
P(D|G)P(G) ∝ P(D|G)P(G) P(D)
(4)
where G is the DAG under evaluation, D is the available dataset, and P(G) is assumed to be uniformly distributed if no prior knowledge is available. Maximizing P(G|D) is equivalent to maximizing P(G|D) that in turn can be decomposed into maximizing local scores by the Markov property. This allows the use of local search techniques which for each node Xi search for the parent set Pa (Xi ) that maximizes the local scoring function. The scoring function that is derived in Section 3.2, can be used with various existing structure learning algorithms such as the K2, Hill-Climbing, and Markov Chain Monte Carlo (MCMC) algorithms, as done by Friedman et al. (2000) and Stuart et al. (2003). HillClimbing starts from a random graph structure. At every iteration the change in the score function (e) for every edge operation e is computed, and the operation with the highest (e) is chosen. The search iteration terminates if, during an iteration, (e) of all edges is less than zero. To avoid the local maxima, encountered easily in this case, greedy search is applied with multiple restarts. The chosen structure at the end will be the highest scoring graphs among all the local-maxima graphs, e.g., see Heckerman (1999) and Neapolitan (2004). The MCMC algorithm tries to find the probability P(G|D) for every possible structure G, and then average every feature (edge) over all the possible graphs, e.g., see Madigan and Raftery (1994), Madigan and York (1995), Husmeier (2004), and Neapolitan (2004). MCMC starts from any initial structure G = G0 . Then, it iterates for n times such that at each iteration, G is changed from structure Gi to another neighboring structure Gj. The two structures differ in one edge with a probability that depends on P(Gi |D) and P(Gj |D). On the other hand, K2 can be implemented easily and does not have a randomized component in it (Friedman et al., 2000). It uses the order information instead, where an ordered permutation of the nodes is used such that all the nodes before a given node can be a possible parent of that node. Although the order requirement may seem restrictive in practical applications, it is a major advantage when it comes to score comparison. In fact, the variations in the accuracy between different versions of K2 are mostly due to the score function, conversely to Hill-Climbing and MCMC where the accuracy of the algorithm can depend heavily on the randomized procedure of the algorithm. It should be noted that the order information is not as restrictive
as it seems: if the ordering is unknown, searching over orderings is more efficient than searching over DAGs as in MCMC and HillClimbing, e.g., see Friedman et al. (2000) and Stuart et al. (2003). Hence, the proposed methods in this work will be based on the K2 algorithm. 3. The Proposed Multiorganism Gene Network Estimation Method In this section, we start by presenting the basic structure of the K2 algorithm and deriving the score function required as part of the proposed multiorganism gene network estimation method. Then, the proposed method is detailed and a low complexity implementation is presented. Finally, complexity analysis is performed. 3.1. The K2 Algorithm The K2 algorithm benefits from the local decomposability of the Bayesian score (BDe) across the parents of every node to locally search for the best DAG, thereby minimizing the search space of the problem. This in turn reduces the required computational complexity. The algorithm assumes that an ordering on the domain variables is available, i.e., if Xi precedes Xj in the assumed ordering, all the structures with Xj → Xi will be neglected, which makes the number of the remaining possible structures equals 2n(n+1)/2 . In addition, the algorithm poses an upper bound u on the number of parents a variable can have. For each node, the K2 algorithm adds a parent to a node as long as this parent causes a positive change (e) in the score. Algorithm 1 presents a general description of the K2 algorithm. Algorithm 1. 1: 2: 3 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
K2 algorithm for i such that 1 ≤ i ≤ n do parent(i) = Null Pold = f (i, parent(i)) proceed = true while parent(i)| < u AND proceed = true do Let z be a node of order preceding xi that maximizes f (i, parent(i) ∪ {z}) Pnew = f (i, parent(i) ∪ {z}) if Pnew > Pold then Pold = Pnew Add z to parent(i) else proceed = false end if end while end for
428
Z. Dawy et al. / BioSystems 103 (2011) 425–434
The K2 algorithm tries to find the most probable structure given the data. Hence, it tries to find the structure that maximizes P(G|D), through maximizing P(D|G) since all networks are assumed to have equal priors. This maximization reduces to the following form: maxP(D|G) = G
n
maxf (i, i )
i=1
(5)
i
3.2. Score Function Derivation In this section, the score function of the proposed method is derived. The inputs to the method are the following: data samples D of a given organism, input DAG Gin of another organism, and similarity parameter ˇ between the organisms. The output of the method is a learned improved DAG structure Gout . Generalizing the Bayesian score approach (BDe), the objective is to find a structure that maximizes P(G|D, Gin ) instead of P(G|D), where P(G, D|Gin ) P(D|Gin )
(6)
Using the fact that P(D|Gin ) is independent of G, we obtain P(G, D|Gin ) = P(D|G, Gin )P(G|Gin )
(7)
In addition, D is independent of Gin given G, since Gin provides only structural information, and D depends only on the structural information of its network. Therefore, P(G, D|Gin ) = P(D|G, Gin )P(G|Gin ) = P(D|G)P(G|Gin )
(8)
P(D|G) is the BDe score used in the K2 algorithm, whereas the new factor P(G|Gin ) will perform the biasing towards the input network. The above scores are decomposed to local scores, involving only nodes and their parents, as is done in the K2 algorithm. The local score corresponding to P(D|G) is represented by the f(i, i ) (1) terms in the K2 algorithm. The local score of P(G|Gin ) is P(Pai = (2)
(1)
|Pai = ˛) that is the probability that the set of parents Pai of xi in (2)
G is given that the set of parents Pai of xi in Gin is ˛. To calculate
(1) P(Pai
(2) |Pai
= = ˛), the definition of the similarity parameter ˇ is used. The set of parents of node xi can be viewed as the set of edges incident to the node xi . Psi is defined to be the set of all possible parents of xi , and Ei denotes the set of all possible edges that are incident to the node xi . Similarly, this definition is extended to any set of parents of xi . That is, given to be a set of parents of node xi , Ei is defined to be the set of all incident edges corresponding to the set . In other words, if xj ∈ then the edge (xj , xi ) ∈ Ei and vice versa. (1)
(2)
×
P(e ∈ / Ei |e ∈ / Ei˛ )
e/ ∈Ei ,Ei˛
P(e ∈ Ei |e ∈ Ei˛ )
e ∈ Ei ,Ei˛
×
P(e ∈ / Ei |e ∈ Ei˛ )
e/ ∈Ei ,e ∈ Ei˛
P(e ∈ Ei |e ∈ / Ei˛ )
(9)
e ∈ Ei ,e/ ∈Ei˛
where n is the number of nodes of G, i is a node index corresponding to node xi of G, and i is the set of parents of xi . The term f(i, i ) evaluates the probability that the set i is the real set of parents of xi , denoted by Pai , given the data D. In other words, f (i, i ) ∝ P(Pai = i |D). Detailed information about the K2 algorithm is presented by Friedman et al. (2000).
P(G|D, Gin ) =
(1)
P(Pai = |Pai = ˛) =
(2)
(1)
(2)
Clearly, Pai ⊆ Psi and Pai ⊆ Psi . To decompose P(Pai = |Pai = ˛), the probability that the node xi has the set as parents given that the prior (input) structure has ˛ as a set of parents, the corresponding edge probabilities are used. In other words, every possible edge e in Ei is checked, and its absence or presence in the sets Ei and Ei˛ is determined. This is illustrated by the following decomposition:
where e ∈ Ei . In the proposed method, the input DAG has been learned by a traditional structure learning algorithm and thus may contain errors. When the input structure is a learned structure possibly containing errors rather than the real structure, it will naturally induce errors in the output structure. Therefore, the proposed method must be adaptive to the quality of the input network. Consequently, a measure of the confidence in the input structure should be included as an additional parameter. 3.2.1. Confidence Measure c The confidence in a given DAG is represent by c. The confidence indicates to what degree the DAG is warranted to be correct. Let GLR = (V, ELR ) be the learned DAG of an organism with a true DAG GTR = (V, ETR ), then the confidence c can be defined as follows: c = P(e ∈ ETR |e ∈ ELR ) = P(e ∈ / ETR |e ∈ / ELR )
(10)
By applying a similar analysis to that of the similarity parameter ˇ, it can be noted that P(e ∈ / ETR |e ∈ ELR ) = P(e ∈ ETR |e ∈ / ELR ) = 1 − c. In order to compute the confidence c, an approach based on bootstrapping is proposed. 3.2.2. Bootstrap Based Approach The main problem imposed by the sparse nature of the datasets is that the learning algorithms fail to separate the measurable signal in the data from the noise, that is the genuine correlations and causations properties from the spurious (random) correlations, as noted by Friedman et al. (1999). Thus, a method is needed to determine the level of confidence about the various structural features of the Bayesian networks induced from data sets. The bootstrap method is used to measure the likelihood that a given feature is actually true. Let B be a given Bayesian network with structure G. Let f be a feature of the structure G such as the presence/absence of an edge/path from a node x to y, or the presence of a node y in the Markov blanket of node x. Given a dataset D of N instances sampled from the network B, and H(D) the structure induced from D, the following quantity is defined by Friedman et al. (1999): PN (f ) = Pr{f (H(D)) = 1||D| = N} where f : G →{0, 1}. Thus, the quantity PN (f) is the probability of inducing a network structure with feature f from any dataset of size N sampled from the network B. As N grows large, PN (f) will converge to f(G), such that as PN (f) approaches 1, the confidence in f also grows to 1. The bootstrap method aims at approximating this quantity from one dataset D. The confidence in a feature must be high if it is still induced when the data is perturbed. In Hecker et al. (2009), it is indicated that genes with the most uncertain connections in the network model are perturbed in order to incrementally determine a gene network with a few experiments. A way to perturb the data and yet maintain the general statistical features of the dataset is the non-parametric bootstrap that is performed by the steps presented in Algorithm 2, proposed by Friedman et al. (1999).
Z. Dawy et al. / BioSystems 103 (2011) 425–434
Algorithm 2.
Hence, it can be easily shown that (ˇ, c(e)) = ˇ for all e. In this case, based on the definition of the similarity parameter ˇ, (9) reduces to
Non-parametric bootstrap algorithm. for i such that 1 ≤ i ≤ m do Re-sample, with replacement, N instances from D. Denote by Di the resulting dataset. Apply the learning procedure on Di to induce a network structure Hi = Hi (Di ). end for For each feature of interest, compute the 1 f (Hi ). quantity: PN (f ) = m
1: 2: 3: 4: 5:
(1)
The required parameter is P(e ∈ E2 |e ∈ E1 ) = (ˇ, c(e)). That is, the probability that an edge e is in the network which is attempted to be learned, given it is in the learned structure E1LR of the input (prior) network. (ˇ, c(e)) is determined as follows: = P(e ∈ E2 |e ∈ E1 ) = P(e ∈ E2 |e ∈ E1 , e ∈ E1 )P(e ∈ E1 |e ∈ E1 ) / E1 )P(e ∈ / E1 |e ∈ E1 ) + P(e ∈ E2 |e ∈ E1 , e ∈
(2)
= ˛) =
e/ ∈Ei ,e ∈ Ei˛
e/ ∈Ei ,e ∈ Ei˛
(1 − ˇ)
(12)
(ˇ, c(e))
e/ ∈Ei ,Ei˛
[1 − (ˇ, c(e))]
m
P(Pai = |Pai = ˛) = (1 − ˇ) ˇN−m = (2)
Algorithm 3.
1−ˇ ˇ
m
ˇN
(13)
Proposed method. Calculate (ˇ, c(e)) for each edge and for each absence of an edge. Consider the set of all possible edges, and compare it with the set of edges being evaluated and the set of edges in the input structure. Form the product in (12), by checking for every edge in the set of all possible edges if it is present in both the structure under consideration and the input structure, or if it is in one of them but not in the other, or if it is not in neither. if (it is present in both) OR (it is not in neither of them) then multiply the product by (ˇ, c(e)) corresponding to this edge. else if it is present in one of them without being in the other then multiply by 1 − (ˇ, c(e)) end if end if Multiply the above product by f (i, i ) in the K2 algorithm.
1: 2:
3
4: 5: 6: 7:
Since N is common to all the structures being evaluated, the final score expression can be expressed as
(1)
(2)
P(Pai = , D|Pai = ˛) ∝
1−ˇ ˇ
m
(14)
As a result, the steps presented in Algorithm 4 should be used to calculate a new score function to replace f(i, i ) in the K2 algorithm.
Noticing that (e ∈ E2 |e ∈ E1 ) is independent of e ∈ E1 , then P(e ∈ E2 |e ∈ E1LR , e ∈ E1TR ) reduces to P(e ∈ E2 |e ∈ E1TR ) = ˇ. Similar reasoning can be applied to the other probability. Using the definitions of ˇ and c(e), the following expression for (ˇ, c(e)) can be derived: (ˇ, c(e)) = c(e)ˇ + (1 − ˇ)(1 − c(e)) = 1 + 2ˇc(e) − ˇ − c(e)
(1 − ˇ)
To solve (12), let H = Ei ⊕ Ei˛ be the set of all elements that are not common in the two sets. Defining m = |H|, the number of elements in H, and N = |Psi |, (12) reduces to
8: 9: 10: 11:
P(e ∈ ETR |e ∈ ELR ) = P(e ∈ / ETR |e ∈ / ELR ) = c(e)
= |Pai
e/ ∈Ei ,Ei˛
e ∈ Ei ,Ei˛
ˇ
P(e ∈ E2 |e ∈ E1 ) = P(e ∈ / E2 |e ∈ / E1 ) = ˇ
(1)
ˇ
e ∈ Ei ,e/ ∈Ei˛
(1)
In this section, the proposed method is explained which makes use of evolutionary relations between two organisms to estimate their genetic regulatory networks. The input to the method consists of the data samples D, input DAG Gin , structure confidence C, and similarity ˇ. The output of the algorithm is the estimated DAG Gout . When the confidence c is obtained by bootstrapping, it will be different for each edge. Hence, it will be denoted by c(e). A new parameter (ˇ, c(e)) is introduced to jointly capture both the similarity between networks (the ˇ parameter) and the confidence in each edge (the c(e) parameters). It should be noted that (ˇ, c(e)) is also computed for the absence of an edge, as is seen from the definition of ˇ. For a given Bayesian network B = (G, ), the true structure GTR = (V, ETR ) is defined as G, and the learned structure GLR = (V, ELR ) is the structure learned from the sampled data for any learning algorithm.
P(Pai
×
3.3. The Proposed Method
Hence,
(2)
P(Pai = |Pai = ˛) =
In the proposed method the bootstrap approach is used to estimate the confidence in the presence or absence of edges in the structure that is induced from one dataset. This confidence will help in improving the learning across similar networks.
(ˇ, c(e))
429
(11)
(ˇ, c(e))
e ∈ Ei ,Ei˛
[1 − (ˇ, c(e))]
Algorithm 4. Calculating a new score function to replace f(i, i ) in the K2 algorithm. 1:
2:
Find m, the number of edges that are different between the parents under consideration and the parents of the input structure. m Multiply f (i, i ) by ((1 − ˇ)/ˇ) .
This simplified version of the proposed method leads to a score function depending only on a single parameter ˇ. No bootstraping is applied in order to estimate the confidence c(e) of each edge. Hence, the computational complexity is reduced. Nevertheless, this simplicity is expected to come at the expense of accuracy. 3.5. Complexity Analysis of the Proposed Method
e ∈ Ei ,e/ ∈Ei˛
Consequently, the steps of the proposed method are described in Algorithm 3. 3.4. Low Complexity Implementation In this section, a simplified version of the proposed method is presented. This version was outlined by Nassar et al. (2008). Assuming that no confidence information is available, i.e., assuming that the learned structure is in fact the true structure, then c = 1 in (11).
The proposed method and its simplified version have the same order of complexity as the K2 algorithm. In fact, the proposed methods differ from K2 by the computation of the score function. Assuming the order information of the n nodes and the maximum number of parents u of each node are provided, the complexity can be computed as follows: • For the node having the lowest order (i.e., any of the other nodes can possibly be one of its parents), (n − 1) evaluations of the score function are needed to determine the first
•
•
• •
parent, (n − 2) evaluations are needed to determine the second parent, and (n − u) evaluations are needed to determine the last parent. Hence, the total number of evaluations is (n − 1) + (n − 2) + · · · + (n − u) = u(2n − u − 1)/2. For the node preceding the previous node in the DAG ordering, any of the other nodes can possibly be one of its parents except the first node mentioned above. Hence, (n − 2) evaluations of the score function are needed to determine the first parent, (n − 3) evaluations are needed to determine the second parent, and (n − u − 1) evaluations are needed to determine the last parent. Hence, the total number of evaluations is (n − 2) + (n − 3) + · · · + (n − u − 1) = u(2n − u − 3)/2. For the node preceding the previous two nodes in the DAG ordering, any of the other nodes can possibly be one of its parents except the first two nodes mentioned above. Hence, (n − 3) evaluations of the score function are needed to determine the first parent, (n − 4) evaluations are needed to determine the second parent, and (n − u − 2) evaluations are needed to determine the last parent. Hence, the total number of evaluations is (n − 3) + (n − 4) + · · · + (n − u − 2) = u(2n − u − 5)/2. When the node having exactly u nodes preceding it in the DAG ordering is reached, then, proceeding as above, the number of evaluations is u[2n − u − (2(n − u) − 1)]/2 = u(u + 1)/2. The number of the score function evaluations for the remaining nodes is u(u − 1)/2, (u − 1)(u − 2)/2, . . ., 1, respectively. The total complexity is then [u(2n − u − 1)/2 + u(2n − u − 3)/2 + · · · + u[2n − u − (2(n − u) − 1)]/2] + [u(u − 1)/2 + (u−1)(u−2)/2 + · · ·+1]. Noting that u(2n − u − 1)/2 + u(2n − u − 3)/2 + · · · + u[2n − u − (2(n − u) − 1)]/2 = nu(n − u)/2, and denoting K(u) = u(u − 1)/2 + (u − 1) (u − 2)/2 + · · · + 1, the method complexity can be expressed as O(nu(n − u)/2 + K(u)) ≈ O(n2 u).
4. Results and Discussion Before applying the proposed method to real gene expression data, it is first tested on synthetic data provided by the ALARM network. The ALARM network was first introduced by Beinlich et al. (1989). This synthetic network is considered one of the standard benchmark networks for comparison between Bayesian learning algorithms, e.g., see Niculescu-Mizil and Caruana (2005).
4.1. Synthetic Data: ALARM Network The proposed method is applied first on two similar networks derived from the ALARM network. This Bayesian network was constructed from expert knowledge as a medical diagnostic alarm message system for patient monitoring (Beinlich et al., 1989). The network has 37 random variables (nodes) and 46 edges, and each random variable has up to four possible values and a maximum of five parents. In order to generate similar networks that resemble biological networks, a similar approach to the one followed by Niculescu-Mizil and Caruana (2005) was used: starting from a base network, two similar networks are generated according to an input parameter Pdel . Similar networks are generated by passing through all edges of the base network and deleting any given edge with probability Pdel . This procedure can be repeated to generate different networks that are similar to the order of Pdel . Any two generated networks contain common edges in addition to edges that are present in one and not in the other. The proposed method is compared to the single network K2 algorithm in terms of the following metrics:
60
Average Number of False Positives
•
Z. Dawy et al. / BioSystems 103 (2011) 425–434
K2 New w/o Confidence New
50
40
30
20
10
0 30
40
50
60
70
80
90
100
110
Dataset Size Fig. 2. Rate of false positives of the proposed method with ALARM network synthetic data.
(2) False negatives: the number of edges present in the real underlying structure but absent in the learned structure; (3) Total error: the difference between the real and the learned structures, i.e., the sum of false positives and false negatives. Figs. 2–4 present the results based on two similar networks of equal size datasets. The y-axis represents the average number of occurrences over multiple runs and the x-axis represents the dataset size. The rate of false positives is shown in Fig. 2. The proposed method, denoted by “New”, clearly outperforms the single network K2 algorithm in terms of false positives. This illustrates the fact that the new method reduces the effect of noisy observations, i.e., the statistical deviations in the data. The results of the simplified version of the method, not taking confidence measures into account, are also displayed (denoted by “New w/o Confidence” on the plots). The simplified version outperforms K2, but is outperformed by the New approach, in terms of false positives. However, the performance in terms of false negatives is comparable as shown in Fig. 3. This is due to the fact that real edges that are weakly supported by the data might not be identified by the proposed method. This explains why the new method is outperformed by its simplified version in terms of false negatives. In fact, when con16
Average Number of False Negatives
430
K2 New w/o Confidence New
14
12
10
8
6
4 30
40
50
60
70
80
90
100
110
Dataset Size (1) False positives: the number of edges absent in the real underlying structure but present in the learned structure;
Fig. 3. Rate of false negatives of the proposed method with ALARM network synthetic data.
Z. Dawy et al. / BioSystems 103 (2011) 425–434
70
K2 New w/o Confidence New
Average Number of Errors
60
50
40
30
20
10 30
40
50
60
70
80
90
100
110
Dataset Size Fig. 4. Total error rate of the proposed method with ALARM network synthetic data.
fidence is taken into account, weakly supported edges will have low confidence values, and hence will not be included in the final DAG. Nevertheless, in terms of total error, the advantage of the proposed method is evident, over both K2 and the simplified version, as shown in Fig. 4. The superiority of the simplified version over K2 is also evident. Hence, the low false positive rate of the proposed method compensates for the slightly worse performance in terms of false negatives. Besides, when biological data is considered, decreasing the false positives rate is normally more important than decreasing the false negatives rate. Finally, it can be seen that as the dataset size increases, the error rates decrease due to a reduction in the estimation noise. 4.2. Real Gene Expression Data The proposed method is used to simultaneously estimate two gene networks of two distinct organisms, with a Bayesian network model utilizing the evolutionary information so that gene expression data of one organism improves the estimation of the gene network of the other organism. The effectiveness of the proposed method is demonstrated through the analysis on Saccharomyces cerevisiae (S. cerevisiae) and Homo sapien (H. sapien) cell cycle gene expression data. The proposed method succeeds in estimating gene networks that capture many known relationships as well as some unknown relationships which are likely to be novel. There are only a few datasets where both detailed knowledge of the regulatory pathways and the same type of expression data are available for two distinct organisms, as remarked by Tamada (2005). The proposed method is implemented on a set of ten orthologous genes between yeast (S. cerevisiae) and human (H. sapiens). The list of the chosen genes is shown in Table 1. The data for the yeast genes is obtained from the work of Spellman et al. (1998) and is available at http://genomewww.stanford.edu/cellcycle/. The data for the human genes is obtained from the work of Whitfield (2002) and is available at http://genome-www.stanford.edu/Human-CellCycle/Hela/. The obtained results are compared to known relations in the published KEGG pathways by Kanehisa et al. (2006). To start the analysis, the yeast gene network is learned using the classical K2 algorithm based only on the yeast dataset. Then, the proposed method is applied. Consequently, the approach can be described as follows:
431
Table 1 Orthologous genes between human and yeast. Yeast genes
Human genes
CDC28 CDC6 CDC7 ORC1 MCM2 MCM3 CDC46 MCM6 CDC47 CDC45
CDK7 CDC6 CDC7 ORC1L MCM2 MCM3 MCM5 MCM6 MCM7 CDC45L
• Bootstrapping is applied to the Yeast data. Since the dataset available contains 77 samples, the following cases are considered: seven iterations with 7 samples each, 11 iterations with 7 samples each, and 7 iterations with 11 samples each. The purpose is to compare the effect of dataset size and/or number of iterations on the performance of the bootstrap algorithm. • Confidence measures for each edge are obtained. • The human gene network is learned by using the human data from Whitfield (2002) in addition to the obtained yeast gene network (learned with the K2 algorithm), and the confidence measures for each edge obtained by bootstrapping. The estimated similarity factor is given by ˇ = 0.8. The results are shown in Fig. 5. The yeast network estimated by K2, and the human networks estimated by K2 and the proposed method are shown. The cases where bootstrapping is applied with 7 iterations and 7 samples, 7 iterations and 11 samples, and 11 iterations and 7 samples are considered. The values for the iterations and the number of samples in each iteration were selected as such because the selected yeast data from Spellman et al. (1998) contains around 77 time samples. The results of the simplified version of the method, where no bootstrapping is used to measure confidence, are also displayed. In Fig. 5, solid black edges correspond to relations consistent with the KEGG pathways and hence indicate a successful detection of known relations. Dashed edges correspond to newly estimated relations, representing new candidate relations by the proposed method to be validated by biological experiments. Finally, gray edges correspond to listed relations in the KEGG pathways that were not estimated, i.e., that the method failed to detect. When the number of samples is fixed to 7, the effect of varying the number of iterations is found to be minor. In fact, comparing the DAGs where the number of samples is 7, it can be seen that the only difference between the seven iterations case and the 11 iterations case is an added edge between MCM2 and MCM5 in the 11 iterations case. If, on the other hand, comparison is made on the cases where the number of iterations is fixed to 7, we find that the edges CDK7 → CDC6 and MCM2 → MCM7 are present in the seven samples case and absent in the 11 samples case, whereas the edge MCM5 → MCM7 is present in the 11 samples case and absent in the seven samples case. Considering the simplified version of the method and comparing it to the bootstrap case, it can be seen that the edge MCM3 → MCM5 is present in the simplified version, but in none of the presented bootstrap cases. In fact, the presence of an edge between their homologue counterparts in the yeast DAG (MCM3 → CDC46) has a confidence of 0.71 in the (7 iterations, 7 samples) case, 0.55 in the (11 iterations, 7 samples) case, and 0.57 in the (7 iterations, 11 samples) case. Such relatively low confidence values (slightly higher than 0.5), coupled with a similarity factor of 0.8, led to the elimination of the edge by the method using the bootstrap approach. Another interesting example is the edge CDC6 → CDC45L, present in all three bootstrap cases, but absent
432
Z. Dawy et al. / BioSystems 103 (2011) 425–434
Fig. 5. Obtained gene networks: solid black edges correspond to relations consistent with the KEGG pathways, dashed edges correspond to newly estimated relations, and gray edges correspond to listed relations in the KEGG pathways that were not estimated.
in the simplified version. Although the absence of an edge between their homologous yeast counterparts (CDC6 and CDC45) is fostered by high confidence values (0.71, 0.86, and 0.73 for the (7 iterations, 7 samples), (7 iterations, 11 samples), and (11 iterations, 7 samples) cases, respectively), an edge was detected in the three cases of the bootstrap based approach. This indicates that the data available strongly suggests the presence of such an edge, especially that it was also detected in other publications in the literature, namely the work of Tamada (2005). Comparing the results to the K2 algorithm, it can be seen that the edge CDK7 → CDC6, absent between their homologous counterparts in the yeast network (CDC28 and CDC6) with confidence values of 0.71, 1, and 0.82 for the (7 iterations, 7 samples), (7 iterations, 11 samples), and (11 iterations, 7 samples) cases, respectively, is absent in two out of three of the bootstrap based cases, and in the simplified approach. Discussing the results from a multiorganism approach, it can be seen that known relations that were not detected in the yeast network by the K2 algorithm were detected by the proposed method in the human network, e.g., relation between CDC6 and the ORC complex. These results are validated in the literature by interesting biological interpretations. In fact, it was noted in Yan et al. (1998) that CDC6 associates with the ORC proteins to render cells competent for DNA replication. The association of CDC6 with ORC alters the nucleotide contacts made by origin binding proteins and promotes recruitment of MCM proteins and other factors
required to form prereplication complexes. Particularly, MCM2 was shown to act as a marker for cancer proliferation in prostate glands in Meng et al. (2001). In fact, Meng et al. (2001) noted that MCM proteins possess highly conserved gene sequences and bind to DNA sites where the ORC and CDC6 proteins have sequentially bound. Studies of the MCM and CDC6 proteins have revealed that they are found only during the cell cycle. The presence of MCM proteins in proliferating cells, but not quiescent or differentiated cells, suggests a role as a cell proliferation marker. In fact, Meng et al. (2001) showed that MCM2 expression is low and limited to the basal cell layer in nonmalignant prostate glands. MCM2 expression is consistently increased in malignant glands and is significantly associated with disease-free survival. Finally, the results are compared to other results discussed in the literature, namely Tamada (2005) and Kanehisa et al. (2006). Many of the relations obtained using the proposed method are consistent with the KEGG pathways; other obtained relations are not listed in the KEGG pathways but were also estimated by Tamada (2005) (e.g., relation between CDC6 and CDC45L in human), which suggests that experimental effort should be done to verify these relations biologically, since they were detected by different approaches. Comparing the obtained results to those of Tamada (2005), it can be seen that many relations listed by Tamada (2005) and not yet biologically verified were not detected by the proposed method (e.g., relation between CDK7 and CDC6 in human), whereas some new
Z. Dawy et al. / BioSystems 103 (2011) 425–434
relations estimated by the proposed method were not estimated by Tamada (2005) (e.g., relation between CDK7 and the MCM complex in human).
5. Work Extensions In this section, some ideas for future extension of this work are presented. The main extensions are related to the definition of the homology measure and the assumptions concerning the gene network graph structure. In this work, the parameter ˇ is used to describe the degree of homology between two species. However, an individual parameter associated with each edge, (ˇ, c(e)), is used to accommodate both the similarity factor ˇ and the confidence c(e) in the data relevant to edge e. Therefore, a possible extension of this work would be to define parameters ˇ(e) for each edge, representing the degree of homology between the two species for that particular edge. Hence, in this case, ˇ(e) would be a sort of probability measure reflecting the possibility of having edge e present in both species. The proposed method can then be applied with ˇ(e) incorporated in the expression of (ˇ, c(e)). In this work, the gene networks are modeled as Bayesian networks. Modeling a gene network as a Bayesian network is motivated by the fact that the causal relations between genes can be described qualitatively through the presence of edges, and quantitatively through the conditional probability distributions. The probabilistic nature of the relations between the expression levels of the genes reveals the presence of other factors, of biological nature, that are not well expressed by the microarray data, and which then prohibit the prediction of a deterministic relation between the different genes. As a result, this probabilistic nature is able to handle the noise inherent in both the biological processes and the microarray experiments (Hecker et al., 2009). This advantage is not present when gene networks are modeled as Boolean networks where the state of a network node is modeled only by a binary on-off variable, although remarkable results were obtained in the literature with boolean gene networks, e.g., Dougherty et al. (2002), Esmaeili and Jacob (2009), Chuang et al. (2009). The Bayesian networks used in this work are assumed to be static, as in several other works in the literature, e.g., Imoto et al. (2002) and Tamada (2005). Static Bayesian networks have a DAG structure, and hence cannot capture feedback loops. Bayesian gene networks can incorporate temporal order, to account for a direct causal effect of one gene on another, through different time stages. This extension of Bayesian gene networks is known as dynamic Bayesian gene networks (Ghahramani, 1997). Methods for incorporating loops in dynamic Bayesian networks are presented in Hecker et al. (2009) and are interesting for future investigation. The main idea consists of separating each node into an input node or regulator node (representing the expression level at time t) and an output node or target node (representing the expression level at time t + t). This way, dynamic Bayesian networks can describe regulatory feedback mechanisms, because a feedback loop will not create a cycle in the graph (Hecker et al., 2009). A challenging extension of this work would be to implement the proposed method with the previous description of dynamic Bayesian networks. Moreover, bootstraping is not well suited for time series data as noted by Hecker et al. (2009). Therefore, the first step would be to investigate this extension with the low complexity implementation of Section 3.4 before studying bootstrap based techniques with dynamic Bayesian networks. Discrete models such as Boolean networks and Bayesian networks, and continuous models such as sets of ordinary or partial differential equations and recurrent neural network models are the two main classes of gene regulatory models. However, a clear
433
relationship between the two classes of regulatory models is still lacking. An attempt to fill the gap between a differential equation based gene regulatory model and a multi-valued logical model has been presented in the work of Ahmad et al. (2009), where a piecewise affine differential equation model with time delay is converted into a discrete model and the dynamics of the resulting model are analyzed using hybrid model-checking techniques. Another possible extension of this work would be to study the applicability of the proposed method to continuous gene network models instead of discrete models. 6. Conclusion A new method for multitask learning in the context of Bayesian networks was proposed. The method was successfully tested and verified on synthetic data using ALARM benchmark network. The main objective of the proposed method is to utilize evolutionary preserved gene interactions in order to learn or refine the construction of genetic regulatory network models across multiple organisms. Hence, the proposed method was also applied to S. cerevisiae and H. sapiens real cell cycle gene expression data. Results demonstrated that the proposed method was able to improve the gene network estimation accuracy compared to published KEGG pathways. In addition, new relations were also estimated which would motivate verification via lab based biological experiments. Acknowledgments The authors wish to express their gratitude to Prof. Kevin Murphy for making the Bayes Net Toolbox for Matlab available online. The authors would like to thank the Editor and the reviewers for their comments that helped improve the content and clarity of the paper. References Ahmad, J., Bourdon, J., Eveillard, D., Fromentin, J., Roux, O., Sinoquet, C., 2009. Temporal constraints of a gene regulatory network: refining a qualitative simulation. BioSystems 98 (3), 149–159. Badea, L., Tilivea, D., 2006. Sparse factorizations of gene expression data guided by binding data. Tech. Report, National Institute for Research and Development in Informatics (ICI), Romania. Beinlich, I., Suermondt, H.J., Chavez, R.M., Cooper, G.F., 1989. The ALARM monitoring system: a case study with two probabilistic inference techniques for belief networks. In: Proceedings of the Second European Conference on Artificial Intelligence in Medicine. Bergmann, S., Ihmels, J., Barkai, N., 2004. Similarities and differences in genome-wide expression data of six organisms. PLOS Biology 2 (1), 85–93. Brun, M., Kim, S., Choi, W., Dougherty, E.R., 2007. Comparison of gene regulatory networks via steady-state trajectories. EURASIP Journal on Bioinformatics and Systems Biology, doi:10.1155/2007/82702. Caruana, R., 1997. Multitask learning. Machine Learning 28 (1), 41–75. Chuang, C.-L., Chen, C.-M., Wong, W.-S., Tsai, K.-N., Chan, E.-C., Jiang, J.-A, 2009. A robust correlation estimator and nonlinear recurrent model to infer genetic interactions in Saccharomyces cerevisiae and pathways of pulmonary disease in homo sapiens. BioSystems 98 (3), 160–175. Dougherty, E., Kim, S., Shmulevish, I., Zhang, H., 2002. Probabilistic boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics 18 (2), 261–274. Esmaeili, A., Jacob, C., 2009. A multi-objective differential evolutionary approach toward more stable gene regulatory networks. BioSystems 98 (3), 127–136. Friedman, N., Goldszmidt, M., Wyner, A., 1999. Data analysis with Bayesian networks: a bootstrap approach. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence , pp. 206–215. Friedman, N., Linial, M., Nachman, I., Peer, D., 2000. Using Bayesian networks to analyze expression data. Journal of Computational Biology 7, 601–620. Fung, E.S., Ng, M.K., 2007. On sparse fisher discriminant method for microarray data analysis. Bioinformation 2 (5), 230–234. Ghahramani, Z., 1997. Learning Dynamic Bayesian Networks. Lecture Notes in Computer Science. Department of Computer Science, University of Toronto. Hecker, M., Sandro, L., Toepfer, S., Van Someren, E., Guthke, R., 2009. Gene regulatory network inference: data integration in dynamic models—a review. BioSystems 96, 86–103. Heckerman, D., 1999. A tutorial on learning with Bayesian networks. In: Jordan, M.I. (Ed.), Learning in Graphical Models. MIT Press, MA, USA, pp. 301–354.
434
Z. Dawy et al. / BioSystems 103 (2011) 425–434
Husmeier, D., 2004. Inferring genetic regulatory networks from microarray experiments with Bayesian networks. In: Husmier, D., Dybowski, R., Roberts, S. (Eds.), Probabilistic Modeling in Bioinformatics and Medical Informatics. Springer, London, pp. 239–268. Imoto, S., Kim, S., Shimodaira, H., Aburatani, S., Tashiro, K., Kuhara, S., Miyano, S., 2002. Bootstrap analysis of gene networks based on Bayesian networks and nonparametric regression. Genome Informatics 13, 369–370. Jimenez, J.L., Mitchell, M.P., Sgouros, J.G., 2003. Microarray analysis of orthologous genes: conservation of the translational machinery across species at the sequence and expression level. Genome Biology 4 (1). Kanehisa, et al., 2006. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Research 34, 354–357. Madigan, D., Raftery, E., 1994. Model selection and accounting for model uncertainty in graphical models using Occam’s window. Journal of the American Statistical Association 89, 1535–1546. Madigan, D., York, J., 1995. Bayesian graphical models for discrete data. International Statistical Review 63, 215–232. Meng, M.V., Grossfeld, G.D., Williams, G.H., Dilworth, S., Stoeber, K., Mulley, T.W., Weinberg, V., Carroll, P.R., Tlsty, T.D., 2001. Minichromosome maintenance protein 2 expression in prostate: characterization and association with outcome after therapy for cancer. Clinical Cancer Research 7, 2712–2718. Nariai, N., Kim, S., Imoto, S., Miyano, S., 2004. Using protein–protein interactions for refining gene networks estimated from microarray data by Bayesian networks. Pacific Symposium on Biocomputing 9, 336–347.
Nassar, M., Abdallah, R., Zeineddine, H.A., Yaacoub, E., Dawy, Z., 2008. A new multitask learning method for multiorganism gene network estimation. In: Proceedings of IEEE ISIT. Neapolitan, R.E., 2004. Learning Bayesian Networks. Prentice Hall. Niculescu-Mizil, A., Caruana, R., 2005. Learning the structure of related tasks. In: Proceedings of NIPS Workshop on Inductive Transfer: 10 Years Later. Spellman, et al., 1998. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9, 3273–3297. Stuart, J.M., Segal, E., Koller, D., Kim, S., 2003. A gene co-expression network for global discovery of conserved genetic modules. Science 302 (5643), 249–255. Tamada, et al., 2005. Utilizing evolutionary information and gene expression data for estimating gene networks with Bayesian network models. Journal of Bioinformatics and Computational Biology 3 (6), 1295–1313. Whitfield, et al., 2002. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Molecular Biology of the Cell 13, 1977–2000. Xue, Y., Liao, X., Carin, L., Krishnapuram, B., 2007. Multi-task learning for classification with dirichlet process priors. Journal of Machine Learning Research 8, 35–63. Yan, Z., Degregori, J., Shohet, R., Leone, G., Stillman, B., Nevins, J., Williams, R.S., 1998. CDC6 is regulated by E2F and is essential for DNA replication in mammalian cells. Cell Biology 95, 3603–3608.