Application of Bayesian networks for inferring cause–effect relations from gene expression profiles of cancer versus normal cells

Application of Bayesian networks for inferring cause–effect relations from gene expression profiles of cancer versus normal cells

Mathematical Biosciences 209 (2007) 528–546 www.elsevier.com/locate/mbs Application of Bayesian networks for inferring cause–effect relations from gen...

700KB Sizes 0 Downloads 31 Views

Mathematical Biosciences 209 (2007) 528–546 www.elsevier.com/locate/mbs

Application of Bayesian networks for inferring cause–effect relations from gene expression profiles of cancer versus normal cells Andrzej Polanski a, Joanna Polanska b,*, Michal Jarzab c, Malgorzata Wiench c, Barbara Jarzab c a

c

Department of Computer Science, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland b System Engineering Group, Department of Automatic Control, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland Department of Nuclear Medicine and Endocrine Oncology, Maria Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Gliwice Branch, Wybrzeze Armii Krajowej 15, 44-101 Gliwice, Poland Received 21 September 2005; received in revised form 15 December 2006; accepted 9 March 2007 Available online 27 March 2007

Abstract The paper is devoted to two questions: whether distinction of causes versus effects of neoplasia leaves a signature in the cancer versus normal gene expression profiles and whether roles of genes, ‘‘causes’’ or ‘‘effects’’, can be inferred from repeated measurements of gene expressions. We model joint probability distributions of logarithms of gene expressions with the use of Bayesian networks (BN). Fitting our models to real data confirms that our BN models have the ability to explain some aspects of observational evidence from DNA microarray experiments. Effects of neoplastic transformation are well seen among genes with the highest power to differentiate between normal and cancer cells. Likelihoods of BNs depend on the biological role of selected genes, defined by Gene Ontology. Also predictions of our BN models are coherent with the set of putative causes and effects constructed based on our data set of papillary thyroid cancer. Ó 2007 Elsevier Inc. All rights reserved. Keywords: Applied statistics; Cancer; DNA microarrays; Cause–effect relations; Bayesian networks

*

Corresponding author. Tel.: +48 32 2372144; fax:+48 32 2371921. E-mail address: [email protected] (J. Polanska).

0025-5564/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.mbs.2007.03.006

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

529

1. Introduction Comparisons of gene expression profiles in cancer and normal cells, abundant in the literature, e.g., [2,13,25] lead to discovering and publishing lists of differentially expressed genes and to large collections of cancer versus normal gene expression profiles available in electronic databases e.g., [22]. Data on cancer versus normal expression profiles support the research in carcinogenesis in the sense that they confirm and/or verify existing knowledge on roles of genes in neoplastic transformations and they provide genes—candidates for further studies. It should be, however, noted that at present for most of discovered differentially expressed genes, their roles in neoplastic processes are unknown. When analyzing lists of differentially expressed genes in cancer versus normal comparisons, a question arises on causality of relations implied by the detected differences. If a gene is expressed differentially in cancer and normal cells, then is it a cause or rather an effect of the ongoing neoplastic transformation? This question is of great importance. Knowledge on the distinction between causes and effects among differentially expressed genes, not only supports researches on their potential roles in tumor biology but also helps in developing diagnostic and prognostic clinical tests based on measured gene expressions. Data on expression profiles of normal versus cancer cells are of observational type where there are no experimental interventions and there is no time order of signals. Therefore labeling genes as causes or effects of the neoplastic transformation, based solely on their patterns of expressions is difficult and even problematic [5]. Nevertheless, since cause versus effect tests, even with low specificity and selectivity, can lead to improvements in many fields it seems to us that the question stated is worth studying. In this paper we present results of our research on (i) whether distinction of causes versus effects leaves a signature in the cancer versus normal gene expression profiles and on (ii) whether roles of genes, causes or effects can be inferred from repeated measurements of gene expressions. We address the question of causative dependencies by the use of Bayesian networks (BNs) [8,14,17,18], which are natural models for representing multidimensional joint probability distributions. Nodes of BNs correspond to random variables representing measurements or observations, and directed edges correspond to relations between these random variables. Causality relations are implied by directions of (non-reversible) edges of BNs. The basic obstacle in using BNs for studying causality relations is the need to estimate the topology of a BN from observations. The problem of estimating topology of BN is difficult, due to the large amount of uncertainty and to serious computational complexity even for BNs of very moderate sizes [3,9]. Therefore we take a different approach of constructing a declarative BN, which models a scenario of triggering neoplastic processes by altered gene expressions. In the next step we solve the node assignment problem for the constructed BN [21]. The assignments of genes to the nodes of the BN imply their classification as causes or effects. Applying our methodology to real gene expression data from cancer versus normal comparisons confirms that our BN models have the ability to explain some aspects of observational evidence. Effects of neoplastic transformations are most distinctly seen among genes with the highest power to differentiate between normal and cancer cells. Likelihoods of BNs depend on the biological role of selected genes, defined by Gene Ontology. Also predictions of our BN models are

530

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

coherent with the set of putative causes and effects constructed based on our data set of papillary thyroid cancer.

2. Models of BNs Bayesian network B is a pair B ¼ ðG; DÞ;

ð1Þ

where G is a directed acyclic graph (DAG) and D is a set of conditional probability distributions that corresponds to G. Nodes of graphs of BNs correspond to random variables, signals, measurements or observations. The term ‘‘signal’’ or ‘‘random variable’’ is related to the node of the BN and names ‘‘measurements’’ or ‘‘observations’’ are used in the context of repetitive experiments. We assume two structures of BNs corresponding to two different causative mechanisms of triggering the neoplastic process, by cooperative causes and by alternative causes. In the cooperative causes model (Fig. 1), the neoplastic process is initiated by combination of many events. Expressions of numerous genes contribute to the final level of risk and each single cause alone does not change the risk significantly. The structure of cooperative causes BN is fully defined by deciding to which of the two classes each gene belongs. Another possibility is the alternative causes model (Fig. 2), where cancer initiation is due to the alternative of (two) events and each of the alternative events alone has the ability of triggering transformation. Signals at nodes of our Bayesian networks, in Figs. 1 and 2, are random variables, either continuous, which physically represent logarithms of levels of measured gene expressions, or binary, which correspond to biological processes. The state of the biological process of neoplastic transformation is defined the status of cells, normal/cancer, and is coded by a binary random variable Y with  0 for normal cells; y¼ 1 for cancer cells:

Fig. 1. Bayesian network model with cooperative causes. Circles are continuous nodes, which represent base 2 logarithms of fluorescence signals at gene probes in a DNA microarray chip. Square node Y holds binary information on the status of the cells used in the experiment—normal (0) or cancer (1). Causal relation XC,1, . . . , XC,K ! Y is modeled by logistic function. The influence of the state of Y on values of XE,1, . . . , XE,M is modeled as changing values of expectations and variances of conditional normal distributions.

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

531

Fig. 2. Bayesian network model with alternative causes. Again, circles are continuous nodes, which represent base 2 logarithms of fluorescence signals at gene probes in a DNA microarray chip. There are three binary nodes, Y, ZA and ZB, marked by squares. Y holds information on the status of the cells used in the experiment—normal (0) or cancer (1). ZA and ZB are processes coded (0), inactive; (1), initiated; whose alternative leads to triggering the neoplastic transformation. Each of the processes Y, ZA and ZB has its specific effects. Models of mechanisms of altering expressions of genes by Y, ZA and ZB is changing expectations and variances of conditional distributions, described in the text. The symbol denotes deterministic mechanism triggered by exceeding a specified threshold value. Due to the deterministic mechanism, states of ZA and ZB are always known, given values of thresholds. The symbol OR means logical OR function.

For continuous random variables the notation, e.g., XC,k, XE,k or X C;k is used with superscripts i added to distinguish and number random variables and subscripts indexing experiments. Lower case letters are used for realizations of random variables. 2.1. BN model with cooperative causes In the model of cooperative causes, shown in Fig. 1, probability of initiating the neoplastic process is set by a combined influence of variables XC,1, . . . , XC,K. Random variables XC,1, . . . , XC,K are assumed normal. Probability of triggering the neoplastic process is assumed to follow the logistic relation P ½Y ¼ 1 ¼ P L ðxC;1 ; . . . ; xC;K Þ ¼

exp½bðxC;1 þ xC;2 þ    þ xC;K Þ ; 1 þ exp½bðxC;1 þ xC;2 þ    þ xC;K Þ

ð2Þ

where index ‘‘L’’ in PL(Æ) is for ‘‘logistic’’ and b is a parameter. The effects of the neoplastic process, XE,1, . . . , XE,M are conditionally normal random variables with conditional means and variances depending on the status of the cell normal (0)/tumor (1): X E;m jY ¼ 0  Nðlm0 ; rm0 Þ, X E;m jY ¼ 1  Nðlm1 ; rm1 Þ; m ¼ 1; 2; . . . ; M: Nðl; rÞ denotes normal distribution with mean l and standard deviation r. Notations for probability density functions are   pðxE;m y ¼ 0Þ ¼ p ðxE;m ; lm ; rm Þ and pðxE;m y ¼ 1Þ ¼ p ðxE;m ; lm ; rm Þ; N

0

0

N

1

1

m ¼ 1; 2; . . . ; M; where pN(x, l, r) or pN(x) stand for probability density function of normal distribution.

ð3Þ

532

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

Based on the structure in Fig. 1 and assumptions (2), (3) one can use standard method to write expression for likelihood L of the Bayesian network Y Y Y C;K C;K pN ðxiC;k Þ P L ðxC;1 Þ ½1  P L ðxC;1 Þ L¼ i ; . . . ; xi i ; . . . ; xi k2C i2H[T



Y

i2T

i2H

m m pN ðxE;m i ; l0 ; r0 Þ

m2E i2H

Y

m m pN ðxE;m i ; l1 ; r1 Þ:

ð4Þ

m2E i2T

Classes of continuous nodes (variables) are C—causes and E—effects, and classes of experiments are H—normal (healthy) and T—tumor. 2.2. BN model with alternative causes In Fig. 2, the Bayesian network model with alternative causes is shown. Neoplastic process Y is the alternative of two hypothetical processes ZA and ZB. Effects of neoplastic transformation are XE,1, . . . , XE,M. Processes ZA and ZB correspond to binary nodes with possible values 0, not active; and 1, process initiated. States 0 and 1 of the processes ZA and ZB are not directly observed, but we assume a simplified situation, namely that each of the processes ZA or ZB can be triggered by an event of the level of expression of one gene XCA or XCB exceeding the value of a specified threshold, ( A

z ¼

0

if xCA 6 thrA

1

if xCA > thrA

0

if xCB 6 thrB

1

if xCB > thrB

ð5Þ

for ZA, and ( B

z ¼

:

ð6Þ

for ZB. Under these hypotheses, given values of thresholds, states of all binary nodes are always known. Each of the binary processes ZA or ZB has its specific consequences, XEA,1, . . . , XEA,K for ZA and XEB,1, . . . , XEB,J for ZB. In the alternative causes Bayesian network, classes of continuous nodes are CA, CB, EA, EB and E. The mechanism of changes of expressions of XE,1, . . . , XE,M secondarily to the ongoing neoplastic transformation is changing conditional expectation and variance, described by conditional distributions (3), already explained. Analogous mechanisms are assumed for altering expressions XEA,1, . . . , XEA,K and XEB,1, . . . , XEB,J secondarily to ZA and Z B : pðxEA;k jzA ¼ 0Þ ¼ and pðxEB;j jzB ¼ 0Þ ¼ pN ðxEA;k ; lk0 ; rk0 Þ; pðxEA;k jzA ¼ 1Þ ¼ pN ðxEA;k ; lk1 ; rk1 Þ; k ¼ 1; 2; . . . ; K; j j j j EB;j EB;j B EB;j pN ðx ; l0 ; r0 Þ; pðx jz ¼ 1Þ ¼ pN ðx ; l1 ; r1 Þ; j ¼ 1; 2; . . . ; J : Notation is the same as that employed in Eq. (3).

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

533

Likelihood function of Bayesian network from Fig. 2 is Y Y k k L ¼pðxCA ÞpðxCB Þ pN ðxEA;k ; l ; r Þ pN ðxEA;k ; lk1 ; rk1 Þ i i 0 0 

k2EA i2HA

Y

pN ðxEB;j ; lj0 ; rj0 Þ i

j2EB i2HB



Y

m m pN ðxE;m i ; l0 ; r0 Þ

m2E i2H

k2EA i2TA

Y

pN ðxEB;j ; lj1 ; rj1 Þ i

j2EB i2TB

Y

m m pN ðxE;m i ; l1 ; r1 Þ:

ð7Þ

m2E i2T

In the above p(xCA) and p(xCB) are probability density functions corresponding to XCA and XCB. Indices k, j and m run over classes of nodes, EA, EB and E, and index i runs over categories of experiments (samples), H/T already defined as (normal/tumor), HA/TA meaning (zA = 0/zA = 1) and HB/TB meaning (zB = 0/zB = 1).

3. Node assignment by maximizing likelihood Recovering roles of genes is based on maximizing likelihood functions (4) and (7) over possible assignments of genes to their roles in the Network. For cooperative causes Bayesian network from Fig. 1, maximizing its likelihood (4) is a combinatorial optimization problem with very large number of possible assignments of genes to their roles in the network. Going through all assignments is not possible, but we used, successfully, standard Metropolis–Hastings algorithm [15] with random switching between roles of genes for obtaining reasonable approximation of the optimal structure. Each step of the procedure involves (i) labeling genes as ‘‘causes’’ or ‘‘effects’’ and (ii) optimizing parameters as described in Section 3.1. According to our assumption of fixed topology of BN proportion between causes and effects remains constant through the optimization process. For Bayesian network with alternative causes, from Fig. 2, we used the following heuristics for approximate optimization of the likelihood. We considered all genes as potential alternative causes. If gene X was selected as one of alternative causes, then the value of threshold thr in (5), (6) was set as equal to thr ¼ max xk : k2H

If expressions of gene X did not satisfy the property max xk < max xk k2H

k2H[T

ð8Þ

gene X was not considered as a potential cause. Provided that two alternative causes, XCA and XCB, are already specified and thresholds thrA and thrB are known, likelihood (7) can easily be optimized by deciding, for all remaining genes, which of the classes EA, EB or E do they belong to. Parameters of conditionally normal distributions are again estimated by standard formulae (Section 3.1) and decision is made based on maximizing likelihood over three possible assignments (EA, EB, E). We use notation ð9Þ L0 ðX CA ; X CB Þ

534

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

for likelihood, given alternative causes are XCA and XCB, optimized as described above. Now the likelihood of the whole network can be optimized by using (9) and going through all possible pairs XCA and XCB. 3.1. Parameter estimation After assignment of genes to their roles in BN is given, likelihoods are computed with the use of expressions (4) and (7). However, values of likelihoods of BNs depend on values of parameters, means and variances of normal and conditionally normal distributions and one parameter, b, of the logistic curve (3). Optimizing location and scale parameters of (conditionally) normal distributions is done by standard expressions PI PI ^ Þ2 ðxi  l i¼1 xi ^ ¼ i¼1 ^¼ and r ; l I I1 where xi, i = 1, 2, . . . , I stand for repeated measurements. Fitting the parameter b in the logistic relation can, in principle, be done with the use of maximum likelihood method by Newton–Raphson iterations. However, for numbers of repetitions of observations, which we used in our computations (30), estimators of b have rather large variations, therefore we work with models with value of b, fixed at b = 1. 3.2. Computational experiments We have performed random simulations where data were generated based on assumptions on topologies and probability distributions in our Bayesian networks [21]. The aim of these experiments was to support intuitive understanding of relations in our Bayesian networks, and estimate how accurately roles of genes can be recovered in artificial data. In artificially generated data, roles of genes can be reconstructed rather reliably and robustly based on the described optimization procedures, even for large Bayesian networks consisting of hundreds of genes. In computations we assumed numbers of repeated measurements of gene expression profiles as ranging from 30 to 100, achievable in real DNA microarray data.

4. Cause–effect relations implied by Bayesian networks Analysis of causative dependencies between the nodes of BNs, is related to the theory of observational equivalence of BNs. Two DAGs, G and G 0 , are observationally equivalent if for every Bayesian network B = (G,D) there exists a Bayesian network B 0 =(G 0 ,D 0 ) such that B and B 0 define the same probability distribution, and vice versa. The well known theorem (e.g., [18, p. 19, Theorem. 1.2.8]) states that two DAGs are observationally equivalent iff they have the same skeletons and the same v-structures. Graphical explanations of the notions of skeleton and v-structure are shown in Fig. 3. Observational equivalence concerns general, non-parametric conditional probability distributions describing edges of BNs. When probability distributions are restricted to parametric classes, theorems on observational equivalence may not be true. However, identifying BN structures by detecting parametric classes

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

535

Fig. 3. Definitions related to the concept of observational equivalence of DAGs. Upper left, exemplary DAG; right, its skeleton. Skeleton is obtained by ignoring directions of arrows in a DAG. Below: v-structure, X ! Z Y. v-Structure is a fragment of a DAG, in the definition of v-structure it is required that there are no direct edges between X and Y.

of edges is rather an artificial idea. In contrast, observational equivalence is a basic property of BN structures and has rather important consequences in inference based on models of BNs [3]. Using the concept of observational equivalence we call a DAG G identifiable, if every DAG, G 0 , obtained from G by reversing arrows is not observationally equivalent to G. In other words, a DAG is identifiable if its arrows cannot be reversed without changing the probability distributions of random variables corresponding to nodes of the related BN. In the case where the number of causes is greater or equal to 2, DAG in Fig. 1 is identifiable. Since we understand cooperative mechanism as many causes contributing to the final risk the condition that their number is greater than two is always satisfied. Bayesian network from Fig. 2 is not identifiable in the sense specified above. Based on the relation of observational equivalence the scenario which we study can be represented by partially directed graph (PDAG), [19,20], where only some of the causative dependencies can be inferred. In the BN in Fig. 2 processes ZA and ZB are causes of Y. Causative dependences between XCA, XCB,XEA,1, . . . , XEA,K, XEB,1, . . . , XEB,J and Y are not resolved.

5. Fitting BN models to cancer versus normal gene expression data We have analyzed three datasets on gene expression in lung cancer [2], prostate cancer [25] and our own dataset on papillary thyroid cancer (PTC), [13]. In all datasets, the experimental platform was Affymetrix oligo microarray and for all the final values of expression levels were computed by the same normalization procedure, Robust Multi-array Average, RMA [12].

536

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

5.1. Data standardization In simulated data we always assumed that increasing values of expressions of causes leads to increased probability of initiation of neoplastic process. Likewise, we assumed that the ongoing neoplastic process increases conditional means of probability distributions of expressions of its effects. In real data, however, cause or effect relationships can involve increase as well as decrease of expression level of a gene. In order to avoid getting involved in estimating increase versus decrease mechanisms we performed data standardization step. Namely, we rescaled realizations of each of random variables corresponding to continuous nodes of the Bayesian network, such that average becomes zero and standard deviation becomes one. For differentially expressed genes, conditional expectations E(XjY = 0) and E(XjY = 1) are statistically significantly different. The case where E(XjY = 0) < E(XjY = 1) is consistent with our convention, increased expression M increased probability. In cases where E(XjY = 0) > E(XjY = 1) we multiplied the standardized variables by 1 transforming them in such a way that they became consistent with the convention. 5.2. Effects of the neoplastic transformation among top differentially expressed genes In Fig. 4(a) we have shown a fragment of our Bayesian networks, with binary node Y and two effects of the neoplastic process XE1 and XE2. We have simulated 50 realizations of random variables appearing in Fig. 4(a) Y, XE1 and XE2. Scatter plot resulting from random simulations is shown in Fig. 4(b). For two effects, XE1 and XE2, their marginal distribution exhibits strong correlation and, for ‘‘strong effects’’ is bimodal. In this way shapes of scatter plots can support intuitive way of pointing out pairs of effects in the data. For both datasets, we applied Welch t-test for base two logarithms of gene expression values and their cancer/normal differentiating power by p-values corresponding to the t-statistic. One hundred top genes were selected and classified according to the appearance of their expression as effects. In Fig. 5 we reproduced scatter plots of values of base 2 logarithms of expressions of two pairs of genes chosen from top 100 differentially expressed genes, Fig. 5(a) corresponds to dataset [2] and Fig. 5(b) to the data from [25]. We infer roles of genes in both pairs (J03890_rna1_at, Y09267_at) and (837_s_at, 40193_at) as effects of neoplasia, by similarity of scatter plots in Fig. 5(a) and (b) to that shown in Fig. 4(b). Marginal distributions exhibit strong correlations and bimodality. When pairs of genes are picked up randomly from lists of top differentiating genes, it is rather likely to observe scatter plots similar to those presented in Fig. 5. This prompts us to draw the conclusion that many (possibly majority) of top differentially expressed genes are effects of the neoplastic process. 5.3. Sensitivity of Bayesian network likelihoods to GO classes of genes Gene Ontology (GO) terms (classes) classify biological properties of genes and make the automated analysis of gene function available [10]. Some ontology classes represent genes causative in cancer from the biological point of view (e.g., signal transduction, cell proliferation), genes represented by other classes are with higher probability the effects of neoplastic process (e.g., metabolism, transport). We analyze such cause–effect model by computing likelihoods of Bayesian network from Fig. 1, for which causes and effects were sampled from lists of genes obtained based

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

537

Fig. 4. (a) Fragment of a Bayesian network with binary node Y indicating the state of the cell (0), normal; (1), cancer; and two effects XE1 and XE2. (b) Corresponding scatter plot of 50 simulated realizations of random variables Y, XE1 and XE2. Values of y are indicated by marker shape open circles for y = 0 and closed circles for y = 1. We assumed probabilities P[Y = 0] = P[Y = 1] = 0.5 and we used following parameters for conditionally normal distributions: l10 ¼ l20 ¼ 3; l11 ¼ l21 ¼ 3 and r10 ¼ r20 ¼ r11 ¼ r21 ¼ 1:

on GO terms. After assigning genes to network nodes, parameters of the network were optimized as described in Section 3.1. Here we show comparisons for the most obvious combination of GO terms: ‘‘signal transduction’’ and ‘‘metabolism’’. Depending on whether metabolism genes are assumed to be effects or the reverse assumption will be taken, the likelihoods of the Bayesian network, obtained by using Eq. (4), will differ. In Fig. 6(a) we show comparison of two histograms for the lung cancer data from [2]. The upper histogram, follows from recording likelihoods of 10,000 randomly generated Bayesian networks from Fig. 1, such that ‘‘causes’’ are sampled from signal transduction genes and ‘‘effects’’ are sampled from metabolism genes. The lower histogram in Fig. 6(a), corresponds to the opposite situation, ‘‘causes’’ are sampled from metabolism genes and ‘‘effects’’ are sampled from signal transduction genes. In Fig. 6(b) analogous comparison is done for the prostate cancer data [25] (some more details concerning computations are in the caption). For both datasets, genes whose GO terms included both signal transduction and metabolism, were omitted. For both comparisons, the proper classification of signal transduction-related genes as putative causes and metabolism-related genes as putative effects leads, on average, to higher likelihoods of

538

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

Fig. 5. Scatter plots of values of base 2 logarithms of expressions of two pairs of genes randomly chosen from top 50 differentially expressed genes obtained from the datasets: (a) concerning lung cancer [2], and (b) concerning prostate cancer [25]. Values of y (normal, y = 0 versus cancer, y = 1) are indicated by marker shape open circles for y = 0 and closed circles for y = 1.

cooperative—cause Bayesian networks. In both plots, histograms can be well modeled by normal distributions, standard deviations have comparable values and shifts of means are of the order of 2r. 5.4. Causes and effects of neoplastic transformation in papillary thyroid cancer For further validation of the Bayesian network models to differentiate causes and effects, we used the microarray data from papillary thyroid cancer (PTC), described in detail elsewhere [13]. We included data of 15 PTC tumors and corresponding 15 normal thyroid tissues analyzed by the same platform and preprocessing as the sets described above (but with a newer generation of chips, HG-U133A). They were derived from the group described in the original paper as the initial set, however, one tumor and its normal tissue were omitted due to technical reasons.

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

539

Fig. 6. Sensitivity of likelihoods of Bayesian networks with cooperative causes to GO terms of genes. Comparisons for two GO terms, ‘‘signal transduction’’ and ‘‘metabolism’’. (a) Lung cancer data from [2]. Upper histogram follows from recording log likelihoods of 10,000 randomly generated Bayesian networks with 50 cooperating causes and 200 effects, such that ‘‘causes’’ are sampled from 816 genes in the data, termed ‘‘signal transduction’’ in GO, and ‘‘effects’’ are sampled from 698 genes termed ‘‘metabolism’’. The lower histogram corresponds to the opposite situation, ‘‘causes’’ are sampled from genes termed ‘‘metabolism’’ and ‘‘effects’’ are sampled from genes termed ‘‘signal transduction’’. (b) Analogous comparison done for the prostate cancer [25]. Upper histogram is for log likelihoods of 10,000 randomly generated Bayesian networks with 150 cooperating causes and 600 effects, with ‘‘causes’’ sampled from 1327 genes termed ‘‘signal transduction’’ and ‘‘effects’’ sampled from 1039 genes termed ‘‘metabolism’’. Lower histogram is for the opposite situation.

Papillary thyroid cancer is the tumor especially suitable for our purposes, because the molecular pathway leading to tumor initiation is quite well characterized [6]. For PTC the alternative causes model is widely accepted: nearly half of the cases show rearrangements of RET protooncogene or other receptor tyrosine kinase, NTRK, which initiate the oncogenic cascade. In remaining patients, 30–40% have a mutation in BRAF kinase, which activates the same pathway as RET but acts downstream to it. These mutations are almost never found together, while only in a small

540

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

percentage of cases none of the initiating molecular events is found. This makes PTC an ideal model for our study, especially because it has been shown that one of the initiating events, RET rearrangement, is clearly distinguishable in PTC expression profile [11,23]. First, we repeated the procedure described above for lung and prostate cancer, to analyze whether the putative effects and causes differ in their distribution between GO classes. Again, two GO classes were analyzed, ‘‘signal transduction’’ with 1948 genes and ‘‘metabolism’’ with 1761 genes, which constitute all the genes on U133A microarray within these two classes. Again 96 genes (2.6%) common to both classes were omitted and the resulting final number of 3517 genes was analyzed. We recorded logarithms of likelihood of 10,000 Bayesian networks obtained by sampling 50 putative cooperative causes from signal transduction genes and 200 putative effects from metabolism. We revealed superiority of the model in which metabolism genes are assumed to be effects, over the model with metabolism genes as the cause of neoplastic transformation, results are shown in Fig. 7(a). Similar results were obtained when we compared metabolism genes to cell proliferation gene ontology class (with 763 genes included). As we found that the method can classify between causes and effects in bulk quantities, we asked a question whether it is possible to reveal a known biological phenomenon, causative to a specific type of cancer. We decided to investigate RET rearrangement as one of the alternative causes of papillary thyroid cancer [6,24,26]. This event leads to RET tyrosine kinase overexpression, thus, may be evaluated by the proposed approach. For this goal we created an artificial mixture of causative and secondarily changed PTC genes from our dataset. To obtain a list of genes putatively related to the causative RET rearrangement, we subdivided our PTC samples by the presence of this mutation, determined previously [28] and genes with significantly different expression between RET-positive and RET-negative cancers were selected (details of the analysis are prepared for publication by M. Wiench). The top 50 genes from this analysis, with false discovery rate less or equal 5%, were included. We expected that at least some of them might be causally related to the early transformation steps consecutive to RET rearrangement or concomitant/alternative initiating event. At the first position of this list there was RET tyrosine kinase itself. We needed also genes which are ‘‘effects’’ in thyroid carcinogenesis. To strengthen the chance of selecting genes not causatively related, we took an assumption, that genes changed in similar manner in thyroid tumors with different biology are rather effects than causes of transformation. Thus, we used our own data on gene expression profile of follicular thyroid cancer (FTC, prepared for publication), a differentiated malignant tumor arising from the same follicular thyroid cell, but in different mechanism [1]. We built a common set from both papillary and follicular thyroid cancer microarray data (manuscript in preparation) and compared their gene expression profile to the expression in normal thyroid tissues. Next, we chose only the genes, which (a) were differentially expressed between tumors and normal tissues and (b) did not show any difference in expression between PTC and FTC (difference between mean log 2 of signal in PTC and FTC less than 0.2). We obtained 1071 such genes, and this list was further enriched to contain only genes with gene ontology classification which we expected to be an effect rather than a cause (metabolism, transport, cell adhesion apoptosis). We attempted to cover the full range of signal log ratios between tumor and normal tissues and not to select only genes with large difference between tumor and normal tissue. Obtained list

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

541

Fig. 7. Comparisons for PTC dataset. (a) Upper histogram follows from recording log likelihoods of 10,000 randomly generated Bayesian networks with 50 cooperating causes and 200 effects, such that ‘‘causes’’ are sampled from 1948 genes in the data, termed ‘‘signal transduction’’ in GO, and ‘‘effects’’ are sampled from 1761 genes termed ‘‘metabolism’’. The lower histogram corresponds to the opposite situation, ‘‘causes’’ are sampled from genes termed ‘‘metabolism’’ and ‘‘effects’’ are sampled from genes termed ‘‘signal transduction’’. (b) Statistics for our putative sets of causes and effects. Upper histogram concerns log likelihoods of 10,000 randomly generated Bayesian networks with 10 cooperating causes and 40 effects, such that ‘‘causes’’ are sampled from the set of our 50 putative causes and effects are sampled from the set of 221 putative effects. Lower histogram corresponds to the opposite situation.

of ‘‘expected effects’’ contained 221 genes. When we mixed them together with 50 ‘‘expected causes’’ described above, we obtained a gene set of 271 genes on which the further analysis was carried out. Its result are shown in Fig. 7(b).We obtained a much more distinct difference in probability of proper classification than in the comparisons shown on Fig. 6. We studied, which genes will be labeled effects and causes, if the value of likelihood of the Network is numerically maximized, under the hypothesis of the scenario with cooperative causes. For the model of Bayesian network with cooperative causes we assumed 271 nodes and the number of 50 genes as the number of causes, and we repeated 50 times the procedure

542

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

of 30,000 iterations of Metropolis–Hastings algorithm of Network likelihood maximization. In the estimated structure of the Bayesian network with the highest likelihood we found 30 genes predicted correctly as putative causes. When 51 genes most often classified as putative causes in 50 repetitions of the procedure were taken as predicted causes, then it appeared that 28 of them were predicted correctly. Both these estimates were significantly better than a random selection from the initial set (the proportion of cause genes in initial dataset was 18.4%, so just by chance the algorithm should identify 9.4 genes, probabilities of obtaining our predictions by random choice of genes roles are of the order of 107–108). Most probably, consistency of labeling genes ‘‘causes’’ and ‘‘effects’’ by the algorithm to our expectations, follows from existence of some genes, among putative causes, which exhibit cooperative mechanism of initiation of the neoplastic process and/or from characteristic pattern of strong effects of neoplasia versus all other genes, detected by our Bayesian network model. However, when the scenario with cooperative causes was fitted to our data as explained above, RET was always labeled as an effect of neoplasia and never classified as a cause. This is concordant to its role as an alternative rather than a cooperative cause. In the next step of the analysis, for 221 putative effects and 50 putative causes described above, we fitted the scenario of alternative causes. When all 271 genes are mixed then there are 36,585 possible pairs of genes. We apply the following conditions to exclude some of these pairs. If XCA and XCB are accepted as two alternative causes than at least two measurements of expressions of cancer cells must satisfy: xCA > thrA

and xCB 6 thrB ;

ð10Þ

and another two must fulfill the opposite xCB > thrB

and xCA 6 thrA : CA

ð11Þ CB

In other words, we believe that X and X are potential alternative causes, if at least for two cancer samples XCA is upregulated and XCB is not, and vice versa. Among all 36,585 pairs of genes, 13,500 passed (10) and (11) and for them we computed logarithms of likelihoods L0(XCA,XCB), as described in (9). In Fig. 8 some of resulting histograms of log likelihoods are presented. The upper histogram shows distribution of log likelihoods for the case where XCA and XCB are pairs of our putative causes and the lower one the distribution of log likelihoods for the case where XCA and XCB are pairs of putative effects. The shift towards higher probabilities in the upper plot is clearly seen and indicates that alternative cause scenarios are more probable if generated by our putative causes than by putative effects. When we analyzed top 100 assignments in alternative causes scenarios (BNs) (cut-off log likelihood was 9100), then among 200 genes in 100 pairs of XCA and XCB, our putative causes appeared 70 times versus 130 times when we saw putative effects. The proportion between causes and effects is again statistically significantly higher than could have been expected by random choice. 100 most probable scenarios included 11 of our putative causes as genes at nodes XCA and XCB, some of them appeared repeatedly. Among them RET transcript was found on the fifth position, and remaining 10 genes were related either to Ras/Ral signaling (2 transcripts), kinase activity (2 transcripts), regulation of transcription (3 transcripts), cell cycle, migration and apoptosis. Only in the alternative causes scenario, RET transcript is classified as a cause, validating the biological meaning of the obtained results.

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

543

Fig. 8. Histograms of log likelihoods of alternative causes scenarios. The upper histogram shows distributions of log likelihoods obtained when pairs of our putative causes are placed as nodes XCA and XCB and the lower histogram shows distributions of log likelihoods obtained when pairs of our putative effects are put as nodes XCA and XCB. Blue arrow in the upper plot shows highest likelihoods of scenarios including RET gene.

6. Discussion Cancer arises as a result of accumulated genetic alterations and tumors appear to select for genetic abnormalities that may be most advantageous for escape from normal regulatory mechanisms in their particular microenvironment [27]. Only some of causative genetic changes accumulated during tumorigenesis are reflected in changed gene transcription, allowing their follow up by gene expression profiling. However, all accumulating genetic and epigenetic changes influence finally the gene transcription and this way may exert the effect on cellular regulation. The aim of our study was to delineate these changes in gene expression which are directly driven by genetic or epigenetic cancer-initiating events (‘‘causes’’) from secondary changes in gene expression (‘‘effects’’), evoked by ‘‘causes’’. Comparisons of gene expression profiles in cancer and normal cells lead to specification of huge lists of cancer versus normal differentially expressed genes, often ordered with respect to their differentiating power determined by several approaches, e.g., by the p-value of a statistical test. Genes can be then compared across different studies, as in the paper [22] and the associated website with the collection of microarray datasets of human cancers. We propose a method to differentiate between genes primarily associated with cancer-initiating mechanisms and genes with expression altered as a secondary consequence, not directly related to the cause of neoplastic transformation and thus potentially irrelevant for tumor initiation or progression. Our analysis of lung and prostate cancer datasets supports the conclusion that genes with the most distinct changes in expression are secondary effects of neoplastic transformation, a view often expressed in discussion of microarray data but rather on the basis of intuitive considerations, not a mathematical analysis. This assumption does not preclude the role of genes with large foldchange in tumor biology and especially in cancer diagnostics, however, for better understanding of the mechanisms of neoplastic transformation an approach should be chosen which is more selective than a simple criterion of the most changed expression.

544

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

We have shown, that a simplistic assumption that gene ontology class including signal transduction genes should contain more putative causes and less putative effects of neoplastic transformation than metabolism genes may be confirmed both in publicly available and our own dataset. However, a more convincing proof on the validity of the proposed approach was necessary, thus we performed an analysis in papillary thyroid cancer and we were able to find within many thyroid cancer related genes the known cause of PTC, RET protooncogene. We used PTC dataset, as we were able to create a hypothesis related to the genes participating in neoplastic transformation. We possessed the information about the presence of RET rearrangements in the examined tumors. RET rearrangement results in the expression of the fusion gene in thyroid follicular cell with the constitutive activation of RET tyrosine kinase [24,26]. Similarly to alternative events, among them NTRK1 rearrangement or BRAF mutations [6,7], it leads to the activation of Ras-MAPK pathway which results in increased transcription of several growth and proliferation genes. Thus, all three mentioned genes may be regarded as causes of PTC tumorigenesis, however, only first two tyrosine kinase genes will show a changed expression on RNA level (BRAF mutation is a single nucleotide substitution which leads to an activation of the kinase). Recently described mechanisms of BRAF activation by chromosomal rearrangement [4] are operating rather in radiation-induced PTC. On the other side, it is well known, that RET rearrangement itself is insufficient to trigger full carcinogenesis but consecutive and/or concomitant molecular processes, resulting finally in the development of papillary thyroid cancer, are less known [16]. Thus, an approach to look for other early causes in papillary thyroid cancer is substantiated. In order to create a gene set where the putative causes are mixed with putative (and prevailing) effects we looked for a method to select genes which are most probably secondarily changed in PTC. We opted for genes changed in a similar manner both in follicular and papillary thyroid cancers, as both histotypes of thyroid cancer are due to clearly different mechanisms of induction [6], thus, genes commonly expressed in both of them may be regarded as putative effects of thyroid carcinogenesis. Additionally, we selected genes representing ontology classes considered rather as effects. Thus, for PTC data the criterion to select genes for further analysis was the knowledge of their roles as causes or effects of neoplastic transformation. Based on this knowledge, coming from both the literature and our own observations, we finally selected 271 genes which were analyzed by Bayesian networks. This set of genes was analyzed by the proposed algorithm and 28–30 from 50 genes selected as the putative PTC causes were found by this approach, significantly more than expected by random. What even more convincing, the use of the alternative causes model was more efficient than the use of cooperative causes model, because only in the first one RET transcript was present among the selected genes. RET rearrangements result in a distinct expression of RET tyrosine kinase (whose expression is absent in normal follicular thyroid cells) and RET rearrangement leaves a different expression signature in comparison to BRAF mutation [7]. Indeed, we observed a strong correlation between the presence of RET rearrangement and the expression of the RET probeset on HU-133A microarray [6]. However, the RET transcript was not selected by the approach based on cooperative causes model, which is in concordance on our knowledge on RET role in PTC tumorogenesis—RET rearrangement is considered an alternative cause of MAPK kinase activation. NTRK1 transcript was not selected in the top 50 genes differentiating between RET-positive and RET-negative tumors (NTRK rearrangement is rather a rare event and is

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

545

found only in 10–15% of PTC cases) and thus not considered by us as a cause; thus, its absence in the final set is obvious. Although a biological confirmation of the hypothesis that the other selected genes are involved in the neoplastic transformation leading finally to PTC is at present unavailable, the confirmation of the presence of RET in the chosen set among ten other genes is the first signal of the efficacy of the presented model. Similarly, we observed the significant enrichment of the selected causes by genes which we expected to participate in the early steps of PTC tumorogenesis. It is to be stressed that the algorithm selected them properly among other genes with similar expression range and derived from the same tumors. Of course the method of preliminary selection of putative PTC causative genes is highly speculative. The genes with most distinct difference between RET-positive and RET-negative PTC tumors may both include some other causative genes, which are alternative initiating molecular events (thus, changed only in RET-negative tumors) as well as genes present only in RET-positive tumors (specific for the presence of RET rearrangements and secondarily changed). However, our working hypothesis is that the genes selected by our algorithm belong to the class of alternative early causes induced by various alternate initiating events. It is beyond the scope of this article to discuss the biological significance of the selected genes and a more direct proof of their participation in PTC transformation is necessary.

Acknowledgements We are grateful to the anonymous referees for many helpful comments. This paper was supported by the Polish Ministry of Science and Education as the Research Project 3T11F01029 and Ordered Research Grant 1/1-PBZ-MNiI-2/1/2005, and European FP6 Grant GENEPI ENTB2.

References [1] M.A. Aldred et al., Papillary and follicular thyroid carcinomas show distinctly different microarray expression profiles and can be distinguished by a minimum of five genes, J. Clin. Oncol. 22 (17) (2004) 3531–3539. [2] D.G. Beer, S.L. Kardia, C.C. Huang, T.J. Giordano, A.M. Levin, D.E. Misek, L. Lin, G. Chen, T.G. Gharib, D.G. Thomas, M.L. Lizyness, R. Kuick, S. Hayasaka, J.M. Taylor, M.D. Iannettoni, M.B. Orringer, S. Hanash, Gene expression profiles predict survival of patients with lung adenocarcinoma, Nat. Med. 8 (2002) 816–824. [3] D.M. Chickering, Learning equivalence classes of Bayesian-network structures, J. Mach. Learn. Res. 2 (2002) 445– 498. [4] R. Ciampi, J.A. Knauf, H.M. Rabes, J.A. Fagin, Y.E. Nikiforov, BRAF kinase activation via chromosomal rearrangement in radiation-induced and sporadic thyroid cancer, Cell Cycle 4 (4) (2005) 547–548. [5] D.R. Cox, Causality: some statistical aspects, J. R. Stat. Soc. Ser. A 155 (1992) 291–301. [6] J.A. Fagin, Challenging dogma in thyroid cancer molecular genetics—role of RET/PTC and BRAF in tumor initiation, J. Clin. Endocrinol. Metab. 89 (2004) 4264–4266. [7] M. Frattini, C. Ferrario, P. Bressan, D. Balestra, L. De Cecco, P. Mondellini, I. Bongarzone, P. Collini, M. Gariboldi, S. Pilotti, M.A. Pierotti, A. Greco, Alternative mutations of BRAF, RET and NTRK1 are associated with similar but distinct gene expression patterns in papillary thyroid cancer, Oncogene 23 (2004) 7436–7440. [8] N. Friedman, M. Linial, I. Nachman, D. Pe’er, Using Bayesian networks to analyze expression data, J. Comput. Biol. 7 (3–4) (2004) 601–620.

546

A. Polanski et al. / Mathematical Biosciences 209 (2007) 528–546

[9] N. Friedman, The Bayesian structural EM algorithm, in: Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI), 1998. [10] The Gene Ontology Consortium, Gene Ontology: tool for the unification of biology, Nat. Genet. 25 (2000) 25–29. [11] T.J. Giordano, R. Kuick, D.G. Thomas, D.E. Misek, M. Vinco, D. Sanders, Z. Zhu, R. Ciampi, M. Roh, K. Shedden, P. Gauger, G. Doherty, N.W. Thompson, S. Hanash, R.J. Koenig, Y.E. Nikiforov, Molecular classification of papillary thyroid carcinoma: distinct BRAF, RAS, and RET/PTC mutation-specific gene expression profiles discovered by DNA microarray analysis, Oncogene 24 (44) (2005) 6646–6656. [12] R.A. Irizarry, B. Hobbs, F. Collin, Y.D. Beazer-Barclay, K.J. Antonellis, U. Scherf, T.P. Speed, Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics 4 (2003) 249– 264. [13] B. Jarzab, M. Wiench, K. Fujarewicz, K. Simek, M. Jarzab, M. Oczko-Wojciechowska, J. Wloch, A. Czarniecka, E. Chmielik, D. Lange, A. Pawlaczek, S. Szpak, E. Gubala, A. Swierniak, Gene expression profile of papillary thyroid cancer: sources of variability and diagnostic implications, Cancer Res. 65 (2005) 1587–1597. [14] F.V. Jensen, Bayesian Networks and Decision Graphs, Springer, 2002. [15] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Equations of state calculations by fast computing machines, J. Chem. Phys. 21 (1953) 1087–1092. [16] N. Mitsutake, J.A. Knauf, S. Mitsutake, C. Mesa Jr., L. Zhang, J.A. Fagin, Conditional BRAFV600E expression induces DNA synthesis, apoptosis, dedifferentiation, and chromosomal instability in thyroid PCCL3 cells, Cancer Res. 65 (2005) 2465–2473. [17] R.E. Neapolitan, Learning Bayesian Networks, Prentice Hall, 2004. [18] J. Pearl, Causality: Models, Reasoning, and Inference, Cambridge University Press, 2000. [19] J. Pearl, T.S. Verma, A theory of inferred causation, in: J.A. Allen, R. Fikes, E. Sandewall (Eds.), Principles of Knowledge Representation and Reasoning, Morgan Kaufmann, San Mateo, 1991. [20] D. Pe’er, A. Regev, G. Elidan, N. Friedman, Inferring subnetworks from perturbed expression profiles, Bioinformatics 17 (2001) S215–S224. [21] J. Polanska, D. Borys, A. Polanski, Node assignment problem in Bayesian networks, Int. J. Appl. Math. Comput. Sci. 16 (2) (2006) 233–240. [22] D.R. Rhodes, J. Yu, K. Shanker, N. Deshpande, R. Varambally, D. Ghosh, T. Barrette, A. Pandey, A.M. Chinnaiyan, ONCOMINE, a cancer microarray database and integrated data mining platform, Neoplasia 6 (1) (2004) 1–6. [23] D. Rusinek, J. Zebracka, M. Oczko-Wojciechowska, M. Wiench, D. Handkiewicz-Junak, G. Gala, K. Simek, K. Fujarewicz, Initiating mutations of BRAF gene in papillary thyroid carcinoma, in: Proceedings of the 31st Annual Meeting of the European Thyroid Association, Naples, 2006, p. 54. [24] M. Santoro, Minireview: RET: normal and abnormal functions, Endocrinology 145 (2004) 5448–5451. [25] D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D’Amico, J.P. Richie, E.S. Lander, M. Loda, P.W. Kantoff, T.R. Golub, W.R. Sellers, Gene expression correlates of clinical prostate cancer behavior, Cancer Cell 1 (2002) 203–209. [26] G. Viglietto, G. Chiappetta, F.J. Martinez-Tello, F.H. Fukunaga, G. Tallini, D. Rigopoulou, R. Visconti, A. Mastro, M. Santoro, A. Fusco, RET/PTC oncogene activation is an early event in thyroid carcinogenesis, Oncogene 11 (1995) 1207–1210. [27] R.A. Weinberg, Cancer: a genetic disorder, in: J. Mendelsohn, P.M. Howley, M.A. Israel, L.A. Liotta (Eds.), The Molecular Basis of Cancer, WB Saunders, 2001, pp. 3–9. [28] M. Wiench, J. Wloch, M. Oczko, E. Gubala, B. Jarzab, Papillary thyroid carcinoma: RET/PTC rearrangement as age-related lesion, Eur. J. Cancer 37 (suppl. 6) (2001) S25.