A route-based pathway analysis framework integrating mutation information and gene expression data

A route-based pathway analysis framework integrating mutation information and gene expression data

Accepted Manuscript A Route-Based Pathway Analysis Framework Integrating Mutation Information and Gene Expression Data Yue Zhao, Tham H. Hoang, Pujan ...

1MB Sizes 3 Downloads 97 Views

Accepted Manuscript A Route-Based Pathway Analysis Framework Integrating Mutation Information and Gene Expression Data Yue Zhao, Tham H. Hoang, Pujan Joshi, Seung-Hyun Hong, Charles Giardina, Dong-Guk Shin PII: DOI: Reference:

S1046-2023(17)30039-7 http://dx.doi.org/10.1016/j.ymeth.2017.06.016 YMETH 4246

To appear in:

Methods

Please cite this article as: Y. Zhao, T.H. Hoang, P. Joshi, S-H. Hong, C. Giardina, D-G. Shin, A Route-Based Pathway Analysis Framework Integrating Mutation Information and Gene Expression Data, Methods (2017), doi: http:// dx.doi.org/10.1016/j.ymeth.2017.06.016

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

A Route-Based Pathway Analysis Framework Integrating Mutation Information and Gene Expression Data Yue Zhaoa,∗, Tham H. Hoanga , Pujan Joshia , Seung-Hyun Honga , Charles Giardinab , Dong-Guk Shina a Computer

Science and Engineering Department, University of Connecticut, 371 Fairfield Way, Unit 4155, Storrs, CT 06269 b Department of Molecular and Cell Biology, University of Connecticut, 91 North Eagleville Road, Unit 3125, Storrs, CT 06269

Abstract We propose a new way of analyzing biological pathways in which the analysis combines both transcriptome data and mutation information and uses the outcome to identify routes of aberrant pathways potentially responsible for the etiology of disease. Each pathway route is encoded as a Bayesian Network which is initialized with a sequence of conditional probabilities which are designed to encode directionality of regulatory relationships encoded in the pathways, i.e. activation and inhibition relationships. First, we demonstrate the effectiveness of our model through simulation in which the model was able to easily separate Test samples from Control samples using fictitiously perturbed pathway routes. Second, we apply our model to analyze the Breast Cancer data set, available from TCGA, against many cancer pathways available from KEGG and rank the significance of identified pathways. The outcome is consistent with what have already been reported in the literature. Third, survival analysis has been carried out on the same data set by using pathway routes as features. Overall, we envision that our model of using pathway routes for analysis can further refine the conventional ways of subtyping cancer patients as it can discover additional characteristics specific to individuals tumor. ∗ Corresponding

author Email address: [email protected] (Yue Zhao)

Preprint submitted to Journal of LATEX Templates

June 19, 2017

Keywords: Pathway Analysis, Bayesian Network, Data Integration, Modeling and Simulation

1. Introduction Pathway analysis is a critically important topic in cancer research. Pathway analysis allows scientists to systematically examine massive amount of genomics data over a curated set of biological pathways where each pathway includes a 5

large number of proteins/genes known interacting for specific biological function, e.g., cell-cycling, apoptosis, Th1 cell differentiation, etc. Generally, pathway analysis methods can be classified into two types. The first type, called the gene set enrichment analysis, is to estimate if some specific pathways are in action in the experimental system by merely calculating the significant number

10

of common genes of a biological pathway [1]. The second type is to use the network topology in which nodes and edges designed to encode directionality of mutually interacting gene/proteins within pathways [2]. The second type is superior to the first type as it utilizes not only the topology of interacting molecular entities but also the semantics of relationships types, i.e., activation

15

and inhibition. However, the existing topology-based pathway analysis methods are still limited in the sense that the analysis does not fully exploit the network topology encoded in the pathways. We argue that the pathway analysis should be route-based which is discoverable from the topology, as opposed to the conventional way of treating the entire pathway as the unit of analysis. For

20

example, if Wnt is highly activated or suppressed, we need to know via which route Wnt is activated or suppressed. A biological pathway is generally made up of a set of multiple routes each representing an alternative or differing way of realizing its intended biological function. For example, Wnt pathway has both canonical and non-canonical routes in which the former involves β-catenin while

25

the latter does not. Likewise, TGFB pathway has the canonical route utilizing R-SMAD complexes but there are other routes involving PI3K and MAPK [3]. Identifying the details of these non-canonical routes are active research topics,

2

and developing a route-based pathway analysis method is imperative to make the biomedical community’s pathway research move forward. 30

Another motivation of this work is the need to combine multiple experimental data types for the pathway analysis. Existing pathway analysis methods mostly use transcriptome data from microarray or RNA-Seq. Recent advancement in other high-throughput sequencing technologies (e.g., DNA-Seq, ChIPSeq, ATAC-Seq) allows scientists to acquire multiple types of genomics data

35

efficiently and cost effectively, and combining these multiple genomic data sets should enhance the accuracy of pathway analysis. In the past, the cancer research community use mostly mutation data to identify the aberrant pathways. In this work, we proposed to combine both mutation and transcriptome data for the pathway analysis. Using transcriptome data to infer the protein-protein

40

interaction is a known concern, but we indicate that the most recent studies by CPTAC suggest a very good Pearson correlation (e.g., 0.84 for ERBB2) between protein and mRNA expression level changes [4]. In addition, our model can be easily extended to use protein data in lieu of or in addition to transcriptome data, should such data becomes readily available for the pathway analysis in

45

the future. The rest of this paper is organized as follows. In Section 2, existing pathway analysis methods are briefly reviewed. In Section 3, the model setting and assumptions are described in detail. In Section 4, we include discussion regarding a simulation model that we used and the results of testing our model using the

50

simulation data. We also present the outcome of applying our model to Breast cancer data from TCGA. The section also includes discussion of performing survival analysis based on pathway routes. Lastly Section 5 is the conclusion. Numerous attempts have been made to incorporate pathway information into genomic data analysis. Li et al. (2008) encode the pathway network into a

55

penalty function and try to carry out model selection by optimizing the penalty function to pick meaningful genes and subnetworks [5]. This work was refined by Pan et al. (2010) who used a new penalty function in the form of a grouped sum [6]. Tarca et al. (2008) propose an approach which measures pathway 3

significance by statistical testing against random permutation [7]. Hancock et 60

al. (2010) rank the KEGG pathway correlated with the microarray dataset using Pearson Correlation [8]. Vaske et al. (2010) present a novel method of inferring genetic activities for a certain patient by modeling the gene as a factor graph thus incorporating pathways topological information [2]. Stingo et al. (2011) apply Partial Least Squares Regression to pick meaningful pathways and

65

genes with a Markov Random Field as the prior, where a gene is more likely to be selected if its neighbor in the pathway is already selected [9]. Verbeke et al. (2015) attempt to rank the pathways by encoding binary data and pathway logic into a global network and then use the p-value to rank perturbed pathways [10]. Here the p-value is calculated based on a hypothesis test where the null

70

hypothesis is the pathway that is picked randomly. Both Korucuoglu et al. (2014) [11] and Isci et al. (2011) [12] introduce Bayesian pathway analysis methods that encode the pathway as a Bayesian network. Both approaches remove cycles in the graph first and then use the expression data to train the model. Significance of the score is given by bootstrap-generated data.

75

Most of the existing pathway analysis methods are limited by using only partial information, i.e. whether two genes are adjacent in the pathway or not, or even ignoring pathway topology by directly scoring the significance of gene sets. The limitations of these existing works include (i) semantics of regulatory relationship is usually not considered, i.e., activation or inhibition relationships,

80

(ii) some approaches involve too many parameters in the model, and (iii) most approaches are designed to elicit general phenomenon by analyzing the patients cohort thus may not be suitable for individual patient-specific analysis. In this work, we propose an approach which fully considers pathway topology together with gene expression and mutation data where patient-specific analysis is pos-

85

sible.

4

From ErbB Signaling Pathway

+P

ErbB-2

STAT5

ERBB2

GAB1

STAT5

+P Shc

SHC2 PI3K

GAB1

Simplify

Grb2

GRB2

PIK3R5

PIP3 Sos

AKT3

SOS1

AKT +P Ras

R1

BAD

HRAS

BAD

A. Pathway of interested

B. Gene regulation network GB Pick a route

M1 R2

M2

SHC2

ERBB2

GRB2

SOS1

M3

R3 R4 D. Resulting Bayesian Network

Encoding as Bayesian Network

C. The route to model, G*

Figure 1: Workflow: converting a route in the pathway to Bayesian Network

2. Material and methods 2.1. Model and Terms A biological pathway is a graphical model made up of nodes and edges which collectively represent how molecular objects interact with each other in sequence 90

to exert some biological process. Fig.1 shows an example pathway, ErbB, which has been adapted from the KEGG pathway database [13]. Fig.1 illustrates how each route of the ErbB pathway can be modeled into a Bayesian network. Initially, the source pathway (Fig.1.A) is simplified by categorizing all interactions in the pathway into either activation or inhibition relationships. We call the

95

outcome of this simplification a Gene Regulation Network (denoted as GB in Fig.1.B) and in this figure, the two edge types, activation and inhibition, are represented as an arrow and a T bar, respectively. The interactions that could adversely affect the function of the target node in the pathway are inhibition while all other interactions are encoded as activation. The next step is to iden-

100

tify all possible “routes” available from the given pathway. Fig.1.C shows a route, G∗ , which starts from ERBB2 and ends at SOS1. The selected route is then converted into a discrete Bayesian Network (denoted as G shown by

5

Fig.1.D). Our objective is to treat each “route” as a unit of pathway analysis, as opposed to the conventional approach in which the entire network is subject 105

to the analysis. This route-based method would offer a more accurate analysis as the detailed analysis would reveal whether the effect of ERBB2 amplification is more prominent through GRB2→SOS1 path or GRB2→GAB1 path, or even for both. The route-based modeling idea assumes that it is crucial to identify which portion(s) of the pathway is either abnormally activated or suppressed.

110

In this way, a more informed treatment plan could be designed. Fig.2 illustrates the conversion process from the pathway route G∗ to a Bayesian network G. For a gene regulation network GB (converted from a pathway), a path G∗ is simply a subgraph of GB , G∗ ⊆ GB , G∗ = (V ∗ , E ∗ ) where V ∗ = {g1 , . . . , gkG∗ }, where gi represents the ith gene and kG∗ is the

115

number of genes contained in path G∗ , E ∗ = {eij |1 ≤ i < kG∗ and j = i + 1}. For gi ∈ V ∗ , 0 < i < kG∗ , we create two nodes in the corresponding Bayesian Network G : Ri and Mi . For the last node, gkG∗ in the route, we only create Ri . For gi ∈ V ∗ , 0 < i < kG∗ , we add edges: Mi → gi+1 and Ri → gi+1 in G. The conditional probability table corresponding to each edge in G is decided

120

by the type of the edge in G∗ (i.e., activation or inhibition). The Bayesian Network is built based on the assumption that the expression level change, between tumor and normal cell, of the gene in the route is affected by the parent’s expression change and the mutation status in the pathway route. After

125

conversion, the resulting Bayesian Network G is formally defined as follows: S G = (V, E), where V = RR M M , RR = {Ri , i ∈ {1, . . . , kG∗ }} where Ri is a random variable representing expression level change on gene gi . M M = {Mi , i ∈ {1, . . . , kG∗ − 1}} where Mi is a random variable representing mutation level on gene gi . Now Mi and Ri are defined in detail. Since Mi does not have a parent node in G, the random variable Mi follows a Bernoulli distribution as shown in (1). The Bernouli random variable Mi has two possible values: +1 to represent obtain of function gain, i.e. the intended biological function gets stronger, and −1 to represent loss of function. The probability distribution can be set to 6

Gene 1 activation Gene 2

Gene i inhibition Genei+1

Activation conditional probability table

R1 M1 R2 Mi-1

Inhibition conditional probability table

Ri Mi Ri+1 MK-2

GeneK-1 activation GeneK

Activation conditional probability table

RK-1 MK-1 RK Corresponding Bayesian Network Model

Route in the pathway

Figure 2: Converting the route in the pathway to Bayesian Network

indicate that the prior has no specific preference on these two levels:   +1 p = 0.5 Mi =  −1 p = 0.5

(1)

Random variable Ri follows a different probability distribution based on the location of gene gi in path G∗ . Suppose the set containing parent nodes of gene gi in regulation network G∗ is denoted as pa(gi ). If pa(gi ) = ∅ or equivalently gi is the starting node in G∗ , Ri ’s distribution is shown in (2)   +1 p = 0.5 Ri =  −1 p = 0.5

(2)

where +1 represents gene gi is overexpressed in tumor cells compared with 130

normal cells, and −1 represents gene gi is under-expressed in tumor cells. On the other hand, if pa(gi ) 6= ∅, then gene gi−1 is influencing the function of 7

Table 1: THE REGULATION PROCESS IN G∗ IS ACTIVATION

*

Mi−1

Ri−1

Ri = +1

Ri = −1

+1

+1

1 − *



−1

+1

α

1−α

+1

−1

1−α

α

−1

−1



1−

 ∈ (0, 0.5) is the error rate we could tolerate

Table 2: THE REGULATION PROCESS IN G∗ IS INHIBITION

Mi−1

Ri−1

Ri = +1

Ri = −1

+1

+1



1−

−1

+1

1−α

α

+1

−1

α

1−α

−1

−1

1−



gi in G∗ . Thus the probability distribution is set to capture the mutation and expression status of its parent, Mi−1 and Ri−1 . Since G∗ is a path, | pa(gi ) |= 1. The conditional probability tables are shown in Table.1 and Table.2. The two tables, Table.1 and Table.2, are to encode how Ri ’s and Mi ’s influence with each other through the two possible regulatory relationships, activation and inhibition, in terms of the conditional probabilities. Here we discuss the activation table (Table.1) in detail; the inhibition table (Table.2) is built in a similar way. If the parent gene of gi , gi−1 , has a mutation that provides function gain and its expression level is increased in cancer cell (Mi−1 = +1, Ri−1 = +1), then the regulation process is likely to be boosted. Thus gi ’s function could be overly activating in tumor cells (compared to the case in normal cells), thus

8

making Ri = +1 given the edge between gi−1 and gi in G∗ is activation. As a result, P (Ri = +1|Mi−1 = +1, Ri−1 = +1) = 1 −  while P (Ri = −1|Mi−1 = +1, Ri−1 = +1) =  135

where  is the error rate to tolerate and should be close to zero. Similarly, if the parent gene of gi has mutation that causes function loss and its expression level is decreased in cancer cell (Mi−1 = −1, Ri−1 = −1), then the downstream regulation process towards gi may not likely be functioning. Therefore, gi can be considered less functional in tumor cells, making Ri = −1, and hence the

140

corresponding probability would be flipped. There are cases which are not straightforward. If the mutation causes function gain and the expression level is much less in tumor cells (Mi−1 = +1, Ri−1 = −1), it could be unclear how the regulation process is affected between gi−1 and gi . If the mutation causes function loss and the expression level is much higher

145

in tumor cells (Mi−1 = −1, Ri−1 = +1), then there are two scenarios to consider. The first is if 100% of cells are mutated, then even though the parent gene is highly expressed (Ri−1 = +1), the regulation process may still fail to function due to the function loss mutation. The second scenario is if the ratio of mutated cells is relatively low, (10%∼20%), then a certain amount of cells may still be

150

functioning to lead the high level of parent genes expression (Ri−1 = +1). We address this issue by introducing a weight on the mutation information, i.e. we can make the mutation level Mi−1 more likely to affect the actual expression level of the child gene Ri with a probability of α when its sign is opposite to that of Ri−1 . We then set α ∈ (, 0.5) since this type of conflict should not

155

be penalized too much, due to the uncertainty. To be thorough, the value of parameter α needs more discussion in the future. For simplicity, we set α to be 0.25 and the outcome has been satisfactory. Hyper parameters ε and α measure how much inaccuracy one would like to tolerate. Tuning the parameter will lead to more fuzzy results. One parameter tuning strategy is to start with the default

9

160

setting ε = 0.1 and α = 0.25 and evaluate the score distribution of some routes whose score are known as a prior knowledge, for example, the routes starting with ERBB2 gene in ErbB pathway should have a value close to +1 for HER2+ patients after tuning. 2.2. Ranking the Route Given (~r, m), ~ i.e., a set of data observations of the random variables in Bayesian Network G from a specific patient s, we rank the path G∗ with the probability of observing ~r and m ~ conditioning on the Bayesian network model ~ = ~r, M ~ = m|G) G, P (R ~ using a method similar to the one given in [14]. The larger the probability, the more likely the pathway route is perturbed since the observation is highly consistent with the biological logic from G∗ . The observations (~r, m) ~ may have missing values, i.e. there will be some nodes with missing data in G. One problem of using this probability as a measure is that the probability will be higher if fewer data is observed. Thus we propose to use a normalized score as shown by equation (3), where the conditional probability ~ M ~ are consistent|G). is normalized by P (R, Scores (G∗ , ~r, m) ~ =

~ = ~r, M ~ =m P (R ~ | G) ~ M ~ are consistent | G) P (R, X

~ = ~r, M ~ =m P (R ~ | G) =

(3)

~ M ~) P (R,

~ r ,M ~ =m R=~ ~

=

X

Y

P (R1 )

~ r ,M ~ =m R=~ ~

P (Ri | Mi−1 , Ri−1 )

1
~ M ~ are consistent|G) is the probability that the random variables with obP (R, servations are fully consistent with the biological logic encoded in the pathway route given R1 = +1, i.e., the first node in the path is upregulated. For instance, suppose the pathway route contains only two genes and g1 activates g2 . The observation is (r1 = −1, m1 = −1), i.e., R2 is not available from the given data. Then we have the following situation: ~ M ~ are consistent | G) P (R, (4) = P (R1 = +1, M1 = +1 | G) 10

(since g2 is the last node in the route, M2 is not included in the model). If R2 is available in the data, then we have the following: ~ M ~ are consistent | G) P (R, (5) = P (R1 = +1, M1 = +1, R2 = +1 | G) 165

A high score means that the path G∗ is highly likely perturbed based on the data we observe. A path G∗ could only get a high score if the observations (i.e., the changes in tumor cells for each gene) are highly consistent with pathway information encoded in the Bayesian Network G. Inconsistency between data and the model would lower the score greatly since the conditional probability

170

will be  instead of 1 −  during the calculation of the score. Advantages of this measure are: (i) the analysis could be done across pathways, i.e. after merging pathways in a reasonable way, this measure could recognize a significantly meaningful route across different pathways, and this could allow cancer researchers to easily notice which biological processes are likely to be overly

175

activated or suppressed; and (ii) even though some expression values are flipped due to random errors from the genomic data (it is observed to be −1 when it is actually +1), the whole path would still have a high score since the majority of other genes have consistent expression observations. We reiterate that the experimental data we use here comes from one patient, s, indicating that the

180

score is specifically tailored to the patient s. The score we discussed in the above could be extended to deal with a cohort of patients, resulting in a group score, ScoreS , S, as shown by equation (6). Assuming each patients disease condition is independent, the score for a group of patient can be set as follows. ScoreS (G∗ ) =

Y s∈S

=

Y

~ = ~rs , M ~ =m P (R ~ s | G) ~ ~ P (R, M are consistent | G)

(6)



Scores (G , ~rs , m ~ s)

s∈S

Since the perturbed route could have two possible status, activated or suppressed, we define a pathway route G∗ to be activated if the last gene’s expression value is observed to be the same as the expected value. The expected value 11

is calculated based on biological logic in G∗ by assuming that the first gene is upregulated. The route is defined to be suppressed if the observation of the last node is opposite to the expectation. The score is easily extended to include this information, resulting in a new signed score, sScore, as shown in (7). ˜ |G∗ | , r˙|G∗ | ) · Scores (G∗ , ~r, m) sScores (G∗ , ~r, m) ~ = I(r ~

(7)

where rG∗ is the observed expression level of the last node in the route G∗ and r˙|G∗ | is the expected expression level of the last node in the route conditioning on the first node being overexpressed. Function I˜ : R2 → R is defined in (8). The signed score varies from −1 (highly suppressed) to +1 (highly activated).   +1 x = y ˜ y) = I(x, (8)  −1 x 6= y Finally, we discuss how the measure for a whole pathway can be derived from the collection of individual route scores. The pathway score for pathway GB based on data of subject s, pScores (GB ), is introduced as follows: 1 |GB | X 1 X I( Scores (G∗ ) > β) |S| ∗

pScoreS (GB ) =

G ∈GB

(9)

s∈S

The idea behind generating the pathway score is the following. The pathway could be partitioned into several routes. We then simply measure the significance of this pathway, GB , by using the proportion of routes that have an average of all the patients’ scores, calculated by equation (3), that is larger than 185

some threshold β. 2.3. Data Description The gene expression variable Ri value could be measured by many types of gene/protein expression data, for instance, Microarray, RNA-seq, Reverse phase protein array (RPPA) among others. When RNA-seq is chosen, Ri ’s observation ri can be generated by log2 ratio of RNA-seq FPKM using the 12

equation (10) given blow. Here the threshold is set to 0.5 to tolerate the random error originating from sequence processing.    +1    ri = −1     missing

umorF P KMi log2 ( NT ormalF P KMi ) > 0.5 umorF P KMi log2 ( NT ormalF P KMi ) < −0.5

(10)

otherwise

The data mi observed for random variables Mi is to represent the mutation level for gene gi and it is typically obtained from WGS or WES. Observation mi = +1 if the mutation on gi is known for function gain or mi = −1 if known 190

for loss of function. We assume that whether a mutation is known for gain, or loss of function is obtainable from existing biological literatures or SNP mutation databases (e.g., ClinVar, dbSNP, COSMIC). In case such prior knowledge is not available, by default we set mi = −1 if a mutation is detected for gi , i.e., treat them as loss of function.

195

3. Results and Discussion 3.1. Simulation 3.1.1. Simulation Model Simulation is to evaluate the performance of our model using artificially created data sets which we know the answer. We first describe how our simulation

200

data sets are systematically generated. During the data generation, we control perturbation of certain routes of pathways and examine if the presumed perturbed routes are discovered by our model. In order to show that our simulation ˜ B using model is generally applicable, we create a random network defined as G the following stochastic process: (i) generate the number of nodes in the network

205

using a Binomial distribution parameterized by Binomial(n = 30, p = 0.5), (ii) assuming the number of nodes obtained is N, generate a random DAG(Directed acyclic graph) with nodes g1 , . . . , gN , (iii) randomly assign the edge type in the network as either activation or inhibition, and (iv) randomly pick one of the routes in the network as the aberrant one and denote it G˜∗ . Then, two groups 13

210

of data sets are simulated based on different assumptions. One group (denoted as Test Group) simulates data assuming the picked route is perturbed while the other group (denoted as Control Group) simulates data assuming the same route is not perturbed. Both groups will simulate transcriptome and mutation data for both tumor and normal samples for all the genes placed in the randomly

215

generated network. The model for data simulation in Test Group and Control Group is introduced separately below. (a) Test Group. Let G˜∗ be the perturbed path. The following process is repeated for each gene in G˜∗ : As discussed before, cancer could affect the path in two ways, either overly enhancing or suppressing. We generate suppressed path

220

data with a probability of 0.8 and the enhanced with probability of 0.2. After enhancing or suppressing is randomly determined with the probability distribution mentioned above, expression value for each gene in G˜∗ is determined according to the topology of the path (i.e., activation and inhibition information). The transcriptome simulation data for normal cells, F P KMnormal , is first generated

225

by a Normal distribution, N (50, 5) and then a change factor η is sampled based on the path topology. If a gene is to be highly activated, η is sampled from [1, 10] uniformly. Otherwise, η is sampled from [0, 1] uniformly. The transcriptome simulation data from cancer cells, F P KMtumor , is then sampled from the normal distribution N (50η, 5), i.e., F P KMtumor = 50η + ε, where ε is random

230

noise following N (0, 5). The random error is introduced to simulate the error in real RNA-seq data. For the mutation data, we choose it to be consistent with the simulated transcriptome data. In doing so, since mutation is infrequent in real data, we use the following approach: if gi is to be upregulated(down regulated), mi is sampled to be +1(−1) with probability of 0.2 with the unobserved

235

cases with the probability of 0.8. For gi not on path G˜∗ , data is simulated in the same way as genes in the Control Group. ˜ B be a random network. Since there is no specific (b) Control Group. Let G ˜ B is randomly set to be activated, suppressed constraint to follow, the genes in G or missing with the probability of each event being 1/3. The transcriptome 14

240

data for normal cells, F P KMnormal , is generated by the Normal distribution N (50, 5) and then a change factor η is sampled. If the gene is to be activated, η is sampled from [1, 10] uniformly. Otherwise, η is sampled from [0, 1] uniformly. The transcriptome data from cancer cells , F P KMtumor , is then sampled from the Normal distribution N (50η, 5), i.e., F P KMtumor = 50η + ε, where ε is

245

random noise following N (0, 5). In this case we make mutation data be totally independent of transcriptome data of the same gene. Its sampling is done in the following way: mi is sampled to be +1 with probability of 0.05, and sampling for −1 is done with the probability of 0.15. Sampling for unobserved is done with the probability of 0.8, similar to what was done for Test Group.

250

3.2. Simulation Results We simulate 400 “patients” data for Test Group and 400 “patients” data for Control Group. With the generated data, we calculate the score, defined in equation (3), for the chosen path G˜∗ by setting  = 0.1 and α = 0.25. The box plot of the scores is displayed in Fig.3. This box plot from the simulation

255

makes it clear that a reasonable threshold to distinguish patients from the two groups should be higher than 0.7. A threshold is needed to perform prediction, i.e. the cases with a score higher than the threshold is predicted to be from the Test Group. After obtaining False positive rate and True positive rate with various thresholds, multiple ROC (Receiver operating characteristic) curves are

260

produced and given in Fig.4. This figure includes four ROC curves, which are colored red, blue, orange and purple. The red curve is the case which combines mutation information and transcriptome data together with  = 0.1. The AUC (Area Under Curve) of the red curve is 0.94 indicating an excellent prediction accuracy. The blue curve corresponds to the result without incorporating the

265

mutation data. The resulting AUC is reduced to 0.89. The other two curves (orange and purple) are produced with different  with both expression and mutation data. The orange curve corresponds to  = 0.2 instead of 0.1. The purple curve corresponds to the result with  = 0.25. With a smaller , a larger AUC of the ROC curve is obtained. These outcomes clearly suggest that 15

Figure 3: The box plot of scores (with parameters:  = 0.1, α = 0.25) by simulation data for the same randomly picked route. Test Group (Group A) has the route aberrant while Control Group (Group B) doesn’t. The scores for the two groups are distributed in two very different patterns.

270

incorporating the mutation information into the analysis produces a noticeable impact on improving accuracy, meaning more samples of the Test Group become identifiable without increasing false positives. Using simulation data is obviously different from using real data. Studying which choices of  and α are more desired for real data is left as future work.

275

3.3. Significance Analysis The bioinformatics field frequently uses the TCGA Breast cancer data to test newly developed analysis models. We choose the same TCGA cancer data set to show effectiveness of our model. We downloaded transcriptome and mutation data from cBioPortal database [15][16] consisting of Breast Invasive Carcinoma

280

processed data sets of 1,105 samples from 1,098 patients. This data set includes 113 test and matching control data and thus we decided to exclusively use these matching-pair data sets in our study. For the transcriptome data, the cancer vs. 16

Figure 4: ROC curve: the threshold is picked from [0, 1] with step of 1/6. Smaller steps condition is also considered and the result makes no big difference

normal paired ratios of FPKM for these patients are converted to the expression observation with equation (10). For the mutation data, the processed mutation 285

calls are extracted from mutation accessor study [17]. The mutation with a ‘medium’ or ‘high’ impact factor is considered so that non functional mutation is filtered out. The entire KEGG Homo Sapiens pathways are used in this study, and the analysis is done mainly with R package “KEGGgraph” [18] and “gRain” [19]. The parameter setting for the Bayesian Network is  = 0.1 and α = 0.25.

290

We carried out a significance analysis similar to that of PARADIGM [2]. First, we produce decoy pathways by permuting the genes in the pathway while keeping the network topology intact. We generate one decoy pathway for each of 293 Homo Sapiens KEGG pathways [13]. For each pathway, we extract all possible routes in it. Then aggregating the scores from each route with

295

more than two nodes, we calculate the score for each pathway by equation (9). We then rank the real pathways and their corresponding decoy pathways. Furthermore, while taking β = 0.7224, the top significant signaling pathways

17

for Breast cancer are filtered out from 44 Homo Sapiens signaling pathways if the real pathway score is larger than the one for its decoy pathway. The 300

resulting pathways and their scores are listed in Table 3. The second column of the table is the score for the real pathway and the third column the score for the corresponding decoy pathway. The list is ordered by the score of the real pathways, while pathways with a score less than its corresponding decoy pathway are removed. The real pathway score is relatively close to the decoy

305

pathway score because the pathway are usually partially perturbed and a large proportion of the routes in each pathway could have a similar score as the routes of the corresponding decoy pathway. Nevertheless, differences between real and decoy pathways clearly emerge. The larger the score difference between the real pathway and its corresponding decoy pathway is, the more significantly the

310

real pathway would have been involved in the initiation and progression of the cancer. The pathways identified in the list have been cross-examined with reported findings in the literature. Oxytocin Receptors are related to breast cancer according to [20]. In primary breast cancer, AMPK activity is known diminished in

315

an estimated 90% of cases [21]. ErbB signaling pathway is well known frequently dysregulated in breast cancer [22]. Rap1 GTPase is found to be driving breast cancer cell migration with 1-integrin as suggested in [23]. When it comes to Calcium signaling pathway, specific Ca(2+) channels reportedly play important roles in the proliferation and invasiveness of breast cancer cells [24]. In breast

320

cancer Neurotrophins and their receptors significantly impact tumor cell growth and metastasis through various signaling pathways according to [25]. The importance of forkhead box O (FOXO) activity in the development of endocrine resistant breast cancer is highlighted in [26]. Oestrogen is known to trigger the sphingolipid signaling cascade in various tissues including breast cancer [27].

325

MAPK pathway involving ERK-1 and ERK-2 is known very relevant to breast cancer progression [28]. Phospholipase D (PLD) overexpresses in breast cancer cell and its implication in survival signals as discussed in [29]. Ras signaling pathway is reported to be activated in many cancers [30]. According to Wang 18

and Li, LOX-1 is up-regulated by TNF in endothelial cell promoting the adhe330

sion and trans-endothelial migration of MDA-MB-231 breast cancer cells [31]. In human breast cancer cells thyroid hormone signaling overshadows estrogen signaling on SMP30 gene leading to induction of apoptosis [32]. HIF-1 plays a key role in regulating breast cancer progression and metastasis [33]. Dysfunction of Hippo pathway components is linked to breast cancer stem cell regulation

335

and the connection between the disease and genetic variations in the pathway is reported in [34]. Evidence from both pre-clinical models and epidemiological studies is reviewed to examine whether adrenergic signaling may modify breast cancer biology [35]. Role of Estrogen Receptor Signaling in Breast Cancer Metastasis is discussed in [36]. Activation of the phosphoinositide 3 kinase

340

(PI3K)/Akt/mammalian target of rapamycin (mTOR) pathway is commonly reported in breast cancer [37]. An excellent review summarizing the role of the toll-like receptor signaling pathway on breast cancer risk, disease progression, survival, and disease recurrence is given in [38]. Inhibition of breast cancer cell migration can be achieved by activation of cAMP signaling [39]as this fact could

345

have been suggested by the negative scores observed in the cAMP pathway result. Overall, we found that 84% (21/25) of the pathways in Table III have some published facts implicated in breast cancer, suggesting that our analysis could be producing meaningful outcomes. The rest 16% pathways could potentially be novel pathways for breast cancer and may merit further investigation.

350

3.4. Survival Analysis The goal of survival analysis is to identify which pathway route could highly affect the mortality of the patients. After processing score calculation for all of the 293 Homo Sapiens pathways, all the route scores are saved into a data matrix where columns correspond to the routes and rows correspond to the

355

patients. This data matrix includes more than 40,000 features, i.e., the score columns obtained from each route of the pathways. We applied Lasso Cox proportional hazards model to derive the relationship between predictor (routes) variables and survival time [40]. The Lasso regression methods generates sparse 19

Table 3: SIGNIFICANT PATHWAY REPORTED

Pathway

Real Score

Decoy Score

Oxytocin signaling pathway

0.002 34

4.170 00 · 10−05

AMPK signaling pathway

0.001 76

0.000 41

ErbB signaling pathway

0.001 74

8.720 00 · 10−05

T cell receptor signaling pathway

0.001 59

0.000 48

Rap1 signaling pathway

0.001 26

0.000 12

Calcium signaling pathway

0.001 21

3.220 00 · 10−05

Fc epsilon RI signaling pathway

0.001 06

0.000 00

Neurotrophin signaling pathway

0.001 03

0.000 13

FoxO signaling pathway

0.001 00

6.250 00 · 10−05

Adipocytokine signaling pathway

0.000 88

0.000 53

Sphingolipid signaling pathway

0.000 63

7.830 00 · 10−05

MAPK signaling pathway

0.000 55

0.000 21

Phospholipase D signaling pathway

0.000 53

0.000 27

Ras signaling pathway

0.000 53

0.000 34

TNF signaling pathway

0.000 52

0.000 39

Thyroid hormone signaling pathway

0.000 40

0.000 15

HIF1 signaling pathway

0.000 37

5.530 00 · 10−05

Hippo signaling pathway

0.000 37

0.000 19

Adrenergic signaling in cardiomyocytes

0.000 31

0.000 12

Estrogen signaling pathway

0.000 24

0.000 00

PI3KAkt signaling pathway

0.000 23

0.000 16

NODlike receptor signaling pathway

0.000 22

8.920 00 · 10−05

Tolllike receptor signaling pathway

0.000 17

9.700 00 · 10−05

cAMP signaling pathway

0.000 12

7.440 00 · 10−05

Glucagon signaling pathway

5.150 00 · 10−05

0.000 00

features, setting all the unimportant features to zero. The implementation was 360

done using the R-package glmnet [41]. We used default values for most of the

20

required parameters of the package except λ which is to pick proper limit of total parameters’ L − 1 Norm, λ. For λ we have chosen 0.12 and this value controls the 10-fold cross validation error to minimal in our case. Among the 41,676 pathway routes (features), only 21 of them have nonzero estimation. 365

The Kaplan-Meier (KM) plots for these 21 routes are shown in Fig.5. This figure shows that 20 route cases out of 21 have a KM estimation significant p-value (p < 0.05). The only exception is the case whose route is made up of “CHRM2,GNB5,PIK3R5,HRAS,MAP2K1,MAPK1,CREB5”. Even in this case, (p = 0.087) which is close to the significance threshold value 0.05.

370

One of routes from Cytokine-cytokine receptor interaction pathway, IL23A → IL23R remains significant for all possible selection of λ. Thus this route could be affecting the patients survival rate significantly and worthy of further investigation. 3.5. Deep Patient-specific Investigation Analysis

375

Our next step is to demonstrate that our route identifying method can offer a comprehensive and yet deeper patient-specific biological insights. First, we merged five pathways of interest, ErbB, PI3K-AKT, Apoptosis Cell Cycle and Cancer Pathway, into one big gene regulatory network. Then the signed scores are calculated for all possible paths ending with the same 57 cancer marker

380

genes. We found that even though multiple patients are of the same type of tumor, HER2+ for instance, the scores for individual paths among the same subtype patients are still very different. We conjecture that a systematic analysis of these differences may offer a way to further refine each known subtype. We illustrate such refined subtyping possibility by a heat map as shown in Fig.6.

385

The clear pattern difference of the figure suggests that our route-based method may offer additionally deeper subtyping, for example, each column in Fig.6 could potentially represent a sub-subtype among the HER2+ patients. Next we summarize which are the commonly activated routes among the HER2+ patients in Table.4.

390

All these routes score +1 using the equation

(7). Route 1 is from the Cell Cycle pathway in which CNKN1B is inhib21

Group=1

+

++ ++ ++++++++++++++++++++++ + + ++ ++ +++++ +++++++ + + p = 0.00029 0

1000

PRKACA,KCNJ9

Group=2

Strata

+ 2000

3000

4000

1.00 0.75 0.50 0.25 0.00

+

Group=0

+

Group=1

+

+++++++++ ++ ++ + +++++ +++++++++++ +++++ +++ +++ +++++ ++ + + p = 0.00078 0

1000

+ 2000

3000

Time MAFA,PDX1,HNF4A,SLC2A2

Group=1

+

Group=2

Strata

+ 2000

3000

4000

1.00 0.75 0.50 0.25 0.00

+

Group=0

+

Group=1

+

++++++++++++++++ + +++++++++++++++ ++ + + + + +++++++ + 1000

Strata

+

p = 0.00051

0

Group=2

2000

3000

4000

Survival probability

+

Survival probability

Group=0

1.00 0.75 0.50 0.25 0.00

+

Group=0

+

Group=1

+

++++++++ +++ ++ ++ +++++ +++ ++ + +++ ++++++ +++ ++ + ++ + +++ +++ 1000

2000

Time

Time

IL23A,IL23R

PRKCB,GNB5,CACNA1B

+

1000

Group=2

Strata

+ 2000

3000

4000

1.00 0.75 0.50 0.25 0.00

+

Group=0

+

Group=1

+

Group=2

p = 0.00012

0

+

1000

2000

3000

+

Strata

++++++ ++ ++++++++++++ ++++ +++++++++++++++ + ++ ++ + ++ 4000

Survival probability

Group=1

Survival probability

+

1.00 0.75 0.50 0.25 0.00

Group=1

+

1000

+ 2000

3000

Time

Time

Time

BTC,ERBB4,PIK3R5,AKT3,CDKN1B

CCL18,CXCR6,GNAI1,FGR,PIK3R5,AKT3,FOXO3

TLR5,MYD88,RELA

Group=2

1000

Strata

+ 2000

3000

4000

1.00 0.75 0.50 0.25 0.00

+

Group=1

+

Group=2

++++++++++++++++++++ +++++ +++ ++ + +++++ ++ ++ + ++ +

+

p = 0.0031

0

1000

Strata

2000

3000

4000

Survival probability

+

Survival probability

Group=1

1.00 0.75 0.50 0.25 0.00

+

Group=0

+

Group=1

+

++ ++ +++++ +++++++++++++ +++ ++ +++ +++++++++ +++++ + ++ ++++ 1000

2000

3000

Time

Time

RASAL1,RRAS2,RAF1,MAP2K1,MAPK1,PLA2G1B

CXCL12,CXCR4,PTK2B,VAV3,RAC2

CALM2,CAMK2A,HRAS,PIK3R2

Group=1

+

Group=2

1000

Strata

+ 2000

3000

4000

1.00 0.75 0.50 0.25 0.00

+

Group=1

+

Group=2

++++++ ++ +++++++++++++ ++++++++++++++++++++ + + +

+

p < 0.0001

0

1000

Strata

2000

3000

4000

Survival probability

+

p = 0.0023

1.00 0.75 0.50 0.25 0.00

+

Group=1

+

Group=2

p < 0.0001

0

+

+

1000

2000

3000

Time

Time

CALM1,RASGRF1,RRAS2,RAF1,MAP2K1,MAPK1,PLA2G16

CHRM2,GNB5,PIK3R5,HRAS,MAP2K1,MAPK1,CREB5

PRKCB,HRAS,PIK3R2

Group=0

+

Group=1

+

Strata

+

p = 0.043

1000

Group=2

2000

3000

4000

1.00 0.75 0.50 0.25 0.00

+

Group=1

+

Group=2

++++++++++++++++++++++++ + + ++++++++ +++++++

Strata

+

p = 0.087

0

1000

2000

3000

4000

Survival probability

+

+++++++++ +++++++++++ ++++ +++ ++++++++++ + ++++++++

1.00 0.75 0.50 0.25 0.00

+

Group=0

+

Group=1

+

Group=2

p < 0.0001

0

+

+

1000

2000

3000

Time

Time

CALM1,RASGRF1,RRAS2,RAF1,MAP2K1,MAPK1,PLA2G1B

PRKACB,MAPK1,CREB3L3

FGF12,EGFR,NRAS,BRAF,MAP2K1,MAPK1,CDK4,RB1

Group=0

+

Group=1

1000

Group=2

Strata

+

p = 0.011

0

+

2000

3000

4000

1.00 0.75 0.50 0.25 0.00

Time

+

Group=0

+

Group=1

++++++++++++ +++++++++ ++ +++++++++++ +++ + ++ ++++++++ + 1000

Group=2

Strata

+

p = 0.00087

0

+

2000 Time

3000

4000

Survival probability

+

1.00 0.75 0.50 0.25 0.00

4000

++++++++ +++ ++ +++++++++++++++ ++++++ ++++++ +

Time

+++ ++ ++++++++++++++++++++++ +++++++++++ + ++++++ ++

4000

++++++++ ++ ++ +++++++++++++++ ++++++ ++++++ +

Time

0

Group=2

Time

+++ ++ ++++++++++++++++++++++ ++++++++++++ + + ++++ + +

4000

+

p = 0.0036

0

4000

Group=2

++++++++++ ++ ++++++++ +++ ++++ ++++++ ++++++++ ++++ + + p = 0.0091 0

Group=2

3000

Time

Group=0

4000

+

p = 0.0033

0

HBS1L,PELO

+

Group=2

Time

p = 0.0039

0

+

RAPGEF4,MAPK9

++++++++++++++ +++++++ +++ ++ +++ +++++++++++ + ++++ ++ 0

Group=0

Time

+++++++++++ +++ ++++++++ ++++ +++ ++++ ++ +++++ +++ ++ + ++ + + p = 0.0041 0

1.00 0.75 0.50 0.25 0.00

+

EIF3G,EIF1

1000

+

Survival probability

4000

Survival probability

Survival probability Survival probability Survival probability Survival probability

3000

p = 0.0063

0

Strata

1.00 0.75 0.50 0.25 0.00

Strata

++++++++++++++++ ++ + ++ ++++++ +++++ ++++ ++++ +++ + ++

Strata

1.00 0.75 0.50 0.25 0.00

Group=2

Survival probability

Survival probability Survival probability

+

2000

Strata

1.00 0.75 0.50 0.25 0.00

+

+

1000

Strata

1.00 0.75 0.50 0.25 0.00

RAPGEF4,RAP1A,BRAF,MAP2K1,MAPK1

Group=1

p = 0.0097

0

Strata

1.00 0.75 0.50 0.25 0.00

+

+++++++++++++++++++ ++ ++++++++++++++ ++++++ ++ + ++ + +

Strata

1.00 0.75 0.50 0.25 0.00

Group=0

Survival probability

Survival probability

1.00 0.75 0.50 0.25 0.00

+

Survival probability

B3GALT5,FUT1 Strata

+

Group=1

++++++++++++++++++++++ + + ++++++ +++++++ +++++ + 1000

Group=2

+

p = 0.00069

0

+

4000

2000

3000

4000

Time

Figure 5: Kaplan Meier plots for the selected routes by the model with p-value attached. Group 0 is the patients with zero score for the corresponding route, while patients in Group 1 and Group 2 seperately have negative and positive scores for the same route. Routes are shown in the title above each plots

ited. CNKN1B is a known tumor suppressor gene and enhancing this route would contribute to cell proliferation [42]. Route 2 is a path across PI3K-AKT pathway and Cancer pathway. In this route, PPP2CA is actively suppressing AKT3 thus possibly preventing MTOR from activation, where MTOR is known 395

responsible for protein synthesis, cell proliferation and evading apoptosis [43]. Route 3 is from the apoptosis pathway and in this path DIABLO inhibits BIRC2

22

Figure 6: The heatmap of path scores by HER2+ tumor, where rows corresponds to different routes and columns denote five HER2+ patients.

thus possibly letting CASP3 get activated. A similar finding has already been reported suggesting that CASP3 is overexpressed in carcinomas [44]. A similar observation is applied to Route 6. The rest of the routes all come from 400

the Apoptosis pathway and these routes could be due to DNA fragmentation which is a generally accepted up-stream biological process known to lead apoptosis [45]. When all these observations are combined, one could generalize that in many HER2+ patients, apoptosis is in action but cell proliferation through Route 1 could outbalance the cell death.

23

Table 4: The commonly enhanced routes for HER2+ subtype

No.

Routes

1

CDK7→CDK2aCDKN1B

2

PPP2CAaAKT3→MTOR

3

DIABLOaBIRC2aCASP3

4

DIABLOaBIRC2aCASP3aDFFA

5

HTRA2aBIRC2aCASP3aDFFA

6

CTSBaBIRC2aCASP3

Table 5: The commonly suppressed routes for HER2+ subtype

405

No.

Routes

1

VHLaEPAS1→FIGF

2

CUL2aEPAS1→FIGF

3

RUNX1→CSF1R→IRS1→PIK3CA→AKT3 →MTOR

4

RUNX1T1aCSF1R→IRS1→PIK3CA→AKT3→MTOR

5

DAPK1aMAPK1→JUN→FIGF

6

DAPK2aMAPK1→BCL2

7

DAPK2aMAPK1→JUN→FIGF

Next we discuss the routes that are commonly suppressed (with the score of −1) in Table.5. Route 3 and Route 4 are suppressed leading to the same result as the Route 2 in Table.4, i.e. suppression of MTOR expression. The fact that Route 6 is suppressed could mean that DAPK fails to inhibit MAPK1 thus

24

possibly causing the expression of BCL2. BCL2 is a well-known anti-apoptotic 410

gene [46] and this route being highly suppressed could be seen as one of the leading factors contributing to the cancer development. The last one, Route 7, ending with FIGF is known responsible for sustained angiogenesis, again an important hallmark of cancers biological capabilities [47]. Similar analysis could be presented for the other three subtypes but we omit

415

the details due to space limitation. Lastly, for an individual patient our model discovers more than 2000 highly activated routes and more than 2000 highly suppressed routes. Thus how to rank these identified ones is an important future research issue.

4. Conclusion 420

We described a new approach for pathway analysis using a scoring model that measures the consistency between gene expression data, mutation data and the interaction relationships specified in the pathways. Unlike existing methods, the key objective of computing the consistency between the data and the prior knowledge specified in the pathway is to identify the particular route(s)

425

of gene/protein interaction that could explain the molecular basis of the cancer initiation and progression. We first demonstrated the performance of the model through simulation. We then carried out a real-world data analysis using TCGA Breast Cancer data and confirmed that our model is capable of recognizing significant KEGG pathways reported in the cancer research literature.

430

During the experiment, we also observed that even though patients are from the conventionally categorized subtype (e.g., HER2+), the way their pathway routes are overly activated or suppressed is quite different, thus offering the advantage of our approach as well as its potential use in further refining the known cancer subtypes. Survival analysis based on identified pathway routes has also

435

been performed in which we discover that 19 route cases out of 20 have a KM estimator significant p-value (p < 0.05). Regarding future work, our Bayesian network model offers an easy way of incorporating additional data types such as

25

CNV, proteomics data, methylation data, and so on, and such model extension should be attempted. Lastly, we are optimistic that our model can be applied 440

to studying a wide array of biological systems beyond cancer research, as long as various experimental data types of test vs. control samples are obtainable.

References References [1] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, 445

M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, et al., Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles, Proceedings of the National Academy of Sciences 102 (43) (2005) 15545–15550. [2] C. J. Vaske, S. C. Benz, J. Z. Sanborn, D. Earl, C. Szeto, J. Zhu, D. Haus-

450

sler, J. M. Stuart, Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using paradigm, Bioinformatics 26 (12) (2010) i237–i245. [3] J. Massagu´e, Tgfβ signalling in context, Nature reviews Molecular cell biology 13 (10) (2012) 616–630.

455

[4] P. Mertins, D. Mani, K. V. Ruggles, M. A. Gillette, K. R. Clauser, P. Wang, X. Wang, J. W. Qiao, S. Cao, F. Petralia, et al., Proteogenomics connects somatic mutations to signalling in breast cancer, Nature 534 (7605) (2016) 55–62. [5] C. Li, H. Li, Network-constrained regularization and variable selection for

460

analysis of genomic data, Bioinformatics 24 (9) (2008) 1175–1182. [6] W. Pan, B. Xie, X. Shen, Incorporating predictor network in penalized regression with application to microarray data, Biometrics 66 (2) (2010) 474–484.

26

[7] A. L. Tarca, S. Draghici, P. Khatri, S. S. Hassan, P. Mittal, J.-s. Kim, 465

C. J. Kim, J. P. Kusanovic, R. Romero, A novel signaling pathway impact analysis, Bioinformatics 25 (1) (2009) 75–82. [8] T. Hancock, I. Takigawa, H. Mamitsuka, Mining metabolic pathways through gene expression, Bioinformatics 26 (17) (2010) 2128–2135. [9] F. C. Stingo, Y. A. Chen, M. G. Tadesse, M. Vannucci, Incorporating bio-

470

logical information into linear models: A bayesian approach to the selection of pathways and genes, The annals of applied statistics 5 (3). [10] L. P. Verbeke, J. Van den Eynden, A. C. Fierro, P. Demeester, J. Fostier, K. Marchal, Pathway relevance ranking for tumor samples through network-based data integration, PloS one 10 (7) (2015) e0133503.

475

[11] M. Korucuoglu, S. Isci, A. Ozgur, H. H. Otu, Bayesian pathway analysis of cancer microarray data, PloS one 9 (7) (2014) e102803. [12] S. Isci, C. Ozturk, J. Jones, H. H. Otu, Pathway analysis of highthroughput biological data within a bayesian network framework, Bioinformatics 27 (12) (2011) 1667–1674.

480

[13] M. Kanehisa, S. Goto, Kegg: kyoto encyclopedia of genes and genomes, Nucleic acids research 28 (1) (2000) 27–30. [14] D. Koller, N. Friedman, Probabilistic graphical models: principles and techniques, MIT press, 2009. [15] J. Gao, B. A. Aksoy, U. Dogrusoz, , et al., Integrative analysis of complex

485

cancer genomics and clinical profiles using the cbioportal, Science signaling 6 (269) (2013) pl1. [16] E. Cerami, J. Gao, U. Dogrusoz, , et al., The cbio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data, Cancer discovery 2 (5) (2012) 401–404.

27

490

[17] B. I. T. G. D. A. Center, Mutation assessor (2016).

doi:10.7908/

C1F18Z2Z. URL https://doi.org/10.7908/C1F18Z2Z [18] J. D. Zhang, S. Wiemann, Kegggraph: a graph approach to kegg pathway in r and bioconductor, Bioinformatics 25 (11) (2009) 1470–1471. 495

[19] S. Højsgaard, Graphical independence networks with the gRain package for R, Journal of Statistical Software 46 (10) (2012) 1–26. URL http://www.jstatsoft.org/v46/i10/ [20] A. Reversi, P. Cassoni, B. Chini, Oxytocin receptor signaling in myoepithelial and cancer cells, Journal of mammary gland biology and neoplasia

500

10 (3) (2005) 221. [21] W. Li, S. M. Saud, M. R. Young, G. Chen, B. Hua, Targeting ampk for cancer prevention and treatment, Oncotarget 6 (10) (2015) 7365–78. [22] Y. Yarden, M. X. Sliwkowski, Untangling the erbb signalling network, Nature reviews Molecular cell biology 2 (2) (2001) 127–137.

505

[23] E. A. McSherry, K. Brennan, L. Hudson, A. D. Hill, A. M. Hopkins, Breast cancer cell migration is regulated through junctional adhesion molecule-amediated activation of rap1 gtpase, Breast Cancer Research 13 (2) (2011) R31. [24] I. Azimi, S. Roberts-Thomson, G. Monteith, Calcium influx pathways in

510

breast cancer: opportunities for pharmacological intervention, British journal of pharmacology 171 (4) (2014) 945–960. [25] H. Hondermarck, Neurotrophins and their receptors in breast cancer, Cytokine & growth factor reviews 23 (6) (2012) 357–365. [26] M. Bullock, Foxo factors and breast cancer: outfoxing endocrine resistance,

515

Endocrine-related cancer 23 (2) (2016) R113–R130.

28

[27] O. Sukocheva, C. Wadham, Role of sphingolipids in oestrogen signalling in breast cancer cells: an update, Journal of Endocrinology 220 (3) (2014) R25–R35. [28] R. J. Santen, R. X. Song, R. McPherson, R. Kumar, L. Adam, M.-H. 520

Jeng, W. Yue, The role of mitogen-activated protein (map) kinase in breast cancer, The Journal of steroid biochemistry and molecular biology 80 (2) (2002) 239–256. [29] J. Gomez-Cambronero, Phosphatidic acid, phospholipase d and tumorigenesis, Advances in biological regulation 54 (2014) 197–206.

525

[30] E. Niemitz, Ras pathway activation in breast cancer, Nature genetics 45 (11) (2013) 1273–1273. [31] X. Wang, Y. Lin, Tumor necrosis factor and cancer, buddies or foes? 1, Acta pharmacologica Sinica 29 (11) (2008) 1275–1288. [32] S. K. Mishra, P. Sar, B. Rath, D. Bhargava, D. Sengupta, S. Chaudhary, In

530

human breast cancer cells thyroid hormone signaling overshadows estrogen signaling on smp30 gene leading to induction of apoptosis (2012). [33] Z.-j. Liu, G. L. Semenza, H.-f. Zhang, Hypoxia-inducible factor 1 and breast cancer metastasis, Journal of Zhejiang University Science B 16 (1) (2015) 32–43.

535

[34] J. Zhang, S. Yao, Q. Hu, Q. Zhu, S. Liu, K. L. Lunetta, S. A. Haddad, N. Yang, H. Shen, C.-C. Hong, et al., Genetic variations in the hippo signaling pathway and breast cancer risk in african american women in the amber consortium, Carcinogenesis (2016) bgw077. [35] E. I. Obeid, S. D. Conzen, The role of adrenergic signaling in breast cancer

540

biology, Cancer Biomarkers 13 (3) (2013) 161–169. [36] S. Saha Roy, R. K. Vadlamudi, Role of estrogen receptor signaling in breast cancer metastasis, International journal of breast cancer 2012. 29

[37] J. J. Lee, K. Loh, Y.-S. Yap, Pi3k/akt/mtor inhibitors in breast cancer, Cancer biology & medicine 12 (4) (2015) 342–354. 545

[38] R. K. La Creis, E. N. Rogers, S. T. Yeyeodu, D. Z. Jones, K. S. Kimbro, Contribution of toll-like receptor signaling pathways to breast tumorigenesis and treatment, Breast Cancer 5 (2013) 43. [39] H. Dong, K. P. Claffey, S. Brocke, P. M. Epstein, Inhibition of breast cancer cell migration by activation of camp signaling, Breast cancer research and

550

treatment 152 (1) (2015) 17–28. [40] R. Tibshirani, et al., The lasso method for variable selection in the cox model, Statistics in medicine 16 (4) (1997) 385–395. [41] N. Simon, J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for cox’s proportional hazards model via coordinate descent, Journal of

555

Statistical Software 39 (5) (2011) 1–13. URL http://www.jstatsoft.org/v39/i05/ [42] B. Belletti, G. Baldassarre, Roles of cdkn1b in cancer?, Aging (Albany NY) 7 (8) (2015) 529. [43] W. H. Goodson, M. G. Luciani, S. A. Sayeed, I. M. Jaffee, D. H. Moore,

560

S. H. Dairkee, Activation of the mtor pathway by low levels of xenoestrogens in breast epithelial cells from high-risk women, Carcinogenesis 32 (11) (2011) 1724–1733. [44] N. ODonovan, J. Crown, H. Stunell, A. D. Hill, E. McDermott, N. OHiggins, M. J. Duffy, Caspase 3 in breast cancer, Clinical Cancer Research

565

9 (2) (2003) 738–742. [45] X. Liu, H. Zou, C. Slaughter, X. Wang, Dff, a heterodimeric protein that functions downstream of caspase-3 to trigger dna fragmentation during apoptosis, Cell 89 (2) (1997) 175–184.

30

[46] Y. Tsujimoto, Role of bcl-2 family proteins in apoptosis: apoptosomes or 570

mitochondria?, Genes to Cells 3 (11) (1998) 697–707. [47] D. Hanahan, R. A. Weinberg, The hallmarks of cancer, cell 100 (1) (2000) 57–70.

31