Accepted Manuscript A Route-Based Pathway Analysis Framework Integrating Mutation Information and Gene Expression Data Yue Zhao, Tham H. Hoang, Pujan Joshi, Seung-Hyun Hong, Charles Giardina, Dong-Guk Shin PII: DOI: Reference:
S1046-2023(17)30039-7 http://dx.doi.org/10.1016/j.ymeth.2017.06.016 YMETH 4246
To appear in:
Methods
Please cite this article as: Y. Zhao, T.H. Hoang, P. Joshi, S-H. Hong, C. Giardina, D-G. Shin, A Route-Based Pathway Analysis Framework Integrating Mutation Information and Gene Expression Data, Methods (2017), doi: http:// dx.doi.org/10.1016/j.ymeth.2017.06.016
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
A Route-Based Pathway Analysis Framework Integrating Mutation Information and Gene Expression Data Yue Zhaoa,∗, Tham H. Hoanga , Pujan Joshia , Seung-Hyun Honga , Charles Giardinab , Dong-Guk Shina a Computer
Science and Engineering Department, University of Connecticut, 371 Fairfield Way, Unit 4155, Storrs, CT 06269 b Department of Molecular and Cell Biology, University of Connecticut, 91 North Eagleville Road, Unit 3125, Storrs, CT 06269
Abstract We propose a new way of analyzing biological pathways in which the analysis combines both transcriptome data and mutation information and uses the outcome to identify routes of aberrant pathways potentially responsible for the etiology of disease. Each pathway route is encoded as a Bayesian Network which is initialized with a sequence of conditional probabilities which are designed to encode directionality of regulatory relationships encoded in the pathways, i.e. activation and inhibition relationships. First, we demonstrate the effectiveness of our model through simulation in which the model was able to easily separate Test samples from Control samples using fictitiously perturbed pathway routes. Second, we apply our model to analyze the Breast Cancer data set, available from TCGA, against many cancer pathways available from KEGG and rank the significance of identified pathways. The outcome is consistent with what have already been reported in the literature. Third, survival analysis has been carried out on the same data set by using pathway routes as features. Overall, we envision that our model of using pathway routes for analysis can further refine the conventional ways of subtyping cancer patients as it can discover additional characteristics specific to individuals tumor. ∗ Corresponding
author Email address:
[email protected] (Yue Zhao)
Preprint submitted to Journal of LATEX Templates
June 19, 2017
Keywords: Pathway Analysis, Bayesian Network, Data Integration, Modeling and Simulation
1. Introduction Pathway analysis is a critically important topic in cancer research. Pathway analysis allows scientists to systematically examine massive amount of genomics data over a curated set of biological pathways where each pathway includes a 5
large number of proteins/genes known interacting for specific biological function, e.g., cell-cycling, apoptosis, Th1 cell differentiation, etc. Generally, pathway analysis methods can be classified into two types. The first type, called the gene set enrichment analysis, is to estimate if some specific pathways are in action in the experimental system by merely calculating the significant number
10
of common genes of a biological pathway [1]. The second type is to use the network topology in which nodes and edges designed to encode directionality of mutually interacting gene/proteins within pathways [2]. The second type is superior to the first type as it utilizes not only the topology of interacting molecular entities but also the semantics of relationships types, i.e., activation
15
and inhibition. However, the existing topology-based pathway analysis methods are still limited in the sense that the analysis does not fully exploit the network topology encoded in the pathways. We argue that the pathway analysis should be route-based which is discoverable from the topology, as opposed to the conventional way of treating the entire pathway as the unit of analysis. For
20
example, if Wnt is highly activated or suppressed, we need to know via which route Wnt is activated or suppressed. A biological pathway is generally made up of a set of multiple routes each representing an alternative or differing way of realizing its intended biological function. For example, Wnt pathway has both canonical and non-canonical routes in which the former involves β-catenin while
25
the latter does not. Likewise, TGFB pathway has the canonical route utilizing R-SMAD complexes but there are other routes involving PI3K and MAPK [3]. Identifying the details of these non-canonical routes are active research topics,
2
and developing a route-based pathway analysis method is imperative to make the biomedical community’s pathway research move forward. 30
Another motivation of this work is the need to combine multiple experimental data types for the pathway analysis. Existing pathway analysis methods mostly use transcriptome data from microarray or RNA-Seq. Recent advancement in other high-throughput sequencing technologies (e.g., DNA-Seq, ChIPSeq, ATAC-Seq) allows scientists to acquire multiple types of genomics data
35
efficiently and cost effectively, and combining these multiple genomic data sets should enhance the accuracy of pathway analysis. In the past, the cancer research community use mostly mutation data to identify the aberrant pathways. In this work, we proposed to combine both mutation and transcriptome data for the pathway analysis. Using transcriptome data to infer the protein-protein
40
interaction is a known concern, but we indicate that the most recent studies by CPTAC suggest a very good Pearson correlation (e.g., 0.84 for ERBB2) between protein and mRNA expression level changes [4]. In addition, our model can be easily extended to use protein data in lieu of or in addition to transcriptome data, should such data becomes readily available for the pathway analysis in
45
the future. The rest of this paper is organized as follows. In Section 2, existing pathway analysis methods are briefly reviewed. In Section 3, the model setting and assumptions are described in detail. In Section 4, we include discussion regarding a simulation model that we used and the results of testing our model using the
50
simulation data. We also present the outcome of applying our model to Breast cancer data from TCGA. The section also includes discussion of performing survival analysis based on pathway routes. Lastly Section 5 is the conclusion. Numerous attempts have been made to incorporate pathway information into genomic data analysis. Li et al. (2008) encode the pathway network into a
55
penalty function and try to carry out model selection by optimizing the penalty function to pick meaningful genes and subnetworks [5]. This work was refined by Pan et al. (2010) who used a new penalty function in the form of a grouped sum [6]. Tarca et al. (2008) propose an approach which measures pathway 3
significance by statistical testing against random permutation [7]. Hancock et 60
al. (2010) rank the KEGG pathway correlated with the microarray dataset using Pearson Correlation [8]. Vaske et al. (2010) present a novel method of inferring genetic activities for a certain patient by modeling the gene as a factor graph thus incorporating pathways topological information [2]. Stingo et al. (2011) apply Partial Least Squares Regression to pick meaningful pathways and
65
genes with a Markov Random Field as the prior, where a gene is more likely to be selected if its neighbor in the pathway is already selected [9]. Verbeke et al. (2015) attempt to rank the pathways by encoding binary data and pathway logic into a global network and then use the p-value to rank perturbed pathways [10]. Here the p-value is calculated based on a hypothesis test where the null
70
hypothesis is the pathway that is picked randomly. Both Korucuoglu et al. (2014) [11] and Isci et al. (2011) [12] introduce Bayesian pathway analysis methods that encode the pathway as a Bayesian network. Both approaches remove cycles in the graph first and then use the expression data to train the model. Significance of the score is given by bootstrap-generated data.
75
Most of the existing pathway analysis methods are limited by using only partial information, i.e. whether two genes are adjacent in the pathway or not, or even ignoring pathway topology by directly scoring the significance of gene sets. The limitations of these existing works include (i) semantics of regulatory relationship is usually not considered, i.e., activation or inhibition relationships,
80
(ii) some approaches involve too many parameters in the model, and (iii) most approaches are designed to elicit general phenomenon by analyzing the patients cohort thus may not be suitable for individual patient-specific analysis. In this work, we propose an approach which fully considers pathway topology together with gene expression and mutation data where patient-specific analysis is pos-
85
sible.
4
From ErbB Signaling Pathway
+P
ErbB-2
STAT5
ERBB2
GAB1
STAT5
+P Shc
SHC2 PI3K
GAB1
Simplify
Grb2
GRB2
PIK3R5
PIP3 Sos
AKT3
SOS1
AKT +P Ras
R1
BAD
HRAS
BAD
A. Pathway of interested
B. Gene regulation network GB Pick a route
M1 R2
M2
SHC2
ERBB2
GRB2
SOS1
M3
R3 R4 D. Resulting Bayesian Network
Encoding as Bayesian Network
C. The route to model, G*
Figure 1: Workflow: converting a route in the pathway to Bayesian Network
2. Material and methods 2.1. Model and Terms A biological pathway is a graphical model made up of nodes and edges which collectively represent how molecular objects interact with each other in sequence 90
to exert some biological process. Fig.1 shows an example pathway, ErbB, which has been adapted from the KEGG pathway database [13]. Fig.1 illustrates how each route of the ErbB pathway can be modeled into a Bayesian network. Initially, the source pathway (Fig.1.A) is simplified by categorizing all interactions in the pathway into either activation or inhibition relationships. We call the
95
outcome of this simplification a Gene Regulation Network (denoted as GB in Fig.1.B) and in this figure, the two edge types, activation and inhibition, are represented as an arrow and a T bar, respectively. The interactions that could adversely affect the function of the target node in the pathway are inhibition while all other interactions are encoded as activation. The next step is to iden-
100
tify all possible “routes” available from the given pathway. Fig.1.C shows a route, G∗ , which starts from ERBB2 and ends at SOS1. The selected route is then converted into a discrete Bayesian Network (denoted as G shown by
5
Fig.1.D). Our objective is to treat each “route” as a unit of pathway analysis, as opposed to the conventional approach in which the entire network is subject 105
to the analysis. This route-based method would offer a more accurate analysis as the detailed analysis would reveal whether the effect of ERBB2 amplification is more prominent through GRB2→SOS1 path or GRB2→GAB1 path, or even for both. The route-based modeling idea assumes that it is crucial to identify which portion(s) of the pathway is either abnormally activated or suppressed.
110
In this way, a more informed treatment plan could be designed. Fig.2 illustrates the conversion process from the pathway route G∗ to a Bayesian network G. For a gene regulation network GB (converted from a pathway), a path G∗ is simply a subgraph of GB , G∗ ⊆ GB , G∗ = (V ∗ , E ∗ ) where V ∗ = {g1 , . . . , gkG∗ }, where gi represents the ith gene and kG∗ is the
115
number of genes contained in path G∗ , E ∗ = {eij |1 ≤ i < kG∗ and j = i + 1}. For gi ∈ V ∗ , 0 < i < kG∗ , we create two nodes in the corresponding Bayesian Network G : Ri and Mi . For the last node, gkG∗ in the route, we only create Ri . For gi ∈ V ∗ , 0 < i < kG∗ , we add edges: Mi → gi+1 and Ri → gi+1 in G. The conditional probability table corresponding to each edge in G is decided
120
by the type of the edge in G∗ (i.e., activation or inhibition). The Bayesian Network is built based on the assumption that the expression level change, between tumor and normal cell, of the gene in the route is affected by the parent’s expression change and the mutation status in the pathway route. After
125
conversion, the resulting Bayesian Network G is formally defined as follows: S G = (V, E), where V = RR M M , RR = {Ri , i ∈ {1, . . . , kG∗ }} where Ri is a random variable representing expression level change on gene gi . M M = {Mi , i ∈ {1, . . . , kG∗ − 1}} where Mi is a random variable representing mutation level on gene gi . Now Mi and Ri are defined in detail. Since Mi does not have a parent node in G, the random variable Mi follows a Bernoulli distribution as shown in (1). The Bernouli random variable Mi has two possible values: +1 to represent obtain of function gain, i.e. the intended biological function gets stronger, and −1 to represent loss of function. The probability distribution can be set to 6
Gene 1 activation Gene 2
Gene i inhibition Genei+1
Activation conditional probability table
R1 M1 R2 Mi-1
Inhibition conditional probability table
Ri Mi Ri+1 MK-2
GeneK-1 activation GeneK
Activation conditional probability table
RK-1 MK-1 RK Corresponding Bayesian Network Model
Route in the pathway
Figure 2: Converting the route in the pathway to Bayesian Network
indicate that the prior has no specific preference on these two levels: +1 p = 0.5 Mi = −1 p = 0.5
(1)
Random variable Ri follows a different probability distribution based on the location of gene gi in path G∗ . Suppose the set containing parent nodes of gene gi in regulation network G∗ is denoted as pa(gi ). If pa(gi ) = ∅ or equivalently gi is the starting node in G∗ , Ri ’s distribution is shown in (2) +1 p = 0.5 Ri = −1 p = 0.5
(2)
where +1 represents gene gi is overexpressed in tumor cells compared with 130
normal cells, and −1 represents gene gi is under-expressed in tumor cells. On the other hand, if pa(gi ) 6= ∅, then gene gi−1 is influencing the function of 7
Table 1: THE REGULATION PROCESS IN G∗ IS ACTIVATION
*
Mi−1
Ri−1
Ri = +1
Ri = −1
+1
+1
1 − *
−1
+1
α
1−α
+1
−1
1−α
α
−1
−1
1−
∈ (0, 0.5) is the error rate we could tolerate
Table 2: THE REGULATION PROCESS IN G∗ IS INHIBITION
Mi−1
Ri−1
Ri = +1
Ri = −1
+1
+1
1−
−1
+1
1−α
α
+1
−1
α
1−α
−1
−1
1−
gi in G∗ . Thus the probability distribution is set to capture the mutation and expression status of its parent, Mi−1 and Ri−1 . Since G∗ is a path, | pa(gi ) |= 1. The conditional probability tables are shown in Table.1 and Table.2. The two tables, Table.1 and Table.2, are to encode how Ri ’s and Mi ’s influence with each other through the two possible regulatory relationships, activation and inhibition, in terms of the conditional probabilities. Here we discuss the activation table (Table.1) in detail; the inhibition table (Table.2) is built in a similar way. If the parent gene of gi , gi−1 , has a mutation that provides function gain and its expression level is increased in cancer cell (Mi−1 = +1, Ri−1 = +1), then the regulation process is likely to be boosted. Thus gi ’s function could be overly activating in tumor cells (compared to the case in normal cells), thus
8
making Ri = +1 given the edge between gi−1 and gi in G∗ is activation. As a result, P (Ri = +1|Mi−1 = +1, Ri−1 = +1) = 1 − while P (Ri = −1|Mi−1 = +1, Ri−1 = +1) = 135
where is the error rate to tolerate and should be close to zero. Similarly, if the parent gene of gi has mutation that causes function loss and its expression level is decreased in cancer cell (Mi−1 = −1, Ri−1 = −1), then the downstream regulation process towards gi may not likely be functioning. Therefore, gi can be considered less functional in tumor cells, making Ri = −1, and hence the
140
corresponding probability would be flipped. There are cases which are not straightforward. If the mutation causes function gain and the expression level is much less in tumor cells (Mi−1 = +1, Ri−1 = −1), it could be unclear how the regulation process is affected between gi−1 and gi . If the mutation causes function loss and the expression level is much higher
145
in tumor cells (Mi−1 = −1, Ri−1 = +1), then there are two scenarios to consider. The first is if 100% of cells are mutated, then even though the parent gene is highly expressed (Ri−1 = +1), the regulation process may still fail to function due to the function loss mutation. The second scenario is if the ratio of mutated cells is relatively low, (10%∼20%), then a certain amount of cells may still be
150
functioning to lead the high level of parent genes expression (Ri−1 = +1). We address this issue by introducing a weight on the mutation information, i.e. we can make the mutation level Mi−1 more likely to affect the actual expression level of the child gene Ri with a probability of α when its sign is opposite to that of Ri−1 . We then set α ∈ (, 0.5) since this type of conflict should not
155
be penalized too much, due to the uncertainty. To be thorough, the value of parameter α needs more discussion in the future. For simplicity, we set α to be 0.25 and the outcome has been satisfactory. Hyper parameters ε and α measure how much inaccuracy one would like to tolerate. Tuning the parameter will lead to more fuzzy results. One parameter tuning strategy is to start with the default
9
160
setting ε = 0.1 and α = 0.25 and evaluate the score distribution of some routes whose score are known as a prior knowledge, for example, the routes starting with ERBB2 gene in ErbB pathway should have a value close to +1 for HER2+ patients after tuning. 2.2. Ranking the Route Given (~r, m), ~ i.e., a set of data observations of the random variables in Bayesian Network G from a specific patient s, we rank the path G∗ with the probability of observing ~r and m ~ conditioning on the Bayesian network model ~ = ~r, M ~ = m|G) G, P (R ~ using a method similar to the one given in [14]. The larger the probability, the more likely the pathway route is perturbed since the observation is highly consistent with the biological logic from G∗ . The observations (~r, m) ~ may have missing values, i.e. there will be some nodes with missing data in G. One problem of using this probability as a measure is that the probability will be higher if fewer data is observed. Thus we propose to use a normalized score as shown by equation (3), where the conditional probability ~ M ~ are consistent|G). is normalized by P (R, Scores (G∗ , ~r, m) ~ =
~ = ~r, M ~ =m P (R ~ | G) ~ M ~ are consistent | G) P (R, X
~ = ~r, M ~ =m P (R ~ | G) =
(3)
~ M ~) P (R,
~ r ,M ~ =m R=~ ~
=
X
Y
P (R1 )
~ r ,M ~ =m R=~ ~
P (Ri | Mi−1 , Ri−1 )
1
~ M ~ are consistent|G) is the probability that the random variables with obP (R, servations are fully consistent with the biological logic encoded in the pathway route given R1 = +1, i.e., the first node in the path is upregulated. For instance, suppose the pathway route contains only two genes and g1 activates g2 . The observation is (r1 = −1, m1 = −1), i.e., R2 is not available from the given data. Then we have the following situation: ~ M ~ are consistent | G) P (R, (4) = P (R1 = +1, M1 = +1 | G) 10
(since g2 is the last node in the route, M2 is not included in the model). If R2 is available in the data, then we have the following: ~ M ~ are consistent | G) P (R, (5) = P (R1 = +1, M1 = +1, R2 = +1 | G) 165
A high score means that the path G∗ is highly likely perturbed based on the data we observe. A path G∗ could only get a high score if the observations (i.e., the changes in tumor cells for each gene) are highly consistent with pathway information encoded in the Bayesian Network G. Inconsistency between data and the model would lower the score greatly since the conditional probability
170
will be instead of 1 − during the calculation of the score. Advantages of this measure are: (i) the analysis could be done across pathways, i.e. after merging pathways in a reasonable way, this measure could recognize a significantly meaningful route across different pathways, and this could allow cancer researchers to easily notice which biological processes are likely to be overly
175
activated or suppressed; and (ii) even though some expression values are flipped due to random errors from the genomic data (it is observed to be −1 when it is actually +1), the whole path would still have a high score since the majority of other genes have consistent expression observations. We reiterate that the experimental data we use here comes from one patient, s, indicating that the
180
score is specifically tailored to the patient s. The score we discussed in the above could be extended to deal with a cohort of patients, resulting in a group score, ScoreS , S, as shown by equation (6). Assuming each patients disease condition is independent, the score for a group of patient can be set as follows. ScoreS (G∗ ) =
Y s∈S
=
Y
~ = ~rs , M ~ =m P (R ~ s | G) ~ ~ P (R, M are consistent | G)
(6)
∗
Scores (G , ~rs , m ~ s)
s∈S
Since the perturbed route could have two possible status, activated or suppressed, we define a pathway route G∗ to be activated if the last gene’s expression value is observed to be the same as the expected value. The expected value 11
is calculated based on biological logic in G∗ by assuming that the first gene is upregulated. The route is defined to be suppressed if the observation of the last node is opposite to the expectation. The score is easily extended to include this information, resulting in a new signed score, sScore, as shown in (7). ˜ |G∗ | , r˙|G∗ | ) · Scores (G∗ , ~r, m) sScores (G∗ , ~r, m) ~ = I(r ~
(7)
where rG∗ is the observed expression level of the last node in the route G∗ and r˙|G∗ | is the expected expression level of the last node in the route conditioning on the first node being overexpressed. Function I˜ : R2 → R is defined in (8). The signed score varies from −1 (highly suppressed) to +1 (highly activated). +1 x = y ˜ y) = I(x, (8) −1 x 6= y Finally, we discuss how the measure for a whole pathway can be derived from the collection of individual route scores. The pathway score for pathway GB based on data of subject s, pScores (GB ), is introduced as follows: 1 |GB | X 1 X I( Scores (G∗ ) > β) |S| ∗
pScoreS (GB ) =
G ∈GB
(9)
s∈S
The idea behind generating the pathway score is the following. The pathway could be partitioned into several routes. We then simply measure the significance of this pathway, GB , by using the proportion of routes that have an average of all the patients’ scores, calculated by equation (3), that is larger than 185
some threshold β. 2.3. Data Description The gene expression variable Ri value could be measured by many types of gene/protein expression data, for instance, Microarray, RNA-seq, Reverse phase protein array (RPPA) among others. When RNA-seq is chosen, Ri ’s observation ri can be generated by log2 ratio of RNA-seq FPKM using the 12
equation (10) given blow. Here the threshold is set to 0.5 to tolerate the random error originating from sequence processing. +1 ri = −1 missing
umorF P KMi log2 ( NT ormalF P KMi ) > 0.5 umorF P KMi log2 ( NT ormalF P KMi ) < −0.5
(10)
otherwise
The data mi observed for random variables Mi is to represent the mutation level for gene gi and it is typically obtained from WGS or WES. Observation mi = +1 if the mutation on gi is known for function gain or mi = −1 if known 190
for loss of function. We assume that whether a mutation is known for gain, or loss of function is obtainable from existing biological literatures or SNP mutation databases (e.g., ClinVar, dbSNP, COSMIC). In case such prior knowledge is not available, by default we set mi = −1 if a mutation is detected for gi , i.e., treat them as loss of function.
195
3. Results and Discussion 3.1. Simulation 3.1.1. Simulation Model Simulation is to evaluate the performance of our model using artificially created data sets which we know the answer. We first describe how our simulation
200
data sets are systematically generated. During the data generation, we control perturbation of certain routes of pathways and examine if the presumed perturbed routes are discovered by our model. In order to show that our simulation ˜ B using model is generally applicable, we create a random network defined as G the following stochastic process: (i) generate the number of nodes in the network
205
using a Binomial distribution parameterized by Binomial(n = 30, p = 0.5), (ii) assuming the number of nodes obtained is N, generate a random DAG(Directed acyclic graph) with nodes g1 , . . . , gN , (iii) randomly assign the edge type in the network as either activation or inhibition, and (iv) randomly pick one of the routes in the network as the aberrant one and denote it G˜∗ . Then, two groups 13
210
of data sets are simulated based on different assumptions. One group (denoted as Test Group) simulates data assuming the picked route is perturbed while the other group (denoted as Control Group) simulates data assuming the same route is not perturbed. Both groups will simulate transcriptome and mutation data for both tumor and normal samples for all the genes placed in the randomly
215
generated network. The model for data simulation in Test Group and Control Group is introduced separately below. (a) Test Group. Let G˜∗ be the perturbed path. The following process is repeated for each gene in G˜∗ : As discussed before, cancer could affect the path in two ways, either overly enhancing or suppressing. We generate suppressed path
220
data with a probability of 0.8 and the enhanced with probability of 0.2. After enhancing or suppressing is randomly determined with the probability distribution mentioned above, expression value for each gene in G˜∗ is determined according to the topology of the path (i.e., activation and inhibition information). The transcriptome simulation data for normal cells, F P KMnormal , is first generated
225
by a Normal distribution, N (50, 5) and then a change factor η is sampled based on the path topology. If a gene is to be highly activated, η is sampled from [1, 10] uniformly. Otherwise, η is sampled from [0, 1] uniformly. The transcriptome simulation data from cancer cells, F P KMtumor , is then sampled from the normal distribution N (50η, 5), i.e., F P KMtumor = 50η + ε, where ε is random
230
noise following N (0, 5). The random error is introduced to simulate the error in real RNA-seq data. For the mutation data, we choose it to be consistent with the simulated transcriptome data. In doing so, since mutation is infrequent in real data, we use the following approach: if gi is to be upregulated(down regulated), mi is sampled to be +1(−1) with probability of 0.2 with the unobserved
235
cases with the probability of 0.8. For gi not on path G˜∗ , data is simulated in the same way as genes in the Control Group. ˜ B be a random network. Since there is no specific (b) Control Group. Let G ˜ B is randomly set to be activated, suppressed constraint to follow, the genes in G or missing with the probability of each event being 1/3. The transcriptome 14
240
data for normal cells, F P KMnormal , is generated by the Normal distribution N (50, 5) and then a change factor η is sampled. If the gene is to be activated, η is sampled from [1, 10] uniformly. Otherwise, η is sampled from [0, 1] uniformly. The transcriptome data from cancer cells , F P KMtumor , is then sampled from the Normal distribution N (50η, 5), i.e., F P KMtumor = 50η + ε, where ε is
245
random noise following N (0, 5). In this case we make mutation data be totally independent of transcriptome data of the same gene. Its sampling is done in the following way: mi is sampled to be +1 with probability of 0.05, and sampling for −1 is done with the probability of 0.15. Sampling for unobserved is done with the probability of 0.8, similar to what was done for Test Group.
250
3.2. Simulation Results We simulate 400 “patients” data for Test Group and 400 “patients” data for Control Group. With the generated data, we calculate the score, defined in equation (3), for the chosen path G˜∗ by setting = 0.1 and α = 0.25. The box plot of the scores is displayed in Fig.3. This box plot from the simulation
255
makes it clear that a reasonable threshold to distinguish patients from the two groups should be higher than 0.7. A threshold is needed to perform prediction, i.e. the cases with a score higher than the threshold is predicted to be from the Test Group. After obtaining False positive rate and True positive rate with various thresholds, multiple ROC (Receiver operating characteristic) curves are
260
produced and given in Fig.4. This figure includes four ROC curves, which are colored red, blue, orange and purple. The red curve is the case which combines mutation information and transcriptome data together with = 0.1. The AUC (Area Under Curve) of the red curve is 0.94 indicating an excellent prediction accuracy. The blue curve corresponds to the result without incorporating the
265
mutation data. The resulting AUC is reduced to 0.89. The other two curves (orange and purple) are produced with different with both expression and mutation data. The orange curve corresponds to = 0.2 instead of 0.1. The purple curve corresponds to the result with = 0.25. With a smaller , a larger AUC of the ROC curve is obtained. These outcomes clearly suggest that 15
Figure 3: The box plot of scores (with parameters: = 0.1, α = 0.25) by simulation data for the same randomly picked route. Test Group (Group A) has the route aberrant while Control Group (Group B) doesn’t. The scores for the two groups are distributed in two very different patterns.
270
incorporating the mutation information into the analysis produces a noticeable impact on improving accuracy, meaning more samples of the Test Group become identifiable without increasing false positives. Using simulation data is obviously different from using real data. Studying which choices of and α are more desired for real data is left as future work.
275
3.3. Significance Analysis The bioinformatics field frequently uses the TCGA Breast cancer data to test newly developed analysis models. We choose the same TCGA cancer data set to show effectiveness of our model. We downloaded transcriptome and mutation data from cBioPortal database [15][16] consisting of Breast Invasive Carcinoma
280
processed data sets of 1,105 samples from 1,098 patients. This data set includes 113 test and matching control data and thus we decided to exclusively use these matching-pair data sets in our study. For the transcriptome data, the cancer vs. 16
Figure 4: ROC curve: the threshold is picked from [0, 1] with step of 1/6. Smaller steps condition is also considered and the result makes no big difference
normal paired ratios of FPKM for these patients are converted to the expression observation with equation (10). For the mutation data, the processed mutation 285
calls are extracted from mutation accessor study [17]. The mutation with a ‘medium’ or ‘high’ impact factor is considered so that non functional mutation is filtered out. The entire KEGG Homo Sapiens pathways are used in this study, and the analysis is done mainly with R package “KEGGgraph” [18] and “gRain” [19]. The parameter setting for the Bayesian Network is = 0.1 and α = 0.25.
290
We carried out a significance analysis similar to that of PARADIGM [2]. First, we produce decoy pathways by permuting the genes in the pathway while keeping the network topology intact. We generate one decoy pathway for each of 293 Homo Sapiens KEGG pathways [13]. For each pathway, we extract all possible routes in it. Then aggregating the scores from each route with
295
more than two nodes, we calculate the score for each pathway by equation (9). We then rank the real pathways and their corresponding decoy pathways. Furthermore, while taking β = 0.7224, the top significant signaling pathways
17
for Breast cancer are filtered out from 44 Homo Sapiens signaling pathways if the real pathway score is larger than the one for its decoy pathway. The 300
resulting pathways and their scores are listed in Table 3. The second column of the table is the score for the real pathway and the third column the score for the corresponding decoy pathway. The list is ordered by the score of the real pathways, while pathways with a score less than its corresponding decoy pathway are removed. The real pathway score is relatively close to the decoy
305
pathway score because the pathway are usually partially perturbed and a large proportion of the routes in each pathway could have a similar score as the routes of the corresponding decoy pathway. Nevertheless, differences between real and decoy pathways clearly emerge. The larger the score difference between the real pathway and its corresponding decoy pathway is, the more significantly the
310
real pathway would have been involved in the initiation and progression of the cancer. The pathways identified in the list have been cross-examined with reported findings in the literature. Oxytocin Receptors are related to breast cancer according to [20]. In primary breast cancer, AMPK activity is known diminished in
315
an estimated 90% of cases [21]. ErbB signaling pathway is well known frequently dysregulated in breast cancer [22]. Rap1 GTPase is found to be driving breast cancer cell migration with 1-integrin as suggested in [23]. When it comes to Calcium signaling pathway, specific Ca(2+) channels reportedly play important roles in the proliferation and invasiveness of breast cancer cells [24]. In breast
320
cancer Neurotrophins and their receptors significantly impact tumor cell growth and metastasis through various signaling pathways according to [25]. The importance of forkhead box O (FOXO) activity in the development of endocrine resistant breast cancer is highlighted in [26]. Oestrogen is known to trigger the sphingolipid signaling cascade in various tissues including breast cancer [27].
325
MAPK pathway involving ERK-1 and ERK-2 is known very relevant to breast cancer progression [28]. Phospholipase D (PLD) overexpresses in breast cancer cell and its implication in survival signals as discussed in [29]. Ras signaling pathway is reported to be activated in many cancers [30]. According to Wang 18
and Li, LOX-1 is up-regulated by TNF in endothelial cell promoting the adhe330
sion and trans-endothelial migration of MDA-MB-231 breast cancer cells [31]. In human breast cancer cells thyroid hormone signaling overshadows estrogen signaling on SMP30 gene leading to induction of apoptosis [32]. HIF-1 plays a key role in regulating breast cancer progression and metastasis [33]. Dysfunction of Hippo pathway components is linked to breast cancer stem cell regulation
335
and the connection between the disease and genetic variations in the pathway is reported in [34]. Evidence from both pre-clinical models and epidemiological studies is reviewed to examine whether adrenergic signaling may modify breast cancer biology [35]. Role of Estrogen Receptor Signaling in Breast Cancer Metastasis is discussed in [36]. Activation of the phosphoinositide 3 kinase
340
(PI3K)/Akt/mammalian target of rapamycin (mTOR) pathway is commonly reported in breast cancer [37]. An excellent review summarizing the role of the toll-like receptor signaling pathway on breast cancer risk, disease progression, survival, and disease recurrence is given in [38]. Inhibition of breast cancer cell migration can be achieved by activation of cAMP signaling [39]as this fact could
345
have been suggested by the negative scores observed in the cAMP pathway result. Overall, we found that 84% (21/25) of the pathways in Table III have some published facts implicated in breast cancer, suggesting that our analysis could be producing meaningful outcomes. The rest 16% pathways could potentially be novel pathways for breast cancer and may merit further investigation.
350
3.4. Survival Analysis The goal of survival analysis is to identify which pathway route could highly affect the mortality of the patients. After processing score calculation for all of the 293 Homo Sapiens pathways, all the route scores are saved into a data matrix where columns correspond to the routes and rows correspond to the
355
patients. This data matrix includes more than 40,000 features, i.e., the score columns obtained from each route of the pathways. We applied Lasso Cox proportional hazards model to derive the relationship between predictor (routes) variables and survival time [40]. The Lasso regression methods generates sparse 19
Table 3: SIGNIFICANT PATHWAY REPORTED
Pathway
Real Score
Decoy Score
Oxytocin signaling pathway
0.002 34
4.170 00 · 10−05
AMPK signaling pathway
0.001 76
0.000 41
ErbB signaling pathway
0.001 74
8.720 00 · 10−05
T cell receptor signaling pathway
0.001 59
0.000 48
Rap1 signaling pathway
0.001 26
0.000 12
Calcium signaling pathway
0.001 21
3.220 00 · 10−05
Fc epsilon RI signaling pathway
0.001 06
0.000 00
Neurotrophin signaling pathway
0.001 03
0.000 13
FoxO signaling pathway
0.001 00
6.250 00 · 10−05
Adipocytokine signaling pathway
0.000 88
0.000 53
Sphingolipid signaling pathway
0.000 63
7.830 00 · 10−05
MAPK signaling pathway
0.000 55
0.000 21
Phospholipase D signaling pathway
0.000 53
0.000 27
Ras signaling pathway
0.000 53
0.000 34
TNF signaling pathway
0.000 52
0.000 39
Thyroid hormone signaling pathway
0.000 40
0.000 15
HIF1 signaling pathway
0.000 37
5.530 00 · 10−05
Hippo signaling pathway
0.000 37
0.000 19
Adrenergic signaling in cardiomyocytes
0.000 31
0.000 12
Estrogen signaling pathway
0.000 24
0.000 00
PI3KAkt signaling pathway
0.000 23
0.000 16
NODlike receptor signaling pathway
0.000 22
8.920 00 · 10−05
Tolllike receptor signaling pathway
0.000 17
9.700 00 · 10−05
cAMP signaling pathway
0.000 12
7.440 00 · 10−05
Glucagon signaling pathway
5.150 00 · 10−05
0.000 00
features, setting all the unimportant features to zero. The implementation was 360
done using the R-package glmnet [41]. We used default values for most of the
20
required parameters of the package except λ which is to pick proper limit of total parameters’ L − 1 Norm, λ. For λ we have chosen 0.12 and this value controls the 10-fold cross validation error to minimal in our case. Among the 41,676 pathway routes (features), only 21 of them have nonzero estimation. 365
The Kaplan-Meier (KM) plots for these 21 routes are shown in Fig.5. This figure shows that 20 route cases out of 21 have a KM estimation significant p-value (p < 0.05). The only exception is the case whose route is made up of “CHRM2,GNB5,PIK3R5,HRAS,MAP2K1,MAPK1,CREB5”. Even in this case, (p = 0.087) which is close to the significance threshold value 0.05.
370
One of routes from Cytokine-cytokine receptor interaction pathway, IL23A → IL23R remains significant for all possible selection of λ. Thus this route could be affecting the patients survival rate significantly and worthy of further investigation. 3.5. Deep Patient-specific Investigation Analysis
375
Our next step is to demonstrate that our route identifying method can offer a comprehensive and yet deeper patient-specific biological insights. First, we merged five pathways of interest, ErbB, PI3K-AKT, Apoptosis Cell Cycle and Cancer Pathway, into one big gene regulatory network. Then the signed scores are calculated for all possible paths ending with the same 57 cancer marker
380
genes. We found that even though multiple patients are of the same type of tumor, HER2+ for instance, the scores for individual paths among the same subtype patients are still very different. We conjecture that a systematic analysis of these differences may offer a way to further refine each known subtype. We illustrate such refined subtyping possibility by a heat map as shown in Fig.6.
385
The clear pattern difference of the figure suggests that our route-based method may offer additionally deeper subtyping, for example, each column in Fig.6 could potentially represent a sub-subtype among the HER2+ patients. Next we summarize which are the commonly activated routes among the HER2+ patients in Table.4.
390
All these routes score +1 using the equation
(7). Route 1 is from the Cell Cycle pathway in which CNKN1B is inhib21
Group=1
+
++ ++ ++++++++++++++++++++++ + + ++ ++ +++++ +++++++ + + p = 0.00029 0
1000
PRKACA,KCNJ9
Group=2
Strata
+ 2000
3000
4000
1.00 0.75 0.50 0.25 0.00
+
Group=0
+
Group=1
+
+++++++++ ++ ++ + +++++ +++++++++++ +++++ +++ +++ +++++ ++ + + p = 0.00078 0
1000
+ 2000
3000
Time MAFA,PDX1,HNF4A,SLC2A2
Group=1
+
Group=2
Strata
+ 2000
3000
4000
1.00 0.75 0.50 0.25 0.00
+
Group=0
+
Group=1
+
++++++++++++++++ + +++++++++++++++ ++ + + + + +++++++ + 1000
Strata
+
p = 0.00051
0
Group=2
2000
3000
4000
Survival probability
+
Survival probability
Group=0
1.00 0.75 0.50 0.25 0.00
+
Group=0
+
Group=1
+
++++++++ +++ ++ ++ +++++ +++ ++ + +++ ++++++ +++ ++ + ++ + +++ +++ 1000
2000
Time
Time
IL23A,IL23R
PRKCB,GNB5,CACNA1B
+
1000
Group=2
Strata
+ 2000
3000
4000
1.00 0.75 0.50 0.25 0.00
+
Group=0
+
Group=1
+
Group=2
p = 0.00012
0
+
1000
2000
3000
+
Strata
++++++ ++ ++++++++++++ ++++ +++++++++++++++ + ++ ++ + ++ 4000
Survival probability
Group=1
Survival probability
+
1.00 0.75 0.50 0.25 0.00
Group=1
+
1000
+ 2000
3000
Time
Time
Time
BTC,ERBB4,PIK3R5,AKT3,CDKN1B
CCL18,CXCR6,GNAI1,FGR,PIK3R5,AKT3,FOXO3
TLR5,MYD88,RELA
Group=2
1000
Strata
+ 2000
3000
4000
1.00 0.75 0.50 0.25 0.00
+
Group=1
+
Group=2
++++++++++++++++++++ +++++ +++ ++ + +++++ ++ ++ + ++ +
+
p = 0.0031
0
1000
Strata
2000
3000
4000
Survival probability
+
Survival probability
Group=1
1.00 0.75 0.50 0.25 0.00
+
Group=0
+
Group=1
+
++ ++ +++++ +++++++++++++ +++ ++ +++ +++++++++ +++++ + ++ ++++ 1000
2000
3000
Time
Time
RASAL1,RRAS2,RAF1,MAP2K1,MAPK1,PLA2G1B
CXCL12,CXCR4,PTK2B,VAV3,RAC2
CALM2,CAMK2A,HRAS,PIK3R2
Group=1
+
Group=2
1000
Strata
+ 2000
3000
4000
1.00 0.75 0.50 0.25 0.00
+
Group=1
+
Group=2
++++++ ++ +++++++++++++ ++++++++++++++++++++ + + +
+
p < 0.0001
0
1000
Strata
2000
3000
4000
Survival probability
+
p = 0.0023
1.00 0.75 0.50 0.25 0.00
+
Group=1
+
Group=2
p < 0.0001
0
+
+
1000
2000
3000
Time
Time
CALM1,RASGRF1,RRAS2,RAF1,MAP2K1,MAPK1,PLA2G16
CHRM2,GNB5,PIK3R5,HRAS,MAP2K1,MAPK1,CREB5
PRKCB,HRAS,PIK3R2
Group=0
+
Group=1
+
Strata
+
p = 0.043
1000
Group=2
2000
3000
4000
1.00 0.75 0.50 0.25 0.00
+
Group=1
+
Group=2
++++++++++++++++++++++++ + + ++++++++ +++++++
Strata
+
p = 0.087
0
1000
2000
3000
4000
Survival probability
+
+++++++++ +++++++++++ ++++ +++ ++++++++++ + ++++++++
1.00 0.75 0.50 0.25 0.00
+
Group=0
+
Group=1
+
Group=2
p < 0.0001
0
+
+
1000
2000
3000
Time
Time
CALM1,RASGRF1,RRAS2,RAF1,MAP2K1,MAPK1,PLA2G1B
PRKACB,MAPK1,CREB3L3
FGF12,EGFR,NRAS,BRAF,MAP2K1,MAPK1,CDK4,RB1
Group=0
+
Group=1
1000
Group=2
Strata
+
p = 0.011
0
+
2000
3000
4000
1.00 0.75 0.50 0.25 0.00
Time
+
Group=0
+
Group=1
++++++++++++ +++++++++ ++ +++++++++++ +++ + ++ ++++++++ + 1000
Group=2
Strata
+
p = 0.00087
0
+
2000 Time
3000
4000
Survival probability
+
1.00 0.75 0.50 0.25 0.00
4000
++++++++ +++ ++ +++++++++++++++ ++++++ ++++++ +
Time
+++ ++ ++++++++++++++++++++++ +++++++++++ + ++++++ ++
4000
++++++++ ++ ++ +++++++++++++++ ++++++ ++++++ +
Time
0
Group=2
Time
+++ ++ ++++++++++++++++++++++ ++++++++++++ + + ++++ + +
4000
+
p = 0.0036
0
4000
Group=2
++++++++++ ++ ++++++++ +++ ++++ ++++++ ++++++++ ++++ + + p = 0.0091 0
Group=2
3000
Time
Group=0
4000
+
p = 0.0033
0
HBS1L,PELO
+
Group=2
Time
p = 0.0039
0
+
RAPGEF4,MAPK9
++++++++++++++ +++++++ +++ ++ +++ +++++++++++ + ++++ ++ 0
Group=0
Time
+++++++++++ +++ ++++++++ ++++ +++ ++++ ++ +++++ +++ ++ + ++ + + p = 0.0041 0
1.00 0.75 0.50 0.25 0.00
+
EIF3G,EIF1
1000
+
Survival probability
4000
Survival probability
Survival probability Survival probability Survival probability Survival probability
3000
p = 0.0063
0
Strata
1.00 0.75 0.50 0.25 0.00
Strata
++++++++++++++++ ++ + ++ ++++++ +++++ ++++ ++++ +++ + ++
Strata
1.00 0.75 0.50 0.25 0.00
Group=2
Survival probability
Survival probability Survival probability
+
2000
Strata
1.00 0.75 0.50 0.25 0.00
+
+
1000
Strata
1.00 0.75 0.50 0.25 0.00
RAPGEF4,RAP1A,BRAF,MAP2K1,MAPK1
Group=1
p = 0.0097
0
Strata
1.00 0.75 0.50 0.25 0.00
+
+++++++++++++++++++ ++ ++++++++++++++ ++++++ ++ + ++ + +
Strata
1.00 0.75 0.50 0.25 0.00
Group=0
Survival probability
Survival probability
1.00 0.75 0.50 0.25 0.00
+
Survival probability
B3GALT5,FUT1 Strata
+
Group=1
++++++++++++++++++++++ + + ++++++ +++++++ +++++ + 1000
Group=2
+
p = 0.00069
0
+
4000
2000
3000
4000
Time
Figure 5: Kaplan Meier plots for the selected routes by the model with p-value attached. Group 0 is the patients with zero score for the corresponding route, while patients in Group 1 and Group 2 seperately have negative and positive scores for the same route. Routes are shown in the title above each plots
ited. CNKN1B is a known tumor suppressor gene and enhancing this route would contribute to cell proliferation [42]. Route 2 is a path across PI3K-AKT pathway and Cancer pathway. In this route, PPP2CA is actively suppressing AKT3 thus possibly preventing MTOR from activation, where MTOR is known 395
responsible for protein synthesis, cell proliferation and evading apoptosis [43]. Route 3 is from the apoptosis pathway and in this path DIABLO inhibits BIRC2
22
Figure 6: The heatmap of path scores by HER2+ tumor, where rows corresponds to different routes and columns denote five HER2+ patients.
thus possibly letting CASP3 get activated. A similar finding has already been reported suggesting that CASP3 is overexpressed in carcinomas [44]. A similar observation is applied to Route 6. The rest of the routes all come from 400
the Apoptosis pathway and these routes could be due to DNA fragmentation which is a generally accepted up-stream biological process known to lead apoptosis [45]. When all these observations are combined, one could generalize that in many HER2+ patients, apoptosis is in action but cell proliferation through Route 1 could outbalance the cell death.
23
Table 4: The commonly enhanced routes for HER2+ subtype
No.
Routes
1
CDK7→CDK2aCDKN1B
2
PPP2CAaAKT3→MTOR
3
DIABLOaBIRC2aCASP3
4
DIABLOaBIRC2aCASP3aDFFA
5
HTRA2aBIRC2aCASP3aDFFA
6
CTSBaBIRC2aCASP3
Table 5: The commonly suppressed routes for HER2+ subtype
405
No.
Routes
1
VHLaEPAS1→FIGF
2
CUL2aEPAS1→FIGF
3
RUNX1→CSF1R→IRS1→PIK3CA→AKT3 →MTOR
4
RUNX1T1aCSF1R→IRS1→PIK3CA→AKT3→MTOR
5
DAPK1aMAPK1→JUN→FIGF
6
DAPK2aMAPK1→BCL2
7
DAPK2aMAPK1→JUN→FIGF
Next we discuss the routes that are commonly suppressed (with the score of −1) in Table.5. Route 3 and Route 4 are suppressed leading to the same result as the Route 2 in Table.4, i.e. suppression of MTOR expression. The fact that Route 6 is suppressed could mean that DAPK fails to inhibit MAPK1 thus
24
possibly causing the expression of BCL2. BCL2 is a well-known anti-apoptotic 410
gene [46] and this route being highly suppressed could be seen as one of the leading factors contributing to the cancer development. The last one, Route 7, ending with FIGF is known responsible for sustained angiogenesis, again an important hallmark of cancers biological capabilities [47]. Similar analysis could be presented for the other three subtypes but we omit
415
the details due to space limitation. Lastly, for an individual patient our model discovers more than 2000 highly activated routes and more than 2000 highly suppressed routes. Thus how to rank these identified ones is an important future research issue.
4. Conclusion 420
We described a new approach for pathway analysis using a scoring model that measures the consistency between gene expression data, mutation data and the interaction relationships specified in the pathways. Unlike existing methods, the key objective of computing the consistency between the data and the prior knowledge specified in the pathway is to identify the particular route(s)
425
of gene/protein interaction that could explain the molecular basis of the cancer initiation and progression. We first demonstrated the performance of the model through simulation. We then carried out a real-world data analysis using TCGA Breast Cancer data and confirmed that our model is capable of recognizing significant KEGG pathways reported in the cancer research literature.
430
During the experiment, we also observed that even though patients are from the conventionally categorized subtype (e.g., HER2+), the way their pathway routes are overly activated or suppressed is quite different, thus offering the advantage of our approach as well as its potential use in further refining the known cancer subtypes. Survival analysis based on identified pathway routes has also
435
been performed in which we discover that 19 route cases out of 20 have a KM estimator significant p-value (p < 0.05). Regarding future work, our Bayesian network model offers an easy way of incorporating additional data types such as
25
CNV, proteomics data, methylation data, and so on, and such model extension should be attempted. Lastly, we are optimistic that our model can be applied 440
to studying a wide array of biological systems beyond cancer research, as long as various experimental data types of test vs. control samples are obtainable.
References References [1] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, 445
M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, et al., Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles, Proceedings of the National Academy of Sciences 102 (43) (2005) 15545–15550. [2] C. J. Vaske, S. C. Benz, J. Z. Sanborn, D. Earl, C. Szeto, J. Zhu, D. Haus-
450
sler, J. M. Stuart, Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using paradigm, Bioinformatics 26 (12) (2010) i237–i245. [3] J. Massagu´e, Tgfβ signalling in context, Nature reviews Molecular cell biology 13 (10) (2012) 616–630.
455
[4] P. Mertins, D. Mani, K. V. Ruggles, M. A. Gillette, K. R. Clauser, P. Wang, X. Wang, J. W. Qiao, S. Cao, F. Petralia, et al., Proteogenomics connects somatic mutations to signalling in breast cancer, Nature 534 (7605) (2016) 55–62. [5] C. Li, H. Li, Network-constrained regularization and variable selection for
460
analysis of genomic data, Bioinformatics 24 (9) (2008) 1175–1182. [6] W. Pan, B. Xie, X. Shen, Incorporating predictor network in penalized regression with application to microarray data, Biometrics 66 (2) (2010) 474–484.
26
[7] A. L. Tarca, S. Draghici, P. Khatri, S. S. Hassan, P. Mittal, J.-s. Kim, 465
C. J. Kim, J. P. Kusanovic, R. Romero, A novel signaling pathway impact analysis, Bioinformatics 25 (1) (2009) 75–82. [8] T. Hancock, I. Takigawa, H. Mamitsuka, Mining metabolic pathways through gene expression, Bioinformatics 26 (17) (2010) 2128–2135. [9] F. C. Stingo, Y. A. Chen, M. G. Tadesse, M. Vannucci, Incorporating bio-
470
logical information into linear models: A bayesian approach to the selection of pathways and genes, The annals of applied statistics 5 (3). [10] L. P. Verbeke, J. Van den Eynden, A. C. Fierro, P. Demeester, J. Fostier, K. Marchal, Pathway relevance ranking for tumor samples through network-based data integration, PloS one 10 (7) (2015) e0133503.
475
[11] M. Korucuoglu, S. Isci, A. Ozgur, H. H. Otu, Bayesian pathway analysis of cancer microarray data, PloS one 9 (7) (2014) e102803. [12] S. Isci, C. Ozturk, J. Jones, H. H. Otu, Pathway analysis of highthroughput biological data within a bayesian network framework, Bioinformatics 27 (12) (2011) 1667–1674.
480
[13] M. Kanehisa, S. Goto, Kegg: kyoto encyclopedia of genes and genomes, Nucleic acids research 28 (1) (2000) 27–30. [14] D. Koller, N. Friedman, Probabilistic graphical models: principles and techniques, MIT press, 2009. [15] J. Gao, B. A. Aksoy, U. Dogrusoz, , et al., Integrative analysis of complex
485
cancer genomics and clinical profiles using the cbioportal, Science signaling 6 (269) (2013) pl1. [16] E. Cerami, J. Gao, U. Dogrusoz, , et al., The cbio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data, Cancer discovery 2 (5) (2012) 401–404.
27
490
[17] B. I. T. G. D. A. Center, Mutation assessor (2016).
doi:10.7908/
C1F18Z2Z. URL https://doi.org/10.7908/C1F18Z2Z [18] J. D. Zhang, S. Wiemann, Kegggraph: a graph approach to kegg pathway in r and bioconductor, Bioinformatics 25 (11) (2009) 1470–1471. 495
[19] S. Højsgaard, Graphical independence networks with the gRain package for R, Journal of Statistical Software 46 (10) (2012) 1–26. URL http://www.jstatsoft.org/v46/i10/ [20] A. Reversi, P. Cassoni, B. Chini, Oxytocin receptor signaling in myoepithelial and cancer cells, Journal of mammary gland biology and neoplasia
500
10 (3) (2005) 221. [21] W. Li, S. M. Saud, M. R. Young, G. Chen, B. Hua, Targeting ampk for cancer prevention and treatment, Oncotarget 6 (10) (2015) 7365–78. [22] Y. Yarden, M. X. Sliwkowski, Untangling the erbb signalling network, Nature reviews Molecular cell biology 2 (2) (2001) 127–137.
505
[23] E. A. McSherry, K. Brennan, L. Hudson, A. D. Hill, A. M. Hopkins, Breast cancer cell migration is regulated through junctional adhesion molecule-amediated activation of rap1 gtpase, Breast Cancer Research 13 (2) (2011) R31. [24] I. Azimi, S. Roberts-Thomson, G. Monteith, Calcium influx pathways in
510
breast cancer: opportunities for pharmacological intervention, British journal of pharmacology 171 (4) (2014) 945–960. [25] H. Hondermarck, Neurotrophins and their receptors in breast cancer, Cytokine & growth factor reviews 23 (6) (2012) 357–365. [26] M. Bullock, Foxo factors and breast cancer: outfoxing endocrine resistance,
515
Endocrine-related cancer 23 (2) (2016) R113–R130.
28
[27] O. Sukocheva, C. Wadham, Role of sphingolipids in oestrogen signalling in breast cancer cells: an update, Journal of Endocrinology 220 (3) (2014) R25–R35. [28] R. J. Santen, R. X. Song, R. McPherson, R. Kumar, L. Adam, M.-H. 520
Jeng, W. Yue, The role of mitogen-activated protein (map) kinase in breast cancer, The Journal of steroid biochemistry and molecular biology 80 (2) (2002) 239–256. [29] J. Gomez-Cambronero, Phosphatidic acid, phospholipase d and tumorigenesis, Advances in biological regulation 54 (2014) 197–206.
525
[30] E. Niemitz, Ras pathway activation in breast cancer, Nature genetics 45 (11) (2013) 1273–1273. [31] X. Wang, Y. Lin, Tumor necrosis factor and cancer, buddies or foes? 1, Acta pharmacologica Sinica 29 (11) (2008) 1275–1288. [32] S. K. Mishra, P. Sar, B. Rath, D. Bhargava, D. Sengupta, S. Chaudhary, In
530
human breast cancer cells thyroid hormone signaling overshadows estrogen signaling on smp30 gene leading to induction of apoptosis (2012). [33] Z.-j. Liu, G. L. Semenza, H.-f. Zhang, Hypoxia-inducible factor 1 and breast cancer metastasis, Journal of Zhejiang University Science B 16 (1) (2015) 32–43.
535
[34] J. Zhang, S. Yao, Q. Hu, Q. Zhu, S. Liu, K. L. Lunetta, S. A. Haddad, N. Yang, H. Shen, C.-C. Hong, et al., Genetic variations in the hippo signaling pathway and breast cancer risk in african american women in the amber consortium, Carcinogenesis (2016) bgw077. [35] E. I. Obeid, S. D. Conzen, The role of adrenergic signaling in breast cancer
540
biology, Cancer Biomarkers 13 (3) (2013) 161–169. [36] S. Saha Roy, R. K. Vadlamudi, Role of estrogen receptor signaling in breast cancer metastasis, International journal of breast cancer 2012. 29
[37] J. J. Lee, K. Loh, Y.-S. Yap, Pi3k/akt/mtor inhibitors in breast cancer, Cancer biology & medicine 12 (4) (2015) 342–354. 545
[38] R. K. La Creis, E. N. Rogers, S. T. Yeyeodu, D. Z. Jones, K. S. Kimbro, Contribution of toll-like receptor signaling pathways to breast tumorigenesis and treatment, Breast Cancer 5 (2013) 43. [39] H. Dong, K. P. Claffey, S. Brocke, P. M. Epstein, Inhibition of breast cancer cell migration by activation of camp signaling, Breast cancer research and
550
treatment 152 (1) (2015) 17–28. [40] R. Tibshirani, et al., The lasso method for variable selection in the cox model, Statistics in medicine 16 (4) (1997) 385–395. [41] N. Simon, J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for cox’s proportional hazards model via coordinate descent, Journal of
555
Statistical Software 39 (5) (2011) 1–13. URL http://www.jstatsoft.org/v39/i05/ [42] B. Belletti, G. Baldassarre, Roles of cdkn1b in cancer?, Aging (Albany NY) 7 (8) (2015) 529. [43] W. H. Goodson, M. G. Luciani, S. A. Sayeed, I. M. Jaffee, D. H. Moore,
560
S. H. Dairkee, Activation of the mtor pathway by low levels of xenoestrogens in breast epithelial cells from high-risk women, Carcinogenesis 32 (11) (2011) 1724–1733. [44] N. ODonovan, J. Crown, H. Stunell, A. D. Hill, E. McDermott, N. OHiggins, M. J. Duffy, Caspase 3 in breast cancer, Clinical Cancer Research
565
9 (2) (2003) 738–742. [45] X. Liu, H. Zou, C. Slaughter, X. Wang, Dff, a heterodimeric protein that functions downstream of caspase-3 to trigger dna fragmentation during apoptosis, Cell 89 (2) (1997) 175–184.
30
[46] Y. Tsujimoto, Role of bcl-2 family proteins in apoptosis: apoptosomes or 570
mitochondria?, Genes to Cells 3 (11) (1998) 697–707. [47] D. Hanahan, R. A. Weinberg, The hallmarks of cancer, cell 100 (1) (2000) 57–70.
31