A network-based approach to identify disease-associated gene modules through integrating DNA methylation and gene expression

A network-based approach to identify disease-associated gene modules through integrating DNA methylation and gene expression

Accepted Manuscript A network-based approach to identify disease-associated gene modules through integrating DNA methylation and gene expression Yuany...

1MB Sizes 0 Downloads 45 Views

Accepted Manuscript A network-based approach to identify disease-associated gene modules through integrating DNA methylation and gene expression Yuanyuan Zhang, Junying Zhang, Zhaowen Liu, Yajun Liu, Shouheng Tuo PII:

S0006-291X(15)30423-X

DOI:

10.1016/j.bbrc.2015.08.033

Reference:

YBBRC 34406

To appear in:

Biochemical and Biophysical Research Communications

Received Date: 2 August 2015 Accepted Date: 9 August 2015

Please cite this article as: Y. Zhang, J. Zhang, Z. Liu, Y. Liu, S. Tuo, A network-based approach to identify disease-associated gene modules through integrating DNA methylation and gene expression, Biochemical and Biophysical Research Communications (2015), doi: 10.1016/j.bbrc.2015.08.033. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

A network-based approach to identify disease-associated gene modules through integrating DNA methylation and gene expression

School of Computer Science and Technology, Xidian University, Xi'an 710071, P.R.China *

SC

1

RI PT

Yuanyuan Zhang1, Junying Zhang1*, Zhaowen Liu1, Yajun Liu1, Shouheng Tuo1

AC C

EP

TE D

M AN U

[email protected]

1

ACCEPTED MANUSCRIPT Abstract Formation and progression of complex diseases are generally the joint effect of genetic and epigenetic disorders, thus an integrative analysis of epigenetic and genetic data is essential for understanding mechanism of the diseases. In this study, we

RI PT

integrate Illuminate 450k DNA methylation and gene expression data to calculate the weights of gene network using Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA). The approach considers all methylation values of CpG sites in a gene, rather than averaging them which was used in other studies ignoring

SC

the variability of the methylation sites. Through comparing topological features of control network with those of case network, including global and local features,

M AN U

candidate disease-associated genes and gene modules are identified. We apply the approach to real data, breast invasive carcinoma (BRCA). It successfully identifies susceptibility breast cancer-related genes, such as TP53, BRCA1, EP300, CDK2, MCM7 and so forth, within which most are previously known to breast cancer. Also, GO and pathway enrichment analysis indicate that these genes enrich in cell apoptosis

TE D

and regulation of cell death which are cancer-related biological processes. Importantly, through analyzing the functions and comparing expression and methylation values of these genes between cases and controls, we find some genes, such as VASN, SNRPD3,

EP

and gene modules, targeted by POLR2C, CHMP1B and TAF9, which might be novel breast cancer-related biomarkers.

AC C

Key words: DNA methylation, Gene expression, Gene network, Integrative analysis, Canonical correlation analysis

1. Introduction

Large-scale studies of disease-associated genetic and epigenetic aberrance are emerging including single-omic and integration study of multi-omics. DNA methylation is known as one of the most important disease-related research hotspots in epigenomics mainly due to two reasons. First, there are evidences that aberrant DNA methylation presents before the presentation of clinic symptoms of diseases [1] 2

ACCEPTED MANUSCRIPT or it causes disease progression [2]. Second, some technologies are developed to measure genome-wide profiles of a large number of samples, such as Infinium HumanMethylation450 BeadChips (Illumina HM450) [3,4]. Gene expression, an important part of transcriptomics, is often used to identify disease-related genes

RI PT

through constructing gene network [5]. Integrative analysis of DNA methylation and other genomic data provides a novel opportunity to understand the mechanism of a disease [6,7,8,9,10,11].

Recently, network-based methods using DNA methylation data or integration data to disease-associated

biomarkers

or

modules

are

popular

SC

identify

[10,12,13,14,15,16,17,18,19,20]. Although these methods provide a novel viewpoint

M AN U

to identify disease-associated biomarkers, they have some limits in constructing gene weighted network. For example, Some calculate weights of gene network or CpG sites network often based on Pearson correlation coefficient, such as methods proposed by Li et al. [12] and Liu et al. [13]. However, Pearson correlation coefficient is only adaptable in calculating relationship between two vectors of a same dimension.

TE D

A gene contains multiple CpG sites and any two different genes generally have different numbers of CpG sites. This is possibly the reason why they used their average value instead of methylation values of all CpG sites. However, such average

EP

can not reflect the pattern of methylation values of all CpG sites in a gene, like variance, etc. Another work given by Li et al. [12], calculates the weights of CpG

AC C

sites network, however not considering correlations between CpG site pairs which are both located in a same gene. That is to say, intragenic information is not considered. Some of the other methods provide novel measures to calculate weights of gene network, such as methods provided by Jiao et al. [19] and Bartlett et al.[20]. The measure provided by Jiao et al. [19] does not consider all probes in a given gene, only assign average methylation values closest to Transcriptional Start Site (TSS) to the gene. It ignores the information of methylation values in other regions. The measure provided by Bartlett et al.[20] cannot statistically compare normal samples and tumor samples, it can only calculate weights of network for tumor samples instead. In this paper, Protein-Protein Interaction (PPI) is assumed to be the prior un-weighted 3

ACCEPTED MANUSCRIPT gene network and the weights of the network are added according to the PCA and CCA through integrating DNA methylation and gene expression data, where PCA is used to reduce dimension of CpG sites for each gene and CCA is used to calculate weights of gene pairs in networks based on the principal components of CpG sites and

RI PT

gene expression values for controls and cases respectively. The method considers values of all CpG sites in a gene instead of their average value and the gene expression value simultaneously. Control and case weighted networks are then obtained based on control samples and case samples respectively. By considering each

SC

network being a complex network, topological features of the network are studied, including global features of the network and local features of each node of the

M AN U

network. Applying the study to real breast cancer data (BRCA), we find that the two networks have no significant difference in global, but are significantly different in local features among the well known candidate breast cancer related 414 genes. Community study is performed on the sub-network consisting of the 414 genes and their neighbors, where the community found is referred to as a gene module. It is

TE D

found that some modules are the targets of some well known breast cancer related genes, like TP53, BRCA1, EP300, CDK2, MCM7. In addition, novel breast cancer related modules are identified and their functions need be biologically studied further.

EP

The overview of our method is shown in Fig. 1.

AC C

2. Material and Method Data source

PPI provides the real biological relationships of proteins, and a protein is coded by its corresponding gene. Therefore, we consider PPI network as a priori gene network. The PPI data was downloaded from the Human Protein Reference Database (HPRD) release 9 [21]. DNA Methylation and gene expression data of BRCA for control and matched case samples was downloaded from TCGA (http://cancergenome.nih.gov/). The data contains 32 controls and 32 matched cases which come from batches 61, 96, and 109. Preprocessing of methylation data is completed refer to reference [22]. For simplicity, we remove the CpG sites which are mapped to multiple genes. As a result, 4

ACCEPTED MANUSCRIPT 363103 CpG sites are remained for analysis. For each gene in the gene network, we map gene expression and methylation values of all CpG sites to it. For an interaction of a gene pair in the gene network, if one of the two genes does not have corresponding expression and methylation values, the interaction is removed. Thus,

RI PT

9370 genes and 36636 interactions remain. Interaction weight of a gene network

Each gene contains multiple CpG sites, and different genes generally have different numbers of CpG sites. Therefore, Pearson correlation coefficient which is often used

SC

to calculate the relationship between two vectors of a same dimension is not adaptable in measuring the relationship between two genes which have different dimensions of

M AN U

CpG sites. Motivated by the idea of CCA which can measure the relationship between two vectors with different dimensions, we calculate the edge weights of gene network using CCA. Also, considering that CCA does not work effectively if the number of data samples is significantly less than that of features (CpG sites), we use PCA to reduce dimension for CpG sites firstly for each gene. The number of principal

TE D

components depends on the percentage of cumulative variance of the principal components to total variance of all principal components. We use 0.95 in this study. Therefore, for a gene, some principal components of CpG sites and gene expression

EP

are known as features of a gene. Any two different genes generally have different numbers of features. Then, CCA is used to measure the edge weight between a gene pair. Detailed description of the procedure is as follows:



AC C

For two genes X  x1m , x2m ,..., x mp





T

and Y  y1m , y2m ,..., yqm



T

( xim and y mj are

methylation values in genes X and Y respectively, p and q are the number of CpG sites in genes X and Y respectively), PCA is used to reduce dimension and T T ~ ~ X  ~ x1m , ~ x2m ,..., ~ xsm  and Y  ~ y1m , ~ y2m ,..., ~ ytm  are the obtained principal components

of genes X and Y respectively. Then we add gene expression of X and Y as







T ~ ~ x1m , ~ x2m ,..., ~ xsm , x e and Y  ~ y1m , ~ y2m ,..., ~ ytm , y e another feature. That is to say, X  ~



T

are features of genes X and Y respectively, where x e and y e represent the 5

ACCEPTED MANUSCRIPT expression values of genes

X and Y respectively. For convenience, we still use

X  x1, x2 ,..., x p  and Y  y1, y2 ,..., yq  to represent the features of genes X and T

T

Y respectively. CCA seeks to find two vectors a and b of different dimensions,









where cov aT  X , bT  Y











var aT  X  var bT  Y





is the covariance of aT  X and bT  Y ; var aT  X



and

are the variance of aT  X and bT  Y respectively. The value of 

SC



var bT  Y





cov aT  X , bT  Y

RI PT

which maximizes the correlation   cor aT  X , bT  Y , defined as follows:

M AN U

which ranges from 0 to 1 is then set to be the weight between genes X and Y . Calculation of topological features of networks

Both global and local topological features ( of the control and case weighted gene networks are analyzed by considering them as complex networks. are studied. The global ones analyzed include distribution of node strength, average path length and

TE D

community structure; and the local ones studied for a node are node strength, distribution of its weights connecting to its neighbors and clustering coefficient. To identify disease-associated biomarkers, we compare the difference of these topological features between control and case networks and identify genes with

EP

significant difference as candidate disease-related genes. Significance study

AC C

Our aim is to finding significantly different genes responsible to disease through comparing the topological features of case network and those of control networks. To measure whether genes are different significantly, we generate random network through permuting weights of control network, and repeat this procedure 1000 times. There are two reasons why we generate random networks by permuting weights of control network. Firstly, in control and case networks, the inter-connections among genes are unchanged, while weights among them are changed. Therefore, we just permute weights and do not change the edges of networks. Secondly, disease state is known as a kind of disorder of normal state, and normal state is always known as an 6

ACCEPTED MANUSCRIPT analyzing benchmark. Therefore, we generate random network from control network. If the difference of topological features between case and control networks does not have significant difference than that of random and control networks, we think that this difference is not caused by a disease.

RI PT

We calculate the values of local topological features for each gene in 1000 random networks, control and case networks. If the difference values of local topological features for a gene between control and case networks are larger than those of random and control networks at significant level 0.01, the gene is said significantly different.

SC

For global topological features, we calculate the distribution of node strength, average path length of random networks, and the similarity of community structure between

M AN U

random networks and control network using normalized mutual information [23]. The more similar the community structure of two networks is, the closer to 1 the value of similarity is. Through comparing these global features, we can indicate whether the global structure of networks is changed significantly between case and control. Identification of disease-related modules

TE D

We construct a sub-network relating only to significantly different genes and their neighbor genes for the identification of disease-related modules. The weight wij of a gene pair i and j is defined by wijcontrol  wijcase , where wijcontrol and wijcase are

EP

weights between genes i and j in control and case networks respectively. We use

AC C

random walk algorithm [24] to detect community structures of the sub-network. The communities are known as candidate disease-related modules. We rank them according to the ratio of sum of weights to total number of nodes in a module. For each module, we rank genes according to node strength and define target genes based on the following equation, Cud

  

d i 1 i n i 1 i

s s

where Cu d represents the percentage of cumulative node strength of top d genes to the sum of node strengths of all genes in the module; n is the number of genes in 7

ACCEPTED MANUSCRIPT the module. We select top d genes as target genes when Cu d is equal to or larger than 50%.

3. Results

RI PT

Comparison of global topological features To compare the difference of the case network and the control network in their global structure, we calculate distribution of node strength, average path length and community structure for weighted control and case network respectively. When

SC

distributions of node strength are compared, we find the distributions follow power-law with the power exponent being r =1.006 and r =1.029 for control and

M AN U

case respectively. In control and case networks, average path length L equals 0.6466 and 0.9039 respectively. Also, we use random walk algorithm to identify community structure of the two networks [24] and calculate their similarity (0.8148) using normalized mutual information [23] (it is completed by igraph package in R). Through comparing the above results with the distribution of node strength, average

TE D

path length of random networks and the similarity of communities between random networks and control network respectively, we find the above differences are not significant. That is to say, control and case networks have no significant difference in

EP

global features. Therefore, the emergence of the disease may not affect global structure of the gene network.

AC C

Comparison of local topological features Through calculating and comparing the values of local topological features of control network with those of case and random networks, we find that control and case networks have significant difference in local structure. 414 genes which have significant difference in si , Yi or ci are identified. We complete GO and pathway enrichment analysis using DAVID [25] for these 414 genes. It is shown that these genes enrich in regulation of cell death, cell apoptosis cell proliferation and cell cycle etc., which are cancer-related biological processes and pathways (Supplemental table 1 and 2). 8

ACCEPTED MANUSCRIPT Discovery of susceptibility disease-associated modules Considering the fact that the 414 genes are significantly different in topological features of networks between control and case, and the local topological features are defined based on weights of genes connecting to their neighbors, we search for

RI PT

modules in a sub-network consisting of 414 genes and their neighbors. 177 modules in the sub-network are detected using random walk algorithm [24]. Target genes for each module are defined through calculating the value of Cu d .

A table summarizing top 10 modules shows that one of the modules contains 719

SC

genes and targeted by TP53, EP300, SP1, CDK2, BRCA1 and RELA etc. (Fig. 2 and Table 1). Gene TP53 are known as an oncogene of various cancers and genes EP300,

M AN U

SP1, CDK2, BRCA1, RELA are known as breast cancer related genes [26,27,28,29,30]. Through analyzing this module using DAVID, we find this module is rich in regulations of cell cycle, apoptosis, cell death and cell proliferation which are cancer-related biological processes (Table 2). Also, through analyzing the relationship between genes in the module and diseases using DAVID, we find it relates

TE D

significantly to breast cancer (Table 2).

For modules 4 and 10, we find that they centre around VASN and TIMP3, respectively. Gene VASN is different significantly in all local topological features. Although there is

EP

no report on VASN relating to breast cancer, it interacts with genes TGFB1, TGFB2 and TGFB3 which are all related with breast cancer referring to database-Cancer

AC C

Genetics (http://www.cancer-genetics.org/). Therefore, we think VASN may be a novel breast cancer associated gene. In module 10, all genes are related with breast cancer according to Cancer Genetics and reference [31]. According to Cancer Genetics database (http://www.cancer-genetics.org/), we find that two genes in module 1, three in module 2, two in module 7 and seven in module 9 are cancer-related gene. Therefore, target genes of these modules may be possible breast-cancer genes. For modules 3, 5 and 6, we compare expressions of genes in the modules and methylation level of one of target genes between cases and controls using heatmap (Fig. 3). It is shown that methylation levels of the target gene and 9

ACCEPTED MANUSCRIPT expression of some genes are significantly different for cases and controls. Through analyzing biological functions of these genes, we find TAF family may have a role in gene regulation associated with apoptosis and TAF9 is a direct effector of TP53. Therefore, we think that these modules may be related to breast cancer.

RI PT

4. Discussion

We present a novel method to calculate weights of a gene network through integrating Illumina 450k DNA methylation and gene expression data. This method can consider

SC

methylation level of all CpG sites in genes rather than the average value of probes close to TSS. It contains comprehensive information for obtaining relationship of

M AN U

gene pairs. By analyzing the topological features of control and case networks, gene modules and their corresponding target genes are identified. In application to BRCA, a large gene module whose target genes are TP53, EP300, SP1, CDK2, BRCA1 and RELA, etc. is identified. These genes are known as breast cancer related genes. Analyzing other modules, we find gene VASN and target genes of modules 1, 2 and 9

TE D

which may also be cancer-related candidate genes. VASN acts as an inhibitor of TGF-beta signaling and TGF-beta signaling is related to breast cancer. Three modules (3, 5 and 6) may be related to breast cancer through analyzing expression and

EP

methylation values of genes.

Because we identify disease-associated gene modules based on the change of weights

AC C

(gene interactions) rather than that of nodes (genes), some genes which may relates to disease might be lost. To show its possibility, we take an extreme example as follows: for genes A and B , they are all high expression (hypermethylated) genes and have strong correlation in controls; in cases, they also have strong correlation but they are all low expression (hypomethylated) genes. Our method may lose these genes. Therefore, if a measure for each gene is provided as node weight, and edge weights of gene network are considered simultaneously, the identification of disease-associated genes or modules should be more effective. It also is our further work.

Acknowledgements 10

ACCEPTED MANUSCRIPT This work was supported by the Natural Science Foundation of China under Grants 61070137, 61201312, 61170183, Research Fund for the Doctoral Program of Higher Education of China (No. 20130203110017), and the Fundamental Research Funds for the Central Universities of China (No. BDY171416, JB140306).

RI PT

References

[1] K.M. Godfrey, A. Sheppard, P.D. Gluckman, K.A. Lillycrop, G.C. Burdge, C. McLean, J. Rodford, J.L. Slater-Jefferies, E. Garratt, S.R. Crozier, B.S. Emerald, C.R. Gale, H.M. Inskip, C. Cooper, M.A. Hanson, Epigenetic gene promoter methylation at birth is associated with child's later adiposity, Diabetes 60 (2011) 1528-1534.

SC

[2] A. Jones, A.E. Teschendorff, Q. Li, J.D. Hayward, A. Kannan, T. Mould, J. West, M. Zikan, D. Cibula, H. Fiegl, S.H. Lee, E. Wik, R. Hadwin, R. Arora, C. Lemech, H. Turunen, P. Pakarinen, I.J. Jacobs, H.B. Salvesen, M.K. Bagchi, I.C. Bagchi, M. Widschwendter, Role of DNA methylation and

M AN U

epigenetic silencing of HAND2 in endometrial cancer development, PLoS Med 10 (2013) e1001551.

[3] J. Sandoval, H. Heyn, S. Moran, J. Serra-Musach, M.A. Pujana, M. Bibikova, M. Esteller, Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome, Epigenetics 6 (2011) 692-702.

[4] S. Moran, M. Vizoso, A. Martinez-Cardus, A. Gomez, X. Matias-Guiu, S.M. Chiavenna, A.G. Fernandez, M. Esteller, Validation of DNA methylation profiling in formalin-fixed

TE D

paraffin-embedded samples using the Infinium HumanMethylation450 Microarray, Epigenetics 9 (2014) 829-833.

[5] M.P. Keller, A gene expression network model of type 2 diabetes links cell cycle regulation in islets with diabetes susceptibility, Genome Research 18 (2008) 706-716. [6] H. Shen, P.W. Laird, Interplay between the cancer genome and epigenome, Cell 153 (2013) 38-55.

EP

[7] M.D. Ritchie, E.R. Holzinger, R. Li, S.A. Pendergrass, D. Kim, Methods of integrating data to uncover genotype-phenotype interactions, Nat Rev Genet 16 (2015) 85-97. [8] V.N. Kristensen, O.C. Lingjaerde, H.G. Russnes, H.K. Vollan, A. Frigessi, A.L. Borresen-Dale, Principles

AC C

and methods of integrative genomic analyses in cancer, Nat Rev Cancer 14 (2014) 299-313. [9] J.T. Bell, A.A. Pai, J.K. Pickrell, D.J. Gaffney, R. Pique-Regi, J.F. Degner, Y. Gilad, J.K. Pritchard, DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines, Genome Biol 12 (2011) R10.

[10] M. Renner, T. Wolf, H. Meyer, W. Hartmann, R. Penzel, A. Ulrich, B. Lehner, V. Hovestadt, E. Czwan, G. Egerer, T. Schmitt, I. Alldinger, E.K. Renker, V. Ehemann, R. Eils, E. Wardelmann, R. Buttner, P. Lichter, B. Brors, P. Schirmacher, G. Mechtersheimer, Integrative DNA methylation and gene expression analysis in high-grade soft tissue sarcomas, Genome Biol 14 (2013) r137. [11] S.A. Selamat, B.S. Chung, L. Girard, W. Zhang, Y. Zhang, M. Campan, K.D. Siegmund, M.N. Koss, J.A. Hagen, W.L. Lam, S. Lam, A.F. Gazdar, I.A. Laird-Offringa, Genome-scale analysis of DNA methylation in lung adenocarcinoma and integration with mRNA expression, Genome Res 22 (2012) 1197-1211. [12] Y. Li, J. Xu, H. Ju, Y. Xiao, H. Chen, J. Lv, T. Shao, J. Bai, Y. Zhang, L. Wang, X. Wang, H. Ren, X. Li, A 11

ACCEPTED MANUSCRIPT network-based, integrative approach to identify genes with aberrant co-methylation in colorectal cancer, Mol Biosyst 10 (2014) 180-190. [13] H. Liu, J. Su, J. Li, H. Liu, J. Lv, B. Li, H. Qiao, Y. Zhang, Prioritizing cancer-related genes with aberrant methylation based on a weighted protein-protein interaction network, BMC Syst Biol 5 (2011) 158. [14] Q. Zhang, J.E. Burdette, J.P. Wang, Integrative network analysis of TCGA data for ovarian cancer, BMC Syst Biol 8 (2014) 1338.

RI PT

[15] C.P. Cheng, I.Y. Kuo, H. Alakus, K.A. Frazer, O. Harismendy, Y.C. Wang, V.S. Tseng, Network-based analysis identifies epigenetic biomarkers of esophageal squamous cell carcinoma progression, Bioinformatics 30 (2014) 3054-3061.

[16] H. Sun, S. Wang, Network-based regularization for matched case-control analysis of high-dimensional DNA methylation data, Stat Med 32 (2013) 2127-2139.

[17] J. Pey, K. Valgepea, A. Rubio, J.E. Beasley, F.J. Planes, Integrating gene and protein expression data

SC

with genome-scale metabolic networks to infer functional pathways, BMC Syst Biol 7 (2013) 134.

[18] K. Mitra, A.R. Carvunis, S.K. Ramesh, T. Ideker, Integrative approaches for finding modular

M AN U

structure in biological networks, Nat Rev Genet 14 (2013) 719-732.

[19] Y. Jiao, M. Widschwendter, A.E. Teschendorff, A systems-level integrative framework for genome-wide DNA methylation and gene expression data identifies differential gene expression modules under epigenetic control, Bioinformatics 30 (2014) 2360-2366. [20] T.E. Bartlett, S.C. Olhede, A. Zaikin, A DNA methylation network interaction measure, and detection of network oncomarkers, PLoS One 9 (2014) e84573.

[21] T.S. Keshava Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar, S. Mathivanan, D.

TE D

Telikicherla, R. Raju, B. Shafreen, A. Venugopal, L. Balakrishnan, A. Marimuthu, S. Banerjee, D.S. Somanathan, A. Sebastian, S. Rani, S. Ray, C.J. Harrys Kishore, S. Kanth, M. Ahmed, M.K. Kashyap, R. Mohmood, Y.L. Ramachandra, V. Krishna, B.A. Rahiman, S. Mohan, P. Ranganathan, S. Ramabadran, R. Chaerkady, A. Pandey, Human Protein Reference Database--2009 update, Nucleic Acids Res 37 (2009) D767-772.

EP

[22] Y. Zhang, J. Zhang, Identification of functionally methylated regions based on discriminant analysis through integrating methylation and gene expression data, Molecular BioSystems 11 (2015) 1786-1793.

AC C

[23] L. Danon, A. Díaz-Guilera, J. Duch, A. Arenas, Comparing community structure identification, Journal of Statistical Mechanics: Theory and Experiment 9 (2005) 09008.

[24] P. Pons, M. Latapy, Computing communities in large networks using random walks (long version), J. of Graph Alg. and App. bf 10 (2004) 284--293.

[25] W. Huang da, B.T. Sherman, R.A. Lempicki, Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources, Nat Protoc 4 (2009) 44-57.

[26] A. Zannetti, S. Del Vecchio, A. Romanelli, S. Scala, M. Saviano, G. Cali, M.P. Stoppelli, C. Pedone, M. Salvatore, Inhibition of Sp1 activity by a decoy PNA-DNA chimera prevents urokinase receptor expression and migration of breast cancer cells, Biochem Pharmacol 70 (2005) 1277-1287. [27] F. Bai, H.L. Chan, A. Scott, M.D. Smith, C. Fan, J.I. Herschkowitz, C.M. Perou, A.S. Livingstone, D.J. Robbins, A.J. Capobianco, X.H. Pei, BRCA1 suppresses epithelial-to-mesenchymal transition and stem cell dedifferentiation during mammary and tumor development, Cancer Res 74 (2014) 6161-6172. 12

ACCEPTED MANUSCRIPT [28] S.J. Kim, N. Masuda, F. Tsukamoto, H. Inaji, F. Akiyama, H. Sonoo, J. Kurebayashi, K. Yoshidome, M. Tsujimoto, H. Takei, S. Masuda, S. Nakamura, S. Noguchi, The cell cycle profiling-risk score based on CDK1 and 2 predicts early recurrence in node-negative, hormone receptor-positive breast cancer treated with endocrine therapy, Cancer Lett 355 (2014) 217-223. [29] M. Wirtenberger, S. Tchatchou, K. Hemminki, J. Schmutzhard, C. Sutter, R.K. Schmutzler, A. Meindl, B. Wappenschmidt, M. Kiechle, N. Arnold, B.H. Weber, D. Niederacher, C.R. Bartram, B. Burwinkel, Associations of genetic variants in the estrogen receptor coactivators PPARGC1A,

RI PT

PPARGC1B and EP300 with familial breast cancer, Carcinogenesis 27 (2006) 2201-2208.

[30] H.J. Park, S.R. Kim, S.S. Kim, H.J. Wee, M.K. Bae, M.H. Ryu, S.K. Bae, Visfatin promotes cell and tumor growth by upregulating Notch1 in breast cancer, Oncotarget 5 (2014) 5087-5099.

[31] A. Sadr-Nabavi, J. Ramser, J. Volkmann, J. Naehrig, F. Wiesmann, B. Betz, H. Hellebrand, S. Engert, S. Seitz, R. Kreutzfeld, Decreased expression of angiogenesis antagonist EFEMP1 in sporadic breast cancer is caused by aberrant promoter methylation and points to an impact of

AC C

EP

TE D

M AN U

SC

EFEMP1 as molecular biomarker, International Journal of Cancer 124 (2009) 1727–1735.

13

ACCEPTED MANUSCRIPT

Fig. 1 An overview of our method to identify possible disease-related modules.

RI PT

Fig. 2 Module 8 identified by our method. Node size denotes its node strength and edge widths represent weights among genes. Yellow nodes are top target genes

AC C

EP

TE D

M AN U

SC

Fig. 3 Heatmaps of gene expression and methylation level of genes in module 3, 5 and 6. P-values in left column of heatmaps is calculated using t-test to evaluate significant difference of gene expression between cases and controls.

14

ACCEPTED MANUSCRIPT Table 1 Summary of top 10 modules in BRCA Modul e Size

1

2

3

4

5

6

7

8

9

10

48

24

15

4

12

20

17

719

19

6

TP53,

SNRPD3

targets

, SMN1

, MCM7

POLR2

VAS

CHMP1

C

N

B

TAF9,

VPS16

TAF1

,

0

VPS11

EP300, SP1, CDK2,

BRCA1

EP

TE D

M AN U

SC

, RELA

AC C

BUB1B, ANAPC

TIMP 3

RI PT

Top

ORC2L

7

ACCEPTED MANUSCRIPT Table 2 GO, pathway and disease enrichment results of module 8 Frequency of mapped genes

P-value

(%)

3.2E-38

3.2E-36

18.5

Cellular response to stress

2.3E-42

2.9E-40

16.4

Regulation of apoptosis

6.7E-23

4.9E-21

15.3

1.5E-22

1.0E-20

15.3

Regulation of cell proliferation

4.1E-23

3.1E-21

KEGG pathway

P-value

Cell cycle

1.9E-34

2.3E-32

Pathways in cancer

5.1E-18

3.0E-16

P53 signaling pathway

2.0E-16

8.8E-15

Disease term

P-value

Cancer

1.1E-16

Breast cancer

4.5E-12

Regulation of programmed cell

EP AC C

15.2

Benjamini correction P-value

Frequency of mapped genes (%) 7.6 8.9 3.9

Benjamini correction

Frequency of mapped genes

P-value

(%)

TE D

death

SC

Cell cycle

M AN U

P-value

RI PT

Benjamini correction

GO term

2.0E-15

17.5

5.7E-9

8.5

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

Conflict of interest statement We declare that we have no financial and personal relationships with other people or

AC C

EP

TE

D

M AN U

SC

RI PT

organizations that can inappropriately influence our work.

ACCEPTED MANUSCRIPT

Topological features of weighted network Node strength and distribution of node strength Node strength of node i is si = ∑ aij ⋅ wij , where aij and wij represent the j

the weight between nodes i and

RI PT

connection ( aij = 1 , if nodes i and j have an interaction, otherwise aij = 0 ) and

j , respectively. Like degree distribution,

distribution of node strength P(s ) is defined as the ratio of the number of nodes with node strength s to that of all nodes in the network, and it can measure the degree of

SC

network homogeneity. The smoother the curve of P (s ) is, the more homogeneous

M AN U

the network is. Average path length

The weight of two adjacent nodes is the maximum correlation coefficient under the CCA regime. Thus, the larger the weight of the two nodes is, the closer they are. Therefore, we define the distance of two adjacent nodes as 1 − w , where w is the

TE D

weight of the two adjacent nodes. The shortest path Lij between any two nodes i and j is the path with the smallest sum of distances among all paths connecting

EP

these two nodes. We denote the path distance by dij , which is the sum of distances in

Lij . Average path length of the network describes the closeness degree of nodes,

AC C

defined as follows,

∑d L=

i≥ j

ij

1 ⋅ (N ⋅ ( N + 1) ) 2

where N is the total number of nodes in a network. Community structure Community structure refers to the occurrence of a group of nodes in a network that are more densely connected internally than with the rest of the network. In this study, We use random walk algorithm [1] to identify community structure of a network. Clustering coefficient

ACCEPTED MANUSCRIPT Clustering coefficient of a node quantifies how close the node together with its neighbors form a complete graph. In this study, we calculate clustering coefficient of node i for a weighted network as follows [2],

(wij + wih ) ⋅ a ⋅ a ⋅ a 1 ⋅ ∑ ij ih jh si ⋅ (ki − 1) j , h∈Neigh (i ) 2

RI PT

ci =

where ki is the degree of node i , Neigh(i ) represents a node set consisting of neighbors of node i .

Distribution of weights for a node connecting to its neighbors

SC

As shown as Fig. 1, the node strength of node i is same in two networks, but the distribution of weights connecting to neighbors of a node is generally different for the 2

M AN U

 wij  two networks. Therefore, distribution of weights for node i , Yi = ∑   , is j∈N i  si  calculated [3]. It measures the discriminant degree of weights connecting to node i .

0. 1

EP

TE D

The larger value of Yi is, the larger discriminant degree of weights is.

AC C

Yi = 0.2

Yi = 0.4

Fig. 1 An example shows the necessity of Y i being analyzed.

Reference

[1] P. Pons, M. Latapy, Computing communities in large networks using random walks (long version), J. of Graph Alg. and App. bf 10 (2004) 284--293.

[2] A. Barrat, M. Barthélemy, R. Pastor-Satorras, A. Vespignani, The architecture of complex weighted networks, Proc Natl Acad Sci U S A 101 (2004) 3747-3752. [3] E. Almaas, B. Kovács, T. Vicsek, Z.N. Oltvai, A.L. Barabási, Global organization of metabolic fluxes in the bacterium Escherichia coli, Nature 427 (2004) 839-843.

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

Term Count % P-Value Benjamini cellular macromolecule metabolic process 240 58.1 1.60E-25 3.50E-22 macromolecule metabolic process 245 59.3 1.60E-21 1.80E-18 RNA metabolic process 79 19.1 8.10E-20 5.90E-17 nucleobase, nucleoside, nucleotide and nucleic 168acid metabolic 40.7 4.90E-18 process 2.70E-15 positive regulation of biological process 120 29.1 8.50E-18 3.80E-15 regulation of metabolic process 173 41.9 2.50E-17 9.30E-15 positive regulation of cellular process 111 26.9 7.10E-17 3.50E-14 regulation of macromolecule metabolic process 160 38.7 2.00E-16 6.10E-14 cellular metabolic process 257 62.2 4.60E-16 1.10E-13 regulation of cellular metabolic process 164 39.7 1.50E-15 3.40E-13 cellular nitrogen compound metabolic process 170 41.2 2.40E-15 4.30E-13 regulation of primary metabolic process 158 38.3 2.20E-15 4.40E-13 regulation of protein metabolic process 52 12.6 5.50E-15 9.40E-13 cell cycle 63 15.3 7.50E-15 1.20E-12 negative regulation of cellular process 99 24 1.70E-14 2.50E-12 cellular process 343 83.1 2.10E-14 2.90E-12 nitrogen compound metabolic process 170 41.2 4.00E-14 5.20E-12 RNA splicing, via transesterification reactions27 6.5 4.70E-14 5.80E-12 nuclear mRNA splicing, via spliceosome 27 6.5 4.70E-14 5.80E-12 RNA splicing, via transesterification reactions 27 with bulged 6.5adenosine 4.70E-14 as nucleophile 5.80E-12 DNA metabolic process 48 11.6 8.80E-14 1.00E-11 primary metabolic process 258 62.5 9.10E-14 1.00E-11 regulation of cellular protein metabolic process 46 11.1 1.40E-13 1.50E-11 negative regulation of biological process 102 24.7 2.30E-13 2.30E-11 mRNA metabolic process 39 9.4 1.20E-12 1.10E-10 RNA splicing 34 8.2 1.30E-12 1.20E-10 DNA replication 28 6.8 1.30E-12 1.20E-10 positive regulation of macromolecule metabolic 62 process 15 2.10E-12 1.80E-10 cellular response to stimulus 60 14.5 3.30E-12 2.70E-10 negative regulation of macromolecule metabolic 56 process 13.6 4.10E-12 3.20E-10 negative regulation of metabolic process 58 14 4.30E-12 3.30E-10 positive regulation of metabolic process 64 15.5 4.90E-12 3.60E-10 RNA processing 47 11.4 5.70E-12 4.00E-10 gene expression 139 33.7 6.00E-12 4.20E-10 regulation of programmed cell death 59 14.3 7.00E-12 4.70E-10 cellular component biogenesis 67 16.2 7.30E-12 4.70E-10 mRNA processing 35 8.5 8.20E-12 5.00E-10 regulation of cell death 59 14.3 8.10E-12 5.10E-10 macromolecular complex assembly 52 12.6 1.10E-11 6.80E-10 regulation of apoptosis 58 14 1.50E-11 8.50E-10 cell cycle process 47 11.4 1.70E-11 9.80E-10 positive regulation of cellular metabolic process 61 14.8 2.00E-11 1.10E-09 cellular component organization 120 29.1 4.30E-11 2.30E-09 transcription, DNA-dependent 32 7.7 6.70E-11 3.50E-09 RNA biosynthetic process 32 7.7 9.50E-11 4.90E-09 regulation of nitrogen compound metabolic 130 process 31.5 1.10E-10 5.60E-09 macromolecular complex subunit organization 52 12.6 1.20E-10 6.10E-09 transcription from RNA polymerase II promoter 28 6.8 1.80E-10 8.70E-09 negative regulation of cellular metabolic process 52 12.6 2.10E-10 9.60E-09

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

cellular component assembly 59 14.3 2.30E-10 1.10E-08 metabolic process 266 64.4 2.40E-10 1.10E-08 regulation of nucleobase, nucleoside, nucleotide 128 and nucleic 31 acid 2.70E-10 metabolic 1.20E-08 process positive regulation of programmed cell death38 9.2 5.10E-10 2.20E-08 cellular protein metabolic process 112 27.1 5.10E-10 2.20E-08 positive regulation of cell death 38 9.2 5.80E-10 2.40E-08 regulation of gene expression 128 31 7.30E-10 3.00E-08 positive regulation of apoptosis 37 9 1.50E-09 6.10E-08 response to DNA damage stimulus 34 8.2 1.90E-09 7.50E-08 cellular macromolecule biosynthetic process125 30.3 2.20E-09 8.60E-08 post-translational protein modification 68 16.5 3.40E-09 1.30E-07 macromolecule biosynthetic process 125 30.3 3.50E-09 1.30E-07 positive regulation of nitrogen compound metabolic 46 process 11.1 4.10E-09 1.50E-07 regulation of biosynthetic process 129 31.2 4.30E-09 1.50E-07 response to stress 86 20.8 4.90E-09 1.70E-07 regulation of cellular biosynthetic process 128 31 5.60E-09 1.90E-07 response to organic substance 49 11.9 5.70E-09 1.90E-07 cellular macromolecule catabolic process 49 11.9 6.60E-09 2.30E-07 regulation of protein modification process 29 7 7.30E-09 2.40E-07 programmed cell death 44 10.7 7.60E-09 2.50E-07 positive regulation of protein metabolic process 26 6.3 9.50E-09 3.00E-07 response to dsRNA 10 2.4 9.40E-09 3.10E-07 regulation of macromolecule biosynthetic process 123 29.8 1.30E-08 4.10E-07 apoptosis 43 10.4 1.50E-08 4.60E-07 intracellular signaling cascade 69 16.7 1.60E-08 4.80E-07 regulation of cellular process 238 57.6 1.60E-08 4.90E-07 regulation of biological process 245 59.3 2.30E-08 6.80E-07 regulation of cell proliferation 50 12.1 3.30E-08 9.70E-07 response to chemical stimulus 69 16.7 3.50E-08 1.00E-06 transcription initiation 15 3.6 4.00E-08 1.20E-06 positive regulation of nucleobase, nucleoside,43 nucleotide 10.4 and nucleic 4.10E-08 acid1.20E-06 metabolic process interspecies interaction between organisms 27 6.5 5.00E-08 1.40E-06 DNA repair 27 6.5 5.30E-08 1.50E-06 macromolecule catabolic process 49 11.9 6.90E-08 1.90E-06 positive regulation of cellular biosynthetic process 45 10.9 7.40E-08 2.00E-06 positive regulation of cellular protein metabolic 24 process 5.8 8.10E-08 2.10E-06 regulation of molecular function 56 13.6 1.00E-07 2.70E-06 cell death 46 11.1 1.10E-07 2.80E-06 positive regulation of biosynthetic process 45 10.9 1.10E-07 2.80E-06 chromosome organization 36 8.7 1.10E-07 2.80E-06 death 46 11.1 1.30E-07 3.40E-06 regulation of transcription from RNA polymerase 46 II promoter 11.1 1.50E-07 3.70E-06 positive regulation of macromolecule biosynthetic 43 process 10.4 1.50E-07 3.70E-06 negative regulation of protein metabolic process 21 5.1 1.60E-07 3.80E-06 organelle organization 69 16.7 1.60E-07 3.80E-06 cellular response to stress 39 9.4 2.00E-07 4.80E-06 transcription initiation from RNA polymerase 13 II promoter3.1 2.30E-07 5.30E-06 ribonucleoprotein complex assembly 13 3.1 2.70E-07 6.20E-06 regulation of transcription 111 26.9 2.70E-07 6.30E-06 cell cycle phase 32 7.7 2.80E-07 6.30E-06

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

protein modification by small protein conjugation 19 or removal 4.6 3.10E-07 6.90E-06 proteolysis involved in cellular protein catabolic 40 process 9.7 3.20E-07 7.10E-06 DNA-dependent DNA replication 12 2.9 3.40E-07 7.50E-06 negative regulation of apoptosis 29 7 3.50E-07 7.70E-06 cellular protein catabolic process 40 9.7 3.60E-07 7.80E-06 positive regulation of gene expression 39 9.4 3.90E-07 8.30E-06 negative regulation of cellular protein metabolic 20 process4.8 3.90E-07 8.30E-06 ribonucleoprotein complex biogenesis 20 4.8 3.90E-07 8.30E-06 negative regulation of programmed cell death29 7 4.70E-07 9.90E-06 negative regulation of cell death 29 7 4.90E-07 1.00E-05 cellular macromolecular complex assembly 27 6.5 5.00E-07 1.00E-05 positive regulation of transcription 38 9.2 5.20E-07 1.10E-05 double-strand break repair 12 2.9 6.90E-07 1.40E-05 protein metabolic process 116 28.1 7.60E-07 1.50E-05 protein catabolic process 40 9.7 7.90E-07 1.60E-05 negative regulation of gene expression 35 8.5 8.30E-07 1.60E-05 protein modification process 71 17.2 8.60E-07 1.70E-05 interphase 15 3.6 9.20E-07 1.80E-05 biological regulation 248 60 1.10E-06 2.10E-05 multi-organism process 42 10.2 1.10E-06 2.20E-05 biopolymer modification 73 17.7 1.30E-06 2.40E-05 positive regulation of molecular function 38 9.2 1.30E-06 2.40E-05 positive regulation of transcription, DNA-dependent 33 8 2.00E-06 3.60E-05 DNA replication initiation 7 1.7 2.30E-06 4.30E-05 positive regulation of RNA metabolic process 33 8 2.40E-06 4.30E-05 protein modification by small protein conjugation 16 3.9 2.60E-06 4.80E-05 protein kinase cascade 28 6.8 2.70E-06 4.80E-05 intracellular transport 40 9.7 3.00E-06 5.30E-05 anti-apoptosis 20 4.8 3.10E-06 5.40E-05 negative regulation of nucleobase, nucleoside, 34nucleotide 8.2and3.20E-06 nucleic acid5.70E-05 metabolic process interphase of mitotic cell cycle 14 3.4 3.80E-06 6.60E-05 cellular macromolecular complex subunit organization 27 6.5 4.30E-06 7.40E-05 negative regulation of nitrogen compound metabolic 34 process 8.2 4.30E-06 7.40E-05 cellular biosynthetic process 132 32 4.90E-06 8.30E-05 response to exogenous dsRNA 6 1.5 5.60E-06 9.30E-05 modification-dependent macromolecule catabolic 36 process 8.7 5.60E-06 9.40E-05 modification-dependent protein catabolic process 36 8.7 5.60E-06 9.40E-05 induction of apoptosis 25 6.1 6.20E-06 1.00E-04 induction of programmed cell death 25 6.1 6.50E-06 1.10E-04 protein complex biogenesis 33 8 6.50E-06 1.10E-04 protein complex assembly 33 8 6.50E-06 1.10E-04 spliceosomal snRNP biogenesis 8 1.9 7.00E-06 1.10E-04 mitotic cell cycle 27 6.5 8.20E-06 1.30E-04 positive regulation of transcription from RNA 27 polymerase6.5 II promoter 8.60E-06 1.40E-04 immune response-activating signal transduction 10 2.4 8.90E-06 1.40E-04 regulation of response to stimulus 31 7.5 9.00E-06 1.40E-04 regulation of cell cycle 25 6.1 1.10E-05 1.70E-04 positive regulation of catalytic activity 33 8 1.20E-05 1.80E-04 chromatin organization 27 6.5 1.20E-05 1.80E-04 positive regulation of protein modification process 18 4.4 1.20E-05 1.90E-04

ACCEPTED MANUSCRIPT 1.30E-05 2.00E-04

RI PT

2.00E-04 2.10E-04 2.10E-04 2.20E-04 2.40E-04 2.40E-04 2.40E-04 2.70E-04 2.70E-04 2.70E-04 2.90E-04 3.00E-04 3.00E-04 3.00E-04 3.10E-04 3.10E-04 3.30E-04 4.50E-04 4.50E-04 4.50E-04 5.00E-04 5.00E-04 5.30E-04 5.70E-04 5.70E-04 5.80E-04 5.80E-04 5.80E-04 6.20E-04 6.80E-04 6.80E-04 7.10E-04 7.10E-04 7.30E-04 7.40E-04 7.50E-04 8.20E-04 9.90E-04 1.00E-03 1.00E-03 1.20E-03 1.20E-03 1.20E-03 1.30E-03 1.30E-03 1.30E-03 1.40E-03 1.60E-03 1.70E-03

SC

1.40E-05 1.40E-05 1.40E-05 1.50E-05 1.60E-05 1.60E-05 1.70E-05 1.90E-05 1.90E-05 1.90E-05 2.00E-05 2.10E-05 2.20E-05 2.20E-05 2.30E-05 2.30E-05 2.40E-05 3.30E-05 3.30E-05 3.30E-05 3.80E-05 3.80E-05 4.00E-05 4.30E-05 4.40E-05 4.40E-05 4.40E-05 4.50E-05 4.90E-05 5.40E-05 5.40E-05 5.70E-05 5.70E-05 5.80E-05 5.90E-05 6.10E-05 6.60E-05 8.10E-05 8.30E-05 8.30E-05 1.00E-04 1.00E-04 1.00E-04 1.10E-04 1.10E-04 1.10E-04 1.20E-04 1.40E-04 1.40E-04

AC C

EP

TE D

M AN U

negative regulation of macromolecule biosynthetic 34 process 8.2 biosynthetic process 133 32.2 cellular catabolic process 52 12.6 G2/M transition of mitotic cell cycle 7 1.7 regulation of protein ubiquitination 13 3.1 regulation of binding 16 3.9 posttranscriptional regulation of gene expression 19 4.6 immune response-regulating signal transduction 10 2.4 negative regulation of protein modification process 14 3.4 protein ubiquitination 14 3.4 negative regulation of transcription 30 7.3 regulation of catalytic activity 45 10.9 regulation of protein localization 15 3.6 positive regulation of response to stimulus 20 4.8 negative regulation of cellular biosynthetic process 34 8.2 regulation of DNA binding 14 3.4 regulation of establishment of protein localization 14 3.4 establishment of localization in cell 45 10.9 chordate embryonic development 24 5.8 signal transduction 110 26.6 negative regulation of biosynthetic process 34 8.2 embryonic development ending in birth or egg 24hatching 5.8 RNA elongation from RNA polymerase II promoter 9 2.2 protein transport 41 9.9 cellular localization 47 11.4 regulation of ubiquitin-protein ligase activity 11 2.7 I-kappaB kinase/NF-kappaB cascade 10 2.4 response to biotic stimulus 26 6.3 negative regulation of cell proliferation 25 6.1 establishment of protein localization 41 9.9 transcription 86 20.8 regulation of response to stress 21 5.1 regulation of protein transport 13 3.1 regulation of DNA metabolic process 13 3.1 response to hormone stimulus 25 6.1 RNA elongation 9 2.2 regulation of ligase activity 11 2.7 embryonic development 33 8 regulation of RNA metabolic process 76 18.4 positive regulation of protein ubiquitination 11 2.7 in utero embryonic development 16 3.9 regulation of transcription factor activity 12 2.9 macromolecule localization 51 12.3 response to endogenous stimulus 26 6.3 histone modification 13 3.1 protein localization 44 10.7 regulation of cytokine production 16 3.9 DNA recombination 12 2.9 regulation of protein kinase cascade 19 4.6 regulation of I-kappaB kinase/NF-kappaB cascade 12 2.9

ACCEPTED MANUSCRIPT 1.70E-03

RI PT

1.70E-03 1.70E-03 1.70E-03 1.70E-03 1.80E-03 2.00E-03 2.80E-03 2.80E-03 3.00E-03 3.30E-03 3.40E-03 4.20E-03 4.50E-03 4.50E-03 4.90E-03 5.20E-03 5.40E-03 5.80E-03 5.80E-03 5.80E-03 5.90E-03 5.90E-03 5.90E-03 5.90E-03 6.40E-03 6.40E-03 6.40E-03 6.40E-03 7.00E-03 7.50E-03 7.50E-03 7.60E-03 8.10E-03 8.70E-03 9.20E-03 9.40E-03 9.40E-03 9.40E-03 9.60E-03 9.60E-03 1.00E-02 1.00E-02 1.00E-02 1.00E-02 1.00E-02

AC C

EP

TE D

M AN U

SC

membrane protein intracellular domain proteolysis 5 1.2 1.50E-04 lipopolysaccharide-mediated signaling pathway5 1.2 1.50E-04 positive regulation of cytokine production 11 2.7 1.50E-04 covalent chromatin modification 13 3.1 1.50E-04 regulation of nucleocytoplasmic transport 9 2.2 1.50E-04 chromatin modification 20 4.8 1.60E-04 spliceosome assembly 7 1.7 1.80E-04 regulation of mRNA stability 6 1.5 2.50E-04 cellular response to hormone stimulus 13 3.1 2.50E-04 DNA damage checkpoint 8 1.9 2.80E-04 ubiquitin-dependent protein catabolic process18 4.4 3.00E-04 T cell receptor signaling pathway 6 1.5 3.10E-04 regulation of RNA stability 6 1.5 3.80E-04 response to abiotic stimulus 23 5.6 4.20E-04 regulation of immune response 17 4.1 4.20E-04 DNA integrity checkpoint 8 1.9 4.60E-04 regulation of defense response 13 3.1 4.90E-04 regulation of intracellular protein transport 8 1.9 5.10E-04 positive regulation of immune response 13 3.1 5.50E-04 RNA stabilization 5 1.2 5.60E-04 mRNA stabilization 5 1.2 5.60E-04 positive regulation of ubiquitin-protein ligase activity 9 2.2 5.60E-04 positive regulation of DNA binding 9 2.2 5.60E-04 protein amino acid phosphorylation 34 8.2 5.70E-04 regulation of transcription, DNA-dependent 71 17.2 5.70E-04 regulation of intracellular transport 9 2.2 6.20E-04 regulation of ubiquitin-protein ligase activity during 9 mitotic 2.2 cell 6.20E-04 cycle regulation of protein export from nucleus 4 1 6.20E-04 positive regulation of interleukin-2 production4 1 6.20E-04 response to stimulus 123 29.8 6.90E-04 positive regulation of NF-kappaB transcription 7factor activity 1.7 7.40E-04 positive regulation of ligase activity 9 2.2 7.50E-04 cell cycle checkpoint 10 2.4 7.60E-04 telomere maintenance 6 1.5 8.10E-04 induction of apoptosis by extracellular signals11 2.7 8.80E-04 pattern recognition receptor signaling pathway5 1.2 9.30E-04 activation of immune response 10 2.4 9.60E-04 regulation of specific transcription from RNA polymerase 10 2.4 II promoter 9.60E-04 telomere organization 6 1.5 9.60E-04 enzyme linked receptor protein signaling pathway 21 5.1 9.90E-04 regulation of gene-specific transcription 12 2.9 9.90E-04 nucleocytoplasmic transport 13 3.1 1.10E-03 response to lipopolysaccharide 9 2.2 1.10E-03 response to ionizing radiation 8 1.9 1.10E-03 positive regulation of transcription factor activity 8 1.9 1.10E-03 catabolic process 53 12.8 1.10E-03

% 26 17 37 15 16 15 19 20 17 15 15 14 17 14 11 17 9 12 14 23 10

6.3 4.1 9 3.6 3.9 3.6 4.6 4.8 4.1 3.6 3.6 3.4 4.1 3.4 2.7 4.1 2.2 2.9 3.4 5.6 2.4

P-Value Benjamini 3.20E-11 3.90E-09 2.80E-08 1.70E-06 5.90E-08 2.30E-06 1.70E-07 5.20E-06 3.40E-07 8.10E-06 1.00E-06 2.00E-05 3.30E-06 5.60E-05 3.50E-06 5.20E-05 3.70E-06 4.90E-05 8.30E-06 9.90E-05 1.30E-05 1.40E-04 5.80E-05 5.80E-04 6.40E-05 5.90E-04 7.40E-05 6.30E-04 7.80E-05 6.20E-04 1.50E-04 1.10E-03 3.00E-04 2.10E-03 3.30E-04 2.20E-03 5.40E-04 3.40E-03 1.70E-03 1.00E-02 1.80E-03 1.00E-02

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT Term Count Cell cycle Pancreatic cancer Pathways in cancer NOD-like receptor signaling pathway Chronic myeloid leukemia RIG-I-like receptor signaling pathway Neurotrophin signaling pathway Ubiquitin mediated proteolysis Toll-like receptor signaling pathway Small cell lung cancer Apoptosis TGF-beta signaling pathway Spliceosome Prostate cancer Cytosolic DNA-sensing pathway Insulin signaling pathway Bladder cancer Adherens junction T cell receptor signaling pathway MAPK signaling pathway Adipocytokine signaling pathway

AC C

EP

TE D

Category KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY KEGG_PATHWAY