Accepted Manuscript Title: Network Analysis based on Low-Rank Method for Mining Information on Integrated Data of Multi-Cancers Authors: Mi-Xiao Hou, Ying-Lian Gao, Jin-Xing Liu, Ling-Yun Dai, Xiang-Zhen Kong, Junliang Shang PII: DOI: Reference:
S1476-9271(18)30781-3 https://doi.org/10.1016/j.compbiolchem.2018.11.027 CBAC 6971
To appear in:
Computational Biology and Chemistry
Received date: Revised date: Accepted date:
25 October 2018 30 November 2018 30 November 2018
Please cite this article as: Hou M-Xiao, Gao Y-Lian, Liu J-Xing, Dai L-Yun, Kong XZhen, Shang J, Network Analysis based on Low-Rank Method for Mining Information on Integrated Data of Multi-Cancers, Computational Biology and Chemistry (2018), https://doi.org/10.1016/j.compbiolchem.2018.11.027 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Network Analysis based on Low-Rank Method
Network Analysis based on Low-Rank Method for Mining Information on Integrated Data of Multi-Cancers
Mi-Xiao Houa, Ying-Lian Gaob, Jin-Xing Liua, c, Ling-Yun Daia, Xiang-Zhen Konga,
of Information Science and Engineering, Qufu Normal University, Rizhao,
SC R
aSchool
IP T
Junliang Shanga
China
of Qufu Normal University, Qufu Normal University, Rizhao, China
cCo-Innovation
Center for Information Supply & Assurance Technology, Anhui
U
bLibrary
M
A
N
University, Hefei, China
Correspondence information: Jin-Xing Liu, School of Information Science and
ED
Engineering, Qufu Normal University, Rizhao, China;Co-Innovation Center for
PT
Information Supply & Assurance Technology, Anhui University, Hefei, China;
A
CC E
E-mail address:
[email protected]; Tel.: +086-633-3981241.
Graphical abstract
1
IP T
Network Analysis based on Low-Rank Method
Highlights
We apply a sparse and low-rank method which RPCA to solve the noise
SC R
problem for integrated data of multi-cancers from TCGA.
Experiments show that after denoising by RPCA, the gene expression data
U
tend to be orderly and neat than before, and the effect is better than other
We find some abnormally expressed genes and pathways are associated
A
N
denoising methods.
ED
M
with many cancers from the denoised network.
PT
Abstract
CC E
The noise problem of cancer sequencing data has been a problem that can’t be ignored.
Utilizing considerable way to reduce noise of these cancer data is an important issue in the
A
analysis of gene co-expression network. In this paper, we apply a sparse and low-rank method which is Robust Principal Component Analysis (RPCA) to solve the noise problem for integrated data of multi-cancers from The Cancer Genome Atlas (TCGA). And then we build the gene co-expression network based on the integrated data after noise
2
Network Analysis based on Low-Rank Method
reduction. Finally, we perform nodes and pathways mining on the denoising networks. Experiments in this paper show that after denoising by RPCA, the gene expression data tend to be orderly and neat than before, and the constructed networks contain more pathway
IP T
enrichment information than unprocessed data. Moreover, learning from the betweenness centrality of the nodes in the network, we find some abnormally expressed genes and
SC R
pathways proven that are associated with many cancers from the denoised network. The
experimental results indicate that our method is reasonable and effective, and we also find
U
some candidate suspicious genes that may be linked to multi-cancers.
N
Keywords: noise reduction; gene co-expression network; multi-cancers; integrated data;
M
A
abnormally expressed genes
ED
1. Introduction
PT
With the development of various sequencing technologies, the analysis of sequencing data by machine learning methods is more diverse [1-5]. In many fields of
CC E
bioinformatics, cancer research has traditionally been a hot topic today [6-9]. However, in recent years, the study of multi-cancers has been widely concerned by the majority of
A
medical people in the clinic. Many cancers may share a common oncogenic mutation that provides good clues to cancer research. In addition, it is hard to say that there is no necessary link between cancers, and the study of the human body itself is in the exploratory stage. With the development of gene chip technology and next-generation
3
Network Analysis based on Low-Rank Method
sequencing technology, a variety of cancer sequencing data are continuously generated, providing critical data support for cancer study [10, 11]. To cancer related materials, the more research of the functional level of a single gene has limited the scope and process
IP T
by which people explore the biological functions of living cells. In recent years, the research on constructing network of cancer data has been drawing much attention [12,
SC R
13]. The body of life itself is a complex network system. It is favorable to explain its internal principle by graph theory and network. In addition, the pathogenicity of cancer
U
is an extremely complicated process. Before and after its involvement, the number of
N
genes and pathways involved is certainly not one. In this case, it is particularly
A
significant to consider the interaction between genes. Gene co-expression network
M
analysis is the process that clusters the same status and function of genes together,
ED
which is built on gene expression and other data to explore the relationship between genes. It has become a major research direction to construct gene co-expression
PT
networks by means of systematic biology to reveal the relationship between genes at the
CC E
level of system [14-16]. And it is also an impressive way to excavate information for multi-cancers analysis [17].
A
However, gene expression profiles which built for co-expression networks inevitably
contain noise and some noise intensity is even greater, which results in bias in extracting information. The noise problem in gene expression data can’t be ignored. From the manufacturing of gene chips to gene expression experiments to the processing of gene
4
Network Analysis based on Low-Rank Method
chip images and the collection and conversion of expression data, noise can be introduced at every stage. And sometimes the noise points can completely cover the original data points. Data with noise can bring numerous disadvantages to the
IP T
construction of gene networks, such as repetitive interference of unwanted nodes, the influence of noise gene expression on the measurement of inter-gene relationship, and
SC R
so on. The improper denoising is likely to bring the loss of the original data and even causes information greatly damaged. Therefore, it is quite necessary to properly remove
U
the noise and reconstruct the original data matrix before constructing the network.
N
Research on denoising gene expression data is not much; some scholars have used
A
wavelet transform reducing noise of tumor data, which achieving impressive results
M
[18]. The main idea of wavelet denoising is to compare the coefficients of each layer
ED
after wavelet transform with corresponding threshold firstly, and then deal with the two coefficients that larger than the threshold and smaller than the threshold separately.
PT
Finally the denoised signal can be obtained after the processed two-part coefficients
CC E
were reconstructed by wavelet [19]. But the effect of the threshold is uncontrollable. In this paper, we utilize Robust Principal Component Analysis (RPCA) to reduce the noise
A
of gene expression data and then build the network. RPCA was first successfully applied to background modeling in video surveillance and shadow removal of face images [20], and it was also used for feature gene selection in recent years [21]. According to the related literature [22], it assumed that the gene expression data are in a
5
Network Analysis based on Low-Rank Method
low-dimensional subspace, so that non-differentially expressed gene data can be regarded as a low-rank matrix. However, in this article, we will introduce RPCA from another perspective in the study of network construction. We started with the purpose of
IP T
network construction to remove noise. With RPCA, the original matrix can be decomposed into a low-rank matrix and a sparse matrix. A small amount of noise is
SC R
decomposed from the original matrix as a sparse matrix. On one hand, the low-rank
matrix keeps the integrity of the original matrix; on the other hand, the property of low
U
rank provides a good basis for the similarity measure between nodes in the
N
co-expression network.
A
In this paper, focus on the gene expression integrated datasets of three cancers from
M
The Cancer Genome Atlas (TCGA) [16, 23, 24], we build the network and conduct
ED
information mining. Before the network construction, a low-rank, sparse method- RPCA was introduced to reduce noise and reconstruct the data matrix. Thus, the internal
PT
relationships of genes in the constructed networks tend to be orderly and the network
CC E
structure is clearer so that more public oncogenic factors and oncogenic modules can be tapped.
A
The rest of this paper is organized as follows: Section 2 introduces related method
and procedure; there are the experiments based on the integrated multi-cancers data in Section 3; and the last section provides the conclusion of this paper.
6
Network Analysis based on Low-Rank Method
2. Methods
The most important two stages of the whole process is denoising and network
Pearson coefficient (PCC) measurement rules between gene nodes. 2.1 RPCA
IP T
construction. Denoising mainly draws on RPCA. Network construction is based on
SC R
The matrix restoration problem of RPCA can be described as follows, ( D is m n
gene expression matrix in the experiments, m and n represent the number of genes
U
and samples), assuming that the matrix D is low-rank (or approximately low-rank),
A
N
namely:
(1)
M
D0 =A0 +S0 ,
ED
where A 0 is a low-rank matrix and S 0 is a sparse perturbation matrix. RPCA was proposed by Candes et al., which can recover low-rank matrices from a highly disrupted
PT
observational matrix [20]. Suppose the elements of S 0 could have a larger magnitude
CC E
and its branch set is sparse. The corresponding steps are shown in Fig.1(a). The kernel norm of matrix A is that A* : i i ( A) . The L1 norm of the matrix
S is that S 1 : ij Sij . Assuming that the data observation matrix is given by (1),
A
RPCA can solve the following optimization problem:
minimize A * S 1 , subject to D A S ,
(2)
where is a regular parameter. For solving this problem, we utilize the algorithm for
7
Network Analysis based on Low-Rank Method
solving RPCA optimization problems given by Lin et al., which is named inexact Augmented Lagrange Multipliers (IALM) method [25]: eliminating equality constraints by introducing a Lagrange multiplier. And then, we can obtain the denoised gene
IP T
expression matrix: A . 2.2 Network Construction
SC R
The correlation measure of two genes are commonly measured by PCC, the PCC between two genes X and Y of A is defined as follows: i ( X i X )(Yi Y ) , n n i1( X i X )2 i (Yi Y )2
U
n
(3)
N
R
A
where X i and Yi are the observations representing the genes X and Y on the i-th are the means of the
M
patient. n is the number of patients. And X and Y
ED
observations of the genes X and Y in all patients. There are many ways for threshold selection of correlation coefficients. After determining the threshold, the
PT
adjacency matrix with m m can be obtained from the correlation coefficient matrix
CC E
R . And we can further build the co-expression networks, as shown in Fig. 1(b).
A
3. Experiments
This is the experimental part for network mining. Subsection 3.1 generally describes
these cancers datasets source. The parameter selection for RPCA is in Subsection 3.2. In Subsection 3.3, there is the threshold selection. The comparative experiments and discussion are in Subsection 3.4.
8
Network Analysis based on Low-Rank Method
3.1 Datasets The experimental datasets are integrated gene expression data of three cancers from the TCGA (RNA-seq data at level 3) database: Pancreatic cancer (PAAD), Esophageal
IP T
cancer (ESCA) and Head and neck squamous cell carcinoma (HNSC). There are greater than 20,502 genes in each dataset, and the number of patients is different. Specifically,
SC R
HNSC contains 398 tumor patients; ESCA 183 contains tumor patients, and 176 tumor
patients are from PAAD. After removing all zero values from all patients, we integrated
U
them into a matrix cover 757 patients with 20261 genes for correlation measure. And
N
the datasets was not normalized with any method [24].
A
3.2 Parameter Selection for RPCA
M
The rank of the low-rank matrix will affect the measurement of the node's relations in
ED
the entire network. If the rank is too low, the inter-node will be more closely, and the networks will be sloppy. Of course, the node relationship is still based on the real and
PT
reliable matrix. Recovery A reasonably is the key process of network construction.
CC E
Here, we determine the matrix A by adjusting the parameter . In order to select the appropriate parameter, we tested the effect of different
A
parameters by clustering samples of different cancers on the low-rank matrix (k-means). For this reason, we carry out two experiments, as are shown in Table 1 and Figure 2. The of Table 1 is in the range of 10-3 ~103 . The clustering effects are assessed by contrasting clustered labels with true label. The accuracy (ACC) [26] as evaluation
9
Network Analysis based on Low-Rank Method
function:
ACC
n i 1
si , map r i
(4)
n
where n is the total number of samples, x, y is a delta function that equals to 1 if
IP T
x y , equals to 0 otherwise. map r i is mapping function that maps each cluster
label ri to original label si .
SC R
Clustering results indicating that RPCA has remarkable advantages and the effect on data recovery and denoising is very impressive. In contrast, the effect of wavelet
U
denoising is significantly weaker (Here, the accuracy of data clustering by wavelet
A
N
denoising is the best result that has been selected through many wavelet denoising tests).
M
RPCA has good explanatory power for data reconstruction. It also provides reliable data for next network construction. PRCA and wavelet transform have been applied in the
ED
field of image processing, and their denoising effects on gene expression data are quite
PT
different. Because RPCA tends to have an overall effect on the processing of data, wavelet transforms may focus on local processing. The gene expression data as a
CC E
two-dimensional matrix is more biased towards the global effect. In addition, there is a very important factor will be involved when the wavelet transform denoising, which is
A
selection of threshold. For gene expression data, the determination of the threshold of wavelet transform will cause the data to be excessively interfered and over-controlled, resulting in distortion of data reconstruction. The RPCA method denoises and reconstructs the overall data macroscopically, preserves most of the data and recovery
10
Network Analysis based on Low-Rank Method
small data as the principle, so as to ensure the reliability of the data. Therefore, gene expression data seems to be more suitable for RPCA for preprocessing of data denoising reconstruction. and this is the reason we chose it as a network noise reducer.
IP T
In order to find better parameters for data denoising and recovery, we further look for the optimal parameters in the interval of 0.01. We narrow the range around 0.001 for
SC R
optimization, as is shown in Fig. 2.
In Fig. 2, we select more precise parameter focus on the advantage value of Table 1.
U
Fig. 2 shows that at =0.003 , the clustering effect is pretty well. Fig. 2 and Table 1
N
show that the clustering accuracy of the data ranges from 0.45 to above 0.8 under
A
different parameters, which indicates the reasonable selection of parameters has a great
M
influence on the data recovery and reconstruction. In our experiment, the parameter is
ED
set to 0.003. And in this case, the low-rank matrix A is the better reconstruction of D under the condition of ensuring the original observation matrix D is not lost much
PT
information and its noise is removed. Therefore, we can construct a co-expression
CC E
network based on A peacefully. 3.3 Threshold Selection
A
PCC matrix correlation values are in the range of 0-1. The closer the value of the
matrix is 1, the higher the correlation is. Here we sort the values and perform curvilinear fitting and select the first inflection point as the overall threshold (0.9342). By this way, we can ensure that the threshold adapts to the network so that the network remains more
11
Network Analysis based on Low-Rank Method
relevant interactions. 3.4 Results and Discussion After constructing the networks, the networks before and after denoising all reserved
IP T
about 200 nodes. Comparing the networks before and after denoising, the network denoising by RPCA than before could detect more pathways and more information, as is
SC R
listed in Table 2 (The databases involved in our pathway analysis include KEGG and
Reactome; original networks: 144; denoising networks by RPCA: 142; FDR: 0.1). Next,
U
we found the common pathways tapped in the network before and after denoising for
N
p-value analysis and comparison, and discovered that the difference of p-values are
A
obvious between the two, as is listed in Table 3. These results suggest that RPCA does
M
have distinct merits in denoising and reconstructing for gene expression data.
ED
In obtained denoising networks, the first five pathways with smaller p-values are in sharp contrast with the pre-denoising networks, as is listed in Table 3. Among the found
PT
common pathways, the enriched pathways found in the denoised networks show smaller
CC E
p-values. Biologically enriched pathways mapped by the gene are more accurate. In addition, Antimicrobial peptides are a shared multi-cancers pathway.
A
We analyzed the connection of genes (nodes) in each module in the denoised
networks. Compared with traditional connectivity, we find that the betweenness centrality of nodes can find more workable genes. The betweenness of a node is a measure of the sum of its proportions appearing in the shortest path between other
12
Network Analysis based on Low-Rank Method
nodes. The betweenness shows the role of a node in the connection of other nodes in the network. The higher the values of betweenness, the more important the node is in maintaining the tight connectivity of the network. In the eight interoperable modules in
IP T
the denoised network, we find nodes with larger betweenness. Refer to the GeneCard [27] ( http://www.genecards.org/ ), which is the gene annotation website, their
SC R
annotations as listed in Table 4.
Among them, BGN has been repeatedly detected abnormal expression in tumor
U
development; BGN was discovered overexpressed in the extracellular matrix of PAAD
N
samples when compared with normal pancreas or chronic pancreatitis tissues [28]; the
A
mRNA level of BGN can distinguish the cancer specificity of bladder cancer patients
M
[29]. The up-regulation of BGN is associated with poor prognosis and PTEN deficiency
ED
in prostate cancer patients [30]. SPARC has long been found to be linked to malignant tumors and is abnormally expressed in PAAD, breast cancer and lung cancer and so on
PT
[31-33]. S100A8 is also a malignant tumor-related gene [34], found in gastric cancer
CC E
and prostate cancer [35, 36]. AGR2 is abnormally expressed in both prostate cancer and breast cancer [37, 38]. SFRP2 inhibits the transformation and invasion of cervical
A
cancer cells through the Wnt signaling pathway, was detected inactivation in gastric cancer [39, 40]. The roles of above genes in the modules are displayed in Fig. 3. Each node in the Figure represents a gene: the size of a node represents the degree of a node; and the
13
Network Analysis based on Low-Rank Method
color is represented by lightness and depth indicating the values of betweenness of node. The top five darker nodes identified in the modules of the networks that have validated to be associated with multi-cancers: AGR2, S100A9, BGN, SPARC and SFRP2. As far
IP T
as node mining is concerned, the betweenness has unique advantages. For example, in Module 1, although S100A9 has little connectivity, it connects two gene clusters, which
SC R
are an important embodiment of the connectivity of the entire network. Similarly, BGN and SPARC 3 in Module 3 have the same effects.
U
We apply RPCA to eliminate noise and ensure the internal information of the data
N
can’t be destroyed simultaneously, so that the constructed networks show excellent
A
mining properties compared with the original data. Moreover, the function of the gene
M
co-expression network is that genes with the same function are clustered. The nature of
ED
low-rank for RPCA facilitates the recognition of similarities between genes. And the discovery of more gene-related features will make the entire network robust and
PT
complete. Therefore, RPCA enhances the identification of relationships between nodes
CC E
in denoising networks, which makes the networks more conducive to excavate more helpful factors than the original data. Moreover, experiments have shown that RPCA is
A
indeed better than wavelet transform in denoising and reconstructing gene expression data. In the aspect of node evaluation, applying the betweenness centrality to improve the recognition of some key nodes in the networks and find some confirmed abnormal genes related to many cancers. Although OR10J5, OR6P1 and CELF3 have unclear
14
Network Analysis based on Low-Rank Method
clinical study indicates their direct effect on cancer, they can be recognized as candidate abnormal genes of multi-cancers for reference because of their fundamental properties in our networks. These genes are that we have collectively discovered that are
IP T
abnormally expressed in the three types of cancers. And related materials show that a large part of these genes are related to some cancers, further demonstrating that these
SC R
genes are not only associated with the three types of cancers in this article. It is
important to have clinical targeting for other cancers. By building a co-expression
U
network through integrated data from three cancers, we also found several shared
N
disease-causing genes for other cancers, which show that our method is indeed
M
A
effective.
ED
4. Conclusions
In this paper, we introduce a low-rank and sparse method-RPCA, to reduce noise and
PT
reconstruct integrated gene expression data of multi-cancers from TCGA and obtain
CC E
more reliable and meaningful cancer data. By building on the integrity of the data, the network was constructed by PCC between genes in reconstructed expression data. And
A
finally, some suspicious information about cancers was extracted from the gene co-expression network. We also compared the network before and after denoising and found that the number of pathways was more accurate and abundant in the denoised and reconstructed networks. In addition, through the betweenness centrality of node, we
15
Network Analysis based on Low-Rank Method
discovered some abnormal expressed genes associated with multi-cancers in denoised networks, which can prove the effect of denoising for RPCA is pretty reliable. This paper provides a new view for network denoising analysis. But there are also
IP T
some immature parts. For example, this model does not consider the effect of the perturbation matrix; TCGA data is multi-omics data, and many types of data can be
SC R
integrated in the new network after denoising and so on. These are all we can consider to improve in the future work.
N
U
Acknowledgments
A
This work was supported in part by the National Natural Science Foundation of
M
China under Grant Nos. 61872220, 61572284, 61702299, and 61701279.
ED
Reference
CC E
(2017) 1-11.
PT
[1] L.Y. Dai, C.M. Feng, J.X. Liu, C.H. Zheng, J. Yu, M.X. Hou, Complexity, 2017
[2] Y.-L. Feng Cm Fau - Gao, J.-X. Gao Yl Fau - Liu, C.-H. Liu Jx Fau - Zheng, J.
A
Zheng Ch Fau - Yu, J. Yu. [3] H. Han, BMC Medical Genomics, 7 (2014) S5. [4] J.-X. Liu, D. Wang, Y.-L. Gao, C.-H. Zheng, J.-L. Shang, F. Liu, Y. Xu, Neurocomputing, 228 (2017) 263-269.
16
Network Analysis based on Low-Rank Method
[5] C.-H. Zheng, L. Yuan, W. Sha, Z.-L. Sun, BMC Bioinformatics, 15 (2014) S3. [6] Z. Chun-Hou, H. De-Shuang, Z. Lei, K. Xiang-Zhen, IEEE Transactions on Information Technology in Biomedicine, 13 (2009) 599-607.
[8] G. Wu, X. Feng, L. Stein, Genome Biology, 11 (2010) R53.
IP T
[7] X. Ma, L. Yu, P. Wang, X. Yang, Comput Biol Chem, 69 (2017) 164-170.
SC R
[9] H. Xie, J. Li, Z. Qiaosheng, Y. Wang, Comparison among dimensionality reduction techniques based on Random Projection for cancer classification, (2016).
U
[10] R.G.W. Verhaak, K.A. Hoadley, E. Purdom, V. Wang, Y. Qi, M.D. Wilkerson,
N
C.R. Miller, L. Ding, T. Golub, J.P. Mesirov, G. Alexe, M. Lawrence, M. O'Kelly, P.
A
Tamayo, B.A. Weir, S. Gabriel, W. Winckler, S. Gupta, L. Jakkula, H.S. Feiler, J.G.
M
Hodgson, C.D. James, J.N. Sarkaria, C. Brennan, A. Kahn, P.T. Spellman, R.K. Wilson,
(2010) 98-110.
ED
T.P. Speed, J.W. Gray, M. Meyerson, G. Getz, C.M. Perou, D.N. Hayes, Cancer Cell, 17
PT
[11] C.-H. Zheng, W. Yang, Y.-W. Chong, J.-F. Xia, Computers in Biology and
CC E
Medicine, 72 (2016) 22-29.
[12] C. Yang, S.G. Ge, C.H. Zheng, Oncotarget, 8 (2017) 89021-89032.
A
[13] S.G. Ge, J.F. Xia, S. Wen, C.H. Zheng, IEEE/ACM Transactions on
Computational Biology and Bioinformatics, 14 (2017) 1115-1121. [14] M. Hofree, H. Shen Jp Fau - Carter, A. Carter H Fau - Gross, T. Gross A Fau Ideker, T. Ideker.
17
Network Analysis based on Low-Rank Method
[15] P. Langfelder, S. Horvath, BMC Bioinformatics, 9 (2008) 559. [16] Q. Zhang, J.E. Burdette, J.-P. Wang, BMC Systems Biology, 8 (2014) 1338. [17] H. Kim, J. Watkinson, V. Varadan, D. Anastassiou, BMC Medical Genomics, 3
[18] G. Chen, Y. Lu, H. Yang, Computers & Applied Chemistry, (2011).
IP T
(2010) 51.
SC R
[19] Y.I. Bo, T. Wen, NAAU, Yantai, Shandong, Computer Engineering & Applications, 48 (2012) 146-149.
U
[20] E.J. Candes, X. Li, Y. Ma, J. Wright, Journal of the ACM, 58 (2009).
N
[21] J.X. Liu, Y.T. Wang, C.H. Zheng, W. Sha, J.X. Mi, Y. Xu, BMC Bioinformatics,
A
14 (2013) 1-10.
M
[22] C. Eckart, G. Young, Psychometrika, 1 (1936) 211-218.
ED
[23] Y. Zhu, P. Qiu, Y. Ji, Nature Methods, 11 (2014) 599-600. [24] H. Han, K. Men, Journal of Biomedical Informatics, 85 (2018) 80-92.
PT
[25] Z. Lin, M. Chen, Y. Ma, Mathematics, (2010). Proceedings of the 26th annual international ACM
CC E
[26] W. Xu, X. Liu, Y. Gong, in:
SIGIR conference on Research and development in informaion retrieval, ACM, 2003,
A
pp. 267-273.
[27] M. Safran, I. Dalah, J. Alexander, N. Rosen, T. Stein, M. Shmoish, N. Nativ, I.
Bahir, T. Doniger, H. Krug, A. Sirota-Madi, T. Olender, Y. Guan-Golan, G. Stelzer, A. Harel, D. Lancet, GeneCards Version 3: The human gene integrator, (2010).
18
Network Analysis based on Low-Rank Method
[28] C.K. Weber, G. Sommer, P. Michl, H. Fensterer, M. Weimer, F. Gansauge, G. Leder, G. Adler, T.M. Gress, Gastroenterology, 121 (2001) 657-667. [29] C. Niedworok, K. Röck, I. Kretschmer, T. Freudenberger, N. Nagy, T. Szarvas,
IP T
D.F. Vom, H. Reis, H. Rübben, J.W. Fischer, Plos One, 8 (2013) e80084. [30] F. Jacobsen, J. Kraft, C. Schroeder, C. Hube-Magg, M. Kluth, D.S. Lang, R.
SC R
Simon, G. Sauter, J.R. Izbicki, T.S. Clauditz, Neoplasia, 19 (2017) 707-715.
[31] G. Watkins, A. Douglas-Jones, R. Bryce, R. E Mansel, W.G. Jiang,
U
Prostaglandins, Leukotrienes and Essential Fatty Acids, 72 (2005) 267-272.
A
Chinese Journal of Cancer, 31 (2012) 541-548.
N
[32] Huang, Jing, Zhang, Yuan-Yuan, Zhao, Jiang, Cong, Hong-Yun, Zhao, Yang,
M
[33] M. Sinn, B.V. Sinn, J.K. Striefler, J.L. Lindner, J.M. Stieler, P. Lohneis, S.
ED
Bischoff, H. Bläker, U. Pelzer, M. Bahra, Annals of Oncology, 25 (2014) 1025-1032.
1622-1631.
PT
[34] C. Gebhardt, J. Németh, P. Angel, J. Hess, Biochemical Pharmacology, 72 (2006)
CC E
[35] H.Y. Yong, A. Moon, Archives of Pharmacal Research, 30 (2007) 75-81. [36] A. Hermani, J. Hess, S.B. De, S. Medunjanin, R. Grobholz, L. Trojan, P. Angel,
A
D. Mayer, Clinical Cancer Research, 11 (2005) 5146-5152. [37] F.R. Fritzsche, E. Dahl, S. Pahl, M. Burkhardt, J. Luo, E. Mayordomo, T.
Gansukh, A. Dankof, R. Knuechel, C. Denkert, Clinical Cancer Research An Official Journal of the American Association for Cancer Research, 12 (2006) 1728-1734.
19
Network Analysis based on Low-Rank Method
[38] J.S. Zhang, A. Gong, J.C. Cheville, D.I. Smith, C.Y. Young, Genes Chromosomes & Cancer, 43 (2005) 249-259. [39] M.T. Chung, H.C. Lai, H.K. Sytwu, M.D. Yan, Y.L. Shih, C.C. Chang, M.H. Yu,
IP T
H.S. Liu, D.W. Chu, Y.W. Lin, Gynecologic Oncology, 112 (2009) 646-653. [40] Y.Y. Cheng, J. Yu, Y.P. Wong, E.P.S. Man, K.F. To, V.X. Jin, J. Li, Q. Tao, J.J.Y.
A
CC E
PT
ED
M
A
N
U
SC R
Sung, F.K.L. Chan, British Journal of Cancer, 97 (2007) 895-901.
20
Network Analysis based on Low-Rank Method
M
A
N
U
SC R
IP T
Figure Captions
Fig. 1. Denoising network construction and information mining. Denoising by RPCA
ED
are shown in Fig.1(a). And the co-expression networks built based on A , as shown in Fig. 1(b). For the co-expression networks, we can carry out node mining and pathways
A
CC E
PT
enrichment analysis.
21
SC R
IP T
Network Analysis based on Low-Rank Method
Fig. 2. Fluctuations of in the optimal value. Under the control of these parameters,
A
CC E
PT
ED
M
A
clustering effect is better than other parameters.
N
U
the clustering effect is generally from superior to inferior. When =0.003 , the
22
M
A
N
U
SC R
IP T
Network Analysis based on Low-Rank Method
ED
Fig. 3. Modules with the confirmed genes of cancers. According to betweenness, AGR2,
PT
S100A9, BGN, SPARC and SFRP2 are identified in the modules that have validated to
A
CC E
be associated with multi-cancers. These genes have bridge effect in these modules.
23
Network Analysis based on Low-Rank Method
Tables
Table 1. Comparison of RPCA with different parameters for cancer subtypes clustering
Wavelet Method
0.51
)
RPCA (different parameters for
Transform
0.001
0.01
0.1
1
10
100
0.55
0.57
0.82
0.50
0.45
0.45
0.45
A
CC E
PT
ED
M
A
N
U
SC R
ACC
original
IP T
(k=3, three cancers)
24
1000 0.45
Network Analysis based on Low-Rank Method
Comparison for pathways of networks that before and after denoising
Mode
Number of pathways 8
Network by RPCA
18
A
CC E
PT
ED
M
A
N
U
SC R
Network before Denoise
IP T
Table 2.
25
Network Analysis based on Low-Rank Method
Table 3. Comparison for p-values of pathways Network after RPCA
Network before denoise
P-value
P-value
Gene sets 6.66E-16
9.33E-14
5.28E-10
Striated Muscle Contraction(R)
1.27E-11
1.01E-09
4.66E-05
Protein digestion and absorption(K)
1.44E-11
1.01E-09
4.94E-13
Antimicrobial peptides(R)
2.40E-05
9.59E-04
0.0313
Fat digestion and absorption(K)
1.95E-04
6.75E-03
0.0275
U N A M ED PT CC E A
26
3.43E-08 9.79E-04 6.42E-11 0.1878
SC R
Pancreatic secretion(K)
FDR
IP T
FDR
0.1878
Network Analysis based on Low-Rank Method
Table 4. List of genes with higher Betweenness values Gene OR10J5
GO OR10J5 (Olfactory Receptor Family 10 Subfamily J Member 5) and OR6P1 (Olfactory Receptor Family 6 Subfamily P Member 1)are Protein Coding genes. Among its related pathways are Signaling by GPCR and Olfactory Signaling Pathway.
IP T
OR6P1
Diseases associated with BGN include Spondyloepimetaphyseal Dysplasia, X-Linked and BGN
Meester-Loeys Syndrome.
Diseases associated with SPARC include Osteogenesis Imperfecta, Type Xvii and Sparc-Related Osteogenesis Imperfecta.
SC R
SPARC
Diseases associated with S100A8 include Cystic Fibrosis and Duodenal Ulcer. Among its S100A8
related pathways are Activated TLR4 signalling and Innate Immune System.
Diseases associated with AGR2 include Pancreatic Ductal Adenocarcinoma. Among its
CELF3
related pathways are Tyrosine Kinases / Adaptors and Adhesion.
U
AGR2
Among its related pathways are Preimplantation Embryo.
SFRP2
N
Diseases associated with SFRP2 include Esophageal Basaloid Squamous Cell Carcinoma and Colorectal Cancer. Among its related pathways are Signaling by Wnt and Wnt Signaling
A
CC E
PT
ED
M
A
Pathway and Pluripotency.
27