Network analysis based on low-rank method for mining information on integrated data of multi-cancers

Network analysis based on low-rank method for mining information on integrated data of multi-cancers

Accepted Manuscript Title: Network Analysis based on Low-Rank Method for Mining Information on Integrated Data of Multi-Cancers Authors: Mi-Xiao Hou, ...

1MB Sizes 0 Downloads 95 Views

Accepted Manuscript Title: Network Analysis based on Low-Rank Method for Mining Information on Integrated Data of Multi-Cancers Authors: Mi-Xiao Hou, Ying-Lian Gao, Jin-Xing Liu, Ling-Yun Dai, Xiang-Zhen Kong, Junliang Shang PII: DOI: Reference:

S1476-9271(18)30781-3 https://doi.org/10.1016/j.compbiolchem.2018.11.027 CBAC 6971

To appear in:

Computational Biology and Chemistry

Received date: Revised date: Accepted date:

25 October 2018 30 November 2018 30 November 2018

Please cite this article as: Hou M-Xiao, Gao Y-Lian, Liu J-Xing, Dai L-Yun, Kong XZhen, Shang J, Network Analysis based on Low-Rank Method for Mining Information on Integrated Data of Multi-Cancers, Computational Biology and Chemistry (2018), https://doi.org/10.1016/j.compbiolchem.2018.11.027 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Network Analysis based on Low-Rank Method

Network Analysis based on Low-Rank Method for Mining Information on Integrated Data of Multi-Cancers

Mi-Xiao Houa, Ying-Lian Gaob, Jin-Xing Liua, c, Ling-Yun Daia, Xiang-Zhen Konga,

of Information Science and Engineering, Qufu Normal University, Rizhao,

SC R

aSchool

IP T

Junliang Shanga

China

of Qufu Normal University, Qufu Normal University, Rizhao, China

cCo-Innovation

Center for Information Supply & Assurance Technology, Anhui

U

bLibrary

M

A

N

University, Hefei, China

Correspondence information: Jin-Xing Liu, School of Information Science and

ED

Engineering, Qufu Normal University, Rizhao, China;Co-Innovation Center for

PT

Information Supply & Assurance Technology, Anhui University, Hefei, China;

A

CC E

E-mail address: [email protected]; Tel.: +086-633-3981241.

Graphical abstract

1

IP T

Network Analysis based on Low-Rank Method

Highlights

We apply a sparse and low-rank method which RPCA to solve the noise

SC R



problem for integrated data of multi-cancers from TCGA. 

Experiments show that after denoising by RPCA, the gene expression data

U

tend to be orderly and neat than before, and the effect is better than other

We find some abnormally expressed genes and pathways are associated

A



N

denoising methods.

ED

M

with many cancers from the denoised network.

PT

Abstract

CC E

The noise problem of cancer sequencing data has been a problem that can’t be ignored.

Utilizing considerable way to reduce noise of these cancer data is an important issue in the

A

analysis of gene co-expression network. In this paper, we apply a sparse and low-rank method which is Robust Principal Component Analysis (RPCA) to solve the noise problem for integrated data of multi-cancers from The Cancer Genome Atlas (TCGA). And then we build the gene co-expression network based on the integrated data after noise

2

Network Analysis based on Low-Rank Method

reduction. Finally, we perform nodes and pathways mining on the denoising networks. Experiments in this paper show that after denoising by RPCA, the gene expression data tend to be orderly and neat than before, and the constructed networks contain more pathway

IP T

enrichment information than unprocessed data. Moreover, learning from the betweenness centrality of the nodes in the network, we find some abnormally expressed genes and

SC R

pathways proven that are associated with many cancers from the denoised network. The

experimental results indicate that our method is reasonable and effective, and we also find

U

some candidate suspicious genes that may be linked to multi-cancers.

N

Keywords: noise reduction; gene co-expression network; multi-cancers; integrated data;

M

A

abnormally expressed genes

ED

1. Introduction

PT

With the development of various sequencing technologies, the analysis of sequencing data by machine learning methods is more diverse [1-5]. In many fields of

CC E

bioinformatics, cancer research has traditionally been a hot topic today [6-9]. However, in recent years, the study of multi-cancers has been widely concerned by the majority of

A

medical people in the clinic. Many cancers may share a common oncogenic mutation that provides good clues to cancer research. In addition, it is hard to say that there is no necessary link between cancers, and the study of the human body itself is in the exploratory stage. With the development of gene chip technology and next-generation

3

Network Analysis based on Low-Rank Method

sequencing technology, a variety of cancer sequencing data are continuously generated, providing critical data support for cancer study [10, 11]. To cancer related materials, the more research of the functional level of a single gene has limited the scope and process

IP T

by which people explore the biological functions of living cells. In recent years, the research on constructing network of cancer data has been drawing much attention [12,

SC R

13]. The body of life itself is a complex network system. It is favorable to explain its internal principle by graph theory and network. In addition, the pathogenicity of cancer

U

is an extremely complicated process. Before and after its involvement, the number of

N

genes and pathways involved is certainly not one. In this case, it is particularly

A

significant to consider the interaction between genes. Gene co-expression network

M

analysis is the process that clusters the same status and function of genes together,

ED

which is built on gene expression and other data to explore the relationship between genes. It has become a major research direction to construct gene co-expression

PT

networks by means of systematic biology to reveal the relationship between genes at the

CC E

level of system [14-16]. And it is also an impressive way to excavate information for multi-cancers analysis [17].

A

However, gene expression profiles which built for co-expression networks inevitably

contain noise and some noise intensity is even greater, which results in bias in extracting information. The noise problem in gene expression data can’t be ignored. From the manufacturing of gene chips to gene expression experiments to the processing of gene

4

Network Analysis based on Low-Rank Method

chip images and the collection and conversion of expression data, noise can be introduced at every stage. And sometimes the noise points can completely cover the original data points. Data with noise can bring numerous disadvantages to the

IP T

construction of gene networks, such as repetitive interference of unwanted nodes, the influence of noise gene expression on the measurement of inter-gene relationship, and

SC R

so on. The improper denoising is likely to bring the loss of the original data and even causes information greatly damaged. Therefore, it is quite necessary to properly remove

U

the noise and reconstruct the original data matrix before constructing the network.

N

Research on denoising gene expression data is not much; some scholars have used

A

wavelet transform reducing noise of tumor data, which achieving impressive results

M

[18]. The main idea of wavelet denoising is to compare the coefficients of each layer

ED

after wavelet transform with corresponding threshold firstly, and then deal with the two coefficients that larger than the threshold and smaller than the threshold separately.

PT

Finally the denoised signal can be obtained after the processed two-part coefficients

CC E

were reconstructed by wavelet [19]. But the effect of the threshold is uncontrollable. In this paper, we utilize Robust Principal Component Analysis (RPCA) to reduce the noise

A

of gene expression data and then build the network. RPCA was first successfully applied to background modeling in video surveillance and shadow removal of face images [20], and it was also used for feature gene selection in recent years [21]. According to the related literature [22], it assumed that the gene expression data are in a

5

Network Analysis based on Low-Rank Method

low-dimensional subspace, so that non-differentially expressed gene data can be regarded as a low-rank matrix. However, in this article, we will introduce RPCA from another perspective in the study of network construction. We started with the purpose of

IP T

network construction to remove noise. With RPCA, the original matrix can be decomposed into a low-rank matrix and a sparse matrix. A small amount of noise is

SC R

decomposed from the original matrix as a sparse matrix. On one hand, the low-rank

matrix keeps the integrity of the original matrix; on the other hand, the property of low

U

rank provides a good basis for the similarity measure between nodes in the

N

co-expression network.

A

In this paper, focus on the gene expression integrated datasets of three cancers from

M

The Cancer Genome Atlas (TCGA) [16, 23, 24], we build the network and conduct

ED

information mining. Before the network construction, a low-rank, sparse method- RPCA was introduced to reduce noise and reconstruct the data matrix. Thus, the internal

PT

relationships of genes in the constructed networks tend to be orderly and the network

CC E

structure is clearer so that more public oncogenic factors and oncogenic modules can be tapped.

A

The rest of this paper is organized as follows: Section 2 introduces related method

and procedure; there are the experiments based on the integrated multi-cancers data in Section 3; and the last section provides the conclusion of this paper.

6

Network Analysis based on Low-Rank Method

2. Methods

The most important two stages of the whole process is denoising and network

Pearson coefficient (PCC) measurement rules between gene nodes. 2.1 RPCA

IP T

construction. Denoising mainly draws on RPCA. Network construction is based on

SC R

The matrix restoration problem of RPCA can be described as follows, ( D is m  n

gene expression matrix in the experiments, m and n represent the number of genes

U

and samples), assuming that the matrix D is low-rank (or approximately low-rank),

A

N

namely:

(1)

M

D0 =A0 +S0 ,

ED

where A 0 is a low-rank matrix and S 0 is a sparse perturbation matrix. RPCA was proposed by Candes et al., which can recover low-rank matrices from a highly disrupted

PT

observational matrix [20]. Suppose the elements of S 0 could have a larger magnitude

CC E

and its branch set is sparse. The corresponding steps are shown in Fig.1(a). The kernel norm of matrix A is that A* : i  i ( A) . The L1 norm of the matrix

S is that S 1 : ij Sij . Assuming that the data observation matrix is given by (1),

A

RPCA can solve the following optimization problem:

minimize A *   S 1 , subject to D  A  S ,

(2)

where  is a regular parameter. For solving this problem, we utilize the algorithm for

7

Network Analysis based on Low-Rank Method

solving RPCA optimization problems given by Lin et al., which is named inexact Augmented Lagrange Multipliers (IALM) method [25]: eliminating equality constraints by introducing a Lagrange multiplier. And then, we can obtain the denoised gene

IP T

expression matrix: A . 2.2 Network Construction

SC R

The correlation measure of two genes are commonly measured by PCC, the PCC between two genes X and Y of A is defined as follows:  i ( X i  X )(Yi Y ) , n n  i1( X i  X )2  i (Yi Y )2

U

n

(3)

N

R

A

where X i and Yi are the observations representing the genes X and Y on the i-th are the means of the

M

patient. n is the number of patients. And X and Y

ED

observations of the genes X and Y in all patients. There are many ways for threshold selection of correlation coefficients. After determining the threshold, the

PT

adjacency matrix with m  m can be obtained from the correlation coefficient matrix

CC E

R . And we can further build the co-expression networks, as shown in Fig. 1(b).

A

3. Experiments

This is the experimental part for network mining. Subsection 3.1 generally describes

these cancers datasets source. The parameter selection for RPCA is in Subsection 3.2. In Subsection 3.3, there is the threshold selection. The comparative experiments and discussion are in Subsection 3.4.

8

Network Analysis based on Low-Rank Method

3.1 Datasets The experimental datasets are integrated gene expression data of three cancers from the TCGA (RNA-seq data at level 3) database: Pancreatic cancer (PAAD), Esophageal

IP T

cancer (ESCA) and Head and neck squamous cell carcinoma (HNSC). There are greater than 20,502 genes in each dataset, and the number of patients is different. Specifically,

SC R

HNSC contains 398 tumor patients; ESCA 183 contains tumor patients, and 176 tumor

patients are from PAAD. After removing all zero values from all patients, we integrated

U

them into a matrix cover 757 patients with 20261 genes for correlation measure. And

N

the datasets was not normalized with any method [24].

A

3.2 Parameter Selection for RPCA

M

The rank of the low-rank matrix will affect the measurement of the node's relations in

ED

the entire network. If the rank is too low, the inter-node will be more closely, and the networks will be sloppy. Of course, the node relationship is still based on the real and

PT

reliable matrix. Recovery A reasonably is the key process of network construction.

CC E

Here, we determine the matrix A by adjusting the parameter  . In order to select the appropriate parameter, we tested the effect of different

A

parameters by clustering samples of different cancers on the low-rank matrix (k-means). For this reason, we carry out two experiments, as are shown in Table 1 and Figure 2. The  of Table 1 is in the range of 10-3 ~103 . The clustering effects are assessed by contrasting clustered labels with true label. The accuracy (ACC) [26] as evaluation

9

Network Analysis based on Low-Rank Method

function:

 ACC 

n i 1

  si , map  r i  

(4)

n

where n is the total number of samples,   x, y  is a delta function that equals to 1 if

IP T

x  y , equals to 0 otherwise. map  r i  is mapping function that maps each cluster

label ri to original label si .

SC R

Clustering results indicating that RPCA has remarkable advantages and the effect on data recovery and denoising is very impressive. In contrast, the effect of wavelet

U

denoising is significantly weaker (Here, the accuracy of data clustering by wavelet

A

N

denoising is the best result that has been selected through many wavelet denoising tests).

M

RPCA has good explanatory power for data reconstruction. It also provides reliable data for next network construction. PRCA and wavelet transform have been applied in the

ED

field of image processing, and their denoising effects on gene expression data are quite

PT

different. Because RPCA tends to have an overall effect on the processing of data, wavelet transforms may focus on local processing. The gene expression data as a

CC E

two-dimensional matrix is more biased towards the global effect. In addition, there is a very important factor will be involved when the wavelet transform denoising, which is

A

selection of threshold. For gene expression data, the determination of the threshold of wavelet transform will cause the data to be excessively interfered and over-controlled, resulting in distortion of data reconstruction. The RPCA method denoises and reconstructs the overall data macroscopically, preserves most of the data and recovery

10

Network Analysis based on Low-Rank Method

small data as the principle, so as to ensure the reliability of the data. Therefore, gene expression data seems to be more suitable for RPCA for preprocessing of data denoising reconstruction. and this is the reason we chose it as a network noise reducer.

IP T

In order to find better parameters for data denoising and recovery, we further look for the optimal parameters in the interval of 0.01. We narrow the range around 0.001 for

SC R

optimization, as is shown in Fig. 2.

In Fig. 2, we select more precise parameter focus on the advantage value of Table 1.

U

Fig. 2 shows that at  =0.003 , the clustering effect is pretty well. Fig. 2 and Table 1

N

show that the clustering accuracy of the data ranges from 0.45 to above 0.8 under

A

different parameters, which indicates the reasonable selection of parameters has a great

M

influence on the data recovery and reconstruction. In our experiment, the parameter is

ED

set to 0.003. And in this case, the low-rank matrix A is the better reconstruction of D under the condition of ensuring the original observation matrix D is not lost much

PT

information and its noise is removed. Therefore, we can construct a co-expression

CC E

network based on A peacefully. 3.3 Threshold Selection

A

PCC matrix correlation values are in the range of 0-1. The closer the value of the

matrix is 1, the higher the correlation is. Here we sort the values and perform curvilinear fitting and select the first inflection point as the overall threshold (0.9342). By this way, we can ensure that the threshold adapts to the network so that the network remains more

11

Network Analysis based on Low-Rank Method

relevant interactions. 3.4 Results and Discussion After constructing the networks, the networks before and after denoising all reserved

IP T

about 200 nodes. Comparing the networks before and after denoising, the network denoising by RPCA than before could detect more pathways and more information, as is

SC R

listed in Table 2 (The databases involved in our pathway analysis include KEGG and

Reactome; original networks: 144; denoising networks by RPCA: 142; FDR: 0.1). Next,

U

we found the common pathways tapped in the network before and after denoising for

N

p-value analysis and comparison, and discovered that the difference of p-values are

A

obvious between the two, as is listed in Table 3. These results suggest that RPCA does

M

have distinct merits in denoising and reconstructing for gene expression data.

ED

In obtained denoising networks, the first five pathways with smaller p-values are in sharp contrast with the pre-denoising networks, as is listed in Table 3. Among the found

PT

common pathways, the enriched pathways found in the denoised networks show smaller

CC E

p-values. Biologically enriched pathways mapped by the gene are more accurate. In addition, Antimicrobial peptides are a shared multi-cancers pathway.

A

We analyzed the connection of genes (nodes) in each module in the denoised

networks. Compared with traditional connectivity, we find that the betweenness centrality of nodes can find more workable genes. The betweenness of a node is a measure of the sum of its proportions appearing in the shortest path between other

12

Network Analysis based on Low-Rank Method

nodes. The betweenness shows the role of a node in the connection of other nodes in the network. The higher the values of betweenness, the more important the node is in maintaining the tight connectivity of the network. In the eight interoperable modules in

IP T

the denoised network, we find nodes with larger betweenness. Refer to the GeneCard [27] ( http://www.genecards.org/ ), which is the gene annotation website, their

SC R

annotations as listed in Table 4.

Among them, BGN has been repeatedly detected abnormal expression in tumor

U

development; BGN was discovered overexpressed in the extracellular matrix of PAAD

N

samples when compared with normal pancreas or chronic pancreatitis tissues [28]; the

A

mRNA level of BGN can distinguish the cancer specificity of bladder cancer patients

M

[29]. The up-regulation of BGN is associated with poor prognosis and PTEN deficiency

ED

in prostate cancer patients [30]. SPARC has long been found to be linked to malignant tumors and is abnormally expressed in PAAD, breast cancer and lung cancer and so on

PT

[31-33]. S100A8 is also a malignant tumor-related gene [34], found in gastric cancer

CC E

and prostate cancer [35, 36]. AGR2 is abnormally expressed in both prostate cancer and breast cancer [37, 38]. SFRP2 inhibits the transformation and invasion of cervical

A

cancer cells through the Wnt signaling pathway, was detected inactivation in gastric cancer [39, 40]. The roles of above genes in the modules are displayed in Fig. 3. Each node in the Figure represents a gene: the size of a node represents the degree of a node; and the

13

Network Analysis based on Low-Rank Method

color is represented by lightness and depth indicating the values of betweenness of node. The top five darker nodes identified in the modules of the networks that have validated to be associated with multi-cancers: AGR2, S100A9, BGN, SPARC and SFRP2. As far

IP T

as node mining is concerned, the betweenness has unique advantages. For example, in Module 1, although S100A9 has little connectivity, it connects two gene clusters, which

SC R

are an important embodiment of the connectivity of the entire network. Similarly, BGN and SPARC 3 in Module 3 have the same effects.

U

We apply RPCA to eliminate noise and ensure the internal information of the data

N

can’t be destroyed simultaneously, so that the constructed networks show excellent

A

mining properties compared with the original data. Moreover, the function of the gene

M

co-expression network is that genes with the same function are clustered. The nature of

ED

low-rank for RPCA facilitates the recognition of similarities between genes. And the discovery of more gene-related features will make the entire network robust and

PT

complete. Therefore, RPCA enhances the identification of relationships between nodes

CC E

in denoising networks, which makes the networks more conducive to excavate more helpful factors than the original data. Moreover, experiments have shown that RPCA is

A

indeed better than wavelet transform in denoising and reconstructing gene expression data. In the aspect of node evaluation, applying the betweenness centrality to improve the recognition of some key nodes in the networks and find some confirmed abnormal genes related to many cancers. Although OR10J5, OR6P1 and CELF3 have unclear

14

Network Analysis based on Low-Rank Method

clinical study indicates their direct effect on cancer, they can be recognized as candidate abnormal genes of multi-cancers for reference because of their fundamental properties in our networks. These genes are that we have collectively discovered that are

IP T

abnormally expressed in the three types of cancers. And related materials show that a large part of these genes are related to some cancers, further demonstrating that these

SC R

genes are not only associated with the three types of cancers in this article. It is

important to have clinical targeting for other cancers. By building a co-expression

U

network through integrated data from three cancers, we also found several shared

N

disease-causing genes for other cancers, which show that our method is indeed

M

A

effective.

ED

4. Conclusions

In this paper, we introduce a low-rank and sparse method-RPCA, to reduce noise and

PT

reconstruct integrated gene expression data of multi-cancers from TCGA and obtain

CC E

more reliable and meaningful cancer data. By building on the integrity of the data, the network was constructed by PCC between genes in reconstructed expression data. And

A

finally, some suspicious information about cancers was extracted from the gene co-expression network. We also compared the network before and after denoising and found that the number of pathways was more accurate and abundant in the denoised and reconstructed networks. In addition, through the betweenness centrality of node, we

15

Network Analysis based on Low-Rank Method

discovered some abnormal expressed genes associated with multi-cancers in denoised networks, which can prove the effect of denoising for RPCA is pretty reliable. This paper provides a new view for network denoising analysis. But there are also

IP T

some immature parts. For example, this model does not consider the effect of the perturbation matrix; TCGA data is multi-omics data, and many types of data can be

SC R

integrated in the new network after denoising and so on. These are all we can consider to improve in the future work.

N

U

Acknowledgments

A

This work was supported in part by the National Natural Science Foundation of

M

China under Grant Nos. 61872220, 61572284, 61702299, and 61701279.

ED

Reference

CC E

(2017) 1-11.

PT

[1] L.Y. Dai, C.M. Feng, J.X. Liu, C.H. Zheng, J. Yu, M.X. Hou, Complexity, 2017

[2] Y.-L. Feng Cm Fau - Gao, J.-X. Gao Yl Fau - Liu, C.-H. Liu Jx Fau - Zheng, J.

A

Zheng Ch Fau - Yu, J. Yu. [3] H. Han, BMC Medical Genomics, 7 (2014) S5. [4] J.-X. Liu, D. Wang, Y.-L. Gao, C.-H. Zheng, J.-L. Shang, F. Liu, Y. Xu, Neurocomputing, 228 (2017) 263-269.

16

Network Analysis based on Low-Rank Method

[5] C.-H. Zheng, L. Yuan, W. Sha, Z.-L. Sun, BMC Bioinformatics, 15 (2014) S3. [6] Z. Chun-Hou, H. De-Shuang, Z. Lei, K. Xiang-Zhen, IEEE Transactions on Information Technology in Biomedicine, 13 (2009) 599-607.

[8] G. Wu, X. Feng, L. Stein, Genome Biology, 11 (2010) R53.

IP T

[7] X. Ma, L. Yu, P. Wang, X. Yang, Comput Biol Chem, 69 (2017) 164-170.

SC R

[9] H. Xie, J. Li, Z. Qiaosheng, Y. Wang, Comparison among dimensionality reduction techniques based on Random Projection for cancer classification, (2016).

U

[10] R.G.W. Verhaak, K.A. Hoadley, E. Purdom, V. Wang, Y. Qi, M.D. Wilkerson,

N

C.R. Miller, L. Ding, T. Golub, J.P. Mesirov, G. Alexe, M. Lawrence, M. O'Kelly, P.

A

Tamayo, B.A. Weir, S. Gabriel, W. Winckler, S. Gupta, L. Jakkula, H.S. Feiler, J.G.

M

Hodgson, C.D. James, J.N. Sarkaria, C. Brennan, A. Kahn, P.T. Spellman, R.K. Wilson,

(2010) 98-110.

ED

T.P. Speed, J.W. Gray, M. Meyerson, G. Getz, C.M. Perou, D.N. Hayes, Cancer Cell, 17

PT

[11] C.-H. Zheng, W. Yang, Y.-W. Chong, J.-F. Xia, Computers in Biology and

CC E

Medicine, 72 (2016) 22-29.

[12] C. Yang, S.G. Ge, C.H. Zheng, Oncotarget, 8 (2017) 89021-89032.

A

[13] S.G. Ge, J.F. Xia, S. Wen, C.H. Zheng, IEEE/ACM Transactions on

Computational Biology and Bioinformatics, 14 (2017) 1115-1121. [14] M. Hofree, H. Shen Jp Fau - Carter, A. Carter H Fau - Gross, T. Gross A Fau Ideker, T. Ideker.

17

Network Analysis based on Low-Rank Method

[15] P. Langfelder, S. Horvath, BMC Bioinformatics, 9 (2008) 559. [16] Q. Zhang, J.E. Burdette, J.-P. Wang, BMC Systems Biology, 8 (2014) 1338. [17] H. Kim, J. Watkinson, V. Varadan, D. Anastassiou, BMC Medical Genomics, 3

[18] G. Chen, Y. Lu, H. Yang, Computers & Applied Chemistry, (2011).

IP T

(2010) 51.

SC R

[19] Y.I. Bo, T. Wen, NAAU, Yantai, Shandong, Computer Engineering & Applications, 48 (2012) 146-149.

U

[20] E.J. Candes, X. Li, Y. Ma, J. Wright, Journal of the ACM, 58 (2009).

N

[21] J.X. Liu, Y.T. Wang, C.H. Zheng, W. Sha, J.X. Mi, Y. Xu, BMC Bioinformatics,

A

14 (2013) 1-10.

M

[22] C. Eckart, G. Young, Psychometrika, 1 (1936) 211-218.

ED

[23] Y. Zhu, P. Qiu, Y. Ji, Nature Methods, 11 (2014) 599-600. [24] H. Han, K. Men, Journal of Biomedical Informatics, 85 (2018) 80-92.

PT

[25] Z. Lin, M. Chen, Y. Ma, Mathematics, (2010). Proceedings of the 26th annual international ACM

CC E

[26] W. Xu, X. Liu, Y. Gong, in:

SIGIR conference on Research and development in informaion retrieval, ACM, 2003,

A

pp. 267-273.

[27] M. Safran, I. Dalah, J. Alexander, N. Rosen, T. Stein, M. Shmoish, N. Nativ, I.

Bahir, T. Doniger, H. Krug, A. Sirota-Madi, T. Olender, Y. Guan-Golan, G. Stelzer, A. Harel, D. Lancet, GeneCards Version 3: The human gene integrator, (2010).

18

Network Analysis based on Low-Rank Method

[28] C.K. Weber, G. Sommer, P. Michl, H. Fensterer, M. Weimer, F. Gansauge, G. Leder, G. Adler, T.M. Gress, Gastroenterology, 121 (2001) 657-667. [29] C. Niedworok, K. Röck, I. Kretschmer, T. Freudenberger, N. Nagy, T. Szarvas,

IP T

D.F. Vom, H. Reis, H. Rübben, J.W. Fischer, Plos One, 8 (2013) e80084. [30] F. Jacobsen, J. Kraft, C. Schroeder, C. Hube-Magg, M. Kluth, D.S. Lang, R.

SC R

Simon, G. Sauter, J.R. Izbicki, T.S. Clauditz, Neoplasia, 19 (2017) 707-715.

[31] G. Watkins, A. Douglas-Jones, R. Bryce, R. E Mansel, W.G. Jiang,

U

Prostaglandins, Leukotrienes and Essential Fatty Acids, 72 (2005) 267-272.

A

Chinese Journal of Cancer, 31 (2012) 541-548.

N

[32] Huang, Jing, Zhang, Yuan-Yuan, Zhao, Jiang, Cong, Hong-Yun, Zhao, Yang,

M

[33] M. Sinn, B.V. Sinn, J.K. Striefler, J.L. Lindner, J.M. Stieler, P. Lohneis, S.

ED

Bischoff, H. Bläker, U. Pelzer, M. Bahra, Annals of Oncology, 25 (2014) 1025-1032.

1622-1631.

PT

[34] C. Gebhardt, J. Németh, P. Angel, J. Hess, Biochemical Pharmacology, 72 (2006)

CC E

[35] H.Y. Yong, A. Moon, Archives of Pharmacal Research, 30 (2007) 75-81. [36] A. Hermani, J. Hess, S.B. De, S. Medunjanin, R. Grobholz, L. Trojan, P. Angel,

A

D. Mayer, Clinical Cancer Research, 11 (2005) 5146-5152. [37] F.R. Fritzsche, E. Dahl, S. Pahl, M. Burkhardt, J. Luo, E. Mayordomo, T.

Gansukh, A. Dankof, R. Knuechel, C. Denkert, Clinical Cancer Research An Official Journal of the American Association for Cancer Research, 12 (2006) 1728-1734.

19

Network Analysis based on Low-Rank Method

[38] J.S. Zhang, A. Gong, J.C. Cheville, D.I. Smith, C.Y. Young, Genes Chromosomes & Cancer, 43 (2005) 249-259. [39] M.T. Chung, H.C. Lai, H.K. Sytwu, M.D. Yan, Y.L. Shih, C.C. Chang, M.H. Yu,

IP T

H.S. Liu, D.W. Chu, Y.W. Lin, Gynecologic Oncology, 112 (2009) 646-653. [40] Y.Y. Cheng, J. Yu, Y.P. Wong, E.P.S. Man, K.F. To, V.X. Jin, J. Li, Q. Tao, J.J.Y.

A

CC E

PT

ED

M

A

N

U

SC R

Sung, F.K.L. Chan, British Journal of Cancer, 97 (2007) 895-901.

20

Network Analysis based on Low-Rank Method

M

A

N

U

SC R

IP T

Figure Captions

Fig. 1. Denoising network construction and information mining. Denoising by RPCA

ED

are shown in Fig.1(a). And the co-expression networks built based on A , as shown in Fig. 1(b). For the co-expression networks, we can carry out node mining and pathways

A

CC E

PT

enrichment analysis.

21

SC R

IP T

Network Analysis based on Low-Rank Method

Fig. 2. Fluctuations of  in the optimal value. Under the control of these parameters,

A

CC E

PT

ED

M

A

clustering effect is better than other parameters.

N

U

the clustering effect is generally from superior to inferior. When  =0.003 , the

22

M

A

N

U

SC R

IP T

Network Analysis based on Low-Rank Method

ED

Fig. 3. Modules with the confirmed genes of cancers. According to betweenness, AGR2,

PT

S100A9, BGN, SPARC and SFRP2 are identified in the modules that have validated to

A

CC E

be associated with multi-cancers. These genes have bridge effect in these modules.

23

Network Analysis based on Low-Rank Method

Tables

Table 1. Comparison of RPCA with different parameters for cancer subtypes clustering

Wavelet Method

0.51

)

RPCA (different parameters for

Transform

0.001

0.01

0.1

1

10

100

0.55

0.57

0.82

0.50

0.45

0.45

0.45

A

CC E

PT

ED

M

A

N

U

SC R

ACC

original

IP T

(k=3, three cancers)

24

1000 0.45

Network Analysis based on Low-Rank Method

Comparison for pathways of networks that before and after denoising

Mode

Number of pathways 8

Network by RPCA

18

A

CC E

PT

ED

M

A

N

U

SC R

Network before Denoise

IP T

Table 2.

25

Network Analysis based on Low-Rank Method

Table 3. Comparison for p-values of pathways Network after RPCA

Network before denoise

P-value

P-value

Gene sets 6.66E-16

9.33E-14

5.28E-10

Striated Muscle Contraction(R)

1.27E-11

1.01E-09

4.66E-05

Protein digestion and absorption(K)

1.44E-11

1.01E-09

4.94E-13

Antimicrobial peptides(R)

2.40E-05

9.59E-04

0.0313

Fat digestion and absorption(K)

1.95E-04

6.75E-03

0.0275

U N A M ED PT CC E A

26

3.43E-08 9.79E-04 6.42E-11 0.1878

SC R

Pancreatic secretion(K)

FDR

IP T

FDR

0.1878

Network Analysis based on Low-Rank Method

Table 4. List of genes with higher Betweenness values Gene OR10J5

GO OR10J5 (Olfactory Receptor Family 10 Subfamily J Member 5) and OR6P1 (Olfactory Receptor Family 6 Subfamily P Member 1)are Protein Coding genes. Among its related pathways are Signaling by GPCR and Olfactory Signaling Pathway.

IP T

OR6P1

Diseases associated with BGN include Spondyloepimetaphyseal Dysplasia, X-Linked and BGN

Meester-Loeys Syndrome.

Diseases associated with SPARC include Osteogenesis Imperfecta, Type Xvii and Sparc-Related Osteogenesis Imperfecta.

SC R

SPARC

Diseases associated with S100A8 include Cystic Fibrosis and Duodenal Ulcer. Among its S100A8

related pathways are Activated TLR4 signalling and Innate Immune System.

Diseases associated with AGR2 include Pancreatic Ductal Adenocarcinoma. Among its

CELF3

related pathways are Tyrosine Kinases / Adaptors and Adhesion.

U

AGR2

Among its related pathways are Preimplantation Embryo.

SFRP2

N

Diseases associated with SFRP2 include Esophageal Basaloid Squamous Cell Carcinoma and Colorectal Cancer. Among its related pathways are Signaling by Wnt and Wnt Signaling

A

CC E

PT

ED

M

A

Pathway and Pluripotency.

27