Prioritization of candidate disease genes by combining topological similarity and semantic similarity

YJBIN 2386 No. of Pages 5, Model 5G 13 July 2015 Journal of Biomedical Informatics xxx (2015) xxx–xxx 1 Contents lists available at ScienceDirect ...

Download PDF

675KB Sizes 1 Downloads 46 Views

Report

PDF Reader
Full Text

YJBIN 2386

No. of Pages 5, Model 5G

13 July 2015 Journal of Biomedical Informatics xxx (2015) xxx–xxx 1

Contents lists available at ScienceDirect

Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin 5 6

4

Prioritization of candidate disease genes by combining topological similarity and semantic similarity

7

Bin Liu, Min Jin ⇑, Pan Zeng

8

College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China

3

10 9 11 1 2 3 5 14 15 16 17 18 19 20 21 22 23 24

a r t i c l e

i n f o

Article history: Received 26 December 2014 Revised 1 July 2015 Accepted 6 July 2015 Available online xxxx Keywords: Disease genes Random walk Topological similarity Semantic similarity

a b s t r a c t The identiﬁcation of gene–phenotype relationships is very important for the treatment of human diseases. Studies have shown that genes causing the same or similar phenotypes tend to interact with each other in a protein–protein interaction (PPI) network. Thus, many identiﬁcation methods based on the PPI network model have achieved good results. However, in the PPI network, some interactions between the proteins encoded by candidate gene and the proteins encoded by known disease genes are very weak. Therefore, some studies have combined the PPI network with other genomic information and reported good predictive performances. However, we believe that the results could be further improved. In this paper, we propose a new method that uses the semantic similarity between the candidate gene and known disease genes to set the initial probability vector of a random walk with a restart algorithm in a human PPI network. The effectiveness of our method was demonstrated by leave-one-out cross-validation, and the experimental results indicated that our method outperformed other methods. Additionally, our method can predict new causative genes of multifactor diseases, including Parkinson’s disease, breast cancer and obesity. The top predictions were good and consistent with the ﬁndings in the literature, which further illustrates the effectiveness of our method. Ó 2015 Published by Elsevier Inc.

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

43 44

1. Introduction

45

Because many diseases, such as cancer, diabetes, and cardiovascular diseases, result from gene mutations, exploring the relationships between diseases and their causative genes has become an important topic in contemporary systems biology. These gene mutation-caused diseases are very common in developed countries and are becoming increasingly common in developing countries [1]. Linkage analyses and association studies have been proposed to identify disease genes [2–4]. However, the efforts of these methods result in genomic intervals of 0.5–10 cM that are composed of hundreds of genes [5,6]. Whether these genes are disease-causing requires further investigation. In recent years, with the rapid accumulation of different types of genomic data, many calculation methods for prioritizing disease genes have been proposed. One remarkable advantage of these calculation methods is the reduction in manpower and material resources. Concretely, most of these methods are based on similarities between the genomic data of known disease genes and the

46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

⇑ Corresponding author. E-mail address: [email protected] (M. Jin).

genomic data of the candidate gene. The genomic data include sequence-based features [7,8], gene ontology (GO) annotation information [9,10], expression patterns [11–13], and protein interaction data [14,15]. In most cases, multiple sources of genomic data are combined to ﬁnd causal genes, e.g., the combinations of GO annotation information with protein interaction data [16], GO annotation information with sequence-based features [17], and metabolic pathway data with protein interaction data [18]. Investigation of the interactions between the proteins that are encoded by genes in the human PPI network has become one of the primary and most powerful approaches for elucidating the molecular mechanisms that underlie complex diseases [19–21]. Such exploration has often been performed by comparing the network topology similarities of the nodes in the PPI network. There are many methods for measuring topological similarity, including calculating the number of common neighbors between two network nodes and calculating the distance between two network nodes. Due to incomplete data about the PPI network, some interactions between the proteins encoded by candidate gene and the proteins encoded by known disease genes are very weak. Thus, some candidate genes cannot be well identiﬁed. To achieve better prediction results, some studies have combined the PPI network with phenotype similarity information and reported good performances. However, we believe that the results could be further

http://dx.doi.org/10.1016/j.jbi.2015.07.005 1532-0464/Ó 2015 Published by Elsevier Inc.

Please cite this article in press as: B. Liu et al., Prioritization of candidate disease genes by combining topological similarity and semantic similarity, J Biomed Inform (2015), http://dx.doi.org/10.1016/j.jbi.2015.07.005

63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86

YJBIN 2386

No. of Pages 5, Model 5G

13 July 2015 2

B. Liu et al. / Journal of Biomedical Informatics xxx (2015) xxx–xxx

110

improved. In biological data resources, large amounts of data describe the molecular function of genes or the biological processes in which the genes are involved. These data form the semantic information of a gene. If a candidate gene and a disease gene share a high level of semantic similarity, we can compensate for the weak interaction between the genes in the PPI network by adding the semantic similarity. Some studies have shown that GO annotation information, which is used to predict disease genes [22], is a very effective semantic resource. Based on these two types of data sources, i.e., protein interaction data and GO annotation information, this paper proposes a new method for inferring gene–phenotype relationships. We use the semantic similarity value between the candidate gene and known disease genes to set the initial probability vector of the random walk with restart (RWR) algorithm and apply this algorithm to the PPI network. When the ﬁnal walk reaches a stable state, we predict new disease genes according to the candidate genes’ rankings in the vector. We used leave-one-out cross-validation to demonstrate the effectiveness of our method. Compared with other methods, our method achieved better performance. Additionally, new causative genes of multifactor diseases, including Parkinson’s disease, breast cancer, and obesity, are predicted with our method. The top predictions were good and consistent with the reports in the literature, which further illustrates the validity of our method.

111

2. Materials and methods

87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109

112

2.1. Data source

127

2.2. Summaries of the RWR and HRSS algorithms

128

135

The RWR is a sorting algorithm [14] that simulates a random walker that either starts from a seed node, or from a set of seed nodes, and moves to its direct neighbors randomly at each step. Finally, based on the probability of the random walker reaching a speciﬁc node, we ranked all of the nodes in the graph. We used P 0 to represent the initial probability vector, and Ps is a vector that represents the probability of the random walker reaching all nodes on the graph at step s. The probability vector at step s + 1 is given by

138

Psþ1 ¼ ð1 dÞMP s þ dP0

115 116 117 118 119 120 121 122 123 124 125

129 130 131 132 133 134

136

144 145

lute value of the difference between P s and P sþ1 falls below 106 .

148

jPsþ1 Ps j < 106

140 141 142 143

146

ð2Þ

150 151

152 154

The probability of the occurrence of the term c in a speciﬁc corpus is represented by component pðcÞ. The IC-based distance between the two terms u and v is deﬁned in Eq. (4), where v is a descendant of u.

155

distIC ðu; v Þ ¼ ICðv Þ ICðuÞ ¼ log pðuÞ log pðv Þ

156 157 158

159

ð4Þ

161

Then, the IC-based speciﬁcity of the most informative common ancestor (MICA) of any two terms termi and termj is

162

aIC ¼ distIC ðroot; MICAÞ ¼ log pðMICAÞ

163

164

ð5Þ

166

The distIC between a term and the most informative leaf nodes (MIL) descending from the term refers to the generality of a term. Component b represents the average of the generality values of termi and termj .

167

distIC ðtermi ; MILi Þ þ distIC ðtermj ; MILj Þ bIC ¼ 2

ð6Þ

¼ maxðHRSSðgoi ; goj ÞÞ goi 2tg1 goj 2tg2

170

171

173 174 175

176

179

ð8Þ

Let g1 and g2 be two genes of interest and tg1 and tg2 the sets of all of the GO terms assigned to gene g1 and g2, respectively.

HRSSGO MAX ðg1; g2Þ

169

178

where c is deﬁned as follows:

c ¼ distðMICA;termi Þ þ distðMICA;termj Þ

168

ð7Þ

180 182 183 184

185

ð9Þ

ð1Þ

The row-normalized adjacency matrix of the graph is represented by parameter M. The parameter d 2 ð0; 1Þ is the restart probability. At each step, the random walker can return to the seed nodes with probability d. After some steps, the vector P sþ1 will reach a steady state. This steady state is obtained by performing the iteration until the abso-

139

149

ð3Þ

1 aIC HRSSðtermi ; termj Þ ¼ 1 þ c aIC þ bIC

126

114

ICðcÞ ¼ log pðcÞ

The most informative leaf nodes of termi and termj are represented by MILi and MILj , respectively.

Gene ontology data (released in October 2013) and a human gene annotation dataset (released in October 2013) were from the Gene Ontology database [23]. The GO consists of three structured ontologies, i.e., biological process (BP), molecular function (MF), and cellular component (CC). The GO data contains 25,571 BP, 9661 MF, and 3386 CC terms. The gene annotation dataset contained 383,316 annotations of 18,911 genes. In this paper, PPI data were downloaded from the Human Protein Reference Database (HPRD). All of the information in the HPRD has been manually extracted from the literature by expert biologists who have read, interpreted, and analyzed the published data. Disease–gene association data were obtained from the Online Mendelian Inheritance in Man (OMIM) database [24].

113

This paper used the HRSS algorithm (Fig. 1) that was developed by Wu [25] to measure the semantic similarity. The information content (IC) is deﬁned by Eq. (3),

Fig. 1. A schematic illustration of the HRSS algorithm.

Please cite this article in press as: B. Liu et al., Prioritization of candidate disease genes by combining topological similarity and semantic similarity, J Biomed Inform (2015), http://dx.doi.org/10.1016/j.jbi.2015.07.005

187

YJBIN 2386

No. of Pages 5, Model 5G

13 July 2015 3

B. Liu et al. / Journal of Biomedical Informatics xxx (2015) xxx–xxx 188 189

2.3. Prediction algorithm based on topological similarity and semantic similarity

221

Different information contents of the genes’ GO terms and the positional relationship between the genes’ GO terms in the directed acyclic graph were used to measure the semantic similarity between two genes. A GO term describes gene information that includes molecular functions, biological processes, and cellular components. Therefore, the greater the semantic similarity shared by genes the more functional similarity they share. There is a relationship between two connected nodes in the PPI network. This type of relationship is reﬂected by factors related to chemical biology, signal transduction, genetic network, metabolism etc. Therefore, nodes that share topological similarity share synergy and similarity in the above aspects. Moreover, genes that share semantic similarity are similar to each other in terms of molecular functions, biological processes, and cellular components. Therefore, both the topological similarity and the semantic similarity reﬂect functional information and complement each other. Researchers have found that similar phenotypes are essentially caused by functionally related genes [26]. Based on this theory, we used the maximum value and the average value of the semantic similarity between the candidate gene and known disease genes to set the initial probability vector for the RWR. The initial probability vector of a disease was constructed in such a manner that we assigned the position value of the causative gene of the as 1. The position value of the candidate gene was between 0 and 1, and it was according to the semantic similarity value between the candidate gene and the causative gene. Next, we used the RWR algorithm in the PPI network to measure the topological similarity. In our method, the location value of the candidate gene in the initial probability vector P0 for a speciﬁc disease can be calculated with formula (10) or formula (11). The HRSS algorithm was used to measure the semantic similarity between the candidate gene and known disease genes.

224

MHRSSðgÞ ¼ maxðHRSSGO MAX ðg; g i ÞÞ

190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220

222

g i 2G

225 227

ð10Þ

Pn AHRSSðgÞ ¼

GO i¼1g i 2G; HRSSMAX ðg; g i Þ

n

ð11Þ

233

The component g is a candidate gene, and G is known disease genes set of the disease. Now that there are two ways to set the initial probability vector P 0 , our method has two according branches. The ﬁrst is the combination of the RWR and MHRSS (RWRMHRSS), and the second is the combination of the RWR and AHRSS (RWRAHRSS).

234

3. Results

235

In this section, we ﬁrst compare the performance of our method with other methods. Next, we explore the inﬂuence of d on our method. Finally, we use our algorithm to predict new causative genes for some multifactor diseases. The set of test genes was constructed in two ways. The ﬁrst was via the artiﬁcial linkage interval (ALI) approach that assumes that one of the real disease-causing genes does not interact with the disease and constructs the test genes’ set with this disease-causing gene and its nearest 99 genes in the chromosome. The second was via a validation against random genes (Rand) approach. With the same assumption mentioned for the ALI approach, the Rand approach selects one of the real disease-causing genes and 99 control genes randomly from all genes in the interactome as the test gene set. The performance of our method for recovering disease–gene relationships was examined with leave-one-out cross-validation. In each round of validation, we removed a disease–gene link. The remaining

228 229 230 231 232

236 237 238 239 240 241 242 243 244 245 246 247 248 249 250

causative genes for the disease were used as the seed nodes. Receiver operating characteristic (ROC) curves were used to compare the performance of our algorithm with those of other methods. ROC curve plot the sensitivity versus 1 – the speciﬁcity based on a threshold that separates the prediction classes [11]. The area under ROC curve (AUC) indicates the performance of the algorithm.

251

3.1. Comparison with other methods

258

To prove the validity of our method, we compared the performance of our method with those of two other methods, i.e., the RWR method [14] and the DP_LCC method [27]. The three methods used the same dataset (i.e., the HPRD and OMIM databases) and the same test method (leave-one-out cross-validation). In total, there were 34,364 interactions among the 8919 genes in the gene network. The phenotype similarity network included phenotype similarities for 5080 diseases. There were 1428 gene–phenotype links between 937 genes and 1216 phenotype entities. As we needed to use the semantic information of the seed genes, only the phenotypes associated with at least two disease genes were considered in our experiment. A total of 168 phenotypes associated with 470 disease genes were obtained. The largest AUC for the DP_LCC algorithm was found when the restart probability was set to 0.7, and the weight was set to 0.5 [27]. Therefore, we set d ¼ 0:7. Table 1 shows the AUC values of the four methods. These values indicated that the performance of our method was better than those of the RWR and DP_LCC methods. The table also shows a comparison between our two methods, i.e., RWRMHRSS and RWRAHRSS, and indicates that the latter performed better than the former.

259

3.2. Effects of parameters

280

There is one free parameter in our method, namely, the restart probability d in the RWR algorithm. We tested our algorithm with different values of d (from 0 to 1 in steps of 0.2) and found that the best performance was obtained with the restart parameter d ¼ 0:3. With the different ds; the difference between the best AUC value and the worst was less than 0.02. Therefore, d did not have a substantial inﬂuence on the results. Table 2 shows our experimental results. The ROC curves for our methods at d ¼ 0:3 are shown in Fig. 2. Fig. 2 shows that our RWRAHRSS method achieved slightly better results than the RWRMHRSS method in the two test gene sets. The results from Rand were always better than those from ALI because the test gene set constructed by ALI was not always included in the gene network. Together, these results demonstrate the following: (1) our algorithm was robust in the selection of the test gene set, and (2) increases in the numbers of known disease genes result in better reﬂections of all aspects of the disease.

281

3.3. Predicting new disease genes for some common diseases

297

Parkinson’s disease (PD), also known as quiver paralytic disease, is one of the most common neurodegenerative diseases. The

298

Table 1 Comparison of the RWR and DP_LCC methods. Method

ALI

Rand

RWR DP_LCC RWRAHRSS RWRMHRSS

0.8084 0.8255 0.8615 0.8455

0.8228 0.8387 0.8846 0.8701

Please cite this article in press as: B. Liu et al., Prioritization of candidate disease genes by combining topological similarity and semantic similarity, J Biomed Inform (2015), http://dx.doi.org/10.1016/j.jbi.2015.07.005

252 253 254 255 256 257

260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279

282 283 284 285 286 287 288 289 290 291 292 293 294 295 296

299

YJBIN 2386

No. of Pages 5, Model 5G

13 July 2015 4

B. Liu et al. / Journal of Biomedical Informatics xxx (2015) xxx–xxx

Table 2 AUC values of our method at different values of d. Method

0.1

0.3

0.5

0.7

0.9

RWRAHRSS ALI Rand

0.8722 0.8929

0.8737 0.8937

0.8689 0.8913

0.8615 0.8846

0.8531 0.8768

RWRMHRSS ALI Rand

0.8587 0.8780

0.8560 0.8795

0.8517 0.8738

0.8455 0.8701

0.8385 0.8616

Fig. 2. ROC curves of our methods at d = 0.3.

300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335

clinical expression of PD is characterized by sport bradycardia, muscle rigidity, resting tremor, and gait abnormalities. In the OMIM database, there are 4 genes associated with this disease that include GBA, ADH1C, TBP, and MAPT. We used our method (RWRAHRSS) to predict new disease genes for PD (MIM: 168600). The top 10 predictions for this disease were BAG5, TIMM9, TIMM10, S100A5, PSPH, ADIPOQ, ADH5, PCSK9, CDH13, and DNM1L. The ﬁrst prediction for PD was BAG5, which inhibits parkin. Loss-of-function mutations in the parkin gene, which encodes an E3 ubiquitin ligase, are the major causes of early-onset Parkinson’s disease (PD) [28]. BAG5 plays a potential role in the promotion of neurodegeneration in sporadic PD [29]. The last prediction was DNM1L, which is also known as DLP1, DRP1, and EMPF. According to the report, the mutations of this gene are associated with several neurological disorders [30]. DNM1L is a key protein in mitochondrial dynamics. Decreases in this protein may interfere with neuron survival in PD [31]. Mitochondrial dysfunction represents an important event in the pathogenesis of PD [32]. Moreover, DNM1L is one of the key regulators of mitochondrial ﬁssion [33]. Breast cancer is a common malignancy that often occurs in females. The clinical expression of breast cancer may include a lump in the breast, a change in the breast shape, dimpling of the skin, ﬂuid coming from the nipple, or a red scaly patch of skin. According to the OMIM record, breast cancer (MIM: 114480) has the following 23 validated causative genes: RAD54L, CASP8, BARD1, PIK3CA, HMMR, NQO2, ESR1, RB1CC1, SLC22A1L, TSG101, ATM, KRAS, BRCA2, XRCC3, AKT1, RAD51A, PALB2, CDH1, TP53, PHB, PPM1D, BRIP1, and CHEK2. When we applied our method (RWRAHRSS), the top 10 predictions for breast cancer were PINK1, MSH5, TOP3A, RAD54L, DMC1, MLH3, PCCB, NMNAT1, BLM, and CHEK2. The fourth prediction for breast cancer was RAD54L. However, RAD54L has been proven to be the cause of breast cancer in the latest OMIM database. This ﬁnding fully demonstrates the effectiveness of our approach. BLM was another prediction and is also a suspected causative gene for breast cancer

[34,35]. BLM plays key roles in homologous recombination repair and DNA replication. Some mutations in the BLM gene cause Bloom’s syndrome, which leads to multiple cancers including breast cancer [36]. The last prediction for breast cancer was CHEK2. However, CHEK2 has also been proven to be the cause of breast cancer in the latest OMIM database. This further illustrates the effectiveness of our approach. Obesity is a common, ancient metabolic syndrome. When the body absorbs more calories than it consumes, the excess calories are stored as fat. When fat exceeds the normal physiological requirements, it reaches a certain value and results in obesity. In the OMIM record, obesity (MIM: 601665) has the following 16 validated causative genes: NR0B2, SDC3, POMC, GHRL, PPARG, UCP1, CART, ADRB2, PPARGC1B, SIM1, ENPP1, ADRB3, UCP3, AGRP, PYY, and MC4R. The top-10 predictions of our algorithm (RWRAHRSS) for obesity were ADIPOQ, UCN, TYMS, NDP, CRH, GHRL, ADIPOR2, CRHBP, NMUR2, and GIP. The ﬁrst prediction was ADIPOQ, which plays an important role in regulating glucose levels and fatty acid oxidation [37]. The literature indicates that ADIPOQ is closely related to obesity [38,39]. Adiponectin is the protein that is encoded by the ADIPOQ gene. It has been reported that low circulating levels of adiponectin and single nucleotide polymorphisms of ADIPOQ are associated with central obesity [40]. Another prediction was CRH, which suppresses food intake and may be associated with obesity [41]. Some studies have found that elevated placental CRH is associated with childhood outcomes, such as increased adiponectin levels and increased central adiposity [42]. Moreover, we have described the relationship between adiponectin and central obesity above.

336

4. Conclusion

365

In the present paper, a method combining the topological similarity in the PPI network and gene semantic similarity was proposed for the prioritizing of candidate disease genes. The novelty of our method lies in the incorporation of the gene semantic similarity value to set the initial probability vector of the RWR. The results of leave-one-out cross-validation on a benchmark dataset revealed that the RWRAHRSS achieved a greater precision (as measured by the AUC) than the RWRMHRSS. This result indicates that the use of greater numbers of known disease genes results in better reﬂections of all aspects of the disease. Our method also outperformed the RWR and DP_LCC methods. Furthermore, we predicted new causative genes for some common diseases, including Parkinson’s disease, breast cancer, and obesity, with our algorithm and found that a portion of the predictions were corroborated by recent experimental reports. These results all prove the effectiveness of our approach. In our experiment, we found that the top (e.g., 20) identiﬁed genes varied widely between these methods. We have explored this phenomenon only from the perspective of information science. The biological principle of this phenomenon remains unclear. However, our team is applying the new method to the prediction of disease genes, and this phenomenon will be examined and analyzed based on additional experimental results, which is also a part of our future research. The PPI network suffers from both high false positive and false negative rates. As the PPI network becomes more perfect and GO annotation information becomes more abundant, the prediction results of our method will be better. Additionally, we can integrate some other genomic information to further improve our method, such as gene expression data and pathway membership data. Moreover, many studies have suggested that a PPI network can also be used to identify disease-related sub-networks. Thus, further attention should be given to the modularity of disease.

366

Please cite this article in press as: B. Liu et al., Prioritization of candidate disease genes by combining topological similarity and semantic similarity, J Biomed Inform (2015), http://dx.doi.org/10.1016/j.jbi.2015.07.005

337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364

367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398

YJBIN 2386

No. of Pages 5, Model 5G

13 July 2015 B. Liu et al. / Journal of Biomedical Informatics xxx (2015) xxx–xxx 399

Authors’ contributions

400

404

BL conceived the algorithm, performed the programming and drafted the manuscript. MJ participated in the algorithm design and analysis of the data and modiﬁed the manuscript. PZ also modiﬁed the manuscript. All authors have read and approved the ﬁnal manuscript.

405

Conﬂict of interests

401 402 403

406

The authors declare that they have no conﬂict of interests.

407

Acknowledgements

408

411

We would like to thank Dr. Wu from Hangzhou Normal University for providing the HRSS program. This work is supported by the National Natural Science Foundation of China (Grant No. 61374172).

412

Appendix A. Supplementary materials

413 414

Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.jbi.2015.07.005.

415

References

416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454

[1] G. Gibson, Decanalization and the origin of complex disease, Nat. Rev. Genet. 10 (2) (2009) 134–140. [2] D. Altshuler, M.J. Daly, E.S. Lander, Genetic mapping in human disease, Science 322 (5903) (2008) 881–888. [3] H.J. Cordell, D.G. Clayton, Genetic association studies, Lancet 366 (9491) (2005) 1121–1131. [4] M.D. Teare, J.H. Barrett, Genetic linkage studies, Lancet 366 (9490) (2005) 1036–1044. [5] W. Anne, H. van Rensburg, J. Adams, H. Ector, F. Van de Werf, H. Heidbuchel, Ablation of post-surgical intra-atrial reentrant tachycardia. Predilection target sites and mapping approach, Eur. Heart J. 23 (20) (2002) 1609–1616. [6] D. Botstein, N. Risch, Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease, Nat. Genet. 33 (Suppl) (2003) 228–237. [7] E.A. Adie, R.R. Adams, K.L. Evans, et al., Speeding disease gene discovery by sequence based candidate prioritization, BMC Bioinformatics 6 (2005) 55. [8] F.S. Turner, D.R. Clutterbuck, Semple CAM: POCUS: mining genomic sequence annotation to predict disease genes, Genome Biol. 4 (11) (2003). [9] J. Freudenberg, P. Propping, A similarity-based method for genome-wide prediction of disease-relevant human genes, Bioinformatics 18 (Suppl) (2002) S110–S115. [10] C. Perez-Iratxeta, P. Bork, M.A. Andrade, Association of genes to genetically inherited diseases using data mining, Nat. Genet. 31 (3) (2002) 316–319. [11] S. Aerts, D. Lambrechts, S. Maity, P. Van Loo, B. Coessens, F. De Smet, L.C. Tranchevent, B. De Moor, P. Marynen, B. Hassan, et al., Gene prioritization through genomic data fusion, Nat. Biotechnol. 24 (5) (2006) 537–544. [12] L. Franke, L. Fokkens, H. van Bakel, E.D. de Jong, et al., Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes, Am. J. Hum. Genet. 78 (6) (2006) 1011–1025. [13] M.A. van Driel, K. Cuelenaere, P.P. Kemmeren, J.A. Leunissen, H.G. Brunner, G. Vriend, GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases, Nucleic Acids Res. 33 (Web Server) (2005) W758–W761. [14] S. Köhler, S. Bauer, D. Horn, P.N. Robinsoh, Walking the interactome for prioritization of candidate disease genes, Am. J. Hum. Genet. 82 (4) (2008) 949–958. [15] J. Xu, Y. Li, Discovering disease–genes by topological features in human protein–protein interaction network, Bioinformatics 22 (22) (2006) 2800– 2805.

409 410

5

[16] C. Ortutay, M. Vihinen, Identiﬁcation of candidate disease genes by integrating Gene Ontologies and protein-interaction networks: case study of primary immunodeﬁciencies, Nucleic Acids Res. 37 (2) (2009) 622–628. [17] P.C. Iratxeta, M. Wjst, P. Bork, AndreadMA: G2D: a tool for Mining Genes associated with Disease, BMC Genet. 6 (3) (2005) 45–53. [18] Lian Chen, Yan Zhao, LiangCai Zheng, et al., CTFMining: a method to predict candidate disease genes based on the combined network topological features mining, in: The 3rd International Conference on Bioinformatics and Biomedical Engineering, Beijing, 11–13 June 2009. [19] T. Ideker, R. Sharan, Protein networks in disease, Genome Res. 18 (4) (2008) 644–652. [20] M. Oti, B. Snel, M.A. Huynen, H.G. Brunner, Predicting desease genes using protein-protein interactions, J. Med. Genet. 43 (2006) 691–698. [21] I. Feldman, A. Rzhetsky, D. Vitkup, Network properties of genes harboring inherited disease mutations, Proc. Natl. Acad. Sci. U.S.A. 105 (11) (2008) 4323– 4328. [22] L. Franke, H. Van Bakel, L. Fokkens, E.D. de jong, et al., Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes, Am. J. Hum. Genet. 78 (6) (2006) 1011–1025. [23] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, et al., Gene ontology: tool for the uniﬁcation of biology, Nat. Genet. 25 (1) (2000) 25–29. [24] A. Hamosh, A.F. Scott, J.S. Amberger, C.A. Bocchini, V.A. McKusick, Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders, Nucleic Acids Res. 33 (Database) (2005) 514–517. [25] X. Wu, E. Pang, K. Lin, Z.M. Pei, Improving the measurement of semantic similarity between gene ontology terms and gene products: insights from an edge- and IC-based hybrid method, PLoS ONE 8 (5) (2013) e66745. [26] J.G. Sanchez, C. Barton, V. David, Human disease genes, Nature 409 (15) (2001) 853–855. [27] Jie Zhu, Yufang Qin et al., Prioritization of candidate disease genes by topological similarity between disease and protein diffusion proﬁles, in: RECOMB-seq: Third Annual Recomb Satellite Workshop on Massively Parallel Sequencing, Beijing, China, 11–12 April 2013. [28] K.K. Chung, T.M. Dawson, Parkin and Hsp70 sacked by BAG5 [abstract], Neuron (2004). [29] S. Lee, P.D. Smith, L. Liu, et al., BAG5 inhibits parkin and enhances dopaminergic neuron degeneration, Neuron 44 (6) (2004) 931–945. [30] http://www.ncbi.nlm.nih.gov/gene/10059. [31] J.G.1. Hoekstra, T.J.2. Cook, T.1. Stewart, et al., Astrocytic dynamin-like protein 1 regulated neuronal protection against excitotoxicity in Parkinson disease, Am. J. Pathol. 185 (2) (2015) 536–549. [32] X. Wang, T.G. Petrie, Y. Liu, et al., Parkinson’s disease-associated DJ-1 mutations impair mitochondrial dynamics and cause mitochondrial dysfunction, J. Neurochem. 121 (5) (2012) 830–839. [33] J. Grohm, S.W. Kim, U. Mamrak, et al., Inhibition of Drp1 provides neuroprotection in vitro and in vivo, Cell Death Differ. 19 (9) (2012) 1446– 1458. [34] A. Sassi, M. Popielarski, E. Synowiec, et al., BLM and RAD51 genes polymorphism and susceptibility to breast cancer, Pathol. Oncol. Res. 19 (3) (2013) 451–459. [35] A.P.1. Sokolenko, A.G. Iyevleva, E.V. Preobrazhenskaya, N.V. Mitiushkina, S.N. Abysheva, E.N. Suspitsin, E.Sh. Kuligina, et al., High prevalence and breast cancer predisposing role of the BLM c.1642 C > T (Q548X) mutation in Russia, Int. J. Cancer 130 (12) (2012) 2867–2873. [36] A.P. Sokolenko, A.G. Iyevleva, E.V. Preobrazhenskaya, N.V. Mitiushkina, S.N. Abysheva, E.N. Suspitsin, E.Sh. Kuligina, et al., Transcriptomic and protein expression analysis reveals clinicopathological signiﬁcance of Bloom’s syndrome helicase (BLM) in breast cancer [abstract], Mol. Cancer Ther. (2015). [37] J. Wu, Z. Liu, K. Meng, L. Zhang, Association of adiponectin gene (ADIPOQ) rs2241766 polymorphism with obesity in adults: a meta-analysis, PLoS ONE 4 (2014) e95. [38] S. Bag, A. Anbarasu, Revealing the strong functional association of adipor2 and cdh13 with adipoq: a gene network study [abstract], Cell Biochem. Biophys. (2014). [39] J.F. Lu, Y. Zhou, G.H. Huang, H.X. Jiang, et al., Association of ADIPOQ polymorphisms with obesity risk: a meta-analysis [abstract], Hum. Immunol. (2014). [40] K. Ramya, K.A. Ayyappa, S. Ghosh, V. Mohan, V. Radha, Genetic association of ADIPOQ gene variants with type 2 diabetes, obesity and serum adiponectin levels in south Indian population, Gene 532 (2) (2013) 253–262. [41] L. Sominsky, S.J. Spencer, Eating behavior and stress: a pathway to obesity, Front. Psychol. 5 (2014) 434. [42] S.A. Stout, E.V. Espel, C.A. Sandman, L.M. Glynn, E.P. Davis, Fetal programming of children’s obesity risk, Psychoneuroendocrinology 53 (2015) 29–39.

455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529

Please cite this article in press as: B. Liu et al., Prioritization of candidate disease genes by combining topological similarity and semantic similarity, J Biomed Inform (2015), http://dx.doi.org/10.1016/j.jbi.2015.07.005

Prioritization of candidate disease genes by combining topological similarity and semantic similarity

Prioritization of candidate disease genes by combining topological similarity and semantic similarity

Recommend Documents