Available online at www.sciencedirect.com
Expert Systems with Applications Expert Systems with Applications 35 (2008) 1115–1121 www.elsevier.com/locate/eswa
Similar genes discovery system (SGDS): Application for predicting possible pathways by using GO semantic similarity measure Jung-Hsien Chiang *, Shing-Hua Ho, Wen-Hung Wang Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, ROC
Abstract This research analyzes the gene relationship according to their annotations. We present here a similar genes discovery system (SGDS), based upon semantic similarity measure of gene ontology (GO) and Entrez gene, to identify groups of similar genes. In order to validate the proposed measure, we analyze the relationships between similarity and expression correlation of pairs of genes. We explore a number of semantic similarity measures and compute the Pearson correlation coefficient. Highly correlated genes exhibit strong similarity in the ontology taxonomies. The results show that our proposed semantic similarity measure outperforms the others and seems better suited for use in GO. We use MAPK homogenous genes group and MAP kinase pathway as benchmarks to tune the parameters in our system for achieving higher accuracy. We applied the SGDS to RON and Lutheran pathways, the results show that it is able to identify a group of similar genes and to predict novel pathways based on a group of candidate genes. Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Gene ontology; Entrez gene; Semantic similarity measure; Gene quantification; Pathway
1. Introduction The biological information is not only available in the form of sequences, but also presented in terms of annotations. Most of the time, an annotation is written in natural language from that is suitable for human readings but not particularly useful for machine processing (Kretschmann, Fleischmann, & Apweiler, 2001). The goal of our research is to examine gene ontology (GO) hierarchy and devise a semantic similarity measure for quantifying annotations between gene pairs. GO is a controlled vocabulary used to describe molecular function, biological process and cellular component of genes and gene products in a generic cell. GO terms and
*
Corresponding author. Tel.: +886 6 2757575x62534; fax: +886 6 2747076. E-mail address:
[email protected] (J.-H. Chiang). 0957-4174/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2007.08.007
their relationships are represented in the form of directed acyclic graphs (DAGs). The traditional method for measuring similarity between a pair of terms is to calculate the path distance between two nodes associated with these terms. Edges are weighted according to the depth of GO. The approach assumes that nodes and links in ontology are uniformly distributed, which is not accurate in GO. An alternative method is based on the amount of information two terms share (Budanitsky & Hirst, 2001; Couto, Silva, & Coutinho, 2003; Li, Bandar, & McLean, 2003; McHale, 1998). The more information two terms share, the more they are similar. The shared information is represented by the information content of the terms that subsumer them in the DAG. The information content is defined as the frequency of each term, or any of its children occur within a corpus. Less frequently occurring terms are more informative. This measurement has been shown to be significantly correlated with sequence similarity (Lord et al., 2003).
1116
J.-H. Chiang et al. / Expert Systems with Applications 35 (2008) 1115–1121
Various measures have been developed for quantifying the semantic similarity in term of ontology (Zhong, Zhu, Li, & Yu, 2002). A first logical approximation to the computation of similarity simply considers the minimum number of edges that need to be traversed from one node to the other. The shorter the path between two nodes, the more similar we consider them to be. Rada et al. (Rada, Mili, Bichnell, & Blettner, 1989) shows that the edge based methods can provide useful results when applied to a lexical medical domain (Medline and MeSH). The information theory based method for semantic similarity was first proposed by Resnik (1995). He defines the similarity of two concepts as the maximum of the information content of the concept that subsumes them in the taxonomy hierarchy. Following the original work of Resnik, some modifications have been made to enhance the pure information content method. Jiang and Conrath (1997) proposed a different approach for measuring semantic distance between words and concepts, combining a lexical taxonomy with corpus statistical information. They take into account the fact that edges in the taxonomy may have unequal link strength, so link strength of an edge is determined by local density, node depth, information content, and link type. Lin (1998) presents another information theoretic definition of similarity applicable to different domains provided that there is a probabilistic model defined. His modification consists of normalizing by the combination of information content of the compared concepts and assuming their independence. There is a broad range of applications for gene functional annotations from Entrez gene. However, the relations among these annotated terms seem to be overlooked. To address this issue, we propose SGDS to exploit these annotations to identify functionally coherent genes via the GO similarity measure. The system takes a gene name and search for its functional annotations in Entrez gene. The semantic similarities of three parts of GO are then computed separately. The overall score is obtained by combining these similarity values according to different weights. Finally, the system ranks the similar genes based on their overall scores. We apply this idea to predict the possible pathway. The hypothesis is that pairs of genes exhibiting similar functional annotations also tend to have high similarity. For two groups of similar genes, they have interactions mentioned in PubMed literatures are termed as candidate genes (Ono, Hishigaki, Tanigami, & Takagi, 2001; Raychaudhuri, Chang, Sutphin, & Altman, 2001; Raychaudhuri & Altman, 2003). The interactions among these candidate genes allow us to infer novel pathways based on a known pathway (Beasley & Planes, 2007; Fukuda & Takagi, 2003; Lei & Dai, 2006). The rest of the paper is organized as follows. Section 2 introduces the related works of GO applications to functional genomics. The similar genes discovery system is presented in Section 3. Section 4 describes the various experimental design and results. Finally, Section 5 concludes the paper and points out further research.
2. Related works Ontologies have been traditionally used to improve database search applications. The GO may facilitate large-scale applications for functional genomics. GO annotations have been recently integrated with relevant genomic resources, including gene expression data. One such application is FatiGO tool, which is a web-based interface for analyzing groups of genes and their associations with GO terms. It allows users to analyze differential distributions of GO terms for two sets of genes (Al-Shahrour, Diaz-Uriarte, & Dopazo, 2004). King et al. (King et al., 2003) have predicted gene–phenotype associations in yeast. They use decision trees to processed phenotypic annotations extracted from the MIPS (Munich Information Center for Protein Sequences) database and GO annotations. Hvidsten et al. (Hvidsten, Lagreid, & Komorowski, 2003) have combined gene expression data with annotations from the GO biological process taxonomy. They used rough set theory to assign biological process terms to genes represented by expression patterns. Similarity between annotation and literature has been shown to augment sequence similarity searches (Chang, Raychaudhuri, & Altman, 2001; MacCallum, Kelly, & Sternberg, 2000). These authors augmented PSI-BLAST (Altschul et al., 1997) with similarity scores computed over the annotations and Medline references cited by entries retrieved by the sequence similarity search. Although these approaches consist of the analysis of GO annotations, they are not based on semantic similarity measures. Moreover, these methods do not apply information content models, which may significantly represent relevant patterns associated with the structure and relationships in the GO. They may fail to identify the similarity between genes annotations with these closely related yet distinct terms. One of the contributions of this study is to compute gene–gene similarity based on term–term similarity in GO hierarchies. 3. Similar genes discovery system (SGDS) The system is consisted of three modules as shown in Fig. 1: data preprocessing, semantic similarity measure, and gene quantification. The system takes a gene name and search for its functional annotations in Entrez gene database. The semantic similarities of query gene and human genes are then computed in three parts of GO separately. The overall score is obtained by combining these similarity values according to different weights. Finally, the system will output a gene list according to the degree of semantic similarity. 3.1. Data preprocessing module The Entrez gene database, previously known as LocusLink, maintains descriptive information about loci including nomenclature, database identifiers (ID), disease
J.-H. Chiang et al. / Expert Systems with Applications 35 (2008) 1115–1121
1117
Input gene name (Ex:BNIP3L) Data Preprocessing
Entrez Gene DB
Human genes GO annotations
GO Trees (MF, BP, CC)
Semantic similarity measure Similarities of BNIP3L with each human gene are calculated
BNIP3L gene similarity similarity similarity (MF) (CC) (BP)
GO annotations
Gene quantification Overall similarity of BNIP3L Output similar genes in order
BNIP3L-Human genes overall similarity
with each human gene via weight adjustment
Fig. 1. Schematic overview of the SGDS. For a query gene, the semantic similarities of query gene and each human genes are computed in three parts of GO separately. Higher similarities are combined to form an overall score based on different weights. The similar gene list can be output in order.
associations, map positions and sequence accessions. These associations are then used to compute connections to other NCBI resources, to both facilitate navigation and enhance opportunities for new discoveries. The data are continuously augmented and reviewed through the combined efforts of NCBI staff and successful collaborations with several organisms. Entrez gene queries can be based on test terms (e.g., a protein name or disease name), gene symbols, sequence accessions and database IDs (e.g., MIN or EC numbers) and so on. In this paper, we opted for Entrez gene because it is one of the more comprehensive genetic databases with a rich set of gene name fields per record. Entrez gene database provides a single query interface to obtain genomic sequences and genetic loci. It contains diverse genomic information including official nomenclature, aliases and PubMed references. It is the most popular search system for obtaining normalized gene names and gene ontology annotations. The SGDS is based upon approximately 20,000 entries for human genes with functional annotations available in Entrez gene and their evidence codes are TAS, ISS and IEA (Go Consortium). In data preprocessing module, the annotations of each human gene are constructed the GO trees according to their annotations in MF, BP, and CC. For a query gene, all of its annotations are obtained by Entrez gene. Then we calculate the similarity of each pair of terms (each term corresponds with a node in the GO) between the query gene and each human gene in semantic similarity measure module. 3.2. Semantic similarity measure module In the GO, semantic similarity is considered to be governed by the shortest path length as well as the depth of
the subsumer. Thus, the semantic similarity between two GO terms can be defined as follows: Sðg1 ; g2 Þ ¼ f1 ðlÞ f2 ðhÞ ¼ ea‘
ebh ebh ebh þ ebh
ð1Þ
where, l and h are the shortest path length and the depth of subsumer in the hierarchy semantic nets, respectively. a and b are parameters scaling the contribution of shortest path length and depth, respectively. The path length between g1 and g2 can be determined from one of three cases. Case 1 implies that g1 and g2 in the same concept, case 2 represents that g1 and g2 are not in the same concept, but the concept for g1 and the concept for g2 contain one or more of the same terms. Case 3 indicates that g1 and g2 are not in the same concept nor do their concepts contain the same terms. Thus, we set f1(l) to be a monotonically decreasing function of l (i.e., f1(l) = eal), where a is a constant. The exponential function form ensures that the value of the function is within the range from 0 to 1. The depth of the subsumer is derived by counting the levels from subsumer to the top of the lexical hierarchy. Terms at upper levels have more general concepts and less semantic similarity between terms than terms at lower levels. Thus, we need to scale down s(g1,g2) for subsuming terms at upper levels and to scale up s(g1,g2) for subsuming terms at lower levels. As a result, f2(h) should be a monotonically increasing function with respect to depth h. It is observed that the ‘‘positive–negative’’ relationships in MF and BP are reached at 99%. For example, the shortest path length between ‘‘anti-apoptosis’’ and ‘‘induction of apoptosis’’ in GO is 2, yet there is nothing in common between their semantic concepts. To solve this problem,
1118
J.-H. Chiang et al. / Expert Systems with Applications 35 (2008) 1115–1121
we use the penalty values of +1 and 1 to represent the positive and negative relation, respectively. In this study, the positive keywords of {positive, induction of, activator, initiation} and negative keywords of {negative, inhibition, anti-, non-, inhibitor, repressor} are collected, respectively. Finally, the similarity score between terms is given by Sðg1 ; g2 Þ ¼ f1 ðlÞ f2 ðhÞ ¼ penalty ea‘
ebh ebh ebh þ ebh
ð2Þ
where the value of penalty is {+1, 1}. l and h are the shortest path length and the depth of subsumer in the hierarchy semantic nets, respectively. a and b are parameters scaling the contribution of shortest path length and depth, respectively. After the similarity scores have computed, we obtain the individual similarities between query gene and each human gene in MF, BP, and CC. In order to compute the semantic similarity degree of two genes, the gene quantification module is used to quantify the semantic similarity further. We describe the module in the following section. 3.3. Gene quantification module In semantic similarity measure module, all pairs of GO terms of three categories are calculated individually. The values represent semantic similarity between two annotations. The large this value, the higher the similarity. In order to decrease the computation cost, only the GO term similarity values higher than c (c = 0.6) are kept for each category. When gene products are annotated with multiple GO terms, average GO term similarity is taken as the gene similarity. Thus, each GO category generates one score for a pair of gene products, and we have to combine the three scores to form a single semantic similarity score. Taking the sum or average of three scores does not distinguish the cases where two gene products have average scores in all ontologies or they have one high score in one ontology and low scores in the other. In SGDS, the weights of three categories, molecular function, biological process and cellular component, are set to 0.6, 0.3 and 0.1, respectively. Among these three categories, we give the molecular function the largest weight to emphasize the procedure of finding the related genes with similar functional annotations. A high similarity score indicates that the two genes are functionally coherent. Finally, SGDS output a gene list according to the degree of semantic similarity. 4. Experimental results 4.1. Experiment of gene expression profiles versus semantic similarities Gene products with similar genomic expression might be functionally related (Lee, Hsu, Sajdak, Qin, & Pavlidis, 2004; Su et al., 2002; Stuart, Segal, Koller, & Kim, 2003). It is reasonable to think that, if two gene products are similar, their genetic expressions are similar, and that they are
similarity annotated in the GO. In order to illustrate the proposed semantic similarity measure, we have used microarray data analysis to determine expression levels of genes in experimental samples and GARBAN (Martinez-Cruz et al., 2003) to gather their GO associations. For each pair of genes, the semantic similarity in individual ontology was compared to the absolute expression correlation value. Expression correlation was computed using the well-known pearson correlation coefficient. The hypothesis is that pairs of genes exhibiting similar expression level also tend to have high similarity (Haiying, Francisco, Olivier, & Joaquin, 2004). This experiment was done separately on the three hierarchies of the GO in order to evaluate whether the hypothesis holds for CC and BP annotations as well as for MF annotations. MARSHA (Wills-Karp & Ewar, 2004) and RAD (RNA Abundance Database, 2004) data sets are used in this experiment. For MARSHA data set, 754 genes are annotated (MF: 634 genes annotated, BP: 597 genes annotated, CC: 580 genes annotated) among the selected genes (i.e., 900 genes). For RAD data set, 753 genes are annotated (MF: 630 genes annotated, BP: 608 genes annotated, CC: 616 genes annotated) among the selected genes (i.e., 900 genes). In order to obtain comparable results, we use Risnik, Jiang, and Lin’s measures to compute semantic similarity between two gene products. Table 1 summarized the correlation coefficients obtained by various semantic measures for both data sets. From all these results, we can conclude that our semantic similarity measure clearly outperforms both Jiang and Lin’s semantic measures, and slight outperforms Resnik semantic measure. The results suggest that there is an underlying relationship between gene expression and GO annotation. They also validate the ability of our semantic measure for quantify the GO annotations (Table 2). 4.2. Experiment for system parameters tunning We have demonstrated that our proposed semantic similarity measure is significant in the experiment above. Then we have to tune the parameters of our system. For this purpose, we use MAPK (mitogen activated protein kinase) homogenous genes group to tune the a, b, and c parameters in SGDS. This group contains 12 genes and it is a kind of Table 1 Pearson correlation coefficients between gene expression and semantic similarity Data sets
Ontology
Resnik
Jiang
Lin
Our method
MARSHA
MF BP CC
0.647 0.726 0.745
0.592 0.318 0.217
0.244 0.128 0.392
0.662 0.753 0.768
RAD
MF BP CC
0.482 0.534 0.582
0.164 0.228 0.143
0.284 0.337 0.409
0.624 0.546 0.578
J.-H. Chiang et al. / Expert Systems with Applications 35 (2008) 1115–1121 Table 2 Top 15 genes similar to MAPK1 gene Gene pair
Overall score
MAPK1 M MAPK1 MAPK1 M MAPK14 MAPK1 M MAPK13 MAPK1 M MAPK6 MAPK1 M MAPK4 MAPK1 M MAPK12 MAPK1 M MAPK9 MAPK1 M MAPK10 MAPK1 M MAPK8 MAPK1 M MAPK11 MAPK1 M MAPK3 MAPK1 M DAPK2 MAPK1 M PRKR MAPK1 M CDK2 MAPK1 M MAP3K7
0.75247 0.52986 0.52081 0.52081 0.52081 0.4662 0.45961 0.45961 0.45961 0.45961 0.44737 0.44003 0.43869 0.4334 0.41741
P38
as the top eleven except MAPK7. The results did not display the similarity score of MAPK7 and MAPK1 since MAPK7 has no available annotation in Entrez gene. It is the limitation of SGDS. We assume that SGDS has potential for identifying a group of functionally coherent genes when a, b, and c parameters are 0.7, 0.2, and 0.6 individually. The accuracy rate of SGDS was evaluated with MAP kinase pathway obtained from BioCarta. This pathway provides several groups of genes with similar functional annotations and the message transmitted direction between the groups. One group containing MAP4K1, MAP4K2, MAP4K3, MAP4K4, and MAP4K5 was tested. Table 3 shows the top 10 genes similar to MAP4K1. It is evident from the test data set that the identified similar genes are indeed related. The result indicates that the system can obtain the satisfying similar genes. 4.3. Experiments of the RON and Lutheran pathways
P38β/SAPK P38γ /SAPK3 P38 δ /SAPK4 ERK5 ERK2 ERK1 ERK4 ERK3 JNK2 JNK1 JNK3
Fig. 2. MAPK homogenous genes group.
This experiment aims at examining the performance of our system for discovering the possible pathways. Fig. 3 shows part components of the MSP/RON pathway (Danilkovitch & Leonard, 1999). It is apparent that the message transmit from PI3K to AKT induces anti-apoptosis. In our system, one pathway graph contains two genes is used as query graph. Two groups of similar genes were identified and the interactions between these two groups were considered as possible pathways. In this experiment, we use PI3K and AKT as query genes. PIK3CA and AKT1 are the normalized gene names for PI3K and AKT, respectively. Based on these two genes, two groups of similar genes identified by SGDS are shown in Tables 4 and 5. Based on these 20 genes, In Fig. 4, the solid line displays the given pathway {PI3K3CA ! AKT1} and the dotted lines shows that
MSP
MSP
P
MAP4K1 M MAP4K1 MAP4K1 M MAP4K5 MAP4K1 M MAP4K3 MAP4K1 M MAP4K4 MAP4K1 M TNIK MAP4K1 M CAMKK2 MAP4K1 M DAPK1 MAP4K1 M MAP4K2 MAP4K1 M MAPK11 MAP4K1 M MAPK13
Induction of apoptosis
Src
?
AKT
0.6739 0.58598 0.45059 0.45059 0.45059 0.43891 0.43433 0.42996 0.41819 0.41819
?
PI3K
?
Overall score
P
? JNK
Table 3 Top 10 genes similar to MAP4K1 gene
RON
RON
RON
Ser/Thr protein kinase (see Fig. 2). They play an important role in several different signal transduction pathways. All MAPKs shared high homogeneity in their eleven kinase subdomains. The system calculated the similarities between MAPK1 and each of the twenty thousand genes in Entrez gene individually. Different parameters were applied to the system and it was apparent that the system has high performance for identifying similar genes when a = 0.7, b = 0.2, and c = 0.6. The results used the normalized gene names are shown in Table 3. The homogenous genes were listed
Gene pair
1119
Anti-apoptosis
?
?
FAK
Rac/Rho
?
MAPK
Mitogenic signal Adhesion motility
Fig. 3. Kinases involved in MSP/RON signaling. Interaction of MSP with RON receptor induces dimerization and consequent kinase autophosphorylation and activation. Active RON kinase initiates multiple signal transduction cascades and biological activities. Circles with question marks indicate unidentified components in MSP-induced signaling pathways.
1120
J.-H. Chiang et al. / Expert Systems with Applications 35 (2008) 1115–1121
Table 4 Top 10 genes similar to PIK3CA gene Gene pair
Overall score
PIK3CA M PIK3CA PIK3CA M PIK3CD PIK3CA M PIK3CB PIK3CA M PIK3C2B PIK3CA M PIK3C2A PIK3CA M PIK3C2G PIK3CA M PIK3R1 PIK3CA M ATM PIK3CA M PIK3C3 PIK3CA M PIK3CG
0.74205 0.74205 0.63737 0.48747 0.38317 0.34492 0.24538 0.22813 0.2029 0.2029
Table 5 Top 10 genes similar to AKT1 gene Gene pair
Overall score
AKT1 M AKT1 AKT1 M MAPK14 AKT1 M CSNK2A1 AKT1 M MAPKAPK3 AKT1 M MAPK11 AKT1 M MAPK13 AKT1 M PKN3 AKT1 M IRAK1 AKT1 M MET AKT1 M RET
0.95404 0.54881 0.50195 0.47751 0.43953 0.43953 0.4277 0.42555 0.4254 0.4254
PIK3CA
PIK3CG
PIK3CB
AKT1
MAPK
JNK
ATM
Fig. 4. The diagram for gene–gene interactions between two similar genes groups. The solid line displays the known pathway PIK3CA ! AKT1 and the dotted lines shows that the following three gene–gene interactions, {PIK3CG ! JNK}, {PIK3CB ! JNK} and {ATM ! MAPK}, are mentioned in PubMed literatures.
LU
LU
TNFRSF6
KLRC2
LRP8
Erk
JNK
MAPK
JNK
Fig. 6. The diagram for gene–gene interactions between two similar genes groups. The solid line displays the known pathway {LU ! Erk} and the dotted lines shows that the following three gene–gene interactions, {TNFRSF6 ! JNK}, {KLRC2 ! MAPK} and {LRP8 ! JNK}, are mentioned in PubMed literatures.
Similarly, the same procedure was applied to another data set, the Lutheran pathway (see Fig. 5). We try to discover the novel pathway based on the known pathway {LU ! Erk}. Fig. 6 shows that some candidate genes obtained from SGDS are also mentioned in PubMed literatures such as {TNFRSF6 ! JNK}, {KLRC2 ! MAPK} and {LRP8 ! JNK}. The results revealed that some gene– gene interactions exist in two groups of similar genes. 5. Conclusions SGDS is developed to elucidate a novel approach based on semantic similarity measure of gene ontology and Entrez gene to identify a group of similar genes. The proposed method depends on the gene ontology hierarchical structure and the semantic similarity between two GO terms quantified by a nonlinear function of path length and depth. The performance of our semantic similarity measure was further verified through the MARSHA and RAD data sets. Comparing with literature methods such as Risnik, Jiang, and Lin’s measures, the results show that our semantic similarity measure is comparable to these methods. The system has been successfully tested on two pathways, namely RON and Lutheran pathways. It is our belief that expanding this concept to the well-known pathway, SGDS can discovery some candidate genes with interactions described in the literatures. The experimental results indicated that the system can identify the genes similar to the query gene. And based on these candidate genes and their interactions, it is possible to predict a novel pathway and/or make the pathway study more efficient.
Erk
RhoA
Rac1
Fig. 5. The Lutheran pathway.
the following three gene–gene interactions, {PIK3CG ! JNK}, {PIK3CB ! JNK} and {ATM ! MAPK}, are mentioned in PubMed literatures. These participating genes are defined as the candidate genes.
References Al-Shahrour, F., Diaz-Uriarte, R., & Dopazo, J. (2004). Fatigo: A web tool for finding significant associations of gene ontology terms with groups of genes. Bioinformatics, 20, 578–580. Altschul, S. F., Madden, T. L., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., et al. (1997). Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25(17), 389–402. Beasley, J. E., & Planes, F. J. (2007). Recovering metabolic pathways via optimization. Bioinformatics, 23(1), 92–98.
J.-H. Chiang et al. / Expert Systems with Applications 35 (2008) 1115–1121 Biocarta pathway web site: http://www.biocarta.com. Budanitsky, A., & Hirst, G. (2001). Semantic distances in WordNet: An experimental, application-oriented evalution of five measures. Workshop on WordNet and other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics, pp. 29–34. Chang, J., Raychaudhuri, S., & Altman, R. (2001). Including biological literature improves homology search. Pacific Symposium on Biocomputing, 6, 374–383. Couto, F.M., Silva, M.J., & Coutinho, P. (2003). Implementation of functional semantic similarity measure between gene-products. FCUL Technical Report DI/FCUL TR 3-29, November. Danilkovitch, A., & Leonard, E. J. (1999). Kinases involved in MSP/RON signaling. Journal of Leukocyte Biology, 65(March). Fukuda, K., & Takagi, T. (2003). Knowledge representation and signal transduction pathways. Bioinformatics, 17(9), 829–837. Gene Ontology web site: http://www.geneontology.org. GO Consortium. http://www.geneontology.org/GO.evidence.shtml. Haiying, W., Francisco, A., Olivier, B., & Joaquin, D. (2004). Gene expression correlation and gene ontology-based similarity: An assessment of quantitative relationships. CIBCB’2004, 25–31. Hvidsten, T., Lagreid, A., & Komorowski, J. (2003). Learning rule-based models of biological process from gene expression time profiles using Gene Ontology. Bioinformatics, 19, 1116–1123. Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of international conference research in computational linguistics (pp. 19–33). Taiwan: Scandinavian University Press. King, O. D., Lee, J. C., Dudley, A. M., Janse, D. M., Church, G. M., & Roth, F. P. (2003). Predicting phenotype from patterns of annotation. Bioinformatics, 19(Suppl. 1), 183–189. Kretschmann, E., Fleischmann, W., & Apweiler, R. (2001). Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics, 17(10), 920–926. Lee, H. K., Hsu, A. K., Sajdak, J., Qin, J., & Pavlidis, P. (2004). Coexpression analysis of human genes across many microarray data sets. Genome Research, 14, 1085–1094. Lei, Z., & Dai, Y. (2006). Assessing protein similarity with gene ontology and its use in subnuclear localization prediction. BMC Bioinformatics, 7, 491. Li, Y., Bandar, Z. A., & McLean, D. (2003). An approach for measuring semantic similarity between words using multiple information sources. IEEE Transaction on Knowledge and Data Engineering, 15(4) (July/ August). Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of 15th international conference on machine learning (pp. 296–304). Lord, P.W., Stevens, R.D., Brass, A., Goble, C.A. (2003). Semantic similarity measures as tools for exploring the gene ontology. In
1121
Proceedings of Pacific symposium on biocomputing, Vol. 8 (pp. 601612). Lord, P. W., Stevens, R. D., Brass, A., & Goble, C. A. (2003b). investigating semantic similarity measures across the gene ontology: The relationship between sequence and annotation. Bioinformatics, 19(10), 1275–1283. MacCallum, R. M., Kelly, L. A., & Sternberg, M. J. (2000). SAWTED: Structure assignment with text description-enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics, 16(2), 125–129. Martinez-Cruz, L. A., Rubio, A., Martinez-Chantar, M. L., Labarga, A., Barrio, I., Podhorski, A., et al. (2003). GARBAN: Genomic analysis and rapid biological annotation of cDNA microarray and proteomic data. Bioinformatics, 19, 2158–2160. McHale, M. (1998). A comparison of WordNet and Roget’s taxonomy for measuring semantic similarity. In Proceedings of COLING/ACL workshop usage of WordNet in natural language processing systems. MeSH website: http://www.nlm.nih.gov/mesh/meshhome.html. Ono, T., Hishigaki, H., Tanigami, A., & Takagi, T. (2001). Automated extraction of information on protein–protein interactions from the biological literature. Bioinformatics, 17(2), 155–161. PubMed website: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi. Rada, R., Mili, H., Bichnell, E., & Blettner, M. (1989). Development and application of a metric on semantic nets. IEEE Transactions on Systems Man and Cybernetics, 9(1), 17–30, Jan. Raychaudhuri, S., & Altman, R. B. (2003). A literature-based method for assessing functional coherence of a gene group. Bioinformatics, 19(3), 396–401. Raychaudhuri, S., Chang, J. T., Sutphin, P. D., & Altman, R. B. (2001). Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. Genome Research, 12, 203–214. Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of 14th international joint conference on artificial intelligence. RNA Abundance Database: http://www.cbil.upenn.edu/RAD/. Stuart, J. M., Segal, E., Koller, D., & Kim, S. K. (2003). A genecoexpression network for global discovery of conserved genetic modules. Science, 302(5643), 249–255. Su, A. I., Cooke, M. P., Ching, K. A., Hakak, Y., Walker, J. R., Wiltshire, T., et al. (2002). Large-scale analysis of the human and mouse transcriptomes. Proceedings of National Academy of Science, 99, 4465–4470. Wills-Karp, M., & Ewar, S. L. (2004). Time to draw breath: Asthma susceptibility genes are identified. Nature Reviews Genetics, 5, 376–387. Zhong, J. W., Zhu, H. P., Li, J. M., & Yu, Y. (2002). Conceptual graph matching for semantic search. In Conceptual structures: Integration and interfaces (pp. 92–106). London: Springer.