NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity

NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity

G Model ARTICLE IN PRESS CBAC-6583; No. of Pages 9 Computational Biology and Chemistry xxx (2016) xxx–xxx Contents lists available at ScienceDirec...

771KB Sizes 0 Downloads 50 Views

G Model

ARTICLE IN PRESS

CBAC-6583; No. of Pages 9

Computational Biology and Chemistry xxx (2016) xxx–xxx

Contents lists available at ScienceDirect

Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem

Research Article

NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity Chang Lu a , Jun Wang a , Zili Zhang a , Pengyi Yang b , Guoxian Yu a,∗ a b

College of Computer and Information Science, Southwest University, Chongqing 400715, China School of Mathematics and Statistics, The University of Sydney, New South Wales, Australia

a r t i c l e

a b s t r a c t

i n f o

Article history: Received 2 September 2016 Accepted 7 September 2016 Available online xxx Keywords: Gene Ontology GO annotations Semantic similarity Taxonomic similarity

Gene Ontology (GO) provides GO annotations (GOA) that associate gene products with GO terms that summarize their cellular, molecular and functional aspects in the context of biological pathways. GO Consortium (GOC) resorts to various quality assurances to ensure the correctness of annotations. Due to resources limitations, only a small portion of annotations are manually added/checked by GO curators, and a large portion of available annotations are computationally inferred. While computationally inferred annotations provide greater coverage of known genes, they may also introduce annotation errors (noise) that could mislead the interpretation of the gene functions and their roles in cellular and biological processes. In this paper, we investigate how to identify noisy annotations, a rarely addressed problem, and propose a novel approach called NoisyGOA. NoisyGOA first measures taxonomic similarity between ontological terms using the GO hierarchy and semantic similarity between genes. Next, it leverages the taxonomic similarity and semantic similarity to predict noisy annotations. We compare NoisyGOA with other alternative methods on identifying noisy annotations under different simulated cases of noisy annotations, and on archived GO annotations. NoisyGOA achieved higher accuracy than other alternative methods in comparison. These results demonstrated both taxonomic similarity and semantic similarity contribute to the identification of noisy annotations. Our study shows that annotation errors are predictable and removing noisy annotations improves the performance of gene function prediction. This study can prompt the community to study methods for removing inaccurate annotations, a critical step for annotating gene and pathway functions. © 2016 Elsevier Ltd. All rights reserved.

1. Introduction Gene Ontology (GO) is a major bioinformatics initiative to unify the annotation of genes across all species (Ashburner et al., 2000). GO structures ontological terms, defined by controlled vocabularies, in a hierarchy (or direct acyclic graph, DAG) to represent the core biological knowledge, and to enable automated and computerbased analysis (Blake, 2013). GO annotations, another component of GO, provide the associations between genes and GO terms, which describe functions of gene products. GO organizes the ontological terms in three sub-ontologies: Biological Process (BP), Cellular Component (CC) and Molecular Function (MF), and it provides annotations with respect to these sub-ontologies. Gene products are the workhorses of cells, tissues and organs. Precisely annotating their biological functions can elucidate their

∗ Corresponding author. E-mail address: [email protected] (G. Yu).

roles in molecular pathways, and has great implications in therapeutic treatment of diseases and other applications (Radivojac et al., 2013; Kahanda et al., 2015). GO curators have been manually adding new annotations into GO from various sources, such as scientific literature and wet-lab experiments. Recent studies demonstrate that these manual annotations are biased by the highthroughput experiments and by the research interests of biologists ˇ (Schnoes et al., 2013; Skunca et al., 2012; Thomas et al., 2012). More importantly, our ability to manually characterize and annotate the functional roles of genes is far outpaced by the pace of newly discovered genes (Radivojac et al., 2013; Rhee et al., 2008). To overcome the deficit of functional annotations, computational models have been introduced to automatically predict gene function in large scale (Radivojac et al., 2013). Till now, about 99% annotations in UniProt (Apweiler et al., 2011), and more than 95% annotations in GO are inferred by algorithms that take advantage of different characteristics of genes (Huntley et al., 2014). For example, homology of amino acids sequences and structures (Lee et al., 2007; Deng, 2015) and protein– protein physical/genetics interactions (Sharan

http://dx.doi.org/10.1016/j.compbiolchem.2016.09.005 1476-9271/© 2016 Elsevier Ltd. All rights reserved.

Please cite this article in press as: Lu, C., et al., NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity. Comput. Biol. Chem. (2016), http://dx.doi.org/10.1016/j.compbiolchem.2016.09.005

G Model CBAC-6583; No. of Pages 9

ARTICLE IN PRESS C. Lu et al. / Computational Biology and Chemistry xxx (2016) xxx–xxx

2

et al., 2007). These inferred (not manually checked) annotations have a broader coverage and occupy a much larger taxonomic range than manual ones. Although the quality of these computationally inferred annotations is improving, the annotation accuracy is still ˇ largely lagging behind manual curated ones (Skunca et al., 2012; Huntley et al., 2014). In fact, GO Consortium (GOC) have been aggregating annotations from many contributors (i.e., InterPro (Hunter et al., 2011), PANTHER (Mi et al., 2005)). Different contributors have different backgrounds and expertises. These contributors provide annotations with different reliabilities. In addition, the annotations curated from web-lab experiments are also dependent on the experimental protocols and research interests. Although GOC resorts to several quality control mechanisms (i.e., GO taxon restrictions and annotation blacklist) to improve the quality of curated and inferred annotations (Huntley et al., 2014), there exist some noisy annotations in GO. Given the wide applications of GO annotations, predicting and removing noisy annotations (or removing inaccurate annotations) is a critical and valuable task (Huntley et al., 2014). Nevertheless, this task is rarely investigated. To bridge this gap, we introduce an algorithm called NoisyGOA to remove noisy annotations. NoisyGOA firstly computes the taxonomic similarity between two ontological terms using the ontology structure, and the semantic similarity between two groups of terms annotated to pairwise genes. Next, it approximates aggregated taxonomic score for each available annotation of a gene with respect to annotations of the gene’s most semantic similar neighbors. Then, it predicts the annotations with the smallest scores as noisy annotations of the gene. Very little has been done on explicitly predicting noisy annotations. Furthermore, there are no off-the-shelf noisy GO annotations for validation. To comparatively study the performance of NoisyGOA, we take several baseline approaches, which utilize different characteristics of ontological structure and the distribution of available annotations of genes, as comparing methods. We assume the available annotations of genes are noise free and inject noisy annotations in two different configurations (see Section 3 for more details). In addition, we also collect archived GOA files (from May 2015 to December 2015) of S. cerevisiae, H. sapiens and A. thaliana to investigate the ability of NoisyGOA in predicting noisy annotations and use historical GOA files which are romoved noisy annotations to predict gene functions. Our empirical study demonstrates that NoisyGOA can more accurately predict noisy annotations than the methods in comparison across these scenarios, and also can improve prediction accuracy of gene functions. 2. Methodology and algorithm Let A ∈ RN×|T| be the gene-term association matrix, where N is the number of genes and |T| is the number of GO terms. A is defined as follows:



A(i, t) =

1,

if gene i is annotated with term t

0,

otherwise

(1)

Our goal is to identify noisy GO annotations and set their corresponding entries in A as 0s. Predicting noisy annotations is different from replenishing the missing annotations of incompletely annotated genes (Tao et al., 2007; Yu et al., 2015), which updates some zero entries of A from 0 to 1. It is also different from negative examples selection (Youngs et al., 2014; Fu et al., 2016), which updates some zero entries of A from 0 to −1, indicating related genes are clearly not annotated with a given term. One intuitive and simple way to identify noisy annotations of a gene is to find k most similar genes of the gene and regard these k genes as voters. Then, these voters vote whether the studied gene should be annotated with a given term based on the term annotated to

themselves or not. In this way, each term annotated to the gene aggregates votes, and the term got the lowest vote or largest disagreement from these genes is deemed as a noisy annotation of the gene. In practice, this idea is widely used to aggregate annotations and to solve the disagreement between annotators (Good et al., 2011; Good and Su, 2013). NoisyGOA also adopts this idea, but it goes further by taking advantage of the taxonomic similarity between GO terms and semantic similarity between genes. The following two subsections elaborate the taxonomic similarity and semantic similarity, which serve as the basic components of NoisyGOA. 2.1. Taxonomic similarity between GO terms GO is a collaborative project that provides core biological knowledge representation for modern biologists (Ashburner et al., 2000). It uses controlled vocabularies to define GO terms, each of which has a distinct alphameric identifier (i.e., ‘GO:0008150’, biological process), and a DAG to represent the hierarchical relationships (i.e., is a, part of and regulate) among terms. Therefore, GO hierarchy can be used as a knowledge source to predict noisy annotations. In fact, GO hierarchy is shown to play paramount important roles in predicting gene function (Tao et al., 2007; Yu et al., 2015), and also in measuring the semantic similarity between genes (Pesquita et al., 2009; Teng et al., 2013). To take advantage of GO hierarchy, here, we measure the taxonomic similarity between two terms based on a formula similar to Lin’s similarity (Lin, 1998), for its simplicity and wide applications (Tao et al., 2007; Teng et al., 2013; Yu et al., 2015). The formula is defined as follow: tsim(t1 , t2 ) =

2 × IC(t ∗ ) IC(t1 ) + IC(t2 )

(2)

where t* is the most specific shared ancestor (or minimum subsumer) of t1 and t2 , and the ancestor terms of t include itself. IC(t) is the information content of t, it is defined as: IC(t) = 1 −

log2 (1 + |dest(t)|) log2 |T|

(3)

where dest(t) are the descendant terms of t and |T| is the cardinality of T. IC(t) is inversely proportional to the number of descendants of t, since the more descendants t has, the less specific it is and the less information it contains. Eq. (3) uses the ontological structure to define the information content of t, instead of the frequency of t among N genes. The reason is that Eq. (3) is irrespective of GO annotations of genes and thus Eq. (2) suffers less from noisy annotations than the original Lin’s similarity, which utilizes t’s frequency to measure the information content of t. Eq. (2) computes the taxonomic similarity of t1 and t2 in terms of the content of their most specific shared ancestor in the hierarchy. Clearly, if t1 = t2 , then tsim(t1 , t2 ) = 1. tsim(t1 , t2 ) = 0 if the most specific shared ancestor term is the root term (i.e., ‘GO:0008150’ of BP sub-ontology). If the most specific shared ancestor is close to t1 and t2 , then tsim(t1 , t2 ) is large. On the other hand, if the most specific shared ancestor of t1 and t2 is far away from them, and close to the root term, then tsim(s, t) is small. Empirically, we observe that NoisyGOA based on taxonomic similarity in Eq. (2) performs better than NoisyGOA based on the original Lin’s similarity (Lin, 1998), and it also achieves better performance than NoisyGOA based on a path-based taxonomic similarity (Benabderrahmane et al., 2010) (see Section 3). 2.2. Semantic similarity between genes Semantic similarity between two genes can be approximated from GO annotations of these genes. Semantic similarity is found

Please cite this article in press as: Lu, C., et al., NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity. Comput. Biol. Chem. (2016), http://dx.doi.org/10.1016/j.compbiolchem.2016.09.005

G Model

ARTICLE IN PRESS

CBAC-6583; No. of Pages 9

C. Lu et al. / Computational Biology and Chemistry xxx (2016) xxx–xxx

to be correlated with the sequence similarity between genes (Lord et al., 2003; Pesquita et al., 2009). It has also been applied to predict the missing annotations of incompletely annotated genes and to validate protein–protein interactions (Wu et al., 2006; Tao et al., 2007; Yu et al., 2015). Diverse semantic similarity measures have been proposed to compute the similarity between two sets of ontological terms, each set of terms annotated to a gene (Pesquita et al., 2009; Teng et al., 2013). Here, we adopt a vectorbased semantic similarity measure for its simplicity and efficiency. The vector-based similarity is defined as: psim(i, j) =

T

A(i, · )A(j, · ) ||A(i, · )||||A(j, · )||

(4)

where ||A(i, ·)|| is the Frobenius norm of A(i, ·). Obviously, psim(i, j) is always between 0 and 1. Measuring the semantic similarity between two genes is a nontrivial task. To account for the specification of different annotations, we weight each annotation A(i, t) = 1 using the IC(t) defined by frequency of t rather than 1 or 0, and then use the weighted A(i, t) and Eq. (4) to compute the semantic similarity between genes. However, the results are similar to the unweighted method. The reason is that the effect of weight is offset by the numerator and denominator of Eq. (4). In fact, we also studied a graph-based semantic similarity (Pesquita et al., 2008) and a term-overlapped based similarity (Mistry and Pavlidis, 2008) to experimentally investigate the influences of semantic similarity on NoisyGOA (see Section 3). 2.3. Predicting noisy GO annotations We can identify the k nearest neighbors of gene i based on psim(i, j) defined in Eq. (4). These k nearest neighborhood genes serve as voters and vote whether A(i, t) = 1 should be kept as 1 or updated as 0 via the following equation: V (i, t) =

⎧  ⎨ A(j, t), if A(i, t) = 1 ⎩ j ∈ Nk (i) 0,

(5) otherwise

where Nk (i) includes the k nearest neighbors of the ith gene. Clearly, if term t is seldom annotated to the neighborhood genes of the ith gene, then A(i, t) will be quite likely being voted as a noisy annotation of that gene. We name the minority vote method in Eq. (5) as LowestVote. However, if t has the same frequency as some other terms currently associated with the gene’s neighborhood genes, but it is indeed a noisy annotation of the ith gene, we cannot identify t as a noisy annotation by only using the semantic similarity (see Eq. (4)) between genes. For example, in Fig. 1, ‘GO:0032059’ is a noisy annotation of gene ‘AFR1’. ‘GO:0032059’, along with ‘GO:0001400’ are not annotated to neighborhood genes ‘ENT2’ and ‘WSC3’. By using Eq. (5), these two terms have equal probability being predicted as noisy annotations of ‘AFR1’.

3

To mitigate this problem, we want to leverage the semantic similarity between genes and the taxonomic similarity between terms as follows: WV (i, t) =



j ∈ Nk (i)

max tsim(t, s)

(6)

A(j,s)=1

If A(i, t) = 1 is a noisy annotation and t has equal (or even larger) frequency with other terms, t still can be identified as a noisy annotation by Eq. (6), since t may have small taxonomic similarity with respect to the terms annotated to neighborhood genes. We name our proposed method based on Eq. (6) as NoisyGOA. From Eq. (6), we can observe that the term with lowest frequency and smallest taxonomic similarity with other terms currently annotated to neighborhood genes of a gene, is more likely to be selected as a noisy annotation of the gene than other terms. Let us return to Fig. 1 to illustrate our idea. From Fig. 1, we can observe that all terms (except ‘GO:0032059’ and ‘GO:0001400’) annotated to ‘AFR1’ have the maximum taxonomic similarity (equal to 1) with the terms associated with ‘ENT2’ and ‘WSC3’. ‘GO:0032059’ and ‘GO:0001400’ have the same frequency in the neighborhood genes, and the maximum taxonomic similarity between ‘GO:0032059’ (or ‘GO:0001400’) and the terms annotated to ‘ENT2’ and ‘WSC3’ are below 1. Particularly, the maximum taxonomic similarity between ‘GO:0032059’ and these terms is 0.4818 (with respect to ‘GO:0042995’), and the maximum taxonomic similarity between ‘GO:0001400’ and these terms is 0.8466 (with respect to ‘GO:0005937’). Since ‘GO:0032059’ has a smaller taxonomic similarity with the terms annotated to ‘ENT2’ and ‘WSC3’ than ‘GO:0001400’, it is more likely being predicted as a noisy annotation of ‘AFR1’. Here, we would like to note that, if t is a noisy annotation of gene i, and it has both large frequency and taxonomic similarity with the terms annotated to neighborhood gene of the ith gene, t is difficult to be identified as a noisy annotation by NoisyGOA. Indeed, this kind of noisy annotations are more challenging and remain for future pursue. 3. Experimental results and analysis 3.1. Datasets and experimental setup To investigate the performance of NoisyGOA in identifying noisy annotations, we carried out experiments on both simulated and real-world datasets. We studied the performance of NoisyGOA on GO annotations of three species, S. cerevisiae, H. sapiens and A. thaliana. The GO file contains hierarchical relationships between GO terms organized in BP, CC and MF sub-ontologies. The GOA files provide annotations currently most appropriate and precisely describing the biological roles of genes. The annotations in GOA files are often viewed as ‘direct’ annotations. A protein annotated with a term indicates the protein is also annotated with its ancestor terms via any path of GO DAG, this rule is recognized as true path

Fig. 1. GO annotations of genes (‘AFR1’, ‘ENT2’ and ‘WSC3’) of S. cerevisiae. ‘ENT2’ and ‘WSC3’ are the two nearest neighborhood genes of ‘AFR1’. ‘GO:0032059’ is a noisy annotation of ‘AFR1’. Our task is to predict ‘GO:0032059’ as a noisy annotation of ‘AFR1’ by using the annotations of neighborhood genes and ontology structure.

Please cite this article in press as: Lu, C., et al., NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity. Comput. Biol. Chem. (2016), http://dx.doi.org/10.1016/j.compbiolchem.2016.09.005

G Model

ARTICLE IN PRESS

CBAC-6583; No. of Pages 9

C. Lu et al. / Computational Biology and Chemistry xxx (2016) xxx–xxx

4

Table 1 Statistics of GO annotations on S. cerevisiae, H. sapiens and A. thaliana. The data in the parentheses beside the dataset is the number of genes, |T| is the number of involved terms for each dataset in the sub-ontology, the last column is the number of noisy annotations from archived GOA files of the same species. |T|

Noisy annotations

S. cerevisiae (6381)

BP CC MF

4816 970 2231

8441 1161 3414

H. sapiens (19170)

BP CC MF

12,564 1580 3718

33,785 6979 3849

A. thaliana (24523)

BP CC MF

4910 745 2272

299,591 4506 6742

term annotated to themselves or not. Next, LowestVote summarizes the votes from these genes and takes terms with lowest votes as noisy annotations. We adopt three metrics, Precision, Recall and F1-Score to evaluate the performance of predicting noisy annotations. The formal definition of these metrics is as follow: pi =

TP i , TP i + FP i

ri =

1 pi , N

TP i TP i + FN i

N

Precision =

(7) 1 ri N N

Recall =

i=1

(8)

i=1

1  2 × pi × ri N pi + ri N

F1-Score =

(9)

i=1

rule (Ashburner et al., 2000; Valentini, 2011). We applied true path rule to annotate all the ancestor terms of direct annotations of a gene to the same gene. There are no benchmark datasets that explicitly record noisy annotations. To mimic the noisy annotations, we performed two sets of experiments. The first set of experiments assume available annotations of a gene are noise free, and then randomly append m noisy annotations to the gene. The second set of experiments use historical GOA files (archived date: 2015-05-26) and validate predicted noisy annotations using recent GOA files (archived date: 2015-12-07) of these species, respectively. If an annotation (including the one appended by true path rule) in the historical GOA file is not kept in the recent GOA file, then this annotation is viewed as a noisy one. For simulated experiments we excluded the annotations with evidence codes ‘IEA’ (Inferred from Electronic Annotation), ‘NR’ (Not Recorded), ‘ND’ (No biological Data available), and ‘IC’ (Inferred by Curator) to avoid circular prediction. For experiments on archived GOA files, we made use of all the annotations, irrespective of their evidence codes. The statistics of these processed datasets is listed in Table 1. For example, there are 6381 genes in S. cerevisiae GOA file, these genes are annotated with 4816 terms in BP sub-ontology, and there are 8441 noisy annotations. 3.2. Methods comparison and evaluation metrics As is shown in Fig. 1, terms annotated to a gene form a hierarchy by themselves. We take terms without any descendant in that hierarchy as leaf terms. Based on the hierarchical nature of these terms, we introduce some related methods: RandomLeaf, Deepest, Random, LowestFreq and LowestVote. The details of these methods are as follows: (i) RandomLeaf randomly selects m leaf terms annotated to a gene and takes the selected terms as noisy annotations of the gene. Note, once an ancestor term has no descendant terms in the hierarchy, it can also be selected as a noisy annotation of the gene. (ii) Deepest recursively selects terms in the deepest level of the hierarchy as noisy annotations. (iii) Random randomly chooses a term and its descendants (if any) in T as noisy annotations. (iv) LowestFreq randomly selects terms with the lowest frequency among N genes as noisy annotations. (v) LowestVote is a simple and widely adopted technique, its objective function is introduced in Eq. (5). Particularly, it first chooses k nearest neighborhood genes of a gene based on the semantic similarity (see Eq. (4)) between genes. Then, these genes vote whether a term should be annotated to the gene based on the

where TPi is the number of correctly predicted noisy annotations of the ith gene, FPi is the number of wrongly predicted noisy annotations, FNi is the number of noisy annotations missed by the predictor and the number of predicted noisy annotations is equal to m. pi and ri are the precision and recall on the ith gene, they evaluate the fraction of predicted noisy annotations that are true noisy annotations and the fraction of noisy annotations that are predicted correctly, respectively. Precision and Recall are the overall precisions and recalls on N genes. It is rather difficult for a predictor to achieve both high Precision and Recall. Thus, F1-Score, the average of harmonic mean of pi and ri , is also adopted to evaluate the overall performance on predicting noisy annotations. 3.3. Experiments on simulated noisy annotations In the simulated experiments, we assume the available GO annotations of genes are free of noise, and randomly inject a fixed number of noisy annotations to each annotated gene in two different configurations. Some genes do not have annotations in a sub-ontology, we do not inject noisy annotations to these genes. To account for the randomness, we repeat the simulated experiments 10 times for each fixed setting and report the average results. In the following simulated experiments, k for LowestVote and NoisyGOA are all set to 5. Here, we randomly annotate a gene with a term that is a direct child of terms currently annotated to the gene. For example, suppose ‘AFR1’ in Fig. 1 is annotated with 7 terms, excluding ‘GO:0032059’. ‘GO:0032059’ is a direct child of ‘GO:0042995’, which is already annotated to ‘AFR1’. We can annotate ‘AFR1’ with ‘GO:0032059’ and take this annotation as a noisy one. Next, any direct child of ‘GO:0032059’ or of the other 7 terms can be annotated to ‘AFR1’ and used as another noisy annotation. We repeat the above process to append m = 1, 3 and 5 noisy annotations to each gene, and report the experimental results on S. cerevisiae in BP sub-ontology in Table 2. Other results on these three species are revealed in Tables S1–S8 of Supplementary file. From Tables 2 and S1–S8, we have following observations: (i) NoisyGOA outperforms these comparing methods across different evaluation metrics in most cases. (ii) NoisyGOA significantly performs better than LowestVote, which only utilizes the semantic similarity between genes to vote noisy annotations. This observation supports our motivation to integrate taxonomic similarity between terms and semantic similarity between genes for predicting noisy annotations. (iii) NoisyGOA, LowestVote and LowestFreq get better results than other comparing methods. The is because noisy annotations often correspond to terms with low frequency, and these three approaches utilize the frequency of terms in different manners. This observation also suggests the frequency of term

Please cite this article in press as: Lu, C., et al., NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity. Comput. Biol. Chem. (2016), http://dx.doi.org/10.1016/j.compbiolchem.2016.09.005

G Model

ARTICLE IN PRESS

CBAC-6583; No. of Pages 9

C. Lu et al. / Computational Biology and Chemistry xxx (2016) xxx–xxx

5

Table 2 Performance of predicting noisy annotations of S. cerevisiae in BP sub-ontology by adding fixed number (m) of noisy annotations to a gene. The numbers in boldface denote the statistical best performance, significance is checked by t-test at 95% significance level, m is the number of added noisy annotations for a gene. Metric

m

Deepest

LowestFreq

Random

RandomLeaf

LowestVote

NoisyGOA

Precision

1 3 5

17.07 ± 0.14 22.54 ± 0.10 27.10 ± 0.19

67.42 ± 0.49 66.69 ± 0.16 68.20 ± 0.16

16.50 ± 0.19 22.02 ± 0.05 27.15 ± 0.06

45.08 ± 0.46 65.61 ± 0.23 74.40 ± 0.12

57.21 ± 0.10 67.39 ± 0.20 70.93 ± 0.16

66.79 ± 0.30 70.41 ± 0.12 73.76 ± 0.12

Recall

1 3 5

17.07 ± 0.14 22.54 ± 0.10 27.10 ± 0.19

67.43 ± 0.49 67.46 ± 0.14 68.90 ± 0.17

38.71 ± 0.14 61.35 ± 0.36 72.47 ± 0.23

45.08 ± 0.46 62.13 ± 0.23 69.37 ± 0.19

59.26 ± 0.08 70.44 ± 0.16 73.89 ± 0.17

74.92 ± 0.36 78.59 ± 0.11 81.23 ± 0.15

F1-Score

1 3 5

17.07 ± 0.14 22.54 ± 0.10 27.10 ± 0.19

67.43 ± 0.49 67.00 ± 0.15 68.51 ± 0.16

19.45 ± 0.17 28.61 ± 0.10 35.61 ± 0.06

45.08 ± 0.46 63.39 ± 0.22 71.27 ± 0.17

57.83 ± 0.09 68.56 ± 0.19 72.16 ± 0.16

69.10 ± 0.29 72.99 ± 0.07 76.30 ± 0.13

can be used as an important feature for identifying noisy annotations. (iv) LowestVote shows improved accuracy than LowestFreq. This improvement demonstrates the contribution of semantic similarity in predicting noisy annotations. Deepest is even outperformed by Random that is because noisy annotations do not always correspond to terms located in the deepest level of the hierarchy. Since the added noisy annotations correspond to leaf terms of the hierarchy of a gene, RandomLeaf always works better than Random. (v) Precision, Recall and F1-Score increase as the number of added noisy annotations rising. That is because the number of predicted noisy annotations increases as the number of added noisy annotations increasing, and thus the number of correctly predicted noisy annotations also increases. In addition, we perform another configuration of simulated noisy annotations. We follow the same experimental protocol as adding a fixed number of noisy annotations to a gene, we add noisy annotations to a gene with a fixed ratio. For example, if the ratio of noisy annotations is 20% and a gene is annotated with 15 terms, then 3 noisy annotations are added to the gene. If a gene is annotated with 11 terms, then 2 noisy annotations are added to the protein. The float number is round to nearest integer. We increase p from 10% to 30% with step-size 10% and report the recorded results on S. cerevisiae in BP sub-ontology in Table 3. Other results on S. cerevisiae are revealed in Tables S9 and S10 of Supplementary file. As can be seen from these tables, these results give the similar observations, since the manner of injecting noisy annotations to a gene is the same as the case of adding fixed number of noisy annotations. However, the performance of NoisyGOA, LowestVote and LowestFreq downgrades much more in BP sub-ontology, the reason is that each fixed ratio of noisy annotation adds more noisy annotations than previous study. For example, for p = 20% on S. cerevisiae in BP sub-ontology, each gene is approximately added with 8 noisy annotations. Even with more noisy annotations in the simulated case, NoisyGOA still get better results than other comparing methods. These results further support our idea of utilizing taxonomic similarity and semantic similarity for predicting noisy annotations.

3.4. Experiments on archived GOA files In this subsection, we conduct experiments on archived GOA files. We downloaded GOA files archived on two different dates for each species. We name the GOA file archived on 2015-05-26 as historical GOA file, and the GOA file archived on 2015-12-07 as recent GOA file. To measure the quality of these methods in identifying noisy annotations, we adopt another metric, namely Matthews correlation coefficient (MCC)(Matthews, 1975). MCC is calculated as follows: TP i × TN i − FP i × FN i 1  N (TP i + FP i )(TP i + FN i )(TN i + FP i )(TN i + FN i ) i=1 N

MCC =

(10)

where TNi is the number of correctly predicted annotations which are not noisy annotations of gene i. MCC is between −1 and 1. 1 represents a perfect prediction, 0 corresponds to random prediction and −1 indicates total disagreement between prediction and observation. NoisyGOA and these comparing methods are trained on all the annotations in the historical GOA file, including the ones appended by true path rule. Next, these methods are validated by noisy annotations, which were present in the historical GOA file but absent in the recent GOA file. The recorded results on S. cerevisiae are revealed in Table 4, and the results on H. sapiens and A. thaliana are reported in Tables S11 and S12 of Supplementary file. From the results on archived GOA files, we can easily find that, although NoisyGOA does not always have larger Precision (or Recall) than other comparing methods, it achieves larger F1Score than other comparing approaches in most cases. This overall superior performance again supports to leverage semantic similarity and taxonomic similarity for predicting noisy annotations. However, the overall performance of NoisyGOA and LowestVote downgrades very much on archived GOA files. The possible reason is that noisy annotations are not always corresponding to direct child terms of the terms currently annotated to a gene. Instead, they are descendants of one (or several) term currently annotated to the gene.

Table 3 Performance of predicting noisy annotations of S. cerevisiae in BP sub-ontology by adding fixed ratio (p) of noisy annotations to a gene. The numbers in boldface denote the statistical best performance, significance is checked by t-test at 95% significance level, p is the percentage of added noisy annotations for a gene. Metric

p(%)

Deepest

LowestFreq

Random

RandomLeaf

LowestVote

NoisyGOA

Precision

10 20 30

6.45 ± 0.25 12.69 ± 0.07 19.56 ± 0.16

52.70 ± 0.19 55.22 ± 0.12 57.07 ± 0.05

9.80 ± 0.08 16.34 ± 0.14 21.89 ± 0.10

54.28 ± 0.20 65.38 ± 0.12 68.79 ± 0.08

57.12 ± 0.16 64.61 ± 0.09 67.44 ± 0.15

59.11 ± 0.17 63.52 ± 0.13 66.43 ± 0.13

Recall

10 20 30

6.45 ± 0.25 12.69 ± 0.07 19.56 ± 0.16

53.31 ± 0.18 55.82 ± 0.13 57.61 ± 0.06

49.69 ± 0.45 65.20 ± 0.42 72.46 ± 0.23

54.28 ± 0.20 65.32 ± 0.12 68.66 ± 0.07

59.19 ± 0.12 67.44 ± 0.10 70.45 ± 0.16

65.58 ± 0.13 69.71 ± 0.09 72.05 ± 0.07

F1-Score

10 20 30

6.45 ± 0.25 12.69 ± 0.07 19.56 ± 0.16

52.96 ± 0.19 55.50 ± 0.13 57.33 ± 0.05

15.47 ± 0.10 25.43 ± 0.16 33.12 ± 0.12

54.28 ± 0.20 65.35 ± 0.12 68.72 ± 0.08

57.75 ± 0.15 65.61 ± 0.09 68.59 ± 0.15

61.13 ± 0.15 65.79 ± 0.12 68.66 ± 0.10

Please cite this article in press as: Lu, C., et al., NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity. Comput. Biol. Chem. (2016), http://dx.doi.org/10.1016/j.compbiolchem.2016.09.005

G Model

ARTICLE IN PRESS

CBAC-6583; No. of Pages 9

C. Lu et al. / Computational Biology and Chemistry xxx (2016) xxx–xxx

6

Table 4 Performance of predicting noisy annotations in S. cerevisiae on archived GOA files. The numbers in boldface denote the best performance. Metric

Deepest

LowestFreq

Random

RandomLeaf

LowestVote

NoisyGOA

BP

Precision Recall F1-Score MCC

28.02 28.02 28.02 12.16

30.79 31.10 30.90 16.96

21.28 65.22 29.93 7.46

38.86 25.38 21.44 17.15

37.96 52.12 41.92 20.98

38.56 52.72 42.40 23.42

CC

Precision Recall F1-Score MCC

37.40 37.40 37.40 20.96

46.57 46.66 46.61 33.61

28.25 73.63 39.14 9.87

51.61 27.90 33.13 23.89

52.02 82.26 58.80 35.48

50.29 83.04 57.94 38.56

MF

Precision Recall F1-Score MCC

26.98 26.98 26.98 13.92

26.11 26.11 26.11 11.80

28.65 50.49 33.83 5.47

48.86 37.27 40.89 31.02

34.26 56.17 40.03 6.83

35.63 56.77 41.08 11.98

LowestFreq and RandomLeaf often get larger Precision than other comparing methods. The reason is twofold. (i) Noisy annotations often correspond to low frequency terms and leaf terms of the hierarchy of a gene. (ii) Due to true path rule in the GO hierarchy, leaf terms have lower frequency than their ancestor terms, so these two approaches prefer to select leaf terms of the hierarchy of a gene as noisy annotations, and thus the number of wrongly predicted noisy annotations is much smaller than that of other approaches. In contrast, Random randomly selects terms currently associated with a gene as noisy annotations of the gene. Furthermore, the descendants of these selected terms are also deemed as noisy annotations. Therefore, Random obtains larger Recall than other comparing methods and gets better F1-Score than RandomLeaf. Another interesting observation is that NoisyGOA gets smaller F1Score than LowestFreq on A. thaliana in CC and MF. That is because each gene of A. thaliana in CC and MF on average is annotated with much fewer terms than that in BP, and the semantic similarity between genes in CC and MF is less reliable than that in BP. Therefore, NoisyGOA loses to some of these comparing algorithms in these two sub-ontologies. From Tables 4 and S11–S12, we can find that the results with respect to MCC are similar to the results of F1-Score. LowestFreq, LowestVote and NoisyGOA always have better results than other methods. This fact indicates the importance of utilizing the frequency of terms. Compared with LowestVote and NoisyGOA, LowestFreq gets larger MCC on H. sapiens and A. thaliana in the MF sub-ontology. That is principally because there are fewer terms annotating genes in MF than BP, so the neighborhood genes of a gene are less reliable than in BP. 3.5. Influence of taxonomic similarity Taxonomic similarity between ontological terms contributes to predict noisy annotations, as we studied in the previous subsections. In this subsection, we investigate the influence of taxonomic similarity by substituting the taxonomic similarity (see Eq. (2)) of NoisyGOA with another two widely adopted taxonomic similarities, Lin’s similarity (Lin, 1998) and a path-based similarity (Benabderrahmane et al., 2010). Lin’s similarity is similar to the taxonomic similarity used by NoisyGOA, except that Lin’s similarity defines IC(t) based on the frequency of t among N genes. The path-based taxonomic similarity is defined as follow: tsimpath (t1 , t2 ) =

2 × depth(t ∗ ) MinSPL(t1 , t2 ) + 2 × depth(t ∗ )

(11)

where t* is the most specific ancestor of t1 and t2 , and depth(t *) is the depth of t* in GO hierarchy. MinSPL(t1 , t2 ) is the length of shortest path between t1 and t2 passing through t*. We can find that the shorter the path between t1 and t2 , and the deeper of t*, the more similar t1 and t2 is. Clearly, these two taxonomic similarities

Table 5 Performance of NoisyGOA using different taxonomic similarities in predicting noisy annotations of S. cerevisiae, the data in boldface denote the best performance. Metric

NoisyGOA-path

NoisyGOA-Lin

NoisyGOA

BP

Precision Recall F1-Score MCC

18.56 100.00 27.96 0.00

37.80 51.07 41.39 22.68

38.56 52.72 42.40 23.42

CC

Precision Recall F1-Score MCC

47.29 81.10 55.33 27.62

50.18 82.87 57.79 30.78

50.29 83.04 57.94 38.56

MF

Precision Recall F1-Score MCC

25.53 97.81 35.32 13.07

38.77 59.52 44.09 14.14

35.63 56.77 41.08 11.98

also employ the ontological structure. We name noisyGOA based on Lin’s similarity (or the path-based similarity) as NoisyGOA-Lin (or NoisyGOA-path). The results on archived GOA files of S. cerevisiae are revealed in Table 5. The results on archived GOA files of H. sapiens and A. thaliana are reported in Tables S13 and S14 of Supplementary file. And the results on simulated experiments on S. cerevisiae are provided in Tables S15 of Supplementary file. From these tables, we can find that NoisyGOA has different performance ranks on different datasets and sub-ontologies. This observation suggests that taxonomic similarity between ontological terms heavily affects the performance of NoisyGOA. Generally speaking, the performance of NoisyGOA-path is slightly lower than that of NoisyGOA-Lin and NoisyGOA in most cases. The possible reason is that the path-based taxonomic similarity only considers the distance between two terms, ignoring the difference between paths. One interesting observation is that NoisyGOA-path gets a 100% Recall in BP sub-ontology. The cause is that there are much more involved terms in BP sub-ontology than in other subontologies, NoisyGOA-path prefers to choose the terms close to root as noisy annotations. In addition, descendants of these selected terms are also deemed as noisy annotations. Thus, NoisyGOA-path gets large Recall but small Precision in BP. Although NoisyGOA and NoisyGOA-Lin often get similar results on archived GOA files, NoisyGOA obtains better results than NoisyGOA-Lin on the experiments with simulated noisy annotations. The cause is that there are much more noisy annotations in the simulated experiments and NoisyGOA-Lin uses t’s frequency among N genes to define IC(t), which is distorted by these added noise annotations. NoisyGOA defines IC(t) by using the ontological structure and IC(t) is irrespective of the frequency of t and thus robust to the added noisy annotations. The results for evaluation metric MCC have the similar pattern as F1-score. A notable observation is that NoisyGOA-path has a zero MCC that is because NoisyGOA-path gets a 100% Recall.

Please cite this article in press as: Lu, C., et al., NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity. Comput. Biol. Chem. (2016), http://dx.doi.org/10.1016/j.compbiolchem.2016.09.005

G Model

ARTICLE IN PRESS

CBAC-6583; No. of Pages 9

C. Lu et al. / Computational Biology and Chemistry xxx (2016) xxx–xxx Table 6 Performance of NoisyGOA using different semantic similarities in predicting noisy annotations of S. cerevisiae. The numbers in boldface denote the best performance. Metric

GIC-NoisyGOA

TO-NoisyGOA

NoisyGOA

BP

Precision Recall F1-Score MCC

39.00 53.55 42.86 23.26

22.75 48.98 28.84 9.47

38.56 52.72 42.40 23.42

CC

Precision Recall F1-Score MCC

51.02 79.85 58.14 32.55

35.36 71.22 42.67 23.12

50.29 83.04 57.94 38.56

MF

Precision Recall F1-Score MCC

42.93 61.50 47.81 25.45

24.78 28.42 25.71 -1.23

35.63 56.77 41.08 11.98

3.6. Influence of semantic similarity The semantic similarity between genes is another component of NoisyGOA. To study the influence of semantic similarity on NoisyGOA, we introduce two variants of NoisyGOA: GIC-NoisyGOA and TO-NoisyGOA. GIC-NoisyGOA is based on a graph-based metric to measure the semantic similarity between genes (Pesquita et al., 2008). TO-NoisyGOA is based on a term-overlap based metric (Mistry and Pavlidis, 2008). These two metrics are defined as:



simGIC (i, j) =

t ∈ Ti ∩Tj



t ∈ Ti ∪Tj

simTO (i, j) =

IC(t) IC(t)

|Ti ∩ Tj | min(|Ti |, |Tj |)

(12)

(13)

N

where IC(t) = log2 ( i=1 A(i, t)/N), Ti is the set of terms annotated to the ith gene, including the terms derived from direct annotations. The results of predicting noisy annotations on archived GOA files are presented in Tables 6 and S16–S17 of Supplementary file. We also report the results on simulated experiments on S. cerevisiae in Table S18 of Supplementary file. From these tables, we can easily draw a conclusion that NoisyGOA and GIC-NoisyGOA achieve better results than TO-NoisyGOA in most cases, and none of these comparing methods always performs better than others. This fact shows that the semantic similarity between genes also has important influence on NoisyGOA. We would like to remark that it is a non-trivial job to choose an effective semantic similarity metric between genes (Pesquita et al., 2009; Teng et al., 2013). Experimental results on simulated noisy annotations (Table S18 of Supplementary file) show more consistent results. We can see that NoisyGOA gets better performance than these two variants in predicting noisy annotations of S. cerevisiae. In actual fact, we compare these methods on other two species under different setting values of m, the results are similar with the results reported Table S18. GIC-NoisyGOA always performs better than TO-NoisyGOA. The cause is that GIC-NoisyGOA takes advantage of the frequency of terms and the ontological graph. In contrast, TO-NoisyGOA only considers the terms annotated to two genes. GIC-NoisyGOA is outperformed by NoisyGOA in most cases. That is because GICNoisyGOA defines IC(t) based on the frequency of t and IC(t) is impacted by the added noisy annotations. 3.7. Noisy annotations in gene function prediction To investigate whether removing noisy annotations can improve the performance of gene function prediction, we downloaded the protein–protein interactions (PPI) network of these

7

three species from BioGrid (date: July 31, 2016) for experiments. We choose annotations whose scores are equal to zero in Eq. (6) as predicted noisy annotations, and remove these annotations from the historical GOA files. Then, a network-based function prediction model – Majority vote (Sharan et al., 2000), which relies on updated annotations of interacting partners of a protein is used to predict annotations of that protein. After that, we use the annotations in the recent GOA files to validate the predicted annotations. We also utilize Majority vote and annotations in the historical GOA file directly to predict gene functions as comparison. Seven evaluation metrics, namely MicroAvgF1, MacroAvgF1, HammLoss, RankLoss, AvgPrec, AvgROC and Fmax, which have been applied to evaluate the results of gene function prediction (Radivojac et al., 2013; Fu et al., 2016) are used to reach a comprehensive evaluation of gene function prediction here. To keep consistency with other metrics, we use 1-RankLoss and 1-HammLoss instead of RankLoss and HammLoss, respectively. Thus, the higher the value of these metrics, the better the performance is. These metrics measure the performance from different aspects, and it is difficult for a method consistently performing better than others across all the metrics. The formal definitions of these metrics are provided in Supplementary file. The recorded results are shown in Tables 7–9. From the results in Tables 7–9, we can have a clear observation that removing noisy annotations improves the performance of gene function prediction than without removing noisy annotations. But the results on S. cerevisiae do not show significant improvement because there are fewer noisy annotations on S. cerevisiae than those on A. thaliana and H. sapiens.

4. Discussion Predicting noisy annotations is different from the continually studied gene function prediction task that mainly focuses on predicting functions of completely un-annotated genes. This task is one of the two subtasks of Critical Assessment of Function Annotation challenge (CAFA2) (Radivojac et al., 2013; Wass et al., 2014), and it often utilizes homology information of genes/proteins to transfer annotations of annotated genes to un-annotated ones (Sharan et al., 2007; Lee et al., 2007; Radivojac et al., 2013). More recently, predicting additional functions of an incompletely annotated gene is set as another subtask of CAFA2 (Wass et al., 2014; Kahanda et al., 2015). Some premier algorithms have been proposed for this subtask (Tao et al., 2007; Yu et al., 2015) or both of these subtasks (Yu et al., 2014, 2015). Predicting noisy annotations is also different from selecting negative examples that genes (or their products) are known not carrying out a given function (Youngs et al., 2014; Fu et al., 2016), which is another interesting problem concentrates on selecting true negative annotations from unspecified annotations. An unspecified annotation means the association between a gene and a given function is currently unknown, or not yet checked by GO curator, or not included into GOA file. Obviously, the number of unspecified annotations is much larger than that of specified annotations, since GOC almost always only provide the specified annotation that reveals a gene carrying out a given function. In contrast, predicting noisy annotations focuses on selecting false positive annotations from specified annotations, including manual ones and inferred ones. NoisyGOA takes advantage of the taxonomic similarity between ontological terms to identify noisy annotations. Measuring the taxonomic similarity between ontological terms is an interesting topic. Various taxonomic similarity metrics, capturing different characteristics of graph-structured ontology, have been proposed (Pesquita et al., 2009). These metrics take advantage of the

Please cite this article in press as: Lu, C., et al., NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity. Comput. Biol. Chem. (2016), http://dx.doi.org/10.1016/j.compbiolchem.2016.09.005

G Model

ARTICLE IN PRESS

CBAC-6583; No. of Pages 9

C. Lu et al. / Computational Biology and Chemistry xxx (2016) xxx–xxx

8

Table 7 Results of gene function prediction on S. cerevisiae. The data in boldface denote the better result. ‘Historical’ directly uses annotations in the historical GOA file to predict gene function, ‘Removed’ removes predicted noisy annotations from the historical GOA file and then predicts gene function. BP

MicroAvgF1 MacroAvgF1 AvgROC 1-HammLoss 1-RankLoss AvgPrec Fmax

CC

MF

Historical

Removed

Historical

Removed

Historical

Removed

89.81 75.13 87.20 99.84 99.32 81.31 89.34

89.82 75.12 87.20 99.84 99.32 81.41 89.41

93.85 80.76 86.96 99.77 99.66 85.22 91.55

93.84 80.53 86.88 99.77 99.66 85.46 91.68

92.57 86.84 93.01 99.93 99.31 88.01 91.50

92.55 86.84 93.00 99.93 99.31 88.02 91.49

Table 8 Results of gene function prediction on H. sapiens. The data in boldface denote the better result. ‘Historical’ directly uses annotations in the historical GOA file to predict gene function, ‘Removed’ removes predicted noisy annotations from the historical GOA file and then predicts gene function. BP

MicroAvgF1 MacroAvgF1 AvgROC 1-HammLoss 1-RankLoss AvgPrec Fmax

CC

MF

Historical

Removed

Historical

Removed

Historical

Removed

87.87 83.28 92.98 99.85 97.59 83.64 90.74

88.17 83.63 93.16 99.85 97.58 84.09 90.88

91.40 82.93 93.18 99.75 98.96 86.62 92.38

91.75 83.42 93.36 99.76 98.95 87.17 92.58

91.45 86.24 95.53 99.94 98.39 89.15 93.95

91.77 86.67 95.68 99.94 98.38 89.53 94.07

Table 9 Results of gene function prediction on A. thaliana. The data in boldface denote the better result. ‘Historical’ directly uses annotations in the historical GOA file to predict gene function, ‘Removed’ removes predicted noisy annotations from the historical GOA file and then predicts gene function. BP

MicroAvgF1 MacroAvgF1 AvgROC 1-HammLoss 1-RankLoss AvgPrec Fmax

CC

MF

Historical

Removed

Historical

Removed

Historical

Removed

57.55 48.65 82.15 98.96 97.42 49.76 74.08

59.10 50.54 83.30 98.99 97.36 51.66 74.77

77.01 67.35 80.13 98.77 98.79 75.09 93.01

78.65 68.83 81.17 98.86 98.80 76.80 93.32

65.27 54.37 75.27 99.38 98.83 63.52 91.35

67.37 58.19 76.69 99.42 98.78 65.72 91.79

properties of the considered terms themselves, their ancestors, descendants and disjointness (Ferreira et al., 2013). These metrics often depend on the information content of a term, which is independent of the term’s depth in the ontology, to define how specific (or informative) the term is, and to measure the similarity between pairwise terms based on the defined information content (Jiang and Conrath, 1997; Lin, 1998; Yang et al., 2012). Some other metrics make use of distance (or path) between two terms, distance between the lowest common ancestor term of the pairwise terms and root term, or the nearest leaf term (Hui et al., 2005; Wu et al., 2006). Other hybrid metrics take into account the specification of relationships (is-a, part-of, regulates) between terms, the hierarchical structure and the properties (i.e., information content) of terms (Wang et al., 2007; Othman et al., 2008). In this paper, we measure the taxonomic similarity between two terms based on the properties of considered terms themselves and the ontology structure. We experimentally compare this taxonomic similarity with another two representative taxonomic similarities (Benabderrahmane et al., 2010; Lin, 1998). The experimental results show that these taxonomic similarities have different performance on different datasets and the taxonomic similarity adopted by NoisyGOA gets better results than others in most cases. We believe pursuing an improved taxonomic similarity can further boost the performance of NoisyGOA. NoisyGOA also takes advantage of the semantic similarity between genes. Similar to taxonomic similarity, semantic similarity

is another ongoing topic and it has various applications (Pesquita et al., 2009; Guzzi et al., 2011), i.e., gene/protein function prediction (Tao et al., 2007; Yu et al., 2015), data integration, protein–protein interactions prediction (Guzzi et al., 2011). Alike the taxonomic similarity discussed in the previous paragraph, a comprehensive discussion of semantic similarity is out of the scope of this paper, and interested readers can refer to references (Pesquita et al., 2009; Guzzi et al., 2011), and follow-up references (Benabderrahmane et al., 2010; Teng et al., 2013). Different from taxonomic similarity that mainly accounts for pairwise GO terms, semantic similarity takes into account of two sets of terms annotated to pairwise genes. Pairwise semantic similarity often measures the similarity between genes by taking the average (maximum, or best match average) taxonomic similarity between two terms in two respective sets (Pesquita et al., 2009). Groupwise semantic similarity does not depend on combining the taxonomic similarity between terms. It computes the similarity using set-based, graph-based, and vector-based similarity techniques. In this paper, we use a vectorbased semantic similarity. Different annotations provide biological knowledge of a gene in different granularity. To study the influence of semantic similarity, we substitute the vector-based semantic similarity with a graph-based one (Pesquita et al., 2008) and a setbased one (Mistry and Pavlidis, 2008) for NoisyGOA, respectively. Our empirical study shows that NoisyGOA performs a little better than NoisyGOA based on either of these two semantic similarities in majority cases. How to effectively measure the semantic similarity between genes is another important future pursue.

Please cite this article in press as: Lu, C., et al., NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity. Comput. Biol. Chem. (2016), http://dx.doi.org/10.1016/j.compbiolchem.2016.09.005

G Model CBAC-6583; No. of Pages 9

ARTICLE IN PRESS C. Lu et al. / Computational Biology and Chemistry xxx (2016) xxx–xxx

We believe that our comparative study in this paper not only shows that noisy annotations are predictable, but also can drive the community to study removing inaccurate annotations. 5. Conclusions In this paper, we studied how to remove inaccurate annotations and proposed a novel approach to tackle this problem. The empirical study on GO annotations of three species shows that NoisyGOA achieves better results in the respect of recognizing noisy annotations than other comparing methods in most cases. Our study not noly shows that noisy annotations of genes can be predicted by both taxonomic similarities between ontological terms and semantic similarities between genes, but also indicates that removing nosiy annotations can increase the prediction accuracy of gene functions. We are going to further explore the patterns of noisy annotations and to model the missing annotations of genes for more accurately predicting noisy annotations. Acknowledgements This work is supported by Natural Science Foundation of China (No. 61402378), Natural Science Foundation of CQ CSTC (No. cstc2014jcyjA40031 and cstc2016jcyjA0351), Fundamental Research Funds for the Central Universities of China (2362015XK07, XDJK2016B009, XDJK2016E076 and XDJK2016D021). Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.compbiolchem. 2016.09.005. References Apweiler, R., Martin, M., O’Donovan, C., Magrane, M., Alam-Faruque, Y., Antunes, R., Barrell, D., Bely, B., et al., 2011. Ongoing and future developments at the universal protein resource. Nucleic Acids Res. 39 (1), D214–D219, http://dx.doi. org/10.1093/nar/gkq1020. Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J., et al., 2000. Gene ontology: tool for the unification of biology. Nat. Genet. 25 (1), 25–29, http://dx.doi.org/10.1038/75556. Benabderrahmane, S., Smailtabbone, M., Poch, O., Napoli, A., Devignes, M., 2010. IntelliGO: a new vector-based semantic similarity measure including annotation origin. BMC Bioinform. 11, 588, http://dx.doi.org/10.1186/1471-2105-11-588. Blake, J.A., 2013. Ten quick tips for using the gene ontology. PLoS Comput. Biol. 9 (11), e1003343, http://dx.doi.org/10.1371/journal.pcbi.1003343. Deng, L., C.Z., 2015. An integrated framework for functional annotations of protein structural domains. IEEE/ACM Trans. Comput. Biol. Bioinform. 12 (4), 902–913. Ferreira, J.D., Hastings, J., Couto, F.M., 2013. Exploiting disjointness axioms to improve semantic similarity measures. Bioinformatics 29 (21), 2781–2787. Fu, G., Wang, J., Yang, B., Yu, G., 2016. NegGOA: negative GO annotations selection using ontology structure. Bioinformatics, http://dx.doi.org/10.1093/ bioinformatics/btw366. Good, B.M., Su, A.I., 2013. Crowdsourcing for bioinformatics. Bioinformatics 29 (16), 1925–1933. Good, B.M., Clarke, E.L., Alfaro, L.D., Su, A.I., 2011. The gene wiki in 2011: community intelligence applied to human gene annotation. Nucleic Acids Res. 40 (1), D1255–D1261. Guzzi, P.H., Mina, M., Guerra, C., Cannataro, M., 2011. Semantic similarity analysis of protein data: assessment with biological features and issues. Brief Bioinform. 13 (5), 569–585. Hui, Y., Lei, G., Kang, T., Zheng, G., 2005. Broadly predicting specific gene functions with expression similarity and taxonomy similarity. Gene 352, 75–81. Hunter, S., Jones, P., Mitchell, A., Apweiler, R., Attwood, T.K., Bateman, A., Bernard, T., Binns, D., Bork, P., Burge, S., 2011. InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 40 (1), D306–D312. Huntley, R.P., Sawford, T., Martin, M.J., O’Donovan, C., 2014. Understanding how and why the gene ontology and its annotations evolve: the GO within UniProt. GigaScience 3, 1–9. Jiang, J.J., Conrath, D.W., 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of International Conference on Research in Computational Linguistics (ROCLING), pp. 1–15.

9

Kahanda, I., Funk, C.S., Ullah, F., Verspoor, K.M., Ben-Hur, A., 2015. A close look at protein function prediction evaluation protocols. GigaScience 4, 41. Lee, D., Redfern, O., Orengo, C., 2007. Predicting protein function from sequence and structure. Nat. Rev. Mol. Cell Biol. 8 (12), 995–1005. Lin, D., 1998. An information-theoretic definition of similarity. In: Proceedings of International Conference on Machine Learning (ICML), pp. 296–304. Lord, P.W., Stevens, R.D., Brass, A., Goble, C.A., 2003. Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics 19 (10), 1275–1283. Matthews, B.W., 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta (BBA) – Protein Struct. 405 (2), 442–451, http://dx.doi.org/10.1016/0005-2795(75)90109-9. Mi, H., Lazarevaulitsky, B., Loo, R., Kejariwal, A., Vandergriff, J., Rabkin, S., Guo, N., Muruganujan, A., Doremieux, O., Campbell, M.J., 2005. The panther database of protein families, subfamilies, functions and pathways. Nucleic Acids Res. 33 (1), 284–288. Mistry, M., Pavlidis, P., 2008. Gene ontology term overlap as a measure of gene functional similarity. BMC Bioinform. 9 (327), 1–11. Othman, R.M., Deris, S., Illiasb, R.M., 2008. A genetic similarity algorithm for searching the gene ontology terms and annotating anonymous protein sequences. J. Biomed. Inform. 41 (1), 65–81. Pesquita, C., Faria, D., Bastos, H., Ferreira, A.E., Falc ao, A.O., Couto, F.M., 2008. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinform. 9 (S5), S4. Pesquita, C., Faria, D., Falc ao, A.O., Lord, P., Couto, F.M., 2009. Semantic similarity in biomedical ontologies. PLoS Comput. Biol. 5 (7), e1000443. Protein–protein interactions network from biogrid. http://thebiogrid.org/ download.php (accessed 01.06.16). Radivojac, P., Clark, W.T., Oron, T.R., Schnoes, A.M., Wittkop, T., Sokolov, A., Graim, K., Funk, C., Verspoor, K., Ben-Hur, A., 2013. A large-scale evaluation of computational protein function prediction. Nat. Methods 10 (3), 221–229. Radivojac, P., Clark, W.T., Oron, T.R., Schnoes, A.M., Wittkop, T., Sokolov, A., Graim, K., Funk, C., Verspoor, K., Ben-Hur, A., 2013. A large-scale evaluation of computational protein function prediction. Nat. Methods 10 (3), 221–227. Rhee, S.Y., Wood, V., Dolinski, K., Draghici, S., 2008. Use and misuse of the gene ontology annotations. Nat. Rev. Genet. 9 (7), 509–515. Schnoes, A.M., Ream, D.C., Thorman, A.W., Babbitt, P.C., Friedberg, I., 2013. Biases in the experimental annotations of protein function and their effect on our understanding of protein function space. PLoS Comput. Biol. 9 (5), e1003063. Sharan, R., Igor, U., Shamir, R., 2000. A network of protein–protein interactions in yeast. Nat. Biotechnol. 18 (12), 1257–1261. Sharan, R., Igor, U., Shamir, R., 2007. Network-based prediction of protein function. Mol. Syst. Biol. 3 (88), 1–13. ˇ Skunca, N., Altenhoff, A., Dessimoz, C., 2012. Quality of computationally inferred gene ontology annotations. PLoS Comput. Biol. 8 (5), e1002533. Tao, Y., Li, J., Friedman, C., Lussier, Y.A., 2007. Information theory applied to the sparse gene ontology annotation network to predict novel gene function. Bioinformatics 23 (13), i529–i538. Teng, Z., Guo, M., Liu, X., Dai, Q., Wang, C., Xuan, P., 2013. Measuring gene functional similarity based on group-wise comparison of GO terms. Bioinformatics 29 (11), 1424–1432. The gene ontology database. http://geneontology.org/page/download-ontology (accessed 02.12.15). The Gene Ontology Annotation Files. http://geneontology.org/page/downloadannotations (accessed 07.12.15). Thomas, P.D., Wood, V., Mungall, C.J., Lewis, S.E., Blake, J.A., 2012. On the use of gene ontology annotations to assess functional similarity among orthologs and paralogs: a short report. PLoS Comput. Biol. 8 (2), 1454–1459. Valentini, G., 2011. True path rule hierarchical ensembles for genome-wide gene function prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 8 (3), 832–847. Wang, J., Du, Z., Payattakool, R., Yu, P., Chen, C., 2007. A new method to measure the semantic similarity of GO terms. Bioinformatics 23 (10), 1274–1281. Wass, M.N., Mooney, S.D., Linial, M., Radivojac, P., Friedberg, I., 2014. The automated function prediction SIG looks back at 2013 and prepares for 2014. Bioinformatics 30 (14), 2091–2092. Wu, X., Zhu, L., Guo, J., Zhang, D., Lin, K., 2006. Prediction of yeast protein–protein interaction network: insights from the gene ontology and annotations. Nucleic Acids Res. 34 (7), 2137–2150. Yang, H., Nepusz, T., Paccanaro, A., 2012. Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty. Bioinformatics 28 (10), 1383–1389. Youngs, N., Penfoldbrown, D., Bonneau, R., Shasha, D., 2014. Negative example selection for protein function prediction: the NoGO database. PLoS Comput. Biol. 10 (6), e1003644. Yu, G., Rangwala, H., Domeniconi, C., Zhang, G., Yu, Z., 2014. Protein function prediction with incomplete annotations. IEEE/ACM Trans. Comput. Biol. Bioinform. 11 (3), 579–591. Yu, G., Zhu, H., Domeniconi, C., Liu, J., 2015. Predicting protein function via downward random walks on a gene ontology. BMC Bioinform. 15, 217. Yu, G., Rangwala, H., Domeniconi, C., Zhang, G., Yu, Z., 2015. Predicting protein functions using incomplete hierarchical labels. BMC Bioinform. 15, 1.

Please cite this article in press as: Lu, C., et al., NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity. Comput. Biol. Chem. (2016), http://dx.doi.org/10.1016/j.compbiolchem.2016.09.005