Cross-organism learning method to discover new gene functionalities

Cross-organism learning method to discover new gene functionalities

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34 journal homepage: www.intl.elsevierhealth.com/j...

2MB Sizes 0 Downloads 47 Views

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34

journal homepage: www.intl.elsevierhealth.com/journals/cmpb

Cross-organism learning method to discover new gene functionalities Giacomo Domeniconi a,∗, Marco Masseroli b, Gianluca Moro a, Pietro Pinoli b a b

DISI, Università degli Studi di Bologna, Via Venezia 52, 47521 Cesena, Italy DEIB, Politecnico di Milano, Piazza L. Da Vinci 32, 20133 Milan, Italy

a r t i c l e

i n f o

a b s t r a c t

Article history:

Background: Knowledge of gene and protein functions is paramount for the understanding

Received 4 June 2015

of physiological and pathological biological processes, as well as in the development of new

Received in revised form

drugs and therapies. Analyses for biomedical knowledge discovery greatly benefit from the

16 November 2015

availability of gene and protein functional feature descriptions expressed through controlled

Accepted 8 December 2015

terminologies and ontologies, i.e., of gene and protein biomedical controlled annotations. In the last years, several databases of such annotations have become available; yet, these

Keywords:

valuable annotations are incomplete, include errors and only some of them represent highly

Biomolecular annotation prediction

reliable human curated information. Computational techniques able to reliably predict new

Knowledge discovery

gene or protein annotations with an associated likelihood value are thus paramount.

Data representation

Methods: Here, we propose a novel cross-organisms learning approach to reliably predict new

Discrete matrix completion

functionalities for the genes of an organism based on the known controlled annotations of

Transfer learning

the genes of another, evolutionarily related and better studied, organism. We leverage a

Gene ontology

new representation of the annotation discovery problem and a random perturbation of the available controlled annotations to allow the application of supervised algorithms to predict with good accuracy unknown gene annotations. Taking advantage of the numerous gene annotations available for a well-studied organism, our cross-organisms learning method creates and trains better prediction models, which can then be applied to predict new gene annotations of a target organism. Results: We tested and compared our method with the equivalent single organism approach on different gene annotation datasets of five evolutionarily related organisms (Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum). Results show both the usefulness of the perturbation method of available annotations for better prediction model training and a great improvement of the cross-organism models with respect to the single-organism ones, without influence of the evolutionary distance between the considered organisms. The generated ranked lists of reliably predicted annotations, which describe novel gene functionalities and have an associated likelihood value, are very valuable both to complement available annotations, for better coverage in biomedical knowledge discovery analyses, and to quicken the annotation curation process, by focusing it on the prioritized novel annotations predicted. © 2015 Elsevier Ireland Ltd. All rights reserved.



Corresponding author. Tel.: +39 3495486309. E-mail addresses: [email protected] (G. Domeniconi), [email protected] (M. Masseroli), [email protected] (G. Moro), [email protected] (P. Pinoli) . http://dx.doi.org/10.1016/j.cmpb.2015.12.002 0169-2607/© 2015 Elsevier Ireland Ltd. All rights reserved.

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34

1.

Introduction

High-throughput biotechnological methods are generating at increasing rates a progressively high amount of genomic and proteomic data about a growing number of different organisms [1–3]; these mostly regard viruses and bacteria, but also many eukaryotic organisms. They are studied mainly to better understand human patho-physiology or improve food and agricultural products; towards these goals, discovering and comprehension of biological functions of genes and proteins within an organism is paramount. Besides biological experiments, which are expensive, timeconsuming and not always possible, several computational approaches have been proposed to discover gene or protein functions through the processing of the literature or of different types of experimental data; the latter ones include biomolecular sequences, gene expressions, protein interactions and phylogenetic profiles, processed using different techniques (e.g., sequence homology, network based or data and text mining based methods) [4,5]. Some attempts to consider together multiple types of data, also from different species through comparative genomic approaches, have also been proposed in order to improve results (e.g., [6,7]); they are usually quite complex and computationally demanding. Among available data, those that describe in structured form existing knowledge about structural and functional properties of biomolecular entities (mainly genes and their protein products) are extremely valuable. They are called controlled biomolecular annotations and each of them consist of the association of a biomolecular entity with a controlled term, which defines a structural or functional biological property and is part of a terminology or ontology; such association states that the biomolecular entity has the property that the controlled term defines. Several biomedical terminologies and ontologies exist [8,9]; among them the Gene Ontology (GO) is the most considerable one [10]. It is composed of three sub-ontologies, overall including more than 40,000 concepts, which characterize species-independent Biological Processes (BP), Molecular Functions (MF) and Cellular Components (CC). These concepts are described through controlled terms and are hierarchically related, mainly through IS A or PART OF relationships, within a Directed Acyclic Graph (DAG), designed to capture orthogonal features of genes and proteins. In the GO DAG, each node represents a GO term (i.e., a concept) and each directed edge from a node A to a node B represents a relationship existing from a child term A to its parent term B. Controlled biomolecular annotations are profitably leveraged for biomedical interpretation of biomolecular test results, extraction of novel information useful to formulate and validate biological hypotheses, and also the discovery of new knowledge to classify biomedical literature [11]. They are particularly valuable for high-throughput and computationally intensive bioinformatics analyses. Several computational tools take advantage of these annotations, such as those for annotation enrichment analysis (e.g., [12–15]) or semantic similarity analysis [16–20] of genes and proteins; they strongly rely on coverage and quality of the available controlled

21

annotations. However, existing controlled biomolecular annotations are accurate only in part, contain errors (particularly those only derived computationally, without human curator supervision) and are incomplete; several biological properties and functions of genes and gene protein products are still to be discovered, especially for recently studied organisms. Furthermore, the available annotations are largely derived computationally, and often they do not have an associated significance level; only a few of them are reviewed by human curators and represent highly reliable information. Annotation curation is crucial for annotation quality; yet, the curation process is very time-consuming. To help and accelerate it, availability of prioritized lists of computationally predicted annotations is greatly effective. In this scenario, computational techniques able to reliably predict new biomolecular annotations with an associated likelihood value are paramount; towards this aim several different methods have been proposed. King et al. [21] suggested the use of decision trees and Bayesian networks to predict annotations by learning patterns from available annotations. Tao et al. [22] proposed a k-nearest neighbour (k-NN) classifier to associate a gene with new annotations common among its functionally nearest neighbour genes, where functional distance between genes is computed according to the semantic similarity of the GO terms that annotate the genes. Hidden Markov Model (HMM) techniques were also used to model gene function evolution [23] or predict gene function from sequential gene expression data [24]. Support Vector Machine (SVM) classifiers are also common in gene function prediction. Minneci and colleagues [25] leveraged them to predict GO annotations for several eukaryotic protein sequences, whereas Mitsakakis et al. [26] used them to predict potential functions for previously un-annotated Drosophila melanogaster genes, through the analysis of a large dataset from gene expression microarray experiments. Also biological network analysis is often used to predict gene functions. Warde-Farley and colleagues [27] developed GeneMANIA, a server for gene function prediction, where query gene sets are represented as networks with an associated weight, which is based on how well connected the genes in the query set are to each other compared with their connectivity to non-query genes. Differently, Li et al. [28] used a kernel-based learning method and proposed a labeled graph kernel which is able to predict functions of individual genes on the basis of the function distributions in their associated gene interaction networks. Using a latent semantic approach and basic linear algebra, Khatri and colleagues [29,30] proposed a prediction algorithm based on the Singular Value Decomposition (SVD) method of the gene-to-term annotation matrix; this is implicitly based on the count of co-occurrences between pairs of annotation terms in the available annotation dataset. Masseroli and colleagues enhanced this algorithm [31], by including gene clustering based on gene functional similarity computed on the GO annotations of the genes; then, they further extended it by automatically choosing the best SVD truncation level, according to the evaluated dataset [32]. The SVD has also been used with annotation weighting schemes, built upon the gene and term frequencies [30,33,34]. Based on simple matrix

22

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34

decomposition operations, these methods are independent of both the considered organism and term vocabulary used in the annotation set; yet, obtained results showed limited accuracy. Other more sophisticated latent semantic analysis techniques, mainly related to Latent Semantic Indexing (LSI) [35], formerly used for Natural Language Processing, have been suggested to predict biomolecular annotations on the basis of available annotations; they include the probabilistic Latent Semantic Analysis (pLSA) [36], which leverages the latent model of a set of annotation terms to increase robustness of annotation prediction results. In a previous work, we used this technique [37], also enhanced with weighting schemes [38], and showed that it improves the SVD method of Khatri and colleagues [29]. Topic modeling has been leveraged also through the use of the Latent Dirichlet Allocation (LDA) algorithm [39]. Bicego et al. [40] and Perina et al. [41] applied LDA to separate gene expression microarray data in clusters. Lately, we took advantage of the LDA algorithm associated with the Gibbs sampling [42–44] to predict gene annotations to GO terms [45]. These advanced techniques greatly overcome those based on linear algebra, but the underlying model complexity and the training procedure slowness make them inappropriate when the evaluated dataset size increases. Recently, additional supervised methods were proposed also for gene annotation prediction. Cheng et al. [46] formulated the prediction of gene functions as a multi-label top-down classification problem. They developed a method using the hierarchical relationships in the GO structure to mitigate the quantitative difference between positive and negative training samples; yet, they obtained only poor accuracy. Additionally, Stojanova et al. [47] considered that relations between genes generate autocorrelation in gene annotations and violate the assumption that annotations are independently and identically distributed; by explicitly considering this autocorrelation they obtained better predictive accuracy. Novel gene annotations were also inferred by leveraging multiple data types or sources, also from different species. A general Bayesian framework was developed by Troyanskaya and colleagues [48] to integrate heterogeneous types of highthroughput biological data, and was applied for the prediction of Saccharomyces cerevisiae gene functions. Barutcuoglu et al. [49] used gene expression levels from microarray experiments to train a SVM classifier for each gene annotation to a GO term; then, they enforced consistency among predicted annotation terms by means of a Bayesian network mapped on the GO structure. Conversely, Raychaudhuri et al. [50] and Pérez et al. [51] used text mining techniques to extract form the literature gene associated keywords which are then mapped to GO concepts. These approaches provide improved results with respect to similar methods applied on a single data type; yet, they require a preparatory data integration step which adds complexity, reduces flexibility and slows the prediction process. Thus, previously proposed methods for biomolecular annotation prediction either are general and flexible, but provide only limited accuracy mainly due to the simple model used, or improve prediction performance by either leveraging a complex integrative analytical framework, which often is difficult and time consuming to be properly set up, or adopting a more complex model, which in turn significantly slows the

prediction process particularly in the usual case of many data to be evaluated. Very recently, we introduced both a novel representation of the annotation discovery problem and a random perturbation method of the available annotations [52]; this allows taking advantage of supervised algorithms to predict with good accuracy new controlled annotations for the genes of an organism based on the annotations available for the same genes. Taking inspiration from the cross-domain approach, applied in data mining domains such as text classification [53] and sentiment analysis [54], here we extend that initial method and apply it to reliably predict new functionalities for the genes of an organism based on the controlled annotations available for the genes of another organism; in this way, the available biological knowledge about a better studied organism can be leveraged to improve the knowledge about a more limitedly studied organism. In the next sections, we firstly present our approach for the prediction of new gene annotations in Section 2, together with the method defined for its evaluation. Then, in Section 3 we describe the annotation data used to evaluate our approach, and in Section 4 we illustrate the performed test results obtained. Finally, we discuss such results and report conclusions and future developments in Section 5.

2.

Methods

In this section we describe our techniques to effectively predict novel gene annotations of an organism based on available gene annotations, of the same organism or of another one evolutionarily related; we also illustrate the method that we employed to evaluate the accuracy of the predicted annotations.

Discovering gene annotations using supervised 2.1. algorithms The discovery of new gene annotations can be modeled as a supervised problem through an approach that we previously introduced [52]; for the reader’s convenience, here we report the main concepts of this approach before extending it and applying it to cross-organism annotation prediction. The annotation profile of a gene gi to a set T of controlled terms t consists in a binary vector whose jth entry corresponds to the jth term tj of the set T, and the value 1 (or 0) for such entry indicates that gi is (or is not) annotated to tj . Let us define as annotation matrix A(g, t) a binary matrix whose ith row A(gi , t) is the annotation profile of the gene gi to the terms t ∈ T; thus, a binary gene-term annotation matrix A(g, t) represents a set of known annotations of some genes g to some controlled terms t ∈ T. Accordingly, the discovery of novel annotations to these terms t for these genes g consists in predicting if any element A(gi , tj ) = 0 should likely be 1, or 0, based on the known annota/ tj , i.e., tions of the gene gi to the other terms tk ∈ T, with tk = based on the value of the other entries k of the gi ’s annotation profile, with k = / j. Thus, this problem can be modeled as a supervised multilabel classification, with the peculiarity of not having two distinct sets of features and labels for the classification, but

23

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34

A0

GO terms (Outdated version)

GO terms (Updated version)

A1

GO:0043226

GO:0005575

GO:0005623



GO:005737

Gene1

0

1

1



1

0

Gene2

1

0

1



0



















1

Gene G n

1

1

1



1

GO:0043226

GO:0005575

GO:0005623



GO:005737

Gene1

0

1

0



0

Gene2

1

0

0











Gene G n

0

1

1

(b)

(a) GO terms

Mt=GO:005737 t GO 005737

features

label

GO:0043226

GO:0005575

GO:0005623



GO:005737

Gene1

0

1

0



1

Gene2

1

0

0



0













Genen

0

1

1



1

(c)

Fig. 1 – Illustrative diagram of the data representation. For each annotation term t (e.g., t = GO:005737), the data set Mt (c) is iteratively created with an older annotation version A0 (a) for the features (light grey) and an updated version A1 (b) for the labels (dark grey).

only an unique set of entries (i.e., all the values 1 and 0 of the matrix A) which alternatively act as labels or features (Fig. 1). We approach this problem by creating a single binary classification task for each term t, called class-term (tc ), in which the classification model predicts the presence or absence of the annotation of each considered gene to the class term tc , based on all the gene annotations to all the other terms t ∈ T, i.e., based on the features. Henceforth, we will refer to a single supervised task, concerning the discovery of gene annotations to the term tc (e.g., the term GO:005737 in Fig. 1), which is then repeated iteratively for all other terms t ∈ T. The goal of the prediction model can be viewed as the discovery of the annotations that are missing from an outdated version of the annotation matrix A and will be present in a more updated version of it. For this purpose, we create the labeled dataset using two versions of the annotation matrix, an outdated one (A0 ), with less annotations, and an updated one (A1 ). We assign as values of the class-term, i.e., as labels, the values in the updated matrix, since these are the goals of the predictions, and we assign as values of all the other terms, i.e., as features, the values in the outdated annotation matrix (Fig. 1). With this approach we train a set of prediction models, one for each annotation term, that are able to infer the annotations missing in the outdated version A0 and present in a more updated one A1 . Then, we use these models to predict novel annotations, based on the available ones. To evaluate the predictions, we create a validation in a similar way as we did for the training set, using the hypothetical current version of A as input to the prediction models and using a more updated version of A to validate the correctness of the predictions. The prediction model provides a probability p(g, t), called likelihood, of the presence of the annotation of the gene g to the term t. According to the hierarchical structure of an ontology, when a gene is annotated to an ontological term, it must be also annotated to all the ancestors of that term in its ontology; this constraint is also known as True Path Rule [55]. The supervised prediction models, however, work independently for each term, regardless of the predictions of the annotation of the same gene to other terms of the same ontology. This might generate possible anomalies when a gene is predicted

annotated to a term, but not to one or more of its ancestor terms, thus violating the True Path Rule. To obtain a likelihood value that takes into account the hierarchy of the annotation terms, once independently obtained the likelihood of each gene-term association, we proceed as follows. 1. For each novel gene-term annotation, we give a hierarchical semantic to its likelihood by first increasing it with the average of the likelihoods of all the annotations of the gene to all the ancestors of the term (which must be implicitly annotated to the gene) and then normalizing such hierarchical likelihood, as follows:

 ta ∈ ancestors(t)

H

p (g, t) =

p(g,ta )

|ancestors(t)|

+ p(g, t)

2

(1)

2. Then, to correct possible violations of the True Path Rule, an iterative process is carried on from the leaf terms to the root of the ontology hierarchy, upgrading each term annotation likelihood with the maximum value of the descendant term annotation likelihoods, if this latter one is greater than the term annotation modified likelihood pH (g, t), as follows: l(g, t) = max{pH (g, t),

max

tc ∈ children(t)

{pH (g, tc )}}

(2)

In such a way, for each ontology term t, the likelihood l(g, t) of annotating a gene g to the term t is always greater than, or equal to, the likelihood of annotating the gene g to any descendant of the term t. The supervised approach described above requires the availability of two versions of the annotation matrix to train the classification models. However, biologists typically have available only the most updated annotation version, not keeping stored the outdated versions for space reasons, given the large amount of data. To overcome this problem, in [52] we proposed a method to artificially create an older version of the annotation matrix. Results obtained in this previous work show that using the same annotation matrix to create the

24

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34

training set, i.e., assuming A0 = A1 , leads to obtain good prediction effectiveness; however, better results are obtained by using an artificial older version A0 of the annotation matrix, created equal to the only available updated matrix A1 and then perturbed in a way that some known annotations present in A1 are randomly removed from A0 with probability p. Formally, for each gene g and term t, the perturbation is done as follows:

A0 (g, t) =

⎧ 0 if A1 (g, t) = 1 ∧ random ≤ p ⎪ ⎨ ⎪ ⎩

1

if A1 (g, t) = 1 ∧ random > p

0

if A1 (g, t) = 0

(3)

where random is a function that uniformly generates random decimal values in the range [0, 1). Once the perturbed matrix A0 is generated, to ensure its correctness with respect to the True Path Rule, for each gene-term annotation deleted, we remove also any annotation of the gene to the descendants of the term; we call this process perturbation unfolding. Generally, term weighting schemes improve the efficacy of text classification [56]; however, we proved in a previous paper [57] that better predictions in this biomedical context can be achieved with binary matrices, namely without weighting their terms, as we did in this work.

2.2.

Cross-organism learning method

The Gene Ontology unifies the representation of gene and gene product features across species (see Section 1), i.e., GO terms describe gene functions independently from the organism considered. This allows us to assume that the coannotations of genes to GO terms depend only on the terms themselves and not on the gene organism. Consequently, given a supervised model able to discover relationships among gene annotations to terms of the GO, or of any other ontology with cross-species characteristics equivalent to those of the GO, it is conceivable to train such model on a largely studied organism (source), and then apply the trained model to predict novel gene annotations of a scarcely analyzed organism (target). With such a training, the used supervised algorithm can leverage a greater number of genes with a larger number of annotations already known, i.e., it can take advantage of a bigger and more dense training annotation matrix than that it would use by considering only the annotation matrix of the target organism. On the other hand, in this way the obtained model is not able to predict annotations to terms that are specific for the target organism and are not used for the source organism, since it can only handle annotations to terms used in the source organism annotations. In order to create a cross-organism learning model, first it is necessary to select the genes and terms useful to predict novel gene annotations for the target organism. Let TS and TT be the two sets of feature terms used in the available gene annotations for the source and target organisms, respectively. We create the prediction model using the set T of those terms that the two organism annotations have in common, i.e., T = TS ∩ TT . Then, we also need to select the source genes to be used to train the model; in fact, there can be genes of the source organism that have no annotations

Fig. 2 – Workflow for novel annotation predictions using the cross-organism learning method.

to the terms in T, thus being useless for the model training. Hence, we introduce a new parameter, called M, to represent the minimum number of annotations to the terms in T that any source gene needs to have in order to be used in the training process. Defined the composition of the initial annotation matrix, as described at the end of Section 2.1, we can perturb it to create the training set and then proceed with the same prediction method used for a single organism; in principle, any supervised prediction algorithm can be used to this end. In Fig. 2 we show a simple example of the described process: first, we select the set of common terms between source and target organisms, namely T = {t1 , t4 , t5 , t6 }, and the source genes that have at least M = 2 annotations to the terms in T. Once created the first training annotation matrix A1 , an artificial older version A0 of it is created by randomly perturbing it; thus, we have the two matrices needed to train the prediction model, which is then applied to the genes of the target organism in order to predict novel annotations for them. If two original annotation versions of the source organism genes are available, they can be used to create the training annotation matrices, without needing any perturbation.

2.3.

Evaluation

Effectiveness of supervised methods in discovering new gene annotations based on the available ones can be fully evaluated by using gold standard datasets; yet, unfortunately, such biological datasets generally are not available, and are difficult and very time-consuming to be created in a useful significant size. An alternative option is to consider two gene annotation versions, available at a given temporal distance, and use the older one to train the model and generate the annotation predictions, and the newer one to evaluate the model predictions; in fact, generally, available gene annotations of an organism increase over time while the organism is studied. It is important to note that, although much more feasible, this option allows only a lower estimate of the method efficacy; the shorter the temporal distance between the available versions of gene annotations, the more the predictions that could be correct although not found simply because they are not yet included in the newer gene annotations available.

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34

To measure the effectiveness of the defined approach, we considered the accuracy of the gene annotations predicted with likelihood greater than a threshold  (Ac ). We calculated it as the number of gene annotations predicted for the target organism with likelihood greater than  that are found confirmed in an available newer version of the target organism’s gene annotations, divided by the total number of target organism’s gene annotations predicted with likelihood greater than . The overall evaluation procedure that we defined for our cross-organism learning method is as follows: 1. Create the annotation matrices As1 and At1 for the source and target organisms, respectively, by considering only all their reliable gene annotations (e.g., excluding their less reliable inferred electronic annotations) available at a given time T1 . 2. Select the set of terms T in common between such source and target organism annotations, avoiding considering the source genes with less than M annotations to the terms in the target organism annotations. 3. Create a second, more complete, annotation matrix At2 for the target organism, by considering a newer version of all its gene annotations (both more and less reliable) available at a time T2 , with T2 > T1 . 4. Create a second, less complete, annotation matrix As0 for the source organism, by considering an older version of only all its reliable gene annotations available at a time T0 , with T0 < T1 . 5. If such older annotation version is not available, randomly perturb the source annotation matrix As1 to get a modified version As0 of it with some missing annotations. 6. Train a supervised prediction model for each term in T, by using the two annotation matrices As0 and As1 of the source organism. 7. Run the prediction algorithm on the annotation matrix At1 to get a list of predicted gene annotations for the target organism, ordered by their confidence value, i.e., by their likelihood. 8. Evaluate the obtained predictions by checking whether they are found confirmed in the updated version of the target organism’s annotation matrix At2 , and compute the Ac measure. 9. For each experiment with a perturbed training matrix As0 , to avoid possible biases due to the random perturbation seed, repeat steps 5, 6, 7 and 8 multiple times (e.g., 5 times) by varying the random seed of the perturbation; then, determine the effectiveness of each experiment by averaging the Ac obtained in each experiment repetition. Please note that, to discover new gene annotations for the target organism, the defined cross-organism approach strictly requires the availability of only one version of the source and target organisms’ gene annotations; only the defined validation procedure requires the availability of an additional version of the considered annotations for the target organism, in order to automatically evaluate the accuracy of the predictions.

3.

25

Annotation data

For the development and validation of the proposed method, we retrieved multiple gene annotation sets from the Genomic and Proteomic Data Warehouse (GPDW) [58] of the Genomic and Proteomic Knowledge Base (GPKB) [59]. GPKB adopts an innovative method to integrate very numerous genomic and proteomic controlled annotations data of many different species from multiple sources. Relevant features of the GPKB are periodical updates of the contained data and storage in the GPDW of some of their outdated versions. To evaluate on a comparative genomic strategy the effectiveness of our cross-organism computational approach for reliably gene annotation prediction, from the GPDW we extracted the GO annotations there available for the genes of multiple eukaryotic organisms; we selected which of these organisms to consider for our approach evaluation on the basis of their number of gene GO annotations, annotated genes and annotation terms, as well as on their evolutionary distance. Towards this purpose, we took into account the evolutionary divergent eukaryotic species shown in Fig. 3, which are considered in the Reactome pathway knowledge base project [60] for a similar orthology inference strategy. Such strategy aims at using available manually-curated human biomolecular reactions to electronically infer biomolecular reaction events in these evolutionary divergent eukaryotic species, for which high-quality comprehensive sequence data exist [61]. Out of these species, we selected the Homo sapiens, the most studied organism with the highest number of gene GO annotations available, and other four evolutionarily related model organisms: an evolutionary close organism with a high amount of gene GO annotations available (Mus musculus), an organism averagely distant evolutionarily (Bos taurus), another averagely distant organism (Gallus gallus), but much less studied, i.e., with much less gene GO annotations available, and an evolutionarily far related organism (Dictyostelium discoideum, a species of soil-living amoeba commonly referred as “slime mold”). Besides their different evolutionary distance, these organisms differ by the quantitative features of their gene GO annotations shown in Table 1. To perform the defined evaluation procedure of our approach, we selected two temporally distant versions of the GO annotations available in the GPDW for the selected organism genes: an older version, as of July 2009, and a more recent one, as of March 2013. Like most of the biomolecular databases, for each gene the GPDW stores only the available annotations to the most specific GO terms describing the gene functional features; all the other gene annotations to all the ancestor terms of these specific GO terms are left implicit and can be derived through the ontology structure. In building our annotation matrices we considered all gene GO annotations, including these implicit ones. Finally, we distinguished between more and less reliable annotations; this distinction has been based on the GO evidence code associated with each GO annotation and describing the procedure that produced the annotation. Each GO annotation can be associated with more than one evidence code, since there can be multiple procedures or assays that derived the annotation; we considered as less reliable all the GO annotations

26

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34

Fig. 3 – Phylogenetic dendrogram of evolutionarily divergent eukaryotic species (left) for which biomolecular reaction events have been electronically inferred in the Reactome pathway knowledge base project [60], using manually curated human reactions available. Success rate of such orthology inference is shown (right) as percentage of eligible reactions in the available Reactome human dataset for which an event can be inferred in the model organism [61].

Table 1 – Counts of gene GO annotations (A) and their involved genes (G) and GO terms (T) used for the cross-organism approach evaluation. Only genes and GO terms shared between the July 2009 and March 2013 versions are counted. Both most specific and implicit gene GO annotations are counted. For the annotations of the July 2009 version (A09 ), only the more reliable human curated annotations that were used for model training are reported; for the annotations of the March 2013 version, both curated (A13 ) and total (At13 ) annotations are reported.

G T A09 A13 At13

Homo s.

Mus m.

Bos t.

Gallus g.

Dict. d.

9937 3322 345,259 353,679 955,341

9265 3366 319,402 606,239 826,033

646 749 21,305 26,194 47,237

321 403 8846 11,339 17,744

1762 1016 65,421 63,621 118,695

with only Inferred from Electronic Annotation (IEA) or No biological Data available (ND) evidence code, whereas we considered more reliable all the other GO annotations. We used the latter ones for the prediction model training and testing, whereas we used the newer available version (as of March 2013) of all the available annotations with any GO evidence code for the validation of the predicted annotations. The quantities of annotations and involved genes and terms for each organism and annotation version considered are listed in Table 1.

4.

Evaluation results

In this section we illustrate the performance results of our approach. First, we compare the performance obtained for single organism vs. cross-organism annotation prediction; following, we describe the assessment results of the usefulness of the perturbation method to better train the supervised

algorithms; we then show the evaluation results of tuning the few parameters that our approach requires, i.e., the supervised algorithm to be used and the minimum number of shared annotation terms between the source and target organism; finally, as an example, we illustrate the full results obtained for the prediction of the GO Cellular Component annotations of a gene.

4.1.

Single organism vs. cross-organism approach

We first evaluated the effective utility of the cross-organism learning method with respect to the single organism method. Results obtained in [52] show that using two equal annotation matrices to train the model (i.e., with A0 = A1 ) provides good effectiveness, avoiding the tuning of the perturbation parameter; accordingly, to obtain a first comparison between the single organism and cross-organism approaches, without perturbing the training annotation matrices we trained multiple prediction models with the gene annotations of the H. sapiens and the other four selected organisms (M. musculus, B. taurus, G. gallus and D. discoideum), and applied all the trained models to the gene annotations of each of such latter organisms. We focused the evaluation on the predicted annotations with good quality level, by setting the prediction likelihood threshold to the value  = 0.8. Results in Table 2 show that models trained with the H. sapiens gene GO annotations achieve much better accuracy than those obtained by using the gene GO annotations of the same organism both in training and testing (i.e., the single organism approach used in [52]). The accuracy (Ac=0.8 ) of the annotations predicted with likelihood greater than the threshold  = 0.8 shows an improvement almost always greater than 100% (except in the case of M. musculus, where the improvement is just above 40%). Moreover, it is notable that this accuracy improvement is due to the use of the most studied

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34

Table 2 – Evaluation results of the cross-organisms learning method vs. the single organism one, obtained without perturbation of the training matrices and using IBk as supervised algorithm. In tests with different source and target organisms, only the source organism genes annotated to at least M = 5 terms of the target organism gene annotations are considered. Results show the accuracy (Ac=0.8 ) and number (N=0.8 ) of the novel gene GO annotations predicted with likelihood greater than  = 0.8, and their average level (L¯ =0.8 ) in the GO DAG (for the L¯ =0.8 computation, when the term of a predicted gene annotation belongs to multiple GO levels, only its lowest level was considered). The best accuracies in each group of tests regarding the same target organism are in bold. “–”: no gene annotations predicted with likelihood greater than  = 0.8. Target

Source

Ac=0.8

N=0.8

L¯ =0.8

Mus m.

Homo s. Mus m. Bos t. Gallus g. Dict. d.

0.573 0.407 0.312 0.221 0.305

6048 205 2184 2576 5285

2.82 2.93 2.73 2.78 2.81

Bos t.

Homo s. Mus m. Bos t. Gallus g. Dict. d.

0.538 0.137 0.250 0.314 0.273

234 388 4 35 220

3.71 3.55 3.00 3.45 3.27

Gallus g.

Homo s. Mus. m. Bos t. Gallus g. Dict. d.

0.390 0.100 0.281 – 0.205

136 281 32 0 156

3.64 3.68 3.26 – 2.97

Homo s. Mus m. Bos t. Gallus g. Dict. d.

0.675 0.217 0.209 0.284 0.297

765 327 187 81 37

3.99 3.96 3.10 3.96 4.18

Dict. d.

organism to train the models. Note also that, with no effect on the predicted annotation accuracy, the number (N=0.8 ) of predicted annotations with likelihood greater than  = 0.8 is mainly related to the size of the annotation matrices of both the source and target organisms. For instance, with the biggest annotation matrices of both the source and target organisms (H. sapiens and M. musculus, respectively) we obtained the highest amount of annotations predicted with likelihood greater than  = 0.8; conversely, with the smallest annotation matrix of both source and target organism (G. gallus), there was no annotation predicted with likelihood greater than  = 0.8. The average depth (i.e., GO level) of the gene GO annotations predicted with likelihood greater than  = 0.8 (L¯ =0.8 ) varied around levels 3 and 4 of the GO DAG; this shows that, on average, the predicted gene annotations were not very specific, but neither too generic (i.e., they were generally valuable). The not high L¯ =0.8 values obtained are mainly due to the fact that the annotations more reliably predicted for a gene (as those with likelihood greater than  = 0.8) are those to the GO terms that are children of terms of known GO annotations of the gene, whose far majority is at low GO levels (note that we consider also the implicit ontological annotations, not only the most specific ones).

27

4.2. Original and perturbed source organism annotation matrices Results in Table 2 show that the cross-organism approach using a prediction model trained on the most largely studied organism (i.e., the one with the highest number of known annotations) generally provides better predictions; thus, we further investigated the matter, testing whether the use of a random perturbation of the training annotation matrix improves the classification accuracy of the models. The four graphics in Fig. 4 show the predicted annotation accuracies for the four target organisms considered, depending on the perturbation probability of the training annotation matrix. Each graphic shows the prediction results using as training set the known gene GO annotations of the H. sapiens (depicted with red squares) or of the target organism (depicted with blue diamonds). In particular, the following measures are reported: (i) the accuracy Ac=0.8 of the novel gene GO annotations predicted with likelihood greater than  = 0.8 (solid lines); (ii) the percentage difference () in number of gene GO annotations in the perturbed and original training matrices (dashed lines); (iii) the percentage difference () in number of novel annotations in the updated (validation) matrix with respect to the outdated (testing) matrix of the target organism (black, dotted line). On these graphs it can be easily observed that, also perturbing the training matrices, better results are almost always achieved with the cross-organism approach, using models trained with H. sapiens annotations. Moreover, the highlighted vertical bars in Fig. 4 show the perturbation ranges where the  of annotations between the original and perturbed training annotation matrices and between the testing and validation annotation matrices are comparable; one can notice that the tests with these perturbation ranges almost always achieve best accuracy results, or very close to them. This confirms that greater accuracy of the model is achieved with a richer representation of the validation problem in the training set; in our scenario such richer representation can be measured by the similarity between the  of annotations in the training and in the testing/validation matrices.

4.3. Two original matrices of a well annotated source organism The above test results show a great improvement to the predicted annotation accuracies when prediction models were trained with H. sapiens gene annotations. Considering this enhancement, we could conceive a tool based on H. sapiens gene annotations, applicable to the discovery of novel annotations for the genes of target organisms of interest. Furthermore, having available two original versions of H. sapiens gene annotations at distinct times, it would be possible using them as source organism annotation matrices to train the prediction model without the need of the perturbation. To evaluate the efficacy of using two original annotation matrices, at a distance of time, of the H. sapiens as source organism, we performed several tests and compared their

28

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34

Fig. 4 – Evaluation results of the gene GO annotations predicted by varying the probability of the random perturbation of the training annotation matrix. Each test is performed using the IBk supervised algorithm and M = 5 minimum common GO terms between source and target organism annotations. For the training with the known gene GO annotations of the H. sapiens (red, squares) or of the target organism (blue, diamonds), each target organism graph shows the accuracies of the gene GO annotations predicted with likelihood greater than  = 0.8 (solid lines) and the percentage difference () in number of annotations in the original and perturbed training matrices (dashed lines) and in the testing and validation sets of the target organism (black, dotted line). The highlighted vertical bar in each graph shows the perturbation range in which the s of annotations between the original and perturbed training matrices of the source organism and between the testing and validation matrices of the target organism are comparable. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

results with those obtained with the perturbation approach. In Fig. 5 we show the trend of the predicted annotation accuracy Ac , while varying the threshold  on the likelihood of the annotations predicted using a model trained with two original matrices of H. sapiens gene GO annotations, as of July 2009 and March 2013, respectively, with a percentage difference in number of annotations of  = 84.15%. For each target organism, we report a graph with the accuracies (solid lines, left vertical axis) and the related number of predicted annotations (dotted lines, right vertical axis) in relation to the minimum likelihood of the predicted gene GO annotations (horizontal axis). Fig. 5 shows that the greater is the prediction likelihood threshold, the greater is the accuracy of the prediction model and the lower is the number of predicted annotations; thus, a biologist interested in very reliable, although few, novel annotations can raise the threshold as desired to obtain less, but comprehensively more reliable, predicted annotations. On the other hand, comparing results in Fig. 5 with those in Fig. 4, it appears that the perturbation approach provides generally better, or in few cases equivalent, results than using different

original annotation versions (which are rarely available). This behavior shows that the perturbation process allows the correct representation of the discovery of new annotations; besides giving the possibility to vary the difference of annotations () between the training matrices, this allows the tuning of the models to better represent the problem. We also evaluated the influence, on the predicted annotations, of the value of the parameter M, i.e., the minimum amount of overlapping between the annotation terms of the source and the target organisms. In Table 3 we show the results obtained by varying this minimum overlapping, i.e., by considering for the prediction model training only the source organism genes that are annotated to at least M of the GO terms in the target organism gene annotations. Results show that generally, independently on the evolutionary distance of the target organism from the H. sapiens and without influence on the average level in the GO DAG of the terms in the novel gene GO annotations predicted, the higher the minimum overlapping, the lower the prediction effectiveness. This suggests that even genes annotated to few terms in common with the

29

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34

Fig. 5 – Evaluation results obtained by varying the threshold  on the likelihood of the gene GO annotations predicted using three different supervised algorithms (the IBk nearest neighbors with K = 3, the Random Forest decision tree and the LibSVM Support Vector Machine with radial basis kernel). Graphs report the numbers (#) of predicted gene GO annotations (dotted lines) and their accuracies Ac (solid lines) obtained by training the used prediction model with two original H. sapiens gene GO annotation versions (as of July 2009 and March 2013, respectively, with a percentage difference of new annotations  = 84.15%) and applying the trained model on the available gene GO annotations of each of the other considered target organisms, using M = 5 minimum common terms between the source and target organism annotations. Table 3 – Evaluation results of the predicted gene GO annotations obtained by varying the minimum number M of GO terms in common between source and target organism gene annotations. Results were obtained with two original H. sapiens gene GO annotation versions (as of July 2009 and March 2013, respectively, with a percentage difference of gene annotations  = 84.15%) as training set, and with each of the other considered model organisms as target, using IBk as supervised algorithm. For each target organism, the accuracy (Ac=0.8 ) and number (N=0.8 ) of novel gene GO annotations and the average level (L¯ =0.8 ) in the GO DAG of the terms in the novel gene GO annotations predicted with likelihood greater than  = 0.8 are shown. The best accuracies for each target organism are in bold. Organism

Measure

M = 100

M = 50

M = 25

M = 10

M=5

M=1

Mus m.

Ac=0.8 N=0.8 L¯ =0.8

0.400 12480 3.30

0.610 8040 3.22

0.672 5398 3.20

0.735 5911 3.12

0.742 5799 3.12

0.743 5818 3.11

Bos t.

Ac=0.8 N=0.8 L¯ =0.8

0.299 719 3.63

0.556 419 3.72

0.683 338 3.87

0.702 265 3.99

0.682 292 3.94

0.686 280 3.95

Gallus g.

Ac=0.8 N=0.8 L¯ =0.8

0.192 655 3.57

0.342 222 3.74

0.438 153 3.75

0.569 123 3.60

0.612 113 3.68

0.573 110 3.95

Dict. d.

Ac=0.8 N=0.8 L¯ =0.8

0.189 4998 3.40

0.426 3230 3.57

0.570 2266 3.62

0.593 1891 3.67

0.589 1854 3.68

0.589 1856 3.69

Average

Ac=0.8 N=0.8 L¯ =0.8

0.270 4713.0 3.47

0.484 2977.7 3.56

0.591 2038.7 3.61

0.650 2047.5 3.60

0.656 2014.5 3.60

0.648 2016.0 3.68

gene annotations of the target organism are useful to better train the prediction model; as mentioned above, an higher number of training genes, terms and annotations (i.e., a wider and more dense source organism annotation matrix) leads to the creation of a better prediction model. In the considered cases (i.e., H. sapiens as source organism and the four considered model organisms as target organisms), on average the best accuracy results are obtained by training the prediction model with all the source organism genes that are annotated to at least M = 5 GO terms in common with the target organism gene annotations.

4.4.

used to build the prediction model. To test which type of supervised algorithm can provide better accuracy results, we evaluated three different well-known supervised algorithms, namely Support Vector Machines (LibSVM, with radial basis kernel), nearest neighbors (IBk, with K = 3) and decision trees (Random Forest), using the implementations and default parameter values provided by Weka1 in its 3.7.9 version. In Fig. 5, we report the obtained results. It can be noted that the Random Forest algorithm always provides better accuracy; however, the time needed to build the classification models using Random Forest is extremely higher than that required by a nearest neighbor classifier like IBk. On the other hand, the LibSVM algorithm

Multiple supervised algorithm comparison

Our proposed cross-organism learning approach is totally independent from the specific supervised learning algorithm

1

Weka, http://www.cs.waikato.ac.nz/ml/weka/.

30

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34

Fig. 6 – Directed Acyclic Graph of the Gene Ontology Cellular Component terms involved in the annotations known or predicted for the baculoviral IAP repeat containing 5 (BIRC5) gene of the B. taurus organism. The predicted annotations are obtained by training the prediction model with two original versions of H. sapiens gene GO annotations (as of July 2009 and March 2013, respectively), with M = 5 minimum number of GO terms in common between the training and target gene annotations and using IBk as supervised algorithm. The ellipses represent GO terms that were already associated with the BIRC5 gene in the input B. taurus annotations (as of July 2009). The rectangles indicate the GO terms of the predicted annotations with likelihood l ≥ 0.2, and assume darker shades of grey as the likelihood of the prediction increases; the double perimeter rectangles represent the GO terms of the novel predicted annotations for the B. taurus BIRC5 gene that were confirmed in the updated B. taurus GO annotations available (as of March 2013), and the triple perimeter rectangles indicate the GO terms of the annotations predicted with likelihood greater than  = 0.8. The octagons represent the remaining GO terms that were associated with the BIRC5 gene in the updated B. taurus annotations available (as of March 2013), but that were not predicted.

often gives a higher number of predicted annotations, but less reliable (i.e., with lower accuracy).

4.5.

Predicted gene annotation example

As an example, in Fig. 6 we report the Directed Acyclic Graph of the GO Cellular Component terms of all annotations predicted and known of the baculoviral IAP repeat containing 5 (BIRC5) gene of the B. taurus organism. For readability reasons, we show a GO Cellular Component DAG, which generally includes less nodes than other GO sub-ontology DAGs; furthermore, we do not show all the GO terms that have no known association or low predicted association likelihood l, i.e., with l < 0.2, with the B. taurus BIRC5 gene. The

predicted annotations shown were obtained by training the prediction model with two original versions of H. sapiens gene GO annotations (as of July 2009 and March 2013, respectively), with M = 5 minimum common terms between the source and target organism gene annotations and using IBk as learning algorithm. In the DAG, ellipses represent GO terms of the known annotations of the B. taurus BIRC5 gene as of July 2009, whereas rectangles indicate GO terms of the predicted annotations with likelihood l ≥ 0.2: the higher the likelihood, the darker the rectangle background. GO terms of predicted annotations confirmed in the validation dataset, which includes the GO annotations of the B. taurus BIRC5 gene as of March 2013, are in double perimeter rectangles, while in triple perimeter rectangles are the GO terms of the annotations

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34

predicted with likelihood greater than  = 0.8 (all of them are confirmed in the validation dataset). The octagons represent the few remaining GO terms that were associated with the BIRC5 gene in the updated B. taurus annotations available (as of March 2013), but that were not predicted. The DAG in Fig. 6 suggests several, very interesting, novel GO Cellular Component annotations for the B. taurus BIRC5 gene, which is part of the inhibitor of apoptosis (IAP) gene family whose members encode negative regulatory proteins preventing apoptotic cell death. Some of its predicted annotations, i.e., 7 out of the 12 annotations (58.33%) predicted with likelihood l greater than  = 0.8 (involving the double and triple perimeter darker rectangles in the DAG), including the annotation to the nucleus (GO:0005634) term, were found confirmed in the 45 month later updated version of the B. taurus gene annotations; others, such as the annotations to the membraneenclosed lumen (GO:0031974) and to four of its descendant terms, are known for the human ortholog gene and could be confirmed for this B. taurus gene in the future. In the DAG, it can be also seen what would be the effect of lowering the threshold on the predicted annotation likelihood: GO terms within single and double perimeter rectangles might become included in the predicted annotations of the B. taurus BIRC5 gene, in accordance with their likelihood; a lower threshold could add many predicted annotations, but both confirmed (double perimeter rectangles) and not confirmed (single perimeter rectangles) ones, although the latter ones could be confirmed in a subsequent updated version of the known GO annotations of the gene. The high likelihood of the annotations confirmed, among those that the method predict with any likelihood, shows the high potential of our method in correctly predicting new annotations for a less studied organism by transferring them from a more studied one used to train the prediction model. The five GO annotations predicted for the B. taurus BIRC5 gene that have not (yet) been confirmed (single perimeter rectangles in Fig. 6) regard GO terms all hierarchically related and are induced by the annotation to the most specific of such terms (nuclear lumen – GO:0031981). Note that the same BIRC5 gene in human (i.e., the human ortholog) is directly annotated to a child of such most specific term (i.e., to nuclear chromosome – GO:0000228) with reliable Inferred from Direct Assay (IDA) evidence [62]; thus, the BIRC5 human ortholog gene is also indirectly annotated to all the GO terms involved in the annotations that our method predicts for the B. taurus BIRC5 gene which are not yet confirmed in the available B. taurus gene annotations.

5.

Discussion and conclusions

To improve gene function prediction, the hypothesis of evidence conservation in gene expression or proteomic data from multiple organisms has been previously leveraged [7,63]. In our work here illustrated, we applied this hypothesis on controlled gene annotations, using as input only the most reliable annotations available (i.e., those not electronically inferred), on the basis of the recognized assumption that genes sharing similar biomedical annotation profiles have similar functionalities [16]. This allows leveraging the knowledge available for well-studied and better-known organisms and transferring it

31

to less studied ones; in particular, it allows taking advantage of the many available and relevant structured descriptions of the existing knowledge, which leverage the several controlled terminologies and ontologies that have been developed and are used to express the existing valuable biological information and knowledge in a way that lets easily and effectively using them for new discoveries. Although complex methods have been proposed to improve gene function discovery by evaluating simultaneously multiple different types of data, analysis of only controlled gene annotations proved still useful and able to reliably provide relevant results [5]. We previously proved that supervised algorithms can be effectively used to reliably discover new annotations of an organism genes [52]. The rationale of our novel cross-organism approach is that the richer and denser is the training annotation set, the greater is the accuracy of the annotation prediction model. Through the several performed evaluations, we proved that the richer knowledge of an organism can be used to build a better prediction model for other less-known organisms, which provides more accurate results than an equivalent model built using the lesser information available for the organism to be predicted. This approach is feasible because of our proposed data representation: the classification models are guided by the co-occurrences of annotation terms in the considered gene annotation profiles, indiscriminately from the genes, and thus from the organisms, in which they occur. The proposed cross-organism approach allows training prediction models using any term involved in the annotation of the source organism; this implies the possibility of discovering new annotations for the target organism to terms representing biological concepts that have newer been associated with the target organism. Although such predicted annotations could be very valuable, guiding biologists to discover novel features completely unknown for the target organism, they could include annotations that are not meaningful for the target organism; furthermore, such predicted annotations could not be automatically validated using updated known annotations become available for the target organism. Thus, for the methodological purpose of this work, in the performed test we predicted new gene annotations only to those terms already included in the gene annotations of the target organism. The proposed method can be used with any source organism; this could lead to use more than one training organism, creating a supervised model for each source organism considered and using a voting algorithm to perform the predictions. In this work, we preferred not to take into consideration this possibility, both to not introduce additional system parameters (e.g., how to weight each vote) and to avoid the introduction in the training of not-well accurately annotated genes (e.g., outliers), which can create less accurate classification models. We tested our approach with both the same and different organisms as source and target organism, using gene annotation datasets of different sizes and densities and regarding organisms at different evolutionary distances. In all the conducted tests, the prediction models trained on a different more studied organism always achieved best accuracy of the predicted annotations, even when the source and target

32

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34

organisms were distantly related evolutionarily. The accuracy obtained using the most studied organism, the H. sapiens, as source organism almost always overcomes by more than 100% the accuracy obtained using the same organism as source and target organism. Moreover, the performed tests show that our introduced perturbation method permits further improvement to prediction results, particularly when the difference in number of annotations between the original and perturbed training annotation matrices is comparable to the one between the test and validation annotation versions. Validation results also confirmed that even without perturbation, i.e., avoiding the tuning of the perturbation parameter, our cross-organism approach ensures good annotation predictions. We focused the design of our approach on the accuracy of the predicted annotations, since this is the most relevant aspect of interest for biologists; in fact, it is much more important to predict gene annotations that are correct and can contribute to advance the biomedical knowledge, although they do not represent all the still unknown gene features, than predict many annotations which cover more unknown gene features, but also include much more incorrect ones. Our approach provides ranked lists of predicted annotations that describe novel gene functionalities and have an associated likelihood value. They are very valuable, for instance, to complement available annotations for better coverage of the several gene functionalities, or to quicken the annotation curation process by focusing it on the prioritized novel annotations predicted. Although our approach cannot be applied to discover new features of completely unannotated genes, the evaluation results show that it can reliably predict novel annotations also of genes which are only very limitedly annotated to the terms included in the gene annotations of the well annotated source organism used. We focused the evaluation of our proposed cross-organism approach on Eukaryotic organisms; however, a priori it is not bound to these complex organisms and in principle it could be applied also on simple microbial organisms. Yet, in this case a thorough testing should be carried out to evaluate if the H. sapiens would still be the best choice of well-annotated source (i.e., training) genome to consider for the annotation prediction also of microbial genes. Finally, we point out that, despite the results presented in this work refer to Gene Ontology annotations of genes, the proposed approach can be equally applied also to protein annotations. Furthermore, it is not bound to Gene Ontology annotations, but it can be applied to any type of controlled annotations, from an ontology or even a flat terminology.

Conflict of interest The authors declare that they have no conflict of interest.

Acknowledgement This research is part of the “Data-Driven Genomic Computing (GenData 2020)” PRIN project (2013–2015), funded by the Italian Ministry of the University and Research (MIUR).

references

[1] EMBL-EBI Nucleotide Archive Statistics. URL http://www3.ebi.ac.uk/Services/DBStats/ (accessed on 30.5.15). [2] M.Y. Galperin, D.J. Rigden, X.M. Fernández-Suárez, The 2015 Nucleic Acids Research Database Issue and Molecular Biology Database Collection, Nucleic Acids Res. 43 (D1) (2015) D1–D5. [3] EMBL-EBI Nucleotide Archive Genomes. URL http://www.ebi.ac.uk/genomes/ (accessed on 30.5.15). [4] G. Pandey, V. Kumar, M. Steinbach, Computational approaches for protein function prediction: a survey, Tech. Rep. TR 06-028, Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA, 2006. [5] A.K. Tiwari, R. Srivastava, A survey of computational intelligence techniques in protein function prediction, Int. J. Proteom. 2014 (2014) 845479. [6] M. Zitnik, B. Zupan, Matrix factorization-based data fusion for gene function prediction in baker’s yeast and slime mold, in: Pac. Symp. Biocomput., World Scientific, 2014, pp. 400–411. [7] M.A. Huynen, B. Snel, V. van Noort, Comparative genomics for reliable protein-function prediction from genomic data, Trends Genet. 20 (8) (2004) 340–344. [8] O. Bodenreider, The Unified Medical Language System (UMLS): integrating biomedical terminology, Nucleic Acids Res. 32 (Database issue) (2004) D267–D270. [9] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, L.J. Goldberg, K. Eilbeck, A. Ireland, C.J. Mungall, et al., The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration, Nat. Biotechnol. 25 (11) (2007) 1251–1255. [10] Gene Ontology Consortium, et al., Creating the gene ontology resource: design and implementation, Genome Res. 11 (8) (2001) 1425–1433. [11] P. Lena, G. Domeniconi, L. Margara, G. Moro, Gota: go term annotation of biomedical literature, BMC Bioinf. 16 (1) (2015) 346. [12] M. Masseroli, D. Martucci, F. Pinciroli, GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining, Nucleic Acids Res. 32 (Web Server issue) (2004) W293–W300. [13] M. Masseroli, Management and analysis of genomic functional and phenotypic controlled annotations to support biomedical investigation and practice, IEEE Trans. Inf. Technol. Biomed. 11 (4) (2007) 376–385. [14] F. Al-Shahrour, P. Minguez, J. Tárraga, I. Medina, E. Alloza, D. Montaner, J. Dopazo, FatiGO+: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments, Nucleic Acids Res. 35 (Web Server issue) (2007) W91–W96. [15] D.W. Huang, B.T. Sherman, R.A. Lempicki, Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists, Nucleic Acids Res. 37 (1) (2009) 1–13. [16] C. Pesquita, D. Faria, A.O. Falcao, P. Lord, F.M. Couto, Semantic similarity in biomedical ontologies, PLoS Comput. Biol. 5 (7) (2009) e1000443. [17] A. Schlicker, T. Lengauer, M. Albrecht, Improving disease gene prioritization using the semantic similarity of Gene Ontology terms, Bioinformatics 26 (18) (2010) i561–i567. [18] S. Jain, G.D. Bader, An improved method for scoring protein–protein interactions using semantic similarity within the Gene Ontology, BMC Bioinf. 11 (1) (2010) 562.

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34

[19] P.M. Tedder, J.R. Bradford, C.J. Needham, G.A. McConkey, A.J. Bulpitt, D.R. Westhead, Gene function prediction using semantic similarity clustering and enrichment analysis in the malaria parasite Plasmodium falciparum, Bioinformatics 26 (19) (2010) 2431–2437. [20] M. Falda, S. Toppo, A. Pescarolo, E. Lavezzo, B. Di Camillo, A. Facchinetti, E. Cilia, R. Velasco, P. Fontana, Argot2: a large scale function prediction tool relying on semantic similarity of weighted Gene Ontology terms, BMC Bioinf. 13 (Suppl 4) (2012) S14. [21] O.D. King, R.E. Foulger, S.S. Dwight, J.V. White, F.P. Roth, Predicting gene function from patterns of annotation, Genome Res. 13 (5) (2003) 896–904. [22] Y. Tao, L. Sam, J. Li, C. Friedman, Y.A. Lussier, Information theory applied to the sparse Gene Ontology annotation network to predict novel gene function, Bioinformatics 23 (13) (2007) 529–538. [23] H. Mi, A. Muruganujan, P.D. Thomas, PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees, Nucleic Acids Res. 41 (Database issue) (2013) D377–D386. [24] X. Deng, H. Ali, A hidden markov model for gene function prediction from sequential expression data, in: Proc IEEE Comput. Syst. Bioinf. Conf., IEEE, 2004, pp. 670–671. [25] F. Minneci, D. Piovesan, D. Cozzetto, D.T. Jones, FFPred 2.0: improved homology-independent prediction of Gene Ontology terms for eukaryotic protein sequences, PLOS One 8 (5) (2013) e63754. [26] N. Mitsakakis, Z. Razak, M.D. Escobar, J.T. Westwood, Prediction of Drosophila melanogaster gene function using Support Vector Machines, BioData Min. 6 (1) (2013) 8. [27] D. Warde-Farley, S.L. Donaldson, O. Comes, K. Zuberi, R. Badrawi, P. Chao, M. Franz, C. Grouios, F. Kazi, C.T. Lopes, et al., The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function, Nucleic Acids Res. 38 (Web Server issue) (2010) W214–W220. [28] X. Li, Z. Zhang, H. Chen, J. Li, Graph kernel-based learning for gene function prediction from gene interaction network, in: Proc. IEEE Int. Conf. Bioinformatics Biomed. (BIBM 2007), IEEE, 2007, pp. 368–373. [29] P. Khatri, B. Done, A. Rao, A. Done, S. Draghici, A semantic analysis of the annotations of the human genome, Bioinformatics 21 (16) (2005) 3416–3421. [30] B. Done, P. Khatri, A. Done, S. Draghici, Predicting novel human gene ontology annotations using semantic analysis, IEEE/ACM Trans. Comput. Biol. Bioinf. 7 (1) (2010) 91–99. [31] M. Masseroli, M. Tagliasacchi, D. Chicco, Semantically improved genome-wide prediction of Gene Ontology annotations, in: Proc IEEE Int. Conf. Intell. Syst. Design App. (ISDA 2011), IEEE, 2011, pp. 1080–1085. [32] D. Chicco, M. Masseroli, A discrete optimization approach for SVD best truncation choice based on ROC curves, in: Proc. IEEE Int. Conf. Bioinformatics Bioeng. (BIBE 2013), IEEE, 2013, 9, pp. 1-4. [33] B. Done, P. Khatri, A. Done, S. Draghici, Semantic analysis of genome annotations using weighting schemes, in: Proc. IEEE Symp. Comput. Intell. Bioinf. Comput. Biol. (CIBCB 2007), IET, 2007, pp. 212–218. [34] P. Pinoli, D. Chicco, M. Masseroli, Weighting scheme methods for enhanced genomic annotation prediction, in: Computational Intelligence Methods for Bioinformatics and Biostatistics, LNCS (Lecture Notes in Bioinformatics), vol. 8452, Springer, 2014, pp. 76–89. [35] S.T. Dumais, G.W. Furnas, T.K. Landauer, S. Deerwester, R. Harshman, Using latent semantic analysis to improve access to textual information, in: Proc. ACM SIGCHI Conf. Hum. Factors Comput. Syst., ACM, 1988, pp. 281–285.

33

[36] T. Hofmann, Probabilistic latent semantic indexing, in: Proc. Int. ACM SIGIR Conf. Res. Dev. Inf. Retr. (RDIR 1999), ACM, 1999, pp. 50–57. [37] M. Masseroli, D. Chicco, P. Pinoli, Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations, in: Proc. Int. Joint Conf. Neural Netw. (IJCNN 2012), IEEE, 2012, pp. 2891–2898. [38] P. Pinoli, D. Chicco, M. Masseroli, Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations, in: Proc. IEEE Int. Conf. Bioinformatics Bioeng. (BIBE 2013), IEEE, 2013, 92, pp. 1–4. [39] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res. 3 (2003) 993–1022. [40] M. Bicego, P. Lovato, B. Oliboni, A. Perina, Expression microarray classification using topic models, in: Proc. ACM Symp. Appl. Comput., ACM, 2010, pp. 1516–1520. [41] A. Perina, P. Lovato, V. Murino, M. Bicego, Biologically-aware latent Dirichlet allocation (balda) for the classification of expression microarray, in: Pattern Recognition in Bioinformatics, Springer, 2010, pp. 230–241. [42] T. Griffiths, Gibbs Sampling in the Generative Model of Latent Dirichlet Allocation, vol. 518(11), Standford University, 2002, pp. 1–3. [43] G. Casella, E.I. George, Explaining the Gibbs sampler, Am. Stat. 46 (3) (1992) 167–174. [44] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, M. Welling, Fast collapsed Gibbs sampling for latent Dirichlet allocation, in: Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. (KDDM 2008), ACM, 2008, pp. 569–577. [45] P. Pinoli, D. Chicco, M. Masseroli, Latent Dirichlet Allocation based on Gibbs Sampling for gene function prediction, in: Proc. IEEE Symp. Comput. Intell. Bioinf. Comput. Biol. (CIBCB 2014), IEEE, 2014, pp. 1–8. [46] L. Cheng, H. Lin, Y. Hu, J. Wang, Z. Yang, Gene function prediction based on the Gene Ontology hierarchical structure, PLOS ONE 9 (9) (2014) e107187. [47] D. Stojanova, M. Ceci, D. Malerba, S. Dzeroski, Using PPI network autocorrelation in hierarchical multi-label classification trees for gene function prediction, BMC Bioinf. 14 (2013) 285. [48] O.G. Troyanskaya, K. Dolinski, A.B. Owen, R.B. Altman, D. Botstein, A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae), Proc. Natl. Acad. Sci. U. S. A. 100 (14) (2003) 8348–8353. [49] Z. Barutcuoglu, R.E. Schapire, O.G. Troyanskaya, Hierarchical multi-label prediction of gene function, Bioinformatics 22 (7) (2006) 830–836. [50] S. Raychaudhuri, J.T. Chang, P.D. Sutphin, R.B. Altman, Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature, Genome Res. 12 (1) (2002) 203–214. [51] A.J. Pérez, C. Perez-Iratxeta, P. Bork, G. Thode, M.A. Andrade, Gene annotation from scientific literature using mappings between keyword systems, Bioinformatics 20 (13) (2004) 2084–2091. [52] G. Domeniconi, M. Masseroli, G. Moro, P. Pinoli, Discovering new gene functionalities from random perturbations of known gene ontological annotations, in: Proc. Int. Conf. Knowl. Discov. Inf. Retr. (KDIR 2014), SCITEPRESS, 2014, pp. 107–116. [53] G. Domeniconi, G. Moro, R. Pasolini, C. Sartori, Iterative refining of category profiles for nearest centroid cross-domain text classification, in: Knowl. Disc., Knowl. Eng. Knowl. Manage., vol. 553 of Commun. Comput. Inf. Sci., Springer, 2015, pp. 50–67. [54] G. Domeniconi, G. Moro, A. Pagliarani, R. Pasolini, Markov chain based method for in-domain and cross-domain

34

[55] [56]

[57]

[58]

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 2 6 ( 2 0 1 6 ) 20–34

sentiment classification, in: Proc. Int. Conf. Knowl. Discov. Inf. Retr. (KDIR 2015), SCITEPRESS, 2015. J. Tanoue, M. Yoshikawa, S. Uemura, The GeneAround GO viewer, Bioinformatics 18 (12) (2002) 1705–1706. G. Domeniconi, G. Moro, R. Pasolini, C. Sartori, A study on term weighting for text categorization: a novel supervised variant of tf.idf, in: Proc. Int. Conf. Data Manage. Technol. Appl. (DATA 2015), SCITEPRESS, 2015, pp. 26–37. G. Domeniconi, M. Masseroli, G. Moro, P. Pinoli, Random perturbations of term weighted gene ontology annotations for discovering gene unknown functionalities, in: Knowl. Disc., Knowl. Eng. Knowl. Manage., vol. 553 of Commun. Comput. Inf. Sci., Springer, 2015, pp. 181–197. A. Canakoglu, G. Ghisalberti, M. Masseroli, Integration of genomic, proteomic and biomolecular interaction data to support biomedical knowledge discovery, in: Proc. Int. Meet Comput. Intell. Methods Bioinf. Biostat. (CIBB 2011), vol. 8, 2011, 6, pp. 1–10.

[59] Genomic and Proteomic Knowledge Base. URL http://www.bioinformatics.deib.polimi.it/GPKB/ (accessed on 30.5.15). [60] D. Croft, A.F. Mundo, R. Haw, M. Milacic, J. Weiser, G. Wu, M. Caudy, P. Garapati, M. Gillespie, M.R. Kamdar, et al., The Reactome pathway knowledgebase, Nucleic Acids Res. 42 (Database issue) (2014) D472–D477. [61] Reactome Project. Computational Inferred Events. URL http://www.reactome.org/pages/documentation/ electronically-inferred-events/ (accessed on 30.5.15). [62] Q.P. Vong, K. Cao, H.Y. Li, P.A. Iglesias, Y. Zheng, Chromosome alignment and segregation regulated by ubiquitination of survivin, Science 310 (5753) (2005) 1499–1504. [63] V. van Noort, B. Snel, M.A. Huynen, Predicting gene function by conserved co-expression, Trends Genet. 19 (5) (2003) 238–242.