Biochimica et Biophysica Acta 1854 (2015) 46–54
Contents lists available at ScienceDirect
Biochimica et Biophysica Acta journal homepage: www.elsevier.com/locate/bbapap
Review
Functional annotation and biological interpretation of proteomics data Carolina M. Carnielli, Flavia V. Winck, Adriana F. Paes Leme ⁎ Laboratório de Espectrometria de Massas, Laboratório Nacional de Biociências, LNBio, CNPEM, Campinas, Brazil
a r t i c l e
i n f o
Article history: Received 11 September 2014 Received in revised form 7 October 2014 Accepted 21 October 2014 Available online 31 October 2014 Keywords: Bioinformatics Ontology Annotation Network Systems biology Proteomics
a b s t r a c t Proteomics experiments often generate a vast amount of data. However, the simple identification and quantification of proteins from a cell proteome or subproteome is not sufficient for the full understanding of complex mechanisms occurring in the biological systems. Therefore, the functional annotation analysis of protein datasets using bioinformatics tools is essential for interpreting the results of high-throughput proteomics. Although large-scale proteomics data have rapidly increased, the biological interpretation of these results remains as a challenging task. Here we reviewed basic concepts and different programs that are commonly used in proteomics data functional annotation, emphasizing the main strategies focused in the use of gene ontology annotations. Furthermore, we explored the characteristics of some tools developed for functional annotation analysis, concerning the ease of use and typical caveats on ontology annotations. The utility and variations between different tools were assessed through the comparison of the resulting outputs generated for an example of proteomics dataset. © 2014 Elsevier B.V. All rights reserved.
1. Introduction Proteomics encompasses a broad range of high-throughput technologies that allows the identification and the quantification of proteins in complex biological samples. Quantitative proteomics approaches rely on the ability to detect small changes in protein abundance of an altered state given a control or reference condition. Thus, the quantification of differences between two or more physiological states of a biological system can be expressed as an absolute protein quantification, by the determination of the exact protein amount or concentration, or as a relative quantification of protein amount, in which the amount of a protein can be defined as fold changes relative to the control sample, determining the up- or down-regulation of such protein [1,2]. Proteomics approaches have been extensively applied in biomedical research for the understanding of diseases, including protein-based biomarker discovery for the early detection and monitoring of different types of cancer [3,4], the analysis of abnormal protein phosphorylation patterns associated with diseases [5,6], such as Alzheimer's [7], the identification of therapeutic targets [8,9], among others. However, mass spectrometrybased proteomics often generates large lists of identified proteins whose interpretation is a challenging task in the field. In order to handle the proteomics data, Biostatistics and Bioinformatics tools become indispensable to the interpretation of biological data and to extract the
⁎ Corresponding author at: Laboratório Nacional de Biociências (LNBio), Centro Nacional de Pesquisa em Energia e Materiais (CNPEM), 13083970 Campinas, Brazil. Tel.: +55 19 3512 1118; fax: +55 19 3512 1006. E-mail addresses:
[email protected] (C.M. Carnielli), fl
[email protected] (F.V. Winck),
[email protected] (A.F. Paes Leme).
http://dx.doi.org/10.1016/j.bbapap.2014.10.019 1570-9639/© 2014 Elsevier B.V. All rights reserved.
biological relevance from the vast amount of identified proteins [10]. Thus, protein functional annotation through computational tools now occupies a place as important as the protein identification itself. Since the advent of shotgun proteomics, many Bioinformatics tools have been developed to provide methodologies for functional annotation of proteomics data. Typical approaches for data interpretation for organisms without an annotated genome include mainly the automated protein annotation as a first step in the data analysis workflow. Protein domains, protein family, subcellular localization and biological function are predicted based on sequence similarity searches [11–14]. Once the protein sequences are functionally annotated, several other tools must be applied to the search for functional patterns and overrepresentation of biological functions or processes in a protein dataset from qualitative or quantitative proteomics data. Further steps in the analysis usually include pathway analysis and the prediction of interaction networks, which are generated through integration of different biological layers of information, such as gene expression and co-expression patterns, protein–protein interactions and protein expression data. Moreover, visualization tools largely contribute to localize the presence of targeted proteins within cellular biological pathways, signaling cascades and metabolic pathways being the most represented ones in proteomic studies. A variety of commercial and open-source bioinformatics tools for the analysis of proteomics data and statistical tests have been developed. However, with the increased amount of proteomics data new challenges in data handling, analysis and visualization push forward the development of the field of computational proteomics. In order to give an overview of tools and approaches currently applied in proteomics functional annotation, we reviewed and discussed different approaches,
C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54
47
computational programs and strategies recently applied for data interpretation, and how different aspects of the analysis can modify the outcome of proteomics studies. 2. Biological meaning of large proteomics datasets through gene ontology-based annotation approaches The prediction of the functional role of identified proteins in a biological event involves a first step of gathering information, a task that must be performed before the actual biological data interpretation is achieved and may include genome and proteome annotations. Many tools have been developed to mine several databases of biological information to finally predict a protein function based on sequence similarities. Detailed strategies on genomics and proteomics sequence annotation can be found in previous publications [11–17]. Nevertheless, once the genome and proteome are annotated, one of the most disseminated strategies of proteomics data functional annotation includes the use of ontologies, which can be understood as an explicit specification of a conceptualization [18]. Usually, ontologies are designed with hierarchical classes, communicating definitions with clarity and objectivity, however, keeping extendibility. In Biology, the ontologies for genes and proteins usually describe the classification of the molecules according to their role in the biological systems, using controlled vocabulary, which permits the analysis of relationships between the ontology terms through data integration, retrieval and functional annotation of large datasets [19–22]. In this scenario, Ashburner et al. developed a controlled vocabulary applicable to all eukaryotes, generating the Gene Ontology (GO) Consortium [23], with the aim to overcome the lack of interoperability of genomic databases resultant of the divergent nomenclature of genes and proteins. Every gene or protein can thus be described by a finite number of vocabulary terms, which are classified into one of the three GO-categories or domains: biological process, molecular function or cellular component [23]. It is noteworthy that GO annotations to a term are included in a hierarchy of terms, having a more general annotation at the highest levels of the hierarchy and more specific annotation at lower levels of the hierarchy. Moreover, a GO term of lower hierarchical level (Child term) can have a relation to one or more terms of higher hierarchical levels (Parent term), which can be traced up to one or more of the GO root terms which correspond to the three GO domains (biological process, cellular component or molecular Function). For instance, if a gene is found to be related to ‘actin filament bundle organization’ according to its GO annotation, it will be annotated downwards within the hierarchy of its parent terms, which include ‘actin filament organization’, ‘cytoskeleton organization’ and ‘organelle organization’ (Fig. 1). Thus, more information can be retrieved from parent terms, which increases the knowledge when making inferences about gene function. On the other hand, researches must consider that GO annotations can be redundant, i.e., a term can be associated to one gene or gene product by more than one annotation. In a recent study, Gillis and Pavlidis [24] found that GO annotations are stable over short periods of time, with losses of semantic similarity for 3% of the genes annotated between monthly GO editions. Thus, some undergo changes in their ‘functional identity’ over time as a result of annotation updates, resulting in loss of semantic similarity matching. Additionally, they presented a way to quantify the stability of GO annotations over time and showed that, in a moderate time, many genes undergo changes in their annotated functionality. Thus, modifications on gene ontologies may influence the results on functional annotation of experimental data [25]. Despite that changes in GO annotations are non-uniformly distributed over different branches of the ontology, the results of term-enrichment analyses were found stable [25]. In order to observe and to demonstrate how different versions of the GO annotations may affect the final interpretation of a proteomics
Fig. 1. Interconnection of relationships of the hierarchical distribution of GO terms. An example of the hierarchical organization of GO annotations is shown for the GO term “actin filament bundle organization” together with the relationships and intermediate GO terms between the most ancestor (Biological process) to the most specific child GO term (actin filament bundle organization) (Figure adapted from QuickGO — http://www.ebi.ac.uk/QuickGO).
dataset we performed a comparison of the results of the enrichment analysis of GO terms using the application BiNGO v.3.0.2 [26] app in the Cytoscape v.3.0.1 [27] to an example proteomics dataset previously published [8]. We used equal parameters for the data analysis and changes in the significant overrepresented terms were evaluated by comparing the list of the Top 10 most significant overrepresented GO terms. It was observed that changes in the list of the Top 10 most significant GO terms retrieved using GO annotation files from 2011, 2012, 2013 and 2014 occurred with the different annotation files (Supplemental Table 1). However, 60% of the GO terms consistently appeared in the Top10 list of the most significant GO terms, implying a data drift among versions of GO annotation. Nevertheless, most of the GO annotations remain partially stable over time. Thus, it is imperative to perform functional annotation analysis with the most recent version of the GO annotation and ontology files. Moreover, it is important that novel approaches on functional annotation are integrated into dynamic data analysis, allowing on-time updating of annotation files to facilitate and improve the interpretation of published proteomics datasets. Detailed description of parameters and GO association and gene ontology files used in this comparative analysis are available in the Supplemental Table 1. Furthermore, knowing what has been modified between different versions of the ontology can be very useful. The web service CODEX (Complex Ontology Diff Explorer) was developed to allow users to verify which changes were performed in a precise version of the ontology [28]. However, it is crucial to report, in an ontology-based study, which version of the gene annotation was used in order to track alterations on functional annotation due to time dependence of GO results. GO annotations can be applied to perform a functional profiling of processes which might be different in a particular set of genes, to predict gene function or to categorize genes in ontology terms [29]. Therefore, the identification of overrepresented categories or enrichment analysis can be performed based on ontologies, contributing to functionally
48
C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54
characterize protein sets. Bioinformatics tools for such analysis map the corresponding input data containing a protein list or gene list (experimental data) to the associated biological GO annotation, identifying statically overrepresented annotations [30]. The overrepresented GO terms are ranked by the p-value score and can be biologically interpreted or considered by the researcher. The GO database is constructed with the most recent version of the ontology and new annotation files are contributed and sent by members of the GO Consortium. Thus, the GO undergoes frequent revisions to add new relationships and terms or to remove obsolete ones. However, not all proteins have been completely annotated to a GO term. Thus, it is crucial to use recent available ontology versions when performing functional annotation analysis. Each new annotation in the GO is linked to its source and a database entry is attributed to it. The source can be a literature reference, a database reference or computational analysis [31]. However, the most important attribute of an annotation is the evidence code, which records the supportive information on the gene annotation to a particular term. GO annotations include evidence codes from four categories: inferred from experimental, computational, indirectly derived from both, or unknown [23]. The vast majority of available GO annotations are assigned using computational methods (more than 95%, see [29]); however, several studies disregard annotations without manual curation, i.e., those that are inferred from electronic annotation (IEA). These IEA annotations are generally thought to be of lower quality than those manually annotated and that should be used with caution, once they probably contain a higher portion of false positives [29]. On the other hand, IEA annotations are useful to provide a first insight of which GO terms are associated to the analyzed dataset. Considering that most annotations are inferred electronically, a methodology to systematically and quantitatively evaluate IEA annotations was previously developed [32]. Based on the releases of the UniProt Gene Ontology Annotation database (UniProtGOA), the largest contributor of electronic annotations to GO consortium [33] used experimental annotations in the newer annotation releases to confirm or reject electronic annotations from older releases. Thus, for quality measurement, they evaluated the proportion of electronic annotations that were confirmed by new experimental annotations (reliability), the power of electronic annotations to predict experimental annotations (coverage), and how informative the predicted GO terms are (specificity). They found that the reliability and the specificity of IEA annotations have improved in recent years compared to the annotations performed by curators [32]. Furthermore, considering that experimental verification would be extremely expensive even for a small subset of annotations, this method provides possibilities to identifying the subset of electronic annotations that can be considered confident in a study. The ontology-based functional annotation, therefore, constitutes a valuable approach for the proteomics data analysis. 3. Functional enrichment analysis for identification of overrepresented biological mechanisms In proteomic studies, the protein identity constitutes the most important information for further biological annotation of large data sets. Furthermore, proteins identified through shotgun proteomics approaches may be related to a broad diversity of biological functions, which may have a role in many different biological pathways. Moreover, variations on protein expression levels may give indications about alterations of cellular mechanisms, such as changes resulting from the development of pathological processes. To identify and prioritize these associations a wide range of bioinformatics tools have been developed in recent years. Some of these tools use enrichment analysis to identify the biological information overrepresented in protein lists and also provides network visualization of the biological interactions. Enrichment analysis based on GO terms allows the functional classification and the detection of the most represented biological
annotations of a protein set, based on statistical methods. This approach increases the probability of identifying the most pertinent biological processes related to a biological mechanism under study. Therefore, it is important to note that if a biological process is altered in a study, enrichment analysis may point sets of genes that should have a higher (enriched) potential to be selected as a biological relevant group by the high-throughput screening technologies. The goal of this approach is to summarize the biological processes and pathways that are most likely related to a given biological condition [34]. To a further discussion on aspects of ontologies and annotations that should be considered when performing GO enrichment analysis see Rhee et al. [29]. The enrichment analysis can be quantitatively measured by statistical methods, including Chi-square, Fisher's exact test, Binomial probability and Hypergeometric distribution [34]. Once the significance test is performed for many groups, a multiple test correction must be carried out in order to limit false positives, commonly known as a type I error. The incidence of false positives is proportional to the number of tests performed and to the critical significance level (p-value cutoff). Thus, the enrichment p-value scores for the annotations must be corrected with statistical methods, such as Bonferroni or Benjamini–Hochberg, to adjust the individual p-value for each gene to keep the overall error rate (or false positive rate) to less than or equal to the p-value cutoff, as set by the user [35]. Several tools are available to accomplish GO-term analysis. An extensive list of GO-related tools can be found at http://neurolex.org/wiki/Category:Resource:Gene_Ontology_Tools. Examples of computational tools for functional annotation of proteomics data are listed in Table 1. Huang et al. [34] performed a survey of the principal aspects of enrichment analysis and identified 68 bioinformatics enrichment tools, classifying them into three classes, according to the main algorithms implemented: Singular Enrichment Analysis (SEA), Gene Set Enrichment Analysis (GSEA), and Modular Enrichment Analysis (MEA) [34,36]. Basically, SEA is the most used strategy, in which the user selects a list of interesting proteins (e.g. differentially expressed proteins between two experimental conditions) and generates a list of statistically significant overrepresented annotations. Of note, different tools for enrichment analysis may generate different results on the functional annotation of proteomics data. For comparative purposes, we performed the enrichment analysis for an example proteome dataset [8] using six different tools and the most significant overrepresented terms were compared. The tools BiNGO, DAVID [37], GeneSetDB [38], FuncAssociate [39], GOMiner [40] and BioMart [41] were compared in this study. The results of the Top 10 most significant terms are shown in Supplemental Table 2. All six tools are based on SEA algorithms, which is currently the most disseminated type of analysis. By using the default parameter settings offered by each tool for the enrichment analysis, the resulting Top 10 list of the most significant overrepresented GO terms was compared between the tools. We did not observe a 100% similarity between the Top 10 lists. The tools that presented the more similar results were BiNGO [26] and DAVID showing 80% of similar GO terms in the Top 10 list, followed by DAVID and GoMiner, which showed 70% of similarity in the Top 10 lists of overrepresented GO terms. This is not an indication that one tool is better or worse than others, but it implies that different tools may and will probably generate different data outcomes which may then modify the biological interpretation of the proteomics data. Different from the SEA tools, the GSEA algorithms take into consideration quantitative experimental data, such as protein fold change without pre-selection of a candidate protein list. Finally, MEA tools inherit the basic enrichment calculation of SEA strategy, but also consider term–term and gene–gene relationships to determine the enrichment p-value. Therefore, this classification helps to select the tools that better integrate into the different strategies of proteomics functional annotation.
C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54
49
Table 1 Computational tools for functional annotation. URL
Comments
Citation
Gene ontology AmiGO BiNGO DAVID GeneMANIA
http://amigo.geneontology.org/cgi-bin/amigo/go.cgi http://www.psb.ugent.be/cbd/papers/BiNGO/Home.html http://david.abcc.ncifcrf.gov/ http://genemania.org/
Search engine of Gene Ontology annotation Tool for enrichment analysis on Cytoscape Meta-tool for functional analysis of large gene lists Tool to identify the most related genes to a query gene
[68] [26] [69] [59]
Pathway analysis KEGG Reactome MetaCore
http://www.genome.jp/kegg/ http://www.reactome.org/ http://thomsonreuters.com/metacore/
Database resource for pathway analysis Tool for pathway analysis Commercial tool for pathway analysis and data mining
[55] [70]
Interaction networks Cytoscape
http://www.cytoscape.org/
Open-source software for integration, visualization and analysis of biological networks Integrative platform for the annotation, analysis and visualization of the interaction profiles of proteins/genes, metabolites and drugs of interest Database tool for direct or indirect protein interactions Integrative software to rank enriched terms, interactive visualization for enrichment results display
[27]
IIS — Integrated Interactome system
http://bioinfo03.ibi.unicamp.br/lnbio/IIS2/index.php
STRING Enrichr
http://string.embl.de http://amp.pharm.mssm.edu/Enrichr/index.html
Text mining iHOP
http://www.ihop-net.org/
Chilibot
http://www.chilibot.net/
BioText search engine PPI Finder PIE the search
http://biosearch.berkeley.edu/index.php?action=logout http://liweilab.genetics.ac.cn/tm/ http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/ PIE/index.html
Moreover, a new annotation method was developed for quantitative proteome analysis, Protein Set Enrichment Analysis (PSEA), to assess pathway level patterns in the expression changes [42]. As a modification of GSEA, PSEA uses the complete quantitative profiles of all identified proteins with no arbitrary cutoffs. A previous study performed this method and not only found many confirmatory biological findings for breast epithelial malignancy, but also revealed significant changes of downstream targets for a number of common transcription factors suggesting a role for specific gene regulatory pathways in breast tumorigenesis [42]. Nevertheless, whereas enrichment analysis ranks a list of annotations based on the p-value scores [34], the researcher must be critical to discriminate terms that are poorly related to the study. There is an important component of overestimation of significant overrepresented GO terms when using Fisher's Exact Test if the protein set has a high number of GO annotations. As Fisher's test relies on the assumption that all genes have an equal probability to be selected under the null hypothesis, it is expected that a level of false positives, highly overlapping annotations are considered overrepresented in a dataset. New approaches, accounting for this implicit bias, have been recently proposed, including the Annotation Enrichment Analysis which accounts for the non-uniformity of the number of functions associated with individual genes and the number of genes annotated to an individual function [43]. Furthermore, literature analysis using text mining tools for information retrieval [36,44] may be very useful to help researchers to explore and access the scientific literature [45] by pre-selecting specific parameters (e.g., keywords, gene names) for data searching and analysis. For an outlook on literature retrieval and text mining, see Manconi et al. [46]. 4. Integration of functional annotations through biological network analysis In the past years, several models of biological networks have been generated by computer science in order to aid visualization of results from the simulation of biological systems, such as biochemical reactions [47] and protein interaction [48] networks. The aims
Tool for search textual information from PubMed for a given gene PubMed literature database about specific relationships between proteins and genes Search engine to access scientific literature Human Protein–Protein Interaction Mining Tool Tool for searching PubMed literature on protein interaction information
[52]
[71] [72]
[73] [74] [75] [76] [77]
of such models in biological research are to (1) systematically interrogate and experimentally verify knowledge of a pathway, (2) manage the complexity of cellular components and interactions, and (3) provide an outlook of properties and consequences of pathway alterations [49]. With the large amount of “omics” data generated by experimental technologies the challenge is now to integrate and deeply explore this large amount of information in order to provide a global understanding of biological functions and systems. Several tools have been developed in order to process and analyze the resulting large-scale datasets, such as proteome and transcriptome data. A common computational approach includes the integration of overrepresented annotations calculated through the enrichment analysis, with results usually displayed as a network, where nodes indicate the molecular entities (proteins, transcripts or genes) and edges representing the different types of relationships (e.g. co-expression, co-localization, physical interactions) between the nodes or it may represent processes or hierarchical connections between the ontology terms. Among the programs used for proteomics data interpretation, Cytoscape appears among the most well-known open source software platform which allows visualization of biological networks generated from high-throughput expression data. Several tools (apps) have been further developed to handle biological pathways in combination to the Cytoscape visualization structure, allowing integration of networks containing functional annotations and gene expression data, protein abundance levels and other biological information [50]. Thus, the software platform provides features for data visualization and integration with apps contributed by the community. The list of apps can be assessed and installed online from Cytoscape App Store (http://apps.cytoscape.org/) or via app manager within Cytoscape. On this scenario, the applications ClueGO [51] and BiNGO [26] within Cytoscape suite, the Integrated Interactome System (IIS) [52] and MetaCore (MetaCore TM, http://thomsonreuters.com/metacore/) appear as interesting tools and platforms for proteomics functional annotation analysis. Each of these tools has a distinct set of features for functional annotation analysis of proteomics data and visualization of biological networks.
50
C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54
However, many software and web tools accept only specific types of protein identifiers as input data. Protein names do not have standardized identifiers and may differ in distinct databases or in updated versions of the same database. Protein databases such as Uniprot [53] and Ensembl [54] are commonly used in proteomic studies and are very useful for protein name normalization. However, another possibility to overcome the ambiguity of protein identifiers is to use gene names corresponding to the proteins of interest. Nevertheless, in some cases, the use of gene names in functional annotation may lead to ambiguous annotations. For a better understanding of bioinformatics services for the normalization of protein and gene identifiers from large-scale proteomics data sets, see Malik et al. [30]. Once the protein or gene list is normalized and checked for no redundancies, the proteome data can be analyzed using different software for data interpretation. The functional interpretation of protein sets by ClueGO app can improve biological interpretation of large lists of genes using Gene Ontology (GO) terms [23] and KEGG/BioCarta pathways [55] for statistical analysis, resulting in a functionally organized GO/pathway term network. Additionally to GO term enrichment analysis, ClueGO includes the possibility to perform a depletion analysis, which can be carried out individually or together with enrichment test. Additionally, it is also possible to select which statistical test will be applied for p-value correction. For a deeper discussion on the different approaches used to determine the significant enrichments and/or depletions of GO categories among a gene dataset, see Rivals et al. [56]. When GO terms are selected, users can set the interval of GO hierarchical levels to be considered in the analysis, which depends if specific or general terms are desired in the interpretation of the study. Other parameters found in ClueGO include the minimal number of genes to consider a term as enriched in the input list, and also how much these genes must represent (in percentage) from the total of genes annotated for a certain term. Among its features, it is also possible to analyze one or two gene lists for cluster comparison, with the visualization of the functional terms grouped in a network. One may also consider making a fusion of related GO terms that have similar sets of associated genes in order to diminish the redundancy of the annotation and to grouping the terms based on GO hierarchy or based on Kappa score (kappa statistics of the association strength between the terms). Here, we have explored the effects of changes of the parameter settings of the enrichment analysis performed by ClueGO app v.2.1.2 in the determination of overrepresented GO terms in an example proteomics dataset from tumor tissues [8]. For this analysis, cluster comparison mode was selected on in order to visualize the enriched GO terms in the up- and down-regulated proteins of the dataset. Two intervals of GO hierarchical levels were considered (default interval of GO levels 3–8 and the broader interval of GO levels from 0 to 20) separately and GO term fusion setting was applied while testing different minimum and maximum number of genes and gene percentage on the enrichment analysis. The number of significant (p b 0.05) overrepresented GO terms generated through the use of ClueGO with GO term fusion setting on was compared with the results from the analysis with this setting off (Supplemental Table 3). As expected, a reduction on the number of terms was observed when applying GO term fusion. Moreover, the resulting list of the Top 10 most significant biological process and cellular component GO terms of these analyses is shown in Supplemental Table 4, while the Top 10 most significant molecular function GO terms are shown in Supplemental Table 5. Considering the analysis performed using the same intervals of GO hierarchical levels, only a few GO terms were observed consistently in the Top 10 list of overrepresented GO terms, considering the comparison of the results obtained for the different enrichment parameters tested. However, for those terms which appeared in all Top 10 lists, the p-values were very similar, as expected. Another Cytoscape app used to determine Gene Ontology (GO) terms overrepresented in a gene set is BiNGO. This plugin maps the predominant functional GO terms of the tested gene set on the GO
hierarchy and displays the results as a network. An interesting feature of BiNGO is the possibility to use custom ontologies and annotations. We, thus, explored this capability and performed a GO term enrichment analysis using a customized reference GO annotation file which included the GO annotation of the proteins identified in our proteomic example dataset [8]. We compared the results of the enrichment analysis performed by BiNGO using our customized reference GO annotation with the results of the enrichment analysis of GO terms performed with the whole proteome GO annotation file as reference set. The input data used in this comparison was the list of the gene names of the proteins differentially expressed in our example proteomics study. The list of the Top 10 most significant GO terms of the analysis using customized GO annotation file is shown in the Supplemental Table 6. All terms found as overrepresented in the analysis with the custom annotation were also found in the analysis with the whole annotation. However, the p-values of the significant overrepresented GO terms were much higher when generated through the use of customized GO annotation as reference set. This result was expected since the p-value for the enrichment analysis is calculated based on the considered subset of proteins. On the other hand, we may imply that the enrichment analysis performed with the customized GO annotation indicates, through its p-values, that there seems to be no GO term significantly enriched in the dataset, contrasting with the very low p-values found in the enrichment analysis performed with the whole GO annotation as reference set. These findings suggest that high-throughput data, such as from transcriptome and proteome analysis, could be further included in the process of functional annotation; however, the number of proteins used to build the reference GO annotation file must be large enough to represent close to the totality of the proteins expressed in a specific moment and experimental condition. Another app available through Cytoscape, called GeneMANIA, can predict interactions in which a gene or protein cluster participates and it generates networks from currently available data [57–59]. GeneMANIA creates an interaction network for the query list and, by a guilt-by association approach, integrates similar genes that are identified using more than 800 annotated networks from six organisms available on databases [57,60]. Thus, the researchers can generate an interaction network of proteins based on genetic interactions, shared protein domains, pathways, physical interactions, co-localization, co-expression and in silico predictions. GeneMANIA retrieves data from different sources, including individual studies and large databases such as BIOGRID [61], GEO [62], I2D [63] and Pathway Commons [64]. The analysis of a single gene or protein can also be useful, once it will display which genes/proteins are known to interact to each other, which might help to have an overview of possible interactions. Additionally, for the integration of biological networks and gene function prediction, GeneMANIA can also be applied for gene prioritization for functional assays [59]. Further programs, such as the Integrated Interactome System (IIS), have been recently released providing an integrative platform for the annotation, analysis and visualization of the interaction profiles of proteins/genes, metabolites and drugs of interest. IIS platform was built in an easy-to-use web-based interface and it is divided into different modules, starting with the input of raw sequencing data or the input of a list of gene or protein identifiers, followed by the annotation and interactome modules. Five parameter classes need to be selected by the user to perform interactome analysis, which includes the organism of study, the network configuration, the score cutoff, the two-hybrid parameters and the expression analysis. IIS works with diverse organism datasets and also offers the possibility to construct networks with interactions between different organisms or using an orthologous relationship, a feature that can be applied as an alternative to analyze data from organisms without complete genome annotation by selecting a closely related organism that has already been annotated. In the expression analysis parameters, the user can set cutoff
C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54
51
Fig. 2. Interaction network of the proteins identified as differentially expressed in tumor tissues. The protein–protein network was built using IIS software and visualized using Cytoscape software for a previous proteomics dataset (Simabuco et al., 2014). (A) The enriched biological process (p ≤ 0.05) or (B) enriched KEGG pathways (p ≤ 0.05) among the up-regulated proteins (red), down-regulated proteins (green) and background intermediary proteins (gray) from IIS database are depicted in the network by clustering the proteins involved in each of the biological process/pathways with a circle layout. Proteins belonging to more than one process or pathway were assigned to the one with the best enrichment p-values. The node sizes of up-, down- and non-regulated proteins are depicted proportional to their fold change expression values.
52
C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54
values according to fold change in expression/concentration levels on the dataset, which is used to define the input node sizes and to color the nodes according to the up- or down-regulation of the protein expression. Regarding the enrichment analysis, the program uses GO biological processes and KEGG pathways to calculate the enrichment in the generated network. For a detailed description of statistical parameters available on IIS, please see [52]. Thus, we explored the applicability of IIS and uploaded in the IIS first module our example proteome dataset [8] as a text file composed of UniProt IDs and the corresponding fold change values of relative expression, in order to get a protein–protein network interaction. The resulting network was then visualized using Cytoscape 3.0.2 and significantly enriched KEGG pathways or GO biological process were assigned as clusters (Fig. 2). This network arrangement is useful to visually detect which proteins, here represented by nodes, are participating in a certain biological process or pathway. Despite the lower amount of proteins annotated by KEGG pathways, several proteins from the input list were found enriched in the dataset based on GO biological process category. Furthermore, based on database interactions, IIS networks provide information of proteins that might be interacting with the input list, which may contribute to a better understanding of the biological role of the proteins studied. Besides the public and open source programs such as Cytoscape and IIS, commercial tools are also available for proteomics data interpretation. Among the tools broadly used for proteomics data interpretation is MetaCore. This program was developed to attend a broad range of biological questions, offering pathway analysis, knowledge mining, model disease pathway, and target and biomarker assessment. Enrichment analysis of one or more protein sets at once can be performed on MetaCore, whose results include pathway maps, process networks, diseases (based on biomarkers), metabolic networks and GO categories related to the input protein sets. Our example proteome dataset was uploaded onto MetaCore and enrichment analysis was performed. A summary of the results was exported and saved as a detailed report with the Top 10 most significant pathway maps, process networks, diseases (by biomarkers), metabolic networks and GO processes (Supplemental Tables 7–12). No cancerassociated disease was found to be enriched, but the results showed possible metabolic alterations during the tumor development, as pointed by pathway maps and GO processes. Landi et al. used MetaCore and DAVID [34,65] softwares for the biological interpretation of the proteomics data and the results provided protein-interaction network maps and insights into biological responses and potential pathways on bronchoalveolar fluid [66]. Chen et al. identified potential bladder cancer markers through the use of MetaCore for the functional annotation of the proteomics alterations observed and caused by the disease [67]. Functional enrichment analysis is a common start in analyzing proteomics data from one or more datasets, followed by network protein interactions and pathway analysis, which are now essential to extract meaningful biological information from proteomics data. Therefore, the interpretation of protein lists is not limited to the simple identification of proteins, and different software can provide meaningful biological insights on cellular processes, related diseases, biomarker candidates, among other features. However, one can encounter dissimilar results on functional annotation when using different software and approaches, which may generate multiple networks of interactions or pathways after performing enrichment analysis of a gene or gene lists. For instance, a gene that is annotated and well described in the literature for a certain function might also be found in a new study as involved in a previous unknown function. Thus, the results from a functional annotation may not include this possible role and may poorly contribute to the understanding of such association, once network interactions and pathways are generally describing common process and functions, even when using different tools for the sake of complementarity. Therefore, manual
curation and consulting the recent literature are actions that still need to be done in order to correctly assign proper knowledge to the proteomics functional annotation analysis. The development of new tools for dynamic integration of updated knowledge is indeed necessary and may contribute effectively to improve functional annotation of proteomics data. 5. Conclusions Datasets generated in proteomics experiments are usually large lists of protein identifications. However, the extraction of biological meaning of these large datasets must be performed through functional annotation. Furthermore, pathway analysis and generation of interaction networks based on previous data are fundamental in the visualization and interpretation of biological processes involved in the conditions studied. Over the last years, bioinformatics tools were developed to gathering the biological knowledge accumulated in public databases facilitating the analysis of large gene or protein lists, contributing to the generation of new hypothesis through significant biological information. In the present work we reviewed the application of some computational tools that can be used for the interpretation of proteomics data regarding functional annotation through enrichment analysis methods, biological network generation and visualization of biological data. Some of the programs and tools discussed here were tested with the analysis of an example proteomics dataset and can easily be applied to other protein or gene sets. These tools can greatly assist researchers to interpret their proteomics data using different ontology annotations and biological knowledge, in order to reveal the identity of key biological mechanisms. Supplementary data to this article can be found online at http://dx. doi.org/10.1016/j.bbapap.2014.10.019. Acknowledgments This work was supported by FAPESP grants: 2009/54067-3, 2010/ 19278-0, 2009/52833-0 and CNPq grants: 470549/2011-4, 301702/ 2011-0 and 470268/2013-1 to AFPL and a CAPES fellowship to C.M.C. References [1] M.H. Elliott, D.S. Smith, C.E. Parker, C. Borchers, Current trends in quantitative proteomics, J. Mass Spectrom. 44 (2009) 1637–1660. [2] S.E. Ong, M. Mann, Mass spectrometry-based proteomics turns quantitative, Nat. Chem. Biol. 1 (2005) 252–262. [3] S. Gupta, A. Venkatesh, S. Ray, S. Srivastava, Challenges and prospects for biomarker research: a current perspective from the developing world, Biochim. Biophys. Acta 1844 (2014) 899–908. [4] B. Pesch, T. Bruning, G. Johnen, S. Casjens, N. Bonberg, D. Taeger, A. Muller, D.G. Weber, T. Behrens, Biomarker research with prospective study designs for the early detection of cancer, Biochim. Biophys. Acta 1844 (2014) 874–883. [5] B. Macek, M. Mann, J.V. Olsen, Global and site-specific quantitative phosphoproteomics: principles and applications, Annu. Rev. Pharmacol. Toxicol. 49 (2009) 199–221. [6] F.V. Winck, M. Belloni, B.A. Pauletti, L. Zanella Jde, R.R. Domingues, N.E. Sherman, A.F. Paes Leme, Phosphoproteome analysis reveals differences in phosphosite profiles between tumorigenic and non-tumorigenic epithelial cells, J. Proteomics 96 (2014) 67–81. [7] F. Di Domenico, R. Sultana, E. Barone, M. Perluigi, C. Cini, C. Mancuso, J. Cai, W.M. Pierce, D.A. Butterfield, Quantitative proteomics analysis of phosphorylated proteins in the hippocampus of Alzheimer's disease subjects, J. Proteomics 74 (2011) 1091–1103. [8] F.M. Simabuco, R. Kawahara, S. Yokoo, D.C. Granato, L. Miguel, M. Agostini, A.Z. Aragao, R.R. Domingues, I.L. Flores, C.C. Macedo, R. Della Coletta, E. Graner, A.F. Paes Leme, ADAM17 mediates OSCC development in an orthotopic murine model, Mol. Cancer 13 (2014) 24. [9] J.L. Paltridge, L. Belle, Y. Khew-Goodall, The secretome in cancer progression, Biochim. Biophys. Acta 1834 (2013) 2233–2241. [10] C. Kumar, M. Mann, Bioinformatics analysis of mass spectrometry-based proteomics data sets, FEBS Lett. 583 (2009) 1703–1712. [11] A. Valencia, Automatic annotation of protein function, Curr. Opin. Struct. Biol. 15 (2005) 267–274. [12] B. Rost, J. Liu, R. Nair, K.O. Wrzeszczynski, Y. Ofran, Automatic prediction of protein function, Cell. Mol. Life Sci. 60 (2003) 2637–2650.
C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54 [13] Y. Loewenstein, D. Raimondo, O.C. Redfern, J. Watson, D. Frishman, M. Linial, C. Orengo, J. Thornton, A. Tramontano, Protein function annotation by homologybased inference, Genome Biol. 10 (2009) 207. [14] A.S. Juncker, L.J. Jensen, A. Pierleoni, A. Bernsel, M.L. Tress, P. Bork, G. von Heijne, A. Valencia, C.A. Ouzounis, R. Casadio, S. Brunak, Sequence-based feature prediction and annotation of proteins, Genome Biol. 10 (2009) 206. [15] M.R. Brent, Genome annotation past, present, and future: how to define an ORF at each locus, Genome Res. 15 (2005) 1777–1786. [16] M. Yandell, D. Ence, A beginner's guide to eukaryotic genome annotation, Nat. Rev. Genet. 13 (2012) 329–342. [17] L. Stein, Genome annotation: from sequence to biology, Nat. Rev. Genet. 2 (2001) 493–503. [18] T. Gruber, Towards principles for the design of ontologies used for knowledge sharing, Int. J. Hum. Comput. Stud. 43 (1993) 907–928. [19] J.B. Bard, S.Y. Rhee, Ontologies in biology: design, applications and future challenges, Nat. Rev. Genet. 5 (2004) 213–222. [20] M. Gan, X. Dou, R. Jiang, From ontology to semantic similarity: calculation of ontology-based semantic similarity, ScientificWorldJournal 2013 (2013) 793091. [21] N.F. Noy, N.H. Shah, P.L. Whetzel, B. Dai, M. Dorf, N. Griffith, C. Jonquet, D.L. Rubin, M.A. Storey, C.G. Chute, M.A. Musen, BioPortal: ontologies and integrated data resources at the click of a mouse, Nucleic Acids Res. 37 (2009) W170–W173. [22] S. Schulze-Kremer, Ontologies for molecular biology and bioinformatics, In Silico Biol. 2 (2002) 179–193. [23] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, G. Sherlock, Gene ontology: tool for the unification of biology. The Gene Ontology Consortium, Nat. Genet. 25 (2000) 25–29. [24] J. Gillis, P. Pavlidis, Assessing identity, redundancy and confounds in Gene Ontology annotations over time, Bioinformatics 29 (2013) 476–482. [25] A. Gross, M. Hartung, K. Prufer, J. Kelso, E. Rahm, Impact of ontology evolution on functional analyses, Bioinformatics 28 (2012) 2671–2677. [26] S. Maere, K. Heymans, M. Kuiper, BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks, Bioinformatics 21 (2005) 3448–3449. [27] R. Saito, M.E. Smoot, K. Ono, J. Ruscheinski, P.L. Wang, S. Lotia, A.R. Pico, G.D. Bader, T. Ideker, A travel guide to Cytoscape plugins, Nat. Methods 9 (2012) 1069–1076. [28] M. Hartung, A. Gross, E. Rahm, CODEX: exploration of semantic changes between ontology versions, Bioinformatics 28 (2012) 895–896. [29] S.Y. Rhee, V. Wood, K. Dolinski, S. Draghici, Use and misuse of the gene ontology annotations, Nat. Rev. Genet. 9 (2008) 509–515. [30] R. Malik, K. Dulla, E.A. Nigg, R. Korner, From proteome lists to biological impact— tools and strategies for the analysis of large MS data sets, Proteomics 10 (2010) 1270–1283. [31] Creating the gene ontology resource: design and implementation, Genome Res. 11 (2001) 1425–1433. [32] N. Skunca, A. Altenhoff, C. Dessimoz, Quality of computationally inferred gene ontology annotations, PLoS Comput. Biol. 8 (2012) e1002533. [33] D. Barrell, E. Dimmer, R.P. Huntley, D. Binns, C. O'Donovan, R. Apweiler, The GOA database in 2009—an integrated Gene Ontology Annotation resource, Nucleic Acids Res. 37 (2009) D396–D403. [34] W. Huang da, B.T. Sherman, R.A. Lempicki, Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists, Nucleic Acids Res. 37 (2009) 1–13. [35] Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing, J. R. Stat. Soc. 57 (1995) 289–300. [36] M. Krallinger, A. Valencia, Text-mining and information-retrieval services for molecular biology, Genome Biol. 6 (2005) 224. [37] G. Dennis Jr., B.T. Sherman, D.A. Hosack, J. Yang, W. Gao, H.C. Lane, R.A. Lempicki, DAVID: Database for Annotation, Visualization, and Integrated Discovery, Genome Biol. 4 (2003) P3. [38] H. Araki, C. Knapp, P. Tsai, C. Print, GeneSetDB: a comprehensive meta-database, statistical and visualisation framework for gene set analysis, FEBS Open Biol. 2 (2012) 76–82. [39] G.F. Berriz, O.D. King, B. Bryant, C. Sander, F.P. Roth, Characterizing gene sets with FuncAssociate, Bioinformatics 19 (2003) 2502–2504. [40] B.R. Zeeberg, W. Feng, G. Wang, M.D. Wang, A.T. Fojo, M. Sunshine, S. Narasimhan, D.W. Kane, W.C. Reinhold, S. Lababidi, K.J. Bussey, J. Riss, J.C. Barrett, J.N. Weinstein, GoMiner: a resource for biological interpretation of genomic and proteomic data, Genome Biol. 4 (2003) R28. [41] A. Kasprzyk, BioMart: driving a paradigm change in biological data management, Database (Oxford) (2011) (2011) bar049. [42] S. Cha, M.B. Imielinski, T. Rejtar, E.A. Richardson, D. Thakur, D.C. Sgroi, B.L. Karger, In situ proteomic analysis of human breast cancer epithelial cells using laser capture microdissection: annotation by protein set enrichment analysis and gene ontology, Mol. Cell. Proteomics 9 (2010) 2529–2544. [43] K. Glass, M. Girvan, Annotation enrichment analysis: an alternative method for evaluating the functional properties of gene sets, Sci. Rep. 4 (2014) 4191. [44] J.J. Kim, D. Rebholz-Schuhmann, Categorization of services for seeking information in biomedical literature: a typology for improvement of practice, Brief. Bioinform. 9 (2008) 452–465. [45] K.B. Cohen, L. Hunter, Getting started in text mining, PLoS Comput. Biol. 4 (2008) e20. [46] A. Manconi, E. Vargiu, G. Armano, L. Milanesi, Literature retrieval and mining in bioinformatics: state of the art and challenges, Adv. Bioinform. 2012 (2012) 573846.
53
[47] W.W. Chen, M. Niepel, P.K. Sorger, Classic and contemporary approaches to modeling biochemical reactions, Genes Dev. 24 (2010) 1861–1875. [48] P. Aloy, R.B. Russell, Structural systems biology: modelling protein interactions, Nat. Rev. Mol. Cell Biol. 7 (2006) 188–197. [49] P. Shannon, A. Markiel, O. Ozier, N.S. Baliga, J.T. Wang, D. Ramage, N. Amin, B. Schwikowski, T. Ideker, Cytoscape: a software environment for integrated models of biomolecular interaction networks, Genome Res. 13 (2003) 2498–2504. [50] M.E. Smoot, K. Ono, J. Ruscheinski, P.L. Wang, T. Ideker, Cytoscape 2.8: new features for data integration and network visualization, Bioinformatics 27 (2011) 431–432. [51] G. Bindea, B. Mlecnik, H. Hackl, P. Charoentong, M. Tosolini, A. Kirilovsky, W.H. Fridman, F. Pages, Z. Trajanoski, J. Galon, ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks, Bioinformatics 25 (2009) 1091–1093. [52] M.F. Carazzolle, L.M. de Carvalho, H.H. Slepicka, R.O. Vidal, G.A. Pereira, J. Kobarg, G.V. Meirelles, IIS—Integrated Interactome System: a web-based platform for the annotation, analysis and visualization of protein–metabolite–gene– drug interactions by integrating a variety of data sources and tools, PLoS One 9 (2014) e100385. [53] The Universal Protein Resource (UniProt), Nucleic Acids Res. 35 (2007) D193–D197. [54] T.J. Hubbard, B.L. Aken, S. Ayling, B. Ballester, K. Beal, E. Bragin, S. Brent, Y. Chen, P. Clapham, L. Clarke, G. Coates, S. Fairley, S. Fitzgerald, J. Fernandez-Banet, L. Gordon, S. Graf, S. Haider, M. Hammond, R. Holland, K. Howe, A. Jenkinson, N. Johnson, A. Kahari, D. Keefe, S. Keenan, R. Kinsella, F. Kokocinski, E. Kulesha, D. Lawson, I. Longden, K. Megy, P. Meidl, B. Overduin, A. Parker, B. Pritchard, D. Rios, M. Schuster, G. Slater, D. Smedley, W. Spooner, G. Spudich, S. Trevanion, A. Vilella, J. Vogel, S. White, S. Wilder, A. Zadissa, E. Birney, F. Cunningham, V. Curwen, R. Durbin, X.M. Fernandez-Suarez, J. Herrero, A. Kasprzyk, G. Proctor, J. Smith, S. Searle, P. Flicek, Ensembl 2009, Nucleic Acids Res. 37 (2009) D690–D697. [55] M. Kanehisa, S. Goto, KEGG: Kyoto Encyclopedia of Genes and Genomes, Nucleic Acids Res. 28 (2000) 27–30. [56] I. Rivals, L. Personnaz, L. Taing, M.C. Potier, Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics 23 (2007) 401–407. [57] J. Montojo, K. Zuberi, H. Rodriguez, F. Kazi, G. Wright, S.L. Donaldson, Q. Morris, G.D. Bader, GeneMANIA Cytoscape plugin: fast gene function predictions on the desktop, Bioinformatics 26 (2010) 2927–2928. [58] S. Mostafavi, D. Ray, D. Warde-Farley, C. Grouios, Q. Morris, GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function, Genome Biol. 9 (Suppl. 1) (2008) S4. [59] D. Warde-Farley, S.L. Donaldson, O. Comes, K. Zuberi, R. Badrawi, P. Chao, M. Franz, C. Grouios, F. Kazi, C.T. Lopes, A. Maitland, S. Mostafavi, J. Montojo, Q. Shao, G. Wright, G.D. Bader, Q. Morris, The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function, Nucleic Acids Res. 38 (2010) W214–W220. [60] W. Tian, L.V. Zhang, M. Tasan, F.D. Gibbons, O.D. King, J. Park, Z. Wunderlich, J.M. Cherry, F.P. Roth, Combining guilt-by-association and guilt-by-profiling to predict Saccharomyces cerevisiae gene function, Genome Biol. 9 (Suppl. 1) (2008) S7. [61] B.J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livstone, R. Oughtred, D.H. Lackner, J. Bahler, V. Wood, K. Dolinski, M. Tyers, The BioGRID interaction database: 2008 update, Nucleic Acids Res. 36 (2008) D637–D640. [62] T. Barrett, S.E. Wilhite, P. Ledoux, C. Evangelista, I.F. Kim, M. Tomashevsky, K.A. Marshall, K.H. Phillippy, P.M. Sherman, M. Holko, A. Yefanov, H. Lee, N. Zhang, C.L. Robertson, N. Serova, S. Davis, A. Soboleva, NCBI GEO: archive for functional genomics data sets—update, Nucleic Acids Res. 41 (2013) D991–D995. [63] K.R. Brown, I. Jurisica, Online predicted human interaction database, Bioinformatics 21 (2005) 2076–2082. [64] PathwayCommons, http://www.pathwaycommons.org (Visited on November 2013, in). [65] W. Huang da, B.T. Sherman, Q. Tan, J.R. Collins, W.G. Alvord, J. Roayaei, R. Stephens, M.W. Baseler, H.C. Lane, R.A. Lempicki, The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists, Genome Biol. 8 (2007) R183. [66] C. Landi, E. Bargagli, L. Bianchi, A. Gagliardi, A. Carleo, D. Bennett, M.G. Perari, A. Armini, A. Prasse, P. Rottoli, L. Bini, Towards a functional proteomics approach to the comprehension of idiopathic pulmonary fibrosis, sarcoidosis, systemic sclerosis and pulmonary Langerhans cell histiocytosis, J. Proteomics 83 (2013) 60–75. [67] C.L. Chen, T.S. Lin, C.H. Tsai, C.C. Wu, T. Chung, K.Y. Chien, M. Wu, Y.S. Chang, J.S. Yu, Y.T. Chen, Identification of potential bladder cancer markers in urine by abundant-protein depletion coupled with quantitative proteomics, J. Proteomics 85 (2013) 28–43. [68] S. Carbon, A. Ireland, C.J. Mungall, S. Shu, B. Marshall, S. Lewis, AmiGO: online access to ontology and annotation data, Bioinformatics 25 (2009) 288–289. [69] W. Huang da, B.T. Sherman, R.A. Lempicki, Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources, Nat. Protoc. 4 (2009) 44–57. [70] D. Croft, A.F. Mundo, R. Haw, M. Milacic, J. Weiser, G. Wu, M. Caudy, P. Garapati, M. Gillespie, M.R. Kamdar, B. Jassal, S. Jupe, L. Matthews, B. May, S. Palatnik, K. Rothfels, V. Shamovsky, H. Song, M. Williams, E. Birney, H. Hermjakob, L. Stein, P. D'Eustachio, The Reactome pathway knowledgebase, Nucleic Acids Res. 42 (2014) D472–D477. [71] B. Snel, G. Lehmann, P. Bork, M.A. Huynen, STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene, Nucleic Acids Res. 28 (2000) 3442–3444. [72] E.Y. Chen, C.M. Tan, Y. Kou, Q. Duan, Z. Wang, G.V. Meirelles, N.R. Clark, A. Ma'ayan, Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool, BMC Bioinformatics 14 (2013) 128. [73] R. Hoffmann, A. Valencia, A gene network for navigating the literature, Nat. Genet. 36 (2004) 664.
54
C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54
[74] H. Chen, B.M. Sharp, Content-rich biological network constructed by mining PubMed abstracts, BMC Bioinformatics 5 (2004) 147. [75] M.A. Hearst, A. Divoli, H. Guturu, A. Ksikes, P. Nakov, M.A. Wooldridge, J. Ye, BioText Search Engine: beyond abstract search, Bioinformatics 23 (2007) 2196–2197.
[76] M. He, Y. Wang, W. Li, PPI finder: a mining tool for human protein–protein interactions, PLoS One 4 (2009) e4554. [77] S. Kim, D. Kwon, S.Y. Shin, W.J. Wilbur, PIE the search: searching PubMed literature for protein interaction information, Bioinformatics 28 (2012) 597–598.