Review
Special issue: Gene Ontology for microbiologist
Understanding animal viruses using the Gene Ontology Fiona M. McCarthy1,2, Timothy J. Mahony4,5, Mark S. Parcells6 and Shane C. Burgess1,2,3 1
Department of Basic Sciences, College of Veterinary Medicine, Mississippi State University, Mississippi State, MS 39762, USA Mississippi State University Institute for Digital Biology, Mississippi State University, Mississippi State, MS 39762, USA 3 Life Science and Biotechnology Institute, Mississippi Agriculture and Forestry Experiment Station, Mississippi State University, Mississippi State, MS 39762, USA 4 Queensland Agricultural Biotechnology Centre, St Lucia, QLD 4072, Australia 5 School of Veterinary Science, University of Queensland, St Lucia, QLD 4072, Australia 6 Department of Animal and Food Science, University of Delaware, Newark, DE, USA 2
Understanding the effects of viral infection has typically focused on specific virus-host interactions such as tissue tropism, immune responses and histopathology. However, modeling viral pathogenesis requires information about the functions of gene products from both virus and host, and how these products interact. Recent developments in the functional annotation of genomes using Gene Ontology (GO) and in modeling functional interactions among gene products, together with an increased interest in systems biology, provide an excellent opportunity to generate global interaction models for viral infection. Here, we review how the GO is being used to model viral pathogenesis, with a focus on animal viruses. Genomic annotation for modeling gene function in pathogens Over the last 3–5 years, bioinformatic resources and databases have been developed for livestock, poultry and, to a lesser extent, aquaculture species to facilitate biological modeling, including systems biology approaches [1]. An important paradigm underpinning many of these resources is the use of bio-ontologies, which capture biological information in a structured way. Bio-ontologies facilitate integrated data sharing between diverse biological databases and are used as a basis for deriving meaning from the large datasets produced by transcriptomic and proteomic analyses. Currently, the most commonly used bio-ontology (see Glossary) is the Gene Ontology (GO), which is used to represent the functions of gene products. In many species, annotation using GO is the de facto method for functional annotation [2], and has become a key tool for modeling functional genomics datasets. Although the use of GO has primarily focused on model eukaryote species, it can also be used to describe and model gene function in pathogens [3] and free living microbes [4,5]. The interactions between gene products from the host and those of their pathogens are the fundamental basis of pathology and the main reason for studying pathogens. However, the use of functional genomics for Corresponding author: McCarthy, F.M. (
[email protected])
328
the identification of pathogenic mechanisms is presently hindered because (i) functional annotation for pathogen genes is often minimal, (ii) functional and network modeling tools rely on the availability of this type of data and (iii) many microbiologists do not know how to use GO for biological modeling of their data and the identification of host-pathogen interactions. For example, in species with a focused GO annotation effort, there is exponential growth in the use of GO to model functional datasets [6]. Given that GO annotation already exists for non-viral pathogens such as Candida albicans, Pseudomonas aeruginosa, Trypanosoma brucei and Plasmodium falciparum [3], we are now in an excellent position to apply GO to the
Glossary Annotation: the process of attaching biologically relevant information to a sequence. Structural annotation identifies the functional elements within the sequence (including both genes and regulatory elements), whereas functional annotation associates functional information with these elements. Biocurators: scientists who have received specialized training to develop and annotate to biomedical ontologies and controlled vocabularies. These scientists collect and extract data from original scientific literature and data sources, and organize this data using standardized biomedical ontologies and controlled vocabularies. This process, which is called ‘biocuration’, facilitates data sharing between biological databases and enables powerful database queries. Biological Process (BP): one of the three aspects used by the GO to describe gene function. BP refers to one or more ordered assemblies of molecular functions that result in a transformation within a living organism. Bio-ontology: a structured representation of biological information, which includes a set of defined terms (or vocabulary) to describe biological concepts (molecules, proteins, etc.), and how these various terms are related. Cellular Component (CC): one of the three aspects used by the GO to describe gene function. CC describes the parts of a cell or its extracellular environment, similar to cellular localization, but might also be a gene product group (e.g. ribosome, proteasome, protein dimer). Evidence codes: during GO annotation, a gene product is associated with a GO term representing function, and the evidence for making this assertion is described using a GO evidence code. GO evidence codes describe experimental evidence, computational analysis of sequence and structure, published author statements and biocurator judgment. Expressed Sequence Tags (ESTs): short sequences produced by one cycle of sequencing of mRNA that represent portions of expressed genes. Molecular Function (MF): is one of the three aspects used by the Gene Ontology to describe gene function. MF describes the elemental activities of a gene product at the molecular level, such as binding or catalysis. MFs typically correspond to activities that can be performed by individual gene products, although some activities are performed by assembled complexes of gene products.
0966-842X/$ – see front matter ß 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.tim.2009.04.006 Available online 2 July 2009
Review modeling of virus infection. Other articles in this special issue of Trends in Microbiology discuss the use of GO in bacterial systems, including host-bacteria interactions. Here, we will briefly introduce GO with particular emphasis on how it has been used to understand agricultural systems. We outline key examples on the use of GO for the annotation of gene products from important viral pathogens, and explain how this can lead to descriptive and predictive models of viral pathogenesis. Because a main motivation for studying viruses, like all pathogens, is to understand pathogenesis in the host species, host-virus interaction and how this information can be used to intervene in virus-induced disease, we also discuss how GO can be used to model host-pathogen interactions. Methods for functional annotation With the development of high-throughput techniques (‘omics’) and the resulting increase in experimental data, genomic annotation is crucial to enable modeling [7,8]. Typically, genomic annotation is understood as ‘structural annotation’, or the process of identifying functional and structural elements within genome sequences. However, ‘functional annotation’ (i.e. linking functions to genes and gene products) is also essential. Without functional annotation, deriving relevant biological models from high throughput functional genomics datasets such as microarrays is a time-consuming and difficult task. Structural annotation begins during genome sequence assembly and is continually updated with new compu-
Trends in Microbiology
Vol.17 No.7
tational methods and/or experimental evidence. Functional annotation is typically not done during genome sequencing, although computational strategies can provide a rapid, broad overview of gene function based on sequence similarity to known genes and/or function of known structural motifs. For example, the Clusters of Orthologous Groups (COGs) [9] is a computational based strategy describing gene function. COGs are computed using phylogenetic inference to determine gene orthology (functionally equivalent genes from different species) and paralogy (duplicated genes that are free to functionally diversify). Orthologous gene clusters are automatically generated and manually inspected. The resulting COGs are functionally classified into 23 groups based on the presence of genes with experimentally characterized functions and/or structures. COGs are used to functionally annotate bacterial genomes, and they have also been applied to 13 herpesviruses [10], demonstrating that these viral genomes have a core set of 25 highly conserved genes encoding DNA binding and capsid-related functions. Use of COGs addresses the need for rapid functional annotation of the increasing number of genomes being sequenced. However the functional annotation provided is limited to broad concepts, such as ‘transcription’ (COG K), ‘signal transduction mechanisms’ (COG T) and ‘energy production and conversion’ (COG C). Another approach is to use bio-ontologies to describe gene function at the level of detail compatible with the type of data used to infer this gene function. Bio-ontologies
Table 1. Corresponding functional annotation for COGs and GO terms COG J
Functional category Translation, ribosomal structure and biogenesis
K L
Transcription DNA replication, recombination and repair
D
Cell division and chromosome partitioning
O
Posttranslational modification, protein turnover, chaperones
M
Cell envelope biogenesis, outer membrane
N
Cell motility and secretion
P
Inorganic ion transport and metabolism
T
Signal transduction mechanisms
C G
Energy production and conversion Carbohydrate transport and metabolism
E
Amino acid transport and metabolism
F
Nucleotide transport and metabolism
H I Q S
Coenzyme metabolism Lipid metabolism Secondary metabolites biosynthesis, transport and catabolism Function unknown
GO term(s) GO:0043037 GO:0007046 GO:0003735 GO:0005840 GO:0006350 GO:0006261 GO:0006310 GO:0006281 GO:0000910 GO:0007059 GO:0006464 GO:0003754 GO:0006508 GO:0043165 GO:0043163 GO:0006928 GO:0046903 GO:0015698 GO:0015674 GO:0015103 GO:0015082 GO:0004871 GO:0007165 GO:0006091 GO:0008643 GO:0005975 GO:0006520 GO:0006865 GO:0006862 GO:0009117 GO:0006732 GO:0006629 GO:0019748 GO:0005554
translation ribosome biogenesis structural constituent of ribosome ribosome transcription DNA-dependent DNA replication DNA recombination DNA repair cytokinesis chromosome segregation protein modification chaperone activity proteolysis and peptidolysis outer membrane biogenesis (sensu Gram-negative Bacteria) cell envelope organization and biogenesis cell motility secretion inorganic anion transport di-, tri-valent inorganic cation transport inorganic anion transporter activity di-, tri-valent inorganic cation transporter activity signal transducer activity signal transduction generation of precursor metabolites and energy carbohydrate transport carbohydrate metabolism amino acid metabolism amino acid transport nucleotide transport nucleotide metabolism coenzyme metabolism lipid metabolism secondary metabolism molecular function unknown 329
Review are descriptions of biological concepts (for example, gene functions) and how they are related. The best developed bioontology, the GO, provides functional annotations based on molecular function (MF), biological process (BP) and cellular component (CC) [11]. The GO term ‘GO:0006350 transcription’ (identical to COG K) is a biological process designated with the computer readable identifier ‘GO:000635’ and a short human-readable GO term name ‘transcription’. Equivalent GO terms for COG functional categories are shown in Table 1. However, the structure of the GO bio-ontology allows for much more detailed terms: for example a more detailed term related to transcription is ‘GO:0006367 transcription initiation from RNA polymerase II promoter’. Providing structure and digital tags to identify functional terms allows the GO to be queried computationally at different levels of detail, providing rich detail for systems biology modeling from various and varied functional genomics data. Although COGs classify gene functions into only 23 broad groups, there are over 25 000 GO terms and this number is increasing. Information about genes and gene products (that is, RNAs and proteins) is added to the GO using these GO terms and a mixture of computational and manual analysis of functional data, and the evidence is recorded as a 3 letter code. Other articles in this special issue of Trends in Microbiology discuss the GO evidence codes in more detail, however there are two broad types of GO evidence codes: direct experimental codes (used to describe function based upon experiments in published literature) and indirect evidence codes. Indirect evidence codes include function prediction based on sequence such as ‘inferred from sequence orthology’ (ISO), in which functional conservation is inferred for predicted orthologs, and ‘inferred from electronic annotation’ (IEA), which includes function predicted based on functional motifs and domains [12]. Although IEA annotations are considered ‘inaccurate’ or ‘weaker’ because they are not individually checked by biocurators, these annotations are based on continually updated, manually-curated files that map external database information to existing GO terms [12]. The disadvantage of IEA annotations is that they typically provide only general functional annotations. However, the quality of IEA annotations is continually improving and these data are rapidly generated for a diverse range of species that probably would not otherwise have GO annotation. Using IEA-generated GO annotations provides preliminary functional data for developing experimentally-testable hypotheses about a biological system, exactly like any other high-throughput data. The GO is used extensively for many eukaryotes, and to a much lesser extent for prokaryotic microbes. Initially, researchers used COGs to represent gene function in microbes [9]. However, the utilization of two different functional annotation systems (GO and COG) has hindered attempts to model interactions between different species (such as host and pathogen) at a genomic level. Moreover, for many species, there is no effort to provide GO annotation based on direct experimental evidence. For these species, indirect, sequence based methods for deriving GO provides valuable data. An example of rapidly providing GO annotation for a species with little GO annotation is 330
Trends in Microbiology Vol.17 No.7
the functional study of a marine crustacean, Penaeus monodon, infected with white spot syndrome virus (WSSV) [13]. This study developed large scale P. monodon expressed sequence tags (ESTs) to observe global gene expression changes in WSSV-infected shrimp. ESTs were clustered and the sequences were assigned GO function based on annotation of the single ‘best hit’ match in the UniProtKB database. This enabled the authors to rapidly provide GO annotations to support their biological modeling. The same researchers have subsequently provided other functional information, including pathway and interaction data (Box 1). GO annotations based on sequence similarity to GO annotated gene products is assigned the evidence code ‘Inferred from Sequence or Structural Similarity (ISS)’. There are several online tools that allow researchers to do exactly this, however assigning ISS requires that the user review each sequence alignment (this can be done using a tool such as GOanna [14]; see http://www.agbase.msstate.edu/). Using GO to describe viral gene function The GO does not attempt to describe all types of biological data: it is limited only to the three functions of gene products that are central to physiology (and thus pathophysiology): individual molecular functions, larger biological processes and cellular location. Moreover, GO does not directly describe gene processes, functions or components that are unique to mutants or diseases. For example, tumourigenesis is not considered a valid GO term because tumour formation is not a normal function of eukaryotic genes. However, tumourigenesis is the pathophysiological outcome of normal gene functions interacting in some ‘abnormal’ way. Thus the genetic networks of tumourigenesis are enriched for the GO biological processes ‘proliferation’, ‘apoptosis’, ‘differentiation’, ‘mitogenesis’ and ‘immune function’, all of which are key biological processes for tumourigenesis to occur [15]. This approach of combining key GO terms for processes involved in a common pathophysiological outcome was used to describe and understand the formation of lymphomas in herpesvirus infected tissues [16–18] (Box 2). In addition, many viral genes were originally host genes that viruses integrated into their genomes. The viral genes have similar (although often potentiated or repressed) biological processes, molecular functions and act in the same cell component as the host ortholog from which they were derived. Examples are ‘oncogenes’ coding for hostderived transcription factors, [19] and genes encoding modulators of host immunity or ‘virokines’ [20,21]. Recently, a concerted effort focused on GO terms for plant pathogens and resulted in more than 700 new GO ‘Biological Process’ terms (most of them describing interactions between species) [22]. Although the Plant-Associated Microbe Gene Ontology (PAMGO) group initiated this expansion, the development of GO terms was done in close collaboration with plant pathologists. This was a noteworthy advancement towards using GO to model hostpathogen systems. Because GO terms are not speciesspecific, but rather must represent functions found in larger taxonomic groups, these terms are also relevant to animal pathogens. For example, a presumptive
Review
Trends in Microbiology
Vol.17 No.7
Box 1. Generating functional genomic resources for poorly annotated species White spot syndrome virus (WSSV) affects aquaculture of marine shrimp species; however, there is minimal information about both host and viral gene function and host-virus interactions. The Penaeus monodon Functional Genomic Database (http://140.129.151.97/pm/ index.php) contains annotated Expressed Sequence Tags (ESTs) related to shrimp’s response to WSSV infection. Where possible, ESTs are grouped into contiguous sequences (‘contigs’). Functional information is provided by linking to sequence homologs from the National Center of Biotechnology Information (NCBI) non-redundant (nr) database, assigning GO function, annotating these ESTs to Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and providing predicted interaction databased on computational analysis of EST sequences. Notably, evidence codes are not indicated for the GO annotations. Details provided for each EST or Contig include similar sequences in public databases (‘homologs’), KEGG Pathways, GO terms and predicted interactions. The types of functional annotation is summarized in Table I using data from Contig 1 as an example. Predicted interaction databased on DIP interactions [38] is also available, and the interaction data for Contig 1 is shown in Table II.
This resource demonstrates how functional annotations can enhance biological understanding, even in species with no sequenced genome.
Table I. Functional annotation for Penaeus monodon Contig 1a ID Description EST Homologs KEGG pathways Gene Ontology
Contig1 Similar to fructose-1,6-bisphosphatase 1, like (Danio rerio) Infected: PmTwI58C10 Normal: LON-03r-C05, PmTwN57A11 NCBI GI: 34783829; UniProtKB: Q6PFT1_BRARE Glycolysis / gluconeogenesis; pentose phosphate pathway; fructose and mannose metabolism; carbon fixation; insulin signaling pathway GO:0005975 carbohydrate metabolism; GO:0016787 hydrolase activity; GO:0042578 phosphoric ester hydrolase activity
a Modified from http://140.129.151.97/pm/contig.php?id=Contig1 (as seen on 04/ 24/09).
Table II. Predicted interaction data for P.monodon Contig 1a Interactor Contig1 PmTwI33G08_Seeing_2nd_1398 Contig1218 Contig1275
Description Similar to fructose-1,6-bisphosphatase 1, like (Danio rerio) Similar to importin a 3 (Aplysia californica) Similar to Ras-related protein Rab-1A Similar to Rab7 (Aiptasia pulchella)
JE b 1.00E-51 2.45E-38 8.94E-33 6.32E-32
Based on DIP:7191E DIP:6468E DIP:6930E DIP:6930E
a
Modified from http://140.129.151.97/pm/ints.php?id=Contig1 (accessed on 04/24/09). JE: Joint E-value as the geometric mean of individual e-values [41].
b
oncoprotein Meq of Gallid herpesvirus 2 (GaHV-2, Marek’s disease virus) is an ortholog of chicken c-Jun, binds to host DNA and has a direct effect on transcription of genes from both virus and host [23]. Hence, Meq is annotated to both ‘regulation of transcription’ (GO:0045449) and ‘modulation by virus of host transcription’ (GO:0019056). Moreover, the GO currently contains 162 virus-specific ‘biological process’ terms, 28 ‘cellular component’ terms relating to viral gene localization, and two ‘molecular function’ terms that are specific to viral function (as of March 2009). As yet, these terms are poorly organized within the GO structure, hindering their use, and GO biocurators are working to improve this section of the ontology (see: http://wiki.geneontology.org/index.php/ Virus_terms). For this effort to be useful for the virology community, biocurators will require the feedback and involvement of virologists, exactly as plant pathologists assisted PAMGO biocurators with developing GO terms for microbe-host interactions [22]. The AgBase database and GO annotation The AgBase database (http://www.agbase.msstate.edu/) is a curated, open-source, web-accessible resource for functional annotation and subsequent systems-biology modeling of gene products from agricultural plant and animal species, some of which are used as biomedical models. Currently, AgBase GO annotations exist for chicken, cat, cow, pig, sheep, horse, dog, salmon, trout and channel catfish. Many of these annotations are done in response to user requests to provide GO for microarrays or for differentially expressed gene lists from microarrays. As a result, AgBase biocurators have also provided GO annotation for genes from several viruses including GaHV-2, Human herpesvirus 1, Meleagrid herpesvirus 1 and Bovine herpesvirus 1 (BoHV-1), and for the bacterium Rhodococcus equi. These GO annotations are
complemented by efforts of the GO Annotation Project (GOA) at the European Bioinformatics Institute (EBI) to ensure that all proteins represented in the UniProtKB database have IEA annotation [24]. One unique innovation at AgBase that is particularly useful for molecular virologists is that we maintain two types of functional annotation files (both files are subject to equally stringent quality control before release). The first file is the usual GO Consortium file (containing annotations submitted to the GO Consortium by AgBase biocurators). The second file is our GO annotation ‘community file’ [25] containing GO annotations provided by biological experts in their fields, GO annotations for ESTs, and annotations that have not yet been submitted to the GO Consortium. Biocurators at AgBase collaborate with experimental virologists to add functional information to the AgBase Community file, and this file is a resource for both researchers and biocurators. GO annotations in the community file provide information regarding the identity of the scientist or group that has done the experimental work and submitted the annotations. However, not all of these experimental data have been published in a scientific journal. For example, the AgBase files contain GO annotations about BoHV-1 gene functions that have recently been published [26] and these data are transferred to the GO Consortium file as publication of the experimental data are released. Data about localization of GaHV-2 gene products are present in the AgBase Community file and, although it awaits publication, it is available to the community. This system makes functional data available to researchers when it would not otherwise be. Users can contact the AgBase biocurators to request specific GO annotations for their respective gene products or datasets or to submit data, and we encourage the research community to submit annotations. 331
Review
Trends in Microbiology Vol.17 No.7
Box 2. Quantitative, hypothesis-driven modeling using the GO Typically, modeling datasets using the GO focuses on categorizing differentially expressed gene products into functional groups. However, the GO can also be used to do hypothesis-driven modeling and hypothesis testing, including integration of quantitative data from gene expression experiments (measuring either RNA or protein levels), as in the following example. Marek’s disease is a herpesvirus-induced lymphoma of chickens that serves as a natural model of human CD30-overexpressing (CD30hi) lymphomas. In a recent publication [18], the GO was used to test if Marek’s Disease CD30hi lymphoma cells had either a T-helper or a T-regulatory phenotype. There are several well studied cell antigens and cytokines that are routinely used to differentiate between the different classes of T cells. Expression of mRNA and protein levels of these differentiating molecules was assessed in purified CD30hi lymphoma cells to provide quantitative data. To determine if the overall phenotype for these cells was T-helper 1 (Th1), T-helper 2 (Th2) or T-regulatory (Treg), data from GO biological process terms describing these phenotypes was collected and assessed. For example, the terms used to describe Treg cells were:
‘GO:0045066 regulatory T cell differentiation’ and ‘GO:0045589 regulation of regulatory T-cell differentiation’ (together with its two children terms ‘GO:0045590 negative regulation of regulatory T-cell differentiation’ and ‘GO:0045591 positive regulation of regulatory Tcell differentiation’). Based on GO annotation data for chicken and orthologous gene products, each cell antigen or cytokine that was quantitatively measured was scored as ‘pro-phenotype: +1’, ‘antiphenotype: –1’, ‘no effect: 0’ or ‘no data: blank’. This created a scoring table (Table I). When the quantitative real-time PCR data was applied to this scoring table, the net effect of each phenotype to CD30hi lymphomas was determined (Table II). Because the net effect of the Treg phenotype is +10.15, whereas Th1 and Th2 are down regulated (indicated by a negative net effect), these results indicate that the predominant phenotype of CD30hi lymphoma cells is a Treg phenotype, a result consistent with host immune evasion. This example demonstrates how the use of detailed GO terms can be used to ask specific questions during biological modeling of highthroughput datasets. Moreover, combining this functional information with quantitative experimental data provides a link between individual gene products and observed phenotypes.
Table II. Addition of quantitative data to the phenotype scoring table Table I. Phenotype scoring based on GO annotation data Gene product IL-2 IL-4 IL-6 IL-8 IL-10 IL-12 IL-13 IL-18 IFNgamma TGFbeta CTLA4 GPR83 SMAD7
Th1 1 1
Th2 1 1
1
1
1
1 1
1
1
1 1
1 1 1 1
1
Treg 1 1 1 1 1
0 1 1 1
1 1 1 1 1 1
Modeling the biology of animal viruses using the GO In addition to GO annotations, AgBase provides computational tools for adding GO terms to datasets and for viewing GO annotations. These tools are described elsewhere in detail [6] but briefly, users can retrieve existing GO annotations for their datasets (using GORetriever), assign additional functional annotation based on homology (GOanna) and summarize the GO function for their data (GOSlimViewer). These tools can be used independently or sequentially (Figure 1). The GO is habitually used to determine which classes of gene products are over-represented or underrepresented in functional genomics datasets [27], and there are many examples of how this same approach is used to study host gene expression changes during viral infection [28–31]. In addition, there are several examples of microarrays that include pathogen genes: the Affymetrix Chicken Genome Array includes genes from 17 different avian viruses and the Agilent macaque oligonucleotide microarray contains 96 viral genes from 27 different viruses. Including probes for viral transcripts on microarrays enables researchers to measure how these viral genes are expressed along with host genes during infection, identifying pathogen genes involved in host-pathogen interactions [32]. However, the usefulness 332
Gene product IL-2 IL-4 IL-6 IL-8 IL-10 IL-12 IL-13 IL-18 IFNgamma TGFbeta CTLA4 GPR83 SMAD7 Net effect
Th1 1.58 0.00 0.00 0.00 0.00 0.00 1.51 0.91 0.00 1.71 1.89 1.69 0.00 1.29
Th2 0.00 0.00 1.20 0.00 0.00 0.00 1.51 0.91 0.00 0.00 1.89 1.69 0.00 5.38
Treg 1.58 0.00 1.20 1.18 0.00 0.00 0.00 0.91 0.00 1.71 1.89 1.69 0.00 10.15
of the GO is not limited to functionally classifying up- and down-regulated gene sets based on microarrays. A criticism of many functional genomics approaches is that these experiments are ‘fishing expeditions’, or purely hypothesis-generating rather than hypothesis-testing (or hypothesis-driven). Moreover, the models produced do not account for quantitative data. However, the GO is also being used for quantitative, hypothesis-testing and hypothesis-driven modeling and an example of this is a study of GaHV-2 transformed lymphomas [18] (Box 2). More recently, the GO has also been used to predict host-pathogen interactions (Box 3) [33,34]. Moreover, a study of available human-viral interaction data demonstrated that pathogens tend to interact selectively with genes coding for ‘hubs’ (proteins that have many interaction partners) or ‘bottlenecks’ (proteins that control multiple pathways) in the human interaction network [35], providing another way to filter computationally predicted host-pathogen interactions. Interestingly, virushost interaction data are being annotated in several molecular interaction databases, most notably Molecular INTeraction (MINT) [36], Pathogen Interaction Gateway (PIG) [37] and the Database of Interacting Proteins (DIP) [38]. However, because functional annotation for most pathogens lacks a funded effort, current host-pathogen
Review
Trends in Microbiology
Vol.17 No.7
Figure 1. GO based tools available from AgBase. The tools are designed to assist with the analysis of experimental datasets using the GO (see: http://www.agbase.msstate.edu/). They may be used independently, or as a pipeline (shown on the right).
data are mostly based on computational analyses and remains to be tested in vivo. Nevertheless, the combined use of GO information and systems biology network modeling is a promising avenue of research for understanding host-pathogen interaction. Concluding remarks and future directions Functional annotation is required to make sense of the datasets generated by microarray and proteomics technologies, which are increasingly being used to investigate virus-host interactions [39]. Currently, GO annotations are available for many important animal hosts but only a few of their pathogens. A focused effort to provide functional annotations for viral gene products will greatly facilitate modeling of viral infection, pathogenesis and host responses in animals. The PAMGO group led a large scale development of GO biological process terms relating to pathogen gene function and this effort was a crucial first step towards providing GO annotations for pathogen genes [22]. In the future, however, focused efforts to annotate pathogen genes will allow the expansion of this initial effort. In addition to providing GO annotations for viral gene products, future annotation efforts must work closely with experts in relevant research communities. Although most researchers do not have the time, training or inclination to biocurate their own data to the standard required by the
GO Consortium, these researchers increasingly rely on annotated biological data to effectively model their datasets. At the same time, the amount of peer-reviewed biological literature is increasing exponentially while there are relatively fewer biologists trained to curate this data. By developing new methods for capturing community annotations, biocurators can more efficiently and precisely curate biological information. Moreover, this will also allow researchers to minimize their frustration at the lack of GO annotations for all but a handful of organisms and to ensure that their experimental data are accurately represented in existing ontologies. One approach to facilitate curation of biological data are for biocurators to develop partnerships with scientific journals, ensuring that published data contains the information required to facilitate its annotation [40]. Another approach, as described earlier, is to develop a publicly available ‘community annotation file’ [14], which serves as an intermediate file for biocurators who can add the additional information and undertake the quality checks required to submit this information to the GO Consortium. In addition to providing improved mechanisms for biocurating functional data and making GO annotations, we also need to address the current limitations in the use of GO for functional modeling of host-pathogen systems. Currently, available tools for gene expression analysis 333
Review
Trends in Microbiology Vol.17 No.7
Box 3. Computational methods that use GO to predict host-pathogen interactions Integrating phenotype, GO and pathway data This approach relies on data from the PhenoGO (http://www.phenogo.org/), a database that infers the anatomical and cellular context for gene products with GO annotations based upon literature and database mining. PhenoGO is used to extract human protein-disease relationships. Next, the Reactome knowledgebase is used to provide protein– protein interaction (PPI) data which is statistically correlated with these diseases. The resulting data are qualitatively assessed by comparing PPIs shared by two diseases with the published experimental literature for these proteins. This approach is necessarily limited to species which are represented in both the PhenoGO and Reactome databases. Predicting inter-species interactions from existing protein–protein interactions This approach leverages existing PPI information from public databases. Due to the limited amount of host-pathogen data in public interaction databases, intra-species PPI are analyzed to determine the likelihood of specific pairs of domains interacting.
need to be both more flexible in allowing users to upload their own GO annotations and also handle multi-species analyses. Although there have been several interesting approaches for mining existing GO annotations to provide interaction data for systems biology analyses [33,34,37], this is still an under-exploited area that is likely to produce real benefits in the future. References 1 Nanduri, B. and McCarthy, F.M. (2007) AgBase - a tool for systems biology in agricultural species. CAB Reviews. PAVNNR 2, 13–26 2 Lewis, S.E. (2005) Gene Ontology: looking backwards and forwards. Genome Biol. 6, 103 3 Rhee, S.Y. et al. (2008) Use and misuse of the gene ontology annotations. Nat. Rev. Genet. 9, 509–515 4 Nanduri, B. et al. (2005) Proteomic analysis using an unfinished bacterial genome: the effects of subminimum inhibitory concentrations of antibiotics on Mannheimia haemolytica virulence factor expression. Proteomics 5, 4852–4863 5 Severino, P. et al. (2007) Comparative transcriptome analysis of Listeria monocytogenes strains of the two major lineages reveals differences in virulence, cell wall, and stress response. Appl. Environ. Microbiol. 73, 6078–6088 6 McCarthy, F.M. et al. (2006) AgBase: a functional genomics resource for agriculture. BMC Genomics 7, 229 7 Gene Ontology Consortium (2006) The Gene Ontology (GO) project in 2006. Nucleic Acids Res. 34, D322-D326 8 Reeves, G.A. and Thornton, J.M. (2006) Integrating biological data through the genome. Hum. Mol. Genet. 15, R81–R87 9 Tatusov, R.L. et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 10 Montague, M.G. and Hutchison, C.A., 3rd (2000) Gene content phylogeny of herpesviruses. Proc. Natl. Acad. Sci. U. S. A. 97, 5334– 5339 11 Gene Ontology Consortium (2008) The Gene Ontology project in 2008. Nucleic Acids Res. 36, D440-D444 12 Barrell, D. et al. (2009) The GOA database in 2009–an integrated Gene Ontology Annotation resource. Nucleic Acids Res. 37, D396–D403 13 Leu, J.H. et al. (2007) Comparative analysis of differentially expressed genes in normal and white spot syndrome virus infected Penaeus monodon. BMC Genomics 8, 120 14 McCarthy, F.M. et al. (2007) AgBase: a unified resource for functional analysis in agriculture. Nucleic Acids Res. 35, D599–D603 15 Jiang, W. et al. (2008) Constructing disease-specific gene networks using pair-wise relevance metric: application to colon cancer identifies interleukin 8, desmin and enolase 1 as the central elements. BMC Syst. Biol. 2, 72 16 Buza, J.J. and Burgess, S.C. (2008) Different signaling pathways expressed by chicken naive CD4(+) T cells, CD4(+) lymphocytes
334
This statistical data are used to predict host-pathogen interactions. In the absence of experimentally verified host-pathogen data, computational analyses are used to assess the predicted host-pathogen interactions. First, because a protein’s function is determined by the proteins it interacts with, host proteins predicted to interact with the same pathogen protein are examined to see if they are close to each other in the host PPI network. Second, gene expression data are analyzed to determine if expression profiles are similar for predicted host-pathogen interacting pairs. Third, GO annotations for hostpathogen interactions are examined to find out if these proteins have closely related functions. Although this method is more flexible, relying on less specialized databases, it is still limited by the lack of existing PPI data. Methods for predicting host-pathogen interactions would benefit from a ‘gold standard’ – a set of experimentally verified host-pathogen interactions that could be used for evaluating computational prediction methods and which could be used to provide information about interactions for closely related host-pathogen pairs.
17
18
19 20 21 22
23
24
25 26 27
28
29
30
31
32 33 34 35
activated with staphylococcal enterotoxin B, and those malignantly transformed by Marek’s disease virus. J. Proteome Res. 7, 2380–2387 Kumar, S. et al. Genotype dependent tumor regression in Marek’s Disease is mediated at the level of tumor immunity. Cancer Microenviron. (in press) Shack, L.A. et al. (2008) The neoplastically transformed (CD30hi) Marek’s disease lymphoma cell phenotype most closely resembles Tregulatory cells. Cancer Immunol. Immunother. 57, 1253–1262 Klein, G. (2002) Perspectives in studies of human tumor viruses. Front. Biosci. 7, d268–d274 Klouche, M. et al. (2004) Virokines in the pathogenesis of cancer: focus on human herpesvirus 8. Ann. N. Y. Acad. Sci. 1028, 329–339 Smith, S.A. and Kotwal, G.J. (2001) Virokines: novel immunomodulatory agents. Expert Opin. Biol. Ther. 1, 343–357 Torto-Alalibo, T. et al. (2009) The Plant-Associated Microbe Gene Ontology (PAMGO) Consortium: community development of new Gene Ontology terms describing biological processes involved in microbe-host interactions. BMC Microbiol. 9 (Suppl 1), S1 Qian, Z. et al. (1995) Transactivation activity of Meq, a Marek’s disease herpesvirus bZIP protein persistently expressed in latently infected transformed T cells. J. Virol. 69, 4037–4044 Camon, E. et al. (2004) The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 32, D262–D266 McCarthy, F.M. et al. (2007) GOing from functional genomics to biological significance. Cytogenet. Genome Res. 117, 278–287 Robinson, K.E. et al. (2008) The essential and non-essential genes of Bovine herpesvirus 1. J. Gen. Virol. 89, 2851–2863 Khatri, P. and Draghici, S. (2005) Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21, 3587–3595 Blanchard, Y. et al. (2006) Cellular gene expression survey of PseudoRabies Virus (PRV) infected Human Embryonic Kidney cells (HEK-293). Vet. Res. 37, 705–723 Danaher, R.J. et al. (2008) Herpes simplex virus type 1 modulates cellular gene expression during quiescent infection of neuronal cells. Arch. Virol. 153, 1335–1345 Seo, M.J. et al. (2005) New approaches to pathogenic gene function discovery with human squamous cell cervical carcinoma by gene ontology. Gynecol. Oncol. 96, 621–629 Zilliox, M.J. et al. (2007) Gene expression changes in peripheral blood mononuclear cells during measles virus infection. Clin. Vaccine Immunol. 14, 918–923 Song, X.M. et al. (2008) Streptococcus pneumoniae early response genes to human lung epithelial cells. BMC Res. 1, 64 Dyer, M.D. et al. (2007) Computational prediction of host-pathogen protein-protein interactions. Bioinformatics 23, i159–i166 Sam, L. et al. (2007) Discovery of protein interaction networks shared by diseases. Pac. Symp. Biocomput. 76–87 Dyer, M.D. et al. (2008) The landscape of human proteins interacting with viruses and other pathogens. PLoS Pathog. 4, e32
Review 36 Chatr-aryamontri, A. et al. (2007) MINT: the Molecular INTeraction database. Nucleic Acids Res. 35, D572–D574 37 Driscoll, T. et al. (2009) PIG–the pathogen interaction gateway. Nucleic Acids Res. 37, D647–D650 38 Salwinski, L. et al. (2004) The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 32, D449–D451
Trends in Microbiology
Vol.17 No.7
39 Forst, C.V. (2006) Host-pathogen systems biology. Drug Discov. Today 11, 220–227 40 Ort, D.R. and Grennan, A.K. (2008) Plant Physiology and TAIR partnership. Plant Physiol. 146, 1022–1023 41 Yu, H. et al. (2004) Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res. 14, 1107–1118
335