Ecological Informatics 6 (2011) 4–12
Contents lists available at ScienceDirect
Ecological Informatics j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / e c o l i n f
Implications of informatics approaches in ecological research Kelsey J. Metzger a,⁎, Rebecca Klaper b, Michael A. Thomas c,1 a b c
Center for Learning Innovation, University of Minnesota Rochester, 300 University Square, 111 South Broadway, Rochester, MN, 55904, United States Great Lakes WATER Institute, School of Freshwater Sciences, University of Wisconsin-Milwaukee, 600 E Greenfield Ave, Milwaukee, WI 53204-2944, United States Department of Biological Sciences, Idaho State University, 921 South 8th Avenue, Stop 8007, Pocatello, ID, 83209-8007, United States
a r t i c l e
i n f o
Article history: Received 16 November 2010 Accepted 16 November 2010 Available online 26 November 2010 Keywords: Ecological informatics Bioinformatics Next-generation sequencing Functional genomics Gene expression Multi-disciplinary
a b s t r a c t Rapid advances in molecular methodologies, computational modeling, GIS applications, and innovations in other fields have influenced the scope and nature of ecological studies in recent decades. Techniques from genomics previously considered primarily useful in the realm of biomedical research have been adopted and adapted for use in ecological contexts, yielding insights in underlying genetic structures of populations, environment/genome associations, classification of biodiversity, quantifying genetic variation within and between groups, comparing genome structure and gene expression. The use of comparatively inexpensive next-generation sequencing (NGS) technology to rapidly produce a large quantity of sequence data will continue to propel the use of informatics in ecological studies, including studies utilizing non-model organisms for which whole genome sequences are not yet available, and in metagenomics studies. Here, we describe the multi-disciplinary and innovative nature of recent ecological studies with informatics components, review some of the ways that informatics methods are being utilized to answer ecologically motivated questions, and explore the implications of these approaches for ecological studies. © 2010 Elsevier B.V. All rights reserved.
Contents 1.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1. Informaticians as ecologists: changing roles in a changing world (an educational context) . . . . . 1.2. The strength (and challenge) of integrated approaches for ecological studies in the informatics age. 2. Current trends: questions being asked in ecological informatics studies . . . . . . . . . . . . . . . . . 2.1. Diversity: composition and quantification of communities . . . . . . . . . . . . . . . . . . . . 2.2. Ecoinformatics, conservation and wildlife management. . . . . . . . . . . . . . . . . . . . . . 2.3. Effects of environmental perturbances: toxins . . . . . . . . . . . . . . . . . . . . . . . . . . 3. Genome-scale applications for ecological investigations . . . . . . . . . . . . . . . . . . . . . . . . 3.1. Functional genomics and NGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Model organism databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Value added approaches for microarray studies . . . . . . . . . . . . . . . . . . . . . . . . . 4. Challenges facing ecoinformaticians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1. Lack of curated non-model organism data bases and gene annotation . . . . . . . . . . . . . . . 4.2. Informatics infrastructure development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1. Visualization of ecological informatics data: an important challenge and need. . . . . . . 4.3. Accessing data, data sharing: culture of use and reward, authorship and collaborative guidelines . . 5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Abbreviations: NGS, Next-generation sequencing; GSEA, Gene Set Enrichment Analysis. ⁎ Corresponding author. Tel.: + 1 507 258 8214. E-mail addresses:
[email protected] (K.J. Metzger),
[email protected] (R. Klaper),
[email protected] (M.A. Thomas). 1 Tel.: + 1 208 282 2396. 1574-9541/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.ecoinf.2010.11.003
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
5 5 5 5 5 6 6 7 7 8 8 9 9 9 9 10 10 10 10
K.J. Metzger et al. / Ecological Informatics 6 (2011) 4–12
1. Introduction This review explores recent advances in the field of ecological informatics with an emphasis on informatics consequences for genome applications in ecology. Driven by the increasingly ubiquitous production and availability of large amounts of genome sequence and gene expression data being utilized for ecological investigations in landscape genetics (Sork et al., 2010), biodiversity (Pfrender et al., 2010), and phylogenomics (Emerson et al., 2010), informaticians are rapidly becoming key team members in ecology groups and are helping to enable a renewed appreciation for and understanding of the genetic and evolutionary underpinnings of ecological phenomena. Here, we explore how tools and approaches from biomedical genomics and informatics (e.g. Venter et al., 2001) are trickling over to and being utilized for ecological applications. We'll explore how ecologists tend to use (and sometimes repurpose) informatics tools, which in turn influences the use of these tools by biomedical bioinformatics researchers. We will review recent genome-scale ecology articles, especially studies that have had substantial impact on field of ecological informatics or landscape genetics, and examine the kinds of questions that are being asked, with emphasis on informatics-intensive applications, such as whole genome analyses for detecting diversity within or between groups. 1.1. Informaticians as ecologists: changing roles in a changing world (an educational context) “We are not students of some subject matter, but students of problems. And problems may cut right across the borders of any subject matter or discipline.”2 A decade of integrative research in biological sciences since the introduction of ‘systems biology’ approaches (Wolkenhauer, 2001; Kitano, 2002) has flourished due to expertise and contributions of diverse experts joining to make significant headway on complex investigations. This changing focus and need for multi-disciplinary teams is keenly exemplified in ecological and ecoinformatic investigations (Michener et al., 2001; Chon and Park, 2006; Baker and Chandler, 2008; Sork and Waits, 2010). Ecological research comprises many levels of organization; thus, a major challenge for ecological researchers is to design and implement research endeavors at the appropriate level for the question at hand, while not disregarding the influence of other aspects of the ecosystem or population of interest (Ferriere and Fox, 1995; Michener et al., 2001; Schnase et al., 2003; Jones et al., 2006; Kelling et al., 2009). Appropriately approaching biological complexity is a challenge in genome-scale ecological and landscape genetics research projects as well, and can be compounded by the multiplicity of layers of genome organization and control in addition to the ecological levels of organization of relevance (Schnase et al., 2007; Manel et al., 2010). The changing nature of research questions and methodology in life sciences has led to updated education guidelines and objectives (e.g. NRC, 2009; AAAS, 2010; AAMC and HHMI, 2009; see also Labov et al., 2010; Woodin et al., 2010;) with an emphasis on deep rather than superficial learning, and the ability to integrate concepts from diverse areas of study, particularly between sub-disciplines of biological sciences: “A deeper understanding of biological systems emerges from the multifaceted thinking of experts from a variety of disciplines. This deeper understanding will advance biology from an era of observation and mechanism to one of deciphering design principles for biological processes, making them accessible to manipulation and eventually predictable.” (Labov et al., 2010; NRC, New Biology for the 21st Century, Figure 2.1).
2 Popper, K.R., 1963. Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Routledge and Kegan Paul. p. 88.
5
Undergraduate, graduate and post-graduate training programs in biology have noted the changing needs of new investigators and have taken measures to reshape curricula to emphasize cellular and molecular mechanisms, molecular data, computational techniques, and informatics competency in addition to ecological and organismal biology. Quite a number of training programs now exist to address the need for educators, program directors, and researchers competent in both biomedical disciplines and informatics (Sahinidis et al., 2005; Altman and Klein, 2007; Severtson et al., 2007; Gerstein et al., 2007; Johnson and Friedman, 2007) in fact, the explosion in programs from 1990–2000 surpassed the need for biomedically-focused bioinformaticians in the job market in early 2000s (Black and Stephan, 2005), providing an opportunity for bioinformaticians to refocus their specialties to ecological investigations. Changing too, are guidelines for and awards made in funding. Informaticians are increasingly becoming essential team members for maintaining, manipulating, and making use of large data sets of molecular information. (e.g., NSF Interdisciplinary Research Program, http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503439). 1.2. The strength (and challenge) of integrated approaches for ecological studies in the informatics age There is little doubt that multi-disciplinary teams have the potential to make more substantial inroads in investigating complex biological and ecological phenomena than less interdisciplinary teams, but such teams are not without their challenges: disciplinarians are reared in a culture that includes unique language, techniques, and standard practices distinct even from colleagues in closely related fields. Research teams comprised of diverse experts may therefore find that communication is an even more essential component of producing synergies than in homogenous, single-discipline teams. Indeed, it is recognized that training for informaticians should include not only disciplinary knowledge, but also training in working as part of multi-disciplinary teams (van Mulligen et al., 2008). Today's best graduate training programs for bioinformatics include coursework and lab rotations in ecological and organismal biology, rather than the simply genetics and physiology. The field of landscape genetics provides a case study of success resulting from a research perspective incorporating multiple levels of biological organization, the utilization of new technology (GIS and computational modeling), and the integration of the disciplines ecology, population genetics, landscape ecology, biological and spatial statistics, evolution, phylogeography, and others (Kronauer et al., 2005; Manel et al., 2010; Sork et al., 2010; Toon et al., 2010). The future of ecological informatics (and perhaps also the future of populations and species being studied—especially those with conservation implications and pressing management decisions) will be shaped by the habits, patterns and practices of new multi-disciplinary research teams (Michener et al., 2001; Green et al., 2005; Chon and Park, 2006; Williams and Poff, 2006; Sork and Waits, 2010). 2. Current trends: questions being asked in ecological informatics studies 2.1. Diversity: composition and quantification of communities Ecology has a long history of quantifying biodiversity, at scales from populations to ecosystems. Ecoinformatics and genome-scale approaches utilizing next-generation sequencing (NGS) methods improve the resolution of biodiversity estimates by providing an information-rich quantitative approach to catalog, quantify, and examine diversity (Michener et al., 2001; Thomas and Klaper, 2004; Yao et al., 2006; Goethals, 2007; Schnase et al., 2007; Hale and Hollister, 2009; Koopman et al., 2010) and genetic structure of populations (O'Corey-Crowe et al., 2006; Morin et al., 2010; Sork et al.,
6
K.J. Metzger et al. / Ecological Informatics 6 (2011) 4–12
2010; Perez-Gonzalez et al., 2010; Wooten et al., 2010). Prokaryotic and eukaryotic microorganisms have been the focus of many such recent biodiversity studies, on scales from communities within individual land plants (Koopman et al., 2010) to communities in rivers, lakes and oceans (Garneau et al., 2006; Johnson et al., 2006; Danovaro et al., 2009; Longhi and Beisner, 2010; Medinger et al., 2010). While next-generation sequencing methods being utilized to assay species distribution and abundance need to be optimized, for instance, estimating abundance through use of single-copy genetic markers rather than markers with copy number variation (such as SSU rRNA genes), some studies indicate that next-generation molecular approaches will result in improvements in estimating species richness relative to traditional morphological species estimation methods (Medinger et al., 2010). 2.2. Ecoinformatics, conservation and wildlife management The results of ecoinformatic studies can have far reaching implications for conservation, management, and recommended land use practices. For example, measurements of taxonomic diversity and composition of macroinvertebrates in freshwater ecosystems are utilized as indicators of water quality and efficacy of remediation and restoration efforts (e.g., Pfrender et al., 2010), and thus influence decisions made by regulatory bodies regarding water quality, restoration, or remediation measures. In addition, species with pressing conservation and management issues are increasingly the focus of ecological informatics studies (Morin et al., 2010; O'CorryCrowe et al., 2006; Wooten et al., 2010). The defining of unique taxonomic groups based on genetic structure in an ecological and geographical context aids in developing appropriate management and/or conservation policies (Hale and Hollister, 2009; Costello, 2009; Goetz et al., 2010; Wooten et al., 2010). As an example, killer whales (Orcinus orca) are currently recognized as a single, cosmopolitan species comprised of multiple poorly characterized subgroups (listing of “Data Deficient” by the International Union for Conservation of Nature and Natural Resources, 2010: http://www.iucnredlist.org/; http://www.iucnredlist.org/apps/redlist/details/15421/0). This lack of taxonomic clarity complicates conservation status assignment and management actions. Recently, high-throughput sequencing methods were utilized to further establish the status of sympatric, but behaviorally, reproductively, and evolutionarily isolated ecotypes within this species (Morin et al., 2010). Such data should allow for the reevaluation and updating of the taxonomy of this genus, and subsequent appropriate conservation status listing for identified taxonomic units that are particularly vulnerable because of their unique ecological and behavioral attributes (e.g. predatory tendencies and availability/abundance of prey items). Similarly, Steller sea lion (Eumetopias jubatus) populations along the coast of Alaska and the Aleutian islands have long been recognized as representing at least two distinct population segments (or ‘stock’ populations). The eastern population segment is currently listed as “threatened” while the western population segment is listed as “endangered.” Recent phylogeographic analyses provide evidence that supports an argument for the recognition of additional distinct populations segments within the western DPS that occupy distinct ecological niche habitats and have become genetically differentiated. The recognition of additional significant population segments will inform future decisions about Environmental Protection Agency species listing status and management decisions (O'Corry-Crowe et al., 2006). A primary emphasis of the field conservation genetics is to identify, describe and protect unique evolutionary lineages (Shaffer et al., 2004); NGS approaches give investigators the ability to identify unique evolutionary lineages on an ever-finer scale, complicating the differentiation of ‘unique evolutionary lineages’ into manageable taxonomic units that are meaningful for biodiversity, environmental stability, and conservation. With the increased use of genomic
methods to inform conservation policies and practices, attention will need to be focused on updating guidelines and policies that address how ‘unique evolutionary lineages’ should be defined for management considerations. Unique integrated ecological informatics approaches combining landscape genetics, niche modeling, and adaptive hypothesis testing help tease out signals of divergent selection and unique lineages across environmental features that may be missed by traditional phylogenetic approaches (Emerson et al., 2010; Freedman, et al., 2010). Such studies give indications as to which segments of populations are likely to be more genetically robust in the face of rapidly changing environmental conditions resulting from global climate change or changes in land use affecting distribution ranges for species (Sork et al., 2010). In addition, tracking the evolution and spread of invasive species, and the monitoring of biological control endeavors could also be enhanced through the use of genome informatics in addition to current non-genome dataset and database utilization (Bai et al., 2010; Marsico et al., 2010). 2.3. Effects of environmental perturbances: toxins Genomic data has provided a plethora of information on the impacts of potential toxins on humans and other organisms. Genomic data can provide information as to the mechanism by which a toxin may act, elucidate the types of effects of exposure, the impacts of different doses, and way to measure exposure in natural populations or epidemiological studies (Klaper and Thomas, 2004). Microarray gene expression studies are now being used to diagnose certain types of cancers (Fan et al., 2010). In these assays that have been developed using cancer tissues from hundreds of cases, combinations of genes have been teased out of patterns of thousands to provide molecular classifiers of cancers that accurately predict outcomes. Similar studies have now shown that there are patterns visible among the gene expression signatures from exposures to various chemical classes so that compounds can be grouped according to their mechanisms of action and microarray gene expression studies can provide a reliable way to detect effects (e.g. Sawle et al., 2010). Genomics provides a way to detect the threshold for a response. Instead of relying on classic endpoints of toxicity, genomic measures provide an early indication of how tissues are responding to a chemical and predict what will happen to those tissues over time (Steinberg et al., 2008, Zarbl et al., 2010). Genomics also provides new information as to alternative functions, genes and receptors that may be triggered with an exposure that would not be measured otherwise (Iguchi et al., 2007). Although genomics has provided a more thorough way of examining the impacts of chemicals there are still problems that exist before these technologies can be fully implemented in determining risk of a chemical to humans and wildlife. Part of this problem is one of informatics. In order to fully develop gene expression patterns as indicators there needs to be a larger validation that these patterns are predictive of endpoints of concern. Specifically, does a change in gene expression mean a person will develop cancer or reproductive problems and in a population of fish does a gene expression change indicate that a population will crash? How do gene expression patterns relate to traditional endpoints? How much does the expression have to change to signal a more serious outcome? Several papers have called for a greater effort to validate these patterns and link traditional endpoints to gene expression endpoints (e.g. Van Aggelen et al., 2010). In part, due to the large number of studies involved informatics is needed to provide a platform to link the various endpoints. There is such an effort in the Comparative Toxicogenomics Database (see Davis et al., 2011). Here curators comb the literature for papers that have made a link between either a chemical and a specific gene or a disease, or a gene-disease relationship. The curators then create networks using this information to try to predict existing and novel relationships among these categories. One of the benefits of
K.J. Metzger et al. / Ecological Informatics 6 (2011) 4–12
this centralized mechanism is that it is a well-curated database with controls placed on vocabulary entered and the ontologies of gene information. However with this heavy curation comes the price of missing information from studies that don't conform to a search engine and the lack of direct interaction with the scientists who create the data. In addition, global gene expression studies do not readily lend themselves to such a database analysis as patterns can be more important than a single gene change. On the other end of the spectrum are datasets with little curation and no effort to link studies such as the Gene Expression Omnibus (GEO) database. Another issue that arises is the annotation of genomic information, in particular for non-human model organisms. With a drop in the cost of sequencing technologies and the immense amount of sequence data generated using next-gen sequencing scientists are not necessarily limited by having sequence information but are limited in the annotation of such sequence information. Informatics could greatly help with the effort to provide functional information for sequences generated in these studies and therefore give more meaning to the endless stream of sequences generated. 3. Genome-scale applications for ecological investigations 3.1. Functional genomics and NGS Genetic maps (primary genome sequences) are necessary, but not sufficient, to infer genome functionality (Werner, 2010). Future genome studies will focus not only on establishing the content and organization (large- and fine-scale) of an organism's genome, but also on functional genomics: how the genetic information is expressed in various developmental stages, specific tissues, or in a given environmental context. Microarray studies have been used to successfully investigate a variety of biological phenomenon (Smith and Greenfield, 2003; Hu et al., 2007; Simon, 2008; Goetz and MacKenzie, 2008). The disadvantage of microarray approaches is that, despite their cost efficiency, they are dependent on an informed concept of candidate genes. If the transcripts chosen for the array do not represent the transcripts expressed in the treatment (time, tissue, conditions, etc.), then important information will be missed. In addition, array-based approaches must overcome issues associated with background hybridization and variation in hybridization properties among probes. NGS transcriptome and epigenetic analysis of organisms from different geographic regions, in different developmental stages, or in different environmental conditions will be fruitful areas of biomedical and ecological informatics research (Rockman and Kruglyak, 2006; Goetz et al., 2010; Harris et al., 2010; Künstner et al., 2010; Werner, 2010). An alternative to the array-based approach is using next-gen sequencing to assess the entirety of the transcriptome response to the given treatment (Wang et al., 2009a,b; Goetz et al., 2010; Vandegehuchte et al., 2010; Wang et al., 2010). RNA-Seq is preferable to microarray or other hybridization approaches as there is no requirement to identify genes of interest going into the study, thus genes are not biased to previous studies or markers. Such an approach is particularly attractive in non-model organism systems for which such markers and reference genome sequences are not yet developed (Goetz and MacKenzie, 2008; Goetz et al., 2010). One such approach, High-Throughput SuperSAGE (Matsumura et al., 2010), is a nextgeneration sequencing-based SuperSAGE profiling approach that is adapted to the simultaneous analysis of multiple samples. RNA-Seq results are strongly correlated with array-based results: the same genes tend to be differentially expressed to similar degrees (Marioni et al., 2008). The RNA-Seq approach has several important advantages. First, with little systematic error, fewer biological replicates need to be performed to minimize signal-to-noise ratio when using RNA-Seq (Marioni et al., 2008). Second, RNA-Seq finds
7
rare transcripts that are not present on any array, since arrays are constructed using well-described and observed transcripts. Third, RNA-Seq provides more information on alternative splicing, which is invaluable when comparing treatments that might induce different splicing isoforms (which an array would either conflate to a single form or ignore one of the isoforms) (Wang et al., 2010). Fourth, the degree of expression of a given gene determined by RNA-Seq, while strongly correlated with the degree of expression determined by array-based methods, is less likely to be a false positive in RNA-Seq than in an array (due to systematic hybridization error). Next generation sequencing methods will continue to allow more non-model organisms to be studied at individual and population levels to a greater extent, and to a greater specificity, than previous sequencing technologies have allowed. RNA-Seq approaches involve informatics issues that, for array approaches, have been largely solved, including mature analytical pipelines (Slonim and Yanai, 2009) and standards for publishing results (Brazma, 2009). To begin with, RNA-Seq requires substantial computational support (both personnel and hardware). A robust informatics plan is an essential component for a competitive NSF Major Research Instrumentation grant proposals. A researcher can choose next-gen sequencing platforms that feature short read with low cost per Mb (Illumina and SOLiD) or long reads with higher costs per Mb (454). While shorter read platforms feature lower error rates and deeper coverage, the 454 platform is probably more appropriate for RNA-Seqs involving non-model organisms, since these lack a reference genome for assembly (Goetz et al., 2010). The assembly process without a reference genome (which will be typical for ecological researchers) can involve pooling of reads from all treatments for comprehensive assembly, then tallying reads per mega base (RPMB) for each treatment. Annotation of results can involve intelligent BLAST search of each assembled transcript to the genome of a related species with complete genome or against a functional database like GO db. Comparative genomics can be facilitated by using NGS approaches described above to investigate how differential regulation and expression of genes is related to phenotypic diversity observed across populations and species (Künstner et al., 2010; O'Neil et al., 2010; Wolf et al., 2010). Investigating sequence differences between orthologous genes is a common approach for investigating evolutionary selection pressures. Quantifying rates of synonymous (dS) and nonsynonymous (dN) substitution rates of genes can yield insights about selection pressures in genomes of different species, but can also be used for within species comparisons. For example, in species that have chromosomal sex-determining systems, comparisons of substitution rates for genes on sex chromosomes vs. genes on autosomes can reveal interesting patterns of selection (Künstner et al., 2010). Genome-wide sequence coverage allows for the calculation of genome-wide mutation and recombination rates and subsequent comparisons of these parameters between organisms with contrasting life histories (Haubold et al., 2010). Uneven expansion of paralogous genes in different species can also yield insight into selection pressures that favor preservation of duplication events in gene families associated with behavioral differences between species (Zdobnov et al., 2002). These and other studies exemplify how NGS approaches and the production of large amounts of sequence data from closely related species will allow for appropriate and biologically interesting comparisons to be made which will yield insight into genome evolution, the relationship between genotype and phenotype, mechanisms of speciation, and the generation and maintenance of biodiversity. For further discussion of how next generation technology is being implemented, as well as some of the limitations that are becoming evident can be found in other reviews (Metzker, 2010; Pfrender et al., 2010; Werner, 2010).
8
K.J. Metzger et al. / Ecological Informatics 6 (2011) 4–12
3.2. Model organism databases While biomedical researchers rely on genome databases for every stage of their research, most ecological researchers have traditionally viewed that these resources use genome databases only as a wayport. This is beginning to change, as ecologists generate genome data and informaticians create specialized, flexible, question-oriented platforms to organize and standardize data storage and centralize data access and interpretation. There are several general types of databases that are of use for ecological investigations. First, literature references can be integrated with specific genetic elements and gene expression experiments based on text mining. The DictyBase (Dictyostelium database) reference library (Fey et al., 2006) provides an excellent example. Second, genome databases provide information to a community of users united in the use of several closely related species to address specific types of problems for which a tight integration of data and powerful relational tools are required. For example, Fleabase (http:// wfleabase.org; Gilbert et al., 2005) provides genome information for the Daphnia research community; as such, it tends to cater to toxicological investigations. The integration of genome structure, gene expression, individual fitness, and population-level responses to environmental change provides a useful model for researchers in other systems asking other types of questions. Similarly, the Sol Genomics Network (http://solgenomics.net; Mueller et al., 2005) is a database and website dedicated to the genomic information of the nightshade family, which includes a number of ecologically important species (such as tomato, potato, pepper, petunia and eggplant). Third, genome databases can provide information that is more directly question- or application-oriented. The marine ecological genomics database (http://www.megx.net; Kottmann et al., 2010) is designed to address environmentally relevant questions in marine microbiology. Researchers use the database to investigate adaptations and regional differences in the microbial cycling of nutrients, coupling sequence data with geographical information and contextual information like physical, chemical and biological data. Features of Megx.net includes integration of a geographic information system, a database with precomputed phylogenetic reconstruction and diversity analyses and a database and tool to classify metagenomic fragments based on oligonucleotide signatures. The creation of genome databases from scratch is not necessary given available resources, the most important of which is GMOD (Generic Model Organism Database, gmod.org; O'Connor et al., 2008). However, since GMOD is so strongly suited for data annotation projects rather than data integration initiatives, this review will concentrate on an example of the latter. InterMine (intermine.org) is a powerful open source data warehouse system used to create integrated relational databases of biological data accessed by sophisticated web query tools. InterMine is used to integrate multiple sources of data and includes an attractive, user-friendly web interface that works “out of the box” and can be easily customized for specific needs. Integrating data makes it possible to run sophisticated data mining queries that span domains of biological knowledge. An example of an InterMine database is FlyMine (www.flymine.org; Lyne et al., 2007), which integrates data from Drosophila, Anopheles, C. elegans, and other species, allowing the user to run sophisticated data mining queries that span domains of biological knowledge. Such data integration techniques create connections between relevant information that can yield rich insights that would not have otherwise emerged (Kelling et al., 2009; Patterson et al., 2010; Peterson et al., 2010). 3.3. Value added approaches for microarray studies One of the issues inherent to microarray-based studies of relative gene expression is the tendency to generate lists of hundreds of genes
that are differentially expressed among the phenotypes examined. Absent sophisticated approaches for data analysis, this produces more data but not necessarily new knowledge. It is simply too difficult to draw conclusions from a list of differentially expressed genes: aside from the fact that many of the genes may be aberrant (due to systemic multiple comparison error or experimental error), it is simply not obvious how the rest of the list should be considered, especially when the organism being studied has no reference genome in NCBI. This is a problem not encountered by the biomedical researchers who pioneered transcriptome analyses, and various approaches to solving this problem have been explored (Slonim and Yanai, 2009). Some researchers, for example, classify each of the hundreds of candidate genes by their Gene Ontology (GO) association (Al-Shahrour et al., 2007). GO provides a standardized vocabulary to describe gene product characteristics involving biological process, cellular component and molecular function. Researchers have curated genes from many organisms with GO terms, providing a powerful resource for comparing gene function across species lines. Additionally, a number of tools are available for assessing the preponderance of GO terms in a list of differentially expressed genes from a microarray experiment, providing a researcher with a snapshot of the functions that are expressed in the data set, rather than simply the genes (see http://www.geneontology.org/ for a list of software tools). While useful, there are three issues with GO analyses. First, terms extracted from a list of genes tend to be overly generalized, making it difficult to draw conclusions about the perturbation of a specific function in a given experiment. Second, the number of GO terms extracted tend to be only slightly smaller than the initial list of hundreds of differentially expressed genes, leaving the researcher with a problem of similar scale. Third, like the list of differentially expressed genes, the list of enhanced GO terms is only a marginal sampling of the genes that are above an arbitrary expression threshold (e.g., three-fold up- or down-regulated) in the experiment at hand. In other words, it doesn't provide a system-wide assessment of functionality that results from a given treatment, only a subanalysis of the most highly expressed genes for which one is assured significant differential expression. One can imagine a system or process controlled by a several dozen genes, of which only one or two may be above the threshold (but as a whole is significantly upregulated); such an analysis would not adequately differentiate that function from another (composed of a similar number of genes) that is not generally up-regulated but has a single gene above the threshold by chance. Gene-class analysis, a relatively new approach for systems-based analysis, involves an examination of curated gene set enrichment; the leading method employing this approach is Gene Set Enrichment Analysis (Subramanian et al., 2005; Thomas et al., forthcoming). The basic idea of GSEA (and other gene-class approaches) is to test if a curated set of genes is enriched in expression in one treatment relative to other treatments. This approach takes into account all genes in a microarray experiment, not just those genes that are statistically above an arbitrary threshold of differential expression. GSEA gene sets consist of dozens to hundreds of genes associated with a given GO category, disease state, functional classification, pathway, genome location, or any other conceivable unifying feature used to group genes. The distribution of genes from a given set across list of genes (ranked by correlation between two treatments) is tabulated, with shifts of genes towards the top of bottom of the list corresponding to a higher enrichment score for the set (relative to a random distribution). The ranked list is permuted 1000+ times to generate a curve from which statistical significance can be determined. Some authors have criticized the GSEA approach as being a Rube Goldberg solution to a problem that could be solve with a simpler, more direct solution (Irizarry et al., 2009). Nonetheless, a GSEA approach is appealing for workers using non-model organism.
K.J. Metzger et al. / Ecological Informatics 6 (2011) 4–12
For an ecological study to use GSEA, several unique informatics issues need to be overcome. First, GSEA is built around human genome data; in order to be useful for non-human applications, non-trivial modifications must be made, depending on the end use. The cleanest option involves creating gene sets unique to the application at hand. These will represent the systems and pathways of interest in the given study and each will include 15+ genetic elements known to be associated with that system or pathway. These gene sets might be drawn from information already known about the organism used or composed of homologs drawn from a closely related species. Construction of curated gene sets requires substantial knowledge of both the organism and the system or pathway under consideration. Obviously, the number of gene sets to be examined would be rather limited. Second, in addition to gene set curation, the user needs to annotate the microarray platform with a standard vocabulary of gene terms that link the curated gene sets to the elements on the array. Humancentric GSEA formatted gene sets and platforms are cross-referenced with HUGO gene symbols (Eyre et al., 2006). Any new microarray platform used with GSEA needs to be annotated with standard gene identifiers. Doing so requires a BLAST search of the full length gene element from the new platform against the human NCBI RefSeq database (using blastx with a scoring matrix and other parameters appropriate for the organism in question). RefSeq hits (filtered for evalue, coverage and other quality control parameters) can be easily linked to HUGO symbols. Some organisms (e.g., zebrafish) are already extremely well annotated and deriving the HUGO score is trivial. This is clearly a difficult task for non-model organisms, since many of the genetic elements on the array will be unknown, but it has been successfully accomplished (e.g., the fathead minnow: Thomas et al., forthcoming). A shortcut could make use of the GO annotation that may already be completed and annotated onto the platform. (GO annotations are frequently determined by sequence search tools using widely available tools and don't rely on preexisting knowledge about each genetic element.) In this approach, gene sets associated with given GO classifications are assembled and used in GSEA analysis of a microarray experiment. Dozens of such gene sets could be easily made, leveraging the GO annotation already performed. While this approach (creating new gene sets associated with a specific microarray platform and experiment) is probably the best approach for someone working with an organism that cannot be related to well-annotated species like mouse (e.g., Quercus), it does not leverage the full power of GSEA and the community of users that have created gene sets. Until recently, the primary applications of GSEA approaches involved biomedical studies (Zhang et al., 2009; Tamburini et al., 2009; Chen et al., 2010). For example, Tamburinin et al., were able to compare chromosomal aberrations in lymphomas occurring in different canine breeds (Golden Retriever vs. non-Golden Retriever) and conclude differential expression of genes in lymphomas across breeds (Tamburini et al., 2009). For organisms further from human, this is more difficult. Thomas et al. (forthcoming) demonstrated that gene expression patterns involved with liver damage in fathead minnows exposed to methylmercury at environmental concentrations was very similar to human liver damage associated with hepatocellular carcinoma and hepatitis infection. To accomplish this, the authors linked ~11 k of the 15 k elements on the fathead minnow array to human homologs, opening up the ability to conduct GSEA analyses using human-centric sets. For ecologists using non-model organisms, GSEA provides a tool to identify the specific pathways and systems perturbed in response to a given treatment (e.g., trees growing above and below tree line, differential exposure to an environmental toxin, etc.), although making use of differential gene expression data is more difficult in organisms whose genomes lack annotation and perhaps even useful GO categories (e.g. invertebrates).
9
4. Challenges facing ecoinformaticians To take advantage of the potential that exists in the field of ecological informatics, several challenges will need to be surmounted. While not exhaustive, we propose three main areas to consider: 1. Lack of gene annotation information (and standardized, curated genome databases in general) for non-model organisms; 2. The development of a robust informatics infrastructure, including standard vocabulary and ontologies for ecological investigations; and 3. The development of recognized accepted practices and culture with regard to authorship and collaboration guidelines, accessing data, data sharing and reuse. 4.1. Lack of curated non-model organism data bases and gene annotation Most of the annotated genomes that exist currently are for humans or human models. An investigator focused on non-vertebrate taxa still faces great challenges in making sense of differential gene expression data, no matter how numerous or readily available, when no reference genome or gene ontology category can be used to shed light on novel genes of interest. While NGS methods facilitate the production of sequence data, the use of systematic GO categories and robust database curation will facilitate the usefulness of such sequence data for subsequent retrieval and comparison. 4.2. Informatics infrastructure development As ecological studies move increasingly into the realm of informatics, a great challenge will be the development of robust informatics infrastructure to facilitate the management and cataloging of data in standardized data bases, along with accepted practices governing data curation, accessibility and use (Michener et al., 1995; Michener, 2006; Schnase et al., 2007; Baker and Chandler, 2008; Patterson et al., 2008; Leinfelder et al., 2010; Peterson et al., 2010). The framework of ecological informatics will likely follow a similar course as that of biomedical informatics, with much the same recommendations for data management, practice, and user accessibility for software and databases, e.g. robust infrastructure for informatics, development and adoption of tools accessible to a naïve user base, development of standardization and compatibility across databases and other tools (e.g. Pearson and Söll, 1991) reiterated with some specificity for particular concerns and challenges associated with informatics for ecological applications (Jones et al., 2006; Williams and Poff, 2006; Kelling et al., 2009). For data to be accessible and useful to a broad multi-disciplinary user community, the use of controlled vocabulary and a standardization of language, meta-data concepts, and ontology will be required (Michener et al., 1995; Michener, 2006; Remsen et al., 2006; Patterson et al., 2008; Leinfelder et al., 2010). In addition, novel data integration approaches will be necessary to make full use of related data (Leinfelder et al., 2010; Patterson et al., 2010; Pfrender et al., 2010.). 4.2.1. Visualization of ecological informatics data: an important challenge and need Phylogeographic studies yield insight into genetic variation of organisms in the context of geographic landscape parameters (Emerson et al., 2010; Kronauer et al., 2005; Kidd and Ritchie, 2006; Toon et al., 2010). As unique informatics approaches are utilized in ecologically motivated studies, new tools will need to be developed that allow for manipulation and representation of data in ecologically meaningful ways. One example is the development and use of software for visualizing phylogeographic data. Biodiversity studies, in addition to informatics studies with other emphasis, can benefit from
10
K.J. Metzger et al. / Ecological Informatics 6 (2011) 4–12
the use of additional tools that provide visualization of data in a geographic context (see Schnase et al., 2003 for a commentary on visualization challenges). The recent release of visualization software Geophylo (Hill and Guralnick, 2010), and novel use of Google Earth (Page, 2008) has allowed for the incorporation of geographically relevant information with phylogenetic information, and would facilitate a more complete representation and understanding of genetic and evolutionary processes for a given population, species, or community in studies utilizing genetic data in geographical contexts (Kronauer et al., 2005; O'Corry-Crowe et al., 2006; Emerson et al., 2010; Morin et al., 2010; Sork et al., 2010; Toon et al., 2010; Storfer et al., 2010). 4.3. Accessing data, data sharing: culture of use and reward, authorship and collaborative guidelines The culture of sharing data, although familiar in long term ecological research (LTER) projects, for example, could also see changes: novel data sharing and reuse may stimulate discussion and revision of guidelines and practices with regard to funding, collaboration, authorship, publishing, and others (Jones et al., 2006; Baker and Chandler, 2008). As increasing amounts of data are purposefully linked through stable informatics infrastructure, the number of researchers who may benefit from reuse of such data also increases. Data previously only available within a research community through a shared local repository or other medium would be more broadly accessible through an open digital repository, making such data available for use much beyond the scope of the original investigator and research question (Baker and Chandler, 2008). 5. Conclusions Informatics approaches are rapidly being adopted and adapted for use in non-biomedical investigations. The use of genome-scale informatics approaches in ecological investigations has great potential to contribute to our understanding of organisms and habitats, biological complexity, organismal diversity, selection pressures, and evolutionary processes. As with any emerging field of research, the development of standard practices regarding data collection, use, and dissemination will greatly further the accessibility and usefulness of such approaches for would-be ecological informatics researchers. Multi-disciplinary teams comprised of a diversity of experts in landscape genetics, ecology, conservation genetics, informatics, computer science, GIS, genetics, and evolution are particularly well suited to conducting such investigations. Insights gleaned from such investigations will potentially inform future conservation and wildlife management decisions and subsequent regulatory guidelines, may influence land use recommendations, and will ultimately redefine the nature and scope of applied and basic ecological investigations. Next generation sequencing approaches in particular will allow researchers utilizing non-model organisms and communities to probe the molecular basis of ecologically relevant phenotypes and examine the influence of selection and other evolutionary processes to better understand ecological phenomena. However, challenges still remain for researchers working with organisms for which reference genomes or gene ontology categories have yet to be developed. Ecological informatics will certainly bring novel tools, informatics approaches, and research questions designed by researchers to fill challenges and needs perhaps yet to be identified. Acknowledgements The authors would like to thank Friedrich Recknagel for inviting this review, and the PhRMA Foundation for a sabbatical grant to MT.
References AAAS, 2010. Vision and Change: A Call to Action. AAAS, Washington, DC. accessed 2 November, 2010 www.visionandchange.org/VC_report.pdf. Al-Shahrour, F., Minguez, P., Tarraga, J., Medina, I., Alloza, E., Montaner, D., Dopazo, J., 2007. FatiGO+: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments. Nucleic Acids Res. 35, W91–W96. Altman, R.B., Klein, T.E., 2007. Biomedical informatics training at Stanford in the 21st century. J. Biomed. Inform. 40, 55–58. Association of American Medical Colleges, Howard Hughes Medical Institute, 2009. Scientific Foundations for Future Physicians: Report of the AAMC-HHMI Committee, Washington, DC, and Chevy Chase, MD. accessed 2 November, 2010 www. hhmi.org/grants/pdf/08-209_AAMC-HHMI_report.pdf. Bai, X.D., Zhang, W., Orantes, L., Jun, T.H., Mittapalli, O., Mian, M.A.R., Michel, A.P., 2010. Combining Next-Generation Sequencing strategies for rapid molecular resource development from an invasive aphid species, Aphis glycines. PLoS ONE 5, 9. Baker, K.S., Chandler, C.L., 2008. Enabling long-term oceanographic research: changing data practices, information management strategies and informatics. Deep. Sea Res. II: Top Stud Oceanogr 55, 2132–2142. Brazma, A., 2009. Minimum information about a microarray experiment (miame) — successes, failures, challenges. Sci. World J. 9, 420–423. Black, G.C., Stephan, P.E., 2005. Bioinformatics training programs are hot but the labor market is not. Biochem. Mol. Biol. Educ. 33, 58–62. Chen, L.S., Hutter, C.M., Potter, J.D., Liu, Y., Prentice, R.L., Peters, U., Hsu, L., 2010. Insights into colon cancer etiology via a regularized approach to Gene Set Analysis of GWAS data. Am. J. Hum. Genet. 86, 860–871. Chon, T.S., Park, Y.S., 2006. Ecological informatics as an advanced interdisciplinary interpretation of ecosystems. Ecol. Inf. 1, 213–217. Costello, M.J., 2009. Distinguishing marine habitat classification concepts for ecological data management. Mar. Ecol. Prog. Ser. 397, 253–268. Danovaro, R., Corinaldesi, C., Luna, G.M., Magagnini, M., Manini, E., Pusceddu, A., 2009. Prokaryote diversity and viral production in deep-sea sediments and seamounts. Deep Sea Res. II: Top. Stud. Oceanogr. 56, 738–747. Davis, A.P., King, B.L., Mockus, S., Murphy, C.G., Saraceni-Richards, C., Rosenstein, M., Wiegers, T., Mattingly, C.J., 2011. Comparative toxicogenomics database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res. 37, D786–D792. Emerson, K.J., Merz, C.R., Catchen, J.M., Hohenlohe, P.A., Cresko, W.A., Bradshaw, W.E., Holzapfel, C.M., 2010. Resolving postglacial phylogeography using high-throughput sequencing. Proc. Natl Acad. Sci. USA 107, 16196–16200. Eyre, T.A., Ducluzeau, F., Sneddon, T.P., Povey, S., Bruford, E.A., Lush, M.J., 2006. The HUGO gene nomenclature database, 2006 updates. Nucleic Acids Res. 34, D319–D321. Fan, X., Lobenhofer, E.K., Chen, M., Shi, W., Huang, J., Luo, J., Zhang, J., Walker, S.J., Chu, T.M., Li, L., Wolfinger, R., Bao, W., Paules, R.S., Bushel, P.R., Li, J., Shi, T., Nikolskaya, T., Nikolsky, Y., Hong, H., Deng, Y., Cheng, Y., Fang, H., Shi, L., Tong, W., 2010. Consistency of predictive signature genes and classifiers generated using different microarray platforms. Pharmacogenomics J. 10, 247–257. Ferriere, R., Fox, G.A., 1995. Chaos and evolution. Trends Ecol. Evol. 10, 480–485. flymine.org [accessed 2 November, 2010] Fey, P., Gaudet, P., Pilcher, K.E., Franke, J., Chisholm, R.L., 2006. dictyBase and the dicty stock center. Methods Mol. Biol. 346, 51–74. Freedman, A.H., Thomassen, H.A., Buermann, W., Smith, T.B., 2010. Genomic signals of diversification along ecological gradients in a tropical lizard. Mol. Ecol. 19, 3773–3788. Garneau, M.E., Vincent, W.F., Alonso-Saez, L., Gratton, Y., Lovejoy, C., 2006. Prokaryotic community structure and heterotrophic production in a river-influenced coastal arctic ecosystem. Aquat. Microb. Ecol. 42, 27–40. geneontology.org [accessed 2 November, 2010]. Gerstein, M., Greenbaum, D., Cheung, K., Miller, P.L., 2007. An interdepartmental Ph.D. program in computational biology and bioinformatics: the Yale perspective. J. Biomed. Inform. 40, 73–79. Gilbert, D., Singan, V.R., Colbourne, J.K., 2005. wFleaBase: the Daphnia genomics information system. BMC Bioinform. 6, 45. Goethals, P.L.M., 2007. Special issue ‘Ecological informatics applications in water management’. Aquat. Ecol. 41, 371–372. Goetz, F., Rosauer, D., Sitar, S., Goetz, G., Simchick, C., Roberts, S., Johnson, R., Murphy, C., Bronte, C.R., Mackenzie, S., 2010. A genetic basis for the phenotypic differentiation between siscowet and lean lake trout (Salvelinus namaycush). Mol. Ecol. 19, 176–196. Goetz, F.W., MacKenzie, S., 2008. Functional genomics with microarrays in fish biology and fisheries. Fish Fish. 9, 378–395. Green, J.L., Hastings, A., Arzberger, P., Ayala, F.J., Cottingham, K.L., Cuddington, K., Davis, F., Dunne, J.A., Fortin, M.J., Gerber, L., Neubert, M., 2005. Complexity in ecology and conservation: Mathematical, statistical, and computational challenges. Bioscience 55, 501–510. Hale, S.S., Hollister, J.W., 2009. Beyond data management: how ecoinformatics can benefit environmental monitoring programs. Environ. Monit. Assess. 150, 227–235. Harris, R.A., Wang, T., Coarfa, C., Nagarajan, R.P., Hong, C., Downey, S.L., Johnson, B.E., Fouse, S.D., Delaney, A., Zhao, Y., Olshen, A., Ballinger, T., Zhou, X., Forsberg, K.J., Gu, J., Echipare, L., O'Geen, H., Lister, R., Pelizzola, M., Xi, Y., Epstein, C.B., Bernstein, B.E., Hawkins, R.D., Ren, B., Chung, W.-Y., Gu, H., Bock, C., Gnirke, A., Zhang, M.Q., Haussler, D., Ecker, J.R., Li, W., Farnham, P.J., Waterland, R.A., Meissner, A., Marra, M.A., Hirst, M., Milosavljevic, A., Costello, J.F., 2010. Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nat. Biotechnol. 28, 1097–1105.
K.J. Metzger et al. / Ecological Informatics 6 (2011) 4–12 Haubold, B., Pfaffelhuber, P., Lynch, M., 2010. mlRho—a program for estimating the population mutation and recombination rates from shotgun-sequenced diploid genomes. Mol. Ecol. 19, 277–284. Hill, A.W., Guralnick, R.P., 2010. GeoPhylo: an online tool for developing visualizations of phylogenetic trees in geographic space. Ecography 33, 633–636. Hu, P., Greenwood, C.M.T., Beyene, J., 2007. Integrative analysis of gene expression data including an assessment of pathway enrichment for predicting prostate cancer. Cancer Inf. 2, 289–300. Iguchi, T., Watanabe, H., Katsu, Y., 2007. Toxicogenomics and ecotoxicogenomics for studying endocrine disruption and basic biology. Gen. Comp. Endocrinol. 153, 25–29. intermine.org [accessed 2 November, 2010]. International Union for Conservation of Nature and Natural Resources, 2010. http://www. iucnredlist.org/http://www.iucnredlist.org/apps/redlist/details/15421/02010 [accessed 2 November, 2010]. Irizarry, R.A., Wang, C., Zhou, Y., Speed, T.P., 2009. Gene set enrichment analysis made simple. Stat. Meth. Med. Res. 18, 565–575. Johnson, S.B., Friedman, R.A., 2007. Bridging the gap between biological and clinical informatics in a graduate training program. J. Biomed. Inform. 40, 59–66. Johnson, Z.I., Zinser, E.R., Coe, A., McNulty, N.P., Woodward, E.M.S., Chisholm, S.W., 2006. Niche partitioning among Prochlorococcus ecotypes along ocean-scale environmental gradients. Science 311, 1737–1740. Jones, M.B., Schildhauer, M.P., Reichman, O.J., Bowers, S., 2006. The new bioinformatics: integrating ecological data from the gene to the biosphere. Annu. Rev. Ecol. Evol. Syst. 37, 519–544. Kelling, S., Hochachka, W.M., Fink, D., Riedewald, M., Caruana, R., Ballard, G., Hooker, G., 2009. Data-intensive science: a new paradigm for biodiversity studies. Bioscience 59, 613–620. Kidd, D.M., Ritchie, M.G., 2006. Phylogeographic information systems: putting the geography into phylogeography. J. Biogeogr. 33, 1851–1865. Kitano, H., 2002. Systems biology: a brief overview. Science 295, 1662–1664. Klaper, R., Thomas, M.A., 2004. At the crossroads of genomics and ecology: the promise of a canary on a chip. Bioscience 54, 403–412. Koopman, M.M., Fuselier, D.M., Hird, S., Carstens, B.C., 2010. The carnivorous pale pitcher plant harbors diverse, distinct, and time-dependent bacterial communities. Appl. Environ. Microbiol. 76, 1851–1860. Kottmann, R., Kostadinov, I., Duhaime, M.B., Buttigieg, P.L., Yilmaz, P., Hankeln, W., Waldmann, J., Glockner, F.O., 2010. Megx.net: integrated database resource for marine ecological genomics. Nucleic Acids Res. 38, D391–D395. Kronauer, D.J.C., Bergmann, P.J., Mercer, J.M., Russell, A.P., 2005. A phylogeographically distinct and deep divergence in the widespread Neotropical turnip-tailed gecko, Thecadactylus rapicauda. Mol. Phylogenet. Evol. 34, 431–437. Künstner, A., Wolf, J.B.W., Backström, N., Whitney, O., Balakrishnan, C.N., Day, L., Edwards, S.V., Janes, D.E., Schlinger, B.A., Wilson, R.K., Jarvis, E.D., Warren, W.C., Ellegren, H., 2010. Comparative genomics based on massive parallel transcriptome sequencing reveals patterns of substitution and selection across 10 bird species. Mol. Ecol. 19, 266–276. Labov, J.B., Reid, A.H., Yamamoto, K.R., 2010. Integrated biology and undergraduate science education: a new biology education for the twenty-first century? CBE Life Sci. Educ. 9, 10–16. Leinfelder, B., Tao, J., Costa, D., Jones, M.B., Servilla, M., O'Brien, M., Burt, C., 2010. A metadata-driven approach to loading and querying heterogeneous scientific data. Ecol. Inf. 5, 3–8. Longhi, M.L., Beisner, B.E., 2010. Patterns in taxonomic and functional diversity of lake phytoplankton. Freshw. Biol. 55, 1349–1366. Lyne, R., Smith, R., Rutherford, K., Wakeling, M., Varley, A., Guillier, F., Janssens, H., Ji, W.Y., McLaren, P., North, P., Rana, D., Riley, T., Sullivan, J., Watkins, X., Woodbridge, M., Lilley, K., Russell, S., Ashburner, M., Mizuguchi, K., Micklem, G., 2007. FlyMine: an integrated database for Drosophila and Anopheles genomics. Genome Biol. 8, R129. doi:10.1186/gb-2007-8-7-r129. Manel, S., Poncet, B.N., Legendre, P., Gugerli, F., Holderegger, R., 2010. Common factors drive adaptive genetic variation at different spatial scales in Arabis alpina. Mol. Ecol. 19, 3824–3835. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., Gilad, Y., 2008. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517. Marsico, T.D., Burt, J.W., Espeland, E.K., Gilchrist, G.W., Jamieson, M.A., Lindstrom, L., Roderick, G.K., Swope, S., Szucs, M., Tsutsui, N.D., 2010. Underutilized resources for studying the evolution of invasive species during their introduction, establishment, and lag phases. Evol. Appl. 3, 203–219. Matsumura, H., Yoshida, K., Luo, S.J., Kimura, E., Fujibe, T., Albertyn, Z., Barrero, R.A., Kruger, D.H., Kahl, G., Schroth, G.P., Terauchi, R., 2010. High-throughput SuperSAGE for digital gene expression analysis of multiple samples using next generation sequencing. PLoS ONE 5, 8. Medinger, R., Nolte, V., Pandey, R.V., Jost, S., Ottenwalder, B., Schlotterer, C., Boenigk, J., 2010. Diversity in a hidden world: potential and limitation of next-generation sequencing for surveys of molecular diversity of eukaryotic microorganisms. Mol. Ecol. 19, 32–40. megx.net [accessed 14 November 2010]. Metzker, M.L., 2010. Applications of next-generation sequencing technologies — the next generation. Nat. Rev. Genet. 11, 31–46. Michener, W.K., 2006. Meta-information concepts for ecological data management. Ecol. Inf. 1, 3–7. Michener, W.K., Baerwald, T.J., Firth, P., Palmer, M.A., Rosenberger, J.L., Sandlin, E.A., Zimmerman, H., 2001. Defining and unraveling biocomplexity. Bioscience 51, 1018–1023. Michener, W.K., Brunt, J.W., Helly, J., Kirchner, T.B., Stafford, S., 1995. Demystifying metadata. In: Gross, K.L., Pake, C.E., Allen, E., Bledsoe, C., Colwell, R., Dayton, P., Dethier, M., Helly, J., Holt, R., Morin, N., Michener, W., Pickett, S.T.A., Stafford, S.
11
(Eds.), Final Report of the Ecological Society of America Committee on the Future of Long-term Ecological Data (FLED): Volume I. Text of the Report. Ecological Society of America, Washington, DC, pp. 40–62. Morin, P.A., Archer, F.I., Foote, A.D., Vilstrup, J., Allen, E.E., Wade, P., Durban, J., Parsons, K., Pitman, R., Li, L., Bouffard, P., Nielsen, S.C.A., Rasmussen, M., Willerslev, E., Gilbert, M.T.P., Harkins, T., 2010. Complete mitochondrial genome phylogeographic analysis of killer whales (Orcinus orca) indicates multiple species. Genome Res. 20, 908–916. Mueller, L.A., Solow, T.H., Taylor, N., Skwarecki, B., Buels, R., Binns, J., Lin, C.W., Wright, M.H., Ahrens, R., Wang, Y., Herbst, E.V., Keyder, E.R., Menda, N., Zamir, D., Tanksley, S.D., 2005. The SOL Genomics Network. A comparative resource for solanaceae biology and beyond. Plant Physiol. 138, 1310–1317. National Research Council, 2009. A New Biology for the 21st Century: Ensuring the United States Leads the Coming Biology Revolution. National Academies Press, Washington, DC. accessed 2 November, 2010 www.nap.edu/catalog.php? record_id=12764. National Science Foundation Interdisciplinary Research (IDR) Programhttp://www.nsf. gov/funding/pgm_summ.jsp?pims_id=503439accessed 2 November, 2010. O'Connor, B.D., Day, A., Cain, S., Arnaiz, O., Sperling, L., Stein, L.D., 2008. GMODWeb: a web framework for the Generic Model Organism Database. Genome Biol. 9. O'Corry-Crowe, G., Taylor, B.L., Gelatt, T., Loughlin, T.R., Bickham, J., Basterretche, M., Pitcher, K.W., DeMaster, D.P., 2006. Demographic independence along ecosystem boundaries in Steller sea lions revealed by mtDNA analysis: implications for management of an endangered species. Can. J. Zool. 84, 1796–1809. O'Neil, S.T., Dzurisin, J.D.K., Carmichael, R.D., Lobo, N.F., Emrich, S.J., Hellmann, J.J., 2010. Population-level transcriptome sequencing of nonmodel organisms Erynnis propertius and Papilio zelicaon. BMC Genomics 11, 15. Page, R.D.M., 2008. Towards realizing Darwin’s dream: setting the trees free. Nature Precedings. doi:10.1038/npre.2008.2217.1. Patterson, D.J., Faulwetter, S., Shipunov, A., 2008. Principles for a names-based cyberinfrastructure to serve all of biology. Zootaxa 153–163. Patterson, D.J., Cooper, J., Kirk, P.M., Pyle, R.L., Remsen, D.P., 2010. Names are key to the big new biology. Trends Ecol. Evol. 25 (12), 686–691. Pearson, M.L., Söll, D., 1991. The human genome project — a paradigm for information management in the life sciences. FASEB J. 5, 35–39. Perez-Gonzalez, J., Carranza, J., Torres-Porras, J., Fernandez-Garcia, J.L., 2010. Low heterozygosity at microsatellite markers in Iberian red deer with small antlers. J. Hered. 101, 553–561. Peterson, A.T., Knapp, S., Guralnick, R., Soberon, J., Holder, M.T., 2010. The big questions for biodiversity informatics. Syst. Biodivers. 8, 159–168. Pfrender, M.E., Hawkins, C.P., Bagley, M., Courtney, G.W., Creutzburg, B.R., Epler, J.H., Fend, S., Ferrington, L.C., Hartzell, P.L., Jackson, S., Larsen, D.P., Levesque, C.A., Morse, J.C., Petersen, M.J., Ruiter, D., Schindel, D., Whiting, M., 2010. Assessing macroinvertebrate biodiversity in freshwater ecosystems: advances and challenges in DNA-based approaches. Q. Rev. Biol. 85, 319–340. Remsen, D.P., Norton, C., Patterson, D.J., 2006. Taxonomic informatics tools for the electronic Nomenclator Zoologicus. Biol. Bull. 210, 18–24. Rockman, M.V., Kruglyak, L., 2006. Genetics of global gene expression. Nat. Rev. Genet. 7, 862–872. Sahinidis, N.V., Harandi, M.T., Heath, M.T., Murphy, L., Snir, M., Wheeler, R.P., Zukoski, C.F., 2005. Establishing a master's degree programme in bioinformatics: challenges and opportunities. Syst. Biol. 152, 269–275. Sawle, A.D., Wit, E., Whale, G., Cossins, A.R., 2010. An information-rich alternative, chemicals testing strategy using a high definition toxicogenomics and zebrafish (Danio rerio) embryos. Toxicol. Sci. 118, 128–139. Schnase, J.L., Cushing, J., Frame, M., Frondorf, A., Landis, E., Maier, D., Silberschatz, A., 2003. Information technology challenges of biodiversity and ecosystems informatics. Inf. Syst. 28, 339–345. Schnase, J.L., Cushing, J., Smith, J.A., 2007. Biodiversity and ecosystem informatics. J. Intell Inf. Syst. 29, 1–6. Severtson, D.J., Pape, L., Page Jr., C.D., Shavlik, J.W., Phillips Jr., G.N., Flatley Brennan, P., 2007. Biomedical informatics training at the University of Wisconsin-Madison. Yearb. Med. Inform. 149–156. Shaffer, H.B., Pauly, G.B., Oliver, J.C., Trenham, P.C., 2004. The molecular phylogenetics of endangerment: cryptic variation and historical phylogeography of the California tiger salamander, Ambystoma californiense. Mol. Ecol. 13, 3033–3049. Simon, R., 2008. Microarray-based expression profiling and informatics. Curr. Opin. Biotechnol. 19, 26–29. Slonim, D.K., Yanai, I., 2009. Getting started in gene expression microarray analysis. PLoS Comput. Biol. 5, 4. Smith, L., Greenfield, A., 2003. DNA microarrays and development. Hum. Mol. Genet. 12, R1–R8. Sork, V.L., Davis, F.W., Westfall, R., Flint, A., Ikegami, M., Wang, H.F., Grivet, D., 2010. Gene movement and genetic association with regional climate gradients in California valley oak (Quercus lobata Née) in the face of climate change. Mol. Ecol. 19, 3806–3823. Sork, V.L., Waits, L., 2010. INTRODUCTION: contributions of landscape genetics — approaches, insights, and future potential. Mol. Ecol. 19, 3489–3495. Steinberg, C.E., Sturzenbaum, S.R., Menzel, R., 2008. Genes and environment — striking the fine balance between sophisticated biomonitoring and true functional environmental genomics. Sci. Total Environ. 400, 142–161. Storfer, A., Murphy, M.A., Spear, S.F., Holderegger, R., Waits, L.P., 2010. Landscape genetics: where are we now? Mol. Ecol. 19, 3496–3514. Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., Mesirov, J.P., 2005. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550.
12
K.J. Metzger et al. / Ecological Informatics 6 (2011) 4–12
Tamburini, B.A., Trapp, S., Phang, T.L., Schappa, J.T., Hunter, L.E., Modiano, J.F., 2009. Gene expression profiles of sporadic canine hemangiosarcoma are uniquely associated with breed. PLoS ONE 4, 12. Thomas, M.A., Klaper, R., 2004. Genomics for the ecological toolbox. Trends Ecol. Evol. 19, 439–445. Thomas, M.A., Yang, L., Carter, B.J., Klaper, R.D., forthcoming. Gene set enrichment analysis of microarray data from Pimephales promelas (Rafinesque), a nonmammalian model organism. BMC Genomics. Toon, A., Hughes, J.M., Joseph, L., 2010. Multilocus analysis of honeyeaters (Aves: Meliphagidae) highlights spatio-temporal heterogeneity in the influence of biogeographic barriers in the Australian monsoonal zone. Mol. Ecol. 19, 2980–2994. Van Aggelen, G., Ankley, G.T., Baldwin, W.S., Bearden, D.W., Benson, W.H., Chipman, J.K., Collette, T.W., Craft, J.A., Denslow, N.D., Embry, M.R., Falciani, F., George, S.G., Helbing, C.C., Hoekstra, P.F., Iguchi, T., Kagami, Y., Katsiadaki, I., Kille, P., Liu, L., Lord, P.G., McIntyre, T., O'Neill, A., Osachoff, H., Perkins, E.J., Santos, E.M., Skirrow, R.C., Snape, J.R., Tyler, C.R., Versteeg, D., Viant, M.R., Volz, D.C., Williams, T.D., Yu, L., 2010. Integrating omic technologies into aquatic ecological risk assessment and environmental monitoring: hurdles, achievements, and future outlook. Environ. Health Perspect. 118, 1–5 PubMed PMID: 20056575; PubMed Central PMCID: PMC2831950. van Mulligen, E.M., Cases, M., Hettne, K., Molero, E., Weeber, M., Robertson, K.A., Oliva, B., de la Calle, G., Maojo, V., 2008. Training multidisciplinary biomedical informatics students: three years of experience. J. Am. Med. Inform. Assoc. 15, 246–254. Vandegehuchte, M.B., Vandenbrouck, T., De Coninck, D., De Coen, W.M., Janssen, C.R., 2010. Gene transcription and higher-level effects of multigenerational Zn exposure in Daphnia magna. Chemosphere 80, 1014–1020. Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., Gocayne, J.D., Amanatides, P., Ballew, R.M., Huson, D.H., Wortman, J.R., Zhang, Q., Kodira, C.D., Zheng, X.Q.H., Chen, L., Skupski, M., Subramanian, G., Thomas, P.D., Zhang, J.H., Miklos, G.L.G., Nelson, C., Broder, S., Clark, A.G., Nadeau, C., McKusick, V.A., Zinder, N., Levine, A.J., Roberts, R.J., Simon, M., Slayman, C., Hunkapiller, M., Bolanos, R., Delcher, A., Dew, I., Fasulo, D., Flanigan, M., Florea, L., Halpern, A., Hannenhalli, S., Kravitz, S., Levy, S., Mobarry, C., Reinert, K., Remington, K., Abu-Threideh, J., Beasley, E., Biddick, K., Bonazzi, V., Brandon, R., Cargill, M., Chandramouliswaran, I., Charlab, R., Chaturvedi, K., Deng, Z.M., Di Francesco, V., Dunn, P., Eilbeck, K., Evangelista, C., Gabrielian, A.E., Gan, W., Ge, W.M., Gong, F.C., Gu, Z.P., Guan, P., Heiman, T.J., Higgins, M.E., Ji, R.R., Ke, Z.X., Ketchum, K.A., Lai, Z.W., Lei, Y.D., Li, Z.Y., Li, J.Y., Liang, Y., Lin, X.Y., Lu, F., Merkulov, G.V., Milshina, N., Moore, H.M., Naik, A.K., Narayan, V.A., Neelam, B., Nusskern, D., Rusch, D.B., Salzberg, S., Shao, W., Shue, B.X., Sun, J.T., Wang, Z.Y., Wang,
A.H., Wang, X., Wang, J., Wei, M.H., Wides, R., Xiao, C.L., Yan, C.H., et al., 2001. The sequence of the human genome. Science 291 1304-+. Wang, J., Qin, R., Ma, Y., Wu, H.Y., Peters, H., Tyska, M., Shaheen, N.J., Chen, X.X., 2009a. Differential gene expression in normal esophagus and Barrett's esophagus. J. Gastroenterol. 44, 897–911. Wang, L.G., Xi, Y.X., Yu, J., Dong, L.P., Yen, L.S., Li, W., 2010. A statistical method for the detection of alternative splicing using RNA-Seq. PLoS ONE 5, 8. Wang, Z., Gerstein, M., Snyder, M., 2009b. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63. Werner, T., 2010. Next generation sequencing in functional genomics. Brief. Bioinform. 11, 499–511.wfleabase.org/ [accessed 14 Nov 2010]. Williams, J.B., Poff, N.L., 2006. Informatics software for the ecologist's toolbox: a basic example. Ecol. Inf. 1, 325–329. Wolf, J.B.W., Bayer, T., Haubold, B., Schilhabel, M., Rosenstiel, P., Tautz, D., 2010. Nucleotide divergence vs. gene expression differentiation: comparative transcriptome sequencing in natural isolates from the carrion crow and its hybrid zone with the hooded crow. Mol. Ecol. 19, 162–175. Wolkenhauer, O., 2001. Systems biology: the reincarnation of systems theory applied in biology? Brief. Bioinform. 2, 258–270. Woodin, T., Carter, V.C., Fletcher, L., 2010. Vision and change in biology undergraduate education, a call for action–initial responses. CBE Life Sci. Educ. 9, 71–73. Wooten, J.A., Camp, C.D., Rissler, L.J., 2010. Genetic diversity in a narrowly endemic, recently described dusky salamander, Desmognathus folkertsi, from the southern Appalachian Mountains. Conserv. Genet. 11, 835–854. Yao, X., Liu, Y., Li, J., He, J., Frayn, C., 2006. Current developments and future directions of bio-inspired computation and implications for ecoinformatics. Ecol. Inf. 1, 9–22. Zarbl, H., Gallo, M.A., Glick, J., Yeung, K.Y., Vouros, P., 2010. The vanishing zero revisited: thresholds in the age of genomics. Chem. Biol. Interact. 184, 273–278. Zdobnov, E.M., von Mering, C., Letunic, I., Torrents, D., Suyama, M., Copley, R.R., Christophides, G.K., Thomasova, D., Holt, R.A., Subramanian, G.M., Mueller, H.M., Dimopoulos, G., Law, J.H., Wells, M.A., Birney, E., Charlab, R., Halpern, A.L., Kokoza, E., Kraft, C.L., Lai, Z.W., Lewis, S., Louis, C., Barillas-Mury, C., Nusskern, D., Rubin, G.M., Salzberg, S.L., Sutton, G.G., Topalis, P., Wides, R., Wincker, P., Yandell, M., Collins, F.H., Ribeiro, J., Gelbart, W.M., Kafatos, F.C., Bork, P., 2002. Comparative genome and proteome analysis of anopheles gambiae and Drosophila melanogaster. Science 298, 149–159. Zhang, J.G., Li, J., Deng, H.W., 2009. Identifying gene interaction enrichment for gene expression data. PLoS ONE 4.