Genomics for Key Players in the N Cycle

Genomics for Key Players in the N Cycle

C H A P T E R T W E LV E Genomics for Key Players in the N Cycle: From Guinea Pigs to the Next Frontier Patrick S. G. Chain,*,†,1 Gary Xie,*,† Shawn...

716KB Sizes 0 Downloads 57 Views

C H A P T E R

T W E LV E

Genomics for Key Players in the N Cycle: From Guinea Pigs to the Next Frontier Patrick S. G. Chain,*,†,1 Gary Xie,*,† Shawn R. Starkenburg,*,† Matthew B. Scholz,*,† Nicholas Beckloff,*,† Chien-Chi Lo,*,† Karen W. Davenport,*,† Krista G. Reitenga,*,† Hajnalka E. Daligault,*,† J. Chris Detter,*,† Tracey A. K. Freitas,*,† Cheryl D. Gleasner,*,† Lance D. Green,*,† Cliff S. Han,*,† Kim K. McMurry,*,† Linda J. Meincke,*,† Xiaohong Shen,*,† and Ahmet Zeytun*,† Contents 290 294 294 295 297 297 300 304 306 306 311 312 313 313

1. Introduction: The Genomic Guinea Pigs 2. Want a Genome? Library Preparation First! 2.1. Sample and sequencer considerations 2.2. Creating libraries for different platforms 3. Sequencing a Genome from Start to Finish 3.1. Sequencing in the “Next-Gen” era 3.2. Assembling sequence reads 3.3. Genome closure and polishing 4. Apre`s Sequencing 4.1. Annotation 4.2. Comparative analysis 5. Outlook: The Next Frontier Acknowledgments References

Abstract While sequencing methods were available in the late 1970s, it was not until the human genome project and a significant influx of funds for such research that this technology became high throughput. The fields of microbiology and microbial ecology, among many others, have been tremendously impacted over the years, * Genome Biology Group, Los Alamos National Laboratory, Los Alamos, New Mexico, USA Microbial and Metagenomics Programs, Joint Genome Institute, Walnut Creek, California, USA Corresponding author

{ 1

Methods in Enzymology, Volume 496 ISSN 0076-6879, DOI: 10.1016/B978-0-12-386489-5.00012-9

#

2011 Elsevier Inc. All rights reserved.

289

290

Patrick S. G. Chain et al.

to such an extent that the determination of complete microbial genome sequences is now commonplace. Given the lower costs of next-generation sequencing platforms, even small laboratories from around the world will be able to generate millions of base pairs of data, equivalent to entire genomes worth of sequence information. With this prospect just around the corner, it is timely to provide an overview of the genomics process: from sample preparation to some of the analytical methods used to gain functional knowledge from sequence information.

1. Introduction: The Genomic Guinea Pigs The first genome sequencing project of a free-living organism, Haemophilus influenzae opened the door to a revolution in genome sciences (Fleischmann et al., 1995). In the 15 years since then, over 1000 bacterial and archaeal genomes have been deposited in genome repositories (Benson et al., 2010). While the initial focus was primarily on organisms of medical importance, the Department of Energy’s bold initiative to promote genomics of microbes of environmental importance has led to a multitude of advances in many different fields, and the creation of the Joint Genome Institute to specifically tackle such environmentally important projects, including nitrifier genomics. While the major centers continue to dominate the sequencing landscape, with next-generation sequencing (NGS) technologies, sequencing genomes and metagenomes has become more commonplace, even in smaller centers and university laboratories, holding tremendous promise for every sphere of biological research. Prior to the genomics era, knowledge of nitrifying microorganisms at the genetic and molecular level was rather limited. Given that nitrifiers are chemolithoautotrophs, most genetic investigations were focused only on a few key genes that enabled the use of ammonia and nitrite as energy sources (ammonia monooxygenase (amo), hydroxylamine oxidoreductase (hao), and nitrite oxidoreductase (nxr)). While targeted sequencing of ribosomal RNA and these key functional genes laid the foundation for understanding the distribution and abundance of ammonia and nitrite-oxidizing microorganisms in natural systems, a more comprehensive understanding of the global molecular, biochemical, and physiological properties that define these microbes remained limited. The emergence of higher-throughput sequencing and automated annotation created new opportunities for many genome-based studies of nitrification. In 2001, the ammonia oxidizer, Nitrosomonas europaea (aka, the Guinea Pig), was the first bacterial genome to be analyzed and published by the Joint Genome Institute (Chain et al., 2003). As sequencing capacity increased, more nitrification-centric genome projects were completed and at present, most of the common nitrifier lineages have one or more sequenced representatives (Table 12.1). Many genome-based investigations

Table 12.1 Genome projects for nitrifying microbes

Organism

Habitat

Phylogenya Status (reference)

Ammonia oxidizers Nitrosomonas europaea Nitrosomonas sp. Is79A3

Sewage sludge Freshwater

Beta Beta

Complete (Chain et al., 2003) In process ( J. Norton, NP)

Nitrosomonas sp. AL212

Sewage sludge

Beta

In process ( J. Norton, NP)

Nitrosomonas marina C-113a

Marine

Beta

In process (M. Klotz, NP)

Nitrosomonas cryotolerans ATCC 49181 Nitrosomonas eutropha C91 Nitrosospira multiformis Surinam Nitrosospira briensis C-128

Marine

Beta

Sewage sludge Soil Soil

Beta Beta Beta

Marine Marine Marine Marine Marine Sponge symbiont Hot spring

Gamma Gamma Gamma Gamma Thaum Thaum

To be completed ( J. Norton, NP) Complete (Stein et al., 2007) Complete (Norton et al., 2008) To be completed ( J. Norton, NP) Complete (M. Klotz, NP) Complete (Klotz et al., 2006) Complete (M. Klotz, NP) In process (M. Klotz, NP) Complete (Walker et al., 2010) Complete (Hallam et al., 2006)

Cren

In process ( J. de la Torre, NP)

Nitrosococcus oceani AFC27 Nitrosococcus oceani ATCC 19707 Nitrosococcus halophilus Nitrosococcus watsonii C-113 Nitrosopumilus maritimus Cenarchaeum symbiosum A Nitrosocaldus yellowstonii

Size Sequence quality (Mbp)

Finished High-quality draft High-quality draft High-quality draft NA

2.8  3.7

Finished Finished NA

2.8 3.2 NA

Finished Finished Finished Finished Finished Composite MF

 3.5 3.5 4.1 3.4 1.6 2.0

NA

NA

 3.2  3.6 NA

(continued)

Table 12.1 (continued)

a

Organism

Habitat

Phylogenya Status (reference)

Nitrite oxidizers Nitrobacter winogradskyi Nb-255

Soil

Alpha

Nitrobacter hamburgensis X14

Soil

Alpha

Nitrobacter sp. Nb-311A

Marine

Alpha

Nitrococcus mobilis Nb-231 Nitrospina sp. SCGC AAA288-L16 Candidatus Nitrospira defluvii

Marine Marine

Gamma Delta

Sewage sludge

Complete (Starkenburg et al., 2006) Complete (Starkenburg et al., 2008) Complete ( J. Waterbury, NP)

Complete ( J. Waterbury, NP) To be completed (R. Stepanauskas, NP) Nitrospirae Complete (Lucker et al., 2010)

Size Sequence quality (Mbp)

Finished

3.4

Finished

5.0

High-quality draft Finished NA

 4.1

Composite MF

4.3

 3.6 NA

Alpha, alphaproteobacteria; Beta, betaproteobacteria; Gamma, gammaproteobacteria; Delta, deltaproteobacteria; Cren, crenarchaeota; Thaum, thaumarchaeota; NP, not published; NA, not available; Composite MF, composite metagenome finished (i.e., a composite genome of a population of like strains).

Genomics for Key Players in the N Cycle: From Guinea Pigs to the Next Frontier

293

followed (Klotz et al., 2006; Norton et al., 2008; Starkenburg et al., 2006, 2008; Stein et al., 2007; Walker et al., 2010) which yielded a number of novel discoveries about the molecular and evolutionary basis for nitrification, demonstrating the value and necessity of genome sequencing, annotation, and analysis. The source of DNA for the aforementioned genome sequencing projects came from pure cultures of common bacterial lab strains (Nitrosomonas, Nitrosospira, Nitrobacter), which were traditionally thought to be the major contributors to nitrification in the natural systems. Targeted cloning and metagenomic sequencing of environmental samples forced a reassessment of this dogma. Ribosomal RNA surveys and shotgun sequencing of environmental samples demonstrated that many previously uncultivated nitrifiers play a significant role in both ammonia and nitrite oxidation in various habitats. For example, through metagenomic studies (Schleper et al., 2005; Treusch et al., 2004; Venter et al., 2004), we have learned that in addition to bacteria, members of the Crenarchaeota contribute to ammonia oxidation ubiquitously around the globe (Francis et al., 2005; Prosser and Nicol, 2008) and may even be a larger contributor to ammonia oxidation than bacteria in some environments (Leininger et al., 2006). Similarly, cultivationindependent methods have also revealed that Nitrospira is more abundant in wastewater treatment systems and some soil types than the classical nitrite oxidizer, Nitrobacter (Altmann et al., 2003; Daims et al., 2001; Freitag et al., 2005). Most recently, archaeal amoA and marine group I crenarchaeal 16S rRNA gene abundances were correlated with Nitrospina 16S rRNA gene abundance (Park et al., 2010; Santoro et al., 2010), suggesting a more important role for these understudied nitrite oxidizers as well. Clearly, these discoveries illustrate the power of environmental metagenomics to decipher the functional importance and diversity of uncultured nitrifiers. Further genomic and metagenomic sequencing as well as comprehensive analysis of nitrifying communities are needed to help determine the ecological relevance of each nitrifier lineage. The application of NGS technologies can be brought to bear on such questions to gain access to the diversity present in most nitrifying environments. Traditional Sanger sequencing, used effectively to sequence many nitrifier genomes from pure cultures over the years, has become too expensive and time consuming compared with NGS, and does not provide as much sequence depth to adequately survey complex metagenomic environmental samples, or to efficiently sequence multiple new strains of the same species. Herein, we present a technical overview of current sequencing methods used to prepare and sequence nitrifier genomic DNA samples, emphasizing NGS technologies. Furthermore, we also provide an overview of common tools and methods used to assemble, finish, annotate, and analyze microbial sequences (Fig. 12.1). Due to the fact that the field of DNA sequencing is constantly reinventing itself, this overview is not meant to be comprehensive but

294

Patrick S. G. Chain et al.

454 Library preparation Sample collection

DNA extraction

NGS sequencing

DNA fragmentation Illumina library preparation

Comparative analysis

Genome annotation

Genome closure

Sequence assembly

Data processing

Figure 12.1 Outline of genomics steps for nitrifiers. After sample collection, the genomic DNA of an isolate nitrifier is processed in the laboratory through extraction, fragmentation, genomic library preparation, and sequencing. Following these steps, the data are processed, assembled, run through a series of finishing reactions and software to obtain a completed genome. Once finished, the genome is annotated and compared to other available nitrifier genomes.

hopefully will provide a taste of the benefits and limitations of utilizing NGS (and related methods and tools) to further research on microbial nitrification.

2. Want a Genome? Library Preparation First! 2.1. Sample and sequencer considerations Although having a target nitrifier or other organism in pure culture may be one of the more important steps in characterizing its genome, the genomic DNA sample quantity and quality continues to be extremely important for all sequencing processes, sometimes taking months to acquire. The ability to obtain high quality and sufficient quantities of sample for specific sequencing strategies can be challenging at times, particularly when the target organism is fastidious and genomic DNA recovery is poor or difficult. This is particularly relevant for assessing nitrifying communities in natural samples and even some pure cultures of nitrifiers, given their slow rates of growth, low biomass yields, and formation of extracellular polymeric substances that can house contaminants (Konneke et al., 2005; Lebedeva et al., 2005, 2008).

Genomics for Key Players in the N Cycle: From Guinea Pigs to the Next Frontier

295

For traditional Sanger sequencing, high-molecular weight DNA is necessary for the construction of clone libraries of narrow insert size range. This is particularly true for larger insert libraries (>10 kb) which require minimal degradation and large amounts of starting material (10–20 mg of DNA). Degraded DNA is preferentially incorporated into cloning vectors and can ruin a library (Birren et al., 1997). The new requirements for constructing cloneless libraries for NGS platforms, such as Roche 454 and Illumina Genome Analyzer (and more recently HiSeq2000), are also quite substantial. Even though library fragment size is somewhat less important on these platforms (or rather, the requirement is to have smaller fragments amenable to the library construction and sequencing protocols), degraded samples often result in very poor-quality libraries (Roche, 2009b). The quantity requirement for starting material varies with the platform being used as well as with the type of sequencing being performed. In addition, the target library type (e.g., paired-end libraries) also dictates the sample quantity requirements up front. New methods for shearing DNA to smaller sizes, such as acoustic wave systems, have reduced sample requirements from 10 to 1–3 mg of starting material, due in part to the retention of most of the DNA within the sample (Kozarewa et al., 2009; Quail et al., 2008). Because such instruments are not capable of shearing genomic DNA in large size ranges (i.e., 10–20 kb), the advantages of these methods are lost in shearing large fragments for paired-end (“jumping”) libraries for current platforms, which thus require up to 10–15 mg. Recent advances by researchers and/or the availability of novel sample preparation kits have enabled NGS with as little as 5 ng of DNA or RNA (Wood et al., 2010). Another way to overcome low amounts of starting material is through whole genome amplification prior to library preparation. Though biases in coverage may arise during this amplification process (e.g., some regions may be preferentially amplified over others; Pinard et al., 2006), the added throughput of NGS technologies can offset this issue, and allow the recovery of the entire genome nonetheless. Thus, the coupling of such amplification strategies with NGS technologies is opening new avenues of genomic research for organisms where genomic DNA recovery is difficult and can help alleviate the challenges associated with nitrifying enrichment cultures.

2.2. Creating libraries for different platforms With sufficient high-molecular weight purified genomic DNA in hand, a number of protocols exist to create a library of sequenceable fragments that are highly dependent on the sequencing platform used (Fig. 12.1). Library preparation for traditional Sanger sequencing can take around 1 week to complete and consists of size selection (3, 8, and 40 kb are typical) of sheared fragments, followed by end repair and ligation into a sequencing vector. Transformation into Escherichia coli allows for selection

296

Patrick S. G. Chain et al.

of clones for storage of these random fragments for future use, which is a significant difference from NGS libraries. A 454 shotgun library can be completed in 1 day following a slightly modified Roche protocol (Roche, 2009a). Fragmentation of genomic DNA occurs with all NGS procedures and can be achieved by any number of methods (nebulization, hydroshearing, sonication), although some new methods such as acoustic shearing may be advantageous for <1 kb fragments due to increased sample recovery. Following fragmentation, fragments anywhere from 500 to 800 bp are size selected by agarose gel electrophoresis and excision. A number of purification and quality control steps are carried out before and after end repair and ligation of library adaptors to these fragments. All primer dimers, short fragments, and fragments not ligated to both different adaptors are removed in a number of steps and result in single-stranded DNA fragments. These are diluted and immobilized onto DNA capture beads such that each bead carries a single unique library fragment. The capture beads are subjected to water-in-oil emulsion PCR amplification (Williams et al., 2006) resulting in several million copies of template fragment per bead. After breaking the emulsion, beads with amplified fragments are enriched by removing “null” beads, and the purified library beads can be loaded onto a 3.6 million well PicoTiterPlate for sequencing. The library making process for 454 paired-end sequencing is substantially more demanding, requiring a greater amount of initial sample material and a far longer preparation time at the bench (3 days following a modified Roche method; Roche, 2009c). Following mechanical shearing to a designated size range, the end-repaired fragments are ligated to biotinylated adaptors with loxP sites, size selected, and gel purified. Cre recombinase is used for circularization of these linear molecules and plasmid-safe DNase is used to digest uncircularized fragments. The resulting circular products carry a single biotinylated loxP adaptor sequence. These are sheared to a size of 500–600 bp, and endrepaired fragments carrying the biotinylated loxP adaptor are purified using streptavidin-coated magnetic beads. These bead-bound fragments undergo library adaptor ligation and PCR amplification which also releases the templates into solution, followed by a final size selection of the fragments using AMPure size exclusion beads (Agencourt). Additional steps are included to remove the biotinylated strand; however, the remainder of the protocol proceeds as in the regular protocol to generate a library of amplified fragments on capture beads that can be loaded onto a PicoTiterPlate for sequencing. Library creation for the Illumina Genome Analyzer can be completed in less than 1 day using Illumina’s library preparation kits and modified versions of the associated protocols (Illumina, 2008). Following genomic DNA shearing to 200–300 bp fragments, end repair is then carried out to generate blunt-ended fragments. A 30 -end A-tailing step enables adaptor ligation, the sequences of which vary depending on the library (single “fragment” reads, paired end, and multiplexing). Ligation products are then size selected and purified using either

Genomics for Key Players in the N Cycle: From Guinea Pigs to the Next Frontier

297

AMPure (Agencourt) magnetic beads or gel electrophoresis/excision. The final step in library preparation is the enrichment of DNA fragments via PCR with “tailed” primers that add complementary sequences to the oligonucleotides bound to the flow cell. When preparing indexed (or barcoded) libraries for multiplexed sequencing, a third primer is used to attach a unique “bar-coding” sequence. This PCR step serves to enrich for library fragments that contain adaptors ligated to both ends, a requirement for successful cluster generation on the flow cell. The library templates are denatured with NaOH, pooled if indexed, and diluted for hybridization to a flow cell. In order to generate sufficient fluorophore signal, templates are amplified on the flow cell to form a cluster of roughly 1000 template copies (Illumina, 2009). The library template hybridization, amplification, linearization, blocking, and sequencing primer hybridization are performed on the flow cell using the Cluster Station. This is performed within 5–6 h using an onboard fluidics system with temperature control. First, the single-stranded library is hybridized to the flow cell followed by “bridge” amplification using complementary oligonucleotides on the flow cell, where amplification products fold over such that the free end comes into contact with a complementary flow cell oligonucleotide and becomes bound to the surface (forming the “bridges”). The end result is a number of “clusters” of identical double-stranded library templates bound to the flow cell at both ends. Linearization then releases one end of each bridge amplicon from the surface of the flow cell by cleaving off one of the adaptors. A blocking group is then added to the free 30 -hydroxyl group to prevent nonspecific sequencing. Finally, the clusters are denatured to single-stranded templates (with one end remaining bound to the flow cell), and the sequencing primer is hybridized. A typical flow cell for the Illumina platforms has millions of these library clusters spread across the surface (Table 12.2). Greater detail on sample preparation for both 454 and Illumina, as well as protocols for other platforms not covered here, can be supplied by the vendor. A number of modifications to these protocols exist. As these two NGS technologies continue to mature, others are certain to encroach on the microbial sequencing space. It is clear that current NGS platforms require substantial man-hour investment in sample preparation. Future NGS platform winners will undoubtedly incorporate the majority of sample preparation on board the instruments, as part of the sequencing process.

3. Sequencing a Genome from Start to Finish 3.1. Sequencing in the “Next-Gen” era The different sequencing platforms vary greatly in methodology, and understanding these differences is critical for correctly interpreting results. In addition, these platforms rely on different library construction processes

Table 12.2 Sequencing platforms and associated attributes

Sequencing technology

Read length

Approximate Maximum cost/Mbp bases/run Error rate

Sanger 600–1000 $500 454 GS-FLX or GS-Junior 400 $10 $20 Illumina GAIIX or 75–125 $0.10 Hiseq2000 100 $0.05 SOLiD 5500 or 5500xla 75 $0.04

a b

Helicos Heliscope Polonator G.007

32 26

$0.45–0.60 <$0.45

364 Kb 450 Mb 35 Mb 95 Gb 200 Gb 105 Gb 210 Gb 37 Gb 12 Gb

Pacific Biosciencesa

> 900

Unknown

200 Mb

Estimates of output at release of instrument. Multiple linked reads, conceptually similar to paired-end reads.

4

5

10 –10 10 3–10 4

Potential biases/comments

10 2–10 3

Cloning biases, random errors Biases at extreme %GC, homopolymer errors Biases at extreme %GC

10 2–10 3

Biases at extreme %GC

10 5 1:10 1– 1:10 3  10 1

Single molecule, very short reads Very short reads Single molecule, real time; strobe sequencingb

Genomics for Key Players in the N Cycle: From Guinea Pigs to the Next Frontier

299

and generate vastly different types of readouts and quantities of data (Table 12.2). Due to the lower cost of NGS on a per base pair amount, traditional Sanger sequencing has all but been replaced as the main throughput method by the newer NGS technologies. Due to this fact, here we provide a very cursory overview of the traditional Sanger method and focus mostly on the two currently dominant platforms for generating microbial genomic draft DNA sequence data. Sanger sequencing relies on the detection of fluorescently labeled dideoxynucleotide-terminated fragments that are separated by size and interrogated by a laser in a DNA polymerase-catalyzed DNA synthesis reaction. Due to the ratio of the four labeled ddNTPs to regular dNTPs, and the random nature of incorporation by DNA polymerase, many fragments terminate at any given nucleotide. The detection of the nucleotide-specific fluorophore along with the size of the fragment allows the reconstruction (sequencing) of the template. High-throughput Sanger sequencing machines generate only up to 384 reads with an average read length of 800 bp within 8 h, a far cry from NGS platforms (Table 12.2). 454 pyrosequencing is a sequence-by-synthesis method that measures the release of pyrophosphate as a byproduct of DNA polymerase activity, using an enzyme cascade that ends with detection of luciferase chemiluminescent signals. The bead-based library (see Section 2.2), together with DNA polymerase, cofactors, and “packing” beads (that help keep the library beads in the wells) are loaded onto a PicoTiterPlate. These are manufactured to accommodate only a single library bead per well, though not all wells contain beads. In addition, “enzyme beads,” which contain the necessary downstream enzymes for light emission, are added and covered with pyrophosphatase beads to help prevent chemical cross talk between the wells. Once the PicoTiterPlate is loaded onto the sequencer, the fluidics system flows individual dNTPs in a fixed order across the wells with washing steps in between each dNTP, allowing for parallel sequencing of many templates that can incorporate the flowed nucleotide. Addition of one or more of the nucleotides is possible and the resulting signal is proportional to the number of nucleotides inserted. The resolution of greater than five or six nucleotide additions is problematic and results in so-called homopolymer errors (Table 12.2). The Illumina GAiiX uses another sequence-by-synthesis method that produces short reads from the clusters generated by the library preparation (see above) and is driven by fluidics delivery of reagents and buffers directly on the flow cell. A DNA polymerase binds and remains bound to the primed, single-stranded templates throughout the process, as a mixture of fluorophore-labeled nucleotide analogs (all four nucleotides can be differentiated via their fluorophore) are flowed through the flow cell. Unincorporated nucleotides are washed away and the incorporated nucleotide is detected via fluorophore laser excitation and image capture and

300

Patrick S. G. Chain et al.

quantification of cluster brightness. The 30 position of each nucleotide is modified with a reversible terminator, 30 -O-azidomethyl 20 -deoxynucleoside triphosphate, allowing only a single nucleotide to be incorporated into the growing complementary strand per “cycle.” This terminator, along with the fluorophore, is then cleaved to allow the addition of another nucleotide (cycle). Cleavage products are washed away and the next cycle of synthesis begins. Read lengths simply depend on the number of cycles allowed (Table 12.2). In multiplexed (barcoded) applications, depending on the method, the custom index is either sequenced as part of the first (or paired) read, or an additional set of cycles (equal to the length of the index) is initiated after denaturation of the first read and annealing of a specific index sequencing primer. For paired-end sequencing applications, the templates are flipped on the flow cell to expose the priming site for the second read. Hybridization of primer and cycles of nucleotide addition occur as in the first read. Due to the small fragments placed on the flow cell, unless special “jumping” libraries are created (similar to the 454 paired-end protocol), the two reads are generally spaced 100–400 bp apart. A number of other technologies have surfaced since the introduction of 454 and Illumina sequencing; however, these two platforms are the dominant platforms used and are representative of their class of sequencing products. Specifics on Illumina, 454, and other platform products, output capabilities, and biases are listed in Table 12.2. Additional information can be gathered at vendor Web sites.

3.2. Assembling sequence reads As no currently available sequencing technology is yet capable of sequencing an entire chromosome in a single pass, all genome sequencing projects require assembly of smaller sequenced reads into larger contiguous sequences (contigs). The method of assembly, however, depends on the sequencing technology used, or more specifically the size and number of reads (fold coverage), and on the availability of a sufficiently closely related genome for reference-based assembly. The screening of data for contaminants, vector or adaptor sequences, and for low-quality regions, prior to assembly, has been a highly useful strategy when using only Sanger sequencing data, and provides superior results. The screening of contaminants and adaptor sequences continues to be beneficial for assemblies; however, less has been done with regard to screening low-quality NGS data, due in part to less adequate experience in estimating biases and errors with respect to quality scores. While screening low-quality bases at the ends or beginnings of reads has generally resulted in better assembly results for NGS data, the quality of extremely high or low G þ C regions has been shown to be inferior with respect to quality scores than for regions with a moderate G þ C content.

Genomics for Key Players in the N Cycle: From Guinea Pigs to the Next Frontier

301

For assembly with the use of a reference genome, much of the computational challenge is reduced to finding similar regions between the data and the reference, such as between closely related or evolved strains. However, these methods are less useful for regions that are divergent, and rarely extend into regions that are only present in the new dataset. This strategy therefore would best be used in conjunction with de novo assembly for divergent/ unique regions. While Sanger sequencing methods reigned, assembly of novel data was done using overlap algorithms that looked at alignments between reads. For example, the Phrap program (P. Green, Washington University) was long used for assembling Sanger reads into contigs even with low coverage (< 10). The lower costs of NGS have trumped the use of Sanger data for sequencing genomes; thus many new algorithms have recently emerged that are capable and more adept at handling the substantially larger volume of data required for adequate assembly of the shorter read lengths. For 454 data, the Newbler assembler (Margulies et al., 2005) has been the primary program of choice, in part since it was developed explicitly for 454 pyrosequencing data. For short read (Illumina, SOLiD) data, a large number of assembly tools have now been developed (Table 12.3). The two predominant methods that have emerged follow either a more traditional all by all comparison to find overlaps and join reads together into contigs, or begin by reducing the complexity of the dataset by deconstructing reads into subsequences of length K (termed, Kmers) and only tracking the overlaps between unique Kmers to build a large graph of possible contigs. The two methodologies have their own drawbacks. The “overlap” and associated methods require significant computational resources which increase exponentially with the number of reads and are also made more difficult with shorter read lengths. As more data are used for assembling genomes, assembly is increasingly performed using Kmer-based graph programs, despite their requirement for copious amounts of memory. In terms of accuracy, most assemblers function well unless there is a predominance of poorquality data; however, their ability to adequately assemble reads into contigs is dependent on the program parameters used. Although the default settings may work well in some instances, sequencing fold coverage, repetitiveness of the genome, and length of reads all need to be taken into consideration. More specific details of the assembly methods and algorithms are reviewed more thoroughly elsewhere (Miller et al., 2010; Pop, 2009). Despite recent advances in de novo genome assembly of NGS data, the majority of these newer short read assembly tools either cannot or do not adequately take advantage of the longer read lengths provided by Sanger or 454 sequencing. Similarly, pyrosequencing and Sanger sequencing assemblers were specifically designed for their respective sequencing methods. Therefore two solutions for combining data from different platforms have been implemented: (1) the direct merging of assemblies using software

Table 12.3 Next-generation sequencing assembly tools

Software

de Bruijn graphs ALLPATHS

EULER-SR

Velvet

SOAPdenovo

ABySS

Ray

Features

Disadvantages

Technologies

Read types recognized

Reference

Uses paired-end information Removes erroneous reads and corrects erroneous base calls Graph reduction and partition by scaffolds Uses paired-end information Removes erroneous reads and corrects erroneous base calls Graph construction by multiple values of Kmers, and graph reduction Uses paired-end information Removes erroneous reads Graph reduction Uses paired-end information Corrects erroneous base calls Graph reduction, and read partitioning Uses paired-end information Removes erroneous reads Distributed algorithm Uses paired-end information Distributed algorithm Hybrid assembly

Requires 40 coverage; requires specific libraries (including paired ends) Produces shorter contigs, less accurate than ALLPATHS and Velvet

Illumina

FASTA/QUAL

Butler et al. (2008)

454, Illumina

FASTA/QUAL

Chaisson et al. (2009)

Sanger, 454, Illumina, SOLiD Illumina

FASTA/ FASTQColor space FASTA/QUAL

Zerbino and Birney (2008)

Poor assembly quality compared to Velvet

Illumina

FASTA/QUAL

Simpson et al. (2009)

Fixed Kmer

454, Illumina

FASTA/ QUALFlow space

Boisvert et al. (2010)

RAM usage issues for larger datasets Sensitive to sequencing errors, produces shorter contigs

Li et al. (2010)

Overlap/consensus CABOG Uses paired-end information (Celera Removes erroneous reads and Assembler) corrects erroneous base calls, and robust to homopolymer run length uncertainty Graph reduction Edena Graph reduction

Specified input format

Sanger, 454, Illumina

FASTA/QUAL

Miller et al. (2008)

Illumina

FASTA/ FASTQ

Hernandez et al. (2008)

454

FASTA/ QUALFlow space FASTA/ QUALColor space

NA

Newbler

Uses paired-end information Graph reduction

Designed for unpaired reads of uniform length Does not work well for <150 bp

Shorty

Uses paired-end information Graph reduction

Required a few long reads as seeds

Illumina, SOLiD, Helicos

Hossain et al. (2009)

304

Patrick S. G. Chain et al.

dedicated to finding overlaps between large and small contigs and (2) adapting the results of one or more assemblies for use as input into one of the assemblers specifically designed for a different type of dataset (e.g., modifying the results of 454 or Illumina assemblies for input into Phrap). Therefore, the specific methodology used to assemble data for a given genome will be highly dependent on the amount (Newbler functions best with 35-fold coverage, while most short read assemblers function best with >100-fold coverage) and types (platform, paired ends) of reads from any given platform. Significant deviations from ideal coverage may result in either poor assemblies or even assembly failure; therefore, additional sequencing data are recommended for under-sequenced genomes, while reduction of data is recommended for oversequenced genomes. Complicating matters are the errors and associated ambiguities within the data (Table 12.2), which can result in many very small contigs or in highly similar contigs that differ only at erroneous bases. To mitigate the effects of sequencing errors, reads can either be trimmed for quality, or can be assembled and evaluated during the assembly process using quality scores as a metric. For nitrifier genomes, NGS assemblies can result in highly accurate contigs (with the possible exception of homopolymer mistakes when not using DNA polymerase terminators), the number of which depends heavily on the amount of data used, the input of paired-end reads capable of spanning repetitive regions, and the complexity of the target genome.

3.3. Genome closure and polishing Draft assemblies are often fragmented in numerous contigs and generally contain many issues and errors that are not readily addressed with assembly software including unresolved repeated elements within a genome, misassemblies (generally at these repeats), gaps in these repeats as well as in unique regions, low-quality sequences, and sequencing errors. Repeat regions often consist of the largest burden in the finishing process, and nitrifier genomes are known to be riddled with repeated insertion elements, difficult tandem repeats, and duplications of key operons, such as the amoCAB gene cluster, involved in ammonia catabolism (Arp et al., 2007; Chain et al., 2003; Starkenburg et al., 2008). The task of “finishing” in order to completely close a genome addresses all of these issues: closing gaps, resolving misassemblies, and raising sequence quality with additional sequencing. Inspection of these regions can be manual or automated with the help of various software (most are center specific and custom built) and is most often performed using a combination of both strategies. Most genome centers develop their own specific finishing process, software and tracking systems, custom built to complement center-specific sequencing processes, assembly tools, and finishing strategies. When Sanger sequencing was the primary

Genomics for Key Players in the N Cycle: From Guinea Pigs to the Next Frontier

305

method for generating draft sequence, clones propagated in E. coli could be retrieved for targeted finishing reactions. With next-generation platforms, however, the requirement for clones is obviated, thus eliminating this resource. For regions without clones, a number of different PCR strategies are used to close gaps and raise sequence quality with the resulting data incorporated into a new assembly. Finishing pipelines often require several rounds of this process to completely finish a microbial genome, depending on the genome complexity, as well as the amount and types of data used. Using any of the sequencing technologies, repetitive DNA has always caused the greatest havoc in finishing a genome. Without adequate mate pairs (paired reads of known or unknown distance from one another) that span repeats or without software to take this information into account, the result is often misassemblies of some type. Complicating matters for finishing is the fact that for most new NGS technologies, paired-end sequencing methods are extremely difficult (see Section 2.2). Tandem repeats, a sequence repeated end to end such that its total length is larger than a read or than paired reads, are the most difficult of challenges, particularly if there is little variability between repeats and the total length is very large, making physical size estimates (e.g., running PCR products on a gel) difficult. Shorter reads exacerbate the repeat-solving issues. Other costly and difficult regions include so-called hard-stops: regions where all reads appear to magically end, often due to secondary structure that impedes the sequencing process. Regions with exceptionally high GC or highly palindromic sequences are often recalcitrant to most finishing reactions, including those with added DNA relaxants. One initial apprehension of using novel sequencing approaches was the fear of unreliable data in the form of high error rates and biases. While the increased throughput allows for higher coverage and addresses in part any higher error rates, reproducible errors such as homopolymer base additions/ deletions using pyrosequencing technology have required additional handling by finishers. This and other errors inherent in new technology processes (e.g., in emulsion PCR steps), or the low quality or lack of coverage found in some genomic regions, can all be addressed by complementing one of the NGS technologies with another (Aury et al., 2008; Goldberg et al., 2006). Many of these lessons in finishing are still being discovered; thus tailoring finishing strategies and software to these NGS technologies is constantly undergoing improvements. In part due to these difficulties, the increased burden of finishing and closing more gaps, and the frenetic pace with which microbial genomes can now be sequenced, most sequencing centers have been forced to reevaluate their commitment to finishing the genomes they draft. This has in turn driven the formalization of new “categories” of genome products, from Draft, to High-Quality Draft, to Improved High-Quality Draft, to Noncontiguous Finished, to Finished (Chain et al., 2009). While Finished continues to be the

306

Patrick S. G. Chain et al.

gold standard where there are no remaining gaps or misassemblies and the error rate is less than 1 in 100,000 bp, Draft now specifically means the raw assembled data from a sequencing procedure, with no additional processing. The additional categories cover various stages of genome improvement and are not meant to be technology specific. While the ideal target would be Finished for all genome projects, the costs associated with this step, particularly in comparison with draft genome sequencing, make this untenable. It is critical for downstream steps, however, to understand what level of finishing has been accomplished before performing and interpreting annotation and subsequent analysis.

4. Apre`s Sequencing 4.1. Annotation Once one of the finishing levels is reached, defining the functional components of the genome is performed in order to assess metabolic and other functional capabilities as well as to provide higher-level comparisons (e.g., at the protein level). Annotation is essentially the extraction and assignment of biological knowledge to nucleotide sequences based on evolutionary principles. Annotation is a multilevel process that can consist of prediction and labeling not only of genes but also of pseudogenes, promoter regions, RNA genes (rRNAs, tRNAs, and other small RNAs), and untranslated regions (Fig. 12.2). Further genomic details, such as origins of replication, mobile elements (including prophages), pathogenicity islands, etc., can also be detailed within annotation files, but are not often performed due to scarcity of automated tools. Although gene-finding efforts have themselves been automated, as have the annotation of called genes, all such tools are far from perfect. The difficulties in correcting gene calls and gene product functional assignments are beyond the scope of this overview and are partly covered elsewhere (e.g., Overbeek et al., 2007; Poptsova and Gogarten, 2010). Additional complications have arisen in the recent past due to inadequately coupling information on assembly quality with automated gene calls. For example, some NGS technologies result in inaccurate sequence in downstream assemblies due to reproducible errors (e.g., pyrosequencing has a well-known issue with homopolymeric regions, with reproducible small insertion or deletion errors). There are two mainstream and complementary approaches to gene prediction, which is the first step in genome annotation (Fig. 12.2). Ab initio gene finding relies on intrinsic properties of DNA sequence (e.g., di-, tri-, and tetranucleotide composition) to facilitate discrimination between coding and noncoding regions. When combined with evidence-based gene prediction, where similarity to known genes is used for calling genes

307

Genomics for Key Players in the N Cycle: From Guinea Pigs to the Next Frontier

No

Want inhouse pipeline?

Yes Want independent programs?

No

Yes

Annotation

Gene prediction

No

Web-based gene prediction and annotation pipeline

Installable prebuilt gene prediction and annotation pipeline (customizable)

Web-based analysis package

ab initio Gene prediction:

Reference genome?

Evidence-based gene prediction

Transfer gene predictions

Transfer annotation

Customized annotation package

Custom analyses that may include any available tools, packages, databases

Analysis

Locally installable package (customizable)

Yes

General features, repetitive elements

Metabolic pathways transporters, protein secretion,

Promoter analysis, regulon analysis

Small RNAs, IS elements, prophages, plasmids

Comparative analysis, phylogenetic profiling

Unique genes, SNPs, INDELs, pseudogenes, reductive evolution analysis

Figure 12.2 Analyzing nitrifier genomes after sequencing is complete. Depending on the level of automation desired, there exist several possibilities and software packages for gene prediction, annotation, and whole genome analysis.

(e.g., using BLAST), the results are generally good in terms of identifying genes. Alone, evidence-based methods function poorly when interrogating genomes of novel lineages, and ab initio methods miss a number of genes with composition profiles different from the remainder of the genome (e.g., some regions of recent lateral gene transfer). A number of commonly used gene callers exist and are listed in Table 12.4a. In addition to identifying proteincoding genes, ribosomal RNA operons are generally found via similarity to known rRNA sequences, while transfer RNAs can be reliably identified using tRNAscan-SE (Lowe and Eddy, 1997). Other genomic features are not normally found or defined as part of automated pipelines, although there do exist a few implementable tools to do some of the required searches (Langille and Brinkman, 2009; Mantri and Williams, 2004). Once protein-coding genes have been found, a large number of similarity searches are generally performed in order to transfer “annotations” based on sequence similarity (e.g., product description, protein family membership, etc.). A number of databases and utilities for describing the products of

Table 12.4 Software and Web servers for genome annotation

(a) Gene prediction and associated programs CRITICA http://www.ttaxus.com/software.html Prodigal http://compbio.ornl.gov/prodigal Glimmer3 http://www.cbcb.umd.edu/software/glimmer Genemark http://exon.biology.gatech.edu MetaGene http://metagene.cb.k.u-tokyo.ac.jp TMHMM http://www.cbs.dtu.dk/services/TMHMM SignalP http://www.cbs.dtu.dk/services/SignalP tRNAscan-SE http://lowelab.ucsc.edu/tRNAscan-SE (b) Annotation resources COG http://www.ncbi.nlm.nih.gov/COG eggnog http://eggnog.embl.de. PHOG http://phylofacts.berkeley.edu/orthologs Pfam http://pfam.janelia.org TIGRfam http://www.jcvi.org/cms/research/projects/tigrfams/ overview KEGG http://www.genome.jp/kegg ProtClustDB http://www.ncbi.nlm.nih.gov/sites/entrez? db¼proteinclusters Interpro http://www.ebi.ac.uk/interpro GenBank http://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi Genomes GO http://www.geneontology.org/GO.downloads. annotations.shtml Prosite http://www.expasy.ch/prosite STRING http://string.embl.de

Badger and Olsen (1999) Hyatt et al. (2010) Delcher et al. (2007) Lukashin and Borodovsky (1998), Besemer et al. (2001) Noguchi et al. (2006, 2008) Krogh et al. (2001) Bendtsen et al. (2004) Lowe and Eddy (1997) Tatusov et al. (1997), Tatusov et al. (2003) Jensen et al. (2008), Muller et al. (2010) Datta et al. (2009) Finn et al. (2010) Haft et al. (2003) Kanehisa and Goto (2000), Kanehisa et al. (2010) Klimke et al. (2009) Hunter et al. (2009) Cummings et al. (2002) Ashburner and Lewis (2002) Sigrist et al. (2010) Jensen et al. (2009)

(c) Web-based annotation packages and pipelines IMG-ER http://img.jgi.doe.gov/cgi-bin/pub/main.cgi RAST http://rast.nmpdr.org JCVI http://www.jcvi.org/cms/research/projects/annotationservice IGS http://ae.igs.umaryland.edu/cgi/index.cgi xBASE http://www.xbase.ac.uk/annotation BASys http://basys.ca/basys/cgi/submit.pl Ergatis http://ergatis.sourceforge.net DIYA http://gmod.org/wiki/DIYA CG-Pipeline http://wiki.jordan.biology.gatech.edu/index.php/ CG-Pipeline PGAAP http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline. html

Markowitz et al. (2009) Aziz et al. (2008) NA NA Chaudhuri et al. (2008) Van Domselaar et al. (2005) Orvis et al. (2010) Stewart et al. (2009) NA NA

(a) Some widely used gene-finding and associated tools, (b) some popular databases for assigning product function/description, (c) some complete annotation packages and pipelines.

310

Patrick S. G. Chain et al.

protein-coding genes are listed in Table 12.4b. The gene-coding regions found by gene prediction programs are searched against these resources and annotations are assigned based on a scoring threshold. This remains an imperfect process due to the fact that each database maintains its own scoring matrices, many of the databases are populated with erroneous data, and there are often contradictions between scoring annotation assignments of the same gene, making the selection process more difficult. While every annotation pipeline differs in a number of aspects, many of the databases and the methods of query are similar; thus here we provide a brief overview of the process and provide options for using or building an annotation pipeline. Because annotation is a multistep process (Fig. 12.2), which requires the integration of a large number of tools for specific genomic features, interoperability of tools for gene finding, assignment of functions and other descriptions, and the ability to visualize such annotations and associated evidence have hampered many independent annotation efforts. Although past efforts were strictly within large genome centers, a number of publicly available tools or resources have surfaced (Table 12.4c) that begin to encompass three critical annotation components: (1) an integrated set of tools and information-rich databases for gene calling and function assignment; (2) a data management system and repository for storing and tracking annotations; and (3) an interactive graphical user interface for presentation of results, evidence, and context. As the advances in decreasing costs and increasing the democratization of sequencing continues, an increasing number of investigators will be relying on such pipelines for annotation of their favorite genomes. A few user friendly Web-based platforms exist that only require formatted short reads, contigs, or chromosomes as input and provide reusable analytical pipelines, such as helping users search a number of resources, executing downstream analysis (i.e., phylogenic analysis, etc.), visualizing and storing the results (Table 12.4c). A recent study compared the annotation capabilities of IMG-ER (Markowitz et al., 2009), RAST (Aziz et al., 2008; Meyer et al., 2008), and JCVI (http://www.jcvi.org/cms/research/ projects/annotation-service/), and while all three were comparable in their abilities to identify genes (though certainly not identical), none of them provided an all inclusive package for genome annotation and analysis (Bakke et al., 2009). Gene calling and annotation standards, such as those proposed for genome finishing, would certainly provide a more solid foundation for future annotation efforts. In the meantime, it may not always be obvious which platform to adopt, since metrics for annotation quality are not well defined. Some of the drawbacks of relying on Web services include turnaround time for annotation, the lack of flexibility in terms of programs, databases, and threshold cutoffs, among others.

Genomics for Key Players in the N Cycle: From Guinea Pigs to the Next Frontier

311

As an alternative to Web services, there exist several semi-stand-alone pipeline packages or workflow systems available for local installation (Table 12.4c). The major drawback of this approach is the knowledge and expertise necessary to integrate the various components and auxiliary tools within the system, as well as hardware requirements. However, for laboratories that conduct bioinformatics research, it may be more beneficial to install a highly flexible pipeline that can integrate gene calling, annotation, analysis, and tools to link to specialized databases that do not necessarily apply to all projects. A number of packages are available through opensource licenses and are composed of multiple gene prediction and protein annotation programs that can be linked together within a computational pipeline (Table 12.4c). As the number of genomes and genome annotations continue to increase, and at a more rapid pace, it has become imperative to provide a reliable resource for future use in comparative genomics, phylogenetics and molecular evolution studies, and ironically, better annotations.

4.2. Comparative analysis While a thorough annotation makes use of comparisons with known proteins or protein features, a more detailed comparison is often sought after sequencing. The specific type or level of comparison is generally highly project specific, but here we discuss a number of ways to perform genome comparisons and why such approaches may be useful (Fig. 12.2). For nitrifier genomes, because there are already a number of lineages sequenced, many options are possible. It is still likely that a number of undiscovered lineages exist, and thus a number of future genome projects will have no near-neighbor reference genome to guide a number of analyses. In the cases where no close relative’s genome is available, protein-based comparisons are used. Ideally, comparisons would be made on similarly annotated genomes (i.e., using similar or identical platforms/methods). When comparing genomes, the most interesting aspects are generally in what is common or distinct between them, thus finding orthologs (often using a bidirectional best BLAST hit approach) is one of the first objectives. One can define candidate orthologs, unique proteins, as well as potential gene family expansions (paralogs). Physiological and phenotypic capabilities are often “transferred” from the reference genome to the newly sequenced genome if the pathway components are present. Such transfers often facilitate detailed comparisons by allowing fast and accurate verification and validation of annotations and pathways obtained with automated annotation (see Section 4.1 above). In a similar fashion, by examining differences in protein profiles, inactive pathways or novel functions may be found. In the rare cases where a nitrifier would be so novel that no direct comparisons can be made, several de novo pathway inference and metabolic reconstructions can be made.

312

Patrick S. G. Chain et al.

If the newly sequenced genome has a close relative, more direct methods can be applied. Whole genome alignments can be performed using a number of different algorithms, though tBLASTx is most sensitive in comparing genome sequences together since it translates DNA sequence into all six reading frames of both reference and query before looking for protein similarities. This also has the advantage of not relying on annotated features, which may differ between genomes despite identical sequence. By comparing presence and absence of genes among all nitrifiers (phylogenetic profiling), it may be possible to discover divergences in metabolic pathways that do not correlate with phylogeny. For the ever-elusive hypothetical and conserved hypothetical genes which have no known function, focusing on gene to gene proximity, or a gene’s environment, functional prediction may be ascribed since genes with related functions (i.e., enzymes of the same metabolic pathway) tend to cluster together on the chromosome. For highly similar genomes, other tools such as MUMmer (Kurtz et al., 2004) can be used to quickly align entire genomes. By processing the output data, genome rearrangements, insertion/deletions, and even single nucleotide polymorphisms (SNPs) can be derived in terms of precise coordinates and resulting effect(s) at the protein, and pathway, levels. Such detailed analyses will also reveal the movement of mobile elements, phage-induced deletions, and genomic rearrangements, as well as unique restriction modification systems. These could imply evolutionary pressures that led to species- or type-specific differences in physiologies. As is common in the field of genomics, the results of one stage of analysis is simply the starting point for a myriad of other inquiries; the same is true for comparative analysis, which highlights those regions of the genome that may be further interrogated in order to better understand specific functions.

5. Outlook: The Next Frontier Many genomics methods to sequence nitrifier and other microbial genomes are being used currently, and due to improvements in cost, the output is increasing at a phenomenal pace. New sequencing technologies are already being used (e.g., Table 12.2), and more are yet to enter the sequencing market. All of these NGS technologies hold great promise for de novo microbial sequencing, as well as genome resequencing (i.e., sequencing closely related strains of the same species). While this type of “pan-genomic” effort (Medini et al., 2005) has not yet been applied to nitrifiers, the methods to quickly use reference genomes for the mapping of sequencing data have been well established (Li and Homer, 2010). As sequencing continues to become inexpensive and fast, the bottleneck has shifted to obtaining sufficient biomass for adequate sample preparation. For nitrifiers, this often

Genomics for Key Players in the N Cycle: From Guinea Pigs to the Next Frontier

313

remains a difficult issue to resolve. Fortunately, NGS technologies and their associated throughput, combined with new developments in assembly of such vast amounts of data, have ushered in a new era of metagenomics, whereby entire communities can be sequenced (Qin et al., 2010). For some communities dominated by nitrifiers, such as in some sewage treatment plants, this approach could help delineate the genomes of entire populations of nitrifiers. A future focus for nitrifier research may thus be outside the confined context of the laboratory, in the natural environment. Other technologies such as flow cytometry may be coupled with advances in molecular amplification of minute quantities of nucleic acids and NGS to overcome sample preparation difficulties or to tackle nitrifier populations that play an important environmental role yet are not dominant within mixed communities. Application of single cell/NGS technologies to nitrification is already underway as researchers at the Bigelow Laboratory for Ocean Sciences will be amplifying genomic DNA from an individual cell of a newly discovered marine nitrite oxidizer, Nitrospina sp. SCGC AAA288L16, for sequencing by the JGI (http://www.jgi.doe.gov/sequencing/why/ mesopelagic-bacterioplankton.html). In addition, studies to investigate the diversity of ammonia oxidizers, that have used amo gene-targeted PCR (Norton et al., 2002; Ward and O’Mullan, 2002), are beginning to be adapted to the much higher-throughput NGS platforms using analytical methods originally designed to accommodate classification of 16S rRNA sequences. As new technologies continue to unfold, and as democratization of sequencing increases, it is inevitable that more and more laboratories will be performing one or a combination of the above methods, applied to one or more of the above samples, in order to gain greater insight into the nitrification process and the microbes that perform this essential function.

ACKNOWLEDGMENTS We thank all other B6 members for their contributions to the establishment of standardized methods, development of software and processes, and genome projects described in this chapter. This study was supported in part by the U.S. Department of Energy Joint Genome Institute through the Office of Science of the U.S. Department of Energy under Contract Number DE-AC02-05CH11231 and grants from NIH (Y1-DE-6006-02), the U.S. Department of Homeland Security under contract number HSHQDC08X00790, and the U.S. Defense Threat Reduction Agency under contract numbers B104153I and B084531I.

REFERENCES Altmann, D., Stief, P., Amann, R., De Beer, D., and Schramm, A. (2003). In situ distribution and activity of nitrifying bacteria in freshwater sediment. Environ. Microbiol. 5, 798–803. Arp, D. J., Chain, P. S., and Klotz, M. G. (2007). The impact of genome analyses on our understanding of ammonia-oxidizing bacteria. Annu. Rev. Microbiol. 61, 503–528.

314

Patrick S. G. Chain et al.

Ashburner, M., and Lewis, S. (2002). On ontologies for biologists: The Gene Ontology— Untangling the web. Novartis Found. Symp. 247, 66–80, discussion 80–63, 84–90, 244– 252. Aury, J. M., Cruaud, C., Barbe, V., Rogier, O., Mangenot, S., Samson, G., et al. (2008). High quality draft sequences for prokaryotic genomes using a mix of new sequencing technologies. BMC Genomics 9, 603. Aziz, R. K., Bartels, D., Best, A. A., DeJongh, M., Disz, T., Edwards, R. A., et al. (2008). The RAST Server: Rapid annotations using subsystems technology. BMC Genomics 9, 75. Badger, J. H., and Olsen, G. J. (1999). CRITICA: Coding region identification tool invoking comparative analysis. Mol. Biol. Evol. 16, 512–524. Bakke, P., Carney, N., Deloache, W., Gearing, M., Ingvorsen, K., Lotz, M., et al. (2009). Evaluation of three automated genome annotations for Halorhabdus utahensis. PLoS One 4, e6291. Bendtsen, J. D., Nielsen, H., von Heijne, G., and Brunak, S. (2004). Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol. 340, 783–795. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Sayers, E. W. (2010). GenBank. Nucleic Acids Res. 38, D46–D51. Besemer, J., Lomsadze, A., and Borodovsky, M. (2001). GeneMarkS: A self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29, 2607–2618. Birren, B., Green, E. D., Klapholz, S., Myers, R. M., Riethman, H., and Roskams, J. (1997). Bacterial artificial chromosomes. In “Genome Analysis: A Laboratory Manual,” (B. Birren, ed.)Vol. 3, pp. 242–295. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. Boisvert, S., Laviolette, F., and Corbeil, J. (2010). Ray: Simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J. Comput. Biol. 17, 1519–1533. Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I. A., Belmonte, M. K., Lander, E. S., et al. (2008). ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820. Chain, P., Lamerdin, J., Larimer, F., Regala, W., Lao, V., Land, M., et al. (2003). Complete genome sequence of the ammonia-oxidizing bacterium and obligate chemolithoautotroph Nitrosomonas europaea. J. Bacteriol. 185, 2759–2773. Chain, P. S., Grafham, D. V., Fulton, R. S., Fitzgerald, M. G., Hostetler, J., Muzny, D., et al. (2009). Genomics. Genome project standards in a new era of sequencing. Science 326, 236–237. Chaisson, M. J., Brinza, D., and Pevzner, P. A. (2009). De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Res. 19, 336–346. Chaudhuri, R. R., Loman, N. J., Snyder, L. A., Bailey, C. M., Stekel, D. J., and Pallen, M. J. (2008). xBASE2: A comprehensive resource for comparative bacterial genomics. Nucleic Acids Res. 36, D543–D546. Cummings, L., Riley, L., Black, L., Souvorov, A., Resenchuk, S., Dondoshansky, I., and Tatusova, T. (2002). Genomic BLAST: Custom-defined virtual databases for complete and unfinished genomes. FEMS Microbiol. Lett. 216, 133–138. Daims, H., Nielsen, J. L., Nielsen, P. H., Schleifer, K. H., and Wagner, M. (2001). In situ characterization of Nitrospira-like nitrite-oxidizing bacteria active in wastewater treatment plants. Appl. Environ. Microbiol. 67, 5273–5284. Datta, R. S., Meacham, C., Samad, B., Neyer, C., and Sjolander, K. (2009). Berkeley PHOG: PhyloFacts orthology group prediction web server. Nucleic Acids Res. 37, W84–W89. Delcher, A. L., Bratke, K. A., Powers, E. C., and Salzberg, S. L. (2007). Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23, 673–679. Finn, R. D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J. E., et al. (2010). The Pfam protein families database. Nucleic Acids Res. 38, D211–D222.

Genomics for Key Players in the N Cycle: From Guinea Pigs to the Next Frontier

315

Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., et al. (1995). Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512. Francis, C. A., Roberts, K. J., Beman, J. M., Santoro, A. E., and Oakley, B. B. (2005). Ubiquity and diversity of ammonia-oxidizing archaea in water columns and sediments of the ocean. Proc. Natl. Acad. Sci. USA 102, 14683–14688. Freitag, T. E., Chang, L., Clegg, C. D., and Prosser, J. I. (2005). Influence of inorganic nitrogen management regime on the diversity of nitrite-oxidizing bacteria in agricultural grassland soils. Appl. Environ. Microbiol. 71, 8323–8334. Goldberg, S. M., Johnson, J., Busam, D., Feldblyum, T., Ferriera, S., Friedman, R., et al. (2006). A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. Proc. Natl. Acad. Sci. USA 103, 11240–11245. Haft, D. H., Selengut, J. D., and White, O. (2003). The TIGRFAMs database of protein families. Nucleic Acids Res. 31, 371–373. Hallam, S. J., Konstantinidis, K. T., Putnam, N., Schleper, C., Watanabe, Y., Sugahara, J., et al. (2006). Genomic analysis of the uncultivated marine crenarchaeote Cenarchaeum symbiosum. Proc. Natl. Acad. Sci. USA 103, 18296–18301. Hernandez, D., Francois, P., Farinelli, L., Osteras, M., and Schrenzel, J. (2008). De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer. Genome Res. 18, 802–809. Hossain, M. S., Azimi, N., and Skiena, S. (2009). Crystallizing short-read assemblies around seeds. BMC Bioinform. 10(Suppl. 1), S16. Hunter, S., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D., et al. (2009). InterPro: The integrative protein signature database. Nucleic Acids Res. 37, D211–D215. Hyatt, D., Chen, G. L., Locascio, P. F., Land, M. L., Larimer, F. W., and Hauser, L. J. (2010). Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 11, 119. Illumina (2008). Preparing Samples for Sequencing Genomic DNA. Illumina, Inc., San Diego, CA. Illumina (2009). Cluster Station Operations Guide. Illumina, Inc., San Diego, CA. Jensen, L. J., Julien, P., Kuhn, M., von Mering, C., Muller, J., Doerks, T., and Bork, P. (2008). eggNOG: Automated construction and annotation of orthologous groups of genes. Nucleic Acids Res. 36, D250–D254. Jensen, L. J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., et al. (2009). STRING 8—A global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 37, D412–D416. Kanehisa, M., and Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30. Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M., and Hirakawa, M. (2010). KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 38, D355–D360. Klimke, W., Agarwala, R., Badretdin, A., Chetvernin, S., Ciufo, S., Fedorov, B., et al. (2009). The National Center for Biotechnology Information’s Protein Clusters Database. Nucleic Acids Res. 37, D216–D223. Klotz, M. G., Arp, D. J., Chain, P. S., El-Sheikh, A. F., Hauser, L. J., Hommes, N. G., et al. (2006). Complete genome sequence of the marine, chemolithoautotrophic, ammoniaoxidizing bacterium Nitrosococcus oceani ATCC 19707. Appl. Environ. Microbiol. 72, 6299–6315. Konneke, M., Bernhard, A. E., de la Torre, J. R., Walker, C. B., Waterbury, J. B., and Stahl, D. A. (2005). Isolation of an autotrophic ammonia-oxidizing marine archaeon. Nature 437, 543–546.

316

Patrick S. G. Chain et al.

Kozarewa, I., Ning, Z., Quail, M. A., Sanders, M. J., Berriman, M., and Turner, D. J. (2009). Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (GþC)-biased genomes. Nat. Methods 6, 291–295. Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E. L. (2001). Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 305, 567–580. Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C., and Salzberg, S. L. (2004). Versatile and open software for comparing large genomes. Genome Biol. 5, R12. Langille, M. G., and Brinkman, F. S. (2009). IslandViewer: An integrated interface for computational identification and visualization of genomic islands. Bioinformatics 25, 664–665. Lebedeva, E. V., Alawi, M., Fiencke, C., Namsaraev, B., Bock, E., and Spieck, E. (2005). Moderately thermophilic nitrifying bacteria from a hot spring of the Baikal rift zone. FEMS Microbiol. Ecol. 54, 297–306. Lebedeva, E. V., Alawi, M., Maixner, F., Jozsa, P. G., Daims, H., and Spieck, E. (2008). Physiological and phylogenetic characterization of a novel lithoautotrophic nitriteoxidizing bacterium, ‘Candidatus Nitrospira bockiana’. Int. J. Syst. Evol. Microbiol. 58, 242–250. Leininger, S., Urich, T., Schloter, M., Schwark, L., Qi, J., Nicol, G. W., et al. (2006). Archaea predominate among ammonia-oxidizing prokaryotes in soils. Nature 442, 806–809. Li, H., and Homer, N. (2010). A survey of sequence alignment algorithms for nextgeneration sequencing. Brief. Bioinform. 11, 473–483. Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., et al. (2010). De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272. Lowe, T. M., and Eddy, S. R. (1997). tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964. Lucker, S., Wagner, M., Maixner, F., Pelletier, E., Koch, H., Vacherie, B., et al. (2010). A Nitrospira metagenome illuminates the physiology and evolution of globally important nitrite-oxidizing bacteria. Proc. Natl. Acad. Sci. USA 107, 13479–13484. Lukashin, A. V., and Borodovsky, M. (1998). GeneMark.hmm: New solutions for gene finding. Nucleic Acids Res. 26, 1107–1115. Mantri, Y., and Williams, K. P. (2004). Islander: A database of integrative islands in prokaryotic genomes, the associated integrases and their DNA site specificities. Nucleic Acids Res. 32, D55–D58. Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben, L. A., et al. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380. Markowitz, V. M., Mavromatis, K., Ivanova, N. N., Chen, I. M., Chu, K., and Kyrpides, N. C. (2009). IMG ER: A system for microbial genome annotation expert review and curation. Bioinformatics 25, 2271–2278. Medini, D., Donati, C., Tettelin, H., Masignani, V., and Rappuoli, R. (2005). The microbial pan-genome. Curr. Opin. Genet. Dev. 15, 589–594. Meyer, F., Paarmann, D., D’Souza, M., Olson, R., Glass, E. M., Kubal, M., et al. (2008). The metagenomics RAST server—A public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinform. 9, 386. Miller, J. R., Delcher, A. L., Koren, S., Venter, E., Walenz, B. P., Brownley, A., et al. (2008). Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24, 2818–2824. Miller, J. R., Koren, S., and Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327.

Genomics for Key Players in the N Cycle: From Guinea Pigs to the Next Frontier

317

Muller, J., Szklarczyk, D., Julien, P., Letunic, I., Roth, A., Kuhn, M., et al. (2010). eggNOG v2.0: Extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations. Nucleic Acids Res. 38, D190–D195. Noguchi, H., Park, J., and Takagi, T. (2006). MetaGene: Prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res. 34, 5623–5630. Noguchi, H., Taniguchi, T., and Itoh, T. (2008). MetaGeneAnnotator: Detecting speciesspecific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res. 15, 387–396. Norton, J. M., Alzerreca, J. J., Suwa, Y., and Klotz, M. G. (2002). Diversity of ammonia monooxygenase operon in autotrophic ammonia-oxidizing bacteria. Arch. Microbiol. 177, 139–149. Norton, J. M., Klotz, M. G., Stein, L. Y., Arp, D. J., Bottomley, P. J., Chain, P. S., et al. (2008). Complete genome sequence of Nitrosospira multiformis, an ammonia-oxidizing bacterium from the soil environment. Appl. Environ. Microbiol. 74, 3559–3572. Orvis, J., Crabtree, J., Galens, K., Gussman, A., Inman, J. M., Lee, E., et al. (2010). Ergatis: A web interface and scalable software system for bioinformatics workflows. Bioinformatics 26, 1488–1492. Overbeek, R., Bartels, D., Vonstein, V., and Meyer, F. (2007). Annotation of bacterial and archaeal genomes: Improving accuracy and consistency. Chem. Rev. 107, 3431–3447. Park, B. J., Park, S. J., Yoon, D. N., Schouten, S., Sinninghe Damste, J. S., and Rhee, S. K. (2010). Cultivation of autotrophic ammonia-oxidizing archaea from marine sediments in co-culture with sulfur-oxidizing bacteria. Appl. Environ. Microbiol. 76, 7575–7587. Pinard, R., de Winter, A., Sarkis, G. J., Gerstein, M. B., Tartaro, K. R., Plant, R. N., et al. (2006). Assessment of whole genome amplification-induced bias through high-throughput, massively parallel whole genome sequencing. BMC Genomics 7, 216. Pop, M. (2009). Genome assembly reborn: Recent computational challenges. Brief. Bioinform. 10, 354–366. Poptsova, M. S., and Gogarten, J. P. (2010). Using comparative genome analysis to identify problems in annotated microbial genomes. Microbiology 156, 1909–1917. Prosser, J. I., and Nicol, G. W. (2008). Relative contributions of archaea and bacteria to aerobic ammonia oxidation in the environment. Environ. Microbiol. 10, 2931–2941. Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K. S., Manichanh, C., et al. (2010). A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65. Quail, M. A., Kozarewa, I., Smith, F., Scally, A., Stephens, P. J., Durbin, R., et al. (2008). A large genome center’s improvements to the Illumina sequencing system. Nat. Methods 5, 1005–1010. Roche (2009a). General library preparation method manual. In GS FLX Titanium Roche Diagnostics GmbH, Mannheim, Germany. Roche (2009b). Multi span paired end libraries: Guidelines and additional information. Technical Bulletin Number. Roche 454. Roche (2009c). Paired end library preparation method manual—20 kb and 8 kb span. In GS FLX Titanium. Roche Diagnostics GmbH, Mannheim, Germany. Santoro, A. E., Casciotti, K. L., and Francis, C. A. (2010). Activity, abundance and diversity of nitrifying archaea and bacteria in the central California Current. Environ. Microbiol. 12, 1989–2006. Schleper, C., Jurgens, G., and Jonuscheit, M. (2005). Genomic studies of uncultivated archaea. Nat. Rev. Microbiol. 3, 479–488. Sigrist, C. J., Cerutti, L., de Castro, E., Langendijk-Genevaux, P. S., Bulliard, V., Bairoch, A., and Hulo, N. (2010). PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res. 38, D161–D166.

318

Patrick S. G. Chain et al.

Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J. M., and Birol, I. (2009). ABySS: A parallel assembler for short read sequence data. Genome Res. 19, 1117–1123. Starkenburg, S. R., Chain, P. S., Sayavedra-Soto, L. A., Hauser, L., Land, M. L., Larimer, F. W., et al. (2006). Genome sequence of the chemolithoautotrophic nitriteoxidizing bacterium Nitrobacter winogradskyi Nb-255. Appl. Environ. Microbiol. 72, 2050–2063. Starkenburg, S. R., Larimer, F. W., Stein, L. Y., Klotz, M. G., Chain, P. S., SayavedraSoto, L. A., et al. (2008). Complete genome sequence of Nitrobacter hamburgensis X14 and comparative genomic analysis of species within the genus Nitrobacter. Appl. Environ. Microbiol. 74, 2852–2863. Stein, L. Y., Arp, D. J., Berube, P. M., Chain, P. S., Hauser, L., Jetten, M. S., et al. (2007). Whole-genome analysis of the ammonia-oxidizing bacterium, Nitrosomonas eutropha C91: Implications for niche adaptation. Environ. Microbiol. 9, 2993–3007. Stewart, A. C., Osborne, B., and Read, T. D. (2009). DIYA: A bacterial annotation pipeline for any genomics lab. Bioinformatics 25, 962–963. Tatusov, R. L., Koonin, E. V., and Lipman, D. J. (1997). A genomic perspective on protein families. Science 278, 631–637. Tatusov, R. L., Fedorova, N. D., Jackson, J. D., Jacobs, A. R., Kiryutin, B., Koonin, E. V., et al. (2003). The COG database: An updated version includes eukaryotes. BMC Bioinform. 4, 41. Treusch, A. H., Kletzin, A., Raddatz, G., Ochsenreiter, T., Quaiser, A., Meurer, G., et al. (2004). Characterization of large-insert DNA libraries from soil for environmental genomic studies of Archaea. Environ. Microbiol. 6, 970–980. Van Domselaar, G. H., Stothard, P., Shrivastava, S., Cruz, J. A., Guo, A., Dong, X., et al. (2005). BASys: A web server for automated bacterial genome annotation. Nucleic Acids Res. 33, W455–W459. Venter, J. C., Remington, K., Heidelberg, J. F., Halpern, A. L., Rusch, D., Eisen, J. A., et al. (2004). Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74. Walker, C. B., de la Torre, J. R., Klotz, M. G., Urakawa, H., Pinel, N., Arp, D. J., et al. (2010). Nitrosopumilus maritimus genome reveals unique mechanisms for nitrification and autotrophy in globally distributed marine crenarchaea. Proc. Natl. Acad. Sci. USA 107, 8818–8823. Ward, B. B., and O’Mullan, G. D. (2002). Worldwide distribution of Nitrosococcus oceani, a marine ammonia-oxidizing gamma-proteobacterium, detected by PCR and sequencing of 16S rRNA and amoA genes. Appl. Environ. Microbiol. 68, 4153–4157. Williams, R., Peisajovich, S. G., Miller, O. J., Magdassi, S., Tawfik, D. S., and Griffiths, A. D. (2006). Amplification of complex gene libraries by emulsion PCR. Nat. Methods 3, 545–550. Wood, H. M., Belvedere, O., Conway, C., Daly, C., Chalkley, R., Bickerdike, M., et al. (2010). Using next-generation sequencing for high resolution multiplex analysis of copy number variation from nanogram quantities of DNA from formalin-fixed paraffinembedded specimens. Nucleic Acids Res. 38, e151. Zerbino, D. R., and Birney, E. (2008). Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829.