Journal of Cereal Science 54 (2011) 395e400
Contents lists available at SciVerse ScienceDirect
Journal of Cereal Science journal homepage: www.elsevier.com/locate/jcs
DNA sequencing methods contributing to new directions in cereal research M.A. Edwardsa, R.J. Henryb, * a b
Southern Cross Plant Science, Southern Cross University, PO Box 157, Lismore, NSW 2480, Australia Queensland Alliance for Agriculture & Food Innovation, The University of Queensland, St Lucia 4072, Australia
a r t i c l e i n f o
a b s t r a c t
Article history: Received 25 March 2011 Received in revised form 29 July 2011 Accepted 29 July 2011
DNA sequencing allows the decoding of the genetic code, facilitating the discovery of the molecular genetic basis of biological systems. Knowledge of the sequence of genes in cereals provides a key to understanding many of their nutritional and functional properties. For many years DNA sequencing technology advanced slowly with incremental advances in the di-deoxy or Sanger method. New technologies are now greatly increasing the power of DNA sequencing. Second (next) generation sequencing has dramatically reduced the cost of DNA sequencing and increased the amount of data that can be collected. Further advances exploiting third generation sequencing promise to accelerate these developments. The DNA sequences of cereal species, genotypes or even individual seeds can now be obtained delivering a new level of capability in cereal science. Ó 2011 Elsevier Ltd. All rights reserved.
Keyword: DNA sequencing
1. Introduction Second Generation Sequencing (SGS) technologies are generating an exponential increase in data enabling genome wideperspectives and supporting integrated ’omics’ studies of an increasing range of food plants. The Third Generation Sequencing technologies under development promise major increases in read length and reductions in costs and time required to generate results. Coupled with the new marker-based selection and screening strategies, these advances provide a base for a further revolution in genetics and crop development. While a few simple traits have been well characterized at the plant genome level, many other traits are poorly understood. This is true for many complex traits which are controlled by gene networks. The functions of many of the genes identified by genome sequencing remain unknown and the genetic control of the majority of agronomic traits is yet to be determined (Edwards and Batley, 2010a). If the threats of climatic instability and food shortages are to be mitigated, the power of current genomics methods needs to be directed toward unraveling complex traits to achieve yet further increases in cereal production. 2. Current sequencing platforms The Sanger sequencing method (1977) currently produces read lengths up to 1000 base pairs (bp) with a per base error rate as low * Corresponding author. Tel.: þ61 2 6620 3010; fax: þ61 2 6622 2080. E-mail address:
[email protected] (R.J. Henry). 0733-5210/$ e see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.jcs.2011.07.006
as 0.001% (Deschamps and Campbell, 2010). Plant genome sequencing has progressed rapidly since the first genome sequence (of Arabidopsis thaliana) was completed in 2000 (Arabidopsis Genome Initiative 2000) and first cereal genome sequence, the 389-Mb genome of rice, in 2004 (Sequencing Project International Rice, 2005). Nevertheless, most projects to date have still used the ‘traditional’ Sanger genome sequencing approaches. While the massive increase in data generated by SGS has come with a drastic reduction in the costs of acquiring the sequence, the quality of the sequence is more variable than the ‘gold standard’ of Sanger sequencing. The assembly of plant genome sequences using short read data is often problematic due to sequence repeats within the genome, especially with the large polyploid plant genomes (see Section 5: Bioinformatics) (Imelfort et al., 2009a). However, some experimental designs can facilitate sequence assembly including longer reads, paired-end reads, mate-pair libraries/large insert libraries, RNA-Seq data, reduced representation libraries, optical mapping, and genetic mapping. The variability in sequence quality is primarily addressed by ensuring deep coverage (i.e. sequencing the equivalent of several whole genomes) (Deschamps and Campbell, 2010). Existing SGS strategies all follow a similar pattern for library preparation that can be summarized in three major steps: (1) random shearing of DNA, either via nebulization (involves a small plastic device that uses compressed air to force a DNA solution though a small aperture causing strands to fragment), sonication or enzyme digestion; (2) ligation of universal adapters at both ends of the sheared DNA fragments and (3) immobilization and amplification of the adapter-flanked fragments to generate
396
M.A. Edwards, R.J. Henry / Journal of Cereal Science 54 (2011) 395e400
clustered amplicons to serve as templates for the sequencing reactions (Shendure and Ji, 2008). These platforms allow up to hundreds of millions of sequencing reactions to be carried out in parallel by alternating cycles of base incorporation and image capture, they produce short DNA sequences which range in size from 25 to 500 bases and generate sequences of up to several billions of bases per run (Deschamps and Campbell, 2010; Kircher and Kelso, 2010). In 2005, the 454 Life Sciences/Roche Diagnostics Genome Sequencer (http://454.com/products-solutions/45-portfolio.asp) introduced parallel Pyrosequencing on a picotiter plate (Kircher and Kelso, 2010). The Roche/454 GS FLX Titanium sequencer is based on emulsion PCR and pyrosequencing using fragmented (300e800 bp) double-stranded DNA. At present, the GS FLX Titanium series allows the generation of more than a million single reads per run with an average read length of 400e500 bases. The 454 sequencing produces the longest reads of currently available platforms but the total sequence output per run is comparatively low (Brautigam and Gowik, 2010; Magi et al., 2010). Although 454 sequencing cannot correctly interpret long stretches (>6) of the same nucleotide (homopolymer DNA segments); substitution errors are rarely encountered. Average raw error rates are in the order of 0.1%. Single base-pair deletions or insertions may result from signal-to-noise thresholding issues (or the inability of image analysis software to correctly differentiate between a positive light signal and background light sources (noise)) but most of these problems can be resolved by higher coverage. As with Sanger sequencing, the error rate increases with the position in the sequence. This results from a reduction in enzyme efficiency or loss of enzymes (resulting in a reduction of the signal intensities) with some molecules no longer being elongated, leading to a loss of synchrony in the sequencing (Imelfort et al., 2009a). The second type of high-throughput sequencer, the Illumina Genome Analyzer (GA) (http://www.illumina.com/systems.ilmn) using Solexa sequencing was introduced in 2007. It is based on solid-phase bridge PCR. This ‘sequencing by synthesis’ (SBS) approach uses reversible terminator nucleotides labeled with fluorescent dye and generates sequences with accuracy greater than 98.5% (Brautigam and Gowik, 2010). Library and flow cell preparation include several in vitro amplification steps, which cause a high background error rate. The GA uses four fluorescent dyes to distinguish the four nucleotides. Of these, two pairs (A/C and G/T) are excited using the same laser and are similar in their emission spectra and hence show only limited separation using optical filters. Therefore, the highest substitution errors have been observed between A/C and G/T (Kircher and Kelso, 2010). Insertion/deletion errors are much less common (Magi et al., 2010). The GAIIx can produce paired-end short reads, 35-120 bases long, leading up to 50 gigabases/run (for 100 bp, paired-end). The HiSeq 2000 Sequencing System is capable of generating two billion paired-end reads, with a spec of 200e300 Gb of data per run (7e8 days, 2 100 Gb) (Peterson et al., 2010). The ABI SOLiD (http://www.lifetechnologies.com) (Sequencing by Oligo Ligation and Detection) sequencer was developed in 2005 at Agencourt Personal Genomics. This approach is based on emulsion PCR sequencing by sequential ligation producing short reads (up to 50 bases) per run with sequence outputs of up to 30 gigabases per single read run with 99.94% accuracy. The system features a two-base encoding mechanism that interrogates each base twice, providing built-in error detection (Brautigam and Gowik, 2010). Types and causes of sequence errors are diverse: amplification steps; beads carrying a mixture of sequences and beads in close proximity to one another, creating false reads and low quality bases; signal decline, a small regular phasing effect,
and incomplete dye removal resulting in increasing error as the ligation cycles progress. There is considerable reduction in the number of molecules participating in subsequent ligation reactions, and therefore substantial signal decline. Average error rates are, however, dependent on the availability of a reference genome for error correction. If no reference genome is available for error correction and no assembly and consensus calling is performed, then the average error rate is reported as higher than for the Illumina GA (Kircher and Kelso, 2010). The SOLiD 4 marks an improvement producing 100 Gb of sequence data per run (300 K beads per panel versus 220 K on SOLiD 3 Plus SOLiD) while the 4hq upgrade package released in late 2010 increases the capability to 300Gb/run (99.99% accuracy with five base encoding) (Peterson et al., 2010). The Polonator (www.polonator.org) is a lower profile SGS system having an ‘open source’ development model, where all aspects of the system are freely available and modifiable by its users. The Polonator in its present version generates paired-end reads of 28 bp using bead-based emulsion PCR and a ‘sequencing by ligation’ approach. 2.1. Summary of second generation sequencing systems Sample preparation processes for SGS technology can take several days and often involve costly capital equipment, reagents, supplies and physical space. The ’wash-and-scan’ sequencing process used in SGS involve sequentially flooding in reagents, incorporating nucleotides into the DNA strands, stopping the incorporation reaction, washing out the excess reagent, scanning to identify the incorporated bases and finally treating the newly incorporated bases to prepare the DNA templates for the next ’wash-and-scan’ cycle. The numerous wash-scan cycles required can take many days. As each base is added, the population of molecules loses synchronicity (called dephasing) resulting in an increase in noise and sequencing errors as the read extends and limiting the read length. Also, to generate sufficient DNA molecules, PCR amplification is required. The amplification process can introduce errors in the template sequence as well as amplification bias. However, since being launched, these systems have been under constant development and improvement. SGS technologies generate a huge amount of data that challenge storage and informatics operations (Schadt et al., 2010). Although SGS has a lower cost per base sequenced, the scale of projects has greatly increased. Data management and analysis costs are often not well accounted for; current multiplexing and targeted genome capture methods are limited and expensive. JP Morgan recently conducted a survey of directors of 30 academic and government SGS laboratories in the U.S. and Europe to identify trends in the sequencing market (Peterson et al., 2010). SGS-related activity was predicted to increase from w37% of research today to 56% within two years. The most cited obstacle to expanding SGS usage was data storage, management, and informatics. The respondents were particularly enthusiastic about RNA Seq and ChIP-Seq technologies (see topic below: Sequencing Design Analyses). Messenger RNA expression profiling was the most frequently used application of SGS, followed by biomarker discovery and whole genome re-sequencing. Diagnostics and targeted re-sequencing also represented high proportions of SGS applications. Third generation platforms such as Oxford Nanopore and NABsys systems (www.nabsys.com) are expected to replace w47% of all SGS activity over the next three years. Microarray demand was expected to remain lukewarm, though supported by interest in Whole Genome Association Studies (GWAS) (Peterson et al., 2010). Since SGS remains more expensive than array-based analyses, the advantages of SGS should be considered in view of the higher costs.
M.A. Edwards, R.J. Henry / Journal of Cereal Science 54 (2011) 395e400
3. Third generation sequencing developments The third generation of sequencing technologies interrogate single molecules of DNA (Single Molecule Sequencing (SMS)), such that no synchronization of multiple reactions is required (a limitation of SGS), thereby overcoming issues related to the biases introduced by PCR amplification and dephasing. This has the potential to exploit more fully the high rates of operation of DNA polymerases to radically increase read length and decrease the time required (Schadt et al., 2010). Third generation DNA sequencing technologies are also characterized by direct inspection of single molecules using methods that do not require the repetitive ‘wash and scan’ steps during DNA synthesis. The Helicos (Helicos BioSciences, Cambridge, MA (www. helicosbio.com)) was the first commercially available SMS sequencing instrument. Because multiple cycles of single-base extension are still required involving ’wash-and-scan’ steps, the time to sequence is high, and the read lengths obtained are only up to 32 nucleotides. The raw read error rates are approximately 5%, although the high parallelization of reactions can deliver high fold coverage and a consensus or finished read accuracy of >99% (Deschamps and Campbell, 2010). The DNA polymerase can also be replaced with a reverse transcriptase enzyme to sequence RNA directly (Schadt et al., 2010). Ion Torrent (Life Technologies: www. iontorrent.com) is a semiconductor-based sequencer, costing less than US$500 per run when the SBS is carried out in a high density array of micromachined wells. CMOS-based platform requires no camera, fluorescence labeling or enzymes, providing fast sequencing useful in clinical applications (Peterson et al., 2010). However, the overall read length and throughput are limited by a ’wash-and-scan’ process requiring PCR amplification of the DNA. Similar to the 454, there is difficulty processing homopolymers (Peterson et al., 2010; Schadt et al., 2010). The single-molecule realtime (SMRT) technology (www.pacificbiosciences.com) directly observes a single molecule of DNA polymerase as it synthesizes a strand of DNA using zero-mode waveguide (ZMW) technology. The current system produces several thousand bases before the polymerase is denatured. Strobe sequencing also allows 4 250 bp across a 6 kb insert. The longest insert sizes being used now are 20 KB but 50 KB is proposed (Deschamps and Campbell, 2010). The SMRT sequencing platform requires minimal amounts of reagent and sample preparation and there are no scan and wash steps or PCR amplification. Detection of methyl modifications on the DNA is possible by analysis of the pulses at the software level. Direct RNA sequencing is also possible (Schadt et al., 2010). Raw read error rates can be in excess of 5%, with error rates dominated by insertions and deletions. However, Pacific Biosciences has developed the SMRTbell sample preparation, which will make it possible to read both forward and reverse strands in one cycle providing greater consensus accuracy (Peterson et al., 2010). Most nanopore sequencing technologies rely on the transit of a DNA molecule or its component bases through a hole and detecting the bases by their effect on an electric current or optical signal (Schadt et al., 2010). Biological nanopores constructed from engineered proteins and synthetic nanopores are both under development. In particular, there is the potential to use atomically thin sheets of graphene as a matrix to support nanopores (Schadt et al., 2010). The Oxford Nanopore BASE technology (www. nanoporetech.com) involves three natural biological molecules: a modified alpha-hemolysin pore in a lipid bilayer, an exonuclease, and a synthetic cyclodextrin sensor, which are engineered to work as a system (Peterson et al., 2010; Schadt et al., 2010). There is again the ability to read methylation directly. IBM (www-03.ibm.com/press/us/en/pr.ease/28558.wss) is developing transistor-mediated DNA sequencing, a nanostructured
397
sequencing device capable of electronically detecting individual bases in a single molecule of DNA, which promises much greater speed and read length. Starlight (Real-time single molecule sequencing using fluorescence resonance energy transfer (FRET)) based on Visigen technology (Life Technologies) incorporates Quantum Dots for continuous long read length (1000e1500 bp) sequencing (Peterson et al., 2010). Halcyon Molecular uses transmission electron microscopy (TEM) to directly image and chemically detect atoms that would uniquely identify the nucleotides on a planar surface, using annular dark-field imaging in an aberration-corrected scanning TEM (www. halcyonmolecular.com). ZS Genetics (www.zsgenetics.com) is developing another TEM-based DNA sequencing instrument using a high-resolution (sub-angstrom) electron microscope which is also capable of producing 10,000e20,000 base reads. Many of the TGS systems are currently in early development and their success in the market is difficult to predict. TGS promises to deliver entire genome sequences in less than a day and at greatly reduced cost. However, like SGS greater challenges are posed in mastering large-scale and diverse data types (Schadt et al., 2010). 4. Sequencing approaches 4.1. Transcriptomics A transcriptome analysis of species with a finished genome reference is useful for gene discovery, genome annotation, expression analysis, determination of alternative splicing and SNP discovery (molecular markers for population analysis). For the assembly of transcriptomes, a single plant, preferably inbred for several generations, should be used to capture RNA. The cDNA library can be normalized which will increase the number of genes represented, though the quantitative information will largely be lost. Non-normalized libraries can discriminate against transcripts of lower expression compared to a normalized library from the same tissue such as channel genes or genes involved in signaling and regulation (Brautigam and Gowik, 2010). It is possible to combine SGS with serial analysis of gene expression (SAGE), leading to much deeper sampling and broader coverage of the sequenced transcriptome for the same sequencing effort (Brautigam and Gowik, 2010). SGS methods have been applied to analyze small RNA analyses (sRNA). Two predominant forms of plant sRNAs have been observed. The 21-nt microRNAs (miRNAs) mainly act post-transcriptionally by direct cleavage of a specific target mRNA. The 24-nt short interfering RNAs (siRNAs) typically direct de novo DNA methylation and regulate gene expression at the transcriptional level (Chen et al., 2010). ChIP-seq: SGS using the ChIP-seq method (Chromatin Immuno-Precipitation) allows genome-wide mapping of DNAeprotein interactions. Transcriptional regulation is mediated by the combinatorial interplay between cis-regulatory DNA elements and trans-acting transcription factors, and is perhaps the most important mechanism for controlling gene expression. ChIP-seq analyses can help to compile a comprehensive catalog of cis-regulatory elements and their interaction with given trans-acting factors under specific conditions, thus increasing a network-level understanding of transcription regulation (Park, 2009). 4.2. Epigenomics Epigenomics refers to the large-scale study of epigenetic marks on the genome, which include covalent modifications of histone tails (acetylation, methylation, phosphorylation, and ubiquitination), DNA methylation and the sRNA machinery (Rival et al., 2010). The delineation of regional DNA methylation patterns, and broader DNA methylation profiles, has important implications for
398
M.A. Edwards, R.J. Henry / Journal of Cereal Science 54 (2011) 395e400
understanding why certain regions of the genome can be expressed in specific developmental contexts and how epigenetic changes might enable aberrant expression patterns and disease (Laird, 2010). A greater understanding of the epigenetic modification of genomes and the impact of such modification on gene expression is likely to have applications in crop improvement. Methylation mapping using bisulphite-based sequencing methods tends to be fairly accurate and reproducible. However, sources of bias and measurement error can occur due to incomplete bisulphite conversion and differential PCR efficiency for methylated versus unmethylated versions of the same sequence (Laird, 2010). The transcriptomes of an organism are continually changing in response to developmental and environmental cues. Similarly, the epigenome is not static and can be molded by developmental signals, environmental perturbations, and disease states (Rival et al., 2010). 4.3. Sequencing project design Researchers are attempting to exploit the huge data capacity offered by SGS platforms by developing methods to study more samples in each process ‘run’ to reduce overall costs. Various methods of sample preparation called ‘genome partitioning’, ‘enrichment’ or ‘genomic capture’ are being utilized. For example, megabase-size candidate genomic regions, identified through linkage studies, can be sequenced to identify the exact genes and mutations underlying a trait. In human genomics, all of the exons (the human ‘exome’) or entire gene families, such as kinases can be targeted in large groups of individuals to identify disease variants. Roche NimbleGen, Solid-phase microarray hybridization (and liquid capture), and Agilent SureSelect bead capture are examples of current targeted capture utilities. Some platforms allow the addition of sample-specific barcode (sometimes called ‘indexing’) sequences to the library molecules. Samples can then be separated computationally based on their barcode sequence. The relatively high level of DNA methylation in repetitive regions of the genome has also been effectively used to target the gene-rich regions of several genomes using methyl-sensitive restriction enzymes (Edwards and Batley, 2010a). Paired-end or mate-pair protocols help to overcome some of the limitations of SGS short read data by providing information about the relative location and orientation of a pair of reads. In paired-end sequencing, the actual ends of short DNA molecules (<1 kb) are determined, while mate-pair sequencing requires the preparation of special libraries of much longer size-selected molecules (inserts range from 3 to 20 kb) (Kircher and Kelso, 2010). If a reference genome sequence is available, less costly ‘re-sequencing’ and mapping (alignment) of reads is possible. By generating up to 1 coverage of a crop genome sequence with short paired read sequence data, it is possible to identify numerous reads which correspond to homologous genes in related species. Designing polymerase chain reaction (PCR) primers enables the amplification and sequencing of the gene and corresponding genomic region in the target species (Edwards and Batley, 2010a). An alternative strategy is to sequence pools of PCR amplicons. 5. Bioinformatics, data resources and challenges Numerous plant sequencing projects have been undertaken by major government funded organizations such as JGI (USA) or various consortiums. Reports on progress of the various plant sequencing projects worldwide and genomics resources can be located from useful links within http://www.arabidopsis.org/portals/genAnnot ation/other_genomes/index.jsp; http://www.ncbi.nlm.nih.gov/geno
mes/PLANTS/PlantList.html; http://www.jgi.doe.gov/genome-projects/pages/projects.jsf%3Fkingdom¼Plant; http://www.phytozome. net/; http://www.ldl.genomics.cn/page/pa-plant.jsp. A recent listing of ‘omics’ tools for the major crop species has been reported by Langridge and Fleury (2010). ’Omics‘ resources, including EST databases, transcriptomics, proteomics, metabolomics, long-insert library, high-throughput genotyping and whole genome sequence data are available for maize (http://www.maizegdb.org); rice (http:// rice.plantbiology.msu.edu); soybean (http://soybase.org/); tomato (http://solgemics.net/gemes) and grape (http://www.vitaceae.org). Similar data resources are also rapidly accumulating for other economically important plants such as wheat (http://www. wheatgenome.org); barley (www.barleygenome.org); sorghum (http://www.phytozome.net/sorghum); potato (http://potatogeme. net); cassava (http://www.phytozome.net); cotton (http://www. cottonmarker.org); apple (http://www.rosaceae.org); rape seed (http://www.brassica.info); and banana (http://www.musage nomics.org/index.php). A comprehensive listing of recent advances in research platforms and resources in plant ‘omics’ together with related databases and advances in technology is also provided by Mochida and Shinozaki (2010). Bioinformatics costs are the major components of most projects, although projects also tend to have analysis bottlenecks (Anon Editorial 2008). The high costs of methods and data management will necessitate wider collaborations and the pooling of resources, especially for developing countries. Currently, there is widespread development of bioinformatics software, leading to the publication of numerous programs within the last two years. However, few standard packages have been adopted and applications may depend on the number of reads, the read length and the genome complexity (http://seqanswers.com/wiki/Special:Browse Data/Bioinformatics_application). 5.1. Progress in sequencing cereal genomes The first cereal species to have a complete reference genome sequence published was rice (Sequencing Project International Rice, 2005) while sorghum (Paterson et al., 2009) and maize (Schnable et al., 2009) have been reported more recently. Significant efforts are now directed to completing reference genome sequences for wheat and barley. As the staple food for 35% of the world’s population and the most widely produced crop, wheat is one of the most important crop species. The development of a highquality whole genome physical map and whole genome reference sequence of bread wheat remains a challenge. The IWGSC (the consortium generating the bread wheat genome, reference sequence http://www.wheatgenome.org/) is producing highquality sequences of individual chromosomes before moving toward a gold standard reference sequence (Eversole, 2011). Recently Berkman et al. (2011) used this strategy in wheat by shotgun sequencing the isolated chromosome arm 7DS assembled into contigs (Berkman et al., 2011). In 2010, UK researchers (Keith Edwards et al.), funded by the Biotechnology and Biological Sciences Research Council (BBSRC), publicly released the first full sequence coverage of the wheat genome (of the line Chinese Spring) comprising sequence reads only (http://www.cerealsdb.uk. net). The main challenge in obtaining a reference genome sequence with current technology is to assemble the short sequence reads obtained from current platforms. Once a reference sequence is obtained the re-sequencing of other genotypes is relatively easy with assembly being achieved by alignment with the reference sequence. Third generation sequencing technology promises to overcome these limitations. Organellar genome sequences of cereals may provide a new option for cereal identification (Nock et al., 2010).
M.A. Edwards, R.J. Henry / Journal of Cereal Science 54 (2011) 395e400
6. Molecular marker discovery and SNPs SGS will impact directly on molecular marker discovery, greatly enhancing rates of marker discovery. Since the 1920s, molecular markers in the form of restriction fragment length polymorphisms (RFLPs), randomly amplified polymorphic DNA (RAPD), amplified fragment length polymorphisms (AFLPs) and microsatellites have been used by plant breeding companies and in other research. SNPs (single nucleotide polymorphisms) have now become the marker of choice, though their use in most plants has been limited, due in part to the cost of development related to the need for high-quality sequence data from several genotypes. Most studies aim to identify SNPs in specific genes that are responsible for advantageous phenotypes. The development of SNPs in plants is complicated by the presence (especially in polyploid crops such as wheat) of homoeologues genes (Henry and Edwards, 2009). Simple sequence repeats (SSR) are also used together with SNPs in marker-assisted selection (MAS) or for comparative genomics to introduce useful traits from wild related species (Edwards and Batley, 2010a). As the cost of genome sequencing continues to decrease, it may become routine to re-sequence the genomes of individual plants in place of the targeted genotyping with current SNP platforms (Imelfort et al., 2009b). SNP analysis has certain critical requirements, for example, adequate read-depth and base quality for the detection of reliable SNPs, also the manual assessment of alignments and identified SNPs. SNP marker development can require several validation steps. Problems with successful SNP locus amplification, low frequency polymorphisms or gene duplicates can render the identification of reliable markers a non-trivial and potentially labor-intensive task (Renaut et al., 2010). Also, the variation discovered in a few individuals cannot be assumed to be representative of the variation across the range of a species. A recent study applying SGS to SNP discovery in humans reported false-positive error rates between 11% and 70% and falsenegative error rates between 10% and 90% (Craig et al., 2008). False-positive errors can result from sequencing errors, alignment errors, or paralogous sequence variants (variants from duplicated regions of the genome). False-negative errors can result from too few overlapping contigs that include an SNP or from regions that are difficult to align or sequence (Metzker, 2010). The main drawbacks to applying SNPs are that the development of numerous, informative markers may be labor-intensive and incorporate ascertainment bias, and that methods to genotype them can be expensive. Ascertainment Bias, a term used in population genetics similar to Sampling Bias where all participants are not equally balanced or objectively represented, results when samples used to ascertain (discover) SNPs or estimate their allele frequencies in specific population samples are of finite or restricted size. Consequently, effort can be wasted with the introduction of uninformative markers. The importance of the SNP discovery strategy cannot be over-emphasized. Unfortunately, the proportion of SNPs that have been validated from potential SNPs detected, reveals that a major challenge for data generated from second-generation SNP discovery efforts is to select the SNPs that can be validated before conversion to a genotyping assay (Garvin et al., 2010). Nevertheless several very accurate and robust methods can be used to perform multiplex genotyping. The iPLEX (Sequenom, San Diego, CA), SNPstream (Beckman Fullerton, CA), and SNaPshot (Applied Biostysems, Foster City, CA) assays can multiplex from 2 to about 50 SNP assays per reaction and typically use a 96 or 384 well format. Off-the-shelf arrays are available for model species and humans; but custom arrays must be designed for non-model species, which can make these methods expensive.
399
Bead-based assays such as Illumina’s GoldenGate assay offer a more flexible assay format. MAS allows the breeder to select traits early in a breeding program. However, there is little if any benefit in using SGS during selection as the vast majority of SNPs are not associated with agronomic traits. Furthermore, the integration of molecular marker data with genomic, proteomic and phenomic data allows researchers to relate sequenced genome data to observed traits, bridging the genome to phenome divide (Edwards and Batley, 2010a). Markers have gradually been integrated into breeding programs, not as a major revolution replacing conventional breeding, but as an additional tool. Marker effects can also be highly variable across populations (parentage, natural and artificial selection pressure), environments (locations, years, field, glasshouse) and types of agricultural systems. MAS is dependent on a strong, efficient and effective phenotypic evaluation (field) program. Greater resources and costs are required to conduct MAS than phenotypic selection but the trade-off is greater genetic gain. The greatest benefit of MAS occurs where the target traits are of low heritability, recessive and require difficult and costly phenotyping, and where the pyramiding of genes is desired for traits such as disease and pest resistance. However, newer technologies based on high-throughput genotyping associated with new marker systems (e.g. SNPs), and new selection strategies such as AB-QTL, mappingas-you-go, marker-assisted recurrent selection, gene pyramiding and genome-wide selection are required for the improvement of complex polygenic traits (Gupta et al., 2010). For example, genomic selection (GS) is an emerging alternative to MAS that uses all marker information to calculate genomic estimated breeding values (GEBVs) for complex traits. Selections are made directly on GEBV without further phenotyping. GS can complete two to three cycles of selection in the same time that it takes to complete one cycle of phenotypic selection. However, the application of genomic selection to crop species has not been reported. The impacts of GS on long-term gain should also be studied prior to its implementation (Jannink, 2010). 7. New directions in research Currently, the main application of genomics is through a range of MAS strategies. Molecular markers are now used routinely in many well-resourced breeding programs, although their application has been slow in some public programs. The next major challenge will be the integration of the full omics datasets into the modeling, population structure and selection strategies (Langridge and Fleury, 2010). Metabolomics may provide an alternative phenotyping approach and also be useful in the improvement of crops, reducing the time to the production of elite lines. Furthermore, phenomics allows automated and alternative methods for screening populations for particular traits and will facilitate the introgression of novel variation from wild germplasm. The new sequencing technologies will greatly accelerate these developments. The wider deployment of genetic manipulation (GM) approaches will be needed for the introduction of novel genes and alleles from diverse sources, and particularly for traits that are absent from plant genomes. The first GM maize hybrids with a degree of drought tolerance are expected to be commercialized by 2012 in the USA. Stacked traits are also an increasingly important feature of biotech crops. For example, a GM maize line named SmartstaxTM, was released in the USA and Canada in 2010 with eight different genes coding for pest-resistance and herbicide tolerance traits. Golden rice (enhanced with pro-vitamin A) is expected to be available in 2013 in the Philippines and thereafter in Bangladesh, Indonesia and Vietnam. According to a recent report from the International Service for the Acquisition of Agri-biotech
400
M.A. Edwards, R.J. Henry / Journal of Cereal Science 54 (2011) 395e400
Applications (ISAAA), the two biggest threats facing the world population today are food insecurity and climate change (http:// www.isaaa.org/resources/publications/briefs/42/highlights/ default.asp). GM crops have the potential to be more productive, and may be more resistant to extreme climatic conditions. SGS technologies will allow more efficient evaluation of GM plants defining the position of insert of the transgene and facilitating regulatory approval. Cereal chemistry is likely to be greatly advanced by the new sequencing technologies. The molecular basis of the nutritional, processing and functional quality and the impact of environment on quality can now be studied at the whole genome and transcriptome level. Molecular tools for cereal variety identification (Nock et al., 2010) may be based upon SGS and varietal purity testing (Pattemore et al., 2010) will become more precise with more DNA sequence data. SGS will also allow the complex biochemical interactions contributing to grain quality (Edwards et al., 2010b) to be defined at the molecular genetic level.
References Berkman, P.J., Skarshewski, A., Lorenc, M.T., Lai, K.T., Duran, C., Ling, E.Y.S., Stiller, J., Smits, L., Imelfort, M., Manoli, S., McKenzie, M.,M.,K., Simková, H., Batley, J., Fleury, D., Dole zel, J., Edwards, D., 2011. Sequencing and assembly of low copy and genic regions of isolated Triticum aestivum chromosome arm 7DS. Plant Biotechnology Journal. doi:10.1111/j.1467-7652.2010.00587.x. Brautigam, A., Gowik, U., 2010. What can next generation sequencing do for you? Next generation sequencing as a valuable tool in plant research. Plant Biology 12, 831e841. Chen, M., Meng, Y.J., Gu, H.B., Chen, D.J., 2010. Functional characterization of plant small RNAs based on next-generation sequencing data. Computational Biology and Chemistry 34, 308e312. Craig, D.W., Pearson, J.V., Szelinger, S., Sekar, A., Redman, M., Corneveaux, J.J., Pawlowski, T.L., Laub, T., Nunn, G., Stephan, D.A., Homer, N., Huentelman, M.J., 2008. Identification of genetic variants using bar-coded multiplexed sequencing. Nature Methods 5, 887e893. Deschamps, S., Campbell, M.A., 2010. Utilization of next-generation sequencing platforms in plant genomics and genetic variant discovery. Molecular Breeding 25, 553e570. Edwards, D., Batley, J., 2010a. Plant genome sequencing: applications for crop improvement. Plant Biotechnology Journal 8, 2e9. Edwards, M.A., Osborne, B.G., Henry, R.J., 2010b. Puroindoline genotype, starch granule size distribution and milling quality of wheat. Journal of Cereal Science 52, 314e320. Eversole, K.A., 2011. Activity Report. In: IWGSC, E.A. (Ed.), The International wheat genome sequencing consortium (IWGSC) Bethesda, MD.
Garvin, M.R., Saitoh, K., Gharrett, A.J., 2010. Application of single nucleotide polymorphisms to non-model species: a technical review. Molecular Ecology Resources 10, 915e934. Gupta, P.K., Langridge, P., Mir, R.R., 2010. Marker-assisted wheat breeding: present status and future possibilities. Molecular Breeding 26, 145e161. Henry, R., Edwards, K., 2009. New tools for single nucleotide polymorphism (SNP) discovery and analysis accelerating plant biotechnology. Plant Biotechnology Journal 7, 311. Imelfort, M., Batley, J., Grimmond, S., Edwards, D., 2009a. Genome sequencing approaches and Successes. In: Somers, D.J., et al. (Eds.), Methods in Molecular Biology, Plant Genomics, vol. 513. Humana Press, pp. 345e358. Imelfort, M., Duran, C., Batley, J., Edwards, D., 2009b. Discovering genetic polymorphisms in next-generation sequencing data. Plant Biotechnology Journal 7, 312e317. Jannink, J.L., 2010. Dynamics of long-term genomic selection. Genetics Selection Evolution 42. Kircher, M., Kelso, J., 2010. High-throughput DNA sequencing - concepts and limitations. Bioessays 32, 524e536. Laird, P.W., 2010. Principles and challenges of genome-wide DNA methylation analysis. Nature Reviews Genetics 11, 191e203. Langridge, P., Fleury, D., 2010. Making the most of ’omics’ for crop breeding. Trends in Biotechnology 29, 33e40. Magi, A., Benelli, M., Gozzini, A., Girolami, F., Torricelli, F., Brandi, M.L., 2010. Bioinformatics for next generation sequencing data. Genes 1, 294e307. Metzker, M.L., 2010. Sequencing technologies e the next generation. Nature Reviews Genetics 11, 31e46. Mochida, K., Shinozaki, K., 2010. Genomics and bioinformatics resources for crop improvement. Plant and Cell Physiology 51, 497e523. Nock, C.J., Waters, D.L.E., Edwards, M.A., Bowen, S.G., Rice, N., Cordeiro, G.M., Henry, R.J., 2010. Chloroplast genome sequences from total DNA for plant identification. Plant Biotechnology Journal 9, 328e333. Park, P.J., 2009. ChIP-seq: advantages and challenges of a maturing technology. Nature Reviews Genetics 10, 669e680. Paterson, A.H., et al., 2009. The Sorghum bicolor genome and the diversification of grasses. Nature 457, 551e556. Pattemore, J.A., Rice, N., Marshall, D.F., Waugh, R., Henry, R.J., 2010. Cereal variety identification using MALDI-TOF mass spectrometry SNP genotyping. Journal of Cereal Science 52, 356e361. Peterson, T.W., Nam, S.J., Darby, A., 2010. Next Gen sequencing survey. In: North America Equity Research, vol. 2010. JP Morgan Chase & Co., New York. Renaut, S., Nolte, A.W., Bernatchez, L., 2010. Mining transcriptome sequences towards identifying adaptive single nucleotide polymorphisms in lake whitefish species pairs (Coregonus spp. Salmonidae). Molecular Ecology 19, 115e131. Rival, A., Beule, T., Bertossi, F.A., Tregear, J., Jaligot, E., 2010. Plant epigenetics: from genomes to Epigenomes. Notulae Botanicae Horti Agrobotanici Cluj-Napoca 38, 9e15. Schadt, E.E., Turner, S., Kasarskis, A., 2010. A window into third-generation sequencing. Human Molecular Genetics 19, R227eR240. Schnable, P.S., et al., 2009. The B73 maize genome: complexity, Diversity, and Dynamics. Science 326, 1112e1115. Sequencing Project International Rice, 2005. The Map-Based Sequence of the Rice Genome. Nature 436, 793e800. Shendure, J., Ji, H.L., 2008. Next-generation DNA sequencing. Nature Biotechnology 26, 1135e1145.