C H A P T E R
2.4 Next-Generation Sequencing For Gene and Pathway Discovery and Analysis in Autism Spectrum Disorders Guiqing Cai*, Joseph D. Buxbaumy *
Seaver Autism Center for Research and Treatment, Laboratory of Molecular Neuropsychiatry, Department of Psychiatry, Friedman Brain Institute, Mount Sinai School of Medicine, New York, NY, USA yas above, and Departments of Psychiatry, Neuroscience, and Genetics and Genomic Sciences, Friedman Brain Institute, Mount Sinai School of Medicine, New York, NY, USA
O U T L I N E Introduction
169
Next-Generation Sequencing Technologies
170
Pyrosequencing Cyclic Reversible Termination Sequencing by Ligation Single-Molecule Sequencing
170 170 171 171
Application of Next-Generation Sequencing Technologies in Human Disease Whole-Genome Sequencing Whole-Exome Sequencing
172 172 173
INTRODUCTION Multiple twin and family studies have shown that there is a very strong genetic basis to autism spectrum disorders (ASD) (Bailey et al., 1995; Jorde et al., 1991), and many genetic and genomic disorders caused by rare mutations are associated with ASD (Betancur, 2011). Fragile X syndrome is thought to be one of the most common monogenic forms of ASD, with estimates of 1–2% of subjects with ASD having an FMR1 mutation (Abrahams and Geschwind, 2008). Other more common monogenic forms of ASD include mutations in SHANK3 (Phelan–McDermid syndrome), TSC1 and The Neuroscience of Autism Spectrum Disorders. http://dx.doi.org/10.1016/B978-0-12-391924-3.00011-9
Transcriptome Sequencing (RNA-seq) Epigenome Sequencing
173 173
Next-Generation Sequencing In Autism Spectrum Disorders Whole-Exome Sequencing In ASD Whole-Genome Sequencing in ASD Transcriptome and Epigenome Sequencing in ASD
174 174 175 175
Conclusion
175
TSC2 (tuberous sclerosis), NF1 (neurofibromatosis), UBE3A (Angelman syndrome), and MECP2 (Rett syndrome). Although they are individually rare, identifying such mutations identifies novel therapeutic targets and points to biological pathways involved in ASD. For example, SHANK3 is located in the postsynaptic density, and mutation of this gene indicates a role for glutamate synaptic function in the etiology of ASD (Durand et al., 2007; Gauthier et al., 2010; Moessner et al., 2007; for more details see Chapter 5.6). Similarly, the X-linked genes neuroligin 4 and 3 (NLGN4 and NLGN3) point to disruption of excitatory transmission in ASD. Mutations
169
Copyright Ó 2013 Elsevier Inc. All rights reserved.
170
2.4. NEXT-GENERATION SEQUENCING FOR GENE AND PATHWAY DISCOVERY AND ANALYSIS IN AUTISM SPECTRUM DISORDERS
of these genes were identified in patients with autism, Asperger syndrome, and/or intellectual disability in several families (Jamain et al., 2003; Laumonnier et al., 2004). Other examples of disrupted pathways and subcellular compartments in ASD can be found in Chapter 2.1. Current estimates indicate that there are many hundreds of loci as yet undiscovered in ASD (see below) and each represents an opportunity for clinical dissection, pathway analysis, and even experimental therapeutics (Wink et al., 2010). However, traditional sequencing had been a bottleneck for finding mutations on a genomic scale. New sequencing technologies developed in recent years will bring an explosion of genetic findings to ASD and will dramatically advance our understanding of ASD pathogenesis, and expand treatment options.
NEXT-GENERATION SEQUENCING TECHNOLOGIES Traditional (‘first-generation’) sequencing came into very widespread use following the methods developed in parallel by Sanger and by Maxam and Gilbert (Maxam and Gilbert, 1977; Sanger et al., 1977;). Sanger sequencing uses chain-termination methods where, during the sequencing by synthesis, a proportion of each sequencing reaction is terminated by adding one of four dideoxynucleotides (ddATP, ddGTP, ddCTP, or ddTTP) generating a ladder of products differing from the next by a single base. These fragments can be separated by electrophoresis with the sequence of template DNA decoded. This chain-termination approach, when combined with automated capillary electrophoresis, yielded vastly improved speed and accuracy. Technical simplicity, high accuracy of raw reads, and read length of about 800–1000 bp, made this the dominant technology for DNA sequencing for over two decades, and the successful completion of the human genome sequencing project was a tour de force which relied on this technology. However, the international 10-year effort and high cost for sequencing the first human genome reflected its major limitations. Next-generation sequencing (NGS) emerged early in 2000, when Lynx Therapeutics published a massively parallel signature sequencing (MPSS) method in Nature Biotechnology (Brenner et al., 2000). In 2005, NGS approaches became more available with the publication of the sequence-by-synthesis technology developed by 454 Life Sciences (Margulies et al., 2005) and of the multiplex polony sequencing protocol of George Church’s lab (Shendure et al., 2005). Additional NGS sequencing technologies have been developed and many are commercially available. These methods differ in template
preparation, sequencing chemistry, sequencing platform, imaging, and data analysis. Four NGS technologies will be briefly described in the following sections to highlight their relevant advantages.
PYROSEQUENCING Pyrosequencing was originally developed on the basis of the sequencing-by-synthesis principle, but it relies on the detection of pyrophosphate release on nucleotide incorporation, rather than chain termination with dideoxynucleotides (Ronaghi et al., 1996; 1998). One example of pyrosequencing is in use on the Roche/454 NGS platform. The DNA library is first generated through random fragmentation of genomic DNA and an emulsion polymerase chain reaction (PCR) is used to prepare the template library by clonal amplification of template DNA in single aqueous droplet-encapsulated reaction beads and DNA complexes. The surface of the beads contains oligonucleotide probes with sequences that are complementary to the adaptors binding the DNA fragments. Thousands of copies of the same template sequence are clonally amplified. Emulsion PCR beads can be chemically attached to a glass slide or deposited into PicoTiterPlate wells (Margulies et al., 2005). Addition of nucleotides complementary to the template strand, in enzymatic reactions including DNA polymerase and luciferase, results in a chemiluminescent signal recorded by a CCD camera within the instrument. Software then identifies the location of the beads and correlates the light flashes with each type of nucleotide that was incorporated into the synthesized DNA. About 1 million reads at lengths of average 400 bases can be generated per run at this time.
Cyclic Reversible Termination Cyclic reversible termination was developed as the basis of a new non-Sanger DNA sequencing method. An important feature of this method is the termination of DNA synthesis after the addition of a single nucleotide. The Illumina/Solexa system uses this method combined with a process called ‘bridge PCR’ amplification, which is used to yield thousands of copies of each fragment in a cluster on the surface of a flow cell. Template DNA is fragmented and captured by adaptor ligation to corresponding primers, which are covalently attached to a solid phase on flow cells. By adding unlabeled nucleotides and enzymes, amplification proceeds in cycles. The enzyme incorporates nucleotides to build double-stranded bridges on the solid-phase substrate. Denaturation leaves single-stranded template with one end of each bridge tethered to the surface. Several million dense clusters of double-stranded DNA are
2. ETIOLOGY OF AUTISM SPECTRUM DISORDERS
PYROSEQUENCING
generated in each channel of the flow cell. During sequencing, each of the four deoxynucleotides is labeled with a different fluorescent dye and one fluorescentlylabeled nucleotide is added to the 3’ end of each growing strand. Fluorescence from each cluster on the flow cell is captured upon laser excitation. The dye and terminating group are chemically cleaved and a new cycle of nucleotide incorporation and fluorescent imaging is carried out (Metzker, 2010; Shendure et al., 2011). One limitation of this system is the length of fragment sequenced: The initial sequence length per read was 36 nt but now 100 nt is achieved on Illumina’s HiSeq2000. This platform is widely used for re-sequencing projects and wholegenome or whole-exome re-sequencing, as well as other applications such as RNA-seq, ChIP-seq and Methyl-seq (discussed later in this chapter).
Sequencing by Ligation Sequencing by ligation followed by emulsion PCR template preparation is used on the Applied Biosystems (now Life Technologies) SOLiD platform. Like the 454 technology, the DNA template fragments are clonally amplified on beads, however the beads are placed on the solid-phase of a flow cell so greater density is achieved than in other approaches. In sequencing by ligation, a mixture of different fluorescently labeled dinucleotide probes is pumped into the flow cell. As the correct dinucleotide probe incorporates the template DNA, it is ligated onto the pre-built primer on the solid-phase. After wash-out of the unincorporated probes, fluorescence is captured and recorded. Each fluorescence wavelength corresponds to a particular dinucleotide combination. Then the fluorescent dye is removed and washed and the next sequencing cycle starts. Although 1 billion reads from a single run can be achieved, the limitation of this system is the short sequence read length (50 nt), however accuracy is very high.
Single-Molecule Sequencing The recent emergence of single-molecule sequencing technologies in the NGS field expands sequencing capabilities enormously. As it represents a qualitative improvement from the technologies noted above, some have taken to calling it third-generation sequencing. Unlike the second-generation sequencing technologies discussed above, single-molecule sequencing interrogates a single molecule of DNA or RNA template in real time. No clonal amplification of DNA or RNA template is required, which overcomes biases introduced by PCR amplification. Single-molecule sequencing can also radically reduce sequencing costs and provide much increased read lengths, however, the most important advantages of this technology are the capacity for
171
real-time measurements of DNA or RNA composition and detection of base modifications such as DNA methylation (Schadt et al., 2010). The first commercially available single-molecule sequencer was from Helicos Biosciences using a proprietary technology (Bowers et al., 2009). Sample preparation started with fragmentation of genomic DNA, adding a 3’ poly(A) tail, and capturing these singlestranded templates on a flow cell surface with covalently bound 5’ dT oligonucleotides. The flow cell with these single-stranded molecules was assembled into the HeliScope sequencer, where specially designed fluorescently labeled nucleotide analogues were pumped in one at a time. Through sequencing by synthesis in the presence of polymerase, templates with corresponding nucleotides were identified and the surface was scanned for the fluorescence label. The incorporated fluorescent moiety was then removed, the next fluorescent nucleotide pumped in and the process was repeated (Ozsolak et al., 2009; Shendure et al., 2011). Since this method uses a wash-and-scan process, the total sequencing time can be as long as that of other second-generation sequencing technologies (Schadt et al., 2010). A more widely deployed single-molecule technology is that of Pacific Biosciences, known as single-molecule real-time sequencing (SMRT). A critical part of this system is called zero-mode waveguide (ZMW), where a single DNA polymerase enzyme with a single molecule of a DNA template is immobilized at the bottom of a ZMW detector. Each of the four DNA bases, attached to one of four different fluorescent dyes, is flooded above an array of ZMWs. When the correct nucleotide is detected by the polymerase, it is incorporated into the growing DNA strand and the phospholinked fluorescent tag is cleaved off and diffuses out of the observation area of the ZMW. A detector registers the fluorescent signal of nucleotide incorporation, and the base call is made according to the corresponding fluorescence of the dye (Schadt et al., 2010). Very importantly, the kinetics of nucleotide incorporation can provide information on chemical modification of the template, providing epigenetic information (Song et al., 2011). The SMRT system shows many desirable features of single-molecule sequencing: much faster sequencing times, long reads, small amounts of starting material, low sequencing costs, and the ability to detect epigenetic modifications directly. Other technologies for single-molecule sequencing are emerging. Oxford Nanopore technology uses a nanopore through which single-stranded DNA molecules are electrophoretically driven. When each nucleotide on the DNA molecule partially obstructs the nanopore, it alters the pore’s electrical properties to a different degree and the change is then recorded as corresponding to that particular nucleotide (Branton et al., 2008; Kasianowicz
2. ETIOLOGY OF AUTISM SPECTRUM DISORDERS
172
2.4. NEXT-GENERATION SEQUENCING FOR GENE AND PATHWAY DISCOVERY AND ANALYSIS IN AUTISM SPECTRUM DISORDERS
et al., 1996). Halcyon Molecular uses transmission electron microscopy to directly image and chemically detect atoms of unique nucleotides. They developed annular dark-field imaging in an aberration-corrected scanning which can identify the chemical type of every atom in a monolayer of hexagonal boron nitride containing substitutional defects (Krivanek et al., 2010). Ion Torrent by Life Technologies uses a semiconductor circuit to perform non-optical DNA sequencing of genomes directly. Sequence data are obtained by directly sensing the ions produced by template-directed DNA polymerase synthesis using all-natural nucleotides on a massively parallel semiconductor-sensing device or ion chip (Rothberg et al., 2011). VisiGen Biotechnologies uses an engineered DNA polymerase tagged with a fluorophore. When it binds to a donor nucleotide with the fluorophore label, a fluorescence resonant energy transfer occurs and this signal is detected in real time. All of these technologies show promise and further enhancements will soon lead to profound changes in NGS, providing a further revolution in gene and pathway discovery and analysis.
APPLICATION OF NEXT-GENERATION SEQUENCING TECHNOLOGIES IN HUMAN DISEASE The rapid development of the next-generation sequencing technologies has given unprecedented power to solve problems in multiple fields of molecular biology, resulting in many discoveries and new insights. With the emergence of new library preparation methods, computing pipelines for processing the huge volumes of sequencing data, and enhanced analysis strategies, NGS is being applied in many areas. In this section, some of the major applications of NGS will be discussed in the context of human health.
Whole-Genome Sequencing NGS demonstrated its profound power with the sequencing of James D. Watson’s genome using the Roche/454 NGS platform (Wheeler et al., 2008). The sequence was completed in two months, at approximately one-hundredth the cost of traditional Sanger automated sequencing methods. This work was quickly followed by five additional genomes sequenced using different NGS platforms: a Yoruban individual (NA18507) was sequenced on the Illumina/Solexa platform (Bentley et al., 2008) and on the Applied Biosystems SOLiD platform (McKernan et al., 2009), a Chinese genome (YH) and two Korean genomes (AK1 and SJK) were sequenced on the Illumina/Solexa platform (Ahn et al., 2009; Kim et al., 2009; Wang et al.,
2008), and one genome (Stephen R. Quake) was sequenced using the single-molecule method by Helicos Biosciences (Pushkarev et al., 2009). One objective of genomic sequencing is to relate single nucleotide variants (SNVs), indels, and structural variations (SVs) to relevant phenotypes. When compared with the reference human genome, about 3–3.5 million SNVs were identified in each individual genome mentioned above. In 2010, the 1,000 Genomes Project was initiated with the aim of providing a deep characterization of human genome sequence variation as a foundation for investigating genotype and phenotype relationships (1,000 Genomes Project Consortium, 2010). This project is now moving quickly to complete its current goal of sequencing and analyzing the genomes of 2,500 individuals from seven populations worldwide. Recently, the UK10K was announced, funded by the Wellcome Trust, which aims to sequence the whole genomes of 4,000 people in the general population, as well as whole exomes of 6,000 individuals with extreme phenotypes or rare disease, including autism (http://www.uk10k.org/goals.html). The first disease mutation identified by wholegenome sequencing was reported in a family with a recessive form of Charcot–Marie–Tooth disease (Lupski et al., 2010). Patient DNA was first sequenced by ligation on an Applied Biosystems SOLiD NGS platform and about 3.4 million SNVs were identified. Filtering against single nucleotide polymorphisms or SNVs in public databases was carried out to identify the particular mutation responsible for the observed phenotypes in this family. By further examining segregation of remaining functional variants in all family members, two mutations in SH3TC2 (SH3 domain and tetratricopeptide repeats 2) were identified as being responsible for the separate subclinical phenotypes in this family. A similar strategy has been used in another family with two children affected with Miller syndrome. Mutation in the DHODH gene was identified as the primary cause of Miller syndrome, while mutation in DNAH5 was identified as a cause of primary ciliary dyskinesia in the patients (Roach et al., 2010). A third report concerned the identification of mutations in SPR, which encodes sepiapterin reductase, in a 14-year-old fraternal twin-pair diagnosed with DOPA-responsive dystonia (DRD). The discovery led to clinical interventions targeting both dopamine (with LDOPA) and serotonin (with 5-hydroxytryptophan supplementation), which led to clinical improvements (Bainbridge et al., 2011). This list is growing rapidly and will continue to provide both etiological diagnosis and new therapeutic strategies in the care of patients. Whole-genome sequencing has special importance for sequencing the cancer genome. By comparing the genome in cancer biopsies with that of healthy cells from the same patient, genes critical in the development
2. ETIOLOGY OF AUTISM SPECTRUM DISORDERS
APPLICATION OF NEXT-GENERATION SEQUENCING TECHNOLOGIES IN HUMAN DISEASE
of the cancer can be identified. For example, Mardis and colleagues sequenced an acute myeloid leukemia genome and a matched normal skin genome from the same patient using the Illumina/Solexa platform, and identified recurring mutations which were relevant to cancer pathogenesis (Mardis et al., 2009). Welch and colleagues used whole-genome sequencing to aid in the diagnosis of a specific leukemia subtype leading to treatment modifications in a patient (Welch et al., 2011). With the ultimate goal of developing muchimproved strategies to better diagnose and treat various cancers, the US National Institutes of Health (NIH) Cancer Genome Atlas project (http:// cancergenome.nih.gov/) is characterizing more than 20 tumor types at genomic and other levels using NGS technology. NGS-based whole-genome sequencing is revolutionizing our ability to rapidly characterize microbial strains related to life-threatening infectious disease. In 2011, origins of the strain causing a cholera outbreak in Haiti and the E. coli strain causing an outbreak of hemolyticuremic syndrome in Germany were quickly identified using NGS technology, including single-molecule sequencing (Chin et al., 2011; Mellmann et al., 2011; Rasko et al., 2011).
Whole-Exome Sequencing Whole-exome sequencing was developed as an efficient and inexpensive means of capturing the subgenome that is directly related to coding regions of the genome. By using target selection and enrichment approaches, only the protein-coding regions of the genome are sequenced on the NGS platform. As protein-coding regions constitute only ~ 1.5% of the human genome and cover only ~ 30–40 megabases (Mb) of sequence, this allows for many more samples to be probed in a given NGS experiment. As a proofof-concept, Ng and colleagues did whole-exome sequencing in four unrelated individuals with a rare autosome dominant inherited disorder (Miller syndrome) and showed that this approach was very cost-efficient as a means of identifying causal variants of rare disorders (Ng et al., 2009). Though a variety of methods for whole-exome library preparation have been developed, solution-based target enrichment is becoming the most prevalent, because of its simplicity and ease of automation. Briefly, a pool of oligonucleotides probes (DNA or RNA) is synthesized to selectively hybridize to the targeted exonic regions of genomic DNA. The probes include tags for pulldown with beads, such that the hybridized probe-target fragments can be isolated and the unbound regions of genomic DNA washed away. The targeted genomic
173
fragments are then sequenced on an NGS platform (Kahvejian et al., 2008). Whole-exome sequencing focuses on what is thought to be the most medically relevant part of the human genome, as this reduces the overhead of both the molecular and analytical aspects. This approach has been broadly applied to identifying the genes that underlie rare disorders. In only the past two years, causal genes or alleles have been identified for more than two dozen Mendelian disorders, and the approach is now being used to identify rare etiologically relevant variants underlying complex traits, such as schizophrenia and autism (Bamshad et al., 2011). An example of early success is the whole-exome sequencing of 10 individuals with unexplained mental retardation in whom pathogenic de novo mutations were identified (Vissers et al., 2010).
Transcriptome Sequencing (RNA-seq) Understanding the transcriptome is an essential step in interpreting functional elements of the genome, revealing the molecular constituents of cells and tissues, and understanding changes associated with development and disease. Using NGS technology for transcriptome sequencing (RNA-seq) allows RNA to be directly sequenced in a high-throughput and quantitative manner (Wang et al., 2009). For most current RNA-seq applications, a population of RNA (e.g., mRNA) is converted to complementary DNA (cDNA) and then double-stranded cDNA is synthesized for library preparation for NGS. However, reverse transcription and library amplification steps introduce biases and artifacts (and increase costs). For this reason there is great interest in using single-molecule sequencing technology, where RNA can be directly sequenced with reverse transcriptase (Schadt et al., 2010).
Epigenome Sequencing The epigenome is the overall epigenetic state of a given genomic sample, and for a given individual genome there may be many hundreds of stable epigenomes, depending on the stability of the chromatin states (Park, 2009). DNA methylation and histone modification are epigenetic modifications that can be mapped genome-wide with NGS technologies at single-nucleotide resolution (Hawkins et al., 2010). Methods developed for genome-wide DNA methylation analysis include bisulfite conversion followed by capture and sequencing (BC-seq), methylated DNA immunoprecipitation-based sequencing (MeDIP), and methylation-sensitive-restriction-enzyme-dependent library preparation and sequencing (Methyl-seq, HELPseq) (Laird, 2010). As noted above, single-molecule RNA
2. ETIOLOGY OF AUTISM SPECTRUM DISORDERS
174
2.4. NEXT-GENERATION SEQUENCING FOR GENE AND PATHWAY DISCOVERY AND ANALYSIS IN AUTISM SPECTRUM DISORDERS
sequencing can also detect methylated DNA, as well as additional modification of DNA (Ozsolak and Milos, 2011; Song et al., 2011). Methods for histone modification analysis include chromatin immunoprecipitation followed by sequencing (ChIP-seq), and DNase I hypersensitivity site footprinting coupled with NGS (DNase-seq). Specific DNA sites in direct physical interaction with transcription factors and other proteins can be isolated by immunoprecipitation and then subjected to NGS sequencing (Laird, 2010). NGS-based ChIP-seq is able to analyze the interaction pattern of any protein that binds DNA, thus allowing us to study these interactions on a genomic scale (Johnson et al., 2007).
NEXT-GENERATION SEQUENCING IN AUTISM SPECTRUM DISORDERS Whole-Exome Sequencing In ASD ASD are common neurodevelopmental disorders often associated with a large burden on the affected individual, the family, and on society. Rare genetic or genomic mutations are clearly implicated in ASD (the rare nature of such variation is due to purifying selection acting on deleterious variation), and there is also likely an as-yet-undefined role for common genetic variation in ASD (O’Roak and State, 2008). ASD are highly heterogeneous from the perspective of genetic etiology, and only about 10–20% of individuals currently have an identified genetic cause. At the same time, more than 100 genetic and genomic loci have been reported in subjects with ASD, showing the success of ongoing efforts but also underscoring the fact that whole-exome and wholegenome sequencing will be critical approaches for identifying ASD genes and loci (Betancur, 2011). The first whole-exome study of ASD was published in 2011. In this study, 20 individuals with sporadic ASD and their parents were sequenced for de novo mutations of major effect (O’Roak et al., 2011). Four potentially causative de novo events were identified, in FOXP1, GRIN2B, SCN1A, and LAMC3. The study showed that family-based exome sequencing was a powerful approach for identifying new candidate genes for ASD, especially if there was a focus on de novo variation. Several large-scale whole-exome sequencing projects have been initiated in the past two years. One collaborative project received support from the NIMH to sequence several hundred trios, as well as 1,000 cases and carefully matched controls. In the first stage of this project, whole-exome sequencing was performed in a total of 175 ASD trios (Neale et al., 2012). The overall rate of de novo mutation was only slightly higher than the expected rate, however, there was significantly
enriched connectivity among the proteins encoded by genes harboring de novo missense or nonsense mutations, as well as excess connectivity to prior ASD genes of major effect, indicating that a subset of observed events were relevant to ASD risk. Moreover, analysis of de novo variation in this and parallel studies (described below) and the case-control data provided evidence in favor of CHD8 and KATNAL2 as genuine autism risk genes. A second ongoing large exome-sequencing project, supported by the Simons Foundation, is focused on the Simons Simplex Collection (SSC). A unique feature of the SSC is that each family is comprised of a proband, unaffected parents, and, in most kindreds, an unaffected sibling, which allows researchers to use these unaffected siblings as an important control group (Fischbach and Lord, 2010; Sanders et al., 2011). Three studies totalling ca 750 SSC trios have been completed (Iossifov et al., 2012; O’Roak et al., 2012 Sanders et al., 2012). In one study (Sanders et al., 2012), where the unaffected sibling was also sequenced, it was clear that deleterious de novo mutations were significantly elevated in affected as compared to unaffected siblings, and this was more pronounced when the authors considered de novo mutations present in brain-expressed genes. The authors conclude that ca 45% of de novo deleterious variants in brain-expressed genes carry risk of ASD in these families. Moreover, based on two independent nonsense substitutions disrupting the same gene, this study identifies SCN2A (sodium channel, voltage-gated, type II, alpha subunit), as a bona fide ASD gene. A second completed SSC study identified CHD8 and NTNG1 as ASD genes (O’Roak et al., 2012). This study (as well as the other studies noted above) shows very clearly that de novo point mutations are positively correlated with paternal age, consistent with an increased risk of developing ASD for children of older fathers. O’Roak and colleagues also show that de novo point mutations are overwhelmingly paternal in origin (5:1 bias) consistent with this hypothesis. Finally, Iossifov et al. show that there is an enrichment of likely ASD genes amongst genes regulated by the Fragile-X-syndrome associated FMR1 protein. Importantly, examining all four studies together, one can estimate both the numbers of ASD genes that can be identified by these approaches (ca 500), as well as the rate of gene discovery as a function of trios sequenced (see Sanders et al., 2012). For example, using as criteria recurrent de novo loss-of-function variants, analyses showed that for 6,000 families (approximately what is available now in US repositories) a threshold of three or more such variants is sufficient to declare genome-wide significance and 20–50 ASD genes would be identified by this criterion alone. Using two or more de novo loss-of-function variants as the criterion would
2. ETIOLOGY OF AUTISM SPECTRUM DISORDERS
175
REFERENCES
identify 60–120 likely ASD genes with an FDR of 0.1. In addition, other forms of discovery, such as recessive loci, would lead to additional gene discovery. Additional whole-exome projects include those led by researchers at the Hospital for Sick Children at Toronto, which aims to create whole-exome sequences for a cohort of 1,000 Canadian patients (Walker et al., 2011). Meanwhile, the UK10K project is currently performing whole-exome sequencing of about 600 ASD subjects (http://www.uk10k.org/goals.html). Supported by a NIMH Grand Opportunity grant, the Broad Institute and Harvard Medical School are also conducting a study of whole-exome and eventually wholegenome sequencing of Middle Eastern patients with a recessive form of autism whose parents share a common ancestry. An analysis of the 1,000 cases and matched controls from the NIMH-funded initiative show clearly that there is a roughly two-fold elevation in individuals with homozygous loss-of-function variation in ASD (Lim et al., submitted), and examination of recessive forms of ASD with NGS will likely identify many additional ASD loci.
Whole-Genome Sequencing in ASD There are several whole-genome projects ongoing at pilot scale. Moreover, in October 2011, Autism Speaks and Beijing Genome Institute (BGI) in China jointly announced a commitment to a collaborative project to perform whole-genome sequencing over a two-year period on more than 2,000 participating families who have two or more children on the autism spectrum. This project would create the world’s largest library of sequenced genomes of individuals with ASD and will add substantially to ongoing efforts in gene discovery in ASD.
Transcriptome and Epigenome Sequencing in ASD Transcriptome sequencing represents an invaluable means of exploring the biology of ASD, but there are current limitations due to the availability of sufficient numbers of high-quality postmortem brain samples. In one study in postmortem ASD brains, researchers made use of RNA-seq as validation data for gene expression arrays. RNA-seq validated changes in groups of genes identified by co-expression analysis, and provided further evidence for a convergence of transcriptional and alternative-splicing abnormalities in the synaptic and signaling pathogenesis of ASD (Voineagu et al., 2011). RNA-seq in postmortem brain samples, peripheral samples, and inducible pluripotent stem cells differentiated into neural cells will be an important avenue in ASD research.
To characterize epigenetic signatures of ASD in prefrontal cortex neurons, Shulha and colleagues performed genome-wide mapping of the histone marker H3K4me3 in neuronal and non-neuronal nuclei from postmortem ASD brain samples. ASD cases showed altered H3K4me3 peaks for numerous genes regulating neuronal connectivity, social behaviors, and cognition, often in conjunction with altered expression of the corresponding transcripts (Shulha et al., 2011). Although a small number of samples were used in these studies, the findings show the power of NGS for ASD research. Again, this approach, when applied to additional postmortem brain samples, peripheral samples, and inducible pluripotent stem cells differentiated into neural cells, will provide insights into the pathways which are disrupted in ASD.
CONCLUSION Next-generation sequencing technologies allow us to interrogate the human genome base by base physically and functionally in an efficient and affordable way. It is likely that whole-exome and whole-genome sequencing will be widely applied in the clinical setting to facilitate genetic diagnosis and inform therapy (Bamshad et al., 2011). However, there are many challenges in identifying disease variants or genes responsible for clinical phenotypes. Characterizing the human genome at the individual and population levels is a fundamental requirement for clarifying the contribution of genetic variation to human phenotypic traits, including those that are disease-related. For example, it was recently estimated that an individual human genome typically contains about 100 loss-of-function variants with about 20 genes completely inactivated, based on systematic investigation of whole-genome sequencing data from 180 subjects in four different populations (MacArthur et al., 2012). Careful application of NGS techniques will help us to identify rare genetic variants in ASD, assess their relative contribution, and ultimately lead to improved diagnosis and treatment. As ASD involves core abnormalities in social cognition and language, both of which are central to what makes us human, these findings will have broad ramifications in the neurosciences (Geschwind, 2011).
References 1000 Genomes Project Consortium, 2010. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073. Abrahams, B.S., Geschwind, D.H., 2008. Advances in Autism genetics: On the threshold of a new neurobiology. Nature Reviews Genetics 9, 341–355.
2. ETIOLOGY OF AUTISM SPECTRUM DISORDERS
176
2.4. NEXT-GENERATION SEQUENCING FOR GENE AND PATHWAY DISCOVERY AND ANALYSIS IN AUTISM SPECTRUM DISORDERS
Ahn, S.M., Kim, T.H., Lee, S., Kim, D., Ghang, H., Kim, D.S., et al., 2009. The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group. Genome Research 19, 1622–1629. Bailey, A., Le Couteur, A., Gottesman, I., Bolton, P., Simonoff, E., Yuzda, E., et al., 1995. Autism as a strongly genetic disorder: Evidence from a British twin study. Psychological Medicine 25, 63–77. Bainbridge, M.N., Wiszniewski, W., Murdock, D.R., Friedman, J., Gonzaga-Jauregui, C., Newsham, I., et al., 2011. Whole-genome sequencing for optimized patient management. Science Translational Medicine 3, 87re3. Bamshad, M.J., Ng, S.B., Bigham, A.W., Tabor, H.K., Emond, M.J., Nickerson, D.A., et al., 2011. Exome sequencing as a tool for Mendelian disease gene discovery. Nature Reviews Genetics 12, 745–755. Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., Smith, G.P., Milton, J., Brown, C.G., et al., 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59. Betancur, C., 2011. Etiological heterogeneity in Autism spectrum disorders: More than 100 genetic and genomic disorders and still counting. Brain Research 1380, 42–77. Bowers, J., Mitchell, J., Beer, E., Buzby, P.R., Causey, M., Efcavitch, J.W., et al., 2009. Virtual terminator nucleotides for next-generation DNA sequencing. Nature Methods 6, 593–595. Branton, D., Deamer, D.W., Marziali, A., Bayley, H., Benner, S.A., Butler, T., et al., 2008. The potential and challenges of nanopore sequencing. Nature Biotechnology 26, 1146–1153. Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D.H., Johnson, D., et al., 2000. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nature Biotechnology 18, 630–634. Chin, C.S., Sorenson, J., Harris, J.B., Robins, W.P., Charles, R.C., JeanCharles, R.R., et al., 2011. The origin of the Haitian cholera outbreak strain. New England Journal of Medicine 364, 33–42. Durand, C.M., Betancur, C., Boeckers, T.M., Bockmann, J., Chaste, P., Fauchereau, F., et al., 2007. Mutations in the gene encoding the synaptic scaffolding protein SHANK3 are associated with Autism spectrum disorders. Nature Genetics 39, 25–27. Fischbach, G.D., Lord, C., 2010. The Simons Simplex Collection: A resource for identification of Autism genetic risk factors. Neuron 68, 192–195. Gauthier, J., Champagne, N., Lafreniere, R.G., Xiong, L., Spiegelman, D., Brustein, E., et al., 2010. De novo mutations in the gene encoding the synaptic scaffolding protein SHANK3 in patients ascertained for schizophrenia. Proceedings of the National Academy of Sciences USA 107, 7863–7868. Geschwind, D.H., 2011. Genetics of Autism spectrum disorders. Trends in Cognitive Sciences 15, 409–416. Hawkins, R.D., Hon, G.C., Ren, B., 2010. Next-generation genomics: An integrative approach. Nature Reviews Genetics 11, 476–486. Iossifov, I., Ronemus, M., Levy, D., Wang, Z., Hakker, I., Rosenbaum, J., et al., 2012. De novo gene disruptions in children on the autistic spectrum. Neuron 74, 285–299. Jamain, S., Quach, H., Betancur, C., Rastam, M., Colineaux, C., Gillberg, I.C., et al., 2003. Mutations of the X-linked genes encoding neuroligins NLGN3 and NLGN4 are associated with Autism. Nature Genetics 34, 27–29. Johnson, D.S., Mortazavi, A., Myers, R.M., Wold, B., 2007. Genomewide mapping of in vivo protein-DNA interactions. Science 316, 1497–1502. Jorde, L.B., Hasstedt, S.J., Ritvo, E.R., Mason-Brothers, A., Freeman, B.J., Pingree, C., et al., 1991. Complex segregation analysis of Autism. American Journal of Human Genetics 49, 932–938.
Kahvejian, A., Quackenbush, J., Thompson, J.F., 2008. What would you do if you could sequence everything? Nature Biotechnology 26, 1125–1133. Kasianowicz, J.J., Brandin, E., Branton, D., Deamer, D.W., 1996. Characterization of individual polynucleotide molecules using a membrane channel. Proceedings of the National Academy of Sciences USA 93, 13,770–13,773. Kim, J.I., Ju, Y.S., Park, H., Kim, S., Lee, S., Yi, J.H., et al., 2009. A highly annotated whole-genome sequence of a Korean individual. Nature 460, 1011–1015. Krivanek, O.L., Chisholm, M.F., Nicolosi, V., Pennycook, T.J., Corbin, G.J., Dellby, N., et al., 2010. Atom-by-atom structural and chemical analysis by annular dark-field electron microscopy. Nature 464, 571–574. Laird, P.W., 2010. Principles and challenges of genomewide DNA methylation analysis. Nature Reviews Genetics 11, 191–203. Laumonnier, F., Bonnet-Brilhault, F., Gomot, M., Blanc, R., David, A., Moizard, M.P., et al., 2004. X-linked mental retardation and Autism are associated with a mutation in the NLGN4 gene, a member of the neuroligin family. American Journal of Human Genetics 74, 552–557. Lim, E.T., Raychaudhuri, S., Stevens, C., Sabo, A., Neale, B.M., Sanders, S.J., et al., 2012. Enrichment of low-frequency two-hit loss-of-function events in cases suggests recessive component in Autism. Lupski, J.R., Reid, J.G., Gonzaga-Jauregui, C., Rio Deiros, D., Chen, D.C., Nazareth, L., et al., 2010. Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. New England Journal of Medicine 362, 1181–1191. MacArthur, D.G., Balasubramanian, S., Frankish, A., Huang, N., Morris, J., Walter, K., et al., 2012. A systematic survey of loss-offunction variants in human protein-coding genes. Science 335, 823–828. Mardis, E.R., Ding, L., Dooling, D.J., Larson, D.E., McLellan, M.D., Chen, K., et al., 2009. Recurring mutations found by sequencing an acute myeloid leukemia genome. New England Journal of Medicine 361, 1058–1066. Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., et al., 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380. Maxam, A.M., Gilbert, W., 1977. A new method for sequencing DNA. Proceedings of the National Academy of Sciences USA 74, 560–564. McKernan, K.J., Peckham, H.E., Costa, G.L., McLaughlin, S.F., Fu, Y., Tsung, E.F., et al., 2009. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Research 19, 1527–1541. Mellmann, A., Harmsen, D., Cummings, C.A., Zentz, E.B., Leopold, S.R., Rico, A., et al., 2011. Prospective genomic characterization of the German enterohemorrhagic Escherichia coli O104:H4 outbreak by rapid next generation sequencing technology. PLoS One 6, e22751. Metzker, M.L., 2010. Sequencing technologies - the next generation. Nature Reviews Genetics 11, 31–46. Moessner, R., Marshall, C.R., Sutcliffe, J.S., Skaug, J., Pinto, D., Vincent, J., et al., 2007. Contribution of SHANK3 mutations to Autism spectrum disorder. American Journal of Human Genetics 81, 1289–1297. Neale, B.M., Kou, Y., Liu, L., Ma’ayan, A., Samocha, K.E., Sabo, A., et al., 2012. Patterns and rates of exonic de novo mutations in Autism spectrum disorders. Nature 485, 242–245. Ng, S.B., Turner, E.H., Robertson, P.D., Flygare, S.D., Bigham, A.W., Lee, C., et al., 2009. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276.
2. ETIOLOGY OF AUTISM SPECTRUM DISORDERS
REFERENCES
O’Roak, B.J., Deriziotis, P., Lee, C., Vives, L., Schwartz, J.J., Girirajan, S., et al., 2011. Exome sequencing in sporadic Autism spectrum disorders identifies severe de novo mutations. Nature Genetics 43, 585–589. O’Roak, B.J., State, M.W., 2008. Autism genetics: Strategies, challenges, and opportunities. Autism Research 1, 4–17. O’Roak, B.J., Vives, L., Girirajan, S., Karakoc, E., Krumm, N., Coe, B.P., et al., 2012. Sporadic Autism exomes reveal a highly interconnected protein network of de novo mutations. Nature 485, 246–250. Ozsolak, F., Milos, P.M., 2011. Transcriptome profiling using singlemolecule direct RNA sequencing. Methods in Molecular Biology 733, 51–61. Ozsolak, F., Platt, A.R., Jones, D.R., Reifenberger, J.G., Sass, L.E., McInerney, P., et al., 2009. Direct RNA sequencing. Nature 461, 814–818. Park, P.J., 2009. ChIP-seq: Advantages and challenges of a maturing technology. Nature Reviews Genetics 10, 669–680. Pushkarev, D., Neff, N.F., Quake, S.R., 2009. Single-molecule sequencing of an individual human genome. Nature Biotechnology 27, 847–850. Rasko, D.A., Webster, D.R., Sahl, J.W., Bashir, A., Boisen, N., Scheutz, F., et al., 2011. Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany. New England Journal of Medicine 365, 709–717. Roach, J.C., Glusman, G., Smit, A.F., Huff, C.D., Hubley, R., Shannon, P.T., et al., 2010. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328, 636–639. Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M., Nyren, P., 1996. Real-time DNA sequencing using detection of pyrophosphate release. Analytical Biochemistry 242, 84–89. Ronaghi, M., Uhlen, M., Nyren, P., 1998. A sequencing method based on real-time pyrophosphate. Science 281, 363–365. Rothberg, J.M., Hinz, W., Rearick, T.M., Schultz, J., Mileski, W., Davey, M., et al., 2011. An integrated semiconductor device enabling non-optical genome sequencing. Nature 475, 348–352. Sanders, S.J., Ercan-Sencicek, A.G., Hus, V., Luo, R., Murtha, M.T., Moreno-De-Luca, D., et al., 2011. Multiple recurrent de novo CNVs, including duplications of the 7q11.23 Williams syndrome region, are strongly associated with Autism. Neuron 70, 863–885. Sanders, S.J., Murtha, M.T., Gupta, A.R., Murdoch, J.D., Raubeson, M.J., Willsey, A.J., et al., 2012. De novo mutations revealed by whole-exome sequencing are strongly associated with Autism. Nature 485, 237–241.
177
Sanger, F., Nicklen, S., Coulson, A.R., 1977. DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences USA 74, 5463–5467. Schadt, E.E., Turner, S., Kasarskis, A., 2010. A window into thirdgeneration sequencing. Human Molecular Genetics 19, R227–R240. Shendure, J.A., Porreca, G.J., Church, G.M., Gardner, A.F., Hendrickson, C.L., Kieleczawa, J., et al., 2011. Overview of DNA sequencing strategies. Current Protocols in Molecular Biology. Shendure, J., Porreca, G.J., Reppas, N.B., Lin, X., McCutcheon, J.P., Rosenbaum, A.M., et al., 2005. Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309, 1728–1732. Shulha, H.P., Cheung, I., Whittle, C., Wang, J., Virgil, D., Lin, C.L., et al., 2011. Epigenetic signatures of autism Trimethylated H3K4 landscapes in prefrontal neurons. Archives of GeneralArchives of General . Song, C.X., Clark, T.A., Lu, X.Y., Kislyuk, A., Dai, Q., Turner, S.W., et al., 2011. Sensitive and specific single-molecule sequencing of 5-hydroxymethylcytosine. Nature Methods 9, 75–77. Vissers, L.E., de Ligt, J., Gilissen, C., Janssen, I., Steehouwer, M., de Vries, P., et al., 2010. A de novo paradigm for mental retardation. Nature Genetics 42, 1109–1112. Voineagu, I., Wang, X., Johnston, P., Lowe, J.K., Tian, Y., Horvath, S., et al., 2011. Transcriptomic analysis of autistic brain reveals convergent molecular pathology. Nature 474, 380–384. Walker, S., Prasad, A., Marshall, C.R., Pereira, S.L., Lau, L., Foong, J., et al., 2011. Exome Sequencing in Autism Spectrum Disorder ASHG annual meeting, poster# 875T. Wang, J., Wang, W., Li, R., Li, Y., Tian, G., Goodman, L., et al., 2008. The diploid genome sequence of an Asian individual. Nature 456, 60–65. Wang, Z., Gerstein, M., Snyder, M., 2009. RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews Genetics 10, 57–63. Welch, J.S., Westervelt, P., Ding, L., Larson, D.E., Klco, J.M., Kulkarni, S., et al., 2011. Use of whole-genome sequencing to diagnose a cryptic fusion oncogene. Journal of the American Medical Association 305, 1577–1584. Wheeler, D.A., Srinivasan, M., Egholm, M., Shen, Y., Chen, L., McGuire, A., et al., 2008. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876. Wink, L.K., Plawecki, M.H., Erickson, C.A., Stigler, K.A., McDougle, C.J., 2010. Emerging drugs for the treatment of symptoms associated with Autism spectrum disorders. Expert Opinion on Emerging Drugs 15, 481–494.
2. ETIOLOGY OF AUTISM SPECTRUM DISORDERS