A personal history of the echinoderm genome sequencing R. Andrew Cameron* Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, United States *Corresponding author: e-mail address:
[email protected]
Abstract At the most fundamental level, the genome is the basis for questions about the mechanisms of development: how it works. This perspective provides a brief historical review of the sequencing of the echinoderm genome and the progress in answering this complex question, which depends on technological advances as well as intellectual ones.
At the most fundamental level, the genome is the basis for questions about the mechanisms of development: how it works. The progress in answering this complex question depends on technological advances as well as intellectual ones. Echinoderms were employed as research models in these studies from the beginning, starting with experiments on DNA synthesis in the 1960’s. For me the echinoderm genomic era commenced with a training session in molecular biology sponsored by the National Science Foundation for current grantees. In 1985, I chose to go to Eric Davidson’s laboratory at the California Institute of Technology for a 6 month period that stretched into a 30 year stint there. Several advances from the late 1970’s onward were in play at the Davidson laboratory when I arrived. They set the direction of research into how the genome works in development. The first was the invention of DNA sequencing technology. Around that time the Sanger chain termination method of sequencing became the laboratory standard and DNA sequence information became available. The second was the ability to clone fragments of DNA in bacterial cells. Now we had sufficient amounts of known DNA fragments to sequence and manipulate. The last was the procedures of gene transfer into sea urchin zygotes (Colin, 1986; McMahon et al., 1985). In what soon became a canonical approach, these techniques allowed the investigator to assess the function of pieces of the genome efficiently. Over the next 10 years fragments of information pertaining to how gene activity translates genomic information into an organism accumulated. But the methods also improved at a blistering rate. Lee Hood’s laboratory at Caltech perfected automated methods for the synthesis and sequencing of nucleic acids and proteins. The Methods in Cell Biology, Volume 151, ISSN 0091-679X, https://doi.org/10.1016/bs.mcb.2019.03.008 © 2019 Elsevier Inc. All rights reserved.
55
56
A personal history of the echinoderm genome sequencing
sequencing machines were adopted by the human genome sequencing project launched in 1990. Another important technology perfected at Caltech during this time was a bacterial cloning vector. These bacterial artificial chromosomes (BAC) were designed to molecularly clone large DNA fragments from 150–350 kbp (Shizuya et al., 1992). These technologies shaped studies of the molecular biology of sea urchin development and guided further advances. In parallel the basis for experiments on development broadened. The cell lineages of early development first described by Horstadius (reviewed in Horstadius, 1939) were extended (Cameron, HoughEvans, Britten, & Davidson, 1987). Whole mount in situ hybridization methods revealed spatial expression patterns of developmentally regulated genes (Cox, DeLeon, Angerer, & Angerer, 1984). Quantitative PCR methods were rapidly adopted to measure transcript abundance over time. Early experiments to demonstrate how actin genes were regulated gave way to studies of the gene responsible for endoderm specification (Yuh & Davidson, 1996). The last chapter still stands as the most complete analysis of a transcriptional control of a sea urchin gene (Yuh, Moore, & Davidson, 1996). The BAC approach was adopted to screen for genes of interest (Rast, Cameron, Poustka, & Davidson, 2002) from libraries of genomic sequences (Cameron et al., 2000). The next step, an antisense approach to inhibition of gene expression through the use of morpholino-substituted oligonucleotides (Howard, Newman, Oleksyn, Angerer, & Angerer, 2001), sidestepped the need for stable mutants of developmentally regulated molecules. With a canonical procedure ascribing function to putative cis-regulatory regions of the genome, it became a goal to speed up the process of finding and characterizing the regions controlling gene expression. After all, experimental developmental biology investigations were implicating tens of genes in the early specification of embryonic territories, in processes of morphogenesis and the maintenance physiological machinery of the organism. Clearly large sequence stretches around genes could be derived from individual BAC clones. Sequence comparisons between particular gene regions in the reference species, Strongylocentrotus purpuratus (SP) and the east coast sea urchin Lytechinus variegatus (LV) revealed short conserved sequences, likely to be functional. With these tools in hand, a rush of sequencing efforts and experiments from a number of laboratories took place in the 1990’s. As BAC clones containing individual genes from SP and LV were identified, draft sequences were produced. The Sea Urchin Gene Library Facility was installed at Caltech with private foundation funding and later NIH support. It coordinated the work with BACs and kept track of genomic information (Cameron et al., 2000). Soon, there was enough information to connect gene activities into networks of regulatory function for early development. These individual efforts at sequencing and experimentation resulted in the first gene regulatory network for early cell specification in a sea urchin embryo (Davidson et al., 2002). The remarkable description of a gene regulatory network, i.e., a fundamental description of how the genome works in development, was a compelling reason
A personal history of the echinoderm genome sequencing
to add to the rationale for sequencing the sea urchin genome. In 2002, this point was argued in a white paper that detailed a full slate of sea urchins research topics ranging from the ecological to the molecular (Human Genome Research Institute, 2002). The Human Genome Sequencing Center at The Baylor College of Medicine joined a group of PI’s from the sea urchin community to submit the white paper to the Human Genome Research Institute, NIH. The HGRI instructed the Baylor center to produce a 6-fold draft coverage of the purple sea urchin genome using whole genome shotgun sequences and a clonal array procedure (Sodergren et al., 2006). The Sea Urchin Gene Library Facility provided a large fragment DNA sample from a single male purple sea urchin. It is the same DNA used to make the extensive BAC library now available. The first sequence assembly, version 0.5, was submitted to NCBI in March 2005. The sequence redundancy of this version was estimated at 15%. The likely cause of the redundancy is the inclusion in the assembly of highly polymorphic sequence fragments, which assembly programs did not recognize as similar. To resolve this issue, the Baylor center employed a method to sequence BAC clones covering the genome (Cai, Chen, Gibbs, & Bradley, 2001; Sodergren et al., 2006). Since each BAC clone is a single haplotype, assemblies based on these objects will have a reduced redundancy. And indeed the next assembly, version 2.1, contained only 5% redundant sequences and an assembly length of 847 Mb. The latter measurement is close to the chemical estimate of DNA content (Hinegardner, 1974). Using the version 0.5 assembly, a strategy combing several prediction programs was used to predict a total of 28,945 gene models from the assembly (Sodergren et al., 2006). The Baylor center posted the gene models for a web-based annotation effort. The echinoderm research community was organized into annotation work groups and soon over 200 individuals were collaborating to annotate over 9000 of these models. The annotation results were added to the web site and by 2005 a searchable database of gene information was available publicly. In 2006, the genome of the purple sea urchin, Strongylocentrotus purpuratus, was announced in an issue of Science (Sea Urchin Genome Sequencing Consortium, 2006) along with accompanying essays on the biology of this animal. At the same time an entire issue of Developmental Biology (Volume 300, No. 1) described the annotation results. Making use of the large amount of DNA extracted from an individual male purple sea urchin, the Baylor center continued to experiment with short read sequencing platforms to improve the genome assembly. In addition to the Sanger sequences used to construct version 2.1, SOLID, Illumina and PacBio platforms were employed. Between version 0.5 (submitted 2005) and 4.2 (submitted 2015) after the PacBio addition the metric for sequence completeness, N50, had increased from 55 to 431 Kb. Most of the additional sequence was used for gap-filling which means that original assembly errors incorporated in the first version were not removed. In about 2007, the data from the Baylor center web site was frozen and transferred to the first echinoderm genome web site called SpBase (Cameron, Samanta, Yuan, He, & Davidson, 2008). The new web information system collated information on sequence, annotation and expression for a total of 29,948 purple sea urchin gene
57
58
A personal history of the echinoderm genome sequencing
models. A literature curation effort linked gene information to published papers through PubMed references. Eventually SpBase evolved into Echinobase as more species were added (see below; Kudtarkar & Cameron, 2017). A modest transcriptome effort was mounted at the same time the genome was sequenced in order to estimate the completeness of the genome sequencing coverage and assembly. They also were used in the gene prediction pipelines. As next generation sequencing methods became available, more extensive multi-sample transcriptome projects described nearly the full complement of genes in a genome. Today (2019) it is possible to mount a full description of a genome or a gene model set within the resources of an individual research grant. The datasets for echinoderm gene models have exploded and the NCBI SRA repository lists over 3000 datasets for echinoderms. Given that many experiments are not submitted to GenBank and only about 10 genome projects are listed, this is a only a rough estimate. It also illustrates the scattered nature of these data and how little curation in a broad view is possible. Protein sequence and complexity was also explored using the new genome information and sophisticated mass spectrographic methods. The first proteomes detailed the calcium signaling toolkit in the purple sea urchin (Roux, Townley, et al., 2006). Subsequent proteomic runs examined the proteins involved in skeletogenesis (Mann, Poustka, & Mann, 2008; Seaver & Livingston, 2015). Comparative genomics was always in the sight of echinoderm developmental biologists and evolutionary biologists. First, a number of different species are used locally as developmental models. The degree of genome similarity figures into comparative function. The utility of non-coding sequence comparisons between Sp and Lv to reveal conserved functional regions was a motive as well. In the years following 2006, a white paper requesting three additional species was sent to HGRI and approved. The species finally chosen for sequencing were different from those proposed as the community members voiced their opinions. The genomes of Lytechinus variegatus, Patiria miniata, Apostichopus parvimensis and Ophiothrix spiculata were sequenced from the single individual DNA samples that were also used to prepare genomic BAC libraries in the Sea Urchin Gene Library Facility. The next-generation short read Illumina sequencing produced draft genome assemblies that were good enough to predict gene models for these species. After the HGRI funding for comparative genomics outside vertebrates ended, additional PacBio sequencing was obtained and added to the Illumina data. The PacBio platform produces much longer reads that gap repeat sequences and thus support improved assemblies. At this point, Echinobase contains genome data for five species of echinoderms: Strongylocentrotus purpuratus, Lytechinus variegatus, Eucidaris tribuloides, Patiria miniata, Apostichopus parvimensis, Ophiothrix spiculata. Transcriptome datasets for other species are also posted. This history establishes in one way the boundary of the echinoderm genomic era about 20 years ago with the installation of the genomic resources, the Sea Urchin Gene Library Facility and the beginnings of a genome information system that became Echinobase. The earliest assembly was sufficiently complete to predict gene models that could be annotated. Other aspects of the assembly including the number of gaps and errors in assembly of non-coding regions lagged even until the present
References
day. One view of this accuracy problem emerges when attempts to sequence based reagents from the reference genome assembly sometimes fail. Hopefully additional sequence and the resulting improvements in sequence assembly can eventually overcome this difficulty. In the coming years the dissection of how genomes work still holds much promise. After all, the genome is the blueprint of an organism that we are just learning to read. Of the new directions to take this work, two stand out. First genetic engineering based on CRISPR-Cas9 sequence changes holds great promise. Its feasibility in sea urchin embryos has been shown (Cui, Lin, & Su, 2017; Lin & Su, 2016) and the procedures work well in echinoderms. Second is the application of single cell sequencing studies to cell type specification in development (Sebe-Pedro´s et al., 2018). Although there are no published single cell sequencing studies published at the time of this writing, several studies in progress were mentioned at a recent sea urchin development meeting and there is a chapter devoted to this topic in this volume (Oulhen, Foster, Wray, & Wessel, 2019, Chapter 4, this volume). In my experience, the eager vision of a quality genome assembly and gene predictions is rendered imperfect in execution. Genome assembly programs have not completely overcome the difficulties presented by highly polymorphic genomes. It is important to remember that sequence assemblies are hypotheses, not facts. The enormously complicated series of experimental approaches needed to describe genomes and test their function carry many assumptions that are not tested. There is no perfect, final metric. Nevertheless, genome information has transformed our view of how the genome works in some detail and continues to do so. APOLOGIA. It is with some trepidation that I present this limited view of genomics. There is no space to acknowledge the very many people who have contributed along the way both to bring the work to the place where genomic studies were feasible and who worked on the genome. Since 2005, almost 3000 echinoderm papers have been listed in the Textpresso literature database on Echinobase. I have noted a few individual works from people I have worked with or near. I have had the good fortune to interact with many more through Echinobase and the Sea Urchin Developmental Biology meeting. I count those interactions among my fondest memories and as a motivation to continue navigating through the intriguing landscapes of echinoderm genomics.
References Cai, W. W., Chen, R., Gibbs, R. A., & Bradley, A. (2001). A clone-array pooled shotgun strategy for sequencing large genomes. Genome Research, 11, 1619–1623. Cameron, R. A., Hough-Evans, B., Britten, R. J., & Davidson, E. H. (1987). Lineage and fate of each blastomere of the eight-cell sea urchin embryo. Genes & Development, 1, 75–85. Cameron, R. A., Mahairas, G., Rast, J. P., Martinez, P., Biondi, T. R., Swartzell, S., et al. (2000). A sea urchin genome project: Sequence scan, virtual map, and additional resources. Proceedings of the National Academy of Sciences of the United States of America, 97, 9514–9518.
59
60
A personal history of the echinoderm genome sequencing
Cameron, R. A., Samanta, M., Yuan, A., He, D., & Davidson, E. (2008). SpBase: The sea urchin genome database and web site. Nucleic Acids Research, 37, D750–D754. https:// doi.org/10.1093/nar/gkn887. 2008. Colin, A. M. (1986). Rapid repetitive microinjection. Chapter 22, In T. E. Schroeder (Ed.), Vol. 27. Methods in cell biology (pp. 395–406). Academic Press. Cox, K. H., DeLeon, D. V., Angerer, L. M., & Angerer, R. C. (1984). Detection of mRNAs in sea urchin embryos by in situ hybridization using asymmetric RNA probes. Developmental Biology, 101, 485–502. Cui, M., Lin, C. Y., & Su, Y. H. (2017). Recent advances in functional perturbation and genome editing techniques in studying sea urchin development. Briefings in Functional Genomics, 16, 309–318. Davidson, E., et al. (2002). A provisional regulatory gene network for specification of endomesoderm in the sea urchin embryo. Developmental Biology, 246, 162–190. Hinegardner, R. (1974). Cellular DNA content of the Echinodermata. Comparative Biochemistry and Physiology, 49B, 219–226. Horstadius, S. (1939). The mechanics of sea urchin development studied by operative methods. Biological Reviews, 14, 132–179. Howard, E. W., Newman, L. A., Oleksyn, D. W., Angerer, R. C., & Angerer, L. M. (2001). SpKrl: A direct target of β-catenin regulation required for endoderm differentiation in sea urchin embryos. Development, 128, 365–375. Human Genome Research Institute, Approved Sequencing Targets Archive. 2002: https:// www.genome.gov/11008265/sea-urchin-genome-sequencing/. Kudtarkar, P., & Cameron, R. A. (2017). Echinobase: An expanding resource for echinoderm genomic information. Database, 2017. https://doi.org/10.1093/database/bax074. Lin, C. Y., & Su, Y. H. (2016). Genome editing in sea urchin embryos by using a CRISPR/ Cas9 system. Developmental Biology, 409, 420–428. Mann, K., Poustka, A. J., & Mann, M. (2008). The sea urchin (Strongylocentrotus purpuratus) test and spine proteomes. Proteome Science, 6, 22. McMahon, P., Flytzanis, C., Hough-Evans, B., Katula, K., Britten, R., & Davidson, E. (1985). Introduction of cloned DNA into sea urchin egg cytoplasm: Replication and persistence during embryogenesis. Developmental Biology, 108, 420–430. Oulhen, N., Foster, S., Wray, G. & Wessel, G. (2019). Identifying gene expression from single cells to single genes. Methods in Cell Biology, 151, 127–158. Rast, J. P., Cameron, R. A., Poustka, A. J., & Davidson, E. H. (2002). Brachyury target genes in the early sea urchin embryo isolated by differential macroarray screening. Developmental Biology, 246, 191–208. Roux, M. M., Townley, I. K., et al.Foltz, K. R., (2006). A functional genomic and proteomic perspective of sea urchin calcium signaling and egg activation. Developmental Biology, 300, 416–433. Sea Urchin Genome Sequencing Consortium (2006). The genome of the sea urchin Strongylocentrotus purpuratus. Science, 314, 941–952. Seaver, R. W., & Livingston, B. T. (2015). Examination of the skeletal proteome of the brittle star Ophiocoma wendtii reveals overall conservation of proteins but variation in spicule matrix proteins. Proteome Science, 13, 7. Sebe-Pedro´s, A., et al. (2018). Early metazoan cell type diversity and the evolution of multicellular gene regulation. Nature Ecology & Evolution, 2, 1176–1188. Shizuya, H., Birren, B., Kim, U.-J., Mancino, V., Slepak, T., Tachiiri, Y., et al. (1992). Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proceedings of the National Academy of Sciences of the United States of America, 89, 8794–8797.
References
Sodergren, E., Shen, Y., Song, X., Zhang, L., Gibbs, R. A., & Weinstock, G. M. (2006). Shedding genomic light on Aristotle’s lantern. Developmental Biology, 300, 2–8. Yuh, C.-H., & Davidson, E. H. (1996). Modular cis-regulatory organization of Endo16, a gut-specific gene of the sea urchin embryo. Development, 122, 1069–1082. Yuh, C. H., Moore, J. G., & Davidson, E. H. (1996). Quantitative functional interrelations within the cis-regulatory system of the S. purpuratus Endo16 gene. Development, 122, 4045–4056.
61