Biophysics and the Genomic Sciences

Biophysics and the Genomic Sciences

Please cite this article in press as: Schwartz, Biophysics and the Genomic Sciences, Biophysical Journal (2019), https://doi.org/10.1016/j.bpj.2019.07...

606KB Sizes 0 Downloads 93 Views

Please cite this article in press as: Schwartz, Biophysics and the Genomic Sciences, Biophysical Journal (2019), https://doi.org/10.1016/j.bpj.2019.07.038

Biophysical Perspective

Biophysics and the Genomic Sciences David C. Schwartz1,* 1 Department of Chemistry, Laboratory of Genetics, Laboratory for Molecular and Computational Genomics, University of Wisconsin-Madison, Madison Wisconsin

ABSTRACT It is now rare to find biological, or genetic investigations that do not rely on the tools, data, and thinking drawn from the genomic sciences. Much of this revolution is powered by contemporary sequencing approaches that readily deliver large, genome-wide data sets that not only provide genetic insights but also uniquely report molecular outcomes from experiments that biophysicists are increasingly using for potentiating structural and mechanistic investigations. In this perspective, I describe a path of how biophysical thinking greatly contributed to this revolution in ways that parallel advancements in computer science through discussion of several key inventions, described as ‘‘foundational devices.’’ These discussions also point at the future of how biophysics and the genomic sciences may become more finely integrated for empowering new measurement paradigms for biological investigations.

Considering genomic sciences as being the creation of or the analysis of large data sets for biological or genetic investigation (a personal and loose definition), it is evident that biophysical thinking has provided the cauldron for a remarkable level of invention and innovation in the genomic sciences. It is now poised to be seamlessly integrated into the common scientific currency underlying most biological and genetic investigation. Although a major factor differentiating biophysics from the genomic sciences is the size and type of data sets, it was biophysical thinking that enfranchised genomicists to make discoveries using large data sets through advancements in molecular measurement and manipulation. These advancements included developments in single-molecule approaches, advanced microscopy techniques, molecular labeling modalities, and fluidics. These, in turn, laid the foundation for systems capable of routine data acquisition of genome sequence, transcriptional profiles, and the ability to gauge chromatin states on a very large scale. New statistical and computational approaches helped to build and interpret large data sets for both hypothesis-driven research and, in a somewhat game-changing way, for making discovery-based research routine and much more effective. The focus of this Biophysical Perspective is to follow a path of how biophysics enabled such large-scale measurements (data sets) and how these genomic data sets are now opening new directions for biophysical investigations.

Submitted April 24, 2019, and accepted for publication July 9, 2019. *Correspondence: [email protected] Editor: Massa Shoura. https://doi.org/10.1016/j.bpj.2019.07.038

For this Biophysical Perspective, I hope to convey a sense of intellectual flow that will try to glimpse into the future of new biophysical research avenues likely to play a major role to comprehensively explain biological and genetic systems in high-dimensional ways. At that point in time, the genomic sciences and biophysics become one. Dissecting the experiment An experiment, or even a large-scale study, usually begins with questions that need to be answered. An investigator may develop a hypothesis (idea) to be tested or validated through physical measurements and analysis. Alternatively, hypothesis generation may be made only after large data sets have been acquired and then analyzed. We may consider these fundamentals more closely by teasing apart these actions into interdependent domains, with each domain comprising a series of physical, analytical, or computational operations. Fig. 1 shows this schema for discovery in large-scale experiments as domain-specific colored boxes, with the major domains representing investigator, physical (‘‘wet’’), computational, cluster computing, and active learning. Notably, these domains and their respective parts are portrayed here as supporting highthroughput measurements, which become usable data sets for analysis in ways that portend discovery or large-scale hypothesis generation and validation. These activities become optimally supported when different research components (e.g., microdroplet technologies, fluorescence measurements, and active learning) are seamlessly linked together within a system or pipeline. A challenge here is how to potentiate such low-level efforts (measurement,

Ó 2019 Biophysical Society.

Biophysical Journal 117, 1–7, November 5, 2019 1

Please cite this article in press as: Schwartz, Biophysics and the Genomic Sciences, Biophysical Journal (2019), https://doi.org/10.1016/j.bpj.2019.07.038

Schwartz FIGURE 1 Dissecting the discovery science parts box: Experiments comprise five interdependent domains indicated by colored boxes. They are 1) investigator domain (blue box); collaborations and hypothesis generation start here. Additional hypotheses are developed through the consideration of new discoveries. The second domain is 2) the physical domain (red box); components for wet experiments are assembled and presented for interaction or measurements steps. Presentation techniques are varied predicating high-dimensional data sets. Measurement outcomes are processed, creating data sets compiled within databases. The third domain is 3) computational/theory domain (yellow box); mathematical modeling, statistical, and computational analysis operations are performed on information contained within databases, and findings are directed back to databases and discoveries promoted as confirming hypotheses or provoking new ones. The fourth domain is 4) cluster computing domain (large green box); the cluster computing environment pervades all aspects of the workflow delivering ample cycles for demanding jobs posed by the components within each domain. The fifth domain is 5) active learning (orange arrows); active learning algorithms foster cycles of hypotheses to be generated to update the workflow as new data are accumulated and analyzed. To see this figure in color, go online.

sample manipulation) through invention of specifically designed molecular components, or ‘‘foundational devices,’’ predicated for enabling a highly integrative environment. I loosely define a foundational device as being the most elemental and central component present within a system. Examples of foundational devices may include the transistor, engineered polymerases, and nanopores for electronic detection. These will be further discussed. A large-scale experiment Contemporary genomic investigation runs on large data sets, acquired across populations (cells, individuals, etc.), and is now taking place at greater levels of comprehensiveness. For example, a parsimonious view of the human genome relies on whole human exome sequencing, which just samples genic portions (2% of the genome); however, many researchers are now leveraging more informative analysis (1) by using whole genome sequencing data sets. In many ways, whole genome analysis provides the means for discovery of functionalities within the ‘‘dark matter’’ of the genome (2) or the repetitive, nongenic portions long overlooked that had previously been labeled as ‘‘junk’’ DNA. This search defined the ENCODE (The Encyclopedia of DNA Elements) project’s mission to functionally elucidate the entire human genome. In groundbreaking ways, the ENCODE project (3) revealed biochemical functions for more than 80% of the human genome and, perhaps more significantly, presciently demonstrated the power of discovery using the large-scale integration of a diverse set of seven measurement approaches. This large-scale integration

2 Biophysical Journal 117, 1–7, November 5, 2019

enabled the analysis of histone modifications, transcription factor binding sites, chromosome-interacting regions, DNA methylation, and RNA expression, to name a few, performed across as many as 147 different cell types. The scale of the ENCODE project is massive and involved more than 30 research groups and 400 researchers in 2012 and persists as ENCODE 4 to further the discovery of functionalities within the human and mouse genomes. Given the scope of this undertaking, local measurement pipelines and analyses were established at the participant sites, with high-level data integration and analysis centralized at a data analysis center, complemented by a data coordination center providing community access portals. Accordingly, we can envision separate versions of the discovery parts box depicted in Fig. 1, operating as discrete workflows at multiple sites with high-level centralization focusing on data rather than on the fundamental measurement processes, namely, sequencing, bisulfite treatments, nuclease digestion, etc. In other words, ENCODE did not centralize the ‘‘physical’’ domain (red box) of the discovery schema so that a single, fully integrated workflow could perform all of the necessary measurement processes; rather, it leveraged the intrinsic advantages of data fungibility and portability for integrative analyses performed across participants’ findings and data sets. Using ENCODE as a point of reference, can we imagine ways to advance genomic/biophysical systems to encompass remarkably broad palettes of unified measurement processes? And, importantly, can we imagine advancements enabling increased democratization of massive project undertakings in ways that do not require dedicated research centers?

Please cite this article in press as: Schwartz, Biophysics and the Genomic Sciences, Biophysical Journal (2019), https://doi.org/10.1016/j.bpj.2019.07.038

Biophysical Perspective

Lessons from computer science In trying to address these questions, we have much to learn from the history of computer science regarding the interplay between theory and foundational devices that actually perform the computational steps within hardware systems. Crudely speaking, modern computer hardware can be considered as a box of nimbly programmable switches. Arguably, the earliest programmable digital computer was envisioned in the 1840s by two mathematicians: Charles Babbage and Augusta Ada King (Countess of Lovelace) (4). Babbage and King not only laid the theoretical underpinnings for digital computation but also designed hardware and a computing infrastructure that would enable their ideas for building a general-purpose ‘‘analytical engine’’ capable of solving nonnumerical problems. Their thinking and engineering designs were based on Babbage’s earlier work centered on the construction of a ‘‘difference engine’’ to automatically calculate mathematical tables. A major limitation in advancing from the difference engine to the analytical engine was, understandably, hardware issues. Their advanced and prescient ideas of early computer science had to grapple with the physical realities offered by bulky, insufficient brass gears (the foundational device) needed to build a computer and, expectedly, the brass gears and ratchets won. Over the next hundred years, electrical devices, such as electromechanical relays, came into use and were later complemented by a purely electronic device (the vacuum tube) by the late 1930s. This speedy electronic ‘‘switch’’ was a great advance over cumbersome electromechanical devices, but vacuum tubes are bulky, fragile, power-hungry, toasterlike devices that are only marginally reliable. These attributes limited the number of tubes that could be housed within a computer system for achieving sustained operation, but, more importantly, throttled the scope of computation that could be performed. The emergence of solid-state physics in the 1940s laid the basis for the invention of a nearly perfect switch: the transistor, which is the foundational device within microprocessors now boasting upward of 20 billion features patterned at a resolution as small as 7 nm (3 to 4 times the width of a DNA molecule). These advances have commoditized computation so that machine learning techniques, which thrive on large data sets, are now increasingly apparent in daily life and gaining widespread use in genomic (5) and biophysical investigations (6,7). Whereas internet traffic readily yields large data sets via analysis of ‘‘mouse clicks,’’ sizable molecular data sets are created from concerted measurements performed as a part of large-scale experiments. It will require new thinking and invention to increase the scope and pace of this process in ways that may parallel developments in computer science.

Miniaturize the analyte and the instrument: single-molecule analytes and foundational devices Biophysical and genomic experimental data emerge from measurement so that lower limits of detectability and sample size are characterized by single-molecule analytes. Single molecules may also serve as the materials for the creation of new types of highly miniaturized foundational devices. For example, devices based on individual DNA polymerase enzymes (8) and ion pore proteins (9) are now defining several contemporary DNA sequencing systems that are, importantly, commercially available (Pacific Biosciences, Oxford Nanopore). These systems are pointing the way forward to a more comprehensive understanding of genomic dark matter by enabling long read lengths that can span across complex chromosomal regions such as a human centromere (10). Furthermore, ion pore proteins, such as heavily engineered a-hemolysins (11,12), are ushering in a new era of electronic, single-molecule sequencing riding on several major paradigm shifts. These shifts include the 1) detection of nanopore blockade currents supplanting optical fluorescence measurements and 2) obviation of DNA polymerase actions for transducing unknown template sequence into measurables as those leveraged by venerable Sanger sequencing ladders via electrophoretic separations. (Such recent developments complement venerable singlemolecule sequencing approaches that do depend on polymerase action by leveraging single-molecule fluorescence signals measured in a zero-mode waveguide (8) or fluorescence signals tabulated on ‘‘polony-like’’ (13) clusters of bridge PCR (14) amplified templates in a flow cell undergoing sequencing-by-synthesis steps on an Illumina platform.) Ion pore proteins can be viewed within this context as foundational devices that emerged from biophysical studies of ion channels in excitable membranes, now harnessed as a single-molecule detector for sequencing individual DNA molecules. Developments are also moving ahead to similarly harness synthetic nanopores, fabricated out of silicon nitride (7,15) and graphene (16) for proteomic or DNA analysis (17). Other advantages offered by nanopore sequencing exploit electronic detection techniques that readily foster a high degree of miniaturization when compared to optical ones. These advantages have enabled hand-sized sequencing systems (18), which are incredibly portable and usable in locations outside of laboratories. Although nanopore devices offer compelling advantages, sequencing by synthesis (19,20), or reading sequence through measurable actions of polymerases on DNA template strands, currently produces 90% of the world’s sequencing data using Illumina platforms ((21); discussed in next paragraph). By contrast, in a measurement tour de force, the zero-mode waveguides (Pacific Biosciences, Menlo, CA) allow the real-time epifluorescence monitoring

Biophysical Journal 117, 1–7, November 5, 2019 3

Please cite this article in press as: Schwartz, Biophysics and the Genomic Sciences, Biophysical Journal (2019), https://doi.org/10.1016/j.bpj.2019.07.038

Schwartz

of an individual DNA polymerase performing sequencingby-synthesis (22) steps on a single-molecule template using 50 -fluorescently labeled nucleotides. Zero-mode waveguides are windowed, with submicroscopic pores etched in aluminum on silica. Briefly, because the pore diameter is subwavelength in size, the illumination depth is largely confined to a 30-nm layer above the silica window where the polymerase resides. Sequence read accuracies are then greatly boosted by repeatedly sequencing the same circular template using a highly processive polymerase. Here, a long template molecule is virtually ‘‘amplified’’ through redundant reads instead of being physically amplified by PCR and then read across a population of amplicons. What makes the zero-mode waveguide a remarkable foundational device for genome analysis is that it solved the conundrum of how to image enzymatic actions of an individual enzyme ‘‘engine’’ (DNA polymerase) during sequencing against an overwhelming background of fluorescently labeled substrate, required at micromolar concentrations for reliable biochemical activity. Given my discussions of genome analysis as viewed through the lens of foundational devices conceived by biophysical thinking, the Illumina sequencing platforms are placed into a unique category. In the case of the Illumina sequencing platform, the foundational device may be the development of a highly integrated system, which at a low level obviated traditional clone libraries needed for feeding large-scale Sanger sequencing efforts (23). Such projects required bacterial clones to be plated en masse on cafeteria-tray-sized plates, which were then robotically picked for subsequent production of sequencing templates; this was all performed within a large genome center before the actual sequencing. In contrast, for Illumina sequencing, a library now starts off in a ‘‘tube,’’ and sequencing templates are then made and amplified in situ within a compact flow cell. This great advance in sequencing technology dramatically dropped costs and enfranchised a broad range of investigators to individually explore the genomic aspects of their organism(s) of choice through construction of reference genomes and use of discovery-laden transcriptional profiling (24). The emergence of novel sequencing applications may prove to be more impactful: inexpensive sequencing can now create general-purpose ‘‘barcodes’’ to uniquely identify, or map back to a reference genome, a population of molecular products resulting from large-scale experiments. Consequently, biophysicists are now working with very large, comprehensive molecular data sets to discern the details of chromosomal architecture (25–27) within the nucleus. A prime example of such developments is chromosome conformational capture experiments (28), which reveal contact points of chromatin within the nucleus by chemical cross-linking followed by sequencing of these products to assign the genomic locations of interacting pairs. Genome-wide contact maps are created using ‘‘Hi-C’’ ap-

4 Biophysical Journal 117, 1–7, November 5, 2019

proaches (29), which now explore chromatin organizational motifs across populations of single cells (30,31). Accordingly, biophysicists have come full circle and are now reaping the benefits of a foundational source of large-scale molecular data sets via the contemporary sequencing infrastructures that they helped to establish. Taking a step back here begs revisiting the original question posed earlier: (paraphrasing) ‘‘.do contemporary sequencing approaches advance genomic/biophysical systems to encompass remarkably broad palettes of unified measurement processes?’’ Using the concepts presented in Fig. 1 and contextualized by the large-scale efforts of ENCODE, the short answer is no. Instead, we view the discovery science parts box (Fig. 1) as being greatly augmented in throughput and scope through faster, more comprehensive sequencing and assays, but advanced sequencing approaches alone do not provide the necessary infrastructure for low-level integration. Instead, we still have dispersed measurement efforts spread over many platforms, investigators, and/or centers, albeit now on a much more productive and cost-effective level. To move forward, can we imagine or identify emerging foundational devices for greatly synergizing sequencing approaches in ways that would foster the low-level integration of measurement processes within large-scale investigations? Microdroplets as foundational devices for measurement integration Coming back to the computer science parallels in terms of foundational devices, we now have sequencing devices that can be as small as a single molecule (a-hemolysin pores are heptameric) and even work with single-molecule analytes, which are all characteristics that speak to great detector sensitivity and a level of miniaturization akin to microelectronic fabrications. But, for the most part, sequencing platforms simply sequence DNA, with notable exceptions in both the Pacific Biosciences (32) and Oxford Nanopore (33) systems’ ability to directly measure base chemical modifications, such as CpG methylation. A major limitation in trying to chain together different measurement approaches, such as DNA/RNA sequencing and mass spectrometry, with experimental manipulations (separations, additions, labeling, etc.), is that it typically places severe restrictions on resolution or comprehensiveness, especially when miniaturization is paramount. This is due to the destruction or perturbation of analytes or living cells during the process of measurement. Although imaging and spectroscopic techniques (remote sensing) may limit or even obviate some of these issues, this inherent empirical limitation sets a high bar for a foundational device offering a highly integrative environment for many types of measurement. Many of these experimental limitations are being addressed by groundbreaking developments in high-throughput cell biology (high-content screening (HCS) (34)) that now

Please cite this article in press as: Schwartz, Biophysics and the Genomic Sciences, Biophysical Journal (2019), https://doi.org/10.1016/j.bpj.2019.07.038

Biophysical Perspective

support a broad range of measurement approaches across large populations of cells through the automated analysis of hundreds of thousands of images enabling machine leaning approaches for high-dimensional data analysis. A great advantage here is that HCS realizes ensembles of single cells as the ‘‘ultimate biological test tube,’’ which also includes unknown or uncontrollable biological variables, whose effects may be mediated or, more interestingly, discovered through analysis of large data sets. Certainly not all biophysical or genomic questions are best served by cell-based systems. In many regards, microdroplet approaches (35–39) may offer some cell-like advantages, which include the ability to formulate and nimbly manipulate a wide array of experimental components within millions of miniature compartments. As such, the microdroplet may be considered as a foundational device because it is now empowering a uniquely broad range of single-cell studies by readily enabling large-scale sequencing for the readout of experiment outcomes. Consequently, these advances have been widely embraced as a tool for transcriptional studies of single cells in place of bulked populations. ‘‘Drop-seq’’ (40), for example, encapsulates single cells and DNA barcoded beads into droplets to capture transcripts for downstream sequencing. In this way, the cellular diversity inherent in tissues can be readily transcriptionally ‘‘dissected’’ and classified. A somewhat complementary microdroplet approach, ‘‘Perturb-seq’’ (41), actually composes experiments using multiple CRISPRmediated perturbations that are then transcriptionally evaluated en masse at the level of the single cell. In addition, complex tissues can be analyzed using a computational strategy, ‘‘Seurat’’ (42), that enables single-cell transcriptional profiling to spatially locate each cell within tissue regions while also leveraging advantages afforded by microdroplet technologies (43). Placing these discussions of HCS and microdroplet approaches into the context of Fig. 1 portends low-level integration of presentation and measurement processes perhaps when other foundational devices (e.g., nanopores) can be directly integrated into or more directly interfaced with droplets and cells for sequencing and other measurements. Further impact will accrue as complex droplet manipulations (e.g., splitting, combining, etc.) become routine operations. Meanwhile, liposomes, or droplets composed to mimic living cells (synthetic cells or protocells (44)) by harboring active cellular components (45,46) or synthetic ones (47), may offer the best of both worlds (droplets and cells) by exact formulation of complex biological systems. Crystal ball: outlook I offer some thoughts here on what the future may hold for biophysics and the genomic sciences implying all of the sober caveats that such predictions must always carry.

1) Synergy: As new and more advanced foundational devices become progressively more sophisticated, they will offer new routes for low-level integration of an even broader array of measurements techniques. The systems that comprise them will be capable of performing increasingly more complex investigations that will enable an unprecedented level of high-dimensional data analysis. It may follow that as the number, density, type, and speed of experimental operations rise within a system, biophysical investigations will increasingly meld data-driven approaches with mechanistic studies and computer simulations. Such developments, at some point, may then intimately link genomic sciences and biophysics by shrinking the intellectual continuum spanning from genetics to the physical sciences. 2) Centralization: Coming back yet again to the computer science parallels previously discussed raises high-level comparisons involving personal computing (PCs on your desk and in your pocket) and cluster computing (linked computers) in regard to the points made above regarding synergy. In terms of hardware, essentially, cluster computing links together a large number of highly specialized PCs using a very sophisticated software infrastructure. Given highly integrated systems capable of large-scale experiment assembly and measurement, based on miniature foundational devices, the pace and scope of scientific investigation at the benchtop level will be enhanced. Similarly, centralized facilities will hold large clusters of such systems, linked through sophisticated software infrastructures and will perhaps precipitate the invention of radically new ‘‘sample’’ handling approaches (48–50) that will further synergize such connectivities. These facilities would offer individual investigators the means to conduct largescale discovery-based investigations that are now only the province of ‘‘big pharma’’ and centers. Different scales of experimental throughput, breadth, and capabilities will likely coexist, interact, and thrive in ways similar to what we now see in computation and data handling. 3) ‘‘Robot Scientist’’: In 2004, King et al. (51) published a seminal study entitled: ‘‘Functional Genomic Hypothesis Generation and Experimentation by a Robot Scientist.’’ This bold work showed that scientific discoveries could be made by a computer linked to an ordinary pipetting robot and operated in a closed-loop fashion under the control of a suite of machine learning algorithms all without human intervention. Simply stated, the system ratchets into discoveries by enabling iterative cycles of automatic hypothesis generation, experiment design, data collection from a physically assembled experiment, and analysis of resulting experimental data. This iterative loop automatically continues until analysis returns a consistent hypothesis (a putative discovery; Fig. 2).

Biophysical Journal 117, 1–7, November 5, 2019 5

Please cite this article in press as: Schwartz, Biophysics and the Genomic Sciences, Biophysical Journal (2019), https://doi.org/10.1016/j.bpj.2019.07.038

Schwartz

FIGURE 2 Robot Scientist schema showing loop comprising hypothesis generation and experimentation; graphic adapted from Fig. 1 shown in (51).

King’s thinking was further advanced by his construction of an elaborate suite of customized robotics unified as a huge system spanning some 5 m in length named ‘‘Adam.’’ This work, published in 2009, was entitled: ‘‘The Automation of Science’’ (52,53). Although Adam orchestrated a greater number of and types of assays, we might be seeing an intellectual impedance mismatch with the physical limitations of laboratory robotics akin to what Babbage and King encountered some 170 years earlier when faced with the task of actually building a digital computer using brass gears. As such, it is easy to imagine how microdroplet-based approaches, or elaborations of other foundational devices, could supplant robotics in similar applications, as operated within the machine learning envelope or ‘‘operating system’’ created for Adam and the Robot Scientist. Referring back to Fig. 1, we see that King’s ideas of machine learning techniques for automatic hypothesis generation might possibly control and unify the diverse set of operations depicted in the various domains without heavy reliance on robotics. Such integration and automation would surely impact future, ambitious projects akin to ENCODE by making them more routine and more feasibly launchable by small groups of investigators. One path for the realization of this goal is for biophysicists to invent new classes of foundational devices expressly designed to operate under and potentiate machine learning principles at the molecular level (54–56).

Final words I humbly apologize to those investigators whose work I have not touched on or have inadequately portrayed. We are in the midst of an explosive convergence of computer science and statistics with nearly every aspect of scientific investigation bearing large data sets and are reaping these benefits through massively cross-disciplinary thinking. As such, I have had to keep my focus here unfortunately narrow by simply pointing my threads through the lens of contemporary DNA sequencing and a handful of foundational devices. The lack of discussion of other advances should in no way be considered judgmental.

6 Biophysical Journal 117, 1–7, November 5, 2019

ACKNOWLEDGMENTS I thank Stephen Levene and Massa Shoura as well as Sarah Harris and Julia Salzman for organizing and for their invitation to attend the Biophysical Society Thematic meeting, Genome Biophysics: Integrating Genomics and Biophysics to Understand Structural and Functional Aspects of Genomes. It was a meeting that fostered many new ideas and forward-thinking discussions. The National Cancer Institute is also thanked for their financial support (5R33CA182360).

REFERENCES 1. Belkadi, A., A. Bolze, ., L. Abel. 2015. Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants. Proc. Natl. Acad. Sci. USA. 112:5473–5478. 2. Schwartz, D. C., and M. S. Waterman. 2010. New generations: sequencing machines and their computational challenges. J. Comput. Sci. Technol. 25:3–9. 3. ENCODE Project Consortium. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature. 489:57–74. 4. Copeland, B. J. 2017. The modern history of computing. https://plato. stanford.edu/archives/win2017/entries/computing-history/. 5. Zou, J., M. Huss, ., A. Telenti. 2019. A primer on deep learning in genomics. Nat. Genet. 51:12–18. 6. Lobo, D., M. Lobikin, and M. Levin. 2017. Discovering novel phenotypes with automatically inferred dynamic models: a partial melanocyte conversion in Xenopus. Sci. Rep. 7:41339. 7. Kolmogorov, M., E. Kennedy, ., P. A. Pevzner. 2017. Single-molecule protein identification by sub-nanopore sensors. PLOS Comput. Biol. 13:e1005356. 8. Levene, M. J., J. Korlach, ., W. W. Webb. 2003. Zero-mode waveguides for single-molecule analysis at high concentrations. Science. 299:682–686. 9. Kasianowicz, J. J., E. Brandin, ., D. W. Deamer. 1996. Characterization of individual polynucleotide molecules using a membrane channel. Proc. Natl. Acad. Sci. USA. 93:13770–13773. 10. Jain, M., H. E. Olsen, ., K. H. Miga. 2018. Linear assembly of a human centromere on the Y chromosome. Nat. Biotechnol. 36:321–323. 11. Gu, L. Q., and H. Bayley. 2000. Interaction of the noncovalent molecular adapter, beta-cyclodextrin, with the staphylococcal alpha-hemolysin pore. Biophys. J. 79:1967–1975. 12. Astier, Y., O. Braha, and H. Bayley. 2006. Toward single molecule DNA sequencing: direct identification of ribonucleoside and deoxyribonucleoside 50 -monophosphates by using an engineered protein nanopore equipped with a molecular adapter. J. Am. Chem. Soc. 128:1705–1710.

Please cite this article in press as: Schwartz, Biophysics and the Genomic Sciences, Biophysical Journal (2019), https://doi.org/10.1016/j.bpj.2019.07.038

Biophysical Perspective 13. Mitra, R. D., and G. M. Church. 1999. In situ localized amplification and contact replication of many individual DNA molecules. Nucleic Acids Res. 27:e34.

36. Kim, H., D. Luo, ., Z. Cheng. 2007. Controlled production of emulsion drops using an electric field in a flow-focusing microfluidic device. Appl. Phys. Lett. 91:133106.

14. Pemov, A., H. Modi, ., S. Bavykin. 2005. DNA analysis with multiplex microarray-enhanced PCR. Nucleic Acids Res. 33:e11.

37. Utada, A. S., E. Lorenceau, ., D. A. Weitz. 2005. Monodisperse double emulsions generated from a microcapillary device. Science. 308:537–541. 38. Brouzes, E., M. Medkova, ., M. L. Samuels. 2009. Droplet microfluidic technology for single-cell high-throughput screening. Proc. Natl. Acad. Sci. USA. 106:14195–14200.

15. Kennedy, E., Z. Dong, ., G. Timp. 2016. Reading the primary structure of a protein with 0.07 nm3 resolution using a subnanometre-diameter pore. Nat. Nanotechnol. 11:968–976. 16. Wilson, J., L. Sloman, ., A. Aksimentiev. 2016. Graphene nanopores for protein sequencing. Adv. Funct. Mater. 26:4830–4838. 17. Garaj, S., W. Hubbard, ., J. A. Golovchenko. 2010. Graphene as a subnanometre trans-electrode membrane. Nature. 467:190–193. 18. Watson, M., M. Thomson, ., M. Blaxter. 2015. poRe: an R package for the visualization and analysis of nanopore sequencing data. Bioinformatics. 31:114–115.

39. Murphy, T. W., Q. Zhang, ., C. Lu. 2017. Recent advances in the use of microfluidic technologies for single cell analysis. Analyst (Lond.). 143:60–80. 40. Macosko, E. Z., A. Basu, ., S. A. McCarroll. 2015. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 161:1202–1214.

19. Fuller, C. W., L. R. Middendorf, ., D. V. Vezenov. 2009. The challenges of sequencing by synthesis. Nat. Biotechnol. 27:1013–1023.

41. Dixit, A., O. Parnas, ., A. Regev. 2016. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell. 167:1853–1866.e17.

20. Ardui, S., A. Ameur, ., M. S. Hestand. 2018. Single molecule realtime (SMRT) sequencing comes of age: applications and utilities for medical diagnostics. Nucleic Acids Res. 46:2159–2168.

42. Satija, R., J. A. Farrell, ., A. Regev. 2015. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33:495–502.

21. Illumina. 2019. At a glance. https://www.illumina.com/content/ dam/illumina-marketing/documents/company/illumina-web-graphic-ata-glance.pdf.

43. Lake, B. B., S. Chen, ., K. Zhang. 2018. Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain. Nat. Biotechnol. 36:70–80.

22. Ritz, A., A. Bashir, and B. J. Raphael. 2010. Structural variation analysis with strobe reads. Bioinformatics. 26:1291–1298.

44. Adamala, K., and J. W. Szostak. 2013. Nonenzymatic templatedirected RNA synthesis inside model protocells. Science. 342:1098– 1100.

23. Lander, E. S., L. M. Linton, ., J. Szustakowki; International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature. 409:860–921.

45. van Nies, P., I. Westerlaken, ., C. Danelon. 2018. Self-replication of DNA by its encoded proteins in liposome-based synthetic cells. Nat. Commun. 9:1583.

24. Wang, Z., M. Gerstein, and M. Snyder. 2009. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10:57–63.

46. Tang, T. D., D. Cecchi, ., S. Mann. 2018. Gene-mediated chemical communication in synthetic protocell communities. ACS Synth. Biol. 7:339–346.

25. Bascom, G. D., K. Y. Sanbonmatsu, and T. Schlick. 2016. Mesoscale modeling reveals hierarchical looping of chromatin fibers near gene regulatory elements. J. Phys. Chem. B. 120:8642–8653. 26. Chiariello, A. M., C. Annunziatella, ., M. Nicodemi. 2016. Polymer physics of chromosome large-scale 3D organisation. Sci. Rep. 6:29775. 27. Ozer, G., A. Luque, and T. Schlick. 2015. The chromatin fiber: multiscale problems and approaches. Curr. Opin. Struct. Biol. 31:124–139. 28. Dekker, J., K. Rippe, ., N. Kleckner. 2002. Capturing chromosome conformation. Science. 295:1306–1311. 29. Lieberman-Aiden, E., N. L. van Berkum, ., J. Dekker. 2009. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 326:289–293. 30. Nagano, T., Y. Lubling, ., P. Fraser. 2013. Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature. 502:59–64. 31. Davies, J. O., A. M. Oudelaar, ., J. R. Hughes. 2017. How best to identify chromosomal interactions: a comparison of approaches. Nat. Methods. 14:125–134. 32. Flusberg, B. A., D. R. Webster, ., S. W. Turner. 2010. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods. 7:461–465. 33. Simpson, J. T., R. E. Workman, ., W. Timp. 2017. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods. 14:407–410. 34. Mattiazzi Usaj, M., E. B. Styles, ., B. J. Andrews. 2016. High-content screening for quantitative cell biology. Trends Cell Biol. 26:598–611. 35. Griffiths, A. D., and D. S. Tawfik. 2006. Miniaturising the laboratory in emulsion droplets. Trends Biotechnol. 24:395–402.

47. Niederholtmeyer, H., C. Chaggan, and N. K. Devaraj. 2018. Communication and quorum sensing in non-living mimics of eukaryotic cells. Nat. Commun. 9:5027. 48. Schwartz, D. C. 2011. Chemical screening system using strip arrays. US Patent US8034550B2, filed May 29, 2008, and published October 11, 2011. 49. Schwartz, D. C. 2004. The new biology. In The Markey Scholars Conference: Proceedings, National Research Council. The National Academies Press, pp. 73–79. 50. Michael, K. L., L. C. Taylor, ., D. R. Walt. 1998. Randomly ordered addressable high-density optical sensor arrays. Anal. Chem. 70:1242– 1248. 51. King, R. D., K. E. Whelan, ., S. G. Oliver. 2004. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature. 427:247–252. 52. King, R. D., J. Rowland, ., A. Clare. 2009. The automation of science. Science. 324:85–89. 53. Waltz, D., and B. G. Buchanan. 2009. Computer science. Automating science. Science. 324:43–44. 54. Muggleton, S. H. 2006. 2020 computing: exceeding human limits. Nature. 440:409–410. 55. Yu, H., K. Jo, ., D. C. Schwartz. 2009. Molecular propulsion: chemical sensing and chemotaxis of DNA driven by RNA polymerase. J. Am. Chem. Soc. 131:5722–5723. 56. Riedel, C., R. Gabizon, ., C. Bustamante. 2015. The heat released during catalytic turnover enhances the diffusion of an enzyme. Nature. 517:227–230.

Biophysical Journal 117, 1–7, November 5, 2019 7