C H A P T E R
1 From Mendelian Genetics to 4D Genomics 1.1 SUMMARY The gene frames much of modern genetics by acting as an independent unit of genetic information. The gene-defined genotypeephenotype relationship has been demonstrated by classical studies linking genes to specific genetic traits and Mendelian diseases. However, it is now apparent that most genetic traits cannot be explained by single genes or even a combination of many. Genomics was positioned to solve this challenge by searching for more genetic variants and quantitatively illustrating their combinatorial mechanisms. Although this approach appears promising to many, genomics has failed to identify common mechanisms of most complex traits. Where then do genetics and genomics fall short? A review of the field reveals that most genes do not, in reality, have independent functions, leading to a great deal of confusion about the role of genes in determining the phenotype. One could say that Mendel’s original pea experiments, which formed the foundation of modern genetics, should have already generated such confusion upon close analysis. In this chapter, the transition from genetics to genomics is briefly reviewed, as reflected by how the concept of the gene has changed during the genomics era. The initial enthusiasm and subsequent disappointment of the Human Genome Project is addressed, as well as the lack of fundamental progress despite overwhelming data accumulation, which slows down bio-industry and medicine. This journey has now brought us to an urgent need for a new biological paradigm, which focuses on genome and evolution-based genomics and incorporates both emergent properties and cytogenetic organization.
Genome Chaos https://doi.org/10.1016/B978-0-12-813635-5.00001-X
1
Copyright © 2019 Elsevier Inc. All rights reserved.
2
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
1.2 THE EMERGENCE OF GENOMICS “Genetics” had already come a long way when British botanist William Bateson coined the term at the first International Congress on Genetics in 1906 to describe a new science that explored heredity and variation as initiated by Mendel’s (1866) publication of heredity in peas (Mendel, 1866). In the past 150 years, to understand the mechanism of Mendelian inheritance, researchers have zoomed in from the nucleus to chromosomes, from chromosomes to genes, and then from genes to DNA motifs. Such reductionist analyses have triumphed, leading to our understanding of the physical and chemical properties and structure of the gene, the mechanism of gene coding RNAs and proteins, the various models of gene regulation, protein modifications/degradation, macromolecule assembly, and the link between gene mutations and phenotypic variants, including many human diseases. We also understand how to identify and manipulate specific genes and apply this knowledge to produce genetically modified foods and improve human health through molecular medicine. The introduction of the double-helix model of DNA in 1953 and recombinant DNA technology in 1972 changed genetics forever (Watson and Crick, 1953a; Jackson et al., 1972). Molecular genetics has become the go-to field for new generations of biologists. Many bio-disciplines that were not gene-based withered. Moreover, the power of the gene has become a cultural phenomenon by capturing the general population’s imagination, thanks to many popular ideas. Richard Dawkins’s The Selfish Gene marked the onset of the gene-era hype in which everything was apparently controlled by genesdfrom individual proteins to specific biological traits and from evolutionary history to current health and behavior (Dawkins, 1976). This mode of thought assumed that all biological systems, including humans, serve the gene masters. We are merely the unwitting vehicles of genes. Genes are dominant, powerful, selfish, and mysterious. Such gene-centric concepts have shaped modern biology, generating a great deal of excitement and expectation within science, medicine, bio-industry, and society in general. If only the path of future genetics was as clear and simple as just following the gene!
1.2.1 A Brief History of Genomics Naturally, the ultimate goal of human genetics became hunting down all “disease genes” by molecular cloning and then correcting them by genetic manipulation such as gene therapy or eliminating them through prenatal screening. Suddenly, gene-based molecular genetics became the flagship of science, and the success of identifying gene defects responsible for human diseases further validated gene-based genetic approaches.
1.2 THE EMERGENCE OF GENOMICS
3
Positional cloning initiated an exciting wave of gene hunting. Following the first gene cloning success in 1986 for X-linked chronic granulomatous diseases by Harvard Medical School’s Stuart Orkin, gene after gene associated with many important disorders have been cloned, including Duchenne muscular dystrophy (cloned by Louis Kunkel at Boston Children’s Hospital and Ronald Worton from the Hospital for Sick Children in Toronto), cystic fibrosis (cloned by Lap-Chee Tsui from the Hospital for Sick Children in Toronto in cooperation with Francis Collins from the University of Michigan), Huntington disease, adult polycystic kidney disease, certain forms of colorectal cancer, and breast cancer. By 1995, about 50 inherited disease genes had been identified, highlighting the triumphant era of human molecular genetics (Collins, 1995). Interestingly, even before the gene hunting movement reached its peak in the late 80s to early 90s, there were increasing concerns about the gene-centric reductionist approach, which lead to calls for genome-based research, notably by Barbara McClintock and a number of evolutionary biologists and scientists who questioned genetic determinism. McClintock, the Nobel laureate who greatly recognized the importance of the genome in biology, specifically emphasized this in her 1983 Nobel Prize acceptance lecture at the Karolinska Institute in Stockholm. In the future, attention undoubtedly will be centered on the genome, with greater appreciation of its significance as a highly sensitive organ of the cell that monitors genomic activities and corrects common errors, senses unusual and unexpected events and response to them, often by restructuring the genome. We know about the components of genomes that could be made available for such restructuring. We know nothing, however, about how the cell senses danger and instigates response to it that often are truly remarkable. McClintock, 1984
It gradually became obvious that most genes do not have dominant phenotypes that display high penetration in populations. Researchers also realized that even though it is possible to identify specific gene mutations in many single-gene Mendelian diseases, this success might not be transferable to many common and complex diseases because of the large number of potential genes involved. Clearly, a better strategy was to search for more genes throughout the entire genome, which was the rationale to move from single-gene hunting to whole genome searches. For many, the advantage of focusing on the genome was merely to include more genes. In the mid-80s, some key technologies became capable of analyzing more genes, such as DNA panels of rodent-human somatic cell hybrids for physical mapping, DNA restriction fragment length polymorphism or RFLPs as variation markers for genetic mapping, polymerase chain reaction, automated DNA sequencing, and partial sequencing or mapping of several small genomes of microbes. These methodologies and the increased use of computers for data storage and analysis served as the
4
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
necessary platforms for this new frontier of genetics. Then, the “perfect storm” came. In May 1985, Robert Sinsheimer, the Chancellor of the University of CaliforniaeSanta Cruz, held a workshop there titled “Can we sequence the human genome?” Sinsheimer organized this workshop to present a stronger argument that such a project was significant and feasible following an unsuccessful attempt to extract funding from his University. Many leading researchers attended, including David Botstein, George Church, Ron Davis, Walter Gilbert, Lee Hood, and John Sulston, and they discussed potential problems, technologies, and a timeline as well as costs for the genome project. Despite the success of this workshop, Sinsheimer still failed to obtain any funding for his project. However, the meeting initiated a chain reaction (Sinsheimer, 2006). In March 1986, new on the job and eager to establish a novel megaproject to bolster the genetic programs within the US Department of Energy (DoE), Charles DeLisi, the Director of the Office of Health and Environmental Research of the DoE, organized a conference at Santa Fe. Influenced by Sinsheimer’s workshop, this meeting also sought to determine the complete sequence of the human genome and map the location of each gene. Most significantly, in addition to discussing the desirability and feasibility of implementing a Human Genome Project, this meeting was crucial to pushing the idea of a full genome sequence onto the national scientific stage and converting it into a reality. DeLisi and others were able to begin the key task of garnering support from the DoE, the Reagan administration, and Congress (DeLisi, 2008). At the same time, Renato Dulbecco, a Nobel winner for discoveries concerning the interaction between tumor viruses and the genetic materials of the cell, published an influential editorial piece in Science urging that sequencing the entire human genome was the best way to solve the puzzles of cancer. His argument has often been used as the rationale for genome sequencing, especially in later cancer genome sequencing. Another meeting worth mentioning is the 1986 Cold Spring Harbor symposium “The Molecular Biology of Homo Sapiens” where the Human Genome Project was also debated in a “rump session” moderated by Paul Berg and Walter Gilbert. Despite the fact that there were more voices urging caution, the discussion among many molecular geneticists in attendance was essential to maturing this idea (Robertson, 1986). Also in late 1986, the National Academy of Science/National Research Council formed a committee on mapping and sequencing the human genome. Collectively, all these events led to the Human Genome Project becoming a reality. The genome research center was established in 1987 and included three National Laboratories of the Energy Department. An office of Human Genome Research at the NIH opened its doors in 1988. Finally, an international organization named the
1.2 THE EMERGENCE OF GENOMICS
5
Human Genome Organization (HUGO) was established in 1988, and the rest is history. It is interesting to ask what caused Sinsheimer to act? He says he was influenced by other “Big Science” projects outside biology. . As Chancellor, I had been involved in the conception of several large-scale scientific enterpriseseinvolving telescopes (the TMT project) and acceleratorsewhich were “Big Science,” scientific projects requiring, in some instances, billions of dollars and the joint efforts of many scientists and engineers. It was thus evident to me that physicists and astronomers were not hesitant to ask for large sums of money to support programs they believed to be essential to advance their science. Biology was still very much a cottage industry, which was fine, but I wondered if we were missing some possibilities of major advances because we did not think on a large enough scale . Sinsheimer, 2006
Similarly, why did the DoE initially play the leading role rather than the NIH? The NIH was correctly concerned about the potential shift of money away from investigator-initiated proposals to this big science project. Despite the fact that the DoE had funded studies of the biological effects of radiation for years, perhaps its historical link to some big projects like the construction of the atomic bomb in the Manhattan Project influenced the Department to undertake this gigantic project. The idea of sequencing the human genome to bolster the DoE’s research program was already circulated before DeLisi’s arrival. The report titled “Technologies for Detecting Heritable Mutations in Human Beings” by the Office of Technology Assessment hinted at the idea of sequencing the whole genome. A new wave of big science was coming. Nevertheless, the birth of such an enormous initiative like the Human Genome Project meant that genetics and biology would never be the same. It certainly marked the maturation of genetics and it also transformed genetics into genomics. There are different opinions regarding the relationship between the birth of the Human Genome Project and genomics. Some believe the Human Genome Project spawned a new science called genomics, while others think the birth of genomics was a gradual process that began from earlier efforts of gene mapping and sequencing that led to the Human Genome Project as it represented a necessary preliminary step before considering the feasibility of the Human Genome Project. Just as genetic research predated the use of the term “genetics,” genomics research predated the creation of its official name. It is hard to determine a defined timeline for the official birth of genomics compared with the Human Genome Project as they are intimately intertwined in both research context and historical timing. One thing was certain, however: the Human Genome Project became the primary goal and a major challenge for the young field of genomics.
6
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
The journal Genomics was launched in 1987 by Victor McKusick, a medical genetics pioneer who published a catalog of all known genes and genetic disorders called Mendelian Inheritance in Man (MIM), and Frank Ruddle, a gene mapping pioneer. In their introduction of Genomics, “A new discipline, a new name, a new journal,” they stated the following: For the newly developing discipline of mapping/sequencing (including analysis of the information) we have adopted the term genomics . The new discipline is born from a marriage of molecular and cell biology with classical genetics and is fostered by computational science. Genomics involves workers competent in constructing and interpreting various types of genetic maps and interested in learning their biologic significance. Genetic mapping and nucleic acid sequencing should be viewed as parts of the same analytic processea process intertwined with our efforts to understand development and diseases.
The initial focus of the journal reflected the focus of the field of genomics which was well-laid out by McKusick and Ruddle in their first editorial piece (McKusick and Ruddle, 1987). It included the following topics: chromosomal mapping of genes, DNA fragments and gene families; sequence characterization of cloned genes and/or other interesting portions of genomes; comparative analyses of genomes to understand structural, regulatory, functional, developmental or evolutionary mechanisms; methods for large-scale genomic cloning, restriction mapping, and DNA sequencing; computational platforms/methods and algorithms to illuminate DNA and protein sequence data; understanding the hierarchy of chromosome structure; analysis of genetic linkage data related to inherited disorders; development of a genomic database; and parallel studies on genomes from different organisms. Thomas Roderick of the Jackson Laboratory coined the term “genomics” that would become the name of the new journal as well as for the new scientific field. According to Roderick, while attending a 1986 meeting with future editors in chief, McKusick and Ruddle: One evening, about 10 of us were at a raw bar, drinking beer and discussing possible titles for the new journal. We were on our second or third pitcher when I suggested ’genomics’. Little did we know then that it would become such a widely used term. Keim, 2008
1.2.2 Genetics or Genomics? Since the emergence of genomics, the terms “genetics” and “genomics” have been associated with diverse definitions within literature. Despite some definitional overlap, “genetics” is generally defined as the science of individual genes, heredity, and variation in living organisms, whereas “genomics” is a new discipline that studies the genomes of organisms.
1.2 THE EMERGENCE OF GENOMICS
7
The main difference between genetics and genomics is that genetics scrutinizes the functions and composition of single genes to illustrate how individual traits are transmitted from parent to offspring, whereas genomics addresses the structure, organization, and function (inheritance) of a genome by dealing with a large number (or the complete set) of genes and noncoding sequences and their nuclear topological and/or biological interrelationships (see Chapters 2 and 4). As no gene is an island and most genetic traits involve multiple genes and their complex interactions within environments, the scope of genomic research is drastically increasing. In particular, because the genome is not just a bag of genes (see Chapter 2), genomics has expanded past its genetic roots. Now, genomic concepts and methodologies generally dominate biological science. Single-gene research no longer fits under the genomics umbrella unless the aim of a specific study is to incorporate the gene or its associated pathway and elucidate its effect on the entire genome’s network (Genome.gov). It would be safe to say that genomics represents a new phase of genetics. Some scholars even refer to genomics as 21st century genetics. Knowing that the future studying of genetic information will undoubtedly involve the genome system rather than individual genes in isolation, the holistic platform of genomics might someday replace genetics altogether. The emergence of genomics is of ultimate significance to genetic research. First and foremost, it turned traditionally highly selective genetic research into less selective genomic investigation. Such a transition is reflected at both the research subject level and the system used. New genomics research focuses on large regions of the genome or the entire genome rather than specific and isolated genes of interest. Equally important, genomics allows a new research approach more amenable to direct analyses of natural populations rather than traditional genetics studies that are mainly dependent on highly specific model systems under controlled laboratory conditions. In fact, most genetics laboratories focus on model organisms as experimental systems, including Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis, various inbred mice strains, and established cell lines. These model organisms/systems clearly lack the diversity and heterogeneity of natural populations. Although some classic genetics studies have analyzed some natural populations, the scale is incomparable with genomics studies in terms of the whole genome approach and the size of the natural populations that are studied. Second, the birth of “Big Science” in biology has transformed genetics, challenging the previous small-scale hypothesis-driven system that was best suited to studying causative relationships in a defined linear system. By revealing the true complexity of biological systems, researchers will likely begin to question the genetic traditions of searching for causal
8
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
isolated genes or well-defined molecular pathways. The “big science projects” approach has brought both enthusiasm and uneasiness to the scientific community, as this is a change from traditional biological science where individual researchers carve out their own unique niche, testing their own hypotheses, sometimes for many years. A key challenge of large-scale genomic projects is using the correct framework to best integrate technologies (Heng and Regan, 2017). These large-scale genomic projects will likely conflict with traditional genetic knowledge, as when increased variants are involved, different principles are often applied. For example, when dealing with a complex adaptive system involving many different factors, the correlation study becomes more important as it is too hard to identify true causation. Third, the biomedical industry requires pragmatic reality checks more often than traditional academic institutions require. They require vigorous reviews to select molecular targets derived from basic research. The failure of any clinical trial could be devastating to a company regardless of how solid the lab-based research is. Such reality checks, which are now fed back into the research community, influence the direction of basic research, as policy makers and researchers are increasingly paying attention. For instance, it was industrial researchers who reported that the majority of representative “high-quality” cancer research papers are unrepeatable (Mullard, 2011; Begley, 2012). There is no incentive for academic research to carry out such analyses. However, it is crucial for pharmaceutical companies to make sure that their billion dollar drug development effort has a solid basis. Finally, because of the scale of funding, public interest is now a key component in genomic research. It is no longer enough to just explore billion dollar hypotheses for curiosity’s sake. Many basic genetic researchers are not happy with this new trend. They firmly believe that basic research takes time and will ultimately pay off in the long run and that scientific progress should not be unduly influenced by factors outside of science. However, the good old days of doing science purely for the accumulation of academic knowledge will likely not return. The days of moderate research budgets supporting individual labs and their genetic discoveries have given way to the megaprojects of the genomics age. This large-scale approach requires more public support and associated scrutiny. Understanding the new reality of genomics is critical, as the research community must educate the general public and be careful to avoid harmful overreaching promises. For this reason, many previously “offtopic” issues have become inseparable parts of genomics itself. Science policy, ethical issues, and public interest are often on the agenda of most scientific conferences of genomics.
1.2 THE EMERGENCE OF GENOMICS
9
1.2.3 Fundamental Limitations of Traditional Genetics Throughout the history of modern genetics, a chain of many “brilliant experimental designs” has generated our core knowledge of genetics, which formed the backbone of the gene theory. An interesting “open secret” is, though, that most of these famous milestone experiments are actually based on exceptional cases that can only be effectively demonstrated using specific model systems and under welldefined experimental conditions. For example, it is well known that Mendel’s classic paper, which perfectly illustrated the genotypee phenotype relationship between parent and offspring, demonstrated that genes function as defined independent informational units. This is still the basis of current genetic theory. However, it is much less appreciated that there were many preconditions or limitations for his beautiful illustrations. First, it is difficult to replicate Mendel’s clear-cut patterns using most other species. In fact, Mendel himself had failed to confirm his hypothesis in his own hands when he used hawkweed (as suggested by Karl von Nageli, one of the leading scientists at that time who had read Mendel’s seminal paper) and beans. Rather, some upsetting data began to appear: only for certain characteristics did the flowers follow the same pattern as his peas. The drastically increased data diversity presented in these other systems clearly caused increased confusion for Mendel. Second, Mendel only selected 7 traits among 34 initially studied traits in peas to demonstrate his points. The rationale of reporting seven selected traits was likely because only these seven traits produced the most appropriate results to support his concept. It would be interesting to know what the data looked like for majority of the traits unreported by Mendel. We know today that the phenotypes of most genes do not follow the Mendelian 3:1 pattern because a majority of genes do not truly function as straightforward independent units. Instead, the expression of a genotype often involves multiple genes and complicated genomic and environmental interactions. Now, Mendel’s seven genetic factors have been linked to seven genes with molecular characterization. These famous characteristics are likely involved in a range of genetic causes (including simple base substitutions, changes to splice sites, and the insertion of a transposon-like element). Interestingly, these seven genes were either not linked or if linked, possibly not subject to his analysis (Reid and Ross, 2011), which allowed Mendel to see a distinct pattern of segregation. Clearly, in contrast to the popular viewpoint, it was not by luck that Mendel chose these seven
10
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
“perfect” characteristics, but by extreme trait selection. Indeed, Mendel’s character selection was described in his paper: Some of the characters noted do not permit of a sharp and certain separation, since the difference is of a “more or less” nature, which is often difficult to define. Such characters could not be utilized for the separate experiments .. Mendel, 1866
Third, Mendel had a strict selection criterion for each sample. He had purposely avoided collecting “average data” by using exceptional samples in his experiments. For example, to comparing the difference in the stem length (one of his 7 traits), a long axis of 6e7 ft was always crossed with a short one of 0.75e1.5 ft. By pushing extreme cases rather than using average long and short populations, the certainty of data becomes much more impressive. Paradoxically, however, the pattern he discovered based on selection will not represent the majority of the data he ignored. Fourth, Mendel had tried his best to reduce environmental variations that could influence the data, such as growth conditions, timing of experiments, and the effect of all foreign pollen, which invariably created ideal systems with minimal environmental influences. Together, Mendel had created a perfect yet highly exceptional system. Perfect for a manipulated linear model with reduced variants, exceptional for the reality of genetics where most genetic traits do not contribute by a single gene and heterogeneity dominates within a population. Mendel’s approach might be the reason why many scientists have had trouble replicating the same simple ratios he reported for these carefully selected traits. For example, when the sweet pea (a closely related species of the garden pea that Mendel used) was examined, the pattern of heredity was considerably more complicated than Mendel’s results (Bateson and Saunders, 1902). In fact, the independence of genes can be diluted when passed them among generations. Furthermore, the rationale of classifying genes into dominant or recessive status has been challenged back to beginning of the last century, when data showed that genetic traits can be dominant, recessive, neither (Weldon, 1902; Radick, 2015), both, or one of many statuses in between. The effect of a gene is constrained or defined by the hereditary background (ancestry) and environments, and the determinist’s viewpoint of the gene might be an illusion for majority of species. After carefully analyzing data from Mendel and other well-known researchers working on related systems, Weldon concluded the following: . I think we can only conclude that segregation of seed-characters is not of universal occurrence among cross-bred Peas, and that when it does occur, it may or may not follow Mendel’s law. The law of segregation, like the law of dominance, appears therefore to hold only for races of particular ancestry. In special cues, other formulae expressing segregation have been offered, especially by De Vries and by Tschermak for other plants, but these seem as little likely to prove generally valid as Mendel’s formula itself. Weldon, 1902
1.2 THE EMERGENCE OF GENOMICS
11
Interestingly, the above paper systematically challenged the data presentation and legitimacy of Mendel’s theory immediately following the rediscovery of the laws of genetics. Based on the understanding of the pea varieties and their pedigrees, Weldon was convinced that Mendel’s law had no validity beyond the created artificially purified experimental systems. He not only calculated the chance that getting worse results is 16 to 1 (based on Mendel’s data) but also illustrated the challenge of classifying continuous variable characteristics of the pea (green or yellow for seed color, round or wrinkled for seed shape) using binary categories (dominant vs. recessive). His analyses hinted the high possibility of cherry-picked results on Mendel’s part. Ronald Fisher also thought that the data from Mendel were too good to be true. Given Fisher’s reputation in data analysis, his viewpoint is more influential than Weldon’s. Three decades after Weldon, Fisher published a paper to elaborate on this issue. In his paper, Fisher argued that Mendel knew how his data should be according to his theory, and he carefully planned his experiments to support his theory. Fisher even guessed that some data must have been quietly removed to support the theoretical prediction (Fisher, 1936). This paper, in addition to some later more direct accusations of data falsifying, has formed the so-called MendeleFisher controversy. Still, it is now accepted that most accusations and suspicions have turned out to be groundless (Hartl and Fairbanks, 2007; Franklin et al., 2008). The editor of Classic Papers in Genetics, James A. Peters, wrote the following introduction to Mendel’s original paper that laid the foundation of modern genetics (Peters, 1959): . There have been comments made that Mendel was either very lucky or tampered with his data, because his results are almost miraculously close to perfect . As to the second charge, that he might have arranged his data so as to shed the best possible light on his conclusions, I believe that the only way he might have manipulated his data is through omission of certain results that would have led to unnecessary complications..Mendel probably knew of these interrelationships . The fact he chose to utilize only those characteristics that fitted his concepts cannot be interpreted as an act of dishonesty on his part . Peters, 1959
Judging by Mendel’s candid presentation in his publication with all the details of data selection and, in particular, knowing that he was increasingly puzzled when he worked on other species, it is clear to us that his extreme selective reporting was not because of his dishonesty but the natural unconscious bias that comes with science research, as these improbably “perfect” data can only be generated from highly selected artificial systems that he created. The dilemma Mendel faced was how to balance the art of selecting beautiful but exceptional data to unveil hidden scientific principles while
12
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
avoiding the fundamental misunderstanding by ignoring the general feature of the system under study. The majority of genetic researchers favor Mendel’s approaches. They argue that it is absolutely necessary and sometimes the only option to select specific conditions or unique models to illustrate certain aspects of nature, which is the rationale of using models to simplify nature and eliminate variables. In fact, the selection of an appropriate system to address the right questions is a key to success in science. Only when we discover the mechanisms in a specific and often the simplest system, can we further add more elements to modify our theory that best fits reality. Of course, they often cite Mendel as an excellent example. Mendel’s law fits single-gene heredity best and thus provides the basis for understanding the heredity of multiple genes in major complex cases. However, scientists must be aware of the fundamental limitations of this approach, as the more unique and elegant the model system used, the more limited the conclusions will be for generalization. For example, the following questions need to be addressed to understand the limitations of Mendel’s laws: should we classify genes based on a dominanterecessive relationship while knowing that a large amount of genetic variants cannot be explained by such binary categories of genes? If there is no clear-cut relationship between most individual genes and phenotypes, should we still consider Mendel’s law the law of genetics? What if laws based on a simplified system (like the single geneephenotype relationship detected by Mendel) are drastically different from real-world complex systems (the majority of biological cases where all genes are connected and environmental variation plays an important role)? Moreover, should we clearly point out that Mendel’s hypotheses only represent exceptions before crowning them as the law of genetics? Such a dilemma has occurred throughout the modern history of genetics, yet most textbooks fail to warn readers that many well-accepted concepts of genetics are fundamentally limited because of key differences between simplified models and reality. By comparing classical model systems and their derived laws of genetics, common features can be summarized using single-factor analysis to link a single genetic element to limited phenotypes by ignoring links with high complexity. Often, the selected model system ideally illustrates a causative relationship of an investigator’s favorite concept, as these model systems become more or less linear with artificially enhanced certainty. In a sense, each model system offers some low-hanging fruit, but they are the exceptions in the real world as these pure systems artificially amplify given genetic contributions by eliminating other important factors that exist in natural systems. This way of thinking in genetics has lasted for over a century without any serious challenge. We often validate data using artificial models
1.2 THE EMERGENCE OF GENOMICS
13
rather than real-world situations. A major and unfortunate trend in the field is to publish “positive” data and not report “negative” data or data that do not make sense. Many “clear-cut” stories have been published. Although these stories are academically interesting, they have limited practical implications. With the advances of human genetics and medical genetics and the increased popularity of the gene in society, there has been high hope to fix gene mutations to fight many common and complex diseases. For the first time in the history of genetics, theories can be directly examined using various molecular genetic methods on many human diseases. Paradoxically, however, the progress has been slow, and the knowledge gap has been drastically increased between genetic concepts and clinical realities. First, it has been challenging to link non-Mendelian diseases to specific genes. Second, the genetic heterogeneity is overwhelming. Third, environmentegene interaction plays dominant roles for disease phenotypes. Fourth, the disease progression/response to treatment is an adaptive system where the power of genetic prediction is drastically reduced. All of these features raise some profound questions: if Mendel’s law is correct, and if many diseases are caused by multiple genes, then why is it so difficult to identify these key gene mutations in most common and complex human diseases given our advanced molecular tools? Do most genes really serve as independent informational units (Pigliucci, 2010), given the fact that the function of individual genes does not divulge the emergent properties of a genetic network (Chouard, 2008)? Obviously, the time to rethink the laws of genetics and move the field forward (from Mendel’s extreme selection to the real world of genetic and environmental complexity) is long overdue. Such a challenging transition will likely generate much confusion, as it did for Mendel. It is thus interesting to know Mendel’s thoughts about his hypotheses following his unsuccessful experiments on hawkweed and other species, which ultimately might question his observations in pea plants. This issue might also relate to the pity that he did not continue his research after he published his milestone publication. A common explanation was that he interrupted his research because of more duties and issues from the monastery. Knowing his increased confusion when dealing with different species, it is not totally unreasonable to speculate that this confusion also contributed to the discontinuation of his remarkable experiments. Finally, the effort of discussing the key limitation of Mendel’s laws is not simply to discredit research based on simplified systems, as initially building knowledge based on low-hanging fruit is a common practice in science. That is why both Newton’s laws and Einstein’s theory of relativity represent keystones in physics. However, there is a crucial difference between many physical/chemical laws and genetic laws. Physical/ chemical laws are supported by the vast majority of experiments, with
14
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
limited exceptions, whereas genetic laws are only correct in exceptional cases. For example, Newton’s second law, F ¼ MA (force ¼ mass acceleration), is supported by nearly all experimental data, except when velocity is close to the speed of light (when the special theory of relativity is needed). In contrast, as we just discussed, Mendel’s laws can only be supported by limited experimental data from very limited cases. Why is there such a huge difference between physical laws and Mendel’s laws of genetics in terms of application toward a majority of cases? This question is not only highly significant to rethink the future of genetics but also interesting in relation to the philosophy of science. In addition to the obvious feature of heterogeneity in biology, which could complicate the prediction power of genetic laws, we need to examine what Mendel had done to initially discover and then establish the laws of genetics. On the surface, Mendel followed similar patterns that other giants of science used when searching for his scientific theory: his initial surprised observation was that the same characteristics kept appearing with unexpected regularity when he crossed certain varieties. He thus designed systematic cross experiments with highly selected traits. His observations included the disappearance of the recessive phenotype from F1, the recovery of the recessive phenotype from F2, and a dominant to recessive phenotype relationship that closely matched a 3:1 ratio. He then introduced a model to explain how the independent genetic factors (both dominant and recessive) can be separated and recombined during the cross without dilution by its counterparts. By scoring the number of offspring, the genetic factors and their defined phenotypes can be illustrated simply by the numbers! His analyses thus validated his models which lead to the laws of genetics. What were the potential problems then? First, the phenotypes were not correctly classified (the initial observations were not very solid). There was no clear cut between dominant and recessive phenotypes; rather, there were many “in between” phenotypes. For example, in between the green and yellow seeds, there are many nontypical greens and nontypical yellows. If a careful classification is used, the data distribution would be far from a 3:1 ratio. The same is true for the seed shapes, as well as other traits, challenging the most basic assumption that phenotypes should be divided into dominant and recessive classifications. From the initial observation to the validation of the model, the data presentation was problematic. Without a solid factual basis, any “law” will inevitably fail. Second, it is possible that under specific conditions, some exceptional “strong” traits might display the pattern close to 3:1 ratio. However, we
1.3 DIMINISHING POWER OF GENE-BASED GENOMICS
15
should not generalize these into the general law. A more realistic model should be established to explain most biological cases. It should be pointed out that, in fact, Mendel did describe some inconsistencies in other species. However, these important discussions/confusions were ignored by other educators who were keen to tell the successful and easy-to-understand story of Mendel’s law. Again, in James Peters’ introduction for Mendel’s classical paper, he wrote: I have not included the last few pages of Mendel’s original paper, which dealt with experiments on hybrids of other species of plants, and with remarks on certain other questions of heredity. These paragraphs have little bearing on the principles Mendel proposed in this paper, and I have found from experience with my students that these pages serve primarily to confuse rather than to clarify.
That is potentially problematic. Many scientific concepts, in a clear-cut and well-designed system, are simple, precise, and even beautiful. However, when put into a broad context, or through the lens of reality, it can become confusing, conflicting, and hard to understand. We do need to show the real picture of science to students. It is a crucial way to illustrate the limitations of some beautiful theories; we cannot just retell rosy stories. That is the partial reason why most researchers nowadays firmly believe they can finally identify key common genetic factors for most complex traits despite how difficult the task actually is. They say, “Mendel did it, why not us? It’s just a bit more complicated than his single gene traits.” Similar examples can be found in cancer research, evolutionary research, and many other fields of biology. Now, knowing the reality behind Mendel’s data, it is up to us to change our attitudes toward genetic approaches.
1.3 DIMINISHING POWER OF GENE-BASED GENOMICS The past 30 years of genomic research has transformed biological research as well as increased interest and expectations of the general public toward science. This is particularly true once the sequencing of the human genome was successfully completed, an achievement that has been praised as the greatest scientific achievement of mankind, as the entire sequence represents the “book of humanity” and “language with which God created life.” During a joint announcement of the United States and United Kingdom on June 26, 2000, surrounded by two teams of scientists, US president Bill Clinton proudly announced that
16
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
“It is now conceivable that our children and our children’s children will know the term cancer only as a constellation of stars”. According to him, “Genome science [.] will revolutionize the diagnosis, prevention and treatment of most, if not all, human diseases.” The White House Office of the Press Secretary, 2000a
Headlines appeared all over the world following this announcement. The New York Times’ banner proclaimed, “Genetic code of human life is cracked by scientists.” Time Magazine made it their cover story. The Guardian called it, “The breakthrough that changes everything.” The Wall Street Journal opined, “This is truly big stuff,” and the Economist read, “The results are a huge step toward a proper understanding of how humans work.” Such hyperbole was not created by politicians in concert with the media but came directly from the scientific genomics community, particularly from many of the leaders who functioned as scientific advisors to politicians as well as spokespersons to the general public. All media information came from these scientists’ estimates of the impact that sequencing would have. The following are a few examples. Francis Collins, then the head of the US Genome Agency at the National Institute of Health, said: “It is probably the most important scientific effort mankind has ever mounted, and that includes splitting the atom and going to the moon.” He predicted the genetic diagnosis for cancer would be achieved in 10 years and in another 5 years, the development of treatments. “Over the longer time, perhaps in another 15 or 20 years, you will see a complete transformation in therapeutic medicine” (The White House Press Release, 2000b). Roland Wolf of the Imperial Cancer Research Fund said: “The sequencing of the human genome represents one of the great achievements in human science. It really will be a landmark in the evolution of man.” Mike Stratton, head of the cancer genome project at the Sanger Center in Cambridge, stated: “Today is the day in which the scientific community hands over its gift of the human genome sequence to humanity. This is a gift that is very delicate, very fragile, very beautiful ..” The genome was picked by Science as “Breakthrough of the Year” in 2000. According to Science, compiling maps and sequences of genetic patterns “might well be the breakthrough of the Decade, perhaps even the Century, for all its potential to alter our view of the world we live in.” Nearly two decades have elapsed since these pronouncements and exciting predictions were made. During these hopeful 19 years of hard work, with the genome sequencing information in hand, many other large-scale methods have been developed and used in genome research and clinical studies (Heng et al., 2011a). Some examples include whole genome scanning to hunt for genetic loci responsible for human diseases; global gene expression profiling to identify the pattern of
1.3 DIMINISHING POWER OF GENE-BASED GENOMICS
17
diseases needed for diagnosis and treatment; copy number variation analysis to understand the genetic mysteries that gene mutations cannot explain; classification of noncoding sequences such as the ENCODE project (the encyclopedia of DNA elements); (Harrow et al., 2012) Human Epigenome projects; and other “omics” projects such as massive parallel sequencing and analysis of personal genomes. Because of these cutting-edge technologies, the future of medicine now appears much clearer to us. In the near future, the prediction is that each physician’s office will be equipped with desktop genetic diagnostic machines. A few drops of blood should offer a look into the future disease potential of each individual patient. Such documentation of personal genomic information will then be the basis of selecting target-specific drugs or even gene therapy. This ultimate goal of individualized medicine, referred to by the buzzword “precision medicine”, appears to be around the corner. Such expectations are obviously overstated as they are based solely on the belief that individual genes and proteins have a linear causative relationship with diseases and treatment response. And so there continues to be disappointment after disappointment that contests such linear relationships and there is growing concern (that was previously dismissed) about the current direction of genomics. A typical argument is that the only way to overcome these obstacles is by being positive and continuing to work hard. Additional data will ultimately reveal the truth. In today’s world of positive outlooks, it is more fashionable to have a “half full” outlook rather than a “half empty” one. Clearly, incorrect approaches and irresponsible media promises have been interpreted in a positive and forward looking light. The issue here is not whether scientific attitude is positive (half full) or negative (half empty). Rather, we must critically evaluate and choose the correct scientific framework when there is clear evidence that the current paradigm is not working. A more rational question is why, in spite of all these technological breakthroughs, have only limited knowledge and applications been achieved? Why do scientists not understand the full picture? By just amassing more data, will they be able to figure out the correct framework? From the Ptolemaic view to the Copernican concept or from Newton’s laws to the world of Einstein, it is known that the simple accumulation of data does not generate a revolution. Unfortunately, many biologists keep busy only with data collection. There is an entrenched mindset that if it is not the gene itself, then it must be something interconnected such as regulatory elements, or copy number variation, or noncoding RNA, or something that is on the horizon and we just have not dug deep enough yet to find it. This has led to wave after wave of popular approaches within the research community where
18
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
few have questioned the gene-based framework of thought. Grouping all these isolated concerns that represent different factions of both the academic and bio-industry communities sends a message that very powerfully questions the status quo. Researchers cannot afford to continue to ignore this message.
1.3.1 The Ignored Voice of Antigenetic Determinism A broader view of genetic determinism (that genes determine biological fate) has been with us far before we obtained knowledge of how genes code proteins. The narrow view of genetic determinism (that genes determine human behavioral phenotypes), in fact, was closely associated with the eugenics movement in the late 19th and early 20th century. For example, besides the infamous actions of Nazi Germany, the sterilization of people with so-called “bad genes” was encouraged. Many states in the United States had even adopted laws to reduce “unfit” populations. It is thus ironic to see that a similar but much broader idea once again comes into play with the onset of gene-centric molecular genetics, and in particular with the growing excitement of the Human Genome Project (Kevles and Hood, 1992). In the late 1980s, to lobby the US congress into funding the Human Genome Project, James Watson declared that “We used to think out fate was in our stars. Now we know, in large measure, our fate is in our genes,” which highlighted the popular viewpoint of genetic determinism. The core tenet of genetic determinism is that genes determine how an organism works: if we know enough about what genes are and how genes “act,” we could understand all of biology (Keller, 1993); and the gene is an independent genetic information unit. No matter how complicated a given genetic process is, it can be dissected into its basic genes as its causative units. Despite the fact that there is no strong direct evidence to back it up, genetic determinism has become incredibly popular in the field of molecular genetics. According to Keller, “. it is hard to see what might be controversial in such claims. The attribution of agency, autonomy, and causal primacy to genes has become so familiar as to seem obvious, even self-evident” (Keller, 1993). Nevertheless, there have been a handful of scholars who continuously pointed out the fundamental flaws of genetic determinism. Their arguments include the following: organisms not only inherit genes made of DNA but also a complex structure of cellular machinery made of proteins (Lewontin, 1993); there are complex dynamics between nuclear and cytoplasmic elements; epigenetic features are also inheritable; and bioprocesses are adaptive systems with emergent properties.
1.3 DIMINISHING POWER OF GENE-BASED GENOMICS
19
Unfortunately, most of these credible points failed to transform the field (beyond the increased appreciation of the importance of epigenetics). The majority of epigenetic studies still focus on how to make sense of gene-centric frameworks rather than search for new concepts outside the “box of genes”. Interestingly, people do read these antigenetic determinism ideas, and some admit (often in private) that these points make sense, but few have changed their research methods. There is thus a big gap between the logical way of how science should be done and the way people practice science in their daily life. In particular, the excitement and high expectation of the Human Genome Project pushed genetic determinism into a new high. For example, genetic determinists have predicted that sequencing the human genome would solve multiple theoretical and practical problems within human biology (Pigliucci, 2010). When the Human Genome Project was completed 17 years ago, some well-known scientists cautioned against over celebrating and overestimating the meaning of these findings as based on genetic determinism. Nobel laureate David Baltimore wrote the following in his piece “DNA is a reality beyond metaphor”: The drumbeats get louder as we approach the day when the first draft of the entire structure of the human genome is to be announced. Pundits appear on television shows, trying to tell the public what this means. Many are my good friends. But I must tell their dirty little secret. They are not telling the whole story. . they tell the world that the genome is like a book, with words, sentences and chapters. .the periodic table for biologists. But these and other metaphoric links miss the real story. The genome is like no other object that science has elucidated. No mere tool devised by humans has the complexity of representation found in the genome. Baltimore, 2000
After reflecting on why it is so challenging to understand sequencing information, he concluded that we should not mistake progress for a solution. According to Baltimore, it will take at least another 50 years to fully understand the meaning of these DNA. In that time, we should “try harder, but with richer and more honest” analyses of the true nature of the genome. In fact, even before the drumbeat of celebration for the genome project began, there were serious concerns about the direction in which genomics was heading as influenced by genetic reductionism. The late Richard Strohman, an emeritus professor of molecular and cell biology at UC Berkeley, wrote numerous papers offering his criticism on the highly publicized genome project. In contrast to genetic determinism, Strohman proposed that the most common and complex human diseases and behaviors (including beliefs and desires) could not be reduced to purely
20
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
genetic influences. Moreover, he believed that genetic determinism was unable to reconcile the increasing findings of enormous biological complexity and that awareness of this fact demanded a new holistic scientific theory of living systems. He further called for a biological revolution using Thomas Kuhn’s criterion. Interestingly, in his view, biological science had not been on the cusp of a Kuhnian revolution until now. Additionally, he argued that the most acclaimed Watson-Crick era had, in fact, marked departure from organismal biology and caused a wrong turn toward gene-dominated genetic research. The Watson-Crick era, which began as a narrowly defined and proper theory and paradigm of the gene, has mistakenly evolved into a revived and thoroughly molecular form of genetic determinism. Strohman, 1997
He realized that the current genetic paradigm no longer worked in light of the importance of epigenesis and bio-complexity. This is the first time in history of the life science that a single generation has been able to live through the rise and fall of a single dominant paradigm. It is a deeply disturbing experience, especially for those who have followed the radical change from a distance, and especially given the enormous investment our culture has made in ideas tied to a hopelessly ineffective, linear causality and determinism. Strohman, 1999 (permission from Wild Duck Review)
Unfortunately, he did not witness the onset of his revolution before his death. He probably understood the reasons why this did not happen. He listed a few key challenges for Kuhn’s paradigm shift: One is incommensurability, where the scientists on either side of the paradigmatic divide experience great difficulty in understanding the other’s point of view or reasons for adopting it. Second is the accumulation of anomalies, wherein “normal science” of the current paradigm unintentionally generates a body of observations which not only fails to support that paradigm, but also points to glaring weaknesses in its method and theoretical outlook. . Third, paradigm shifts encounter resistance to change from the old guard, which is based not only upon a scientific incommensurability but on traditional ways of teaching and training the new generations in the (old) ways of research ..(permission from Wild Duck Review)
He then emphasized the importance of the arrival of a competent paradigm capable of replacing the old one. Clearly, key elements instrumental to arriving at a new and contending paradigm were not yet in place, as there is no new and competent paradigm in the field of genomics. Strohman had great interest in epigenesis and he had searched for a holistic theory of living systems. But he realized that the epigenesist approach provided only partial answers. He represented a fighter for a new paradigm before its arrival. But just what exactly is the new paradigm?
1.3 DIMINISHING POWER OF GENE-BASED GENOMICS
21
An interesting observation is that many scientists who vehemently criticize current reductionism-centered genomics and the “sequencing for the sake of sequencing” projects are senior scientists who are less directly involved in sequencing and many are close to retirement. While Kuhn stated that “a scientist cannot remain a scientist and at the same time be without a paradigm,” the reason that “scientists cannot at the same time practice and renounce the paradigm under which they work” is also a political and economic one. Senior, established scientists have the luxury of being able to critically review their own careers, and they may decide that the direction they took did not work despite a lifetime of effort. They are allowed this luxury because these senior scientists are no longer constrained by outside influencesethey are free to speak the truth as they see it without worrying about academic politics in relation to funding, publications, or tenure. Perhaps, it is more interesting to note that in the postgenome era, even leading genomic researchers who highly praise the achievements of the Human Genome Project frankly convey to the audience that many of the basic concepts of genetics they have taught students in recent years have simply been deemed incorrect. While such true statements are often used to glorify their ongoing research, it evidently reveals that there is a big elephant in the lecture hall: if the new data disprove previous concepts, where is the new paradigm in genomics? Scientific revolution constantly brings down previous paradigms no matter how solid and valued they once appeared. Practically speaking, large portions of our current knowledge will eventually be found to be somewhat wrong or inaccurate. When looking back at the history of our current scientific era in the future, one intriguing question might be: why did genomics scientists in the 21st century not actively search for a new paradigm, given that they (1) knew the history of science, (2) were familiar with Karl Popper and Tomas Kuhn’s concept of how science works, and (3) in the face of daily accumulating facts that did not make sense and contradicted the current paradigm.
1.3.2 The Rise and Fall of the Gene Genetic determinism was spawned from the gene concept and has served to further cement veneration of this concept that scientists created. The gene is the foundation of genetics, and genetics itself can be described as the history of understanding the gene. The definition of the gene has been constantly refined during history of genetics and genomics. What started as an abstract idea acquired the physical identity of coded DNA molecules. The idea developed from a one geneeone enzyme hypothesis to a one geneeone peptide idea and progressed to increasingly complex explanations that had increasing uncertainty. With the development of
22
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
genomics, however, the concept of the gene has been seriously challenged. First, the classic notion of genes as discrete units in the genome is no longer correct as there is separation between coding sequences and control regions that can even reside on different chromosomes. Based on the complex relationship among all genes and different splicing forms, the “unit of inheritance” is also questionable as most genes cannot be functionally isolated within a complex genome system. The general definition that “genes are functional regions within DNA molecules, and their mission is to code the instructions to make specific proteins” is also no longer correct. Gerstein et al have briefly reviewed the major conceptual changes in the definition of the gene over time. From the 1860-1960s, the gene was accepted as a discrete unit of inheritance. Specifically, during the 1910s, the gene was considered as a distinct genetic locus; during the 1940s, as what codes an enzyme; during the 1950s, as a DNA molecule; during the 1960s, as a transcribed code; from the 1970-1980s, as an open reading frame sequence pattern; from 1990-2000s, as a genomic sequence defined by annotating methods; and finally in the post-ENCODE era, as a union of genomic sequences encoding a coherent set of potentially overlapping functional products (Gerstein et al., 2007). Clearly, the genome project has paradoxically only brought increased uncertainty to the concept of the gene. The definition of a gene is clearly influenced by concepts of inheritance and technology-defined experimental findings. There has been a struggle to balance data generation and synthesis. During synthesis, the framework used and types of data on which this framework is based are most crucial. The history of the rise and fall of the gene concept clearly reflects this. Despite drastic changes brought on by different technologies during the past 150 years regarding genetic material, what has not changed is the notion that the genotype determines phenotype, and genes are key information units that determine genotype. Like many scientific concepts, the definition of a gene has gone through a cycle of uncertainty, certainty, and then uncertainty. Before genetic elements were linked to chromosomes, the gene concept was uncertain. Once the DNA model was established, it became highly certain. But on completion of the human genome sequencing and particularly the ENCODE project, once again it became highly uncertain as there were many doubtful components within the current definition. For example, the post-ENCODE definition tries to comprehend the complexity revealed by sequencing analyses. In fact, far before the genome sequencing era, there were serious concerns about the gene and its definition from some well-known thinkers. For example, R. C. Lewontin suggested that the process of inheritance would be better understood by developing a “geneless” theory of
1.3 DIMINISHING POWER OF GENE-BASED GENOMICS
23
heredity (Lewontin and Lewontin, 1974). With the increased knowledge of the biology of genes, an even bigger challenge is that many generally accepted definitions of the gene do not address the key issue that “it’s rare that a gene can be determined to have caused any particular trait, characteristic or behavior” (Keller, 2002). According to Denis Noble: Relating genotypes to phenotypes is problematic not only owing to the extreme complexity of the interactions between genes, proteins and high-level physiological functions but also because the paradigms for genetic causality in biological systems are seriously confused. Noble, 2008
To address this confusion, researchers must reevaluate the gene theory that considers the gene as an independent informational unit and move to a new paradigm that considers genes to be parts or tools of a coding system rather than precisely defined informational entities. Such a new paradigm insists that the genotype is not simply a collection of individual genes that define traits where the given DNA sequences do not directly determine the defined function but sees the final function as an emergent property of the genome context (Ye et al., 2007; Heng, 2009; 2015; Heng et al., 2011a). With the rapid accumulation of genomic information, the significance of the individual gene has declined. It is interesting to note that 30 years after its publication, even the author of The Selfish Gene perceives the reduced star power of the gene. In 2006, in an event celebrating “The selfish gene: thirty years on,” Richard Dawkins stated: I can see how the title “The Selfish Gene” could be misunderstood, especially by those philosophers, not here present, who prefer to read a book by title only, omitting the rather extensive footnote which is the book itself. Alternative titles could well have been “The immortal gene,” “The Altruistic vehicle,” or indeed “The Cooperative Gene.” Edge, 2006 (with permission from Edge)
How the power of the gene has been altered in 30 years! The change from “Selfish Gene” to “Cooperative Gene” as suggested by the author himself is really a complete reversal of the basic premise of the Selfish Gene conceptea concept that was born amid the heyday of the gene. Its popularity was further enhanced in the nonscientific community by Dawkins’s book. The “Selfish Gene” was a brilliant title for capturing the attention of the public. The “Cooperative Gene” likely would not have enjoyed the same level of success. Daniel Dennett recorded his first impression on reading The Selfish Gene at the same 2006 meeting: When I first read The Selfish Gene . I was struck by the very first paragraph, and by one of the chief sentences in it e
24
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
We are survival machines, robot machines, blindly programmed to preserve the selfish molecules known as genes The author goes on to say, “This is a truth which still fills me with astonishment.” Thirty years on I think the question that can be raised is, are we still astonished by this remarkable inversion, this strange inversion of reasoning that we find in this claim?
Dennett has perhaps already found his answer in Dawkins’s statements regarding the book’s title. Interestingly, after 30 years of veritably worshiping the gene, there is more astonishment in the reality that the imagined superpower of the gene does not in fact exist. Discussion of the changes in prevailing ideas about genes is important. It signifies the growing trend in the gene-centric outlook on biology. The selfish gene concept fits well with the gene-centric view, whereas the cooperative gene concept implies that the function of an individual gene does not translate directly to phenotypes and calls for a new theory beyond genes. There is extensive discussion in literature regarding the relationship between competing selfish genes and cooperative systems, particularly when game theory is used to address the issue (Nowak, 2006). However, this subject needs to be reexamined from evolutionary and genome perspectives (see Chapters 2 and 4). Perhaps, it is now the time to write a new chapter on genomics that no longer relies on the power of individual genes. It was long thought that all DNA sequences that did not code for genes were junk DNAdthe ENCODE project had all but declared that the concept of junk DNA was no longer relevant (Pennisi, 2012). Since the initial idea of junk DNA as defined by the gene concept of coding proteins, the death of “junk DNA” idea also challenges the concept of the gene. For example, researchers have argued that the basic unit of heredity should be the transcriptethe piece of RNA decoded from DNAeand not the gene. However, even the transcript cannot be considered an independent informational unit from a genome point of view. As for the majority of traits in the real world, only the genome package (not its parts) can serve as the platform for emergent genetic information. Defining the function of genes and noncoding sequences using the genome “package” concept rather than as individual dissected parts is now needed, in particular within the context of micro and macroevolution (Chapters 6 and 7).
1.3.3 A Reality Check of the “Industry Gene” Concept The birth of the biotech industry was surrounded by the excitement of various gene manipulation technologies. The first biotechnology company, Genentech Inc. (Genetic Engineering Technology), was founded in 1976 (the same year The Selfish Gene was published) by a venture capitalist and pioneer in the field of recombinant DNA technology. Within a short
1.3 DIMINISHING POWER OF GENE-BASED GENOMICS
25
period, the biotech industry was booming. One incredible example is the success story of AMGen Inc (the company’s original name was Applied Molecular Genetics) in the ‘80s. AMGen cloned the erythropoietin gene and marketed its gene product as “Epogen” (EPO, a glycoprotein hormone that controls erythropoiesis with therapeutic uses in diseases such as anemia). It remains one of the most successful biotech products to date. Following the advice of bio-scientists (many of whom were company advisors), everything seemed to be going as planned. With the cloning of increasing numbers of disease genes, there were high hopes that such gene products would soon revolutionize medicine. Experts even introduced the “industry gene” concept and declared genes as an independent industry entity which could be defined, patented, owned, tracked, proven to be acceptably safe, have uniform functions and useful effects, then sold, and possibly even recalled. This approach reached its ultimate level in the form of the Human Genome Project that had hoped to catalog and develop a huge treasure trove for the biotech industry. The tens of thousands of products that would include large numbers of disease genes would ensure the success of many companies. Thus, the vigorous race and battle to patent genes was on. Unexpectedly for many, the Human Genome Project and the information derived from genomic research has been a total surprise to the biotech industry. An article published in The New York Times by Denise Caruso in 2007 outlined this surprise (Caruso, 2007). The $73.5 billion global biotech business may soon have to grapple with a discovery that calls into question the scientific principles on which it was founded.
By describing a recent unexpected scientific finding that the human genome might not function as a “tidy collection of independent genes,” this author emphasized the challenges to the current concept of how individual genes work. The current theory links each gene to a single function, such as a predisposition to specific diseases like diabetes or heart disease. This finding noted that the gene operates in a complex network with overlapping functions and complicated relationships, which questions the current theory. .these findings will challenge scientists “to rethink some long-held views about what genes are and what they do”.
In addition to patent and safety issues that are critical to any company (according to a 2005 report, over 4000 human genes had been patented in the United States alone), what does the future hold for the
26
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
gene-based bio-industry if the core belief in genes is no longer valid (the concept that assumes genes operate independently has been the foundation for many biotech companies)? In fact, it is much easier for academia to face such a surprise. Many already had their doubts, as their results are rarely as clear-cut as those in high-profile publications, especially the award-winning experiments cited in text books. They often doubt themselves and consider that maybe the conditions they used were not optimal. Maybe their data would have been “better” if they had used even higher resolution methods. Maybe they were not lucky or competent enough to get the “perfect” data. In light of the complexity revealed by genomics, they may finally be relieved that they were not that unlucky after all because the perfect gene-based world of textbooks genetics and the concept of the gene have been wrong all along when used to describe the general rules (Wilkins, 2007; Heng, 2008b) (see Section 1.2.3). It is interesting to point out that the US Supreme Court unanimously ruled in 2013 that human genes are not patentable, as the act of isolating DNA sequences does not make it sufficiently different from native DNA to make it patent-eligible. In contrast, synthetic DNA, or cDNA, is patent eligible because it does not occur naturally. This decision changed the US Patent and Trademark Office’s policy on this issue and was met with criticism from the biotech industry. In a statement from the president of the Biotechnology Industry Organization, the ruling is “a troubling departure from decades of judicial and Patent and Trademark Office precedent supporting the patentability of DNA molecules that mimic naturally-occurring sequences. In addition, the Court’s decision could unnecessarily create business uncertainty for a broader range of biotechnology inventions” (Genome web, 2013). Ironically, in the future, the biotech industry will likely appreciate the US Supreme Court’s ruling, when the drastically decreased value of most individual genes is finally realized and accepted. There is yet another alarming phenomenon in the current basic biological research community: there is an increasing separation between reality and artificial experimentally generated knowledge, particularly when a high degree of cherry-picking is used to create the perfect story. In private, scientists acknowledge the limitation of their experiments and are often frustrated by the uncertainty that they generate. In public lectures, publications, and grant applications, they prefer to tell a rather simpler, more clear-cut, and convincing story by ignoring the uncertainties. The accumulation of cherry-picked stories has had profound negative effects on science’s progression, as the current dominant paradigm favors most of these biased results despite that they are not a reflection of reality. Science appears to be making daily progress using carefully selected model systems, but there seems to be no interest on what this knowledge
1.3 DIMINISHING POWER OF GENE-BASED GENOMICS
27
is based on or whether it reflects real natural systems. After all, this is basic research. In the world of basic research, curiosity trumps reality. The bio-industry has thus provided a key reality check to our genomic knowledge. Regardless of its textbook support, a company will fail if it cannot translate the popular gene concept into reality. Companies do not have the luxury of failure that the academic world has. For a professor to admit the failure of a project, he or she can simply say, “well, science is much more complicated than I thought” and move on to a new hypothesis. For companies, it is rather a different story. Their investments are costly and their success relies on the correct concept. Before finishing human genome sequencing, there was high hope that pharmaceutical companies could use cutting-edge genomic knowledge for drug discovery. On paper, it made sense that gene-based drug discoveries would be superior to traditional drugs and most of the academic community agreed. There was much recruiting of high-profile academics to head research branches at big pharmaceutical companies. This move, however, did not meet expectations, as despite the drastic increase in spending on genomic research, the pipeline of new drugs has actually significantly diminished. The following is a quote from a feature article in The Economist in 2004, titled “Fixing the drug pipeline,” which discusses the challenges the pharmaceutical industry is facing (with annual sales of about $400 billion, it is one of the biggest and most lucrative industries in the world). Drug design: The more pharmaceutical companies spend on research and development, the less they have to show for it. The “pipelines” of forthcoming drugs on which its future health depends have been drying up for some time . . the sequencing of the human genome was expected to revolutionize the process of drug discovery. It is undeniably a remarkable achievement, but looked at squarely, it represents a “parts list” of genes whose connection with disease is still obscure. . The flood of information has caused a kind of “paralysis by novelty” . The Economist, 2004
Obviously, the flood of genomic information did not help Big Pharma and is a huge disappointment in light of the promises made by the Human Genome Project. Once again, new currently popular concepts such as system biology, networks, proteomics, and the big data science offer the hope of delivering powerful drugs (Ideker et al., 2001). Their rationale is that if individual gene identification and characterization is not that useful, then there must be some other molecular targets. Unfortunately, without a new genome-based framework, disappointing results will continue to prevail. In fact, it is not just the pharmaceutical industry facing this challengedacademic communities have their own headaches. In recent
28
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
years, the issue of low reproducibility in cancer research has shocked the research community. The effort of systematically repeating many highly cited important experiments has revealed the sad fact that the majority of experiments cannot be duplicated (Mullard, 2011; Begley and Ellis, 2012)! Because these examined papers were published in top journals by laboratories with high reputations, and their conclusions are highly influential, the message is especially chilling. It obviously raises doubt for the majority of bio-literature (if the best literature is not reliable, what about average ones?). Interestingly, it was the pharma/bio-companies rather than academic institutions that performed such experiments, as they wanted to make sure that the scientific discoveries were solid before applying them into drug development pipelines. Heated discussions followed. Some cited the possible cell line misuse, while others criticized the scientific misconduct and dishonesty. The hidden answer for majority of the cases, however, might be the conceptual framework of the genomics and the current methods that study them.
1.3.4 Gene-Based 1D Genomics Is Not Enough The frustration of knowing more about genes and other genetic details but understanding less about biology (with increasing uncertainty and reduced medical implications) not only applies to the biotech and Pharma industries, but increasing examples have struck at the very core of genomics research itself. One of the priorities of current human genomics is to identify defective genes related to diseases and to establish a comprehensive catalog of all human diseases. The logic seemed very straightforward and solid: by sequencing all normal and defective genes using large patient populations, we can hunt down disease genes by simply comparing the sequences. The reality is much more challenging. A score of genetic abnormalities have been detected, however, most of them contribute to diseases with low penetration in populations as many of them are not shared by the majority of patients. Even in patients that have the same genetic sequence profiles that influence a disease, only a portion of them will eventually develop the disease. At the same time, there also are high levels of diverse genetic variations among normal individuals, which make the job of identifying disease genes extremely difficult for most common afflictions. On the one hand, in most solid cancers, the sequencing project has revealed so many gene mutations that are associated with the same cancer type posing the challenge of sorting out which mutations are important. On the other hand, it has been very difficult to identify “meaningful and/or useful” genetic loci that contribute to other common diseases such as obesity, diabetes, and an
1.3 DIMINISHING POWER OF GENE-BASED GENOMICS
29
array of neuron genetic diseases following large-scale whole genome scanning. For example, the ballyhooed success of identifying over 10,000 different genetic variants associated with schizophrenia in fact represents the biggest failure yet of genetic determinism, as each of these identified genetic variations is relatively rare and responsible for only a tiny increase in disease risk, rendering them clinically useless (Wade, 2009). The difficulty in identifying genetic elements in obvious genetic diseases has been blamed on so-called “genetic dark matter,” as a genetic link clearly exists but is hard to detect (Manolio et al., 2009). There has been extensive debate regarding genetic dark matterd“ missing heritability” (Pennisi, 2010). The so-called missing heritability of common traits refers to a continuing mystery in human genetics. A multitude of large-scale genetic studies including genome-wide association studies have identified many loci that are associated with common human diseases and traits, but together these results can explain only a small proportion of the “heritability” of the traits. For most traits, the majority of the heritability remains unexplained (Eichler et al., 2010). Some believe that it has not been detected because we have not been looking hard enough. Others think that targets exist beyond the gene, such as epigenetics, copy number variation, noncoding RNA, and genetic interaction. Some even argue that by improving a mathematic model, these issues will eventually be solved. While there have been increasing reports in recent years that illustrate the involvement of copy number variation and methylated DNA correlated to some traits in defined systems, these contributions are clearly limited, and the so-called phantom heritability seems real (Zuk et al., 2012). Genome-level alterations have been suggested as a key component of the missing heritability (Heng, 2010) (see Chapters 4 and 7). The situation gets worse when faced with the additional challenges that come with applying hard-to-identify genetic markers in a clinical setting. For example, 101 of these genetic markers were found to be useless in predicting heart disease in a clinical setting. Application of these markers had no clinical value in forecasting diseases among 19,000 women who had been monitored for 12 years, despite the fact that all 101 identified genetic variants had been statistically linked to heart disease in various genome-scanning studies. In fact, the oldfashioned way of asking about the family history had better prediction success (Paynter et al., 2010). Questions have been raised regarding the new trend of sequencing everything made possible by increasing technical capabilities. According to Nature, “Human genome: Genomes by the thousand,” there were 2700 finished human genomes in October 2010, and by the end of 2011, there will be 30,000 sequenced human genomes. The 1000 Genome Project is an international collaboration to produce a comprehensive catalog of human genetic variation for medical research. The genomes of over 1000
30
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
unidentified individuals from around the world will be sequenced using the next generation of sequencing technologies. That venture as well as the Cancer Genome Project, and the Personal Genome Project, contribute to this trend (Mardis, 2010). A few years ago, the Personal Genome Project captured the general public’s imagination and has stimulated new hopes. This approach is a way to jump-start the whole process of integrating human genomic data into clinical medicine that has failed to deliver in the first decade of the original sequencing project. The new logic is that if whole genome scanning has not worked in terms of finding the genetic dark matter, then we must sequence more samplesethousands of themethe answer must be there! Here researchers are again using the same reasoning and the same genetic determinism only now in different clothes. The fundamental flaw of such an approach lies in the fact that most common diseases are not caused by Mendelian factors! The more samples we analyze, the more diversity we will detect because of the very nature of bio-system heterogeneity. Let us not forget that the main argument of personal genome sequencing is to provide genetic profiles for common diseases. As soon as the initial data of personal genome was available (with a few celebrities in the genome field), these data were at odds with the expectation of the rationale to carry on the personal genome project, as they provided more questions than answers. Some of these questions may shake the very core of genetics. James Watson had 310 gene mutations in his genome that could affect his health, including DNA repair genes linked to cancer. At the age of 89 and currently without cancer, it illustrates the uncertainty of trying to predict diseases based on individual gene information. In addition to the fact that there are no defined recommendations on how to improve his health based on the sequencing data, there is also some “unwanted information.” For example, Watson requested that he not be told about the status of his own gene (ApoE) to reduce potential negative anxiety. This gene is associated with late-stage Alzheimer’s disease, which affected one of his grandmothers. If a highly informed genetic scientist like Watson would lose sleep over this type of information, imagine the impact of this information on a 20-year-old without genomics training. What about the impact on making life decisions regarding marriage, kids, medical treatment, lifestyle, and even types of jobs and financial planning? Not to mention more complicated social issues regarding privacy and discrimination. Again, even leading molecular geneticist James Watson expresses his bias based on his own understanding of genetics. Such gene determinism has also generated many controversies regarding social issues. Many try to brush over his controversial points, claiming that “smart people saying dumb things,” but such judgments are also actually made based on personal knowledge and scientific beliefs. Watson’s genetic determinist view that
1.3 DIMINISHING POWER OF GENE-BASED GENOMICS
31
made him speak out his version of the genetics that is highly controversial, in addition to his worry over a potential gene mutation. In fact, Watson’s support of the Human Genome Project began with his personal quest to find a treatment for his oldest son’s schizophrenia.) So how can people prevent potential discrimination generated among nonscientific populations based on DNA sequences? Clearly, the technology will be available to the general population soon, but should it be applied in this way? These are all critical questions. More important, just sequencing without really understanding all of the repercussions is not a wise course of action. This difficult situation is most apparent in the cancer genome sequencing project. Cancer genomic research has been a front runner of using whole genome sequencing to attempt to pinpoint genetic causes. Taking this approach was not solely because of the original prediction by Renato Dulbecco who stated that sequencing the human genome is the key to solving the puzzles of cancer. It was also because of generous funding available to the field and the public’s desire to win the war on cancer. In December 2005, the US National Institute of Health announced plans to sequence every genetic mutation involved in cancer. According to Francis Collins, then Director of the National Human Genome Research Institute, such an effort was a natural extension of the Human Genome Project. Based on the estimation of Anna Barker, who was the US National Cancer Institute (NCI) Deputy Director, there were 5e15 identified genetic changes for each type of cancer at that time, and there are probably 100 or more such changes involved in the formation, growth, and metastasis of each type of cancer. This unrealistic view has faced serious challenges. George Miklos, an Australian geneticist who anticipated the original Human Genome Project and is a widely recognized expert in genomics, stated in his article “The human cancer genome projecteone more misstep in the war on cancer,” that: No one doubts that primary tumors accumulate somatic mutations over time. However, the Achilles’ heel of cancer is not the mutational baggage train of the primary tumor, but the genomic imbalances and methylation changes of the deadly cohort of cells that metastasize in different genetic backgrounds. As a megaproject in advancing cancer research and ultimate cures, the human cancer genome project thus is fundamentally flawed. Miklos, 2005
A few more articles supported Miklos’ voice. In the article “Cancer genome sequencing: the challenges ahead,” the major problem of this project is identified as existing at the conceptual level with regard to genebased cancer research. A major challenge. is solving the high level of genetic and epigenetic heterogeneity of cancer. For the majority of solid tumors, evolution patterns are stochastic and the end products are unpredictable, in contrast to the relatively predictable
32
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
stepwise patterns classically described in many hematological cancers . These features of cancer could significantly reduce the impact of the sequencing approach, as it is only when mutated genes are the main cause of cancer that directly sequencing them is justified. Many biological factors (genetic and epigenetic variations, metabolic processes) and environmental influences can increase the probability of cancer formation, depending on the given circumstances. The common link between these factors is the stochastic genome variations that provide the driving force behind the cancer evolutionary process within multiple levels of a biological system. This analysis suggests that cancer is a disease of probability and the most-challenging issue to the project, as well as the development of general strategies for fighting cancer, lie at the conceptual level. Heng, 2007a (with permission from John Wiley & Son)
Unfortunately, the big sequencing machine had already been set in motion, and the leadership at the NCI as well as the research community followed the rationale of sequencing them all and identifying the gene mutations patterns, as according to the gene mutation theory of cancer, gene mutations are the driving force of cancer. With whole genome sequencing methods, scientists can leave no stone unturned, and thus it is believed that the patterns must eventually emerge after sequencing thousands of samples. Eleven years later and the cancer genome sequencing project has generated enough data to be realistically evaluated. Despite a high level of excitement and news reports that have claimed that the cancer genome has been cracked, the results are fundamentally disappointing. In most cancers, there are many gene mutations, most of them not commonly shared among patients. It is a major challenge to distinguish the important ones. For certain types of cancer, there are some highly penetrant genes, such as the known p53 mutations, but without knowing how to evaluate other diverse gene mutations, the field has been pushed back to square one. A much more serious challenge is that genome alterations are a general rule and not an exception in most cancers and the meaning of the same gene mutation differs within different genome systems. The truth is we now have a big mess (see Chapter 8). The following was a “news and view” piece in Nature published in 2010, 10 years after finishing the Human Genome Project. . Bert Vogelstein, . has watched first-hand as complexity dashed one of the biggest hopes of the genome era: that knowing the sequence of healthy and diseased genomes would allow researchers to find the genetic glitches that cause disease, paving the way for new treatments. Cancer, like other common diseases, is much more complicated than researchers hoped. By sequencing the genomes of cancer cells, for example, researchers now know that an individual patient’s cancer has about 50 genetic mutations, but that they differ between individuals. Check Hayden 2010
Vogelstein is a well-known leading cancer researcher. His concept of the stepwise accumulation of gene mutations causing cancer has been referred to as the Vogelgram and has had significant impact on current
1.3 DIMINISHING POWER OF GENE-BASED GENOMICS
33
cancer theory. Echoing Vogelstein’s new view, Harold Varmus, the director of the NCI, was quoted by the New York Times in 2012 saying “Genomics is a way to do science, not medicine.” Remember, just a few years ago in 2005, Varmus, then director of the NIH, was quoted by The New York Times that he believed that the cancer genome project could “completely change how we view cancer.” His prediction was right but with a very different twist. Indeed, the gene view of cancer is no longer working as illustrated by high levels of diverse gene mutations and in contrast to the original hope of finding a handful of key gene mutations for each cancer type. For this reason, genomics is not applicable to medicine. It is interesting to point out that without a solid paradigm, a swing of opinions can frequently occur in opposite directions, even from the same scientists. For example, in an act that surprised many, Dr. Robert Weinberg from MIT recently published a notable piece that is critical of the current molecular reductionist approaches of cancer research (Weinberg, 2014). As he is a leading scientist behind the gene mutation theory of cancer, his candid and well-thought-out opinions regarding the fundamental limitations of current molecular research should receive much attention from the research community. Equally surprising, this is not the case at all. While there are many discussions about Weinberg’s perspective among scientists who challenge the gene mutation theory of cancer (Horne et al., 2015a-c; Heng, 2015; Liu et al., 2014), no serious discussion is taking place among the majority of researchers who have followed him for decades. Ironically, as pointed by his former trainee in private, Weinberg’s piece seems to have nothing to do with his current research. It is also noticeable that Vogelstein’s group has found renewed interest in how a few gene mutations can lead to cancer. It is thus understandable that it might take a while for the field of cancer research to change. An earlier reality check happened on the 10th year anniversary of the sequencing of the human genome (Check Hayden, 2010). In contrast to many promises, life is complicated. The New York Times 2010 piece states: “A decade later, genetic map yields few new cures,” which highlighted the needed debate (Wade, 2010). Responding to this piece and the increasing dissatisfaction by the public, genomic scientists reminded the general public that science needs time to deliver (they apparently forgot it was scientists themselves that made the promise of rapidly delivering results in the first place). Increasing numbers of scientists are now questioning the direction of current research despite many surprising discoveries in basic genomics. Craig Venter’s viewpoint is worth mentioning given his status within the human genome sequencing project. (He is most famous for his role in being one of the first to sequence the human genome using private funding.) In his 2010 interview with the International online edition of
34
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
Germany’s newsmagazine DER SPIEGEL, the surprising title was “we have learned nothing from the genome” (Spiegel, 2010). Excerpts: SPIEGEL: So the significance of the genome isn’t so great after all? Venter: Not at all . We couldn’t even be certain from my genome what my eye color was. Isn’t that sad? . SPIEGEL: So the Human Genome Project has had very little medical benefits so far? Venter: Close to zero to put it precisely . . Because we have, in truth, learned nothing from the genome other than probabilities. How does a 1% or 3% increased risk for something translate into the clinic? It is useless information.
When a reporter from DER SPIEGEL stated that there are hundreds of hereditary diseases which can be linked to individual gene mutations, Venter just simply responded: There were false expectations. Wow, what an interview! Nevertheless, Venter offered some very candid views about the disappointment over the lack of payoff of the Human Genome Project. Most interesting is that he completely changed his opinion since the early 2000s, soon after the completion of the human genome sequencing, regarding the significance of genome sequencing. At the time of interview, not a large number of sequenced individual data were available, and many experts in genomics consider his viewpoint too extreme, as when more genomes were sequenced, the high value of sequencing would certainly be visible. Fast forward 9 years after his interview: individual genome sequencing has become a routine approach for many patients and normal individuals. It is about time to reevaluate the value of sequencing. As CEO of HLI, Human Longevity Inc., Venter has been pushing the deep sequencing for large numbers of individuals. By October 2016, this company has sequenced over 10,000 individuals. Following coverage of 30 to 40 , the whole genome sequencing effort has revealed 150 millon mutations. According to the company web page, “By combining the largest collection of genomic and phenotypic data, HLI is able to use machine learning and expert analysis to transform the data into meaningful and useful insights. This turns the information into new discoveries that can inform health decisions leading to new treatment options, personal health plans, and the potential for longer, healthier human lifespans.” HLI also established the Health Nucleus program to profile individual customers at a price tag of $25,000 per person.
1.3 DIMINISHING POWER OF GENE-BASED GENOMICS
35
The Health Nucleus platform uses whole genome sequence analysis, advanced clinical imaging and innovative machine learningecombined with a comprehensive curation of personal health historyeto deliver the most complete picture of individual health. http://www.humanlongevity.com/about/overview
All these activities from HLI only represent the tips of the iceberg of the global sequencing movement. The high level of academic and commercial interest on sequencing is overwhelming. Some cancer centers are now offering whole genome sequencing services to all willing patients, promising that their gene mutation profile will certainly help medical treatment decisions. Judged by how popular the sequencing approach is, one logical conclusion is that experts like Venter must have drastically increased his optimistic prediction on how the sequencing information can change medicine. Specifically, based on current genome sequencing technology, we now must understand much more about the relationship between DNA sequencing and diseases. Surprisingly, despite all these accumulated data, Venter still holds the same viewpoint as he did 7 years ago (without increased confidence that using DNA information can predict human diseases). Last year, when discussing such a question at an exciting talk in Beijing University’s medical college, he said that we currently only understand about 1% (of the relationship between gene and diseases), despite that the value of DNA sequencing should have become more and more obvious with the further development of computational capability. He is not alone. Early in 2011, the journal Science published a piece titled “deflating the genomic bubble,” borrowing the popular term used to describe the meltdown of our economy during the greatest recession in decades (Evans, et al. 2011). . Although it may be hard to overestimate the significance of that achievement, it is easy to misconstrue its meaning and promise. People argue about whether mapping the human genome was worth the investment. With global funding for genomics approaching $3 billion/year, some wonder what became of all the genomic medicine we were promised. . Recent methodological progress in genomics has been breathtaking. . But claims of near-term applications are too often unrealistic and ultimately counterproductive. From the South Sea and dot-com “bubbles” to the ongoing housing market crisis, the world has seen its share of inflated expectations and attendant dangers. Science is immune to neither.
Little by little, increasing doubt and contradiction is emerging to challenge genetic determinism. The direct implication that there is limited translation of current gene-based genetic information to medicine is indeed the reality. In a perspective piece published by Nature Review
36
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
Genetics in 2010, titled “Viewpoint: Missing heritability and strategies for finding the underlying causes of complex disease”, Jason Moore, a leading expert wrote: . Such biomolecular interactions that depend on multiple genetic variations can substantially complicate the relationship between genotype and phenotype, making it impossible to explain phenotypic variation simply by adding together independent genetic effects. This hypothesis is completely consistent with the current results of GWA studies. . High-throughput technology alone will not solve this problem. The time is now to philosophically and analytically retool for a complex genetic architecture or we will continue to underdeliver on the promises of human genetics. Indeed, life, and thus genetics, is complicated and some will soon ask, as seismologists have, whether we are trying to predict the unpredictable. Eichler et al., 2010
Even some highly optimistic leaders in the field have modified their position or unintentionally pointed out some key paradoxes within their own reasoning. Eric Lander, a major player of the Human Genome Project, in his Nature review “Initial impact of the sequencing of the human genome” summarized the achievement of the genome project in the decade since its publication (Lander, 2011). In addition to his routine praising of accelerated biomedical research, there were some surprising analyses worthy of quoting: It is important to distinguish between two distinct goals. The primary goal of human genetics is to transform the treatment of common disease through an understanding of the underlying molecular pathways. Knowledge of these pathways can lead to therapies with broad utility, often applicable to patients regardless of their genotype . Some seek a secondary goal: to provide patients with personalized risk prediction. Although partial risk prediction will be feasible and medically useful in some cases, there are likely to be fundamental limits on precise prediction due to the complex architecture of common traits, including common variants of tiny effect, rare variants that cannot be fully enumerated and complex epistatic interactions, as well as many non-genetic factors.
Lander admits that the reality of complexity and uncertainly of genetic information will not likely deliver the expected personal risk prediction for most common diseases. This is a sharp contradiction of the goal of the personal genome project! If there is limited value of risk prediction, why raise expectations of sequencing personal genomes on which to base medical prediction? Unfortunately, because of the same complexity and uncertainty, the same analysis also applies to the primary goal of transforming the treatment of common diseases through an understanding of the underlying molecular pathways. Based on the cancer genome sequencing data, the answers are already known. Using Lander’s own words, “Knowledge of these pathways can lead to therapies with broad utility,
1.4 NEW GENOMIC SCIENCE ON THE HORIZON
37
often applicable to patients regardless of their genotype.” If this is true, why bother to search for genotypes? Clearly, there are some profound paradoxes amid the current popular genomic reasoning. A few years ago, the journal Science also published the editorial “Genomics is not enough,” which recommends that we move to other fields to develop clinically useful applications. . Translating current knowledge into medical practice is an important goal for the public who support medical research, and for the scientists and clinicians who articulate the critical research needs of our time. However, despite innumerable successful gene discoveries through genomics, a major impediment is our lack of knowledge of how these genes affect the fundamental biological mechanisms that are dysregulated in disease. If genomic medicine is to prosper, we need to turn our attention to this gaping hole. . The lessons from genome biology are quite clear. Genes and their products almost never act alone, but in networks with other genes and proteins and in context of the environment. . Chakravarti, 2011
It should be further argued that it is not simply that genomics is not enough, but rather that gene-based 1D genomics is not enough. We cannot merely move away from genomics just yet, as we have not yet searched for and established the correct framework of the new genomics field. Moreover, to understand evolution (both organismal and cellular), genomics, which represents one of three key elements for evolution, should not be ignored. It is clear that a genome-based genomic revolution will arise as soon as the proper framework is established.
1.4 NEW GENOMIC SCIENCE ON THE HORIZON In 2013, surprised by many, DNA pioneer James Watson stated that he has changed his position on the cancer genome sequencing project. “While I initially supported the Cancer Genome Atlas project getting big money, I no longer do so. Further 100 million dollar annual injections so spent are not likely to produce the truly breakthrough drugs that we now so desperately need” (Watson, 2013). Knowing Watson’s passion and initial key role in the human genome sequencing project (he was the first director of the US Human Genome Project), this change is not a trivial one. He clearly realizes that the current cancer genome sequencing approach is ineffective. However, he then proposed a “new” strategy of screening and targeting cancer genes based on various pathways. Unfortunately, while he has criticized the continuation of large-scale DNA sequencing, the new favored approach was based on the same gene-centric framework, which will likely be just as ineffective. In fact, the rationale of sequencing the cancer genome is to provide a list of genes for molecular targeting which Watson wishes to achieve.
38
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
This case forcefully illustrates the ultimate importance of rethinking the genetic and genomic concepts based on the current cancer theory. There is no use to simply modify our strategies without a solid conceptual basis. Fundamentally, the current limitations of disease research in fact root on genomic and evolution theories.
1.4.1 Time to Rethink Genetics and Genomics Increased genomic data have suggested a common message: a large amount of new genomic facts do not fit the traditional concept of genetics/genomics. Surprisingly, however, the majority of publications have so far avoided spelling out this obvious conclusion. Various fractions of the research community have discussed this issue, some of which have called for new genetic concepts because new facts demand new conceptual frameworks. For example, following decades-long genomics studies coupled with the popularity of systems biology, it seems like the field of genetics has gradually accepted the general viewpoint that there are major limitations in the gene concept, specifically concerning how to define the gene and how to understand the relationship between individual genes and their phenotypes. In his recent book, Making sense of genes, Kostas Kampourakis stated that: More recent research has shown that it is impossible to structurally individuate genes, and that the best way we can do is to identify them on the basis of their functional products. In many cases, single genes cannot explain the variation observed for simple monogenic ones. . genes “operate” in the context of developmental process only. This means that genes are implicated in development of characters but do not determine them. Kampourakis, 2017
With that being said, most researchers only admit the limitation of using a single gene to explain a complex phenotype. They insist that the quantitative determinist power of genes is factual information, by which the polygenic model will ultimately reveal the genotypeephenotype relationship for complex phenotypes when enough samples are analyzed (of course all the while believing the one-to-one correlation in single gene defined characteristics). That is the main reason why GWAS have become very popular in recent years: it was firmly believed by the research community that common diseases are caused by common genetic alterations. Based on the nature of genomic and environmental heterogeneity, we have disagreed with this idea for years (Heng et al., 2006a; Heng, 2007a). Seven years ago, the limitations of GWAS were openly discussed
1.4 NEW GENOMIC SCIENCE ON THE HORIZON
39
by a leading journal Cell. Based on the high levels of genetic heterogeneity in human diseases, McClellan and King analyzed the genetic basis of why the strategy of identifying a fixed set of common shared genetic loci is challenging, supporting the idea that common diseases caused by common loci are incorrect for many common and complex diseases. They thus favored next-generation sequencing technologies over GWAS to find rare disease-causing mutations and the genes that harbor them (McClellan and King, 2010). As expected, this piece has generated heated debate. The GWAS community has firmly claimed their success and promised to improve their strategies to deliver. In 2017, the journal Cell again published a noticeable piece entitled “An expanded view of complex traits: from polygenic to omnigenic.” In this perspective, Pritchard’s group has reanalyzed the recent GWAS studies on human height with over 205,000 individuals. The original report published in 2014 has identified 700 variants that affect human height, which collectively explain just 16% of the variation of heights in people of European ancestry. Compared to the general estimation that about 80% of all human height variation should be explained by genetic factors, the missing fraction seemed too big given the large number of individuals investigated, which triggered their curiosity and lead to their reanalysis on human height. This reanalysis has led to the realization that more than 100,000 variants affect human height, although most of these variants only contribute one-seventh of a millimeter. Because of their tiny contribution, these variants are often considered statistic noise and are ignored. Furthermore, these variants are distributed across the entire genome, making them less useful for pattern identification (such as GWAS studies) (Boyle et al., 2017). The authors further compared large genetic studies of rheumatoid arthritis, schizophrenia, and Crohn’s disease (the successful stories). Although some of the identified variants fit the mechanisms of the disease in question, the majority of genetic variants are not related to a given disease based on current knowledge. Some are detected in many different diseases (and many are linked to normal and basic function as well). They have concluded that . there is an extremely large number of causal variants with tiny effect sizes on height and, moreover, that these are spread very widely across the genome,. implying that a substantial fraction of all genes contribute to variation in disease risk. These observations seem inconsistent with the expectation that complex trait variants are primarily in specific biologically relevant genes and pathways
After a series of data analyses, they realized that even though a central goal of genetics is to understand the links between genetic variation and
40
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
disease, and it is expected that disease-causing variants are clustered into key pathways that drive disease etiology, such common variants are hard to identify. It is especially known that for complex traits, association signals can spread across the entire genome, including near many genes which have no obvious connection to a given disease or phenotype (such as that most 100-kb windows can contribute to variance in human height). . We propose that gene regulatory networks are sufficiently interconnected such that all genes expressed in disease-relevant cells are liable to affect the functions of core disease-related genes and that most heritability can be explained by effects on genes outside core pathways. We refer to this hypothesis as an “omnigenic” model (Boyle et al., 2017)
Clearly, a polygenic model is no longer sufficient. In other words, the quantitative relationship among a group of common genes and given diseases is hard to establish. While focusing on a gene network seems better than on sets of individual genes or genetic loci, this hypothesis is rather vague at its current stage. For example, it does not explain why there exist many nonedisease-related genetic loci, many of them overlapping with different diseases as well as with healthy individuals, and how somatic evolution impacts the geneeenvironment interaction, which often reduces predictability based on genetic potential, as coded by the germline. In particular, it is certainly challenging to apply this concept in clinical practice. Moreover, if (almost) every gene affects (almost) everything, how does genetics actually work? According to an interview with one of the authors, it is evident that they clearly understand the challenge. “It is a really hard problem.” “Historically, even understanding the role of one gene in one disease has been considered a major success. Now we have to somehow understand how combinations of seemingly hundreds or thousands of genes work together in very complicated ways. It’s beyond our current ability.” (Youg, 2017)
Nevertheless, the authors think that nature is telling us something profound about how our cells and genes work. As is such, they are trying to shift the focus from common genes to complex gene networks. Others are willing to move further by considering factors other than genes. One main faction favors the world view of epigenetics (see Mu¨ller, 2007; Jablonka, 2012; 2013; Noble, 2013; Omholt, 2013; Strohman, 1997). This viewpoint has become increasingly popular following the failure to identify common gene mutations from many diseases, including cancer. Our group, in contrast, pushes for the genome theory where “chromosome coding or karyotype coding” defines the multiple levels of genetic/epigenetic and environmental interaction of cells, with the phenotypes representing emergent properties of such interactions within the
1.4 NEW GENOMIC SCIENCE ON THE HORIZON
41
context of somatic cell evolution (Heng et al., 2006aec; Heng, 2007a, 2009, 2015). Together, the critical voice toward the current concepts is getting louder. For example, in their 2014 article “Chasing Mendel: five questions for personalized medicine,” Joyner and Prendergast state: We close this essay by postulating that there has been an pervasive influence of the gene centrism inherent in the Modern Synthesis in conjunction with the Central Dogma of Molecular Biology on biomedical thinking. We believe this influence has now become counterproductive. Thus, it is critical for new ideas stemming from evolutionary biology highlighted in this special issue of The Journal of Physiology and elsewhere to more fully inform biomedical thinking about the complex relationship between DNA and phenotype. The time has come to stop chasing Mendel. Joyner and Prendergast, 2014
Indeed, now is the time to search for a new genomic concept, the effective way to stop chasing Mendel. This new concept needs to realize the difference between gene and genome, as well as integrate environmental interaction within evolutionary processes. Specifically, individual genes and environmental factors should function as lower level agents, while the phenotypes should represent emergent properties of the genome as a whole. More and more genomic researchers explain away the challenges that current genomic research is facing by saying that “Mendelian simplicity belied true complexity.” But we should realize that this is not the complete reason: Mendel ignored the true nature of fuzzy inheritance when he incorrectly classified his data and proposed his theory in the first place.
1.4.2 Crisis Created New Opportunities for Future Genetics/ Genomics According to Thomas Kuhn, science is not a stepwise and cumulative acquisition of knowledge. The pattern of scientific progress can be described as a series of peaceful intervals of normal science punctuated by intellectually violent revolutions. While the main goal of the normal science phase is to bring the accepted theory and fact into closer agreement, the main goal of the revolution phase is to shatter the no-longer sufficient framework and establish a new paradigm. It is thus necessary to ask whether or not the paradigm shift is currently underway in genetics/genomics. One of the key preconditions and signals of a paradigm shift is the transition from a routine progression stage to a crisis stage of a given scientific field. In a real crisis stage, the dominating paradigm is losing its capability to explain fundamental facts (most of which are newly discovered), despite that there are many superficial technical
42
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
achievements being made and a large amount of data being collected. In other words, the more data that are collected, the more confusion there is and the less we can comprehend it, as these new discoveries contradict the expectations of the current paradigm. Such increased anomalies are highly unfit between the existing theory and reality. These crises can be resolved by the following ways: (1) normal science advances and finally solves the crisis-provoking problems through new technologies or new realizations, which brings the field back to “normal”; (2) the crisis continues, as no key solution can be found. The challenging problems are often passed to future generations; and (3) a competing paradigm emerges, battling the old paradigm for acceptance. Only the success of the new paradigm represents a paradigm shift. Such a process is very rare and can sometimes last a long period of time. “Successive transition from one paradigm to another via revolution is the usual developmental pattern of mature science.” (Kuhn, 1962). Being familiar with these definitions and characters is essential to judge our current status in scientific progression. Equally important, however, we must not confuse and even abuse various key terms such as “paradigm shift,” “crisis,” and “revolution” in a scientific context. Unfortunately, “paradigm shift” has become an overused term in routine communications among scientists, reflected by various research papers and seminar titles. Most scientific discoveries or technical improvements are being referred to as “paradigm shifts” when clearly they only belong to the normal science progression and are not actual paradigm shifts. In fact, in Kuhn’s viewpoint, a paradigm is not just the basic theory of the field, but the entire worldview in which it exists, and that such worldview also defines individuals with a specific landscape of their knowledge/rationale/way of thinking. So to have a true paradigm shift, you must change a framework of a field as well as its relayed worldview, not just make a simple discovery based on the already existing paradigm. Similarly, one should not confuse crisis with “technical difficulties” in the normal science phase, which are commonly faced by most scientists in routine work. Although ample evidence strongly suggests that genetics is now in the crisis stage (new data actually change the very foundation of its theory), the majority of researchers are clearly not aware of this situation and will likely continue to practice science as usual because the paradigm they believe is preventing them from realizing its own fundamental limitations. For example, one of the most popular viewpoints is that the recently discovered genomic heterogeneity in most common and complex diseases can be resolved by accumulating more data and eliminating more “noise” when the results state otherwise: the more samples used, the more heterogeneities are detected. The issue here is not about the sample size, but about the need for a better framework which can be used to correctly explain the data and understand the
1.4 NEW GENOMIC SCIENCE ON THE HORIZON
43
mechanism of bio-heterogeneity. Only by changing the current way of thinking will we accept that (1) heterogeneity is the key functional feature of bio-systems rather than useless “noise” and (2) the majority of common and complex genetic traits often do not share common genetic loci, meaning sequencing more and more samples to identify a clinically useful pattern is obviously flawed. When rational decisions cannot be made in a given field, increased efforts that are driven by emotion and political interests rather than scientific logic will not only waste valuable resources but also worsen the crisis. Nevertheless, no matter which direction genetics/genomics will move toward, the crisis status will certainly offer great opportunities for those who open their minds and actively search for new paradigms. Even researchers who still firmly believe in the current paradigm might be able to advance by recognizing the theoretical limitations of their scientific practice and perhaps even start to wonder if they too should change their views in the face of constant surprises and confusion. The following opportunities are on the horizon: a. Potential new paradigms When studying the history of bioscience, specifically when analyzing famous milestone experiments and the individual scientists behind them, most genetics students share mixed feelings: on one hand, the century-long accumulated experimental data and theories seem very impressive, which inspires them. On the other hand, the great success of previous scientists overwhelms and discourages newcomers, as if they were born too late and have missed the only one opportunity of becoming genetic pioneers. Facing these overwhelming achievements in genetics, what is the pathway newcomers should take? And how should they make contributions with equal historical importance? A motivating fact is that there is currently a unique opportunity to rethink and redefine genetics/genomics and evolution, an essential step to search for and establish the new paradigm. Knowing the long history of genetic and evolutionary studies, readers should not take this historical opportunity lightly, as it represents a once in a lifetime chance for all scientists to make their breakthrough. Most scientists are not lucky enough to witness and anticipate a paradigm shift, as the normal phase of science is often much longer than the crisis stage. Within the crisis stage, even ordinary researchers will suddenly have the same opportunities as the most brilliant researchers to be extraordinary by contributing to the new paradigm, which will certainly rewrite the history of bioscience. For example, there are some crucial steps that lie ahead, which also represent exciting opportunities to advance the field of genomics.
44
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
First, it must be demonstrated that the current gene-based genetic paradigm is fundamentally limited and only continuing to push for more data accumulation will not provide satisfactory outcomes. This process is the hardest step toward establishing a new paradigm. It requires us to be honest with ourselves, as well as the following: clarity to see through the massive conflicting data, encouragement to challenge the status quo, personal sacrifice, support for new ideas and opportunities, and finally passion as well as patience to seek the truth. Second, the new genomic paradigm needs to change from ultrareductionism and move to holistic genomics. The realization that decoding genes and decoding the genome is fundamentally different will certainly help such a transition (Chapter 2 and 4). It is essential that researchers acknowledge that the various levels of genetic organization follow distinct biological laws and the knowledge gaps that exist between these levels and laws cannot simply be bridged by data accumulation. For example, understanding how the genome works is very different from simply accumulating gene data. In particular, there is an urgent need to finish the transition from studies of 1D genes to the analysis of the 3D genome and then 4D genomics that includes time within the evolutionary concept to achieve the new paradigm. System inheritance is ensured by 4D genomics which is fundamentally different from inheritance of the parts as reflected by the function of genes (Chapter 7). Third, the call to search for the new genome-based paradigm should not be simply viewed as antigene. It is meant to provide the evolutionary context of genomic research including geneegenome integration. It is obvious that gene research has generated a great deal of knowledge of the parts of the system, including the mechanistic functions of cells and organisms at the molecular level. However, when so many genes and environmental factors can be linked to the same cellular functions, the molecular understanding of each mechanism becomes limited. Paradoxically, the success of gene-based research also challenges the rationale of understanding the genome through the accumulation of knowledge at the gene level, as emergent properties have little to do with individual parts in the face of high levels of heterogeneity and complexity. This situation has been increasingly appreciated by genomic researchers when considering the genetic basis of human diseases. For example, in cancer research, a great deal is known regarding the individual molecular mechanisms of many cancer gene mutations. However, clinical predictions based on these mutations are extremely unreliable (Chapter 3 and 4). Clearly, a new paradigm that provides clinical relevance is urgently needed.
1.4 NEW GENOMIC SCIENCE ON THE HORIZON
45
b. New scientific expectations Traditional genetic analyses have favored the strategies that search for patterns or bio-certainty. Examples can be traced back to Mendel as discussed in an earlier section of this chapter. Assuming that the genetic factor functions as an independent unit provides the basis for tracing the pattern of how bio-systems pass genetic information between generations (such as 3:1 segregation). This search for a molecular pattern in the name of understanding a mechanism has become the most popular strategy following the arrival of molecular genetics. Because of the availability of various in vitro and in vivo model systems, as well as arrays of biotechnologies and molecular agents, many beautiful experimental systems can be designed and used to search for molecular patterns. Coupling with biostatistics methods, different patterns can be identified by disregarding bio-heterogeneity or “noise.” The success of bio-research using well-designed experimental systems has often been considered the art of bioscience, which has led to many milestone experiments and brought fame to many scientists. In the field of evolutionary research, in contrast, it is much more challenging to design clear-cut linear models, resulting in much less certainty. Remember the quote from Theodosius Dobzhansky: “Nothing in biology makes sense except in the light evolution”? While this statement points out the ultimate importance of evolution in biology, it also portrays the frustration of searching for bio-uncertainty. Scientists cannot just confidently make predictions in biology based on “laws.” That is the reason why it is so hard to achieve certainty in biology (to make sense of many things according to the rules). Unfortunately, few have realized this hidden message. For most bioscientists, uncertainty is a weak and even bad word. With determination and data accumulation, they say, science (with certainty) will prevail. It is thus necessary to mention Karl Popper’s viewpoint on this issue. He has sharply distinguished truth from certainty. He believes that the search for truth is not the search for certainty. According to Popper, “All human knowledge is fallible and therefore uncertain.” (Popper, 1996). Surely, large-scale genomics and various -omics have revealed this overwhelming uncertainty. The continuation of “sequencing everything” with large numbers of heterogeneous samples will only increase the uncertainty. Bio-researchers must realize the limitations of certainty and appreciate uncertainty as reality. To do so, researchers need to reconsider their treatment of genetic heterogeneity as noise, which calls on the establishment of new technical platforms. Similarly, the expectation of simplified model
46
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
systems needs to be drastically modified, as any model will reveal only limited information and drastic simplification will reduce the value of their applications or clinical relevance. It often takes time to apply theories to practical matters. However, the future challenge in genomics goes beyond this tradition, as the gap between the parts (genes) and the system (genome) will most likely not be filled by an accumulation of more knowledge of the parts but will require new principles/knowledge of the genome. The key difference between genes and genomes and the way each is studied is intimately involved in this discrepancy (Chapters 2e5); yes, the knowledge of individual genes can be obtained by analyzing many defined experimental systems, but the task of assembling a functional genome represents different types of research. Yes, many gene mutations can be linked to cancer using various in vitro and animal models and these genes may be detectable in a portion of tumor samples, but these genes are not commonly detectable in patients and possess minimal prediction value in the clinic for most cancer types (Chapter 4); yes, a large number of pathways have been characterized as potential targets for cancer therapy, but as soon as effective drugs are used, pathway switching reduces the effectiveness of these drugs that actually creates a moving target; yes, in a defined “pure” experimental system, a specific gene function can be understood within the fixed context of that system, but increasing research in different systems (such as a heterogeneous cell population with an altered genome) indicates that the function of the same gene becomes less certain, and the dynamic interaction among all genes alters its function (the function of a given gene is genome context determined); yes, the importance of key development genes can be illustrated beautifully during the developmental process, but it is challenging to apply how genes work in disease conditions where genome alterationemediated stochasticity dominates (Chapters 7, 8); yes, one can study the involvement of individual genes in various evolutionary stages, as most related species have similar key genes, but studies of these genes often do not reveal the mechanism of macroevolution, as the major force of evolution is the reorganization of the genome using similar gene sets rather than mainly by an accumulation of novel genes (Chapter 6). c. New approaches New approaches should meet the new expectations and play an important role in illustrating, testing, and even falsifying a new paradigm. For example, rather than pushing the development of methods to collect more data at the single cell level (yet another current hot topic), new approaches should focus more on the
1.4 NEW GENOMIC SCIENCE ON THE HORIZON
47
integration of single cell profiles within the population dynamics, and in particular, how different single cells display heterogeneitymediated emergent relationships with phenotypes under normal physiological and pathological conditions. The new approaches need to cover the following aspects: 1. We need a truly holistic and system-orientated approach. While further molecular characterization of agents (genes, regulation elements, pathways, and different cellular parts) will continue, more effort will focus on system behavior, and in particular, the mechanism and dynamics of the emergent properties of the system. For system biology, only focusing on the characterization of drastically increased numbers of parts and their distribution pattern is not the true system approach. In contrast, studying the mechanism of how “the topological relationship among genes” (the physical genetic interaction platform) defines the dynamics of network interaction and system behavior at higher levels (above agents) might be a key (Heng, 2013c; Heng et al., 2019). 2. Most platforms need to integrate the two key components of biological processes: genomics and evolution. Understanding why different cellular parts function in such complex ways requires an evolutionary understanding beyond molecular interactions. In this sense, evolutionary mechanisms can unify diverse molecular mechanisms. In disease studies, one promising approach is to watch evolution in action. When there are too many moving parts involved, a better way is to analyze the final phenotype in the context of evolution. 3. We need to quantitatively compare the contribution of each type of genetic/epigenetic alteration in a given evolutionary process of a particular trait. Currently, different scientific groups often push the importance of their favorite types of variants, and there is a lack of studies illustrating which level is most important for which problem. Quantitative estimation of the contribution of various levels is of importance. This effort can be achieved with a systematic comparison of the gene, chromosome, and epigene in the context of contributing to a given phenotype using the same experimental system. With such knowledge, further integration of the correct levels of genetic/epigenetic organization become possible. For example, in the punctuated macrocellular phase, genome-level reorganization dominates (much more significant than gene mutations), whereas in the stepwise phase, gene mutation and epigenetic disfunction play an obvious role.
48
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
d. New methods Comparing the different levels of genetic/epigenetic contributions will lead to the adjustment of our rationale to search for molecular methods with the highest resolution. This will allow us to realize that many methods that focus on the cellular level (with relatively lower molecular resolution) may be more proper for somatic evolutionary studies where the select unit is a cell rather than a specific gene or molecular pathway. As we previously illustrated, pushing for higher molecular resolutions will often lower the biological significance (chapter; debating Cancer). For example, monitoring karyotype changes will likely provide more predictable clinical information than DNA sequencing (Chapter 4). A similar argument can be made for studying the relationship between diseases and stress response pathways: despite the complex interactions among various stress response pathways, the most important outcome is the phenotype at the cellular level rather than at the specific pathway (as there are so many pathways that can contribute to a similar cellular phenotype). The evolutionary mechanism of cancer is the main rationale we have been pushing to unify the highly diverse molecular pathways (Heng et al., 2011b; Ye et al., 2009; Horne et al., 2015a-c; Heng, 2015, 2017a). Accordingly, more methods are needed to monitor the levels above gene or molecular pathways, such as at the cellular, tissue, and individual patient levels. Karyotype analysis and its 3D representation/emergence represent a new frontier, for instance. e. Big data, new challenges Because of its ultimate importance in current genomics, the issue of big data generation and analysis needs to be separated from the above “new methods” category. Departing from the traditional genetic era, recent large-scale genomic data require sophisticated computational tools to handle the overwhelming amount and types of data. This challenges the ability of researchers to present their data in the most meaningful way. For example, what type and which portion of the large amount of data is best to report when this diverse information often conflicts with each other and does not make sense according to experimental assumptions. Accordingly, various bioinformatics platforms and mathematical models have been introduced. Now, there are high expectations for big data, as many excited biologists believe that big data will finally deliver, especially to overcome the issue of high bio-heterogeneity and increasingly observed bio-nonspecificity. Despite the advances in today’s machine learning and the application of artificial intelligence into the bio-data analyses, effective computational and bioinformatics platforms cannot replace
1.4 NEW GENOMIC SCIENCE ON THE HORIZON
49
the theoretical framework or change the biological facts. Its success or failure entirely depends on correct biological concepts and data collection. Biologists should not simply depend on bioinformatics to reveal biological truths, as the validity of computational models does not equate to the reality of biological systems. A few key issues need the close attention of bioinformatics researchers. 1. For many models, the key assumptions and the projected goal do not have a solid biological basis. For example, when profiling a large number of cells, it is often assumed that the same types of cells share the same dominant source of variation. Similarly, when separating different clusters, it is assumed that each cluster is defined by a different small set of genes. Now, increased biological understanding has challenged these assumptions. 2. More generally, most models aim to filter out bio-noise or heterogeneity, for which various statistical tools have been developed. Unfortunately, these efforts represent the wrong approach, as heterogeneity is the key feature for biological systems, particularly for the evolution of disease conditions. 3. Accordingly, simply increasing the sample size will neither solve the challenges of pattern recognition nor will it benefit individual patients. More heterogeneous data will not solve the issue of heterogeneity. In fact, it will worsen it because the number of variants in the data set will increase as well. If variants are continuously being added into the data base, this prevents a magical breakthrough for bioinformatics to solve their current crisis. The cancer genome sequencing represents such an example. 4. There is a gap between the understanding of a population’s profile and the prediction of an individual. For example, listing most of the cancer gene mutations detected from a large number of breast cancer patients is one thing, but predicting the likelihood of a given individual acquiring breast cancer based on a few gene mutations is another thing. 5. “Chance” also contributes to biological processes. This is even more difficult to predict. Because of the large number of elements involved, it is extremely challenging to predict the perfect storm. 6. Finally, it is necessary to remind bioinformatics researchers of the challenge of solving the three-body problem in science. Actually, most bio-systems display much more complexity than the classical three-body problem. Clearly, many bio-problems cannot be reduced into two-body problems. The complex interactions among large numbers of heterogeneous agents, and the
50
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
emergence of multiple system levels, will likely lead to nearly unlimited evolutionary potential. Altogether, we need to acknowledge the limitations of the large data movement and put more effort into searching for a better biological framework. Bioinformatics will certainly speed up biological discoveries, as long as the data collection is on the correct level, and the data presentation is based on solid bio-facts. Furthermore, genomics needs to integrate with other fields rather than simply relying on physics and mathematics. It is crucial to realize that because of the high level of heterogeneity and evolutionary processes, biological systems are as distinct from nonbiological systems as are the laws used to describe them. Although there are some successful examples of using mathematical models to predict evolution in laboratory settings, it is hard to apply these models to natural conditions. The search for a new framework should include the integration of various theories of evolution and complex systems. These important ideas include but are not limited to complex adaptive systems, collapsing chaos, network theory, selforganization, ordered heterogeneity, and various evolutionary concepts that focus on the dynamics and patterns of microevolution and macroevolution. f. New materials or model systems There has been over a century of genetic research using a dozen model systems (from drosophila to yeast to mice) for the characterization of genetic parts (mostly used to understand genes and their functions). Now, at a new stage of data synthesis, it has become obvious that most of these simple-to-use systems differ from the reality of human systems, meaning the knowledge obtained from these highly simplified model systems are difficult to apply to human systems, which contain a high degree of genomic and environmental heterogeneity (Heng, 2015). It is necessary to try to overcome such challenges and directly analyze human systems to the best of our ability, despite the difficulty that is involved. Yes, it is much easier to obtain results from model systems with high reproducibility, but it will likely be more difficult to translate them into clinical meaning for real individuals. How to choose the most appropriate cellular system to analyze has also become important. Traditionally, to get consistent data, researchers prefer using homogenous cellular populations (following cell cloning). Now, knowing that outliers are the key for cancer evolution, the heterogeneous population instead becomes the key material for us to understand the mechanism of cancer evolution.
1.4 NEW GENOMIC SCIENCE ON THE HORIZON
51
1.4.3 4D Genomics: the New Paradigm By now, most readers should clearly see the increased conflicts in the current genomic field. On one hand, the mainstream research community declares that the victory of genomic medicine is around the corner: with impressive big data sets, precision medicine will finally be achieved by profiling genomic landscapes with the resolution of single pair DNA molecules across the entire genome, coupled with the power of gene editing technology (such as CRISPR). Hopes are high: researchers plan to change the eating habits of insects by altering their DNA coding, cure most human diseases by DNA design/manipulation, and extend the human lifespan beyond our imagination. On the other hand, many fundamental limitations of genetics and genomics discussed in this chapter are equally obvious and hard to ignore, which undeniably challenge the very basis of the genetic/genomic concepts we know. This is evident not only because many candid comments came from highly respected scientists but also because the cases analyzed are easy to understand and agree with, especially through the lens of scientific revolution. Such increased conflicts (more will be discussed in the coming chapters) precisely reflect the collision between different “worldviews,” as suggested by Thomas Kuhn. Such a status in science requires a new paradigm shift. The genome theory represents an important “worldview” of future genomics. To articulate the genome theory, a genome evolutionebased concept of inheritance and its implications to biomedical science, the term “4D genomics” was proposed to distinguish from the traditional “2D genetics” on which the gene theory is based. By combining 3D genome complexity with time, 4D genomics serves as the biological platform for passing genetic information and provides a selective landscape for evolution including somatic cell evolution that drives disease progression. The following quote illustrates the rationale to introduce 4D genomics: Genes and genomes represent different levels of genetic organization with distinct genetic coding systems. According to the traditional gene theory, the DNA sequence codes for all the genetic information necessary for the life of an organism. Information transfers from DNA to RNA to proteins, and this exchange lies in the foundation of modern biology. However, under the genome theory, the information regarding assembly of parts is most likely not stored within the individual gene or genetic locus. DNA only encodes for the parts and some tools of the system (RNAs, proteins, regulatory elements). The complete interactive genetic network is coded by genome topology-mediated self-organization. Under genome theory, the genome is not merely the entire DNA sequence or the vehicle of all genes. Rather, the genome context or landscape (the genomic topologic relationship among genes and other sequences within three-dimensional nuclei) defines the genetic system and ensures system inheritance. Altered genomes yield altered genetic networks, and understanding the
52
1. FROM MENDELIAN GENETICS TO 4D GENOMICS
pattern of genome dynamics provides key information to how the entire genomic system works. The concept of 4D-Genomics was formed based on the genomic reality that genetic information is preserved by the three-dimensional topology of the genome through time. This new concept calls for a departure from the less informative 1D gene-defined traditional genetics and recognizes that the genomic topology represents the framework for ‘system inheritance,’ which is distinctly different from ‘part inheritance’ (e.g., genetic information encoded in individual genes) Horne et al., 2013a (with permission from Taylor & Francis)
By increasing the 1D gene to a 4D genome, the framework is changed not by a quantitative increase of individual gene numbers but by the introduction of an entirely new type of system inheritance with fuzziness and by integrating how evolution works based on heterogeneous genotypes and their environmental interactions. This new adaptive paradigm requires many transformations, from attitudes and expectations to specific research strategies. Specifically, research priorities need to be drastically adjusted from currently focusing on individual genes to analyzing genome dynamics, from mainly characterizing lower level agents to monitoring the higher level of emergent properties (system behavior), from primarily tracing certain molecular pathways in isolation to profiling the evolutionary reality which includes almost an unlimited combination of pathways, and finally, from appreciating only the molecular certainty to embracing the real world with its inevitable mixture of certainty and uncertainty.