Medical Hypotheses (2001) 56(5), 646–652 © 2001 Harcourt Publishers Ltd doi: 10.1054/mehy.2000.1200, available online at http://www.idealibrary.com on
How many deleterious mutations are there in the human genome? J. A. Morris Consultant Pathologist, Royal Lancaster Infirmary, Lancaster, UK
Summary An estimate of the number of deleterious mutations in the human genome is made using data on the frequency of rare recessive disease in cousin marriages and in the general population. Sexual reproduction ensures that deleterious mutations are distributed at random in zygotes with an approximate Poisson distribution. The mean of this distribution is the sum of the mean number of deleterious mutations in zygotes which contribute to the next generation (Y) and the mean number of new mutations which arise in each human generation (X). The estimates are that X is between 1 and 2.6 and Y is between 12 and 32. A mathematical model based on redundancy is then used to predict how zygote survival will vary with the number of deleterious mutations. The form of this relationship is the same as that seen in experiments on cell survival following radiation-induced mutational damage and this provides independent support for this theoretical approach. The zygotes that survive to contribute to the next generation have a skewed distribution with a mean of Y. It is argued that the number of deleterious mutations in the genome is an important variable in health and disease. © 2001 Harcourt Publishers Ltd
INTRODUCTION A classical concept in genetics is that the integrity of the genome is maintained by strong selective pressure acting against deleterious mutations. There is, however, convincing evidence that the mean number of new deleterious mutations entering the human genome per generation is greater than one (1–4). This means that selection must operate against combinations of mutations rather than single mutations. If strong selective pressure did operate against single deleterious mutations and more than one deleterious mutation arises per human generation it is difficult to see how the human race could survive (5). The use of mathematical models to examine the relative benefits of sexual and asexual reproduction has led to the same conclusion (1,6). Sexual reproduction, which is universal in complex biological organisms, has an
Received 9 June 2000 Accepted 5 September 2000 Correspondence to : Professor J. A. Morris MA, MB, BChir, FRCPath, Consultant Pathologist, Royal Lancaster Infirmary, Ashton Road, Lancaster LA1 4RP, UK
646
important advantage over asexual reproduction in that deleterious mutations which are present in one generation can be absent from the next generation. The mathematical models show, however, that this only offers a survival advantage if single deleterious mutations have little effect on fitness and strong selection only operates against combinations of mutations which show synergistic interaction. This relationship between the number of deleterious mutations and survival can be understood in terms of biological complexity and redundancy (7,8). Mammals are complex biological systems and all complex systems need a high level of redundancy in order to function. If genes code for the components of a redundant system, however, a single deleterious mutation will have a negligible effect and there will be no selective pressure against that mutation. In redundant systems selection can only operate against combinations of mutations. Furthermore in redundant systems successive deleterious mutations interact synergistically. These ideas have important biological implications. If the successive build-up of deleterious mutations in the genome progressively impairs function and eventually causes genetic death, there must be an intermediate state in which performance is impaired short of a fatal outcome.
Deleterious mutations in the human genome
This will manifest as polygenic disease in which a large number of different combinations of mutations in a redundant system can lead to disease rather than a few specific mutations. Since sexual reproduction ensures that the number of deleterious mutations in the genome of the next generation is a random variable, this number could be an important determinant of health and disease. In this paper an attempt is made to estimate the number of deleterious mutations in the human genome and relate this number to the probability of zygote survival using a mathematical model based on redundancy. The cell survival curves derived theoretically are strikingly similar to those seen experimentally in work on radiationinduced mutations in vivo and in vitro (9). REDUNDANCY IN THE GENOME Biological organisms are highly complicated, highly redundant information processing systems coded by genes. The first and last part of this statement are beyond dispute. Biological organisms are highly complex structures and this certainly applies to advanced life forms such as mammals. The information for this complexity is coded in the genome. The concept that biological organisms are information-processing systems is more contentious (10–13), but there is no doubt that the brain processes information and that the interaction between micro-organisms and the immune system can be analysed in terms of information theory (11). In more general terms, biological behaviour can be analysed as a series of reactions (decisions) made in an uncertain world, based on information (evidence) processed in noise and subject to error (10, 11). It is the concept of redundancy, however, which is central to the argument in this paper. If a complex system is made from fallible components there is a risk of error; as complexity is increased the risk of error will rise. The way to reduce error is to introduce redundancy, ie for every key component that might fail there is a back-up system. As systems become more and more complicated, higher levels of redundancy are required to maintain function over a specified period of time. Mathematical models indicate that thereafter performance will slowly decline, ie the system ages gracefully (13). These ideas are best expressed and analysed in terms of information theory which leads to the following general principles: 1. information systems have a finite capacity; 2. information is processed in noise and subject to error; 3. the components of information processing systems are subject to the laws of entropy and performance will deteriorate with time; 4. redundancy reduces the error rate and, in general, the more complex the system and the more important the task, the higher the level of redundancy required. © 2001 Harcourt Publishers Ltd
647
A consequence of redundancy is that deleterious mutations will accumulate in the genome. In a highly redundant system a deleterious mutation in a single gene will have a negligible effect and selection can only operate against combinations of mutations. In sexual reproduction the deleterious mutations are distributed at random into gametes and the gametes then fuse at random to form zygotes. The result is that the number of deleterious mutations in zygotes is a random variable which will approximate to a Poisson distribution. A mathematical model of this process has been published previously (7). Assume that there are N separate highly redundant systems each with ni deleterious mutations. If ni is less than or equal to a small number b [ni ≤ b] there is no measurable effect on function. If ni is equal to or greater than a slightly larger number c [ni ≥ c] the zygote will not develop. If ni is greater than b but less than c [b < ni < c] the zygote develops but there is some measurable effect on function which is expressed as polygenic disease or polygenic-induced impairment of function either during development or later in life. In each generation new deleterious mutations arise in germ cells and are passed on to the next generation of zygotes. Let Y equal the mean number of deleterious mutations in the zygotes of individuals who subsequently contribute to the next generation. Let X equal the mean number of new deleterious mutations which arise in germ cells and are passed to the next generation of zygotes. The mean number of deleterious mutations in zygotes will be Y + X. But the zygotes with the most mutations (ie those in which ni ≥ c) will not develop and the mean number of zygotes which do develop and contribute to the next generation will be Y + X - k. The system is in balance if X = k. DELETERIOUS MUTATIONS IN THE GENOME An estimate of the frequency of any single recessive gene in the general population is YP, where P is the number of genes in the human diploid genome. The frequency of any single recessive disease is then YP and the total frequency of recessive disease in liveborn children is P2 YP = Y2P. The frequency of recessive diseases in the general population is approximately 2.5 per 1000 live births (14) and P is of the order of 1.2 × 105 (4, 5). 2
2
2
2
2
Thus
Y2 = 2.5 × 10–3 2.4 × 105 Y2 = 600 Y = 24.5
There are a number of factors, however, which make this approach too simple. It does not allow for the fact Medical Hypotheses (2001) 56(5), 646–652
Morris
648
that some recessive conditions are lethal in utero and some reduce the chance of live birth. It is also unlikely that case ascertainment of recessive disease in liveborn children is complete. Both of these factors will lead to an underestimate of Y in the above analysis. In fact the measured frequency of any single recessive disease will be YP f where f is a factor between 0 and 1 which reflects selection against the recessive disease in utero and underascertainment of cases after birth. There is also an important factor working in the opposite direction leading to overestimation of Y. A few recessive genes have risen to a high frequency in the general population due to genetic drift or heterozygote advantage and these conditions contribute disproportionately to the total number of recessive cases. If a recessive deleterious mutation doubles in frequency due to genetic drift the frequency of the recessive disease rises fourfold and as a result the majority of recessive disease is due to a relatively few conditions. If a rare recessive disease is defined as one in which the recessive gene has not risen due to genetic drift or heterozygote advantage then the totality of rare recessive conditions sum to less than 2.5 per 1000 live births. Another approach is to consider the frequency of recessive disease in the general population and in consanguineous matings. The frequency of a single rare Y2 f. recessive disease in the general population is 2
2
P
(It is worth noting that the frequency of the recessive gene in zygotes is (Y + X) , the frequency in individuals who P contribute to the next generation is Y , but recessive disP ease is measured as cases per live born child. Since most selection occurs in utero the frequency in liveborn children is closer to Y than to Y + X ). P P Consider the offspring of a cousin marriage shown in Figure 1. The probability that a specific rare recessive gene is present in generation A is 4Y (there are four copies P of the gene and each has a probability of Y of being a P deleterious mutant). If a specific rare recessive gene is present in generation Y 1 A the probability that it is present in B1 is (Y + X) 2 Half of the zygotes produced by generation A will contain the mutant gene, but there is further selection reducing the mean number of mutant genes from Y + X to Y. The probability that the recessive gene is passed from A → B1 Y3 1 → C1 → D is (Y + X)3 23 ). The probability that the mutant gene is passed A → B1 → C1 → D and A → B2 → C2 → D to give a recessive diseases is Y6 (Y+X)6
4Y P (Y+X)6
freq in cousins freq in gen pop
=
4Y 99.5 = P 0.5
(Y + X)6
This simplifies to
Y5
Y6 (Y + X)6
= 38
1 26
P2 Y2
(1)
There is experimental evidence that X >1(3, 4, 5) and there are no solutions to the above equation if X > 2.6. Thus X is between 1 and 2.6 and Y between 1 and 32. The range of Y can be narrowed further by considering the frequency of recessive disease in the offspring of incestuous matings. Over 50% of the live born offspring of brother–sister matings have severe disease, including mental retardation and congenital abnormality as well as defined recessive conditions. The probability of avoiding recessive disease in the offspring of a brother/sister union is given by the following function: 2Y
1−
Y4
1
(Y+X)4
24
1 26
Thus the measured frequency of a specific rare recessive in the child of a cousin marriage is Y6
Recessive diseases are more common in the offspring of cousin marriages than in the general population. For the more common recessive conditions such as albinism and phenylketonuria approximately 10% of cases are the products of first cousin marriages. For rare recessive conditions, however, such as microcephaly, over 50% occur in cousin marriages (15). In Western Europe and the USA less than 0.5% of marriages involve first cousins, although the frequency was higher in the recent past in some isolated rural areas (16). Therefore for a specific rare recessive disease
f 26
Medical Hypotheses (2001) 56(5), 646–652
Fig. 1 D is the offspring of a cousin marriage between C1 and C2. A deleterious mutation in either of the great-grandparents in generation A can cause a recessive disease in D.
© 2001 Harcourt Publishers Ltd
Deleterious mutations in the human genome
If Y is less than 12 this function takes values above 0.5 and is inconsistent with the above data. Thus the estimate of Y can be narrowed to between 12 and 32. For each value of X in equation 1 there are two possible values of Y. If X = 2, Y = 23 or Y = 5. In the following sections a working estimate of X = 2 and Y = 23 is used to simplify the presentation. In Drosophila melanogaster, 25% of protein coding genes are lethal recessives (17). This means that if the gene product is absent due to deleterious mutations in both genes of an allelic pair, then the organism does not develop. In one sense this is the opposite of redundancy in that the gene product is essential and its absence is incompatible with survival. The occurrence of two copies of the gene is, however, an example of redundancy as both must be lost before the organism dies. Let us assume that the same applies to humans, 25% of protein coding genes are lethal recessives and there is no selection against a deleterious mutation in the heterozygote. Thus for this subset of the genome the heterozygous condition is neutral and the homozygous condition is lethal. To simplify the calculations let us also assume that the human genome has 60 000 protein coding genes in the haploid state and two new deleterious mutations enter the human genome at random in each generation. The frequency of deleterious mutations in the population will rise until the prevalence is approximately 1 in 250 deleterious mutations per gene in the lethal subset under consideration. At this level in a population of 60 000 children two new mutations will arise in each gene and two will be lost as one child will die due to the specific lethal recessive disease caused by absence of the gene product. Each individual would have an average of 120 deleterious mutations (frequency of 1 in 250 in 25% of the genome). If this were the case there is only a 0.02 chance that a child of a cousin marriage would survive. In fact fertility in cousin marriages would be vastly reduced and there is no evidence that this is the case. Thus in recessive disease which is lethal or causes severe disease there must be selection against the heterozygote to reduce the frequency of deleterious mutations from 1 in 250 to something closer to 1 in 6000 as calculated in the current paper. Selection must operate by the deleterious mutation in the heterozygote interacting with other deleterious mutations, ie the system is highly redundant. It is easy to see that this could occur during development as imprinting results in gene products of allelic pairs having different functions. The converse of a lethal recessive is a deleterious mutation which even in the recessive state does not cause disease or any measurable disadvantage. If there is no selection against the heterozygote or the homozygote then the gene will eventually disappear and the redundancy model does not apply. If the gene does survive then single mutations or perhaps the homozygous state © 2001 Harcourt Publishers Ltd
649
must interact with other deleterious mutations to reduce fitness. In which case the redundancy model does apply as in the case of lethal recessives and the same process will reduce the frequency of deleterious mutations. Thus the estimate in this paper applies to the great majority of protein coding genes, whether or not the recessive condition leads to severe disease. SELECTION During sexual reproduction deleterious mutations are distributed into gametes at meiosis according to a random binomial process. The gametes then fuse at random to produce zygotes. The result of this stochastic process is that the distribution of deleterious mutations in zygotes will be reasonably close to a Poisson distribution with a mean of Y + X. It is, however, possible that factors such as sexual selection could modify the distribution. If the selection of a marriage partner is related to the number of deleterious mutations in the genome, and there are strong theoretical reasons why this should be so, the resulting distribution of mutations in zygotes in the general population will be a modified Poisson distribution with an increased variance. The process of zygote selection, according to the mathematical model based on redundancy, is that zygotes will fail to develop if by chance any one redundant genetic system has more than a certain number of mutations (ni ≥ c). Figure 2 shows the distribution of mutations in zygotes assuming a Poisson distribution with a mean of 25 ( Y + X = 25). The distribution of mutations in zygotes selected to contribute to the next generation is also shown. This distribution is generated using the mathematical model described in the section on redundancy with N = 22 and c = 4. The distribution is skewed with a mean of 23. In this example 53% of zygotes are selected by this process. This estimate is of the right order for humans, since the chance of conception leading to a term delivery in any one reproductive cycle is approximately 0.25 for individuals engaging in regular (two to three times per week) unprotected sexual intercourse. Thus in any one cycle there is perhaps a 0.5 chance of a zygote forming and a further 0.5 chance of the zygote proceeding to term. Figure 3 shows the relationship between the probability of zygote survival and the mean number of deleterious mutations in the population. This theoretical curve is the same as that seen experimentally, both in vivo and in vitro, when cell survival is related to radiation-induced cell damage (9). Since radiation induces mutations at random in the genome it is clear why the curves should be the same. The form of this curve, with a shoulder followed by a linear fall, has troubled theorists for many years. It is possible to describe the curve mathematically Medical Hypotheses (2001) 56(5), 646–652
650
Morris
as a combination of two functions ie a linear quadratic curve. But the biological basis has remained unclear and controversial. Most attempts at a theoretical explanation are based on target theory, but they involve a phase in which multiple targets need to be hit to explain the shoulder, followed by a phase in which only a single target needs to be hit to explain the linear part of the curve. The redundancy model, however, provides a natural unitary explanation without the need for any arbitrary changes of phase.
DISCUSSION The human diploid genome has 6.4 × 109 bp of DNA of which 1.6 × 108 bp code for proteins (3). The total number of protein-coding genes is between 1 × 105 and 1.6 × 105 (4, 5) and these form only a tiny fraction of the total genome. The function of the rest of the genome is a matter of dispute. Some of the DNA provides physical separation between coding elements, some has regulatory function, some might have essential function in the interaction between DNA strands in mitosis and meiosis, some might have no function – so-called junk DNA (18). If there are 23 deleterious mutations in the protein coding fraction of the genome then the total number of deleterious mutations in the entire genome could be many multiples of 23. There are, however, considerable difficulties if this is true. For instance if Y = 230 and X = 20 the degree of zygote selection as shown in Figure 2 becomes too extreme to be credible. There is, however, another way of looking at this problem using ideas from target theory in radiobiology. Genes in cells with a small genome are much more resistant to radiation-induced
mutation than similar genes in cells with a larger genome. In fact there is a linear relationship between sensitivity to radiation damage and genome size which covers many orders of magnitude from bacteria to mammals (19). One explanation is that the target for damage to a gene is much larger than the protein coding part but also includes regulatory elements and other elements of noncoding DNA. In a sense each gene has a hinterland which is much larger in the larger genome. The value of 23 deleterious mutations could therefore apply not only to protein coding genes but also to the entire genome. The values of Y=23 and X=2 should be regarded as working estimates. There is evidence that X lies between 1 and 2.6. The estimated value of Y, between 12 and 32, depends on a number of assumptions, one of which is the number of genes in the genome, which is estimated at between 50 000 and 80 000 per haploid genome. The frequency of cousin marriages and the percentage of rare recessive diseases which occur in cousin marriages are also estimates from the literature which are subject to some variance. There is also doubt about the frequency of recessive disease in the children of incestuous unions but in this case the value of 0.5 is probably an underestimate as it does not allow for death in utero. There is, however, some independent support for the values Y = 23 and X = 2, in that the probability of conception in a reproductive cycle fits with the selection criteria in Figure 2. The assumption that biological organisms are complex leads directly to the conclusion that genes must code for redundant systems. This is a theoretical conclusion but one which is difficult to refute. In addition, there is some direct evidence that mammalian systems are redundant. This applies to antibody diversity, T cell diversity and
0.1 y+x
0.09 0.08
y + x =25 y =23
Frequency
0.07 0.06
probability distribution of zygoles
0.05 probability distribution of individuals who contribute to the next generation
y
0.04 0.03 0.02 0.01
40
35
30
25
20
15
10
0 No.of deleterious mutations
Fig. 2 The probability distribution of deleterious mutations in zygotes (mean = Y + X = 25) and in the subset of zygotes which develop into individuals who contribute to the next generation (mean = Y = 23).
Medical Hypotheses (2001) 56(5), 646–652
© 2001 Harcourt Publishers Ltd
Deleterious mutations in the human genome
cytokine action (20). There is also evidence that the genome of simple organisms, such as Saccharomyces cerevisiae, have a high level of redundancy (21). The model of redundancy used in this paper involves some simplification. Single deleterious mutations can have an observable effect as seen with dominant disease, and it is likely that deleterious mutations in general will vary from a measurable effect to no effect rather than all having a negligible effect. The idea that the genome can be divided into N independent systems is also a useful idea but again it is likely that systems overlap and interact. The number of mutations which lead to cell death also probably varies between systems and is not a constant. The redundancy model should, therefore, be regarded as a simple working model and it should be judged, as is always the case with mathematical models, on two counts: does it preserve the essence of the problem? and is it useful? The theoretical curve plotted in Figure 3 indicates that the model is
No. of deleterious mutations 0
10
20
30
40
50
60
70
1
-1
Log scale probability of survival
10
-2
useful. The curve is precisely that found when cell survival is measured in response to radiation-induced mutation. As the dose of radiation is increased and more mutations are induced the probability of cell survival falls. There are two components on a log scale, an initial shoulder and then an approximate linear fall. This curve has caused considerable theoretical difficulty and hitherto there is no single biological explanation which predicts both components of the curve. An explanation based on target theory is that multiple hits are required initially but then a single hit suffices to cause cell death, but the transition between the phases is arbitrary. The redundancy model, however, gives a unitary explanation for both phases of the curve and this is derived from first principles. There are important biological implications which follow from the idea of redundancy in the genome. A number of these implications have been explored in previous publications (7, 8, 22). The number of deleterious mutations in the genome is a random variable and in the population as a whole the number approximately doubles between the 2.5 and 97.5 percentiles. If the number of mutations in any one system exceeds a critical value that individual will not contribute to the next generation. In those individuals who do survive systems in which there are several deleterious mutations will be compromised. The performance of the system will be sub-optimal. The more deleterious mutations there are in the genome the more likely it is that one or more systems will be compromised in this way. The prediction is, therefore, that the number of deleterious mutations will correlate with polygenic disease and all aspects of impaired development, impaired performance and even impaired physical symmetry (7, 8, 22). REFERENCES
10
-3
10
-4
10
0
Fig. 3 The relationship between the probability of zygote survival and the mean number of deleterious mutations in the genome. This curve is derived using the mathematical model based on redundancy with N=22 and c=4.
© 2001 Harcourt Publishers Ltd
651
1. Kondrashov A. S. Sex and deleterious mutation. Nature 1994; 369: 99–100. 2. Crow J. F. The high spontaneous mutation rate: is it a health risk? Proc Nat Acad Sci USA 1997; 94: 8380–8386. 3. Drake J. W., Charlesworth B., Charlesworth D., Crow J. F. Rates of spontaneous mutation. Genetics 1998; 148: 1667–1686. 4. Eyre-Walker A., Keightley P. D. High genomic deleterious mutation rates in hominoids. Nature 1999; 397: 344–347. 5. Crow J. F. The odds of losing at genetic roulette. Nature 1999; 397: 293–294. 6. Redfield R. J. Male mutation rates and the cost of sex for females. Nature 1994; 369: 145–147. 7. Morris J. A. Genetic control of redundant systems. Med Hypotheses 1997; 49: 159–164. 8. Morris J. A. Fetal origins of maturity-onset diabetes mellitus: genetic or environmental cause? Med Hypotheses 1998; 51: 285–288. 9. Nias A. H. W. An Introduction to Radiobiology. Chichester, England: John Wiley and Sons, 1990. 10. Green D. M., Swets J. A. Signal detection theory and psychophysics. London: John Wiley and Sons, 1966.
Medical Hypotheses (2001) 56(5), 646–652
652
Morris
11. Morris J. A. Autoimmunity: a decision theory model. J Clin Pathol 1987; 40: 210–215. 12. Morris J. A. The age incidence of multiple sclerosis: a decision theory model. Med Hypotheses 1990; 32: 129–135. 13. Morris J. A. Ageing, information and the magical number seven. Med Hypotheses 1992; 39: 291–294. 14. Weatherall D. J. The New Genetics and Clinical Practice. Oxford: Oxford University Press, 1991. 15. Reed S. Counselling in Medical Genetics, 3rd edn. New York: Alan Liss, 1980. 16. Fraser Roberts J. A. An Introduction to Medical Genetics. Oxford: Oxford University Press, 1967. 17. Ashburner M., Misra S., Roote J. et al. An exploration of the sequence of a 2.9–Mb region of the genome of
Medical Hypotheses (2001) 56(5), 646–652
18. 19.
20. 21. 22.
Drosophila melanogaster: the Adh region. Genetics 1999; 153: 179–219. Lewis B. Genes. Oxford: Oxford University Press, 1997. Abrahamson S., Bender M. A., Conger A. D., Wolff S. Uniformity of radiation-induced mutation rates among different species. Nature 1973; 245: 460–462. Roitt I. Essential Immunology. Oxford: Blackwell Sciences 1997. Oliver S. G. From DNA sequence to biological function. Nature 1996; 379: 597–600. Morris J. A. Information and redundancy: key concepts in understanding the genetic control of health and intelligence. Med Hypotheses 1999; 53: 118–123.
© 2001 Harcourt Publishers Ltd