J. Mol. Biol. (1990) 213,203-206
Frequency of Abnormal H u m a n Haemoglobins Caused by C - , T Transitions in CpG Dinucleotides M.
F. Perutz
M R C Laboratory of Molecular Biology Hills Road, Cambridge CB2 2QH, U.K. (Received 17 October 1989; accepted 17 January 1990) A large part of human genetic disease apparently arises from deamination of cytosine residues in methylated CpG dinucleotides. Their mutation rate is known to be high when C is present as 5-methyl-cytosine, but is believed to be normal when it is unmethylated. The ~-globin gene contains five, the 7-globin gene two, and each of the ~-globin genes contains 35 CpG dinucleotides. The CpG dinucleotides in the fl and 7-globin genes are methylated, while those in the ~-globin genes are under-methylated. One would therefore have expected the CpG dinucleotides to be a frequent source of mutations in the ~ and 7-globin genes, but n o t in the ~-globin genes. In fact, the evidence points to CpG dinucleotides being a frequent source of mutations in both the ~ and ~-globin genes. This suggests either that the mutation rates of both methylated and unmethylated CpG dinucleotides are abnormally high, which conflicts with published evidence, or that there is a finite chance of some of these in the ~-globin genes of certain individuals being methylated and therefore subject to mutation.
replaces them by C" G pairs. They speculate that Coulondre et al. (1978) did not observe that repair, because they used F-lac episomes that were present in large numbers in the E. coli cell, and the number of repair enzymes in each cell was too small to cope with them. In mammalian cells mismatched T" G pairs are subject to excision repair that is biased in favour of C- G: when circular simian virus 40 (SV40) DNA containing one mismatched T . G pair was cloned in monkey kidney cells, most of the mispairs were corrected to C ' G rather than T - A pairs (Brown & Jiricny, 1987, 1988; Wiebauer & Jiricny, 1989). In monkey kidney cells deamination of 5-methyl-cytosine in a (5-mC'G/G'5m-C) pair to (T'G/G'5m-C) is followed by mismatch repair that preserves the methylated strand, thus ensuring restoration of the original (C" G/G'5m-C) pair (Hare & Taylor 1985). All the same, the CpG dinucleotide is present in mammalian genomes at only a b o u t 20~/o of its expected frequency, and the TpG/CpA dinucleotide occurs in a corresponding excess, thus testifying to a net loss of CpG during evolution (Ohno, 1988). That loss is not entirely random. Salser (1977) found 22 CpG dinucleotides in the mRNA for rabbit a-globin, but only three in the mRNA for fl-globin. He suggested that most CpG dinucleotides might have been eliminated from t h e fl-globin gene because they were methylated and therefore formed hotspots for mutations. This prediction was borne
Sinsheimer (1955) appears to have been the first to draw attention to the under-representation of the dinucleotide CpG in mammalian DNA after he found that an enzymatic digest of calf thymus DNA contained only a low proportion of CpG dinucleotides and no 5-methyl CpG dinucleotides. His work was followed up by Schwartz et al. (1962) who showed that CpG dinucleotides occurred in the DNA of animal tissues at no more than one-third of the frequency expected from the base composition if the distribution of dinucleotides were random, while the CpG dinueleotides occurred at the expected frequency. After Roy & Weissbach (1975) had isolated a methylase from HeLa cells that specifically methylated CpG dinucleotides, Coulondre et al. (1978) found a methylase in Escherichia coli that methylates the second cytosine in the sequence CCAGG. In t h e / a c I gene of E. coli these 5-methylcytosines acted as "hotspots", because deamination converted the 5-methyl-cytosines to thymines which then paired with adenines. In a methylasedeficient mutant of E. coli these hotspots were absent, apparently because uracil derived from deamination of unmethylated cytosine was excised by uracil-DNA-glycosidase and replaced by cytosine, while thymine derived from deamination of 5-methyl-cytosine was not excised. Recently, Jones et al. (1987) discovered that E. coli do in fact have a system that specifically repairs T ' G mismatches derived from deamination of CpG dinucleotides and 0022-2836/90/100203-04 $03.00/0
203
(~) 1990 Academic Press Limited
M. F. Perutz
204 Table 1
Nucleotide base changes caused by deamination of 5-methyl-cytosine in CpG dinueleotides DNA
Wild-type Mutant
Coding
Non-coding
mRNA
CpG TpG (CpA)
CpG (CpA) TpG
CpA CpA UpG
out by the discovery of methylated CpG dinucleotides in the human fl and y-globin genes. On the other hand, the human ~1 and a2-globin genes and m a n y other regions of mammalian chromosomes contained islands of u n m e t h y l a t e d CpG dinucleotides where the frequency of this dinucleotide was higher than expected from the base composition (Bird et al., 1985; Bird, 1986, Brown & Bird, 1986; Bird et al., 1987). The h u m a n ~x and ~2-globin genes each contain 35 CpG dinucleotides, compared to 26 expected on a random basis, while the fl-globin gene contains only five and the y-globin gene only two, consistent with the generally accepted notion t h a t m e t h y l a t e d CpG dinucleotides have been lost during evolution, while those non-methylated have been preserved. I f only methylated CpG dinucleotides gave rise to hotspots, then the amino acid substitutions to be expected from C - , T transitions should have shown up frequently among the human haemoglobins with abnormal fl or y-chains, but only rarely among those with abnormal a-chains. I set out to discover if this is really true.
Table 1 shows t h a t deamination of 5-methyi-C in the coding strand of DNA replaces CpG in the m R N A b y CpA; deamination of 5-methyl-C in the non-coding strand of DNA replaces CpG in the m R N A by UpG. In the fl-globin chain, C - , T transitions in the coding strand would cause three valine residues to be replaced by methionine residues, one valine by an isoleucine, and one glycine by a serine. C - * T transitions in the non-coding strand would cause no amino acid substitutions, because they all occur in the third letters of the codons. Table 2 shows the number of apparently independent occurrences of the five predicted replacements in the fl-globin chains t h a t have either been published or reported to the International Hemoglobin Information Center in Augusta, Georgia, U.S.A. Haemoglobin KSln, which causes the severest symptom, i.e. inclusion body anaemia, has been reported 32 times. Of the other four replacements, two cause high oxygen affinity which is a mild symptom, and the other two are asymptomatic. T h e y have been reported between two and five times. On the other hand, neither of the two electrophoretically silent substitutions predicted in the y-chain have been found, probably because such m u t a n t s rarely manifest themselves symptomatically: only six electrophoretically silent mutants have been detected among the 44 abnormal foetal haemoglobins reported so far. The al and a2 genes differ at only two codons, neither of which contains a CpG (Liebhaber et al., 1981); therefore we have to deal with only a single set of a-globin codons. Table 3 lists the n u m b e r of apparently independent occurrences of the amino
Table 2
Abnormal human haemoglobins due to C--* T transitions in CpG dinucleotides in the fl and y-globin genes Wild-type mRNA
A. fl-Chain Coding strand 11 20 69 98 109
GCC--GUU AAC-GUG CUC--GGU CAC-GUG AAC-~UG
Non-coding strand 10 GCC 19 AAC 68 CUC 97 CAC 108 AAC B. T-Chain Coding slrand 113 ACC-GUU 118 UUC-GGC Non-coding strand All silent
Mutant Residue
Val Val Oly Val
mRNA
Residue
Val
AUU AUG AGU AUG AUG
Ile Met Ser Met Met
Ala Asn Leu His Ash
GCU AAU CUU CAU AAU
Ala Asn Leu His Asn
Val Gly
AUU AGC
Ile Ser
t HOA, high oxygen affinity. NO, not observed.
Name
Hamilton Olympia City of Hope K~iln San Diego
Number of apparently independent occurrencesClinical manifestations
5 5 4 32 2
NO NO
-HOA~f -Inclusionbody anemia, HOA HOA
205
Communications Table 3
A b n o r m a l h u m a n haemoglobins due to C ~ T transitions i n CpG dinucleotides i n the a-globin genes
Wild-type
a2-Chain
mRNA
Residue
Mutant mRNA
Residue
Name
Number of apparently independent occurences
Clinical manifestations
A. C "G--* T" G transition.s in coding strand
6 GCC-GAC Asp 23 GGC-GAC Glu 47 UUC-GAC Asp 64 GCC-GAC Asp 75 GAC-GAC Asp 85 AGC-GAC Asp 92 CUU-CGG Arg 116 GCC-GAG Glu 141 UAC-CGU Arg Plus 20 electrophoretically silent mutations
AAC AAG AAC AAC AAC AAC GAG AAG CAU
Asn Lys Asn Asn Asn Asn Gin Lys His
Dunn Chad Arya Aida Matsui-Oki G-Norfolk Cape Town O-Indonesia Sur~snes
4 5 2 8 7 4 4 7 2
44 CCG Pro CUG Leu 92 CGG Arg UGG Trp 95 CCG Pro CUG Leu 141 CGU Arg UGU Cys Plus 18 changes in 3rd codon and 7Ala--, Val not observed.
Milledgeville
1 NO 6 1
HOAr -Slightly unstable ---HOA -HOA
unobserved
B. C "G--* T" G transitions in non-codling strand
G-Georgia Nunobiki
HOA HOA HOA
The codons for the globin genes in Tables 2 and 3 have been taken from Bunn & Forget (1986), and the abnormal haemoglobinsfrom the list of the International Hemoglobin Information Cetlter published in Hemoglobin, 12, no. 3 (1988). ~fHOA, high oxygen affinity. NO, not observed.
acid substitutions predicted in the a-globin chain t h a t have been either published or reported to the International Hemoglobin Information Center. Ten of t h e ' l 1 predicted and electrophoretically distinct amino acid substitutions have been found, one eight times, one seven times, one five times, and three have been found four times. All occurred in unrelated families. Two of the electrophoretically silent ones have also been found. T h e y involve replacements of proline at the alfl 2 contact b y leucine residues, which disturbs the allosteric equilibrium, raises the oxygen affinity and manifests itself clinically b y p o l y c y t h e m i a (too m a n y red cells). A total of 33 of the remaifiing 58 possible C - * T transitions would produce no amino acid substitutions, which leaves 25 predicted amino acid substitutions t h a t have not been reported. Only one of these would be visible electrophoretically: A r g 9 2 a - , T r p ; the rest includes 11 Val--*Ala; three Val--,Met; one Val - , Ile and two Gly -* Ser substitutions. All these substitutions are conservative. The three valine residues t h a t are replaced by methionine residues all lie in surface crevices, so t h a t the bulkier methionine side-chain can be a c c o m m o d a t e d without disturbing the structure. I t is improbable, therefore, t h a t a n y of these substitutions would manifest themselves clinically and t h e y are therefore likely to escape detection. The reported frequency of an abnormal haemoglobin is a function of its actual frequency in the population, divided by the probability of its detection. F o r haemoglobin KSln, which causes severe haemolytic anaemia, t h a t factor would be unity.
Haemoglobins with abnormal electrophoretic mobility show up either in routine hospital tests or in large-scale screening programmes, b u t the fraction of the population caught by such programmes must be small, say not larger than one-tenth. Haemoglobins with high oxygen affinity rarely produce disease, b u t m a y manifest themselves b y polycythaemia in routine blood counts. Again the fraction detected m a y not be more t h a n one-tenth. Abnormal haemoglobins t h a t are electrophoretically silent and a s y m p t o m a t i c are detected even more rarely, a-Chain m u t a n t s are intrinsically harder to detect t h a n fl-chain m u t a n t s because t h e y affect only a b o u t a quarter of the haemoglobin in the red cell. F o r all these mutants, the probability of detection is low, and the real frequency is likely to be much larger t h a n given in Tables 2 and 3. The frequent occurrence of the amino acid substitutions predicted among the fl-globins is consistent with the notion t h a t m e t h y l a t e d CpG dinueleotides form hotspots, but the discovery of ten out of the 11 predicted, electrophoretically distinct substitutions among the a-globins, and their repeated presence in unrelated individuals, are unexpected if the CpG dinucleotides in the a-globin genes are supposedly unmethylated. I t has been suggested to me t h a t the occurrence of the a-globin m u t a n t s m a y not necessarily imply abnormally high m u t a t i o n rates at these CpG dinucleotides, because the world-wide search for abnormal haemoglobins has already led to the discovery of all possible electrophoretically distinct abnormal haemoglobins. However, this is not true. A t the eight loci in the a-globin gene
206
M . F. Perutz
shown at the top of Table 3, transitions and transversions could give rise to 51 other electrophoretitally visible abnormalities, but only 20 of these have been found. The evidence of undermethylation of the a-globin genes comes from restriction mapping rather than sequencing of the germ-line DNA (Bird et al., 1987), whence some methylated CpG dinucleotides might have escaped detection, but this can hardly be true of all ten involved in the electrophoretically observed amino acid replacements. On the evidence presented here, one would be led to conclude that unmethylated CpG dinucleotides also form hotspots, for example because proofreading by the polymerase-associated exonuclease is error-prone, but this would contradict previous work and raise the question why unmethylated CpG dinucleotides have not been largely eliminated from the a-globin genes in evolution, just as methylated ones have been eliminated from the fl-globin genes. Alternatively, single CpG dinucleotides in the a-globin genes might occasionally become methylated at random and these single methylated CpG dinucleotides might then act as hotspots. If that were true, then sequence analysis of germline DNA of different individuals should reveal single methylated CpG dinucleotides at different loci. Whatever the explanation, it is difficult to see what selective advantage, if any, can now be assigned to the unmethylated islands in the a.-globin genes. Cooper & Youssoufian (1988) estimate that nearly one-third of intragenic single base-pair mutations causing human inherited disease occurs in CpG dinucleotides, because CpG is up to 42 times more unstable than predicted from random mutations. My findings show that they are also responsible for a substantial fraction of haemoglobinopathies. I thank Drs A. P. Bird and M Radman for helpful criticism and Drs T. H. J. Huisman and B. B. Webber at
the International Hemoglobin Information Center in Augusta Georgia, U.S.A. for kindly searching their records in order to supply me with the frequencies of the independently reported abnormal haemoglobins listed in Tables 2 and 3. This work was supported by NIH grant no. HL 31461 and NSF grant no. DMB 8609842.
References
Bird, A., Taggart, M., Frommer, M., Miller, O.J. & MacLeod, D. (1985). Cell, 40, 9199. Bird, A. P. (1986). Nature (London), :}21,209-213. Bird, A. P., Taggart, M. H., Nichols, R. D. & Higgs, D. R. (1987). EMBO J. 6, 999-1004. Brown, T. C. & Jiricny, J. (1987). Cell, 50, 945-951. Brown, T. C. & Jiricny, J. (1988). Celt, 54, 705-711. Brown, W. R. A. & Bird, A. P. (1986). Nature (London), 322, 477-481. Bunn, E. H. & Forget, B. G. (1986). In Hemoglobin: Molecular, Genetic and Clinical Aspects, pp. 182-183, W. B. Saunders Co., Philadelphia. Cooper, & Youssoufian, (1988). Hum. Genet. 78, 151-155. Coulondre, C., Miller, J. H., Farabough P. J. & Gilbert, W. (1978). Nature (London), 274, 775-780. Hare, J. T. & Taylor, J. H. (1985). Proc. Nat. Acad. Sci., U.S.A. 82, 7350-7354. Jones, M., Wagner, R. & Radman, M. (1987). J. Mol. Biol. 194,155-159. Liebhaber, S. A., Goossens, M. & Kan, Y.W. (1981). Nature (London), 290, 26-29. Ohno, S. (1988). Proc. Nat. Acad. Sci., U.S.A. 85, 9630-9634. Roy, P. H. & Weissbach, A. (1975). Nucl. Acids Res. 2, 1669-1675. Salser, W. (1977). Cold Spring Harbor Syrup. Quant Biol. 42, 985-1002. Schwartz, M. N., Trautner, T. A. & Kornberg, A. (1962). J. Biol. Chem. 237, 1961-1967. Sinsheimer, R. L. (1955). J. Biol. Chem. 215, 579-583. Wiebauer, K. & Jiricny, J. (1989). Nature (London), 339, 234-236.
Edited by A . Klug