J. theor. Biol. (1997) 188, 355–360
Improving the Efficiency of the Genetic Code by Varying the Codon Length—The Perfect Genetic Code A J. D Department of Biomolecular Sciences, UMIST, P.O. Box 88, Manchester M60 1QD, UK (Received on 10 April 1997, Accepted in revised form on 3 June 1997)
The function of DNA is to specify protein sequences. The four-base ‘‘alphabet’’ used in nucleic acids is translated to the 20 base alphabet of proteins (plus a stop signal) via the genetic code. The code is neither overlapping nor punctuated, but has mRNA sequences read in successive triplet codons until reaching a stop codon. The true genetic code uses three bases for every amino acid. The efficiency of the genetic code can be significantly increased if the requirement for a fixed codon length is dropped so that the more common amino acids have shorter codon lengths and rare amino acids have longer codon lengths. More efficient codes can be derived using the Shannon–Fano and Huffman coding algorithms. The compression achieved using a Huffman code cannot be improved upon. I have used these algorithms to derive efficient codes for representing protein sequences using both two and four bases. The length of DNA required to specify the complete set of protein sequences could be significantly shorter if transcription used a variable codon length. The restriction to a fixed codon length of three bases means that it takes 42% more DNA than the minimum necessary, and the genetic code is 70% efficient. One can think of many reasons why this maximally efficient code has not evolved: there is very little redundancy so almost any mutation causes an amino acid change. Many mutations will be potentially lethal frame-shift mutations, if the mutation leads to a change in codon length. It would be more difficult for the machinery of transcription to cope with a variable codon length. Nevertheless, in the strict and narrow sense of coding for protein sequences using the minimum length of DNA possible, the Huffman code derived here is perfect. 7 1997 Academic Press Limited
Introduction The principal function of DNA is to specify protein sequences. The four base ‘‘alphabet’’ used in nucleic acids is translated to the 20 base alphabet of proteins (plus a stop signal) via the genetic code. Seminal work in the 1960s demonstrated that the code is neither overlapping, nor punctuated with mRNA sequences, read in successive triplet codons until reaching a stop codon. Several theories have been proposed regarding the origin of the code. The stereochemical theory proposes that each amino acid became linked to its appropriate triplet for stereochemical reasons while the Frozen Accident theory proposes that the codon–amino acid relationship arose purely by chance (Crick, 1968). Wong (1975) suggested that the genetic code co-evolved with metabolic pathways 0022–5193/97/190355 + 06 $25.00/0/jt970489
between amino acids. The code is almost universal with only a few differences first seen in mitochondria, (Barrel et al., 1979) showing the intense pressure of natural selection in opposing changes. A question I wish to address in this paper is how efficient is DNA at encoding protein sequences. A maximally efficient genetic code is able to encode protein sequences using the minimum number of bases. For the purpose of this work, efficiency is defined by eqn (1) (Haydock et al., 1991). Efficiency = Minimum number of bases needed to encode protein sequences Number of bases actually used to encode protein sequences (1) 7 1997 Academic Press Limited
. .
356
The true genetic code uses three bases for every amino acid so the mean number of bases per amino acid [the denominator in eqn (1)] is three. The minimum number of bases required to encode protein sequences [the numerator in eqn (1)] will be derived using the Huffman coding algorithm (Huffman, 1952). I will consider what the mean number of bases per amino acid is if there were two or six bases instead of four (even numbers only are considered, as I assume that base pairing is always maintained), and if a variable codon length for different amino acids was allowed in the genetic code, instead of three every time. I show that the efficiency of the genetic code can be significantly increased, if the requirement for a fixed codon length is dropped so that more frequency amino acids have shorter codon lengths and rare amino acids have longer codon lengths. Methods – The Shannon–Fano coding algorithm was applied to derive the binary code as follows: the amino acids plus stop signal are written in order of frequency, as found from codon usage tables in the Yeast genome (http://alanine.gcg.com/techsupport/data/ yeast high.cod). The amino acids are divided into two groups to give the summed frequencies in each group as close to 50% as possible. In this case, this gives the groups ALKGV and E (summed frequency 47%), and SIDTRNPFQYMHWC and End (summed frequency 53%). These groups are further subdivided into two groups with equal frequencies until the amino acids are isolated. This yields a tree (not shown). Branches to the left are arbitrarily assigned the base G while those to the right are assigned the base U (Table 1). The Shannon–Fano algorithm is applied to a four-base coding in a similar way with groups of amino acids divided into four groups with frequencies as close to 0.25 as possible, giving a tree with four branches and splits assigned the G, U, A and T codons (Fig. 1; Table 3). The Huffman coding algorithm was applied to derive the binary code as follows: the amino acids plus stop signal are written in order of frequency, as found from codon usage tables in the Yeast genome. The two least probable amino acids are grouped together (Cys and End in this case), their frequencies are summed and the amino acids are rewritten in order of frequency with the new group treated as a single
entity. The two new least probable amino acids or amino acid groups are then grouped together. The process continues until only two groups remain. All of the amino acid lists produced are then written in the order in which they were derived. The codes for each amino acid are read from the list by reading from the bottom and giving each amino acid a G when it is in the rightmost column and a U when it is in the second column from the right (Table 2). The Huffman algorithm is applied to a four-base coding by first grouping Trp, Cys and End together and treating that as a single entity. In subsequent lists the four least frequent groups are joined and treated as a single entity until only four groups remain. The codes for each amino acid are found by reading the lists from the bottom and giving each amino acid a T when it is in the rightmost column, an A when it is in the second column from the right, a G when it is in the third column from the right and a U when it is in the fourth column from the right. The results (Table 3) can similarly be presented as a tree diagram (Fig. 2). Results Suppose there are only two possible bases (for example, G and U) used to encode the 21 amino acids plus end signal. If the requirement for a fixed codon T 1 Shannon–Fano binary genetic code Amino acid Ala Leu Lys Gly Val Glu Ser Ile Asp Thr Arg Asn Pro Phe Gln Tyr Met His Trp Cys End Mean codon length
Shannon–Fano Frequency (%) Codon 8.93 8.26 7.86 7.68 7.21 7.08 6.47 6.09 6.02 5.85 4.20 4.18 4.15 4.11 3.35 2.95 2.00 1.74 0.85 0.75 0.29
GGG GGUG GGUU GUG GUUG GUUU UGGG UGGU UGUG UGUU UUGG UUGUG UUGUU UUUGG UUUGU UUUUGG UUUUGU UUUUUG UUUUUUG UUUUUUUG UUUUUUUU
Frequency × codon length 0.268 0.331 0.314 0.230 0.288 0.283 0.259 0.244 0.241 0.234 0.168 0.209 0.208 0.206 0.168 0.177 0.120 0.104 0.060 0.060 0.023 Sum = 4.19
1st Base
2nd Base
3rd Base
T 2 Huffman binary genetic code
Ala U C
Leu
A Lys G Gly
Val
U
U C
Glu
A Ser
C G
Ile
Asp A
U
357
Amino acid
Frequency (%)a
Codon
Frequency × codon length
Ala Leu Lys Gly Val Glu Ser Ile Asp Thr Arg Asn Pro Phe Gln Tyr Met His Trp Cys End Mean codon length
8.93 8.26 7.86 7.68 7.21 7.08 6.47 6.09 6.02 5.85 4.20 4.18 4.15 4.11 3.35 2.95 2.00 1.74 0.85 0.75 0.29
GGU UUUU UUGU UUGG UGUU UGUG UGGG GUUU GUUG GUGU GGGU GGGG UUUGU UUUGG UGGUG GUGGU GUGGG UGGUUG UGGUUUG UGGUUUUU UGGUUUUG
0.268 0.331 0.314 0.307 0.288 0.283 0.259 0.244 0.241 0.234 0.168 0.167 0.208 0.206 0.198 0.148 0.100 0.104 0.060 0.060 0.023 Sum = 4.18
a
C
Taken from codon usage tables for yeast genes (http://alanine.gcg.com/techsupport/data/yeast high.cod).
Thr
A G
Arg G Asn
Pro U C A G
Phe U G U G
Gln C Ala A Tyr Met His C Trp Cys A End
F. 1. Shannon–Fano four base code tree.
length is maintained the codon length must be 5 as 25 = 32, so 5 is the first power of two greater than 21. DNA would therefore need to be 5/3 times longer to encode proteins. If six bases were possible, a codon length of two would be all that was needed as 62 = 36 q 21 and DNA could then be 2/3 shorter. An alternative to the AT and GC base pairs would be necessary that would bind strongly and specifically to each other.
Binary code A more interesting problem is what is the maximally efficient genetic code if different amino acids can have different codon lengths. One could imagine tRNAs could have a variable number of bases in their anticodon loops. Successive tRNAs would bind to the mRNA next to one another so that every base in the mRNA is read. Note that sequences which specify the reading frame remain essential. A well-known example of a binary code is the Morse code for the English language, where each letter is represented by combinations of dots and dashes. Morse gave E and T the shortest representations, namely the single bits ‘‘dot’’ and ‘‘dash’’, respectively, as these are the most commonly used letters in English. The least frequent letters were given four bits. By using a variable bit length, the total number of bits that are transmitted is greatly reduced. The development of computer data storage and the electronic transmission of information inspired the development of theory for the efficient compression of data. Shannon and Fano (Shannon & Weaver, 1949; Fano, 1949) proposed a general method for the production of an efficient code which assigns short codes to the most frequent characters (such as E, T and A in English) and the longest codes to the rarest
. .
358
T 3 Huffman and Shannon–Fano genetic code with four bases Amino acid
Huffman codon
Fano–Shannon codon
Ala Leu Lys Gly Val Glu Ser Ile Asp Thr Arg Asn Pro Phe Gln Tyr Met His Trp Cys End Mean codon length
UU/UGAG UA UC GU GG GA GC AU AG AA AC CU CG CA CC UGU UGG UGC UGAU UGAA UGAC
UU/GAC UC UA UG CU CC CA CG AU AC AA AG GU GC GAU GAA GAG GGU GGC GGA GGG
2.11
2.12
sequences. The code could be used mechanistically by having 20 tRNAs (plus a release factor that recognizes the stop codon) with anticodons of variable size. Each charged tRNA would bind successively adjacent to the previous tRNA and its amino acid would couple to the growing polypeptide chain (Fig. 3). There is only one possible tRNA that could base pair with the next mRNA base each time. The mean codon length is the sum of the codon length multiplied by the amino acid frequencies (taken from the yeast codon usage table). For this code the mean codon length is 4.18. This is clearly an improvement on five (the minimum length for a fixed length binary code), showing that if there were only two bases, DNA could be 4.18/5 = 84% shorter if a variable length codon was allowed [i.e. the efficiency from eqn (1) is 84%]. This code will give the proteins
1st Base
2nd Base
3rd Base
4th Base
Ala U C
Lys Tyr
A
characters (such as X, Z and Q in English). Huffman (1952) devised a similar algorithm which gives the optimum solution to the problem of determining an efficient code. A Huffman code cannot be bettered in terms of maximal data compression and have found a wide range of applications (see Williams, 1991; Held, 1983; Storer, 1988 for further details). I have applied the Huffman coding algorithm (see Methods) to the problem of coding the 20 amino acids plus stop signal by a binary (two-base) code, giving the results shown in Table 2. G and U are arbitrarily used as the two bases. In Table 2, the amino acids are written in order of frequency, thus showing that the code gives the more common amino acids the shorter codons. The code has the essential property of all such coding schemes that it can only be read in one way. For example, UUGUGGUGUGUUGGUUGGUUUUGGUUUUG codes for Lys–Ala–Thr–His–Ile–End unambiguously. It is not necessary to include the length of codon to be read as an additional piece of information as there is only one choice each time. If this code was used in a cell, there would be 21 tRNA molecules each with an anticodon that base pairs with the codes listed in the table. Once the start position of a codon is specified there is only one of these tRNAs that can base pair to the next available codon. Note that it is still necessary to specify the reading frame by other means, such as using initiation codons and promoter
Leu U
Trp C
G
His U C
A
End
A U
Met
C
Gln
A Phe
C G
Pro
A
Cys
Asn G
U
Ile U C
Arg
A Thr
G G
Asp Gly U C
Ser
A Glu G Val F. 2. Huffman four-base code tree.
G Ala
in the yeast genome using the minimum number of bases possible. It is the most efficient way to store or transmit protein sequences with any medium that uses bits. The Shannon–Fano algorithm (Table 2) gives a very similar code, but with a mean codon length of 4.19, illustrating that this algorithm gives very efficient coding schemes, almost as good as the optimal Huffman coding solution. The Shannon– Fano scheme uses a shorter code for Gly than the Huffman code and a longer code for Asn, Tyr and Met, giving a slightly longer mean codon length.
359
The Shannon–Fano algorithm gives a slightly less efficient solution with a mean codon length of 2.12. It differs from the Huffman code by giving Trp, Cys and End codon lengths of three instead of four, offset by having a codon length of three for Gln instead of two. The redundant codon from this algorithm (GAC) is given to Ala. The efficiency of the natural genetic code is found from eqn (1) by comparing the minimum number of bases required to encode protein sequences determined by application of the Huffman algorithm (2.11) to the natural length (3). It is 2.11/3 = 70%.
Four-base code I have used the Shannon–Fano and Huffman algorithms to derive efficient codes for representing the protein sequences using four bases, as observed in all organisms (Table 3; Figs 1 and 2). As with the binary codes, every DNA sequence is unambiguous, once the start position is specified. For example, UCUUAAUGCAUUGAC codes for Lys–Ala–Thr– His–Ile–End using the Huffman code. The Huffman code increases the efficiency by assigning the 14 most common amino acids a codon length of two while the rare Trp, Cys and End have codon lengths of four. The algorithm results in one codon (UGAG) being unassigned. In Table 3 this given to Ala as this is the most abundant amino acid. The mean codon length is 2.11, a considerable improvement on three, the mean codon length if all codon lengths are the same.
Discussion I have shown that the length of DNA required to specify the complete set of protein sequences could be significantly shorter if transcription used a variable codon length. The restriction to a fixed codon length of three bases means that it takes 42% (3/2.11 = 1.42) more DNA than the minimum necessary. If efficiency is defined using the strict definition of eqn (1), the genetic code is only 70% efficient. One can, of course, think of many reasons why this maximally efficient code is not used: There is very little redundancy so almost any mutation causes an amino acid change. In addition, very many mutations will be potentially lethal frameshift mutations, if the mutation leads to a change in codon length (Woese, 1967). With only two bases in tRNA pairing with
ArgMet
ArgMet
Val
tRNA
A
T
G U
U
U G
G G G U G A
Binding of charged Val tRNA
A U T
mRNA 5'
A
T
G U
U U U
U G
G G G U G
A A U T
3' Formation of Met–Val peptide bond. Release of uncharged Met tRNA. ArgMetVal
Cys
ArgMetVal
Binding of charged Cys tRNA
U U G U T T A
T
U G
G G G U G A
A U T 3'
etc.
F. 3. Transcription scheme with a variable codon length.
U U A
T
U G
G G G U G
A A U T
360
. .
mRNA for some codons the binding energy between the tRNA and mRNA may be too low. In any case, to some extent the current genetic code is a ‘‘frozen accident’’, with any further change in the code being lethal as it will cause too many simultaneous mutations. Nevertheless, in the strict and narrow sense of coding for protein sequences using the minimum length of DNA possible, the Huffman code derived here is perfect. The only way to compress protein information into DNA further than the scheme presented here is to have overlapping genes, as first observed in bacteriophages (Barrel et al., 1976; Godson et al., 1978; Sanger et al., 1977), and then later also in animal viruses, mitochondria and bacteria (Normark et al., 1983). If DNA was to be used as a coding molecule in a non-biological situation, such as a DNA based biocomputer, it would be most efficient to read it using a Huffman code. It is not impossible that the present genetic code may evolve in the future towards the maximally efficient code derived here. For example, Val is coded by the four codons GUU, GUC, GUA and GUG only. As the third base conveys no information it would be advantageous, in terms of saving 2/3 of the DNA, if Val was coded by the two-base codon GU only, and the third base was used to give information on the identity of the next amino acid. Despite the advantage this change would have, the difficulties of making such a change are enormous. Switching from a three- to two-base codon length for Val would be a frameshift mutation as the entire sequence following the Val would be scrambled. It would also be extremely difficult for the machinery of transcription using the ribosome to cope with a move from a fixed codon length. In contrast, there are no problems in the
translation of DNA to mRNA if the codon length is altered. I thank Terry Brown, Jeremy Derrick, Don Haynie, Simon Hubbard, Steve Oliver and Richard Walmsley for many helpful discussions. I thank the MRC (G9625094) for financial support.
REFERENCES B, B. G., B, A. T. & D, J. (1979). A different genetic code in human mitochondria. Nature 282, 189–194. B, B. G., A, G. M. & H III, C. A. (1976). Overlapping genes in bacteriophage 8X174. Nature 264, 34–40. C, F. H. C. (1968). The Origin of the Genetic Code. J. Mol. Biol. 38, 367–379. F, R. M. (1949). Massachusetts Institute of Technology, Cambridge, MA. Ph.D. Thesis. G, G. N., B, B. G., S, R. & F, J. C. (1978). Nucleotide sequence of bacteriophage G4 DNA. Nature 276, 236–247. H, R., D, S. & H, A. (1991). Information and Coding, The Schools Mathematics Project. Cambridge: Cambridge University Press. H, G. (1983). Data Compression. Chichester: John Wiley. H, D. A. (1952). A Method for the construction of minimum redundancy codes. Proceedings of the I.R.E. 40, 1098–1101. K, A. K. (1985). Theory of Degenerate Coding and Informational Parameters of Protein Coding Genes. Biochimie 67, 455–468. N, S., B¨, S., E, T., G¨, T., J, B., L, F. P. & O, O. (1983). Overlapping genes. Ann. Rev. Genet. 17, 499–525. S, F., A, G. M., B, B. G., B, N. L., C, A. R., F, J. C., H III, C. A., S, P. M. & S, M. (1977). Nucleotide sequence of bacteriophage OX174 DNA. Nature 265, 687–695. S, C. E. & W, W. (1949). The Mathematical Theory of Communication. Urbana, Illinois: University of Illinois Press. S, J. A. (1988). Data Compression. Rockville, Maryland: Computer Science Press. W, R. N. (1991). Adaptive Data Compression. Boston: Kluwer Academic Publishers. W, C. R. (1967). The Genetic Code, New York: Harper & Row. W, J. T. (1975). Proc. Natl. Acad. Sci. U.S.A. 72, 1909–1912.