BioSystems 109 (2012) 133–136
Contents lists available at SciVerse ScienceDirect
BioSystems journal homepage: www.elsevier.com/locate/biosystems
The most probable number of blocks for the partitions of the set of codons could have determined the number of standard amino acids Dino G. Salinas ∗ , Mauricio O. Gallardo, Manuel I. Osorio Facultad de Medicina, Universidad Diego Portales, Avda. Ejército 141, Santiago, Chile
a r t i c l e
i n f o
Article history: Received 27 April 2011 Received in revised form 16 September 2011 Accepted 28 February 2012 Keywords: Genetic code evolution Standard amino acid number Stirling number
a b s t r a c t Given a genetic code formed by 64 codons, we calculate the number of partitions of the set of encoding amino acid codons. When there are 0–3 stop codons, the results indicate that the most probable number of partitions is 19 and/or 20. Then, assuming that in the early evolution the genetic code could have had random variations, we suggest that the most probable number of partitions of the set of encoding amino acid codons determined the actual number 20 of standard amino acids. © 2012 Elsevier Ireland Ltd. All rights reserved.
1. Introduction The current genetic code has evolved since about 3700 million years ago, probably starting within a very simple self-replicating metabolic system, such as a protocell (Chen et al., 2005; Deamer and Weber, 2010). The first changes in the genetic code could have been by chance (Koonin and Novozhilov, 2009) until the encoding for both amino acids and stop signals was stable enough to support life (Weber and Miller, 1981). The genetic code is like a function capable to assign each codon to an amino acid or a stop signal. Thus, in the standard code, from 64 possible codons, there are 61 codons encoding 20 amino acids and 3 codons encoding 1 stop signal. However, there are other genetic codes with minor variations in relation to the standard (canonical) genetic code (Sammet et al., 2010) and some of them have been studied theoretically (Freeland et al., 2000; Novozhilov et al., 2007; Vetsigian et al., 2006). Here we are interested to study theoretically about how the number of standard amino acid was determined by mean of the code evolution. In this work, we name pattern of degeneration to a partition of the set of encoding-amino acid codons into subsets in which all the codons are encoding for the same amino acid. Depending on the number w of stop codons (w = 0, 1, 2 or 3), we calculate the probability of obtaining any of the partitions of a set of kamino acid encoding codons (1 ≤ k ≤ 64 − w) into k blocks and we
demonstrate that the maximum values of this probability are in accordance with a value of k equal or near to 20 (the number of standard amino acids). 2. Theoretical framework 2.1. Probability of obtaining any of the partitions into k blocks of a set of k-amino acid-encoding n-codons (P(n, k)) We define the following sets: A = {a1 , . . ., ak }, a set of encoded amino acids. C = {c1 , . . ., cn }, a set of encoding codons, such that n > k.
k
Ci ⊆ C, i ∈ {1, . . ., k}, with: C = i=1 Ci , Ci =/ and if i =/ j then Ci Cj = . P = {C1 , . . . , Ck }, a partition of C into k subsets Ci . That is, we have subdivided C, a n-codon set, into k disjoint subsets, such that each of them could be formed by the codons that are assigned to a particular amino acid in A in some genetic code. The number of different P partitions is given by S(n, k), the Stirling number of the second kind (Comtet, 1974): kn S(n, k) = (−1)j k! k
j=0
∗ Corresponding author. E-mail addresses:
[email protected] (D.G. Salinas),
[email protected] (M.O. Gallardo),
[email protected] (M.I. Osorio). 0303-2647/$ – see front matter © 2012 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.biosystems.2012.02.007
k j
1−
j k
n
.
(1)
For example, let n = 64 − ω, with ω the number of stop codons. Fig. 1 shows how, when there is one stop codon, P partitions are formed from a set of n = 63 codons by means of k indistinguishable blocks. Only two of the S(63, k) possible partitions are suggested.
134
D.G. Salinas et al. / BioSystems 109 (2012) 133–136
Fig. 1. The central figure shows a set of 64 codons. The gray codon encodes the stop signal and the remaining codons encode the amino acids. It is shown how to form k partitions of the 63 codon set by means of codon reordering into k indistinguishable blocks (subsets), given that all codons of a block are coding the same amino acid. Only two partitions are shown (left and right). However, the Stirling number of second kind, S(63, k) is equal to the number of possible partitions, each of them named, by us, a pattern of degeneration.
The number of partitions of C, that is, the number of P partitions for all k values within {1, . . ., n}, is called Bell number (Comtet, 1974) and it is Bn =
n
S(n, r).
(2)
r=1
We assume that each of the Bn partitions has an equal probability to be selected. Considering all of them, the probability of obtaining any of the S(n, k) P partitions is given by P(n, k) =
S(n, k) . Bn
the code contains redundancy, that is similar to the genetic code, where some amino acids have assigned more than one codon (to formal purposes, the stop signal is considered like an amino acid). This property is known as degeneration and we named to each partition P as a pattern of degeneration. There are S(n, k) · k ! possible codes (Nieselt-Struwe and Wills, 1997). In the case of the genetic code, each of the 20 standard amino acids and the stop signal are encoded by at least one codon. Then 64 codons have to be
(3)
2.2. Maximum values of P(n, k) Changing k, the maxima of P(n, k) and S(n, k) are obtained at the same k = kn values, and it is known that S(n, k) has only one or two maxima (Harper, 1967), all of them into the range of k determined by (Yu, 2009): e˝n − 2 ≤ kn ≤ e˝n + 1,
(4)
with ˝n being the omega function (also called the Lambert Wfunction), the solution of n = ˝n e˝n ,
(5)
x denoting the integer part of x, and n ≥ 2. Thus, it is possible to find k = kn , corresponding to the maximum of P(n, k), by means of calculation of P(n, k) (Eq. (3)) only into the range of k determined by Eqs. (4) and (5). 3. Results Assuming only one stop codon from the 64 possible codons, the set C has n = 63 elements (Fig. 1). Solving numerically Eq. (5) we have e˝63 ≈ 20.8. So e˝63 = 20 and, in accordance with Eq. (4), 18 ≤ k63 ≤ 21. A numerical calculation of P(63, k) values verifies that its own maximum value is P(63, 20) = 0.1938. Therefore k63 = 20 (Fig. 2). Table 1 shows the numerical calculation of kn -values for 61 ≤ n ≤ 64. 4. Discussion A code is defined as a set G = {ci → a(i) , i ∈ {1, . . ., n}}, where is a surjective mapping of {1, . . ., n} to {1, . . ., k}. Since n > k,
Fig. 2. Probability of obtaining randomly any of the S(63, k) partitions of the 63codon set into k subsets and considering that the number of all possible partitions is given by
63
r=1
S(63, r). The maximum value of P(63, k) is P(63, 20) = 0.1938.
Table 1 Values of P(n, k) and kn for different n-values. n = 61 P(n, 18) P(n, 19) P(n, 20) P(n, 21) P(n, 22) kn
0.164 0.197 0.184 0.136 19
n = 62 0.150 0.192 0.190 0.149 19, 20
n = 63 0.136 0.184 0.194 0.161 20
n = 64 0.174 0.195 0.172 0.121 20
Values of kn are the k-values that correspond to the maximum of P(n, k) (in bold). The P(n, k)-values are for e˝n − 2 ≤ k ≤ e˝n + 1 (Eq. (4)), with e˝61 = e˝62 = e˝63 = 20 and e˝64 = 21 (from Eq. (5) for each n-value).
D.G. Salinas et al. / BioSystems 109 (2012) 133–136
partitioned into 21 subsets and, therefore, the total number of possible genetic codes is S(64, 21) · 21 ! ≈ 1.51 · 1084 (Nieselt-Struwe and Wills, 1997; Novozhilov et al., 2007). Thus, the standard genetic code is a rare event within the enormous set of the possible genetic codes. There are some evidences for the variation of the partition size (k) of the codon set, including changes in the stop codons or in the number of encoded amino acids, all of them having no effect on code viability. Thus, some prokaryotes and organelles, which do not have the enzymatic machinery to add the amino acids glutamine and asparagine to transfer RNA, only encode 18 amino acids (Bailly et al., 2006). In addition, proteins of methanogenic bacteria have two non-standard amino acids that are encoded by canonical stop codons using specific tRNAs. The codon UGA, usually a stop signal, in some codes is encoding the nonstandard amino acids selenocysteine (Zhang et al., 2005) or pyrrolysine (Blight et al., 2004). Inspired in the above mentioned, here we have studied some cases of different numbers of stop codons. Assuming that patterns of degeneration (above mentioned) are random and that the stop codons have been previously fixed, we have calculated the probability to obtain any partition into subsets corresponding to a given number of amino acids. Table 1 is for codes having from 0 to 3 fixed stop codons and it shows that the highest probability values correspond to 19 and 20 amino acids. In the case of 64 codons coding for amino acids, without a stop codon, we concluded that 20 is the most probable number of codified amino acids. In this work we have studied the cases in which the stop signal is fixed in the code before the amino acid signal. That is because, in the other case, a change of the stop signal could be much more disruptive to the cell. The case w = 0 is similar to the case when the stop signal has not been fixed previously to the amino acid signals, and must be considered like one more amino acid signal employed to build the partitions of the set of codons. Thus, in the previous results for 64 encoding codons without a previously fixed stop codons (w = 0), the most probable partition set of codons (k64 = 20) really corresponds to 19 amino acid signals plus one stop signal. There are many evidence about a gradual emergence of the genetic code. Thus, according to a classification scheme of the genetic code, it has been hypothesized that the genetic code started with a binary doublet code and developed via a quaternary doublet code into the contemporary triplet code (Wilhelm and Nikolajewa, 2004). Moreover, there are several theories to explain the incorporation of amino acids into the canonical code (Deamer and Weber, 2010; Lu and Freeland, 2006; Weber and Miller, 1981). A consensus temporal order of evolutionary appearance of amino acids and their respective codons has been reconstructed on the basis of the integration of different criteria and rules (Trifonov, 2004). To explain the actual genetic code, it has been proposed that the driving force during the evolution process would be the positive selection pressure for the increased diversity and functionality of the proteins (Higgs, 2009). However, with regard to determining the number of encoded amino acids, there are only studies about some restrictions. There are many theoretically possible numbers of amino acids but only a small number of them are encoded in the genetic code and they are chosen as building blocks to assemble the proteins (Grützmann et al., 2010). It has been suggested that the triplet codons gradually evolved from two types of ambiguous doublet codons, depending on whether first two or last two bases of the triplet are read, and this would determine why there are not more than 20 encoded amino acids (Wu et al., 2005). On the other hand, the Shannon information associated to the known primary sequences of natural proteins imposes a lower limit in the number of amino acids of an early alphabet, and the original alphabet might have contained 7 or more amino acids (Fernandez, 2004). In addition, according to an increased cell viability, and assuming random changes in the pattern of degeneration, we believe that the most
135
probable (more stable) k-number of encoded amino acids must be equal to the most probable k-size of partitions of the set of amino acid encoding codons, with each partition representing a pattern of degeneration of the genetic code. Thus, our results suggest that at some early step of the evolution, perhaps by means of thermodynamic mechanisms (Zhang, 2007), the most probable number of partition determined the actual number of encoded amino acids. Our results are consistent with a gradual emergence of the genetic code. Namely, at the beginning, the 20 amino acids would have not been simultaneously employed. Specifically, we propose a two-step evolution model for the origin of the genetic code. (i) The entropic step: First, some metabolic system determined the partition of the set of codons into k subsets, all of them corresponding to any of the S(n, k) possible partitions. Then, we say that there are S(n, k) microscope states to each macroscopic k-state, this one with probability P(n, k). At the end of an entropic evolution, the system reaches the maxima entropic macroscopic kn -state, corresponding to a maxima probability P(n, kn ). This entropic step of evolution would have been supported by some unknown primitive metabolic structure. The entropy for distinguishable entities in indistinguishable states has already been studied elsewhere, as well as their relationship with a particular pattern into the total possible partitions, by means of the Stirling numbers of the second kind (Niven, 2007). Instead, here we are interested in all the S(n, k) partitions (each one of the S(n, k) microstates that corresponds to the macrostate k), and not only on partitions with a common pattern of partition (e.g.: {1, 6, 3, 1, 2, 2, . . .}). (ii) The deterministic step: This phase determines a function between the set of k amino acids and the partitioned set of n codons (above mentioned), that is, the genetic code. The process, driven by the selection pressure, should have been increasingly complex: Frameshift reorganization, from the doublet-code to the triplet-code, as well as gradual appearance of the amino acids in the genetic code, in accordance with the literature. Since second step is a deterministic event, the probability to obtain the canonical genetic code, considering both entropic and deterministic evolutive steps, corresponds to the maximum probability P(n, kn ) of the entropic step. The P(n, kn ) values are shown in Table 1, with kn values very similar to the actual number of amino acids coded. Interestingly, the aminoacyl tRNA synthetase has two protein functional domains that could be related to these two evolutive steps. However, here both the steps have been proposed to occur consecutively in a primitive metabolic system. Obtaining the optimal number of encoded amino acids could have been one of the driving forces in the evolution of the genetic code. However, it is still necessary to define a molecular mechanism that determines how a primitive self-replicating metabolic system was able to discriminate between different patterns of degeneration. Acknowledgment We thank Eduardo Karahanian for critical reading of the manuscript. References Bailly, M., Giannouli, S., Blaise, M., Stathopoulos, C., Kern, D., Becker, H.D., 2006. A single tRNA base pair mediates bacterial tRNA-dependent biosynthesis
136
D.G. Salinas et al. / BioSystems 109 (2012) 133–136
of asparagine. Nucleic Acids Res. 34, 6083–6094, doi:10.1093/nar/gkl622, pii:gkl622. Blight, S.K., Larue, R.C., Mahapatra, A., Longstaff, D.G., Chang, E., Zhao, G., Kang, P.T., Green-Church, K.B., Chan, M.K., Krzycki, J.A., 2004. Direct charging of tRNACUA with pyrrolysine in vitro and in vivo. Nature 431, 333–335. Chen, I.A., Salehi-Ashtiani, K., Szostak, J.W., 2005. RNA catalysis in model protocell vesicles. J. Am. Chem. Soc. 127 (September (38)), 13213–13219. Comtet, L., 1974. Advanced Combinatorics. The Art of Finite and Infinite Expansions. Reidel Publishing Company, Dordrecht, Holland/Boston, USA. Deamer, D., Weber, A.L., 2010. Bioenergetics and life’s origins. Cold Spring Harb. Perspect. Biol. 2 (February (2)), a004929. Fernandez, A., 2004. Lower limit to the size of the primeval amino acid alphabet. Z. Naturforsch. C 59, 151–152. Freeland, S.J., Knight, R.D., Landweber, L.F., Hurst, L.D., 2000. Early fixation of an optimal genetic code. Mol. Biol. Evol. 17 (4), 511–518. Grützmann, K., Böcker, S., Schuster, S., 2010. Combinatorics of aliphatic amino acids. Naturwissenschaften 98 (1), 79–86. Harper, L.H., 1967. Stirling behavior is asymptotically normal. Ann. Math. Stat., 410–414. Higgs, P.G., 2009. A four-column theory for the origin of the genetic code: tracing the evolutionary pathways that gave rise to an optimized code. Biol. Direct 4, 16, doi:10.1186/1745-6150-4-16. Koonin, E.V., Novozhilov, A.S., 2009. Origin and evolution of the genetic code: the universal enigma. IUBMB Life 61, 99–111, doi:10.1002/iub.146. Lu, Y., Freeland, S., 2006. On the evolution of the standard amino-acid alphabet. Genome Biol. 7 (1), 102. Nieselt-Struwe, K., Wills, P.R., 1997. The emergence of genetic coding in physical systems. J. Theor. Biol. 187, 1–14, doi:10.1006/jtbi.1997.0404.
Niven, R.K., 2007. Combinatorial entropy for distinguishable entities in indistinguishable states. AIP Conf. Proc. 965, 96–103. Novozhilov, A.S., Wolf, Y.I., Koonin, E.V., 2007. Evolution of the genetic code: partial optimization of a random code for robustness to translation error in a rugged fitness landscape. Biol. Direct 2, 24, doi:10.1186/1745-6150-2-24. Sammet, S.G., Bastolla, U., Porto, M., 2010. Comparison of translation loads for standard and alternative genetic codes. BMC Evol. Biol. 10, 178, doi:10.1186/1471-2148-10-178. Trifonov, E.N., 2004. The triplet code from first principles. J. Biomol. Struct. Dyn. 22 (1), 1–11. Vetsigian, K., Woese, C., Goldenfeld, N., 2006. Collective evolution and the genetic code. Proc. Natl. Acad. Sci. U S A. 103 (28), 10696–10701. Weber, A.L., Miller, S.L., 1981. Reasons for the occurrence of the twenty coded protein amino acids. J. Mol. Evol. 17, 273–284. Wilhelm, T., Nikolajewa, S., 2004. A new classification scheme of the genetic code. J. Mol. Evol. 59 (5), 598–605. Wu, H.L., Bagby, S., van den Elsen, J.M., 2005. Evolution of the genetic triplet code via two types of doublet codons. J. Mol. Evol. 61, 54–64, doi:10.1007/s00239-0040224-3. Yu, Y., 2009. Bounds on the location of the maximum Stirling numbers of the second kind. Discrete Math. 309, 4624–4627. Zhang, H.Y., 2007. Exploring the evolution of standard amino-acid alphabet: when genomics meets thermodynamics. Biochem. Biophys. Res. Commun. 359, 403–405, doi:10.1016/j.bbrc.2007.05.115. Zhang, Y., Baranov, P.V., Atkins, J.F., Gladyshev, V.N., 2005. Pyrrolysine and selenocysteine use dissimilar decoding strategies. J. Biol. Chem. 280, 20740–20751, doi:10.1074/jbc.M501458200.