Computational Biology and Chemistry 42 (2013) 35–39
Contents lists available at SciVerse ScienceDirect
Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem
A more accurate relationship between ‘effective number of codons’ and GC3s under assumptions of no selection Xiong’en Liu ∗ School of Computer and Information, Fujian Agriculture and Forestry University, Fuzhou 350002, China
a r t i c l e
i n f o
Article history: Received 21 August 2012 Received in revised form 7 November 2012 Accepted 14 November 2012 Keywords: Codon usage bias Effective number of codons (Nc) GC-content at third codon position (GC3s) Reference line
a b s t r a c t The ‘effective number of codons’ (Nc) introduced by Frank Wright in 1990 is one of the best measures to show the state of codon usage biases in genes and genomes. Although estimate methods of Nc have been improved by several investigators since then, no one noticed that the relationship between Nc and GC3s under assumptions of no selection given by Wright has a little but significant deviation. Since the curve showing such a relationship in Nc-plot is a useful reference line to display the main features of codon usage pattern for a number of genes, its high accuracy is important and necessary. Under ideal and ultimate conditions listed in this text a computational sample of Nc versus GC3s was derived and calculated. By nonlinear regression analysis, the relationship between Nc and GC3s without synonymous 2 codon selection can be approximated by: Nc = 2.5 − s + 29.5/(s2 + (1 − s) ), instead of Wright’s: Nc = 2 2 + s + 29/(s2 + (1 − s) ), where s denotes GC3s. The goodness of fit analysis of both confirmed that the new formula presented in this text is more accurate than the original one. In addition, in the case of using the same estimate method of Nc, the situation in overestimation is decreased to a certain extent by using the new reference line in comparison with Wright’s one. © 2012 Elsevier Ltd. All rights reserved.
1. Introduction Many statistical methods have been proposed and used to analyze synonymous codon usage bias (Ikemura, 1981; Sharp and Li, 1987; Wright, 1990; Comeron and Aguadé, 1998; Urrutia and Hurst, 2001; Suzuki et al., 2008; Fox and Erill, 2010). Among them, the ‘effective number of codons’ (Nc) introduced by Frank Wright has widely been used to measure codon usage evenness. Despite its easy calculation the Nc is considered one of the most useful measures to show the state of codon usage biases in a gene or a set of genes. The Nc quantifies codon usage bias in a range from 20 to 61, the value of 20 represents the extreme bias when only one synonymous codon is used to encode one amino acid, and 61 means no bias in the case of equal usage of synonymous codons, since there are 61 different codons encoding 20 different amino acids (with 3 stop codons). The Wright’s formula to calculate Nc was ˆc = 2 + N
9 1 5 3 + + + Fˆ3 Fˆ¯ 2 Fˆ¯ 4 Fˆ¯ 6
k
n Fˆ =
∗ Correspondence address: 15 Shangxiadian Road, Cangshan District, Fuzhou, Fujian, China. Tel.: +86 13665061023. E-mail address:
[email protected]
p2i − 1
i=1
n−1
(2)
where pi denotes the frequency of the ith synonymous codon, k is the number of synonymous codons of an amino acid, and n is the number of the translated amino acid in a gene. The average homozygosity of one degeneracy class was calculated by na
(1)
where Fˆ¯ m denotes the average homozygosity of one degenerate code class with m synonymous codons. The coefficients 9, 1, 5 and
1476-9271/$ – see front matter © 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compbiolchem.2012.11.003
3 are the number of amino acids encoded by the different classes. Since methionine and tryptophan are specified by a single codon (AUG, UGG) each of them, their Nc value are always 1, so the first item is 2 without further calculation. By Wright’s definition, the homozygosity of synonymous codons for a given amino acid was calculated from the squared allele (codon) frequencies by
Fˆ¯ m =
i=1
na
Fˆi (3)
where na is the number of amino acids encoded by one degeneracy class, it may be 9, 1, 5, or 3. Even under assumptions of no selection the Nc value may depart from 61 because of variation in background nucleotide composition. Wright described the relationship between Nc and GC3s (the
36
X. Liu / Computational Biology and Chemistry 42 (2013) 35–39
2. Methods 2.1. Computational sample of Nc versus GC3s Previous simulation studies have shown that Nc values will be overestimated for gene lengths of less than several hundred bp, and the longer the length of a gene the truer and stabler the estimated Nc value tends to be (Wright, 1990; Comeron and Aguadé, 1998; Novembre, 2002; Fuglsang, 2006a). In theory, the relationship between Nc and GC3s under assumptions of no selection describes the ultimate Nc values at different GC3s varying from 0 to 1. So, to build such a relationship should be under ideal and ultimate conditions listed as below.
Fig. 1. View of relationships between observed Nc values and the values expected respectively by Wright’s formula and the new one presented in this text.
proportion of GC-content in the third codon position) under no selection by an approximate formula Nc = 2 + s +
29 s2 + (1 − s)2
(4)
where s denotes GC3s. Using Eq. (4) one can plot a bell-shaped curve which represents the expected Nc values versus GC3s under assumptions of no selection in a Nc-plot (see Fig. 2 on second page to last). The estimated Nc values of a set of genes with different GC3s content can be pointed out below or on the curve. Such a plot is a useful visual form to show the main features of codon usage pattern for a number of genes. According to Eq. (1), estimation of codon homozygosity is the key to the estimation of Nc. Several attempts have been made to improve estimating methods over last decade (Novembre, 2002; Marashi and Najafabadi, 2004; Fuglsang, 2004, 2005, 2006a,b, 2008; Banerjee et al., 2005). In contrast with Wright’s original method new estimators have taken background nucleotide composition, sequence length, absence of codons encoding individual amino acid, and bias discrepancy into account. Although of the improvements for estimate of codon homozygosity, no one noticed that the relationship between Nc and GC3s under assumptions of no selection described with Eq. (4) exists a little but significant deviation. Let’s examine the expected Nc value at the specific case of GC3s, 0.5, i.e. A = U = G = C. The Nc value in such a case must be the maximum according to its definition. However, the expected Nc value is not 61, but 60.5 by Eq. (4). In overall region, the discrepancy is obvious between observed Nc values (see sampling method in next section) and the values expected by Wright’s formula, as shown in Fig. 1. Different factors have been proposed to influence codon usage bias, including transcriptional selection, translation efficiency, gene expression level, GC composition, amino acid conservation, protein hydropathy, RNA stability and even adaptation to particular habitat (Ermolaeva, 2001; Lynn et al., 2002; Kliman et al., 2003). However, the main goal of this paper is a technique, rather than using this technique to do actual biological research to explain various issues related to codon usage. This study only focused on the relationship between codon usage bias, indicated by Nc, and GC-content at third codon position under assumptions of no selection forces. Since the curve described with such a relationship is the reference line in a Nc-plot, its high accuracy is important and necessary.
• There is a hypothetical gene with infinite length • All 20 different amino acids are included, and the number of each is infinite • Synonymous codon usage is under no selection, GC3s bias is due to mutation • No GC-content bias exist at other silent sites (the 1st and 2nd codon position) For twofold degenerate codons, the equivalent nucleotides are always either two purines (A/G) or two pyrimidines (C/U) at third position. Under the above-mentioned conditions, the probability of one synonymous codon is s or 1 − s, so their homozygosity can be calculated by (see detail derivation in Appendix A.1) F2 = s2 + (1 − s)2
(5)
The only one amino acid, isoleucine, is encoded by threefold degenerate codons: AUU, AUC, or AUA (AUG encodes methionine). Under the above conditions, the probabilities of these three synonymous codons are (1 − s)/(2 − s), s/(2 − s), and (1 − s)/(2 − s), thus, its homozygosity can be calculated by (see Appendix A.2) F3 =
s2 + 2(1 − s)2 (2 − s)2
(6)
For fourfold degenerate codons, all nucleotide substitutions at third position are synonymous. Under the above conditions, the probability of one synonymous codon is s/2 or (1 − s)/2, and their homozygosity can be calculated by (see Appendix A.3) F4 =
s2 + (1 − s)2 2
(7)
There are three amino acids specified by six synonymous codons: serine, leucine, and arginine. Their six synonymous codons are composed of a group of fourfold degenerate codons and another group of twofold degenerate codons without exception. Then, the probability of one synonymous codon is s/3 or (1 − s)/3. So, their homozygosity can be calculated by (see Appendix A.4) F6 =
s2 + (1 − s)2 3
(8)
A computational sample of Nc versus GC3s was collected by a program’s calculation by Eq. (5)–(8), and (1) (see Appendix B). Systematic sampling method was adopted in this text. In the sample (see Appendix C) there are 101 elements which were selected at regular GC3s interval of 0.01. 2.2. Nonlinear regression and goodness of fit analysis Despite discrepancy of observed Nc values and the values expected by Wright’s formula, the distribution pattern of observed values of the above computational sample is similar to the bellshaped curve described by Eq. (4), and nearly bilateral symmetric but its left side (GC3s below 0.5) a little higher than its right side.
X. Liu / Computational Biology and Chemistry 42 (2013) 35–39
So, a nonlinear regression model in which observed Nc value was treated as a dependent variable and GC3s as an independent variable was selected as
comparison of their goodness of fits four statistic measures were calculated, including summed square of residuals, labeled as SSE
SSE = Nc = ˛ − s +
ˇ s2
+ (1 − s)
2
(9)
37
n
2
ˆ i) , (Nci − Nc
i=1
coefficient of determination (R-square), labeled as R2 where ˛ and ˇ were unknown parameters, they were estimated by using nonlinear least squares approach, and making the function model to fit data of the above-mentioned computational sample. After curve fitting the goodness of fit was evaluated both for the new equation (Eq. (10)) and Wright’s original one (Eq. (4)). For
n
R2 = 1 −
ˆ i) (Nci − Nc
2
i=1
n
, ¯ (Nci − Nc)
2
i=1
Fig. 2. Nc-plots for E. coli (a and a ), S. cerevisiae (b and b ), and O. sativa J. (c and c ). The number of genes analyzed was 676, 2207, and 25,467, respectively. As Nc could not be calculated because of rarely used or missing amino acid, all genes analyzed are CDS with lengths of more than 102 codons, including all 20 amino acids with each number of more than 5, and downloaded from GenBank of NCBI. The Nc value of each gene was estimated by Eqs. (1)–(3). The bell-shaped curve in each of the figures represents the relationship between Nc and GC3s under assumptions of no selection, red curves in figures on left side were given by Wright’s formula (Eq. (4)) and blue curves on right side by the new one presented in this paper (Eq. (10)).
38
X. Liu / Computational Biology and Chemistry 42 (2013) 35–39
Table 1 Comparison in the number of genes in which estimated Nc values are above expected values in Fig. 2. Totala
Organism E. coli S. cerevisiae O. sativa J. T. aestivumd V. viniferad
676 2207 25,467 308 10,233
Above Wright’sb
Above this text’sc
19 136 557 17 229
13 70 374 16 130
a
Total number of genes analyzed. Number of genes in which Nc value is above the expected value as givens in Eq. (4), representing cross points over the red curve in Fig. 2. c Number of genes in which Nc value is above the expected value as given in Eq. (10), representing cross points over the blue curve in Fig. 2. d Not to be plotted in Nc-plots in Fig. 2 because of the limit of layout. b
adjusted R-square, labeled as adjusted R2 adjusted R2 = 1 −
(n − 1)(1 − R2 ) , n−p−1
RMSE =
n−p
,
where p denotes the number of fitted coefficients estimated, i.e. p = 3. 3. Results and conclusion 3.1. New relationship between Nc and GC3s under assumptions of no selection After fitting the data of the above-mentioned computational sample with the nonlinear function model (Eq. (9)), a new approximate formula was built up Nc = 2.5 − s +
29.5
(10)
(s2 + (1 − s)2 )
which can represent the relationship between Nc and GC3s under assumptions of no selection of synonymous codons. The new bell-shaped curve as given in Eq. (10) is shown in Figs. 1 (on previous page) and 2. As an example of the use of both the Wright’s formula (Eq. (4)) and the new one (Eq. (10)) in Nc-plots, consider the display of codon usage data from Escherchia coli (see a and a in Fig. 2), Saccharomyces cerevisiae (see b and b in Fig. 2), and Oryza sativa Japonica Group (see c and c in Fig. 2). By comparison, the number of genes in which estimated Nc values are above the expected values (ultimate values) given by Wright’s formula and the new one are computed and listed in Table 1. In general, the overestimation of Nc is due to short length of a gene (Wright, 1990; Comeron and Aguadé, 1998; Novembre, 2002; Fuglsang, 2006a). However, even using the same estimate method of Nc the situation in overestimation is different upon which reference line representing the relationship between Nc and GC3s under assumptions of no selection is adopted. As shown in Fig. 2 and Table 2 Comparisons in four statistic measures of goodness of fit with Wright’s formula. Equation 2
Nc = 2 + s + 29/(s + (1 − s) ) 2 Nc = 2.5 − s + 29.5/(s2 + (1 − s) ) 2
Table 1, by the new reference line as given in Eq. (10) overestimation of Nc is decreased to a certain extent in comparison with Wright’s original equation (see reason in next section). 3.2. Goodness of fit analysis
and standard error of the regression, labeled as RMSE
n 2 ˆ i) (Nci − Nc i=1
Fig. 3. Residuals for Nc values expected by Wright’s formula (× points) and the new ˆ c ). one (• points) presented in this text (where residual = Nc − N
SSE
R2
Adjusted R2
RMSE
41.51022 0.53594
0.99643 0.99995
0.99632 0.99995
0.65417 0.07433
As shown in Fig. 3, the residuals from the new formula are smaller than those of Wright’s. Meanwhile, the distribution features of them can explain why the situation in overestimation of Nc is more serious by using Wright’s reference line than that presented in this text. Table 2 lists the results of four numerical measures evaluating goodness of fit for both equations. Both Rsquare and adjusted R-square of Eq. (10) are much closer to 1 than Eq. (4), and standard error of the regression of the new function model is much closer to zero than Wright’s. So, the new equation as given in Eq. (10) fits the data of the computational sample better. The goodness of fit analysis confirmed that the relationship between Nc and GC3s under assumptions of no selection can be described much closer to the truth and the ultimate by the new equation (Eq. (10)) presented in this text than Wright’s one (Eq. (4)). The comparison of the use of the new formula and Wright’s one had shown that the deviation of original formula may partly cause the appearance of overestimate of observed Nc values, and the new one can reduce this case. The bell-shaped curve which is the reference line representing the expected and ultimate Nc values versus GC3s varying from 0 to 1 in Nc-plots can accurately be plotted by the new equation, instead of the Wright’s original one. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.compbiolchem. 2012.11.003. References Banerjee, T., et al., 2005. Towards a resolution on the inherent methodological weakness of the effective number of codons used by a gene. Biochemical and Biophysical Research Communications 330, 1015–1022. Comeron, J.M., Aguadé, M., 1998. An evaluation of measures of synonymous codon usage bias. Journal of Molecular Evolution 47, 268–274. Ermolaeva, M.D., 2001. Synonymous codon usage in bacteria. Current Issues in Molecular Biology 3, 91–97. Fox, J.M., Erill, I., 2010. Relative codon adaptation: a generic codon bias index for prediction of gene expression. DNA Research 17, 185–196. Fuglsang, A., 2004. The ‘effective number of codons’ revisited. Biochemical and Biophysical Research Communications 317, 957–964. Fuglsang, A., 2005. On the methodological weakness of ‘the effective number of codons’: a reply to Marashi and Najafabadi. Biochemical and Biophysical Research Communications 327, 1–3. Fuglsang, A., 2006a. Estimating the effective number of codons: the Wright way of determining codon homozygosity leads to superior estimates. Genetics 172, 1301–1307. Fuglsang, A., 2006b. Accounting for background nucleotide composition when measuring codon usage bias: brilliant idea, difficult in practice. Molecular Biology and Evolution 23, 1345–1351.
X. Liu / Computational Biology and Chemistry 42 (2013) 35–39 Fuglsang, A., 2008. Impact of bias discrepancy and amino acid usage on estimates of the effective number of codons used in a gene, and a test for selection on codon usage. Gene 410, 82–89. Ikemura, T., 1981. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. Journal of Molecular Biology 151, 389–409. Kliman, R.M., et al., 2003. Selection conflicts, gene expression, and codon usage trends in yeast. Journal of Molecular Evolution 57, 98–109. Lynn, D.J., et al., 2002. Synonymous codon usage is subject to selection in thermophilic bacteria. Nucleic Acids Research 30, 4272–4277. Marashi, S.A., Najafabadi, H.S., 2004. How reliable re-adjustment is: correspondence regarding A. Fuglsang, the ‘effective number of codons’ revisited. Biochemical and Biophysical Research Communications 324, 1–2.
39
Novembre, J.A., 2002. Accounting for background nucleotide composition when measuring codon usage bias. Molecular Biology and Evolution 19, 1390–1394. Sharp, P.M., Li, W.-H., 1987. The codon adaptation index – a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research 15, 1281–1295. Suzuki, H., et al., 2008. Comparison of correspondence analysis methods for synonymous codon usage in bacteria. DNA Research 15, 357–365. Urrutia, A., Hurst, L., 2001. Codon usage bias covaries with expression breadth and the rate of synonymous evolution in humans, but this is not evidence for selection. Genetics 159, 1191–1199. Wright, F., 1990. The ‘effective number of codons’ used in a gene. Gene 87, 23–29.