Gene 400 (2007) 71 – 81 www.elsevier.com/locate/gene
Nature of selective constraints on synonymous codon usage of rice differs in GC-poor and GC-rich genes Pamela Mukhopadhyay, Surajit Basak, Tapash Chandra Ghosh ⁎ Bioinformatics Centre, Bose Institute, P 1/12, C.I.T. Scheme VII M, Kolkata-700 054, India Received 22 February 2007; received in revised form 28 April 2007; accepted 31 May 2007 Available online 16 June 2007 Received by M. Di Giulio
Abstract Synonymous codon usage and cellular tRNA abundance are thought to be co-evolved in optimizing translational efficiencies in highly expressed genes. Here in this communication by taking the advantage of publicly available gene expression data of rice and Arabidopsis we demonstrated that tRNA gene copy number is not the only driving force favoring translational selection in all highly expressed genes of rice. We found that forces favoring translational selection differ between GC-rich and GC-poor classes of genes. Supporting our results we also showed that, in highly expressed genes of GC-poor class there is a perfect correspondence between majority of preferred codons and tRNA gene copy number that confers translational efficiencies to this group of genes. However, tRNA gene copy number is not fully consistent with models of translational selection in GC-rich group of genes, where constraints on mRNA secondary structure play a role to optimize codon usage in highly expressed genes. © 2007 Elsevier B.V. All rights reserved. Keywords: Codon usage; tRNA abundance; mRNA folding stability; Translational efficiencies
1. Introduction In most species synonymous codons are not used with equal frequencies; the phenomenon known as codon usage bias. Codon bias is generally governed by a balance between mutation, genetic drift and natural selection (Bulmer, 1991; Sharp et al., 1993; Akashi and Eyre-Walker, 1998; Gupta and Ghosh, 2001; Basak and Ghosh, 2005). The codon usage bias varies considerably among organisms and even within the genes of the same organism (Grantham et al., 1980a,b, 1981; Gouy and Gautier, 1982). There is a positive correlation between the gene expression level and the level of codon usage bias on synonymous sites in a number of eukaryotic genomes, such as C. elegans (Duret, 2000) and Drosophila (Akashi, 1995; Moriyama and Powell, 1997; Powell and Moriyama, 1997) consistent with models of translational selection on codon usage Abbreviations: HEG, highly expressed genes; LEG, lowly expressed genes; RSCU, relative synonymous codon usage. ⁎ Corresponding author. Tel.: +91 33 2355 6626; fax: +91 33 2355 3886. E-mail address:
[email protected] (T.C. Ghosh). 0378-1119/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.gene.2007.05.027
(Akashi, 2001; Duret and Mouchiroud, 1999).The above findings emphasized that codon usage is generally biased towards “preferred” codon that generally corresponds to the most abundant tRNA species (Ikemura, 1992). Even considering unicellular organisms, such as E. coli and S. cerevisiae, it was found that the codons translated by the most abundant tRNA are the most frequently used (Ikemura, 1981, 1982). However, very recent analysis of S. cerevisiae genome by considering the experimentally determined gene expression data has challenged the translational selection hypothesis correlating codon usage bias and tRNA abundance of highly expressed genes (Kahali et al., 2007). This hypothesis has not been confirmed in higher eukaryotes (Kanaya et al., 2001). Codon usage in mammals is mainly determined by the spatial arrangement of genomic G + C-content, i.e., the isochore structure (Sharp et al., 1995). Heterogeneity in base composition is a causative factor of codon usage bias in warm blooded vertebrates where natural selection drives the formation of GCrich isochores (Bernardi and Bernardi, 1985; Bernardi, 2004). Similar elevation in GC content has also been reported in Gramineae (Carels and Bernardi, 2000) leading to the formation
72
P. Mukhopadhyay et al. / Gene 400 (2007) 71–81
of two classes of genes (GC-rich and GC-poor) in this family of plant (Montero et al., 1990; Wong et al., 2002; Guo et al., 2007). The possibility of two different GC classes has been explained by (1) selectionist hypothesis and (2) mutational bias hypothesis. Selectionist hypothesis argued that selective advantages favored GC increments in the synonymous positions of genes (Bernardi, 2004), whereas mutational bias hypothesis advocated that biases are due to the mutations in the genes coding for the repair, replication or recombination (Filipski, 1987; Wolfe et al., 1989; Eyre-Walker, 1990). However, by analyzing the gene expression data of rice genes it has been reported that synonymous codon usage is related to translational selection in rice genes (Liu et al., 2004; Wang and Roossinck, 2006). Recently Shi et al. (2006) by analyzing paralogs of rice genes suggested that average synonymous substitution rate is lower in GC3-rich genes than that in GC3-poor genes, indicating that the synonymous substitution rate is negatively correlated with GC content in the rice genome. Availability of gene expression data of both rice (Nakano et al., 2006) and Arabidopsis (Meyers et al., 2004) gives us a unique opportunity to investigate the influences of gene expression on codon usages in rice genes. It has been demonstrated that both mRNA stability and tRNA gene copy number are the major evolutionary force in shaping the codon usage differences between the highly and lowly expressed genes in rice genome depending on the two classes of genes characterized by high and low GC content. 2. Materials and methods The genomes of rice and Arabidopsis were downloaded respectively, from FTP server of NCBI and ftp://ftp.dna.affrc. go.jp/pub/RiceGAAS/current. Complete protein coding sequences from nuclear genes were retrieved. All hypothetical coding sequences as well as sequences having less than 100 codons were ignored from our dataset. Also, genes containing internal stop codons were removed and thus a dataset comprising a total of 8387 genes was taken for our analysis. Homologous genes between rice and Arabidopsis genomes were identified using gapped BLASTP searches (Altschul et al., 1997) using a cut-off expect of 10.0 × 10− 6. Pairs of coding sequences which have at least 30% amino acids positives and overlaps over at least 80% of their length were retained for the analysis. The maximum gap size allowed between a pair of sequence is 5%. Due to presence of much multi-copy genes both in Arabidopsis and rice, some sequences from one species showed high levels of sequence similarity with more than one sequence from the other species. In those cases the sequence pairs that produced higher degree of sequence similarity was retained (Banerjee et al., 2006). Finally our dataset comprised a total of 3525 homologous gene pairs. The public domain MPSS (Massively Parallel Signature Sequencing) expression data for rice (Nakano et al., 2006) (http://mpss.udel.edu/rice/) and Arabidopsis (Meyers et al., 2004) (http://mpss.udel.edu/at/) present more accurate estimation of gene transcript levels and are easily accessible (Meyers et al., 2004). The expression level of a gene expressed in a
single library is estimated by counting the number of individual 17 base signature sequences representing each gene (Ren et al., 2006). And the expression levels of a gene expressed in different expression libraries were estimated by calculating average expression values in all libraries considered. We sorted the expression values in each library in an ascending order, and then divided them into 5 groups, each containing 20% of the population (Ren et al., 2006). Individual genes were assigned an expression rank from 1 (low expression) to 5 (high expression) according to the increase in average expression level. Correspondence analysis (Greenacre, 1984) available in CodonW 1.4.2 (http://www.molbiol.ox.ac.uk/cu/) was used to investigate the major trend in relative synonymous codon usage variation among the genes. For each native mRNA sequence, 50 random sequences were generated using the randomization protocol, CodonShuffle (Katz and Burge, 2003), which randomly permutes synonymous codon in codon degenerate family preserving the exact count of each codon and order of encoded amino acids as in the original transcript. The Zipfold program was used to predict free-folding energies for each native mRNA sequence and corresponding shuffled sequence available at http://www. bioinfo.rpi.edu/applications/mfold/old/rna/form4.cgi. The folding free energy difference between native sequence and corresponding random sequences was measured by Z-score. Z-score = {Enative − Erandom} / STD, where Enative denotes the folding free energy of native mRNA sequence, Erandom denotes the average over a large number of random sequences generated from the native sequence and STD denotes its standard deviation. A positive Z-score indicates the native sequence has a higher folding free energy than the average of the random sequences and therefore is thought to have less stable secondary structure than random sequence. The transfer RNA gene copy number necessary to determine the major codons (Kotlar and Lavner, 2006) for each amino acid in rice was taken from Xiyin et al. (2002) and tRNA copy number for Arabidopsis was taken from http://lowelab.ucsc.edu/GtRNAdb/Athal/. The Student's t-test was used to evaluate the significance of all the pair-wise differences. The statistical tests were performed using the SPSS (13.0) package.
Fig. 1. Average scores of the positions of genes in five expression ranks of rice along the second major axis. Error bars indicate 99% confidence interval.
P. Mukhopadhyay et al. / Gene 400 (2007) 71–81
3. Results The main factor driving codon usage in rice is the strong mutational bias towards G and C as revealed from the relative synonymous codon usage (RSCU) values which supports the earlier observation by Liu et al. (2004). This is also evident when we subjected our dataset using correspondence analysis (CoA) on RSCU values of 8387 coding sequences of rice classified in five different ranks according to the gene expression levels. The first (A × 1) and the second (A × 2) axes generated by CoA, account for 51.41% and 4.61% of the total variations respectively. The positions of the genes along the first major axis are strongly correlated with GC3 (R = 0.995; p b 10− 4). Fig. 1 represents average scores of the positions of genes in five expression ranks along the second major axis. From Fig. 1, it is clearly evident that the highly (expression rank 5) and lowly (expression rank 1) expressed genes are separated on the two opposite ends of axis2. The analysis of variance (ANOVA) showed that the genes are significantly discriminated on the second axis (A × 2), F = 40.466, p b 10− 6 according to their expression levels (Fig. 1). The distribution of codons along the first and second major axes (Fig. 2) shows that C- and T-ending codons are located along the positive side of the second major axis, indicating that natural selection might be related to translational and transcriptional efficiencies (Kawabe and Miyashita, 2003). We also assessed the breadth of expression on variation of codon usage in our dataset (data not
73
shown). Breadth of expression is the number of libraries in which transcription of gene is detected (Urrutia and Hurst, 2001). The genes were sorted in ascending order according to their expression breadth and then we divided them equally into 11 groups. Each group was assigned an expression rank from 1 (low breadth) to 11 (high breadth). CoA was performed on the dataset to inspect the effect of expression breadth on codon usage in rice. As predicted, all highly expressed genes belong to the highest breadth rank (Urrutia and Hurst, 2001; Subramanium and Kumar, 2004; Kotlar and Lavner, 2006). The result further emphasized that a gene whose transcription detected under all condition (highest breadth) is an important determination of expression level of gene. 3.1. The nature of selective constraint on synonymous codon usage in rice It is discussed elsewhere that two classes of genes in Gramineae, namely GC-poor and GC-rich classes are evolving under different contrasting evolutionary forces (Montero et al., 1990; Carels and Bernardi, 2000; Wang et al., 2004). Thus considering the isochore organization of rice genome we have made an attempt to answer the nature of evolutionary forces shaping codon usage in both GC-poor and GC-rich class solely due to gene expression level. To minimize the GC3 effect on codon usage we have sorted datasets into two groups comprising highly and lowly expressed genes from both GC-poor
Fig. 2. The distribution of synonymous codons of all rice genes along the first and second axes of the correspondence analysis.
74
P. Mukhopadhyay et al. / Gene 400 (2007) 71–81
Table 1 The RSCU values of highly expressed (HEG) and lowly expressed genes (LEG) of rice AA
Codons
RSCU (HEG)
RSCU (LEG)
tRNA copy no. of Oryza sativa
Phe
TTT TTC TATR TAC CAT CAC AAT AAC GATR GAC⁎P TGT TGC CAA CAGP AAA AAG⁎B GAA GAG⁎P GTT⁎P GTCR GTA GTG CCT CCCR CCA CCG ACT ACCR ACA ACG GCT⁎P GCC GCA GCG GGTB GGC GGA GGG TTA TTG CTT⁎R CTCB CTA CTGP TCT⁎R TCCR TCA TCG AGT AGC CGT⁎B CGCB CGA CGGP AGA AGG ATT ATCR ATA
1.16(0.13) 0.84(1.87) 1.23(0.21) 0.77(1.79) 1.34(0.26) 0.66(1.74) 1.23(0.21) 0.77(1.79) 1.35(0.29) 0.65(1.71) 1.1(0.08) 0.9(1.92) 0.89(0.17) 1.11(1.83) 0.9(0.06) 1.1(1.94) 1(0.12) 1(1.88) 1.72(0.25) 0.7(1.88) 0.61(0.02) 0.97(1.85) 1.55(0.3) 0.51(1.56) 1.49(0.28) 0.45(1.86) 1.41(0.23) 0.69(2.43) 1.58(0.15) 0.32(1.19) 1.72(0.21) 0.67(1.99) 1.26(0.19) 0.35(1.6) 1.4(0.37) 0.77(2.27) 1.13(0.34) 0.7(1.02) 0.65(0.07) 1.38(0.3) 1.56(0.37) 0.76(3.49) 0.61(0.1) 1.05(1.67) 1.57(0.47) 0.73(2.49) 1.42(0.23) 0.39(1.54) 1.06(0.18) 0.83(1.09) 1.08(0.37) 0.76(2.83) 0.58(0.13) 0.7(1.22) 1.46(0.15) 1.41(1.31) 1.36(0.2) 0.77(2.7) 0.87(0.1)
1.2(0.11) 0.8(1.89) 1.25(0.09) 0.75(1.91) 1.39(0.27) 0.61(1.73) 1.28(0.2) 0.72(1.8) 1.41(0.21) 0.59(1.79) 1.07(0.09) 0.93(1.91) 1.1(0.2) 0.9(1.8) 0.97(0.12) 1.03(1.88) 1.11(0.16) 0.89(1.84) 1.61(0.18) 0.65(1.69) 0.7(0.08) 1.04(2.05) 1.47(0.31) 0.47(1.1) 1.66(0.31) 0.39(2.28) 1.4(0.19) 0.7(1.8) 1.57(0.2) 0.33(1.81) 1.54(0.19) 0.66(1.8) 1.44(0.17) 0.37(1.84) 1.18(0.23) 0.77(2.52) 1.32(0.25) 0.73(0.99) 0.83(0.04) 1.31(0.3) 1.52(0.23) 0.68(3.14) 0.74(0.13) 0.92(2.15) 1.52(0.23) 0.72(2.01) 1.5(0.19) 0.38(1.81) 1.09(0.11) 0.8(1.65) 0.84(0.21) 0.56(2.38) 0.69(0.16) 0.57(1.69) 1.89(0.19) 1.45(1.38) 1.37(0.24) 0.76(2.54) 0.88(0.21)
0 15 0 16 0 11 0 14 0 28 0 10 16 13 10 22 15 29 21 0 4 10 16 0 11 10 9 0 8 0 25 0 11 13 0 24 13 8 7 9 19 0 8 6 17 0 10 7 0 13 16 0 4 7 9 10 23 0 6
Tyr His Asn Asp Cys Gln
Lys Glu Val
Pro
Thr
Ala
Gly
Leu
Ser
Arg
Ile
RSCU values within parenthesis represent GC-rich class of rice while the values outside indicate GC-poor class. Codons marked in bold show significantly (p b 0.05) higher preference in highly expressed genes (HEG) and codons marked with ⁎ hold a perfect correspondence with most abundant tRNA gene copy number. The underlined codons indicate the presence of perfectly matching tRNA. The codons marked with superscript R show optimal codon in GC-rich class of rice, codons marked with superscript P represent an optimal codon in GC-poor class whereas the codons marked with superscript B are optimal in both GC-rich and GC-poor class of rice.
P. Mukhopadhyay et al. / Gene 400 (2007) 71–81
(GC3 b 45%) (41 genes in highly expressed and 260 genes in lowly expressed) and GC-rich class (GC3 N 80%) (40 genes in highly expressed and 497 genes in lowly expressed) with 10% GC3 variation within each dataset of lowly and highly expressed genes. The co-adaptation of tRNA content and codon usage for the optimal translation of the pool of highly expressed genes is well known in C. elegans (Duret, 2000). To test translational selection, we have identified the optimal codons (Table 1) in both gene classes and investigated the correspondence between codon preferences and tRNA gene copy number in rice. tRNA gene abundance has been found to correlate strongly with the corresponding tRNA gene copy number in a number of prokaryotic and eukaryotic genome (Duret, 2000; Kanaya et al., 2001). Optimal codons are those that show statistically significant increase in frequency in the highly expressed genes than that of the lowly expressed genes (Sharp and Matassi, 1994). Optimal codons provide fitness benefits to highly expressed genes by enhancing translation efficiency. Whereas preferred codons are those that generally correspond to the most abundant tRNA species (Ikemura, 1992). Interestingly both gene classes have different optimal codons (Table 1). The difference in optimal codons might be due to the different selective constraint shaping codon usage in highly expressed genes of both gene classes. We matched the optimal codons on the basis of the same tRNA isoacceptor. In GC-poor class, there are 9 codons out of 12 optimal codons that show perfect correspondence with tRNA copy number among the 9, there are 6 preferred codons that show correspondence with most abundant tRNA copy number, whereas in GC-rich class there are only 4 preferred codons out of 15 optimal codons that correspond to most abundant tRNA gene copy number. The tRNA gene copy number supporting preferred codons is in favor of translational selection in GC-poor class in rice (Ikemura, 1985; Sharp and Devine, 1989). Translational selection driven by tRNA copy number has a greater influence in GC-poor class but not in GCrich class where majority of the optimal codons do not correspond to perfectly matching tRNA. 3.2. Selective constraints on mRNA secondary structure are related to codon usage variation in GC-rich class It has been demonstrated that there is a selection for local RNA secondary structures in coding regions and this nucleic acid structure resembles the folding profiles of the coded proteins (Biro, 2006). Further it has been observed in E. coli that the decrease of the stability of mRNA structure contributes to the increase of mRNA expression (Jia and Li, 2005) suggesting possible relationships between synonymous codon usage and presence of some constraints upon mRNA secondary structure that subsequently regulate the gene expression levels. To investigate if selection acts on mRNA secondary structure to optimize synonymous codon usage we have randomized both the GC-rich class and GC-poor class of rice using randomization protocol, CodonShuffle (Katz and Burge, 2003). For each native mRNA sequence, 50 random sequences were generated. For each native sequence in GC-poor and GC-rich class, average value of Z-score (Table 2) has been calculated and a
75
Table 2 Average Z-score values of GC-rich and GC-poor classes of genes in highly expressed (HEG) and lowly expressed genes (LEG) of rice
HEG LEG
GC-poor (average Z-score)
GC-rich (average Z-score)
−0.00259022 −0.001249458
− 0.002119329 0.00252657
significantly lower (p b 0.05) Z-score has been observed only in the highly expressed genes of GC-rich class, indicating additional selection pressure of mRNA secondary structure forming potential on codon usage in this group of genes. To emphasize selective constraint act on mRNA secondary structure of highly expressed genes of GC-rich class to optimize codon usage variation, we performed correlation analysis between positions of each genes in the second axis generated by CoA with corresponding Z-score values of both GC-poor and GC-rich classes of genes. Interestingly a significant correlation (R = 0.115, p b 0.01) have been observed only in GC-rich class that indicates selective constraint on mRNA secondary structure by modulating codon usage variation in highly expressed genes of GC-rich class. 3.3. The nature of selective constraint on synonymous codon usage in Arabidopsis Our results suggest that there is a difference in nature of selective constraint shaping codon usage in highly expressed genes of both the gene classes, namely GC-poor and GC-rich class of rice. Supporting our results we also wished to investigate; whether the difference in selective constraint acting on synonymous codon usage in rice according to regional GCcomposition has followed since the divergence of monocots and dicots. It is already known that the divergence in codon usage patterns among the rice genes has occurred since the evolutionary divergence of the dicots and monocots approximately 200 million years (My) ago, i.e., over a relatively short evolutionary time with increment in GC content of some rice genes (Bernardi, 2004; Wang and Hickey, 2007). In order to understand the evolutionary significance of the difference in nature of selective constraint shaping codon usage in both gene classes of rice, we compared these rice genes with their homolog in Arabidopsis. To investigate translational selection optimizing codon usage in Arabidopsis homologous gene sets of GC-rich and GC-poor rice genes, we have identified optimal codons in Arabidopsis two homologous gene sets. The correspondence between codon preferences and tRNA gene copy number was further studied. The optimal codons were matched on the basis of the same tRNA isoacceptor. In GC-rich homologous class (Table 3), there are 11 codons out of 13 optimal codons that show perfect correspondence with tRNA copy number. Furthermore there are 7 preferred codons that show perfect correspondence with most abundant tRNA copy number. Whereas in GC-poor class (Table 3) there are 4 codons out of 4 optimal codons that show perfect correspondence with tRNA copy number and among them 3 preferred codons correspond to most abundant tRNA gene copy number. Thus the tRNA gene copy number
76
P. Mukhopadhyay et al. / Gene 400 (2007) 71–81
Table 3 The RSCU values of highly expressed (HEG) and lowly expressed genes (LEG) of Arabidopsis homologous gene set AA
Codon
RSCU (HEG)
RSCU (LEG)
tRNA copy no of Arabidopsis
Phe
TTT TTC⁎R TAT TAC⁎B CAT CAC AAT AAC GAT GAC TGT TGC CAA CAG⁎B AAA AAG⁎R GAA GAG⁎R GTT GTC GTA GTG CCT CCC CCA CCG ACT ACCR ACA ACG GCT⁎R GCC GCA GCG GGTR GGC GGA GGG TTA TTG CTT CTCR CTA CTGP TCT TCCR TCA TCG AGT AGC CGT⁎B CGC CGA CGG AGA AGGR ATT ATCR ATA
1.13(0.75) 0.87(1.25) 1.11(0.7) 0.89(1.3) 1.39(1.07) 0.61(0.93) 1.17(0.82) 0.83(1.18) 1.43(1.21) 0.57(0.79) 1.17(1.23) 0.83(0.77) 1.02(0.97) 0.98(1.03) 0.97(0.79) 1.03(1.21) 1.1(0.85) 0.9(1.15) 1.72(1.6) 0.65(0.99) 0.64(0.37) 0.99(1.04) 1.57(1.41) 0.45(0.53) 1.44(1.33) 0.54(0.74) 1.47(1.29) 0.68(1.15) 1.4(0.94) 0.45(0.62) 1.74(1.79) 0.54(0.82) 1.32(0.8) 0.39(0.58) 1.32(1.52) 0.56(0.59) 1.49(1.38) 0.63(0.51) 0.72(0.63) 1.31(1.2) 1.73(1.57) 0.79(1.55) 0.61(0.48) 0.84(0.56) 1.8(1.69) 0.65(1.11) 1.28(0.86) 0.53(0.62) 1.03(0.83) 0.72(0.89) 1.15(1.43) 0.52(0.51) 0.72(0.52) 0.56(0.33) 1.85(1.79) 1.2(1.42) 1.34(1.05) 0.86(1.55) 0.8(0.4)
1.15(0.97) 0.85(1.03) 1.23(0.86) 0.77(1.14) 1.33(1.18) 0.67(0.82) 1.16(0.92) 0.84(1.08) 1.42(1.29) 0.58(0.71) 1.25(1.15) 0.75(0.85) 1.17(1.2) 0.83(0.8) 1.02(0.97) 0.98(1.03) 1.14(0.99) 0.86(1.01) 1.64(1.55) 0.75(0.92) 0.68(0.37) 0.93(1.15) 1.41(1.42) 0.47(0.5) 1.59(1.26) 0.53(0.81) 1.41(1.17) 0.7(0.9) 1.38(1.25) 0.51(0.69) 1.62(1.6) 0.65(0.78) 1.29(0.89) 0.45(0.73) 1.29(1.23) 0.5(0.65) 1.49(1.59) 0.71(0.53) 0.91(0.76) 1.35(1.39) 1.59(1.55) 0.77(1.18) 0.68(0.66) 0.7(0.45) 1.79(1.69) 0.65(0.89) 1.2(1.12) 0.55(0.73) 1.09(0.78) 0.72(0.8) 0.91(1.11) 0.41(0.35) 0.82(0.68) 0.48(0.51) 2.21(2.26) 1.18(1.09) 1.31(1.13) 0.84(1.22) 0.85(0.64)
0 16 0 76 0 10 0 16 0 26 0 15 8 9 13 18 12 13 15 0 7 8 16 0 45 5 10 0 10 7 16 0 10 7 1 23 12 5 6 10 12 1 10 3 37 1 9 4 0 13 9 0 6 4 9 8 19 0 5
Tyr His Asn Asp Cys Gln Lys Glu Val
Pro
Thr
Ala
Gly
Leu
Ser
Arg
Ile
RSCU values within parenthesis represent GC-rich class of Arabidopsis while the values outside indicate GC-poor class. Codons marked in bold show significantly ( p b 0.05) higher preference in highly expressed genes (HEG) and codons marked with ⁎ hold a perfect correspondence with most abundant tRNA gene copy number. The underlined codons indicate the presence of perfectly matching tRNA. The codons marked with superscript R show optimal codon in GC-rich class of Arabidopsis, codons marked with superscript P represent an optimal codon in GC-poor class whereas the codons marked with superscript B are optimal in both GC-rich and GC-poor class of Arabidopsis.
P. Mukhopadhyay et al. / Gene 400 (2007) 71–81
77
Table 4 The RSCU values of highly expressed (HEG) and lowly expressed genes (LEG) in rice homologous genes AA
Codon
RSCU (HEG)
RSCU (LEG)
tRNA copy no. of Oryza sativa
Phe
TTT TTC TAT TAC CAT CAC⁎ AAT AAC⁎ GAT GAC TGT TGC CAA CAG AAA AAG⁎ GAA GAG⁎ GTT⁎ GTC GTA GTG CCT⁎ CCC CCA CCG ACT⁎ ACC ACA ACG GCT⁎ GCC GCA GCG GGT GGC GGA GGG TTA TTG CTT⁎ CTC CTA CTG TCT⁎ TCC TCA TCG AGT AGC CGT⁎ CGC CGA CGG AGA AGG ATT ATC ATA
0.72 1.28 0.77 1.23 0.89 1.11 0.83 1.17 0.96 1.04 0.62 1.38 0.67 1.33 0.57 1.43 0.68 1.32 1.04 1.18 0.36 1.41 1.01 0.88 1.05 1.06 0.96 1.31 1 0.73 0.96 1.26 0.78 1 0.86 1.44 0.8 0.89 0.33 0.89 1.15 1.79 0.41 1.43 1.01 1.31 1.02 0.85 0.64 1.17 0.66 1.45 0.37 1.03 0.95 1.55 1.01 1.43 0.57
0.72 1.28 0.79 1.21 0.93 1.07 0.88 1.12 0.92 1.08 0.65 1.35 0.7 1.3 0.65 1.35 0.71 1.29 0.94 1.18 0.38 1.5 0.89 0.86 1.01 1.23 0.86 1.21 1.02 0.91 0.82 1.33 0.77 1.08 0.74 1.54 0.78 0.93 0.38 0.9 1.01 1.82 0.46 1.43 0.95 1.34 1.01 0.87 0.67 1.16 0.52 1.49 0.43 1.08 0.95 1.53 1.02 1.32 0.65
0 15 0 16 0 11 0 14 0 28 0 10 16 13 10 22 15 29 21 0 4 10 16 0 11 10 9 0 8 0 25 0 11 13 0 24 13 8 7 9 19 0 8 6 17 0 10 7 0 13 16 0 4 7 9 10 23 0 6
Tyr His Asn Asp Cys Gln Lys Glu Val
Pro
Thr
Ala
Gly
Leu
Ser
Arg
Ile
Codons marked in bold show significantly ( p b 0.05) higher preference in highly expressed genes (HEG) and codons marked with ⁎ hold a perfect correspondence with most abundant tRNA gene copy number. The underlined codons indicate the presence of perfectly matching tRNA.
78
P. Mukhopadhyay et al. / Gene 400 (2007) 71–81
Table 5 The RSCU values of highly expressed (HEG) and lowly expressed genes (LEG) in Arabidopsis homologous genes AA
Codon
RSCU (HEG)
RSCU (LEG)
tRNA copy no. of Arabidopsis
Phe
TTT TTC⁎ TAT TAC⁎ CAT CAC⁎ AAT AAC⁎ GAT GAC⁎ TGT TGC⁎ CAA CAG⁎ AAA AAG⁎ GAA GAG⁎ GTT GTC GTA GTG CCT CCC CCA CCG ACT⁎ ACC ACA ACG GCT⁎ GCC GCA GCG GGT GGC GGA GGG TTA TTG CTT⁎ CTC CTA CTG TCT TCC TCA TCG AGT AGC CGT⁎ CGC CGA CGG AGA AGG ATT ATC ATA
0.93 1.07 0.91 1.09 1.12 0.88 0.94 1.06 1.3 0.7 1.15 0.85 1 1 0.86 1.14 0.95 1.05 1.67 0.82 0.46 1.05 1.6 0.45 1.31 0.64 1.46 0.91 1.11 0.52 1.86 0.69 0.98 0.47 1.44 0.53 1.47 0.56 0.63 1.34 1.67 1.11 0.57 0.68 1.71 0.82 1.15 0.6 0.89 0.82 1.2 0.44 0.6 0.47 1.93 1.35 1.28 1.18 0.54
1.07 0.93 1.08 0.92 1.28 0.72 1.06 0.94 1.37 0.63 1.22 0.78 1.14 0.86 1 1 1.05 0.95 1.65 0.71 0.61 1.02 1.55 0.43 1.31 0.71 1.36 0.76 1.27 0.61 1.71 0.59 1.12 0.58 1.31 0.51 1.47 0.7 0.86 1.39 1.54 0.97 0.66 0.59 1.67 0.69 1.23 0.64 1.01 0.76 0.91 0.38 0.7 0.54 2.25 1.21 1.25 0.99 0.76
0 16 0 76 0 10 0 16 0 26 0 15 8 9 13 18 12 13 15 0 7 8 16 0 45 5 10 0 10 7 16 0 10 7 1 23 12 5 6 10 12 1 10 3 37 1 9 4 0 13 9 0 6 4 9 8 19 0 5
Tyr His Asn Asp Cys Gln Lys Glu Val
Pro
Thr
Ala
Gly
Leu
Ser
Arg
Ile
Codons marked in bold show significantly ( p b 0.05) higher preference in highly expressed genes (HEG) and codons marked with ⁎ hold a perfect correspondence with most abundant tRNA gene copy number. The underlined codons indicate the presence of perfectly matching tRNA.
P. Mukhopadhyay et al. / Gene 400 (2007) 71–81
supporting preferred codons is in favor of translational selection in both Arabidopsis gene sets corresponding to homologous sets of rice genes. 4. Discussion Considering two gene classes, namely GC-poor and GC-rich class of rice, there is a perfect correspondence between majority of preferred codons and tRNA gene copy number in highly expressed genes of GC-poor class that confer translational efficiencies to this group of genes. However in GC-rich class, apart from tRNA gene copy number, constraints on mRNA secondary structure play a crucial role to optimize synonymous codon usage. It is known that selection for optimal translation efficiencies results in the preferential use of a subset of synonymous codons which generally correspond to abundant tRNA isoacceptors (Duret, 2000; Elf et al., 2003; Ikemura, 1982; Moriyama and Powell, 1997).The above paradigm which considers translational selection solely favored by tRNA gene copy number in highly expressed genes is not true while considering rice genes belonging to different GC-composition. The nature of selective constraints varies in both gene classes in rice indicating different selective forces shaping codon usage in highly expressed genes of GC-poor and GC-rich group of genes. On the other hand the synonymous codon usage pattern in Arabidopsis is completely influenced by tRNA gene copy number. It is known that the divergence between dicot and monocot is the result of compositional genome transition that has caused increased GC level of a large percentage of genes from Gramineae leading to two classes of genes (GC-rich and GC-poor) but not in dicot. In homologous gene sets there exists a huge variation of GC content in rice, which is absent in Arabidopsis. It should be also noted that GC3 of Arabidopsis genes (43.26% GC3) are similar to rice GC-poor class genes (39.97% GC3) rather than GC-rich class genes (90.75% GC3). When we compared the GC-poor rice genes with their Arabidopsis homolog, it was observed that tRNA gene copy number is in favor of translational selection in both the group of genes. But while comparing GC-rich rice genes with their Arabidopsis homolog, it was observed that the selective forces influencing codon usage has changed in rice with increment of GC. The main selective force influencing codon usage of GCrich rice genes is the constraint on mRNA secondary structure to modulate expression level while in the corresponding Arabidopsis homolog, tRNA gene copy number is the major selective force influencing codon usage. Thus the important conclusion drawn by comparing homologous genes sets between rice and Arabidopsis that with increment of GC there is also a change in nature of selective constraint guiding codon usage of highly expressed genes in rice. It is possible that local mutational bias causes different GC contents in rice and translational selection is simply more effective in low GC content genes (may be similar GC content as Arabidopsis genes) while high GC content genes in rice should avoid extensively high GC content to achieve high expression. The evolutionary forces causing increment of GC and consequently its effect on codon usage in rice have always been a
79
study of interest. Wang and Roossinck, (2006) through analyzing patterns of optimal codons in 6 dicots and 5 monocots observed that the use of optimal codons is well conserved in plants with most of the optimal codons end in C or G base. They concluded that codon usage in plant is affected by translational selection. The same conclusion has been suggested by Liu et al. (2004). These authors have used EST and CAI respectively to assign expression value to the genes, while whole genome MPSS expression data for rice and Arabidopsis offer more accurate estimation of gene expression data (Wright et al., 2004; Wang and Roossinck, 2006). Even we have compared pattern of
Fig. 3. Histogram of GC3 levels of (A) 1647 highly expressed genes of rice and (B) 1798 lowly expressed genes of rice.
80
P. Mukhopadhyay et al. / Gene 400 (2007) 71–81
optimal codons in homologous set of genes (3525 homologous genes) between Rice and Arabidopsis, and observed that there are 9 T-ending codons followed by 4 C-ending codons out of 15 optimal codons in rice. There are 11 preferred codons out of 15 optimal codons in rice (Table 4) that match with most abundant tRNA copy number. Whereas in Arabidopsis (Table 5) there are 14 C-ending codons followed by 6 T-ending codons out of 25 optimal codons, and 13 preferred codons that correspond to most abundant tRNA copy number. Thus the observed correspondence between majority of optimal codon and tRNA copy number in rice and Arabidopsis is consistent on models of translational selection. We found that there is a preference for both C-ending and T-ending optimal codons in homologous genes of rice and Arabidopsis which is consistent with the hypothesis that the CT richness i.e. the increased use of pyrimidines in plants is related with natural selection which might be related to translational and transcriptional efficiencies (Kawabe and Miyashita, 2003). However, it has recently been argued that large variation in synonymous codon usage in rice is not related to selection acting on the translational efficiency (Wang and Hickey, 2007). But reaching such conclusion requires exploration of whole genome expression data. In the present study we have utilized the whole genome expression data to infer the presence of translational selection in the rice genome. Similar to Sharp and Li (1987), we also observed significant reduction in synonymous substitution rate for highly expressed genes compared to lowly expressed genes (p b 0.001) (data not shown). This result indicates the influence of translational selection on synonymous codon usage according to expression level of genes. Fig. 3 clearly depicts that both highly and lowly expressed genes of rice have got wide variation in GC content. Possibly, the strong mutational bias present in the rice genome is swamped by the comparatively weaker effect of translational selection, which prevented Wang and Hickey (2007) to observe the influence of translational selection on synonymous codon usage in rice genome. Acknowledgement The authors are thankful to the Department of Biotechnology, Government of India for the financial help. References Akashi, H., 1995. Inferring weak selection from patterns of polymorphism and divergence at “silent” sites in Drosophila DNA. Genetics 139, 1067–1076. Akashi, H., 2001. Gene expression and molecular evolution. Curr. Opin. Genet. Dev. 11, 660–666. Akashi, H., Eyre-Walker, A., 1998. Translational selection and molecular evolution. Curr. Opin. Genet. Dev. 8, 688–693. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. Banerjee, T., Gupta, S.K., Ghosh, T.C., 2006. Compositional transitions between Oryza sativa and Arabidopsis thaliana genes linked to the functional change of encoded proteins. Plant Sci. 170, 267–273. Basak, S., Ghosh, T.C., 2005. On the origin of genomic adaptation at high temperature for prokaryotic organisms. Biochem. Biophys. Res. Commun. 330, 629–632.
Bernardi, G., 2004. Structural and Evolutionary Genomics, Natural Selection in Genome Evolution. Elsevier, The Netherlands. Bernardi, G., Bernardi, G., 1985. Codon usage and genome composition. J. Mol. Biol. 22, 363–365. Biro, J.C., 2006. Indications that “codon boundaries” are physico-chemically defined and that protein-folding information is contained in the redundant exon bases. Theor. Biol. Med. Model. 3, 28. Bulmer, M., 1991. The selection-mutation-drift theory of synonymous codon usage. Genetics 129, 897–907. Carels, N., Bernardi, G., 2000. Two classes of genes in plants. Genetics 154, 1819–1825. Duret, L., 2000. tRNA gene number and codon usage in the C. elegans genome are co-adapted for optimal translation of highly expressed genes. Trends Genet. 16, 287–289. Duret, L., Mouchiroud, D., 1999. Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proc. Natl. Acad. Sci. U. S. A. 96, 4482–4487. Elf, J., Nilsson, D., Tenson, T., Ehrenberg, M., 2003. Selective charging of tRNA isoacceptors explains patterns of codon usage. Science 300, 1718–1722. Eyre-Walker, A., 1990. Recombination and mammalian genome evolution. Proc. R. Soc. Lond., B. 252, 237–243. Filipski, J., 1987. Correlation between molecular clock ticking, codon usage, fidelity of DNA repair, chromosome banding and chromatin compactness in germline cells. FEBS Lett. 217, 184–186. Gouy, M., Gautier, C., 1982. Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res. 10, 7055–7074. Grantham, R., Gautier, C., Gouy, M., 1980a. Codon frequencies in 119 individual genes confirm consistent choices of degenerate bases according to genome type. Nucleic Acids Res. 8, 1893–1912. Grantham, R., Gautier, C., Guoy, M., Mercier, R., 1980b. Codon catalog usage and the genome hypothesis. Nucleic Acids Res. 8, 49r–62r. Grantham, R., Gautier, C., Guoy, M., Jacobzone, M., Mercier, R., 1981. Codon catalog usage is a genome strategy modulated for gene expressivity. Nucleic Acids Res. 9, 43r–74r. Greenacre, M.J., 1984. Theory and Applications of Correspondence Analysis. Academic Press, London, UK. Guo, X., Bao, J., Fan, L., 2007. Evidence of selectively driven codon usage in rice: Implications for GC content evolution of Gramineae genes. FEBS Lett. 581, 1015–1021. Gupta, S.K., Ghosh, T.C., 2001. Gene expressivity is the main factor in dictating the codon usage variation among the genes in Pseudomonas aeruginosa. Gene 273, 63–70. Ikemura, T., 1981. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes. J. Mol. Biol. 146, 1–21. Ikemura, T., 1982. Correlation between the abundance of yeast transfer RNAs and the occurrence of the respective codons in protein genes. Differences in synonymous codon choice patterns of yeast and Escherichia coli with reference to the abundance of isoaccepting transfer RNAs. J. Mol. Biol. 158, 573–597. Ikemura, T., 1985. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2, 13–34. Ikemura, T., 1992. In: Hatfield, D.L., Lee, B.J., Pirtle, R.M. (Eds.), Transfer RNA in Protein Synthesis. CRC, Boca Raton, FL, pp. 87–111. Jia, M., Li, Y., 2005. The relationship among gene expression, folding free energy and codon usage bias in Escherichia coli. FEBS Lett. 579, 5333–5337. Kahali, B., Basak, S., Ghosh, T.C., 2007. Reinvestigating the codon and amino acid usage of S. cerevisiae genome: a new insight from protein secondary structure analysis. Biochem. Biophys. Res. Commun. 354, 693–699. Kanaya, Y., Yamada, Y., Kinouchi, M., Kudo, Y., Ikemura, T., 2001. Codon usage and tRNA genes in eukaryotes: correlation of codon usage diversity with translation efficiency and with CG-dinucleotide usage as assessed by multivariate analysis. J. Mol. Evol. 53, 290–298. Katz, L., Burge, C.B., 2003. Widespread selection for local RNA secondary structure in coding regions of bacterial genes. Genome Res. 13, 2042–2051. Kawabe, A., Miyashita, N.T., 2003. Patterns of codon usage bias in three dicot and four monocot plant species. Genes Genet. Syst. 78, 343–352.
P. Mukhopadhyay et al. / Gene 400 (2007) 71–81 Kotlar, D., Lavner, Y., 2006. The action of selection on codon bias in the human genome is related to frequency, complexity, and chronology of amino acids. BMC Genomics 7, 67. Liu, Q., Feng, Y., Zhao, X., Dong, H., Xue, Q., 2004. Synonymous codon usage bias in Oryza sativa. Plant Sci. 167, 101–105. Meyers, B.C., Tej, S.S., Vu, T.H., Haudenschild, C.D., Agrawal, V., Edberg, S.B., Ghazal, H., Decola, S., 2004. The use of MPSS for whole-genome transcriptional analysis in Arabidopsis. Genome Res. 14, 1641–1653. Montero, L.M., Salinas, J., Matassi, G., Bernardi, G., 1990. Gene distribution and isochore organization in the nuclear genome of plant. Nucleic Acids Res. 18, 1859–1867. Moriyama, E.N., Powell, J.R., 1997. Synonymous substitution rates in Drosophila: mitochondrial versus nuclear genes. J. Mol. Evol. 45, 378–391. Nakano, M., Nobuta, K., Vemaraju, K., Tej, S.S., Skogen, J.W., Meyers, B.C., 2006. Plant MPSS databases: signature-based transcriptional resources for analyses of mRNA and small RNA. Nucleic Acids Res. 34, D731–D735. Powell, J.R., Moriyama, E.N., 1997. Evolution of codon usage bias in Drosophila. Proc. Natl. Acad. Sci. U. S. A. 94, 7784–7790. Ren, X.-Y., Vorst, O., Fiers, M.W.E.J., Stiekema, W.J., Nap, J.-P., 2006. In plants, highly expressed genes are the least compact. Trends Genet. 22, 528–532. Sharp, P.M., Li, W.H., 1987. The rate of synonymous substitution in enterobacterial genes is inversely related to codon usage bias. Mol. Biol. Evol. 4, 222–230. Sharp, P.M., Devine, K.M., 1989. Codon usage and gene expression level in Dictyostelium discoideum: highly expressed genes do ‘prefer’ optimal codons. Nucleic Acids Res. 17, 5029–5039. Sharp, P.M., Matassi, G., 1994. Codon usage and genome evolution. Curr. Opin. Genet. Dev. 4, 851–860. Sharp, P.M., Stenico, M., Peden, J.F., 1993. Codon usage: mutational bias, translational selection, or both? Biochem. Soc. Trans. 21, 835–841.
81
Sharp, P.M., Averof, M., Lloyd, A.T., Matassi, G., Peden, J.F., 1995. DNA sequence evolution: the sounds of silence. Philos. Trans. R. Soc. Lond., B Biol. Sci. 349, 241–247. Shi, X., Wang, X., Li, Z., Zhu, Q., Tang, W., Ge, S., Luo, J., 2006. Nucleotide substitution pattern in rice paralogues: implication for negative correlation between the synonymous substitution rate and codon usage bias. Gene 376, 199–206. Subramanium, S., Kumar, S., 2004. Gene expression intensity shapes evolutionary rates of the proteins encoded by the vertebrate genome. Genetics 168, 373–381. Urrutia, A.O., Hurst, L.D., 2001. Codon usage bias covaries with expression breadth and the rate of synonymous evolution in humans, but this is not evidence for selection. Genetics 159, 1191–1199. Wang, H.C., Hickey, D.A., 2007. Rapid divergence of codon usage patterns within the rice genome. BMC Evol. Biol. 7. Wang, L., Roossinck, M.J., 2006. Comparative analysis of expressed sequences reveals a conserved pattern of optimal codon usage in plants. Plant Mol. Biol. 61, 699–710. Wang, H.C., Singer, G.A., Hickey, D.A., 2004. Mutational bias affects protein evolution in flowering plants. Mol. Biol. Evol. 21, 90–96. Wolfe, K.H., Gouy, M., Yang, Y., Sharp, P.M., Li, W.H., 1989. Date of the monocot–dicot divergence estimated from chloroplast DNA sequence data. Proc. Natl. Acad. Sci. U.S.A. 86, 6201–6205. Wong, G.K., Wang, J., Tao, L., Tan, J., Zhang, J., Passey, D.A., Yu, J., 2002. Compositional gradients in Gramineae genes. Genome Res. 12, 851–856. Wright, S.I., Yau, C.B., Looseley, M., Meyers, B.C., 2004. Effects of gene expression on molecular evolution in Arabidopsis thaliana and Arabidopsis lyrata. Mol. Biol. Evol. 21, 1719–1726. Xiyin, W., Xiaoli, S., Bailin, H., 2002. The transfer RNA genes in Oryza sativa L. ssp. Indica. Sci. China, Ser. C: Life Sci. 45, 504–511.