Gene 463 (2010) 41–48
Contents lists available at ScienceDirect
Gene j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / g e n e
On nucleotide solvent accessibility in RNA structure Yumlembam H. Singh a, Munazah Andrabi b, Bratati Kahali c, Tapash Chandra Ghosh c, Kenji Mizuguchi b, Alex V. Kochetov d,e,⁎, Shandar Ahmad b,⁎ a
Bio-informatics Centre, North Eastern Hill University, Shillong-22, Meghalaya, India National Institute of Biomedical Innovation, 7-6-8, Saito-asagi, Ibaraki, Osaka 5670085, Japan Bioinformatics Centre, Bose Institute, P 1/12, C.I.T. Scheme, VII M, Kolkata 700 054, India d Institute of Cytology and Genetics, 10, Lavrentieva ave., Novosibirsk, Russia e Novosibirsk State University, 2, Pirogova str., Novosibirsk, Russia b c
a r t i c l e
i n f o
Article history: Received 16 March 2010 Received in revised form 3 May 2010 Accepted 4 May 2010 Available online 12 May 2010 Received by Leonardo Marino-Ramirez Keywords: Solvent accessibility RNA structure Prediction Surface propensity Codon usage
a b s t r a c t Sequence dependence of solvent accessibility in globular and membrane proteins is well established. However, this important structural property has been poorly investigated in nucleic acids. On the other hand investigation of structural determinants of transcriptional and post-transcriptional processes in gene expression are also in a primitive stage and there is a need to explore novel sequence and structural features of both DNA and RNA, which may explain both basic and regulatory mechanisms at various stages of expression. We have recently shown that the nucleotide accessibility in double-stranded DNA molecules strongly depends on sequence context and can be predicted using neighbor information. In this work, we investigate statistics, neighbor-dependence and predictability of nucleotide solvent accessibility for various types of RNA molecules (single-stranded, double-stranded, protein-unbound and protein-bound). It was found that average solvent accessibility of different RNA trinucleotides varies considerably. Interestingly, important translational signals (initiatory AUG codon, Shine-Dalgharno site) were characterized by high solvent accessibility that could be important for its selection in evolution. We also analyzed a relationship between nucleotide accessibility and synonymous codon usage bias in some genomes and find that the two properties are directly related. We believe that the analysis and prediction of nucleotide solvent accessibility opens new avenues to explore more biologically meaningful relationship between RNA structure and function. © 2010 Elsevier B.V. All rights reserved.
1. Introduction RNA molecules perform many vitally essential cellular functions (e.g. translation, splicing, expression control, etc.) taking place through interaction with other biological macromolecules. RNAlocated functional sites form steric conformations specifically adapted for interaction with a corresponding protein or with other molecules. This conformation is provided by certain combinations of nucleotides (either in the linear form or as a part of the RNA secondary structure— e.g., Iron Responsive Element (Volz, 2008) (for review, see (Abaza and Gebauer, 2008; Sonenberg and Hinnebasch, 2009; Van Der Kelen et al., 2009)). In most cases the nucleotides in positions of the proteinbinding site differ in their functional importance—mutations in some positions abolish binding whereas mutations in other positions are Abbreviation: ASA, Accessible surface area; PDB, Protein Data Bank; SS, Single stranded; DS, Double stranded; MAE, Mean Absolute Error. ⁎ Corresponding authors. A.V. Kochetov is to be contacted at Institute of Cytology and Genetics, 10, Lavrentieva ave., Novosibirsk, Russia. S. Ahmad, National Institute of Biomedical Innovation, Osaka, Japan. E-mail addresses:
[email protected] (A.V. Kochetov),
[email protected] (S. Ahmad). 0378-1119/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.gene.2010.05.001
either neutral or influence the functional activity quantitatively (Chatterjee and Pal, 2009). This results in a well-known phenomenon of functional sites variability. In the majority of cases RNA (or DNA) segments interacting with a certain protein are not identical and their organization can be described by some consensus sequence or weight matrix. In many cases, computational analyses of nucleotide sequences alone are not sufficient to reveal the actual structure of the functional site because currently available information on RNA sequence–structure relationships is rather limited. Physicochemical features (including the solvent accessibility, as proposed in this work) can influence the efficiency of RNA interaction with proteins or other molecules. Thus, their detailed investigation is needed to reveal the biologically relevant structure-function relationships. One of the prospective ways to reveal these relations is an analysis of 3dimensional structures of RNA and RNA–protein complexes. This approach has been successfully used for analysis of conformational features of interacting surfaces in proteins (e.g., (Gong et al., 2009)). Solvent accessibility is an important feature of biological polymers. First of all, it almost exclusively determines the hydrophobic energy of a molecule. Secondly, the interactions between two folded molecules take place via their exposed surfaces and therefore this parameter has
42
Y.H. Singh et al. / Gene 463 (2010) 41–48
been used in predicting binding sites in proteins (Ahmad et al., 2004). While solvent accessibility of proteins as well as its relationship with sequence and role in function has been extensively studied (e.g., (Ahmad et al., 2003; Andrabi et al., 2009)), solvent accessibility of nucleic acids has not received much attention, partly because much fewer unique structures of nucleic acids have been solved. An analysis of solvent accessibility of nucleic acids in double-stranded DNA was recently reported, where its sequence-dependence was demonstrated (Ahmad, 2009). There are many examples where prediction of RNA-located sites or their functional activities is not yet resolved. This concerns both the mRNA general translational properties and specific posttranscriptional control of gene expression patterns (Kochetov, 2008; Chatterjee and Pal, 2009; Liu et al., 2009). RNA is a very flexible molecule and prediction of its structure is a complex task (Shajani and Varani, 2007; Shapiro et al., 2007). Population of RNA molecules in a solution can exist as a set of dynamically balanced conformations and interactions with proteins (e.g., ubiquitous translation factors, poly(A)-binding protein, specific regulatory proteins, etc.) can further (dynamically) change their structural organization. This also hampers the prediction of RNA-located functional sites. Further accumulation of data on biologically meaningful relationship between RNA structure and function is needed for the progress in this field (Laederach, 2007; Montange and Batey, 2008). In our opinion, prediction of solvent accessibility of RNA segments could make a valuable contribution to the investigation of this problem. Difference in the solvent accessibility of nucleotides is an intrinsic feature, characteristic of nucleotide combinations. Higher solvent accessibility could correlate with a higher probability of interaction with other molecules. Further investigations will reveal the significance of this feature for activity of various RNA-located functional sites. In this work, we present a statistical analysis of nucleotide solvent accessibilities at a single nucleotide level as well as in various sequence-neighbor environments. We characterize unique trinucleotides by the accessibility distribution patterns, revealed from threedimensional structures and explore their biological relevance. We further investigate the predictability of nucleotide accessibility from successively larger windows of N-mer segments in a manner it has been traditionally done for proteins. Results indicate that nucleotide solvent accessibility is an important local structure descriptor and is likely to provide important insights into RNA structure and function. 2. Materials and methods 2.1. Data sets A total of 379 coordinate files of RNA, solved by X-ray crystallography at resolution better than 2.5 Å, were downloaded from Protein Data Bank as of August 18, 2009 (Berman et al., 2000). These data contained 882 RNA chains. RNA chains, with shorter than 20 nucleotides were removed to have sufficient data from each structure, leaving 331 RNA chains. Redundancy was removed by clustering resulting structures at 25% sequence identity using BLASTCLUST program, using 15 as word size. This resulted in a total of 87 RNA chains. Each of these 87 chains was manually annotated using Nucleic Acid Database (NDB) (Berman et al., 1992), Structural Classification of RNA (SCOR) (Klosterman et al., 2002), PDB coordinate files (Berman et al., 2000) and other sources. Some of these structures have been solved in complex with a protein, while others are in the unbound (free) state, which may bias analysis and hence separate lists corresponding to each category were also delimited. This resulted in overlapping lists of unbound, free, single-stranded and doublestranded RNA sequences as listed in supplementary file data-list.xls. Some of the entries could not be annotated into single- and doublestranded categories due to lack of information in the source databases and were therefore included only in the overall data lists.
2.2. Solvent accessibility and accessible surface area (ASA) calculation All accessible surface area (ASA) calculations were performed using publicly available program NACCESS (Hubbard and Thornton, 1993), with default options of van der Waals and probe radii. Although, solvent accessibility data corresponding to all-atom, sidechain, main-chain, polar and non-polar atomic groups were calculated, only all-atom data is used in this work, as no additional significant conclusions could be derived from higher resolution analysis. It may be noted that solvent accessibility is generally a dimensionless value of surface area relative to some reference values such as a hypothetical extended state in Ala–X–Ala conformation (Singh and Ahmad, 2009). However, an extended state in the case of nucleotides is not well recorded and in this work, absolute values of ASA have been used to conduct all the statistical analysis.
2.3. Prediction of accessible surface area (ASA) All predictions were made using neural network, which was trained as follows. First each nucleotide was represented by a four dimensional binary vector whose each dimension denotes its identity (A, C, G and U respectively). For each nucleotide, a single binary value corresponding to the identity of that nucleotide is set to one, whereas all other units are set to zero. To include sequence neighbor environment, a sliding window of 3, 5, 7, and 9 nucleotides corresponding to one, two, three and four neighboring nucleotides was encoded in the same way, forming 12-, 20-, 28- and 36dimensional binary vectors (called input vectors) representing a single nucleotide and its sequence neighbors. Target values of prediction are the all-atom ASA values, scaled to [0,1] range by dividing all values by 500 (to be rescaled to real values after the model training is completed). Prediction model is a neural network with single hidden layer containing two units connected to the input layer by arctan activation function and to the target values by a sigmoidal function.
2.4. Cross validation All data are trained and models are validated using leave-one-out cross validation i.e. one RNA structure is left out of training, while the remaining data are allowed to over learn, without giving any information about the left out RNA. After the training is completed, performance is evaluated on the left out data. Each RNA sequence in a data set was left out once in this way and performance was averaged after running through the entire list.
2.5. Performance evaluation Prediction performance in each training/ validation cycle is measured by comparing predicted values of the left out data with their actual (target) values of nucleotide ASA. Two scores were used to evaluate this performance, similar to our previous works (Ahmad, 2009) viz. Mean absolute error and coefficient of correlation, given as
MAE =
1 N 1 n ∑i = 1 ∑j i= 1 jPij −Oij j N ni
Where N is the number of RNA chains in the data set, and ni is the number of nucleotides in ith RNA sequence and Pij and Oij are the predicted and observed solvent accessibilities in jth nucleotide of ith RNA chain.
Y.H. Singh et al. / Gene 463 (2010) 41–48
Coefficient of correlation for each chain was calculated as: ni ∑Pij Oij −∑Pij ∑Oij Ci = qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ni ∑Pij −ð∑Pij Þ2 ni ∑O2ij −ð∑Oij Þ2 For evaluating overall performance, correlation coefficients calculated above are averaged over all chains in the data set. This average correlation is what is referred to as simply correlation in this work, whenever overall performance of a model is considered. 2.6. Trinucleotide ASA For the purpose of investigating a relationship between RNA function and accessibility, trinucleotide ASA histograms were generated for each of the 64 combinations, by treating RNA sequences as a set of overlapping triplets, whose ASA is available from the structural data. Whole trinucleotide ASA is defined by either (i) taking the average of the three constituent nucleotides in the given triplets or (ii) considering the ASA of the central nucleotide in each case. 2.7. Relationship with codon usage Frequency of codon usage has been taken from Kazusa codon data base (http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=9606). For each amino acid, average ASA values of a trinucleotide corresponding to each codon are taken from data set of all RNA chains. A numerical rank is assigned to each codon corresponding to the amino acid in the order of its ASA. Thus, all codons are assigned a numerical rank which signifies their ASA in the context of competing codons for the same amino acid. To find a relationship between codon usage bias and ASA, coefficient of correlation between this rank and frequency of usage is used. 3. Results and discussion 3.1. Nucleotide ASA distribution Overall ASA distribution of four nucleotides (A, C, G, and U) is shown in Fig. 1. More detailed statistics within each RNA structure type and the statistical significance of difference (using χ2 test) is shown in Table 1. Following observations are made from this simple statistics: 1. ASA distributions are close to normal. All four nucleotides have close to normal (Guassian-like) ASA distribution, with their mean, median and mode values coming
43
close to 180 Å2 and number of nucleotides on both sides of the mean value falling almost symmetrically. This is in sharp contrast to ASA distributions in proteins (comparison with DNA to be discussed later in this article). In case of proteins, most residues are buried in the interior and due to the nature of protein-folding, ASA distribution has a sharp peak close to zero. In the case of RNA molecules, there are no hydrophobic residues or hydrophobic core with buried residues and hence much fewer nucleotides are found in lower ASA range. In the biological sense, this implies that almost all RNA nucleotides are at least partially exposed as their primary function is achieved by exposed bases, allowing them to interact with other molecules. This is also consistent with the well-known facts that RNA structure is quite flexible and can adopt several conformations (Shajani and Varani, 2007). 2. Purines are more exposed than pyrimidines, as expected. Fig. 1 shows that overall mean ASA of G and A (Table 1: average ASA 196.1 Å2 and 189.2 Å2, respectively) is higher than that of C and U (Table 1: average ASA 181.3 Å2 and 185.3 Å2, respectively). Obviously, this is likely caused by larger nucleic acid side chains of purines compared to pyrimindines. However, this is in contrast to what we observed in double-stranded DNA in which T was found to be more exposed than all other bases despite being a pyrimidine. The difference between DNA and RNA in this case may be attributed to the fact that nucleic acid base is often buried in DNA due to double helical conformation and therefore the size of the purine or pyrimidine ring makes little difference on exposed surface area, which is not so in RNA. A small difference between the four nucleotide ASA averages suggests that the information of single bases cannot be used as such to detect solvent accessibility based signals, unless their sequence or structure context is revealed. 3. Nucleotides in double-stranded RNA are more exposed than singlestranded RNA. Table 1b gives the p-values of comparisons between overall ASA in various RNA structure types. The most significantly different pair of structures is double and single-stranded RNA (p-value = 8.1e-26). Similar differences are observed between unbound double-stranded RNA with unbound single-stranded RNA, as well as complex double-stranded RNA and complex single-stranded RNA. To look at it more closely, a distribution of nucleotide ASA in double-stranded and single-stranded RNA chains was plotted in Fig. 2. Distribution for double-stranded DNA from our earlier work was also shown for comparison (Ahmad, 2009). It is quite clear that double-stranded RNA is made of more exposed nucleotides than single-stranded counterparts. This may look counter-intuitive at the first glance as double-stranded RNA would be imagined to have more basepairing and hence less exposed. However, in reality single RNA chains typically fold onto themselves and form complex internal secondary structures (Lescoute and Westhof, 2006). This type of folding leads to much tighter contacts and packing at the folding interface than pairing of bases from complementary strands in doublestranded RNA because it does not leave terminal bases exposed as a two-strand system does and this makes doublestranded RNA more exposed than its single-stranded counterpart. A comparison with distribution of ASA in double-stranded DNA shows that the latter is less exposed than any form of RNA. This is apparently because base-pairing in DNA is almost always perfect and leaves very few nucleotides unpaired (and exposed) as is observed in RNA molecules. Presumably, even the internal secondary structure of single-stranded RNA does not surpass the overall side-chain burial of DNA. 3.2. Trinucleotide solvent accessibility
Fig. 1. Nucleotide accessible surface area (ASA) distribution of each base in all RNA structures.
We showed above that the four nucleotides are very similar as far as their ASA distribution is concerned. We try to see if the nucleotide
44
Y.H. Singh et al. / Gene 463 (2010) 41–48
Table 1 Basic statistics of RNA nucleotide ASA in various structural environments: (a) mean and standard deviation of each nucleic acid type and the number of data in each category and (b) Statistical significance (p-values of a χ2 test) of differences in main structural regions. (a)
All Double stranded (DS Single stranded (SS) Bound (complex) Unbound (free) SS-bound SS-free DS-bound DS-free
All-bases (Mean)
Adenine (A) Mean (stdev)
Cytosine (C) Mean (stdev)
Guanine (G) Mean (stdev)
Uracyl (U) Mean (stdev)
N(A)
N(C)
N(G)
N(U)
188.0 223.1 183.0 191.2 182.8 188.8 172.3 215.7 229.1
189.2 219.8 184.9 193.2 183.3 191.2 174.7 211.0 228.3
181.3 219.5 175.1 183.7 177.4 181.0 163.3 213.3 223.1
196.1 245.2 189.5 198.3 192.6 195.1 179.7 235.0 252.5
185.3 208.0 182.4 189.5 177.9 188.1 171.4 203.5 212.4
892 112 780 536 356 481 299 55 57
1023 143 880 640 383 587 293 53 90
1225 144 1081 752 473 692 389 60 84
706 81 625 450 256 410 215 40 41
All
DS
SS
Complex
Unbound
SS-bound
SS-free
DS-bound
DS-free
1.00E+00 1.04E−02 7.14E−01 8.17E−01 7.05E−01 9.50E−01 2.52E−01 4.33E−02 2.74E−03
1.87E−02 1.00E+00 8.09E−26 3.24E−02 6.96E−03 2.18E−02 6.65E−04 6.19E−01 6.90E−01
7.11E−01 3.00E−03 1.00E+00 5.45E−01 1.00E+00 6.64E−01 1.00E+00 1.56E−02 6.55E−04
8.18E−01 2.08E−02 5.54E−01 1.00E+00 5.54E−01 8.67E−01 1.00E+00 7.60E−02 6.11E−03
7.01E−01 2.87E−03 9.90E−01 5.36E−01 1.00E+00 6.55E−01 4.36E−01 1.50E−02 6.22E−04
9.50E−01 1.26E−02 6.69E−01 8.66E−01 6.60E−01 1.00E+00 2.28E−01 5.08E−02 3.43E−03
2.31E−01 1.07E−04 4.15E−01 1.50E−01 4.22E−01 2.07E−01 1.00E+00 9.40E−04 1.51E−05
5.93E−02 6.13E−01 2.59E−02 9.49E−02 2.51E−02 6.76E−02 3.11E−03 1.00E+00 3.62E−01
6.65E−03 6.94E−01 2.32E−03 1.23E−02 2.24E−03 7.88E−03 1.75E−04 3.77E−01 1.00E+00
(66.1) (56.9) (66.2) (68.5) (62.0) (69.6) (59.0) (54.9) (57.9)
(54.8) (52.0) (52.7) (55.0) (54.4) (54.5) (46.8) (51.1) (52.5)
(61.2) (53.6) (59.2) (61.1) (61.4) (60.5) (55.5) (55.8) (51.0)
(61.6) (52.4) (62.1) (64.4) (55.5) (65.2) (54.0) (54.2) (50.9)
(b)
All Double stranded (DS Single stranded (SS) Bound (complex) Unbound (free) SS-bound SS-free DS-bound DS-free
neighbors impart any specificity to this distribution. If so, we would be able to predict exposed nucleotides by looking at the identity of neighbors. First neighbor effect is rather easy to analyze as we simply compile the statistics of 64 trinucleotide combinations and plot their ASA averages. Fig. 3 shows a plot of average trinucleotide ASA values. Two independent plots are drawn, first by taking the ASA of a “central” nucleotide in all possible 3' and 5' neighbor environments (Fig. 3a, plotted for the case of single-stranded, as well as other RNAtypes), and then by using ASA of all three nucleotides for each of the 64 possibilities (Fig. 3b, plotted with error bars, only for singlestranded RNA for better clarity) (For details, see Materials and methods). Our aim is to see whether trinucleotides are better separated in their overall mean values than the single nucleotides. Indeed, the graph shows that in all three categories of RNA structure, average ASA of 64 trinucleotides has a wider range then the average ASA of 4 single nucleotides. Specifically, trinucleotide ASA range (difference between the highest and lowest of 64 average values) is about 60 Å2 compared to 15 Å2 single nucleotide ASA range (difference between highest and lowest average values of 4 nucleo-
Fig. 2. Nucleotide ASA distribution in double and single-stranded RNA molecules.
tides). On the other hand average standard deviation in trinucleotide accessibilities is ∼53 Å2 compared to ∼61 Å2 in the case of single nucleotides (data for trinucleotide standard deviation is plotted as error bar in Fig. 3b). This shows that the immediate nucleotide neighbors can significantly constraint the solvent accessibility of a nucleotide and thereby may be used to predict its value with some confidence. To evaluate the exact degree of predictability, we will use a more complex and rigorous model (next section). Here, we discuss some of the most significant results, derived from data in Fig. 3. Before comparing the actual data, it may be noted that there are several important cellular processes where RNA features can influence their functional properties. Translation is based on the interaction of mRNAs with ribosomes, aminoacylated tRNAs and a number of general translation factors (Volkova and Kochetov, 2010). The first steps of interaction of eukaryotic mRNA and translation initiation factors are based on specific structures (combination of Shine–Dalgarno site and start codon in prokaryotic mRNAs, 5'-cap or internal ribosome entry site in eukaryotic mRNAs). Complementary interaction of anticodon of initiatory tRNA with a start codon within mRNA is a key step of translational initiation. It may be assumed that the bases interacting with other molecules should be more solvent-accessible than other bases in the RNA structure. In this context, the most striking observation here is that AUG-trinucleotide ASA ranks on 5th top position out of all 64 possibilities in single-stranded ASA plots (Fig. 3b), whereas ASA of trinucleotide corresponding to an alternate start codon in Escherichia coli, GUG, ranks 6th in the same data set. ASA of the central nucleotide in these environments (Fig. 3a) also ranks highly (9th and 6th out of 64, respectively). A closer look at the detailed statistics of AUG trinucleotides in comparison to other RNA segments is provided in Fig. 4, which clearly shows that AUG trinucleotides have significantly more exposed cases than expected (a statistical test of significance, made by Student's t-test yields a p-value = 0.01). Due to the scarcity of data (only 48 cases of AUG are seen in the entire data), an analysis of all ASA values in each RNA molecules is not possible; however, the overall statistics is quite striking. The Shine–Dalgarno site (AGGAGG-similar sequence) also functions through complementary interaction with a corresponding motif in 16S rRNA. It is well known that the unfolded state of this site is important for its activity. One may see that some bases in Shine-
Y.H. Singh et al. / Gene 463 (2010) 41–48
45
Fig. 3. Trinucleotide ASA distribution in RNA structure (a) ASA of central nucleotide “X” in various neighbor environments “Y–X–Z”. (b) Average per nucleotide ASA in singlestranded RNA (error bars represent standard deviation).
Dalgharno (S-D) site are characterized by high ASA; especially it concerns GAG (rank 12 out of 64). Finally, trinucleotide UGA is characterized by the highest ASA value (Fig. 3b). This triplet serves as a stop codon and this is the most frequently used stop codon in many organisms. Interestingly, regulated termination (readthrough and synthesis of C-end extended protein isoforms) is also mostly associated with UGA stop codon. Reasons for the selection of exactly these functional nucleotide combinations in evolution are not clear; however, appropriate solvent accessibility could be one of the contributing features. In our opinion, evaluation of the solvent accessibility provides a promising approach for further analysis of RNA-located signals of various types (splicing sites, poly(A)-signals, etc.). In particular, ASA profiles could be helpful to reveal the RNA bases important for interactions with other macromolecules.
3.3. Sequence-based prediction using neural network A true estimate of neighbor effect can be made by taking variously sized windows and looking at their ASA statistics. However, as the window size goes beyond 3 nucleotides, the data within each class becomes too insufficient to determine reliable statistics. Thus, a prediction model in which the number of unique patterns increases linearly rather than exponentially can be created for larger window sizes as described in Materials and methods. We developed neural network models with various sized windows for the entire data set of RNA sequences (only the overall and single-stranded categories were considered for separate prediction models due to a small data set in other RNA types). Results from such predictions at various choices of window size are shown in Table 2. These results show that the ASA of nucleotides in all types of RNA may be predicted with a mean absolute
Fig. 4. Distribution of per nucleotide ASA of AUG trinucleotides in single-stranded RNA structures: We note that AUG trinucleotides are significantly more exposed than others. Applying a Student's t-test resulted in a P-value = 0.01. Also, AUG is the fifth most exposed trinucleotide in all 64 possible combinations (see Fig. 4(b)), suggesting a role for solvent accessibility in a recognition of translation start sites (TSS).
46
Y.H. Singh et al. / Gene 463 (2010) 41–48
Table 2 Variation of nucleotide ASA prediction performance with window size.
All SS SS-Bound SS-Free
Table 3 Chain-wise prediction performance in predicting nucleotide ASA in all RNA structures.
Mean Absolute Error (MAE) Å2 At various window size inputs
Correlation coefficient (CC) At various window size inputs
PDBID
MAE (Å2)
CC
PDBID
MAE (Å2)
CC
PDBID
MAE (Å2)
CC
3 base
5 base
7 base
3 base
5 base
7 base
47.0 43.0 45.4 38.4
46.3 42.3 44.8 37.4
46.4 42.3 44.9 37.4
0.541 0.494 0.501 0.482
0.563 0.520 0.527 0.505
0.560 0.514 0.521 0.497
1a9nQ 1c0aB 1c9sW 1d4rA 1dulB 1ec6C 1ehzA 1et4A 1f7uB 1feuC 1ffk9 1ffyT 1g59B 1gtfW 1gtnW 1j1uB 1j7tA 1jbrD 1jbsC 1k8wB 1l2xA 1lngB 1m5kA 1m5kB 1mjiC 1msyA 1nlcA 1ooaC 1q93A
42.0 39.7 87.3 54.7 21.7 38.0 45.4 45.8 49.8 53.8 31.5 34.8 35.2 121.8 135.9 35.6 66.1 31.8 49.6 33.2 47.0 27.7 76.9 33.2 33.7 26.3 54.5 26.0 28.5
0.647 0.349 0.825 0.800 0.578 0.723 0.343 0.458 0.349 0.844 0.314 0.352 0.356 0.746 0.662 0.341 0.763 0.579 0.586 0.642 0.476 0.382 0.679 0.396 0.526 0.648 0.740 0.657 0.624
1q96A 1q9aA 1qtqB 1r9fB 1rpuC 1sj3R 1u0bA 1u8dA 1urnP 1wsuE 1xp7A 1xpeA 1y26X 2a43A 2ab4B 2annB 2bh2C 2bs0R 2dlcY 2draB 2ez6C 2fqnA 2gcsA 2gcsB 2gozA 2gozB 2hojA 2hvyE 2hw8B
28.9 30.2 36.2 53.5 68.6 37.4 31.0 30.6 52.1 20.5 34.6 33.9 26.2 48.3 36.7 36.8 44.6 44.3 50.3 23.8 16.1 64.3 57.5 39.3 34.2 103.2 42.9 36.6 30.8
0.623 0.626 0.378 0.799 0.749 0.358 0.412 0.419 0.552 0.721 0.682 0.670 0.476 0.445 0.626 0.552 0.555 0.712 0.260 0.626 0.722 0.777 0.782 0.277 0.476 0.789 0.331 0.414 0.565
2i82E 2jeaC 2nz4P 2pjpB 2pn4A 2pn4B 2qusA 2quxC 2r8sR 2vplB 2zkoC 2zy6A 397dA 3b31A 3b5aB 3b5sB 3bbiB 3bnnA 3bnoA 3bnqA 3boyD 3d2gA 3dd2B 3dilA 3e5cA 3egzB 3fozC 3fu4A 3hjwD Mean
34.3 126.9 34.5 29.1 54.2 85.1 27.5 36.6 37.9 33.9 57.6 34.9 51.6 26.7 53.6 53.8 52.4 54.8 64.7 59.5 104.2 50.2 71.8 30.3 27.4 32.8 35.4 32.3 34.1 46.3
0.637 0.601 0.252 0.626 0.803 0.696 0.468 0.548 0.248 0.440 0.805 0.480 0.736 0.596 0.623 0.622 0.620 0.711 0.671 0.686 0.719 0.333 0.708 0.261 0.484 0.404 0.368 0.619 0.477 0.563
error (MAE) ranging from 37.4 to 46.3 Å2 depending on the type of RNA structure. For the overall data, the prediction MAE is higher because the model is trained on a larger variety of samples. Comparing this MAE with the standard deviations in trinucleotide ASA, we find that the neural network performance is about 7 Å2 lower than the standard deviation in the data (note that standard deviation is an overall data statistic and contains no cross-validation information, which is required to accurately estimate performance for a new RNA chain). We also observed from Table 2 that MAE improvement with window size is rather limited. However, correlation coefficients do improve from 0.541 for 0.563 when window size is increased from 3 to 5. This improvement is seen in all types of RNA. Similarly, all RNA data show a decrease in performance, when window size is increased to 7. This can be explained by the fact that as the window size increases, there is an overfitting of training data and hence cross-validation performance falls. Hopefully, in the future when many more RNA structures become available, performance can be further improved without overfitting. In this work, we use 5-nucleotide window results for further analysis. Higher correlation coefficients and relatively modest MAE values suggest that within the RNA, it is easier to rank exposed and buried nucleotides with better confidence than estimating exact ASA values. This is so, because the exact ASA values depend on a number of properties of RNA chain, not available for the prediction model. In general, prediction correlation being 0.541 shows that the sequence-based models of ASA prediction performance in RNA is comparable to what we observed in proteins (although performance in proteins can reach a correlation as high as 0.74, it comes from many complex sequence and evolutionary features, not used here; for prediction with only single sequence feature, performance between RNA and protein structures is comparable). Further evaluation of performance is made by looking at the distribution of correlations in all RNA sequences. Fig. 5 shows that ASA in 85% RNA chains could be predicted with a correlation better than 0.4, whereas about 5% RNA chains have unusual accessibility distribution and cannot be predicted from the data from other RNA chains. Detailed performance for each RNA sequence is shown in Table 3. Poorest correlations between predicted and observed ASA are
seen in RNA chains identified by 2r8s (chain R), 3dil (chain A), 2dlc (chain Y) and 2gcs (chain A). A closer look shows that all these structures are unusually long and have complex folding patterns. Thus, long range contacts and unusual folding patterns may alter nucleotide ASA significantly making its prediction harder to perform. On the other hand RNA chains from PDB IDs 1d4r (chain A), 2pn4 (chain A) and 2zko (chain C) are the best predicted. A closer inspection reveals that all three of these RNA chains form almost a double helix in structure and exist in an extended helical state, with almost no long-range intra-chain contacts. Thus, we conclude that nucleotide ASA of extended state regions, belonging to shorter RNA chains and with more regular base-pairing can be better predicted than others and these factors need to be considered for further improving prediction methods in future.
3.4. Codon usage bias
Fig. 5. Distribution of prediction performance for all RNA structures.
There are a number of biological applications involving RNA, where the above analysis and prediction of nucleotide ASA may be useful. As a test case, we tried to see if the synonymous codon usage in some of the most widely studied species has any correspondence with their solvent accessibility. We assigned to all 64 codons, ASA averages of trinucleotides obtained above. Assigning ranks to synonymous codons based on these ASA values, we calculated correlation coefficient between ASA rank and codon usage (see Materials and methods for details). We find that trinucleotide ASA averages derived from RNA structures without consideration of species correlate well with synonymous codon usage in several species (R = 0.185 to 0.522 for single-stranded RNA values). This suggests that there is an overall trend of using accessible codons for genetic information in all species, which is probably further biased and fine-tuned by the speciesspecific environmental signals.
Y.H. Singh et al. / Gene 463 (2010) 41–48
To examine how the species may provide further bias to codon usage in terms of their ASA ranks, we explored a relationship of the correlation coefficients, defined above and the GC content of the respective genomes. Fig. 6 shows these correlations plotted against GC contents for the seven genomes considered (Detailed values are in (Table 4). It is striking to observe that there is a very good correspondence between the GC content of an organism and relationship between ASA and synonymous codon bias; organisms with higher GC contents show much higher correlation than those with lower GC content, implying that GC-rich organisms preferentially utilize more accessible codons compared to others. Correlation between (a) GC content and (b) ASA versus codon usage correlations, is found to be as high as 0.6, establishing a clear relationship between the two. The biological meaning of this observation is not fully clear. Largescale analysis of codon usage in bacteria, archeas and fungi demonstrated that across all studied organisms the identity of favored codons tracks the GC content of the genomes. Thus, GC rich organisms tend to have GC rich favored codons while AT rich organisms tend to have AT rich favored codons (Hershberg and Petrov, 2009). Probably, higher ASA of favored synonymous codons relates to more stable secondary structure of GC-rich mRNAs and also some additional factors (e.g., the organism's typical temperature) (Basak et al., 2007). Further investigations are needed to reveal the biological relevance of correlation between ASA, GC content and synonymous codons preferences. In this manuscript we used this example to demonstrate that solvent accessibility could be an important feature of functional RNA sites and taking this parameter into account can reveal some new interesting relationships (Table 4).
47
Table 4 Non-specific synonymous codon bias: Rank-based correlation (R) between speciesspecific synonymous codon usage and purely structure-based (species-independent) nucleotide ASA in seven most widely studied genomes. Negative sign obtained with ASA ranks (lower rank value for higher ASA) are reversed to indicate that codon usage has a positive correlation with actual ASA values. Species
GC content (%)
R (SS RNA)
R (All RNA)
Homo sapiens (mammal) E. coli (bacteria) S. cerevisiae (yeast) A. thaliana (plant) Mimivirus (virus) C. elegans (nematode) P. aethiopicus (vertebrae)
52.2 51.5 39.8 44.6 28.8 42.9 44.6
0.522 0.309 0.220 0.199 0.212 0.262 0.185
0.475 0.341 0.223 0.228 0.260 0.353 0.206
application of this technique will be useful to reveal the functionally important elements of RNA-located sites (splicing sites and poly(A)signals, various translational enhancers, microRNA targets, etc.). Acknowledgments A.V.K. was supported by the Program of Russian Academy of Sciences (Molecular and Cellular Biology), RFBR (08-04-00525, RFBRDST 09-04-92653), and Russian Ministry of Science & Education (FAE: P721; 2.1.1/6382; FASI: 02.740.11.0705). Appendix A. Supplementary Data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.gene.2010.05.001.
4. Conclusions References We show that the nucleotide ASA of RNA chains depends significantly on sequence and structural environments, and can be predicted with a promising accuracy. A test application on synonymous codon usage bias shows that these statistics can be used for investigation of such phenomena. Results are likely to be helpful in studying other regulatory and evolutionary aspects of gene expression. In our opinion, this parameter can be useful as a measure of “interaction activity” of RNA segments. We hope that further
Fig. 6. Correlation between species-specific synonymous codon usage and non-specific accessible surface area (ASA) ranks in various genomes is directly related to their GC content (GC rich organisms utilize more accessible codons). Error bars are obtained by bootstrap method i.e. generating 100 lists of RNA structures (each with 68 unique RNA chains from PDB) and calculating standard deviations in the resulting correlations. Detailed ASA data for each trinucleotide ASA is attached in supplementary file codonusage.xls.
Abaza, I., Gebauer, F., 2008. Trading translation with RNA-binding proteins. RNA 14, 404–409. Ahmad, S., 2009. Sequence-dependence and prediction of nucleotide solvent accessibility in double stranded DNA. Gene 428, 25–30. Ahmad, S., Gromiha, M.M., Sarai, A., 2003. Real-value prediction of solvent accessibility from amino acid sequence. Proteins 50, 629–635. Ahmad, S., Gromiha, M.M., Sarai, A., 2004. Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics 20, 477–486. Andrabi, M., Mizuguchi, K., Sarai, A., Ahmad, S., 2009. Prediction of mono- and dinucleotide-specific DNA-binding sites in proteins using neural networks. BMC Struct. Biol. 9, 30. Basak, S., Roy, S., Ghosh, T.C., 2007. On the origin of synonymous codon usage divergence between thermophilic and mesophilic prokaryotes. FEBS Lett. 581, 5825–5830. Berman, H.M., Olson, W.K., Beveridge, D.L., Westbrook, J., Gelbin, A., Demeny, T., Hsieh, S.H., Srinivasan, A.R., Schneider, B., 1992. The Nucleic Acid Database: A Comprehensive Relational Database of Three-Dimensional Structures of Nucleic Acids. Biophys. J. 63, 751–759. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E., 2000. The Protein Data Bank. Nucleic Acids Res. 28, 235–242. Chatterjee, S., Pal, J.K., 2009. Role of 5'- and 3'-untranslated regions of mRNAs in human diseases. Biol. Cell 101, 251–262. Gong, S., Worth, C.L., Bickerton, G.R., Lee, S., Tanramluk, D., Blundell, T.L., 2009. Structural and functional restraints in the evolution of protein families and superfamilies. Biochem. Soc. Trans. 37, 722–733. Hershberg, R., Petrov, D.A., 2009. General rules for optimal codon choice. PLoS Genet. 5, e1000556. Hubbard, S.J., Thornton, J.M., 1993. NACCESS. Department of Biochemistry and Molecular Biology, University College London. . Klosterman, P.S., Tamura, M., Holbrook, S.R., Brenner, S.E., 2002. SCOR: a Structural Classification of RNA database. Nucleic Acids Res. 30, 392–394. Kochetov, A.V., 2008. Alternative translation start sites and hidden coding potential of eukaryotic mRNAs. Bioessays 30, 683–691. Laederach, A., 2007. Informatics challenges in structured RNA. Brief. Bioinform. 8, 294–303. Lescoute, A., Westhof, E., 2006. The interaction networks of structured RNAs. Nucleic Acids Res. 34, 6587–6604. Liu, Y., Wimmer, E., Paul, A.V., 2009. Cis-acting RNA elements in human and animal plus-strand RNA viruses. Biochim. Biophys. Acta 1789, 495–517. Montange, R.K., Batey, R.T., 2008. Riboswitches: Emerging Themes in RNA Structure and Function. Annu. Rev. Biophys. 37, 117–133. Shajani, Z., Varani, G., 2007. NMR studies of dynamics in RNA and DNA by 13C relaxation. Biopolymers 86, 348–359.
48
Y.H. Singh et al. / Gene 463 (2010) 41–48
Shapiro, B.A., Yingling, Y.G., Kasprzak, W., Bindewald, E., 2007. Bridging the gap in RNA structure prediction. Curr. Opin. Struct. Biol. 17, 157–165. Singh, H., Ahmad, S., 2009. Context-dependent reference state of solvent accessibility derived from native protein structures and assessed by predictability analysis. BMC Struct. Biol. 9, 25. Sonenberg, N., Hinnebasch, A.G., 2009. Regulation of translation initiation in eukaryotes: mechanisms and biological targets. Cell 136, 731–745.
Van Der Kelen, K., Beyaert, R., Inze, D., De Veylder, L., 2009. Translational control of eukaryotic gene expression. Crit. Rev. Biochem. Mol. Biol. 44, 143–168. Volkova, O.A., Kochetov, A.V., 2010. Interrelations between the nucleotide context of human start AUG codon, N-end amino acids of the encoded protein and initiation of translation. J. Biomol. Struct. Dyn. 27, 611–618. Volz, K., 2008. The functional duality of iron regulatory protein. Curr. Opin. Struct. Biol. 18, 106–111.