Motif prediction in ribosomal RNAs Lessons and prospects for automated motif prediction in homologous RNA molecules

Motif prediction in ribosomal RNAs Lessons and prospects for automated motif prediction in homologous RNA molecules

Biochimie 84 (2002) 961–973 Original article Motif prediction in ribosomal RNAs Lessons and prospects for automated motif prediction in homologous R...

1MB Sizes 0 Downloads 36 Views

Biochimie 84 (2002) 961–973

Original article

Motif prediction in ribosomal RNAs Lessons and prospects for automated motif prediction in homologous RNA molecules N.B. Leontis a,*, J. Stombaugh a, E. Westhof b a

Chemistry Department and Center for Biomolecular Sciences, Overman Hall, Bowling Green State University, Bowling Green, OH 43403, USA b Institut de biologie moléculaire et cellulaire du CNRS, UPR 9002, modélisation et simulations des acides nucléiques, Université Louis-Pasteur, 15, rue René-Descartes, 67084 Strasbourg cedex, France Received 5 July 2002; accepted 9 July 2002

Abstract The traditional way to infer RNA secondary structure involves an iterative process of alignment and evaluation of covariation statistics between all positions possibly involved in basepairing. Watson–Crick basepairs typically show covariations that score well when examples of two or more possible basepairs occur. This is not necessarily the case for non-Watson–Crick basepairing geometries. For example, for sheared (trans Hoogsteen/Sugar edge) pairs, one base is highly conserved (always A or mostly A with some C or U), while the other can vary (G or A and sometimes C and U as well). RNA motifs consist of ordered, stacked arrays of non-Watson–Crick basepairs that in the secondary structure representation form hairpin or internal loops, multi-stem junctions, and even pseudoknots. Although RNA motifs occur recurrently and contribute in a modular fashion to RNA architecture, it is usually not apparent which bases interact and whether it is by edge-to-edge H-bonding or solely by stacking interactions. Using a modular sequence-analysis approach, recurrent motifs related to the sarcin–ricin loop of 23S RNA and to loop E from 5S RNA were predicted in universally conserved regions of the large ribosomal RNAs (16S- and 23S-like) before the publication of high-resolution, atomic-level structures of representative examples of 16S and 23S rRNA molecules in their native contexts. This provides the opportunity to evaluate the predictive power of motif-level sequence analysis, with the goal of automating the process for predicting RNA motifs in genomic sequences. The process of inferring structure from sequence by constructing accurate alignments is a circular one. The crucial link that allows a productive iteration of motif modeling and realignment is the comparison of the sequence variations for each putative pair with the corresponding isostericity matrix to determine which basepairs are consistent both with the sequence and the geometrical data. © 2002 Société française de biochimie et biologie moléculaire / Éditions scientifiques et médicales Elsevier SAS. All rights reserved Keywords: RNA motif; Non-Watson–Crick basepair; Sugar-edge; Hoogsteen edge

1. Introduction The new high-resolution structures of the ribosomal subunits confirm that the ribosomal RNA molecules (5S, 16S, and 23S) comprise a number of recurrent, modular motifs that mediate RNA–RNA, RNA–protein, and even RNA–drug interactions [1,2]. For our purposes, RNA motifs are ordered, stacked arrays of non-Watson–Crick basepairs that in the secondary structure representation form hairpin or internal loops, multi-stem junctions, and even Abbreviations: W.C., Watson–crick; S.E., Sugar-edge * Corresponding author. Tel.: +1-419-372-8663; fax: +1-419-372-9809. E-mail address: [email protected] (N.B. Leontis).

pseudoknots. We predicted that certain motifs, such as the sarcin/ricin loop motif of 23S rRNA and the bacterial loop E motif of 5S rRNA, occur autonomously in a variety of other contexts within the ribosome [3,4]. The method that we employed to make these predictions entails the following steps: (1) The secondary structure is used to identify regions forming internal, hairpin, or junction loops. (2) A consensus profile is constructed for each strand in these loops and the consensus is checked against the consensus for the given motif—here the sarcin/ricin motif. (3) In promising regions, putative pairs are identified and the sequence variations in homologous sequences are compiled for those positions, paying attention to the possibility of misalignment of sequences in the database. (4) The se-

© 2002 Société française de biochimie et biologie moléculaire / Éditions scientifiques et médicales Elsevier SAS. All rights reserved PII: S 0 3 0 0 - 9 0 8 4 ( 0 2 ) 0 1 4 6 3 - 3

962

N.B. Leontis et al. / Biochimie 84 (2002) 961–973

quence variations for each putative basepair are checked against the variations observed for known occurrences of the motif as well as isosteric pairs obtained by computer modeling (the isostericity matrix for the corresponding basepair family). Seven occurrences of the sarcin/ricin motif were predicted to occur in conserved regions of 23S rRNA and two in 16S rRNA, in addition to the parent motif in Domain VI of 23S rRNA (Table 1). Recently, crystal structures at atomic resolution of the large and small ribosomal subunits have been published [1,2,5]. The new structures (obtained from the PDB and NDB) prove correct all but one of these predictions. Moreover, additional examples and variants of these and related motifs are evident in the new structures. These occur in variable or composite regions of the molecules. The new data provide an opportunity to evaluate the prospects for using a motif-driven approach to sequence analysis to predict 3D structure in large RNAs for which homologues are available. Furthermore, an evaluation of the data provides insights into how accurate structural alignments of the RNA sequence databases can be generated.

2. Materials and methods Three-dimensional structures were downloaded from the protein data bank (PDB) or the nucleic acid databanks (NDB) and viewed using Swiss PDB Viewer [6]. Motifs were annotated using Canvas 8.0 (Deneba Software), using the recently proposed symbols for annotation [7]. Sequences of aligned RNA molecules were downloaded from the European Ribosomal RNA Database [8] and analyzed using COSEQ (C. Massire, in preparation).

3. Results For the purposes of the present appraisal we can divide the predicted and observed sarcin/ricin motifs into three groups: Correctly predicted sarcin/ricin motifs (8 of 12); Motifs incorrectly predicted to be sarcin/ricin motifs (1); Sarcin/ricin motifs that do occur in the ribosomal crystal structures but which were not predicted (4 of 12).

Table 1 Sarcin and sarcin-related motifs predicted in the ribosomes. The predictions were made with reference to the E. coli sequences for 16S and 23S rRNA [4]. The corresponding positions in the crystal structures of H. marismortui and D. radiodurans are provided. For eukaryal 5S rRNA, the X. laevis sequence was used and for archaeal 5S, the H. vannielli sequence. Incorrectly predicted basepairs are in bold letters. U1345-U1376 is trans W.C./W.C. in the 16S crystal structure.

. E. coli sequences show predicted basepair [4]. Incorrect predictions are in Bold. For Eukaryal 5S, the X. laevis sequence was used for prediction. For Archael 5S, the H. Vanniellii sequence was used for prediction. 16S T. thermophilus is numbered according to E. coli 16S in the crystal structure. U1345–U1376 is trans W.C./W.C.

N.B. Leontis et al. / Biochimie 84 (2002) 961–973

Fig. 1. Generic sarcin/ricin motif (left) and parent motif from the 23S rRNA of the archaeon, H. marismortui (right).

A discussion of motifs related to the bacterial loop E is also included. One bacterial loop E motif was correctly predicted in 16S rRNA and another in the conserved signal recognition particle (SRP) RNA using this same approach [4]. 3.1. The sarcin/ricin motifs The paradigmatic sarcin/ricin motif occurs in a very conserved region of Domain VI of 23S rRNA, corresponding to residues 2690–2694 and 2701–2704 in the 23S rRNA of the archaebacterium H. marismortui, for which the highest resolution (2.3 Å) crystal structure (e.g., NDB file RR0033) currently exists [2]. The motif is also present in the 3.1 Å structure of the 23S rRNA of D. radiodurans [9]. The corresponding positions in the E. coli, H. marismortui, and D. radiodurans 23S rRNA sequences are shown in Table 1 for the parent motif and each of the predicted motifs. The complete motif comprises the following ordered assembly of non-Watson–Crick basepairs, listed with the corresponding basepairs from the H. marismortui motif: 1. Trans Hoogsteen/Sugar-edge (A2694–G2701) 2. Trans Watson–Crick/Hoogsteen (U2693–A2702) 3. Cis Hoogsteen/Sugar-edge (U2693–G2692) 4. Trans Hoogsteen/Hoogsteen (A2691–A2703) 5. Trans Sugar-edge/Hoogsteen (U2690–C2704)

963

The motif is shown schematically in Fig. 1 using recently proposed conventions to indicate unambiguously the nature of each non-Watson–Crick basepair [7]. A5 of the left-hand strand and G1 of right-hand strand form basepair 1, corresponding to A2694–G2701 in the parent motif in the 23S rRNA of H. marismortui. Note that the right-hand strand in every case is shown 5’ to 3’ going down the page. The red arrows indicate local changes of strand orientation [7]. Also note that in our original analysis, the fifth nonWatson–Crick basepair was not included in the motif. No crystal structures were then available and that basepair was not well characterized in either of the NMR structures on which we relied for the structure of the motif [10,11]. Each sarcin/ricin motif is designated in this paper by the number of the residue corresponding to G-2701 in the parent motif of 23S rRNA in H. marismortui (see Fig. 1). To designate the motifs, the numbering of the best available crystal structures will be used (RR0033 for 23S and 5S rRNAs and RR0030 for 16S rRNA). G2701 forms a trans Hoogsteen/Sugar-edge [7] (sheared) basepair. The parent motif assembles an internal loop, as it is flanked on both sides by Watson–Crick pairs, formed by bases belonging to the same strands with no interruptions. In fact, only five of the correctly predicted occurrences of the motif (three in 23S rRNA and one each in 16S and 5S rRNA) form internal loops in the 2D structure. The others occur at junction loops (three in 23S and one in 16S rRNA), a strong indication of the autonomous character of the motif. 3.1.1. Group 1: correctly predicted sarcin/ricin motifs Two types of motifs were predicted, those occurring as internal loops and those occurring at junctions. Three of the correctly predicted 23S rRNA sarcin/ricin motifs occur in universally conserved (Archaea, Eubacteria, and Eukarya) internal loops (Fig. 2). All five non-Watson–Crick basepairs of the parent motif are present at these sites. The motifs are flanked by Watson–Crick or wobble pairs (i.e. cis W.C./W.C. GU, UG, or UU). As shown in Table 1, all pairs were predicted correctly for these motifs, although the exact

Fig. 2. Internal loop sarcin motifs in 23S rRNA (H. marismortui): The G159 motif occurs in Helix 11, the G225 motif in Helix 13, and the G2053 motif in the central domain of 23S rRNA (Table 1). The symbols indicate the interacting edges for each non-Watson–Crick basepair, using a recently introduced convention [7]: Circles indicate Watson–Crick edges, squares, Hoogsteen edges, and triangles, Sugar edges. Open symbols indicate trans basepairs and closed symbols, cis basepairs.

964

N.B. Leontis et al. / Biochimie 84 (2002) 961–973

Fig. 3. Internal loop sarcin motif in the “switch helix” of 16S rRNA (T. thermophilus, E. coli numbering). The predicted basepairing [3] is shown on the left and the crystallographically observed basepairing, on the right (NDB file RR0030).

nature of the fifth basepair was not known. Interestingly, the third of these motifs occurs in the central domain of 23S rRNA. A sarcin motif was predicted to occur in the universally conserved Helix 27 of 16S rRNA (Fig. 3), also known as the switch helix [12,13]. The 3D structures of the small (30S) ribosomal subunit proved this prediction correct [1,5]. Nonetheless, a small error was made in the prediction because we did not recognize that C893 (E. coli numbering) is involved in a tertiary interaction and not paired to G906. This resulted in a misalignment of the two strands and predicted a bulge at G888, as shown in Table 1 and Fig. 3. In the correct structure, G906 pairs with A892 to form basepair 1 of the motif and G888 pairs with A909 to form basepair 5. More careful attention to the sequence variations at positions G888 and G906 would have allowed us to correctly assign the basepairing. Both G888 and G906 are substituted by A or U in all three phylogenetic domains, whereas A892, A907 and A909 are invariant. Three sarcin/ricin motifs were predicted at conserved junction loops in the secondary structure of 23S rRNA, two in Domain I and one in Domain II (Fig. 4). The first (motif G406) was correctly predicted to comprise only four of the five non-Watson–Crick basepairs of the parent motif. The second (motif G475) was incorrectly predicted to include the fifth basepair (C456–A472 in the E. coli sequence, corresponding to A462–C478 in H. marismortui). The other

basepairs were correctly predicted. In the observed motif, C478 (corresponding to E. coli A472) is stacked below basepair 4 (A463/A477) but is not paired. The third junction motif (G911) is a complex composite motif. Thus only the first three basepairs were correctly predicted, even though the motif comprises all five non-W.C. basepairs of the parent motif. However, these are the result of the complex interaction of four different sequence elements, as shown by the colors in the third panel of Fig. 4. In spite of the complex topology of sarcin/ricin motif G911, its 3D structure (shown in yellow in Fig. 5) superimposes well on that of a classic internal loop motif, such as the 23S G225 motif (shown in mauve in Fig. 5). The only parts of the two motifs that do not superimpose well belong to the backbone, which in the composite motif connects to other parts of 23S rRNA. A sarcin/ricin motif was also correctly predicted to occur at a 3-way junction in 16S rRNA, adjacent to Helix 43 in domain 3. The motif comprises four of the five non-W.C. basepairs of the parent motif (see Fig. 6). The fifth basepair is replaced by a trans W.C./W.C. pair and is not stacked on basepair 4 in the standard way. 3.1.2. Group 2: motif incorrectly predicted to be a sarcin/ricin motif The only motif that was incorrectly predicted to be a sarcin motif was the internal loop between Helices 31 and 32 in 23S rRNA. Nonetheless, one basepair was correctly predicted (trans W.C./Hoogsteen C896/A766 in the H. marismortui sequence—see Table 1). The predicted motif (for the H. marismortui sequence) and the observed motifs from H. marismortui and D. radiodurans are shown in Fig. 7. This motif is conserved in bacterial, archaeal, eukaryal and most mitochondrial 23S rRNAs and bears similarity to other motifs found in ribosomal RNAs (Leontis, unpublished observations). Several factors contributed to the error. First and most importantly, the deduced secondary structure was incorrect [14], leading to the incorrect assumption that G898 is paired to C764 (H. marismortui numbering). Secondly, the nature of the cis Sugar-edge/Hoogsteen (platform) GU pair was not well understood from the NMR structures and thus the base

Fig. 4. Observed sarcin/ricin motifs occurring in junction loops in 23S rRNA (H. marismortui). The composite character of the third motif (G911) is indicated by the color coding of the strands.

N.B. Leontis et al. / Biochimie 84 (2002) 961–973

965

Fig. 5. Superposition of crystal structures (NDB file RR0033) of composite sarcin motif, 23S G911 (yellow), and internal loop sarcin motif, 23S G225 (mauve).

Fig. 6. Sarcin motif occurring in a conserved three way junction (3WJ) of 16S rRNA (T. thermophilus, NDB file RR0030, with E. coli numbering, as in crystal structure).

variations at positions 895 and 896 could not be interpreted properly. Thirdly, the nature of the fifth basepair in sarcin/ricin motifs was not understood and so base variations at positions 893 and 768 also could not be interpreted. In fact C893/U768 cannot form a trans S.E./Hoogsteen basepair unless the C and U are reversed. By and large, the cis Sugar-edge/Hoogsteen GU pair is conserved at sarcin/ricin motifs. A small amount of variation is seen for this basepair in the conserved sarcin motifs, especially in motif G159 in 23S rRNA. The first base can be A, C, or U, in addition to the canonical G, while the second base is almost always U in the three primary phylogenetic domains. However, in mitochondrial 23S-like rRNA additional covariations are observed, the principal one being AA. We now have examples from high-resolution structures or models of all base combinations that can form cis Sugar-edge/Hoogsteen pairs (Leontis et al. In press). The

Fig. 7. Conserved motif from Domain II of 23S rRNA that was incorrectly predicted motif to be a sarcin/ricin motif. The predicted structure is shown on the left and the observed structures in the middle and right panels.

966

N.B. Leontis et al. / Biochimie 84 (2002) 961–973

Fig. 8. Sarcin/ricin motifs observed in the crystal structures of H. marismortui 23S rRNA that were not predicted. On the left is a conserved composite motif found in all phylogenetic groups. The middle panel shows a sarcin motif only found in archaeal sequences and on the right a motif that is unique to H. marismortui.

combinations that are observed all form isosteric pairs in the cis Sugar-edge/Hoogsteen geometry. 3.1.3. Group 3: sarcin/ricin motifs that do occur but were not predicted Three sarcin/ricin motifs occur in the 23S rRNA of H. marismortui that were not predicted. These motifs are shown in Fig. 8. The first is a composite motif in a conserved region of Domain II (motif A1012). The fourth basepair of this motif is trans H./H. and is assembled by a tertiary interaction involving two rRNA domains, Domain V (A2302) and Domain II (A1014), bringing together three rRNA strands. Thus, only one rRNA strand continuously forms the motif (C1011 to U1016). In addition, the motif lacks the fifth basepair of the parent motif. In the original work, we carefully scrutinized the motif but rejected it because it comprises a (locally) symmetrical internal loop. The pattern of conserved and variable bases, however, supported the existence of the first three basepairs of the sarcin/ricin motif.

The second sarcin/ricin motif that was not predicted is an internal loop found in the variable Helix 25 found between Domains I and II. The motif is conserved among archaea. The crystal structure of H. marismortui shows that this motif comprises all five non-Watson–Crick basepairs of the paradigmatic sarcin motif. A sequence analysis of archaeal 23S rRNAs would have revealed this motif. This example illustrates the importance of comparisons within phylogenetic groups to identify modular motifs that may be peculiar to a particular lineage. The third motif comprises an internal loop in Domain I that is highly variable. Obviously, this could not have been established by sequence comparisons. A sarcin/ricin motif occurs in Domain I of 16S rRNA that appears to be conserved and specific to eubacteria. It occurs in Helix 17 and comprises all but the fifth basepair of the motif as shown in Fig. 9 (left and center panels). Besides lacking the fifth basepair, the eubacterial 16S motif has two additional peculiarities. First, the cis Hoogsteen/Sugar-edge UG “platform” pair, universally conserved in most sarcin loop motifs, covaries with UU. Cis Hoogsteen/Sugar-edge

Fig. 9. Sarcin/ricin motif observed in the crystal structure of the 16S rRNA of T. thermophilus that was not predicted. The corresponding motif in 16S rRNA of E. coli is also shown. The motif occurs in a variable region and is specific to eubacteria. Third panel: conserved composite sarcin/ricin-like motif in Domain IV of 23S rRNA.

N.B. Leontis et al. / Biochimie 84 (2002) 961–973

UU pairs have been observed in other motifs (e.g. U831–U832 in 23S rRNA, RR0033) but were not known at the time of the analysis. Second, basepair 4, the trans Hoogsteen pair, is G–C, which is rare for this geometry [15]. Additional sub-motifs of the sarcin/ricin motif are found in conserved regions of the ribosomal RNAs. The best example occurs in Domain IV of 23S rRNA (Fig. 9, right panel). It is also a composite motif, resulting from the assembly of four strands, but it lacks basepairs 4 and 5 of the intact sarcin/ricin motif.

967

sarcin motifs is the recognition that the conserved adenosines that correspond to A5 (A2694) and A2 (A2702) in the generic (parent) sarcin motif (see Fig. 1), are paired within the motif by their Hoogsteen edges. Moreover, they are cross-strand stacked on each other and thus present their W.C. and Sugar-edges in an ideal manner to interact with the shallow groove of a RNA helix. In addition, the base corresponding to A3 (A2703), which is usually an A or C, is stacked beneath A2 (A2702) and can also participate in shallow groove interactions. 4.1. 23S G159 sarcin motif

4. Role of sarcin-like motifs in RNA packing and recognition The crystal structures help us appreciate the role of these motifs. Almost universally, they serve as sites for specific RNA–RNA, RNA–protein, and in a few cases RNA–drug interactions. Critical for understanding the interactions of

Fig. 10. Interactions of sarcin and sarcin-related motifs in the ribosome.

This motif in Domain I of 23S rRNA interacts both with protein L15E and a helical region in Domain II. A161, equivalent to A2702 in the parent motif, makes a trans Sugar-edge/Sugar-edge pair with G892 in Domain II, and A160, equivalent to A2703, makes a cis Sugar-edge/Sugaredge pair with C770 of the adjacent basepair in Domain II. This is shown in Fig. 10 (upper left panel).

968

N.B. Leontis et al. / Biochimie 84 (2002) 961–973

4.2. 23S G225 sarcin motif The second motif in Domain I interacts with G393 which is stacked flat against the W.C. and sugar edges of the cross-strand stacked A215 and A226 of the motif. This is shown in Fig. 10 (upper right panel).

(see Fig. 11, upper panel). A2054 forms a trans W.C. pair with U2648 and also interacts with its Arg128 of protein L22 with its sugar edge. A2055 forms a cis S.E./W.C. pair with U840. The W.C. edge of A1372 interacts with Trp136 of L22 and the Hoogsteen edge of G1370 interacts with Ser24 of L22. This example illustrates the versatility of the sarcin motif as an RNA and protein binder.

4.3. 23S G406 sarcin motif 4.8. 23S G2009 composite sarcin motif A407 and A408 form distorted trans W.C. pairs with A429 and A430, respectively. In addition, the sugar edge of A408 interacts with C412(O2’). The Hoogsteen edge of unpaired U409 interacts with Lys13 of protein L15E. See the center left panel of Fig. 10.

The cross-strand stacked A2010 and A1973 of the motif interact in the shallow groove of a helix that includes

4.4. 23S G475 sarcin motif A462 stacks against the W.C. edge of A477, which corresponds to A3 in Fig. 1. A463, which is trans H./H. paired to A477, forms a trans S.E./S.E. pair with G458. See Fig. 10, center right panel. 4.5. 23S G911 composite sarcin motif The cross-strand stacked A1294 and A912 form cis S.E./S.E. interactions with U1041 and C930, respectively and the S.E. of A913 (N3 and O2’) H-bonds to U1042(O2’). See Fig. 10, lower left panel. 4.6. 23S A1012 composite sarcin motif and 5S rRNA loop E Quite remarkably, these two motifs interact via a complex set of interactions that cannot be displayed in a 2D figure. Note that in archaea, as in eukarya, the loop Es of 5S rRNA actually comprise an intact sarcin motif, distinct from bacterial loop E motifs (see below). This is shown in the left lower panel of Fig. 10. The A1012 composite motif and the 5S loop E/sarcin motif interact at an angle to each other so that the bases of the motifs are positioned almost perpendicularly to one another. The interactions thus involve single H-bonds between bases and multiple H-bonds involving the ribose hydroxyl groups including the following: A955(N3)–C81(O2’) G956(O2’)–A80(N3); G956(O2’)–A80(O2’) G956(N3)–A103(C2) A957(O2’)–A104(N3); A957(O2’)–A104(O2’) A1013(C2)–C81(O2) A1014(O2’)–G100(N2); A1014(O2’)–U82(O2); A1014(O2’)–U82(O2’) 4.7. 23S G2053 sarcin motif This motif, located in the central domain of the molecule, interacts with protein L22 and with two stacked uridines that interact with the stacked A2054 and A2055 of the motif

Fig. 11. First row: motifs comprising a single trans Hoogsteen/Sugar-edge (sheared) basepair and a (variable) bulge 3’ to the Sugar-edge base. Second row: examples of “tandem sheared” motifs comprising two trans Hoogsteen/Suger-edge basepairs. Third row: an example of an RNA tertiary interaction involving a tandem sheared motif.

N.B. Leontis et al. / Biochimie 84 (2002) 961–973

basepairs C1888 = G2014 and C1889 = G2013 and form trans S.E./S.E. pairs with C1889 and G2014, respectively, as shown in Fig. 11 (center panel). 4.9. 23S G2701 sarcin/ricin parent motif The W.C. of A2694 H-bonds to G2567(O2’) which forms a trans S.E./S.E. pair with G2700. The W.C. or Sugar edges of A2702, A2703, and C2704 make van der Waals contacts the backbone of protein L6.

969

Fig. 11 (lower panel). Note the interactions with basic groups on the Hoogsteen edges of G1373, G1347, and U1376, and the cis S.E./S.E. motif formed by A1375 and U1376. This discussion illustrates the complex, crucial interactions of sarcin motifs with proteins and RNA in the ribosome.

5. Motif morphology: relationships of sarcin/ricin motifs with other motifs

4.10. 16S G906 (switch helix) motif As mentioned above C893 forms a tertiary cis W.C. pair with U244. A909 forms a cis S.E./S.E. pair with A1413 in the crucial, long penultimate helix of 16S. 4.11. 16S G1373 three-way junction motif The motif is embedded in a tightly packed region and makes interactions with RNA as well as protein, as shown in

The sarcin–ricin motif is one of a number of related RNA motifs that comprise trans Hoogsteen/Sugar-edge (“sheared”) and trans Watson–Crick/Hoogsteen basepairs as integral components. A single sheared pair cannot be inserted into a normal A-type helix and thus a distortion necessarily arises [16]. Thus, the simplest motif of this type comprises a single sheared basepair (typically A•G) and a bulge consisting of one or more nucleotides. The bulge always occurs in the same place, 3’ to the G of the sheared

Fig. 12. First row: Motifs comprising a single trans Hoogsteen/Sugar-edge (sheared) basepair and a variable bulge 3’ to the Sugar-edge base. Second row: Examples of ‘tandem sheared’ motifs comprising two trans Hoogsteen/Sugar-edge basepairs. Third row: An example of an RNA tertiary interaction involving a tandem sheared motif.

970

N.B. Leontis et al. / Biochimie 84 (2002) 961–973

A•G pair. Four examples are shown from the crystal structures of 16S and 23S rRNA in Fig. 12 (upper panel). These examples illustrate that the size of the loop of unpaired bases can vary without affecting the stacking between the sheared pair and the adjacent pairs and that the sheared pair does not have to be A•G. In fact an A•C sheared pair is isosteric to A•G and can replace it without changing the motif [15]. While a single sheared pair cannot be inserted into a helix, a tandem pair can. The new crystal structures show that the ribosome has several examples of tandem sheared pairs (Fig. 13 middle panel), as was anticipated also by sequence analysis. These occur within a normal A-type duplex and feature cross-strand stacking of the two conserved adenosines of the motif (note C, A, or U can substitute for G in the trans Hoogsteen/Sugar-edge pairs) so that both bases face into the shallow groove with their Watson–Crick and Sugar edges. As in the sarcin motif, the cross-strand stacked As are ideally positioned to interact with the shallow groove of another helix as shown by the example from 23S rRNA in the third row of Fig. 13 (only the cross-strand stacked A’s are shown). Note the symmetry of the interaction. The next elaboration of the motif involves the insertion of a single nucleotide between the sheared pairs of one of

the strands to form a “Tandem-Sheared/Bulged” motif. The inserted nucleotide is almost always an A, although an example with U occurs in the 23S rRNA structure. Some examples of this motif are shown in the lower panel of Fig. 13. A key point is that this motif retains the cross-strand stacking motif and adds a third adenosine stacked on the first two but in parallel. These bases are shown in outline in the examples of Tandem-Sheared/Bulged motifs shown below. Other unpaired nucleotides can occur at different positions in the interacting strands, but these are looped out of the motif (e.g., G681, U2726, C1618), while the conserved base is inserted in the helix and stacked between the As of the conserved sheared pairs. The flanking sheared pairs can be substituted by trans W.C./Hoogsteen Y/A pairs (Y = U or C) and additional As can be inserted as in the fourth example in Fig. 13, which has four stacked As, A1581, A1580, A1615, and A1616. Having three stacked bases available for interactions allows for more complex interactions. An example of an interaction of one of these motifs is shown next. Note that only the stacked adenosines of the motif are shown (in outline font) and that the cross-strand stacked adenosines (2799 and 2775) form cis rather than the trans Sugaredge/Sugar-edge pairs with the bases of pair C2575=G2558.

Fig. 13. First and second row. Examples of motifs comprising tandem sheared pairs and bulges from the T. thermophilus 16S rRNA (left panel, first row) and H. marismortui 23S rRNA (other motifs). Third row: An example of the RNA tertiary interactions involving one of these motifs.

N.B. Leontis et al. / Biochimie 84 (2002) 961–973

The interactions of A2775 and A2776 of the tandem sheared/bulged motif with the shallow groove are identical to those usually exhibited by GNRA loops [17–20]. The ribosome has many examples of this interaction. The important point is that parallel stacked bases (usually adenosines, but not necessarily) in a variety of motifs can interact in the shallow groove in the same way as GNRA loops. Examples from the 50S ribosome are shown in Fig. 14 (upper panels). A second type of interaction in the shallow groove is also possible, but this is much rarer. Examples are shown in Fig. 14 (lower panels). These interactions involve the Watson–Crick edge of the bases. Adenosine is favored as the 3’ nucleotide of the interacting stacked bases.

971

5.1. Bacterial loop E motifs Continuing with the consideration of more elaborate motifs, we turn to the motifs related to bacterial loop E of 5S rRNA. As previously discussed loop E consists of two isosteric submotifs related by 180° rotation [4]. We will refer to these submotifs as “bacterial loop E”. This is an internal loop consisting of three basepairs: 1. Trans Hoogsteen/Sugar-edge 2. Trans Watson–Crick/Hoogsteen or Trans Sugaredge/Hoogsteen 3. Cis Bifurcated or Trans Sugar-edge/Hoogsteen The structural principle that unites these motifs is that they comprise a core of three basepairs which position three

Fig. 14. (Upper six panels) Examples from the 50S ribosomal subunit (rr0033) of RNA tertiary interactions in the shallow groove comprising cis and trans S.E./S.E. basepairs. (Lower four panels) Examples from the 50S ribosomal subunit (rr0033) of RNA tertiary interactions in the shallow groove comprising cis and trans W.C./S.E. basepairs. The numbering is that of H. marismortui 23S rRNA.

972

N.B. Leontis et al. / Biochimie 84 (2002) 961–973

Fig. 15. Examples of bacterial 5S rRNA loop E motifs found in the ribosome.

bases (usually but not exclusively As) to face the shallow groove with their Watson–Crick and Sugar-edges, just as in the tandem-sheared/bulged motifs and in the sarcin/ricin motifs. As described above, the related sarcin/ricin motifs (which also occur in the loop E of eucaryal or archaeal 5S rRNA and thus have also been referred to as “loop E motifs”) also retain this core stacking arrangement. Thus the bacterial loop E motifs also comprise motifs for interactions with other molecules, which can be RNA, protein, or small molecules. These bases are shown in outline in the examples shown in Fig. 15.

sequence and of constructing accurate alignments is in fact a circular one. To resolve the circularity, one must refer to isostericity matrices to determine which basepairs are consistent with the sequence data [15,21]. (5) Finally, one must recognize that further difficulties are presented by composite motifs made up of several discontinuous strands. Thus, RNA motifs behave like Russian dolls where larger motifs comprise smaller motifs [22]. This strengthens the case for recurrence of motifs and underlines the importance of specific base stacking geometries, besides the central roles of joining the sugar-phosphate backbone ends.

5.2. Prediction of loop E motifs Acknowledgements On the basis of sequence analysis of eubacterial 5S rRNA loop E motifs, we correctly predicted that a loop E motif occurs in 16S rRNA at positions 580–584/757–723, as shown in the second panel in Fig. 15 and that the conserved internal loop of SRP RNA also comprises a loop E motif [4].

This work was supported by NSF REU Grant CHE9732563 and NIH Grant 2R15-GM55898. The authors acknowledge fruitful discussions with Luc Jaeger.

References 6. Conclusions [1]

The present comparative analysis leads to the following major conclusions concerning the origins of errors in the assignments of motifs: (1) The correct secondary structure must be used. We use the terms ‘secondary structure’ to refer to the set of cis Watson–Crick basepairs. The next two points are related to the first and underline the embedding of the various levels of complexity. (2) High quality sequence alignments reflecting the correct secondary structure must have been deduced. (3) In order to create accurate structural alignments, one must consider the potential insertion of recurrent motifs. (4) The crucial link that allows a productive iteration of motif modeling and realignment is the comparison of the sequence variations for each putative pair with the corresponding isostericity matrix. Indeed, points 1 to 3 indicate that the process of inferring structure from

[2]

[3]

[4]

[5] [6]

[7]

B.T. Wimberly, et al., Structure of the 30S ribosomal subunit, Nature 407 (2000) 327–339. N. Ban, et al., The complete atomic structure of the large ribosomal subunit at 2.4 Å resolution (see comments), Science 289 (2000) 905–920. N.B. Leontis, E. Westhof, A common motif organizes the structure of multi-helix loops in 16S and 23S ribosomal RNAs, J. Mol. Biol. 283 (1998) 571–583. N.B. Leontis, E. Westhof, The 5S rRNA loop E: chemical probing and phylogenetic data versus crystal structure, RNA 4 (1998) 1134–1153. F. Schluenzen, et al., Structure of functionally activated small ribosomal subunit at 3.3 Å resolution, Cell 102 (2000) 615–623. N. Guex, M.C. Peitsch, Swiss-model and the Swiss-PdbViewer: an environment for comparative protein modeling, Electrophoresis 18 (1997) 2714–2723. N.B. Leontis, E. Westhof, Geometric nomenclature and classification of RNA base pairs, RNA 7 (2001) 499–512.

N.B. Leontis et al. / Biochimie 84 (2002) 961–973 [8] [9] [10]

[11]

[12]

[13]

[14] [15]

J. Wuyts, et al., The European large subunit ribosomal RNA database, Nucleic Acids Res. 29 (2001) 175–177. J. Harms, et al., High-resolution structure of the large ribosomal subunit from a mesophilic eubacterium, Cell 107 (2001) 679–688. B. Wimberly, G. Varani, I. Tinoco Jr, The conformation of loop E of eukaryotic 5S ribosomal RNA, Biochemistry 32 (1993) 1078–1087. A.A. Szewczak, et al., The conformation of the sarcin/ricin loop from 28S ribosomal RNA, Proc. Natl. Acad. Sci. USA 90 (1993) 9581–9585. I.S. Gabashvili, et al., Major rearrangements in the 70S ribosomal 3D structure caused by a conformational switch in 16S ribosomal RNA, Embo. J. 18 (1999) 6501–6507. J.S. Lodmell, A.E. Dahlberg, A conformational switch in Escherichia coli 16S ribosomal RNA during decoding of messenger RNA, Science 277 (1997) 1262–1267. R.R. Gutell, et al., A story: unpaired adenosine bases in ribosomal RNAs, J. Mol. Biol. 304 (2000) 335–354. N.B. Leontis, E. Westhof, Conserved geometrical base-pairing patterns in RNA, Q. Rev. Biophys. 31 (1998) 399–455.

973

[16] D. Gautheret, D. Konings, R.R. Gutell, A major family of motifs involving GA mismatches in ribosomal RNA, J. Mol. Biol. 242 (1994) 1–8. [17] F. Michel, E. Westhof, Modelling of the three-dimensional architecture of group I catalytic introns based on comparative sequence analysis, J. Mol. Biol. 216 (1990) 585–610. [18] H.W. Pley, K.M. Flaherty, D.B. McKay, Three-dimensional structure of a hammerhead ribozyme, Nature 372 (1994) 68–74. [19] J.H. Cate, et al., RNA tertiary structure mediation by adenosine platforms, Science 273 (1996) 1696–1699. [20] E.A. Doherty, et al., A universal mode of helix packing in RNA, Nature Struct. Biol. 8 (2001) 339–343. [21] N. Leontis, J. Stombaugh, E. Westhof, The non-Watson–Crick base pairs and their associated isostericity matrices, Nucleic Acids Res (2002) in press. [22] E. Westhof, V. Fritsch, RNA folding: beyond Watson–Crick pairs, Structure Fold. Des 8 (2000) R55–R65.