RNA Secondary Structure Prediction Based on Free Energy and Phylogenetic Analysis

RNA Secondary Structure Prediction Based on Free Energy and Phylogenetic Analysis

Article No. jmbi.1999.2801 available online at http://www.idealibrary.com on J. Mol. Biol. (1999) 289, 935±947 RNA Secondary Structure Prediction Ba...

324KB Sizes 0 Downloads 31 Views

Article No. jmbi.1999.2801 available online at http://www.idealibrary.com on

J. Mol. Biol. (1999) 289, 935±947

RNA Secondary Structure Prediction Based on Free Energy and Phylogenetic Analysis Veronica Juan and Charles Wilson* Department of Biology and Center for the Molecular Biology of RNA, Sinsheimer Laboratories, University of California at Santa Cruz, Santa Cruz, CA 95064, USA

We describe a computational method for the prediction of RNA secondary structure that uses a combination of free energy and comparative sequence analysis strategies. Using a homology-based sequence alignment as a starting point, all favorable pairings with respect to the Turner energy function are identi®ed. Each potentially paired region within a multiple sequence alignment is scored using a function that combines both predicted free energy and sequence covariation with optimized weightings. High scoring regions are ranked and sequentially incorporated to de®ne a growing secondary structure. Using a single set of optimized parameters, it is possible to accurately predict the foldings of several test RNAs de®ned previously by extensive phylogenetic and experimental data (including tRNA, 5 S rRNA, SRP RNA, tmRNA, and 16 S rRNA). The algorithm correctly predicts approximately 80 % of the secondary structure. A range of parameters have been tested to de®ne the minimal sequence information content required to accurately predict secondary structure and to assess the importance of individual terms in the prediction scheme. This analysis indicates that prediction accuracy most strongly depends upon covariational information and only weakly on the energetic terms. However, relatively few sequences prove suf®cient to provide the covariational information required for an accurate prediction. Secondary structures can be accurately de®ned by alignments with as few as ®ve sequences and predictions improve only moderately with the inclusion of additional sequences. # 1999 Academic Press

*Corresponding author

Keywords: covariation; algorithm; pseudoknots; SRP; tRNA

Introduction Knowledge of the secondary structure of biological RNAs is essential for understanding their basis for function. Much effort has thus been directed towards predicting secondary structure from RNA primary sequence. Computational strategies have generally relied upon either free energy calculation or comparative sequence analysis to form the basis for structure predictions (James et al., 1989; Winker et al., 1990; Zuker, 1989). Automated free energy calculations can be readily carried out for a single sequence and often succeed in identifying stable duplexes within relatively short RNA sequences (e.g. transfer RNA; Pipas & McMahon, 1975). Covariational analysis of large sequence phylogenies can provide very strong constraints on folding and has been used to establish the secondary structures E-mail address of the corresponding author: [email protected] 0022-2836/99/240935±13 $30.00/0

of most known folded RNAs (Gutell et al., 1992; Larsen & Zwieb, 1991; Williams & Bartel, 1996). While each strategy has its strengths, both have fundamental limitations, outlined below, that restrict their ability to provide accurate, speci®c predictions of secondary structure for individual RNAs. Free energy prediction methods presume that the biological structure of an RNA corresponds to the global free energy minimum structure, and that the free energy of an RNA conformation may be calculated by summing the individual base-pairing interactions which de®ne the secondary structure. Using dynamic programming methods, a large fraction of conformation space may be sampled, evaluated, and low energy structures identi®ed (Zuker, 1989). Several factors intrinsically limit the accuracy of such methods. Many biological RNAs associate with protein factors which may alter the sequence-dependent energy landscape for an RNA. RNAs may be trapped in kinetically accessible, biologically relevant metastates that differ from the # 1999 Academic Press

936 true ground state. Probably the largest problem for these methods is the energy function itself, which often fails to capture many of the details that ultimately de®ne the energy difference between alternative conformations. Careful experimental measurements of duplex formation have yielded an elaborate set of estimates for the individual energetic contributions to folding, and improvements in these estimates over the last 20 years have increased the accuracy of free energy prediction methods (Freier et al., 1986; Serra & Turner, 1995; Turner, 1996; Turner & Sugimoto, 1988; Turner et al., 1987; Wu et al., 1995). While relatively short RNAs such as tRNA may be accurately predicted by such methods, the lowest energy structures predicted for longer RNAs invariably differ from the experimentally de®ned structures (Le et al., 1988; Pipas & McMahon, 1975). As the length of a sequence grows, so does the probability that multiple, low energy pairings exist for each nucleotide, and thus the number of predicted stable foldings grows exponentially with size. In an effort to minimize this problem, many computational methods are now designed to incorporate speci®c pairwise constraints deduced from phylogenetic, enzymatic, and chemical modi®cation studies (Fink et al., 1996; Gutell et al., 1992; Harris et al., 1997). The addition of these constraints can dramatically improve prediction accuracy (Zuker & Stiegler, 1981; Zuker et al. 1991; Fink et al. 1996). Zuker and others have shown that suboptimal foldings (those with predicted energies near the minimum) of larger RNAs often contain elements of the correct structure (80 % of the basepairs within the correct structure can be found in structures within 2 % of the lowest energy structure; Zuker et al., 1991). Results can be improved somewhat by averaging predicted base-pair probabilities over a set of related sequences (Luck et al., 1996). In doing so, a proposed paired region must be capable of forming simultaneously in several different sequences. Despite these improvements to the basic free energy minimization approach, the ability of these methods to make a unique, accurate prediction for a given primary sequence has proven limited. An alternative, frequently used method for RNA structure prediction is based upon covariational analysis (Felden et al., 1997; Han & Kim, 1993; James et al., 1989; Massire et al., 1998; Winker et al., 1990). In contrast to free energy methods which rely on a single RNA sequence, covariational analysis uses a set of homologous, aligned sequences to identify evolutionarily conserved structure. Presumptive evidence for RNA folding is provided by a sequence alignment when substitutions at pairs of nucleotide positions are readily tolerated yet the ability to form stable base-pairs is maintained. This property can be expressed generally in the form a mutual information metric which measures all correlated changes at pairs of positions without presuming stable pairings for only certain combinations of nucleotides (in doing so, non-cannonical

RNA Secondary Structure Prediction

tertiary interactions can be detected; Gutell et al., 1992). While virtually all recognized secondary structures have been determined using the covariational approach, it nonetheless suffers from several practical drawbacks. To completely de®ne a structure by covariational means requires a very large database of aligned homologous sequences. At least 30 sequences were used to de®ne the secondary structure of ribonuclease P RNA (Brown et al., 1996). In many cases, such primary sequence information may be dif®cult or impossible to obtain. If the sequence of a duplex is important for biological function, it will tend not to vary through evolution and covariational evidence for it will be weak or absent. Accurate recognition of sequence covariation requires that nucleotides be aligned on the basis of structure rather than primary sequence (O'Brien et al., 1998). In the absence of covariational data, however, the structure is incompletely known and such an alignment cannot be computed. Finally, covariational analysis has historically been interpreted on an ad hoc basis by experimenters (often in combination with other physical data) rather than by computer algorithms (e.g. Noller & Woese, 1981; Williams & Bartel, 1996). As such, formalized rules that would allow one to unambiguously de®ne an optimal structure in the general case have not been ®rmly established. Given this background, we have sought to develop a computational approach that combines the relative strengths of free energy prediction and covariational analysis. Speci®cally, we have implemented an algorithm that generates a unique secondary structure prediction from a minimum of primary sequence information. A matrix of all lowenergy pairings within a homology-based multiple sequence alignment is ®rst calculated, and then ®ltered to search for a globally stable overall structure. Potential pairings are ranked and accepted according to a set of functions based on free energy, covariation, and the size of the intervening loop that separates the halves of each potential duplex. To evaluate the performance of the algorithm, we have considered RNAs with secondary structures well de®ned by phylogenetic and experimental data, including transfer RNA, 5 S rRNA, SRP RNA, tmRNA, and 16 S rRNA. Analysis of the consensus structures shows that approximately 80 % of the correct helices are recognized. A series of control simulations has further de®ned the essential elements of the algorithm and its minimum sequence requirements for accurate structure prediction.

Results The algorithm presented here has been implemented as part of X2s, an interactive program written using the XForms toolkit for graphical user interface design (Reichard & Johnson,

937

RNA Secondary Structure Prediction

1996). Described in detail in Methods, the essence of the method can be summarized as follows (Figure 1). A homology-based multiple sequence alignment generated by the program ClustalW (version 1.7) serves as the input data (Higgins et al., 1996; Thompson et al., 1994). Empirically weighted contributions for covariation, energy, and loop terms for each potentially paired region in the alignment are calculated and used to predict the secondary structure. Covariation is scored by iteratively evaluating all pairs of sequences and tabulating the frequency with which one potential base-pair (Watson-Crick or G:U wobble) is substituted by a different pair. Simultaneous variation at both sites in a base-pair is required (e.g. a A U ! G:U substitution does not contribute to the score) and only covariation in the context of a potentially paired region is considered (outlined in detail in Methods). The energy score for a paired region is derived as the free energy predicted for the concatenated sequence de®ned by both halves of the pairing, corrected for the energy predicted for each of the separated half sequences. As such, a pairing of two regions i and j that requires the disruption of a local pairing within either i or j is energetically disfavored. The energy term does not in any way account for the entropic cost of closing the loop between the two paired sequences. As such, an additional term that penalizes large loop formation is computed that acts to favor local pairings over long-range interactions. Each potential paired region is thus assigned a ®nal score in which covariation, energy, and loop terms are weighted (wC, wE, wL) and summed. These scores de®ne the order in which paired regions are incorporated into the secondary structure. Potential paired regions are progressively incorporated into the secondary structure as long as they do not overlap with regions that have already been added to the structure. In the case of overlap, the lower scoring region is trimmed back and incorporated into the structure if it remains larger than a de®ned minimal helix size (nmin helix). Pseudoknots can be explicitly forbidden by testing each potential paired region to determine whether it lies in the loop of a previously de®ned helix. Without such ®ltering, pseudoknots are by default allowed. To allow an objective and quantitative assessment of the accuracy of the prediction algorithm, we prepared a set of test cases consisting of RNAs whose secondary structures have been well established by both phylogenetic analysis and direct biochemical experiments. For each of the test RNAs, a reference structure was de®ned on the basis of these previous studies. Table 1 lists the sequences used and indicates the source of the reference structure (shown in Figure 2). This test set includes tRNA, 5 S rRNA, SRP RNA, tmRNA, and 16 S rRNA. These RNAs cover a wide range of sequence lengths (74-1500 nt), sequence similarities (30-90 % conservation), and contain a mix of both short-range, long-range, and pseudoknot foldings.

Figure 1. Flowchart outlining the X2s secondary structure prediction algorithm starting with a multiple sequence alignment. Details for each step are provided in Methods.

Prediction results A set of global parameters, listed in Table 2, was found to yield optimal results for all RNAs in the test set. Slightly different results were obtained if pseudoknots were either allowed or forbidden (accuracy decreases slightly for SRP, 5 S rRNA, and 16 S rRNA, and increases somewhat for tmRNA when pseudoknots are permitted). Agreement between the predicted and reference structures for each test RNA are considered below. tRNA The clover-leaf secondary structure of tRNA has been established by a wide range of experimental techniques, including, most de®nitively, X-ray crystallography (Kim et al., 1973). Over 2700 tRNA sequences are available in the database and phylogenetic data alone can accurately de®ne the location of all helices (Sprinzl et al., 1996). tRNA is the shortest RNA in our test set and, with the exception of the acceptor stem, consists entirely of local hairpins. We prepared a sequence alignment containing ten tRNAAla sequences, drawn from archeal, eubacterial, and viral lineages (Table 1). The structure of the phage T5 tRNA as de®ned by Sprinzl et al. (1996) served as a reference for measuring prediction accuracy. All of the correct helices were identi®ed and no additional helices were found. The predicted

938

RNA Secondary Structure Prediction

Table 1. Sequences used for each test case RNA

Sequences

Pairwise identity (%)

tRNA

Phage T5 Methanobacter therm. Methanotherm fer. Archaeglobus fulg. Methanospir hung.

Halobacterium halobium Thermococcus celer. Thermoprot. tenax. Mycoplasma pneu. Acholeplasma laid.

71.8

5 S rRNA

Escherichia coli Alcaligenes faecalis Pseudomonas cepacia

Aquaspirillum serpens Acinetobacter calcoaceticus

70.4

SRP RNA

Homo sapiens Canis species Rattus rattus

Caenorhabditis elegans Drosophila melanogaster Xenopus laevis

72.2

tmRNA

Escherichia coli Haemophilus influenza Aeromonas salmonicida Vibrio cholerae

Pseudomonas aeruginosa Pseudoalteromonas haloplank. Marinobacter hydrocarbonocla Actinobacillus actinomycete.

67.8

16 S rRNA

Escherichia coli Thermotoga maritima Borrelia burgdorferi Pirellula marina Clostridium botulinum Microcystis aeruginosa Micrococcus luteus Chlamydia pneumoniae

Methanothermus fervidus Methanoplanus limicola Methanobacterium bryantii Thermus thermophilus Mycoplasma pneumoniae Staphylococcus aureus Fibrobacter succinogenes

71.2

The reference sequence used to determine the accuracy of prediction is indicated in bold. Structures for the reference sequence were obtained from the following sources: tRNA, Sprinzl et al. (1996); 5 S rRNA, Noller (1984); SRP RNA, Larsen & Zwieb (1993); tmRNA, Williams & Bartel (1996); 16 S rRNA, Gutell et al. (1992). Pairwise identity corresponds to the probability that a pair of nucleotides at any position within the alignment will be identical.

structure differs from the reference by the omission of a single base-pair at the bottom of the anticodon stem and the addition of single base-pairs in the anticodon and T C loops.

5 S rRNA The secondary structure of the 120 nt Escherichia coli 5 S rRNA consists of three interrupted helical arms that meet at a three-way

Table 2. Parameters that were varied during the optimization process Effect of modification Parameters A. Weights Covariation (wC) Energy (wE) Loop (wL) B. Other Nucleotide window size (nhelixwin) Minimum allowed helix size (nminhelix) No. of flanking positions To test base-pairing In covariation calculation (nflank test) Min. no. acceptable Flanking base-pairs In covaration test (nflank thres) Optimal loop size (nloop0) Loop term drop-off rate (nloop1)

Default values

Optimal range

9.0 0.5 2.0

55 0.2-1.0 45

Modified value

Overall accuracy (82.0)a

See Figure 4(a) See Figure 4(b) See Figure 4(c)

8 2

1-5

2

2

3

1-3

1 7 0 1 3

78.8 77.0 79.9 72.2 80.5

1 2 4

79.9 81.1 76.2

15 10

Parameters that were varied during the optimization process are listed above. Default value, optimized parameter values, used for all predictions unless indicated otherwise. Optimal range, range of parameter values that yield an overall accuracy within 95 % of that obtained with the default values when individually varied. Effect of modi®cation, speci®c changes in overall prediction accuracy that result when non-default values are used (all other parameters held at the default values). a Indicates average % accuracy using default parameter values.

RNA Secondary Structure Prediction

939

Figure 2. Comparison of the predicted and reference RNA secondary structures. Each panel shows a squiggle plot (generated within X2s using the NAVIEW program; Bruccoleri & Heinrich, 1988) color coded by accuracy. (a) Secondary structures for the reference sequences predicted using the default parameters listed in Table 2. (b) Reference secondary structures, de®ned by phylogenetic and experimental data (see Results). (c) Secondary structure for tmRNA generated when pseudoknots are allowed during the prediction process. Individual nucleotides are colored as follows: green, correct; yellow, overprediction; blue, underprediction; red, misprediction. Helices are labeled following established conventions for each molecule with the following tRNA abbreviations: AA, aminoacyl acceptor stem; AC, anticodon stem; D, D stem; T, TcC stem.

940 junction (Noller, 1984). Several non-canonical base-pairings are embedded within stems IV/V and contribute signi®cantly to its folding. The predicted structure includes all of the correct helices. Differences between the reference E. coli structure and the predicted structure correspond to the precise de®nition of helix ends and the speci®c pairings within stem II and V. SRP RNA Experimental data support a characteristic ``dog bone'' structure for the metazoan signal recognition particle (SRP) RNA (Larsen & Zwieb, 1993). The central helix in this structure (helix 5) consists of a series of short duplexes interrupted by asymmetric bulges (Larsen & Zwieb, 1991). Long-range interactions bring the 50 and 30 -ends of the RNA to

RNA Secondary Structure Prediction

create this major duplex. The inclusion of this RNA in our test set allows us to assess the effects of free energy penalties associated with asymmetric bulges and the preference for short-range over long-range interactions in the process of structure prediction. All of the helices de®ning the SRP structure are predicted, with the exception of the two base-pair helix 7 (Figure 3). Almost all of the unpredicted bases make up the base-pair ends of the short duplexes found in helix 5. tmRNA The proposed secondary structure of tmRNA has been largely determined by phylogenetic analysis (Williams & Bartel, 1996). This structure is dominated by local pairings as well as several pseudoknots, in contrast to other RNAs in the test

Figure 3. The effects of omitting each term from the prediction algorithm on the prediction of the SRP RNA secondary structure. Secondary structures for the SRP RNA were predicted when (a) the covariation weight, (b) the energy weight, or (c) the loop weight were set to zero. Omission of the loop term has no effect on the prediction and thus (c) also represents the default prediction. Coloring corresponds to that used in Figure 2. (d) The lowest energy structure for SRP RNA identi®ed by LRNA (Zuker, 1989) using default parameters.

941

RNA Secondary Structure Prediction

set. While the overall pairwise identity among the different eubacterial and archeal tmRNA sequences in the test set is over 60 %, the alignment is punctuated by islands of complete conservation, separated by regions of high sequence divergence. A total of 71.8 % of the base-pairings de®ned in the E. coli structure, de®ning 11 of the 19 helices, are correctly predicted by the algorithm when pseudoknots are forbidden. When pseudoknots are permitted, the predicted structure is substantially altered such that 12 of the 19 reference helices are identi®ed (accuracy increases to 86 %). Regions corresponding to reference helices 3, 8, 1, 2, and 4 are predicted to base-pair but are mispaired in the predicted structure. 16 S rRNA Perhaps one of the most widely studied RNAs is the E. coli 16 S rRNA. Its secondary structure has been deduced from phylogenetic methods and con®rmed by a wide range of biochemical probing experiments (Noller & Woese 1981; Gutell et al. 1992). The secondary structure of the RNA is generally comprised of small local duplexes ranging from two to six nucleotides in length. A set of 12 eubacterial and three archeal sequences was used to generate a 16 S rRNA alignment. A total of 87 % of all of the helices in E. coli 16 S rRNA are predicted using the optimized global parameters (Table 3). Most of the unpredicted helices correspond to three or four nucleotide pairings embedded within the loops of other helices. Algorithmic requirements for accurate structure prediction Covariation, energy, and loop terms are independently evaluated for each potential pairing and ultimately determine which helices are predicted to form. Table 3 lists the relative contribution of each of these terms to the total score among those pairings accepted into the predicted structures. The covariation term consistently contributes approximately two-thirds of the total score, suggesting that it is the major determining factor in making an accurate prediction. In comparison, the energy

term is relatively minor, contributing only approximately 20 % of the total score. The loop term accounts for the remaining 7-15 % of the total, rising in those structures such as tmRNA in which local pairings dominate. To further understand what aspects of the prediction strategy are essential for accurate prediction of secondary structure, we systematically adjusted parameters and measured their effect (Table 2). In doing so we sought to determine how well de®ned the parameters are for any given test case, and what factors considered by the algorithm play the strongest role in making a prediction. Optimal parameter ranges were de®ned as those parameter values which only marginally reduced the overall prediction accuracy (speci®cally, that allowed at least 95 % of the accuracy obtained with optimized parameters). As shown in Table 3, relatively large deviations in each of the parameters can be tolerated without severely compromising the results. Given these results, we conclude that the parameters determined for the RNAs in the current test set are robust and unlikely to be signi®cantly improved upon for other unknown cases. To assess the relative contribution of the different factors that de®ne the accuracy of a prediction, we varied individual parameters over a wide range and measured their effect on overall accuracy. Figure 4 shows the results obtained when either the covariation, free energy, or loop terms are adjusted from their default values of 9, 0.5, and 2, respectively. Elimination of the covariation term has the strongest effect, reducing the overall accuracy to 36 % (Figure 4(a)). Prediction accuracy improves linearly as the covariation weight is increased from 0 to 5 and changes little at higher values. Filtering instances of covariation on the basis of helical context (i.e. only scoring covariation when the covarying nucleotides are ¯anked by potentially basepairing nucleotides) has only a marginal effect on the accuracy (the overall prediction rate drops to 79.9 % when ®ltering is not used). Elimination of the free energy term has a statistically signi®cant but relatively limited effect on accuracy, reducing the prediction rate for all test cases to 73.3 %. Accuracy improves to optimal

Table 3. Summary of the prediction results using default parameters for each RNA in the test set Prediction accuracy RNA tRNA 5 S rRNA SRP tmRNA 16 S rRNA

PK disallowed

PK allowed

Correct helices identified

95.0 82.4 89.2 71.8 71.8

95.0 82.4 86.4 86.8 65.1

4/4 5/8 16/17 12/19 79/91

% Contribution of individual terms Covar. Energy Loop 68.6 68.0 71.1 61.1 68.5

17.2 17.9 22.4 24.4 23.7

14.2 14.1 6.5 14.5 7.8

Predictions for each test RNA were generated with pseudoknots (PK) either disallowed (default) or allowed. For each prediction, the number of helices in the predicted structure that overlap with those in the reference structure were tabulated (#correct helices identi®ed). For each predicted nucleotide, the covariation, energy, and loop terms assigned for the corresponding helix it lies in were tabulated and then normalized to the total score. The overall contribution of each term on a nucleotide basis is thus computed.

942

RNA Secondary Structure Prediction

Figure 4. Effects of altering the empirical weights for covariation, energy, and loop terms. Except as indicated, the weights correspond to the default values (wC ˆ 9, wE ˆ 0.5, wL ˆ 2). Overall prediction accuracy is plotted as a function of varying (a) covariation weight, (b) energy weight, or (c) loop weight. Variation of the loop term (c) was carried out with either pseudoknots disallowed (default, ®lled circles) or with pseudoknots allowed (open circles). (d) Data in (a), (b) and (c) superimposed after normalizing each varying weight to its default value. Normalized covariation weight (®lled circles), energy weight (open squares), and loop weight (®lled triangles) are shown.

levels as the energy weight increases to 0.5-1.0, but falls dramatically as the weight is increased to higher values (Figure 4(b)). We have tested the prediction method computing the energy score as either the lowest energy value for each potentially paired region (the default case) or as the value obtained by averaging the free energy calculated for all sequences in the alignment. Accuracy is somewhat reduced (78.9 %) and computation time signi®cantly increased when the average value is used instead of the lowest energy value. The loop term has the weakest effect on the prediction. With wL ˆ 0, overall prediction accuracy drops from 82 % to 80 % (Figure 4(c)). Prediction is marginally improved as wL increases, but drops off slightly at values greater than 5. Only the 5 S rRNA prediction accuracy improves with wL > 5, increasing from 82.4 % to 86.8 %. When pseudoknots are permitted, the loop term becomes a more important factor in making accurate predictions and we ®nd a higher relative weight, wL ˆ 3, is optimal (by favoring local interactions, the loop term effectively disfavors pseudoknots which are largely absent from the test case RNAs).

To understand the nature of errors introduced when individual terms are omitted, we examined the predicted structures for SRP RNA (Figure 3). When covariation is neglected, only three of eight helices, corresponding to the Alu domain and helix 6, are correctly predicted (25.0 % accuracy, Figure 3(a)). The long central helix of SRP is broken into a series of shortened helices and the highly evolutionarily conserved helix 8 is misaligned. When the energy term is eliminated, all eight helices are identi®ed but small regions within the helices are either misaligned or not identi®ed (86.9 % accuracy, Figure 3(b)). In the absence of an energy term, paired regions lacking covariational support are not considered. The net result is that the correct helices are identi®ed but their ends are not fully de®ned. The structure predicted for SRP when the loop weight is omitted is identical with the default structure (89.2 % accurate, Figure 3(c)). This is perhaps unsurprising given that SRP contains the highest fraction of long-range pairings among the RNAs in our test set and the loop term will favor down-weighting of these pairings. In those RNAs where the loop term has an effect on

RNA Secondary Structure Prediction

the prediction (e.g. tmRNA), omission of the loop term causes a decrease in prediction accuracy. Primary sequence requirements Given the importance of covariational information for the algorithm, we explored its primary sequence requirements, speci®cally addressing how accuracy changes as a function of the number of sequences in the input alignment. A set of tRNA alignments (nine) and 16 S rRNA alignments (14) were generated in which sequences from the original test alignments were progressively deleted. As shown in Figure 5, accuracy for both the tRNA and 16 S rRNA predictions is not drastically altered until the alignments are reduced to approximately ®ve sequences. With alignments of less than ®ve sequences, the tRNA prediction ¯uctuates between highly accurate and inaccurate results. These ¯uctuations appear to be averaged across the much longer rRNA sequence. The 16 S rRNA prediction accuracy decreases monotonically as the alignment size is reduced to a single sequence (at which point covariation cannot be calculated). Similar trends were obtained for the other test case RNAs with the shorter sequences showing signi®cant ¯uctuations in accuracy with small alignment sets (data not shown).

Discussion We have described a computational method which combines covariational and free energy methods to accurately predict RNA secondary structure from a relatively small set of sequences. The likelihood that a paired region exists is factored into separable components (covariation, free energy, loop terms), making it possible to judge the relative role of each in constructing an accurate prediction. Surprisingly, the same relative

943 weighting for these components appears optimal for most of the RNAs considered in the test cases, and covariation is consistently the dominant contributor to prediction accuracy. We show that for a wide range of test cases, a minimum of ®ve sequences suf®ces to yield reliably accurate secondary structure predictions and that predictions improve modestly with increasing additional sequences. Covariational analysis of multiple sequences provides substantially more predictive information than that obtained by simply averaging free energy predictions across multiple sequences. Luck et al. (1996) have described an algorithm that essentially uses this latter approach, computing free energybased base-pair probabilities for individual sequences, and then averaging probabilities across sequences on the basis of a sequence alignment. While our formalism is somewhat different, we ®nd that when covariation is omitted and an energy score is computed by averaging the free energy predicted for all sequences in an alignment, overall accuracy drops from 82 % to 58 %. It is worth noting that when covariation is ignored, deriving the energy for a paired region by averaging across sequences is a better strategy than simply using the most stable sequence energy (58 % versus 36 % accuracy). In examining the predicted and reference structures, most prediction errors correspond to the assignment of the ends of helices. This problem can, in part, be attributed to real variation in the sequences within each of the aligned sets. If, for instance, a reference structure contains longer helices than most other RNAs in an alignment, the prediction algorithm will tend to systematically underestimate the length of helices for the reference. In addition, the lack of sequence conservation at the helix ends leads to errors in the alignment that prevent their accurate prediction. In practice, these

Figure 5. The dependence of accurate structure prediction upon alignment size. For (a) 16 S rRNA and (b) tRNA, alignments were generated using ClustalW (Higgins et al., 1996; Thompson et al., 1994) with a varying number of primary sequences (extracted from the sets shown in Table 2). X2s was used to predict the secondary structure for each alignment. Prediction accuracy is plotted as a function of the number of sequences in each alignment.

944 errors can force the paired region of one sequence to become out-of-register with respect to the homologous paired region in other sequences. One of the surprising results from this work is the observation that a simple alignment based on the local match to a consensus (not taking potential long-range pairings into account) is able to provide useful covariational information that can accurately de®ne the structure. Ideally, an alignment based on potential base-pairing across paired regions of the alignment should more accurately re¯ect covariation. In all of the alignments reported, the average pairwise identity (the likelihood that any pair of nucleotides at the same position in an alignment will be identical with each other) is approximately 70 %. In general, alignments with this level of identity can be accurately constructed (i.e. structurally homologous nucleotides are placed at equivalent positions in the alignment). The tmRNA alignment is notable for the fact that it has distinct islands of sequence divergence and sequence conservation. In the regions of high divergence, the automatically generated ClustalW alignment is incorrect with regard to the reference structural alignment. Regions unique to one sequence may be improperly aligned with regions of other sequences containing minimal primary sequence homology, which may result in misprediction of the unique region. On the other hand, automatic alignment algorithms may, in principle, add gaps arti®cially to regions of poor primary sequence conservation to maximize ®t to a consensus. As a result, covarying nucleotides may be no longer positioned at equivalent positions in the alignment. In practice, this potential problem does not seem to impact upon prediction accuracy. The loop term is the only factor in the prediction scheme that explicitly favors local pairings over long-range interactions. It is thus somewhat surprising that while most of the RNAs in the test cases examined are dominated by local pairings, prediction accuracy does not signi®cantly depend upon the loop term. Allowing pseudoknots increases the sampling space of paired regions, and thus makes accurate prediction more dif®cult in those molecules in which pseudoknots are absent. Correspondingly, it is interesting to note that accurate predictions require a stronger loop term when pseudoknots are permitted than when they are disallowed (Figure 4(c)). It is striking how poorly secondary structure is predicted when the covariation term is omitted. While our approach is somewhat different from the dynamic programming algorithm of Zuker (1989) used in LRNA, the accuracy obtained using only the energy term is similar for both approaches (e.g. for SRP, X2s accuracy ˆ 25.0 %, LRNA ˆ 26.4 %, see Figure 3). It could be argued that biologically de®ned RNA structures depend, in part, upon protein-RNA interactions that cannot be inferred by sequence analysis. In the speci®c case of the SRP RNA, however, chemical probing suggests that the native structure is largely formed

RNA Secondary Structure Prediction

in the absence of SRP proteins (Andreazzoli & Gerbi, 1991). In this case, free energy prediction identi®es only the Alu domain helices and helix 6 in the signaling domain. Ongoing oligonucleotide melting experiments directed at determining folding free energies for non-cannonical pairings and loops continue to make improvements to the force ®eld terms (e.g. Xia et al., 1997). Using a computational framework for predicting secondary structure such as that presented here, it would seem that an alternative strategy would be to use the wealth of structural information available from biological RNAs as the constraints that de®ne force ®eld terms. By iteratively adjusting energetic parameters and measuring the corresponding effect on prediction accuracy, it may prove possible to improve the precision of the energy function and reduce the requirement for covariational information.

Methods Automated structure prediction The algorithm presented here has been implemented as part of a computer program, X2s, which can be used either interactively or non-interactively to de®ne secondary structure from a set of RNA sequences. Individual steps in the prediction process are outlined in Figure 1 and described in more detail below. Default values for the adjustable parameters listed below are presented in Table 1. Step 1 A multiple sequence alignment is generated that optimizes the ®t of all sequences to a single consensus while minimizing a gap penalty function. In all of the tests carried out in the work described, alignments were prepared using the ClustalW program (version 1.7; executed with the default parameters; Higgins et al., 1996; Thompson et al., 1994). The X2s program is able to read, display, and edit the ClustalW-generated alignment. In the test cases described, however, the initial unedited alignment served as the starting point for automated structure prediction. Step 2 A predicted folding energy matrix is calculated for each primary sequence. For a sequence consisting of N nucleotides, the free energy of folding for each possible pairing centered at nucleotides i and j (j < i, i < N) is computed using a dynamic programming algorithm based on the work by Zuker & Stiegler (1981) and McCaskill (1990) and implemented as the standalone ``fold'' subroutine within the Vienna RNA package (Hofacker et al., 1994). For each matrix element calculation, a window of nhelixwin nucleotides surrounding both the ith and jth nucleotides are extracted and merged

945

RNA Secondary Structure Prediction

(separated by a loop of up to four unde®ned nucleotides if i > j ‡ nhelixwin) to create a test sequence for evaluation.

correction effectively down-weights long-range pairings which require the disruption of local structures.

Step 3

Step 6

The elements in the energy matrix are ranked and the lowest energy pairings stored in a ®nal listing. In the test cases considered here, we have analyzed the 5000 lowest energy pairings for each sequence. In practice, this may be reduced to 2000 without eliminating correct pairings from consideration. Steps 4-6 are carried out for each stored low energy pairing. Step 4 A covariation score, Cij, is computed at each position in the proposed pairing between i and j according to the formula: ! kÿ1 X numseq nwindowÿ1 X X bpi;j;k;n  …bpi;j;l;n  difi;j;k;l;n † 2 lˆ1 Cij ˆ nˆ0 kˆ2 n window  numseq  …numseq ÿ 1† where bpi,j,k,n indicates the existence of a base-pair between the (i ‡ n)th and (j ÿ n)th nucleotides in the kth sequence (Watson-Crick or wobble (G:U) base-pair ˆ 1, no pairing ˆ 0) and difi,j,k,l,n indicates lack of nucleotide conservation at the both the (i ‡ n)th and (j ÿ n)th positions in the alignment for sequences k and l (if nucleotides in the kth and lth sequences are identical at either position, difi,j,k,l,n ˆ 0; otherwise difi,j,k,l,n ˆ 1). This score may be ®ltered such that covariation is considered only at positions in which ¯anking nucleotides have the potential to form some minimum number of base-pairs. This ®ltering is enabled by computing for each sequence how many base-pairs exist in a window n¯ank test around the test pair. If this number (n¯ank) fails to reach a de®ned threshold n¯ank thres, bpi,j,n,k is set to 0, regardless of whether the (i ‡ n)th and ( j ÿ n)th nucleotides can pair. Step 5 An energy score, Eij, is assigned to each pairing. For consistency with other parameters, the energy is multiplied by ÿ1 (thus a favorable energy yields a highly positive score). Two alternative measures of folding energy may be used: (a) lowest energy (default): the predicted energy of the most stable helix among all sequences in a multiple sequence alignment is used directly; or (b) average energy: the predicted energy of all sequences at a given region in the multiple sequence alignment is calculated and averaged to give the energy score. The energy calculated for each potential pairing centered at nucleotides i and j is corrected by subtracting the predicted energy for each half of the proposed pairing (i.e. Eij ˆ Ei ‡ j ÿ (Ei ‡ Ej)). This

A cost function for loop size is calculated as: 8 > < 1; if …i ÿ j†4nloop0   Lij ˆ nloop0 ÿ …i ÿ j† > exp ; if…i ÿ j†5nloop0 : nloop1 where nloop0 is the optimal loop size (loops shorter than or equal to this length earn the maximum score) and nloop1 effectively controls the rate at which the loop score drops off for longer loops. This term favors local pairings over long-range pairings and corrects for the fact that the entropic cost of long-range pairings is ignored in the free energy calculation (described above). Step 7 Computed loop, energy and covariation scores for each region are combined to yield an overall score: scoreij ˆ wc Cij ‡ wE Eij ‡ wL Lij where wC, wE, and wL are empirically determined weights for the covariation, energy, and loop terms, respectively. Step 8 After computing overall scores for all potential paired regions, they are ranked and the top 1000 are then considered further as candidates for the structure. High scoring regions are added progressively (working from the highest scoring to lowest scoring region) to build up the structure. The highest scoring duplex is automatically added to the structure. Subsequent duplexes are added if they do not overlap with previously added (higher scoring) duplexes. If overlap exists, the ends of the lower scoring duplex are adjusted to shorten the helix and eliminate the overlap. If overlap editing reduces the length of the helix to less than nminhelix nucleotides, it is eliminated from further consideration. Otherwise, the paired region becomes part of the de®ned secondary structure. Step 9 After considering all potential pairings and building the bulk of the secondary structure, regions immediately ¯anking helices are searched for additional paired regions of nminhelix nucleotides within which a minimal fraction, fmin bp ˆ 0.6, of nucleotides are capable of forming base-pairs. The ¯anking pairing with the highest base-pairing potential is added to the structure if it satis®es the minimum base-pair content requirement.

946

RNA Secondary Structure Prediction

Computation time varies according to the number and length of sequences in a given multiple sequence alignment ®le. When pseudoknots are disallowed, prediction on an SGI R8000 workstation ranges from 16 seconds (5 S rRNA) to 13 minutes (16 S rRNA) for alignments in the test set, and varies approximately with O(N2) dependence. The range of prediction times increases from two minutes (5 S rRNA) ! two hours (16 S rRNA) when pseudoknots are considered in the prediction. Evaluating results A predicted structure can be directly compared to a reference structure within the X2s program. Instances where the base-pairing partner for a given nucleotide are identical in the two structures are scored as correct. Instances where a given nucleotide pairs in the predicted structure but remains unpaired in the reference structure are considered overpredictions while the converse (unpaired in the predicted structure but paired in the reference) are considered underpredictions. If a nucleotide pairs in both the predicted and reference structure but the pairing partners are different, the instance is recorded as a misprediction. The accuracy of the algorithm is de®ned as the number of correct predictions normalized by the maximum possible number of correct predictions: accuracy ˆ

100%  ]correct ]correct ‡ ]underpredicted ‡ ]mispredicted

Overall accuracy is calculated by simply averaging the individual accuracies measured for each of the test case RNAs.

Acknowledgments This work is supported by grants from the NIH (GM52707) and the Packard Foundation to CW. V.J. was supported by a GAANN award to UC Santa Cruz. Source code and executables for X2s are freely available at http://tyrant.ucsc.edu/X2s or by contacting C.W. at [email protected]

References Andreazzoli, M. & Gerbi, S. A. (1991). Changes in 7SL RNA conformation during the signal recognition particle cycle. EMBO J. 10, 767-777. Brown, J. W., Nolan, J. M., Haas, E. S., Rubio, M. A. T., Major, F. & Pace, N. R. (1996). Comparative analysis of ribonuclease P RNA using gene sequences from natural microbial populations reveals tertiary structural elements. Proc. Natl Acad. Sci, USA, 93, 3001-3006. Bruccoleri, R. E. & Heinrich, G. (1998). An improved algorithm for nucleic acid secondary structure display. Comput. Appl. Biosci. 4, 167-173. Felden, B., Himeno, H., Muto, A., McCutcheon, J. P., Atkins, J. F. & Gesteland, R. F. (1997). Probing the

structure of the Escherichia coli 10Sa RNA (tmRNA). RNA, 3, 89-103. Fink, D. L., Chen, R. O., Noller, H. F. & Altman, R. B. (1996). Computational methods for de®ning the allowed conformational space of 16S rRNA based on chemical footprinting data. RNA, 2, 851-866. Freier, S. M., Kierzek, R., Jaeger, J. A., Sugimoto, N., Caruthers, M. H., Neilson, T. & Turner, D. H. (1986). Improved free-energy parameters for predictions of RNA duplex stability. Proc. Natl Acad. Sci. USA, 83, 9373-9377. Gutell, R. R., Power, A., Hertz, G. Z., Putz, E. J. & Stormo, G. D. (1992). Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. Nucl. Acids Res. 20, 5785-5795. Han, K. & Kim, H. J. (1993). Prediction of common folding structures of homologous RNAS. Nucl. Acids Res. 21, 1251-1257. Harris, M. E., Kazantsev, A. V., Chen, J. L. & Pace, N. R. (1997). Analysis of the tertiary structure of the ribonuclease P ribozyme-substrate complex by sitespeci®c photoaf®nity crosslinking. RNA, 3, 561-576. Higgins, D. G., Thompson, J. D. & Gibson, T. J. (1996). Using CLUSTAL for multiple sequence alignments. Methods Enzymol. 266, 383-402. Hofacker, I. L., Fontana, W., Stadler, P. F., Bonhoeffer, L. S., Tacker, M. & Schuster, P. (1994). Fast folding and comparison of RNA secondary structures. Monatsh. Chem. 125, 167-188. James, B. D., Olsen, G. J. & Pace, N. R. (1989). Phylogenetic comparative analysis of RNA secondary structure. Methods Enzymol. 180, 227-239. Kim, S. H., Quigley, G. J., Suddath, F. L., McPherson, A., Sneden, D., Kim, J. J., Weinzierl, J. & Rich, A. (1973). Three-dimensional structure of yeast phenylalanine transfer RNA: folding of the polynucleotide chain. Science, 179, 285-288. Larsen, N. & Zwieb, C. (1991). SRP-RNA sequence alignment and secondary structure. Nucl. Acids Res. 19, 209-215. Larsen, N. & Zwieb, C. (1993). The signal recognition particle database (SRPDB). Nucl. Acids Res. 21, 30193020. Le, S. Y., Chen, J. H., Nussinov, R. & Maizel, J. V., Jr. (1988). An improved secondary structure computation method and its application to intervening sequence in the human alpha-like globin mRNA precursors. Comput. Appl. Biosci. 4, 337-344. Luck, R., Steger, G. & Riesner, D. (1996). Thermodynamic prediction of conserved secondary structure: application to the RRE element of HIV, the tRNAlike element of CMV and the mRNA of prion protein. J. Mol. Biol., 813-826. Massire, C., Jaeger, L. & Westhof, E. (1998). Derivation of the three-dimensional architecture of bacterial ribonuclease P RNAs from comparative sequence analysis. J. Mol. Biol. 279, 773-793. McCaskill, J. S. (1990). The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29, 1105-1119. Noller, H. F. (1984). Structure of ribosomal RNA. Annu. Rev. Biochem. 53, 119-162. Noller, H. F. & Woese, C. R. (1981). Secondary structure of 16 S ribosomal RNA. Science, 212, 403-411. O'Brien, E. A., Notredame, C. & Higgins, D. G. (1998). Optimization of ribosomal RNA pro®le alignments. Bioinformatics, 14, 332-341.

RNA Secondary Structure Prediction Pipas, J. M. & McMahon, J. E. (1975). Method for predicting RNA secondary structure. Proc. Natl Acad. Sci. USA, 72, 2017-2021. Reichard, K. & Johnson, E. F. (1996). Using XForms. UNIX Rev. 14, 83. Serra, M. J. & Turner, D. H. (1995). Predicting thermodynamic properties of RNA. Methods Enzymol. 259, 242-261. Sprinzl, M., Steegborn, C., Huebel, F. & Steinberg, S. (1996). Compilation of tRNA sequences and sequences of tRNA genes. Nucl. Acids Res. 24, 68-72. Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-speci®c gap penalties and weight matrix choice. Nucl. Acids Res. 22, 4673-4680. Turner, D. H. (1996). Thermodynamics of base-pairing. Curr. Opin. Struct. Biol. 6, 299-304. Turner, D. H. & Sugimoto, N. (1988). RNA structure prediction. Annu. Rev. Biophys. Biophys. Chem. 17, 167-192. Turner, D. H., Sugimoto, N., Jaeger, J. A., Longfellow, C. E., Freier, S. M. & Kierzek, R. (1987). Improved parameters for prediction of RNA structure. Cold Spring Harbor Symp. Quant. Biol. 52, 123-133.

947 Williams, K. P. & Bartel, D. P. (1996). Phylogenetic analysis of tmRNA secondary structure. RNA, 2, 13061310. Winker, S., Overbeek, R., Woese, C. R., Olsen, G. J. & P¯uger, N. (1990). Structure detection through automated covariance search. Comput. Appl. Biosci. 6, 365-371. Wu, M., McDowell, J. A. & Turner, D. H. (1995). A periodic table of symmetric tandem mismatches in RNA. Biochemistry, 34, 3204-3211. Xia, T., McDowell, J. A. & Turner, D. H. (1997). Thermodynamics of nonsymmetric tandem mismatches adjacent to G. C base-pairs in RNA. Biochemistry, 36, 12486-12497. Zuker, M. (1989). On ®nding all suboptimal foldings of an RNA molecule. Science, 244, 48-52. Zuker, M. & Stiegler, P. (1981). Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucl. Acids Res. 9, 133148. Zuker, M., Jaeger, J. A. & Turner, D. H. (1991). A comparison of optimal and suboptimal RNA secondary structures predicted by free energy minimization with structures determined by phylogenetic comparison. Nucl. Acids Res. 19, 2707-2714.

Edited by D. E. Draper (Received 24 February 1999; received in revised form 13 April 1999; accepted 17 April 1999)