Evolutionary rate variation and RNA secondary structure prediction

Evolutionary rate variation and RNA secondary structure prediction

Computational Biology and Chemistry 28 (2004) 219–226 Evolutionary rate variation and RNA secondary structure prediction B. Knudsen a,1 , E.S. Anders...

166KB Sizes 0 Downloads 57 Views

Computational Biology and Chemistry 28 (2004) 219–226

Evolutionary rate variation and RNA secondary structure prediction B. Knudsen a,1 , E.S. Andersen b , C. Damgaard b , J. Kjems b , J. Gorodkin a,c,∗ a Bioinformatics Research Center, Høegh Guldbergsgade 10, University of Aarhus, DK-8000 Århus C, Denmark Department of Molecular Biology, University of Aarhus, C. F. Møllers Allé, Building 130, DK-8000 Århus C, Denmark Division of Genetics, Institute of Animal Science and Animal Health, The Royal Veterinary and Agricultural University, Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark b

c

Received 30 January 2004; received in revised form 19 April 2004; accepted 19 April 2004

Abstract Predicting RNA secondary structure using evolutionary history can be carried out by using an alignment of related RNA sequences with conserved structure. Accurately determining evolutionary substitution rates for base pairs and single stranded nucleotides is a concern for methods based on this type of approach. Determining these rates can be hard to do reliably without a large and accurate initial alignment, which ideally also has structural annotation. Hence, one must often apply rates extracted from other RNA families with trusted alignments and structures. Here, we investigate this problem by applying rates derived from tRNA and rRNA to the prediction of the much more rapidly evolving 5 -region of HIV-1. We find that the HIV-1 prediction is in agreement with experimental data, even though the relative evolutionary rate between A and G is significantly increased, both in stem and loop regions. In addition we obtained an alignment of the 5 HIV-1 region that is more consistent with the structure than that currently in the database. We added randomized noise to the original values of the rates to investigate the stability of predictions to rate matrix deviations. We find that changes within a fairly large range still produce reliable predictions and conclude that using rates from a limited set of RNA sequences is valid over a broader range of sequences. © 2004 Elsevier Ltd. All rights reserved. Keywords: RNA; Secondary structure; Prediction; Evolution; HIV

1. Introduction Within the last few years the role of RNA function and structure as a key player in cellular and gene regulatory mechanisms has become more evident than ever. With the discovery of many novel short RNAs, RNA gene regulation has shown to be a common mechanism that can bind highly specific targets, two examples being micro RNA and small nucleolar RNA Eddy (2001). Clearly the structure is important for the functional role of the RNA molecules. Predicting RNA secondary structure is a field of rapid growth in the research community. Over the past decade many approaches have been proposed and in particular energy based folding approaches been popular (Hofacker et al., 1994; Zuker et al., 1999). For a set of evolutionary related sequences, one of the most fundamental contributions was ∗

Corresponding author. Tel.: +45-3528-3578; fax: +45-3528-3042. E-mail address: [email protected] (J. Gorodkin). 1 Present address: Department of Zoology, University of Florida, Gainesville, FL 32611-8525, USA. 1476-9271/$ – see front matter © 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2004.04.001

given by Sankoff, who simultaneous optimize alignment and structure (Sankoff, 1985). A large number of such prediction tasks have later been effectively handled by stochastic context-free grammars (SCFGs) which deal with the prediction problem in a probabilistic framework (Eddy and Durbin, 1994; Sakakibara et al., 1994; Eddy, 2002). In addition, an evolutionary model using information about single nucleotide substitution as well as base pair substitution has been explicitly incorporated into an SCFG, leading to significantly improved performance and accuracy (Knudsen and Hein, 1999, 2003). A method that takes thermodynamic stability into consideration has also been applied (Hofacker et al., 2002). A key issue, when using evolutionary information in the prediction of RNA structure, is that the base pair substitution rates for a given family of RNA sequences will remain unknown as long as the structure is unknown. Thus, the initial prediction must be based on rates extracted from other families of typically unrelated RNAs. Estimating model parameters from a trusted set of structural alignments have been applied in many other respects.

220

B. Knudsen et al. / Computational Biology and Chemistry 28 (2004) 219–226

Both qrna and rsearch, for example, implicitly employ evolutionary assumptions and use parameter estimates from ribosomal alignments (Rivas and Eddy, 2001; Klein and Eddy, 2003). The same situation is known from sequence alignments where for example substitution matrices and hidden Markov models are constructed from representative sets of proteins that implicitly covers certain structural features (Thorne et al., 1996). These parameter estimates are also applied to comparisons of sequences not represented in the sets used for parameter estimation. Here we address the usage of such estimates for RNA secondary structure, but in the context of an explicit evolutionary model that includes base pair substitutions, and we show that the pfold implementation (Knudsen and Hein, 2003), is robust to rate variations. We in particular investigate the extreme case where evolutionary rates extracted from trusted alignments of tRNA and rRNA (Sprinzl et al., 1998; De Rijk et al., 1998) are applied to structure prediction of an unrelated and much faster mutating RNA: the 5 -region of HIV-1, i.e. the 5 -leader and Gag open reading frame (ORF). The original rates are also changed (randomized) and a structure prediction carried out for each set of rates. We further found that the main part of the HIV-1 structure predicted by pfold remains almost identical within a broad range of variations in the evolutionary rates. The evolutionary rates of HIV-1 is found to be within this stability interval, relative to the rates of tRNA and rRNA. There is an excellent correlation between the predictions from pfold and the enzymatic probing data obtained for these structures (Damgaard et al., 1998, 2004; Abbink and Berkhout, 2003). Finally, the structural alignment of the 5 -region of HIV-1 could iteratively be improved by applying the prediction approach. This alignment is more consistent than the one currently in the HIV-1 database (Kuiken et al., 2002).

2. Material and methods 2.1. Sequence data HIV-1 sequences from the Los Alamos HIV Sequence Database (Kuiken et al., 2002) were extracted: the 5 -region (5 -leader and Gag ORF sequence) nucleotide positions 1–646; (database position 457–1102) was downloaded (December 2001) and resulted in 112 sequences that were complete with no unknown regions in the range considered (position numbers are relative to HXB2 throughout this work). This database alignment was essentially based on HMMER Eddy (1998) taking codon bias into account. Distantly related sequences is a good input to pfold as it provides the most amount of phylogenetic information. Also, keeping the data set small is an advantage when considering computation time and manual editing of the alignment.

Table 1 The representative set of the 5 -region of the HIV-1 sequences Name

Accession number

ADU CD.MAL B FR.LAI B GB.CAM1 B GB.MANC B US.JRCSF B US.WEAU160 B KR.WK B NL.3202A21 C BW.96BW06J4 C IN.93IN101 D CD.ELI D KE.MB2059 G BE.DRCBL N CM.YBF106 O CM.ANT70 CRF01 AE TH.95TNIH047 CRF02 AG GH.97GHAG1 CRF06 cpx AU.BFP90 CRF12 BF UY.URTR35 CRF12 BF AR.ARMA159

X04415 K02013 D10112 U23487 M38429 U21135 AF224507 U34604 AF290028 AB023804 K03454 AF133821 AF084936 AJ271370 L20587 AB032741 AB049811 AF064699 AF385935 AF385936

The left column indicates names from the HIV database and the right column the corresponding GenBank (Benson et al., 2003) accession numbers. Sequence names (left) are indicated by subtype, country, and sequence name. CRF stands for ‘circulating recombinant form’.

Hence, a set of representative sequences were selected from a sequence similarity based phylogeny. Using the algorithm by Hobohm and Sander (1994) with a cut-off at 84% in sequence similarity yields a similar reduction. We obtained the 20 sequences listed in Table 1. As a control test we also apply the rate variation analysis on a set of 9 ribosomal 16S sequences (Hofacker et al., 2002), which share a well-defined common secondary structure. 2.2. Basic cleanup of the initial HIV-1 alignment The region of the HIV-1 alignment had a few obvious errors that were corrected before any further analyses were made [hairpin structure designations according to Berkhout (1996)]: (1) the trans-activating region (TAR) of group N and O was mis-aligned (Berkhout, 2000) and (2) the database alignment reveals two possible placements of the duplication immediately downstream of the primer binding site (PBS). This 24 nt duplication was re-aligned to the upstream position with a 3 nt deletion 14 nt’s downstream (Chang et al., 1999; Andersen et al., 2003), (3) HIV-1 group O contains a 25 nt insertion downstream of the packaging signal (PSI) that also caused mis-alignments, which were corrected. 2.3. Structure determination with pfold As mentioned, stochastic context-free grammars (SCFGs) have been shown to capture RNA structural information in a framework which facilitates statistical analyses (Eddy and Durbin, 1994; Sakakibara et al., 1994). Furthermore, meth-

B. Knudsen et al. / Computational Biology and Chemistry 28 (2004) 219–226

ods that explicitly takes the evolutionary information into consideration have also been proposed (Knudsen and Hein, 1999; Akmaev et al., 2000). Here, we use the pfold, which combines an SCFG with evolutionary information from an alignment of RNA sequences through an explicit evolutionary model. The SCFG defines a prior probability distribution of RNA structures, and the evolutionary information is incorporated to give a structure prediction from the data in the alignment. This optimal structure is found from the probabilities of single columns being unpaired, and the probabilities of pairs of columns forming base pairs. The structure prediction of each position in the alignment has a probability of being correct under this statistical model. We refer to this probability as the reliability, as it tells us how much each part of the prediction can be trusted. Evolutionary information is included as follows. First a phylogenetic tree is estimated for the sequences using the neighbor joining method by Saitou and Nei (1987), while the branch lengths are maximum likelihood estimated. The evolutionary process is described by a rate matrix, R, that defines the substitution of nucleotides. The modeling of unpaired regions is done by a 4 × 4 matrix, while modeling the evolution of base pairs requires a 16 × 16 matrix. The second matrix allows double substitutions, for instance, the base pair AU can change to GC in a single evolutionary event (Tillier and Collins, 1998). The rate matrix can be written as:   r11 r12 · · · r1n  r21 r22 · · · r2n    R= . .. ..  , ..  .. . . .  rn1

rn2

···

rnn

where rij for i = j defines the substitution rate of nucleotide i with j, or base pair i with j. The diagonal is calculated as  rii = − rij , j=i

which insures that the evolutionary process can be stated as a single differential equation. The equilibrium distribution of nucleotides and base pairs can be written as a vector: π = [π1 , π2 , . . . , πn ] (n = 4 for nucleotides and n = 16 for base pairs). These equilibrium distributions are estimated from given trusted alignments (here the tRNA and rRNA alignments, as described by Knudsen and Hein (1999)). Using that the process is assumed to be time reversible, the flux between two nucleotides (or base pairs) has to be the same in each direction, so πi rij = πj rji , implying that the equilibrium distribution and the elements above (or below) the diagonal in the rate matrix R, determines all elements in R.

221

The evolutionary matrix for base pairs have additional constraints, e.g., the base pairs AU and UA should have the same equilibrium probability, and the rate of substitution of AU with GC should be equal to the rate of substitution of UA with CG. The probability of a column that does not pair can be found from the 4 × 4 matrix (Felsenstein, 1981), and the probability of two columns forming a base pair can be found from the 16 × 16 matrix in a similar way (Knudsen and Hein, 1999). 2.4. Rate matrix randomization As the equilibrium frequencies for single nucleotides and base pairs prescribe the structure prediction of individual sequences, we are interested in preserving the equilibrium frequencies in the randomization process of the evolutionary rates. 2.4.1. Randomizing a single rate Rates are often viewed relative to each other, since the relative values are often more important than absolute values. Notice that increasing all rates by a constant factor gives the same evolutionary behavior, except that everything occurs faster by that factor. This behavior suggests that the rates should be considered on a logarithmic scale. Furthermore, as we are interested in generating new rates from the original values, these new rates should be drawn from a distribution around the original value, with some variance. Thus, when randomizing, doubling a rate should be as likely as cutting it in half. This is the case when drawing the logarithm of a rate from a distribution that is symmetrical around the logarithm of the original rate. A simple distribution is the normal distribution, which we will use to obtain a new rate rij : log(rij ) ∼ N(log(rij ), σ 2 ). The higher the variance, σ 2 , the wider the distribution and the further away the new rate is from the original. For a given factor, k, we can choose σ so that the new rate is within a factor k of the original with probability 50%, see Fig. 1. Referring to this k-value can be more intuitive than the variance. Writing Φ(x) for the standard normal distribution function (µ = 0, σ 2 = 1), the new variance can now be calculated from k as  2 log(k) σ2 = , Φ−1 (0.75) since for a standard normal distribution, 75% of the density is below Φ−1 (0.75) = 0.67449. As the distribution is symmetric around zero, 50% of the density is between −Φ−1 (0.75) and Φ−1 (0.75), while having the mean centered. Since the expected value of rij is E[rij ] = e(1/2) σ rij , 2

222

B. Knudsen et al. / Computational Biology and Chemistry 28 (2004) 219–226

For a set of randomized rates, let Nt (c > c0 ) be the number of base pairs predicted with reliability c higher than some threshold c0 , that also are predicted base pairs using the original rates. Let Ntot be the total number of base pairs predicted using the randomized rates. If the fraction of correct base pairs Nt (c > c0 )/Ntot is high, the prediction using the original rates is stable to rate change. This fraction is also called the positive predictive value at the reliability level c0 . Notice that ‘correct’ refers to the structure predicted using the original rates, not the actual structure.

3. Results

Fig. 1. The new rate is drawn from a log normal distribution around the original rate. (A) The plot uses a log scale for the x-axis, so it looks like a normal distribution. The shaded area under the graph represents half its area. This means that there is a 50% probability that the new rate is within a factor k of the original rate. It does, however, turn out that it is necessary to shift the distribution to the left in order to keep the average value of the rate unchanged. (B) Rate distributions for k = 1.1, 1.2, and 1.5.

we need to normalize the randomized rate with 1/e(1/2) σ to get a new rate of the same expected magnitude. This means that the center of the distribution of log rij is moved from 2

the original value of log rij to log(rij /e(1/2) σ ), but we still have a window of a factor k above and below the new center, where 50% of the density lies, see Fig. 1. 2

In this section we first present the results concerning the stability of rate changes, then we relate the results to the structure assignment in the literature. The alignment of the 20 HIV-1 sequences has a length of 713 positions and pfold made predictions on 637 of them (the positions with at most five gaps). 3.1. Rate variation and base pair stability When k increases, the fraction of base pairs in agreement with the original structure decreases. Depending on the reliability threshold for acceptance, k-values up to two can be quite acceptable, see Fig. 2. For example, for the curve with reliability level of 0.9, the fraction of correct base pairs falls from 100% to a little less than 90% for k = 2. Even for low threshold curves the fraction of correct base pairs is about 85%, for k-values up to about 1.5. For the very large k-value of 4, the threshold curve for level 0.9 has a 70% fraction of correct base pairs.

2.4.2. Preserving constraints When randomizing the rate matrices, we wish to keep the equilibrium distribution of nucleotides unchanged. This can be accomplished by randomizing only the rates above the diagonal and calculating the remaining rates using πi rij = πj rji . When randomizing the base pair matrix, randomizations are done so that XY to WZ substitutions still occur with the same rates as YX to ZW substitutions. Again randomizations are done only for rates above the diagonal and the rest are calculated from there. 2.5. Evaluation For each randomization we are interested in measuring the stability of the structure prediction. This in particular concerns measuring whether base pairs (and unpaired alignment positions) predicted with high confidence remains highly confident. Hence, we measure the fraction of predicted base pairs and unpaired positions that remain being predicted with a given level of reliability.

Fig. 2. The impact of applying randomized rates. Each value is an average of 100 randomizations. The upper curve is for base pairs accepted only when their reliability value exceeds a threshold of 0.9. Then follows the corresponding curve for reliability level 0.8, and so on. The standard deviations for the points on the graphs are on average 0.010.

B. Knudsen et al. / Computational Biology and Chemistry 28 (2004) 219–226

Fig. 3. The number of base pairs in the original structure (Ref. str.) as a function of the reliability threshold. The curves for the different k-values report the average number of base pairs of the 100 randomizations.

When considering the reliability of bases not involved in pairing, similar results are obtained (not shown). We also performed the same analyses multiple times on a number of random subsets of 10 of the 20 HIV-1 sequences. This gave close to the same results, showing that the conclusions are also valid when only a more limited dataset is available. When the reliability threshold increases, the total number of predicted base pairs drop, as shown in Fig. 3. The reference structure has 104 base pairs predicted and 429 positions predicted not to form pairs. Of these 637 positions with predictions, 340 (73 base pairs and 194 unpaired) had a prediction reliability of at least 90%, while 533 positions (90 base pairs and 353 unpaired) had a prediction reliability of at least 70%. For k values up to 1.8, the number of predicted base pairs remains fairly constant up to a threshold of around 0.4. The numbers of predicted base pairs then falls with increasing k and threshold levels. The reason for this drop is not entirely clear, but as k becomes larger, the base pair model will tend to fit the data less well, and the reliabilities drop. We performed the same type of analysis on the ribosomal 16S sequences (Hofacker et al., 2002) and obtained the same observations with even higher confidence. The data corresponding to Fig. 2 have values higher than 0.88 for threshold 0.1 through 0.95 (data not shown), showing the robustness with respect to varying rates. 3.2. Cleaning up the structural alignment In addition to the initial alignment adjustments (see Section 2), pfold was used in an iterative cleanup approach. Visual inspection of the pfold output revealed obvious mis-alignments that could be improved by hand. An interesting observation was made in the Poly(A) hairpin where a single nucleotide bulge was alternating between two positions (nt 95–96). The sliding bulge could be incor-

223

porated in the alignment by re-aligning these nucleotides in double and single stranded columns. This resulted in increased reliability values when re-calculated by the pfold algorithm. Likewise, several minor corrections were made that gave rise to more consistency between the assignment of base pairs for individual sequences and for the alignment. The alignment was further improved by iteratively applying pfold and rnadbtools (Gorodkin et al., 2001). The corrected and extended database can be found at: http://www.daimi.au.dk/ compbio/pfold/hiv-1. As pfold was used to clean up the alignment, a number of improvements were made. The resulting alignment consists of 710 positions with 636 positions having a prediction (107 base pairs and 422 unpaired), and 347 positions (77 base pairs and 193 unpaired) had predictions with reliability of at least 90%. For reliability level 70% there are 533 positions (92 base pairs and 349 unpaired) with predictions. The rate stability was also improved when repeating the analysis: the fraction of correct base pairs (from Fig. 2) generally increase (data not shown). 3.3. Rate comparison There are 14 free parameters in the evolutionary rate matrix for base pairs, when only considering the six standard base pairs (AU, GC, and GU and reverse). Because of the many free parameters, the 74 base pair substitutions observed in the 20 HIV-1 sequences are not sufficient for a reliable estimate of new rate matrices. Even with the 112 complete 5 -region HIV-1 sequences in the database, only a few extra base pair substitutions would have been observed, since the 20 sequences used here are spread out over the phylogeny. However, it is possible to estimate the factor k between the original rates and the HIV-1 alignment. Doing this (see Appendix A) we obtain k = 1.6. In agreement with the stability results, the structure prediction can be trusted, with only a few false predictions to be expected. Beside considering the rates, we can compute the expected numbers of substitutions between nucleotides and base pairs, and compare to the observations drawn from the alignment, as shown in Table 2. There are 513 observed substitutions of unpaired nucleotides. The distribution of these substitutions is significantly different from what is expected using the rate matrix of Knudsen and Hein (1999). The most significant difference is 72% more substitutions between A and G than expected (207 versus 120). The second most significant difference is 48% more substitutions between A and C than expected (88.9 versus 60.1). There are 74 observed base pair substitutions, with only one kind being significantly over-represented (by 177%), namely the one between AU and GU (and correspondingly between UA and UG) base pairs (37.5 versus 13.5). Together with the high rate between A and G in unpaired regions this suggests an overall high rate between A and G, in agreement

224

B. Knudsen et al. / Computational Biology and Chemistry 28 (2004) 219–226

Table 2 Observed substitutions compared to expected substitutions, with stem and unpaired regions treated individually Substitution

HIV-1

Expected

Single substitutions A↔C A↔G A↔U C↔G C↔U G↔U

88.89 207.39 61.39 24.22 99.22 31.89

60.13 120.25 97.70 37.41 144.98 52.53

513.00

513.00

0.00 12.67 37.50 0.33 1.83 4.67 2.67 12.33 2.00

6.01 25.03 13.52 1.00 4.51 4.88 1.50 17.27 0.28

74.00

74.00

Total Double substitutions AU ↔ CG AU ↔ GC AU ↔ GU AU ↔ UG AU ↔ UA CG ↔ GC CG ↔ GU CG ↔ UG GU ↔ UG Total

The expected substitutions are scaled to give the same totals as observed. Less substitutions are observed in HIV-1 stems relative to HIV-1 unpaired regions, corresponding to a 46% reduction in stem evolutionary rate in HIV-1 relative to the model of Knudsen and Hein (1999).

with previous observations (Bourara et al., 2000; Berkhout et al., 2001). The distribution of unpaired nucleotides is a little different in the 5 -region of HIV-1 with G over-represented and U under-represented (A: 37.4% versus 36.4%; C: 16.3% versus 15.1%; G: 26.1% versus 21.2%; U: 20.2% versus 27.3%). This over-representation of G may be related to the high frequency of A in conjunction with the high substitution rate between A and G. The base pair frequencies in the HIV-1 sequences are not significantly different from the estimates from the data sets used in Knudsen and Hein (1999). This may have to do with base pairing being governed by the energy of pairing, which will be the same for all RNAs (depending on their environment). The last important difference between the evolution of the 5 -region of HIV-1 and the model of Knudsen and Hein (1999) is that stems are more conserved in HIV-1 than expected (46% reduced evolutionary rate in stems). The functional importance of the HIV-1 stems and/or the many overlapping functional sequence constraints of the 5 -region of HIV-1 may be responsible for this observation.

4. Discussion A general problem in RNA structure prediction is the usage of parameters estimated from trusted alignments of some families of RNA when making structure predictions

for sequences from unrelated RNA families. Whereas some methods, such as qrna and rsearch make implicit use of such estimates (Rivas et al., 2001; Klein and Eddy, 2003), we have addressed the problem for rates explicitly applied in the secondary structure prediction of unrelated RNAs. We did this by applying rates extracted from tRNA and rRNA to predict the secondary structure of the unrelated 5 -region (leader and Gag ORF) of HIV-1 RNA. We investigated to which extent the rates could be randomized without having a notable impact on the structure prediction using pfold (Knudsen and Hein, 2003). The rates were randomized such that a rate had 50% chance of being within a factor k (times bigger or smaller) of the original rate. We found that k-values up to two did not have any major impact in predicting the significant base pairs. These conclusions even hold for limited sizes of the dataset being analysed. Furthermore, we also noticed that the structure prediction becomes slightly better if the quality of the alignment is improved. The predicted structure of the 5 -region of HIV-1 was used to estimate a number of single and pairwise substitutions, and a k-factor was estimated to 1.60. Hence, we conclude that rates extracted from tRNA and rRNA very well could be applied to the the 5 -region of HIV-1. There is indeed an excellent correlation between the structural predictions by pfold and the biochemical analysis of the 5 region of HIV-1. In fact, pfold suggested a novel stem region, the LDI (long distance interaction), which we have supported experimentally (Abbink and Berkhout, 2003; Damgaard et al., 2004). We conclude that RNA structure predictions are quite stable toward variations in the evolutionary rate estimates. This applies for pfold, but is also likely to hold for qrna and rsearch, which use similar techniques for structure prediction (Rivas et al., 2001; Klein and Eddy, 2003).

Acknowledgements We thank Rienk E. Jeeninga for helpful discussions. This work was supported by the Danish Technical and Natural Science Research Councils, the Ministry of food, agriculture and fisheries, Danish Center for Scientific Computing, the Danish AIDS Foundation, and the Carlsberg Foundation.

Appendix A. k value for HIV-1 rates Given the counts of substitutions in HIV, we can estimate which k-value the HIV-1 rate matrix represents. The rate of observing a substitution of residue (or base pair) i with j is rij , given that we start in residue i (r  refers to the rate in HIV). If we consider a sequence with the equilibrium distribution of nucleotides (base pairs), with fraction πi of residue (base pair) i, the rate of observing a substitution from i to j is πi rij . There are six such terms for single residues for a reversible model. For base pairs, there are

B. Knudsen et al. / Computational Biology and Chemistry 28 (2004) 219–226

15 terms, but due to symmetry in the pairs, only nine are different. This gives a total of 15 (9 + 6) different events that we can observe. If we knew the rates for HIV, we could give the overall rates of observing these events. Let us denote these rates fi , and the corresponding counts ci , 1 ≤ i ≤ 15 (the prime again refers to HIV). It is a good assumption that the events are Poisson distributed, since we are modeling the substitutions as a continuous time Markov process that is close to equilibrium. The probability of c observations in a Poisson distribution of intensity λ is: Pλ (c) =

λc e−λ . c!

As the process has intensity tf i , the probability of observing the ci ’s over a time, t, is: 15

i=1

Ptf i (ci ) =

 15

(tf  )ci e−tf i

i

i=1

ci !

.

Since the fi ’s are unknown for HIV-1 we cannot compute this probability directly. However, we can use that they are log-normal distributed around the rates in our original evolutionary model: log(fi )∼N(log(fi ), σ 2 ). The (log-normal) density function for fi is: φi (x) =

1 √

e−(1/2)((log x−log fi )/σ) . 2

xσ 2π

Now the probability of observing the ci ’s, given the fi ’s, t, and σ is (we integrate the fi ’s out): P(c1 . . . c15 |f1 . . . f15 , t, σ) = =

15

15

i=1 0



i=1 0



Ptx (ci )φi (x)dx

(tx)ci e−tx 1 2 √ e−(1/2)((log x−log fi )/σ) dx. ci ! xσ 2π

Using this, the maximum likelihood estimate of t and σ can be found. From σ it is estimated that k = 1.60 for the HIV-1 sequences.

References Abbink, T.E., Berkhout, B., 2003. A novel lon distance base-pairing interaction in human immunodeficiency virus type 1 RNA occludes the gag start codon. J. Biol Chem. 278, 11601–11611. Akmaev, V.R., Kelley, S.T., Stormo, G.D., 2000. Phylogenetically enhanced statistical tools for RNA structure prediction. Bioinformatics 16, 501–512. Andersen, E.S., Jeeninga, R.E., Damgaard, C.K., Berkhout, B., Kjems, J., 2003. Dimerization and template switching in the 5 untranslated region between various subtypes of human immunodeficiency virus type 1. J. Virol. 77, 3020–3030.

225

Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L., 2003. Genbank. Nucl. Acids Res. 31, 23–27. Berkhout, B., 1996. Structure and function of the human immunodeficiency virus leader RNA. Prog. Nucleic Acid. Res. Mol. Biol. 54, 1–34. Berkhout, B., 2000. Multiple biological roles associated with the repeat (R) region of the HIV-1 RNA genome. Adv. Pharmacol. 48, 29–73. Berkhout, B., Das, A.T., Beerens, N., 2001. HIV-1 RNA editing, hypermutation, and error-prone reverse transcription. Science 249, 505–510. Bourara, K., Litvak, S., Araya, A., 2000. Generation of G-to-A and C-to-U changes in HIV-1 transcripts by RNA editing. Science 289, 1564–1566. Chang, S.Y., Apichartpiyakul, C., Kuiken, C.L., Essex, M., Lee, T.H., 1999. Sequence features downstream of the primer-binding site of HIV type 1 subtype E shared by subtype G and a subset of subtype A. AIDS Res. Hum. Retroviruses 15, 1703–1706. Damgaard, C.K., Andersen, E.S., Knudsen, B., Gorodkin, J., Kjems, J., 2004. RNA interactions in the 5 region of HIV-1 genome. J. Mol. Biol. 336, 369–379. Damgaard, C.K., Dyhr-Mikkelsen, H., Kjems, J., 1998. Mapping the RNA binding sites for human immunodeficiency virus type-1 gag and NC proteins within the complete HIV-1 and -2 untranslated leader regions. Nucl. Acids Res. 66, 4144–4153. De Rijk, P., Caers, A., Van de Peer, Y., De Wachter, R., 1998. Database on the structure of large ribosomal subunit RNA. Nucleic Acids Res. 26 (1), 183–186. Eddy, S., Durbin, R., 1994. RNA sequence analysis using covariance models. Nucl. Acids Res. 22, 2079–2088 (http://genome.wustl.edu/eddy/#cove). Eddy, S.R., 1998. Profile hidden markov models. Bioinformatics 14, 755– 763, http://hmmer.wustl.edu/. Eddy, S.R., 2001. Non-coding RNA genes and the modern RNA world. Net Rev. Genet 2, 919–929. Eddy, S.R., 2002. Computational genomics of noncoding RNA genes. Cell 109, 137–140. Felsenstein, J., 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17 (6), 368–376. Gorodkin, J., Knudsen, B., Zwieb, C., 2001. Semi-automated update and cleanup of structural RNA databases. Bioinformatics 17, 642–645. Hobohm, U., Sander, C., 1994. Enlarged representative set of protein structures. Prot. Sci. 3, 522–524. Hofacker, I.L., Fekete, M., Stadler, P.F., 2002. Secondary structure prediction for aligned RNA sequences. J. Mol. Biol. 319, 1059–1066. Hofacker, I.L., Fontana, W., Stadler, P.F., Bonhoeffer, L.S., Tacker, M., Schuster, P., 1994. Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie 125, 167–188 (http://www.tbi.univie.ac.at/∼ivo/RNA/). Klein, R.J., Eddy, S.R., 2003. RSEARCH: Finding homologs of single structured RNA sequences. BMC Bioinformatics 4, 44. Knudsen, B., Hein, J., 1999. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15 (6), 446–454, http://www.daimi.au.dk/∼compbio/pfold. Knudsen, B., Hein, J.J., 2003. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucl. Acids Res. 31, 3423– 3428. Kuiken, C.L., Foley, B., Hahn, B., Korber, B., McCutchan, F., Marx, P.A., Mellors, J.W., Mullins, J.I., Sodroski, J., Wolinksy, S., 2002. Human retroviruses and aids 2000. http://hiv-web.lanl.gov/seq-db.html. Rivas, E., Eddy, S.E., 2001. Noncoding rna gene detection using comparative sequence analysis. BMC Bioinformatics 2, 8. Rivas, E., Klein, R.J., Jones, T.A., Eddy, S.R., 2001. Computational identification of noncoding RNAs in E. coli by comparative genomics. Curr. Biol. 11, 1369–1373. Saitou, N., Nei, M., 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4 (4), 406– 425.

226

B. Knudsen et al. / Computational Biology and Chemistry 28 (2004) 219–226

Sakakibara, Y., Brown, M., Hughey, R., Mian, I.S., Sjölander, K., Underwood, R.C., Haussler, D., 1994. Stochastic context-free grammars for tRNA modeling. Nucl. Acids Res. 22, 5112– 5120. Sankoff, D., 1985. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl Math 45, 810– 825. Sprinzl, M., Horn, C., Brown, M., Ioudovitch, A., Steinberg, S., 1998. Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 26 (1), 148–153.

Thorne, J.L., Goldman, N., Jones, D.T., 1996. Combining protein evolution and secondary structure. Mol. Biol. Evol. 13 (5), 666–673. Tillier, E.R., Collins, R.A., 1998. High apparent rate of simultaneous compensatory base-pair substitutions in ribosomal RNA. Genetics 148 (4), 1993–2002. Zuker, M., Mathews, D.H., Turner, D.H., 1999. Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide. In: Clark, J.B.B. (Ed.), RNA Biochemistry and Biotechnology. NATO ASI Series, Kluwer Academic Publishers, The Netherlands, pp. 11–43, (http://bioinfo.math.rpi.edu/∼mfold/rna/form1.cgi).