In silico methods for co-transcriptional RNA secondary structure prediction and for investigating alternative RNA structure expression

Accepted Manuscript In silico methods for co-transcriptional RNA secondary structure prediction and for investigating alternative RNA structure expres...

Download PDF

5MB Sizes 0 Downloads 41 Views

Report

Full Text

Accepted Manuscript In silico methods for co-transcriptional RNA secondary structure prediction and for investigating alternative RNA structure expression Irmtraud M. Meyer PII: DOI: Reference:

S1046-2023(16)30447-9 http://dx.doi.org/10.1016/j.ymeth.2017.04.009 YMETH 4185

To appear in:

Methods

Please cite this article as: I.M. Meyer, In silico methods for co-transcriptional RNA secondary structure prediction and for investigating alternative RNA structure expression, Methods (2017), doi: http://dx.doi.org/10.1016/j.ymeth. 2017.04.009

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

In silico methods for co-transcriptional RNA secondary structure prediction . and for investigating alternative RNA structure expression / Irmtraud M. Meyer ∗, Laboratory of Bioinformatics of RNA Structure and Transcriptome Regulation, Berlin Institute for Medical Systems Biology, Max Delbrueck Center for Molecular Medicine, Robert-R¨ossle-Str. 10, 13125 Berlin-Buch, Germany and Institute of Chemistry and Biochemistry, Free University, Thielallee 63, 14195 Berlin, Germany 24th March 2017

Abstract RNA transcripts are the primary products of active genes in any living organism, including many viruses. Their cellular destiny not only depends on primary sequence signals, but can also be determined by RNA structure. Recent experimental evidence shows that many transcripts can be assigned more than a single functional RNA structure throughout their cellular life and that structure formation happens co-transcriptionally, i.e. as the transcript is synthesised in the cell. Moreover, functional RNA structures are not limited to non-coding transcripts, but can also feature in coding transcripts. The picture that now emerges is that RNA structures constitute an additional layer of information that can be encoded in any RNA transcript (and on top of other layers of information such as protein-context) in order ∗ To whom correspondence should be addressed. Tel: +49–30–9406–3292; Fax: +49–30– 9406–3291, Email: [email protected]

1

to exert a wide range of functional roles. Moreover, different encoded RNA structures can be expressed at different stages of a transcript’s life in order to alter the transcript’s behaviour depending on its actual cellular context. Similar to the concept of alternative splicing for protein-coding genes, where a single transcript can yield different proteins depending on cellular context, it is thus appropriate to propose the notion of alternative RNA structure expression for any given transcript. This review introduces several computational strategies that my group developed to detect different aspects of RNA structure expression in vivo. Two aspects are of particular interest to us: (1) RNA secondary structure features that emerge during co-transcriptional folding and (2) functional RNA structure features that are expressed at different times of a transcript’s life and potentially mutually exclusive.

Keywords: RNA secondary structures, RNA structure prediction, co-transcriptional folding, RNA folding pathways, prediction algorithms

1

Introduction . and biological background /

RNA sequences, or transcripts, are the primary products of all DNA genomes. They and their functional products define the active state of a cell in a living organism. Knowing them and understanding the mechanisms regulating their expression is thus key to understanding cellular life and how it regulates (or mis-regulates) itself. . The primary sequence of any RNA transcript can in principle contain information on a range of functional roles that are required by the transcripts at different times of its life. One prominent functional role is information on the protein product(s) that can synthesised from the transcript. There are by now numerous studies of alternative splicing from a wide range of organisms confirming that the expression of a specific protein product requires both, (A) the corresponding protein to be encoded in the transcript itself and (B) the specific cellular environment to allow the expression of that specific protein product. Alternative splicing is thus one prominent aspect of gene expression in vivo that depends as much on intrinsic features (e.g. the range of potential protein products encoded in the naked primary sequence of the transcript itself ) as on extrinsic features (e.g. the specific details of the in vivo environment that the transcript encounters at different stages of its life). / . The set of transcripts that a genome can express is typically sub-divided into those that yield a protein product (protein-coding transcripts) and those that do not (non-coding transcripts). If and when a protein-coding transcript expresses a protein depends, however, both, on its intrinsic features (i.e. the set of potential proteins

2

product encoded in its sequence) as well as on a range of extrinsic features (i.e. details of the specific cellular environment in which the transcript is expressed). / . Another important way of encoding functional information into a transcript is via so-called RNA structures. An RNA structure is formed when a transcript interacts with itself. These interactions primarily involve hydrogen bonds between so-called complementary base-pairs. For many purposes, it suffices to study RNA structures on the level of so-called RNA secondary structure where one only retains information on the set of base-paired nucleotide positions in the transcript (the consensus, complementary base-pairs are {C, G}, {A, U} and {G, U}). An RNA secondary structure can either be global, i.e. engage most of the transcript, or local, i.e. be confined to a sub-sequence of the transcript. / Introducing the notion of alternative RNA structure expression . Investigations of transcripts encoding functional RNA secondary structures have historically focused on non-coding transcripts such as transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs), i.e. well-studied cases with prominent functional roles in the cell. More recent research has, however, yielded a wealth of evidence that otherwise protein-coding transcripts can also encode local RNA secondary structure features that can influence, for example, its localisation, degradation and translation initiation. Moreover, due to the degeneracy of the genetic code, information on RNA secondary structure can be embedded in otherwise protein-coding regions. This was first discovered in viral genomes and subsequently also in eubacteria and yeast, mouse and human [1, 2, 3, 4, 5, 6, 7]. / . Conceptually, it was key to realise that one transcript can encode more than a single functional RNA secondary structures, for example two mutually exclusive, switch-like RNA secondary structure elements that engage overlapping sub-sequences of a transcript, typically in the 5’-UTRs of these transcripts (riboswitches). Similar to alternative splicing, the transcript itself encodes more than a single potential RNA secondary structure and extrinsic features of the cellular environment such the temperature [8, 9, 10], level of pH [11] or ligand binding [12] determine if and when a specific RNA structure is expressed. On top of these select individual cases, more recent studies provide strong transcriptome-wide evidence that alternative splicing is likely to be mediated to some extent by alternative RNA structure expression [13]. Mazloomian et al. [13] investigated ADAR-mediated RNA editing in several, high-throughput libraries for ten tissue types in Drosophila melanogaster. First, they find that editing sites are three times more likely to occur in exons with multiple known donor/acceptor splices than in exons with unique splice

3

sites (2879 identified editing sites, p-value < 2 · 10−15 ). Second, they identified 244 edited regions where RNA editing and alternative splicing are likely to influence each other. For 96 out of these 244 regions (39 %), evolutionarily conserved RNA secondary-structures near the splice sites can be readily detected, suggesting an underlying molecular mechanisms where changes in alternative splicing are initiated via ADAR-induced changes of local RNA secondary structure elements (ADAR-editing has the overall effect of loosening RNA structure elements by destroying certain base-pairs via A-to-I edits). The results of Mazloomian et al. provide yet an other reason for looking beyond the one-sequence-one-structure dogma in RNA secondary structure prediction and for introducing the notion of alternative RNA structure expression. / The need for taking co-transcriptional RNA structure formation into account . Historically, early computational methods for RNA secondary structure prediction address an optimisation problem where the goal is to find a thermodynamically most stable RNA secondary structure configuration for a input given RNA sequence (minimumfree energy structure or MFE-structure which need not be unique). These methods are also called MFE-methods for RNA secondary structure prediction. These methods consider a fully extended RNA sequence as input and completely ignore the process by which the sequence folds into the final, most stable RNA secondary structure. / . In reality, most transcripts in vivo emerge sequentially and primarily during the process of transcription, with the 5’-end emerging first and the 3’-end emerging last. This is conceptually a very different setting from the one that most existing MFE-methods assume, including well-known and frequently used methods such as MFold, RNAfold and related programs [14, 15, 16, 17]. / . There is by now surmounting evidence, both from the experimental and the theoretical side, that RNA structure formation can happen co-transcriptionally, i.e. during transcription, and that it is advisable to take the resulting effects into account. The main reason is that it is reasonable to expect that the directionality of the transcript’s synthesis results in corresponding asymmetries during the co-transcriptional formation of the RNA structure. This cleverly reduces the search space of potential RNA secondary structure (and folding pathways) which is a key insight whose importance cannot be overestimated. / . As the main purpose of this review paper is to introduce the reader to computational tools and how to employ them (rather than the detailed experimental and theoretical findings that have originally motivated the creation of these tools), the following only provides a concise summary of the key experimental and theoretical findings

4

in support of co-transcriptional RNA structure formation, for more details see Lai et al. [18]. / . The earliest evidence for co-transcriptional folding dates from the early 1980s, where it was found that RNA folding in vivo and transcription happen on similar timescales [19, 20, 21, 22, 23]. Based on these results, it was clear that any events during the synthesis of the transcript have the potential to influence RNA structure formation. It was subsequently shown that the transcription speed influences the corresponding RNA structure outcome [24, 25, 26] and, importantly, that a change of the natural transcription speed can result in mis-folded, non-functional RNA structures [27, 28]. To complicate matters further, pausing at specific transcription sites was discovered to fine-tune RNA structure formation in vivo [29, 30, 31] and to even determine the choice between multiple potential folding pathways depending on cellular circumstances (flavin mono-nucleotide (FMN)dependent riboswitch in Bacillus subtilis) [30]. / . One of the key features of co-transcriptional RNA structure formation is that is happens sequentially. So, any base-pairs involving only the 5’-end of the transcript can form first, whereas those involving the 3’-end of the transcript can only form once transcription is more or less complete [32, 33]. The second key feature is that cotranscriptional RNA structure formation may involve so-called transient RNA structure elements. These are not necessarily part of the final RNA structure and may only be present during a specific time-span [20, 34]. Transient features may not only help to efficiently guide the emerging RNA structure towards “good” folding pathways that yield the desired structural configuration(s) [35], but can also play dedicated, additional functional roles [34, 36]. Recent in silico analyses show that transient RNA structure features can be evolutionarily as conserved as the final RNA secondary structure and that these transient features can be computationally predicted [37, 38]. The succession of RNA secondary structure for a given transcript as function of time defines a so-called folding pathway. As far as we now, co-transcriptional folding pathways in vivo need not be unique [39, 40] and the choice of one folding pathway over others can be influenced both by intrinsic features [41, 42] as well as extrinsic features of the cellular environment [43, 44, 45, 46, 47, 48]. / Almost all existing methods for RNA secondary structure predictions ignore co-transcriptional structure formation and the influence it may have on the predicted structures. . Technically, this implies that these computational methods search a too large space of potential RNA secondary structure which the transcript in vivo never explores as co-folding efficiently decreases the number of available folding pathways and attainable RNA secondary structures. We expect this effect, i.e. the ratio between the search space of all potential RNA secondary structure and those 5

RNA secondary structure that are actually attainable during many instances of co-transcriptional folding in vivo, to be more pronounced as the length of the transcript increases. Whereas the folding of a longish transcript in vivo is nicely kept on track via the effects of co-transcriptional folding, computational attempts of detecting the functionally relevant RNA secondary structure within an increasingly vast search space (of largely unattainable) potential RNA secondary structure become increasingly vain. This is exactly what Morgan and Higgs [49] suspected in 1996 when observing significant differences between the evolutionarily conserved RNA secondary structure and the respective predicted MFE structures for long RNAs (16S rRNAs, 23S rRNAs, RNAseP). These differences, they concluded, “cannot simply be put down to errors in the free energy parameters used in the model”, but should be due to the effects of kinetic folding in vivo. So, put simply, Levinthal’s paradox [50] does not apply to RNA structures forming co-transcriptionally in vivo. / . A purely theoretical analysis of 409 RNAs with known functional structure, however, showed that these transcripts not only encode information on the respective final RNA secondary structure, but also on how to reach them via co-transcriptional folding pathways [35]. Moreover, the study managed to identify potential transient RNA structure elements and to judge their potential beneficial or detrimental effect on the co-transcriptional formation of the known final RNA structure. There is thus justified hope that we should be able to investigate many intrinsic features influencing cotranscriptional RNA structure formation by carefully studying only the transcripts themselves and – with a bit of luck – turning our insight into better methods for RNA secondary structure structure prediction. In contrast to this, it will be much more challenging to computationally capture the multitude of extrinsic features, i.e. the details of cellular environment, which determine which of the potential, encoded RNA secondary structures is expressed when. / . As we explain in the following, new methods for RNA secondary structure structure prediction that explicitly capture effects of cotranscriptional folding in vivo are only starting to emerge. As we explain in the discussion, creating and further improving these methods will require to overcome several challenges, ranging from conceptual to more technical ones. So, in a way, it is thus early days. Yet, as the results below show, it is well worth it, both in terms of improved prediction accuracy and new biological insights gained. / . The following explains how RNA secondary structure predictions can benefit from taking co-transcriptional folding explicitly into account. For this, three computational strategies for investigating different aspects of co-transcriptional RNA secondary structure formation are introduced. Goal 1 is currently the most challenging to achieve. We present initial results showing that transient RNA struc6

ture features are evolutionarily conserved and can be computationally predicted in Zhu et al. [37, 38]. The underlying computational pipeline is concisely described below in order to be easily reproducible. Goals 2 and 3 are more straightforward to achieve which is reflected in a correspondingly shorter description. For each goal, particular care was taken to highlight the limitations of each strategy and to explicitly list the underlying assumptions. /

2

Computational strategies for investigating cotranscriptional RNA structure formation in vivo

In the following, the term “RNA secondary structure” will refer to a single set of mutually compatible RNA structure features (e.g. helices) that can be made or expressed by the underlying RNA sequence at the same time. During cotranscriptional folding, a single RNA thus undergoes a time-ordered series of different RNA secondary structures and a transcript with a known riboswitch thus has at least two functional RNA secondary structures. The term “final RNA secondary structure” will refer to the RNA secondary structure at the end of a single co-transcriptional folding pathway (note that there may be several potential pathways for any given transcript). The term “helix” will be used to denote an RNA structure feature corresponding to a contiguous stretch of base-paired nucleotides, i.e. without internal loops and bulges.

2.1

Goal 1: Predicting conserved transient and final RNA secondary structure features of co-transcriptional folding pathways

Computational methods for predicting entire co-transcriptional folding pathways have started to emerge since the mid-1980s. These methods take a single sequence as input and predict the folding kinetics of the emerging sequence as function of simulated time (RNA folding pathway prediction methods). Their raw output typically consists of a list of structural configurations encountered during the simulated folding pathway. These methods typically work in a nondeterministic way [51, 52, 53, 54, 55, 56, 57, 58], although exceptions exist [59], see table 1. Different simulations for the same input sequence thus typically result in different predicted folding pathways. Folding pathway prediction methods model structural changes during the kinetic folding pathway typically on the level of entire helices rather than individual base-pairs. On extending the RNA sequence, corresponding randomised structural changes are proposed (i.e. new helices are created or existing ones destroyed). These changes are accepted with a probability which corresponds to the theoretical rate of the corresponding . configuration change / . As the errors of these folding pathway prediction methods are . cumulative / , 7

Program Kinwalker ´fold Kine RNAKinetics

algorithm deterministic (MFE structures) stochastic simulation (helices) stochastic simulation (helices)

pseudo knots no yes no

max seq. length (nt) 1000 200 200

Table 1: Key features of the three investigated folding pathway prediction methods. For all three methods, a transcription rate r can be specified by the user for each individual input sequence. We use a simulation time of t = 2 · L/r equal to twice the transcription time for an input sequence of length L to allow the program to converge. To determine the number of simulated trajectories for each input sequence of length L, we use a quadratic function of ´fold and RNAKinetics. L as recommended by the authors of Kine Program Kinwalker [59] ´fold Kine [55, 56, 57] RNAKinetics [51, 52, 53]

raw output list of struct. configurations over simulated time (one trajectory per input sequence) list of struct. configurations over simulated time (multiple trajectories per input sequence) aggregated data across all simulated trajectories (for each helix, probabilities over simulated time points)

Table 2: Type of predictions generated as raw output by the three folding pathway prediction methods.

stochastic simulation methods tend to have a length limitation of around 200 nt. This constraint precludes the analysis of many naturally occurring transcripts. A more recent addition from 2008, Kinwalker [59], significantly extends the length limitation from 200 nt to around 1000 nt by employing a deterministic strategy to model distinct secondary structure configurations of the co-transcriptional folding pathway. For this, it combines the repeated execution of a deterministic free energy minimisation (MFE) algorithm with heuristic considerations that judge the kinetic feasibility and speed of potential structural transitions between two deterministically calculated structural configuration. For a given input sequence, Kinwalker returns as output exactly one predicted folding trajectory and the corresponding series of encountered RNA secondary structures, whereas simulation-based methods typically yield different folding pathways for multiple executions of the program for the same input sequence, see table 2. Conceptual limitations of folding pathway prediction programs: • the length of input sequence limited to 200 nt or 1000 nt, in case of Kinwalker • the transcription speed is assumed to be constant 8

• any cis-interaction partners (proteins, ligands, RNAs) as well as their potential effects on co-transcriptional folding are ignored. We recently presented the first systematic performance benchmark for folding pathway prediction methods for a non-redundant dataset of 32 sequences from six functional RNA families and showed that homologous RNA sequences not only fold into similar RNA structures, but also use similar co-transcriptional folding pathways [37]. For this, we investigated and compared the performance of three folding pathway prediction methods, RNAKinetics [51, 52, 53], Kin´fold [55, 56, 57]. These methods are freely available walker [59] and Kine and representative of the different underlying algorithms being employed in the field, see table 1 and table 2 for an overview of their key features. The six functional RNA families which constitute our test set were selected to comprise known final RNA secondary structure as well as known transient structural features, as we were keen to explore how well these could be predicted by the different folding pathway prediction methods. The six functional RNA families are: (1) Levivirus maturation gene (Levivirus), (2) bacterial ribonuclease P type A (RNase P Type A), (3) Hepatitis delta virus ribozyme (HDV ribozyme), (4) Bacterial signal recognition particle 4.5S RNA (SRP 4.5S RNA), (5) Tryptophan operon leader (Trp operon) and (6) S-adenosylmethionine riboswitch (SAM riboswitch). We first compiled high-quality multiple-sequence alignment for each of the six families and then extracted a total of 32 non-redundant sequences that constitute of test set, see table 3 for an overview of different quality measures. Step 1: In order to predict evolutionarily conserved transient and final RNA secondary structure features encountered during co-transcriptional folding pathways, we first need to establish a high-quality alignment of multiple homologous RNA sequences. This alignment will later be used to map and compare the RNA structure features predicted for individual sequences. From this multiple-sequence alignment, we first extract representative sequences to be used for individual analysis with one of the folding pathway prediction methods. General recommendations for high-quality alignments to be used for subsequent analysis with folding pathway prediction methods are: • The primary sequences in the alignment have to extend up to the transcription start site on the 5’ end. This is key for simulating co-transcriptional folding pathways. • The sequences in the alignment should have an average pairwise primary sequence identity that is neither too low nor too high (ideally, between 50% and 85%) to best prepare for comparative analysis. . If the sequences are too similar in terms of primary sequence identity, it is hard for the algorithms to distinguish between RNA structureencoding regions and non-structure-encoding regions. If, on the 9

contrary, the sequences are too dissimilar, it becomes challenging to establish a high-quality input alignment as most alignment methods only consider primary sequence conservation, but not RNA structure conservation. / • The selected representative sequences have to have a good fit to the known RNA structure features and the alignment quality within known structured regions ought to be high. Manual adjustments of the multiplesequence alignment can, for example, be made with the help of 4SALE [60]. • The quality of any sub-alignment between known structured regions has to be sufficiently high, for example using MUSCLE [61] which is guided by primary sequence conservation only. Step 2: Once a high-quality alignment has been established, several so-called representative sequences are extracted from it. The goal is to identify a sub-set of sequences that are all (1) good representatives of the reference RNA secondary structure encoded in the multiple-sequence alignment and (2) non-redundant in terms of pairwise primary sequence identity. This can be achieved as follows: • First, order all sequences in the alignment based on the quality of the structural fit to the known reference structure, e.g. the number of consensus base-pairs fitting the base-pairs of the known RNA secondary structure and the number and type of gaps within structured regions (in particular, one-sided versus two-sided gaps). From the sub-set of resulting sequences with a good structural fit, select the top-scoring sequence as first representative sequence. . The known RNA secondary structure comprises all officially known RNA structure features irrespective of whether these are transient or not, mutually compatible or not. / • Continue selecting representative sequences from the structure-fit-ranked list of sequences as long as the pairwise primary sequence identity with respect to every already selected representative sequences is sufficiently low. A decent lower threshold value for the maximum pairwise sequence identity (max. PSI) between any two representative sequences from the same multiple-sequence alignment is 50%. Values used in our study [37] range from 55% to 85% depending on the overall level of primary sequence conservation in each alignment and the desired target number of representative sequences, see table 3 for details. The above procedure yields a non-redundant set of representative sequences for any given alignment. These representative sequences are used as input sequences for the subsequent analysis with folding pathway prediction methods to detect evolutionarily conserved transient and final RNA secondary structure features. 10

alignment

N

R

L

cons.

Levivirus RNase P Type A HDV ribozyme SRP 4.5S RNA trp operon SAM riboswitch

7 24 10 10 10 15

4 7 3 5 5 7

158 391 152 141 107 215

0.645 0.698 0.819 0.726 0.694 0.532

max. PSI 0.80 0.70 0.85 0.80 0.80 0.55

gaps

TTL

BPs

0.154 0.084 0.070 0.024 0.068 0.227

1.181 4.279 0.499 1.240 1.253 4.310

54 122 61 38 29 66

canon. BPs 0.939 0.966 0.941 0.976 0.969 0.895

Table 3: Overview of different features and quality measures for the six alignments [37]. N refers to the number of sequences in the alignment, R to the respective number of representative sequences, L to the alignment length (nt). The conservation (cons.) indicates the average pairwise primary sequence conservation, max. PSI the maximum pairwise sequence identity (used for selecting representative sequences from the respective multiple-sequence alignment), gaps the fraction of gaps and TTL the total tree length. BPs specifies the number of base-pairs and canon. BPs the fraction of canonical base-pairs within all base-paired alignment columns. The covariation (cov.) ranges from -2 to +2 and measures the relative frequency of compensatory mutations that retain the base-pairing ability within pairs of base-paired alignment columns. Its value is 0 when there is no variation in paired alignment columns, positive when they comprise compensatory mutations that retain the base-pairing ability and negative when they contain invalid base-pairs.

Step 3: Once the set of representative sequences has been determined for a given alignment of homologous transcripts, each individual representative sequence is used as input to one of the three folding pathway prediction methods. As table 1 and ´fold and RNAKinettable 2 explain, the three methods Kinwalker, Kine ics differ considerably in their underlying algorithms and type of raw output predicted. All three folding pathway prediction methods allow the user to specify a (constant) transcription speed for each individual input sequence. This parameter is key for influencing the corresponding simulated co-transcriptional folding pathways as a change of transcription speed can have a major impact on cotranscriptional RNA secondary structure formation. This is because RNA structure formation can happen on the same time scale as transcription [23]. There is ample experimental support that a change of transcription speed can yield different RNA structure outcomes [27, 28, 24, 25, 26] and that it need not be constant along the entire transcript (transcriptional pausing sites) [29, 30, 31]. The latter is an effect that none of the three folding pathway prediction methods can currently model. For our data sets, we choose different values for the transcription speed depending on the evolutionary domain of each representative sequence [22], see

11

cov. 0.302 0.441 0.007 0.340 0.195 0.276

table 4. alignment Levivirus RNase P Type A HDV ribozyme SRP 4.5S RNA trp operon SAM riboswitch

r [nt/s] 30.0 22.5 20.0 22.5 22.5 75.0

Table 4: Overview of different values of transcription speed r for each RNA family. These values are used as parameters for the three folding pathway prediction methods. Note that for Levivirus, the speed corresponds to the speed of replication of the positive RNA. Depending on the polymerase, the transcription speed can range from 200 nt/s for phages, 20–80 nt/s for bacteria to 10–20 nt/s for human pol II [22].

´fold and RNAKinetics, we use a simulation time of t = 2 · L/r For Kine equal to twice the transcription time for an input sequence of length L to allow the program to converge. To determine the number of simulated trajectories for each input sequence of length L, we use a quadratic function of L as recom´fold and RNAKinetics. mended by the authors of Kine Step 4: The previous step generates a range of raw output data for all individual representative sequences. For each alignment, the predictions of each method for all representative sequences are then aggregated and converted into scores for individual, predicted base-pairs. This is done as follows: • Kinwalker As this method works in a deterministic manner and predicts a single co-transcriptional folding pathway for each representative sequence, each predicted base-pair is assigned a score which is equal to the fraction of representative sequences for which this base-pair was predicted as part of the structural features encountered during the predicted co-transcriptional folding pathway. Based on our MCC-optimised procedure for known structural features, see figure 2 in [37], we recommend a cut-off value of 0.43 for these base-pair specific scores, i.e. any predicted base-pair with a score below 0.43 is discarded. ´fold This method employs stochastic simulations to predict individ• Kine ual co-transcriptional folding pathways. For each representative sequence, we sample N folding pathways which depends quadratically on the length L of the input sequence. For any predicted base-pair, we first calculate the fraction of pathways that feature this base-pair. We then average this fraction across all representative sequences of the same alignment to arrive at 12

the final score assigned to the predicted base-pair. The MCC-optimised ´fold is cut-off value we recommend for base-pairs predicted with Kine 0.755, see figure 2 in [37] for details. • RNAKinetics The raw output of this methods already corresponds to a summary of helices predicted during any of the simulated co-transcriptional ´fold, the number N of simulated pathways folding pathways. As for Kine depends quadratically on the length L of the input sequence. For each predicted helix, a probability value is specified for points in simulation time. For a predicted base-pair, we first pick the maximum probability over time as probability for that base-pair and then assign the average value of these probabilities for all representative sequences as final score. The MCC-optimised cut-off value we recommend for base-pairs predicted with RNAKinetics is 0.0082, see figure 2 in [37]. Prediction accuracy for known transient and final RNA secondary structures ´fold have an equally high Matthews’ As table 5 shows, Kinwalker and Kine correlation coefficient (MCC) of 0.676 and 0.656, respectively, for all known transient and final structural features. The MCC is a combined measure of the sensitivity and the specificity of the prediction, see the caption of table 5 for its definition. For Kinwalker, the high MCC value is due to equally high values for the true positive rate (TPR) and the positive predictive value (PPV), ´fold with a significantly whereas these values are more unbalanced for Kine higher PPV (0.885) than TPR (0.501). RNAKinetics has a markedly lower overall MCC value of 0.263 which can be primarily attributed to significantly lower values for the PPV, both for known transient and known final RNA structure features. These low values cannot be rescued by the comparatively high values for the TPR. Typically, the TPR for known transient RNA structure features is significantly lower than for known features of the final RNA secondary structure, RNAKinetics being the exception with a high TPR of 0.722 for known transient features which is even high than the TPR of 0.652 for known final structural features. As we showed in the original paper, see figure 2 in [37], the performance values shown in table 5 are not sensitive to the precise choice of cutoff values. Our MCC-optimised cutoff values, see table table 5, can thus be viewed as robust, general recommendation unless a dedicated data set of known structures is available for training. Summary Using the computational pipeline described above, individual folding pathway prediction methods can thus be used in combination with a comparative analysis strategy to reliably predict evolutionarily conserved RNA secondary structure features of co-transcriptional folding pathways. Due to conceptual constraints

13

Program Kinwalker ´fold Kine RNAKinetics

Cutoff 0.430 0.755 0.0082

Known transient TPR PPV 0.428 0.318 0.183 0.378 0.722 0.191

Known final TPR PPV 0.762 0.648 0.586 0.874 0.652 0.210

TPR 0.693 0.501 0.678

All known PPV MCC 0.667 0.676 0.885 0.656 0.231 0.263

Table 5: Performance figures for known RNA structure features for the three folding pathway predictionmethods [37]. Average true positive rate (TPR) and positive predictive value (PPV) for known transient and known final RNA secondary structure features using the three folding pathway prediction programs at MCC-optimised cutoff values optimised over known features (Cutoff) for the three folding pathway prediction methods Kin´fold, and RNAKinetics. Matthews’ correlation coefficient walker, Kine (MCC) is a measure of both p sensitivity and specificity and is defined as M CC = (T P · T N − F P · F N )/ (T P + F P ) · (T P + F N ) · (T N + F P ) · (T N + F N ). TPR is a measurement of sensitivity on base-pair level, and is defined as T P R = T P/(T P + F N ). PPV is a measurement of specificity on base-pair level, and is defined as P P V = 1 − (F P/(T P + F P )). See also figure 5 for the ROC curve showing the TPR as function of the FPR for all three folding pathway prediction methods.

of the folding pathway prediction methods, this strategy is currently limited to transcripts of 200 nt length (1000 nt in case of Kinwalker). Known transient structure features can be predicted with roughly the same accuracy as structural features of the final RNA secondary structure. Furthermore, the prediction accuracy of the computational analysis pipeline is robust with respect to the recommended, MCC-optimised cutoff values. When dedicated data sets with known RNA secondary structures are available, there is potential to further improve the prediction accuracy by deriving dedicated cutoff values. If desired, the specificity of the analysis can be further increased by combining the predictions of more than a single folding pathway prediction method to significantly increase the overall PPV while only slightly lowering the TPR, see table 3 in [37] for details.

2.2

Goal 2: Predicting novel, conserved RNA secondary structure features of co-transcriptional folding pathways

The computional pipeline described above for goal 1 gives a sense of the expected prediction accuracy for known transient and final RNA secondary structure features. One limitation of the above comparative analysis pipeline, however, is that it can only be applied to rather short transcripts. In order to detect RNA structure features that may play important functional roles during any time of the RNA transcripts’ life in its cellular environment in transcripts of

14

arbitrary length, we require a technically and conceptually different approach. We devised a computational prediction method called Transat [62] in order to address this situation. This method works in a comparative way by taking a multiple-sequence alignment and a corresponding evolutionary tree linking the sequences in the multiple-sequence alignment as input. It predicts as output a set of evolutionarily conserved helices with calculated log-likelihood values and corresponding estimated p-values. Transat thus implicitly captures the hypothesis that structural features that have been conserved during certain evolutionary times (as specified by the input tree) are likely to be of functional importance. In devising Transat, we deliberate avoided conceiving a method for predicting RNA secondary structures, i.e. sets of helices that could all be present at one point in time. Instead, Transat only aims to identify individual, evolutionarily conserved helices. By deliberately ignoring the grouping of predicted helices into distinct RNA secondary structures and leaving the interpretation of the raw Transat predictions to the user, however, we gain the ability to: • identify pseudo-knotted configurations of helices that most RNA secondary structure prediction methods cannot identify • not having to model the details . of / the cellular environment (such as ion concentrations, temperature, potential trans-interaction partners, transcription speed and potential transcription pausing sites) and the influence they may have on in vivo RNA structure formation. If any RNA structure feature is conserved during evolution, Transat will be able to identify it irrespective of the cellular circumstances and molecular mechanisms that lead to their conservation. • find mutually exclusive helices such as those involved in the two distinct structural configurations of riboswitches • predict potential transient helices that may not be part of the final RNA secondary structure configuration, but may have distinct functional roles at some time during the RNA transcripts’ life in the cell, e.g. by functioning as local RNA secondary structure features that regulate alternative splicing via RNA editing [13] or as helices interacting with trans-acting molecules during co-transcriptional folding in the cell [18]. • generate predictions without having to assume that any input sequence of the input multiple-sequence alignment folds into a global RNA secondary structure spanning the entire transcript. The latter is usually the implicit assumption of any RNA secondary structure prediction method that is based on the principle of free energy minimisation. These so-called MFE methods cannot be used to identify regions devoid of conserved RNA structure features as they will aim to predict additional base-pairs in order to lower the overall free energy of the predicted structure.

15

2.3

How Transat works:

Transat is a probabilistic method, both in terms of the underlying algorithms and the predictions being made as output. It takes as input a multiple-sequence alignment of homologous RNA sequences and a corresponding evolutionary tree. This tree relates the sequences in the input alignment quantitatively in terms of topology and branch distances. In its first step, Transat calculates potential helices for each individual, ungapped RNA sequence in the multiple-sequence alignment using a fast dynamic programming procedure that depends quadratically on the sequence length. In a second step, these sequence-specific helices are mapped onto the input multiple-sequence alignment without altering the multiple-sequence alignment itself. This results in a list of candidate helices along the alignment which can each be uniquely specified by a corresponding pair of base-paired outer alignment columns x1 and y 1 as well as a helix length L. Each candidate helix along the alignment can thus be viewed as a list of 5’-alignment columns, x = (x1 , x2 , . . . , xL ), that constitute the “left” arm of the helix, and a corresponding list of consecutive 3’-alignment columns, y = (y L , . . . , y 2 , y 1 ), that constitute the “right” arm of the helix. A candidate helix of length L thus forms basepairs between pairs of alignment columns starting with (x1 , y 1 ) as the outer base-pair up to (xL , y L ) as the inner-most base-pair. In the third step, Transat calculates a log-likelihood score Λ(h) for each candidate helix h along the alignment. For this, it computes the likelihood P (x, y|θpaired ) that the “left” alignment columns x are indeed base-paired with the corresponding “right” alignment columns y as well as the likelihood P (x, y|θunpaired ) that they are all unpaired. The first hypothesis is captured by a probabilistic evolutionary model which spells out how pairs of base-paired nucleotides evolve as function of evolutionary time. The second hypothesis corresponds to a different probabilistic evolutionary model which specifies how unpaired nucleotides evolve over time, see [62] for more details. Both likelihood values are calculated using the Felsenstein algorithm [63]. For a given candidate helix h along the alignment, this algorithm takes the nucleotides observed in the actual alignment columns of the multiple-sequence alignment (i.e. at time t = now) and evolves them back in time along the input tree using the evolutionary model of the respective hypothesis. The log-likelihood ratio assigned to each candidate helix h along the alignment can then be expressed as Λ(h) = log(P (x, y|θpaired )/P (x, y|θunpaired )) · 1/L One key difference with respect to the usual way that these probabilistic models of evolution are employed in the context of RNA secondary structure prediction is that we interpret one-sided gaps within candidate pairs of alignment columns not as missing data, but explicitly penalise them by treating them as non-consensus base-pairs. This correctly captures the observation that helices evolve over time by acquiring or losing entire base-pairs and significantly contributes to Transat’s ability to correctly distinguish base-paired form unpaired

16

alignment columns. In the fourth and last step of Transat, each candidate helix h along the alignment is assigned an estimated p-value. This quantifies the statistical significance of the corresponding log-likelihood value Λ(h). This step addresses the following issue. An input multiple-sequence alignment consisting primarily of G and C nucleotides has a generally higher propensity to form spurious helices than a multiple-sequence alignment with a different nucleotide, di-nucleotide and gap composition. In order to be able to distinguish the different propensities of different input multiple-sequence alignment to form spurious helices, we thus have to assign estimated p-values to the calculated log-likelihood values Λ. These p-values then allow the user to rank predicted helices for a given alignment and, more importantly, allow the ranking of helices deriving from different alignments. P-values are estimated by first re-aligning the original multiple-sequence alignment using T-Coffee [64]. The resulting alignment is then carefully randomised by swapping entire alignment column within bins of similar primary sequence conservation and gap composition [65]. Each randomised multiple-sequence alignment is then assumed to no longer contain any real helices. By default, Transat generates 500 randomised versions for a given input multiple-sequence alignment. The log-likelihood values of all candidate helices “detected” in these randomised multiple-sequence alignment are then aggregated to form a null-distribution of log-likelihood scores from which the p-values of the log-likelihood values for the original multiple-sequence alignment are derived, see pages 5 in [62] for more details.

2.4

Key features of Transat and summary:

First, Transat’s prediction accuracy is terms of sensitivity, PPV and FPR is almost independent of the length of the input alignment, see figure 5 in [62]. For an individual sequence, the number of potential bi-secondary RNA secondary structures grows exponentially with the sequence length [66]. For any noncomparative structure prediction method, we thus expect a marked decrease in PPV as function of increasing sequence length. As Transat works in a comparative way by taking a multiple-sequence alignment rather than a single RNA sequence as input, it alleviates this problem as there is a priori no reason to expect the number of evolutionarily conserved helices to grow quadratically with the sequence length. That said, the performance of any method used for automatically generating input alignments for Transat may very well (and strongly) depend on the length of the sequences and the alignment. In this case, one also expects a corresponding decrease in the prediction accuracy of Transat as Transat keeps input alignments fixed. Transat has been devised to tolerate some amount of alignment errors, see [62] and step one and two above, but it technically cannot fix any errors in the input alignment. Second, Transat works best for input alignment that correspond to a total tree length (TTL) of 2 or more, see figure 1. Below a TTL of 1, the primary sequences in the input multiple-sequence alignments tend to be too closely re17

lated both in terms primary sequence identity and lack of co-variation within base-paired regions, resulting in a sub-optimal performance of Transat. In practice, there is no doubt also an upper limit for the TTL beyond which we expect a decrease in predictive accuracy. This can be attributed to either (1) too many RNA structure variations between the sequences to render the comparative approach meaningful or (2) too significant deviations in terms of primary sequence identity to enable the assembly of a trustworthy input alignment for Transat. Both potential complications have to be considered on a case-by-case basis as they strongly depend on the cellular constraints on the primary RNA sequence and on the RNA secondary structure of the specific RNA family being studied. Cases of RNA structure variation in the setting of viral sequences are, for example, shown in [6]. Third, Transat can successfully identify known transient helices and mutually exclusive helices as shown for the hok and trp-attenuator data sets for which more than a single functional RNA secondary structure is known, see figure 2 and page 13 in [62] for more details. As each predicted helix is assigned an estimated p-value, the user of Transat can easily focus on those with the highest statistical significance. Fourth, Transat can be employed to also identify regions of the input multiple-sequence alignment that are devoid of any conserved RNA structure features, see e.g. Rfam alignments RF00018 (which is known to be bound by multiple copies of the CsrA protein that binds single-stranded regions) and RF00023 (which contains an open reading frame (ORF) that has to remain single-stranded), see figure 4. This cannot be readily achieved using MFE-based methods. Fifth, Transat can be used to identify potential novel, evolutionarily conserved transient features of the co-transcriptional folding pathways of homologous RNAs, see table 6 for a comparison of key alignment features and figure figure 5 for the predictive performance for folding pathway prediction methods ´fold, RNAKinetics and Kinwalker for new transient features identiKine fied by Transat. ´fold (blue) and RNAKinetics (green) have a similar prediction perKine ´fold) formance and manage to detect 76.5% (RNAKinetics) and 67% (Kine of the new transient features with a false positive rate (FPR) of 7% (both). The outlier is Kinwalker (red) which can detect only 28.1% of the new fea´fold and tures with a FPR of 4%. The performance plot shows that Kine RNAKinetics are capable of detecting truly novel transient features identified by Transat provided the default cutoff values (that were determined based on known features, see table 5) are relaxed and higher FPRs tolerated. Finally, despite predicting only individual, conserved helices, these Transat predictions can often be readily interpreted to suggest a time-ordered potential co-transcriptional folding pathway, see the example of the Cripavirus internal ribosomal entry site in figure 3.

18

Figure 1: Transat performance for helices as function of the total tree lengths (TTLs) for a test set of 990 artificially generated, structure encoding multiple-sequence alignments [62]. Different values of the TTL ranging between 0.5 and 16 are indicated by different colours, see the legend. The quality of helices predicted by Transat are measured in terms of MCC as function of the p-value threshold, see caption of table 5 for the MCC definition. The peak MCC-values are highest for TTLs of around 2 or more.

2.5

Goal 3: Detecting final RNA structures formed during co-transcriptional folding

Any of the three folding pathway prediction methods described for goal 1 can be employed in order to predict the final RNA secondary structures that are formed as the result of co-transcriptional folding pathways, see the text, tables and figures of section 2.1 above. All three methods work in a non-comparative way and take a single RNA as input. One major drawback of these methods, however, is that they can only handle rather short input sequences, up to 200 nt in case of ´fold and up to 1000 nt in case of Kinwalker. This is RNAKinetics and Kine primarily due to conceptual reasons: As any of these methods simulate/calculate an actual co-transcriptional folding pathway, any errors made in the early stages of the prediction (i.e. while the RNA is still relatively short) are magnified as the transcript is elongated and the prediction/simulation progresses. This limitation prevents their use on many naturally occurring RNA transcripts. We thus set out to devise a new prediction method CoFold [69] with the following goals. The new method should: • take as input a single RNA, 19

Figure 2: Conserved helices predicted by Transat for the hok data set for a p-value threshold value of right 10−3 [62]. The horizontal axis corresponds to the hok alignment that was used as input to Transat. Each arc corresponds to a single base-pair linking the corresponding pair of columns in the input alignment. Arcs above the x-axis correspond to known base-pairs, those below to newly predicted base-pairs (false positives). Known base-pairs that are missing from the prediction are shown as black arcs (false negatives). All predicted base-pairs are shown as arcs in a colour that corresponds to their respective, estimated p-values: < 10−5 green, < 10−4 blue and < 10−3 orange. All helices of the known structure are predicted well. In addition, Transat predicts three new helices with statistically significant p-values that may guide the co-transcriptional formation of the final RNA secondary structure. Figure made with R-chie [67, 68].

• predict as output a single RNA secondary structure, • take some effects of co-transcriptional folding explicitly into account, however, without predicting or simulating an actual co-transcriptional folding pathway, • be also guided to some degree by free energy minimisation and, hopefully, • outperform the prediction accuracy of existing MFE methods that do not take co-transcriptional effects into account, especially for longer sequences (> 200 nt) for which the prediction accuracy of non-comparative MFE methods tends to significantly decrease [49]. For this, we used the widely used, non-comparative MFE methods Mfold [17] and RNAfold [14] as a starting point. Both methods take a single RNA sequence as input and predict as output an RNA secondary structure. Both are guided by free energy minimisation alone and do not take any effects of cotranscriptional folding into account.

20

Figure 3: Transat predictions for the Cripavirus internal ribosomal entry site (IRES), Rfam family RF00458, for a p-value threshold of 10−3 [62]. The helices that render the known structure pseudo-knotted are correctly predicted by Transat. In addition, Transat predicts a few distinct helices that clash with helices of the known, final structure and that may help to guide the overall co-transcriptional structure formation, see the numbers next to the respective helices. The numbering of helices is not part of the Transat predictions, but an interpretation by the user. New helix 4 may yield to final helix 8, new helix 6 to final helix 7 and new helix 10 to final helix 12. These novel helices may thus serve as guiding transient helices during the co-transcriptional formation of the known, final RNA structure. Figure made with R-chie [67, 68].

2.6

How CoFold works:

Transat employs a modified version of the Zuker-Stiegler algorithm [14]. . The Zuker-Stiegler algorithm is also / employed by Mfold and RNAfold to calculate the thermodynamically most favourable, pseudo-knot-free RNA secondary structure for any given input RNA. For this, the algorithm (1) decomposes the overall free energy of any possible (pseudo-knot free) RNA secondary structure into a sum of free-energy contributions from various structural legolike building blocks (such as stacking interactions between pairs of adjacent base pairs, unpaired nucleotides and hairpin loops) and (2) employs a dynamic programming procedure to derive the RNA secondary structure that minimises the overall free energy for a given input RNA sequence. The latter takes O(L3 ) time, where L denotes the length of the input sequence. RNA structures predicted by Mfold and RNAfold are correspondingly called minimum-free energy (MFE) structures and the methods themselves referred to as MFE methods. Our modification of the Zuker-Stiegler algorithm consists of • altering the default free-energy contribution associated with any base-pair by a scaling function γ(d) whose value only depends on the distance d of the two base-paired nucleotides along the input sequence.

21

Figure 4: Transat predictions for a bacterial tmRNA (Rfam family RF00023 with alignment length 655 bp) for a p-value threshold value of 10−4 [62]. Transat captures the helices of the known pseudo-knotted RNA secondary structure well, see arcs above the horizontal line. The known open reading frame (ORF) is predicted to be completely devoid of statistically significant helices. This indicates that this region has to remain single-stranded for the tmRNA to function properly in the cell. The arcs below the horizontal line correspond to new base-pairs predicted by Transat. Arcs with a p-value < 10−5 are shown in green green, those with p-values < 10−4 in blue. Figure made with R-chie [67, 68].

Technically, this is achieved via a scaling function which is defined as d

γ(d) := α · (e− τ − 1) + 1 This is an exponential decay function with two free parameters α ∈]0, 1] and τ > 0 [nt−1 ]. As limd→∞ γ(d) = 1 − α, the value of α determines the range of values of γ: γ(d) ∈]1 − α, 1]. For example, setting α = 0.3 modifies energy values to vary within 70% to 100% of their original values. Parameter τ defines the steepness of the exponential decay as function of the distance d between base-pairing nucleotides. More precisely, τ is the nucleotide distance at which the max-value of γ(d = 0) = 1 is lowered by α(1 − 1/e). Nucleotides close to each other along the input sequence are thus more likely to base-pair than nucleotides further apart as a higher value of d implies a correspondingly lower scaling factor γ(d). This is what we expect as one overall effect of co-transcriptional folding: . 22

New transient Known transient Known final

cons.

gaps

0.915 0.767 0.758

0.009 0.023 0.021

canon. BPs 0.954 0.909 0.963

cov. 0.036 0.100 0.305

Table 6: Comparison of alignment features for known and novel RNA structure features of co-transcriptional folding pathways of homologous RNAs [37]: novel transient features (“New transient”), known transient features (“Known transient”) and known features of the final RNA secondary structure (“Known final”). As before, see caption of table 3, the conservation (cons.) indicates the average pairwise primary sequence conservation and gaps the fraction of gaps in the alignments. Canon. BPs specifies the fraction of canonical base-pairs within all base-paired alignment columns. Values for the covariation (cov.) range from -2 to +2 and measure the relative frequency of compensatory mutations that retain the base-pairing potential within pairs of base-paired alignment columns. The covariation is 0 when there is no variation in paired alignment columns, positive when they comprise compensatory mutations that retain the base-pairing ability and negative when they contain invalid base-pairs. The features for new transient helices are based on the six statistically most significant helices predicted by Transat for each alignment in [37] that have less than 50% overlap with the known structural features. New transient helices are more highly conserved than known transient and final helices, both in terms of primary sequence conservation (see cons. and gaps) and valid base-pairs (see canon. BPs) which is in line with their lower value of covariation.

Due to the speed of transcription / , nucleotides close to each other along the sequence have less difficulty “finding each other” than nucleotides further apart. We expect this effect to be more pronounced the faster the transcription speed is. This secondary effect can actually be captured via different values of α and τ , as was shown for a sub-set of viral sequences [69].

2.7

Key features of CoFold:

One goal in devising CoFold was to come up with an MFE method that also performs well for longer sequences, in particular those longer than 200 nt. We thus compiled a corresponding data set of 248 sequences with known functional RNA secondary structure with a large fraction of long sequences (average sequence length 776 nt, minimum length 110 nt, maximum length 3578 nt), see table 8 for details. Based on our detailed examinations [69], we conclude that: • Capturing one overall effect of co-transcriptional folding via the scaling function γ(d) leads to a significantly improved prediction performance of CoFold w.r.t. the state-of-the-art MFE method RNAfold, especially for sequences longer than 1000 nt, see table 7. Compared to RNAfold, Co23

1.0 0.0

0.2

0.4

TPR

0.6

0.8

1.0 0.8 0.6 TPR 0.4 0.2 0.0 0.00

0.02

0.04

0.06

0.08

0.10

FPR

0.00

0.025

0.05

0.075 FPR

0.10

0.125

0.15

Figure 5: ROC curves showing the predictive performance of the three ´fold (blue line), RNAKifolding pathway prediction methods: Kine netics (green line) and Kinwalker (red line), once for the known transient and final RNA structure features (left figure) and once for the new potential, transient helices identified by Transat (right figure) [37]. The true positive rate (TPR) is shown on the vertical axis and the false positive rate (FPR) on the horizontal axis. Note that the axes use different scales. The new transient features in the right figure correspond to the six statistically most significant helices predicted by Transat for each of the six multiple-sequence alignments. Each candidate helix has to have less than 50 % overlap with any of the known structural features.

Fold predicts 7% more base-pairs with 6% higher specificity, corresponding to an 6% increase in MCC. A further increase in PPV, TPR and MCC of 4% can be achieved with CoFold when switching from the default energy parameters by Mathews [70] to those derived by Andronescu [71] via a joint computational tweaking of 363 free energy parameters, compare the respective performance figures of CoFold and CoFold-A in table 7. When zooming into the sub-set of 23S RNAs of the data set (av. length 3069 nt, min. length 2882 nt, max. length 3578 nt), CoFold and CoFold-A increase the MCC of RNAfold on average by 8% and 12%, respectively, which is considerable. Figure 6 shows a comparative illustration of the two RNA secondary structures predicted by RNAfold and CoFold-A for the 23S RNA of gamma-proteobacteria Pseudomonas aeruginosa of length 2893 nt. • The free energy differences between the energies of the RNA secondary structures predicted by CoFold and CoFold-A and those predicted by RNAfold do not correlate with an improved prediction accuracy.

24

RNAfold RNAfold-A CoFold CoFold-A

TPR 0.4630 0.5202 0.5283 0.5780

FPR 0.000176 0.000160 0.000159 0.000145

PPV 0.3974 0.4476 0.4579 0.5006

MCC 0.4281 0.4817 0.4910 0.5370

Table 7: Prediction accuracy of CoFold, CoFold-A, RNAfold and RNAfold-A for base pairs in the sub-set of long sequences (length > 1000 nt) [69]. Performance specified in terms of true positive rate (TPR), false positive rate (FPR), positive predictive value (PPV) and Matthews correlation coefficient (MCC), see the caption of table 5 for definitions.

Moreover, CoFold improves the prediction accuracy without significantly modifying the corresponding free energy of the respective RNA secondary structure predicted by RNAfold for the same RNA. The improvements in predictive power can thus be attributed to capturing effects of cotranscriptional folding. This is a conceptually important insight as it confirms the earlier hypothesis by Morgan and Higgs from 1996. They conjectured that the observed discrepancies between the free energies of evolutionarily conserved RNA secondary structure and those predicted by MFE methods for longer sequences “cannot simply be put down to errors in the free energy parameters used in the model”, but are likely due to effects of kinetic structure formation [49]. • The new algorithm underlying CoFold effectively only depends on one free parameter as the two free parameters α and τ of the scaling function γ turn out to be strongly correlated. This is evident when investigating the optimal prediction accuracy in terms of average MCC for different combinations of α and τ values, see figure 1 in [69]. The observed correlation between α and τ can be well described by a linear function α = a·τ +b with slope a = 6.1·10−4 ±2·10−5 and intercept b = 0.105±0.016 (R2 = 98.4%). Cross-validation experiments validate the robustness of parameter training and the overall approach. For optimal average MCC values, we determine α = 0.50 and τ = 640. These are the parameters we recommend for general use with CoFold and CoFold-A.

3

Availability

Transat, CoFold, CoFold-A and the RNA structure visualisation program R-chie are freely available for use via our web-server at www.e-rna.org. This is also where you can download the respective software for local use.

25

clade Bacteria Eukaryotes Virus Archea Chloroplast sum sequence length (nt) average minimum maximum

short seqs. (≤ 1000 nt) 54 97 20 16 0 187

long seqs. (> 1000 nt) 15 15 0 17 14 61

combined data set 69 112 20 33 14 248

247 110 628

2397 1245 3578

776 110 3578

Table 8: Key features of the data set used for evaluating CoFold and CoFold-A [69]. The set consists of 248 sequences with known RNA secondary structures. One special focus is the sub-set of 61 long sequences (length > 1000 nt; average 2397 nt, min length 1245 nt, max length 3578 nt) which consists of 16S and 23S rRNAs only (27 sequences of type 16S RNAs and 34 of type 23S RNAs). These sequences were extracted from multiple-sequence alignments from the Comparative RNA Web site (CRW) [72]. The 187 sequences corresponding to the sub-set of comparatively short sequences (length < 1000 nt) were derived from 21 biologically diverse Rfam families [73]. When compiling both sub-sets, great care was taken to ensure the resulting sequences are nonredundant (maximum pairwise sequence identity of 85%), have an RNA secondary structure that is well supported in terms of evolutionary evidence and are diverse in terms of represented clades.

26

Figure 6: Comparison of the two RNA secondary structure predicted by RNAfold (top) and CoFold-A (bottom) for the 23S RNA of the gamma-proteobacteria Pseudomonas aeruginosa of 2893 nt length [69]. Black arcs indicate base-pairs that are part of the known reference structure, but missing in the prediction (false negatives), red arcs correspond to incorrectly predicted base-pairs (false positives) and blue arcs to correctly predicted base-pairs (true positives). Neither RNAfold nor CoFold-A can handle pseudo-knotted RNA secondary structures. Base-pairs that render the RNA secondary structure pseudo-knotted are indicated by orange arcs. The multitude of red arcs on top of the horizontal line imply that RNAfold predicts many incorrect base-pairs spanning 100 nt or more. These mostly disappear with CoFold-A, see arcs shown below the horizontal line. Using CoFold-A rather than RNAfold, the MCC rises significantly from 43% to 58% (+15%) which can be attributed to a simultaneous increase of the true positive rate 27 (45% to 61%) and the positive predictive value (41% to 56%). The false positive rate of 0.01% is equally low for both prediction programs. Arc-plot made with R-chie [67, 68].

4

Summary and outlook

In summary, it quite remarkable that it is possible to detect conserved RNA secondary structure features of co-transcriptional folding pathways based on “naked” RNA sequences alone. Much of it is due to the power of the comparative approach . which can harness a variety of intrinsic sequence signals influencing co-transcriptional RNA structure formation. This implies that there is considerable evolutionary pressure to harness the benefits of co-transcriptional RNA structure formation. One of the key benefits of co-transcriptional RNA structure formation is to efficiently reduce the search space of attainable, potential RNA secondary structure that the co-folding transcript explores in vivo. This is particularly important for the folding of long transcripts. We are thus fortunate to conclude that Levinthal’s paradox [50] simply does not apply to RNA structures forming in vivo. It will thus be particularly exciting to see if future computational methods can harness the power of cotranscriptional RNA structure formation in the context of challenging cases such as long, protein-coding transcripts (where many interactions and processes are known to happen co-transcriptionally). It will also be exciting to see if the concept of co-translational protein structure formation [74, 75, 76] can be harnessed to significantly and conceptually improve protein structure prediction. / . None of the computational strategies introduced here aim to capture the extrinsic features as these can depend on minute details of the respective cellular environment which, in many cases, cannot easily be measured in vivo. Our approaches, however, allow us to discover the range of encoded RNA secondary structure features which can be alternatively expressed depending on the cellular context. The testing of specific hypotheses regarding the context-dependent expression of alternative RNA secondary structure is thereby significantly facilitated. / . Below, several ideas are introduced with the hope of inspiring future improvements of existing computational methods and of conceptually novel strategies. / As we showed for goal 1, computational methods for predicting actual co-transcriptional folding pathways can be combined with a comparative analysis to successfully identify transient and final RNA secondary structure features that have been conserved during evolution. Our main result, namely that homologous RNA transcripts not only assume similar final RNA secondary structures, but also reach their respective target structure via similar co-transcriptional folding pathways, opens the possibility to study these conserved transient structure features in more detail. This could, for example, by achieved in more fine-grained computational studies that examine the time-wise ordering of transient structure features to investigate: • (1) how conserved transient structure features guide the co-transcriptional formation of the final RNA secondary structure, in particular 28

– (1a) if and how they prevent mis-folded structure intermediates and/or – (1b) whether or not they serve as anchors for potential trans-acting interaction partners. The latter project will require a decent number of experimentally confirmed test cases of known trans-interactions, e.g. with proteins or other RNA transcripts, to compile a test set and a potential training set for purely computational analyses, see [77] for an example of how known trans-interactions can be captured using folding pathway prediction methods. As the existing folding pathway prediction methods can only be applied to rather short transcripts (< 200 nt length, or < 1000 nt in case of Kinwalker), there is also ample scope to improve the existing folding pathway prediction methods or, alternatively, to invent a conceptually new: • (2) The primary predictive power of the computational pipeline presented above derives from employing a comparative approach. This power could, for example, be better harnessed by devising the first comparative method for folding pathway prediction. Rather than taking a single RNA as input, it would take an (ideally, un-aligned) set of homologous RNAs as input and predict/simulate the corresponding co-transcriptional folding pathways in a comparative way. This approach is not only likely to result in a superior prediction accuracy, but could also remove the conceptual limitation of handling only rather short transcripts. There already exists a Markov-chain Monte-Carlo-based method SimulFold [78] for co-estimating homologous RNA secondary structures, multiplesequence alignments and evolutionary trees. This method works without considering co-transcriptional structure formation. This alignment-free method could inspire a similar, comparative method that could also capture some aspects of co-transcriptional RNA structure formation. We described how Transat can be used to successfully identify conserved helices in given multiple-sequence alignments of homologous transcripts that most methods for RNA secondary structure prediction miss due to technical constraints. Examples include mutually exclusive helices such as those in riboswitches, transient helices that are not part of the final reference structure or helices involved in pseudo-knotted configurations. Moreover, Transat does not force helices into a single, global RNA secondary structure which almost all methods for RNA secondary structure implicitly do. One conceptual bottle-neck of Transat (as well as most comparative RNA secondary structure prediction methods) is that it requires a high-quality input alignment in order to perform well. • (3) One idea to further improve Transat is thus to convert it into an alignment-free method while keeping (A) the probabilistic nature of the underlying algorithm and its predictions and (B) the time-and-memory efficiency of the algorithm. As Transat internally operates on helices 29

(rather than individual nucleotides or base-pairs) once the candidate helices for individual sequences have been predicted, see step 1 of its description in 2.3, this can be viewed as a realistic goal. CARNAC [79] was the first to apply the alignment-free idea in the context of RNA secondary structure prediction, albeit outside a probabilistic framework and without knowing the theoretical time-and-memory requirements of the underlying, heuristic algorithm. CoFold was started as a conceptual idea to see if the state-of-the-art method RNAfold could be improved by capturing a single, basic overall effect that co-transcriptional folding has on structure formation, namely the reachability of potential pairing partners along the RNA sequence. This works surprisingly well. There is thus ample scope for further improvements: • (4) One obvious effect of co-transcriptional folding that is not yet captured is the 5’-to-3’ directionality. This is one of the key features of co-transcriptional folding. Many state-of-the-art methods for RNA secondary structure use a dynamic programming algorithm to generate their predictions and CoFold is no exception. These dynamic programming algorithms, however, have inherent no notion of left and right, 5’ and 3’, but are symmetric w.r.t. both sequence ends. This makes it conceptually challenging to incorporate 5’-to-3’ biases into them. Recent technological advances allow us to investigate RNA structures in vivo with unprecedented detail and on a transcriptome-wide scale [80, 81, 82, 83]. . These new experimental protocols are capable of giving us a global and largely unbiased overview of the helices formed by RNA structures and trans RNA-RNA interactions for a specific point in time (there are some biases regarding the length of the helices and the base-pairs involved). The raw duplex reads corresponding to these helices still require considerable computational dis-entangling and interpretation as they do not directly translate into information on the RNA structures or trans-interactions of individual transcripts. Yet, these data should provide valuable input for testing a range of exciting hypotheses regarding the molecular mechanisms and functional roles of RNA structures and trans RNA-RNA interactions in vivo. /

Acknowledgements I thank past and present group members and collaborators for the privilege of working with them.

Funding This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. 30

References [1] A. Tuplin, J. Wood, D. J. Evans, A. H. Patel, P. Simmonds, Thermodynamic and phylogenetic prediction of RNA secondary structures in the coding region of hepatitis C virus, RNA 8 (6) (2002) 824–841. [2] A. Tuplin, D. J. Evans, P. Simmonds, Detailed mapping of RNA secondary structures in core and NS5B-encoding region sequences of hepatitis C virus by RNase cleavage and novel bioinformatic prediction methods, Journal of General Virology 85 (2004) 3037–3047. [3] S. You, D. D. Stump, A. D. Branch, C. M. Rice, A cis-acting replication element in the sequence encoding the NS5B RNA-dependent RNA polymerase is required for hepatitis C virus RNA replication, Journal of Virology 78 (3) (2004) 1352–1366. [4] L. Katz, C. B. Burge, Widespread selection for local RNA secondary structure in coding regions of bacterial genes, Genome Res 13 (9) (2003) 2042– 2051. [5] J. Pedersen, R. Forsberg, I. Meyer, J. Hein, An evolutionary model for protein-coding regions with conserved RNA structure, Mol. Biol. Evol. 21 (10) (2004) 1913 – 1922. [6] J. Pedersen, I. Meyer, R. Forsberg, P. Simmonds, J. Hein, A comparative method for finding and folding RNA secondary structures within proteincoding regions, Nucleic Acids Res. 32 (16) (2004) 4925 – 4936. [7] I. M. Meyer, I. Mikl´os, Statistical evidence for conserved, local secondary structure in the coding regions of eukaryotic mRNAs and pre-mRNAs, Nucleic Acids Research 33 (19) (2005) 6338–6348. [8] J. Johansson, P. Mandin, A. Renzoni, C. Chiaruttini, M. Springer, P. Cossart, An RNA thermosensor controls expression of virulence genes in Listeria monocytogenes, Cell 110 (5) (2002) 551–561. [9] F. Narberhaus, Translational control of bacterial heat shock and virulence genes by temperature-sensing mRNAs, RNA Biology 7 (1) (2010) 84–89. [10] A. Giuliodori, F. D. Pietro, S. Marzi, B. Masquida, R. Wagner, P. Romby, C. Gualerzi, C. Pon, The cspA mRNA Is a Thermosensor that Modulates Translation of the Cold-Shock Protein CspA, Molecular Cell 37 (1) (2010) 21–33. [11] G. Nechooshtan, M. Egrably-Weiss, A. Sheaffer, E. Westhof, S. Altuvia, A pH-responsive riboregulator, Genes & Development 23 (2009) 2650–2662. [12] K. L. Frieda, S. M. Block, Direct observation of cotranscriptional folding in an adenine riboswitch, Science 338 (6105) (2012) 397–400.

31

[13] A. Mazloomian, I. M. Meyer, Genome-wide identification and characterization of tissue-specific RNA editing events in D. melanogaster and their potential role in regulating alternative splicing, RNA Biology 12 (12) (2015) 1391–1401. [14] M. Zuker, P. Stiegler, Optimal computer folding of large RNA sequences using thermodynamic and auxiliary information, Nucleic Acids Research 9 (1981) 133–148. [15] I. Hofacker, W. Fontana, P. Stadler, S. Bonhoeffer, M. Tacker, P. Schuster, Fast Folding and Comparison of RNA Secondary Strutures, Monatshefte f¨ ur Chemie (Chemical Monthly) 125 (1994) 167–188. [16] D. Mathews, J. Sabina, M. Zuker, D. Turner, Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure, J. Mol. Biol. 288 (1999) 911–940. [17] M. Zuker, Mfold web server for nucleic acid folding and hybridization prediction, Nucleic Acids Research 31 (13) (2003) 3406–3415. [18] D. Lai, J. R. Proctor, I. M. Meyer, On the importance of co-transcriptional RNA structure formation, RNA 19 (2013) 1461–1473. [19] J. Boyle, G. Robillard, S. Kim, Sequential folding of transfer RNA. A nuclear magnetic resonance study of successively longer tRNA fragments with a common 5’ end, Journal of Molecular Biology 139 (1980) 601–625. [20] F. Kramer, D. Mills, Secondary structure formation during RNA-synthesis, Nucleic Acids Research 9 (19) (1981) 5109–5124. [21] S. Brehm, T. Cech, Fate of an intevening sequence ribonucleic-acid — excision and cyclization of the Tetrahymena ribosomal ribonucleic-acid intervening sequence in vivo, Biochemistry 22 (10) (1983) 2390–2397. [22] T. Pan, T. Sosnick, RNA folding during transcription, Annual Review of Biophysics and Biomolecular Structure 35 (2006) 161–175. [23] H. M. Al-Hashimi, N. G. Walter, RNA dynamics: it is about time, Current Opinion in Structural Biology 18 (2008) 321–329. [24] T. Pan, X. Fang, T. Sosnick, Pathway modulation, circular permutation and rapid RNA folding under kinetic control, Journal of Molecular Biology 286 (13) (1999) 721–731. [25] S. Heilman-Miller, S. Woodson, Effect of transcription on folding of the Tetrahymena ribozyme, RNA 9 (6) (2003) 722–733. [26] S. Heilman-Miller, S. Woodson, Perturbed folding kinetics of circularly permuted RNAs with altered topology, Journal of Molecular Biology 328 (2) (2003) 385–394. 32

[27] B. Lewicki, T. Margus, J. Remme, K. Nierhaus, Coupling of rRNA transcription and ribosomal assembly in vivo – formation of active ribosomalsubunits in Escherichia coli requires transcription of RNA genes by host RNA polymerase which cannot be replaced by T7 RNA polymerase, Journal of Molecular Biology 231 (1993) 581–593. [28] M. Y. Chao, M. Kan, S. Lin-Chao, RNAII transcribed by IPTG-induced T7 RNA polymerase is non-functional as a replication primer for ColE1-type plasmids in Escherichia coli, Nucleic Acids Research 23 (1995) 1691–1695. [29] F. Toulme, C. Mosrin-Huaman, I. Artsimovitch, A. Rahmouni, Transcriptional pausing in vivo: A nascent RNA hairpin restricts lateral movements of RNA polymerase in both forward and reverse directions, Journal of Molecular Biology 351 (1) (2005) 39–51. [30] J. Wickiser, W. Winkler, R. Breaker, D. Crothers, The speed of RNA transcription and metabolite binding kinetics operate an FMN riboswitch, Molecular Cell 18 (1) (2005) 49–60. [31] T. Wong, T. Sosnick, T. Pan, Folding of noncoding RNAs during transcription facilitated by pausing-induced nonnative structures, Proceedings of the National Academy of Science of the USA 104 (46) (2007) 17995–18000. [32] E. Mahen, J. Harger, E. Calderon, M. Fedor, Kinetics and thermodynamics make different contributions to RNA folding in vitro and in yeast, Molecular Cell 19 (1) (2005) 27–37. [33] E. Mahen, P. Watson, J. Cottrell, M. Fedor, mRNA Secondary Structures Fold Sequentially But Exchange Rapidly In Vivo, PLoS Biology 8 (2) (2010) e1000307. [34] D. Repsilber, S. Wiese, M. Rachen, A. Schroder, D. Riesner, G. Steger, Formation of metastable RNA structures by sequential folding during transcription: Time-resolved structural analysis of potato spindle tuber viroid ()-stranded RNA by temperature-gradient gel electrophoresis, RNA 5 (1999) 574–584. [35] I. M. Meyer, I. Mikl´os, Co-transcriptional folding is encoded within RNA genes, BMC Molecular Biology 10 (2004) 5. [36] T. Ro-Choi, Y. Choi, Structural elements of dynamic RNA strings, Molecules and Cells 16 (2) (2003) 201–210. [37] J. Y. A. Zhu, A. Steif, J. R. Proctor, I. M. Meyer, Transient RNA structure features are evolutionarily conserved and can be computationally predicted, Nucleic Acids Research 41 (12) (2013) 6273–6285. [38] J. Zhu, I. Meyer, Four RNA families with functional transient structures, RNA Biology 12 (1) (2015) 5–20.

33

[39] B. Sclavi, M. Sullivan, M. Chance, M. Brenowitz, S. Woodson, RNA folding at millisecond intervals by synchrotron hydroxyl radical footprinting, Science 279 (5358) (1998) 1940–1943. [40] S. Jackson, S. Koduvayur, S. Woodson, Self-splicing of a group I intron reveals partitioning of native and misfolded RNA populations in yeast, RNA 12 (12) (2006) 2149–2159. [41] S. Koduvayur, S. Woodson, Intracellular folding of the Tetrahymena group I intron depends on exon sequence and promoter choice, RNA 10 (10) (2004) 1526–1532. [42] S. Chauhan, S. Woodson, Tertiary interactions determine the accuracy of RNA folding, Journal of the American Chemical Society 130 (4) (2008) 1296–1303. [43] K. Semrad, R. Schroeder, Ribosomal function is necessary for efficient splicing of the T4 phage thymidylate synthase intron in vivo, Genes & Development 12 (9) (1998) 1327–1337. [44] S. Granneman, S. Baserga, Crosstalk in gene expression: coupling and coregulation of rDNA transcription, pre-ribosome assembly and pre-rRNA processing, Current Opinion in Cell Biology 17 (3) (2005) 281–286. [45] S. Udem, J. Warner, Cytoplasmic maturation of a ribosomal precursor ribonucleic-acid in yeast, Journal of Biological Chemistry 248 (4) (1973) 1412–1416. [46] M. Oakes, I. Nogi, M. Clark, M. Nomura, Structural alterations of the nucleolus in mutants of Saccharomyces-Cerevisiae defective in RNA polymerase-I, Molecular and Cellular Biology 13 (4) (1993) 2441–2455. [47] M. Kos, D. Tollervey, Yeast Pre-rRNA Processing and Modification Occur Cotranscriptionally, Molecular Cell 37 (6) (2010) 809–820. [48] R. Alexander, S. Innocente, J. Barrass, J. Beggs, Splicing-dependent RNA polymerase pausing in yeast, Molecular Cell 40 (4) (2010) 582–593. [49] S. Morgan, P. Higgs, Evidence for kinetic effects in the folding of large RNA molecules, Journal of Chemical Physics 105 (16) (1996) 7152–7157. [50] C. Levinthal, How to Fold Graciously, Mossbauer Spectroscopy in Biological Systems: Proceedings of a meeting held at Allerton House, Monticello, Illinois (1969) 22–24. [51] A. Mironov, L. Dyakonova, A. Kister, A kinetic approach to the prediction of RNA secondary structures, Journal of Biomolecular Structure & Dynamics 2 (5) (1985) 953–962. [52] A. Mironov, V. Lebedev, A kinetic model of RNA folding, Biosystems 30 (1-3) (1993) 49–56. 34

[53] L. Danilova, D. Pervouchine, A. Favorov, A. Mironov, RNAkinetics: a web server that models secondary structure kinetics of an elongating RNA, Journal of Bioinformatics and Computational Biology 4 (2) (2006) 589–596. [54] C. Flamm, W. Fontana, I. L. Hofacker, P. Schuster, RNA folding at elementary step resolution, RNA 6 (3) (2000) 325–38. [55] H. Isambert, E. D. Siggia, Modeling RNA folding paths with pseudoknots: application to hepatitis delta virus ribozyme, Proceedings of the National Academy of Science of the USA 97 (12) (2000) 6515–20. [56] A. Xayaphoummine, T. Bucher, F. Thalmann, H. Isambert, Prediction and statistics of pseudoknots in RNA structures using exactly clustered stochastic simulations, Proceedings of the National Academy of Science of the USA 100 (26) (2003) 15310–5. [57] A. Xayaphoummine, T. Bucher, H. Isambert, Kinefold web server for RNA/DNA folding path and structure prediction including pseudoknots and knots., Nucleic Acids Res 33 (Web Server issue) (2005) W605–10. [58] A. Gultyaev, F. von Batenburg, C. Pleij, The computer-simulation of RNA folding pathways using a genetic algorithm, Journal of Molecular Biology 250 (1) (1995) 37–51. [59] M. Geis, C. Flamm, M. T. Wolfinger, A. Tanzer, I. L. Hofacker, M. Middendorf, C. Mandl, P. F. Stadler, C. Thurner, Folding kinetics of large RNAs., Journal of Molecular Biology 379 (1) (2008) 160–173. [60] P. Seibel, T. M¨ uller, T. Dandekar, J. Schultz, M. Wolf, 4SALE: A tool for synchronous RNA sequence and secondary structure alignment and editing, BMC Bioinformatics 7 (2006) 498. [61] R. Edgar, MUSCLE: Multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research 32 (5) (2004) 1792–1797. [62] N. J. P. Wiebe, I. M. Meyer, Transat — method for detecting the conserved helices of functional RNA structures, including transient, pseudoknotted and alternative structures, PLoS Computational Biology 6 (6) (2010) e1000823. [63] J. Felsenstein, Evolutionary trees from DNA sequences: a maximum likelihood approach, Journal of Molecular Evolution 17 (6) (1981) 368–376. [64] C. Notredame, D. Higgins, J. Heringa, T-Coffee: A novel method for fast and accurate multiple sequence alignment, Journal of Molecular Biology 302 (1) (2000) 205–217. [65] S. Washietl, I. Hofacker, Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics, Journal of Molecular Biology 342 (1) (2004) 19–30. 35

[66] C. Haslinger, P. F. S. PF, RNA structures with pseudo-knots: graphtheoretical, combinatorial, and statistical properties, Bulletin of Mathematical Biology 61 (3) (1999) 437–467. [67] D. Lai, J. R. Proctor, J. Y. Zhu, I. M. Meyer, R-CHIE: a web server and R package for visualizing RNA secondary structures, Nucleic Acids Research 40 (12) (2012) e95. [68] D. Lai, I. M. Meyer, e-RNA: a collection of web servers for comparative RNA structure prediction and visualisation, Nucleic Acids Research 42 (Web Server Issue) (2014) W373–W376. [69] J. R. Proctor, I. M. Meyer, CoFold: an RNA secondary structure prediction method that takes co-transcriptional folding into account, Nucleic Acids Research 41 (9) (2013) e102. [70] D. H. Mathews, J. Sabina, M. Zuker, D. H. Turner, Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure, J Mol Biol 288 (5) (1999) 911–940. [71] M. Andronescu, A. Condon, H. H. Hoos, D. H. Mathews, , K. P. Murphy, Efficient parameter estimation for RNA secondary structure prediction, Bioinformatics 23 (13) (2007) I19–I28. [72] J. Cannone, S. Subramanian, M. Schnare, J. Collett, L. D’Souza, Y. Du, B. Feng, N. Lin, L. Madabusi, K. Muller, N. Pande, Z. Shang, N. Yu, R. Gutell, The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs, BMC Bioinformatics 3 (2002) 2. [73] S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, A. Bateman, Rfam: annotating non-coding RNAs in complete genomes, Nucleic Acids Research 33 (2005) D121–D124. [74] I. Krasheninnikov, A. Komar, I. Adzhubei, The role of redundancy of the genetic-code in determining the mode of cotranslational protein folding, Biochemistry-Moscow 54 (2) (1989) 131–143. [75] I. Krasheninnikov, A. Komar, I. Adzhubei, Nonuniform size distribution of nascent globin peptides, evidence for pause localization sites, and cotranslational protein-folding model, Journal of Protein Chemistry 10 (5) (1991) 445–453. [76] A. Fedorov, T. Baldwin, Cotranslational protein folding, Journal of Biological Chemistry 272 (52) (1997) 32715–32718. [77] R. J. W. Schoemaker, A. P. Gultyaev, Computer simulation of chaperone effects of Archael C/D box sRNA binding on rRNA folding, Nucleic Acids Research 34 (7) (2006) 2015–2026.

36

[78] I. M. Meyer, I. Mikl´os, Simulfold: Simultaneously Inferring an RNA Structure Including Pseudo-Knots, a Multiple Sequence Alignment and an Evolutionary Tree Using a Bayesian Markov Chain Monte Carlo Framework, PLoS Computational Biology 3 (8) (2007) e149. [79] H. Touzet, O. Perriquet, CARNAC: folding families of related RNAs., Nucleic Acids Research 32 (2004) W142–145. [80] M. Zubradt, P. Gupta, S. Persad, A. M. Lambowitz, J. S. Weissman, S. Rouskin, DMS-MaPseq for genome-wide or targeted RNA structure probing in vivo, Nature 14 (1) (2016) 75–82. [81] J. G. Aw, Y. Shen, A. Wilm, M. Sun, X. N. Lim, K. L. Boon, S. Tapsin, Y. S. Chan, C. P. Tan, A. Y. Sim, T. Zhang, T. T. Susanto, Z. F. Z2, N. Nagaraj, Y. Wan, In Vivo Mapping of Eukaryotic RNA Interactomes Reveals Principles of Higher-Order Organization and Regulation, Molecular Cell 62 (4) (2016) 603–617. [82] Z. Lu, Q. C. Zhang, B. Lee, R. A. Flynn, M. A. Smith, J. T. Robinson, C. Davidovich, A. R. Gooding, K. J. Goodrich, J. S. Mattick, J. P. Mesirov, T. R. Cech, H. Y. Chang, RNA Duplex Map in Living Cells Reveals HigherOrder Transcriptome Structure, Cell 165 (5) (2016) 1267–1279. [83] E. Sharma, T. Sterne-Weiler, D. O’Hanlon, B. J. Blencowe, Global Mapping of Human RNA-RNA Interactions, Molecular Cell 62 (4) (2016) 618–626.

37

In silico methods for co-transcriptional RNA secondary structure prediction and for investigating alternative RNA structure expression

In silico methods for co-transcriptional RNA secondary structure prediction and for investigating alternative RNA structure expression

Recommend Documents