Computational Biology and Chemistry 83 (2019) 107118
Contents lists available at ScienceDirect
Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/cbac
Research Article
Ant colony optimization for predicting RNA folding pathways Seira Takitou a b
a,1
b,
, Akito Taneda *
T
Course of Electronics and Information Technology, Graduate School of Science and Technology, Hirosaki University, Hirosaki, Aomori 036-8561, Japan Graduate School of Science and Technology, Hirosaki University, Hirosaki, Aomori 036-8561, Japan
ARTICLE INFO
ABSTRACT
Keywords: Energy barrier height Secondary structure RNA switch MAX-MIN ant system Folding dynamics Pseudoknot
RNA folding dynamics plays important roles in various functions of RNAs. To date, coarse-grained modeling has been successfully employed to simulate RNA folding dynamics on the energy landscape composed of secondary structures. In such a modeling, the energy barrier height between metastable structures is a key parameter that crucially affects the simulation results. Although a number of approaches ranging from the exact method to heuristic ones are available to predict the barrier heights, developing an efficient heuristic for this purpose is still an algorithmic challenge. We developed a novel RNA folding pathway prediction method, ACOfoldpath, based on Ant Colony Optimization (ACO). ACO is a widely used powerful combinatorial optimization algorithm inspired from the food-seeking behavior of ants. In ACOfoldpath, to accelerate the folding pathway prediction, we reduce the search space by utilizing originally devised structure generation rules. To evaluate the performance of the proposed method, we benchmarked ACOfoldpath on the known nineteen conformational RNA switches. As a result, ACOfoldpath successfully predicted folding pathways better than or comparable to the previous heuristics. The results of RNA folding dynamics simulations and pseudoknotted pathway predictions are also presented.
1. Introduction It is well known non-coding RNAs that fold into characteristic secondary structures play crucial roles in living organisms. Some of those functional RNAs, there exist those mediate important biological processes, such as gene expression, through their conformational dynamics (Al-Hashimi and Walter, 2008). We can find various reports on natural functional non-coding RNAs (e.g. riboswitches (Breaker, 2011)) utilizing conformational changes in literature. Moreover, recent progress in the field of synthetic biology has realized tailor-made RNA sequences that function in vitro/in vivo (Isaacs et al., 2006; Chang et al., 2012). In this synthetic biological context, predicting the changes in the secondary structures, or the folding pathway, is expected to be useful not only for analyzing the functions of the RNAs, but also for designing finetuned artificial RNA switches (Findeißet al., 2018). Mathematically modeling the conformational change of RNA secondary structure is an important issue to understand the functionality of the RNA molecules. If we consider the conformational change as a trajectory, we can construct the folding pathway between given start and end structures. The energy barrier height is defined as the energy difference between the start structure and that with the highest energy
on the optimal folding pathway. The barrier height is a useful quantity for modeling the conformational kinetics of RNA secondary structure, since the transition rate between two secondary structures separated by an energy barrier can be calculated based on the barrier height. In the coarse-grained model of RNA folding kinetics, the transition rate between macrostates (basins) is expressed as the function of the energy barrier height between the secondary structures corresponding to the local minima of the macrostates. Furthermore, once a barrier tree, which is composed of energy barrier heights between the basins, is constructed, we can simulate folding dynamics, i.e. the time course of secondary structure distribution, based on a master equation (Flamm et al., 2000; Wolfinger et al., 2004). Such a coarse-grained model has been successfully applied to simulate co-transcriptional folding of relatively short RNA molecules (Hofacker et al., 2010; Wolfinger et al., 2018). Recently, the basin hopping graph has been proposed to approximately model the coarsegrained RNA folding landscape (Kucharík et al., 2014). A comprehensive review of the prediction methods for the kinetic folding of RNAs is given (Flamm and Hofacker, 2008) in which a variety of methods ranging from those operating at the resolution of single base pair to helix-based ones are described.
Corresponding author. E-mail address:
[email protected] (A. Taneda). 1 His contribution to this study was made while he was at Hirosaki University. ⁎
https://doi.org/10.1016/j.compbiolchem.2019.107118 Received 22 February 2019; Received in revised form 10 June 2019; Accepted 26 August 2019 1476-9271/ © 2019 Elsevier Ltd. All rights reserved.
Computational Biology and Chemistry 83 (2019) 107118
S. Takitou and A. Taneda
Energy barrier height is useful not only for simulation of RNA landscapes but also for RNA switch design. It has been proposed to utilize the predicted energy barrier height as the objective function of computational riboswitch design (Flamm et al., 2001; Rodrigo et al., 2012; Taneda, 2015), since the performance of riboswitches is thought to be affected by the transition rate between ON/OFF secondary structures. Thus, predicting RNA folding pathways and energy barrier heights is a fundamental technique for analyzing and designing functional RNAs with a conformational change. To predict RNA folding pathways, Morgan and Higgs have proposed a heuristic algorithm (Morgan and Higgs, 1998). Since their heuristic is a greedy algorithm that does not utilize a global optimization technique, we have to develop a more elaborate approach to predict the pathway with a lower energy barrier height. BARRIERS is the exact algorithm that can predict the accurate energy barrier height between the basins (Flamm et al., 2002). However, the application of BARRIERS has been limited to short RNA sequences due to large computational resources needed to run BARRIERS. In literature, we can find the other state-of-the-art methods for predicting energy barrier heights and RNA folding pathways: e.g. RNAtabupath (Dotu et al., 2010), RNAEAPath (Li and Zhang, 2012), and BHGbuilder (Kucharík et al., 2014). These energy barrier height prediction methods except for BARRIERS are heuristics. Since RNA energy barrier height prediction is known to be NP-hard (Maňuch et al., 2009), it is reasonable to utilize heuristics to solve the RNA folding pathway prediction problem instead of using exact methods. Ant colony optimization (ACO) (Dorigo and Stützle, 2004) is a bio-inspired heuristic which has successfully been applied to combinatorial optimal path problems such as the traveling salesman problem. Since the RNA folding pathway prediction is a kind of optimal path problem, it is expected that ACO is a good approach for solving RNA folding pathway prediction problems. The aim of the present study is to develop a novel ACO-based algorithm for finding the RNA folding pathway with a low energy barrier height and to evaluate its prediction performance through the benchmark test on natural RNA sequence data.
Fig. 1. An example for explaining structure distance. Red parentheses indicate non-common base pairs. The structure distance between these two structures is 4 bp.
parameters. We evaluate the free energies of the pseudoknot-free secondary structures by using RNAeval in Vienna RNA Package 2.4.12 (Lorenz et al., 2011). ACOfoldpath can use the current version (the 2004 version (Mathews et al., 2004)) and the old one (the 1999 version (Mathews et al., 1999)) of the Turner parameters. If a version is not mentioned explicitly, we use the 2004 version throughout the present paper. The ACOfoldpath algorithm was developed based on the 1999 version of the Turner parameters, then tested with the 2004 version of the parameters. In accordance with the previous papers of RNA folding pathway prediction, the option -d1 is used with the 1999 version of the Turner parameters (Dotu et al., 2010; Li and Zhang, 2012). For pseudoknotted folding pathway prediction, computeEnergy (we use the default setting) in HotKnots 2.0 (Ren et al., 2005) is used to evaluate the free energy of generated structures. 2.2. RNA folding pathway An RNA folding pathway between the start structure A and end structure B is represented by a sequence of secondary structures, A = S0, S1, .. ., Sl = B, where Si(i = 1, .. ., l − 1) are intermediate structures. In the present study, we consider pathways in which the distance between neighboring structures, Si and Si+1, is one base pair. The pathways in which all intermediate structures are composed of the base pairs included in structures A or B are referred to as direct pathways (an example is shown in Fig. 2). The other pathways are called indirect pathways (e.g. Fig. 3). The saddle point is the structure with the highest free energy in the optimal pathway between two secondary structures (throughout the present paper, we also refer to the approximate saddle point, which is the structure with the highest energy in the predicted pathway between two secondary structures, as a “saddle point” for short). The energy barrier height of the pathway between the structures A and B is calculated as E(Ssp) − E(A), where E(Ssp) and E(A) are the free energies of the saddle point Ssp and the start structure A, respectively. In accordance with the coarse-grained model of RNA landscape, the transition rate between two states can be approximated by the Arrhenius equation k ∝ e−(E(Ssp)−E(A))/RT. Examples of RNA folding pathway are shown in Figs. 2 and 3 . The energy barrier heights of these pathways are (-10.80) - (-17.90) = 7.10 [kcal/mol] and (-11.30) - (-17.90) = 6.60 [kcal/mol], respectively. As can be seen from these examples, there are cases where inclusion of the base pairs not included in start and end structures lowers the energy barrier height. Therefore, examining indirect pathways, whereas they widen the search space and increase the computational costs, is an effective approach for finding the pathway with a low barrier height. In the present study, we explore the conformational space to find the pathway with the lowest barrier height.
2. Methods 2.1. RNA secondary structure An RNA sequence with a length of n nucleotides is represented as a string, s = s1s2. .. sn, composed of four types of letters (si ∈ {A, C, G, U}, 1 ≤ i ≤ n). By forming base pairs, RNA sequences can have various secondary structures. In the present study, we consider canonical (A·U and C·G) and wobble (G·U) base pairs and ignore the other types of base pairs. We denote a base pair by (i, j), where i and j (1 ≤ i < j ≤ n, j − i > 3) are nucleotide positions in the RNA sequence. An RNA secondary structure S is represented as a set of base pairs (e.g. S= {(1,8),(2,7),(3,6)}). Unpaired nucleotides are called loop nucleotides. A stem is defined as a sequence of consecutive base pairs, (i, j), (i + 1, j − 1), (i + 2, j − 2), .. ., (i′, j′), and denoted by [i, i′, j′, j]. A stem [i, i′, j′, j] is said to be maximal if neither (i − 1, j + 1) nor (i′ + 1, j′ − 1) exists. Two base pairs (i, j) and (k, l) are said to be pseudoknotted if i < k < j < l or k < i < l < j. A secondary structure is called a pseudoknotted secondary structure if it includes a pseudoknotted base pair. A base pair (i, j) is said to clash with a base pair (k, l), if (i, j) and (k, l) form a pseudoknot or share the same nucleotide position (i.e. i = k or i = l or j = k or j = l). Moreover, the notion of the base pair clashing can be extended to that between a base pair and a set of base pairs (Dotu et al., 2010): a base pair x is said to clash with a set of base pairs, A, if there exists a base pair y ∈ A that clashes with x. The structure distance between two secondary structures of an RNA sequence is defined as the number of the base pairs that are not shared by the two secondary structures. An example of the structure distance is shown in Fig. 1. Stability of the secondary structure S is evaluated by the free energy E(S) based on the nearest neighbor thermodynamic
2.3. Ant colony optimization ACO, which is one of the swarm intelligence algorithms, is a graph search algorithm mimicking the food-seeking behavior of ants (Dorigo and Stützle, 2004). In ACO, multiple ant agents iteratively walk on the graph and set pheromones, which evaporate as time passes, onto their pathways in order to search the given graph for an optimal pathway. Here we briefly describe the most basic ACO algorithm, Ant System (AS) (Dorigo and Stützle, 2004), with an example of the traveling 2
Computational Biology and Chemistry 83 (2019) 107118
S. Takitou and A. Taneda
Fig. 2. An example of a direct pathway (taken from the predicted pathway of s15). The energy value in bold is that of the saddle point, E(Ssp) [kcal/mol]. This pathway was obtained with the 1999 version of the Turner parameters (Mathews et al., 1999) and the option -d1.
salesman problem (TSP). For other practical applications of ACO, see (Dorigo and Stützle, 2004). In TSP, our aim is to find the shortest tour among all possible tours such that a salesman passes all cities and finally returns to its starting point. After the search graph construction, ACO iteratively performs the solution construction and the pheromone update until satisfying given stopping criterions. Stopping criterions include the maximum number of iterations, the maximum run time, and convergence of the solution quality. Search graph construction: This step constructs the graph to be searched by the ant agents. In TSP, nodes and edges correspond to cities and the roads between the cities, respectively. We set the length and the initial pheromone τ0 on each edge. Solution construction: Ant agents generate solutions by walking on the graph. Initially, ants are located at randomly selected nodes. Each ant repeatedly moves to a linked node that has not been visited yet; after visiting all nodes, each ant finally returns to the start node. The node to be moved to is stochastically selected by using the transition probability Pij calculated based on the pheromone values and heuristic information values on the edges:
[ ij] [ ij]
Pij =
l Ni
[ il] [ il]
(j
Ni ),
(1)
where τij is the pheromone value on the edge between node i and j, ηij is the inverse of the length of the edge, Ni is the set of the neighboring nodes that have not been visited yet by the ant, α and β are the parameters for determining the relative influence of the pheromone and heuristic information values. In the case of TSP, edges with a high pheromone value and short edge length have large transition probabilities. Pheromone update: Here pheromones evaporate and then we increase pheromone values in accordance with the pathways generated by the ant agents. The updating equations are as follows: m
(1
ij
)
ij
k ij ,
+ k=1
k ij
=
1/ Ck , if (i, j) T k; 0, otherwise;
(2)
(3)
where ρ is a pheromone evaporation rate (0 < ρ ≤ 1), m is the number of ant agents, Ck is the length of the pathway generated by ant k, Tk is
Fig. 3. An example of an indirect pathway (taken from the predicted pathway of s15). The parentheses in red are base pairs that are included neither in the start structure nor in the end structure. This pathway was obtained with the 1999 version of the Turner parameters (Mathews et al., 1999) and the option -d1.
3
Computational Biology and Chemistry 83 (2019) 107118
S. Takitou and A. Taneda
the set of edges included in the pathway generated by ant k. By using these updating equations, pheromone values on less-favored pathways gradually decrease and larger pheromone values are assigned to the edges that are selected by many ants and are included in good pathways. In the present study, we developed a novel RNA folding pathway prediction algorithm ACOfoldpath based on the MAX-MIN Ant System (MMAS) (Stützle and Hoos, 2000), which is a variant of AS. The main differences between MMAS and AS are as follows: (1) In the pheromone update, MMAS adds pheromones only to the best-so-far tour or the best tour in each iteration. (2) In MMAS, there are upper and lower limits of the pheromone value on each edge (τmax and τmin); if a pheromone value exceeds the limits, the pheromone value is set to the value of the corresponding limit. (3) Reinitialization of pheromone values to the upper limit value is utilized; reinitialization is typically triggered when the search shows the stagnation behavior. MMAS updates pheromone values based on a single solution (the best-so-far solution or the best solution in each iteration) as described in (1). This might cause quick convergence to a local optimum. The lower and upper limits are expected to prevent trapping to local optima in such a situation. MMAS uses an estimate of 1/ρC* to define τmax, where C* is the length of the optimal tour; τmax is used as τ0. According to the performance comparison based on TSPLIB, MMAS is one of the bestperforming variants of AS (Dorigo and Stützle, 2004).
end structure is Send ∖ Scommon. The common base pairs will be used as the initial secondary structure Sinitial in the subsequent “structure generation” step. Examples of common and non-common base pairs are shown in Figs. 5 and 6 . The base pair index is the serial number of the base pairs; the base pair index values are determined in such a way that the base pairs are sorted in ascending order of the 5’-side nucleotide positions of the base pairs; the base pair index values are separately determined in the start and end structures (an example is shown in Fig. 6). The stem list of the non-common base pairs is defined in this preprocessing (an example is shown in Fig. 7). Generally speaking, a long loop raises the free energy of RNA secondary structure, while a long stem makes the structure stable. Hence, to avoid formation of a lonely base pair, which easily makes a long loop and a very short (a length of one base pair) stem, it is reasonable that non-common base pairs are added to the initial structure in order of extending a common stem if possible. The order of non-common base pair addition is determined in accordance with the positional relationships between the non-common and the common stems. One of the following three types of the order of non-common base pair addition is assigned to each non-common stem. (Type 1) approach type: for the non-common stems closely located inside a common stem; this type adds base pairs from the outside to the inside. (Type 2) expansion type: for the non-common stems closely located outside a common stem; this type adds base pairs from the inside to the outside. (Type 3) enumeration type: for the non-common stems other than the cases of types 1 and 2; this type enumerates all combinations of partial stems. Examples of these types are shown in Figs. 8 and 9 .
2.3.1. Graph representation of RNA folding pathways To apply ACO to RNA folding pathway prediction, we define a search graph G = (V, E). Each node in V corresponds to one secondary structure. E is a set of edges each of which links two nodes with a structure distance of one. Although pheromone values are assigned to the edges in the original form of ACO, we assign pheromone values on the nodes in our ACOfoldpath algorithm. This is because, in our algorithm, free energies are not assigned to edges, but assigned to nodes. This pheromone assignment method means that “which node should be visited” is more important than “which edge should be passed through” to find good folding pathways.
2.4.3. Computation of energy upper bound To obtain the upper bound of the free energy, direct pathways are generated by using the following simple procedures for addition and deletion of the non-common base pairs: (simpleDirect 1) first, start base pairs are deleted, then end base pairs are added; (simpleDirect 2) first, end base pairs are added; if a clash between a start base pair and an end base pair occurs, delete the start base pair then add the end base pair; finally, remaining start base pairs are deleted. These simple direct pathways are consistent with the StemCounter representation described in the next “Structure generation” subsection, i.e. any pathway generated by simpleDirect 1 or 2 can be expressed by StemCounter values. In addition to the direct pathways generated by simpleDirect 1 and 2, we consider the pathway predicted by the findpath program taken from the Vienna RNA Package (Lorenz et al., 2011). Energy barrier heights of the generated pathways are obtained by evaluating the free energies of the secondary structures. The pathway with the lowest barrier height in the generated direct pathways is stored as the best-so-far pathway. Then, the highest free energy value in the best-so-far pathway is adopted as the energy upper bound that is used in the next structure generation procedure.
2.4. ACOfoldpath algorithm 2.4.1. Flowchart of ACOfoldpath For a given input triplet (an RNA sequence s, start secondary structure Sstart, and end secondary structure Send), the present algorithm explores the search graph to find the best folding pathway with the lowest energy barrier height. The flowchart of ACOfoldpath is shown in Fig. 4. ACOfoldpath is composed of the following steps: (1) preprocessing; (2) computation of the energy upper bound; (3) direct pathway prediction; (4) expansion of the search space; (5) pathway prediction for the expanded search space. Steps (4) and (5) are iteratively performed and this repeat between Steps (4) and (5) is terminated when no improved folding pathway is found for a given number of continuous iterations or when the search space expansion fails. During the run, to efficiently explore the search space by reducing the nodes of the search graph, the energy upper bound is updated (decreased) whenever a pathway with a lower barrier height is found, while we try to expand the search space in Step (4). In Step (5), we can predict indirect pathways by considering the base pairs other than those of the start and end structures.
2.4.4. Structure generation For a given set of the initial structure Sinitial and the stem list, we generate secondary structures of which the V of the search graph G is comprised. This structure generation process utilizes an integer array, StemCounter, like an |L|-digit counter, where |L| is the number of stems in the stem list. By setting appropriate stem state index values to the digits of StemCounter, a secondary structure is represented as Sinitial ∪ BP (StemCounter), where BP(StemCounter) is a set of the base pairs that corresponds to the stem state index values of StemCounter. In the example of Fig. 10, “Sinitial = (structure A) and StemCounter = (10, 0, 2) for the stem list” give the structure F. The StemCounter value of the maximal stem that is obtained by extending stem x is denoted as MaxStemCounter(x): e.g. MaxStemCounter(stem1) = 10 in this example; this is because the stem 1 is the enumeration-type stem and has a maximum length of 4 bp; such a stem has the eleven stem states as shown in Fig. 10 (c). Secondary structures are generated in accordance with the
2.4.2. Preprocessing First, the free energies of the start and end structures are computed. If the free energy value of the end structure is lower than that of the start structure, these structures are swapped. Then, we determine the “common base pairs” Scommon = Sstart ∧ Send and the “non-common base pairs” belonging to the start structure or the end structure. A set of the “non-common base pairs” belonging to the start structure is expressed as Sstart ∖ Scommon and that belonging to the 4
Computational Biology and Chemistry 83 (2019) 107118
S. Takitou and A. Taneda
Fig. 4. The flowchart of ACOfoldpath algorithm. Direct pathway prediction (Step (3)), expansion of the search space (Step (4)), and pathway prediction for the expanded search space (Step (5)) are indicated in blue, red, and green, respectively. getSaddle(κ, Π) scans the best-so-far pathway Π from κ (= the start structure, a saddle point, or a spike point) to the end structure to find the next saddle point or spike point. If any next structure is not found in Π, the function returns a null. λ is easily derived from the κ structure. It is noted that a new stem (i.e. the stems that do not include common/start/end base pairs) can be added to the stem list only when we failed to add a common stem to the stem list; this switching of the stem addition schemes is realized by the first ‘A stem was added?” if statement. Loop A is the loop for retrying another κ value; Loop B is that for resetting κ and λ for the improved best-so-far pathway; Loop C is that for counting the number of the ACO steps to catch the stagnation behavior.
Fig. 5. Examples of the common and non-common structures (taken from the secondary structures of s15). A: the common structure; B: the start structure; C: the end structure; D: the non-common base pairs belonging to the start structure; E: the non-common base pairs belonging to the end structure.
following steps: (Step 1) push the start structure into a stack P; (Step 2) pop a structure p from the top of the P (if no structure exists in P, stop this structure generation procedure); then generate a set of the secondary structures, Xneighbor, neighboring the structure p, where the neighboring structures are defined in accordance with the stem type of each stem; (Step 3) if x in Xneighbor satisfies one of the following conditions, delete x from Xneighbor: (Condition 1) x has the energy higher than the energy upper bound; (Condition 2) x has already appeared in
V; we check this by using a hash table that has the StemCounter values as keys; (Condition 3) x violates structure generation rules described below; (Step 4) push all structures included in Xneighbor into P; go to Step 2. In Step 2, we can efficiently detect neighboring structures, since a move is defined as addition/deletion of a single base pair in the present study, and any such move changes only a single digit of StemCounter. To reduce the search space, for the non-common base pairs of the start structure, we allow only deletion; for those of the end structure, only 5
Computational Biology and Chemistry 83 (2019) 107118
S. Takitou and A. Taneda
•
Fig. 6. Examples of the non-common base pairs for the case of Figure 5. This list is called the base pair list.
•
addition is allowed. In the case of stem types 1 and 2 (approach and expansion), at most, it is adequate to check an increment and decrement of each digit of StemCounter. (e.g. in Fig. 10 (c), the neighboring structures of stem state 2 of approach type are stem states 1 and 3). In stem type 3 (enumeration), in addition to such two structures, two other structures can be neighboring structures. (e.g. in Fig. 10 (c), the neighboring structures of stem state 5 of enumeration type are stem states 2, 4, 6, and 9). Structure generation rules: During the structure generation procedure, we reject some generated structures to exclude structures with a high free energy and to reduce the size of V according to the following rules based on our assumptions and experiences. We call them structure generation rules.
•
•A
stem x is said to be short if 0 < StemCounter (x) < MaxStemCounter(x). In general, it is likely that short stems increase the free energy of a secondary structure, while maximal stems are favorable to achieve a low free energy value (a stem x is said to be maximal if StemCounter(x) = MaxStemCounter(x)). The maximal stem of a start or end stem is defined for each stem in the stem list (i.e. the common base pairs are not used in the maximal stems of the start and end structures). From this point of view, a folding pathway including a secondary structure containing multiple short stems (i.e. non-maximal stems) could have a high energy barrier height (e.g. Pathway 1 in Fig. 11). In contrast, a folding pathway whose secondary structures have only a single short stem is expected to have a low energy barrier height (Pathway 2 in Fig. 11). Moreover, in many cases start stems overlap with end stems, where stem A is said to overlap with stem B if a base pair of the maximal
stem of stem A clashes with a base pair of the maximal stem of stem B. These overlapping start and end stems are closely located in the RNA sequence. In such a case, the short start stem and the short end stem could cooperatively form an energetically favorable structure. So that, we assume that a secondary structure in which both start and end stems are simultaneously short stems does not have a very high free energy value. (Pathway 3 in Fig. 11). By summarizing these rules, we impose the rule (the main rule) such that “multiple short stems belonging to the same structure (the start or end structure) do not appear simultaneously in a secondary structure” to our structure generation process. In other words, only a single short stem is allowed in the stems belonging to the same structure (the start or end structure). As an exception, we do not count the short stem in the rule mentioned above if the short stem overlaps with an extra stem, where the extra stems are the stems other than the start and end stems (extra stems are added to the stem list in the “expansion of the search space” step described below). The maximal stem of the extra stem is defined for each stem in the stem list. This exception rule (Exception rule 1) was derived from our observation in which such an exception appears in the folding pathway with a low barrier height (e.g. Fig. 12). Another exception rule is the case of neighboring two short stems that belong to the same structure (start or end), i.e. such short stems in a structure are allowed in the structure generation procedure. A stem A is said to neighbor a stem B if b3A + 1 = b5B or b3B + 1 = b5A , where b3A is the outermost 3’-side nucleotide position of the maximal stem of stem A (b5B , b3B , and b5A are those of stem B and 5’-side, stem B and 3’-side, and stem A and 5’-side, respectively); the maximal stems are defined in accordance with the stem list. This exception rule (Exception rule 2) allows only one set of two neighboring stems per one structure (e.g. Fig. 13). It is noted that Exception rule 2 is applied to the short stems that do not follow Exception rule 1. That is, the short stems that follow Exception rule 1 or 2 are not counted as short stems (since they are “exceptions”) when we check whether the main rule is violated or not. The secondary structures having a clashing base pair are not allowed in pseudoknot-free folding pathway prediction.
In addition to the structures generated by structure generation rules, we add all the secondary structures included in the pathway predicted by the findpath program in the Vienna RNA Package to the generated structures. In this structure generation step, we use the maximum number of structures specified by STRmax. When the total number of generated structures is larger than the maximum number of structures determined by STRmax, we terminate this structure generation step to reduce the computational costs. 2.4.5. Search graph construction Here we construct the graph structure to be searched by ACO. By Fig. 7. Examples of the non-common stems for the case of Figure 5. This list is called the stem list. The stem type column indicates the base pair addition scheme used in the structure generation procedure. The MaxStemCounter value (used in the structure generation procedure) is defined for each stem in accordance with the corresponding stem type.
6
Computational Biology and Chemistry 83 (2019) 107118
S. Takitou and A. Taneda
Fig. 8. Examples of the positional relationships between stems. The positional relationships distinguish the three base pair addition schemes (approach, expansion, and enumeration). Parentheses in black are common base pairs; those in red are non-common base pairs.
linking the nodes of V with the edges each of which has a distance of 1 bp, the search graph is obtained. This linking can easily be done by using the property of StemCounter such that the structures with a structure distance of 1 bp can be found by incrementing or decrementing each StemCounter value by one in the case of type-1 and type-2 stems, since this increment or decrement adds or removes one base pair; in the case of type-3 stems, for a given node, we can easily find the StemCounter values of the neighboring nodes in O(1) time by using a two-dimensional array containing the neighboring information among the stem state index values. By utilizing these properties of StemCounter and a hash table in which a string obtained by concatenating all element values of a StemCounter is used as a key, we efficiently find the edges of the graph G for a given set of nodes V. The linking between the structures predicted by the findpath program and those generated by structure generation rules is performed in an all-against-all manner.
where MaxE is the maximum free energy value of the nodes in V, Estart and Eend are the energies of the start and end structures, respectively, and ρ (0 < ρ ≤ 1) is an evaporation rate. Pathway construction: In this step, multiple ant agents walk on the graph G. The ants continue to move to the linked nodes until the ants reach the end node. The next nodes are randomly selected, where the transition probability is calculated based on the free energy value and the pheromone value of each node. The transition probability Pij from node i to node j is defined as follows:
Pij =
l Ni
= MaxE
Estart + Eend / , 2
[ l] [ l ]
(j
Ni ),
(5)
where τl is the pheromone value on node l, ηl = 0.1+MaxE - El, El is the free energy of node l, Ni is a set of the nodes that have not been visited and are linked to node i, and α and β are the parameters for determining the relative influence of the pheromone and energy values. To efficiently search the graph, we impose the following constraints in our ACO: (1) do not move to the nodes already visited by the ant agent during the ant's walk; (2) do not add the non-common base pairs of the start structure; do not remove the non-common base pairs of the end structure; both addition and deletion are allowed for common stems and extra stems; (3) do not move to the node with a free energy value higher than the energy upper bound found before each ant's walk starts. Pathway evaluation: The pathways obtained by the ant agents are evaluated by the following three criterions. The smaller number indicates higher priority: (i) a lower energy barrier height, (ii) a shorter pathway length, and (iii) a lower “average free energy over the pathway”. When the energy barrier heights tie, the one with a shorter
2.4.6. Performing ACO calculation One iteration of the ACO is comprised of pathway construction, pathway evaluation, and pheromone update. If no improvement has ACO been observed during continuous Nstop iterations, the ACO procedure is terminated. Otherwise, we continue the ACO run up to the maximum ACO number of iterations, Niteration . In the present study, we do not use the reinitialization of MMAS. On all the nodes, we assign the initial pheromone τ0: 0
[ j] [ j ]
(4)
Fig. 9. Examples of the base pair addition order in the three base pair addition schemes. From left to right, the MaxStemCounter values of these examples are 4, 4, and 10, respectively. 7
Computational Biology and Chemistry 83 (2019) 107118
S. Takitou and A. Taneda
Fig. 10. An example for explaining StemCounter. (a) An RNA sequence and structures. Structure F is an example of the generated structure in the structure generation step. Base pairs in red are the non-common ones. (b) A stem list. (c) Stem states. In these stem states, a stem is depicted as a hairpin structure regardless of its actual structure.
pathway length is a better pathway. Moreover, if the pathway lengths also tie, the one with a lower “average free energy over the pathway” is better. If we find a pathway better than the best-so-far pathway, the best-so-far pathway is updated. Pheromone update: The pheromone values are updated by the following equations: i
(1 best, i
=
) i+
MaxE 0,
best, i
AvEbest , if i Tbest otherwise.
where E(Si) is the energy of the i-th structure in the best-so-far pathway; (2)the structure with the highest sum of the energy differences between neighboring structures in the best-so-far pathway (a spike point):
= argmax 0 < i < n | Si
E (Si 1)
E (Si + 1).
(9)
Based on Sκ, we determine the subpoint Sλ in such a way that Sλ is the structure with the smallest number of base pairs in Sκ−1, Sκ, and Sκ+1. If a tie occurs, λ = κ − 1 is adopted. κ and λ are determined in getSaddle(κ, Π) (see Fig. 4). getSaddle(κ, Π) scans the best-so-far pathway Π from κ (= the start structure, a saddle point, or a spike point) to the end structure to find the next saddle point or spike point. If we start with the start structure or a saddle point, this scan first searches for the next saddle point; if it does not exist, we rescan the Π from the start structure to the end structure to find a spike point. If we start with a spike point (i.e. getSaddle(κ, Π) is called with κ = a spike point), this scan searches for the next spike point. If the saddle point or spike point is found, this function returns the saddle point or spike point as a κ structure. If any next saddle or spike structure is not found in Π, the function returns a null. λ is easily derived from the κ structure (e.g. Fig. 14). If the end structure has the highest energy in the best-so-far pathway, ACOfoldpath is terminated, since there is no folding pathway that has a barrier height lower than that, E(end) − E(start), of the bestso-far pathway.
(6) (7)
Tbest is a set of the nodes included in the best-so-far pathway, and AvEbest is the average energy of the nodes in Tbest. If the pheromone value is higher than τmax = τ0 or lower than τmin = τ0/60, the pheromone value is reset to the corresponding value. 2.4.7. Pathway analysis We analyze the best-so-far pathway to determine the stem (a base pair is dealt with a stem with a length of 1 bp) to be added to the stem list to consider more secondary structures in the search graph. This best-so-far pathway analysis is performed in the getSaddle function. We detect structures with a high energy (saddle points) and those with a locally high energy (spike points) to find a possible bypass that avoids these energetically unfavorable structures: (1)the structure with the highest energy in the best-so-far pathway (a saddle point):
= argmax 0 < i < nE (Si ),
saddlepoints2E (Si )
(8)
Fig. 11. Pathway construction examples for the case of two stems. 8
Computational Biology and Chemistry 83 (2019) 107118
S. Takitou and A. Taneda
Fig. 12. An example of indirect pathway (taken from the predicted pathway of dsrA). The yellow stem is the extra stem; the green stem is the start structure stem overlapping with the extra stem; the red stem is another start structure stem. In this pathway, we can see that the two start structure stems in green and red are simultaneously short. This pathway was obtained with the 1999 version of the Turner parameters (Mathews et al., 1999) and the option -d1.
2.4.8. Expansion of search space Before each ACO step (indicated in green in Fig. 4) in the main loops of ACOfoldpath, we try to add extra stems to the stem list to expand the search space explored by the ACO step. This expansion procedure is highlighted in red in Fig. 4. Extra stems can include (i) the common stems and (ii) the stems not included in the common, start, and end stems. Addition of a base pair is dealt with as addition of a stem with a length of 1 bp. Extra stems are determined by the five procedures described below. The procedures from (1) to (3) add common stems to the stem list, while the procedures (4) and (5) add a stem for indirect pathways to the stem list. The extra stems added here are dealt with as type-3 (enumeration type) stems. (1) FindCommonPairA adds the common base pair b such that (i) Estart,−b < Estart ∨ Eend,−b < Eend and (ii) b = argminc∈ZEstart,−c + Eend,−c, where Estart,−b is the free energy value of Sstart ∖ {b}, Eend,−b is that for the end structure, Z = Scommon ∖ BPextra, and BPextra is a set of the base pairs of the extra stems already contained in the stem list. The base pair b selected in this procedure is removed from Sinitial and added to BPextra. When start and end structures are not local minima, where a local minimum is the structure such that any addition or deletion of one base pair does not decrease the energy, removal of a common base pair may lower the energies of the folding pathway. (2) FindCommonStem adds the common stem σ such that (i)
Eκ,−σ < Eκ, (ii) |σ|≥ 2 bp, and (iii) σ = argminξ∈ΨEκ,−ξ, where Eκ,−σ is the free energy value of Sκ ∖ σ, Ψ = MAXSTEM(Z ∩ Sκ), MAXSTEM(Y) is a set of all maximal stems included in a set Y of base pairs, and Sκ is a set of base pairs of the κ structure. The base pairs of the stem selected in this procedure are removed from Sinitial and added to BPextra. Removal of a common stem may lower the energies of the structures around Sκ. (3) FindCommonPairB adds the common base pair b such that (i) Eκ,−b < Eκ and (ii) b = argminc∈XEκ,−c, where X = Z ∩ Sκ. The base pair selected here is removed from Sinitial and added to BPextra. Removal of a common base pair may lower the energies of the structures around Sκ. (4) FindNewPair first finds the base pair b such that (i) Eλ,+b < Eκ and (ii) b = argminc∈WEλ,+c, where Eλ,+b is the free energy value of Sλ ∪ {b}, W = {w BPall (Sstart Send BPextra S )|w does not clash with S } (instead of this, we use W = {w BPall (Sstart Send BPextra S )} for pseudoknotted folding pathway prediction), and BPall is a set of all possible base pairs of the input RNA sequence s. Then, this procedure adds the stem σ (where σ ∈ MAXSTEM(W)) that includes the base pair b. The base pairs of the selected stem σ are added to BPextra. Here we add a stem that can be included in indirect pathways. In general, unstable intermediate structures of a stem, such as a lonely base pair, can raise the energy. To prevent the raise of the energy, we select the stem that has the base pair with a small energy increase (we expect that the base
Fig. 13. Examples of allowed or not allowed neighboring short stems. These short stems belong to the start (or end) structure; these short stems do not overlap with any extra stem. Example 1 has two neighboring short stems; Example 2 violates structure generation rules since this structure contains more than one set of neighboring two short stems. 9
Computational Biology and Chemistry 83 (2019) 107118
S. Takitou and A. Taneda
Fig. 14. An example of the best-so-far folding pathway containing a spike point and two saddle points. Step numbers 1 and 12 correspond to the start and end structures, respectively. The two saddle points have a degenerate value of free energy. For this folding pathway, first, the getSaddle function returns the saddle point structure at step number 8 as a κ structure. Then the getSaddle function returns that at step number 10. After that, if the getSaddle function is called again for the folding pathway, the getSaddle function returns the spike point at step number 4. In this figure, the energies are shifted to set the energy of the start structure to zero.
Fig. 15. Schematic illustrations for explaining the pseudoknot filter. In each figure, a part of a whole secondary structure is depicted as an RNA arc diagram. (a) and (c) are allowed pseudoknots; (b) and (d) are not allowed cases. (a) An H-type pseudoknot that does not contain any recursive structure. (b) An example of an H-type pseudoknot that contains a recursive structure. In this example, the dotted stem enclosed by the solid one is a recursive structure. (c) A K-type pseudoknot. (d) An L-type pseudoknot. L-type pseudoknots do not pass our pseudoknot filter.
Table 1 The comparison of the predicted energy barrier heights in kcal/mol. For RNAEAPath and ACOfoldpath, the best one in five independent runs is shown for each RNA. The best values are indicated in bold. RNA
Findpath
RNAEAPath
ACOfoldpath Default Enhanced param. param.
BARRIERS
rb1 rb2 rb3 rb4 rb5 hok SL attenuator s15 s-box leader thiM leader ms2 HDV dsrA ribD leader amv alpha operon HIV-1 leader SV-11
24.2 9.1 20.9 15.4 24.2 26.5 11.6 8.2 8.2 4.3 9.1 1.6 16.2 9.6 7.8 6.3 3.7 8.5 42.2
23.5 8.0 15.3 15.4 20.8 24.5 11.6 8.2 8.2 4.3 6.6 1.6 16.2 8.2 7.4 6.3 3.7 3.5 42.3
20.8 7.8 14.3 15.4 20.8 24.5 11.6 8.2 8.2 4.3 7.5 1.6 15.1 8.1 7.3 5.2 3.7 3.5 42.2
7.8 11.6 8.2 6.7 4.3 1.6 8.1 5.2 3.7 -
20.8 7.8 14.3 15.4 20.8 24.5 11.6 8.2 8.2 4.3 7.3 1.6 15.1 8.1 7.3 5.2 3.7 3.5 42.2
Table 2 The run time t (seconds) of ACOfoldpath (default param.). In addition to run times, the length len (nt) of an RNA sequence, the iteration number # esteps of the “expansion of the search space” steps, the mean number mn of the nodes in one ACOfoldpath run, the structure distance dist (bp) between start and end structures, and temporary file size fs (bytes) are also shown. For t, # esteps, and mn, the averages of the results of five independent ACOfoldpath runs are shown; fs is the maximum value in independent five runs.
10
RNA
t
len
# esteps
mn
dist
fs
rb1 rb2 rb3 rb4 rb5 hok SL attenuator s15 s-box leader thiM leader ms2 HDV dsrA ribD leader amv alpha operon HIV-1 leader SV-11
1.32 2.61 32.59 0.18 1.33 1966.21 2.70 2.85 0.26 5.31 106.90 0.38 96.77 4.90 55.32 6.07 0.66 22.38 249.41
148 113 141 146 201 396 56 73 74 247 165 73 153 85 304 145 130 280 115
4.0 13.0 18.4 2.0 4.0 6.0 12.0 12.0 2.0 12.0 19.0 2.0 24.6 15.0 14.0 17.8 12.0 13.0 12.0
1015.2 571.8 4966.0 76.0 1015.2 499985.2 256.1 277.9 67.0 517.9 9714.9 64.0 7045.8 514.1 8960.9 426.9 74.2 4677.5 27599.9
31 27 25 8 31 81 29 21 11 19 38 11 38 19 32 18 11 30 86
150533 228746 2381749 14297 198550 416487829 75317 61419 7937 400731 4405187 7998 5335181 132167 6505053 195964 72911 5831277 10203010
Computational Biology and Chemistry 83 (2019) 107118
S. Takitou and A. Taneda
Table 3 The comparison of the run times in seconds. For RNAEAPath and ACOfoldpath, the mean of five independent runs is shown. In parentheses, standard deviations are shown. These times were measured on a PC with Intel Xeon CPU E5-2665 (2.40GHz) and 132 GB of RAM. RNA
len
Findpath
RNAEAPath
ACOfoldpath default param.
enhanced param.
rb1 rb2 rb3 rb4 rb5 hok SL attenuator s15 s-box leader thiM leader ms2 HDV dsrA ribD leader amv alpha operon HIV-1 leader SV-11
148 113 141 146 201 396 56 73 74 247 165 73 153 85 304 145 130 280 115
0.69 0.10 0.32 0.02 0.78 20.84 0.18 0.04 0.02 0.03 1.03 0.02 0.78 0.04 0.62 0.02 0.01 0.35 16.75
1210.03 (134.78) 480.84 (130.43) 879.26 (164.92) 0.93 (0.40) 744.70 (438.89) 7557.00 (2097.72) 329.48 (81.24) 314.24 (104.42) 31.36 (24.72) 574.12 (132.49) 1450.59 (230.40) 289.69 (72.27) 655.02 (178.67) 296.02 (111.97) 1551.64 (247.05) 365.96 (121.80) 219.23 (72.71) 1165.71 (242.54) 1377.29 (233.00)
1.32 (0.14) 2.61 (0.24) 32.59 (12.00) 0.18 (0.05) 1.33 (0.14) 1966.21 (30.77) 2.70 (0.78) 2.85 (0.90) 0.26 (0.00) 5.31 (1.19) 106.90 (1.78) 0.38 (0.00) 96.77 (11.37) 4.90 (1.08) 55.32 (1.32) 6.07 (1.18) 0.66 (0.02) 22.38 (0.08) 249.41 (2.48)
2.36 (0.13) 10.30 (0.05) 226.47 (79.48) 0.18 (0.00) 2.69 (0.10) 3827.20 (99.04) 3.58 (0.03) 4.63 (0.03) 0.27 (0.01) 10.11 (0.04) 352.14 (62.46) 0.39 (0.01) 168.74 (79.71) 5.73 (0.07) 201.82 (34.39) 6.40 (0.67) 1.51 (0.03) 63.80 (0.15) 1108.21 (28.15)
pair plays the role as a seed upon the formation of the stem). (5) FindNewStem adds the stem σ such that (i) Eλ,+σ < Eκ, (ii) |σ|≥ 2 bp, and (iii) σ = argminξ∈ΦEλ,+ξ, where Φ = MAXSTEM(W). The base pairs of the stem selected in this procedure are added to BPextra. As the result of these procedures, the selected stems are added to the end of the stem list. If a common stem is added to the stem list, the corresponding base pairs are removed from Sinitial (it is noted that if the common stem added here is invalidated later, Sinitial recovers the base pairs of the common stem). In each procedure, if a stem to be added is not found, no stem is added. The base pairs that have been added to the stem list once by one of the five procedures are no longer selected as the base pairs of an extra stem at the subsequent processing in ACOfoldpath; i.e. we never remove any base pair from BPextra and never remove any stem from the stem list. Not to add too many stems to the stem list at once (this can cause a huge search graph), we do not use all of the five procedures mentioned above simultaneously. If procedure (1), (2), or (3) finds a stem to be added, (4) and (5) are not performed; otherwise, one of (4) and (5) is performed; if the procedure (one of (4) and (5)) also does not find a stem to be added, another one of (4) and (5) is performed. The order of invoking (4) and (5) is changed at each iteration. At most, this addition scheme for the five procedures simultaneously adds two base pairs and a stem to the stem list. In this “expansion of the search space” step, our aim is to add stems by which the ant agents detour around the saddle point of the best-sofar pathway to the search graph G (a local search heuristic for such a detour has been proposed (Takeda et al., 2005)).
BARRIERS 831.72 21.44 13.27 61.18 747.20 1.84 4.75 1463.97 6.28 -
grained model of RNA folding dynamics. Once transition rates between local minimum structures are given, we can numerically simulate the dynamics of secondary structure distribution based on a master equation. In this study, we utilize the Arrhenius equation to compute the transition rate kxy from macrostate y to macrostate x:
k xy = exp( Eybarrier x /RT),
(10)
where is the energy barrier height, R is the gas constant, and T is the absolute temperature. The local minimum structures for the RNA folding dynamics simulation are generated by RNASLOpt (Li and Zhang, 2011). We randomly select ten secondary structures from the local minimum structures generated by RNASLOpt for efficient benchmarking. We relax each of the ten structures in a steepest descent manner to ensure the local optimality of the ten structures. By using the generated local minima, we compute the switching time (Michálik et al., 2017) between the structure with the lowest energy and that with the highest energy in each set of ten structures by treekin (Wolfinger et al., 2004). We calculated kxy for all pairs in ten local minima to construct the transition rate matrix. For more sophisticated way where adjacent basins are considered for computing the transition rate matrix, see (Kucharík et al., 2014). Simulation by treekin is performed by setting the initial population of the structure with the highest energy as P10 = 1 and those of the other macrostates to 0. The switching time between the structure with the lowest energy and that with the highest energy is obtained by computing the shortest time at which the structure with the lowest energy achieves a higher population than that of the structure with the highest energy (i.e. when P1 > P10 is achieved).
Eybarrier x
2.4.9. Invalidation of extra stems If no improved folding pathway is found by an ACO step, the extra stems that were added to the stem list just before the ACO step are invalidated. The base pairs of the invalidated stems are no longer used at the subsequent processing for the structure generation, since such base pairs in the extra stems failed to improve the folding pathway in the past trial. It is noted that if a common stem is invalidated, Sinitial recovers the base pairs of the invalidated common stem.
2.4.11. Pathway prediction allowing pseudoknots Pseudoknots can be included in our pathway prediction by extending the structure generation step of ACOfoldpath with respect to the following two points: (i) using a pseudoknot prediction method to evaluate the free energies of generated structures; in this study, we adopt computeEnergy in HotKnots 2.0 (Ren et al., 2005) for this purpose; (ii) replacing the clashing base pair check in the structure generation rules by a pseudoknot filter. This pseudoknot filter examines whether a newly added base pair in the structure generation process forms a pseudoknot with the other base pairs in the whole secondary structure. In our structure generation procedure, secondary structures
2.4.10. Performance evaluation by switching time RNA folding dynamics can be modeled as a Markov process among macrostates, where the local minimum structures are used as the macrostates. Energy barrier height is a key parameter of such a coarse11
Computational Biology and Chemistry 83 (2019) 107118
S. Takitou and A. Taneda
Fig. 16. An exact folding pathway of SL. The base pairs in red (that are included neither in the start structure nor in the end structure) clash with the blue base pairs in Fig. 17. This pathway was obtained with the 1999 version of the Turner parameters (Mathews et al., 1999) and the option -d1.
12
Computational Biology and Chemistry 83 (2019) 107118
S. Takitou and A. Taneda
Fig. 17. The folding pathway predicted by ACOfoldpath for SL. The base pairs included in the saddle point (indicated by square brackets in blue) clash with the red base pairs of Fig. 16 (for ease of comparison, a structure taken from Fig. 16 is shown below the pathway by ACOfoldpath and is indicated by an asterisk). In this situation, we cannot add the ‘good’ red stems of Fig. 16 to the stem list, since the blue ‘saddle point’ base pairs clash with the red stems. This pathway was obtained with the 1999 version of the Turner parameters (Mathews et al., 1999) and the option -d1.
are generated one by one; by applying the pseudoknot filter every time a new structure is generated we can rigorously eliminate pseudoknots belonging to not allowed pseudoknot classes. Schematic illustrations of the pseudoknot filter are shown in Fig. 15. The pseudoknot filter allows H-type and K-type pseudoknots (Kucharík et al., 2016), where recursive structures in pseudoknot are not allowed. The findpath program of Vienna RNA Package is not used in the pseudoknotted pathway prediction.
Table 4 The switching times in an aribtray unit obtained by Findpath and ACOfoldpath. RNA
Findpath
ACOfoldpath
rb2 SL attenuator s15 ms2 dsrA amv alpha operon
4029.9 660.6 6.3 257.4 83.3 128.8 8.7 2094.0
4.8 586.0 0.6 178.7 5.4 40.9 4.1 219.0
2.4.12. Software Non-profit academic users can obtain the ACOfoldpath software at 13
Computational Biology and Chemistry 83 (2019) 107118
S. Takitou and A. Taneda
among the three heuristic methods. In eleven RNA sequences, ACOfoldpath gave the results comparable to the best values of the previous heuristic methods. Except for s15, all the energy barrier values obtained by BARRIERS are reproduced by ACOfoldpath (default param.). For thiM leader, RNAEAPath predicted the lowest barrier height. The run times of ACOfoldpath are shown in Table 2, which were measured on a PC with Intel Xeon CPU E5-2665 (2.40GHz) and 132 GB of RAM. As can be seen from the table, ACOfoldpath quickly outputs the optimized pathway except for hok; the run time is noticeably long for hok. This high computational cost is mainly due to the large number of the nodes in G; such a large number of nodes can be generated if the structure distance between start and end structures is large. In addition, a slow convergence, which can be seen from a large value of the iteration number of the expansion steps, # esteps, in Table 2, and a long RNA sequence length can also cause long run times. The comparison of run times is shown in Table 3. As clearly shown in Table 3, ACOfoldpath is much faster than RNAEAPath and BARRIERS. To examine the parameter dependence of the results, we ran ACOfoldpath with enhanced parameter settings. This set of enhanced parameters has larger values (the number of ants = 600, the maximum ACO = 10; these number of ACO iterations = 400, Ntrials = 20, and Nstop values are twice of the default settings) and a smaller value (ρ = 0.1) compared to the default settings; a smaller ρ was set to avoid to be trapped in local minimum solutions. As shown in Table 1, these enhanced parameter settings slightly improved the result for thiM leader, and the other RNA types showed no improvement. These results indicate that the default parameter settings give well-converged results with smaller computational costs. In these performance comparisons, we do not include the results of RNAtabupath (Dotu et al., 2010), since RNAtabupath is implemented with the old Turner parameters (Mathews et al., 1999) and the predicted barrier heights were poor (data not shown); no barrier height is better than those predicted by ACOfoldpath (default param.) with the 1999 version of the Turner parameters and the option -d1; only four RNA types have the same barrier heights with those by ACOfoldpath (default param.). RNAtabupath was tested with an upper bound value = 30, 50, or 100 kcal/mol. The best barrier height in 1,000 independent runs is adopted as the result of each upper bound value. RNAtabupath (1,000 independent runs) was 1,000 times or more slower than ACOfoldpath in average. As an additional computational cost, the current version of ACOfoldpath outputs temporary output files. These temporary files can have large sizes in accordance with the input RNA sequence and structures; for example, the maximum file sizes of the temporary files for the nineteen RNAs are shown in Table 2. Thus, an adequate disk space is necessary to run the current version of ACOfoldpath. In the folding pathways predicted by ACOfoldpath, those of seven RNAs (rb1, rb2, rb3, HDV, dsrA, ribD leader, and amv), which have the energy barrier heights lower than the energy barrier heights by the previous heuristics, are indirect pathways or the pathway including deletion of a common base pair. These results imply that the extra stems added by “expansion of the search space” step successfully lower the energies of the pathways around saddle points. However, if the saddle point structure has a base pair that clashes with the stem included in the pathway with a lower energy barrier height, the search space expansion procedure of ACOfoldpath cannot add the extra stem necessary for improving the pathway including the saddle point. In such a case, it is difficult to improve the pathway by ACOfoldpath. As an example, Fig. 16 and 17 show an exact pathway and a pathway predicted by BARRIERS and by ACOfoldpath (the 1999 version of the Turner parameters (Mathews et al., 1999) and the option -d1 are used), respectively. In Fig. 16, the base pairs in red are those appear only the exact pathway. It can be seen from Fig. 17 that these base pairs in red cannot be added to the pathway obtained by ACOfoldpath since the stems in red overlap with the stem in blue. The switching times computed based on the predicted barrier
Table 5 The barrier height b (kcal/mol), run time t (seconds) and the mean number mn of nodes in one ACOfoldpath run obtained by the pseudoknotted version of ACOfoldpath. For t and mn, the mean of five independent ACOfoldpath runs is shown. The maximum temporary file size in the predictions in this table is approximately 44 MBytes. RNA
b
t
mn
rb1 rb2 rb3 rb4 rb5 hok SL attenuator s15 s-box leader thiM leader ms2 HDV dsrA ribD leader amv alpha operon HIV-1 leader SV-11
20.93 7.20 17.70 16.86 21.43 29.22 9.04 9.28 6.50 5.21 10.99 6.75 15.81 8.05 9.09 5.66 6.05 7.50 73.94
411.1 828.5 12282.2 4.3 1092.4 74208.4 698.1 48.7 42.7 229.0 37795.6 11.3 1730.7 286.2 8198.2 346.5 46.8 397.6 17306.2
2154.0 824.8 6295.9 86.0 10606.5 49910.0 551.2 881.0 83.0 1461.7 11983.5 188.0 1634.5 188.0 4789.0 430.4 287.0 2173.7 49905.0
the ACOfoldpath website.2 3. Results and discussion We benchmarked ACOfoldpath on natural RNA sequences, start and end structures used in Dotu et al. (2010), Li and Zhang (2012) (eighteen RNAs including the sequences and structures taken from Mandal et al. (2003), Voss et al. (2004), Wakeman et al. (2007)) and Höner Zu Siederdissen (2013) (a single RNA:SV-11 Biebricher and Luce, 1992). In total, nineteen sets are included in the benchmark dataset. The parameters we used are α = β = 1, an evaporation rate ρ = 0.4, the number of artificial ants = 300, the maximum number of ACO iterations = 200, STRmax = 50,000 (for hok, we used an increased value STRmax = 500,000 since more than 50,000 structures are generated), the number of trials for “expansion of the search space” step Ntrials = 10, and the ACO number of ACO iterations for the stopping criterion Nstop = 5. For pseudoknot-free pathway prediction, we used the 2004 version of the Turner energy parameters (Mathews et al., 2004) with the option -d2. The predicted energy barrier heights for the nineteen RNA sequences are summarized in Table 1. For the performance comparison, the results of BARRIERS, which is the exact method, Findpath and RNAEAPath, which are state-of-the-art heuristics, are also included in the table. Since BARRIERS predicts the pathway between local minima, first we relax the start and end structures to each basin in a steepest descent manner, then predict the pathway between the basins. Computational costs of BARRIERS are very high, hence we focus on the relatively short RNA sequences. A ‘-’ indicates that BARRIERS was not employed to the corresponding RNA. In the table, each energy barrier height obtained by ACOfoldpath (or RNAEAPath) is the best one in independent five runs; for all the nineteen RNAs ACOfoldpath always gave the same energy barrier height (i.e. the same value was obtained for five independent runs), where only exceptions are a single run of ACOfoldpath (enhanced param.) for thiM leader and two runs for rb3. RNAEAPath showed slight instability for seven RNAs (rb1, rb2, rb3, rb5, thiM leader, ribD leader, and SV-11). As can be seen from Table 1, ACOfoldpath (default param.) successfully found the folding pathways with the best barrier heights for seven RNA sequences (rb1, rb2, rb3, HDV, dsrA, ribD leader, and amv) 2
http//rna.eit.hirosaki-u.ac.jp/acofoldpath/ 14
Computational Biology and Chemistry 83 (2019) 107118
S. Takitou and A. Taneda
Fig. 18. The pseudoknotted folding pathway predicted by ACOfoldpath for SL.
heights are summarized in Table 4. To avoid large computational costs, we focus on relatively short RNA sequences in these simulations. For comparison, the results of ACOfoldpath (default param.) and Findpath are shown. As can be seen from the table, ACOfoldpath outperformed Findpath; i.e. the switching times obtained by ACOfoldpath are smaller than those of Findpath. The barrier heights, run times, and the mean number of nodes for the pseudoknotted pathway prediction by ACOfoldpath are shown in Table 5. Due to the higher computational costs of pseudoknot prediction, pseudoknotted pathway prediction is much more demanding calculation compared to pseudoknot-free pathway prediction (for hok, we used STRmax = 50,000 to reduce the computational cost of pseudoknotted pathway prediction). In these predictions, we obtained only one example (SL) whose predicted pathway has pseudoknots (Fig. 18). This example implies that pseudoknotted pathway prediction may find better folding pathways compared to pseudoknot-free pathways. However, we obtained very high energy barrier values for some RNAs, e.g. SV-11. These results could be due to a small STRmax value, since inclusion of pseudoknots drastically widens the conformational space to
be searched for and a small STRmax value cuts off a large part of the conformational space. In the present paper, we do not compare BHGbuilder (Kucharík et al., 2016) that can predict pseudoknotted folding pathways with ACOfoldpath, since the energy models adopted in these folding pathway prediction methods are different from each other. The present method has a great flexibility, since it is a heuristic algorithm based on evolutionary computation. Although the shift move of base pair (Flamm et al., 2000) is also an attractive extension of ACOfoldpath, the StemCounter representation of stems is not suitable for the shift move. Conclusion We have developed ACOfoldpath, which is a heuristic algorithm for predicting RNA folding pathways. We predicted the folding pathways with low energy barriers for nineteen RNA conformational switches and compared the performance of ACOfoldpath with those of other existing methods. As a result, in seven cases, our method outperformed the 15
Computational Biology and Chemistry 83 (2019) 107118
S. Takitou and A. Taneda
existing state-of-the-art methods. In eleven cases, our method found the pathways with the barrier height comparable to the best value of the existing methods. Thus, through the benchmarking on the natural RNA sequences that have been used to assess the performance of the RNA folding pathway prediction methods, we successfully showed that our heuristic algorithm can obtain good folding pathways comparable to or better than those by the previous methods. In addition, the switching times based on an RNA folding dynamics simulation were measured to assess the predicted barrier heights in the realistic application. In this assessment, ACOfoldpath outperformed Findpath for eight RNAs. Pseudoknotted folding pathways can also be predicted by ACOfoldpath. We tested the pseudoknotted pathway prediction against the nineteen RNAs, and obtained a pseudoknotted pathway as an optimal result of SL. The start and end structures used in the pseudoknotted pathway prediction of the present study contain no pseudoknotted base pair. Validating the prediction performance for pseudoknotted start and end structures is an interesting future research issue. In ACOfoldpath, we select the base pairs/stems that most lower the free energy in “expansion of the search space” step. Although this approach for the search space expansion works well in the present benchmarking, it has an ad hoc nature and there is room for improvement in this strategy. By using STRmax, we arbitrarily cut off a part of possible nodes if the number of generated nodes reaches the maximum number determined by STRmax during the structure generation step. A better graph generation technique is favorable not to cut off a part of nodes. The research in these directions may be an interesting future challenge in the field of RNA folding pathway prediction.
4731–4741. Hofacker, I.L., Flamm, C., Heine, C., Wolfinger, M.T., Scheuermann, G., Stadler, P.F., BarMap:, 2010. RNA folding on dynamic energy landscapes. RNA 16 (7), 1308–1316. Wolfinger, M.T., Flamm, C., Hofacker, I.L., 2018. Efficient computation of co-transcriptional RNA-ligand interaction dynamics. Methods 143, 70–76. Kucharík, M., Hofacker, I.L., Stadler, P.F., Qin, J., 2014. Basin Hopping Graph: A computational framework to characterize RNA folding landscapes. Bioinformatics 30 (14), 2009–2017. Flamm, C., Hofacker, I.L., 2008. Beyond energy minimization: approaches to the kinetic folding of RNA. Monatshefte für Chemie - Chemical Monthly 139 (4), 447–457. Flamm, C., Hofacker, I.L., Maurer-Stroh, S., Stadler, P.F., Zehl, M., 2001. Design of multistable RNA molecules. RNA 7 (2), 254–265. Rodrigo, G., Landrain, T.E., Jaramillo, A., 2012. De novo automated design of small RNA circuits for engineering synthetic riboregulation in living cells. Proc. Natl. Acad. Sci. U. S. A. 109 (38), 15271–15276. Taneda, A., 2015. Multi-objective optimization for RNA design with multiple target secondary structures. BMC Bioinformatics 16 (1), 280. Morgan, S.R., Higgs, P.G., 1998. Barrier heights between ground states in a model of RNA secondary structure. J. Phys. A 31, 3153–3170. Flamm, C., Hofacker, I.L., Stadler, P.F., Wolfinger, M.T., 2002. Barrier Trees of Degenerate Landscapes. Zeitschrift für Physikalische Chemie 216, 155–173. Dotu, I., Lorenz, W.A., Van Hentenryck, P., Clote, P., 2010. Computing folding pathways between RNA secondary structures. Nucl. Acids Res. 38 (5), 1711–1722. Li, Y., Zhang, S., 2012. Predicting folding pathways between RNA conformational structures guided by RNA stacks. BMC Bioinformatics 13 (Suppl 3), S5. Maňuch, J., Thachuk, C., Stacho, L., Condon, A., 2009. NP-completeness of the direct energy barrier problem without pseudoknots. In: Deaton, R., Suyama, A. (Eds.), Proc. 15th International Meeting on DNA Computing and Molecular Programming, Vol. 5877 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 106–115. Dorigo, M., Stützle, T., 2004. Ant colony optimization. MIT Press, Cambridge, MA. Mathews, D.H., Sabina, J., Zuker, M., Turner, D.H., 1999. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 288 (5), 911–940. Lorenz, R., Bernhart, S.H., Höner Zu Siederdissen, C., Tafer, H., Flamm, C., Stadler, P.F., Hofacker, I.L., 2011. ViennaRNA Package 2.0. Algorithms Mol. Biol.: AMB 6, 26. Mathews, D.H., Disney, M.D., Childs, J.L., Schroeder, S.J., Zuker, M., Turner, D.H., 2004. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc. Natl. Acad. Sci. U. S. A. 101 (19), 7287–7292. Ren, J., Rastegari, B., Condon, A., Hoos, H.H., 2005. HotKnots: heuristic prediction of RNA secondary structures including pseudoknots. RNA (New York, N.Y.) 11 (10), 1494–1504. Stützle, T., Hoos, H.H., 2000. MAX-MIN Ant System. Future Gen. Comput. Syst. 16 (8), 889–914. Takeda, T., Ono, H., Sadakane, K., Yamashita, M., 2005. A Local Search Based Barrier Height Estimation Algorithm for DNA Molecular Transitions. Lecture Notes in Computer Science, Vol. 3892 359–370. Li, Y., Zhang, S., 2011. Finding Stable Local Optimal RNA Secondary Structures. Bioinformatics (Oxford, England) 27 (21), 2994–3001. Michálik, J., Touzet, H., Ponty, Y., 2017. Efficient approximations of RNA kinetics landscape using non-redundant sampling. Bioinformatics 33 (14), i283–i292. Kucharík, M., Hofacker, I.L., Stadler, P.F., Qin, J., 2016. Pseudoknots in RNA folding landscapes. Bioinformatics 32 (2), 187–194. Mandal, M., Boese, B., Barrick, J.E., Winkler, W.C., Breaker, R.R., 2003. Riboswitches control fundamental biochemical pathways in Bacillus subtilis and other bacteria. Cell 113 (5), 577–586. Voss, B., Meyer, C., Giegerich, R., 2004. Evaluating the predictability of conformational switching in RNA. Bioinformatics 20, 1573–1582. Wakeman, C.A., Winkler, W.C., Dann, C.E., 2007. Structural features of metabolite-sensing riboswitches. Trends Biochem. Sci. 32 (9), 415–424. Höner Zu Siederdissen, C., Hammer, S., Abfalter, I., Hofacker, I.L., Flamm, C., Stadler, P.F., 2013. Computational design of RNAs with complex energy landscapes. Biopolymers 99 (12), 1124–1136. Biebricher, C.K., Luce, R., 1992. In vitro recombination and terminal elongation of RNA by Q beta replicase. EMBO J. 11 (13), 5129–5135.
Conflicts of interest The authors declare that they have no conflicts of interest. Acknowledgements This research was partially supported by JSPS KAKENHI Grant Number 15K00392 and 18K11519. References Al-Hashimi, H.M., Walter, N.G., 2008. RNA dynamics: It is about time. Curr. Opin. Struct. Biol. 18 (3), 321–329. Breaker, R.R., 2011. Prospects for Riboswitch Discovery and Analysis. Mol. Cell 43 (6), 867–879. Isaacs, F.J., Dwyer, D.J., Collins, J.J., 2006. RNA Synth. Biol. Nature biotechnology 24 (5), 545–554. Chang, A.L., Wolf, J.J., Smolke, C.D., 2012. Synthetic RNA switches as a tool for temporal and spatial control over gene expression. Curr. Opin. Biotechnol. 23 (5), 679–688. Findeiß, S., Hammer, S., Wolfinger, M.T., Kühnl, F., Flamm, C., Hofacker, I.L., 2018. In silico design of ligand triggered RNA switches. Methods 143, 90–101. Flamm, C., Fontana, W., Hofacker, I.L., Schuster, P., 2000. RNA folding at elementary step resolution. RNA 6 (3), 325–338. Wolfinger, M.T., Svrcek-Seiler, W.A., Flamm, C., Hofacker, I.L., Stadler, P.F., 2004. Efficient computation of RNA folding dynamics. J. Phys. A: Math. Gen. 37 (17),
16