NCHB: A Method for Constructing Rooted Phylogenetic Networks from Rooted Triplets based on Height Function and Binarization
Journal Pre-proof
NCHB: A Method for Constructing Rooted Phylogenetic Networks from Rooted Triplets based on Height Function and Binarization Hadi Poormohammadi, Mohsen Sardari Zarchi, Hossein Ghaneai PII: DOI: Reference:
S0022-5193(19)30514-4 https://doi.org/10.1016/j.jtbi.2019.110144 YJTBI 110144
To appear in:
Journal of Theoretical Biology
Received date: Revised date: Accepted date:
1 June 2018 27 December 2019 30 December 2019
Please cite this article as: Hadi Poormohammadi, Mohsen Sardari Zarchi, Hossein Ghaneai, NCHB: A Method for Constructing Rooted Phylogenetic Networks from Rooted Triplets based on Height Function and Binarization, Journal of Theoretical Biology (2019), doi: https://doi.org/10.1016/j.jtbi.2019.110144
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.
Highlight • Constructing an optimal rooted phylogenetic network from a given set of rooted triplets is a NP-hard problem. • NCHB is introduced to construct a rooted phylogenetic network from an arbitrary given set of rooted triplets. • The height function and binarization concepts are considered innovatively. • NCHB is applied on biologically and randomly generated data to show its efficiency. • The experimental results show that NCHB has high performance with triplets as input.
1
NCHB: A Method for Constructing Rooted Phylogenetic Networks from Rooted Triplets based on Height Function and Binarization Hadi Poormohammadia,b,∗, Mohsen Sardari Zarchia , Hossein Ghaneaia a Department b School
of Computer Engineering, Meybod University, Meybod, Iran of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
Abstract Phylogenetics is a field that studies and models the evolutionary history of currently living species. The rooted phylogenetic network is an important approach that models non-tree-like events between currently living species. Rooted triplets are one type of inputs in constructing rooted phylogenetic networks. Constructing an optimal rooted phylogenetic network that contains all given rooted triplets is a NP-hard problem. To overcome this challenge efficiently, a novel heuristic method called NCHB is introduced in this paper. NCHB produces an optimal rooted phylogenetic network that covers all given rooted triplets. The NCHB optimality criterions in building a rooted phylogenetic network are minimizing the number of reticulation nodes, and minimizing the level of the final network. In NCHB, the two concepts: the height function and the binarization of a network are considered innovatively. In order to study the performance of NCHB, our proposed method is compared with the three state of the art algorithms that are LEV1ATHAN, SIMPLISTIC and TripNet in two scenarios. In the first scenario, triplet sets are generated under biological presumptions and our proposed method is compared with SIMPLISTIC and TripNet. The results show that NCHB outperforms TripNet and SIMPLISTIC ∗ Corresponding
author Email addresses:
[email protected] (Hadi Poormohammadi),
[email protected] (Mohsen Sardari Zarchi),
[email protected] (Hossein Ghaneai)
Preprint submitted to Journal of LATEX Templates
January 3, 2020
according to the optimality criterions. In the second scenario, we designed a software for generating level-k networks. Then all triplets consistent with each network are obtained and are used as input for NCHB, LEV1THAN, SIMPLISTIC, and TripNet. LEV1ATHEN is just applicable for level-1 networks while the other algorithms can be performed to obtain higher level networks. The results show that the NCHB and LEV1ATHAN outputs are almost the same when we are restricted to level-1 networks. Also the results show that NCHB outperforms TripNet and SIMPLISTIC. Moreover NCHB outputs are very close to the generated networks (that are optimal) with respect to the criterions. Keywords: Bioinformatics, Rooted phylogenetic network, Rooted triplet, Density, Consistency, Height function, Reticulation node, NP-hard, living species
1. Introduction Analyzing the relation between currently living species and providing valid models to investigate this relation, is an important challenge in bioinformatics. Phylogenetics is a branch of bioinformatics that investigates and models 5
the relation among currently living species or groups of living species [1, 2]. Phylogenetic trees are the simplest possible model that studies such relations [3] . Phylogenetic networks are complex structures that generalize phylogenetic trees. In phylogenetic trees or networks, leaves are distinctly labeled by a given set of taxa [1, 4, 5]. Phylogenetic networks are considered as well-known tools to
10
represent the whole relation between species in a unique model [6, 7, 8]. These networks allow the representation of non-tree-like underlying events such as recombination, horizontal gene transfer, and hybridization, known as reticulate events [3, 9, 10, 11, 12, 13, 14]. Phylogenetic trees are the simplest possible models of phylogenetic networks with no non-tree-like events (figure 1a) [1].
15
Generally, phylogenetic networks divide into two groups i.e. rooted and unrooted [15]. A rooted phylogenetic network contains a common ancestor for all given taxa which is called root. These taxa are considered as the leaves of the
3
given rooted phylogenetic network [15]. In rooted phylogenetic networks, each vertex achieved from the root by at least a directed path. Rooted phylogenetic 20
networks are directed graphs with no cycle. In this paper, we are interested in rooted phylogenetic networks. The goal in using rooted phylogenetic networks, is to represent reticulate events explicitly [15]. In rooted phylogenetic networks, vertices can have more than one parent. Reticulate nodes, which are the nodes with two parents are used to explain
25
reticulate events. More precisely, a rooted phylogenetic network is a rooted directed acyclic graph in which the in-degree of each node is at most 2. The out-degree of nodes with in-degree 2 is 1. The nodes with in-degree 2 and outdegree 1 called reticulation nodes. In rooted phylogenetic networks, leaves are nodes with in-degree 1 and out-degree 0 and are distinctly labeled by a set of
30
given taxa (figure 1b).
(a)
(b)
(c)
Figure 1: (a) A rooted phylogenetic tree on the set of taxa X = {a, b, c, d, e, f, g}. (b) A level-1 rooted phylogenetic network on the set of taxa X = {a, b, c, d, e, f, g} as its leaves and t is its reticulation node. (c) A rooted triplet ab|c with three leaves on the set of taxa X = {a, b, c}.
There are different types of input for constructing phylogenetic networks. Generally, the inputs are in the forms of biological sequences, distance matrixs, triplets, quartets, binets and trinets, and clusters [16, 17, 18, 19, 20, 21, 22, 23, 15, 24]. Based on the inputs and the goal of the problem, various types
4
35
of phylogenetic networks are introduced. This paper focuses on constructing rooted phylogenetic networks. The inputs for constructing rooted phylogenetic networks can be in the forms of trinets and binets, triplets, and clusters. These three types of inputs are obtained in different ways and can be used for different purposes. The outputs obtained from all these three inputs are as rooted
40
phylogenetic networks. Although the outputs related to these three inputs have the same structure, their usage condition are different. In this paper, we are interested to use rooted triplets as standard inputs to obtain rooted phylogenetic networks. A rooted triplet is a rooted binary tree with three taxa (figure 1c) [15].
45
The symbol ab|c is used to represent the triplet of figure 1c. For a given set of triplets as input, constructing an optimal rooted phylogenetic network in which each input rooted triplet is its induced subgraph, is our main purpose. Formally, all given rooted triplets are consistent with a rooted phylogenetic network if each rooted triplet is an induced subgraph of the rooted phylogenetic
50
network. For example the set of triplets = bc|g, ab|e, cd|b, ef |g, f g|e are consistent with the network of figure 1b. [15, 24]. Two optimality criterions are considered for rooted phylogenetic networks: (i) minimizing the level and (ii) minimizing the number of reticulation nodes [15, 24]. The level of a rooted phylogenetic network is defined as the maximum number of reticulation nodes,
55
which is contained in each biconnected component of such network. Figure 2 shows a level-2 phylogenetic network with 3 reticulation nodes. According to the definition, the most optimal rooted phylogenetic networks are level-0 rooted phylogenetic networks i.e. rooted binary trees. Therefore, constructing a rooted tree consistent with all given rooted triplets is the first
60
goal. BUILD is a well-known algorithm which constructs a rooted tree (level-0 networks) consistent with a collection of rooted triplets in polynomial time if such a tree exists [25]. When there is no rooted tree for a given set of rooted triplets, BUILD algorithms halts. In this condition, the main goal is to produce an optimal rooted phylogenetic network (a level-1 network) consistent with all
65
the rooted triplets [15, 24]. 5
Figure 2: A level-2 network with the reticulation nodes t1 , t2 , t3 .
Level-1 rooted phylogenetic networks that also known as galled trees are the simplest possible types of rooted phylogenetic networks which contain at least one reticulation node (Figure 1b) [26]. The studies on level-1 rooted phylogenetic networks show that in general, it is NP-hard to decide whether there 70
exists a level-1 rooted phylogenetic network that contains a collection of rooted triplets [16, 26, 27]. LEV1ATHAN is an algorithm for constructing a level-1 rooted phylogenetic network consistent with input rooted triplets as many as possible. But it does not consider any optimality criterion [16]. Density is a critical concept for computing the complexity of constructing a rooted phylogenetic
75
network from a collection of rooted triplets. A set of rooted triplets is called dense if for each subset of three taxa there is at least one information in the input triplets set i.e. there is at least one rooted triplet for each subset of three taxa [27]. For example for a given set of taxa X = a, b, c, d the set of triplets τ = {bc|a, ab|d, cd|b, ac|d, ab|c} is dense. For dense sets of rooted triplets, it
80
is possible to construct a level-1 rooted phylogenetic network that contains all given rooted triplets in polynomial time if such a network exists [16, 27]. Hence, constructing rooted phylogenetic networks from dense sets of rooted triplets has received more attention recently. Iersel and et al. proposed an algorithm which constructs at most a level-2
85
rooted phylogenetic network that considers both optimality criterions. If such a network does not exist, this algorithm will halt [28, 29]. They proved that 6
for a dense set of rooted triplets τ , if τ is precisely equal to the set of rooted triplets consistent with some rooted phylogenetic network, then such a rooted phylogenetic network can be constructed with the smallest possible level in time 90
O(|τ |k+1 ) , where k is a fixed upper bound on the level of the network [29]. Moreover, according to the ideas described in [29] the authors proposed SIMPLISTIC algorithm, which always returns some rooted phylogenetic network consistent with a given dense set of rooted triplets, τ . But it does not give any minimality guarantees.
95
For non-dense sets of rooted triplets, TripNet and RPNCH algorithms are introduced [24, 30]. RPNCH and TripNet firstly assign a nonnegative integer to each pairs of input taxa and call it a height function. The height function is a helping tool to obtain the structure of the final network. Then RPNCH obtain a tree that contains all input taxa and finally based on input rooted
100
triplets, it adds some edges to the tree in an almost optimal way that the final network be consistent with all given rooted triplets. In TripNet algorithm, after assigning height function, SN-sets are found and the size of the problem is reduced. Then TripNet finds reticulation nodes and removes them. For the remaining taxa TripNet finds a tree structure. Finally reticulation nodes
105
are added to the tree structure to obtain a network consistent with all given rooted triplets [24, 30]. On average RPNCH is faster than TripNet but generally TripNet is more accurate [30]. Therefore, in this paper, TripNet algorithm is considered in experimental comparisons. According to our best knowledge, TripNet and SIMPLISTIC are the most important algorithms that try to build
110
an optimal rooted phylogenetic network consistent with a given set of rooted triplets. In summary, SIMPLISTIC just works for dense sets of rooted triplets, while TripNet works for arbitrary sets of rooted triplets. Almost all methods that work for constructing rooted phylogenetic networks use the SN-set concept to reduce the size of the problem. Let τ be a set of
115
triplets on a set of taxa X. A subset S of X is an SN-set if there is no triplet ij|k ∈ τ such that i ∈ / S and j, k ∈ S. These methods try to find maximal SN-sets which partition X and contract each SN-set into a single node. Then 7
they return a rooted phylogenetic network for SN-sets, and replace each SN-set with its related network. The final rooted phylogenetic network is consistent 120
with all given rooted triplets [26, 27] . If τ is dense, the maximal SN-sets are obtained in polynomial time [26, 27] . The methods, which work for dense sets of rooted triplets, use density concept for finding SN-sets. Note for a dense set of rooted triplets, the final network should be binary and satisfies all the constraint related to a rooted phylogenetic network because for
125
each subset of three taxa there is at least one rooted triplet in the input triplets set. For arbitrary non-dense sets of rooted triplets, some edges or nodes of the final rooted phylogenetic network can be contracted without any distortion in the consistency of the all input rooted triplets with the new network. It means that for non-dense sets of rooted triplets, the information is not sufficient
130
for returning a (binary) rooted phylogenetic network and in some steps; the resulting network may not be binary. So using the intelligent methods to make a non-binary rooted network into a (binary) rooted phylogenetic network are important. This paper introduces a method for constructing rooted phylogenetic net-
135
works with the minimum number of reticulation nodes, consistent with an arbitrary given set of rooted triplets. Since in the proposed method, there is no constraint on the input triplets, (i) proposing effective methods for obtaining SN-sets and (ii) binarization a non-binary network, are very important challenges. In our method, we use the height function as an effective tool to ob-
140
tain SN-sets in polynomial time and an intelligent method for binarization a non-binary network. We name the new algorithm, Network Construction with Height and Binarization, algorithm NCHB for short. Note that in this area, TripNet is the only existing method and NCHB improved TripNet. This paper is organized as follows. Section 2 presents the definitions and notation. In
145
section 3, NCHB algorithm is introduced. In section 4, the NCHB results are presented and they are compared with the LEV1ATHAN, TripNet, and SIMPLISTIC results. Finally, the discussion about the performance of NCHB is presented in section 5. 8
2. Definitions and notations 150
A rooted phylogenetic tree (tree for short) on a given set of taxa X is a rooted unordered leaf labeled tree in which root has in-degree 0 and at least out-degree 2. Leaves have in-degree 1 and out-degree 0. Every node except the root and leaves, has in-degree 1 and at least out-degree 2. In trees, leaves are distinctly labeled by X (Figure 1a) [15, 24].
155
A directed acyclic graph (DAG) is a directed graph that contains no cycles. For example, figure 3a shows a non-acyclic directed graph while figure 3b shows a DAG that is obtained from figure 3a. A DAG G is connected if there is an undirected path between every pair of nodes and, is bi-connected if by removing of each node, the graph will remain connected. A bi-connected component (also
160
known as a 2-connected component) of a graph G is a maximal bi-connected subgraph of G [15, 24].
(a)
(b)
Figure 3: (a) A non-acyclic directed graph (b) A directed acyclic graph that is obtained from the graph of figure 3a.
A rooted phylogenetic network (network for short) on X is a rooted DAG that. Root has in-degree 0 and out-degree 2. Reticulation nodes have in-degree 2 and out-degree 1. Tree nodes have in-degree 1 and out-degree 2. Leaves have 165
in-degree 1 and out-degree 0 and are distinctly labeled by X (Figure 1b). A level-k network is a network in which the number of reticulation nodes in each 9
bi-connected component is at most k. For example in figure 2, the value of k is 2. A rooted binary tree is a rooted tree in which root has in-degree 0 and out170
degree 2. Leaves have in-degree 1 and out-degree 0. Every node except the root and leaves has in-degree 1 and out-degree 2. A rooted binary unordered tree with three leaves is the simplest possible rooted binary tree structure that contains information and is called a rooted triplet (triplet for short). The symbol ij|k denotes a triplet with taxa i and j on one side and k on the other side of
175
the root (Figure 1c) [24]. From here a set of triplets is indicated by τ . A set τ is called dense if for each subset of three taxa, there is at least one triplet in τ . A triplet ij|k is consistent with a network N or equivalently N is consistent with ij|k if the leaf set of ij|k is the subset of the leaf set of N and N contains distinct nodes u
180
and v and pairwise internally node-disjoint paths u → i, u → j, v → u, v → k. Figure 2 shows an example of a network that is consistent with ef |g because the nodes u and v satisfies the above condition. A network N is consistent with τ or equivalently τ is consistent with a network N if all the triplets in τ are consistent with N . The symbols τ (N ) and LN denote the set of all triplets that
185
are consistent with N and the set of labels of its leaves respectively. For any set τ of triplets L(τ ) = ∪t∈τ Lt . τ is called a set of triplets on X if L(τ ) = X [24]. A binarization of a given tree is defined as follows. Let T be a tree and x be a node of T with x1 , x2 , . . . , xk , k ≥ 3 as its children. Consider a new
node y. Construct T 0 by removing the edges (x, x1 ), (x, x2 ), . . . , (x, xk−1 ) from 190
T and adding the edges (x, y), (y, x1 ), (y, x2 ), . . . , (y, xk−1 ) to T . Continuing the same method for each node with out-degree more than 2, a binary tree is obtained, and is called a binarization of T . Obviously, one can obtain different binarization of T . If τ is consistent with a tree T1 , and T2 be a binarization of T1 , then τ is consistent with T2 [24]. Figure 4a shows a non-binary tree and
195
figures 4b and 4c shows two different binarization of the tree of figure 1a. Gτ the directed graph related to τ , is defined by V (Gτ ) = {{i, j} : i, j ∈ L(τ ), i 6= j} ({i, j} is denoted by ij for short) and E(Gτ ) = {(ij, ik) : ij|k ∈ 10
(a)
(b)
(c)
Figure 4: (a) A non-binary tree (b) A binarization of figure 4.a. (c) Another binarization of figure 4.a.
τ } ∪ {(ij, jk) : ij|k ∈ τ } [24].
The height function of a tree and network is defined as follows. Let
denotes the set of all subsets of a finite set X of size 2. A function X →N h: 2
X 2
(1)
is called a height function on X [24]. Let T be a rooted tree with the root r, cij
200
be the lowest common ancestor of the leaves i and j, and lT denotes the length of the longest path started from r. For any two nodes x and y, let dT (x, y) denotes the number of edges of the path between x and y. The height function of T , hT is defined as hT (i, j) = lT − dT (r, cij ) where i and j are two distinct leaves of T and cij is the lowest common ancestor of i and j [24]. For example
205
in figure 4c, lT = 6 and hT (a, b) = lT − dT (r, cab ) = 6 − 2 = 4. For a given τ , let Gτ be a DAG and lGτ denotes the length of the longest path in Gτ . Since Gτ is a DAG, the set of nodes with out-degree zero is nonempty. Assign lGτ + 1 to the nodes with out-degree zero and remove them from Gτ . Assign lGτ to the nodes with out-degree zero in the resulting graph and continue this procedure until
210
all nodes are removed. For any two distinct i, j ∈ L(τ ) , hGτ (i, j) is defined as 11
the value that is assigned by the above procedure to the node ij and is called the height function related to Gτ . For example figure 5 shows the process of calculating the height function related to Gτ for τ = ab|c, cd|e, de|a, ce|b, ce|a
(a)
(b)
(c)
Figure 5: (a) Gτ for τ = {ab|c, cd|e, de|a, ce|b, ce|a} which is a DAG and lGτ = 2 and h(a, d) = h(a, e) = h(b, e) = h(a, c) = h(b, c) = h(b, d) = 3. (b) Removing the nodes with outdegree = 0 from Gτ which yields h(d, e) = h(c, e) = h(a, b) = 2. (c) Removing the nodes with outdegree = 0 from the previous step which yields h(c, d) = 1.
If τ is consistent with a tree then Gτ is a DAG and hGτ is well-defined [24]. 215
Let N be a network with the root r and lN be the length of the longest directed path from r to the leaves. For each node u consider d(r, u) as the length of the longest directed path from r to u. For any two nodes u and v, u is called an ancestor of v, if there exists a directed path from u to v. If u is an ancestor of v then v is lower than u. Let i and j be two leaves of N . c is called a lowest
220
common ancestor of i and j in N , if c is a common ancestor of i and j and there is no common ancestor of i and j lower than c. For any two leaves i and j, let Cij denote the set of all lowest common ancestors of i and j. For each pair of leaves i and j, hN (i, j) = min{lN − d(r, c) : c ∈ Cij } and is called the height function of N [24]. For example in figure 1b, hN (e, f ) = 3.
225
For a given τ , a subset S of L(τ ) is an SN-set if there is no triplet ij|k ∈ τ such that i ∈ / S and j, k ∈ S. If τ is dense then the maximal SN-sets are obtained in polynomial time and partition L(τ ). One can contract each of the SN-set to a single node and assign a common ancestor for all the leaves in one SN-set. By contracting each SN-set to a single node, the size of the problem
230
reduced (Figure 6).
12
Figure 6: The SN-sets of figure 2. The sets S1 , S2 , S3 are SN-sets and, S3 , S4 are maximal SN-sets.
A quartet is a unrooted binary tree with four leaves. We use the symbol ij|kl for the quartet on four leaves i, j, k, and l for which i and j are a pair and k and l are another pair. Each quartet has a unique edge for which two endpoints are not leaves, and called the inner edge of a quartet (Figure 7).
Figure 7: The quartet ij|kl on the set of leaves {i, j, k, l}.
235
3. Method As mentioned in the introduction, the two concepts, height function and binarization are considered innovatively for constructing a consistent network with a given τ in NCHB. At first for a given τ , NCHB finds a height function h on L(τ ). 13
240
3.1. Height function A triplet ij|k is consistent with a tree T if and only if hT (i, j) < hT (i, k) or hT (i, j) < hT (j, k). Also for a given network N and its three distinct leaves i, j, and k, if hN (i, j) < hN (i, k) or hN (i, j) < hN (j, k) then ij|k is consistent with N [24]. Based on these concepts the following Integer Programming IP (τ, s) is
245
considered for a given τ with |L(τ )| = n and an integer s. M aximize X
h(i, j),
1≤i,j≤n
subject to :
(2) h(i, k) − h(i, j) > 0
ij|k ∈ τ,
h(j, k) − h(i, j) > 0
ij|k ∈ τ,
0 < h(i, j) ≤ s 1 ≤ i, j ≤ n. In the above IP (formula 2) two inequalities are assigned to each triplet in τ . A combinatorial method is used to solve the IP. The IP has a feasible solution if and only if Gτ is a DAG. In this case, the height function related to Gτ is a feasible solution, which maximizes the IP [24]. For example, the height function 250
of figure 5 is the optimal solution to the IP. If Gτ is not a DAG the proposed method removes minimum number of edges to lose minimum information to obtain a DAG called G0 [24]. The problem of removing minimum number of edges from a directed graph to make it a DAG is known as the Minimum Feedback Arc Set (MFAS) problem. MFAS problem
255
is NP-hard [31]. In order to obtain G0 a novel heuristic algorithm is introduced in the following subsection. 3.1.1. Obtaining DAG The nodes with out-degree zero cannot participate in any directed cycle. So remove them and continue this process in the remaining graph until there is no
260
node with out-degree zero. Similarly, this process is performed for the nodes 14
with in-degree zero in the remaining graph. Figure 8 illustrates an example of removing the nodes with outdegree = 0 (figure 8a to 8c) and indegree = 0 (figure 8d).
(a)
(b)
(c)
(d)
Figure 8: The process of removing the nodes with indegree = 0 or outdegree = 0 to obtain a graph with no nodes with indegree = 0 or outdegree = 0. (a) Original graph G (b-c) Removing the nodes with outdegree = 0. (d) Removing the nodes with indegree = 0.
In the resulting graph, first assign color white to each node. Then for each 265
node v ∈ V (G) do the following. Suppose that the out-degree of v is m and vv1 , vv2 , . . . , vvm be m such directed edges with v as their tail. Remove these m edges from the resulting graph. Convert the color of v and v1 , v2 , . . . , vm to black. For each vi , 1 ≤ i ≤ m remove the vi u edge that the color of u is white. Then convert the color of each node u that is the head of some vi , 1 ≤ i ≤ m
270
to black. Continue the process of removing the edges in a way that the color of all nodes becomes black. Define the value of v as the number of remaining edges in the resulting graph. The remaining edges related to the node with the minimum value is removed from G and the resulting graph G0 is a DAG. Figure 9 shows an example of this process.
275
3.1.2. Assigning the height function to DAG Each removed edge in the process of obtaining DAG, is equivalent to an inequality (or equivalently to a constraint) in the IP of subsection 3.1. Each inequality in this IP is concluded from only one triplet in τ i.e. at most, the information related to these triplets is removed. Remove these triplets from τ 15
(a)
(b)
(d)
(e)
(c)
(f) Figure 9: The process of assigning black and white color to the nodes that is obtained from the graph G of figure 8d. (a) The first step of nodes coloring, by considering the node b as the starting point. (b-d) The next three steps of the nodes coloring. (e) A DAG obtained from figure 9a by removing the edges that are determined in the graph of figure 9d. (f) The DAG G0 that is obtained from the graph G of figure 8a by removing the edges that are determined in the graph of figure 9d.
16
280
and call the resulting set, τ 0 . The height function hG0 is a solution to the IP in which some inequalities related to the removed edges are removed. For example figure 10a illustrates Gτ for τ = {ab|c, cd|e, de|a, ce|b, ce|a, bc|d}. Removing
the edge (bc, cd) makes G0 with τ 0 = {ab|c, cd|e, de|a, ce|b, ce|a}. See figure 10b.
Calculating the height function of G0 is similar to figure 5. So its height function 285
is hG0 (a, d) = hG0 (a, e) = hG0 (b, e) = hG0 (a, c) = hG0 (b, c) = hG0 (b, d) = 3, hG0 (d, e) = hG0 (c, e) = hG0 (a, b) = 2, and hG0 (c, d) = 1.
(a)
(b)
Figure 10: (a) The graph Gτ for τ = {ab|c, cd|e, de|a, ce|b, ce|a, bc|d}. (b) G0 is obtained from Gτ by removing the edge (bc, cd).
3.2. The weighted complete graph The weighted complete graph (G, h) is defined where V (G) is the set of taxa and each edge ij has the weight hG0 (i, j). Figure 11 illustrates an example of 290
constructing a complete graph based on the height function of figure 10. The next step (subsection 3.3) presents the process of obtaining a tree from the weighted complete graph if such a tree exists. Otherwise, in subsection 3.4, the SN-sets are recognized to construct a SN-sets tree. 3.3. Obtaining a tree
295
For the obtained τ 0 and the weighted complete graph, a tree T 0 is obtained and is used as the input for the binarization step (subsection 3.5) if such a tree exists. For example for the graph of figure 11, a tree is obtained by the following procedure. To obtain a tree from the weighted complete graph, maximum edges are removed in each step. By removing the edges with maximum weight, if the 17
Figure 11: The weighted complete graph based on the height function of figure 10b.
300
resulting graph becomes disconnected, this process goes to the next step. This process continues until all edges are removed. For example, figure 12, shows the graph segmentation of figure 11. Note that in each step of figure 12, removing the maximum edges; make each connected component, disconnected.
(a)
(b)
(c)
(d)
Figure 12: Graph segmentation based on the height function. (a) The original weighted complete graph. (b) Removing the edges with weight = 3. (c) Removing the edges with weight = 2. (d) Removing the edges with weight = 1.
In order to obtain the tree, the steps of graph segmentation are reversed 305
and a tree T 0 consistent with τ 0 is built. Figure 13, presents the steps of tree construction using segmented graphs of figure 12. If in one step of graph segmentation, by removing the maximum edges from a connected component, the resulting graph remains connected, the algorithm stops and searches for a SN-sets tree as is described in the following subsection.
310
For example, in figure 14, a tree does not exist and the algorithm halts.
18
(a)
(b)
(c)
Figure 13: Construction a tree based on the segmented graph of figure 12. (a) The nodes c and d are merged because they are separated in the last step of the graph segmentation( figure 12d) (b) The node e is merged to the common ancestor of the nodes c and d, and the nodes a and b are merged according to the reverse order of the graph segmentation (figure 12c). (c) Constructing the final tree according to figure 12b.
3.4. Obtaining a SN-sets tree If there is no tree for the given τ 0 and weighted complete graph, the following process is performed. At first, SN-sets are recognized to determine reticulation nodes. The determined reticulation nodes are removed and a tree is constructed 315
based on obtained SN-sets which is called SN-sets tree. The details are described as follow. 3.4.1. SN-sets recognition When the process halts in finding a tree, the algorithm searches for SN-sets by using the resulting connected components. For example for the figure 14,
320
the SN-sets recognition starts from the components of figure 14d. For each connected component if it is an SN-set, it contracted to a single node. Otherwise, the maximum edges are removed until the non-SN-set components become disconnected. Then the same process is performed on the resulting components until all components become SN-sets which are shown by S = {S1 , S2 , . . . , Sk }.
325
Here, the τ is replaced by τS = {Si Sj |Sk : ∃ xy|z ∈ τ, x ∈ Si , y ∈ Sj , z ∈ Sk } and a weighted complete graph (GS , wS ) with V (GS ) = S and wS (Si , Sj ) = min{h(x, y) : x ∈ Si , y ∈ Sj } is constructed for finding reticulation nodes. 19
(a)
(b)
(d)
(c)
(e)
Figure 14: (a) The graph Gτ for τ = {ab|c, bc|a, ac|d, ac|e, de|a, de|b}. (b) G0 : A DAG that is
obtained from Gτ and by removing the edge (bc, ab). The height function of G0 is hG0 (a, d) = hG0 (a, e) = hG0 (b, d) = hG0 (b, e) = hG0 (c, d) = hG0 (c, e) = 4, hG0 (a, c) = hG0 (d, e) = 3, hG0 (b, c) = 2, and hG0 (a, b) = 1. (c) The weighted complete graph for the obtained height function. (d) First step of the graph segmentation. (e) Second step of the graph segmentation in which the algorithm halts because by removing the edge ac the resulting component remains connected.
3.4.2. Constructing the SN-set tree based on reticulation nodes In order to detect reticulation nodes, the following process is performed. For 330
each node Si ∈ S, 1 ≤ i ≤ k, remove it from (GS , wS ) and in the resulting graph perform the process of recognizing SN-sets (as was described previously). The value vi of each removed node Si is equal to the number of recognized SN-sets. The node with the minimum value, which is the optimum reticulation node, is removed. Then, the obtaining tree algorithm of subsection 3.3 is used to
335
check if there is a tree for the resulting graph. If there is a tree the process of
20
detecting and removing reticulation nodes stops and the resulting SN-set tree which is called TS0 is used as the input for the binarization step (subsection 3.5). Otherwise, if there is no tree, the process of detecting and removing reticulation nodes is done recursively on the resulting graph until TS0 is obtained. 340
Note that if there are more than one minimum value for SN-sets, define wmin = min{wS (Si , Sj ) : 1 ≤ i, j ≤ k} and select the node in which the number of its edges that have wmin , is maximum. 3.5. Binarization For a given τ , as is described in 3.3 and 3.4, a tree T 0 or a SN-set tree TS0 is
345
constructed. Usually in all cases, these two kinds of trees are not binary. The binarization is performed to achieve a structure that is more consistent with the input triplets. To obtain a binary tree, the following binarization process is used [32]. For a given τ let Vi and Vj be two arbitrary subsets of L(τ ) such that
350
Vi ∩ Vj = ∅. Let W (Vi , Vj ) = {vi vj |v ∈ τ : vi ∈ Vi , vj ∈ Vj and v ∈ / Vi ∪ Vj } , P (Vi , Vj ) = {vi v|vj ∈ τ or vj v : vi ∈ τ |vi ∈ Vi , vj ∈ Vj and v ∈ / Vi ∪ Vj }, and let T (Vi , Vj ) = {vi vj |v ∈ τ : vi ∈ Vi , vj ∈ Vj }. Let w(Vi , Vj ), p(Vi , Vj ), and t(Vi , Vj ) be the cardinality of W (Vi , Vj ), P (Vi , Vj ), and T (Vi , Vj ) respectively [33, 34].
355
According to [32], nine different measures are defined based on the three parameters w, p, and t to find the best configuration for the binary tree. The measures M = {m1 , m2 , . . . , m9 } are defined as: m1 = t(Vi , Vj ), m2 = w(Vi , Vj ), m3 = (w − p)(Vi , Vj ) , m4 = m7 =
360
(w−p) (Vi , Vj ), t
w (w+p) (Vi , Vj )
m8 = (w − p +
(w−p) w t (Vi , Vj ) ,m6 = (w+p) (Vi , Vj ), w m9 = ( (w−p) + (w+p) )(Vi , Vj ). t
,m5 =
w t )(Vi , Vj ),
If for a given τ the tree T 0 consistent with τ 0 ⊆ τ is binary, the goal is
achieved. Otherwise, there is at least a node x ∈ V (T 0 ) with c1 , c2 , . . . , ck , k ≥ 3
as its children. Based on a measure mi ∈ M, 1 ≤ i ≤ 9, children of x partitioned into two disjoint sets Ux and Vx with the maximum mi (Ux , Vx ). The edges (x, c1 ), (x, c2 ), . . . , (x, ck ) are removed from T 0 . Create two new nodes ux and 365
vx and connect x to them. Then connect ux to all nodes in Ux and vx to all 21
nodes in Vx . Continue this process on the resulting tree recursively until the tree becomes binary. For simplicity, the new binary tree is called T 0 as before. If instead of T 0 , there is Ts0 , the above process is performed to obtain a binary tree, which is called TS0 as before. By using the nine different measures, nine 370
binary trees or SN- set trees are achieved. 3.6. Constructing network If TS0 is obtained in the previous steps, the deleted reticulation nodes should be added to TS0 in a way that the resulting network is consistent with the maximum number of triplets in τS . Let x1 , x2 , . . . , xm be the sequence of m
375
reticulation leaves which are removed . Now add these m nodes in the reverse order (xm to x1 ) to TS0 as what follows: Let e1 and e2 be two edges of TS0 . Create two new nodes y1 and y2 in the middle of e1 and e2 . Connect y1 and y2 to a new node y3 and connect the reticulation leaf xi , 1 ≤ i ≤ m to y3 (Figure 15).
(a)
(b)
Figure 15: (a) Tree TS0 . (b) Adding the removed reticulation node xi to TS0 to obtain a network with one reticulation node.
380
Do this procedure for all pairs of edges and choose a pair such that the resulting network is consistent with the maximum number of triplets in τS . Continue this procedure until all the reticulation nodes are added and the network NS is achieved. Each SN-set in NS is replaced with its corresponding network to obtain the network N 0 . 22
385
The network N 0 may not be consistent with some triplets of τ that is called τincons . In order to make the network consistent, some edges are added based on τincons gradually [24].Let τ (a, b) = {ab|c such that ab|c ∈ τincons } and |τ (a, b)| be its size. Choose a, b ∈ L(τ ) in a way that |τ (a, b)| is maximum. Let pa and pb be the parent of a and b. Two new nodes are created between {pa , a} and
390
{pb , b}. These two new nodes are connected with a new edge. By adding this edge, the resulting network becomes consistent with τ (a, b). The set τincons is updated and this process will continue until τincons = ∅. Similarly for T 0 which may be obtained in subsection 3.3, the above process is used to obtain a network consistent with τ .
395
Note that the binarization process generates nine different outputs. Hence, nine different networks are achieved. The network with the minimum number of reticulation nodes is reported as the result of the algorithm. Figure 16 illustrates the general flowchart of the proposed algorithm.
4. Results 400
In order to study the performance of NCHB two scenarios are designed. In the first scenario, NCHB is compared with TripNet and SIMPLISTIC on the triplets sets that are obtained from generated sequences data. In this scenario sequences data are obtained under biological assumptions and an standard method is used to obtain triplets from the generated sequences. In the second
405
scenario, we randomly generate level-k networks. Then for each randomly generated network, all triplets consistent with it are obtained. Each obtained set of triplets is considered as input for NCHB, LEV1ATHAN, SIMPLISTIC, and TripNet. In this scenario an optimal network corresponds to a set of triplets that are obtained from a randomly generated network, is the randomly gener-
410
ated network itself. This scenario displays the proximity of the output of the NCHB, LEV1ATHAN, SIMPLISTIC, and TripNet to the optimal networks. These two scenarios are explained in subsections 4.1 and 4.2.
23
24
Figure 16: The flowchart of the NCHB algorithm.
4.1. Experimental results on biological sequence based triplets TripNet and SMPLISTIC are the well-known methods for constructing net415
works from given triplets. SIMPLISTIC just works for dense sets of triplets, while in NCHB and TripNet there is no constraint on the input triplets. In order to study the performance of NCHB the following scenario is designed. Triplets can be obtained randomly or by using sequence data. In this research, triplets are obtained from biological sequences data by using TREEVOLVE.
420
TREEVOLVE is a software that simulates the evolution of DNA sequences [35]. In TREEVOLVE various parameters should be set such as: the Number of samples, the Number of sequences, and the Length of sequence. In the experiments, the Number of sequences is set to 10, 20, 30, and 40, and the Length of sequence is set to 100, 200, 300, and 400. In each case, the Number of samples is set to
425
10. For the other parameters of TREEVOLVE, the default values are used. Maximum Likelihood is usually used to obtain triplets from biological sequences [29]. PhyML is a software that uses Maximum Likelihood criterion to build unrooted weighted binary trees based on the input sequences and can be used to produce triplets. For each set of sequences, an outgroup is added to all
430
the sets consisting of three sequences in the input and construct a quartet using PhyML. Finally, triplets corresponding to each of these groups are obtained by removing the outlier sequence. For a quartet ij|ko where o is the outgroup, if the inner edge weight is zero, it means that the quartet contains no information and in fact is a star. Equivalently, the triplets correspond to these kinds of
435
quartets contains no information and are removed. In these experiments, NCHB is compared with SIMPLISTIC and TripNet based on the constructed triplets. As SIMPLISTIC just works for dense sets of triplets, some triplets are added randomly to the constructed sets of triplets in order to make them dense. Since obtaining the output from input triplets might
440
be time consuming, the running time restriction 6 hours is considered [24]. Let Nf inite be the set of networks with the running time less than 6 hours. Let Ssequence indicates the Number of sequences where Ssequence ∈ {10, 20, 30, 40}. The comparison between NCHB, TripNet and SIMPLISTIC results are shown 25
in Tables 1 to 4. 445
Table 1 shows the number of triplets that belong to Nf inite and Table 2 indicates the average running time results. According to the experimental results, when the number of input taxa is 10, all methods always give an output in less than one second. When Ssequence = 20, in 7.5% of cases, SIMPLISTIC has no results in less than 6 hours. For the remaining 92.5% of the cases, the
450
SIMPLISTIC running time is on average 300 seconds. On average the NCHB and TripNet running time is not more than 2 seconds for all cases. By increasing the Number of sequences to 30, in 72.5% of the cases, SIMPLISTIC cannot return a network in less than 6 hours. For the remaining 27.5% of the cases, on average SIMPLISTIC outputs a network in 2724 seconds, while the TripNet
455
and NCHB running time in all cases are on average 217 and 220 seconds, respectively. Moreover when the Number of sequences is set to 40, in all cases SIMPLISTIC fails to return any network in less than 6 hours, while on average TripNet and NCHB output a network in 750 seconds. Totally for all 160 input triplets sets on average TripNet and NCHB outputs a network in less than
460
244 seconds, while on average in 55% of the SIMPLISTIC networks belong to Nf inite , the running time is near to 346 seconds. Number of sequences (ssequence )
10
20
30
40
Number of the NCHB networks ∈ Nf inite
40
40
40
40
Number of the TripNet networks ∈ Nf inite
40
40
40
40
Number of the SIMPLISTIC networks ∈ Nf inite
40
37
11
0
Table 1: The number of NCHB, TripNet, and SIMPLISTIC sets of triplets that belong to Nf inite
Table 3 shows the average number of reticulation nodes results and Table 4 shows the average level results. Note that tables 2 to 4 show the results for the networks ∈ Nf inite . For these two optimality criterions i.e. the number of 465
reticulation nodes and the level of the network, when the number of inputs is 10,
26
Number of sequences (ssequence )
10
20
30
40
NCHB avg runtime for networks ∈ Nf inite (Sec)
1
2
220
750
TripNet avg runtime for networks ∈ Nf inite (Sec)
1
1.9
217
750
SIMPLISTIC avg runtime for networks ∈ Nf inite (Sec)
1
300
2724
-
Table 2: NCHB, TripNet, and SIMPLISTIC average running time results on the sets of triplets that belong to Nf inite
the number of reticulation nodes and the level of the SIMPLISTIC networks are 2.6 and 2.3, respectively. In this case, both optimality criterions for the TripNet results are 0.9, while for the NCHB networks both optimality criterions are 0.7. On average, by increasing the number of input to 20, SIMPLISTIC networks 470
belongs to Nf inite contain 7.3 reticulation nodes and their level is 4.6. In this case, TripNet results contain on average 2.7 reticulation nodes and their level is 2.3 while for the NCHB results, the number of reticulation nodes is 2.2 and the level is 1.7 on average. For the number of input 30, the SIMPLISTIC results belongs to Nf inite , contain on average 12 reticulation nodes and their level is 7.1.
475
For this input, the number of reticulation nodes and the level of the network for the TripNet results on average are 8.1 and 7.2 respectively while NCHB results on average contains 7 reticulation nodes and their level is 6. For the last case, SIMPLISTIC gives no output. However, on average the TripNet results contains 16.5 reticulation nodes and their level is 16 while the NCHB results contain 15.2 reticulation nodes and their level is 15. Number of sequences (ssequence )
10
20
30
40
NCHB avg number of reticulations for networks ∈ Nf inite
0.7
2.2
7
15.2
TripNet avg number of reticulations for networks ∈ Nf inite
0.9
2.7
8.1
16.5
SIMPLISTIC avg number of reticulations for networks ∈ Nf inite
2.6
7.3
12
-
Table 3: NCHB, TripNet, and SIMPLISTIC average number of reticulation nodes results on the sets of triplets that belong to Nf inite
27
Number of sequences (ssequence )
10
20
30
40
NCHB avg level for networks ∈ Nf inite
0.7
1.7
6
15
TripNet avg level for networks ∈ Nf inite
0.9
2.3
7.2
16
SIMPLISTIC avg level for networks ∈ Nf inite
2.3
4.6
7.1
-
Table 4: NCHB, TripNet, and SIMPLISTIC average level results on the sets of triplets that belong to Nf inite 480
4.2. Experimental results on randomly generated networks In the second scenario, we design a software for generating networks. The software gets the number of taxa (n) as input. Then it randomly generates a binary rooted tree with n leaves (equivalent to taxa). Finally by adding k edges, a 485
network with k reticulation nodes is generated. Then all triplets consistent with the generated network are obtained and are used as input for NCHB, TripNet, and SIMPLISTIC. Also NCHB results are compared with LEV1ATHAN results on the generated level-1 networks. Let n be the number of taxa and k be the level of the randomly generated network. The first scenario’s results show that,
490
for the number of taxa n= 10 all three methods (NCHB, TripNet and SIMPLISTIC) output a network in an appropriate time. But for n = 20 in some cases SIMPLISTIC has not the ability to return a network in time less than 6 hours. Also by increasing the number of taxa to n = 30 in more than 70% of the cases SIMPLISTIC has not the ability to return a network in time less than 6 hours.
495
In the second scenario the goal is to obtain sets of triplets in a way that all three methods have the ability to return a network in an appropriate time. For this reason in this scenario we generate networks for the number of taxa n = 10, 15 rather than n = 10, 20, 30, 40. Let n = 10, 15 and 1 ≤ k ≤ 5. For each case 20 samples are generated. Then NCHB is compared with TripNet, SIMPLISTIC,
500
and LEV1ATHAN on these generated data. The results are shown in Tables 5 and 6. The results show that when k = 1 all methods return a level-1 network. For 28
level of the generated network on 10 taxa
1
2
3
4
5
NCHB average level for generated networks
1
2
3.1
4.2
5.4
TripNet average level for generated networks
1
2
3.2
4.3
5.6
SIMPLISTIC average level for generated networks
1
2
3.7
5.1
6.5
LEV1ATHAN average level for generated networks
1
-
-
-
-
Table 5: NCHB, TripNet, SIMPLISTIC and LEV1ATHAN average level results on the generated networks on 10 taxa
level of the generated network on 15 taxa
1
2
3
4
5
NCHB average level for generated networks
1
2
3.5
4.6
6
TripNet average level for generated networks
1
2
4
5.1
6.5
SIMPLISTIC average level for generated networks
1
2
5.1
6.2
7.5
LEV1ATHAN average level for generated networks
1
-
-
-
-
Table 6: NCHB, TripNet, SIMPLISTIC and LEV1ATHAN average level results on the generated networks on 15 taxa
k = 2, NCHB, TripNet, and SIMPLISTIC return a level-2 network. For 3 ≤ k ≤ 5, none of the three methods, on average can not return an optimal network, 505
but in all cases NCHB outperforms TripNet and SIMPLISTIC. Moreover when n = 10 on average the resulting networks are less complicated compared to the networks corresponds to n = 15. Generally the results show that NCHB outputs are very close to the generated optimal networks.
5. Discussion 510
This paper introduced NCHB algorithm, which produces a network by considering the minimum number of reticulation nodes as an optimality criterion. In order to study the performance of NCHB, two different scenarios were designed. In the first scenario, 160 different sets of triplets that obtained from 160
29
different generated sets of biological sequences under biological presumptions 515
was used. To generate sets of sequences TREVOLVE software was employed. Then PhyML software, which works based on Maximum Likelihood criterion was used to obtain triplets from the generated biological sequences. The results of the proposed methods was compared with the SIMPLISTIC and TripNet results on these data. The comparison shows that in all cases and for the input
520
taxa with the cardinality 10, 20, 30, and 40, NCHB and TripNet on average returned a network that contains all given triplets in the appropriate time 243.475 and 242.25 seconds, respectively. While just in 55% of the cases SIMPLISTIC has the ability to return a network in less than 6 hours. Furthermore, in all cases by increasing the number of input taxa to 40, SIMPLISTIC could not
525
find any results in less than 6 hours. It means that by increasing the number of input taxa, the running time of SIMPLISTIC exceeds exponentially while NCHB and TripNet outputs a network in an appropriate time. Therefore for large size data, SIMPLISTIC is not a practical method and just works for small size data. Moreover for the small size input data with the cardinality 10, 20,
530
and 30, on average the number of reticulation nodes and the level of the final network for SIMPLISTIC results are higher than NCHB and TripNet results. So on average NCHB and TripNet are more effective compared to SIMPLISTIC and outperforms SIMPLISTIC in both optimality criterions and running time. The results for NCHB and TripNet show that the running time of both
535
methods are nearly the same, while in all cases, NCHB outperforms TripNet in both optimality criterions. More precisely the results show that for small size data and for number of taxa 10 for both methods, the two optimality criterions are close to each other and on average the differences between each optimality criterion for both methods is not more than 0.2. This happens since on average
540
the final networks have one reticulation node. By increasing the number of input taxa to 20, on average the differences between the two optimality criterions for both methods exceeds to 0.5. Moreover, when the numbers of input taxa are 30 and 40, on average the difference between the two optimality criterions for the both methods exceeds to 1. This fact concludes that NCHB outperforms 30
545
TripNet when the size of input data is large. In the second scenario we designed a software for generating random networks. All triplets consistent with each network were obtained and were used as input for NCHB, LEV1ATHAN, TripNet, and SIMPLISTIC. This scenario shows the constructed network proximity to the optimal network. The results
550
show that in all cases NCHB outperform the other methods and generally NCHB is a valuable method for constructing networks very close to the optimum networks.
References [1] J. Felsenstein, J. Felenstein, Inferring phylogenies, Vol. 2, Sinauer associates 555
Sunderland, MA, 2004. [2] R. Bouckaert, P. Lemey, M. Dunn, S. J. Greenhill, A. V. Alekseyenko, A. J. Drummond, R. D. Gray, M. A. Suchard, Q. D. Atkinson, Mapping the origins and expansion of the indo-european language family, Science 337 (6097) (2012) 957–960.
560
[3] C. Scornavacca, J. C. P. Mayol, G. Cardona, Fast algorithm for the reconciliation of gene trees and lgt networks, Journal of theoretical biology 418 (2017) 129–137. [4] H. Poormohammadi, An ew h euristic algorithm for mrtc problem, Journal of Emerging Trends in Computing and Information Sciences 3 (7).
565
[5] S. Jahangiri, S. N. Hashemi, H. Poormohammadi, New heuristics for rooted triplet consistency, Algorithms 6 (3) (2013) 396–406. [6] S. J. Willson, Reconstruction of certain phylogenetic networks from the genomes at their leaves, Journal of Theoretical Biology 252 (2) (2008) 338– 349.
31
570
[7] O. Canko, F. Ta¸skın, K. Argın, Phylogenetic tree and community structure from a tangled nature model, Journal of theoretical biology 382 (2015) 216– 222. [8] V. Moulton, C. Semple, M. Steel, Optimizing phylogenetic diversity under constraints, Journal of theoretical biology 246 (1) (2007) 186–194.
575
[9] C. R. Linder, B. M. Moret, L. Nakhleh, T. Warnow, Network (reticulate) evolution: biology, models, and algorithms, in: The Ninth Pacific Symposium on Biocomputing (PSB), 2004. [10] T. Sang, Y. Zhong, Testing hybridization hypotheses based on incongruent gene trees, Systematic Biology 49 (3) (2000) 422–434.
580
[11] M. Bordewich, C. Semple, Computing the minimum number of hybridization events for a consistent evolutionary history, Discrete Applied Mathematics 155 (8) (2007) 914–928. [12] M. Hallett, J. Lagergren, A. Tofigh, Simultaneous identification of duplications and lateral transfers, in: Proceedings of the eighth annual interna-
585
tional conference on Resaerch in computational molecular biology, ACM, 2004, pp. 347–356. [13] R. R. Hudson, Properties of a neutral allele model with intragenic recombination, Theoretical population biology 23 (2) (1983) 183–201. [14] R. B. Lyngsø, Y. S. Song, J. Hein, Minimum recombination histories by
590
branch and bound, in: International Workshop on Algorithms in Bioinformatics, Springer, 2005, pp. 239–250. [15] D. H. Huson, R. Rupp, C. Scornavacca, Phylogenetic networks: concepts, algorithms and applications, Cambridge University Press, 2010. [16] K. T. Huber, L. van Iersel, S. Kelk, R. Suchecki, A practical algorithm for
595
reconstructing level-1 phylogenetic networks, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 8 (3) (2011) 635–649. 32
[17] J. Oldman, T. Wu, L. Van Iersel, V. Moulton, Trilonet: piecing together small networks to reconstruct reticulate evolutionary histories, Molecular biology and evolution 33 (8) (2016) 2151–2162. 600
[18] K. T. Huber, L. Van Iersel, V. Moulton, C. Scornavacca, T. Wu, Reconstructing phylogenetic level-1 networks from nondense binet and trinet sets, Algorithmica 77 (1) (2017) 173–200. [19] J. Wang, M. Guo, L. Xing, K. Che, X. Liu, C. Wang, Bimlr: a method for constructing rooted phylogenetic networks from rooted phylogenetic trees,
605
Gene 527 (1) (2013) 344–351. [20] D. Bryant, Neighbornet: an agglomerative method for the construction of planar phylogenetic networks, Molecular Biology and Evolution 21 (2004) 255–265. [21] S. Gr¨ unewald, K. Forslund, A. Dress, V. Moulton, Qnet: An agglomera-
610
tive method for the construction of phylogenetic networks from weighted quartets, Molecular Biology and Evolution 24 (2) (2006) 532–538. [22] A. R. Templeton, K. A. Crandall, C. F. Sing, A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping and dna sequence data. iii. cladogram estimation., Genetics
615
132 (2) (1992) 619–633. [23] H.-J. Bandelt, P. Forster, B. C. Sykes, M. B. Richards, Mitochondrial portraits of human populations using median networks., Genetics 141 (2) (1995) 743–753. [24] H. Poormohammadi, C. Eslahchi, R. Tusserkani, Tripnet: a method for
620
constructing rooted phylogenetic networks from rooted triplets, PloS one 9 (9) (2014) e106531. [25] A. V. Aho, Y. Sagiv, T. G. Szymanski, J. D. Ullman, Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions, SIAM Journal on Computing 10 (3) (1981) 405–421. 33
625
[26] J. Jansson, W.-K. Sung, Algorithms for combining rooted triplets into a galled phylogenetic network, Encyclopedia of Algorithms (2016) 48–52. [27] J. Jansson, W.-K. Sung, Inferring a level-1 phylogenetic network from a dense set of rooted triplets, Theoretical Computer Science 363 (1) (2006) 60–68.
630
[28] L. Van Iersel, J. Keijsper, S. Kelk, L. Stougie, F. Hagen, T. Boekhout, Constructing level-2 phylogenetic networks from triplets, IEEE/ACM transactions on computational biology and bioinformatics (TCBB) 6 (4) (2009) 667–681. [29] L. van Iersel, S. Kelk, Constructing the simplest possible phylogenetic net-
635
work from triplets, Algorithmica 60 (2) (2011) 207–235. [30] M. H. Reyhani, H. Poormohammadi, Rpnch: A method for constructing rooted phylogenetic networks from rooted triplets based on height function, Journal of Paramedical Sciences 8 (4) (2017) 14–20. [31] R. M. Karp, Reducibility among combinatorial problems, in: Complexity
640
of computer computations, Springer, 1972, pp. 85–103. [32] H. Poormohammadi, M. S. Zarchi, Cbth: a new algorithm for mrtc problem, submitted. [33] B. Y. Wu, Constructing the maximum consensus tree from rooted triples, Journal of Combinatorial Optimization 8 (1) (2004) 29–39.
645
[34] K. Maemura, J. Jansson, H. Ono, K. Sadakane, M. Yamashita, Approximation algorithms for constructing evolutionary trees from rooted triplets, in: Proceedings of 10th Korea-Japan Joint Workshop on Algorithms and Computation, 2007, pp. 56–63. [35] N. Grassly, A. Rambaut, Treevole: a program to simulate the evolution of
650
dna sequences under different population dynamic scenarios. 1.3. wellcome centre for infectious disease, department of zoology (1997). 34
Author Contribution Hadi Poormohammadi: Conceptualization, Methodology, Validation, Writing - Original Draft, Writing - Review & Editing, Project administration, Su655
pervision. Mohsen Sardari Zarchi: Methodology, Software, Investigation, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization. Hossein Ghaneai: Software, Formal analysis, Investigation, Writing - Review & Editing.
35