Discrete Applied Mathematics 180 (2015) 126–134
Contents lists available at ScienceDirect
Discrete Applied Mathematics journal homepage: www.elsevier.com/locate/dam
Weight-constrained and density-constrained paths in a tree: Enumerating, counting, and k-maximum density paths Chia-Wei Lee a , Pin-Liang Chen a , Sun-Yuan Hsieh a,b,c,∗ a
Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan b Institute of Medical Informatics, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan c
Institute of Manufacturing Information and Systems, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan
article
info
Article history: Received 25 July 2013 Received in revised form 25 June 2014 Accepted 21 July 2014 Available online 11 August 2014 Keywords: Counting mode Design and analysis of algorithms Feasible paths k-maximum density path problem Network design Trees
abstract Let T be a tree of n nodes in which each edge is associated with a value and a weight that are a real number and a positive integer, respectively. Given two integers wmin and wmax , and two real numbers dmin and dmax , a path P in a tree is feasible if the sum of the edge weights in P is between wmin and wmax , and the ratio of the sum of the edge values in P to the sum of the edge weights in P is between dmin and dmax . In this paper, we first present an O(n log2 n + h)time algorithm to find all feasible paths in a tree, where h = O(n2 ) if the output of a path is given by its end-nodes. Then, we present an O(n log2 n)-time algorithm to count the number of all feasible paths in a tree. Finally, we present an O(n log2 n + h)-time algorithm to find the k feasible paths whose densities are the k largest of all feasible paths. © 2014 Elsevier B.V. All rights reserved.
1. Introduction Given a sequence S of n number pairs (ai , wi ), where wi > 0 for i ∈ {1, . . . , n}, and two weight bounds wmin and wmax , the maximum-density segment problem involves finding a consecutive subsequence S (i, j) of S, where 1 ≤ i < j ≤ n, such a +a +···+a that wmin ≤ wi + wi+1 + · · · + wj ≤ wmax and the density of S (i, j), that is, w i+wi+1 +···+wj , is maximum over all consecutive i
i+1
j
subsequences of S (i′ , j′ ), where 1 ≤ i′ < j′ ≤ n, satisfying wmin ≤ wi′ + wi′ +1 + · · · + wj′ ≤ wmax . This problem arises in the investigation of the non-uniformity of nucleotide compositions within genomic sequences. It was first identified in thermal melting and gradient centrifugation experiments [16,23]. In molecular biology and genetics, GC-content (or guanine– cytosine content) is the percentage of nitrogenous bases on a DNA molecule that are either guanine or cytosine (from a possibility of four different ones, also including adenine and thymine). This may refer to a specific fragment of DNA or RNA, or that of the whole genome. Researchers observed that the compositional heterogeneity of nucleotides is highly correlated to the GC content of the genomic sequences [25,28], and this motivates finding GC-rich segments. For the maximum-density segment problem, Goldwasser, Kao, and Lu [11] proposed an O(n log(wmax − wmin + 1))-time algorithm. Subsequently, Chung and Lu [6] designed an O(n)-time algorithm that bypasses the complicated preprocessing step required in [11]. More researches related to the non-general case, i.e., wi = 1 for all indices i, can be found in [11,15,18,20,25,26]. Other research fields that consider the concept of ‘‘density’’ are discussed in [1,4,9,17]. In this paper, we consider the density on the tree topology. Let T = (V , E ) be a tree of n nodes in which each edge e ∈ E is associated with a pair (v(e), w(e)), where v(e) is a real number that represents the value of e, and w(e) is a positive integer
∗ Corresponding author at: Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan. E-mail addresses:
[email protected] (C.-W. Lee),
[email protected] (P.-L. Chen),
[email protected] (S.-Y. Hsieh). http://dx.doi.org/10.1016/j.dam.2014.07.024 0166-218X/© 2014 Elsevier B.V. All rights reserved.
C.-W. Lee et al. / Discrete Applied Mathematics 180 (2015) 126–134
127
that represents the weight of e. Given a path P = ⟨v0 , e1 , v1 , . . . , ek , vk ⟩, the weight of P, denoted by w(P ), is defined as k k i=1 w(ei ); the value of P, denoted by v(P ), is defined as i=1 v(ei ); and the density of P, denoted by d(P ), is defined as v(P )/w(P ). Given two integers wmin and wmax , and two real numbers dmin and dmax , a path P is feasible if wmin ≤ w(P ) ≤ wmax and dmin ≤ d(P ) ≤ dmax . Given a tree T of n nodes, two integers wmin and wmax , and two real number dmin and dmax , we consider the following three problems:
• All Weight-constrained and Density-constrained Paths problem (AWDP): find all feasible paths in T . • Counting mode of the All Weight-constrained and Density-constrained Paths problem (CAWDP): count the number of all feasible paths in T .
• k-Maximum Density Path problem (k-MDP): for an integer k, find the k feasible paths in T whose densities are the k largest of all the feasible paths. Many practical applications are affected by the above problems. For example, let us consider a network design application. Given a tree network in which the value and the weight of a link (edge) represent the traffic load and the upgrade cost, respectively, we attempt to upgrade the network by replacing a path whose upgrade cost and profit (traffic load/upgrade cost) are reasonable, i.e., within some specific range. After finding such a path, we can replace it with a path containing some high speed links to increase the profit. In another application, Chao et al. [5] considered constrained alignments consisting of aligned pairs in nearly optimal alignments. The algorithms developed in this paper can help select the subalignment that has a reasonable cumulative average score from among all high-scoring alignments. When the input tree is a path, the AWDP and CAWDP problems are equivalent, respectively, to the density range query problem and the counting mode of the density range query problem defined in [22]. The k-MDP is equivalent to the k maximum densities problem proposed in [22]. However, the algorithm presented in [22] is a randomized one, whereas our algorithm is deterministic. Several related works have been reported in the literature. For example, given a tree of n nodes in which each node is associated with a value and a weight, and two integers wmin and wmax , Hsieh and Chou [13] proposed an O(nwmax )-time (re2 spectively, O(nwmax )-time) algorithm for locating a maximum-density path (respectively, subtree) whose weight is between wmin and wmax . Under the assumption that the values associated with the edges are restricted to integers, Lau et al. [19] presented an O(n log2 n)-time algorithm for the above problem, and Su et al. [29] developed an O(nwmax log n)-time algorithm for locating a maximum-density subtree. Given a tree of n nodes in which each edge is associated with a value and a weight, two integers wmin and wmax , and a lower bound L, Hsieh and Cheng [12] presented an O(nwmax L)-time algorithm for finding a maximum-density path P such that wmin ≤ w(P ) ≤ wmax and the length of P is at least L. More related results can be found in [18,21,31]. In this paper, we first show that AWDP can be solved in O(n log2 n + h) time, where h = O(n2 ) if the output of a path is given by its end-nodes. Based on this result, we also show that CAWDP can be solved in O(n log2 n) time. Finally, we show that k-MDP can be solved in O(n log2 n + h) time. To the best of our knowledge, the above results have not been reported before. A summary of previous related results and our results is provided in Table 1. The remainder of this paper is organized as follows. In Section 2, we present a method that serves as a subroutine of our algorithms for AWDP, CAWDP, and k-MDP, which we discuss in Sections 3–5, respectively. Section 6 contains some concluding remarks. 2. Splitting a tree In this section, we present a method for splitting a given tree into three subtrees. The method serves as a subroutine for our algorithms (described in Sections 3–5). We begin with some definitions. A graph G = (V , E ) is comprised of a node (vertex) set V and an edge set E, where V is a finite set and E is a subset of {(u, v)| (u, v) is an unordered pair of V }. We also use V (G) and E (G) to denote the node set and edge set of G, respectively. If (u, v) ∈ E (G), we say that u and v are adjacent and u (or v ) is incident to the edge (u, v). A subgraph of G = (V , E ) is a graph (V ′ , E ′ ) such that V ′ ⊆ V and E ′ ⊆ E. Given U ⊆ V (G), the subgraph of G induced by U is defined as G[U ] = (U , {(u, v) ∈ E (G)| u, v ∈ U }). A tree T is a connected and acyclic graph; and a rooted tree T is a tree in which one node is chosen as the root. For a node v in a rooted tree T with the root r, the parent of v , denoted by par (v), is its neighbor on the unique path from v to r; and v ’s children, denoted by child(v), are its other neighbors. A path ⟨v0 , v1 , . . . , vk ⟩ with the end-nodes v0 and vk is denoted by P [v0 , vk ], which is a sequence of distinct nodes in a graph, where any two consecutive nodes are adjacent. For a given node v in a rooted tree, the paths starting from v can be classified into two types: a downward path, which stretches downward to its children only; and an upward path, which includes its parent. For any tree T = (V , E ) containing n nodes, a centroid of T is a node c ∈ V such that if we delete c and the edges incident to c, each resulting subtree will contain no more than ⌊n/2⌋ nodes. Each tree has at least one centroid [10,14,31]. A centroid of a tree is also called a 1-median for unit-weighted nodes [10,14], which can be found in linear time [10,14,31]. 1 After finding a centroid c of T , we can split T into three subtrees T1 , T2 and T3 , all of which are rooted at c, such that V (Ti ) V (Tj ) = {c } and |V (Tk )| ≤ ⌊n/2⌋ + 1, where i, j, k ∈ {1, 2, 3} and i ̸= j. Moreover, if T can be partitioned into two 1 If there are two centroids c , c of T , we can choose any one of these two centroids. 1 2
128
C.-W. Lee et al. / Discrete Applied Mathematics 180 (2015) 126–134
Table 1 Summary of related results for trees. Problem description
Finding a maximum-sum path Finding a maximum-density path Finding a maximum-density path Finding a maximum-density subtree Finding a maximum-density path Finding a maximum-density subtree Finding a maximum-density path with length at least L Finding a maximum-density subtree Finding a longest path Finding all feasible paths Counting all feasible paths Finding k maximum density paths
Constraints Weight-constraint
Density-constraint
w(P ) ≤ wmax wmin ≤ w(P ) wmin ≤ w(P ) ≤ wmax wmin ≤ w(P ) ≤ wmax wmin ≤ w(P ) ≤ wmax wmin ≤ w(P ) ≤ wmax wmin ≤ w(P ) ≤ wmax wmin ≤ w(P ) ≤ wmax wmin ≤ w(P ) wmin ≤ w(P ) ≤ wmax wmin ≤ w(P ) ≤ wmax wmin ≤ w(P ) ≤ wmax
None None None None None None None None None dmin ≤ d(P ) ≤ dmax dmin ≤ d(P ) ≤ dmax dmin ≤ d(P ) ≤ dmax
Time complexity
Reference
O(n log2 n) O(nwmin ) O(nwmax ) O(nwmax 2 ) O(n log2 n) O(nwmin 2 ) O(wmax nL) O(nwmax log n) O(n log n) O(n log2 n + h) O(n log2 n) O(n log2 n + h)
[31] [21] [13] [13] [19] [19] [12] [29] [18] This paper This paper This paper
parts such that the size of each part is at most ⌈ 2n ⌉, T will be split into two subtrees. That is, |V (T3 )| = 0. Hereafter, we use n1 , n2 , and n3 to denote the number of nodes in T1 , T2 and T3 , respectively. Next, by traversing all the nodes in Ti in a top-down fashion for i = 1, 2, 3, we can construct lists Ai such that Ai [x] = (vi (x), wi (x)) for each node x ∈ V (Ti ) \ {c }; where vi (x) and wi (x) are, respectively, the value and the weight of the path from the centroid c to x in Ti . After constructing A1 , A2 and A3 , we sort the entries in Ai in non-decreasing order according to their weights. Fig. 1 is the example of Algorithm Tree_Split. The centroid node of tree T is A, and its children are B, F , I and K . Let TB , TF , TI and TK be the subtrees in the tree T where their roots are B, F , I and K , respectively. And let |TB |, |TF |, |TI | and |Tk | be the numbers of nodes of subtree TB , TF , TI and TK , respectively. Therefore, we have |TB | = 4, |TF | = 3, |TI | = 2, and |TK | = 3. Since |T | = 13, Algorithm Tree_Split will split T into subtrees such that the number of nodes in each subtree is at most ⌊n/2⌋ + 1 = 7. Then, after executing the Algorithm Tree_Split, tree T will be split into three subtrees T1 , T2 and T3 , and we obtain the resulting lists A1 , A2 and A3 . The above steps are described in the following algorithm. Algorithm Tree_Split(T , v, w ) Input: A tree T = (V , E ) in which each edge e ∈ E is associated with a pair (v(e), w(e)). Output: Three sorted sequences A1 , A2 and A3 . 1: find a centroid c in T and split it into three subtrees T1 , T2 and T3 2: for each tree Ti , where 1 ≤ i ≤ 3, do 3: for each node x ∈ V (Ti ) do 4: construct Ai [x] 5: sort the entries in A1 , A2 and A3 in non-decreasing order according to their weights Since Lines 1–4 can be implemented in linear time and Line 5 can be implemented in O(n log n) time by using a sorting algorithm, we have the following result. Lemma 1. Algorithm Tree_Split(T , v, w ) can be implemented to run in time O(n log n), where n is the number of nodes in T . 3. Enumerating all feasible paths In this section, we describe our algorithm for solving AWDP. First, we invoke Algorithm Tree_Split(T , v, w ) to split a tree T into three subtrees T1 , T2 and T3 . Recall that, for any two subtrees Ta and Tb for a, b ∈ {1, 2, 3} and a ̸= b, V (Ta )∩ V (Tb ) = {c }, where c is the centroid in T . Moreover, for any two subtrees in {T1 , T2 , T3 }, a feasible path can be a downward path in a subtree or an upward path crossing the two subtrees. To find all feasible paths containing c, we execute the following steps for each pair (Ta , Tb ), where 1 ≤ a < b ≤ 3. For each node x ∈ V (Ti ), 1 ≤ i ≤ 3, we construct two sequences Bi and Ci , where Bi [x] = vi (x) − wi (x) ∗ dmax and Ci [x] = vi (x) − wi (x) ∗ dmin . To construct the sequences, we need the following technical lemma. Lemma 2. Let x ∈ V (Ta ) and y ∈ V (Tb ), where 1 ≤ a < b ≤ 3. Then the path P [x, y] is feasible if and only if Bb [y] ≤ −Ba [x], Cb [y] ≥ −Ca [x], and wmin ≤ wa (x) + wb (y) ≤ wmax . Proof. Since Ba [x] + Bb [y] = (va (x) + vb (y)) − (wa (x) + wb (y)) ∗ dmax and d(P [x, y]) = (va (x) + vb (y))/(wa (x) + wb (y)), we conclude that d(P [x, y]) ≤ dmax if and only if Bb [y] ≤ −Ba [x]. Similarly, d(P [x, y]) ≥ dmin if and only if Cb [y] ≥ −Ca [x]. The (binary) heap data structure [7,30] is an array object that can be regarded as a nearly complete binary tree, as shown in Fig. 2.
C.-W. Lee et al. / Discrete Applied Mathematics 180 (2015) 126–134
129
Fig. 1. Illustration of the construction of the lists A1 , A2 and A3 . (a) The tree T in which its root is the centroid A. (b) The tree T with the centroid A is split into T1 , T2 and T3 by executing Step 1, where V (T1 ) = {A, B, C , D, E }, V (T2 ) = {A, F , G, H , I , J }, and V (T3 ) = {A, K , L, M }. (c) The resulting lists A1 , A2 and A3 , obtained by executing Steps 2–3.
Fig. 2. A min-heap viewed as (a) a binary tree and (b) an array. The number in the circle at each node in the tree is the value stored at that node. The number above a node is the corresponding index in the array. The lines above and below the array show the parent–child relationships; parents are always to the left of their children.
Each node of the tree corresponds to an element of the array that stores the value in the node. The tree is completely full on all levels except possibly the lowest, which is filled from the left up to a point. There are two kinds of binary heaps: max-heaps and min-heaps. In a max-heap, the value of a node is, at most, the value of its parent. Thus, the largest element in a max-heap is stored at the root, and the subtree rooted at a node contains values no larger than that contained in the node. In a min-heap, the value of a node is at least the value of its parent. Thus, the smallest element in a min-heap is stored at the root, and the subtree rooted at a node contains values no smaller than that contained in the node. To solve the AWDP problem, we utilize a data structure called a priority search tree [24]. This is a hybrid data structure that combines a min-heap with a balanced binary tree, where the y-coordinates of the points are maintained by a min-heap and the x-coordinates of the points are maintained by a balanced binary tree. It supports each insertion and deletion of a
130
C.-W. Lee et al. / Discrete Applied Mathematics 180 (2015) 126–134
Fig. 3. Illustration of a priority search tree. The point (x, y) stored in each node is the data that will be searched. The circles store the x-coordinates of each point, which are maintained by a balanced binary tree; and the squares store the y-coordinates of each pair, which are maintained by a minimum heap.
point in O(log n) time and each search in O(log n + r ) time, where n is the number of nodes in priority tree, and r is the number of reported points. Fig. 3 shows an example of a priority search tree. Let V (T1 ) = {x1 , x2 , . . . , xn1 } and V (T2 ) = {y1 , y2 , . . . , yn2 } such that w(x1 ) ≥ w(x2 ) ≥ · · · ≥ w(xn1 ) and w(y1 ) ≤ w(y2 ) ≤ · · · ≤ w(yn2 ). We define Q = {q1 , q2 , . . . , qn1 }, where qi = (−B1 [xi ], −C1 [xi ]) for 1 ≤ i ≤ n1 , and Fj = {qi | 1 ≤ i ≤ n1 and wmin ≤ w1 (xi ) + w2 (yj ) ≤ wmax } for 1 ≤ j ≤ n2 . Next, we describe how the pair (T1 , T2 ) is processed to find all feasible paths crossing T1 and T2 . The other pairs (T1 , T3 ) and (T2 , T3 ) can be processed in a similar manner. We can view each element in Fj as a point represented by an x-coordinate and a y-coordinate. Moreover, we utilize a priority search tree to obtain Fj because it supports quick searches as well as the insertion and deletion of points. We execute n2 iterations from j = 1 to n2 such that, in the jth iteration, we obtain Fj from Fj−1 using a priority search tree PT as follows. Let F0 = ∅. In the first iteration (j = 1), we scan the points in Q starting from i = 1, and insert qi ∈ Q into PT if w1 (xi )+w2 (y1 ) ≥ wmin , otherwise, we go to the next iteration if w1 (xi )+w2 (y1 ) < wmin .2 In this manner, we can obtain F1 from F0 . Now, suppose that we have obtained Fj−1 . We first delete the points qk from the current tree PT if w1 (xk )+w2 (yj ) > wmax . Then, we continue to scan the remaining points in Q and insert qi into PT if w1 (xi )+w2 (yj ) ≥ wmin , or go to the next iteration if w1 (xi ) + w2 (yj ) < wmin . Hence, we successfully obtain Fj from Fj−1 . By finding the set of points in Fj that satisfies B2 [yj ] ≤ −B1 [xi ] and C2 [yj ] ≥ −C1 [xi ], based on Lemma 2, we can generate all feasible paths P [xi , yj ] containing the centroid c. After executing the above steps, we recursively call the respective algorithms for subtrees T1 , T2 and T3 to find all feasible paths in each Ti for 1 ≤ i ≤ 3. The steps of our algorithm are summarized below. Algorithm Finding_All_Feasible_Paths(T , v, w ) Input: A tree T = (V , E ) in which each edge e ∈ E is associated with a pair (v(e), w(e)). Output: All feasible paths. 1: call Algorithm Tree_Split(T ) 2: for each pair (Ta , Tb ) with 1 ≤ a < b ≤ 3 do 3: construct Q = {q1 , q2 , . . . , qna }, where qi = (−Ba [xi ], −Ca [xi ]) 4: F0 = ∅ 5: for j ← 1 to nb do 6: delete points qi from Fj−1 if wa (xi ) + wb (yj ) > wmax 7: insert points qi into Fj−1 if wa (xi ) + wb (yj ) ≥ wmin 8: generate all feasible paths by finding the set of points in Fj 9:
recursively call Algorithm Finding_All_Feasible_Paths on each subtree T1 , T2 and T3
Lemma 3. Algorithm Finding_All_Feasible_Paths finds all feasible paths correctly. Proof. Every feasible path may or may not contain the centroid c. If there is a feasible path containing c, then, based on Lemma 2, lines 1–8 of the algorithm will find all such paths. Moreover, line 9 will find all feasible paths that do not contain c by executing a recursive call on each subtree. Let T (n) be the time complexity of Algorithm Finding_All_Feasible_Paths. By Lemma 1, the subroutine call described in Line 1 can be implemented in O(n log n) time. Based on a priority search tree and the nodes recorded in arrays A1 –A3 obtained after executing Algorithm Tree_Split, all feasible paths containing the centroid c in Lines 2–8 can be found in O(n log n + hf ) time, where hf is the number of feasible paths containing c. Thus, the time complexity of the algorithm can be formulated 2 If w (x ) + w (y ) < w , we do not need to process q 1 i 2 1 min i+1 because w1 (xi+1 ) + w2 (y1 ) < wmin (according to the fact that w(x1 ) ≥ w(x2 ) ≥ · · · ≥ w(xi ) ≥ w(xi+1 ) ≥ · · · ≥ w(xn1 )). Hence, the algorithm can go to the next iteration.
C.-W. Lee et al. / Discrete Applied Mathematics 180 (2015) 126–134
131
with the following recurrence: T (n) = T (n1 ) + T (n2 ) + T (n3 ) + O(n log n + hf ), where max{n1 , n2 , n3 } ≤ ⌊n/2⌋ + 1. By solving the above recurrence, we obtain T (n) = O(n log2 n + h). Based on the above time complexity analysis and Lemma 3, we have the following result. Theorem 1. The problem AWDP can be solved in time O(n log2 n + h), where h = O(n2 ) if the output of a path is given by its end-nodes. 4. Counting all feasible paths We now present an algorithm to solve the CAWDP problem. In the first step, our algorithm invokes Algorithm Tree_Split (T , v, w ) to split a tree T into three subtrees T1 , T2 and T3 . Then, it performs the following steps to process the pairs (Ta , Tb ), where 1 ≤ a < b ≤ 3. Without loss of generality, we only consider how to process the pair (T1 , T2 ), as the other pairs (T1 , T3 ) and (T2 , T3 ) can be processed in a similar manner. For 1 ≤ j ≤ n2 , let Xj = {qi ∈ Fj | B2 [yj ] > −B1 [xi ]} and Yj = {qi ∈ Fj | C2 [yj ] < −C1 [xi ]}. We utilize a data structure called an order-statistic tree [7]. The ith order statistic of a set of n elements, where i ∈ {1, 2, . . . , n}, is simply the element in the set with the ith smallest key. An order-statistic tree is a data structure that can support fast order-statistic operations. It is simply a red–black tree (i.e., a balanced binary search tree) with additional information stored at each node. Besides the usual red–black tree fields key[u], color [u], parent [u], left [u], and right [u] in a node u,3 we have another field size[u] that stores the number of nodes in the subtree rooted at u. The field size[u] is defined as size[u] = size[left [u]]+ size[right [u]]+ 1, where left [u] and right [u] are the left and right children of node u, respectively; and size[u] = 1 if u is a leaf. The order-statistic tree supports each insertion, deletion and search of a point in O(log n) time. Moreover, with the aid of binary search trees and the field size[u] associated with each node u, we can utilize an order-statistic tree to count the number of elements whose key values fall within some specified range. We construct two order-statistic trees, OT and OT ′ . The nodes in OT (respectively, OT ′ ) are associated with the key values −B1 [xi ] (respectively, −C1 [xi ]) of the points qi in Fj , where 1 ≤ i ≤ n1 and 1 ≤ j ≤ n2 . Then, we execute n2 iterations from j = 1 to n2 such that, in the jth iteration, we search OT and find the node u whose associated key value equals max1≤i≤n1 {−B1 [xi ]|B2 [yj ] > −B1 [xi ]}. Based on the properties of the order-statistic tree, |Xj | can be computed by the following lemma. Lemma 4. Let r be the root of OT , u be the node in OT whose associated key value equals max1≤i≤n1 {−B1 [xi ]| B2 [yj ] > −B1 [xi ]}, and let ⟨v0 = r , e1 , v1 , . . . , ek , vk = u⟩ be a unique path P [u, v] from r to u. For 1 ≤ i ≤ n1 and 1 ≤ j ≤ n2 , let q = 0≤l≤k−1 {size[right [vl ]] + 1| vl is a node in P [u, v] whose associated value − B1 [xi ] is larger than B2 [yj ]}. Then, |Xj | = size[r ] − q − size[right [u]]. Proof. By the property of a binary search tree [7], if the value associated with vl is larger than B2 [yj ], then the values associated with vl and the nodes in the subtree rooted at right [vl ] must be larger than B2 [yj ]. Moreover, the values associated with the nodes in the subtree rooted at right [u] must also be larger than B2 [yj ]. Clearly, these nodes do not belong to Xj . Hence, the result holds. Similarly, we can compute |Yj | for 1 ≤ j ≤ n2 . Thus, |Xj | and |Yj | can be determined in O(log n) time. Next, we execute n2 iterations from j = 1 to n2 such that, in the jth iteration, we delete the points with w1 (xi ) + w2 (yj ) > wmax and insert the points with w1 (xi ) + w2 (yj ) ≥ wmin to obtain Fj from Fj−1 . Then, we can count the number of all feasible paths Dj by
|Dj | = |Fj | − |Xj | − |Yj |. After executing the above steps, we recursively call the respective algorithms for subtrees T1 , T2 and T3 to count the number of feasible paths in each Ti for 1 ≤ i ≤ 3. The correctness of the algorithm follows directly from the above discussion. The time complexity T (n) can be analyzed as follows. By Lemma 1, Algorithm Tree_Split can be implemented in O(n log n) time. Moreover, by utilizing two order-statistic trees, all feasible paths containing the centroid c can be counted in O(n log n) time. This means the time complexity of the algorithm can be formulated by the following recurrence: T (n) = T (n1 )+ T (n2 )+ T (n3 )+ O(n log n), where max{n1 , n2 , n3 } ≤ ⌊n/2⌋ + 1. By solving this recurrence, we obtain T (n) = O(n log2 n). Hence, the following theorem holds. Theorem 2. The CAWDP problem can be solved in time O(n log2 n). 5. Finding k-maximum density paths In this section, we present our solution to the k-MDP problem. Our algorithm first calls Algorithm Tree_Split(T , v, w ) to split a tree T into three subtrees T1 , T2 and T3 . It then executes the following steps to process the pairs (Ta , Tb ), where 1 ≤ a < b ≤ 3. Without loss of generality, we only consider how the pair (T1 , T2 ) is processed, as the other pairs (T1 , T3 ) and (T2 , T3 ) can be processed in a similar manner. 3 The field key[u] stores the key value associated with node u; and fields parent [u], left [u], and right [u] point to the nodes corresponding to its parent, its left child, and its right child, respectively. The field color [u] indicates the color of u, which can be either red or black.
132
C.-W. Lee et al. / Discrete Applied Mathematics 180 (2015) 126–134
Fig. 4. Illustration of the insertion operation in Iheap.
First, we construct a max-heap by the density of each feasible path. Note that in a max-heap, the value of a node is, at most, the value of its parent. Thus, the largest element in a max-heap is stored at the root, and the subtree rooted at a node contains values no larger than that contained in the node. Then, we execute n2 iterations from j = 1 to n2 so that in the jth iteration, we can insert the densities of all the feasible paths into the max-heap when the paths are generated by finding the set of points in Fj that satisfy B2 [yj ] ≤ −B1 [xi ] and C2 [yj ] ≥ −C1 [xi ]. The main objective of our algorithm is to insert the densities of the feasible paths into a data structure and then select the k largest densities from among them. Hence, the data structure utilized to solve the problem must support quick insertion operations incorporated with a fast selection algorithm. Sleator and Tarjan [27] proposed a self-adjusting binary heap called a skew heap, which can support heap construction, as well as search, insertion, or meld operations in amortized constant time. However, the skew heap can do much more than we require in the algorithm. Therefore, we use a simpler version of the skew heap, namely Iheap, proposed by Brodal and Jørgensen [3]. The essential properties of Iheap are that it is represented by a heap ordered binary tree and insertions are supported in amortized constant time. When we need to insert an element into Iheap, we first traverse the rightmost path in a bottom-up manner until a larger element is found or the root is passed. Then, we insert the new element as the right child of the newly found larger element or as the new root. The original element, which was the right child of the larger discovered element or the root, becomes the left child of the inserted element (see Fig. 4). Thus, a new element is inserted into Iheap. Lemma 5 ([3]). The insert operation in Iheap can be performed in amortized constant time. After constructing Iheap, which contains the densities of all the feasible paths, we need to select the k largest densities from it. The last step in our algorithm utilizes the heap selection algorithm proposed by Frederickson [8], which extracts the k largest elements from a max-heap in O(k) time. The step starts at the root, and only explores a node if the parent has been explored previously. The purpose of this step is to locate an element e with k ≤ rank(e) ≤ C k for some constant C .4 After this element has been identified, the input heap is traversed and all elements larger than e are extracted. Standard selection [2] is then used to obtain the k largest elements from the O(k) extracted elements. To find e, elements in the heap are organized into appropriately sized groups called clans. Clans are represented by their smallest elements, which are managed in binary heaps. By setting the size of the clans as log k, we can obtain an O(k log log k)-time algorithm. The steps are as follows. We construct the first clan by locating the ⌊log k⌋ largest elements and initialize a clan-heap with the representative of the clan. The children of the elements in the clan are associated with it and denoted as its offspring. A new clan is constructed from a set of log k nodes in O(log k log log k) time using a heap. However, not all the elements in an offspring set are necessarily put into a new clan. The remaining elements are associated with the newly created clan and denoted as the clan’s poor relations. Fig. 5 illustrates the clan, offspring, representatives, and poor relations. Suppose k = 8, then the size of each clan will be 3. The set of clan 1 is {15, 14, 13}, the representative of clan 1 is 13, the set of offspring of clan 1 is {12, 4, 6, 9}, and the set of poor relations of clan 1 is empty set. Then, the clan 2 was created from the set of offspring of clan 1. Therefore, we have the set of clan 2 is {12, 11, 10}, the representative of clan 2 is 10, the set of offspring of clan 2 is {5, 8}, and the set of poor relations of clan 2 is {4, 6, 9}.
4 The largest element has a rank of one.
C.-W. Lee et al. / Discrete Applied Mathematics 180 (2015) 126–134
133
Fig. 5. Illustration of the clan, offspring, representatives, and poor relations.
We delete the maximum clan from the clan-heap iteratively. In each iteration, we construct two new clans from the offspring and poor relations, and insert their representatives into the clan-heap. After ⌈k/⌊log k⌋⌉ iterations, an element of rank at least k is found, because the representative of the last clan deleted is the smallest of ⌈k/⌊log k⌋⌉ representatives. Since 2⌈k/⌊log k⌋⌉+ 1 clans are created at most, each takes O(log k × log log k) time; therefore, the total time is O(k log log k). By applying this concept recursively and then bootstrapping it, we can derive a linear time algorithm. Lemma 6 ([8]). The heap selection algorithm proposed by Frederickson can find the k largest elements in time O(k). By Lemmas 5–6 and a similar technique used in the proof of Theorem 1, we have the following result. Theorem 3. The k-MDP problem can be solved in time O(n log2 n + h), where h = O(n2 ) if the output of a path is given by its end-nodes. 6. Concluding remarks In this paper, we have extended previous works from sequences to trees. We have solved the weight-constrained and density-constrained path problem in time O(n log2 n + h), where h = O(n2 ) if the output of a path is given by its endnodes; and solved the counting mode of the weight-constrained and density-constrained path problem in time O(n log2 n). Moreover, we have solved the k-maximum density path problem in time O(n log2 n + h). The proposed algorithms can be extended easily to deal with the case where nodes and their edges are associated with value–weight pairs. In our future work, we will try to find other data structures that can be utilized to reduce the time complexity of the proposed algorithms. We will also attempt to find weight-constrained and density-constrained subtrees in a tree. References [1] R.M. Aliguliyev, Performance evaluation of density-based clustering methods, Inform. Sci. 179 (2009) 3583–3602. [2] M. Blum, R.W. Floyd, V.R. Pratt, R.L. Rivest, R.E. Tarjan, Time bounds for selection, J. Comput. System Sci. 7 (4) (1973) 448–461. [3] G.S. Brodal, A.G. Jørgensen, A linear time algorithm for the k maximal sums problem, in: Proceedings of the 32nd International Symposium on Mathematical Foundations of Computer Science, Lecture Notes in Computer Science, vol. 4708 (2007) pp. 442–453. [4] J. Bukor, L. Misšík, J.T. Tóth, Dependence of densities on a parameter, Inform. Sci. 179 (2009) 2903–2911. [5] K.M. Chao, R.C. Hardison, W. Miller, Recent developments in linear-space alignment methods: a survey, J. Comput. Biol. 1 (1994) 271–291. [6] K.M. Chung, H.I. Lu, An optimal algorithm for the maximum-density segment problem, SIAM J. Comput. 34 (2) (2004) 373–387. [7] T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, third ed., MIT Press, Cambridge, MA, 2009. [8] G.N. Frederickson, An optimal algorithm for selection in a min-heap, Inform. and Comput. 104 (2) (1993) 197–214. [9] A. Ghosh, A. Halder, M. Kothari, S. Ghosh, Aggregation pheromone density based data clustering, Inform. Sci. 178 (2008) 2816–2831. [10] A.J. Goldman, Optimal center location in simple networks, Transp. Sci. 5 (2) (1971) 212–221. [11] M.H. Goldwasser, M.Y. Kao, H.I. Lu, Fast algorithms for finding maximum-density segments of a sequence with applications to bioinformatics, J. Comput. System Sci. 70 (2) (2005) 128–144. [12] S.Y. Hsieh, C.S. Cheng, Finding a maximum-density path in a tree under the weight and length constraints, Inform. Process. Lett. 105 (2008) 202–205. [13] S.Y. Hsieh, T.Y. Chou, Finding a weight-constrained maximum-density subtree in a tree, in: Proceedings of the 16th International Symposium on Algorithms and Computation, Lecture Notes in Computer Science, vol. 3827 (2005) pp. 944–953. [14] L.K. Hua, Applications of mathematical models to wheat harvesting, Chin. Math. 2 (1961) 77–91. [15] X. Huang, An algorithm for identifying regions of a DNA sequence that satisfy a content requirement, Comput. Appl. Biosci. 10 (3) (1994) 219–225. [16] R.B. Inman, A denaturation map of the 1 phage DNA molecule determined by electron microscopy, J. Mol. Biol. 18 (1966) 464–476. [17] Z. Jiang, Y.X. Huang, Parametric calibration of speed–density relationships in mesoscopic traffic simulator with data mining, Inform. Sci. 179 (2009) 2002–2013. [18] S.K. Kim, Finding a longest nonnegative path in a constant degree tree, Inform. Process. Lett. 93 (2005) 275–279. [19] H.C. Lau, T.H. Ngo, B.N. Nguyen, Finding a length-constrained maximum-sum or maximum-density subtree and its application to logistics, Discrete Optim. 3 (2006) 385–391.
134
C.-W. Lee et al. / Discrete Applied Mathematics 180 (2015) 126–134
[20] Y.L. Lin, T. Jiang, K.M. Chao, Algorithms for locating the length-constrained heaviest segments, with applications to biomolecular sequences analysis, J. Comput. System Sci. 65 (3) (2002) 570–586. [21] R.R. Lin, W.H. Kuo, K.M. Chao, Finding a length-constrained maximum-density path in a tree, J. Combin. Optim. 9 (2) (2005) 147–156. [22] T.C. Lin, D.T. Lee, Algorithmic studies of sequence manipulation and related problems (Ph.D. thesis), National Taiwan University, Taiwan, 2007. [23] G. Macaya, J.P. Thiery, G. Bernardi, An approach to the organization of eukaryotic genomes at a macromolecular level, J. Mol. Biol. 108 (1976) 237–254. [24] E.M. McCreight, Priority search trees, SIAM J. Comput. 14 (2) (1985) 257–276. [25] A. Nekrutenko, W.H. Li, Assessment of compositional heterogeneity within and between eukaryotic genomes, Genome Res. 10 (2000) 1986–1995. [26] P. Rice, I. Longden, A. Bleasby, EMBOSS: The European molecular biology open software suite, Trends Genet. 16 (6) (2000) 276–277. [27] D.D. Sleator, R.E. Tarjan, Self-adjusting heaps, SIAM J. Comput. 15 (1) (1986) 52–69. [28] N. Stojanovic, L. Florea, C. Riemer, D. Gumucio, J. Slightom, M. Goodman, W. Miller, R. Hardison, Comparison of five methods for finding conserved sequences in multiple alignments of gene regulatory regions, Nucleic Acids Res. 27 (1999) 3899–3910. [29] H.H. Su, C.L. Lu, C.Y. Tang, An improved algorithm for finding a length-constrained maximum-density subtree in a tree, Inform. Process. Lett. 109 (2) (2008) 161–164. [30] J.W.J. Williams, Algorithm 232: heapsort, Commun. ACM 7 (6) (1964) 347–348. [31] B.Y. Wu, K.M. Chao, C.Y. Tang, An efficient algorithm for the length-constrained heaviest path problem on a tree, Inform. Process. Lett. 69 (1999) 63–67.