An O(n) algorithm for finding an optimal position with relative distances in an evolutionary tree

An O(n) algorithm for finding an optimal position with relative distances in an evolutionary tree

ELSEVIER Information Processing Letters 63 (1997) 263-269 Information Processing Letters An O( n) algorithm for finding an optimal position with ...

477KB Sizes 0 Downloads 95 Views

ELSEVIER

Information

Processing

Letters 63 (1997) 263-269

Information Processing Letters

An O( n) algorithm for finding an optimal position with relative distances in an evolutionary tree B.Y. Wu, C.Y. Tang * Department of Computer Science, National Tsing Hua Uniuersify, Hsinchu, Taiwan, ROC Received 1 November 1996 Communicated by T. Asano

Abstract An O(n) algorithm for finding an optimal position with relative distances in an evolutionary tree is presented in this paper. The optimality of position is defined by minimum incremental distance under Lx-norm. The algorithm can also be used to solve similar problems with alternative criteria, such as L’-norm or minimum tree size. 0 1997 Elsevier Science

B.V. Keywords: Evolutionary

tree; Computational

biology;

Algorithms

1. Introduction Constructing an evolutionary tree (phylogenetic tree) according to pairwise dissimilarities has been a useful model for analyzing evolutionary relations between species. An evolutionary tree is a rooted tree with leaves as species and weights on edges. The distance between two species is the total weight of the path between the two leaves. A distance matrix, whose elements are dissimilarities between species, is said to be additive if there is an evolutionary tree realizing it, that is, the distance on the tree equals the corresponding entry of the matrix for any pair of nodes. If the distance data are additive, there are efficient algorithms for reconstructing the tree [1,6-g]. However, observed distances are hardly additive. In this case, most of the tree reconstruction problems with minimum error are NP-complete or still open [2-5,7] and there is no available algorithm for finding an optimal position in an evolutionary tree with relative distances. The relative distances of a species x is a vector D, = (a,, a?, . . . , a,,) in which each nonnegative ni stands for the observing distance between x and species i. With an evolutionary tree in database, it should be an important problem to find the most suitable position in the tree for another species x with observing distance vector D,. There might be several ways to define the term “most suitable”. Just like the reconstruction problems defined in [5], the criteria may be LX-norm, L’-norm or minimum tree size. In this paper, we consider the L”-norm problem while the developed algorithm can be easily extended to L’-norm or minimum tree size problem. Therefore, the problem is: given an evolutionary tree and a distance vector, find a position in the tree

* Corresponding

author. Email: [email protected].

0020-0190/97/$17.00 0 1997 Elsevier Science B.V. All rights reserved. PII soozo-0190(97)00109-9

264

B.Y. Wu, C. Y. Tang /Information

Processing Letters 63 (1997) 263-269

Fig. 1. A position ((r. s), 4, 3) can be thought of as inserting a new edge into an evolutionary tree. (r, s) is the attached edge and splitted into (r, y), ( y. s), and a new edge ( y. x) is inserted.

such that the distance (on the tree) from the position to any leaf is no less than the given one and the difference between the resulting and given distances is minimized. The reason why the distances of the solution should be no less than the given ones is that the distances obtained by sequence alignments were believed to be lower bounds [5,81. In [8], explanations of the phenomena were also given. The remaining sections are organized as follows: We define some notations in Section 2. In Section 3, the algorithms for Lx-norm are presented. In Section 4, a conclusion is given.

2, Definitions

and notations

Definition (Farach et al. [51). A phylogenetic (or euolutionary) tree T= (V, E, w) is a rooted tree with node set V, edge set E, and nonnegative edge weight w. Every leaf of T stands for a species. We shall assume that there is no vertex with degree 2 in a tree. Let i, j be two nodes of tree T; d(T, i,j) denotes the distance between i and j, that is, the total edge weight of the unique path between i and j on T. Definition. Let T = (V, E, w> be an evolutionary tree and x e V be a new species, a position fir x in T is a triple (e, h, k), in which e E E, h and k are values, h 2 0, and w(e) 3 k 2 0. The meaning of a position (e, h, k) for x can be thought of inserting a new edge ( y, X) into the tree, where y is a point on an edge e(r, s) with w(r, y> = k, and w(y, X) = h. The attached edge e is split into two edges (r, y) and ( y, s) (Fig. 1). Thus, for any i E V, we define the distance from i to a position to be the distance to the new inserted leaf X. Formally: Definition. Let T = (V, E, w) be an evolutionary tree, i E V, and f = ((r, s>, h, k) be a position in which r is the parent of s. Define d(T, f, i) = c&T, i, f> = minId(T, i, r> + k, d(T, i, s> + w(r, s> -k} + h. Problem Given position use C(T,

MIP”(T, D,) (Minimum Increment for Position under LX-norm): an evolutionary tree T with leaves {l, 2,. . . , n) and a distance vector D, = (a,, a2,. . . , an), find a f for x in T, such that d(T, f,i) > ai Vl ,< i Q n and max, ~ iG,,(d(T, f,i> - ai) is minimized. We D,) to denote the minimum solution, that is, C(T, D,) = min!{max, < i6 ,,{d(T, f,i>- a,)).

3. An O(n) algorithm For any edge e in T, there exists a feasible solution (e, h, k) since the constraints when h is sufficiently long. We define the subproblem with specified edge as follows:

can always be satisfied

B.Y. Wu. C.Y. Tang/Information

Processing Letters 63 (1997) 263-269

265

Given an evolutionary tree T with leaves (1, 2,. . , n), an edge e in T, and a distance vector (a,, a2...., an), find a position f= (e, h, k) for x such that d(T, f, i) 2 ai and max, ~ iG ,,(d(T, f, i) - a;} is minimized. The problem is denoted as MIPE^(T, D,, e) and C,(T, D,) is the minimum solution, that

Problem. D,=

is, C,(T,

0,) =

min,knax, (

i G,jd(T, f, i> - a,)). D,,

Since there exists a solution of MIPE”(T, MIPE”(T, D,, e) must be a solution of MIP^(T, Lemma

1. C(T,

D,) = min,,

E ,(C,O’,

e)

for each

e and at least

one of the solutions

of

D,>, we have the next lemma:

D,)}.

The next lemma shows how to find C,(T,

0,).

We first define notations

to simplify

the formula.

Definition. For any edge e = (s, r) of T, deleting e will result in two subtrees. We denote the subtree containing s by T,, and the other by T,. The leaf sets of T, and T, are V, and Vr respectively. Let pi = d(T, s, i) if leaf i E V, and pi = d(T, r, i> if leaf i E V,. We also define p,,,(

e, s) = max( ai - pi: Vi E V,} , ViEVr},

Pmax(e, r) =max(ai-Pi:

p,,,(

e, s) = min{ ai - pi: Vi E V,},

p,,,(

e, r) = min( ai - pi: Vi E V,} .

D,) can be computed in constant time ifp_(e,

Lemma 2. Ler e = (s, r>, C,(T,

~1, pman(e. r), Pmin(el S) and

pmin(e, r> are giuen.

Proof. Observe that any feasible solution (e, h, k) is to split e = (s, r) into (s, y), ( y, r) and insert an edge with s or r.) The problem to find C,(T, 0,) is to select h and k which minimize maxv ;(d(T, x, i) - a;) subject to w(e)>k>O, h>Oand d(T, x, i)>a,ViSince

( y, x). ( y may coincide

d(T,

x, i) =

i

h+k+p,

if i E V,,

h+w(e)-k+p,

if iEVr,

we can rewrite the object function

= yi;

h+k+/?,-ai

ViEV

h+w(e)-k+p,-a,

tlitV:

(max( h + k - pmi,( e, s), h + w(e)

-kmPmin(ey

r))}

under the constraints

hfkap,,,(e,

I

s),

h+w(e) -k>p,,,(e, w(e)

r),

>kaO,

h 2 0.

If pmax(e, s), Pmax(e, r>, P,,,i,,(e, s) and p,i,( e, r) are given, this is just a linear programming with 2 variables and 5 constraints. It is not hard to write down its close form and can be computed in constant time. q

B.Y. Wu, C.Y. Tang /Infonnarion Processing Letters 63 (19971263-269

266

In order to find C(T, OX>, all we need to do is to compute pmax(e, s), pmin(e, s), pmax(e, r), and pmin(e, r) e. In the following section, we concentrate on pmax(e, s) and pmax(e, r) while it is similar for computing pmi,( e, s) and pmin(e, r). The next lemma directly comes from the definitions. for every edge

Lemma

3. Let e = (s, r), in which s is either parent or son. ifs is a leaf,

a, ~a+x~(p,,,((

Pmax(eT ‘) =

y, s),

y) - w( y, s)}

otherwise.

i

Considering an evolutionary tree T, we say e, = (u,, u2) is the parent edge of e2 = (u,, us> if u, is the parent of u2 and u2 is the parent of us. Similarly, edges with the same parent are called siblings. In the following section, when an edge is written as (u,, u,), we shall assume u, is the parent and u2 is the son. Lemma 4. a Pmax((U1~ u*>, 02) =

Prn&I~

u2),

if u2 is a leaf,

v;;y:

u,) = max

(

)IPmax((k

u3), u3) - w( u2, u3)}

otherwise,

3

,ftl;;

){P~~~((u,~ 3

4,

9)

Pmax1( uo T u,>7 uo> -4fJclr

-w(u~’

U3)lr

u1> * >

in which u2 and u3 are assumed to be different nodes. Proof. Let e = (u, , uz). Since u2 is the son, the set {( y, u,), From Lemma 3,

(uz,

y>IVy

i Since U, is the parent, Then we have

,C;*!“u t{P,.((U*7 3

to {(u,, u,)lVuJ.

if u2 is a leaf,

ati, Pma&UI 7 u*>, u*) =

# u,) is equivalent

U3)> %) - w(%

the set {(y, u,), (u,, y)Ivy

41

Z uq) is equivalent

otherwise*

to ((u,, Uj)lvUj

# UZ}U ((UO, uj)l.

From Lemma 4, p,,,(( u,, u2), u,> can be computed when the values of all its sons, that is pmax((uZ, u,>, u,>, have been computed. p,,,(( u, , u2>, u, ) can be found when the values of its parent, p,,,(( uo, u I ), uo>, and all u,), u,), are done. We can compute the values of all edges in two phases, bottom-up and its siblings, p_((u,, then top-down. In phase 1, we visit the edges in a postorder sequence. Thus, when visiting an edge, all its sons were visited, and pmax((uI, u,), uz) can be computed. In phase 2, the edges are visited in a preorder sequence such that parents were visited before sons. When visiting an edge (u,, v,>, pmax((uo, u,>, uo> has been computed. Since max,(,,, “!)( P,,x(( u I1 u,), u,) - w(u,, 03)} can be computed in phase 1, we can find PmaxG 1 u2>, u, ) when visiting ( u, , u,). The algorithm is stated in Fig. 2.

B.Y. Wu. C.Y. Tang/Information

Processing Letters 63 (1997) 263-269

261

Algorithm PMiX Input: a tree T with root r, a distance vector Q=. output: p,,&e,v) for every edge e. /* We only show p,,(e,v), while p,,,,“(e,v) can be computed in similar way*/ Step 1: set P,,,&e,v)=-w for each edge e and each endpoint v. Step 2(Phase I): (v,,v+first-edge-in-postorder-sequence while (not-complete-the-traversal) do if (v.’ is a leaf) then ~,,,((v,,v~),v,)=u,:; for its parent (v,,,v,) do (2.1)

(2.2)

p,,((vo,v,),v,)=max(p,,,.~((v,.v,),v,), for each sibling (v,,vJ do

p,,,((v,,v,),v,)=max(p,,,((v,,vj),v,), (v,,v+next-edge-in-postorder-sequence; endwhile Step 3(phase 2): (v,,vJ=first-edge-in-preorder-sequence while (not-complete-the-traversal) do for each son (v?,v]) do p,“,((v?.v3),v?)=max{ fk((v2.v~),vA e=(v,,v,)=next-edge-in-preorder-sequence; endwhile

p,,,((v,,v?),v?)-w(v,,v~)}; ~,~,((vl,v?),v?)-w(v,,v?)};

p,“,((v,,v?),v,)-w(v,.v~));

Fig. 2.

Theorem 1. Algorithm PMAXfinds p,,,( tree in which degree

e, v) for every edge e with time complexity O(n) for an evolutionary

of vertex is bounded.

Proof. The correctness except O(Z:,,,

of the algorithm follows from the above lemmas. For the time complexity, all steps step 2, the values of one edge result from all its siblings. So, the worst-case time complexity is q deg(v,) X (deg(vi) - l)), which is linear if the degree of vertex is bounded.

In the case of unbounded tree, the time complexity may be as high as 0(n2). We shall show a method to overcome this problem and get an algorithm running in O(n) time for any tree. For showing this method, let us see the next problem first:

Richest Siblings Problem: If e,, e2,. . . , e, are brothers, each with money x,, x2,. . . , x, respectively, let every brother know who is the richest one of his brothers (excluding himself)?

how to

The direct method is that everyone asks all his brothers and compares their money. However, this method requires that everyone performs n - 1 comparisons, and n X (n - 1) comparisons in total. This is just the method used in the above algorithm. A good method is as follows: First, the parent asks everyone, finds the richest and second richest brothers, and then announces the results. Assume that ei and ej are the richest and second richest brothers respectively. Now, everyone except ei knows ei is the richest brother, and ei knows that ej is the richest brother excluding himself. In the first step, the parent performs 2n comparisons. In the second step, everyone performs just one comparison. Thus only 3n comparisons are needed. We shall show how this method can be utilized in the original algorithm. For any edge (u,, v,), there is no difficulty in computing p,,,(( v,, v,), v2) in linear time. For v,), II,), it depends on the maximum among its parent (in step 3) and all its siblings (in step 2.2). The Pmax((vIt value from its parent, pmax((vor II,), v,), will be ready and can be compared in phase 2. The only problem is to find the maximum of its siblings for every edge. It is just the “Richest Siblings Problem” described above. So, if we keep the largest and the second largest members in step 2.1, we can get the data from its siblings at step 3. The next property and lemma show the method formally.

268

B.Y. Wu, C.Y. Tang/Information

Property.

Let {x,,

largest member

Processing Letters 63 (1997) 263-269

x2,. . . , x,) be a multiset of numbers,

x,,,

be the largest member,

and xmaX2 be the second

in the set; then

(

X *’

max ( xj} = jZi

Xmax 2

ifq #

xmax,

ifxi Z xmaX.

With this property and Lemma 4, we can get the next lemma. For convenience, dummy edge (rO, r) is added such that every edge in T has a parent edge.

if r is the root of T, a

Lemma 5. Let x,,, and x,,,,,~ be the largest and the second largest member in the multiset { p,,,(( u) E T} and u, is not a leaf, then u)lWq,

u, , u>, u> -

wtu,,

Pmax@%

u2),

X

u,) = max

ifw,,,

max

fPmax((%

u2)p

02)

-+I,

u2)

otherwise

Xmax2

Pmax((u0’

u1)7 uo) -w(uo,

u,)

) I

=pmax((uo, u,), u,).

andx,,,

Proof. x,,, = p,,,((

uo, u, >, u, ) directly

Pmax(( u1’ u2>, u,) = max vf+y

comes from Lemma 4. Also from Lemma 4,

z

From the above property, max ~~~~~~~~~~ 4

VU#U2

which completes

the proof.

a) -w(uIv

a)} =

xmax+~max((ul,

X max

if

xnl,x 2

otherwise

u2)9u2)-w(u~9

u2)

0

Based on Lemmas 4 and 5, algorithm PMAX can be modified to find all pmax(e, u) in O(n) time for any evolutionary tree. Then, from Lemma 2, MIP”(T, 0,) can also be solved in O(n) time. Since the trivial lower bound of the problem is O(n), the algorithm is optimal in time complexity. Theorem

2. MIP”(T,

0,)

problem

can be solued in O(n)

time, which is optimal.

4. Conclusion In this paper, we examine the problem for finding an optimal position with relative distances in an evolutionary tree. An optimal algorithm is developed for Lx-norm. In addition to L”-norm, there are some other criteria when considering an evolutionary tree. Two other important criteria are L’-norm and minimum tree size. The problems are similar to MIP”(T, D,) but with only different object functions, which are Ci(d(T, x, i) - a,) and h respectively. It is not hard to show that the two problems can also be solved in O(n) time using the same technique as in Section 3.

B.Y. Wu, C.Y. Tang / Informarion Processing Letters 63 (1997) 263-269

269

References [l] J. Culberson and P. Rudnicki, A fast algorithm for constructing trees from distance matrices, Inform. Process. Lerr. 30 (1989) 215-220. [2] W.H.E. Day, Computationally difficult parsimony problems in phylogenetic systematic& J. Theorer. Biology 103 (1983) 429-438. [3] W.H.E. Day, D.S. Johnson and D. Sankoff, The computational complexity of inferring rooted phylogenies by parsimony, Math. Bioscience 81 (1986) 33-42. [4] W.H.E. Day, Computational complexity of inferring phylogenies from dissimilarity matrices, Bull. Math. Biology 49 (4) (1987) 461-467. [5] M. Farach, S. Kannan and T. Wamow. A robust model for finding optimal evolutionary tree, [6] S. Kannan, E. Lawler and T. Wamow, Determining the evolutionary tree, in: Proc. Isr Algorithms, 475-484, San Francisco, CA (1990) 475-484. [7] M. Krivanek, The complexity of ultrametric partitions on graph, Inform. Process. Left. 27 (5) [S] M.S. Waterman, T.F. Smith, M. Singh and W.A. Beyer, Additive evolutionary tree, J. Theor.

Algorirhmica I3 (1995) 155-179. Ann. ACM-SIAM Symp. on Discrete (1988) 265-270. Biology 64 (1977) 199-213.