Information Sciences 121 (1999) 367±386
www.elsevier.com/locate/ins
Identifying approximately common substructures in trees based on a restricted edit distance Jason T.L. Wang a
a,* ,
Kaizhong Zhang b, Chia-Yo Chang
a
Department of Computer and Information Science, New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA b Department of Computer Science, The University of Western Ontario, London, Ont., Canada N6A 5B7
Received 21 May 1997; received in revised form 11 December 1998; accepted 1 February 1999
Abstract In this paper we present a dynamic programming algorithm for identifying the largest approximately common substructures (LACSs) of two ordered labeled trees. We consider a substructure of a tree T to be a connected subgraph of T . Given two trees T1 ; T2 and an integer d, the LACS problem is to ®nd a substructure U1 of T1 and a substructure U2 of T2 such that U1 is within distance d of U2 and where there does not exist any other substructure V1 of T1 and V2 of T2 such that V1 and V2 satisfy the distance restriction and the sum of the sizes of V1 and V2 is greater than the sum of the sizes of U1 and U2 . The distance measure considered in the paper is a restricted edit distance originated from Selkow. The LACS problem is motivated by the studies of program and document comparison. The proposed algorithm solves the problem in time O
d 2 jT1 j jT2 j, which is as fast as the best known algorithm for calculating Selkow's distance of two trees when the distance allowed in the common substructures is a constant independent of the input trees. Ó 1999 Elsevier Science Inc. All rights reserved.
*
Corresponding author. Fax: +1-973-596-5777. E-mail addresses:
[email protected] (J.T.L. Wang),
[email protected] (K. Zhang), changc @homer.njit.edu (C.-Y. Chang). 0020-0255/99/$ - see front matter Ó 1999 Elsevier Science Inc. All rights reserved. PII: S 0 0 2 0 - 0 2 5 5 ( 9 9 ) 0 0 1 0 0 - 0
368
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
1. Introduction Ordered labeled trees are trees in which each node has a label and the left-toright order among siblings matters. Such trees ®nd many applications in web management, programming languages and natural language processing, including the representation of structured documents (e.g. HTML) [2], grammar parses [1] and sentences [5]. Identifying structural similarities in such trees helps in version management and change detection. For example, in software maintenance where programs are represented as parse trees [10], comparing the parse trees helps detect the changes to the code. As another example, a user of the World Wide Web may be interested in knowing changes in an HTML document. Such changes can be detected by representing the document as a tree based on its underlying markup language [2] and by comparing the old and new version of the document (tree). In hypertext authoring, a user may wish to ®nd the common portions in the history list of an HTML document or a set of documents. This may be accomplished by ®nding the common substructures in the corresponding trees. 1.1. Related work A large amount of work has been performed for comparing trees based on various distance measures. Tai [13], for example, proposed to measure the distance between two trees by the minimum cost sequence of edit operations needed to transform one tree to the other. The set of allowable edit operations includes: (1) changing one node into another node (changing the label of the node); (2) deleting one node from a tree; and (3) inserting a node into a tree. Tai gave an algorithm for computing the distance of two 2 2 trees, which runs in time O
jT1 j jT2 j
depth
T1
depth
T2 . Zhang and Shasha [21] later presented a faster algorithm that runs in time O
jT1 j jT2 j min
depth
T1 ; leaves
T1 min
depth
T2 ; leaves
T2 . In a more recent paper, Zhang, Shasha and Wang [22] extended the tree matching work to allow one of the trees (the pattern tree) to contain variable length don't cares (VLDCs). When matching the pattern tree with a data tree, the VLDCs may substitute for a portion of the data tree at no cost. Their algorithm runs as fast as the best algorithm for tree matching without VLDCs [21]. The authors implemented the algorithms into a graphics toolkit for querying lexical databases [17]. Cheng and Lu [3] studied a dierent distance measure based on node splitting and merging; the distance between two trees is de®ned as the minimum number of such operations needed to transform one tree to the other. The authors presented a tree matching algorithm based on this metric and applied it to waveform correlation.
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
369
Tanaka and Tanaka [14] considered a restricted edit distance with the condition that two disjoint subtrees must be matched with two disjoint subtrees. They argued that such a structure preserving matching is more meaningful when comparing two classi®cation trees. The authors presented an algorithm to calculate the restricted distance of trees in time O
jT1 j jT2 j min
leaves
T1 ; leaves
T2 . Zhang [20] gave a faster algorithm that runs in time O
jT1 j jT2 j. Recently, Jiang et al. [7] proposed tree alignment as an alternative of tree matching. Two trees are aligned by inserting a minimum number of extra nodes to the two trees such that the resulting trees are topologically isomorphic. The metric is an upward extension of sequence alignment [12] commonly used in biocomputing. Other relevant distance measures for trees and ecient algorithms for computing the distance measures can be found in [6,8]. In ®nding common portions of trees, our work is closely related to [19]. In [19], Yang proposed an algorithm to identify the dierences between two programs. The author represented a program by a parse tree and found the largest common substructure of two parse trees under the condition that the sibling, ancestor and parent±child relationships are preserved. Here, we extend Yang's work to ®nd the largest approximately common substructures of two trees by allowing node mismatches and insertions/deletions to exist in the common substructures. The distance measure used in the paper is the restricted edit distance originated from [9,11]. Yang argued that this metric is most suitable for hierarchically structured data such as programs and structured documents, since it exploits the knowledge of grammars and respects the parent±child relationships. 1.2. Our algorithmic strategy As noted above, the paper presents an algorithm that identi®es, or discovers, the largest approximately common substructures (patterns) in pairs of trees. The algorithm bears a spiritual kinship with the previous algorithms surveyed in Section 1.1 that we and others have proposed for tree pattern matching. Nevertheless, fundamental dierences exist because of the dierences in the problem goals themselves. In matching problems, we are given a pattern and need to ®nd a distance between the pattern and one or more objects; in discovery problems by contrast, we are given two objects and a `target' distance and are asked to ®nd the largest portions of the objects that dier by at most that distance. Specializing the discovery problem to a pair of trees, we want to ®nd the largest connected component from each tree such that the distance between them is under the target distance value. Let us consider the connected component in one of the trees. Since it is connected, it must be rooted at a node n and is generated by cutting o some subtrees in the subtree rooted at n. This means that a naive algorithm for
370
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
pattern discovery would have to consider all the subtree pairs and for each subtree pair all the possible cuts of its subtrees. Since the number of such possible cuts is exponential, the naive algorithm is clearly impractical. Instead we use a compound form of dynamic programming called selective memorization. By compound, we mean that dynamic programming is applied 1. to compute sizes of common patterns between two subtree pairs given a set of cuts; 2. to ®nd the cuttings that yield distances less than or equal to the target one; 3. to compute the optimal cuttings for distance k; 1 6 k 6 d, given the optimal cuttings for distances 0 to k ÿ 1. The memorization is selective in the sense that we only memorize the necessary values. For example we do not compute the size values for cuttings that have a distance larger than the target distance. For cuttings that give distance values within the target value, we do not explicitly compute every one of them. In the computation of an optimal solution for distance value k, we also have to solve a problem which is unique for trees. Consider a pair of subtrees s1 and s2 whose roots map to one another in the optimal solution for distance value k. Then, in general, there are several (more than one) subtrees of s1 and s2 pairs which map to one another. We have to determine how the distance value k should be distributed to these several subtree pairs so that we can obtain the optimal solution. We solve this problem by partitioning the subtrees of s1 , respectively s2 , into a forest and a subtree. We then compute the distance and size values from forest to forest and from subtree to subtree. The rest of the paper is organized as follows. In Section 2 we review Selkow's distance measure and formalize the problem at hand. Section 3 presents our algorithm and its complexity. Section 4 concludes the paper and reports some implementation eorts. 2. Preliminaries In this section we ®rst de®ne the edit operations, the edit distance and mapping between two trees. Then we review Selkow's restricted edit distance, and establish the relationship between this restricted distance and the edit distance. Finally we de®ne the problem of identifying the largest approximately common substructures of two trees and introduce the terms and notation used in the paper. 2.1. Edit operations for trees The trees with which we are concerned are rooted, ordered, and labeled ones. Each node in the trees has a label and the left-to-right order of its children, if it has any, is ®xed. The edit distance for trees is measured in terms of
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
371
three types of edit operations: relabel a node, delete a node, and insert a node [13,21]. Let X represent the alphabet of node labels. We represent the edit operations as u ! v, where each of u and v is either a node label in X or the null label k 62 X. We call u ! v a relabeling operation if u 6 k and v 6 k, a delete operation if u 6 k and v k, or an insert operation if u k and v 6 k. Let T2 be the tree that results from the application of an edit operation u ! v to tree T1 ; this is written as T1 ) T2 via u ! v. Fig. 1 illustrates the edit operations. Let S be a sequence s1 ; s2 ; . . . ; sk of edit operations. S transforms tree T to tree T 0 if there is a sequence of trees T0 ; T1 ; . . . ; Tk such that T T0 ; T 0 Tk and Tiÿ1 ) Ti via si , for 1 6 i 6 k. Our de®nition of edit operations is really a shorthand notation for the speci®cation. Here is the speci®cation in full detail. Consider a single edit operation, e.g., one that transforms Tiÿ1 to Ti . If it is a relabeling operation, we specify the node to be relabeled in Tiÿ1 . The same
Fig. 1. Examples illustrating the edit operations. (a) Relabeling to change one node label b to another c. (b) Deletion of a node. All children of the deleted node with label b become children of the parent labeled r. (c) Insertion of a node. A consecutive sequence of siblings among the children of the node labeled r, namely those with labels a; e; f , become the children of the newly inserted node labeled c.
372
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
holds for a delete operation. An insert operation is a little more complicated. We must specify the parent p of the node n to be inserted and which consecutive sequence of siblings among the children of p will be the children of n. If that consecutive sequence is empty, then we need to specify the position of n among the children of p. However, we continue to use our shorthand notation, because these other speci®cations are clear from the mapping structure de®ned below. Let c be a cost function that assigns to each edit operation u ! v a nonnegative real number c
u ! v. For the purpose of this work, we assume c
u ! v 1 if u 6 v and 0 otherwise. We P extend c to a sequence of edit opk erations S s1 ; s2 ; . . . ; sk by letting c
S i1 c
si . The edit distance from 0 0 tree T to tree T , denoted edist
T ; T , is de®ned to be the cost of a minimum cost sequence of edit operations transforming T to T 0 [13,21]. 2.2. Mappings The edit operations correspond to a mapping that is a graphical speci®cation of which edit operations apply to each node in the two trees. For example, the mapping in Fig. 2 shows a way to transform tree T1 to tree T2 . It corresponds to the sequence: delete the nodes with labels a and m and then insert the nodes with labels a and m. Let ti represent the node of tree T whose position in the left-to-right postorder traversal of T is i. Formally, a mapping from tree T1 to tree T2 is a triple
Me ; T1 ; T2 (or simply Me if the context is clear) where Me is any set of ordered pairs of integers
i; j satisfying the following conditions: 1. 1 6 i 6 jT1 j and 1 6 j 6 jT2 j where j j represents the size, i.e. the number of nodes, of the indicated tree; 2. for any pair of
i1 ; j1 and
i2 ; j2 in Me , (a) i1 i2 i j1 j2 (one-to-one condition);
Fig. 2. A mapping from tree T1 to tree T2 . A dashed line from a node u in T1 to a node v in T2 indicates that u should be changed to v if u 6 v, or that u remains unchanged if u v. The nodes of T1 not touched by a dashed line are to be deleted and the nodes of T2 not touched by a dashed line are to be inserted.
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
373
(b) t1 i1 is to the left of t1 i2 i t2 j1 is to the left of t2 j2 (sibling order preservation condition); (c) t1 i1 is an ancestor of t1 i2 i t2 j1 is an ancestor of t2 j2 (ancestor order preservation condition). Thus, for example, the mapping in Fig. 2 is {(1, 1), (2, 2), (3, 3), (4, 5), (6, 6), (8, 8)}. Let l
ti represent the label of node ti. Let Me be a mapping from T1 to T2 . Let I and J be the sets of nodes in T1 and T2 , respectively, not touched by any mapping line in Me . Then we can de®ne the cost of Me : c
Me
X
c
l
t1 i ! l
t2 j
X
c
l
t1 i ! k
i2I
i;j2Me
X
c
k ! l
t2 j:
j2J
Given S, a sequence of edit operations from T1 to T2 , it can be shown that there exists a mapping Me from T1 to T2 such that c
Me 6 c
S; conversely, for any mapping Me , there exists a sequence of edit operations S such that c
S c
Me [21, Lemma 2]. Hence, we have edist
T1 ; T2 min fc
Me j Me is a mapping from T1 to T2 g: 2.3. Selkow's distance measure Selkow [11] proposed to impose the following restriction on the mapping conditions described in Section 2.2: if
i; j is in Me where neither t1 i nor t2 j is the root, then
par
i; par
j 1 is also in Me where par
i represents the postorder number of the parent of t1 i and par
j represents the postorder number of the parent of t2 j (parent±child order preservation). Due to its nature, we will refer to the resulting mapping as a top-down edit mapping. Fig. 3 shows an example of a top-down edit mapping Mt from tree T1 to tree T2 given in Fig. 2. Three notes follow. First, the de®nition of top-down edit mappings implies that
jT1 j; jT2 j must be in Mt , i.e., the root of T1 must be mapped to the root of T2 . Second, if a node n is to be deleted (inserted, respectively) in Mt , then the subtree rooted at n, if any, is to be deleted (inserted, respectively). Third, a topdown edit mapping is a mapping whereas a mapping may not satisfy the conditions of a top-down edit mapping. For example, the mapping in Fig. 2 is not a top-down edit mapping since the nodes with labels a and m in T1 are deleted while their children remain in the tree. 1
par
i is unde®ned if ti is the root.
374
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
Fig. 3. A top-down edit mapping from tree T1 to tree T2 .
Let Mt be a top-down edit mapping from tree T1 to tree T2 . Let I and J be the sets of nodes in T1 and T2 , respectively, not touched by any mapping line in Mt . Then we can de®ne the cost of Mt : X X c
Mt c
l
t1 i ! l
t2 j c
l
t1 i ! k i2I
i;j2Mt
X
c
k ! l
t2 j:
j2J
The top-down edit distance from tree T1 to tree T2 , denoted t dist
T1 ; T2 , is the cost of a minimum cost top-down edit mapping from T1 to T2 . This restricted distance measure was originally proposed by Selkow [11] and will also be referred to as Selkow's distance, or simply the distance, throughout the paper. Theorem 1 establishes the relationship between the edit distance of two trees T1 and T2 and their top-down edit distance. Theorem 1. e dist
T1 ; T2 6 t dist
T1 ; T2 . Proof. By de®nition, e dist
T1 ; T2 is the cost of a minimum cost mapping from T1 to T2 and t dist
T1 ; T2 is the cost of a minimum cost top-down edit mapping from T1 to T2 . Since any top-down edit mapping is also a mapping, the result follows. Intuitively, the top-down edit distance gives more weight to upper level nodes than to lower level nodes, i.e., the upper level nodes are more important than lower level ones. For example, in Fig. 3, if we delete the node labeled a in T1 , then the subtree with the nodes labeled b; c; d; e must be removed, too. However, if we delete the node labeled, for example, d, then only d is removed. This is reasonable when comparing two hierarchically structured documents or programs. For example, deleting a section of a document implies that one also removes all the paragraphs within the section whereas deleting a paragraph simply removes that paragraph, leaving the other paragraphs in the section intact.
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
375
2.4. Cut operations We de®ne a substructure U of tree T to be a connected subgraph of T . That is, U is rooted at a node n in T and is generated by cutting o some subtrees in the subtree rooted at n. Formally, let T i represent the subtree rooted at ti. The operation of cutting at node ti means removing T i (see Fig. 4). A set S of nodes of T k is said to be a set of consistent subtree cuts in T k if (i) ti 2 S implies that ti is a node in T k, and (ii) ti; tj 2 S implies that neither is an ancestor of the other in T k. Intuitively, S is the set of all roots of the removed subtrees in T k. We use Cut
T ; S to represent the tree T with subtree removals at all nodes in S. Let Subtrees
T be the set of all possible sets of consistent subtree cuts in T . Given two trees T1 and T2 and an integer d, the size of the largest approximately common root-containing substructures within distance k; 0 6 k 6 d, of T1 i and T2 j, denoted t size
T1 i; T2 j; k (or simply t size
i; j; k when the context is clear), is maxfjCut
T1 i; S1 j jCut
T2 j; S2 jg subject to t dist
Cut
T1 i; S1 ; Cut
T2 j; S2 6 k S1 2 Subtrees
T1 i S2 2 Subtrees
T2 j: LACS Problem. The LACS problem for T1 i and T2 j is to identify the largest approximately common substructure (LACS), within distance d, of T1 i and T2 j, that is, to calculate max1 6 u 6 i;1 6 v 6 j ft size
T1 u; T2 v; dg and locate the Cut
T1 u; Su and Cut
T2 v; Sv , Su 2 Subtrees
T1 u; Sv 2 Subtrees
T2 v that achieve the maximum size. The LACS problem for T1 and T2 is to identify the largest approximately common substructure, within distance d, of T1 and T2 , that is, to calculate max1 6 i 6 jT1 j;1 6 j 6 jT2 j ft size
T1 i; T2 j; dg and locate the
Fig. 4. Cutting at node t8 to get T 0 from T .
376
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
substructure U1 of T1 and the substructure U2 of T2 that achieve the maximum size. We will focus on computing the maximum size in solving the LACS problem. By memorizing the size information during the computation and by a backtracking technique, one can ®nd both the maximum size and one of the corresponding substructure pairs yielding the size in the same time and space complexity. 2.5. Notation We use F i to represent the ordered forest obtained by deleting ti from T i (Fig. 5). jF j is the size of forest F . A set S of nodes of F is said to be a set of consistent subtree cuts in F if (i) tp 2 S implies that tp is a node in F , and (ii) tp; tq 2 S implies that neither is an ancestor of the other in F . We use Cut
F ; S to represent the subforest F with subtree removals at all nodes in S. For example, consider Fig. 5 again. S ft2; t5; t7g is a set of consistent subtree cuts in F 9. Cut
F 9; S is obtained by cutting at nodes t2; t5 and t7 from F 9. Let Subtrees
F be the set of all possible sets of consistent subtree cuts in F . Let f dist
F1 ; F2 be the top-down edit distance from forest F1 to forest F2 . We de®ne the size of the largest approximately common root-containing substructures, within distance k, of F1 and F2 , denoted f size
F1 ; F2 ; k, as maxfjCut
F1 ; S1 j jCut
F2 ; S2 jg subject to f dist
Cut
F1 ; S1 ; Cut
F2 ; S2 6 k S1 2 Subtrees
F1 S2 2 Subtrees
F2
Fig. 5. (a) A tree T . (b) Deleting node t9 to get F 9 from T . (c) Cutting at nodes in S ft2; t5; t7g from F 9.
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
377
3. Our algorithm We will ®rst describe some basic properties and then present the algorithm for solving the LACS problem. 3.1. Basic properties Let t1 i be a node in tree T1 and let t2 j be a node in tree T2 . We use t1 i1 ; t1 i2 ; . . . ; t1 imi to represent the children of t1 i and use t2 j1 ; t2 j2 ; . . . ; t2 jnj to represent the children of t2 j. F1 is ; it ; 1 6 s 6 t 6 mi , represents the forest consisting of the subtrees T1 is ; . . . ; T1 it . F1 i1 ; imi F1 i; F1 ip ; ipÿ1 ; for all 1 6 p 6 mi . Note F1 ip ; ip T1 ip 6 F1 i. Lemmas 1 and 2 below summarize the formulas used for initializing t size and f size for all k; 0 6 k 6 d. Lemma 1. For all i; j; 1 6 i 6 jT1 j and 1 6 j 6 jT2 j, and for any p; q; s; t; 1 6 p 6 s 6 mi and 1 6 q 6 t 6 nj , (i) (a) t size
;; ;; 0 0; (b) f size
;; ;; 0 0; (ii) (a) t size
T1 i; ;; 0 0; (b) t size
;; T2 j; 0 0; (iii) (a) f size
F1 ip ; is ; ;; 0 0; (b) f size
;; F2 jq ; jt ; 0 0: Proof. Immediate from de®nitions.
Lemma 2. For all i; j; 1 6 i 6 jT1 j and 1 6 j 6 jT2 j, and for any p; q; s; t; 1 6 p 6 s 6 mi and 1 6 q 6 t 6 nj , and for any k; 1 6 k 6 d, (i) (a) t size
;; ;; k 0, (b) f size
;; ;; k 0; (ii) (a) t size
T1 i; ;; k f size
F1 i; ;; k ÿ 1 1, (b) t size
;; T2 j; k f size
;; F2 j; k ÿ 1 1; (iii) (a) f size
F1 ip ; is ; ;; k minfjF1 ip ; is j; kg, (b) f size
;; F2 jq ; jt ; k minfjF2 jq ; jt j; kg. Proof. (i) Obvious, since the largest approximately common root-containing substructure of two empty trees or forests is empty. For (ii), in (a), the largest approximately common root-containing substructure of F1 i and ; plus t1 i is the largest approximately common root-containing substructure of T1 i and ;. This means that t size
T1 i; ;; k f size
F1 i; ;; k ÿ 1 1 where the 1 is obtained by including the node t1 i. (b) is proved similarly as for (a). For (iii), in (a), if k 6 jF1 ip ; is j, since the allowed distance is at most k, we have to remove some subtrees from F1 ip ; is until only k nodes are left in the forest. If k > jF1 ip ; is j, then all the nodes in F1 ip ; is constitute the largest
378
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
approximately common root-containing substructure of F1 ip ; is and ;. Thus the result follows. (b) is proved similarly as for (a). Lemma 3. For all i; j; 1 6 i 6 jT1 j and 1 6 j 6 jT2 j, and for any s; t; 1 6 s 6 mi and 1 6 t 6 nj , 8 f size
F1 i1 ; isÿ1 ; F2 j1 ; jt ; 0; > > > < f size
F i ; i ; F j ; j ; 0; 1 1 s 2 1 tÿ1 f size
F1 i1 ; is ; F2 j1 ; jt ; 0 max > i ; i ; f size
F > 1 1 sÿ1 F2 j1 ; jtÿ1 ; 0 > : t size
is ; jt ; 0 Proof. Suppose S1 2 Subtrees
F1 i1 ; is and S2 2 Subtrees
F2 j1 ; jt are two smallest sets of consistent subtree cuts that maximize jCut
F1 i1 ; is ; S1 j jCut
F2 j1 ; jt ; S2 j where f dist
Cut
F1 [i1 ; is ; S1 ), Cut
F2 j1 ; jt , S2 0. Then at least one of the following cases must hold: Case 1: t1 is 2 S1 (i.e., the subtree T1 is is removed). So, f size
F1 i1 ; is ; F2 j1 ; jt ; 0 f size
F1 i1 ; isÿ1 ; F2 j1 ; jt ; 0. Case 2: t2 jt 2 S2 (i.e., the subtree T2 jt is removed). So, f size
F1 i1 ; is ; F2 j1 ; jt ; 0 f size
F1 i1 ; is ; F2 j1 ; jtÿ1 ; 0. Case 3: t1 is 62 S1 and t2 jt 62 S2 (i.e., neither T1 is nor T2 jt is removed) (Fig. 6). Let Mt be the top-down edit mapping (with cost 0) from Cut
F1 i1 ; is ; S1 to Cut
F2 j1 ; jt ; S2 . In Mt ; T1 is must be mapped to T2 jt because otherwise we cannot have distance zero between Cut
F1 i1 ; is ; S1 and Cut
F2 j1 ; jt ; S2 . Therefore f size
F1 i1 ; is ; F2 j1 ; jt ; 0 f size
F1 i1 ; isÿ1 ; F2 j1 ; jtÿ1 ; 0 t size
is ; jt ; 0. Since these three cases exhaust all possible mappings yielding f size
F1 i1 ; is ; F2 j1 ; jt ; 0, we take the maximum of the corresponding sizes, which gives the formula asserted by the lemma. Lemma 4. For all i; j; 1 6 i 6 jT1 j and 1 6 j 6 jT2 j, f size
F1 i; F2 j; 0 2 if l
t1 i l
t2 j; t size
i; j; 0 0 otherwise: Proof. First, consider the case where l
t1 i l
t2 j. Suppose S1 2 Subtrees
T1 i and S2 2 Subtrees
T2 j are two smallest sets of consistent subtree cuts that maximize jCut
T1 i; S1 j jCut
T2 j; S2 j where t dist
Cut
T1 i; S1 ; Cut
T2 j; S2 0. Let Mt be the top-down edit mapping (with cost 0) from Cut
T1 i; S1 to Cut
T2 j; S2 (Fig. 7). In Mt , t1 i must be mapped to t2 j. Here is why. Suppose
i; k and
h; j are in Mt . Since t1 i is the root of T1 i, t1 h must be a proper descendant of t1 i. This implies that t2 j is a proper de-
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
379
Fig. 6. Illustration of the case in which one of F1 i1 ; is and F2 j1 ; jt is a forest and neither T1 is nor T2 jt is removed.
scendant of t2 k by the ancestor order preservation condition on mappings described in Section 2.2. This is impossible because t2 j is the root of T2 j. So, h i. By symmetry, k j and hence
i; j 2 Mt . Next, note that the largest common root-containing substructure of F1 i and F2 j plus t1 i (or t2 j) must be the largest common root-containing substructure of T1 i and T2 j. This means that t size
i; j; 0 f size
F1 i; F2 j; 0 2, where the 2 is obtained by including the two nodes t1 i and t2 j. Finally consider the case where l
t1 i 6 l
t2 j (i.e., the labels of the roots of the two trees T1 i and T2 j dier). In order to get distance zero between the two trees after applying cut operations to them, we have to remove both trees entirely. Thus, t size
i; j; 0 0. The following two lemmas show the recurrence equations for calculating f size and t size when the allowed distance k > 0.
Lemma 5. For all i; j; 1 6 i 6 jT1 j and 1 6 j 6 jT2 j, and for any s; t; 1 6 s 6 mi and 1 6 t 6 nj , and for any k; 1 6 k 6 d,
Fig. 7. Illustration of the case in which we have two trees T1 i and T2 j and l
t1 i l
t2 j.
380
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
f size
F1 i1 ; is ; F2 j1 ; jt ; k 8 f size
F1 i1 ; isÿ1 ; F2 j1 ; jt ; k; > > > > < f size
F1 i1 ; is ; F2 j1 ; jtÿ1 ; k; max max0 6 h 6 k ff size
F1 i1 ; isÿ1 ; F2 j1 ; jt ; k ÿ h t size
T1 is ; ;; hg; > > max0 6 h 6 k ff size
F1 i1 ; is ; F2 j1 ; jtÿ1 ; k ÿ h t size
;; T2 jt ; hg; > > : max0 6 h 6 k ff size
F1 i1 ; isÿ1 ; F2 j1 ; jtÿ1 ; k ÿ h t size
is ; jt ; hg: Proof. Suppose S1 2 Subtrees
F1 i1 ; is and S2 2 Subtrees
F2 j1 ; jt are two smallest sets of consistent subtree cuts that maximize jCut
F1 i1 ; is ; S1 j jCut
F2 j1 ; jt ; S2 j where f dist
Cut (F1 i1 ; is ], S1 ), Cut
F2 j1 ; jt ], S2 )) 6 k. Then, at least one of the following cases must hold: Case 1: t1 is 2 S1 (i.e., T1 is is removed). So, f size
F1 i1 ; is ; F2 j1 ; jt ; k f size
F1 i1 ; isÿ1 ], F2 j1 ; jt ], k. Case 2: t2 jt 2 S2 (i.e., T2 jt is removed). So, f size
F1 i1 ; is ; F2 j1 ; jt ; k f size
F1 i1 ; is ], F2 j1 ; jtÿ1 ], k. Case 3: t1 is 62 S1 and t2 jt 62 S2 (i.e., neither T1 is nor T2 jt is removed). Let Mt be a minimum cost top-down edit mapping from Cut
F1 i1 ; is ; S1 to Cut
F2 j1 ; jt ; S2 . There are three subcases to be considered: (a) t1 is is not touched by a line in Mt . So, T1 is must be mapped to an empty tree. We may remove nodes from T1 is until only h nodes are left where 0 6 h 6 k. Therefore f size
F1 i1 ; is ; F2 j1 ; jt ; k max0 6 h 6 k ff size
F1 i1 ; isÿ1 ; F2 j1 ; jt ; k ÿ h t size
T1 is ; ;; h}. (b) t2 jt is not touched by a line in Mt . So, f size
F1 i1 ; is F2 j1 ; jt ; k max0 6 h 6 k {f size
F1 i1 ; is , F2 j1 , jtÿ1 ], k ÿ h t size
;; T2 jt ; h}. (c) t1 is and t2 jt are both touched by lines in Mt . Then, T1 is must be mapped to T2 jt . The allowed distance h between T1 is and T2 jt can vary from 0 to k. So, f size
F1 i1 , is ; F2 j1 ; jt ; k max0 6 h 6 k {f size
F1 i1 ; isÿ1 ], F2 j1 ; jtÿ1 ], k ÿ h t size
is ; jt ; h}. Lemma 6. For all i; j; 1 6 i 6 jT1 j and 1 6 j 6 jT2 j, and for any k; 1 6 k 6 d, t size
i; j; k f size
F1 i; F2 j; k ÿ c 2; where c
0 1
if l
t1 i l
t2 j; otherwise:
Proof. We ®rst show that removing either T1 i or T2 j would not yield the maximum size. There are three cases to be considered:
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
381
Case 1: Both T1 i and T2 j are removed. Then t size
i; j; k 0. However, since k P 1, cutting at just the children of t1 i and t2 j would cause t size
i; j; k P 2. Therefore removing both T1 i and T2 j cannot yield the maximum size. Case 2: Only T1 i is removed. Then t size
i; j; k t size
;; T2 j; k. Assume without loss of generality that jT2 jj P k. The above equation implies that we have to remove some subtrees from T2 j so that there are no more than k nodes left in T2 j. Thus, t size
i; j; k t size
;; T2 j; k k. On the other hand, if we just cut at the children of t1 i and leave t1 i in the tree, we would map t1 i to t2 j. This would lead to t size
i; j; k P f size
;; F2 j; k ÿ 1 2 k 1. Thus, removing T1 i alone cannot yield the maximum size. Case 3: Only T2 j is removed. The proof is similar to that in Case 2. The above arguments lead to the conclusion that in order to obtain the maximum size, neither T1 i nor T2 j can be removed. Now suppose S1 2 Subtrees
T1 i and S2 2 Subtrees
T2 j are two smallest sets of consistent subtree cuts that maximize jCut
T1 i; S1 j jCut
T2 j; S2 j where t dist
Cut
T1 i; S1 ; Cut
T2 j; S2 6 k. Let Mt be a minimum cost top-down edit mapping from Cut
T1 i; S1 to Cut
T2 j; S2 . By the mapping conditions, t1 i must be mapped to t2 j in Mt . Thus, if l
t1 i l
t2 j, t size
i; j; k f size
F1 i; F2 j; k 2; otherwise t size
i; j; k f size
F1 i; F2 j; k ÿ 1 2: 3.2. The algorithm We use dynamic programming to compute t size
i; j; k. The f size values computed and used in the algorithm are stored in a temporary array that is freed once the corresponding t size is computed. The t size values are stored in a permanent array. Fig. 8 summarizes the procedure for computing t size
i; j; k for 0 6 k 6 d. Note that t size
i; j; k represents the size of the largest approximately common root-containing substructures, within distance k, of T1 i and T2 j. To calculate the size of the largest approximately common substructure (LACS), within distance d, of T1 i and T2 j, we build, in a bottom-up fashion, another array b
i; j; d; 1 6 i 6 jT1 j; 1 6 j 6 jT2 j, using t size
i; j; d as follows. Let L max1 6 u 6 mi b
iu ; j; d where i1 . . . imi are the postorder numbers of the children of t1 i or L 0 if t1 i is a leaf. Let R max1 6 v 6 nj b
i; jv ; d where j1 . . . jnj are the postorder numbers of the children of t2 j or R 0 if t2 j is a leaf. Calculate b
i; j; d maxft size
i; j; d; L; Rg. The size of the largest approximately common substructure (LACS), within distance d, of T1 i and T2 j is b
i; j; d. The size of the largest approximately common substructure (LACS), within distance d, of T1 and T2 is b
jT1 j; jT2 j; d. Theorem 2. The algorithm presented above correctly solves the LACS problem. The space complexity of the algorithm is O
d jT1 j jT2 j and the time complexity of the algorithm is O
d 2 jT1 j jT2 j.
382
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
Fig. 8. Procedure for computing t size
i; j; k.
Proof. The correctness of the algorithm follows by observing that Procedure Find-Largest correctly computes t size
i; j; k; 1 6 i 6 jT1 j; 1 6 j 6 jT2 j; 0 6 k 6 d, and the bottom-up procedure described above correctly computes b
i; j; d. We use an array to hold f size; t size and b, respectively. These arrays require O
d jT1 j jT2 j space. By Lemma 3, computing f size
F1 i1 ; is ; F2 j1 ; jt ; 0 takes O
1 time. By Lemma 4, computing t size
i; j; 0 also takes O
1 time. So the time required for the k 0 case is jT1 j X jT2 j X i1
O
mi nj 6 O jT2 j
j1
jT1 j X
! mi 6 O
jT1 j jT2 j:
i1
By Lemma 5, computing f size
F1 i1 ; is ; F2 j1 ; jt ; k takes O
k time. By Lemma 6, computing t size
i; j; k takes O
1 time. So the time required for the k > 0 case is jT1 j X jT2 j d X X k1 i1
j1
O
k mi nj 6 O jT2 j
jT1 j d X X
! k mi
k1 i1
6 O jT1 j jT2 j
d X k1
! k 6 O
d 2 jT1 j jT2 j:
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
383
Finally computing the array b takes O
jT1 j jT2 j time. Therefore the total time used by our algorithm is O
d 2 jT1 j jT2 j. When d is a constant, the complexity is the same as that of the best known algorithm for computing the top-down edit distance of two trees due to Selkow [11], even though the LACS problem addressed here appears to be harder than ®nding the distance. It should be pointed out that by memorizing the size information during the computation and by a backtracking technique, one can ®nd both the maximum size and one of the corresponding substructure pairs yielding the size in the same time and space complexity. To see this, notice that in considering t size
i; j; k during the backtracking, one checks l
t1 i and l
t2 j and determines f size
F1 i; F2 j; k ÿ c (cf. Lemmas 4 and 6). When considering f size
F1 i; F2 j; k, one re-calculates the recurrence formulas in Lemmas 3 and 5 and ®nds out which case yields the maximum value, thus determining the mapping corresponding to that case. 4. Summary and implementation In this paper we presented a dynamic programming algorithm for identifying approximately common substructures of two trees. The `approximation' here is measured by the top-down edit distance originally proposed by Selkow [11]. This distance is most suitable for comparing hierarchical data since it respects the parent±child relationships [19]. Using selective memorization, our algorithm runs as fast as the best known algorithm [11] for comparing two trees using the restricted top-down edit distance when the distance allowed in the common substructures is a constant independent of the input trees. We also established the relationship between the restricted distance and the edit distance described in [13,21]. The substructures de®ned in the paper are generated by applying the `cut' operations to nodes in the trees. Cutting at node n from tree T means removing n and all its descendants, i.e., removing the subtree rooted at n. In [21], a relevant operation, called `prune', was de®ned. Pruning at node n from tree T means removing only the descendants of n; n itself remains in T . Thus, a pruning never eliminates the entire tree. The formulas and algorithms presented in the paper generalize easily to the `prune' case, though the resulting substructures dier as the de®nitions for the two operations are dierent. We have implemented the proposed algorithm into a graphics toolkit for comparing structured documents such as SGML and HTML [2]. Our approach is to transform the documents into ordered labeled trees based on the underlying markup language and then compare the documents using the proposed algorithm. Speci®cally, we represent the paragraph content of a document as a
384
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
leaf node in the corresponding tree structure. 2 The contents of the paragraphs are encoded into signatures. Each signature is an integer obtained by hashing the content of the corresponding paragraph. Thus, if two paragraphs are the same, their signatures remain the same. When the hash function chosen is perfect [4], two equivalent signatures also imply the equivalence of the corresponding paragraphs. Likewise, the section and subsection titles associated with the non-leaf nodes of the tree structure are encoded into signatures. These signatures become the labels of the nodes. We then use the proposed algorithm to compare two trees (documents), and display/highlight their common portion through the graphical interface. Based on the top-down edit distance, our system allows paragraph deletes, inserts and mismatches to exist when ®nding the common portion of two documents. In addition, the system is able to detect a paragraph move, which is an important operation in document editing [18]. The underlying heuristic works by ®rst ®nding the mapping between two trees. For nodes not touched by mapping lines, we check their signatures. Two paragraphs are related and labeled as `moved paragraphs' if they have the same signature. Our work requires a grammar based on which documents or programs can be parsed into trees. Some electronic documents such as line-based documents may not hold structural tags as those of SGML. Some documents, such as Latex, have very liberal syntax rules, rendering the document comparison dicult. In such situations, we create an arti®cial grammar embeddable to the documents, so that they are parsed based on the same rules. The UNIX `di' program is a good tool that complements our system. We have demonstrated the system in combination with `di' in several conferences [15,16,18]. The executable code has been used by several researchers in their labs and is available from the authors.
Acknowledgements We thank the anonymous reviewers for their thoughtful comments, which helped improve both the quality and presentation of the paper. We also thank Professor Dennis Shasha of New York University for useful conversations during the preparation of this work. The work was partially supported by NSF
2 There are trade-os concerning the leaf representation in terms of running-time, mappings obtained, and semantics of the comparison. The choice is application-dependent and we feel that representing paragraphs as leaves is suitable for document editing. In our implementation, when the user wants to see a more detailed comparison between two paragraphs, he/she can invoke the UNIX `di' program to do so.
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
385
grant IRI-9531548 and by the Natural Sciences and Engineering Research Council of Canada under Grant OGP0046373.
References [1] A.V. Aho, R. Sethi, J.D. Ullman, Compilers: Principles, Techniques, and Tools, AddisonWesley, Reading, MA, 1986. [2] G.J.S. Chang, G. Patel, L. Relihan, J.T.L. Wang, A graphical environment for change detection in structured documents, in: Proceedings of the 21st Annual International Computer Software and Application Conference, Washington DC, August 1997, pp. 536±541. [3] Y.C. Cheng, S.Y. Lu, Waveform correlation by tree matching, IEEE Trans. Pattern Anal. Machine Intell. 7 (1985) 299±305. [4] E.A. Fox, L.S. Heath, Q.F. Chen, A.M. Daoud, Practical minimal perfect hash functions for large databases, Commun. ACM 35 (1) (1992) 105±121. [5] R. Grishman, Computational Linguistics, Cambridge University Press, New York, 1986. [6] J. Hein, T. Jiang, L. Wang, K. Zhang, On the complexity of comparing evolutionary trees, Discrete Appl. Math. 71 (1996) 153±169. [7] T. Jiang, L. Wang, K. Zhang, Alignment of trees ± an alternative to tree edit, in: M. Crochemore, D. Gus®eld (Eds.), Combinatorial Pattern Matching, Lecture Notes in Computer Science, vol. 807, Springer, Berlin, 1994, pp. 75±86. [8] D. Keselman, A. Amir, Maximum agreement subtree in a set of evolutionary trees ± metrics and ecient algorithms, in: Proceedings of the 35th IEEE Annual Symposium on Foundations of Computer Science, Santa Fe, NM, 1994, pp. 758±769. [9] A.S. Noetzel, S.M. Selkow, An analysis of the general tree-editing problem, in: D. Sanko, J. B. Kruskal (Eds.), Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley, Reading, MA, 1983, pp. 237±252. [10] C.V. Ramamoorthy, W.T. Tsai, Advances in software engineering, IEEE Comput. 29 (10) (1996) 47±58. [11] S.M. Selkow, The tree-to-tree editing problem, Inform. Process. Lett. 6 (6) (1977) 184±186. [12] T.F. Smith, M.S. Waterman, Identi®cation of common molecular subsequences, J. Mol. Biol. 147 (1) (1981) 195±197. [13] K.-C. Tai, The tree-to-tree correction problem, J. ACM 26 (3) (1979) 422±433. [14] E. Tanaka, K. Tanaka, The tree-to-tree editing problem, Int. J. Pattern Recog. Arti®cial Intell. 2 (2) (1988) 221±240. [15] J.T.L. Wang, G.-W. Chirn, C.-Y. Chang, G. Chang, A. Noriega, K. Pysniak, An integrated toolkit for pattern matching and pattern discovery in scienti®c, program and document databases, in: Proceedings of the 7th International Conference on Software Engineering and Knowledge Engineering, Rockville, MD, June 1995, p. 497. [16] J.T.L. Wang, D. Shasha, G.J.S. Chang, L. Relihan, K. Zhang, G. Patel, Structural matching and discovery in document databases, in: Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, AZ, May 1997, pp. 560±563. [17] J.T.L. Wang, K. Zhang, K. Jeong, D. Shasha, A system for approximate tree matching, IEEE Trans. Knowledge and Data Eng. 6 (4) (1994) 559±571. [18] J.T.L. Wang, K. Zhang, D. Shasha, Pattern matching and pattern discovery in scienti®c, program, and document databases, in: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, CA, May 1995, p. 487. [19] W. Yang, Identifying syntactic dierences between two programs, Software ± Practice and Experience 21 (7) (1991) 739±755.
386
J.T.L. Wang et al. / Information Sciences 121 (1999) 367±386
[20] K. Zhang, Algorithms for the constrained editing distance between ordered labeled trees and related problems, Pattern Recog. 28 (3) (1995) 463±474. [21] K. Zhang, D. Shasha, Simple fast algorithms for the editing distance between trees and related problems, SIAM J. Comput. 18 (6) (1989) 1245±1262. [22] K. Zhang, D. Shasha, J.T.L. Wang, Approximate tree matching in the presence of variable length don't cares, J. Algorithms 16 (1) (1994) 33±66.