Journal of Algorithms 40, 1–23 (2001) doi:10.1006/jagm.2001.1160, available online at http://www.idealibrary.com on
Digital Access to Comparison-Based Tree Data Structures and Algorithms1 Salvador Roura Departament de Llenguatges i Sistemes Inform` atics, Universitat Polit`ecnica de Catalunya, E-08028 Barcelona, Catalonia, Spain E-mail:
[email protected] Received July 6, 1999; published online June 9, 2001
This paper presents a simple method of building tree data structures, which only requires visiting log N nodes and comparing D digits per search or update, where N is the number of keys and D is the length of the keys. These bounds hold independently of the order of the updates and of the digits of the keys. The additional space required by the method is asymptotically dismissable when compared to the space used by the keys and pointers. The proposed method applies either to fixed-length base-2 keys or to variable-length string keys and permits saving space for common prefixes. The same ideas can be applied to achieve algorithms that have the best bounds of quicksort or of mergesort and radixsort together. © 2001 Academic Press
1. INTRODUCTION There are two general frameworks for building a tree data structure with a given set of keys. In the first framework we consider keys as units and perform comparisons between them as a whole. Given two keys y and z, we are not concerned about the particular value of their ith bit; we are only interested in whether y is smaller than, equal to, or larger than z. A direct application of this idea produces the binary search tree data structure (BST, for short), as well as several variants of balanced binary search trees (see [11], for instance). We call this kind of tree a comparison-based search tree. The second framework uses the digital representation of the keys to guide searches. Loosely speaking, at the ith level of the tree we 1 This research was partially supported by the projects ALCOM-IST-1999-14186 and DGES PB98-0926 (AEDRI).
1 0196-6774/01 $35.00 Copyright © 2001 by Academic Press All rights of reproduction in any form reserved.
2
salvador roura
either go left or right depending on whether the ith bit is 0 or 1. Two of the most important variants of trees built under this approach are tries [3] and patricia tries [10]. These trees are called radix-search trees or digital trees. Let N and D denote the number of keys in the tree and the number of digits (bits, for the moment) of the keys, respectively. We assume the common situation where D = oN and N = o2D . This paper presents a simple method for achieving data structures with the best properties of balanced search trees and of radix-search trees. In other words, the data structures will have the following characteristics: (a) log N visited nodes per search or update, independent of the order of the insertions or deletions; (b) D bit comparisons per search or update, independent of the set of digits for the keys. Note that comparison-based search trees do not follow the property (b), while radix-search trees fail to fulfill the property (a). For the former, we can guarantee visiting only log N nodes per search, by balancing the tree. However, if the keys have long common prefixes, then each full-key comparison requires D bit comparisons, which amounts to D log N bit comparisons per search, on the average. On the other hand, radix-search trees make D bit comparisons per search, but keys with long common prefixes may induce visiting D nodes. Figure 1 includes an example of a trie for which a search visits D nodes on the average. Moreover, any patricia trie with the same set of keys suffers a similar effect.
FIG. 1. Worst-case instance for a trie.
comparison-based tree data structures
3
FIG. 2. Main binary tree data structures and their worst-case search cost.
Our method, which will be presented in the next sections, applies to BSTs and also to the problem of sorting an array of elements. We call it digital access, as each digit needs to be compared at most a constant number of times. Figure 2 presents a summary of the cost per successful search or deletion of the main binary tree data structures. The cost is computed for the worst case of each data structure and averaged over the N keys in the tree. The second column of the table reproduces the number of visited nodes, while the third column gives the number of bit comparisons. The costs labeled with “always” do not depend on the insertion order nor on the digits of the keys. Those labeled with “insertion order” or “digits” achieve the given cost with a worst-case insertion order (say, increasing order) or with a worst-case set of digits for the keys, for instance, the one in Fig. 1. The number of visited nodes for tries or patricia tries with random keys is a double average: it is averaged over the keys in the tree and also over every possible set of digits for the keys. The value log N could become D only with dismissable probability. But in practical situations where keys are not random, with high probability a search or an update in a radix-search tree would need to visit significantly more than log N nodes. The discussion about unsuccessful searches and insertions is deferred to Sections 2–4, where we will see that the digital access to BSTs is also very competitive. We end this introductory section by briefly comparing the results in this paper with related works. Mehlhorn [9, p. 284] was probably the first to present a data structure that combines the best of comparison-based and radix methods. He showed that a trie with nodes implemented as weighted dynamic trees can perform searches, insertions, and deletions in amortized cost D + log N. Our method is simpler, more general, and in the worst case achieves the same cost. The suffix array [8] is a data structure for performing efficient string searches in static arrays. A major difference between
4
salvador roura
our method and suffix arrays is that we deal with dynamic sets of keys. The same remarks apply to the suffix binary search tree [6], which was devised to deal with static sets of suffixes. In [7], an AVL version of the suffix BST is presented as a competitor of suffix trees and suffix arrays, in the context of string processing. It is worth noting that when we apply our method to BSTs it consists in a suffix BST together with a mechanism to perform insertions and deletions at the leaves of the tree and rotations. We will see that this improvement allows us to incorporate the digital access into most comparison-based balancing strategies. In [4], a new data structure called multidimensional balanced BST is introduced, which allows for searches, insertions, and deletions in D + log N time. The searching algorithm presented there is equivalent to the one in this paper. However, our approach is more general as it applies to basically all comparison-based BSTs and some of the most important sorting methods and is less spaceconsuming. Another interesting data structure with connections to our work is the string B-tree [2], a combination of a suffix array and a B-tree. Finally, our results share some similarities with the work of Grossi and Italiano [5], although both papers are independent. There, the authors provide a general framework for combining string techniques with a variety of linked data structures. In contrast, the emphasis in our paper is set on search trees and on sorting methods (quicksort and mergesort). The next sections are organized as follows. Section 2 covers successful and unsuccessful digital searches in BSTs. Section 3 deals with digital updates in BSTs, i.e., insertions at leaves, deletions of leaves, and rotations. Section 4 extends the fixed-length base-2 algorithms designed in Sections 2 and 3, so they work for variable-length keys in any base larger than 2. Section 5 presents a straightforward mechanism for incorporating the digital access into the code of comparison-based balancing algorithms. Section 6 adapts some of the most important sorting methods to the digital framework of this paper, obtaining excellent cost bounds for the number of memory exchanges and digit comparisons. The paper ends with some conclusions and remarks. 2. DIGITAL SEARCHES IN BINARY SEARCH TREES Given a D-bit key y, let y1 D denote the digital representation of y, where y1 is the most significant bit. For two distinct keys y and z, let δy z be the largest integer i such that y1 i − 1 = z1 i − 1. If yi = 0 and zi = 1 then we assume that y < z, that is, that key comparisons are defined as a direct extension of the underlying bit comparisons. Given a BST T and two keys z and y, we say that z is an ancestor of y if and only if z appears in the search path for y in T . We define py, the path predecessor
comparison-based tree data structures
5
of y, as the largest ancestor of y which is smaller than y. Similarly, we define sy, the path successor of y, as the smallest ancestor of y which is larger than y. Note that py or sy could be undefined. Through the remainder of the paper, c will always denote the key of the current node. So, pc and sc will denote the path predecessor and the path successor of the current key. The algorithms in this paper use δc pc and δc sc; in fact, the crucial value is the largest of the two (note that these two values cannot be equal). In our algorithms, together with every key c we store dc = maxδc pc δc sc and a boolean flag is pc that indicates which of the two is stored. When c has no smaller ancestors we set is pc = false. Similarly, when c has no larger ancestors we set is pc = true. For the root node we arbitrarily set is pc to true or false and define dc = 0, as if c differed in a 0th virtual bit with a nonexistent path predecessor or path successor above the root of the tree. Figure 3 shows an example of BST with 8-bit keys. For each key c, the bits between parentheses do not need to be stored explicitly; we shall come back to this point at the end of the section. Apart from its key c and the pointers to its children, every node has two additional fields: the left one is dc, the right one is is pc. For instance, the key c = 10000000 differs at its first bit with pc = 00101100 and at its third bit with sc = 10100101. So dc = max1 3 = 3 and is pc = false. Note that the shape of the tree is completely independent of the set of digits for the keys. Let x be the search key. The digital search algorithm visits exactly the same nodes as a standard search but performs bit comparisons instead of full key comparisons. Let y and z be ancestors of x such that z > y > x.
FIG. 3. An example of BST with the information to perform digital accesses.
6
salvador roura
Then, on the one hand z is an ancestor of y, and on the other hand δx z ≤ δx y. Therefore, when searching for x in the way down the tree, after x and z have been compared it is enough to compare x with y starting at the δx z + 1th bit to find out that xδx y = 0 < yδx y = 1. For instance, suppose that we search for x = 00000001 at the BST in Fig. 3. At the root node we only need to compare x1 with c1 to conclude that x < c = 10100101; therefore we follow the left branch. At the next node we know that x and c start with the same prefix (this can be deduced through dc and is pc, as we will see below) so we compare x2 with c2 , x3 with c3 , and finally x4 with c4 . The argument above “telescopes” to show that during the whole search and regarding ancestors larger than x, only the bits in x1 δx sx need to be inspected and just once. In our example, a final comparison involving x5 7 is performed at the last node with c = 00000010. Altogether, we have examined x1 δx sx, where sx = 00000010 and δx sx = 7. A similar argument applies to ancestors smaller than x, thus allowing us to concluding that only the bits in x1 maxδx px δx sx are helpful in guiding the search. Now, suppose that we are given two keys, the search key x and the key of the current node c, and have to decide whether x is smaller than, equal to, or larger than c. Assume that δc pc > δc sc; this implies that the first δc sc − 1 bits of c, pc, sc, and x are equal. Moreover, cδc sc = pcδc sc = 0 and scδc sc = 1. Let j (respectively, k) be the maximum bit position examined so far by comparisons with the ancestors of x which are on the path to c and smaller than x (respectively, larger than x). Since these are exactly the ancestors of c smaller than c (respectively, larger than c), we have j = δx pc (respectively, k = δx sc). There are three possibilities: (a) If j < δc pc, then x > c, because x1 j − 1 = c1 j − 1 and xj = 1 > cj = 0. (b) If j > δc pc, then x < c, because x1 δc pc − 1 = c1 δc pc − 1 and xδc pc = 0 < cδc pc = 1. (c) If j = δc pc, we have x1 j − 1 = c1 j − 1 with xj = cj = 1. Moreover, j > δc sc implies xδc sc = 0 and k = δc sc < j. Hence, the comparisons at previous steps with ancestors of x larger than x provide no additional information about the relation between x and c. This situation requires comparing xj+1 with cj+1 , xj+2 with cj+2 until either a difference between x and c is found or the end of the keys is reached. In contrast, the possibilities (a) and (b) require no bit comparisons. The digital search algorithm in Fig. 4 makes use of all the observations above with a further refinement; instead of recursively keeping
comparison-based tree data structures
7
FIG. 4. Algorithm to compare x with c when is pc = true.
j = δx pc and k = δx sc, we settle for the maximum of the two. This requires less computational work and, as shown below, is enough to guide the searches. We will use two variables, xp and xs, following this invariant: 1. If j > k, then xp = j and xs ≤ k; if j < k, then xs = k and xp ≤ j. 2. So far only the first maxxp xs bits of x have been compared, and that just once. As initial values we set xp = xs = 0. In Fig. 4, j and k denote the values of j and k at the next step of the search. Similarly, xp and xs denote the new values chosen for the variables xp and xs. The algorithm considers all possible cases for a digital search, when is pc = true. Figure 5 includes all these cases. We denote the current key in the next step of the search by c . The positions marked with ******, ######, and xxxxxx denote common subsequences of bits, which could be empty. The positions marked with ?????? are common subsequences that need to be compared to continue the search. In Case 5, δc pc and δx sc are not related at all. Note that only Cases 3 and 4 require updating xp or xs. When is pc = false, the situation is symmetric. For example, suppose that we search for x = 01111000 in the BST of Fig. 3. Initially xp = xs = 0 and c is the root key. The field dc stores δc pc because is pc = true. Since xp = δc pc = 0, we compare x1 = 0 with c1 = 1 to conclude that x < c (Case 4). Therefore we set xs = 1 and take the left branch. In the second node is pc = false. In such a case we have to use the symmetric of the algorithm described in Fig. 4. Since xs = δc sc = 1, we compare x2 = 1 with c2 = 0 and conclude that x > c
8
salvador roura
FIG. 5. Possible cases for a digital search when is pc = true.
(Case 3’s symmetric). So we set xp = 2 and follow the right branch. In the third node with is pc = true we have xp = 2 < δc pc = 3; we know without inspecting any bits that x > c and take the right branch (Case 1). At the fourth node we have xs = 1 < δc sc = 3; we know that x < c (Case 5’s symmetric) and follow the left branch. In the fifth and last node we have xp = δc pc = 2; so we start comparing x3 with c3 , until we find that x6 = 0 < c6 = 1. Since x < c (Case 4), we set xs = 6, take the left branch to reach an empty tree, and stop the search. It is not difficult to compute the time and space costs of a search with our method: If we assume the tree to be balanced, then the number of visited nodes per successful or unsuccessful search is log N. A successful search requires comparing each of the D bits just once; an unsuccessful search requires comparing each of the first maxδx px δx sx bits just once. The exact number of bit comparisons in this last case depends on the digits of the keys or on their probability distribution, and sometimes it is quite difficult to compute. Fortunately, a digital search in a binary search tree compares exactly the same bits as a search in a trie, therefore we can directly use all mathematical analyses existing in the literature for tries. For instance, we can conclude that the expected number of bit comparisons is just log2 N + olog N when the keys are uniformly distributed [11, p. 618]. For other probability distributions of the digits, any cost in the range log2 N D is possible.
comparison-based tree data structures
9
We need log2 D + 1 bits for the space overhead for the fields dc and is pc, an amount which is asymptotically dismissable compared to the D bits of c. Moreover, if we have keys with long common prefixes, we can save space by storing only the suffix of c that starts at the position dc + 1. The rest of the bits are implicitly defined by dc, is pc, and the ancestors of c and can be computed while going down the tree searching for c. For instance, in the example in Fig. 3 it would suffice to store the last two bits of the key 00000100. Since dc = 6 and is pc = true, c has to start with the prefix 000001, which is common to pc = 00000010, and the sixth bit of c is 1 because is pc = true. As extreme examples, the key 10100101 at the root has to be stored in full, while no bits of the key 10000001 need to be explicitly stored. Observe that this strategy saves the same space as in a trie with full keys at leaves replaced by suffixes that are not computable from the trie structure. Furthermore, our method does not use extra nodes with null links. Recall that a trie with random keys requires about N/ ln 2 1 44N nodes on the average [11, p. 620]. 3. DIGITAL UPDATES IN BINARY SEARCH TREES There are just three fundamental operations to change the shape of a BST: inserting a key at a leaf, deleting the key of a leaf, and rotating a key with its parent. Most balancing strategies can be written in terms of these operations. In this section we show how to perform them while keeping consistent the information for performing digital accesses. We first observe that the values δc pc and δc sc depend only on the ancestors of c. Therefore, apart from the nodes directly involved, an update can only affect the fields dc and is pc of the nodes of their children. A consequence of the previous observation is the digital deletion algorithm of a key stored in a leaf: We first digitally search for it; afterward, we delete the leaf where it was stored. No other action is needed. The digital insertion algorithm of a key x is also very simple: First, we digitally search for x. If x was already in the tree, we do nothing. Otherwise we will reach an empty tree, where we must place a new node with c = x. Suppose that xp > xs at the end of the search. Then xp = δx pc > δx sc or, equivalently, xp = δx px > δx sx. Therefore, in this case we must store the values xp and true in the fields dx and is px. The case xp < xs is symmetric. For instance, in the BST in Fig. 3, at the end of the digital search for x = 01111000 we had xp = 2 and xs = 6. Since xp < xs, the correct values for dx and is px are 6 and false. We now consider how the contents of the fields dc and is pc must be updated after the rotation in Fig. 6. Recall that the ancestors of y and z cannot be affected by the rotation. The keys in B have exactly the same
10
salvador roura
FIG. 6. Possible cases for a rotation, together with the algorithm to update the fields dy, is py, dz, and is pz.
ancestors in both trees, so they are not affected either. Each key c in C gains y as a new ancestor, which, in principle, could imply updating dc or is pc. However, as y < z < c, y cannot be the new path predecessor nor the new path successor of c. Similarly, the keys in A lose z as an ancestor, but z could neither be the path predecessor nor the path successor of any of those keys. Therefore, we only need to update consistently the fields dy, is py, dz, and is pz. Figure 6 shows the five possible cases for a rotation in terms of the digits of y and z. Let pz and sz denote the path predecessor and path successor of z before the rotation, where py = pz and sy = z. After the rotation we have p y = pz, s y = sz, p z = y, and s z = sz. In Case 3, δy pz could be smaller than, equal to, or
comparison-based tree data structures
11
larger than δz sz. Observe that only Cases 2 and 4 require updating the information for digital accesses. The inverse rotation follows a symmetric pattern. In the same way as in the case of searches, the number of visited nodes during an insertion or deletion of a leaf depends on the shape of the tree, and it is log N provided that the tree is balanced. Regarding bit comparisons, inserting or deleting x requires a previous search for x. Again, by taking into account the analyses for tries we can conclude that a random deletion of a nonpresent key or a random insertion of a new key requires only about log2 N bit comparisons on the average. In most balancing strategies, rotations are performed before or after an insertion or deletion. The number of visited nodes is proportional to the number of rotations, and it is usually constant on the average. Note that no bits from the involved keys are compared during a rotation. However, if common prefixes are stored just once to save storage space, then a rotation may require moving a common subsequence of bits from the old parent to the new parent. In such a situation, we should decide if the space savings compensate for the time overhead in the rotations. 4. EXTENSIONS TO LARGER BASES In this section we consider variable-length keys in a base b > 2. Without loss of generality, we think of a key as a string in C, i.e., an array of characters finished with ’\0’, a character smaller than any other. Figure 7 shows a BST storing a set of English words together with the information for performing digital accesses. There are several important differences of keys in a base b > 2. First, a base-b key c can differ at the same digit with pc and sc. For instance, the key “archive” in Fig. 7 differs at the fourth digit with both “arc” and “arcs.” In such a situation we arbitrarily set is pc to true or false. Second, knowing that δy z = i and that yi < zi does not give us enough information to deduce the values of yi and zi . Recall that, in the case of base-2 keys, we know that yi = 0 and zi = 1. Consequently, every key c has to be explicitly stored from the dcth digit, since now cdc is not computable from is pc. In Fig. 7 we can see that the redundant prefix of every key is always c1 dc − 1. Thus the appropriate value for the field dc at the root is 1. Finally, a digital search may require comparing some digits more than once, though at most a constant number of times. For instance, if in the tree at Fig. 7 we digitally search for “home,” then its first and second characters are only compared at the root, while its third character has to be compared at the root and also at the node with the key “hopefully.”
12
salvador roura
FIG. 7. A BST with strings and the information to perform digital accesses.
Taking into account all of the above observations, the search invariant for a string key x must be slightly different from the one for base-2 keys. Let j = δx pc and k = δx sc. The new search invariant is: 1. If j > k, then xp = j and xs ≤ k; if j < k, then xs = k and xp ≤ j; and if j = k, then xp = xs = j. 2. So far only the first maxxp xs digits of x have been compared, perhaps more than once. As initial values we set xp = xs = 1. When is pc = true, the possible cases of a digital search are also those in Fig. 5, if we think of 0s and 1s as any pair of digits such that the first is smaller than the second. Base-b keys present two additional cases that have to be considered. The first one is similar to Case 5, but with pcδc sc = cδc sc < xδc sc < scδc sc . In this case xp = xs = δc sc < δc pc, and as in Case 5 we deduce that x is larger than c. Note that no updates for xp or xs are needed. The second case is new; it occurs when c differs with pc and sc at the same digit dc. If xp = xs = dc, then we do not have enough information to decide which of x or c is the largest, so we must compare their digits from the dcth position. If xp > xs = dc (if xs > xp = dc), then we can affirm that x is smaller (respectively, larger) than c without comparing any digits, and we do not need to update xp or xs. Altogether, the search algorithm is almost identical to that in Section 2: The situations xp < δc pc and xp > δc pc can be handled exactly as in the binary search. The situation xp = δc pc is different, since now pxp < cxp and pxp < xxp do not imply that cxp = xxp . In conclusion,
comparison-based tree data structures
13
the algorithm in Section 2 works perfectly for base-b keys, provided that we start comparing xi and ci at the position i = xp. Regarding rotations with base-b keys, it is easy to see that the new cases where y or z (or both) differ at the same position with its path predecessor and its path successor can be handled exactly like the cases at Fig. 6. Therefore, the algorithm given in Section 3 works correctly without further changes. We end the section by computing the time and space costs of our method when we apply it to base-b keys. If the tree is balanced, a digital search visits log N nodes. But as some digits may be compared more than once, the analysis of the number of digit comparisons per search is not immediate. Assume that all keys have D digits. After comparing two digits, if they were different we move down in the tree, and if they were equal we move to the right in the search key. Therefore, the number of digit comparisons is at most the number of visited nodes plus the length of the inspected prefix. Note that this sum is only an upper bound because we may move down the tree without comparing any digits. Thus, we conclude that the number of digit comparisons per successful search is between D and D + log N, which is roughly D. The number of digit comparisons per unsuccessful search depends on the digits of the keys. For instance, for random digits the average length of the common prefixes is about logb N [11, p. 636]. Hence, a digital unsuccessful search in any kind of balanced BST with random keys performs log N digit comparisons on the average. For the fields dc and is pc we require log D additional space per node, which is asymptotically neglectable even if at the nodes we store pointers to keys instead of full keys, because each pointer requires at least log N space. On the other hand, by storing each common prefix just once, we save exactly the same space as when building a ternary search tree [1] (TST, for short) by inserting the keys in any of the orders used to build the BST—say, left-preorder—since all of these orders produce the same TST. (Recall that, unlike tries, the shape of a TST depends on the insertion order of the keys.) For example, Fig. 8 shows the unique TST associated to the BST of Fig. 7. (Note that it is not true that a TST has in general a unique associated BST. For instance, from the TST in Fig. 8 we cannot deduce if “artist” has been inserted before or after “hopefully,” so BSTs different from the one in Fig. 7 could be obtained from the TST in Fig. 8.) Observe that the nonredundant digits in the BST are indeed those stored in the TST. Moreover, it is not difficult to see that the digit comparisons of our method are exactly those that would be made in the associated TST. Thus, we could argue that our method captures the best properties of TSTs, without suffering from their sources of unbalancing,
14
salvador roura
FIG. 8. The ternary search tree associated to the BST in Figure 7.
i.e., digits of the keys and insertion order. Moreover, in the worst case TSTs need three links per digit, while our method requires only two links per key. 5. IMPLEMENTATION ISSUES This section shows how to translate the algorithmic ideas in previous sections into compact code, which can be easily and efficiently incorporated into the code of comparison-based algorithms for trees. Figure 9 presents the type definitions for fixed-length base-2 keys. We assume that a key is a word of D = 32 bits. As is usual, a tree is identified with a pointer to its root node. Each node has, apart from its key w and two pointers l and r to the left and right children, the additional fields d and is p. Using a 16-bit integer for the type Diff in the field d, we can codify any value in the range 0 216 − 1, which would suffice for most practical situations with large keys. The macro bit(X,I) returns the Ith bit of the word X. Bits and Booleans are defined as the same basic type (for simplicity, a full integer whose value is always 0 or 1), but it is preferable to distinguish them because of their conceptual difference. Figure 9 also includes the function Compare, which is a direct translation of the digital comparison algorithm in Section 2 and its symmetric. Given the search key x, a pointer t to the root of the current subtree, and pointers to xp and xs, the function Compare returns the result of the comparison of
comparison-based tree data structures
15
FIG. 9. Type definitions in C and the Compare function.
x against the key at the root of t, updating *xp or *xs if necessary. The four possible results are codified with the type Comp. Making use of the function Compare allows us to adapt most comparisonbased procedures over BSTs to the digital framework. For instance, Fig. 10 presents the recursive function R Cla Search, which implements the classic recursive search for a given key into a given BST. Obtaining the digital version R Dig Search is trivial if the function Compare is used. The function Dig Search performs the first call to R Dig Search. It is also a simple matter to obtain the digital counterpart of the classic insertion algorithm of a key x into a new leaf of a given tree t. Figure 11 includes the function R Dig Insert, which returns the tree after the insertion. The variables xp and xs are first used to guide the search. If x was not already in the tree, then we fill a new node with x as the key, empty subtrees as children, and the appropriate values for the fields d and is p, which are easily computed from the current values of xp and xs. Observe that we have arbitrarily chosen to set n->is p to true when xp equals xs, which happens on the first insertion.
16
salvador roura
FIG. 10. Code for classic and digital searches in a BST.
FIG. 11. Code for digital insertions in a BST.
Figure 12 presents two C functions to implement the right and left rotations. The function Rotate Right is a direct translation of the algorithm in Section 3; the function Rotate Left is its symmetric. Finally, Fig. 13 includes C definitions for string keys. The digits of the keys are not bits but bytes (characters). Since the keys may be variablelength, we assume that dynamic memory is available and define a key to be a pointer to its first character. Figure 13 also includes the Compare function for strings, which has two main differences with that in Fig. 9. First, at the
comparison-based tree data structures
17
FIG. 12. The Rotate Right and Rotate Left functions.
FIG. 13. New C types and the Compare function for strings.
end of each iteration of the for we check if the keys have reached their end. Second, we cannot start comparing bytes at the t -> d + 1 position, but rather at the t -> d position.
18
salvador roura 6. DIGITAL SORTING
In this section we adapt the previous ideas to sort digitally a set of N keys. Figure 14 presents the well-known quicksort algorithm. Given two integers 1 ≤ l ≤ r ≤ N, the function Quicksort sorts in two steps the global array K between the lth position and the rth position. First, using the former value in K[r] as a pivot, K[l r] is partitioned into a subarray with keys smaller than or equal to the pivot, followed by the pivot and a subarray with keys greater than or equal to the pivot. Then, the subarrays are recursively sorted. Figure 14 also includes the function that partitions K[l, , r] w.r.t. K[r]. The only noticeable feature is that K[r] is compared exactly once with every other key. This guarantees that there is an implicit BST associated to the sorting process (with K[r] at the root) and allows us to incorporate the digital access into the partitioning step. The macro exch(A,B) exchanges K[A] and K[B], and it is used to isolate one of the two pieces that have to be modified to get the digital version of quicksort. The only other change is obviously the comparison between keys. Figure 15 presents the digital counterpart of the partitioning algorithm. It is a version for fixed length base-2 keys. Apart from the array of keys
FIG. 14. Classic quicksort for base-2 keys.
comparison-based tree data structures
19
FIG. 15. Digital partition for base-2 keys.
K, we need two additional arrays P and S, which are used to store, for every key, the equivalent of the variables xp and xs on the digital access to a BST. At each stage, these values are used to compare the keys against K[r], which behaves like their parent. When exchanging two keys, their corresponding values in P and S also have to be exchanged. The macro dig exch(A,B) does this. The arrays P and S must be initialized to 0. The function Compare is similar to the one in Fig. 9, with two differences: the EMPTY case does not make sense now, and the function needs information about which of the sides of the array is being scanned. For instance, if we are scanning the left side and a key K[k] equal to K[r] is found,
20
salvador roura
then K[k] will be placed to the right of K[r], which will be the new path predecessor of K[k]; consequently P[k] is updated. The cost of digital quicksort for base-2 keys is not difficult to compute. First, we observe that with high probability we can have a well balanced BST associated to the sorting process. This can be achieved by partitioning the array w.r.t. the median of a small sample of keys. The expected number of key movements in that case is N log N. On the other hand, we perform exactly the same bit comparisons as in the process of inserting the keys into a trie. Therefore, the number of bit comparisons varies from about N log2 N in the case of random keys to about N · D in the worst case. Altogether, note that the cost of digital quicksort for base-2 keys is competitive w.r.t. to the cost of radixsort. Let us consider the cost of digital quicksort for strings. It is obtained by combining the digital quicksort algorithm for base-2 keys of Fig. 15 and the digital mergesort algorithm for strings of Fig. 17. If the BST related to the quicksort execution is well balanced, the number of pointers moved around is N log N. The number of times each character is compared depends on the digits of the keys and on the shape of the associated BST. If the BST is well balanced, then in the worst case the number of digit comparisons is about N · D and is N log N in the case of random keys. Note that, with a worst-case set of digits for string keys, radixsort compares about N · D digits as well, but it moves around N · D keys. (When the base of the digits is larger than 2, practical algorithms to partition an array with N keys w.r.t. the ith digit require N key movements, and i can vary from 1 to D.) We show now how to adapt mergesort to digitally sort a set of string keys. Figure 16 includes the classic version of mergesort, which first recursively sorts the left and right halves of the subarray K[l r] and afterwards merges them by calling the function Cla Merge. The call strcmp(K[a] K[b]) compares the keys from the beginning. To avoid redundant digit comparisons, we consider each of the two sorted subarrays as a BST where the keys have been inserted in increasing order and merge them digitally. Figure 17 presents the digital version of the merge algorithm of Fig. 16. Compared to digital quicksort, we can make two observations: First, the keys have no path successor; hence, we only need the additional arrays P, initialised to 1, and auxP. Second, when we find two keys K[a] and K[b] that are equal, in order to achieve a stable version of mergesort we always copy K[a] in the auxiliary array. Therefore, no boolean flag is needed for the equality case. The cost of digital mergesort for string keys can be computed as follows. The number of movements of pointers is about N log2 N or about 2N log2 N if we count each movement from auxK to K. Assume that every key has D digits. Comparing two given keys K[a] and K[b] requires
comparison-based tree data structures
21
FIG. 16. Classic mergesort for string keys.
performing digit comparisons only when P[a] and P[b] are equal. In this case, every comparison between equal digits acts like moving to the right in the larger key; therefore its number is at most N · D. We only need a single comparison between different digits to know the result of the current comparison. Since the sorting process has about log2 N levels of recursion and each one performs at most N key comparisons, the number of comparisons between different digits is at most about N log2 N. Altogether, only about N · D digit comparisons are performed in the worst case. The number of digit comparisons with random keys is difficult to compute with precision, but we certainly know that it is N log N, because on the average the common prefixes have length log N. Let us end the section by computing the cost of the digital mergesort algorithm for base-2 keys, which is easily obtained by combining the algorithms in this section. The number of keys moved around is again about N log2 N or 2N log2 N. The shape of the related BST has no influence on which bits are compared, since the bit comparisons are always those in the insertion of the keys into a trie. Hence, as in the case of quicksort, the number of bit comparisons ranges from about N log2 N with random keys to about N · D in the worst case.
22
salvador roura
FIG. 17. Digital merge for string keys.
7. CONCLUSIONS We have presented a simple method to build tree data structures which achieve just log N visited nodes and D digit comparisons per search or update. The idea is to provide every key c with a boolean flag and an integer, which enable us to know which of the keys above c shares the longest common prefix with c, and how long the prefix is. This additional information allows us to search for any given key by comparing each digit at most a constant number of times. Moreover, after a deletion or insertion of a leaf, or after a rotation, this information can be updated consistently in constant time. Therefore, most balancing strategies for trees can incorporate the digital access presented in this paper, thus adding the chance to digitally access their information to their comparison-based properties. Furthermore, we have shown that it is straightforward to translate the existing code for classic balancing strategies into code for the digital framework. Finally, applying similar ideas to the problem of sorting an array of
comparison-based tree data structures
23
elements, we have achieved algorithms with the best theoretical properties of comparison-based sorting methods and radixsort. It remains as an open problem performing an empirical evaluation to test in the practical realm the theoretical ideas developed in this paper. ACKNOWLEDGMENTS I thank Josep D´ıaz and Conrado Mart´ınez for their many useful suggestions, Roberto Grossi and Giuseppe Italiano for their references to previous works, and an anonymous referee for the notation used.
REFERENCES 1. J. Bentley and R. Sedgewick, Fast algorithms for sorting and searching strings, in “Proceedings of the 8th ACM–SIAM Symposium on Discrete Algorithms (SODA),” pp. 360–369, 1997. 2. P. Ferragina and R. Grossi, The string B-tree: A new data structure for string search in external memory and its applications, J. Assoc. comput. Mechn., to appear. A preliminary version appeared in Proceedings, 27th ACM Symposium on Theory of Comp., pp. 693–702, 1995. 3. E. Fredkin. Trie memory, Comm. Assoc. Comput. Mechn. 3 (1960), 490–500. 4. T. F. Gonzalez, Simple algorithms for the on-line multidimensional dictionary and related problems, Algorithmica 28 (2000), 255–267. 5. R. Grossi and G. F. Italiano, Efficient techniques for maintaining multidimensional keys in linked data structures, in (J. Wiedermann, P. van Emde Boas, and M. Nielsen, Eds.), “ICALP: Annual International Colloquium on Automata, Languages and Programming,” Lecture Notes in Computing Science, Vol. 1644, pp. 372–381, Springer, New York/Berlin, 1999. 6. R. W. Irving, Suffix binary search trees, Technical report, Department of Computer Science, University of Glasgow, April 1997. 7. R. W. Irving and L. Love, The suffix binary search tree, in “British Colloquium for Theoretical Computer Science, 2000”. 8. U. Manber and G. Myers, Suffix arrays: A new method for on-line string searches, SIAM J. Comput. 22 (Oct 1993), 935–948. 9. K. Mehlhorn, “Data Structures and Algorithms. Vol. 1. Sorting and Searching,” SpringerVerlag, New York/Berlin, 1984. 10. D. R. Morrison, Patricia—practical algorithm to retrieve information coded in alphanumeric, J. Assoc. Comput. Machin. 15 (1968), 514–534. 11. R. Sedgewick, “Algorithms in C.” (3rd ed.), Addison–Wesley, Reading, MA, 1998.