Efficient and compact indexing structure for processing of spatial queries in line-based databases

Efficient and compact indexing structure for processing of spatial queries in line-based databases

Available online at www.sciencedirect.com Data & Knowledge Engineering 64 (2008) 365–380 www.elsevier.com/locate/datak Efficient and compact indexing ...

1MB Sizes 1 Downloads 114 Views

Available online at www.sciencedirect.com

Data & Knowledge Engineering 64 (2008) 365–380 www.elsevier.com/locate/datak

Efficient and compact indexing structure for processing of spatial queries in line-based databases q Hung-Yi Lin

*

Department of Logistics Engineering and Management, National Taichung Institute of Technology, 129, Sanmin Rd., Sec. 3, Taichung, Taiwan, ROC Received 8 August 2006; received in revised form 3 August 2007; accepted 4 September 2007 Available online 22 September 2007

Abstract Points, lines and regions are the three basic entities for constituting vector-based objects in spatial databases. Many indexing schemes have been widely discussed for handling point or region data. These traditional schemes can efficiently organize point or region objects in a space into a hashing or hierarchical directory, and they provide efficient access methods for accurate retrievals. However, two difficulties arise when applying such methods to line segments: (1) the spatial information of line segments may not be precisely expressed in terms of that of points and/or regions, and (2) traditional methods for handling line segments can generate a large amount of dead space and overlapping areas in internal and external nodes in the hierarchical directory. The first problem impedes high-quality spatial conservation of line segments in a line-based database, while the second degrades the system performance over time. This study develops a novel indexing structure of line segments based on compressed Bþ trees. The proposed method significantly improves the time and space efficiencies over that of the R-tree indexing scheme.  2007 Elsevier B.V. All rights reserved. Keywords: Spatial database; Line segments; Indexing structure; GIS; Compressed Bþ -tree; R-tree

1. Introduction Large spatial databases have been extensively adopted in the recent decade, and various methods [3,8,19,21] have been presented to store, browse, search and retrieve spatial objects. A good spatial database can preserve and arrange explicit and implicit information of spatial objects. Explicit information of an object includes its location, extent, orientation, size and circumference. Implicit information includes the spatial relationship between distinct objects, the distribution and density of objects in a specific area and the coverage for some objects. Gaede and Gunther [4] classified spatial objects in d-dimensional Euclidean space (Ed ) into d + 1 types. For each k ð0 6 k 6 dÞ, the set of k-dimensional polyhedra forms a data type. For instance, a q *

This research was supported by Nation Science Council of ROC under Grant 96-2221-E-025-013. Tel.: +886 4 2219 6769; fax: +886 4 2219 6161. E-mail address: [email protected]

0169-023X/$ - see front matter  2007 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2007.09.009

366

H.-Y. Lin / Data & Knowledge Engineering 64 (2008) 365–380

two-dimensional graph may contain a zero-dimensional polyhedron (points), a one-dimensional polyhedron (lines, polylines) and a two-dimensional polyhedron (regions, polygons). An efficient spatial database has a well-organized indexing structure, and enables the easy retrieval interested objects for user queries. The construction of spatial databases differs from that of traditional databases in many respects. First, the data model representing spatial objects must be determined in advance of processing an index scheme. A spatial object may be a single point, a line segment, a curve, a polygonal segment, a 2D polygon or a multidimensional polygon, its spatial information needs to be preserved precisely. Second, insertions and deletions are interleaved with updates, since spatial objects are often dynamic. The data structures adopted in this context need to support this dynamic behavior without deteriorating over time. Third, spatial databases tend to be large, making the integration of secondary and tertiary memory essential for efficient processing. Two-dimensional objects can be categorized according to their space occupancies. Points with zero spatial occupancy are generally depicted by explicit coordinates, which are handled and queried by many traditional methods. Polygons, circles, ellipses and rectangles are regional data with nonzero spatial occupancy, and are generally depicted by rectangular objects, which are also indexed and queried accordingly by many traditional methods. A line segment does not enclose any area, so cannot be categorized as a nonzero-size object. However, in many previous studies [9,16,22], line segments are enclosed and represented by grids, cells, rectangles or MBRs (minimum bounding rectangles). Such methods represent line segments using nonzero-size objects and the corresponding index entries include much redundant information in their representations. Such redundancy at leaf level in the hierarchical directory propagates upward to the root and aggravates this redundant condition at higher levels. The resulting indexing structure suffers from many problems, such as the heavy building overhead cost of index structures, poor system execution performance, low retrieval accuracy and the incapability of processing certain types of query. The rest of this paper is organized as follows. Section 2 reviews pertinent literature on point and region indexing methods. The inefficiency of the traditional methods for handling line objects is then addressed. Section 3 then presents a variant of Bþ -tree to facilitate line indexing. Section 4 describes the construction of an indexing structure for line segments, and demonstrates the process by an example. Section 5 presents the algorithms for insertion and deletion of a line segment, and three query processing methods. Section 5 also analyzes the time complexity for all query processes. Section 6 discusses the performance of the proposed indexing structures. Section 7 summarizes the experimental results of a GIS application, in which the storage requirement and retrieval performance of the proposed system are compared with those of the R-tree indexing scheme. Conclusions are finally drawn in Section 8. 2. Overview of previous methods Many methods for handling point and region data have been proposed during the last two decades. The grid file [13] and its variants [2,20] adopt the point access method based on hashing. The KD-tree [1] is a binary search tree that stores points in k-dimensional space, and at each intermediate node, the KD-tree divides the kdimensional space in two parts by a ðk  1Þ-dimensional hyperplane. The K–D–B tree [17] and the G-tree [10] are the typical structures for indexing point data. Grid files proposed in [13] can handle spatial objects with points or regions. The R-tree proposed by Guttman [5] is probably the most popular structure for indexing nonzero-size objects. An R-tree adopts MBRs to enclose nonzero-size objects, and then represent them as indexed entities. R-trees have been widely employed to index the spatial objects in a large pictorial database such as the GIS application [15]. Although the above-mentioned structures can efficiently process a large number of point objects, uniform grid data and non-uniform MBR data, they do not index line segments well. Applying these structures to line segments causes three major problems. First, using only the endpoints or some specific points of a line does not preserve the full spatial information, since a line comprises infinitely many points, and can involve a wide space. Second, adopting MBRs for slender or winding objects can easily introduce dead space. Dead space is the redundant space outside an object and inside its MBR. Such redundant space easily leads to serious overlaps between MBRs at each level in an indexing tree. Fig. 1 shows MBR1 and MBR2 enclosing l1 and l2 , respectively. Lines l1 and l2 do not intersect, yet MBR1 and MBR2 overlap each other. Third, various spatial relationships exist among line segments. For instance, a line may have no joint with others; a line may be a

H.-Y. Lin / Data & Knowledge Engineering 64 (2008) 365–380

367

Fig. 1. Dead space and overlapping area derived from MBRs.

part of a polygonal line or a polygon; many lines may appear radially with a common central joint, or a line can cross its underlying space and intersect with many other objects. Fig. 2 shows examples of these relationships. Many researchers have utilized data transformations to deal with line segments. Orenstein and Merrett [14] proposed to map spatial objects, including line segments, into points in the lower-dimensional space. One serious problem with this transformation is that it does not preserve proximity. Another method [18] conceptualizes line segments into points in the same multidimensional space. This mapping works for storage purposes, but not for spatial operations involving search. For instance, difficulties may be encountered in detecting how close two line segments are to each other, or finding the nearest line to a given point or line. This is because the proximity in the two-dimensional space from which the line segments are drawn is not preserved in the fourdimensional space. Hoel and Samet [6,7] performed a qualitative comparison on the performance of three spatial indexing methods, R*-tree, Rþ -tree and PMR quad-tree, in processing spatial queries in large line segment databases. They believed that adopting data structures to preserve spatial occupancy is the best way to overcome the above-mentioned problems. However, since a line segment does not enclose any area, it does not have any spatial occupancy and its representation as any nonzero-size geometric shape still introduces dead space. Lindenbaum et al. [12] addressed the geometric properties inherent from the line segments generated from the proposed random image model when analyzing the relative qualitative behavior of the various quad-trees (MX, PM, bucket PMRq and PMRq ). Their detailed quantitative and analytic comparison facilitates choosing between various options in a way that is neither experimental nor domain-dependent.

Fig. 2. Different spatial relationships among line segments in a plane.

368

H.-Y. Lin / Data & Knowledge Engineering 64 (2008) 365–380

This study is restricted to two-dimensional line-based spatial databases. A new splitting policy for traditional Bþ -tree [11] is presented to compact the inserted data. Rather than appealing any other geometric shapes or spatial objects to index line segments, a new technique based on the new version of Bþ -tree is proposed to handle line segments directly. The proposed indexing structure uses economic space for database storage and performs fast queries over a large scale of line-based database. 3. Compressed B þ -trees Handling and organizing a tremendous amount of endpoints of line segments in a large scale of linebased database requires efficient indexing. A traditional Bþ -tree with low storage utilization (50%–70%) not only consumes a lot of time and space resources in the building process, but also suffers from the poor retrieval performance. This study first proposes a variant of Bþ -tree to facilitate indexing the endpoints of line segments. This new Bþ -tree, called compressed Bþ -tree, adopts a novel node splitting policy. A compressed Bþ -tree has the high space efficiency in data storage, and the high time efficiency in data retrieval. The structure of a Bþ -tree is sensitive to data insertion order. An improper data insertion order results in a badly-organized index directory. The compressed Bþ -tree reduces the frequency of splits at the leaf level by encapsulating data into leaves as many as possible before the leaves become full. This technique prevents the Bþ -tree from allocating additional nodes to the index directory. When a data entry is trying to insert into a target node, the load status of the target may become under-full, full or overflowed. No adjustment is required after insertion for cases of ‘‘under-full’’ and ‘‘full’’. If the target becomes overflowed, instead of splitting immediately, the unused space in the sibling nodes is taken to disperse the target’s overload. An example is given below to illustrate the differences between constructing a traditional Bþ -tree and a compressed Bþ -tree. Suppose a data sequence {8, 5, 1, 7, 3, 12, 9, 10} is adopted to build a traditional Bþ -tree and a compressed þ B -tree. Fig. 3 shows the procedure and resulting structure of a traditional Bþ -tree. Fig. 4 shows the final building procedure of a compressed Bþ -tree. Both structures have the node capacity of 2 and the fan-out of 3. First, the root accommodates data 8 and 5, then datum 1 makes the root overflowed, forcing it to split. Consider the following four scenarios (A to D): A. [Attempting to insert a new datum into an under-full target leaf.] At step 2 in Fig. 3, datum 7 is to be inserted into the right child of the root, which makes the target full. No adjustment is required. B. [Attempting to insert a new datum into a full target leaf with all its siblings full, while the fan-out of the parent node is less than 3.] At step 3 in Fig. 3, datum 3 is to be inserted into the left child of the root, but that child is full. According to the index scheme of traditional Bþ -tree, the target must split, as shown at step 4. The fan-out of the root thus becomes 3. C. [Attempting to insert a new datum into a full target leaf while at least one of its siblings has free space.] Datum 12 is to be inserted into the most right child of the root at step 4 in Fig. 3, and step 4 0 in Fig. 4. Since that target child is already full, it must split based on the traditional Bþ -tree scheme. This split propagates upward to the root, and the system generates a new root in order to increase the level by 1, as shown at step 5 in Fig. 3. However, the unused space found in the target’s sibling can be taken for dispersing the target’s overload. As shown at step 4 0 in Fig. 4, datum 7 is shifted to the sibling, giving the target free space for datum 12. Additionally, datum 5 in the parent is replaced by datum 7. D. [Attempting to insert a new datum into a full target leaf while all its siblings are full, as well as the parent node.] The target splits, and this split propagates upward to the root as shown at step 5 0 in Fig. 4. The system generates a new root, and increases the level by 1, producing the structure shown at step 6 0 in Fig. 4. The most significant improvement resulting from the use of compressed Bþ -trees is a better management of data entries at the leaf level. Compressed Bþ -trees remedy the problem caused by data insertion order, and make leaves bear much more data before appealing splits. (The traditional Bþ -tree requires 5 splits and 8

H.-Y. Lin / Data & Knowledge Engineering 64 (2008) 365–380

369

Fig. 3. A traditional Bþ -tree built by the data sequence of {8, 5, 1, 7, 3, 12, 9, 10}.

nodes, as indicated in Fig. 3, while the compressed Bþ -tree shown in Fig. 4 only requires 4 splits and 7 nodes.) Compressed Bþ -trees with the compact arrangement of data entries facilitate the better searching performance than traditional Bþ -trees. Hence, space and time efficiencies are successfully achieved by compressed Bþ -trees. The improvements of storage space and retrieval performance are essential for the design of developing an index scheme. The following sections reveal the superior performance of compressed Bþ -trees in indexing line objects. 4. New indexing method A line-based database (denoted by LDB) comprises a collection of line segments. A line segment is defined as lðpi ; pj Þ where pi ¼ ðxi ; y i Þ and pj ¼ ðxj ; y j Þ, i 6¼ j, are the endpoints of l in two-dimensional Euclidean space. For simplicity, denote lðpi ; pj Þ by lij . Assume that line segments are non-directional, then lij ¼ lji . Fig. 5 shows a simple LDB with 10 line segments, LDB ¼ fl12 ; l15 ; l23 ; l34 ; l37 ; l45 ; l48 ; l59 ; l68 ; l78 g, for the running example. The proposed design respectively extracts the x- and y-coordinates of endpoints of all line segments into sets I x and I y , which contain linearly ordered indexing numbers. Based on the new inserting technique of compressed Bþ -trees, the elements in I x and I y are adopted to build two independent compressed Bþ -trees called

370

H.-Y. Lin / Data & Knowledge Engineering 64 (2008) 365–380

Fig. 4. A compressed Bþ -tree built by the data sequence of {8, 5, 1, 7, 3, 12, 9, 10}.

Fig. 5. (a) The indexing numbers in I x and (b) the indexing numbers in I y generated from a LDB with 10 line segments. þ compressed Bþ x -tree and compressed By -tree, respectively. These two hierarchical directories jointly arrange all þ line segments in a LDB into a well-organized index structure. Compressed Bþ x & By -trees are used to denote the joint hierarchy.

H.-Y. Lin / Data & Knowledge Engineering 64 (2008) 365–380

371

The two distinct x-coordinates of each line lij in a LDB are collected into I x . If lij is vertical, then its endpoints have the same x-coordinate, which is not included into I x . Denote by lxij the vertically projected interval of lij . Then I x is defined as follows: I x ¼ fxk jxk is the x-coordinate of lij ; jlxij j 6¼ 0 and 8lij 2 LDBg Similarly, I y is created by the same operations in y-direction. In Figs. 5(a) and (b), 10 line segments share 9 endpoints in common. The point p4 is a common endpoint shared by lines l34 , l48 and l45 . Since l34 , l48 and l45 are not parallel to the y-axis, numbers x3 , x4 , x5 and x8 are collected into I x . Notably, l68 is horizontal, and the number y 6 in I y is derived from l48 or l78 but not from l68 . These 10 line segments generate 9 numbers in I x and 7 numbers in I y . That is, jI x j ¼ 9 and jI y j ¼ 7. When (1) no line segment is vertical or horizontal; (2) no line segment shares the same x-coordinate or ycoordinate, and (3) all line segments have nonzero lengths, we can infer jI x j ffi jI y j ffi 2jLDBj. Nevertheless, an endpoint can be shared by several line segments and then the magnitudes of jI x j and jI y j are generally far below 2jLDBj. As a result, the proper range of the magnitudes of jI x j and jI y j is likely to be between jLDBj and 2jLDBj. To maintain the correct orders in I x and I y , these elements are sorted every time when new elements are appended to the sets. Suppose that the insertion order of line segments in Fig. 5 is fl12 ; l15 ; l23 ; l34 ; l37 ; l45 ; l48 ; l59 ; l68 ; l78 g. Consequently, the data insertion orders for the compressed Bþ x -tree and the compressed Bþ y -tree are fx1 ; x2 ; x5 ; x3 ; x4 ; x7 ; x8 ; x9 ; x6 g and fy 2 ; y 1 ; y 5 ; y 3 ; y 4 ; y 7 ; y 6 g, respectively. Figs. þ 6 and 7 show the resulting compressed Bþ x -tree and compressed By -tree, respectively. þ In Fig. 6, the compressed Bx -tree involves 5 splits, and requires 8 nodes to complete the index. The tree has 13 data entries and its space utilization is 13=ð8  2Þ ¼ 81:25%. Conversely, a traditional Bþ -tree with the same data insertion order involves 7 splits, and allocates 10 nodes to complete the index as shown in Fig. 8. This tree has 14 data entries and its space utilization is 14=ð10  2Þ ¼ 70%. þ As with the compressed Bþ x -tree, the compressed By -tree involves only 4 splits, and uses only 7 nodes and 10 þ entries to complete the index. A traditional B -tree with the same data insertion order, involves 5 splits, and requires 8 nodes and 11 entries to complete the index as shown in Fig. 9. The ratio of space utilization between them is 10=ð7  2Þ : 11=ð8  2Þ ¼ 71:43% : 68:75%. þ Let xi denote an arbitrary number on the x-axis, which may be an element in I x . Terms x i and xi are  þ defined as the numbers before and after xi in I x , respectively, so that xi < xi and xi < xi . Namely, no number þ xm 2 I x exists where x i < xm < xi or xi < xm < xi . The same argument is applied on the y-axis. In our design, an entry associated with the indexed number xk in a leaf node of a compressed Bþ x -tree has the form [xk , bp], where bp denotes an array pointer referring to the array Cðxk Þ. Those line segments are collected þ into Cðxk Þ if their vertical projections overlap ½xk ; xþ k . Similar arguments are held for a compressed By -tree. Restated, Cðxk Þ and Cðxy Þ can be defined as follows:

Fig. 6. The compressed Bþ x -tree built by the data sequence of fx1 ; x2 ; x5 ; x3 ; x4 ; x7 ; x8 ; x9 ; x6 g.

Fig. 7. The compressed Bþ y -tree built by the data sequence of fy 2 ; y 1 ; y 5 ; y 3 ; y 4 ; y 7 ; y 6 g.

372

H.-Y. Lin / Data & Knowledge Engineering 64 (2008) 365–380

Fig. 8. The traditional Bþ -tree built by the data sequence of fx1 ; x2 ; x5 ; x3 ; x4 ; x7 ; x8 ; x9 ; x6 g.

Fig. 9. The conventional Bþ -tree built by the data sequence of fy 2 ; y 1 ; y 5 ; y 3 ; y 4 ; y 7 ; y 6 g.

x Cðxk Þ ¼ flij j½xk ; xþ k  \ lij 6¼ /; where 8lij 2 LDBg y Cðy k Þ ¼ flij j½y k ; y þ k  \ lij 6¼ /; where 8lij 2 LDBg

Taking p4 in Fig. 5 for illustration, the spatial proximity near coordinate ðx4 ; y 4 Þ can be obtained from the  information preserved in Cðx 4 Þ, Cðx4 Þ, Cðy 4 Þ and Cðy 4 Þ. That is  ðCðx 4 Þ [ Cðx4 ÞÞ \ ðCðy 4 Þ [ Cðy 4 ÞÞ ¼ fl15 ; l23 ; l34 ; l48 ; l45 g \ fl12 ; l34 ; l45 ; l48 ; l59 g ¼ fl34 ; l48 ; l45 g

The number of identifiers preserved in arrays grows significantly when LDB’s data size grows. Fortunately, the difference in contents between two near arrays is small, as a line segment generally covers several continuous indexing numbers on the x- or y-axis, and its identifier is therefore included into several continuous arrays. To save storage space, the duplicate information between near arrays is eliminated by the following difference encoding method. The full information is stored at the array referred by the leftmost entry þ in every leaf of the compressed Bþ x & By -trees. The consequent arrays only store the difference from their predecessor arrays. Using the LDB as shown in Fig. 5, and taking x1 and x2 for illustration, Cðx2 Þ ¼ fl23 ; l15 g is replaced by the tuning array C0 ðx2 Þ ¼ fl12 ; þl23 g, where l12 means that l12 is excluded from Cðx1 Þ, and þl23 means that l23 is included into Cðx1 Þ. The other arrays referred by the compressed Bþ x & Bþ y -trees are shown as follows: Cðx1 Þ ¼ fl12 ; l15 g C0 ðx2 Þ ¼ fl12 ; þl23 g

Cðy 3 Þ ¼ fl23 ; l34 ; l37 g C0 ðy 7 Þ ¼ fl37 ; þl78 g

Cðx4 Þ ¼ fl23 ; l34 ; l48 ; l45 ; l15 g C0 ðx3 Þ ¼ fl23 ; l34 ; þl37 g

Cðy 2 Þ ¼ fl12 ; l34 ; l78 g C0 ðy 6 Þ ¼ fl78 ; þl48 g

Cðx6 Þ ¼ fl37 ; l68 ; l48 ; l45 ; l15 g

Cð y 4 Þ ¼ fl12 ; l45 ; l59 g

0

C ðx5 Þ ¼ fl45 ; l15 ; þl59 g Cðx9 Þ ¼ fl37 ; l68 ; l48 g

Cðy 1 Þ ¼ fl15 ; l45 ; l59 g

Cðx7 Þ ¼ fl78 ; l68 ; l48 g The complexity analysis presented in Section 5, and the experimental results presented in Section 6, indicate that the difference encoding method can keep all tuning arrays within a fixed size. Notably, calculating the actual content for any array referred by a leaf entry of the compressed trees requires traversing at most M entries from the leaf’s leftmost entry, where M denotes the maximum capacity of a leaf node. Fig. 10(a) and (b) indicates the final compressed index trees for the LDB shown in Fig. 5. The endpoints of line segments þ are completely indexed at the leaf level of the compressed Bþ x & By -trees.

H.-Y. Lin / Data & Knowledge Engineering 64 (2008) 365–380

373

þ Fig. 10. The compressed Bþ x and By -trees for the LDB shown in Fig. 5.

5. Maintaining indexing structures and query processing Adding a new line segment with nonzero length into a LDB may insert a new index record into the comþ þ pressed Bþ x -tree and/or compressed By -tree. Inserting such a new index record into a compressed Bx -tree or a þ þ compressed By -tree is similar to inserting one into a compressed B -tree: the new index record is added to one leaf; an overflowed target may appeal to its siblings for sharing overload or split itself, and splits may propagate up the tree. One or several arrays may be accessed to include the identifier of this new line segment into their contents. When inserting a segment with one or two x-coordinates not listed in I x , this or these coordinates are þ inserted into the compressed Bþ x -tree. Some arrays referred by the leaves in the compressed Bx -tree are accessed for updating. Similar handles are applied in y-direction. No adjustment is required to the compressed þ Bþ x & By -trees while inserting a segment with two of x-coordinates (and two of y-coordinates) found in I x (and I y ). Only the related arrays are accessed for updating in this case. Insertion algorithm is described as follows: Algorithm. Insertion Input: A segment lab ¼ lðpa ; pb Þ with pa ¼ ðxa ; y a Þ, pb ¼ ðxb ; y b Þ, a 6¼ b, is to be inserted into compressed Bþ x & Bþ y -trees. þ Output: The roots of the new compressed Bþ x & By -trees. 1. Let x1 minðxa ; xb Þ, x2 maxðxa ; xb Þ, y 1 minðy a ; y b Þ, y 2 maxðy a ; y b Þ. þ 2. Search the compressed Bþ x & By -trees for x1 , x2 , y 1 and y 2 . If x1 and/or x2 are/is not found, and x1 6¼ x2 , then insert x1 and/or x2 into the compressed Bþ x -tree. Apply similar operations in y-direction (i.e. y 1 and/or y 2 are/is not found and y 1 6¼ y 2 ) to the compressed Bþ y -tree.

374

H.-Y. Lin / Data & Knowledge Engineering 64 (2008) 365–380

3. In the compressed Bþ x -tree, (a) If x1 is the leftmost entry in a leaf, then Add ‘‘lab ’’ to Cðx1 Þ. Else Add ‘‘þlab ’’ to C0 ðx1 Þ. (b) For each xk in I x that satisfies x1 < xk < x2 and is the leftmost entry in a leaf, Add ‘‘lab ’’ to Cðxk Þ. (c) If x2 is not the leftmost entry in a leaf, then Add ‘‘lab ’’ to C0 ðx2 Þ. 4. Apply the same operations as those in step 3 in y-direction to the compressed Bþ y -tree. þ 5. Return the roots of the new compressed Bþ & B -trees. x y A line is said to be dangling if it does not share endpoints with other lines. Removing a dangling line from a þ 2D space causes some indexed entries to be deleted from compressed Bþ x & By -trees. Such deletion may cause re-organization of the index structure. Additionally, the leaves between the two targets on the compressed Bþ x & Bþ y -trees are involved to update the contents in the related arrays. Conversely, if a segment shares endpoint(s) with other segments, then deleting it only affects the contents in some arrays, and no re-organization þ is required for compressed Bþ x & By -trees. The deletion algorithm is described as follows: Algorithm. Deletion Input: A segment lab ¼ lðpa ; pb Þ with pa ¼ ðxa ; y a Þ and pb ¼ ðxb ; y b Þ, a 6¼ b, is to be removed from þ compressed Bþ x & By -trees. þ Output: The roots of the new compressed Bþ x & By -trees. 1. Let x1 minðxa ; xb Þ, x2 maxðxa ; xb Þ, y 1 minðy a ; y b Þ, y 2 maxðy a ; y b Þ. þ 2. Search the compressed Bþ x & By -trees for x1 , x2 , y 1 and y 2 . If they are not found, then return ‘‘Object not found’’ and exit. 3. In the compressed Bþ x -tree, (a) If ðjCðx1 Þj ¼ 1Þ or ðjCðx 1 Þ  Cðx1 Þj ¼ 1Þ, /* lab is the unique segment whose x-projection begins from x1 . */ Return the corresponding array of x1 to the system. Delete x1 from the compressed Bþ x -tree, and re-organize the tree. (b) If x1 is the leftmost entry in a leaf, then Remove ‘‘lab ’’ from Cðx1 Þ. Else Remove ‘‘þlab ’’ from C0 ðx1 Þ. (c) For every xk in I x that satisfies x1 < xk < x2 If xk is the leftmost entry in a leaf, then Remove ‘‘lab ’’ from Cðxk Þ. (d) If x2 is not the leftmost entry in a leaf, then Remove ‘‘lab ’’ from C0 ðx2 Þ. * * (e) If ðjCðx 2 Þ  Cðx2 Þj ¼ 1Þ, then / lab is the unique segment ending at x2 . / Return the corresponding array of x2 to the system. Delete x2 from the compressed Bþ x -tree, and re-organize the tree. 4. Apply the same operations as those in step 3 to the compressed Bþ y -tree in y-direction. þ 5. Return the roots of the new compressed Bþ x & By -trees. Queries on a LDB are categorized into three types: point, interval and window queries. A point query is adopted to retrieve the segments near to a user’s designated location (P). A threshold value j is chosen to test the candidates. A segment lij is regarded as a candidate if Dðlij ; P Þ 6 j, where Dðlij ; P Þ denotes the Euclidean distance between lij and P. A query interval QIðx1 ; x2 Þ (or QIðy 1 ; y 2 Þ) is adopted to identify the candidates

H.-Y. Lin / Data & Knowledge Engineering 64 (2008) 365–380

375

whose vertical (or horizontal) projections intersect the query interval. In other applications, users may scan a narrow band across the entire range along x- or y-direction to accumulate all objects of interest. An interval query is easily implemented by performing two rounds of point query on the compressed Bþ x -tree or on the compressed Bþ -tree. For window query processing, two rounds of interval query are applied separately on y þ the compressed Bþ -tree and compressed B -tree. The final output of a window query is got by intersecting x y the outputs of the two interval queries. Algorithm. Point_Query(P) Input: A point P ðxa ; y a Þ and a threshold j. Output: A set S contains the segments that are near P. 1. S /.   2. Search the compressed Bþ x -tree for xa , if found, then set S 1 ¼ Cðxa Þ [ Cðxa Þ. Otherwise, find xa 2 I x and let   þ xp ¼ xa . Then set S 1 ¼ Cðxp Þ [ Cðxp Þ [ Cðxp Þ. 3. The same operations as step 2 regarding the y-axis are applied in the compressed Bþ y -tree. Then, if y a is   found, let S 2 ¼ Cðy  Þ [ Cðy Þ. Otherwise, find y 2 I and let y ¼ y . As a result, S 2 ¼ Cðy  y a p a a a pÞ þ [Cðy p Þ [ Cðy p Þ. 4. S 3 S1 \ S2. 5. For each lij 2 S 3 , if Dðlij ; P Þ 6 j, then S

S [ flij g

6. Return S.

Algorithm. Interval_Query(QI) Input: An interval QIðxa , xb Þ along the vertical direction, xa < xb . Output: A set S contains segments whose vertical projections intersect QI. 1. 2. 3. 4.

S /.   Search the compressed Bþ x -tree for xa , if found, then let xs ¼ xa . Otherwise, find xa 2 I x and then let xs ¼ xa . þ   Search the compressed Bx -tree for xb , if found, then let xe ¼ xb . Otherwise, find xb 2 I x and then let xe ¼ xb . For every xi 2 I x that satisfies xs 6 xi 6 xe , S

S [ Cðxi Þ

5. Return S.

Algorithm. Window_Query(W) Input: A window W has the query intervals QIðxa , xb Þ and QIðy a , y b Þ, xa < xb and y a < y b . Output: A set S contains segments that intersect W. 1. 2. 3. 4. 5.

S /. S1 Interval QueryðQIðxa ; xb ÞÞ. S2 Interval QueryðQIðy a ; y b ÞÞ. S S1 \ S2. Return S.

þ Assume the node capacity of the compressed Bþ x -tree and the compressed By -tree is M, and there are N line segments will be indexed by the proposed method. Because the distribution of data objects in a LDB can influence the contents of I x and I y at the same time, similar data and entry organizations are arranged at all levels of þ the compressed Bþ x -tree and the compressed By -tree. The numbers of leaf nodes in the two compressed trees

376

H.-Y. Lin / Data & Knowledge Engineering 64 (2008) 365–380

 have the same complexity of O MN . The depths of the two compressed trees (denoted by Dx & Dy ) have the same complexity of OðlogM N Þ. As well-known, both compressed trees can be preserved in the main memory when the quantity of data is small. Nevertheless, the hierarchies of two compressed trees should reside in the auxiliary storage when the data size grows significantly. The search performance is mainly dominated by the þ number of involved pages (visited nodes) in the compressed Bþ x & By -trees. þ One round of single-path searching is performed on each of the compressed Bþ x -tree and the compressed By tree for a point query. The number of nodes involved is the sum of the depths of the two compressed trees, that is ðDx þ Dy Þ. Hence, the time complexity of a point query is OðlogM N Þ. For an interval query processing along the x-axis, two rounds of point queries are performed on the compressed Bþ x -tree, causing two target leaf nodes to be retrieved. All leaves between these two targets are then accessed sequentially. Assume that there are C leaves between two target leaf nodes in the tree. The ideal case when a query is performed on a narrow interval involves 2Dx þ C nodes in the compressed Bþ x -tree, and the time complexity of this interval query is OðlogM N Þ. In the worst case, with a wild interval query, an exhaustive search may be inevitable, as all leaves of the tree  need to be examined. The time complexity is O MN in this case. Similar situations occur along the y-axis. Window queries can be completed by applying one round of interval query on each of the compressed Bþ x -tree and the compressed Bþ -tree, and the time complexity is similar to that of an interval query. y 6. Experimental results Line-based databases were tested in six sizes ðN ¼ 210 ; 211 ; 212 ; 213 ; 214 and 215 Þ. Ten sets of objects on a 1000 · 1000 digital map and a 5000 · 5000 digital map were randomly generated for each data size. Each individual measurement was stored for averaging experimental results. All generated line segments in every test set þ were indexed by the compressed Bþ x & By -trees with a node capacity of M ¼ 20. Tables 1 and 2 show the structural information of the resulting index structures for all cases. The experimental results depicted in these two tables contain the measurements of the magnitude of jI x jðjI y jÞ, the number of nodes at various levels in two compressed trees, the depths of two compressed trees and the average number of elements in each array ðN b Þ. Assume that the underlying space of data objects is given by X · Y, where X and Y represent horizontal and vertical resolutions, respectively. First, the ‘‘dense’’ case involving a large number of line segments distributed on a limited 2D domain with a low resolution, was considered. In a dense case and based on our index scheme, most coordinates on the x-axis and y-axis were registered as indexing numbers, and were accumulated into I x and I y , respectively. The magnitude of jI x jðjI y jÞ is almost the same as the value of X (Y). Conversely, the ‘‘sparse’’ case, involving a small number of line segments distributed on a large 2D domain with a high resolution, was also considered. In a sparse case, I x ðI y Þ only collects a few parts of coordinates on the x-axis (yaxis). The observation results shown in the second column in Tables 1 and 2 reveals that all cases of Table 1 and the second half of Table 2 belong to the dense cases. The first half of Table 2 belongs to the sparse cases. Interestingly, the tree organization changed little when the data size N was doubled. This can be observed by examining changes in fan-outs, depths, and node numbers in Tables 1 and 2. In particular, in dense cases, the changes were very small or even nonexistent. It indicates that data from the later half had lower impact and caused fewer adjustments on the hierarchies than the data from the earlier half. This phenomenon remains obvious as N rises tremendously. Suppose the size of the underlying data space is fixed, we conclude that the

Table 1 On a 1000  1000 digital map, the structural information of compressed Bx -trees (compressed By -trees) with a node capacity M ¼ 20 Data size (N) 10

2 211 212 213 214 215

jI x j ðjI y jÞ

Number of nodes at various levels

Depth

Nb

857 984 999 1000 1000 1000

1, 3, 47 1, 3, 54 1, 3, 55 1, 3, 55 1, 3, 55 1, 3, 55

3 3 3 3 3 3

2.4 4.2 8.2 16.5 32.8 65.5

H.-Y. Lin / Data & Knowledge Engineering 64 (2008) 365–380

377

Table 2 On a 5000 · 5000 digital map, the structural information of compressed Bx -trees (compressed By -trees) with a node capacity M ¼ 20 Data size (N)

jI x jðjI y jÞ

Number of nodes at various levels

Depth

Nb

210 211 212 213 214 215

1668 2782 4043 4818 4984 4999

1, 6, 91 1, 9, 152 1, 13, 220 1, 15, 262 1, 16, 272 1, 16, 272

3 3 3 3 3 3

1.2 1.5 2.0 3.5 6.6 13.1

þ overhead cost of storage space used in maintaining compressed Bþ x & By -trees for the future data is inversely proportional to the magnitude of N. For further comprehension of the entire indexing structures, the number of line segments preserved in each array was also measured. As mentioned in Section 4, N b denotes the number of data preserved in each array after applying the difference encoding method. For all arrays referred by the entries in all leaves of a comþ pressed Bþ x -tree (or compressed By -tree), all N b values were averaged to obtain the measurement N b , as shown in the final column of Tables 1 and 2. The result N b ¼ 2:4 in the case of N ¼ 210 in Table 1 signifies that about 2 or 3 line segments differed from one array to its neighbor. Experimental results indicate that very little extra information is added to the arrays, even when the data size N grows significantly. Elementary probability demonstrates that N b is approximately equal to 2N divided by jI x j (or jI y j) when all end-points of line segments are evenly distributed over the 2D domain. That is

Nb 

2N 2N  ; jI x j jI y j

where jI x j 6 X and jI y j 6 Y on a X  Y space

In sparse cases, N b is normally between 1 and 2 since the magnitude of jI x jðjI Y jÞ is likely to be between N and 2N. However, if the data quantity N rises, or if dense cases are  considered, then the magnitude of jI x jðjI Y jÞ 2N . The tendency of N b verifies the correctness approaches X (Y). Consequently, N b is bound above by 2N X Y of this analysis, and is shown in Fig. 11. 7. Application to GIS database Several experiments were also performed on a GIS map to evaluate the performance of the proposed system and to demonstrate the improvement of the proposed method over the system based on the R-tree indexing scheme. One thousand line segments were chosen from the traffic network of a digital map of Taipei city in Taiwan, as shown in Fig. 12. The size of this map was 774  618, and all 1000 line segments were stored in the data files so that both the proposed index method and the R-tree indexing scheme could adopt the same set of line segments for performance comparison. This application was written in MatLab and in C language.

Fig. 11. The tendency of N b .

378

H.-Y. Lin / Data & Knowledge Engineering 64 (2008) 365–380

Fig. 12. A digital map with 774 · 618 pixels.

A personal computer with an Intel Pentium 1400 MHz processor was used with CPU executing at 1000 ticks per second. Table 3 shows the simulated results from the first experiment, where the total number of line segments (N) was 1000, and the maximum number of entries in a node (M) varied from 4 to 10. Since N and M were both þ small, the R-tree and the compressed Bþ x & By -trees could be stored in the main memory instead of in a disk. The number of nodes generated at each level in all cases indicates that the sizes of the compressed Bþ x -tree and the compressed Bþ y -tree were smaller than that of the R-tree. This is because an R-tree adopts 1000 MBRs to enclose 1000 segments, and needs to preserve exactly 1000 entries in its leaves. However, different segments may contribute the same indexing numbers to I x or I y . Hence, the magnitudes of jI x j and jI y j are far below 1000 ðjI y j ¼ 450 and jI x j ¼ 465Þ, as shown in the second column in Table 3. The compressed trees had many þ fewer leaves than the R-tree. Our compressed Bþ x -tree and compressed By -tree were built in a similar manner þ for each level in all three cases. Notably, the tiny difference between compressed Bþ x -tree and compressed By tree was caused by the different resolutions of X and Y. Table 4 shows the overall storage space for storing an indexing structure in the case of M = 4, 6 and 10. In an R-tree, each entry corresponding to an MBR in an external or internal node adopts four values to describe an indexed data (a pair of x-coordinate values and a pair of y-coordinate values). Conversely, each entry in an Table 3 The number of nodes generated at each level Index structure

Data size

M ¼4

M ¼6

M ¼ 10

R-tree Compressed Bþ x -tree Compressed Bþ y -tree

1000 jI x j ¼ 465 jI y j ¼ 450

1, 2, 5, 17, 53, 160, 475 1, 3, 10, 36, 127 1, 3, 10, 35, 123

1, 2, 6, 2, 98, 318 1, 3, 16, 85 1, 3, 16, 82

1, 5, 30, 192 1, 6, 51 1, 6, 49

Table 4 Scalability (bytes)

R-tree þ Compressed Bþ x & By -trees Size ratio

M ¼4

M ¼6

M ¼ 10

11,408 1396 8.2: 1

10,728 1242 8.6: 1

9120 1140 8.0: 1

H.-Y. Lin / Data & Knowledge Engineering 64 (2008) 365–380

379

Table 5 The performance of point queries Index scheme

Comparison items

M ¼4

M ¼6

M ¼ 10

R-tree

Average number of nodes involved Average number of segments compared Average access time (ticks)

65.3 78.4 966.6

34.3 57.6 744.9

23.5 43.2 760.3

þ Compressed Bþ x & By -trees

Average number of nodes involved Average number of segments compared Average access time (ticks) Time efficiency ratio

10 10.5 142 6.8:1

8 11.5 166 4.5:1

6 10.8 193.2 3.9:1

Table 6 The performance of window queries Index scheme

Comparison items

M ¼4

M ¼6

M ¼ 10

R-tree

Average number of nodes involved Average number of segments compared Average access time (ticks)

135.8 168.5 2032

77.6 192.6 1934.4

50.0 351.2 2654.8

þ Compressed Bþ x & By -trees

Average number of nodes involved Average number of segments compared Average access time (ticks) Time efficiency ratio

57.8 56.6 804.4 2.5:1

40.5 56.6 833.9 2.3:1

25.8 56.6 871.4 3:1

þ external or internal node of the compressed Bþ x & By -trees adopts only one value. The combinative storage þ þ space of the compressed Bx & By -trees was significantly better than that of the R-tree. The size ratio of the þ R-tree to the compressed Bþ x & By -trees was about 8:1. To measure query performance, the number of nodes involved, the number of segments compared and the access time during the query process were evaluated. Tables 5 and 6 show the measurement results. The measurements were averaged after applying 100 point queries over the digital map. The R-tree involved more þ nodes than the compressed Bþ x & By -trees for point queries, because searches over an R-tree have long and þ multiple paths. In contrast, each point query in the compressed Bþ x & By -trees has a short and unique search path. An upper MBR can enclose many lower MBRs, and a multiple path retrieves several target MBRs, thus þ þ an R-tree retrieves many more candidate segments than the compressed Bþ x & By -trees. The compressed Bx & þ By -trees not only take less time to finish its search, but also take much less time to compare the candidate segments to the query position. These two significant improvements make the proposed method better than an R-tree in search performance. As expected, in the case of M ¼ 4 in Table 5, the R-tree took 966.6 ticks þ on average to complete a point query, while the compressed Bþ x & By -trees took only 142 ticks. Table 6 shows the final experimental results of 100 window queries. The size of each query windows was 1/25 of the whole digital map. Both the average number of nodes involved and the average number of segments compared in the þ compressed Bþ x & By -trees were smaller than those in the R-tree. Significantly, performing a window query on þ þ compressed Bx & By -trees makes the leaves between two targets be involved for comparison. This is why winþ dow queries on compressed Bþ x & By -trees have the lower time efficiency than point queries.

8. Conclusion Compressed Bþ -trees in collaboration with new indexing techniques can efficiently arrange line segments into compressed hierarchical directories. The application of compressed Bþ -trees is far more successful in dealing with line segments than traditional methods, since it provides the following advantages: (1) more accurate preservation of spatial information for line segments; (2) lower storage requirement for the indexing structure; (3) more efficient retrieval performance, and (4) more predictable retrieval overhead. Additionally, the proposed scheme maintains the indexing structures with a stable magnitude of tree size, which is not significantly influenced by the size of the data.

380

H.-Y. Lin / Data & Knowledge Engineering 64 (2008) 365–380

References [1] J.L. Bentley, Multidimensional binary search trees used for associative searching, Communications of the ACM 18 (9) (1975) 509– 517. [2] H. Blanken, A. Ijbema, P. Meek, B. Akker, The generalized grid file: description and performance aspects, in: Proceedings of Sixth IEEE International Conference on Data Engineering, 1990, pp. 380–388. [3] E.I. Chong, J. Srinivasan, S. Das, C. Freiwald, A. Yalamanchi, M. Jagannath, A.T. Tran, R. Krishnan, R. Jiang, A mapping mechanism to support bitmap index and other auxiliary structures on tables stored as primary Bþ -trees, SIGMOD Record 32 (2) (2003) 78–88. [4] V. Gaede, O. Gunther, Multidimensional access methods, ACM Computing Surveys 30 (2) (1998) 170–231, Baltimore. [5] A. Guttman, R-trees: a dynamic index structure for spatial searching, in: Proceedings of ACM SIGMOD, 1984, pp. 47–57. [6] E.G. Hoel, H. Samet, Efficient processing of spatial queries in line segment databases, in: Spatial Databases – Second Symposium, SSD’91, Lecture Notes in Computer Science, vol. 525, Springer, Berlin, 1991, pp. 237–256. [7] E.G. Hoel, H. Samet, A qualitative comparison study of data structure for large segment databases, SIGMOD (1992) 205–214, San Diego, CA. [8] J.H. Kim, Y.H. Kim, S.W. Kim, S. Ok, An efficient processing of queries with joins and aggregate functions in data warehousing environment, in: 13th International Workshop on Database and Expert Systems Applications, Aix-en-Provence, France, 2002, pp. 785–794. [9] G. Kollios, D. Gunopulos, V.J. Tsotras, On indexing mobile objects, in: Proceedings of PODS’99, 1999, pp. 261–272. [10] A. Kumar, G-tree: a new data structure for organization multidimensional data, IEEE Transaction on Knowledge and Data Engineering 6 (2) (1994) 341–347. [11] S. Lanka, E. Mays, Fully persistent Bþ -trees, in: Proceedings of ACM SIGMOD, 1991, pp. 426–435. [12] M. Lindenbaum, H. Samet, G.R. Hjaltason, A probabilistic analysis of Trie-based sorting of large collections of line segments in spatial databases, University of Maryland Computer Science TR 3455.1, 2000. [13] J. Nievergelt, H. Hinterberger, K. Sevcik, The grid file: an adaptable symmetric multikey file structure, ACM Transactions on Database Systems 9 (1) (1984) 38–71. [14] J.A. Orenstein, T.H. Merrett, A class of data structure for associative searching, in: Proceedings of the Third ACM SIGACT– SIGMOD Symposium on Principles of Database Systems, Waterloo, Canada, 1984, pp. 181–190. [15] D. Papadias, Y. Theodoridis, Spatial relations, minimum bounding rectangles and spatial data structures, International Journal of Geographical Information Science 11 (2) (1997) 111–138. [16] D. Papadopoulos, G. Kollios, D. Gunopulos, V.J. Tsotras, Indexing mobile objects on the plane, in: Proceedings of DESA’02, 2002, pp. 693–697. [17] J.T. Robinson, The K–D-B tree: a search structure for large multidimensional dynamic indexes, in: Proceedings of ACM SIGMOD, 1981, pp. 10–18. [18] H.V. Jagadish, On indexing line segments, in: Proceedings of the Sixteen International Conference on Very Large Data Bases, Brisbane, Australia, 1990, pp. 614–625. [19] H.V. Jagadish, B.C. Ooi, K.L. Tan, C. Yu, R. Zhang, iDistance: an adaptive Bþ -tree based indexing method for nearest neighbor search, ACM Transactions on Data Base Systems (ACM TODS) 30 (2) (2005) 364–397. [20] H. Six, P. Widmayer, Spatial searching in geometric databases, in: Proceedings of Fourth IEEE International Conference on Data Engineering, 1988, pp. 496–503. [21] D. Taniar, J.W. Rahayu, Global Bþ -tree indexing in parallel database systems, Intelligent Data Engineering and Automated Learning, Lecture Notes in Computer Science, vol. 2690, Springer-Verlag, 2003. [22] Y. Yanagisawa, J. Akahani, T. Satoh, Shape-based similarity query for trajectory of mobile objects, in: Proceedings of Fourth International Conference on Mobile Data Management, Melbourne, Australia, 2003, pp. 63–77.

Hung-Yi Lin is an assistant professor of the department of Logistics Engineering and Management of National Taichung Institute of Technology in Taichung, Taiwan, since February 2007. He received his degree of Bachelor at the department of Applied Mathematics of National Chung Hsing University in 1992. He then received his degree of Master at the department of Information Management from National Taiwan University of Science and Technology in 1994. In 2005 winter, he earned his Ph.D. at the department of Applied Mathematics from National Chung Hsing University. His current research topics include spatial data mining, spatial database, mobile database, and knowledge discovery in database.