Systems Vol. 19, No. 6, pp. 447-465, 1994 Copyright@ 1994 Elsevier Science Ltd
hformation
0306-4379(94)00022-O .
Printed in Great Britain. All rights reserved
I
0306-4379/94
$7.00
+ 0.00
Invited Project Review ON INDEXING SPATIAL AND TEMPORAL DATA’
BETTY
SALZBERG
College of Computer Science, Northeastern Boston MA 02115, USA (Received
University
in final form 8 July 1994)
1. INTRODUCTION Large databases are stored on magnetic disks, optical jukeboxes and tape libraries. Indexing for such large collections must take into account the physical characteristics of the medium on which the data resides. This makes the criteria for efficiency different from that needed for indexing in main memory. As an example, let us suppose that our data is stored on magnetic disks. Data is fetched from magnetic disk drives in pages of a minimum size (in 1994, minimum size pages are usually 4K). Each fetch of a page takes about 10 milliseconds on average on the fastest disk drives. This is the time it takes for the disk arm to move to the correct cylinder (seek time), for the disk to revolve until the head is over the correct place in the track (rotational latency), and for the data to be transferred to the main memory of the computer when the head moves over it. In addition, the CPU cost in performing a disk page I/O is usually over 1000 instructions. Therefore, when we decide to design an index for a large database stored on magnetic disks, we have to be careful that the number of pages accessed for any operation is as small as possible. For example, a simple linked list of data pages is not efficient for search. To find a given piece of data in a linked list of database pages, we would have to fetch each page on the list until we found the data we wanted. On average, this would be half the pages in the database and each page might require a seek. A big improvement over following a linked list is using no structure at all. Then we can do sequential reading on the disk, which is more efficient than repeated seeks. However, we can do much better than this. A good index structure requires only one or two page accesses to find any data item in a very large database (after the top part of the index is loaded into memory) In addition, a good index structure clusters within pages data that is used at the same time. If we can put all the records which satisfy a common query on the same page, we will only need to fetch that one page, instead of having to fetch a separate page for each record. Also, disk space should be used wisely. Although magnetic disk space is less expensive than main memory space, it is possible to err spectacularly in placing data efficiently in pages. For example, imagine having records which represent customers. The key of a record is the first six digits of the customer’s social security number. Suppose a page is reserved for each possible sixdigit number. This is one million pages. Many of these numbers may not correspond to actual customers. The pages with numbers which do not correspond to customers will be empty. These few observations-that clustering data within pages is important, that the number of disk accesses needed for search should be small and that disk space should be used wisely-guide the design of good access methods for data stored on magnetic disks. For data organized by one key, there is an excellent example of an access method which satisfies most requirements for efficient indexing. This is the B+-tree [l]. tThis work was partially supported
by NSF grant IRI-93-03403
447
BETTY SALZBERG
448
,,, ,’
jz+i$ ‘\
i 509 green apple
iYl
700 green beans
loo0 ellow cans
1100 red beans
:z:;
2090 red pear
1078 purple grapes
1209 purple turnip
1590 green pear
2300 green guava
1489
1300 orange carrot Fig. 1. A B+-tree. The data is in the leaves. Search by finding the largest key smaller than or equal to the search value, or use the first address in the node if all keys in the node are larger than the search value.
J’
11I-
100 red berry
509
159 blue berry
1100
Key
1000 ellow Leans
red beans
1489 green celerv
2090 red pear
700 green beans
1078 purple grapes
1209 purple turnip
1590 green pear
2300 green guava
L-J
,
1300 orange carrot
t 200 yellow papaya 345 yellow squash
Fig. 2.
A B+-tree.
Here the new record “yellow papaya” with key 200 has been added to the B+-tree of Figure 1.
On Indexing Spatial and Temporal Data
449
A B+-tree can be made wide and shallow. Each node is a page and each index node can have many hundreds of children. When used as a primary index, with all the data in the leaves, the B+-tree’s large fun-out, or ratio of the size of the data to the size of the index, ensures that most of the index pages (non-leaf pages) can be kept in main memory. After the upper levels of the index are loaded in memory, finding one record among gigabytes of data is usually only one or two disk accesses. The records are clustered in the order of the search key so key range queries do not require many page accesses. A B+-tree is illustrated in Figure 1. The B+-tree is also an efficient dynamic search structure. Insertion and deletion of records is simple and usually affects only one or two index pages. For example, in the most common case, to insert a record once we have found the leaf it, belongs in, we simply write the record in the page. On average, pages remain about 69% full. Usually there is room in the leaf for the new record. In case the leaf is full, the leaf is split. Half of the contents goes to a new leaf and half remains in the old leaf. A new term is posted to the index. This is illustrated in Figure 2. After insertions or deletions, records remain clustered within pages by key. Unfortunately, the Bf-tree cannot be used efficiently for spatial or temporal data. Spatial data is best clustered in pages by nearness in multidimensional space. That is, spatial indexing maps k-dimensional space to the one-dimensional space of disk pages. Keeping “nearby” &dimensional points clustered in disk pages while inserting data is a difficult problem. Temporal indexes have other requirements. If finding all records valid at a given time is a likely query, the records should be clustered on disk pages by time. If it is likely that key ranges as well as time ranges are to be queried, clustering by both time and key is advantageous. It is also desirable to separate the current data, which will be accessed more often, from the historical data. Since 1986, I have been working with Dave Lomet on designing indexes for spatial and temporal data with the good properties of B+-trees: large fan-out, good disk space utilization, clustering tailored to the canonical queries, fast single-record search, and incremental insertion algorithms. We have assumed that our data is stored on magnetic or on optical disks, which have the same random access capabilities and similar costs for page retrieval. This is the story of our results in this area and the lessons we learned along the way.
2. THE BEGINNINGS In 1982, I was one of a half dozen tenured mathematics professors at Northeastern University who were teaching hundreds of students how to program in Pascal and in VAX/VMS assembly language. We were invited to join the new College of Computer Science and offered a raise in pay. We were asked to teach undergraduates subjects we had never learned ourselves (up to six different, subjects each year), to develop a master’s degree program, to represent our College on numerous University committees, to hire four or five new faculty each year and to begin doing research in some area of computer science (the less mathematical the better). After a year of this, I took a leave of absence and worked in a software company producing a CASE tool. I worked in the database group of the company. Unfortunately, no one in the company knew anything about database management systems and it was one of the courses I had not taught. After looking at several textbooks, I picked Ullman [21] as the one that made the most sense. But, my coworkers could not read it (too much math). So I started writing something they could read. When I returned to Northeastern the next year, I had a draft of a textbook on databases(l61. The research I did for that text led me to look at normalization, concurrency and recovery and indexing as possible research topics. I finished the text and started to write a few research papers. Meanwhile, my colleague John Casey was teaching the file structures course from the text by Wiederhold [22]. Wiederhold had exactly the right idea about the subject: measuring indexing methods by seek time, rotational latency and data transfer time. But it wits a little too advanced for our undergraduates. Casey suggested writing a more elementary but still very analytical text on file structures. By 1986, I was half-way through writing a new file structures textbook [17] and I had been looking at papers on concurrency in B+-trees and also on multiattribute access methods. I had
BETTY SALZBERG
450
attempted to publish some results in this area with varying degrees of success. One of these attempts was a modification of a result of Robinson [15], the k-d-B-tree or k-dimensional B-tree. The k-d-B-tree is a spatial index. Each page in the index represents a rectangle in twodimensional space (or a brick in k-dimensional space). In the k-d-B-tree, splitting an index page sometimes requires splitting a number of its children which might require splitting their children and so forth. Insertion of a new record could involve many pages of the index. I suggested (in a submission to the 1986 SIGMOD conference) a variation which made the splitting simpler, but left the space utilization without any guarantees. The k-d-B-tree split and my split are illustrated in figure 3. My idea was just to split at the first split line, which never causes any splits at lower levels. It is also very easy to calculate as you can keep track of the splits using a binary tree. Splitting at the root of this tree corresponds to using the first split line. The resulting binary trees are the two subtrees just below the root. The subtrees you would need for the k-d-B-tree split are more complicated, requiring suppressing the node corresponding to the split line and duplicating some of the other nodes, as shown in Figure 3c.
x=4
x=9
x=6
AA y=2
a)
4 original index page
6
9
fi
i=6 A
I
b) my split
Fig. 3.
i ,
The split I suggested using the first split line (root of the tree) and the k-d-B-tree split.
The paper was rejected, but Dave Lomet, who was on the 1986 SIGMOD program committee, and who had recently given a colloquium talk at Northeastern, liked the basic ideas I had described. He had some suggestions on how to change it to make it both simple and efficient. Dave had a Ph. D. in Computer Science from the University of Pennsylvania, with a thesis His research had been in machine architecture, programin parsing programming languages. ming languages, concurrency control and access methods. He had worked for many years at IBM
On Indexing Spatial and Temporal
Data
Yorktown Heights and had just moved to the Wang Institute Massachusetts, a little north of Boston.
451
of Graduate
Studies
in Tyngsboro,
Dave and I started to work together on improving my modification of the k-d-B-tree. It grew into a structure which we called the holey brick tree-because the nodes represented bricks with holes in them.
!
7.
,
6 I
4
x=9
2
x=6
A/\ 4
6
8
9
a) original index page
y=4 A
A A
A
X-6
B
full path
X=4
y=6 I\
I
4
6
8
9
b) holey brick split
Fig. 4.
A holey brick split. Both the full path (from the root of the original tree to the root of the extracted tree) and the condensed path are shown. Only the condensed path is needed to accurately describe the split. The “ext” marker at the bottom right shows where the subtree was extracted to create a sibling.
3. THE HOLEY
BRICK
TREE
I had used a binary tree in each page of my structure to describe the decomposition on the next level. It was a k-d-tree [3]. Each node had an attribute value and an indicator of which attribute it was. Search for a point involved comparing its coordinates with the appropriate ones in the k-d-tree. This is illustrated in Figure 3 in the trees on the right hand side of the diagram Many structures proposed for multiattribute search (for example, the R-tree family[7]), use unordered collections of boundaries to describe the next lower level. This is generally slower than a binary tree since the search is linear. It also takes up more room. The space needed for boundaries of rectangular bricks increases linearly with the dimension of the bricks. The space needed for k-dtree nodes does not depend on k. The trouble with my k-d-tree algorithm was that it always split at the root. If the binary tree was skewed (as in the example in Figure 3), index page utilization could not be guaranteed for the resulting pages. One page might get only a small number of the binary tree nodes and the other page might get a very large number. We decided to allow splits at other parts of the tree. We saw that we could always split off a subtree with between one third
452
BETTY SALZBERG
and two thirds of the k-d-tree nodes (more accurately, between floor(n/3) and ceiling(2n/3) nodes, where n is the total number of nodes before the split). This implies that index page utilization is always good. Figure 4 shows an example where four nodes are split from a tree with ten nodes. I will be using this example several times. The pages no longer corresponded to simple rectangles or bricks in k-dimensional space. Now they were bricks with holes. Instead of posting the one value in the root to describe a simple split by hyperplane, we had to post something that described the hole. This turned out to be part of the path from the root of the original tree to the root of the subtree that was split off. Figure 4 shows what the full path is, and how to post only a subset of this path and still correctly describe the split. In general, we showed that we could describe the hole represented by the extracted k-d-tree by using in the worst case only the least upper bound and greatest lower bound for each attribute. Sometimes one additional node had to be posted to “glue” the new information in the right place in the parents’ k-d-tree. We called the minimal set of nodes which had to be posted for a split the condensed path. The condensed path in Figure 4 consists of only two k-d-tree nodes. 3.1. k-d- Tree Boundaries We submitted the paper on the hB-tree to Transactions on Database Systems in the Summer of 1987. At this time, I began a sabbatical year in Iowa and California and Dave began work I was still working on the final draft of my File Structures at Digital Equipment Corporation. textbook [17]. Dave was working full time learning the internals of the DEC relational DBMS, Rdb. We temporarily stopped work on spatial access methods. In retrospect, we should have spent more time programming and testing the hB-tree. I had had one of our best undergrad students at Northeastern, Dave Flaherty, program it for a reading course one term (ten weeks). He found no particular difficulties, but he also did not have time to complete the work to cover all cases. The academic year 1987-88 (while I was on sabbatical) was the first year of the Ph.D. program in the College of Computer Science. I did not realize how having Ph.D. students can improve and extend your research. A couple of years after my sabbatical, Dave and I began work with Georgios Evangelidis, my first computer science Ph.D. student. Georgios was to modify the hB-tree to add concurrency and recovery and node deletion. When Georgios started to work on these modifications, we found an error in the original hB-tree paper-the “k-d-tree boundaries” problem. The k-d-tree boundaries problem arose from the fact that boundary lines could appear at higher levels of the hB-tree which were not boundary lines at lower levels. This is illustrated in Figure 5, which is the example Georgios found. The theorems we proved in the paper-that you can always split a data or index node with at worst a one to two ratio and that the k-d-tree nodes we specified on the path would describe the hole correctly, are true. We ran in to trouble when part of the description of a hole was in one parent and another part was in a sibling parent. Then it is unclear how to post information about new splits and still have correct search. The description on the grandparent level may not reflect boundaries at lower levels as Figure 5 shows. We found two solutions to this problem. One option is to post the full path from the extracted tree to the root of the original k-d-tree. Then when there is a later split, new k-d-tree nodes are only posted at the leaves of trees which have already been posted. Often, only some of the labels are changed and no new nodes are posted. This is illustrated in Figure 6. The second option is to restrict the splits of index nodes to places where the k-d-tree nodes describing one split on a lower level are not separated. This is easy to support once we label each posted node with the address of the child from which it is copied. Then we cannot split a parent if two k-d-tree nodes with the same address would be placed in different pages. We say that a k-d-tree node is decorated with the child address and only the highest such k-d-tree node needs to keep the decoration. Thus we only allow splits at decorations. Then each child has exactly one parent and index term postings are simple. This is illustrated in Figure 7. Either of these options means boundaries seen at higher levels of the hB tree exist at all lower levels of the hB-tree. But splitting at decorations means you can no longer guarantee good index space utilization.
On Indexing Spatial and Temporal
Data
453
y=4
P
I
/\
grandparent
cl
I
41
Q level
parent level
7 6
A
A then
I3
B
B
’ 4 2
A
A then C
A
then C
A then C 4 Fig. 5.
6
6
9
The k-d-tree boundaries problem. We used the condensed path when we split B from A, made a split at the parent level which put part of the condensed path in one parent and The boundaries seen at the grandparent level do not match boundaries at the data level. off from A, we must post the LLx=4” k-d-tree node at the parent level. Where do we post
and then we part in another. When we split C it?
Posting the full path means you cannot guarantee an upper limit of 2k + 1 k-d-tree nodes posted at each split. However, both the space guarantees and the posting guarantee-a maximum of k k-d-tree nodes posted in this case-hold when making splits at the leaf level. Since the leaf level (where the data is) tends to comprise around 99% of the total space taken by the structure, it is not surprising that the total space used and the data page utilization are still very good with the new variations of the hB-tree. It is perhaps surprising that the index page utilization is also very good (see Figure 8). 3.2.
The hB” -tree
Science often consists of two steps forward and one step backward-note that this moves you in a positive direction. The discovery of the k-d-tree boundaries problem was disappointing to us but led us to a better understanding of the ways of creating spatial indexes. The new variations are clearly simpler than the original. In addition, all experiments run so far, using computer-generated data which is skewed, data from the Berkeley Sequoia project [19] as well as data which is uniformly distributed, show the modified algorithm to have excellent performance. This is as well as other methods have done and better than most. In particular, the experiments show that the modified hB-tree is insensitive to dimension (see Figure 8). Often, only one k-d tree node with one value of one attribute is needed to describe a
BETTY SALZBERG
454
x=4 /\
grandparent level
M Q
P
-
p
P
P%a
parent level now split C from A:
-I7
grandparent level
Q
P
P
parent level x=4 A: y=4
y=7 A A y=2
i=\” e C
0
Fig. 6.
4
6
6
9
Solving the k-d-tree boundaries problem with full path posting. This time, when B was split from A, the full path was posted to the parent level. The parent level can then split anywhere and no boundaries appear which are not boundaries at the data level. To split C from A later, one label is changed from A to C in P and one from A to C in Q.
On Indexing Spatial and Temporal
Data
455
grandparent level t
Q: B 1 parent level now split C from A
parent level
4
6
Fig. 7.
Solving the k-d-tree boundaries problem by splitting only at decorations. Here the condensed path was posted when B split from A. But at the parent level, a split was not allowed in the middle of the condensed path that wss posted. Decorations (marked with italic upper case letters) indicate the beginning of a posted condensed path. When you split “at decorations,” each child has only one parent. Then it is clear where the next split information is posted.
Fig. 8.
Data and index page utilization and complete size of structure. For pages of size 2K bytes and larger, data and index page utilization and the size of the index does not vary significantly as the number of dimensions increase. For smaller page size, index page utilization decreases because the page header contains the boundaries of the space when the page was first created. Computer-generated highly-skewed data was used. In all cases, the same 150,000 24-byte records were inserted. Full path posting and splitting at decorations was used.
BETTY SALZBERG
456
split. The size of the tree does not grow significantly as the number of dimensions indexed. This is unlike the R-tree [7], for example, where all k boundaries must be kept for each child of an index node. In spite of lacking worst case guarantees on space utilization above the are about the same as they are for the Bf-tree. B+-tree nodes split evenly, utilization of 50% and an average of about 69%. Our average space utilization This is true of the Sequoia data (Figure 9) as well as the computer-generated
leaves, the figures giving a worst-case is also about 69%. data (Figure 8).
SEQUOIA 2000 Storage Benchmark Point Data 62,584 points representing California place names hB%ree Node Space Utilization (NSU) 69
Data NSU /----
67 65
63
/_-~J
Record format
Index NSU (A/fp)
__----
(int, int, var-string)
4~\_CC:~~ .A ’
++.
a
Average record size
Index NSU (DMp)
29 bytes
61 59’
Fig. 9.
I
I
/
512
1024
2048
4096
5
4
3
3
# of tree levels
359 370 5777
a4
20
6
85 2713
20 1321
6 646
# of index nodes (AIfp) # of index nodes (Dffp) # of data nodes
’
e
node size in bytes
Node Space utilization for the Sequoia 2000 Storage benchmark point data. Two splitting/posting strategies are shown: “A/fp”-split anywhere and use full path posting and “D/fp”-split at decorations and use full path posting. Data page splits are the same for all strategies. Slightly lower index page utilization for 4K page sizes is due to the small number of index nodes. (Here the root node has only five children, hence is sparsely populated and also accounts for one sixth of all index nodes.)
Figure 10 shows range search performance on Sequoia point data. Range search performance depends on how the query region intersects data page boundaries. We made 104 range searches for each of the four page sizes. We varied the selectivity (proportion of the total number of records in the query window) in the different range searches by varying the size of the query window. For each range search, the query window is rectangular and the center is a randomly chosen existing point. The extent of the window for each attribute is a random ratio of the domain range for that attribute. Note that when the query selectivity is approximately equal to that of the average number of records in a data node, 25% of the records retrieved satisfy the query. This is as expected because it is likely that in this case the query window will overlap on average four data page boundaries. When query selectivity was at least one percent, the proportion of records retrieved which satisfied the query was quite good. This shows that the new variations of the hB-tree perform well for the canonical spatial queries. Further, we have been able to adapt the concurrency, recovery and deletion algorithms Dave and I developed for index trees in [12]. Since the tree in (121 is called a r-tree, the new variation, a combination of the hB-tree and the r-tree, is called the hB”-tree [S]. It solves the problem of the k-d-tree boundaries, is simpler to understand, and supports concurrency and recovery and deletion.
On Indexing Spatial and Temporal Data
457
matching records % 100
retrieved records
A selectivitycorresponding to a number of 1 = records equal to the average number of records that a data node can hold
90 _I
80 ~~
70
60 50 -.
40 1
30 4 20 4
A’
wry __l__i..
ScllmAl”1
w
I
0.01
Fig. 10.
0.02
0.04
0.08
0.16
0.32
0.64
1.28
2.56
5.12
10.24 20.48
40.96
%
Range search performance in terms of the ratio of retrieved points that satisfy the query over total number of retrieved points per range search, under various node sizes (0.5, 1, 2, and 4 Kbytes) and query selectivity.
4. TEMPORAL
INDEXES
Before Georgios started working with Dave and me, when we thought we were done with the hB-tree, Dave and I started a project on temporal indexing. The properties desired for a good temporal index include most of those any good index would have. The use of disk space should be efficient. One should be able to answer canonical queries quickly. The index should be much smaller than the data collection. Records which are likely to be answers to the same query should be clustered in disk pages. The canonical temporal queries are:
1. The time-slice:
2. The key-range
3. The exact-match:
4. Past versions:
find all record versions
valid at a given time t,
time slice: find all records in a given key range which are valid at a given time
find the version
find the past versions
of the record with key k at time t and
of a given record.
Temporal data is modeled with collections of line segments. These segments an end time and a time-invariant key. This is illustrated in Figure 11.
have a begin time,
BETTY SALZBERG
key f
I
h I
I
.
9 I I I
e
ri
.
.
.
ci
b
I
a 5678
time
Fig. 11
9
10
--___c
A rectangle in time-key space with line segments representing distinct record versions. When the time scale begins for this rectangle, records with keys b, c, f and h are alive. At time instant 5, a new version of the record with key c is created. At instant 6, a record with key g is created. At instant 7, a new version of the record with key b is created. At instant 8, both c and f have new versions and record h is deleted. At instant 9, a new version of record g is created. At instant 10, a record with key e is created.
I I I
time (a) entering versions in pages by start time
I I
now
key
time
now
(b) entering versions in pages by end time
Fig. 12.
Entering by begin or by end time. When record versions are placed in a page as they are created (until the page is full), as in (a) above, queries as of more recent times must visit most pages in the database and in many pages, only a few records will be valid as of the search time. On the other hand, if records are entered into pages in a historical database when they “die” as in (b), queries as of past time will have to visit most pages in the database and will hit only small numbers of versions valid at the search time in each page.
In Figure 11, one begins with several records with different keys. As time moves forward, new records with different keys are inserted, new versions of already existing records are created by update operations, and records are deleted. However, as no history is erased, a deletion is “logical.” No already created versions are erased from the collection. A version which is no longer valid has an end time before the current time. The challenge is to cluster this data in disk pages for efficient retrieval of answers to temporal queries. There are many stumbling blocks to achieving this goal. First, if record versions are entered into pages as they are created (by begin time only), some of the versions will last a long time and others will end sooner, either because the record is deleted
On Indexing Spatial and Temporal Data
459
or because an update causes a new version to be valid. This is illustrated in Figure 12(a). Note that queries as of the time just after the page is allocated are likely to be efficient, picking up many valid record versions. Queries as of later times might be less efficient if a large number of versions are no longer valid. (One mitigating factor is that most records in most collections are not deleted or updated. Unfortunately, this implies that for a current query, almost all pages of the database will have to be visited, since they are likely to contain at least one current record.) On the other hand, suppose as in Postgres [18] current versions are kept on one disk and versions which are no longer current are moved to another disk as they “die” (are updated or deleted). This causes the other extreme: records are entered into pages of the historical collection by their end times. Pages can contain records with widely varying begin times as in figure 12(b). In this case, search for records valid in the past may not be efficient. Also, all past-time searches must scan the current data base as there may be many long-lived records which are still current. Using end-time or begin-time to place records in pages and using the boundaries of the time-key rectangle for search means that search involves overlapping rectangles. Even if the exact time and key is known, many pages may have to be visited. In fact, keeping record versions corresponding to line segments in pages with rectangular timekey boundaries and with a fixed capacity (as disk pages have) implies that there will be overlapping. Suppose there is one very long-lived record version and one other record with a distinct key in a page which covers the key range defined by the two keys and the very long time range. Then all other record versions which might be added in that time and key range might not fit in the page. When the page fills up, new versions in the time and key range must go to another page. The two pages will overlap. This is illustrated in Figure 13.
a) two record versions in a page which can hold ten record versions
key
b) The eleventh version in this time-key rectangle does not fit.
Fig. 13.
Placing record versions in pages which define time-key boundaries requires overlapping if only one copy of each version is kept. After two records with distinct keys are entered, all versions of the second record which overlap the time interval of the first record either belong to this page or to an overlapping page.
Two ways to avoid overlapping pages when dealing with data which has some extent, such as the line segments we envision here, are to do some controlled replication (sometimes called “clipping”) or to map segments into higher-dimension points (begin time, end time and key). Dave and I chose clipping. First, I shall explain what we did, and then I shall discuss mapping as a solution. 4.1. The WOBT We started with a paper by Easton [4]. Easton had invented an access method to index historical records which resided entirely on Write-Once optical disks (WORMS). He called his method the Write-Once B-tree or WOBT. On WORMS, one must burn into the disk an entire page with a checksum (the error rate is high, so a very long error-correcting code must be appended to each page.). Thus, once a page is written, it cannot be updated. With these constraints, Easton made a modification of the B-tree. Each index entry was an entire page. Each new record version was an entire page. Nodes were IS 19:6-B
460
BETTY SALZBERG
collections of pages, for example a track on a disk. When a track filled up, it could be split by time or split first by time and then by key. The old node was left in place. (There is no other choice.) The record versions valid at the current time were copied to a new node. There would be space for new versions in the new node. If there were many current versions, the current data could be copied into two new nodes and separated by key. This design enabled clustering of the records in nodes by time and key. But it wasted a lot of space. The basic properties of WORM disks include random access; that is, the ability to read or write any block on the disk. There is also a disk arm, this time with a laser, and the disk rotates, so one has to account for seek time and rotational latency. Several tracks can be read without moving the arm. The only really different feature is the fact that once written, a WORM disk page cannot be written again. Thus any method suggested for WORM disks can also be used on magnetic disks. This is true of the WOBT. If the WOBT is implemented on a magnetic disk, space utilization is immediately improved as it is not necessary to use an entire page for one entry. Pages can be updated. But the WOBT used on a magnetic disk still wastes space. The WOBT always makes a time split before making a key split. This creates one historical page and two current pages where previously there was only one current page. A B+-tree split creates two current pages where there was only one. No historical pages are created. Dave’s essential new idea was to keep the current data in an erasable medium such as a magnetic disk and migrate the data to another disk (which could be magnetic or optical) when a time split was made. Also he suggested that pure key splits be made, just like in the B+-tree. This partitioned the data in nodes by time and key like the WOBT did, but it was much more space efficient. It also separated the current records from most of the historical records. Since current search is more likely, it is advantageous to have current data grouped together on a fast expensive disk. Historical data can be on a slower cheaper disk. We called the new index the Time-Split B-tree or TSB- tree.
4.2. The TSB-tree Like the WOBT, the TSB-tree pages partition time and key space. That is, every point in time-key space is covered by exactly one page. This clustering by time and key within pages is necessary for efficient response to the canonical temporal queries.
4.2.1. Data node splits When a data page is full and there are less than some threshold value of distinct keys, the TSB-tree will split the page by time only. This is the same as what the WOBT did, except now times other than the current time can be chosen. For example, the split time for a data page could be the “time of last update,” after which there were only insertions of records with new keys and no updates creating new versions of already existing records. The new insertions, after the time chosen for the split, need not have copies in the historical node. A time split by current time and by time of last update is illustrated in Figure 14. Time splitting, whether by current time or by time of last update, enables a graceful migration of older versions of records to a separate historical database. Full data pages with a large number of distinct keys are split by key only in the TSB-tree. The WOBT splits first by time and then by key. Key splits are illustrated in Figure 15. Recently, I have become convinced by the work of several authors ([2,8,20]) that it is probably desirable to do a time split some of the time before doing a key split. Suppose you have a database where most of the changes are insertions of records with a new key. As time goes by, in the TSBtree, only key splits are made. After a while, queries as of a past time will become inefficient. Every time-slice query will have to visit every node of the TSB-tree since they are all current nodes. Queries as of now, or recent time, will be efficient since every node will have many live records. But queries as of the distant past will be inefficient since many of the current nodes will not contain records which were valid at that distant past time. This is illustrated in Figure 15(b).
461
On Indexing Spatial and Temporal Data
l-l old node
5676
9
b *
new node
10
time Y (a) The WOBT splits at current time, copying current records into a new node.
I old node after split
b a
6676
9
10
time
new node c
(b) The TSB tree can choose other times to split.
Fig. 14. Time Splits by current time and by time of last update. The page in Figure 11 is split in (a) by current time, as done in the WOBT. In (b), it is split by time of last update, so that new insertions need not have copies in the old node.
I key
I
g
If
+
old node
h
h
n
0
1-l;
Y
I
I
I
56769
c
=d
10
time
c
new nodes
(a) The WOBT splits data nodes first by time then sometimes also by key.
key
new nodes
5676
9
10
time P (b) The TSB-tree can split by key alone.
Fig. 15. Time-and-key splits and pure key splits. The WOBT must split by time first since it cannot alter the old node. Then it copies the current versions into two new nodes. The TSB-tree can make a pure key split since it is not on a write-once medium. This avoids making extra copies of record versions. However, note in this example that the new nodes after the pure key split have only two record versions alive at the beginning of their time range.
BETTY SALZBERG
462
The methods in [2, 81 solve this problem by always doing time splits before key splits as in the WOBT. What they have pointed out is that this can be used to guarantee that as of any query time which is contained in the time interval spanned by a page, a minimum number of versions in the page will be alive at that time. As in the WOBT, however, there is more replication; disk space usage is not as good as in the TSB-tree. The access method in [2] “Optimal Multiversion ways. First, it is simply the WOBT with merging method in [S] “Fully Persistent B+-trees” restricted treating timestamps as version numbers.
Access Structures” can be looked at in two of sparse nodes supported. Second, it is the to the case where versions do not branch, and
Both [2, 81 treat deletion of records and merging of nodes. The WOBT and the TSB-tree merely post deletion markers and do not merge sparse nodes. If no merging of current nodes is done, and there are many deletions of records, a current node may contain few current records. This could make current search slower than it should be. However, the solution in [2, 81 makes time splits of two sibling current nodes then merges to create a current node (which may have to be key split). Two new historical nodes are created. Many extra copies of records are made. The “Snapshot index” in [20] was designed for optimizing the time-slice query and does not paginate with respect to key range. Thus when records are deleted, it is not necessary to do node merging. Since this complication is avoided, the snapshot index is able to support record deletion should be comparable to the by making time splits of sparse nodes. Its disk space utilization TSB-tree, but it cannot efficiently support key range queries. Probably the best variation of the WOBT is to use some parameters to decide whether to time split, time-and-key split, key split, time-split and merge or time-split, merge and key-split. These parameters will depend on the minimum number of versions alive at any time in the interval spanned by the page. All of the policies pit disk space usage against query time. A pure key split creates one new page. A time-and-key split creates two new pages: one new historical page and one new current page. The historical page will have copies of the current records, so more copies are made than when pure key splits are allowed.
4.22.
Index Node Splits
Index nodes in the TSB-tree are treated differently from data nodes. The children of index nodes are rectangles in time-key space. So making a time split or key split of an index node may cause a lower level node to be referred to by two parents. In the TSB-tree, index pages which refer to current pages can only be split by times which allow the historical index page to refer only to historical nodes. This way, current nodes (the only ones where insertions and updates occur) have only one parent. Similarly, if a key split is made using a key boundary for a current node, only historical nodes have more than one parent. This is illustrated in Figure 16. This way, unlike the WOBT (or [2, 8]), the TSB-tree can move the old node to another location in a separate historical database. No node which might be split in the future has more than one parent. If it does a time split, the new address of the data from the old node can be placed in its unique parent and the old address can be used for the new current data. If it does a key split, the new key range for the old page can be posted along with the new key range and address. There is an interesting analogy with the k-d-B-tree illustrated in Figure 3c. The k-d-B-tree forced lower level nodes to split when upper level splits crossed their boundary lines. This was so that splits caused by insertions on lower levels needed only to be posted to one parent. With the TSB-tree, this is not needed as the lower level nodes which are crossed by a split line are always historical nodes and historical nodes are never split. You can use k-d trees within TSB-tree index nodes just as in the k-d-B-tree or in the hB-tree (although we did not do so in [lo, 11, 13, 141).
On Indexing Spatial and Temporal
463
Data
t=4 A k=4
6 /\
4 2
A
time a) original index page
4
6
9
now
k=2
y\ k=6
/\A
1=4 /\
key
k=4 A
k=4 A ?!2
7 /\
time
.
now
b) split by begin time of oldest current child
key
t=4 A
r!
1=6
. .
time
/
now
c) split by a current key boundary
Fig. 16.
TSB-tree index node splits. A time split can be done on any time before the begin time of the oldest current child. A key split can be done at any current key boundary. Node merging is not done in the TSB-tree so key divisions get more refined as time goes forward. Note the similarity with the k-d-B-tree in Figure 3.
4.2.3. Copies of Record Versions We did program the TSB-tree. Dave took a preliminary version written by a M.S. student at Northeastern, Madhav Anand, and rewrote it, running extensive experiments. I programmed an iterative matrix eigenvalue calculation needed to provide results for our Markov chain model. The results of the experiments and of the analysis showed that the TSB-tree had on average less than two copies of each record, no matter what the mix of updates versus insertions. So the space needed was about twice what would be needed for a method which haa only one copy of each version of a record. Methods which keep only one copy of temporal items (as we saw in Figure 13) run into the “long-lived record” problem. You cannot easily partition the time-key space into non-overlapping rectangles as the TSB-tree does. Once you allow a key range containing more than one key, the number of versions of records in that key range other than the long-lived one may not fit into one disk page. Thus, you are reduced to overlapping time-key rectangles or mapping the versions of records to points in the three dimensional space (begin time, end time, and key). Overlapping rectangles lead to backtracking algorithms. This backtracking will be especially inefficient in the presence of significant numbers of long-lived records-each time-slice query will have to visit a page containing a long-lived record even if that record is the only one alive in the key range of the page during the query time. Let’s look briefly at mapping as an option for temporal indexing.
BETTY
464
4.3. Multiattribute
SALZBERG
point search structures.
Suppose that the begin time, end time and database key are used as a triplet key for a multiattribute point structure. If this structure clusters records in disk pages by closeness in several attributes, one can obtain efficient time-slice queries, efficient time-key range queries, using only one copy of each record. Records with similar values of begin time, end time and key will be clustered together in disk pages. Having both a similar begin time and a similar end time means that long-lived records will be in the same page as other long-lived records. These records will be answers to many time slice queries. Short lived records will only be on the same pages if their short lives are close in time. These will contain many correct answers to time-slice queries with time values in the short interval their entries span. Every time slice query will access some of the long-lived record pages and a small proportion of the short-lived record pages. Individual time-slice queries will not need to access most of the short-lived record pages as they will not intersect the time slice. There are some subtle problems with this. Suppose a data page is split by begin time. In one of the pages resulting from the split, all the record versions whose begin time is before the split time are stored. This page has an upper bound on begin time, implying that no ne2u record versions can be inserted. All new record versions will have a begin time after “now,” which is certainly after the split time. Further, if there are current records in this page, their end time will continue to rise, so the lengths of the time spans of records in this page will be variable. Some will be long and others short. Queries as of current time may only retrieve a few (or no) records from a page which has been limited by an upper bound on begin time. This is somewhat like the situation in Figure 12a. Also, when a new version is created, its begin time is often far from the begin time of its predecessor (the previous version with the same key). So consecutive versions of the same record are unlikely to be on the same page if begin-time splits are used. Now suppose we decide that splitting by begin time is a bad idea and we split only by key or by end time. We get the same problem illustrated in Figure 12b. A query as of a past time may only retrieve a small number of records if the records are placed only by the requirement of having an end time before some cut-off value. Current pages (which have been split by key) can contain versions whose lifetimes are very long and versions whose lifetimes are very short. This also makes past-time queries inefficient. All of these subtle problems come from the fact that many records are still current and have growing lifetimes and all new record versions have increasing begin times. Perhaps if we use a point-based multiattribute index like the hB-tree for dead versions only, efficient clustering may be possible. Here newly dead record versions can be inserted in a page with an upper limit on begin time because the begin times were long ago. Items can be clustered in pages by key, nearby begin times and nearby end times. No guarantees can be made that a query as of a given time will hit a minimum number of record versions in a page, however. For example, imagine a page with record versions with very short lifetimes all of which are close by, but none of which overlap. So using a mapping of record versions to triplets of begin time, end time and key value will give us an access method with only one copy of each record version, but we cannot make the same guarantees of having a time slice hit a certain proportion of valid record versions in each data page it visits as we can with the variations on the WOBT. Having an integrated current and historical index with graceful migrations of historical data to the historical data base, as we have with the TSB-tree, also seems problematical with mapping.
5. CONCLUSION The last word on spatial and temporal indexing has not yet been written. It is difficult to cluster data efficiently in pages to answer the canonical queries quickly without sacrificing the index page utilization or making too many copies of records. Guarantees of any kind are hard to come by. However, I am convinced that the TSB-tree and the hB-tree have been steps in the right direction. We took care to make disk space usage efficient. We made sure that insertions involved
On Indexing Spatial and Temporal
Data
465
updating a minimal number of disk pages. We tried to make our algorithms simple, with no special cases. We were always aware of the importance of clustering in disk pages. And I had a wonderful time working on these problems. Acknowledgements
-
I would like to thank Dennis Shasha for many patient suggestions on the style and the content
of this article. Also Vassilis Tsotras’
comments were useful and welcome.
REFERENCES [l]
R. Bayer and M. Schkolnick.
Concurrency
of operations
on B-trees. Acta Injormatica
[2]
B. Becker, S. Gshwind, T. Ohler, B. Seeger, and P. Widmayer. Workshop on Advances in Spatial Databases, 123-141 (1993).
[3]
J.L. Bentley. Multidimensional 18 (9), 509-517 (1975).
[4]
M. Easton. Key-sequence
[5]
G. Evangelidis and B. Salzberg. Using the Holey Brick Tree for spatial data in general purpose DBMSs. Database Engineering Bulletin (1993).
[6]
G. Evangelidis, D. Lomet , and B. Salzberg. The hB”-tree: indexing method. Technical Report NCJ-CCS-19-1993.
[7]
A. Guttman. R-trees: Ma., 47-57 (1984).
[8]
S. Lanka and E. Mays. Fully persistent B+ trees PTOC. ACM
[9]
D. Lomet and B. Salzberg. The hB-Tree: A multiattribute ACM Trans. Database Systems 15.4, 625-657 (1990).
[lo]
D. Lomet and B. Salzberg. Access methods for multiversion data. Proc. ACM 315-324 (1989).
[ll]
D. Lomet and B. Salzberg. The performance Atlantic City, NJ, 354-363 (1990).
[12]
D. Lomet and B. Salzberg. Access method concurrency 351-360 (1992).
[13]
D. Lomet and B. Salzberg. Exploiting Ireland (1993).
[14]
D. Lomet and B. Salzberg. Transaction-time databases. in Temporal mentation, A. Benjamin Cummings, Redwood City (1993).
[15]
J.T. Robinson. The K-D-B-tree: A search structure for large multi-dimensional SIGMOD Conj. New York, N.Y., 10-18, (1981).
[16]
B. Salzberg.
An Introduction
to Data
[17]
B. Salzberg.
File Structures:
An Analytic
[18]
M. Stonebraker. 289-300 (1987).
[19]
M. Stonebraker, J. Frew, K. Gardels, and 3. Meredith. Conj. Washington, D. C. 2-11 (1993).
[20]
V.J. Tsotras and N. Kangelaris. The snapshot index, an I/O-optimal CATT-Tech Report 93-68, Polytechnic University (1993).
[21]
J. Ullman. Principles
[22]
G. Wiederhold,
On optimal
9, 1-21 (1977).
multiversion
binary search trees used for associative searching. Communications
of the ACM
data sets on indelible storage. IBM J. of R.D. 30 (3), 230-241 (1986).
A dynamic
A concurrent
and recoverable
index structure for spatial searching. PTOC. ACM
SIGMOD
, 426-435 (1991).
Conj.
SIGMOD
access method.
Proc.
of Database Design,
Systems,
Academic
Computer
McGraw-Hill,
Conj.,
ACM
Conj.,
Very Large Databases
Databases:
Theory,
Portland,
SIGMOD
with recovery. Proc. ACM SIGMOD
history for database backup. Proc.
Approach,
Boston,
Conj.,
indexing method with good guaranteed performance.
of a multiversion
Base Design,
IEEE
multi-attribute
SIGMOD
OR,
Conj.,
San Diego,
Conj. Dublin,
Design
and Imple-
dynamic indexes. Proc.
ACM
Press, Orlando, Florida (1986).
Prentice-Hall,
Englewood
Cliffs, New Jersey (1988).
The design of the Postgres storage system. PTOC. Veery Large Databases
Database
access structures.
The Sequoia 2000 benchmark.
access method
Science Press, Rockville,
New York (1983).
Conj.,
Proc.
Brighton, UK,
ACM
for snapshot
Maryland (1982).
SIGMOD
queries.