Indexing bitemporal databases as points

ELSEVIER Information and Software Technology 40 (1998) 327-337 Indexing bitemporal databases as points Beng Chin Ooi’, Cheng Hian Gob’, Kian-Lee Tan...

Download PDF

1MB Sizes 1 Downloads 61 Views

Report

PDF Reader
Full Text

ELSEVIER

Information and Software Technology 40 (1998) 327-337

Indexing bitemporal databases as points Beng Chin Ooi’, Cheng Hian Gob’, Kian-Lee Tan*93 Department of Information Systems & Computer Science, National University of Singapore, Lower Kent Ridge Road, Singapore 119260, Singapore

Received 29 September 1997; received in revised form 24 March 1998; accepted 25 March 1998

Abstract A bitemporal database supports both the valid and transaction time dimensions. Records in such a database can be viewed as a rectangle in a 2-dimensional space (corresponding to the valid and transaction time dimensions). Hence, a spatial access method can be employed to facilitate speedy retrieval of the database. In this paper, we re-examine the issue of designing efficient access methods for bitemporal databases. In particular, we transform a record into a point in a multi-dimensional space, where the valid time and transaction time are each mapped to a 2-dimensional coordinate. A temporal selection operation can then be implemented as a region search operation. This allows us to tap into the many point access methods that are commercially available without modification. We implemented and evaluated three R-tree

based methods on key-range time-slice queries: the naive Point R-tree, the Dual Point R-tree and the Dual Spatial R-tree. Our experimental results show that while the simple Point R-tree is inferior to the Dual Spatial R-tree, the Dual Point R-tree has the best performance of the three. 0 1998 Elsevier Science B.V. Keywords:

Bitemporal databases; Valid time; Transaction time; Spatial access methods; R-trees

1. Introduction Traditionally, databases store and manipulate only one version - the most recent - of any data. Updates on a record are handled by replacing the old copy of the record with the new one. However, there is an increasing need to maintain the history of data in many applications. For example, historical information is useful for control purposes and for mining knowledge having significant implications for new business strategies. In some applications (for instance, in the insurance industry), it is necessary for historical records (e.g. past insurance claims) to be kept at the same time as allowing retroactive changes. Temporal database systems have been proposed to support these applications by providing explicit representation of temporal information associated with data records. Temporal data can be associated with two time dimensions: valid time or transaction time. Valid time represents the time at which a database fact is true in the real world, whereas transaction time defines the time at which a transaction is committed. A database that supports both time dimensions is known as a bitemporal database. A * Correspondingauthor. ’ [email protected] *[email protected] 3 [email protected] 0950-5849/98/$19.00 8 1998 Elsevier Science B.V. All rights reserved SO950-5849(98)00054-8

PIZ

bitemporal database makes it possible to view records as being valid at some moment relative to some other moment. One of the key challenges for temporal databases is to support efficient query retrieval based on time, as well as to attribute values associated with a data record. To allow temporal queries to be answered efficiently, a temporal index can be used to cluster data on their temporal relationships. Like most indexing structures, the desirable properties of a temporal index include efficient usage of disk space and speedy evaluation of queries. However, most of the access methods proposed in the literature have focused on a single time dimension, either the valid time [l-5] or the transaction time [6-81. Valid time intervals of a timeinvariant object can overlap, but each interval is usually closed4. On the other hand, transaction time intervals of a time-invariant object do not overlap, and its last interval is usually not closed. Both properties present unique problems to the design of indexes for bitemporal databases. In this paper, we address the issue of designing efficient access methods for bitemporal databases. Because records in such a database can be viewed as a rectangle in a 2dimensional space (corresponding to the valid and transaction time dimensions), a spatial access method can be useful 4There are some applications for which this may not he true. For our purposes here, we consider aI1 valid time intervals to be closed, as is the case in [9].

328

B.C. Ooi et al./lnfonnation and Software

for indexing the records in the database. We, however, advocate a more conventional approach, in which a record is transformed into a point in a multi-dimensional space where the valid time and transaction time are each mapped to a 2-dimensional coordinate. Under this transformation, a temporal selection operation can then be implemented as a region search operation. This allows us to tap into the many point access methods that are commercially available without modification. We implemented and evaluated three R-tree based methods on key-range time-slice queries. The first method, called the Dual Spatial R-tree (DSR-tree), is a variation of the Dual R-tree approach proposed by Kumar, Tsotras and Faloutsos [93. The DSR-tree treats each record as a rectangular block in a 3dimensional space - the rectangle formed by the valid-transaction dimensions and the domain of a predefined search attribute. As in Ref. [9], the DSR-tree uses two R-tree structures, a front R-tree to index records with open transaction end-time and a back R-tree to index records with closed transaction time. The second method, the Point R-tree (PR-tree) method, is a straightforward approach that represents the key and the two time dimensions as a 5dimensional point, enabling us to use a single R-tree to index the points. The last method, the Dual Point R-tree (DPR-tree), is a counterpart of the DSR-tree which transforms rectangles into points in a higherdimensional space. The front R-tree of the DPR-tree is built on 4-dimensional points while the back R-tree indexes on 5dimensional points. Our experimental results show that while the simple PR-tree is inferior to the DSR-tree, the DPR-tree has the best performance of the three. This is contrary to the commonly held view that indexing methods that rely on the transformation of regions to points do not yield better performances compared to spatial indexing methods that index on the native space directly. The remainder of this paper is organized as follows. In the next section, we present some background on (bi)temporal databases, as well as previous work on bitemporal indexing. Section 3 presents the two access methods designed to support bitemporal data as points in a higher-dimensional space. In Section 4, we report on the experimental study conducted and the results. Section 5 summarizes our contributions and presents suggestions for further work.

2. Preliminaries In this section, we briefly review the concepts of (bi)temporal databases. For a complete list of terms and their definitions, please refer to Ref. [lo]. We also review related work on indexing bitemporal databases.

Technology40 (1998) 327-337

valid time captures the time at which a record is true in the real world. It is typically represented as an interval [VT,, VT,], which means that the record is valid from time VT, to time VT, in the real world. Transaction time refers to the time at which a new value is posted to the database by a transaction [lo], and constitutes the interval during which the value reported by a particular transaction is current in the database (i.e. the interval bounded by the time from which a transaction is committed, to the time when another update arrives). There are two approaches to representing transaction time. The first is to model transaction time as a single time instant. In this case, when a record RI is newly inserted at time TTI, its transaction time is TT 1. Should this record be updated at transaction time TT2, a new version R2 will be created with transaction time TT2. The bitemporal database should be able to determine that RI has been “deleted” at time TT2. An alternative approach is to model a record’s transaction time explicitly with an interval which corresponds to its actual transaction time lifespan. In this case, the transaction time of the newly inserted record RI is given by [TT,, NOWJ, where NOW indicates the open-ended time space. Once the new version R2 is posted, R,‘s transaction time becomes [TT,, TT2 - 11, and R2’s transaction time is [TT*, NOW]. Further updates to R2 will change its transaction time interval to a closed one. An update to the version, say at time Tj, is treated as a deletion which changes the lifespan to [Ti, Tj - 11, followed by the creation of a new version with lifespan [Tj, NOW]. To illustrate the concepts just described, consider an employee movement bitemporal relationship that monitors the movement of employees. Table 1 shows the bitemporal relationship at time 0 when it is first created, and Table 2 shows the relationship at time 7. We have represented the transaction time as intervals. We note that bitemporal relations facilitate predictive updates (where a record can be posted before it is valid) and retroactive updates (where a record can be inserted after it is valid). For example, in Table 1, we note that the valid times of all records end after their beginning transaction time, and in Table 2, at transaction time 3, there is an update for el that changes its valid time range to [2,3]. We also note that records t7 and t8, with the same ENo and City values, bear overlapping valid times. This is possible because the two record versions have different transaction time values. Valid time and transaction time are two orthogonal dimensions in semantics, i.e. the transaction time when a Table 1 The employee movement bitemporal Record

ENo

relationship City

at time 0

Valid time

Transaction time

2.1. Bitemporal databases In a bitemporal database, each record is associated with two time dimensions: valid time and transaction time. The

t1 t2 t3

el e2 e3

New York Washington New York

to. 31 LO.71 LO.51

W, NOwI [‘A NOwI

10,NOwI

B.C. Ooi et alAnformation and Software Technology 40 (1998) 327-337 Table 2 The employee movement bitemporal relationship at time 7 Record t1 t6 t2 t3 t4 t5 t7 t8

ENo el el e2 e3 el e6 e5 e5

City New York Seattle Washington New York Los Angeles Seattle Washington Washington

Valid time

Transaction time

[O.21 [2, 31 [O>71 [O,51 l3.91 13.61 [4,61 IX 81

UI 21 [3>61 [O,NOW1 [O,NOW1 [7, NOW1 13, NOW1 [3,41 1%NOWI

new version is created is independent of the valid time of the record version (it can be earlier, during or later than the valid time interval). However, it has been argued that the two time dimensions are correlated in some way in most real world applications [lo]. In particular, the transaction time may be bounded by a time interval after the fact was effective in valid time. For example, in an empZoyee relationship, a promotion of an employee may be required to be posted to the database no later than 10 days after the assignment is valid. Such a “constraint” is known as the retroactive bound constraint. The constraint becomes a strong retroactive bound constraint if a past fact is not to be retroactively updated in the database. Similarly, we can have a strong predictive bound constraint to ensure that a future valid time cannot be too far away from the posted time. A bitemporal relationship is strongly bounded if the transaction can only be posted within the upper and lower bounds from the valid time, i.e. gAt, such that for each record of a relationship, TT, - At I VT, 5 ‘IT, + At, where VT, and TT, are the starting times of the valid and transaction time ranges. 2.2. Bitemporal queries Various types of query for temporal databases have been discussed in the literature [4,5,11]. Like any other application, temporal indexing structures must be able to support a common set of simple and frequently used queries efficiently. The following constitutes the common set of bitemporal queries: (Ql) Bitemporal time-slice queries. Find all versions that are valid during the given time interval [T,, T,] as of a given transaction time T,. This query comes in four types, depending on the search operation: (Tl) Intersection queries. Given a time interval [T,, T,] and a time point T,, retrieve all the versions whose valid time intervals intersect the given interval as of transaction time T,. (T2) Inclusion queries. Given a time interval [T,, T,] and a time point T,, retrieve all the versions whose valid time intervals are included in the given interval as of transaction time T,. (T3) Containment queries. Given a time interval [T,, T,]

329

and a time point T,, retrieve all the versions whose valid time intervals contain the given interval as of transaction time T,. (T4) Point queries. Given two specific time points T, and T,, retrieve all the versions whose valid intervals contain T, as of transaction time T,. Point queries can be viewed as the special case of intersection queries or containment queries where the time interval [T,, T,] is reduced to a single time instant T,. (Q2) Bitemporal key-range time-slice queries. Find all versions which are in the given key range [k,, k,] that are valid during the given time interval [T,, T,] as of a given transaction time T,. Like the time-slice query, the time-slice part of the query can assume one of the predicates described in (Tl) to (T4). 2.3. Related work on bitemporal indexing While many indexing methods have been proposed for temporal databases, they are largely focused on supporting temporal databases with one time dimension only. A survey of some early work in temporal indexing can be found in Ref. [12]. In this section, we review some recent structures that have been proposed for bitemporal databases. These approaches can be categorized based on the representation of transaction time in bitemporal databases. When transaction time is represented as time points, a bitemporal database can be visualized as a history of timeslices each at a transaction time t, with each time-slice containing a set of records with valid time intervals as oft. The straightforward solution to this problem is to build a twolevel index, where the first level indexes on transaction time, while the second level indexes on valid time. However, the main drawback of this simple method is the large amount of storage required. This is because records may be valid at different transaction times and hence some data redundancy cannot be avoided. To overcome this deficiency, Nascimento et al. proposed a Multiple Incremental Valid Time Tree (M-IVTT) [13] which has a two-level structure that trades off processing time for storage. In the M-IV’IT, the top level is a transaction time tree that indexes on transaction time points. Instead of building a valid-time tree structure to index all records valid at each distinct transaction time, only a small set of such trees is constructed at selected transaction time points (at every D, transaction point, say) and the remaining transaction time points are associated with patches. Patches contain information on update sequences that can facilitate reconstruction of the validtime trees corresponding to the transaction time points of the patches. Patches can be applied backward or forward to the neighboring valid-time tree. While the storage requirements can be reduced, the additional overhead in reconstructing the valid-time trees can be significant. This method may be viewed as a special case of the multiversion index methods [6,14-161 which were developed for

Li.C. Ooi et alhfomation

330

and Sojiware Technology 40 (1998) 327-337

transaction-time databases. The evolution of a set of interval objects is kept, so that it is possible to roll back a bitemporal database to a particular state as it was stored in the past, as of a specified transaction time. When transaction times are modeled as intervals, we can view the two kinds of time interval as a 2dimensional rectangle with sides parallel to the axes. Thus, we can adopt an existing spatial access method such as the R-tree [17] to index bitemporal data, and a bitemporal query can be mapped into a spatial search operation. However, indexing bitemporal data with a single R-tree can lead to poor performance, caused by excessive overlap of rectangles corresponding to records whose transaction time intervals are open. This requires multiple subtrees of an R-tree index to be searched for a given query, therefore causing performance to degrade significantly [ 181. Moreover, when a record is updated, “closing” its transaction time interval will affect the rectangle’s boundary which can lead to reorganization of the R-tree. Kumar et al. [9] proposed two methods to solve the above problem. In the Dual R-tree method, two R-trees are employed to index a bitemporal database. Records in the database are split into two sets: those with known ending transaction times (cZosed versions) and those with unknown ending transaction times (current versions). A back R-tree is used for the former and afront R-tree for the latter. When a version r is newly inserted into a database, its ending transaction time is unknown. Such a version is inserted into the front R-tree. Because all bitemporal objects in the front Rtree have an ending point NOW, only the starting points in the transaction time dimension need to be kept. Therefore, in the front R-tree, a record’s transaction and valid time are represented as a line that is parallel to the valid time axis in a 2-dimensional space. When r is closed and its ending transaction time becomes known, it is deleted from the front Rtree and inserted into the back R-tree. This strategy allows the rectangle overlap to be reduced significantly by making use of the front R-tree. Note that no reorganization of the back R-tree is needed, because records are only inserted and never deleted. The Bitemporal Interval Tree makes use of the Interval Tree [ 191 to index a finite set U that contains V valid time points. An interval tree consists of a full binary tree and a

number of doubly-linked lists. The V time points are stored in the leaves, and each internal node contains the middle value of its two immediate children. If the starting points of an interval fall in the left subtree of an internal node and the ending point falls in the right subtree, the interval is stored in the doubly-linked lists associated with this internal node. The left/right list contains the starting/ending point, respectively. In the Bitemporal Interval Tree, the lists are transformed into “conceptual” lists of pages to facilitate the splitting policies of the MVBT [14] so as to answer bitemporal pure time-slice queries.

3. Point access methods for indexing bitemporal databases A spatial index such as the R-tree cannot handle intervals with open end-time directly. An entry in the internal node of the R-tree contains a minimum bounding rectangle (MBR) that describes the data space of its child node. When data intervals are not closed, the MBR cannot be defined properly, and these affect the splitting algorithm that makes use of space coverage to distribute the data into two groups. It is, however, possible to use the current time NOW or the largest time due to the proactive insertion (if any) as an estimate during node splitting and data insertion. The approximation is likely to have some effect on the performance. In this section, we present two R-tree based access methods for indexing bitemporal data which have been transformed into points in a 5dimensional space. The methods are designed to facilitate key-range time-slice queries. Each record in the database has a time-invariant key, a valid time represented as a closed interval, and a transaction time represented as an interval which may have an open ending time. There may be other attributes but these are not used for the purpose of indexing. A 2-dimensional rectangle can be mapped into a point in 4-dimensional space [20,21]. One way to achieve this is the simple comer representation [21] which takes the lower and upper bounds of the rectangle as the x- and y-coordinates for each of the two dimensions (endpoint transfomafion). Another way of mapping is based on the centroid and two additional parameters for the extension of the rectangle

13 12 11

14 -

17 15

-

16

Fig. 1. The endpoint transfomation of line intervals.

B.C. Ooi et al./Infomtion

331

and Sojiware Technology 40 (1998) 327-337

in x- and y-axes (midpoint transform&on). For a rectangle aligned with the axes, four coordinates obtained from the two transformation methods are enough to uniquely determine it. Thus, the transformation technique maps a database of rectangle objects onto a database of point objects. For our purposes here, we have chosen the simple comer transformation (endpoint transformation) which represents the lower and upper bounds in each dimension of a rectangle as x- and y-coordinates. For instance, a l-dimensional interval [x,y] can be interpreted as a point (x,y) in a 2dimensional space. Similarly, k-dimensional intervals can be interpreted as points in 2k-dimensional space. In Fig. 1 we transform several 1-D intervals into points in a 2-D space. Some properties of this transformation can be found with respect to the transformed higher-dimensional representation. For each dimension of a rectangle, the upper bound is always greater than or equal to the lower bound of a line segment [QTSr QT,], i.e. QT, 2 QT,. Therefore, all points are mapped into .the shaded triangular subspace which lies above the diagonal. Intervals of equal length, e.g. l2 and 14 in Fig. 1, are transformed into points that lie on a line parallel to the diagonal. For short intervals, the corresponding points are located in a strip of area close to the diagonal. Conversely, longer intervals are transformed to points which are located farther away from the diagonal, e.g. 13. The original 2-dimensional query region for rectangles is transformed into a 4-dimensional query region for points. Since it is difficult to illustrate 4-dimensional space, we use a mapping of line intervals (l-dimensional) to points in a 2-dimensional space as our example. Fig. 2 shows the transformation of the query regions with the query range [QT,, QTJ for the four search operations - intersection, containment, inclusion and point. As an example, consider the intersection search. Let the query (valid) time interval be [QT,, QT,]. For an interval in the database to intersect the query interval, either its end time must be in the interval or its start time must be in the interval. Thus, no record with end time less than QT, needs to be considered, and no record with start time after QT, needs to be examined. We then have the query region as indicated by the shaded portion. Note that the area of the transformed query region can be reduced if the greatest duration of all the intervals is known. However, the problem of transformation improvement is an orthogonal issue, and we will not address it here. 3.1. Naive Point R-tree In a naive Point R-tree (PR-tree), each record is transformed into a point in a 5dimensional space - two for transaction time, two for valid time and one for the attribute

Tmax

QTe QTs Y

*

QTs

QTe Tmax X

QTs

(a) Intersection search

QTe TmaxX

(b) Inclusion search

Tmax

Tmax

QTe F

QTs

QTs (c)

; c

.;

QTe

TmaxX

Containment search

QTs QTs

Tmax X

(d) Point search

Fig. 2. Query regions for interval [QT,, QT,].

is being indexed. Since transaction time can be a closed or open interval, there are also two categories of points in the 5dimensional space. The first category are points with the transaction time coordinates (TT,, NOW). Since NOW should always be assigned the value of present transaction time, the y-coordinates of these points change along with the transaction time. For example, at transaction time TT,, a record’s first version 7 is inserted in the R-tree. At this time, NOW’s value is TT,, so the coordinates of 7’s transaction time dimension should be (lT[, TT,). At time 7T2, the next transaction of the database occurs. At this time, since NOWs value is T’I’*,7’s transaction time coordinates change to (TT1, TT2). We call these points flouring points. The movement of 7 doesn’t stop until its next version arrives, say, for example, at TT,. Then, 7’s transaction lifespan will change to a closed one, ITT,, TT, - 11. As a result, the point’s transaction time coordinates are fixed as (TT *,TT,) and will never change again. We call the kind of points which have closed transaction time ranges fixed points. Since the PR-tree stores both thefloating points andfired points, a bounding rectangle (MBR) containing floating points will have an upper bound (along the transaction time dimension) that changes dynamically; in this sense, MBRs encompassing floating points are said to be unbounded. As shown in Fig. 3, all floating points are on the upper bound of the transaction time dimension because their right endpoint of transaction time range is NOW. This may result in having a large region of space that has no which

332

B.C. Ooi et alAnformation and Software Technology 40 (1998) 327-337

NOW

3

eliminating the unexpected large empty space extension in the MBRs.

000

empty region

4. A performance study 0

floating point

l

fixed point

We have implemented the three R-tree based techniques. In this section, we present the experimental studies, and report our findings. 4.1. Experiment set-up

Fig. 3. Mix of floating points and fixed points.

points at all. The large extension of empty space in such rectangles affects the query efficiency negatively. This is because to search for the records with transaction time interval intersecting with the query predicate, a large number of such MBRs will be checked. Fig. 3 shows such an MBR. Notice however that an MBR may change its status from “unbounded” to “bounded” only when all floating points it encompasses are changed to fixed points. When this happens the boundaries of an MBR must be updated to reflect this change, thus making sure that the MBRs are as tightly bounded as possible. 3.2. Dual Spatial R-tree The dual spatial R-tree (DSR-tree) is a variation of the Dual R-tree first proposed in Ref. [9], where each record is a 3-dimensional rectangular block in a Euclidean space defined by valid time, transaction time and the search attribute. As is the case with Dual R-trees, the DSR-tree employs two R-trees: a front R-tree and a back R-tree, where the first is used for indexing records with open transaction intervals, and the latter is used only for indexing records with closed transaction intervals. Updates to the records in the front-tree requires the deletion of the updated record from the front R-tree and insertion into the back R-tree. 3.3.

Dual

Point R-tree

To solve the problem caused by mixing Jloating points and fixed points in the same MBR in a PR-tree, we adopted

the Dual Spatial R-tree approach of separating the two kinds of points and keeping them in two separate R-trees. We refer to the resultant structure as the Dual Point R-tree (DPR-tree) which einploys two point R-trees. The back PR-tree fofJ.xed points is still 5dimensional, whereas the front PR-tree for gloating points is 4-dimensional since there is no need to index the right endpoints of transaction time ranges. Thus, delete operations are performed only on one R-tree, while

In our experiment, we employed a strongly bounded bitemporal relation. We model our bitemporal database as follows. Suppose a car rental company maintains a database of car rental information. The relation, Car-Rent, has attributes {CarNo, CustomNo, ClerkNo, VT,, VT,, TT,, TT,). When a customer (denoted by CustomNo) rents a car from the company, the clerk records the rental information into the database. CarNo is a time-invariant key of the car rented. The duration of the car rental is given by the closed valid time range [VT,, VT,]. The transaction time of a record also has a lifespan. If a record is a current version, it has a transaction lifespan [TT,, NOWJ, where NOW is the current transaction time; if this had been updated at time TT, (thus becoming an “old” version), its transaction time range changes to [TT,, TT, - I]. We assume that a transaction ending time of a record version is the instant immediately preceding the transaction starting time of its successive version. For each time-invariant key, the transaction arrival of its version, i.e. the transaction duration for a certain car follows a Poisson distribution with mean p. Since a transaction is normally posted around the time when an event is true in reality, relation Car-Rent is a strongly bounded temporal relation. For each record, TT, - 20%~ 5 VT, 5 TT, + 20%~. In other words, a car rental transaction must be recorded in the database no later than duration 20%~ after the car is rented out and no earlier than duration 20%~ before the car is rented out. To test the performance of the indexes under different lengths of time intervals, we varied the mean length of transaction time duration to generate different databases. In our study, ~1is set to 20, 100,200 and 400. Each database contains 1000000 record versions. The page size for our indexes is 4k bytes. Different time-invariant keys may have various first transactions, i.e. the first event of key K1 may occur at transaction time 5, whereas the first version of key Kz can start at time 50. The first arrival of transaction for each car is modeled using a Poisson distribution. The mean interarrival transaction time of the key’s first version is h = 20. The time-invariant key, CarNo, is in the range [ 1, 100001. The number of versions per key follows the Zipfian distribution. In our study, we model the distribution of versions per key with 14 ranks, 100, 80, 110, 60, 130, 40, 150, 20, 130, 10, 200, 5, 250 and 300. With most keys having 100

B.C. Ooi et al./lnfomtion

and So&are Technology 40 (1998) 327-337

Fig. 5 shows the update performance of the two methods. It can be observed that the PR-tree yields more I/O accesses per update than the DSR-tree. In Fig. 6, we fixed the length of key range as 2500 and compare the query performance of the PR-tree and the DSR-tree under different valid time query interval lengths. From the figure, we can see that when the query range increases, the number of page accesses for the PR-tree increases sharply. The poor performance of the PR-tree can be attributed to the large extension of the MBRs that contain both floating points and fied points. As expected, by using two trees, DSR-tree can reduce the number of I/OS for overlapping rectangles.

2500 5 2000 K s E:

1500

S o cl

1000

2 2 500

:

333

0k 100 60 110 60 130 40 150 20 170 10 200 numberofversions

5

250 300

Fig. 4. Zipfian distribution of keys with respect to the number of versions per key.

versions, the databases also contain keys that may have more or fewer versions. This allows us to model the skewness of data on the number of versions. Fig. 4 illustrates the distribution of key versions determined by the Zipf-like distribution [22] with 8 = 0.2. In all of our query experiments, we use the bitemporal query “Return all records whose valid time intersects with a range [VT,, VT,] and the key is in the range [K,, K,] as of time TT”. In each experiment, the query ranges are generated randomly and 100 searches are performed for each range length, and the average value of results are recorded. 4.2. Dual Spatial R-tree vs Point R-tree In this experiment, we compare the performance of the Dual Spatial R-tree with the naive Point R-tree. Since the relative performance of the two is similar in most cases, we present only the result for the database whose mean valid and transaction time lifespan for all records is 100.

4.3. Dual Spatial R-tree vs Dual Point R-tree In these experiments, we compare the Dual Spatial R-tree and the Dual Point R-tree. We carried out our experiments on four databases: D20, DlOO, D200 and D400, which have the mean transaction and valid time durations 20, 100,200 and 400 respectively. Figs. 7 and 8 show the average number of pages accessed for each update in different databases. For both methods, the number of pages accessed increases as p increases. This is because when the mean length of intervals becomes larger, there are more MBRs that need to be searched to accommodate a record in insert or delete operations. Also, it can be observed that in each database, there is no significant difference for the update performance of DSR-tree and DPR-tree after the initial loading, when the trees reach almost the same height. We also evaluated both methods on their query performance. We first fix the length of valid time query range as 5000 and compare the performance under different lengths of key query ranges. Fig. 9 shows the number of page accesses per query for each database. We observe that the DPR-tree is superior in all cases. We also notice that as the mean time duration of a database increases, the number of

Time Duration

24

= 100

DSR-tree PR-tree

100

200

300 400 500 600 700 number of updates ( x 1000

+-6-

)

800

Fig. 5. Average number of pages accessed per update.

900

1000

334

B.C. Ooi et al./lnfomation and Sojiware Technology 40 (1998) 327-337

,

700;,

Time duration =: 100

,

I

I

I

I

600 500 400 300 200 100 0 1 5 10 15

30 1%9th( time query x

10007j)

100

Fig. 6. Query performance under different lengths of valid time ranges.

mean=20 mean = 100 mean = 200 mean = 400

9 8

loo

200

300 400 500 600 700 number of updates( x 1000 )

4-t-0+--

600

900

1000

Fig. 7. Average number of pages accessed per update in DSR-tree. Update performance of DPR-tree 15 14 13 12 11 10 9

mean = 400 +

8 7 69

I

I

100

200

1

I

I

I

,

300 400 500 6dO 700 number of updates( x 1000 )

1

1

1

600

900

1000

Fig. 8. Average number of pages accessed per update in DPR-tree

B.C. Ooi et alAnformation and Software Technology 40 (1998) 327-337

14 13 12 11

-

10 98-

77 100

500

1Oca

2000 key qwty

(a) Database

4000

5Ow

3000 2O@J key query length

lenglh

D20

(b) Database

DlOO

DSR-tree DPR-tree

6/ 100

500

loo0

3000 key query length

(c) Database

.I

WC

D200

8 too 500

1OOa

3Om 2000 key query length

(d) Database

+ -

4000

5000

D400

Fig. 9. Query performance under different lengths of key range.

page accesses also increases because of the presence of more overlaps that makes the search operation less efficient. We also compare the performance under different lengths of valid time query ranges when fixing the length of key query range as 2500. The results are shown in Fig. 10. As the query length increases, the DSR-tree yields a larger number of page accesses in each database, making it inferior to the DPR-tree. We also evaluated both methods in terms of storage cost. Fig. 11 shows the results, which indicate that the DPR-tree is more storage efficient. In fact, the number of pages used in the DPR-tree is about 20% less than that in the SDR-tree.

5. Conclusion In this paper, we have advocated the use of point access methods for indexing bitemporal databases. Instead of representing a bitemporal record as a spatial object, we suggested that this can be transformed into a point in a higher-dimensional space. This transformation allows us

to tap into the large number of point access methods to facilitate ease of implementation. We proposed an evaluated three methods that are based on the R-tree structure. The Dual Spatial R-tree method employs two spatial R-trees and indexes the records as rectangles. The naive Point R-tree represents each record as a point in a 5dimensional space. Finally, the Dual Point R-tree approach overcomes the limitation of the Point R-tree method by employing two Point R-trees. It is interesting to note that a similar conclusion has been reported in Ref. [23], in which an indexing approach called 2-Rp is proposed. The Dual Point R-tree, while similar in spirit, differs from the 2-Rp approach in that valid-time intervals are being transformed to points in the back R-tree. Given the recentnessy of these results, it remains to be seen if there is any performance gains between these two indexes. Our experimental results showed that the Dual Spatial R-tree is an efficient structure, and outperforms the Point R-tree. However, the Dual Point R-tree is sup&or in all cases in terms of query performance and storage cost. We are currently extending this work in several

336

B.C. Ooi et alAnformation and So&are

Technology 40 (1998) 327-337 24

16 DSR-tree DPR-tree

c +

DSR-tree DPR-tree

14

-e -e-

12 11

1 5 10 15

SO time query IZ9th( x tdy

10

100

1 5 10 15

(a) Database D20

30 time query IEgth( x 10007~

100

(b) Database DlOO

24

1 1 DSR-tree DPR-tree

c -e-

100

1 5 10 15

(c) Database D200

30 SO time query length( x lOO$

(d) Database D400

Fig. 10. Query performance under different lengths of valid time ranges.

14500 g B

14000 -

;

13500

-

f 2

13000

-

12500

-

DSR-tree DPR-tree

+--+

-

la000 {Ii 11500

1 D20

I DlOO

databases

I *D200

Fig. 11. Storage cost in DSR-tree and DPR-tree.

I D400

B.C. Ooi et al./lnfoma?ion

and Sofnare

directions. First, we noted that the search on the front R-tree for the DPR-tree is expensive, since all transaction times before time T must be examined for a given time slice query as of T. It is possible to prune the search space for this query by using the interval of valid time and key range. This can be accomplished by implementing the front R-tree as a Mimensional R-tree. The performance of such a variation is currently being studied. Second, the existing schemes are currently being evaluated on a larger and wider set of queries which include containment, point and inclusion queries. These performances are being compared with other competing techniques such as the Bitemporal Interval tree. Finally, we are looking into how our approach can be adapted to handle situations where the valid time is not closed, thus making our indexing approach applicable to more disparate situations.

Acknowledgements We would like to thank Peng Jiang for setting up the experimental study. He also read an earlier draft of this paper. References [II C.H. Ang, K.P. Tan, The Interval B-tree, Information Processing Letters 53 (2) (1995) 85-89. 121 R. Elmasri, Gene T.J. Wuu, V. Koummajian, The Time Index: An access structure for temporal data, in: Proceedings of the 16th International Conference on Very Large Data Bases, 1990, pp. 1-12. 131C.H. Goh, H. Lu, B.C. Ooi, K.L. Tan, Indexing temporal data using B+-tree, Data and Knowledge Engineering 18 (1996) 147-165. [4] H. Gunadhi, A. Segev, Efficient indexing methods for temporal relation, IEEE Transactions on Knowledge and Data Engineering 5 (3) (1993) 496-509. [5] H. Shen, B.C. Ooi, H. Lu, The TP-index: A dynamic and efficient indexing mechanism for temporal databases, in: Proceedings of the 10th International Conference on Data Engineering, 1994, pp. 274-281. 161D. Lomet, B. Salzberg, Access methods for multiversion data, in: Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data, 1989, pp. 315-324.

Technology 40 (1998j 327-337

337

[7] D. Lomet, B. Salzberg, The performance of a multiversion access method, in: Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, 1990, pp. 353-363. [S] D. Lomet, B. Sal&erg, ‘I’rnnsaction time databases, in: Temporal Databases: Theory, Design and Implementation, Benjamin/ Cummings, 1993, Chap. 16, pp. 388-417. [9] A. Kumar, V.J. Tsotras, C. Faloutsos, Access methods for bi-temporal databases, in: Proceedings of the International Workshop on Temporal Databases, 1995, pp. 235-254. [lo] A consensus glossary of temporal database concepts, ACM SIGMOD Record, 23(l) (1994) 52-64. [l I] B. Salzberg, On indexing spatial and temporal data, Information Systems 19 (6) (1994) 447-465. [12] B. Salzberg, V.J. Tsotras, A comparison of access methods for timeevolving data, ACM Computing Surveys, 1994 (forthcoming). (Also available as Technical Report NU-CCS-94-21 at Northeastern University.). [13] M.A. Nascimento, M.H. Dunham, R. Elmasri, M-IV’IT: An index for bitemporal database, in: Proceedings DEKA’96, 1996, pp. 779-790. [14] B. Becker, S. Gschwind, T. Ohler, B. Seeger, P. Widmayer, On optimal multiversion access struc~s, in: Proceedings of the 3rd International Symposium on Large Spatial Database Systems, 1993, pp. 123-141. I151 JR. Driscoll, N. Sarnak, D-D. Sleator, R.E. Tarjan, Making data structures persistent, Journal of Computer and System Sciences 38 (1989) 86-124. [16] S. Lanka, E. Mays, Fully persistent B+-trees, in: Proceedings of the 1991 ACM SIGMOD International Conference on Management of Data, 1991, pp. 426-435. [17] A. Guttman, R-trees: A dynamic index structure for spatial searching, in: Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, 1984, pp. 47-57. [ 181 S. Berchtold, D.A. Keim, H.-P. Kriegel, The x-tree: An index structure for high-dimensional data, in: Proceedings of the 22nd International Conference on Very Large Data Bases, 1996. [19] H. Edelsbrunner, A new approach to rectangle intersections, International Journal of Computer Mathematics 13 (1983) 209-229. 1201K. Hinrichs, Implementation of the grid file: Design concepts and experience, BIT 25 (1985) 569-592. [21] B. Seeger, H. Kriegel, Techniques for design and implementation of efficient spatial access methods, in: Proceedings of the 14th Intemational Conference on Very Large Data Bases, 1988, pp. 360-371. [22] D.E. Knuth, Fundamental Algorithms: The Art of Computer Program ming, Volume 1, Addison-Wesley, 1973. [23] Anil Kumar, Vassilis J. Tsotras, Cbirstos Faloutso, Designing access methods for bitemporal databases, IEEE Transactions on Data and Knowledge Engineering (forthcoming). (Also appeared as Technical Report CS-TR-3764, published by Department of Computer Science, University of Maryland.).

Indexing bitemporal databases as points

Indexing bitemporal databases as points

Recommend Documents