A data model for approximate query processing of real-time databases

A data model for approximate query processing of real-time databases

DATA & KNOWLEDGE ENGINEERING Data & Knowledge Engineering 21 (1997) 79-102 EL'~EV1ER A data model for approximate query processing of real-time data...

2MB Sizes 1 Downloads 82 Views

DATA & KNOWLEDGE ENGINEERING Data & Knowledge Engineering 21 (1997) 79-102

EL'~EV1ER

A data model for approximate query processing of real-time databases S.V.

Vrbsky

Department of Computer Science. The University qf Alabama. Tuscaloosa, AL 35487-0290, USA

Received 10 April 1995: revised 14 August 1995; accepted 7 February 1996

Abstract A real-time system has specific time constraints fi~r the processing of a transaction as well as temporal consistency constraints for its temporal data. If it is not possible to produce an exact answer to a database query within the specified time constraints, for many applications it may be better to produce an approximate answer than to produce no answer at all or to wait for an exact answer and miss a deadline. Approximate query processing can be used to provide approximate answers to database queries for such applications. However, approximate query processing does not address the time dimension of the data. In this paper we extend the theoretical basis of approximate query processing to include the temporal dimension of real-time databases and present a temporal data model for approximate query processing. We describe an approximation that is designed to include the temporal data and address the temporal consistency constraints of a real-time database. We present monotone approximate relational algebra operations that are redefined to include the temporal dimension of the data. We also describe the semantic support of an implementation of an approximate query processor for temporal data that is based on this data model. Keywords: Approximation; Approximate query processing; Monotone; Temporal consistency constraints; Temporal data

1. I n t r o d u c t i o n Real-time database systems, such as n e t w o r k m a n a g e m e n t , m a n a g i n g a u t o m a t e d factories, radar tracking and air traffic control, have temporal data to m o d e l the real-world as well as specific time constraints for p r o c e s s i n g transactions or queries [24,27]. T e m p o r a l data is defined as data that varies with time and is used to capture the current and p r e v i o u s states o f the world. C o n s i s t e n c y constraints o f the temporal data in a real-time database require the data to present a state o f the real-world that is up-to-date and also require data that represent past states o f the real-world to have values that are close in time. The time constraints o f a real-time database s y s t e m require that an a n s w e r to a q u e r y be p r o d u c e d by a specific deadline. F o r example, a real-time s y s t e m m a y require an analysis o f the m o s t recently a c q u i r e d w e a t h e r data by the time 1300 h, or a quick response to a c h a n g i n g condition in a f a c t o r y based on p r e v i o u s and consistent states at the factory. 0169-023X/97/$15.00 © 1997 Elsevier Science B.V. All rights reserved PII S01 6 9 - 0 2 3 X ( 9 6 ) 0 0 0 1 9 - 5

80

S.V. Vrbskq~ / Data & Knowledge Engineering 21 (1997) 79-102

It is not always possible to produce an exact answer to a database query within the time constraints specified in a real-time system. There may not be enough time to acquire all the locks and retrieve all the data needed to answer a query by a specified deadline. In a distributed database, a network partition or a host failure can cause some needed data to become inaccessible. In a hard-real-time environment, a result produced too late can be of little or no use and failure to produce a result in time can have disastrous consequences. For many applications, it may be better for a database to produce an approximate answer than to produce no answer at all or to wait for an exact answer and miss a deadline. For example, if a pilot requires a list of available runways at airports within landing distance, having a partial list of those runways may be better than no list at all. Even for applications that have softer deadlines and less dramatic consequences resulting from a missed deadline, approximate answers can be made available. For some applications, a user may simply want approximate answers displayed during query processing to determine if query processing should continue. Making approximate answers to database queries available when exact answers cannot be obtained, either due to the lack of time or the inaccessibility of data, enhances the fault tolerance and availability of a database system. Approximate que©' processing [26,29] provides useful approximate answers to database queries if exact answers cannot be produced in time, or if all of the needed data are not available. The approximate answer improves as more data become available and, if there is time to continue with processing and all of the data are available, converges to the exact answer when all the needed data are retrieved and processed. We have designed and implemented a query processor, called APPROXIMATE [29], that provides approximate answers to database queries and can be used in a real-time environment. APPROXIMATE assumes that the data in the database are precise. The approximate answers of APPROXIMATE are produced because some of the data needed for the exact answer were not retrieved and processed, either because there was not enough time or the data were not available. Although approximate query processing provides a strategy for satisfying the time constraints specified by the query, it does not address the temporal dimension or temporal consistency constraints of the data in a real-time database. In this paper, we describe how the theoretical basis of approximate query processing is extended to include the temporal dimension of the data in a real-time system and we present a temporal data model for approximate query processing. We describe an approximation that includes temporal data and the temporal consistency constraints of a real-time database. We redefine the approximate relational algebra operations to include the temporal dimension and we show they are monotonic. We also describe how the semantic support of APPROXIMATE can be expanded to include temporal data and address the temporal consistency constraints of a real-time database.

I.I. Approximate answers The problem of defining and improving approximate answers for conventional databases that do not have temporal data has been addressed by researchers [2,17,19]. Buneman, Davidson and Watters [1] introduced the approximation semantics where approximations of a set-valued answer are defined in terms of the subsets and supersets that sandwich the exact answer. The precision of the approximations improves as rules describing subsets and supersets are applied. The approximations are monotonically improving as the subsets and supersets approach the exact answer, and an approximate answer produced earlier is never contradicted by an approximate answer produced later. Ozsoyoglu et

S.V. Vrbsk3' / Data & Knowledge Engineering 21 (1997) 79-102

81

al. have developed a relational database system called CASE-DB that produces approximate answers to queries with deadlines or time quotas [11,21,22]. The approximate answers of a query are subsets of the exact answer that increase in size, and, thus, are monotonically improving. Segments of the data are processed one at a time to produce these subsets. For an intensional answer [3,20], general characteristics about the data objects in the exact answer are provided instead of the data objects themselves. An intensional answer provides an approximation that is derived without accessing the physical data in the database. The data is organized into type hierarchies and information is maintained about the data of each type, such as the domains of the attributes, to provide the general characteristics. Liu and Sunderraman [17] introduced an I-table that is capable of representing indefinite and 'maybe' kinds of information in the database. Queries involving relations represented by I-tables produce indefinite and 'maybe' answers. The relational algebra operations of union, difference, Cartesian product, project and select are extended to operate on I-tables. An approximate answer in approximate query processing has semantics similar to the approximation proposed by Buneman, Davidson and Watters in [I]; a subset and superset of the exact answer are also provided. An approximate answer provided by approximate query processing differs from an approximate answer in [1] by allowing the user to distinguish between the data objects processed thus far from those not yet processed, and also by producing the actual data objects each time an approximate answer is improved. General characteristics about the data objects in the exact answer that have not yet been processed are provided by APPROXIMATE, and, thus, are similar to intensional answers. An approximate answer of APPROXIMATE differs from an intensional answer because it also provides actual data objects in addition to general characteristics. The approximation semantics and approximate operations of approximate query processing are similar to those of I-tables [17], but I-tables do not consider monotonicity and approximate query processing does not consider disjunctive information. Similar to CASE-DB, segments of the database are processed at a time in approximate query processing to achieve the monotonicity of approximate answers. However, an approximate answer in CASE-DB differs from an approximate answer in APPROXIMATE because CASE-DB provides only a subset of the data objects of the exact answer~ and no information about the data not yet processed is provided.

1.2. Temporal data models Many temporal data models have been proposed [5,8,18] that use the relational algebra framework. While these data models do not associate time constraints with the processing of the data, they do address the issues of representing and retrieving data with a temporal dimension. A temporal data model includes time-varying data, and can provide many snapshots of the states of the database. Attributes may be set-valued when behaviour over time is recorded. A temporal data model can support three different metrics of time [13,18]: (1) valid time, which is the time a fact recorded in the database is true in the real-world, (2) transaction time, which is the time when a fact is recorded in the database (is current in the database) and may be retrieved, and (3) user time, which is an uninterpreted attribute domain of date and time, and is provided by the user or the applied program. The Historical Relational Data Model [5] assigns valid times, called lifespans, to attribute values and to tuples. A lifespan is defined as a discrete unit of time. The tuple lifespan is the period of time during which the tuple's values are valid and the lifespan of the attributes is the period of time during

82

S.V. Vrbsky / Data & Knowledge Engineering 21 (1997) 79--102

which the attribute values are valid. An attribute lifespan must fall within the lifespan of the tuple. Relational algebra operations are extended to operate on data that contain lifespans and new time-oriented operations are introduced, such as the when and time-slice operations, to restrict the lifespan of the tuple to some portion of its original lifespan. In Gadia's model [8], lifespans are defined as the finite union of disjoint time intervals. All attributes in a tuple are required to have the same time domain, a requirement that is called homogeneity. The set of all lifespans is closed under finite union, intersection and complementation, and forms a Boolean algebra. A relational algebra and query language for the homogeneous temporal data model is presented. The temporal data in a real-time system are comprised of different collections of data objects in order to model real-world objects [24,27]. Each of these different collections is described in detail in Section 3. A real-time database is assumed to mirror the physical world, and frequent interactions with the physical world are needed to obtain accurate and timely data. Real-time database systems are comprised of both periodic and sporadic transactions with real-time constraints, as well as a flow of information which is gathered by devices interacting directly with the real-world and written to the database. A real-time database must also satisfy temporal consistency constraints. Data objects are said to be temporally consistent if they satisfy both relative and absolute consistency constraints. In other words, as will be described in Section 3, the ages and dispersion of the times of the temporal data meet the requirements of the application [27]. The remainder of this paper is organized as follows. Section 2 describes how the timing constraints of a query are addressed by approximate query processing. Section 3 describes the temporal data in real-time databases. In Section 4 we present the temporal data model for approximate query processing. In Section 5 we demonstrate the monotonicity of the operations that include temporal data. Section 6 describes the semantic support of an implementation of approximate query processing to include temporal data. Section 7 presents conclusions and describes future work.

2. Time constraints of queries The imprecise computation approach [4,14,16,25] is applied in approximate query processing. The basic strategy taken in the imprecise computation approach is to make a usable, approximate result available when the exact result cannot be produced in time. The results produced using this approach sacrifice precision for timeliness, and, therefore, they are referred to as imprecise results. Computations that are not complete are imprecise, and computations that produce the imprecise results are imprecise computations. An imprecise computation is also called an 'anytime' computation [7]. One approach used to implement imprecise computation is to save the results produced by a computation at certain points in time during the computation [15]. The assumption is that the accuracy of the intermediate result increases monotonically as the computation progresses. This assumption is valid if the underlying algorithm is monotonic. Approximate query processing is monotonic. It produces approximate answers which are non-decreasing in accuracy with increasing amounts of data retrieved and time spent to produce them. An initial approximation of the exact answer is provided when query processing begins. The approximate answer produced improves as more data are retrieved to answer the query according to the partial-order relation defined by the approximate relational model [26], which is described in Section 2.1. The exact answer is returned if all of the needed data are available and if there is enough time to complete the processing. The latest best available approximate

83

S.V. Vrbskv / Data & Knowledge Engineering 21 (1997) 79-102

answer is returned if the user demands an answer before query processing is completed. We begin by describing the semantics of approximation of approximate query processing. 2.1. Approximate relational model

Approximate query processing is based on the approximate relational model [26] which asserts that there is a natural way to approximate answers of set-valued queries. An exact answer E to such a query is a set of data objects. All the data objects in a subset of E certainly belong to E, and a data object in a superset of E is possibly also a member of E. Therefore, a meaningful approximation of any exact answer E can be defined in terms of a subset and a superset of E. Specifically, an approximation A of an exact answer E is the union of two sets of data objects: a certain set C, where C C E, and a possible set P, where (C U P ) D E . C is the set of data objects certainly in E; it is produced from the stored data processed thus far. P is the set of data objects that may be in E; it is produced based on information about the data that is maintained by the query processor. P is the data to be retrieved and processed as query processing continues. An approximation is denoted by the 2-tuple A = (C,P). An approximation allows the user to distinguish between the data objects processed thus far, the set C, from those not yet processed, the set P. We consider a database described in [24]. This real-time database system for an automated factory consists of the floor of the factory containing robots, assembly stations and assembled parts, as well as computer and human interfaces to manage and coordinate the activities occurring on the floor of the factory. (We will ignore the temporal dimension to this database for now.) The database contains information about the conditions, such as temperature and pressure, at different stations on the floor of the factory. Each of the stations is managed by a specific employee (Jones, Lee or Smith). Fig. la illustrates an exact answer E to a query requesting the id, manager and temperatures of all stations on the factory floor where the temperatures are greater than or equal to 80.

[ id IManager [ Temp I 7 9 10

Smith Smith Jones

90 95 80

(a). The exact answer E

id 1 6 7 9 10 13 27

Manager Lee Jones

Smith Smith Jones

Smith Lee

Temp

70 85 90 95 80 75 75

(b). Approximation A 1 id

Manager

Temp

id

Manager

Temp

6 10 1 7 9 13 27

Jones Jones Lee Smith Smith Smith

85 80 70 90 95 75 75

6 10 7 9 13 27

Jones Jones Smith Smith Smith Lee

85 80 90 95 75 75

Lee

(c). Approximation A 2

(d). Improved approximation A3

Fig. 1. Approximationsof the exact answer.

84

S.V. Vrbs~' / Data & Knowledge Engineering 21 (1997) 79-102

Fig. lb is an approximation A 1 of E. It is a superset of E containing data objects that are possibly in E, and, hence, can be considered an approximation of E. A j is the temperatures at all stations in all areas. A 2 in Fig. lc is also an approximation of the exact answer. The horizontal line in the approximation distinguishes the certain set and the possible set. The first two data objects have been identified as certainly in the exact answer; they are a subset of the exact answer and are the set C. The last five data objects are possibly in the exact answer and are the set P. The data objects in Fig. lc are a superset of those in Fig. la. A 3 in Fig. ld is another approximation of E. The data object with an id of 1 has been identified as a data object certainly not in the exact answer and has been deleted from the approximation. A 3 in Fig. ld is a superset of the exact answer and is a subset of the approximation in Fig. lc. Any exact answer E has many approximations. Given a set of approximations of E, a partial order relation -> for comparing them can be defined over the set as follows. One approximation a i = (Ci,P~)is better than or equal to another Aj -= (Cj,Pj), denoted as A i ->Aj, if C i _D C~, Pi _C-~ and A i C-Aj. (We say A~ C-Aj if (Q UP~)_C(Cj UPj).) In other words, C~ has at least as many objects in E as C~. Ai is better than Aj because: (1) the fact that the data objects in C~ - C~ are in E becomes certain, and/or, (2) the fact that the data objects in Aj - A i that are not in E becomes certain. This partially ordered set of all approximations of E is a lattice. In the lattice, A o = (~,v) is the least element and the worst possible approximation of E, where v is the Cartesian product of all the domains in the schema of E. It is the set of all possible data objects which could be in E. Q3 denotes the null set. A key assumption is that the number of elements in v is finite and, at least theoretically, can be computed without accessing any data. This is necessary because the possible part is the data objects which have not been processed yet. The greatest element of the lattice is the best possible approximation which is E itself, represented by (E,Q). This partial-order relation is compatible with a common sense notion of which approximations are better and which are worse. If the approximation A~ has more certain data objects and less possible data objects than the approximation Aj, it is known that more tuples have been retrieved and processed to produce A v Thus, it makes sense to say the Ag -->Aj. It is also important to compare only those approximate answers that have a subset/superset relation, so that we are comparing answers that can be obtained at different stages of processing. This semantics of approximation can also be used to approximate queries with single-valued answers [28].

2.2. Semantic support APPROXIMATE [29] is an approximate query processor that is based on the approximate relational data model described in the previous section; it derives approximate answers to standard relational algebra queries. The data is stored as relations and the user's view remains that of the relational model, This query processor can be implemented on a relational database system, requiring little or no change to the underlying relational architecture. APPROXIMATE can be one of several query processors available in the system. To produce an approximate answer, APPROXIMATE maintains the following semantic information [ 10,23] for an efficient implementation. Similar to the approach used to produce intensional answers, data objects are categorized into classes [9,12]. A class corresponds to a relation or a segment of a relation. Classes can be organized into a class hierarchy, such as a specialization hierarchy. A class in a class hierarchy is described by the class attributes and methods that are

S.V. Vrbsky / Data & Knowledge Engineering 21 (1997) 79-102

85

specified by the database manager when the classes and class hierarchy are designed. The information supplied by the classes in a class hierarchy is accessed along with the base relations during query processing. Maintaining the semantic information provides an alternative to processing the possible data objects P during query processing. The classes provide information so that templates of the possible data objects can be used instead. This information is provided in the form of class attributes such as: the number of instances of the class, the attributes of the instance variables and the ranges of values for the attributes of the instances of a class. The value of an approximation is stored in an approximate object. The object has two variables: the certain-part and the possible-part. The value of the certain-part is a certain set C containing all the data objects that are known to be in the exact answer. The value of its possible-part is a set P of possible classes. (A possible class is a class in a class hierarchy.) The set of all instances of the classes in P is the possible set in an approximation (C,P) of the exact answer. Fig. 2 illustrates an approximate object that corresponds to the approximation in Fig. lc. The certain sets in both figures are the same. The factory station data objects have been categorized into classes according to the value of the attribute Manager. Instead of the possible data objects, the possible set of the approximation in Fig. 2 is the two classes 'Smith Stations' and 'Lee Stations'. The instances of these possible classes are the data objects in the possible set of stations that are managed by Smith and Lee. In this particular example, an approximation can provide the information needed to react to the conditions at the stations that are in the certain set. It is certainly known that stations 6 and 10 have exceeded the specified temperature. The semantic information maintained for the possible classes, such as the domains of the attributes of the possible classes, can provide to the user an additional insight about the exact answer to a query. For example, considering the approximation in Fig. 2, the range of temperatures for the Smith station indicates if there are instances of the class with temperatures greater than or equal to 80. Similarly, if a class does not appear as a possible class, and none of its instances are in the certain set, it is known that no stations of that class had a temperature greater than or equal to 80. Hence, semantic information is used during approximate query processing to facilitate the processing of the select operation and such aggregate functions as minimum and maximum, and to eliminate possible classes for processing. For example, if the smallest value of an attribute of a possible class is greater than the current minimum, no value of the possible class can possibly be the exact answer to a query requesting the minimum.

2.3. Approximate que~ processing strategy The basic approximate query processing algorithm of APPROXIMATE works as follows. It begins by representing a query with a query tree. Each node in the query tree represents a relation that is the

id 6 10

Manager Jones Jones

Temp 85 80

{Smith Stations, Lee Stations } Fig. 2. An approximate object.

86

S.~ Vrbsky / Data & Knowledge Engineering 21 (1997) 79-102

result of a relational algebra operation. An initial approximation, that is an approximate object, is assigned to every node in the query tree. Initially, the certain set is empty for every approximate object. The initial value of the possible-part in an approximate object at a node gives a template of all possible tuples that can possibly be in the base relation represented by that node. The complete set of relational algebra operations are extended to accept approximate objects as operand(s) and to produce an approximate object as a result. Each of these approximate operations are monotonic and will be further described in Section 4. An additional primitive is an approximate-read that returns a subset of the data objects, or a segment, of the requested base relation at a time. (Information on which segments can be returned by an approximate-read is given by a class hierarchy.) The leaf nodes of the query tree represent the result of an approximate-read operation. During traditional query processing, a node at a higher level in the query tree can be evaluated only when all operands represented by its children are available. No answer is produced unless all the read requests are granted and all operations performed. During approximate query processing, segments of a relation are retrieved and processed at a time, with the results repeatedly propagated up the query tree. The value of the leaf node improves as more segments are returned from the approximate-read. As each approximate-read of a leaf node is carried out, each returned segment causes additional certain tuples to be added to, and possible classes to be deleted from, the current approximation of the base relation. An improvement in a leaf node is propagated upward to the root node by reevaluating the nodes in the query tree. At each node upward an approximate relational algebra operation is applied to the improved approximation. The value of the root node is updated with better values each time the root node is reevaluated. If there is enough time to process all of the data, the exact answer is returned; otherwise, the most recent value of the approximate object at the root node is used for the approximate answer. During approximate query processing, the processing of the certain set(s) is done incrementally. Processing of the possible classes is deferred until an answer must be produced. The possible part is maintained as a symbolic expression which contains the class names of the possible data objects. If query processing terminates normally, there are no possible classes to process. If query processing terminates and an approximate answer is provided, the symbolic expression of the possible classes is evaluated to produce the values of the possible tuples in the approximate object. An approximate answer produced by approximate query processing is a set of certain tuples and the domain values of the attributes of the possible classes. Additional I/O and processing time are required by approximate query processing due to the class at a time processing of an approximate read, the approximate relational algebra operations, and the evaluation and printing of the domains of the possible classes if an approximate answer is produced. Some of the factors affecting the increase in execution time of approximate query processing were shown in [30] to be the size of the relation, the number of classes into which the relation is categorized, and the size of the domains of the possible classes. Techniques used in traditional database management systems, such as clustering indexes and hash files, are used to implement the approximate-read and approximate relational algebra operations, respectively. For example, to implement an approximate-read the data objects are clustered by class membership to minimize the I/O overhead, which can grow to become large when the number of classes of a relation is increased. In [30] the overhead of approximate query processing is examined and simulation results comparing the performance of approximate query processing to traditional query processing for several queries are presented. When an exact answer was produced by approximate query processing,

S.V. Vrbsk3' / Data & Knowledge Engineering 21 (1997) 79-102

87

the increase in total time for approximate query processing was less than 5% for a relation of size 10 000 tuples and 10 classes, and less than 2% for relations of size 100 000 tuples or larger. When an approximate answer was produced for relations of size 100 000 tuples or larger with 25-100 classes, approximate query processing processed more than 95% of the data during the same amount of time as required to produce an exact answer with traditional query processing. (For smaller size relations the percentage of data processed was 85% or less because the time to evaluate and print the domains of the possible classes for the approximate answer contributed more to the time required to process the query. However, the number of classes can be chosen to maximize the percentage of data processed.) For some real-time systems where deadlines cannot be met, providing an approximate answer based on processing up to 95% of the data is certainly better than providing no answer at all.

3. Real-time databases The temporal data and the temporal consistency requirements of a real-time database are now described. Real-time databases are comprised of different collections of temporal data: image objects, derived objects and invariant objects [24,27]. A real-time database requires the specification and enforcement of integrity constraints that are found in conventional databases as well as temporal constraints of the different collections of temporal data that must be specified and enforced to maintain the integrity of the temporal dimension of the data.

3.1. Temporal data For data that are obtained directly from sensors monitoring real-world objects in a real-time database system, the value of a real-world object is sampled periodically and is written to what is called an image object. Associated with an image object is the most recent sampling time, or valid time. The value of an image object is usually not altered once the value is recorded in the database. Instead, a new value that is sampled from the corresponding real-world object at a later time is written to a new image object. The sampling and writing of the value of an image object can be modelled as a periodic real-time job. Archival relations of image objects are typically maintained in a real-time system, so that different snapshots of the data at different points in time are available. An example of an image object is the temperature at time t i as sampled by the sensors at different stations on the factory floor. A more complex real-world object, such as the weather, requires sampling the ground temperature, the humidity and the direction of the wind during the time period § to ti. A derived object is computed from a set of image objects and/or other data objects during execution of a transaction. The time associated with the derived object is the oldest valid time or transaction time of the data objects used to derive it. Unlike image objects, the value of a derived object in the database may be updated. Archival relations of derived objects may or may not be maintained. An example of a derived object is the projected heading of the aircraft that is computed from the wind speed and other parameters in an auto pilot system or the rate at which a reaction is progressing in the factory. The latter derived object depends on past conditions, such as temperature and pressure trends, to determine the amount of chemicals or coolant to be added to the reaction. An invariant object is a value that is constant with time. An example of an invariant object is the weight of an empty aircraft or the number of assembling stations on the floor of the factory. An

88

S.V. Vrbskv / Data & Knowledge Engineering 21 (1997) 79-102

invariant object can be considered either as temporal or non-temporal data. If an invariant object is considered to be temporal data, it is considered to be a special case of temporal data that does not vary with time and the time associated with an invariant object is always the current time.

3.2. Temporal consistency A real-time database should maintain absolute consistency [24,27] so that the data in the database provide a direct and immediate representation of the physical world. Any changes in the real world should be reflected in the database. If the time of a data object is within some specified threshold of the current time, the object is absolutely consistent. In other words, the time of the data object must be close enough to the actual time so that it reflects the state of the real world. The current time will be denoted as tnow and the time of data object x i will be denoted as t,. (It is assumed that for now the time of an object is a single point in time.) The age of object x i, denoted as a(xi), is defined as a(x~)=too w-tx. A database object x~ is said to be absolutely consistent if a(x~)-
S.V. Vrbskv / Data & Knowledge Engineering 21 (1997) 79-102

89

Invariant objects that are considered as a special case of temporal data will always be temporally consistent. Since the valid time or transaction time associated with the value of an invariant object is always the current time t . . . . for any positive threshold value, an invariant object x~ will always be absolutely consistent: a(x~)=toow-t =0, where C = t .... . An invariant object will always be relatively consistent. Given any two invariant objects x, and XJ: d(xi'x~)=]t,,-t,~]=O" where

t , = t , =t,o ~.

4. Temporal data model In this section, the semantics of approximation of approximate query processing is extended to serve as a basis for meaningful approximations of queries involving temporal data, called temporal data queries. This semantics will include the temporal dimension of the data and the temporal consistency constraints of the temporal data described above. A database DB is defined as: DB ={1~,12.....I,D,V}, where 1,, is the most recent set of image objects and I~,I 2..... ln_. ~ are the archival sets of image objects that comprise the database. An archival set I i of image objects can have one or more relative consistency sets RC where 1~={RCij,RCi2 .....RCi,,}. D is the set of derived objects. We assume that only the most recent states of the derived objects are of interest and no archival sets are maintained. V is the set of invariant objects and no archival sets need be maintained. T~ and T r are the absolute and relative threshold, respectively. The database DB is said to have absolute consistency if: • (1) a(x~)<--T~, VxiEI,, • (2) a(xi)<-T a, Vx~ used to derive D, • (3) a(x~)<-Ta, Vx, EV. In other words, the database is absolutely consistent if the ages of the most recent image objects and the ages of the data objects used to derive the derived objects are less than the specified thresholds. The invariant objects are always considered to be absolutely consistent. The database DB is said to have relative consistency if: • (1) d(xi,xl)<-Tr, Vx~, xiERCmk, VRC,,,k~I,,,, VI,,EDB. • (2) d(xi,xj)<--Tr, Vx~, XJ used to derive D, • (3) d(x~,xj)<--Tr, Vx,, xjEV. In other words, the database is relatively consistent if the dispersion of all of the relative consistency sets of each image object and the dispersion of all of the data objects used to derive the set of derived objects are less than the specified thresholds. The invariant objects are always considered to be relatively consistent. Since it is important to reflect the state of the real world in a real-time system, it is assumed that the difference between the valid time and the transaction time is small. Therefore, only the valid time of temporal data objects is considered here. A finite, discrete universe of time instances is assumed, where an interval is a sequence of consecutive time instances which can be represented as integers. tl < t e means that t~ is earlier than t 2. The valid time associated with each temporal data object in the database will be called the lifespan [5] of the data object and denoted as I. (We refer to both image, derived and invariant objects with the more general term of data objects.) A lifespan l of a data object is defined as a finite union of closed intervals. A closed interval is represented as: [L-to], where t

S.V. Vrbsk3~ / Data & Knowledge Engineering 21 (1997) 79-102

90

and t e are the time instants when the interval starts and ends, respectively, and t~ <-t e. An example of a lifespan is: [10-15] U [20-25]. (These intervals that are lifespans are closed under union, intersection and complementation [8].) A time instant t ~ [ t ~ - t ~ ] if t<--t<--t. A single instance of time is represented by the interval [ t ~ - te], where t~ = te. The lifespan of an invariant object is [ t ~ - t~], where t~ = t e = tnow. A lifespan can also be associated with a set of data objects or the set of instances of a class. We define a polymorphic function LS that can return the lifespan of data object x i, the lifespan of a set of data objects P, or the lifespan of the instances of a class O. LS(xi) is the lifespan l of data object x~. The lifespan of a set P of data objects is defined as: LS(P)=LS(x~ )ULS(x2)U ...ULS(x,,), Vx~ ~P. The lifespan of the union of two sets of data objects is defined as: L S ( C U P ) = L S ( C ) U L S ( P ) . The lifespan of the instances of a class 0 is defined as: LS(O)=LS(x I )ULS(x 2) U ... U LS(x,,), Vx~ ~ inst(O), where inst(O) is the set of instances of class O.

4.1. An approximation An approximation of an exact answer to a temporal data query can be viewed as occupying a two-dimensional space containing both objects and time. Fig. 3a illustrates the set of temporal data objects that are an exact answer to a temporal data query. In Fig. 3 a set of temporal data objects in our data model are assumed to be similar to homogeneous temporal relations [8], where the lifespan of all the attributes of a data object are the same. It is also assumed that for any lifespan l, there is only one data object in the set containing the value of the attributes and whose lifespan is 1. A superset to the exact answer can be based on the object dimension. One such approximation to

id 36 id 36 18

Temp 33 33

Lifespan [1300-1359]

(a). The exact answer E id 36 18

36 18

36 18

Ternp 25 27 33 33 35 36

Lifespan

[1200-1259]

18

9 27

Lifespan [1300-1359]

(b). A superset of objects id 36 18

[1300-1359]

9 27 36

[1400-1459]

18

(c). A superset of time

Temp 33 33 40 40

9 27 36 18

9 27

Temp 25 27 38 38 33 33 40 40 35 36 45 48

Lifespan [1200-1259]

[1300-1359]

[1400-1459]

(d). A superset of objects and time

Fig. 3. Approximations.

S.V. VrbsM' / Data & Knowledge Engineering 21 (1997) 79-102

91

this temporal data query is a superset of objects with the same lifespan as the objects of the exact answer, as illustrated in Fig. 3b. This approach is similar to that used for approximations of exact answers to queries with non-temporal data. Alternatively, an approximation to the exact answer can be based on the time dimension. Such an approximation contains values for the same objects that are in the exact answer, but includes the values of these objects for a superset of lifespans, as illustrated in Fig. 3c. It is emphasized that the objects in the approximation are not a superset of the objects in the exact answer to the temporal data query, but the lifespans are a superset. An approximation can also be based on both the time and object dimensions, as illustrated in Fig. 3d. Such an approximation is a superset of values of the objects of the exact answer for a superset of lifespans. An approximation of an exact answer E to a temporal data query is defined to have a certain part C, a possible part P and lifespan L, and is denoted by the 3-tuple (C.P.L). The lifespan of an approximation describes the lifespans of the certain and possible data objects of the approximation. In other words, the lifespan L of an approximation is defined as the union of the lifespans of the data objects in its certain part and possible part, where L = LS(C)U LS(P). An approximation A =(C,P,L) is the union of the certain set C, where C_CE, and the possible set P, where (CUP)D_E, for the lifespan L, where L_DLe. LE is the lifespan of the data objects in the exact answer to the query. Thus, L is a superset of the lifespan of the data objects in the exact answer to a temporal data query. The set L has a minimum value (the smallest t~ value of the intervals) and a maximum value. If L is omitted, it is assumed that the approximation does not involve temporal data, and the only lifespan is the current one. Given a set of approximations (C,P,L), a partial-order relation is defined over the set as follows. One approximation is considered to be better than another if it has more certain data objects, less possible objects, and a shorter lifespan. In other words, A,->A i if (C, UP,)C(C~ UP,), C,___DCj,P, _CPj and Li_CLj. This partially ordered set of all approximations of an exact answer E is a lattice. In the lattice, Ao =(O,v,Lo) is the least element and worst possible approximation of E, where ~9 is the same as described in Section 2 and Lo is the lifespan of all the possible data objects in v. The greatest element of the lattice is the best possible approximation which is (E,Q,LE). While Lo provides the lifespans of all of the data objects of the worst approximation, ultimately LE will provide the data objects for the subset of those lifespans that are the exact answer. In other words, as query processing progresses, the lifespan of an approximation converges to the lifespan of the exact answer. Thus, if A,-->Aj, then LS(C,)D_LS(Cj), LS(P,)C_LS(Pj) and LS(C, UP,)C_LS(Cj UP/).

4.2. Temporal consistency constraints of approximations The age of a set of data objects P is defined as: a(P)=t ..... -min(LS(P)), where min(LS(P))= min(min(LS(xi)), V x i E P ) and min(LS(x~)) returns the minimum t value of the intervals of the lifespan of data object x~. The age of the instances of a class O is similarly defined: a ( O ) = t .... min(LS(O)), where min(LS(O))=min(min(LS(xi)), VxiEinst(O)). The dispersion of a set of data objects P is defined as: d(P)=max(LS(P))-min(LS(P)). The dispersion of a set of instances of a class O is defined as: d(O)=max(LS(O))-min(LS(O)). Given an exact answer E, the data objects of E will always be temporally consistent. However, all of the data objects of an approximation may not be temporally consistent. An approximation may provide a superset of the lifespan of the exact answer as the approximate answer, and it may not be

S.V. Vrbsky / Data & Knowledge Engineering 21 (1997) 79-102

92

temporally consistent. We only know that given an approximation A =(C,P,L) of E, the data objects in the certain part C of an approximate answer will always be temporally consistent. Given a set of approximations of E, we can prove that as the approximate answers improve monotonically, the age and dispersion decrease and converge to the age and dispersion of the exact answer. The age of an approximation Ai is defined as: a(A,)=t ..... -min(LS(C~)ULS(P~))=t ..... min(Li), and the dispersion is defined as: d(A~)=max(L~)-min(L~). Theorem. If A,-->A~, then a(A~)<-a(Ar). Definition. a(A~)=t ..... -min(Li) and a(A:)=t .... -min(L;). To show that a(A~)---rnin(L;). We consider m~ =rnin(L~) and rnr =

rain(L;). Proof. Given A~->Ar and, by definition, L~_CL;, we know that if miCL; then milL;. If m~=min(L;), then m~=m/, t .... -m~=t ..... - r n r, and a(A~)=a(Ar). If mi~rnin(L;) then {3mr]m:= rain(L;), mr ~rn~ and mr CL;}. Since m~ ~L i and Li _CL;, if rn~ CL;, % CL; and m: =rain(L;), then it is impossible for rn/>mi, and, instead, m~>rnr. Therefore, if A~-->A/, a(Ai)-Ar, then d(A ~)-Aj and, by definition, L i _C-L; we know that if mi CL~ then mg CL; and if Mi CL~ then Mg CL;. If mi =rain(L;) then rn~ =rn r and if Mz =max(L;) then M~ = M r and M~-m, =Mr - m r which implies d(A~)=d(Ar). If mi~min(L;)then {3mr]mr=rain(L;), m~#m r and mrCL;}. Since rn~ CL~ and L~ _CL; if m i CL;, mr CL; and rnr =rain(L;), then it is impossible for mr>m . and, instead, mi >mr. Similarly, if M~ #max(L;) then {3Mr]Mj =max(L;), M~ #M~ and M; CL;}. Since M~ CLi and L~_C-L; if Mi CL;, Mr CL r and Mj =max(L;), then it is impossible for Mr < M , and, instead, M, < M r. Since we have shown that M~ -mr, it follows that Mi - m i _Ar,

d(A i )----a(A~)> > a(A ), where a(A )=a(E), and d(Ao)->d(Al)->...>-d(A,,), where d(A )=d(E). []

4.3. Extended approximate relational algebra operations The monotone approximate relational algebra operations have been redefined to include the temporal dimension. Each approximate operation accepts an approximation with temporal data A =(C,P,L) as an operand(s) and produces an approximation with temporal data as its result. The complete set of operations: union, select, project, Cartesian product and difference are redefined and appear in Table 1. (An approximate join operation can be obtained by performing an approximate Cartesian product followed by an approximate select operation.) This table lists the values of the

S.V. Vrbsk), I Data & Knowledge Engineering 21 (1997) 79-102

93

Table 1 Approximate relational algebra operations Approximate

Cr

Pr

Lr

operation Ar

AIUA2

Pr iP~UPz)

C~-%. ,;,ICI

Pz-cr.

A I = "ff~IIA I

(_'r : Tr~uC i

P~ - "rr~.P I

L r - L S ( C t )ULSIP I )

A~ =A I X A 2

CT=C I X C~

PJ -((CI ×Pi iU(('E X Pz IUtC2 x Pi )UIC: x P2 ))- C I

Lz = LS(C~ UC 2)U LSIP I UP z I

AT=AI-A 2

C T - C I - ( ( ' 2 U P 2)

Pt-t(Pi

L¢ - LS((C I UP I I-C2)ULS4P2AIC I UP~ )1

A7 =71,,1A I

CI=7¢,,tCI

P7 X~t Pl

L T - L S t % I C~ )ULS(% j P~I

A r [MI

Cz =~R" I

Pz-~Pi

L 7 LS(C IikjLS(p I )

CT=C I -1[C2UP2)

Pt-(IP~

C; -Iz(("l - *PT )

P7 -I.t(Pp U(C I ffl*P~ ))

A~=A~

rA2

A T -ttA ~

C~

Lz -LF(('~ UC2IULS(P~ UP,_)

CT CIUC,

A 7 =~r;m ,,~lAi

,,~1PI

(C2UP21UIP2A(CIUPI 1)

TIC2UP2IU(P2VI*i('IUPI))

LT- LS(crau ,alCI)ULSt%. ,,,IP~ I

LT - I-S((CI UPI ) - I C2 I@LStP2P*(C~ UPI ~) L 7 =LS(C~ iULS(P~I

certain-part C v, the possible-part PT and the lifespan L T resulting from each of these operations. The x , U, 71 and - in the C T, Pr and L r columns denote the set theoretical operations of Cartesian product, union, intersection and set difference, respectively. Additional approximate operations that specifically address the temporal dimension of the data are also needed and their definitions are included in Table 1. They are the operations: temporal select r, temporal project 12, temporal set difference - v and temporal merge IX. The complete set of approximate relational algebra operations for approximations that do not have temporal data was shown to be monotonic in [26,29]. In Section 5 we expand the proofs to include lifespans and we also show that the additional approximate temporal operations of Table 1 are monotonic. For the set theoretical operations of X, U, 71 and - in the C r, Pr and L T columns in Table 1, both the value and the lifespan of a data object are considered in the computation of an operation. For instance, two data objects are considered to intersect if their values and their lifespans are identical. When the * appears following a set theoretical operation, only the value of the data object is considered and the lifespan of the data object is ignored. For example, we assume the two sets R and S, and the exact answer E for E = R - * S . Given two data objects x~ and x 2, where x~ ~ R , x z E S and x~ = x 2, then x~ ~ E regardless of the values of LS(Xl) and LS(x2). For the 71" operation, two data objects are considered to intersect if their values are the same, regardless of the values of their lifespans. The approximate union operation U in Table 1 is a union operation that eliminates any duplicate data objects whose values and lifespans are the same. An approximate temporal union operation is described at the end of this section. The approximate select operation ¢r in Table 1 selects the data objects that satisfy the selection criteria specified for its attribute values. The lifespans of these data objects are unaffected by the operation. The approximate temporal select operation, denoted as 1-~L in Table 1, is similar to the time-slice and select-when [5] temporal operations. The temporal select operation %,1~ restricts the data objects to those times specified by L, where L is a Boolean expression of the form . This operation can be used to retrieve those data objects having a specified value as a lifespan. An example of this operation is a request for the conditions at stations during the time period 1 100-1 130: 7"(w(I --~ I 1 0 0 and

I_
S.V. Vrbskv / Data & Knowledge Engineering 21 (1997) 79-102

94

The approximate project operation xr in Table l projects the specified attributes; the lifespans of the data objects remain associated with the corresponding projected attributes. The temporal project operation, also called the when operation in [5] and denoted in Table 1 as ~2, is similar to the traditional project operation except only the lifespans of the temporal data are projected. The approximate temporal project operation is defined to return the lifespans of the data objects of an approximation, where f~A = LS(C) U LS(P). The approximate Cartesian product x in Table 1 is performed on any two data objects, regardless of the values of their lifespans. The temporal Cartesian product described in [13] is performed on two data objects only if the intersection of their lifespans is not null. An approximate temporal Cartesian product can be obtained by performing an approximate Cartesian product followed by an approximate temporal select operation, to select only those data objects with a non-null intersection of their two lifespans. Hence, no separate approximate temporal Cartesian product operation is provided. For the approximate difference operation, - in Table I, the values and lifespans of two data objects must be exactly the same for the subtraction to succeed. For the approximate temporal difference operation, denoted as - v in Table 1, a data object is subtracted if its value is equal to and its lifespan is an improper subset of another data object in the set to be subtracted. The value of the resulting lifespan is also changed by this operation. For example, we assume two sets R and S, and the exact answer E for R - v S . Given the two data objects x~ and x 2, where x~ ER, x2ES and x t = x 2, then x~ ~ E if LS(x~)C_LS(x2). If x~ =x~ and LS(x~)~LS(x 2) or LS(x~)ALS(x2)=G, then x~ E E but the value of the lifespan of x~ is appropriately altered in the former case to LS(x~)=LS(x~)-LS(x2). Given two data objects with the same value but with different lifespans, the two data objects can be viewed as separate entities, corresponding to two different real-world objects. It may be desirable to maintain the integrity of image objects that have the same value fl)r different lifespans. However, the semantics of an application may require that these two data objects be viewed as the same data object and their lifespans merged into one lifespan. We provide an additional temporal operation, called a temporal merge and denoted as tx in Table 1. The temporal merge operation applied to a set of data objects will union the lifespan of any two or more data objects with the same value so that only one data object results. For example, given the data object x~ occurring twice in a set of data objects with lifespans l~ and l 2, after applying the merge operation, x~ will occur once in the set with lifespan

Ii UI 2. The temporal merge operation can be used to obtain an approximate temporal union. The temporal union operation requires the merging of data objects that have the same values but different lifespans. An approximate temporal union can be obtained by applying an approximate union operation followed by the approximate temporal merge operation. It should also be noted that the approximate temporal merge operation should be applied to both operands of an approximate temporal difference operation if the operands are not already merged.

5. Monotonicity of operations This section presents the proofs of the monotonicity of the approximate relational algebra operations defined in Table I in Section 4. T h e o r e m A1. The approximate union operation is monotonic.

S.V. Vrbsk), / Data & Knowledge Engineering 21 (1997) 79-102

95

Proof. The approximate union, A r = A / U A e, is defined as follows:

Cr Pr Lr

= = =

C~ UC: (P~ UP_,)-C,, LS(C~ UC:)ULS(P~ UP_,)

where ' tO' and ' - ' in the right hand side of the equations denote standard set union and standard set difference, not the approximate operations being defined here. To show that the approximate union operation is monotonic, we let, A ~, ,A 'j, A 2, A 2, A ~ and A 1- be approximations. Also, let A] - A ] and . A~>-A.. _ , A, r = A ~L)A~, and ,A r =A~ LJA'~ _. We show below that A r >-A r by showing that: (1) C r DC r, (2) Pr C P r, and (3) Lr_CL,r. To show that C r D_C r, we consider an arbitrary tuple t in C r. By definition, t ~ C / o r t ~C,. Since AI-->A , and Ae-->A e, we know that CI____DC, and C_,___DC_~.Therelore, t ~ C i or t ~ C e and t ~ C r. In other words, an arbitrary tuple in C r is always also in C7., and, hence, Cr__DCr. Similarly, to show that P r C P r , we consider an arbitrary tuple t in P r. t ~ P I U P , , and t ~ C r by definition. Since A~-->A/ and A:-->A e, we know that P~CP~ and P,_CP_,. Therefore, t ~ P / or t ~ P , . We have already shown that C,r__DCr, and, since t ~ C r , it follows that t ~ C r. Therefore, t ~ ( P / U P , ) - C r, which is Pr" Since an arbitrary tuple in Pr is always also in Pr, we know that Pr_CPr. Finally, to show that Lr_CL r, we use the function LS described in Section 4. By definition, L, =LS(C I UPI), L, =LS(C, UP,), L , - L S ( C : UP',,), and L2=LS(C e UP:). Since A,->A, and A ~ >A_, we know that L/CL~ and L: C-L,, and we can say that LS(P/UC/)C_LS(P/UC~) and LS(P 2 U C,_)CLS(P e UC:). Therefore, LS(P I UC I )ULS(P -, UC ~,)CLS(P~ UC, )ULS(P: UC,) which is Lr C_ L r. Hence, we have shown that Ar-->Ar. []

Theorem A2. The approximate select operation is monotonic. Proofi The approximate select. A r = o - ~ . : .,,~A ~. is defined as follows: C T

Pr Lr

~ =

=

O'att : valCI

O'au=valPI L S ( ( Y . = v ~ I C l )DLS(O'att=,.,iP 1 )

where % . v,, denotes the traditional selection operator that treats C] and P] as traditional, exact relations. To show that the approximate select operation is monotonic, let A~, A), A T, and A r be approximations. Let A I -->A ~, Av=o-~. ,.~A f, and A T = % . ,,,A'~ for some attribute 'att' and value 'val'. Since A j>--A~, we know that C~ DC~, and, by definition of the select operation, O'at t ~alC'l % . v,,C~. Thus, Cr__3C7.. A similar proof can be given for P.'rCP r. Since A~->A. we know that L~C-L~ and LS(P/UC~)C_LS(P, UC~). By definition of the select operation, LS(%. ......~C'~)U LS(~..:v.,P',)CLS(~.. ......~C] ULS(%,. valPi), which is L1. CL r. Therefore. we have shown that A T ~ A r. []

Theorem A3. The approximate project operation is monotonic. Proof. The approximate project, A r='rr ~A ~, is defined as follows:

96

S.V. Vrbsk}' / Data & Knowledge Engineering 21 (1997) 79-102

C T

PT LT

~ =

=

qiatt C 1

'rl'attPl LS(C 1 U P 1)

where "rq,, denotes the traditional project operation on C~ and Pj as if these subsets were exact relations. To show that the approximate project operation is monotonic, let A~, A I, A r, and A T' be approximate relations. Let A, -->A~, A T= % , A ~, and A v = 'rr~,tA '~ for some attribute 'att'. Projection is calculated in two steps: First, the columns of attributes not desired are removed from the original relation. Second, any duplicate tuples are removed to form a new relation. To derive C T' and C r, we apply the first step to C L and C~, yielding intermediate forms for each. Since C I D C ~, the intermediate form of C'~ consists of two parts: one equal to the intermediate form of C, and one derived from tuples in C t -C~. If we remove duplicate tuples from each part, the result consists of two sets: one equal to C T and one equal to the second part with its duplicates removed. C r is the union of these two sets, therefore C T__DCT. That PT C_PT can be shown in a similar manner. Since A I->A, w e know that L IC_L ,, and by the definition of lifespan, L S ( C I U P I ) C _ L S ( C , U P , ) which is LT C--LT" Therefore, we have shown that AT-->AT. [] Theorem A4. The approximate Cartesian product operation is monotonic. Proof. The approximate Cartesian product, A r = A I X A Cr PT Lr

= = =

2, is defined as follows:

C l X C2 ((C 1 U P I ) X (Ce U P 2 )) - C T LS(C 1 UCe)ULS(PI U P 2)

where ' X' denotes the traditional Cartesian product of sets. To show that the approximate Cartesian product is a monotonic operator, we consider three pairs of approximations: A~, A , , A 2, A 2, A r and A T. Again, let A'j->A, and A2>-A2. A r = A ~XA 2, and A'T=A ~XA2. We want to show that AT->A T. We note that C r has four parts: (C 1XC2), ( C ~ - C ~ ) X C : , C ~ X ( C e - C 2 ) , and ( C , - C ~ ) X ( C e - C : ) . Since the first part by itself equals C r, C ~ Q C T. Since A ~-->A ~ we know that ( C ~ U P ~)_C (P ~ U C ~), and, since Ae-->A2 we know that (C 2 U P e)C_ (P e UC2). Therefore, ((C I U P I ) X ( C ) U P :))C((C~ U P ~ ) X ( C 2 UPe) ), because x is monotonic in the traditional sense. Since C r D C r, a tuple in PT (that is, in ((C I U P I ) X ( C e U P ' _ , ) ) - C T . ) is also in PT' and Pr_CPT. We have proven that L'TC_L r, in Theorem AI. Hence, A T-->AT" [] Theorem AS. The approximate set difference operation is monotonic. Proof. The approximate set difference, A T = A ~ - A e, is defined as follows: C.r Pr Lr

= = =

C 1 - (C e UPe) (PI - ( C z U P e ) ) U ( P z f - I ( C J UPj)) LS((C~ U P I ) - C e ) U L S ( P 2 [ " I ( C I UP,))

where ' U ' , ' - '

and ' A ' in the right hand side of the equations denote standard set union, set

S.V. Vrbs~" / Data & Knowledge Engineering 21 (1997) 79-102

97

difference and set intersection. To show that the approximate difference operator defined here is monotonic, we let A,, A~, A-,, A2, A T, and A r be approximations, where A,->A, and A'.->Ae. A T = A 1 - A - , and AT=A l - A e. To show that C r D C T, let t be an arbitrary tuple in CT. t~C,, an d t ~ A e by definition of A r. Since A~->A~ we know that C, D C~, and, since A ,-->A, we know that (C_, UP ,)C(C, UPs). If t E C~, then tEC~ and if t~:(C-, UPe), then t ~ ( C e U P z ) . It follows that t E C , - ( C 2 UP_,), which is C r. Since any arbitrary tuple in C T is also in C r, we have C T D C r. To show that Pr_C.Pr, we simplify the above PT expression to PT =((P, -C2 )UPe )#)(C , QP, ) and P '~=((P I - C ~,) u P -,)f)( C, u P , ) . Since A , -> A, and A,->A, we know that P~ CP, and C,_DC,, and, hence, (P, - C , ) C ( P , - C , ) . We also know that P , CP,, and, therefore, ( ( P , - C , ) U P O C ( ( P , - C , ) U P , ) . As described above, we know that (C i UP , )_C(C, UP~ ), and, therefore, (((P ~ - C -,) UP z) m (C , u P i )) c--(((P, - Cz ) UP.) f) (C, UP~ )) and we have P T--C-----PT" We now show that LT C_LT. Since a i->a , and a z-->a e we know that LS(C I UP I )CLS(C , UP,), LS(C ~) D LS(C,) and LS(P , ) C LS(P, ). Hence, LS((C , U P , ) - C ~) C LS((C, UP, ) - C, ) and LS(P , [3 (C, UP,))CLS(P2F)(C, UP,)). Therefore, LS((C, U P ~ ) - C z ) O L S ( P e F ) ( C , UP,))C_LS((C, U PI)-C-,)OLS(P-,D(C, UP1)) which is L'rCL T. We have shown that AT->A T. [] ,

,

¢

t

~ 7

,

-

-'7

~

-

-

"

- -

-

•~

--

,

-

,

-

,

-

*--

,

,

-



"

Theorem A6. The approximate temporal select operation is monotonic. Proof. The approximate temporal select, Ar='rc~LA j, is defined as follows: C

T

Pr LT

~-

T ~ / L C

=

"rc~cPJ

I

=

LS("r~cC I )OLS("rc~cPi )

where ~~c denotes the temporal select operation on C~ and Pi as if they were exact relations. The temporal select operation is similar to the select operation, the difference is that a select condition is specified for the value of a lifespan. Since A~->A~, we know that C I_DC ,, and by definition of the when operation, "r~cCi._D'rc~cC ,. Thus, C'rQC r. A similar proof can be given for P'r__C-PT and L 'T--CLT to show that A T-->AT' [] Theorem AT. The approximate temporal project operation is monotonic. Proof. The approximate temporal project, A T=I)Aj. is defined as follows:

Cr Pr LT

= = =

I)Cj ~PJ LS(C I UP, )

where 1~ denotes the temporal project operation on C~ and P~ as if they were exact relations. To show that the approximate temporal project operation is monotonic, let A,, al, AT and A r be approximations. Let A,->A~, A T = ~ A ~ and AT=I~A ~. By definition, I)C~=LS(Cj), ~ C I = L S ( C I ) and LS(C'~)D_LS(CI). Hence, CT=QCT. Also by definition, I)P I =LS(P,), I~P i =LS(P ~) and LS(P~)D_

98

S.V. Vrbsky / Data & Knowledge Engineering 21 (1997) 79-102

LS(PI). Hence, P'T_CPT. To show that L'T_C--LT, we know that by definition LS(CI UP~,)=L 1 and LS(C i U P i ) = L I. Since L i -CLI, it follows that L.r_CLT. Therefore, we have shown that a r-->a T" [] Theorem Ag. The approximate temporal set difference operation is monotonic. Proofi The approximate set difference, A T = A j - TA2, is defined as follows:

CT Pr Lr

= = =

C I-T(c2UP2) (PI-T(C2UP2))U(P21"-'I*(Cl

UPI))

LS((C l I..)P,)-TC2)ULS(P2("I*(C , UPI))

where ' U ' in the right hand side of the equations denotes standard set union, and ' - v' and ' VI*' are defined in Section 4.3. The proof is the same as for the approximate set difference, i.e. Theorem A5. [] Theorem A9. The approximate merge operation is monotonic. Proofi The approximate merge, A r = ~A t, is defined as follows:

Cr Pr LT

= = =

P4CI - *PI ) P'(PI t_J(C 1A *P1 )) LS(C I )ULS(Pt )

where ' U ' in the right hand side of the equations denotes standard set union, and ' - *' and ' A*' are described in Section 4.3. To show that the approximate merge operator defined here is monotonic, we let a , , A I, A T and a r be approximations, where A , - > A , and ar=P.a, and a'v=P,a' ,. Again, we want to show that AT>-AT by showing that: (1) C'T_DC T, (2) P T C P T , and (3) LTC_L V ' By Since A I->A I we know that C'I__DC I and P 1' C P I , and it follows that (C 1 - . * P I ) C ( C I - * P I ) " the definition of the merge operation, we know that if (C 1 - * P 1 ) C ( C I *P/), then I~(C~- *Pj )C I.z(C'~ - *P ~), which is C 7 D C r. To show that P "TCPT, since A / -->A / we again know that P ~C P / and C i __DC r Hence, (C i n*PI)_C-(C 1 n ' P , ) and (P, U ( C i n * P ~ ) ) C ( P 1 U(C~ n *P/)). By definition I~(P' t tO(C'~ (3 *P'j )) C ~(P1 LJ(C, A *P, )) and we have P "r--CPT" We have already shown that L "rC--LT in Theorem A3. Thus, we have shown that A'r-->Ar. []

6. Semantic support

The semantic support in the implementation of APPROXIMATE allows the time dimension of the data in a real-time database to be easily included. The data objects in the database can be grouped into classes not only according to characteristics as described in [29], but also according to the temporal dimension. In addition, an approximate object of APPROXIMATE is expanded to include the time dimension of the data.

S.V. Vrbskv / Data & Knowledge Engineering 21 (1997) 79-10-9

99

6.1. Class hierarchies

The classes in a class hierarchy can be categorized along the object dimension, the time dimension, or both the object and the time dimension. If the temporal dimension is used, the data objects in the database can be categorized by lifespans and/or temporal consistency. As an example of categorizing data objects along the time dimension, in Fig. 4 we consider the values of temperature and pressure at factory stations managed by three different employees, Jones, Lee and Smith; these are sampled and written to image objects. The classes in Fig. 4a are categorized along the object and time dimensions. Each hierarchy represents a different sampling period of the image objects. Each class has only one instance of the temperature and pressure at a particular station. Fig. 4b presents an alternative design of the class hierarchy with classes categorized along only the time dimension. In Fig. 4b, the lifespans of each class span all of the sampling periods for a 24 h period of time ( 1 day). Hence, several temperature and pressure values for the same station are members of the same class. An approximation using a class in this class hierarchy could provide a superset of time for a temporal data query. Categorizing the image objects as illustrated in Fig. 4b maintains the integrity of temporal values by treating the temporal objects as first class values so that the history of these values can be referenced over time [6]. This approach is useful to answer a query, such as, 'What is the change in the temperature during the last 5 h today?', by providing the changes in the values over some time period or providing the changes in the values as a function of time. The semantic support of APPROXIMATE also provides a convenient framework for categorizing these image objects into classes according to consistency constraints such as a relative consistency set, so that the instances of a class are relatively and/or absolutely consistent. The classes are defined so that the data objects of a relative consistency set are the instances of the same class and satisfy relative consistency constraints. For example, the instances of the classes in Fig. 4a show relative consistency. The instances of a class representing the most recent snapshot of the state of the real world satisfy absolute consistency. (Depending on the threshold T,,, the instances of more than one class may be absolutely consistent.) Given a class O~ whose instances are the consistency set RC i. inst(O~)= RC~, if etc. Sampling period 3 Sampling period 2

~

~rio6 1

(a). Classhierarchies (b). Classhierarchy Fig. 4. Categorizationof classes.

100

S.V. VrbslO, / Data & Knowledge Engineering 21 (1997) 79-102

id 6 10

Manager Jones Jones

Temp 85 80

Lifespan [1400-1459]

{Smith Stations, Lee Stations} [1200-1259] v [1400-1459] Fig. 5. An approximate object.

a(O~)-< T a and d(O~)-< T r we say that the instances of class Oi are absolutely and relatively consistent, respectively, and, as shown in the next section, they can be used to provide an approximation that is temporally consistent.

6.2. Approximate object The value of an approximation A=(C,P,L) is stored in an approximate object A=(C,P,L). An approximate object has three variables: the certain-part, the possible-part and the lifespan. The value of the certain-part is a certain set C containing all the image objects that are known to be in the exact answer. The value of its possible-part is a set P of possible classes, where P = { O j , O 2..... O,}. Fig. 5 illustrates an approximate object A. The lifespan L of an approximate object is the union of the set of lifespans of the certain set and each possible class O i of P. L=(LS(C)ULS(O~)ULS(O2)U...U LS(O,)). An approximate object A=(C,P,L) is relatively consistent if: d(RCQ---a(Al)>-...>-a(A,), where a(A,,)=a(E), and d(Ao)>-d(A~)>-...>-d(A,,), where d(A,)=d(E). As shown in Section 4.2, the approximate answers converge to the age and dispersion of the exact answer. If all of the classes of set P={O~,O 2..... O,} of an approximate object are relatively consistent and absolutely consistent, an approximate answer can be provided that is also temporally consistent. An approximate object is temporally consistent if it satisfies the following two criteria: (1) a(C) <-T~A a(O1)<--TaA...Aa(O,)<--T ~, and (2) d(RCi)<--Tr, VRCi~CAd(O~)<--TrA...Ad(O,)<--T ~. This new semantics of approximation allows us to provide an approximate answer that converges to the age and dispersion of the exact answer, and when the above two criteria are satisfied, to provide an approximate answer that is both relatively and absolutely consistent.

7. Conclusions and future work

A real-time database has timing constraints for the processing of its data as well as temporal consistency constraints for its temporal data. Approximate query processing provides a strategy for satisfying the timing constraints of the query, but it does not address the temporal consistency of the

S.V. Vrbsky / Data & Knowledge Engineering 21 (1997) 79-102

101

temporal data. This paper has shown how the theoretical basis of approximate query processing can be extended to include the temporal dimension and temporal consistency of the data, allowing the presentation of a data model for approximate query processing of real-time databases. The semantics of approximation that provides a subset and a superset of the exact answer was shown to be meaningful for queries when the approximation is based on the object or time dimension. It was shown that approximate answers can be provided that converge to the age and dispersion of the exact answer. Depending upon the design of the class hierarchies, approximate answers can also be provided that are temporally consistent. The approximate relational algebra operations were redefined to include the temporal dimension, and new approximate temporal operations were included. These operations were shown to be monotonic. The semantic support of an implementation that addresses the temporal dimension and the relative and absolute consistency of the data was also described. The overall result of this work is to make an approximate query processing system available that will address the temporal constraints associated with the temporal data as well as the timing constraints of the queries in a real-time database system. Future research will provide an analysis to confirm the feasibility of this system by determining whether the system satisfies the timing constraints of the queries and maintains the temporal consistency requirements of the data in its approximate answers. The limitations of the approximate query processing scheme for real-time databases will also be identified.

Acknowledgments This work was supported in part by NSF grant IRI-9308494.

References [I ] E Buneman, S.B. Davidson and A. Watters, A semantics for complex objects and approximate queries, Proc. 7th Symp. on the Principles of Database Systems (1988) 305-314. [2] S. Chaudhuri, Generalization and a framework for query modification, 6th Int. Con]. on Data Engineering (1990) 138-145. [3] W.W. Chu, R.-C. Lee and Q. Chen, Using type inference and induced rules to provide intensional answers, Proc. 7th Int. Conf on Data Engineering (1991) 396-403. [4] J.Y. Chung, J.W.S. Liu and K.J. Lin, Scheduling periodic jobs that allow imprecise results, IEEE Trans. on Computers 39(9) (Sept. 1990) 1156-1174. [5] J. Clifford and A. Croker, The historical relational data model (HRDM) revisited, Temporal Databases (Benjamin/ Cummings, CA, 1993) pp. 6-26. [6] J. Clifford, A. Croker and A. Tuzhilin, On the completeness of query languages for grouped and ungrouped historical data models, Temporal Databases (Benjamin/Cummings, CA. 1993) pp. 496-533. [7] T. Dean and M. Body, An analysis of time-dependent planning, Proc. 7th Int. Conf on AI (1988) 49-54. [8] S.K. Gadia, A homogeneous temporal database, ACM Trans. on Database Systems 13(4) (1988) 418-488. [9] M. Hammer and D.M. McLeod, Database description with SDM: A semantic database model, in S.B. Zdonik and D. Maier (eds.), Readings in Object-Oriented Database Systems (Morgan-Kaufman Publishers, CA, 1990) pp. 123-140. [10] M. Hammer and S.B. Zdonik Jr., Knowledge-based query processing, Proc. 6th Int. Conf on Veo' Large Data Bases (1980) 137-147.

102

S.V. Vrbskv / Data & Knowledge Engineering 21 (1997) 79-1(12

[l l] W.-C. Hou, G. Ozsoyoglu and B.K. Taneja, Processing aggregate relational queries with hard time constraints, Proc. 5th Int. Cov{fi Statistical and Scientific" Database Management (1990) 183-199. [12] R. Hull and R. King, Semantic database modelling: Survey, applications and research issues, ACM Computing Surveys 19(3) (Sept. 1987) 201-260. [13] C.S. Jensen et al., A consensus glossary of temporal database concepts, SIGMOD Record 23(1) (1994) 52-63. [14] V.R. Lesser, J. Pavlin and A. Durfee, Approximate processing in realtime problem solving, AI Magazine 9(I) (Spring 1988) 49-61. [15] K.J. Lin, S. Natarajan and J.W.S. Liu, Imprecise results: Utilizing computations in real-time systems, Proc. IEEE Real-Time &,stems Syrup. (1987) 210-217. [16] K.J. Lin, S. Natarajan, J.W.S. Liu and T. Krauskopf, Concord: A system of imprecise computations, Proe. COMPSAC "87 (1987) 75-81. [17] K. Liu and R. Sunderraman, On representing indefinite and maybe information in relational databases, Proe. 4th Int. Conf on Data Engineering (1988) 250-257. [18] L.E. McKenzie Jr. and Richard T. Snodgrass, Evaluation of relational algebras incorporating the time dimension in databases, ACM Computing Surveys 23(4) (Dec. 1991) 501-544. [19] A. Motro, VAGUE: A user interface to relational databases that permits vague queries, ACM Trans. on Office lnfi~rmation Systems 6(3) (July 1988) 187-214. [20] A. Motro, Using integrity constraints to provide intensional answers to relational queries, Proc. 15th Int. Confi on Very Large Data Bases (1989) 237-246. [21] G. Ozsoyoglu, Z.M. Ozsoyoglu and W.-C. Hou, Research in time- and error-constrained database query processing, Workshop on Real-Time Operating Systems and Sq[tware, VA (1990) 32 38. [22] G. Ozsoyoglu, K. Du, S. Guruswamy and W.-C. Hou, Processing real-time non-aggregate queries with time-constraints in Case-DB, 8th Int. Conf on Data Engineering (1992) 410-417. [23] J. Peckham and F. Maryanski, Semantic data models, ACM Computing Surveys 20(3) (Sept. 1988) 153-189. [24] K. Ramamritham, Real-time databases, Int. J. q[ Distributed and Parallel Databases (1993) 199-226. [25] S.J. Russell and S. Zilberstein, Composing real-time systems, Proe. 12th Int. Cm~fi on AI (1991). [26] K.E Smith and J.W.S. Liu, Monotonically improving approximate answers to relational algebra queries, Proe. qf COMPSAC "89 (1989) 234-241. [27] X. Song and J.W.-S.Liu, Performance of multiversion concurrency control algorithms in maintaining temporal consistency, Proc. 1EEE 14th Annual COMPSAC (1990) 132-139. [28] S.V.Vrbsky and J.W.S. Liu, Producing approximate answers to set-valued and single-valued queries, 1st Int. Cov!fl on Information and Knowledge Management (1992) 405-412. [29] S.V.Vrbsky and J.W.S. Liu, APPROXIMATE: A query processor that produces monotonically improving approximate answers, 1EEE Trans. on Knowledge and Data Engineering 5(6) (Dec. 1993) 1056-1068. [30] SN. Vrbsky and N. Jukic, Analysis of approximate query processing overhead, Proc. q[ ISMM Intelligent lnlbrmation Systems ( 1995 ). Susan V. Vrbsky is an Assistant Professor of Computer Science at the University of Alabama. She received her Ph.D. in Computer Science in 1993 from the Universityof Illinois, UrbanaChampaign and an MS. from Southern Illinois University,Carbondale, IL. Her research interests include real-time systems, object-orientedand temporal database systems.