ARTICLE IN PRESS
Information Systems 30 (2005) 133–149
Decision support queries on a tape-resident data warehouse$ Damianos Chatziantonioua,*, Theodore Johnsonb a
Department of Management Science and Technology, Athens University of Economics and Business, Evelpidon 47A, Lefkados, Athens 11362, Greece b AT&T Research Labs, 180 Park Ave, P.O. Box 971, Florham Park, NJ 07932-0971, USA Received 7 June 2003; received in revised form 28 October 2003; accepted 3 November 2003
Abstract Data warehouses collect masses of operational data, allowing analysts to extract information by issuing decision support queries on the otherwise discarded data. In many application areas (e.g. telecommunications), the warehoused data sets are multiple terabytes in size. Parts of these data sets are stored on very large disk arrays, while the remainder is stored on tape-based tertiary storage (which is one to two orders of magnitude less expensive than on-line storage). However, the inherently sequential nature of access to tape-based tertiary storage makes the efficient access to taperesident data difficult to accomplish through conventional databases. In this paper, we present a way to make access to a massive tape-resident data warehouse easy and efficient. Ad hoc decision support queries usually involve large scale and complex aggregation over the detail data. These queries are difficult to express in SQL, and frequently require self-joins on the detail data (which are prohibitively expensive on the disk-resident data and infeasible to compute on tape-resident data), or unnecessary multiple passes through the detail data. An extension to SQL, the extended multi feature SQL (EMF SQL) expresses complex aggregation computations in a clear manner without using self-joins. The detail data in a data warehouse usually represents a record of past activities, and therefore is temporal. We show that complex queries involving sequences can be easily expressed in EMF SQL. An EMF SQL query can be optimized to minimize the number of passes through the detail data required to evaluate the query, in many cases to only one pass. We describe an efficient query evaluation algorithm along with a query optimization algorithm that minimizes the number of passes through the detail data, and which minimizes the amount of main memory required to evaluate the query. These algorithms are useful not only in the context of taperesident data warehouses but also in data stream systems which require similar processing techniques. r 2003 Elsevier Ltd. All rights reserved. Keywords: Tape databases; Decision support; Data warehousing; Query processing
1. Introduction The rapidly declining cost of digital data storage has encouraged the creation of data warehouses. In a typical scenario, an organization stores a detailed record of its operations in a database, which is then analyzed to improve efficiency, detect sales opportunities, and so on. In many applications, the warehoused $
Recommended by F. Carino Jr. *Corresponding author. Tel.: +30-210-613-7689; fax: +30-210-613-7889. E-mail addresses:
[email protected] (D. Chatziantoniou),
[email protected] (T. Johnson).
0306-4379/$ - see front matter r 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.is.2003.11.003
ARTICLE IN PRESS 134
D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149
data sets can become very large. For example, AT&T creates data warehouses of network operations data. This information is vital for optimizing network operations, debugging network problems, fraud detection, verification of billing procedures, and so on. These data sets are collected at the rate of tens of gigabytes per day, and rapidly accumulate into multiple terabytes. Storing these data sets on-line is prohibitively expensive, in spite of the rapid decline in the cost of magnetic disk drives. Tape-resident tertiary storage can provide a high-performance and low-cost storage alternative. A typical installation uses an automated tape library, which consists of a storage rack for a collection of tapes, a set of tape drives, and a robot arm that transfers tapes between the storage rack and the drives. Tape-resident tertiary storage is one to two orders of magnitude less expensive than on-line storage on a per-byte basis [1,2]. Therefore, only the most recent detail data and summaries of older data is stored on-line. Older detail data is stored on tape, often in robotic storage libraries. Traditionally, the tape-resident data is difficult to access (because it is exported from the database), which inhibits experimentation and analysis. The cost advantages of tertiary storage make it very desirable to integrate tertiary storage into the storage hierarchy managed by a database. However, integrating a general purpose (i.e., SQL) database with tertiary storage (especially tape-based tertiary storage) is a very difficult problem because of the unusual access characteristics (very large access latencies and inherently sequential data access) of tape resident data. We observe that one should be able to execute queries on tape-resident data. An automated tape library with multiple high-performance tape drives can provide very high-sequential data access rates (e.g., 50 Mbytes=s or greater) to tens of terabytes of data at a moderate cost [3]. Most decision support queries (e.g., ‘‘report the monthly sales at each store for the year 2002’’) can be computed with a few (often only one) sequential passes through the detail data. So, the queries are well matched to the equipment. However, more complex queries can also be computed in a single pass. Lets consider the following example (which we use throughout the paper). Suppose that we have a data warehouse of records for packet-based communications (which can be voice, data, or both). Each tuple in the database is a summary of a packet that traversed the network (e.g., source, destination, size, protocol information, timestamp, etc.). The tuples are sorted by their timestamps (they are collected in sorted order). Some of the packets are control packets i.e., they affect the state of the network. In particular, some of the packets set up a connection (in which case the StartCall field has the value 1), and others tear down a connection (EndCall is 1). We would like to compute a summary of the connections observed in the data (for example, ‘‘for each connection, report the number of packets sent’’). If the database is sorted by timestamp (as it usually is), it is easy to see a one-pass query plan: scan through the data, look for tuples with the StartCall field set, then count the number of tuples with matching source and destination fields until a matching tuple with the EndCall field set. The contribution of this paper is to show that a large set of decision support queries (including queries that relate sequences of tuples, as in the above query) can be computed efficiently on very large data sets, even when the data resides on tape. To achieve this, we make use of the properties of an extension to SQL, the extended multi feature query language (EMF SQL) [4–6]. Because the EMF SQL query processing algorithm makes scans through the detail data set, it will efficiently process data whether stored on disk, tape or modeled as data streams. We propose two additional optimizations to enable very large scale data processing. The first optimization minimizes the number of scans made over the data set. The second optimization minimizes the size of the active temporary storage required to process the query. We conclude with a discussion of future work.
2. SQL and tape-resident data Because SQL is a standard, general purpose database query language, several researchers have investigated approaches to querying tape-resident data using an SQL database (e.g., [7]). However, doing so
ARTICLE IN PRESS D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149
135
is difficult. One major problem is that it is easy to ask queries that are prohibitively expensive to compute, if not optimized properly. A particular problem is that one can express ‘‘cross-product’’ joins between tape-resident tables. Evaluating this type of query requires many passes through the tape-resident tables, and the number of required passes increases as the size of the tape-resident table increases [1]. Making a single pass through a large tape-resident table is expensive, making dozens of passes is prohibitive. In addition, SQL is not good at expressing decision support queries [8]. To express such queries in SQL, one has to join/correlate several views and/or subqueries, grouped by one or more sets of grouping attributes. Many reasonable queries are cumbersome to write, understand, and maintain. Furthermore, these queries exhibit a high degree of redundancy, making them difficult to optimize. Succinctness is a desirable property not only from the user’s, but also from the optimizer’s perspective. Let us consider a couple of examples. Example 2.1. The following relation stores the sales of some company. Category attribute denotes the category of the product. SalesðCustomer; Product; Category; Day; Month; Year; QuantityÞ Assume that a data analyst is interested to find for each customer the average quantity of ‘‘toys’’ category and the average quantity of ‘‘clothing’’ category. In standard SQL, one has to define two views, one that contains for each customer the average quantity of the ‘‘toys’’ category and one that contains for each customer the average quantity of the ‘‘clothing’’ category, and get the equijoin of these two views on the Customer attribute: create view Temp1 as select Customer, avgToys=avg(Quantity) from Sales where Category = ‘‘toys’’ group by Customer create view Temp2 as select Customer, avgClothing=avg(Quantity) from Sales where Category = ‘‘clothing’’ group by Customer select Temp1.Customer, Temp1.avgToys, Temp2.avgClothing from Temp1, Temp2 where Temp1.Customer = Temp2.Customer While the SQL expression of the query seems to require two passes over the data set and then an outer join (to retrieve information about customers who purchased toys but not clothing, or vice versa), it is evident that the query can be answered using a single pass over the base table. Example 2.2. Let us recall the scenario in which the data warehouse contains records describing packets that traverse a communications network. This relation has the following schema: PktðSource; Dest; StartCall; EndCall; Length; tsÞ
ARTICLE IN PRESS 136
D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149
Suppose that a network engineer wants to know how many packets are involved in each connection, where a connection is a sequence of packets (in ts order) with the same Source and Dest attribute values, starting with a packet where StartCall is 1, and ending with a packet where EndCall is 1. Packets of different connections may be interleaved. We can assume that every packet has a unique value of ts. A direct expression of this query in SQL is create view Temp as select R.Source, R.Dest, R.ts, EndTime=min(S.ts) from Pkt R, Pkt S where R:Source ¼ S:Source AND R:Des ¼ S:Dest AND R:tspS:ts AND R.StartCall=1 AND S.EndCall=1 group by R.Source, R.Dest, R.ts select T.Source, T.Dest, T.ts, count() from Pkt R, Temp T where R.Source = T.Source AND R.Dest=T.Dest AND R:tsXT:ts AND R:tspT:EndTime group by T.Source, T.Dest, T.ts This query is expressed as one cross-product self-join, the creation of a temporary table, and another join. Even on disk-resident data, this query would be difficult to evaluate if current optimization techniques are employed. The astute reader will notice that one can rephrase the query to ensure a faster evaluation, e.g. by creating views instead of using S directly, or to write it in a single block. However, the logic of the query becomes opaque, and the user is burdened with query optimization. Further, multiple passes through the Pkt table would be required. Yet we have seen that only one pass through the Pkt table is needed. As the examples show, SQL not only allows but encourages one to express queries that involve multiple joins and groupby’s and thus are very expensive to evaluate, if they are not optimized properly. Note that both of these queries can be computed by making one pass through the detail data. However, it is very difficult to make this optimization given the expression of the SQL queries.1
2.1. Desired language features We argue that in order to effectively use tape-based tertiary storage in a data warehouse, the queries should be written in a declarative language that expresses even complex queries in a way that is easy to optimize. The declarative aspect of the language frees the user from specifying how the query will be computed. In addition, queries written in a declarative language are portable and easy to reuse. As a side benefit, one can use the same language to query disk-resident data, permitting a seamless integration of the two data stores. The queries can be optimized to make the best use of available resources, and use optimized query processing algorithms. In a previous paper [3], we present an in-depth measurement study of the performance of tertiary storage devices. Tape-based tertiary storage has very large access latencies and has slow random access, but provides high rates of sequential access. Therefore, the query should be evaluated by using a few sequential scans through the tape-resident data, preferably one scan. 1 Note for the reviewer: Our experiments have shown that common SQL databases (e.g., Oracle and Sybase) cannot optimize these queries properly. Please read Section 4.2 for a brief performance comparison between Oracle and our implementation of EMF.
ARTICLE IN PRESS D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149
137
A query language should encourage the user to express even complex queries in a way that naturally translates into an efficient query plan (e.g., requiring k scans, where k is small). This requires a close mapping between the query language and an efficient query plan. The extended multi-feature syntax presented in the following section allows most decision support queries to be expressed in a succinct, concise, and intuitive way, making user’s job much easier. More important, these queries translate directly to an efficient and easily optimizable implementation which evaluates in at most k scans. 3. Introducing extended multi-feature syntax On-line analytical processing has attracted a lot of attention because large enterprises want to analyze the warehouses of their collected data. Ad hoc decision support queries present two key features that can be exploited either in developing query optimization techniques [9–11], or in equipping a language with appropriate syntactic extensions [8,12–14]: *
*
First, OLAP queries group the relation(s) on a set of attributes and perform some complex (or simple) operation within each group. Although SQL handles simple operations within each group well (e.g. compute avg(salary)), it requires a high degree of redundancy (joins, correlated subqueries) to express more complex operations within each group. This has been addressed in [11,13]. Second, OLAP queries may correlate results of groupings on different sets of attributes. These sets are usually related somehow (e.g. the one is subset of the other.)
Extended multi-feature SQL [4–6] introduces a minimal, yet powerful extension of SQL, based on the concept of grouping variable [13]. EMF SQL has proven useful in expression and evaluation of complex OLAP queries in various domains (e.g. medical/clinical data warehouses [15]), has been used as a reference to a major commercial DBMS [16,17], has been implemented in a query tool prototype [6] and is very similar to and has influenced the specification of the WINDOW construct, now part of ANSI SQL3 [18]. 3.1. Extended multi-feature queries The idea is similar to tuple variables in SQL in which one can define for each row of a table—by using nested sub-queries—one or more relations and compare them appropriately (e.g. membership). In EMF SQL, one can define for each group (specified by the group by clause) one or more subsets of the entire table and perform aggregation (min, max, average, etc.) on top of these. These subsets are called grouping variables. This syntactic extension not only does allow the succinct and declarative expression of decision support queries, but also maps directly to an efficient, easily optimizable implementation that is particularly effective for very large data warehouses, as it is shown in Section 4. 3.1.1. Syntax and semantics The extended multi-feature syntax modifies standard SQL in the following ways: * *
From, Where clauses: There are no changes. Group by clause: The declaration of the grouping variables follows the grouping attributes, separated by a semicolon: group by G1 ; y; Gm ; X1 ; y; Xn :
ARTICLE IN PRESS D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149
138 *
Such that clause: This newly introduced clause defines the range of the grouping variables. It has the following format: such that C1 ; C2 ; y; Cn ;
*
*
where each Ci is a potentially complex expression (similar to a where clause), used to define Xi ; i ¼ 1; 2; y; n: It involves comparisons between an attribute of Xi and either a constant, a group’s aggregate, a previously defined grouping variable’s aggregate, or one of the grouping attributes. Grouping attributes have the same value for all the tuples of a group. As a result they can be considered as constants as far as the such that clause is concerned. A big fraction of the flexibility and power of extended multi-feature syntax is based on that observation. Select clause: It may contain the grouping attributes, aggregates of the group, and attributes and aggregates of the grouping variables. Having clause: It is an expression involving the grouping attributes, constants, the aggregates of the group, and the aggregates of the grouping variables.
The semantics of EMF can be captured in the relational algebra, by taking a 2n þ 1-way join for each value v of the grouping variables, for an EMF query with n grouping variables. The naive implementation of these semantics may be very expensive. However, Section 4 proposes an efficient evaluation algorithm that captures the gist of extended multi-feature queries. 3.1.2. Examples
Example 3.1. Assume that one is interested to find for each product the ratio of each month’s total sales over the year’s total sales. This query can be expressed using the extended multi-feature syntax as select Product, Month, Year, sum(X.Quantity)/sum(Y.Quantity) from Sales group by Product, Month, Year ; X, Y such that (X.Product=Product and X.Month=Month and X.Year=Year), (Y.Product=Product and Y.Year=Year) The Sales relation is grouped by Product, Month, Year. The grouping variable X denotes all the tuples in that group (the grouping variable X is not necessary, since sum(X.Quantity) could be replaced by sum(Quantity).) The grouping variable Y denotes all the tuples of Sales relation of that group’s product and year. Note that Y may contain tuples not in the group. Example 3.2. A data analyst wants to know for each customer and product, how many purchases of the product by the customer had quantity more than the average quantity sale of that product. This is an extended multi-feature query and can be written as select Customer, Product, count(Y.) from Sales group by Customer, Product ; X,Y such that (X.Product=Product), (Y.Customer=Customer and Y.Product=Product and Y:Quantity > avgðX:QuantityÞÞ
ARTICLE IN PRESS D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149
139
The Sales relation is grouped by Customer, Product. Then for each (customer,product) value (i.e. for each group) two sets are defined, X and Y ; where X contains all the sales of the current product and Y contains all the purchases of that product by the current customer that exceeded the average quantity sale of the product (Y.Quantity > avg(X.Quantity).) Example 3.3. Assume that we are interested to find for each product the average quantity sold before and after each month of 2001. select Product, Month, avg(X.Quantity), avg(Y.Quantity) from Sales where Year=‘‘2001’’ group by Product, Month ; X, Y such that ðX:Product ¼ Product and X:MonthoMonthÞ; ðY:Product ¼ Product and Y:Month > MonthÞ For each product and month of ‘‘2001’’ we define two sets, X and Y : X contains all the sales of the current group’s product before the current group’s month (X.Month o Month) and Y the sales of that product following that month (Y.Month > Month.)
4. Evaluation and optimization of EMF queries In this section, we present a direct implementation of extended multi-feature queries and optimizations of that implementation. All aggregate functions are presumed to be distributive. We start with two definitions. Definition 4.1. A grouping variable Y depends on grouping variable X ; if some aggregate value of X appears in the defining condition of Y : This is denoted as Y -X : If the defining condition of a grouping variable Y contains aggregates of the group or grouping attributes, then the group is denoted as a grouping variable X0 and we write Y -X0 : The directed acyclic graph that is formed from the grouping variables’ interdependencies is called EMF-dependency graph. Definition 4.2. The output of a grouping variable X ; denoted as outp(X) is the set of the aggregates of X that appear in either the such that clause, or the select clause. 4.1. Evaluation We assume that the select clause of an extended multi-feature query contains only the grouping attributes, aggregates of the group, and aggregates of the grouping variables, i.e. it does not contain attributes of the grouping variables. Under this assumption, a final join between the grouping variables is not necessary and the proposed evaluation algorithm guarantees the answer in at most n þ 1 scans of the relation, where n is the number of the grouping variables. This guarantee is fundamental for tape data warehouses. Let H be a special table, called the mf-structure of an extended multi-feature query, with the following structure. Each row of H; called entry; corresponds to a group. The columns consist of the value of the grouping attributes, the aggregates of the group, and the aggregates of the grouping variables. Let X1 ; y; Xn be the grouping variables of the query, ordered by a topological sort of the EMF-dependency graph. The following algorithm computes correctly all the entries of H:
ARTICLE IN PRESS 140
D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149
Algorithm 4.1. Evaluation of extended multi-feature queries: for sc ¼ 0 to n { for each tuple t on scan sc { for all entries of H; check if the defining condition of grouping variable Xsc is satisfied. If yes, update Xsc ’s aggregates appropriately. X0 denotes the group itself. } } This algorithm performs n þ 1 scans of the base relation. On scan i it computes the aggregates of Xi grouping variable (X0 denotes the group.) As a result, if Xj depends on Xi (i.e. if aggregates of Xi appear in the defining condition of Xj ), this algorithm makes sure that Xi ’s aggregates will have been calculated before the jth scan. Note also that given a tuple t on scan i; all entries of table H must be examined, since t may belong to grouping variable Xi of several groups, as in Example 3.1: a tuple t affects several groups with respect to grouping variable Y ; namely those that agree on year with t’s year. Algorithm 4.1 represents an efficient, self-join free direct implementation of the extended multi-feature syntax. This is the main contribution of the extended multi-feature syntax: complex decision support queries not only can be expressed intuitively and declaratively, but also there exists a direct mapping between the representation (syntax) and an evaluation algorithm that guarantees the answer in few scans of the base data, i.e. an algorithm suitable for tape databases. As was mentioned earlier, self-joins are prohibitively expensive for tape data warehouses and large relations in general. The only permitted operation is scanning. Observation 4.1. Given an extended multi-feature query containing n grouping variables (with no attributes of the grouping variables in the select clause), the answer can be computed in at most n þ 1 passes of the base relation. Proof. Straightforward.
&
4.2. Optimization The EMF query evaluation algorithm in Algorithm 4.1 is a naive implementation—many optimizations can be made (presented in length in [5]) that greatly improve performance. For example, Algorithm 4.1 can become very expensive if the mf-structure has a large number of entries, since on each scan, for every tuple, all H’s entries are examined, resulting in an implicit nested-loop join. However, this is not necessary since one can identify, given a tuple t and a grouping variable Xi ; a small number of entries (i.e. groups) that this tuple ‘‘affects’’. Consider Example 3.1 and grouping variable X : Given a tuple t; we know that only one entry of the mf-structure will be affected during the evaluation of grouping variable X (the one that agrees on t’s Product, Month, Year values) because X denotes the group. As a result, a tuple t can belong only to X of one group. Therefore, only one entry is updated. A syntactic analysis shows which indices on H should be built to accelerate the inner loop of Algorithm 4.1. Although tapes have high sequential throughput, it is essential to reduce the number of passes over the base relation as much as possible. A scan of a large relation can take several hours, or even days. In data stream applications we can only afford one scan. In Example 3.3 we defined two grouping variables, X and Y : Algorithm 4.1 would require three passes to evaluate that query. However, since X and Y are independent, two passes would suffice. Therefore, some dependency analysis should be applied to
ARTICLE IN PRESS D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149
141
determine the order of evaluation of the grouping variables and the minimum number of passes. Section 5.2 will introduce the concept of pass-reducibility, used for dependency analysis.
5. The temporal nature of tape data warehouses A well-designed data warehouse that uses tertiary storage will also have a large on-line component. Typically, a decision support data warehouse stores summaries of the detail data for fast query response. While most queries can be answered by using the summary tables, some queries can only be answered by using the detail data [19]. The bulk of the detail data will be stored on tape, so answering the query requires that tape-resident data be used. In most applications, warehoused data is temporal [20]. Individual tuples in the relation represent activities that occurred in the past. Recalling the context of the activity requires that each tuple include a timestamp attribute. We assume that the tables are sorted in timestamp order. This assumption is reasonable because the tuples are collected in timestamp order (or are almost sorted when collected). In many detail databases, subsets of the tuples can be thought of as sequences. It is often desirable to make an analysis of tuple sequences. One example is the query described in Example 2.2. Other examples include analyzing web clicks before a link was followed, analyzing shopping behavior in a credit card transaction database, computing weekly sales trends, and so on. In the previous section, we showed how common decision support queries can be phrased in EMF SQL and efficiently evaluated. In this section, we discuss further optimizations that are critical for tape-resident databases. This section also shows the suitability of extended multi-feature syntax to OLAP queries involving sequences and streams, and discusses specific optimizations of these queries. Such optimizations are essential for tape data warehousing as the following examples show. These optimizations would be very difficult to identify if standard SQL or some programming language has been used for querying. 5.1. Two representative examples
Example 5.1. Let us recall the query of Example 2.2. Using the same schema, we are trying to find the number of packets in each connection (i.e., a connection starts with a StartCall packet and ends with a EndCall packet). This query can be expressed in the extended multi-feature syntax as select Source, Dest, ts, count(Y.) from Pkt group by Source, Dest, StartCall, ts ; X, Y such that ðX:Source ¼ Source and X:Dest ¼ Dest and X:EndCall ¼ 1 and X:ts > tsÞ ðY:Source ¼ Source and Y:Dest ¼ Dest and Y:tsXts and Y:tspminðX:tsÞÞ having StartCall=1 A group is defined by a connection’s opening packet (which is the only tuple in the group). The grouping variable X is the collection of all later packets which end calls between this source and destination. Therefore, the earliest member of X ends the connection. The grouping variable Y is the set of all tuples from the source to the destination between the start of the connection and the end of the connection.
ARTICLE IN PRESS 142
D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149
The number of entries in the mf-structure H of this example may be very large, forcing it to the disk. Furthermore, since all entries of H must be examined, the standard evaluation algorithm may be way too expensive.2 However, two basic and related observations improve the performance tremendously. The optimizer can exploit the fact that Pkt is ordered on ts and produce the answer in one pass, instead of three. Therefore it is important to have a principled way to reduce the number of passes as much as possible. An entry in the mf-structure (i.e. a group, a session) ‘‘opens’’ during the scan and later ‘‘closes’’. In Example 5.1, an entry opens with a StartCall packet and closes with an EndCall packet. As a result, the mf-structure H will contain at any time only a few entries (the maximum number of concurrently open connections is much smaller than the total number of open connections). Therefore it is important to be able to identify when an entry ‘‘closes’’, since this reduces the size of the mf-structure and makes the output of that group available as soon as possible.
*
*
These two optimizations are discussed in length in the following sections. Example 5.2. This example shows the optimizations that can be made on a complex query on a time sequence. Let us assume that R is a relation with attributes c,k,l,m,ts, and that R is ordered in ascending order on ts. Consider the following extended multi-feature query: select c, ts, avg(Z.m), avg(S.m) from R group by c, ts ; X, Y, Z, W, S such that X:ts ¼ ts; Y:ts > tsandY:k > avgðX:kÞ; Z:ts > tsandZ:tspminðY:tsÞ; W:ts > minðY:tsÞandW:l ¼ 1; S:ts > minðY:tsÞandS:tspminðW:tsÞ This query expresses the following idea: Compute for tuples of the same ts the average value of k; and find the first tuple t1 following ts with k greater than the computed average. Output the average value of m in the interval ðts; t1 :ts: Then find the first tuple t2 after t1 :ts that has l=1 and find the average value of m in the interval ðt1:ts; t2:ts: Fig. 1(a) shows ‘‘schematically’’ this query. Using the standard evaluation algorithm described in Section 4, four scans of R would be required to answer that query, since the degree of the EMF-dependency graph, shown in Fig. 1(b), is 4. However, a closer examination of that query reveals that one scan of R would suffice to compute the answer: Given a ts; as the scan proceeds, avgðX :kÞ is computed. When it is time for Y to ‘‘use’’ avgðX :kÞ; we know that avgðX :kÞ has been computed, because Y ’s tuples are later than X ’s tuples. Z keeps aggregating, starting from the tuple with ts greater than the current ts, until the first tuple of Y has been detected. W ‘‘opens’’ after the first tuple of Y and S is similar to Z: All these observations may affect several entries of the mf-structure
* *
*
* *
2
Optimizations described in [5] would let the standard evaluation algorithm avoid examining all entries.
ARTICLE IN PRESS D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149 X0
R avg
X
avg
Z
avg
S
(a)
143
ts ts ts ts+1 ts+1 ts+1 ts+2 ts+2 ts+2 ts+2 ts+3 ts+3 ts+3 ts+3 ts+4
X Y min(Y.ts) Z
W
min(W.ts) S
(b)
Fig. 1. Query graphs: (a) the query ‘‘schematically’’; (b) the EMF-dependency graph.
Note that grouping variables Y and W are used merely as time limits for Z and S: Note also, that although c is one of the grouping attributes, it does not affect the ability to compute the answer in one pass. The important point is for ts to be in the grouping attributes.
5.2. Pass-reducibility In this section, we describe the concept of pass-reducibility and give a criterion to identify which grouping variables can be computed on the same pass. Definition 5.1. Assume that Y -X ; Y -X1 ; y; Y -Xk are the dependencies of Y and CY is the defining condition of Y : We say that a grouping variable Y is pass-reducible to X ; if for all groups and all database instances it is possible (i.e. there exists an algorithm) to compute in the same pass the aggregates of X involved in CY and the aggregates of Y ; given that the aggregates of X1 ; y; Xk used in CY have already been calculated. This is denoted as Y ) X : A grouping variable Y may be pass-reducible to X due to another grouping variable Z: In Example 5.2, Y is pass-reducible to X because the tuples needed for Y strictly follow the tuples of X for each group. But this can be deduced from the Y.ts > ts subexpression of CY (which involves the grouping variable X0 :) As a result to detect (syntactically) the pass reducibility of Y to X ; the entire CY must be considered (in the general case) and not just the subexpression(s) related to X : However, we want to check pass-reducibility only between Y and X ; independently of the other grouping variables X1 ; y; Xk involved in CY : Thus, in Definition 5.1 we assume that the aggregates of X1 ; y; Xk used in CY have been already calculated. Note that if a grouping variable Y depends on another grouping variable X ; it does not necessarily mean that Y is not pass-reducible to X : We may have additional information or syntactic conditions that ensure that Y can be evaluated on the same pass with X ’s aggregates in CY : If we do not have any additional information, if X depends on Y ; then X is not pass-reducible to Y : Example 5.3. Grouping variable Y is pass-reducible to X in Example 5.1, because for each group, all tuples of Y precede the first tuple of X : Therefore, we can keep counting for Y until the first tuple of X is detected. Observation 5.1. There is a subtle point of the definition that is, however, important for what follows. Assume that Y -X ; Z-X ; and Y -Z: Further assume that Z is not pass-reducible to X : To check if Y is pass-reducible to X ; Z’s aggregates of CY are considered calculated. That means that some of X ’s aggregates have already been computed at an earlier pass, since Z-X and ZRX : However, X ’s aggregates mentioned in CY are not considered as already evaluated, when pass-reducibility of Y to X is checked.
ARTICLE IN PRESS D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149
144
We give here an algorithm to determine the minimum number of passes required to evaluate the answer of an extended multi-feature query. Assume that the grouping variables are X0 ; X1 ; y; Xn : Algorithm 5.1. Algorithm to determine the minimum number of passes required. * *
*
Build the EMF-dependency graph G among the grouping variables according to Definition 4.1. Proceed according the reverse topological order of G: Xi is the grouping variable next in that order, iAf1; 2; y; ng: Assume that Xi depends on Y1 ; Y2 ; y; Yk ; where Y1 ; Y2 ; y; Yk in fX0 ; y; Xi 1 ; Xiþ1 ; y; Xn g; ðkX0Þ: Then: degreeðXi Þ ¼ maxf1; fðY1 Þ; y; fðYk Þg; where fðYj Þ ¼
degreeðYj Þ;
if Xi is pass-reducible to Yj ;
degreeðYj Þ þ 1;
otherwise:
A grouping variable X is evaluated on pass degreeðX Þ: An extended multi-feature query is answered in maxfdegreeðX1 Þ; y; degreeðXn Þg passes. This number is called the degree of the extended multi-feature query. Example 5.4. The degree of the queries of Examples 3.1 and 3.3 is 2. The degree of the query of Example 3.2 is 3. It is apparent from the examples of the previous section that we must have a principled way (e.g. some syntactic criteria) to identify pass-reducibility among grouping variables. Furthermore, evaluation Algorithm 4.1 must be changed (e.g. keep some additional info on mf-structure’s entries) to reflect the fact that a grouping variable Y can be computed in the same pass with another grouping variable X : Note that these modifications of the standard algorithm will depend on these syntactic criteria. While pass-reducibility is still an open issue for research, we give here a simple syntactic criterion that covers many extended multi-feature queries similar to 5.1 and 5.2 and explains how evaluation Algorithm 4.1 should be modified to reflect pass-reducibility due to that criterion. Criterion 5.1. Assume that a relation R is ordered (ascending) on one of its attributes t, Q is an extended multifeature query and CX denotes the defining condition of grouping variable X in the such that clause. Further suppose that t is one of the grouping attributes of Q and all defining conditions are conjunctions. Then: (a) If Y -X and one of the subexpressions of CY has the format Y.t op min(X.t)+c, where opAf>; X; ¼g and c a non-negative integer and no other aggregates of X are mentioned in CY ; then Y ) X : X can be X0 (i.e. the group.) (b) If Y -X and one of the subexpressions of CY has the format Y.t op min(X.t)+c, where opAfo; pg and c a non-negative integer and no other aggregates of X are mentioned in CY ; then Y ) X : X can not be X0 : (c) If Y -X and CY involves X ’s aggregates other than min(X.ts), then if syntactically can be proven that X ’s tuples are preceding Y ’s tuples, then Y ) X :
Proof. Case (a) says that Y depends on X and the only aggregate of X ’s required for the definition of Y is min(X.t). That means, since R is ordered on t, that only one tuple of X ; the first one, is required for Y ’s
ARTICLE IN PRESS D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149
145
computation (with respect to X ). Note that a condition like Y.t > t translates to Y.t > min(X0 :t) and by using the above reasoning, Y ) X0 : Case(b) says that Y depends on X and likewise only the first tuple of X is needed for the computation of Y : However this is not the case when X is X0 ; since given a tuple t; t may affect entries of the mf-structure not identified yet. Case (c) is straightforward. & Example 5.5. Consider Example 5.1. The grouping variable X is pass-reducible to X0 because of case (a) of Criterion 5.1. Y is pass-reducible to X0 and X due to cases (a) and (b), respectively. Applying Algorithm 5.1 we find that the degree of the query is 1, i.e. it can be answered in one pass of the base relation. Example 5.6. Consider Example 5.2. The grouping variable X is pass-reducible to X0 because of case (a). Y is pass-reducible to X0 due to case (a) and pass-reducible to X because of case (c), since all tuples of X have ts equal with group’s ts and Y ’s tuples have ts strictly greater than group’s ts (this can be checked syntactically.) Applying the criterion for the remaining grouping variables and then using Algorithm 5.1, we find that this query has also degree 1. We present now the necessary modifications of Algorithm 4.1 to incorporate pass-reducibility due to Criterion 5.1. The basic idea of Algorithm 4.1 remains the same; the difference is that in one pass, more than one grouping variables may be calculated.3 Algorithm 5.2. Modified algorithm to reflect Criterion 5.1. Q is an extended multi-feature query.
for sc=1 to degreeðQÞ { Y1 ; y; Yksc are the grouping variables with degree sc: Initially, all entries of the mf-structure H have the value NULL for the aggregates of these grouping variables, except for minðX :tÞ; X aX0 ; which have the value þN: for each tuple r on scan sc { for each entry of H; check the defining conditions of Y1 ; y; Yksc according to the order of their dependencies. If some are satisfied, update their aggregates appropriately. } } Example 5.7. Consider Example 5.1, some tuple t1 with t1 :EndCalla1; and an entry h of the mf-structure with h:Source ¼ t1 :Source; h:Dest ¼ t1 :Dest; h:tspt1 :ts: Also assume that the first tuple satisfying X ’s condition for that entry has not been found yet. As a result, minðX :tsÞ of entry h is still þN: Note that Y ’s condition for that entry is satisfied (since t:tso þ N) and thus Y ’s sum is updated (as it should). The first time that X ’s condition is satisfied for entry h; minðX :tsÞ is replaced by the correct value, t:ts: Consider Example 5.2, an entry h; and grouping variable W : until the first tuple of Y is found and minðY :tsÞ set to its correct value, minðY :tsÞ is þN: As a result, W ’s condition can not be satisfied (because W :ts can not be greater than þN) until the first tuple of Y is found. 3 Case (a) of the criterion changes slightly: the op can be only >; except for X0 in which op can still be >; X; ¼ : Otherwise the algorithm contains details that are not in the scope of this paper.
ARTICLE IN PRESS 146
D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149
5.3. Active and completed mf-structure entries There are cases, as in Example 5.1 where a group can ‘‘close-out’’ before the end of the scan(s). If it is known somehow that tuples further in the current scan will not affect an entry’s grouping variable’s output that is being calculated in that scan, then that entry is ‘‘completed’’ with respect to that grouping variable. If an entry becomes ‘‘completed’’ with respect to all grouping variables, then it can be removed from the mf-structure, reducing its size. For one-pass queries such as Example 5.1, if output is computed and entries removed as soon as possible, it is likely that the mf table can fit into memory. For multiple pass queries, ‘‘completed’’ entries can be transferred to disk, allowing the ‘‘active’’ part of the mf-structure to be in memory. Definition 5.2. An entry h of the mf-structure H of an extended multi-feature query Q is said to be completed with respect a grouping variable X ; if it is known that tuples further in the current or later scans will not affect the output of X with respect to h: If an entry h is completed with respect to all grouping variables, the entry is called completed. Otherwise it is called active. Once again we must have some syntactic criteria to determine when a grouping variable X closes-out. One way is to know that tuples further in the scan do not affect the output of the grouping variable X of an entry (Criterion 5.2 (a)); a different way is to know that tuples further in the scan will not belong to grouping variable X of an entry (Criterion 5.2 (b).) Criterion 5.2. Assume that a relation R is ordered (ascending) on one of its attributes t, Q is an extended multi-feature query and CX denotes the defining condition of grouping variable X in the such that clause. Further suppose that t is one of the grouping attributes of Q and all defining conditions are conjunctions. (a) If X ’s output consists entirely of min(X.t), then the first time that CX is satisfied for an entry h (on pass degreeðX Þ), h becomes completed with respect to X : (b) If one of the subexpressions of CX has the format X.t op min(Z.t)+c, where opAfo; p; ¼g; c a nonnegative integer, and Z a grouping variable, then the first time that this subexpression is violated for an entry h on pass degreeðX Þ; after the first tuple of X has been identified, h becomes completed with respect to X : Case (a): X ’s output consists entirely of min(X.t), so we are interested in computing just the minimum value of t for X ’s tuples. Assume that r is a tuple on pass degreeðX Þ such that CX is satisfied for an entry h for the first time (i.e. r is the first tuple belonging to X of this entry h). Since R is ordered on t, subsequent tuples satisfying CX will have t’s value greater than or equal to r.t. As a result, min(X.t) = r.t, which means that after locating r, the output of X can not change. Case (b): Since R is ordered on t, the first time that the subexpression X.t op min(Z.t)+c evaluates to false (after it had been evaluated to true at least once) implies that this subexpression will continue to evaluate to false for all the remaining tuples on this pass (because X.t can only increase). Since this subexpression is part (conjunctively) of CX ; we know that there are no tuples further in the pass satisfying CX : Example 5.8. X ’s output in Example 5.1 consists only of min(X.ts). When the first tuple of X for an entry h has been found, then we can ‘‘close-out’’ X with respect to h because of Criterion 5.2(a): the output of X has been found for entry h: Example 5.9. One of the subexpressions of CY in Example 5.1 is Y.ts p min(X.ts). The first time that this subexpression is violated for an entry h means that we can ‘‘close-out’’ Y with respect to h because of
ARTICLE IN PRESS D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149
147
Criterion 5.2(b): no tuples later in the scan can belong to Y for h; since their ts will be greater or equal than the tuple’s ts that caused the violation. Consider X grouping variable of Example 5.2. X ’s defining condition is X.ts = min(X0 :ts), which means that Criterion 5.2(b) can be applied. Note that X closes-out when CX is violated, after the first tuple for X has been found. Attribute t could be anything on which the relation is ordered. For example, sales of a company may not be ordered on date, but may be ordered on month (due to the recording system of the company—batch updates.) Queries having month in the grouping attributes can exploit Criterias 5.1 and 5.2. Note that these Criteria assume ascending order on attribute t. In case of descending order these criteria are still applicable by replacing min(Xi :t) to max(Xi :t) and using the inversed operators. The þN in Algorithm 5.2 changes to N:
6. Related work The problem of integrating tertiary storage into database management systems has become an active research problem [21]. Sarawagi [22] proposes an architecture by which SQL queries on tape resident data can be processed. A relation is divided into partitions, where a partition is stored contiguously on tertiary media. A database access, such as a join, is transformed into a sequence of operations on disk-resident partitions. In [7], Sarawagi and Stonebraker give query optimization techniques for complex SQL queries. Myllymaki and Livny [1] have investigated disk-to-tape and tape-to-tape join algorithms that do not make use of indices, which they show are prohibitively expensive. Carino et. al. in [23,24] discuss a commercial database system called StorHouse/relational manager that uses and exploits an active storage hierarchy, meaning that SQL queries can be directly posted and executed on all hierarchical storage. This is the main difference between StorHouse and traditional hierarchical storage management systems, which require some post processing of a file or DBA intervention. Tribeca is a database for processing very large network traffic datasets [25]. Because the processing is sequential, querying from tape is supported, but the query language is highly restrictive. Several papers discuss query optimization techniques in the context of standard SQL that are particularly applicable in the processing of complex OLAP queries [9,26,27]. A running example in this paper involves the analysis of sequences of tuples. Sequence databases [28] have been proposed as a more convenient and more efficient way to make these queries. Our concern is with a broader range of decision support queries. The optimizations presented in this paper are applicable in the context of querying data stream systems [29] and the mf-structure can be used as a structure supporting continuous queries [30,31].
7. Conclusions Tertiary storage is one to two orders of magnitude less expensive than on-line storage, making it an attractive option for storing very large databases. However, integrating tertiary storage with general purpose databases is a difficult problem. Tertiary storage databases generally have been designed for applications that fetch large objects, such as multimedia and scientific databases. Very large multi-terabyte data warehouses can be inexpensively implemented by using tertiary storage to store the detail data. We observe that most decision support queries on tape-resident data can be implemented by making a small number of scans. However, queries expressed in conventional database languages such as SQL are difficult to optimize for efficient evaluation on tape-resident data.
ARTICLE IN PRESS 148
D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149
Extended multi-feature syntax is an extension of SQL that allows the concise representation of many common decision support queries. These queries require an implicit loop over the values of some grouping attributes and ask for some complex operation for each such value. EMF syntax represents this idea declaratively and succinctly. Aside the obvious user’s benefits, extended multi-feature queries translate directly to an implementation that is particularly attractive for tape-resident data warehouses (and large relations in general) because it computes the answer in a small number of passes over the tape relation. Furthermore, it is relatively easy to optimize this implementation by checking syntactically the such that clause. The optimizations presented in this paper are focused on scalable querying very large data sets: handling the mf-structure in an efficient manner, reducing its size, and minimizing the required number of passes. We note that the methods described in this paper also apply to querying very large disk resident data sets and data streams. In the later, the pass-reducibility optimization can be used to derive whether a complex query can be evaluated in one pass, an essential requirement in these applications. The mfstructure is continuously updated as new data arrives and completed entries can be written out to a database or fed to another system as a stream.
References [1] J. Myllymaki, M. Livny, Relational joins for data on tertiary storage, in: Proceedings of the International Conference on Data Engineering, 1997. [2] J. Gray, G. Graefe, The five-minute rule ten years later, and other computer storage rules of thumb, SIGMOD Record 26 (4) (1997) 63–68. [3] T. Johnson, E. Miller, Benchmarking tape system performance, in: Proceedings of the IEEE Conference on Mass Storage Systems/NASA Goddard Conference on Mass Storage Systems and Technologies, 1998. [4] D. Chatziantoniou, Ad hoc OLAP: expression and evaluation, in: IEEE International conference on Data Engineering, 1999, p. 250. [5] D. Chatziantoniou, Evaluation of ad hoc OLAP: in-place computation, in: ACM/IEEE International Conference on Scientific and Statistical Database Management, 1999. [6] D. Chatziantoniou, The PanQ Tool and EMF SQL for complex data management, in: ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 1999. [7] M.S.S. Sarawagi, Reordering query execution in tertiary memory databases, in: Proceedings of the 22nd Very Large Database Conference, 1996. [8] R. Kimball, K. Strehlo, Why decision support fails and how to fix it, SIGMOD Record 24 (3) (1995) 92–97. [9] S. Chaudhuri, K. Shim, Including group-by in query optimization, in: VLDB Conference, 1994, pp. 354–366. [10] Y. Zhao, P. Deshpande, J. Naughton, A. Shukla, Simultaneous optimization and evaluation of multiple dimensional queries, in: ACM SIGMOD, Conference on Management of Data, 1998, pp. 271–282. [11] D. Chatziantoniou, K. Ross, Groupwise processing of relational queries, in: 23rd VLDB Conference, 1997. [12] J. Gray, A. Bosworth, A. Layman, H. Pirahesh, Datacube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals, in: IEEE International Conference on Data Engineering, 1996, pp. 152–159. [13] D. Chatziantoniou, K. Ross, Querying multiple features of groups in relational databases, in: 22nd VLDB Conference, 1996, pp. 295–306. [14] H. Wang, C. Zaniolo, Database system extensions for decision support: the AXL approach, in: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2000, pp. 11–20. [15] S. Johnson, D. Chatziantoniou, Extended SQL for manipulating clinical warehouse data, Annual Symposium of American Medical Informatics Association, 1999. [16] Oracle. Analytic Functions for Oracle 8i. 1999. http://otn.oracle.com/products/oracle8i/pdf/8ifunctions.pdf. [17] S. Bellamkonda, T. Borzkaya, B. Ghosh, A. Gupta, J. Haydu, S. Subramanian, A. Witkowski, Analytic functions in Oracle 8i. Available at: http://wwwdb.stanford.edu/dbseminar/Archive/SpringY2000/speakers/agupta/paper.pdf. [18] F. Zemke, K. Kulkarni, A. Witkowski, B. Lyle, Introduction to OLAP Functions, 1999. ISO/IEC JTC1/SC32 WG3: YGJ-068 = ANSI NCITS H2-99-154r2. [19] P. O’Neil, D. Quass, Improved query performance with variant indices, in: Proceedings of the ACM SIGMOD, 1997. [20] D. Schneider, The ins and outs (and everything inbetween) of data warehousing, in: Proceedings of the 23rd International Conference on Very Large Data Bases, 1997, pp. 1–32. (in Tutorials).
ARTICLE IN PRESS D. Chatziantoniou, T. Johnson / Information Systems 30 (2005) 133–149
149
[21] M. Carey, L. Haas, M. Livny, Tapes hold data too: challenges of tuples on tertiary storage, in: Proceedings of the ACM SIGMOD, 1993, pp. 413–418. [22] S. Sarawagi, Query processing in tertiary memory databases, in: Proceedings of the 21st Very Large Database Conference, 1995, pp. 585–596. [23] F. Carino, J. Burgess, StorHouse/relational manager (RM)—active storage hierarchy database system and applications, in: 17th IEEE Mass Storage Systems Symposium, 2000, pp. 179–186. [24] F. Carino, P. Kostamaa, A. Kaufmann, J. Burgess, StorHouse metanoia—new applications for database, storage and data warehousing, in: ACM SIGMOD, Conference on Management of Data, 2001, pp. 521–531. [25] M. Sullican, A. Heybey, Tribeca: a system for managing large databases of network traffic, Technical Report, Bellcore, 1996. [26] G. Graefe, Query evaluation techniques for large databases, ACM Comput. Surveys 25 (2) (1993) 73–170. [27] W.P. Yan, P.-A. Larson, Eager aggregation and lazy aggregation, in: VLDB Conference, 1995, pp. 345–357. [28] P. Seshadri, M. Livny, R. Raghu, The design and implementation of a sequence database system, in: Proceedings of the 22nd VLDB Conference, 1996. [29] B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom. Models and issues in data stream systems, in: Symposium on Principles of Database Systems (PODS), 2002, pp. 1–16. [30] D. Terry, D. Goldberg, D. Nichols, B. Oki, Continuous queries over append only databases, in: ACM SIGMOD, Conference on Management of Data, 1992, pp. 321–330. [31] S. Babu, J. Widom, Continuous queries over data streams, Sigmod Record 30 (3) (2001) 109–120.