Data & Knowledge Engineering 61 (2007) 114–136 www.elsevier.com/locate/datak
Using grouping variables to express complex decision support queries Damianos Chatziantoniou
*
Department of Management Science and Technology, Athens University of Economics and Business, Evelpidon 47A & Lefkados 33, 11362 Athens, Greece Received 2 January 2006; received in revised form 24 March 2006; accepted 16 May 2006 Available online 15 June 2006
Abstract Performing complex analysis on top of massive data stores is essential to most modern enterprises and organizations and requires simple, flexible and powerful syntactic constructs to express naturally and succinctly complex decision support queries. In addition, these linguistic features have to be coupled by appropriate evaluation and optimization techniques in order to efficiently compute these queries. In this article we review the concept of grouping variable and describe a simple SQL extension to match it. We show that this extension enables the facile expression of a large class of practical data analysis queries. Besides syntactic simplicity, grouping variables can be neatly modeled in relational algebra via a relational operator, called MD-join. MD-join combines joins and group-bys (a frequent case in decision support queries) into one operator, allowing novel evaluation and optimization techniques. By making explicit how joins interact with group bys, we provide the optimizer with enough information to use specific algorithms and employ appropriate optimization plans, not easily detectable previously. Several experiments demonstrate substantial performance improvements, in some cases of one or two orders of magnitude. The work on grouping variables have influenced at least one commercial system and the standardization of ANSI SQL and implementations of it have been studied in the context of telecom applications, medical and bioinformatics, finance and others. Finally, current work studies the potential of grouping variables in formulating decision support queries over streams of data, one of the latest research trends in database community. 2006 Elsevier B.V. All rights reserved. Keywords: Decision support queries; OLAP; Query processing; SQL; Relational algebra
1. Introduction Data warehousing, decision support, OLAP and data mining, what many people frequently call business intelligence (BI), has reached a maturity height with abundance of systems, platforms and methods. It has evolved from a niche area for large and highly sophisticated corporations few years ago to an essential *
Tel.: +30 210 613 7689; fax: +30 210 613 7889. E-mail address:
[email protected]
0169-023X/$ - see front matter 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2006.05.001
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
115
operation of any modern organization, small or large. A crucial element of any BI system is the concise and succinct expression of decision support queries and efficient evaluation. In relational database systems an SQL query is translated to an expression in relation algebra, which is then optimized, annotated with evaluation algorithms for each relational operator and executed. In this article we deal with all the stages involved in query processing: review grouping variables, an SQL extension to facilitate expression of complex decision support queries; present a relational operator to map this syntactic construct into algebraic terms; discuss interactions with other relational operators and present transformation rules; and propose evaluation algorithms for this operator. These concepts have been used in numerous real-life applications, such as telecommunications, finance, direct marketing, medical- and bio-informatics. Section 2 presents the concept of grouping variable. A grouping variable provides SQL with an implicit iterator over the values of one or more columns: for each value of the column(s), a set of tuples is defined and one or more aggregates are computed over this. For example, for each customer, get his/her sales in ‘‘NY’’ and compute the total amount; for each product, get its sales on January and compute the average; for each month, get the sales of the previous month and compute the average. The best place to put a grouping variable is next to a group-by clause, which lists the distinct values of one or more attributes (the grouping attributes). We show through a series of examples that grouping variables simplify expression of complex decision support queries compared to standard SQL. In Section 3 we describe a relational operator, called MD-join, used to map grouping variables in relational algebra. It consists of a condition h and two relations B and R, and for each value v of B (i.e. for each record of B) a subset of R is defined and aggregated. B is called the base-values table and R is the detail table. In fact, this operator can express higher abstractions than grouping variables: since B can be any set of values, grouping variable’s computations can be defined over group-bys, cube-bys, or any other relational expression. This operator is a restricted form of a join combined with aggregation which leads to efficient implementations. Interactions with other relational operators and transformation rules are also discussed in this section. MD-joins have a simple implementation, described in Section 4. This evaluation algorithm scans the detail data and performs aggregation on the fly on a specialized structure. Indexing this structure may improve performance significantly. The algebraic transformations of Section 3 allow for more efficient versions of the simple implementation, employing parallelism and in-memory computations. Selections can be pushed past an MD-join, as in traditional joins. Furthermore, several MD-joins can be combined in one ‘‘generalized’’ MD-join and computed simultaneously. Some indicative performance results are also presented in this section. Grouping variables and MD-joins have been exploited in telecommunication applications, medical informatics, bioinformatics and finance to express common data analysis queries. We present typical queries in these application areas along with their formulation in extended SQL in Section 5. We also show that in some cases (e.g. telecommunications), additional information on the detail data (e.g. sorted on time) lead to better implementations of the MD-join operator. We argue that this kind of optimizations is easy to be made exactly due to the structure of the MD-join. In Section 6 we discuss related and current work. Data stream systems, an emerging technology in data management, is currently the focus of intense research activity. In data streams we usually have ‘‘continuous’’ rather than ‘‘one-time’’ queries. The answer to a continuous query is produced over time, reflecting the stream data seen so far. Current work examines whether grouping variables and MD-joins can be used to express and model a practical class of continuous data analysis queries. We conclude this paper with our final remarks in Section 7. All examples in this paper are over the relation: Sales(cust, prod, day, month, year, state, amount) which stores the purchases of customer cust of product prod on day, month, year in state state for some amount. 2. Syntax of grouping variables A set variable is a variable containing rows of a table, i.e. denotes a subset of the table. It is usually the result of a selection operation. It is frequent in data analysis to define a set variable for each distinct value
116
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
of a column (or combination of columns) and then compute some aggregated value. For example, we may say: ‘‘for each state, get the sales in that state on January’’. For each distinct value s of column state we should define a set variable as the rows of table Sales having state=s and month=1. A grouping variable, first introduced in [17] and then refined in [10,11,28], depicts precisely this idea. A grouping variable is attached to a group by clause and for each distinct value of the grouping attributes a new set variable is instantiated. The definition of the grouping variable is given with the newly introduced clause ‘‘such that’’. The previous example could be expressed as group by state; X such that X.state=state and month=1 Formally, the syntactic extensions are: • Group by clause. The group by clause is the same as in standard SQL, with the following addition: after specifying the grouping columns, it may contain a list of grouping variables. For example, we may write: group by product; X1, X2, . . . , Xn1 • Such that clause. This newly introduced clause contains one defining condition for each grouping variable, separated by commas, as shown below: such that C1, C2, . . . , Cn Each Ci is a (potentially complex) condition used to define Xi grouping variable, i = 1, 2, . . . , n, similar to a where clause in standard SQL, i.e. a boolean expression used to select rows from the relation. The terms of Ci may involve attributes of the relation and constants as in standard where clauses and, in addition, grouping columns, aggregates of the group and aggregates of the X1, . . . , Xi1 grouping variables. Part (v) implies that aggregates of grouping variables appearing earlier in the list can be used to define grouping variables later in the list. This is semantically sound, since earlier computed aggregates can be considered as constants later in the process. • Select clause. The select clause is the same as in standard SQL, with the following addition: attributes and aggregates of the grouping variables can also appear in the select clause. • Having clause. The having clause is extended to contain aggregates of the grouping variables. With these syntactic extensions, the group by clause acts as an implicit iterator over the values of the grouping attributes. The group itself can be considered as another grouping variable, denoted as X0. Aggregates of the group are considered as aggregates of the X0 grouping variable. Each of the following examples over the Sales table is a typical representative of a well-known class of decision support queries: with pivoting we can ‘‘transpose’’ a table and view row data as columns; with hierarchical aggregation we compare finer granularity to coarser granularity aggregates; with cumulative aggregation we can identify trends or compute aggregates over a moving window; with correlated aggregation we are able to refine a selection of rows by using aggregates computed earlier in the process. These cases cover some of the most frequent decision support queries. Interested readers are directed to [12], where more examples can be found. All standard SQL formulations of these examples are cumbersome to express in SQL. In all cases, repeated joins and group-bys and/or subqueries have to be employed. Example 2.1 (Pivoting – data become columns). Assume that we want to find for each customer the average sale amount in ‘‘NY’’, the average sale amount in ‘‘CT’’ and the average sale amount in ‘‘NJ’’, but only if New York’s average is greater than the other two. The having clause can be used to select the appropriate groups.
1
To be precise, the list of the grouping variables has the form: group by prod; X1(R1), X2(R2), . . . , Xn(Rn), where Ri, i = 1, 2, . . . , n is a relation not necessarily the same with the one in the from clause. In other words, grouping variables can aggregate data from different sources, possibly distributed and in different platforms and formats. For example, a ‘‘group by prod; X(Sales),Y(Orders), Z(Web)’’ uses table Sales to compute aggregates of X, uses table Orders to compute aggregates of Y and uses table Web to compute aggregates of Z. Sales may be a flat file stored in comma delimited format, Web may be another flat file piped in through a remote connection and Orders may be an Oracle table stored locally.
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
117
select cust, avg(x.amount), avg(y.amount), avg(z.amount) from Sales where year=2004 group by cust ; x,y,z such that x.cust=cust and x.state=‘NY’, y.cust=cust and y.state=‘CT’, z.cust=cust and z.state=‘NJ’ having avg(x.sale) > avg(y.sale) and avg(x.sale) > avg(z.sale) In this example, for each customer cust, x grouping variable contains the rows of table Sales that agree on cust and have state equal to ‘‘NY’’ (i.e. the sales of customer cust in ‘‘NY’’), y grouping variable contains the sales of customer cust in ‘‘CT’’ and z grouping variable contains the sales of customer cust in ‘‘NJ’’. For each customer, we just want to print the average of sale amount of x, y and z subsets, if the first is greater than the other two. Example 2.2 (Hierarchical queries – aggregate along some hierarchy/roll-up). Suppose that we want to identify ‘‘good’’ and ‘‘bad’’ months of a product’s sales, as well as interesting irregularities: ‘‘For each product show each month’s total sales as percentage of this product’s year-long total sales’’ select prod, month, sum(x.amount)/sum(y.amount) from Sales where year=2004 group by prod, month ; x,y such that x.prod=prod and x.month=month, y.prod=prod In this example, for each combination of product and month (i.e. group by prod, month), x grouping variable contains the rows of table Sales that agree on prod and month, i.e. the sales of product prod on month month and y grouping variable contains the rows of Sales that agree only on prod, i.e. all the sales of the product. For each product and month, we want to output the total amount of x subset divided by the total amount of y subset. Example 2.3 (Trends – moving aggregate along some time sequence). One may want to identify those months of 2004 that were ‘‘significant’’ for the sales of a product, e.g. those months that signify an increase on the average sale of the product compared to the previous period: ‘‘For each product and each month of 2004, show the product’s average sale of sales preceding the month and the product’s average sale of sales following the month (i.e. two columns per product and month)’’ select cust, month, avg(x.amount), avg(y.amount) from Sales where year=2004 group by cust, month ; x, y such that x.cust=cust and x.month
month Once again, for each combination of customer and month (i.e. group by cust, month), x contains the sales of customer cust (x.cust=cust) before month month (x.month
118
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
this list. Assume that we want for each product and each month of 2004 to count the sales with sale amount between previous month’s and following month’s average sale amount (of that product). select prod, month, count(z.*) from Sales where year=2004 group by prod,month ; x,y,z such that x.prod=prod and x.month=x.month1, y.prod=prod and y.month=y.month+1, z.prod=prod and z.month=month and z.amount > avg(x.amount) and z.amount < avg(y.amount) Example 2.5 (Inter-dimensional comparisons). One can discover customers who have the potential to increase their purchases in one or more products by comparing for each customer and product, the customer’s average sale amount on this product versus the average sale amount of the other customers on this product. select cust, prod, avg(x.amount), avg(y.amount) from Sales group by cust, prod ; x, y such that x.cust=cust and x.prod=prod, y.cust<>cust and y.prod=prod In [12] we have presented a prototype tool (PanQ tool) extending SQL with grouping variables. We have called this version of SQL, EMF SQL (Extended Multi-Feature SQL). 3. Algebraic formulation In traditional relational database systems an SQL query translates to a relational algebra expression which is then optimized by the query engine and an execution plan is produced. As a result, it is essential to be able to express grouping variables in relational algebra. In this section we present an operator, called MD-join [13,4–6], which generalizes the conventional notion of group-by [21]: it clearly distinguishes between the definition of the ‘‘base values’’ used to aggregate-by and the actual computation of aggregates of these. Grouping variables represent the latter. Several algebraic transformations are also presented to be used during the optimization phase. 3.1. Definition A grouping variable X expresses the following idea: for each distinct value v in grouping attribute(s) C of relation R, define a subset Xv using a condition h and compute one or more aggregated values over Xv. Then attach these aggregated values next to v to formulate the output row of the resulting table. Example 3.1. Consider the query: ‘‘for each state show the total sales on January’’. We should define for each distinct value s of column state in table Sales a set Xs as rstate=s and month=1(Sales), get the sum(Xs.amount), concatenate it with s to form a new output row and proceed with the next value of state. This process is what we try to model algebraically with the MD-join operator. The above-mentioned example is not easily represented by traditional relational operators: it requires repeated use of aggregates and joins. A well-known aggregate operator [21], F, is defined as hgrouping columnsiF½h function listiðRÞ where hgrouping_columnsi is the list of grouping attributes of the relation R, and hfunction_listi is a list of the aggregate functions (min, max, count, average, sum), accompanied by an appropriate attribute of the relation specified in R.
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
119
While this formulation is sufficient to capture simple group-by semantics, it has two shortcomings that we try to overcome with the MD-join operator: 1. The aggregate-by values (the values on which aggregation is based) are always the result of a select distinct hgrouping-columnsi from R. It is not difficult to conceive examples where the aggregation is based on values resulted by an arbitrary relational expression, e.g. the cube-by values. Of course we can always express this by using joins, but succinctness and conciseness in algebra usually lead to better execution plans. 2. Aggregation takes place once. We cannot express successive aggregations over different – and possibly correlated – subsets of the relation on each succession. Once again multiple joins can be employed to achieve this, but there are optimization and implementation opportunities (simultaneous aggregate computation, main-memory evaluation, parallel processing) that are difficult to identify this way. Definition 3.1. Let B and R be relations, h a condition involving attributes of B and R and l a list of aggregate functions (f1, f2, . . . , fn) over c1, c2, . . . , cn of R. We define a new relational operator between B and R, called the MD-join, denoted as MDðB; l; h; RÞ with the following semantics: 1. Table B is augmented with as many columns as the number of aggregate functions in l. Each column is named as fi_R_ci, i = 1, . . . , n (e.g. avg_Sales_sale). If a duplicate name is generated, the table R must be renamed. 2. For each row r of table B we find the set S of tuples in R that satisfy h with respect to r, i.e. when B’s attributes in h are replaced by the corresponding r’s values. Then, the value of column fi_R_ci of row r is the fi(ci) computed over tuples of S, i = 1, . . . , n. B is called the base-values relation (or table) and R is called the detail relation (or table). This definition is quite general to capture grouping variables’ computation and more. B represents the group-by structure of an EMF SQL query; condition h corresponds to the defining condition of the grouping variable in the such that clause; the list of aggregate functions l corresponds to the grouping variable’s aggregates mentioned in the select, having or such that clauses. Example 3.1 is represented in relational algebra as MDðB; Sales; sumðamountÞ; Sales:state ¼ state and Sales:month ¼ 1Þ where B contains the distinct values of state column (i.e. select distinct state from Sales). The definition of the MD-join operator allows the user a tremendous amount of flexibility in defining an aggregation query, as B and R can be arbitrary relational expressions and h can be an arbitrary join predicate. Note however that the row count of the result of the MD-join is the same as the row count of B (i.e., the MDjoin performs an outer join). This semantics more accurately captures the user’s intentions than the standard aggregation does. In addition, this property is valuable for efficient implementation and optimization. Note also that the MD-join operator can be considered as a shortcut for a somewhat more complex expression. However, the expression that the MD-join represents occurs very often in OLAP queries. By understanding the properties of the operator, we can easily obtain many query transformations leading to efficient evaluation plans, as will be shown in Section 3.2. Many decision support queries ask for something more complex than a simple average or total. Examples 2.1–2.5 are instances of queries requiring complex computations. The semantics of the MD-join operator has been defined in such a way that several MD-joins can be combined in a sequence to carry out most complex computations. The MD-join does not change the size of the base-values table (in terms of row count), it only adds columns. This is the case when we have more than one grouping variables in a query.
120
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
Example 3.2. Example 2.1 (without the having condition) can be expressed in relational algebra using MDjoins as:2 MDðMDðMDðB; Sales; avgðamountÞ; h1 Þ; Sales; avgðamountÞ; h2 Þ; Sales; avgðamountÞ; h3 Þ where h1 h2 h3
Sales.cust=cust and Sales.state=‘‘NY’’, Sales.cust=cust and Sales.state=‘‘CT’’, Sales.cust=cust and Sales.state=‘‘NJ’’,
and B is the resulting table of a select distinct cust from Sales. The fact that the base relation B can be an arbitrary relational expression and not just the distinct values of one or more columns – as in the case of group by – allows the concept of grouping variable to be extended and used in more general contexts. In essence, B can be any set of values. For example B can be a collection of interesting points for a data mining algorithm [24], a series of time intervals, or all the possible base value combinations of a cube-by operation [25]. Example 3.3. Suppose that we want to count how many sales were above the average sale, when table Sales is broken down by all possible combinations of prod, month, state. The last sentence implies a data cube structure over prod, month and year attributes; however, we want to compute something more complex than a simple aggregate. Even if the cube by syntax [25] is available in our version of SQL, it can not be used. Rather, the user has to define eight group bys, join each one with the Sales table and perform eight new group bys. In [38], Ross, Srivastava and Chatziantoniou argued that the grouping variables and the cube by syntaxes should be merged for succinctness and performance reasons (presented briefly in Section 5.1). By defining MD-join as in Definition 3.1 we are able to express this computation as follows: MDðMDðB; Sales; avgðamountÞ; h1 Þ; Sales; countðÞ; h2 Þ where h1 h2
Sales.prod=prod and Sales.month=month and Sales.state=state, Sales.prod=prod and Sales.month=month and Sales.state=state and Sales.amount > avg_Sales_ amount,
and B is the data cube of prod, month and state attributes.3 By allowing B to be an arbitrary set of values, grouping variables’ computation is conceptually detached from the group by structure and its meaning is generalized. We show that this is useful not only in terms of expression but also in terms of evaluation via a set of algebraic transformations. 3.2. Algebraic transformations In this section we discuss how the MD-join operator interacts with other relational operators and present several algebraic transformations useful for generating optimized query plans. Thus we show that the MD-join contributes to the efficient evaluation of decision support queries. The MD-join can be immediately incorporated into present cost- and algebraic-based query optimizers. Theorems 3.1–3.5 discuss a series of algebraic transformations related to unions, selections, other MD-joins, joins and projections respectively. While we defer discussion and consequences of these transformations in evaluation algorithms to Section 4, we note that these theorems set the ground for parallel evaluations, memory-based processing and incorporation of wellknown data cube algorithms in a generic relational optimization framework. 2 We omit for simplicity several relational algebra details in this formulation. Each application of the MD-join should be preceded by renaming of the Sales table. The same holds for the remaining examples. 3 To model multiple granularity aggregates within a single table, we use the value ‘‘ALL’’ in the base-values table. An atomic predicate involving ‘‘ALL’’ always evaluates to TRUE (except for a test for equality or inequality to a constant ‘‘ALL’’). Conversely, atomic predicates involving ‘‘NULL’’ always evaluate to FALSE, except for a test for equality or inequality to a constant ‘‘NULL’’.
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
121
Theorem 3.1 (MD-joins and union). If B and R are relations, B = B1 [ B2 [ . . . [ Bm, l is a list of aggregate functions over columns of R and h is a condition involving attributes of B and R, then: MDðB; R; l; hÞ ¼ MDðB1 ; R; l; hÞ [ MDðB2 ; R; l; hÞ [ . . . [ MDðBm ; R; l; hÞ Proof. Appears in [13,3]. h This theorem demonstrates how the MD-join and union operators interact. Specifically, it shows the basevalues table partitioning transformation which can be used in developing query plans for large-scale computations and for parallel query evaluation. The MD-join operator interacts with selections in a similar fashion to joins. Theorem 3.2 (MD-joins and selections). Let B and R be relations, l be a list of aggregate functions over columns of R and h be a condition involving attributes of B and R. If h = h1 and h2 and h2 involves only attributes of R, then: MDðB; R; l; h1 and h2 Þ ¼ MDðB; rh2 ðRÞ; l; h1 Þ Proof. Tuples of R not satisfying h2 will not be considered by the MD-join and as a result this selection can be pushed to R. h The MD-join operator interacts with selections in a similar fashion to joins. Conditions can be pushed to R relation. Another set of important algebraic transformations applies when there is a sequence of MD-join operators, a common case in decision support queries (almost all of the examples of Section 2 involve series of MD-join operators, i.e. multiple grouping variables). Theorem 3.3 (Commutation of MD-joins). If B, R1, R2 are relations, l1 and l2 are list of aggregate functions and h1, h2 conditions, then: MDðMDðB; R1 ; l1 ; h1 Þ; R2 ; l2 ; h2 Þ ¼ MDðMDðB; R2 ; l2 ; h2 Þ; R1 ; l1 ; h1 Þ if h1 involves only attributes of B and R1 and h2 involves only attributes of B and R2. Proof. Appears in [13,3]. h Theorem 3.3 allows the commutation of two MD-join operations under certain conditions. It can be used to combine the computation of several MD-joins in ‘‘one shot’’. In addition to combining a sequence of MD-joins into a single operation, it is possible to express one as a join of two separate MD-joins. Theorem 3.4 (Joins and MD-joins). Assume that B, R1, and R2 are relations, l1 and l2 are lists of aggregate functions over columns of R1 and R2 respectively, and h1 and h2 are conditions involving attributes of B and respectively R1 and R2. MDðMDðB; R1 ; l1 ; h1 Þ; R2 ; l2 ; h2 Þ ¼ MDðB; R1 ; l1 ; h1 Þ fflB MDðB; R2 ; l2 ; h2 Þ Proof. Appears in [13,3]. h Because the MD-join does not change the rows of B, it is possible under suitable conditions to perform the equijoin very efficiently. An implication of the theorem is that one can move the computation of the MD-join to the source of a relation R, and in fact perform several such joins in parallel. Consider Example 2.1. Suppose that the Sales table is a distributed relation, and data for New Jersey is stored in Trenton, data for New York in Albany, and so on. It is likely to be more efficient to move the base-value relation to the three data stores, perform local MD-joins, then equijoin the results to obtain the answer. Note that we make use of Theorem 3.2 here. The following theorem shows how projections and MD-joins can be used to express the roll-up property of data cubes – that is, subcubes can be constructed from their drill-down cubes – as an algebraic transformation,
122
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
and discuss how this allows us to algebraically express and generalize more specialized cube computation algorithms. Theorem 3.5 (Projections and MD-joins). Let S and R be two relations, X and Y be attribute sets of S. Let pX ;ALLi ðBÞ be consist of the unique tuples of S projected to attributes in X, and with i copies of ‘ALL’ attached. Let l be a list of distributive aggregates, and let h be a predicate that is a conjunction of predicates that test the equality of an attribute of X or Y with an attribute in R, for all attributes in X ¨ Y. Then, MDðpX ;ALLjY j ðSÞ; R; l; hÞ ¼ MDðpX ;ALLjY j ðSÞ; MDðpX ;Y ðSÞ; R; l; hÞ; l; hÞ where l 0 is the set of distributive aggregates in l modified appropriately (e.g. a count in l becomes a sum in l 0 ). Proof. Appears in [13]. h Theorem 3.5 simply states that a coarser granularity cuboid can be computed by a finer granularity cuboid. The efficient cube computation algorithms of Agarwal et al. [1] are based largely on appropriate partitions of the data cube, on pushing selections on the base-values table partitions to the detail table partitions, and the roll up property of data cubes. Using Theorem 3.5 (in conjuction with the preceding theorems), we can express the algorithms of [1] algebraically, and thus generalize their application within a standard optimizer. The same holds for Ross and Srivastava’s cube computation algorithm [37]. Details can be found in [13]. 4. Evaluation, optimization and performance In this section we discuss how to implement the MD-join operator, optimized versions of the basic evaluation algorithm and performance results. 4.1. Evaluation algorithm We give below one simple algorithm to implement MD-join.4 Note that although Definition 3.1 states that for each row r of B we identify a set of tuples of R, in this algorithm we follow the reverse direction (as in hashbased approaches): we scan the detail table R and loop over all tuples of B to identify matches based on condition h. If a match is detected, we update the aggregate columns appropriately. This kind of processing has several implications in performance and is appropriate for data stream processing, discussed in Section 6. Algorithm 4.1. Evaluation of the MD-join operator: scan R and for all tuples t in R { for all rows r of B, check if condition h is satisfied with respect to r and t. If yes, update r’s aggregate columns appropriately. } The conventional group-wise aggregation algorithm cannot in general be applied, because the tuples in B do not necessarily represent groups in R. Neither can a conventional hash aggregation algorithm be applied, as a tuple from R might join with many tuples from B. However, by using the appropriate transformations and optimizations one can develop efficient evaluation plans the generalize the group-wise and the hash aggregation algorithms. Example 4.1. Example 2.2 of Section 2 can be expressed in relational algebra using MD-joins as MDðMDðB; Sales; sumðamountÞ; h1 Þ; Sales; sumðamountÞ; h2 Þ 4 This algorithm works only for distributive and algebraic aggregates; holistic aggregates can be processed by a similar algorithm that handles memory allocation issues (e.g., see [43]). However, some holistic aggregates can be made algebraic by using approximation, e.g. approximate medians [33].
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
where h1 h2
123
Sales.prod=prod and Sales.month=month, Sales.prod=prod,
and B is the resulting table of a select distinct prod, month from Sales. This example involves a sequence of two MD-join operators. Algorithm 4.1 will be invoked two times, once for each MD-join.
B:
B:
Prod
Month
Sum(x.amount)
A012
2
156
A012
5
212 233 343
Prod
Month
A012
2
A012
7
A012
5
A012
9
190
A012
7
A012
10
122
A012
11
128
A231
3
342
A231
4
438
A231
6
179
…
…
…
…
…
…
A012
9
A012
10
A012
11
A231
3
A231
4
A231
6
…
…
…
…
t: Cust
Prod
Day
Month
Year
State
Amount
1233
A012
12
5
2004
NY
21
(b)
(a) B: Prod
Month
Sum(x.amount)
A012
2
460
Sum(y.amount)
A012
5
544
1122 1143 1122 1143
A012
7
586
1122
A012
9
239
1122
A012
10
678
1122
A012
11
566
1122
A231
3
982
1782
A231
4
1022
1782
A231
6
546
1782
…
…
…
…
…
…
…
…
1143 1143 1143 1143
t: Cust
Prod
Day
Month
Year
State
Amount
1233
A012
12
5
2004
NY
21
(c) Fig. 1. Evaluating Example 2.2. (a) Base-values table B, (b) first MD-join (x grouping variable) and (c) second MD-join (y grouping variable).
124
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
The base-values table B of this query is given in Fig. 1(a) and corresponds to a select distinct prod, month from Sales query. To compute the inner MD-join, Algorithm 4.1 extends B’s schema by one column (Sum(x.amount)) and scans the detail table Sales. For each read tuple t, all rows of B satisfying condition h1 are identified and updated. Fig. 1(b) shows a currently scanned tuple t, the row of B that satisfies h1 (x’s defining condition) with respect to t and the update that takes place. Similarly, to compute the outer MD-join, Algorithm 4.1 extends B’s schema once again by one column (Sum(y.amount)) and scans the detail table Sales. For each read tuple t, all rows of B satisfying condition h2 are identified and updated. Fig. 1(c) shows a currently scanned tuple t, the rows of B that satisfy h2 (y’s defining condition) with respect to t and the update that takes place. 4.2. Optimizations In this section we examine how the theorems of Section 3.2 can lead to efficient implementations of Algorithm 4.1. 4.2.1. Keeping base-values table in main memory One interpretation of Theorem 3.1 is that the base-values table B can be partitioned in any arbitrary way in m partitions and the MD-join can be computed in m scans of the detail table R instead of one – each scan ‘‘computes’’ one partition of B. This theorem has a profound effect in performance of Algorithm 4.1 because it guarantees that the basevalues table can always be in memory, a crucial performance requirement. As a result, we can always devise a query evaluation plan in which Algorithm 4.1 operates on memory-resident data. Regardless of the indices constructed on B for use in the MD-join (discussed in Section 4.2.5), in-memory evaluation will usually be significantly faster than an out-of-core evaluation. The cost is a well-defined increase in the number of scans of R. 4.2.2. Parallel search of the base-values table A different interpretation of Theorem 3.1 is some form of intra-operator parallelism. It is easy to imagine a predicate h in which a significant portion of the rows of B join with an arbitrary row of R. For example, consider the following query which compares sales of a product on a month to the sales of every other product on the same month: MDðMDðB; Sales; avgðamountÞ; h1 Þ; Sales; avgðamountÞ; h2 Þ where h1 h2
Sales.prod=prod and Sales.month=month, Sales.prod<>prod and Sales.month=month
and B is the resulting table of a select distinct prod, month from Sales. The processing of the second MD-join is expensive because each tuple from Sales affects 1/12 of the basevalue tuples, on average. Using Theorem 3.1, we can partition the base-values table B to m independent processing nodes N1, N2, . . . , Nm, where node Ni keeps partition Bi, i = 1, . . . , m. Algorithm 4.1 proceeds as before but upon scanning of a tuple t, this tuple is distributed to all m nodes and the search of the partition Bi is carried out locally on each node. When the scan has been completed, the MD-join result is represented by the union of the individual outputs. Fig. 2 depicts this processing. 4.2.3. Pushing selections The MD-join operator interacts with selections in a similar fashion to joins. Theorem 3.2 says that the MD-join operator can be implemented via an indexed instead of a full scan of R. This is a very important optimization, especially if R is very large (e.g. internet/web/call logs). Example 4.2. One may be interested to compare for each product the total sales of the period 2000–2002 versus the total sales of 2004. This can be expressed in relational algebra using MD-joins as
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
125
R t
B1
t
Node Nm
Node N1 output
Bm
t
output m
1
union output
Fig. 2. Parallel search of the base-values table.
MDðMDðB; Sales; sumðsaleÞ; h1 Þ; Sales; sumðsaleÞ; h2 Þ where h1 h2
Sales.prod=prod and Sales.year P 2000 and Sales.year 6 2002, Sales.prod=prod and Sales.year=2004
and B is the resulting table of a select distinct prod from Sales. If there is a clustered index on the date (day, month, year) set of attributes, there is no need to perform two full scans of the Sales relation. Instead, the first scan will involve sales of 2000–2002 and the second scan will involve sales of 2004. The same transformation can be applied in Example 2.1, if there is an index on state. We note that Theorem 3.2 can work in conjunction with Theorem 3.1 to limit the range of R that must be MD-joined with B. 4.2.4. Combination of MD-joins A scan of a relation can be very expensive, especially if the data set is large. It is therefore essential to reduce the number of scans as much as possible. If there is a series of k consecutive MD-joins as usually is the case – Algorithm 4.1 requires one pass over the detail relation(s) for each operator, for a total of k scans. However, this is not always necessary. Consider Example 2.1. Although this query is expressed as a series of three MD-joins, obviously it can be evaluated with just one scan. Upon reading tuple t of Sales, search base-values table B for rows that satisfy condition h1 and update column (avg(x.amount)) appropriately, then search B for rows that satisfy condition h2 and update column (avg(y.amount)) and then search B for rows that satisfy condition h3 and update column (avg(z.amount)). The key observation in this example is that the theta conditions of the MD-joins are not related to each other. To be exact, there are no augmented columns generated by an MD-join used in a theta condition of a subsequent MD-join. Dependency analysis between correlated aggregates has been treated in [11,29,16] and many ideas are common. However it is valuable to formulate this concept as an algebraic transformation. One approach is to generalize the MD-join operator to incorporate a vector of theta conditions and a vector of lists of aggregate functions (and possibly a vector of names to rename aggregate functions). This is denoted as MDðB; R; ðl1 ; l2 ; . . . ; lk Þ; ðh1 ; h2 ; . . . ; hk ÞÞ
126
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
with the obvious semantics’ extensions (for each tuple of B we define k subsets of R instead of one). Using this new generalized MD-join, two consecutive MD-join operators can be combined in one, if (i) the condition of the second MD-join does not mention any column generated by the first and (ii) the detail relation is the same for both MD-joins. Strictly speaking, the second requirement is not necessary. For example, if the detail relations reside in different disks, then it is possible to simultaneously scan both relations and compute the answer in one pass. Generalized MD-joins have been discussed in [4–6]. A possible impediment to using the series MD-join optimization is that we might be provided with the algebraic expression in an inappropriate order. Fortunately, MD-joins commute under suitable conditions as shown in Theorem 3.3. 4.2.5. Indexing Algorithm 4.1 can become very expensive if the base-values relation B has a large number of rows (even though B is memory resident), since for each scanned tuple t of the detail table R, all B’s rows are examined, resulting in a nested-loop join. However, this is not always necessary since, given a tuple t, one can identify a small number of B’s rows that may be updated with respect to t during the evaluation of the MD-join operator. For instance, consider Examples 2.1 and 3.2. During the evaluation of the first MD-join, given a tuple t of Sales relation, there is only one row of B that may be updated, the one that has the same value in the cust attribute with t’s cust value. If B is indexed on cust attribute, searching B becomes fast. Now consider Example 2.4. This query can be expressed by three MD-joins with theta conditions as described in the such that clause of this example. During the evaluation of the first MD-join, given a tuple t of Sales relation, there is one row of B that may be updated, the one that has the same value in the cust attribute with t’s cust value and its month value equals t’s month value +1. An index on (prod, month) on B would reduce searching of B significantly (even though an index on prod would be efficient enough). Definition 4.1. The set of rows of B that are updated during the evaluation of an MD-join given a tuple t (i.e. the set of rows updated in the innermost loop body of Algorithm 4.1) is called the relative set of B with respect to t, denoted as Rel(t). Sometimes it is possible to index B once and use this indexing for the evaluation of all MD-joins in a query, as in the examples described above. In other cases we may have to create an additional index (for example, we may have to index on a column generated by a previous MD-join). 4.2.6. Other optimizations Other optimizations relate to reducing the size of the base-values table B and minimizing further the number of scans required to compute the answer. For example, in the presence of additional information we may be able to deduct that a row of B has been ‘‘completed’’ before evaluation terminates (i.e. further processing will not change the row) and thus remove it immediately from B and present it to the user. This is usually the case with sorted detail tables on some timestamp, such as log files and data streams. We present two such optimizations in the context of telecom applications in Section 5.2. 4.3. Performance It is natural to wonder whether Algorithm 4.1 has an efficient implementation, even in the presence of the optimizations described in Section 4.2. In [12] we presented a tool (the PanQ tool) which implements the EMF-SQL language for expressing complex OLAP queries with most MD-join optimizations implemented, namely those of Sections 4.2.2, 4.2.4 and 4.2.5. In [10], the performance of an EMF-SQL query to compute the answer to Example 2.4 is measured and compared to the performance of an equivalent SQL query executed on a popular commercial database system. Fig. 3 shows the performance of the DBMS when the query of Example 2.4 is expressed in standard SQL and there are no underlying indices. Each group has size 10 (i.e. 10 sales per customer). We measured the performance at 5 datapoints, ranging from 1000 customers to 5000 customers. The Sales relation size ranges from
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
127
50 Commercial DB system MD-join and Opts
45
Time (secs)
40 35 30 25 20 15 10 5 0
0
1000
2000
3000
4000
5000
Size (number of groups)
Fig. 3. Performance of a commercial DBMS and an EMF-SQL tool.
10,000 tuples (250 KB) to 50,000 tuples (1.4 MB). Fig 3 also shows the performance of the PanQ tool for the same sets of data. The result is computed in two scans of the Sales relation and B, the base-values table, is indexed (in memory) on cust. For larger detail relations, the performance difference is much higher, due to the scalability offered by transformation 4.2.2. The issue of performance is revisited in Section 5.2 by providing experimental results over an example drawn from the telecommunications industry. 5. Applications In this section we describe typical applications that have exploited the concepts of grouping variables and MD-joins. 5.1. Cubes and grouping variables In [38], Ross, Srivastava and Chatziantoniou proposed using grouping variables on datacube queries to perform complex aggregation. A datacube query [25] computes a simple aggregate when the relation is grouped by all possible subsets of grouping attributes. For example, a typical datacube query would be: grouping by all subsets of cust, prod, month, find the total amount among all sales of 2004. This query would be expressed as select cust, prod, month, sum(amount) from Sales where year=2004 cube by cust, prod, month Now assume that we want to execute a query similar to Example 2.1 when Sales is grouped by all possible subsets of cust, prod, month, i.e. compute the average sale amount of ‘‘NY’’, ‘‘CT’’ and ‘‘NJ’’ at groupings of {cust}, {prod}, {month}, {cust, prod}, {cust, month}, {month, prod}, {cust, month, prod} and {}. Then we are performing an operation analogous to the datacube, which we call a multi-feature cube. This example can be expressed as select cust, prod, month, sum(amount) from Sales where year=2004 cube by cust, prod, month; x, y, z such that x.cust=cust and x.prod=prod and x.month=month and x.state=‘NY’, y.cust=cust and y.prod=prod and y.month=month and y.state=‘CT’, z.cust=cust and z.prod=prod and z.month=month and z.state=‘NJ’
128
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
We use the value ‘‘ALL’’ to fill missing group-by attributes, as in [25]. An atomic predicate involving ‘‘ALL’’ always evaluates to TRUE (except for a test for equality or inequality to a constant ‘‘ALL’’). Conversely, atomic predicates involving ‘‘NULL’’ always evaluate to FALSE, except for a test for equality or inequality to a constant ‘‘NULL’’. The generic syntax of multi-feature cubes is shown below and is similar to the one in Section 2 with treatment of ‘‘ALL’’ values as mentioned above. select B1, . . . , Bk, f1(A1), . . . , fm(Am) from R1, . . . , Rk where Cond cube by B1, . . . , Bk ; X1, . . . , Xn such that C1, . . . , Cn In [38], Ross et al. classified multi-feature cubes based on the extent to which fine granularity results could be used to compute coarse granularity results. This classification included distributive, algebraic and holistic multi-feature cubes. They also provided syntactic sufficient conditions to determine when a multi-feature cube was either distributive or algebraic. This distinction was important because existing datacube evaluation algorithms could be used to compute multi-feature cubes that were distributive or algebraic, without any increase in I/O complexity. Note that the MD-join operator is used to algebraically model multi-feature cubes. Simply the base-values relation B corresponds to the cube-by structure. 5.2. Telecom applications Many processes in telecommunications (e.g., network monitoring) generate very large amounts (terabytes) of data. This data is stored in a data warehouse and used for data mining and analysis. Many analyses require the join of several very large data sets. Conventional methods for performing these joins are prohibitively expensive. However, one can often exploit the temporal nature of the data and the join conditions to obtain fast algorithms that operate entirely in memory. The application that has motivated this research [27,16] is the analysis of network performance data. A common feature of the problems encountered is the aggregation of sequences of tuples into larger units (for example from packets to sessions). Let us consider a scenario in which the data warehouse contains records describing packets that traverse a communications network. This relation has the following schema: Pkt (Source, Dest, StartCall, EndCall, Length, ts) Example 5.1. Suppose that a network engineer wants to know how many packets are involved in each connection, where a connection is a sequence of packets (in ts order) with the same Source and Dest attribute values, starting with a packet where StartCall is 1, and ending with a packet where EndCall is 1. Packets of different connections may be interleaved. We can assume that every packet has a unique value of ts. A direct expression of this query in SQL is quite complex involving self-joins, temporary tables, additional joins, etc. [27]. This type of query is difficult to optimize, and in fact our experience is that common commercial SQL databases cannot optimize this type of query properly. However, when a programmer is asked to compute the number of packets per connection, she will usually write an efficient program that scans the Pkt table and builds a hash table of aggregates indexed by the Source, Dest pairs. It is clear that a packet with StartCall=1 defines a new hash table entry, and a packet with EndCall=1 closes the entry. If we increment a counter for every packet that hashes to an entry, when we close the entry we have computed the output. Finally, closed entries can be deleted from the hash table to minimize space use. As the example shows, there is a large class of queries that are difficult to express and optimize in SQL, but which have simple and natural programs for computing their result. Our goal is to express these queries simply
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
129
and identify efficient evaluations, as the one mentioned above. Example 5.1 can be expressed using grouping variables as select Source, Dest, ts, count(Y.*) from Pkt group by Source, Dest, StartCall, ts ; X, Y such that (X.Source=Source and X.Dest=Dest and X.EndCall=1 and X.ts > ts) (Y.Source=Source and Y.Dest=Dest and Y.ts P ts and Y.ts 6 min(X.ts)) having StartCall=1 A group is defined by a connection’s opening packet (which is the only tuple in the group). The grouping variable X is the collection of all later packets which end calls between this source and destination. Therefore, the earliest member of X ends the connection. The grouping variable Y is the set of all tuples from the source to the destination between the start of the connection and the end of the connection. The number of entries in the base-values table B of this example may be very large, forcing it to the disk and greatly increasing the cost of computing the query. However, two basic and related observations improve the performance tremendously. • The optimizer can exploit the fact that Pkt is ordered on ts and produce the answer in one pass, instead of three. • An entry in B (i.e. a group, a session) ‘‘opens’’ during the scan and later ‘‘closes’’. In Example 5.1, an entry opens with a StartCall packet and closes with an EndCall packet. The base-values table B needs to store only the currently open entries (the maximum number of concurrently open connections is much smaller than the total number of open connections). By using this optimization, we can usually avoid out-of-core processing B. Definition 5.1. Assume that Y ! X, Y ! X1, . . . , Y ! Xk are the dependencies of Y and CY is the defining condition of Y. We say that a grouping variable Y is pass-reducible to X, if for all groups and all database instances it is possible (i.e. there exists an algorithm) to compute in the same pass the aggregates of X involved in CY and the aggregates of Y, given that the aggregates of X1, . . . , Xk used in CY have been already calculated. This is denoted as Y ) X. Definition 5.1 simply generalizes the optimization discussed in Section 4.2.4, where several MD-joins can be computed in one pass. In [27,16] we introduced syntactic criteria that infer pass reducibility between grouping variables in the presence of additional information (e.g. sort order of a relation). For instance in Example 5.1, although grouping variable Y depends on X, we can deduce that Y is pass-reducible to X 5 because for each group, all tuples of Y precede the first tuple of X. Definition 5.2. A row b of base-values table B of an MD-join operator is said to be completed if it is somehow known that tuples further in the scan of Algorithm 4.1 will not affect the output columns with respect to b. Once again we must have some syntactic criteria to determine when a grouping variable X closes-out. One way is to know that tuples further in the scan do not affect the output of X; a different way is to know that tuples further in the scan will not belong to X. These criteria are discussed in [27,16]. These two optimizations are very useful also in the context of data stream processing, where only one-pass algorithms are allowed [8,31] and entries must be instantiated and deleted dynamically. We compared the performance of Example 5.1 when written and executed in the standard SQL component of a leading commercial system versus the corresponding EMF SQL query of the PanQ tool. The platform we used was a 2 GHz P4 Intel PC with a 256 DDR main memory and a 7200 rpm HD running Win2000. All con5
To be accurate, the MD-join that corresponds to Y grouping variable can be computed along with the MD-join that corresponds to X grouping variable, as stated in Section 4.2.4.
130
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136 250 Standard SQL - NoIndx Standard SQL - Indx EMF SQL
Time (secs)
200
150
100
50
0
0
10000
20000 30000 40000 Size (number of groups)
50000
Fig. 4. Performance of a commercial DBMS and an EMF-SQL tool.
nections in the log file have 100 packets each and the number of connections ranges in our experiments from 10,000 to 50,000 with a step of 10,000. The size (in records) of the log file ranges from 1 M to 5 M records. Each record is 32 bytes long and the total size of the log file (in MB) ranges from 32 MB to 160 MB. In all experiments the memory was flushed before running the tests in order to clear the cache. We run the standard SQL query twice, once on indexed and once on non-indexed data (indexing on (Source, Dest, ts)). Fig. 4 compares graphically the performance of the three approaches. The performance of EMF query is linear (except a small start-up cost) which is expected due to the nature of the implementation: a pass over the data. As a result, the cost is proportional to the size of the log file. On the other hand, the performance of standard SQL query is super-linear, due to the nature of SQL formulation and optimization: joins and group-bys. Note that 160 MB (50,000 connections) could fit in main memory. We run the same tests for a log file of size 650 MB (200,000 connections) and the measurements changed significantly (standard SQL: 1398 s, EMF: 38 s). 5.3. Medical informatics In [27], Johnson and Chatziantoniou discussed how EMF SQL could be used in typical queries over a clinical warehouse. Health care institutions are beginning to collect large amounts of clinical data through patient care applications. Clinical data warehouses make these data available for complex analysis across patient records, benefiting administrative reporting, patient care and clinical research. Typically, clinical data are distributed across several tables, and have generic columns with coded values. The transformation of raw warehouse data requires considerable intellectual effort, as well as intensive computer processing. The principal difficulty of generic database designs is the use of coded values instead of specific column names. This results in a schema that is conceptually difficult for users to manipulate and which performs very poorly for queries involving multiple patient records. A classic example of a generic schema is shown in Fig. 5(a), in which lab data are identified by codes NA, K, and CL. In contrast, a specific schema employs distinct column names for each attribute of interest. Fig. 5(b) shows the same data as (a), but arranged in a table in which the value of each lab test can be identified by its column name. This schema is far easier to use in decision support and data analysis applications. EMF SQL was used to construct a number of queries which occur frequently in analysis of clinical data, but which are known to be difficult and inefficient to formulate in standard SQL. The queries were run on data in the Clinical Data Warehouse at Columbia-Presbyterian Medical Center, which contains fifteen years worth of administrative and ancillary data. The clinical queries that were studied were the following: • Pivoting: transforming lab data in a generic schema to a specific schema (Fig. 5). • Flattening: transforming nested radiology data (findings with modifiers that in turn have modifiers) in a generic schema into a specific schema with named columns.
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
MRN 123 123 123 123 123 123 123 123 123 700 700 700 700 700 700
Date 03-12-2003 03-12-2003 03-12-2003 06-01-2003 06-01-2003 06-01-2003 11-23-2003 11-23-2003 11-23-2003 02-17-2003 02-17-2003 02-17-2003 09-22-2003 09-22-2003 09-22-2003
Test NA K CL NA K CL NA K CL NA K CL NA K CL (a)
Value 135 3.7 109 140 5.1 104 145 4.2 110 139 4.2 106 146 4.1 108
MRN 123 123 123 700 700
DATE 03 -12 -2003 06 -01 -2003 11 -23 -2003 02 -17 -2003 09 -22 -2003
NA 135 140 145 139 146
131
K 3.7 5.1 4.2 4.2 4.1
CL 109 104 110 106 108
(b)
Fig. 5. Schematic discrepancies in clinical warehouses.
• Sequencing: finding cardiology procedures in a generic table of longitudinal data, and returning a specific schema in which rows have a particular temporal sequence. • Aggregation: determining the median value of a series of lab values. Each of these queries was expressed in both standard and extended SQL, and executed on data in the warehouse. For example, the query to transform table lab of Fig. 5(a) to that of (b) is: select mrn, date, any(x.value), any(y.value), any(z.value) from lab group by mrn, date; x,y,z such that x.mrn = mrn and x.date=date and x.test = ‘NA’, y.mrn = mrn and y.date=date and y.test = ‘K’, z.mrn = mrn and z.date=date and z.test = ‘CL’
5.4. Bioinformatics In [35], Rafiq, O’Connor and Das are developing and validating a novel software method for temporal pattern discovery in biomedical genomics. In this paper, they present an efficient and flexible query algorithm (called Temporal EMF – TEMF) to extract statistical patterns from time-oriented relational databases. They show that TEMF can express a wide range of complex temporal aggregations without the need for data processing in a statistical software package. They demonstrate the expressivity of TEMF using example queries from the Stanford HIV Database. Chronus II is a temporal querying method that has been designed to express complex queries over time-oriented values. It uses a querying language, called Chronus Querying Language (CQL), and is based on the proposed temporal standard TSQL2 for the relational query language SQL, which has few temporal features. Chronus II is implemented in Java and is designed to operate as a modular program above existing relational databases using JDBC as its access layer. The Chronus II interface takes a CQL command, generates standard SQL statements for the non-temporal part of the command, and completes the processing of temporal operations in memory. Rafiq et al. have extended the EMF syntax to express temporal aggregation in a generalized and succinct manner. This extended EMF syntax has been incorporated into CQL, taking advantage of Chronus II’s rich ability to handle time values. Temporal EMF required the following functionalities to be added on EMF SQL:
132
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
• Supporting multiple database tables in the from clause (the original EMF syntax only allows for a single table to be queried). • Implementing the temporal equivalent of the basic aggregate operators available in SQL to specify arbitrarily complex temporal aggregate queries. • Defining grouping variables over rows of data based on temporal conditions, such as duration operators, over stored time values. • Allowing nested aggregation to be used in defining temporal aggregate queries. The nested aggregates can support combinations of both duration operators and basic temporal aggregate operators defined above. The implementation of TEMF involves additional data processing in the Chronus II program. A table structure is created in memory for which the columns are the aggregates of each grouping variable. Chronus II parsing of a TEMF specification in a CQL query separates the two constituents, the CQL syntax and the EMF syntax. The CQL syntax is processed by Chronus II as before. The output of this query fills the rows of a memoryresident table structure (called mf-table, similar to base-values table) the rows of which correspond to the user-defined grouping attributes. Then, the output is computed with an evaluation strategy similar to Algorithm 4.1. 5.5. Financial applications In this section we briefly describe how extended SQL has been used to express decision support queries in a financial environment. In this scenario we have a relation that stores the opening price, the closing price and the volume of each stock traded on day, month, year. Transactions (stock-id, day, month, year, open-price, close-price, volume) Example 5.2. Assume that we would like to find the average opening and closing price of January, February and March of each stock for 2004. This is similar to Example 2.1 and can be expressed as select stock-id, avg(x.open-price), avg(x.close-price), avg(y.open-price), avg(y.close-price), avg(z.open-price), avg(z.close-price) from Transactions where year=2004 group by stock-id; x,y,z such that x.stock-id=stock-id and x.month=1, y.stock-id=stock-id and y.month=2, z.stock-id=stock-id and z.month=3
Example 5.3. Now suppose that we are interested in finding for each stock those days of 2004 that the opening price of the previous day was less than the opening price of the following day. This can be expressed in extended SQL as select stock-id, day, month from Transactions where year=2004 group by stock-id, day, month; x, y such that x.stock-id=stock-id and date(x.day,x.month)=date(day,month)1, y.stock-id=stock-id and date(y.day,y.month)=date(day,month)+1 having any(x.open-price) < any(y.open-price)
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
133
Note that although x and y grouping variables contain only one row, we cannot mention relation’s attributes in the select or having clauses. The aggregate function any solves this problem by selecting randomly a tuple among the rows being aggregated (the selection is implementation specific) and the value of the argument column is returned. We have also expressed continuous financial queries (queries continuously evaluated over data streams [9,41]) using grouping variables, similar to Traderbot’s web engine.6 6. Related and current work SQL syntactic extensions to accommodate decision support queries have been proposed in the past by several authors and systems (e.g. [36,30,23,39,42,44]). MD-join is similar to SegmentApply operator of [22] and Gapply of [19]. Galindo-Legaria and Joshi proposed in [22] two operators called Apply and SegmentApply to model parameterized query execution in an algebraic manner. Chaudhuri et al. proposed the GApply operator in the context of XML queries, based on the Apply operator. The evaluation algorithm of this operator is to partition the input tuple stream based on the grouping attributes and execute on each group – which binds to a relation-valued variable – the per-group-query. All operators (MD-join, SegmentApply and Gapply) could be unified under certain conditions into one framework [18]. Another similarity exists between the MD-join and the nest-Join operator proposed by Steenhagen et al. [40] for complex object models. The nestjoin is a generalized form of the outer-join. The interaction between aggregation and join has been studied in several papers (e.g. [45,26,20]), but the research direction was in identifying which operator to execute first (push a group-by past a join or pull a group-by above a join). MD-join combines join and aggregation in a generalized fashion into a new operator. Other relational operators to model decision support queries have been proposed in the past by several authors, considering mainly the problem of modelling the data cube (e.g. [32,2,24]). Current work focus on data stream applications, one of the most active research areas in databases. These applications involve data streams, data items that arrive on-line from multiple sources in a continuous, rapid and time-varying fashion [8,31]. Examples include financial streams (such as stock tickers and news feeds), network packet streams, sensor networks, web logs, etc. In addition, queries over data streams are quite different than traditional ones. In data streams we usually have ‘‘continuous’’ queries [41,9]. The answer to a continuous query is produced over time, reflecting the stream data seen so far. Answers may be stored and updated as new data arrives or may be produced as data streams themselves. Continuous queries can be used for monitoring, alerting, security, personalization, etc. An open question is whether grouping variables and the MD-join operator could be used to express and evaluate continuous queries without significant modification of the existing framework. As mentioned briefly in Section 2 one can define the range of a grouping variable by specifying it within a parenthesis next to the grouping variable’s definition, as below: group by prod; X(Sales), Y(Web) X grouping variable denotes subsets of Sales relation, while Y denotes subsets of Web. Expressions like this translate to a series of MD-joins over different detail relations that are evaluated by combining and parallelizing scans according to the optimizations of Section 4.2. Now assume that we extend grouping variables definition so they can range over data stream sources, for example X(Packets), where Packets is a stream of data. Due to the ‘‘sequential’’ nature of Algorithm 4.1 and the optimizations of Section 4.2, few changes are required in the framework discussed so far. The detail argument of the MD-join operator can be a stream source, scanned ‘‘indefinitely’’ during the evaluation of Algorithm 4.1. Since in data stream processing only
6 Traderbot (www.traderbot.com) is a web-based financial search engine that evaluates queries (both traditional and continuous) over real-time streaming data (e.g. ‘‘find all stocks between $20 and $200 where the spread between the high tick and the low tick over the past 30 min is greater than 3% of the last price and in the last 5 min the average volume has surged by more than 300%.’’)
134
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
one-pass algorithms are allowed and memory has to be managed very carefully, dependency analysis between multiple MD-joins and creating/destroying in-memory items on-time become crucial optimizations. The work on [27,16] seems appropriate toward that direction. Another research path is getting from the concept of grouping variable to the concept of grouping (or stream) object. We argued in [14,15] that stream applications have complex semantics, similar to objectoriented applications. As a result, stream objects have names, instances, methods, constructors and destructors, form hierarchies, etc. A prototype based on stream objects called NESTREAM has been presented in [15].
7. Conclusions In this paper we presented a class of data analysis queries called extended multi-feature (EMF) queries. This class contains many useful, practical ad hoc decision support queries not easily expressed with standard SQL. We reviewed a simple and intuitive syntactic extension of SQL in order to express EMF queries. It equips SQL with a looping construct over the values of one or more columns in a declarative way. We argued that such an extension is essential in expressing ad hoc OLAP queries. The concept of grouping variable has influenced at least one commercial DBMS [34] and the OLAP Amendment of ANSI SQL [7]. Furthermore, it has been used in several application areas, such as telecommunications and medical informatics. Grouping variables are coupled in relational algebra with a generic operator called MD-join. The operands of the MD-join include the set of base values and the relation to be aggregated; these operands can be the result of arbitrary relational expressions. As a result, the concept of grouping variable is extended beyond group-by queries, providing greater flexibility in expressing OLAP. In addition, we showed that expressing decision support queries in terms of MD-joins leads to efficient and optimized query plans. We presented an efficient algorithm for implementing MD-joins. We next showed a variety of algebraic transformations of expressions involving MD-joins to have parallel execution and partitioned in-memory computations. By using the roll-up transformation we can express the efficient data cube computation algorithms of [1,37] as transformed MD-join expressions. As a result, we can generalize these cube computation algorithms and apply them to new settings.
References [1] S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J.F. Naughton, R. Ramakrishnan, S. Sarawagi, On the computation of multidimensional aggregates, in: 22nd VLDB Conferenceit, 1996, pp. 505–521. [2] R. Agrawal, A. Gupta, S. Sarawagi, Modeling multidimensional databases, in: IEEE International Conference on Data Engineering, 1997. [3] M. Akinde, Complex and Distributed On-line Analytical Processing. PhD thesis, Department of Computer Science, Aalborg University, 2004. [4] M. Akinde, M. Bohlen, Generalized MD-Joins: Evaluation and reduction to SQL, in: Databases in Telecommunications, 2001, pp. 52–67. [5] M. Akinde, M. Bohlen, J.T., L. Lakshmanan, D. Srivastava, Efficient OLAP query processing in distributed data warehouses, in: International Conference on Extending Database Technology, 2002, pp. 336–353. [6] M. Akinde, M. Bohlen, J.T.L. Lakshmanan, D. Srivastava, Efficient OLAP query processing in distributed data warehouses, Information Systems 28 (1–2) (2003) 111–135. [7] ANSI SQL Standards Committee. SQL AMENDMENT 1: On Line Analytical Processing (SQL/OLAP), Supplement to ISO/IEC 9075:1999. American National Standards Institute, 2001. [8] B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom, Models and issues in data stream systems, in: Symposium on Principles of Database Systems (PODS), 2002, pp. 1–16. [9] S. Babu, J. Widom, Continuous queries over data streams, SIGMOD Record 30 (3) (2001) 109–120. [10] D. Chatziantoniou, Ad hoc OLAP: expression and evaluation, in: IEEE International Conference on Data Engineering (ICDE), 1999, p. 250. [11] D. Chatziantoniou, Evaluation of ad hoc OLAP: in-place computation, in: ACM/IEEE International Conference on Scientific and Statistical Database Management (SSDBM), 1999, pp. 34–43. [12] D. Chatziantoniou, The PanQ tool and EMF SQL for complex data management, in: ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 1999.
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136
135
[13] D. Chatziantoniou, M. Akinde, T. Johnson, S. Kim, The MD-join: an operator for complex OLAP, in: IEEE International Conference on Data Engineering, 2001, pp. 524–533. [14] D. Chatziantoniou, A. Anagnostopoulos, Hierarchical stream aggregates: querying nested stream sessions, in: International Conference on Scientific and Statistical Database Management (SSDBM), 2004, pp. 439–441. [15] D. Chatziantoniou, A. Anagnostopoulos, NESTREAM: querying nested streams, SIGMOD Record 33 (3) (2004) 71–78. [16] D. Chatziantoniou, T. Johnson, Decision support queries on a tape-resident data warehouse, Information Systems 30 (2) (2005) 133–149. [17] D. Chatziantoniou, K. Ross, Querying multiple features of groups in relational databases, in: 22nd International Conference on Very Large Databases (VLDB), 1996, pp. 295–306. [18] D. Chatziantoniou, K. Ross, Partitioned optimization of complex queries, Information Systems, accepted for publication. [19] S. Chaudhuri, R. Kaushik, J. Naughton, On relational support for XML publishing: beyond sorting and tagging, in: ACM SIGMOD, Conference on Management of Data, 2003, pp. 611–622. [20] S. Chaudhuri, K. Shim, Optimizing queries with aggregate views, in: Extending Database Technology, 1996, pp. 167–182. [21] R. Elmasri, S. Navathe, Fundamentals of Database Systems, Addison-Wesley Publishing, 2000. [22] C. Galindo-Legaria, M. Joshi, Orthogonal optimization of subqueries and aggregation, in: ACM SIGMOD, Conference on Management of Data, 2001, pp. 571–581. [23] F. Gingras, L. Lakshmanan, nD-SQL: a multi-dimensional language for interoperability and OLAP, in: VLDB Conference, 1998, pp. 134–145. [24] G. Graefe, U. Fayyad, S. Chaudhuri, On the efficient gathering of sufficient statistics for classification from large SQL databases, in: International Conference on Knowledge Discovery and Data Mining, 1998, pp. 204–208. [25] J. Gray, A. Bosworth, A. Layman, H. Pirahesh, Datacube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals, in: IEEE International Conference on Data Engineering, 1996, pp. 152–159. [26] A. Gupta, V. Harinarayan, D. Quass, Aggregate-query processing in data warehousing environments, in: VLDB Conference, 1995, pp. 358–369. [27] S. Johnson, D. Chatziantoniou, Extended SQL for manipulating clinical warehouse data, in: Annual Symposium of American Medical Informatics Association, 1999. [28] T. Johnson, D. Chatziantoniou. Extending complex ad hoc OLAP, in: ACM Conference on Information and Knowledge Management (CIKM), 1999. [29] T. Johnson, D. Chatziantoniou, Joining very large data sets, in: Databases in Telecommunications 1999 (VLDB co-located), Lecture Notes in Computer Science, vol. 1819, Springer, 2000. [30] R. Kimball, K. Strehlo, Why decision support fails and how to fix it, SIGMOD Record 24 (3) (1995) 92–97. [31] N. Koudas, D. Srivastava, Data stream query processing: a tutorial, in: 29th International Conference on Very Large Databases (VLDB), 2003, p. 1149. [32] C. Li, S.W. Wang, A data model for supporting on-line analytical processing, in: International Conference on Information and Knowledge Management, 1996, pp. 81–88. [33] G. Manku, S. Rajagopalan, B. Lindsay, Approximate medians and other quantiles in one pass and with limited memory, in: ACM SIGMOD, Conference on Management of Data, 1998, pp. 426–435. [34] Oracle. Analytic Functions for Oracle 8i. Oracle Corporation, 1999. Available from: . [35] M. Rafiq, M. O’Connor, A. Das, Computational Method for Temporal Pattern Discovery in Biomedical Genomic Databases, in: IEEE Computational Systems Bioinformatics Conference (CSB2005), (2005). [36] C. Red Brick Systems, Los Gatos. RISQL Reference Guide, Red Brick Warehouse VPT Version 3, 1994. [37] K. Ross, D. Srivastava, Fast computation of sparse datacubes, in: 23nd VLDB Conference, 1996, pp. 116–125. [38] K. Ross, D. Srivastava, D. Chatziantoniou, Complex aggregation at multiple granularities, in: Extending Database Technology (EDBT), Valencia, 1998, pp. 263–277. [39] S. Sarawagi, S. Thomas, R. Agrawal. Integrating mining with relational database systems: alternatives and implications, in: ACM SIGMOD, Conference on Management of Data, 1998, pp. 343–354. [40] H. Steenhagen, P. Apers, H. Blanken, Optimization of nested queries in a complex object model, in: International Conference on Extending Database Technology (EDBT), Cambridge, 1994, pp. 337–350. [41] D. Terry, D. Goldberg, D. Nichols, B. Oki, Continuous queries over append only databases, in: ACM SIGMOD, Conference on Management of Data, 1992, pp. 321–330. [42] H. Wang, C. Zaniolo, Database system extensions for decision support: the AXL approach, in: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2000, pp. 11–20. [43] H. Wang, C. Zaniolo, User defined aggregates in object-relational systems, in: International Conference on Data Engineering (ICDE), 2000, pp. 135–144. [44] A. Witkowski, S. Bellamkonda, T. Bozkaya, G. Dorman, N. Folkert, A. Gupta, L. Sheng, S. Subramanian, Spreadsheets in RDBMS for OLAP, in: ACM SIGMOD, Conference on Management of Data, 2003, pp. 52–63. [45] W.P. Yan, P.-A. Larson, Eager aggregation and lazy aggregation, in: VLDB Conference, 1995, pp. 345–357.
136
D. Chatziantoniou / Data & Knowledge Engineering 61 (2007) 114–136 Damianos Chatziantoniou received his B.Sc. in Applied Mathematics from the University of Athens (Greece) and continued his studies in Computer Science at New York University (M.S., 1993) and Columbia University (Ph.D., 1997). His research interests include OLAP, decision support systems, data streams and query processing. He has published more than 20 articles at VLDB, ICDE, EDBT, SIGKDD, SIGMOD, Information Systems Journal and elsewhere. His research work has influenced Microsoft SQL Server (query processor), Oracle 8i and 9i (OLAP functions), and ANSI SQL standard (OLAP amendment). He is currently an Assistant Professor at Athens University of Economics and Business (AUEB), Department of Management Science and Technology. Prior to AUEB, Damianos has taught at Columbia University as an Adjunct Professor and at Stevens Institute of Technology as a tenure-track Assistant Professor. He has collaborated with AT&T Research and Columbia Medical Informatics Department. Besides academia, Damianos has been involved in a couple of technology startup companies, one based in New York (Panakea Software Inc., OLAP tools & consulting, founder) and one based in Athens (VoiceWeb, speech & telecom applications, founder).