Detecting summarizability in OLAP

Detecting summarizability in OLAP

Data & Knowledge Engineering 89 (2014) 1–20 Contents lists available at ScienceDirect Data & Knowledge Engineering journal homepage: www.elsevier.co...

270KB Sizes 1 Downloads 74 Views

Data & Knowledge Engineering 89 (2014) 1–20

Contents lists available at ScienceDirect

Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak

Editorial

Detecting summarizability in OLAP Tapio Niemi a,⁎, Marko Niinimäki a,b, Peter Thanisch c, Jyrki Nummenmaa c a b c

Helsinki Institute of Physics, Technology Programme, CERN, CH-1211 Geneva 23, Switzerland HEPIA, University of Applied Sciences of Western Switzerland, Rue de la Prairie 4, CH-1202 Geneva, Switzerland School of Information Sciences, University of Tampere, FIN-33014 Tampere, Finland

a r t i c l e

i n f o

Article history: Received 17 February 2012 Received in revised form 13 November 2013 Accepted 13 November 2013 Available online 27 December 2013 Keywords: Business intelligence Database design Modelling and management Summarizability Additivity OLAP

a b s t r a c t The industry trend towards self-service business intelligence is impeded by the absence, in commercially-available information systems, of automated identification of potential issues with summarization operations. Research on statistical databases and on data warehouses have both produced widely-accepted categorisations of measure attributes, the former based on general summarizability properties and the latter based on measures' additivity properties. We demonstrate that neither of these categorisations is an appropriate basis for precise identification of measure types since they are incomplete, ambiguous and insufficiently refined. Using a new categorisation of dimension types and multidimensional structures, we derive a measure categorisation which is a synthesis and a refinement of the two aforementioned categorisations. We give formal definitions for our summarizability types, based on the relational model of data, and then construct rules for correct summarization by using these definitions. We also give a method to detect whether a given MDX OLAP query conforms to those rules. © 2013 Elsevier B.V. All rights reserved.

1. Introduction The current trend for Business Intelligence (BI) to become increasingly pervasive [39] has led to an increasing reliance on self-service BI. Spahn et al. [36] note that business users face considerable challenges when trying to mine data in a self-service manner. One of these challenges is finding what data should be summarised and how. Here, we aim at answering this question by studying the OLAP cube, the multidimensional structure used in practical BI. More specifically, we study the measures (like units sold, distance travelled) that appear in the cube. In the past, there have been two main sets of ideas about categorising measures: 1. The Lenz and Shoshani [16] categorisation of flow, stock and value-per-unit has been adopted by several groups of researchers. 2. The Kimball and Ross [13] categorisation of additive, semi-additive and non-additive has also been adopted by various groups of researchers. In our paper, we examine how these two categorisations relate to each other and we develop a new categorisation which is a synthesis of the two earlier categorisations. When a user's query extracts information from a multidimensional database (“cube”, for short) the output is typically in summary form. The user's query specifies: (a) The aggregation operation (e.g. a summation, a mean or a count), (b) The measure which is to be aggregated, and (c) Either an aggregation level or a member for each dimension. ⁎ Corresponding author. Tel.: +41 22 767 6179; fax: +41 22 767 3600. E-mail addresses: [email protected] (T. Niemi), [email protected] (M. Niinimäki), [email protected].fi (P. Thanisch), [email protected].fi (J. Nummenmaa). 0169-023X/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.datak.2013.11.001

2

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

The query can only produce meaningful results if: 1. The aggregation operation is appropriate for the measure and 2. The measure is appropriate for the aggregation levels in the cube's dimensions. When both of the above conditions are true, we say that the selected data is “summarizable” with respect to the aggregation operation. The first condition is determined by the inherent statistical properties of the measure. For example, if our data comprises wind directions taken at time intervals and we want to determine the prevailing wind during some period of time, then it is not appropriate to compute the sum or the arithmetic mean, but it is appropriate to compute the mode. By contrast, the second condition is determined by the semantics of the selected data. For example, while it is appropriate to sum the inventory levels of a particular product at a particular time over all of the stores, it is not appropriate to sum inventory levels for the product at one particular store over a number of points in time. The problem of detecting summarizability has attracted the attention of researchers for many years. An indication as to why detecting summarizability is such a hard research problem can be gleaned from our use of the word “appropriate” in both of the above conditions. In particular, for the second of the above pair of conditions, what is “appropriate” is context-sensitive to a heterogeneous collection of contextual elements. As is well-known, such elements include the semantics of the dimensional data and the kind of measure involved. However, as we demonstrate in the present paper, whether or not a particular measure is appropriate can also depend on the kind of cube. Indeed, one of the contributions of the present paper is to provide categorisations of measures and cubes and to demonstrate that it is the combination of the measure and the cube category which determines summarizability. In this paper we propose a method that can be used to detect when a query intended to summarise data violates one or both of the above conditions. We focus on additivity, i.e. cases in which the sum operation is used. In order for our method to work, it is necessary for the cube design to represent the units of measurement and the statistical scale of each of the cube's measures. We present a model of multidimensional databases which allows us to provide decision criteria for summarizability that are provably correct. Our model places restrictions on cube designs, for example we only allow simple hierarchical relationships in dimension and we do not model many-to-many relationships. Furthermore, our model also places an extra responsibility on the cube designer, since he/she must include the units of measurement for each of the measures as separate attributes in the schema for the cube. However, the benefit of our model is that we can provide provably-correct conditions for determining additivity. Our model is sufficiently detailed to permit the automatic detection of additivity. Our approach requires that we model the types of measures in a more formal manner than has been done in earlier research. To demonstrate our method, we show how the correctness of summarizability can be checked when using the MDX [38] query language for expressing OLAP queries. The paper is organised as follows. We illustrate the problems of existing work on summarizability in Section 2. In Section 3, we give a brief review of related work and then present a running example in Section 4. We present a formal model of OLAP based on relational calculus in Section 5. The core of the paper is presented in Sections 6 and 7, where we introduce the types of measures (tally, semi-tally, reckoning, and snapshot), and the definition of additivity. In Section 8, we describe how the designer can determine the summarizability types of measure attributes and apply this information in OLAP design. We also show how additivity rules are applied to OLAP queries, expressed in the MDX language. In Section 9, we give a brief evaluation of the method. Conclusions and future work are given in Section 10. 2. Motivation According to Lenz and Shoshani [16], correct summarizability requires the following three conditions to be true: 1. disjointness of attribute groups, 2. completeness of grouping, and 3. the combination of types of the attribute, the dimension, and the aggregation function must be consistent. The first rule means that an attribute value may roll-up to only one group on the higher level in the hierarchy, while the completeness rule means that each value must roll-up to some group. These two rules are quite straightforward to check but some real-world cases do not necessarily fit into them. The classical examples are Russia belonging to both Europe and Asia, and Washington D.C. being without a state. The third rule is not so straightforward, since semantics plays an important role in it. Lenz and Shoshani's suggested solution is to divide dimensions into temporal and non-temporal ones and summary attributes (“measures”) into three groups: • Flow, “cumulative effect over a period”, the unit/the period of time, such as Euro/month, e.g. monthly sales. • Stock, “state at specific point at time”, a simple unit such as Euro, meter, kilogramme, e.g. inventory. • Value-per-unit, x/y units such as price/item, e.g. item price or exchange rate. The flow type refers to a period of time and it is recorded at the end of the period, while the stock type is recorded at a particular point of time. The difference between the stock and the value-per-unit type is the unit. For example, the unit of the product price is e.g. Euro/product, while the unit of the daily sales, i.e. the total sales of all products sold on a day, is Euro or the number of products. Horner and Song [9] give a more detailed description of this topic and provide a classification of aggregation functions based on their applicability to summarization. The flow type describes the change during a period, the stock type the value recorded at a specific point in time, and the value-per-unit type a fraction or rate that can be used to convert units.

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

3

Based on this classification, Lenz and Shoshani give rules for detecting semantic correctness of aggregation operations. Briefly, they claim that is never correct to use the value-per-unit type with the sum operation and the stock type can used with the sum operation only with non-temporal dimensions. The other aggregation functions (min, max, avg, and range) are correct with both temporal and non-temporal dimensions. Kimball and Ross [12] categorise measures as follows: • Additive — Measures that can be added across any dimensions. • Non Additive — Measures that cannot be added across any dimension. • Semi Additive — Measures that can be added across some dimensions. We find it useful to identify the following two special cases of the “Semi Additive” category: • Temporally Additive: the measure can only be aggregated across the time dimension. • Non-temporally Additive: the measure cannot be aggregated across the time dimension (but can be aggregated across one or more of the other dimensions) [8]. Table 1 combines these categories. While temporal summarizability of flows is semantically meaningful, temporal summarizability for stocks and values-per-unit depends on the statistical function being applied. For instance, monthly income can be summed to annual income, but it makes no sense to sum monthly inventories of cars (at the end of each month) into years. Lenz and Shoshani's categorisation of measures into flow, stock and value-per-unit has been widely adopted. In the present paper, we raise several issues with the categorisation, which we summarise as follows. 1. There is a presumption that the data has been pre-aggregated for some time period, such as a day. 2. There is a presumption that a measure which has a unit of measurement that is a value-per-unit is non-additive. 3. Lenz and Shoshani's categorisation is not exhaustive, since there is a useful class of summarizable attributes which do not fall into any of their categories. 4. The categorisation it is not precisely defined and thus it can be ambiguous. We now discuss these issues in detail. Our first issue is that there is a presumption in Lenz and Shoshani that the data has been pre-aggregated for some time period, such as a day. However, in many applications, the raw (i.e. non-pre-aggregated) data is used. In this case, a fact record contains the details of an individual event, such as a sale, rather than an aggregated flow for a period of time. This creates a problem for Lenz and Shoshani because there can be additive measures in the fact record, yet the measure value does not represent a cumulative effect over a period. Lenz and Shoshani's categorisation could be defended on the grounds that such a value is a cumulative effect over an extremely short period. Such an argument is not valid because there is nothing to prevent two records containing exactly the same timestamp. In view of this, for the case where the data has not been pre-aggregated for a time period, we need to introduce a new class of additive measure. Our second issue is that a measure with a unit of measurement which is value-per-unit, is not necessarily non-additive. Example 1 demonstrates that such a measure might be semi-additive. Example 1. Suppose that the fact records contain details of pumps, including the capacity of each pump, for which measure the unit of measurement is litres per minute. If we wish to compute the aggregate capacity of the pumps, we can simply add the individual capacities. Thus the measure is semi-additive, despite the fact that the unit of measurement makes the attribute value-per-unit in Lenz and Shoshani's categorisation. Example 1 establishes that we cannot categorise a measure just by reference to its unit of measurement. Our third issue is that Lenz and Shoshani's categorisation is not exhaustive. Their notion of a “flow”, i.e. a measurement of a cumulative effect over a period of time, means that the sum operator can be used to aggregate this data in any dimension, including the non-temporal ones. However, some useful measures are cumulative effects over a period, yet it makes no sense to use the sum operator in any dimension other than the temporal one. An example of such a measure is as follows. Example 2. With weather data, rainfall is measured using a rain gauge. This shows the number of millimeters of precipitation captured by the gauge. Typically, rainfall is measured daily and the rain gauge is emptied after the rainfall has been measured. If we want to know the aggregate rainfall in one location over some number of days, we simply add the data obtained from the rain gauge. However, if we Table 1 Measure categories [16,12]. Lenz and Shoshani's category

Description

Kimball and Ross' category

Lenz and Shoshani's examples

Flow — also called ‘rate’

A flow records the cumulative effect over a period and is recorded at the end of the period. A stock measures and records a simple value at a particular point in time. A value-per-unit measures and records the ratio of two values at a particular point in time.

Additive

Monthly number of births, annual income Inventory of cars, number of citizens Item price, cost per unit manufactured, exchange rate

Stock — also called level Value-per-unit

Semi-additive Non-additive

4

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

have one rain gauge in Tampere and another rain gauge in Geneva, it makes no sense to add the rainfall captured by the two rain gauges on a particular day in order to make an assertion about the total rainfall in Tampere and Geneva. Example 2 establishes that there are useful flow-like measures for which the use of the sum operator is limited to the temporal dimension. These are not captured by Lenz and Shoshani's categorisation. Finally, Lenz and Shoshani's categorisation it is not precisely defined and thus ambiguous. Consider the following two examples. Example 3. There is a requirement to measure the flow of water through a pipe. Two different kinds of meters are installed for this purpose. Meter A is a piston meter which measure the flow, in terms of litres per minute, by counting the number of times that a one-litre piston is filled each minute. Meter B measures the same flow, in terms of litres per minute, by measuring the speed at which the water travels through a point in the pipe with a known aperture area. Typically, this will not be based on a single sample, but rather the root-mean-square of several samples. We have the following table with four columns: Date

Time

Meter A

Meter B

1/4/13

9:01

200

200

Into which of Lenz and Shoshani's categories should we put the measures Meter A and Meter B? The unit of measurement for Meter B is litres-per-minute. So should it be classified as a value-per-unit? On the other hand, it records a state at a specific point in time, so it could also be classified as a stock. The unit of measurement for Meter A is litres. It is presumably categorised as a flow because it measures the cumulative effect over a period. It seems counter-intuitive to categorise the meter reading from Meter A as a flow and not Meter B. They are just different technologies for measuring what is called “flow” in everyday parlance. Example 4. A wind speed can be recorded as 60 kph or as 7 using the Beaufort scale. In the first case the speed is a value per unit: the unit is an hour and the value is 60 km. But in the second case it is a stock because it is recorded at a particular instant in time and its unit is not “value-per-unit”. Contrast this speed reading with the reading from Meter B in Example 3. In some systems, the flow is computed by the speed with which water travels through a point in the pipe with an aperture of a known size. This is an instantaneous measure of intensity and hence it is the same class of measurement as the wind speed. These issues also cause challenges when detecting summarizability features of measures. For example, the temperature is not a value-per-unit but neither can it be a stock or flow, since temperatures cannot be summed along any dimensions. However, if the measure represents a temperature change which has occurred in the period of time, then the data can be summed along the time dimension. There are other problems with the value-per-unit category. For example, the item price is very different from the exchange rate but they both are classified as value-per-units. The reason why item prices are not summarizable over non temporal dimensions is not always obvious. If we have a set of item prices (e.g. prices of sold second-hand cars) it is totally correct to sum them.

3. Related work Studying summarizability, in the sense we understand it in this paper, started in statistical database research around 30 years ago. Both in statistical databases and OLAP research, most of the work in summarizability focuses on the multidimensional structure of the OLAP cube, for example [32,25,10,19,21,2,24]. The paper by Mazon et al. is a recent review of the field [22]. The importance of semantics in aggregation computation has been mentioned in several studies since the 1980s. For example Johnson [11] studies aggregation computing in relational databases to clarify the semantics of aggregation functions, while Shoshani [34] gives an idea of “automatic aggregation”. It means that the user does not need to define in detail the aggregation operation but the system can “guess” it based on meta information. More recently Meo-Evoli et al. [23] give a model in which the summary type is a part of the definition of the semantics of the measure attribute. This summary type indicates which summary function has been used to generate the macro data from the micro data. Further, Cherfi and Prat [35] give a model in which applicable aggregation functions are given by the user for each measure and for each dimension level. Gomes et al. [6] study spatial aggregation and introduce a formal model for integrating geographical and non-spatial data, while Garcia-Garcia and Ordonez [4] focus on aggregations on databases having integrity errors. Other related work includes, for example, Pu's [31] functional approach to OLAP databases. This approach helps solving problems related to summarizability; it is in a sense a formalisation of the rule that the function computed from the macro data is the same as it would be, had it been computed from the micro data. Similarly, Lenz and Thalheim [17,18] present an impressive formalism where summarizability is defined as associativity of the aggregation function. They divide aggregation functions into distributive (count, sum, min and max) and algebraic (average and covariance). Further, they remark that e.g. an average of logarithmic scale values can be

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

5

misleading. They also note that “Summarizability cannot be maintained for functions after application of which identification of objects is lost.” This means that it is impossible to detect whether some instances have been taken into account multiple times, if the identity of instances is not known anymore. Lehner et al. [15] present a multidimensional normal form that is aware of the summarizability rules of Lenz and Shoshani [16] but does not actually support determining summarizability types of attributes. Lechtenbörger and Vossen [14], too, study multidimensional normal forms. Pedersen and Jensen [27] categorise data according to the set of functions which can legitimately be used in aggregation operations. In Pedersen and Jensen's model, there is assumed to be an aggregation type associated with each category type in a dimension. Pedersen et al. [29] keeps track of which aggregate functions can meaningfully be applied to measures for each dimension by associating an aggregation type to each combination of a measure and a dimension. They use these aggregation types both to prohibit semantically incorrect aggregation, and to prevent aggregation when irregular hierarchies may lead to incorrect results. Ariyan and Bertossi [1] focus on the aspect of summarizability relevant to determining whether an aggregation computed at a low level in an aggregation hierarchy can be used to compute an aggregation at a higher level in that hierarchy, thereby avoiding accessing the raw data. Ariyan and Bertossi propose extending Hurtado et al.'s multidimensional model in a way which facilitates the adaptation of some problematic dimension schemata so that the properties of their aggregation hierarchies become compatible with summarizability. Prat et al. [30] introduce a UML model for multidimensional databases and use a rule language to enforce constraints on the use of aggregation functions. Horner et al. [8] study additivity, i.e. aggregations using the sum operation, in OLAP. Instead of using Lenz and Shoshani's types, they classify measurements in the same way as Kimball and Ross [13], i.e. as additive, semi-additive, and non-additive. They also explain why some indirect measurements such as temperature, are non-additive, or why some others, such as account balance, are semi-additive. By indirect measurements they mean that these values are indirect measurements of additive base attributes. For example, the temperature is a measurement of kinetic energy. Finally, they give a class of categorically non-additive measures, such as basket counts. Another work not using the summary attribute types of Lenz and Shoshani is a paper by Dittrich et al. [3]. They study shortcomings of SQL database systems in OLAP usage. They list, for example, the lack of understanding of measure units and describe the use of functional dependencies in detecting summarizability. Kimball and Ross [12] have been credited with the additive/semi-additive/non-additive categorisation of measures. From their wide experience of implementing data warehouses, they recommend that the designer must specify, in the metadata, what the measure type of the attribute is [13]. We follow Kimball and Ross' recommendation in the present paper. In view of this, we conclude this section by enumerating the issues in these approaches, which we believe require further attention. Those ones that are addressed in this paper, are marked with an X below. Issues remaining open are with an O, and are discussed in Section 10. 1. X The summary attribute types do not have formal definitions, which makes it difficult to detect what is the correct type of an attribute. 2. X There is no work combining measure units and statistical scales of summary attributes. 3. X The temporal dimension plays a special role in most of the existing work but it is not necessarily so. Some other dimension that defines a linear total ordering, can play the same role related to cumulative measure attributes. For example, a spatial dimension can be used in a similar way in some applications. 4. O In addition to common aggregation functions (sum, count, average, min/max) there are also some other commonly used types such as the last item. The existing research does not include this. 5. O It is a common practice to create new measures by combining existing measures in arithmetic expressions. Summarizability of these new measures is not included in existing research. 6. O There is no proof that any of the various categorisations of summary types is complete or minimal.

4. Running example Our running example is a simplified OLAP analysis database on real supply chain data of a sporting goods company having several international brands. The data comes from transactional database containing, for example, sales orders, order lines, deliveries, and inventory levels. In our case at hand, the aim is to study how external conditions such as the weather affect the supply chain performance. Therefore, we focus on three tables order lines (Table 2), inventory (Table 3), and weather conditions (Table 4). The inventory table has two measure columns and a row for each warehouse transaction: “change” for the size of the current transaction and “total count” storing data on how many products exist in the warehouse. The latter is a cumulative measure over the date dimension. Table 2 Order lines table. Time

Country

Customer

Product

Count

Price

Currency

Jan 02 2011, 12:00:00 Jan 02 2011, 12:00:10 Jan 02 2011, 12:00:30

FI CH US

Sport store Ski store Golf store

Football Alpine ski Golf ball

100 200 400

10 20 100

EUR CHF USD

6

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

Table 3 Inventory table. Date

Product

Group

Change

Total count

Jan 02 2010 Jan 02 2011 Mar 02 2011 Jun 02 2010 Aug 02 2011

Cross-country skis Football Cross-country skis Golf balls Alpine ski

Skiing Ball sports Skiing Ball sports Skiing

−100 −200 −500 1000 1000

1000 1000 2400 4000 2000

Table 4 Weather conditions table. Date

Location

Temperature

Precipitation

Snow level

Jan 02 2010 Jan 02 2011 Mar 02 2011

Davos-CH Zermatt-CH Geneva-CH

−5 −10 15

5 0 5

20 50 0

From the transactional database we form a data cube for each product group by aggregating the data to day level. The resulting cube with six measure attributes is illustrated in Table 5: • • • •

Inventory, the total value of products in the inventory. Sold items, the number of sold items. Sales in Euro, the total sales. Temperature, precipitation, and snow level indicate the weather on the given day.

The OLAP dimensions in the example are the date and the country. Sold items and sales measure how many product items have been sold and what is the total value of the sales in a day in a country. These values can be summarised along the time or country by using the sum function. Inventory, i.e., the total number of a product in the inventory comes from Inventory table and it is defined as a sum of all manufactured products minus the sum of all sold products from the beginning of the season until the current day. (We assume the inventory level of the products is zero at the start of the season, which is normal for seasonal products.) It is a cumulative measure similar to account balance. It is not additive along the time dimension. The table also has three weather measure columns. The precipitation values can be summed over time but not over the location dimension while the temperature and snow level can only be aggregated using min/max or average functions. One additional feature in the example is the area dimension. The same products can be available in different countries and thus having different currencies. In this way, the currency is a characteristic of the measure and it depends on the country. An obvious consequence is that prices cannot be directly summarised but the currencies must be converted first. After doing this it is correct, for example, to calculate the average price of the products sold in different countries. 5. Relational OLAP model Before giving formal definitions for measure types and summarization rules we need a formal model for representing the OLAP cube. Based on a formalised OLAP model [26] we define the OLAP schema and the cube instance using the relational database model. This model is only used as an abstract model and it is not specific to any OLAP implementations. Using the relational model is a common practice in data warehousing. For example, Kimball and Ross state that, in a typical data warehouse, the data from operational databases is first landed in a star schema structure, from which it may subsequently be loaded into an OLAP cube in order to perform complex data calculations [13]. Our idea to present also measure types using the relational model is inspired by Kimball and Ross [13] as well, since they state, as the first of their criteria for what makes a data warehouse dimensional, that the measure type for each measure should be recorded in the metadata. The measure type declaration should: (1) distinguish a dimension entity from a measurement

Table 5 A part of an example OLAP cube as a relation (measure type and scale columns omitted for brevity). Date

Country

Inventory

Sold items

Sales

Temp.

Pre.

Snow lev.

11–01–02 11–01–02 11–03–02

FI CH US

1200 4000 5000

1000 10 500

3400 1400 5000

−5 5 10

10 5 0

200 0 0

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

7

entity and (2) declare the fact as either additive, semi-additive or non-additive. The declaration also associates a default aggregation technique with the fact. Our model can be seen as a denormalised star schema having both facts and dimensions in the same table. Denormalisation is commonly used in data warehousing [33] but in this study the model is used only as an abstract formalism: our results are not bound to any particular implementation technique. A similar model is used, for example, by Malvestudo et al. [20]. Definition 1 (OLAP cube schema and instance). The OLAP cube and its instance are defined as follows: 1. A dimension schema Di(1 ≤ i ≤ n) and a measure set M are disjoint sets of attributes. 2. An OLAP cube schema C = D1 ∪ D2 ∪ … ∪ Dn ∪ M, where D1 … Dn are dimension schemata, M is the set of measure attributes with their units and statistical scales, m1, m1u, m1s, …, mp, mpu, mps. 3. Dimension level attribute values in each dimension must form a complete and disjoint hierarchy. This means that there is a many-to-one relationship from the lower-level attribute to the higher-level attribute. The dimension levels form a linear ordering. 4. An OLAP cube instance c is a relation over the OLAP cube schema C and a relation d over D is called a dimension instance. 5. It is assumed that the measure set M is not empty and the cube has at least one dimension. The definition follows the relational database model assuming that duplicate tuples do not exist. As discussed in Section 6.1, in practical data warehouses, it is invariably better to aggregate tuples with duplicate dimension keys during the ETL process. Further, we highlight the roles of the measure unit, mu and statistical scale of the measure ms in the definition, since OLAP tuples represent measurements on some real world phenomenon and it is naturally important to know what the unit of the measurement is and also what the scale of the measure is. Both of these also have essential roles when detecting summarizability of aggregation operations. The measure unit can be a common unit such as meter, kilogramme, or Euro but it can also depend on a dimension such as the currency depends on the country. If the dimension contains all attribute values that may exist (in the domain), then we say that the dimension is domain-complete. We do not generally assume that dimensions are domain-complete. If there is a functional dependency between a dimension and a measure unit, we say that the measure unit depends on the dimension. For example, the currency of the sales value can depend on the country. Definition 2. The measure unit mu depends on the dimension schema Di if there is a functional dependency Di → mu in the OLAP cube c. Additionally, we require the attribute set formed by taking the union of the attributes on the lowest hierarchy level in each dimension schema to be a key for C. This guarantees that all measures depend on all dimensions [3]. This also affects summarizability, since it prevents situations in which we would summarise values over a dimension that does not determine the measure. For example, the properties of the product would depend only on the product name, not the customer who bought it. Since our current work focuses on summarizability issues other than dimension hierarchy related issues, we assume that dimension hierarchies are complete and disjoint. Hence they do not cause incorrect summarizability [16]. This requirement means that in each dimension, the levels can be ordered in such a way that each member on any but the highest level is associated with exactly one member on the higher level. Since the OLAP cube according to our definition is a relation, it can be manipulated by using relational algebra. Regarding summarizability we are interested in operations eliminating dimension attributes. We focus on two operations: 1) roll-up and 2) slice. In a roll-up operation we remove the most detailed dimension level and start to observe the data in a more general level. For example, instead of daily data, we want to analyse monthly data. Slicing in some dimension, Di, means taking a single member of some level in Di and using that member as a filter. This is analogous to the select operation of relational algebra. For example, we can select only values associated to a particular year. Both of these operations can affect summarizability. The slice operation can make it possible to aggregate values by selecting only compatible values, such as values in the same currency. In our model, we assume that a roll-up operation is performed by a series of aggregation operations, defined below. Each aggregation operation removes the most detailed level of one dimension. By a series of aggregation operations, it is possible to go up as much as is required, in several dimensions. We define the aggregation operation in the style of relational algebra. Definition 3 (Aggregation). Let c be an OLAP cube over C and Di ⊂ C a dimension schema over which we aggregate. The lowest level A ∈ Di is eliminated by aggregation. This is marked by ∑Di ðcÞ. The aggregation operation returns an OLAP cube c′ over the schema C′ = ′ , m1s, …, mp′, mpu ′ , mps ′ )}, where (D1 ∪ … ∪ Di − 1 ∪ (Di − A) ∪ … ∪ Dn ∪ M) such that c′ = {t′ = (d1d2 … di − 1 … d′i … dnm1′, m1u d'i means a tuple over (Di − A), and for each t′ ∈ c′, it holds that 1. there exists some t ∈ c such that t[D1 ∪ … ∪ (Di − A) ∪ … ∪ Dn] = (d1d2 … d′i − … dn), 2. m′iu= miu, u = 1, …, p,

8

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

3. m′is = mis, u = 1, …, p, and 4. t′[m′] i = ∑ (t[mi])| ∃ t ∈ c such that t[D1 ∪ … ∪ (Di − A) ∪ … ∪ Dn] = (d1d2 … d′i … dn)). Definition 3 assumes that (a) the type of the measure, (b) the unit of measurement and (c) the statistical scale of the measure attribute remain the same after the aggregation function has been applied. Example 5. Suppose an OLAP cube c over a schema (day group product value): Day

Group

Product

Value

Unit

Scale

11-Jan-1 11-Jan-1 11-Jan-2 11-Jan-2 11-Jan-3 11-Jan-3

Alpine skis Alpine skis Cross-country skis Cross-country skis Golf balls Footballs

P-11 P-12 P-21 P-22 P-31 P-32

10 10 20 10 20 30

Euro Euro Euro Euro Euro Euro

ratio ratio ratio ratio ratio ratio

After applying an aggregation c′ = ∑

Product

(c)

Day

Group

Value

Unit

Scale

11-Jan1 11-Jan2 11-Jan3 11-Jan3

Alpine skis Cross-country skis Golf balls Footballs

20 30 20 30

Euro Euro Euro Euro

Ratio Ratio Ratio Ratio

It should be noticed that our definition of aggregation does not guarantee the correctness (summarizability) of the results of aggregation. Related to the example above, we may have summed elements of C that have different units, without converting them first. Identifying such problems will be discussed in detail in the next sections. 6. Measure types In statistics the scales of variables can be classified as “nominal”, “ordinal”, “interval”, and “ratio” scales [37]. The statistical scale limits the usage of aggregation functions. For example the sum usually requires the ratio scale, the average is meaningful for interval and ratio scales, and the mode (the most common value) can be computed for all scales. The difference between the interval and the ratio scales is that for the ratio scale there is a meaningful zero. For instance temperature in Celsius would be interval and not ratio, as the 0 point is artificial and there is no point in saying that 10 is twice as warm as 5.

6.1. Cube and dimension terminology Before giving definitions for different types of measure attributes, we define the terminology to be used as a base for further definitions. Definition 4 (Interval scale ordering dimension). A dimension is an interval scale ordering dimension (“ordering dimension” for short) if, for each level in the dimension, the set of all members on that level forms a linear order. In many cubes, the ordering dimension is the time dimension. But it is possible for the linear ordering to be something other than time. For example, if the subject matter of the cube is earthquake damage, the linear ordering could be the (linear) distance from the epicentre of the earthquake. In this example the linear ordering is spatial, rather than temporal. In this sort of application, there may also be a time dimension, but it merely provides timestamps for earthquake events so that records relating to the same earthquake event can be aggregated. We consider it very rare that there would be more than one ordering dimensions. Thus, we assume that a cube contains at most one ordering dimension. Definition 5 (Ordered cube). An ordered cube is a cube in which at least one of the dimensions is an ordering dimension. A cube on which the user may want to perform aggregations, is initially computed from tabular or relational data, and the measure values come from tables called fact tables. If the measure values in the cube are taken from the fact tables directly, without aggregating them, we say that the cube is a non pre-aggregated cube, and otherwise it is a pre-aggregated cube. Clearly, pre-aggregation and ordering are independent properties of a cube. Further, if a measure uniquely records quantitative information, which is specific to the event associated with the fact record, we call the measure a simple event measure. Thus, for instance monthly store inventory counts are

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

9

generally not simple event measures, because the same items are included in the counts every month as long as they are in the store. Let D be a dimension of cube c (pre-aggregated or not). Then, the lowest-level attribute of D is called the granularity of c (for D). For some of our discussion, it is convenient to refer directly to the dimension over which the user is aggregating or intending to aggregate. Definition 6 (Aggregation dimension). An aggregation dimension is a dimension over which the user intends to perform or performs an aggregation operation. Example 6. Consider the following non pre-aggregated ordered cube c with one (ordering) dimension which has two levels month and day: Month

Day

Value

Jan Jan Feb Feb

11 12 21 22

10 20 30 40

Suppose this ordering dimension is also an aggregation dimension. After an aggregation operation removing the day level, we get a pre-aggregated cube c′ as follows: Month

Value

Jan Feb

30 70

Example 7. In our sport company example, consider the following two approaches that a cube designer might take. Design 1: Non pre-aggregated ordered cube. In the relation, r, over the cube schema, there is a record for every occurrence of a sales order of a product. So, for example, if there were ten sales orders for alpine skis in Switzerland on a particular day, then r will contain ten separate records with the details. Obviously, each sales order can involve many products. Design 2: Pre-aggregated ordered cube. In the ETL process, all of the individual sales orders for a particular product at a particular sales area on a particular day are aggregated up to a daily total. In this design, there is just one record for all ten of the separate sales orders. Both of the cubes are ordered, having time as their ordering dimension, but Design 1 is not a pre-aggregated cube, since it contains the original fact table data. Design 2, however, is a pre-aggregated ordered cube, because it does not contain the original fact table data: instead, it contains a tally of all of the events that occur in an interval in the total ordering associated with the ordering dimension: in our case, the measurement is the total of all sales in alpine skis in Switzerland at any point in time during a particular day. Typically, when designing a pre-aggregated ordered cube, the designer will have in mind a specific granularity for the time dimension (a day, in our example). That granularity will be the lowest level in the hierarchy associated with the time dimension. If fact records in r include tally measures which reference a day value in the time dimension, there is a presumption that this refers to the time interval, rather than a point in time. Based on the definitions above, we define three main types of measure attributes: 1) Tally measure, 2) reckoning measure, and 3) snapshot measure. Our definitions define how measures of different types in a non-preaggregated ordered cube are mapped to measures in an aggregated ordered cube. 6.2. Tally measure In a non pre-aggregated cube, fact records are associated with occurrences of a specific class of events, the recording of which is the purpose of the cube design. Each fact record is associated with the occurrence of a specific event in that event class. Some of the measures in the fact record will contain the quantitative information which is intrinsic to the event. We refer to such measures as simple event measures. For example, in a sales cube, simple event measures may include the quantity sold and the revenue received from the sale in a specific sales event. Other measures in the fact table might record additional information which, though not intrinsic to this class of events, are of potential interest. Such measures could include, for example, environmental conditions at the time of the sale or the prevailing currency exchange rate. We shall categorise such supplementary measures separately.

10

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

Definition 7 (Semi-tally measure). Let c be an ordered non pre-aggregated cube over a schema C with an ordering dimension Di , and let m be a simple event measure. Let A ∈ Di be the lowest level of the hierarchy in Di and c′ ¼ ∑Di ðcÞ. Now c′ is a preaggregated ordered OLAP cube over a schema C′ = C − A and m′ ∈ C′ is a semi-tally measure composed from the values of m over ordering dimension Di.   ′ ″ ′ If b ¼ ∑D j ðbÞ is a pre-aggregated ordered cube with an ordering dimension Dj and a semi-tally measure q, then also b ¼ ∑D j b is a pre-aggregated ordered cube and q′ ∈ b″ is a semi-tally measure over ordering dimension Dj (in other words, further aggregations preserve a semi-tally measure in the cube). As the definition indicates, a semi-tally measure represents a change at a point in time or during a fixed period, or a value measured in some period of time; e.g. daily sales. A semi-tally measure can only exist in an ordered cube. The measurements are independent, that is, the value of a measurement does not depend on the previous value. In Example 2, we demonstrated that some measures, and even semi-tally measures, are additive over time, but can be defined to be non-additive over non-time dimensions. This is because in Example 2 the geographical dimension is complete but not domain-complete. Definition 8 (Tally measure). If m is a semi-tally measure in a cube where all the dimensions are domain-complete, then m is a tally measure. 6.3. Reckoning measure The second special measure type of interest, the reckoning measure, can exist in unordered, ordered non-preaggregated, or ordered pre-aggregated cubes. A reckoning measure is a measurement such as an inventory level. Naturally it can be measured using various units just as the number of items, monetary value, volume, weight etc. Before giving the definition of the reckoning measure, we give the following definition: Definition 9 (Ordering self-closure). Let r be a cube over a schema C, and let Di be an ordering dimension, and A the lowest level of Di, and m a measure in C. An ordering self-closure of r with respect to Di and m is rþ ðDi ; mÞ ¼ πðDi ∪m′ Þ∖m r ⋈ A ≤ A′ ρðA′ =A;m′ =mÞ (r), where ⋈ is the θ-join and ρ is the rename operation of relational algebra. Even though m for clarity is replaced by an attribute name m′ in the schema, in practice we would still rename m′ into m. Definition 10 (Reckoning measure). Suppose c is a non pre-aggregated ordered cube over a schema C with an ordering dimension Di and a measure m. Then, c′ ¼ ∑Di ðcþ ðDi ; mÞÞ is a cube with a reckoning measure m′.

Example 8. Suppose the following relation r:

A

m

1 2 3 4

10 20 30 40

Now the ordering self-closure of r with respect to A and m (r+(A,m)) is the projection to A and m′ of the following relation:

A

m

A′

m′

1 2 2 3 3 3 4 4 4 4

10 20 20 30 30 30 40 40 40 40

1 1 2 1 2 3 1 2 3 4

10 10 20 10 20 30 10 20 30 40

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

11

And after applying an aggregation operation ∑A′ ðr þ Þ, we get a reckoning measure m′ as follows:

A

m

m′

1 2 3 4

10 20 30 40

10 30 60 100

In practice, it is often the case that inventory levels must be checked independent of the sum of tally measures. For example, the actual amounts received and dispatched might not match the recorded amounts or some stock might be stolen. Regardless of whether the measure is computed as a cumulative sum or as an independent check, the measure's additivity properties are identical and hence both are referred to as a reckoning. As mentioned earlier we have a connection between tally and reckoning measures: a tally measure can also be defined as a difference between two reckoning measures measuring the same thing in successive time periods, or a reckoning measure may be defined as a sum of tally measures. For example, daily profit = − balance in the evening − balance in the morning, balance in the evening = Σ (revenue from sales events in a day) + balance in the morning. 6.4. Snapshot measure Our final measure type, the Snapshot measure, represents all other measures which are not included in the previous categories. Usually a snapshot is an instantaneous measurement, such as temperature or air pressure. According Horner et al. [8] these measures of intensity, such as our snapshot measure, are usually indirect measurements of base data but it is not practical to store these base data (e.g. the temperature represents movements of molecules or the speed distances and times). The snapshot is always a measurement represented by independent tuples in the database. It has the same form in both non-pre-aggregated and pre-aggregated cube. We define the snapshot measure as follows: Definition 11 (Snapshot measure). A measure is a snapshot measure if it is defined to be non-additive over all dimensions in the cube. Conversion factors have a special role in expressions involving measures. A conversion factor is a snapshot measure which is used, primarily, in multiplication operations in order to convert the unit of measurement of some other measure. A conversion factor can be thought of as representing a quotient of two measurements (which are not necessarily explicitly stored) and it can be used in multiplications with tally or snapshot measures to convert measure units. An example for this is converting currencies using exchange rates. In some cases the required modifications can be done in the ETL (Extraction–Transform–Load) process but sometimes this is not possible — or it is more practical to operate with the original values. 7. Additivity In this section we focus on additivity, that is, using the sum as the sole aggregation function. The application of other aggregation functions is discussed in Section 8. Horner et al. [8] define additivity as follows: “A measure is additive along a dimension if the sum operator can be used to meaningfully aggregate values along all hierarchies in that dimension.” Later in this section, we extend the definition by giving clear rules indicating when the results are meaningful but first we list situations when the sum operation cannot be applied. Based on literature (e.g. [22]), we can identify five possible main problems in additivity: 1. The measure does not depend on all dimensions. Example: in the sport company realm, having a measure like “the number of product units” that does not depend on the sales area. If we aggregate over the sales area dimension, the same products can be counted several times. 2. Not all values are taken into account or some values are counted multiple times due to incomplete or overlapping dimension hierarchies. Example: a product belongs to two different product groups. 3. Some values are counted multiple times due to the cumulative nature of a measure. Example: summarising the daily inventory values over the time dimension. 4. Incorrectness because of different units. Example: adding prices that are expressed in different currencies. 5. Using aggregation operations that are not suitable for the scale of the measure. For example, using the sum operation for zip codes. The first two problems are prevented by our definition of the OLAP cube requiring that the dimensions must form a key for the OLAP cube and the hierarchies be complete and disjoint. The third problem is related to cumulative measures, the fourth one on measure units, and the final one to statistical scales of measures. Additivity here means the correct use of the sum function. Being non-additive does not mean that the result cannot be computed in some other way. In the case of the reckoning measure, using the last child operation gives a correct result. For example, the inventory of the month usually means the same as the inventory of the last day of the month. Or in the case of a measure composed being as a quotient of two measures, we should compute the numerator and denominator separately.

12

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

7.1. Requirements for additivity Additivity means that aggregations by using the sum function are correct. Additivity, according to our definition given later in this section, depends on the dimension and the measure. We define an aggregation operation in an OLAP cube to be correct if: 1. The result of the aggregation on each hierarchy level is the same as calculated directly from the micro-data. 2. The measure unit of each value of the measure attribute participating in the aggregation is the same. 3. The statistical scale of the measure attribute is the ratio scale. The first rule guarantees that in aggregations each tuple in micro-data is counted exactly once. The second rule guarantees that each measure value in the aggregation operation has the same units. For example, we cannot sum meters and feet. Finally, the third rule guarantees that the addition operation is defined for the measure attribute. For example, we cannot calculate the sum of zip codes. We can assume that for any pre-aggregated cube, c, there exists a non-pre-aggregated cube, say c′, with an identical set of attributes except for a finer-grained attribute, say A, at the bottom level of the hierarchy in a dimension Di. Each tuple in c maps on  to a unique set of tuples in c′ and we obtain the pre-aggregated cube c from c′ such that: ∑Di c′ ¼ c. Based on this, we give a more formal definition for additivity. For this definition, we need a function that indicates the values needed when calculating the measure values in aggregation, as defined in Definition 3. Definition 12 (Additivity). Let c be a cube over schema C, D an aggregation dimension in C, and m a measure in c. Let c′ be a cube, obtained by aggregation ΣD(c). Then assume the function f(c′,D,t,m), which returns a set of measure values from c that are used to calculate t[m] ∈ c′. We say that the measure m is additive in c for dimension D if 1. 2. 3. 4.

c is either a non pre-aggregated cube or a cube produced from a cube c″ such that m was additive in c for dimension D, for each tuple t ∈ c′, t[m] = ∑ (f(c′,D,t,m)), the measure units of the measure value returned by f(c′,D,t,m) are the same, and the scale of each measure value returned by f(c′,D,t,m) is the ratio scale.

We know that the function f(c′,D,t,m) must always exist, since we have the pre-aggregated cube c. For a tally measure f(c′,D,t,m) returns a set of measure values associated for the given granularity of the ordering dimension. For a reckoning measure we must use an ordering closure of c instead of c. Example 9. Consider the following cubes c and c′. The cubes have only one dimension namely time, and two measures, amount and total amount. Amount is a tally measure and the total amount is a snapshot measure. Non-pre-aggregated cube c (micro-data): Month

Day

Amount

Total amount

Jan Jan Jan Jan Feb

Jan-1 Jan-1 Jan-1 Jan-2 Feb-2

10 20 10 5 15

10 30 40 45 60

Pre-aggregated cube c, c′ = ∑

Time

(c):

Month

Day

Amount

Total amount

Jan Jan Feb

Jan-1 Jan-2 Feb-2

40 5 15

80 125 185

Now we see that for all values of Amount ∑ (f(c′,D,t,m)) is the same as the corresponding value in the cube c′. For example, let t1 be the first row of c′, above. Then f(c′,Time,t1,Amount) = {10,20,10} → ∑ = 40. This means that amount in c is additive over the time dimension. Instead, the reckoning measure total amount is not additive over the time dimension. Let t2 be the second row in c′, above. Now f(c′, Time,t2,TotalAmount) = 10, 20, 10, 5 → ∑ = 45 but in c′ the corresponding value is 125. This indicates that total amount is not additive over the time dimension.

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

13

Theorem 13 (Measure types and additivity). Based on the definition of additivity, we can show that if the rules 2 and 3 are valid, the measure types have the following properties:

Measure type

Ordering dimension

Non-ordering dimension

Tally Semi-tally Reckoning Snapshot

Additive Additive Non-additive Non-additive

Additive Not known Not known Non-additive

Both semi-tally and reckoning are additive over a domain-complete non-ordering dimension. Proof. Let c be an OLAP cube over a schema C, D1 ⊂ C an ordering dimension, D2 ⊂ C a non ordering dimension, and m a measure. 1. First, assume that m is a tally measure. Based on Definition 1 the dimension hierarchies are complete and disjointed. This means that each level attribute, apart from the highest one, maps exactly to one attribute on the next higher level. Thus, each value is counted exactly once. There is no difference whether the aggregation dimension is ordering or not, since non-ordering dimensions are domain-complete according to Definition 8. Thus, m is additive over ordering and non-ordering dimensions. 2. Next, assume that m is a semi-tally measure. First of all, if D2 is domain-complete, then the case is similar to tally measure and m is additive over D2. As m is a semi-tally measure, D2 may not be domain-complete, which may mean that m is not additive over D2, like in Example 2. Thus, m is additive over a domain-complete non-ordering dimension, but we cannot generally know if m is additive over a non-ordering dimension. 3. Now, let m be a reckoning measure. By the definition of reckoning measure, cube c is a pre-aggregated cube, and there is such  þ a non pre-aggregated cube b that c ¼ ∑D1 b ðD1 ; mÞ . We will show that when calculating c, values of m are generally counted several times. Based on Definition 10 the reckoning measure is constructed by a θ-join where the join condition is D1 ≤ D1. Now if the ordering dimension D1 has more than one value, there exists at least one tuple t such as t[D1] b the maximum value of t′[D1]. Thus, the tuple t is joined at least to itself and t′. This means that it is counted more than one time in an aggregation thereby violating additivity. The situation with D2 is the same as with semi-tally, that is, if D2 is domain-complete, then m is additive over D2, but we cannot generally know if m is additive over non-ordering dimensions. 4. Finally, assume that m is a snapshot measure. Then, by the definition of snapshot measure, m is non-additive.  Although the table in Theorem 13 is reminiscent of the table presented earlier by Lenz and Shoshani [16], there are two clear differences: 1) our categorisation of measure types is well-defined and it cannot be mapped to theirs as explained in Section 3, 2) the temporal dimension of theirs is only a sub-class of our ordering dimension.

7.2. Additive OLAP cube To conclude the requirements of additivity, we give the following corollary: Corollary 14 (Additive OLAP cube). Let c be a valid OLAP cube over a schema D1 ∪ … ∪ Dn ∪ M ∪ Mu ∪ Ms, where Di is the set of dimension attributes, M is the set of measure attributes, Mu and Ms are sets of measure units and scales. The cube c is additive over any dimension and measure combination if: 1. 2. 3. 4. 5.

There are no snapshot type measures in c. If there are any measures of type reckoning then there is no ordering dimension and non-ordering dimensions are domain-complete in c. If there are any semi-tally measures, then the only dimension in c is the ordering dimension. For each measure, m, in c, all of the values for m share an identical measure unit. The scales of all measures are ratio scales.

Proof. The proof follows from Definition 12 and Theorem 13.



8. Detecting summarizability We first focus on additivity, i.e. the correctness of the sum operation. Based on this, we extend our analysis to other basic aggregation functions.

14

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

8.1. Detecting measure types Using our formalism and definitions based on the relational database model, we can determine the measure types of the measure attributes. This, however, requires that the data be presented in the right multidimensional format. In practical applications this is not feasible, since, for example, the transactional level data is not stored in the data warehouse and some aggregation is already done in the ETL process. Therefore, we assume that the designer can detect the measure types based on knowledge of the application area and applied ETL processes. We can give the following classification to help the designer to determine the correct measure types: Tally measure: • Non-pre-aggregated cube: – independent events measured in a particular point in time (e.g. the value of the sold item). • Pre-aggregated cube: – a value measured in a period of time (e.g. daily sales), or – a change in a fixed period (e.g. daily changes in inventory). • In our running example sold items and sales belong to this group. Semi-tally measure: • Similar to the tally measure but clearly not additive over some non-ordering dimensions such as geography. • In our running example daily precipitation belongs to this group. Reckoning measure: • Non-pre-aggregated cube: – an observed value of the same phenomenon at a particular point in time (e.g. daily inventory values), • Pre-aggregated cube: – a cumulative sum of changes (e.g. inventory level). • In our running example inventory belongs to this group. Snapshot measure: • indirect measurements [8] (e.g. temperature, and speed). • In our running example temperature belongs to this group. Conversion factor measure: • a conversion factor is the quotient of two measures (e.g. 1.28 USD/EUR) • In our running example currency exchange rates belongs to this group. 8.1.1. Tally versus semi-tally measures In Example 2 in Section 2, we used the absurdity of finding the sum of the heights registered by two rain gauges in order to support our argument that this form of precipitation measurement is an example of a semi-tally. Example 10. Suppose that we have two leaks in a pipe and we want to discover whether the leaks are getting worse over time. We could place two rain gauges under the two leaks and once per day, we note the heights, empty the gauges and re-position them under the leaks. In order to compare the total leakage from day to day, we must add the two heights. Thus we are using the same measurement technique, collection method and unit of measurement as in precipitation measurement, but now our measure is a tally, rather than a semi-tally. Example 10 and Example 2 illustrate the point that the type of a measure cannot be determined by reference to how the data are collected and measured. We have to know the purpose of the measurement. In Example 10, the purpose of the measurement is to measure a volume directly. By contrast, in Example 2 the purpose of the measurement is to act as a representative of the precipitation experienced in the geographical vicinity of the rain gauge. Example 11. In a hydro-geological model, we need to estimate the total volume of precipitation in the catchment area of a river. We divide the catchment area into zones and place a rain gauge in each zone. Each day, for each zone, the height of precipitation recorded by the rain gauge is multiplied by the surface area of the zone to find an estimate of the volume of precipitation for the zone. Then the estimated volumes for each zone are added together to find an estimate for the total precipitation in the catchment area. Example 11 illustrates the point that the height in the rain gauge is a representation of a volume, but that volume represents nothing other than itself. Hence the height is not a tally, whereas the volume is a tally. 8.1.2. Snapshot versus reckoning measures The following example is intended to illustrate the differences between reckoning and snapshot measures.

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

15

Example 12. The Sunspot Count is issued daily and computed by counting the number of sunspots on the surface of the Sun facing the Earth. A sunspot's duration can be hours or weeks. As a measure, it is a reckoning. Today's count is equal to yesterday's count minus the sunspots which have ceased to be visible since yesterday and plus the sunspots that are newly-visible today. Example 13. A pollen count is computed by counting the number of pollen grains of a certain type per cubic meter of air sampled, averaged over 24 h. As a measure it is a snapshot measure. Pollen count is a snapshot measure, whereas sunspot count is a reckoning. The difference between these two counts is that the sunspot count simply reports the number of items counted, whereas the pollen count is a sample which purports to be somehow representative of the geographic vicinity of the sample collection device. If we introduce a second pollen grain collection device near to the first one, we might collect twice the number of pollen grains, but we would still report the pollen count as the number of grains from just one of the collection devices. By contrast, if our solar system had binary stars, we would take the sum of the sunspots on the surfaces of the two stars. A snapshot measurement is always a sample value that is intended to be representative. Even if the scale of a snapshot permits measure values to be added, summing snapshot values never makes sense. If two or more such values are associated with identical dimensional information then summing the values simply alters the basis of the sample, whereas if the values are associated with differing dimensional information, then summing the values invalidates the representative nature of the sample. 8.2. Aggregations in OLAP queries First we must analyse the OLAP query in order to find aggregation operations with their measures and dimensions. Then, the additivity of these operations can be detected, based on properties of measures and dimensions as explained in previous sections. Since selection operations can drastically reduce the data to be included in the aggregation calculation, the additivity testing should be done after the selections. For example calculating total sales of all countries violates additivity requirement 2 in Definition 12, but if we limit the data to one country and thus to one currency, the aggregation would be correct. We focus on checking additivity using the MDX query language. First we must identify parts of queries causing aggregation calculation in query evaluation. Generally, there are two ways in which aggregations can occur: 1. an aggregation removes a dimension or 2. an aggregation removes a hierarchy layer in a dimension, i.e. rolls-up a dimension. The first case means that the dimension is eliminated in the resulting cube and thus it is not listed in the selection part (e.g. the SELECT clause) of the query. It can still be used as a selection condition to limit the values to be included in the result. The second case is slightly more complicated, since the dimension is now included in the selection part and we must find out possible roll-up operations. For example, when summarising daily values to monthly values, the aggregation dimension is the time dimension. Now it is important whether the most detailed level of the dimension, that is the key of the dimension, is included or not. If it is not included, aggregation computing is needed along this dimension. Since MDX includes many features that are not relevant to our analysis, we have defined a limited MDX-like query language whose syntax can be found in http://www.cern.ch/hiptek/mdxlike. In MDX (and our simplified version of it), a basic query has the following form: SELECT [baxis_specificationN [, baxis_specificationN…]] FROM [bcube_specificationN] [WHERE [bslicer_specificationN]]

The axis specifications of the SELECT clause are dimensions and the WHERE clause can be used to limit data taken into the result set. MDX handles measures and other dimensions symmetrically. The FROM clause specifies the OLAP cube and it is trivial in the case of only one cube. In MDX, one axis specification can also contain several dimensions (by using a cross-join operation) but in our language, each axis specification contains only one dimension. For example, the following query retrieves prices of ski products from year 2010: SELECT [Measures].[Price], [Product].[Skis] FROM Sales WHERE [Date].[2010] The SELECT clause contains dimensions and measures to be included to the resulting OLAP cube while the WHERE clause gives restrictions on how these should be tied together. It can also give some other restrictions if only some subsets of values are to be included into the result.

16

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

Let As be the set of attributes in the SELECT clause. We can find the dimensions and the measures by using the OLAP cube schema: the set of selection dimensions, i.e. the dimensions mentioned in the SELECT clause, Ds, contains attributes belonging to some dimension in the set of all dimensions D of the OLAP cube schema. The measures are then simply As − Ds. By aggregation dimensions we mean the dimensions along which the aggregation is done. The set of aggregation dimensions consists of two parts: eliminated dimensions and roll-up dimensions. The eliminated dimensions are not mentioned at all on the SELECT clause so they can be found easily: De = D − Ds. The roll-up dimension means that the most detailed level of the dimension is not included in the SELECT clause. If a dimension is not mentioned in the SELECT part, there is an aggregation along it. For example, in the above example, there will be aggregation along the product dimension, since the most detailed level of it is not included. 8.3. Checking summarizability of aggregation operations 8.3.1. Additivity After the aggregation dimensions and measures are identified in the OLAP query, the summarizability can be checked by using Algorithm 1. The algorithm is based on the definition of additivity (Section 7.1). For example, in the example query in the previous section, the only dimension is product and the measure is price. This means that aggregation dimensions are product, date, and country. The date and country dimensions are eliminated from the result while product rolls-up to the product group level. According to our definitions, the measure type of the price is snapshot measure. Thus, we can detect that price cannot be aggregated by the sum function along any dimension, that is, the result of the query is not additive. Algorithm 1. Additivity of the OLAP query D = the set of aggregation dimensions. M = the set of measures. for all m ∈ M do for all d ∈ D do if (m depends on d) OR (the measure type of m is snapshot) OR (the measure type m is reckoning and d is an ordering dimension) OR (the measure type m is reckoning and d is a not domain-complete non-ordering dimension) OR (the measure type m is semi-tally and d is a not domain-complete non-ordering dimension) OR (the scale of m is not ratio) then Return False. end if end for end for Return True. Proof. The algorithm iterates over all measure and aggregation dimension pairs used in the query definition and checks if the rules given in Theorem 13 are violated. False is returned if any of these pairs violates the rule. Since there are finite sets of both measures and dimensions, there is also a finite number of pairs to check, thus the algorithm always returns an answer in finite time.  8.3.2. Conversion factors and summarizability The rule which applies to conversion factors is analogous to the rules on additivity and semi-additivity which apply to tally and snapshot measures. Whereas tally and snapshot measures fall into the categories of additive, semi-additive and non-additive, conversion factors are always non-additive. However, conversion factors have an analogous property which determines the point in the evaluation of an expression at which the conversion factor must be applied. The following examples illustrate this property. Example 14. Value-added tax (“VAT”) rate. The VAT rate only rarely changes over time. However, in some countries the VAT rate varies according the product category. Suppose the measures include Gross Sales Revenue and VAT Rate and the Categorical attributes include product category and date of sale. If we aggregate over time, and we need to find the tax then it does not matter whether we multiply Gross Sales Revenue by VAT rate before or after the aggregation. However, if we aggregate by product category, then we must multiple Gross Sales Revenue by VAT rate before the aggregation. Example 15. Sales tax rate. In the USA, each state can levy a sales tax that is a percentage of the value of a retail transaction. When computing the total of all sales tax for a day, then in an expression we can multiply the gross sales revenue by the sales tax rate before or after

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

17

the aggregation so long as we are only aggregating up to the state level. If, however, we are aggregating up to the national level, then we have to multiply gross sales revenue by sales tax rate before the aggregation. Example 16. Currency exchange rates vary all the time. In some applications, however, it is assumed that there was one prevailing rate on any given day. In that case, to convert the gross sales revenue from one currency to another, the expression must perform the multiplication before the aggregation if the aggregation includes transactions from more than one day.

8.3.3. Period and scope rules for snapshot measures Snapshot measures are non-additive, but each such measure has its own set of rules which determine how its values must be handled when aggregation operations are performed on the cube. We refer these rules as the “period” in the case of the ordering dimension and as the “scope” for each of the non-ordering dimensions. For each snapshot measure, the period and scope rules must be specified by the cube designer. First, we deal with the case of the ordering dimension. In the case of some snapshot measures, the values which appear in the cube are obtained from instantaneous measurements. Examples of such measures are wind speed and temperature. Other snapshot measures can only be taken after some period of time has elapsed, an example being pollen grain count which requires a one day collection period. In either case, there will be a period associated with the measure value. For example, the temperature might be taken at noon each day and that measurement will be taken as the representative temperature for the entire day. When the granularity of the cube is no higher up the ordering dimension's hierarchy than the period, then the raw measure value can be used. If, however, the granularity of the cube is higher up the ordering dimension's hierarchy than the period, then the measure value must be obtained by a suitable aggregation function, such as min, max or mean. For non-ordering dimensions, the cube designer must specify the “scope” of the snapshot measure. As with the period, the scope is an attribute in the level hierarchy associated with the dimension. So long as the cube's aggregation level in this dimension is no higher than the scope, then the measure value can be used (possibly in aggregated form, depending on the current aggregation level in the ordering dimension). If, however, the required aggregation level is higher than the scope, then it is not valid to include this snapshot measure in the cube. For example, with the pollen count, we assume that there is just one pollen count issued for each city once per day. It does not make sense to aggregate pollen counts between cities, partly because the results would be meaningless, but also because the results may be incomparable due to differing vegetation (only one kind of pollen is used in each count). Thus, in the geography dimension, the scope for this measure is the city level in the aggregation hierarchy. If the aggregation level is no higher than city then the pollen count measure can be included in the aggregated cube. If, however, the aggregation hierarchy level is higher than city, then the measure cannot be included in the cube, even in aggregated form. If the validity of the snapshot measure is unaffected by the aggregation level in a dimension, then the cube designer simply makes the scope the highest level in the aggregation hierarchy for that dimension. 8.3.4. Other aggregation functions If the OLAP query is additive, then we can correctly use the sum operation. Additionally, min and max functions can always be applied when the statistical scale of the measure attribute is at least the ordinal scale and all measure units are the same. For example, the results of min/max functions are not meaningful if the values cannot be compared (e.g. product names) or the product prices are in different currencies. Unfortunately, the average function is more complicated. Since the average is defined as the sum divided by the count, results of both functions must exist. This means that additivity is a necessary condition for computing averages. However, it is not necessarily a sufficient condition as illustrated by the following example. Intuitively, it seems to be correct to compute monthly averages of daily rainfalls. If we have two months with only one rainy day with 10 mm of rain, the monthly averages should obviously be the same. However, if the first month happens to be January and the second one February, the first value is 10 mm/31 and the second value 10 mm/28 indicating that the value is over 10% larger in February than in January. In this sense, we can argue that the average can be correctly used only if the cardinality of all groups is the same. We can formulate this as follows: An aggregation operation using the average is the valid (1) if the statistical scale of the measure attribute is at least interval scale, (2) all measure units are the same, and (3) the cardinalities of all groups are the same. 9. Evaluation Our current study has two different objectives 1) presenting a new categorisation for summarizability types, and 2) demonstrating how this categorisation can be applied to guarantee correct aggregation results. Since these objectives have a different nature, we evaluate our research by using two methods. For the first objective we use the descriptive method and for the second one the analytical method [7]. Furthermore, limitations of our method are stated in Section 10. Our work is motivated by the fact that a commonly-used categorisation of Lenz and Shoshani is ambiguous as we demonstrated in Section 2. Although a direct comparison between our categorisation Lenz and Shoshani's categorisation is not possible, we give here some key observations. Table 6 gives an overview of measure categories. Kimball and Ross's categorisation is straightforward, since the categories are defined based on their summarizability properties. Lenz and Shoshani's starting point is similar to ours: first defining the categories based on general properties of variables and after that

18

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

Table 6 Measure type comparison. Description

Our category

Lenz and Shoshani's category [16]

Kimball and Ross' category [12]

Examples

A cumulative effect over a period. Additive over any dimension. A cumulative effect over a period. Additive only over a temporal dimension. A stock value records the state at a specific point in time. Additive over non-temporal dimensions The state at a specific point in time

Tally

Flow

Additive

Semi-Tally

does not exist

does not exist

Reckoning

Stock

Semi-additive

Monthly number of births, annual income, daily sales Measured rain in different locations Inventory of cars, number of citizens

Snapshot

1)

Non-additive

Value-per-unit or Stock

2)

A factor used to converted measure units

Conversion Factor

Value-per-unit

Non-additive

1) item price, cost per unit manufactured; 2) temperature, pressure, speed Exchange rate

describing or showing what kind of summarizability features these categories have. We can further compare Lenz and Shoshani's categories to ours as follows. Flow vs. Tally: Lenz and Shoshani define flow as follows: “flows refer to periods of time and they are recorded at the end of these periods.” This measure type is quite similar to our tally measure. We, however, recognise a measure “semi-tally” that is additive only over the temporal dimension. Stock vs. Reckoning: Lenz and Shoshani: “stocks are measured at particular points of time.” Similarly to the tally measure, we divide “stock” measures into two groups. Lenz and Shoshani's Stock measure covers our reckoning and also partially our snapshot, although their summarization properties are different: only the reckoning is semi-additive while the snapshot is non additive. Value-per-unit vs. snapshot: Lenz and Shoshani: “Value-per-units are determined for a fixed time, too, but are not similar to stocks in so far as their unit is different.” The value-per-unit covers a part of our snapshot measure (depending on units) and our conversion factor measure. Using the unit differentiating the measure types is problematic, since the same thing can be measured by using different units. This, obviously, should not affect summarizability as far as there is a bijective mapping between these units (e.g. knots vs. km/h). Additionally, our categorisation can handle other linear ordering dimensions than just the temporal dimension. Kimball and Ross's and Lenz and Shoshani's categorisations cannot recognise these. For analytical evaluation we built a proof-of-concept implementation of a summarizability checker by using our MDX-like query language. Our implementation takes the OLAP model and a query as inputs and checks the summarizability of the query based on Algorithm 1. The implementation is straightforward since it only needs to study those dimensions and measures that appear in the query. The summarizability checker is integrated in an open source OLAP engine [5] and it correctly identifies the summarizability of the query. 10. Conclusions and future work In this paper, we studied summarization, especially additivity, based on the relational data model. The motivation for this work comes from our earlier paper in which we presented preliminary ideas on checking summarizability based on RDF ontologies [27]. The approach in the current work is still very different and remarkably larger offering a totally new contribution based on the relational data model: 1) a distinct summarization categorisation for measure attributes in different OLAP cubes, 2) a formal connection between the measure types and dimension types, 3) rules with a proof for correct summarization. We now briefly assess the questions of summarizability that we felt were left open by the earlier research. 1. The summary attribute types do not have formal definitions. – This has been covered by our definitions in Section 6, Definitions 7–11. 2. There is no work combining measure units and statistical scale of summary attributes. – This has been covered in Sections 6 and 7. Specifically, our definition for additivity utilises both statistical scales and measure units. 3. The temporal dimension plays a special role in most of the existing work but it is not necessary so. – Our Definition 4 indicates that a dimension other than time can be used as an ordering dimension if it defines a linear total ordering. For example, a geography dimension can be used in a similar way in some applications. 4. In addition to common aggregation functions (sum, count, average, min/max) there are also some other commonly used types such as the last item. – We only demonstrated that the last item type corresponds to the cumulative snapshot measure. This will belong to our future research.

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

19

5. Summarizability of measures created by combining existing measures using arithmetic operations is not included in existing research. – Although this issue has been partially addressed in our treatment of conversion factor measures, it is not covered by the current work. It will, however, belong to our future research. 6. There is no proof that the set of our summary types is complete or minimal. – This question mostly remains open. We identified a new measure type “Semi-tally” that is only temporally additive. We also showed that our summarizability types can be partially defined using other types. In this sense the set is not minimal. In addition to items mentioned above, our method has two obvious limitations: 1) snapshot and semi-tally measures: our characterisation is not sufficient to distinguish between non-additive and additive measures in these categories, since some of measures in these categories can still be additive; and 2) implementation: the proof of concept implementation is rudimentary and to be actually useful it should be integrated with an OLAP design tool. In our future work, we shall • apply our summarizability rules to UML OLAP models such as [30]. In the similar way as Prat et al., we can construct UML classes based on our model. The main similarities in Prat et al.'s UML model in [30] are the relations between aggregation function, aggregation type and hierarchies. The main difference is that we do not use pre-aggregated measures in our model. • investigate the rules which determine the correctness of expression in which several measures, of differing measurement types, appear as the operands. • investigate the rules which determine the stage in the computation of an expression where a conversion factor must be applied. • study whether snapshot and semi-tally categories can be further divided into meaningful sub categories which could have different summarization behaviour. References [1] S. Ariyan, L. Bertossi, A multidimensional data model with subcategories for flexibly capturing summarizability, Proceedings of the 25th International Conference on Scientific and Statistical Database Management, SSDBM, ACM, New York, NY, USA, 2013, pp. 6:1–6:12. [2] M. Caniupan, L. Bravo, C.A. Hurtado, Repairing inconsistent dimensions in data warehouses, Data Knowl. Eng. 79&80 (0) (2012) 17–39. [3] J. Dittrich, D. Kossmann, A. Kreutz, Bridging the gap between OLAP and SQL, Proceedings of the 31st International Conference on Very Large Data Bases, VLDB'05, VLDB Endowment, 2005, pp. 1031–1042. [4] J. García-García, C. Ordonez, Extended aggregations for databases with referential integrity issues, Data Knowl. Eng. 69 (1) (2010) 73–95. [5] M. Golfarelli, Open source BI Platforms: a functional and architectural comparison, Data Warehousing and Knowledge Discovery, LNCS 5691/2009: 287–2972009. [6] L. Gomez, S. Haesevoets, B. Kuijpers, A.A. Vaisman, Spatial aggregation: data model and implementation, Inf. Syst. 34 (6) (2009) 551–576. [7] A.R. Hevner, S.T. March, J. Park, S. Ram, Design science in information systems research, MIS Q. 28 (1) (Mar. 2004) 75–105. [8] J. Horner, I. Song, P. Chen, An analysis of additivity in olap systems, Proceedings of the 7th ACM international workshop on Data warehousing and OLAP, DOLAP'04, ACM, New York, NY, USA, 2004, pp. 83–91. [9] J. Horner, I.-Y. Song, A taxonomy of inaccurate summaries and their management in OLAP systems, Proceedings of the 24th International Conference on Conceptual Modeling, ER'05, Springer-Verlag, Berlin, Heidelberg, 2005, pp. 433–448. [10] C.A. Hurtado, C. Gutierrez, A.O. Mendelzon, Capturing summarizability with integrity constraints in OLAP, ACM Trans. Database Syst. 30 (3) (2005) 854–886. [11] R. Johnson, Modelling summary data, SIGMOD'81: Proceedings of the 1981 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, 1981, pp. 93–97. [12] R. Kimball, M. Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, John Wiley & Sons, 2002. [13] R. Kimball, M. Ross, The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence, Wiley, 2010. [14] J. Lechtenbörger, G. Vossen, Multidimensional normal forms for data warehouse design, Inf. Syst. 28 (5) (July 2003) 415–434. [15] W. Lehner, J. Albrecht, H. Wedekind, Normal forms for multidimensional databases, SSDBM'98, IEEE Computer Society, Washington, DC, USA, 1998, pp. 63–72. [16] H.-J. Lenz, A. Shoshani, Summarizability in OLAP and statistical data bases, SSDBM'97, IEEE Computer Society, Washington, DC, USA, 1997, pp. 132–143. [17] H.-J. Lenz, B. Thalheim, OLAP databases and aggregation functions, SSDBM'01: Proceedings of the 13th International Conference on Scientific and Statistical Database Management, IEEE Computer Society, Washington, DC, USA, 2001, pp. 91–100. [18] H.-J. Lenz, B. Thalheim, A formal framework of aggregation for the OLAP-OLTP model, J. Univ. Comput. Sci. 15 (1) (2009) 273–302. [19] E. Malinowski, E. Zimányi, Hierarchies in a multidimensional model: from conceptual modeling to logical representation, Data Knowl. Eng. 59 (2) (2006) 348–377. [20] F.M. Malvestuto, M. Mezzini, M. Moscarini, An analytical approach to the inference of summary data of additive type, Theor. Comput. Sci. 385 (13) (2007) 264–285. [21] J. Mazón, J. Lechtenbörger, J. Trujillo, Solving summarizability problems in fact-dimension relationships for multidimensional models, DOLAP'08: Proceeding of the ACM 11th International Workshop on Data Warehousing and OLAP, ACM, New York, NY, USA, 2008, pp. 57–64. [22] J. Mazon, J. Lechtenborger, J. Trujillo, A survey on summarizability issues in multidimensional modeling, Data Knowl. Eng. 68 (12) (2009) 1452–1469. [23] L. Meo-Evoli, F.L. Ricci, A. Shoshani, On the semantic completeness of macro-data operators for statistical aggregation, SSDBM'1992: Proceedings of the 6th International Working Conference on Scientific and Statistical Database Management, Lawrence Berkeley Laboratory, Berkeley, CA, US, 1992, pp. 239–258. [24] B. Neumayr, M. Schrefl, B. Thalheim, Hetero-homogeneous hierarchies in data warehouses, Proceedings of the Seventh Asia-Pacific Conference on Conceptual Modelling — Volume 110, APCCM'10, Australian Computer Society, Inc., Darlinghurst, Australia, 2010, pp. 61–70. [25] T. Niemi, J. Nummenmaa, P. Thanisch, Logical multidimensional database design for ragged and unbalanced aggregation hierarchies, Proceedings of 3rd International Workshop on Design and Management of Data Warehouses, 2001. [26] T. Niemi, J. Nummenmaa, P. Thanisch, Normalising OLAP cubes for controlling sparsity, Data Knowl. Eng. 46 (1) (2003) 317–343. [27] T. Niemi, M. Niinimäki, Ontologies and summarizability in OLAP, Proceedings of the Semantic Web and Applications (SWA), A Technical Track of the 25th Annual ACM Symposium on Applied Computing, Sierre, Switzerland, March, 2010. [28] T. Pedersen, C.S. Jensen, Multidimensional data modeling for complex data, Data Engineering, 1999, Proceedings., 15th International Conference on, 1999, 1113, 1999, pp. 336–345. [29] T. Pedersen, D. Pedersen, K. Riis, On-demand multidimensional data integration: toward a semantic foundation for cloud intelligence, J. Supercomput. 65 (1) (2013) 217–257. [30] N. Prat, I. Comyn-Wattiau, J. Akoka, Combining objects with rules to represent aggregation knowledge in data warehouse and olap systems, Data Knowl. Eng. 70 (8) (Aug. 2011) 732–752.

20

T. Niemi et al. / Data & Knowledge Engineering 89 (2014) 1–20

[31] K.Q. Pu, Modeling, querying and reasoning about OLAP databases: a functional approach, DOLAP'05: Proceedings of the 8th ACM International Workshop on Data Warehousing and OLAP, ACM, New York, NY, USA, 2005, pp. 1–8. [32] M. Rafanelli, A. Shoshani, Storm: a statistical object representation model, SSDBM'90, Springer-Verlag New York, Inc., New York, NY, USA, 1990, pp. 14–29. [33] S.K. Shin, G.L. Sanders, Denormalization strategies for data retrieval from data warehouses, Decis. Support. Syst. 42 (1) (2006) 267–282. [34] A. Shoshani, Statistical databases: characteristics, problems, and some solutions, VLDB'82: Proceedings of the 8th International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1982, pp. 208–222. [35] S. Si-Said Cherfi, N. Prat, Multidimensional schemas quality: assessing and balancing analyzability and simplicity, ER (Workshops), 2003, pp. 140–151. [36] M. Spahn, J. Kleb, S. Grimm, S. Scheidl, Supporting business intelligence by providing ontology-based end-user information self-service, Proceedings of the First International Workshop on Ontology-Supported Business intelligence, ACM, 2008. [37] S.S. Stevens, On the theory of scales of measurement, Science 103 (2684) (1946) 677–680. [38] E. Thomsen, G. Spofford, D. Chase, Microsoft OLAP Solutions, Wiley, 1999. [39] H.J. Watson, B.H. Wixom, The current state of business intelligence, Computer 40 (9) (2007) 96–99.