Multidimensional query reformulation with measure decomposition

Multidimensional query reformulation with measure decomposition

Accepted Manuscript Multidimensional Query Reformulation with Measure Decomposition Claudia Diamantini, Domenico Potena, Emanuele Storti PII: DOI: Re...

1004KB Sizes 0 Downloads 50 Views

Accepted Manuscript

Multidimensional Query Reformulation with Measure Decomposition Claudia Diamantini, Domenico Potena, Emanuele Storti PII: DOI: Reference:

S0306-4379(17)30374-5 10.1016/j.is.2018.05.002 IS 1308

To appear in:

Information Systems

Received date: Revised date: Accepted date:

13 June 2017 20 March 2018 8 May 2018

Please cite this article as: Claudia Diamantini, Domenico Potena, Emanuele Storti, Multidimensional Query Reformulation with Measure Decomposition, Information Systems (2018), doi: 10.1016/j.is.2018.05.002

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • Approach to support performance comparisons in a federation of data marts • Multidimensional queries over a global model integrating local sources • An extension to the multidimensional model with mathematical formulas for indicators • A query reformulation approach exploiting aggregation and indicator decomposition

AC

CE

PT

ED

M

AN US

CR IP T

• Computational analysis of the reformulation algorithm and proof of correctness

1

ACCEPTED MANUSCRIPT

Multidimensional Query Reformulation with Measure Decomposition Claudia Diamantinia , Domenico Potenaa , Emanuele Stortia,∗ di Ingegneria dell’Informazione, Universit` a Politecnica delle Marche, via Brecce Bianche, 60131 Ancona

CR IP T

a Dipartimento

Abstract

M

AN US

Measurement and comparison of performances in networked organisations is particularly critical because of heterogeneity and sparsity of data. In particular, each organization is autonomous in the definitions of which measures to use and their calculation formulas, i.e. the mathematical expressions stating how a measure is calculated from others. Hence, full integration of data marts requires a reconciliation among such heterogeneous definitions in order to support evaluation of cross-organizations performances and to produce meaningful comparisons. To address this issue, this paper proposes (1) an extension of the traditional multidimensional model by taking into account the explicit representation of the semantics for measure formulas, and, on the top of this model, (2) a novel query reformulation approach for a scenario of federated data warehouses. The approach exploits both aggregation and, unlike traditional approaches, measure decomposition through the calculation of measure formulas. This extends usual features of query rewriting based on views, allowing to overcome heterogeneities at measure level among data mart schemas and enabling meaningful comparisons among values of different autonomous data marts. A formalization of the rewriting algorithm is proposed, together with a computational analysis, proofs of correctness and termination, and an evaluation of effectiveness that shows how the approach can lead to a significant increase in the capability of integrating indicators to answer queries in a federated scenario.

ED

Keywords: Federated data warehouses; query rewriting; multidimensional model; indicator formulas; materialized views

PT

1. Introduction

AC

CE

Monitoring of performances represents a valuable means for an enterprise to recognize and address critical aspects. In traditional enterprises, but even more in collaborative networks (e.g., Virtual Enterprises, Business Ecosystems), which see the temporary cooperation of different partners, the evaluation of identified Performance Indicators (PI) can depend on data produced, organized and maintained in different information systems with a certain degree of heterogeneity. Such a heterogeneity represents an obstacle in the integration of data and prevents the possibility to evaluate cross-organizations (or even cross-departments) PIs. Physical integration of local sources in a global data warehouse is feasible only for single organizations or networks that are established for long-term collaborations. Distributed and collaborative contexts, like the one we envisage in this work, are inherently dynamic and temporary, as well as characterised by a certain degree of autonomy and privacy requirements. In these contexts, federated architectures have been firstly proposed in the database field [1] and then extended to data warehouses and include, for each source, a local schema, a component schema that expresses the former in a canonical data model and an export schema that includes only the fragment of the component schema that is shared among the partners. Finally, on top ∗ Corresponding

author Email addresses: [email protected] (Claudia Diamantini), [email protected] (Domenico Potena), [email protected] (Emanuele Storti) Preprint submitted to Elsevier

May 17, 2018

ACCEPTED MANUSCRIPT

AN US

CR IP T

of several export schemas a federated (global) schema is defined [2]. A so-called federated data warehouse (FDWH) is therefore meant in a collaborative scenario as a logical integration of data warehouses from various organizations. A specific source of complexity that is peculiar in this scenario, and that received little attention in the Literature so far, is related to heterogeneities at indicator (or measure) level. This issue stems from the fact that indicators often have a compound nature, as their values are calculated by applying some formulas defined over other indicators. Unawareness of the dependencies among indicators leads to evaluations that are prone to errors. Let us consider for instance two different definitions of Return On Investment (ROI), both commonly adopted in business, e.g. Net Income/Investment versus (Revenue−Cost of goods sold )/Costs of goods sold. Such definitions (and corresponding values) are comparable only if Net Income is equal to Revenue−Cost of goods sold, and if Investment is equal to Costs of goods sold, but this cannot be given for granted, and it is apparent that no agreement upon which definition of ROI must be used at federated level is feasible in advance. Hence, firstly a solution for integration of all indicators and their formulas in a common model is a requirement to reconcile heterogeneous definitions. Secondly, a query answering mechanisms should be capable to answer queries by exploiting indicator formulas, in order to carry on correct interpretations of indicators and meaningful comparisons among performance values calculated by different definitions. These requirements stem from the FP7 European research project BIVEE1 , which was aimed to perform monitoring of indicators over a federation of data warehouses in presence of distribution of sources, autonomy in the choice of PIs and heterogeneities at measure level. The research goal of this paper is to address the above-mentioned issues by discussing a formula-aware approach for multidimensional query reformulation over a federation of data marts. The contributions of the paper are: • extension of the model of the global schema with the compositional semantics of indicators, i.e. mathematical formulas stating how to calculate an indicator from others;

PT

ED

M

• a novel query answering approach exploiting the extended multidimensional model and involving the following reformulation process: (1) at first, as integrated data are not physically stored, queries formulated on the global schema are adapted to each source. This can add more complexity, but also allows a more flexible architecture as a new source can be dynamically inserted [3]. Then, (2) each local query is rewritten using the materialized views of the local data mart, by exploiting indicator formulas in the extended multidimensional model, and finally (3) the answer to the user query is calculated from local results. We provide a computational analysis of the algorithm and proofs of correctness and termination.

AC

CE

This work builds on some earlier contributions (e.g., including [4, 5, 6]), which share with this paper the applicative scenario and the general idea of extending the multidimensional model with the semantics of indicator formulas, for a variety of applications including managing a shared library of PIs and supporting data mart integration. This work elaborates upon previous results addressing multidimensional query rewriting in a federated scenario. To the best of our knowledge, the mathematical manipulation of indicator formulas as a support for this task is novel with respect to the state of the art. By enabling an alternative solution for dynamic calculation of virtual views from materialized ones, this approach provides valuable support to overcome heterogeneities at measure level, which are particularly critical in a dynamic, temporary and collaborative environment. 1.1. Motivating example Hereafter, we present a case study that will be used as motivating example in the paper. We consider a collaborative environment where two enterprises ACME1 and ACME2, each with a data mart, cooperate in a project. In the paper we use Virtual Enterprise (VE) to refer to this kind of temporary association of enterprises. As shown in Figure 1, the two data marts include a set of dimensions related to time, location 1 https://cordis.europa.eu/project/rcn/100275

en.html

3

AN US

CR IP T

ACCEPTED MANUSCRIPT

(b)

(a)

Figure 1: Case study. A representation of the content of the data marts for enterprises (a) ACME1 and (b) ACME2. Indicators are shown in bold and linked to their analysis dimensions. For each dimension, the corresponding hierarchy of levels is shown. Indicator Costs

Description Total costs for employees

NumTrainingHours HourlyCost

M

Expenses for travels Expenses for personnel Number of hours training Cost per hour

ED

TravelCosts PersonnelCosts

Calculation travel costs plus personnel costs number of working hours times the hourly cost -

for

-

Aggregator SUM

Source ACME2

SUM SUM

ACME1 ACME1

SUM

ACME2

AVG

ACME2

Table 1: Case study: indicators measured by ACME1 and ACME2.

AC

CE

PT

and product, and five measures related to costs. Several structural heterogeneities can be identified: product dimension is missing in ACME1, and the others refer to different granularities (e.g., there are 5 levels for the time dimension in ACME2, but only 3 for ACME1; in the location dimension, town and city are synonyms, as well as country and nation). Finally, also the set of indicators are different, as also explained in Table 1 where details about aggregation functions are also provided. According to our approach, a user query over the whole VE is posed on the global federated schema. Let us consider a user query Q asking for indicator Costs by year and country/nation. By exploiting semantic mappings between export schema and federated schema (see Subsection 3.2), the global query can be solved by decomposition in two subqueries asking for Costs over the two sources ACME1 and ACME2. However, ACME1 does not include the indicator at hand in its data mart. As a consequence, we can answer the query only for ACME2. Let us now suppose that a formula for Costs is defined as Costs=T ravelCosts + P ersonnelCosts. In such a way, it is possible to reformulate the query for ACME1 as two subqueries, one for T ravelCosts and another for P ersonnelCosts. As both indicators are available in ACME1, in this way the original query can be answered by exploiting the mathematical structure of the indicator formula (i.e., in this case by summing the results of the two subqueries). Furthermore, in this work we also consider the hypothesis that some materialized views with preaggregated data are available, while other are missing. Let us suppose that views V1(TravelCosts,year,region) 4

ACCEPTED MANUSCRIPT

CR IP T

and V2(PersonnelCosts, year,country) are defined for ACME1 and view V3(Costs,year,nation) is defined for ACME2. While for ACME2 V 3 already provides the needed data, for ACME1 the query corresponds to (1) an aggregation of V 1 over the attribute region, in order to generate a virtual view V V with the same level of granularity of the query (e.g., year and country), followed by (2) the computation of the formula for Costs from V V and V 2. Finally, the two new queries can be joined to produce a final aggregated value at VE level, by referring to the global schema. The interaction of formulas calculation with availability of materialized views is also subject of study of the work. To the best of our knowledge, there is currently no approach in the Literature capable to automatically generate and answer this type of queries by reasoning on the mathematical structure of indicator formulas, exploiting materialized views. In next Sections, we show how to represent a global model for the federation and how to automatically generate rewritings like these by exploiting knowledge about dimension hierarchies, indicator formulas and mappings, for each source of the federation. 1.2. Structure of the paper The rest of the paper is structured as follows:

AN US

• Section 2 surveys some related work in the field of data integration and extensions to the multidimensional model; • Section 3 we introduces some background definitions on multidimensional model and its extension with mathematical formulas for indicators, the notion of multidimensional query and we report the two main rewriting rules, namely the traditional drill-down and a novel indicator expansion rule, that are exploited to determine which views can be used for rewriting; • in Section 4 we formalize the research problem and we discuss under which conditions the rewriting rules can be applied to drive the rewriting process;

ED

M

• we propose in Section 5 a query reformulation procedure that adapts the global query to the schema of the sources, and implements a rewriting algorithm capable to explore the space of possible rewritings to find a solution, according to the two rules. Proofs for correctness and termination of the algorithm are provided in Appendix A; • in Section 6 we provide an analysis on some computational aspects of the rewriting algorithm and an experimental evaluation of the efficacy of the approach.

CE

2. Related Work

PT

Finally, in Section 7 we conclude the paper discussing implementation issues and possible solutions for optimizations.

AC

In this Section we survey the work in the Literature that are most relevant to our research, both in the field of data integration in federations of data warehouses (Subsection 2.1) and semantic extensions to multidimensional model for performance monitoring and analysis (Subsection 2.2). 2.1. Data integration and query answering in federated data warehouses The sharing of data coming from distributed, autonomous and heterogeneous sources asks for methods to integrate them in a unified view, which requires to address several issues including different structures and terminologies adopted, and several syntactic, structural, and semantic conflicts [7, 8]. A common approach is to rely on a global schema to obtain an integrated and virtual view, either with a global-as-view (GAV) or a local-as-view (LAV) approach [9]. After all sources have been integrated and mappings have been defined, the main tasks of a data integration system is processing queries expressed on the global schema. This involves a reformulation step, in which the query expressed by terms of the global schema is rewritten in terms of queries over the sources. According to the typology of the chosen approach queries are answered 5

ACCEPTED MANUSCRIPT

AN US

CR IP T

differently, namely by unfolding the global query replacing each term with the local one (according to GAV) or by query answering/rewriting (according to LAV), that involves decomposing the query into equivalent subqueries over the local schemas (see e.g. [10] for a survey in the context of distributed database systems, and [11] for data warehouse integration). More precisely, query rewriting uses view definitions to produce a new rewritten query, which is equivalent to the original one. The answer is obtained by using the rewritten query and instances in the database. On the other hand, query answering exploits both definition and instances to determine the best possible answer (typically a subset of the optimal answer). Given that this last is very inefficient in case of large amounts of data, the former approach is more appropriate for OLAP queries [12]. Integration is however far more complex when the data models of the sources are multi-dimensional. Heterogeneities hindering the integration of independent data marts may be classified on the basis of conflicts that occur at dimension or measure levels [13] and can be solved through proper mappings between layers, as in [2] where authors propose a FDWH framework where Dimension/Facts algebras are exploited to define mappings between global and local schemas. One of the first work [14] to consider aggregation and grouping proposed a rewriting solution based on syntactic transformations of the query into equivalent subqueries. Limits of this approach are addressed by semantic approaches. Among the others, [15] extends the notion of query containment showing that in some cases an algorithm for rewriting is complete, i.e. if a solution exists it is found, while [16] aims at rewriting OLAP queries using materialized views by exploiting knowledge about dimension hierarchies. The possible alternative rewritings obtained through the algorithm can be sorted by a greedy heuristic according to their estimated cost. Semantic data integration relies on conceptual and machine-understandable representation of the data and their relationships to avoid heterogeneities and allow interoperability. Also in the present work we refer to a semantic approach, where the rewriting algorithm exploits knowledge of global/local models. Unlike previous work, this includes not only knowledge about dimension hierarchies, but also knowledge about indicator formulas and, for each source, details about materialized views and local hierarchies.

AC

CE

PT

ED

M

2.2. Extensions to the multidimensional model for monitoring and analysis The multidimensional model takes into account the aggregative aspect, defining a data cube as a multilevel, multidimensional database with aggregate data at multiple granularities [17]. The definition of powerful OLAP operators like drill-down directly comes from this model. Recently, semantic representations of the multidimensional model have been proposed [18, 19] mainly with the aim to reduce the gap between the high-level business view of indicators and the technical view of data cubes, to simplify and to automatise the main steps in design and analysis. The former relies on standard OWL-DL ontology and reason on it to check the multidimensional model and its summarizability, while the latter refers to Datalog inference to implement the abstract structure and semantics of multidimensional ontologies as rules and constraints. In particular, in [20] a proprietary script language is exploited to define formulas, with the aim to support data cube design and analysis, while in [21] authors define an indicator ontology based on MathML, to define PI formulas in order to allow automatic linking of calculation definitions to specific data warehouse elements. Although the complex nature of indicators is well-known, the compound nature of indicators is far less explored. Proposals in [22, 20, 21, 23, 24] include in the representations of indicators’ properties some notion of formula in order to support the automatic generation of customised data marts, the calculation of indicators [20, 23], or the interoperability of heterogeneous and autonomous data warehouses [22]. Anyway, in the above proposals, formula representation does not rely on formal languages, hence their manipulation is limited to ad-hoc software modules. Formal, logic-based representations of dependencies among indicators are proposed in [24, 25]. [24] introduces a goal-oriented meta-modeling framework in the context of the performance-oriented view of organizations. The proposed framework is based on the idea that goals should be defined by means of organizational performance indicators, both represented by referring to a logical language. In such a way the approach provides means to evaluate the consistency of goal structures both when designing a new organization and for the specification of an existing organization, also enabling reuse, exchange and alignment of knowledge and activities between organizations like in supply chains. Relations among indicators are represented by logical predicates (e.g. isCalculated [25], correlated [24]), while reasoning allows inference 6

ACCEPTED MANUSCRIPT

AN US

CR IP T

of implicit dependencies among indicators for a variety of tasks, including organisation, modeling, design, as well as reuse, exchange and alignment of business knowledge. However, no manipulation of formulas is exploitable in this way. An ontological representation of indicator formulas is also proposed in [21] in order to exchange business calculation definitions and to infer their availability on a given data mart through semantic reasoning. Semantic multidimensional models have been addressed in previous work of ours. In particular, we proposed an ontology named KPIOnto, formally describing indicators and their formulas, dimensions and members [4]. As such, it has been used for a variety of applications, ranging from ontology-based data exploration of indicators to the development of ambient assisted living environments [26]. On its top, a set of reasoning services have been defined, capable to reason over formulas with the aim to support advanced functionalities for various purposes, ranging from performance monitoring in the context of collaborative organizations [4], to data mart reconciliation [5], to serving as a knowledge model to support ontology-based data exploration of indicators [6]. To the best of our knowledge, automatic mechanisms based on formal representation of formulas as a support for data mart analysis, integration or query answering are provided neither by other existing approaches in the Literature nor by commercial tools. The approach proposed in next sections advance previous results exploiting for the first time the compositional semantics of PI formulas and their mathematical manipulation (see subsection 5.2) in the context of query rewriting using materialized views in a federated scenario. 3. Background

M

The canonical data model used in this work as a common model for the export and federated schema refers to an extended multidimensional model based on the formal representation of dimensions, indicators and their formulas, as discussed in the next subsection. Afterwards, in subsection 3.2 we provide an overview of the architecture, while in subsections 3.3 and 3.4 we discuss the notion of multidimensional query and the rewriting rules on which the approach relies. 3.1. Canonical model

ED

Definition 1. (Dimension) A Dimension is the coordinate/perspective along which the indicator is analysed. Referring to the multiD D dimensional model used in data warehouses [27], a dimension D is a set {LD 1 , ..., Ln }, where each Li is a label2 representing the name of the i-th level of D. Given two dimensions D1 and D2 , D1 ∩ D2 = ∅.

CE

PT

For each level Li , we refer to its domain by α(Li ). Elements of α(Li ) are called members of LD i , e.g. α(Country) = {P ortugal, Italy, Greece, Spain}. Each level represents a different way of grouping members of the dimension. Finally, a partial order ≤L is defined on {Li , ..., Ln }, as a hierarchy of levels for a dimension. By exploiting the partial level among levels, the traditional roll − up operation can be defined to perform aggregation of data from a lower level to an upper level.

AC

Definition 2. (partOf relation) D The transitive relation that maps a member of LD i to a member of Lj , with Li ≤L Lj is defined as partOf ⊆ α(Li ) × α(Lj ).

For example, partOf (V alencia, Spain) means that V alencia is part of Spain. The partOf relation defines a partition of the members of level Li : indeed, the partOf relation is such that if Li ≤L Lj , then each member of Li is in a partOf relation with only one member of Lj ; in turn, each member of LD j is D composed by at least one member of Li . Finally, for each dimension we introduce a ficticious level LALL 2 In

the following, the subscript and the superscript for levels will be shown only when necessary.

7

ACCEPTED MANUSCRIPT

all,all

category,all

all

all,country

all product,all

category,country

all,region

country

product,country

category,region

CR IP T

category

region

product

product,region

(a)

city

all,city

category,city

product,city

(b)

AN US

(c)

Figure 2: Dimensional hierarchies for (a) Product and (b) Location dimensions, and (c) the corresponding dimensional lattice.

ED

M

as the most abstract, including only the member ALL. This concept is useful during the execution of drilldown/roll-up operations for aggregation and disaggregation, as it represents the top level to which every dimension can be aggregated. In Figure 2a and 2b we show the graph representation of the hierarchies of levels for (a) Product and (b) Location dimensions. In order to give an overall and synthetic representation of the whole dimensional schema, we refer to the dimensional lattice as introduced in [28], by considering the product of all the dimension hierarchies. To build the dimensional lattice we first introduce the notion of pattern.

PT

Definition 3. (Pattern) Dn 1 A pattern is a tuple pi = hLD x , . . . , Ly i ∈ D1 × . . . × Dn = P . Dn D1 Dn 1 Given px , py ∈ P , where px = hLD x1 , . . . , Lxn i and py = hLy1 , . . . , Lyn i, then px < py if and only if D

D

AC

CE

j j Di i ∃i ∈ 1 . . . n : LD xi ≤L Lyi and ∀j 6= i, Lxj = Lyj . From the notion of dimensional hierarchy and the corresponding partial order relation ≤L , we define a covering relation ≺∗ among patterns, as an irreflexive, antisymmetric and non-transitive binary relation over the set P , such that: px ≺∗ py if and only if @ pz such that px < pz < py . A partial-order relation is defined by extending the notion of partial order among levels, i.e. given px Di ∗ i and py , px ≤P py if and only if ∀i, LD xi ≤L Lyi . As a consequence, if px ≺ py then px ≤P py . Moreover, if ∗ ∗ px ≺ py and py ≺ pz , then px ≤P pz . A dimensional lattice hP, Ai is hence a directed acyclic graph where P = {p1 , . . . , pn } is the set of patterns and A is the set of arcs among them representing the partial order relation. Figure 2c shows an example of dimensional lattice generated by Product and Location dimensions. Given that a pattern identifies a view in a data mart, as also proposed by [28], the dimensional lattice can represent dependence relations among views. As such, it can be exploited to support query reformulation and answering mechanisms, as discussed in Section 4.

Definition 4. (Formula) Given a set {ind1 , ..., indn } of symbols of atomic indicators and a set {opa1 1 , ..., opann } of operators (where aj 8

CR IP T

ACCEPTED MANUSCRIPT

AN US

Costs=TravelCosts+PersonnelCosts PersonnelCosts=NumHours*HourlyCost PersonnelTrainingCosts=HourlyCost*NumTrainingHours TeachCost=NumTrainingHours*HourRate InvestmentInEmpDev=PersonnelTrainingCosts+TeachCosts Figure 3: Example of formula graph for 10 indicators.

is the arity of the j-th operator), we define a well-formed indicator formula as a finite construction obtained from the recursive definition given below: • indj is a well-formed indicator formula;

M

• if {ind1 , ..., indk } are well-formed indicator formulas then opk (ind1 , ..., indk ) is a well-formed indicator formula.

PT

ED

As we refer to algebraic operators like +, −, ∗, /, ˆ, well-formed indicators can be represented by mathematical expressions in prefix notation, e.g. +(−(A, B), ∗(C, D)). Formulas and dependencies among them can be represented graphically in a formula graph, like in Figure 3 where the formulas for the federated schema of the case study are shown: they include both indicators defined in the data marts and other indicators that may be interesting to analyse at VE level. Together with the dimensional lattice, formulas are exploited in Section 4 as a support for query reformulation. We refer to [4] for further details about formulas and other properties of indicators.

CE

Definition 5. (Indicator) An indicator ind is a pair haggr, f i, where aggr is an aggregation function and f (ind1 , . . . , indn ) is a formula. Hereafter we denote by τ (ind) the set of values the indicator ind assumes, i.e. its domain.

AC

Definition 6. (Aggregate function) A function that represents how to summarize values of an indicator. According to the classification usually adopted in data warehouse Literature [29], aggregation functions can be categorized as follows, based on how a group can be aggregated from subgroups: • Distributive. If a set of data is partitioned into n sets, by applying the function to each set the result will consists of n aggregate values. A function F is distributive if the aggregate value of the whole data set d can now be computed by applying the same function to all the previously aggregate values ds , i.e. F (d) = F (ds ). Hence, the value at a certain level for an indicators with a distributive aggregator (e.g., SUM, MIN, MAX, COUNT) can be directly computed through a roll-up operation from the lower level, e.g. the “TravelCosts” for the 1st semester 2016 can be computed by summing values of the TravelCosts for the first six months of the 2016. 9

ACCEPTED MANUSCRIPT

CR IP T

• Algebraic. An aggregate function F is algebraic if for a group g, F (g) can be computed from a known M number of intermediate aggregations of sub-groups of g by some aggregate functions. Hence, these aggregators cannot be computed by means of values at next lower level unless a set of other indicators are also provided, which transform the algebraic indicator in a distributive one; a classical example is AVG, which can be computed at next lower level if the number of elements on which the AVG is computed is provided as well. For instance, let us suppose that an indicator “Average Revenue per product” has been computed on 10 stores in the Italian market and on 4 stores for the German market. Then, the Average Revenue at european level can be obtained by weighting the values of the indicator at country level for the number of stores before summing them and dividing by 2. • Holistic. There is no fixed number of intermediate aggregate values that can aggregate a group for a holistic function (e.g., MEDIAN, MODE). Hence, for indicators with these functions, there exist no computation methods with the exception of computing the aggregation directly from the raw data.

AN US

We further introduce a special aggregation function, namely NONE, which means that no aggregation is performed moving along dimensions, e.g. Return on Investment=hN ON E, T otalRevenue/Investmenti. For the sake of simplicity and without loss of generality, in this work we assume that each indicator has a unique aggregation function for any dimension, and that different aggregations of the same data define different indicators (for instance, total revenue, average revenue). The type of aggregation function affects the way the rewriting can be performed, as discussed in the next Section. Given a dimensional schema for an indicator (i.e. a set of dimensions along which it can be measured), the instance is given by cells that contain data aggregated with respect to certain levels of dimensions. We introduce this structure by the notion of cubeElement.

M

Definition 7. (Cube and cube element) A cubeElement is a storage element and is defined as a tuple hind, m1 , . . . , mn , s, vi, where ind ∈ I, mi ∈ i α(LD j ) (with Di ∈ ds(ind)), s is the data mart of a source and v ∈ τ (ind) is the value of ind. A cube(ind, v) is the logical grouping of all cubeElements provided for an indicator ind by the source v. Finally, the set of all cubes for a Virtual Enterprise is the instance of the global data warehouse.

CE

PT

ED

3.2. Architecture Hereby we refer to the classic architecture for a FDWH structured in layers, among which we focus on the federation and export ones, and mappings among them. The federation layer includes the global schema, built by integration of the export schemas of local sources, which are represented through the canonical data model, i.e. the semantic multidimensional model. The federation layer includes mapping assertions between local and global concepts, as shown in Figure 4. The same mappings are used to know, at federated level, which levels and members are defined in each source. In the following we report on the mappings defined to link indicators, dimensions, levels and members of an export schema l to the federated schema G. Within the set of mapping predicates proposed by [22], we refer to the following ones:

AC

• same(indl ,indG ) to state the equivalence between a local indicator indl and a global one indG ;

• sameDim(Dl ,DG ) to state the equivalence between a local dimension Dl and a global one DG ; • equi-level(Ll ,LG ) to declare the equivalence between a local level Ll and a level Lg in the global schema;

• we introduce also equi-member(ml ,mG ) to declare that a local member ml is equivalent to a global one mG . Annotations can be attached to these predicates in order to state syntactical transformations at need (e.g., substring or similar). Other predicates introduced in [22] are left out of the discussion, as they allow to specify roll-up and drill-down, which in this case are not needed because the global schema includes all the knowledge 10

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 4: Case study: example of mappings between dimensional hierarchies and members of the federated layer and the export layer of ACME1.

ED

M

about dimensional hierarchies. Also the possible specification of a mathematical expression, attached by authors to the same predicate, is replaced in this work by the explicit representation of formulas in the global model. For what concerns heterogeneities related to measures with different units of measurement, involving scaling or similar issues, we assume that new indicators are defined with formulas capable to operate the needed transformation, e.g. temp Celsius = 59 ∗ (temp F ahrenheit − 32). In order to reduce the effort for processing a federated query, here we take the approach proposed by several work [30, 31, 2], consisting in including, at this level, a model of the export schemas to specify which dimensions and indicators are defined by the data mart of the source, and which views are materialized:

CE

PT

• we represent the dimensional schema for an indicator ind and a data mart s as dss (ind), which is always a subset of ds(ind). In case a dimension is not defined for a data mart, we assume that each underspecified dimension is aggregated at the highest level to make the cubeElements of the source compliant with the global schema.

AC

• Materialized views are specified at the federation layer through tuples as hind, p, si which are used to declare that, for data mart s and indicator ind, the view specified by the pattern p = hLDi , . . . , LDj i is materialized, where dimensions {Di , . . . , Dj } ⊆ dss (ind). To give an example, the view identified by hP ersonnelCosts, hM onth, Regioni, ACM E1 i means that all cubeElements for P ersonnelCosts aggregated at month and region levels are available in ACM E1 ’s data mart.

Example. In Figure 4 we depict the dimension hierarchies in the federated schema, for the Time and Location dimensions of the case study. The dimensional schema for all indicators is ds(Costs) = . . . = ds(HourRate) = {T imeDimension, P roductDimension, LocationDimension}, while for ACM E1 we have dsACM E1 (T ravelCosts) = dsACM E1 (P ersonnelCosts)={T imeDimension, LocationDimension}. In this case, we consider the missing P roductDimension as implicitly aggregated at the ficticious level LALL . Please note that, for the sake of simplicity, in the following we refer to global terms, implicitly hiding the specific terminology stated by above mentioned mappings. Given that the focus of this work is on query 11

ACCEPTED MANUSCRIPT

reformulation, that is performed within the canonical data model, we do not delve into further details about other typologies of mappings, for instance between export and local schemas. The setup of a comprehensive mapping framework capable to integrate local schemas complying with various models, in a FDWH architecture, is a complex issue that only recently received attention. We refer the interested reader to work dealing with this aspects, such as [2]. 3.3. Multidimensional queries

CR IP T

In this work we focus on multidimensional OLAP queries that can be expressed as conjunctive queries with aggregate functions. Hereafter we provide the definition of a multidimensional query in an abstract syntax. For the sake of simplicity, with no loss of generality the definition refers to a single indicator. Definition 8. A multidimensional query q over a Virtual Enterprise VE is a tuple hind, W, K,VE , θi, where: • ind ∈ I is the requested indicator;

• W is a tuple of levels hLD1 , . . . , LDn i on which result must be aggregated;

AN US

• K includes sets Ki = {mi1 , . . . , mik } ⊆ α(LDi ) of members on which to filter, where LDi ∈ W . If Ki is empty, all members of the corresponding level are considered; • VE is a set of sources {s1 , . . . , sm };

• θ = true if the result must be aggregated at the Virtual Enterprise level; otherwise the query will return a value for each source in VE.

M

The result set R(q) for the query q is defined as the set of cubeElements that satisfy the query, i.e. {hind, m1 , . . . , mn , ent, vi} where {m1 , . . . , mn } ∈ K1 × . . . × Kn , s ∈VE and v ∈ τ (ind). Finally, a multidimensional query over a source s is defined as hind, W, K, si.

ED

Example. By referring to the case study in Subsection 1.1, a valid multidimensional query over the global schema is hCosts, hY ear, Regioni, h{2016}, {M arche, T oscana}i, {ACM E1 , ACM E2 }, f alsei, as 2016 ∈ α(Y ear) and M arche,T oscana ∈ α(Region).

PT

The definition extends the classical notion of aggregated OLAP query; in particular, W and ind are the elements of the target list. We define W as the pattern of the query, which defines the desired roll-up levels and in the following it will be referred to as the “pattern” of the query, that identifies the view over which the query is posed. Finally, K allows slice and dice (suitable selections of the cube portion).

CE

3.4. Rewriting rules In the following we discuss two rules that are used in next Section to rewrite a query over a nonmaterialized view using materialized ones, either by exploiting aggregation (drill-down) or calculation (indicator decomposition, or expansion).

AC

Definition 9. (Drill-down expansion) Let cube(ind, s) be the cube for indicator ind and data mart s, let Lx , Ly be levels of the dimension D ∈ ds(ind) such that Ly ≤L Lx . A cubeElement hind, m1 , . . . , mx , . . . , mn , s, ai can be calculated from a set of cubeElements hind, m1 , . . . , myi , . . . , mn , s, ai i, for all myi ∈ α(Ly ) that rolls-up to mx . The value a is calculated by aggregation of all ai under the aggregation function of ind. This corresponds to the well-known notion of query rewriting using aggregated views.

Example. Let us consider levels Semester ≤L Y ear, member 2016 of Y ear and members {2016 S1,2016 S2} of Semester that are all and only members that roll-up to 2016. If these cubeElements exist: • hP ersonnelCosts, 2016 S1, Spain, ACM E1 , 20i 12

ACCEPTED MANUSCRIPT

• hP ersonnelCosts, 2016 S2, Spain, ACM E1 , 30i, then the following cubeElement can be computed, assuming that the aggregation function of PersonnelCosts is SUM: hP ersonnelCosts, 2016, Spain, ACM E1 , 50i. Rule 1. (Rewriting through drill-down expansion) Given a query q = hind, W, K, si over s, and the conditions of Definition 9, then the query q 0 = hind, W 0 , K 0 , si can be used to answer q, where:

CR IP T

• W 0 =(W \ {Ly }) ∪ {Lx }

• K’=(K \ {m1 , . . . , mn }) ∪ {mi }, where mi ∈ Lx and is valid for s, and partOf (mj , mi ) for all mj ∈ Ly in {m1 , . . . , mn }. This rule exploits the classical roll-up OLAP operation in data warehouse systems.

AN US

Definition 10. (Indicator expansion) Let cube(ind, s) be the cube for indicator ind = haggr, f (ind1 , . . . , indk )i and source s ∈ VE. If the set of cubeElements {hindj , m1 , . . . , mn , s, vj i} with j = 1, . . . , k exist in s, then the cubeElement hind, m1 , . . . , mn , s, f (v1 , . . . , vk ) i can be computed. Example. Let us consider the indicator Costs with formula Costs = T ravelCosts +P ersonnelCosts. Assuming that the following cubeElements are available: • hT ravelCosts, 2016, Spain, ACM E1 , 35i

• hP ersonnelCosts, 2016, Spain, ACM E1 , 45i

then the cubeElement hCosts, 2016, Spain, ACM E1 , 80i can be computed.

M

Rule 2. (Rewriting through indicator expansion) Given a query q = hind, W, K, si over s and the conditions of Definition 10, the set of queries qj0 = hindj , W, K, si, with j = 1, . . . , k, can be used to answer q.

ED

This rule introduces a novel operation which is made possible by exploiting the formal representation of formulas. As such, it constitutes one of the basic building blocks needed to reformulate a global query in the procedure discussed in Section 5.

PT

4. Proposed approach for query reformulation

AC

CE

Computing an answer to a query expressed in terms of the global schema consists in (1) reformulating the query as valid queries expressed in terms of export schemas, with the goal to provide an answer to the user by hiding the implementation details of the federated architecture, and (2) rewriting the local query by using materialized views of the sources. Given a query, if the corresponding view is materialized in the data mart, then the search is over. Otherwise, in case the view is not materialized, we need to apply one of the rewriting rules (drill-down or indicator expansion) in order to rewrite the query, and hence start the verification in a recursive fashion until a materialized view is found, such that the query can be answered. In this Section we discuss the criteria according to which the rules introduced in the last subsection can be applied and how they can be used to generate the space of possible rewritings. 4.1. Validity and usage of rewriting rules In this work, the fundamental problem related to rewriting is to define under which conditions the two rewriting rules discussed in the previous Section can be applied, i.e. how aggregability and indicator formulas affect each other. As introduced in last Section, the aggregability of an indicator depends on its aggregation function. As a consequence, the rewriting rules can be applied differently according to the aggregator. The major following results come from the definition of distributive aggregation functions and from the fact that summation commutes with distributive aggregation functions. 13

ACCEPTED MANUSCRIPT

Aggr. function distributive distributive algebraic holistic none

Additivity true false true/false true/false true/false

Applicability rule 1 yes yes yes+ no N/A

Applicability rule 2 yes yes∗ yes∗+ no yes

CR IP T

Table 2: Conditions for applicability of expansion rules according to the aggregation function of the indicator and the additivity of its formula (*=only for the lowest levels for every dimension, +=under the condition that the algebraic function is transformed in a distributive one as discussed in text).

Definition 11. (Distribution and additivity) Given an indicator ind, with aggregation function aggr and formula f (indα , . . . , indω ), the following property holds: if aggr belongs to the class of distributive functions and f is additive (i.e., it includes only summations and differences of indicators), then aggr(f (indα , . . . , indω )) = f (aggr(indα ), . . . , aggr(indω )).

AN US

In other words, this property states that aggregating the indicator after the calculation of the formula (i.e., aggr(ind) = aggr(f (indα , . . . , indω ))) is equivalent to firstly aggregate the components and then calculating the formula (i.e., f (aggr(indα ), . . . , aggr(indω ))), under the specified conditions. As also summarized in Table 2, from this definition and from the definitions of aggregation functions, we derive in the following the usage of the two rules for each type of aggregation functions:

M

• Indicators with distributive aggregators: rule 1 can be always used by definition of distributive aggregation function (see Subsection 3.1). In case the formula is additive, rule 2 can be applied according to Definition 11. Otherwise, in general it cannot be applied, with the exception of the case when the query does not require further aggregation, i.e. if W includes only levels that are the lowest for each dimension hierarchy.

PT

ED

• Indicators with algebraic aggregators: by definition, an algebraic indicator can be transformed in a distributive one by replacing the formula of the indicator (see Subsection 3.1). Under these condition, rule 1 can be applied according to Definition 11, while rule 2 can be applied under the same conditions of the distributive aggregators. For instance, if ind = hAV G, f i then the indicator can be rewritten f i, where CountM is the number of cubeElements of the lowest levels over as ind0 = hSU M, CountM which the SUM is performed. To make an example, if the query asks for HourlyCost aggregated at 2016 (level Y ear), then CountM is 365 (i.e. number of members of level Day that are partOf 2016).

CE

• Indicators with holistic aggregators: by definition (see Subsection 3.1) rule 1 cannot be applied. Rule 2 cannot be applied. As an example, if a query asks for Median amount sold (with aggregation function MEDIAN) for year 2016, and the cubeElement is not available, mathematically there is no way to roll-up lower-level cubeElements (e.g. at Day level) to obtain the final required value.

AC

• Indicators with the aggregator NONE: rule 1 cannot be applied, by definition (see Subsection 3.1). As a consequence, given that no aggregation can be performed, rule 2 can be used with no constraints. 4.2. The state space In order to have an overall and synthetic representation of all possible combinations of views useful for rewritings of the global query, in this section we introduce the state space of the rewriting. The problem of finding a proper rewriting for a query can indeed be expressed as a search over the space of all possible rewritings3 . The state space is built in an incremental way, starting from a root state which is the pattern of 3 This is related to the view usability problem in query rewriting, i.e. the problem to identify the views that can be useful in the rewriting process.

14

ACCEPTED MANUSCRIPT

CR IP T

the query. Then, for every valid rule (drill-down or indicator expansion) that can be applied at each state, a new state is defined accordingly, and includes the pattern(s) of the queries obtained from the application of the rule. The state space is hence generated by combining the dimensional lattice and the formula graph, and contains all states that can be used to rewrite the query. Please note that the range of validity of each rule depends, as discussed in the previous subsection, from the specific aggregator of the indicator at hand. This means that, for instance, if the aggregator is AVG it is not valid to drill-down to the next level, but we need to drill-down to the lowest level. Even though in the building of the state space we have taken this aspect into account, for the sake of simplicity we will show the basic case with a distributive aggregator and an additive formula. More formally, we first define a state Si as a set {pαx , . . . , pωz } of views where each pαk represents a view hindα , pk i where pk is a pattern in the dimension lattice of s for indα . By extending the definition of ≺∗ in Subsection 3, given two states Sx , Sy and given indα = f (indα1 , . . . , indαn ), the relation Sx ≺∗ Sy holds if either: • (drill-down expansion) Sx = Sy \ {pαi } ∪ {pαj }, where pαj ≺∗ pαi and they refer to the same indicator.

AN US

• (indicator expansion) Sx = Sy \{pαi }∪{pα1 i , . . . , pαn i }, where pαi refers to the same level of aggregation of pα1 i , . . . , pαn i .

PT

ED

M

The state space is therefore defined as a partially ordered set S = (S, ≤S ), where S is the state set and ≤S is a partial order relation (reflexive, antisymmetric and transitive), defined by extending the covering relation, i.e. if Sx ≺∗ Sy ≺∗ Sz , then Sx ≤S Sy , Sy ≤S Sz and Sx ≤S Sz . The space can be represented as a directed acyclic graph hS, Ai, where A is the set of arcs representing rules 1 and 2. As an example, in Figure 5 we show a state space built for a fragment of the dimensional lattice in Figure 2c and for an indicator with distributive aggregator and formula indα =indβ +indγ . From the Figure we see that {pβ2 , pγ4 } ≤S {pβ2 , pγ2 } because, according to drill-down rule, pγ4 ≺∗ pγ2 and the other view pβ2 is unchanged. Similarly, {pβ1 , pγ1 } ≤S {pα1 } because indα = indβ + indγ and the indicator expansion rule holds. A state in the state space is materialized if all its views are materialized (they are shown in grey in the Figure). A path in the state space is a totally ordered set of states starting from the root state (that corresponds to the objective of the search, i.e. the pattern/view required by the multidimensional query) to a materialized state. The length of a path is always greater than 1. As such, the path defines a possible way to compute the global query by rewriting it in terms of other queries. In order to avoid cycles, each formula can be used only once for each path. The following property holds for the state space (a proof in is Appendix A).

CE

Lemma 1. (Finiteness of the state space) Given cube(ind, s) and a finite set of indicator formulas, the number of states of a state space is finite. In Section 6 we will analytically discuss such a size.

AC

5. Query answering procedure Given a query q = hind, W, K,VE , θi over a VE, the procedure for query answering is composed of these macro-processes, applied to each source, as discussed in the following: a) adaptation of the global query to the dimensional schema of ind for s: from a global multidimensional query, a customised query for s is obtained as output;

b) search over the state space of s to derive a path to a materialized state; c) rewriting of the query according to the path: starting from the query obtained at step (a) and the path at step (b), a set of subqueries is generated capable to retrieve all the needed data; 15

ACCEPTED MANUSCRIPT

pα1 r2

r1

pβ1 , pγ1

pα2 r2

r1

category,region

r1

r1

r2

r1 pβ1 , pγ2

pβ3 , pγ1

pβ1 , pγ3

3

pβ4 , pγ2

pβ2 , pγ4

pβ4 , pγ1

r1

r1

4

r2

r1 r1

product,city

r1

r1 r1 r1

r1

pβ3 , pγ2

pβ1 , pγ4

pβ4 , pγ3

r1

1

pβ2 , pγ3

r1

r1

r1

r1

r1

pβ3 , pγ4

r1

r1

pβ4 , pγ4

r1

AN US

r1

r1 r

CR IP T

r1

category,city

pβ3 , pγ3 r1

r1

r1

2

pβ2 , pγ1

pα3

r1

1

product,region

r1

pβ2 , pγ2

pα4

r1

(a)

(b)

M

Figure 5: (a) A fragment of the dimensional lattice of Figure 2c and (b) the corresponding state space for indα =indβ +indγ . With r1 and r2 we refer to the application of corresponding rules. The states filled in grey are materialized.

ED

d) answer integration and calculation of final results: from all subqueries obtained in step (c), here an integrated final query is obtained, and then executed over s. In case θ is true, as last step the results obtained for each s ∈VE have to be aggregated in order to provide users with a single result.

PT

a) Adaption of the query to the export schema We can adapt a query hind, W, K,VE , θi to a given source cube only under the following conditions:

CE

• (c1) the indicator is provided by the source at hand, or it can be obtained from a formula containing indicators of the cube. More formally, either if same(ind∗ ,ind) or if (ind=f(ind1 ,. . .,indn ) ∧ ∀ indi , same(ind∗i ,indi )).

AC

• (c2) the dimensions requested in the query are a subset of dss (ind). Furthermore, if a dimension D is not included in the query, then we assume it is requested at the highest level, i.e. LD ALL .

Finally, if the query includes a dimension that is not in the cube, it cannot be adapted. Under these 0 0 conditions, the multidimensional query qs over dss (ind), with s ∈ VE, is defined as hind∗ , W , K , si, where W 0 and K 0 in (each) qs are as follows: 0

0

• (a) W = W, K = K if W contains a level for each dimension in dss (ind) 0

0

x • (b) W = W ∪ {LD ALL }, K = K ∪ {ALL} ∀Dx ∈ dss (ind) without a corresponding level in W

16

ACCEPTED MANUSCRIPT

where mappings equi-level and equi-member are properly taken into account. Otherwise qs cannot be defined. Example. Hereafter in some cases, for convenience of notation, we will refer to Costs, T ravelCosts and P ersonnelCosts respectively as C, T C and P C. Let us consider a query hC, hY ear, Countryi, h{2016}, {Italy}i, {ACM E1, ACM E2}i over a global schema and the two sources of the example in Subsection 1.1. The adaptation of the query to the export schemas of the sources is as follows: • qACM E1 =hC, hY ear, Countryi, h{2016}, {Italy}i, ACM E1i

CR IP T

• qACM E2 =hC, hY ear, Country, P roductALL i, h{2016}, {Italy}, {ALL}i, ACM E2i

Please note that ACM E1 does not include Costs, but a formula for it is available, namely Costs = T ravelCosts + P ersonnelCosts, such that the components are both provided by ACM E1. As for ACM E2, note that the dsACM E2 (Costs) includes also a dimension ProductDimension, that is not included in the query. For this reason, the rewriting of the query for ACM E2 includes a new level P roductALL , i.e. the ficticious level for ProductDimension, and a corresponding member ALL.

AC

CE

PT

ED

M

AN US

b) Determination of a path to a materialized state After the adaptation of the global query to the specific dimensional schema of the source at hand, the next step is to determine how the query can be solved. Indeed, if the pattern of the query does not correspond to a materialized view for s, no answer can be given. For this reason, the query is rewritten by exploiting drill-down and indicator expansion rules. Query rewriting can be modeled as a search problem over the state space S with a starting state Sin . The procedure implements a breadth-first search (BFS) and explores the neighbor nodes first, before moving to the next level. A solution to the problem is a path hSin , . . . , Sf in i from Sin to a state Sf in that is materialized. The procedure to find a solution is described in pseudo-code in Algorithm 1. At first, given a query over a source s, a new path is created containing only Sin . Then, it is pushed into a FIFO list C which will contain both the partial paths Xi that are to be analyzed and, for each path, the set Fi of formulas already used for the path. At the beginning, the set V of visited states is empty. For every new iteration (line 4) a path Xi (a LIFO list of states) is popped from C, and its head Sh is considered. If this state has been visited before (line 5), the path is discarded. Otherwise, if Sh is materialized the procedure ends and the path Xi is returned as solution (line 7). If Sh is not materialized, the find_rewritings procedure is called, to find alternative ways to rewrite it (line 9). This procedure takes into account every element pαx ∈ Sh that is not materialized. For each of them the rules 1 and 2 are separately executed, according to what defined in Subsection 4.1. As for the former (line 15), every drill-down for the pattern is considered and for each of them a new state S 0 is created as the head of a new path that is then pushed into C (line 17). As for the latter (line 19), every formula for indα is retrieved, and a new state is created replacing pαx with all the components of the formula f (indα1 , . . . , indαn ). Such a new state is the head of a new path that is pushed into C (line 21). To keep track of the specific transformation (drill-down or indicator expansion) applied from Sh to S 0 , the procedure stores in C also a list T (see line 17 and line 21). The structure of each element of T follows the template [r1|r2, ind, ], where r1 or r2 represent which rule has been applied, ind is the indicator at hand, and  is the expression specifying the applied transformation: • in case of r1: =Lx → Ly [, merge];

• in case of r2: =f (indα1 , . . . , indαn )[, merge].

where the optional annotation [merge] removes duplicates, to avoid to follow the same path more than once. It is applied whenever the application of the rule generates a view that is identical to another one in the state. This can happen when there are nodes with multiple parents in the formula graph. Indeed, if an indicator node in the graph has multiple parents, it means that it is a component of more than one formula, e.g. in Figure 3 N umT rainingHours is included in two formulas. As a consequence, in the space, the 17

ACCEPTED MANUSCRIPT

4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22

for every view pαx ∈ Sh that is not materialized do for every pαy ≺∗ pαx do S 0 = (Sh \ {pαx }) ∪ {pαy }; 0 C ← [Xi ← S , Ti ← [r1, indα , pαx → pαy ], Fi ]; end for every formula fα = f (indα ) = f (indα1 , . . . , indαn ) such that fα ∈ / Fi do 0 S = (Sh \ {pαx }) ∪ {pα1 x , . . . , pαn x }; 0 C ← [Xi ← S , Ti ← [r2, indα , f (indα1 , . . . , indαn )], Fi ← fα ]; end end Algorithm 1: Procedure to find a valid solution.

ED

23

V ← ∅, C ← [Xin = {Sin }, Tin = ∅, Fin = ∅]; while C is not empty do C → [Xi = [Sh |Xt ], Ti , Fi ]; if Sh ∈ / V then if Sh is materialized then return Xi ; end f ind rewritings(Xi , Ti , Fi ); V ← Sh ; end end Procedure find rewritings (Xi = [Sh |Xt ], Ti , Fi ) Data: A path Xi , the list Ti of transformations and the set Fi of formulas already used for such a path

CR IP T

3

AN US

2

Function query answering (Sin ) Result: A path X = hSin , . . . , Sf in i, where Sf in is materialized

M

1

AC

CE

PT

repeated execution of rule 2 will lead to states containing two or more views referring to the same indicator but to different patterns. To make an example, let us consider the case in Figure 3. By starting from a root state about InvestmentInEmpDev if rule 2 is executed we obtain a state with two views, about indicators P ersonnelT rainingCost and T eachCost. By applying again the rule 2 to the first, we get to a state with three views about HourlyCost, N umT rainingHours and T eachCost. Finally, by applying rule 2 to T eachCost the final state will include views about HourlyCost, N umT rainingHours and HourRate (we assume that duplicate views are removed). If rule 1 had been used on T eachCost before the last application of rule 2, we would have got a state with two views related to N umT rainingHours and different patterns. Note also that, at this step, the list Fi of formulas already used is considered and updated as every formula can be exploited only once per each path. Finally (line 10), the analysis of the state at hand is over, and it is added to the list of the visited states. The procedure goes on until there are available items into C. Example. Let us consider the query qACM E1 =hC, {Y ear, Country}, {{2016}, {Italy}}, ACM E1i, obtained adapting q for ACM E1 in the previous example. Let us also assume that the materialized views for ACM E1 are the following: • hT C, hY ear, Regioni, ACM E1 i • hP C, hSemester, Regioni, ACM E1 i • hP C, hSemester, Cityi, ACM E1 i • hP C, hM onth, Cityi, ACM E1 i 18

ACCEPTED MANUSCRIPT

• S1 = {hC, hY ear, Countryii • S21 = {hC, hY ear, Regionii • S31 = {hT C, hY ear, Regionii, hP C, hY ear, Regionii

AN US

• S41 = {hT C, hY ear, Regionii, hP C, hSemester, Regionii

CR IP T

The procedure starts from a state S1 = {hC, hY ear, Countryii}. Given that it is not materialized, the find_rewriting procedure is called. Applying rule 1, new states S21 and S22 are created by drill-down of Country to Region, or by drill-down of Year to Semester. Applying rule 2, a new state S23 is created by expanding C in T C and P C. Corresponding paths for the three new states are created and put in the list C, as well as corresponding transformations are stored in T . In the second iteration, let us focus on S21 , that is not materialized. Hence, a set of rewritings are generated by rule 1 and rule 2. Among all of them, let us consider now in the discussion the application of rule 2, that leads to the creation of a new state S31 = {hT C, hY ear, Regionii, hP C, hY ear, Regionii}, with transformation t2 = [r2, C, T C + P C]. In the third iteration, S31 is further rewritten as it not materialized. In particular, it includes a materialized view hT C, hY ear, Regionii, and therefore by using rule 1 the other view is derived by drill-down of Year to Semester. Finally, we obtain a state S41 = {hT C, hY ear, Regionii, hP C, hSemester, Regionii} that turns out to be completely materialized. In short, the final path that is returned is given by:

with the following sequence of transformations T : • t1 =[r1, C, Country → Region] • t2 =[r2, C, T C + P C]

M

• t3 =[r1, P C, Y ear → Semester]

AC

CE

PT

ED

c) Rewriting of the query over the sources This step allows to rewrite the query over the export schema (obtained after step a) as a set of materialized views, by exploiting the path obtained from Algorithm 1. Given a path {Sin , . . . , Sf in } of lenght n in the state space, for a certain s, let us consider the sequence of transformations T = {t1 , . . . , tn−1 }. Such a sequence determines the application of transformations ti (drill-down or indicator expansion) capable to move from Sin to Sf in . States and queries are mutually related concepts. Indeed, as mentioned, the W in the user query qs = hindα , W, K, si corresponds to the root state pαx . In this sense, the arcs between states have also the meaning of dependence relations among queries: given two states Si+1 ≺∗ Si , and given the corresponding qi+1 and qi , then qi can be answered using only results of query qi+1 . Hence, qi is dependent on qi+1 . As such, given that the sequence of transformations defines how to operationally derive the final state from the initial state, the same sequence can be used to derive the set of subqueries over the sources from the starting query. Let us define a rewriting operator ρt capable to rewrite a query according to a transformation t. A transformation can represents either (a) a drill-down between two levels (b) or an indicator expansion. In the first case, the elements W and K of the query will be drilled-down accordingly. In the second case, the query is replaced by a set of queries, one for each component of the indicator at hand.

Definition 12. (Rewriting operator ρ) Given a set of queries Q = {q1 , . . . , qn } over the export schema of s and given a transformation t defined for a certain indα , then ρt (Q) is given by the application of ρt to each query in Q, i.e. {ρt (q1 ), . . . , ρt (qn )}. In turn, given a qi = hindi , Wi , Ki , si, if indi 6= indα then ρt (qi )=qi , else: • if t=[r1, indα , LDy → LDy ], then ρt (qi ) is equal to hindα , W 0 , K 0 , si according to the rewriting rule 1; • if t =[r2, indα , f (indα1 , . . . , indαm )], then ρt (qi ) is equal to {qi1 , . . . , qim }, where ∀indαi , qij = hindαi , W, K, si according to the rewriting rule 2. 19

ACCEPTED MANUSCRIPT

Please note that the application of ρ, as shown above, can generate multiple queries (in case the transformation regards the indicator expansion rule). In case the transformation t includes the merge operation, the duplicated queries are removed, as explained in step (b) of this procedure. In general, given a list T of transformations {t1 , . . . , tn−1 }, we can rewrite the query q1 (i.e. the set Q1 = {q1 }) for a state S2 as Q2 = ρt1 ({q1 }). In turn, this means that Qi = ρti−1 (ρti−2 (...(ρt1 ({q1 }))).

- Q1 ={hC, hY ear, Countryi, h{2016}, {Italy}i, ACM E1 i} and t1 = [r1, C, Country → Region] implies:

CR IP T

Example. Given the query q and the solution {S1 , S21 , S31 , S41 } reported in the previous example, the query to be executed over the export schema of ACM E1 is Q4 = ρt3 (ρt2 (ρt1 (Q1 ))). Hence:

- Q2 ={hC, hY ear, Regioni, h{2016}, {M arche, T oscana}i, ACM E1 i} and t2 = [r2, C, T C + P C] implies:

- Q3 ={q31 , q32 } = {hT C, hY ear, Regioni, h{2016}, {M arche, T oscana}i, ACM E1 i, hP C, hY ear, Regioni, h{2016}, {M arche, T oscana}i, ACM E1 i} and t3 = [r1, P C, Y ear → Semester] implies:

AN US

- Q4 ={q41 , q42 } = {hT C, hY ear, Regioni, h{2016}, {M arche, T oscana}i, ACM E1 i, hP C, hSemester, Regioni, h{2016 S1, 2016 S2}, {M arche, T oscana}i, ACM E1 i}

M

d) Answer integration and calculation of final result The purpose of this last step is to integrate the result sets R(qi ) of above defined queries over materialized views, to calculate the final result for the original query. To this aim, let us introduce the calculation operator ζt (R(Q)), aimed to calculate/aggregate the results of a set of queries Q by using the inverse of the transformation t. This means to apply all the transformations in a reverse order over the output of the previous step. If (a) the i-th transformation represents a drill-down between two levels, then the result will be aggregated by roll-up, otherwise (b) if it represents an indicator expansion, then the result is calculated through the formula used to expand the indicator at hand.

ED

Definition 13. (Calculation operator ζ) Given a result R(Q) with Q = {q1 , . . . , qn } and a transformation t:

PT

• if t=[r1, indα , Lx → Ly ], then ζt (R(Q)) is the set ζt (R(q1 )) ∪ . . . ∪ ζt (R(qn )) with qi = hindi , Wi , Ki , si, such that: – if indi 6= indα , then ζt (R(qi )) = R(qi );

CE

– otherwise if indi = indα , then ζt (R(qi )) is obtained by the following calculation: (i) at first determine the set A of tuples hmi , mj i, such that partOf (mi , mj ) for all mi ∈ Ly and mj ∈ Lx that are valid for s. Then (ii) group (R(qi ) ./Ly A) with respect to W \ {Ly } ∪ {Lx } and finally (iii) aggregate the values through aggr(indα ).

AC

• if t=[r2, indα , f (indα1 , . . . , indαn )], then ζt (R(Q)) is calculated by joining together all the R(qi ) such that qi refers to the indicator indαi , and by applying the formula f . All others R(qj ), if any, are left unchanged.

Please note that if the transformation t included also a merge operation, as discussed in step (b) of the procedure, it means that a duplicated view has been previously removed for optimization purposes. At this stage, in order to restore such a view, the corresponding R(qi ) must be duplicated before applying any other operation. By generalizing the calculation operation, given a sequence of transformations T = {t1 , . . . , tn−1 }, we can calculate the result for R(Q1 ) as ζt1 ({R(Q2 )}). In turn, this means that R(Q1 ) = ζt1 (ζt2 (...(ζtn−1 (R(Qn )))). 20

ACCEPTED MANUSCRIPT

Example. Let us consider the data mart for ACM E1 reported in Figure 1. Let us consider the queries of the previous example, and let us assume that the result R(Q4 ) includes the following tuples, obtained by executing both q41 and q42 : Indicator TC TC

TimeDimension 2016 (Year) 2016 (Year)

LocationDimension Marche (Region) Toscana (Region)

R(q42 ):

Indicator PC PC PC PC

TimeDimension 2016 S1 (Semester) 2016 S2 (Semester) 2016 S1 (Semester) 2016 S2 (Semester)

value 50 67

LocationDimension Marche (Region) Marche (Region) Toscana (Region) Toscana (Region)

value 90 160 108 145

CR IP T

R(q41 ):

R(q32 ):

Indicator TC TC Indicator PC PC

TimeDimension 2016 (Year) 2016 (Year) TimeDimension 2016 (Year) 2016 (Year)

LocationDimension Marche (Region) Toscana (Region) LocationDimension Marche (Region) Toscana (Region)

M

R(q31 ):

AN US

The result for R(Q1 ) can be calculated as ζt1 (ζt2 (ζt3 (R(Q4 )))). In detail, we show the calculation step by step. Let us begin with R(Q3 ) = ζt3 (R(Q4 )). Hence, given that t3 = [r1, P C, Y ear → Semester], in this case R(Q3 ) = {R(q31 ) = ζt3 (R(q41 )), R(q32 ) = ζt3 (R(q42 ))}. Since q41 does not deal with PC, the transformation does not apply to it, while for the second we need to determine the set A which puts semesters in relation to years. The set A is obtained directly by the model and in this case is equal to {h2016 S1, 2016i, h2016 S2, 2016i}. We can calculate R(q42 ) ./Semester A and group over levels Y ear and Region. Finally, R(Q3 ) is as follows: value 50 67 value 250 253

Indicator C C

TimeDimension 2016 (Year) 2016 (Year)

PT

R(Q2 ):

ED

Now, R(Q2 ) = ζt2 (R(Q3 )), with t2 = [r2, C, T C + P C]. According to the semantics of ζ, we need to join the two sets of R(Q3 ), i.e. R(q31 ) and R(q32 ) and calculate the formula T C + P C. Hence, LocationDimension Marche (Region) Toscana (Region)

value 300 320

CE

Finally, R(Q1 ) = ζt1 (R(Q2 )). Similarly to the previous step, being t1 = [r1, C, Country → Region], we retrieve from the model the set A that relates regions to countries: A = {hM arche, Italyi, hT oscana, Italyi}. Then, the result for Q1 is obtained by grouping over attribute Country and Y ear the result of the join R(Q2 ) ./Region A, as follows: Indicator C

AC R(Q1 ):

TimeDimension 2016 (Year)

LocationDimension Italy (Country)

value 620

5.1. Properties of the reformulation procedure Here we discuss some specific properties of the reformulation procedure. The procedure for query reformulation is correct if it satisfies the following requirements: • soundness: every returned solution (i.e., a path from the query to materialized view) is guaranteed to be valid, where a solution hSin , . . . , Sf in i is valid if, given a starting state Sin : – Sf in is materialized; 21

ACCEPTED MANUSCRIPT

– ∀(Si , Si+1 ), Si+1 ≺∗ Si (the solution is composed of valid steps);

– ∀Si , @Sj = Si with j 6= i (it does not contain cycles).

• termination: the procedure is guaranteed to terminate in a finite number of steps. In Appendix A we report the proofs for the following two theorems. Theorem 1. (Soundness) The algorithm to find a reformulation is sound.

CR IP T

Theorem 2. (Termination) The query reformulation procedure is guaranteed to terminate. Please note that in general the completeness of the procedure, i.e. the property according to which it is capable to find every possible solution, is not granted as other solutions could be retrieved by following alternative paths. To make an example, if an algebraic indicator is replaced by a distributive one, as shown in Subsection 4.1, alternative solutions could be found. However, for what concerns solutions that are expressable as sequential applications of the two rewriting rules, the procedure is complete by definition: indeed, the space state includes all and only those states that are reachable through all possible application of the two rules.

AN US

5.2. Mathematical reasoning on formulas

ED

M

In the previous section we discussed how the rewriting algorithm determines, at each step of the process, if the indicator at hand has a formula and how it can be decomposed accordingly. In the work we assume that the rewriting rule 2 can be applied to formulas that are defined in the global schema. However, if an indicator is not provided with a formula (i.e., it is atomic), no decomposition can be applied. In the Literature, approaches for mathematical reasoning on formulas have been proposed to manipulate symbolic expressions for a variety of goals. Such approaches can greatly enhance the effectiveness of the rewriting mechanism. Indeed, by means of mathematical reasoning, it is possible to derive formulas that are not defined at global schema level. The basic mechanism upon which these systems operate is that of equation solving. Indeed, given the set of all defined formulas, the problem is to solve them with respect to the indicator at hand. For those that include the indicator as a component, the formula will be reformulated. As regards the example in Figure 3, for instance, a formula for N umT rainingHours is not defined. However, two formulas include that indicator as a component, namely: • P ersonnelT rainingCosts=HourlyCost ∗ N umT rainingHours

PT

• T eachCost = N umT rainingHours ∗ HourRate Hence, by solving them w.r.t. N umT rainingHours, it is possible to derive the following formulas:

CE

• N umT rainingHours = P ersonnelT rainingCosts/HourlyCosts • N umT rainingHours = T eachCost/HourRate

AC

It is apparent that the mechanism is more effective when the formula graph is tightly connected: the more operands are included as components in a formula for an indicator, the more alternative formulas can be found for it. In this work we refer to the approach followed in our previous work [4], in which we discuss a logicbased framework relying on Logic Programming, where formulas are represented as facts, and where specific predicates for equation solving and formula manipulation are defined. 6. Analysis and evaluation In this Section we provide an analysis on computational aspects of the approach and an evaluation of effectiveness. 22

ACCEPTED MANUSCRIPT

6.1. Time and space complexity

AN US

CR IP T

At first, we discuss the time and space complexity of the reformulation procedure. Given the breadth-first approach in the search of a reformulation, the time complexity of the procedure is O(|S|+|A|), where |S| is the cardinality of the set of states (i.e., vertices of the graph) and |A| is the number of edges in the state space. Space complexity, on the other hand, can be formulated as O(|S|). Since state transition is enabled by rules 1 and 2, in the following we separately discuss the main contributions due to both rules to determine |S| and |A|. We consider firstly |S|. Through rule 1 it is possible to drill-down a pattern to a lower pattern. Let us firstly take into account just the dimensional lattice of a single indicator. We can evaluate the number of reachable states from a given view. Given an indicator indα and considering the set Pα of patterns of its Qn D dimensional lattice, the number of reachable states from a given view pαx is: Np = i=1 (h(Li j ) + 1), where Dj Dj h(Li ) is the distance between level Li of pαx and the bottom level of the same dimension, and n is the number of dimensions. Obviously, for a given indα , Np ≤ |Pα |. For instance, in the example of Figure 2c, the levels Product and City are the bottom levels for the ProductDimension and the LocationDimension respectively. Hence, h(Product)=h(City)=0, h(Category)=h(Region)=1 and h(Country)=2. Np assumes its maximum value for the patten hALL, ALLi: Np = (3 + 1) ∗ (2 + 1) = 12 = |Pα |. Conversely, through rule 2 it is possible to expand an indicator into its components by referring to a formula. Each component can in turn be expanded. The number of states that originate from the indicator indα depends on the number of its alternative formulas as well as on the number of its components4 . We introduce Φ(indα )={φ1 , . . . , φn } as the set of alternative ways in which indα can be decomposed by applying rule 2. In particular, for each alternative formula fi of indα , φi is the set of components in fi . For instance, if the following formulas are given: 1) indα = indβ + indγ

M

2) indγ = indδ /ind

AC

CE

PT

ED

then the alternative formulas for indβ are indα − indγ (by manipulation of formula 1) and indα − indδ /ind (by manipulating formulas 1 and 2). Therefore, Φ(indβ )={{indβ }, {indα , indγ }, {indα , indδ , ind }}. In the worst case scenario, given a query hindα , W, K, si, every formula related to indα is exploited and every level lower than those of W is evaluated. In the above mentioned examples with two formulas, if we assume to have only 2 dimensions and 2 levels (i.e., considering also the level ALL, a total of 9 patterns for Pβ ), then the total number of states in the space state is given by the 9 states of Pβ plus all the combinations of Pα and Pγ (in case rule 2 is used once to pass from indβ to indα − indγ ), plus the combinations of Pα , Pδ and P (in case rule 2 is used two times). Therefore, the maximum number of states is Ns = 9+92 +93 = 819. Hence, given a query q (represented as a pattern pαx ) the number of states |S| that can be analysed by the algorithm in the worst case scenario is equal than Ns . In general, |S| is the summation of |Φ(indα )| + 1 terms, where each one is equal to |Np | power the cardinality of |φi | (that is the i-th item in |Φ(indα )|): P|Φ(ind )| |φ | |S| = i=1 α Np i . Let us now consider an evaluation for |A|. If we take into account a single execution of the find_rewritings subprocedure, the average number of possible ways in which the input path can be expanded is the branching factor of the space state. This number depends on: (a) the number of patterns inside the state to be expanded, and (b) the number of possible expansions (by executions of rule 1 and 2): as for the former, it varies according to the specific state. For such a reason let us consider an average number of patterns per state as nf . As for the latter, the maximum number of applications of rule 1 is D, as it is possible to apply the rule once for each dimension. Finally, rule 2 can be applied at most a number of times equal to the number of formulas for the indicator that have not been used yet. In total, an upper bound on the number of expansions can be given by: Ns ∗ nf ∗ (D + |Φ(indα )|). 4 See

our previous work [4, 6] for details on mathematical reasoning mechanisms for formula manipulation.

23

ACCEPTED MANUSCRIPT

6.2. Effectiveness (at data source level)

AN US

CR IP T

Every data source typically complies to a materialization policy, according to which the subset of views that are kept materialized is chosen. A way to evaluate how much the proposed approach is effective is to consider, for a certain dimension lattice and a formula graph, how many more views can be computed by exploiting rules 1 and 2, starting from the set of those views already materialized. Indeed, the more views can be computed, the more queries can be answered using those views. For such a reason, we compare hereby the number of views that are materialized with the number of views that are computable through rule 1 alone, and through both rules. Without the rewriting rules proposed in this work there would be no other computable view except for the materialized ones. In other terms, a query could be solved only if it corresponds to a materialized state (and hence to a set of materialized views). By enabling rules 1 and 2, however, the number of computable views can be greater than the number of materialized ones. Since rule 1 actually corresponds to the well known drill-down/roll-up operator, we are interested in enlightening the relative advantage of the novel indicator expansion mechanism, thus we refer to Computation Gain (CG) as an effectiveness measure capable to evaluate the relative gain in the number of computable views by using both rules instead of just rule 1. In other terms, CG gives a measure of the increase in the number of queries that can be answered by our framework with respect to a framework endowed only with drill-down/roll-up mechanisms. both −Comprule1 ) , where Comprule1 (Compboth ) is the ratio The measure is calculated as CG = (CompComp rule1 between the number of views that can be computed using rule 1 (both rules) over the total number of views.

CE

PT

ED

M

Case study. For instance, for the case study at hand, we focus on the indicators used in the previous examples for ACME1, namely Costs (C), TotalCosts (T C) and PersonnelCosts(P C). Let us consider only one formula, namely C = T C +P C and a dimensional lattice built over two dimensions, i.e. T ime and P lace, with 3 levels each, i.e. Y ear, Semester, M onth and Country, Region, City. Considering the three indicators at hand, the total number of views (or the total number of queries to ask) is therefore 9 ∗ 3 = 27. Let us consider two materialized views, namely V 1 = hP C, hM onth, Cityii and V 2 = hT C, hSemester, Regionii. From V1 by using rule 1 it is possible to compute all 9 views of the dimensional lattice for PC, as Month and City are the bottom levels for their corresponding dimensions, and all the other views can be computed through roll-up. As for V2, instead, 4 views can be calculated through rule 1. In total, 13 views are computable, i.e. Comprule1 = 13/27 = 0.48. On the other hand, by executing rule 2, views for C can be computed from available views for PC and TC. For instance, from hP C, hSemester, Regionii and hT C, hSemester, Regionii the view hC, hSemester, Regionii is derivable. In total, further 4 views can be computed through rule 2, obtaining a total of 17 views, i.e. Compboth = 17/27 = 0.63. Finally, CG = (0.63 − 0.48)/0.48 = 0.31. In the following we show a set of experiments aimed to quantitatively assess CG in scenarios with dimensional lattice and formulas graphs of increasing size.

AC

Experiments. A set of experiments have been run considering an increasing number of dimensions, levels and formulas. Formulas have been generated automatically fixing the type of operators to sum and by considering two operands each. The experimental setup is as follows: (1) a formula graph containing a number of indicators and formula is generated. Then (2) a dimensional lattice is generated with a set number of dimensions and levels and (3) the percentage of materialized views is set. Finally, the algorithm is run starting from any possible pattern, i.e. for any possible query (we did not consider filtering on specific members) to determine the number of views that can be computed through both rules. The experiments focused on a number of dimensions from 3 to 4 and a number of levels from 3 to 4, resulting in dimensional lattices containing from 9 to 256 views. We set the percentage of materialized views from 1% to 50%. As such views are chosen randomly in the initial setup, to avoid any bias related to this choice we executed the whole process 100 times and averaged the results. The graph of formulas contained from 1 formulas to 15 formulas (with a number of indicators from 3 to 30), resulting in a total number of possible views from 81 to 7680, depending on the size of the lattice. 24

ACCEPTED MANUSCRIPT

3

3

4

4

3

4

4

Comprule1 [0,1] 0.08 0.15 0.32 0.5 0.62 0.7 0.82 0.89 0.93 0.14 0.25 0.47 0.66 0.77 0.83 0.91 0.94 0.97 0.14 0.25 0.47 0.66 0.76 0.83 0.91 0.95 0.97 0.28 0.45 0.67 0.82 0.89 0.93 0.96 0.98 0.99

Compboth [0,1] 0.1 0.19 0.44 0.67 0.79 0.86 0.94 0.97 0.99 0.19 0.35 0.63 0.82 0.89 0.94 0.97 0.99 0.99 0.19 0.35 0.64 0.82 0.89 0.94 0.97 0.99 0.99 0.38 0.59 0.81 0.92 0.96 0.98 0.99 1 1

CG 0.25 0.27 0.37 0.34 0.27 0.23 0.15 0.09 0.06 0.36 0.4 0.34 0.24 0.16 0.13 0.07 0.05 0.02 0.36 0.4 0.36 0.24 0.17 0.13 0.07 0.04 0.02 0.36 0.31 0.21 0.12 0.08 0.05 0.03 0.02 0.01

CR IP T

3

mat [0,1] 0.01 0.02 0.05 0.1 0.15 0.2 0.3 0.4 0.5 0.01 0.02 0.05 0.1 0.15 0.2 0.3 0.4 0.5 0.01 0.02 0.05 0.1 0.15 0.2 0.3 0.4 0.5 0.01 0.02 0.05 0.1 0.15 0.2 0.3 0.4 0.5

AN US

Lev

M

Dim

ED

Table 3: Computation Gain with 15 formulas, dimensions from 3 to 4 and levels from 3 to 4, and percentages of materialization from 1% to 50%. Values averaged over 100 random materializations.

AC

CE

PT

In Table 3 we report the results for the graph with 15 formulas. For lack of space, the complete set of results are available at our repository5 . As results show, the execution of both rules always leads to a higher Computation Gain. In particular, CG is affected by two parameter in particular. On the one hand, it increases with the decrease of the lattice’s size. This is easily explained as the bigger the lattice, the more chances there are to apply rule 1 by performing roll-up to a higher patter. On the other hand, when the dimensional lattice is smaller, rule 1 can be applied only up to a certain extent. This behaviour can be seen in Figure 6, where we depicted how CG varies according to different materialization percentages and different lattices, by keeping the same number of formulas. A second influencing parameter is the number of formulas and indicator. Indeed, as the number of formulas increases, CG increases as well, up to a certain extend, as shown in Figure 7 where we kept the same dimensional lattice with formulas graphs of various sizes. With higher values of materialization, instead, CG decreases as the probability of having a view for each indicator at low levels is higher, and this enables the exploitation of rule 1 to derive upper level views through aggregation without the need of rule 2. 6.3. Effectiveness (at federated level) As the reformulation approach is ultimately meant to support query answering at Virtual Enterprise level, we assess here how much the exploitation of rule 2 is capable to increase the number of indicators that can be computed at VE level, given the indicators available at source level. 5 https://github.com/KDMG/Datawarehouse/blob/master/Results

25

multidim query reformulation CG

ACCEPTED MANUSCRIPT

D4-L4 D4-L3 D3-L4 D3-L3

0.4

CR IP T

CG

0.3 0.2 0.1 0 0.1

0.2

0.3

0.4

0.5

AN US

0

Views materialization

ED

M

Figure 6: Computation Gain with 15 formulas and variable number of dimensions and levels.

15 formulas 10 formulas 5 formulas 1 formula

AC

CE

CG

PT

0.3

0.2

0.1

0 0

0.1

0.2

0.3

0.4

0.5

Views materialization

Figure 7: Computation Gain with 4 dimensions and 3 levels, and a variable number of formulas.

26

0.4

0.3

0.2 Lav=0.2 Lav=0.25 Lav=0.3 Lav=0.35 Lav=0.4

0.1

0

0

5 · 10−2

0.1 Shared

0.15

CR IP T

Computable indicators at VE

ACCEPTED MANUSCRIPT

0.2

AN US

Figure 8: Percentage of computable indicators at Virtual Enterprise level, in a scenario with two data sources each with Shared∈ [0, 1] and Locally available∈ [0, 1]. The dotted line represents the case without using rule 2.

CE

PT

ED

M

The experimental setting includes a global set of indicators and graph of formulas, and two data sources, ACME1 and ACME2, each with a certain random number of available indicators. We refer to Sh (shared) as the fraction of the indicators shared by the two data sources over the total number of indicators available at VE level. We refer to Lav (locally available) as the fraction of the indicators available at each source but not in common over the total number of indicators (for simplicity we consider the same Lav for both sources). We fixed to 15 the number of formulas at VE level and defined an increasing percentage for Sh and Lav. Please consider that Sh and Lav cannot vary independently, as the sum of the Lav for ACME1, Lav for ACME2 and Sh is bound to 1, i.e. 2 ∗ Lav + Sh <= 1. In these experiments, dimensional patterns are considered homogeneous for ACME1 and ACME2 in order to stress the impact of rule 2 on integration capabilities. Each test for a certain value of Sh and Lav has been run 100 times, each time with a random selection of shared and available indicators. The number of computable indicators, i.e. the number of indicators that turn out to be in common after the execution of rule 2 on each source, is calculated and averaged. From results, shown in Figure 8, it can be derived how the number of computable indicators at VE level is positively affected by both Sh and Lav. In particular, this happens also with small values of Sh. In the worst case, even when Sh=0, i.e. no indicator is shared between the sources, a significant percentage of indicators can be computed, up to 30% when Lav is equal to 40%. 7. Conclusion

AC

In this work, we propose a formula-based approach for query rewriting using materialized views in a scenario of federated data warehouses, with the aim to support the comparison of performance indicators in cross-organization activities, in a Virtual Enterprise scenario. In particular, we discuss how the enrichment of the multidimensional model with the formal description of indicator formulas extends usual features of query rewriting based on views, allowing to overcome heterogeneities at measure level among data mart schemas. Experimental results discussed in the previous Section show how the approach can lead to a significant increase in the capability of integrating indicators at VE level, and how extending the multidimensional model with indicator formulas improves the Computation Gain compared to just using rule 1 (the drill-down operator). The development of a full-fledged implementation of the reformulation process is scheduled as future work. Current and future effort are devoted to address optimization issues of the overall approach. In 27

ACCEPTED MANUSCRIPT

particular, in a controlled scenario in which data marts do not frequently change, the aggregation levels for each source are known in advance, as well as the materialized states. In this case, such a knowledge is used to drive the procedure in the expansions more effectively. Optimization strategies could be adopted to reduce in such a way the number of paths that are concurrently taken into account for expansion. These may include:

CR IP T

• pre-calculation of all formulas that can be inferred for a given indicator, following the approach of subsection 5.2. Mathematical manipulation of formulas can be achieved through a proper logic-based representation and specific libraries as shown in a previous work [4]; • pre-calculation of valid execution plans for frequent queries, by an off-line computation for any view I X and source. In this scenario, the total number of solutions to store is O( |Pi |); i=0

AN US

• pre-calculation and storage, for any source, of the state space for a root indicator, with respect to the most abstract pattern. In such a case, indeed, the solution for any subpattern is included in the already created graph. In the same way, the solution for any indicator that belongs to the formula of the root indicator is included as well.

AC

CE

PT

ED

M

On the other hand, in more general uncontrolled scenarios it is possible to consider strategies that are anyway capable to reduce the number of steps of the procedure, for instance by (a) caching or (b) indexing. As for the former choice, the cubeElements that are dynamically computed for a query could be materialized. In such a way, if the same query will be asked again, the value will be immediately available. Obviously, the specific caching policy can be defined only after a cost/benefit analysis. Alternatively, a different strategy could be implemented through indexing structures capable to realize an index of the materialized cubeElements. For instance, in [32], an indexing structure called cubettes is proposed to represent which sets of cubeElements are materialized. Following this approach, it could be possible to devise a batch procedure that scans the data warehouse and periodically generates such cubettes, that are then checked at query time in order to find out whether the query can be solved in the first place, and if so, to retrieve a path in the space state. A further topic that we believe is worth investigating regards the analysis of most effective trade-off between the storage costs for maintaining materialized views and computational costs for run-time evaluation of a solution. Moreover, so far the query answering algorithms randomly applies either rule 1 or rule 2 at each step of the find_rewriting procedure. However, weighting differently the two rewriting rules could help to reflect that indeed such rules involve operations with different hidden costs. As an example, each roll-up requires to aggregate a set of distinct values, whose cardinality has an impact on the final costs (e.g., the cost of aggregating all 20 regions in Italy at Country level is certainly lower than aggregating several hundred cities at level Region). Similarly, breaking down an indicator into its components through rule 2 involves the creation of states which include multiple patterns. For instance, in Section 5 (step c) at a certain point the procedure generates this state: {hT C, hY ear, Regioni, h{2016}, {M arche, T oscana}i, ACM E1 i, hP C, hY ear, Regioni, h{2016}, {M arche, T oscana}i, ACM E1 i}. States like this inherently involve an additional computational costs, because checking its computability involves performing two queries, i.e. one for each pattern, instead of a single query. This information could be helpful to define heuristics aimed to more effectively drive the choice between rules and, even in the context of a single rule, between alternative paths. References

[1] A. P. Sheth, J. A. Larson, Federated database systems for managing distributed, heterogeneous, and autonomous databases, ACM Comput. Surv. 22 (1990) 183–236. [2] S. Berger, M. Schrefl, From federated databases to a federated data warehouse system, in: Hawaii International Conference on System Sciences, Proceedings of the 41st Annual, 2008, pp. 394–394. [3] S. Rizzi, Business Intelligence: First European Summer School, eBISS 2011, Paris, France, July 3-8, 2011, Tutorial Lectures, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 186–205.

28

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

[4] C. Diamantini, D. Potena, E. Storti, Sempi: A semantic framework for the collaborative construction and maintenance of a shared dictionary of performance indicators, Future Generation Computer Systems 54 (2016) 352 – 365. [5] C. Diamantini, D. Potena, E. Storti, Data mart reconciliation in virtual innovation factories, in: Advanced Information Systems Engineering Workshops: CAiSE 2014 International Workshops, Thessaloniki, Greece, June 16-20, 2014. Proceedings, Springer International Publishing, 2014, pp. 274–285. [6] C. Diamantini, D. Potena, E. Storti, Extended drill-down operator: Digging into the structure of performance indicators, Concurrency and Computation: Practice and Experience 28 (2016) 3948–3968. [7] E. Rahm, P. A. Bernstein, A survey of approaches to automatic schema matching, The VLDB Journal 10 (2001) 334–350. [8] C. Lee, C.-J. Chen, H. Lu, An aspect of query optimization in multidatabase systems, SIGMOD Rec. 24 (1995) 28–33. [9] M. Lenzerini, Data integration: A theoretical perspective, in: Proceedings of the Twenty-first ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems, PODS ’02, ACM, New York, NY, USA, 2002, pp. 233–246. [10] A. Y. Halevy, Answering queries using views: A survey, The VLDB Journal 10 (2001) 270–294. [11] S. Cohen, W. Nutt, Y. Sagiv, Rewriting queries with arbitrary aggregation functions using views., ACM Trans. Database Syst. 31 (2006) 672–715. [12] J. Goldstein, P. Larson, Optimizing queries using materialized views: A practical, scalable solution, SIGMOD Rec. 30 (2001) 331–342. [13] F. S. Tseng, C.-W. Chen, Integrating heterogeneous data warehouses using xml technologies, J. Inf. Sci. 31 (2005) 209–229. [14] A. Gupta, V. Harinarayan, D. Quass, Aggregate-query processing in data warehousing environments, in: Proceedings of the 21th International Conference on Very Large Data Bases, VLDB ’95, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1995, pp. 358–369. [15] S. Cohen, W. Nutt, A. Serebrenik, Rewriting aggregate queries using views, in: Proceedings of the Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’99, ACM, New York, NY, USA, 1999, pp. 155–166. [16] C.-S. Park, M. H. Kim, Y.-J. Lee, Rewriting olap queries using materialized views and dimension hierarchies in data warehouses, in: Data Engineering, 2001. Proceedings. 17th International Conference on, IEEE, pp. 515–523. [17] L. V. S. Lakshmanan, J. Pei, Y. Zhao, Efficacious data cube exploration by semantic summarization and compression, in: Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB ’03, VLDB Endowment, 2003, pp. 1125–1128. [18] B. Neumayr, S. Anderlik, M. Schrefl, Towards ontology-based olap: Datalog-based reasoning over multidimensional ontologies, in: Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAP, DOLAP ’12, ACM, New York, NY, USA, 2012, pp. 41–48. [19] N. Prat, I. Megdiche, J. Akoka, Multidimensional models meet the semantic web: Defining and reasoning on owl-dl ontologies for olap, in: Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAP, DOLAP ’12, ACM, New York, NY, USA, 2012, pp. 17–24. [20] G. Xie, Y. Yang, S. Liu, Z. Qiu, Y. Pan, X. Zhou, EIAW: Towards a Business-Friendly Data Warehouse Using Semantic Web Technologies, in: The Semantic Web, volume 4825 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2007, pp. 857–870. [21] M. Kehlenbeck, M. H. Breitner, Ontology-based exchange and immediate application of business calculation definitions for online analytical processing, in: Proc. of the 11th International Conference on Data Warehousing and Knowledge Discovery, DaWaK ’09, Springer-Verlag, Berlin, Heidelberg, 2009, pp. 298–311. [22] M. Golfarelli, F. Mandreoli, W. Penzo, S. Rizzi, E. Turricchia, Olap query reformulation in peer-to-peer data warehousing., Inf. Syst. 37 (2012) 393–411. [23] J. Horkoff, D. Barone, L. Jiang, E. Yu, D. Amyot, A. Borgida, J. Mylopoulos, Strategic business modeling: representation and reasoning, Software & Systems Modeling (2012). [24] V. Popova, A. Sharpanskykh, Modeling organizational performance indicators, Information Systems 35 (2010) 505 – 527. [25] A. Del-R´ıo-Ortega, M. Resinas, A. Ruiz-Cort´ es, Defining process performance indicators: An ontological approach, in: Proceedings of the 2010 International Conference on On the Move to Meaningful Internet Systems - Volume Part I, OTM’10, Springer-Verlag, Berlin, Heidelberg, 2010, pp. 555–572. [26] C. Diamantini, A. Freddi, S. Longhi, D. Potena, E. Storti, A goal-oriented, ontology-based methodology to support the design of aal environments, Expert Systems with Applications 64 (2016) 117 – 131. [27] M. Golfarelli, S. Rizzi, Data Warehouse Design: Modern Principles and Methodologies, McGraw-Hill, Inc., New York, NY, USA, 1 edition, 2009. [28] V. Harinarayan, A. Rajaraman, J. Ullman, Implementing data cubes efficiently, in: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, 1996, pp. 205–216. [29] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, H. Pirahesh, Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub totals., Data Min. Knowl. Discov. 1 (1997) 29–53. [30] M. O. Akinde, M. H. B¨ ohlen, T. Johnson, L. V. Lakshmanan, D. Srivastava, Efficient olap query processing in distributed data warehouses, Information Systems 28 (2003) 111 – 135. [31] J. Bernardino, P. Furtado, H. Madeira, DWS-AQA: A cost effective approach for very large data warehouses, in: International Database Engineering & Applications Symposium, IDEAS’02, July 17-19, 2002, Edmonton, Canada, Proceedings, pp. 233–242. [32] C. E. Dyreson, Information retrieval from an incomplete data cube, in: T. M. Vijayaraman, A. P. Buchmann, C. Mohan, N. L. Sarda, T. M. Vijayaraman, A. P. Buchmann, C. Mohan, N. L. Sarda (Eds.), VLDB, Morgan Kaufmann, 1996, pp. 532–543.

29

ACCEPTED MANUSCRIPT

Appendix A. Theorem proofs Lemma 2. (Finiteness of the state space)

CR IP T

Proof. Since the set of dimension hierarchies in dss (ind) is finite and that the number of levels for each dimension is finite, the dimensional lattice Pind for an indicator is limited. The maximum cardinality of a state is given by the all possible views: |I| × |Pind |, i.e. all combinations of all indicators for all possible patterns. By definition, a state is a set and hence cannot include duplicated views. Therefore, the cardinality of the state space is also limited, and an upper bound is given by the possible combinations of views. Theorem 1. (Soundness) The algorithm to find a reformulation is sound.

Proof. Let us assume that the algorithm is not sound. Hence, it returns at least a non-valid solution. According to the above mentioned definition of valid solution, given a starting state Sin , a solution hSin , . . . , Sf in i is not valid if one of the following conditions is true: 1) the ending state Sf in is not materialized;

3) the returned solution contains cycles.

AN US

2) ∃ a transition (Si−1 , Si ) such that Si 6≤S Si−1 ;

AC

CE

PT

ED

M

Condition 1. The condition is contradicted because the procedure can successfully terminate only when a materialized state is found (lines 6-7). Therefore, the tail of the path that is yielded as solution is always materialized. Condition 2. At the beginning, the path contains only Sin (line 2). Then, a new state can be added to a path either by application of rule 1 (in line 17) or by rule 2 (in line 21), according to the validity for such rules as defined in Subsection 4.1. For what concerns the former, S 0 is created by replacing a single pattern pjk with a pattern phk such that phk ≺∗ pjk . By definition of ≺∗ given in Subsection 4.2, S 0 ≺∗ S. For what concerns rule 2, S 0 is created by replacing a single pattern pjk with a set of patterns such that pjk = f (pjk1 , . . . , pjkn ), given a formula fk = f (indk1 , . . . , indkn ) for indk . By the same definition, S 0 ≺∗ S. Therefore, given that there is no other way to add a state to a path, it must be that every transition (Si−1 , Si ) is such that Si ≺∗ Si−1 , and the condition is contradicted. Condition 3. At the beginning, the path includes only Sin and does not contain cycles. Let us assume that a state Sf in , not previously visited, is added to the path. When such a path is popped from C (line 4), the lastly added state is considered and the find_rewritings procedure is called to expand the path. After it ends, Sf in is added to the states already visited. Let us suppose now that after some steps, a new state S 0 = Sf in is added to a path (line 17 or 21) that already contains it, thus creating a non-simple path X 0 . When X 0 is popped from C in order to be analysed, it is checked whether the lastly inserted state S 0 is already visited (line 5). Given that S 0 = Sf in and that Sf in ∈ V , the path is discarded. Therefore, a path with two identical states (i.e. with cycles) cannot be evaluated and as a consequence cannot be yielded in output as a solution. Hence, the condition is contradicted. We reach a contradiction in all cases. Therefore, the procedure is sound. Theorem 2. (Termination) The query reformulation procedure is guaranteed to terminate. Proof. The termination criteria are as follows: if a materialized state is reached, the procedure ends successfully (line 7), whereas the procedure ends with a failure if the queue C is empty (line 12). Given that each state in the state space can be analysed only once (line 5), the maximum number of iterations for the whole procedure is given by the number of states in the state space, which is limited by the Lemma 1.

30