Information Systems 37 (2012) 393–411
Contents lists available at ScienceDirect
Information Systems journal homepage: www.elsevier.com/locate/infosys
OLAP query reformulation in peer-to-peer data warehousing M. Golfarelli a, F. Mandreoli b,c, W. Penzo a,c, S. Rizzi a,, E. Turricchia a a
DEIS - University of Bologna, V.le Risorgimento 2, 40136 Bologna, Italy DII - University of Modena and Reggio Emilia, Via Vignolese 905/b, Modena, Italy c IEIIT - BO/CNR, V.le Risorgimento 2, 40136 Bologna, Italy b
a r t i c l e i n f o
abstract
Available online 14 June 2011
Inter-business collaborative contexts prefigure a distributed scenario where companies organize and coordinate themselves to develop common and shared opportunities, but traditional business intelligence systems do not provide support to this end. To fill this gap, in this paper we envision a peer-to-peer data warehousing architecture based on a network of heterogeneous peers, each exposing query answering functionalities aimed at sharing business information. To enhance the decision making process, an OLAP query expressed on a peer needs to be properly reformulated on the local multidimensional schemata of the other peers. To this end, we present a language for the definition of mappings between the multidimensional schemata of peers and we introduce a query reformulation framework that relies on the translation of mappings, queries, and multidimensional schemata onto the relational level. Then, we formalize a query reformulation algorithm and prove two properties: correctness and closure, that are essential in a peer-to-peer setting. Finally, we discuss the main implementation issues related to the reformulation setting proposed, with specific reference to the case in which the local multidimensional engines hosted by peers use the standard MDX language. & 2011 Elsevier Ltd. All rights reserved.
Keywords: Business intelligence OLAP Distributed databases Query reformulation
1. Introduction Business intelligence (BI) transformed the role of computer science in companies from a technology for passively storing data into a discipline for timely detecting key business factors and effectively solving strategic decisional problems. However, in the current changeable and unpredictable market scenarios, the needs of decision makers are rapidly evolving as well. To meet the new, more sophisticated user needs, a new generation of BI systems (often labeled as BI 2.0) has been emerging during the last few years.
Corresponding author.
E-mail addresses:
[email protected] (M. Golfarelli),
[email protected] (F. Mandreoli),
[email protected] (W. Penzo),
[email protected] (S. Rizzi),
[email protected] (E. Turricchia). 0306-4379/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.is.2011.06.003
One of the key features of BI 2.0 is the ability to become collaborative and extend the decision-making process beyond the boundaries of a single company [40]. Indeed, collaborative BI has been predicted to be the main BI trend for 2011 [30]. Users need to access information anywhere it can be found, by locating it through a semantic process and performing integration on the fly. This is particularly relevant in inter-business collaborative contexts where companies organize and coordinate themselves to share opportunities, respecting their own autonomy and heterogeneity but pursuing a common goal. In such a complex and distributed business scenario, traditional BI systems—that were born to support stand-alone decision making—are no longer sufficient to maximize the effectiveness of monitoring and decision making processes. Accessing local information is no more enough, users need to transparently and uniformly access information scattered across several heterogeneous BI platforms [25].
394
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
peer in this neighborhood in turn sends q to its neighborhood, and so on. In this way, q undergoes a chain of reformulations along the peers it reaches, and results are collected from any peer that is connected to p through a path of semantic mappings. The approach outlined above is reflected by the internal architecture of each peer, sketched in the right side of Fig. 1, whose components are:
Fig. 1. Envisioned architecture for a BIN.
To fill this gap, we envision a peer-to-peer data warehousing architecture called Business Intelligence Network (BIN) and sketched in Fig. 1. A BIN is an architecture for sharing BI functionalities across a dynamic and collaborative network of heterogeneous and autonomous peers. Each peer is equipped with an independent data warehouse system, that relies on a local multidimensional schema to represent the peer’s view of the business and exposes OLAP query answering functionalities (based for instance on the MDX language, a de-facto standard for querying multidimensional databases [36]) aimed at sharing business information, in order to enhance the decision making process and create new knowledge. The main benefits the BIN approach aims at delivering to the corporate world are the possibility of building new interorganizational relationships and coordination approaches, and the ability to efficiently manage inter-company processes and safely sharing management information besides operational information [17]. The core idea of a BIN is that of enabling users to transparently access business information distributed over the network. A typical interaction sequence is the following: 1. A user formulates an OLAP query q by accessing the local multidimensional schema exposed by her peer, p. 2. Query q is processed locally on the data warehouse of p. 3. At the same time q is forwarded to the network. 4. Each involved peer locally processes the query on its data warehouse and returns its results to p. 5. The results are integrated and returned to the user. The local multidimensional schemata of peers are typically heterogeneous; so, before a query issued on a peer can be forwarded to the network, it must be first reformulated according to the multidimensional schemata of the destination peers. In line with the approach adopted in Peer Data Management Systems (PDMSs) [24], query reformulation in a BIN is based on semantic mappings that mediate between the different multidimensional schemata exposed by two peers, i.e., they describe how the concepts in the multidimensional schema of one peer map onto those of another peer. Direct mappings cannot be realistically defined for all the possible couples of peers. So, to enhance information sharing, a query q issued on p is forwarded to the network by first sending it to the neighborhood of p; then, each
1. User interface. A web-based component that manages bidirectional interaction with users, who use it to visually formulate OLAP queries on the local multidimensional schema and explore query results. 2. Query handler. This component receives an OLAP query from either the user interface or a neighboring peer on the network, sends that query to the OLAP adapter to have it locally answered, reformulates it onto the neighborhood (using the available semantic mappings), and transmits it to the peers in that neighborhood. 3. Data handler. When the peer is processing a query that was locally formulated, the data handler collects query results from the OLAP adapter and from the peers, integrates them, and returns them to the user interface. When the peer is processing a query that was formulated on some other peer p, the data handler just collects local query results from the OLAP adapter and returns them to p. 4. OLAP adapter. This component adapts queries received from the query handler to the querying interface exposed by the local multidimensional engine. 5. Multidimensional engine. It manages the local data warehouse according to the multidimensional schema representing the peer’s view of the business, and provides MDX-like query answering functionalities. Interactions between peers are based on a messagepassing protocol. Query answering in a BIN architecture poses several research challenges, ranging from languages and models for semantic mediation to query reformulation issues and proper techniques and data structures for the query processing phase. Much work has been done on these issues in the context of PDMSs (e.g., [32–34]) and relational databases [38], however those results are not directly applicable in the OLAP scenario presented by the BIN. In this paper we introduce the foundations for enabling query reformulation in a BIN. In particular, we focus on the definition of a language for semantic mapping specification to cope with the semantic heterogeneity of peers’ schemata and on the specification of rules for reformulating OLAP queries. Reformulating an OLAP query is a challenging task due to the presence of aggregation and to the possibility of having information represented at different granularities and under different perspectives in each peer. The original contributions we give are:
We present a language for the definition of semantic mappings between the schemata of peers, using predicates that are specifically tailored for the multidimensional model (Section 4). To overcome possible differences in data formats, mappings can be associated
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
with transcoding functions. Mappings are classified according to the accuracy they allow in query reformulation and a set of requirements for the query reformulation algorithm is introduced when dealing with mapping of different accuracies. We introduce a framework for OLAP query reformulation that relies on the translation of mappings and queries towards the underlying relational schemata (Section 5). For simplicity, we will use standard star schemata to this end. We propose a query reformulation algorithm and show that it is correct for compatible reformulations, that it satisfies all the requirements, and that our language for expressing OLAP queries is closed under reformulation (Section 6). Remarkably, this means that our reformulation algorithm can be safely used to implement chains of reformulations as required by the BIN setting. We briefly discuss the main implementation issues related to the reformulation setting proposed, with specific reference to the case in which the multidimensional engines hosted by peers use the standard MDX language (Section 7).
The paper is completed by Section 2 that reviews the related literature, Section 3 that formalizes multidimensional schemata and OLAP queries, and Appendix A that gives the proofs of the theorems. 2. Related works In this section we discuss the related works, distinguishing those specifically placed in the data warehouse context from those dealing with operational databases. 2.1. Data warehouse integration and federation Only a few works are specifically focused on strategies for data warehouse integration and federation. Indeed, in this context, problems related to data heterogeneity are usually solved by ETL processes that read data from several data sources and load them in a single repository. While this centralized architecture may fit the needs of standalone companies, it is hardly feasible in the context of a BIN, where the dynamic nature of the business network, together with the independence and autonomy of peers, call for more sophisticated solutions. See [1] for a discussion of the benefits of a peer-to-peer architecture for data warehousing and [5] for a description of the issues arising in collaborative BI systems. In the context of a federated data warehouse architecture, [42] describes two methods for integrating dimensions belonging to different data marts and provide a set of rules to check their compatibility. The problem of how to define mappings between concepts is not considered. The work proposed in [9] presents a complete algorithm for matching multidimensional structures, with a strong emphasis on the process of calculating similarity between complex concepts. However, the data-related aspects—that could enrich the matching definitions—are not considered, and no model is provided to formalize the mapping predicates.
395
Another work centered on interoperability issues among heterogeneous data warehouses is the one by [28], that emphasizes the importance of a semantic layer to enable communication among different entities. This approach supports the exchange of business calculation definitions and allows for their automatic linking to specific data warehouses through semantic reasoning. The approach is flexible because it could be implemented by adopting different formalisms for each layer; however, the work proposes specific techniques to deal with measures only, so it cannot be used to completely solve a typical aggregate query. As concerns performance aspects in distributed data warehouse systems, in [27] the authors propose a peer-topeer architecture for supporting OLAP queries focusing on the definition of a caching model to make the query rewriting process more efficient. They also define adaptive techniques that dynamically reconfigure the network structure in order to minimize the query cost. In [26] a hybrid approach between the centralized and the full-federation configuration is described, suggesting the use of a federation server to store aggregated data from different remote sources. Both works cannot be compared to ours because they focus on improving OLAP query performance and do not address the issues related to query reformulation models. In some approaches, the key idea to reduce the response time of OLAP queries in distributed contexts is to provide early approximate results, trying to maximize their accuracy, rather than late precise answers. In this direction, a hybrid approach taking advantage of sampling techniques and approximate query processing to optimize performances for aggregate queries has been proposed in [44]. Though the authors rely on a peer-to-peer network, they neglect the issues related to the semantic heterogeneity that characterizes federated data warehouse systems. Other works focus on how to use aggregate functions to retrieve approximate results. Note how, in OLAP distributed contexts, correctly calculating aggregates from pre-aggregates is particularly relevant due to the heterogeneity in the data granularities of the involved data warehouses. In particular, the impact of aggregation operators on query performance is investigated in [7], where a query cost model for distributive operators such as SUM and COUNT is presented. Finally, the discussion is extended to nondistributive aggregate functions in [4], where an algorithm to minimize the inter-peer information exchange when reconciling non-distributive aggregates are devised. Here, however, a simplified federated architecture is considered where data are distributed among different nodes but the data warehouse schema is shared. The work that is most closely related to ours is [43]. Though most goals are shared with our approach, some important differences exist:
Network architecture. In [43] the underlying peer-to-peer architecture is based on a fully connected network. In a BIN, a peer does not need to be connected to all the other peers, which has great advantages in terms of better scalability in maintaining mappings, preservation of peers’ autonomy in choosing their neighbors, and easier integration with existing networks.
396
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
Global vs. local knowledge. In [43], the peers whose
dimensions have the same semantics are logically clustered. This requires a global knowledge of the network, which is not necessary in a BIN—where each peer only needs local knowledge about his neighboring peers. Reformulation and expressiveness. In [43] reformulation is a one-step process, i.e., information loss due to the multiple reformulations that typically take place in peer-to-peer architectures is neglected. Besides, the problem of reconciling different data formats between heterogeneous peers is not considered. Both issues are taken into account in our approach. While in [43] queries involve one measure only, our approach enables peers to deal with expressions of measures.
2.2. Mediation-based information sharing Generally, information sharing in distributed contexts has to deal with the problem of mutually inconsistent data sources, since sources are usually autonomous. In the field of inconsistent databases, when data is integrated from multiple sources, each single source may satisfy some constraint, but the merged data may not. Thus, queries need to be rewritten to retrieve consistent answers [14]. In data integration systems, data from different sources are combined to give users a unified view [31]; in this way, users are freed from having to locate individual sources, learn their specific interaction details, and manually combine the data [23]. The unified view that reconciles the sources is represented by a global schema. Even in this context query processing requires a reformulation step: a query over the global schema has to be reformulated in terms of a set of queries over the sources. Another research area facing similar issues is data exchange, although some differences exist. In data exchange, data structured under one source schema must be restructured and translated into an instance of a different target schema [12]. This means that the instance of the target schema is materialized—while in the case of a virtual view expressed by a global schema in a data integration system, the data remain at the sources. Moreover, in data exchange scenarios, the target schema is often independently created and comes with its own constraints that have to be satisfied. Decentralized sharing of data between autonomous sources has also been deeply studied in the context of Peer Data Management Systems (PDMSs) that were born as an evolution of mediator systems in the data integration field [24] and generalize peer data exchange settings [15]. In a PDMS there is no distinction between source and target, since a peer may simultaneously act as a distributor of data (thus, a source peer) and a recipient of data (thus, a target peer). However, as in the case of data integration systems, in a PDMS data remain at the sources and queries processing entails query reformulation over the source schemata. In all these contexts, modeling the relationships—called mappings—between source and target schemata is a crucial aspect. Research in the data integration area has provided rich and well-understood schema mediation languages [31]
to this end. The two commonly used formalisms are the global-as-view (GAV) approach, in which the mediated schema is defined as a set of views over the data sources, and the local-as-view (LAV) approach, in which the contents of data sources are described as views over the mediated schema. Depending on the kind of formalism adopted, GAV or LAV, queries posed to the system are answered differently, namely by means of query unfolding or query rewriting techniques [21], respectively. In a data exchange setting, assertions between a source query and a target query are used to specify what source data should appear in the target and how. These assertions can be represented neither in the LAV nor in the GAV formalisms, but rather they can be thought of as GLAV (global-and-local-as-view) [12]. A structural characterization of schema mapping languages is provided in [41], together with a list of the basic tasks that all languages ought to support. A lot of research work has been devoted to schema mapping specifications, as deeply investigated in [37,39,6]. Other studies in the field have focused on integrating the computation of core solutions in the mapping generation process, aimed at determining redundancy-free mappings in data exchange settings [35,13]. We emphasize that our work is neither concerned with schema mapping generation nor with the computation of core schema mappings, and that we assume that mapping specifications between source and target schemata are correct and minimal. Query answering using views relies on the notion of certain answers [2]. In data integration, the source instances are used to compute the certain answers of queries over the global schema, whereas in data exchange certain answers are computed by using the materialized target instances [12]. The semantics of query answering in a PDMS is defined in [22], where an efficient reformulation algorithm that returns only (and under specific conditions, all) certain answers is presented. The approach deals with unions of conjunctive queries, but it does not face the problem of rewriting aggregate queries. As to this research issue, the work in [8] introduces range semantics as the semantics of aggregate queries in the context of inconsistent databases. Range semantics is a form of certain answers semantics for aggregate queries and it guarantees that, in every possible world, the value of the aggregate query will be in a given interval. Ref. [3] adopts range semantics and studies the complexity of algorithms for answering aggregate queries in peer data exchange settings. Other algorithms have been proposed in the literature, but none of them can be directly applied in a BIN since they all assume either manipulation of data (repairs of the inconsistent databases, e.g., [14]) or availability of materialized source data at the target (e.g., [3] in data exchange). In [11] the problem of rewriting aggregate queries using views in relational databases is deeply investigated. In order to rewrite a query q involving an aggregation function a, the work introduces two types of queries that use count-views and possibly a-views for which the unfolding property holds, thus reducing the rewriting problem to one of query equivalence. Unfortunately, the results obtained in [11] cannot be applied in an OLAP context since they require the computation of the cardinalities of the measure values, a functionality that is
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
generally supported in a relational setting but is not in multidimensional query engines. While in the literature the problem of differences in data formats has only marginally been considered, declaring useful mappings in the OLAP context necessarily requires also the level of instances to be taken into account. A seminal paper regarding this topic is [10], where constraint queries are translated across heterogeneous information sources taking into account differences in operators and data formats. However, the proposal in [10] was conceived for translating conjunctive queries, and it does not consider the problem of translating aggregation components which characterize OLAP queries. The work we present in this paper is based on [16], where a PDMS-based data warehousing architecture is introduced, a basic language for the definition of inter-peer schema mappings is proposed, and a query reformulation framework for answering aggregate queries is outlined. Here we extend and improve our previous work under several substantial aspects. A complete formalization of the elements necessary to support data format heterogeneity between peers is given; to this end, we define transcodings as functions for interdomain mapping, and we expand our query language to admit the presence of transcoding components in the groupby set of a query, thus making reformulation closed under the query language. We also extend the query language to enable the specification of expressions involving one or more measures, where each measure can be aggregated using a set of aggregation operators, thus making query rewriting more flexible. Then, the expressiveness of the mapping language is augmented by annotating measure mappings with selection predicates, so as to restrict the validity of rewritings. Finally, the query reformulation algorithm is formalized and a proof of its correctness is given. 3. Preliminaries In this section we introduce a basic formal setting to manipulate and query multidimensional data, and we introduce a running example. In particular, our formalization of md-schemata is an extension of the one used in [18]. Without loss of generality, we will assume that all measures are defined on the same domain of real numbers, R. Definition 1 (Md-Schema). An md-schema is a triple D ¼ /A,H,MS where:
A ¼ fa1 , . . . ,ap g is a finite set of attributes, each defined on a categorical domain Domðai Þ;
H ¼ fh1 , . . . ,hn g is a finite set of hierarchies, each charac-
terized by (1) a subset Attrðhi Þ DA of attributes (such that the Attrðhi Þ’s for i ¼ 1, . . . ,n define a partition of A) and (2) a roll-up tree-structured partial order khi of Attrðhi Þ; M ¼ fm1 , . . . ,ml g is a finite set of measures, each aggregable through a set of one or more aggregation operators, Aggðmi Þ.
For each hierarchy hi, the root attribute of the order is called dimension, denoted by DIMi, and determines the finest aggregation level for the hierarchy. A pair m ¼ /mi , aj S such that mi 2 M and aj 2 Aggðmi Þ is called a metric of D.
397
Example 1. In the running example we adopt in this paper, a set of local health-care departments participate in a collaborative network to integrate their data about admissions so as to enable more effective analysis of epidemics and health-care costs by the Ministry. For simplicity we will focus on two peers: the first, located in Rome, hosting data on hospitalizations at the most detailed level; the second, located in Florence, hosting data on admissions grouped by patient gender, residence city, and birth year. The underlying md-schemata for Rome and Florence are called HOSPITALIZATION and ADMISSIONS, respectively; their roll-up orders are shown in Fig. 2. Assuming that each hierarchy is named after its finest-level attribute, but capitalized, relationships DIM Patient ¼ patient and citykPatient region hold. The HOSPITALIZATION md-schema includes measures cost and durationOfStay; ADMISSIONS includes measures totStayCost, totExamCost, totLength, maxLength, and numAdmissions. The aggregation operators exposed by the two md-schemata are as follows: AggðcostÞ ¼ fsum,avgg AggðdurationOfStayÞ ¼ fsum,avg,min,maxg AggðtotStayCostÞ ¼ AggðtotExamCostÞ ¼ AggðtotLengthÞ ¼ fsum,avgg AggðmaxLengthÞ ¼ fmaxg AggðnumAdmissionsÞ ¼ fsumg Note that the HOSPITALIZATION md-schema stores data at the maximum detail, so all its measures can in principle be aggregated using any operator. On the other hand, ADMISSIONS stores pre-aggregated data, so the (additive) measures totStayCost, totExamCost, and totLength can be also averaged thanks to the presence of numAdmissions, that acts as a support measure for the avg operator. A projection is a subset of A, and defines a possible way to aggregate data. Definition 2 (Projection). Given D ¼ /A,H,MS, a projection of D is a subset of attributes P DA. The domain of P is DomðPÞ ¼ ai 2P Domðai Þ; each value of Dom(P) is called a coordinate of P. Definition 3 (Transcoding). Given D, let Dom be a generic domain of values. A transcoding of D is a function f : DomðPÞ-Dom, where P is a projection of D, that maps each coordinate of P onto a value of Dom. Note that Dom can be a compound domain (e.g., Dom ¼ DomðweekÞ DomðregionÞ); in this case, the transcoding is made of components, each mapping onto a simple domain. Example 2. Examples of projections of HOSPITALIZATION and ADMISSIONS are P ¼ fweek,regiong and P0 ¼ fdate, patientCityg, respectively. A transcoding of ADMISSIONS is f : DomðP0 Þ-DomðPÞ whose components are week ¼ weekOfðdateÞ, region ¼ regionOfðpatientCityÞ, where weekOfðÞ is a common SQL function and regionOfðÞ is a user-defined function that returns the region a city belongs to by accessing a CITIES relational table stored at the Florence peer.
398
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
Fig. 2. Roll-up orders for the hierarchies in the HOSPITALIZATION and ADMISSIONS md-schemata (LHD stands for local health department).
We will consider a simple form of OLAP queries characterized by an aggregation and a selection (GPSJ— generalized projection/selection/join query, [20]), where transcodings can be applied to attributes and measures can appear within expressions. To avoid getting burdened with the details of a specific multidimensional query language, we will express queries using an abstract syntax. Definition 4 (BIN query). A BIN query is a 5-tuple q ¼ /D,G,p,expr,TS where: 1. D ¼ /A,H,MS is the md-schema q is formulated on; 2. G is the query group-by set, and it is a set where each element is either an attribute of D or a component of a transcoding of D; 3. p is an (optional) selection predicate; it is a conjunction of Boolean predicates, each involving either an attribute of D or a component of a transcoding of D; 4. expr is the expression computed by q; it is a numerical expression involving measures in M; 5. T is a list of metrics of D, one for each measure used in expr, expressing the operators that will be used for aggregation. Consistently with the behavior of the MDX language, the semantics we assume for BIN queries is that aggregation is executed first. This means that q returns an expression of aggregates (rather than an aggregation of expressions). Example 3. The query q1 ¼ /HOSPITALIZATION,fregion,yearg, ðgender ¼ ‘Female’Þ, cost,//cost,sumSSS computes, at the Rome peer, the total hospitalization cost of female patients for each region and year. The query q2 ¼ /ADMISSIONS, fyear,regionOfðpatientCityÞg,F, totLength,//totLength,avgSSS
computes, at the Florence peer, the yearly average admission length for each patient region. 4. Mapping language In this section we describe the language we devised for the definition of semantic mappings between the mdschemata of peers. As mentioned in the Introduction, these mappings play a key role in a BIN because, as we will show in Section 5, they enable query reformulation. After introducing a set of mapping predicates in Section 4.1, in Section 4.2 we informally discuss how the mapping predicates introduced can lead to query reformulations at different levels of accurateness, and we derive a set of requirements for our reformulation algorithm accordingly. 4.1. Mapping predicates We preliminarily note that, according to the PDMS terminology, data are extracted from a source peer and are mapped onto the schema of a target peer. Accordingly, we will name the peer on whose md-schema a BIN query q is originally formulated as target peer, and the one on whose md-schema q has to be reformulated as source peer. The language we propose to express how the md-schema Ds of a source peer s maps onto the md-schema Dt of a target peer t includes five mapping predicates, namely same, equilevel, roll-up, drill-down, and related that will be discussed in detail below. In general, a mapping establishes a semantic relationship from one or more concepts (either measures or attributes) of Ds to one or more concepts of Dt , and enables a BIN query formulated on Dt to be reformulated on Ds . Optionally, a mapping involving attributes can be annotated with a transcoding that specifies how values of the target concepts can be obtained from values of the source concepts. If this function is available, it is used to increase the reformulation effectiveness.
same predicate: mt sameexpr,p Ns , where mt ¼ /mt , at S is a metric of Dt , Ns is a subset of measures of Ds , and expr
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
is an expression involving the measures in Ns. This mapping predicate is used to state that whenever mt is asked in a query on Dt using at , it can be rewritten as expr on Ds . The same mapping predicate can be annotated with a conjunctive Boolean predicate p involving one or more attributes in Dt , to restrict the validity of the rewriting for metric mt . equi-level predicate: Pt equilevelf Ps , where Pt and Ps are projections of Dt and Ds , respectively. This predicate is used to state that Pt has the same semantics and granularity as Ps. Optionally, it can be annotated with an injective transcoding f : DomðPs Þ-DomðPt Þ that establishes a one-to-one relation between coordinates of Ps and Pt, and is used to integrate data returned by the source and target peers. roll-up predicate: Pt rollupf Ps . This predicate states that Pt is a roll-up of (i.e., it aggregates) Ps. Optionally, it can be annotated with a non-injective transcoding f : DomðPs Þ-DomðPt Þ that establishes a many-to-one relation between coordinates of Ps and Pt, and is used to aggregate data returned by the source peer and integrate them with data returned by the target peer. drill-down predicate: Pt drilldownf Ps . This predicate is used to state that Pt is a drill-down of (i.e., it disaggregates) Ps. Optionally, it can be annotated with a non-injective transcoding f : DomðPt Þ-DomðPs Þ that establishes a one-to-many relation between coordinates of Ps and Pt. The transcoding f cannot be used to integrate data returned by t and s because this would require disaggregating data returned by s, which obviously cannot be done univocally; however, it can be used to reformulate selection predicates expressed at t onto s. related predicate: Pt related Ps . This predicate is used to state that Pt coordinates have a many-to-many relationship with Ps coordinates.
Example 4. The complete set of mappings and annotations for our health-care example is reported in Tables 1 and 2. Mappings o1 and o2 state that measure cost in Rome can be derived by summing measures totStayCost and totExamCost in Florence. Mappings from o1 to o5 are also annotated with predicate segment in {‘NH’,‘EU’}, to state that the Florence peer only stores National Health and European patients. On the other hand, as shown in Fig. 3, mapping o12 states that the diagnosis codes used in Florence are obtained by concatenating the fixed-length disease and organ codes used in Rome, and mapping o11 states that weeks are an aggregation of dates. Finally, mapping o15 uses as transcoding a completeGenderðÞ function that converts values ‘M’ and ‘F’ (Florence vocabulary) into ‘Male’ and ‘Female’ (Rome vocabulary). 4.2. Mapping accuracy We start this section by classifying mappings according to their accuracy. Definition 5. We say mapping o is exact iff it is either an equi-level or a roll-up mapping and it has an associated transcoding, or it is a same mapping. A mapping is
399
Table 1 Mappings from Florence (source peer) to Rome (target peer).
o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 o15 o16 o17 o18 o19
/cost,sumS same {totStayCost, totExamCost } /cost,avgS same {totStayCost, totExamCost } /durationOfStay,sumS same {totLength} /durationOfStay,avgS same { totLength} /durationOfStay,maxS same{maxLength} { LHD } roll-up { unit } { ward } equi-level { ward } { year } equi-level { year } { month } equi-level { month } { date } equi-level { date } { week } roll-up { date } { disease,organ } equi-level { diagnosis } { disease } drill-down { category } { patient } drill-down {patientGender,patientCity,patientBirthYear } { gender } equi-level { patientGender } { segment } related { patientGender, patientCity,patientBirthYear } { birthDate } drill-down { patientBirthYear } { city } equi-level { patientCity } { region } roll-up { patientCity }
Table 2 Annotations to the mappings in Table 1.
o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 o15 o16 o17 o18 o19
cost ¼ totStayCostþ totExamCost, segment in {‘NH’,‘EU’ } cost ¼ totStayCostþ totExamCost, segment in {‘NH’,‘EU’ } durationOfStay ¼ totLength, segment in {‘NH’,‘EU’ } durationOfStay ¼ totLength, segment in {‘NH’,‘EU’ } durationOfStay ¼ maxLength, segment in {‘NH’,‘EU’ } LHD ¼ ‘LHD39 - Florence’ ward ¼ ward year ¼ year month ¼ month date ¼ date week ¼ weekOf(date) disease ¼ substring(diagnosis, 1, 40), organ ¼ substring(diagnosis, 41, 80) categoryOf(disease) ¼ category – gender ¼ completeGender(patientGender) – yearOf(birthDate) ¼ patientBirthYear city ¼ patientCity region ¼ regionOf(patientCity)
said to be loose when it is either a drill-down or a related mapping. An attribute mapping is said to be approximate when it has no associated transcoding. In our example, mappings o14 and o16 are approximate and loose; mappings o13 and o17 are loose (but not approximate); all the other mappings are exact. Let q be a BIN query formulated at peer t. The accuracy of a reformulation of q on peer s depends on the accuracy of the mappings involved. In the following, we focus on this aspect and provide a set of requirements for the query reformulation algorithm when dealing with mappings of different accuracies. Intuitively, when (i) for each attribute mentioned in q there is an exact mapping from Ds , and (ii) for each metric required by q there is a same mapping from Ds , there exists a compatible reformulation of q on s, i.e., one that fully preserves the semantics of q. When a compatible
400
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
Fig. 3. Transcoding examples.
reformulation is used, the results returned by s do exactly match with q so they can be seamlessly integrated with those returned by t. For instance, query q1 formulated at the Rome peer in Example 3 has a compatible reformulation at the Florence peer1: q01 ¼ /ADMISSIONS, fregionOfðpatientCityÞ,yearg, ðcompleteGenderðpatientGenderÞ ¼ ‘Female’Þ, totStayCostþtotExamCost,//totStayCost,sumS, /totExamCost,sumSSS Of course, when a compatible reformulation exists for a query, it must be correctly generated by the reformulation algorithm. In all the other cases, the results returned by s match with q with some approximation. Three (possibly overlapping) situations can be distinguished: 1. For at least one of the attributes mentioned in q, there is an approximate mapping from Ds . This is a very common situation in real applications, for instance when proprietary encoding are used for attributes, because building a reliable transcoding would require a huge effort. However, the data returned by the source peer can be understood and useful, so we require that a reformulation is generated though this will lead to a value mismatch, meaning that the data returned by s and t will not be integrated. For instance, if mapping o8 were not annotated with a transcoding for year, query q1 would still be reformulated at the Florence peer as q01 , but there would be no guarantee that the year values returned can be integrated with those returned from the Rome peer. 1 As a matter of fact, the reformulation generated for q1 by the algorithm proposed in Section 6 includes one more predicate that constrains source data based on mapping o13 . Here we do not report this predicate for simplicity.
2. For at least one of the attributes mentioned in q, either there is a loose mapping from Ds or even there is no mapping at all. Also this situation is common, because independent md-schemata often have different granularities and hierarchies have different attributes. We require that a reformulation is generated, knowing that this will lead to a granularity mismatch, meaning that the data returned by s and t will have different aggregation levels. Although an integration is not possible, users can still exploit the results coming from s, for example by comparing them with those returned by t at the finest common aggregation level. For instance, if no mapping were defined for year, q1 would be reformulated as q001 ¼ /ADMISSIONS, fregionOfðpatientCityÞg, ðcompleteGenderðpatientGenderÞ ¼ ‘Female’Þ, totStayCost þ totExamCost,//totStayCost,sumS, /totExamCost,sumSSS which returns data with a different aggregation level than the one required by q1. 3. For at least one of the metrics mentioned in q, there is no mapping from Ds . In this case, we believe that no meaningful reformulation can be done. 5. A reformulation framework In the BIN architecture, queries are formulated on a peer md-schema and answers can come from any other peer connected to the queried peer through a chain of semantic mappings. The key step to this end is reformulating a peer’s query over its immediate neighbors, then over their immediate neighbors, and so on. More precisely, reformulation takes as input a BIN query on a target md-schema Dt as well as the mappings O between Dt and the schema of one of its neighbors, the source
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
401
Fig. 4. Reformulation framework.
md-schema Ds , to output a BIN query that refers only to Ds .2 Our approach takes advantage of the well-established research results in the OLTP context, with specific reference to distributed semantic data sharing systems [22]. The reformulation framework we propose is based on a relational setting, as depicted in Fig. 4, where md-schemata, BIN queries, and semantic mappings at the OLAP level are translated to the relational model. As to mdschemata, without loss of generality we assume that they are stored at the relational level as star schemata. As to queries, a classic logic-based syntax is adopted to express them at the relational level. As to mappings, their representation at the relational level uses a logical formalism typically adopted for schema mapping languages, i.e., source-to-target tuple generating dependencies (s-t tgd’s) [41]. A BIN query is then reformulated starting from its relational form on a star schema, using the mappings expressed as s-t tgd’s. To simplify the query reformulation task, we translate mappings following an approach founded on the semantics of the transformations data are subjected to along the reformulation process. In this way, we are not tied to the syntax of the mapping language presented in Section 4, which enables users to express their specification needs through powerful predicates. In particular, the translation of a mapping depends on the mapping accuracy as follows:
5.1. Translating schemata
cile source data values to target data values according to either the expression or the transcoding specified. Loose but not approximate mappings are translated into s-t tgd’s that relate source data values with target data values using a transcoding. Though this transcoding is useful for the reformulation of selection predicates expressed at the target peer, it cannot be used for
Let D ¼ /A,H,MS be an md-schema. We assume that D is stored as a standard star schema: one dimension table dt i ðDIM i ,ai1 , . . . ,aiv Þ for each hierarchy hi, where fDIM i , ai1 , . . . ,aiv g ¼ Attrðhi Þ, and a fact table ftðDIM 1 , . . . ,DIMn , m1 , . . . ,ml Þ where each DIMi is a foreign key, thus enforcing inclusion dependencies between the fact table and the dimension tables. For example, the star schema corresponding to the HOSPITALIZATION and ADMISSION md-schemata are shown in Fig. 5. It is worth noting that, while at the OLAP level both the mapping and query languages directly use the attributes and measures names, their counterpart borrowed from the relational model use star schema tables under the unnamed perspective, i.e., the specific attribute names are ignored, and only the number of attributes of each relation schema is available. These languages, both stemming from mathematical logic, view a database schema as a tuple R ¼ ðr1 , . . . ,rn Þ of relation symbols, each of which has a fixed arity. Given an md-schema D and the corresponding star schema, we switch from the named perspective of the OLAP level to the unnamed one of the relational level through (a) n dimension table-encoding functions di : Attrðhi Þ-N, each associating every attribute a 2 Attrðhi Þ with the corresponding position in dti, and (b) a fact table-encoding function j : fDIM1 , . . . ,DIMn g [ M-N that associates each measure and dimension with the corresponding position in ft.
2
5.2. Translating queries
Exact mappings are translated into s-t tgd’s that recon
data reconciliation since it is only applicable to transform target values into source values. Approximate mappings do not describe how source values should be transcoded to be compatible with the target domains. Thus, the corresponding s-t tgd’s can only specify syntactic constraints that define how attributes in the source are related with attributes in the target.
Hereinafter, we will safely assume that O is consistent. Consistency checking is part of mapping management, that is out of the scope of the present paper. Interested readers can refer to [29] for an in-depth discussion on this topic.
In a classical logic-based syntax, a PSJ query on a database schema R is expressed as a conjunction of
402
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
Fig. 5. Star schemata for the Rome (top) and Florence (bottom) peers.
relational atoms and Boolean predicates having the following rule-based form: qðzÞ’r1 ðx1 Þ, . . . ,rn ðxn Þ,p1 ðy1 Þ, . . . ,pu ðyu Þ where each ri is a relation of R, each xi is a tuple of variables and/or constants of the same arity of ri, each pj is an atomic Boolean predicate involving variables in X (where X is the union of the variables appearing in the xi ’s), and all the variables of tuple z belong to X. Given tuple x, in the following we will use dot notation x:k to refer to the k-th component of x. A GPSJ query is then represented as a conjunctive query with one or more aggregate terms in its head:
dt1 ðx1 Þ, . . . ,dt c ðxc Þ to introduce the variables associated with the measures and the attributes involved in q. Their positions in ftðxÞ are determined by j, while the di ’s are used to place the variables associated with the attributes of q in dt 1 ðx1 Þ, . . . ,dt c ðxc Þ. Formally: qðnðe1 Þ, . . . , nðeg Þ,exprða1 ðnðm1 ÞÞ, . . . , av ðnðmv ÞÞÞÞ ’ftðxÞ,dt 1 ðx1 Þ, . . . ,dt c ðxc Þ, nðp1 Þ, . . . , nðpu Þ where
nðei Þ denotes the substitution of each attribute a in ei with the corresponding variable nðaÞ;
the tuple x is such that x:jðDIM1 Þ ¼ nðDIM1 Þ, . . . ,
qðz, a1 ðw1 Þ, . . . , av ðwv ÞÞ’r1 ðx1 Þ, . . . ,rn ðxn Þ,p1 ðy1 Þ, . . . ,pu ðyu Þ where each ai is an aggregation operator, no variable in z occurs in w1 ,y, wv , and all variables in z and w1 ,y, wv belong to X. The answer to a GPSJ query q on a database instance I, q(I), is a relation obtained in three steps: (1) project the tuples of I satisfying the body (i.e., the righthand part) of q on z,w1 , . . . ,wv ; (2) group the tuples that agree on z; and (3) aggregate the values assigned to w1 , . . . ,wv within each group. Interested readers can refer to [11] for a formal definition. Now, let q ¼ /D,G,p,expr,TS be a BIN query, where G ¼ fe1 , . . . ,eg g, p ¼ p1 4 4pu , and T ¼ f/m1 , a1 S, . . . , /mv , av Sg and let h1 , . . . ,hc be the hierarchies involved in G and p. The translation of q to the relational level relies on a variable-assignment function n that associates each measure m in expr with a free variable nðmÞ as well as each attribute a in G and in p with a free variable nðaÞ. The join between the involved dimension tables and the fact table requires the introduction of one more free variable nðDIMi Þ for the dimension DIMi of each involved hierarchy hi when DIMi is not explicitly involved in q. The translation of q is then a GPSJ query based on the star join between the fact table ft and the dimension tables dt 1 , . . . ,dt c of the hierarchies h1 , . . . ,hc . The variable-assignment function n is used in the free tuples of the atoms ftðxÞ,
x:jðDIMc Þ ¼ nðDIM c Þ and x:jðm1 Þ ¼ nðm1 Þ, . . . ,x:jðmv Þ ¼ nðmv Þ. The other variables in x are anonymous (and denoted with the symbol _Þ; each tuple xi is such that xi :di ðaÞ ¼ nðaÞ for each attribute a involved from the i-th hierarchy and xi :di ðDIMi Þ ¼ nðDIMi Þ. The other variables in xi are anonymous; nðpi Þ denotes the substitution of each attribute a in pi with the corresponding variable nðaÞ.
Example 5. The query q1 shown in Example 3 translates onto the HOSPITALIZATION star schema to q1 ðR,Y,sumðCÞÞ’HospFTð_,_,D,_,P,C,_Þ, DateDTðD,_,_,YÞ, PatientDTðP,_,_,R,_,GÞ,G ¼ ‘Female’ while q2 translates onto the ADMISSIONS star schema to q2 ðY,regionOfðPÞ,avgðTÞÞ’AdmFTð_,D,_,P,_,_,_,_,T,_,_Þ, DateDTðD,_,YÞ, PatientCityDTðP,_Þ
5.3. Translating mappings We represent mappings as s-t tgd’s [41]. Given a source star schema S and a target star schema T, an s-t tgd has
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
the form
The roll-up mapping o11 translates to 8D,WðS:DateDTðD,_,_Þ,AdmFTð_,D,_, . . . ,_Þ, W ¼ weekOfðDÞ
8xðfðxÞ-(y cðx,yÞÞ where fðxÞ is a conjunction of atomic formulas over S and cðx,yÞ is a conjunction of atomic formulas over T. Every s-t tgd expresses the containment of one conjunctive query fðxÞ in another conjunctive query cðx,yÞ; informally, it asserts that if a pattern of facts appears in the source, then another pattern of facts must appear in the target. In spite of their syntactic simplicity, s-t tgd’s can express many data interoperability tasks arising in applications. They are also known as global-and-local-as-view (GLAV) dependencies, and can accommodate both GAV and LAV formalisms: in a GAV, the right-hand side of the implication consists of a single atomic formula 8xðfðxÞ-Uðx0 ÞÞ where the variables in x0 are among those in x, while in a LAV dependency the left-hand side of the implication consists of an atomic formula 8xðrðxÞ-(y cðx,yÞÞ A schema mapping is then a triple M ¼ /S,T, SS where S is a set of s-t tgd’s. The semantics of M is given in terms of star schema instances: Given an S-instance I and a T-instance J, we say that J is consistent with I w.r.t. M if (I, J) satisfies every s-t tgd in S. In the following subsections, we present the translation of the mapping predicates proposed in Section 4 into s-t tgd’s; in case of ambiguity, we will use prefixes S and T to distinguish source tables from target ones. 5.3.1. Exact mappings We start from exact mappings, that is, those assertions that either contain expressions or transcoding functions to be used for translating source values to the target domains (sameexpr ,3 equilevelf , and rollupf ). In this case, the proposed s-t tgd’s express the constraints between tuples induced either by the expression expr that relates measure values in case of same, or by the transcoding f which relates attribute values in case of equi-level and roll-up. Example 6. With reference to Example 4, we report some examples of mapping translations:
The same mapping o1 translates to
403
-(D0 ðT:DateDTðD0 ,W,_,_Þ,HospFTð_,_,D0 ,_, . . . ,_ÞÞÞ
Noticeably, two different predicates, equi-level and roll-up, translate to the same s-t tgd form. Indeed, an equi-level predicate states that two projections have the same granularity, whereas a roll-up is used to express a one-to-many relationship; this means that, differently from an equi-level predicate, a roll-up predicate requires to aggregate source data in order to make them compatible with the target domain. On the other hand, the proposed translation focuses on the kind of tuple dependencies each mapping defines, which is solely dependent on the existence of a transcoding that maps each source value into exactly one target value. The specific properties of transcodings (equi-level transcodings are injective, roll-up transcodings are not) have an impact on query reformulation but they do not affect the translation of mappings. For instance, consider mapping o12, which translates to 8D,D0 ,OðDiagnosisDTðD,_Þ,AdmFTðD,_, . . . ,_Þ, D0 ¼ substringðD,1,40Þ,O ¼ substringðD,41,80Þ -DiseaseDTðD0 Þ,OrganDTðOÞ,HospFTðO,D0 ,_, . . . ,_ÞÞ When a query requires only disease, the diagnosis values obtained from the source peer must be grouped by their disease values, because f projected on the first co-domain is no longer injective. This example also shows that, when a mapping involves no target attributes other than dimensions, the existential quantifier is not used because the join between the fact table and dimension tables is already specified by the attributes involved. The formal translation of this first class of predicates is shown in Table 3. Without loss of generality, we assume that the first l attributes of fat1 , . . . ,atj g and l0 attributes of fas1 , . . . ,ask g are dimensions, and that dti is the dimension table ai belongs to. Tuples are then defined as follows (all the unspecified variables are anonymous, and we recall that n is a variable assignment function that associates a free variable to each attribute and measure involved in the mapping assertions):
8S,E,CðAdmFTð_, . . . ,_,S,E,_,_,_Þ,C ¼ S þ E -HospFTð_, . . . ,_,C,_ÞÞ
the (fact table) tuple ys is such that ys :js ðmsi Þ ¼ nðmsi Þ for
This translation disregards the involved aggregation function, sum, that will be dealt with in the query reformulation algorithm. The equi-level mapping o8 translates to
8D,Y,Y 0 ðS:DateDTðD,_,YÞ,AdmFTð_,D,_, . . . ,_Þ,Y 0 ¼ Y -(D0 ðT:DateDTðD0 ,_,_,Y 0 Þ,HospFTð_,_,D0 ,_, . . . ,_ÞÞÞ
3
As to the same predicate, for presentation simplicity, here we assume that it is annotated with no predicates. The way predicates are dealt with, both in mappings and in queries, will be discussed in Section 6.1.
i ¼ 1, . . . ,k, i.e., the only assigned variables are those corresponding to measures; the (fact table) tuple yt is such that yt :jt ðmt Þ ¼ nðmt Þ; the (fact table) tuple xs is such that xs :js ðasi Þ ¼ nðasi Þ for i ¼ 1, . . . ,l0 and xs :js ðDIM si Þ ¼ nðDIM si Þ for i ¼ l0 þ1, . . . ,k, i.e., the only assigned variables are those corresponding to attributes; the (fact table) tuple xt is such that xt :jt ðati Þ ¼ nðati Þ for i ¼ 1, . . . ,l and xt :jt ðDIMti Þ ¼ nðDIM ti Þ for i ¼ l þ1, . . . ,j; s the (dimension table) tuple xsi is such that xsi :di ðasi Þ ¼ s s s s 0 s nðai Þ. Moreover, xi :di ðDIMi Þ ¼ nðDIMi Þ for i ¼ l þ 1, . . . ,k; t the (dimension table) tuple xti is such that xti :di ðati Þ ¼ nðati Þ. Moreover, xti :dti ðDIMti Þ ¼ nðDIMti Þ for i ¼ l þ 1, . . . ,j.
404
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
Table 3 Translation of the exact mapping predicates; superscripts t and s denote elements of the target and source schemata, respectively. Predicate
Translation
/mt , aS sameexpr nfms1 , . . . ,msk g
8nðms1 Þ, . . . , nðmsk Þ, nðmt Þ ðS:ftðys Þ, nðmt Þ ¼ exprðnðms1 Þ, . . . , nðmsk ÞÞ-T:ftðyt ÞÞ
fat1 , . . . ,atj gequilevelf fas1 , . . . ,ask g fat1 , . . . ,atj grollupf fas1 , . . . ,ask g
8nðat1 Þ, . . . , nðatj Þ, nðas1 Þ, . . . , nðask Þ, nðDIM sl0 þ 1 Þ, . . . , nðDIM sk Þ s
s
ðS:dt 1 ðxs1 Þ, . . . ,S:dtk ðxsk Þ,S:ftðxs Þ, ðat1 Þ ¼
n
-(n
f ðn
ðas1 Þ,
ðDIM tl þ 1 Þ,
. . . , nðask ÞÞ:1, . . . , nðatj Þ ¼ f ðnðas1 Þ, . . . , nðask ÞÞ:j t
t
. . . , nðDIM tj ÞðT:dt1 ðxt1 Þ, . . . ,T:dtj ðxtj Þ,T:ftðxt ÞÞ)
Table 4 Translation of the loose/approximate mapping predicates. Predicate
Translation
fat1 , . . . ,atj gdrilldownf fas1 , . . . ,ask g
8nðat1 Þ, . . . , nðatj Þ, nðas1 Þ, . . . , nðask Þ, nðDIM sl0 þ 1 Þ, . . . , nðDIM sk Þ s
s
ðS:dt 1 ðxs1 Þ, . . . ,S:dtk ðxsk Þ,S:ftðxs Þ,
nðas1 Þ ¼ f ðnðat1 Þ, . . . , nðatj ÞÞ:1, . . . , nðask Þ ¼ f ðnðat1 Þ, . . . , nðatj ÞÞ:k t
t
-(nðDIM tl þ 1 Þ, . . . , nðDIMtj ÞðT:dt1 ðxt1 Þ, . . . ,T:dt j ðxtj Þ,T:ftðxt ÞÞ) fat1 , . . . ,atj gequilevelfas1 , . . . ,ask g
8nðas1 Þ, . . . , nðask Þ, nðDIMsl0 þ 1 Þ, . . . , nðDIM sk Þ s
s
fat1 , . . . ,atj grollupfas1 , . . . ,ask g
ðS:dt 1 ðxs1 Þ, . . . ,S:dtk ðxsk Þ,S:ftðxs Þ
fat1 , . . . ,atj gdrilldownfas1 , . . . ,ask g
-(nðat1 Þ, . . . , nðatj Þ, nðDIM tl þ 1 Þ, . . . nðDIMtj Þ
fat1 , . . . ,atj grelatedfas1 , . . . ,ask g
ðT:dt 1 ðxt1 Þ, . . . ,T:dt j ðxtj Þ,T:ftðxt ÞÞ)
Note that each dimension table dti is related to its fact table ft through dimension DIMi to enforce the inclusion dependency between the two tables. 5.3.2. Loose/approximate mappings For a loose but not approximate drilldown mapping, a transcoding function f exists but, unlike the case of exact mappings, it cannot be used to translate source values to the target domains (but rather to translate target values to the source domains). The translation of drilldownf is shown in the upper part of Table 4. The lower part of Table 4 shows the translation of approximate mappings. Importantly, in this case there is no dependency between the source and the target tuples. Since the resulting assertions do not express any relationship between the right-hand and the left-hand variables, using s-t tgd’s forces a little abuse of notation.
t
6. Query reformulation in a BIN Reformulation is the key step for query answering in a BIN as it actually represents a means to mediate queries among heterogeneous md-schemata. Two important issues arise:
The inter-peer reformulation algorithm takes as input a
Example 7. The non-approximate drill-down mapping
o17 translates to 8D,YðPatientBirthYearDTðYÞ,AdmFTð_,_,_,_,Y,_, . . . ,_Þ, Y ¼ yearOfðDÞ -(PðPatientDTðP,D,_, . . . ,_Þ,HospFTð_,_,_,_,P,_, . . . ,_ÞÞÞ while the approximate drill-down mapping o14 translates to 8C,G,BðPatientCityDTðC,_Þ,PatientGenderDTðGÞ, PatientBirthYearDTðBÞ,AdmFTð_,_,_,C,B,G,_, . . . ,_Þ -(PðPatientDTðP,_, . . . ,_Þ,HospFTð_,_,_,_,P,_, . . . ,_ÞÞÞ
t
target query and produces a source query. The fact that the BIN query language introduced in Definition 4 is closed under reformulation is an essential property which ensures that chains of reformulations can take place in a BIN. On the other hand, multidimensional data are accessible through multidimensional engines using query languages whose expressive power is typically different from ours (for instance, MDX does not allow a group-by set to include a transcoding). This means that, to access its local data, each queried peer may have to carry out an intra-peer reformulation that translates a source BIN query into a local query. A local query relies on a view V corresponding to a query that can be directly executed on the local engine, and contains a declarative description of how to derive the data required by the BIN query from V. As pointed out in Section 4, the results returned by a peer may match with the original query with some approximation. We recall that a non-compatible reformulation occurs every time the available mappings do not allow the values of (some of) the target attributes to be derived from the source ones. In this case, the source query does not fully preserve the semantics of the
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
target query as it refers to attributes which are merely syntactically related with the target ones, so the returned values cannot be integrated in the target attribute columns. We believe that any information about the reasons of a non-compatible reformulation could considerably help users understand the returned results. To this end, when a query is formulated on a peer, the answers this peer receives from any other peer should at least be equipped with the knowledge about which attributes found an exact mapping and which ones did not.
405
(Oattr D O). Depending on the mapping predicates involved, attribute mappings are further distinguished into three types: TypeðoÞ 2 ft1 , t2 , t3 g, where t1 ¼ fequilevelf ,rollupf g (exact mappings), t2 ¼ fdrilldownf g (loose and nonapproximate mappings), and t3 ¼ fequilevel, rollup, drilldown,relatedtog (approximate mappings). The selection procedure is shown in Algorithm 1. It selects and returns a set Oq of mappings based on two policies: Algorithm 1. Mapping selection.
We are now ready to define the query reformulation problem in a BIN. Definition 6 (Query reformulation problem). Given a BIN query qt on (target) md-schema Dt , a (source) md-schema Ds , and a set of mappings O from Ds to Dt : 1. An inter-peer reformulation of qt on Ds is a BIN query qs that refers only to Ds , together with some information on the mapped attributes. 2. An intra-peer reformulation of qt or qs is a relational query whose body contains a view V directly computable by the local multidimensional engine. In the remainder of this section we focus on inter-peer reformulation, while intra-peer reformulation will be discussed in Section 7 with specific reference to the MDX query language. 6.1. The inter-peer reformulation algorithm This section presents an algorithm that, consistently with the framework proposed in Section 5, reformulates a target BIN query qt onto a source peer at the relational level in three steps: 1. The mappings to be used for reformulation are selected from O. In presence of predicates annotating same mappings, qt is expanded to include those predicates. 2. The relational translation of qt, q, is reformulated into a relational query q0 that only refers to the source star schema S. 3. q0 is expanded to include all the constraints stated by the transcodings used, then translated into a (source) BIN query qs. In the following we show each single step, using as an example the query asking for the average cost of female patients for each disease and segment: qt ¼ /HOSPITALIZATION, fdisease,segmentg,ðgender ¼ ‘Female’Þ, cost,//cost,avgSSS Finally, we discuss how to deal with selection predicates. 6.1.1. Step 1: mapping selection In this step we perform a selection of the mappings in O that are relevant for reformulating qt. To this end, we follow a syntactic approach that relies on the distinction between measure mappings (Omeas D O) and attribute mappings
Require: O ¼ Omeas [ Oattr : mappings, qt ¼ /Dt ,G,p,expr,TS: target query 1: Oq ¼ | 2: for all m ¼ /m, aS 2 T do 3: for all o 2 Omeas , o : mt sameexpr,pred Ns do 4: if mt ¼ m then 5: add o to Oq 6: p ¼ p4pred 7: if p is not satisfiable then 8: Abort reformulation 9: end if 10: Go to Line 2 11: end if 12: end for 13: if m did not find a mapping then 14: Abort reformulation 15: end if 16: end for 17: for all attributes a appearing in G do 18: for all o 2 Oattr , o : Pt /mappPredS Ps do 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29 : 30: 31:
if a 2 Pt then add o to Oq end if end for end for for all attributes a appearing in p but not in G do for all o 2 Oattr such that TypeðoÞ ¼ T1 or TypeðoÞ ¼ T2 do if a 2 Pt then add o to Oq end if end for end for return Oq , qt;
As to metrics (Lines 2–16), we require that a source
query can correctly compute the aggregate values specified in the target query with reference to its mdschema Ds ; therefore, the existence of a mapping for each metric m 2 T in qt is mandatory (Line 14). The conjunctive Boolean predicate that may annotate a same mapping is added to the query selection predicate p (Line 6); this modification could make p unsatisfiable, in which case reformulation is aborted as in [22]. As to attributes, we distinguish those in the group-by set G (Lines 17–23) from those in the selection predicate p (Lines 24–30). This distinction is necessary because in the first case we match the attributes against all the mappings in O whereas, in the second case, we focus our search only on those mappings that contain a transcoding (because, in the absence of a transcoding, translating a selection predicate is not possible).
Example 8. The measure mapping to be used for translating qt is o2 , so initially Oq ¼ fo2 g. Since o2 is annotated
406
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
with one predicate, at Line 6 qt becomes qt ¼ /HOSPITALIZATION, fdisease,segmentg,ðgender ¼ ‘Female’Þ 4ðsegment in f‘NH’,‘EU’gÞ, cost,//cost,avgSSS Then the other two steps add mappings o12 , o13 , o15 , and o16 to Oq . Note that, if the predicate annotating o2 were gender¼‘Male’, reformulation would be aborted. 6.1.2. Step 2: query reformulation This step takes place at the relational level, and it reformulates the relational translation of qt, q, using Sq , i.e., the set of s-t tgd’s that translate the mappings in Oq onto the relational level. To this end, each s-t tgd s 2 Sq is matched with the body of q. As already mentioned, the s-t tgd’s translating approximate mappings do not state any dependency between the left-hand variable and the right-hand one; in other words, none of the right-hand variables is used in the left-hand of such s-t tgd’s. In these cases, we keep track of the syntactic relationships between the target variables and the source ones as derived from the mappings by defining a partial 0 variable-set mapping function Approx : 2VarðqÞ -2Varðq Þ that relates sets of head variables in q with sets of variables in q0 . This function will be used to select the head variables of the source query q0 and will annotate the results returned to the querying peer. This step includes three phases: first we define the body of q0 , then we build Approx, and finally we define the head of q0 . For simplicity, we first consider the case in which neither the same mappings nor q include Boolean predicates. The first phase is described by Algorithm 2, where each s-t tgd having the form s : 8xðfðxÞ-(y cðx,yÞÞ is turned into a pair of tgd’s sLAV : 8x0 ðVðx0 Þ-(y : cðx0 ,yÞÞ and sGAV : 8xðfðxÞ-Vðx0 ÞÞ, where the former is expressed in LAV style, the latter in GAV style, and x0 is the tuple of variables shared by the two sides of s. Essentially, the algorithm builds the body of q0 , body, by merging the left sides of the s-ttgd’s in Sq , and returns the set of views used during this process. More specifically, for each selected s-t tgd s, unifyðsLAV ,qÞ (Line 5) matches the atomic formulas in the body of sLAV , fs , to that of q, fq, by finding a variable mapping Z such that Zðfs Þ ¼ Zðfq Þ, if possible. The result of unifying sLAV with q, Z, is then applied to the sLAV counterpart, sGAV (Line 6). Noticeably, these steps are performed only if the head of sLAV and that of sGAV share some variables, because only in this case unification affects sGAV . Moreover, some of the distinguished variables in x0 may become anonymous in ZðsGAV Þ. Finally, the body of sGAV is merged with body. Algorithm 2. body Setup. Require: Sq : relevant s-t tgd’s 1: body ¼ e 2: views ¼ fg 3: for all s 2 Sq do 4: if x0 ae then 5: Z ¼ unifyðsLAV ,qÞ
6: 7: 8: 9: 10: 11:
sGAV ¼ ZðsGAV Þ views ¼ views [ ZðVðx0 ÞÞ end if merge ðbody, sGAV Þ end for return body, views
Example 9. The query q shown in Example 8 translates at the relational level as follows: qðD,S,avgðCÞÞ’HospFTð_,D,_,_,P,C,_Þ, DiseaseDTðDÞ, PatientDTðP,_,_,_,S,GÞ,G ¼ ‘Female’, S in f‘NH’,‘EU’g The set Sq corresponding to Oq includes the st-tgd’s listed below:
s2 : 8S,E,CðAdmFTð_, . . . ,_,S,E,_,_,_Þ,C ¼ S þE -HospFTð_, . . . ,_,C,_ÞÞ
s12 : 8D,D0 ,OðDiagnosisDTðD,_Þ,AdmFTðD,_, . . . ,_Þ, D0 ¼ substringðD,1,40Þ,O ¼ substringðD,41,80Þ -DiseaseDTðD0 Þ,OrganDTðOÞ,HospFTðO,D0 ,_, . . . ,_ÞÞ
s13 : 8D,D0 ,CðDiagnosisDTðD,CÞ,AdmFTðD,_, . . . ,_Þ, C ¼ categoryOfðD0 Þ -DiseaseDTðD0 Þ,HospFTð_,D0 ,_, . . . ,_ÞÞ
s15 : 8G,PðPatientGenderDTðPÞ,AdmFTð_,_,_,_,_,P,_, . . . ,_Þ, G ¼ completeGenderðPÞ -(P 0 ðPatientDTðP 0 ,_, . . . ,_,GÞ,HospFTð_,_,_,_,P 0 ,_,_ÞÞÞ
s16 : 8G,C,YðPatientCityDTðC,_Þ,PatientBirthYearDTðYÞ, PatientGenderDTðGÞ,AdmFTð_,_,_,C,Y,G,_, . . . ,_Þ -(P,SðPatientDTðP,_, . . . ,_,S,_Þ,HospFTð_,_,_,_,P,_,_ÞÞÞ The output of the body setup phase is body ¼ ðAdmFTðD0 ,_,_,T,Y,P,S0 ,E,_,_,_Þ,C 0 ¼ S0 þ E, DiagnosisDTðD0 ,CÞ, D ¼ substringðD0 ,1,40Þ,C ¼ categoryOfðDÞ, PatientGenderDTðPÞ, PatientCityDTðT,_Þ,PatientBirthYearDTðYÞ, G ¼ completeGenderðPÞÞ Let us focus on the s-t tgd s12 , that corresponds to
sLAV : 8O,D0 ðV12 ðO,D0 Þ -DiseaseDTðD0 Þ,OrganDTðOÞ, HospFTðO,D0 ,_, . . . ,_ÞÞ
sGAV : 8D,D0 ,OðDiagnosisDTðD,_Þ,AdmFTðD,_, . . . ,_Þ, D0 ¼ substringðD,1,40Þ,O ¼ substringðD,41,80Þ-V12 ðO,D0 ÞÞ
The result of unification, unifyðsLAV ,qÞ, is a variable mapping Z that maps O into an anonymous variable and ZðV12 ðO,D0 ÞÞ ¼ V12 ð_,DÞ. Therefore, the predicate _ ¼ substringðD,41,80Þ is not added to body. Moreover, note that none of the returned views, V2 ðCÞ, V12 ð_,DÞ, V13 ðDÞ, and V15 ðGÞ, contains the query head variable S as the mapping of segment has type t3 and, therefore, the corresponding s-t tgd does not introduce any relationship between the left-hand variables and the righthand ones.
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
The second phase is shown in Algorithm 3, that outputs the partial variable set mapping Approx by leveraging on the properties of the selected mappings. If a target attribute is related to the source schema by means of a nonexact mapping, then its values cannot be derived from the values of the related source attributes. For this reason, each time we use a mapping o of type t2 or t3 , we extend Approx with a mapping from the set of variables corresponding to the attributes in o that have not found an exact mapping yet, i.e., I ¼ ðq \ Pt Þ\Exact, to the set of variables corresponding to Ps (Line 6). On the contrary, if o is exact we delete from Approx all the variables corresponding to the attributes in o and add them to the attribute set Exact, to prevent their inclusion into Approx during a later step (Lines 8–10). Algorithm 3. Approx Setup. Require: Oq ¼ Omeas [ Oattr q q : relevant mappings, q: target query, nq : variable-assignment function for q, body: body of q0 , nbody : variableassignment function for body 1: Approx ¼ fg 2: Exact ¼ fg 3: for all o 2 Oattr q , o : Pt /mappPredS Ps do 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:
I ¼ ðq \ Pt Þ\Exact if TypeðoÞ ¼ t2 or TypeðoÞ ¼ t3 then Add nq ðIÞ/nbody ðPs Þ to Approx else Add I to Exact Delete from the domain of Approx the variables in nq ðIÞ Delete from Approx each pair V/V 0 such that V ¼ fg end if end for return Approx
Example 10. The partial variable set mapping Approx for the reference example consists of the pair fSg/fT,Y,Pg. The third phase consists in deriving the head of q0 , q0 ðz0 ,expr 0 ða01 ðw1 0 Þ, . . . , a0v0 ðwv0 ÞÞ, from the head of q, qðz,exprða1 ðw1 Þ, . . . , av ðwv ÞÞ:
As to the head variables in z0 , we first compare the set of
variables y in views with the set of head variables z of q. Indeed, the only head variables of q that found a reformulation are those contained in at least the head of one view. Therefore, z00 D z0 where z00 ¼ ðy \ xÞ\ DomðApproxÞ. For any head variable in z that is not contained in z00 , either it did not find any mapping or it found approximate or loose mappings only. For the latter case, we exploit the image of the Approx function since it contains the set of q0 variables that are syntactically related with the head of q. Such variables are added to z0 in place of the head variables in z that found approximate and loose mappings only. Therefore z0 ¼ z00 [ ImgðApproxÞ. As to aggregate terms ai ðwi Þ, each wi is necessarily contained in the body of q0 and is involved in a comparison predicate wi ¼ expri ðv1 , . . . ,vn Þ, where v1 , . . . ,vn are variables of q0 , that states the relationship between one target measure and the source ones. Therefore, wi is replaced with expr i ðv1 , . . . ,vn Þ in the aggregate term and the comparison predicate is deleted from body. Then, the aggregate function ai is distributed inside expri, thus obtaining expri ðai ðv1 Þ, . . . , ai ðvn ÞÞ.
407
As to the last point note that, obviously, expr i ðai ðv1 Þ, . . . , ai ðvn ÞÞ is equal to ai ðexpri ðv1 , . . . ,vn ÞÞ only if ai is distributive over expri. Ensuring that the established mappings are consistent with the semantics of measures and with the operators through which each measure can be aggregated, is the designer’s responsibility. Example 11. The head of our reference query is qðD,S,avgðC 0 ÞÞ. By following the steps described above, we have that D 2 fV12 ,V13 g whereas S= 2views. Indeed, S found an approximate mapping only, so ðfSg/fT,Y,PgÞ 2 Approx. Therefore, D,T,Y,P are head variables. Moreover, avgðC 0 Þ translates into avgðS0 Þ þavgðEÞ. Summing up, the reformulated query q0 is q0 ðD,T,Y,P,avgðS0 Þ þavgðEÞÞ ’AdmFTðD0 ,_,_,T,Y,P,S0 ,E,_,_,_Þ, DiagnosisDTðD0 ,CÞ, D ¼ substringðD0 ,1,40Þ,C ¼ categoryOfðDÞ, PatientGenderDTðPÞ, PatientCityDTðT,_Þ,PatientBirthYearDTðYÞ, G ¼ completeGenderðPÞ 6.1.3. Step 3: query expansion Once q has been reformulated into q0 according to the s-t tgd’s in Sq , it may be necessary to expand q0 to include further constraints on the sources variables stated by the s-t tgd’s in S that contain transcodings. This is done by merging the body of q0 with a set G of atomic formulas. Since G is independent of q0 , it can be computed off-line as explained below. Let Oc D O be the set of mappings of types t1 and t2 , i.e., that express transcoding constraints, and let Sc be the corresponding set of s-t tgd’s. First, the s-t tgd’s in Sc are merged to obtain a single s-t tgd, sc . Then, the closure of the predicates appearing on the left-hand side of sc is computed. Among the predicates in the closure, only those involving distinguished variables belonging only to the left-hand part of sc must be kept. To this end, we denote with D the set of these variables, and we introduce Z as a variable mapping that maps each variable in D into itself, and makes all the remaining variables anonymous. The set G is finally obtained by applying Z to the left-hand side of sc . Example 12. In our reference example, we have
G ¼ ðAdmFTðD0 ,_, . . . ,_Þ,DiagnosisDTðD0 ,CÞ, C ¼ categoryOfðsubstringðD0 ,1,40ÞÞÞ Note that several predicates translating the transcodings in Table 1 do not appear here; for instance, the predicate C¼P corresponding to o18 is discarded because C and P are not both defined on the left-hand side of sc : the former is defined in the relational atom PatientDTð_,_,C,_, . . .Þ of the right-hand side of sc , while the latter is defined in the relational atom PatientCityDTðP,_Þ of the left-hand side of sc . Obviously, when G and q0 are merged, predicate C ¼ categoryOfðDÞ is discarded because, if coupled with D ¼ substringðD0 ,1,40Þ, it is equivalent to C ¼ categoryOfðsubstringðD0 ,1,40ÞÞ. On the other hand, D ¼ substringðD0 ,1,40Þ cannot be discarded because D is a head variable.
408
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
6.1.4. Incorporating selection predicates Selection predicates provide a very useful mechanism for specifying constraints on attribute values. The BIN framework supports the specification of selection predicates in both queries and same mappings, and Step 1 exploits them to prevent inconsistent reformulations (see Algorithm 1). On the other hand, when Step 1 succeeds, it outputs a BIN query qt whose predicate p includes both the query predicates and the mapping ones. At the beginning of Step 2, such predicates are encoded in the relational query q and they must reformulated on the source schema. Nevertheless, some of the atomic predicates in q may not undergo reformulation because of the head variables of q that did not find any exact mapping. More specifically, to decide which predicates of q can be incorporated in q0 , Step 2 must be modified as follows. At the end of Algorithm 2, for each atomic predicate c in q we check if the set of variables in c is contained in the set of variables in views, in which case c is merged with body. Generally speaking, the presence of predicates could lead to inconsistent reformulation on the source side too. To this end, each time in Step 2 we include any predicate in body, we also check that body is still satisfiable. Example 13. Given the query qt shown in Example 8, its relational translation q contains two atomic predicates: G ¼ ‘Female’ and S in f‘NH’,‘EU’g. Since G 2 V15 , the corresponding predicate is added to the body of q0 , whereas S does not belong to views and thus the corresponding predicate is discarded. The final reformulated query q0 is q0 ðD,T,Y,P,avgðS0 Þ þ avgðEÞÞ ’AdmFTðD0 ,_,_,T,Y,P,S0 ,E,_,_,_Þ, DiagnosisDTðD0 ,CÞ, D ¼ substringðD0 ,1,40Þ, C ¼ categoryOfðsubstringðD0 ,1,40ÞÞ, PatientGenderDTðPÞ,PatientCityDTðT,_Þ, PatientBirthYearDTðYÞ, completeGenderðPÞ ¼ ‘Female’ 6.2. Properties of the inter-peer reformulation algorithm The query reformulation algorithm shows two properties that are essential for its working in a distributed setting (the proofs are given in Appendix A). First, the algorithm outputs a query q0 that finds a corresponding query at the BIN level. For instance, the BIN query corresponding to our reference query q0 is qs ¼ /ADMISSION, fsubstringðdiagnosis,1,40Þ, patientCity,patientBirthYear,patientGenderg, ðcompleteGenderðpatientGenderÞ ¼ ‘Female’ 4category ¼ categoryOfðsubstring ðdiagnosis,1,40ÞÞÞ,totStayCostþ totExamCost, //totStayCost,avgS,/totExamCost,avgSSS In other words, the BIN query language is closed under reformulation. The practical impact of this is that our query reformulation algorithm can be used by each peer in a BIN to implement chains of reformulations. In this way, any query formulated over a peer schema can be safely
distributed across the network, and answers can come from any other peer in the network which is connected to the queried peer through a chain of semantic mappings. Theorem 1. Let qt be a (target) BIN query and q0 be the output of the query reformulation algorithm when qt is given as input. Then, there exists a (source) BIN query qs such that q0 is the relational translation of qs. Secondly, the reformulation algorithm is sound and complete with respect to the semantics of query answering, that in data sharing settings is usually given in terms of certain answers. Definition 7 (Certain answers). Let M ¼ ðS,T, SÞ be a schema mapping and q be a query over the target star schema T. If I is a source instance, then the certain answers of q on I with respect to M, denoted certainM ðqÞðIÞ, is the set [41] \ fqðJÞ : J is a solution for I w:r:t: Mg certainM ðqÞðIÞ ¼ The computational complexity of finding all certain answers is well understood for the data integration context with a two-tiered architecture of a mediator and a set of data sources [2]. The same problem has been deeply investigated also in data exchange settings [3]. In both cases, computing the certain answers of unions of conjunctive queries has been proved to be done with polynomial time data complexity. The same computational results have been proved for conjunctive queries in the context of PDMSs, under specific constraints on mappings [22]. Given a BIN query qt, the following theorem shows that the algorithm we have proposed for reformulating qt into qs is sound and complete, because evaluating qs always produces all and only certain answers to qt projected on the compatible part of the reformulation. Theorem 2. Let qt be a BIN query, qt1 be the output of Step 1 on qt, qðz,aggrExprÞ be the relational translation of qt1 , and z00 be the head variables of q that find a reformulation. The reformulation algorithm guarantees to find all certain answers of qðz00 ,aggrExprÞ. The proof of the theorem above (Appendix A) states that the main difference between our algorithm and the reference algorithms for query reformulation (e.g., [22]) lies in the selection of mappings. Indeed, differently from the logical approach usually adopted, we implement a policy for mapping selection that is syntactically driven (Step 1). The reason why we adopt a different approach is that the BIN framework allows for the specification of approximate mappings (type t3 ), i.e., mappings that induce no dependencies between the source and the target tuples. Such mappings are neglected by traditional algorithms because the heads of their views are empty. The syntactic approach we adopt, instead, allows not only the selection of mappings of types t1 and t2 , as in traditional algorithms, but justly also those of type t3 as they state syntactic transformations that will be exploited in the actual reformulation phase. In this way, besides being correct, our algorithm also satisfies all the requirements listed in Section 4.
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
7. Implementation issues and future work Extending the PDMS paradigm to the BI context is a challenging task that lays the foundations for BI 2.0. After introducing BINs as an architecture to support sharing of information for decision-making processes, in this paper we made a first, significant step in that direction by addressing the core task in a BIN framework: query reformulation. In particular, we introduced a mapping language and we proposed a correct reformulation algorithm. Remarkably, though reformulation obviously introduces some overhead, this is definitely negligible as compared to the costs related to retrieving data from local multidimensional engines, and transmitting them across the network. To give a complete picture, here we discuss the main implementation issues raised by reformulation in a BIN, namely: how to bridge the language and expressiveness gap between the query handler and the local multidimensional engine (i.e., how to deal with intra-peer reformulation) and how to manage transcodings and share them among peers. As to the first problem, we observe that in general a BIN query (either directly formulated by a user or reformulated across the network) cannot be directly executed on the peer local multidimensional engine. The OLAP adapter is in charge of bridging this gap by supporting intra-peer reformulation of BIN queries. Assuming that the de-facto standard MDX is the querying language of the local multidimensional engine, intra-peer reformulation must deal with the presence of transcodings in the query group-by set, and must properly manage non-distributive aggregation operators. From the reformulation point of view, this amounts to solving a problem of query rewriting using views [21], where the set of views is made of all the possible queries that the engine supports. In particular, given the relational translation q on a BIN query, we have to find a local query ql that refers to one view and is equivalent to q. Differently from classical approaches, in our case the required view V is constructed on-the-fly based on the body of q and on the properties of the involved aggregation operators. To this end we recall that, while for a distributive aggregation operator (like sum) an aggregation can be correctly computed from pre-aggregates using one aggregation operator, for an algebraic operator (like avg) this requires the computation of pre-aggregates using a tuple of aggregation operators and a function to combine them [19].4 More precisely, the group-by set of V is the projection obtained by ‘‘flattening’’ the group-by set of q (which means eliminating the transcodings), and the aggregations component of V computes the pre-aggregate values. Besides V, the body of ql contains the predicates that express transcodings, while its head computes the final aggregates by properly combining the pre-aggregates. As to the second problem, we saw that the OLAP adapter of a source peer may have to apply transcodings (i.e., functions) to the locally retrieved data, in order to express them in the target peer format and encoding. A trivial solution to make transcodings available to peers consists in 4 Holistic operators, such as mode and median, are not considered in this work because there is no way to correctly compute aggregates from pre-aggregates using these operators, which makes reformulation impossible in most practical cases.
409
assuming the existence of a shared library of transcodings. Unfortunately this solution is hardly feasible because the size of such library would be proportional to the number of mappings, thus decreasing the BIN scalability; furthermore, it does not comply with the security policies of the network since it would entail the existence of a knowledge shared by all the peers. For this reasons, we assume that transcoding functions can be classified into public and protected. Public transcodings are standard database functions (e.g., substring and dateOf) that are shared by all peers. Protected transcodings (e.g., regionOf) are owned by a peer, that will make them available to its neighboring peers by attaching them to query messages. If protected transcodings are expressed as procedures, a shared programming language must be available in the BIN. Otherwise, transcodings can be expressed as look-up tables to be applied by a relational engine. In this case, an obvious drawback is the quantity of information to be transmitted over the network. To reduce such overhead, look-up tables for protected transcodings of mappings from p1 to p2 should be stored in p2 (i.e., the peer playing the role of the target in query reformulation) so that, when multiple transcodings are nested (e.g., nationOfðregionOfðÞÞ) due to chain reformulations, the composition can be solved in p2 and only the resulting lookup table is sent across the network. Example 14. The query q2 shown in Example 3 requires the involved measure to be averaged by year and regionOf(patientCity). Computing the average of pre-aggregates requires that both a sum and a count are calculated. Considering that measure numAdmission actually stores a count, view V is VðY,P,sumðTÞ,sumðAÞÞ’AdmFTð_,D,_,P,_,_,_,_,T,_,AÞ, DateDTðD,_,YÞ, PatientCityDTðP,_Þ where the transcoding predicate R ¼ regionOfðPÞ is not included in V because it cannot be directly expressed in MDX. The corresponding local query is ql ðY,R,sumðsTÞ=sumðsAÞÞ’VðY,P,sT,sAÞ, R ¼ regionOfðPÞ To complete this example we show the MDX query corresponding to view V: SELECT
FROM
{[Measures].[totLength],[Measures]. [numAdmissions]} ON COLUMN, {NonEmptyCrossJoin([Date]. [year].Members,[Patient].[patientCity]. Members)} ON ROWS [Admissions]
and the SQL query corresponding to ql: SELECT FROM WHERE GROUP BY
RS.year, LT.region, sum (RS.totLength)/ sum(RS.numAdmissions) AS avgLength ResultSet RS, LookupTable LT RS.patientCity¼ LT.city RS.year, LT.region;
where ResultSet stores the result of the MDX query, whereas LookupTable corresponds to the regionOf() transcoding.
410
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
We close this section by mentioning our future work on BINs. Though query reformulation is at the core of a distributed approach, several other issues must be tackled to complete the BIN framework, namely: how to efficiently process queries across the network by applying routing strategies that select a subset of neighboring peers for reformulation, how to reconcile multidimensional data returned by different peers through object fusion techniques, how to rank peer results depending on how compliant they are with the original local query, and how to deal with security depending on the degree of trust between the BIN participants.
t1 and t2 (i.e., those that define z00 ) it can be easily shown that Step 1 follows the same selection principles adopted by usual approaches for answering queries using views [21], thus making the mapping selection phase equivalent. Finally, the computation of G at Step 3 has the objective of determining additional constraints between source variables induced by the mappings of types t1 and t2 . Without the merging of G with the body of the reformulated query q0 , q0 would return a superset of the answers, also including results that do not satisfy all the implicit constraints derived in G. Thus Step 3 is fundamental for determining the certain answers. &
Appendix A. Theorem proofs
References
Theorem 1. Let qt be a (target) BIN query and q0 be the output of the query reformulation algorithm when qt is given as input. Then, there exists a (source) BIN query qs such that q0 is the relational translation of qs. Proof. Let us consider q0 ðz0 ,expr 0 ða01 ðw1 0 Þ, . . . , a0v0 ðw0v0 ÞÞ’ body. Note that the way attribute mappings have been defined ensures that body follows a star form, i.e., it contains the star join that is necessary to relate the source fact table with the involved dimension tables. This implies that a BIN query must exist whose encoding is q0 . Then, without loss of generality, we assume that q0 is equipped with a variable assignment function n that can be easily derived from the variable assignment functions of q and of the involved mappings. With a little abuse of notation we will apply the inverse of n, n1 , also to atomic formulas, meaning that n1 is applied to its arguments. We define qs ¼ /Ds ,Gs ,ps ,expr s ,Ts S on the source schema Ds as follows: Gs : for each head variable v 2 z0 , check whether there is a comparison predicate v ¼ f ðv0 Þ in the body. If it exists, then add f ðn1 ðv0 Þ to Gs and delete the predicate from body, otherwise add n1 ðvÞ to Gs. ps : for each left predicate p in body, add n1 ðpÞ to ps. expr s : is the expression expr0 ðn1 ðw1 0 Þ, . . . , n1 ðw0v0 ÞÞ. Ts : is the list //n1 ðw1 0 Þ, a01 S, . . . ,/n1 ðw0v0 Þ, a0v0 SS Then, it can be easily shown that the relational translation of qs is q0 modulo variable mappings. & Theorem 2. Let qt be a BIN query, qt1 be the output of Step 1 on qt, qðz,aggrExprÞ be the relational translation of qt1 , and z00 be the head variables of q that find a reformulation. The reformulation algorithm guarantees to find all certain answers of qðz00 ,aggrExprÞ. Proof. In the following we show that our algorithm reduces to the algorithm for query reformulation shown in [22], which is proved to find all and only certain answers. Our s-t tgd’s are in GLAV-style. As in [22], we transform them in pairs of LAV and GAV s-t tgd’s; Step 2 deals with the set of GAV and LAV s-t tgd’s by first matching the LAV s-t tgd’s and then the GAV ones. The main difference lies in the selection of mappings. Indeed, our algorithm performs a syntactic selection whereas the algorithm in [22] refers to the logical form of the mappings. Nevertheless, for mappings of types
[1] S. Abiteboul, Managing an XML warehouse in a P2P context, in: Proceedings of the CAiSE, Klagenfurt, Austria, 2003, pp. 4–13. [2] S. Abiteboul, O.M. Duschka, Complexity of answering queries using materialized views, in: Proceedings of the PODS, Seattle, Washington, USA, 1998, pp. 254–263. [3] F.N. Afrati, P.G. Kolaitis, Answering aggregate queries in data exchange, in: Proceedings of the PODS, Vancouver, BC, Canada, 2008, pp. 129–138. ¨ [4] M.O. Akinde, M.H. Bohlen, T. Johnson, L.V.S. Lakshmanan, D. Srivastava, Efficient OLAP query processing in distributed data warehouses, Information Systems 28 (1–2) (2003) 111–135. [5] J. Albrecht, W. Lehner, On-line analytical processing in distributed data warehouses, in: Proceedings of the IDEAS, Cardiff, Wales, UK, 1998, pp. 78–85. [6] Y. An, A. Borgida, R.J. Miller, J. Mylopoulos, A semantic approach to discovering schema mapping expressions, in: Proceedings of the ICDE, Istanbul, Turkey, 2007, pp. 206–215. [7] B. Arai, G. Das, D. Gunopulos, V. Kalogeraki, Efficient approximate query processing in peer-to-peer networks, IEEE Transactions on Knowledge and Data Engineering 19 (7) (2007) 919–933. [8] M. Arenas, L.E. Bertossi, J. Chomicki, X. He, V. Raghavan, J. Spinrad, Scalar aggregation in inconsistent databases, Theoretical Computer Science 296 (3) (2003) 405–434. [9] M. Banek, B. Vrdoljak, A.M. Tjoa, Z. Skocir, Automated integration of heterogeneous data warehouse schemas, International Journal of Data Warehousing and Mining 4 (4) (2008) 1–21. [10] K.C. Chang, H. Garcia-Molina, Mind your vocabulary: query mapping across heterogeneous information sources, in: Proceedings of the SIGMOD, Philadelphia, PA, USA, 1999, pp. 335–346. [11] S. Cohen, W. Nutt, Y. Sagiv, Rewriting queries with arbitrary aggregation functions using views, ACM Transactions on Database Systems 31 (2) (2006) 672–715. [12] R. Fagin, P.G. Kolaitis, R.J. Miller, L. Popa, Data exchange: semantics and query answering, in: Proceedings of the ICDT, Siena, Italy, 2003, pp. 207–224. [13] R. Fagin, P.G. Kolaitis, L. Popa, Data exchange: getting to the core, ACM Transactions on Database Systems 30 (1) (2005) 174–210. [14] A. Fuxman, E. Fazli, R. Miller, Conquer: efficient management of inconsistent databases, in: Proceedings of the SIGMOD, Baltimore, MD, USA, 2005, pp. 155–166. [15] A. Fuxman, P.G. Kolaitis, R.J. Miller, W.-C. Tan, Peer data exchange, in: Proceedings of the PODS, Baltimore, MD, USA, 2005, pp. 160–171. [16] M. Golfarelli, F. Mandreoli, W. Penzo, S. Rizzi, E. Turricchia, Towards OLAP query reformulation in peer-to-peer data warehousing, in: Proceedings of the DOLAP, Toronto, Ontario, Canada, 2010, pp. 37–44. [17] M. Golfarelli, F. Mandreoli, W. Penzo, S. Rizzi, E. Turricchia, BIN: Business intelligence networks, in: Business Intelligence Applications and the Web: Models, Systems and Technologies, IGI Global, 2011, doi:10.4018/978-1-61350-038-5. [18] M. Golfarelli, S. Rizzi, P. Biondi, myOLAP: an approach to express and evaluate OLAP preferences, IEEE Transactions on Knowledge and Data Engineering 23 (7) (2011) 1050–1064. [19] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, H. Pirahesh, Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub totals, Data Mining and Knowledge Discovery 1 (1) (1997) 29–53. [20] A. Gupta, V. Harinarayan, D. Quass, Aggregate-query processing in data warehousing environments, in: Proceedings of the VLDB, Zurich, Switzerland, 1995, pp. 358–369.
M. Golfarelli et al. / Information Systems 37 (2012) 393–411
[21] A. Halevy, Answering queries using views: a survey, VLDB Journal 10 (4) (2001) 270–294. [22] A. Halevy, Z. Ives, D. Suciu, I. Tatarinov, Schema mediation for largescale semantic data sharing, VLDB Journal 14 (1) (2005) 68–83. [23] A.Y. Halevy, Technical perspective—schema mappings: rules for mixing data, Communications of the ACM 53 (1) (2010). [24] A.Y. Halevy, Z.G. Ives, J. Madhavan, P. Mork, D. Suciu, I. Tatarinov, The piazza peer data management system, IEEE Transactions on Knowledge and Data Engineering 16 (7) (2004) 787–798. [25] T.A.D. Hoang, T.B. Nguyen, State of the art and emerging rule-driven perspectives towards service-based business process interoperability, in: Proceedings of the International Conference on Computer and Communication Technology, Danang City, Vietnam, 2009, pp. 1–4. [26] H. Jiang, D. Gao, W.-S. Li, Exploiting correlation and parallelism of materialized-view recommendation for distributed data warehouses, in: Proceedings of the ICDE, Istanbul, Turkey, 2007, pp. 276–285. [27] P. Kalnis, W.S. Ng, B.C. Ooi, D. Papadias, K.-L. Tan, An adaptive peerto-peer network for distributed caching of OLAP results, in: Proceedings of the SIGMOD, Madison, Wisconsin, USA, 2002, pp. 25–36. [28] M. Kehlenbeck, M.H. Breitner, Ontology-based exchange and immediate application of business calculation definitions for online analytical processing, in: Proceedings of the DaWaK, Linz, Austria, 2009, pp. 298–311. [29] A. Kementsietsidis, M. Arenas, R. Miller, Mapping data in peer-topeer systems: semantics and algorithmic issues, in: Proceedings of the SIGMOD, San Diego, California, USA, 2003, pp. 325–336. [30] J. Lachlan, Top 13 business intelligence trends for 2011. /http:// www.japan.yellowfin.biS, 2011. [31] M. Lenzerini, Data integration: a theoretical perspective, in: Proceedings of the PODS, Madison, WI, USA, 2002, pp. 233–246.
411
[32] F. Mandreoli, R. Martoglia, W. Penzo, S. Sassatelli, SRI: Exploiting semantic information for effective query routing in a PDMS, in: Proceedings of the WIDM, Arlington, Virginia, USA, 2006, pp. 19–26. [33] F. Mandreoli, R. Martoglia, W. Penzo, S. Sassatelli, Data-sharing P2P networks with semantic approximation capabilities, IEEE Internet Computing 13 (5) (2009) 60–70. [34] F. Mandreoli, R. Martoglia, W. Penzo, S. Sassatelli, G. Villani, SRI@work: efficient and effective routing strategies in a PDMS, in: Proceedings of the WISE, Nancy, France, 2007, pp. 285–297. [35] G. Mecca, P. Papotti, S. Raunich, Core schema mappings, in: Proceedings of the SIGMOD, Providence, RI, USA, 2009, pp. 655–668. [36] Microsoft. MDX reference. /http://msdn.microsoft.com/en-us/ library/ms145506.aspxS, 2009. [37] R.J. Miller, L.M. Haas, M.A. Herna´ndez, Schema mapping as query discovery, in: Proceedings of the VLDB, Cairo, Egypt, 2000, pp. 77–88. [38] W. Penzo, Rewriting rules to permeate complex similarity and fuzzy queries within a relational database system, IEEE Transactions on Knowledge and Data Engineering 17 (2) (2005) 255–270. [39] L. Popa, Y. Velegrakis, R.J. Miller, M.A. Herna´ndez, R. Fagin, Translating web data, in: Proceedings of the VLDB, Hong Kong, China, 2002, pp. 598–609. [40] N. Raden, Business intelligence 2.0: simpler, more accessible, inevitable /http://intelligent-enterprise.informationweek.comS, 2007. [41] B. ten Cate, P.G. Kolaitis, Structural characterizations of schemamapping languages, Communications of the ACM 53 (1) (2010) 101–110. [42] R. Torlone, Two approaches to the integration of heterogeneous data warehouses, Distributed and Parallel Databases 23 (1) (2008) 69–97. [43] A. Vaisman, M.M. Espil, M. Paradela, P2P OLAP: data model, implementation and case study, Information Systems 34 (2) (2009) 231–257. [44] S. Wu, S. Jiang, B. Ooi, K.-L. Tan, Distributed online aggregation, in: Proceedings of the VLDB, Lyon, France, vol. 2, 2009, pp. 443–454.