The data richness estimation framework for federated data warehouse integration

The data richness estimation framework for federated data warehouse integration

ARTICLE IN PRESS JID: INS [m3Gsc;November 14, 2019;14:11] Information Sciences xxx (xxxx) xxx Contents lists available at ScienceDirect Informati...

877KB Sizes 1 Downloads 82 Views

ARTICLE IN PRESS

JID: INS

[m3Gsc;November 14, 2019;14:11]

Information Sciences xxx (xxxx) xxx

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

The data richness estimation framework for federated data warehouse integration Rafał Kern, Adrianna Kozierkiewicz, Marcin Pietranik∗ ˙ Wyspian ´ skiego 27, Wroclaw Faculty of Computer Science and Management, Wroclaw University of Science and Technology, Wybrzeze 50-370, Poland

a r t i c l e

i n f o

Article history: Received 10 December 2018 Revised 16 August 2019 Accepted 25 October 2019 Available online xxx Keywords: Data warehouse integration Knowledge management Consensus theory

a b s t r a c t A federated data warehouse is a tool that provides an end-user a unified perspective on a finite set of independent data warehouses. This requires creating a global schema from partial schemas, which remains purely virtual. This is a result of iterative integration of participating data warehouses. It is then used to simulate that the aforementioned set of participating warehouses as an effectively one, “super” data warehouse exposed to the end-user. In this paper, authors present a framework, that can be used to evaluate the profitability of adding a new data warehouse to the existing federation, in terms of increased data richness and its expressiveness. Solid formal foundations are provided, along with heuristic algorithms, an experimental verification (which involved two different experimental procedures) and a statistical analysis of obtained results. © 2019 Elsevier Inc. All rights reserved.

1. Introduction In the modern age, where the most valued resource is information, companies build their assets and management strategies around data and data processing. The data may describe many aspects that need to be analyzed, such as current market trends, a state of competing companies, a variety of key performance indicators, etc. Every management board wants some kind of a dashboard where any statistical summary can be displayed. If the presented data are not valid or up to date, then the given summaries may entail drawing wrong conclusions and therefore, wrong decisions concerning the future of some company. If a particular company is small, and a market of interest is not differentiated, then the task of gathering valuable data can be simple. They may be stored in one database that serves as a single source of truth, while their validity and topicality can be assured using basic means, such as database normalization or transactional operations. If the company is more divergent, processing data created during its functioning becomes more demanding, which entails some difficulties. For example, it involves dealing with a situation in which interesting data are spread across several relational databases. This issue is closely related to creating data warehouses using the Extract-Transform-Load (ETL) procedures. However, with an increasing companies’ complexity (e.g. during the merging of two brands) another issue appears. While creating a single data warehouse from independent databases is quite common procedure, the task of taking a set of data warehouses and creating a single, virtual data warehouse (referred to as a federation of data warehouses) is still an open ∗

Corresponding author. E-mail addresses: [email protected] (R. Kern), [email protected] (A. Kozierkiewicz), [email protected] (M. Pietranik).

https://doi.org/10.1016/j.ins.2019.10.046 0020-0255/© 2019 Elsevier Inc. All rights reserved.

Please cite this article as: R. Kern, A. Kozierkiewicz and M. Pietranik, The data richness estimation framework for federated data warehouse integration, Information Sciences, https://doi.org/10.1016/j.ins.2019.10.046

JID: INS 2

ARTICLE IN PRESS

[m3Gsc;November 14, 2019;14:11]

R. Kern, A. Kozierkiewicz and M. Pietranik / Information Sciences xxx (xxxx) xxx

Fig. 1. Example of three warehouses’ schemas.

topic. Most importantly, dealing with problems related not with providing a single source of truth, but (due to the distributed character of such structures) a single version of truth. Such an approach, while tempting during theoretical considerations, may entail a series of difficulties. In our previous research [13], we created a framework for integrating a set of data warehouses into such federation. The provided method was based on a greedy algorithm that iteratively adds members of the processed set of warehouses into the unified view of a federation. Despite the fact that it was experimentally verified in terms of providing valid data, it also requires a tremendous amount of time and computational resources to assert its correctness. This entails that using it in an enterprise environment may be too expensive, which in consequence may forfeit the possibility of complete utilization of all available data. Moreover, our framework is based on the assumption that it is necessary to iteratively integrate the whole set of participating data warehouses into the final federation. So is it possible to have some kind of an indicator that could be used as a stop condition during the integration when the obtained partial result (a federation build only from a subset of available component warehouses) is good enough? Such an indicator could also answer questions order in which subsequent participating warehouses should be added? Integrating which in the first place is more profitable in terms of gained data?

Example 1. Consider three data warehouse schemas (referred to as A, B, and C) shown in Fig. 1. Let us assume that the warehouse A is an initial schema, to which we can add warehouses B and C. Intuitively, adding warehouse B is more profitable due to the fact that two tables are (Date, Product) repetitive with tables available in warehouse A, but two tables (SalesRepresentative, CompanyClient) can deliver completely new data. Having in mind a number of tables in both warehouses, one could estimate the increase of schema complexity at 25%. From that point of view, warehouse C offers only one table with new data (Manufacturer), as in repetitive in two (Date, Product). Therefore, the data gain achieved due to integrating it with warehouse A is much lower than for warehouse B and one could estimate it for 15%. The values above are purely intuitive estimations with no mathematical backbone whatsover. However, they clearly illustrate the need for such tool - this information can be used twofold. Firstly, to designate order in which subsequent data warehouses will be added do the federation to achieve the maximal gain of schema complexity during initial phases of creating a federation. Secondly, they can be used to exclude from the integration warehouses that are redundant in terms of their schemas. Please cite this article as: R. Kern, A. Kozierkiewicz and M. Pietranik, The data richness estimation framework for federated data warehouse integration, Information Sciences, https://doi.org/10.1016/j.ins.2019.10.046

JID: INS

ARTICLE IN PRESS R. Kern, A. Kozierkiewicz and M. Pietranik / Information Sciences xxx (xxxx) xxx

[m3Gsc;November 14, 2019;14:11] 3

Our other research [14–16] was focused on creating a knowledge increase estimation framework used during ontology integration. Its mian purpose is answering a question concerning what has been achieved thanks to the performed integration. In other words - was it worthwhile to allocate resources in a time- and cost-consuming ontology merging procedure? We achieved this goal by developing a set of real value functions that can serve as indicators of the integration profitability. Thus, we have adapted an idea of estimating the knowledge increase that appears during the integration, into the area of federated data warehouses. Obviously, we cannot use a term “knowledge” in this context. Therefore, we have developed a notion of “data richness” that could be used in the context of data integration in federated data warehouses to reflect the increasing expressivity and literally a richness of the processed data. To the best of our knowledge this problem is not widely investigated. The task of estimating the potential data richness increase during the described integration can be formally defined as follows: For a given set of data warehouses participating in a federation, denoted as H = {H1 , . . . , HN }, and an schema integration algorithm Int (Fi−1 , Hi ) that returns the ith (denoted as Fi ) state of the federation’s schema (with F0 denoting an empty federation with an empty schema) after integrating (i-1)th state with an ith warehouse from the set H, one should determine two functions: (i)δ D : H → R which represents an increase of data richness that has been gained as a result of the performed ith step of integration of dimensions and (ii)δ M : H → R which represents an increase of data richness after integration of measures. The integration algorithm is based on an iterative addition of subsequent warehouses to the created structure. It is based on comparing measures and dimensions used in schemas of warehouses involved in the federation. The similarity between these elements (calculated using auxiliary tools) designates which of the measures or dimensions present in the warehouse added to the federation are new to this federation and which are a repetition of already existing components. Due to the limited space, we are not able to present this procedure in detail. Obviously, data warehouses are built not only from schemas but also from actual data they contain. However, we claim that creating a framework operating on the level of schemas should be the first step in developing a generic tool for calculating the increase of data richness in federated data warehouses. We will address the issue of the increase of data richness on the level of the content of data warehouse in upcoming publications. The article is structured as follows. In the next section, an overview of other research that has been done in the field is given. Section 3 contains all of the basic notions that will be used throughout the subsequent parts of the article. Section 4 includes a description of a general idea of federated data warehouse integration and the main contribution of the following article. The task at hand must be decomposed into subtasks concerning different elements of data warehouses: measures, dimensions, and facts. However, the following paper is devoted only to schema-related elements of data warehouses, which are measures and dimensions. Therefore, in parts 4.1 and 4.2 of the article we will describe functions δ D and δ M respectively. Their usefulness has been verified using an experimental procedure described in Section 5, which also contains a statistical analysis of the collected results. Section 6 serves as a summary and sheds some light on the authors’ upcoming research plans. 2. Related works To obtain practical business value, many companies need to deal with processing, integrating and managing large sets of data and information [25]. However, the problem of managing data quality in federated data warehouses is still open and has not been widely researched. Our preliminary research has raised a subject of the estimation of a data richness’ increase in case of the data integration [14–16]. The mentioned papers have been devoted to an easy and cheap method of estimating the profitability of the ontology integration process, which could be based on any of the semantic similarity calculation methods (such as [26]). A novel measure that allows estimating a potential growth of knowledge after the integration of a set of ontologies on a concept, an instance and relation level has been proposed. The conducted experiments (followed by a statistical analysis of obtained results) demonstrated how the developed measures reflect the way people evaluate the considered knowledge increase. The obtained results refer to the ontology integration demonstrated to us that our research direction is promising and worth of detailed examination for other knowledge structures like data warehouses. To the best of our knowledge in the literature the problem of estimating the profitability of adding a new data warehouse to the existing federation has not been raised. The most popular topic focuses on quality issues in a data warehouse. In [23] authors mentioned eight different factors affecting the data quality in such an environment and could be considered to estimate the usefulness of data for further analysis and decision making. In [17] authors developed twenty-three quality metrics used in data warehouses. They include functionality, reliability, usability, efficiency, maintainability, portability, accessibility, accuracy, consistency, security, compliance, recoverability, analyzability, changeability, testability, installability, implementation efficiency, system availability, currency, volatility, completeness, credibility, and data interpretability. The paper [11] focuses on practical applications and proposes a probability-based metric that allows measuring data quality in terms of its semantic consistency. The developed framework is based on statistical tests and has a clear interpretation. However, most of the mentioned quality metrics refer directly to the data stored in data warehouses and do not consider schema level. Schema level has been consider in another works like [6,10,21]. In [6] the metrics were divided into three different categories: table, star, and schema level. For the first level, a number of attributes of a table and a number of foreign keys in a table were proposed. Eight star metrics like: the number of dimension tables of a star, the number of tables of star, the number of attributes of dimension table of a star, the number of attributes plus the number of foreign keys of a fact Please cite this article as: R. Kern, A. Kozierkiewicz and M. Pietranik, The data richness estimation framework for federated data warehouse integration, Information Sciences, https://doi.org/10.1016/j.ins.2019.10.046

JID: INS 4

ARTICLE IN PRESS

[m3Gsc;November 14, 2019;14:11]

R. Kern, A. Kozierkiewicz and M. Pietranik / Information Sciences xxx (xxxx) xxx

table, the number of attributes of a star, the number of foreign keys of a star, ratio of star attributes to foreign keys were designed. The last category contained fourteen schema metrics, which were considered in detail in [21]. In further work of Serrano [22] only two of the metrics have been verified: NFT- the number of fact tables of the schema and NDT- the number of dimension tables of the schema. This set of validated metrics for measuring the data warehouse quality can be used for choosing the best option if more than one alternative data warehouse design is considered. In [10] authors used and validated the following metrics: NFC - a number of fact classes in the schema, NDC - a number of dimension classes in the schema, NBS - a number of base classes present in the schema, NC - a total number of classes in the schema, RBC - a total number of base classes divided by total number of dimension classes, NAFC - a number of fact attributes of fact classes of the schema, NADC - a number of dimension attributes of dimension classes of the schema, NABC - a number of dimension attributes of dimension classes of the schema, NA - a total number of attributes calculated as: NA=NAFC+NADC+NABC, NH - a number of hierarchies present in the schema, DHP - the maximal depth of the hierarchy relations of the schema, RSA - the number of fact attributes divided by the number of dimensions and its attributes. Empirical validation of these metrics shown a good relationship between metrics and the selected quality attributes: effectiveness and understandability. Although the mentioned measures are closely related to the scope of this paper thus these measures are determined separately by counting a number of facts, attributes, schemas, tables, etc. and do not give a total view of the profitability of the integration process. The results of research on data quality in data warehouses have been applied in some systems. AQUAWARE [2] is a computational environment used as infrastructure for feeding quality information to Data Warehousing client tools. Queries launched through AQUAWARE enable the end-user to evaluate to what extent is it safe to trust in the returned information, providing more reliability to the decision-making process. AQUAWARE applies some typical quality measures like syntactic accuracy, semantic accuracy, completeness, and currency. DWQ [12] provides a framework and modeling formalisms - these development and maintenance tools can be tuned to particular application domains and levels of expected quality. It gives data warehouse designers some assistance by linking the main components of a data warehouse’s reference architecture to a formal model of the data quality. The solutions described above cannot be simply used to estimate the quality of the federated data warehouse and the accompanying integration process. The first papers devoted to this problem were created in the late ’90s. Both systems concentrate on stored data quality and the schema level is omitted. On the other hand, the problem of the estimation data richness for federated data warehouse should be considered in conjunction with data integration. As mentioned before, many described measures refer to the data quality stored in a ready data warehouse or federation and do not take into account the situation when a warehouse is added or removed from the federation. Some papers like [4,24] touch upon an issue of datawarehouse integration by introducing the desirable properties of coherence, soundness and consistency that “good” matchings between dimensions should satisfy. However, presented properties refers to the dimension compatibility to guarantee the correctness of aggregate operations but no to estimate of data enrichment equated with the profit from the integration process. The paper [8] is closely related to the problem raised in this work. [8] is related to source selection concerning data fusion. Authors proposed algorithms to estimate fusion accuracy and to select sources that maximize profit. Similar results have been obtained by Gertz [9]. The author proposed a basic quality dimension that consists of some attributes like accuracy, completeness, timeliness, availability, reliability and data volume. These tools were used for specifying the data quality related to query goals for global applications and to dynamically integrate data from component databases that may differ in their quality. The mentioned metrics like accuracy, reliability or completeness are typical in data integration problems. However, as was mentioned before, determining the value of these measures requires instance level information. The subjective quality of the data integration has been deliberated in [1]. Three quality dimensions were defined as, namely: an accuracy, defined as the extent to which information is correct and reliable; a believability, defined as the extent to which information is regarded as a true and credible; an added value, defined as the extent to which information is beneficial and provides advantages for its use. However, these dimensions typically require a user’s opinion and do not have a clear mathematical technique for finding their value. The general research about information quality analysis into data integration systems, particularly in the context of the integrated schema is presented in [3]. The paper discussed the evaluation of schema quality focusing on the minimality aspects. In [20] also the problem of data quality evaluation in a data integration system is addressed. The authors mentioned a federated data warehouse as an example of such a system and defined some metrics. They include currency- the time elapsed since data was extracted from the source, obsolescence- the number of updates operations to a source since data extraction time, freshness rate- the percentage tuples in the view that are up-to-date, timeliness- the time elapsed from the last update to a source. The data freshness and accuracy have been raised also in [18]. Despite many advantages and certain simplicity, these functions have one, serious defect - all of them require the integration to be performed beforehand and only afterward they can be used to evaluate obtained results. In many cases the integration is very complicated and timeand cost-consuming. Data richness on the schema level directly affects data richness on the instances level. While, adding new columns during the integration process is not expensive yet, thus completing many records is already costly. Therefore, working out methods that allow estimating the potential gain of the integration process is desirable. Authors of [5] provided an architecture for data quality enhancement and awareness in the complex processes of integration and warehousing of biomedical data. Authors identified several criteria of information quality, assigned to the data extracted from biological databases. Those criteria was classified into three sets: (i) A bio-knowledge-based quality criteria Please cite this article as: R. Kern, A. Kozierkiewicz and M. Pietranik, The data richness estimation framework for federated data warehouse integration, Information Sciences, https://doi.org/10.1016/j.ins.2019.10.046

JID: INS

ARTICLE IN PRESS R. Kern, A. Kozierkiewicz and M. Pietranik / Information Sciences xxx (xxxx) xxx

[m3Gsc;November 14, 2019;14:11] 5

such as originality or domain authority of the authors who submitted the sequence; (ii) A schema-based quality criteria such as local and global completeness, level of detail, and intra- and inter-record redundancy; (iii) A contextual quality criteria such as freshness or consolidation degree. Some of the mentioned measures are extracted from the information retrieval field and do not consider the specific problem of the data warehouse integration process. There are some publications [13,19] focusing on similarity measures that are used in the database or warehouse integration process. However, none of these works does allow estimating the data richness in a situation of adding a new warehouse to a federation or removing it. In this paper, we would like to propose algorithms for data richness estimation without the flaws that have been described above. 3. Definition of a federated data warehouse and its core components We describe a data warehouse’s schema as:

Hi = {Di0 , Di1 , . . . , Diα i }

(1)

where i is a data warehouse’s identifier. Di0 denotes a facts table of the i-th data warehouse, defined as Di0 = {mi1 , . . . , miβ i },

where β i is a number of measures in Di0 and miz is a z-th measure in the data warehouse Hi . For j ∈ [1, α i] we denote a j-th dimension in the data warehouse Hi as Dij = {aij , . . . , aij i }, where γ i is a number of attributes in Dij . We denote a z-th 1

γ

attribute of the i-th dimension of data warehouse Hi as aij . z The federation of data warehouses described according to Eq. 1 is defined as:

Fˆ = (F , H, U, q, l )

(2)

where F is a federation’s schema, H is a set of participating data warehouses (such that |H | = N), U is a user’s or application’s interface, q is a query decomposition procedure, l is a responses’ integration procedure. Moreover, a query language LQ is defined as a part of U. Data warehouses federation schema was defined similarly to the data warehouse schema:

F = {D0 , D1 , . . . , Dn }

(3)

where D0 is a schema of facts table and Dj denotes a j-th dimensions schemas for j ∈ [1, n]. Between every dimension Dj and the facts table D0 there is a relation of type 1 − ∞ which means, that one row from the dimension table may be associated with more than one fact, but one fact is associated with exactly one row from the given dimension. Each attribute from any dimension and any measure from the facts table of the federation must come from at least one local data warehouse. A dimension’s attribute is described as:

a = (namea , re f )

(4)

where namea is an attributes’ identifier in the federation’s schema and ref is a list of pairs of form (Dij , name ) that match the local dimensions attributes. Dij identifies local warehouse dimension and name - its local name. Due to the fact that we do not consider a level of warehouses’ content and only the level of their schemas, we do not define federation’s measures. We assume the existence of the following similarity functions: •



MS: MF × MH → [0, 1] is a similarity between a measure taken from the federation’s schema (where MF denotes a set of all measures in a federation) and a measure of some participating data warehouse (where MH denotes a set of all measures of participating warehouses). DS : {D1 , . . . , Dn } × {Di1 , . . . , Diα i } → [0, 1] is a similarity between a dimension taken from the federation’s schema F and a dimensions of the i-th participating data warehouse.

Both of these functions, MS and DS, are built on top of comparing auxiliary annotations bound to particular measures and dimensions. These annotations can be understood as additional descriptions of these components of the data warehouse. For the simplest case, it can be assumed that these similarity functions are based on comparing sets of attributes’ names of dimensions and measures. Due to the limited scope available for this paper, for their formal definitions please refer to our other publications [13]. For simplicity, we also assume the following: • •

m  m ⇐⇒ MS(m, m ) = 1, which denotes that two measures are equivalent D j ∼ Dix ⇐⇒ DS(D j , Dix ) ≥ t, which denotes that two dimensions are sufficiently similar, where t is some assumed threshold such that t ∈ [0, 1]

 is a set of dimensions of a F0 denotes an empty federation (in its initial state) that contains an empty schema. D i federation in the i-th iteration of the integration algorithm. For i = 0 (in the initial state) D0 is an empty set (D0 = φ ). H = (H1 , . . . , HN ) denotes a set of participating data warehouses that will be added to the federation. Please cite this article as: R. Kern, A. Kozierkiewicz and M. Pietranik, The data richness estimation framework for federated data warehouse integration, Information Sciences, https://doi.org/10.1016/j.ins.2019.10.046

ARTICLE IN PRESS

JID: INS 6

[m3Gsc;November 14, 2019;14:11]

R. Kern, A. Kozierkiewicz and M. Pietranik / Information Sciences xxx (xxxx) xxx

4. The warehouse integration profitability estimation framework i that could be used to represent an increase As stated in the first section, our main goal is developing functions δDi , δM of data richness, on the level of dimensions and measures respectively, that has been gained as a result of adding a new data warehouse to the already existing federation. This division is caused by the fact that it is impossible to develop one omniscient method that could process and analyze all of the main elements of data warehouses (measures, dimensions) in a unified way. Therefore, the next two subsections of the paper will cover the definitions and characteristics of these functions. Eventually, we will present an experimental procedure that we have conducted to evaluate and verify our ideas.

4.1. Estimating an increase of data richness on the level of dimensions The increase of data richness on the level of dimensions can be calculated using a function with a following signature: 

δDi : 2Di−1 × 2{D1 ,...,Dαi } → [0, 1] i

i

(5)

It estimates a gain of data richness that is achieved after a current state of the federation (in particular, a current set  ) is integrated with a data warehouse H taken from the set of participants H and its set of of dimensions denoted as D i−1 i i i dimensions {D1 , . . . , Dα i }. Such a function must meet the following postulates: • • •





P1. δDi (φ , {Di1 , . . . , Diα i } ) = 1 P2. δDi ({Di1 , . . . , Diα i }, {Di1 , . . . , Diα i } ) = 0 i   P3. ∀di ∈ {Di1 , . . . , Diα i }∃d∈ D i−1 : d ∼ d ⇒ i i i  ⇒ δD (Di−1 , {D1 , . . . , Dα i } ) ≤ t  P4. ∀di ∈ {Di , . . . , Di }∃d∈ D : DS(di , d) = 1 ⇒ i−1

αi

1

 , {Di , . . . , Di } ) = 0 ⇒ δDi (D i−1 1 αi  : di ∼ d⇒ P5. ∀di ∈ {Di1 , . . . , Diα i }¬∃d∈ D i−1 i i i  ⇒ δ ( D , {D , . . . , D } ) = 1 D

i−1

1

αi

P1 states that the gain of data richness after integrating some set of dimensions into an empty schema is always maximal and equal to 1. P2 describes that the gain of data richness after integrating the same set of dimensions twice is minimal and equal to 0. P3 states that the gain of data richness after integrating some set of dimensions into a set that contains somehow similar dimensions cannot be higher than the used threshold t. P4 tells that the gain of data richness after integrating some set of dimensions into a set that contains dimensions with maximal similarity is minimal and equal to 0. P5 states that the gain of data richness after integrating some set of dimensions into a set that does not contain any similar dimensions is maximal and equal to 1. Having the above characteristics in mind, the developed function δDi must be defined in a way, which allows solving problems appearing when one of two situations occur during addition of a new warehouse to the federation: 1. Adding a completely new dimension into the federation, thus adding a completely new set of data. In such a situation, the gain of knowledge should be maximal according to postulates P1 and P5. These added dimensions are a pure gain of data richness - their interpretation is straightforward and can be reflected using the following set:







 i−1 : di ∼ d XDi + = di ∈ {Di1 , . . . , Diα i }¬∃d∈ D

(6)

2. Extending a dimension that already existed within a federation with some new attributes according to postulates P3 and P4. In such a case, the function δDi needs to include a measure of similarity between two dimensions to provide a precise interpretation of the considered integration. These dimensions are included in a set defined as:



 



i−1 : di ∼ d XDi − = di ∈ {Di1 , . . . , Diα i }∃d∈ D

(7)

Obviously, both sets XDi + and XDi − can be mutually designated using a simple set difference, e.g. XDi − = {Di1 , . . . , Diα i }\XDi + . By analogy, we decomposed the eventual function δDi into two components, that operate on sets defined in Eqs. (6) and(7). i i i  δSD + (Di−1 , {D1 , . . . , Dα i } ) =

|{

|

|

XDi + i D1 , . . . , Diα i

}|

(8)

The Eq. (8) defines what we consider a subjective gain of data richness obtained by the i-th warehouse. In other words, it reflects how profitable joining the federation is for the i-th warehouse.

i−1 | |D i i i  δOD + (Di−1 , {D1 , . . . , Dα i } ) = 1 −  |Di−1 | + |XDi + |

(9)

Please cite this article as: R. Kern, A. Kozierkiewicz and M. Pietranik, The data richness estimation framework for federated data warehouse integration, Information Sciences, https://doi.org/10.1016/j.ins.2019.10.046

ARTICLE IN PRESS

JID: INS

[m3Gsc;November 14, 2019;14:11]

R. Kern, A. Kozierkiewicz and M. Pietranik / Information Sciences xxx (xxxx) xxx

7

The Eq. (9) above is an objective gain of data richness gained by the federation thanks to the integration of the i-th partial warehouse. In other words - it reflects what the federation has gained from adding a new member to its structure. The second factor, concerning extending a dimension that already existed within a federation, can be defined as follows:



 i−1 , {Di , . . . , Di } ) = δDi − (D 1 αi

di ∈XDi −

1 − maxd j ∈Di−1 DS(d j , di )

 (10)

|XDi − |

Obviously, for the maximal threshold t (used to check if two dimensions are sufficiently similar), the function δDi − is always equal to 0. The definition incorporates a function introduced in Section 3 which calculates a similarity between dimensions function DS. It is based on comparing additional annotations that describe dimensions. Theorem 1. For any federation the following condition is true:

i−1 ⊆ {Di , . . . , Di } ⇐⇒ D 1 αi

i i i i i i   δSD + (Di−1 , {D1 , . . . , Dα i } ) = δOD+ (Di−1 , {D1 , . . . , Dα i } )

(11)

where i ∈ {2, . . . , N}. i i i i i i   Proof. Let us assume that δSD + (Di−1 , {D1 , . . . , Dα i } ) = δOD+ (Di−1 , {D1 , . . . , Dα i } ). Then



|XDi + | |X i | =  D+ i ⇔ |{Di1 ,...,Diα i }| |Di−1 |+|XD+ |

 | |XDi + | |D = 1 −  i−1 i |{Di1 ,...,Diα i }| |Di−1 |+|XD+ |

 | + |X i + | |{Di1 , . . . , Diα i }| = |D i−1 D

i i  | = |{Di , . . . , Di }| − |X i + | = |X i − |. It means that D  ⇔ |D i−1 i−1 ⊆ {D1 , . . . , Dα i }. In the opposite direction, the proof of the1 αi D D orem is done in analogical way. 

The Theorem 1 tells us that if a set of dimensions of i-th data warehouse included all dimensions of the federation in  , then both the objective and the subjective gain of data richnesses are the same. In other words, the the i-1 iteration D i−1 integration brings the same benefits both for the i-th data warehouse and the federation. This theorem allows us to conduct calculations of only one of the mentioned measures. Theorem 2. For any federation the following condition is always true:

i−1 , {Di , . . . , Di } ) ≤ (1 − t ) δDi − (D 1 αi

(12)

Proof. We assume that DS(D j , Dix ) ≥ t, which denotes that two dimensions are sufficiently similar, where t is some accepted 

 , {Di , . . . , Di } ) = threshold, such that t ∈ [0, 1]. Therefore, δDi − (D i−1 1 αi 1−t

d i ∈X i − D



(1−maxd

 j ∈Di−1

|XDi − |

DS(d j ,di ))





(1−t )

d i ∈X i − D |XDi − |

=

|XDi − |∗(1−t ) = |XDi − |

 , {Di , . . . , Di } ). As we can see, it depends on the assumed The Theorem 2 gives us the upper limit of the factor δDi − (D i−1 1 αi threshold t used to check if two dimensions are sufficiently similar. For a higher threshold t we obtain the lower value of δDi − . It other words, if we accept the smaller number of dimensions as similar, then δDi − is smaller. Eventually, no matter which function (from Eq. (8) or Eq. (9)) is chosen, the final form of δDi is given as a weight sum of its two components:

δDi = w+ ∗ δDi + + w− ∗ δDi − w+ , w−

w+

(13) + w−

Obviously ∈ [0, 1] and = 1. These weights can be used to easily adjust the importance of both components. This clearly depends on different requirements that may appear in particular applications of the considered federated data warehouse model. An impact they may have on eventual schema integration is beyond the scope of this paper. In the further parts (especially in Section 5) we will assume that both w+ = w− = 0.5. Up until now, we dealt only with the increase of data richness gained through the integration of a single data warehouse into the federation, during subsequent steps of the integration. The eventual increase of the data richness from the perspective of the whole federation can be calculated using an Algorithm 1, which is a straightforward approach to creating the Algorithm 1 The increase of data richness on the level of dimensions. Require: H (a set of data warehouses participating in a federation, |H|=N) 1: F0 := φ ; 2: D := 0; 3: for i ∈ [1, N] do  , {Di , . . ., Di } ); D := D + δDi (D 4: i−1 1 αi Fi := Int (Fi−1 , Hi ); 5: 6: end for 7: return D /N

Please cite this article as: R. Kern, A. Kozierkiewicz and M. Pietranik, The data richness estimation framework for federated data warehouse integration, Information Sciences, https://doi.org/10.1016/j.ins.2019.10.046

ARTICLE IN PRESS

JID: INS 8

[m3Gsc;November 14, 2019;14:11]

R. Kern, A. Kozierkiewicz and M. Pietranik / Information Sciences xxx (xxxx) xxx

level of dimensions of a federation. The resulting estimation of the increase of the data richness is obtained as an average gain originating from adding subsequent warehouses to the federation. The partial gains can be calculated using the subjective or the objective approach, by substituting a component δDi + within a usage of δDi in Line 4 with Eq. (8) or Eq. (9). Eventually, no matter which strategy is used, the final shape of the algorithm is the same. Example 2. Let us suppose that we want to integrate schemas of warehouses A and B presented in Fig. 1. Based on the Algorithm 1 initially F0 = φ . In the first iteration, F0 = φ is integrated with the schema of the warehouse H1 , therefore, we obtain F1 =(InternetSales,Date,IndividualClient,Product). From a point of view of the increasing content, the second iteration is more interesting. We need to combine F1 with H2 =(SalesWithSalesRepresentative, Date, Product, SalesRepresentative, CompanyClient), thus XD2+ = { SalesRepresentative, 2 1 2 i 2 2 2 2   CompanyClient } and XD2− = {Date, P roduct }. We can calculate: δSD + (D1 , {D1 , . . . , D4 } ) = 4 = 2 ,δOD+ (D1 , {D1 , . . . , D4 } ) = 1 − 1

1− 2 2 2 2 2  5 δD − ( D 1 , { D 1 , . . . , D 4 } ) = 2 Finally, if we assume w+ = w−

3 3+2

=

=

1 4. 1 2

2 = 1 ∗ 2 + 1 ∗ 1 = 13 = 0.325 and δ 2 = 1 ∗ 1 + 1 ∗ 1 = 2 = 0.25, = we obtain: δOD 2 5 2 4 40 2 2 2 4 8 SD which is the final value of the data richness increase on the level of dimensions. In other words - it represents that creating a federation from independent warehouses from Fig. 1 resulted with 25% new content. 2 = 0.16 and δ 2 = 0.125, which is Performing the same calculations for the integration of warehouse A and C results δOD SD convergent with the intuitive approach presented in Section 1.

4.2. Estimating an increase of data richness on the level of measures According to the Eq. (1) schemas of a federation and any participating data warehouse, beside sets of dimensions, contain distinguished fact tables (denoted as D0 and Di0 ), which are built with measures Di0 = {mi1 , . . . , miβ i }, where β i is a number of measures in Di0 . As stated at the first subsection of the following chapter, we assume the existence of the similarity function MS: MF × MH → [0, 1] between a measure taken from a federation’s facts table and a measure from the i-th warehouse that is being integrated into some iteration of the integration algorithm. For clarity, we assume that MiF denotes a set of measures in a federation in the i-th iteration, and by MHi we denote a set of measures in the i-th warehouse. Similarly to the level of dimensions, the initial set of measures in a federation is empty (M0F = φ ). The increase of data richness on the level of measures is a function with the following signature:

δMi : 2Mi−1 × 2M i → [0, 1] F

H

(14)

which must meet the following postulates: • • • •

i (φ , MHi ) = 1 P1. δM i (MHi , MHi ) = 0 P2. δM i (MF , MHi ) = 0 P3. ∀mi ∈ MHi ∃m ∈ MiF−1 : mi  m ⇒ δM i−1

i (MF , MHi ) = 1 P4. ∀mi ∈ MHi ¬∃m ∈ MiF−1 : mi  m ⇒ δM i−1

P1 states that the gain of data richness after integrating some set of measures into an empty schema is always maximal and equal to 1. This postulate concerns the initial state of the federation. It obviously does not carry any substantial application from a practical point of view but must be included for the sake of formal completeness of the presented framework. P2 describes that the gain of data richness after integrating the same set of measures twice is minimal and equal to 0. P3 tells that the gain of data richness after integrating some set of measures into a set that contains only equivalent measures is minimal and equal to 0. P4 states that the gain of data richness after integrating some set of measures into a set that does not contain any equivalent measures is maximal and equal to 1. Since we do not consider extending measures (which would imply giving them some kind of formal semantics to perform an inference about relationships between them), we only need to consider a situation in which we add new measures to the schema of the federation. Thus, we introduce an auxiliary notion below:



 

i XM mi ∈ MHi ¬∃m ∈ MiF−1 : mi  m + =



(15)

The set defined above contains measures from the i-th warehouse that do not have equivalent measures in the previous state of the federation schema. Therefore, it can be simply added to the federation, eventually creating a new state of the schema. By analogy to the level of dimensions, we developed functions that represent subjective and objective increase of data richness within the federation. The first function, presented on the Eq. (16), reflects what we consider a subjective approach to the data richness gained from the perspective of the i-th warehouse that will be integrated.

δSi M (MiF−1 , MHi ) =

|XMi + | |MHi |

(16)

Please cite this article as: R. Kern, A. Kozierkiewicz and M. Pietranik, The data richness estimation framework for federated data warehouse integration, Information Sciences, https://doi.org/10.1016/j.ins.2019.10.046

ARTICLE IN PRESS

JID: INS

[m3Gsc;November 14, 2019;14:11]

R. Kern, A. Kozierkiewicz and M. Pietranik / Information Sciences xxx (xxxx) xxx

9

On the other hand, the Eq. (17), is an objective gain of data richness gained by the federation thanks to the integration of the i-th partial warehouse in the i-th iteration of the integration algorithm.

δOi M (MiF−1 , MHi ) = 1 −

|

|MiF−1 | | + |XMi + |

(17)

MiF−1

Theorem 3. For any federation the following condition is true: MiF−1 ⊂ MHi ⇔ δOi M (MiF−1 , MHi ) = δSi M (MiF−1 , MHi ), where i ∈ {2, . . . , N }. Proof. Let us assume that δOi M (MiF−1 , MHi ) = δSi M (MiF−1 , MHi ). Then 1 −

|MiF−1 |

i | + |XM +

|MHi |

|MiF−1 |

= ⇔ done in an analogical way.

=



|MHi | − |XMi + |.

It means that

MiF−1



i | |XM |MiF−1 | + i | = |MiF−1 |+|XM |MHi | +

MHi .



i | i | |XM |XM + + i | = |MiF−1 |+|XM |MHi | +



In the opposite side the proof of theorem is

The Theorem 3 tells us that, if the set of measures of i-th data warehouse included all measures of a federation in the i-1 iteration, then both the objective and the subjective gain of data richness are the same. It means, that the integration brings the same benefits both for the i-th data warehouse and the federation. This theorem allows us to reduce the calculation to only one function: δOi M (MiF−1 , MHi ) or δSi M (MiF−1 , MHi ). The eventual increase of the data richness on the level of measures from the perspective of the whole federation can be calculated using an integration presented on Algorithm 2. Similarly to Algorithm 1, a final result is calculated as an average value from data richness increases appearing in subsequent steps of the procedure. Moreover, the considered procedure is i in Line 4 may be filled with Eq. (16) or (17). strategy agnostic, meaning that the factor δM Example 3. Let us again consider warehouses A and B presented on Fig. 1. D10 = {V alue, C ountO fC lients} and D20 = {V alue, CountO f Products}. As it was done for the dimension level, in the first step M0F = φ and M1F ={Value, CountOfClients}. 2 = {CountO f P roducts} is designated. Then, δ 2 M (MF , MH2 ) = 1 and δ 2 M (MF , MH2 ) = For the second warehouse, the set XM + 2 S 1 O 1 2 1 1 − 2+1 = 3 , which reflects that the federation is 33% more expressive on the level of measures than any of its parts. 5. The experimental analysis This section of the article describes an experimental verification of the developed data richness estimation framework for federated data warehouse integration. It was conducted to confirm the correctness of our assumptions and eventually the usefulness of proposed methods. In general, it was based on investigating how different parameters, that can be used to describe a federation, affect the increase of data richness available in a federation on its different levels. To achieve this goal a comparison of following responses for a prepared query is performed: • • •



a response a response a response warehouse a response

Rp of a pattern data warehouse Hp Rf of a federation Fˆb Rpn of a pattern data warehouse Hn , which is a pattern data warehouse Hp with an additional partial data added during the process Rfn of a federation Fˆn which is built on top of a federation Fˆb with new partial data warehouse added

Eventually, gathered responses are confronted with calculated increases of data richness to find potential correlations. The whole process is illustrated on Fig. 2 and described in details in the next section. The chapter is split into two parts - the first contains an overview of base assumptions and an overview of the accepted procedure and experimental setup, and the second, presents the results of a statistical analysis of gathered data. 5.1. The base assumptions and setup We have designed two types of experiments. Both of them were conducted using a dedicated environment. Each experiment consists of two phases: (i) A data preparation phase and (ii) the execution phase.

Algorithm 2 The increase of data richness on the level of measures. Require: H (a set of data warehouses participating in a federation, |H|=N) 1: F0 := φ ; 2: M := 0; 3: for i ∈ [1, N] do i (MF , MHi ); M := M + δM 4: i−1 5: Fi := Int (Fi−1 , Hi ); 6: end for 7: return M /N Please cite this article as: R. Kern, A. Kozierkiewicz and M. Pietranik, The data richness estimation framework for federated data warehouse integration, Information Sciences, https://doi.org/10.1016/j.ins.2019.10.046

JID: INS 10

ARTICLE IN PRESS

[m3Gsc;November 14, 2019;14:11]

R. Kern, A. Kozierkiewicz and M. Pietranik / Information Sciences xxx (xxxx) xxx

Fig. 2. Experimental procedure. Table 1 Parameters used in the experiment. Parameter’s name

Parameter’s value

Dimensions number median Dimensions number deviation Dimensions dictionary size Dimensions attributes number median Dimensions attributes number deviation Dimensions attributes dictionary size Measures number mean Measures number deviation Measures dictionary size Threshold used in criterion for D j ∼ Dix

8 3 20 6 4 40 3 3 12 0.6

First, we have performed an experiment with a pattern data warehouse. The goal of this approach is to find if there is any relation between federation responses accuracy and the data richness metrics. The data preparation phase consists of several steps: 1. Generate 8 source data warehouses according to parameters from Table 1 2. Create a federation from data warehouses from Step 1. Mark it as a base federation, denoted further as Fˆb . 3. Create a pattern data warehouse, containing all the data from source data warehouses. Mark it as a base pattern data warehouse, denoted further as Hp . 4. Execute a test query on the base federation and the base pattern data warehouse to obtain baseline results. 5. Generate a number of test source data warehouses like in Step 1. The names of dimensions, attributes and measures were taken from predefined dictionaries. This is necessary because the similarity functions are mostly based on items’ names. The execution phase is a set of iterations ran with a reference to the base federation. Each iteration uses one of previously generated test source data warehouses and consists of five major steps: 1. Calculate data richness measures for the given base federation and the schema of test data warehouse denoted as Ht 2. Create a new federation from source schemas associated with the base federation and the test schema, denoted further as Fˆn 3. Create a pattern data warehouse containing all the data from the federation from the previous step. Mark it as a pattern data warehouse for the particular iteration, denoted as Hn 4. Execute a test query on the base federation and the base pattern data warehouse to obtain a test result. Collected responses from the pattern data warehouse and the federation are further denoted as Rpn and Rfn respectively. Please cite this article as: R. Kern, A. Kozierkiewicz and M. Pietranik, The data richness estimation framework for federated data warehouse integration, Information Sciences, https://doi.org/10.1016/j.ins.2019.10.046

JID: INS

ARTICLE IN PRESS R. Kern, A. Kozierkiewicz and M. Pietranik / Information Sciences xxx (xxxx) xxx

[m3Gsc;November 14, 2019;14:11] 11

5. Compare gathered results Each experiment was executed using the parameters presented in Table 1. Their values were chosen based on the result of previous experiments, where very wide ranges of possible values were checked. The dictionaries’ sizes must be related to the range of a generated number of attributes. If the attributes’ names dictionary size was equal to the number of attributes, it would be very possible, that each data warehouse has a very similar, or even identical, set of attributes in dimensions. This means, that the increase of data richness would be almost equal to 0. If the size of the dimensions’ dictionary is close to the median of the dimensions’ number the example becomes trivial too. If the standard deviation of the number of the dimensions’ is too small, the experiment will work only for a small number of data warehouses. In other words, if the number of possible configurations is too low in comparison with the number of chosen data warehouses, it is hard to add new, useful data to the federation. On the other hand, if the dimensions’ dictionary size is too big, it is hard to randomly choose data warehouses that could create a useful federation. Lowering the threshold is not a good idea, because it leads to a situation in which federations have a very low schema coverage level. During query processing, the federation uses a very small number of its component data warehouses, because their schemas have a very small intersection. The queries executed on federations were based on their schemas in the following steps: 1. Select one of the several query templates, which contains standard SQL keywords like: SELECT, WHERE, JOIN, HAVING, GROUP BY, etc. 2. Identify the dimensions, attributes, and measures from the federation schema, which have the largest representations in the component data warehouses (with the largest mappings dictionaries). 3. Fill the query template with items identified in the previous step. Using the most representative dimensions allows us to consider the experiment input data as complete. In about 40% of iterations, the newly generated test schema was not involved in the query execution process. This was caused by too small matching level between the test schema and the base federation’s schema. The general problem concerns a remark, that the test schema should somehow be simultaneously both similar and different to the base schema. If the test schema is an ideal subset of the base schema the case would be trivial. If the test schema had no similar parts with the base schema, it would not be involved in the query execution process and the difference between baseline and test would be equal to 0. Iterations for which the test data warehouse schema was not involved in the query processing were removed from the data analysis. The comparison is based on an iterative check of each corresponding rows in pattern response and the test federation response. If all attributes values in those rows match, the difference between measured values acts as a difference between rows. If at least one combination of attributes values appeared only in one of the compared responses, then this experiment iteration was removed from further considerations. The average difference between response from test federation and the pattern data warehouse is calculated as an average distance between corresponding rows. After that, we have executed the second type of experiment - with a simplified ACCU method taken from [7] and [8]. Its goal was to compare our approach with an external metric. The data preparation phase consists of three steps: 1. Generate 8 source data warehouses according to parameters from Table 1 2. Create a federation from data warehouses from Step 1. Mark it as Fˆb . 3. Generate a number of test source data warehouses like in Step 1. The names of dimensions, attributes and measures were taken similarly to the first Experiment. The execution phase is also a set of iterations ran with a reference to the base federation. Each iteration uses one of the previously generated test source data warehouses and consists of six major steps: 1. Calculate data richness measures for the given base federation Fˆb and the schema of test data warehouse denoted as Ht 2. Fill the current test data warehouse (denoted as Ht ) with data. 3. Create a new data warehouse created based on source data warehouses associated with the base federation Fˆb extended by Ht and mark it as Hn 4. Modify the test query by removing aggregation functions, group by clauses to return just the number of rows fulfilling the conditions in where clause. Execute it both on the Hn and Ht 5. Calculate simplified ACCU metric 6. Compare results from steps 1 and 5. The benchmark metric is a simplified implementation of the ACCU algorithm described in [7]. It compares the number of rows fulfilling the requirements defined in user query from Ht and Hn . The number of rows in Ht cannot exceed the number of rows in Hn , because Ht is only one of many data sources used to construct Hn . According to [7], each row from any data warehouse is an object which cares some kind of data. Originally it was true/false values. In our approach the measures from fact tables have a far bigger universe of possible values, therefore it is considered in the context of a certain query, where the row may fulfill given conditions or not. Moreover, the data warehouses are independent by design (by the postulate of components autonomy in the federation). Finally, each of them has equal reliability, therefore mentioned simplification was possible and justified. Please cite this article as: R. Kern, A. Kozierkiewicz and M. Pietranik, The data richness estimation framework for federated data warehouse integration, Information Sciences, https://doi.org/10.1016/j.ins.2019.10.046

ARTICLE IN PRESS

JID: INS 12

[m3Gsc;November 14, 2019;14:11]

R. Kern, A. Kozierkiewicz and M. Pietranik / Information Sciences xxx (xxxx) xxx

5.2. The statistical analysis 5.2.1. Multiple linear regression As it was mentioned in the previous Section, the experiment is built on top of generating a query which is sent to the created base federation Fˆb and to the pattern data warehouse Hp . Next, the dist(Rp , Rb ) is calculated using a function defined in our previous publication [13]:



n

i=1

dist (R p , R f ) =



s

(Ripj −Ri j )



f

j=1

ij Rp

s

⎠ (18)

n ij

ij

where Rp and Rf are the pattern warehouse’s and the base federation’s responses, respectively, R p and R f are the values of jth element in i-th row pattern data warehouse response and base federation response, respectively, s is a number of columns in a single row of the response, n is a number of rows in the pattern data warehouse response. In a single row, each element (which may be associated with columns in the schema) may be a simple chain of characters. Then, the expression

(Ripj −Rifj ) ij

Rp

is replaced by the value of a Hamming distance between two words divided by the length of the longer word. After that, a new data warehouse is added to the base federation, creating a federation Fˆn . All of the designed data richness measures are calculated to estimate the profitability of the integration process. Eventually, assuming that Rfn denotes a response of the federation warehouse’s Fˆn and Rpn is a response of a new pattern warehouse, the value

dist (R p , R f ) − dist (R pn , R f n ) is calculated. We expected that adding a new data warehouse to the base federation should improve the response of such federation for the same query. However, for a sample of 913 experiments, we have obtained only an average difference dist (R p , R f ) − dist (R pn , R f n ) very near to 0. It means, that adding a new data warehouse to the federation not always improves its response to the given query. Therefore, we divided our analysis into two parts. At first, we have verified the hypothesis about a correlation of our measures with the difference between dist (R p , R f ) − dist (R pn , R f n ) in a situation where adding a new data warehouse to the base federation improves the response to the generated query. The second analysis was done in the other direction. For our analysis, we used a Multiple Linear Regression. The constructed model of the linear regression allows analyzing the influence of many independent variables on one dependent variable. In our case, we want to test the influence of our four data richness measures on dist (R p , R f ) − dist (R pn , R f n ). A Multiple Linear Regression is the most frequently used variety of multiple regression. It is an extension of a linear regression model based on Pearson’s linear correlation coefficient and it presumes the existence of a linear relationship between the studied variables. Multiple Linear Regression allows us to identify the strength of the effect that the our defined data richness measures have on dist (R p , R f ) − dist (R pn , R f n ). Additionally, the results of the test, will allow us to predict the effectiveness of the integration process based only on calculated data richness measures. To construct a model of linear regression to study the influence of four independent variables, we accepted the following auxiliary notions: •







sd denotes the subjective increase of data richness measure on the level of dimensions, Algorithm 1 using the strategy defined in Eq. (8) od denotes the objective increase of data richness measure on the level of dimensions, Algorithm 1 using the strategy defined in Eq. (9) sm denotes the subjective increase of data richness measure on the level of measures, Algorithm 2 using the strategy defined in Eq. (16) om denotes the objective increase of data richness measure on the level of measures, Algorithm 2 using the strategy defined in Eq. (17)

which is the output of which is the output of which is the output of which is the output of

The depended variable was defined as: dist (R p , R f ) − dist (R pn , R f n ). As a result, the coefficients of the regression equation was estimated, along with measures which allowed to perform an evaluation of the quality of the model. As aforementioned, in the first step we prepared a model where both dist(Rp , Rf ) > 0 and dist(Rpn , Rfn ) > 0. Our sample had a size equal to 223. To obtain a correct regression model we should check the basic assumptions concerning model residuals. In our model’s residual, there are no outliers (the given residual deviates are smaller than 3 standard deviations from the mean value). Additionally, the distribution of residuals deviates slightly from the normal distribution (the p-value of Liliefors test is equal 0.02712), however, a small difference between the residuals distribution and the normal distribution is acceptable. The homoscedasticity has been confirmed by observing the presented charts: the residual with respect to predicted values, the square of the residual with respect to predicted values, the residual with respect to observed values, and the square of the residual with respect to observed values. Due to the limitation of this paper, these charts are not included. Additionally, the variance inflation factor VIF is equal to 1.18, which allows us to decide that the multicollinearity is very weak. Thus, all requirements for Multiple Linear Regression are satisfied and the results of the analysis are presented in Tables 2 and 3. Please cite this article as: R. Kern, A. Kozierkiewicz and M. Pietranik, The data richness estimation framework for federated data warehouse integration, Information Sciences, https://doi.org/10.1016/j.ins.2019.10.046

JID: INS

ARTICLE IN PRESS R. Kern, A. Kozierkiewicz and M. Pietranik / Information Sciences xxx (xxxx) xxx

[m3Gsc;November 14, 2019;14:11] 13

Table 2 The results of statistical analysis. Function

Value

r r2 standard error of an estimation F-value p-value

0.3934 0.1543 0.261 9.94 <0.000001

Table 3 The multiple linear regression model. Variable

b coeff.

t stat.

p-value

Interception sm om sd od

8.15 −0.303 −0.51 −0.439 −7.47

2.58 −0.803 −0.51 −0.761 −2.55

0.0104 0.423 0.61 0.45 0.011

Table 4 Pearson’s linear correlation coefficient between the objective data richness on the level of dimensions. Function

Value

r r2 standard error of an estimation t statistic p-value

0.3934 0.15 0.062 −6.22 <0.000001

On the basis of the estimated value of the coefficient b, the relationship between dist (R p , Rb ) − dist (R p , Rn ) and all independent variables can be described by means of the equation:

8.15 − 0.303 ∗ sm − 0.51 ∗ om − 0.439 ∗ sd − 7.47 ∗ od + 0.26

(19)

Additionally, the acquired model is fair fitting, which is confirmed by: • •

the small standard error of an estimation: SEe = 0.261 and r = 0.3934; the result of the F-test of variance analysis: p-value < 0.0 0 0 0 01.

The results of the t-test for each variable have shown that only the objective increase of the data richness on the level of dimensions (od) has a significant influence on the dependent variable. The interpretation of the obtained results, for the different models we have testes, allows us to assume that a part of the variables does not have a significant effect on the profit and can be superfluous. Therefore, we have checked the linear correlation between variables od and dist (R p , R f ) − dist (R pn , R f n ). We have obtained a statistically significant linear dependence between tested variables., which is presented in Table 4. The strength of the linear relation between the tested variables was equal to r = −0.384 that can be interpreted as a negative moderate correlation. It implies that, if objective increase of data richness measure od growth then the value of the difference dist (R p , R f ) − dist (R pn , R f n ) decrease. For other variables, the strength of linear relations was very poor (less than 0.1). The second analysis has been made for the case when dist(Rp , Rf ) < 0 and dist(Rpn , Rfn ) < 0. Our sample had a size equal to 688. We have obtained for all possible models the value of multiple correlation coefficient r ≤ 0.14, which means that the strength of the effect of the set of independent variables on the dependent variable is very weak in this case. The obtained result is not essential therefore, we do not present all of the performed calculations. 5.2.2. Comparison with ACCU method The second experiment has been devoted to comparing our approach with an external metric based on the ACCU algorithm ([7]). We have conducted 530 experiments and based on the initial analysis we found that only sd subjective increase of data richness measures on the level of dimensions is statistically correlated with the value obtained based on the ACCU method. We have split our samples into two groups: in the first, values obtained by the ACCU method are smaller than sd value and the second one otherwise. We have obtained samples’ sizes equal to 149 elements and 381 elements, respectively. The samples are large enough that recalling the central limit theorem we can accept that the samples’ distributions are approximately normal. Thus, for further analysis, we have chosen the ICC test. The intraclass correlation coefficient (ICC) test measures the strength of interjudge reliability- the degree of its assessment concordance. In all cases, we have obtained Please cite this article as: R. Kern, A. Kozierkiewicz and M. Pietranik, The data richness estimation framework for federated data warehouse integration, Information Sciences, https://doi.org/10.1016/j.ins.2019.10.046

ARTICLE IN PRESS

JID: INS 14

[m3Gsc;November 14, 2019;14:11]

R. Kern, A. Kozierkiewicz and M. Pietranik / Information Sciences xxx (xxxx) xxx

a p − value of less than 0.0 0 0 0 01, therefore, obtained dependencies are statistically significant. The ICC(2,k) calculated for consistency for both cases were around 0.45 and ICC(2,1) around 0.29, for absolute agreement the values were lowest: 0.24 and 0.14, respectively. Such results provide that the absolute concordance of assessment made by both measures are not high, however, some convergence can be noticed. The values calculated by the ACCU method and the sd values are fairly consistent, however not identical. It corresponds with our intuitions because we compared two different measures and we do not expect a total concordance. The sd measure considers only schemas of dimensions in data warehouses, where the ACCU algorithm requires a more holistic analysis. Firstly, the ACCU measure can be calculated only when a partial data warehouse has been integrated within a federation. Therefore, it can judge whether or not adding a data warehouse was profitable for a federation after it has been done. It cannot be used to evaluate the candidate data warehouse beforehand. What is the most important - this addition involved adding not only dimensions’ schemas, but all of the data warehouse’s content. Secondly, using ACCU required processing warehouses as a whole unit, which involves their global topology, schemas of dimensions and measures, and eventually (and most importantly) data they contain. This entails more complex calculations, which are more cost-demanding. The measure of the increase of data richness on the level of dimensions presented in this article requires only processing schemas of dimensions, which is very light in terms of calculations. Even when hundreds of dimensions with a large number of attributes are involved, the algorithms from Section 4.1 need to compare several fixed sets. Therefore, it can be a very convenient tool for an initial analysis of data warehouses joining some federations. After pre-checking whether or not they can be valuable, more complex calculations can be performed. 6. Future works and summary Federated data warehouses, by design, are built on top of highly efficient data integration methods that allow processing a lot of data originating from source data warehouses and serve them as a single warehouse to the end-user. Such a process may be very time- and cost-consuming. This can entail slow processing of big federations which may, intuitively, be better in terms of data richness they provide. The framework presented in this article allows estimating a potential growth of data richness, in case of adding a new data warehouse to the federation. This tool may be used to decide whether or not it is profitable to launch the aforementioned, expensive integration methods. The results of the statistical analysis provided us some important conclusions. First of all, the data richness measures are moderately linearly correlated with the effectiveness of data warehouse integration. The proposed methodology, (especially the objective measure on the level of dimensions) allows us to answer the question concerning whether a potential integration is valuable referring to the potential growth of data richness. For all analyzed measures we have obtained the negative correlation, but only for the objective data richness measure on the level of dimensions, this correlation is statistically significant. However, we have developed a multiple linear regression model, which allowed us to claim that if all of the data richness measures increase then the value of the difference dist (R p , R f ) − dist (R pn , R f n ) decreases. This means that after integrating a new warehouse, the quality of answers for sent queries improves when dist(Rp , Rf ) < dist(Rpn , Rfn ). Therefore, the developed multiple linear regression model makes it possible to predict the effectiveness of the integration process (obviously, based on calculated data richness measures). Thus, it makes possible to avoid a non-effective integration of data warehouse which would not result in improving the overall quality of the federation. The second experiment was performed to compare the developed framework with some similar tools found in the literature. We have chosen a simplified ACCU method that serves a close purpose - evaluating the quality of created federated data warehouse after the addition of a partial data warehouse. The obtained results show some degree of convergence between the compared tools. Obviously, a full cohesion is impossible to achieve because both methods use a different kind of data and differ in terms of their complexity. However, the tool proposed in this article is much simpler to calculate and can be designated before the partial data integration has been conducted. In our opinion, it unequivocally proves the usefulness of our framework. In the future, we plan to extend our framework with methods of estimating the increase of data richness on the level of contents of tables, due to the fact that the tools presented in this article are designed to process only schemas of warehouses. This may result in interesting remarks concerning relationships between data richness of raw dimensions’ schemas and richness of facts. We also intend to conduct more experiments using data originating from real-world business scenarios and applications of data warehouses. Declaration of Competing Interest None. References [1] A. Abu Halimeh, M.E. Tudoreanu, Subjective information quality in data integration: evaluation and principles, in: Information Quality and Governance for Business Intelligence, 2013, p. 44-65. [2] G.C.M. Amaral, M.L.M. Campos, AQUAWARE: a data quality support environment for data warehousing, in: Proceedings of the Simpósio Brasileiro de Bancos de Dados, 2004, p. 121-133.

Please cite this article as: R. Kern, A. Kozierkiewicz and M. Pietranik, The data richness estimation framework for federated data warehouse integration, Information Sciences, https://doi.org/10.1016/j.ins.2019.10.046

JID: INS

ARTICLE IN PRESS R. Kern, A. Kozierkiewicz and M. Pietranik / Information Sciences xxx (xxxx) xxx

[m3Gsc;November 14, 2019;14:11] 15

[3] M. Batista, C.M. da, A.C. Salgado, Minimality quality criterion evaluation for integrated schemas, in: Proceedings of the 2nd International Conference on Digital Information Management, ICDIM, 1, 2007, pp. 436–441. [4] D. Beneventano, M.O. Olaru, M. Vincini, Analyzing dimension mappings and properties in data warehouse integration, Proceedings of the OTM Conferences (2013) 616–623. [5] L. Berti-Equille, F. Moussouni, Quality-aware integration and warehousing of genomic data, in: Proceedings of the 10th International Conference on Information Quality (IQ’05), MIT, Cambridge, USA, 2005. [6] C. Calero, M. Piattini, C. Pascua, M.A. Serrano, Towards data warehouse quality metrics, in: Proceedings of the international Workshop on Design and Management of Data Warehouses, 2001. [7] X.L. Dong, L. Berti-Equille, D. Srivastava, Integrating conflicting data: the role of source dependence, in: Proceedings of the VLDB, 2, 2009, pp. 550–561. [8] X.L. Dong, B. Saha, D. Srivastava, Less is more: selecting sources wisely for integration 6 (2) (2012) 37–48. [9] M. Gertz, Managing data quality and integrity in federated databases, in: Integrity and Internal Control in Information Systems, Springer, US, 1998, pp. 211–229. [10] A. Gosian, S. Mann, Empirical validation of metrics for object oriented multidimensional model for data warehouse, Int. J. Syst. Assur. Eng. Manag. 5.3 (2014) 262–275. [11] B. Heinrich, M. Klier, A. Schiller, G. Wagner, Assessing data quality-a probability-based metric for semantic consistency, Decis. Support Syst. 110 (2018) 95–106. [12] M. Jarke, Y. Vassiliou, Data warehouse quality: a review of the DWQ project, in: Proceedings of 2nd Conference on Information Quality, MIT pp. 299–313, Cambridge, 1997. [13] R. Kern, Data warehouses federation as a single data warehouse, in: Proceedings of International Conference on Computational Collective Intelligence, Springer, Cham, 2016, pp. 356–366. ´ [14] A. Kozierkiewicz-Hetmanska , M. Pietranik, The knowledge increase estimation framework for ontology integration on the concept level, J. Intell. Fuzzy Syst. 32 (2) (2017) 161–1172. [15] A. Kozierkiewicz-Hetmanska, M. Pietranik, B. Hnatkowska, The knowledge increase estimation framework for ontology integration on the instance level, in: Proceedings of the Asian Conference on Intelligent Information and Database Systems, Springer, Cham, 2017, pp. 3–12. [16] A. Kozierkiewicz-Hetmanska, M. Pietranik, The knowledge increase estimation framework for ontology integration on the relation level, in: Proceedings of the Conference on Computational Collective Intelligence Technologies and Applications., Springer, Cham, 2017, pp. 44–53. [17] S. Kumar, U. Kumar, Various data quality issues in data warehousing, Int. J. Res. 1 (10) (2014) 210–213. [18] A. Marotta, R. Ruggia, Managing source quality changes in a data integration system, in: Proceedings of the CAiSE (Doctoral Consortium), 2005. [19] N.T. Nguyen, Advanced Methods for Inconsistent Knowledge Management, Springer London, 2008. [20] V. Peralta, et al., A framework for data quality evaluation in a data integration system, in: Proceedings of the XIX Simpósio Brasileiro de Bancos de Dados, 2004, pp. 134–147. [21] M. Serrano, C. Calero, M. Piattini, Validating metrics for data warehouses, IEE Proc. Softw. 149 (5) (2002) 161–166. [22] M. Serrano, C. Calero, M. Piattini, Experimental validation of multidimensional data model metrics, in: Proceedings of the 36th Hawaii International Conference on System Science, 2003, p. 7-14. [23] J. Sheoran, Issues of data quality in data warehouses, in: Proceedings of the International Conference on Advances in Computer Engineering and Applications, ICACEA, 6, 2014, pp. 6–8. [24] R. Torlone, Two approaches to the integration of heterogeneous data warehouses, Distrib. and Parallel Databases 23 (1) (2008) 69–97. [25] V.H. Trieu, Getting value from business intelligence systems: a review and research agenda, Decis. Support Syst. 93 (2017) 111–124. [26] G. Zhu, C.A. Iglesias, Exploiting semantic similarity for named entity disambiguation in knowledge graphs, Expert Syst. Appl. 101 (2018) 8–24.

Please cite this article as: R. Kern, A. Kozierkiewicz and M. Pietranik, The data richness estimation framework for federated data warehouse integration, Information Sciences, https://doi.org/10.1016/j.ins.2019.10.046