Validation of mappings between schemas

Validation of mappings between schemas

Data & Knowledge Engineering 66 (2008) 414–437 Contents lists available at ScienceDirect Data & Knowledge Engineering journal homepage: www.elsevier...

622KB Sizes 0 Downloads 20 Views

Data & Knowledge Engineering 66 (2008) 414–437

Contents lists available at ScienceDirect

Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak

Validation of mappings between schemas Guillem Rull *,1,2, Carles Farré 2, Ernest Teniente 2, Toni Urpí 2 Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya, 1-3 Jordi Girona, Barcelona 08034, Spain

a r t i c l e

i n f o

Article history: Received 18 December 2007 Received in revised form 29 March 2008 Accepted 26 April 2008 Available online 17 May 2008

Keywords: Validation Data models Schema mappings Query liveliness

a b s t r a c t Mappings between schemas are key elements in several contexts such as data exchange, data integration and peer data management systems. In all these contexts, the mapping design process requires the participation of a mapping designer, who needs to be able to validate the mapping that is being defined, i.e., check whether the mapping is in fact what the designer intended. However, to date very little work has directly focused on the effective validation of schema mappings. In this paper, we present a new approach for validating schema mappings that allows the mapping designer to ask whether they have certain desirable properties. We consider four properties of mappings: mapping satisfiability, mapping inference, query answerability and mapping losslessness. We reformulate these properties in terms of the problem of checking whether a query is lively over a database schema. A query is lively if there is a consistent instance of the database for which the query gives a non-empty answer. To show the feasibility of our approach, we use an implementation of the CQC method and provide some experimental results.  2008 Elsevier B.V. All rights reserved.

1. Introduction Mappings between schemas are key elements in any system that requires interaction of heterogeneous data and applications. Data exchange [14,22] and data integration [23] are two well-known contexts in which mappings play a central role, as are peer data management systems [7,20]. In all these cases, mappings are specifications that model a relationship between schemas. Several formalisms are used to define mappings. In data exchange, tuple-generating dependencies (TGDs) and equalitygenerating dependencies (EGDs) are widely used [14]. In the context of data integration, global-as-view (GAV), local-as-view (LAV), and global-and-local-as-view (GLAV) [17,23] are the most common approaches. A further formalism that has recently emerged is that of nested mappings [18], which extends previous formalisms for relational and nested data. Model management [3,5,27] is also a widely known approach which establishes a conceptual framework for handling schemas and mappings generically and provides a set of generic operators such as the composition of mappings [4,15,25]. In the context of conceptual modeling, QVT (query/view/transformation) [31] is a standard language defined by the OMG (object management group) to specify transformations between models (i.e., schemas). Data transformation languages like XSLT, XQuery and SQL are also used to specify mappings in several existing tools that help engineers to build them [26,38]. One example of a system for producing mappings is Clio [19,28,32], which can be used to semi-automatically generate a schema mapping from a set of correspondences between schema elements (e.g., field names). This set of inter-schema correspondences is usually called a matching. In fact, finding a matching between the

* Corresponding author. E-mail addresses: [email protected] (G. Rull), [email protected] (C. Farré), [email protected] (E. Teniente), [email protected] (T. Urpí). 1 This work was supported in part by Microsoft Research through the European Ph.D. Scholarship Programme. 2 This work was supported in part by the Spanish Ministerio de Educación y Ciencia under project TIN2005-05406. 0169-023X/$ - see front matter  2008 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2008.04.009

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

415

schemas is the first step towards developing a mapping. A significant amount of work on schema-matching techniques can be found in the literature—see [13,33,36] for a survey. Nevertheless, the process of designing a mapping always requires feedback from a human engineer. The designer guides the generating process by choosing among candidates and successively refining the mapping. The designer thus needs to check whether the mapping produced is in fact what was intended, that is, the developer must find a way to validate the mapping. However, to date very little work has focused directly on the effective validation of schema mappings. In this paper, we present a new approach for validating schema mappings that allows the designer to ask whether the mappings have certain desirable properties. Answers to these questions will provide information on whether the mapping adequately matches the intended needs and requirements. We consider two properties identified in [24] as important properties of mappings: mapping inference and query answerability. We likewise consider another property: mapping satisfiability (strong and weak), and introduce a new property: mapping losslessness. Mapping satisfiability is aimed at detecting incompatibilities, either within the mapping or between the mapping and the constraints in the schemas. Mapping inference can be used to check for redundancies in the mapping or to check the equivalence of two mappings. Regarding query answerability and mapping losslessness, it is worth to note that, in general, mappings lose information and can be partial or incomplete, since they rarely map all the concepts from one schema to another. Query answerability and mapping losslessness are aimed at detecting whether these lossy mappings still preserve some relevant information, which is represented in terms of a query. Specifically, query answerability informs the designer whether two mapped schemas are equivalent with respect to a given query in that it would obtain the same answer on both sides. Mapping losslessness is aimed at detecting whether or not the mapping loses certain information from one schema to another, that is, whether a given query over the first schema can also be computed over the second one. However, it is not necessary for the answer obtained from the second schema to be exactly the same as that obtained from the first one. In summary, the main idea of our approach to validating these properties of mappings is to define a distinguished derived predicate (a query) that describes the fulfillment of the current property. We define this predicate over a new schema, which integrates the two mapped schemas and a set of integrity constraints that makes the relationship modeled by the mapping explicit. The distinguished predicate will be lively [12,21,41] over this new schema if and only if the property holds. A derived predicate is lively if the schema admits at least one fact about it. Hence, the four properties we consider in this paper are expressed in terms of the problem of checking whether a derived predicate is lively over a database schema. This allows us to use one single method to check all the properties. It is worth noting that any existing method for checking the liveliness of a predicate can be used, as long as it is able to deal with the classes of queries and integrity constraints that appear in the database schema. To show the feasibility of our approach, we use our CQC method [16]. The CQC method is able to deal with a broad range of database schemas; thus, we are able to consider schemas and mappings with a high degree of expressiveness. However, this high expressiveness makes the problem semidecidable for the general case. Hence, according the completeness results in [16], the CQC method terminates when either there are one or more finite schema instances for which the distinguished predicate is lively or there is no instance (finite or infinite). A practical solution for dealing with this semidecidability is discussed in Section 4. In this paper, we consider mappings to be relationships between two relational schemas. We specify mappings as sets of mapping formulas in a similar way to what occurs in the GLAV approach. Summarizing, the main contributions of the paper are the following:  We propose a new approach for validating schema mappings. This approach consists in defining a set of desirable properties for mappings and reformulating them in terms of predicate liveliness. The approach is unrelated to the method used to check liveliness.  We show how our approach can be used to check two important properties of mappings that have already been identified in the literature [24]: mapping inference and query answerability.  We define a new property—mapping losslessness—and propose the checking of an additional one—mapping satisfiability (strong and weak)—to capture validation information that is not covered by the other two properties.  We show that our mapping losslessness property is in fact an extension of query answerability. We can use losslessness to try to obtain useful validation information when query answerability does not hold.  We show how our CQC method [16] can be used to check the four mapping properties. The CQC method is able to deal with schemas that have integrity constraints, safe-negated EDB and IDB literals, and comparisons. Although the problem is semidecidable for the general case, we show a pragmatic solution that ensures the CQC method’s termination. The paper is organized as follows. Section 2 introduces the basic concepts and notation used in the paper. Section 3 presents our approach, the properties and their formulation in terms of predicate liveliness. Section 4 discusses some decidability and complexity issues and argues the practical usefulness of the approach. Section 5 describes the experimental evaluation carried out using the CQC method. Section 6 addresses related work and Section 7 presents our conclusions and suggestions for further research.

416

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

2. Schemas and mappings Mappings define a relationship on the sets of instances of the mapped schemas. In this paper, we focus on relational schemas and on binary mappings, i.e., mappings between two schemas. We express schemas and mappings using a logic notation, as is the case in many other mapping formalisms (e.g., TGDs [14] and GLAV [17]). This section introduces the basic concepts. A schema S is a tuple (DR, IC) where DR is a finite set of deductive rules and IC a finite set of constraints. A deductive rule has the form

pðXÞ

r 1 ðX 1 Þ ^    ^ r n ðX n Þ ^ :rnþ1 ðY 1 Þ ^    ^ :r m ðY s Þ ^ C 1 ^    ^ C t

and a constraint takes one of the following two forms:

r 1 ðX 1 Þ ^    ^ rn ðX n Þ ! C 1 _    _ C t ; r 1 ðX 1 Þ ^    ^ rn ðX n Þ ^ C 1 ^    ^ C t ! 9V 1 r nþ1 ðZ 1 Þ _    _ 9V s r m ðZ s Þ: Symbols p, ri are predicates. Tuples X, X i , Y i contain terms, which are either variables or constants. Each V i is a tuple of distinct variables. Each Ci is a built-in literal in the form of t1 h t2, where t1 and t2 are terms and the operator h is <, 6, >, P, = or 6¼. Atom pðXÞ is the head of the rule, and r i ðX i Þ; :r i ðY i Þ are positive and negative ordinary literals (those that are not built-in). Every rule and constraint must be safe, that is, every variable that occurs in X, Y i , Ci must also appear in some X j . Each variable in Z i is either from V i or from some X j . Variables in X i are assumed to be universally quantified over the whole formula. Predicates that appear in the head of a deductive rule are derived predicates, which are also called intensional database (IDB) predicates. They correspond to views or queries. The rest are base predicates, which are also called extensional database (EDB) predicates. They correspond to tables. For a schema S = (DR, IC), a schema instance D is an EDB, that is, a set of ground facts about the base predicates of S (the tuples stored in the database). We use DR(D) to denote the whole set of ground facts about the base and derived predicates that are inferred from an instance D, i.e., the fix-point model of DR [ D. An instance D violates a constraint L1 ^  ^Ln ? Ln+1 _  _Lm if the implication (L1 ^  ^ Ln ? Ln+1 _  _ Lm)r is false over DR(D) for some ground substitution r. An instance D is consistent with schema S if it does not violate any of the constraints in IC. A query is a finite set of deductive rules that define the same n-ary predicate. Let D be an instance of schema S = (DR, IC) and Q a query that defines predicate q. The answer to Q over D, written as AQ(D), is the set of all the ground facts about q obtained by evaluating the deductive rules from both Q and DR over D, i.e., Þjqða Þ 2 ðQ [ DRÞðDÞg. AQðDÞ ¼ fqða A mapping M between schemas A = (DRA, ICA) and B = (DRB, ICB) is a triplet (F, A, B), where F is a finite set of formulas {f1, . . ., fn}. Each formula fi either takes the form Q Ai ¼ Q Bi or Q Ai  Q Bi , where Q Ai and Q Bi are queries over schemas A and B, respectively. Obviously, the queries must be compatible, that is, the predicates must have the same number and type of arguments. We will assume that the deductive rules for these predicates are in either DRA or DRB. In this paper, we consider the IDB predicates (that includes the queries used in the definition of the mapping formulas) under the exact view assumption [23], i.e., the extension of a view/query is exactly the set of tuples that satisfies its definition over the database. We say that schema instances DA and DB are consistent under mapping M = (F, A, B) if all the formulas in F are true. We say that a formula Q Ai ¼ Q Bi is true if the tuples in the answer to Q Ai over DA are the same as the ones in the answer to Q Bi over DB. Þ 2 AQ Ai ðDA Þ if and only if qBi ða Þ 2 AQ Bi ðDB Þ for each In more formal terms, the formula is true when the following holds: qAi ða A B , where qi and qi are the predicates defined by the two queries in the formula. Similarly, the formula tuple of constants a Q Ai  Q Bi is true when the tuples in the answer to Q Ai over DA are a subset of those in the answer to Q Bi over DB, i.e., Þ 2 AQ Ai ðDA Þ implies qBi ða Þ 2 AQ Bi ðDB Þ. qAi ða This way of defining mappings is inspired by the framework for representing mappings presented in [24]. In that general framework, mapping formulas have the form e1 op e2, where e1 and e2 are expressions over the mapped schemas, and the operator op is well defined with respect to the output types of e1 and e2. Other similar formalisms are the GLAV [17] approach and TGDs [14]. GLAV mappings consist in formulas that have the form QA  QB, where QA and QB are conjunctive queries, and TGDs consist in formulas of the form 8Xð/ðXÞ ! 9YwðX; YÞÞ, where /ðXÞ and wðX, YÞ are conjunctions of atomic formulas. Note that, with respect to GLAV and TGDs, we allow the use of a more expressive class of queries. The formalization of schemas has been taken from [16]. A similar formalization, but considering only referential and implication constraints, is used in [41]. The following example illustrates the representation of schemas and mappings and motivates the need of validation. Example 2.1. Consider the schemas and the mapping that are represented graphically in Fig. 1. Schemas A and B are formalized as follows: Schema A = (DRA, ICA), where constraints ICA = {

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

Schema A employee ssn name department salary referential constraint

QA1 ⊆ QB1

417

Schema B

department dept-id max-salary

emp ssn name salary constraint: salary ≥ 2500

constraint: salary ≤ max-salary

queries: QA1: qA1(X, Y, U) ← employee(X, Y, Z, U) ∧ department(Z, W) ∧ W ≤ 1000 QB1: qB1(X, Y, Z) ← emp(X, Y, Z) Fig. 1. Graphical representation of the schema mapping on Example 2.1.

employee(X, Y, Z, U) ? $W department(Z, W), employee(X, Y, Z, U) ^ department(Z, W) ? U 6 W} and deductive rules DRA = { qA1 (X, Y, U) employee(X, Y, Z, U) ^ department(Z, W) ^ W 6 1000}. Schema B = (DRB, ICB), where constraints ICB = { emp(X, Y, Z) ? Z P 2500} and deductive rules DRB = { qB1 (X, Y, Z) emp(X, Y, Z)}. Mapping M is formalized as follows:

M ¼ ðF; A; BÞ; where mapping formulas F ¼ fQ A1  Q B1 g: The deductive rules for the queries in F are those defined in the schemas. Schema A consists of tables employee(ssn, name, department, salary) and department(dept-id, max-salary). The employee table is related with the department table by means of a referential constraint from employee.department to department.dept-id. A constraint states that the salary of an employee may not be greater than the maximum allowed by the corresponding department. Schema B has one table: emp(ssn, name, salary). It has also one constraint that enforces the tuples in the emp table to have a value for column salary greater or equal than 2500. The mapping formula states that all the employees on schema A assigned to a department with a maximum salary no greater than 1000 must be also emps on schema B. Consider the following instances of schemas A and B:

DA ¼ femployeeð1; ‘Joan’; 5; 700Þ; departmentð5; 900Þg; DB ¼ fempð1; ‘Joan’; 2500Þg: Note that instance DA is consistent with schema A, i.e., it satisfies the integrity constraints on schema A, and that instance DB is consistent with schema B. Both schema A and schema B are thus satisfiable, that is, they have non-empty consistent instances. However, instances DA, DB are not consistent under mapping M. The answer to Q A1 over DA is not a subset of the answer to B Q 1 over DB, i.e.,

AQ A1 ðDA Þ ¼ fqA1 ð1; ‘Joan’; 700Þg but AQ B1 ðDB Þ ¼ fqB1 ð1; ‘Joan’; 2500Þg: In fact, no pair of instances of schemas A and B exists, such that query Q A1 has a non-empty answer, the instances are consistent under mapping M, and they do not violate the integrity constraints on the schemas. That is because constraint

418

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

salary 6 max-salary on schema A and the mapping formula (which maps all employees in departments with a maximum salary 61000) are in conflict with constraint salary P 2500 on schema B. The designer could find out the problem with our approach and, then, she could refine the mapping to solve it. For the sake of an example, let us suppose that the designer realizes condition W 6 1000 on query Q A1 is wrong and decides to replace it with condition W > 3000.

qA1 ðX; Y; UÞ

employeeðX; Y; Z; UÞ ^ departmentðZ; WÞ ^ W > 3000:

Consider now the following instances:

DA ¼ femployeeð1; ‘Joan’; 5; 3000Þ; departmentð5; 3500Þg; DB ¼ fempð1; ‘Joan’; 3000Þg: They do not violate the integrity constraints, and they are consistent under the new mapping, i.e.,

AQ A1 ðDA Þ ¼ fqA1 ð1; ‘Joan’; 3000Þg and AQ B1 ðDB Þ ¼ fqB1 ð1; ‘Joan’; 3000Þg: This is an example of how being able to check certain desirable properties may help the designer to get better mappings.

3. Checking desirable properties of mappings In this section, we describe our approach to validating mappings, which consists in checking whether they have certain desirable properties. Firstly, we formalize the desirable properties and, secondly, we explain how the fulfillment of each property should be expressed in terms of predicate liveliness. A predicate p is lively over schema S if there is any consistent instance of S in which at least one fact about p is true [12,21,41]. In our case, we define schema S in such a way that mapped schemas A and B and mapping M are considered together. We assume that the two original schemas have different predicate names; otherwise, the predicates can simply be renamed. In general, schema S is built by grouping the deductive rules and integrity constraints of the two schemas, and then adding new constraints to make the relationship stated by the mapping explicit. Formally, this is defined as

S ¼ ðDRA [ DRB ; IC A [ IC B [ IC M Þ; where ICM is the set of additional constraints that enforces the mapping formulas. For each mapping formula of the form Q Ai ¼ Q Bi , the following two constraints are needed in ICM:

qAi ðXÞ ! qBi ðXÞ; qBi ðXÞ ! qAi ðXÞ: These constraints state that the two queries in the formula must give the same answer, that is, both Q Ai  Q Bi and Q Bi  Q Ai must be true. For the formulas of the form Q Ai  Q Bi , only the first constraint is required:

qAi ðXÞ ! qBi ðXÞ: Having defined schema S, we define a new derived predicate p for each property, such that p will be lively over S if and only if the property holds. Below, we describe each property in detail and its specific reformulation in terms of predicate liveliness. 3.1. Mapping satisfiability As stated in [23], when constraints are considered in the global schema in a data integration context, it may be the case that the data retrieved from the sources cannot be reconciled in the global schema in such a way that both the constraints of the global schema and the mapping are satisfied. In general, whenever we have a mapping between schemas that have constraints, there may be incompatibilities between the constraints and the mapping, or even between the mapping formulas. Therefore, checking whether there is at least one case in which the mapping and the constraints are satisfied simultaneously is clearly a validation task that should be performed, and this is precisely the aim of this property. Definition 3.1. We consider a mapping M = (F, A, B) to be satisfiable if there are at least two non-empty instances DA, DB such that they are consistent with schemas A and B, respectively, and are also consistent under M. Note that the above definition explicitly avoids the trivial case in which DA and DB are both empty sets. However, the formulas in F can still be satisfied trivially. We say that a formula Q Ai ¼ Q Bi is satisfied trivially when both AQ Ai ðDA Þ and AQ Bi ðDB Þ are empty sets. A formula Q Ai  Q Bi is satisfied trivially when AQ Ai ðDA Þ is the empty set. Therefore, to really validate the mapping, we may ask if all its mapping formulas can be satisfied non-trivially, or at least one of them.

419

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

Definition 3.2. A mapping M = (F, A, B) is strongly satisfiable if all formulas in F are satisfied non-trivially. The mapping is weakly satisfiable if at least one formula in F is satisfied non-trivially. Example 3.3. Consider the schemas and the mapping shown graphically in Fig. 2. The formalization of the mapped schemas is the following: Schema A = (DRA, ICA), where constraints ICA = { category(C, S) ? S 6 100, employee(E, C, H)?$S category(C, S)} and deductive rules DRA = { qA1 (E, H) employee(E, C, H) ^ H > 10, qA2 (E, S) employee(E, C, H) ^ category(C, S)}. Schema B = (DRB, ICB), where constraints ICB = { happy-emp(E, H) ^ emp(E, S) ? S > 200, happy-emp(E, H) ? $S emp(E, S)} and deductive rules DRB = { qB1 (E, H) happy-emp(E, H), qB2 (E, S) emp(E, S)}. The formalization of the mapping is as follows:

M ¼ ðF; A; BÞ;

where

mapping formulas F ¼ fQ A1 ¼ Q B1 ; Q A2 ¼ Q B2 g: The deductive rules for the queries in F are those defined in the schemas. Schema A has two tables: employee(emp-id, category, happiness-degree) and category(cat-id, salary). The employee table is related to the category table through a referential constraint from employee.category to category.cat-id. The category table has a constraint on salaries, which may not exceed 100. Schema B has also two tables: emp(id, salary) and happy-emp(emp-id, happiness-degree). It has a referential constraint from happy-emp.emp-id to emp.id, and a constraint that states all happy employees must have a salary of more than 200. Mapping M links those instances of A and B, say DA and DB, in which the employees in DA with a happiness degree greater than 10 are the same as the happy-emps in DB, and the employees in DA are the same and have the same salary as the emps in DB. We can see that the first formula, i.e., Q A1 ¼ Q B1 (where qA1 ðE; HÞ employeeðE; C; HÞ ^ H > 10 and B q1 ðE; HÞ happy-empðE; HÞÞ, can only be satisfied trivially. Thus, mapping M is not strongly satisfiable. There are two reasons for that. The first reason is that all happy-emps in schema B must have a salary of over 200 while all employees in schema A, regardless of their happiness degree, must have a maximum salary of 100. The second reason is that formula

Schema A

referential constraint

constraint: salary ≤ 100

employee emp-id category happiness-degree category cat-id salary

QA1 = QB1

Schema B happy-emp emp-id happiness-degree emp id salarya

QA2 = QB2

referential constraint

constraint: happy-emp(E, H) ∧ emp(E, S) → S > 200

queries: QA1: qA1(E, H) ← employee(E, C, H) ∧ H > 10 QB1: qB1(E, H) ← happy-emp(E, H) QA2: qA2(E, S) ← employee(E, C, H) ∧ category(C, S) QB2: qB2(E, S) ← emp(E, S) Fig. 2. Graphical representation of the schema mapping on Example 3.3.

420

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

Q A2 ¼ Q B2 (where qA2 ðE; SÞ employeeðE; C; HÞ ^ categoryðC; SÞ and qB2 ðE; SÞ empðE; SÞÞ dictates that all employees should have the same salary in both sides of the mapping. In contrast, the second formula is non-trivially satisfiable. Thus, M is weakly satisfiable. This is because there may be employees in A with a happiness degree of 10 or lower, and emps in B that are not happy-emps. It should also be noted that if we removed the second formula from the mapping and just kept the first one, the resulting mapping would be strongly satisfiable. For the sake of an example, instances

DA ¼ femployeeðjoan; sales; 20Þ; categoryðsales; 30Þg and DB ¼ fempðjoan; 300Þ; happy-empðjoan; 20Þg are consistent and satisfy the first mapping formula. The mapping would also become strongly satisfiable if we removed the first formula and kept the second. That is an example of how the satisfiability of a mapping formula may be affected by the rest of formulas in the mapping. The mapping satisfiability property for a given a mapping M = (F, A, B) can be formulated in terms of predicate liveliness as follows. First, we build the schema that groups schemas A, B and mapping M:

S ¼ ðDRA [ DRB ; IC A [ IC B [ IC M Þ: The intuition is that from a consistent instance of S we can get one consistent instance of A and one consistent instance of B, such that they are also consistent under M. Then, we define the distinguished derived predicate in order to check whether it is lively over S. As there are two types of satisfiability—strong and weak—we need to define a predicate for either case. Assuming that F = {f1, . . . , fn} and that we want to check strong satisfiability, we define the derived predicate strong_sat as follows:

strong sat

qA1 ðX 1 Þ ^    ^ qAn ðX n Þ;

where the terms in X 1 ; . . . ; X n are distinct variables, and qA1 ; . . . ; qAn are the predicates defined by Q A1 ; . . . ; Q An . Intuitively, each Q Ai query in the mapping must have a non-empty answer in order to satisfy this predicate. The same applies to Q Bi queries because of the constraints in ICM that enforce the mapping formulas. Similarly, in the case of weak satisfiability, we would define the derived predicate weak_sat as follows:

weak sat

qA1 ðX 1 Þ _    _ qAn ðX n Þ:

However, because the bodies of the rules must be conjunctions of literals, the predicate should be defined using the following deductive rules:

weak sat

qA1 ðX 1 Þ;

 weak sat

qAn ðX n Þ:

Proposition 3.4. Derived predicate strong_sat/weak_sat is lively over schema S if and only if mapping M is strongly/weakly satisfiable. Fig. 3 shows the schema we obtain when Example 3.3 is expressed in terms of predicate liveliness. Note that the rule that defines the distinguished predicate strong_sat has been added to the resulting schema S.

Fig. 3. Example 3.3 in terms of predicate liveliness.

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

421

3.2. Mapping inference The mapping inference property was identified in [24] as an important property of mappings. It consists in checking whether a mapping entails a given mapping formula, that is, whether or not the given formula adds new mapping information. One application of the property would be that of checking whether a formula in the mapping is redundant, that is, whether it is entailed by the other formulas. Another application would be that of checking the equivalence of two different mappings. We can say that two mappings are equivalent if the formulas in the first mapping entail the formulas in the second, and vice versa. The results presented in [24] are in the context of conjunctive queries and schemas without integrity constraints. They show that checking the property in this setting involves finding a maximally contained rewriting and checking two equivalences of conjunctive queries. Here, we consider a broader class of queries and the presence of constraints in the schemas (see Section 2). Definition 3.5 (see [24]). Let a formula f be defined over schemas A and B. Formula f is inferred from a mapping M = (F, A, B) if all pair of instances of A and B that is consistent under M also satisfies formula f. Example 3.6. Consider again the schemas from Example 3.3, but without the salary constraints: Schema A = (DRA, ICA), where constraints ICA = { employee(E, C, H) ? $S category(C, S)} and deductive rules DRA = { qA1 (E, H) employee(E, C, H) ^ H > 10, qA2 (E, S) employee(E, C, H) ^ category(C, S)}. Schema B = (DRB, ICB), where constraints ICB = { happy-emp(E, H) ? $S emp(E, S)} and deductive rules DRB = { qB1 (E, H) happy-emp(E, H), qB2 (E, S) emp(E, S)}. Consider a new mapping:

M 2 ¼ ðF 2 ; B; AÞ; where F 2 contains just the mapping formula Q B2 ¼ Q A2 and queries Q B2 , Q A2 are those already defined in the schemas. Let f1 be the mapping formula Q1  Q2, where Q1 and Q2 are queries defined over schemas B and A, respectively:

q1 ðEÞ

happy-empðE; HÞ;

q2 ðEÞ

employeeðE; C; HÞ:

The referential constraint in schema B guarantees that all happy-emps are also emps, and if the mapping formula in M2 holds, that means they are also employees in the corresponding instance of schema A. Thus, the information contained in formula f1 is true, namely, the employees’ identifiers in the happy-emp table are a subset of those in the employeetable. Therefore, mapping M2 entails formula f1. Now, let f2 be the formula Q3  Q4, where Q3 and Q4 are defined as follows:

q3 ðE; HÞ

happy-empðE; HÞ;

q4 ðE; HÞ

employeeðE; C; HÞ:

We can see that mapping M2 does not entail formula f2. The difference is that we are not just projecting the employee’s identifier as before, but also the happiness degree. In addition, given that the formula in the mapping disregards the happiness degree, we can build a counterexample to show that the entailment of f2 does not hold. This counterexample would consist of a pair of instances, say DA and DB, that would satisfy the mapping formula in M2 but that would not satisfy f2, like, for instance, the following EDBs:

DA ¼ femployeeð0; 0; 5Þ; categoryð0; 50Þg and DB ¼ fempð0; 50Þ; happy-empð0; 10Þg: It is not difficult to prove that the formula in mapping M2 holds over DA and DB:

AQ B2 ðDB Þ ¼ fqB2 ð0; 50Þg and AQ A2 ðDA Þ ¼ fqA2 ð0; 50Þg:

422

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

However, f2 does not hold:

AQ 3 ðDB Þ ¼ fq3 ð0; 10Þg and AQ 4 ðDA Þ ¼ fq4 ð0; 5Þg: Expressing the mapping inference property in terms of predicate liveliness is best done by checking the negation of the property (i.e., the lack of inference) instead of checking the property directly. The negated property states that a certain formula f is not inferred from a mapping M = (F, A, B) if there are two schema instances DA, DB that are consistent under M and do not satisfy f. Therefore, the derived predicate for checking liveliness must be the negation of f. When f has the form Qa = Qb, we define the derived predicate map_inf with the following two deductive rules:

map inf

qa ðXÞ ^ :qb ðXÞ;

map inf

qb ðXÞ ^ :qa ðXÞ:

Otherwise, when f has the form Qa  Qb, only the first deductive rule is needed:

map inf

qa ðXÞ ^ :qb ðXÞ:

We define the schema S in the usual way: by putting the deductive rules and constraints from schemas A and B together, and by considering additional constraints to enforce the mapping formulas. Formally,

S ¼ ðDRA [ DRB ; IC A [ IC B [ IC M Þ:

Proposition 3.7. Derived predicate map_inf is lively over schema S if and only if formula f is not inferred from mapping M. Proof. Let us assume that f takes the form Qa = Qb and that map_inf is lively over S. Then, there is a consistent instance of S for which map_inf is true. It follows that there are two consistent instances DA, DB of schemas A and B, respectively, such that they are also consistent under mapping M. Given that map_inf is true, we can infer that there is a tuple that either belongs to the answer to Qa but that does not belong to the answer to Qb, or that belongs to the answer to Qb but not to the answer to Qa. Therefore, this pair of instances does not satisfy f. In contrast, let us assume that there are two consistent instances DA, DB that are also consistent under mapping M, but that do not satisfy formula f. It follows that there is a consistent instance of S for which formula f does not hold. That means Qa 6¼ Qb, i.e., either Qa 6 Qb or Qb 6 Qa. We can therefore conclude that map_inf is true over this instance of S, and so, that map_inf is lively over S. The proof for the case in which f takes the form Qa  Qb can be directly obtained from this. h Fig. 4 shows the schema that results from the reformulation of Example 3.6 in terms of predicate liveliness, for the case of testing whether formula f1 is inferred from mapping M2. Note the presence of the deductive rules that define the queries from formula f1 (Q1 and Q2) and the rule that defines the distinguished predicate map_inf. 3.3. Query answerability In this section, we consider the query answerability property that was also described in [24] as an important property of mappings. The reasoning behind this property is that a mapping that is partial or incomplete may nevertheless be successfully used for certain tasks. These tasks will be represented by means of certain queries. The property checks whether the mapping enables the answering of these queries over the schemas being mapped. While the previous two properties are intended to validate the mapping without considering its context, this property validates the mapping with regard to the use for which it has been designed.

Fig. 4. Example 3.6 in terms of predicate liveliness.

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

423

As with mapping inference, the results presented in [24] are in the context of conjunctive queries without constraints on the schemas. They show that the property can be checked by means of the existence of an equivalent rewriting. As in the previous case, we consider a broader class of queries and the presence of constraints in the schemas. The intuition behind the property is that, given a mapping M = (F, A, B) and a query Q defined over schema A, it checks whether every consistent instance of B uniquely determines the answer to Q over A. In other words, if the property holds for query Q, and DA, DB are two instances consistent under M, we may compute the exact answer to Q over DA using only the tuples in DB. Definition 3.8 (see [24]). Let Q be a query over schema A. Mapping M = (F, A, B) enables query answering of Q if for all consistent instance DB of schema B, AQ ðDA Þ ¼ AQ ðD0A Þ for every pair DA, D0A of consistent instances of schema A that are also consistent under M with DB. Example 3.9. Consider again the schemas from the previous example: Schema A = (DRA, ICA), where constraints ICA = { employee(E, C, H) ? $S category(C, S)} and deductive rules DRA = { qA1 (E, H) employee(E, C, H) ^ H > 10, qA2 (E, S) employee(E, C, H) ^ category(C, S)}. Schema B = (DRB, ICB), where constraints ICB = { happy-emp(E, H) ? $Semp(E, S)} and deductive rules DRB = { qB1 (E, H) happy-emp(E, H), qB2 (E, S) emp(E, S)}. Consider also the mapping M from Example 3.3:

M ¼ ðF; A; BÞ;

where

mapping formulas F ¼ fQ A1 ¼ Q B1 ; Q A2 ¼ Q B2 g and a query Q defined over schema A:

qðEÞ

employeeðE; C; HÞ ^ H > 5:

We can see that mapping M does not enable the answering of query Q. The mapping only deals with those employees in schema A who have a happiness degree greater than 10, while the evaluation of query Q must also have access to the employees with a happiness degree of between 5 and 10. Thus, we can build a counterexample that will consist of three consistent instances: one instance DB of schema B and two instances DA, D0A of schema A. Instances DA, D0A will be consistent under mapping M with DB, but the answer to Q will not be the same in both instances, i.e., AQðDA Þ 6¼ AQðD0A Þ. These criteria are satisfied, for instance, by the following EDBs:

DB ¼ fempð0; 150Þ; empð1; 200Þ; happy-empð1; 15Þg; DA ¼ femployeeð0; 0; 6Þ; employeeð1; 1; 15Þ; categoryð0; 150Þ; categoryð1; 200Þg and D0A ¼ femployeeð0; 0; 4Þ; employeeð1; 1; 15Þ; categoryð0; 150Þ; categoryð1; 120Þg: We can easily state that this is indeed a counterexample because

AQðDA Þ ¼ fqð0Þ; qð1Þg but

AQðD0A Þ ¼ fqð1Þg: The previous example illustrates that query answerability is also easier to check by means of its negation. Therefore, as two instances of schema A must be found in order to build a counterexample, we must extend the definition of schema S as follows:

S ¼ ðDRA [ DR0A [ DRB ; IC A [ IC 0A [ IC B [ IC M [ IC 0M Þ; where A0 ¼ ðDR0A ; IC 0A Þ is a copy of schema A = (DRA, ICA) in which we rename all the predicates (e.g., q is renamed q0 ) and, similarly, M0 = (F0 , A0 , B) is a copy of mapping M = (F, A, B) in which we rename the predicates in the mapping formulas that are predicates from schema A. The intuition is that having two copies of schema A allows us to get from one single instance of schema S the two instances of A that are necessary for building the counterexample.

424

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

We will then check the liveliness of the derived predicate q_answer, which we define using the following deductive rule:

q answer

qðXÞ ^ :q0 ðXÞ;

where q is the predicate defined by query Q and q0 is its renamed version over schema A0 . Intuitively, this predicate can only be satisfied by an instance of S from which we can get two instances of A that do not have the same answer for query Q, i.e., Þ that is true on one instance but not on the other. there is a fact qða Proposition 3.10. Derived predicate q_answer is lively over schema S if and only if mapping M does not enable query answering of Q. Proof. Let us assume that q_answer is lively over S. This means that there is a consistent instance of S in which q_answer is true. It follows that there is an instance DB of B, an instance DA of A, and an instance D0A of A0 , such that they are all consistent and DA and D0A are also both consistent with DB under mappings M and M0 , respectively. As q_answer is true, we can infer that Þ, such that qða Þ 2 AQðDA Þ and q0 ða Þ 62 A0 Q ðD0A Þ. Given that schema A0 is in fact a copy of schema A, we there is a ground fact qða can conclude that for each instance of schema A0 there is an identical one in schema A. Thus, D0A can be seen as an instance of A, in such a way that if it was previously consistent with DB under M0 , it is now also consistent with DB under M and, for all Þ in the answer to Q0 , there is now a qða Þ in the answer to Q. Therefore, we have found two instances DA, D0A of previous q0 ða 0 schema A such that AQðDA Þ 6 AQðDA Þ and are both consistent under mapping M with a given instance DB of schema B. We can thus conclude that M does not enable query answering of Q. The other direction can easily be proved by inverting this line of reasoning. h Fig. 5 shows the schema that we get when we reformulate Example 3.9 in terms of predicate liveliness. Note that the deductive rule that defines the distinguished predicate q_answer has been added to the resulting schema S, and that the rules corresponding to queries Q, Q0 have been added to DRA and DR0A , respectively. 3.4. Mapping losslessness As we have seen, query answerability determines whether two mapped schemas are equivalent with respect to a given query in that it would obtain the same answer in both cases. However, in certain contexts this may be too restrictive. Consider data integration [23], for instance, and assume that for security reasons it must be known whether any sensitive local data is exposed by the integrator system. Clearly, in such a situation, the query intended to retrieve such sensitive data from a source is not expected to obtain the exact answer that would be obtained if the query were directly executed over the global schema. Therefore, such a query is not answerable under the terms specified above. Nevertheless, sensitive local data are in fact exposed if the query can be computed over the global schema. Thus, a new property that is able to deal with this is needed. In fact, when a mapping contains formulas of the form Q Ai  Q Bi , checking query answerability does not always provide the designer with useful information. Let us illustrate this with an example. Example 3.11. Consider the schemas used in the previous two examples: Schema A = (DRA, ICA), where constraints ICA = { employee(E, C, H) ? $S category(C, S)} and deductive rules DRA = {

Fig. 5. Example 3.9 in terms of predicate liveliness.

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

425

qA1 (E, H) employee(E, C, H) ^ H > 10, qA2 (E, S) employee(E, C, H) ^ category(C, S)}. Schema B = (DRB, ICB), where constraints ICB = { happy-emp(E, H) ? $S emp(E, S)} and deductive rules DRB = { qB1 (E, H) happy-emp(E, H), qB2 (E, S) emp(E, S)}. Consider also the following query Q:

qðEÞ

employeeðE; C; HÞ:

It is not difficult to see that the mapping M from Example 3.3 enables query answering of Q.

M ¼ ðF; A; BÞ;

where

mapping formulas F ¼ fQ A1 ¼ Q B1 ; Q A2 ¼ Q B2 g: The parameter query Q selects all the employees in schema A, and mapping M states that the employees in A are the same as the emps in B. Thus, when an instance of schema B is given, the extension of the employee table in schema A is uniquely determined as well as the answer to Q. Consider now the following mapping:

M 3 ¼ ðF 3 ; A; BÞ;

where

mapping formulas F 3 ¼ fQ A1  Q B1 ; Q A2  Q B2 g: Note that mapping M3 is similar to the previous mapping M, but the = operator has been replaced with the  operator. If we consider mapping M3, the extension of the employee table is not uniquely determined when an instance of B is given. In fact, it can be any subset of the emps in the given instance of B. For example, let DB be an instance of schema B such that

DB ¼ fempð0; 70Þ; empð1; 40Þg and let DA and D0A be two instances of schema A such that

DA ¼ femployeeð0; 0; 5Þ; categoryð0; 70Þg and D0A ¼ femployeeð1; 0; 5Þ; categoryð0; 40Þg: We have come up with a counterexample and may thus conclude that mapping M3 does not enable query answering of Q. The above example shows that when the mapping contains formulas of the form Q Ai  Q Bi , an instance of schema B does not generally determine the answer to a certain query Q over schema A. This is because over a given instance of B, there is just one possible answer for each query Q B1 ; . . . ; Q Bn in the mapping. However, due to the  operator, there is more than one possible answer to the queries Q A1 ; . . . ; Q An . A similar result would be obtained if Q were defined over schema B. Thus, query answerability does not generally hold for mappings of this type. Intuitively, we can say that this is because query answerability holds when we are able to compute the exact answer for Q over an instance DA using only the tuples in the corresponding mapped instance DB. However, if any of the mapping formulas contain the  operator, we cannot know which tuples are also in DA just by looking at DB, and we are therefore unable to compute an exact answer. To deal with that, we defined the mapping losslessness property, which, informally speaking, checks whether all pieces of data that are needed to answer a given query Q over schema A are captured by the mapping M = (F, A, B), in such a way that they have a counterpart in schema B. In other words, if DA and DB are two instances consistent under mapping M and the property holds for query Q, an answer to Q may be computed using the tuples in DB, although it is not necessarily the same answer we would obtain evaluating Q directly over DA. Definition 3.12. Let Q be a query defined over schema A. We say that a mapping M = (F, A, B) is lossless with respect to Q if for all pair of consistent instances DA, D0A of schema A, both the existence of an instance DB that is consistent under M with DA and with D0A and the fact that AQ Ai ðDA Þ ¼ AQ Ai ðD0A Þ for each query Q Ai in the mapping imply that AQðDA Þ ¼ AQðD0A Þ. Example 3.13. Consider once again the schemas A and B used in the previous examples, and the mapping M3 and the query Q from Example 3.11: Schema A = (DRA, ICA), where constraints ICA = { employee(E, C, H) ? $S category(C, S)} and deductive rules DRA = { qA1 (E, H) employee(E, C, H) ^ H > 10,

426

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

qA2 (E, S) employee(E, C, H) ^ category(C, S)}. Schema B = (DRB, ICB), where constraints ICB = { happy-emp(E, H) ? $S emp(E, S)} and deductive rules DRB = { qB1 (E, H) happy-emp(E, H), qB2 (E, S) emp(E, S)}. Mapping M3 = (F3, A, B), where mapping formulas F 3 ¼ fQ A1  Q B1 , Q A2  Q B2 }. Query Q = {q(E) employee(E, C, H)}. We saw that the query answerability property does not hold for mapping M3. Let us now check the mapping losslessness property. Let us assume that we have two consistent instances DA, D0A of schema A, and a consistent instance DB of schema B that is consistent under M with both instances of A. Let us also assume that the answers to Q A2 and Q A1 are exactly the same over DA and D0A . Let us now suppose that Q obtains q(0) over DA but not over D0A . According to the definition of Q, it follows that DA contains at least one employee tuple, say employee (0, 0, 12), which D0A does not contain. Since DA is consistent with the integrity constraints, it must also contain its corresponding category tuple, say category(0, 20). Therefore, according to the definition of Q A2 , the answer qA2 ð0; 20Þ would be obtained over DA but not over D0A . This clearly contradicts our initial assumption. Mapping M3 is thus lossless with respect to Q. To formulate the mapping losslessness property in terms of predicate liveliness, we define the schema S in a similar way as we did for query answerability:

S ¼ ðDRA [ DR0A [ DRB ; IC A [ IC 0A [ IC B [ IC M [ IC L Þ; where schema A0 ¼ ðDR0A ; IC 0A Þ is a copy of schema A= (DRA, ICA) in which the predicates are renamed, and ICM is the set of constraints that enforce the formulas in mapping M = (F, A, B). We use ICL to denote the set of constraints that force A and A0 to share the same answers for the Q Ai queries in the mapping:

IC L ¼ fqA1 ðX 1 Þ ! qA1 0 ðX 1 Þ; qA1 0 ðX 1 Þ ! qA1 ðX 1 Þ;  qAn ðX n Þ ! qAn 0 ðX n Þ; qAn 0 ðX n Þ ! qAn ðX n Þg: Let Q be the query over schema A that we wish to use for checking losslessness, and let Q0 be the copy of Q over schema A0 . We define the derived predicate map_loss as follows:

map loss

qðXÞ ^ :q0 ðXÞ:

The intuition is that this predicate can only be satisfied by an instance of S from which we can get two instances of A that have the same answers for the Q Ai queries (because of ICL) but not for the query Q (because of the deductive rule). We are checking thus for the existence of a counterexample. Proposition 3.14. Derived predicate map_loss is lively over schema S if and only if mapping M is not lossless with respect to query Q. Proof. Let us assume that map_loss is lively over S. Hence, there is an instance of S in which map_loss is true. This means that the answer to Q has a tuple that is not in the answer to Q0 . Based on the instance of S, we can thus build an instance DA for schema A, an instance D0A for schema A0 , and an instance DB for schema B. Given that A and A0 are in fact the same schema with different predicate names, and that queries Q and Q0 are the same query, we can conclude that D0A is also an instance of schema A, and that query Q evaluated over DA returns a tuple that is not returned when it is evaluated over D0A . Thus, we have two instances of schema A, both of which are consistent under mapping M with a third instance of schema B. Furthermore, the two instances of schema A have the same answer for the queries in the mapping but not for query Q. According to the definition of mapping losslessness, M is not lossless with respect to Q. The other direction can be proved by inverting the line of reasoning. h Fig. 6 shows the reformulation of Example 3.13 in terms of predicate liveliness. The mapping losslessness property is the result of adapting the property of view losslessness or determinacy [10,30,35]. Under the exact view assumption, a set of views V is lossless with respect to a query Q if every pair of database instances that has the same extension for the views in V also has the same answer for Q. Therefore, we can say that mapping losslessness checks whether the set of queries V ¼ fQ A1 ; . . . ; Q An g is lossless with respect to query Q. However, there is the additional requirement that the extensions for the queries in V must be so that there is a consistent instance of schema B also consistent under the mapping with them.

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

427

Fig. 6. Example 3.13 in terms of predicate liveliness.

It can be seen that in the cases in which all the formulas in the mapping take the form Q Ai ¼ Q Bi , mapping losslessness and query answerability (Section 3.3) are equivalent properties. Proposition 3.15. Let Q be a query over schema A, and let M= (F, A, B) be a mapping where F = {f1, . . . , fn} and fi is Q Ai ¼ Q Bi for 1 6 i 6 n. Mapping M is lossless with respect to Q if and only if M enables query answering of Q. Proof. Let us assume that mapping M is lossless with respect to query Q, and let us also assume that mapping M does not enable query answering of Q. By the negation of query answerability, there is an instance DB of schema B and two instances DA, D0A of schema A such that DA and D0A are both consistent under M with DB but AQ ðDA Þ 6¼ AQðD0A Þ. Given that all mapping formulas are like Q Ai ¼ Q Bi , AQ Ai ðDA Þ ¼ AQ Ai ðD0A Þ is true for 1 6 i 6 n. Hence, instances DA, D0A and DB form a counterexample for mapping losslessness and a contradiction is thus reached. Let us now however assume that mapping M enables query answering of Q, and let us also assume that M is not lossless with respect to Q. By the negation of losslessness, there are two instances DA, D0A of A such that AQ Ai ðDA Þ ¼ AQ Ai ðD0A Þ for 1 6 i 6 n, and there is also an instance DB of B that is consistent under M with both instances of A. It must also be true that AQ ðDA Þ 6¼ AQ ðD0A Þ. In this case, the three instances form a counterexample for query answerability. Thus, a contradiction is again reached. h 4. Decidability and complexity issues The high expressiveness of the queries and schemas considered in the paper makes the problem of query liveliness undecidable in the general case (that can be shown by reduction from query containment [21]). Possible sources of undecidability are the presence of recursively-defined derived predicates, and the presence of ‘‘axioms of infinity” [9] or ‘‘embedded TGDs” [34]. For this reason, if we are using the CQC method to check the properties of mappings defined in Section 3, the method may not terminate. However, we propose a pragmatic solution that ensures the method’s termination, and makes the approach useful in practice. Intuitively, the aim of the CQC method is to construct an example that proves that the query being checked is lively. In [16], it is proved that the CQC method is sound and complete in the following terms: – Failure soundness: If the method terminates without building any example, then the tested query is not lively. – Finite success soundness: If the method builds a finite example when queries contain no recursively-defined derived predicates, then the tested query is lively. – Failure completeness: If the tested query is not lively, then the method terminates reporting its failure to build an example, when queries contain no recursively-defined derived predicates. – Finite success completeness: If there is a finite example, the method finds it and terminates either when recursively-defined derived predicates are not considered or recursion and negation occurs together in a strict-stratified manner. Therefore, the CQC method does not terminate when there are no finite examples but infinite ones. However, if there is a finite example, the CQC method finds it and terminates, and if the tested query is not lively, the method fails finitely and terminates. One form to assure always termination is to directly avoid the CQC method to construct infinite EDBs. This can be done by restricting the maximum number of constants used in the method’s execution (see Section 5.1). Once reached that maximum, the current EDB under construction is considered ‘‘failed” since it probably does not lead to a finite example, and the next relevant EDB is tried. At the end of the process, if no solution has finally been found, the method would report the user that no example requiring less constants that the maximum exists. Then, the designer can choose to repeat the test with a greater maximum, and, if the situation persists, that may be an indicator that there is no finite database in which the tested query is lively.

428

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

Such a solution could be regarded as some kind of ‘‘trickery”; however, it is a pragmatic solution in the sense that no ‘‘real” database is supposed to store an infinite number of tuples. Another key point regarding complexity is showing that for the class of schemas we consider (see Section 2), expressing the desirable properties of mappings in terms of predicate liveliness does not increase their complexity. For instance, the liveliness problem can be reduced to the one of mapping losslessness: Let us assume that we want to check whether p is lively over A. Let A0 be a copy of A, and let p0 be the copy of p over A0 . Let us define q over A as a copy of p, but with a contradiction (e.g., the built-in literal 1 = 0) added to the body of all its deductive rules (q is thus not lively). Let M be a mapping between A and A0 , with one single mapping formula: q  p0 . If p is lively, there is a consistent instance D of schema A in which a fact about p is true. Instance D together with the empty instances of A and A0 form a counterexample for losslessness of M with respect to p. If p is not lively, then no counterexample exists. If there were a counterexample, we would have two instances of A giving different answers for p. Therefore, p should be lively and we would reach a contradiction. Similar reductions can be made from predicate liveliness to query answerability, mapping inference, and mapping satisfiability. 5. Experimental evaluation The main goal of this section is to show the feasibility of our approach by means of a number of experiments. We used the CQC method [16], specifically its implementation in the Schema Validation Tool (SVT) prototype [39], to perform the former tests in different situations. We first provide a brief overview of the CQC method and the SVT and explain how they should be used to perform liveliness tests and thus to check the desirable properties we defined in Section 3. We then describe the experiments and comment on the results. 5.1. CQC method and SVT The CQC (constructive query containment) method [16], originally defined for query containment, performs a validation test by trying to build a consistent instance for a database schema in order to satisfy a given goal (a conjunction of literals). It is able to deal with database schemas that have integrity constraints, safe-negated EDB and IDB literals, and comparisons. The CQC method starts by taking the empty instance and uses different variable instantiation patterns (VIPs) based on the syntactic properties of the views/queries and constraints in the schema to generate only the relevant facts that are to be added to the instance under construction. If the method is able to build an instance that satisfies all literals in the goal and does not violate any of the constraints, then that instance is a solution and it proves that the goal is satisfiable. The key point is that the VIPs guarantee that if the variables in the goal are instantiated using the constants they provide and the method does not find any solution, no solution is possible. The two major VIPs are the Negation VIP that is applied when all built-in literals in the schema are = or 6¼ comparisons, and the Order VIP that is applied when the schema contains order comparisons. The Negation VIP works as follows: a given variable X can be instantiated with one of the constants already used (those in the schema definition and those provided by previous applications of the VIP) or with a fresh constant. The Order VIP also gives the choice of reusing a constant or using a fresh one. However, in the latter case, the fresh constant may be either greater or lower than all those previously used, or it may fall between two previously used constants. Example 5.1. Let us assume that the method must instantiate the relational atom R(X, Y) using the Negation VIP, and that the set of used constants is now empty. The possible instantiations would be R(0, 0) and R(0, 1). As variable X is instantiated first, the only option is to use a fresh constant, e.g., the constant 0. Thus, there are two possibilities for instantiating variable Y: using the constant 0 again, or using a fresh constant, e.g., the constant 1. As proven in [16], the CQC method always terminates when deductive rules have no recursion and, the goal is unsatisfiable or there is a finite consistent instance that satisfies the goal. The SVT [39] is a prototype tool designed to perform validation tests on database schemas, in particular the liveliness test in which we are interested here. It accepts the logic formalism we defined in Section 2 and the following subset of the SQL language: – Primary key, foreign key, Boolean check constraints. – SPJ views, negation, subselects (exists, in), union. – Data types: integer, real, string. The current implementation of the SVT assumes a set semantics of views and queries and does not allow null values, aggregates or arithmetic functions. The SVT implements the CQC method as a backtracking algorithm. It adds facts to the EDB under construction in order to make the literals in the goal true. After adding a new fact, it checks if the EDB violates any constraint. When it detects that a constraint has been violated or a literal in the goal has been evaluated as false (e.g., a comparison), it backtracks and recon-

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

429

siders the last decision. Some constraints, like foreign keys, can be repaired by adding new literals to the goal; thus, no backtracking is required in these cases. Example 5.2. Let us assume that we are executing the SVT with the following goal:

RðX; YÞ ^ Y > 5 ^ :RðX; YÞ: The first step would be to instantiate R(X, Y) and check that no constraint has been violated. Let us assume that the Negation VIP applied on X and the Order VIP applied on Y provides three instantiations: R(0, 4), R(0, 5), R(0, 6), and that none of them violates any of the constraints. When one of the first two instantiations is used, the SVT finds that the second literal is false; it must therefore backtrack and try the next instantiation. When it finally tries the last one, i.e., R(0, 6), the comparison 6 > 5 is true; consequently, only one literal remains, i.e., :R(0, 6). As it is a negated literal, the SVT turns it into a new constraint, i.e., R(X, Y) ? X 6¼ 0 _ X 6¼ 6, which is obviously violated by the current instance D = {R(0, 6)}. The SVT must now reconsider the last decision, but as it has already tried all the instantiations for R(X, Y), the search ends and it can be concluded that there is no solution. Using the CQC method, and thus the SVT, to check the properties of mappings is easy once we have redefined them in terms of predicate liveliness. We must merely provide a schema and a goal. The schema will be schema S; we explained how to construct this schema in Section 3. The goal will be just one literal, which corresponds to the derived predicate that we defined in Section 3 for each property. 5.2. Experiments We experimentally evaluated the behavior of our approach for validating mappings by means of a number of experiments using the SVT. We executed our experiments on an Intel Core 2 Duo, 2.16 GHz machine with Windows XP (SP2) and 2 GB RAM. Each experiment was repeated three times and we report the average of these three trials. The experiments were designed with the goal of measuring the influence of two parameters: (1) the number of formulas in the mapping, and (2) the number of relational atoms in the queries (positive ordinary literals with a base predicate). We focused on the setting in which the two mapped schemas are of similar difficulty (i.e., a similar number and size of tables with a similar number and class of constraints), as well as the queries on each side of the mapping. We designed the scenario for the experiments using the relational schema of the Mondial database [29], which models geographic information. The schema consists of 28 tables with 38 foreign keys. We consider each table with its corresponding primary key, unique and foreign key constraints. The scenario consists of two copies of the Mondial database schema that play the roles of schema A and schema B, respectively. The fact that both schemas are indeed copies of a single schema has no real effect on the performance of the SVT; what is actually relevant is the difficulty of the schemas. The mapping between the two schemas varies from experiment to experiment, but mapping formulas always take the form Q Ai ¼ Q Bi . It must be remembered that equality formulas are expressed by means of two constraints, while formulas with the  operator only require one of them. Thus, we are considering the most general setting. Fig. 7 shows a graphic that compares the performance of the properties: strong satisfiability, mapping losslessness, mapping inference and weak satisfiability, when the number of formulas in the mapping varies. The property of query answerability (not shown in the graphic) would show the same behavior as mapping losslessness (this generally happens when the two mapped schemas are of similar difficulty). Fig. 7 focus on the case in which the distinguished predicate that describes the fulfillment of the corresponding property (see Section 3) is lively. It must be remembered that the fact that the tested predicate is lively has a different meaning depending on which property we are considering. For the two satisfiability properties, it means that they hold, while for the other three properties it means that they do not. In this experiment, the queries in the mapping take the form: qAi ðXÞ RA ðXÞ and qBi ðXÞ RB ðXÞ, where RA is a table ranB domly selected from schema A and R is its counterpart in schema B. The two variants of mapping satisfiability—strong and weak—can be checked without any changes in the mapping because both already hold. Instead, in order to ensure that mapping inference does not hold, we tested the property with respect to the following formula: Q1 = Q2. Queries Q1 and Q2 are defined as follows: q1 ðXÞ RA ðXÞ ^ X i 6¼ K 1 and q2 ðXÞ RB ðXÞ ^ X i 6¼ K 2 , where A X i 2 X, K1 and K2 are different constants, and R is one of the tables that appear in the definition of the Q Ai queries, and RB is the counterpart of RA. We built this formula by taking one of the formulas in the mapping and adding the inequalities to make it non-inferable. We added the inequalities on both sides of the query to ensure they were equally difficult. In the case of mapping losslessness, we used a parameter query Q that selects all the tuples from a table T randomly selected from schema A. To ensure that the mapping is not lossless with respect to Q, we modified the corresponding formula that maps T to its counterpart in B, in such a way that the two queries in the formula projected all the columns but one. We can see in Fig. 7 that the strong version of mapping satisfiability is slower than the weak one. This is expected, since strong satisfiability requires that all formulas be checked to ensure that all of them can be satisfied non-trivially, but weak satisfiability can stop checking after finding one. It can also be seen that strong satisfiability has clearly higher running times than mapping losslessness and mapping inference. This is because these two properties have an additional parameter: a query and a formula, respectively, and in order to check the properties, the SVT only has to deal with the fragment of the schemas and mapping that is ‘‘mentioned” by the parameter query/formula. However, strong satisfiability has to deal with

430

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

There is a solution for the liveliness test Varying the number of mapping formulas One relational atom in each query 4.5 strong sat.

Running time (sec)

4

map. losslessness

3.5

map. inference

3

weak sat.

2.5 2 1.5 1 0.5 0 1

3

5

7

9

11

13

15

17

19

21

23

25

27

# mapping formulas Fig. 7. Comparison in performance of the properties when the number of mapping formulas varies and the distinguished predicate is lively.

the whole part of the schema that participates in the mapping. Fig. 7 also shows that mapping losslessness has higher times than mapping inference. This is expected, given the difference in the size of schema S between the two cases. In Fig. 8, we can see the same experiment as in the previous figure but now for the case in which the distinguished predicate is not lively. To make the two satisfiability properties fail in this second experiment, we added, to each table in schema A, a check constraint that required that one of the columns was greater than a fresh constant. We added the same constraint to the counterpart of the table in schema B. We also modified the mapping formulas in such a way that the two queries in the formula were forced to select those tuples that violated the check constraints. In the case of mapping inference, we used one of the formulas already in the mapping as a parameter formula, and in the case of mapping losslessness, one of the Q Ai queries in the mapping as a parameter query. The first thing we can observe in Fig. 8 is the overall increment of all computing times. The reason is that the SVT must try all the relevant instances that are provided by the VIPs before concluding that there is no solution. Instead, in the previous experiment, the search stopped when a solution was found. It is worth noting that strong satisfiability and weak satisfiability have exchanged roles. The weak version of the property is now slower than the strong one. Intuitively, this is because strong satisfiability may stop as soon as it finds a formula that cannot be satisfied non-trivially, while weak satisfiability must continue to search until all the formulas have been considered. Figs. 9 and 10 compare the performance of the properties when the number of relational atoms in each query varies. Fig. 9 focus on the case in which liveliness tests have solution, and Fig. 10 focus on the case in which they do not. In both experiments, the mapping has 14 formulas and its Q Ai queries follow the pattern qAi ðXÞ RA1 ðX 1 Þ ^ . . . ^ RAn ðX n Þ, where RA1 ; . . . ; RAn are n tables randomly selected from schema A, and each query Q Bi is the counterpart over schema B of the corresponding query Q Ai . No solution for the liveliness test Varying the number of mapping formulas One relational atom in each query

Running time (sec)

30 25

strong sat. map. losslessness

20

map. inference weak sat.

15 10 5 0 1

3

5

7

9

11

13

15

17

19

21

23

25

27

# mapping formulas Fig. 8. Comparison in performance of the properties when the number of mapping formulas varies and the distinguished predicate is not lively.

431

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

To ensure that in Fig. 9 the liveliness tests had a solution and that in Fig. 10 they did not, we proceeded in a similar way as in the previous two experiments. To ensure that mapping inference in Fig. 9 did not hold, we used one of the formulas in the mapping as a parameter formula but added one inequality on each side. In the case of mapping losslessness, we modified one of the formulas in the mapping in such a way that its queries projected all the columns minus one. We then used the original query Q Ai from the formula as parameter query. In Fig. 10, to make the mapping unsatisfiable, we added a check constraint (column X must be greater than a constant K) in each table, and modified the mapping formulas in such a way that their queries now selected those tuples that violated the check constraints. To ensure that mapping inference and mapping losslessness held in Fig. 10, we used one of the formulas/queries in the mapping as parameter formula/query. Fig. 9 shows similar results to those in Fig. 7 for all the properties. However, Fig. 10 shows a rapid degradation of time for mapping losslessness and mapping inference, while weak and strong satisfiability show similar behavior to that in Fig. 8 (only weak satisfiability is shown in the graphic). As mentioned above, it is expected that times will be higher when there is no solution, since all possible instantiations for the literals in the definition of the tested predicate must be checked. The reason why mapping inference and losslessness are so sensitive and grow so quickly with the addition of new literals can be seen by looking at the bodies of the rules that define their derived predicates (see Section 3), which follow the pattern qðXÞ ^ :pðXÞ. Therefore, in order to determine whether or not the predicate map_inf/map_loss is lively, the unfolding of qðXÞ must first be fully instantiated using the constants proÞ  1 Þ ^ . . . ^ r n ða n Þ, and it must then be checked whether or not :pða Þ is true. If it is false, vided by the VIPs, e.g., qða r1 ða we must then try another instantiation and so on until all possible ones have been checked. Thanks to the VIPs there is a finite number of possible instantiations for the unfolding of qðXÞ, but its number is exponential with respect to the number of literals in the definition of q. However, this is a very straightforward implementation of the CQC method. Many of these There is a solution for the liveliness test Varying the number of relational atoms in each query Mapping has 14 formulas 6 strong sat. map. losslessness map. inference weak sat.

Running time (sec)

5 4 3 2 1 0 1

2 3 # relational atoms in each query

4

Fig. 9. Comparison in performance of the properties when the number of relational atoms in each query varies and the distinguished predicate is lively.

No solution for the liveliness test Varying the number of relational atoms in each query Mapping has 14 formulas 12000 map. losslessness map. inference weak sat.

Running time (sec)

10000 8000 6000 4000 2000 0 1

2 3 # relational atoms in each query

4

Fig. 10. Comparison in performance of the properties when the number of relational atoms in each query varies and the distinguished predicate is not lively.

432

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

Strong satisfiability, There is a solution for the liveliness test Varying the number of negations and relational atoms in each query Mapping has 14 formulas 40 3 negations 2 negations 1 negation no negations

Running time (sec)

35 30 25 20 15 10 5 0 1

2

3

4

# relational atoms in each query Fig. 11. Effect in strong satisfiability of increasing the number of negated atoms in each query.

instantiations could be avoided by using the knowledge we could obtain from the violations that occur during the search. Thus, we believe that further research on optimizing the CQC method would reduce these running times considerably. In Fig. 10, the mapping satisfiability properties are not greatly affected by the addition of new literals because not all the literals have to be instantiated before concluding that the corresponding predicate is not lively. Once a contradiction is found, only the literals that have been instantiated so far must be reconsidered. Fig. 11 studies the effect on strong satisfiability of adding negated literals to each query in the mapping, for the setting in which the distinguished predicate is lively and the number of positive relational atoms in each query varies. To perform the experiment shown in Fig. 11, we added a conjunction of s negated literals :T A1 ðY 1 Þ ^ . . . ^ :T As ðY s Þ to each mapping query, where T Ai are tables randomly selected from schema A not already appearing in the definition of the corresponding query Q Ai , and each variable in Y i appears in a positive atom of Q Ai . As the CQC method turns the negations in the goal into integrity constraints, adding negated literals makes more probable that a violation occurs during the search; thus, the SVT has to perform more backtracking and that results in an augment of computing time. Fig. 12 studies the effect on strong satisfiability of adding built-in literals (instead of the negations added in the previous experiment) to each query in the mapping. The built-in literals added to the queries are comparisons with the form X > K, where X is a variable corresponding to a column of one of the tables in the query’s definition, and K is a constant.

Strong satisfiability, There is a solution for the liveliness test Varying the number of comparisons and relational atoms in each query, Mapping has 14 formulas 800 3 comparisons 2 comparisons 1 comparison no comparisons

Running time (sec)

700 600 500 400 300 200 100 0 1

2

3

4

# relational atoms in each query Fig. 12. Effect in strong satisfiability of increasing the number of comparisons in each query.

433

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437 Table 1 Summary of experiments Tested properties

Liveliness tests

# mapping formulas

# relational atoms per query

# deductive rules

# constraints

# negated literals in the schema

# order comparisons in the schema

strong sat., weak sat., map. inference, map. losslessness strong sat., weak sat., map. inference, map. losslessness strong sat., weak sat., map. inference, map. losslessness weak sat., map. inference, map. losslessness strong sat.

Have solution

Varies from 1 to 28

1

Between 133 and 229

Between 192 and 397

0, 1 or 2 (depending on the property)

0

Have no solution

Varies from 1 to 28

1

Between 133 and 229

Between 192 and 397

0, 1 or 2 (depending on the property)

Between 0 and 112

Have solution

14

Varies from 1 to 4

Between 105 and 159

Between 218 and 341

0, 1 or 2 (depending on the property)

0

Have no solution

14

Varies from 1 to 4

Between 108 and 159

Between 218 and 341

1 or 2 (depending on the property)

Between 0 and 112

Have solution Have solution

14

Varies from 1 to 4

105

218

Varies from 0 to 84

0

14

Varies from 1 to 4

105

218

0

Varies from 0 to 84

strong sat.

Fig. 12 shows a clear increment of times when there are more than two comparisons per query. The reason for that increment is the augment in the number of constants provided by the VIPs. The VIPs use the constants in the schema to determine the relevant instantiations. Thus, adding comparisons with constants increases the number of possible instantiations and, consequently, the running time. Moreover, the more comparisons there are, the more pronounced the increment of time. Table 1 summarizes the experiments performed in this section. It indicates which properties have been checked in each case, whether or not the corresponding distinguished predicates were lively, and other relevant features of the different scenarios. 6. Related work A debugger for schema mappings is presented in [1,11]. The approach of [11] is based on the idea of routes. These routes describe the relationships between source and target data with the schema mapping. For the sake of an example, let us suppose that target instance contains a tuple t2. A route for this tuple might be that t2 has been generated by the mapping formula m1 (a TGD) when applied over a tuple t1 belonging to the source instance. The authors present algorithms for computing one route or all routes for a selected set of target tuples. This feature of routes is intended to allow the user to explore and understand a given schema mapping. The main difference with our approach is that we do not need a source instance and a target instance in order to perform the validation. In our approach, only the mapping and the mapped schemas need to be provided, and we are therefore able to reason over the mapping itself rather than relying on a specific instance that may not reveal all the potential pitfalls. Another difference that should be highlighted is that we deal with a more general class of mappings than TGDs. We consider mappings defined by means of queries that may have negations and comparisons, and schemas that may have views and integrity constraints more expressive than TGDs or EGDs (see Section 2). Nevertheless, the work in [11] can be seen as complementary to our approach. Indeed, if the method used to check liveliness not only provides a Boolean answer but also an example of an instance in which the tested predicate is lively, as the CQC method does, it would be well worth using the routes to allow the designer to understand the example. This would be especially useful for those properties checked by means of their negation, as the counterexample provided would help the designer to figure out why the checked property does not hold. The following example illustrates this. Example 6.1. Let us consider a schema mapping taken from [11]. It consists of a set of source-to-target tuple-generating dependencies (TGDs) between two relational schemas. Fig. 13 shows a graphical representation of the scenario. The source schema contains the tables Cards(cardNo, limit, ssn, name, maidenName, salary, location) and SupplementaryCards(accNo, ssn, name, address), with a referential constraint (TGD) from SupplementaryCards.accNo to Cards.cardNo. The target schema has two tables: Accounts(accNo, limit, accHolder) and Clients(ssn, name, maidenName, income, address),

434

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

Source schema

referential constraint

Cards cadNo limit ssn name maidenName salary location SupplementaryCards accNo ssn name address

Target schema Accounts accNo limit accHolder

m1

Clients ssn name maidenName income address

referential constraints

m2

Fig. 13. Graphical representation of a schema mapping taken from [11] and used on Example 6.1.

with a referential constraint from Accounts.accHolder to Clients.ssn, and another one from Clients.ssn to Accounts.accHolder. The mapping has the following TGDs:

m1 : Cardsðcn; l; s; n; m; sal; locÞ ! 9AðAccountsðcn; l; sÞ ^ Clientsðs; m; m; sal; AÞÞ; m2 : SupplementaryCardsðan; s; n; aÞ ! 9M9IClientsðs; n; M; I; aÞ: Let us suppose that we want to be sure that the mapping captures all the information about cards. We can achieve that by using our approach to check if the mapping is lossless with respect to the query Q : qðXÞ CardsðXÞ. In this case, the answer will be negative and if we are using the CQC method as underlying liveliness method, we will obtain a counterexample that will consists of two instances for the source schema and one instance for the target. Let us assume that we get the following instances:

Dsource ¼ fs1 : Cardsð0; 0; 0; ‘A’; ‘A’; 0; ‘A’Þg; 0

D0source ¼ fs2 : Cards ð0; 0; 0; ‘B’; ‘A’; 0; ‘A’Þg; Dtarget ¼ ft1 : Accountsð0; 0; 0Þ; t 2 : Clientsð0; ‘A’; ‘A’; 0; ‘A’Þg: We select the tuples in the target instance (t1 and t2) and compute their routes for the case in which the source instance is Dsource and for the case in which it is D0source . We get the routes s1 ? m1 {t1, t2} and s2 ? m1 {t1, t2}, and therefore, we realize that m1 is generating the same output for two different source tuples. As s1, s2 differ in their fourth attribute, we discover that Cards.maidenName has been incorrectly mapped to Clients.name. We correct m1 as follows:

m1 : Cardsðcn; l; s; n; m; sal; locÞ ! 9AðAccountsðcn; l; sÞ ^ Clientsðs; n; m; sal; AÞÞ: Now, as we have modified the mapping, we should validate it again. Let us suppose that we check again the losslessness of the mapping with respect to query Q. For this example, the answer will be again negative, and we will get the following instances as counterexample:

Dsource ¼ fs1 : Cardsð0; 0; 0; ‘A’; ‘A’; 0; ‘A’Þg; 0

D0source ¼ fs2 : Cards ð0; 0; 0; ‘A’; ‘A’; 0; ‘B’Þg; Dtarget ¼ ft1 : Accountsð0; 0; 0Þ; t 2 : Clientsð0; ‘A’; ‘A’; 0; ‘A’Þg: The routes for t1, t2 are as in the previous case: s1 ? m1{t1, t2} and s2 ? m1 {t1, t2}. Thus, we discover that m1 still has an error, that is, Cards.location is disregarded by m1. The corrected version of the TGD would be:

m1 : Cardsðcn; l; s; n; m; sal; locÞ ! Accountsðcn; l; sÞ ^ Clientsðs; n; m; sal; locÞ: A new checking of the losslessness property would reveal that the mapping captures now all information about cards. Verification of schema mappings has also been addressed in [8], where a system for verifying the quality of mappings in the data exchange context is presented. The approach of [8] is based on the structural analysis of candidate mappings (TGDs) to choose the ones that represent better transformations of the source into the target. Instances of the source and target schemas are required. A candidate mapping is executed over the source instance, in such a way that a new instance for the target schema is obtained. This instance is compared with the available target instance by means of transforming these instances in electrical circuits, collecting a small number of descriptive parameters from these circuits, and using these parameters to get a measure of the similarity of the two data structures. At the end of the process, the user gets a ranked list of mappings, suggesting which ones are believed to better reproduce the target. The user could use our approach to check the validation properties of Section 3 over the candidate mappings in the ranked list provided by the system of [8]. The validation information provided by the desirable properties, combined with the sim-

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

435

ilarity measure, may help the user to choose and refine the final mapping. For the sake of an example, the designer might be interested on focus on the candidates with a better score that preserve some information that is relevant for the intended use of the mapping. In [40], the authors propose a framework for understanding and refining mappings in Clio. It consists in making it easier for users to perform the task of building the mapping using examples. These examples are samples of a given data source that are carefully selected to help users to understand the mapping and choose between alternative mappings. Our approach can be used in conjunction with this framework because it is always difficult for users to be acquainted with all possible questions. Therefore, it would be useful if users were able to confirm that certain desirable properties hold for a given mapping. The main difference between our work and the approach of [40] is that we focus on checking certain specific desirable properties, but not necessarily on the data exchange context. As we said before, we do not assume that a data source is available. The methodology proposed in [37] for the integration of spatio-temporal databases includes a validation step of the interschema mappings against the source schemas. Both schemas and mappings are expressed in Description Logics (DL), and the reasoning services of DL are used to ensure that all the concepts (unary predicates) in the DL models describe non-empty sets of instances. This checking is therefore similar to the strong satisfiability property. However, as [37] is based on DL, it cannot use comparisons and only a limited form of negation, i.e., negation can only be used in unary predicates. Other differences with respect to our approach are that they do not consider weak satisfiability or any other property, and that our reformulation in terms of predicate liveliness allows us to use one single method to check all the validation properties defined in Section 3. In [2], the authors address the problem of information preservation in XML-to-relational mapping schemes. A mapping scheme consists of a procedure for storing XML documents in a relational database and a procedure for recovering the documents. Compared with our mappings, these mapping schemes would be mappings between models (the XML model and the relational model in this case), while our mappings are between two specific relational schemas. Our approach is related to this in that the authors define two properties of information preservation for mapping schemes. They define validating mapping schemes as those in which valid documents can be mapped into legal databases and in which all legal databases are mappings of valid documents. They also define lossless mapping schemes as those that preserve the structure and content of the XML documents, that is, one must be able to reconstruct any document from its relational image. Note that despite its name, this property is not the same as our mapping losslessness property. In fact, our losslessness property assumes that mappings may be partial or incomplete, and thus, not lossless in the sense of [2]. The authors show that testing these two properties is undecidable for the class of mapping schemes they address (XDS mapping schemes). They also propose an XML-to-relational mapping scheme that is both lossless and validating. Also in the context of information preservation, [6] focus on mappings between XML data models (DTDs). It addresses the query preservation of XML mappings as one of the criteria for information preservation. A mapping is query preserving with respect to a query language if all the possible queries on the source schema that can be expressed in this language can also be computed on the target schema. The results in [6] show that query preservation is undecidable for XML mappings defined in a simple fragment of XQuery or XSLT. This notion of query preserving mappings is very similar to that of lossless mapping schemes in [2], and as we already discussed for this latter, it is not the same property as our mapping losslessness property. Query preservation is defined with respect to a query language, while our mapping losslessness is defined with respect to a given query, that is, it checks whether the mapping preserves a given query (not all the possible ones), and assumes that mappings may be partial or incomplete, and so, not query preserving. Other related work regarding the validation properties has already been described in the corresponding sections.

7. Conclusions and future work We have proposed a new approach for validating schema mappings, which relies on determining the fulfillment of certain desirable properties of these mappings. We considered two properties that had already been identified in the literature [24]: mapping inference and query answerability. We also considered another property: mapping satisfiability (strong and weak), and introduced a new one: mapping losslessness. We have shown how all these properties may be established by means of checking the liveliness of a distinguished derived predicate in a new schema that integrates the two mapped schemas and the mapping. We have also shown that our mapping losslessness property is an extension of the query answerability property, and that can be used to try to obtain useful validation information in those cases in which query answerability does not hold. Moreover, we have shown that when all the mapping formulas are of a certain type, both properties are actually equivalent. We have shown that our CQC method can be used to check the four validation properties of mappings. We have also discussed a pragmatic solution to deal with the semidecidability of the predicate liveliness problem in the class of schemas and mappings we consider. This solution ensures the termination of the CQC method. Finally, we have described the results of a number of experiments we performed using an implementation of the CQC method for validating the four properties in different scenarios.

436

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

In the future, it would be interesting to find new desirable properties of mappings that capture validation information not covered by the four properties discussed here and that would be helpful for mapping designers. Moreover, further research on optimizing the implementation of the CQC method also proves to be of interest. We also envisage extending our approach to mappings beyond the relational-to-relational setting by considering additional classes of schemas (e.g., XML, object-oriented, etc.). References [1] B. Alexe, L. Chiticariu, W.-C. Tan, SPIDER: a Schema mapPIng DEbuggeR, in: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB’06), 2006, pp. 1179–1182. [2] D. Barbosa, J. Freire, A.O. Mendelzon, Information preservation in XML-to-relational mappings, in: Proceedings of the Database and XML Technologies, Second International XML Database Symposium (XSym’04), LNCS 3186, Springer, 2004, pp. 66–81. [3] P.A. Bernstein, Applying model management to classical meta data problems, in: First Biennial Conference on Innovative Data Systems Research (CIDR’03), 2003. [4] P.A. Bernstein, T.J. Green, S. Melnik, A. Nash, Implementing mapping composition, in: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB’06), 2006, pp. 55–66. [5] P.A. Bernstein, S. Melnik, Model management 2.0: manipulating richer mappings, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2007, pp. 1–12. [6] P. Bohannon, W. Fan, M. Flaster, P.P.S. Narayan, Information preserving XML schema embedding, in: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB’05), 2005, pp. 85–96. [7] A. Bonifati, E.Q. Chang, T. Ho, L.V.S. Lakshmanan, R. Pottinger, HePToX: marrying XML and heterogeneity in your P2P databases, in: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB’05), 2005, pp. 1267–1270. [8] A. Bonifati, G. Mecca, A. Pappalardo, S. Raunich, G. Summa, Schema mapping verification: the spicy way, in: Proceedings of the 11th International Conference on Extending Database Technology (EDBT’08), 2008, pp. 85–96. [9] F. Bry, R. Manthey, Checking consistency of database constraints: a logical basis, in: Proceedings of the 12th International Conference on Very Large Data Bases (VLDB’86), 1986, pp. 13–20. [10] D. Calvanese, G. De Giacomo, M. Lenzerini, M.Y. Vardi, Lossless regular views, in: Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’02), 2002, pp. 247–258. [11] L. Chiticariu, W.-C. Tan, Debugging schema mappings with routes, in: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB’06), 2006, pp. 79–90. [12] H. Decker, E. Teniente, T. Urpí, How to tackle schema validation by view updating, in: Proceedings of the 5th International Conference on Extending Database Technology (EDBT’96), LNCS 1057, Springer, 1996, pp. 535–549. [13] H.H. Do, S. Melnik, E. Rahm, Comparison of schema matching evaluations, in: Web, Web-Services, and Database Systems, LNCS 2593, Springer, 2003, pp. 221–237. [14] R. Fagin, P.G. Kolaitis, R.J. Miller, L. Popa, Data exchange: semantics and query answering, Theoretical Computer Science 336 (1) (2005) 89–124. [15] R. Fagin, P.G. Kolaitis, L. Popa, W.-C. Tan, Composing schema mappings: second-order dependencies to the rescue, ACM Transactions on Database Systems 30 (4) (2005) 994–1055. [16] C. Farré, E. Teniente, T. Urpí, Checking query containment with the CQC method, Data & Knowledge Engineering 53 (2) (2005) 163–223. [17] M. Friedman, A.Y. Levy, T.D. Millstein, Navigational plans for data integration, in: Proceedings of the 16th National Conference on Artificial Intelligence and 11th Conference on Innovative Applications of Artificial Intelligence (AAAI/IAAI’99), 1999, pp. 67–73. [18] A. Fuxman, M.A. Hernández, H. Ho, R.J. Miller, P. Papotti, L. Popa, Nested mappings: schema mapping reloaded, in: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB’06), 2006, pp. 67–78. [19] L.M. Haas, M.A. Hernández, H. Ho, L. Popa, M. Roth, Clio grows up: from research prototype to industrial tool, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2005, pp. 805–810. [20] A.Y. Halevy, Z.G. Ives, J. Madhavan, P. Mork, D. Suciu, I. Tatarinov, The piazza peer data management system, IEEE Transactions on Knowledge and Data Engineering 16 (7) (2004) 787–798. [21] A.Y. Halevy, I.S. Mumick, Y. Sagiv, O. Shmueli, Static analysis in datalog extensions, Journal of the ACM 48 (5) (2001) 971–1012. [22] P.G. Kolaitis, Schema mappings, data exchange, and metadata management, in: Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’05), 2005, pp. 61–75. [23] M. Lenzerini, Data integration: a theoretical perspective, in: Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’02), 2002, pp. 233–246. [24] J. Madhavan, P.A. Bernstein, P. Domingos, A.Y. Halevy, Representing and reasoning about mappings between domain models, in: Proceedings of the 18th National Conference on Artificial Intelligence and 14th Conference on Innovative Applications of Artificial Intelligence (AAAI/IAAI’02), 2002, pp. 80–86. [25] J. Madhavan, A.Y. Halevy, Composing mappings among data sources, in: Proceedings of 29th International Conference on Very Large Data Bases (VLDB’03), 2003, pp. 572–583. [26] Altova MapForce. . [27] S. Melnik, P.A. Bernstein, A.Y. Halevy, E. Rahm, Supporting executable mappings in model management, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2005, pp. 167–178. [28] R.J. Miller, L.M. Haas, M.A. Hernández, Schema mapping as query discovery, in: Proceedings of the 26th International Conference on Very Large Data Bases (VLDB’00), 2000, pp. 77–88. [29] The Mondial database. . [30] A. Nash, L. Segoufin, V. Vianu, Determinacy and rewriting of conjunctive queries using views: a progress report, in: Proceedings of the 11th International Conference on Database Theory (ICDT’07), LNCS 4353, Springer, 2007, pp. 59–73. [31] Object Management Group, MOF 2.0 QVT Final Adopted Specification, 2007. . [32] L. Popa, Y. Velegrakis, R.J. Miller, M.A. Hernández, R. Fagin, Translating web data, in: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB’02), 2002, pp. 598–609. [33] E. Rahm, P.A. Bernstein, A survey of approaches to automatic schema matching, VLDB Journal 10 (4) (2001) 334–350. [34] Y. Sagiv, Optimizing datalog programs, in: Foundations of Deductive Databases and Logic Programming, Morgan Kaufman, 1988, pp. 659–698. [35] L. Segoufin, V. Vianu, Views and queries: determinacy and rewriting, in: Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’05), 2005, pp. 49–60. [36] P. Shvaiko, J. Euzenat, A survey of schema-based matching approaches, Journal on Data Semantics IV, LNCS 3730, Springer, 2005, pp. 146–171. [37] A. Sotnykova, C. Vangenot, N. Cullot, N. Bennacer, M.-A. Aufaure, Semantic Mappings in Description Logics for Spatio-temporal Database Schema Integration, Journal on Data Semantics III, LNCS 3534, Springer, 2005, pp. 143–167. [38] Stylus Studio. . [39] E. Teniente, C. Farré, T. Urpí, C. Beltrán, D. Gañán, SVT: schema validation tool for microsoft SQL-server, in: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB’04), 2004, pp. 1349–1352.

G. Rull et al. / Data & Knowledge Engineering 66 (2008) 414–437

437

[40] L. Yan, R.J. Miller, L.M. Haas, R. Fagin, Data-driven understanding and refinement of schema mappings, in: Proceedings of the ACM SIGMOD Conference, 2001, pp. 485–496. [41] X. Zhang, Z.M. Özsoyoglu, Implication and referential constraints: a new formal reasoning, IEEE Transactions on Knowledge and Data Engineering 9 (6) (1997) 894–910.

Guillem Rull is currently a Ph.D. student at the Department of Software of the Technical University of Catalonia in Barcelona. He received a scholarship from the Microsoft Research European Ph.D. Scholarship Programme in 2006. His current research interests are involved with schema and mapping validation.

Carles Farré is currently associate lecturer at the Department of Software of the Technical University of Catalonia in Barcelona. He received his Ph.D. degree from the Technical University of Catalonia in 2003. His research interests are involved with conceptual modelling, schema validation, abduction and query containment.

Ernest Teniente is currently associate professor at the Department of Software of the Technical University of Catalonia in Barcelona. He received his Ph.D. degree from the Technical University of Catalonia in 1992. He has also been a visiting researcher at the Politecnico di Milano and at the Universitá di Roma Tre. He worked on deductive databases, database updates and integrity constraint maintenance. Current research interests are involved with conceptual modelling, schema validation, abduction and query containment.

Toni Urpí is currently associate professor at the Department of Software of the Technical University of Catalonia in Barcelona. He received his Ph.D. degree from the Technical University of Catalonia in 1993. He worked on deductive databases, database updates and integrity constraint maintenance. Current research interests are involved with conceptual modelling, schema validation, abduction and query containment.