Attribute equivalence in global schema design for heterogeneous distributed databases

Attribute equivalence in global schema design for heterogeneous distributed databases

InJbn Sysrrnts Vol. 9, No. 314, pp. 237-240, Punted in the U S.A. 1984 0304-4379184 $3.00 t .oO Pergamon Pres, Ltd. ATTRIBUTE EQUIVALENCE IN GLOBAL...

511KB Sizes 0 Downloads 89 Views

InJbn Sysrrnts Vol. 9, No. 314, pp. 237-240, Punted in the U S.A.

1984

0304-4379184 $3.00 t .oO Pergamon Pres, Ltd.

ATTRIBUTE EQUIVALENCE IN GLOBAL SCHEMA DESIGN FOR HETEROGENEOUS DISTRIBUTED DATABASES WOLFGANG

IBM Scientific Center, Tiergartenstrasse

EFFELSBERG IS, 6900 Heidelberg, West Germany

and MICHAEL V. MANNINO Computer and Information Sciences Dept., University of Florida, Gainesville, FL 3261 I, U.S.A. (Received

5 September

1983; in revised form 6 February

1984)

Abstract-A

global schema is a single, connected view of heterogeneous databases. Past research into the problem of global schema design has demonstrated the use of generalization for connecting disjoint schemas. Before generalization can be applied, the common attributes of the local schemas must be identified. We use the term attribute equivalence for the identification of such common attributes. This paper defines four types of attribute equivalences. The distinction between local and global equivalence is explained, and special attention is given to key attributes. We also discuss the placement of locally equivalent attributes in a global schema. 1. INTRODUCTION

In recent years, there has been increasing interest in heterogeneous database management systems (HDBMSs). An HDBMS such as Multibase [l], supports data retrieval from a network of heterogeneous, distributed databases by providing an interface layer on top of existing database management systems. A user interacts with the network using a single query language and a single global schema. Thus, a HDBMS provides a uniform way to access heterogeneous databases, while protecting an organization’s investment in existing software. A global schema is an ordinary schema formatted according to the rules of a common or unifying data model (UDM). The underlying local schemas are transformed into equivalent schemas in the UDM to eliminate data model differences. Mapping rules define how global schema objects such as entity types and attributes are derived from the local UDM schema objects. In this context, GLOBAL SCHEMA DESIGN is the process of designing the global schema and the mapping rules from a given set of local UDM schemas (Fig. 1). Previous work on global schema design by Motro and Buneman [2,31 and Dayal and Hwang [4] demonstrated

the use of generalization

for connecting

local schemas. Motro and Buneman use equivalence assertions about attributes as input to their algorithms which infer when to apply generalization. Dayal and Hwang devised a view definition language for constructing a global schema and mapping rules. They require a common key before applying any of their generalization operators. Neither group attempted to define the concept of attribute equivalence. The primary purpose of this paper is to define

attribute equivalence so that the global schema design process can be better understood. We define four types of attribute equivalences with an emphasis on the distinction between local and global equivalence. We also present strategies for placing 1ocally equivalent attributes in a global schema. Both the notion of attribute equivalence and the placement strategies are incorporated into our global schema design methodology [5, 61. This paper is organized as follows. Section 2 defmes the four types of attribute equivalences. Section 3 presents strategies for placing locally equivalent attributes in a global schema. Section 4 summarizes and discusses future extensions to this work.

2. ATTRIBUTEEQUIVALENCE The problem of attribute equivalence can be understood in the context of a simple example (Fig. 2). Assume entity types EMPl and EMP2 are in databases 1 and 2, respectively. In order to form the supertype EMP, the subtypes EMPl and EMP2 must have a common key so that we can determine whether two employee occurrences from different databases describe the same real-world employee. In addition, we have to determine whether any attributes common to the subtypes EMPl and EMP2

should be assigned to EMP. So the critical design questions are: (1) What local attributes are semantically equivalent? (2) What attributes can be merged in the global schema? and (3) What is a common key? These three questions are the essence of attribute equivalence. Determining whether two attributes are semantically equivalent is a decision of the designer. There is no way to deduce the equivalence of two attributes by using simple rules such as sim237

WOLFGANGEFFELSBERG and MICHAELV. MANNION

238

DESIGNER

FEEDBACK

&I+!3

DESIGN

Fig. 1. High-level view of global schema design.

ilar names or formats. For example, consider the case of the attributes ETHNIC-GROUP from database 1 and RACE from database 2. ETHNICGROUP is defined as an integer, while RACE is defined as a character string. Despite their different formats, they can still be semantically equivalent. As a counter-example, consider the attributes FAMILY-CODE from Fig. 2. Even if these attributes have the same name and type (e.g. a two character field), they may not be equivalent. Only a skilled designer with knowledge of the databases of interest can determine the semantic equivalence of two attributes. Semantic equivalence will be the minimum requirement for further consideration of a pair of attributes in this paper. The second question involves the scope property (global or local). Scope in a global schema environment involves the range of an attribute’s meaning just as scope in a programming context involves the range of a variable’s meaning. An attribute has meaning at least within its defining entity type. An attribute’s meaning can also extend to other entity types within the same database and to other databases. For example, two NAME attributes from Fig. 2 have the same scope because a person’s name is the same in both databases. The two SAL attributes have a different scope because each represents the salary of an employee within one organare ization. We say that two attributes GLOBALLY EQUIVALENT if they are semant-

ically equivalent and have the same scope. The attributes are LOCALLY EQUIVALENT if they are semantically equivalent but have different scopes. The third question addresses the problem of finding unique identifiers (keys) in the global schema. Since the same real-world object can exist in several local databases, we need a mechanism to identify such common entity occurrences. These two definitions imply four types of attribute equivalences: globally equivalent keys, locally equivalent keys, globally equivalent non-keys and locally equivalent non-keys. These four types are based on the type of attribute (key, non-key) and range of meaning (local, global). The semantic equivalence, scope and key properties are applied in the following two definitions. These two definitions imply four types of attribute equivalences which are based on the type of attribute (key, non-key) and the range of meaning (global, local). Semantic equivalence is taken as a minimum condition for considering a pair of attributes. The definition for key attributes is presented followed by the definition for non-key attributes. Note that an attribute is considered “key” if it is a primary or candidate key and non-key otherwise. DEFINITION 1

Two semantically equivalent key attributes are GLOBALLY EQUIVALENT if the primary key property holds over the extension of both attributes

EMP

EN0

NAME

SAL

ENO: employee number SSNO: Social Security Number NAME: nome of employee

ETHNIC GROUP

FAMILY CODE

(

EN0

SAL: ETHNIC GROUP/ FAMILY CODE:

SSNO

RACE:

Fig. 2. Employees example.

NAME

SAL

RACE

FAMILY CODE

local salary (for one location) (different names, same meaning) (same nome, different meanings)

Attribute

equivalence

after domain conversions; cally equivalent.

in global

otherwise

schema

design

they are lo-

For the primary key property to hold, every pair of identical entities must have the same converted key value. Likewise, every pair of different entities must have different converted key values. The clause “after domain conversions” implies that attributes may use different representations in their local databases. For example, ETHNIC GROUP of EMPI may be defined as an integer, while RACE of EMP2 is defined as a character string. Such differences can be resolved by applying the operators defined in IS] or [4]. Few attribute pairs will meet the conditions of a global key unless some previous emphasis was given to industry-wide or universal identification. Such attributes include social security numbers, universal product codes, etc. In the employee example, the EN0 attributes ‘would not meet the definition of a global key if the employee numbers were unique only within their respective databases. As another example of these difftculties, consider the case of attributes in the format of old Computing Surveys reference identifiers. A reference id uniquely identifies a citable reference such as a book or journal article. A reference id is formed by taking the first four characters of the lead author’s last name, the last two digits of the publication year, followed by a lower case letter if the first two components are not unique by themselves. Two reference id attributes from different databases are probably not globally equivalent even though by definition they are constructed by the same rules, e.g., SMIT81a in one database may not identify the same reference as SMITSla in another database. The reference id value of a given entity occurrence partially depends on the existing reference id values at the time the entity was created. Two non-key attributes can also be globally or locally equivalent as stated in the following definition. DEFINITION 2 Two semantically equivalent non-key attributes are globally equivalent if every pair of identical entities has identical values for the attributes after domain conversions; otherwise they are locally equivalent.

Let us return to Fig. 2 for examples that clarify the previous definition. For the ETHNIC GROUP/ RACE attributes, every employee common to EMPl and EMP2 should have the same converted value. Therefore, these two attributes are globally equivalent. Such is not the case for the salary attributes if EMPl and EMP2 are entity types in databases of different companies. Employees common to both companies will likely have different salary values. Therefore, these attributes are locally equivalent. This concept of attribute equivalence is an important input to our global schema design process.

for heterogeneous

distributed

databases

239

Assertions about attribute pairs are used by our schema merge algorithm in selecting integration operators to apply. For example, in a simple case such as Fig. 2, the merge algorithm will create the supertype EMP and assign the globally equivalent attributes to it. The merge algorithm does not assign locally equivalent attributes to the supertype. Instead, the designer must determine their placement by following the strategies described in the next section.

3. FLACE~~~

OF LOCALLY EQUIVAL~

ABIDES

The difficulty with locally equiv~ent attributes is that they cannot easily be combined in the global schema because their meanings have different ranges. One design option is, of course, to ignore the semantic similarities and leave such attributes in their local entity types. But in some cases, strategies can be devised that create a global attribute from two or more local attributes. In this section, we first discuss such strategies for locally equivalent key attributes and then for locally equivalent non-key attributes. If two key attributes are locally equivalent, their entity types cannot be integrated into a supertype since it is impossible to determine whether or not two entity occurrences describe the same realworld object. But in many cases, an integration into a supertype is desirable. Three strategies can be used to solve this problem: (1) Add attributes to one or more local databases such that there will be an attribute with the global key property. This implies assigning globally unique values to this attribute over all local databases. In the employee example of Fig. 2, one of the local databases does not have a social security number. The designer could add this attribute with the cooperation of the local database administrator. The attractiveness of this strategy depends on the willingness of the organizations to modify their databases so that a better merge may be achieved. (2) Add a new attribute to the global schema only, and create a conversion table that maps existing local key values into a global key value. Local key values from more than one database can be associated with the same global key value, meaning that two local entities describe the same real-world object. Like strategy 1, user input is required to define and maintain the conversion table. The main difference between the two strategies is that strategy 2 does not affect local databases. In some cases this strategy is not practical because of the large effort to define and maintain the conversion table. For example, creating a conversion table for two databases which each contain 20,000 employees would be a very large effort. In addition, the administrators of the local databases would have to update the conversion table to reflect additions and deletions to the local databases.

240

WOLFGANGEFFELSBERGand MICHAELV. MANNION

(3) Augment the locally equivalent keys with a database identifier to make them globally unique. A prefix containing the database identifier can be created using a simple concatenation operator (see [Sl]). The prefix string would be a constant such as “DBl”. The choice between strategies 1 and 2 on one hand and strategy 3 on the other hand depends on the potential overlaps of the local entity extensions and the cost of the strategies. If the intersection of the entity extensions is non-null, strategies 1 or 2 are normally appropriate. Both strategies require input of additional information to uniquely identify entity occurrences. If the semantics of the local databases guarantee an empty intersection, strategy 3 is appropriate. Strategy 3 can also be appropriate where either the cost of strategies 1 or 2 is prohibitive, or the global users do not care about identifying common entity occurrences. In these cases, strategy 3 represents a compromise between the cost of strategies 1 or 2 and the benefit of identifying common entity occurrences. For locally equivalent non-key attributes, we can identify two strategies. (1) If we know that the intersection of the entity extensions is empty or if strategy 3 was used for the key attributes of the owning entity types, then we can create a global attribute for the supertype that refers to the entity’s property in its local environment. For example, if the intersection of EMPl and EMP2 is empty, then we can create the global attribute EMP.SAL that represents the salary that the employee makes in his/her company. We know that this is the employee’s only salary, and we do not care where he/she earns it. (2) If common entity occurrences can be identified, then we can create one or more global attributes with new meaning; the values of these attributes are derived from the local attribute values. For example, EMP could have new attributes such as AVGSAL and TOTALSAL which are derived from the local salaries. This approach is attractive because we can now refer to a global property without deriving its value explicitly in a query, an idea consistent with the philosophy of a global schema. In both cases, the local attributes are still attributes of the subtypes; the mapping into a global attribute does not imply the loss of info~ation. We do not suggest that the global attribute replaces the locally equivalent attributes, since this would re-

strict the information available to global queries. an ad-koc query can access For example, EMP.TOTALSAL, but also EMPl.SAL and EMP2.SAL, allowing the computation of maximum salary, minimum salary, etc. in the query. For all global attributes derived from locally equivalent attributes, a format and a set of mapping rules must be defined after the placement strategy is determined. The techniques discussed by Mannino [5] and by Dayal and Hwang [4] apply. 4. CONCLUSION

We demonstrated the impo~an~e of attribute equivalence for integrating disjoint schemas via generalization. We identified four types of attribute equivalences and discussed strategies for placing locally equivalent attributes in the global schema. The four types of attribute equivalences are based on the type of attributes (key, non-key) and the range of meaning, (local and global). The concept of attribute equivalence is part of a global schema design methodology [5, 61. Two algorithms of the methodology make extensive use of the equivalence assertions. One algorithm analyzes a set of assertions for inconsistencies and omissions. Another algorithm performs a merge of disjoint schemas by considering several types of assertions including equivalence assertions. This algorithm is embedded in an interactive design program. Extensions to this methodology are currently under development at the University of Florida. REFERENCES [I]

T. Landers and R. Rosenberg: An overview of multibase. Proc. 2nd Int. Symp. Distributed Data Bases, Bedin, pp. 153-184 (Sept. 1982). [23 A. Motro and P. Buneman: Const~cting superviews. Proc. ACM SIGNOR Conf... Ann Arbor, Michigan

(May 1981). 131 A. Motro: Virtual merging of databases.

Ph.D. Thesis,

Univ. of Penn., Moore School of Elec. Eng., Rep. MS-CIS-80-39 (1981). [4] U. Dayal and H. Hwang: View definition and alization for database integration in multibase: tem for heterogeneous, distributed databases.

Tech. genera sys-

Proc. 6th Berkeley Workshop Distributed Data Mgmt. and Communication, (LBL-1345), pp. 203-238 (Feb.

1982). [S] M. Mannino: A methodology for global schema design. Ph.D. Dissertation, Dept. of Mgmt. Info. Syst., Univ. of Arizona (June 1983). [6] M. Mannino and W. Effelsberg: Matching techniques in global schema design, Proc. IEEE COMPDEC, Los Angeles, CA (April 1984).