Multidatabase Systems: An Advanced Concept in Handling Distributed Data

Multidatabase Systems: An Advanced Concept in Handling Distributed Data

Multidatabase Systems: An Advanced Concept in Handling Distributed Data A . R . HURSON and M . W . BRIGHT Computer Engineering Program Department of E...

3MB Sizes 2 Downloads 78 Views

Multidatabase Systems: An Advanced Concept in Handling Distributed Data A . R . HURSON and M . W . BRIGHT Computer Engineering Program Department of Electrical Engineering The Pennsylvania State University University Park. Pennsylvania 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2. What Is a Multidatabase? . . . . . . . . . . . . . . . . . . . . 2.1 Taxonomyof Globalhiormation-SharingSolutions . . . . . . . . 2.2 Definition of a Multidatabase . . . . . . . . . . . . . . . . 3. Multidatabase Issues . . . . . . . . . . . . . . . . . . . . . . 3.1 Site Autonomy . . . . . . . . . . . . . . . . . . . . . . 3.2 Differences in Data Representation . . . . . . . . . . . . . . 3.3 Heterogeneous Local Databases . . . . . . . . . . . . . . . 3.4 Global Constraints . . . . . . . . . . . . . . . . . . . . 3.5 Global Query Processing . . . . . . . . . . . . . . . . . 3.6 Global Query Optimization . . . . . . . . . . . . . . . . 3.7 Concurrency Control . . . . . . . . . . . . . . . . . . . . 3.8 Security . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Local Node Requirements . . . . . . . . . . . . . . . . . 4. Multidatabase Design Choices . . . . . . . . . . . . . . . . . 4.1 Global-Schema Approach . . . . . . . . . . . . . . . . . 4.2 Multidatabase-Language Approach . . . . . . . . . . . . . . 5. Analysis of Existing Multidatabase Systems . . . . . . . . . . . . . 5.1 Amount of Multidatabase Function . . . . . . . . . . . . . 5.2 Missing Database Function . . . . . . . . . . . . . . . . 5.3 Performance . . . . . . . . . . . . . . . . . . . . . . . 5.4 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 6. The Future of Multidatabase Systems . . . . . . . . . . . . . . . 6.1 User Interfaces . . . . . . . . . . . . . . . . . . . . . . 6.2 Effective Utilization of Resources . . . . . . . . . . . . . . 6.3 Increased Semantic Content . . . . . . . . . . . . . . . . 6.4 A Proposed Solution . . . . . . . . . . . . . . . . . . . . 6.5 New Database-Management-System Function . . . . . . . . . . 6.6 Integration of Other Data Sources . . . . . . . . . . . . . . 7. Summary and Future Developments . . . . . . . . . . . . . . . Appendix A. Review of Multidatabase Projects . . . . . . . . . . . A.l Global-Schema Multidatabase Projects . . . . . . . . . . . . A.2 Federated Database Projects . . . . . . . . . . . . . . . . A.3 Multidatabase-Language-SystemProjects . . . . . . . . . . . A.4 Homogeneous Multidatabase-Language-System Projects . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . .

. .

. . . .

. . .

.

. . .

. . .

. . . .

. . .

.

150 151 152 156 159 160 161 164 165 165 166 168 169 170 170 171 173 176 176 178 179 180 180 180 182 182 183 184 184 185

187 187 192 192 194 195

149 ADVANCES IN COMPUTERS. VOL . 32

Copyright 01991 by Academic Press. Inc. All rights of reproduction in any form rwrved.

ISBN 0-12-012132-8

150

A. R. HURSON AND M. W. BRIGHT

1. Introduction

The business, government, and academic worlds are replete with groups that have computerized all or part of their daily functions. This computerization often includes databases to model the real-world entities involved in these functions. In today’s information age, there is a major requirement for most of these systems to share the information they manage. An example is the French Teletel system which allows 1.8 million users to access over 1500 separate databases (Litwin and Abdellatif, 1987). However, users cannot be expected to remember multiple different access methods and access paradigms in order to use these separate databases. Nor can the user expect all these databases to be converted to a single common model with a single access method. Multidatabases provide users with a common interface to multiple databases with minimal impact on the existing function of these databases. Database systems often serve critical functions and represent significant capital investment for organizations. Many organizations have multiple different computers and database systems. This existing environment must be preserved in many cases, yet there is a need to share information on an organizationwide or regionwide basis. There is a need to provide integrated access to similar information with different data representations, different access operators, and located at different nodes. Multidatabases typically integrate the data from preexisting, heterogeneous local databases in a distributed environment and present global users with transparent methods to use the total information in the system. A key feature is that the individual databases retain their autonomy to serve their existing customer set. This preservation of local autonomy protects an organization’s existing investment in local-database-management software, existing applications, and user training. The existence of multiple, autonomous local databases within an organization can lead to the problem of “islands of information” (Andrew, 1987). This means that globally important information exists in separate local database management systems (DBMSs) that are incompatible, thus making the existing data inaccessible to remote users. Even when the host computers are placed on a common network, the remote data may still be inaccessible if users are not familiar with the access language and data model. One possible solution is to provide translators from a user’s local access language/data model to remote languages/models (Demurjian and Hsiao, 1987; Hsiao and Kamel, 1989). However, simple translations still require users to access remote databases individually. A multidatabase system provides integrated access to multiple databases with a single query. Multidatabases are an important area of current research as evidenced by the number of projects in both academia and industry. The need for userfriendly global information sharing is also well documented in the trade press.

MULTIDATABASE SYSTEMS

151

The next level of computerization is global, distributed systems that can share information from all participating sites. Billions of dollars are at stake, and the winners will be those enterprises that most effectively utilize the new technology. The problems and issues faced by multidatabase architects and designers are numerous. Some solutions are available in research prototypes, and a few limited functional systems are starting to enter the commercial marketplace. Currently, there is little standardization of requirements or solutions. In fact, there is still considerable disagreement over major architectural issues-such as whether to use a global schema to integrate data or to allow users to do their own integration through multidatabase language features. (See Section 4.) Many known problems remain unsolved. Also, future requirements, such as integration of knowledge base systems with traditional database systems, are still open issues. Integration of knowledge-base systems is particularly important since they represent semantically rich information that is increasingly important to today’s sophisticated applications. This chapter explores issues associated with multidatabases and reviews the current work in the field. The reader is assumed to be familiar with traditional, centralized database-system concepts (Date, 1983, 1985) and the relational data model (Codd, 1970; Date, 1986). An appreciation of the problems related to distributed systems is also helpful (Ceri and Pelagatti, 1984). Section 2 defines what a multidatabase is and where it lies in the spectrum of global information-sharing solutions. It also gives examples of the two main types of multidatabases. The issues and problems associated with multidatabases are discussed in Section 3. Existing projects are cited as they relate to particular issues. Section 4 explores the two major approaches in designing a multidatabase and the associated problems. An analysis of existing multidatabase systems is presented in Section 5. Section 6 discusses the future of multidatabase systems. Existing problems and future requirements are considered. Section 7 concludes the chapter. Appendix A reviews the existing multidatabase projects that have been reported in the open literature.

2.

What Is a Multidatabase?

A multidatabase is a particular type of distributed system that allows global users to easily access information from multiple local databases (Staniszkis, 1986; Veijalainen and Popescu-Zeletin, 1988). Since there are many possible solutions to global information sharing, this section will first examine where multidatabases are positioned in the spectrum of solutions. Then a more complete definition of a multidatabase environment will be given along with some examples.

152

A. R. HURSON AND M. W. BRIGHT

2.1 Taxonomy of Global Information-Sharing Solutions There is a wide range of solutions for global information sharing in a distributed system. User requirements, existing hardware and software, and the amount of investment (time, money, and resources) available will determine which solution is appropriate in any given environment. A wide range of terms are used in the literature to describe various solutions, including distributed databases, multidatabases, federated databases, and interoperable systems. The distinction between terms sometimes varies from paper to paper, but the most common definitions are used here (Daisy, 1988; Litwin and Zeroual, 1988; Staniszkis, 1986). The aforementioned terms are intended to describe a distributed system that has a global component with access to all globally shared information and multiple local components that only manage information at that site. The distinction is in the structure of the global component and how it interacts with local components. Our taxonomy defines the solutions according to how tightly the global system integrates the local DBMSs. A tightly coupled system means the global functions have access to low level, internal functions of the local DBMSs. This allows close synchronization among sites and efficient global processing. However, it also implies that the global functions may have priority over local functions so local DBMSs do not have full control over local resources. In a more loosely coupled system, the global functions access local functions through the DBMS external user interface. Global synchronization and efficiency are not as high as the tightly coupled case, but local DBMSs have full control over local data and processing (site autonomy). In a cooperative spirit, local systems may voluntarily give up some local control and agree to give specified global functions priority over local functions. This cooperation is at the local system’s discretion. In the most loosely coupled system, there are few global functions (just simple data exchange via request/response-type messages), and the local interface to global information is through applications residing above the local DBMS user interface. Global synchronization and efficiency are minimal, and again the local system has full control over local data and processing. The following definitions are presented in order from the most tightly coupled to the most loosely coupled. A summary of the taxonomy is shown in Fig. 1.

2.1.1 Distributed Databases

A distributed-database system is the most tightly coupled informationsharing system. Global and local functions share very low level, internal interfaces and are so tightly integrated that there is little distinction between them. Distributed databases, therefore, should typically be designed in a topdown fashion and the entire system, global and local functions, should be

! Tightly coupled

<

Distributed Database

local

...

Local nodes typically are . . . F u l l global database function

functions

homogeneous databases

yes

1

>

Loosely coupled

Global Schema Multidatabase

Federated Database

Multidatabase Language System

Homogeneous Multidatabase Language System

Interoperable Sys tem

DBMS user interface

DBMS user interface

DBMS user interface

DBMS user interface plus some internal functions

application on top of DBMS

hetero geneous databases

hetero geneous databases

geneous databases

databases

source t h a t

yes FIG. 1.

I

yes

Taxonomy of information sharing systems.

154

A. R. HURSON AND M. W. BRIGHT

implemented at the same time. The local DBMSs are typically homogeneous, i.e., they use the same data model and present the same functional interfaces at all levels, even though they may be implemented on different hardware/ system-software platforms. The global system has control over local data and processing. The system typically maintains a global schema, a structured description of all the information available in the system. This global schema is created by integrating the schemas of all the local DBMSs. Global users access the system by submitting queries over the global schema. Because they are so tightly integrated, distributed databases can closely synchronize global processing. Furthermore, since the global functions have complete control over local functions, processing can be optimized for global requirements. As a result, distributed databases have the best performance of all the information-sharing solutions presented here, but at the cost of significant local modification and loss of control. Ceri and Pelagatti have provided a good introduction to distributed databases, and they review some existing systems (Ceri and Pelagatti, 1984). 2.1.2

Global-Schema Multidatabases

Global-schema multidatabases are more loosely coupled than distributed databases because global functions access local information through the external user interface of the local DBMS (Landers and Rosenberg, 1982). However, the global system still maintains a global schema, so there must be close cooperation between local sites to maintain the global schema. Globalschema multidatabases are typically designed bottom-up and can integrate preexisting local DBMSs without modifying them. Global-schema multidatabases also normally integrate heterogeneous local DBMSs. This heterogeneity may mean different data models or different implementations of the same data model. Thus, creating the global schema is a more difficult problem than in a distributed database, where the local DBMSs are homogeneous and the global database administrator (DBA) has control over the local schema input to the global schema. The global system must provide mappings between different local schemas and the common global schema.

2.7.3 Federated Databases Federated databases are a more loosely coupled subset of global-schema multidatabases (Heimbigner and McLeod, 1985). There is no single global schema. Each local system maintains its own partial global schema which contains only the global information descriptions that will be used at that node (as opposed to all the information in the system), So each node must

MULTIDATABASE SYSTEMS

155

cooperate closely only with the specific nodes it accesses. User queries are restricted to the locally maintained partial global schema. 2.1.4 Multidatabase Language Systems

Multidatabase language systems are more loosely coupled than globalschema multidatabases or federated databases because no global schema is maintained (Litwin and Zeroual, 1988). The global system supports full database function by providing query language tools to integrate information from separate databases. User queries can specify data from any local schema in the distributed system. Language tools include a global name space and special functions to map information from different models and representations to a model and representation meaningful to the user. Like globalschema multidatabases, multidatabase language system integrate preexisting, heterogeneous local DBMSs without modifying them. 2.1.5 Homogeneous Multidatabase Language Systems

Homogeneous multidatabase language systems are a degenerate form of multidatabase language systems. This subset merits its own class because there are a number of existing multidatabase projects that currently only supports homogeneous local DBMSs (Litwin and Zeroual, 1988). This class is also important because it contains some of the first commercially available multidatabase systems. The commercial products tend to have very limited language functions relative to the projects in the previous class. Some of these systems are actually rather tightly coupled because they allow some global/local interaction below the standard user interface. However, these exceptions are usually minimal. Because of the exceptions, members of this class are close to being distributed databases rather than multidatabases and may display attributes of both classes. 2.1.6 Interoperable Systems

Interoperable systems are the most loosely coupled information-sharing systems. Global function is limited to simple message passing and does not support full database functions (query processing, for example). Standard protocols are defined for communication among the nodes. The local interface is supported by an application above the local DBMS user interface. Because the global system is not database-oriented, local systems may include other types of information repositories such as expert systems or knowledge-based systems. Interoperable systems are still mainly in the research stage (Daisy,

156

A. R. HURSON AND M. W. BRIGHT

1988; Eliassen and Veijalainen, 1988; Litwin and Zeroual, 1988; Mark and Roussopoulos, 1987). 2.2

Definition of a Multidatabase

For the purposes of this chapter, the generic term multidatabase will include the classes of global-schema multidatabase, federated database, multidatabase language system, and homogeneous multidatabase language system. A multidatabase differs from a distributed database because the global interface to local functions is through the local DBMS external user interface. This means that the local DBMS has full site autonomy. A multidatabase differs from an interoperable system because it provides full database function to global users. Because of the looser coupling, multidatabases cannot synchronize operations as tightly as a distributed database, nor can they optimize as well for global requirements (because the local sites control local resources). However, a multidatabase requires no modification to existing local databases and typically can integrate heterogeneous local systems. A multidatabase is a distributed system that acts as a front end to multiple local DBMSs or is structured as a global system layer on top of the local DBMSs. Although the local node must maintain some global function in order to interface with the global system, the local DBMS participates in the multidatabase without modification. The local DBMS retains full control over local data and processing. Cooperating with the global system and servicing global requests is strictly voluntary. The global system provides some means (global schema or multidatabase language) of resolving the differences in data representation and function between local DBMSs. This resolution capability is necessary because the same information may be maintained at multiple locations in differing forms. The global user can access information from multiple sources with a single, relatively simple request. Examples of the two major classes of multidatabases are given next. These examples illustrates many common features of multidatabases. 2.2.7

Example of a Global Schema Multidatabase: Multibase

Multibase (Dayal, 1983; Landers and Rosenberg, 1982; Smith et al., 1981) was developed at the Computer Corporation of America and provides a uniform global retrieval interface for information in multiple, preexisting, heterogeneous local databases. The global user interface is a single global schema that can be accessed through a single query language. The global schema uses the functional data model (note that most other multidatabase systems use the relational data model or the Entity-Relationship model (Chen, 1976)), and the query language is DAPLEX, which is a functional

MULTlDATABASE SYSTEMS

157

data language (Shipman, 1981). Local DBMSs can use the relational or network data model and can participate in the global system without any modification. The schema structure of Multibase is shown in Fig. 2. Each local DBMS presents a local schema to the global system. This local schema corresponds to a regular DBMS user view containing the information the local system wants to share. The local schema is defined in terms of the local DBMS data model and is accessed via the local DBMS query language. Multibase maintains a mapping at each node from the local schema to an equivalent schema using DAPLEX and the functional data model. The DAPLEX local schema has the same information as the local schema, but it now has a globally common representation and access method (DAPLEX queries). The DAPLEX global schema represents the combined information from all the local DAPLEX schemas. This mapping may not be a simple union, however. Information contained in separate databases may overlap; i.e., the same real-world object may be represented numerous times with differing database representations. The mapping from DAPLEX local schemas to the global schema must resolve these differences so each real-world entity and relationship among entities modeled anywhere in the system has a single representation at the global level. In addition to the local DAPLEX schemas, there is also a DAPLEX auxiliary schema representing additional information necessary to global functions. Examples are global data not available at any local node, procedures for resolving incompatibilities between local data representations, and relationships between information in separate nodes. This auxiliary schema is also mapped into the global schema. Multibase provides a schema design aid to help the database administrator create the DAPLEX local schemas and the global schema. Users access global data by submitting a DAPLEX query defined over the global schema. The global query processor decomposes this global query into multiple subqueries. Each subquery references only information found in a single DAPLEX local schema. A local database interface module translates the appropriate subquery into the local data model and language. The local DBMS services the subquery and returns the local result to the local interface for translation into a local DAPLEX result. The global system combines the local results into a single global result to be returned to the user. Global result processing includes processing any relevant information in the DAPLEX auxiliary schema. Global query optimization is based on parameters such as the most efficient place to retrieve each requested data item, the most efficient way to process and combine results, network costs of transferring data, and the functions available at each local node. Local DBMSs may support different sets of database functions, and the global system may need to compensate for missing function at some nodes. Multibase also provides local

158

A. R. HURSON AND M. W. BRIGHT

query optimization to provide the most efficient subquery processing, given local DBMS capabilities. Global queries are read only; updates must be performed by local users through the regular local access method. The typical local DBMS user interface does not provide access to any type of locking or time-stamp mechanism for individual data items. Without such mechanisms, the global system cannot provide adequate concurrency control, so only retrievals are allowed through the global interface. Global-schema multidatabases allow simple, integrated global access to information because the global DBAs have done much of the integration work by creating the global schema. To the user, the multidatabase looks like a large, centralized database. 2.2.2 Example of a Muitidatabase Language System: MRDSM MRDSM (Multics Relational Data Store Multiple) is a research project at the INRIA research center in France (Litwin, 1985b; Litwan and Abdellatif, 1987; Litwin and Vigier, 1986; Wong and Bazex, 1984). The basic system is an extension of Honeywell’s centralized database system, MRDS, although support for other local DBMSs is possible. The multidatabase languages, MDSL (MultiDatabase SubLanguage) (Litwin and Abdellatif, 1987) and its successor MSQL (Litwin et al., 1987), are similar to SQL (Structured Query Language) (Date, 1987)in their basic functions, but have many extensions for manipulating multiple databases. A global user is aware that multiple data sources exist, but the access language features makes it easy for the user to manipulate these multiple sources. Most of the reported research has concentrated on the new language functions required for dealing with multiple databases (Litwin, 1984a, b). The structure of MRDSM is similar to Multibase (Fig. 2), but without the global-schema level. Local DBMSs present an external schema of data to share with the global system. If the local schema is not defined in the global model/language, there is a local mapping to the common model/language. The global system also allows auxiliary schemas to keep global data and information about interdatabase relationships. MDSL has functions to define such relationships and the query processor automatically uses the appropriate portions of the auxiliary database for each query. Global queries are decomposed into subqueries to be submitted to the appropriate local DBMSs. The global system collects the local results and processes them to obtain the global result for the user. Because there is no prior integration of global data, global queries must contain information about data sources, about resolution of differences in data representations from separate sources, and about how to process the

MULTIDATABASE SYSTEMS

I' I I

159

DAPLEX Global Schema

DAPLEX Local

I DAPLEX Local

Auxiliary Schema

1 Local Schema N

Local Schema 1

FIG.2. Schema Structure of Multibase

resulting data. MDSL has a number of built-in language functions to accomplish these objectives. The query processor will make some implicit assumptions about how to process information (Litwin, 1985a). This automatic processing allows the user to specify requests in a more nonprocedural fashion. MRDSM does allow global updates. MDSL supports reverse transformations of dynamically created objects so updates can be unambiguously applied to the base data objects (Litwin and Vigier, 1986; Vigier and Litwin; 1987). In a multidatabase language system, the user is responsible for most of the information-integration work, rather than a global DBA. The user is aware of different data sources and must dynamically map these sources into a single, logically integrated whole that matches his or her information requirements. The multidatabase language provides functions to help with this task.

3.

Multidatabase Issues

Multidatabases inherit many of the problems associated with distributed systems in general and distributed databases in particular. The literature has fully discussed solutions to distributed system problems, as well as distributeddatabase problems (see Ceri and Pelagatti, 1984, in particular). Therefore, this chapter will concentrate on the issues that are specific to multidatabases. Litwin has discussed some of these issues in a more limited sense, since he has only considered multidatabase language systems (Litwin, 1988; Litwin and Abdellatif, 1986; Litwin and Zeroual, 1988).

160

A. R. HURSON AND M. W. BRIGHT

3.1 Site Autonomy A key aspect of multidatabases, as opposed to distributed databases, is that each local DBMS retains complete control over local data and processing. This is referred to as site autonomy (Abbott and McCarthy, 1988; GarciaMolina and Kogan, 1988; Veijalainen and Popescu-Zeletin, 1988). Each site independently determines what information it will share with the global system, what global requests it will service, when it will join the multidatabase, and when it will stop participating in the multidatabase. The DBMS itself is not modified by joining the multidatabase. Global changes, such as addition and deletion of other sites, or global optimization of data structures and processing methods, do not have any effect on the local DBMS. Local DBAs are free to optimize local data structures, access paths, and query-processing methods to satisfy local user requirements rather than global requirements. Since the global system interfaces with the local DBMS at the user level, the local DBMS sees the global system as just another local user. Therefore, the local DBMS has as much control over the global system as it does over local users. Note that site autonomy applies to the local DBMS rather than the local system as a whole (see Section 3.9.) The multidatabase approach of preserving site autonomy may be desirable for a number of reasons. Some local databases may have critical roles in an organization, and it may be impossible from an economic standpoint to change these systems (Holtkamp, 1988). Site autonomy means the local DBMS can add global access without changing existing local function. Another economic factor is that an organization may have significant capital invested in existing hardware, software, and user training. All of this investment is preserved when joining a multidatabase since existing local applications can continue operating unchanged. Site autonomy can also act as a security measure because the local DBMS has full control over who accesses local resources through the multidatabase interface and what processing options will be allowed. In particular, a site can protect information by not including it in the local schema that is shared with the global system. An organization’s requirement for global access may be minimal or sporadic. Site autonomy allows the local DBMS to join and quit the multidatabase with minimal local impact. Despite the desirable aspects of site autonomy, it places a large burden on global DBAs. Each site has independent local requirements and makes independent local optimizations to satisfy those requirements. Because of this independence and the possibly large number of participating sites, global requirements and desirable global optimizations (of global data structures, access paths, query processing methods, etc.) are likely to conflict with local ones. The global DBA must work around these conflicts in initial global system design and ongoing global maintenance. Global performance suffers

MULTIDATABASE SYSTEMS

161

relative to a tightly coupled distributed database because of the lack of global control over local resources. Because of the heterogeneity of local DBMSs, the global system may have to dedicate global resources to compensate for any missing local function or information. Some of these problems may be alleviated to a degree if the local DBAs agree to cooperate and conform to some global standards. Site autonomy ensures that this cooperation is not enforced by the system, but organizational policies can be used to force cooperation. 3.2

Differences in Data Representation

There are many ways to model a given real-world object (or relationships to other objects) depending on how the model will be used. Because local databases are developed independently with differing local requirements, a multidatabase system is likely to have many different models, or representations, for similar objects (Breitbart et al., 1986;DeMichiel, 1989).However, a global user desires an integrated presentation of global information without duplications or heterogeneity. The same real-world object in different local databases should map to a single global representation, and semantically different objects should map to different global representations (i.e., no erroneous integrations).The style of representation should be consistent at the global level. Moreover, the global information should have the representation most useful to the particular user or application, or at least be easily convertible to a more useful form. This section discusses the various possible differences in local representations. 3.2.1 Name Differences Local databases may have different conventions for naming objects, leading to the problems of synonyms and homonyms. Synonym means the same data item has different names in different databases. The global system must recognize the semantic equivalence of the items and map the differing local names to a single global name. Homonym means different data items have the same name in different databases. The global system must recognize the semantic difference between items and map the common names to different global names. 3.2.2 Format Differences Format differencesinclude differencesin data type, domain, scale, precision, and item combinations. An example of a data-type difference is a part number that is defined as an integer in one database and as an alphanumeric string in

162

A. R. HURSON AND M. W. BRIGHT

another. A data item’s type may have different domains in different databases. For example, temperatures in one database may be rounded off to the nearest ten, while another keeps exact integer readings. An example of scale differences is the area of a plot of land measured in square feet in one database and acres in another. One database (or machine) may use single-precision floating-point numbers for a given quantity while another uses doubleprecision. Sometimes data items are broken into components in one database while the combination is recorded as a single quantity in another. For example, dates can be kept as a single string, such as 012290, or as separate quantities for month, day, and year. Multidatabases typically resolve format differences by defining transformation functions between the local and global representations. Some functions may be simple numeric calculations such as converting square feet to acres. Some may require table conversions. For example, temperatures may be recorded as hot, warm, cold, or frigid in one place and as exact degrees in another. A table can be used to define what range of degree readings correspond to hot, warm, etc. Others may require calls to software procedures that implement an algorithmic transformation. A problem in this area is that the local-to-global transformation may be simple, but the inverse transformation (required if updates are supported) may be very complex. 3.2.3 Structural Differences

Depending on what an object is used for in a database, it may be structured differently in different local databases. Of course, there are basic differences between different data models-the relational model has relations with tuples and attributes while the hierarchical model has records, fields, and links. Even assuming the multidatabase maps all local schemas to a common relational model, there may still be structural differences. A data item may have a single value in one database and multiple values in another. For example, one database may simply record telephone number while another records a home phone number and an office number. An object may be represented as a single relation in one place or as multiple relations in another. This will be common if different databases are in different normal forms (Date, 1985). An example is a database with all employee attributes in the same relation as opposed to a database that has separate relations for employee financial attributes, health attributes, job-related attributes, etc. The same item may be a data value in one place, an attribute in another, and a relationship in a third place. For example, the color of a car may be recorded as green in the color attribute of the dealer’s database. Green may be an attribute in the car shipping department since there may be several shades of green. In the paint department, green may be a whole relation with tuples being the different shades and attributes relating to

MULTIDATABASE SYSTEMS

163

the type of paint, cost of the paint, how it is mixed, etc. The relationships between objects may differ from database to database. For example, two people may be linked as manager and employee in their company database, as cousins in a genealogy database, or as separate individuals in a census database. The dependencies between objects may differ between databases. For example, a company may automatically update a manager’s salary when an employee’s income is updated, while another company may not have such an update dependency. Recognizing semantically equivalent objects despite structual differences can be a difficult task and is almost always a manual process for the global DBA or user. Resolving structural differences can also be difficult, but much work has done to ease the task in this area. 3.2.4 Abstraction Differences

Different local users may be interested in different levels of detail about the same object. One database may record all the options and details about a particular car, another may merely record the particular car’s existence, while a third may only note that cars as a general class of transportation exist. Differing levels of abstraction can be integrated through the use of generalization hierarchies. (See Section 4.1.1 and Smith and Smith, 1977.) A generalization hierarchy combines objects based on their common attributes. An object at one level of the hierarchy represents the collection of the common attributes of its immediate descendants at the next level of the hierarchy. Local objects can be mapped into the global hierarchy at the appropriate level of detail. 3.2.5 Missing or Conflicting Data

Local databases may have embedded information that is not explicitly recorded. Embedded data is information that is assumed by local users, so it does not have to be spelled out with explicit data values. For example, a company’s personnel data are unlikely to explicitly record the company name as an attribute of every object since the database is wholly within the company’s computer. Yet in a global environment, such information must be included to distinguish between personnel data from different companies. Databases that model the same real-world object may have conflicts in the actual data values recorded. One system may not have some information recorded due to incomplete updates, system error, or insufficient demand to maintain such data. More serious is the case where two databases record the same data item, but the value are different.The values may differ because of an error, or there may be valid underlying semanticsto explain the difference.The

164

A. R. HURSON AND M. W. BRIGHT

value in one place may have been valid at one time, but may now be outdated. The values may actually represent semantically different data items for the same object. The values may represent the same data item for different objects. An example is two employee objects with different salary values (Ceri et al., 1987). One or both salaries may just be wrong. One may represent the salary from a previous job. The employee may actually hold two different jobs and draw two valid salaries. The salaries may mean there are actually two different employees. An integrated representation of an object can normally only record one value for a particular data item. Many policies are possible for resolving conflicting data. The multidatabase can average the recorded values, take the maximum or minimum, take the most frequently recorded value, take the value from the closest node in the network, take the value from the most reliable node, return the range of values, return all the values, etc. Any of these policies may be appropriate in a given situation.

3.3 Heterogeneous Local Databases Many multidatabases claim to support heterogeneous data models at the local level. Generally these are the network, hierarchical, and relational models. The support mainly consists of providing local translation capability from the local model to the common global model, usually relational. Some systems materialize temporary relational databases representing the local information in order to make global query-processingsimpler. Even systems that only support relational DBMSs may be heterogeneous to some degree. The relational data model is a theoretical model. Different implementations of a relational DBMS may interpret the theoretical model differently. For example, there are several different versions and variations of relational data definition/manipulation languages. In fact, most relational DBMSs do not implement the full function of the theoretical model (Codd, 1986; Date, 1986). Thus, systems that only support relational DBMSs may still have to deal with heterogeneity of implementation. The problem with supporting local DBMS heterogeneity is due to the tradeoff between writing translation code and limiting participation. If the multidatabase developers are willing to write enough translation code (considering development costs and execution efficiency), the multidatabase can accept a wide variety of local DBMSs. Another consideration here is that any local functional deficiencies must be programmed around with global system software. If minimizing translation-code cost is important, then the variety of DBMSs allowed to join the multidatabase will be limited to those with interfaces close to the global standard. Some systems are designed so the translation code can be automatically developed from a definition of the local

MULTIDATABASE SYSTEMS

165

access language grammer (Rusinkiewicz and Czejdo, 1985). This allows easy addition of appropriate DBMSs.

3.4 Global Constraints Since different local databases may represent semantically equivalent data or semantically related data, the global system needs some method for specifying and enforcing integrity constraints on interdatabase dependencies and relationships (global constraints). An example is equivalent data maintained in several nodes-value changes at one node may need to be propagated to the others. Another example is an aggregate privacy dependency-combining independent data from several sources may reveal more information that the simple sum of the individual data items. This could be a security violation, and concurrent access to these data items may need to be restricted. These constraints may also represent additional semantic information about the data items involved. A user accessing a particular item may want to know about semantically related items, and the defined constraints can be used to identify the related items. These global integrity constraints are sometimes defined as part of the global schema. Other multidatabases keep separate auxiliary databases specifying global constraints. The query processor checks these auxiliary databases during query execution to enforce the constraints Like resolving representation differences, global constraints suffer from a lack of automation. However, some work has been done on automatic consistency checking and optimization once the constraints have been defined (Casanova and Vidal, 1983; Mannino and Effelsberg, 1984). Global constraints require a thorough system policy statement to define how they are to be managed. An example is an interdatabase update dependency (updating an object in one database should cause an equivalent object in another to be updated). If the update is to be propagated, the site autonomy of the second node may be compromised. If the update is not to be propagated, the first site must either reject updates (restrict function) or accept them (violate the constraint and cause data inconsistency). The decision on what to do is a policy position since any of the alternatives is technically feasible. 3.5

Global Query Processing

The basics of global query processing are consistent across most multidatabases (Gligor and Luckenbaugh, 1984; Rusinkiewicz and Czejdo, 1987). A user submits a global query to be processed using the global schema, or in the multidatabase language case the query itself contains all the information necessary for retrieving local data. The query is decomposed into a set of

166

A. R. HURSON AND

M. W.

BRIGHT

subqueries-one for each local DBMS that will be involved in query execution. The query optimizer (see Section 3.6) creates an access strategy that specifies which local DBMSs are to be involved, what each will do, how the intermediate results will be combined, and where the global processing will occur. Then the access strategy is executed- the subqueries are performed, and the individual results are combined to form a response to the original global query. Global constraints must also be checked and enforced during query execution. Initial query processing usually occurs at the node where the query is submitted, although some systems pass queries to designated servers for processing. Query execution is distributed to nodes across the system. During global query execution, queries may be translated several times as they travel through the various system layers. Translations are used to allow different languages and representations at different layers as well as to resolve representation differences. For example, some multidatabases such as DQS (Belcastro et al., 1988)and Mermaid (Templeton et al., 1987a) use an internal database-manipulation language that is different than the external user language. An internal language may allow more efficient processing than an external language that must sacrifice some efficiency for user-friendliness. Another translation example is the Multibase translation from local schemas in the local language to equivalent DAPLEX schemas. Query decomposition and optimization in a distributed system have been studied in the distributed database environment and a number of solutions are available for those problems (Ceri and Pelagatti, 1984). Multidatabase systems must also handle interdatabase dependencies, manage global resources, and support additional language features (for multidatabase languages). Interdatabase dependencies may cause functions to cascade to many databases beyond the immediate scope of a query. The system must identify pertinent dependencies by checking the global dependencies database, then expand the submitted query to include consequences of the dependencies. The query processor must manage global resources, such as the global constraints database, local work spaces, and the software modules responsible for global processing. These resources are generally distributed across the system. Finally, multidatabase language systems provide many new language features that must be handled by the query processor. All these demands on the query processor must be handled in an efficient manner, despite its dynamic, distributed nature and the lack of control over local DBMSs.

3.6 Global Query Optimization

A global query optimizer deals with information about the state of the distributed system, the capabilities of individual nodes, communication link

MULTIDATABASE SYSTEMS

167

costs, the data requirements of a query (e.g., location and approximate amount), and the processing requirements of the query. The optimizer applies a cost function that weights these factors and produces an efficient strategy for satisfying the query. The cost function implements global policy and reflects global processing requirements in the weights it gives to the various factors. Most systems consider only a subset of the factors listed. Since all subqueries and some result combination processing are performed at separate nodes, the optimizer can schedule these to operate in parallel. Some systems restrict the optimization problem by making simplifying assumptions or applying heuristics to make it computationally tractable (Staniszkis et al., 1984). Mermaid, for example, explicitly solves for a good, but not optimal, solution in order to cut optimization time and expense (Templeton et al., 1987b). Often there are three levels of query optimization. The global level optimizer considers what operations should be performed at each node and how partial results should be combined. Each node that is involved in the operation is sent a subquery for execution. Then a local module, which is still part of the multidatabase software, optimizes the subquery in terms of how to submit it to the local DBMS. The local optimizer must consider what functions are requested in the subquery, what functions the local DBMS provides, and the most efficient method for requesting services from the DBMS. If the local DBMS does not support some requested function, the local optimizer will have to do additional processing to make up for the local deficiency. For example, if the DBMS does not support an average function on an attribute, the local optimizer will initiate a program to sequentially retrieve the attribute values, keep a running total, and calculate the average independently of the local DBMS. Finally, the local DBMS normally has an optimizer to process requests from the multidatabase as a local user. The local DBMS optimizer is independent of, and transparent to, the multidatabase system. Once the required system and query information is gathered, a number of good optimization algorithms exist for applying a cost function (Brill et al., Deen et al., 1984; Mackert and Lohman, 1986). However, how the system information is gathered and where it is stored are not clear in existing multidatabase descriptions. Examples of such information are network link costs and node processing capacity. This is static information based on rated capabilities of the hardware and/or system software. Static or even outdated system information will likely lead to suboptimal solutions in a dynamic distributed environment. For example, if the optimizer chooses a network link that is currently down or a processing node that has been deleted from the system, the query strategy will obviously be inefficient (if it can complete at all). Optimizer input information should. be gathered dynamically to best utilize a distributed system (Norrie and Asker, 1989). However, the

168

A. R. HURSON AND M. W. BRIGHT

data-gathering process must not require too much processing, or else system performance will be degraded. 3.7 Concurrency Control

The traditional concept of a transaction as short-lived and atomic is unsuited to the multidatabase environment. Multidatabase transactions will typically involve multiple, separate local DBMSs and several layers of data/query translations. More importantly, local DBMSs have site autonomy so global control does not in fact have control of the actual data items. Multidatabase transactions are characterized as relatively long-lived and nonatomic. New models of transaction management have been (and are being) developed (Eliassen'et al., 1988; Elmagarmid and Leu, 1987; Litwin and Tirri, 1988; Pu, 1987). Some make use of semantic knowledge about the transaction to break up the transaction into subtransactions that are atomic, while the transaction as a whole is not (Alonso et al., 1987). If a transaction aborts and some subtransactions have already been committed, a compensating subtransaction is run to reverse the effects of the initial subtransaction. As stated previously, operations that cause updates must also be concerned with global constraints defined on target data, since the effects of the update may propagate to other databases (Gamal-Eldin et al., 1988). Concurrency control schedules concurrent transaction data accesses to be serializable (Bernstein et al., 1987). To do this, however, it requires knowledge of all the currently active transactions and the ability to control access to data items. A standard DBMS user interface does not normally provide information about other user's transactions or access to data item locks, time stamps, etc., depending on the local concurrency control scheme. Moreover, different DBMSs may use different local concurrency-control schemes. The global system has enough information to provide concurrency control for global transactions, but it does not have information about local transactions. Since local and global transactions may conflict, the global system does not have enough knowledge or control to provide total concurrency control. Also, implementing individual global transaction synchronization, such as the twophase commit protocol, may imply some slight loss of local autonomy (Mohan et al., 1986). Consequently, many multidatabases, particularly global-schema multidatabases, restrict global information access to retrieve only. Updates must be performed through the local DBMS interface on a node-by-node basis. Part of the update problem is related to the standard problem of updating views (Fagin et al., 1983). A global schema is just a big view defined over all the local databases, and multidatabase language queries usually create a similar view of the particular data being accessed. Creating a view means

MULTIDATABASE SYSTEMS

169

creating a transformation function from existing base information to the representation desired in the view. Retrieving information means applying the transformation function. Updating information requires a reverse transformation-an inverse of the transformation function. Defining inverse functions can be extremely difficult (Vigier and Litwin, 1987). The other part of the update problem, lack of global concurrency control, could be easily solved if local DBMSs provided more information and control at their user interface (Gligor and Popescu-Zeletin, 1984). However, recent research has led to some possible solutions for the existing multidatabase environment. For example, the ADDS system has developed a site graph algorithm to allow updates with existing local DBMS user interfaces (Breitbart and Silberschatz, 1988; Breitbart et al., 1987). Another approach is to assume that conflicts between global and local updates occur very infrequently, so synchronizing global and local concurrency control is unnecessary. There is doubt as to the general applicability of this assumption (Breitbart et al., 1987).Using a tree protocol for locking eliminates some of the problems with standard two-phase locking (Vidyasankar, 1987). Finally, a new paradigm for assigning value dates (a form of time stamp) to transactions and data items provides a better fit to the concurreny-control requirements of autonomous local nodes (Litwin and Tirri, 1988). The value-date method includes more complex options for transactions that encounter blocking of requested data items, such as request an alternate data item or accept the risk of reading a nonserialized value.

3.8 Security Providing security in any distributed system is a difficult task at best. Some of the problems include nonsecure communication links and varying levels of security provided at different nodes. Multidatabases must rely on the underlying hardware and system software for most of their security requirements. Site autonomy provides some measure of local security. Local DBAs can restrict the information available to global users by not including it in the local schema for the multidatabase interface. Also, the local DBMS can monitor and control the incoming requests and requesting user identifications at the multidatabase interface. For example, one version of Mermaid requires each global user to have separate identification and authorization codes on each local system to be accessed (note that issuing these codes is mostly automatic-the user does not repeatedly enter all codes) (Templeton et al., 1987b).The use of views for global users is also an important security measure (Bertino and Haas, 1988; Wang and Spooner, 1987). In a global-schema system, the global DBAs may only allow each global user a limited view of global schema.

170

A. R. HURSON AND M. W. BRIGHT

3.9 Local Node Requirements Multidatabases require global data structures and global software modules to implement global functions. Although site autonomy guarantees that local DBMSs will be unchanged by joining a multidatabase, the local machine will have to share some of the global storage and processing requirements. Many multidatabases spread the load evenly over all participating sites. Some, such as EDDS (Bell et al., 1987) and Odu (Omololu et al., 1989), impose minimal requirements on smaller machines and use larger machines nearby (in the network) to pick up the slack. Others, such as Multibase (Landers and Rosenberg, 1982; Smith et al., 1981) and Mermaid (Templeton et ul., 1986; Templeton et a/., 1987a), designate specific server machines to perform the bulk of global work. Proteus (Oxborrow, 1987) performs most of the global function at a central node. Global data structures and global software functions vary among multidatabase systems. Common data structures include the global schema, auxiliary databases for global constraints, space for intermediate query results, and temporary work spaces for global functions. Common software functions include translation between local and global languages, transformation functions between local and global information representations, query processing and optimization, and global system control (e.g., concurrency control and management of global data structures). The distribution of global structures and processing is a major factor in determining global performance. Usage patterns and machine capacities are likely to vary widely across different system participants and over time. Maintaining a full complement of global data and software modules on a personal computer or a heavily loaded mainframe may be difficult or impossible. Deciding the most efficient resource distribution is a systemdesign-optimization problem. It may be beneficial to make this distribution a dynamic system attribute, although this does not seem to be implemented in any current systems. A final problem associated with local node requirements is the need to port global resources to multiple hardware and system software configurations. Once ported, the structures and code must be verified for consistency across all local configurations. Again, uneven distribution of resources may alleviate this problem if the main global servers can be placed on similar machines. 4.

Multidatabase Design Choices

There are two major approaches to designing a multidatabase system: the global-schema approach and the multidatabase-language approach. The global-schema approach was the first to be used in multidatabase design

MULTIDATABASE SYSTEMS

171

and continues to be a popular choice in many projects. The multidatabaselanguage approach was partly inspired by the problems inherent in the globalschema approach and partly by the simpler overall system architecture. Both approaches must deal with the issues discussed in Section 3, but the solution to some problems will vary with the design approach. 4.1

Global-Schema Approach

The global-schema approach to multidatabases is a direct outgrowth of distributed databases. The global schema is just another layer, above the local external schemas, that provides additional data independence. Consequently, some of the work in distributed database research is applicable-particularly in the area of global-schema design. A major difference, however, is the lack of global control over local decisions, i.e., the global system cannot force local systems to conform to any sort of standard schema design (local schemas are developed independently), nor can it control changes to the local schemas. Despite the issues discussed here and in Section 3, the global-schema approach does make global access quite user-friendly. Global users essentially see a single, albeit large, integrated database. The global interface is independent of all the heterogeneity in local DBMSs and data representations. Because most global-schema multidatabases use the relational model (or some variant such as the entity-relationship model), users are presented with a familiar and intuitive paradigm for accessing the system. For specific users and applications, views may be defined on top of the global schema to tailor the interface. The global query language is normally SQL or some variant of it. The global schema is usually replicated at each node for efficient user access, although some systems only keep copies at specified server nodes. 4.1.1

Global-Schema Design

Global-schema design takes the independently developed local schemas, resolves semantic and syntactic differences between them, and creates an integrated summary of all the information from the union of the local schemas. Global-schema design is also referred to as view integration. This process is much more difficult than just taking a union of the input schemas for several reasons. Information about the same real-world object may occur in multiple local databases and have completely different representations. (See Section 3.2.) The information stored about the same world object in different databases may overlap, with each database having a different portion of the data for that object. Information in separate databases may have many interdependenciesthat are not applicable locally, but that must be considered

172

A. R. HURSON AND M. W. BRIGHT

when the databases are linked together. A review of existing methods for schema integration is given in Batini et al. (1986). There are a number of common techniques for integrating multiple, distinct schemas. Analyzing similarities and conflicts between objects and relationships in separate schemas must be done before they can be integrated (Batini et al., 1983; Ceri et al., 1987). Some methods use special data models and design languages with special constructs to resolve representation differences-constructs similar to those found in multidatabase languages. The entity-relationship and functional models are frequently used to describe design methodologies (DAtri and Sacca, 1984; Dayal and Hwang, 1984; Elmasri and Navathe, 1984; Motro, 1987). It is also important to define the interdependencies between objects and relationships in different databases to know what to integrate. Several methods have been defined to aid in defining these interdependencies and algorithms have been proposed to make sure the definition statements are consistent and optimal (Casanova and Vidal, 1983; Mannino and Effelsberg, 1984). Generalization hierarchies (Smith and Smith, 1977) are often used to classify similar objects from different schemas (Ceri et al., 1987; Dayal and Hwang, 1984; Elmasri and Navathe, 1984). A generalization hierarchy takes similar objects and creates a new, generic object that has all the common properties of the original objects. The original objects are modified so they only retain the properties that were unique to themselves. The new object is called a generalization of the originals and is placed above them in a hierarchy of objects. Finally, there may be hundreds or thousands of schemas to integrate and the sheer size of the job complicates the design process. This complexity can be eased by initially integrating only two of the schemas and then one by one integrating the rest into the running total (Batini and Lenzerini, 1983; Ceri et al., 1987). Despite the methodologies, algorithms, and heuristics that have been defined to help automate parts of the schema-integration process, this process is still very human-labor-intensive. In fact, it may be theoretically impossible to automate the whole task (Convent, 1986).Because of the size of the task and the lack of automated decisions, it is possible to create many different, but valid, global schemas from a given set of input schemas. Global DBAs are required to design the global schema. These designers must have extensive knowledge of all the input schemas and the user requirements of the global system to decide how to integrate the inputs (decide which of the many possible global schemas to create) (Navathe and Gadgil, 1982). Each of the local schemas is assumed to be optimized to local requirements. An optimal design for global requirements will likely conflict with some local optimizations, but the global DBA cannot change local optimizations because the nodes are autonomous. Therefore, the global DBA must also understand all the local optimizations and consider them when trying to create efficient global structures. The amount of global knowledge required about what is

MULTIDATABASE SYSTEMS

173

being integrated and how to integrate it is a major problem with the globalschema approach. In fact, a large enough number of local schemas may make the global-schema approach impossible due to the knowledge requirements and the development time associated with integration. 4.1.2

Global-Schema Maintenance

A global schema can be a very large data object. The sheer size can make it a problem to replicate at nodes with limited storage facilities. The popularity of personal computers and small DBMSs that may want to join the multidatabase system make this an important problem. Some systems get around this problem by only replicating the global schema at specified server nodes. However, this means queries cannot be processed at all query-origin nodes. Global DBAs must also maintain the global schema in the face of arbitrary (since the local DBMSs are autonomous) changes to local schemas. The literature is largely silent on how this is done. Changes to local schemas must be reflected by corresponding changes in the global schema. Addition and deletion of whole nodes can mean massive amounts of change. The integration techniques used in global-schema design and the changes in local data representations at the global level can make the mapping of changes to the global schema a complex problem. Local changes may force the DBA to reconsider many design decisions made during the initial integration process-with wide-reaching consequences. Again, the DBA must have extensive global knowledge of all the input schemas, the global schema, and what design decisions were made initially. For example, a generalization hierarchy at the global level is based on combining common attributes of local objects at higher levels of the hierarchy. Adding or deleting local attributes can affect multiple levels of the hierarchy by changing the intersection of common attributes. Because the global schema is replicated at multiple nodes, changes to it must be synchronized. An atomic update (change all copies at the same time) is quite expensive in terms of making the global schema unavailable while the update propagates to all nodes of the system (or all server nodes). A nonatomic update means that some copies of the global schema will be out of date for short periods. As a result, queries may be processed against invalid information.

4.2

Muitidatabase-Language Approach

The multidatabase language approach is an attempt to resolve some of the problems associated with global schemas, such as up-front knowledge required of DBAs, up-front development time to create the global schema, large maintenance requirements, and processing/storage requirements placed

174

A. R. HURSON AND M. W. BRIGHT

on local nodes. A multidatabase language system puts most of the integration responsibility on the user, but alleviates the problem by giving the user many functions to ease the task and providing a great deal of control over the information. Examples of such functions are given shortly. Most multidatabase languages are relational-similar to SQL in the standard capabilities, but extending the function significantly. Litwin and his colleagues have argued persuasively for multidatabase languages and performed much research in this area (Frankhauser et al., 1988; Litwin, 1984a,b, 1988; Litwin and Abdellatif, 1986, 1987). Note that global-schema multidatabases may require some sort of multidatabase language, if only for use by the global DBAs to create and maintain the global schema. A basic requirement for a multidatabase language is to define a common name space across all participating schemas. The most straightforward way to accomplish this is to allow data-item names to be qualified with the associated database name and node identifier. A common name space can still provide some measure of location independence in the face of data-item movement if object names are independent of the node they currently reside at (Lindsay, 1987).However, the global user is still aware that multiple data sources exist. Most of the language extensions beyond standard database capabilities are involved with manipulating information representations. (See Section 4.2.1.) Since representation differences exist when the user submits a query, the language must have the ability to transform source information into the representations most useful to the user. It is particularly desirable in this context to make the language very nonprocedural. The multidatabase system should be capable of making good implicit decisions in interpreting what the user wants to accomplish and providing many functions by default. The more complexity the system can automatically take care of, the easier the system will be to use. Examples are the capability to iterate operations over multiple, slightly varying objects (Litwin, 1984a)and the capability to do implicit joins (Litwin, 1985a). The user may be working with multiple equivalent objects with slightly varying attributes. If the system can apply a single operation to all the objects with consistent and intuitive results, then the user can give the objects a group name and invoke the operation with a single command. In a regular join query, the user specifies the result format and the data sources. Implicit join capability means the userjust has to specify the result format, and the system figures out which relations to join in order to produce that result. Multidatabase language system users must have a means to display what information is available from various sources. The user is assumed to have well-defined ideas about what information is required and where it probably resides. Otherwise the sheer size of the information available globally will make finding necessary data an overwhelming task. The language should provide the ability to limit the scope of a query to the pertinent local database.

MULTIDATABASE SYSTEMS

175

Being forced to do basic representation transformations for every query can make the system burdensome to use. However, views defined on commonly used information from multiple databases can be stored as basic building blocks for users to work with (Fankhauser et al., 1988).These views can create a richer environment by providing information representations closer to the user’s actual needs. In summary, the multidatabase language approach shifts the burden of integration from global DBAs (the global-schema approach) to users and local DBAs. User queries may have to contain some programming to achieve the desired results. However, the results and processing methods can be individually tailored. Users must have some global knowledge of representation differences and data sources, but only about the information actually used. Multidatabase languagesystems trade a level of data independence(the global schema hides duplication, heterogeneity, and location information) for a more dynamic system and greater control over system information. The amount of function provided by the multidatabase language and its ease of use are crucial in making this a good approach.

4.2.1

Examples of Language Features

Most of the features in this section are taken from MDSL, the multidatabase language for MRDSM (Litwin, 1985a; Litwin and Abdellatif, 1987; Litwin and Vigier, 1986;Vigier and Litwin, 1987).These features can mostly be defined and invoked dynamically. A key aspect of MDSL is that query constructs remain valid as the dynamic environment (which local databases are currently open) changes. For example, an operation defined on an object is valid while local databases that contain information about the object are being opened and closed. At different points in the query execution, the operation may be applied to a single object instance, multiple object instances in different databases, or to an empty set (all pertinent databases are closed).The results of the operation should be consistent and intuitive in all cases. The ability to define aliases and abbreviations for data-item names is important for resolving name differences between databases. A name should be allowed to refer to multiple objects from different sources if the objects are semantically equivalent. Thus, an operation on a named object may actually cause multiple operations to occur. A user query may have to create temporary structures to hold intermediate results or to hold new representations of local information. Particularly important in this respect is the ability to define dynamic attributes. A dynamic attribute is a temporary attribute defined by a mapping from existing attributes. There are several important uses for this capability. Dynamic attributes can be used to accomplish transformations in data format. (See

176

A. R. HURSON AND M. W. BRIGHT

Section 3.2.2.) They can be used to abstract attribute values from multiple sources into a single set of values. They can be used to create a column forjoins with other relations. The transformation used to create a dynamic attribute must have an inverse transformation defined if the system supports updates. A query that updates a value in the dynamic attribute will use the inverse transformation to unambiguously map the update back to corresponding updates on base data values. MDSL has been extended with the ability to define some inverse transformations automatically (Vigier and Litwin, 1987). The last example is the capability to present results in a variety of formats. When multiple data sources are used, multiple different results may be possible for an operation depending on which sources are used and how they are combined. A user may want to see all the results and which sources were used to produce them so he or she can make a personal judgment about the reliability or applicability of the result values. A user may want the system to screen the results and only present a subset, such as the maximum and minimum values or the values from the three closest nodes in the network. Finally, the user may want the system to calculate the best result based on some specified criteria and only present a single result value. 5. Analysls of Existing Multidatabase Systems

This section analyzes the current state of multidatabase research. Existing multidatabase projects are reviewed in Appendix A and summarized in Table I. Litwin has also provided some analysis of the current state of multidatabase research and the most pressing current issues (Litwin, 1988; Litwin and Zeroual, 1988). 5.1

Amount of Multidatabase Function

The amount of function attempted, the amount actually supported, and the system requirements vary widely in current multidatabase projects. HDDBMS represents the ambitious end of the spectrum. It attempts to preserve all existing user interfaces while supplementing them with multidatabase function. It also attempts to provide good global performance by tightly integrating the interface to local databases through a global schema layer with specific access-path information. As a result of its ambitious goals, HDDBMS is still largely in the research stage. The commercially available multidatabase systems, Empress and Sybase in particular, represent significantly less function and less transparency (data-location transparency and data-representation transparency), but the attempted functions have been fully implemented.

177

MULTIDATA BASE SYSTEMS

TABLE I

SUMMARYOF MULT~DATABASE PROJECTS (Grouped by Class)

LOBAL SCHEMA MULTIDATAFJASI Amoco Research Cencer G e n d Mom Research CRAI. Italy U. of ulsta, lnland

UCLA

SIRIUS-DELTA

Japan Development Center UMSYS Computu Corp of Amenca CRAI. Italy U. of Aberdeen. Scotland Briash Uruversiues U. of Pans. Tunn U. INRu\. Fmce Info hnplte. Poland Nahonal B m u Standards

limited limited pmtotypc pmtorypc revarch limited PmDW limited

pmtotype limited pmtotype research limited research limited

Breitbart et al. 1986 Chug 1990 Belcasm, et al. 1988 Belletal. 1987 Cardcnas 1987 Takizawa 1983 Templeton et al. 1987a LandeWRosenberg 1982 Staniszkis et al. 1984 Deen etal. 1985 Oxbomw 1987 Spaccapietraetal.1982 Erulier 1984 Brrcvnslu et al. 1984 Kimbleton et al. 1981

FEDERATEDDATABASES

VIP-MDBS

GTE Research Labs Fehpe Canno. Cahfornia Research C& , Finland lNRl.4. Fmcc U N V U ~ of ~ ~Wales. Y UK S m . Europe Vienna Tech. U , A u s m

System R'

Rhcd~usInc., Canada Sybase Inc.. Cahforma IBM

LIM)A MRDSM

swm

prototype limited prolotype llmlted research

commerclaf commerclaf

LitwwZeroual

1988

Wolslu Liawin/Abdellauf 1987 Omololu et al 1989 Hollkamp 1988 Kuhn/Ludwig 1988b

Litwin/Zeroual Litwin/Zeroual

1988 1988 1987

Key: research .no unplementatton yet, snll in research phase lunited - a prototype e x w but does not suppon full funcuon prototype - a full funcuon pmlotype has been implemenled commerclaf - product IS commerclafly avadable

Some of the multidatabase projects concentrate on only a few (or a single) aspect of multidatabase function. Research on SCOOP deals mainly with the mappings of the required functions and data at various levels of the system. Since the global-schema level is structured below the existing external user interfaces of the local DBMSs, SCOOP needs algorithms to translate the external interfaces to the global schema, as well as the normal global-schema multidatabase mappings from the global schema to the local DBMSs. SIRIUS-DELTA work concentrates on the pivot-system concept and a layered architectural approach. Like SCOOP, SIRIUS-DELTA allows users to continue using their existing database interface. The various user schemas

178

A. R. HURSON AND M. W. BRIGHT

and access languages are translated to an internal global schema and an internal global language. These internal components are the pivot system. Multiple, heterogeneous external interfaces are mapped to the common pivot system, and the pivot system is mapped to multiple, heterogeneous local data DBMSs (with different data models and access languages). Global function in SIRIUS-DELTA is layered to better isolate and control the various aspects of distributed data management. The network portion of the distributed system is an important focus of several systems. The access languages of XNDM concentrate on network functions and transformations. LINDA relies heavily on the informationexchange protocols to achieve global database function. Although most multidatabases assume point-to-point networks or a variety of networks in the system, JDDBS is based on a broadcast network. This allows a different paradigm for designing global query-processingfunctions. The existing SWIFT network uses defined protocols to exchange data. Current research is aimed at providing multidatabase function by implementing advanced transaction concepts (Section 3.7). Transactions are broken up into subtransactions and only the subtransactions are required to be executed atomically. In some cases, the semantics of a subtransaction may allow successful completion even if only part of the function is executed or if the function had to be retried on a different system. If the whole transaction is to be aborted, completed subtransactions are rolled back by running a compensating subtransaction. The logic capabilities of Prolog applied to multidatabase problems are the focus of VIP-MDBS. Prolog can be used to map data representations, specify name aliases, set up triggers, create dynamic attributes, and set up implicit joins. Rules and inferencing are also available. Other projects attempt to produce well-rounded solutions to most multidatabase issues. ADDS, Mermaid, Multibase, PRECI*, and MRDSM are particularly well documented systems that support most multidatabase functions. These projects have implemented all, or at least major portions, of their stated function. 5.2 Missing Database Function

Several functional areas required for a commercial system are poorly represented in existing research. These areas have either been ignored or explicitly put off until more basic system function was developed. Hopefully, these deficiencies will be corrected in the near future. The concurrency-control issue, particularly global update capability, is recognized and some recent work has begun to address the problem (Section 3.7). Several of the systems reviewed in Appendix A claim to have global concurrency control, yet fail to adequately support the claim with a full

MULTIDATABASE SYSTEMS

179

description of how it is achieved. Updates in global-schema multidatabases are forced to deal with the well-known problem of updating views, This problem is exacerbated in multidatabase systems because of the increased levels of transformations and the possible one-way mappings of local data representations to global representations. Multidatabase language systems should have a somewhat easier task in allowing global updates since queries have more control over the local-to-global data transformations. Security in a multidatabase system has been almost completely ignored, with Mermaid and DATAPLEX being minor exceptions (Section 3.8). It is not clear that traditional centralized-database-security concepts are adequate for the multidatabase environment. There is a need for more theoretical work in this area. The maintenance requirements of multidatabase systems are not clear from any of the existing literature. This is of particular concern in global-schema multidatabase systems because of the maintenance required by the globalschema structure. Potential customers must know what the ongoing costs will be in order to evaluate the usefulness of a multidatabase system. It is not clear how much customization will be required or will be desired when installing a multidatabase system. A related problem, also not covered in the literature, is the administrative problems and costs associated with a multidatabase system. Finally, error control and recovery are vital functions of any commercial product. The complexity of the multidatabase environment and the need to maintain site autonomy complicate these issues. Each local system may have different procedures and responses for each class of system errors. A multidatabase must not only integrate varying data representations into a uniform global view, it must also integrate various types of error reporting and recovery methods into a uniform global function. Again, this area has received very little attention in the literature.

5.3 Performance With the exception of ADDS and System R*, none of the literature on existing multidatabase systems presents concrete performance information (Breitbart et al., 1984; Mackert and Lohman, 1986). Even the aforementioned projects present such information on only limited portions of the system function. The ADDS testing simulated remote data requests and collected information on response time and data-transmission time. The System R * study modeled query-optimizer performance in selecting efficient access plans for local queries. There is a pressing need for more performance evaluation of existing systems and more comprehensive testing of system function. Multidatabase systems operate in a large, complex environment that presents difficult challenges to system testing and the creation of adequate

180

A.

R. HURSON AND M. W. BRIGHT

performance-evaluation models. However, such testing and evaluation is necessary to sort out the various proposed solutions to different multidatabase issues. There is currently little basis of comparison between systems as to the effectiveness of their design. Hard benchmarks are needed to evaluate equivalent function between systems, compare the merits of differing functions, and estimate the costs of implementing various capabilities. Part of this lack of data is attributable to the fact that many existing projects are still wholly or partly in the research stage of development. Even many of the systems that claim to have prototypes have implemented only limited function or have only attained full function recently. Hopefully, hard performance data will begin to appear in the literature as the development groups gain experience with existing systems.

5.4

cost

There is no information in the literature about the relative costs of multidatabase functions. Section 3.9 reviews some of the processing and storage overheads required for participating nodes, but the cost of these overheads is unknown. Like performance-evaluation results, the cost of various functions needs to be known in order to compare the merits of difference systems. Necessary cost information includes the monetary expense of buying and maintaining a multidatabase system, as well as data on the impact of adding global function on local resources and performance. Costs related to the development of individual multidatabase functions will also be useful in evaluating which global functions should be developed.

6.

The Future of Multidatabase Systems

As existing issues are resolved in various ways, multidatabase designers should begin to consider future requirements and directions for multidatabase systems. As multidatabases come into more general use, the needs of new sets of users and increasingly sophisticated application requirements must be met. The prerequisites for multidatabase systems, networks, and local DBMSs are proliferating rapidly. The expectations of the computer industry and computer users are also growing rapidly. Multidatabases should play a significant role in future computer systems.

6.1

User Interfaces

Multidatabases represent large collections of data that must be effectively managed and easily accessed by systems designers and users. Traditional

MULTIDATABASE SYSTEMS

181

DBMSs have assumed that the schemas that describe available data have been relatively small and easily understandable by users. Today, however, even large centralized databases are discovering that even the schemas (which represent a concise view of the information in the database) are too large and complex to be understood without assistance (DAtri and Tarantino, 1989). Because of their size and extra layer of complexity (the global integration of local data), multidatabases will require automated aides to help users to find and access global information. Three interactive techniques have been proposed to help users navigate through large, complex data sources (D’Atri and Tarantino, 1989). Browsing allows a user to see a portion of the database schema (or actual data) and find specific data items by moving the viewing window around the schema. This technique is of limited value because it is tedious and requires the user to wade through levels of detail that may not be appropriate to his or her needs. A second technique, connection under logical independence, allows the user to partially specify the information desired and the system does its best to interpret the user’s request and return the available information that best matches the interpreted request. This technique is more nonprocedural, thus user-friendly, but requires the system to interpret a user’s intention based on partial information. Finally, generalization allows the user to specify a sample of the information desired and the system generalizes the sample to include the full scope of information that the user actually wants. This technique requires the user to be very knowledgeable about the information desired (in order to provide an appropriate sample) and the system must still make some interpretation of the user’s intent in order to generalize properly. Another issue for large system-user interfaces is the vocabulary problem. A system must use specific names (access terms) to keep track of specific data items. Traditionally, the user must know the precise name for a desired data item in order to access it. Furnas has shown that without a prior coordination between designers and users, the two groups are quite unlikely to select the same name for any given entity (Furnas et al., 1987). The suggested solution is to allow the user to enter a data name meaningful to himself or herself and force the system to map the user-specified name to the closest semantic match with an actual system name. Again, this forces the system to perform some interpretation of the user’s request. Motro has surveyed some user interfaces capable of handling vague (imprecise)requests (Motro, 1989). A problem closely related to the vocabulary problem is the difficulty in determining the semantic equivalence or distinction between entities in different local DBMSs. Differing data representations, particularly different naming conventions, may obscure the semantic relations between entities. Global-schemadesigners need to be able to determine semantic equivalence in order to map multiple local entities to a single, common global entity. Multidatabase language system users need to be able to determine semantic

182

A. R. HURSON AND M. W. BRIGHT

equivalence to ensure that all appropriate data are processed by a query. Automatic semantic disambiguation is a pressing need now. Because multidatabases will serve a large, varied user set, they must allow for novice users and users who are not precisely sure of the data they need. Some system interpretation of imprecise user requests will be necessary to support nonprocedural access to these large data sources. The ability to suppress unwanted details and to view the available information at various levels of abstraction will also be required. 6.2

Effective Utilization of Resources

Current computer networks may link computer systems of vastly different sizes and capabilities. Personal computers and desktop workstations can be linked to the most powerful supercomputer. Multidatabase systems should be designed to allow any computer to participate in the system and to make the optimum use of available processing resources. The common practice of distributing global function evenly among all nodes is not practical when small computers are in the system. The system should recognize that some hardware units have limited processing power and storage space and assign these machines minimal global-processing responsibility. As mentioned in Section 3.9, a few existing systems do cater to smaller machines, or use more powerful computers as servers. This unequal distribution of global function should become the norm in multidatabase-system design. On the other end of the scale, a multidatabase system should effectively harness powerful machines to more efficiently perform global processing. Global-system administrators may want to boost performance by adding specialized database machines to relieve the burden on other nodes (Hurson et al., 1989, 1990). Today’s query optimizers typically try to minimize communication overhead at the cost of all other considerations. With today’s rapidly increasing network performance and a possible wide range in node processing speed, query optimizers must also balance the processing overhead with the communications overhead. Some optimizers, such as system R*, already incorporate such considerations. 6.3

Increased Semantic Content

There is much research in the area of data models intended to capture more semantic information about the data stored in a database. The entityrelationship (Chen, 1976), the functional-data model (Date, 1983), the extended-relational model (Codd, 1979; Date, 1983), and object-oriented models (Dittrich, 1986) are all examples of this trend. Increasingly sophisticated users and applications require more powerful methods for expressing and manipulating information. Most multidatabase systems are based on the

MULTIDATABASE SYSTEMS

183

relational-data model. The relational model is attractive because of its simplicity and strong mathematical foundation (Date, 1983,1986). Also, SQL has become an ANSI standard so designers have an accepted, common basis for the global language (Date, 1987). However, the forces that have motivated the aforementioned research will force multidatabase systems to consider more sophisticated global-data models and languages. One problem with a powerful global model is that the powerful request must still be translated back to less powerful models at the local level. Therefore, a significant processing burden would be shifted to the global system to make up for local functional deficiencies. This problem already occurs to some extent, but will be exacerbated if the global model becomes more powerful. 6.4

A Proposed Solution

One possible solution to the problems of powerful user interfaces, effective distribution of processing, and increased semantic content is the summaryschema model (Bright and Hurson, 1990). The summary-schema model groups a multidatabase system into a logical hierarchy of nodes. An online dictionary/thesaurus is used to summarize the local DBMS schemas into increasingly more abstract and more compact representations of information available in the system. This summarization allows global data representation with varying levels of detail and allows users to submit imprecise queries. Nodes at the lowest level of the hierarchy participate in the multidatabase by providing an export schema of data to share with the global system and supporting a minimal subset of global function. Nodes at higher levels of the hierarchy summarize the data of the lower-level schemas and support most of the global processing. This distribution of function allows smaller computers to participate with minimal overhead and powerful machines to utilize their capabilities as higher-level nodes in the hierarchy. Each name in an export schema (at the lowest level of the hierarchy) is considered an access term for some local data item (or attribute). Each localaccess term is associated with an entry in the online dictionary/thesaurus in order to precisely capture its semantics. Higher-level nodes in the hierarchy summarize the access terms from their children node’s schemas by using the broader-term/narrower-term {hypernym/hyponym) links in the online thesaurus. Multiple specific access terms at one level will map to a single more general term at the next higher level. Therefore, a summary schema contains fewer access terms than the underlying summarized schemas, but retains the essential information (albeit in a more abstract form). Different system names that are semantically similar will be mapped to the same term at some level of the summary-schema hierarchy. Therefore, summary schemas can be used to determine how close two access terms are

184

A. R. HURSON AND M. W. BRIGHT

semantically.A user can submit a query with imprecisedata references,and the system will use the summary schemas and the information in the online thesaurus to determine the available system reference that is semantically closest to the user’s request. The summary-schema model captures global information at varying levels of detail. This provides most of the function of a global schema with less overhead for most nodes. The increased semantics available for query processing are a significant aid to multidatabase languages. The ability to submit imprecise queries makes the system significantly more user-friendly. 6.5

New Database-Management-System Function

Just as there is continual research on new data models, there is a great deal of current research on new DBMS functions. The POSTGRES system contains several good examples of current research directions (Stonebraker and Rowe, 1986).POSTGRES has support for complex objects such as images or nested relations. Complex objects can usually be represented in a standard relational system, but the representation may be large and clumsy. Adequate access performance can only be achieved if more sophisticated support is integrated in the DBMS. POSTGRES also has supports for triggers, alerts, and inferencing.These functions make the DBMS a more active component in information management. Finally, POSTGRES includes support for userdefined data types, operations, and access methods. Data types can be procedural, i.e., the basic data item is a software procedure rather than a specific data item. This support allows users to customize the DBMS to some extent and create an environment that is easier to write applications for and provides enhanced performance. The Starburst project at IBM’s Almaden Research Center also allows significant user-defined extensions to the basic DBMS (Haas et al., 1988).Both POSTGRES and Starburst are built on top of the standard relational model. As centralized databases become increasingly more sophisticated, multidatabase systems will be forced to follow suit. Local users accustomed to sophisticated information management locally will demand a similar environment globally. Multidatabase researchers must be aware of current centralized DBMS developments and begin to plan similar extensions to their systems. Again, implementing powerful global functions may be difficult if some local DBMSs are more primitive. 6.6

Integration of Other Data Sources

Computer systems may have many other data sources than just standard databases. For example, some multidatabase systems (Multibase for example)

MULTIDATABASE SYSTEMS

185

recognize the need to access flat files. These may contain application information or important system information needed by the global function to plan efficient processing. A more important future consideration is the integration of knowledge base systems (Stonebraker, 1989). Knowledge based systems are increasing in popularity and importance as a source of data. The need for knowledge base systems is related to the need for more powerful semantic data representation and processing as discussed previously. One way to couple databases and knowledge bases is in an interoperable system (Section 2.1.6).However, many applications will require tighter integration between them. Even the global function of the multidatabase itself may require some expert system help to handle the complex, dynamic problems brought on by the complex, dynamic nature of distributed systems. An example is the ADZE (ADaptive query optimizing systEm) that uses a dynamic learning component to fine-tune the query optimizer itself (Norrie and Asker, 1989). Rule-based systems and logic programming are attractive user paradigms. VIP-MDBS illustrates the power of Prolog in handling multidatabase functions. Thus, multidatabase designers may want to consider a knowledgebased-system interface for the global user interface. 7.

Summary and Future Developments

Multidatabases are an important tool for meeting current and future information-sharing needs. They preserve existing investments in computers and training, yet provide integrated global information access. They are an effective solution to pressing requirements for such global access. This chapter presented a taxonomy of global information-sharing systems and discussed the position of multidatabase systems in the taxonomy. Multidatabase systems can be implemented on top of existing, heterogeneous local DBMSs without modifying them, while distributed databases cannot. Multidatabase systems provide significantly more powerful global function than interoperable systems. Multidatabases were defined and two representative systems, Multibase and MRDSM, were discussed in detail. Multidatabase issues were reviewed and current problems and solutions were presented. Key issues include site autonomy, differences in data representation, and concurrency control. Site autonomy is a major strength of multidatabases, yet it is also a major constraint on global-system design. Differences in data representation are many and varied. Resolving these differences is a major concern of global-system design. Global updates are a major current restriction in many systems, and the focus of much current research. A related topic is the search for more effective paradigms for global transaction management.

186

A. R. HURSON AND M. W. BRIGHT

The majority of multidatabase systems being studied are global-schema multidatabases. The close ties to distributed databases allow some synergy in solving related issues in the two fields. The single-data-source user paradigm is also appealing. However, the size and complexity problems of globalschema multidatabases make them impractical for large distributed systems. Since the trend today is toward more interconnection, i.e., larger systems, multidatabase language systems (the other major design approach) seem more practical for most future requirements. Multidatabase language systems have no constraints on system size, but the tradeoff for achieving this is a multiple-data-source user paradigm and a more complex user interface, i.e., the user language has more features. The simplicity of multidatabase language systems relative to global-schema multidatabases is evidenced by the fact that homogeneous multidatabase language systems are the first class of multidatabases t o produce commercial products. Currently these commercial products have limited multidatabase function and characteristics that make them more like distributed databases than multidatabase systems-i.e., local databases do not have full site autonomy. Hopefully these deficiencies will be corrected soon and fullfunction multidatabases will become commercially available. Strengths and weaknesses of existing multidatabase projects were discussed. Existing projects are summarized in Appendix A. Areas in need of further research include concurrency control, security, maintenance, error control and recovery, performance evaluation, and cost information. Future requirements for multidatabase systems were presented. User interfaces must provide more function to display available information in a simpler form, do more implicit processing, and allow for queries specified in the user’s terms rather than the system’s. Since future distributed systems are likely to be increasingly more heterogeneous with respect to size, power, and function of participating computer systems, multidatabase systems must make more effective use of existing resources. The global system must provide more semantic power. The summary-schema model was presented as a possible solution to several current and future system requirements (Hurson and Bright, 1991). Advances in centralized DBMS function should be reflected in multidatabase function. Integration of nontraditional data sources must be provided, particularly integration of knowledge based systems. There are a number of open problems that should be solved in order to make multidatabase systems more useful and efficient. Some deficiencies in existing prototypes must be resolved before they can evolve into commercial products. These deficiencies include global concurrency control, global security, maintenance and administration, and error control and recovery. These issues may not have the excitement and need for extensive theory that

MULTIDATABASE SYSTEMS

187

some other functions require, but they are requirements for a production environment. Existing projects are also deficient in reporting performanceevaluation data and cost information. This information is required to compare systems and evaluate which theories and functions are most productive. Multidatabase systems did not exist, even in theory, in the 1970s. Since that time, they have made great strides in theory and in practice. Many problems have been solved, yet many known problems remain. Multidatabase systems should begin to have a large impact in the information-processing world as more powerful systems become generally available in the near future. Appendix A.

Review of Multidatabase Projects

This section reviews most of the current multidatabase projects reported in the open literature. Because of the dynamic nature and wide scope of the field, we cannot assure that all applicable projects have been included. These projects come from a wide variety of countries and institutions. Some are mainly research vehicles to study some specific problem area; others are fullblown commercial systems destined for the open market. The range of organizations involved and the number of projects reported indicate the importance of this field. This chapter concentrates on the highlights and significant features of each system. More complete descriptions are available in the referenced materials. A summary of this review is shown in Table I.

A.l A.l.l

Global-Schema Multidatabase Projects

ADDS

ADDS (Amoco Distributed Database System) was developed at the Amoco Production Company Research Center (Breitbart et al., 1986; Breitbart and Silberschatz, 1988; Breitbart et al., 1987; Breitbart and Tieman, 1984). It is one of the most functionally complete multidatabase projects. Each global user has a view defined on a global schema, as well as access to a multidatabase language. The global schema uses an extended relational model (Date, 1986), but the user interface also supports a universal relational model paradigm (Maier et al., 1984). Methods are provided to transform representation differences and to define global constraints. The user interface is well designed. It supports menu or command-level processing and keeps a profile of environment-control options for each user. Global updates are not currently supported, but an update scheme has been developed and will be included in the future.

188

A. R. HURSON AND M. W. BRIGHT

A . I . 2 DATAPLEX

DATAPLEX was developed at the General Motors Research Laboratories (Chung, 1989,1990).The global schema is relational and the global language is SQL. The global schema seems to be just a union of the local schemas, since no mention is made of any support for data integration. Much of the work concentrates on query decomposition and the translation of subqueries from the global relational model to local nonrelational models. The query optimizer collects local statistics on relations involved in a query and makes use of semijoins in the query-execution plan. Security is provided through required global authorizations to local data items and through the use of views. The system is designed to use two-phase locking and two-phase commit to support global updates. Since existing DBMSs do not provide support for these functions at the user interface, research is ongoing in this area.

A. 1.3 DQS DQS (Distributed Query System) was developed by CRAI, an Italian company (Belcastro et al., 1988).The global schema is relational and the user language is SQL. DQS translates SQL user queries to an internal language for more efficient global processing. Part of the global auxiliary database contains statistical information about local data structures for use by the global optimizer. When accessing local DBMSs that are not relational, the query processor will first materialize a local relation equivalent to the local representation of the requested data before processing the information. Nodes are allowed to join the system as data servers only, i.e., no query processing occurs at that site. This allows those sites to minimize the local node storage and processing requirements. DQS global queries are read only. A.1.4 EDDS

EDDS (Experimental Distributed Database System) was developed at the University of Ulster and Trinity College, both in Ireland (Bell et al., 1987, 1989). The global schema is relational and the global language is SQL. Auxiliary schemas are used to store data-representation transformation information. EDDS has global concurrency control. Small machines may join the system and are only required to maintain a minimal global function module. Queries from these nodes are passed to the nearest full-function site for processing. A.1.5

HD-DBMS

HD-DBMS (Heterogeneous Distributed DataBase Management System) is an ambitious long-range multidatabase project at UCLA that started in the

MULTIDATABASE SYSTEMS

189

1970s(Cardenas, 1987;Cardenas and Pirahesh, 1980).The global schema uses the entity-relationship model. There is an extra global-schema level, i.e., the global internal level, that has access-pathinformation that is tightly integrated with local DBMSs. A major system goal is to provide external views in multiple data models and with multiple data-manipulation languages. This allows users to use whatever access paradigm is most comfortable for them. The development effort for HD-DBMS is taking a very layered and modular approach. Some parts of the system are in the process of being implemented, while others are still being worked out on a theoretical level. A.1.6

JDDBS

JDDBS (Japanese Distributed DataBase System) was developed at the Japan Information Processing Development Center (Takizawa, 1983). The global schema is relational and supports external user views. The global query processor assumes a broadcast network. JDDBS supports global updates. A.1.7

Mermaid

Mermaid was developed at System Development Corporation, which later became part of UNISYS (Brill et al., 1984; Templeton et al., 1986, 1987a, 1987b). The global schema is relational and supports two query languages: ARIEL and SQL. A major emphasis in Mermaid is query-processing performance, and much of the research has gone into developing a good query optimizer. User queries are translated into an internal data-manipulation language, DIL (Distributed Intermediate Language). This internal language is optimized for interdatabase processing. The query optimizer uses a combination of two processing-distribution methods called the semijoin algorithm and the replicate algorithm. The optimizer is designed to reach a good distribution solution very quickly, rather than make an exhaustive search for the optimal solution. The cost difference between the good and optimal solutions is assumed to be less than the extra cost associated with an exhaustive search for the optimal solution. A.1.8

Multibase

Multibase was developed by Computer Corporation of America (Dayal, 1983;Landers and Rosenberg, 1982; Smith et al., 1981).It was one of the first major multidatabase projects and contained many common ideas used in later systems. Its features are discussed in Section 2.2.1.

190

A.1.9

A. R. HURSON AND M. W. BRIGHT

NDMS

NDMS (Network Data Management System) was developed at CRAI, an Italian company (Staniszkis et al., 1983,1984).The global schema is relational and the query language is a variant of SQL. The global schema is created by applying aggregate functions to local information structures. An aggregate function abstracts common details of input information and is similar in nature to a generalization function (used to create generalization hierarchies) (Smith and Smith, 1977). When a query requests information from a local database that is not relational, the global module at that site will retrieve the data and create relations that contain equivalent data. The query optimizer keeps track of statistical information about local data structures to help create better execution plans. Global concurrency control is provided, so updates are allowed.

A.l.10 PRECI* PRECI* was developed at the University of Aberdeen in Scotland (Deen, 1981; Deen et al., l984,1985,1987a,b). The global schema is relational and the query language is PAL (PRECI* Algebraic Language). Local DBMSs must present a relational interface to join the system and support a minimum subset of PAL functions. An important goal of PRECI* is support for replicated data, and the system provides two modes for handling it. Replicated data means a local database can keep copies of information from other databases and the copies will be synchronized in some fashion (Alonso and Barbara, 1989). PRECI* also supports different levels of participation by local nodes. Inner nodes contribute to the global schema and support full multidatabase function. Outer nodes do not contribute to the global schema and support a lower level of global function. External databases can exchange data with PRECI* via defined protocols. PRECI* supports global updates.

A.l.ll

Proteus

Proteus is a research prototype developed by a group of British Universities (Oxborrow, 1987;Stocker et al., 1984).The global data model is the Abstracted Conceptual Schema and the global language is the Network Transfer Language. The user submits queries in his or her local data language and the queries are translated to the global data language/data model. The system architecture is based on a star network structure and only the central node runs the core of the system function. However, this global function is designed so that it could be distributed. Proteus does not currently support global updates.

MULTIDATABASE SYSTEMS

191

A.1.12 SCOOP SCOOP (Systeme de Cooperation Polyglotte) was a joint effort between the University of Paris 6 and Turin University in Italy (Spaccapietra et al., 1981,1982). The global schema uses the entity-relationship model. The global schema is actually structured between the local DBMS external-user schemas and the local DBMS conceptual schema (Date, 1985). Normally a global schema is a layer above the local external-user schemas. This means the global system is responsible for intercepting local queries, mapping them to the global schema, and then mapping them back to the local DBMS conceptual schema for processing. Consequently, a major goal of the SCOOP project is to study mapping algorithms.

A.1.13 SIRIUS-DELTA SIRIUS-DELTA was developed at the INRIA research center in France (Esculier, 1984; Ferrier and Stangret, 1982). It is an extension of an existing homogeneous distributed database. The key part of SIRIUS-DELTA design is the definition of a common global level, called the pivot system, which includes the data model (relational), global schema, data-manipulation language (PETAL), and global functions. On the user side of the pivot system, users keep their existing interfaces and the pivot system translates queries down to the common layer. On the processing side, the pivot system translates queries in the common language to subqueries in the local languages of the databases involved with the query. Results are translated back to the pivot system and then to the original user’s language. The emphasis in SIRIUSDELTA is on the layered architecture approach.

A.1.14

UNfBASE

UNIBASE was developed at the Institute for Scientific, Technical, and Economic Information in Warsaw, Poland (Brzezinski et al., 1984). It has a relational global schema and a relational query language. There is support for global constraints. UNIBASE is still in a research phase.

A.1.15

XNDM

XNDM (experimental Network Data Manager) was developed at the National Bureau of Standards (Kimbleton et al., 1981). It has a relational global schema and two relational user languages, both of which are similar to SQL. One language, XNQL, is used strictly for reading, and the other XNUL, is used for updates. These functions are separated because the global requirements for reading versus updating are significantly different. Query

192

A. R. HURSON AND M. W. BRIGHT

processing is performed at server machines rather than at the query-origin node. Much of the XNDM work concentrates on data mappings and translations.

A.2 A.2.1

Federated Database Projects

Heimbigner

Heimbigner and McLeod (1985) provide the standard definition of a federated database. A small prototype based on the definition was implemented. Each site defines an export schema of local data it will share with the system and an import schema of all the information that the site will access from other system nodes. The import schema is similar to a view on a global schema, but in this case the view is defined directly on local export schemas. (There is no intervening global schema.) To create an import schema, a local DBA checks the federated dictionary to find out what remote nodes are active in the system. Then the DBA uses defined protocols to communicate with remote nodes in order to browse over their export schemas. There is a negotiation protocol for gaining access rights so remote information can be included in an import schema. The architecture uses an object-oriented data model (Dittrich, 1986). The global language provides functions for datarepresentation transformations to be defined in the mappings from export to import schemas. Global update is automatically allowed for transformations with defined inverses. A.2.2

IngreslStar

Ingres/Star is a commercial product that provides a multidatabase layer on top of local Ingres DBMSs (Andrew, 1987; Litwin and Zeroual, 1988). Other relational DBMSs will be supported in the future. Ingres/Star allows users to define multiple import schemas at a node (a slight deviation from a strict federated database). Multidatabase queries can only be submitted through a previously defined import schema. Global constraints are not supported, while global updates are planned for a future release. A.3 A.3.1

Multidatabase-Language-SystemProjects

Calida

Calida was developed at GTE Research Laboratories (Litwin and Zeroual, 1988). The global language is relational and is called DELPHI. The language supports a global name space and implicit joins. Query optimization is emphasized. Global updates are supported.

MULTIDATABASE SYSTEMS

A.3.2

193

Hetero

Hetero was developed by Felipe Carino, Jr., in Sunnyvale, California (Carino, 1987). Hetero is a front-end system only, and all query processing is performed by local DBMSs. The global-data model is similar to the extended relational model RM/T (Codd, 1979), and the global query language is an extension of SQL. The system automatically creates virtual relations for all the local data managed by nonrelational DBMSs. Global concurrency control is maintained through an emulated two-phase commit protocol with time stamps. Thus, global updates are supported. Global security is maintained through a matrix authorization and validation scheme. The global catalog contains all the local schemas (they are not integrated together) and is fully replicated at each node. A key feature of Hetero is the sophisticated user interface. A fourth-generation language is used to support querying, report writing, graphics, and a Unix-like shell. The shell supports pipes and filters between database queries. A.3.3

LINDA

LINDA (Loosely INtegrated DAtabase system) was developed at the Technical Research Centre of Finland (Wolski, 1989). LINDA has many features of interoperable systems, as opposed to multidatabases. The system emphasis is on the use of defined message protocols to exchange information. However, LINDA does provide some global database function, so it is included in this review. The global language is SQL. Each site has a client unit and/or a server unit. All information-integration tasks are handled by the user. A.3.4

MRDSM

MRDSM (Multics Relational Data Store Multiple) was developed at the INRIA research center in France (Litwin, 1984a, 1985a,b; Litwin and Abdellatif, 1987; Litwin and Vigier, 1986; Wong and Bazex, 1984). It is one of the most extensively documented multidatabase language systems and includes many key features of this class. Its features are discussed in Section 2.2.2. A.3.5

Odu

Odu was developed at the University of Wales (Omololu, 1989). The globaldata model is similar to the entity-relationship model. Odu is tailored for a distributed system that has a variety of machine sizes. The global function is unevenly distributed. Small local nodes, which support limited global function, are connected to larger machines with full global function. The large machines are connected in a star topology. Global queries are read only.

194

A.3.6

A. R. HURSON AND M. W. BRIGHT

SWIFT

SWIFT (Society for Worldwide Interbank Financial Telecommunications) is an organization that manages an international network for banking transactions (Eliassen et al., 1988; Holtkamp, 1988; Veijalainen and PopescuZeletin, 1988). The existing system just transfers messages, but an extension to the system has been proposed to add multidatabase capabilities. The global language is an extension of SQL. Much of the research has concentrated on developing an appropriate model for global transactions. A.3.7

VIP-MDBS

VIP-MDBS (Vienna Integrated Prolog MultiDataBase System) was developed at the Vienna Technical University, Austria (Kuhn and Ludwig, 1988a,b). VIP-MDBS uses Prolog, a logic-programming language, as its global language. Local DBMSs are relational, but the logic capabilities of Prolog may allow future versions of the system to integrate knowledge-based systems as well as traditional databases. Prolog constructs are used to support a global name space, global constraints, information-representation transformations, and implicit joins. The user interface is nonprocedural. A.4 A.4.1

Homogeneous Multidatabase-Language-System Projects Empress

Empress V2 was developed by Rhodius Inc., a Canadian company (Litwin and Zeroual, 1988). The global language is an extension of SQL and supports a global name space, opening multiple remote databases, and defining multidatabase views. Empress provides global concurrency control, so global updates are supported. A.4.2

Sybase

Sybase was developed by Sybase Inc. in California (Cornelius, 1988; Litwin and Zeroual, 1988). The global language is an extension of SQL called Transac-SQL. It supports a global name space, implicit joins, and global constraints. Multidatabase views can be defined. Global function is split up between front-end sites and server sites. Note that both front-end and server modules may reside on the same machine. A.4.3

System R*

System R* was developed at IBM’s San Jose Laboratories (Bertino and Haas, 1988; Lindsay, 1987; Mackert and Lohman, 1986; Mohan et al., 1986).

MULTIDATABASE SYSTEMS

195

System R* is an extension of the experimental,centralized DBMS System R. It could be classified as a distributed database system because the global system has internal interfaces at the local level, but site autonomy is stressed throughout the system design, so System R* is included here. SQL is the global language. System R* supports a global name space and global updates. There is support for distributed security, and the query optimizer has been well tested. REFERENCES Abbott, K. R., and McCarthy, D. R. (1988). Administration and Autonomy in a ReplicationTransparent Distributed DBMS. Proceedings of the 14th International Conference on Very Large Data Bases, pp. 195-205. Alonso, R., and Barbara, D. (1989). Negotiating Data Access in Federated Database Systems. Proceedings of the 5th International Conference on Data Engineering, pp. 56-65. Alonso, R., Garcia-Molina, H., and Salem, K. (1987). Concurrency Control and Recovery for Global Procedures in Federated Database Systems. Database Engineering 10, 129-135. Andrew, D. (1987). INGRES/STAR: A Product and Application Overview. Colloquium on Distributed Database Systems, pp. 21-27. Batini, C., and Lenzerini, M. (1983). A Conceptual Foundation for View Integration. “System Description Methodologies” (D. Teichroew and G. David eds.), pp. 417-432, North-Holland. Batini, C., Lenzerini, M., and Moscarini, M. (1983). Views Integration. In “Methodology and Tools for Data Base Design’’ (S. Ceri ed.), pp. 57-84, North-Holland. Batini, C., Lenzerini, M., and Navathe, S . B. (1986).A Comparative Analysis of Methodologies for Database Schema Integration. ACM Computing Surveys 18,322-364. Belcastro, V., Dutkowski, A,, Kaminski, W., Kowalewski, M., Mallamici, C. L., Mezyk, S., Mostardi, T., Scrocco, F. P., Staniszkis, W., and Turco, G. (1988). An Overview of the Distributed Query System DQS. Proceedings of the International Conference on Extending Database Technology-EDBT ’88,pp. 170-189. Bell, D. A., Crimson, J. B., and Ling, D. H. 0. (1987). EDDS-A System to Harmonize Access to Heterogeneous Databases on Distributed Micros and Mainframes. Information and Software Technology 29,362-370. Bell, D. A., Ling, D. H. O., and McClean, S. (1989). Pragmatic Estimation of Join Sizes and Attribute Correlations. Proceedings of the 5th International Conference on Data Engineering, pp. 76-84. Bernstein, P. A., Hadzilacos, V., and Goodman, N. (1987). “Concurrency Control and Recovery in Database Systems.” Addison-Wesley, Reading, Maryland. Bertino, E., and Haas, L. M. (1988). Views and Security in Distributed Database Management Systems. Proceedings of the International Conference on Extending Database TechnologyEDBT ’ 8 8 , ~155-169. ~. Breitbart, Y., and Silberschatz, A. (1988). Multidatabase Update Issues. Proceedings of the SIGMOD International Conference on Management of Data, pp, 135-142. Breitbart, Y. J., Kemp, L. F., Thompson, G. R., and Silberschatz, A. (1984) Performance Evaluation of a Simulation Model for Data Retrieval in a Heterogeneous Database Environment. Proceedings of the Trends and Applications Conference, pp. 190- 197. Breitbart, Y., Olson, P. L., and Thompson, G. R. (1986). Database Integration in a Distributed Heterogeneous Database System. Proceedings of the 2nd International Conference on Data Engineering, pp. 301- 3 10. Breitbart, Y., Silberschatz, A., and Thompson, G. (1987). An Update Mechanism for Multidatabase Systems. Database Engineering 10, 136-142.

196

A. R. HURSON AND M. W. BRIGHT

Breitbart, Y. J., and Tieman, L. R. (1984).ADDS-Heterogeneous Distributed Database System. Proceedings of the 3rd International Seminar on Distributed Data Sharing Systems, pp. 7-24. Bright, M. W., and Hurson, A. R. (1990).Summary Schemas in Multidatabase Systems. Technical Report TR-90-076. Pennsylvania State University. University Park, Pennsylvania. Brill, D., Templeton, M., and Yu, C. (1984).Distributed Query Processing Strategies in Mermaid, A Frontend to Data Management Systems. Proceedings of the 1st International Conference on Data Engineering, pp. 21 1-218. Brzezinski, Z., Getta, J., Rybnik, J., and Stepniewski, W. (1984). UNIBASE- An Integrated Access to Databases. Proceedings of the 10th International Conference on Very Large Data Bases, pp. 388-396. Cardenas, A. F. (1987). Heterogeneous Distributed Database Management: The HD-DBMS. Proceedings of the IEEE 75, pp. 588-600. Cardenas, A. F., and Pirahesh, M. H. (1980). Data Base Communication in a Heterogeneous Data Base Management System Network. Information Systems 5, 55-79. Carino, Jr., F. (1987). Hetero: Heterogeneous DBMS Frontend. “OfficeSystems: Methods and Tools” (G. Bracchi and D. Tsichritzis eds.) pp. 159-172, North-Holland. Casanova, M. A,, and Vidal, V. M. P. (1983). Towards a Sound View Integration Methodology. Proceedings of the 2nd ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, pp. 36-47. Ceri, S., and Pelagatti, G. (1984).“Distributed Databases: Principles and Systems.”McGraw-Hill, New York. Ceri, S., Pernici, B., and Wiederhold, G. (1987). Distributed Database Design Methodologies. Proceedings of the IEEE 75,533-546. Chen, P. P. (1976). The Entity-Relationship Model-Toward a Unified View of Data. ACM Transactions on Database Systems 1,9-36. Chung, C. W. (1989). Design and Implementation of a Heterogeneous Distributed Database Management System. Proceedings of the IEEE INFOCOM ‘89 8th Annual Joint Conference of the IEEE Computer and Communications Societies, pp. 356-362. Chung, C. W. (1990). DATAPLEX: An Access to Heterogeneous Distributed Databases. Communications of the ACM 33, 70-80. Codd, E. F. (1970). A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13,377-387. Codd, E. F. (1979).Extending the Database Relational Model to Capture More Meaning. ACM Transactions on Database Systems 4, 397-434. Codd, E. F. (1986).An Evaluation Scheme for Database Management Systems That Are Claimed to be Relational. Proceedings of the 2nd International Conference on Data Engineering, pp. 719-729. Convent, B. (1986).Unsolvable Problems Related to the View Integration Approach. Proceedings of the International Conference on Database Theory-1CDT ‘86, pp. 141-156. Cornelius, R. (1988).Site Autonomy in a Distributed Database Environment. Compcon Spring ’88: Intellectual Leverage, 33rd IEEE Computer Society International Conference, pp. 440-443. Daisy Working Group (1988). Distributed Aspects of Information Systems. Proceedings of the European Teleinformatics Conference-EUTECO ’88 on Research into Networks and Distributed Applications (M. Hatzopoulos ed.), pp. 1029-1049. Date, C. J. (1983). “An Introduction to Database Systems,” vol. 2. Addison-Wesley, Reading, Maryland. Date, C. J. (1985). “An Introduction to Database Systems,” vol. 1, 4th ed. Addison-Wesley, Reading, Maryland. Date, C. J. (1986). “Relational Database.” Addison-Wesley, Reading, Maryland. Date, C. J. (1987). “A Guide to the SQL Standard.” Addison-Wesley, Reading, Maryland.

MULTIDATABASE SYSTEMS

197

DAtri, A., and Sacca, D. (1984). Equivalence and Mapping of Database Schemes. Proceedings of the loth International Conference on Very Large Data Bases, pp. 187-195. DAtri, A., and Tarantino, L. (1989). From Browsing to Querying. Data Engineering 12,46-53. Dayal, U. (1983). Processing Queries over Generalization Hierarchies in a Multidatabase System. Proceedings of the 9th International Conference on Very Large Data Bases, pp. 342-353. Dayal, U., and Hwang, H. (1984). View Definition and Generalization for Database Integration in a Multidatabase System. IEEE Transactions on Software Engineering 10,628-644. Deen, S. M. (1981). A General Framework for the Architecture of Distributed Database Systems. Proceedings of the 2nd International Seminar on Distributed Data Sharing Systems, pp. 153-171. Deen, S. M., Amin, R. R., and Taylor, M. C. (1984). Query Decomposition in PRECI*. Proceedings of the 3rd International Seminar on Distributed Data Sharing Systems, pp. 91-103. Deen, S. M., Amin, R. R., Ofori-Dwumfuo, G. O., and Taylor, M. C. (1985). The Architecture of a Generalised Distributed Database System-PRECI*. Computer Journal 28, 157-162. Deen, S. M., Amin, R. R., and Taylor, M. C. (1987a). Implementation of a Prototype for PRECI*. Computer Journal 30, 157-162. Deen, S. M., Amin, R. R., and Taylor, M. C. (1987b). Data Integration in Distributed Databases. IEEE Transactions on Software Engineering 13,860-864. DeMichiel, L. G. (1989). Performing Operations over Mismatched Domains. Proceedings of the 5th International Conjerence on Data Engineering, pp. 36-45. Demurjian, S. A., and Hsiao, D. K. (1987). The Multi-Lingual Database System. Proceedings of the 3rd International Conference on Data Engineering, pp. 44-51. Dittrich, K. R. (1986). Object Oriented Database Systems. Proceedings of the 5th International Conference on Entity-Relationship Approach, pp. 51-66. Eliassen, F., and Veijalainen, J. (1988). A Functional Approach to Information System Interoperability. Proceedings of the European Teleinformatics Conference- EU TECO '88 on Research into Networks and Distributed Applications, pp. 1121-1 135. Eliassen, F., Veijalainen, J., and Tirri, H. (1988). Aspects of Transaction Modelling for Interoperable Information Systems. Proceedings of the European Teleinformatics ConferenceEU TECO '88 on Research into Networks and Distributed Applications, pp. 1051- 1067. Ellinghaus, D., Hallmann, M., Holtkamp, B., and Kreplin, K. (1988). A Multidatabase System for Transnational Accounting. Proceedings of the International Conference on Extending Database Technology-EDBT '88, pp. 600-605. Elmagarmid, A. K., and Leu, Y. (1987). An Optimistic Concurrency Control Algorithm for Heterogeneous Distributed Database Systems. Database Engineering 10, 150-156. Elmasri, R., and Navathe, S. (1984). Object Integration in Logical Database Design. Proceedings of the 1st International Conference on Data Engineering, pp. 426-433. Esculier, C. (1984). The SIRIUS-DELTA Architecture: A Framework for Co-operating Database Systems. Computer Networks 8,43-48. Fagin, R., Ullman, J. D., and Vardi, M. Y.(1983). On the Semantics of Updates in Databases. Proceedings of the 2nd ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, pp. 352-365. Fankhauser. P., Litwin, W., Neuhold, E. J., and Schrefl, M. (1988). Global View Definition and Multidatabase Languages-Two Approaches to Database Integration. Proceedings of the European Teleinformatics Confaewe- EUTECO '88 on Research into Networks and Distributed Applications, pp. 1069-1082. Ferrier, A,, and Stangret, C. (1982). Heterogeneity in the Distributed Database Management System SIRIUS-DELTA. Proceedings of the 8th International Conference on Very Large Data Bases, pp. 45-53. Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. T. (1987). The Vocabulary Problem in Human-System Communication. Communications of the ACM 30,964-971.

198

A. R. HURSON AND M. W. BRIGHT

Gamal-Eldin, M. S., Thomas, G., and Elmasri, R. (1988). Integrating Relational Databases with Support for Updates. Proceedings of the International Symposium on Databases in Parallel and Distributed Systems, pp. 202-209. Garcia-Molina, H., and Kogan, B. (1988). Node Autonomy in Distributed Systems. Proceedings of the International Symposium on Databases in Parallel and Distributed Systems, pp. 158-166. Gligor, V. D., and Luckenbaugh, G. L. (1984). Interconnecting Heterogeneous Database Management Systems. IEEE Computer 17,33-43. Gligor, V. D., and Popescu-Zeletin, R. (1984). Concurrency Control Issues in Distributed Heterogeneous Database Management Systems. Proceedings of the 3rd International Seminar on Distributed Data Sharing Systems, pp. 43-56. Haas, L. M., Freytag, J. C., Lohman, G. M., and Pirahesh, H. (1988).Extensible Query Processing in Starburst. IBM Research Report RJ 6610. IBM, Yorktown Heights, New York. Heimbigner, D., and McLeod, D. (1985).A Federated Architecture for Information Management. ACM Transactions on Office Information Systems 3,253-278. Holtkamp, B. (1988). Preserving Autonomy in a Heterogeneous Multidatabase System. Proceedings of the 12th International Computer Sopware and Applications Conference, COMPSAC 88, pp. 259-266. Hsiao, D. K., and Kamel, M. N. (1989). Heterogeneous Databases: Proliferations, Issues, and Solutions. IEEE Transactions on Knowledge and Data Engineering 1,45-62. Hurson, A. R., Miller, L. L., Pakzad, S. H., Eich, M. H., and Shirazi, B. (1989). Parallel Architectures for Database Systems. “Advances in Computers,” vol. 28 (M. C. Yovits ed.), pp. 108-151, Academic Press. Hurson, A. R., Miller, L. L., Pakzad, S . H., and Cheng, J. B. (1990). Specialized Parallel Architectures for Textual Databases. “Advances in Computers,” vol. 30 (M. C. Yovits ed.), pp. 1-37, Academic Press. Kimbleton, S . R., Wang, P., and Lampson, B. W. (1981).Applications and Protocols. “Distributed Systems-Architecture and Implementation” (B. W. Lampson, M. Paul, and H. J. Siegert eds.), pp. 308- 370, Springer-Verlag. Kuhn, E., and Ludwig, T. (1988a). Extending Prolog by Multidatabase Features. Proceedings of the European Teleinformatics Conference- EUTECO ‘88 on Research into Networks and Distributed Applications, pp. 1107-1 119. Kuhn, E.,and Ludwig,T.( 1988b).VIP-MDBS: A Logic Multidatabase System. Proceedings of the International Symposium on Databases in Parallel and Distributed Systems, pp. 190-201. Landers, T., and Rosenberg, R. L. (1982). An Overview of Multibase. Proceedings of the 2nd International Symposium on Distributed Databases, pp. 153-183. Lindsay, B. G. (1987). A Retrospective of R*: A Distributed Database Management System. Proceedings of the IEEE IS, 668-673. Litwin, W. (1984a).Concepts for Multidatabase Manipulation Languages. Proceedings of the 4th Jerusalem Conference on Information Technology, pp. 309- 317. Litwin, W. (1984b). MALPHA: A Relational Multidatabase Manipulation Language. Proceedings of the 1st International Conference on Data Engineering, pp. 86-94. Litwin, W. (1985a). Implicit Joins in the Multidatabase System MRDSM. Proceedings of the 9th International Computer Software and Applications Conference. COMPSAC 85, pp. 495-504. Litwin, W. (1985b). An Overview of the Multidatabase System MRDSM. Proceedings of the ACM Annual Conference, pp. 524-533. Litwin, W. (1988). From Database Systems to Multidatabase Systems: Why and How. Proceedings of the 6th British National Conference on Databases, pp. 161-188. Litwin, W., and Abdellatif, A. (1986). Multidatabase Interoperability. IEEE Computer 19, 10-18. Litwin, W., and Abdellatif, A. (1987).An Overview of the Multidatabase Manipulation Language MDSL. Proceedings of the IEEE 15,621-632.

MULTIDATABASE SYSTEMS

199

Litwin, W., and Tirri, H. (1988). Flexible Concurrency Control Using Value Dates. INRIA Research Report no. 845. INRIA, Le Chesnay, France. Litwin, W., and Vigier, P. (1986). Dynamic Attributes in the Multidatabase System MRDSM. Proceedings of the 2nd International Conference on Data Engineering, pp. 103-110. Litwin, W., and Zeroual, A. (1988). Advances in Multidatabase Systems. Proceedings of the European Teleinformatics Conference- EUTECO '88 on Research into Networks and Distributed Applications, pp. 1137-1151. Litwin, W., Abdellatif, A., Nicolas, B., Vigier, P., and Zeroual, A. (1987). MSQL: A Multidatabase Language. INRIA Research Report no. 695. INRIA, Le Chesnay, France. Mackert, L. F., and Lohman, G. M. (1986). R* Optimizer Validation and Performance Evaluation for Local Queries. Proceedings of ACM-SIGMOD International Conference on Management of Data, pp. 84-95. Maier, D., Ullman, J. D., and Vardi, M. Y.(1984). On the Foundations of the Universal Relation Model. ACM Transactions on Database Systems 9,283-308. Mannino, M. V., and Effelsberg, W.(1984). Matching Techniques in Global Schema Design. Proceedings of the 1st International Conference on Data Engineering , pp. 418-425. Mark, L., and Roussopoulos, N. (1987). Information Interchange between Self-Describing Databases. Database Engineering 10, 170-176. Mohan, C., Lindsay, B., and Obermarck, R. (1986). Transaction Management in the R* Distributed Database Management System. ACM Transactions on Database Systems 11, 378-396.

Motro, A. (1987). Superviews: Virtual Integration of Multiple Databases. IEEE Transactions on Software Engineering 13,785-798. Motro, A. (1989). A Trio of Database User Interfaces for Handling Vague Retrieval Requests. Data Engineering 12,54-63. Navathe, S. B., and Gadgil, S. G. (1982). A Methodology for View Integration in Logical Database Design. Proceedings of thedth International Conference on Very Large Data Bases, pp. 142-162. Norrie, M., and Asker, L. (1989). Learning Techniques for Query Optimization in Federated Database Systems. Proceedings of the International Workshop on Industrial Applications of Machine Intelligence and Vision, pp. 62-66. Omololu, A. O., Fiddian, N. J., and Gray, W. A. (1989). Confederated Database Management Systems. Proceedings of the Seventh British National Conference on Databases, pp. 51-70. Oxborrow, E. A. (1987). Distributing a Database across a Network of Different Database Systems. Colloquium on Distributed Database Systems, pp. 51-57. Pu, C. (1987). Superdatabases: Transactions across Database Boundaries. Database Engineering 10, 143-149.

Rusinkiewicz, M., and Czejdo, B. (1985). Query Transformation in Heterogeneous Distributed Database Systems. Proceedings of the 5th International Conference on Distributed Computing Systems, pp. 300-307. Rusinkiewicz, M., and Czejdo, B. (1987). An Approach to Query Processing in Federated Database Systems. Proceedings of the Twentieth Annual Hawaii International Conference on System Sciences, pp. 430-440. Shipman, D. (1981). The Functional Data Model and the Data Language DAPLEX. ACM Transactions on Database Systems 6, 140-173. Smith, J. M., and Smith, D. C. P. (1977). Database Abstractions: Aggregation and Generalization. ACM Transactions on Database Systems 2,105-133. Smith, J. M., Bernstein, P. A., Dayal, U., Goodman, N., Landers, T. A., Lin, W. T.K., and Wong, E. (1981). Multibase-Integrating Heterogeneous Distributed Database Systems. AFIPS Conference Proceedings, National Computer Conference, vol. 50, pp. 487-499. Spaccapietra, S., Demo, B., DiLeva, A., Parent, C., Perez De Celis, C., and Belfar, K. (1981). An

200

A. R. HURSON AND M. W. BRIGHT

Approach to Effective Heterogeneous Databases Cooperation. Proceedings of the 2nd International Seminar on Distributed Data Sharing Systems, pp. 209-218. Spaccapietra, S., Demo, D., DiLeva, A., and Parent, C. (1982). SCOOP-A System for Cooperation between Existing Heterogeneous Distributed Data Bases and Programs. Database Engineering 5,288-293. Staniszkis, W. (1986). Integrating Heterogeneous Databases. CRAI state of the Art Report P229-47. Pergamon Infotech, Maidenhead Berkshire, England. Staniszkis, W., Kowalewski, M., Turco, G., Krajewski, K., and Saccone, M. (1983).Network Data Management System General Architecture and Implementation Principles. Proceedings of the 3rd International Conference on Engineering Software, pp. 832-846. Staniszkis, W., Kaminski, W., Kowalewski, M., Krajewski, K., Mezyk, S., and Turco, G. (1984). Architecture of the Network Data Management System. Proceedings of the 3rd International Seminar on Distributed Data Sharing Systems, pp. 57-75. Stocker, P. M., Atkinson, M. P., Gray, P. M. D., Gray, W. A., Oxborrow, E. A., Shave, M. R., and Johnson, R. G. (1984). PROTEUS: A Heterogeneous Distributed Database Project. “Databases-Role and Structure” (P. M. Stocker, P. M. D. Gray, and M. P. Atkinson eds.), pp. 125-150, Cambridge University Press. Stonebraker, M. (1989).Future Trends in Database Systems. IEEE Transactions on Knowledge and Data Engineering 1,33-44. Stonebraker, M., and Rowe, L. A. (1986). The Design of POSTGRES. Proceedings of ACMSIGMOD International Conference on Management of Data, pp. 340-355. Takizawa, M. (1983).Heterogeneous Distributed Database System: JDDBS. Database Engineering 6, 58-62. Templeton, M., Brill, D., Chen, A,, Dao, S., and Lund, E. (1986). Mermaid-Experiences with Network Operation. Proceedings of the 2nd International Conference on Data Engineering, pp. 292-300. Templeton, M., Brill, D., Dao, S. K., Lund, E., Ward, P., Chen, A. L. P., and MacGregor, R. (1987a).Mermaid-A Front-End to Distributed Heterogeneous Databases. Proceedings of the IEEE 75,695-708. Templeton, M., Lund, E., and Ward, P. (1987b). Pragmatics of Access Control in Mermaid. Database Engineering 10, 157-162. Veijalainen,J., and Popescu-Zeletin, R. (1988).Multidatabase Systems in ISO/OSI Environment. “Standards in Information Technology and Industrial Control” (N. E. Malagardis and T. J. Williams eds.), pp. 83-97, North-Holland. Vidyasankar, K. (1987).Non-Two-Phase Locking Protocols for Global Concurrency Control in Distributed Heterogeneous Database Systems. Proceedings of CIPS ’87: Intelligence Integration, pp. 161-166. Vigier, P., and Litwin, W. (1987). New Functions for Dynamic Attributes in the Multidatabase System MRDSM. INRIA Research Report no. 724. INRIA, Le Chesnay, France. Wang, C., and Spooner, D. L. (1987). Access Control in a Heterogeneous Distributed Database Management System. Proceedings of the Sixth Symposium on Reliability in Distributed Software and Database Systems, pp. 84-92. Wolski, A. (1989). LINDA: A System for Loosely Integrated Databases. Proceedings of the 5th International Conference on Data Engineering, pp. 66-75. Wong, K. K., and Bazex, P. (1984).MRDSM: A Relational Multidatabases Management System. Proceedings of the 3rd International Seminar on Distributed Data Sharing Systems, pp. 77-85.