Four reference architectures for distributed database management systems

Four reference architectures for distributed database management systems

209 Four Reference Architectures for Distributed Database Management Systems Introduction J a m e s A. L A R S O N Corporate Systems Development Div...

1014KB Sizes 0 Downloads 125 Views

209

Four Reference Architectures for Distributed Database Management Systems Introduction

J a m e s A. L A R S O N Corporate Systems Development Division, Honeywell Inc., 1000 Boone Avenue North, Golden Valley, MN 55427, U.S.A.

Five requirements for a centralized database management system and five additional requirements for a distributed database management system suggest definitions of software processors and schemas. These sorftware processors and schemas can be organized into four reference architectures for distributed database management systems which permit cornparisons of their advantages and disadvantages, Keywords: Reference Architecture, Database Management System, Distributed Database Management System, Federated database system, Heterogeneous database system, Replication independence, Location independence, Data independence, Database integration.

a

|

James A. Larson is a senior research fellow at the Honeywell Corporate Systems Development Division in Golden Valley, Minnesota, and an adjunct professor in the Computer Science Department of the University of Minnesota. In addition to writing numerous technical papers and arti° cles, Larson is the co-editor of IEEE Computer Society Tutorials Database Management Systems and Distributed Database Management Systems.

North-Holland Computer Standards & Interfaces 8 (1988/89) 209-221

D a t a b a s e M a n a g e m e n t Systems (DBMSS) allow an e n t e r p r i s e to centralize its d a t a on a single c o m p u t e r system. W i t h the w i d e s p r e a d use of mini a n d m i c r o c o m p u t e r s t h r o u g h o u t an enterprise, g r o u p s w i t h i n the e n t e r p r i s e have b e g u n collecting a n d storing d a t a specific to their own needs. This has resulted in a t r e n d t o w a r d dispersing d a t a to the groups. Yet the g r o u p s often need to access d a t a of o t h e r g r o u p s as well as to access the d a t a on the e n t e r p r i s e ' s large c o m p u t e r system. T h e n e e d to share d a t a t h a t exists on several cornp u t e r s forms the p r i m a r y m o t i v a t i o n for d i s t r i b u t e d DBMSS, A d i s t r i b u t e d DBMS m a i n t a i n s d a t a on several i n d i v i d u a l c o m p u t e r s , p e r m i t t i n g the systematic s h a r i n g of data. A distributed DBMS is a collection of individual, c e n t r a l i z e d DBMSS, each existing on a s e p a r a t e c o m p u t e r , a c o m m u n i c a t i o n s facility that allows c e n t r a l i z e d DBMSS to c o m m u n i c a t e with each other, a n d s o m e a d d i t i o n a l facilities that enforce a strategy for sharing the data. T h e p u r p o s e of this p a p e r is to define requirem e n t s for a d i s t r i b u t e d DaMS a n d p r e s e n t reference architectures t h a t meet those requirements. These reference architectures p a r t i t i o n the task of d i s t r i b u t e d d a t a b a s e m a n a g e m e n t into c o m p o nents that p e r m i t s y s t e m a t i c analysis. M a n y of the c o m p o n e n t s of a d i s t r i b u t e d DBMS reference architecture are also f o u n d in a reference architecture for centralized DBMSS. T h e reference architectures for d i s t r i b u t e d DBMSS differ in the a m o u n t of c o n t r o l t h a t the local d a t a b a s e a d m i n i s t r a t o r has over local data. E n t e r p r i s e s c o n s i d e r i n g implemeriting a d i s t r i b u t e d DBMS should first d e t e r m i n e the degree of c o n t r o l that local d a t a b a s e a d m i n i s trators have over local data, a n d then choose the a p p r o p r i a t e d i s t r i b u t e d DBMS reference architecture. This p a p e r is o r g a n i z e d as follows. Section 1 p r e s e n t s r e q u i r e m e n t s for centralized DBMS. Section 2 p r o v i d e s a d d i t o n a l r e q u i r e m e n t s for distribu t e d DBMSS. Section 3 discusses reference architec-

0920-5489/89/$3.50 © 1989, Elsevier Science Publishers B.V. (North-Holland)

210

J.A. Larson / FourReferenceArchitecturesfor DBMS

tures. It describes a reference architecture which meets the requirements for centralized DBMSS. Components from this architecture, together with additional components, are used to form reference architectures which meet the requirements for distributed DBMSS. The advantages of these four different reference architectures are summarized, Section 4 presents our conclusions,

1. Reasons for Database Management DBMSS are popular because they provide useful features for maintaining and accessing data. Some of these features are supported by both centralized and distributed DBMSS; others are supported primarily by distributed DBMSS. Ideally, both centralized and distributed systems should support five significant features: multiple user interface, semantic integrity constraint enforcement, program data independence, concurrency transparency and transaction atomicity. (1) Multiple User Interface. DBMS users should be able to select from among various styles of describing and manipulating data, depending on familiarity and application. Just as programmers may choose from various programming languages, database users should be able to choose among various user interfaces such as menus, form fill-in, keyword command languages (such as SQL, relational algebra, etc.), and procedural languages (such as CODASYL data manipulation language) etc. (2) Semantic Integrity Constraint Enforcement. Semantic integrity constraints define criteria for valid data values that may exist in a database, These constraints are desirable because they decrease the possibility that users may enter invalid data into the database. DBMSSshould contain facilities for describing and enforcing semantic integrity constraints. Software in the DBMS monitors transactions to detect violations of these constraints. If a violation occurs, appropriate action is taken. Such actions may include rejecting the transaction, reporting the violation, or possibly correcting the violation. This facility can also be used to enforce security constraints, that is, restrictions on what operations users may perform on data. (3) Program-data Independence. Some changes to a database require that application programs

that access the database be modified by a programmer, which is frequently an expensive task. An application program has program-data independence (sometimes simply called data independence) if it does not require modification when the database is restructed or reorganized. Programdata independence is provided by hiding the physical placement and organization of data in the database from the application program. (4) Concurrency transparency enables users to behave as if each one has sole access to the database even though the DBMScan support several users simultaneously. Occasionally, two or more users may attempt to access the same portion of the database at the same time. Concurrent access presents a special problem in that the values of data objects may change unexpectedly, or worse, an update may be lost. Database management systems should be able to resolve the problem of concurrent updates, allowing several users to access the same data without losing any database updates. (5) Transaction atomicity means that each transaction is either completely executed with all modifications permanently recorded in the database, or the transaction is aborted with no permanent change to the database. When transaction atomicity is supported, the user does not have to restore the database if a transaction aborts.

2. Requirements for Distributed Database Management Because of their increased complexity and high cost of communications, distributed DBMSS are not used as widely as centralized DBMSS. However, the high cost of communication is decreasing and the complexity of distributed DBMSS is becoming better understood. Distributed DBMSSpromise to solve many of the problems that face users of centralized DBMSS. Reliability, faster response time at a lower cost, location independence, and configuration independence are frequently suggested benefits of distributed database management systems. A1though many of these benefits may also be those of centralized DBMSS, distributed DBMSS provide these benefits in ways different from those of centralized systems. Depending on the architecture for the distributed DBMS, some of these poten-

J.A. Larson / Four Reference Architectures for DBMS

tial benefits can only be realized by sacrificing other potential benefits, (1) Reliability. Reliability can be increased by duplicating components. If one component falls, another component can assume its tasks. Replicated systems and multiprocessors can provide improved reliability for the centralized systems, However, disasters such as fire or floods can render these centralized systems inaccessible. By replicating the data at different sites, a disaster at one site does not render the database inaccessible; users may access copies of the data at the remaining sites. (2) Faster response time and lower cost. Accessing a centralized database from a remote site may involve time-consuming and expensive communications. Much of this expense can be eliminated by copying or moving data from the centralized site to the sites where it is frequently referenced, By partitioning data into segments of interest to separate classes of users and by storing segments at sites convenient to the targeted user classes, each segment can be stored at the site in a manner most satisfactory to that user class. This can lead to faster response times and lower costs for database retrievals at the remote sites than if remote accesses of a centralized DaMS were used. (3) Location and replication independence. Location independence enables a user to access data without knowing at what site the data resides. The geographical location of the data is hidden from the user. Some distributed DBMSS provide this kind of independence. When more than one copy of the data exists, one copy must be chosen when retrieving data, and all copies must be updated. Choosing one copy of the data for retrieval and always ensuring that all copies of the data are updated can be a burden on users. A DaMS that hides the existence of multiple data copies from the user is said to have replication independence. Although the data may be replicated at several sites in the network, the user may behave as if there is only one single copy at one single site. (4) Configuration independence. When a centralized computer system becomes saturated, it must either be replaced by a more powerful computer or some of the applications must be moved to another computer. In the former case, upgrading can be a frustrating experience especially if it involves new hardware from a different vendor. In the latter case, not only is conversion difficult, but

211

applications on either computer cannot access data stored on the other machine. In a distributed system, these problems can be minimized by integrating a new computer into the distributed system. Because of the underlying communications systems, the new computer can access data stored on the new computer. Configuration independence enables the enterprise to add or replace hardware without changing the existing software components of the distributed DAMS. Configuration independence results in a system that is expandable when its current hardware is saturated. (5) Heterogeneous DBMSS. It is often desirable to integrate databases maintained by different DBMSS on different computers. Frequently the DBMSSare supplied by different vendors and may support different data models. One approach to integrating these databases is to provide a single user interface that can access the data maintained by the heterogeneous DaMSS. The different user interfaces supported by the heterogeneous DBMSS are hidden from the user by this single, systemwide interface [1,2]. The above requirements lead to several reference architectures for distributed DBMSS. Before describing four distributed DaMS architectures, we describe why reference architectures are desirable and what constitutes a useful reference architecture for distributed DaMSS.

3. Reference Architectures A reference architecture clarifies the various issues and choices to be made by implementors of a distributed DAMS. Each component of the reference architecture deals with only one or two of the important issues of distributed database management and allows people to ignore details irrelevant to those issues. People may concentrate on a small number of issues at a time by analyzing a single component. Such a reference architecture is helpful in understanding distributed DBMSS, categorizing and comparing distributed OaMSS, converting data and programs from one distributed DaMS tO another, and supporting a flexible distributed DAMS. The components of a reference architecture are processors and schemas. A processor is an executable software module that is responsible for performing one or more tasks. A schema is a descrip-

J.A. Larson / Four Reference Architectures for DBMS

212

tion of some aspect of the database that is used to direct one or more processors in performing its tasks. A processor accepts input, manipulates that input based on information from zero or more schemas, and produces output which is passed onto another processor, This structuring technique permits us to view a distributed DBMS as a collection of processors, each interfacing with adjacent processors. We can use the following criteria when partitioning a large collection into sets of tasks to be performed by different processors, • Separability: If one task can be separated from another, then the two tasks should be performed by different processors, • Understandability: The collection of tasks assigned to a processor should be small enough so that the manner in which the processor performs those tasks can be easily understood by the human mind. • Independence: The manner in which a task is performed by one processor does not affect the manner in which other tasks are performed by other processors, An interface defines how one processor may invoke and communicate with another processor and the set of predefined functions that the invoked processor can perform on behalf of the invoking processor. An interface thus consists of

cb~eot Processor

Data

Proc.....

Fig. 1. Overviewof centralized DBMSS.

rules that describe the format and content of communications between processors that, (1) instruct the invoked processor what to perform and, (2) informs the invoking processor of the results. Examples of interfaces include query languages, rules used to describe requests to a DaMs, the specification of the internal representation of a compiled command that is produced by a compiler and executed by a file system, and specification of a collection of bit strings that instructs the device controller how to access data on external storage. The reader should note than when implemented, a system may merge two or more reference architecture processors into a single implemented processor, or several implemented processors may correspond to a single processor of the reference architecture. Conceptually, however, the functions supported by each processor of the reference architecture should be supported by the implemented system. First, we describe a reference architecture for centralized DBMSS. This model will then be extended to reference architectures for distributed DBMSS.

3.1. Reference Architecturefor a Centralized DBMS A centralized DBMS can be partitioned into two main processors, which we call a client processor and a data processor (Fig. 1 ). The purpose of a client processor is to translate user commands that are expressed in one of several different user interfaces languages into a format more suitable for machine processing. We will use the term canonical command to refer to this internal form. The client processor also translates data retrieved from the data processor into a format understandable by the user. The canonical commands produced by the data processor can be either stored and executed by the data processor at a later time, or the data processor can be invoked at the time the user request is submitted and executed. The client processor can be divided into several subprocessors, as illustrated in Fig. 2. These subprocessors may support two of the requirements of both centralized and distributed DBMSS: multiple user interfaces and semantic integrity constraints. Multiple User Interfaces. User command translators and user result formatters support multiple

213

J.A. Larson / Four Reference Architectures for DBMS

~ --

~

Client Processor

___.] ~ l

.

L___._~ co~ 7L--~

\\

t/

\\

/

~

~ ~ '

,,~_/ -I S~h~a] ~

~..DataJ

~ '

~

~

Fig. 2. Components of client processor,

user interfaces. A user command translator exists for each user interface supported by the DBMS system. The user command translator accepts user commands which are expressed using the data manipulation language of the user interface and converts these commands into the same canonical format that is produced by all other user cornmand translators. The user result formatter accepts data in a canonical format and converts the data to the format required by the data description language of the user interface. Some DBMSSdo not have user result formatters. In these systems, the canonical commands generated by the user command translator contain the instructions to translate the canonical data to the format needed by the u s e r . Semantic integrity constraints. These describe the valid values of data in a database. One way of enforcing semantic integrity constraints is f o r e a c h update program to verify that each piece of data entered into the database satisfies the constraints, Another approach is to rely upon a constraint enforcer to guarantee that the semantic integrity constraints that have been specified using the data description language are enforced. The constraint enforcer shown in Fig. 2. performs two functions: (1) It enforces semantic integrity constraints that can be enforced without accessing the database (for example, modifying a person's age to be a negative integer can be detected by the semantic integrity constraint enforcer without accessing the database), and (2) for semantic integrity constraints that can be enforced only by accessing the database, the semantic integrity constraint enforcer

modifies canonical commands in such a way that the semantic integrity constraints are automatically enforced during the execution of the commands. The constraint enforcer may also modify the canonical commands in such a way that security constraints are also enforced. Many centralized DBMSS do not have a constraint enforcer in the client processor. In these DBMSS, constraints are enforced by the data processor. The data processor is responsible for storing and accessing data in a database. The data processor can be divided into the subprocessors as illustrated in Fig. 3. These processors support three more requirements of both centralized and distributed DBMSS: program-data independence, c o n c u r r e n c y t r a n s p a r e n c y a n d t r a n s a c t i o n atomic-

ity. Program-data independence. The canonical command translation and canonical result formatter support program-data independence. The canonical command translator converts the canonical commands into physical commands to be executed by the run-time support processor. The canonical command translator is responsible for choosing an optimal or nearly optimal access path to the physical data. If the physical data structures or access paths are changed, the canonical cornmand translator must be invoked to choose a new

~ Data Processor

l'-...._...~/ ~

.-~

[ ~

" ~

,,,,I Conceptual k

t [ Schema 1 ~ . ] I~ .~ "*'~. I

~.s

Canonical

Resalt

F. . . . tter

F-,,,

[~__.~Ls..,

]]

t ,

q

r[

Runtime

Sup~t

I

Processor

Fig. 3. Components of data processor.

I I

Translator

]

Oommaod

[

[

1

' Canonical

"

I

214

J.A. Larson / Four Reference Architecturesfor DBMS

optimal or nearly optimal access path. Thus, physical data structures and access paths can be fine-tuned to make the overall DBMS perform as efficiently as possible without having to modify existing application programs or queries. The canonical result formatter converts the physical data into the canonical format required by each of the user result formatters. Some DBMSS do not have a canonical result formatter. In these systems, the physical commands produced by the canonical command translator contain instructions which convert data directly from the physical format to the canonical formats required by the client processor, Concurrency transparency. The run-time support processor accepts physical commands from the canonical command translator, performs the necessary accesses to the database, and returns the physical data to the canonical result formatter, Many DBMSS can support several users simultaneously. The run-time support processor is responsible for accepting several requests, scheduling the processing of those requests, and returning the results. The run-time support processor must resolve the problem of concurrent updates. It provides concurrency control which enables each user to believe that h e / s h e has sole access to the database and is isolated from the actions of other concurrent users, Transaction atomicity. The run-time support processor is also responsible for guaranteeing that the changes to the database made by a transaction are made permanently to the database. If for some reason a transaction aborts, the run-time processor is responsible for guaranteeing that any database changes are either removed from, or are not written into, the database, Schema levels. All of the processors in the reference architectures need access to descriptions of data. A description of data is called a schema. The reference architectures for distributed DBMSS contains several different types of schemas, each describing a different aspect of data. We adopt the three schema types suggested by ANSI/X3/SPARC, a national committee of computer users and vendors which recommends areas for potential standardization. The ANSI/X3/SPARCframework [3] contains three levels of data descriptions: • One or more external schemas (Fig. 2), each contain a description of a part of the database, for use by a user or class of users who wish to

access the database. An external schema will contain descriptions of data in the format required for the user interface. • The conceptual schema (Fig. 2 and Fig. 3) contains a logical description of the entire database for use by the database administrators in designing and modifying a global, conceptual view of the database. The conceptual schema also contains semantic integrity constraints to be enforced by the DAMS. • The internal schema (Fig. 3) contains a description of the placement and format of physical data structures used to represent each data type described in the conceptual schema. This schema is used by the database administrator to fine tune the database in order to make the DBMS perform efficiently. These three schemas, and the mappings between them, guide the various processors in performing their respective tasks. The external schema, the conceptual schema, and the mapping between these two schemas guide the user c o m m a n d translator in converting user commands into canonical commands. These same schemas guide the user result formatter in translating canonical data into the format required by the user interface. The semantic integrity constraints within the conceptual schema guide the semantic integrity constraint enforcer in performing its tasks. The conceptual schema, the internal schema, and the mappings between these two schemas guide the canonical c o m m a n d translator in choosing the appropriate physical commands. These schemas also guide the canonical result formatter in translating the physical commands into the canonical format. The external, conceptual, and internal schemas are often stored in a special database called a data dictionary. The data dictionary may contain other information in addition to the data description language [4,5,6]. This additional information may include cardinality information (how much data), frequency information (how often it is accessed), access patterns (who accesses the data), and the other information needed to control and to administer the DBMS. The reference architecture for centralized DBMSS described above may be extended to several reference architectures for distributed DBMSS. First, we examine in detail one of the reference architectures for distributed DBMSS. Then we compare the several reference architectures.

J.A. Larson / Four Reference Architectures for DBMS

~ ~ user

• •" ~ •

~ Canonical

•.

~ - , ~ _ a ~ "~ G,ob~D~tabaso

Control and Communications

syste,~

~

~

•°•

Fig. 4. Components of distributed DBMS.

3.2 A Reference Architecture for a Distributed DBMS Data in a distributed DBNS may be distributed at several hardware processors, called sites. Sites are connected by a communications system. To support distributed DBMSS, the following extensions to the reference architecture of centralized OBMSS are necessary: 1. A client processor exists at each site where users may submit requests to the distributed OBMS. 2. A data processor exists at each site where data resides, 3. A global database control and communication system is needed to support communication and to control distributed execution. Portions of the global database control and communication system exist at each site. The resulting reference architecture, illustrated in Fig. 4, supports three of the remaining requiremerits of distributed DBMSS. (1) Reliability. Data can be replicated at multipie data processors to increase data availability and reliability. (2) Faster response time at lower cost. A large amount of time and cost may be needed for a client processor and a data processor to communicate if they are not located at the same site.

215

If the data processor and the client processor are at the same site, then communication time and cost are nonexistent. Data may be moved or copied from data processor to data processor so that communication time and cost are decreased. If several different client processors frequently access the same data, it may be economical to replicate the data across the data processors that reside at the same sites where each of the client processors reside. However, updating replicated copies at separate data processors can result in a large overhead so that any savings realized during retrievals can be lost if frequent updating is required. (3) Configuration independence. The distributed DB~S should be able to grow and evolve with the enterprise. New data processors can be added to store larger amounts of data. New client processors can be added to support more users in accessing the distributed DBNS. New types of hardware can replace existing hardware. Specialized database computers can replace general purpose computers serving as data processors. Specialized workstations can replace general purpose computers serving as client processors. The architecture of Fig. 4 is very flexible in that it can support the addition of new data processors and client processors, the removal of existing data processors and client processors, and the upgrading of existing data and client processors, all without interfering with the users' activities. The global database control and communications system in the architecture of Fig. 4 consists of several processors (Fig. 5): • The decomposer translates a request from a client processor into a distributed execution strategy consisting of several commands to be executed at one or more data processors. • The merger combines results from several data processors. • The distributed execution monitor is responsible for the correct execution of the execution strategy as well as for transaction atomicity in the distributed environment. • The communication subsystem is responsible for transporting requests and data between sites. • The local execution monitor is responsible f o r the local execution of a piece of the execution strategy at a data processor. It notifies the global execution monitor whenever a piece of the execution strategy executing at a data

216

J.A. Larson / Four Reference Architectures for DBMS

Distributed Execution Monitor

Global Database Control and Communication

Communication Subsystem Local Execution Monitor

System

Fig. 5. Components of global databases control and communication system. processor either completes or fails. (Sometimes the run-time support processor is extended to support the functions of the local execution monitor), The decomposer, merger and distributed execution monitor may physically reside with a client processor, while the local execution monitor may reside with a data processor. Both client processors and data processors also have subprocessors to build and to interpret messages to be sent between client processors and data processors, The decomposer and merger support another of the requirements of a distributed DBMS. (4) Location and replication independence. The data needed by a single command may reside at several different sites. A command, therefore, may need to be "decomposed" or split apart into several subcommands, each to be processed at a different site. The collection of subcommands, the order of their execution, and the data processors on which each subcommand is executed are called the distributedexecution strategy. If multiple copies of the data exist, the copies to be accessed must be chosen. For retrieval, a single copy is chosen; for updating, all copies are chosen. The processor that

creates a distributed execution strategy is called the decomposer. The data accessed by the distributed execution strategy at different sites need to be combined before they are handed to the client processor. This is the responsibility of the merger. The decomposer and merger support location independence by hiding the site location of data from the client processors. Some distributed I~B~lss do not have a merger. In these systems, the distributed execution strategy produced by the decomposer contains instructions to be executed by the run-time support processor within the distributed DB~S for merging the canonical data from several data processors. The decomposer must know where data resides in order to decompose a single c o m m a n d into several subcornmands, each to be processed at different sites. A special type of schema, called a distribution schema (Fig. 5) is used to store the location of each site containing data that m a y be accessed by users at the local site. The decomposer and merger hide the location of data from the client processor. Similar to a centralized system, a distributed system supports concurrency transparency. This

217

J.A. Larson / FourReference Architecturesfor DBMS

means that every user of a distributed DBMS can act as if he/she has sole access to the database even though the database is accessed s i m neously by several users. Because data may be distributed across several data processors, the complexity of controlling concurrency in a distributed system is increased. Data may also be replicated to increase data availability and system reliability. Replication independence is the ability of a user to believe that there is only one copy of the data in the distributed system. A mechanism is needed, however, to keep all copies mutually consistent. The distributed execution monitor is responsible for providing replication independence. It is responsible for the execution of distributed execution strategy produced by the decomposer, including updating all data copies. The distributed execution monitor is resonsible for distributed concurrency control. (Note that the run time support processor is still responsible for local concurrency control in a data processor.) The distributed execution monitor is also responsible for supporting transaction atomicity at the global level. If any part of the distributed execution strategy is aborted, then the distributed execution monitor aborts all parts of the distributed execution strategy so that no change is made to any of the databases, A communication subsystem provides for intersite information transfer. To provide this, a communication processor resides at each site and interfaces the processors residing at the site to the communication subsystem. The communication subsystem may be a point-to-point public network, a local area network, or a combination of these two. The communication processors use a set of communication protocols to make proper use of the communication facility and to provide error-free and reliable communication support for distributed systems, The final requirement of a distributed DBMS can be met as follows: (5) Heterogeneous DBMSS. If each of the data processors in Fig. 4 is replaced by a centralized DBMS, the client processors of Fig. 4 present a simplified interface to nonhomogeneous DBMSS. Fig. 6 illustrates centralized DBMSSacting as data processors of a distributed DAMS.This architecture supports the remaining requirement of supporting access to nonhomogeneous DBMSS.The major difference is the presence of the external schema and

~ u

l

t

a

AI~ -

~

%

~ U~e~

" • •" ~

~

~

~

°°° ~,~ G~ob~, ~

Con~r0LanSystem ~C ...........t,.... /~"

N

1,"

~

°° •

Fig. 6. HeterogeneousDaMs.

user command translator beneath the Global Database Communication System in Fig. 6 and its absence from that position in Fig. 4. If all data processors support the same data model, then the architecture of Fig. 4 suffices. If the data processors support different data models, then the additional external schema and the user command translator are necessary to hide the different data models from the client processors. 3.3 Alternative Reference Architectures

The reference architecture of Fig. 6 is Option 3 of the four alternative reference architectures for distributed DBMSSillustrated in Fig. 7. These alternative architectures differ only in the placement of the decomposer and the distributed execution monitor. Option 1: Connected by Programmer Fig. 7, Option 1, illustrates the highest placement of the distributed execution monitor in a reference architecture. When using this option the programmer specifies the distributed execution strategy. For example, to move a record from StTE 1 to SITE 2, a programmer might explicitly provide

J.A, Larson / Four Reference Architectures for DBMS

218

';emmand

Option 1

Option 2

~'°'~ Option 3

Fig. 7. Four reference architectures.

Option 4

J.A. Larson / Four Reference Architectures for DBMS

the name of the site as part of the data manipulation command: FETCH EMP-RECORD AT SITE1 STORE EMP-RECORD AT SITE2 DELETE EMP-RECORD AT SITE1

This architecture lacks location independence because the application program is sensitive to the site at which the data resides.

Option 2: Federated DBMS Option 2 of Fig. 7 represents a special type of loosely coupled distributed database architecture called a federated database. In a federated database, the decomposer and the distributed execution monitor occur above the conceptual schema and below the external schema used by the federated database management system user. In federated databases, no integrity constraints can be expressed across sites; global control does not exist. Because the constraint enforcer, which is responsible for access control, exists at each site, the local data administrator has complete power to determine who may access data residing at h i s / h e r local site. A federated databases is a natural intermediate step between two completely independent database as in Option 1 and a distributed database with centralized control as in Option 3.

Option 3: Tightly Coupled Distributed DBMS This is the reference architecture described earlier in this chapter. In this architecture, semantic integrity constraints involving data residing at several sites can be specified in the conceptual schema. This is not possible in Options 2 and 1.

Option 4: Centralized DBMS This is a centralized DBMS architecture that permits data to be physically stored at several sites. Users may access the data only through the central site. Like Option 3, Option 4 can support semantic integrity constraints that cross sites, The choice of decomposer and distributed execution monitor placement represents a trade-off between the following two factors: • The ability to specify integrity constraints across sites, • The ability of a local data administrator to control the description and usage of data stored at h i s / h e r site in the computer network, Options 4 and 3 of Fig. 7 allow the expression

219

of semantic integrity constraints on data across sites while the remaining options allow such constraints to apply to data within a site but not to constraints that span sites.

3.4 Limited Integration of Data We need flexible alternatives for the limited integration of data--alternatives that span the range from complete privacy of data to totally sharing it and from absolute local autonomy to none at all. Such alternatives depend on the nature of the distribution of both the data and the enterprise for which the database exists. Limited integration of separate databases may result from the historical evolution of separate previously-defined databases. The content and form of databases residing at different host cornputers which are to be integrated may range from heterogeneous (as in the case of personnel databases at several sites of the same company) to homogeneous (as in the case of merging personnel and billing databases of the same company). Limited integration may be the result of a choice to make database decentralization the rule from the outset, so that local administrators may develop and m a y maintain the definitions of local databases. Limited integration is also useful in preserving the integrated view of data for programs developed before an integrated, centralized database is dispersed to several sites. Fig. 8 illustrates a logical view of how users can access data at various sites. Each site that contains data to be shared with other sites has one or more local schemas. A local schema describes a subset of the data at a site that m a y be accessed by users. The local schema corresponds to the highest schema below the distribution schema in each of the reference architectures. In this approach, the administrators of each site are able to control who m a y access the local data using a mechanism other than security constraints specified as part of the semantic integrity constraints within the conceptual schema. This is done by creating a local schema that describes only the data to be accessed by a class of users. Different subsets of the same data m a y be described by different local schemas, for use by different classes of users. Each site that allows users to access data contains one or more global schemas. A global schema describes data from one or more sites that a user

J.A. Larson / Four Reference Architectures for DBMS

220 Site 1

Site 2

Site 3

Site 4

Fig. 8. Limited integration of database.

m a y access. The global schema is the lowest schema above the distribution schema in each of the reference architectures. Several different global schemas m a y exist at a site, one for each class of users. This strategy is very flexible in that it m a y allow one class of users to access all of the data, and it m a y allow several classes of users to access different subcollections of the data. It provides a high degree of reliability, i.e. if a particular site is done, only the data described by the local schemas at that site are unavailable. It is possible for (1) some user classes to access a completely integrated database, (2) other user classes to access a limited integrated database, and (3) still other user classes to access only individual centralized databases. If a user class has a global schema that describes only a single database, then the user has access to a single centralized data-

then there is limited integration of the databases with respect to the user class. For example, in USER CLASSES2 and 4 in Fig. 8, the user can access two of the three databases. The three distributed DBMS architectures (Options 1, 2, and 3 in Fig. 7) can each support both partial and total integration of multiple databases. The following Table 1 illustrates how the global and local schemas of Fig. 8 relate to three of the four alternative architectures of Fig. 7. There are other reference architectures for the analysis of distributed OBMSS. G l i g o r a n d L u c k e n b a u g h [8] describe a model similar to the above. Decitre [9] describes a model of a distributed DBMS that concentrates on the Global Database Control and C o m m u n i c a t i o n System. Ceil

base. For example, in Fig. 8, a USER CLASS 1 c a n access only the database at SITE 1. If a user class has a global schema that describes all of the databases at all sites, then the databases are completely integrated with respect to the user class. F o r example, in Fig. 8, USER CLASS 3 can access all of the databases. If a class of users has a global schema that describes only some of the databases,

Distributed

Table 1

option

Local schema of Fig. 8 corresponds to

1

External schema

2 3

Conceptual schema

Internal schema

corresponds to Higher level External schema External schema Conceptual schema

4

Not applicable

not applicable

architecture

Global schema of Fig. 8

J.A. Larson / Four Reference Architectures for DBMS

and Pelagatti [7] replace our distribution schema with two schemas: (1) the fragmentation schema describing the horizontal and vertical partitioning of files into file fragments, and (2) the allocation schemas indicating the site or sites where each fragment is stored. Bachman and Ross [10] describe a more comprehensive reference architecture for DBMSS and communication,

221

Flexible alternatives are needed for integration of data in a distributed DBMS. These alternatives range from limited integration to complete integration. A distributed database administrator can control the degree of integration by specifying how descriptions of local databases are combined to form a global description of data that can be accessed by a user class. Different user classes may have different degrees of database integration.

4. Conclusion Both centralized and distributed systems need to support multiple user interfaces, semantic integrity constraint enforcement, program-data independence, concurrency transparency, and transaction atomicity. Reliability, replication independence, location independence, faster response time and lower cost, and configuration independence are frequently suggested objectives of distributed DBMSS. In most systems, some of these objectives are realized at the cost of sacrificing other objec-

tives. A reference architecture of a DBMS is n e c e s s a r y to clarify the various issues and choices of a distributed DB~S sO that the objectives that are important can be deliberately chosen. Each component of the reference architecture deals with one of the important issues of distributed data management and each allows people to ignore other issue dealt with by other components of the architecture. We have presented four reference architectures consisting of client processors, data processors, and a global database control and communication system. The choice of the placement of the decomposer and of the distributed execution monitor is a trade-off between the ability to specify semantic integrity constraints across sites and the ability of a local data administrator to control the description and usage of local data.

References ll] Durchholz, R., and Will, H.J., The Impact of Database Management Systems (DBMSs) Standardization on Auditing, C&S Vol. 1, No. 1, pp. 44-59. [21 Larson, J.A., Johnson, H.R., and Wilson, T.B., Data Independence Categorizations for the Database Computer, Proceedings 13th International Conference on System Sciences,

Vol. 1, 1980, pp. 221-230.

[3] Study Group on Database Management Systems: Interim Report, FDT 7 : 2 (ACM, New York). [41 Curtice, R.M., Data Dictionaries: An Assessment of Current Practice and Problems, Proceedings, 7th VL D B (Cannes, France) (AC Order No. 471810), September

1982. [5] McCarthy, J.L., Metadata Management for Large Statistical Databases, Proceedings, 8th VLDB (Mexico City, September, 1982) pp. 234-243. [61 Meyer, M.E. et al. (eds) Information Resource Dictionary System (IRDS) Requirements Document, ANSI/X3/H4 IRDS Technical Committee DRAFT Version 3.0, Septem-

ber, 1981. [71 Ceri, S. and G. Pelagatti, Distributed Databases: Principles and Systems (McGraw-Hill, 1984). [81 Gligor, V.D., and Luckenbaugh, G.L., A Model of Describing Distributed Database Management Systems Architecture, Computer, January 1984, pp. 33-34. 191 Decitre, P., A Model for Describing Distributed Database Management Systems Architecture, Proceedings of the 16th Annual Electronics and Aerospace Conference and Exposition, September 1983, pp. 367-376.

[101 Bachman, Charles, W., and Ross, Ronald G., Towards a More Complete Reference Model of Computer-Based Information Systems, Computers and Standards 1 : l, pp. 35-48.