Considerations for determining the degrees of centralization or decentralization in the computing environment

Considerations for determining the degrees of centralization or decentralization in the computing environment

Techniques siderations for etermining the Degrees of Centralization or Decentralization in omputing Environment Jacob Slonim, Dave Schmidt, and Paul ...

1MB Sizes 1 Downloads 88 Views

Techniques

siderations for etermining the Degrees of Centralization or Decentralization in omputing Environment Jacob Slonim, Dave Schmidt, and Paul Fisher of Computer Scirzzcc, Karzsas State Uzzivcrsity,

Dcpartznczzt Manhattan,

Kansas

66506.

USA

The advent of distributed data base systems has introduced a bewildering assortment of terms, measurements, and descriptions for managers and users. The complexity which a distributed system introduces in hardware, software, and data allocation provides the major source of misunderstanding and confusion that is reflected in current jargon. This paper proposes a method of definition and measurement which alleviates the terminology and measurement probleli. Tlie methodology provides a standardized view of distributed systems and promotes an objective, quantified approach to the classification and selection of such a system. Keywords:

DiIkibdSC, OpCKitiOIIa~, performance, update, retrieval economic, distribution.

Jacob Slonim received a B.S. in Computer Science and Mathematics from the University of Western Ontaric in 1971, an M.S. in Computer Science in 1973, and a Ph.D. in the same field was awarded him by Kansas State University in 1978. His professional experience includes the following: system designer, programmer, and project _var?agFr_for Canad@ Jurimetrics Liznitect; international project manager for the National Center of Scientific and Technological Inlormation, Israel; and research assistant and instructor at Kansa’ State University. He is a member of the ACM, SIGMINI, : nd SIGIR honorary professional societies. an.d his current research field is data base management. He has authored oh
Paul S. Fisher received a B.A. in Mathematics from the University of Utah in 1963, an M.A. in the same field in 1964, and a Ph.D. in Computer Science was awarded him by Arizona State University in 1967. From 1967 to 1972, he worked as an Assistant Professor in the Depalrtment of Computer Scipnce at Kansas ‘! University was thereaftei and \ advanced to the position of Associ.ltc 6% Professor. His current position as Head of that Department was presented him in 1973, and he was awarded a full professorship in July, 1978. Dr. Fisher has served as a reviewer for Computer Reviews ACM, CACMProgramming Systems Section, Wiley, and McGraw-Hill. Ht.’is currently serving au consultant to the Computer Systems Command, U.S. Army. He has authored over fourteen publications since 1970 and in that time has also received over 81 dozen research grants, most notably one each for software portability issues (&195K) and a back-end DBMS crmmunication system (ElOK). Dr. Fisher is a member of the ACM,, SIGMINI, SlGPLAN, and SIGACT honorary proressiona: societies, and is presently on sabbatical in Israel. wlvere he is a Visiting Professor at the University of Tel-Aviv and working as a research consultant to ELBIT.

David A. Schmidt was born in Coilby, Kansas on May 10, 1953. He receiived the B.A. degree (Mathematics) from Fort Hal’s (Kansas) State University in 1975 and the M.S. degree (I‘omputer Science) from Kansas State University in 1977, where he is currently working towards the Ph.D. degree. His research interests include denotational semantics and computational complexity. Mr. Schmidt is a member of the IEEE Computer Society and the Association for Computing Machinery.

1. Introduction There has long been a need for a standard for defining and comparing distributed organizations. Since there are now a large variety of organizational options in terms of geography, hardware and software, this need becomes crucial. The methodology presented for filling this need assists in selection as well as measurement of a system; managers need assistance in deciding, based on factors important to them, which alternatives of a distributed system are best for their implementation. In addition, communication between technicians and managers is educed. Technicians view a system in terms of pert:ormance while management sees a system through economy fxtors; the proposed tool reconciles these views by providil,g a common meeting ground for determining priorities aud trade-offs. Such a unifying approach is crucial in reducing future communication gaps.

2. Composition of the tool Presedl_~, the largest problem in comparing and rating distributed systems lies in the large number of characteristic features to be considered. The proposed tool organizes these Factors into seven groups of the >;ystem consideration. They are as k~llows: I. Operational Characteristics 2. Performance Characteristics 3. Update Characteristics 4. Retrieval Characteristics 5. Economic Characteristics 6. Data Base Size 7. Number and Distribution of Users Each group contains a number of subitems which, as a whole, constitute the essence of the proper!y under consideration. As some subitems may carry more weight thin others in influencing the overall rating of the group, a five-point weighting system is introduced to balance their relative importance. The subitams for each group will be discussed later. With the distributed system’s characteristic features organized as indicated, objective ratings may now be stated for a given system architecture in each area.

This allows different distribution alternatives to be compared in the light of their overall ratkgs.

Types of

systems

The basic forms of distributed systems are described, where each will be rated against the seven groups. These alternatives represent a current view of what distributed systems can become. The reader who desires more information on these is referred to [7], 3.1. Centralization(C) This organization maintains the entire data base at a single, central location. This organization is utilized in most systems today and is particularly well-suited for application where requests are alike from all nodes. 3.2. Deceutrakation Decemralization, Partitioned (DP). A partitioned data base is often described as multiple logical data bases; i.e., the formerly centralized data is divided across several computers. Data bases are typically partitioned according to required accessibility; that is, files are positioned on the machine where they are likely to receive heaviest usage. Decentralization, Partitioned, Heterogeneous Software (DPHS), Nere a number of different systems arc employed in the network. The fundantental issue then becomes the development of a control structure. The integration of different DBMS’s involving different data models, data definitibn languages (DDL), data manipulation languages @ML), and data formats requires a large effort in cross data systems transla-

tion technologies. Schemes for global control of the Fystem (to achieve transparency) and global addreessing techniques (master directories, schemas) are also needed components. Decentralization,Partitioned, Heterogeneous Hardware (DPHH). One may use the same data base

management system on different computer architectures, usually those of different hardware manufae turers.

Decentralization, Partitiomd, Homogeneous Software, Heterogeneous Hardware and Data Cotnpatibility (PHHDC). Mixed vendors may present one prob-

lem not otherwise encountered, that is, compatibility differences between one machine and another involving the basic codes used for representation of infor-

J. Slonirrl et al. / Centralization

mation. Fortunately, the commonly used information codes for internal representation (i.e., ASCII) allow code conversion at this basic level to be feasible and possible, and the frequent use of code conversion tables involves little use of storage and processing power. Decentralization, Partitioned, Heterogeneous Software/Hardware and Duta Compatibility (PHSDC).

This configuration presents a very complex situation, as the distributed system embodies a truly distributed data base. Some doubt has been expressed whether organizations need such systems, but there is clearly interest. One insurance company interviewed by P.J. Down [12] is contemplating a distributed system in which each branch office will have its own computer, software, and data. The majority of the transaction processing will thus be made in each specific office, which will be linked to others so that transactions for data not held at that office can be routed to the appropriate location. Decentralization, Replication (DR). A distributed data base may be primarily based .Jpon the duplication of certain files at some or all the information processors. Duplicatioll may be needed to increase

access to the file by providing more paths to it; to provide rapid backup in case of the failure of a device, channel, or information processors accessing the file; or to decrease commurications volume and/or dependencies on the communications facilities between information processors. A distributed system in which there is duplication of data between different locations raises the problem of maintaining consis-

tency of the duplicated data, particularly after a system failure. No available software package covers this situation. Decentralization,Replication, Heterogeneous Software (DRHS). A replicated data base could be supervised by different DBMS software (based upon dif-

ferent data models; i.e., network, hierarchical, relational). This approach involves very complex updating procedures. Recentralization,Replication, Hett rogeneous Hardware (DRHH). Such configuratiorb have different

machines, and thus physical da!? indcpet@cnce becomes extremely important; i.e., the data and the application programs which use it must remain unaffected (except for performance) by changes made to the physical storage structure. Cost,‘benefit tradeoffs

or decentralization

1I

might make certain types of physical independence very expensive. Decentralization,Replication,Heterogeneous Hard ware and Data Compatibility(RHHDC). The problem that faces the designer of a distributed DBMS composed of multiple software and hardware systems on a heterogeneous network is data incompatibility. The problem of disparate internal data representation is complicated by different physical sites running different data base management machines. &CeW?'Q~iZQtiOn, Replication, Heterogeneous Software/Hardwure and Data Compatibility (RHSHDC). Here data is replicated at multiple heterogeneous sites in order to minimize hardware and transmission costs. Such a system must cope with multiple copies of the same data in one or more logical and/or physical formats. Data compatibility is concerned with translating data from one format (logical or physical) to another. This issue must be addressed when transferring data from one DBMS to another. With different sites running different data base systems, effective datl: translation becomes critically important.

4. Construction and use of the tool The system configurations are now evaluated against each of the seven characteristic categories. As each category contains many subitems, a tabular format is used. Each subitem 1s represented as a row in the table; system configurations appear as the columns. A five-point rating scale is used to rank each system on c ach specific subitem. A five-point weighting factor is then introduced to balance the subitems in proportion to their importance to the group as a whole. The results for each system configuration are totaled, and an overall rating results for each characteristic. As discussion of system characteristics follows, the reader should remember that the table includes many subitems which may be of little or no consideration to a specific situation. The manager may therefore include only items that are relevant to the environment, The tables can therefore be usesd as 6 guide; hardware and software options can be easily compared for performance and economic improvement. The ratings for each characteristic subitem/system

configuration are subjective. Information for the ratings stems tram the authors referenced on each of the subitems. The reader may disagree with ratings on svveral of the kerns and is hzlcome to do so. Due to the large number of subitems, the error on each subitem is minimized. Overall evaluation is therefore insensitive to smaller points.

5. System evaluation characteristics The subitems that constitute the seven categories of the distributed system evaluation are: 1. Operational Characteristics - Those factors which encourage easy access, flexibility, and expansion of the system 2. Performance Characteristics - Those factors at the hardware level that influence throughput 3. Update Characteristics - Approaches to the updating of data and how they affect system performance 4. Retrieval Characteristics - The types and distribution of user qr 5es which affect sysrem performance 5. Economic Characteristics -.. Elements of present and future cost benefits 6. Dots Base Size -- The number of schemas and the size of each data base 7. tJser Characteristrcs - The geographic distribution of users of the system The subitems of !IICSCse\‘cn categories must now bo considered.

Data locality [ 10,l 11. AI:a single*site, data may be spread across multiple storage units to improve load leveling. Multiple sites may ailocate data among the sites. with each subset of data being allocated to the site which uses it most. This latter subdivision makes remute access methods essential. Da?a disftibuted by machines [28]. Data may be divided among ‘multiple machines with each subset of data kept locally while summarized data is held at

remote machines. This method of distributing data must be en!‘orced managerially within the organization in large enterprises. Data standards [7]. To manage the data consis-

terbily, standards must be developed and implemented

throughout the company. Such standards will include the specification of data names and their descriptions. These can be monitored and enforced by the data administrator. Expertise [ 121. People within the data administration function must become expert in using the DBMS and its associated software for solving physical layout problems and extracting logical data. Available d&a [30], The ability to reach many different data bases can provide a terminal user with a rich array of capabilities. To provide high data availability, the system designer must store vital data in duplicate in more than one machine (to allow access during partial system failure). Resource-sharing [3 1. Resource-sharing encompasses a myriad of operational issues that directly affect the operation of distributed data bases. The configuration and homogeneity of the system determine, to a large degree, the technology required. Homogeneous systems will naturally require less effort than heterogeneous systems. Machine independence [3 I 1. Machine indepen-

dence is concerned with the change from one set of hardware to another, usually of another manufacturer. Such a change constitutes a radicnl, painful transformation in most environments. Ikk&ionarv growth of the data base [23]. The most obvious form of data balsegrowth is the increase in the volume of data of a “horizontal expansion.” Such growth leads to a need to spread the data across machines or storage devices. Diverse requirements of users [23 1. Disregarding the changes to the data base, the addition of new users usually means that they need to see their unique subjet. New access paths ma:y need to be established. Additional users may impose more stringent performauce or timeliness requirements. Security 1291. The best configuration IS one in which the DBMSis the only system executing on the hardware. Where that is not possible, the DBA should review all hardware and software configurations to determine the threat which might be appliad for the specific environment. Concurrency [ 161. The sharing 3f data among concurrent processes can adversely affect data base integrity and consistencjr. A large data base must usup ally allow concurr-nt processing. The user should nn?

J. Slot&t

be concerned with other processes that are executing simultaneously. Localized management [20,12]. A factory or dist:ict office may desire to keep its own data. The data are nevertheless used elsewhere, possibly by means of telecommunication links. Localized managemerit and control of data can have advantages; the locrl organization is fully responsible for accuracy and safekeeping and cannot blame malfunctions on surrlc far distant group. 5.2. System performance Integration [34]. Among homogeneous data base systems, the level of integration effort is small in relation to that of heterogeneous ones. The integration of different DBMS’s involving different data models, data definition languages, and data formats requires a large effort in data translation. ,System complexity [25] In general, data base software becomes more cc.s‘iplex when it permits logical files to be split. Little software is available to help. Overlap execution [22]. Simulation studies show that a backend computer can overlap its execution wit.h the host machine, allowing more throughput in the larger computer. System mafurity [ 151. A fairly complete analysis of distributed file system designs has been made but actual implementation is dependent upon technological advances in the area of computer networking. On the other hand, the centralized DBMS is mature and this promotes easier implementation and usage. System overhead [24]. Overhead is defined as nonproduction effort when the system and its programs arc: performing administrative (i.e., non-user related)

taa.ks. fro@rzm compatibility [20]. Centralized control is needed to ensure that transfer of applications will be possible without burdensome reprogramming. The data item formats should be centrally controlled, and thla same data dictionary should I*e used everywhere. Only one data description langk,z;ge(DDL) must be us’:d and all schemas reviewed czntrally. Only with

centralized control is it possible to avoid the crippling problems resulting from piecemeal development. Dora redundancy [32]. Some redundancy exists ln order to give improved access, reduced transmission,

et al. / Centralization

or decentralization

19

simple addressing methods, and better recovery from accidental loss of data. Uncontrolled redundancy involves :he extra cost of storing multiple copies, serious problenls in updating, and probable inconsistencies. System throughput [ 181. In distributed systems, application programs will generally execute faster due to the reduction in tczffic. On the other hand, the overhead of the network can reduce the response time for programs initiated across nodes or requesting data from several locations. Reorganization of data [26]. The process of rearranging the relative physical placement of data units in the data base constitutes reorganization. The reorganization of data in distributed networks is much more difficult than in a centralized DBMS application. Translationof data [2 1,24,33]. One of the problems that faces the distributed system is data incompatibility. The problem of disparate internal data representations is complicated by the different logical structures of the data base system. Since differences ale a fact of life in data, processing, a method of data base translation is nece:ssary for the case of distributed networking. Response time overhead [8]. Overhead due to the backend approach with respect to response time consists elf transmission time of the command to the backend, task queueing delay, possible conversion overhead associated with character sets and data format, and transmission time of the result back to the host.

Cost [20]. If a system stores its data close to the locations swhere they o.riginate or are used, there is less transmission of data with a subsequent reduction in telecommunication cost. On the other hand, economics of scale often favor the use of large centralized data storage facilities. There is a cost tradeoff between these factors, but the cost of small localized data storage facilities is dropping much faster than the cost of data transmission. System development [20]. The problems associated with excessively large system development can be alleviated if local organizations develop local data bases and make them work, albeit under centralized

xl

Techniques

constraints,

such as data definitions, formats, and

schemas. Hardware cost [6]. Local development and storage of local files has gained popularity and economic viability with the spread of minicomputers with data transmission capability. Inzproved utilization of resumes [12]. By carefully dividing the data between sites, one improves the utilization of system resources. However, overhead is incurred when remote access is necessary. Lkra comnzunic~tiun ,nosr [ 121. In distributed data bases, a new cost factor is communication, e.g., tariffs. Such tariffs may increase in the future rather than decrease. Software cost [12]. The cost of distributed network DBMF software is increasing with time as a result of the rising cost of software development personnel and the increasing complexity of network software. There has been little s,uccess in providing better software engineering methodology to compensate for higher personnel cost. Hardware cm [12]. The cost/performance ratio of hardware changes rapidly; mini- and microprocessors have beer. developed 10 be used in distributed

data environments. 5.4. Update C&Me from dijferent locations [ lo,1 I]. If data is 10 be updated by transactions from different locations, it should remain in one place so that the updating process with its potential conflicts and deadlocks can be controlled. It is undl:sirable to have more than one copy of data being updated in differctt places at the same time. Dyrlarnicupdate versus static update [ 141. When an elaborate search of the data is nriessary in order

to respond to spontaneous queries, the data must be structured to facilitate searching. With such structures, it is complex and expensive to lrpdate or insert new records. It is sometimes, however, difficult to avoid this complexity.

Sometimes it is possible to

carry out the updating later on off-line operations. Redundancy [ 13,5]. In a rep!icated data system, multiple updating operatio*ns are necessary. Redundancy 4s therefore expensive for volatile files. Also, because different copies may be in different stages of updating, the system may give inconsistent information.

Deadlock [ 11. The possibility of a deadlock in distributed networks is greater than in centralized systems. 5.5. Retrieval &a&h query. A batch is a collection of transactions taken over a period of time which are retained for later sequential processing. On-line query. In an on-line operation, a user has direct and immediate access to the computer system via terminal devices. On-line query for more than one file. Multiple file capabilities allow direct and immediate access to more than one schema. On-line quqv from DEMS distributed geogmphitally. The user has direct and immediate access to a

data base management system which connects geographically separated computers together via transmission lines. Restricted query. The retrieval of the data is predefined by the DBMS. Unrestricted query. In a system that permits spontaneous queries or allows a nouprogrammer user to explore the data base and produce reports, unrestricted queries may be said to occur. On-line query from homogeneous DBMS. This form presents direct and immediate access to more

than one data base management system from the same type. On-line query front hetclrogeneousDBMS. Similar to above, this ciass cf query permits direct and immediate access to more than one DBMSwith different sites running different DBMS’s, 5.6. Data base size

Data base size [27] is the total number of characters required for a representation of the data. This can influence the choice of the data base organizam tion. An organizational method that requires a large amount of overhead per entity would bc an unlikely candidate for a large data base. The size also affects the type and number of backend machines selected. We shall divide the data base imo fouc categories. 1. Small: up to 1 Mbyte 2. Medium: 1 M to SOM bytes 3. Lrlrge: 50 M to 1OOMbytes 4. Very large: over 100 M bytes

J. Slonh

5.7.

user

The most important consideration in defining the need for expansion of data processing and data accessibility is the number of users. The following is our categorization of number of users:

Small number locally: 2-16 Small number distributed: l-8 2. Medium number locally: 16-32 Medium number distributed: 8- 16 3. Large number locally: 32-128 Large number distributed: 16-64

et al. 1 Centralization

or decentraIization

21

are presented in the “total” row of each table. The final row gives the evaluation index -- the total weighted score divided by the sum of the subitems’ weights. An overall evaluation of each configuration is exhibited in Figure 8, where the seven characteristic factors are themselves weighted and summed.

1.

4. Very large number locally: 128 and up

Very large number distributed: 64 and up The tables for each of the seven categories are displayed in Tables 1 through 7. The “Description” column contains abbreviations. “W” denotes the weighting factor. The weighted evaluation of a subitem for a particular configuration is the product of the rating shown multiplied by the weighting factor (as given at the left of the row). The sums of the weighted results

6. Evaluation The operational table i shows that good results occur in systems with homogeneous hardware and software. Such an effort is seen du.e to the maturity of the system’s development in those areas; heterogeneous networks are still in their infancy. Also note that the centralized system fails to show good results due to the lack of data locality and machine independence in its singular configuration. Data replication versus partitioning has no effect upon the scores. It should be noted that even in centralized systems the need to “distribute” data over several files may be

22

Techniques

Table 2 LLotion

/WI

C

tDPt

DPHS

Integration

3

3

4

2

System Complexity

2

5

4

2

Execution Overhead

11

4

3

DPHH PHHDC PHSHDC

Overlap Execution

I 31

5

System Maturity

141

5

2

5

3

2

2

I

o+

5

5

4

2

1

o+

o+

4

4

5

5

5

2

2

Program Conpatibility

I Data Redl*ndancy

DR

DRSlS

DRRfl

RHHDC RHSHDC

2

I System Throughput

I

Response Time Overhead

i

Total Meall

36

t

needed to improve reliability and timing, etc. One implication drawn from these results is that from a Furely operational point of view. a homogeneous distributed partitioned cystem presents a definite step up from a centralized system. Totals from the system performwetable 2 show

uted systems suffer due to the lack of maturity data translation and program compatibility problems are as yet unsolved. Note that replication provides better perfornlzrtcc than data partitioning due to the locality of data, an implementation which is truly an extension of the centralized philosophy. Upddeconaidsrations in table 3 show that centraliz. ing versus partitioning gives almost the same results.

that centrakxd systems score well due to the total lack of compatibility problems. In contrast, distribTable 3

Update From Different

Redundancy of Data

75

50

38

i.

4.17

2.70

2.11

.22

1 4 .25

&3 -' 4Q 3.22

2.54

J. Sfoninr et al. / Centralization or decentrakution

23

Table C

DB Distributed Geographically

DP

5

DPHS DPHH iHHDC PHSIIDCDR DRHS D?l+I RHHDCRHSHDC !I I I I 1 I I 3 2 2 o+ 0-b 3 2 1 2 ! o+ ! o+

3

5

2

2

a+

o+

3

4

2

2

o+

o+ j--i--/*+

5

4

4

o+

0t

1

5 5 5 Query From Hetero-

1 80 3.30 _-

The other end of data access, retrieval, in table 4, shows that the replicated distributed system provides an excellent query environment. Centralizatioin fares poorly, due to the absence of data locality. T,irbles3

Replication of data requires more update overhead and consequently such systems do poorly. Note also that performance is reduced significantly when there are hardware/software incompatibilities, Table 5

Data Comunication

I &xpensLn Costa

15121413/51515'11114141515

i

Uodate cost

(Per Unit)

15

5

3

411

Retrieval Cost (Per Unit)

14

4

3

31

159

105

Total Meall

40

161

121 72

I.

1

1

1

o+

1

!i

4

4

3

2

75

137

93

97

72

67

4.02 3.97 2.62 3.02 1.80 1.85

3.42 2.32 2.4? 1.83 1.67

Techniques

24

at 4 are really complementary views; a comparison of t -e tables shows that some form of a partitioned system gives good balance between data retrieval and update. heplkation should only be considered when retrievals constitute a large portion of the user requests. Table 6

I

all To ---

Data

1 MBytes

Base

up

Table 5 shows the totals for the economic factors, which sugest that the rephcated systems are not economical - extra copies of the data mean extra cost for storage and maintenance. Note also that usage of ht terogeneous hardware and software plays a positive role, in that one may obtain a better price on

J. ShlliFFl et d. / Cmtraiization or decwltralization

a system configuration by choosing different vendors; however, one “pays” For this in other categories. The effect of size of the data base and number of sehemas is shown in table 6. Only a partitioned systern performs well when the size and number of users

increase. ‘Thechoice of replication ruins any hope of reasonable expansion, and a centralized configuration simply cannot cope with growth. Table 7 shaiws that as the consideration of the Users’needs increase, a replicated system appears to be

Table 7 Description a.

W

Small Number Locally 2-16

15

Medium Number Locally 16-32

2

Large Number Locally 32-128

C

DP

DPHS

DPHH PHHDC PHSHDC

DR

DRHS

DRHH

RHHDC RHSHDC

55555

5

5

5

5

5

5

44444

4

4

4

4

4

3

4

3

3

3

3

3

3

3

3

3

3

Very Large Number Locally 128 and Up

5

3

2

2

2

2

2

2

2

2

2

2

e.

Small Distributed Geographically 1-S

13

5

5

5

5

5

5

5

5

5

5

8.

Combinatio

b.

c.

d.

25

26

Techniques

and update areas. Results similar to the performance ratings must be realized to produce a truly wellrounded system. A grouping of major “tradeofr’ areas is shown in figure 2. The effect of performance on economics is especially noticeable. Although it might appear that a system should be selected which minimizes these discrepancies in the graphs (and thus eliminates the trade-off decision), the better solution ia to choose a scheme which presents high scores for ail important areas. The simple partitioned configuration then becomes a good choice. Replication is definitely the most volatile - the distributed replicrted heterogeneous hardware system eliminates the trade-off factor. This oddity occurs because the balance between performance and cost is maintained by the amount of duplicated data in the system. However, one can see that such a configuration is defmitely not generally the “best” choice. The user and size factors are compared in figure 3; apparently no matter how good the performance, a replicated system is a potential disaster if the data base has a possibility for extensive growth. Only the partitioned system presents a safe choice when the

best. A replicated system is responsive because data can be located wherever there is a demand. The use of homogeneous versus heterogeneous hardware/software is thus minimized. As a general aid in considering the categories together, a trade off summary is presented in table 8. One notes that the partitioned system presents the best overall balance, primarily due to its strong performance in tie areas of operations, economics, user considerations, and update features. Replicated systems make up the next general group. The centralized ,system falls to the bottom primarily because it is not flexible to growth, i.e., it is not cost effective. This becomes apparent when one considers the economic factors of expansion. Figure 1 gives a graphical comparison of the system configurations in the general performance areas, viz., performance, retrieval, and update. One notes that the performance graph alone provides a good measure of a balanced retrieval versus based system. One must also conclude that there are other important factors which contribute to performance characteristics, hence, the discrepancy in the curves. There definitely is room for improvement in the retrieval

Table8 -Description AverageI h'

C

DP

-I--

DPHS DPHH PfftfDC PUSHIX DR 1.62 1.28 4.26 8.1 6.4 ,21.3

Operational System

4.28 3.22 17.1212.88 -_p_ I ~3.42 2.32 17.1 11.6

Performance Economically

I I

.72 1.44

Update -76.95

6.99 DE Size and No. Schema

3

26.18 2.38

28.69 Hean

3.35 3.58 2.29 2.29 .67 9.99 10.74 6.87 6.87 2.01

2.69 3.82 3.19 3.19 2.64 ,_ 2.69 3.82 3.19 3.19 2.64 -.._ 3.24 9.72

~112.96 1 10.27

Iher's

.22 3.22 2.56 .44 6.44 5.12

23

1.84 1.39 1.19 1.84 1.39 1.19 -~ 3.85 3.85 3.55 11.55tt11.5510.66

84.7 91.7 65.36 61.7f 36.69

67.76

3.68 3.99 i 2.84 2.69 -

2.95 t

I -

I 2.61

J. Slonim 5.c

5.0 4.8

------

0.6

-.-.-

ECONOMICS SYSTEM PEWORMANLL

a.4

1 11 E.

1

4.F 4.t 4.4

et al. / Cerrtralization

or decentrdization

I

a.2

4.2

4.0

4.0 1

3.H

3.8,

3.6

3.6~

3.4

3.4%

3.2

3.2,

\

21

-

DB SIZE E

_-____

USE‘,

-.-.-

SYSTFM PERF~RM~NcC

F

i'\

3.0,

3.u 2.8

2.8,

7.6

2.5.

6

2.4

2.4.



2.2

2.2.

I;

2.0

2.0,

:

1.B

1.8,

I.6

1.6,

1.4

1.4,

1 2. 1 0, 8’

1.2 I‘C .I:

.6*

.f .4 .; .c C

rMlA

0 P

ORI;A!~ItAlION

0

P H

0 P H

P

I’ H

'D !I i

R

;;; H

H

II

R I, ,

c

Fig.

Fig. 1

-

OUERV

__w___

UPDATE

-.-.-

SYSTEM PERFOWANCE

C

C

DATA OFCANlZATlON

3

data base size is volatile - one negative factor has a definite influence over all others. Finally, we can validate our results by looking at other relevent work. Most current and earlier research papers have asked: “What is the optimal data file allocation scheme (in terms of cost) for a system with a given degree of replication?” These papers consider the factors of system node distribution, communication Bnes speeds, data storage costs, level of data sharing, and behavior df access patterns. Mathematical models for answering this question for a given system have been developed [ 17,191. However, only [lO,l I] and [9] present generalized results., They conclude that simple partitioning is recommended for systems whose traffic consists of a majority of updates. Centralization is also an acceptable alternative. A replicated system becomes viable only when updates drop below the ten percent level. Tables 3 and 4 I>f this paper agree with these results. When updates become a primary cdnsideration, table 3 indicates that the centralized or partitioned schemes provide good results. However, when retrievals provide the primary traffic, the replicated system performs be;:ter, as shown in table 4.

28

Techniques

7. Conclusion The evaluation tool presented in this paper has proved to be easy to use and an effective device for comparing data base system configurations. One Can lneasure trade-off factors in developing a system and use them to view progress in the improvement of the state-of-the-art. One must obviou&,

what conclusions can be drawn from the tables and graphs in terms of a “best” configuration. First, a partitioned system has become a realizble,attractive replacement for existingcentralized qstems. There no longer need be fear about buildi:.g and using such a configuration. Second, distriouted systems Hhich replicate data should be con,;idered for use only in specialized situations, e.g., e>;clusive on-line retrieval with batch updating. The ask

potential for ruin in a replicated system directly increases with the amount of growth projected. Finally, problems still exist in the use of heterogeneous hardware and software in a system. The lack of maturity and use of such systems remains the culprit. t lopefully, research in this area will bring their use up to the level of success currently enjoyed by heterogeneous systems.

References

[ I] P.A. Alsbsrg and J.D. Day, “A Principle for Resilient Sharing of Distributed Resources,” Rcpert from the Center for Advanced Computation, University of Illinois at Urbana-Champagne, Urbana, IL, 1976. [ 21 P.A. Albcrg, G.G. Rclford, S.R. Bunch, J.D. Day, E. Grapa, D.C. Heal), E.J, McCauley, and D.A. WiIIcox, “Synchronization and Deadlock,” PAC Document Number 185, CC’IC-WAD 6503, Center For Advanced Computation, Clnivcrsity of Illinois at UrbanaXhampagnc, Urbana, IL, March 1, 1976. 131 J.L. Beg, Editor. “Data Base Directions, The Next Steps,” Proceedings of the WorPqhop of the National Bureau of Standards, Pub. 451 And ACM held at t:t. Lauderdale, FL, October 29-31, 1975. 141 P.A. Bernstein, N. Goodman, J.B. Rothnie, C.A. Papadimitrious, “Analysis of Serializability in SDD-I: A System for Distributed Data Base (The Fully Distributed CaseI,” Proceedings First International Conference on Computer Software and Application, IEEE Computer Society.Chicago, IL, November 1977. (Also available from Computer Corporation of America, 575 Technology Square, Cambridge, MA 02139 as Technical Report No. CCA-77.05.)

(51 P.A. Bernstein, N. Goodman, J.B. Rothnie, D.W, Ship man, “The SDD-1 Redundant Update Algorithm: The General Case,” Computer Corporation of America, 575 Techndogy Square, Cambddge, MA 02139, Technical Report No. CCA-7749, August 1,1977. 161 P.B. Bena, “Data Base Machines,” ACM SICIR Forum, Volume X11, Number 3, Winter 1977, pp. 4-23, [ 71 GM, Booth, “DistrIbutod Data Bases: Their Strwturu and Use,” Infotctch State of the Art Report, Distributed Systems, pp. 201-213,1976. 181 R.E. Canaday, et al., ‘“A Back-end Computer for Data Base Management,” CACM, Volume 17, Number 10, October 1974, pp. 575-582. [9] R.G. Casey, “Allocation of Copies of a File in an Information Network,” Proceedings AFIPS SJCC, Volume 40.1972. pp. 618-623. [ lOI W.W, Chu, “Optimal File ALlocation in Computer Netwotks,” In Computer Communication Networks, Kuo, F.F., Editor, Prentice-Hall, Computer Applications in Electrical Engineering Series, Prentice-Hall, Inc., Englcwood Cliffs, NJ, 1973. [ 1 I I W.W, Chu, “Performance of I:ilc Directory Systems for Data Bases in Star and Distributed Networks,” Proccedings AFIPS NCC. Volume 45, June 1976, pp. 585493. 1121 P.J. Down and f:.E. Taylor, “Why Distributed Computing,” Published by NCC Publication, 1978. [ 13 I C.A. Ellis, “A Robust Algorithm for Updating Duplicate Data Bases,” Proceeding 1977 Berkeley Workshop on Distributed Data Management und Computer Networks, Lawrence Berkeley Laboratory, UI iversity of California, Berkeley, CA, May 1977. pp. 146- 158. 1I41 K.P. bwaran, J.N. Graph, R-A. Lotie and I.L. Traigcr. “Tht Notions of Consistency and Predicate Locks in a Data Base System,” CACM. Volume 19, Number 11, November 1976, pp. 624-633. IlSl J.P. Fry and E.H. Sibley, “Evolution of Data Rase Management Systcmb,” Computing Surveys, Volume 8, Number 1, March 1976, pp, 7-42. 1161 P.P. Klng and A.J. Cullmeyer, “Data Base Sharing - An Efficient Mechanism For Supporting Concurrent Processes,” Data Base Management, U, Schneiderman, (Ed), 1978, pp. 1 IO. II71 K.D. Levin and N.M. Morg;n. “Optimizing Distributed Data Bases- A Framework for Research,” Proceedings AFIPS NCC, Volume 44, June 1975, pp. 472-478. 1181M.S. Loomis and G.J. Popek, “A Module for Data Base Distribution,” Symposium on Trends and Application, 1976 Computer NetiNorks, IEEE, pp. 162-369, 1976. [ 191 S. Mahmound and J.S. Riordon, “Optimal Allocation of Resources in Distributed Information Networks,” ACM Trans. on Database Systems, 1 : 1, pp. 65-83.

[20] J. Martin, “Princip&*s of Data Base Management,” Prentice-Hall, Inc., EngIev&od Cliffs. NJ, 1975. [ 211 F J. Maryanski, “A Sulwey of Developments in Distributed Data Base Management Systems,” Computer, Volume 11, Number 2, February 1978, pp. 28-38. [22] F.J. hfaryanski and V.E. WalIentine, “A Simulation Model of a Backend Data Base Management System,”

J. Slorlinr et al. / Cerltralizotiorl or dccentralizatiorl Proceedings Pittsburgh Modeling and Simulation Conference, April 1976, pp. 252-257~ [231 H.S. Mcltzer, “Data base Concept and an Architecture for a Data Base System,” Proceedings of SHARE, XXXIH, Boston, August 1969, pp. 315-470. 1241 A.G. Merten and J.P. Fry, “A Data Description Language Approach to File Translation,” Proceedings ACM SIGMOD Workshop, May 1974, pp. 191-205. [ 25 1 R.L. Nolan, “Computer Data Base: The Future is Now,” !jarvard Business Review, September-October 1975, pp. 101. [26] I. Palmer, “Data Base Systems: A Practical Reference,” Q.S.D. information Sciences, Inc., Welleslcy, MA, 1976. [ 271 N.S. Prywcs, “Structure and Organization of Very Large Data Bases,” Critical Factors in Data Management, ed., I:. Gruenberger, Prentice-Hall, Inc., Englewood Cliffs, NJ, 1969. 1281 C.V. Ramanroorthy, C.S. Ho, T. Krishnarao and B.W. Wah. “Architectural lssucs in Distribution Data Base Systems,” Proceedings Very Large Data Bases. Third

29

International Conference, October 1977, pp. 121-I 26. [29] B. Ruder, “Security Mechanisms for Protecting Data in a DBMS Environment,” Auerhach Information Management Series, Auerbach Publishers Inc., 22-03-l 1, 1976. [30] V.L. Shatz, “Cornpurer Network for Retail Stores,” Computer, 1973, pp. 21-25. [3!] L.B. Smith, “Data Independence in DBMS,” Auerbach Publishers, Inc., Aue:bach Information Management Series 22-03-08, 1976. [32] M. Stonebraker and E. Neuhold, “A Distribution Data Base Version of Ingres,” 1977 Berkeley Workshop on Distributed Data Management and Computer Networks, Lawrence Berkeley Laboraiory. Universit)! of California, Berkeley, CA, May 1977, pp. 10-36. [ 331 S.U.W. Su and Ii. Lam, “A Semi-Automatic Data Translation Scheme for Achieving Data Sharing in a Network Environment,” Proceedings ACM SlGMOD Workshop, May 1974. pp. 227-241. [ 341 D. Tsichritzis, “Features for a Conceptual Schema,” CSRG Report 56, University of Toronto, 1975.