INFORMATION
SCIENCES
89
29, 89-91 (1983)
Introduction: Special Issue on Database Management SAKTI
P. GHOSH
IBM Research Loboratory, San Jose, California 95193
The transition to logical processing of nonnumerical data in a computer started in the early sixties. In the early days of this fascinating branch of information processing (almost ninety percent of human communication is conveyed by normumeric information), research was confined to the use of computers for indexing literatures. The Library of Congress and the major journals could foresee the information explosion and turned to the computer for help. The goal of the early researchers was to process natural language using computers; though we have crossed many milestones since then, that journey is far from complete. In the mid sixties, the computer manufacturers took the bold step of introducing information management systems. These systems had very limited capabilities of logically manipulating nommmeric information, as very little research had been performed in this area at that time. It is difficult to assess how much benefit these early systems had for their customers, but they surely did stimulate the minds of many researchers. Two new areas of research started incubating simultaneously, namely, research in access methods and research in modeling databases. A considerable amount of progress has been made in both these areas and in the late seventies’a new dimension was added to this area of research, namely distributed databases. The pioneer models which survived the time test are: Child’s entity model, Codd’s relational model, and Buchmann’s network model. Many hundreds of researchers have contributed to the understanding and implementing of these models. Like any imprecise branch of knowledge, these models are only a framework for explaining and understanding the events of the real world. They have served the purpose of advancing research and implementing database management systems for some practical purposes. Each of these models has its limitations in expressing real world phenomena, developing user friendly languages, fast accessing of information, deriving intelligent information from stored information, etc. Thus, a second generation of models were invented to fill the gaps left by the pioneer models. In this special issue on database management, we have tried to give you a snapshot of the state of the research that is currently being performed in the leading universities and industrial research centers. Some of these papers will relate to further advances within the 0Elsevier Science Publishing Co., Inc. 1983 52 Vanderbilt Ave., New York, NY 10017
0020-0255/83/$03.00
90
SAKTI P. GHOSH
second generation of models, some will discuss the evaluation of query languages, and some will examine the new area of distributed databases. In the early sixties, queries were directed to the databases using CALLS; the concept of query languages evolved in the seventies after the invention of the pioneering database models. As computer systems evolved into interactive systems, the query languages were supported by such systems. Thus, query languages were exposed to casual users, and human factors became a major factor in query language development. Fred Lochovsky and Denis Tsichritzis, in their paper, have discussed some systematic criteria for evaluating interactive query languages. Their approach consists of dividing the querying process into three parts: request, reply and dynamics. They identify certain desirable characteristics that describe the parameters of user interactions. These parameters provide the framework for comparison of different interactive query languages. Six different query types-keyword; by example; natural language; menu; graphic; and multimedia-are used for comparison purposes. Each query type is described in terms of the interaction parameters. The tradeoff among the desirable characteristics for the different query types is exposed. The importance of human factors analysis in such interactive query languages is emphasized. Gio Wiederhold, in his paper, discusses a fundamental theory for modeling databases. These issues are vital, not only in understanding database models, but also in developing new models and implementing them on computer systems. He touches on issues such as what the model is supposed to do, what the relation is between data and the real world, how knowledge and the real world complement each other, how to use the knowledge and data, the concept of desiderata, the objectives of the database model, issues related to independence of database implementation, application independence issues, application semantics, and issues related to performance prediction of implementations. Entity-relationship is a second generation model, introduced by Peter Chen. This model is based on the good features of the two pioneer models: the entity set model and the relational model. This model has received considerable attention from many researchers. Peter Chen, in his paper, demonstrates how the entity-relationship model can be used to create an entity-relationship diagram for English sentence structures. His paper provides some elegant rules for translating English sentences, used for describing information requirements of an information system, into database schemas. This is the first step in designing a complex information management system. His paper studies the correspondence between English sentence structures and entity-relationship diagrams. He shows how the basic structures of English sentences-namely, nouns, verbs, adjectives, adverbs, and clauses-have counterparts in the entityrelationship diagrammatic techniques. Database models, up to now, have been developed for databases in general; only recently have researchers focused on developing database models for
INTRODUCTION:
DATABASE
MANAGEMENT
91
specific types of databases. Stanley Su, in his paper, outlines the semantics associated with a corporate and scientific-statistical database. The model uses concepts and associations, which are characteristics, of the data associated with these types of environment. Seven general association types are defined and distinguished on the basis of their structural properties, operational characteristics, and semantic constraints. These are used as the basic constructs for modeling complex semantic relationships in these types of databases. The model is characterized by its expressive power, its recognition of complex data types, its support of the principle of relativism, and its special treatment of statistical attributes. The network structure and tabular structure for representing the conceptual and implementation design of databases based on this model are also included in the paper. Some algebraic operations for some of the relations have also been defined. One of the great features of Codd’s relation algebra was the concept of normalization of an arbitrary database relation. Normalization is often used to eliminate redundancies caused by dependency constraints in relational databases. Normalization of relations has led to the development of many simple query languages and eliminated many ambiguities during algebraic manipulation of relational data. Algebraic manipulation of unnormahzed relations is difficult because of the’underlying complex mathematical structures. Yahiko Kambayashi, Katsumi Tanaka, and Koichi Takeda, in their paper, examine the problem of how to synthesize tmnormalized relations to include more semantics and at the same time reduce redundancies caused by dependence constraints. They have introduced four basic operations-namely, group-by, row-nest, column-nest, and relation-nest-for deriving unnormalized relations from a relation in the normal form. They have shown that functional dependencies can also be enforced using the group-by operation; a join dependence can be represented by a combination of group-by, row-nest, and relation-nest operations; columnnest operations can be used for a hierarchy of related attribute sets. The paper also provides procedures for representing an important class of constraints consisting of one join dependences. Distributed databases are a relatively new area of database research. Security and data autonomy become critical issues when data are distributed over many computers and can be accessed by many users. Selinger et al., in their paper, discuss the site autonomy issues in a distributed management system being developed at IBM Research laboratory in San Jose. They discuss the problem of the economics of shared database versus sole ownership. They define site autonomy as the site’s control over its own data, i.e., who may access which data, who may modify which data, etc. The paper discusses some of the issues raised in the implementation of a DDBMS by requirements of site autonomy. It also discusses how vital properties like authorization, query compilation and binding, catalog management, and transaction commit protocols are affected by the notion of site autonomy.