Use of the SQL and RDA standards in data management

Use of the SQL and RDA standards in data management

199 Information Systems Engineering Division David K. JEFFERSON, Chief Use of the SQL and RDA Standards in Data Management Leonard GALLAGHER, Joan ...

616KB Sizes 0 Downloads 44 Views

199

Information Systems Engineering Division David K. JEFFERSON,

Chief

Use of the SQL and RDA Standards in Data Management Leonard GALLAGHER, Joan SULLIVAN and Joseph COLLICA

1 Introduction

ICST, US Department of Commerce, NIST, Gaithersburg, MD 20899, USA

In the 1990s standards for interoperability, and for portability of data and applications, will become critical as new and complex information systems are developed. These systems will have logically integrated databases of diverse data under the management and control of heterogeneous database management systems (DBMS) with physical storage in geographically separated data banks. Data complexity and data volume will be enormous. Data objects will include documents,

This paper discusses how existing and proposed Database Language SQL and Remote Database Access (RDA) standards can fulfill critical operational requirements for data management. SQL and RDA are but two components of a family of data management standards needed to meet requirements for emerging distributed processing systems. Requirements for needed capabilities such as object-oriented database, distributed database, and database export/import are discussed.

Keywords: Database languages; Distributed database; Objectoriented databases; (RDA); SQL.

Remote

Database

Access

Leonard Gallagher is a computer scientist i'.. the Information Systems Engineering Division responsible for data models, database standardization, and integration of database technology with new approaches to information management such as knowledge and object-oriented systems and hypertext. He has been a member of the ANSI/X3 technical committee on Database, X3H2, since 1979 and chairs the international development group for follow-on enhancements to the ISO SQL standard. At NIST, he also leads an inter-division project for research and experimentation into hypertext and hypermedia systems. Dr. Gallagher received the B.A. degree in mathematics from St. John's University of Minnesota in 1965 and the Ph.D. in mathematics from the University of Colorado in 1972. He taught mathematics at the Catholic University of America for 6 years and has been involved in database research at NIST for the past 12 years. North-Holland Computer Standards & Interfaces 10 (1990) 199-205 0920-5489/90/$03.50

Joan M. Sullivan is project manager for SQL Validation Tests at the National Institute of Standards and ,~ Technology (NIST). Joan is a Computer Scientist in the National Com~ puter Systems Laboratory of NIST. She received a B.S. in Mathematics ~ from Trinity College in Washington, D.C. and attended the University of Virginia 1969-1970 in Theoretical Mathematics. Joan has over 15 years experience in designing, programming -- ~ and managing computer applications for NIST, NOAA and NSRDC. As a member of the database committee X3H2, she participated in the development of the current SQL standard. Joan is currently enhancing the NIST SQL Test Suite and formalizing testing procedures for validating commercial products for conformance to the SQL standards. Joseph C. Collica is Manager of the Database and Graphics Group in the National Institute of Standards and Technology's National Computer Systems Laboratory. He has over twenty years of experience in research, development, and standardization of programming languages, database management systems, computer graphics, and conformance testing of software systems. The scope of the database and Graphics Group's activities includes research and the development of guidelines and standards in database management systems, expert systems, hypertext systems, and support for the Information System Engineering Division's Database, KnowledgeBased, Computer Graphics, Geographic Information System, and Validation Laboratories.

© 1990 - Elsevier Science Publishers B.V. (North-Holland)

200

L. Gallagher et al. / SQL and RDA Standards in Data Management

graphics, alphanumeric records, images, voice, and video, with each object type likely created and manipulated by a different type of processing system and stored in different databases. Description and management of these multiple databases will be achieved using a three-schema architecture under the administration and control of a logically integrated Information Resource Dictionary System (IRDS). An example of such an application is the US Department of Defense's Computer-aided Acquisition and Logistic Support Program. NIST is assisting the CALS effort in developing and testing standards for digital technical information. The goal is to move from paper-intensive acquisition and logistic processes to a highly automated and integrated mode of operation for the weapon systems of the 1990s. CALS will facilitate data integration, exchange, and access among government and industry maintained databases and eliminate the development of duplicate data. Traditional DBMS data types (alphanumeric and numeric) will be inadequate to represent CALS data. New DBMSs will be needed to encompass new data types, such as graphics, documents, images, voice, and video. DBMS functionality will need to be accessible from (integrated with) specialized processors and editors which store and manipulate these new data types. In addition, DBMSs will need to support integrity rules and algorithmic manipulations for specialized data types such as geographic location, date/time, geometric properties, temporal and spatial properties and a rich variety of user-defined data types. Another requirement will be object-oriented facilities to create and manipulate objects with complex structures governed by numerous interrelated rules. These facilities are critical to advanced logistics applications, but they are also needed by traditional business applications, in order to provide greater data integrity and higher-order (less procedural, data-independent) data manipulation. Still another requirement will be knowledgebase facilities, encompassing artificial intelligence rule-based algorithms for inferencing and decision-making. DBMS functionality will need to be broadly accessible by other software processes, for example user-interface software, such as hypertext processors, Fourth Generation Languages (4 GLs) and interactive query processors.

2 Database Language SQL Database Language SQL [1] is one important component of the CALS family of data management standards. Database Language SQL is appropriate for the definition and management of one class of data, herein called 'repetitive data.' Repetitive data includes any of the diverse data elements mentioned above that are structured into repeated occurrences having a common data structure definition. A large proportion of CALS data is or will be repetitive. IRDS data is repetitive, so the IRDS may be implemented as an application over an SQL database management system. SQL is appropriate for logistics applications which are business-oriented. Such systems traditionally collect data in forms (or their automated equivalent - formatted screens) and produce periodic reports, direct data lookup and a variety of cross-references. SQL is appropriate for storing CAD/CAM-generated data and for relating graphical data (such as a design drawing) to nongraphical properties (such as materials, tolerances, and configuration data). SQL is appropriate for storing facts and possibly rules for knowledgebased systems. SQL may be used to store documents and the associated management information. Database Language SQL was first adopted in 1986 by the American National Standards Institute (ANSI). Later, in 1987, an identical specification was adopted by the International Organization for Standardization (ISO) and by the NIST as a Federal Information Processing Standard (FIPS). A revised standard, with optional integrity enhancements, was adopted by ISO in May 1989, and ANSI in 1989. An Embedded SQL specification, for embedding SQL statements into six programming languages, was adopted by ANSI in April 1989. The integrity enhancement and the embedded SQL specification are included in the revised FIPS [5]. These standards are mutually compatible and are referenced in the following paragraphs as SQL-1989. Certain complex objects, such as documents, hypertext, and graphics, may best be created and manipulated using non-SQL systems. However, the descriptions and storage of such data are often repetitive and thus may properly occur in SQL databases or be integrated with (appear to the user to be in) SQL systems. Other complex objects (e.g.

L. Gallagher et al. / SQL and RDA Standards in Data Management

object-oriented generalization or specialization hierarchies) may best be handled by extensions to SQL. Unfortunately, needed extensions go beyond the current proposed extensions to SQL. The proposed SQL2 standard [7], expected in 1992, will greatly enhance the ability to handle dynamic data and will give data administrators greater control over exception handling and schema management. A substantial upward compatible enhancement to SQL-89, SQL2 will standardize a number of SQL features not included in the original specification because they were not commonly available in SQL products. The technical specification for SQL2 is quite stable; only a very few features are controversial and subject to modification. An emerging SQL3 standard [8], expected in 1994, will provide additional facilities to manage object-oriented data and will include support for recursive data, as well as 'triggers' and 'assertions' to form the basis of 'intelligent' database management systems. This standard is under development with publication expected in the 1994 time frame. The SQL3 features are preliminary and subject to substantial modification before final adoption. 3 Benefits of S Q L Standards

The primary goals in standardizing SQL are: • portability of user applications (standard-lan-

guage programs accessing data via a relational database) across heterogeneous hardware, including replacement computers, and • portability of programmer training and end-user training in data definition and manipulation. Beyond the obvious user applications which SQL was designed to support are numerous surprising and creative uses of SQL where there was a need for a data definition and manipulation language. Examples are: • SQL is a data specification language in its own right. Data modelers use the schema definition language of SQL to express data structures and integrity constraints. Where the current standard is deficient in expressing integrity constraints, complex SQL retrievals can be used to express constraint violations. Additionally, SQL retrievals or views can be used in activity modelling and physical database design to express data abstractions needed for tasks and transactions.

201

• SQL is used as the data access sublanguage of m a n y 4 GLs. Although most 4 GLs differ greatly in their functionality and syntax, m a n y share a core of standard commands - SQL. The demand for SQL among users is so great, that most 4 GLs lacking SQL are in the process of being upgraded to offer that interface. • SQL language is used between databases (often 'front end' and ' b a c k end') to request data extractions. SQL SELECT statements, sent as messages, are answered with flat-file representations of tables. There is a clear opportunity for SQL to serve as a data specification language in a distributed database environment. • SQL is used ' u n d e r the covers,' by developers of off-the-shelf software, as a replacement for internal application-specific file structures. This has the advantage of making the application code more data-independent, less procedural and easier to write. An additional advantage is that the data is accessible to other applications and to end users. There are other automatic benefits in data integrity, security, concurrent update, backup, recovery, etc. • SQL is a data specification language which can be used for communications between software processes which are not even databases. In the Automated Manufacturing Research Facility at the National Institute of Standards and Technology (NIST), a subset of SQL is used, in lieu of developing a new language, to communicate information among the various controllers in the automated manufacturing system. That is, SQL is used to extract and populate engineering data structures among C A D design workstations, the shop floor management process, robots, machine tools (e.g. milling machines), robot carts, material transfer devices, and other components on the shop floor. SQL is a very popular and pervasive database language. Its place in history and its persistence in software engineering are assured by its success, its versatility, the investments by implementors who offer SQL, the investments by software developers who write applications using SQL, and the investment in training for programmers and end-users. 4 R e m o t e Database A c c e s s Standard

The Remote Database Access (RDA) [2,3] standard is another important component of the

202

L. Gallagher et al. / SQL and RDA Standards in Data Management

CALS family of data management standards [4]. The emerging generic Remote Database Access (Generic RDA) standard will provide standard protocols for establishing a remote connection between one local data management system acting as a 'client' and a second system acting as a 'server.' The Generic R D A standard consists of service elements for association control, for transfer of database operations and parameters from client to server, for transfer of resulting data from server to client, and for 'pairwise' transaction management, including two-phase commit. Remote Database Access (RDA) is under development by Joint Technical Committee One (JTC 1) of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). It is specified in two pieces, a Generic R D A for arbitrary database connection and an SQL Specialization for connecting databases conforming to Database Language SQL. The initial specifications for both pieces have been completed, but active refinements are still underway. The formal review process began in early 1989 with final adoption expected in late 1991. Remote Database Access provides standard protocols for establishing a remote connection between a database client and a database server. The client is acting on behalf of an application program or remote process, while the server is interfacing to a process that controls data transfers to and from a database. The communications protocols are defined in terms of Open Systems Interconnection (OSI) standards for Association Control (ACSE), Remote Operations (RO), and Transaction Processing (TP). The goal is to promote the interconnection of database applications among heterogeneous environments. R D A is appropriate for remote access to a database in any context where OSI protocol standards for RO, ACSE, and TP have already been established. It is expected that R D A will become the basis for all interconnection among SQL database management products from different vendors. Interconnection among database products from the same vendor will likely continue to use vendor specific communication and interchange forms. Multiple data management systems, including SQL and I R D S as well as non-standard document processing or graphics systems, can use Generic

R D A for remote data access, provided that both client and server understand the non-standard data management statements that cross the interface. An SQL Specialization, proposed as part of the Remote Database Access (RDA) standard, will provide R D A specializations for connecting two heterogeneous SQL database management systems, for passing SQL schema manipulation and SQL data manipulation statements from client to server, and for returning SQL query results from server to client. The R D A / S Q L Specialization is what makes possible a standardized 'distributed processing' environment for SQL databases.

5 Object-Oriented Database Management The object-oriented approach to database management has emerged from the concepts of object-oriented p r o g r a m m i n g languages and database technology. These two approaches are natural complements of each other. The programruing language approach emphasizes the processing of complex structures and local data while the database approach focuses .on data sharing, persistence, and support for large amounts of data. A need exists to extend database management technology to support broader application areas such as C A D / C A M and graphics applications. Characteristics from object-oriented programming languages that are appearing in object-oriented database are encapsulation, data abstraction, dynamic binding, and inheritance. Encapsulation is the packaging of data and procedures into one entity - the object. Data abstraction is a programming technique that hides a data structure behind a set of procedures that the programmer does not need to know to manipulate the data. Dynamic binding is a feature that allows the programmer to send a message to an object without knowing the class or type of the object to take advantage of data abstraction. Binding occurs at run time. Entities with a c o m m o n definition and behavior are grouped in a class. For inheritance, a class may inherit attributes and operations from 'superclasses' and may have its attributes and operations inherited by 'subclasses.' Recently, a lot of interest in object-oriented database has been shown by the database community. Several database management systems

L. Gallagher et al. / SQL and RDA Standards in Data Management

based on the object-oriented data model have become available in the marketplace. In the object-oriented database approach, objects are entities that can be grouped into classes. Classes can be related hierarchically forming superclasses or subclasses. Attributes of the object can have a single value or multiple values. A database object corresponds to, represents, or models a real world object. Each object is unique and identified independent of attribute values, independent of record types in the database, and independent of location. The object-oriented approach allows the creation of arbitrarily complex and dynamic objects built up from other objects. Object-oriented database also allows the definition of operations in conjunction with the data structures.

6 Distributed Database Management Distributed database management means a totally integrated and fully functional distributed database with heterogeneous SQL systems residing at each local node. This requirement cannot be met with existing or proposed S Q L / R D A standards. However, new standards for distributed database management can be built on top of existing SQL and R D A standardization efforts. Therefore, the SQL and R D A efforts are vital building blocks for future distributed database management standards. R D A provides for 'distributed transaction processing' in a client-server environment. R D A specifies a two-way connection between a client and a server, as well as transfer syntax and semantics for SQL database operations. R D A does not specify the syntax for invoking R D A services from an application program or in schema definition. Also, R D A does not specify how a global transaction manager might handle multiple outstanding connections. The schema statements that need to be specified include distributing a global schema into local components or combining previously defined local components into a global schema. Some of this work will be included in the SQL3 specification, but much more work needs to be done.

203

7 SQL Database Export/Import Database export is the act of unloading a database definition and the contents of an existing database into an external form, representable on various media, for the purpose of later automatic re-generation. Database import is the act of loading a database definition and contents from an external source. These activities are needed for the efficient management of any database, but are especially critical to support data administration functions such as backup, recovery, archiving, and database interchange in a heterogeneous distributed processing environment. If database e x p o r t / i m p o r t occurs under the control of a single homogeneous database management system, then there is very little need for standardization of the external representation. All that would be required is that the vendor specific representation of database objects be structured from primitive elements (e.g. 8-bit bytes, ASCII characters, etc.) that are representable on the desired exchange medium (e.g. tape, disks, OSI). The meaning of the transfer syntax would be known at both source and target, so the only standardization requirement might be for c o m m a n d syntax (e.g. LOAD, U N L O A D ) to convey a user's request to the system. This is the current state of affairs in existing commercial SQL conforming systems as the SQL standard is silent on external representations. This approach requires a single vendor's product at each node of the distributed network. Most existing SQL systems are able to unload database contents into ASCII format in human readable tables, with data elements represented in their 'literal' form as character strings and with elements, rows, and tables separated by some user specified delimiters. A crude form of database interchange can then be achieved. This is usually accomplished by interchanging schema definitions as character strings via the existing standard SQL schema definition language. Then, with human intervention to execute the proper load and unload commands, the database contents can be moved from source to target system in ASCII tabular format. This approach requires either a user developed global data manager or direct human intervention. To move toward the goal that all data be interchangeable among multiple vendors and products,

204

L. Gallagher et al. / SQL and RDA Standards in Data Management

all data under the control of any standard conforming SQL data managers should be readily accessible and mutually interchangeable. This requirement is partially addressed by the forthcoming RDA standard which provides standardized access to conforming SQL databases. The transfer syntax section of the R D A / S Q L Specialization [3] is particularly important because it specifies an OSI/ASN.1 external exchange representation for sequences composed of any data type defined in the SQL standard. This enables the result of an SQL query to be exchanged among conforming systems. Unfortunately, this section of the R D A / S Q L Specialization is not as completely specified in the current document as it should be, leaving several opportunities for different representations that may require expensive conversions during data interchange. It also does not yet address the new data types defined in the forthcoming SQL2 standard. The R D A / S Q L Specialization is not of itself sufficient for the unrestricted e x p o r t / i m p o r t of SQL databases. It does not provide for the exchange of whole schema definitions or multiple tables as one exchange object. What is needed in addition to RDA is a standard specification of a complete E x p o r t / I m p o r t facility, with syntax for invoking load and unload functions and with options to include or not include definitions for domains, assertions, or other integrity constraints associated with each exported object. It should be possible for a user to specify that a collection of tables and table definitions satisfying specific conditions and including specific integrity constraints be exported from System A to System B and that specific data occurrences be included as part of the package. For example, it should be possible to specify that all schema objects and all data occurrences associated with a given project be exported from System A and imported into System B, all as part of a single SQL transaction. The need for a standard E x p o r t / I m p o r t facility has already been recognized by I S O / I E C JTC1/SC21 with the approval of a new project titled 'Data Management: E x p o r t / I m p o r t . ' The intent is that this specification consist of one generic standard with separate specializations for IRDS and SQL. In addition to its obvious support for data administration, it is likely that e x p o r t / import facilities defined in this standard will provide the basis for specification of 'fragmentation'

or 'replication' schemas in a subsequent distributed DBMS standard.

8 NIST Activities NIST has represented the interests of Federal users on the Database Committee X3H2 since its inception in 1978, and has contributed to the technical content of the SQL standard. The NIST representative to X3H2 holds the office of International Representative for the committee and additionally serves as the Rapporteur on the international committee for database languages, I S O / I E C J T C 1 / S C 2 1 / W G 3 DBL Rapporteur Group. NIST has sponsored numerous enhancements to the various levels of SQL standardization, including proposals for user-defined functions, outer join, recursive tables, data type and case conversion functions. NIST has written papers to refine features such as d a t e / t i m e and referential integrity. NIST has performed extensive research into distributed database management. NBS Special Publication 500-154, Guide to Distributed D a t a base M a n a g e m e n t [6] defines and describes the characteristics and options of distributed database management. NIST has an extensive in-house laboratory-based research program that evaluates current developments and limitations in the stateof-the art of commercial distributed database management systems. Finally, NIST provides leadership in national and international standardization efforts in SQL and RDA. An SQL test suite has been developed by the NIST. PUB 127 [5] requires Federal agencies to implement SQL in relational database management system applications acquired or developed after August 3, 1988. The technical goal of the test suite is to help evaluate the conformance of implementations to FIPS SQL. Since there is no formal testing service in place, the NIST SQL Test Suite is available to vendors and users to use in a variety of ways. Despite the fact that formal testing is not required, the test suite has had an immediate impact on existing SQL products: Vendors are changing their products to pass more conformance tests. Previously, conformance to FIPS SQL was a general goal and a marketing assertion for SQL products. But now, it is a testable assertion. US government agencies

L. Gallagher et al. / S Q L and RDA Standards in Data Management

are asking vendors to provide test results a n d are i n c o r p o r a t i n g the test suite into their acceptance criteria for p r o c u r e m e n t s . N I S T expects to offer a formal testing service. This service will issue a certificate of v a l i d a t i o n to i m p l e m e n t a t i o n s that p e r f o r m satisfactorily o n the N I S T SQL Test Suite. This certificate w o u l d docum e n t , to the extent tested, the i m p l e m e n t a t i o n ' s c o n f o r m a n c e to F I P S SQL. N I S T currently provides eight test suite types: e m b e d d e d (preprocessor) C O B O L , e m b e d d e d F O R T R A N , e m b e d d e d Pascal, e m b e d d e d C, m o d u l e language C O B O L , m o d u l e language F O R T R A N , m o d u l e language Pascal, a n d m o d u l e language C. N I S T will soon be offering a suite of interactive SQL tests. A d d i t i o n a l test suite types, i n c l u d i n g m o d u l e language Ada, will be offered as resources b e c o m e available for their development.

References [1] ANSI, American National Standard - Database Language SQL with Integrity Enhancement, Number X3.135-1989,

205

American National Standards Institute, 1989. Revision of ANSI Standard X3.135-1986. [2] ISO, Information Processing Systems - Open Systems Interconnection - Generic Remote Database Access Service and Protocol, ISO DP 9579, Document ISO/JTC1/SC21 N3606, June 1, 1989. [3] ISO, Information Processing Systems - Open Systems lnterconnection - Remote Database Access - SQL Specialization, ISO Working Draft, Document ISO/JTC1/SC21/ WG3 N844, June 2, 1989. [4] NIST, Use of the IRDS Standard in CALS, D. Jefferson and C. Furlani, April 3, 1989. [5] Federal Information Processing Standard - Database Language S Q L , FIPS PUB 127 (US Department of Commerce, February 1990). [6] NIST, NBS Special Publication 500-154, Guide to Distributed Database Management (April 1988). [7] ISO, Database Language SQL2, ISO/IEC Draft Proposal 9075 Revised, Document JTC1/SC21 N3155, January 1989. Proposed revision of ISO 9075: 1989, the international equivalent of [1] above. [8] ISO, Database Language SQL3, Working Draft, Committee document J T C 1 / S C 2 1 / W G 3 / D B L CAN-3b, May 1989, also referenced as X3H2-89-151.