INFMMATIDN
ELSEVIER
Information and Software Technology 38 (1996) 467-475
AND SOFTWARE TEcHN(wocy
Migration from relational to object-oriented databases Simon Monk”‘*, John A. Marianib, Beshir Elgalalb, Helen Campbell” aComputing Department, University of Central Lancashire, Preston PRl 2TQ. UK bComputing Department, Lancaster University, Lancaster LA1 4YR, UK
Received 26 June 1995; revised 22 October 1995; accepted 10 November 1995
Abstract This paper examines the issues involved in the migration of database technology towards object-orientation. This includes issues of schema translation, data migration and the consideration of the whole database system including client programs. In addition, this paper reports some early results from a long-term project to provide support for the migration of data and meta-data from a relational to an object-oriented database. The initial tool processes the schema of a relational database and presents it to the user
as the equivalent object-oriented schema. The result of this processing is an initial version of the object schema and requires user intervention in the naming of new abstract classes built by the system. Keywords; Schema migration; Schema transformation;
Database migration
1. Introduction
Relational database technology has been in widespread use since the early 1970s. In the decades since then, this technology has been applied in a number of areas, some of which have stretched the technology to its limits; indeed, we would suggest some of these areas have used this technology inappropriately. In addition, database schema design suffers from many of the same problems as software design. An initial design may appear to meet the user requirements but once the system is in place it may be found that some requirements have been inadequately matched, incorrectly stated, or simply have been missing. This leads to the problem of schema maintenance and, just like software, schemata can be ‘hacked’ to meet these ‘new’ requirements. These considerations lead us to two similar problems. Firstly, we have the well-known legacy system problem [l] where users are unwilling to convert existing data and software to a new data model and associated DBMS (Database Management System) simply because they have a massive investment in the existing system. Clearly, we require automated support for such data migration, and indeed this is a growing area of research [2,3]. The second related problem is one of maintenance; if a schema is forced to ‘grow like Topsey’ over a number * email:
[email protected] 0950-5849/96/$15.00 0 1996 Elsevier Science B.V. All rights reserved 0950-5849(95)01090-4
SSDI
of years as new requirements for the use of the data are uncovered, this aggravates the job of maintaining the schema for the database administrator [4,5]. There is a need for automated support for the presentation of existing relational schemata in such a way that the administrator is able to understand the entities, relationships and attributes stored in the system. An obvious way to tackle this problem is to reverse engineer an ER (Entity Relationship) model [6] from the stored schema and present this to the user. This paper reviews the issues involved in the adoption of object-orientation by database systems and the different approaches that have been taken by system developers. This includes extensions of the relational model; the use of object-oriented front-ends to relational technology and systems for conversion of relational systems to object-oriented technology (system migration). We also describe some early results from a long-term project to provide support for both these problems. The initial tool processes the schema of a relational database and presents it to the user as the equivalent objectoriented schema. This serves to illustrate some of the issues involved in the system migration process and the nature of software tools that could be used to support the process. At the moment, this work is limited to the translation of a schema from object to relational and does not address the problem of transferring the data.
468
S. Monk et al./Information and Software Technology 38 (1996) 467-475
The remainder of this paper begins with a review of the different approaches that have been taken by system developers to the object-orientation of relational systems. The section after that looks at the proposed tool for supporting system migration and provides a simple example of its use. Subsequently future developments of the tool are explored and the paper concluded.
gained in making a full migration. It may offer suggestions for the reorganization of the relational schema. In the wider consideration, by examining existing relational schemata, we can arrive at some feel for whether object technology (at the schema level) offers us that much more than relational technology. 2.2. Extending the relational model
2. Technology migration This section reviews the issues involved in the adoption of object-orientation by database systems and the different approaches that have been taken by system developers. This includes attempts to extend the relational model and the use of object-oriented frontends to relational technology and systems for conversion of relational systems to object-oriented technology (system migration). 2.1. Horses for courses There continues to be a great deal of interest in the design and use of OODB (Object-Oriented Database) systems. Such systems find application in areas which require the storage of complex objects of many different types. This implies few objects, many types. This is diametrically opposed to the application of the successful relational systems, where there tends to be many objects, few types. Moreover, the storage of complex objects results in the use of several relations to store the component parts, while the use of symbolic pointers (or foreign keys) to link the parts into a whole leads to large and unwieldy collections of relations. OODBs, then, seem best suited to areas which can best be categorized as design environments, where there are many complex objects to be stored and manipulated by groups of people. Areas such as CAD/CAM (Computer Aided Design/Computer Aided Manufacture), Software Engineering Environments, Office Information Systems, etc. where there are complex artefacts being operated on by teams of designers. Although semantically richer than the relational model, the object model is not necessarily the best technology for all applications. Indeed, for a significant class of relational applications, there is no real advantage in a migration to object technology. The above arguments make the strong assumption that, in certain cases, object technology is superior to relational technology [7]. A further aim of our work is to consider the suitability of applying the object model to pre-existing relational schemata. The tool described later in this paper is capable of examining relational schemata and converting them to the equivalent object schema, making full use of inheritance. This process allows the user to assess whether there actually is anything to be
We must not lose sight of the fact that the benefits of adopting object-orientation lie in the enhanced data modelling capabilities of this model. Such capabilities are also largely present in the extensions to the relational model exemplified in extended relational systems such as POSTGRES [8,9]. Such systems have the advantage that since they are derived from relational technology and retain some degree of backwards compatibility they can offer a smoother migration path to object-orientation. In purely pragmatic terms it is the object-oriented and not the extended relational model that has become widely accepted as the successor to relational databases. 2.3. Object-oriented front-ends An approach adopted in many applications is to retain the underlying layer of relational technology for persistence, and add a layer of object-orientation on top. This object layer may be at a number of different levels. 2.3.1. Object-oriented DBMS layer A fairly common practice in the early days of OODBMS (Object-Oriented Database Management System) was to layer the object model over some existing database technology. The benefits of layering are essentially those of prototyping. With an existing storage module, the implementors do not need to waste any time or effort in what was the less exciting part of the work. Effort could be concentrated on the higher-levels addressing the questions: What kind of languages could be supported? What benefits does the model offer the end user? What kind of applications become possible? The IRIS systems, [ 10,l l] amongst others, [ 12,131 implements an object-oriented DBMS on top of an existing relational DBMS. An advantage of this is that the relational data is still accessible as relational data (albeit with much of the semantics factored out). The disadvantage of the system lies in the inefficiency of having to translate DML (Data Manipulation Language) commands between the layers. The severity of this problem to some extent depends on how the objects are mapped onto the underlying relations. However, it is clear that in, for instance, finding the parts of a composite object, pointer chasing an object-oriented
S. Monk et aLlInformation and Software Technology 38 (1996) 467-475
469
the ODBMS layer or considerable efforts must be made to introduce additional relations to model these extra intentional concepts. 2.3.2. Object-oriented application layer. Another way of providing object-orientation, used by a number of systems, is to provide an object-oriented application layer on top of a relational data model. In this case, no OODBMS layer is explicit but rather the application provides an object-oriented appearance to the underlying relational system. Using this architecture, it is relatively straightforward to provide a new user interface and extra functionality to an existing relational database. Languages (or rather environments) such as VisualWorks [18] extending the Smalltalk language provide automatic links to relational DBMSs such as Oracle and Ingres. Relations in these systems are mapped onto class definitions in the Smalltalk language. In this way, they can be queried either by SQL or directly as instances of a relation/class. Such systems work well for extending the life of an existing system, but are not as appropriate for new applications, since commercial OODBMSs such as Itasca [19] and Gemstone [20] can provide the same persistence an OOPL (Object-Oriented Programming Language) provides in a more tightly coupled package.
Fig. 1. Mapping objects to relations.
system is likely to be far more efficient than performing multiple joins in a hybrid system. The alternatives in object layering fall on a continuum that lies between the extremes of simple object representation and complete semantic translation (Fig. 1). In a primitive object representation, there is no attempt to reflect the semantic complexity of the ODBMSs schema, but rather the objects may for instance be mapped onto records of two relations called Object and Attribute that could have a structure similar to that shown in Fig. 2. Clearly the schema of Fig. 2 has little to do with the actual data model of the application system. A similar approach has been used in the Oggetto system [14] to map the objects onto a triple (binaryrelational model). Oggetto, never intended as an industrial strength system, provides the basis for a number of related research areas. For example, it has been used to investigate the appearance and functionality of relational-style algebraic operations [ 151,as the target for a number of novel database browsing mechanisms [ 16,171 and as the high-level end of a system utilizing a database engine. The triple store offers some attractive features. Meta data is stored as data, and can thus be manipulated by the same facilities. Many of the problems of schema evolution (without versions) are eased, as the disappearance of attributes from instances map down onto triple deletion/insertion. At the other extreme (complete semantic translation) each class will be mapped onto a relation bearing the same name and attributes. Since the relational model does not directly support inheritance, methods, object identity or set-valued attributes then either there must be compromises in the data modelling capabilities of
2.4. Schema translation and reverse engineering We believe that some data currently stored using relational technology could benefit from migration to objects and that for the systems that use this data to operate most efficiently it will in the long run be necessary to translate the existing databases to objectorientation. l
Attribute
Object
El
Some older ‘design environment’ systems have, as there were no viable alternatives at that time, been implemented using relational technology. Clearly, a migration to the better suited object technology would prove of benefit. As a simple example, the retrieval of a complex object should be more efficiently handled (and therefore faster) from the object base as opposed to the relational base which requires the execution of ‘select-project-join’ operations to form the complex object required [7].
1
Fig. 2. Primitive representation of objects.
410 l
S. Monk et al./Information and Software Technology 38 (1996) 467-475
Some systems may end up as hybrids, where older data has been stored as relations, and newer data as objects. This is a problem area addressed by issues of interoperability of heterogeneous database systems and technology [21,22]. The final solution (not suitable for all cases) is to use migration to convert all data to the single, currently supported technology (in our case objects).
The first step in this process is the derivation of an object-oriented schema from the relational schema of the existing system. There has been a significant amount of work published on the reverse engineering of existing relational databases. Normally this involves deriving an Entity Relationship (ER) or Enhanced Entity Relationship (EER) model from the relational schema [2,23,24]. A natural consequence of being able to produce an EER model is that the EER model can relatively easily be transformed into an object-oriented schema. 2.5. System migration The simple creation of an object-oriented schema neglects the important issues of recording the mapping of the data between the models and the issues relating to the greater system that comprises client programs with embedded queries that act on a database. We have termed this bigger problem ‘System Migration’.
3. Tool support for system migration Providing tool support for system migration requires an architecture that is capable of transforming the schema to an object-oriented form, converting the data to the new format and where possible providing interpretation facilities for the client programs of the database system. Fig. 3 shows the proposed architecture for our system. The converters which create the new schema and data and the translators which allow existing client programs to use the new database, can both be built by the system
migration workbench since during the design/reverse engineering process, the relationships between the new and old schemata can be recorded in a ‘conversion dictionary’. Thus far, the prototype only supports the creation of an inheritance hierarchy in a way that is configurable by the user. Once created, the hierarchy can be edited. This section describes both the algorithm for inferring the inheritance hierarchy and the prototype system used to explore its working. 3.1. Generating a default inheritance hierarchy To generate a default inheritance hierarchy, the relations of the schema are first copied onto a new object-oriented schema as class definitions. These class definitions are initially all subclasses of the root class of the object-oriented schema. That is, there is no use of inheritance to remove the redundancy of multiple identical attribute definitions in the different classes. This ‘unnormalized’ object-oriented schema is then processed to infer successive layers of the inheritance hierarchy. The algorithm that accomplishes this can be represented as the iterative process described below. This algorithm is based on the algorithm given in [25]. (1) Find the most numerous set of class definitions S who have the same parent and also have one or more attributes in common A. (2) Derive from that set of class definitions S a class definition C containing the common attributes. (3) Remove the common attributes from each class of S. (4) Make each of the classes in S a subclass of C. (5) Then make C a subclass of the common parent of S. (6) Repeat steps (1) to (5) until no new sets can be found containing attributes in common. This algorithm is perhaps best illustrated with an example. Fig. 4(i) shows the initial set of classes, all of which are subclasses of the root class.
Fig. 3. Proposed architecture.
S. Monk et al.lInformation and Software Technology 38 (1996) 467-475
(ii)
(0
471
(iii)
Fig. 4. Creating an inheritance hierarchy.
Step (1). Possible sets of classes with attributes in common are {X, Y}, {X, Z}, {Y, Z} and {X, Y, Z}. Of these, {X, Y, Z} is the largest and so it is chosen. The common attributes A will be {a b}. Steps (2)-(5). The new class definition is now inserted into the class hierarchy as shown in Fig. 4(ii). Step (6). The process starts again with Step (1) for the modified class lattice. Step (1). This time, there is only one set with common attributes {Y, Z}, here A = {d}. Steps (2)-(5). The new class is inserted into the class hierarchy to produce the final inheritance hierarchy shown in Fig. 4(iii). Step (6). There are no further sets of classes with attributes in common, and so the operation is completed. Where there are a number of possible sets of classes that have common attributes, it is by no means certain that the best choice is the set with the largest number of members. It may well be that the choice should be made for the biggest number of common attributes, or some combination of the two such as the number of classes in the set multiplied by the number of common attributes. Notice also that the class Y has no attributes. It may be that the class Y-Z should in fact be the class Y. This description has ignored the problem of attributes that are keys. In fact, we make the simplifying assumption that any relation with more than one primary key is the result of normalizing a 1 : N or M : N relationship and should not therefore be considered when inferring superclasses. Also, to simplify matters, the algorithm does not allow multiple inheritance. The problem of dealing with relations that are used in 1 : N and M : N relationships requires a more sophisticated approach than the one adopted here. We need first to identify such relations and ensure that they are incorporated into the object-oriented schema as multivalued attributes. Where the data model supports it, these relations could also be defined as inverse relationships. However, since the goal of this work is to produce a first pass at an object-oriented schema and not
to produce a complete schema, such results are acceptable. Slightly less acceptable may be the problem of apparently meaningless supertypes being created. We are at present working on a comparison of the algorithms results with results produced by hand. 3.2. The prototype To illustrate this method of deriving an inheritance hierarchy, a prototype has been developed. This system is intended to prove some of the concepts behind a possible schema migration toolkit that would help a databases designer convert an existing set of relations into an OODB class hierarchy and ultimately to support the migration of data. The system has been developed in the VisualWork@ Smalltalk environment. This environment has the advantages of being highly platform independent (UNIX, Macintosh and PC). It also supports good links to databases and the visual programming style is very suitable to prototyping. The system that we have developed allows the representation of both the relational schema to be converted and the new OODB schema. Relational schemata can be loaded into the system in one of three ways: l
l
l
Relational schemata filed-out in standard text format can be parsed by the system. Relational definitions can be made and altered from within the system. Relational schemata can be loaded from an efficient saved format.
Having been entered, a schema can be saved in the system’s fast-loading format which is more efficient than parsing the neutral text format schema file. Fig. 5 shows a relational schema that has been loaded into the system and an attribute definition window. When a relation is selected in the left-hand column, its attributes appear in the right-hand attributes column. From here, they can be edited, deleted or new attributes created.
412
S. Monk et aLlInformation and Software Technology 38 (1996) 467-475
Fig. 5. Relational schema and attribute editor.
Fig. 6 shows the OODB schema window. Most of this window is occupied by a view of the inheritance hierarchy. Only the names of classes in the hierarchy are normally shown, but individual nodes can be inspected by using a menu that pops up over the entity. In fact, two options are available, either the node can be expanded so that its attributes appear in the inheritance hierarchy view, or the class may be opened in a separate window where it can be edited. The class definition’s name and attributes can also be edited. The ‘Process Options’ button reveals a number of options that affect the schema transformation, including the criteria for
selecting the group of classes to be made into a superclass and how the names of inferred superclasses should be derived. For instance, the default mechanism for deriving superclass names is to concatenate the names of its subclasses; however this leads to very long names in a deep hierarchy and so a threshold can be set such that if the name’s length exceeds this threshold then the name is abbreviated. This threshold is set along with other options in the ‘Process Options’ dialog box (Fig. 7). The ‘Process’ button actually carries out the transformation and redisplays the inheritance hierarchy after it has been generated. Data dictionary entries for
Fig. 6. Object-oriented schema.
S. Monk et al./Information and Software Technology 38 (1996) 467-475
413
Fig. 7. Conversion options.
the relations are also stored in the system and are shown when a class is opened or expanded. 3.3. An example This example is taken from a database designed by a final year project student for a stock control system. The initial relational database is shown in Fig. 9. This is then copied to a new OODB schema shown in Fig. 10. This is accomplished by selecting the ‘Copy’ option on the Schema Migration Manager window (Fig. 8) which controls the loading and transfer of meta-data between the two data models. The four relations are now treated as subclasses of the class ‘Root’. At present, there may be attributes shared by the different classes that should result in the creation of new superclasses. To accomplish this, the Process button is pressed. The resultant hierarchy is the one shown in Fig. 6. To see the attributes of these classes, the expand menu option is used on all the classes. The effect of this is shown in Fig. 11. From Fig. 11 it is apparent that there are two common attributes (POST-CODE and PHONE_NUM) of the classes CUSTOMER and SUPPLIER. However, it is also apparent that the attributes associated with address are also equivalent (e.g. CUST ADDRl and SUPPL_ADDRl). This highlights a li&ation of the existing system. At present, the user would now edit
Fig. 8. Schema migration manager window.
Fig. 9. Stock control example, relational schema.
by hand the attributes of the three classes concerned, moving the attributes to the superclass and renaming them. A useful extension of the system would be to catch such situations by ‘fuzzy’ matching of the attribute names and flagging them for the user’s inspection.
4. Future work
The effect of using different strategies for choosing the groups of classes from which to create a superclass has not yet been adequately investigated. It may be that adjusting this algorithm may reduce the number of inappropriate superclasses being created. The system described in this paper is an early prototype intended to illustrate the use of the algorithm described. There are a number of improvements to the system that suggest themselves. To begin with, however, one drawback of the current system is that it only analyses the schema information. Depending on the completeness of this information, there may be a requirement to examine actual instances of the data to uncover which attributes form part of the primary keys of the relations and which act as foreign keys within other relations. This is clearly needed to assist with uncovering the storage of 1: N and M: N relationships within the data, information which is required in order to map these relationships correctly onto the subject schema. To assist with the understanding of the schema, we can investigate alternative ways of displaying the objectoriented schema. At the moment flat 2-D DAGs (Directed Acyclic Graphs) are used; we could move to a 3-D cone tree display [26], which enables more information to
474
S. Monk et aLlInformation and Software Technology 38 (1996) 467-475
Fig. 10. Stock control example, object-oriented schema.
be displayed on the screen than the equivalent flat display. It may be possible to generate further 3-D displays such as those reported in [17,27,28]. It might be useful to employ direct manipulation interfaces to the schema display, allowing the user to perform some schema evolution operations, thus altering the system generated schema to suit the administrator’s needs better. 5. Conclusions This paper has described a number of potential strategies for the migration of information systems away from relational technology and towards object-
orientation. This indicates that there are a number of alternatives to system migration that are more appropriate in certain circumstances. For instance, the provision of object-oriented front-ends to existing relational systems. The architecture of a proposed tool to support the migration of existing systems to an object-oriented base is described in some detail and this paper has also presented an algorithm which, by processing the schema of a relational database, automatically .generates an initial attempt at an object-oriented version of the same schema. The user can then interact with the system to provide meaningful names for any abstract classes generated by the algorithm.
Fig. 11. Inheritance hierarchy for stock control example.
S. Monk et aLlInformation
and Software
This system provides us with a platform for further work, targeting visualization and full database system migration, while raising issues of code, as well as data, migration. This is an essential component of any system which seriously expects to assist with the legacy system problem.
References 111 M.L. Brodie, The promise of distributed
computing and the challenges of legacy systems, Proc. 10th British Nat. Conf. on Databases, BNCOD 10, Aberdeen, Scotland 1992, P.M.D. Gray and R.J. Lucas (eds.), Springer-Verlag, Berlin, pp. l-28. PI R.H.L. Chiang, T.M. Barron and V.C. Storey, Reverse engineering of relational databases: extraction of an EER model from a relational database, Data & Knowl. Eng., 12 (1994) 107-142. [31 W.A. Gray, G.N. Wikramanayake and N.J. Fiddian, Assisting legacy database migration, Proc. IEE Colloquium on Legacy Information Systems-Barriers to Business Process Re-Engineering 1994, pp. 5/l-5/6. [41 S.R. Monk and I. Sommerville, A model for versioning of classes in object-oriented databases, Proc. BNCOD 10, Aberdeen, 1992, P.M.D. Gray and R.J. Lucas (eds.) Springer-Verlag, (1992) pp. 42-58. 151 S.R. Monk, Schema evolution in object-oriented databases using class versioning, SIGMOD Record, 22, (1993) 16-22. 161P.P. Chen, The entity-relationship model-towards a unified view of data, ACM Trans. Database Sys., 1 (1976) 9-36. [71 J. Duhl and C. Damon, A performance comparison of object and relational databases using the sun benchmark, Proc. of OOPSLA’881988, pp. 153-163. PI M. Stonebraker, L.A. Rowe and M. Hirohama, The implementation of POSTGRES, IEEE Trans. on Knowl. and Data Eng., 2 (1990) 1255141. [91 M. Stonebraker and G. Kemnitz, The POSTGRES nextgeneration database management system, Comm. ACM, 34 (1991) 78-92. 1101 K. Wilkinson, P. Lyngbcek and W. Hasan, The Iris architecture and implementation, IEEE Trans. on Knowl. and Data Eng., 2 (1990) 63-75. Pll D. Fishman, D. Beech and H. Cate, Iris: an object-oriented database management system, ACM Trans. on Office Inf. Systems, 5 (1987) 48-69. [121M.K. Crowe, Object systems over relational databases, Inf. and Soft. Technol., 35 (1993) 449-461.
Technology
38 (1996)
467-475
415
1131 M. Hardwick and L. Spooner, The ROSE data manager: using
object technology to support interactive engineering applications, IEEE Trans. on Know]. and Data Eng., 1 (1989) 285-289. u41 J.A. Mariani, Oggetto: an object oriented database layered on a triple store, Computer J., 35 (1991). I51 J.A. Mariani, Realizing relational style operators and views in the Oggetto object-oriented database system, Inf. and Soft. Technol., 35 (1993) 207-216. [I61 J.A. Mariani, TripleSpace: an experiment in a 3-D graphical interface to a binary relational database, Interacting with Computers, 4 (1992). [I71 S. Benford and J.A. Mariani, Virtual environments for sharing and visualization-populated information terrains, Proc. 2nd Int. Conf. on User Interfaces to Databases, in P. Sawyer (ed.), Ambleside, UK, Springer-Verlag, 1994. 1181 ParcPlace, VisualWorks: ObjectWorks Smalltalk User’s Guide, ParcPlace Systems, 999 E. Arques Ave, Sunnyvale, California 94086-4593, 1992. t191 T.R. Ayers, D.K. Barry, J.D. Dolejsi, J.R. Galameau and R.V. Zoeller, Development of ITASCA, JOOP (1991) 46-49. WI P. Butterworth, A. Otis and J. Stein, The Gemstone object database management system, Comm. ACM, 34 (1991) 64-77. 1211 A.P. Sheth, Semantic issues in multidatabase systems, SIGMOD Record, 20 (1991) 5-9. 1221 S.D. Urban and J. Wu, Resolving semantic heterogeneity through the explicit representation of data model semantics, SIGMOD Record, 20 (1991) 55-58. 1231J. Hainaut, Database reverse engineering, models, techniques and strategies, Proc. 10th Int. Conf. on Entity-Relationship Approach, San Mateo, California, 1991, pp. 729-741. v41 W.J. Premerlani and M.R. Blaha, An approach for reverse engineering of relational databases, Comm. ACM, 37 (1994) 42-49. v51 ‘E.M. Elgalal, Minimally-redundant data structures and reasonable hypotheses: some general heuristic methods of knowledge processing, Ph.D. Thesis, University of Strathclyde, Glasgow, UK, 1985. WI G.G. Robertson, S.K. Card and J.D. Mackinlay, Cone trees: animated 3D visualization of hierarchical information, Proc. ACM CH1’91, 1991, pp. 184-194. 1271M.H. Rapley and J.B. Kennedy, Three dimensional interface for an object-oriented database, in P. Sawyer (ed.), Proc. 2nd Int. Conf. on User Interfaces to Databases, Ambleside, UK, Springer-Verlag, 1994. WI J. Boyle, J. Fothergill and P. Gray, Amaze: a 3-D graphical user interface for an object-oriented database, in P. Sawyer (ed.), Proc. 2nd Int. Conf. on User Interfaces to Databases, Ambleside, UK, Springer-Verlag, 1994.