Extracting entity-relationship diagram from a table-based legacy database

Available online at www.sciencedirect.com The Journal of Systems and Software 81 (2008) 764–771 www.elsevier.com/locate/jss Extracting entity-relati...

Download PDF

150KB Sizes 1 Downloads 58 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com

The Journal of Systems and Software 81 (2008) 764–771 www.elsevier.com/locate/jss

Extracting entity-relationship diagram from a table-based legacy database Dowming Yeh a

a,*

, Yuwen Li b, William Chu

c,*

National Kaohsiung Normal University, Department of Software Engineering, 116 Hou-Ping 1st Road, Kaohsiung, Taiwan b National Sun Yat-Sen University, Department of Information Management, 70 Lien-hai Road, Kaohsiung, Taiwan c Computer Science and Information Engineering, Tunghai University, Taiwan Received 30 January 2007; received in revised form 24 June 2007; accepted 1 July 2007 Available online 26 July 2007

Abstract Current database reverse engineering researches presume that the information regarding semantics of attributes, primary keys, and foreign keys in database tables is complete. However, this may not be the case. In a recent DBRE eﬀort to derive a data model from a table-based database system, we ﬁnd the data content of many attributes are not related to their names at all. In this paper, we present a process that extracts an extended entity-relationship diagram from a table-based database with little descriptions for the ﬁelds in its tables and no description for keys. The primary inputs of our approach are system display forms, table schema and data instance. We utilize screen displays to construct form instances. Secondly, code analysis and data analysis involving comparisons of ﬁelds and decomposition of ﬁelds are applied to extract attribute semantics from forms and table schemas, followed by the determination of primary keys, foreign keys and constraints of the database system. In the ﬁnal step of conceptualization, with the processes of table mergence and relationship identiﬁcation, an extended ER diagram is successfully extracted in a case study. 2007 Published by Elsevier Inc. Keywords: Database; Reverse engineering; Entity-relationship diagram; Data model; Case study

1. Introduction Software reengineering is the modiﬁcation of the functionalities or structures of a software system in order to improve the quality of the software (Snelting, 2000). Software reengineering includes two parts: reverse engineering and forward engineering, i.e., traditional software engineering. Reverse engineering analyzes the implementation of a legacy system, and then abstracts such information into high-level design representations in order to obtain the design speciﬁcation of the original system. With the speciﬁcation, the meaning of the source code can be comprehended, and the future maintenance or replacement work may be performed much easier. Most reverse engi*

Corresponding authors. E-mail addresses: [email protected] (D.M. Yeh), [email protected] (W. Chu). 0164-1212/$ - see front matter 2007 Published by Elsevier Inc. doi:10.1016/j.jss.2007.07.005

neering works focus on uncovering programming logic and functions of application software. Data reverse engineering, on the other hand, concentrates on data model recovery of legacy systems (Aiken, 1996, 1998). In many cases, the major goal of data reverse engineering eﬀorts is to reconstruct the conceptual data model of a database system in the form of an entity-relationship diagram (ERD) or extended ERD (Davis, 2001). Therefore, they are also called database reverse engineering (DBRE) (Blaha, 1999). The thoroughness of semantics acquisition is one of the criteria that must be considered when designing a DBRE process. The semantics of a recovered conceptual model must incorporate more domain semantics than some structural meaning, namely how data entities relate to each other. Therefore, sources of DBRE should not be limited to the database schema (Chiang et al., 1997). Another way to obtain semantics is through analysis of data instances in a database and the query and view manipula-

D.M. Yeh et al. / The Journal of Systems and Software 81 (2008) 764–771

tion statements in program code related to the database. Data analysis relies on the quantity and quality of data instances in the database. Advanced techniques such as knowledge discovery and random sampling are suggested to improve the data analysis process (Fayyad et al., 1996). Although the focus is on database, to recover a data model may require analyzing source code, program control language and data description language (Tan et al., 2002). Semantics on data dependency, in particular, is often extracted by the program analysis, for example, functional dependency and inclusion dependency (Tan and Zhao, 2003). The domain knowledge of the database system is also important for DBRE (Ghannouchi et al., 1998). Users may be consulted to resolve some of the problems, but the process could be time-consuming. Moreover, many design details cannot be observed from the user’s point of view. Recently, form analysis gathers information extracted from both forms structures and instances and generalizes such information into database semantics using an inference process (Malki et al., 2001; Mfourga, 1997). A form is any structured collection of variables, which are appropriately formatted to communicate with the database, particularly for data retrieval and data display. Forms are obtained through examining the output displayed by executing application systems. Finally, schemas resulting from these analyses need to be integrated into a complete schema. Many DBRE researches, however, are based on relational databases and presume that information regarding semantics of attributes, primary keys, and foreign keys in database tables is complete (Pedro-de-Jesus and Sousa, 1999). They claim that such information may not be well documented, but they can be derived directly from the database schema. For example, the name of an attribute could provide the semantics of the data items in the attribute. However, this may not be the case. In a recent DBRE eﬀort to derive a data model from a dBase III database system, we ﬁnd that although it is also based on tables, it lacks some important characteristics found in relational databases, such as key annotations in the schema and SQL support. Moreover, the data content of many attributes are not related to the names of these attributes at all and some ﬁelds in the database are actually composed of subﬁelds. The database system is maintained by a software vendor outside of our organization. Because the system documentation oﬀered by the developer is inadequate and out of date, our technicians are unable to construct a complete database model so that a modern relational database system can be built to replace this legacy system. Hence, the aim of this paper is to present a complete DBRE experience and techniques for a table-based legacy databases that lacks important characteristics of modern relational databases and the support of documentation to resolve semantics of many database ﬁelds. The primary inputs of our DBRE case study are screen displays which are essentially forms, form instances and table schemas. Some routines are devised to automate the form analysis

765

process. Further analyses demanding more human intervention are augmented with sources from real data and associated program code. Finally, after the processes of table mergence and relationship identiﬁcation, we apply some domain knowledge to construct the complete conceptual structure. Some related work is discussed in Section 2. The general DBRE process is described in Section 3. The techniques that we adopt are detailed in Sections 4 and 5. Section 6 compares our method with other reported techniques, and ﬁnally the conclusion is given.

2. Related works Most of the research done in the ﬁeld is based on using the relational schema as the basic input, aiming at analysis of individual relations in the schema and extracting entityrelationship schema. However, the maintenance of an existing database depends on the depth of understanding of its characteristics. The semantic information of attributes is viewed as a critical part of the characteristics of a database. It helps to discover how a system works, which are often with an inadequate or non-existent documentation. There are some researches on database reverse engineering using forms (Malki et al., 2001; Mfourga, 1997). Mfourga gathers structural information and constraints among data, extracted statically and dynamically to construct form model schemas. These schemas supplemented with low-level database schemas can help rebuild entity-relationship schemas (Mfourga, 1997). The whole process consists of three steps: (1) acquisition of both structures and instances of forms, (2) extraction of static aspect of legacy database, and (3) extraction of dynamic aspect of legacy application. The information extracted from both form structure and instances is used as a databases application reverse engineering input using an interaction with a user. The database ﬁelds are well named and therefore, the mapping of the form ﬁelds and database ﬁelds are straightforward. Moreover, it is not clear how to handle composite database ﬁelds. The work by Alhajj addresses database reverse engineering by extracting the extended entity-relationship schema from the relational schema (Alhajj, 2002). He developed algorithms that investigate characteristics of an existing legacy database in order to identify candidate keys and foreign keys of all relations, and to decide on the appropriate links between relations. There are two options provided in his implementation. The user may feed the system with the required information if he or she can provide such information, or leave the system to recover characteristics of the relational database with minimum user involvement. However, a relational table schema is the required input for this reverse engineering process. Other earlier reverse engineering methods require similar constraints, such as tables satisfying third normal form (Pedro-de-Jesus and Sousa, 1999). Our approach, on the other hand, is not subject to such constraints.

766

D.M. Yeh et al. / The Journal of Systems and Software 81 (2008) 764–771

3. DBRE process The dBase is a table-based database system under DOS. The data structure of dBase is deﬁned in DOS ﬁles. Diﬀerent kinds of ﬁle name extensions denote varying types of information. For example, the ﬁle name extension ‘.DBF’ is for database records and the ﬁle name extension ‘.NDX’ is for database index pointers. In other words, the ﬁle name extension ‘.DBF’ is a table that contains a lot of ﬁelds and any data entered for database. There are 58 tables in our database. Certain schemes need to be developed to solve this problem. The primary inputs of our DBRE case study are screen displays which are essentially forms, form instances and table schemas. We follow the three DBRE steps described in Hainaut et al. (1996): 1. Project preparation. 2. Data structure extraction. 3. Data structure conceptualization. In the step of projection reparation, we utilize screen displays to construct form instances. To automate most of the process, we transplant the original legacy data into a relational database system, SQL server, so that SQL commands may be applied to analyze the data. Secondly, code analysis and data analysis involving comparisons of ﬁelds and decomposition of ﬁelds are applied to extract attribute semantics from forms and table schemas, followed by the determination of primary keys, foreign keys and constraints of the database system. In the ﬁnal step of conceptualization, with the processes of table mergence and relationship identiﬁcation, we derive the ﬁrst-cut entity-relationship schema. Finally, we apply some domain knowledge to construct the complete conceptual structure. The ﬁrst step of DBRE is the preparation of system proﬁles for the DBRE project. Before starting the DBRE work, we have to collect and organize the complete description of all data assets in the database. The description could be divided into three types: 1. Explicit data description language (DDL): for instance, the name, domain and length of ﬁelds in database tables. 2. Implicit structure: for instance, primary keys, foreign keys, constraints, and program process logic for accessing the database. 3. Other parts in programs: such as constraints on data structures in application programs. We list all the forms and classify them according to their characteristics, and form instances are designed based on form ﬁelds. The designed data are then inputted through the operation of the dBase system and stored into the legacy database. Finally, all form ﬁelds and database ﬁelds are listed for data analysis in the following steps, and the legacy data is transformed and loaded into a SQL server database.

The following points should be noted during the process: • Values of the related ﬁelds should be consistent when we design the instances. For example, if we construct a value for the registration ﬁeld of a student, we also have to construct instances of his/her course enrollment, and other relational information. • Form instances are best designed with the help of users. Users who are familiar with the system and the related information possess rich domain knowledge. If users could participate in designing, classifying, and interpreting form instances, their knowledge on the extracted ﬁelds will contribute to more accurate classiﬁcation and interpretation. As a consequence, data structure extraction to be followed will be more complete and accurate. • The classiﬁcation and interpretation of extracted ﬁelds should be based on domain knowledge. Related data may be dispersed on various display forms for diﬀerent users. The designation of information classiﬁcation is somewhat similar with the design of a database where diﬀerent data requirements are produced by diﬀerent users. Classifying the information in this way could reduce the load of designing instance. In order to automate the extraction procedure, we further develop four SQL server tables in this phase to store the information of form instances, form ﬁelds, database tables and table ﬁelds. Table ‘‘Entity’’ contains the classiﬁcation information of forms. Table ‘‘Attribute’’ stores the form ﬁelds and the related explanation of the instances. Table ‘‘TableSchema’’ holds the table description of the database, and Table ‘‘FieldSchema’’ stores the information of all the ﬁelds in the database tables. The four tables created under SQL statements are shown in Table 1. The legacy data is then loaded into the SQL server through a transformation procedure.

Table 1 Schemas for form and table Create table Entity ( EntityNo int identity (1,1), EntityName varchar (20), primary key (EntityNo) )

Create table TableSchema ( TableName varchar (20), Meaning varchar (30) primary key (TableName) )

Create table attribute ( EntityNo int, AttributeNo int, AttributeName varchar(20), TestValue varchar(100), primary key (EntityNo, AttributeNo), )

Create table FieldSchema ( TableName varchar(20), FieldName varchar(20), meaning varchar(30), Data varchar(100) primary key (TableName, FieldName) )

D.M. Yeh et al. / The Journal of Systems and Software 81 (2008) 764–771

4. Data structure extraction The second step, data structure extraction, extracts the complete database schema. If there is a formal DDL description of the database, this process could be greatly expedited. Otherwise, a fair amount of data analysis, program analysis, and form analysis need to be performed. Data analysis examines the content of a database to uncover properties such as uniqueness and test hypothesis such as foreign keys. In this step, the users’ domain knowledge and judgment are the decisive elements for semantics. Program analysis provides extra information in integrity constraints. There are three steps, attributes extraction, keys extraction, and constraints extraction in this extraction process. The aim of attributes extraction is to extract semantic information for database ﬁelds through ﬁeld comparison, character comparison, data analysis, and code analysis. 4.1. Attribute extraction Field comparison is to compare form ﬁelds and database ﬁelds through instances in order to obtain the true meaning of each database ﬁeld. This is necessary because the meanings of some of database ﬁelds could not be referred from the title name. First, we ﬁnd the corresponding entries of forms ﬁelds in database ﬁelds, and then use captions of form ﬁelds and the context of the form to derive the meaning of database ﬁelds. In the process of comparison, form instances are the medium. Since the value of each ﬁeld in form instances is diﬀerent, by comparing values in form ﬁelds and database ﬁelds, we could identify the corresponding database ﬁelds readily. In this way, we are able to extract the semantics of most attributes in the system. The rule is formulated as follows: Let an unique value, v, is inputted in a display form under the caption, c, and there exists a table, t, which contains v under a ﬁeld, f, that is, v 2 Pf(t), where P is the project operation of the relational algebra, then the meaning of the ﬁeld f is the same as the meaning of the caption c.

declare @str nvarchar(500), @Tname varchar(20), @Fname varchar(20) declare field cursor for select TableName, FieldName from FieldSchema where Data is null open field fetch next from field into @Tname, @Fname while (@@fetch_status=0) begin set @str=' update Fieldschema set Data='+@Fname+' from '+@Tname set @str=@str+' where TableName='''+@Tname+ ''''+ ' and FieldName='''+@Fname+'''' exec sp_executesql @str fetch next from field into @Tname, @Fname end close field deallocate field update FieldSchema set Meaning=AttributeName from Attribute where FieldSchema.Data=Attribute.TestValue

Fig. 1. Field comparison procedure.

767

Database FieldSchema

Form Attribute

TableName FieldName Data

AttributeName

TestValue

StDatNow

St_Id

11111111

Student number

11111111

StDatNow

St_Pass

* MA 2222

Password

2222

StDatNow

St_Dyb

ch01101

Subject, grade, class ch01101

FieldSchema Document TableName FieldName Meaning StDatNow

St_Id

Student Number

StDatNow

St_Pass

* (not find)

StDatNow

St_Dyb

Subject, grade, class

Fig. 2. Field comparison example.

The SQL server procedure implementation for the ﬁeld comparison rule above is shown in Fig. 1. With the tables deﬁned in Table 1, the ﬁeld comparison is performed automatically for our case. The values of FieldSchema.Data and Attribute.TestValue are compared. If the values of the two ﬁelds are the same, it means that the two ﬁelds match, and the Attribute.AttributeName should stand for FieldSchema.Meaning. For example, in Fig. 2, the procedure would resolve the meaning of the mystical database ﬁeld, St_Dyb, to be the subject, grade, and class of a student since the inputted value, ch01101, is stored in the data ﬁeld associated with the ﬁeld in table StDatNow. After the process of ﬁeld comparison, there are still some form ﬁelds with no corresponding database ﬁelds. After further investigation, we ﬁnd that this is due to the fact that certain database ﬁelds contain data from several form ﬁelds. The reason for packing diﬀerent sources of input data into one ﬁeld is possibly the result of a quickand-dirty solution to accommodate additional data items by stuﬃng these data into ﬁelds with spare space and avoiding the trouble of restructuring the table schema. Such packing is possible only when the data type of the ﬁeld is ﬁxed-sized character strings. The database ﬁeld is, as a result, overloaded with more than one meaning which may not be related in the sense of a composite attribute in the ERD design. That is, there are actually several attributes inside such ﬁeld. The previous rule can be extended to handle such overloaded ﬁelds as follows: Let an unique string, s, is inputted in a display form under the caption, c, and there exists a table, t, which contains s as a substring under a ﬁeld, f, that is, $v 2 Pf (t) such that s is a substring of v, then the ﬁeld f is a overloaded ﬁeld and the meaning of the ﬁeld f contains the meaning of the caption c. Therefore, besides simple ﬁeld comparison, we need to perform character comparison (also implemented in SQL as shown in Fig. 3) that employs characters searching to

768

D.M. Yeh et al. / The Journal of Systems and Software 81 (2008) 764–771

declare @str nvarchar(500), @Fname varchar(20), @FData varchar(20) declare TestData cursor for select AttributeName, TestValue from attribute where TestValue not in (select Data from FieldSchema) open TestData fetch next from TestData into @Fname, @FData while (@@fetch_status=0) begin set @str=' update Fieldschema set Meaning=str(patindex(''%'+@FData +'%'', Data))+ ''-''+ str(patindex(''%'+@FData +'%'', Data)+len('''+@FData+''')-1)+' +''''+@Fname+'''' set @str=@str+' where patindex(''%'+@FData+'%'', Data)<>0 and Meaning is null ' exec sp_executesql @str fetch next from TestData into @Fname, @FData end close TestData deallocate TestData

Fig. 3. Character comparison procedure.

ﬁnd the corresponding database ﬁelds for these form ﬁelds. For example, the password form ﬁeld in Fig. 2, does not have a corresponding database ﬁeld. Character comparison is performed to ﬁnd out which ﬁeld containing ‘2222’ in FieldSchema.Data. The matching ﬁeld is ’’St_Pass’’, and ‘‘2222’’ is stored in the 7–10th character of the ﬁeld. Consequently, the database ﬁeld ‘‘St_Pass’’ might contain several form ﬁelds and one of which is ‘‘password’’. After the process of ﬁeld comparison and character comparison, the semantics of the database ﬁelds that can be directly related to captions in form ﬁelds through unique form instances are extracted. For those attributes that does not store unique form instances, data analysis and code analysis are applied manually to extract other attributes from database ﬁelds. Data analysis is based on the existing content of the database to classify the values of ﬁelds, and then applying domain knowledge to identify the diﬀerences between the classiﬁed catalogues. By comparing the diﬀerences among catalogues, combined with domain knowledge, we might be able to understand the meaning of the ﬁeld. If there is no suﬃcient knowledge to resolve the meaning, some code analysis will be needed. Code analysis procedure usually searches source code for the ﬁeld name in question, analyzes the related processing of the ﬁeld and then resolves the meaning of the ﬁeld (Henrard and Hainaut, 2001; Tan et al., 2002; Tan and Zhao, 2003). Such analysis can be augmented with form ﬁelds for a database ﬁled may be related to a form ﬁeld after some computation. Some database ﬁelds that could not be observed in forms are produced by programs automatically without referring to any other ﬁelds. Such ﬁelds are considered not of any semantic importance in this case study, and discarded in subsequent analyzes. 4.2. Key extraction The second step of data structure extraction is key extraction to obtain primary keys and foreign keys for the database tables. With the knowledge of primary keys and foreign keys, the referential association among tables can be constructed. Primary keys and foreign keys are declared in a relational database system, but such information is not explicit for dBase systems. The dBase system usually maintains an index ﬁle to speed up data searching for a table. Usually, primary keys

are included in such index ﬁle; therefore, it is a good starting point to extract primary keys. If the row count of an index ﬁle is the same as that of the indexed ﬁle, the index ﬁle could be selected as candidate keys. If there are more than one candidate keys, the one containing less ﬁelds would be chosen as the primary key. For example, in our case study, there is a table StDatNow with three indexes. The ﬁrst one is St_Id, the second one is St_Dyc and the third one is St_Id+St_Dyc. The row count of the table is calculated with a SQL count command to be 4704. And then, the row counts of the three indexes with distinct values are also calculated to be 4704, 158, and 4704 for St_Id, St_Dyc, and St_Id+St_Dyc, respectively. Therefore, St_Id and St_Id+St_Dyc are candidate keys, and St_Id is chosen as primary key for containing fewer ﬁelds. As for tables without an index ﬁle, we have to examine all the values for each ﬁelds in the table for uniqueness, as suggested by Alhajj (2002). However, the ﬁelds containing null values can be eliminated to reduce the eﬀort of checking uniqueness. Finally, the same criterion is followed to pick out a primary key. After primary keys are established, we can apply primary keys to extract foreign keys in order to identify association among tables. During this process, every primary key is checked whether it is referred in ﬁelds of other tables (Alhajj, 2002). The referring ﬁeld is a foreign key. The criterion is that the domains of the referring ﬁelds and the referred ﬁelds must match and the values of the referring ﬁeld must be a subset of those of the referred ﬁeld. Such criteria can be easily tested by designing appropriate SQL commands. 4.3. Constraint extraction The primary objective of constraints extraction is to obtain the association cardinality between primary keys and foreign keys. If a value of the primary key in a table shows in only one record in another table with the associated foreign key, the mapping cardinality is inferred to be one to one. Otherwise the mapping cardinality is considered one to many. In fact, if the foreign key relationship is already established, one only needs to check for the uniqueness of values in the foreign key ﬁelds to determine the cardinality. The procedure for automating the classiﬁcation process of cardinality is shown in Fig. 4. If the result of the query is not zero, there exists at least one occurrence in TA with its primary key value equivalent to multiple foreign key values

select count(*) from (select s.A from TA s join TB n on s.A=n.B group by s.A having count(*) > 1)

Fig. 4. Constraint extraction process.

D.M. Yeh et al. / The Journal of Systems and Software 81 (2008) 764–771

in TB. Therefore, it is a one to many mapping. Otherwise, it is a one to one mapping. Though such procedure serves most of the purpose, cardinality constraints on relationships involving more than two tables should be determined by the algorithm proposed by Soutou (1998). For example, we apply this procedure to calculate the cardinalities between the primary key St_Id of table StDatNow and a foreign key, St_Id of table StNua in our case study. With the result of the query not zero, we determine that the mapping between StDatNow.St_Id and StNua.St_Id is one to many. At the end of this process, a logical schema of the database comes into shape. Such logical schema may be represented with a graph similar to the RID graph in Alhajj (2002). 5. Data structure conceptualization The conceptualization step concerns the formalization of the conceptual model and its reﬁnement. For EED, the focus is on the identiﬁcation of entities and relationships among these entities from the logical schema gathered from the previous step. In the process of designing table-based database, an entity might be transformed into several tables depending on its characteristics. For example, a multi-valued attribute is usually separated from the entity to become another table. We call such separation a vertical division. Another way of splitting entities is called horizontal division, which involves division of a generalized entity in a generalization hierarchy into diﬀerent tables representing the specialized entities in the same hierarchy. As for a relationship, if the cardinality of the relationship is many to many, it will usually be transformed into a table directly. On the other hand, if the cardinality of a relationship is one to one or one to many, it is generally aﬃliated into a table representing the related entity (Silberschatz et al., 2002). Moreover, there are other extended relationships to consider such as generalization. Hence, table mergence and relationship identiﬁcation are essential reverse processes to reconstruct entities and relationships for an EED. The clue for merging tables produced by the vertical division is that the attributes comprising primary key of the separated tables is a strict superset of the primary key of the table representing the main entity. The primary key of the former table consists of the primary key of the main entity and the attribute with multiple values. The same characteristics may exist between tables transformed from a weak entity and its associated strong entity. In this case, the primary key of the table representing weak entity consists of both the key of the strong entity and the key of the weak entity (Silberschatz et al., 2002). However, there are some other attributes of the weak entity that are not part of the primary key. Therefore, the case of multi-valued attributes may be further distinguished from a weak entity by checking whether the primary key is comprised of all the attributes of the table. These heuristic rules are formalized as follows:

769

Given two tables, t1 and t2, and let T1, T2 and k1, k2 be their schemas and primary keys, respectively. Let k1a be a strict subset of k1 and k1a is a foreign key referencing k2. 1. If k1 T1, then T1 may represent a multi-value attribute of the entity represented by T2. 2. If k1 5 T1, then T1 may represent a weak entity of the strong entity represented by T2. However, domain knowledge may still be needed to decide whether to merge those tables or not according to the meaning of their attributes, and the naming of the entity. The coincidence of the same primary key of multiple tables provides clues for the horizontal division. The horizontal division results from classifying a generalized entity into several specialized entities and transforming each specialized entity into a table. Therefore, the resulting tables are comprised of overlapping attributes shared by every entity in the hierarchy, including the primary keys, and some disjoint attributes existing only in a certain entity. Reversing the process, we ﬁrst identify tables containing the same primary key and other similar attributes. And then, we analyze if those candidate tables are divided based on a classiﬁcation. After the identiﬁcation, we can start to merge the qualiﬁed tables to produce the generalized entity with the shared attributes and the specialized entities with the disjoint attributes. Note that it is possible to observe tables with the same primary key and no other attributes in common. This may also signify a generalization hierarchy, but the generalized entity as well as its specialized entities all are transformed into tables. There is no need to table mergence here, only the generalization relationship is marked. It is hard to formalize the rule for horizontal division since realizing the similarity of attributes requires judgment from domain knowledge. The ﬁnal step of data structure conceptualization is relationships identiﬁcation to identify the relationships between entities uncovered from the previous step. To recover one-to-one or one-to-many relationships, we again resort to the foreign key extracted in the data structure extraction step. When a many-to-many relationship is transformed into a table directly, the primary key of the table is usually composed of the primary keys of the two entities associated with the relationship. Hence, if the primary key of a table is composed of two foreign keys referencing other two tables that each represents an entity, the essence of the table is very likely a relationship. More formally, given any three tables, t1, t2, t3 with t2 and t3 representing entities from the previous step, let T1, T2, T3 and k1, k2, k3 be their schemas and primary keys, respectively. Let k1a and k1b be two disjoint strict subset of k1, if kla and k1b are foreign keys referencing k2 and k3 respectively, then t1 may be a relationship between the entity t2 and t3. Note that this rule overlaps somewhat with the previous rules of multi-value attribute and weak entity. To be precise, the previous rule should be reﬁned to state

770

D.M. Yeh et al. / The Journal of Systems and Software 81 (2008) 764–771

Satisfy

Obligatory

Student

Take

Course

ISA Undergrads

Reserved

Freshman

Graduate

Suspended

Dropout

Transfered

Fig. 5. Partial EER.

that there is no other foreign key besides the one discussed. Readers are encouraged to reformulate the rule as an exercise. Other foreign keys in a table may imply a many-to-one or one-to-one relationship if the associated table and the table itself represent some entities. The naming of the relationships is again determined by domain knowledge. Now, take our case study as an example. From the analysis of the attribute extraction, we ﬁnd out that seven tables have the same primary key and similar attributes, namely, StDatNow, StDatNew, StDatRes, StDatOut, StDatTur, StDatGra, and StDatHol. The seven tables are found to form a hierarchy for various kinds of student entities. StDatGra stands for graduate student entity, for instance. In identifying relationships, the primary key of table StChk consists of ‘‘St_Id’’ and ‘‘Cs_No,’’ and ‘‘St_Id’’ and ‘‘Cs_No’’ reference the primary key of the student entity and Table Grader, respectively. Hence, StChk actually represents a relationship, named satisfy, between the student entity and the obligatory course entity. Similarly, the primary key of table StNua is ‘‘St_Id’’ and ‘‘St_Key,’’ while ‘‘St_Id’’ and ’’St_Key’’ are the primary keys of the student and Ssf (representing course entity), respectively. A partially reconstructed extended ER diagram is shown in Fig. 5. 6. Evaluation of DBRE techniques Chiang et al. proposed three criteria for evaluation of DBRE methods, including completeness and robustness, validation, and performance eﬃciency (Chiang et al., 1997). Completeness and robustness are evaluated by common problems that are related to executable schema and data instances to show how the process will improve the quality of the result of a DBRE method. Validation is focused on information preserving between source database schema and target database schema about all database instances, and on accurate result that satisfy a DBRE method’s purpose. Finally, in order to evaluate automation of a DBRE process for performance eﬃcacy, the DBRE process’s operations are classiﬁed into fully automated operations and user-assisted operations to indicate how much automated operations can be implemented

by a computer-based system. All criteria mentioned above are used to evaluation our DBRE techniques. According to the level of automation deﬁned by Chiang et al., our method can be classiﬁed into two levels, medial and high, depending on the sources of input and the techniques applied during the data structure extraction phase. A highly automatic method that is based on form analysis only and applies little domain knowledge except for designing form instances, and a semi-automatic method that utilizes all the techniques introduced and requires manual assistance throughout the process. A small test case consists of 12 database tables where a total of 36 attributes are supposed to be uncovered by DBRE. Our form analysis extracts 29 attributes automatically, while data analysis and code analysis extract ﬁve more attributes from the remaining seven attributes with manual assistance. Hence, we can obtain 34 attributes in semi-automatic level (95%), and 29 attributes in highly automatic level (81%). For the semi-automatic level, we compare our technique to Signore et al.’s method in Signore et al. (1994) that can handle erroneous data but need enormously manual assistance from users who have suﬃcient knowledge about the legacy system and application domain. Supposing users have suﬃcient domain knowledge, their method can extract as many as attributes as ours, and can ﬁnd out a hidden foreign key in seven tables. However, their method assumes that all attribute semantics are known and there is no overloaded ﬁeld in the database, so it may produce ambiguous ﬁeld name and therefore a poorly readable ERD. As for the highly automatic level, we compare our technique to Alhajj’s method in Chiang et al. (1997) that minimizes the dependency on users but focus on clean data and Third Normal Form relations. Although their method is nearly fully automatic, it ignores issues of unknown attribute semantic, overloaded ﬁelds, and hidden foreign keys. Therefore, it only can extract 31 attributes, and the ambiguous attributes also lead to poor readability in ERD. The comparison is shown in Table 2. The purpose of DBRE process for our dBase III system is to rebuild the appropriate document that can describes the system in detail and makes the maintenance easily. To validate the DBRE result, we collect the conﬁdence data from users and database administrators to verify the accuracy as suggested by Chiang et al. for validation. After completing the comparison and a questionnaire, we found that there is no signiﬁcant diﬀerence between two schemas and the users with domain knowledge have 89% faith in our result. Consequently, our conceptual schema is accurate enough to represent the semantics of application domain. 7. Conclusions In this paper, we presented a DBRE approach that supports extracting an extended entity-relationship diagram

D.M. Yeh et al. / The Journal of Systems and Software 81 (2008) 764–771

771

Table 2 Comparison between automatic and semi-automatic Semi-automatic

Automatic

Signore et al.’s method

Our method

Our method

Alhajj’s method

Input

Schema, data, code

Schema, form, form instances

Schema (3NF), clean data

Attribute semantic Attribute extraction Automatic Primary key Foreign key ERD readability

Clear 34 Low 12 11 Poor

Schema, data, code, form, form instances Ambiguous 34 (95%) Medial 12 11 Fine

Ambiguous 29 (81%) High 12 11 Fine

Clear 29 High 12 4 Poor

from a legacy database based on tables. The database is not a relational database and the tables do not have primary keys declared. Moreover, the descriptions for some of the database ﬁelds were poor. Most DBRE methods are not applicable under these constraints. For this purpose, it is necessary to understand the semantic information of attributes in the existing database. Our approach uses the display forms, table schema and instances as a database reverse engineering input. After analyzing the forms, organized data are inputted through forms into database ﬁelds. Attribute instances in the legacy database are transplanted into a relational database system and SQL commands are used to automate most of the analysis work. Various procedures, ﬁeld comparison, character comparison, data analysis, and code analysis, are applied to determine the semantics of attributes, followed by the identiﬁcations of primary keys, foreign keys, and cardinality constraints. In the conceptualization phase, entities are determined through table mergence. Finally, relationships are identiﬁed, and an extended entity relationship diagram is reconstructed. A case study is presented with the DBRE process, and some of the analysis results are shown. References Aiken, P.H., 1996. Data Reverse Engineering: Slaying the Legacy Dragon. McGraw-Hill. Aiken, P.H., 1998. Reverse engineering of data. IBM Systems Journal 37 (2), 246–269. Alhajj, R., 2002. Extracting the extended entity-relationship model from a legacy relational database. In: Information Systems. Elsevier Science Ltd. Blaha, M. 1999. An Industrial Example of Database Reverse Engineering. In: Proceedings of the Sixth Working Conference on Reverse Engineering. Atlanta, Georgia, October. Chiang, R.H.L., Barron, T.M., Storey, V.C., 1997. A framework for the design and evaluation of reverse engineering methods for relational databases. Data & Knowledge Engineering 21, 57–77. Davis, K.H., 2001. Lessons Learned in Data Reverse Engineering. In: Proceedings of the Eightth Working Conference on Reverse Engineering, October. Suttgart, Germany, pp. 323–327.

Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (Eds.), 1996. Advances in Knowledge discovery and Data Mining. AAAI Press, Menlo Park, CA. Ghannouchi, S.A., Ghezala, H.H.B., Kamoun, F. 1998. A generic approach for data reverse engineering taking into account application domain knowledge. In: Proceedings of the Second Euromicro Conference on Software Maintenance and Reengineering, March. Palazzo degli Aﬀari, Italy, pp. 21–28. Hainaut, J., Henrard, J., Roland, D., Englebert, V., Hick, J. 1996. Structure elicitation in database reverse engineering. In: Proceedings of the Third Working Conference on Reverse Engineering, November. Monterey, CA, pp. 131–140. Henrard, J., Hainaut, J. 2001. Data dependency elicitation in database reverse engineering. In: Proceedings of the Fifth Euromicro Conference on Software Maintenance and Reengineering, March. Lisbon, Portugal, pp. 11–19. Malki, M., Flory, A., Rahmouni, M.K. 2001. Static and dynamic reverse engineering of relational database applications: a formdriven methodology. In: Proceedings of the International Conference on Computer Systems and Applications (AICCSA’01), June. 2001, Beirut, Lebanon. Mfourga, N., 1997. Extracting entity-relationship schemas from relational databases: a form-driven approach. Proceedings of the Fourth Working Conference on Reverse Engineering (WCRE’97). IEEE. Pedro-de-Jesus, M.L., Sousa, P.M.A. 1999. Selection of reverse engineering methods for relational databases. In: Proceedings of the Third European Conference on Software Maintenance and Reengineering, March. Amsterdam, The Netherlands, pp. 194–197. Signore, O., Loﬀredo, M., et al., 1994. Using procedural patterns in abstracting relational schemata. In: Proceedings of the 13th International Conference on Entity-Relationship Approach. Silberschatz, A., Korth, H., Sudarshan, S., 2002. Database System Concepts, 4th edition. McGraw Hill. Snelting, G., 2000. Software reengineering based on concept lattices. In: Proceedings of the Conference on Software Maintenance and Reengineering, February. Zurich, Switzerland, pp. 3–12. Soutou, C., 1998. Relational database reverse engineering: algorithms to extract cardinality constraints. Data & Knowledge Engineering 28, 161–207. Tan, Hee Beng Kuan, Ling, Tok Wang, Goh, Cheng Hian, 2002. Exploring into programs for the recovery of data dependencies designed. IEEE Transactions On Knowledge And Data Engineering 14 (4). Tan, Hee Beng Kuan, Zhao, Yuan, 2003. Automated elicitation of inclusion dependencies from the source code for database transactions. Journal of Software Maintenance and Evolution: Research and Practice 15 (6), 79–392.

Extracting entity-relationship diagram from a table-based legacy database

Extracting entity-relationship diagram from a table-based legacy database

Recommend Documents