Available online at www.sciencedirect.com
The Journal of Systems and Software 81 (2008) 764–771 www.elsevier.com/locate/jss
Extracting entity-relationship diagram from a table-based legacy database Dowming Yeh a
a,*
, Yuwen Li b, William Chu
c,*
National Kaohsiung Normal University, Department of Software Engineering, 116 Hou-Ping 1st Road, Kaohsiung, Taiwan b National Sun Yat-Sen University, Department of Information Management, 70 Lien-hai Road, Kaohsiung, Taiwan c Computer Science and Information Engineering, Tunghai University, Taiwan Received 30 January 2007; received in revised form 24 June 2007; accepted 1 July 2007 Available online 26 July 2007
Abstract Current database reverse engineering researches presume that the information regarding semantics of attributes, primary keys, and foreign keys in database tables is complete. However, this may not be the case. In a recent DBRE effort to derive a data model from a table-based database system, we find the data content of many attributes are not related to their names at all. In this paper, we present a process that extracts an extended entity-relationship diagram from a table-based database with little descriptions for the fields in its tables and no description for keys. The primary inputs of our approach are system display forms, table schema and data instance. We utilize screen displays to construct form instances. Secondly, code analysis and data analysis involving comparisons of fields and decomposition of fields are applied to extract attribute semantics from forms and table schemas, followed by the determination of primary keys, foreign keys and constraints of the database system. In the final step of conceptualization, with the processes of table mergence and relationship identification, an extended ER diagram is successfully extracted in a case study. 2007 Published by Elsevier Inc. Keywords: Database; Reverse engineering; Entity-relationship diagram; Data model; Case study
1. Introduction Software reengineering is the modification of the functionalities or structures of a software system in order to improve the quality of the software (Snelting, 2000). Software reengineering includes two parts: reverse engineering and forward engineering, i.e., traditional software engineering. Reverse engineering analyzes the implementation of a legacy system, and then abstracts such information into high-level design representations in order to obtain the design specification of the original system. With the specification, the meaning of the source code can be comprehended, and the future maintenance or replacement work may be performed much easier. Most reverse engi*
Corresponding authors. E-mail addresses:
[email protected] (D.M. Yeh),
[email protected] (W. Chu). 0164-1212/$ - see front matter 2007 Published by Elsevier Inc. doi:10.1016/j.jss.2007.07.005
neering works focus on uncovering programming logic and functions of application software. Data reverse engineering, on the other hand, concentrates on data model recovery of legacy systems (Aiken, 1996, 1998). In many cases, the major goal of data reverse engineering efforts is to reconstruct the conceptual data model of a database system in the form of an entity-relationship diagram (ERD) or extended ERD (Davis, 2001). Therefore, they are also called database reverse engineering (DBRE) (Blaha, 1999). The thoroughness of semantics acquisition is one of the criteria that must be considered when designing a DBRE process. The semantics of a recovered conceptual model must incorporate more domain semantics than some structural meaning, namely how data entities relate to each other. Therefore, sources of DBRE should not be limited to the database schema (Chiang et al., 1997). Another way to obtain semantics is through analysis of data instances in a database and the query and view manipula-
D.M. Yeh et al. / The Journal of Systems and Software 81 (2008) 764–771
tion statements in program code related to the database. Data analysis relies on the quantity and quality of data instances in the database. Advanced techniques such as knowledge discovery and random sampling are suggested to improve the data analysis process (Fayyad et al., 1996). Although the focus is on database, to recover a data model may require analyzing source code, program control language and data description language (Tan et al., 2002). Semantics on data dependency, in particular, is often extracted by the program analysis, for example, functional dependency and inclusion dependency (Tan and Zhao, 2003). The domain knowledge of the database system is also important for DBRE (Ghannouchi et al., 1998). Users may be consulted to resolve some of the problems, but the process could be time-consuming. Moreover, many design details cannot be observed from the user’s point of view. Recently, form analysis gathers information extracted from both forms structures and instances and generalizes such information into database semantics using an inference process (Malki et al., 2001; Mfourga, 1997). A form is any structured collection of variables, which are appropriately formatted to communicate with the database, particularly for data retrieval and data display. Forms are obtained through examining the output displayed by executing application systems. Finally, schemas resulting from these analyses need to be integrated into a complete schema. Many DBRE researches, however, are based on relational databases and presume that information regarding semantics of attributes, primary keys, and foreign keys in database tables is complete (Pedro-de-Jesus and Sousa, 1999). They claim that such information may not be well documented, but they can be derived directly from the database schema. For example, the name of an attribute could provide the semantics of the data items in the attribute. However, this may not be the case. In a recent DBRE effort to derive a data model from a dBase III database system, we find that although it is also based on tables, it lacks some important characteristics found in relational databases, such as key annotations in the schema and SQL support. Moreover, the data content of many attributes are not related to the names of these attributes at all and some fields in the database are actually composed of subfields. The database system is maintained by a software vendor outside of our organization. Because the system documentation offered by the developer is inadequate and out of date, our technicians are unable to construct a complete database model so that a modern relational database system can be built to replace this legacy system. Hence, the aim of this paper is to present a complete DBRE experience and techniques for a table-based legacy databases that lacks important characteristics of modern relational databases and the support of documentation to resolve semantics of many database fields. The primary inputs of our DBRE case study are screen displays which are essentially forms, form instances and table schemas. Some routines are devised to automate the form analysis
765
process. Further analyses demanding more human intervention are augmented with sources from real data and associated program code. Finally, after the processes of table mergence and relationship identification, we apply some domain knowledge to construct the complete conceptual structure. Some related work is discussed in Section 2. The general DBRE process is described in Section 3. The techniques that we adopt are detailed in Sections 4 and 5. Section 6 compares our method with other reported techniques, and finally the conclusion is given.
2. Related works Most of the research done in the field is based on using the relational schema as the basic input, aiming at analysis of individual relations in the schema and extracting entityrelationship schema. However, the maintenance of an existing database depends on the depth of understanding of its characteristics. The semantic information of attributes is viewed as a critical part of the characteristics of a database. It helps to discover how a system works, which are often with an inadequate or non-existent documentation. There are some researches on database reverse engineering using forms (Malki et al., 2001; Mfourga, 1997). Mfourga gathers structural information and constraints among data, extracted statically and dynamically to construct form model schemas. These schemas supplemented with low-level database schemas can help rebuild entity-relationship schemas (Mfourga, 1997). The whole process consists of three steps: (1) acquisition of both structures and instances of forms, (2) extraction of static aspect of legacy database, and (3) extraction of dynamic aspect of legacy application. The information extracted from both form structure and instances is used as a databases application reverse engineering input using an interaction with a user. The database fields are well named and therefore, the mapping of the form fields and database fields are straightforward. Moreover, it is not clear how to handle composite database fields. The work by Alhajj addresses database reverse engineering by extracting the extended entity-relationship schema from the relational schema (Alhajj, 2002). He developed algorithms that investigate characteristics of an existing legacy database in order to identify candidate keys and foreign keys of all relations, and to decide on the appropriate links between relations. There are two options provided in his implementation. The user may feed the system with the required information if he or she can provide such information, or leave the system to recover characteristics of the relational database with minimum user involvement. However, a relational table schema is the required input for this reverse engineering process. Other earlier reverse engineering methods require similar constraints, such as tables satisfying third normal form (Pedro-de-Jesus and Sousa, 1999). Our approach, on the other hand, is not subject to such constraints.
766
D.M. Yeh et al. / The Journal of Systems and Software 81 (2008) 764–771
3. DBRE process The dBase is a table-based database system under DOS. The data structure of dBase is defined in DOS files. Different kinds of file name extensions denote varying types of information. For example, the file name extension ‘.DBF’ is for database records and the file name extension ‘.NDX’ is for database index pointers. In other words, the file name extension ‘.DBF’ is a table that contains a lot of fields and any data entered for database. There are 58 tables in our database. Certain schemes need to be developed to solve this problem. The primary inputs of our DBRE case study are screen displays which are essentially forms, form instances and table schemas. We follow the three DBRE steps described in Hainaut et al. (1996): 1. Project preparation. 2. Data structure extraction. 3. Data structure conceptualization. In the step of projection reparation, we utilize screen displays to construct form instances. To automate most of the process, we transplant the original legacy data into a relational database system, SQL server, so that SQL commands may be applied to analyze the data. Secondly, code analysis and data analysis involving comparisons of fields and decomposition of fields are applied to extract attribute semantics from forms and table schemas, followed by the determination of primary keys, foreign keys and constraints of the database system. In the final step of conceptualization, with the processes of table mergence and relationship identification, we derive the first-cut entity-relationship schema. Finally, we apply some domain knowledge to construct the complete conceptual structure. The first step of DBRE is the preparation of system profiles for the DBRE project. Before starting the DBRE work, we have to collect and organize the complete description of all data assets in the database. The description could be divided into three types: 1. Explicit data description language (DDL): for instance, the name, domain and length of fields in database tables. 2. Implicit structure: for instance, primary keys, foreign keys, constraints, and program process logic for accessing the database. 3. Other parts in programs: such as constraints on data structures in application programs. We list all the forms and classify them according to their characteristics, and form instances are designed based on form fields. The designed data are then inputted through the operation of the dBase system and stored into the legacy database. Finally, all form fields and database fields are listed for data analysis in the following steps, and the legacy data is transformed and loaded into a SQL server database.
The following points should be noted during the process: • Values of the related fields should be consistent when we design the instances. For example, if we construct a value for the registration field of a student, we also have to construct instances of his/her course enrollment, and other relational information. • Form instances are best designed with the help of users. Users who are familiar with the system and the related information possess rich domain knowledge. If users could participate in designing, classifying, and interpreting form instances, their knowledge on the extracted fields will contribute to more accurate classification and interpretation. As a consequence, data structure extraction to be followed will be more complete and accurate. • The classification and interpretation of extracted fields should be based on domain knowledge. Related data may be dispersed on various display forms for different users. The designation of information classification is somewhat similar with the design of a database where different data requirements are produced by different users. Classifying the information in this way could reduce the load of designing instance. In order to automate the extraction procedure, we further develop four SQL server tables in this phase to store the information of form instances, form fields, database tables and table fields. Table ‘‘Entity’’ contains the classification information of forms. Table ‘‘Attribute’’ stores the form fields and the related explanation of the instances. Table ‘‘TableSchema’’ holds the table description of the database, and Table ‘‘FieldSchema’’ stores the information of all the fields in the database tables. The four tables created under SQL statements are shown in Table 1. The legacy data is then loaded into the SQL server through a transformation procedure.
Table 1 Schemas for form and table Create table Entity ( EntityNo int identity (1,1), EntityName varchar (20), primary key (EntityNo) )
Create table TableSchema ( TableName varchar (20), Meaning varchar (30) primary key (TableName) )
Create table attribute ( EntityNo int, AttributeNo int, AttributeName varchar(20), TestValue varchar(100), primary key (EntityNo, AttributeNo), )
Create table FieldSchema ( TableName varchar(20), FieldName varchar(20), meaning varchar(30), Data varchar(100) primary key (TableName, FieldName) )
D.M. Yeh et al. / The Journal of Systems and Software 81 (2008) 764–771
4. Data structure extraction The second step, data structure extraction, extracts the complete database schema. If there is a formal DDL description of the database, this process could be greatly expedited. Otherwise, a fair amount of data analysis, program analysis, and form analysis need to be performed. Data analysis examines the content of a database to uncover properties such as uniqueness and test hypothesis such as foreign keys. In this step, the users’ domain knowledge and judgment are the decisive elements for semantics. Program analysis provides extra information in integrity constraints. There are three steps, attributes extraction, keys extraction, and constraints extraction in this extraction process. The aim of attributes extraction is to extract semantic information for database fields through field comparison, character comparison, data analysis, and code analysis. 4.1. Attribute extraction Field comparison is to compare form fields and database fields through instances in order to obtain the true meaning of each database field. This is necessary because the meanings of some of database fields could not be referred from the title name. First, we find the corresponding entries of forms fields in database fields, and then use captions of form fields and the context of the form to derive the meaning of database fields. In the process of comparison, form instances are the medium. Since the value of each field in form instances is different, by comparing values in form fields and database fields, we could identify the corresponding database fields readily. In this way, we are able to extract the semantics of most attributes in the system. The rule is formulated as follows: Let an unique value, v, is inputted in a display form under the caption, c, and there exists a table, t, which contains v under a field, f, that is, v 2 Pf(t), where P is the project operation of the relational algebra, then the meaning of the field f is the same as the meaning of the caption c.
declare @str nvarchar(500), @Tname varchar(20), @Fname varchar(20) declare field cursor for select TableName, FieldName from FieldSchema where Data is null open field fetch next from field into @Tname, @Fname while (@@fetch_status=0) begin set @str=' update Fieldschema set Data='+@Fname+' from '+@Tname set @str=@str+' where TableName='''+@Tname+ ''''+ ' and FieldName='''+@Fname+'''' exec sp_executesql @str fetch next from field into @Tname, @Fname end close field deallocate field update FieldSchema set Meaning=AttributeName from Attribute where FieldSchema.Data=Attribute.TestValue
Fig. 1. Field comparison procedure.
767
Database FieldSchema
Form Attribute
TableName FieldName Data
AttributeName
TestValue
StDatNow
St_Id
11111111
Student number
11111111
StDatNow
St_Pass
* MA 2222
Password
2222
StDatNow
St_Dyb
ch01101
Subject, grade, class ch01101
FieldSchema Document TableName FieldName Meaning StDatNow
St_Id
Student Number
StDatNow
St_Pass
* (not find)
StDatNow
St_Dyb
Subject, grade, class
Fig. 2. Field comparison example.
The SQL server procedure implementation for the field comparison rule above is shown in Fig. 1. With the tables defined in Table 1, the field comparison is performed automatically for our case. The values of FieldSchema.Data and Attribute.TestValue are compared. If the values of the two fields are the same, it means that the two fields match, and the Attribute.AttributeName should stand for FieldSchema.Meaning. For example, in Fig. 2, the procedure would resolve the meaning of the mystical database field, St_Dyb, to be the subject, grade, and class of a student since the inputted value, ch01101, is stored in the data field associated with the field in table StDatNow. After the process of field comparison, there are still some form fields with no corresponding database fields. After further investigation, we find that this is due to the fact that certain database fields contain data from several form fields. The reason for packing different sources of input data into one field is possibly the result of a quickand-dirty solution to accommodate additional data items by stuffing these data into fields with spare space and avoiding the trouble of restructuring the table schema. Such packing is possible only when the data type of the field is fixed-sized character strings. The database field is, as a result, overloaded with more than one meaning which may not be related in the sense of a composite attribute in the ERD design. That is, there are actually several attributes inside such field. The previous rule can be extended to handle such overloaded fields as follows: Let an unique string, s, is inputted in a display form under the caption, c, and there exists a table, t, which contains s as a substring under a field, f, that is, $v 2 Pf (t) such that s is a substring of v, then the field f is a overloaded field and the meaning of the field f contains the meaning of the caption c. Therefore, besides simple field comparison, we need to perform character comparison (also implemented in SQL as shown in Fig. 3) that employs characters searching to
768
D.M. Yeh et al. / The Journal of Systems and Software 81 (2008) 764–771
declare @str nvarchar(500), @Fname varchar(20), @FData varchar(20) declare TestData cursor for select AttributeName, TestValue from attribute where TestValue not in (select Data from FieldSchema) open TestData fetch next from TestData into @Fname, @FData while (@@fetch_status=0) begin set @str=' update Fieldschema set Meaning=str(patindex(''%'+@FData +'%'', Data))+ ''-''+ str(patindex(''%'+@FData +'%'', Data)+len('''+@FData+''')-1)+' +''''+@Fname+'''' set @str=@str+' where patindex(''%'+@FData+'%'', Data)<>0 and Meaning is null ' exec sp_executesql @str fetch next from TestData into @Fname, @FData end close TestData deallocate TestData
Fig. 3. Character comparison procedure.
find the corresponding database fields for these form fields. For example, the password form field in Fig. 2, does not have a corresponding database field. Character comparison is performed to find out which field containing ‘2222’ in FieldSchema.Data. The matching field is ’’St_Pass’’, and ‘‘2222’’ is stored in the 7–10th character of the field. Consequently, the database field ‘‘St_Pass’’ might contain several form fields and one of which is ‘‘password’’. After the process of field comparison and character comparison, the semantics of the database fields that can be directly related to captions in form fields through unique form instances are extracted. For those attributes that does not store unique form instances, data analysis and code analysis are applied manually to extract other attributes from database fields. Data analysis is based on the existing content of the database to classify the values of fields, and then applying domain knowledge to identify the differences between the classified catalogues. By comparing the differences among catalogues, combined with domain knowledge, we might be able to understand the meaning of the field. If there is no sufficient knowledge to resolve the meaning, some code analysis will be needed. Code analysis procedure usually searches source code for the field name in question, analyzes the related processing of the field and then resolves the meaning of the field (Henrard and Hainaut, 2001; Tan et al., 2002; Tan and Zhao, 2003). Such analysis can be augmented with form fields for a database filed may be related to a form field after some computation. Some database fields that could not be observed in forms are produced by programs automatically without referring to any other fields. Such fields are considered not of any semantic importance in this case study, and discarded in subsequent analyzes. 4.2. Key extraction The second step of data structure extraction is key extraction to obtain primary keys and foreign keys for the database tables. With the knowledge of primary keys and foreign keys, the referential association among tables can be constructed. Primary keys and foreign keys are declared in a relational database system, but such information is not explicit for dBase systems. The dBase system usually maintains an index file to speed up data searching for a table. Usually, primary keys
are included in such index file; therefore, it is a good starting point to extract primary keys. If the row count of an index file is the same as that of the indexed file, the index file could be selected as candidate keys. If there are more than one candidate keys, the one containing less fields would be chosen as the primary key. For example, in our case study, there is a table StDatNow with three indexes. The first one is St_Id, the second one is St_Dyc and the third one is St_Id+St_Dyc. The row count of the table is calculated with a SQL count command to be 4704. And then, the row counts of the three indexes with distinct values are also calculated to be 4704, 158, and 4704 for St_Id, St_Dyc, and St_Id+St_Dyc, respectively. Therefore, St_Id and St_Id+St_Dyc are candidate keys, and St_Id is chosen as primary key for containing fewer fields. As for tables without an index file, we have to examine all the values for each fields in the table for uniqueness, as suggested by Alhajj (2002). However, the fields containing null values can be eliminated to reduce the effort of checking uniqueness. Finally, the same criterion is followed to pick out a primary key. After primary keys are established, we can apply primary keys to extract foreign keys in order to identify association among tables. During this process, every primary key is checked whether it is referred in fields of other tables (Alhajj, 2002). The referring field is a foreign key. The criterion is that the domains of the referring fields and the referred fields must match and the values of the referring field must be a subset of those of the referred field. Such criteria can be easily tested by designing appropriate SQL commands. 4.3. Constraint extraction The primary objective of constraints extraction is to obtain the association cardinality between primary keys and foreign keys. If a value of the primary key in a table shows in only one record in another table with the associated foreign key, the mapping cardinality is inferred to be one to one. Otherwise the mapping cardinality is considered one to many. In fact, if the foreign key relationship is already established, one only needs to check for the uniqueness of values in the foreign key fields to determine the cardinality. The procedure for automating the classification process of cardinality is shown in Fig. 4. If the result of the query is not zero, there exists at least one occurrence in TA with its primary key value equivalent to multiple foreign key values
select count(*) from (select s.A from TA s join TB n on s.A=n.B group by s.A having count(*) > 1)
Fig. 4. Constraint extraction process.
D.M. Yeh et al. / The Journal of Systems and Software 81 (2008) 764–771
in TB. Therefore, it is a one to many mapping. Otherwise, it is a one to one mapping. Though such procedure serves most of the purpose, cardinality constraints on relationships involving more than two tables should be determined by the algorithm proposed by Soutou (1998). For example, we apply this procedure to calculate the cardinalities between the primary key St_Id of table StDatNow and a foreign key, St_Id of table StNua in our case study. With the result of the query not zero, we determine that the mapping between StDatNow.St_Id and StNua.St_Id is one to many. At the end of this process, a logical schema of the database comes into shape. Such logical schema may be represented with a graph similar to the RID graph in Alhajj (2002). 5. Data structure conceptualization The conceptualization step concerns the formalization of the conceptual model and its refinement. For EED, the focus is on the identification of entities and relationships among these entities from the logical schema gathered from the previous step. In the process of designing table-based database, an entity might be transformed into several tables depending on its characteristics. For example, a multi-valued attribute is usually separated from the entity to become another table. We call such separation a vertical division. Another way of splitting entities is called horizontal division, which involves division of a generalized entity in a generalization hierarchy into different tables representing the specialized entities in the same hierarchy. As for a relationship, if the cardinality of the relationship is many to many, it will usually be transformed into a table directly. On the other hand, if the cardinality of a relationship is one to one or one to many, it is generally affiliated into a table representing the related entity (Silberschatz et al., 2002). Moreover, there are other extended relationships to consider such as generalization. Hence, table mergence and relationship identification are essential reverse processes to reconstruct entities and relationships for an EED. The clue for merging tables produced by the vertical division is that the attributes comprising primary key of the separated tables is a strict superset of the primary key of the table representing the main entity. The primary key of the former table consists of the primary key of the main entity and the attribute with multiple values. The same characteristics may exist between tables transformed from a weak entity and its associated strong entity. In this case, the primary key of the table representing weak entity consists of both the key of the strong entity and the key of the weak entity (Silberschatz et al., 2002). However, there are some other attributes of the weak entity that are not part of the primary key. Therefore, the case of multi-valued attributes may be further distinguished from a weak entity by checking whether the primary key is comprised of all the attributes of the table. These heuristic rules are formalized as follows:
769
Given two tables, t1 and t2, and let T1, T2 and k1, k2 be their schemas and primary keys, respectively. Let k1a be a strict subset of k1 and k1a is a foreign key referencing k2. 1. If k1 T1, then T1 may represent a multi-value attribute of the entity represented by T2. 2. If k1 5 T1, then T1 may represent a weak entity of the strong entity represented by T2. However, domain knowledge may still be needed to decide whether to merge those tables or not according to the meaning of their attributes, and the naming of the entity. The coincidence of the same primary key of multiple tables provides clues for the horizontal division. The horizontal division results from classifying a generalized entity into several specialized entities and transforming each specialized entity into a table. Therefore, the resulting tables are comprised of overlapping attributes shared by every entity in the hierarchy, including the primary keys, and some disjoint attributes existing only in a certain entity. Reversing the process, we first identify tables containing the same primary key and other similar attributes. And then, we analyze if those candidate tables are divided based on a classification. After the identification, we can start to merge the qualified tables to produce the generalized entity with the shared attributes and the specialized entities with the disjoint attributes. Note that it is possible to observe tables with the same primary key and no other attributes in common. This may also signify a generalization hierarchy, but the generalized entity as well as its specialized entities all are transformed into tables. There is no need to table mergence here, only the generalization relationship is marked. It is hard to formalize the rule for horizontal division since realizing the similarity of attributes requires judgment from domain knowledge. The final step of data structure conceptualization is relationships identification to identify the relationships between entities uncovered from the previous step. To recover one-to-one or one-to-many relationships, we again resort to the foreign key extracted in the data structure extraction step. When a many-to-many relationship is transformed into a table directly, the primary key of the table is usually composed of the primary keys of the two entities associated with the relationship. Hence, if the primary key of a table is composed of two foreign keys referencing other two tables that each represents an entity, the essence of the table is very likely a relationship. More formally, given any three tables, t1, t2, t3 with t2 and t3 representing entities from the previous step, let T1, T2, T3 and k1, k2, k3 be their schemas and primary keys, respectively. Let k1a and k1b be two disjoint strict subset of k1, if kla and k1b are foreign keys referencing k2 and k3 respectively, then t1 may be a relationship between the entity t2 and t3. Note that this rule overlaps somewhat with the previous rules of multi-value attribute and weak entity. To be precise, the previous rule should be refined to state
770
D.M. Yeh et al. / The Journal of Systems and Software 81 (2008) 764–771
Satisfy
Obligatory
Student
Take
Course
ISA Undergrads
Reserved
Freshman
Graduate
Suspended
Dropout
Transfered
Fig. 5. Partial EER.
that there is no other foreign key besides the one discussed. Readers are encouraged to reformulate the rule as an exercise. Other foreign keys in a table may imply a many-to-one or one-to-one relationship if the associated table and the table itself represent some entities. The naming of the relationships is again determined by domain knowledge. Now, take our case study as an example. From the analysis of the attribute extraction, we find out that seven tables have the same primary key and similar attributes, namely, StDatNow, StDatNew, StDatRes, StDatOut, StDatTur, StDatGra, and StDatHol. The seven tables are found to form a hierarchy for various kinds of student entities. StDatGra stands for graduate student entity, for instance. In identifying relationships, the primary key of table StChk consists of ‘‘St_Id’’ and ‘‘Cs_No,’’ and ‘‘St_Id’’ and ‘‘Cs_No’’ reference the primary key of the student entity and Table Grader, respectively. Hence, StChk actually represents a relationship, named satisfy, between the student entity and the obligatory course entity. Similarly, the primary key of table StNua is ‘‘St_Id’’ and ‘‘St_Key,’’ while ‘‘St_Id’’ and ’’St_Key’’ are the primary keys of the student and Ssf (representing course entity), respectively. A partially reconstructed extended ER diagram is shown in Fig. 5. 6. Evaluation of DBRE techniques Chiang et al. proposed three criteria for evaluation of DBRE methods, including completeness and robustness, validation, and performance efficiency (Chiang et al., 1997). Completeness and robustness are evaluated by common problems that are related to executable schema and data instances to show how the process will improve the quality of the result of a DBRE method. Validation is focused on information preserving between source database schema and target database schema about all database instances, and on accurate result that satisfy a DBRE method’s purpose. Finally, in order to evaluate automation of a DBRE process for performance efficacy, the DBRE process’s operations are classified into fully automated operations and user-assisted operations to indicate how much automated operations can be implemented
by a computer-based system. All criteria mentioned above are used to evaluation our DBRE techniques. According to the level of automation defined by Chiang et al., our method can be classified into two levels, medial and high, depending on the sources of input and the techniques applied during the data structure extraction phase. A highly automatic method that is based on form analysis only and applies little domain knowledge except for designing form instances, and a semi-automatic method that utilizes all the techniques introduced and requires manual assistance throughout the process. A small test case consists of 12 database tables where a total of 36 attributes are supposed to be uncovered by DBRE. Our form analysis extracts 29 attributes automatically, while data analysis and code analysis extract five more attributes from the remaining seven attributes with manual assistance. Hence, we can obtain 34 attributes in semi-automatic level (95%), and 29 attributes in highly automatic level (81%). For the semi-automatic level, we compare our technique to Signore et al.’s method in Signore et al. (1994) that can handle erroneous data but need enormously manual assistance from users who have sufficient knowledge about the legacy system and application domain. Supposing users have sufficient domain knowledge, their method can extract as many as attributes as ours, and can find out a hidden foreign key in seven tables. However, their method assumes that all attribute semantics are known and there is no overloaded field in the database, so it may produce ambiguous field name and therefore a poorly readable ERD. As for the highly automatic level, we compare our technique to Alhajj’s method in Chiang et al. (1997) that minimizes the dependency on users but focus on clean data and Third Normal Form relations. Although their method is nearly fully automatic, it ignores issues of unknown attribute semantic, overloaded fields, and hidden foreign keys. Therefore, it only can extract 31 attributes, and the ambiguous attributes also lead to poor readability in ERD. The comparison is shown in Table 2. The purpose of DBRE process for our dBase III system is to rebuild the appropriate document that can describes the system in detail and makes the maintenance easily. To validate the DBRE result, we collect the confidence data from users and database administrators to verify the accuracy as suggested by Chiang et al. for validation. After completing the comparison and a questionnaire, we found that there is no significant difference between two schemas and the users with domain knowledge have 89% faith in our result. Consequently, our conceptual schema is accurate enough to represent the semantics of application domain. 7. Conclusions In this paper, we presented a DBRE approach that supports extracting an extended entity-relationship diagram
D.M. Yeh et al. / The Journal of Systems and Software 81 (2008) 764–771
771
Table 2 Comparison between automatic and semi-automatic Semi-automatic
Automatic
Signore et al.’s method
Our method
Our method
Alhajj’s method
Input
Schema, data, code
Schema, form, form instances
Schema (3NF), clean data
Attribute semantic Attribute extraction Automatic Primary key Foreign key ERD readability
Clear 34 Low 12 11 Poor
Schema, data, code, form, form instances Ambiguous 34 (95%) Medial 12 11 Fine
Ambiguous 29 (81%) High 12 11 Fine
Clear 29 High 12 4 Poor
from a legacy database based on tables. The database is not a relational database and the tables do not have primary keys declared. Moreover, the descriptions for some of the database fields were poor. Most DBRE methods are not applicable under these constraints. For this purpose, it is necessary to understand the semantic information of attributes in the existing database. Our approach uses the display forms, table schema and instances as a database reverse engineering input. After analyzing the forms, organized data are inputted through forms into database fields. Attribute instances in the legacy database are transplanted into a relational database system and SQL commands are used to automate most of the analysis work. Various procedures, field comparison, character comparison, data analysis, and code analysis, are applied to determine the semantics of attributes, followed by the identifications of primary keys, foreign keys, and cardinality constraints. In the conceptualization phase, entities are determined through table mergence. Finally, relationships are identified, and an extended entity relationship diagram is reconstructed. A case study is presented with the DBRE process, and some of the analysis results are shown. References Aiken, P.H., 1996. Data Reverse Engineering: Slaying the Legacy Dragon. McGraw-Hill. Aiken, P.H., 1998. Reverse engineering of data. IBM Systems Journal 37 (2), 246–269. Alhajj, R., 2002. Extracting the extended entity-relationship model from a legacy relational database. In: Information Systems. Elsevier Science Ltd. Blaha, M. 1999. An Industrial Example of Database Reverse Engineering. In: Proceedings of the Sixth Working Conference on Reverse Engineering. Atlanta, Georgia, October. Chiang, R.H.L., Barron, T.M., Storey, V.C., 1997. A framework for the design and evaluation of reverse engineering methods for relational databases. Data & Knowledge Engineering 21, 57–77. Davis, K.H., 2001. Lessons Learned in Data Reverse Engineering. In: Proceedings of the Eightth Working Conference on Reverse Engineering, October. Suttgart, Germany, pp. 323–327.
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (Eds.), 1996. Advances in Knowledge discovery and Data Mining. AAAI Press, Menlo Park, CA. Ghannouchi, S.A., Ghezala, H.H.B., Kamoun, F. 1998. A generic approach for data reverse engineering taking into account application domain knowledge. In: Proceedings of the Second Euromicro Conference on Software Maintenance and Reengineering, March. Palazzo degli Affari, Italy, pp. 21–28. Hainaut, J., Henrard, J., Roland, D., Englebert, V., Hick, J. 1996. Structure elicitation in database reverse engineering. In: Proceedings of the Third Working Conference on Reverse Engineering, November. Monterey, CA, pp. 131–140. Henrard, J., Hainaut, J. 2001. Data dependency elicitation in database reverse engineering. In: Proceedings of the Fifth Euromicro Conference on Software Maintenance and Reengineering, March. Lisbon, Portugal, pp. 11–19. Malki, M., Flory, A., Rahmouni, M.K. 2001. Static and dynamic reverse engineering of relational database applications: a formdriven methodology. In: Proceedings of the International Conference on Computer Systems and Applications (AICCSA’01), June. 2001, Beirut, Lebanon. Mfourga, N., 1997. Extracting entity-relationship schemas from relational databases: a form-driven approach. Proceedings of the Fourth Working Conference on Reverse Engineering (WCRE’97). IEEE. Pedro-de-Jesus, M.L., Sousa, P.M.A. 1999. Selection of reverse engineering methods for relational databases. In: Proceedings of the Third European Conference on Software Maintenance and Reengineering, March. Amsterdam, The Netherlands, pp. 194–197. Signore, O., Loffredo, M., et al., 1994. Using procedural patterns in abstracting relational schemata. In: Proceedings of the 13th International Conference on Entity-Relationship Approach. Silberschatz, A., Korth, H., Sudarshan, S., 2002. Database System Concepts, 4th edition. McGraw Hill. Snelting, G., 2000. Software reengineering based on concept lattices. In: Proceedings of the Conference on Software Maintenance and Reengineering, February. Zurich, Switzerland, pp. 3–12. Soutou, C., 1998. Relational database reverse engineering: algorithms to extract cardinality constraints. Data & Knowledge Engineering 28, 161–207. Tan, Hee Beng Kuan, Ling, Tok Wang, Goh, Cheng Hian, 2002. Exploring into programs for the recovery of data dependencies designed. IEEE Transactions On Knowledge And Data Engineering 14 (4). Tan, Hee Beng Kuan, Zhao, Yuan, 2003. Automated elicitation of inclusion dependencies from the source code for database transactions. Journal of Software Maintenance and Evolution: Research and Practice 15 (6), 79–392.