Domain model-driven software reengineering and maintenance

Domain model-driven software reengineering and maintenance

J. SYSTEMS SOFTWARE 1993; 20~37-51 37 Domain Model-Driven Software Reengineering and Maintenance Stan Jarzabek CSA Research Pte., Ltd., and Departme...

2MB Sizes 47 Downloads 120 Views

J. SYSTEMS SOFTWARE 1993; 20~37-51

37

Domain Model-Driven Software Reengineering and Maintenance Stan Jarzabek CSA Research Pte., Ltd., and Department of Information Systems and Computer Science, The National University of Singapore

This article describes a method for reengineering business systems. The method’s objective is to produce a library of reusable software components based on the analysis of existing code. The library contains conceptual data models, procedural pseudocode, screens, and reports that can be fed into a CASE repository and code generators. The domain model plays a central role in this approach. By grouping procedures around data model entities and attaching procedures to screens and reports, the domain model creates an object-oriented view of a system. In business applications, many domain objects correspond to entities in a conceptual data model. Therefore, we first reconstruct a data model and then derive the first-cut domain model from it. Data definitions in a program are directly linked to domain objects. The programmer inspects a program via the domain model. Combinations of pattern matching, program slicing, and other static program analysis techniques help the programmer to identify and recover procedural code related to system functionalities. Recovered design abstractions are linked to the domain model and form a library of reusable components. By integrating the original system with the library, the domain model ensures that software components recovered from code are kept up to date with changes made to the system during ongo-

ing maintenance. 1. INTRODUCTION

Current software methodologies concentrate on development of new systems. But, in reality, few systems are developed from scratch; rather, existing programs are modified and extended to meet new requirements. Market research shows that companies spend 70% of their budgets on maintaining existing software systems [l, 21.

Address correspondence to Dr. Stan Jarzabek, Dept. of Information Systems and Computer Science (DISCS), The National Uniuersity of Singapore, Lower Kent Ridge Road, Singapore 0511.

0 Elsevier Science Publishing Co., Inc. 655 Avenue of the Americas, New York,

NY 10010

Among the main problems that DP departments face are user backlog, inability to merge old applications with new technology (both hardware and software), and inability to integrate old applications into strategic information systems required today [3]. The first issue can be addressed by providing maintenance methodologies and tools that reduce the cost of day-to-day maintenance of applications. This is a short-term goal. The next two issues are of strategic importance and target a long-term solution, which is to migrate old applications to new architectures (new computers and languages, CASE, relational data bases, etc.) and integrate functionalities of existing systems into the strategic information systems required today. Possible strategies include using conversion technology (e.g., translators from COBOL 74 to COBOL 851, reimplementating from scratch, or reengineering, i.e., redevelopment with reuse of information encoded in old systems. As far as terminology is concerned, we shall follow definitions proposed by Chikofsky and Cross [4]: “reengineering, also known as both renovation and reclamation, is the examination and alteration of a subject system to reconstitute it in a new form and subsequent implementation of the form.” We understand that in view of this definition, reengineering covers a wide spectrum of tactics ranging from conversion technology to redevelopment, based on information (design or code) from existing programs. Reengineering may have ambitious objectives and may involve all: system migration, improving the design, and adding new functionalities to a system. We shall use the word “reengineering” in this sense. In the reengineering scenario, code is “cleaned,” and then reverse engineering techniques are used to transform it into the higher level of specifications which, after modifications, can be used to rebuild new applications, possibly using forward engineering techniques. Ideally, we would like to generate appli-

0164-1212/93/$5.00

38

J. SYSTEMS SOFTWARE 1993; 20:37-51

cations into multiple platforms from a single copy of logical system specifications. Whether system reengineering proves to be a feasible and cost-effective solution to today’s IT problems depends much on the level of automation we can provide. We need a methodological framework as well as tools to support reengineering processes. The type of system (such as scientific system, business application, or compiler) and objectives of a software improvement program have much to do with defining a right reengineering approach. Scientific programs with complex process logic and relatively simple data require a different set of techniques than data processing applications. Our project is concerned with business applications. We distinguish three groups of issues which collectively form a reengineering framework for business systerns, namely: l

l

l

information

strategy planning

a methodology for defining an optimal reengineering strategy a methodology (i.e., techniques and tools) supporting reengineering processes.

During information strategy planning [5], we identify business objectives, critical success factors, and main areas of business operations. Based on this information, we can decide what information systems should be built to support business operations. Business objectives may change in the future. The deliverables of information strategy planning, in particular, a strategic data model, enable us to ensure the stability and maintainability of systems in the future. Information strategy planning defines the target for eventual reengineering efforts. The role of information strategy planning in system migration and reengineering is comprehensively treated in Haine

bl. Having identified strategic business systems needed in an organization, we need to define the most cost-effective way of implementing them. We must assess the value of currently used systems, evaluate the quality of code, take into account availability of system documentation, etc. Based on the analysis of such factors, we can decide on the overall strategy (i.e., whether we should maintain programs as they are, rewrite programs, reengineer, or, perhaps, use conversion tools just to port existing code into another platform) and decide which specific reengineering techniques and tools are most appropriate for the task in hand [6, 71. Finally, we need a methodology supporting system reengineering. This includes techniques such as re-

Stan Jarzabek structuring, reverse engineering, design recovery, program analysis, code generation, and testing. The project described in this article is concerned with a reengineering methodology. Our starting point is the situation when a company has already decided to reengineer some of the crucial applications; now the question is how to do it. Many authors agree that we cannot understand a program without intimate knowledge of the application’s domain [S-lo]. Also, difficult problems of reusability seem to have more realistic solutions within specific domains rather than in general [ll]. Program understanding and reusability are key issues in system reengineering. To better support reengineering process, we should capture some of the domain knowledge and express it within a formally built domain model. In our approach, maintenance and reengineering is guided by a domain model. The core of a domain model forms the user-level concepts. Concepts may refer to objects (e.g., EMPLOYEE and SALARYRECORD in payroll systems), procedures (how is an EMPLOYEE salary computed?), business rules (under which conditions an EMPLOYEE is not entitled to a bonus?) and transactions (a bonus should be paid at the end of a year). Within a domain model, concepts are represented by structures that help programmers understand how a given concept has By integrating the domain been implemented. knowledge with program knowledge, a domain model may serve the following purposes: the domain model forms a conceptual level window that explains all known aspects and features of the system (i.e., known requirements, design abstractions, implementation, etc.); the domain model specifies which system features are yet to be recovered from code (i.e., analyzed, understood, documented, extracted from the old system, etc.); the domain model provides links to a library of specification components and code which can be reused in building similar systems; the domain model forms a bridge between ongoing maintenance and system reengineering. What issues do we address in domain modeling?

The domain model must have enough expressive power to highlight all aspects of a real situation relevant to understanding and maintaining systems. Within the domain model, we talk about business operations, business rules, environmental objects involved in business operations (people, documents, etc.) and relationships among them. We also talk

Domain Model-Driven

SOFIWAKE 1993; 20:37-s I

Maintenance

J. SYSTEMS

about computer programs that support business operations and users of these programs. As far as computer programs are concerned, we are interested in issues such as the role of a program in conducting business operations, which objects and operations are pertinent to various programs, how users interact with programs, and which functionalities should be available to various classes of users. It is difficult to understand a business system without understanding the system’s user interfaces. ~though the domain model concentrates on concepts, we also include some crucial design elements, such as screens and reports, into it. We devise the domain model in such a way that we can conveniently link the design information (and eventually code) to the domain model, so that the domain model can play the role of a guide and integrator of all the knowledge pertinent to program maintenance and reengineering. How do we build a dosage model? Our basic vocabulary for modeling business systems consists of objects, relationships, procedures, business rules, and transactions. We call them features because of some similarity between our domain model and software features described in Heart and Shiling [12]. Consider a library system such as one described in Baab and Kieburtz [13]. Examples of objects are a book and a borrower. Certain relationships may hold between objects, for example, a book may be borrowed by a certain borrower. Objects are described by attributes and by any other features (for example, procedures) related to them. A procedure describes a meaningful action that may involve one or more objects, e.g., purchasing a new book, checking out a copy of a book, or computing the total money spent by a library on purchasing computer science books

in 1990. Business rules express conditions to be obeyed during system operation or specify procedures to be activated when a certain condition arises, for example, “if a book is overdue more than a week, send a reminder to borrower.” Transactions are business rules that are enforced at certain time intervals or when certain events occur, e.g., “at the end of each working day, print a list of all currently borrowed books” or “when a user returns a book, check whether or not he/she has any overdue books.” Each feature type (i.e., object, procedure, etc.) has a description template associated to it that specifies how a given feature is to be documented. Domain features are organized into an inheritance hierarchy. Predefined classes-object, procedure, business rule, and transaction-constitute top-level classes of the inheritance hierarchy and other features are derived from them. Inheritance simplifies definitions of features by reducing redundancies. Inheritance is also important in resolving queries and patterns expressed in terms of the domain model. For example, the search space for a query referring to PERSON (Figure 1) includes PERSON and all the objects derived from PERSON. An entity-relationship (conceptual) data model [14] contains much of the domain knowledge: entities, attributes, and relationships in a data model reflect sound domain-level concepts. Therefore, we build a first-cut domain model on top of the data model. The domain model is initially incomplete. It is refined in the process of application domain analysis and further evolves as maintenance and reengineering progress. Application domain analysis may involve study of system requirement documents and interviews with domain experts. By maintaining explicit links between a domain -_._.

top

I

/

robject-class

~.____

, I .--.--_i__ I 1 PERSON design-class

1 i USER Ii--.-. Figure 1.

----7 procedure-class

other

I

rmethod -class

business rule

!

screen/report

-.

Classification of domain mode1 features.

39

transaction

40

Stan Jarzabek

J. SYSTEMS SOFTWARE 1993; 20~37-51

model and a data model, we gain a number advantages:

of

we can include state-of-the-art methodologies for system construction into the reengineering cycle; we can automate more front-end reengineering processes, as well as code generation; a domain model evolving around well-known notions may be easier for programmers to accept; it is easier for a programmer to do domain modeling by completing and refining some predefined structures than to start from scratch. The domain model is explained in terms of (and is directly linked to) a design model. The design model contains design abstractions derived from code such as structure charts, control flow graphs, program slices, etc. (Program slices represent executable code segments that approximate functionalities of a program [15].) Many of these design abstractions can be extracted automatically from code using reverse engineering techniques. Conceptual design abstractions, such as design decision, are reconstructed manually or semiautomatically. As design abstractions have direct links to code structures (e.g., boxes in structure charts are linked to corresponding routines), we can trace information from the domain model concepts down to code (Figure 2). It is well known that most existing systems are undocumented; often, code is the only definition of what a program does. Our notations for the domain model allow one to specify unknown system features yet to be analyzed, understood, recovered, and docu-

domain model

design model

structure chart

data model

Figure 2.

Domain model-driven system reengineering.

mented during system maintenance and reengineering. The objective is to extract functionalities from the existing system and convert them into reusable components suitable for regeneration of a new system. One of the problems with applying the reengineering scenario in practice is that reengineering is a long process and, at the same time, the systems have to be maintained. The potential danger is that the specifications and reusable components recovered from code become inconsistent with the base system. The domain model creates a formal link between old code and artifacts of reverse engineering processes, so that the impact of changes can be traced and potential inconsistencies detected. Today, state-of-the-art methodologies for business systems such as information engineering [5] are data driven. CASE tools support data-driven methodologies and data-driven code generators produce executable applications from design specifications. Data-driven software construction leads to stable maintainable systems. To be consistent with current trends, we decided to use data-driven forward engineering techniques in the reengineering cycle. The domain model based on a conceptual data model forms a natural bridge between reverse engineering processes and data-driven forward engineering techniques. Data-driven methods have two important benefits. First, by using the same data model, systems can communicate one with each other. This helps in system integration. The second benefit comes with improved code generation technology: based on the

code

Domain Model-Driven

Maintenance

J. SYSTEMS SOFTWARE 1993; 20:37-51

model suggests the following domain model-driven reengineering scenario (Figure 3):

analysis of a data model, some of the standard functionality of a system (such as screens, data validations, and dialog navigations) can be obtained automatically. This reduces the implementation effort and simplifies the developer’s perception of a system. But data-driven methods preserve a dual view of a system, as far as data and functions are concerned. Lack of a formal link between data and functions makes it difficult to reuse functions and ensure uniformity of functions across systems. This has a negative impact on maintenance. The objectoriented approach unifies functions with data and in that way alleviates some of these problems. Already today we observe evolution of data-driven methods toward object orientation [16, 171. Our domain model provides an object-oriented view of a system, as structures representing concepts group together procedures and rules pertinent to entities. We designed the domain model in that way not only because we feel that it is a natural way to present information about a system, but also to be prepared for the transition to object-oriented forward engineering techniques and tools which are likely to evolve soon. In the rest of the article we describe our reengineering method and some of the techniques related to our approach. 2. A DOMAIN REENGINEERING

4.

5.

6.

7. 8. 9.

MODEL-DRIVEN SCENARIO

10.

The decision to use data-driven forward engineering techniques together with the concept of a domain

strategic data

Rationalize and standardize the names of data elements across programs. Capture data definitions and recreate the company-wide conceptual data model. Verify the data model against the results of strategic information planning and modify the data model to be consistent with the strategic data model. Derive the first-cut domain model from the conceptual data model (roughly speaking, this is done by mapping each entity into a domain object). Restructure code (optional) and compute process logic design abstractions (e.g., structure charts, program control flow graphs, etc.). Recover and respecify procedures, business rules, and transactions; attach them to related domain objects to form an object-oriented view of a system. Generate the relational data model and design screens and reports. Recover and respecify procedures related to screens and reports. Reuse the information aggregated within the ._ domain model to formulate specification for a new system. Complete the reengineering cycle by regenerating a new system using a data-driven (in future, object-driven) code generator.

plan

\e

mcesses, screens,.

a library

of reusable

components

\ generete code

new systems

Figure 3.

System reengineering process model.

41

42

J. SYSTEMS SOFTWARE 1993; 20:37-51

2.1 The Starting

Stan Jarzabek

Point

Statistics show that most old systems are poorly structured, contain many patches, use outdated data bases, have data names that are inconsistent across programs, etc. [l, 21. When maintaining systems in their current state becomes virtually impossible, reengineering is one option. We start reengineering with the analysis of data because data is central in understanding business systems and because we want to rebuild systems in a data-driven way.

2.2 Recreating

a Conceptual

Data Model

To build a domain model and reimplement systems using data-driven methods, we need a stable, robust data model. Analysis of data definitions extracted from old programs can help in building a data model. But we also have to do strategic data planning and company-wide data modeling exercises. A robust data model will result from these activities. Tools have been developed to help a programmer in both reverse engineering of data and conceptual modeling [ 18-201. Some of these tools read data definitions from programs and use expert knowledge to transform data definitions into a conceptual data model. As this transformation cannot be fully automated, additional input is obtained from a business analyst during an online question-answer session. The conceptual data model recreated from data definitions is further modified to meet the requirements of the strategic data plan. From normalized data models, we can generate relational tables and data base schemas into a number of commercial data base systems. With normalized data models, we can use CASE screen painters and report generators to redesign user interfaces. During reverse engineering of data it is important to maintain links between entities and corresponding file structures and between attributes and corresponding data elements. As we shall see in the following sections, these links enable us to provide high-level query facilities for inspecting and analyzing code at the conceptual level.

2.3 Recovering Domain Objects We build the first-cut

object inheritance. Cases for object inheritance can also be identified by analyzing entities and attributes EM. The domain model must reflect two, not necessarily identical, data models: the original data model recreated from programs and the refined data model, conforming to a strategic data model, which is used in construction of new systems. There should be no substantial differences as far as entities are concerned, but some entity attributes may differ from one model to another. Here is a description template for domain objects (keywords are in bold face type; double slash marks indicate comments):

domain model by mapping each entity into a domain object, each entity attribute into an object attribute, and each entity relationship into an object relationship. If entity subbing is used, then subtyping is mapped into

domain object-entity BOOK { // this domain object corresponds to entity BOOK informal description: // rationale, design constraints, etc. are described here attributes: // entity attributes derived from the original data model M and from the refined data model M’ are listed here: common: // entity attributes common to M and M’ are listed here: Author Title #copies // number of copies Status // borrowed or available ...

original: // entity attributes that appear in M but do not appear in M’ are listed here: Location // at the time when the system was originatly designed, the library was located in two different buildings; now, the whole library is located in one place refined: // entity attributes that appear in M’ but do not appear in M are listed here: #checkouts // number of times a book has been checked out known features: // a list of references to features already recovered from code: procedures: RegisterNewBooM 1 CheckOut( 1 Check14 ) IsReserved( > ...

Domain Model-Driven

Maintenance

J. SYSTEMS SOFTWARE 1993: 20:37-51

business rules: There should be no more than 10 copies of any title in a library. If a book is overdue more than one week, send a reminder to a borrower. If a book is not borrowed for consecutive 5 years, remove a book from a library. transactions: When a user returns a book, inform him/her about any other overdue books he/she may have. other features: // any other features that help to bridge user’s view with the program view are listed here Reserved-What does it mean that a book is reserved? For example, we may find the following explanation: l a reserved book can only be borrowed by a person who reserved it l a book cannot be reserved for more than a week screens and reports: // references to screens and reports related to the object are listed here unknown features: // features to be analyzed and recovered

are listed

here: procedures:

SortByTitle( business

)

rules:

Is there a limit to a number of books in a given discipline that may be available at the reference desk? Can more than one copy of a given title remain on the reserved shelf? Under what circumstances can a book be removed from the library? I Attributes are listed in three sections, namely common, original, and refined, so that we can tell the original data model (i.e., existing systems view) from the refined data model (i.e., the library of reusable components and new system view). Domain objects are linked via the original data model to related files and data definitions in existing systems. Procedures, business rules, and transactions already recovered are listed in the known features section. Procedures are documented by procedure templates which will be described later. The other features section describes any other concepts that help explain how a given domain object is reflected and manipulated in a program. Whenever possible, informal features are explained in terms of at-

43

tributes, already recovered procedures, and business rules, and in terms of the design model (e.g., by pointing to program modules responsible for handling various aspects of a given feature). Recovery of unknown features is done by analyzing code related to a given feature, respecifying it, and converting into a reusable component. Reusable components are data models (both conceptual and relational) with associated procedures and global procedures. Depending on the objectives, procedures may be either in the form of executable (e.g., COBOL) programs or in the form of logical level, language-independent pseudocode. Based on data models and with the help of CASE tools, we can also produce screens and reports, recover procedures related to them, and store them in the library. Domain objects contain all the necessary information to enable a code generator to produce data base schemas, data declarations, and to convert pseudocode into code. Reusable components can be viewed as data abstractions assembled from data and procedures related to a given domain object. Domain objects described so far have been derived from a conceptual data model. In general, domain objects aggregate information about any domain or design concepts. The role of a domain object is to help a programmer understand how a given concept has been implemented. This information helps a programmer in future maintenance of a program. In particular, it is useful to document explicitly delocalized concepts [2]. (A delocalized concept spans a range of program modules.) Apart from informal descriptions, domain objects characterize concepts in terms of other domain objects, data, code patterns, slicing criteria, program modules, and any other information that may help programmers to understand a concept and to trace its implementation. In the example of the object-entity BOOK, procedures, business rules, and transactions are described informally. Informal specifications are important because they appeal to intuition, but they lack precision and lead to ambiguities. We are exploring some elements of formal specifications for procedures and business rules based on notations proposed in SF language by Berztiss [21] and McBrien et al. [22]. 2.4 Recovering

Procedural

Code

By a procedure, we mean program statements related to some well defined functionality of a program. A procedure may represent data base operations, data entry validation rules, or any other meaningful computations (such as computation of

44

Stan Jarzabek

J. SYSTEMSSOFTWARE 1993; 20~37-51

employee’s salary in a payroll system). Analyzing, understanding, and recovering the procedural code is a challenge: in old programs, procedures most often are delocalized (i.e., they do not form syntactically cohesive code segments). We try to reduce the complexity of this task by gradually limiting the amount of code a programmer has to deal with at one time. The domain model helps a programmer to focus attention on one feature at a time. Code related to features is extracted and further transformed into logical level in a semiautomatic way. Tools help a programmer to find the right information and interpret it. A programmer interacts with tools via the domain model. Not all procedures must be recovered. We recover only complex computations (simple ones can simply be reimplemented). Computations related to physical implementation and platform dependencies are isolated rather than recovered. For example, if the original system uses a hierarchical data base, then much code will relate to data base accesses. Of course, this code would be irrelevant to new systems implemented using a relational data base. Identifying and filtering physical code is especially important if our objective is to produce specifications from which we can generate code into multiple target platforms. Recall that we can trace data definitions in programs from domain objects. This traceability helps in program understanding, but it is also crucial in recovery steps that involve analysis and respecification of procedures. A programmer identifies procedures in code by querying program information, searching for code patterns, slicing the program, inspecting related design abstractions (e.g., structure charts), and by using other static program analysis methods. Each recovery step is supposed to produce a better approximation of the computation in question. At the end, all the code related to a functionality in question (and only code related to this functionality) is extracted. After some clean-up, code is entered into a tool called a design recovery assistant (DRA). The DRA helps a programmer to further analyze and respecify the process logic. Procedures are documented in the following way: procedure Checkout (BOOK b) { informal description: // rationale, design constraints, etc. objects: // domain objects involved in the procedure listed here BOOK, BORROWER

are

preconditions: // conditions that must hold before a procedure is executed _ IsReserved postconditions: // conditions that hold just after a procedure has been executed N IsAvailable design abstractions: // links to design abstractions related to a procedure are specified here modules:Ml, M2 // an explicit list of modules patterns:// patterns identifying related modules Module M; find M such that Ref(M,BOOK) code views: // links to code related to a procedure are specified here patterns:// code patterns identifying certain statements in a program are listed here Statement s: find s such that WriteToFile(BOOK) slicing criteria: // slicing criteria for identifying code related to a procedure are listed here slice(s, BOOK.Status)

1 Recovered procedures are placed in a library of reusable components. We also attach to each procedure a set of patterns and slicing criteria that identity code related to a given procedure (explained in more detail in the next section). 2.4.1 IdentifLng code related to procedures. We will now illustrate how we can help. a programmer identify code related to a given procedure. Having selected a procedure P (or a group of related procedures listed in the unknown features section of a domain object), we identify domain objects that may be involved in P. Say we reengineer programs computing taxes and we are interested in a procedure that computes taxes for nonresidents. We may suspect that this procedure involves objects such as PERSON, TAX-FORM, INCOME-RECORD. (It may be that we already have NON-RESIDENT as a special type of PERSON. If not, we may consider including it into the inheritance network of domain objects.) From the identified objects we obtain a set of data definitions, D(P), which are potentially involved in computation of P. We use set D(P) to zoom into code segments related to procedure P. First, we may want to find out which program modules are related

Domain Model-Driven

Maintenance

to TAX-FORMS and NON-RESIDENTS Module M; find M such that ReKM,TAX-FORM) RESIDENT)

& Ref(M,NON-

The result of this query is all the modules that reference data related to objects TAX-FORM and NON-RESIDENT. (The query system and pattern resolution mechanism is discussed in section 2.7.) Program modules can be displayed in graphical form, using CASE editors [23], with highlighted boxes corresponding to modules in question. We can further reduce the number of modules by specifying that we are interested only in modules that write information to a file connected to object TAX-FORM. Program modules identified in that way may still contain much code that is irrelevant to a computation in question. To find a better approximation of the computation, we use techniques of pattern matching and program slicing. First, we try to find program points that play an important role in the computation. Continuing our example with taxes, important points may involve the statements that assign the amount of tax to be paid by a nonresident to a certain field in a tax form record. We can express this in terms of a domain model pattern: Statement sl, ~2; find sl such that AssignTo(s1, TAX-FORM.TotalTax) &exists s2 such that Write(s2, TAXFORM.TotalTax) & Affect&l, ~2) The above pattern specifies that we want to find all statements sl that assign value to data related to TAX-FORM.TotalTax, such that the value computed in sl is subsequently written (i.e., printed to a file or displayed). Having identified program points of interest, we can try to find other statements in a program that have to do with computing the tax. The next step is to extract from a program only these statements that really affect the value of attribute TAXFORM.TotalTax for nonresidents. This is done by computing program slices with respect to statements that match our pattern. The amount of code extracted from a program can be further reduced by specifying that we are really interested in slices that involve nonresidents (i.e., which involve data structures linked to NON-RESIDENT objects). 2.4.2 The design recovery assistant. Pattern matching and program slicing techniques help to identify

J. SYSTEMS SOFTWARE 1993: 20~37-51

45

procedures. Isolated code can be automatically restructured to clean the program control flow and enhance readability by providing a uniform programming style. Now the programmer’s task is to analyze and respecify code. If the objective of reengineering is to produce logical level specifications from which code can be generated into multiple platforms, a programmer must isolate code related to physical implementation and other platform dependencies. In particular, physical data base access sequences should be replaced by logical operations expressed in terms of a conceptual data model. Conversion and reverse engineering tools can help a programmer in this task. We are implementing a tool called a design recovery assistant (DRA). The role of the DRA is to help a programmer analyze and respecify code. The DRA displays code using language-independent control structures and supports a range of static program analysis methods. In particular, a programmer can trace control and data flow relations in a program, compute slices, and search for code patterns. For a reasonably structured program, the function abstraction method [24] can be used to analyze and recover the meaning of code. The DRA supports an annotation system to respecify procedures. Annotations capture design abstractions recovered from code that can be directly attached to program structures. Annotations may have various degrees of formality; they form a simple bridge between formal structures and their (mostly informal) conceptualizations. The DRA supports typed annotations to address the variety of design abstractions one may want to represent. In general, annotations may refer to the meaning of a program unit, contain design information (e.g., explanation of the role of a program module in the overall system architecture), describe system requirements (e.g., system behavior or performance constraints), or contain assertions. The DRA supports two types of annotations, standard and programmer-defined annotations. Standard annotations. Standard annotations have a fixed meaning and must be used in a predefined way. An example of standard (and formal) annotations are assertions. Assertions refer to the procedure in which a programmer acquires program understanding by stating hypotheses as to what he or she believes program behavior should be, and by verifying the hypotheses. Two kinds of assertions, namely preconditions and postconditions, can be defined. The precondition expresses the programmer’s belief as to the system state properties just before a program unit is executed. The postcondi-

46

Stan Jarzabek

J. SYSTEMS SOFTWARE 1993; 20:37-51

tion characterizes the system state just after a program unit has been executed. Standard annotations can also encourage (or even enforce) standard ways of respecifying procedures. Each program unit may have a standard template of annotations that should be filled as procedures are recovered. For example, a procedure declaration might have the following standard annotations: PURPOSE:

explains what a procedure

REQUIRES:

specifies restrictions

does

on arguments (if

any)

COMMENTS:

additional information

CALLS: procedures CALLED-BY:

called by this procedure

procedures

that call this procedure

Software complexity metrics can be used to identify the program units that require annotations. Some of the standard annotations can be computed automatically. For example, the CALLS and CALLED-BY annotations can be retrieved from the design model. User-defined annotations. User-defined annotations allow a programmer to exercise full control over classification of the information that is produced during the design recovery procedure. Each programmer-defined annotation has a tag identifying the type of information the annotation contains. Examples of possible tags are MEANING, PURPOSE, DESIGN-SPEC, and REQUIREMENTSPEC. For example, it might be possible to generate a requirement specification document by extracting the information from the REQUIREMENT-SPEC annotations. The DRA is aware of the tagged annotation mechanism and can selectively retrieve the information contained in annotations.

2.5 Analyzing

Business

Rules and Transactions

Business rules and transactions are documented by templates similar to those for procedures. They are also recovered using a similar set of techniques as those used for recovering procedures. First, objects and data involved in a business rule are identified. Next, techniques of pattern matching and program slicing are used to zoom into the code related to a given business rule. Recovering procedures and business rules requires much progr~mer involvement. 2.6 Recovering

User Interfaces

To understand and maintain a business program we need to understand its user interfaces. Typical user

interfaces are screens and reports. Some important procedures are often linked to screens and reports. For example, each menu option has an associated procedure that is activated on selection of a given option from the menu; procedures that compute salaries for various groups of employees are associated with a salary report, etc. Understanding of user interfaces in existing programs is useful in program maintenance, but also helps in designing user interfaces for reengineered applications. Screens and reports for reengineered applications can be designed using CASE screen/report painters based on a relational data model generated from a normalized conceptual data model. (Screen and report prototypes are sometimes designed using a conceptual data model.) Screens for typical user interface operations such as creating/ searching/ updating records can be inferred from the data model, automatically generated, and linked to reengineered applications. Analysis of user interfaces in the base system may contribute to designing new user interfaces in two ways: by providing clues about some specific types of screens and reports which are required and by identifying computations related to screens and report. User interfaces in existing programs are analyzed in the following way. First, a program is run for sample inputs to get a basic understanding of system behavior. The documentation and reports are studied to understand user interfaces. Nontrivial screens and reports (i.e., ones that are not just data basebrowsers or involve complex computations) are identified and analyzed by searching for patterns and slicing programs. Using code patterns, we can find statement s which displays (or writes) specific information on the screen for to a file). To find out how a specific field in a report is computed, we slice the program from s with respect to a data element linked to this field. To analyze code that may be activated from the screen (e.g., on selection of a menu option), we trace the control flow and a forward slice from the corresponding statement in a program. Recovering user interfaces differs from recovering domain objects in that user interfaces are usually analyzed in bottom-up fashion, while domain objects are recovered mainly in a top-down way. 2.7 Querying

the Program

Information

A programmer can inspect the domain model, design abstractions, and code via a query system. Queries are formulated in terms of domain objects, attributes, procedures, design objects (e.g., modules), control flow info~ation, data flow information, code

Domain Model-Driven

Maintenance

J. SYSTEMS SOFTWARE 1993; 20:37-5 I

patterns, etc. Here are some examples: Statement s; find s such that WriteToFile(s,

PERSON)

Meaning: find all the statements that write information to a file connected to object PERSON (or to a file connected to any object derived from PERSON). Statement s; find s such that IsLoop PERSON))

& Contain&

ReadFromFile(s,

Meaning: find loops that read information file connected to object PERSON.

from a

Module M, Ml; find M such that Name(M1, ModName) & Calls(M1, M) & Ref(M, GlobVar) Meaning: find modules M that are called (directly or indirectly) from module ModName and that reference a variable GlobVar. We will describe first how we store all the program information, i.e., domain models, design abstractions, and code, in the program knowledge base (PKB). The PKB consists of a PROLOG data base and attribute syntax trees. (In the future we are planning to replace PROLOG with a frame-based expert system shell.) Design abstractions such as structure charts, control flow graphs, cross-reference lists, etc., are computed by parsers and stored in the data base. Associated with the data base are PROLOG rules that greatly help in supporting queries. For example, based on direct procedure call relations computed by the parser and recorded in the data base, rules can define what it means when a procedure indirectly calls another procedure. In the prototype that we are implementing, queries are resolved in the following way: Domain model features (objects, procedures, etc.,) are supported by syntax-directed editors. We generate editors using a system supporting extended attribute grammar formalism [25]. Editors represent domain model features as a forest of interrelated attribute syntax trees. Relationships between features reflect inheritance, ownership, as well as domain-specific dependencies between features (e.g., the fact that a given BORROWER borrowed a BOOK). The attribute evaluation mechanism allows us to resolve queries that involve inheritance. For example, a query Statement s; find s such that AssignTo@, PERSON)

47

asks about statements that modify a value of any data item associated to object PERSON or of data associated to any object derived from PERSON. (Note that when resolving queries involving objects, the inheritance hierarchy is used in the opposite way as when resolving messages passed to an object.) Having identified classes of objects, queries are further resolved by finding object instances that satisfy a query (e.g., all the modules called from module M or all the statements that read data from a given file). As shown in the above examples, query conditions are written using predicates. Predicates such as Calls(M, M) and Ref(M, Var) are evaluated by the PROLOG backtracking mechanism. Predicates such as WriteToFilecStat, Obj) and AssignTocStat, Obj) are computed by pattern matching. Program statements are classified into semantic groups such as WriteToFile and AssignTo. Names of these groups are used in queries to identify types of statements to be searched for. Internally, we represent source programs in the same way as a domain model: parsers translate programs into attribute syntax trees. Pattern matching is done on syntax trees.

2.8 Further Automation

of System Reengineering

We described a semi-automatic

method for recovering design components and object-oriented views from programs. Most of the techniques involved in the recovery process can be efficiently supported by tools, but the design recovery procedure actually is masterminded by the programmer. In particular, the programmer has to decide on patterns and slicing criteria, which may be relevant to recovering various features. It would be interesting to relax the programmer’s control by further automating the procedure of building links between domain model objects and related code [8]. To attempt this, a rich library of code patterns and slicing criteria is needed, as well as a generic way of specifying patterns and querying program information. We plan to address these issues in the future.

2.9 Role of the Domain Model In Program Maintenance

Often, system reengineering must be done in parallel to ongoing maintenance of the original system. In such a case, the system maintenance should be done via the same domain model that is used for reengineering. There are several reasons why this is important. First, the domain model and design abstractions recovered during reengineering contain

48

Stan Jarzabek

J. SYSTEMS SOFTWARE 1993; 20:37-51

information that helps one understand a program and is invaluable for maintenance. Second, program modifications done during the ongoing maintenance can affect some of the information recovered from code so far. As the domain model contains information about both reusable specification recovered from code and code itself, it is possible to trace the impact of changes via the domain model. Finally, ongoing maintenance can contribute new facts to our knowledge of a system and it may be worth communicating those facts to the reengineering team. This is also done via the domain model. 3. EVALUATION OF AND WITH THE METHOD

EXPERIENCES

How can we reuse some of the information encoded in old systems to rebuild new systems? How can we make the reengineering option a workable, costeffective solution? It is not easy to answer these questions because there are so many factors that we need to take into account. There seems to be no universal approach to reengineering that will work well in all situations. Rather, we need a flexible methodological framework that allows us to define the most cost-effective reengineering strategy based on the analysis of objectives, system quality, manpower, etc. [7]. Then we need a rich set of techniques and tools to support project teams in maintenance and reengineering. An impo~ant prerequisite for system reengineering is efficient software construction methods that result in robust, easy-to-maintain systems. If we are not sure whether new systems are going to be substantially better than old ones, why should we bother to rebuild them? System reengineering is very expensive and we need some guarantee that, after few years, we will not find ourselves in the same situation we are in today, fighting with problems of spaghetti code. Therefore, our starting point is a methodology for system construction. Our forward enginee~ng techniques are based on data-driven methods. The essence of a data-driven approach is that the backbone structure and basic functionality of an information system are inferred from a quality data model. Our sister company in Quebec City set up experiments to compare productivity of structured methods versus the Datarun methodology. (Datarun is a data-driven methodology, a descendent of the Merise 1261 methodology developed in France in the 1970s.) The results of experiments showed productivity improvements by a factor of 7 to 10 in favor of the data-driven approach. From 1981-1987, 11 commercial applica-

tions of substantial complexity (22,097 function points) were developed using the Datarun methodology. Both in experiments and commercial development, systems were coded manually using a fourth-generation language. But the information concentrated around data models can be fed into a data-driven code generator to make system construction more efficient. Data-driven construction methods give us a clear picture of what design abstractions we need to reverse engineer from old code and how we should represent them so that they can be reused in generating new systems. These are conceptual data models, relational data models, and procedures grouped around entities, screens, and reports. We already have tools that allow us to use data definitions from existing programs to create data models. We discussed the techniques of program slicing and pattern matching which lead to recovering object-oriented views from programs. Implementation and experimentation with those techniques is a subject of a master project at the National University of Singapore. We are also working on a methodological framework for system reengineering and on a reengineering expert, a tool to help a manager analyze the situation and define an optimal reengineering or maintenance strategy. Automation of the whole system reengineering life cycle requires many tools. Fortunately, many useful tools already exist. In particular, CASE tools that allow editors to view/manipulate design abstractions, restructuring tools, conversion tools, and data reverse engineering tools are well established. Using bridge technology, it is possible to integrate foreign tools into a reengineering environment. In our project, we concentrated on techniques and tools related to the domain model. This includes the following issues: using reverse engineering tools to capture information from existing programs and build the design model (both conceptual data models and procedure models), using CASE notations and editors for viewing and manipulating design models, support for manipulating/ querying the domain model, a mechanism to maintain all the program information in a consistent state, techniques for recreating object-oriented views from programs (in particular, static program analysis techniques based on program slicing and pattern matching), and the design recovery assistant. Currently, we have the following tools: CASE tools supporting all major notations for planning, analysis and design [23], and a tool assisting programmers in building quality data models [201. We have implemented a syntax-directed editor genera-

Domain Model-Driven

Maintenance

tor based on extended attribute grammars [25]. Generated editors support multilingual formalisms and can propagate information across interrelated components via an incremental attribute-evaluating mechanism. Editors interface to a data base to store the information about objects being edited. We shall use editors to support the domain model as well as source programs. We are concentrating on building a set of tools that will allow us to experiment with recovering object-oriented views from real COBOL programs, using techniques of pattern matching and program slicing. 4. RELATED WORK The importance

of domain knowledge in program understanding and the need for explicit representation of domain models in systems supporting design recovery and program reengineering have been pointed out by many authors [8-10, 271. The DESIRE system [8] addresses real time applications implemented in C. In DESIRE, the domain model is built first and then code patterns are used to find instances of domain concepts in code. Design abstractions are linked to code and a programmer can view them in graphical form and query the program information. The objective of LaSSIE [9] is to explain the design of a complex application and identify code that has a potential for reuse. LaSSIE uses a framebased expert system to store domain models, design abstractions, feature views, and code views. The expert system provides a number of important facilities for modeling a program knowledge base, namely classification by inheritance, inferencing, and a database. Because of the use of a frame-based expert system, LaSSIE’s queries support semantic retrieval of information. In LaSSIE, domain models are built manually while the design abstractions are computed by parsers. Liu and Wilde [28] describe a method for identifying objects in C programs. Candidate objects are selected based on the analysis of type definitions; next, procedures that have arguments of a given type or return a value of a given type are identified as candidate methods. One reason why it is difficult to understand and maintain programs is lack of online (and up-to-date) program documentation with explicitly links between documentation components and code. Documentation systems based on the concept of hypertext [29] and systems using cross-reference lists to organize the program information for ease of comprehension [30] try to alleviate some of these problems. Most

J. SYSTEMS SOFTWARE 1993; 20:37-51

49

experimental design recovery and program information systems use explicitly links between domain models, design abstractions, and code to improve the traceability of information across various levels of system descriptions. Static program analysis methods based on control and data flow information and, in particular, program slicing [15] have been found useful in program maintenance [31, 321. Slices represent executable code segments that approximate functionalities of a program. Experimental studies show that programmers look for program slices when analyzing programs during debugging and maintenance. Program slices can be computed automatically and there are tools on the market that support program slicing t331. Reengineering scenarios for business systems have been described [2, 3, 5, 61. All authors stress the need to capture system specifications in a CASE repository and to use data-driven methods to rebuild systems. Many techniques required in system reengineering are supported by tools already on the market. Tools help to clean and unify data definitions across programs and reconstruct conceptual data models from data definitions [18-201; tools can restructure procedural code by eliminating GOTOs and reformatting code into a standard form [34]; rule-based tools help in language translation and in migrating programs from one data base to another [19, 351; CASE tools offer editors supporting a wide range of graphical notations suitable for viewing design abstractions [23]; CASE tools also offer facilities to paint screens and design reports based on data models. Finally, work is under way to further reduce the implementation effort involved in datadriven system construction by inferring much of system functionalities from quality data models [20]. Current methods and tools for data-driven system construction are evolving toward object orientation. Rumbaugh et al. [16] describe the use of objectoriented analysis and design methods for business systems. Authors demonstrate that even if we implement systems in a data-driven way and use relational data base technology and data-driven code generators, we can still do object-oriented analysis and design and observe the benefits of an object-oriented approach.

5. CONCLUSIONS

System reengineering as described here is guided by a domain model. The objective of our method is to recreate object-oriented views of a system; the do-

50

Stan Jarzabek

J. SYSTEMS SOFTWARE 1993; 20~37-51

main model captures those views. The domain model also integrates information about old systems with the specifications from which new, similar systems can be regenerated. We observe that in the domain of business systems, many interesting domain concepts are reflected by entities, attributes, and relationships of the conceptual data model. Therefore, we derive the first-cut domain model from the conceptual data model. During reverse engineering, a programmer identifies procedures related to entities, screens, and reports. Techniques of pattern matching and program slicing help a programmer to analyze and extract code related to various procedures. The reengineering method described in this article offers the following benefits: We can include state-of-the-art methods and tools for system construction into the reengineering cycle; this is the only way to guarantee that new systems will be more stable and easier to maintain than original ones. To some extent, program maintenance and reengineering are done at the conceptual level. By integrating ongoing maintenance with reengineering efforts, our method offers an incremental migration path for old to new systems. The domain model evolves around well-known notions which may make it easier for programmers to accept. The method provides a framework with data and procedural code.

for dealing

an Information Blueprint for the 1990’s, CASE Outlook 15-21 (1990). 4. E. Chikofsky and J. Cross, II, Reverse Engineering and Design Recovery: A Taxonomy, IEEE Sofhvare, 13-18 (1990). 5. J. Martin, Information Engineering, vol. 1, 1986. 6. P. Haine, Migration Strategies for Effective Informa-

tion Systems, SAVANT, 1991. and K. Tham, Towards automating software maintenance, in Lecture Notes in Computer Science, No. 498, Springer-Verlag, (Proceedings of the Third International Conference on Advanced Information System Engineering CAiSE ‘91, Trondheim, Norway, 1991) pp. 336-355. T. Biggerstaff, J. Hoskins and D. Webster, DESIRE: A System for Design Recovery, MCC Technical Report Number STP-081-89, April 1989. P. Devanbu, R. J. Brachman, P. G. Selfridge and B. W. Ballard, LaSSIE: A Knowledge-Based Software Information System, Commun. ACM, 34,34-49 (1991). W. Kozaczynski, The “Catch 22” of Re-Engineering, in 12th International Conference on Software Engineering, 1990, Nice, France, p. 199. G. Arango, Domain analysis: From art form to engineering discipline, in Proceedings of the Fifth Intema-

7. S. Jarzabek

8.

9.

10.

11.

tional Workshop on Software Specification and Design, 1989, pp. 152-159.

12. C. Hart and J. Shiling, An environment for documenting software features, in Proceedings of the 4th ACM SIGSOFT Symposium on SofhYare Development Enuironments, Calif. 1990, pp. 120-132.

13. R. Baab, II, and R. Kieburtz, Workshop on Models and Languages for Software Specification and Design, IEEE Computer 103-108 (1985).

Model-Toward a 14. P. Chen, The Entity-Relationship Unified View of Data, ACM Trans. Database Syst. 1, 9-36 (1976).

ACKNOWLEDGMENTS The author thanks the management of the CSA Research Pte. Ltd., particularly Robert Fu, for support and creating an encouraging atmosphere for this research project. Thanks are also due Dr. Tan Chew Lim, Kelly Tham, Tan Wui Gee, and Tang Siong Tiing for many interesting discussions. Insightful comments from these people and the anonymous reviewers were very useful and helped to clarify many concepts described here

of Conference on Soj?ware Maintenance, 1990, pp. 161-169. 19. Data Analyst, Bachman Information Systems. 20. SILVERRUN, CSA Research Ltd.

REFERENCES 1. R. Rock-Evans and K. Hales, Reverse Engineering:

Markets, Methods and Tools, Ovum Report vol. 1, Ovum Ltd., England, 1990. 2. H. Sneed and G. Jandrasics, Software Recycling, in Proceedings of the Conference on SofhYare Maintenance, 1987, pp. 82-90.

3. W. Ulrich, Re-development

15. M. Weiser, Program Slicing, IEEE Trans. Software Eng., 10, 352-357 (1984). 16. J. Rumbaugh, M. Blaha, W. Premerlani, F. Eddy, and W. Lorensen, Object-Otiented Modeling and Design, Prentice-Hall, Englewood Cliffs, New Jersey, 1991. 17. P. Coad and E. Yourdan, Object-Oriented Analysis, Prentice-Hall International, Inc., 1990. 18. P. Benedusi, V. Benvenuto and M. G. Caporaso, Maintenance and prototyping at the entity-relationship level: A knowledge-based support, in Proceedings

Engineering:

Formulating

21. A. Beztiss, The Specification and Prototyping Language SF, Report No. 78, SYSLAD, The Royal Institute of Technology, Sweden, 1990. 22. P. McBrien, et al., A rule language to capture and model business policy specifications, in Lecture Notes in Computer Science, No. 498, Springer Verlag, (Proceedings of the Third International Conference on

Domain Model-Driven

J. SYSTEMS SOFTWARE 1993; 20:37-51

Maintenance

Advanced Information System Engineering CAiSE ‘91, Trondheim, Norway) 1991, pp. 307-318. 23. POSE CSA Research Pte. ltd. 24. P. Hausler, et al., Using Function Abstraction to Understand Program Behavior, IEEE Software, 5.5-64 (1990). 25. S. Jarzabek, Specifying and Generating

Software Development

Environments,

30. 31.

Multilanguage IEEE Software

Eng. J. 5, 125-137 (1990). 26. A. Rochfeld

and H. Tardieu, MERISE: An Information System Design and Development Methodology,

Info. Manage., 6 (1983). 27. C. Rich and L. Wills, Recognizing Program’s Design: A Graph-Parsing Approach, IEEE Software, 82-89 (1990). 28. S. Liu and N. Wilde, Identifying objects in a conven-

tional procedural language: An example of data design recovery, in Proceedings of the Conference on Software Maintenance, 1990, pp. 266-271. 29. B. Blum, Documentation for maintenance: A hyper-

32.

51

text design, in Proceedings of Conference on Software Maintenance, 1988, pp. 23-31. P. Oman and C. Cook, The Book Paradigm for Improved Maintenance, IEEE Sojbvare, 39-45 (1990). J. Keables et al., Data flow analysis and its application to software maintenance, in Proceedings of the Conference on Software Maintenance, 1988, pp. 335-347. K. Gallagher, Using Program Slicing in Software Maintenance, TR CS-90-05, Ph.D. Thesis, University of Maryland, 1990. VU/CENTER, VIASOFT. RECODER, Language Technology, Inc.

33. 34. 35. EXECOM Comersion Technology. 36. D. Card and R. Glass, Measuring Software Design Quality, Prentice-Hall, Englewood Cliffs. New Jersey, 1990. 37. Y. Chen, M. Nishimoto, and C. Ramamoorthy, The C Information Abstraction System, IEEE Trans. Software Eng., 16, 325-334 (1990).