Recovering structured data types from a legacy data model with overlays

Recovering structured data types from a legacy data model with overlays

Information and Software Technology 51 (2009) 1454–1468 Contents lists available at ScienceDirect Information and Software Technology journal homepa...

3MB Sizes 0 Downloads 55 Views

Information and Software Technology 51 (2009) 1454–1468

Contents lists available at ScienceDirect

Information and Software Technology journal homepage: www.elsevier.com/locate/infsof

Recovering structured data types from a legacy data model with overlays Mariano Ceccato a, Thomas Roy Dean b, Paolo Tonella a,* a b

Fondazione Bruno Kessler, FBK-IRST, Software Engineering, Via Sommarive 18, 38050 Povo Trento, Italy Queen’s University, Kingston, Canada, ON K7L 3N6

a r t i c l e

i n f o

Article history: Available online 10 May 2009 Keywords: Program transformations Reengineering Legacy systems

a b s t r a c t Legacy systems are often written in programming languages that support arbitrary variable overlays. When migrating to modern languages, the data model must adhere to strict structuring rules, such as those associated with an object oriented data model, supporting classes, class attributes and inter-class relationships. In this paper, we deal with the problem of automatically transforming a data model which lacks structure and relies on the explicit layout of variables in memory as defined by programmers. We introduce an abstract syntax and a set of abstract rewrite rules to describe the proposed approach in a language neutral formalism. Then, we instantiate the approach for the proprietary programming language that was used to develop a large legacy system we are migrating to Java. Ó 2009 Elsevier B.V. All rights reserved.

1. Introduction Migration projects are initiated due to a number of technical and non technical reasons. Among the technical motivations for migrating a legacy system to a modern platform is the obsolescence of the technology or programming language in use. Domain specific, proprietary programming languages represent a typical instance of this problem. Swanson [1] classified such interventions as adaptive maintenance and investigated its characteristics. Obsolete platforms and languages may become a major impediment to future developments and innovation, especially when they are tied to a proprietary development, compilation and execution environment. In such cases, supporting the language and the associated production and execution environment may require even more resources than those dedicated to the core business of the software company. On the non technical side, customers are increasingly aware of the technologies underlying the products they buy. Hence, they have also requirements about the adoption of modern environments, frameworks and languages. At the same time, they may have concerns about the usage of non-standard solutions, having no consolidated support and limited user community. This is becoming especially true in the days of the open source software development. This mixture of technical and non technical arguments is usually the trigger for a migration project, aimed at replacing the old technology with a modern one. Translation of the data model is central to any migration effort. Decisions made about the representation of the data structures will * Corresponding author. Tel.: +39 0461 314577. E-mail addresses: [email protected] (M. Ceccato), [email protected] (T.R. Dean), [email protected] (P. Tonella). 0950-5849/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.infsof.2009.04.017

have strong implications for the rest of the translation. While modern languages have a structured approach to data modeling, enforcing relationships such as attribute ownership and aggregation, old languages are often biased toward the underlying physical model of the data. Hence, they typically support views that are tied to the memory layout, consisting of a sequence of bytes. Programmers define their data structures as a set of views and overlays defined on top of a flat sequence of bytes available from the memory. Types may be superimposed on such a sequence, but the relative displacement of various parts of a data structure is left to the programmers’ responsibility, being not explicitly supported by the language. In this paper we consider how source transformations can be used to infer structure from a set of data declarations based on notions such as their size in memory and their position, relative to other declared variables. One advantage of a source transformation approach as opposed to approaches that recover structure from binary objects is that the grouping of the fields as given by the programmer provides important hints as to the final structure of the data model. Our approach attempts to keep these fields together as much as possible to maximize the programmer’s familiarity with the result. We present our approach abstractly, to increase its reusability in different contexts. For this purpose we define an abstract syntax for the starting, unstructured model and for the target, structured one, making minimal assumptions on the surface appearance of any language instantiating the abstract syntax. On top of such abstract syntax, we provide our data structuring algorithm as a set of abstract rewrite rules. We instantiated our approach to data structuring for the programming language used in a legacy system we are migrating

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

to Java. The language used by the legacy system is BAL, an acronym for Business Application Language. BAL is a proprietary, BASIC like language that contains unstructured data elements as well as unstructured control statements (e.g., GOTO). Variables in BAL are declared with an explicit size in bytes and can be displaced in memory at any arbitrary position, relative to some previously declared variable (relocations). The system that we are migrating is a production banking application which supports all functionalities necessary to operate a bank, including account management, financial products management, front-desk operations, communications to central bank and other authorities, inter-bank communications, statistics and report generation. The user interface is character-oriented and the overall architecture is client-server, with the client operating mostly as a character terminal. The execution environment is a proprietary platform called B2U. The application has a quite large size, consisting of around 8 MLOC (million lines of code) and 500,000 declarations involving relocations. Results show that the proposed abstract heuristics to handle the structuring problem work pretty well in practice, with only 1% of the cases left for manual intervention. The rest of the paper is presented in six sections. Section 2 gives a short introduction to the BAL data model. Sections 3 and 4 discuss the transformations in more detail. Some experimental results about the effectiveness of the proposed heuristics are given in Section 5. Related work is presented in Section 6, followed by conclusions (Section 7).

2. Data model We start by presenting the unstructured data model provided by the BAL language, and some of the difficulties it presents to the translation process. We then present an abstract model that will be used to define the formal specifications for the set of transformations that generate a structured model, to be used in the translation to Java.

1455

Data relocation is accomplished with the FIELD=M,VAR statement, which is analogous to the .= variable address directive available in most assembly languages. An example is shown in Fig. 1. Indentation is used in the example to indicate programmer intention, but is not mandatory in the language. The variable a is a string variable (indicated by a type specifier of ‘$’) and is ten bytes long. The following FIELD statement resets the current variable position (i.e., the memory location of the next declared variable) to the beginning of the variable a. As a result, the variable b, a string variable of length five, occupies the same five bytes as the first five bytes of a. The variable c, also a string variable, takes the next five bytes. Thus all ten bytes of the variable a are shared with the two variables b and c. The variable b is further redefined by the variables d, e, and f. The variables d and e are byte variables (the type specifier ‘#’) while f is an array of strings. The size of the array is 3, while the size of its string components is 1 (so, it contains three strings of length 1). A second relocation starts from the beginning of b. It represents an alternative view on the five bytes allocated in memory for variable b. Such a view consists of variable g, a short (‘%’ type specifier) that occupies two bytes, and the string h, of size 3. The FIELD=M that concludes these declarations refers to no variable. Its meaning is that the next declared variable will be allocated starting from the first free location in memory (i.e., immediately past the end of c). Fig. 1(right) shows how the example would be translated to Java. Native BAL short, byte and string types are translated to the corresponding Java types, while BCDs are translated to BigDecimals. To preserve the semantics in the translated Java, strings and BCDs are annotated with the original BAL size (@Field(size=5)). Whenever a variable is the subject of a relocation (FILED=M,a), such a variable becomes a Java class (class A) and all the contained variables (declared in the same memory region,e.g., b and c) become fields of this class. When multiple overlays are defined

2.1. Concrete syntax An appropriate translation of the data model is central to any migration effort, and our project is no exception. This section gives a short overview of the BAL data model. The data model is discussed in more depth in a different paper [2]. While BAL contains some structured control flow statements such as IF. . .ENDIF and WHILE. . .WEND, the data model is very unstructured and similar to that found in structured assembly languages (e.g., that of IBM mainframes). The data model is byte oriented, and the language only provides four basic data types: byte, short, binary coded decimal (BCD) and string. The first two are the same as those available in most languages, representing single and two byte signed integer values. Variables of the BCD and string data types can be of various lengths, and the developer must specify the length in bytes if she wants something different than the default length (eight and 16 bytes, respectively). Unlike languages such as C, there is no dynamic allocation and the length of all variables (including arrays and strings) is known at compile time. The BAL data model is based on the notion of memory relocation. Each newly declared variable can be either allocated on the next available region in memory, or it can be a relocation of a previously defined variable. The relocation data model allows for arbitrary aliasing among variables as well as for arbitrary definition of multiple views (unions) insisting on the same underlying memory region (i.e., sequence of bytes). Such views are not constrained to be within the boundaries of the relocated region. On the contrary, they can span multiple memory regions that were previously regarded as separate data structures.

Fig. 1. Variable declarations in BAL (left) and their translation in Java (right).

1456

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

for the same variable, a class with union semantics is produced (B_Union for b). A separate class is generated for each union variant (B_Var1 and B_Var2). B_Union implements all interfaces of all union variants, so that it can be accessed through any alternative view. Some code to switch among views when needed is automatically generated (it implements the copy on read/write protocol and is not shown in Fig. 1 for space reasons). There are several consequences to the approach taken by the BAL language. The first consequence is that records do not introduce any additional lexical scope. The second is that it is the developers’ responsibility to ensure that the sizes of all of the variables are correct. For example, in Fig. 1, the size declared for variable a could be incorrectly nine. This has no consequences on the memory layout of the other variables, except that the last byte of h would be not accessible from a. Moreover, no change would occur for variables declared after a, even in the absence of the final FIELD=M statement, since 10 bytes are anyway consumed for this data structure (e.g., due to the total size of b and c). This indicates how the BAL data model is robust with respect to mismatches involving the size of container variables (such as a). Such a robustness is sometimes ‘‘abused” by BAL programmers, who do not care too much about the container size not matching the cointainees’ size. Finally, there are many ways of expressing the exact same layout of variables within memory. In the example in Fig. 1, the declaration of c could be the last of this data structure, without affecting the memory layout of variables in any respect. In a language such as COBOL or C, the variables d, e, and f would normally be declared as part of the aggregate field that they form. In BAL they could be located anywhere within the same scope of declarations (i.e., global or local). Size mismatches and variations in memory layout declaration make the recovery of a structured record from a sequence of BAL declarations quite difficult. All these issues must be addressed before starting the actual translation to Java. In the rest of the paper, problem of inferring a containment relationship among variables will be addressed by the bracketing algorithm. In case of variable size mismatches, some heuristics will be used to identify the correct size values and to suggest size fixes. The developers usually keep the definitions of complex data structures in separate files which are included by the preprocessor. One special case is that of the data structures used to access persistent ISAM (Indexed Sequential Access Method – a data file format common in legacy systems) tables. These record definitions are stored in a data dictionary and contain full hierarchical storage information. Include files containing BAL declarations are generated from these dictionary entries, losing some of the hierarchical information in the process. Since the information exists in the dictionary, these files are treated specially by our migration process as the Java classes can be generated from the dictionary directly. One wrinkle is that the data dictionary contains a few size mismatches [2] and as a result, so do the generated include files. Such size mismatches are basically similar to those mentioned above for the example in Fig. 1. While not affecting the correctness of the final code, these problems complicate the process of recovering the data model and sometimes require manual fixes. The data structures for other temporary data files and internal storage do not have entries in a dictionary and the structure of these records must be inferred solely from the BAL code. However, when analyzing any structures that are local to the program, we have considerably more flexibility in how the fields can be reorganized in order to infer reasonable data structures.

target data model. The abstract syntax does not contain any reference to a particular surface syntax that may be used to instantiate the underlying data model. This means that by adding proper syntactic sugar, it is possible to make this abstract syntax match the surface syntax of BAL. On the other hand, the data model of other programming languages supporting memory relocations and multiple memory overlays may be obtained by properly instantiating this abstract syntax, which makes our results generalizable to any programming language supporting some notion of relocation and memory overlay. Hence, the transformations described with reference to the abstract syntax become a general theory of data model transformation, from unstructured to structured. Fig. 2 shows the abstract representation of the unstructured BAL model. A declaration sequence is a set of zero or more declarations, each of which is a tuple consisting of a mem-id (which in turn is an identifier) a size (a number) and an optional relocation (which is another mem-id). This is a straightforward abstraction of the BAL syntax, removing the type of the variable as well as details of the size specification, such as arrays and preprocessor macros, and terminal symbols, such as FIELD=M. One constraint not visible in Fig. 2 prescribes that for each reloc there must exist one mem-id appearing in a declaration (i.e., the relocation must refer to some allocated memory area). Another obvious constraint not in Fig. 2 is that each mem-id should appear in one declaration only (no redeclaration allowed). Fig. 3 shows a model of the final structured data. In this model, a declaration is either a structure declaration, a union declaration or a variable declaration. A structure declaration is a possibly empty sequence of declarations, which may in turn be plain variables or nested structures/unions. A union is a sequence of at least two variants. Each variant is a structure, in turn. This model represents the standard tree representation of the structured data models provided by most structured languages such as C, Pascal or Ada. It should be noticed that we deliberately decided to distinguish between structures and unions, with a union containing at least two variants, which means that the associated memory area is accessible according to two or more alternate views. Another possibility in the definition of the abstract syntax for the structured data model would be to unify the notion of union and structure, by allowing also zero or one variant in a union. We prefer to keep the two separate because they are subject to a different treatment during transformation and this can be described more clearly if we make the distinction between structure and union apparent at the syntactical level.

Fig. 2. Abstract syntax for the unstructured data model.

2.2. Abstract syntax In order to describe the data structuring process that we applied to the BAL programming language in a general, reusable form, we use an abstract syntax for the starting and for the

Fig. 3. Abstract syntax for the structured data model.

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

1457

Fig. 4. Abstract syntax for the merged data model.

Fig. 4 shows the merge of the two models into a single model. The merged model is needed to be able to describe the transformation as a set of rewrite rules operating on a single language. During the transformation process, part of the model will be already transformed, while part of it will be still under transformation, which requires the capability to manage a mixed abstract syntax tree. Hence the merge of the two data models. In the merged model, declarations are again structures, unions or variable declarations, but now they are followed by size and optionally by a relocation. Simple variables are represented by an identifier, which is now a memory identifier (mem-id), for consistency with the unstructured data model. Unions remain the same (a sequence of two or more variants, each of which is a structure), but they now have a name (mem-id), since they are associated with relocated variables. Similarly, a structure in a declaration (decl) has also an identifier (mem-id), while structures that define union variants do not have any identifier (it is the whole union that has one). The extra information (optional reloc) in the structure and union syntax is used to keep track of the structure as it is built. Figs. 5 and 6 show the abstract syntax trees for the example in Fig. 1 before and after structuring the data model. In the next few sections we will provide the rewrite rules necessary to bridge the gap between the two. 3. Data transformations We begin this section by first using the abstract model described in the previous section to specify the desired transforma-

Fig. 6. Abstract syntax tree for the example program in Fig. 1 after structuring the data model.

tions. We then show how the abstract rules can be implemented using TXL [3]. 3.1. Abstract rewrite rules

Fig. 5. Abstract syntax tree for the example program in Fig. 1 before structuring the data model.

We can express the transformation of the ideal case (no size mismatches) in two abstract rules. The first rule identifies the extent of a structure declaration, while the second one groups together all the redefinitions for the same variable and it generates a union when it identifies more than one redefinition for the same variable. Fig. 7 shows the first of theses rules. The first two lines of the rule state that we are looking for a single declaration (decl0) and for a sequence of 1 or more declarations (decl1 . . .declk). There may be one or more declarations separating the first declaration (decl0) from the set of declarations we are interested in. The third line of the rule states that the first declaration of the set (decl1) is a relocation of decl0, while the fourth line states that none of the other declarations in the set (decl2 . . .declk) specify a relocation. The fifth line requires that the cumulative size of the set of declarations is the same as the size of the field that is being redefined. The remaining four preconditions specify a new structure declaration (declnew) with the same name as the original declaration (decl0) and containing the set of declarations (decl1 . . .declk) as fields. If the original declaration is a relocation of another

1458

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

Fig. 7. Rewrite rule handling the case of exact size match.

declaration, then the new structure is also a relocation of the same declaration. The rule gives two possible results. If the main declaration(decl0) currently has no children (first result line), i.e., it is not a structure, then it is replaced with the new structure. If the main declaration has already children, i.e., it is a structure, then the current one is an additional (second or successive) relocation, hence we have a union. In this transformation step, when a second (or successive) union variant is discovered, we place the structures sequentially (decl0 declnew), so that the union rule (Fig. 9) can easily find and merge the variants. Fig. 8 shows the result of three consecutive applications of the rewrite rule in Fig. 7 to the example in Fig. 5. First, the declaration of a is transformed into a structure, due to the relocation in decl1, followed by decl2. Then, the declaration of b becomes a structure. Finally, the second relocation of b (decl6) produces a structure which is added just after the previous one, instead of replacing it. In fact, such a relocation is the second relocation of b, hence it is the second variant of a union, which the current step transforms into an additional structure following the previous one. The union rule, shown in Fig. 9 takes a sequence of two or more structure declarations (decl0 . . .declk) that represent variants of the same variable and combines them into an abstract union (declnew.union). The constraints on this transformation are that:

 all the declarations in the sequence have the same name(memid);  all the declarations have been transformed into structures by the rule in Fig. 7;  no declarations outside of the sequence have the same name as the structures in the sequence. The result of the union rule is to replace the sequence of structure declarations with a single union node that has each structure as a variant. If the union rule in Fig. 9 is applied to our running example (Fig. 1) after structuring the data based on the exact size match (rule in Fig. 7), we obtain the abstract syntax tree shown in Fig. 6. These abstract rules provide a good starting point for understanding the transformation task in the presence of exact size match. We now present the concrete implementation of these rules. Next, we consider the possible cases of size mismatch. 3.2. Concrete program transformation overview At the core of the concrete data structuring transformation is a source-to-source transformation which we call bracketing. Its purpose is to infer a containment relationship among variables, that is later used in the actual translation to Java. A variable that acts as

Fig. 8. Abstract syntax tree for the example program in Fig. 1 after applying the rewrite rule in Fig. 7 three times.

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

Fig. 9. Rewrite rule producing unions.

container eventually becomes a class and its content is translated into class fields. Several other phases of the migration process precede bracketing, providing generic transformations that are of use to the entire process, including bracketing. We present some of these transformations (shown in Fig. 10), although we limit the discussion to those that are directly useful for bracketing. The process is composed as follows:  in a normalization phase, some syntactic ambiguities (e.g., overloaded semantics of symbols such as brackets) are resolved, in order to make the subsequent steps easier (e.g., brackets used with arrays are turned into square brackets);  a number of facts are extracted from the code and stored into a data base;  persistent table (ISAM) definitions, which are translated separately, are identified and marked;  some difficult cases of size inference are addressed separately; for example, two phases are devoted to resolving errors in persistent tables and function parameters that admit variable (i.e., non constant) size;  at this point the actual bracketing (folding) can run, with all its internal sub-steps;  multiple redefinitions of the same variable could be spread all over the code. The purpose of the last step (moving) is to group them together with the redefined variable, generating a union when their multiplicity is greater than one. In the following sections, we describe each of the steps in Fig. 10 using the BAL code of Fig. 1 as a running example. TXL was used for most of the transformations, including bracketing itself. 3.3. Normalization The normalization phase is the gateway to the entire migration process and performs several transformations designed to make the rest of the process easier. The normalization transforms related

1459

to bracketing are preprocessing, unique naming and persistent file identification. The BAL compiler produces a partially preprocessed file during the compilation process. In producing this file, the BAL compiler resolves all inclusion statements as well as any conditional compilation. Fig. 11 gives a short example of the inclusion of a data definition. Fig. 11a shows the including file while Fig. 11b shows the included file. Fig. 11c shows the output of the BAL compiler. Each of the lines of the file end with an annotation that gives the line number and name of the original source file (e.g., the name of the include file that contains the statement). In addition, the BAL processor removes macro definitions and comments, although the use of macros in the code is not expanded in this partially preprocessed file. Since the macros will be translated separately to Java constants and methods, the presence of the original names in the code will allow us to use them in the final translation. The first set of transformations is to uniquely name [4] all of the identifiers in the programs. Unique identifiers are strings, which contain the names of each of the parent scopes of the identifier. So the global variable A in program P has the unique identifier ‘‘P A”, while the variable B in the function C in the program P has the unique name ‘‘P C B”. Since program names are unique, and functions and variables may not have the same name, this approach gives a globally unique name to all identifiers in the system. We use agile parsing [5] techniques to extend the grammar to include the unique identifiers as annotations on all occurrences of the identifiers (definitions and uses). These annotations take the form:

½\UID" # identifier Since BAL has simpler naming rules than Java, the unique naming process is also simpler, requiring only the resolution of macro names from include files (since the instances of the macros are not expanded) and the external functions defined in libraries. For clarity and ease of reading, the examples in this paper do not show the unique naming annotations. As mentioned earlier, some of the include files are generated from the data dictionary. Each ISAM file in the data dictionary is used to generate a single persistent data definition file. Since some extra structural information is present in the data dictionary (and lost when the BAL include files are generated), it makes sense to generate the Java classes from the dictionary. However, we must still recognize where such data structures are used in the code so that instances of classes may be generated during migration. They also complicate the bracketing process since they may also be included as a subrecord of a larger record. Thus the last transform of the normalization process is to use the annotations inserted by the compiler at the end of each line to group all of the fields for a given persistent file structure into a single unit. In this way, a single object allocation can be generated in Java, replacing the entire persistent data structure included. In the example in Fig. 11b, the file ‘tabx.bal’ is an example of one of these files.

Fig. 10. Bracketing process.

1460

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

Fig. 11. Include file preprocessing.

Fig. 12 shows the preprocessed file from Fig. 11c after we have annotated the definitions (PERSISTENT_SD ‘tabx” [[. . . ]]). 3.4. Fact extraction Facts in RSF format [6] are extracted from the source code as part of the migration process. These include facts about the type of variables, their size, the structure of the BAL source and the names of externally visible entities. Facts are triples of the form:

factName uid uid or value The UIDs created by unique naming during normalization form a link between the facts and the source code, and allow the facts to be read and used by transforms later in the migration process. The main set of facts relevant to the bracketing process are the size facts. In fact, the layout of variables in memory depends on where they are allocated (i.e., whether they are relocations of other variables or not) and on their size. The size fact gives the size of each variable and constant in the code. This is accomplished by first extracting constant declarations from the code. The transform then extracts all variables and their associated size expressions, inserting default sizes were appropriate. Since named constants may be used when specifying the size of variables, a constant propagation algorithm is run on the size expressions to determine the size of each variable.

language. The bracketing algorithm consists of finding out what is the full extent of a FIELD=M and in enclosing it between square brackets. It starts at a FIELD=M statement and it adds one or more declarations (DCL), possibly with nested FIELD=M. In addition to the square brackets for the FIELD=M instruction, we need a second kind of brackets used to group together all the redefinitions of the same declaration (DCL). We call this second kind of brackets square-angle brackets (i.e., [h. . . i]). They start at a declaration and enclose all FIELD=M,var statements that insist on the same declared variable. When a DCL has no FIELD=M statement insisting on it, the square-angular brackets contain an empty sequence. Hence, they are omitted. Square bracketing corresponds to the application of the abstract rewrite rule in Fig. 7, which recognizes structures based on exact size match. Consequently, square bracketing determines the scope of a FIELD=M based on exact size match. Such a scope determines the structure (sequence of declarations) recognized in this step. Square-angle bracketing corresponds to the abstract rewrite rule in Fig. 9. Whenever one or more FIELD=M insist on the same DCL, all of such structures, identified as square bracketed FIELD=M, are put within the square angle brackets. After square angle bracketing, union identification accounts just for determining the cases where square angle brackets surround two or more FIELD=M. Fig. 13 shows the result of applying bracketing to the example in Fig. 1. The first declaration (variable a) is associated with a single FIELD=M within square-angle brackets, hence it is recognized as a structure, not a union. The structure associated with variable a contains two data members, b and c. They are determined by the square brackets surrounding the FIELD=M,a statement. In turn, b is recognized as a union, since its declaration has two FIELD=M,b within square-angle brackets. Each of the two FIELD=M,b is a structure playing the role of the respective union variant. Data members of variants are again available within square brackets. During the translation to Java, square-brackets and squareangular brackets play different roles. Square brackets identify the boundaries of classes, with nested FIELD=M representing cases of the composition relationship between classes. Square-angular brackets discriminate between classes and unions. When more than one FIELD=M insists on the same variable declaration, a union must be generated instead of a class.

3.5. Bracketing Bracketing is the core transform used to add structure to the legacy BAL data model. It instantiates the abstract rewrite rules described above to the specific features of the BAL programming

Fig. 12. Annotating persistent data definitions.

Fig. 13. Square bracketing and square-angle bracketing of the example in Fig. 1.

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

Since unions are not available in Java, we resort to an equivalent code organization, which implements the copy on read/write protocol. The generated class implements a set of interfaces, associated with the different union variants, and declares a set of data members, each referencing one union variant. At each point in time during execution, at most one variant is active (i.e., at most one data member is non-null). When the code switches from one view to another one (e.g., it sets an attribute available in one view and then it reads an attribute available in another view) the active variant is changed and data are copied from the previous active variant to the new one. 3.6. Bracketing algorithm The algorithm used for bracketing is shown in Fig. 14. We discuss each element of the algorithm separately. 3.6.1. Persistent Data The first step of the bracketing algorithm consists of the identification of the persistent data definitions (coming from the dictionary), which are delimited and separated from the user data definitions, within the preprocessed files. A single persistent data definition file corresponds to the definition of an ISAM file which, in turn, can contain more than one table. During normalization, each included file is delimited as a whole by inserting proper code annotations. However, the included file may contain more than one persistent data structure each associated with one table declared in the file. In the original, unpreprocessed BAL source files, each table definition is delimited by its own preprocessor macro, giving the programmer fine grain control over the inclusion of table definitions in the code. In particular, each table definition is enclosed between #ifdef table_name and #endif preprocessor lines. Fig. 11b is an example of one of these files. The file tabx.bal contains three tables (x1, x2 and x3).

Fig. 14. Bracketing algorithm.

1461

As these BAL files are generated, the only conditional preprocessor lines are the ones used to handle tables. We use a Java program to record the extent(in line numbers) of each table. In Fig. 11b, the extents of each table, x1, x2 and x3, are (1:4), (5:8) and (9:12) respectively. We then augment the annotations inserted by the normalizer identifying the tables as shown in Fig. 15 (PERSISTENT_SD ‘‘tabx” ‘‘tablename” [[. . .]]). 3.6.2. Persistent data size As mentioned previously, persistent data structures contain a few size mismatches. In fact, the BAL data model is quite robust with respect to such mismatches, which usually have no impact on the correctness of the program execution. Actually, the pointer to the next free location in memory can be easily reset to a legal position through FIELD=M with no parameter. Moreover, container size mismatches are usually not an issue. Provided the container has all the information needed by the computation based on it, its size can be larger or shorter than the containees without any visible effect. However, from the point of view of recovering a structured data model, size mismatches represent a major impediment to proper bracketing. In order to fix the known size mismatches, a table of fixes is read in the next step (Step 2 in Fig. 14) and then propagated to the definitions of the variables bracketed by the previous step. Examples of such corrections are changing the size of a field (usually increasing it to include all contained fields) or inserting a missing container for a sequence of fields. The table of fixes was produced manually, addressing the warnings raised by a size checker developed as part of the bracketing tool. The fixes have been validated by the programmers. 3.6.3. Function parameter with variable size The third step addresses the formal parameters of functions. When a variable is passed by reference, the compiler does not enforce that the type in the signature of the function corresponds to the type of the actual argument, and often the size and types do not match. Programmers declare the formal parameters without type and size information (allowed by the BAL language). Since the parameter is passed by reference, its actual location in memory starts exactly where the actual argument is located. Just as global and local variables may be redefined, so may parameters. In particular when a variable parameter is redefined with a FIELD=M statement, all successive variables are defined relative to the actual argument. The only way to reset the storage to the memory space for local variables is to use a FIELD=M statement that relocates another local variable, or a FIELD=M statement without a variable, which resets the allocation pointer to the next available location in local variable memory. Thus the transform for identifying the structure of a pass by reference

Fig. 15. Annotating tables in persistent data definitions.

1462

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

parameter is slightly different than that for global or local variables or for pass by value parameters. We use the information provided by the redefinitions to correct the function signature and indicate the correct size of each parameter that is passed by reference. On some occasions, more than one redefinition on the same parameter specify different sizes. In this case, the maximum among the candidates is used for the correction. For example in Fig. 16 function f1 declares two parameters, the first (i.e., p1) is passed by value, while the second (i.e., buffer) is passed by reference (using the keyword VAR). The size of buffer is redefined twice with two different lengths (500 and 1000 bytes). Thus in this case, we assume the maximum size (1000 bytes). 3.6.4. Bracketing Once annotations and fixes are done, the actual bracketing can start. The general approach consists of folding overlaying fields into the parent field until they equal the size of the parent field. The bracketing step is composed of four sub-steps. First, size information is read from the fact base and an in-memory data-base is initialized. As explained previously, the unique identifiers provide the link between the size facts and the declarations they represent. We have extended the base grammar of the BAL language to include our bracketing syntax as shown in Fig. 17. The first definition adds the square bracketing syntax to the FIELD statement. The grammar for the square bracketing includes a number. This is where the transform stores the amount of memory that is left in the redefinition (initially the size of the variable that is being rede-

Fig. 16. Formal parameter size.

Fig. 17. Bracketing grammar extensions.

Fig. 18. Bracketing initialization.

fined). The second grammar definition adds the optional redefinition (square-angle bracketing) to variable declarations. A redefinition of the bal_declaration non-terminal adds both of these non-terminals as possible definitions. The bracketing transform (Steps 4.1 and 4.2 in Fig. 14) starts by adding an empty extent to all FIELD redefinition. The size field of the extent is initialized to the size of the variable that is overlaid. Fig. 18 shows the result of this transformation. Empty square brackets have been added to the redefinitions of a and b. The initial size associated with the square brackets is 10 for a and 5 for b. The transform proceeds by checking if the next entity following a bracketed field can be moved inside the brackets. This is achieved by comparing the size of the entity with the available space left in the field. Depending on the type of the entity that follows the bracketed extent, a different transformation applies. The possible transformations are Steps 4.3.1.1, 4.3.1.2, 4.3.1.3, 4.3.2 from Fig. 14. 3.6.4.1. Fold declaration (Step 4.3.1.1). In case the entity to fold is a variable declaration (DCL), the transform folds in the variable if there is space left in the container. Fig. 20 shows the results of this transform. In Fig. 20a, the first declaration following each of the square bracketed extents from Fig. 18 is folded. In particular, the variable b, whose size is 5, is folded into the redefinition of a, reducing the size of that redefinition from 10 to 5. The variable d, whose size is 1, is folded into the redefinition of b reducing the available space from 5 to 4. This transformation is applied until no other variable declarations can be folded into the appropriate overlay. After folding declaration f into the redefinition of b, no other declaration folding is possible, the result of which is shown in (Fig. 20c). 3.6.4.2. Fold constants (Step 4.3.1.2). This is a special case of folding declarations which applies when the currently bracketed extent is followed by the declaration of a constant. Depending on the type of the constant, some space can be required in the variable pool. Tests revealed that byte and short constants do not require space and are substituted into expressions by the compiler at compile time. However, BCD and string constants are allocated on the variable pool and they interact with the other (non-constant) variables. As an aside, this means that they are not really constants, and their values can be changed by assigning to an alias created by an overlay. Thus, when folding a constant into a bracketed redefinition, the available space must be updated when a BCD or string constant is folded. In Fig. 21 only constant strings x1 and x2 consume space when bracketed (3 bytes each), while x2 and x3 (byte and short constants) do not impact the available space. 3.6.4.3. Fold redefinitions (Step 4.3.1.3). The last case is where the item following a redefinition is another redefinition (i.e., another FIELD statement). A redefinition may only be folded, or nested, when:

Fig. 19. Folding redefinitions.

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

1463

Fig. 20. Folding declarations.

Fig. 21. Folding constants.

1. 2.

The redefinition is fully bracketed. That is, the remaining space is zero; and, The redefinition applies to a field already present in the bracketed extent.

In Fig. 19a, the two redefinitions of b can be folded into the redefinition of a. When redefinitions are folded, the available space is not updated, since they apply to space that has already been accounted for. Thus in the figure, the available space of a is not updated, because the redefinition of b does not actually consume space within a. In fact, the redefinition of b takes exactly the amount of space already reserved for field b (i.e., 5 bytes). After this transformation, the fist rule (Fold declarations, Step 4.3.1.1) re-applies and the last declaration is finally folded into a, whose available space is eventually reduced to 0 (Fig.19b). The last case (Fold persistent structure, Step 4.3.2) applies when the current redefinition is followed by a persistent data definition. In this case, the persistent data definition cannot be broken into parts, because it describes the structure of an entire ISAM table. Either it is moved entirely in the extent of the redefinition or it is left outside of it. The size of a table definition is computed as the sum of the fields contained in its definition. In order to get this calculation right, overlay extents must have been already computed for the whole table, otherwise the same variable space could be counted multiple times (redefinitions do not require new memory space). This is the reason for applying this transformation only after all the others have been successfully applied.

Fig. 22. Moving transformation: (a) result of bracketing, (b) initialization and (c) move redefinitions. Final Java code (d).

1464

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

After folding the persistent data structures, repeated iterations of folding declarations, constants and redefinitions may be required to complete the transformation. 3.6.5. Move definitions Assuming that the square bracketing steps have been completed, the algorithm ends by moving redefinitions next to the declarations of the variables for which they apply (Step 6 in Fig. 14, Move redefinitions). This is done in three steps. 1. Redefinitions are removed from the code and put in a separate data structure. 2. From the redefinitions, a list is created with the names of all the variables for which an overlay has been found. The declarations of variables in this list are initialized with an empty squareangle brackets. 3. Redefinitions are moved into the square-angular brackets of the variable for which they provide an overlay. When this step is over, every field declaration is followed (in square-angular brackets) by all its possible alternative overlays. If there are more than one overlay, a union is recognized. Fig. 22 shows an example of results produced by the moving step. Fig. 22a shows the output of the bracketing transformation for fields a and b. In Fig. 22b the three redefinitions are removed and the declarations of a and b are initialized with empty square-angle brackets. Finally, redefinitions are moved under the proper declaration (see Fig. 22c). It can be noticed that variable a has only one overlay, so it would be naturally translated into a normal Java class, while variable b has two alternative redefinitions, thus it corresponds to a union in the final Java code (see Fig. 22d).

In the ideal situation, a perfect match should be found between the size of the container field and the size of its overlay. In such a case, the folding procedure stops when the available space of the overlaid variable reaches zero. However, since the BAL compiler does not enforce any size control among overlays, size mismatches could happen, so mismatches must be handled by the bracketing algorithm. This is the topic of the next section. 4. Heuristics to handle data size mismatches In this section we describe the heuristics that we have defined to manage the case where the size of the relocated variable and the total size of the declaration sequence following the relocation do not match exactly, as assumed by the transformations presented in the previous section. In the presence of size mismatches the problem of structuring the data is intrinsically ambiguous and admits multiple solutions. Our heuristics find a solution which captures the programmer’s intention, whenever there are hints pointing at such intentions. However, in some cases no heuristic is applicable and we consider the case as not solvable automatically. In such cases we resort to a manual intervention. The amount of such interventions was the subject of the empirical assessment of the proposed heuristics. 4.1. Abstract heuristics to handle data size mismatches The first case of size mismatch occurs when the size of the redefinition is smaller than the available space. Usually, in this case an explicit stopping condition is inserted by the developer. A stopping condition is a programming idiom that allows to determine that the current variable relocation is finished and a structure or union variant can be identified.

Fig. 23. Rewrite rule handling the case of insufficient size followed by relocation.

Fig. 24. Rewrite rule handling the case of insufficient size at the end of the program’s declarations.

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

Abstract stopping conditions are:  A relocation of a variable starts before reaching the full size of the relocated variable.  End of declarations is reached before reaching the full size of the relocated variable. Figs. 23 and 24 provide the rewrite rules associated with these stopping conditions. Each rule triggers the identification of a structure even though the total size of the relocating variables is lower than the size of the relocated variable, based on the occurrence of the two stopping conditions described above. The second case of size mismatch consists of a redefinition which is greater than the available space. When the declaration that causes the size overflow is followed either by another relocation or by the end of the declarations in the program, we resolve the size mismatch by adding the last declaration to the structure being identified, under the assumption that the container declaration size is wrong (lower than needed). Specifically, the rewrite rules used for the cases of exceeding size in the relocating declarations are given in Figs. 25 and 26. The abstract heuristics provided in this section do not cover all possible cases of size mismatch in relocations. However, we built them based on our experience with a real legacy system heavily based on relocation for its data model. So, we think that the rules reported in this section are the most important ones and the cases not handled through them are expected to be the minority. This is confirmed by our experimental findings. More heuristics may be of course added, especially to deal with

1465

different target languages and systems, but we think that the approach and the cases provided in this paper are a good starting point. 4.2. Concrete algorithm to handle data size mismatches In this section we instantiate the stopping conditions described before for the general case so as to take into account the specific features of the BAL programming language. These, in turn, give raise to additional stopping conditions, associated with special construct provided by the language (e.g., FIELD=M with no parameter). Concrete stopping conditions for the BAL language are:  The redefinition is explicitly closed by a FIELD=M statement, that resets the memory pointer to the next free available position.  Another redefinition of the same field starts before the full size is reached.  A redefinition of another field starts before the full size is reached.  Declarations in the code are not enough to fit the size, but the end of the declaration code section is reached. If neither a stopping condition is found nor an exact size match is reached, the second case of mismatch occurs: the redefinition is greater than the available space. In this case, the last entity to fold in the redefinition extent is the one that causes the extent to exceed the available space. We considered this a

Fig. 25. Rewrite rule handling the case of exceeding size followed by relocation.

Fig. 26. Rewrite rule handling the case of exceeding size at the end of the program’s declarations.

1466

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

deliberate intention of the developer when the entity that crosses the boundary is followed by one of the previous stopping conditions. If no stopping condition occurs, an error is raised and manual intervention is required (Step 5, Report errors) to fix the size mismatch instance.

their number is still prohibitive for a manual intervention in case of the user code. Further automated analysis needs to be developed, at least to support and simplify the manual job.

5. Experimental results

The 45 error cases remaining in the dictionary have been fixed by means of a tool that allows the user to specify the dictionary transformations required to fix the problems. The tool accepts instructions such as ‘‘change field length” or ‘‘insert container field before a given field”. By manually providing such instructions to the tool, we have been able to produce a version of the dictionary that can be processed completely automatically and from which it is possible to generate all Java classes required to represent the persistent data. This manual intervention required five working days (inclusive of dictionary fixing tool development). A similar code fixing tool is under development for the 5286 error cases remaining in the user code. Even if the cases to solve are quite a lot, a further investigation showed that they are not independent, they all refer to just 442 recurring variables. So we expect to be able to address them in a reasonable amount of time (estimate in a couple of man-months, which is acceptable for the software company involved in the project). It should be noticed that the unhandled cases are a small fraction (around 1%) of the total number of declarations with relocations (around 500,000), which validates empirically the effectiveness of the proposed heuristics. An example of an error that has been often encountered in BAL is exemplified in the code of Fig. 27(left). The corresponding memory layout is in Fig. 27(top). BAL does not enforce the presence of a container for variables even in the presence of multiple overlays. Thus, often a variable (in the example a) is used as a pointer to the position in memory where to start a redefinition. In this way two different structures, [a, b] and [c, d], are overlaid in memory. This is a case where the bracketing algorithm reports an error. In fact, it tries to fold c and d within a, terminating with a size mismatch that is not followed by a stopping condition (declaration of x). A manual intervention in required here to: (1) add a new variable, i.e., cont, that would act as an explicit container; and, (2) change any redefinition of a to a redefinition of the novel container. On the manually fixed code, shown in Fig. 27(right), the bracketing algorithm would report a perfect size match. Another case that was often found among the errors reported by bracketing is shown in Fig. 28. It can be considered a particular case of the same error presented above, however a better fix can be applied here. In this case a sequence of variables (b and c) are defined without any container. However, later, programmers added a container (a) by using the address of the first variable in the sequence for the relocation of the container (FIELD=M,b). Here the bracketing would try to bracket a inside b causing a size mismatch that is followed by no stopping condition, since a new variable definition (i.e., x) is encountered in the code. The most

In this section we presents some data collected during the data structure recovering phase. As an example we discuss two instances of size mismatches that have been reported as errors by the bracketing algorithm that required manual fix. 5.1. Heuristics The algorithm described in Fig. 14 has been successfully applied to the legacy system being migrated (8 MLOC). Specifically, we have been able to add structure to 510,108 variable declarations for which one or more overlays were defined in the original code. Correspondingly, 510,108 Java classes have been generated. Among them, 29,394 are unions (i.e., contain multiple variants, deriving from multiple alternative overlays). The stopping conditions used to automatically handle the occurrence of size mismatches were pretty effective. They allowed us to solve automatically 81,900 cases, leaving only a few size mismatches to be fixed manually. The effort involved in such manual intervention accounted for less than a working week of a person. Table 1 shows the frequency of the cases considered during the inference of an object model from the existing flat memory model. The table is split into two columns, associated with data model inference for user code vs. dictionary. Case err occurs whenever none of the case-based heuristics applies and manual intervention is required. Manual fixes are performed by the engineers who developed the original legacy system and are BAL expert. As apparent from Table 1, most of the cases, both in user code and dictionary, can be handled by the simplest of the cases in our case analysis: exact size match. Insufficient size prevails on Exceeding size, both in user code and dictionary. Whenever a size mismatch occurs, the presence of a successive redefinition of the same or another variable can be exploited to infer the data structure boundaries in most situations. For the Exceeding size case, an explicit FIELD=M statement is used to indicate the end of the data structure being defined. Inference heuristics, allows for a manual decision for the data structures of the dictionary. However,

Table 1 Occurrences of structuring cases. Case

User code

Dictionary

Exact size match Insufficient size Exceeding size Insufficient size followed by FIELD=M Insufficient size followed by redefinition of same variable Insufficient size followed by redefinition of another variable Insufficient size at end of declarations Exceeding size followed by FIELD=M Exceeding size followed by redefinition of same variable Exceeding size followed by redefinition of another variable Exceeding size at end of declarations Remaining cases

425,988 27,100 13,850 1371 13,121

8279 65 4 0 36

11,642

11

966 5207 1645

18 0 1

4508

3

2490 5286

0 45

5.2. Identified errors

Fig. 27. Error identified by bracketing: original (left) and manually fixed (right) code.

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

Fig. 28. Error identified by bracketing: original (left) and manually fixed (right) code.

Table 2 Generated Java code.

Classes Interfaces Unions

User code

Dictionary

510,108 148,621 29,394

12,402 753 263

appropriate fix in this case is shown in Fig. 28(right). It consists of: (1) moving the container (a) on top of the variable declaration sequence; and, (2) adding a relocation to the container before the first variable in the sequence (FIELD=M,a). 5.3. Migration results After structuring the data model, Java classes have been automatically generated for dictionary and user code, providing an object data model to the legacy system being migrated. Table 2 shows the number of classes, interfaces and unions (counted also as classes) generated for user code and dictionary. Interfaces are generated only to support proper definition of unions in Java. Numbers indicate that in most of the cases, unions are not necessary, in that the given data structure is accessible through a single view. In the dictionary, the total number of unions is relatively small (263), so that it is reasonable to plan their complete manual elimination from the generated Java code. 5.4. Optimizations Our first simplistic implementation of the square bracketing transform took almost 7 days to execute over the entire BAL codebase (8 MLOC). Two simple optimizations were applied. The first of these was an application of agile parsing. The non-terminal that represents a declaration was split into two non-terminals, one for foldable declarations (variable declarations, constant declarations and field statements with variable redefinitions) and one for declarations that are not foldable (field statements without variable declarations). This allows the parser to provide the first categorization and allows a simple pattern match in TXL to guard the rule. The second optimization was to stratify the bracketing rules to a one pass rule that identifies potential folding locations using the grammar change described above, checks that space remains in the extent and then invokes a subrule which uses the TXL skipping statement to limit the rule to folding only at that location. A grand parent rule applies the one pass rule until no further changes are made. The result of these two optimizations reduced the transformation time of the entire BAL code base to less than 6 h (from 7 days), a significant improvement. 6. Related work The problem of migrating a legacy software system to a novel technology has been widely addressed in the literature by different

1467

approaches. The different strategies have been classified by [7] into (1) redevelopment from scratch; (2) wrapping; and, (3) migration. In their view, even the migration strategy requires substantial redevelopment. Our contribution belongs to the third class and consists of a set of automatic transformations. Migration to object oriented programming and extraction of an object oriented data model from procedural code are the topics of several works (e.g., [8–12]). Class fields originate from persistent data, user interface, files, records and function parameters, while class operations come from the segmentation of the program according to branch labels, in the migration of legacy procedural code to an object-oriented design described by Sneed and Nyary [10]. For similar purposes, data flow analysis and the classification of data elements into constants, user inputs/outputs and database records are used in the augmented object model by Tan et al. [11]. Sneed [13] migrated Cobol code to Object-oriented Cobol. Other works on object identification rely on the analysis of global data and of the code accessing them [8,14,15]. Since a record is too large and often contains unrelated data, cluster analysis was used [12] to identify groups of related fields within a record. Concept analysis is then applied to group together data and functionalities into candidate classes. In order to decide which data and which routines should be grouped together into classes, object-oriented design metrics (Chidamber and Kemerer) are also used to guide the migration [16,9], so as to avoid a poor design quality in the resulting system that would pose maintenance problems. Classes are still based on persistent data stores and routines are assigned to classes, such that the final result minimizes the object coupling metric. Type inference was used to acquire information about variables in legacy applications that goes beyond that conveyed by the declared type, so as to simplify migration toward a programming language with a richer and stronger type system [17,18]. For instance, type inference was applied to Cobol [19,20] to determine subtypes of existing types and to check for type equivalence. Static analysis and model checking have been used on Cobol to determine when a scalar type should be better regarded as a record type [21] and to determine unions the variants of which are consistently accessed through discriminators [22,23]. Analysis of low-level languages has been address by Cifluentes [24] and Balakrishnan [25]. Cifluentes [24] addresses the problem of translating assembly to a high level language (e.g., to C), with the purpose of implementing a decompiler. In this work both control-flow and data-flow analysis have been used. The former is required to remove unstructured control flow statements (i.e., jumps), while the latter is used to group together multiple assembly statements that use registers for intermediate results into a single, higher-level C statement. Simple type conversion is also performed to get rid of register accesses. An approach to analyze binary code has been presented by Balakrishnan [25], with the goal of debugging a piece of executable when the enclosed symbol table cannot be trusted, e.g., in a suspicious virus. The algorithm used, namely value-set analysis, computes an over-approximation of the address values and integer values that each data object can possibly hold at each point of the program. Such approximation represents a viable alternative to an untrusted symbol table. Both of these approaches attempt to recover the structure of blocks of memory by analyzing their use in the code. Similar to our paper, a theoretical framework is used to analyse Cobol data structures in the work by Ramalingam et al. [26]. While our objective is to group variables into classes, the approach described by Ramalingam et al. [26] is devoted to splitting records and arrays (aggregates) into a set of disjoint atoms. Structuring is then propagated to the other program variables participating in assignments. Such an atomization of variables leads to a more com-

1468

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

pact dependency representation (e.g., PDG) that makes program slicing simpler. The work presented in this paper differs from the existing literature in that it deals with a starting data model permitting arbitrary overlays in memory. The approach taken, bracketing fields as subfields of their parents by folding them based on size and offset, provides an equivalent structure while attempting to keep as much of the original source code organization as possible. This paper is the companion of another work [2], where the target data model and the migration path from the legacy to the new data model are described in detail. On the contrary, in the present paper, we do not provide details about the target data model or the migration process. We focus instead on the automated program transformations that have been defined to actually implement the migration strategy described in the companion paper [2]. 7. Conclusions and future work We have proposed an approach to deal with an unstructured data model in migration projects where the target data model is necessarily a structured (e.g., object oriented) one. To gain generality, we have described the problem by means of an abstract syntax and we have described our approach as a set of abstract rewrite rules over the abstract syntax. Since size mismatches are expected to occur in practice when an unstructured data model is used, we have also provided a set of abstract heuristics, still in the form of abstract rewrite rules, to handle automatically the cases that we think are most common. We have instantiated the proposed approach for the BAL programming language and we have applied it to a large legacy system being migrated to Java. Results indicate that the proposed heuristics are quite effective and manage most of the cases of size mismatch occurring in the analyzed system. For the remaining cases we estimated an effort for a manual intervention which is compatible with the migration project plan. Overall, around 1% of the cases could not be handled by the proposed heuristics. Our ongoing work is devoted to completing the migration from BAL to Java, addressing the problems associated with the translation of the program structure and of the statements (which involves also GOTO elimination). With regards to the migration of the data model, we will investigate the possibility of inferring inheritance relationships among classes, based on the identification of memory overlays that redefine with specialization (e.g., adding more variable declarations) existing data structures. References [1] E.B. Swanson, The dimensions of maintenance, in: ICSE, 1976, pp. 492–497. [2] M. Ceccato, T.R. Dean, P. Tonella, D. Marchignoli, Data model reverse engineering in migrating a legacy system to Java, Reverse Engineering, in: WCRE’08, 15th Working Conference on 2008, pp. 177–186, doi:10.1109/ WCRE.2008.27, 2008. [3] J. Cordy, The TXL source transformation language, Science of Computer Programming 61 (3) (2006) 190–210.

[4] X. Guo, J.R. Cordy, T.R. Dean, Unique renaming of Java using source transformation, in: Proceedings of the 3rd IEEE International Workshop on Source Code Analysis and Manipulation (SCAM), IEEE Computer Society, Amsterdam, The Netherlands, 2003. [5] T. Dean, J. Cordy, A. Malton, K. Schneider, Agile parsing in TXL, Journal of Automated Software Engineering 10 (4) (2003) 311–336. [6] K. Wong, The Rigi User’s Manual – Version 5.4.4, 1998. [7] J. Bisbal, D. Lawless, B. Wu, J. Grimson, Legacy information systems: issues and directions, Software, IEEE 16 (5) (1999) 103–111, doi:10.1109/52.795108. [8] G. Canfora, A. Cimitile, M. Munro, An improved algorithm for identifying objects in code, Software: Practice and Experience 26 (1996) 25–48. [9] A. De Lucia, G. Di Lucca, A. Fasolino, P. Guerra, S. Petruzzelli, Migrating legacy systems towards object-oriented platforms, software maintenance, in: Proceedings of the International Conference on October 1–3, 1997, pp. 122– 129, doi:10.1109/ICSM.1997.624238. [10] H. Sneed, E. Nyary, Extracting object-oriented specification from procedurally oriented programs, reverse engineering, in: Proceedings of the Second Working Conference on July 14–16, 1995, pp. 217–226, doi:10.1109/ WCRE.1995.514710. [11] H.B.K. Tan, T.W. Ling, Recovery of object-oriented design from existing dataintensive business programs, Information and Software Technology 37 (1995) 67–77. [12] A. van Deursen, T. Kuipers, Identifying objects using cluster and concept analysis, software engineering, in: Proceedings of the 1999 International Conference on 1999, pp. 246–255, doi:10.1109/ICSE.1999.841014. [13] H. Sneed, Migration of procedurally oriented cobol programs in an objectoriented architecture, software maintenance, in: Proceerdings of the Conference on November 9–12, 1992, pp. 105–116, doi:10.1109/ ICSM.1992.242552. [14] S.-S. Liu, N. Wilde, Identifying objects in a conventional procedural language: an example of data design recovery, software maintenance, in: Proceedings of the Conference on 26–29 November, 1990, pp. 266–271, doi:10.1109/ ICSM.1990.131371. [15] S. Pidaparthi, G. Cysewski, Case study in migration to object-oriented system structure using design transformation methods, Software Maintenance and Reengineering, in: EUROMICRO 97, First Euromicro Conference on March 17– 19, 1997, pp. 128–135, doi:10.1109/CSMR.1997.583021. [16] A. Cimitile, A. De Lucia, G.A. Di Lucca, A.R. Fasolino, Identifying objects in legacy systems using design metrics, Journal of Systems and Software 44 (1999) 199–211. [17] R. O’Callahan, D. Jackson, Lackwit: A program understanding tool based on type inference, in: ICSE, 1997, pp. 338–348. [18] G. Ramalingam, R. Komondoor, J. Field, S. Sinha, Semantics-based reverse engineering of object-oriented data models, in: ICSE, 2006, pp. 192–201. [19] A. van Deursen, L. Moonen, Understanding cobol systems using inferred types, in: IWPC, 1999, p. 74. [20] A. van Deursen, L. Moonen, Exploring legacy systems using types, in: WCRE, 2000, pp. 32–41. [21] R. Komondoor, G. Ramalingam, S. Chandra, J. Field, Dependent types for program understanding, in: TACAS, 2005, pp. 157–173. [22] R. Jhala, R. Majumdar, R.-G. Xu, State of the union: type inference via craig interpolation, in: TACAS, 2007, pp. 553–567. [23] R. Komondoor, G. Ramalingam, Recovering data models via guarded dependences, in: WCRE, 2007, pp. 110–119. [24] C. Cifuentes, D. Simon, A. Fraboulet, Assembly to high-level language translation, software maintenance, in: Proceedings of the International Conference on 1998, pp. 228–237, doi:10.1109/ICSM.1998. 738514. [25] G. Balakrishnan, T. Reps, Analyzing memory accesses in x86 binary executables, in: Proceedings of the 13th International Conference on Compiler Construction (LNCS 2985), Springer-Verlag, 2004, pp. 5–23. [26] G. Ramalingam, J. Field, F. Tip, Aggregate structure identification and its application to program analysis, in: POPL’99: Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ACM, New York, NY, USA, 1999, pp. 119–132, doi:acm.org/10.1145/292540. 292553.