Recovering structured data types from a legacy data model with overlays

Information and Software Technology 51 (2009) 1454–1468 Contents lists available at ScienceDirect Information and Software Technology journal homepa...

Download PDF

3MB Sizes 0 Downloads 55 Views

Report

Full Text

Information and Software Technology 51 (2009) 1454–1468

Contents lists available at ScienceDirect

Information and Software Technology journal homepage: www.elsevier.com/locate/infsof

Recovering structured data types from a legacy data model with overlays Mariano Ceccato a, Thomas Roy Dean b, Paolo Tonella a,* a b

Fondazione Bruno Kessler, FBK-IRST, Software Engineering, Via Sommarive 18, 38050 Povo Trento, Italy Queen’s University, Kingston, Canada, ON K7L 3N6

a r t i c l e

i n f o

Article history: Available online 10 May 2009 Keywords: Program transformations Reengineering Legacy systems

a b s t r a c t Legacy systems are often written in programming languages that support arbitrary variable overlays. When migrating to modern languages, the data model must adhere to strict structuring rules, such as those associated with an object oriented data model, supporting classes, class attributes and inter-class relationships. In this paper, we deal with the problem of automatically transforming a data model which lacks structure and relies on the explicit layout of variables in memory as deﬁned by programmers. We introduce an abstract syntax and a set of abstract rewrite rules to describe the proposed approach in a language neutral formalism. Then, we instantiate the approach for the proprietary programming language that was used to develop a large legacy system we are migrating to Java. Ó 2009 Elsevier B.V. All rights reserved.

1. Introduction Migration projects are initiated due to a number of technical and non technical reasons. Among the technical motivations for migrating a legacy system to a modern platform is the obsolescence of the technology or programming language in use. Domain speciﬁc, proprietary programming languages represent a typical instance of this problem. Swanson [1] classiﬁed such interventions as adaptive maintenance and investigated its characteristics. Obsolete platforms and languages may become a major impediment to future developments and innovation, especially when they are tied to a proprietary development, compilation and execution environment. In such cases, supporting the language and the associated production and execution environment may require even more resources than those dedicated to the core business of the software company. On the non technical side, customers are increasingly aware of the technologies underlying the products they buy. Hence, they have also requirements about the adoption of modern environments, frameworks and languages. At the same time, they may have concerns about the usage of non-standard solutions, having no consolidated support and limited user community. This is becoming especially true in the days of the open source software development. This mixture of technical and non technical arguments is usually the trigger for a migration project, aimed at replacing the old technology with a modern one. Translation of the data model is central to any migration effort. Decisions made about the representation of the data structures will * Corresponding author. Tel.: +39 0461 314577. E-mail addresses: [email protected] (M. Ceccato), [email protected] (T.R. Dean), [email protected] (P. Tonella). 0950-5849/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.infsof.2009.04.017

have strong implications for the rest of the translation. While modern languages have a structured approach to data modeling, enforcing relationships such as attribute ownership and aggregation, old languages are often biased toward the underlying physical model of the data. Hence, they typically support views that are tied to the memory layout, consisting of a sequence of bytes. Programmers deﬁne their data structures as a set of views and overlays deﬁned on top of a ﬂat sequence of bytes available from the memory. Types may be superimposed on such a sequence, but the relative displacement of various parts of a data structure is left to the programmers’ responsibility, being not explicitly supported by the language. In this paper we consider how source transformations can be used to infer structure from a set of data declarations based on notions such as their size in memory and their position, relative to other declared variables. One advantage of a source transformation approach as opposed to approaches that recover structure from binary objects is that the grouping of the ﬁelds as given by the programmer provides important hints as to the ﬁnal structure of the data model. Our approach attempts to keep these ﬁelds together as much as possible to maximize the programmer’s familiarity with the result. We present our approach abstractly, to increase its reusability in different contexts. For this purpose we deﬁne an abstract syntax for the starting, unstructured model and for the target, structured one, making minimal assumptions on the surface appearance of any language instantiating the abstract syntax. On top of such abstract syntax, we provide our data structuring algorithm as a set of abstract rewrite rules. We instantiated our approach to data structuring for the programming language used in a legacy system we are migrating

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

to Java. The language used by the legacy system is BAL, an acronym for Business Application Language. BAL is a proprietary, BASIC like language that contains unstructured data elements as well as unstructured control statements (e.g., GOTO). Variables in BAL are declared with an explicit size in bytes and can be displaced in memory at any arbitrary position, relative to some previously declared variable (relocations). The system that we are migrating is a production banking application which supports all functionalities necessary to operate a bank, including account management, ﬁnancial products management, front-desk operations, communications to central bank and other authorities, inter-bank communications, statistics and report generation. The user interface is character-oriented and the overall architecture is client-server, with the client operating mostly as a character terminal. The execution environment is a proprietary platform called B2U. The application has a quite large size, consisting of around 8 MLOC (million lines of code) and 500,000 declarations involving relocations. Results show that the proposed abstract heuristics to handle the structuring problem work pretty well in practice, with only 1% of the cases left for manual intervention. The rest of the paper is presented in six sections. Section 2 gives a short introduction to the BAL data model. Sections 3 and 4 discuss the transformations in more detail. Some experimental results about the effectiveness of the proposed heuristics are given in Section 5. Related work is presented in Section 6, followed by conclusions (Section 7).

2. Data model We start by presenting the unstructured data model provided by the BAL language, and some of the difﬁculties it presents to the translation process. We then present an abstract model that will be used to deﬁne the formal speciﬁcations for the set of transformations that generate a structured model, to be used in the translation to Java.

1455

Data relocation is accomplished with the FIELD=M,VAR statement, which is analogous to the .= variable address directive available in most assembly languages. An example is shown in Fig. 1. Indentation is used in the example to indicate programmer intention, but is not mandatory in the language. The variable a is a string variable (indicated by a type speciﬁer of ‘$’) and is ten bytes long. The following FIELD statement resets the current variable position (i.e., the memory location of the next declared variable) to the beginning of the variable a. As a result, the variable b, a string variable of length ﬁve, occupies the same ﬁve bytes as the ﬁrst ﬁve bytes of a. The variable c, also a string variable, takes the next ﬁve bytes. Thus all ten bytes of the variable a are shared with the two variables b and c. The variable b is further redeﬁned by the variables d, e, and f. The variables d and e are byte variables (the type speciﬁer ‘#’) while f is an array of strings. The size of the array is 3, while the size of its string components is 1 (so, it contains three strings of length 1). A second relocation starts from the beginning of b. It represents an alternative view on the ﬁve bytes allocated in memory for variable b. Such a view consists of variable g, a short (‘%’ type speciﬁer) that occupies two bytes, and the string h, of size 3. The FIELD=M that concludes these declarations refers to no variable. Its meaning is that the next declared variable will be allocated starting from the ﬁrst free location in memory (i.e., immediately past the end of c). Fig. 1(right) shows how the example would be translated to Java. Native BAL short, byte and string types are translated to the corresponding Java types, while BCDs are translated to BigDecimals. To preserve the semantics in the translated Java, strings and BCDs are annotated with the original BAL size (@Field(size=5)). Whenever a variable is the subject of a relocation (FILED=M,a), such a variable becomes a Java class (class A) and all the contained variables (declared in the same memory region,e.g., b and c) become ﬁelds of this class. When multiple overlays are deﬁned

2.1. Concrete syntax An appropriate translation of the data model is central to any migration effort, and our project is no exception. This section gives a short overview of the BAL data model. The data model is discussed in more depth in a different paper [2]. While BAL contains some structured control ﬂow statements such as IF. . .ENDIF and WHILE. . .WEND, the data model is very unstructured and similar to that found in structured assembly languages (e.g., that of IBM mainframes). The data model is byte oriented, and the language only provides four basic data types: byte, short, binary coded decimal (BCD) and string. The ﬁrst two are the same as those available in most languages, representing single and two byte signed integer values. Variables of the BCD and string data types can be of various lengths, and the developer must specify the length in bytes if she wants something different than the default length (eight and 16 bytes, respectively). Unlike languages such as C, there is no dynamic allocation and the length of all variables (including arrays and strings) is known at compile time. The BAL data model is based on the notion of memory relocation. Each newly declared variable can be either allocated on the next available region in memory, or it can be a relocation of a previously deﬁned variable. The relocation data model allows for arbitrary aliasing among variables as well as for arbitrary deﬁnition of multiple views (unions) insisting on the same underlying memory region (i.e., sequence of bytes). Such views are not constrained to be within the boundaries of the relocated region. On the contrary, they can span multiple memory regions that were previously regarded as separate data structures.

Fig. 1. Variable declarations in BAL (left) and their translation in Java (right).

1456

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

for the same variable, a class with union semantics is produced (B_Union for b). A separate class is generated for each union variant (B_Var1 and B_Var2). B_Union implements all interfaces of all union variants, so that it can be accessed through any alternative view. Some code to switch among views when needed is automatically generated (it implements the copy on read/write protocol and is not shown in Fig. 1 for space reasons). There are several consequences to the approach taken by the BAL language. The ﬁrst consequence is that records do not introduce any additional lexical scope. The second is that it is the developers’ responsibility to ensure that the sizes of all of the variables are correct. For example, in Fig. 1, the size declared for variable a could be incorrectly nine. This has no consequences on the memory layout of the other variables, except that the last byte of h would be not accessible from a. Moreover, no change would occur for variables declared after a, even in the absence of the ﬁnal FIELD=M statement, since 10 bytes are anyway consumed for this data structure (e.g., due to the total size of b and c). This indicates how the BAL data model is robust with respect to mismatches involving the size of container variables (such as a). Such a robustness is sometimes ‘‘abused” by BAL programmers, who do not care too much about the container size not matching the cointainees’ size. Finally, there are many ways of expressing the exact same layout of variables within memory. In the example in Fig. 1, the declaration of c could be the last of this data structure, without affecting the memory layout of variables in any respect. In a language such as COBOL or C, the variables d, e, and f would normally be declared as part of the aggregate ﬁeld that they form. In BAL they could be located anywhere within the same scope of declarations (i.e., global or local). Size mismatches and variations in memory layout declaration make the recovery of a structured record from a sequence of BAL declarations quite difﬁcult. All these issues must be addressed before starting the actual translation to Java. In the rest of the paper, problem of inferring a containment relationship among variables will be addressed by the bracketing algorithm. In case of variable size mismatches, some heuristics will be used to identify the correct size values and to suggest size ﬁxes. The developers usually keep the deﬁnitions of complex data structures in separate ﬁles which are included by the preprocessor. One special case is that of the data structures used to access persistent ISAM (Indexed Sequential Access Method – a data ﬁle format common in legacy systems) tables. These record deﬁnitions are stored in a data dictionary and contain full hierarchical storage information. Include ﬁles containing BAL declarations are generated from these dictionary entries, losing some of the hierarchical information in the process. Since the information exists in the dictionary, these ﬁles are treated specially by our migration process as the Java classes can be generated from the dictionary directly. One wrinkle is that the data dictionary contains a few size mismatches [2] and as a result, so do the generated include ﬁles. Such size mismatches are basically similar to those mentioned above for the example in Fig. 1. While not affecting the correctness of the ﬁnal code, these problems complicate the process of recovering the data model and sometimes require manual ﬁxes. The data structures for other temporary data ﬁles and internal storage do not have entries in a dictionary and the structure of these records must be inferred solely from the BAL code. However, when analyzing any structures that are local to the program, we have considerably more ﬂexibility in how the ﬁelds can be reorganized in order to infer reasonable data structures.

target data model. The abstract syntax does not contain any reference to a particular surface syntax that may be used to instantiate the underlying data model. This means that by adding proper syntactic sugar, it is possible to make this abstract syntax match the surface syntax of BAL. On the other hand, the data model of other programming languages supporting memory relocations and multiple memory overlays may be obtained by properly instantiating this abstract syntax, which makes our results generalizable to any programming language supporting some notion of relocation and memory overlay. Hence, the transformations described with reference to the abstract syntax become a general theory of data model transformation, from unstructured to structured. Fig. 2 shows the abstract representation of the unstructured BAL model. A declaration sequence is a set of zero or more declarations, each of which is a tuple consisting of a mem-id (which in turn is an identiﬁer) a size (a number) and an optional relocation (which is another mem-id). This is a straightforward abstraction of the BAL syntax, removing the type of the variable as well as details of the size speciﬁcation, such as arrays and preprocessor macros, and terminal symbols, such as FIELD=M. One constraint not visible in Fig. 2 prescribes that for each reloc there must exist one mem-id appearing in a declaration (i.e., the relocation must refer to some allocated memory area). Another obvious constraint not in Fig. 2 is that each mem-id should appear in one declaration only (no redeclaration allowed). Fig. 3 shows a model of the ﬁnal structured data. In this model, a declaration is either a structure declaration, a union declaration or a variable declaration. A structure declaration is a possibly empty sequence of declarations, which may in turn be plain variables or nested structures/unions. A union is a sequence of at least two variants. Each variant is a structure, in turn. This model represents the standard tree representation of the structured data models provided by most structured languages such as C, Pascal or Ada. It should be noticed that we deliberately decided to distinguish between structures and unions, with a union containing at least two variants, which means that the associated memory area is accessible according to two or more alternate views. Another possibility in the deﬁnition of the abstract syntax for the structured data model would be to unify the notion of union and structure, by allowing also zero or one variant in a union. We prefer to keep the two separate because they are subject to a different treatment during transformation and this can be described more clearly if we make the distinction between structure and union apparent at the syntactical level.

Fig. 2. Abstract syntax for the unstructured data model.

2.2. Abstract syntax In order to describe the data structuring process that we applied to the BAL programming language in a general, reusable form, we use an abstract syntax for the starting and for the

Fig. 3. Abstract syntax for the structured data model.

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

1457

Fig. 4. Abstract syntax for the merged data model.

Fig. 4 shows the merge of the two models into a single model. The merged model is needed to be able to describe the transformation as a set of rewrite rules operating on a single language. During the transformation process, part of the model will be already transformed, while part of it will be still under transformation, which requires the capability to manage a mixed abstract syntax tree. Hence the merge of the two data models. In the merged model, declarations are again structures, unions or variable declarations, but now they are followed by size and optionally by a relocation. Simple variables are represented by an identiﬁer, which is now a memory identiﬁer (mem-id), for consistency with the unstructured data model. Unions remain the same (a sequence of two or more variants, each of which is a structure), but they now have a name (mem-id), since they are associated with relocated variables. Similarly, a structure in a declaration (decl) has also an identiﬁer (mem-id), while structures that deﬁne union variants do not have any identiﬁer (it is the whole union that has one). The extra information (optional reloc) in the structure and union syntax is used to keep track of the structure as it is built. Figs. 5 and 6 show the abstract syntax trees for the example in Fig. 1 before and after structuring the data model. In the next few sections we will provide the rewrite rules necessary to bridge the gap between the two. 3. Data transformations We begin this section by ﬁrst using the abstract model described in the previous section to specify the desired transforma-

Fig. 6. Abstract syntax tree for the example program in Fig. 1 after structuring the data model.

tions. We then show how the abstract rules can be implemented using TXL [3]. 3.1. Abstract rewrite rules

Fig. 5. Abstract syntax tree for the example program in Fig. 1 before structuring the data model.

We can express the transformation of the ideal case (no size mismatches) in two abstract rules. The ﬁrst rule identiﬁes the extent of a structure declaration, while the second one groups together all the redeﬁnitions for the same variable and it generates a union when it identiﬁes more than one redeﬁnition for the same variable. Fig. 7 shows the ﬁrst of theses rules. The ﬁrst two lines of the rule state that we are looking for a single declaration (decl0) and for a sequence of 1 or more declarations (decl1 . . .declk). There may be one or more declarations separating the ﬁrst declaration (decl0) from the set of declarations we are interested in. The third line of the rule states that the ﬁrst declaration of the set (decl1) is a relocation of decl0, while the fourth line states that none of the other declarations in the set (decl2 . . .declk) specify a relocation. The ﬁfth line requires that the cumulative size of the set of declarations is the same as the size of the ﬁeld that is being redeﬁned. The remaining four preconditions specify a new structure declaration (declnew) with the same name as the original declaration (decl0) and containing the set of declarations (decl1 . . .declk) as ﬁelds. If the original declaration is a relocation of another

1458

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

Fig. 7. Rewrite rule handling the case of exact size match.

declaration, then the new structure is also a relocation of the same declaration. The rule gives two possible results. If the main declaration(decl0) currently has no children (ﬁrst result line), i.e., it is not a structure, then it is replaced with the new structure. If the main declaration has already children, i.e., it is a structure, then the current one is an additional (second or successive) relocation, hence we have a union. In this transformation step, when a second (or successive) union variant is discovered, we place the structures sequentially (decl0 declnew), so that the union rule (Fig. 9) can easily ﬁnd and merge the variants. Fig. 8 shows the result of three consecutive applications of the rewrite rule in Fig. 7 to the example in Fig. 5. First, the declaration of a is transformed into a structure, due to the relocation in decl1, followed by decl2. Then, the declaration of b becomes a structure. Finally, the second relocation of b (decl6) produces a structure which is added just after the previous one, instead of replacing it. In fact, such a relocation is the second relocation of b, hence it is the second variant of a union, which the current step transforms into an additional structure following the previous one. The union rule, shown in Fig. 9 takes a sequence of two or more structure declarations (decl0 . . .declk) that represent variants of the same variable and combines them into an abstract union (declnew.union). The constraints on this transformation are that:

all the declarations in the sequence have the same name(memid); all the declarations have been transformed into structures by the rule in Fig. 7; no declarations outside of the sequence have the same name as the structures in the sequence. The result of the union rule is to replace the sequence of structure declarations with a single union node that has each structure as a variant. If the union rule in Fig. 9 is applied to our running example (Fig. 1) after structuring the data based on the exact size match (rule in Fig. 7), we obtain the abstract syntax tree shown in Fig. 6. These abstract rules provide a good starting point for understanding the transformation task in the presence of exact size match. We now present the concrete implementation of these rules. Next, we consider the possible cases of size mismatch. 3.2. Concrete program transformation overview At the core of the concrete data structuring transformation is a source-to-source transformation which we call bracketing. Its purpose is to infer a containment relationship among variables, that is later used in the actual translation to Java. A variable that acts as

Fig. 8. Abstract syntax tree for the example program in Fig. 1 after applying the rewrite rule in Fig. 7 three times.

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

Fig. 9. Rewrite rule producing unions.

container eventually becomes a class and its content is translated into class ﬁelds. Several other phases of the migration process precede bracketing, providing generic transformations that are of use to the entire process, including bracketing. We present some of these transformations (shown in Fig. 10), although we limit the discussion to those that are directly useful for bracketing. The process is composed as follows: in a normalization phase, some syntactic ambiguities (e.g., overloaded semantics of symbols such as brackets) are resolved, in order to make the subsequent steps easier (e.g., brackets used with arrays are turned into square brackets); a number of facts are extracted from the code and stored into a data base; persistent table (ISAM) deﬁnitions, which are translated separately, are identiﬁed and marked; some difﬁcult cases of size inference are addressed separately; for example, two phases are devoted to resolving errors in persistent tables and function parameters that admit variable (i.e., non constant) size; at this point the actual bracketing (folding) can run, with all its internal sub-steps; multiple redeﬁnitions of the same variable could be spread all over the code. The purpose of the last step (moving) is to group them together with the redeﬁned variable, generating a union when their multiplicity is greater than one. In the following sections, we describe each of the steps in Fig. 10 using the BAL code of Fig. 1 as a running example. TXL was used for most of the transformations, including bracketing itself. 3.3. Normalization The normalization phase is the gateway to the entire migration process and performs several transformations designed to make the rest of the process easier. The normalization transforms related

1459

to bracketing are preprocessing, unique naming and persistent ﬁle identiﬁcation. The BAL compiler produces a partially preprocessed ﬁle during the compilation process. In producing this ﬁle, the BAL compiler resolves all inclusion statements as well as any conditional compilation. Fig. 11 gives a short example of the inclusion of a data deﬁnition. Fig. 11a shows the including ﬁle while Fig. 11b shows the included ﬁle. Fig. 11c shows the output of the BAL compiler. Each of the lines of the ﬁle end with an annotation that gives the line number and name of the original source ﬁle (e.g., the name of the include ﬁle that contains the statement). In addition, the BAL processor removes macro deﬁnitions and comments, although the use of macros in the code is not expanded in this partially preprocessed ﬁle. Since the macros will be translated separately to Java constants and methods, the presence of the original names in the code will allow us to use them in the ﬁnal translation. The ﬁrst set of transformations is to uniquely name [4] all of the identiﬁers in the programs. Unique identiﬁers are strings, which contain the names of each of the parent scopes of the identiﬁer. So the global variable A in program P has the unique identiﬁer ‘‘P A”, while the variable B in the function C in the program P has the unique name ‘‘P C B”. Since program names are unique, and functions and variables may not have the same name, this approach gives a globally unique name to all identiﬁers in the system. We use agile parsing [5] techniques to extend the grammar to include the unique identiﬁers as annotations on all occurrences of the identiﬁers (deﬁnitions and uses). These annotations take the form:

½\UID" # identifier Since BAL has simpler naming rules than Java, the unique naming process is also simpler, requiring only the resolution of macro names from include ﬁles (since the instances of the macros are not expanded) and the external functions deﬁned in libraries. For clarity and ease of reading, the examples in this paper do not show the unique naming annotations. As mentioned earlier, some of the include ﬁles are generated from the data dictionary. Each ISAM ﬁle in the data dictionary is used to generate a single persistent data deﬁnition ﬁle. Since some extra structural information is present in the data dictionary (and lost when the BAL include ﬁles are generated), it makes sense to generate the Java classes from the dictionary. However, we must still recognize where such data structures are used in the code so that instances of classes may be generated during migration. They also complicate the bracketing process since they may also be included as a subrecord of a larger record. Thus the last transform of the normalization process is to use the annotations inserted by the compiler at the end of each line to group all of the ﬁelds for a given persistent ﬁle structure into a single unit. In this way, a single object allocation can be generated in Java, replacing the entire persistent data structure included. In the example in Fig. 11b, the ﬁle ‘tabx.bal’ is an example of one of these ﬁles.

Fig. 10. Bracketing process.

1460

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

Fig. 11. Include ﬁle preprocessing.

Fig. 12 shows the preprocessed ﬁle from Fig. 11c after we have annotated the deﬁnitions (PERSISTENT_SD ‘tabx” [[. . . ]]). 3.4. Fact extraction Facts in RSF format [6] are extracted from the source code as part of the migration process. These include facts about the type of variables, their size, the structure of the BAL source and the names of externally visible entities. Facts are triples of the form:

factName uid uid or value The UIDs created by unique naming during normalization form a link between the facts and the source code, and allow the facts to be read and used by transforms later in the migration process. The main set of facts relevant to the bracketing process are the size facts. In fact, the layout of variables in memory depends on where they are allocated (i.e., whether they are relocations of other variables or not) and on their size. The size fact gives the size of each variable and constant in the code. This is accomplished by ﬁrst extracting constant declarations from the code. The transform then extracts all variables and their associated size expressions, inserting default sizes were appropriate. Since named constants may be used when specifying the size of variables, a constant propagation algorithm is run on the size expressions to determine the size of each variable.

language. The bracketing algorithm consists of ﬁnding out what is the full extent of a FIELD=M and in enclosing it between square brackets. It starts at a FIELD=M statement and it adds one or more declarations (DCL), possibly with nested FIELD=M. In addition to the square brackets for the FIELD=M instruction, we need a second kind of brackets used to group together all the redeﬁnitions of the same declaration (DCL). We call this second kind of brackets square-angle brackets (i.e., [h. . . i]). They start at a declaration and enclose all FIELD=M,var statements that insist on the same declared variable. When a DCL has no FIELD=M statement insisting on it, the square-angular brackets contain an empty sequence. Hence, they are omitted. Square bracketing corresponds to the application of the abstract rewrite rule in Fig. 7, which recognizes structures based on exact size match. Consequently, square bracketing determines the scope of a FIELD=M based on exact size match. Such a scope determines the structure (sequence of declarations) recognized in this step. Square-angle bracketing corresponds to the abstract rewrite rule in Fig. 9. Whenever one or more FIELD=M insist on the same DCL, all of such structures, identiﬁed as square bracketed FIELD=M, are put within the square angle brackets. After square angle bracketing, union identiﬁcation accounts just for determining the cases where square angle brackets surround two or more FIELD=M. Fig. 13 shows the result of applying bracketing to the example in Fig. 1. The ﬁrst declaration (variable a) is associated with a single FIELD=M within square-angle brackets, hence it is recognized as a structure, not a union. The structure associated with variable a contains two data members, b and c. They are determined by the square brackets surrounding the FIELD=M,a statement. In turn, b is recognized as a union, since its declaration has two FIELD=M,b within square-angle brackets. Each of the two FIELD=M,b is a structure playing the role of the respective union variant. Data members of variants are again available within square brackets. During the translation to Java, square-brackets and squareangular brackets play different roles. Square brackets identify the boundaries of classes, with nested FIELD=M representing cases of the composition relationship between classes. Square-angular brackets discriminate between classes and unions. When more than one FIELD=M insists on the same variable declaration, a union must be generated instead of a class.

3.5. Bracketing Bracketing is the core transform used to add structure to the legacy BAL data model. It instantiates the abstract rewrite rules described above to the speciﬁc features of the BAL programming

Fig. 12. Annotating persistent data deﬁnitions.

Fig. 13. Square bracketing and square-angle bracketing of the example in Fig. 1.

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

Since unions are not available in Java, we resort to an equivalent code organization, which implements the copy on read/write protocol. The generated class implements a set of interfaces, associated with the different union variants, and declares a set of data members, each referencing one union variant. At each point in time during execution, at most one variant is active (i.e., at most one data member is non-null). When the code switches from one view to another one (e.g., it sets an attribute available in one view and then it reads an attribute available in another view) the active variant is changed and data are copied from the previous active variant to the new one. 3.6. Bracketing algorithm The algorithm used for bracketing is shown in Fig. 14. We discuss each element of the algorithm separately. 3.6.1. Persistent Data The ﬁrst step of the bracketing algorithm consists of the identiﬁcation of the persistent data deﬁnitions (coming from the dictionary), which are delimited and separated from the user data deﬁnitions, within the preprocessed ﬁles. A single persistent data deﬁnition ﬁle corresponds to the deﬁnition of an ISAM ﬁle which, in turn, can contain more than one table. During normalization, each included ﬁle is delimited as a whole by inserting proper code annotations. However, the included ﬁle may contain more than one persistent data structure each associated with one table declared in the ﬁle. In the original, unpreprocessed BAL source ﬁles, each table deﬁnition is delimited by its own preprocessor macro, giving the programmer ﬁne grain control over the inclusion of table deﬁnitions in the code. In particular, each table deﬁnition is enclosed between #ifdef table_name and #endif preprocessor lines. Fig. 11b is an example of one of these ﬁles. The ﬁle tabx.bal contains three tables (x1, x2 and x3).

Fig. 14. Bracketing algorithm.

1461

As these BAL ﬁles are generated, the only conditional preprocessor lines are the ones used to handle tables. We use a Java program to record the extent(in line numbers) of each table. In Fig. 11b, the extents of each table, x1, x2 and x3, are (1:4), (5:8) and (9:12) respectively. We then augment the annotations inserted by the normalizer identifying the tables as shown in Fig. 15 (PERSISTENT_SD ‘‘tabx” ‘‘tablename” [[. . .]]). 3.6.2. Persistent data size As mentioned previously, persistent data structures contain a few size mismatches. In fact, the BAL data model is quite robust with respect to such mismatches, which usually have no impact on the correctness of the program execution. Actually, the pointer to the next free location in memory can be easily reset to a legal position through FIELD=M with no parameter. Moreover, container size mismatches are usually not an issue. Provided the container has all the information needed by the computation based on it, its size can be larger or shorter than the containees without any visible effect. However, from the point of view of recovering a structured data model, size mismatches represent a major impediment to proper bracketing. In order to ﬁx the known size mismatches, a table of ﬁxes is read in the next step (Step 2 in Fig. 14) and then propagated to the deﬁnitions of the variables bracketed by the previous step. Examples of such corrections are changing the size of a ﬁeld (usually increasing it to include all contained ﬁelds) or inserting a missing container for a sequence of ﬁelds. The table of ﬁxes was produced manually, addressing the warnings raised by a size checker developed as part of the bracketing tool. The ﬁxes have been validated by the programmers. 3.6.3. Function parameter with variable size The third step addresses the formal parameters of functions. When a variable is passed by reference, the compiler does not enforce that the type in the signature of the function corresponds to the type of the actual argument, and often the size and types do not match. Programmers declare the formal parameters without type and size information (allowed by the BAL language). Since the parameter is passed by reference, its actual location in memory starts exactly where the actual argument is located. Just as global and local variables may be redeﬁned, so may parameters. In particular when a variable parameter is redeﬁned with a FIELD=M statement, all successive variables are deﬁned relative to the actual argument. The only way to reset the storage to the memory space for local variables is to use a FIELD=M statement that relocates another local variable, or a FIELD=M statement without a variable, which resets the allocation pointer to the next available location in local variable memory. Thus the transform for identifying the structure of a pass by reference

Fig. 15. Annotating tables in persistent data deﬁnitions.

1462

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

parameter is slightly different than that for global or local variables or for pass by value parameters. We use the information provided by the redeﬁnitions to correct the function signature and indicate the correct size of each parameter that is passed by reference. On some occasions, more than one redeﬁnition on the same parameter specify different sizes. In this case, the maximum among the candidates is used for the correction. For example in Fig. 16 function f1 declares two parameters, the ﬁrst (i.e., p1) is passed by value, while the second (i.e., buffer) is passed by reference (using the keyword VAR). The size of buffer is redeﬁned twice with two different lengths (500 and 1000 bytes). Thus in this case, we assume the maximum size (1000 bytes). 3.6.4. Bracketing Once annotations and ﬁxes are done, the actual bracketing can start. The general approach consists of folding overlaying ﬁelds into the parent ﬁeld until they equal the size of the parent ﬁeld. The bracketing step is composed of four sub-steps. First, size information is read from the fact base and an in-memory data-base is initialized. As explained previously, the unique identiﬁers provide the link between the size facts and the declarations they represent. We have extended the base grammar of the BAL language to include our bracketing syntax as shown in Fig. 17. The ﬁrst deﬁnition adds the square bracketing syntax to the FIELD statement. The grammar for the square bracketing includes a number. This is where the transform stores the amount of memory that is left in the redeﬁnition (initially the size of the variable that is being rede-

Fig. 16. Formal parameter size.

Fig. 17. Bracketing grammar extensions.

Fig. 18. Bracketing initialization.

ﬁned). The second grammar deﬁnition adds the optional redeﬁnition (square-angle bracketing) to variable declarations. A redeﬁnition of the bal_declaration non-terminal adds both of these non-terminals as possible deﬁnitions. The bracketing transform (Steps 4.1 and 4.2 in Fig. 14) starts by adding an empty extent to all FIELD redeﬁnition. The size ﬁeld of the extent is initialized to the size of the variable that is overlaid. Fig. 18 shows the result of this transformation. Empty square brackets have been added to the redeﬁnitions of a and b. The initial size associated with the square brackets is 10 for a and 5 for b. The transform proceeds by checking if the next entity following a bracketed ﬁeld can be moved inside the brackets. This is achieved by comparing the size of the entity with the available space left in the ﬁeld. Depending on the type of the entity that follows the bracketed extent, a different transformation applies. The possible transformations are Steps 4.3.1.1, 4.3.1.2, 4.3.1.3, 4.3.2 from Fig. 14. 3.6.4.1. Fold declaration (Step 4.3.1.1). In case the entity to fold is a variable declaration (DCL), the transform folds in the variable if there is space left in the container. Fig. 20 shows the results of this transform. In Fig. 20a, the ﬁrst declaration following each of the square bracketed extents from Fig. 18 is folded. In particular, the variable b, whose size is 5, is folded into the redeﬁnition of a, reducing the size of that redeﬁnition from 10 to 5. The variable d, whose size is 1, is folded into the redeﬁnition of b reducing the available space from 5 to 4. This transformation is applied until no other variable declarations can be folded into the appropriate overlay. After folding declaration f into the redeﬁnition of b, no other declaration folding is possible, the result of which is shown in (Fig. 20c). 3.6.4.2. Fold constants (Step 4.3.1.2). This is a special case of folding declarations which applies when the currently bracketed extent is followed by the declaration of a constant. Depending on the type of the constant, some space can be required in the variable pool. Tests revealed that byte and short constants do not require space and are substituted into expressions by the compiler at compile time. However, BCD and string constants are allocated on the variable pool and they interact with the other (non-constant) variables. As an aside, this means that they are not really constants, and their values can be changed by assigning to an alias created by an overlay. Thus, when folding a constant into a bracketed redeﬁnition, the available space must be updated when a BCD or string constant is folded. In Fig. 21 only constant strings x1 and x2 consume space when bracketed (3 bytes each), while x2 and x3 (byte and short constants) do not impact the available space. 3.6.4.3. Fold redeﬁnitions (Step 4.3.1.3). The last case is where the item following a redeﬁnition is another redeﬁnition (i.e., another FIELD statement). A redeﬁnition may only be folded, or nested, when:

Fig. 19. Folding redeﬁnitions.

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

1463

Fig. 20. Folding declarations.

Fig. 21. Folding constants.

1. 2.

The redeﬁnition is fully bracketed. That is, the remaining space is zero; and, The redeﬁnition applies to a ﬁeld already present in the bracketed extent.

In Fig. 19a, the two redeﬁnitions of b can be folded into the redeﬁnition of a. When redeﬁnitions are folded, the available space is not updated, since they apply to space that has already been accounted for. Thus in the ﬁgure, the available space of a is not updated, because the redeﬁnition of b does not actually consume space within a. In fact, the redeﬁnition of b takes exactly the amount of space already reserved for ﬁeld b (i.e., 5 bytes). After this transformation, the ﬁst rule (Fold declarations, Step 4.3.1.1) re-applies and the last declaration is ﬁnally folded into a, whose available space is eventually reduced to 0 (Fig.19b). The last case (Fold persistent structure, Step 4.3.2) applies when the current redeﬁnition is followed by a persistent data deﬁnition. In this case, the persistent data deﬁnition cannot be broken into parts, because it describes the structure of an entire ISAM table. Either it is moved entirely in the extent of the redefinition or it is left outside of it. The size of a table deﬁnition is computed as the sum of the ﬁelds contained in its deﬁnition. In order to get this calculation right, overlay extents must have been already computed for the whole table, otherwise the same variable space could be counted multiple times (redeﬁnitions do not require new memory space). This is the reason for applying this transformation only after all the others have been successfully applied.

Fig. 22. Moving transformation: (a) result of bracketing, (b) initialization and (c) move redeﬁnitions. Final Java code (d).

1464

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

After folding the persistent data structures, repeated iterations of folding declarations, constants and redeﬁnitions may be required to complete the transformation. 3.6.5. Move deﬁnitions Assuming that the square bracketing steps have been completed, the algorithm ends by moving redeﬁnitions next to the declarations of the variables for which they apply (Step 6 in Fig. 14, Move redeﬁnitions). This is done in three steps. 1. Redeﬁnitions are removed from the code and put in a separate data structure. 2. From the redeﬁnitions, a list is created with the names of all the variables for which an overlay has been found. The declarations of variables in this list are initialized with an empty squareangle brackets. 3. Redeﬁnitions are moved into the square-angular brackets of the variable for which they provide an overlay. When this step is over, every ﬁeld declaration is followed (in square-angular brackets) by all its possible alternative overlays. If there are more than one overlay, a union is recognized. Fig. 22 shows an example of results produced by the moving step. Fig. 22a shows the output of the bracketing transformation for ﬁelds a and b. In Fig. 22b the three redeﬁnitions are removed and the declarations of a and b are initialized with empty square-angle brackets. Finally, redeﬁnitions are moved under the proper declaration (see Fig. 22c). It can be noticed that variable a has only one overlay, so it would be naturally translated into a normal Java class, while variable b has two alternative redeﬁnitions, thus it corresponds to a union in the ﬁnal Java code (see Fig. 22d).

In the ideal situation, a perfect match should be found between the size of the container ﬁeld and the size of its overlay. In such a case, the folding procedure stops when the available space of the overlaid variable reaches zero. However, since the BAL compiler does not enforce any size control among overlays, size mismatches could happen, so mismatches must be handled by the bracketing algorithm. This is the topic of the next section. 4. Heuristics to handle data size mismatches In this section we describe the heuristics that we have deﬁned to manage the case where the size of the relocated variable and the total size of the declaration sequence following the relocation do not match exactly, as assumed by the transformations presented in the previous section. In the presence of size mismatches the problem of structuring the data is intrinsically ambiguous and admits multiple solutions. Our heuristics ﬁnd a solution which captures the programmer’s intention, whenever there are hints pointing at such intentions. However, in some cases no heuristic is applicable and we consider the case as not solvable automatically. In such cases we resort to a manual intervention. The amount of such interventions was the subject of the empirical assessment of the proposed heuristics. 4.1. Abstract heuristics to handle data size mismatches The ﬁrst case of size mismatch occurs when the size of the redeﬁnition is smaller than the available space. Usually, in this case an explicit stopping condition is inserted by the developer. A stopping condition is a programming idiom that allows to determine that the current variable relocation is ﬁnished and a structure or union variant can be identiﬁed.

Fig. 23. Rewrite rule handling the case of insufﬁcient size followed by relocation.

Fig. 24. Rewrite rule handling the case of insufﬁcient size at the end of the program’s declarations.

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

Abstract stopping conditions are: A relocation of a variable starts before reaching the full size of the relocated variable. End of declarations is reached before reaching the full size of the relocated variable. Figs. 23 and 24 provide the rewrite rules associated with these stopping conditions. Each rule triggers the identiﬁcation of a structure even though the total size of the relocating variables is lower than the size of the relocated variable, based on the occurrence of the two stopping conditions described above. The second case of size mismatch consists of a redeﬁnition which is greater than the available space. When the declaration that causes the size overﬂow is followed either by another relocation or by the end of the declarations in the program, we resolve the size mismatch by adding the last declaration to the structure being identiﬁed, under the assumption that the container declaration size is wrong (lower than needed). Speciﬁcally, the rewrite rules used for the cases of exceeding size in the relocating declarations are given in Figs. 25 and 26. The abstract heuristics provided in this section do not cover all possible cases of size mismatch in relocations. However, we built them based on our experience with a real legacy system heavily based on relocation for its data model. So, we think that the rules reported in this section are the most important ones and the cases not handled through them are expected to be the minority. This is conﬁrmed by our experimental ﬁndings. More heuristics may be of course added, especially to deal with

1465

different target languages and systems, but we think that the approach and the cases provided in this paper are a good starting point. 4.2. Concrete algorithm to handle data size mismatches In this section we instantiate the stopping conditions described before for the general case so as to take into account the speciﬁc features of the BAL programming language. These, in turn, give raise to additional stopping conditions, associated with special construct provided by the language (e.g., FIELD=M with no parameter). Concrete stopping conditions for the BAL language are: The redeﬁnition is explicitly closed by a FIELD=M statement, that resets the memory pointer to the next free available position. Another redeﬁnition of the same ﬁeld starts before the full size is reached. A redeﬁnition of another ﬁeld starts before the full size is reached. Declarations in the code are not enough to ﬁt the size, but the end of the declaration code section is reached. If neither a stopping condition is found nor an exact size match is reached, the second case of mismatch occurs: the redefinition is greater than the available space. In this case, the last entity to fold in the redeﬁnition extent is the one that causes the extent to exceed the available space. We considered this a

Fig. 25. Rewrite rule handling the case of exceeding size followed by relocation.

Fig. 26. Rewrite rule handling the case of exceeding size at the end of the program’s declarations.

1466

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

deliberate intention of the developer when the entity that crosses the boundary is followed by one of the previous stopping conditions. If no stopping condition occurs, an error is raised and manual intervention is required (Step 5, Report errors) to ﬁx the size mismatch instance.

their number is still prohibitive for a manual intervention in case of the user code. Further automated analysis needs to be developed, at least to support and simplify the manual job.

5. Experimental results

The 45 error cases remaining in the dictionary have been ﬁxed by means of a tool that allows the user to specify the dictionary transformations required to ﬁx the problems. The tool accepts instructions such as ‘‘change ﬁeld length” or ‘‘insert container ﬁeld before a given ﬁeld”. By manually providing such instructions to the tool, we have been able to produce a version of the dictionary that can be processed completely automatically and from which it is possible to generate all Java classes required to represent the persistent data. This manual intervention required ﬁve working days (inclusive of dictionary ﬁxing tool development). A similar code ﬁxing tool is under development for the 5286 error cases remaining in the user code. Even if the cases to solve are quite a lot, a further investigation showed that they are not independent, they all refer to just 442 recurring variables. So we expect to be able to address them in a reasonable amount of time (estimate in a couple of man-months, which is acceptable for the software company involved in the project). It should be noticed that the unhandled cases are a small fraction (around 1%) of the total number of declarations with relocations (around 500,000), which validates empirically the effectiveness of the proposed heuristics. An example of an error that has been often encountered in BAL is exempliﬁed in the code of Fig. 27(left). The corresponding memory layout is in Fig. 27(top). BAL does not enforce the presence of a container for variables even in the presence of multiple overlays. Thus, often a variable (in the example a) is used as a pointer to the position in memory where to start a redeﬁnition. In this way two different structures, [a, b] and [c, d], are overlaid in memory. This is a case where the bracketing algorithm reports an error. In fact, it tries to fold c and d within a, terminating with a size mismatch that is not followed by a stopping condition (declaration of x). A manual intervention in required here to: (1) add a new variable, i.e., cont, that would act as an explicit container; and, (2) change any redeﬁnition of a to a redeﬁnition of the novel container. On the manually ﬁxed code, shown in Fig. 27(right), the bracketing algorithm would report a perfect size match. Another case that was often found among the errors reported by bracketing is shown in Fig. 28. It can be considered a particular case of the same error presented above, however a better ﬁx can be applied here. In this case a sequence of variables (b and c) are deﬁned without any container. However, later, programmers added a container (a) by using the address of the ﬁrst variable in the sequence for the relocation of the container (FIELD=M,b). Here the bracketing would try to bracket a inside b causing a size mismatch that is followed by no stopping condition, since a new variable deﬁnition (i.e., x) is encountered in the code. The most

In this section we presents some data collected during the data structure recovering phase. As an example we discuss two instances of size mismatches that have been reported as errors by the bracketing algorithm that required manual ﬁx. 5.1. Heuristics The algorithm described in Fig. 14 has been successfully applied to the legacy system being migrated (8 MLOC). Speciﬁcally, we have been able to add structure to 510,108 variable declarations for which one or more overlays were deﬁned in the original code. Correspondingly, 510,108 Java classes have been generated. Among them, 29,394 are unions (i.e., contain multiple variants, deriving from multiple alternative overlays). The stopping conditions used to automatically handle the occurrence of size mismatches were pretty effective. They allowed us to solve automatically 81,900 cases, leaving only a few size mismatches to be ﬁxed manually. The effort involved in such manual intervention accounted for less than a working week of a person. Table 1 shows the frequency of the cases considered during the inference of an object model from the existing ﬂat memory model. The table is split into two columns, associated with data model inference for user code vs. dictionary. Case err occurs whenever none of the case-based heuristics applies and manual intervention is required. Manual ﬁxes are performed by the engineers who developed the original legacy system and are BAL expert. As apparent from Table 1, most of the cases, both in user code and dictionary, can be handled by the simplest of the cases in our case analysis: exact size match. Insufﬁcient size prevails on Exceeding size, both in user code and dictionary. Whenever a size mismatch occurs, the presence of a successive redeﬁnition of the same or another variable can be exploited to infer the data structure boundaries in most situations. For the Exceeding size case, an explicit FIELD=M statement is used to indicate the end of the data structure being deﬁned. Inference heuristics, allows for a manual decision for the data structures of the dictionary. However,

Table 1 Occurrences of structuring cases. Case

User code

Dictionary

Exact size match Insufﬁcient size Exceeding size Insufﬁcient size followed by FIELD=M Insufﬁcient size followed by redeﬁnition of same variable Insufﬁcient size followed by redeﬁnition of another variable Insufﬁcient size at end of declarations Exceeding size followed by FIELD=M Exceeding size followed by redeﬁnition of same variable Exceeding size followed by redeﬁnition of another variable Exceeding size at end of declarations Remaining cases

425,988 27,100 13,850 1371 13,121

8279 65 4 0 36

11,642

11

966 5207 1645

18 0 1

4508

3

2490 5286

0 45

5.2. Identiﬁed errors

Fig. 27. Error identiﬁed by bracketing: original (left) and manually ﬁxed (right) code.

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

Fig. 28. Error identiﬁed by bracketing: original (left) and manually ﬁxed (right) code.

Table 2 Generated Java code.

Classes Interfaces Unions

User code

Dictionary

510,108 148,621 29,394

12,402 753 263

appropriate ﬁx in this case is shown in Fig. 28(right). It consists of: (1) moving the container (a) on top of the variable declaration sequence; and, (2) adding a relocation to the container before the ﬁrst variable in the sequence (FIELD=M,a). 5.3. Migration results After structuring the data model, Java classes have been automatically generated for dictionary and user code, providing an object data model to the legacy system being migrated. Table 2 shows the number of classes, interfaces and unions (counted also as classes) generated for user code and dictionary. Interfaces are generated only to support proper deﬁnition of unions in Java. Numbers indicate that in most of the cases, unions are not necessary, in that the given data structure is accessible through a single view. In the dictionary, the total number of unions is relatively small (263), so that it is reasonable to plan their complete manual elimination from the generated Java code. 5.4. Optimizations Our ﬁrst simplistic implementation of the square bracketing transform took almost 7 days to execute over the entire BAL codebase (8 MLOC). Two simple optimizations were applied. The ﬁrst of these was an application of agile parsing. The non-terminal that represents a declaration was split into two non-terminals, one for foldable declarations (variable declarations, constant declarations and ﬁeld statements with variable redeﬁnitions) and one for declarations that are not foldable (ﬁeld statements without variable declarations). This allows the parser to provide the ﬁrst categorization and allows a simple pattern match in TXL to guard the rule. The second optimization was to stratify the bracketing rules to a one pass rule that identiﬁes potential folding locations using the grammar change described above, checks that space remains in the extent and then invokes a subrule which uses the TXL skipping statement to limit the rule to folding only at that location. A grand parent rule applies the one pass rule until no further changes are made. The result of these two optimizations reduced the transformation time of the entire BAL code base to less than 6 h (from 7 days), a signiﬁcant improvement. 6. Related work The problem of migrating a legacy software system to a novel technology has been widely addressed in the literature by different

1467

approaches. The different strategies have been classiﬁed by [7] into (1) redevelopment from scratch; (2) wrapping; and, (3) migration. In their view, even the migration strategy requires substantial redevelopment. Our contribution belongs to the third class and consists of a set of automatic transformations. Migration to object oriented programming and extraction of an object oriented data model from procedural code are the topics of several works (e.g., [8–12]). Class ﬁelds originate from persistent data, user interface, ﬁles, records and function parameters, while class operations come from the segmentation of the program according to branch labels, in the migration of legacy procedural code to an object-oriented design described by Sneed and Nyary [10]. For similar purposes, data ﬂow analysis and the classiﬁcation of data elements into constants, user inputs/outputs and database records are used in the augmented object model by Tan et al. [11]. Sneed [13] migrated Cobol code to Object-oriented Cobol. Other works on object identiﬁcation rely on the analysis of global data and of the code accessing them [8,14,15]. Since a record is too large and often contains unrelated data, cluster analysis was used [12] to identify groups of related ﬁelds within a record. Concept analysis is then applied to group together data and functionalities into candidate classes. In order to decide which data and which routines should be grouped together into classes, object-oriented design metrics (Chidamber and Kemerer) are also used to guide the migration [16,9], so as to avoid a poor design quality in the resulting system that would pose maintenance problems. Classes are still based on persistent data stores and routines are assigned to classes, such that the ﬁnal result minimizes the object coupling metric. Type inference was used to acquire information about variables in legacy applications that goes beyond that conveyed by the declared type, so as to simplify migration toward a programming language with a richer and stronger type system [17,18]. For instance, type inference was applied to Cobol [19,20] to determine subtypes of existing types and to check for type equivalence. Static analysis and model checking have been used on Cobol to determine when a scalar type should be better regarded as a record type [21] and to determine unions the variants of which are consistently accessed through discriminators [22,23]. Analysis of low-level languages has been address by Ciﬂuentes [24] and Balakrishnan [25]. Ciﬂuentes [24] addresses the problem of translating assembly to a high level language (e.g., to C), with the purpose of implementing a decompiler. In this work both control-ﬂow and data-ﬂow analysis have been used. The former is required to remove unstructured control ﬂow statements (i.e., jumps), while the latter is used to group together multiple assembly statements that use registers for intermediate results into a single, higher-level C statement. Simple type conversion is also performed to get rid of register accesses. An approach to analyze binary code has been presented by Balakrishnan [25], with the goal of debugging a piece of executable when the enclosed symbol table cannot be trusted, e.g., in a suspicious virus. The algorithm used, namely value-set analysis, computes an over-approximation of the address values and integer values that each data object can possibly hold at each point of the program. Such approximation represents a viable alternative to an untrusted symbol table. Both of these approaches attempt to recover the structure of blocks of memory by analyzing their use in the code. Similar to our paper, a theoretical framework is used to analyse Cobol data structures in the work by Ramalingam et al. [26]. While our objective is to group variables into classes, the approach described by Ramalingam et al. [26] is devoted to splitting records and arrays (aggregates) into a set of disjoint atoms. Structuring is then propagated to the other program variables participating in assignments. Such an atomization of variables leads to a more com-

1468

M. Ceccato et al. / Information and Software Technology 51 (2009) 1454–1468

pact dependency representation (e.g., PDG) that makes program slicing simpler. The work presented in this paper differs from the existing literature in that it deals with a starting data model permitting arbitrary overlays in memory. The approach taken, bracketing ﬁelds as subﬁelds of their parents by folding them based on size and offset, provides an equivalent structure while attempting to keep as much of the original source code organization as possible. This paper is the companion of another work [2], where the target data model and the migration path from the legacy to the new data model are described in detail. On the contrary, in the present paper, we do not provide details about the target data model or the migration process. We focus instead on the automated program transformations that have been deﬁned to actually implement the migration strategy described in the companion paper [2]. 7. Conclusions and future work We have proposed an approach to deal with an unstructured data model in migration projects where the target data model is necessarily a structured (e.g., object oriented) one. To gain generality, we have described the problem by means of an abstract syntax and we have described our approach as a set of abstract rewrite rules over the abstract syntax. Since size mismatches are expected to occur in practice when an unstructured data model is used, we have also provided a set of abstract heuristics, still in the form of abstract rewrite rules, to handle automatically the cases that we think are most common. We have instantiated the proposed approach for the BAL programming language and we have applied it to a large legacy system being migrated to Java. Results indicate that the proposed heuristics are quite effective and manage most of the cases of size mismatch occurring in the analyzed system. For the remaining cases we estimated an effort for a manual intervention which is compatible with the migration project plan. Overall, around 1% of the cases could not be handled by the proposed heuristics. Our ongoing work is devoted to completing the migration from BAL to Java, addressing the problems associated with the translation of the program structure and of the statements (which involves also GOTO elimination). With regards to the migration of the data model, we will investigate the possibility of inferring inheritance relationships among classes, based on the identiﬁcation of memory overlays that redeﬁne with specialization (e.g., adding more variable declarations) existing data structures. References [1] E.B. Swanson, The dimensions of maintenance, in: ICSE, 1976, pp. 492–497. [2] M. Ceccato, T.R. Dean, P. Tonella, D. Marchignoli, Data model reverse engineering in migrating a legacy system to Java, Reverse Engineering, in: WCRE’08, 15th Working Conference on 2008, pp. 177–186, doi:10.1109/ WCRE.2008.27, 2008. [3] J. Cordy, The TXL source transformation language, Science of Computer Programming 61 (3) (2006) 190–210.

[4] X. Guo, J.R. Cordy, T.R. Dean, Unique renaming of Java using source transformation, in: Proceedings of the 3rd IEEE International Workshop on Source Code Analysis and Manipulation (SCAM), IEEE Computer Society, Amsterdam, The Netherlands, 2003. [5] T. Dean, J. Cordy, A. Malton, K. Schneider, Agile parsing in TXL, Journal of Automated Software Engineering 10 (4) (2003) 311–336. [6] K. Wong, The Rigi User’s Manual – Version 5.4.4, 1998. [7] J. Bisbal, D. Lawless, B. Wu, J. Grimson, Legacy information systems: issues and directions, Software, IEEE 16 (5) (1999) 103–111, doi:10.1109/52.795108. [8] G. Canfora, A. Cimitile, M. Munro, An improved algorithm for identifying objects in code, Software: Practice and Experience 26 (1996) 25–48. [9] A. De Lucia, G. Di Lucca, A. Fasolino, P. Guerra, S. Petruzzelli, Migrating legacy systems towards object-oriented platforms, software maintenance, in: Proceedings of the International Conference on October 1–3, 1997, pp. 122– 129, doi:10.1109/ICSM.1997.624238. [10] H. Sneed, E. Nyary, Extracting object-oriented speciﬁcation from procedurally oriented programs, reverse engineering, in: Proceedings of the Second Working Conference on July 14–16, 1995, pp. 217–226, doi:10.1109/ WCRE.1995.514710. [11] H.B.K. Tan, T.W. Ling, Recovery of object-oriented design from existing dataintensive business programs, Information and Software Technology 37 (1995) 67–77. [12] A. van Deursen, T. Kuipers, Identifying objects using cluster and concept analysis, software engineering, in: Proceedings of the 1999 International Conference on 1999, pp. 246–255, doi:10.1109/ICSE.1999.841014. [13] H. Sneed, Migration of procedurally oriented cobol programs in an objectoriented architecture, software maintenance, in: Proceerdings of the Conference on November 9–12, 1992, pp. 105–116, doi:10.1109/ ICSM.1992.242552. [14] S.-S. Liu, N. Wilde, Identifying objects in a conventional procedural language: an example of data design recovery, software maintenance, in: Proceedings of the Conference on 26–29 November, 1990, pp. 266–271, doi:10.1109/ ICSM.1990.131371. [15] S. Pidaparthi, G. Cysewski, Case study in migration to object-oriented system structure using design transformation methods, Software Maintenance and Reengineering, in: EUROMICRO 97, First Euromicro Conference on March 17– 19, 1997, pp. 128–135, doi:10.1109/CSMR.1997.583021. [16] A. Cimitile, A. De Lucia, G.A. Di Lucca, A.R. Fasolino, Identifying objects in legacy systems using design metrics, Journal of Systems and Software 44 (1999) 199–211. [17] R. O’Callahan, D. Jackson, Lackwit: A program understanding tool based on type inference, in: ICSE, 1997, pp. 338–348. [18] G. Ramalingam, R. Komondoor, J. Field, S. Sinha, Semantics-based reverse engineering of object-oriented data models, in: ICSE, 2006, pp. 192–201. [19] A. van Deursen, L. Moonen, Understanding cobol systems using inferred types, in: IWPC, 1999, p. 74. [20] A. van Deursen, L. Moonen, Exploring legacy systems using types, in: WCRE, 2000, pp. 32–41. [21] R. Komondoor, G. Ramalingam, S. Chandra, J. Field, Dependent types for program understanding, in: TACAS, 2005, pp. 157–173. [22] R. Jhala, R. Majumdar, R.-G. Xu, State of the union: type inference via craig interpolation, in: TACAS, 2007, pp. 553–567. [23] R. Komondoor, G. Ramalingam, Recovering data models via guarded dependences, in: WCRE, 2007, pp. 110–119. [24] C. Cifuentes, D. Simon, A. Fraboulet, Assembly to high-level language translation, software maintenance, in: Proceedings of the International Conference on 1998, pp. 228–237, doi:10.1109/ICSM.1998. 738514. [25] G. Balakrishnan, T. Reps, Analyzing memory accesses in x86 binary executables, in: Proceedings of the 13th International Conference on Compiler Construction (LNCS 2985), Springer-Verlag, 2004, pp. 5–23. [26] G. Ramalingam, J. Field, F. Tip, Aggregate structure identiﬁcation and its application to program analysis, in: POPL’99: Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ACM, New York, NY, USA, 1999, pp. 119–132, doi:acm.org/10.1145/292540. 292553.

Recovering structured data types from a legacy data model with overlays

Recovering structured data types from a legacy data model with overlays

Recommend Documents