Case study: Re-engineering C++ component models via automatic program transformation

Case study: Re-engineering C++ component models via automatic program transformation

Information and Software Technology 49 (2007) 275–291 www.elsevier.com/locate/infsof Case study: Re-engineering C++ component models via automatic pr...

611KB Sizes 1 Downloads 22 Views

Information and Software Technology 49 (2007) 275–291 www.elsevier.com/locate/infsof

Case study: Re-engineering C++ component models via automatic program transformation Robert L. Akers

a,*

, Ira D. Baxter a, Michael Mehlich a, Brian J. Ellis b, Kenn R. Luecke a b

b

Semantic Designs Inc., USA The Boeing Company, USA

Available online 13 December 2006

Abstract Automated program transformation holds promise for a variety of software life cycle endeavors, particularly where the size of legacy systems makes manual code analysis, re-engineering, and evolution difficult and expensive. But constructing highly scalable transformation tools supporting modern languages in full generality is itself a painstaking and expensive process. This cost can be managed by developing a common transformation system infrastructure re-useable by derived tools that each address specific tasks, thus leveraging the infrastructure costs. This paper describes the Design Maintenance System (DMS1), a practical, commercial program analysis and transformation system, and discusses how it was employed to construct a custom modernization tool being applied to a large C++ avionics system. The tool transforms components developed in a 1990s-era component style to a more modern CORBA-like component framework, preserving functionality.  2006 Elsevier B.V. All rights reserved. Keywords: Software transformation; Software analysis; C++; Migration; Component architectures; Legacy systems; Compilers; Reengineering; Abstract syntax trees; Patterns; Rewrite rules

1. Introduction Organizations have vast resources invested and vast institutional wealth represented in software systems. Changes in business requirements and hardware support platforms require that these software systems undergo major change during their lifetimes. Often these changes must have sweeping impact across the system, such as when a system is migrated to a new technological platform or is reconfigured to a modern design. Though these modifications may be individually easy, their sheer number and scope represents a huge barrier to modernization. Gartner Group estimates that a well-trained pro*

Corresponding author. E-mail addresses: [email protected] (R.L. Akers), idbaxter@ semdesigns.com (I.D. Baxter), [email protected] (M. Mehlich), [email protected] (B.J. Ellis), [email protected] (K.R. Luecke). 1 DMS is a registered trademark of Semantic Designs, Inc. 0950-5849/$ - see front matter  2006 Elsevier B.V. All rights reserved. doi:10.1016/j.infsof.2006.10.012

grammer can migrate 160 lines of code per day from one platform to another [24]. With software systems containing millions of lines of code, the cost of a manual migration or other sweeping system change can be prohibitively high, resulting in financial stress or early system obsolescence. For reengineering efforts that can be formulated for the most part as a series of mechanical transformations and code synthesis operations, automated source code transformation systems offer a solution. The source code languages in use can be embedded in a rewrite rule and pattern language to codify the required transformations, and these rules can be applied to the source code base by a transformation engine under the control of a custom process. Constructing an automated reengineering tool for a real application, particularly one coded in very complex languages like C++, is non-trivial and requires the use of a transformation infrastructure that is built for scale and is fully attuned to the semantic complexities of the source languages. But economy of scale quickly makes this effort

276

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

worthwhile for reengineering tasks on large systems, even if the automated migration requires some hand-finishing. DMS is a developmental base for building semantically aware automated analysis and transformation tools. It supports virtually all conventional software languages and can be applied to systems built from multiple coding and design languages. DMS has been employed on numerous large-scale industrial software analysis and transformation projects. The Boeing Migration Tool (BMT), built using the DMS infrastructure, automatically transforms the component framework of a large C++ avionics system from a 1990s era model to one based on a proprietary variant of the Common Object Resource Broker Architecture (CORBA), preserving functionality but introducing regular interfaces for inter-component communication. In this paper we describe the DMS infrastructure and the BMT application itself to provide insight into how transformation technology can address software analysis and evolution problems where scale, complexity, and custom needs are barriers. We illustrate some of the kinds of syntheses and transformations required and some of the issues involved with transforming industrial C++ code. We also discuss the development experience, including the strategies for approaching the scale of the migration, the style of interaction that evolved between the tool-building company and its industrial customer, and how the project adapted to changing requirements. We present Boeing’s assessment of the project, assess the costs and benefits of the automated migration strategy, and present some reflections on the experience to guide others considering large-scale code reengineering projects. 2. The DMS software reengineering toolkit DMS provides an infrastructure for building software analysis and transformation tools that employ deep semantic understanding of programs. Programs are internalized via DMS-generated parsers that exist for virtually all conventional languages. Analyses and manipulations are performed on abstract syntax tree (AST) representations of the programs, and transformed programs are printed with prettyprinters for the appropriate languages. The Toolkit can accept and simultaneously utilize definitions of multiple specification and implementation languages (domains) and can apply analyses and transformations to source code written in any combination of defined domains. Transformations may be either written as procedural code or expressed as source-to-source rewrite rules in an enriched syntax for the defined domains. Rewrite rules may be optionally qualified by arbitrary semantic conditions with reference to the parameters of the rule. The DMS Toolkit can be considered as extremely generalized compiler technology. It presently includes the following tightly integrated facilities:

• A hypergraph foundation for capturing program representations (e.g., ASTs, flow graphs, etc.) in a form convenient for processing. • Interfaces for procedurally manipulating general hypergraphs and ASTs. • A means for defining language syntax descriptions for any context free languages and deriving parsers and prettyprinters to convert domain instances (e.g. source code) to and from internal forms. • Support for name and type analysis, defining and updating flexibly constructed namespaces containing name, type, and location information, and implementing arbitrary scoping rules. • Attribute evaluation for encoding arbitrary analyses over ASTs with rules tied to grammar elements for expressive and organizational convenience. • An AST-to-AST rewriting engine that understands algebraic properties (e.g., associativity, commutativity). • The ability to specify and apply source-to-source program transformations based on language syntax. Such transforms can operate within a language or across language boundaries. • A procedural framework for connecting these pieces and adding conventional code. The DMS architecture is illustrated in Fig. 1. Notice that the infrastructure supports multiple domain notations (source code languages), so that multiple languages can be handled or generated by a given tool. DMS is implemented in a parallel programming language, PARLANSE [29], which enables DMS and DMSbased tools to run on commodity symmetric-multiprocessing workstations. DMS was designed for scalability of problem space and extensibility of domain coverage. C++ is among the many domains implemented within DMS, and the system contains complete preprocessors, parsers, name and type resolvers, and prettyprinters for the ANSI, GNU3, Visual C++ 6.0, and Visual Studio 2005 dialects. Unlike a compiler preprocessor, the DMS C++ preprocessor preserves both the original form and expanded manifestation of the directives within the AST, so that programs can be manipulated, transformed, and printed with preprocessor directives preserved or transformed accordingly. The C++ name and type resolver has been extended to fully support preprocessor conditionals, creating a symbol table with conditional entries for symbols and conditional relationships between lexical scopes containing such symbols. Since all conditionals are retained in symbol table entries and can be subsequently evaluated with respect to any (context-dependent) settings of conditional variables, a system view can be constructed for any particular collection of configuration variable settings, but the ASTs for the system can be manipulated in full generality and source code re-generated with conditional directives intact. DMS has been under development for ten years. As presently constituted, it has been used for a variety of large-scale commercial activities, including cross-platform

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

277

Symbol Table

Source F il e s (Domain No tat ion)

Compiler Data Structures

Lexers/ Parsers Analyzers

Declarations

Prettyprinters

Revised Source Files

Viewer

Debug Text

Compiler Structures

Attribute Evaluation

Parser Definition Do mai n Definition Reader

Transforms

Procedures Language Descriptions Analysis + Transform Descriptions = Tool definition

Transformation Engine

Sequencing; Transforms

Analysis results

Prettyprinter definitions

Viewer

St a ti c Analysis Results

Fig. 1. DMS Software Reengineering Toolkit.

migrations, domain-specific code generation, preprocessor code clean-up, and construction of a variety of conventional software engineering tools implementing tasks such as dead/clone code elimination, test code coverage, execution profiling, source code browsing, and static metrics analysis. A more complete overview of DMS is presented in [8], including discussion of much of its machinery and how DMS was extensively used to create itself. For example, the DMS lexer generator, prettyprinter generator, and its name and type resolution analyzers for various languages are all tools created with DMS. Various other DMS-based tools are described on the Semantic Designs Inc. (SD) web site [34]. The bulk of the BMT consists of DMS infrastructure, with the remainder being the DMS domain definitions for C++ and custom code, written in PARLANSE and the DMS rewrite specification language, implementing tasks specific to the component re-engineering. 3. The Boeing Migration Tool – Overview Boeing’s Bold Stroke avionics component software architecture is based on the best practices of the mid1990s [35]. Component technology has since matured, and the Common Object Resource Broker Architecture (CORBA) component model has emerged as a standard. The U.S. Government’s Defense Advanced Research Projects Agency’s Program Composition for Embedded Systems (DARPA-PCES) program and the Object Management Group (OMG) are sponsoring development of a CORBA-inspired standard real time embedded system component model (CCMRT) [14], which offers standardization, improved interoperability, superior encapsulation, and interfaces for ongoing development of distributed, real

time, embedded systems like Bold Stroke. Standardization also provides a base for tools for design and analysis of such systems, and for easier integration of newly developed technologies such as advanced schedulers and telecommunication bandwidth managers. Boeing wishes to upgrade its avionics software to a more modern architecture, a proprietary CCMRT variant known as PRiSm (Product line Real Time embedded System). This will allow more regular interoperability, standardization across flight platforms, and opportunities for integrating emerging technologies that require CORBAlike interfaces. Yet since the legacy software is operating in mature flight environments, maintaining functionality is critical. The modernization effort, then, must not alter functionality as it melds legacy components into the modern component framework. The task of converting components is straightforward and well understood, but a great deal of detail must be managed with rigorous regularity and completeness. Since Bold Stroke is implemented in C++, the complexity of the language and its preprocessor requires careful attention to semantic detail. With thousands of legacy components now fielded, the sheer size of the migration task is an extraordinary barrier to success. With the use of C++ libraries, approximately 250,000 lines of C++ source contributes to a typical component, and a sound understanding of a component’s name space requires comprehension of all this code. To deal with the scale, semantic sensitivity, and regularity issues, DARPA, Boeing, and SD decided to automate the component migration using a custom DMS-based tool. Automating the migration process would assure regularity of the transformation across all components and allow the examination of transformation correctness to focus

278

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

primarily on the general transforms rather than on particular, potentially idiosyncratic examples. It also would ensure uniform treatment of code style in a variety of ways, including name conventions for new entities, commenting conventions, code layout, and file organization. DMS offered scalability suitable for the size of the problem, and the ability to write transformation rules in C++ syntax allowed easier review by Boeing programmers. Its C++ front end provided several important features, including the preprocessor’s ability to both preserve and expand macros and #include directives. Its full name and type resolution guaranteed semantically accurate treatment of symbols and program entities, including templates and entities defined under preprocessing conditionals. DMS was a uniquely qualified substrate for constructing a migration tool that blended code synthesis with code reorganization for a large C++ application. Fig. 2 diagrams the relationship of communicating CORBA components. Facets are classes that provide related methods implementing a service, i.e., an area of functional concern. Event sinks provide entry points for signaling changes of state upstream, to which the component may wish to react. A similar interface exists for outgoing calls, with event sources being a standard route through which the demand for service is signaled to other components, and receptacles providing connections to the instances of facets for other components. Components are wired together at configuration time, prior to system execution.

C o m p o n e n t A

method calls

facet1

receptacl e1 facet2

facetN

Facet: offered service Receptacle: us ed s e rvi c e Ev e n t S i n k : trigger Event Source: signal

While event sinks and facets are very similar ideas with respect to data flow, they are distinct in the CORBA model for several reasons. The functionality of a facet is specific to each facet and component and to the services it offers, and facets share little commonality in form or function with each other or with facets of other components. Event sinks, on the other hand, implement a standard protocol for inter-component event signaling. Though the code specifics vary with the components’ functional interfaces, the function, style, and structure of event sinks are consistent across all components, and hence they are given distinct, stylized identity and treatment, likewise for event sources and receptacles. The legacy component structure is essentially flat, with all a component’s methods typically collected in a very few classes (often just one), each defined with .h and .cpp files. One principal piece of the migration involves factoring a component into facets, each forming a distinct class populated with methods reflecting a particular area of concern. Other classes, like the event sinks and receptacles, must be synthesized at a more fine-grained level, extracting code fragments or connection information from the legacy system. Facets as conceived in this Boeing architecture present the collected interface for a service. Determining which methods form this interface requires human understanding, since it is a reflection of system concepts and requirements and is not directly deducible from the system code. One of the BMT’s tasks is to sort the legacy component’s interface methods into bins corresponding to the facets and to give

C o m p o n e n t B

event sin k1

events

C o m p receptacl eK o n e n t event source1

event sinkM

runtime wiring Fig. 2. CORBA Distributed Component Architecture.

C o m p o n e n t X

C o m p o n e n t Y

C o m p o n e n t Z

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

COMPONENT AV_LogicalPosition_StateDevice FACET AV_LogicalPosition_InternalStatus FACET AV_LogicalPosition_StateMode LEGACYCLASSES AV_LogicalPosition_StateDevice END COMPONENT FACET AV_LogicalPosition_InternalStatus FACADE AV_LogicalPosition_StateDevice "IsInitDataRequested" "IsAlmanacRequested" "IsDailyKeyInUseVerified" "IsDailyKeyInUseIncorrect" "IsGUV_User" "IsReceiverContainKeys" "GetMissionDuration" "IsRPU_Failed" "GetMemoryBatteryLow" "GetReceiverLRU_Fail" END FACET FACET AV_LogicalPosition_StateMode FACADE AV_LogicalPosition_StateDevice "bool GetBIT_Status ()" "GetPosition_StateRequested" "GetPosition_StateAchieved" "GetEventSupplierReference" END FACET Fig. 3. Facet specification for a component.

indicative names to the new facet classes. To provide a clean specification facility for the Boeing engineers using the BMT, SD developed a simple facet specification language. For each component, an engineer names the facets and uniquely identifies which methods (via simple name, qualified name, or signature if necessary) comprise its interface. The bulk of the migration engineer’s coding task is formulating facet specifications for all the components to be migrated, a very easy task for a knowledgeable engineer. The following simple rule is used: copy the legacy facade’s

interface and remove everything except those methods that are accessed by outside legacy components. The engineer who is knowledgeable with the component can easily identify which methods are accessed by outside legacy components. The facet language itself is defined as a DMS domain, allowing DMS to automatically generate a parser from its grammar and to define specification-processing tools as attribute evaluators over the facet grammar. Fig. 3 shows a facet specification for a simple, two-facet component. The list of legacy classes prescribes which classes are to be transformed as part of the component. Fac¸ade declarations specify the legacy classes from which the facet interface methods are drawn. Facet method parameters need only be supplied when the method name is overloaded. Fig. 4 illustrates the high-level view of this factoring of a flat-structured legacy component by the BMT, under direction from a facetization specification. Fig. 5 shows a more detailed view of this process in terms of a stripped down legacy component being refactored into a CORBA-like facet configuration. (It does not reflect the activity related to event sinks, receptacles, or other new PRiSm classes.) The methods identified in the engineer’s facet specification are relocated into new facet classes. References from within those methods to outside entities are modified with additional pointers to resolve to the originally intended entities. Furthermore, references in the rest of the system to the relocated methods are also adjusted via an indirection to point into the new facet classes. Declarations of all these new pointers must appear in the appropriate contexts, along with #include directives to provide the necessary namespace elements. Classes h1 and h2 are unchanged. The BMT translates components one at a time. Input consists of the source code, the facet specification for the

Configuration methods Configuration Facet Class

Locking methods C o-mi ngled appli cati on m e t h o ds wi t h embedded eventhandling code, irregular interfaces

BM T

Locking Facet Class

Receptacle C l as s Event Sink C l as s Application Facet Class 1

… Component (Data) Class

Component Data

Application Facet Class n

Component Representative Class Legacy Bold Stroke Component ( monolith ic C++ cl ass )

E n gi n ee r S p e c i f i e d F a cetizatio n

279

PRiSm Bold Stroke Component: S t a n d a r d i n te r f a c e s , a p p l i c a t i o n c o d e facto red by functionality, ev ent han dling code extra cted and is olat ed

Fig. 4. Bold Stroke component conversion.

280

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

Fig. 5. Factoring legacy classes into CORBA/PRiSm facets.

component being translated, and facet specifications for all components with which it communicates, plus a few bookkeeping directives. Conversion input is succinct. The BMT’s migration process begins with the parsing of all the facet specifications relating to the component. A DMS-based attribute evaluator over the facet domain traverses the facet specifications’ abstract syntax trees and collates it into a database (or symbol table) of facts for use during component transformation. In general, attribute evaluator rules associate computational actions with the rules of a language grammar. In this case, attribute evaluator rules, attached to the grammar rules defining the facet language, collectively build and populate a symbol table associating facets to components and facet methods to facets, as per the facet specification. For example, consider the facet grammar rule: signatures ¼ signatures native signature; This rule essentially is a list constructor, declaring that a signature list may take the form of a facet signature appended to another signature list. An attribute evaluator rule attached to this grammar rule causes the facet method’s signature (or name, if that is all that is supplied) found in the native_signature to be added to the list of facet signatures accumulated for the signatures from the right hand side of the grammar rule. Similarly, the evaluator rule for

the grammar rule for a facet declaration places the list of accumulated facet method signatures in the symbol table entry associating the facet with its component. The attribute evaluator is a simple, language-based method for building the bridge between the declarative form of the facet specification and the internal structures to be used by the BMT. Associating computation rules with grammar rules factors computations like this one intuitively, and DMS’ automated synthesis of algorithm control structure and population of the structure with the computation rules reduces programming time and handles dependency ordering. After processing the facet specifications, the BMT parses and does full name and type resolution on all the C++ component’s classes, the fac¸ade classes of the neighboring components, and all files referenced via #include transitively from parsing this code. The DMS C++ name and type resolver constructs a symbol table for this entire source base, allowing lookup of identifiers and methods with respect to any lexical scope. Only by internalizing the code base in this manner can symbol lookups and the transformations depending on them be guaranteed sound. (For example, introducing a new symbol without knowing that the symbol is defined elsewhere in the system, even outside the scope of a local analysis, may change the semantics of the larger system. The complexity of proper

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

C++ name and type resolution and of the resulting symbol spaces is one key point that defeats scripting languages for writing C++ transformers. Lexical processing alone is inadequate to guarantee soundness, and lack of high performance symbol table infrastructure hampers the use of scripting languages for handling both language complexity and industrial application scale.) After digesting the source base, the BMT can then check the validity of the facet specification, for instance, checking that relevant classes are present, and that the specified facet methods are in fact methods defined in the specified fac¸ade class, and that their specifications are unambiguous in the presence of overloading (otherwise requiring specification by signature rather than simply by name). Four particular kinds of transformations typify what the BMT does to perform the component migration: • New classes for facets and their interfaces are generated based on the facet specifications. The BMT generates a base class for each facet, essentially a standard form, and a ‘‘wrapper’’ class inheriting from the facet and containing one method for each method in the functional facet’s interface (i.e., for each method listed in the facet specification). These wrapper methods simply relay calls to the corresponding method in the component’s legacy classes. Constructing the wrapper methods involves replicating each method’s header and utilizing its arguments in the relayed call. Appropriate #include directives must be generated for access to entities incorporated for these purposes, as well as to access standard component infrastructure. A nest of constructor patterns expressed in the DMS pattern language pull the pieces together into a class definition, thus synthesizing the code from patterns. The change from a pure CORBA scheme like that illustrated in Fig. 5 to a wrapper scheme was a major mid-project design change that allowed modernized components to interoperate with untranslated legacy components during the long-term modernization transition. • After constructing the facets and wrappers, the BMT transforms all the legacy code calls to any of the facets’ methods, redirecting original method calls on the legacy class to instead call the appropriate wrapper method via newly declared pointers. The pointer declarations, their initializations, and their employment in access paths are all inserted using source-to-source transforms, the latter with conditionals to focus their applicability. An example of one such transform appears in Fig. 6. The various arguments are typed by their corresponding C++ grammar non-terminal names, and the rule transforms one postfix_expression to another. Within the body of the rule, argument names are preceded by ‘‘n’’. The rule uses concatenation to construct the symbol for the new pointer and adds the indirection. It will be applied only when the ‘‘pointer_refers’’ predicate establishes that the class_pointer and method name

281

-+ Rewrites legacy cross-component method references, e.g., -+ FooComp->methname(arg1, arg2) rewrites to -+ FooCompBarFacetPtr_->methname(arg1,arg2) private rule adjust_component_access(class_pointer:identifier, method:identifier, arguments:expression_list): postfix_expression->postfix_expression = "\class_pointer->\method(\arguments)" -> "\concatenate_identifiers\(\concatenate_identifiers\( \get_facet_for_method_with_legacy_pointer\(\class_pointer\, \method\)\, Facet_\)\, \class_pointer\) ->\method(\arguments)" with side-effect remove_identifier_table_entries(class_pointer, method) if pointer_refers_to_class_of_foreign_component(class_pointer). Fig. 6. DMS rewrite rule to modify an access path.

correspond to a specified facet method. If the rule does fire, it will also trigger some tool-internal bookkeeping as a side effect. DMS patterns and rewrite rules are parameterized by typed AST’s from the domain (C++). Quoted forms within the definitions are in the domain syntax, but within quotes, backslashes can precede non-domain lexemes. For example the references to the class_pointer, method, and arguments parameters are preceded with slashes, as are references to the names of other patterns (concatenate_ identifiers and get_facet_for_method_with_legacy_pointer), and syntactic artifacts of other pattern parameter lists (parentheses and commas). The application of the rewrite rule can be made subject to a Boolean condition, as it is here by the external function pointer_refers_to_class_of_foreign_component, and may also, upon application, trigger a side effect like external procedure remove_identifier_table_entries. In this case both the condition and the side effect are externally defined with PARLANSE code. DMS conditions can be coded within the domain-specific syntax and employ internal pattern matching and Boolean combinators, but they may seamlessly branch out to arbitrary conventional computation. • ‘‘Receptacle’’ classes provide an image of the outgoing interface of a component to the other components whose methods it calls. Since a particular component’s connectivity to other components is not known at compile time, the receptacles provide a wiring harness through which dynamic configuration code can connect instances into a runtime flight configuration. Constructing the receptacles involves searching all of a component’s classes for outgoing method calls, identifying which facet the called method belongs to, and generating code to serve each connection accordingly. Fig. 7 illustrates a portion of a receptacle .cpp file. The figure illustrates one of the Connect methods, which sets up connection logic between the called component (AMC_Position1) and

282

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

// File incorporates PRiSm Component Model (Wrapper version) // file generated by the BMT tool for the PCES II Program #include "AMV__LogicalPosition_StateDevice/AMV__LogicalPosition_StateDevice.h" #include "AMV__LogicalPosition_StateDevice/AMC__Position1Receptacle.h" #include "AMV__LogicalPosition_StateDevice/AMC__Position1Wrapper.h" AMC__Position1Receptacle::AMC__Position1Receptacle( AMV__LogicalPosition_StateDevice * theAMV__LogicalPosition_StateDevicePtr) : theAMV__LogicalPosition_StateDevicePtr_(theAMV__LogicalPosition_StateDevicePtr) { //There is nothing to instantiate. } AMC__Position1Receptacle::~AMC__Position1Receptacle() { //Nothing needs to be destructed. } bool AMC__Position1Receptacle::ConnectAMV__LogicalPosition_StateDevice ( BM__Facet *item) { // Cast the parameter from a BM__Facet pointer to a wrapper pointer if (AMC__Position1Wrapper *tempCompPtr = platform_dependent_do_dynamic_cast(item)) { theAMV__LogicalPosition_StateDevicePtr_ -> AddAMC__Position1Connection(tempCompPtr); return true; ... } Fig. 7. A portion of a generated receptacle.cpp file.

the caller (AMV_LogicalPositionStateDevice) for the remote facet specified by the item parameter. The Connect methods are always of the same form, but draw specifics from the calling environment and the specifics of the legacy method. The BMT creates #include directives appropriate to the calling and called components and as required by other symbols appearing in the method headers. Standard receptacle comments are inserted. • Event sinks are classes that are among the communication aspects of a component, representing an entry point through which an event service can deliver its product. They are uniform in style and structure for all components in a way that is more or less independent of the components’ functionality, but their content is nevertheless driven by the legacy code’s particular interconnectivity behavior. Since the essential code fragments for event processing already exist in the legacy classes (though their locations are not specified to the BMT and cannot generally be characterized), synthesizing event sinks involves having the BMT identify idiomatic legacy event-handling code throughout the legacy component by matching against DMS patterns for those idioms. Code thus identified is moved into the new event sink class, which is synthesized with a standard framework of constructive patterns. Control structures are then merged to consolidate the handling of each specific kind of event. One necessary complication of this extensive movement of code from one lexical environment to another is that all relevant name space declarations must be constructed in the new event sink. Definitions, declarations, and #include directives supporting the moved code must be constructed in the event sink class, and

newly irrelevant declarations must be removed from the original environment. Doing all this requires extensive use of the DMS symbol table for the application, which among other things retains knowledge of the locations within files of the declarations and definitions of all symbols. Event sink code extraction and synthesis combines pattern-based recognition of idiomatic forms with namespace and code reorganization and simplification via semantically informed transformation. The BMT, then, significantly modifies the legacy classes for the component by extracting code segments, by modifying access paths, by removing component pointer declarations and initializations, by adding facet pointer declarations and initializations, by reconfiguring the namespace as necessary, and by doing other miscellaneous modifications. It also introduces new classes for the component’s facets, facet wrappers, receptacles, event sinks, and an ‘‘equivalent interface’’ class desired by Boeing component designers. Each kind of new class has a regular structure, but the details vary widely, based on the characteristics of the component being translated. Some classes, like the event sinks, are populated with code fragments extracted from the legacy class and assembled into new methods under some collation heuristics. In our experience, the amount of new or modified code the BMT produces in a converted component amounts to over half the amount of code in the legacy component and replaces a more or less equivalent amount of code. Since a typical component-based system involves a large number of components, this makes a clear economyof-scale argument for using automatic transformation in component re-engineering.

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

4. Application of enabling technologies The DMS infrastructure provided a variety of mechanisms enabling the BMT to be developed in a timely and cogent manner. In this section we describe how some of these were employed in particular settings, and in so doing provide technical detail of how the migration was accomplished. 4.1. Use of patterns, rewrite rules, and AST manipulation By looking at a few of the steps the BMT takes in the collection and consolidation of event-handling code, we can provide a taste of how a migration tool can use pattern-based recognition, code construction, and rewriting to perform code movement. As previously stated, constructing an event sink class involves identifying various pieces of event-handling code fragments idiomatically in the legacy classes, extracting them, and using them to populate a generic framework specified by the PRiSm model. The idiomatic recognition of these fragments is by application of recognizer patterns to the abstract syntax trees of the legacy classes. An example of one such pattern is in Fig. 8. Patterns are essentially parameterized pieces of abstract syntax from the domain (programming language) in question, optionally qualified by a condition. DMS allows these pieces to be declared in terms of the native syntax of the domain, relieving the engineer from building the corresponding abstract syntax tree himself to write the rule. Within the quoted form that is the body of the pattern, parameters are preceded by a ‘‘n’’. (References to the names of other patterns and their parameter lists within the quoted form are also preceded by ‘‘n’’, but there are no such cases in this pattern. Embedding patterns in this manner allows for the construction of very large and elaborate patterns.) The types of the parameters and of the pattern itself are syntax trees directly corresponding to non-terminal productions in the domain (in this case C++) grammar and are used both for type-checking the pattern itself and for appropriately focusing its applicability. The peculiar syntactic structure of this pattern, in particular with reference to the type_ identifier, which is application-specific, is sufficient to guarantee that any code in the legacy application that matches this pattern is event processing code that should be moved into a particular method in the new event sink class. DMS pattern matching combines normal pattern matching technology with the ability to express patterns in native domain syntax,

283

possibly in mixed domains, possibly explicitly handling syntactic ambiguities in the domain(s), and possibly in terms of patterns elaborated with conventional coding (in PARLANSE) and procedural conditionals. Several such patterns are used to identify all the candidate event-handling code in the legacy system. These idiomatic structures were enumerated by Boeing engineers, who believed them to be an exhaustive covering of cases. Since all the ported code was examined after migration, any missed or spurious cases would have been noticed during the quality assurance process, and new transformations added to the pool for subsequent re-translation. Compilation provides added assurance, since mistakes in transformation would likely manifest as compile-time errors. No such iteration was required for the components translated in this project. One of the few points in the migration where Boeing wished to manually intervene was with the migration of this specific construct into an event sink. In some cases, they wished to make an essentially subjective judgment as to whether the statement sequence within any case from the switch statement should be factored into a new method or methods. There were no fixed rules about how this judgment could be made, therefore no way to encode the judgment in any recognizer patterns. However, there were syntactic complexity measures that could be used to at least set a threshold beyond which manual inspection of the migrated code would be desired. These were encoded as a set of DMS patterns that, if satisfied, resulted in the replacement of the candidate statement sequence with an identical one annotated with a message to the migration engineers that directed their attention to the spot. These annotations were in the form of an #ifdef referencing a spectacularly descriptive identifier that would grab the engineer’s attention. A constructor pattern was used to create the replacement code, and one such pattern is shown in Fig. 9. (#ifdef was used for what is effectively a comment in this pattern to ensure later manual intervention. If the case was overlooked and the #ifdef not removed, the resulting code would fail to compile.) A set of rewrite rules like the one shown in Fig. 10 replace the legacy code with the annotated version in cases where the complexity threshold is met. Code fragments matching the previously exhibited switch pattern can be drawn from all over the legacy component, and their various case statements are all collected and placed within the same method in the new event sink

public pattern EventSink_push_switch(event_set:identifier, index_express ion:expres sio n, cases:statement):selection_statement = "switch(\event_set [\index_expression] . type_ ) \cases".

private pattern history_attention_3(stmt_seq: pp_statement_seq):statement_seq = " #i fdef \ splice _ i fd ef_id entifier_2\( ATTENTION_CONVERTER_The_legacy_push_handler_from\, contains_complex_code__You_must_decide_how_to_handle_it\)\&n #endif\&n \ st m t _s eq ".

Fig. 8. An idiomatic event-handling recognizer pattern.

Fig. 9. Constructor pattern for annotating migrated code.

284

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

private rule Annotate_EventSink_push_cases_3(test:constant_expression, stmt_seq:statement_seq, pp_stmt: pp_stat ement_seq): labeled_statement->labeled_st a tem ent = "case \test : { \stmt_seq \ p p _ st m t break; }" -> "case \test : { \ his tory_a tten tion _3 \(\ stmt_ seq \:pp_s tat eme nt_seq \) \ :pp_ statement _s eq \pp _stm t break; } ". Fig. 10. A rewrite rule to invoke the annotation in context.

class. Constructor patterns are used to build this method, called push, and one pattern to create such a method is illustrated in Fig. 11. Having been built as the various event sink switch statements were accumulated via pattern matching over the legacy code, the collected case statements are inserted via the parameter. In this pattern are embedded references to other parameterized patterns, such as id_expression_of_identifier, and declarator_id_of_id_ expression, each of which coerce the creation of additional abstract syntax around the identifier to make it appropriate for this context, and pp_statement_seq_of_macro_invocation, which does a similar coercion for macro calls. On the other hand, concatenate_identifiers is a constructor pattern that forms a new identifier out of pieces. Having gathered the relocated push code into a single method, the next task is to consolidate its structure by collapsing and unifying the case statements. A set of

rewrite rules performs this task. One such rule is illustrated in Fig. 12. Two cases which have exactly the same test, as determined by using a single parameter to represent both case conditions, may be unified by concatenating their associated actions (statement_seq’s), a rewrite which Boeing engineers judged to be sound within the limited realm of event-handling code. (The tool must be careful to apply the rewrite specifically to the body of the new push method, since this rewrite would not be sound in general use.) This rule applies to first and last cases within a switch, separated by other non-matching cases; other rules must be formulated for contiguous cases and for cases surrounded on one end, the other, or both by other cases within the switch. This multiplicity of similar rewrite rules within a rewrite rule set is a common phenomenon, but can be sometimes mitigated within DMS by designating a sequence of elements within a grammar rule to be associative and/or commutative and thereby instructing the rewrite machinery to try all combinations of matches. An excellent commutative list-matching system incorporating possibly empty lists would simplify the expressive messiness here, but as a

private rule merge_push_case7(test:constant_expression, stmt_seq1:statement_seq, stmt_seq2:statement_seq, s t m t s 1 : s t a t e m e n t _ s e q ) : s t a t e m e n t _ s e q -> s t a t e m e n t _ s e q = " case \te st : { \stm t _se q1 } \stmts1 \:pp_statement_seq c a se \ t es t : { \s tm t _s e q 2 } " -> " case \test : { \stmt_seq1 \stmt_seq2 \:pp_statement_seq} \stmts1 \:pp_statement_seq".

Fig. 12. Rewrite rule to do idiomatic merging of cases within a Push method.

publ ic patt ern Event Sink_push_de f i n i t i o n ( c l a s s _ n a m e : i d e n t i f i e r , eventSet_name:identifier, case_statements:statement_seq): function_definition = "void \concatenate_identifiers\(\class_name\,EventSink\) :: Push (const UUEventSet & \declarator_id_of_id_expression\(\eventSet_name\)) { \id_expression_of_identifier\(TRACE\)(\"Push (const UUEventSet& myEventSet)\"); f o r ( i n t \ d e c l a r a t o r _ i d _ o f _ i d _ e x p r e s s i o n \ ( i \ ) = 0 ; i < \ e v e n t S e t _n a m e . G e t L e n g t h ( ) ; i + + ) { switch (\eventSet_name[i].type_) { \case_statements default: { \pp_statement_seq_of_macro_invocation \(\%MacroName assert \,0\) // terminate gracefully for unrecognized event break; } } } }". Fig. 11. Constructor pattern for building an event sink Push method.

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

matter of development, SD has not yet done this. Since the semantics of switch would not allow this, we must resort to manually coding an exhaustive set of rules. In most instances where push code was merged into a single case in this case statement, the merge was more or less trivial and was performed under heuristics certified as sound by Boeing component engineers. All merges that did not fall into one of the easy cases were flagged with special comments requiring the engineer to examine the resulting code and resolve any ordering anomalies. More elaborate heuristics could have been developed to guide the tool in these merges, but for the number of components ported in this demonstration and for the few cases where this complexity was expected to arise, the additional cost of development was not warranted. A treatment of thousands of components would surely warrant this investment. Other clean-up steps were also formulated for cleaning up the push method, such as the elimination of break statements within the switch structure. Finally, pattern-based rewriting was also used to eliminate the relocated eventhandling code from the legacy classes, either by cleanly removing it or by leaving it in remnant form as code conditionally defined under an #ifdef for an undefined symbol. The event sink class contains a number of methods that must be formulated in ways similar to the push method, i.e., by recognizing and extracting code from the legacy classes and then relocating and consolidating it within the event sink.

285

spaces of the legacy classes must be adjusted to reflect the absence of need for these entities in cases where no remaining references exist. As code is being moved into an event sink, an abstract syntax tree scan locates all symbols that must be declared and determines whether these should be defined locally or have their definitions incorporated via #include. For symbols that are to be declared locally, the tool also gathers the symbols required by their declarators, e.g. types. When the full list of externally defined symbols is accumulated, the tool looks up the definition location for each in the symbol table (being cognizant of which of these symbols might themselves have been moved during facet refactoring) and develops a list of files that are to be included. This list is collated and then constructor patterns are used to build a sequence of #include directives that are then passed to the constructor pattern for the .h file for the class (e.g., the event sink).2 Also of value was the retention of macro calls in the abstract syntax trees. Since the product of this effort was maintainable source code rather than compiled images, we could not afford to sacrifice code artifacts such as macro definitions and their calls. The DMS C++ parser retains the definitions of macros and places them in the symbol table, and at macro call sites it retains both the macro call form and its expansion. When the migrated source code is produced, then, macro calls may remain intact, and where they have been moved, the symbol table references back to the macro definitions allows the necessary #include directives to be generated.

4.2. C++ parsing, name resolution, and symbol table One of the enabling technologies of this effort was the investment made in parsing and name and type resolution for C++. The C++ grammar is not LALR(1). DMS parser uses a GLR parsing strategy [36] as realized by [38], retaining in the abstract syntax trees representations of all possible cases when ambiguity arises. Resolving these ambiguities requires name and type resolution, since their nature determines the syntactic forms in which they can be embedded. C++ parsing is notoriously difficult when the strategy is to attempt ambiguity resolution as it arises during parsing by doing partial name and type resolution on demand. The DMS strategy allows full name and type resolution to occur after parsing when the symbol table is built (via attribute evaluation over the abstract syntax trees resulting from the parse). The presence of the symbol table and the extent of information saved there (including the locations within the source base of the definitions of all symbols) was one of the primary pieces of infrastructure supporting the reconfiguration of name spaces in the BMT refactoring. In particular, retaining within the symbol table the locations of the definitions of all symbols was invaluable in reconstructing the #include file configuration after code movement. The name space of the destination class, for example an event sink, must be constructed to support the declaration of all the moved entities, and the name

5. Experience Early in the project, SD and Boeing considered placing DMS in the Boeing environment and training Boeing’s engineers in tool-building. Time constraints did not allow for this approach. Instead, SD gave Boeing sufficient instruction to be able to read and understand rewrite rules developed for the migration, so that engineers could review them for intent and accuracy. Placing the tool building solely in SD’s hands forced the creation of an operational framework that dealt with several fundamental issues. Boeing has extensive expertise in avionics and component engineering, but only a nascent appreciation of transformation technology. The tool builder, Semantic Designs, understands transformation and the mechanized semantics of C++, but had only cursory prior understanding of CORBA component technology and avionics. Other operational issues in the conversion were the strict proprietary nature of most of the source code, uncertainty about what 2

Refining this treatment to produce an optimal set of #include declarations, for example by exploiting the indirections inherent in the component’s #include structure, would be possible, but was non-essential and was postponed for this iteration of tool development as production deadlines neared. This kind of pragmatic cost/benefit tradeoff evaluation is typical of automated migrations and illustrates how flexibility can be applied in managing the effort to achieve best return on investment.

286

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

exotic C++ features might turn up in the large source code base, Boeing’s evolving understanding of the details of the target configuration, and the geographical separation between Boeing and SD. To deal with most of these issues, Boeing chose a particular non-proprietary component and performed a hand conversion, thus providing SD with a concrete image of source and target and a benchmark for progress. The hand conversion forced details into Boeing’s consideration. Being unburdened by application knowledge, SD was able to focus purely on translation issues, removing from the conversion endeavor the temptation to make application-related adjustments that could add instability. Electronic communication of benchmark results provided a basis for ongoing evaluation, and phone conferences supported development of sufficient bilateral understanding of tool and component technologies, minimizing the need for travel. SD’s lack of access to the full source code base required the tool builders to prepare for worst case scenarios of what C++ features would be encountered by the BMT in the larger source base. This forced development of SD’s C++ preprocessing and name resolution infrastructure to handle the cross product of preprocessing conditionals, templates, and macros. These improvements both hardened the tool against unanticipated stress and strengthened the DMS infrastructure for future projects. New requirements were developed in mid-project, modifying the target (on one occasion very significantly and broadly affecting the coding target). Had a manual conversion altered its course at the same point, the cost for re-coding code that had already been ported would have been very high. The automated transformation approach, however, allowed the changed requirements to be codified with manageable reworking of transformation rules and wiring rather than massive recoding. 6. Evaluation Boeing planned to use the BMT to initially convert 60 components handling navigation and launch area region computations to the PriSm model for use in their unmanned aerial vehicles (UAV) avionics systems. Two large components were converted and used in flight as part of the successful DARPA PCES Flight Demonstration in April 2005. Tool input was extremely easy to formulate. One script, which could be shared among all components, set environment variables guiding the tool into the source code base and then invoked the tool. Formulating component-specific input took generally around 20 min per component. This time commitment could be less for smaller components and for engineers more experienced using BMT. The BMT converts one component at a time. (A batch file could very easily be written to convert multiple components at once). Usually the tool would execute for approximately ten minutes on a Dell 610 Precision Desktop with dual processor,

producing a re-engineered component ready for post-processing by the human engineer. The generated and modified code conformed to Boeing’s style standards and was free of errors and stylistic irregularities. Human engineers did a modest amount of hand finishing, mostly in situations understood in advance to require human discretion. In these cases, the tool highlighted the generated code with comments recommending some particular action and sometimes generated candidate code. Roughly half the time, this candidate code was sufficient; other cases required hand modification. The most difficult part of the conversion was the event sink mechanization. For input components that had their event mechanization spread across several classes, the BMT correctly moved the code fragments into the event sinks, but the human engineer was required to update some code inside the transitioned code fragments, making judgments, for instance, about whether some aggregations of code should be encapsulated in new functions. These judgment-based decisions would have been necessary with a hand conversion as well. More extensive engineering of the BMT might have eliminated this manual requirement by applying heuristics. Testing consisted of very close code review and testing of the benchmark component, plus visual inspection and conventional testing of other components. Among the advantages of using the BMT were: • All code was generated on the basis of a simple facet specification. The facet specifications were extremely easy and quick to write. The generated code was always complete with respect to the specifications. • No negative side effects were generated by the tool. This means the tool did not generate code in places that it was not desired, nor did the tool move any code fragments to places that were not desired, mistakes that scripting approaches are more prone to making. • The tool’s name resolution capability proved to be a major advantage over the scripting approaches usually used, which often create unpredictable side effects that BMT is capable of avoiding. By using the BMT, we reduced the time necessary to convert the two components used for the DARPA PCES Demonstration by approximately half, compared to the time required to convert the components by hand. This time commitment is based on the entire conversion process, which consists of converting the component, post-processing the component based on the BMT-generated comments, writing the component’s dynamic configuration manager by hand, and testing and integrating the converted component in the resultant software product. For a large project with hundreds or even thousands of components a 50% reduction in total conversion time enables a tremendous cost and time reduction. The tool-based approach would represent the difference between machine-hours and man-centuries of code development labor, between feasibility and infeasibility of mass conversion.

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

7. Cost-benefit tradeoffs Developing a custom migration tool takes a significant effort. Not including the DMS environment contribution, the BMT required 13,000 lines of new tool code. This development cost must be balanced against the alternative cost of doing the tool’s work by hand, so the tradeoff for mechanized migration depends predominately on the amount of code being ported. For small applications, the effort is not worthwhile, but with even a modest sized legacy system, the economics quickly turn positive. One benchmark legacy component conversion gives an idea of the scale of this conversion. The legacy component, typical in size and complexity, contained 9931 lines of source code. The BMT-converted component contained 9456 lines, including 2209 lines of code in newly generated classes and 2222 modified lines in the residual legacy classes. Scaling these numbers, to convert a mere 60 components would require revision or creation of over 250,000 lines of code. Porting four such components would cause more new code to be written than went into the BMT itself. With commercial and military avionics systems typically containing thousands of components, the economic advantage of mechanized migration is compelling. The BMT automates only the coding part of the migration. Requirements analysis and integration are also significant factors and are fixed costs independent of translation method. Likewise partly for testing, though automatic translation errors, once found and corrected, will not recur, while coding errors can appear and recur at any point in a manual process at a rate more or less linear with the amount of code ported. Although some hand polishing of the BMT output was required, coding time was reduced to near zero relative to a manual effort, and the savings in coding time alone allowed a reduction of approximately half the total time required to migrate the components used for the DARPA PCES demonstration. The regularity of style in automatically migrated code provides further value, since there is no review and enforcement cost regarding coding standards. One might posit equations quantifying costs for either migration strategy, but too few components were actually ported either by hand or by the BMT to accurately specify the constant coefficient factors in such equations for this code base. Absent experiential data, Boeing estimated a least a man-month of effort to migrate a single component manually. As another data point, Gartner Group estimates that an engineer can migrate 160 lines of code per day. Our component refactoring does not have the complexity of a crossplatform migration, but Gartner’s estimate provides an upper bound for cost comparison. The measure of economic success is not whether a migration tool achieves 100% automation, but whether it saves time and money overall. Boeing felt that converting 75% of the code automatically would produce significant cost savings, a good rule of thumb for modest-sized projects. Anything less puts the benefit in a gray area. The code pro-

287

duced by the BMT was 95–98% finished. This number could have been driven higher, but the additional tool development cost did not justify the dwindling payoff in this pilot project. Huge projects would easily justify greater refinement of a migration tool, resulting in less need for hand polishing the results, and thus driving coding costs ever lower. Cost-benefit tradeoffs should be considered when scoping the task of a migration tool, even while the project is in progress. In this project, for example, we could have developed elaborate heuristics for consolidating event sink code, but we judged the expense to not be worthwhile for the pilot project. Any project of this kind would face similar issues. Mechanization can mean the difference between feasibility and infeasibility of even a medium size project. 8. Technological barriers One inherent technical difficulty is the automatic conversion of semantically informal code comments. Though comments are preserved through the BMT migration, what they say may not be wholly appropriate for the newly modified code. Developing an accurate interpretation of free text discussing legacy code and modifying these legacy comments to reflect code modifications would challenge the state of the art in both natural language and code understanding. So while new documentation can be generated to accurately reflect the semantics of new code, legacy documentation must be viewed as subject to human revision. Though the DMS C++ preprocessor capability, with its special treatment of conditionals, was up to the task for this migration, extremely extensive use of C/C++ preprocessors exploiting dialect differences, conditionals, templates, and macros can lead to an explosion of possible semantic interpretations of system code and a resource problem for a migration tool. Preserving all these interpretations, however, is necessary for soundness. Furthermore, since macro definitions and invocations must be preserved as ASTs through migration, macros that do not map cleanly to native language constructs (e.g., producing only fragments of a syntactic construct or fragments that partially overlap multiple constructs) are very difficult to maintain. Though these unstructured macro definitions cause no problem for compilers, since they are relieved prior to semantic analysis with respect to any single compilation, to preserve them in the abstract representation of a program for all cases is quite difficult. All these factors suggest that for some projects using languages involving preprocessors, a cleanup of preprocessor code prior to system migration is in order. For reasons of scale and complexity, this is a separate problem that could be tackled with another automated, customized tool. 9. Related work The architecture community is extremely interested in the relationship between architectures and code.

288

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

Model-driven architecture (MDA) [26], generally speaking, is the abstract specification of models for a computing system in a standard notation, in its more ambitious forms combined with the automatic generation of compliant code from the models. Architecture-driven modernization (ADM) [27] is the process of developing and using an understanding of the architecture of existing systems to better enable the evolution of software assets, in practice by first extracting from legacy systems the kinds of models used for MDA. The BMT tool is accomplishing an ADMlike task, but it is neither MDA nor ADM. The difficulty of applying MDA/ADM techniques to this task is that one must somehow first extract the architecture, determine the architectural change, and then regenerate the code. For the BMT, the kind of information that can be derived via conventional architecture extraction techniques would not be sufficiently tied to the legacy code to enable the automated reengineering. Instead, we acquired the target architecture information directly via the engineer-authored facetization specifications, used idiomatic recognition of code fragments of interest to match existing code to the target components, and used program transformations to map the low level code while preserving its functionality. Program transformations come in two forms: procedural (arbitrary functions applied to compiler data structures) and source-to-source (maps between concrete surface syntax forms of code). Procedural transformations have enjoyed enormous success for some 50 years in classical compilers [1] for translating source code to binary and optimizing code generators. Traditionally transformations have been considered interesting only if they were correctness preserving. Compilers also traditionally construct and cache crucial auxiliary information such as symbol tables and control and data flow analysis results to support complex context-dependent transformations that preserve correctness. More recently, procedural transforms have been proposed to support mechanical refactoring [25] as a technique for restructuring programs. Manual versions of this have been popularized [13] as a methodology with suggestions for tool support. Tools for refactoring SmallTalk [32] and Java have emerged. The better Java tools [42,44] do some sophisticated refactorings such as ExtractMethod; others in the market require some manual validation of the steps. Some experimental work has been done in refactoring C++ [37], and some commercial C++ refactoring tools are available [49]. All of these above tools can apply only predefined sets of procedurally encoded transforms. The results of these tools must be manually (re)validated because most of them do not construct the auxiliary information needed to ensure that the transforms are truly correctness preserving. OpenC++ [28] combines a GNU C++ parser front end with name and type resolution with customizable procedural transforms to enable transforms on C++ abstract syntax trees. The Puma system [46], like OpenC++, is a library of C++ classes for the manipulation of C++

sources, and includes a C++ parser and some name resolution logic. Puma apparently provides its own C++ parser. The libraries provide the ability to procedurally transform the Puma parse tree. Elsa [41] provides a GLR parser for preprocessed C++, produces a C++ abstract syntax tree and does substantial type checking. Many tools have been constructed that analyze source code by parsing to compiler data structures and applying procedurally encoded static analysis. The SUIF [47] compiler infrastructure has been used in research for more than a decade to provide a framework for custom procedural analyses and program transformations, mostly focused on FORTRAN and C. Source-to-source program transformations were originally conceived as a method of program generation in the 1970s [4]. The seminal Draco system [23] provided the notion of domain analysis to support domain-specific language definition and transformational program generation. Draco accepted language (‘‘domain’’) definitions and true sourceto-source transformations for each language, using recursive-descent parsing. This type of technology has continued development, moving into scientific computing with the Sinapse system [17], which has evolved into a commercial system for synthesizing sophisticated financial instruments [43]. The idea that program transformations could be used for software maintenance and evolution by changing a specification and re-synthesizing was suggested by Balzer [5] in the early 1980s. Porting software and carrying out changes using transformations was suggested and demonstrated in the mid 1980s [2]. Feather [16] suggested simple source-to-source ‘‘evolution’’ transforms to change specification code. Evolution transforms lead to the key idea of using both correctness preserving transforms to change non-functional properties of software, and to use non-correctness preserving transforms to change functional properties of a software system. Theory about how to modify programs incrementally using ‘‘maintenance delta’’ transformations and previously captured design information was suggested in 1990 [6,7]. This work defined an array of transformation support technology, which inspired the DMS Software Reengineering Toolkit. Source to source transformation as a serious tool for software evolution is largely unrealized in practice, although there have been some very interesting applications. Following the Draco lead, Refine [11,30] was a groundbreaking software transformation engine originally intended for program generation from high level specifications, but found modest success only when used for early reengineering projects on legacy code [22]. Refine tended to be limited by its LALR parser generator, which does not handle most legacy languages very well. Refine was the basis for the original work at The Software Revolution Incorporated [24], and that company has re-implemented its transformation foundation as JANUS, and used it to successfully perform numerous cross-platform migrations. We do not know if JANUS

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

is a procedural or a source-to-source transformation system. The TXL transformation system [12] was used to find and repair Y2K flaws on reportedly 4 billion lines of legacy code, much of it COBOL, and is still available [48]. TXL uses a backtracking parser, which makes it somewhat more robust for parsing legacy languages, but does not handle ambiguities well. TXL does not directly support the construction of symbol tables. The TXL transformation philosophy attempts to avoid these issues by partitioning massive transformations into separate stages, each operating on an abstracted, and therefore less complex, grammar specific to that stage. The ASF+SDF system [9] enables the specification of arbitrary syntax using the Syntax Definition Formalism (SDF) [15] and parsing of these languages using GLR parsing technology which can handle arbitrary context free grammars, as enhanced in [31]. ASF+SDF parsers work at the character level, allowing convenient composition of arbitrary sets of grammar rules. Transformations are specified in terms of concrete surface syntax [18]. The Stratego [40] system also used SDF and the same parsing technology as ASF+SDF, along with a domain-specific language for specifying complex traversals of the abstract syntax tree and sequencing of transforms over either abstract or concrete syntax. Both ASF+SDF and Stratego have been used as porting and mass source code change tools in research and commercial contexts [19]. The ATX Reengineering environment L-CARE [20] is a commercial tool based on parsing to compiler data structures and applying procedures to extract program metrics and slices. L-CARE apparently has ability to specify transformations as a combination of a source pattern and a transformation ‘‘patch’’. It is not clear from the LCARE-documentation how these transformations are represented. L-CARE has been used for large-scale COBOL reengineering. Two key stumbling blocks for transforming C++ are parsing it and the even more difficult engineering task of constructing an accurate symbol table, both because of the complexity of the language semantics and because of the differences in the C++ dialects in use. CodeBoost [3] combines the OpenC++ front end with Stratego to research the application of PDE library-call optimizing transforms to be applied to C++ code; transformation rules are expressed in concrete syntax inside the C++ source code itself. The Rose system [33] combines the EDG C++ front end with recursive calls on that front end to enable writing C++ source patterns used by C++ procedural transforms, to help optimize scientific codes written in C++. Besides being a production C++ parser, the EDG front end provides a symbol table, crucial to implementing sophisticated context-sensitive transforms. AspectC++ [21,45] uses Puma as a foundation to implement AspectJ-like transforms (‘‘weaving’’) on C++ code. C++ source patterns are included in aspect-advice inserted

289

at pointcuts specified in C++ terms. AspectC++ is not used for general C++ transformation. We find many projects that claim that parsing is the issue for C++, only to discover that the parsing problem is dwarfed by that of constructing the symbol tables correctly. The only available production quality C++ parsers known to us to be available as system components are the GNU hand-coded parser, the EDG parser and the one for DMS. All three are successful in our opinion not only because of the ability to parse, but especially because of the significant effort invested to engineer production high quality symbol tables. While GNU is very good as a compiler, it is not designed as a general program transformation system. EDG’s front end is just that; it builds an AST and a symbol table, but provides nothing beyond that point to support transformation. ASF+SDF’s parsing machinery, like that of DMS, is GLR and can arguably be used to parse C++ directly, but unlike DMS it does not subsequently produce a symbol table. An approach to producing a symbol table with GLR parsers is to post-process the C++ parse tree according to the complex name and type resolution rules of C++. A significant complication is handling the highly ambiguous parse trees produced for C++. One approach to ambiguity handling is to use inconsistencies in a type analysis to eliminate impossibly typed subtrees, leaving behind only a legitimate parse and its legitimate interpretation. One can accomplish this with rewrites [10], but attribute grammars were considered better in a study of toy examples [39]. The DMS C++ domain definition, covering several popular dialects of C++, validates this view by implementing a full name and type resolver using some 250,000 lines of attribute grammar rules including automatic elimination of incorrectly typed ambiguous subtrees. The DMS C++ grammar is only several thousand lines, demonstrating clearly that name and type resolution completely dominates the cost of constructing the parser. Of all of these systems, only two can realistically process C++: DMS and Rose. Rose is limited only to C and C++ because of its tight tie to the EDG front end, whereas DMS’s source-to-source engine is independent but parameterized by the actual grammar and so can use arbitrary grammar syntax. The DMS parsers, being reengineering parsers, capture macros, preprocessor conditionals, include files, comments and lexical formatting information in a regenerable way, which none of the other tools accomplish. Most other tools are rather purist in their offering of only procedural transforms, or only rewrites. One atypical aspect of the DMS Toolkit is the explicit encouragement to mix both source-to-source and procedural transformation/ analysis. While this may sound awkward, combining the succinctness of source-to-source transforms with procedural control and Turing-power of procedural transforms is remarkably practical. We believe the Boeing Migration tool would not have been practical without the robust DMS C++ front end,

290

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

the reengineering parsing machinery, and the combined power of the mixed rewrites. 10. Concluding remarks Above all, this effort illustrates that automatic transformation technology is now up to the task of performing significant re-engineering tasks on high integrity industrial C++ systems, a goal that has been elusive due to the complexity of the C++ language. However, a few over-arching observations apply to this and other mass transformation projects, regardless of language platform: • Mass migrations are best not mingled with changes in business logic, optimization, or other software enhancements. Entangling tasks muddies requirements, induces extra interaction between tool builders and application specialists, and makes evaluation difficult, at the expense of soundness, time, and money. Related tasks may be considered independently, applying new transformation tools if appropriate. • Automating a transformation task helps deal with changing requirements. Modifying a few rewrite rules, constructive patterns, and organizational code is far easier and results in a more consistent product than revising a mass of hand-translated code. Changes implemented in the tool may manifest in all previously migrated code by simply re-running the modified tool on the original sources. This allows blending the requirements definition timeframe into the implementation timeframe, which can significantly shorten the whole project. • Cleanly factoring a migration task between tool builders and application specialists allows proprietary information to remain within the application owner’s organization while forcing tool builders toward optimal generality. Lack of access to proprietary sources, or in general lack of full visibility into a customer’s project induces transformation engineers to anticipate problems and confront them in advance by building robust tools. Surprises therefore tend to be less overwhelming. • Automated transformation allows the code base to evolve independently during the migration tool development effort. To get a final product, the tool may be re-run on the most recent source code base at the end of tool development. There is no need for parallel maintenance of both the fielded system and the system being migrated. • Using a mature infrastructure makes the construction of transformation-based tools not just economically viable, but advantageous. Not doing this is infeasible. Language front ends and analyzers, transformation engines, and other components are all very significant pieces of software. The BMT contains approximately 1.5 million lines of source code, but most is DMS infrastructure. Only 13,000 lines of code are BMT-specific. Furthermore, off-the-shelf components are inadequate to the task.

For example, lex and yacc do not produce ASTs that are suitable for manipulation. Only a common parsing infrastructure can produce AST structures that allow a rewrite engine and code generation infrastructure to function over arbitrary domain languages and combinations of languages. • Customers can become transformation tool builders. There is a significant learning curve in building transformation-based tools. A customer seeking a single tool can save money by letting transformation specialists build it. But transformation methods are well-suited to a range of software life cycle tasks, and engineers can be trained to build tools themselves and incorporate the technology into their operation with great benefit and cost savings.

11. Future directions The PRiSm or CORBA component technologies impose computational overhead as service requests are routed through several new layers of component communication protocol. Essentially, the extra layers exist to provide separation of concern in design and coding and to provide plug-and-play capability at configuration time. Mechanized static partial evaluation could relieve this overhead. With semantic awareness of the component wiring, a transformation tool could be developed to statically evaluate the various communication indirections, sparing run-time overhead. In this highly performance-sensitive environment, the effort could be well justified. Semantics-based analysis can also be applied to deeper partial evaluation of the code resulting from the dynamic assembly of components into a flight configuration. For example, code that supports configurations that are not in play and conditionals that dynamically test for those configurations can be eliminated. Indirectly called procedures can be in-lined, avoiding indirection and call overhead. Combining automated analysis and code transformation at build time should enhance performance. Acknowledgements We give our thanks to our collaborator in this effort, the Boeing Company, and to the DARPA PCES program for funding. We are also grateful to the reviewers for the their invaluable feedback and the consequent improvement to this paper. References [1] A. Aho, R. Sethi, J. Ullman, Compilers: Principles, Techniques and Tools, Addison-Wesley, Reading, MA, 1986. [2] G. Arango, I. Baxter, C. Pidgeon, P. Freeman, TMM: software maintenance by transformation, IEEE Software 3 (3) (1986) 27–39. [3] O. Bagge, K. Kalleberg, M. Haveraaen, E. Visser, Design of the CodeBoost transformation system for domain-specific optimization of C++ programs, in: Third IEEE International Workshop on Source

R.L. Akers et al. / Information and Software Technology 49 (2007) 275–291

[4]

[5] [6]

[7] [8]

[9]

[10]

[11]

[12]

[13] [14]

[15]

[16]

[17]

[18]

[19] [20]

[21]

Code Analysis and Manipulation (SCAM’03), Amsterdam, IEEE Computer Society Press, September 2003. R.M. Balzer, N.M. Goldman, D.S. Wile, On the transformational implementation approach to programming, in: Proceedings, 2nd International Conference on Software Engineering, Oct. 1976, pp. 337–344. R. Balzer, A 15 year perspective on automatic programming, IEEE Transactions on Software Engineering 11 (1985) 1257–1268. I. Baxter, Transformational Maintenance by Reuse of Design Histories, Ph.D. Thesis, Information and Computer Science Department, University of California at Irvine, Nov. 1990, TR 90-36. I. Baxter, Design maintenance systems, Communications of the ACM 35 (4) (1992). ACM. I.D. Baxter, C. Pidgeon., M. Mehlich, DMS: program transformations for practical scalable software evolution, in: Proceedings of the 26th International Conference on Software Engineering, 2004. M. van den Brand, A. van Deursen, J. Heering, H. de Jong, M. de Jonge, T. Kuipers, P. Klint, L. Moonen, P.A. Olivier, J. Scheerder, J.J. Vinju, E. Visser, J. Visser, The ASF+SDF meta-environment: a componentbased language development environment, in: R. Wilhelm (Ed.), Compiler Construction 2001 (CC’01), Springer, Verlag, 2001, pp. 365–370. M. van den Brand, S. Klusener, L. Moonen, J. Vinju, Generalized parsing and term rewriting – semantics directed disambiguation, in: Third Workshop on Language Description Tools and Applications, Electronic Notes in Theoretical Computer Science, 2003. S. Burson, G.B. Kotik, L.Z. Markosian, A Program Transformation Approach to Automating Software Reengineering, in: Proceedings of the 14th Annual International Computer Software and Applications Conference (COMPSAC 90), IEEE Publishers, 1990. J.R. Cordy, TXL – a language for programming language tools and applications, in: Proc. LDTA 2004, ACM 4th International Workshop on Language Descriptions, Tools and Applications, Barcelona, Spain, April 2004, pp. 1–27. Electronic Notes in Theoretical Computer Science 110 (December 2004), pp. 3–31. M. Fowler, Refactoring: Improving the Design of Existing Code, Addison Wesley, Reading, MA, 1999. V. Gidding, B. Beckwith, Real-time CORBA Tutorial, in: OMG’s Workshop on Distributed Object Computing for Real-Time and Embedded Systems 2003. www.omg.org/news/meetings/workshops/ rt_embedded2003. M. van den Brand, A. van Deursen, J. Heering, H. de Jong, M. de Jonge, T. Kuipers, P. Klint, L. Moonen, P.A. Olivier, J. Scheerder, J.J. Vinju, E. Visser, J. Visser, The ASF+SDF meta-environment: a component-based language development environment, in: R. Wilhelm (Ed.), Compiler Construction 2001 (CC’01), Springer-Verlag, 2001, pp. 365–370. W.L. Johnson, M.S. Feather, Using evolution transforms to construct specifications, in: M. Lowry, R. McCartney (Eds.), Automating Software Design, AAAI Press, 1991. E. Kant, F. Daube, E. MacGregor, J. Wald, Scientific programming by automated synthesis, in: Michael R. Lowery, Robert D. McCartney (Eds.), Automating Software Design, MIT Press, 1991. P. Klint, A Meta Environment for Generating Programming Environments, ACM Transactions on Software Engineering Methodology 2 (2) (1993) 176–201. S. Klusener, R. Laemmel, C. Verhoef, Architectural modifications to deployed software, Science of Computer Programming 54 (2005). G. Koutsoukos, L-CARE: a legacy reengineering environment, ATX Software S.A., Portugal, available from www.atxsoftware.com and www.choose.s-i.ch/Events/forum2005/LCARE_CHOOSE05.pdf. D. Lohmann, G. Blaschke, O. Spinczyk, Generic advice: on the combination of AOP with generative programming in AspectC++’’,

291

in: Proceedings of GPCE’04, October 24th–28th, 2004, Vancouver, Canada. [22] L. Markosian et al., Using an enabling technology to reengineer legacy systems, Communications of the ACM 37 (5) (1994) 58–70. [23] J. Neighbors, Draco: a method for engineering reusable software systems, in: T. Biggerstaff, A. Perlis (Eds.), Software Reusability, ACM Press, 1989. [24] P. Newcomb, R. Doblar, Automated Transformation of Legacy Systems, CrossTalk, December 2001. [25] W.F. Opdyke, Refactoring Object-Oriented Frameworks, PhD Thesis, University of Illinois at Urbana-Champaign. Also available as Technical Report UIUCDCS-R-92-1759, Department of Computer Science, University of Illinois at Urbana-Champaign. [26] Object Management Group, The Architecture of Choice for a Changing World. http://www.omg.org/mda/. [27] Object Management Group, Architecture-Driven Modernization (ADM). http:/adm.omg.org/. [28] opencxx.sourceforge.net, OpenC++ website. [29] PARLANSE Reference Manual, Semantic Designs, 1998. [30] Reasoning Systems, Palo Alto, CA, Refine Language Tools, 1993. [31] J. Rekers, Parser Generation for Interactive Environments, Phd Dissertation, University of Amsterdam, 1992. [32] D. Roberts, J. Brant, R. Johnson, W. Opdyke, An automated refactoring tool, in: Proceedings of ICAST ’96: 12th International Conference on Advanced Science and Technology, Chicago, Illinois. April, 1996. [33] M. Schordan, D. Quinlan, Specifying transformation sequences as computation on program fragments with an abstract attribute grammar, in: Fifth International Workshop on Source Code Analysis and Manipulation (SCAM’05), Amsterdam, 2005. IEEE Computer Society Press, pp 97–106. [34] Semantic Designs, Inc. www.semanticdesigns.com. [35] D.C. Sharp, Reducing avionics software cost through component based product line development, in: Proceedings of the 1998 Software Technology Conference. [36] M. Tomita, Efficient Parsing for Natural Languages, Kluwer Academic Publishers, 1985. [37] L. Tokuda, D. Batory, Evolving object oriented designs with refactoring, in: Proceedings of the Conference on Automated Software Engineering, IEEE, 1999. [38] T. Wagner, S. Graham, Incremental analysis of real programming languages, in: Proceedings of the ACM SIGPLAN ’97 Conference on Programming Language Design and Implementation (PLDI), SIGPLAN Notices 32(5), May 1997, pp 31–43. [39] C. Vasseur, Semantics driven disambiguation: a comparison of different approaches. www.lrde.epita.fr/dload/20041201-Seminar/vasseur1201_disambiguation_report.pdf, 2004. [40] E. Visser, Program transformation with Stratego/XT: rules, strategies, tools, and systems in StrategoXT-0.9, in: C. Lengauer (Ed.), Domain-Specific Program Generation, Lecture Notes in Computer Science, vol. 3016, Spinger-Verlag, 2004, pp. 216–238. [41] www.cs.berkeley.edu/~smcpeak/elkhound Elsa: a GLR-based C++ parser. [42] www.intellij.com. IDEA refactoring tool for Java. [43] www.scicomp.com, SciFinance pricing model development environment. [44] www.instantiations.com. Jfactor refactoring tool for Java. [45] www.aspectc.org Aspect C++ website. [46] www-ivs.cs.uni-magdeburg.de/~puma, C++ transformation infrastructure. [47] suif.stanford.edu, SUIF website. [48] www.txl.ca, TXL transformation tool web site. [49] www.xref-tech.com/xrefactory/main.htm. CC++ Refactoring Tool.