Increasing clone maintenance support by unifying clone detection and refactoring activities

Increasing clone maintenance support by unifying clone detection and refactoring activities

Information and Software Technology 54 (2012) 1297–1307 Contents lists available at SciVerse ScienceDirect Information and Software Technology journ...

1MB Sizes 0 Downloads 31 Views

Information and Software Technology 54 (2012) 1297–1307

Contents lists available at SciVerse ScienceDirect

Information and Software Technology journal homepage: www.elsevier.com/locate/infsof

Increasing clone maintenance support by unifying clone detection and refactoring activities Robert Tairas a,⇑, Jeff Gray b a b

AtlanMod, INRIA & École des Mines de Nantes, Nantes, France University of Alabama, Department of Computer Science, Tuscaloosa, AL, United States

a r t i c l e

i n f o

Article history: Received 16 August 2011 Received in revised form 21 June 2012 Accepted 23 June 2012 Available online 6 July 2012 Keywords: Maintenance Code clones Refactoring

a b s t r a c t Context: Clone detection tools provide an automated mechanism to discover clones in source code. On the other side, refactoring capabilities within integrated development environments provide the necessary functionality to assist programmers in refactoring. However, we have observed a gap between the processes of clone detection and refactoring. Objective: In this paper, we describe our work on unifying the code clone maintenance process by bridging the gap between clone detection and refactoring. Method: Through an Eclipse plug-in called CeDAR (Clone Detection, Analysis, and Refactoring), we forward clone detection results to the refactoring engine in Eclipse. In this case, the refactoring engine is supplied with information about the detected clones to which it can then determine those clones that can be refactored. We describe the extensions to Eclipse’s refactoring engine to allow clones with additional similarity properties to be refactored. Results: Our evaluation of open source artifacts shows that this process yields considerable increases in the instances of clone groups that may be suggested to the programmer for refactoring within Eclipse. Conclusion: By unifying the processes of clone detection and refactoring, in addition to providing extensions to the refactoring engine of an IDE, the strengths of both processes (i.e., more significant detection capabilities and an established framework for refactoring) can be garnered. Ó 2012 Elsevier B.V. All rights reserved.

1. Introduction Code clones can be defined as two or more sections of code that are duplicates of each other based on some type of similarity measurement. Automated code clone detection techniques and tools utilize different similarity measurements to find clones in code by evaluating different representations of the code (i.e., string [15], token [28], parse tree [26], and program dependence graph [33]). After clones have been detected, the maintenance of certain clones can be performed. Clone maintenance relates to the activity of updating the sections of code that are duplicates of each other. One specific activity is removing the duplication associated with the clones by modularizing the code. This results in a single copy of the originally duplicated sections of code, which may simplify future maintenance in one location and limit the possibility of bug propagation. The activity of refactoring [20], which changes the structure of the code but keeps the behavior of the code the same, can be used to eliminate the duplication associated with clones in a clone

⇑ Corresponding author. E-mail addresses: [email protected] (R. Tairas), [email protected] (J. Gray). 0950-5849/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.infsof.2012.06.011

group (i.e., clones that represent the same duplicated code). Integrated development environments (IDEs) such as Eclipse [18] provide a refactoring framework with the necessary functionality to assist programmers in the refactoring task. However, support for finding clones to refactor is still limited in such IDEs. Clone detection-related tools [25,32] can assist in finding potential clones for refactoring. However, the actual activity of refactoring is often delegated to the programmer who must manually forward information about the clones to a refactoring engine. We have observed that a gap exists between clone detection and clone refactoring. In this paper, we introduce our approach to unify the processes of clone detection, analysis, and refactoring, which is realized in CeDAR (Clone Detection, Analysis, and Refactoring) [43], a plugin for the Eclipse IDE. The rest of the paper is organized as follows: the next section further describes the process of clone maintenance as it relates to clone elimination and current state-of-the-art tool support. Certain limitations are noted that provide the motivation for our contributions to the process. Section 3 details the capabilities of CeDAR with regards to clone refactoring and Section 4 describes specifically the extensions to the Eclipse refactoring engine. Section 5 evaluates the instances of clone refactoring made available by CeDAR. Section 6 describes threats to validity on the evaluation

1298

R. Tairas, J. Gray / Information and Software Technology 54 (2012) 1297–1307

Clone Detection Automated Clone Detection Tool

Clone Analysis

Clone Refactoring

Analyze Clones for Refactoring Opportunities

IDE Refactoring Engine

Fig. 1. Clone maintenance process support.

of CeDAR. Section 7 summarizes related work and Section 8 concludes the paper and offers future work considerations.

2. Clone maintenance process through clone elimination In general, the process of removing the duplication associated with clones can be considered in three phases: detection, analysis, and refactoring, as seen in Fig. 1. The figure also lists current stateof-the-art tools that can be utilized in each phase, which can assist a programmer in that specific phase. The following paragraphs provide further details concerning these supporting mechanisms and the connections between the phases that are either automatic or manual. Limitations with current capabilities are summarized, which motivate the work described in this paper. Clone detection – Consider a scenario where a programmer attempts to eliminate code clones through refactoring. The programmer can utilize an automated clone detection tool to look for clones in the source code that he or she maintains. Such tools utilize various searching techniques to determine sections of code that are duplicates and report these clones to the user. These tools include CCFinder [28], CloneDR [5], Deckard [26], and SimScan [41], among others.1 Each of these tools provides varying types of reports to the programmer. Most provide a textual file containing details about the detected clones (e.g., Deckard and Simian [40]), while others incorporate their results within an IDE (e.g., SimScan). Some provide a graphical representation of the clones (e.g., CCFinder provides a scatterplot). However, information about clones from these tools, such as the location of the clones, is currently not automatically forwarded to the clone refactoring step. Clone analysis – The reports from clone detection tools can be used by the programmer to analyze and determine clone candidates for refactoring with the purpose of removing their duplication. However, depending on the size of the software and level of cloning in the source code, the amount of cloning data returned by a clone detection tool can be large. For example, 8% of the 2418 K lines of code in the JDK 1.4.2 were found to be duplicated code [26]. Manually finding opportunities for refactoring among the clones represented by such a large amount of lines of code can be a daunting task. Several automated techniques can assist the maintainers of code in deciding potential refactoring opportunities for a large number of clones. Such techniques evaluate properties, such as the location of clones in the class hierarchy of a software system (SUPREMO) [32] and offer metrics related to the number of variable references and assignments (ARIES) [25]. A limitation found in these techniques is that they only provide clones that have potential to be refactored, which may include clones that cannot be refactored. In addition, the actual refactoring of the clones is delegated to the maintainer of the code. That is, after suggesting specific clones for refactoring, these techniques pass the responsibility of refactoring to the maintainer, who will need to perform the manual refactoring task based on the informed analysis. 1 For a listing of code clone tools and a collection of code clone literature, please visit [9].

Clone refactoring – After deciding which clones should be refactored, the programmer must actually perform the activity of refactoring. IDEs such as Eclipse provide refactoring support through its refactoring engine. This support allows programmers to delegate the code structure changes to the refactoring engine, which reduces errors that may occur if changing the code was done manually. However, current tools do not accommodate the clone refactoring step that provides the clone information as input to the refactoring engine. After obtaining information from the clone detection tools in the detection phase, or even information about clones that potentially could be refactored from the analysis phase, this information must still be passed to the refactoring engine. This is due to the input mechanism of the refactoring engine, which requires the programmer to select the individual code section that needs to be refactored within the source code editor. The limitations described in the detection, analysis, and refactoring phases provide the motivation for our work. There is a need to introduce a mechanism that unifies the phases in such a way that information from a preceding phase is utilized in the subsequent phase. Thus, our goal is to streamline the clone maintenance process as it relates to clone elimination. The contributions of the work described in this paper, which are realized through our CeDAR prototype Eclipse plug-in for Java programs, are given below:  Incorporating clone detection tool results in the clone maintenance process by communicating these results to the refactoring engine of an IDE to allow more steps to be automated in the clone maintenance process.  Utilizing the checking of pre-conditions for clone refactoring as extended in the refactoring engine to filter clone groups that can be refactored from clone groups that have potential to be refactored using Extract Method.  A preliminary effort in extending the refactoring capabilities of an IDE to allow for more types of clones that can be refactored to include clones in multiple classes. 3. Combining clone detection, analysis, and refactoring In this section, we provide details of our contributions to the process of clone maintenance as it relates to clone refactoring to eliminate duplication. The phases of detection and analysis can be considered as the preprocessing steps before the refactoring of the clones. That is, the ultimate goal is the refactoring of the clones, but initial steps that need to take place before this include the detection and analysis of the clones. This process is realized through the CeDAR prototype plug-in for Eclipse. A demonstration of CeDAR is available at http://www.cis.uab.edu/softcom/cedar/ demo/ (refactoring demonstration starts at 2m05s). 3.1. Clone detection tool results as input Clone detection tools present their results in various ways. For the majority of the freely available clone detection tools listed in Table 1, a textual output is available in which clones that represent the same duplicated code are grouped together (i.e., in clone groups). The layout of the results file differ, but for the majority

R. Tairas, J. Gray / Information and Software Technology 54 (2012) 1297–1307 Table 1 Clone detection tool results availability. Parseable text file results

Text files requiring additional processing No text file output

ConQAT [10], CloneDR [5], CPD [11], Deckard [26], Duplo [17], JCD [13], JCCD [2], NiCad [12], Scorpio [24], Simian [40], SimScan [41] CCFinder [28], CloneDigger [7] DMF [16], SDD [35]

of the tools two properties are always present: (1) location information of each clone, including information about their groupings, and (2) a standard delimiter that separates each clone and clone group instance (i.e., end of line, semicolon). With these two properties, CeDAR can parse the results file from a tool and obtain the necessary information regarding the clones detected by the tool. The execution of the clone detection tool is considered a separate task. The information obtained includes the file location where a clone exists, the starting and ending lines of the code associated with a clone, and the group in which the clone is a member. Table 1 also identifies whether additional processing of the output is needed in order to obtain the reported clones. For example, CCFinder requires additional processing, because the reported clones are identified with their token ranges rather than line ranges. From the tools listed in Table 1, CeDAR can currently parse CCFinder, CloneDR, Deckard, Simian, and SimScan output results, which mostly represent token-based and tree-based clone detection tools. Other semantic-based tools (i.e., those evaluating program dependence graphs) are currently not supported. The results are utilized within Eclipse for further analysis and maintenance purposes. In a separate work, we have investigated standardizing the results of clone detection tools using model-driven engineering (MDE) techniques [42]. Other related standardization efforts include Rich Clone Format (RCF), which aims to standardize the exchange of clone data [23]. The capability to obtain clone information from different detection tools allows for the use of the results in subsequent clone maintenance activities. With the availability of more than one clone detection tool, a question that may arise is which tool provides the best results. This can be interpreted as a tool providing the most complete result set or a tool providing the least false positives or negatives. Comparisons of clone detection tools have been performed (i.e., [6,39]), but a consensus on the best tool has not been reached. When considering clone refactoring, appropriate clone regions for refactoring should consist of well-formed blocks of code or syntactically meaningful blocks of code. Tree-based detection tools such as CloneDR, Deckard, and SimScan can provide such results because the results are represented by a sub-tree within the parse tree or AST of the code. As a tree-based clone detection tool, Deckard is used as the tool of choice for the evaluation in Section 5. Text-based or even token-based detection tools can sometimes include clone regions that are between blocks or statements. However, some of these tools try to alleviate this problem. For example, the detection settings for CCFinder include a shaper level setting, which when set to ‘‘hard’’ will only allow clone candidates whose token sequences are enclosed by a block. 3.2. Analysis for refactoring opportunities Tools such as ARIES [25] and SUPREMO [32] provide a categorization of clone groups to filter the clone groups that have potential to be refactored. ARIES uses several metrics that can be adjusted manually by the user to filter out clones that potentially can be refactored using Extract Method or Pull Up Method. For example, Extract Method opportunities are based on the Dispersion in Class Hierarchy (DCH) metric being zero (i.e., the clones are in the same

1299

class) and the Number of Assigned Variables (NAV) metric being one or less. SUPREMO observes the location of the clones within classes in object-oriented code to determine the type of refactoring that can be performed on the clones. For example, refactorings proposed for clones contained in sibling classes (i.e., classes with the same parent class) include Extract Method, Pull Up Method, Form Template Method, and Extract Superclass. It should be noted that the filtered clone groups identified by the aforementioned tools may include groups that cannot be refactored. That is, a clone group that is identified as being a candidate for Extract Method by these tools may in fact not be refactorable. Hence, the determination of the groups that actually can be refactored from the listing of potentially refactorable groups must be done manually. For example, the results of ARIES were evaluated manually to determine which groups actually can be refactored [25]. CeDAR can determine whether a clone group that has potential to be refactored actually can be refactored. This is done by utilizing the feature of the Eclipse refactoring engine that analyzes a section of code to determine whether it meets the pre-conditions for a specific refactoring activity. With our incorporation of extensions for clone refactoring, pre-conditions can be determined for a clone group to identify whether it can be refactored. This reduces the number of clone groups that a programmer needs to evaluate and confirm for refactoring. In Section 5, such an activity is performed, albeit with a different purpose of determining how many more refactorings can be supported with the extensions available in CeDAR. For example, when evaluating Extract Method, each clone group with clones in the same file was run through the pre-condition checking step to determine whether the group actually can be refactored. 3.3. Clone refactoring in the Eclipse IDE The Eclipse IDE refactoring engine provides support for clone refactoring through Extract Method refactoring for clones with parameterized local variables. Fig. 2a illustrates the process of such refactoring within Eclipse. The user highlights a section of code to be refactored in the source editor and activates the Extract Method process. The AST of the selected code is determined and, using subtree matching on the AST of the class where the code is located, matching sub-trees are identified. Differences of the matches are limited to local variable names (i.e., different values of the SimpleName AST node). In addition, in this scenario the programmer only knows the existence of clones after confirming and invoking the refactoring of one section of code. That is, the programmer does not know before selecting the section of code whether the code has duplicates in another location. Fig. 2b introduces changes to the process that incorporates the results from a clone detection tool. The results (i.e., detected clones in clone groups) are displayed and accessible within the IDE. The clone information display in CeDAR includes Eclipse views that display clone groups and the location of the clones, and allows connection directly to the source code associated with the clones. In effect, the programmer has prior knowledge of the clones before initiating any refactoring. In addition, parameterized differences are represented in a central manner on one of the clones of a clone group and differences among associated clones are displayed on top of that clone [45]. It should be noted that at the end, it is the decision of the programmer to refactor any group of clones. Confirmation before performing a refactoring is included in the process in Fig. 2b, similar to the confirmation of Fig. 2a. Fig. 3 provides a screenshot of CeDAR. The process of clone refactoring starts with a clone group being selected by the programmer for refactoring. When a clone group is selected, at least one of the clones in the group is set as the default clone. In Fig. 3, Clone 1 is

1300

R. Tairas, J. Gray / Information and Software Technology 54 (2012) 1297–1307

Eclipse

Selected Code

Sub-tree Matching Process

Detected Clones of Selected Code

Clone Refactoring

Refactored Source Code

Original Source Code

(a) Eclipse process

CeDAR Plug-in for Eclipse

Clone Detection Tool Original Source Code

Clone Group

Clone Group

Clone 1 Clone 2

Clone 1 Clone 2 Clone 3

Clone 3 Clone Information Display

Selected Clone Group

Clone Refactoring

Refactored Source Code

(b) CeDAR process Fig. 2. Clone maintenance processes.

Fig. 3. Screenshot of CeDAR showing information on a selected clone group.

the default clone and is denoted with a dark circle marker (i.e., in blue) in the ‘‘Clone Group’’ view in the top-right. The section of code associated with Clone 1 is also highlighted (i.e., in blue) in the source editor and bordered by horizontal lines. The remaining

clones of the group are marked with a light circle marker (i.e., in grey in the top-right) and are highlighted in a different color (i.e., also in grey) in the source editor (not viewable in the figure). One of the purposes of specifying a default clone is to select the clone

1301

R. Tairas, J. Gray / Information and Software Technology 54 (2012) 1297–1307

with its corresponding code that will be kept as the single remaining copy of the duplicated code being refactored. For example, the code associated with the default clone is used in the newly extracted method containing the duplicated code. This process replaces the task of selecting a snippet of code when activating a refactoring in the current mechanism in Eclipse. 4. Clone refactoring extensions In this section, we provide details of enabling further clone refactoring capabilities within Eclipse both in a single class and in multiple classes. This is done by updating the various steps in the Eclipse refactoring framework to include consideration of new parameterized differences in the clones. 4.1. Refactoring clones in a single class Code clones can be categorized into different types based on their similarity properties [6]. Among these types are Type I and Type II clones. Type I clones are clones that exactly match each other. Type II clones provide for variations in the variable, type, or function identifiers between the clones. Clones with parameterized differences can be considered Type 2 clones where there exists a one-to-one correspondence among the identifier differences of the clones [1]. The Eclipse refactoring engine accepts parameterized local variable names among detected clones and can refactor these clones with Extract Method. However, Type II clones can consist of other parameterized differences. In the following paragraphs, we describe our work of incorporating additional parameterized differences for Extract Method refactoring related to clones. Our current focus is on Type I and Type II clones. Parameterized non-local variables – We consider incorporating other variable accesses in addition to the local variable currently allowed by the Eclipse refactoring engine. These include incorporating fields in the same class of the clones and from external classes. Fields in the same class of the clones are identified by extending the local variable identification process currently available in the Eclipse refactoring engine. For fields from external clas-

Clone Group

All parameterized method calls

Method calls with the same return type

ses, an additional process is used to compare QualifiedName nodes between the clones. Parameterized method calls – Modularizing code clones with parameterized method calls (i.e., MethodInvocation) considers similar requirements as those of parameterized variables. For example, the parameterized method calls within the clones should return the same type. This will allow the calls to be passed to the newly extracted method with the same argument type in the newly extracted method’s signature. However, additional requirements that are more specific to parameterized methods must be considered. This includes instances where methods do not return a type (i.e., void return type). These methods cannot be included as a formal parameter in the call to the newly extracted method. A possible solution for this is to pass a flag that determines which part of an If-Statement is executed, where the If-Statement contains separate calls for each parameterized method. Such a situation can be considered ‘‘Control Coupling’’ [34], as one method is determining the execution of another method. However, such instances reduce the independence of the methods and introduce coupling between the methods, which can be less desirable in practice. Based on refactoring pre-conditions, not all clone groups with parameterized methods that are reported by a clone detection tool can be refactored. Fig. 4 details the filtering done among the candidate clones with parameterized methods. It is worth noting that the filtering of methods with no arguments was done to limit the scope of the investigation of the technique for clone refactoring. It should also be noted that currently the evaluation of parameterized methods does not analyze whether a method call is side-effect free or not. Hence, the programmer will need to confirm this property before considering the refactoring. Fig. 5 outlines the class responsible for performing Extract Method refactoring (i.e., ExtractMethodRefactoring). Eclipse’s refactoring framework consists of several steps that perform tasks such as initially checking the section of code being prepared for refactoring to determine whether pre-conditions are met and actually changing the code associated with the refactoring. This allows the framework to provide features such as user input, preview of the refactoring, and an undo mechanism if an executed refactoring must be reversed. In Fig. 5, method checkInitialConditions

Method calls with non-void return types

Method calls with no arguments

Fig. 4. Filtering of clones with parameterized methods.

Fig. 5. Outline of ExtractMethodRefactoring class in Eclipse.

Refactorable Clone Group

1302

R. Tairas, J. Gray / Information and Software Technology 54 (2012) 1297–1307

performs tasks such as finding sections of code that are duplicates of the initially selected code, in addition to checking the pre-conditions for refactoring. These duplicates can consist of parameterized local variables. Method createChange performs the refactoring, which includes the generation of the newly extracted method and replacement of all identified duplicates with a call to the newly extracted method. As can be seen in Fig. 5, a framework to eliminate duplicate sections of code through Extract Method refactoring is available in Eclipse. However, the refactoring of clones to include additional parameterized differences requires extensions on the current process. These extensions within the associated tasks in Fig. 5 correspond to changes that are described in the following paragraphs. Identify duplicates – Section 3.1 and Fig. 2b describe the incorporation of an external clone detection tool as the input to identify the duplicated sections of code. In this case, a clone group reported from the tool and observed by the programmer can be selected for refactoring. This process also includes finding the parameterized

field1 and field2, respectively. This leads to the existence of parameterized fields within the clones. In order to refactor the clones, the fields must be passed to the extracted method (as seen in the right side of Fig. 7). To incorporate parameterized fields, Algorithm 1 extends the argument list to include any fields that were in the list of parameterized variables as identified in previous steps. Each parameterized variable is evaluated to determine if it is already included in the original list of arguments (i.e., ContainsVariable). In addition, IsSingleField checks whether the variable is a field and if the field binding differs between at least two of the clones in the group. This latter task is done by observing the mappings of elements, as seen in Fig. 6. If the variable is both not already in the original list of arguments and the variable is a field that represents more than one variable binding, then the variable is included in the list of arguments. The output of the function in Algorithm 1 is a new list of arguments that take into account any parameterized fields.

elements of the clones, which in addition to local variables will also include non-local variables and method calls. Both local variables and fields in the same class are identified by comparisons of SimpleName nodes. Because initially only local variables are included, the comparison of SimpleName rejects those nodes that are associated with fields. This rule is removed and additional tasks are included in subsequent steps (i.e., in ‘‘Analyze code to be extracted’’) to incorporate fields. New comparisons also are included for MethodInvocation and QualifiedName nodes to find parameterized method calls and fields from external classes. The filtering of method calls as illustrated in Fig. 4 is done at this point. All parameterized elements (i.e., SimpleName, MethodInvocation, and QualifiedName) are associated among the clones in the group by mapping the elements in the default clone with the elements of the other clones. Fig. 6 illustrates this mapping where, for example, the variable bool1 in the default clone (i.e., Clone 1) is mapped with the variables bool2 in Clone 2 and bool3 in Clone 3. We extended the mechanism that originally mapped only local variables to also map method calls and fields from external classes. This example is not a real clone group per se, but is used to illustrate the mappings of the supported parameterized elements in CeDAR. Analyze code to be extracted – This task among other things determines the arguments that will need to be passed to the newly extracted method. Initially, such arguments only include local variables declared outside the scope of the selected snippet of code, but still within the block of the method body. Field variables were not included, because they are globally accessible in the class and thus do not need to be passed. However, clones may use different fields in their corresponding code. For example, in Fig. 7, two sections of code contain cloned statements that include references to

Initialize parameter information – Information related to identified parameters for the method to be extracted are stored for usage during the actual refactoring of the code. The identified parameters relate to the local variables and newly included fields obtained from the previous step. These parameters are supplemented with additional parameters representing fields from external classes and method calls that are identified in the first step. With the identification of parameterized variables and method calls, a scenario may arise where both elements are passed separately. For example, in Fig. 8, the variable p and the call p.call() potentially could be passed separately in the call to the new extracted method (middle column of Fig. 8). To avoid this situation, the list of parameters must be evaluated to remove unnecessary passing of duplicate variables (right column of Fig. 8). Algorithm 2 outlines the steps to remove the duplicate variables in the parameters list. The first step is to obtain all instance variables that are associated to a method call within the code range of the default clone (line 2). For each of these variables, we determine if their corresponding method call parameters can be removed from the parameters list (lines 4–20). In line 5, the method calls associated to the instance variable are retrieved. IsUniformCall looks at all of the corresponding method calls in the other clones to determine if the calls are the same. If all calls are the same, then these method calls can be removed from the parameters list as the associated variable can be used instead. This is done by including each method call (line 11) in the original parameters list in the new parameters list (i.e., NPL) (line 14), if the method call is not associated to the variable V (line 13). All other parameters that are not method calls are added automatically in the new parameters list (line 17), which includes the variable V.

R. Tairas, J. Gray / Information and Software Technology 54 (2012) 1297–1307

Create new method/Replace duplicates – The parameter information from the previous steps is used to create the newly extracted method and replace the clones with a call to the extracted method. In both tasks, the additional parameterized differences were incorporated in the process. For example, within the newly extracted method body, the method calls and fields are replaced with a new name as determined by the programmer during the refactoring confirmation process. Similarly, these same elements are included in the new method calls that replace each clone instance. 4.2. Refactoring of clones in multiple classes Clones that reside in multiple classes require a different refactoring approach. In this subsection, we describe the features of CeDAR that can be used to assist in the refactoring of clones across multiple classes. The descriptions are grouped into two types of refactoring approaches: Pull-up Method and Extract to Utility-type Class. In both cases, the results from clone detection tools provide information about the clones, which will include clones in multiple classes. Pull-up Method – Method-level clones in classes that extend the same super class could be pulled up in order to remove the

1303

duplication. The Eclipse refactoring engine can identify methods in sibling classes that have the same signature if a method was selected for Pull-up Method refactoring. However, if the clones in the group represented code below the method block, then Extract Method refactoring must be performed first before the newly extracted method can be pulled up. In CeDAR, clones can be selected/de-selected for refactoring. In the case of a Pull-up Method refactoring that requires an initial Extract Method refactoring, the inclusion of all clones in the group (i.e., Fig. 9a) can be updated to include only a selection of the clones (i.e., Fig. 9b). These selected clones in the same class can then be refactored (i.e., Extract Method) first, followed by the refactoring of the remaining two clones. Then, the standard Pull-up Method refactoring can be activated on the newly extracted methods in the two classes. At this point the refactoring cannot be done at once, but the programmer does not need to select any section of code because the clone information is available and can be forwarded to the refactoring engine. Extract to Utility-type class – For clones that are scattered in unassociated classes, a possible solution is to extract a method containing the duplicated code of the clones into a separate class. More specifically, a method is extracted into a static utility-type class, where the functionality that is extracted performs some logic that is not as tightly coupled to the inner workings of an existing

Fig. 6. Mapping of parameterized elements.

1304

R. Tairas, J. Gray / Information and Software Technology 54 (2012) 1297–1307

Fig. 7. Including parameterized fields.

Fig. 8. Removing duplicate variables.

(a) all clones selected

(b) only clones in same class selected Fig. 9. Filtering clone selection.

class. Such methods resemble standard Java library APIs such as those found in the predefined Math class. In CeDAR (as seen in Fig. 2b), the clone detection results provide information about clones located in multiple classes. This information can be used by the programmer to perform the refactoring. Extracting a method into a utility-type class can be considered a combination of several atomic refactoring activities. A snippet of code is initially extracted into a new method and a call to the newly extracted method replaces the code in the original location (i.e., Extract Method refactoring). This new method is then extracted into a new class and the call to the original method is replaced by a call to the method in the new class (i.e., Extract Class refactoring). When clones are involved, a programmer would need to select one of the clone instances and perform the steps described above. The programmer could then manually replace the remaining sections of duplicated code associated with

the clones (i.e., such as those in in Fig. 2b) with the call to the method in the new class. 5. Evaluation In this section, we evaluate the inclusion of refactoring clones with additional parameterized differences in several open source software artifacts. We want to observe the potential increase of refactoring capabilities with the incorporation of additional parameterized differences described in Section 4.1 as compared to the current capabilities in Eclipse. Table 2 summarizes experiments on open source Java projects where we performed clone detection using Deckard. CeDAR supports several clone detection tools, but Deckard is a tool that is tree-based and reports syntactically meaningful clones. Syntactically meaningful clones are important because refactoring should be performed on clones that

1305

R. Tairas, J. Gray / Information and Software Technology 54 (2012) 1297–1307

represent proper syntactic blocks. For the detection settings in Deckard, the ‘‘minimum tokens,’’ which set the minimum size of the clones, was set to 50. The ‘‘similarity’’ was set to 0.95. ‘‘Stride,’’ which determines how large code fragments are encoded together, was set to 0. The column ‘‘#Cand. CG’’ reports the number of clone groups that satisfy the general property of Extract Method (i.e., all clones in one file). The table shows that the varying totals of candidate groups are not based on the program size. For example, JFreeChart consisted of the largest number of groups (i.e., 291 groups), but did not have the most lines of code. For each of these candidate groups, the Extract Method option in Eclipse was executed to determine how many of these groups satisfied the requirements for refactoring and thus can be refactored. To satisfy the requirements, activating refactoring on a clone group will not return any error messages from the refactoring engine (i.e., ‘‘OK’’ status returned). This is the step that can determine which clone groups actually can be refactored from the results of clone detection or analysis tools, as mentioned in Section 3.2. The column ‘‘Eclipse’’ reports the number of groups that are refactorable using Extract Method in standard Eclipse. This number filters out non-refactorable clone groups such as those with multiple assigned variables within the cloned code. The column ‘‘CeDAR’’ reports the number of groups that are refactorable with the addition of other parameterized differences. These totals include the instances that were refactorable by the standard Eclipse technique. It can be seen that in half of the artifacts evaluated (i.e., based on the totals in the last column), the number of refactorings doubled using CeDAR. This demonstrates a considerable increase in instances that CeDAR provides in terms of assistance to programmers during clone maintenance. CeDAR provides the additional step needed to remove clone groups reported by tools such as ARIES and SUPREMO that actually cannot be refactored using Extract Method. In [25], the results of ARIES were manually grouped into five ‘‘Extract Method Groups.’’ These groups are based, for example, on whether a clone can be extracted with or without any parameters in the newly extracted method. The fifth group represented clones that required too much effort to refactor and thus could not be refactored simply with Extract Method. In this case, CeDAR can filter out such clones and report only those that can be refactored with Extract Method. Similarly, SUPREMO will only report clones with potential for Extract Method refactoring based on their locations within the class hierarchy; hence, no evaluation is done on whether the refactoring could actually be performed. The clone groups found only by CeDAR (i.e., representing the difference in the last column in Table 2) were evaluated to determine the occurrence of the types of parameterized differences that were included in the arguments list of newly extracted methods, which is given in Table 3. JRuby was not included in the table because CeDAR did not show any new instances. In addition to local variables (i.e., LV), internal and external fields (i.e., IF and EF), and method calls (i.e., MC), Table 3 also contains the number of Table 2 Additional Extract Method refactorings by CeDAR. Project

KLoC

#Cand. CG

Eclipse

CeDAR

D

Apache Ant 1.7.0 Columba 1.4 EMF 2.4.1 Hibernate 3.3.2 Jakarta-JMeter 2.3.2 JEdit 4.2 JFreeChart 1.0.10 JRuby 1.4.0 Squirrel-SQL 3.0.3

67 75 118 209 54 51 76 101 141

120 88 149 177 68 157 291 81 75

14 (12%) 13 (15%) 8 (5%) 15 (8%) 3 (4%) 15 (10%) 29 (10%) 23 (28%) 8 (11%)

28 30 14 18 11 20 62 23 20

+14 +17 +6 +3 +8 +5 +33 0 +12

(23%) (34%) (9%) (10%) (16%) (13%) (21%) (28%) (27%)

Table 3 Parameterized differences in arguments list of extracted method. Project

LV

IF

EF

MC

S

Apache Ant 1.7.0 Columba 1.4 EMF 2.4.1 Hibernate 3.3.2 Jakarta-JMeter 2.3.2 JEdit 4.2 JFreeChart 1.0.10 Squirrel-SQL 3.0.3

10 14 6 3 8 4 34 12

8 7 2 0 1 1 19 6

2 7 0 0 1 1 11 3

8 7 2 2 2 1 13 9

6 5 4 2 7 2 5 4

LV = local variable; IF = internal field; EF = external field; MC = method call; S = string.

occurrences of parameterized strings (i.e., S) that was also part of the refactoring extensions. It should be noted that one clone group can be represented multiple times for an observed project in Table 3, because an arguments list can include one or more types of parameterized differences. For example, in Apache Ant, of the 14 additional clone groups identified by CeDAR for refactoring, 10 of these groups contained at least one local variable parameter in the arguments of their respective newly extracted methods. As can be seen from the table, local variable parameters still comprise the majority of the clone groups, as for each artifact it represents the most instances. For the remaining parameterized differences, which were added to the clone refactoring process, the next largest number of instances varies. From the artifacts evaluated, fields from external classes tend to represent the lowest number of instances compared to the other three elements. However, the evaluation of the artifacts demonstrates that parameterized differences other than local variables were utilized during varying instances of the Extract Method observed refactoring activities.

6. Threats to validity The purpose of the evaluation in Section 5 was to identify the extent of clone groups that Extract Method refactoring can be performed as determined by CeDAR. However, it should be noted that a programmer may not refactor all instances that were identified. This poses a threat to the validity of how much of the refactorings actually would be applied by the programmer. However, the instances represent clone groups where refactorings have been identified by CeDAR, thus removing a considerable number of clone groups that must be evaluated by the programmer to at least determine refactoring opportunities. Related to this, our process does not promote a fully automatic mechanism of refactoring both for individual clones and all clones identified as refactorable. Some clones may not necessarily need to be refactored [29], hence the programmer must first decide whether refactoring is the appropriate solution. CeDAR’s objective is to reduce the number of clone groups that need to be evaluated by the programmer for refactoring before he or she decides to refactor. A separate threat to validity on the results of CeDAR is the percentage of clone groups that were identified as refactorable with Extract Method in Table 2. In the largest case (i.e., Columba), only 34% of the clone groups were considered refactorable, which poses a threat to the low number of identified clone groups for refactoring. A manual evaluation of the clone groups not selected by CeDAR revealed several cases that hindered a particular clone group from being refactorable. One case is the parameterized differences of the clones that are not currently supported by CeDAR. We plan to investigate further differences as future work. Another case consists of parameterized differences that cannot simply be refactored using Extract Method. For example, two clones

1306

R. Tairas, J. Gray / Information and Software Technology 54 (2012) 1297–1307

containing two different method calls in the same syntactic location. This requires a more elaborate refactoring framework. A separate case is related to the results from the clone detection tool that was used. From the results of Deckard, the clones in some clone groups contained more complex differences. For example, in one code location, a clone consisted of a string constant, whereas in another clone, the same location consisted of a concatenation of a string variable and a string constant. This case represents clones that are not syntactically similar even with parameterization and again requires a more elaborate refactoring framework or a more manual process. The clone groups reported as refactorable by CeDAR are dependent on the clone detection tool that is used. That is, CeDAR evaluates the clones reported by the tool that it uses as the frontend. Clones that might not be reported by the tool will not be evaluated by CeDAR. However, we have selected Deckard in our evaluation, because its reported clones are syntactically meaningful and its popularity as a clone detection tool.

7. Related work Fanta and Rajlich performed clone removal through processes that include function insertion, function encapsulation, variable renaming, and argument reordering [19]. The candidate clones for removal must be selected and determined manually, while the removal of clones is automated. In CeDAR, candidate clones for refactoring can be determined by filtering properties of the clones. Further filtering can be done to determine clones that can be refactored from the clones that potentially can be refactored. Balazinska et al. also proposed an automated refactoring technique in which the Strategy design pattern [21] is used to transform the clones into a more manageable format [3]. The refactoring was highly automated, and the authors concluded in a subsequent paper that a fully automated refactoring approach is less effective compared to one with some user interaction [4]. Juillerat and Hirshbrunner evaluated clones based on predefined constraints [27]. However, only parameterized local variables are considered. The refactoring of clones in the C language was considered by Baxter et al. [5], Komondoor and Horwitz [31], and Liu [38]. Baxter et al. replaced detected clones with macros. Komondoor and Horwitz suggested approaches to extract a function from the clones, which is similar to Extract Method. These works provide extensive mechanisms for function extraction in procedural languages, but are currently not incorporated within an IDE that can provide a centralized location for source code maintenance. Li and Thompson proposed code clone removal for the functional language Erlang [37]. This technique is incorporated within a refactoring environment for the language. The refactoring is performed one at a time on each duplicate code that is detected, rather than as a combined process as given in CeDAR. Along the theme of the ARIES and SUPREMO tools, Choi et al. [8], Lee et al. [36], and Zibran and Roy [47] have also considered techniques for finding refactoring opportunities for clones. Lee et al. and Zibran and Roy also seek to determine the best sequence or schedule for clone refactoring. As stated in Section 3.2, these clone analysis tools for refactoring can complement and potentially be included in the clone maintenance process of CeDAR as more appropriate clones for refactoring can be identified. A more recent work by Zibran and Roy proposes an IDE-based clone management system that includes the processes of clone detection, management, and refactoring [48]. The work does not discuss the mechanism of refactoring that will be used. We have endeavored to utilize and connect as much as possible CeDAR’s process of refactoring with the refactoring framework that is already available in Eclipse. Hence, CeDAR is able to benefit from the already established framework in the IDE. It should be noted that Zibran and

Roy also consider refactoring of Type III clones, which are clones that can differ by a few additions or deletions of statements. CeDAR currently is only focused on Type I and II clones. It is worth noting that a separate approach to clone maintenance keeps the clones in place and performs maintenance where the clones are located [14,46]. In this case, the duplication of the clones is not removed. This approach links the sections of code that are duplicates of each other, allowing the programmer to just edit one instance and all other instances will be edited appropriately. A motivation for this approach is the observation that some clones are short-lived and over time become harder to refactor [30]. It should be noted that the refactoring of all clones is not necessarily the ideal solution. However, when instances arise in which a clone group can be refactored, the programmer can decide to refactor the group with the aid of CeDAR. RCF [23] provides a data model for results from clone detection tools. Currently, RCF does not support CloneDR and Deckard, the latter of which is the tool we utilized in our evaluation. As RCF is aimed for a standardized representation of clone data, we will consider as a future feature the incorporation of RCF in our clone maintenance process by contributing the functionality of obtaining RCFformatted clone data for tools such as CloneDR and Deckard. 8. Conclusion and future work This paper described our work on unifying the phases related to code clone maintenance, specifically the elimination of duplication through refactoring activities. By unifying the processes of clone detection and refactoring, the strengths of both processes can be garnered. From the clone detection side, we include the ability to obtain results from a selection of clone detection tools. These results can serve as input into the refactoring engine to determine actual refactoring opportunities from potential ones. From the refactoring side, extensions to the Eclipse refactoring engine provided for the incorporation of more parameterized differences among the clones to enable additional accepted refactorings. The increase in instances of refactoring on clones was seen to double in many of the software artifacts that were evaluated. We will consider additional parameterized differences for refactoring in an effort to increase the types of clones that can be refactored by CeDAR. Such differences include node types in the same syntactic location that are different. For example, where one clone uses a local variable, another clone may use a method call in the same syntactic location. Future work on the refactorings described in Section 4.2 includes adding more automation to the process of pulling up a method that represents a section of code that needs to be extracted first. We will also investigate the support for other types of refactorings for clones in addition to the refactorings described in this paper. However, when considering any extension, an evaluation must be performed to determine the prevalence of instances that can use such a refactoring. The evaluation in Section 5 showed mostly a doubling of refactoring instances for detected clone groups. The inclusion of further parameterized differences should be evaluated first to determine to what extent the increase in refactoring support is provided. In addition, we will further evaluate the results of CeDAR with the maintainer of a code base to determine to what extent the proposed refactorings actually would be performed. Finally, clone refactoring that does not include the entire clone has been observed in [22,44]. A mechanism to allow sub-clone refactoring will also be considered. Acknowledgment This material is based upon work supported by the National Science Foundation under Grant No. CCF-0702764.

R. Tairas, J. Gray / Information and Software Technology 54 (2012) 1297–1307

References [1] B. Baker, A Program for Identifying Duplicated Code, in: Computing Science and Statistics: Proceedings of the Symposium on the Interface, College Station, TX, 1992, pp. 49-57. [2] B. Biegel, S. Diehl, Highly Configurable and Extensible Code Clone Detection, in: Proceedings of the Working Conference on Reverse Engineering, Beverly, MA, 2010, pp. 237–241. [3] M. Balazinska, E. Merlo, M. Dagenais, B. Lague, K. Kontogiannis, Partial Redesign of Java Software Systems Based on Clone Analysis, in: Proceedings of the Working Conference on Reverse Engineering, Atlanta, GA, 1999, pp. 326– 336. [4] M. Balazinska, E. Merlo, M. Dagenais, B. Lague, K. Kontogiannis, Advanced Clone-analysis to Support Object-oriented System Refactoring, in: Proceedings of the Working Conference on Reverse Engineering, Brisbane, Australia, 2000, pp. 98–107. [5] I. Baxter, A. Yahin, L. Moura, M. Sant’Anna, L. Bier, Clone Detection using Abstract Syntax Trees, in: Proceedings of the International Conference on Software Maintenance, Bethesda, MD, 1998, pp. 368–377. [6] S. Bellon, R. Koschke, G. Antoniol, J. Krinke, E. Merlo, Comparison and Evaluation of Clone Detection Tools, Transactions on Software Engineering 33 (9) (2007) 577–591. [7] P. Bulychev, An Evaluation of Duplicate Code Detection using Anti-unification, in: Proceedings of the International Workshop on Software Clones, Kaiserslauten, Germany, 2009. [8] E. Choi, N. Yoshida, T. Ishio, K. Inoue, T. Sano, Extracting Code Clones for Refactoring using Combinations of Clone Metrics, in: Proceedings of the International Workshop on Software Clones, Waikiki, HI, 2011, pp. 7–13. [9] Code Clones Literature, 2011. . [10] Continuous Quality Assessment Toolkit (ConQAT), Clone Detection, 2011. . [11] Copy/Paste Detector (CPD), 2011. . [12] J. Cordy, C. Roy. The NiCad Clone Detector, in: Proceedings of the International Conference on Program Comprehension, Kingston, Canada, 2011, pp. 219–220. [13] I. Davis, M. Godfrey, From Whence It Came: Detecting Source Code Clones by Analyzing Assembler, in: Proceedings of the Working Conference on Reverse Engineering, Beverly, MA, 2010, pp. 242–246. [14] E. Duala-Ekoko, M. Robillard, Tracking Code Clones in Evolving Software, in: Proceedings of the International Conference on Software Engineering, Minneapolis, MN, 2007, pp. 158–167. [15] S. Ducasse, M. Rieger, S. Demeyer, A Language Independent Approach for Detecting Duplicated Code, in: Proceedings of the International Conference on Software Maintenance, Oxford, United Kingdom, 1999, pp. 109–118. [16] Duplication Management Framework (DMF), 2011. . [17] Duplo, 2011. . [18] Eclipse Integrated Development Environment, 2011. . [19] R. Fanta, V. Rajlich, Removing clones from the code, Journal of Software Maintenance: Research and Practice 11 (4) (1999) 223–243. [20] M. Fowler, Refactoring: Improving the Design Of Existing Code, AddisonWesley, Reading, MA, 1999. [21] E. Gamma, R. Helm, R. Johnson, J. Vlissides, Design Patterns, Addison-Wesley, Boston, MA, 1995. [22] N. Göde, Clone Removal: Fact or Fiction?, in: Proceedings of the International Workshop on Software Clones, Cape Town, South, Africa, 2010, pp. 33–40. [23] J. Harder, N. Göde, Efficiently Handling Clone Data: RCF and Cyclone, in: Proceedings of the International Workshop on Software Clones, Waikiki, HI, 2011, pp. 81–82. [24] Y. Higo, S. Kusumoto, Significant and Scalable Code Clone Detection with Program Dependency Graph, in: Proceedings of the Working Conference on Reverse Engineering, Lille, France, 2009, pp. 315–316. [25] Y. Higo, S. Kusumoto, K. Inoue, A metric-based approach to identifying refactoring opportunities for merging code clones in a Java software system, Journal of Software Maintenance and Evolution: Research and Practice 20 (6) (2008) 435–461.

1307

[26] L. Jiang, G. Misherghi, Z. Su, S. Glondu, DECKARD: Scalable and Accurate Treebased Detection of Code Clones, in: Proceedings of the International Conference on Software Engineering, Minneapolis, MN, 2007, pp. 96–105. [27] N. Juillerat, B. Hirsbrunner, An Algorithm for Detecting and Removing Clones in Java Code, in: Proceedings of the Workshop on Software Evolution through Transformations, Natal, Brazil, 2006, pp. 63–74. [28] T. Kamiya, S. Kusumoto, K. Inoue, CCFinder: a multilinguistic token-based code clone detection system for large scale source code, Transactions on Software Engineering 28 (2) (2002) 654–670. [29] C. Kapser, M. Godfrey, Cloning considered harmful, Considered Harmful: Patterns of Cloning in Software, Empirical Software Engineering 13 (6) (2008) 645–692. [30] M. Kim, V. Sazawal, D. Notkin, G. Murphy, An Empirical Study of Code Clone Genealogies, in: Proceedings of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering, Lisbon, Portugal, 2005, pp. 187–196. [31] R. Komondoor, S. Horwitz, Effective, Automatic Procedure Extraction, in: Proceedings of the International Workshop on Program Comprehension, Portland, OR, 2003, pp. 33–42. [32] G. Koni-N’Sapu, SUPREMO: A Scenario Based Approach for Refactoring Duplicated Code in Object Oriented Systems, Diploma Thesis, University of Bern, Switzerland, 2001. [33] J. Krinke, Identifying Similar Code with Program Dependence Graphs, in: Proceedings of the Working Conference on Reverse Engineering, Stuttgart, Germany, 2001, pp. 301–309. [34] S. Lawrence, J. Atlee, Software Engineering: Theory and Practice, Prentice Hall, Upper Saddle River, NJ, 2006. [35] S. Lee, I. Jeong, SDD: High Performance Code Clone Detection System for Large Scale Source Code, in: Companion to the International Conference on ObjectOriented Programming, Systems, Languages, and Applications, San Diego, CA, 2005, pp. 140–141. [36] S. Lee, G. Bae, H. Chae, D-H. Bae, Y. Kwon, Automated scheduling for clonebased refactoring using a competent GA, Software: Practice and Experience 41 (5) (2011) 521–550. [37] H. Li, S. Thompson, Clone Detection and Removal for Erlang/OTP within a Refactoring Environment, in: Proceedings of the Workshop Partial Evaluation and Semantics-Based Program Manipulation, Savannah, GA, 2009, pp. 169– 178. [38] Y. Liu, Semi Automatic Removal of Duplicated Code, Diploma Thesis, University of Stuttgart, Germany, 2004. [39] C. Roy, J. Cordy, R. Koschke, Comparison and evaluation of code clone detection techniques and tools: a qualitative approach, Science of Computer Programming 74 (7) (2009) 470–495. [40] Simian – Similarity Analyser, 2011. . [41] SimScan, 2010. . [42] Y. Sun, Z. Demirezen, F. Jouault, R. Tairas, J. Gray, Tool Interoperability through Model Transformations, in: Proceedings of the International Conference on Software Language Engineering, Toulouse, France, 2008, pp. 178–187. [43] R. Tairas, J. Gray, Get to Know Your Clones with CeDAR, in: Companion to the International Conference on Object-Oriented Programming, Systems, Languages, and Applications, Orlando, FL, 2009, pp. 817–818. [44] R. Tairas, J. Gray. Sub-clone Refactoring in Open Source Software Artifacts, in: Proceedings of the Symposium on Applied Computing, Sierre, Switzerland, 2010, pp. 2373–2374. [45] R. Tairas, F. Jacob, J. Gray, Representing Clones in a Localized Manner, in: Proceedings of the International Workshop on Software Clones, Waikiki, HI, 2011, pp. 54–60. [46] M. Toomim, A. Begel, S. Graham, Managing Duplicated Code with Linked Editing, in: Proceedings of the Symposium on Visual Languages – Human Centric Computing, Rome, Italy, 2004, pp. 173–180. [47] M. Zibran, C. Roy, Conflict-aware Optimal Scheduling of Code Clone Refactoring: A Constraint Programming Approach, in: Proceedings the International Conference on Program Comprehension, Kingston, Canada, 2011, pp. 266–269. [48] M. Zibran, C. Roy, Towards Flexible Code Clone Detection, Management, and Refactoring in IDE, in: Proceedings of the International Workshop on Software Clones, Waikiki, HI, 2011, pp. 75–76.