Collaborative or individual identification of code smells? On the effectiveness of novice and professional developers

Information and Software Technology 120 (2019) 106242 Contents lists available at ScienceDirect Information and Software Technology journal homepage...

Download PDF

2MB Sizes 0 Downloads 26 Views

Report

PDF Reader
Full Text

Information and Software Technology 120 (2019) 106242

Contents lists available at ScienceDirect

Information and Software Technology journal homepage: www.elsevier.com/locate/infsof

Collaborative or individual identiﬁcation of code smells? On the eﬀectiveness of novice and professional developers Roberto Oliveira a,b,∗, Rafael de Mello a, Eduardo Fernandes a, Alessandro Garcia a, Carlos Lucena a a b

Pontiﬁcal Catholic University of Rio de Janeiro (PUC-Rio), Rio de Janeiro, Brazil State University of Goiás (UEG), Posse-GO, Brazil

a r t i c l e

i n f o

Keywords: Code smell identiﬁcation Collaboration Empirical study

a b s t r a c t Context: The code smell identiﬁcation aims to reveal code structures that harm the software maintainability. Such identiﬁcation usually requires a deep understanding of multiple parts of a system. Unfortunately, developers in charge of identifying code smells individually can struggle to identify, conﬁrm, and refute code smell suspects. Developers may reduce their struggle by identifying code smells in pairs through the collaborative smell identiﬁcation. Objective: The current knowledge on the eﬀectiveness of collaborative smell identiﬁcation remains limited. Some scenarios were not explored by previous work on eﬀectiveness of collaborative versus individual smell identiﬁcation. In this paper, we address a particular scenario that reﬂects various organizations worldwide. We also compare our study results with recent studies. Method: We have carefully designed and conducted a controlled experiment with 34 developers. We exploited a particular scenario that reﬂects various organizations: novices and professionals inspecting systems they are unfamiliar with. We expect to minimize some critical threats to validity of previous work. Additionally, we interviewed 5 project leaders aimed to understand the potential adoption of the collaborative smell identiﬁcation in practice. Results: Statistical testing suggests 27% more precision and 36% more recall through the collaborative smell identiﬁcation for both novices and professionals. These results partially conﬁrm previous work in a not previously exploited scenario. Additionally, the interviews showed that leaders would strongly adopt the collaborative smell identiﬁcation. However, some organization and tool constraints may limit such adoption. We derived recommendations to organizations concerned about adopting the collaborative smell identiﬁcation in practice. Conclusion: We recommend that organizations allocate novice developers for identifying code smells in collaboration. Thus, these organizations can promote the knowledge sharing and the correct smell identiﬁcation. We also recommend the allocation of developers that are unfamiliar with the system for identifying smells. Thus, organizations can allocate more experience developers in more critical tasks.

1. Introduction Software maintenance implies constantly changing the code elements that constitute a system, such as classes and methods, aimed to address the users’ needs [1]. Unfortunately, maintaining those code elements often degrades the internal code structure of systems [2,3]. A practical implication of such degradation is that poor code structures are diﬃcult for developers to maintain [4,5]. In other words, these structures usually make diﬃcult to understand and change a system. Aimed at minimizing the negative impact of the code structure degradation, developers should identify and eliminate poor code structures aﬀecting a system whenever possible [2].

Code smells are basic symptoms of code structure degradation [6]. Each smell type represents a pattern of poor code structure that, if not minimized or eliminated, can hamper the system maintenance. We exemplify this scenario with the Large Class smell type [6,7]. A Large Class instance is realized by a too large and complex class, which should be split into many for separating features. Developers usually associate a Large Class instance with a high complexity to understand and change the aﬀected class [4,5]. Thus, identifying and eliminating Large Class instances potentially increases the system maintainability. In summary, software developers should identify and eliminate this smell type, as well as many other smell types, from their systems [8,9].

∗

Corresponding author. E-mail addresses: [email protected] (R. Oliveira), [email protected] (R. de Mello), [email protected] (E. Fernandes), [email protected] (A. Garcia), [email protected] (C. Lucena). https://doi.org/10.1016/j.infsof.2019.106242 Received 18 December 2018; Received in revised form 2 November 2019; Accepted 10 December 2019 Available online 13 December 2019 0950-5849/© 2019 Elsevier B.V. All rights reserved.

R. Oliveira, R. de Mello and E. Fernandes et al.

Software contributions aﬀected by instances of code smells are often rejected for integration in open source systems [10]. It suggests that code smells aﬀect the system development in practice. As a means to eliminate code smells, software developers often apply refactoring on their systems [11–15]. By changing speciﬁc code elements, the developers usually expect to either minimize or eliminate the existing poor code structures. However, it requires an eﬀective support for developers in identifying those poor code structures that require refactoring [16– 19]. In this context, various studies have investigated the identiﬁcation of code smells from diﬀerent scenarios [20–23]. 1.1. Current knowledge and limitations In this study, we refer to the developers identifying alone the code smells aﬀecting a system as single developers. A recent study [21] investigates the ineﬀectiveness of single developers in identifying code smells in the so-called individual smell identiﬁcation. Such ineﬀectiveness is mostly due to the subjective nature of the identiﬁcation of code smells [4,5,21,24] combined with the limited tooling support for identifying code smells [25], especially regarding precision and recall [26]. Thus, the collaborative smell identiﬁcation emerges as a promising solution to such ineﬀectiveness [22,23]. It consists of two developers working as collaborators for identifying, conﬁrming, and refuting code smell suspects, i.e., code smell instances not yet conﬁrmed by developers. The common wisdom could state that, in contrast with the individual smell identiﬁcation, the collaborative smell identiﬁcation is more eﬀective, especially for supporting a high rate of identiﬁed code smells. We recently investigated the common wisdom through empirical studies [22,23]. Nevertheless, the current knowledge on the eﬀectiveness of collaborative smell identiﬁcation does not suﬃce to convince the development companies in adopting it in practice. The ﬁrst study [23] investigates the eﬀectiveness of novice developers arranged individually (one developer), in pairs (two developers), and in groups (three or more developers). All developers are unfamiliar with the inspected systems. In other words, the developers know little or nothing about the code elements and possible poor code structures aﬀecting the systems. This scenario emulates real-world companies in which developers have to maintain a system they did not originally developed. The second study [22] investigates the eﬀectiveness of professional developers in the same three arrangements. All developers are familiar with the inspected systems and, therefore, they might be more likely to identify code smells rather than other developers. The two aforementioned studies [22,23] conclude that the collaborative smell identiﬁcation (performed by pairs and groups) is more eﬀective than the individual smell identiﬁcation. They then suggest, to some extent, that companies adopt the collaborative smell identiﬁcation. Nevertheless, their study ﬁndings are not exactly comparable because they regard diﬀerent smell types, inspected systems, and working experience. Although we have crossed the data of both studies in a recent work [27], the comparison has several threats to validity. It has two major practical implications: (1) it remains diﬃcult to assure that the familiarity of developers with a system has aﬀected eﬀectiveness, and (2) it might be hard to convince companies in allocating either novices or professionals for identifying code smells collaboratively. Especially, professional developers tend to hold most knowledge about their systems [28]. Thus, companies would prefer to allocate novices rather than professionals for identifying code smells. 1.2. Expanding the current knowledge: an unexploited scenario In this paper, we aim to extend the scope of our analysis while avoiding the threats to validity inherent from combining observations from diﬀerent studies [27]. We present a controlled experiment on the eﬀectiveness of collaborative smell identiﬁcation in an unprecedented exploitation scenario: novice and professional developers inspecting systems they are unfamiliar with. This scenario is quite interesting in

Information and Software Technology 120 (2019) 106242

practice due to the following reasons. First, the eﬀectiveness of novice and professional developers may vary signiﬁcantly so that it justiﬁes allocating developers with a speciﬁc working experience to identify code smells. Second, the lack of familiarity with a system is recurring in practice due to (1) the high developer turnover [29] and (2) the need for allocating professional developers to address the clients’ needs rather than to eliminate poor code structures aﬀecting a system [15]. A total of 34 novice and professional developers participated in our controlled experiment. By following empirical study guidelines [30], we asked developers to identify code smells either as single developers or as collaborators (two developers only). We have selected a software system aﬀected by a suﬃcient and varied number of code smell suspects. The respective smell types are those that developers usually perceive as harmful to system maintainability [4,5]. We have submitted our data to quantitative analysis via precision and recall computation [26] and the application of statistical methods [31] for hypothesis testing. As a complement to our quantitative analysis, we performed interviews with ﬁve software project leaders. Our major goal was understanding if leaders are likely to adopt the collaborative smell identiﬁcation in their teams. 1.3. Study results and practical implications Our study results suggest that both novice and professional developers reach 27% more precision and 36% more recall by identifying code smells collaboratively. By applying statistical tests, we conﬁrmed a signiﬁcant diﬀerence between the eﬀectiveness of developers working as single developers and collaborators in the identiﬁcation of code smells. These results are interesting by themselves, since they provide empirical evidence that allocating developers to work collaboratively can help identifying more code smells with only a low misidentiﬁcation rate when compared to the individual smell identiﬁcation. However, our study reveals even more insightful conclusions when combined to our previous ﬁndings about the collaborative smell identiﬁcation. In our ﬁrst study [23], we investigated the eﬀectiveness of novices identifying code smells individually by simply computing the average number of identiﬁed code smells. Unfortunately, this computation represents a threat to the study validity because the average number does not reveal important aspects of identiﬁcation of code smells, such as how frequent novice developers misidentify code smells when allocated to work collaboratively. By addressing such threat, we have now conﬁrmed that novice developers working collaboratively obtain a higher precision and recall than those working individually, thereby reducing the number of misidentiﬁed code smells. Thus, organizations can allocate novice rather than professional developers for identifying code smells collaboratively. Thus, professional developers can focus on addressing the clients’ needs. In our second study [22], we investigated the eﬀectiveness of collaborative smell identiﬁcation from the perspective of professional developers who are familiar with the inspected system, but by computing precision and recall. The familiarity of developers with the system might have facilitated the identiﬁcation of code smells, which implies a threat to the study validity. We addressed this threat by allocating developers with varied working experiences to inspect a system they are unfamiliar with. Surprisingly, we observed that developers working collaboratively are more eﬀective than the ones working individually for identifying code smells, regardless their familiarity with the inspected system. Thus, organizations can simply allocate developers that are unfamiliar with their systems to identify code smells. It enables the other developers to address the clients’ needs based on their extensive knowledge about the system. Through the interviews with software project leaders, we have derived various interesting ﬁndings. First, most leaders agreed that collaboration is essential to support the code smell identiﬁcation. Especially, leaders expect that collaboration may support the identiﬁcation of code smells depending on granularity and complexity. Second, leaders are ultimately open to employ the collaborative smell identiﬁcation in their

R. Oliveira, R. de Mello and E. Fernandes et al.

development teams. Leaders mentioned many motivations behind such adoption, such as the code quality improvement and the enhancement of identiﬁcation accuracy. However, as expected, limited budget and time may hinder this adoption in practice. 2. Background and related work This section provides background information of the paper. Section 2.1 discusses the characteristics of code smells and their impact on system maintainability. Section 2.2 discusses the limitations of both automated and manual identiﬁcation of code smells. We justify our investigation of the collaborative smell identiﬁcation as a manual and human-centered activity. Section 2.3 discusses the limited knowledge about the eﬀectiveness of collaborative smell identiﬁcation. Section 2.4 motivates our study with a practical example. 2.1. Code smells: characteristics and impact on maintainability The internal code structure of a system tends to degrade along successive changes [2,8,32]. Code smells are symptoms of poor code structures that realize such degradation [5,6]. They can reveal, at least partially, a problem of system maintainability [8,9], such as the diﬃculty to understand and change certain code elements. To identify code smells, developers need to understand two basic characteristics that help revealing the impact of a code smell instance on the system maintainability. These characteristics are: (1) the smell type and (2) the smell granularity. We explain each characteristic as follows. The smell type characterizes how the poor code structure manifests in the source code of a system. Let us take back the Large Class smell type mentioned in Section 1. Large Class consists of a too large and complex class, which developers might ﬁnd challenging to understand and change [6,7]. This smell type is usually realized by several methods, which sum up many lines of code and implement complex system features. Previous work [4,5] points out Large Class as one of the most critical smell types from the viewpoint of developers. Besides Large Class, many other smell types have been cataloged [6,7,33]. Due to the diﬀerent manifestations of code smells, each code smell instance aﬀects a particular set of code elements in a system. It characterizes the smell granularity. In this study, we relied on previous work [6,34] and categorized smell granularity in two categories. The ﬁrst category is called intra-class smells and consists of poor code structures that usually require inspecting a single class to be identiﬁed. Examples are the Large Class [6] instances, which locally aﬀect a class. The second category is called inter-class smells and consists of poor code structures whose identiﬁcation usually requires reasoning about multiple classes together. Examples are Message Chain [6] instances, each requiring the analysis of various classes whose method calls form a call chain. We have selected four smell types for inspection. Data Clumps (Inter-class): A data cluster often seen together in the system, either as a class member, or as a parameter list in the signature of methods. A class could encapsulate it [6]. Large Class (Intra-class): A too large and complex class. It usually implements various unrelated features, which could be distributed to other classes [6,7]. Long Method (Intra-class): A too large and complex method. It usually has either various lines of source code, with several conditional branches, or implements complex features [6,7]. Message Chain (Inter-class): A large sequence of method calls composed across various system classes [6]. We have three reasons why we selected these smell types. First, each selected smell type is frequent in the inspected systems and typically associated with maintainability problems by developers [2,4,5]. Second, these smells are conceptually interrelated, which can impose different diﬃculty levels in their identiﬁcation. For instance, by deﬁnition, a Large Class instance might be a composition of various Long Method instances. Third, identifying instances of these smell types might require the analysis of various code elements and reasoning about their multiple characteristics [6,7], such as cohesion and coupling [35].

Information and Software Technology 120 (2019) 106242

2.2. Automated versus manual smell identiﬁcation The identiﬁcation of code smells consists of searching for poor code structures that potentially realize maintainability problems in a system [6,7]. Usually, such identiﬁcation realizes on inspecting of code elements that constitute a system. Each code element represents a basic decomposition unit of the system [36]. In this study, we focus on inspecting code smells that aﬀect two types of code elements: methods and classes. Whenever we identify a code smell aﬀecting a speciﬁc code element, we refer to this code element as a code smell suspect. After identifying code smell suspects, each suspect has to be either conﬁrmed or refuted by developers as actually harmful to the system maintainability. Developers might perform the identiﬁcation of code smell suspects aﬀecting the source code of a system in two ways: (1) by relying on the support of automated tools [25] and (2) by manually inspecting the source code aimed at identifying the code smell suspects. Examples of automated tools are JDeodorant [37] and Stench Blossom [38]. However, in spite of the various existing tools, a previous study [25] observes that these tools tend to have a low eﬀectiveness in terms of precision and recall. More critically, each tool varies signiﬁcantly in eﬀectiveness, especially because each tool adopts a diﬀerent identiﬁcation strategy (e.g., based on software metrics). Especially due to the high number of misidentiﬁed code smell suspects [25], which causes the usually low recall rates of existing tools, the developers still need to reason about the validity of each code smell suspect after using an automated tool. As a response to the lack of eﬀectiveness of automated supporting tools, companies can use the manual identiﬁcation of code smells as the main practice, and the automated tools as a complement. Thus, developers can automate the identiﬁcation of several code smell suspects and, then, manually conﬁrm or refute them. There are two scenarios for the manual identiﬁcation of code smells [22]. The ﬁrst scenario allocates single developers to identify code smells alone in the so-called individual smell identiﬁcation. The second scenario allocates two or more developers as collaborators to identify code smells together in the so-called collaborative smell identiﬁcation. Our previous studies [22,23] suggest that adding more than two collaborators does not signiﬁcantly increase the eﬀectiveness of collaborative smells. Therefore, unlike our previous studies, this current study assesses the collaborative smell identiﬁcation in pairs only.

2.3. The current knowledge about collaborative smell identiﬁcation The collaboration of developers has been applied to and assessed in various software engineering tasks [39,40]. For instance, a previous study [40] shows that developers working collaboratively can beneﬁt along the code review by revealing more easily and frequently defects and maintenance problems aﬀecting their software systems. The beneﬁts come mostly from the knowledge exchange promoted by collaboration [22,40]. Another study [39] shows that collaboration reduces the diﬃculty faced by developers to identify defects in software systems, which require a careful inspection of the source code. Consequently, one could expect that the collaboration potentially improves the developers’ effectiveness in identifying code smells, which also require a source code inspection. This paper presents an empirical study aimed at assessing the eﬀectiveness of collaborative versus individual smell identiﬁcation. However, this is not the ﬁrst study proposed with similar purpose. Table 1 compares the study design of this paper with the study design of two previous studies of ours [22,23] with similar purpose. The ﬁrst column lists the study design characteristics that we consider relevant for comparison across studies. The second column characterizes our ﬁrst study in the context of collaborative smell identiﬁcation [23], in terms of those characteristics. The third column characterizes our second study [22]. The fourth column characterizes our current study design. This table

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

Table 1 Comparison between This Study and Previous Studies. Characteristic

Study 1 [23]

Study 2 [22]

Study 1+2 [27]

This study

Nature of the study performed Number of subjects Subject working experience Arrangement of the subjects Number of inspected systems Familiarity with the system Effectiveness metrics Smell types

Controlled experiment 28 developers Novices Individuals, pairs, groups 3 Unfamiliar Precision, recall Long Method and 6 others

Controlled experiment 13 developers Professionals Individuals, pairs, groups 5 Familiar Average number of code smells Large Class, Long Method, Message Chain, and 9 others

Controlled experiment 41 developers Novices, professionals Individuals, pairs, groups 8 Unfamiliar, Familiar Precision Large Class, Long Method, Message Chain, and 9 others

Controlled experiment plus interviews 34 developers + 5 team leaders Novices, professionals Individuals, pairs 2 Unfamiliar Precision, recall Data Clumps, Large Class, Long Method, Message Chain

lists only the smell types inspected in our current study. We discuss the diﬀerences among studies, and justify our current study, as follows. The data of Table 1 indicates the diﬀerent design characteristics across studies. Notably, the ﬁrst study aimed at complementing the quantitative data obtained from controlled experience through interviews with project leaders. We aimed to know if leaders are likely to adopt the collaborative smell identiﬁcation in practice. Additionally, our study diﬀers from others due to the working experience of developers that participated in each study. In this study, we have counted on the participation of novice and professional developers rather than developers with a single working experience. Our major goal was enabling a fair eﬀectiveness comparison of novices and professionals through a shared context: the same systems aﬀected by the same smell types. Thus, we aim at minimizing threats from our previous studies [22,23] that hinder concluding if novice are as eﬀective as professional developers in identifying code smells collaboratively.

2.4. Motivating example Based on the existing evidence [22] that the developers’ consent is essential to the identiﬁcation of code smells, especially regarding the conﬁrmation or refutation of code smell suspects, we motivate our study by showing how diﬃcult it might be for single developers to perform as follows. We illustrate below how developers working collaboratively may beneﬁt from the exchange knowledge on the identiﬁcation of code smells. The developer’s knowledge may include information about the purpose of the software system, details on the code structure, or deﬁnitions of smell types. In turn, knowledge exchange means sharing knowledge between developers, specially when identifying code smells together. Suppose that two hypothetical developers, Bill and Suzi, must carry out the identiﬁcation of code smells (e.g. Data Clumps) in a software system not implemented by them. Each developer is responsible for inspecting the Util and System components, respectively. Through the identiﬁcation of code smells, developers can have indications of which code elements need to be modiﬁed to improve the longevity of the software system.

2.4.1. Scenario 1: Individual smell identiﬁcation Each developer (i.e. Bill and Suzi) performs the analysis of a particular component of software system. Thereby, during the identiﬁcation of code smells, Bill only inspects the Util component while Susy only inspects the System component. Bill initially suspects that there is a Data Clump smell aﬀecting the Class A. He has this suspicion because foo and bar parameters appear together in the parameter list of two diﬀerent methods (see code fragments in blue color). In order to conﬁrm the existence of a Data Clump, Bill inspects the parameter list of the Class B methods. However, Bill thinks there is no Data Clump smell because foo and bar parameters had appeared together only in the two methods from Class A.

2.4.2. Scenario 2: Collaborative smell identiﬁcation During collaborative smell identiﬁcation, Bill suspects there is a Data Clump smell aﬀecting the Class A. He has this suspicion because foo and bar parameters appear together in the parameter list of two diﬀerent methods. To either refute or conﬁrm the existence of a Data Clump smell, Bill and Suzy veriﬁed whether the foo and bar parameters also appear together in the parameter list of methods in the System component. Along the discussion, Suzi pointed out problems relating to Class D, which conﬁrmed the parameters foo and bar is repeated in the parameter list of three diﬀerent methods in this class. Thus, both Bill and Suzi broaden their knowledge about the code elements of the whole system. Accordingly, they agreed there is a Data Clump smell aﬀecting the Util and System component. Therefore, the exempliﬁed Data Clump smell would not have been found if Bill and Suzy had only inspected a particular component. On the other hand, Data Clump was conﬁrmed after the developers had exchanged knowledge regarding the inspected components. This knowledge could only be gathered by understanding how the list of parameters of the methods is inspected in the whole system, in that scenario, inspecting the Util and System components. The presented scenario exempliﬁes how the information exchange between developers about the code element is very important for spreading knowledge between developers. Based on this, the developers can more reliably: (i) conﬁrm the code smell, (ii) deﬁne the necessary code changes in order to improve the code quality and (iii) infer the actual impact on the design of the component hosting the smell. Especially for (i) and (iii), the existing tools are still insuﬃcient to substitute the importance of developers in the identiﬁcation of code smells [8,25]. Moreover, for (ii) the existing tools do not discard the involvement of developers [22]. Therefore, the collaboration or isolation of developers along the smell identiﬁcation may aﬀect the task eﬀectiveness. In particular, the use of collaborators may help developers perform speciﬁc actions that contribute to improving eﬀectiveness on the identiﬁcation of code smells. 3. Controlled experiment settings This section describes the settings of our study aimed at understanding whether collaborators are more eﬀective than single developers when identifying code smells. Section 3.1 presents the study goal, the research question, and associated hypotheses. Section 3.2 presents the characterization of subjects. Section 3.3 describes the target software systems and data sources used in the experiment. Section 3.4 presents the data analysis procedure. Section 3.5 describes the experiment procedure steps. 3.1. Research goal This study aims at comparing both collaborative and individual smell identiﬁcation. Our goal was to understand whether collaborators are more eﬀective than single developers when identifying code smells. By

R. Oliveira, R. de Mello and E. Fernandes et al.

relying on a well-known guideline [30], we reﬁned and structured the study goal as follows: •

• •

•

•

Analyze the collaborative smell identiﬁcation when compared to individual smell identiﬁcation, For the purpose of assessing the developers’ eﬀectiveness, With respect to precision and recall of the identiﬁcation of code smells, From the perspective of novice developers and professional developers, In the context of Java software systems that novice and professional developers are unfamiliar with, i.e., systems about which developers have no previous knowledge.

From our study goal, we designed the following research question (RQ1 ). RQ1 : Is collaborative smell identiﬁcation more eﬀective than individual smell identiﬁcation? Our empirical study addresses the RQ1 as follows. First, we assess the collaborative smell identiﬁcation from the viewpoint of developers who are unfamiliar with the inspected systems, in order to avoid biases in the identiﬁcation of code smells. In other words, the developers have no previous knowledge about the software system in which they identify code smells. Second, we compute two metrics to compare developers eﬀectiveness in code smell identiﬁcation: precision and recall [26]. Basically, precision measures the correctness of the identiﬁed code smells, and recall measures the completeness of the code smell identiﬁcation with respect to all existing code smells based on the identiﬁcation performed by an specialist in the software development or code smells. To compute these metrics, we built a reference list of code smells, i.e., an itemization of code smells thoughtfully identiﬁed in the software systems [41]. Details on how we build the reference list can be found in Section 3.4. Third, we derived the following null and alternative hypotheses from RQ1 . •

•

•

•

HP0 . There is no diﬀerence in the precision between collaborators and single developers in smell identiﬁcation. HP1 . There is a diﬀerence in the precision between collaborators and single developers in smell identiﬁcation. HR0 . There is no diﬀerence in the recall between collaborators and single developers in smell identiﬁcation. HR1 . There is a diﬀerence in the recall between collaborators and single developers in smell identiﬁcation.

We discuss each hypothesis as follows. With the null hypotheses, we assume that the number of developers working on the identiﬁcation of code smells does not make it more or less eﬀective, with respect to precision (HP0 ) or recall (HR0 ). On the other hand, the alternative hypotheses indicate that there is a diﬀerence between the eﬀectiveness of code smell identiﬁcation by collaborators and single developers, with respect to precision (HP1 ) or recall (HR1 ). 3.2. Characterization of the participants This study involved 34 developers as subjects. They were classiﬁed according to two levels of working experience, namely novice developers and professional developers. This classiﬁcation aimed at helping to understand whether each level can beneﬁt diﬀerently (or not) from collaborative smell identiﬁcation. It also aimed at supporting the generalization of our results, since we consider developers with diﬀerent working experiences. We introduce each level as follows. •

Novice developers are subjects with little or no experience in industrial software development. We selected these subjects from a software engineering course of a Brazilian undergraduate course in Computer Science.

Information and Software Technology 120 (2019) 106242 •

Professional developers are subjects currently acting in the industrial software development and that hold at least one year of experience – mostly due to the high developer turnover in the selected organization that made unfeasible selecting developers with a much more years of experience. We selected subjects from diﬀerent organizations, some of them are multinational; these companies typically perform maintenance tasks and developers are concerned about identifying and eliminating code smells.

Our experiment consists of two sessions: one with novice developers in an academic laboratory, and another with professional developers in their working environment. For each session, subjects ﬁrst performed smell identiﬁcation in isolation and, after that, they performed the same task in collaboration. To participate in the study, all subjects signed an informal consent form. The subjects also ﬁlled out a characterization questionnaire with closed questions about their expertise in four topics related to the study: programming, Java, Pair Programming (PP), and code smells. We chose PP as a reference for measuring the experience with collaborative work, which is well known in both the literature and the industry [42,43]. Table 2 presents the data collected from the subject characterization questionnaire with respect to all subjects. The ﬁrst column lists the two levels of working experience, i.e., novice developers (16 subjects in total) and professional developers (10 subjects in total). The second column provides a label for each subject, aimed at keeping anonymous their identity. The remaining columns present the experience reported by each subject regarding each aforementioned topic related to our study. We ranked the knowledge of subjects per topic (Table 2) in four categories: none means I never had contact with it; low means I had contact with it in classes or instructional material; medium means I had contact with it in the context of academic system; and high means I had contact with it for at least one year in industrial systems. In the case of Java, the knowledge of the subject is: none when the subject never had contact with the Java programming language; low when the subject had contact with Java only through classes or by reading instructional material; medium when the subject had contact with Java only in the context of academic systems developed in academic courses and laboratories; and high when the subject had contact with Java for at least one year in industrial software systems. Note that professionals might not be experienced with Java in the industry but with another language. Similar rationale was applied to the other topics, such as code smells. We analyze Table 2 to identify subjects with high degree of knowledge about our topics of interest, collected via characterization questionnaire, when compared to other subjects. The table highlights these subjects in boldface. We say that a subject has a suﬃcient knowledge about the topics for study when the subject has a medium knowledge in at least two topics, since it represents at least a half of knowledge on the related topics. We observe that: 28 out of 34 subjects have medium to high knowledge in programming and Java; 27 out of 34 subjects have medium to high knowledge in Pair Programming; 25 out of 34 subjects have medium to high knowledge in code smells; and no subject has no knowledge in any of four topics. We then conclude that our subjects met the minimum requirements to take part in our experiment. 3.3. Target software systems and data sources To allow us to investigate the collaborative smell identiﬁcation, we have selected a set of target software systems and data sources for analysis. We present both the target software systems and data sources as follows. 3.3.1. Target software system This study focuses on the identiﬁcation of code smells by the subjects. Thus, we selected a set of target systems for usage by the subjects during such identiﬁcation. For this purpose, we selected two industry systems,

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

Table 2 Characterization of subjects for the controlled experiment. Topics Working experience

Subjects

Programming

Java

Pair Programming

Code Smells

Novice developers

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17 s18

Medium Medium Low Low Low Low Low Medium Medium Low Medium Medium Medium Medium Medium Medium High Medium

Medium Medium Low Low Low Low Low Medium Medium Low Medium Medium Medium Medium Medium Medium High Medium

Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium High Low

Low Medium Low Low Low Low Low Medium Medium Low Medium Low Medium Medium Low Medium High Medium

Professional developers

s19 s20 s21 s22 s23 s24 s25 s16 s27 s28 s29 s30 s31 s32 s33 s34

High High High High High High High High High High High High High High High High

High High High High High High High High High High High High High High High High

High none High High High Low Medium Low Low Low Medium Low Medium High High Medium

Medium High High High High Medium High Medium Medium High High High Medium High High High

namely Java IO and Java Print, which belong to the Java Core project1 . This selection relies on the following criteria. First, each system has to be open source, which allows the study replication. Second, each software system has to enable the identiﬁcation of code smells using the Stench Blossom tool in its default settings [38], which relies on well-known detection strategies for code smells [7] and provides a visualization of the code smell suspects. Third, each software system has to be aﬀected by multiple code smells. To support the generality of our ﬁndings, we selected systems in which we identiﬁed diﬀerent smell types, with varying granularity. We focus on four smell types [6]: Data Clumps, Large Class, Long Method, and Message Chain. Both Large Class and Long Method are intra-class smells, which locally aﬀect a single class of the software system. In turn, Data Clumps and Message Chain are inter-class smells, which aﬀect multiple classes. The selected smell types aﬀect diﬀerent code structures in a software system, such as methods and classes [6]. All selected smell types are reportedly very frequent in software systems [4,5]. Both source code ﬁles selected for inspection were developed within the context of the same project. As a consequence, we could assume that the implementation decisions that might lead to the occurrence of code smells are similar among ﬁles. This assumption is conﬁrmed by the diﬀerent smell types identiﬁed in each code ﬁle. We observe that all code smell types occur in equal proportion in both ﬁles: 1 Large Class instance by ﬁle, 2 Data Clump instances by ﬁle, and 1 Message Chain instance by ﬁle. The exception is Long Method with 7 instances in Java IO against 6 instances in Java Print. We have some observations about the complexity of both source code ﬁles. Even though the ﬁles diﬀer in terms of Number of Lines of Code 1

In: http://openjdk.java.net/groups/core-libs.

[44], this metric has been shown inappropriate for measuring ﬁle complexity [45], which lead us to use other complexity metric. For instance, by computing the complexity of both ﬁles via Weighted Methods per Class (WMC) [35], we obtained similar results: WMC equals 6 for the Java Print ﬁle, against 4 for the Java IO ﬁle. These results suggest again that both ﬁles have a quite similar complexity for analysis and identiﬁcation of code smells. We then conclude that comparing both ﬁles in our study (Section 4) is acceptable. 3.3.2. Data sources We collect experimental data of the subjects from diﬀerent data sources, namely: the subject characterization questionnaire, the List of smells identiﬁed by developers, and the post-experiment questionnaire. We combined the data obtained via these data sources to compensate their strengths and limitations. We describe each data source as follows. •

•

•

Subject characterization questionnaire: it is composed of questions aimed at characterizing each subject, in terms of their their knowledge on topics of interesting. List of smells identiﬁed by developers: it is a form aimed at collecting the list the code smells identiﬁed by each subject and pair during the experiment. Post-experiment questionnaire:, it is composed of questions aimed at collecting the perception of subjects regarding the code smell identiﬁcation conducted in the experiment.

3.4. Data analysis procedures As aforementioned, we conducted an empirical study which uses multiple data sources for data analysis. Thus, we carefully designed our data analysis procedures. We present each procedure as follows.

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

3.4.1. Creation of a code smell reference list We built a code smell reference list to support the data analysis. For this propose, we recruited two researchers, which are PhD students with experience in software maintenance research and knowledge in software development and the identiﬁcation of code smells. The researchers identiﬁed code smells in the selected projects in a complementary way. That is, one researcher conducted the manual smell identiﬁcation, without tool support, and the other used the Stench Blossom tool [38]. Each researcher then obtained a list of possible code smell suspects, which were not exactly the same due to the subjectiveness of the identiﬁcation of code smells. To reach a consensus, we computed the agreement between the lists of code smell suspects reported by both researchers. By smell suspect, we have an agreement whenever the developers have conﬁrmed or refuted the suspect together. Conversely, we have a disagreement whenever the developers diverged in opinion without a consensus. After, the researchers conducted an open discussion to reach a consensus. Finally, we built the ﬁnal code smell reference list. 3.4.2. Quantitative data analysis Our study assesses the eﬀectiveness of both collaborators and single developers on the identiﬁcation of code smells. For this purpose, we compute the developers’ eﬀectiveness in terms of two well-known metrics, namely precision and recall [26]. Precision measures the correctness of the identiﬁed code smells. Recall measures the completeness of the identiﬁed code smells with respect to all code smells which occur in a system [26]. To compute these metrics, we used the aforementioned code smell reference list, which is an itemization of code smells identiﬁed in a systems [41]. Precision and recall were calculated based on the number of code smells marked as true positive (TP), false positive (FP) and false negative (FN). TP means that the developer identiﬁes a code smell that appears in the code smell reference list. FP means that the developer identiﬁes a code smell that does not appear in the code smell reference list. FN means that a code smell appears in the code smell reference list but the developer was unable to identify. Both precision and recall are normalized in a range from 0 to 1. High precision values (close to 1) mean that the developer had reported, proportionally, only a few occurrences of FP in the software system. High recall values (close to 1) mean that the developer was able to identify a representative number of occurrences of TP in the software system. Eqs. (1) and (2) present the formulas. 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑅𝑒𝑐𝑎𝑙𝑙 =

𝑇𝑃 𝑇𝑃 + 𝐹𝑃

𝑇𝑃 𝑇𝑃 + 𝐹𝑁

(1) (2)

We applied the two-tailed Mann-Whitney test, which is a nonparametric statistical test, aimed at rejecting our null hypotheses. The reason for selected a non-parametric test is discussed as follows. Based on the normality test, we observed that both distributions of precision and recall are normal. We consider an alpha coeﬃcient equal to 95%, which gives us a conﬁdence interval of 5% (p-value < 0.05) to compare the data distributions. However, after applying the Levene’s test [46], we observed that the distribution of recall is not homoscedastic, which requires the application of a non-parametric test. To avoid applying different statistical tests for precision and recall, and due to the limited sample of our study, we decided to apply the non-parametric test. We used the Minitab tool [47] to apply the statistical test. 3.4.3. Complementary discussion As aforementioned, we conducted a quantitative analysis on the effectiveness of both collaborative and the individual smell identiﬁcation. In addition, we conducted a complementary analysis based on the follow-up questionnaire, which was applied after the experiment execution with the developers. This complementary analysis aimed at understanding the feedback of subjects regarding the experiment, mainly focused on the diﬃculties faced by the subjects to identify code smells.

The analysis aimed at understanding the subject viewpoint on the identiﬁcation of code smells, specially collaborative smell identiﬁcation. 3.5. Experiment design We have designed four steps to guide our controlled experiment. That is, we conducted two sessions of the experiment, each with a group of participants classiﬁed according to their level of working experience. The ﬁrst session was conducted with the novice developers, while the second session was conducted with the professional developers. We asked all subjects to ﬁrst ﬁll out and sign a consent questionnaire. After that, the subjects engaged in the experiment. 3.5.1. Step 1. Apply the subject characterization questionnaire The subject characterization questionnaire aims to characterize the experiment subjects. The responses obtained through this questionnaire allowed us to identify some key characteristics of each subject, as presented in Section 3.2. 3.5.2. Step 2. Training of subjects After characterizing the subjects, we provided a training session to the subjects. This training aimed at supporting subjects to proper understand and execute the experiment. The training was organized in two parts. First, during 25 minutes, we explained the technical concepts and terminologies related to this study. Second, we took 10 minutes to conduct a discussion about the concepts. Regarding code smells, we provided explicit deﬁnitions and practical examples. This training was provided to both novice and professional developers. 3.5.3. Step 3. Smell identiﬁcation task The experiment tasks were conducted in four rounds. In the ﬁrst round, 26 subjects (s1 to s16 and s25 to s34) individually inspected Java IO. In the second round, the same subjects collaboratively inspected Java Print. In the third round, 8 subjects (s17 to s24) individually inspected Java Print. In the fourth round, the same subjects collaboratively inspected Java IO. The teams of subjects were allocated randomly. In each round, the subjects were asked to annotate the identiﬁed code smells in the code smell report questionnaire. This procedure allowed us to compute the number of true positives and false positives. All subjects performed the experiment simultaneously under the supervision of the researchers. Each round lasted 60 minutes only for the identiﬁcation of code smell. We did not swap the order of individual and collaborative tasks because previous studies [22,27] show that such order does not signiﬁcantly aﬀect the eﬀectiveness of code smell identiﬁcation. 3.5.4. Step 4. Answer the follow-up questionnaire After participating in the experiment, the participants ﬁlled a followup questionnaire. This questionnaire at collecting the perception of each subject regarding the experiment. We aimed at understanding their opinion about the identiﬁcation of code smells and the experience of working collaboratively to identify code smell. 4. Results of the controlled experiment This section answers RQ1 : Is the collaborative smell identiﬁcation more eﬀective than the individual smell identiﬁcation? Section 4.1 analyzes the distribution of precision and recall to conﬁrm or refute the hypotheses of Section 3.1. Section 4.2 complements the ﬁndings based on the analysis per subject. 4.1. Analysis of distribution for precision and recall At ﬁrst, we analyzed the distribution of precision for collaborators and single developers. We aimed at understanding whether the collaborators tend to obtain a higher precision in the code smell identiﬁcation when compared to single developers. We also computed the average

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

Fig. 1. Distribution of precision for developers.

precision for collaborators and single developers, by summing their precision regardless of their working experience and dividing this value by the total number of participants. Finally, we obtained the average precision for collaborators and single developers. Thus, we investigate the alternative hypothesis HP1 as follows. Fig. 1 presents the distribution of precision for collaborators and single developers, respectively. The ﬁgure also indicates the average precision for both collaborators and single developers. Overall, we observe an average precision equals 0.76 (76%) for collaborators against 0.49 (49%) for single developers. Our results suggest that collaborators had a 27% higher average precision than single developers. By applying the Mann-Whitney test, we observed a signiﬁcant diﬀerence between precision values (p-value = 0.001). In summary, our results lead us to reject the null hypothesis HP0 and accepts HP1 . We analyze the distribution of recall for collaborators and single developers. Similarly to precision, we compute the average recall as follows. First, we sum the recall of developers regardless the working experience. Second, we divided this value by the total number of developers, which resulted in the average recall for both collaborators and single developers. Thus, we investigate the alternative hypothesis HR1 as follows. Fig. 2 presents the distribution of the average of recall. We observed an average recall equals 0.63 (63%) for collaborators against 0.27 (27%) for single developers. Our results suggest that collaborators had a 36%

higher average recall than single developers. By applying the MannWhitney test, we observed signiﬁcant diﬀerence between recall values (p-value = 0.001). Consequently, our results led us to reject the null hypothesis HR0 and accepts HR1 . In summary, our results for precision and recall suggest that developers tend to identify more smells when working collaboratively. Particularly, we observed that collaborators obtained higher precision and recall than single developers. These results have two main implications discussed as follows. First, collaborators tend to make less mistakes when identifying code smells, i.e., they obtain higher precision. Second, collaborators are able to identify a more representative number of code smells in the software systems than single developers. In summary, our results lead us to Finding 1. Finding 1: Collaborators tend to be more eﬀective than single developers when identifying code smells in software systems. Table 3 presents a complementary analysis per working experience (novices and professionals), aimed at assessing any biases on the results of precision and recall caused by the working experience of the developers. The ﬁrst column lists the working experience. The second column lists the experiment groups (individual and collaborators). The third and fourth columns present precision with respect to average and median precision. The ﬁfth and sixth columns present recall with respect to average and median recall.

Fig. 2. Distribution of recall for developers.

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

Table 3 Comparison of precision and recall per working experience.

∗

Working Experience

Group

Precision

Recall

Average

Median

Average

Median

Novice

Individual Collaborators

0.61 0.74

0.50 0.76

0.25 0.72

0.27 0.76

Professional

Individual Collaborators∗

0.38 0.77

0.39 0.83

0.29 0.54

0.28 0.50

Two pairs have one novice and one professional; we counted these pairs here.

In general, we observe that the results of the complementary analysis for both working experiences conﬁrm Finding 1, i.e., they show that collaborators tend to have higher precision and recall than single developers. However, by comparing both working experiences, we observed a non-ignorable diﬀerence. In the case of single developers, there is diﬀerence in the results obtained with respect to average and median precision (23% and 11%, respectively), but also average and median recall (4% and 1%). In the case of collaborators, there is diﬀerence in the results obtained with respect to average and median precision (3% and 7%, respectively), and average and median recall (18% and 26%). Thus, although working experience somehow aﬀects Finding 1, it remains valid that collaboration improves precision and recall regardless the working experience. This discussion leads us to Finding 2. Finding 2: Regardless the working experience of software developers, collaborators have performed more eﬀectively than single developers in the identiﬁcation of code smells. 4.2. Comparing precision and recall of collaborators and single developers After analyzing the distribution of precision and recall presented in Section 4.1, we conducted a more detailed analysis. We aimed at deeply understanding the eﬀectiveness of the collaborative smell identiﬁcation per developer, who may have performed the identiﬁcation of code smells as collaborator and single developer. First, we analyze precision as follows. Fig. 3 presents the precision of collaborators and single developers per subject. For each set of three consecutive bars (two black and one gray bar), we compare the precision of two developers as single developers with the precision of both developers as collaborators. Overall, 12 of the 17 sets (70.59%) obtained higher precision as collaborators. In addition, for 4 of the 5 remaining sets (38.47% of the total), at least one developer as single developer have improved its precision when worked as a collaborator (by comparing the gray bar of the single developers

with the black bar of the corresponding collaboration). It implies that collaboration improves the eﬀectiveness of at least one developer involved in the collaboration (as observed for 4 out of the 5 cases), by reducing the number of incorrectly identiﬁed code smells. By analyzing the follow-up questionnaire, we may draw additional conclusion on the beneﬁts of the collaborative smell identiﬁcation. All subjects stated that the collaboration minimized their frustrations and improved their conﬁdence during the identiﬁcation of code smells. For example, subject s16 said: The collaboration has strengthened the communication between members and the possibility of a more precise analysis because four eyes see more than two... consequently we were more conﬁdent in our work. In turn, subject s28 said: The discussions with my partner were essential for understanding the long chaining of methods and for conﬁrming the existence of Message Chain. Next, we analyzed recall as follows. Fig. 4 presents the recall values obtained by collaborators and single developers. Each set of three consecutive bars (two gray and one black bar) compares the recall of two subjects working as single developers with the recall of both together working as collaborators. Overall, 13 of the 17 sets (76.47%) obtained higher recall as collaborators. In addition, for 2 of the 4 remaining sets (50% of the total), at least one developer as single developer has improved its recall when worked as a collaborator – this result is observed by comparing the gray bars of single developers with the black bar of the corresponding collaboration. It reinforces our ﬁndings of Fig. 2 and suggests that collaborators tend to identify a more representative number of code smells in the software systems, when compared with single developers. By analyzing the follow-up questionnaire, we draw the following observations. We found that, during the collaborative smell identiﬁcation, one collaborator was usually responsible for selecting a code smell suspect and, after, both collaborators started arguing about the code smell suspect. We have also found that collaborators have more conﬁdence to conﬁrm a code smell suspect when compared with single developers. Consequently, collaborators are able to identify a larger number of code smells in software systems. This observation is reinforced by the opinion of the subjects, such as s7 that said: The greatest potential of working collaboratively was the possibility of adding diﬀerent strategies to determine a code smell. This fact was only possible thanks to diﬀerent experiences that each one of us has. Finally, the follow-up form revealed the certainties and uncertainties of the subjects about each code smell suspect. For instance, when the collaborators were uncertain on conﬁrming a code smell suspect, both ended up not conﬁrming a particular code smell suspect as an actual code smell. Thus, by relying on the comments of subjects like s7 and s28, we conclude that collaborators may exchange information and,

Fig. 3. Precision of collaborators and single developers per subject.

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

Fig. 4. Recall of collaborators and single developers per subject.

•

consequently, improve their eﬀectiveness when compared with single developers, which leads us to Finding 3. Finding 3: The exchange of information among collaborators has a potential to improve the eﬀectiveness of the code smell identiﬁcation.

From the interview goal, we designed the following research question (RQ2 ).

5. Interviews with software project leaders This section regards the interviews performed with software project leaders of real-world development teams. Section 5.1 introduces our interview goals and procedures performed for collecting and analyzing data. Section 5.2 presents the interviewee characterization. Section 5.4 discuss the project leaders’ perceptions on the need for collaborating along the identiﬁcation of code smells. Finally, Section 5.5 explores the leaders’ perceptions on the potential to adopt the collaborative smell identiﬁcation in practical settings. 5.1. Interview goal As a complement to the quantitative analysis presented in Sections 3 and 4, we decided to perform interviews with software projects leaders. Our major goal was understanding if, based on our empirical evidence that collaborative smell identiﬁcation can beneﬁt development teams in diﬀerent manners (e.g., by enhancing the identiﬁcation precision), leaders: (1) acknowledge the need for collaborating and (2) would employ the collaborative smell identiﬁcation in their respective teams. Based on a well-known guideline [30], we systematically designed our interview goal as follows. • •

•

•

Analyze the perceptions of software project leaders, For the purpose of understanding the likeliness of leaders to adopt collaborative smell identiﬁcation in practice, With respect to the need for collaborating and the potential adoption of collaborative smell identiﬁcation in the leaders’ teams, From the perspective of project leaders of varied organizations,

In the context of Java software systems that novice and professional developers are unfamiliar with, i.e., systems about which developers have no previous knowledge.

RQ2 . Are software project leaders likely to adopt the collaborative smell identiﬁcation in their development teams? Our study addresses the RQ2 as follows. First, we have summarized the results of our quantitative study reported in Section 3 and 4. Our goal was to familiarize participants with key evidence on the beneﬁts of both collaborative and individual smell identiﬁcation. Because the studies that provide insights on the collaborative versus individual smell identiﬁcation are quite recent, our subjects were probably unaware of the eﬀects the collaborating may have on the identiﬁcation eﬀectiveness. Asking participants about the use of either collaborative or individual smell identiﬁcation without the proper background would provide us with insuﬃcient results; participants would guess rather than reason about their responses. Second, we carefully designed a structured interview [48] in order to capture the perceptions of software project leaders on collaborative smell identiﬁcation. Third, we performed a thematic synthesis [49] for aggregating the interview qualitative data by topic aimed to properly answer RQ2 . 5.2. Characterization of the interviewees Table 4 summarizes the background of ﬁve software project leaders that we recruited for interviewing. The ﬁrst column identiﬁes each leader, from L1 to L5. The second column informs the highest education level of each leader. All leaders have at least a Master’s degree in Computer Science or related areas; one leader is a PhD and another leader is a Postdoc. The third column informs the leadership experience in years by leader. Four out of the ﬁve leaders have at least one year of experience in leading development teams. The fourth column reports

Table 4 Characterization of software project leaders.

∗

ID

Education

Years as Leader

Avg. Size of Teams

Adopted Development Processes

Familiarity w/ Smells

L1 L2 L3 L4 L5

PhD MSc Postdoc MSc MSc

2 2 1 5 15

6 6 5 6 5

Scrum Agile, MPS.BR Scrum Agile, Kanban, Lean, Scrum∗ CMMI, MPS.BR, Scrum

Totally agree Totally agree Totally agree Partially agree Totally agree

Plus one development process that is less popular and we were not able to cite.

R. Oliveira, R. de Mello and E. Fernandes et al.

the average team size led by each software project leader. All teams are composed by ﬁve or six developers in average. The ﬁfth column lists the software developers processes adopted by the development teams led by project leader. All projects led by the interviewees adopt agile development practices, such as Scrum [50] and Lean [51]. The sixth column reports the familiarity of software project leaders with the concept of code smells. All leaders mentioned that at least partially with the fact their are familiar with code smell deﬁnitions. 5.3. Interview design We carefully designed and performed a four-step interview protocol. We describe each step in the sequence. 5.3.1. Step 1: Recruit software project leaders. Our ﬁrst step consisted of recruiting software project leaders from diﬀerent development organizations. As discussed in Section 5.2, ﬁve project leaders kindly volunteered to participate as interviewees. We ﬁrst contacted these leaders via email or video-conference, and they agreed to have their interview data analyzed and reported anonymously. 5.3.2. Step 2: Instruct leaders about the interview questionnaire. We decided to perform each interview separately in order to make the project leaders as comfortable as possible with answering the questions. Before applying the questionnaire, we instructed each leader with respect to (1) the general interview goal, (2) privacy policies that will adopted with respect to the interview data, and (3) the basics of code smells and code smell identiﬁcation. 5.3.3. Step 3: Apply the questionnaire. We applied the interview questionnaire with each software project leader. The questionnaire is composed of two questions which we describe in the sequence. •

•

Question 1: Do you believe that discussion among development team members is necessary to either conﬁrm or refute the occurrence of a code smells? Please justify yours answer. – This question captures the leaders’ opinions on the need for collaborating while validating smell suspects. Question 2: As a team leader, would you allocate two or more development team members for either conﬁrming or refuting the occurrence of code smells? Please, justify your answer. – Complementarily, this question aims to capture the potential adoption of collaborative smell identiﬁcation in real settings, also from a project leader’s perspective.

Information and Software Technology 120 (2019) 106242

5.3.4. Step 4: Perform the thematic synthesis. This step consisted of applying thematic synthesis procedures [49] for understanding the interviewees’ answers. Once both questions made to the project leaders are open, the leaders were free to answer whatever they wanted. Thus, we decided: (1) to tabulate all answers by questions; (2) for each question, to extract the main discussion topics that emerged from each answer; and (3) to derive the themes by grouping similar discussion topics. We performed all three procedures in a pair aimed to avoid biases and missing data. After performing the thematic synthesis, we built visual models that represent the data (see Sections 5.4 and 5.5 for details). 5.4. On the need for collaborating along the identiﬁcation of code smells We present below the full answer of all ﬁve leaders to Question 1. Yes! Code smells are subjective, thus a development team has to agree on the validity of code smell suspects. Discussions help in deciding which suspects to eliminate based on the code quality. – L1 I think so. Some code smells aﬀect diﬀerent modules of a software project. One or more developers may be responsible for managing each single module. Thus, discussions could make easier to identify code smells. – L2 There may be some gain in collaboration to identify code smells. Similarly to techniques like code inspection, collaborators may be more eﬀective than single developers, especially when the task is reasonably complex. – L3 Yes! My personal leadership experience suggests that brainstorms with divergent opinions is better for reaching consensus. – L4 I believe that discussion may beneﬁt development teams, once developers typically have diﬀerent experiences. – L5 Fig. 5 visually represent the qualitative data extracted from the Question 1 answers via thematic synthesis procedures. The top box shortens the interview question. The boxes immediately below (Leader Attributes and Code Smell Properties) represent the major discussion themes. The boxes below (Response Conﬁdence, Smell Subjectivity, and Smell Structure) represent discussion topics. Finally, the bottom boxes (Experience, Consensus, etc.) represent the discussion subtopics. The lines connecting two boxes represent either a theme-topic or a topic-subtopic relationships. We discuss the obtained results as follows. •

•

All leaders agreed that collaboration is important for validating, i.e., conﬁrming or refuting, smell suspects (cf. Response Conﬁdence in Fig. 5). However, three leaders (L2, L3, and L5) were not strongly conﬁdent with their answers. This observation relies on quotes like I think so (L2), There may be some gain... (L3), and may beneﬁt (L5). Leaders have mentioned subjective aspects that may positively aﬀect the collaborative smell identiﬁcation (cf. Smell Subjectivity). These aspects are Experience (developers typically have diﬀerence experiences Fig. 5. Model of themes and topics of Question 1.

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

Fig. 6. Model of themes and topics of Question 2.

•

by L5), Consensus (reaching consensus by L4), and Authorship (be responsible for managing each single module by L2). Leaders also mentioned structural aspects of code smells that may be better perceived by developers working in collaboration rather than individually. These aspects are Granularity (Some code smells aﬀect diﬀerent modules by L2) and Complexity (the task is reasonably complex by L3) – the complexity of code smell detection is often associated with the complexity of the poor code structure underlying the code smell instance [4,5]. Finding 4: The interviewed project leaders agree that collaboration is important when validating code smell instances. Their responses typically associate a potential beneﬁt of collaboration with the inherent subjectivity of code smell identiﬁcation and the underlying code smell structure.

5.5. On the adoption of the collaborative smell identiﬁcation We present below the full answer of all ﬁve leaders to Question 2. I would love to. Discussions can help teams in enhancing the code quality, but ﬁrst I would opt for asynchronous discussions via code review tools (so that developers use in-line code comments to document their viewpoints). – L1 Yes! Merging opinions can leverage the accuracy in the identiﬁcation of certain code smell types, mainly those that aﬀect diﬀerent modules. – L2 I would ﬁrst identify code smell suspects via static analysis. Depending on project criticality, desired code quality, etc., I would not hesitate to promote collaboration towards enhanced precision, mainly to validate suspects. – L3 Yes, because I suppose that collaborators are more likely to identify and validate smell suspects than individuals. Each individual has his own opinion and, therefore, discussions among collaborators can reveal varied viewpoints. – L4 I would but, unfortunately, both budget and time are usually scarce to allocate developers for inspecting source code. – L5 Fig. 6 visually represent the qualitative data extracted from the Question 2 answers via thematic synthesis procedures. The top box shortens the interview question. The boxes immediately below (Leader Attributes, Organization Attributes, Collaboration Goal, Developers Arrangement) represent the major discussion themes. The boxes below (Response Conﬁdence, Availability of Resources, etc.) represent discussion topics. Finally, the bottom boxes (Consensus, etc.) represent the discussion subtopics whenever applicable. The lines connecting two boxes represent either a theme-topic or a topic-subtopic relationships. We discuss the obtained results as follows. •

Surprisingly, all leaders said they would adopt the collaborative smell identiﬁcation in their development teams (cf. Response Conﬁdence in Fig. 6). In this case, only two leaders (L3 and L5) were not strongly conﬁdent with their answers. This observation relies on quotes like Depending on project criticality... (L3) and I would but... (L5).

•

•

•

Leaders have mentioned organization attributes that are decisive to the practical adoption of collaborative smell identiﬁcation (cf. Organization Attributes). These aspects are Availability of Resources (both budget and time are usually scarce by L5) and Availability of Tools (via code review tools by L1). Leaders also mentioned various goals behind allocating developers to collaborate along the code smell identiﬁcation. These goals are Code Quality Improvement (enhancing the code quality by L1), Validation Consensus (can reveal varied viewpoints by L4), and Accuracy Enhancement (leverage the accuracy by L2). Finally, leaders mentioned diﬀerent ways to arrange collaborators: asynchronous (I would opt for asynchronous discussions... by L1) and synchronous identiﬁcation – we considered this category for L2, L4, and L5 because they did not make explicit the need for an asynchronous discussions among developers (we then suppose that they would arrange collaborators to work at the same time and, sometimes, at the same location). Finding 5: All interviewed project leaders said they would adopt the collaborative smell identiﬁcation for diﬀerent reasons – including the accuracy enhancement observed through our quantitative study. However, budget, time, and tool constraints can limit such adoption in practice.

6. Study comparison-based discussions With respect to our controlled experiment, we partially reused the study design of our previous studies [22,23] to enable us better understanding the eﬀectiveness of collaborative smell identiﬁcation. Thus, we conduct an empirical study with novice and professional developers, who had to identify code smells in software systems they are unfamiliar. Due to considerable number of changes applied to the existing study designs, we do not characterize our study as purely a study replication. Especially, we have added a qualitative study based on interviews [49] with software project leaders. Our empirical results suggest that both novice and professional developers obtained 27% more precision and 36% more recall through the collaborative smell identiﬁcation. Besides that, we have found signiﬁcant diﬀerence between the developers’ eﬀectiveness of single developers and collaborators. Information about our ﬁrst previous study. In our previous eﬀorts [23], we provide empirical evidence on the eﬀectiveness of collaborative smell identiﬁcation. In the ﬁrst study [23], we conduct an empirical study with novice developers, who had to identify code smells in software systems they are unfamiliar with [23]. As a result of this study, we observe that collaborators identify a higher average number of code smells because they almost always had to consider both of each developer’s knowledge on revealing scattered, complementary symptoms associated with a single smell type. We also derive empirical evidence that adding more than two developers on the task of collaborative smell identiﬁcation does not necessarily improve their identiﬁcation eﬀectiveness.

R. Oliveira, R. de Mello and E. Fernandes et al.

Also, we observed that novice developers perform several collaborative activities during the identiﬁcation of code smells. Comparison between this study and our ﬁrst study [23]. In our previous study, we assess the collaborative smell identiﬁcation from the perspective of novice developers who are unfamiliar with the software systems under analysis. In other words, the developers have no previous knowledge about the software system in which they identify code smells. To evaluate the eﬀectiveness of individual and collaborative and identiﬁcation of code smells, we computed the average number of identiﬁed code smells to compare the eﬀectiveness of collaborators and single developers. However, we know that considering only the average number of smells identiﬁed was a threat to validity in this study. Since the average code smells identiﬁed does not reveal important aspects of the eﬀectiveness of the smell identiﬁcation which we have been able to capture through other metrics such as precision and recall. Therefore, in this study, we evaluated precision and recall and, surprisingly, the results conﬁrm that even for precision and recall, collaboration beneﬁts smells identiﬁcation. There are two implications of our study results. The ﬁrst implication is that novice developers working in collaboration have shown themselves competent enough to perform the identiﬁcation of smells under diﬀerent perspectives of eﬀectiveness. The second implication is that organizations do not need to allocate professional developers, which should focus implementing new software features, to identify code smells. Instead, organizations can allocate novice developers for identifying code smells in collaboration, thereby promoting the knowledge sharing and the correct identiﬁcation of code smells without misidentifying them. Our interview data reinforce that leaders are concerned about knowledge exchange among team members (as we can see from the answers of L4 and L5, for instance). The following is the recommendation: Recommendation 1: Organizations can save resources by allocating novice developers rather than professional developers to eﬀectively identify code smells. Information about our second previous study. In the second study [22], we conduct two sessions of an exploratory case study aimed at understanding the eﬀectiveness of collaborative smell identiﬁcation in industry. In other words, we aimed at understanding the eﬀectiveness of smell identiﬁcation from the viewpoint of professional developers who are familiar with the inspected software systems. As a result of this study, we observed that collaborators are also more eﬀective than single developers in the identiﬁcation of code smells. This observation conﬁrms our previous ﬁndings in [23]. In addition, we observed that collaborators often beneﬁted from knowledge exchange to identify certain smell types. Also, we observed that developers require several types of information in order to conﬁrm or refute a code smell suspect. Comparison between this study and our second study [22]. In our second previous study, we assess the collaborative smell identiﬁcation from the viewpoint of professional developers who are familiar with the inspected software systems. We used precision and recall metrics to compare developers eﬀectiveness in both scenarios (individually and collaboratively). A diﬀerence between our previous study [22] and this study is directly related to two aspects: working experience and familiarity with the inspected systems. Prior to this study, there was a lack of knowledge on the eﬀectiveness of collaborative identiﬁcation from the viewpoint of professional developers who are unfamiliar with the software systems under analysis. Surprisingly, we have found that, even when professional developers are unfamiliar with a system, they perform better in identifying code smells collaboratively rather than individually. One immediate implication of our results is that organizations do not need to allocate developers familiar with the system, which have multiple fundamental responsibilities in the software developers, to simply identify code smells. Thus, these developers can concentrate their eﬀorts in addressing the clients’ need. In this case, we recommend orga-

Information and Software Technology 120 (2019) 106242

nizations to allocate developers that unfamiliar with the system identify code smells, since they will perform as eﬀective as a developer with higher familiarity with the system. On the other hand, professional developers familiar with the system could be allocated only to determine how critical are the smells identiﬁed. This observation is reinforced by the leaders’ viewpoints on the criticality of code smells (L3) and authorship of program modules by diﬀerent developers (L2). The following is the recommendation: Recommendation 2: Organizations can save resources by allocating developers that are unfamiliar with the system rather than the familiar ones to eﬀectively identify code smells. Information about our third previous study. In the third study [23], we aggregated the data points related to novice developers unfamiliar with the inspected software systems [23] with the data points related to professional developers that are familiar with the systems [22]. We then proposed a classiﬁcation of subjects by professional background, i.e., the participant knowledge on identiﬁcation of code smells, software development experience, and other related topics: no, little, middle, and high background. As a result, we found evidence that collaboration signiﬁcantly improves the precision of the identiﬁcation tasks for both novice and professional developers regardless the professional background. We also found evidence that, when working collaboratively, the precision reached by professional developers is slightly higher that then one reached by novice developers, especially when identifying smell types that are more complex and require inspecting multiple code elements. Comparison between this study and our third study [27]. Similarly to our third study [27], we observed that the collaborative smell identiﬁcation is more eﬀective than the individual smell identiﬁcation regardless the working experience. It is worth mentioning that the interviewed project leaders have this concern about enhancing the identiﬁcation accuracy (L2) and promoting the smell validation consensus (L2, L4). The novelty of our current study is that our observations are based on the identiﬁcation of code smells performed on the same software system, in order to identify the same set of smell types, rather than in diﬀerent settings whose comparison has various threats to validity. As an implication, we now have more reliable results that suggest the adoption of collaborative smell identiﬁcation by organizations, especially those that (1) lack enough professional developers to work on addressing the clients’ needs and also identify code smell suspects for validation and elimination if necessary but (2) have suﬃcient novice developers to allocate for working collaboratively. Recommendation 3: Organizations can mostly allocate novice developers for identifying code smells collaboratively, except when there is a need for identifying too complex and scattered smell types, which might require the inspection by professional developers. 7. Threats to validity 7.1. Construct validity We have restricted our study to the analysis of a limited set of code smell types, which may have aﬀected our ﬁndings. However, we reduce this threat by selecting four diverse types of code smells. These types occur in diﬀerent code elements and are reportedly common in software systems (Section 3.3). Regarding the creation of the reference list of code smells, we recruited two PhD students with knowledge in software development and the identiﬁcation of code smells. Thus, we mitigate possible threats by engaging researchers that are suﬃciently qualiﬁed for such creation (Section 3.4). Regarding the absence of a static analysis tools, we highlight that our goal was not to investigate the impact of a speciﬁc tool on smell identiﬁcation tasks; instead, our goal was to investigate how the developers collaborative identify code smells. Nevertheless, if we had used a

R. Oliveira, R. de Mello and E. Fernandes et al.

tool during the experiment, the results could be completely dependent on the intricacies of this particular tool. In other words, the use of a tool would introduce signiﬁcant bias to the experiment. With respect to the diﬀerent background of the subjects, we mitigate this threat by selecting subjects with at least a minimum knowledge on topics of interest, such as Java and code smells (Section 3.2). In addition, all subjects underwent the training sessions to normalize their background. We followed strict guidelines [30,48] in order to elaborate the interview protocol and artifacts. All procedures were double-checked by two paper authors aimed to mitigate threats such as invalid and useless interview questions. As discussed in Section 5.1, we trained the interviewees about the key beneﬁts of both collaborative and individual smell identiﬁcation prior to the interview execution. We avoided biasing the interviewees’ opinions by comparing both approaches equally and providing evidence on the applicability of each approach. 7.2. Internal validity Regarding the communication among subjects during the experiment execution, we mitigate threats by limiting such communication with little interference on their answers. We also explained the experimental tasks for all subjects, aimed at avoiding misunderstandings and reducing the communication among subjects. As far as the experiment execution is concerned, we had in mind that developers would have enough time available to ﬁnish the identiﬁcation of code smells. We have performed some experiment simulations, in which a set of participants (some of them working in pairs, others working individually) run the experiment. Sixty minutes were enough for completing the experiment in both cases on time, once no participant complained about it. We did not inform the total time of the experiment, but after completing 60 minutes the participants were asked to conclude their participation. This decision has prevented participants from worrying about time during the identiﬁcation of code smells. Due the aforementioned observations, we decided to keep the predeﬁned experiment time limit. Through these simulations, we also identiﬁed opportunities for improving the experiment. We did not swap the order of individual and collaborative tasks during the controlled experiment (Section 3.5). Although we did not employ a cross-over design [52] in terms of individual and collaborative smell identiﬁcation, this swapping related decision was taken because previous studies [22,27] observed such a swap does not signiﬁcantly aﬀect code smell identiﬁcation. We tried to minimize threats related to the diﬀerence between target systems by swapping the order of the target systems across experiment rounds. Rounds 1 and 3 counted on the inspection of Java IO ﬁrst; Rounds 2 and 4 counted on the inspection of Java Print ﬁrst. However, we did not swap the order of the target systems in each experiment round, e.g., splitting the 26 participants of Round 1 in two groups, one for inspecting Java IO ﬁrst, and the other inspecting Java Print ﬁrst. Not swapping systems may have favored code smell identiﬁcation in the system analyzed in Rounds 2 and 4. This possible bias can eventually occur because participants may have learned how to identify smells while inspecting a diﬀerent system in Rounds 1 and 3. Future research could address this threat by using a single target system in diﬀerent rounds. We performed each interview with the software project leaders in isolation in order to make leaders as comfortable as possible with answering the interview questions. We also instructed and answered general questions of leaders before ﬁlling the interview questionnaire. Thus, we expected to assure that all leaders understood each interview question. 7.3. Conclusion validity To conduct the data analysis, we carefully selected the most appropriate statistical tests. We also paid special attention to avoid violating assumptions of the selected statistical tests. To answer our research question, we applied the Mann-Whitney test [31] as discussed in Section 3.4.

Information and Software Technology 120 (2019) 106242

Furthermore, we believe that our questionnaires ﬁt our expectations with the empirical study and support answering our research question. For instance, they allowed us to characterize the experienced and inexperienced developers. Thus, we mitigate possible threats related to the data analysis through the exclusive analysis of data collected from the questionnaires. As previously discussed in Internal Validity, the inspection in diﬀerent rounds without properly swapping the systems may have inﬂuenced our study results. Aimed at capturing confounding factors, we scrutinized the precision and recall results for Java Print and Java IO regardless of the inspection approach, i.e., individual and collaborative smell identiﬁcation. Mean precision were quite similar across systems: mean precision equals 0.60 for Java IO against 0.55 for Java Print. Thus, the target system inﬂuenced little the precision results. Conversely, the results changed considerably across systems in terms of recall: mean recall equals 0.26 for Java IO against 0.57 for Java Print. Aimed at understanding this diﬀerence, we highlight that, in spite of the two systems exhibiting similar complexity in several aspects, Java IO is two times larger in size than Java Print. Additionally, unlike precision, recall considers false negatives. Once Java IO is larger, the probability of missing a code smell instance in this system is indeed greater than in Java Print. Thus, one could consider that the target system may have inﬂuenced somehow our key study result (i.e., does collaboration indeed outperforms individualism?). However, we made a comparison of the recall between collaborators and single developers (based on Fig. 4), when both cases are using Java IO (i.e., from pairs 17&19 to pairs 23&24). This comparison reveals there is only a minor diﬀerence in recall for most pairs (against their individual counterparts from 17 to 24), except for the pair 21&22 against the developer 21. In any case, as one considers the balance between precision and recall, the collaborators still clearly outperform single developers in the Java IO case, including the pair 21&22. This can be conﬁrmed by combining both results of Figs. 4 and 3. We noticed along the experiment that the reason for this behavior is that collaborators tend to focus much more on precision than recall when modules become larger in size (i.e., in the Java IO case). They tend to discuss more and be more reﬂective along the decision of conﬁrming whether a code fragment has a smell. In fact, the precision of the aforementioned collaborators (Fig. 3) is much higher than the respective single developers for the Java IO case. Therefore, collaborative smell identiﬁcation tend to be much superior in precision even for larger modules, possibly in detriment to recall improvement. This reﬂects what developers would do very often while identifying opportunities for code refactoring [24,53]. In practice, developers tend to focus on a subset of smells in each code revision session, i.e., those smells they are absolutely sure the aﬀected code should be refactored [22,54–58]. Finally, as far as the interviews are concerned, we carefully applied thematic synthesis procedures [49] in order to avoid biases in the qualitative data analysis. Two of the paper authors have performed the procedures in a pair for identifying and ﬁxing missing and incorrect data.

7.4. External validity Our study has some possible threats related to the generalization of ﬁndings. First, we applied the study with Brazilian developers, which may not represent all development scenarios. In addition, although we have spent a period of one month to engage novice developers and professional developers in our study, the set of subjects is limited to 18 novice developers and 16 professional developers. We minimize possible threats regarding the set of subjects by involving developers with varied background and level of working experience. We also focused on developers with minimum experience with topics of interest, such as code smells and pair programming.

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

Even though someone could consider our population as limited, we did our best to involve the novice and professional developers on the identiﬁcation of code smells. Moreover, many experiments in software engineering have even fewer participants than our experiment due to the diﬃculty in ﬁnding participants. Such limitation is a consequence of a lack of adequate sampling frames available in the ﬁeld, making the identiﬁcation of representative samples diﬃcult [59]. Regarding the interviews, we recruited as many software project leaders as possible. Unfortunately, it is quite challenging to recruit leaders from diﬀerent organizations with availability for the interview. We acknowledge that our qualitative results are not representative of most leaders. However, the ﬁve leaders that participated in our interviews not only helped us to conﬁrm previous assumptions, but also reveal interesting insights on the potential application of the collaborative smell identiﬁcation in practical settings.

leverage the eﬀectiveness of code smell identiﬁcation. The eﬀectiveness beneﬁts could extrapolate to code review in general, which strongly depends on developer discussions [39,40,42,60]. Based on our aforementioned ﬁndings, we expect to provide suﬃcient evidence for organizations concerned about the eﬀectiveness of smell identiﬁcation, which should take into consideration to adopt the collaborative smell identiﬁcation in practice. As a future work, we plan to investigate how novice and professional developers can collaborate in a more comprehensive scenario, in which (1) novice developers perform the collaborative smell identiﬁcation and, after that, (2) professional developers, especially those that are familiar with the inspected system, prioritize the identiﬁed smell instances according to their impact of system maintainability.

8. Final remarks

The authors declare that they have no known competing ﬁnancial interests or personal relationships that could have appeared to inﬂuence the work reported in this paper.

This paper expands the current knowledge about collaborative smell identiﬁcation through an empirical study conducted with 34 developers in an unprecedented exploitation scenario: novices and professionals inspecting systems they are unfamiliar with. We summarize our study ﬁndings as follows. •

•

•

•

The average precision of collaborators was 27% higher than the average of single developers on the identiﬁcation of code smells. Thus, collaborators tend to identify more actual code smells than single developers. The average recall of collaborators was 36% higher than the average of single developers on the identiﬁcation of code smells. That is, the identiﬁcation of code smells performed by collaborators has a higher coverage than by single developers. The exchange of information allowed by collaboration is essential to improve the eﬀectiveness of the code smell identiﬁcation. We observed that collaborators share knowledge and complement each other. Consequently, it improves their conﬁdence on conﬁrming a code smell suspect. All interviewed project leaders agree that collaboration is important to validate smell suspects. From a leaders’ perspective, collaboration can help in handling with the subjective and complex nature of code smells. Complementarily, leaders are unanimous in stating that they would adopt the collaborative smell identiﬁcation in practice, except in particular cases when resources are too limited.

8.1. A recommendation to development organizations Organizations may allocate novice developers for identifying code smells in collaboration. The major beneﬁt would be promoting the knowledge sharing and the correct identiﬁcation of code smells without misidentifying them. An advantage of allocating novice rather than professional developers to such identiﬁcation is that professional developers can focus on addressing the clients’ needs, by adding new feature into the system, for instance. In addition, we observe the eﬀectiveness of identifying code smells is similar for developers either familiar or unfamiliar with the system under inspection. Thus, organizations can simply allocate the unfamiliar ones to the collaborative smell identiﬁcation and let the familiar ones for addressing the users’ needs. 8.2. Enhancing current tool support for code smell identiﬁcation A previous work summarizes the existing tools for code smell identiﬁcation [25]. Unfortunately, most of those tools provide little or no support to developer collaboration during the inspection of code smell candidates. Thus, developers may still struggle with validating candidates towards a better identiﬁcation eﬀectiveness [22,23]. We hypothesize that incorporating collaboration in the current tools could considerably

Declaration of Competing Interests

Acknowledgments This work was partially funded by CNPq (grants 434969/2018-4, 312149/2016-6, and 409536/2017-2), CAPES/Procad (grant 175956), and FAPERJ (grant 22520-7/2016). References [1] K. Bennett, V. Rajlich, Software maintenance and evolution: a roadmap, in: Conference on the Future of Software Engineering, Co-located with the 22nd International Conference on Software Engineering (ICSE), 2000, pp. 73–87. [2] I. Macia, R. Arcoverde, A. Garcia, C. Chavez, A. von Staa, On the relevance of code anomalies for identifying architecture degradation symptoms, in: 16th European Conference on Software Maintenance and Reengineering (CSMR), 2012, pp. 277–286. [3] M. Tufano, F. Palomba, G. Bavota, R. Oliveto, M. Di Penta, A. De Lucia, D. Poshyvanyk, When and why your code starts to smell bad (and whether the smells go away), IEEE Trans. Softw. Eng. 43 (11) (2017) 1063–1088. [4] F. Palomba, G. Bavota, M.D. Penta, R. Oliveto, A.D. Lucia, Do they really smell bad? A study on developers’ perception of bad code smells, in: 30th International Conference on Software Maintenance and Evolution (ICSME), 2014, pp. 101–110. [5] A. Yamashita, L. Moonen, Do code smells reﬂect important maintainability aspects? in: 28th International Conference on Software Maintenance (ICSM), 2012, pp. 306–315. [6] M. Fowler, Refactoring: Improving the Design of Existing Code, ﬁrst ed., Addison-Wesley Professional, 1999. [7] M. Lanza, R. Marinescu, Object-oriented Metrics in Practice: Using Software Metrics to Characterize, Evaluate, and Improve the Design of Object-oriented Systems, ﬁrst ed., Springer Science & Business Media, 2006. [8] W. Oizumi, A. Garcia, L. Sousa, B. Cafeo, Y. Zhao, Code anomalies ﬂock together: exploring code anomaly agglomerations for locating design problems, in: 38th International Conference on Software Engineering (ICSE), 2016, pp. 440–451. [9] E. Fernandes, G. Vale, L. Sousa, E. Figueiredo, A. Garcia, J. Lee, No code anomaly is an island: anomaly agglomeration as sign of product line instabilities, in: 16th International Conference on Software Reuse (ICSR), 2017, pp. 48–64. [10] M. Silva, M.T. Valente, R. Terra, Does technical debt lead to the rejection of pull requests? in: 12th Brazilian Symposium on Information Systems (SBSI), 2016, pp. 248–254. [11] A. Chávez, I. Ferreira, E. Fernandes, D. Cedrim, A. Garcia, How does refactoring affect internal quality attributes? A multi-project study, in: 31st Brazilian Symposium on Software Engineering (SBES), 2017, pp. 74–83. [12] D. Cedrim, A. Garcia, M. Mongiovi, R. Gheyi, L. Sousa, R. de Mello, B. Fonseca, M. Ribeiro, A. Chávez, Understanding the impact of refactoring on smells: alongitudinal study of 23 software projects, in: 11th Symposium on the Foundations of Software Engineering (FSE), 2017, pp. 465–475. [13] M. Kim, T. Zimmermann, N. Nagappan, An empirical study of refactoring: challenges and beneﬁts at Microsoft, IEEE Trans. Softw. Eng. 40 (7) (2014) 633–649. [14] E. Murphy-Hill, C. Parnin, A. Black, How we refactor, and how we know it, IEEE Trans. Softw. Eng. 38 (1) (2012) 5–18. [15] D. Silva, N. Tsantalis, M.T. Valente, Why we refactor? Confessions of GitHub contributors, in: 24th International Symposium on Foundations of Software Engineering (FSE), 2016, pp. 858–870. [16] R. Arcoverde, I. Macia, A. Garcia, A. Von Staa, Automatically detecting architecturally-relevant code anomalies, in: 3rd International Workshop on Recommendation Systems for Software Engineering (RSSE), co-located with 34th International Conference on Software Engineering (ICSE), 2012, pp. 90–91. [17] I. Macia, J. Garcia, D. Popescu, A. Garcia, N. Medvidovic, A. von Staa, Are automatically-detected code anomalies relevant to architectural modularity? an exploratory

R. Oliveira, R. de Mello and E. Fernandes et al.

[18] [19]

[20] [21]

[22]

[23]

[24]

[25]

[26] [27]

[28]

[29] [30] [31] [32]

[33]

[34]

[35] [36] [37]

[38]

[39]

analysis of evolving systems, in: 11th International Conference on Aspect-oriented Software Development (AOSD), 2012, pp. 167–178. J. Oliveira, R. Gheyi, M. Mongiovi, G. Soares, M. Ribeiro, A. Garcia, Revisiting the refactoring mechanics, Inf. Softw. Technol. 110 (2019) 136–138. A.C. Bibiano, E. Fernandes, D. Oliveira, A. Garcia, M. Kalinowski, B. Fonseca, R. Oliveira, A. Oliveira, D. Cedrim, A quantitative study on characteristics and eﬀect of batch refactoring on code smells, in: 13th International Symposium on Empirical Software Engineering and Measurement (ESEM), 2019, pp. 1–11. T. Paiva, A. Damasceno, E. Figueiredo, C. Sant’Anna, On the evaluation of code smells and detection tools, J. Softw. Eng. Res. Dev. 5 (1) (2017) 7:1–7:28. C. Conceicao, G. Carneiro, F.B. e Abreu, Streamlining code smells: using collective intelligence and visualization, in: 9th International Conference on the Quality of Information and Communications Technology (QUATIC), 2014, pp. 306–311. R. Oliveira, L. Sousa, R. de Mello, N. Valentim, A. Lopes, T. Conte, A. Garcia, E. Oliveira, C. Lucena, Collaborative identiﬁcation of code smells: amulti-case study, in: 39th International Conference on Software Engineering (ICSE): Software Engineering in Practice Track (SEIP), 2017, pp. 33–42. R. Oliveira, B. Estácio, A. Garcia, S. Marczak, R. Prikladnicki, M. Kalinowski, C. Lucena, Identifying code smells with collaborative practices: a controlled experiment, in: 10th Brazilian Symposium on Software Components, Architectures and Reuse (SBCARS), 2016, pp. 61–70. L. Sousa, A. Oliveira, W. Oizumi, S. Barbosa, A. Garcia, J. Lee, M. Kalinowski, R. de Mello, B. Fonseca, R. Oliveira, et al., Identifying design problems in the source code:agrounded theory, in: 40th International Conference on Software Engineering (ICSE), 2018, pp. 921–931. E. Fernandes, J. Oliveira, G. Vale, T. Paiva, E. Figueiredo, A review-based comparative study of bad smell detection tools, in: 20th International Conference on Evaluation and Assessment in Software Engineering (EASE), 2016, pp. 18:1–18:12. T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. 27 (8) (2006) 861–874. R. de Mello, R. Oliveira, A. Garcia, On the inﬂuence of human factors for identifying code smells: a multi-trial empirical study, in: 11th International Symposium on Empirical Software Engineering and Measurement (ESEM), 2017, pp. 68–77. C. Bird, N. Nagappan, B. Murphy, H. Gall, P. Devanbu, Don’t touch my code! Examining the eﬀects of ownership on software quality, in: 19th Symposium on the Foundations of Software Engineering (FSE), 2011, pp. 4–14. B. Boehm, Software risk management: principles and practices, IEEE Softw. 8 (1) (1991) 32–41. C. Wohlin, P. Runeson, M. Höst, M. Ohlsson, B. Regnell, A. Wesslén, Experimentation in Software Engineering, ﬁrst ed., Springer Science & Business Media, 2012. S. Siegel, N. Castellan Jr, Nonparametric Statistics for the Behavioral Sciences, second ed., McGraw-Hill, 1988. S. Vidal, E. Guimaraes, W. Oizumi, A. Garcia, A.D. Pace, C. Marcos, Identifying architectural problems through prioritization of code smells, in: 10th Brazilian Symposium on Software Components, Architectures and Reuse (SBCARS), 2016, pp. 41–50. J. Garcia, D. Popescu, G. Edwards, N. Medvidovic, Toward a catalogue of architectural bad smells, in: 5th International Conference on the Quality of Software Architectures (QoSA), 2009, pp. 146–162. R. Marticorena, C. López, Y. Crespo, Extending a taxonomy of bad code smells with metrics, 7th International Workshop on Object-Oriented Reengineering (WOOR), 2006. S. Chidamber, C. Kemerer, A metrics suite for object oriented design, IEEE Trans. Softw. Eng. 20 (6) (1994) 476–493. E. Gamma, R. Helm, R. Johnson, J. Vlissides, Design Patterns: Elements of Reusable Object-Oriented Software, ﬁrst ed., Addison-Wesley Professional, 1994. N. Tsantalis, T. Chaikalis, A. Chatzigeorgiou, Ten years of JDeodorant: lessons learned from the hunt for smells, in: 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), 2018, pp. 4–14. E. Murphy-Hill, A. Black, Seven habits of a highly eﬀective smell detector, in: 1st International Workshop on Recommendation Systems for Software Engineering (RSSE), 2008, pp. 36–40. S. McIntosh, Y. Kamei, B. Adams, A. Hassan, The impact of code review coverage and code review participation on software quality: a case study of the Qt, VTK, and

Information and Software Technology 120 (2019) 106242

[40]

[41]

[42]

[43] [44] [45] [46]

[47]

[48] [49]

[50]

[51]

[52]

[53]

[54]

[55]

[56] [57]

[58]

[59]

[60]

ITK projects, in: 11th Working Conference on Mining Software Repositories (MSR), 2014, pp. 192–201. A. Bacchelli, C. Bird, Expectations, outcomes, and challenges of modern code review, in: 35th International Conference on Software Engineering (ICSE), 2013, pp. 712–721. E. Fernandes, P. Souza, K. Ferreira, M. Bigonha, E. Figueiredo, Detection strategies for modularity anomalies: an evaluation with software product lines, in: 14th International Conference on Information Technology: New Generations (ITNG), 2018, pp. 565–570. A. Begel, N. Nagappan, Pair programming: what’s in it for me? in: 2nd International Symposium on Empirical Software Engineering and Measurement (ESEM), 2008, pp. 120–128. G. Braught, J. MacCormick, T. Wahls, The beneﬁts of pairing by ability, in: 41st Technical Symposium on Computer Science Education (SIGCSE), 2010, pp. 249–253. M. Lorenz, J. Kidd, Object-Oriented Software Metrics: A Practical Guide, ﬁrst ed., Prentice Hall, 1994. J. Rosenberg, Some misconceptions about lines of code, in: 4th International Software Metrics Symposium (METRICS), 1997, pp. 137–142. H. Levene, Robust tests for equality of variances, in: I. Olkin (Ed.), Contributions to Probability and Statistics. Essays in Honor of Harold Hotelling, Stanford University Press, 1960, pp. 278–292. D. Lapková, M. Adámek, Statistical and mathematical classiﬁcation of direct punch, in: 38th International Conference on Telecommunications and Signal Processing (TSP), 2015, pp. 486–489. C. Seaman, Qualitative methods in empirical studies of software engineering, IEEE Trans. Softw. Eng. 25 (4) (1999) 557–572. D. Cruzes, T. Dyba, Recommended steps for thematic synthesis in software engineering, in: 5th International Symposium on Empirical Software Engineering and Measurement (ESEM), 2011, pp. 275–284. T. Hayata, J. Han, A hybrid model for IT project with Scrum, in: 7th International Conference on Service Operations, Logistics, and Informatics (SOLI), 2011, pp. 285–290. I. Perera, S. Fernando, Enhanced agile software development–hybrid paradigm with LEAN practice, in: 6th International Conference on Industrial and Information Systems (ICIIS), 2007, pp. 239–244. B. Kitchenham, S. Pﬂeeger, L. Pickard, P. Jones, D. Hoaglin, K. El Emam, J. Rosenberg, Preliminary guidelines for empirical research in software engineering, IEEE Trans. Softw. Eng. 28 (8) (2002) 721–734. L. Sousa, R. Oliveira, A. Garcia, J. Lee, T. Conte, W. Oizumi, R. de Mello, A. Lopes, N. Valentim, E. Oliveira, C. Lucena, How do software developers identify design problems? A qualitative analysis, 31st Brazilian Symposium on Software Engineering (SBES), 2017. , On the prioritization of design-relevant smelly elements: amixed-method, multi-project study, in: 13th Brazilian Symposium on Software Components, Architectures, and Reuse (SBCARS), 2019, pp. 83–92. S. Vidal, E. Guimaraes, W. Oizumi, A. Garcia, A.D. Pace, C. Marcos, Identifying architectural problems through prioritization of code smells, in: 10th Brazilian Symposium on Software Components, Architectures and Reuse (SBCARS), 2016, pp. 41–50. S. Vidal, W. Oizumi, A. Garcia, A.D. Pace, C. Marcos, Ranking architecturally critical agglomerations of code smells, Sci. Comput. Program. 182 (2019) 64–85. W. Oizumi, L. Sousa, A. Garcia, R. Oliveira, A. Oliveira, O.I.A.B. Agbachi, C. Lucena, Revealing design problems in stinky code: amixed-method study, in: 11th Brazilian Symposium on Software Components, Architectures, and Reuse (SBCARS), 2017, pp. 5:1–5:10. W. Oizumi, A. Garcia, T. Colanzi, M. Ferreira, A. Staa, On the relationship of code-anomaly agglomerations and architectural problems, J. Softw. Eng. Res. Dev. 3 (11) (2015) 1–22. R. de Mello, P. Da Silva, G. Travassos, Investigating probabilistic sampling approaches for large-scale surveys in software engineering, J. Softw. Eng. Res. Dev. 3 (1) (2015) 8. E. Fernandes, A. Uchôa, A.C. Bibiano, A. Garcia, On the alternatives for composing batch refactoring, in: Proceedings of the 3rd International Workshop on Refactoring (IWoR), co-located with the 41st International Conference on Software Engineering (ICSE), 2019, pp. 9–12.

Collaborative or individual identification of code smells? On the effectiveness of novice and professional developers

Collaborative or individual identification of code smells? On the effectiveness of novice and professional developers

Recommend Documents