Collaborative or individual identification of code smells? On the effectiveness of novice and professional developers

Collaborative or individual identification of code smells? On the effectiveness of novice and professional developers

Information and Software Technology 120 (2019) 106242 Contents lists available at ScienceDirect Information and Software Technology journal homepage...

2MB Sizes 0 Downloads 26 Views

Information and Software Technology 120 (2019) 106242

Contents lists available at ScienceDirect

Information and Software Technology journal homepage: www.elsevier.com/locate/infsof

Collaborative or individual identification of code smells? On the effectiveness of novice and professional developers Roberto Oliveira a,b,∗, Rafael de Mello a, Eduardo Fernandes a, Alessandro Garcia a, Carlos Lucena a a b

Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Rio de Janeiro, Brazil State University of Goiás (UEG), Posse-GO, Brazil

a r t i c l e

i n f o

Keywords: Code smell identification Collaboration Empirical study

a b s t r a c t Context: The code smell identification aims to reveal code structures that harm the software maintainability. Such identification usually requires a deep understanding of multiple parts of a system. Unfortunately, developers in charge of identifying code smells individually can struggle to identify, confirm, and refute code smell suspects. Developers may reduce their struggle by identifying code smells in pairs through the collaborative smell identification. Objective: The current knowledge on the effectiveness of collaborative smell identification remains limited. Some scenarios were not explored by previous work on effectiveness of collaborative versus individual smell identification. In this paper, we address a particular scenario that reflects various organizations worldwide. We also compare our study results with recent studies. Method: We have carefully designed and conducted a controlled experiment with 34 developers. We exploited a particular scenario that reflects various organizations: novices and professionals inspecting systems they are unfamiliar with. We expect to minimize some critical threats to validity of previous work. Additionally, we interviewed 5 project leaders aimed to understand the potential adoption of the collaborative smell identification in practice. Results: Statistical testing suggests 27% more precision and 36% more recall through the collaborative smell identification for both novices and professionals. These results partially confirm previous work in a not previously exploited scenario. Additionally, the interviews showed that leaders would strongly adopt the collaborative smell identification. However, some organization and tool constraints may limit such adoption. We derived recommendations to organizations concerned about adopting the collaborative smell identification in practice. Conclusion: We recommend that organizations allocate novice developers for identifying code smells in collaboration. Thus, these organizations can promote the knowledge sharing and the correct smell identification. We also recommend the allocation of developers that are unfamiliar with the system for identifying smells. Thus, organizations can allocate more experience developers in more critical tasks.

1. Introduction Software maintenance implies constantly changing the code elements that constitute a system, such as classes and methods, aimed to address the users’ needs [1]. Unfortunately, maintaining those code elements often degrades the internal code structure of systems [2,3]. A practical implication of such degradation is that poor code structures are difficult for developers to maintain [4,5]. In other words, these structures usually make difficult to understand and change a system. Aimed at minimizing the negative impact of the code structure degradation, developers should identify and eliminate poor code structures affecting a system whenever possible [2].

Code smells are basic symptoms of code structure degradation [6]. Each smell type represents a pattern of poor code structure that, if not minimized or eliminated, can hamper the system maintenance. We exemplify this scenario with the Large Class smell type [6,7]. A Large Class instance is realized by a too large and complex class, which should be split into many for separating features. Developers usually associate a Large Class instance with a high complexity to understand and change the affected class [4,5]. Thus, identifying and eliminating Large Class instances potentially increases the system maintainability. In summary, software developers should identify and eliminate this smell type, as well as many other smell types, from their systems [8,9].



Corresponding author. E-mail addresses: [email protected] (R. Oliveira), [email protected] (R. de Mello), [email protected] (E. Fernandes), [email protected] (A. Garcia), [email protected] (C. Lucena). https://doi.org/10.1016/j.infsof.2019.106242 Received 18 December 2018; Received in revised form 2 November 2019; Accepted 10 December 2019 Available online 13 December 2019 0950-5849/© 2019 Elsevier B.V. All rights reserved.

R. Oliveira, R. de Mello and E. Fernandes et al.

Software contributions affected by instances of code smells are often rejected for integration in open source systems [10]. It suggests that code smells affect the system development in practice. As a means to eliminate code smells, software developers often apply refactoring on their systems [11–15]. By changing specific code elements, the developers usually expect to either minimize or eliminate the existing poor code structures. However, it requires an effective support for developers in identifying those poor code structures that require refactoring [16– 19]. In this context, various studies have investigated the identification of code smells from different scenarios [20–23]. 1.1. Current knowledge and limitations In this study, we refer to the developers identifying alone the code smells affecting a system as single developers. A recent study [21] investigates the ineffectiveness of single developers in identifying code smells in the so-called individual smell identification. Such ineffectiveness is mostly due to the subjective nature of the identification of code smells [4,5,21,24] combined with the limited tooling support for identifying code smells [25], especially regarding precision and recall [26]. Thus, the collaborative smell identification emerges as a promising solution to such ineffectiveness [22,23]. It consists of two developers working as collaborators for identifying, confirming, and refuting code smell suspects, i.e., code smell instances not yet confirmed by developers. The common wisdom could state that, in contrast with the individual smell identification, the collaborative smell identification is more effective, especially for supporting a high rate of identified code smells. We recently investigated the common wisdom through empirical studies [22,23]. Nevertheless, the current knowledge on the effectiveness of collaborative smell identification does not suffice to convince the development companies in adopting it in practice. The first study [23] investigates the effectiveness of novice developers arranged individually (one developer), in pairs (two developers), and in groups (three or more developers). All developers are unfamiliar with the inspected systems. In other words, the developers know little or nothing about the code elements and possible poor code structures affecting the systems. This scenario emulates real-world companies in which developers have to maintain a system they did not originally developed. The second study [22] investigates the effectiveness of professional developers in the same three arrangements. All developers are familiar with the inspected systems and, therefore, they might be more likely to identify code smells rather than other developers. The two aforementioned studies [22,23] conclude that the collaborative smell identification (performed by pairs and groups) is more effective than the individual smell identification. They then suggest, to some extent, that companies adopt the collaborative smell identification. Nevertheless, their study findings are not exactly comparable because they regard different smell types, inspected systems, and working experience. Although we have crossed the data of both studies in a recent work [27], the comparison has several threats to validity. It has two major practical implications: (1) it remains difficult to assure that the familiarity of developers with a system has affected effectiveness, and (2) it might be hard to convince companies in allocating either novices or professionals for identifying code smells collaboratively. Especially, professional developers tend to hold most knowledge about their systems [28]. Thus, companies would prefer to allocate novices rather than professionals for identifying code smells. 1.2. Expanding the current knowledge: an unexploited scenario In this paper, we aim to extend the scope of our analysis while avoiding the threats to validity inherent from combining observations from different studies [27]. We present a controlled experiment on the effectiveness of collaborative smell identification in an unprecedented exploitation scenario: novice and professional developers inspecting systems they are unfamiliar with. This scenario is quite interesting in

Information and Software Technology 120 (2019) 106242

practice due to the following reasons. First, the effectiveness of novice and professional developers may vary significantly so that it justifies allocating developers with a specific working experience to identify code smells. Second, the lack of familiarity with a system is recurring in practice due to (1) the high developer turnover [29] and (2) the need for allocating professional developers to address the clients’ needs rather than to eliminate poor code structures affecting a system [15]. A total of 34 novice and professional developers participated in our controlled experiment. By following empirical study guidelines [30], we asked developers to identify code smells either as single developers or as collaborators (two developers only). We have selected a software system affected by a sufficient and varied number of code smell suspects. The respective smell types are those that developers usually perceive as harmful to system maintainability [4,5]. We have submitted our data to quantitative analysis via precision and recall computation [26] and the application of statistical methods [31] for hypothesis testing. As a complement to our quantitative analysis, we performed interviews with five software project leaders. Our major goal was understanding if leaders are likely to adopt the collaborative smell identification in their teams. 1.3. Study results and practical implications Our study results suggest that both novice and professional developers reach 27% more precision and 36% more recall by identifying code smells collaboratively. By applying statistical tests, we confirmed a significant difference between the effectiveness of developers working as single developers and collaborators in the identification of code smells. These results are interesting by themselves, since they provide empirical evidence that allocating developers to work collaboratively can help identifying more code smells with only a low misidentification rate when compared to the individual smell identification. However, our study reveals even more insightful conclusions when combined to our previous findings about the collaborative smell identification. In our first study [23], we investigated the effectiveness of novices identifying code smells individually by simply computing the average number of identified code smells. Unfortunately, this computation represents a threat to the study validity because the average number does not reveal important aspects of identification of code smells, such as how frequent novice developers misidentify code smells when allocated to work collaboratively. By addressing such threat, we have now confirmed that novice developers working collaboratively obtain a higher precision and recall than those working individually, thereby reducing the number of misidentified code smells. Thus, organizations can allocate novice rather than professional developers for identifying code smells collaboratively. Thus, professional developers can focus on addressing the clients’ needs. In our second study [22], we investigated the effectiveness of collaborative smell identification from the perspective of professional developers who are familiar with the inspected system, but by computing precision and recall. The familiarity of developers with the system might have facilitated the identification of code smells, which implies a threat to the study validity. We addressed this threat by allocating developers with varied working experiences to inspect a system they are unfamiliar with. Surprisingly, we observed that developers working collaboratively are more effective than the ones working individually for identifying code smells, regardless their familiarity with the inspected system. Thus, organizations can simply allocate developers that are unfamiliar with their systems to identify code smells. It enables the other developers to address the clients’ needs based on their extensive knowledge about the system. Through the interviews with software project leaders, we have derived various interesting findings. First, most leaders agreed that collaboration is essential to support the code smell identification. Especially, leaders expect that collaboration may support the identification of code smells depending on granularity and complexity. Second, leaders are ultimately open to employ the collaborative smell identification in their

R. Oliveira, R. de Mello and E. Fernandes et al.

development teams. Leaders mentioned many motivations behind such adoption, such as the code quality improvement and the enhancement of identification accuracy. However, as expected, limited budget and time may hinder this adoption in practice. 2. Background and related work This section provides background information of the paper. Section 2.1 discusses the characteristics of code smells and their impact on system maintainability. Section 2.2 discusses the limitations of both automated and manual identification of code smells. We justify our investigation of the collaborative smell identification as a manual and human-centered activity. Section 2.3 discusses the limited knowledge about the effectiveness of collaborative smell identification. Section 2.4 motivates our study with a practical example. 2.1. Code smells: characteristics and impact on maintainability The internal code structure of a system tends to degrade along successive changes [2,8,32]. Code smells are symptoms of poor code structures that realize such degradation [5,6]. They can reveal, at least partially, a problem of system maintainability [8,9], such as the difficulty to understand and change certain code elements. To identify code smells, developers need to understand two basic characteristics that help revealing the impact of a code smell instance on the system maintainability. These characteristics are: (1) the smell type and (2) the smell granularity. We explain each characteristic as follows. The smell type characterizes how the poor code structure manifests in the source code of a system. Let us take back the Large Class smell type mentioned in Section 1. Large Class consists of a too large and complex class, which developers might find challenging to understand and change [6,7]. This smell type is usually realized by several methods, which sum up many lines of code and implement complex system features. Previous work [4,5] points out Large Class as one of the most critical smell types from the viewpoint of developers. Besides Large Class, many other smell types have been cataloged [6,7,33]. Due to the different manifestations of code smells, each code smell instance affects a particular set of code elements in a system. It characterizes the smell granularity. In this study, we relied on previous work [6,34] and categorized smell granularity in two categories. The first category is called intra-class smells and consists of poor code structures that usually require inspecting a single class to be identified. Examples are the Large Class [6] instances, which locally affect a class. The second category is called inter-class smells and consists of poor code structures whose identification usually requires reasoning about multiple classes together. Examples are Message Chain [6] instances, each requiring the analysis of various classes whose method calls form a call chain. We have selected four smell types for inspection. Data Clumps (Inter-class): A data cluster often seen together in the system, either as a class member, or as a parameter list in the signature of methods. A class could encapsulate it [6]. Large Class (Intra-class): A too large and complex class. It usually implements various unrelated features, which could be distributed to other classes [6,7]. Long Method (Intra-class): A too large and complex method. It usually has either various lines of source code, with several conditional branches, or implements complex features [6,7]. Message Chain (Inter-class): A large sequence of method calls composed across various system classes [6]. We have three reasons why we selected these smell types. First, each selected smell type is frequent in the inspected systems and typically associated with maintainability problems by developers [2,4,5]. Second, these smells are conceptually interrelated, which can impose different difficulty levels in their identification. For instance, by definition, a Large Class instance might be a composition of various Long Method instances. Third, identifying instances of these smell types might require the analysis of various code elements and reasoning about their multiple characteristics [6,7], such as cohesion and coupling [35].

Information and Software Technology 120 (2019) 106242

2.2. Automated versus manual smell identification The identification of code smells consists of searching for poor code structures that potentially realize maintainability problems in a system [6,7]. Usually, such identification realizes on inspecting of code elements that constitute a system. Each code element represents a basic decomposition unit of the system [36]. In this study, we focus on inspecting code smells that affect two types of code elements: methods and classes. Whenever we identify a code smell affecting a specific code element, we refer to this code element as a code smell suspect. After identifying code smell suspects, each suspect has to be either confirmed or refuted by developers as actually harmful to the system maintainability. Developers might perform the identification of code smell suspects affecting the source code of a system in two ways: (1) by relying on the support of automated tools [25] and (2) by manually inspecting the source code aimed at identifying the code smell suspects. Examples of automated tools are JDeodorant [37] and Stench Blossom [38]. However, in spite of the various existing tools, a previous study [25] observes that these tools tend to have a low effectiveness in terms of precision and recall. More critically, each tool varies significantly in effectiveness, especially because each tool adopts a different identification strategy (e.g., based on software metrics). Especially due to the high number of misidentified code smell suspects [25], which causes the usually low recall rates of existing tools, the developers still need to reason about the validity of each code smell suspect after using an automated tool. As a response to the lack of effectiveness of automated supporting tools, companies can use the manual identification of code smells as the main practice, and the automated tools as a complement. Thus, developers can automate the identification of several code smell suspects and, then, manually confirm or refute them. There are two scenarios for the manual identification of code smells [22]. The first scenario allocates single developers to identify code smells alone in the so-called individual smell identification. The second scenario allocates two or more developers as collaborators to identify code smells together in the so-called collaborative smell identification. Our previous studies [22,23] suggest that adding more than two collaborators does not significantly increase the effectiveness of collaborative smells. Therefore, unlike our previous studies, this current study assesses the collaborative smell identification in pairs only.

2.3. The current knowledge about collaborative smell identification The collaboration of developers has been applied to and assessed in various software engineering tasks [39,40]. For instance, a previous study [40] shows that developers working collaboratively can benefit along the code review by revealing more easily and frequently defects and maintenance problems affecting their software systems. The benefits come mostly from the knowledge exchange promoted by collaboration [22,40]. Another study [39] shows that collaboration reduces the difficulty faced by developers to identify defects in software systems, which require a careful inspection of the source code. Consequently, one could expect that the collaboration potentially improves the developers’ effectiveness in identifying code smells, which also require a source code inspection. This paper presents an empirical study aimed at assessing the effectiveness of collaborative versus individual smell identification. However, this is not the first study proposed with similar purpose. Table 1 compares the study design of this paper with the study design of two previous studies of ours [22,23] with similar purpose. The first column lists the study design characteristics that we consider relevant for comparison across studies. The second column characterizes our first study in the context of collaborative smell identification [23], in terms of those characteristics. The third column characterizes our second study [22]. The fourth column characterizes our current study design. This table

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

Table 1 Comparison between This Study and Previous Studies. Characteristic

Study 1 [23]

Study 2 [22]

Study 1+2 [27]

This study

Nature of the study performed Number of subjects Subject working experience Arrangement of the subjects Number of inspected systems Familiarity with the system Effectiveness metrics Smell types

Controlled experiment 28 developers Novices Individuals, pairs, groups 3 Unfamiliar Precision, recall Long Method and 6 others

Controlled experiment 13 developers Professionals Individuals, pairs, groups 5 Familiar Average number of code smells Large Class, Long Method, Message Chain, and 9 others

Controlled experiment 41 developers Novices, professionals Individuals, pairs, groups 8 Unfamiliar, Familiar Precision Large Class, Long Method, Message Chain, and 9 others

Controlled experiment plus interviews 34 developers + 5 team leaders Novices, professionals Individuals, pairs 2 Unfamiliar Precision, recall Data Clumps, Large Class, Long Method, Message Chain

lists only the smell types inspected in our current study. We discuss the differences among studies, and justify our current study, as follows. The data of Table 1 indicates the different design characteristics across studies. Notably, the first study aimed at complementing the quantitative data obtained from controlled experience through interviews with project leaders. We aimed to know if leaders are likely to adopt the collaborative smell identification in practice. Additionally, our study differs from others due to the working experience of developers that participated in each study. In this study, we have counted on the participation of novice and professional developers rather than developers with a single working experience. Our major goal was enabling a fair effectiveness comparison of novices and professionals through a shared context: the same systems affected by the same smell types. Thus, we aim at minimizing threats from our previous studies [22,23] that hinder concluding if novice are as effective as professional developers in identifying code smells collaboratively.

2.4. Motivating example Based on the existing evidence [22] that the developers’ consent is essential to the identification of code smells, especially regarding the confirmation or refutation of code smell suspects, we motivate our study by showing how difficult it might be for single developers to perform as follows. We illustrate below how developers working collaboratively may benefit from the exchange knowledge on the identification of code smells. The developer’s knowledge may include information about the purpose of the software system, details on the code structure, or definitions of smell types. In turn, knowledge exchange means sharing knowledge between developers, specially when identifying code smells together. Suppose that two hypothetical developers, Bill and Suzi, must carry out the identification of code smells (e.g. Data Clumps) in a software system not implemented by them. Each developer is responsible for inspecting the Util and System components, respectively. Through the identification of code smells, developers can have indications of which code elements need to be modified to improve the longevity of the software system.

2.4.1. Scenario 1: Individual smell identification Each developer (i.e. Bill and Suzi) performs the analysis of a particular component of software system. Thereby, during the identification of code smells, Bill only inspects the Util component while Susy only inspects the System component. Bill initially suspects that there is a Data Clump smell affecting the Class A. He has this suspicion because foo and bar parameters appear together in the parameter list of two different methods (see code fragments in blue color). In order to confirm the existence of a Data Clump, Bill inspects the parameter list of the Class B methods. However, Bill thinks there is no Data Clump smell because foo and bar parameters had appeared together only in the two methods from Class A.

2.4.2. Scenario 2: Collaborative smell identification During collaborative smell identification, Bill suspects there is a Data Clump smell affecting the Class A. He has this suspicion because foo and bar parameters appear together in the parameter list of two different methods. To either refute or confirm the existence of a Data Clump smell, Bill and Suzy verified whether the foo and bar parameters also appear together in the parameter list of methods in the System component. Along the discussion, Suzi pointed out problems relating to Class D, which confirmed the parameters foo and bar is repeated in the parameter list of three different methods in this class. Thus, both Bill and Suzi broaden their knowledge about the code elements of the whole system. Accordingly, they agreed there is a Data Clump smell affecting the Util and System component. Therefore, the exemplified Data Clump smell would not have been found if Bill and Suzy had only inspected a particular component. On the other hand, Data Clump was confirmed after the developers had exchanged knowledge regarding the inspected components. This knowledge could only be gathered by understanding how the list of parameters of the methods is inspected in the whole system, in that scenario, inspecting the Util and System components. The presented scenario exemplifies how the information exchange between developers about the code element is very important for spreading knowledge between developers. Based on this, the developers can more reliably: (i) confirm the code smell, (ii) define the necessary code changes in order to improve the code quality and (iii) infer the actual impact on the design of the component hosting the smell. Especially for (i) and (iii), the existing tools are still insufficient to substitute the importance of developers in the identification of code smells [8,25]. Moreover, for (ii) the existing tools do not discard the involvement of developers [22]. Therefore, the collaboration or isolation of developers along the smell identification may affect the task effectiveness. In particular, the use of collaborators may help developers perform specific actions that contribute to improving effectiveness on the identification of code smells. 3. Controlled experiment settings This section describes the settings of our study aimed at understanding whether collaborators are more effective than single developers when identifying code smells. Section 3.1 presents the study goal, the research question, and associated hypotheses. Section 3.2 presents the characterization of subjects. Section 3.3 describes the target software systems and data sources used in the experiment. Section 3.4 presents the data analysis procedure. Section 3.5 describes the experiment procedure steps. 3.1. Research goal This study aims at comparing both collaborative and individual smell identification. Our goal was to understand whether collaborators are more effective than single developers when identifying code smells. By

R. Oliveira, R. de Mello and E. Fernandes et al.

relying on a well-known guideline [30], we refined and structured the study goal as follows: •

• •





Analyze the collaborative smell identification when compared to individual smell identification, For the purpose of assessing the developers’ effectiveness, With respect to precision and recall of the identification of code smells, From the perspective of novice developers and professional developers, In the context of Java software systems that novice and professional developers are unfamiliar with, i.e., systems about which developers have no previous knowledge.

From our study goal, we designed the following research question (RQ1 ). RQ1 : Is collaborative smell identification more effective than individual smell identification? Our empirical study addresses the RQ1 as follows. First, we assess the collaborative smell identification from the viewpoint of developers who are unfamiliar with the inspected systems, in order to avoid biases in the identification of code smells. In other words, the developers have no previous knowledge about the software system in which they identify code smells. Second, we compute two metrics to compare developers effectiveness in code smell identification: precision and recall [26]. Basically, precision measures the correctness of the identified code smells, and recall measures the completeness of the code smell identification with respect to all existing code smells based on the identification performed by an specialist in the software development or code smells. To compute these metrics, we built a reference list of code smells, i.e., an itemization of code smells thoughtfully identified in the software systems [41]. Details on how we build the reference list can be found in Section 3.4. Third, we derived the following null and alternative hypotheses from RQ1 . •







HP0 . There is no difference in the precision between collaborators and single developers in smell identification. HP1 . There is a difference in the precision between collaborators and single developers in smell identification. HR0 . There is no difference in the recall between collaborators and single developers in smell identification. HR1 . There is a difference in the recall between collaborators and single developers in smell identification.

We discuss each hypothesis as follows. With the null hypotheses, we assume that the number of developers working on the identification of code smells does not make it more or less effective, with respect to precision (HP0 ) or recall (HR0 ). On the other hand, the alternative hypotheses indicate that there is a difference between the effectiveness of code smell identification by collaborators and single developers, with respect to precision (HP1 ) or recall (HR1 ). 3.2. Characterization of the participants This study involved 34 developers as subjects. They were classified according to two levels of working experience, namely novice developers and professional developers. This classification aimed at helping to understand whether each level can benefit differently (or not) from collaborative smell identification. It also aimed at supporting the generalization of our results, since we consider developers with different working experiences. We introduce each level as follows. •

Novice developers are subjects with little or no experience in industrial software development. We selected these subjects from a software engineering course of a Brazilian undergraduate course in Computer Science.

Information and Software Technology 120 (2019) 106242 •

Professional developers are subjects currently acting in the industrial software development and that hold at least one year of experience – mostly due to the high developer turnover in the selected organization that made unfeasible selecting developers with a much more years of experience. We selected subjects from different organizations, some of them are multinational; these companies typically perform maintenance tasks and developers are concerned about identifying and eliminating code smells.

Our experiment consists of two sessions: one with novice developers in an academic laboratory, and another with professional developers in their working environment. For each session, subjects first performed smell identification in isolation and, after that, they performed the same task in collaboration. To participate in the study, all subjects signed an informal consent form. The subjects also filled out a characterization questionnaire with closed questions about their expertise in four topics related to the study: programming, Java, Pair Programming (PP), and code smells. We chose PP as a reference for measuring the experience with collaborative work, which is well known in both the literature and the industry [42,43]. Table 2 presents the data collected from the subject characterization questionnaire with respect to all subjects. The first column lists the two levels of working experience, i.e., novice developers (16 subjects in total) and professional developers (10 subjects in total). The second column provides a label for each subject, aimed at keeping anonymous their identity. The remaining columns present the experience reported by each subject regarding each aforementioned topic related to our study. We ranked the knowledge of subjects per topic (Table 2) in four categories: none means I never had contact with it; low means I had contact with it in classes or instructional material; medium means I had contact with it in the context of academic system; and high means I had contact with it for at least one year in industrial systems. In the case of Java, the knowledge of the subject is: none when the subject never had contact with the Java programming language; low when the subject had contact with Java only through classes or by reading instructional material; medium when the subject had contact with Java only in the context of academic systems developed in academic courses and laboratories; and high when the subject had contact with Java for at least one year in industrial software systems. Note that professionals might not be experienced with Java in the industry but with another language. Similar rationale was applied to the other topics, such as code smells. We analyze Table 2 to identify subjects with high degree of knowledge about our topics of interest, collected via characterization questionnaire, when compared to other subjects. The table highlights these subjects in boldface. We say that a subject has a sufficient knowledge about the topics for study when the subject has a medium knowledge in at least two topics, since it represents at least a half of knowledge on the related topics. We observe that: 28 out of 34 subjects have medium to high knowledge in programming and Java; 27 out of 34 subjects have medium to high knowledge in Pair Programming; 25 out of 34 subjects have medium to high knowledge in code smells; and no subject has no knowledge in any of four topics. We then conclude that our subjects met the minimum requirements to take part in our experiment. 3.3. Target software systems and data sources To allow us to investigate the collaborative smell identification, we have selected a set of target software systems and data sources for analysis. We present both the target software systems and data sources as follows. 3.3.1. Target software system This study focuses on the identification of code smells by the subjects. Thus, we selected a set of target systems for usage by the subjects during such identification. For this purpose, we selected two industry systems,

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

Table 2 Characterization of subjects for the controlled experiment. Topics Working experience

Subjects

Programming

Java

Pair Programming

Code Smells

Novice developers

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17 s18

Medium Medium Low Low Low Low Low Medium Medium Low Medium Medium Medium Medium Medium Medium High Medium

Medium Medium Low Low Low Low Low Medium Medium Low Medium Medium Medium Medium Medium Medium High Medium

Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium High Low

Low Medium Low Low Low Low Low Medium Medium Low Medium Low Medium Medium Low Medium High Medium

Professional developers

s19 s20 s21 s22 s23 s24 s25 s16 s27 s28 s29 s30 s31 s32 s33 s34

High High High High High High High High High High High High High High High High

High High High High High High High High High High High High High High High High

High none High High High Low Medium Low Low Low Medium Low Medium High High Medium

Medium High High High High Medium High Medium Medium High High High Medium High High High

namely Java IO and Java Print, which belong to the Java Core project1 . This selection relies on the following criteria. First, each system has to be open source, which allows the study replication. Second, each software system has to enable the identification of code smells using the Stench Blossom tool in its default settings [38], which relies on well-known detection strategies for code smells [7] and provides a visualization of the code smell suspects. Third, each software system has to be affected by multiple code smells. To support the generality of our findings, we selected systems in which we identified different smell types, with varying granularity. We focus on four smell types [6]: Data Clumps, Large Class, Long Method, and Message Chain. Both Large Class and Long Method are intra-class smells, which locally affect a single class of the software system. In turn, Data Clumps and Message Chain are inter-class smells, which affect multiple classes. The selected smell types affect different code structures in a software system, such as methods and classes [6]. All selected smell types are reportedly very frequent in software systems [4,5]. Both source code files selected for inspection were developed within the context of the same project. As a consequence, we could assume that the implementation decisions that might lead to the occurrence of code smells are similar among files. This assumption is confirmed by the different smell types identified in each code file. We observe that all code smell types occur in equal proportion in both files: 1 Large Class instance by file, 2 Data Clump instances by file, and 1 Message Chain instance by file. The exception is Long Method with 7 instances in Java IO against 6 instances in Java Print. We have some observations about the complexity of both source code files. Even though the files differ in terms of Number of Lines of Code 1

In: http://openjdk.java.net/groups/core-libs.

[44], this metric has been shown inappropriate for measuring file complexity [45], which lead us to use other complexity metric. For instance, by computing the complexity of both files via Weighted Methods per Class (WMC) [35], we obtained similar results: WMC equals 6 for the Java Print file, against 4 for the Java IO file. These results suggest again that both files have a quite similar complexity for analysis and identification of code smells. We then conclude that comparing both files in our study (Section 4) is acceptable. 3.3.2. Data sources We collect experimental data of the subjects from different data sources, namely: the subject characterization questionnaire, the List of smells identified by developers, and the post-experiment questionnaire. We combined the data obtained via these data sources to compensate their strengths and limitations. We describe each data source as follows. •





Subject characterization questionnaire: it is composed of questions aimed at characterizing each subject, in terms of their their knowledge on topics of interesting. List of smells identified by developers: it is a form aimed at collecting the list the code smells identified by each subject and pair during the experiment. Post-experiment questionnaire:, it is composed of questions aimed at collecting the perception of subjects regarding the code smell identification conducted in the experiment.

3.4. Data analysis procedures As aforementioned, we conducted an empirical study which uses multiple data sources for data analysis. Thus, we carefully designed our data analysis procedures. We present each procedure as follows.

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

3.4.1. Creation of a code smell reference list We built a code smell reference list to support the data analysis. For this propose, we recruited two researchers, which are PhD students with experience in software maintenance research and knowledge in software development and the identification of code smells. The researchers identified code smells in the selected projects in a complementary way. That is, one researcher conducted the manual smell identification, without tool support, and the other used the Stench Blossom tool [38]. Each researcher then obtained a list of possible code smell suspects, which were not exactly the same due to the subjectiveness of the identification of code smells. To reach a consensus, we computed the agreement between the lists of code smell suspects reported by both researchers. By smell suspect, we have an agreement whenever the developers have confirmed or refuted the suspect together. Conversely, we have a disagreement whenever the developers diverged in opinion without a consensus. After, the researchers conducted an open discussion to reach a consensus. Finally, we built the final code smell reference list. 3.4.2. Quantitative data analysis Our study assesses the effectiveness of both collaborators and single developers on the identification of code smells. For this purpose, we compute the developers’ effectiveness in terms of two well-known metrics, namely precision and recall [26]. Precision measures the correctness of the identified code smells. Recall measures the completeness of the identified code smells with respect to all code smells which occur in a system [26]. To compute these metrics, we used the aforementioned code smell reference list, which is an itemization of code smells identified in a systems [41]. Precision and recall were calculated based on the number of code smells marked as true positive (TP), false positive (FP) and false negative (FN). TP means that the developer identifies a code smell that appears in the code smell reference list. FP means that the developer identifies a code smell that does not appear in the code smell reference list. FN means that a code smell appears in the code smell reference list but the developer was unable to identify. Both precision and recall are normalized in a range from 0 to 1. High precision values (close to 1) mean that the developer had reported, proportionally, only a few occurrences of FP in the software system. High recall values (close to 1) mean that the developer was able to identify a representative number of occurrences of TP in the software system. Eqs. (1) and (2) present the formulas. 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑅𝑒𝑐𝑎𝑙𝑙 =

𝑇𝑃 𝑇𝑃 + 𝐹𝑃

𝑇𝑃 𝑇𝑃 + 𝐹𝑁

(1) (2)

We applied the two-tailed Mann-Whitney test, which is a nonparametric statistical test, aimed at rejecting our null hypotheses. The reason for selected a non-parametric test is discussed as follows. Based on the normality test, we observed that both distributions of precision and recall are normal. We consider an alpha coefficient equal to 95%, which gives us a confidence interval of 5% (p-value < 0.05) to compare the data distributions. However, after applying the Levene’s test [46], we observed that the distribution of recall is not homoscedastic, which requires the application of a non-parametric test. To avoid applying different statistical tests for precision and recall, and due to the limited sample of our study, we decided to apply the non-parametric test. We used the Minitab tool [47] to apply the statistical test. 3.4.3. Complementary discussion As aforementioned, we conducted a quantitative analysis on the effectiveness of both collaborative and the individual smell identification. In addition, we conducted a complementary analysis based on the follow-up questionnaire, which was applied after the experiment execution with the developers. This complementary analysis aimed at understanding the feedback of subjects regarding the experiment, mainly focused on the difficulties faced by the subjects to identify code smells.

The analysis aimed at understanding the subject viewpoint on the identification of code smells, specially collaborative smell identification. 3.5. Experiment design We have designed four steps to guide our controlled experiment. That is, we conducted two sessions of the experiment, each with a group of participants classified according to their level of working experience. The first session was conducted with the novice developers, while the second session was conducted with the professional developers. We asked all subjects to first fill out and sign a consent questionnaire. After that, the subjects engaged in the experiment. 3.5.1. Step 1. Apply the subject characterization questionnaire The subject characterization questionnaire aims to characterize the experiment subjects. The responses obtained through this questionnaire allowed us to identify some key characteristics of each subject, as presented in Section 3.2. 3.5.2. Step 2. Training of subjects After characterizing the subjects, we provided a training session to the subjects. This training aimed at supporting subjects to proper understand and execute the experiment. The training was organized in two parts. First, during 25 minutes, we explained the technical concepts and terminologies related to this study. Second, we took 10 minutes to conduct a discussion about the concepts. Regarding code smells, we provided explicit definitions and practical examples. This training was provided to both novice and professional developers. 3.5.3. Step 3. Smell identification task The experiment tasks were conducted in four rounds. In the first round, 26 subjects (s1 to s16 and s25 to s34) individually inspected Java IO. In the second round, the same subjects collaboratively inspected Java Print. In the third round, 8 subjects (s17 to s24) individually inspected Java Print. In the fourth round, the same subjects collaboratively inspected Java IO. The teams of subjects were allocated randomly. In each round, the subjects were asked to annotate the identified code smells in the code smell report questionnaire. This procedure allowed us to compute the number of true positives and false positives. All subjects performed the experiment simultaneously under the supervision of the researchers. Each round lasted 60 minutes only for the identification of code smell. We did not swap the order of individual and collaborative tasks because previous studies [22,27] show that such order does not significantly affect the effectiveness of code smell identification. 3.5.4. Step 4. Answer the follow-up questionnaire After participating in the experiment, the participants filled a followup questionnaire. This questionnaire at collecting the perception of each subject regarding the experiment. We aimed at understanding their opinion about the identification of code smells and the experience of working collaboratively to identify code smell. 4. Results of the controlled experiment This section answers RQ1 : Is the collaborative smell identification more effective than the individual smell identification? Section 4.1 analyzes the distribution of precision and recall to confirm or refute the hypotheses of Section 3.1. Section 4.2 complements the findings based on the analysis per subject. 4.1. Analysis of distribution for precision and recall At first, we analyzed the distribution of precision for collaborators and single developers. We aimed at understanding whether the collaborators tend to obtain a higher precision in the code smell identification when compared to single developers. We also computed the average

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

Fig. 1. Distribution of precision for developers.

precision for collaborators and single developers, by summing their precision regardless of their working experience and dividing this value by the total number of participants. Finally, we obtained the average precision for collaborators and single developers. Thus, we investigate the alternative hypothesis HP1 as follows. Fig. 1 presents the distribution of precision for collaborators and single developers, respectively. The figure also indicates the average precision for both collaborators and single developers. Overall, we observe an average precision equals 0.76 (76%) for collaborators against 0.49 (49%) for single developers. Our results suggest that collaborators had a 27% higher average precision than single developers. By applying the Mann-Whitney test, we observed a significant difference between precision values (p-value = 0.001). In summary, our results lead us to reject the null hypothesis HP0 and accepts HP1 . We analyze the distribution of recall for collaborators and single developers. Similarly to precision, we compute the average recall as follows. First, we sum the recall of developers regardless the working experience. Second, we divided this value by the total number of developers, which resulted in the average recall for both collaborators and single developers. Thus, we investigate the alternative hypothesis HR1 as follows. Fig. 2 presents the distribution of the average of recall. We observed an average recall equals 0.63 (63%) for collaborators against 0.27 (27%) for single developers. Our results suggest that collaborators had a 36%

higher average recall than single developers. By applying the MannWhitney test, we observed significant difference between recall values (p-value = 0.001). Consequently, our results led us to reject the null hypothesis HR0 and accepts HR1 . In summary, our results for precision and recall suggest that developers tend to identify more smells when working collaboratively. Particularly, we observed that collaborators obtained higher precision and recall than single developers. These results have two main implications discussed as follows. First, collaborators tend to make less mistakes when identifying code smells, i.e., they obtain higher precision. Second, collaborators are able to identify a more representative number of code smells in the software systems than single developers. In summary, our results lead us to Finding 1. Finding 1: Collaborators tend to be more effective than single developers when identifying code smells in software systems. Table 3 presents a complementary analysis per working experience (novices and professionals), aimed at assessing any biases on the results of precision and recall caused by the working experience of the developers. The first column lists the working experience. The second column lists the experiment groups (individual and collaborators). The third and fourth columns present precision with respect to average and median precision. The fifth and sixth columns present recall with respect to average and median recall.

Fig. 2. Distribution of recall for developers.

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

Table 3 Comparison of precision and recall per working experience.



Working Experience

Group

Precision

Recall

Average

Median

Average

Median

Novice

Individual Collaborators

0.61 0.74

0.50 0.76

0.25 0.72

0.27 0.76

Professional

Individual Collaborators∗

0.38 0.77

0.39 0.83

0.29 0.54

0.28 0.50

Two pairs have one novice and one professional; we counted these pairs here.

In general, we observe that the results of the complementary analysis for both working experiences confirm Finding 1, i.e., they show that collaborators tend to have higher precision and recall than single developers. However, by comparing both working experiences, we observed a non-ignorable difference. In the case of single developers, there is difference in the results obtained with respect to average and median precision (23% and 11%, respectively), but also average and median recall (4% and 1%). In the case of collaborators, there is difference in the results obtained with respect to average and median precision (3% and 7%, respectively), and average and median recall (18% and 26%). Thus, although working experience somehow affects Finding 1, it remains valid that collaboration improves precision and recall regardless the working experience. This discussion leads us to Finding 2. Finding 2: Regardless the working experience of software developers, collaborators have performed more effectively than single developers in the identification of code smells. 4.2. Comparing precision and recall of collaborators and single developers After analyzing the distribution of precision and recall presented in Section 4.1, we conducted a more detailed analysis. We aimed at deeply understanding the effectiveness of the collaborative smell identification per developer, who may have performed the identification of code smells as collaborator and single developer. First, we analyze precision as follows. Fig. 3 presents the precision of collaborators and single developers per subject. For each set of three consecutive bars (two black and one gray bar), we compare the precision of two developers as single developers with the precision of both developers as collaborators. Overall, 12 of the 17 sets (70.59%) obtained higher precision as collaborators. In addition, for 4 of the 5 remaining sets (38.47% of the total), at least one developer as single developer have improved its precision when worked as a collaborator (by comparing the gray bar of the single developers

with the black bar of the corresponding collaboration). It implies that collaboration improves the effectiveness of at least one developer involved in the collaboration (as observed for 4 out of the 5 cases), by reducing the number of incorrectly identified code smells. By analyzing the follow-up questionnaire, we may draw additional conclusion on the benefits of the collaborative smell identification. All subjects stated that the collaboration minimized their frustrations and improved their confidence during the identification of code smells. For example, subject s16 said: The collaboration has strengthened the communication between members and the possibility of a more precise analysis because four eyes see more than two... consequently we were more confident in our work. In turn, subject s28 said: The discussions with my partner were essential for understanding the long chaining of methods and for confirming the existence of Message Chain. Next, we analyzed recall as follows. Fig. 4 presents the recall values obtained by collaborators and single developers. Each set of three consecutive bars (two gray and one black bar) compares the recall of two subjects working as single developers with the recall of both together working as collaborators. Overall, 13 of the 17 sets (76.47%) obtained higher recall as collaborators. In addition, for 2 of the 4 remaining sets (50% of the total), at least one developer as single developer has improved its recall when worked as a collaborator – this result is observed by comparing the gray bars of single developers with the black bar of the corresponding collaboration. It reinforces our findings of Fig. 2 and suggests that collaborators tend to identify a more representative number of code smells in the software systems, when compared with single developers. By analyzing the follow-up questionnaire, we draw the following observations. We found that, during the collaborative smell identification, one collaborator was usually responsible for selecting a code smell suspect and, after, both collaborators started arguing about the code smell suspect. We have also found that collaborators have more confidence to confirm a code smell suspect when compared with single developers. Consequently, collaborators are able to identify a larger number of code smells in software systems. This observation is reinforced by the opinion of the subjects, such as s7 that said: The greatest potential of working collaboratively was the possibility of adding different strategies to determine a code smell. This fact was only possible thanks to different experiences that each one of us has. Finally, the follow-up form revealed the certainties and uncertainties of the subjects about each code smell suspect. For instance, when the collaborators were uncertain on confirming a code smell suspect, both ended up not confirming a particular code smell suspect as an actual code smell. Thus, by relying on the comments of subjects like s7 and s28, we conclude that collaborators may exchange information and,

Fig. 3. Precision of collaborators and single developers per subject.

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

Fig. 4. Recall of collaborators and single developers per subject.



consequently, improve their effectiveness when compared with single developers, which leads us to Finding 3. Finding 3: The exchange of information among collaborators has a potential to improve the effectiveness of the code smell identification.

From the interview goal, we designed the following research question (RQ2 ).

5. Interviews with software project leaders This section regards the interviews performed with software project leaders of real-world development teams. Section 5.1 introduces our interview goals and procedures performed for collecting and analyzing data. Section 5.2 presents the interviewee characterization. Section 5.4 discuss the project leaders’ perceptions on the need for collaborating along the identification of code smells. Finally, Section 5.5 explores the leaders’ perceptions on the potential to adopt the collaborative smell identification in practical settings. 5.1. Interview goal As a complement to the quantitative analysis presented in Sections 3 and 4, we decided to perform interviews with software projects leaders. Our major goal was understanding if, based on our empirical evidence that collaborative smell identification can benefit development teams in different manners (e.g., by enhancing the identification precision), leaders: (1) acknowledge the need for collaborating and (2) would employ the collaborative smell identification in their respective teams. Based on a well-known guideline [30], we systematically designed our interview goal as follows. • •





Analyze the perceptions of software project leaders, For the purpose of understanding the likeliness of leaders to adopt collaborative smell identification in practice, With respect to the need for collaborating and the potential adoption of collaborative smell identification in the leaders’ teams, From the perspective of project leaders of varied organizations,

In the context of Java software systems that novice and professional developers are unfamiliar with, i.e., systems about which developers have no previous knowledge.

RQ2 . Are software project leaders likely to adopt the collaborative smell identification in their development teams? Our study addresses the RQ2 as follows. First, we have summarized the results of our quantitative study reported in Section 3 and 4. Our goal was to familiarize participants with key evidence on the benefits of both collaborative and individual smell identification. Because the studies that provide insights on the collaborative versus individual smell identification are quite recent, our subjects were probably unaware of the effects the collaborating may have on the identification effectiveness. Asking participants about the use of either collaborative or individual smell identification without the proper background would provide us with insufficient results; participants would guess rather than reason about their responses. Second, we carefully designed a structured interview [48] in order to capture the perceptions of software project leaders on collaborative smell identification. Third, we performed a thematic synthesis [49] for aggregating the interview qualitative data by topic aimed to properly answer RQ2 . 5.2. Characterization of the interviewees Table 4 summarizes the background of five software project leaders that we recruited for interviewing. The first column identifies each leader, from L1 to L5. The second column informs the highest education level of each leader. All leaders have at least a Master’s degree in Computer Science or related areas; one leader is a PhD and another leader is a Postdoc. The third column informs the leadership experience in years by leader. Four out of the five leaders have at least one year of experience in leading development teams. The fourth column reports

Table 4 Characterization of software project leaders.



ID

Education

Years as Leader

Avg. Size of Teams

Adopted Development Processes

Familiarity w/ Smells

L1 L2 L3 L4 L5

PhD MSc Postdoc MSc MSc

2 2 1 5 15

6 6 5 6 5

Scrum Agile, MPS.BR Scrum Agile, Kanban, Lean, Scrum∗ CMMI, MPS.BR, Scrum

Totally agree Totally agree Totally agree Partially agree Totally agree

Plus one development process that is less popular and we were not able to cite.

R. Oliveira, R. de Mello and E. Fernandes et al.

the average team size led by each software project leader. All teams are composed by five or six developers in average. The fifth column lists the software developers processes adopted by the development teams led by project leader. All projects led by the interviewees adopt agile development practices, such as Scrum [50] and Lean [51]. The sixth column reports the familiarity of software project leaders with the concept of code smells. All leaders mentioned that at least partially with the fact their are familiar with code smell definitions. 5.3. Interview design We carefully designed and performed a four-step interview protocol. We describe each step in the sequence. 5.3.1. Step 1: Recruit software project leaders. Our first step consisted of recruiting software project leaders from different development organizations. As discussed in Section 5.2, five project leaders kindly volunteered to participate as interviewees. We first contacted these leaders via email or video-conference, and they agreed to have their interview data analyzed and reported anonymously. 5.3.2. Step 2: Instruct leaders about the interview questionnaire. We decided to perform each interview separately in order to make the project leaders as comfortable as possible with answering the questions. Before applying the questionnaire, we instructed each leader with respect to (1) the general interview goal, (2) privacy policies that will adopted with respect to the interview data, and (3) the basics of code smells and code smell identification. 5.3.3. Step 3: Apply the questionnaire. We applied the interview questionnaire with each software project leader. The questionnaire is composed of two questions which we describe in the sequence. •



Question 1: Do you believe that discussion among development team members is necessary to either confirm or refute the occurrence of a code smells? Please justify yours answer. – This question captures the leaders’ opinions on the need for collaborating while validating smell suspects. Question 2: As a team leader, would you allocate two or more development team members for either confirming or refuting the occurrence of code smells? Please, justify your answer. – Complementarily, this question aims to capture the potential adoption of collaborative smell identification in real settings, also from a project leader’s perspective.

Information and Software Technology 120 (2019) 106242

5.3.4. Step 4: Perform the thematic synthesis. This step consisted of applying thematic synthesis procedures [49] for understanding the interviewees’ answers. Once both questions made to the project leaders are open, the leaders were free to answer whatever they wanted. Thus, we decided: (1) to tabulate all answers by questions; (2) for each question, to extract the main discussion topics that emerged from each answer; and (3) to derive the themes by grouping similar discussion topics. We performed all three procedures in a pair aimed to avoid biases and missing data. After performing the thematic synthesis, we built visual models that represent the data (see Sections 5.4 and 5.5 for details). 5.4. On the need for collaborating along the identification of code smells We present below the full answer of all five leaders to Question 1. Yes! Code smells are subjective, thus a development team has to agree on the validity of code smell suspects. Discussions help in deciding which suspects to eliminate based on the code quality. – L1 I think so. Some code smells affect different modules of a software project. One or more developers may be responsible for managing each single module. Thus, discussions could make easier to identify code smells. – L2 There may be some gain in collaboration to identify code smells. Similarly to techniques like code inspection, collaborators may be more effective than single developers, especially when the task is reasonably complex. – L3 Yes! My personal leadership experience suggests that brainstorms with divergent opinions is better for reaching consensus. – L4 I believe that discussion may benefit development teams, once developers typically have different experiences. – L5 Fig. 5 visually represent the qualitative data extracted from the Question 1 answers via thematic synthesis procedures. The top box shortens the interview question. The boxes immediately below (Leader Attributes and Code Smell Properties) represent the major discussion themes. The boxes below (Response Confidence, Smell Subjectivity, and Smell Structure) represent discussion topics. Finally, the bottom boxes (Experience, Consensus, etc.) represent the discussion subtopics. The lines connecting two boxes represent either a theme-topic or a topic-subtopic relationships. We discuss the obtained results as follows. •



All leaders agreed that collaboration is important for validating, i.e., confirming or refuting, smell suspects (cf. Response Confidence in Fig. 5). However, three leaders (L2, L3, and L5) were not strongly confident with their answers. This observation relies on quotes like I think so (L2), There may be some gain... (L3), and may benefit (L5). Leaders have mentioned subjective aspects that may positively affect the collaborative smell identification (cf. Smell Subjectivity). These aspects are Experience (developers typically have difference experiences Fig. 5. Model of themes and topics of Question 1.

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

Fig. 6. Model of themes and topics of Question 2.



by L5), Consensus (reaching consensus by L4), and Authorship (be responsible for managing each single module by L2). Leaders also mentioned structural aspects of code smells that may be better perceived by developers working in collaboration rather than individually. These aspects are Granularity (Some code smells affect different modules by L2) and Complexity (the task is reasonably complex by L3) – the complexity of code smell detection is often associated with the complexity of the poor code structure underlying the code smell instance [4,5]. Finding 4: The interviewed project leaders agree that collaboration is important when validating code smell instances. Their responses typically associate a potential benefit of collaboration with the inherent subjectivity of code smell identification and the underlying code smell structure.

5.5. On the adoption of the collaborative smell identification We present below the full answer of all five leaders to Question 2. I would love to. Discussions can help teams in enhancing the code quality, but first I would opt for asynchronous discussions via code review tools (so that developers use in-line code comments to document their viewpoints). – L1 Yes! Merging opinions can leverage the accuracy in the identification of certain code smell types, mainly those that affect different modules. – L2 I would first identify code smell suspects via static analysis. Depending on project criticality, desired code quality, etc., I would not hesitate to promote collaboration towards enhanced precision, mainly to validate suspects. – L3 Yes, because I suppose that collaborators are more likely to identify and validate smell suspects than individuals. Each individual has his own opinion and, therefore, discussions among collaborators can reveal varied viewpoints. – L4 I would but, unfortunately, both budget and time are usually scarce to allocate developers for inspecting source code. – L5 Fig. 6 visually represent the qualitative data extracted from the Question 2 answers via thematic synthesis procedures. The top box shortens the interview question. The boxes immediately below (Leader Attributes, Organization Attributes, Collaboration Goal, Developers Arrangement) represent the major discussion themes. The boxes below (Response Confidence, Availability of Resources, etc.) represent discussion topics. Finally, the bottom boxes (Consensus, etc.) represent the discussion subtopics whenever applicable. The lines connecting two boxes represent either a theme-topic or a topic-subtopic relationships. We discuss the obtained results as follows. •

Surprisingly, all leaders said they would adopt the collaborative smell identification in their development teams (cf. Response Confidence in Fig. 6). In this case, only two leaders (L3 and L5) were not strongly confident with their answers. This observation relies on quotes like Depending on project criticality... (L3) and I would but... (L5).







Leaders have mentioned organization attributes that are decisive to the practical adoption of collaborative smell identification (cf. Organization Attributes). These aspects are Availability of Resources (both budget and time are usually scarce by L5) and Availability of Tools (via code review tools by L1). Leaders also mentioned various goals behind allocating developers to collaborate along the code smell identification. These goals are Code Quality Improvement (enhancing the code quality by L1), Validation Consensus (can reveal varied viewpoints by L4), and Accuracy Enhancement (leverage the accuracy by L2). Finally, leaders mentioned different ways to arrange collaborators: asynchronous (I would opt for asynchronous discussions... by L1) and synchronous identification – we considered this category for L2, L4, and L5 because they did not make explicit the need for an asynchronous discussions among developers (we then suppose that they would arrange collaborators to work at the same time and, sometimes, at the same location). Finding 5: All interviewed project leaders said they would adopt the collaborative smell identification for different reasons – including the accuracy enhancement observed through our quantitative study. However, budget, time, and tool constraints can limit such adoption in practice.

6. Study comparison-based discussions With respect to our controlled experiment, we partially reused the study design of our previous studies [22,23] to enable us better understanding the effectiveness of collaborative smell identification. Thus, we conduct an empirical study with novice and professional developers, who had to identify code smells in software systems they are unfamiliar. Due to considerable number of changes applied to the existing study designs, we do not characterize our study as purely a study replication. Especially, we have added a qualitative study based on interviews [49] with software project leaders. Our empirical results suggest that both novice and professional developers obtained 27% more precision and 36% more recall through the collaborative smell identification. Besides that, we have found significant difference between the developers’ effectiveness of single developers and collaborators. Information about our first previous study. In our previous efforts [23], we provide empirical evidence on the effectiveness of collaborative smell identification. In the first study [23], we conduct an empirical study with novice developers, who had to identify code smells in software systems they are unfamiliar with [23]. As a result of this study, we observe that collaborators identify a higher average number of code smells because they almost always had to consider both of each developer’s knowledge on revealing scattered, complementary symptoms associated with a single smell type. We also derive empirical evidence that adding more than two developers on the task of collaborative smell identification does not necessarily improve their identification effectiveness.

R. Oliveira, R. de Mello and E. Fernandes et al.

Also, we observed that novice developers perform several collaborative activities during the identification of code smells. Comparison between this study and our first study [23]. In our previous study, we assess the collaborative smell identification from the perspective of novice developers who are unfamiliar with the software systems under analysis. In other words, the developers have no previous knowledge about the software system in which they identify code smells. To evaluate the effectiveness of individual and collaborative and identification of code smells, we computed the average number of identified code smells to compare the effectiveness of collaborators and single developers. However, we know that considering only the average number of smells identified was a threat to validity in this study. Since the average code smells identified does not reveal important aspects of the effectiveness of the smell identification which we have been able to capture through other metrics such as precision and recall. Therefore, in this study, we evaluated precision and recall and, surprisingly, the results confirm that even for precision and recall, collaboration benefits smells identification. There are two implications of our study results. The first implication is that novice developers working in collaboration have shown themselves competent enough to perform the identification of smells under different perspectives of effectiveness. The second implication is that organizations do not need to allocate professional developers, which should focus implementing new software features, to identify code smells. Instead, organizations can allocate novice developers for identifying code smells in collaboration, thereby promoting the knowledge sharing and the correct identification of code smells without misidentifying them. Our interview data reinforce that leaders are concerned about knowledge exchange among team members (as we can see from the answers of L4 and L5, for instance). The following is the recommendation: Recommendation 1: Organizations can save resources by allocating novice developers rather than professional developers to effectively identify code smells. Information about our second previous study. In the second study [22], we conduct two sessions of an exploratory case study aimed at understanding the effectiveness of collaborative smell identification in industry. In other words, we aimed at understanding the effectiveness of smell identification from the viewpoint of professional developers who are familiar with the inspected software systems. As a result of this study, we observed that collaborators are also more effective than single developers in the identification of code smells. This observation confirms our previous findings in [23]. In addition, we observed that collaborators often benefited from knowledge exchange to identify certain smell types. Also, we observed that developers require several types of information in order to confirm or refute a code smell suspect. Comparison between this study and our second study [22]. In our second previous study, we assess the collaborative smell identification from the viewpoint of professional developers who are familiar with the inspected software systems. We used precision and recall metrics to compare developers effectiveness in both scenarios (individually and collaboratively). A difference between our previous study [22] and this study is directly related to two aspects: working experience and familiarity with the inspected systems. Prior to this study, there was a lack of knowledge on the effectiveness of collaborative identification from the viewpoint of professional developers who are unfamiliar with the software systems under analysis. Surprisingly, we have found that, even when professional developers are unfamiliar with a system, they perform better in identifying code smells collaboratively rather than individually. One immediate implication of our results is that organizations do not need to allocate developers familiar with the system, which have multiple fundamental responsibilities in the software developers, to simply identify code smells. Thus, these developers can concentrate their efforts in addressing the clients’ need. In this case, we recommend orga-

Information and Software Technology 120 (2019) 106242

nizations to allocate developers that unfamiliar with the system identify code smells, since they will perform as effective as a developer with higher familiarity with the system. On the other hand, professional developers familiar with the system could be allocated only to determine how critical are the smells identified. This observation is reinforced by the leaders’ viewpoints on the criticality of code smells (L3) and authorship of program modules by different developers (L2). The following is the recommendation: Recommendation 2: Organizations can save resources by allocating developers that are unfamiliar with the system rather than the familiar ones to effectively identify code smells. Information about our third previous study. In the third study [23], we aggregated the data points related to novice developers unfamiliar with the inspected software systems [23] with the data points related to professional developers that are familiar with the systems [22]. We then proposed a classification of subjects by professional background, i.e., the participant knowledge on identification of code smells, software development experience, and other related topics: no, little, middle, and high background. As a result, we found evidence that collaboration significantly improves the precision of the identification tasks for both novice and professional developers regardless the professional background. We also found evidence that, when working collaboratively, the precision reached by professional developers is slightly higher that then one reached by novice developers, especially when identifying smell types that are more complex and require inspecting multiple code elements. Comparison between this study and our third study [27]. Similarly to our third study [27], we observed that the collaborative smell identification is more effective than the individual smell identification regardless the working experience. It is worth mentioning that the interviewed project leaders have this concern about enhancing the identification accuracy (L2) and promoting the smell validation consensus (L2, L4). The novelty of our current study is that our observations are based on the identification of code smells performed on the same software system, in order to identify the same set of smell types, rather than in different settings whose comparison has various threats to validity. As an implication, we now have more reliable results that suggest the adoption of collaborative smell identification by organizations, especially those that (1) lack enough professional developers to work on addressing the clients’ needs and also identify code smell suspects for validation and elimination if necessary but (2) have sufficient novice developers to allocate for working collaboratively. Recommendation 3: Organizations can mostly allocate novice developers for identifying code smells collaboratively, except when there is a need for identifying too complex and scattered smell types, which might require the inspection by professional developers. 7. Threats to validity 7.1. Construct validity We have restricted our study to the analysis of a limited set of code smell types, which may have affected our findings. However, we reduce this threat by selecting four diverse types of code smells. These types occur in different code elements and are reportedly common in software systems (Section 3.3). Regarding the creation of the reference list of code smells, we recruited two PhD students with knowledge in software development and the identification of code smells. Thus, we mitigate possible threats by engaging researchers that are sufficiently qualified for such creation (Section 3.4). Regarding the absence of a static analysis tools, we highlight that our goal was not to investigate the impact of a specific tool on smell identification tasks; instead, our goal was to investigate how the developers collaborative identify code smells. Nevertheless, if we had used a

R. Oliveira, R. de Mello and E. Fernandes et al.

tool during the experiment, the results could be completely dependent on the intricacies of this particular tool. In other words, the use of a tool would introduce significant bias to the experiment. With respect to the different background of the subjects, we mitigate this threat by selecting subjects with at least a minimum knowledge on topics of interest, such as Java and code smells (Section 3.2). In addition, all subjects underwent the training sessions to normalize their background. We followed strict guidelines [30,48] in order to elaborate the interview protocol and artifacts. All procedures were double-checked by two paper authors aimed to mitigate threats such as invalid and useless interview questions. As discussed in Section 5.1, we trained the interviewees about the key benefits of both collaborative and individual smell identification prior to the interview execution. We avoided biasing the interviewees’ opinions by comparing both approaches equally and providing evidence on the applicability of each approach. 7.2. Internal validity Regarding the communication among subjects during the experiment execution, we mitigate threats by limiting such communication with little interference on their answers. We also explained the experimental tasks for all subjects, aimed at avoiding misunderstandings and reducing the communication among subjects. As far as the experiment execution is concerned, we had in mind that developers would have enough time available to finish the identification of code smells. We have performed some experiment simulations, in which a set of participants (some of them working in pairs, others working individually) run the experiment. Sixty minutes were enough for completing the experiment in both cases on time, once no participant complained about it. We did not inform the total time of the experiment, but after completing 60 minutes the participants were asked to conclude their participation. This decision has prevented participants from worrying about time during the identification of code smells. Due the aforementioned observations, we decided to keep the predefined experiment time limit. Through these simulations, we also identified opportunities for improving the experiment. We did not swap the order of individual and collaborative tasks during the controlled experiment (Section 3.5). Although we did not employ a cross-over design [52] in terms of individual and collaborative smell identification, this swapping related decision was taken because previous studies [22,27] observed such a swap does not significantly affect code smell identification. We tried to minimize threats related to the difference between target systems by swapping the order of the target systems across experiment rounds. Rounds 1 and 3 counted on the inspection of Java IO first; Rounds 2 and 4 counted on the inspection of Java Print first. However, we did not swap the order of the target systems in each experiment round, e.g., splitting the 26 participants of Round 1 in two groups, one for inspecting Java IO first, and the other inspecting Java Print first. Not swapping systems may have favored code smell identification in the system analyzed in Rounds 2 and 4. This possible bias can eventually occur because participants may have learned how to identify smells while inspecting a different system in Rounds 1 and 3. Future research could address this threat by using a single target system in different rounds. We performed each interview with the software project leaders in isolation in order to make leaders as comfortable as possible with answering the interview questions. We also instructed and answered general questions of leaders before filling the interview questionnaire. Thus, we expected to assure that all leaders understood each interview question. 7.3. Conclusion validity To conduct the data analysis, we carefully selected the most appropriate statistical tests. We also paid special attention to avoid violating assumptions of the selected statistical tests. To answer our research question, we applied the Mann-Whitney test [31] as discussed in Section 3.4.

Information and Software Technology 120 (2019) 106242

Furthermore, we believe that our questionnaires fit our expectations with the empirical study and support answering our research question. For instance, they allowed us to characterize the experienced and inexperienced developers. Thus, we mitigate possible threats related to the data analysis through the exclusive analysis of data collected from the questionnaires. As previously discussed in Internal Validity, the inspection in different rounds without properly swapping the systems may have influenced our study results. Aimed at capturing confounding factors, we scrutinized the precision and recall results for Java Print and Java IO regardless of the inspection approach, i.e., individual and collaborative smell identification. Mean precision were quite similar across systems: mean precision equals 0.60 for Java IO against 0.55 for Java Print. Thus, the target system influenced little the precision results. Conversely, the results changed considerably across systems in terms of recall: mean recall equals 0.26 for Java IO against 0.57 for Java Print. Aimed at understanding this difference, we highlight that, in spite of the two systems exhibiting similar complexity in several aspects, Java IO is two times larger in size than Java Print. Additionally, unlike precision, recall considers false negatives. Once Java IO is larger, the probability of missing a code smell instance in this system is indeed greater than in Java Print. Thus, one could consider that the target system may have influenced somehow our key study result (i.e., does collaboration indeed outperforms individualism?). However, we made a comparison of the recall between collaborators and single developers (based on Fig. 4), when both cases are using Java IO (i.e., from pairs 17&19 to pairs 23&24). This comparison reveals there is only a minor difference in recall for most pairs (against their individual counterparts from 17 to 24), except for the pair 21&22 against the developer 21. In any case, as one considers the balance between precision and recall, the collaborators still clearly outperform single developers in the Java IO case, including the pair 21&22. This can be confirmed by combining both results of Figs. 4 and 3. We noticed along the experiment that the reason for this behavior is that collaborators tend to focus much more on precision than recall when modules become larger in size (i.e., in the Java IO case). They tend to discuss more and be more reflective along the decision of confirming whether a code fragment has a smell. In fact, the precision of the aforementioned collaborators (Fig. 3) is much higher than the respective single developers for the Java IO case. Therefore, collaborative smell identification tend to be much superior in precision even for larger modules, possibly in detriment to recall improvement. This reflects what developers would do very often while identifying opportunities for code refactoring [24,53]. In practice, developers tend to focus on a subset of smells in each code revision session, i.e., those smells they are absolutely sure the affected code should be refactored [22,54–58]. Finally, as far as the interviews are concerned, we carefully applied thematic synthesis procedures [49] in order to avoid biases in the qualitative data analysis. Two of the paper authors have performed the procedures in a pair for identifying and fixing missing and incorrect data.

7.4. External validity Our study has some possible threats related to the generalization of findings. First, we applied the study with Brazilian developers, which may not represent all development scenarios. In addition, although we have spent a period of one month to engage novice developers and professional developers in our study, the set of subjects is limited to 18 novice developers and 16 professional developers. We minimize possible threats regarding the set of subjects by involving developers with varied background and level of working experience. We also focused on developers with minimum experience with topics of interest, such as code smells and pair programming.

R. Oliveira, R. de Mello and E. Fernandes et al.

Information and Software Technology 120 (2019) 106242

Even though someone could consider our population as limited, we did our best to involve the novice and professional developers on the identification of code smells. Moreover, many experiments in software engineering have even fewer participants than our experiment due to the difficulty in finding participants. Such limitation is a consequence of a lack of adequate sampling frames available in the field, making the identification of representative samples difficult [59]. Regarding the interviews, we recruited as many software project leaders as possible. Unfortunately, it is quite challenging to recruit leaders from different organizations with availability for the interview. We acknowledge that our qualitative results are not representative of most leaders. However, the five leaders that participated in our interviews not only helped us to confirm previous assumptions, but also reveal interesting insights on the potential application of the collaborative smell identification in practical settings.

leverage the effectiveness of code smell identification. The effectiveness benefits could extrapolate to code review in general, which strongly depends on developer discussions [39,40,42,60]. Based on our aforementioned findings, we expect to provide sufficient evidence for organizations concerned about the effectiveness of smell identification, which should take into consideration to adopt the collaborative smell identification in practice. As a future work, we plan to investigate how novice and professional developers can collaborate in a more comprehensive scenario, in which (1) novice developers perform the collaborative smell identification and, after that, (2) professional developers, especially those that are familiar with the inspected system, prioritize the identified smell instances according to their impact of system maintainability.

8. Final remarks

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

This paper expands the current knowledge about collaborative smell identification through an empirical study conducted with 34 developers in an unprecedented exploitation scenario: novices and professionals inspecting systems they are unfamiliar with. We summarize our study findings as follows. •







The average precision of collaborators was 27% higher than the average of single developers on the identification of code smells. Thus, collaborators tend to identify more actual code smells than single developers. The average recall of collaborators was 36% higher than the average of single developers on the identification of code smells. That is, the identification of code smells performed by collaborators has a higher coverage than by single developers. The exchange of information allowed by collaboration is essential to improve the effectiveness of the code smell identification. We observed that collaborators share knowledge and complement each other. Consequently, it improves their confidence on confirming a code smell suspect. All interviewed project leaders agree that collaboration is important to validate smell suspects. From a leaders’ perspective, collaboration can help in handling with the subjective and complex nature of code smells. Complementarily, leaders are unanimous in stating that they would adopt the collaborative smell identification in practice, except in particular cases when resources are too limited.

8.1. A recommendation to development organizations Organizations may allocate novice developers for identifying code smells in collaboration. The major benefit would be promoting the knowledge sharing and the correct identification of code smells without misidentifying them. An advantage of allocating novice rather than professional developers to such identification is that professional developers can focus on addressing the clients’ needs, by adding new feature into the system, for instance. In addition, we observe the effectiveness of identifying code smells is similar for developers either familiar or unfamiliar with the system under inspection. Thus, organizations can simply allocate the unfamiliar ones to the collaborative smell identification and let the familiar ones for addressing the users’ needs. 8.2. Enhancing current tool support for code smell identification A previous work summarizes the existing tools for code smell identification [25]. Unfortunately, most of those tools provide little or no support to developer collaboration during the inspection of code smell candidates. Thus, developers may still struggle with validating candidates towards a better identification effectiveness [22,23]. We hypothesize that incorporating collaboration in the current tools could considerably

Declaration of Competing Interests

Acknowledgments This work was partially funded by CNPq (grants 434969/2018-4, 312149/2016-6, and 409536/2017-2), CAPES/Procad (grant 175956), and FAPERJ (grant 22520-7/2016). References [1] K. Bennett, V. Rajlich, Software maintenance and evolution: a roadmap, in: Conference on the Future of Software Engineering, Co-located with the 22nd International Conference on Software Engineering (ICSE), 2000, pp. 73–87. [2] I. Macia, R. Arcoverde, A. Garcia, C. Chavez, A. von Staa, On the relevance of code anomalies for identifying architecture degradation symptoms, in: 16th European Conference on Software Maintenance and Reengineering (CSMR), 2012, pp. 277–286. [3] M. Tufano, F. Palomba, G. Bavota, R. Oliveto, M. Di Penta, A. De Lucia, D. Poshyvanyk, When and why your code starts to smell bad (and whether the smells go away), IEEE Trans. Softw. Eng. 43 (11) (2017) 1063–1088. [4] F. Palomba, G. Bavota, M.D. Penta, R. Oliveto, A.D. Lucia, Do they really smell bad? A study on developers’ perception of bad code smells, in: 30th International Conference on Software Maintenance and Evolution (ICSME), 2014, pp. 101–110. [5] A. Yamashita, L. Moonen, Do code smells reflect important maintainability aspects? in: 28th International Conference on Software Maintenance (ICSM), 2012, pp. 306–315. [6] M. Fowler, Refactoring: Improving the Design of Existing Code, first ed., Addison-Wesley Professional, 1999. [7] M. Lanza, R. Marinescu, Object-oriented Metrics in Practice: Using Software Metrics to Characterize, Evaluate, and Improve the Design of Object-oriented Systems, first ed., Springer Science & Business Media, 2006. [8] W. Oizumi, A. Garcia, L. Sousa, B. Cafeo, Y. Zhao, Code anomalies flock together: exploring code anomaly agglomerations for locating design problems, in: 38th International Conference on Software Engineering (ICSE), 2016, pp. 440–451. [9] E. Fernandes, G. Vale, L. Sousa, E. Figueiredo, A. Garcia, J. Lee, No code anomaly is an island: anomaly agglomeration as sign of product line instabilities, in: 16th International Conference on Software Reuse (ICSR), 2017, pp. 48–64. [10] M. Silva, M.T. Valente, R. Terra, Does technical debt lead to the rejection of pull requests? in: 12th Brazilian Symposium on Information Systems (SBSI), 2016, pp. 248–254. [11] A. Chávez, I. Ferreira, E. Fernandes, D. Cedrim, A. Garcia, How does refactoring affect internal quality attributes? A multi-project study, in: 31st Brazilian Symposium on Software Engineering (SBES), 2017, pp. 74–83. [12] D. Cedrim, A. Garcia, M. Mongiovi, R. Gheyi, L. Sousa, R. de Mello, B. Fonseca, M. Ribeiro, A. Chávez, Understanding the impact of refactoring on smells: alongitudinal study of 23 software projects, in: 11th Symposium on the Foundations of Software Engineering (FSE), 2017, pp. 465–475. [13] M. Kim, T. Zimmermann, N. Nagappan, An empirical study of refactoring: challenges and benefits at Microsoft, IEEE Trans. Softw. Eng. 40 (7) (2014) 633–649. [14] E. Murphy-Hill, C. Parnin, A. Black, How we refactor, and how we know it, IEEE Trans. Softw. Eng. 38 (1) (2012) 5–18. [15] D. Silva, N. Tsantalis, M.T. Valente, Why we refactor? Confessions of GitHub contributors, in: 24th International Symposium on Foundations of Software Engineering (FSE), 2016, pp. 858–870. [16] R. Arcoverde, I. Macia, A. Garcia, A. Von Staa, Automatically detecting architecturally-relevant code anomalies, in: 3rd International Workshop on Recommendation Systems for Software Engineering (RSSE), co-located with 34th International Conference on Software Engineering (ICSE), 2012, pp. 90–91. [17] I. Macia, J. Garcia, D. Popescu, A. Garcia, N. Medvidovic, A. von Staa, Are automatically-detected code anomalies relevant to architectural modularity? an exploratory

R. Oliveira, R. de Mello and E. Fernandes et al.

[18] [19]

[20] [21]

[22]

[23]

[24]

[25]

[26] [27]

[28]

[29] [30] [31] [32]

[33]

[34]

[35] [36] [37]

[38]

[39]

analysis of evolving systems, in: 11th International Conference on Aspect-oriented Software Development (AOSD), 2012, pp. 167–178. J. Oliveira, R. Gheyi, M. Mongiovi, G. Soares, M. Ribeiro, A. Garcia, Revisiting the refactoring mechanics, Inf. Softw. Technol. 110 (2019) 136–138. A.C. Bibiano, E. Fernandes, D. Oliveira, A. Garcia, M. Kalinowski, B. Fonseca, R. Oliveira, A. Oliveira, D. Cedrim, A quantitative study on characteristics and effect of batch refactoring on code smells, in: 13th International Symposium on Empirical Software Engineering and Measurement (ESEM), 2019, pp. 1–11. T. Paiva, A. Damasceno, E. Figueiredo, C. Sant’Anna, On the evaluation of code smells and detection tools, J. Softw. Eng. Res. Dev. 5 (1) (2017) 7:1–7:28. C. Conceicao, G. Carneiro, F.B. e Abreu, Streamlining code smells: using collective intelligence and visualization, in: 9th International Conference on the Quality of Information and Communications Technology (QUATIC), 2014, pp. 306–311. R. Oliveira, L. Sousa, R. de Mello, N. Valentim, A. Lopes, T. Conte, A. Garcia, E. Oliveira, C. Lucena, Collaborative identification of code smells: amulti-case study, in: 39th International Conference on Software Engineering (ICSE): Software Engineering in Practice Track (SEIP), 2017, pp. 33–42. R. Oliveira, B. Estácio, A. Garcia, S. Marczak, R. Prikladnicki, M. Kalinowski, C. Lucena, Identifying code smells with collaborative practices: a controlled experiment, in: 10th Brazilian Symposium on Software Components, Architectures and Reuse (SBCARS), 2016, pp. 61–70. L. Sousa, A. Oliveira, W. Oizumi, S. Barbosa, A. Garcia, J. Lee, M. Kalinowski, R. de Mello, B. Fonseca, R. Oliveira, et al., Identifying design problems in the source code:agrounded theory, in: 40th International Conference on Software Engineering (ICSE), 2018, pp. 921–931. E. Fernandes, J. Oliveira, G. Vale, T. Paiva, E. Figueiredo, A review-based comparative study of bad smell detection tools, in: 20th International Conference on Evaluation and Assessment in Software Engineering (EASE), 2016, pp. 18:1–18:12. T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. 27 (8) (2006) 861–874. R. de Mello, R. Oliveira, A. Garcia, On the influence of human factors for identifying code smells: a multi-trial empirical study, in: 11th International Symposium on Empirical Software Engineering and Measurement (ESEM), 2017, pp. 68–77. C. Bird, N. Nagappan, B. Murphy, H. Gall, P. Devanbu, Don’t touch my code! Examining the effects of ownership on software quality, in: 19th Symposium on the Foundations of Software Engineering (FSE), 2011, pp. 4–14. B. Boehm, Software risk management: principles and practices, IEEE Softw. 8 (1) (1991) 32–41. C. Wohlin, P. Runeson, M. Höst, M. Ohlsson, B. Regnell, A. Wesslén, Experimentation in Software Engineering, first ed., Springer Science & Business Media, 2012. S. Siegel, N. Castellan Jr, Nonparametric Statistics for the Behavioral Sciences, second ed., McGraw-Hill, 1988. S. Vidal, E. Guimaraes, W. Oizumi, A. Garcia, A.D. Pace, C. Marcos, Identifying architectural problems through prioritization of code smells, in: 10th Brazilian Symposium on Software Components, Architectures and Reuse (SBCARS), 2016, pp. 41–50. J. Garcia, D. Popescu, G. Edwards, N. Medvidovic, Toward a catalogue of architectural bad smells, in: 5th International Conference on the Quality of Software Architectures (QoSA), 2009, pp. 146–162. R. Marticorena, C. López, Y. Crespo, Extending a taxonomy of bad code smells with metrics, 7th International Workshop on Object-Oriented Reengineering (WOOR), 2006. S. Chidamber, C. Kemerer, A metrics suite for object oriented design, IEEE Trans. Softw. Eng. 20 (6) (1994) 476–493. E. Gamma, R. Helm, R. Johnson, J. Vlissides, Design Patterns: Elements of Reusable Object-Oriented Software, first ed., Addison-Wesley Professional, 1994. N. Tsantalis, T. Chaikalis, A. Chatzigeorgiou, Ten years of JDeodorant: lessons learned from the hunt for smells, in: 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), 2018, pp. 4–14. E. Murphy-Hill, A. Black, Seven habits of a highly effective smell detector, in: 1st International Workshop on Recommendation Systems for Software Engineering (RSSE), 2008, pp. 36–40. S. McIntosh, Y. Kamei, B. Adams, A. Hassan, The impact of code review coverage and code review participation on software quality: a case study of the Qt, VTK, and

Information and Software Technology 120 (2019) 106242

[40]

[41]

[42]

[43] [44] [45] [46]

[47]

[48] [49]

[50]

[51]

[52]

[53]

[54]

[55]

[56] [57]

[58]

[59]

[60]

ITK projects, in: 11th Working Conference on Mining Software Repositories (MSR), 2014, pp. 192–201. A. Bacchelli, C. Bird, Expectations, outcomes, and challenges of modern code review, in: 35th International Conference on Software Engineering (ICSE), 2013, pp. 712–721. E. Fernandes, P. Souza, K. Ferreira, M. Bigonha, E. Figueiredo, Detection strategies for modularity anomalies: an evaluation with software product lines, in: 14th International Conference on Information Technology: New Generations (ITNG), 2018, pp. 565–570. A. Begel, N. Nagappan, Pair programming: what’s in it for me? in: 2nd International Symposium on Empirical Software Engineering and Measurement (ESEM), 2008, pp. 120–128. G. Braught, J. MacCormick, T. Wahls, The benefits of pairing by ability, in: 41st Technical Symposium on Computer Science Education (SIGCSE), 2010, pp. 249–253. M. Lorenz, J. Kidd, Object-Oriented Software Metrics: A Practical Guide, first ed., Prentice Hall, 1994. J. Rosenberg, Some misconceptions about lines of code, in: 4th International Software Metrics Symposium (METRICS), 1997, pp. 137–142. H. Levene, Robust tests for equality of variances, in: I. Olkin (Ed.), Contributions to Probability and Statistics. Essays in Honor of Harold Hotelling, Stanford University Press, 1960, pp. 278–292. D. Lapková, M. Adámek, Statistical and mathematical classification of direct punch, in: 38th International Conference on Telecommunications and Signal Processing (TSP), 2015, pp. 486–489. C. Seaman, Qualitative methods in empirical studies of software engineering, IEEE Trans. Softw. Eng. 25 (4) (1999) 557–572. D. Cruzes, T. Dyba, Recommended steps for thematic synthesis in software engineering, in: 5th International Symposium on Empirical Software Engineering and Measurement (ESEM), 2011, pp. 275–284. T. Hayata, J. Han, A hybrid model for IT project with Scrum, in: 7th International Conference on Service Operations, Logistics, and Informatics (SOLI), 2011, pp. 285–290. I. Perera, S. Fernando, Enhanced agile software development–hybrid paradigm with LEAN practice, in: 6th International Conference on Industrial and Information Systems (ICIIS), 2007, pp. 239–244. B. Kitchenham, S. Pfleeger, L. Pickard, P. Jones, D. Hoaglin, K. El Emam, J. Rosenberg, Preliminary guidelines for empirical research in software engineering, IEEE Trans. Softw. Eng. 28 (8) (2002) 721–734. L. Sousa, R. Oliveira, A. Garcia, J. Lee, T. Conte, W. Oizumi, R. de Mello, A. Lopes, N. Valentim, E. Oliveira, C. Lucena, How do software developers identify design problems? A qualitative analysis, 31st Brazilian Symposium on Software Engineering (SBES), 2017. , On the prioritization of design-relevant smelly elements: amixed-method, multi-project study, in: 13th Brazilian Symposium on Software Components, Architectures, and Reuse (SBCARS), 2019, pp. 83–92. S. Vidal, E. Guimaraes, W. Oizumi, A. Garcia, A.D. Pace, C. Marcos, Identifying architectural problems through prioritization of code smells, in: 10th Brazilian Symposium on Software Components, Architectures and Reuse (SBCARS), 2016, pp. 41–50. S. Vidal, W. Oizumi, A. Garcia, A.D. Pace, C. Marcos, Ranking architecturally critical agglomerations of code smells, Sci. Comput. Program. 182 (2019) 64–85. W. Oizumi, L. Sousa, A. Garcia, R. Oliveira, A. Oliveira, O.I.A.B. Agbachi, C. Lucena, Revealing design problems in stinky code: amixed-method study, in: 11th Brazilian Symposium on Software Components, Architectures, and Reuse (SBCARS), 2017, pp. 5:1–5:10. W. Oizumi, A. Garcia, T. Colanzi, M. Ferreira, A. Staa, On the relationship of code-anomaly agglomerations and architectural problems, J. Softw. Eng. Res. Dev. 3 (11) (2015) 1–22. R. de Mello, P. Da Silva, G. Travassos, Investigating probabilistic sampling approaches for large-scale surveys in software engineering, J. Softw. Eng. Res. Dev. 3 (1) (2015) 8. E. Fernandes, A. Uchôa, A.C. Bibiano, A. Garcia, On the alternatives for composing batch refactoring, in: Proceedings of the 3rd International Workshop on Refactoring (IWoR), co-located with the 41st International Conference on Software Engineering (ICSE), 2019, pp. 9–12.