WAS: A weighted attribute-based strategy for cluster test selection

WAS: A weighted attribute-based strategy for cluster test selection

The Journal of Systems and Software 98 (2014) 44–58 Contents lists available at ScienceDirect The Journal of Systems and Software journal homepage: ...

3MB Sizes 0 Downloads 70 Views

The Journal of Systems and Software 98 (2014) 44–58

Contents lists available at ScienceDirect

The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss

WAS: A weighted attribute-based strategy for cluster test selection Yabin Wang a , Ruizhi Gao b , Zhenyu Chen a,∗ , W. Eric Wong b , Bin Luo a a b

State Key Laboratory for Novel Software Technology, Nanjing University, China Department of Computer Science, University of Texas at Dallas, USA

a r t i c l e

i n f o

Article history: Received 6 September 2013 Received in revised form 4 August 2014 Accepted 18 August 2014 Available online 10 September 2014 Keywords: Weighted execution profile Cluster test selection Software fault localization

a b s t r a c t In past decades, many techniques have been proposed to generate and execute test cases automatically. However, when a test oracle does not exist, execution results have to be examined manually. With increasing functionality and complexity of today’s software, this process can be extremely time-consuming and mistake-prone. A CTS-based (cluster test selection) strategy provides a feasible solution to mitigate such deficiency by examining the execution results only with respect to a small number of selected test cases. It groups test cases with similar execution profiles into the same cluster and selects them from each cluster. Some well-known CTS-based strategies are one per cluster, n (a predefined value which is greater than 1) per cluster, adaptive sampling, and execution-spectra-based sampling (ESBS). The ultimate goal is to reduce testing cost by quickly identifying the executions that are likely to fail. However, improperly grouping the test cases will significantly diminish the effectiveness of these strategies (by examining results of more successful executions and fewer failed executions). To overcome this problem, we propose a weighted attribute-based strategy (WAS). Instead of clustering test cases based on the similarity of their execution profiles only once like the aforementioned CTS-based strategies, WAS will conduct more than one iteration of clustering using weighted execution profiles by also considering the suspiciousness of each program element (statement, basic block, decision, etc.), where the suspiciousness in terms of the likelihood of containing bugs can be computed by using various software fault localization techniques. Case studies using seven programs (make, ant, sed, flex, grep, gzip, and space) and four CTS-based strategies (one per cluster sampling, n per cluster sampling, adaptive sampling, and ESBS) were conducted to evaluate the effectiveness of WAS on 184 faulty versions containing either single or multiple bugs. Experimental results suggest that the proposed WAS strategy outperforms other four CTS-based strategies with respect to both recall and precision such that output verification is focused more strongly on failed executions. © 2014 Elsevier Inc. All rights reserved.

1. Introduction Software has become an inseparable component of everyday life, but programs continue to grow in complexity. As a result, it is inevitable that we spend more and more resources on software testing, which includes three major steps: test generation, execution, and output verification. Many techniques have been proposed to reduce the cost of test generation and execution (e.g., Samuel et al., 2008; Tsai et al., 1990). In general, however, test output verification still has to be done manually, especially when a test oracle does not exist. This can be very timeconsuming and mistake-prone. To solve this problem, we propose a weighted attribute-based strategy (WAS) to help identify more

∗ Corresponding author. Tel.: +86-25-83621360. E-mail addresses: [email protected], [email protected] (Z. Chen). http://dx.doi.org/10.1016/j.jss.2014.08.032 0164-1212/© 2014 Elsevier Inc. All rights reserved.

failure-causing test cases, as the remaining test executions1 are likely to be successful. This can significantly reduce software development and maintenance cost. Intuitively, failed executions caused by the same bug tend to share some similarities and are more likely to have similar execution profiles (i.e. statement coverage) (Anderberg, 1973; Dickinson et al., 2001a; Podgurski et al., 1999). Test cases for output verification are typically selected using a CTS-based strategy, which groups test cases with similar execution profiles into the same cluster and then selects only a small number of test cases from each cluster for output verification. A CTS-based strategy contains two major steps: clustering and sampling. Most CTS-based strategies such as one per

1 In this paper, we use the terms “failed/successful execution” and “failed/successful test case” interchangeably. We also use “bugs” and “faults” interchangeably.

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

cluster (Dickinson et al., 2001a), n (a predefined value greater than 1) per cluster (Dickinson et al., 2001a), adaptive sampling (Chen et al., 2010), and execution-spectra-based sampling (ESBS) (Yan et al., 2010) focus on how to optimize the sampling. Unfortunately, they overlook the other critical factor – how to do a better clustering. Rather than having one iteration of clustering as in aforementioned CTS-based techniques, the proposed WAS imposes multiple iterations of clustering. After the first iteration of clustering, WAS uses a similar sampling approach from ESBS to select a few test cases for output verification. The suspiciousness value of each statement is then computed using a software fault localization technique (see Section 2.2) based on the execution profiles and the success or failure of the selected test cases. These values are then used to calibrate the initial execution profile of each test case to create a weighted execution profile for the next iteration of clustering. With some failure-causing statements weighted with a higher priority, failed executions can be more accurately clustered. Users can continue this process until the number of test cases selected reaches a predefined value. The example in Section 3.2 provides a more detailed explanation. To demonstrate the benefit of using the proposed WAS, case studies were conducted on seven open source programs (flex, grep, ant, make, space, sed and gzip) and the four different CTS-based strategies mentioned earlier (one per cluster, n per cluster, adaptive sampling, and ESBS). Our results suggest that WAS generally outperforms other strategies with a better precision and recall. This paper is an extension of our conference paper2 with the following significant enhancements and contributions:

• A novel strategy, WAS, is proposed by introducing weighted execution profile. To the best of our knowledge, we are the first to apply a software fault localization technique to assign weights to execution profiles while using CTS-based strategies. • The effectiveness of WAS is compared to that of four existing CTS-based strategies on seven open-source programs. The results show that WAS generally outperforms other CTS-based strategies in terms of both recall and precision. • Statement coverage instead of function coverage as in our conference paper is used. A comparison of the effectiveness of WAS using these two different execution profiles is reported. • Case studies include programs with a single bug as well as programs containing multiple bugs. • The effectiveness of WAS using six different fault localization techniques are compared to provide more insight. • Not only effectiveness but also efficiency of WAS is evaluated. • The impact on the effectiveness of WAS using a different percentage of all test cases selected per clustering iteration is examined and presented.

The rest of the paper is organized as follows. In Section 2, we provide the background of CTS-based strategies and a brief overview of software fault localization. Section 3 presents a detailed explanation and an example of WAS. Case studies, results, and analyses are given in Section 4. Discussion on research questions is provided in Section 5. Related work appears in Section 6. Section 7 discusses the threats to validity. Conclusions and future direction can be found in Section 8. After the Reference Section, additional experiment results are presented in Appendix.

2

A preliminary version of this paper was presented at the Sixth International Conference on Software Security and Reliability (SERE’12) (Wang et al., 2012).

45

2. Background We will first describe how to cluster execution profiles of test cases followed by the introduction of six software fault localization techniques used in this paper. 2.1. Execution profile clustering Clustering divides a collection of objects into clusters that contain similar instances (Witten and Frank, 1999). Objects in the same cluster are similar to each other and are dissimilar to those in other clusters. CTS-based strategies use clustering techniques to group execution profiles based on their semantic behavior. Test cases are clustered by their execution profiles, each of which is an instance when using a clustering algorithm. Existing studies show that execution profiles can be used as a representation of the behavior of executions (DiGiuseppe and Jones, 2012). Dickinson et al. (2001a) later found that over half of the nearest neighbors of failed test cases had also failed, and failed executions have unusual properties that may be reflected in execution profiles (Dickinson et al., 2001b). These results, extending from Podgurski’s earlier findings (Podgurski and Yang, 1993), suggested that profiles might be an appropriate indication of failures. According to studies described above, failed executions may have similar profiles. Hence, these profiles can be grouped together using clustering algorithms. Podgurski and Yang (1993) studied such problems in their early work and tried to predict an execution’s success or failure through profile clustering. Clustering techniques can be roughly distinguished as soft clustering techniques and hard clustering techniques. In the results from a soft clustering technique, an object can be shared by different clusters. In the results from a hard clustering technique, however, an object belongs at most to one cluster. In this paper, we will use the K-means, which is one of the most popular hard clustering techniques. The first step of K-means is to estimate the number of clusters, k, according to the total number of test cases. Next, k test cases are randomly selected as cluster centers, and each test is assigned to the nearest cluster center based on the Euclidean distance between test cases. The center of each cluster is then recalculated based on the current clustering results. In each iteration, new cluster centers are produced, and the process continues until test cases in each cluster remain constant. K-means produces clusters so as to minimize Eq. (1) where Si indicates the ith cluster, xj indicates a test case and i indicates a cluster center. arg min S

k    xj − i 2

(1)

i=1 xj ∈ Si

The approach of assigning one instance to its nearest cluster center is shown in Eq. (2). In this equation, t indicates a test case. mi and mj indicate cluster centers. In WAS, the execution profile of a test case t is represented as a numeric vector t: e1 , e2 , e3 , . . ., en  where ei = / 0 indicates that a statement wi is executed by t, and ei = 0 indicates that wi is not executed by t. Cluster centers mi and mj are calculated by Eq. (3) in which |Si | indicates the number of test cases in cluster Si . So mi is also a numeric vector holding the same dimension as t. If mi is represented  as c1 , c2 , c3 , . . ., cn  the Euclidean distance between t and mi is





Si = {t : t − mi ≤ t − mj  ∀1 ≤ j ≤ k} mi =

1

  si 

 tj ∈ Si

tj

n

k−1

(ei − c1 )2 .

(2) (3)

46

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

Table 1 Notations. NCF (ω) NCS (ω) NUF (ω) NUS (ω) NS NF

The number of failed test cases which cover w The number of successful test cases which cover w The number of failed test cases which do not cover w The number of successful test cases which do not cover w The number of successful test cases The number of failed test cases

2.2. Software fault localization The aim of software fault localization is to identify the location of bugs in a software system. The process uses different techniques to calculate suspiciousness values for each component of the software. All components are ranked according to suspiciousness values in descending order so that the most suspicious part is prioritized. There are many fault localization techniques, such as static, dynamic and execution slice-based techniques, program state-based techniques, machine learning-based techniques and program spectrum-based techniques. WAS can be based on any of these techniques, but in this paper, it is implemented using program spectrum-base techniques. In spectrum-based fault localization (SBFL), a program spectrum contains information about the elements that are covered during the execution of an instrumented program. Such information can be collected at the granularity level of a function, block, branch or statement. In this paper, spectrum information is collected at the statement level. SBFL technique uses the notation NUS (ω), NUF (ω), NCS (ω), and NCF (ω) to indicate the four properties of a statement ω. NUS (ω) indicates the number of successful test cases that do not cover ω. NUF (ω) indicates the number of failed test cases that do not cover ω. NCS (ω) indicates the number of successful test cases that cover ω. NCF (ω) indicates the number of failed test cases that cover ω. These four properties are used to calculate a suspiciousness value for the statement. A higher suspiciousness value suggests that a statement is more likely to contain bugs. Many SBFL techniques, such as Jaccard (Jaccard, 1901), Tarantula (Jones and Harrold, 2005), Ochiai (Abreu et al., 2009), Crosstab (Wong et al., 2012a), H3b and H3c (Wong et al., 2010), have been shown effective in reducing the number of statements examined in order to identify bugs. The Jaccard metric (Abreu et al., 2009), which is used to compare the similarity and dissimilarity of two sets, is defined as the size of intersection divided by the size of the union of two sets. In fault localization, for a statement ω, the two sets are the set of failed test cases and the set of all test cases that cover ω. Table 1 shows the notations used in this section. The suspiciousness value Jaccard(ω) is defined in Eq. (4). Jaccard(w) =

NCF (ω) NCF (ω) + NCS (ω) + NUF (ω)

(4)

“dependence”/“independence” relationship between the two variables, it is more relevant to study the degree of association between them. More precisely, a crosstab for each statement contains two column-wise categorical variables (covered and not covered) and two row-wise categorical variables (successful and failed) to determine whether a closer relationship exists between the two kinds of categorical variable. In addition, a statistic is computed based on each table to calculate the suspiciousness of the corresponding statement. Wong et al. (2010) proposed the H3b and H3c technique by examining how each additional failed (or successful) test cases can help perform fault localization. They concluded that if all the test cases are executed in sequence, the contribution of the first failed test case is larger than or equal to that of the second one, which is larger than or equal to that of the third one, and so on. The same situation exists in the case of successful test cases. The suspiciousness of each statement is calculated as [(1.0) × nF,1 + (0.1) × nF,2 + (0.01) × nF,3 ]–[(1.0) × ns,1 + (0.1) × ns,2 + ˛ × F/S × ns,3 ], where nF,i and nS,i are the number of failed and successful test cases in the ith group, and F/S is the ratio of the total number of failed test cases to the total number of successful test cases with respect to a specific bug. The technique is labeled H3b when ˛ = 0.001 and H3c with ˛ = 0.0001. 3. Weighted attributed-based strategy This section describes the detailed WAS procedure and demonstrates the workflow with a running example. 3.1. Detailed steps This paper uses the following concepts and notations. For each statement ω, two values X(ω) and Y(ω) are defined as follows, and X(ω) is initialized to 0. Accordingly, Y(ω) has an initial value of 0 based on Eq. (6). X(ω) = NCS (ω) − NCF (ω)



Y (ω) =

0 if X(ω) < 1 1 otherwise

(6)

For each test case t, Z(t) represents the total number of statements with Y = 0 covered by t. ST(k) is a set that contains test cases with Z(t) > 0, and k is the index of test selection starting from 0. NT(q) denotes the set of test cases selected in each clustering iteration and q is the index of clustering iteration starting from 0. NT contains the test cases selected across all clustering iterations. Fig. 1 demonstrates the overall process of WAS, and a detailed description of each step follows.

The Tarantula fault localization technique (Jones and Harrold, 2005) assigns a suspiciousness value to each statement according to the formula X/X + Y, where X = (NCF /NF ) and Y = (NCS /NS ). Tarantula has been shown to be more effective than other fault localization techniques such as set union, set intersection, nearest neighbor, and cause transitions on the Siemens suite. The Ochiai coefficient, evaluated by Abreu et al. (2009) (Ochiai, 1957), in the context of software fault localization, uses the formula NCF √ to assign a suspiciousness value to each statement. NF ×(NCF +NCS )

Wong et al. (2012a) presented a Crosstab (cross tabulation) based fault localization technique that uses a well-defined statistical analysis, which has been used to study the relationship between two or more categorical variables. In their study, two variables are used: “test execution result (successful or failed)” and “the coverage of a given statement.” The null hypothesis is that these two variables are independent. However, rather than studying the

(5)

Fig. 1. Detailed procedure of WAS.

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

47

Table 2 Initial execution profiles. Statements

Test cases executed t0

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10

t11

t12

t13

t14

t15

t16

t17

t18

t19

t20

t21

t22

t23

t24

t25

t26

t27

t28

t29

ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8 ω9 ω10

1 1 0 0 1 1 1 0 1 1 P

1 1 0 0 1 1 1 1 1 1 P

1 1 1 0 1 0 1 1 1 1 P

1 1 0 0 1 1 1 1 1 1 P

1 1 1 0 1 1 1 0 1 1 P

1 1 0 0 1 1 1 1 1 1 P

0 0 1 0 1 1 0 1 0 0 F

0 0 1 0 1 1 0 1 0 0 F

1 1 0 1 1 1 0 1 0 0 P

1 1 1 1 0 1 1 1 0 0 P

1 1 0 1 1 1 1 1 0 0 P

1 1 0 1 1 1 1 0 0 0 P

1 1 0 1 1 1 1 1 0 0 P

1 1 0 1 1 0 1 1 0 0 P

1 1 0 1 1 1 0 1 0 0 P

1 1 0 1 1 1 1 1 0 0 P

1 1 0 1 0 1 1 1 0 0 P

1 1 0 1 0 0 1 1 0 0 P

1 1 0 1 0 0 1 1 0 0 P

1 1 1 1 0 0 1 1 0 0 P

1 1 0 1 0 1 1 1 0 0 P

1 1 0 1 0 0 1 1 0 0 P

1 1 0 1 0 1 1 1 0 0 P

1 1 0 1 0 0 1 1 0 0 P

1 1 0 1 0 1 1 1 0 0 P

1 1 1 1 0 0 1 1 0 0 P

1 1 0 1 0 1 0 0 0 0 P

1 0 1 0 0 0 1 1 1 1 F

1 0 1 0 1 0 0 1 1 1 F

1 0 1 0 0 1 1 1 1 1 F

Step 1: Test execution and execution profile collection All test cases are executed, and their execution profiles are collected. Test cases are classified as failed or successful depending on the expected output. If the output of a test case matches its expected output, it is successful. Otherwise, the test case has failed. Step 2: Cluster analysis WAS uses K-means (Hartigan and Wong, 1979; James, 1967) to group the test cases into clusters based on the similarity of their execution profiles.  Before K-means is applied, the number Nt /2 for Nt test cases based on the work of of clusters is set to Mardia et al. (1979). One test case is then randomly selected for each cluster as the initial cluster center. Step 3: Cluster selection One cluster is randomly chosen from the results of Step 2. A chosen cluster cannot be used in further iterations. Step 4: Test selection For each statement ω, Z(t) is then calculated for each test t given X(ω) and Y(ω) using Eqs. (5) and (6). Test cases with Z(t) > 0 are included in the ST(k) set. If ST(k) is empty, the process returns to Step 3. Otherwise, a test which has the highest Z(t) is selected from ST(k) , and noted as th (k) . If multiple test cases have the same highest Z(t), only one test is randomly selected. th (k) is examined to determine whether it is successful or failed, and the test case is then included in NT(q) . If the size of NT(q) is greater than or equal to 10% of the test cases, the test cases in NT(q) are combined into NT. NT(q) is then reset to an empty set, and the process continues to Step 5. If the size of NT(q) is less than 10% of the test cases, Step 4 is repeated. Step 5: Execution profile weighting Using the test cases in NT, suspiciousness values susp(ω) for each statement are calculated using a fault localization technique (i.e. Jaccard, Tarantula, Ochiai, Crosstab, H3b or H3c). For a test case t, the execution profile t: e1 , e2 , e3 , . . ., en  will be weighted by replacing ei with susp(ωi ) if ei = / 0. With the weighted execution profile, the process returns to Step 2. 3.2. Example The following hypothetical example demonstrates how WAS is applied. This example uses a program containing ten statements, ω1 –ω10 , with a bug introduced in ω3 . Thirty tests, t0 –t29 , are executed, and the execution profile (statement coverage) for each test

is collected (Step 1). Table 2 shows the initial execution profiles. For example, the row labeled with ω1 shows how the statement ω1 is covered with respect to each test case. An entry 1 implies ω1 is covered by the corresponding test case and an entry 0 means it is not. In the last row, a P result indicates a successful test case and an F result indicates a failed test case. After the test execution and execution profile collection, cluster analysis (Step 2) is completed. The 30 test cases are grouped into three clusters (cluster 1, cluster 2, and cluster 3) according to the similarities of their execution profiles. The clustering results in Table 3 show that eleven successful test cases and two failed test cases, t6 and t7 , are included in cluster 1. Seven successful test cases and no failed test cases are included in cluster 2. Six successful test cases and three failed test cases, t27 , t28 , and t29 , are assigned to cluster 3. Cluster 1 is randomly selected (Step 3). As shown in Table 4, test selection (Step 4) begins by calculating X(ω) and Y(ω) based on Eqs. (5) and (6). Z(t) is counted and ST(0) is generated as described in Section 3.1. Test cases t10 , t12 , and t15 have the highest Z(t) value of 7. t10 is then randomly chosen for output examination and is found to be successful. After including t10 in NT(0) , the size of NT(0) is one. As the size of NT(0) is less than 10% of the test cases (30 × 10% = 3), test selection is then repeated and ST(1) is produced based on the updated X(ω), Y(ω), and Z(t). In ST(1) , three test cases t6 , t7 , and t9 have a highest Z(t) value of 1. t6 is then randomly selected and examined to be failed. As the size of NT(0) is now 2, test selection is rerun. One test case, t7 , has the highest Z(t) value of 4. It is then selected and examined to be failed. At this point, since the size of NT(0) is equal to 10% of the test cases, the process continues to execution profile weighting (Step 5) and the test cases in NT(0) are added to NT. Using the test cases in NT, t6 , t7 , and t10 , the suspiciousness value of each statement is calculated according to the Jaccard fault localization technique. The results are shown in Table 5. In execution profile weighting (Step 5), the suspiciousness values are used to weight the initial execution profile as shown in Table 6. The process then returns to cluster analysis (Step 2). New clusters are formed from the weighted execution profile as shown in Table 7. In new cluster 3, t27 , t28 , and t29 are failed and t8 is successful. Compared to the original cluster 3 in Table 3, the percentage of failed test cases has increased from 33.3% (three out of nine) to 75% (three out of four). Hence, it will be much easier for a user to identify more failed test cases in this new cluster with fewer test cases results examined.

Table 3 Results from the first clustering iteration. Cluster 1 Cluster 2 Cluster 3

t6 t13 t0

t7 t17 t1

t8 t18 t2

t9 t19 t3

t10 t21 t4

t11 t23 t5

t12 t25 t27

t14

t15

t28

t29

t16

t20

t22

t24

t26

48

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

Table 4 Test selection procedure. NT(0)

X(s)

Statements whose Y(ω) = 0

Z(t) > 0

ST(k)

Z(t6 ) = 4, Z(t7 ) = 4, Z(t8 ) = 6, Z(t9 ) = 7, Z(t10 ) = 7, Z(t11 ) = 6, Z(t12 ) = 7, Z(t14 ) = 6, Z(t15 ) = 7, Z(t16 ) = 6, Z(t20 ) = 6, Z(t22 ) = 6,Z(t24 ) = 6, Z(t26 ) = 4 Z(t6 ) = 1, Z(t7 ) = 1, Z(t9 ) = 1

ST(0) = {t6 , t7 , t8 , t9 , t10 , t11 , t12 , t14 , t15 , t16 , t20 , t22 , t24 , t26 }

ω1

ω2

ω3

ω4

ω5

ω6

ω7

ω8

ω9

ω10

Empty

0

0

0

0

0

0

0

0

0

0

ω1 , ω2 , ω3 , ω4 , ω5 , ω6 , ω7 , ω8 , ω9 , ω10

t10

1

1

0

1

1

1

1

1

0

0

ω3 , ω9 , ω10

t6 and t10

1

1

−1

1

0

0

1

0

0

0

ω3 , ω5 , ω6 , ω8 , ω9 , ω10

t6 , t7 , and t10

1

1

−2

1

−1

−1

1

−1

0

0

ω3 , ω5 , ω6 , ω8 , ω9 , ω10

ST(1) = {t6 , t7 , t9 } ST(2) = {t7 , t8 , t9 , t11 , t12 , t14 , t15 , t16 , t20 , t22 , t24 , t26 } ST(3) = {t8 , t9 , t11 , t12 , t14 , t15 , t16 , t20 , t22 , t24 , t26 }

Z(t7 ) = 4, Z(t8 ) = 3, Z(t9 ) = 3, Z(t11 ) = 2, Z(t12 ) = 3, Z(t14 ) = 3, Z(t15 ) = 3, Z(t16 ) = 2, Z(t20 ) = 2, Z(t22 ) = 2, Z(t24 ) = 2, Z(t26 ) = 1 Z(t8 ) = 3, Z(t9 ) = 3, Z(t11 ) = 2, Z(t12 ) = 3, Z(t14 ) = 3, Z(t15 ) = 3, Z(t16 ) = 2, Z(t20 ) = 2, Z(t22 ) = 2, Z(t24 ) = 2, Z(t26 ) = 1

Table 5 Suspiciousness values.

ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8 ω9 ω10

NCF

NCS

NUF

NUS

Suspiciousness

0 0 2 0 2 2 0 2 0 0

1 1 0 1 1 1 1 1 0 0

2 2 0 2 0 0 2 0 2 2

0 0 1 0 0 0 0 0 1 1

0 0 1 0 0.6 0.6 0 0.6 0 0

Table 6 Weighted execution profiles. Statements

ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8 ω9 ω10

Test cases executed t0

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10

t11

t12

t13

t14

t15

t16

t17

t18

t19

t20

t21

t22

t23

t24

t25

t26

t27

t28

t29

0 0 0 0 .6 .6 0 0 0 0 P

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 0 0 .6 0 0 .6 0 0 P

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 1 0 .6 .6 0 0 0 0 P

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 1 0 .6 .6 0 .6 0 0 F

0 0 1 0 .6 .6 0 .6 0 0 F

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 1 0 0 .6 0 .6 0 0 P

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 0 0 .6 .6 0 0 0 0 P

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 0 0 .6 0 0 .6 0 0 P

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 0 0 0 .6 0 .6 0 0 P

0 0 0 0 0 0 0 .6 0 0 P

0 0 0 0 0 0 0 .6 0 0 P

0 0 1 0 0 0 0 .6 0 0 P

0 0 0 0 0 .6 0 .6 0 0 P

0 0 0 0 0 0 0 .6 0 0 P

0 0 0 0 0 .6 0 .6 0 0 P

0 0 0 0 0 0 0 .6 0 0 P

0 0 0 0 0 .6 0 .6 0 0 P

0 0 1 0 0 0 0 .6 0 0 P

0 0 0 0 0 .6 0 0 0 0 P

0 0 1 0 0 0 0 .6 0 0 F

0 0 1 0 .6 0 0 .6 0 0 F

0 0 1 0 0 .6 0 .6 0 0 F

Table 7 New clustering results. Cluster 1 Cluster 2 Cluster 3

t0 t2 t8

t1 t14 t27

t3 t18 t28

t4 t19 t29

t5 t20

t6 t22

t7 t24

4. Case studies This section presents a detailed description of case studies, including subject programs, experimental setup, evaluation model, and experimental results. 4.1. Subject programs Case studies were conducted on seven open-source programs, flex, grep, gzip, make, ant, sed and space, which were downloaded

t8 t26

t9

t10

t11

t12

t13

t15

t16

from (The Software, 2008). Table 8 shows the number of statements, the number of test cases, the number of faulty versions, and the number of failed test cases for each program in our case studies. Faulty versions that did not result in any execution failure by the downloaded test cases were excluded. To enlarge the study sets, additional faulty versions were created by the application of mutation-based fault injection, which has been shown to be an effective approach for simulating realistic faults (Andrews et al., 2005; Do and Rothermel, 2006; Liu et al., 2006; Namin et al., 2006). In this paper, two classes of mutant operators are used:

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

49

Table 8 Subject programs.

Number of statements Number of test cases Number of faulty versions Number of failed test cases

make

ant

sed

flex

grep

gzip

space

20,014 793 20 452

75,333 871 21 182

12,062 360 7 108

13,892 525 19 215

12,653 470 18 140

6573 211 16 21

9126 13,585 23 1357

• Replacement of an arithmetic, relational, logical, increment/decrement, or assignment operator by another operator from the same class. • Decision negation in an if or while statement.

4.3. Evaluation model To assess the effectiveness of WAS and other strategies, two evaluation metrics are introduced. Recall calculates failures detection ability, and precision represents failure detection exactitude. • Recall

Studies such as Offutt et al. (1996), Van der Meulen et al. (2004) and Wong and Mathur (1995) have reported that test cases that kill mutants generated by the relational operator replacement and logical operator replacement are also likely to kill other mutants. The corresponding test cases also achieve a high mutation and dataflow-based coverage. Although the focus of this paper differs from those studies, the conclusion reached in those papers is still applicable here. Additional mutation operators such as replacement of assignment, arithmetic or increment/decrement are also included based on our analysis of common program bugs.

Recall =

4.2. Experimental setup

Precision =

The experiments start with the estimation of cluster number. As described in Section 3.1, for all strategies evaluated in this paper, the K-means  clustering technique is used. The number of clusters is set to Nt /2 for Nt test cases and one test case is then randomly selected for each cluster as the initial cluster center. Table 9 shows the number of clusters and the number of test cases in each cluster after clustering the initial execution profile. For example, for make, each cluster may contain between 25 and 77 test cases. To compare WAS to other CTS-based strategies, WAS, ESBS, one per cluster, n per cluster, and adaptive sampling are implemented. After clustering of the execution profile, one per cluster selects one test case randomly from each cluster. n per cluster selects n test cases from each cluster. n is set to 3 according to (Yan et al., 2010). Adaptive sampling initially selects one test randomly from each cluster and then selects the remaining test cases from the clusters in which the first selected test is failed. ESBS is an improvement over adaptive sampling strategy. The sampling strategy of ESBS is very similar to that in WAS. However, in WAS, Y(ω) of a statement ω is set to 0 each time a new cluster is chosen. Furthermore, WAS contains multiple clustering iterations using weighted execution profiles while ESBS only conducts clustering once. WAS is implemented with the Crosstab fault localization technique to calculate the suspiciousness value of each statement, and this value is used to weight the execution profile. All experiments were conducted using a work station with an i73770 CPU, 3.4 GHz, and 16 GB memory. All subject programs used in this paper were instrumented with a revised version of ␹Suds (␹Suds User’s Manual, 1998), which has the ability to record the statements executed by a test case.

If there are F failed test cases, and a strategy identifies Fs of these, the recall can be defined as: FS F

• Precision If Ts test cases have been selected and examined by a strategy, and Fs of these are failed test cases, the precision can be defined as: FS TS

4.4. Experimental results This section presents a comparison of the five CTS-based strategies with respect to the recall and precision. For each bug version of a subject program, WAS, ESBS, adaptive sampling, one per cluster, and n per cluster strategies were applied, and recall and precision values for each strategy were recorded. When a strategy is applied for a given subject program, the average recall and precision values are calculated using the set of recall and precision values for all faulty versions. Due to space limitations, only the results of four out of the seven programs gzip, grep, ant and sed, are presented below in Fig. 2. The remaining program results are included in Fig. 5 in Appendix. As Figs. 2 and 5 show, WAS demonstrates improved recall and precision results over ESBS, adaptive sampling, one-per cluster sampling, and n per cluster sampling. For gzip, grep, ant and sed, WAS achieves about 10–20% higher recall than ESBS when the test selection percentage ranges from 20% to 50%. The improvement in recall is significant, especially for gzip and ant. For gzip, when test selection reaches 20%, the recall of WAS reaches 76% while the recall of ESBS only reaches 55%. When test selection ratio reaches 30%, the recall of WAS reaches 80%, while the recall of ESBS reaches only 64%. When test selection percentage reaches 40%, the recall of WAS reaches 90%, while the recall of ESBS reaches only 71%. As shown in Fig. 2, the adaptive sampling, one per cluster, and n per cluster strategies likewise perform poorly in comparison to WAS. As Fig. 2(e)–(h) shows, WAS can also achieve a higher precision value than other strategies. When the test selection percentage is

Table 9 Number of clusters and number of test cases in each cluster.

Number of clusters Number of test cases in each cluster

make

ant

sed

flex

grep

gzip

space

20 25–77

21 32–59

13 10–42

16 27–50

15 19–87

10 15–32

82 112–480

50

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

Fig. 2. Recall and precision for gizp, grep, ant, and sed using different strategies.

the same, higher precision indicates that there are more failed test cases among those selected. For example, according to Fig. 2(f), when 30% of all test cases are selected, WAS is able to achieve a precision of 79% while only a precision of 66% is reached by ESBS. It can be observed in Fig. 2 that the precision of WAS is relatively low for some programs. For example, when 30% of test cases are selected, the precision of WAS is 79% for grep. For gzip, however, only 29% precision can be achieved at the same test selection percentage. A similar situation can also be found for space (shown in Fig. 5(f)). Nevertheless, WAS consistently returns higher precision than the other strategies evaluated. From Fig. 2, it can be observed that WAS performs similarly to ESBS and outperforms the other techniques at a 10% test selection percentage. In WAS, each time a cluster is chosen, the Y(ω) of each statement ω is set to 0 based on Eq. (6). ESBS does not initialize Y(ω) to 0 when it points to a new cluster, resulting in interference with

the previous Y(ω) value. After using a fault localization technique on the 10% of test cases selected to weight the initial execution profile, the precision and recall of WAS increases in comparison to ESBS and other sampling strategies. 5. Discussion In this section, six research questions are proposed regarding the important factors which may affect the effectiveness of WAS. 5.1. Can statement coverage improve the effectiveness of WAS? In our conference paper (Wang et al., 2012), the execution profile in terms of function coverage is used for clustering. In this paper, the execution profile in terms of statement coverage is used instead, as it provides more detailed information about each execution. We

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

51

Table 10 Compare failure-close-degree for original and weighted execution profile.

Fig. 3. Recall using statement coverage versus that using function coverage.

evaluated the recall and precision for both the statement coverage and the function coverage in WAS and found that the usage of statement coverage returned better results. For example, Fig. 3 shows that the recall using statement coverage (line with rhombic dots) is higher than that using function coverage on flex. 5.2. Can WAS improve the distance space of execution profiles for clustering? Distance space shows the degree of similarity and dissimilarity of test cases in a test suite. In our study, Euclidean distance between two execution profiles is calculated as the distance between two test cases. The distance between each pair of test cases constructs the distance space of a test suite. To determine whether WAS can improve the distance space of execution profiles, the failure-closedegree metric (Brock et al., 2008) is used to evaluate the distance space of clustering. Failure-close-degree is used to measure how close failed test cases in a test suite T are to each other. For one faulty version of a program, NF indicates the failed test set in T. NS indicates the successful test set in T. sum(NF ) indicates the sum of Euclidean distances between any two failed test cases fi and fj in NF . It is rep resented as dist(fi , fj ). sum(NF , NS ) indicates the sum ∀f ∈ N ,i = / j i,j

F

of Euclidean distances between any  test case fi in NF and any test case pj in NS . It is represented as ∀f ∈ N ,∀p ∈ N dist(fi , pj ). A small i

F

i

S

failure-close-degree means that failed test cases are close to each

Program

Original

Weighted

grep sed gzip make flex ant space

1.37 6.85 8.05 23.37 11.95 7.52 0.82

0.85 1.14 2.56 10.72 3.44 3.73 0.84

other in T and the distance between failed and successful test cases is large. Therefore, a smaller failure-close-degree indicates a better distance space for clustering. Failure-close-degree =

sum(NF ) sum(NF , NS )

Results shown in Table 10 for a subject program are calculated by averaging the failure-close-degrees across all faulty versions for both the original and the weighted execution profile after 10% of the test cases are selected. The results show that failure-close-degree decreases rapidly after using the weighted execution profile. 5.3. Can WAS fit a multi-fault situation? So far, the evaluation on sampling strategies has been discussed in terms of programs containing only one fault. To better simulate a realistic situation, we conducted experiments on multi-fault programs. 20 four-fault versions of three large-sized programs, flex, gzip and grep, are generated using a combination of single-fault versions. The effectiveness of WAS in terms of precision and recall in a multi-fault scenario is presented in Fig. 4. As shown in Fig. 4, precision and recall of WAS decrease for multi-fault programs compared to those for single-fault programs in Fig. 3. For example, in Fig. 4(a), for flex, the recall of WAS reaches 42% when 30% of test cases are selected, which is lower than the 57% recall for flex with a single fault as shown in Fig. 5(b). However, WAS consistently returns higher recall and precision than the other strategies evaluated.

Fig. 4. Multi-fault experimental results.

52

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

Table 11 Recall values for different percentages of test cases selected. Program

Test selection percentage

5%

10%

15%

20%

make

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

grep

Recall

Recall

Recall

Recall

0.087 0.207 0.307 0.601 0.716 0.846 0.959 1.000 1.000 1.000

0.097 0.287 0.467 0.651 0.806 0.896 0.969 1.000 1.000 1.000

0.098 0.288 0.470 0.661 0.816 0.899 0.971 1.000 1.000 1.000

0.098 0.284 0.478 0.673 0.817 0.899 0.972 1.000 1.000 1.000

flex

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.173 0.421 0.632 0.723 0.765 0.906 0.934 1.000 1.000 1.000

0.183 0.421 0.742 0.803 0.855 0.946 1.000 1.000 1.000 1.000

0.184 0.451 0.752 0.803 0.845 0.956 1.000 1.000 1.000 1.000

0.184 0.451 0.762 0.803 0.855 0.966 1.000 1.000 1.000 1.000

gzip

ant

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.165 0.495 0.616 0.818 0.820 0.895 0.919 0.958 0.985 1.000

0.185 0.485 0.796 0.828 0.850 0.885 0.939 0.988 0.995 1.000

0.195 0.495 0.816 0.829 0.840 0.895 0.933 0.978 0.991 1.000

0.195 0.496 0.826 0.839 0.843 0.895 0.933 0.978 0.991 1.000

space

sed

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.135 0.453 0.685 0.742 0.810 0.928 0.948 1.000 1.000 1.000

0.135 0.453 0.735 0.792 0.850 0.948 0.988 1.000 1.000 1.000

0.136 0.456 0.745 0.796 0.859 0.958 0.989 1.000 1.000 1.000

0.136 0.466 0.747 0.792 0.859 0.978 0.989 1.000 1.000 1.000

5.4. Will different fault localization techniques affect the effectiveness of WAS? WAS uses a fault localization technique to calculate the suspiciousness of each statement. As the performance of fault localization techniques differs, the fault localization technique used may affect the effectiveness of WAS. To further investigate this point, we used different fault localization techniques in WAS and calculated the corresponding precision and recall. Fig. 6 in Appendix shows the experiment results. Six fault localization techniques are used: Crosstab, Ochiai, Tarantula, H3c, H3b, and Jaccard. The results show that using different fault localization techniques will result in a different effectiveness of WAS, but the difference is small. However, Crosstab is the technique which generally outperforms the others when used in WAS. Therefore, in our case studies, we use Crosstab in WAS to compute the suspiciousness for each statement. 5.5. Is there any impact if we select a different percentage of all test cases in each clustering iteration? In our case studies, we selected 10% of all test cases in each clustering iteration. To find the most appropriate value for the percentage of test cases selected, we set this value to 5%, 10%, 15% and 20% of all test cases and calculated the corresponding recall

Program

Test selection percentage

5%

10%

15%

20%

Recall

Recall

Recall

Recall

0.087 0.277 0.587 0.631 0.746 0.806 0.829 0.850 0.910 1.000

0.100 0.322 0.570 0.770 0.800 0.860 0.912 0.976 1.000 1.000

0.099 0.321 0.574 0.761 0.806 0.876 0.919 1.000 1.000 1.000

0.099 0.321 0.575 0.761 0.807 0.886 0.929 1.000 1.000 1.000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.355 0.736 0.707 0.875 0.954 1.000 1.000 1.000 1.000 1.000

0.325 0.756 0.807 0.905 0.964 1.000 1.000 1.000 1.000 1.000

0.325 0.756 0.807 0.915 0.944 1.000 1.000 1.000 1.000 1.000

0.325 0.766 0.807 0.915 0.954 1.000 1.000 1.000 1.000 1.000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.265 0.706 0.814 0.840 0.891 0.950 0.987 1.000 1.000 1.000

0.265 0.716 0.924 0.961 0.961 0.970 0.987 1.000 1.000 1.000

0.265 0.715 0.934 0.971 0.963 0.973 0.989 1.000 1.000 1.000

0.265 0.725 0.934 0.982 0.966 0.973 0.989 1.000 1.000 1.000

for all the programs. The results are shown in Table 11. When the test selection percentage increases from 5% to 10%, the recall value increases significantly. When the test selection percentage increases from 10% to 20%, the increase of recall value is very limited. Using make as an example, when the test selection percentage increases from 5% to 10%, the recall increases by 0.16, from 0.307 to 0.467. When the test selection percentage increases from 10% to 15%, the recall only increases by 0.003, from 0.467 to 0.47. In order to achieve a higher recall while selecting as few test cases as possible, 10% of all test cases are selected in each clustering iteration. 5.6. Is the additional time cost incurred by WAS acceptable? There are several clustering iterations in WAS but only one iteration in other CTS-based strategies. Although WAS achieves higher recall than other strategies, its time cost is relatively higher. To evaluate the efficiency of WAS, we recorded the number of clustering iterations for WAS and compared the time cost between ESBS and WAS at 70% recall. Results in Table 12 show that, out of the seven subject programs, when recall reaches 70%, WAS requires one additional clustering iteration over ESBS on two programs, two additional clustering iterations on three programs, and three and four additional clustering iterations on one program. Since the time cost for each clustering

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

53

Table 12 Comparing times of clustering for WAS and ESBS. Program

Recall

Approach

Number of clustering iterations

Average time cost for each clustering iteration (s)

flex

70% 70% 70% 70% 70% 70% 70% 70% 70% 70% 70% 70% 70% 70%

WAS ESBS WAS ESBS WAS ESBS WAS ESBS WAS ESBS WAS ESBS WAS ESBS

4 1 3 1 2 1 5 1 3 1 3 1 2 1

73 72 74 78 61 64 81 89 116 113 79 73 121 119

grep gzip make ant sed space

iteration is no more than 121 s, the total time cost will be generally acceptable in practice. 6. Related works 6.1. Test selection Test selection is normally used in regression testing. The goal of regression testing is to ensure that no new errors are introduced when a program is modified (Ball, 1998). Regression testing is known to be expensive. Previously completed test cases need to be rerun to check whether program execution behavior has changed and whether fixed bugs arise again. This process costs a significant amount of time and running effort. Therefore, reducing the cost of regression testing is important in practice. Regression test selection selects a subset of all test cases to rerun the modified program (Biswas et al., 2011). A few regression test selection strategies have been proposed and studied (Ball, 1998; Bates and Horwitz, 1993; Binkley, 1997). Rothermel and Harrold (1994) defined the regression test selection problem formally. Leung and White (1991) compared certain regression test selection strategies with the retest-all approach and observed that regression test selection can reduce the cost. CTS-based strategies could also be applied to regression testing (Chen et al., 2011b; Yu et al., 2008). We focused on test selection for output inspection, which is socalled observation-based testing (Leon et al., 2000), not regression testing. 6.2. Cluster test selection CTS-based strategy, a well-known strategy for observationbased testing, aims to reduce the cost by selecting a subset of the universal set of executions that is more likely to consist of failures. The selection process itself, especially the analysis of execution profiles, should be as automated as possible (Dickinson et al., 2001a). Dickinson et al. (2001a,b) studied cluster filtering and proposed an adaptive sampling strategy to select all test cases in a cluster when a failed test case is examined. They compared the n per cluster strategy, one per cluster strategy, and adaptive sampling strategy. Their studies showed that adaptive sampling strategy can reveal more failures while equal or fewer test cases need to be examined (Dickinson et al., 2001a). All the existing CTS-based strategies use different execution profiles and clustering algorithms to achieve better results (Dickinson et al., 2001a; Yu et al., 2008). Chen et al. (2011b) proposed a slice filtering technique to reduce the dimensions of test cases such that this technique can be used for larger software. All these CTS-based strategies

use the original execution profile for a single clustering iteration. WAS, on the other hand, uses a weighted execution profile and conducts multiple clustering iterations. Chen et al. introduced semi-supervised learning to reduce the dimensions of test cases and improve both the distance space for clustering and the effectiveness of CTS-based strategy (Chen et al., 2011a). However, semi-supervised learning techniques require human participation while WAS can be implemented automatically. Therefore, the effectiveness of these two strategies is not comparable. 6.3. Fault localization and test selection While there are many studies on fault localization based on execution profile (Wong et al., 2012a), we only examined a few related to test selection. Wong and Qi (2009) propose a fault localization technique based on a back-propagation (BP) neural network, one of the most popular neural network models in practice (Fausett, 1994). However, they found that an RBF (radial basis function) neural network has several advantages over a BP network, including a faster learning rate and a stronger resistance to problems such as paralysis and minima (Lee et al., 1999; Wasserman, 1993). Therefore, they proposed a fault localization technique using a modified RBF (radial basis function) neural network (Wong et al., 2012b), which consists of a three-layer feed-forward structure: one input layer, one output layer, and one hidden layer where each neuron uses a Gaussian basis function (Hassoun, 1995) as the activation function with the distance computed using weighted bit comparison-based dissimilarity. This network will be trained to learn the relationship between the statement coverage of a test case and its corresponding success or failure. A set of virtual test cases (each covering a single statement) is provided as input after the training phase. Output for each virtual test case is considered to be the suspiciousness of the covered statement. Intuitively, the suspiciousness assigned to a statement should be directly proportional to the number of failed test cases that cover it. The more frequently a statement is executed by failed tests, the more suspicious it should be. Wong et al. further propose the DStar (Wong et al., 2014) technique to achieve a better effectiveness with respect to fewer statements examined to locate program bugs. DStar has been shown to be more effective than other fault localization techniques such as Tarantula (Jones and Harrold, 2005) and Ochiai (Abreu et al., 2009). In past decades, many studies have also been published on predicate-based fault localization. Liblit et al. propose a statistical debugging technique, Liblit05, that can isolate bugs in the programs with instrumented predicates at particular points (Liblit et al.,

54

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

2005). As an extension (and improvement) to Liblit05, Liu et al. propose the SOBER technique to rank suspicious predicates (Liu et al., 2006). All instrumented predicates can be ranked and examined in order of their fault-relevance. Zhang et al. (2011) propose a framework to handle statistical predicate-based fault localization by applying standard hypothesis testing techniques. Controlled experiments based on single-fault scenarios using the Siemens suite and space program are conducted to evaluate the effectiveness of different hypothesis testing methods in the proposed framework. The effectiveness and efficiency of the proposed framework has been compared to 33 statement-level fault-localization techniques to gauge whether the best predicate-based technique will outperform statement-level techniques. The results show that the proposed framework can generally achieve better effectiveness and efficiency than the statement-level techniques compared in the paper. Test prioritizing to improve the performance of SBFL techniques has also been studied recently. Gonzalez-Sanchez et al. (2011) uses Bayesian theory to prioritize test cases for fault localization by minimizing the fault locality entropy. Similarly, Yoo et al. (2013) propose a technique, FLINT (fault localization using information theory), to improve the effectiveness of fault localization by trying to reduce the Shannon entropy of the locality of the fault. Empirical evaluation shows that FLINT is able to improve the effectiveness of Tarantula on different subject programs. There are also two main issues with traditional fault localization techniques. First, for most fault localization techniques, a good performance may require a large number of test cases. For a large and complex program, even though a technique can narrow down the potential number of faulty statements to 5% of the entire program, this still requires hundreds of statements to be examined. Second, there is no explanation or context for why a given statement is ranked as suspicious, since most of the techniques are based on the assumption of perfect debugging (i.e. the examination of a faulty statement is enough for a developer to detect, understand, and fix the corresponding bug). However, such an assumption is not realistic in practice, since additional related information such as function requirements, test specifications, etc. has to be provided for a correct debugging (Parnin and Orso, 2011). Yu et al. (2008) compared the effect of using different test reduction techniques on fault localization. Jiang et al. (2012) conducted an experiment to study the effect of test prioritization on fault localization. Their research showed that test prioritization and test reduction could make fault localization more effective. Yan et al. (2010) proposed a CTS-based strategy, ESBS, using execution profiles. ESBS can outperform the other CTS-based strategies. However, these strategies, including ESBS, should be based on well-established clustering results. WAS uses a fault localization technique to weight the execution profile, which can significantly improve the effectiveness of CTS-based strategies.

7. Threats to validity There are several threats to validity of the proposed strategy and its accompanying results, which include but are not limited to the following. One major concern that may limit the generalization of our results is the choice of subject programs. To alleviate this threat, we conducted our experiments on seven programs (written in C and Java) containing 184 faulty versions. All the programs we use are large in size, particularly ant (over 75,000 lines of code). Although we cannot claim that WAS can be applied to every program under every circumstance, such extensive empirical studies give us high confidence in the validity of our results.

A threat to validity may also be caused by the fault localization techniques used. In Section 5.4, we compared the recall and precision of WAS using six fault localization techniques including Crosstab, Ochiai, Tarantula, H3c, H3b, and Jaccard on different subject programs. It has been observed that using different fault localization techniques will result in a different recall and precision, but the difference is insignificant. However, Crosstab is the technique which generally outperforms others when used in WAS. In the future, more advanced fault localization techniques will be included in the cross-comparison study. Another threat to validity is the use of K-means as the only clustering technique in WAS. We will evaluate the recall and precision of WAS using other clustering techniques in a future study, as most of clustering techniques can be easily applied in WAS. 8. Conclusion In this paper, we have proposed a novel weighted attributebased strategy, WAS, to improve the performance of traditional cluster-based test selection so that output verification is more focused on executions that are more likely to fail. Test cases are clustered based on the similarity of their execution profiles weighted by the suspiciousness (likelihood of containing program bugs) of statements computed using different spectrum-based software fault localization techniques (Crosstab, Ochiai, Tarantula, H3b, H3c, and Jaccard). Case studies are conducted on seven open source programs (flex, grep, ant, make, space, sed, and gzip) applying four existing CTS-based strategies (one per cluster, n per cluster, adaptive sampling, and ESBS), and the results are compared. Our data indicate that WAS generally attains better precision and recall than the other strategies. Thus, not only are more failed executions selected by examining only a small number of test cases, but also the majority of failed executions will be selected. Furthermore, the additional clustering iterations required by WAS introduce very little overhead, which makes WAS as efficient as other CTS-based strategies. These findings suggest that using WAS provides significant advantages over situations where no testing oracle exists for automated output verification. Rather than spending excessive resources to manually examine the result of every test execution, we only need to verify the execution of a small number of test cases selected using WAS. Since those which are not selected are unlikely to fail, we can ignore their verification with good confidence. In future studies, we would like to further explore how using additional software fault localization techniques (Wong et al., 2012b, 2014; Wong and Qi, 2009) and other clustering techniques may affect the effectiveness of WAS. We will also extend our case studies to larger real-life programs. Acknowledgements This research is supported, in part, by the National Basic Research Program of China (973 Program 2014CB340702), the National Natural Science Foundation of China (Grant No. 61170067, 61373013), and the US National Science Foundation (DUE–1023071). Appendix. Precision and recall using WAS, ESBS, adaptive sampling, one per cluster, and n per cluster on flex, make, and space are presented in Fig. 5. Precision and recall using WAS with different fault localization techniques including Crosstab, Tarantula, Ochiai, H3b, H3c, and Jaccard on gzip, grep, ant, sed, flex, make, and space are presented in Fig. 6.

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

Fig. 5. Precision and recall using different CTS-based strategies on flex, make, and space.

55

56

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

Fig. 6. Precision and recall using WAS with different fault localization techniques.

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

References Abreu, R., Zoeteweij, P., Golsteijn, R., van Gemund, A.J.C., 2009. A practical evaluation of spectrum-based fault localization. J. Syst. Softw. 82 (11), 1780–1792. Anderberg, M.R., 1973. Cluster Analysis for Applications. Academic Press, New York. Andrews, J.H., Briand, L.C., Labiche, Y., 2005. Is mutation an appropriate tool for testing experiments? In: Proceedings of the 27th International Conference on Software Engineering, St. Louis, Missouri, USA, May, pp. 402–411. Ball, T., 1998. On the limit of control flow analysis for regression test selection. In: Proceedings of the ACM SIGSOFT international Symposium on Software Testing and Analysis, Clearwater Beach, USA, March, pp. 134–142. Bates, S., Horwitz, S., 1993. Incremental program testing using program dependence graphs. In: Proceedings of the 20th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 384–396. Binkley, D., 1997. Semantics guided regression test cost reduction. IEEE Trans. Softw. Eng. 23 (August (8)), 498–516. Biswas, S., Mall, R., Satpathy, M., Sukumaran, S., 2011. Regression test selection techniques: a survey. Inform.: Int. J. Comput. Inform. 35 (September), 289–321. Brock, G., Pihur, V., Datta, S., 2008. clValid, an R package for cluster validation. J. Stat. Softw. 25 (4), 1–22. Chen, S., Chen, Z., Zhao, Z., Xu, B., Feng, Y., 2011a. Using semisupervised clustering to improve regression test selection techniques. In: Proceedings of 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation, Berlin, Germany, March, pp. 1–10. Chen, T.Y., Kuo, F., Merkel, R.G., Tse, T.H., 2010. Adaptive random testing: the ART of test case diversity. J. Syst. Softw. 83 (January (1)), 60–66. Chen, Z., Duan, Y., Zhao, Z., Xu, B., Qian, J., 2011b. Using program slicing to improve the efficiency and effectiveness of cluster test selection. Int. J. Softw. Eng. Knowl. Eng. 21 (September (6)), 759–777. Dickinson, W., Leon, D., Podgurski, A., 2001a. Finding failures by cluster analysis of execution profiles. In: Proceedings of 23rd International Conference on Software Engineering, Toronto, Canada, May, pp. 339–348. Dickinson, W., Leon, D., Podgurski, A., 2001b. Pursuing failure: the distribution of program failures in a profile space. In: Proceedings of 8th European Software Engineering Conference held jointly with 9th ACM SIGSOFT International symposium on Foundations of Software Engineering, Yokohama, Japan, April, pp. 246–255. DiGiuseppe, N., Jones, J.A., 2012. Software behavior and failure clustering: an empirical study of fault causality. In: Proceedings of Fifth International Conference on Software Testing, Verification and Validation, Montreal, Canada, April, pp. 191–200. Do, H., Rothermel, G., 2006. On the use of mutation faults in empirical assessments of test case prioritization techniques. IEEE Trans. Softw. Eng. 32 (September (9)), 733–752. Fausett, L., 1994. Fundamentals of Neural Networks: Architectures, Algorithms, and Applications. Jaccard, P., 1901. Etude comparative de la distribution florale dans une portion des alpes et des jura. Bull. Soc. Vaudoise Sci. Nat. 37, 547–579. Gonzalez-Sanchez, A., Piel, E., Abreu, R., Gross, H., Van Gemund, A.J.C., 2011. Prioritizing tests for software fault diagnosis. Softw. Pract. Exp. 41 (September (10)). Hartigan, J.A., Wong, M.A., 1979. Algorithm AS 136: a K-means clustering algorithm. Appl. Stat. 28 (January (1)), 100–108. Hassoun, M.H., 1995. Fundamentals of Artificial Neural Networks. MIT Press, Cambridge, Massachusetts. Jiang, B., Zhang, Z., Chan, W.K., Tse, T.H., Chen, T.Y., 2012. How well do test prioritization techniques support statistical fault localization? Inf. Softw. Technol. 54 (7), 739–758. Jones, J.A., Harrold, M.J., 2005. Empirical evaluation of the tarantula automatic faultlocalization technique. In: Proceedings of 20th IEEE International Conference on Automated Software Engineering, Long Beach, USA, November, pp. 273–282. Mardia, K.V., Kent, J.T., Bibby, J.M., 1979. Multivariate Analysis. Academic Press, Waltham, Massachusetts, USA. Leon, D., Podgurski, A., White, L.J., 2000. Multivariate visualization in observationbased testing. In: Proceedings of International Conference on Software Engineering, Castletroy, Ireland, June, pp. 116–125. Leung, H.K., White, L., 1991. A Cost model to compare regression test strategies. Proc. Softw. Maint., 201–208. Lee, C.C., Chung, P.C., Tsai, J.R., Chang, C.I., 1999. Robust radial basis function neural network. IEEE Trans. Syst. Man Cybern. B: Cybern. 29 (December (6)), 674–685. Liblit, B., Naik, M., Zheng, A.X., Aiken, A., Jordan, M.I., 2005. Scalable statistical bug isolation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, Chicago, IL, USA, June, pp. 15–26. Liu, C., Fei, L., Yan, X., Han, J., Midkiff, S.P., 2006. Statistical debugging: a hypothesis testing-based approach. IEEE Trans. Softw. Eng. 32 (October (10)), 831–848. James, M., 1967. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. Namin, A.S., Andrews, J.H., Labiche, Y., 2006. Using mutation analysis for assessing and comparing testing coverage criteria. IEEE Trans. Softw. Eng. 32 (August (8)), 608–624. Ochiai, A., 1957. Zoogeographic studies on the solenoid fishes found in Japan and its neighboring regions. Bull. Jpn. Soc. Sci. Fish. 22, 526C530. Offutt, A.J., Lee, A., Rothermel, G., Untch, R.H., Zapf, C., 1996. An experimental determination of sufficient mutant operators. ACM Trans. Softw. Eng. Methodol. 5 (April (2)), 99–118.

57

Parnin, C., Orso, A., 2011. Are automated debugging techniques actually helping programmers? In: Proceedings of the 2011 International Symposium on Software Testing and Analysis, Toronto, Canada, July, pp. 199–209. Podgurski, A., Masri, W., McCleese, Y., Wolff, F.G., Yang, C., 1999. Estimation of software reliability by stratified sampling. ACM Trans. Softw. Eng. Methodol. 8, 263–283. Podgurski, A., Yang, C., 1993. Partition testing, stratified sampling, and cluster analysis. In: Proceedings of SIGSOFT FSE, pp. 169–181. Rothermel, G., Harrold, M.J., 1994. Selecting tests and identifying test coverage requirements for modified software. In: Proceedings of the 1994 ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 169–184. Samuel, P., Mall, R., Bothra, A.K., 2008. Automatic test case generation using unified modeling language (UML) state diagrams. Softw. IET 2 (April (2)). The Software Infrastructure Repository http://sir.unl.edu/portal/index.html (retrieved October 2008). Tsai, W.T., Volovik, D., Keefe, T.F., 1990. Automated test case generation for programs specified by relational algebra queries. Trans. Softw. Eng. 16 (March (3)). Van der Meulen, M.J.P., Bishop, P., Revilla, M., 2004. An exploration of software faults and failure behaviour in a large population of programs. In: Proceedings of the 15th International Symposium on Software Reliability Engineering, Saint Malo, Bretagne, France, November, pp. 101–112. Wang, Y., Chen, Z., Feng, Y., Luo, B., Yang, Y., 2012. Using weighted attributes to improve cluster test selection. In: Proceedings of 5th International Conference on Software Security and Reliability, Washington, DC, June, pp. 138–146. Wasserman, P.D., 1993. Advanced Methods in Neural Computing. Van Nostrand Reinhold, New Yank. Witten, I.H., Frank, E., 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, Burlington, Massachusetts, USA. Wong, W.E., Mathur, A.P., 1995. Reducing the cost of mutation testing: an empirical study. J. Syst. Softw. 31 (December (3)), 185–196. Wong, W.E., Debroy, V., Choi, B., 2010. A family of code coverage-based heuristics for effective fault localization. J. Syst. Softw. 83 (February (2)), 188–208. Wong, W.E., Debroy, V., Xu, D., 2012a. Towards better fault localization: a crosstabbased statistical approach. IEEE Trans. Syst. Man Cybern. C 42 (May (3)), 378–396. Wong, W.E., Debroy, V., Golden, R., Xu, X., Thuraisingham, B., 2012b. Effective software fault localization using an RBF neural network. IEEE Trans. Reliab. 61 (March (1)), 149–169. Wong, W.E., Debroy, V., Gao, R., Li, Y., 2014. The DStar method for effective software fault localization. IEEE Trans. Reliab. 63 (March (1)), 290–308. Wong, W.E., Qi, Y., 2009. BP neural network-based effective fault localization. Int. J. Softw. Eng. Knowl. Eng. 19 (4), 573–597. 1998. ␹Suds User’s Manual. Telcordia Technologies. Yan, S., Chen, Z., Zhao, Z., Zhang, C., Zhou, Y., 2010. A dynamic test cluster sampling strategy by leveraging execution spectra information. In: Proceedings of Third International Conference on Software Testing, Verification, and Validation, Paris, France, April, pp. 147–154. Yoo, S., Harman, M., Clark, D., 2013. Fault localization prioritization: comparing information-theoretic and coverage-based approaches. ACM Trans. Softw. Eng. Methodol. 22 (July (3)). Yu, Y., Jones, J., Harrold, M.J., 2008. An empirical study of the effects of test-suite reduction on fault localization. In: Proceedings of 30th International Conference on Software Engineering, Leipzig, Germany, May, pp. 201–210. Zhang, Z., Chan, W.K., Tse, T.H., Yu, Y., Hu, P., 2011. Non-parametric statistical fault localization. J. Syst. Softw. 84 (June (6)), 885–905. Yabin Wang is a Ph.D. student at the Software Institute, Nanjing University, under the supervision of Prof. Zhenyu Chen and Prof. Bin Luo. His research interest is software testing. He visited the University of Texas at Dallas as a visiting student under the supervision of Prof. Eric Wong in 2012. Ruizhi Gao received his B.S. in Software Engineering from Nanjing University. He is a Ph.D. candidate in the Computer Science department at the University of Texas at Dallas. His current research interests include software testing, fault localization, and program debugging. Zhenyu Chen is an Associate Professor at the Software Institute, Nanjing University. He received his B.S. and Ph.D. in Mathematics from Nanjing University. He worked as a Postdoctoral Researcher at the School of Computer Science and Engineering, Southeast University, China. His research interests focus on software analysis and testing. He has about 70 publications at major venues including TOSEM, JSS, SQJ, IJSEKE, ISSTA, ICST, QSIC etc. He has served as PC co-chair of QSIC 2013, AST2013, IWPD2012 and the program committee member of many international conferences. He has won research funding from several competitive sources such as NSFC. W. Eric Wong received his Ph.D. in computer science from Purdue University. He is currently a Professor, and the Director of International Outreach in Computer Science at the University of Texas, Dallas. He is also the Funding Director of the newly established Advanced Research Center on Software Testing and Quality Assurance. In addition, Professor Wong has an appointment as a Guest Researcher from NIST (National Institute of Standards and Technology), an agency of the U.S. Department of Commerce. Prior to jointing UTD, he was with Telcordia (formerly Bellcore) as a Project Manager for Dependable Telecom Software Development. Dr. Wong received the Quality Assurance Special Achievement Award from Johnson Space Center, NASA, in 1997. His research focus is on the technology to help practitioners develop high quality software at low cost. In particular, he is doing research in

58

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

software testing, debugging, metrics, safety, and reliability. Dr. Wong is also the Secretary of the IEEE Reliability Society, and the Founding Steering Committee Chair of the IEEE International Conference on Software Security and Reliability (SERE). He is on the editorial board of IEEE Transactions on Reliability and the Journal of Systems and Software.

Bin Luo is a Professor at the Software Institute, Nanjing University. His main research interests are applied software engineering and software engineering education. He is leading the institute of applied software engineering at Nanjing University.