WAS: A weighted attribute-based strategy for cluster test selection

The Journal of Systems and Software 98 (2014) 44–58 Contents lists available at ScienceDirect The Journal of Systems and Software journal homepage: ...

Download PDF

3MB Sizes 0 Downloads 70 Views

Report

Full Text

The Journal of Systems and Software 98 (2014) 44–58

Contents lists available at ScienceDirect

The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss

WAS: A weighted attribute-based strategy for cluster test selection Yabin Wang a , Ruizhi Gao b , Zhenyu Chen a,∗ , W. Eric Wong b , Bin Luo a a b

State Key Laboratory for Novel Software Technology, Nanjing University, China Department of Computer Science, University of Texas at Dallas, USA

a r t i c l e

i n f o

Article history: Received 6 September 2013 Received in revised form 4 August 2014 Accepted 18 August 2014 Available online 10 September 2014 Keywords: Weighted execution proﬁle Cluster test selection Software fault localization

a b s t r a c t In past decades, many techniques have been proposed to generate and execute test cases automatically. However, when a test oracle does not exist, execution results have to be examined manually. With increasing functionality and complexity of today’s software, this process can be extremely time-consuming and mistake-prone. A CTS-based (cluster test selection) strategy provides a feasible solution to mitigate such deﬁciency by examining the execution results only with respect to a small number of selected test cases. It groups test cases with similar execution proﬁles into the same cluster and selects them from each cluster. Some well-known CTS-based strategies are one per cluster, n (a predeﬁned value which is greater than 1) per cluster, adaptive sampling, and execution-spectra-based sampling (ESBS). The ultimate goal is to reduce testing cost by quickly identifying the executions that are likely to fail. However, improperly grouping the test cases will signiﬁcantly diminish the effectiveness of these strategies (by examining results of more successful executions and fewer failed executions). To overcome this problem, we propose a weighted attribute-based strategy (WAS). Instead of clustering test cases based on the similarity of their execution proﬁles only once like the aforementioned CTS-based strategies, WAS will conduct more than one iteration of clustering using weighted execution proﬁles by also considering the suspiciousness of each program element (statement, basic block, decision, etc.), where the suspiciousness in terms of the likelihood of containing bugs can be computed by using various software fault localization techniques. Case studies using seven programs (make, ant, sed, ﬂex, grep, gzip, and space) and four CTS-based strategies (one per cluster sampling, n per cluster sampling, adaptive sampling, and ESBS) were conducted to evaluate the effectiveness of WAS on 184 faulty versions containing either single or multiple bugs. Experimental results suggest that the proposed WAS strategy outperforms other four CTS-based strategies with respect to both recall and precision such that output veriﬁcation is focused more strongly on failed executions. © 2014 Elsevier Inc. All rights reserved.

1. Introduction Software has become an inseparable component of everyday life, but programs continue to grow in complexity. As a result, it is inevitable that we spend more and more resources on software testing, which includes three major steps: test generation, execution, and output veriﬁcation. Many techniques have been proposed to reduce the cost of test generation and execution (e.g., Samuel et al., 2008; Tsai et al., 1990). In general, however, test output veriﬁcation still has to be done manually, especially when a test oracle does not exist. This can be very timeconsuming and mistake-prone. To solve this problem, we propose a weighted attribute-based strategy (WAS) to help identify more

∗ Corresponding author. Tel.: +86-25-83621360. E-mail addresses: [email protected], [email protected] (Z. Chen). http://dx.doi.org/10.1016/j.jss.2014.08.032 0164-1212/© 2014 Elsevier Inc. All rights reserved.

failure-causing test cases, as the remaining test executions1 are likely to be successful. This can signiﬁcantly reduce software development and maintenance cost. Intuitively, failed executions caused by the same bug tend to share some similarities and are more likely to have similar execution proﬁles (i.e. statement coverage) (Anderberg, 1973; Dickinson et al., 2001a; Podgurski et al., 1999). Test cases for output veriﬁcation are typically selected using a CTS-based strategy, which groups test cases with similar execution proﬁles into the same cluster and then selects only a small number of test cases from each cluster for output veriﬁcation. A CTS-based strategy contains two major steps: clustering and sampling. Most CTS-based strategies such as one per

1 In this paper, we use the terms “failed/successful execution” and “failed/successful test case” interchangeably. We also use “bugs” and “faults” interchangeably.

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

cluster (Dickinson et al., 2001a), n (a predeﬁned value greater than 1) per cluster (Dickinson et al., 2001a), adaptive sampling (Chen et al., 2010), and execution-spectra-based sampling (ESBS) (Yan et al., 2010) focus on how to optimize the sampling. Unfortunately, they overlook the other critical factor – how to do a better clustering. Rather than having one iteration of clustering as in aforementioned CTS-based techniques, the proposed WAS imposes multiple iterations of clustering. After the ﬁrst iteration of clustering, WAS uses a similar sampling approach from ESBS to select a few test cases for output veriﬁcation. The suspiciousness value of each statement is then computed using a software fault localization technique (see Section 2.2) based on the execution proﬁles and the success or failure of the selected test cases. These values are then used to calibrate the initial execution proﬁle of each test case to create a weighted execution proﬁle for the next iteration of clustering. With some failure-causing statements weighted with a higher priority, failed executions can be more accurately clustered. Users can continue this process until the number of test cases selected reaches a predeﬁned value. The example in Section 3.2 provides a more detailed explanation. To demonstrate the beneﬁt of using the proposed WAS, case studies were conducted on seven open source programs (ﬂex, grep, ant, make, space, sed and gzip) and the four different CTS-based strategies mentioned earlier (one per cluster, n per cluster, adaptive sampling, and ESBS). Our results suggest that WAS generally outperforms other strategies with a better precision and recall. This paper is an extension of our conference paper2 with the following signiﬁcant enhancements and contributions:

• A novel strategy, WAS, is proposed by introducing weighted execution proﬁle. To the best of our knowledge, we are the ﬁrst to apply a software fault localization technique to assign weights to execution proﬁles while using CTS-based strategies. • The effectiveness of WAS is compared to that of four existing CTS-based strategies on seven open-source programs. The results show that WAS generally outperforms other CTS-based strategies in terms of both recall and precision. • Statement coverage instead of function coverage as in our conference paper is used. A comparison of the effectiveness of WAS using these two different execution proﬁles is reported. • Case studies include programs with a single bug as well as programs containing multiple bugs. • The effectiveness of WAS using six different fault localization techniques are compared to provide more insight. • Not only effectiveness but also efﬁciency of WAS is evaluated. • The impact on the effectiveness of WAS using a different percentage of all test cases selected per clustering iteration is examined and presented.

The rest of the paper is organized as follows. In Section 2, we provide the background of CTS-based strategies and a brief overview of software fault localization. Section 3 presents a detailed explanation and an example of WAS. Case studies, results, and analyses are given in Section 4. Discussion on research questions is provided in Section 5. Related work appears in Section 6. Section 7 discusses the threats to validity. Conclusions and future direction can be found in Section 8. After the Reference Section, additional experiment results are presented in Appendix.

2

A preliminary version of this paper was presented at the Sixth International Conference on Software Security and Reliability (SERE’12) (Wang et al., 2012).

45

2. Background We will ﬁrst describe how to cluster execution proﬁles of test cases followed by the introduction of six software fault localization techniques used in this paper. 2.1. Execution proﬁle clustering Clustering divides a collection of objects into clusters that contain similar instances (Witten and Frank, 1999). Objects in the same cluster are similar to each other and are dissimilar to those in other clusters. CTS-based strategies use clustering techniques to group execution proﬁles based on their semantic behavior. Test cases are clustered by their execution proﬁles, each of which is an instance when using a clustering algorithm. Existing studies show that execution proﬁles can be used as a representation of the behavior of executions (DiGiuseppe and Jones, 2012). Dickinson et al. (2001a) later found that over half of the nearest neighbors of failed test cases had also failed, and failed executions have unusual properties that may be reﬂected in execution proﬁles (Dickinson et al., 2001b). These results, extending from Podgurski’s earlier ﬁndings (Podgurski and Yang, 1993), suggested that proﬁles might be an appropriate indication of failures. According to studies described above, failed executions may have similar proﬁles. Hence, these proﬁles can be grouped together using clustering algorithms. Podgurski and Yang (1993) studied such problems in their early work and tried to predict an execution’s success or failure through proﬁle clustering. Clustering techniques can be roughly distinguished as soft clustering techniques and hard clustering techniques. In the results from a soft clustering technique, an object can be shared by different clusters. In the results from a hard clustering technique, however, an object belongs at most to one cluster. In this paper, we will use the K-means, which is one of the most popular hard clustering techniques. The ﬁrst step of K-means is to estimate the number of clusters, k, according to the total number of test cases. Next, k test cases are randomly selected as cluster centers, and each test is assigned to the nearest cluster center based on the Euclidean distance between test cases. The center of each cluster is then recalculated based on the current clustering results. In each iteration, new cluster centers are produced, and the process continues until test cases in each cluster remain constant. K-means produces clusters so as to minimize Eq. (1) where Si indicates the ith cluster, xj indicates a test case and i indicates a cluster center. arg min S

k xj − i 2

(1)

i=1 xj ∈ Si

The approach of assigning one instance to its nearest cluster center is shown in Eq. (2). In this equation, t indicates a test case. mi and mj indicate cluster centers. In WAS, the execution proﬁle of a test case t is represented as a numeric vector t: e1 , e2 , e3 , . . ., en where ei = / 0 indicates that a statement wi is executed by t, and ei = 0 indicates that wi is not executed by t. Cluster centers mi and mj are calculated by Eq. (3) in which |Si | indicates the number of test cases in cluster Si . So mi is also a numeric vector holding the same dimension as t. If mi is represented as c1 , c2 , c3 , . . ., cn the Euclidean distance between t and mi is

Si = {t : t − mi ≤ t − mj ∀1 ≤ j ≤ k} mi =

1

si

tj ∈ Si

tj

n

k−1

(ei − c1 )2 .

(2) (3)

46

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

Table 1 Notations. NCF (ω) NCS (ω) NUF (ω) NUS (ω) NS NF

The number of failed test cases which cover w The number of successful test cases which cover w The number of failed test cases which do not cover w The number of successful test cases which do not cover w The number of successful test cases The number of failed test cases

2.2. Software fault localization The aim of software fault localization is to identify the location of bugs in a software system. The process uses different techniques to calculate suspiciousness values for each component of the software. All components are ranked according to suspiciousness values in descending order so that the most suspicious part is prioritized. There are many fault localization techniques, such as static, dynamic and execution slice-based techniques, program state-based techniques, machine learning-based techniques and program spectrum-based techniques. WAS can be based on any of these techniques, but in this paper, it is implemented using program spectrum-base techniques. In spectrum-based fault localization (SBFL), a program spectrum contains information about the elements that are covered during the execution of an instrumented program. Such information can be collected at the granularity level of a function, block, branch or statement. In this paper, spectrum information is collected at the statement level. SBFL technique uses the notation NUS (ω), NUF (ω), NCS (ω), and NCF (ω) to indicate the four properties of a statement ω. NUS (ω) indicates the number of successful test cases that do not cover ω. NUF (ω) indicates the number of failed test cases that do not cover ω. NCS (ω) indicates the number of successful test cases that cover ω. NCF (ω) indicates the number of failed test cases that cover ω. These four properties are used to calculate a suspiciousness value for the statement. A higher suspiciousness value suggests that a statement is more likely to contain bugs. Many SBFL techniques, such as Jaccard (Jaccard, 1901), Tarantula (Jones and Harrold, 2005), Ochiai (Abreu et al., 2009), Crosstab (Wong et al., 2012a), H3b and H3c (Wong et al., 2010), have been shown effective in reducing the number of statements examined in order to identify bugs. The Jaccard metric (Abreu et al., 2009), which is used to compare the similarity and dissimilarity of two sets, is deﬁned as the size of intersection divided by the size of the union of two sets. In fault localization, for a statement ω, the two sets are the set of failed test cases and the set of all test cases that cover ω. Table 1 shows the notations used in this section. The suspiciousness value Jaccard(ω) is deﬁned in Eq. (4). Jaccard(w) =

NCF (ω) NCF (ω) + NCS (ω) + NUF (ω)

(4)

“dependence”/“independence” relationship between the two variables, it is more relevant to study the degree of association between them. More precisely, a crosstab for each statement contains two column-wise categorical variables (covered and not covered) and two row-wise categorical variables (successful and failed) to determine whether a closer relationship exists between the two kinds of categorical variable. In addition, a statistic is computed based on each table to calculate the suspiciousness of the corresponding statement. Wong et al. (2010) proposed the H3b and H3c technique by examining how each additional failed (or successful) test cases can help perform fault localization. They concluded that if all the test cases are executed in sequence, the contribution of the ﬁrst failed test case is larger than or equal to that of the second one, which is larger than or equal to that of the third one, and so on. The same situation exists in the case of successful test cases. The suspiciousness of each statement is calculated as [(1.0) × nF,1 + (0.1) × nF,2 + (0.01) × nF,3 ]–[(1.0) × ns,1 + (0.1) × ns,2 + ˛ × F/S × ns,3 ], where nF,i and nS,i are the number of failed and successful test cases in the ith group, and F/S is the ratio of the total number of failed test cases to the total number of successful test cases with respect to a speciﬁc bug. The technique is labeled H3b when ˛ = 0.001 and H3c with ˛ = 0.0001. 3. Weighted attributed-based strategy This section describes the detailed WAS procedure and demonstrates the workﬂow with a running example. 3.1. Detailed steps This paper uses the following concepts and notations. For each statement ω, two values X(ω) and Y(ω) are deﬁned as follows, and X(ω) is initialized to 0. Accordingly, Y(ω) has an initial value of 0 based on Eq. (6). X(ω) = NCS (ω) − NCF (ω)

Y (ω) =

0 if X(ω) < 1 1 otherwise

(6)

For each test case t, Z(t) represents the total number of statements with Y = 0 covered by t. ST(k) is a set that contains test cases with Z(t) > 0, and k is the index of test selection starting from 0. NT(q) denotes the set of test cases selected in each clustering iteration and q is the index of clustering iteration starting from 0. NT contains the test cases selected across all clustering iterations. Fig. 1 demonstrates the overall process of WAS, and a detailed description of each step follows.

The Tarantula fault localization technique (Jones and Harrold, 2005) assigns a suspiciousness value to each statement according to the formula X/X + Y, where X = (NCF /NF ) and Y = (NCS /NS ). Tarantula has been shown to be more effective than other fault localization techniques such as set union, set intersection, nearest neighbor, and cause transitions on the Siemens suite. The Ochiai coefﬁcient, evaluated by Abreu et al. (2009) (Ochiai, 1957), in the context of software fault localization, uses the formula NCF √ to assign a suspiciousness value to each statement. NF ×(NCF +NCS )

Wong et al. (2012a) presented a Crosstab (cross tabulation) based fault localization technique that uses a well-deﬁned statistical analysis, which has been used to study the relationship between two or more categorical variables. In their study, two variables are used: “test execution result (successful or failed)” and “the coverage of a given statement.” The null hypothesis is that these two variables are independent. However, rather than studying the

(5)

Fig. 1. Detailed procedure of WAS.

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

47

Table 2 Initial execution proﬁles. Statements

Test cases executed t0

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10

t11

t12

t13

t14

t15

t16

t17

t18

t19

t20

t21

t22

t23

t24

t25

t26

t27

t28

t29

ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8 ω9 ω10

1 1 0 0 1 1 1 0 1 1 P

1 1 0 0 1 1 1 1 1 1 P

1 1 1 0 1 0 1 1 1 1 P

1 1 0 0 1 1 1 1 1 1 P

1 1 1 0 1 1 1 0 1 1 P

1 1 0 0 1 1 1 1 1 1 P

0 0 1 0 1 1 0 1 0 0 F

0 0 1 0 1 1 0 1 0 0 F

1 1 0 1 1 1 0 1 0 0 P

1 1 1 1 0 1 1 1 0 0 P

1 1 0 1 1 1 1 1 0 0 P

1 1 0 1 1 1 1 0 0 0 P

1 1 0 1 1 1 1 1 0 0 P

1 1 0 1 1 0 1 1 0 0 P

1 1 0 1 1 1 0 1 0 0 P

1 1 0 1 1 1 1 1 0 0 P

1 1 0 1 0 1 1 1 0 0 P

1 1 0 1 0 0 1 1 0 0 P

1 1 0 1 0 0 1 1 0 0 P

1 1 1 1 0 0 1 1 0 0 P

1 1 0 1 0 1 1 1 0 0 P

1 1 0 1 0 0 1 1 0 0 P

1 1 0 1 0 1 1 1 0 0 P

1 1 0 1 0 0 1 1 0 0 P

1 1 0 1 0 1 1 1 0 0 P

1 1 1 1 0 0 1 1 0 0 P

1 1 0 1 0 1 0 0 0 0 P

1 0 1 0 0 0 1 1 1 1 F

1 0 1 0 1 0 0 1 1 1 F

1 0 1 0 0 1 1 1 1 1 F

Step 1: Test execution and execution proﬁle collection All test cases are executed, and their execution proﬁles are collected. Test cases are classiﬁed as failed or successful depending on the expected output. If the output of a test case matches its expected output, it is successful. Otherwise, the test case has failed. Step 2: Cluster analysis WAS uses K-means (Hartigan and Wong, 1979; James, 1967) to group the test cases into clusters based on the similarity of their execution proﬁles. Before K-means is applied, the number Nt /2 for Nt test cases based on the work of of clusters is set to Mardia et al. (1979). One test case is then randomly selected for each cluster as the initial cluster center. Step 3: Cluster selection One cluster is randomly chosen from the results of Step 2. A chosen cluster cannot be used in further iterations. Step 4: Test selection For each statement ω, Z(t) is then calculated for each test t given X(ω) and Y(ω) using Eqs. (5) and (6). Test cases with Z(t) > 0 are included in the ST(k) set. If ST(k) is empty, the process returns to Step 3. Otherwise, a test which has the highest Z(t) is selected from ST(k) , and noted as th (k) . If multiple test cases have the same highest Z(t), only one test is randomly selected. th (k) is examined to determine whether it is successful or failed, and the test case is then included in NT(q) . If the size of NT(q) is greater than or equal to 10% of the test cases, the test cases in NT(q) are combined into NT. NT(q) is then reset to an empty set, and the process continues to Step 5. If the size of NT(q) is less than 10% of the test cases, Step 4 is repeated. Step 5: Execution proﬁle weighting Using the test cases in NT, suspiciousness values susp(ω) for each statement are calculated using a fault localization technique (i.e. Jaccard, Tarantula, Ochiai, Crosstab, H3b or H3c). For a test case t, the execution proﬁle t: e1 , e2 , e3 , . . ., en will be weighted by replacing ei with susp(ωi ) if ei = / 0. With the weighted execution proﬁle, the process returns to Step 2. 3.2. Example The following hypothetical example demonstrates how WAS is applied. This example uses a program containing ten statements, ω1 –ω10 , with a bug introduced in ω3 . Thirty tests, t0 –t29 , are executed, and the execution proﬁle (statement coverage) for each test

is collected (Step 1). Table 2 shows the initial execution proﬁles. For example, the row labeled with ω1 shows how the statement ω1 is covered with respect to each test case. An entry 1 implies ω1 is covered by the corresponding test case and an entry 0 means it is not. In the last row, a P result indicates a successful test case and an F result indicates a failed test case. After the test execution and execution proﬁle collection, cluster analysis (Step 2) is completed. The 30 test cases are grouped into three clusters (cluster 1, cluster 2, and cluster 3) according to the similarities of their execution proﬁles. The clustering results in Table 3 show that eleven successful test cases and two failed test cases, t6 and t7 , are included in cluster 1. Seven successful test cases and no failed test cases are included in cluster 2. Six successful test cases and three failed test cases, t27 , t28 , and t29 , are assigned to cluster 3. Cluster 1 is randomly selected (Step 3). As shown in Table 4, test selection (Step 4) begins by calculating X(ω) and Y(ω) based on Eqs. (5) and (6). Z(t) is counted and ST(0) is generated as described in Section 3.1. Test cases t10 , t12 , and t15 have the highest Z(t) value of 7. t10 is then randomly chosen for output examination and is found to be successful. After including t10 in NT(0) , the size of NT(0) is one. As the size of NT(0) is less than 10% of the test cases (30 × 10% = 3), test selection is then repeated and ST(1) is produced based on the updated X(ω), Y(ω), and Z(t). In ST(1) , three test cases t6 , t7 , and t9 have a highest Z(t) value of 1. t6 is then randomly selected and examined to be failed. As the size of NT(0) is now 2, test selection is rerun. One test case, t7 , has the highest Z(t) value of 4. It is then selected and examined to be failed. At this point, since the size of NT(0) is equal to 10% of the test cases, the process continues to execution proﬁle weighting (Step 5) and the test cases in NT(0) are added to NT. Using the test cases in NT, t6 , t7 , and t10 , the suspiciousness value of each statement is calculated according to the Jaccard fault localization technique. The results are shown in Table 5. In execution proﬁle weighting (Step 5), the suspiciousness values are used to weight the initial execution proﬁle as shown in Table 6. The process then returns to cluster analysis (Step 2). New clusters are formed from the weighted execution proﬁle as shown in Table 7. In new cluster 3, t27 , t28 , and t29 are failed and t8 is successful. Compared to the original cluster 3 in Table 3, the percentage of failed test cases has increased from 33.3% (three out of nine) to 75% (three out of four). Hence, it will be much easier for a user to identify more failed test cases in this new cluster with fewer test cases results examined.

Table 3 Results from the ﬁrst clustering iteration. Cluster 1 Cluster 2 Cluster 3

t6 t13 t0

t7 t17 t1

t8 t18 t2

t9 t19 t3

t10 t21 t4

t11 t23 t5

t12 t25 t27

t14

t15

t28

t29

t16

t20

t22

t24

t26

48

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

Table 4 Test selection procedure. NT(0)

X(s)

Statements whose Y(ω) = 0

Z(t) > 0

ST(k)

Z(t6 ) = 4, Z(t7 ) = 4, Z(t8 ) = 6, Z(t9 ) = 7, Z(t10 ) = 7, Z(t11 ) = 6, Z(t12 ) = 7, Z(t14 ) = 6, Z(t15 ) = 7, Z(t16 ) = 6, Z(t20 ) = 6, Z(t22 ) = 6,Z(t24 ) = 6, Z(t26 ) = 4 Z(t6 ) = 1, Z(t7 ) = 1, Z(t9 ) = 1

ST(0) = {t6 , t7 , t8 , t9 , t10 , t11 , t12 , t14 , t15 , t16 , t20 , t22 , t24 , t26 }

ω1

ω2

ω3

ω4

ω5

ω6

ω7

ω8

ω9

ω10

Empty

0

0

0

0

0

0

0

0

0

0

ω1 , ω2 , ω3 , ω4 , ω5 , ω6 , ω7 , ω8 , ω9 , ω10

t10

1

1

0

1

1

1

1

1

0

0

ω3 , ω9 , ω10

t6 and t10

1

1

−1

1

0

0

1

0

0

0

ω3 , ω5 , ω6 , ω8 , ω9 , ω10

t6 , t7 , and t10

1

1

−2

1

−1

−1

1

−1

0

0

ω3 , ω5 , ω6 , ω8 , ω9 , ω10

ST(1) = {t6 , t7 , t9 } ST(2) = {t7 , t8 , t9 , t11 , t12 , t14 , t15 , t16 , t20 , t22 , t24 , t26 } ST(3) = {t8 , t9 , t11 , t12 , t14 , t15 , t16 , t20 , t22 , t24 , t26 }

Z(t7 ) = 4, Z(t8 ) = 3, Z(t9 ) = 3, Z(t11 ) = 2, Z(t12 ) = 3, Z(t14 ) = 3, Z(t15 ) = 3, Z(t16 ) = 2, Z(t20 ) = 2, Z(t22 ) = 2, Z(t24 ) = 2, Z(t26 ) = 1 Z(t8 ) = 3, Z(t9 ) = 3, Z(t11 ) = 2, Z(t12 ) = 3, Z(t14 ) = 3, Z(t15 ) = 3, Z(t16 ) = 2, Z(t20 ) = 2, Z(t22 ) = 2, Z(t24 ) = 2, Z(t26 ) = 1

Table 5 Suspiciousness values.

ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8 ω9 ω10

NCF

NCS

NUF

NUS

Suspiciousness

0 0 2 0 2 2 0 2 0 0

1 1 0 1 1 1 1 1 0 0

2 2 0 2 0 0 2 0 2 2

0 0 1 0 0 0 0 0 1 1

0 0 1 0 0.6 0.6 0 0.6 0 0

Table 6 Weighted execution proﬁles. Statements

ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8 ω9 ω10

Test cases executed t0

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10

t11

t12

t13

t14

t15

t16

t17

t18

t19

t20

t21

t22

t23

t24

t25

t26

t27

t28

t29

0 0 0 0 .6 .6 0 0 0 0 P

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 0 0 .6 0 0 .6 0 0 P

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 1 0 .6 .6 0 0 0 0 P

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 1 0 .6 .6 0 .6 0 0 F

0 0 1 0 .6 .6 0 .6 0 0 F

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 1 0 0 .6 0 .6 0 0 P

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 0 0 .6 .6 0 0 0 0 P

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 0 0 .6 0 0 .6 0 0 P

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 0 0 .6 .6 0 .6 0 0 P

0 0 0 0 0 .6 0 .6 0 0 P

0 0 0 0 0 0 0 .6 0 0 P

0 0 0 0 0 0 0 .6 0 0 P

0 0 1 0 0 0 0 .6 0 0 P

0 0 0 0 0 .6 0 .6 0 0 P

0 0 0 0 0 0 0 .6 0 0 P

0 0 0 0 0 .6 0 .6 0 0 P

0 0 0 0 0 0 0 .6 0 0 P

0 0 0 0 0 .6 0 .6 0 0 P

0 0 1 0 0 0 0 .6 0 0 P

0 0 0 0 0 .6 0 0 0 0 P

0 0 1 0 0 0 0 .6 0 0 F

0 0 1 0 .6 0 0 .6 0 0 F

0 0 1 0 0 .6 0 .6 0 0 F

Table 7 New clustering results. Cluster 1 Cluster 2 Cluster 3

t0 t2 t8

t1 t14 t27

t3 t18 t28

t4 t19 t29

t5 t20

t6 t22

t7 t24

4. Case studies This section presents a detailed description of case studies, including subject programs, experimental setup, evaluation model, and experimental results. 4.1. Subject programs Case studies were conducted on seven open-source programs, ﬂex, grep, gzip, make, ant, sed and space, which were downloaded

t8 t26

t9

t10

t11

t12

t13

t15

t16

from (The Software, 2008). Table 8 shows the number of statements, the number of test cases, the number of faulty versions, and the number of failed test cases for each program in our case studies. Faulty versions that did not result in any execution failure by the downloaded test cases were excluded. To enlarge the study sets, additional faulty versions were created by the application of mutation-based fault injection, which has been shown to be an effective approach for simulating realistic faults (Andrews et al., 2005; Do and Rothermel, 2006; Liu et al., 2006; Namin et al., 2006). In this paper, two classes of mutant operators are used:

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

49

Table 8 Subject programs.

Number of statements Number of test cases Number of faulty versions Number of failed test cases

make

ant

sed

ﬂex

grep

gzip

space

20,014 793 20 452

75,333 871 21 182

12,062 360 7 108

13,892 525 19 215

12,653 470 18 140

6573 211 16 21

9126 13,585 23 1357

• Replacement of an arithmetic, relational, logical, increment/decrement, or assignment operator by another operator from the same class. • Decision negation in an if or while statement.

4.3. Evaluation model To assess the effectiveness of WAS and other strategies, two evaluation metrics are introduced. Recall calculates failures detection ability, and precision represents failure detection exactitude. • Recall

Studies such as Offutt et al. (1996), Van der Meulen et al. (2004) and Wong and Mathur (1995) have reported that test cases that kill mutants generated by the relational operator replacement and logical operator replacement are also likely to kill other mutants. The corresponding test cases also achieve a high mutation and dataﬂow-based coverage. Although the focus of this paper differs from those studies, the conclusion reached in those papers is still applicable here. Additional mutation operators such as replacement of assignment, arithmetic or increment/decrement are also included based on our analysis of common program bugs.

Recall =

4.2. Experimental setup

Precision =

The experiments start with the estimation of cluster number. As described in Section 3.1, for all strategies evaluated in this paper, the K-means clustering technique is used. The number of clusters is set to Nt /2 for Nt test cases and one test case is then randomly selected for each cluster as the initial cluster center. Table 9 shows the number of clusters and the number of test cases in each cluster after clustering the initial execution proﬁle. For example, for make, each cluster may contain between 25 and 77 test cases. To compare WAS to other CTS-based strategies, WAS, ESBS, one per cluster, n per cluster, and adaptive sampling are implemented. After clustering of the execution proﬁle, one per cluster selects one test case randomly from each cluster. n per cluster selects n test cases from each cluster. n is set to 3 according to (Yan et al., 2010). Adaptive sampling initially selects one test randomly from each cluster and then selects the remaining test cases from the clusters in which the ﬁrst selected test is failed. ESBS is an improvement over adaptive sampling strategy. The sampling strategy of ESBS is very similar to that in WAS. However, in WAS, Y(ω) of a statement ω is set to 0 each time a new cluster is chosen. Furthermore, WAS contains multiple clustering iterations using weighted execution proﬁles while ESBS only conducts clustering once. WAS is implemented with the Crosstab fault localization technique to calculate the suspiciousness value of each statement, and this value is used to weight the execution proﬁle. All experiments were conducted using a work station with an i73770 CPU, 3.4 GHz, and 16 GB memory. All subject programs used in this paper were instrumented with a revised version of ␹Suds (␹Suds User’s Manual, 1998), which has the ability to record the statements executed by a test case.

If there are F failed test cases, and a strategy identiﬁes Fs of these, the recall can be deﬁned as: FS F

• Precision If Ts test cases have been selected and examined by a strategy, and Fs of these are failed test cases, the precision can be deﬁned as: FS TS

4.4. Experimental results This section presents a comparison of the ﬁve CTS-based strategies with respect to the recall and precision. For each bug version of a subject program, WAS, ESBS, adaptive sampling, one per cluster, and n per cluster strategies were applied, and recall and precision values for each strategy were recorded. When a strategy is applied for a given subject program, the average recall and precision values are calculated using the set of recall and precision values for all faulty versions. Due to space limitations, only the results of four out of the seven programs gzip, grep, ant and sed, are presented below in Fig. 2. The remaining program results are included in Fig. 5 in Appendix. As Figs. 2 and 5 show, WAS demonstrates improved recall and precision results over ESBS, adaptive sampling, one-per cluster sampling, and n per cluster sampling. For gzip, grep, ant and sed, WAS achieves about 10–20% higher recall than ESBS when the test selection percentage ranges from 20% to 50%. The improvement in recall is signiﬁcant, especially for gzip and ant. For gzip, when test selection reaches 20%, the recall of WAS reaches 76% while the recall of ESBS only reaches 55%. When test selection ratio reaches 30%, the recall of WAS reaches 80%, while the recall of ESBS reaches only 64%. When test selection percentage reaches 40%, the recall of WAS reaches 90%, while the recall of ESBS reaches only 71%. As shown in Fig. 2, the adaptive sampling, one per cluster, and n per cluster strategies likewise perform poorly in comparison to WAS. As Fig. 2(e)–(h) shows, WAS can also achieve a higher precision value than other strategies. When the test selection percentage is

Table 9 Number of clusters and number of test cases in each cluster.

Number of clusters Number of test cases in each cluster

make

ant

sed

ﬂex

grep

gzip

space

20 25–77

21 32–59

13 10–42

16 27–50

15 19–87

10 15–32

82 112–480

50

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

Fig. 2. Recall and precision for gizp, grep, ant, and sed using different strategies.

the same, higher precision indicates that there are more failed test cases among those selected. For example, according to Fig. 2(f), when 30% of all test cases are selected, WAS is able to achieve a precision of 79% while only a precision of 66% is reached by ESBS. It can be observed in Fig. 2 that the precision of WAS is relatively low for some programs. For example, when 30% of test cases are selected, the precision of WAS is 79% for grep. For gzip, however, only 29% precision can be achieved at the same test selection percentage. A similar situation can also be found for space (shown in Fig. 5(f)). Nevertheless, WAS consistently returns higher precision than the other strategies evaluated. From Fig. 2, it can be observed that WAS performs similarly to ESBS and outperforms the other techniques at a 10% test selection percentage. In WAS, each time a cluster is chosen, the Y(ω) of each statement ω is set to 0 based on Eq. (6). ESBS does not initialize Y(ω) to 0 when it points to a new cluster, resulting in interference with

the previous Y(ω) value. After using a fault localization technique on the 10% of test cases selected to weight the initial execution proﬁle, the precision and recall of WAS increases in comparison to ESBS and other sampling strategies. 5. Discussion In this section, six research questions are proposed regarding the important factors which may affect the effectiveness of WAS. 5.1. Can statement coverage improve the effectiveness of WAS? In our conference paper (Wang et al., 2012), the execution proﬁle in terms of function coverage is used for clustering. In this paper, the execution proﬁle in terms of statement coverage is used instead, as it provides more detailed information about each execution. We

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

51

Table 10 Compare failure-close-degree for original and weighted execution proﬁle.

Fig. 3. Recall using statement coverage versus that using function coverage.

evaluated the recall and precision for both the statement coverage and the function coverage in WAS and found that the usage of statement coverage returned better results. For example, Fig. 3 shows that the recall using statement coverage (line with rhombic dots) is higher than that using function coverage on ﬂex. 5.2. Can WAS improve the distance space of execution proﬁles for clustering? Distance space shows the degree of similarity and dissimilarity of test cases in a test suite. In our study, Euclidean distance between two execution proﬁles is calculated as the distance between two test cases. The distance between each pair of test cases constructs the distance space of a test suite. To determine whether WAS can improve the distance space of execution proﬁles, the failure-closedegree metric (Brock et al., 2008) is used to evaluate the distance space of clustering. Failure-close-degree is used to measure how close failed test cases in a test suite T are to each other. For one faulty version of a program, NF indicates the failed test set in T. NS indicates the successful test set in T. sum(NF ) indicates the sum of Euclidean distances between any two failed test cases fi and fj in NF . It is rep resented as dist(fi , fj ). sum(NF , NS ) indicates the sum ∀f ∈ N ,i = / j i,j

F

of Euclidean distances between any test case fi in NF and any test case pj in NS . It is represented as ∀f ∈ N ,∀p ∈ N dist(fi , pj ). A small i

F

i

S

failure-close-degree means that failed test cases are close to each

Program

Original

Weighted

grep sed gzip make ﬂex ant space

1.37 6.85 8.05 23.37 11.95 7.52 0.82

0.85 1.14 2.56 10.72 3.44 3.73 0.84

other in T and the distance between failed and successful test cases is large. Therefore, a smaller failure-close-degree indicates a better distance space for clustering. Failure-close-degree =

sum(NF ) sum(NF , NS )

Results shown in Table 10 for a subject program are calculated by averaging the failure-close-degrees across all faulty versions for both the original and the weighted execution proﬁle after 10% of the test cases are selected. The results show that failure-close-degree decreases rapidly after using the weighted execution proﬁle. 5.3. Can WAS ﬁt a multi-fault situation? So far, the evaluation on sampling strategies has been discussed in terms of programs containing only one fault. To better simulate a realistic situation, we conducted experiments on multi-fault programs. 20 four-fault versions of three large-sized programs, ﬂex, gzip and grep, are generated using a combination of single-fault versions. The effectiveness of WAS in terms of precision and recall in a multi-fault scenario is presented in Fig. 4. As shown in Fig. 4, precision and recall of WAS decrease for multi-fault programs compared to those for single-fault programs in Fig. 3. For example, in Fig. 4(a), for ﬂex, the recall of WAS reaches 42% when 30% of test cases are selected, which is lower than the 57% recall for ﬂex with a single fault as shown in Fig. 5(b). However, WAS consistently returns higher recall and precision than the other strategies evaluated.

Fig. 4. Multi-fault experimental results.

52

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

Table 11 Recall values for different percentages of test cases selected. Program

Test selection percentage

5%

10%

15%

20%

make

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

grep

Recall

Recall

Recall

Recall

0.087 0.207 0.307 0.601 0.716 0.846 0.959 1.000 1.000 1.000

0.097 0.287 0.467 0.651 0.806 0.896 0.969 1.000 1.000 1.000

0.098 0.288 0.470 0.661 0.816 0.899 0.971 1.000 1.000 1.000

0.098 0.284 0.478 0.673 0.817 0.899 0.972 1.000 1.000 1.000

ﬂex

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.173 0.421 0.632 0.723 0.765 0.906 0.934 1.000 1.000 1.000

0.183 0.421 0.742 0.803 0.855 0.946 1.000 1.000 1.000 1.000

0.184 0.451 0.752 0.803 0.845 0.956 1.000 1.000 1.000 1.000

0.184 0.451 0.762 0.803 0.855 0.966 1.000 1.000 1.000 1.000

gzip

ant

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.165 0.495 0.616 0.818 0.820 0.895 0.919 0.958 0.985 1.000

0.185 0.485 0.796 0.828 0.850 0.885 0.939 0.988 0.995 1.000

0.195 0.495 0.816 0.829 0.840 0.895 0.933 0.978 0.991 1.000

0.195 0.496 0.826 0.839 0.843 0.895 0.933 0.978 0.991 1.000

space

sed

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.135 0.453 0.685 0.742 0.810 0.928 0.948 1.000 1.000 1.000

0.135 0.453 0.735 0.792 0.850 0.948 0.988 1.000 1.000 1.000

0.136 0.456 0.745 0.796 0.859 0.958 0.989 1.000 1.000 1.000

0.136 0.466 0.747 0.792 0.859 0.978 0.989 1.000 1.000 1.000

5.4. Will different fault localization techniques affect the effectiveness of WAS? WAS uses a fault localization technique to calculate the suspiciousness of each statement. As the performance of fault localization techniques differs, the fault localization technique used may affect the effectiveness of WAS. To further investigate this point, we used different fault localization techniques in WAS and calculated the corresponding precision and recall. Fig. 6 in Appendix shows the experiment results. Six fault localization techniques are used: Crosstab, Ochiai, Tarantula, H3c, H3b, and Jaccard. The results show that using different fault localization techniques will result in a different effectiveness of WAS, but the difference is small. However, Crosstab is the technique which generally outperforms the others when used in WAS. Therefore, in our case studies, we use Crosstab in WAS to compute the suspiciousness for each statement. 5.5. Is there any impact if we select a different percentage of all test cases in each clustering iteration? In our case studies, we selected 10% of all test cases in each clustering iteration. To ﬁnd the most appropriate value for the percentage of test cases selected, we set this value to 5%, 10%, 15% and 20% of all test cases and calculated the corresponding recall

Program

Test selection percentage

5%

10%

15%

20%

Recall

Recall

Recall

Recall

0.087 0.277 0.587 0.631 0.746 0.806 0.829 0.850 0.910 1.000

0.100 0.322 0.570 0.770 0.800 0.860 0.912 0.976 1.000 1.000

0.099 0.321 0.574 0.761 0.806 0.876 0.919 1.000 1.000 1.000

0.099 0.321 0.575 0.761 0.807 0.886 0.929 1.000 1.000 1.000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.355 0.736 0.707 0.875 0.954 1.000 1.000 1.000 1.000 1.000

0.325 0.756 0.807 0.905 0.964 1.000 1.000 1.000 1.000 1.000

0.325 0.756 0.807 0.915 0.944 1.000 1.000 1.000 1.000 1.000

0.325 0.766 0.807 0.915 0.954 1.000 1.000 1.000 1.000 1.000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.265 0.706 0.814 0.840 0.891 0.950 0.987 1.000 1.000 1.000

0.265 0.716 0.924 0.961 0.961 0.970 0.987 1.000 1.000 1.000

0.265 0.715 0.934 0.971 0.963 0.973 0.989 1.000 1.000 1.000

0.265 0.725 0.934 0.982 0.966 0.973 0.989 1.000 1.000 1.000

for all the programs. The results are shown in Table 11. When the test selection percentage increases from 5% to 10%, the recall value increases signiﬁcantly. When the test selection percentage increases from 10% to 20%, the increase of recall value is very limited. Using make as an example, when the test selection percentage increases from 5% to 10%, the recall increases by 0.16, from 0.307 to 0.467. When the test selection percentage increases from 10% to 15%, the recall only increases by 0.003, from 0.467 to 0.47. In order to achieve a higher recall while selecting as few test cases as possible, 10% of all test cases are selected in each clustering iteration. 5.6. Is the additional time cost incurred by WAS acceptable? There are several clustering iterations in WAS but only one iteration in other CTS-based strategies. Although WAS achieves higher recall than other strategies, its time cost is relatively higher. To evaluate the efﬁciency of WAS, we recorded the number of clustering iterations for WAS and compared the time cost between ESBS and WAS at 70% recall. Results in Table 12 show that, out of the seven subject programs, when recall reaches 70%, WAS requires one additional clustering iteration over ESBS on two programs, two additional clustering iterations on three programs, and three and four additional clustering iterations on one program. Since the time cost for each clustering

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

53

Table 12 Comparing times of clustering for WAS and ESBS. Program

Recall

Approach

Number of clustering iterations

Average time cost for each clustering iteration (s)

ﬂex

70% 70% 70% 70% 70% 70% 70% 70% 70% 70% 70% 70% 70% 70%

WAS ESBS WAS ESBS WAS ESBS WAS ESBS WAS ESBS WAS ESBS WAS ESBS

4 1 3 1 2 1 5 1 3 1 3 1 2 1

73 72 74 78 61 64 81 89 116 113 79 73 121 119

grep gzip make ant sed space

iteration is no more than 121 s, the total time cost will be generally acceptable in practice. 6. Related works 6.1. Test selection Test selection is normally used in regression testing. The goal of regression testing is to ensure that no new errors are introduced when a program is modiﬁed (Ball, 1998). Regression testing is known to be expensive. Previously completed test cases need to be rerun to check whether program execution behavior has changed and whether ﬁxed bugs arise again. This process costs a signiﬁcant amount of time and running effort. Therefore, reducing the cost of regression testing is important in practice. Regression test selection selects a subset of all test cases to rerun the modiﬁed program (Biswas et al., 2011). A few regression test selection strategies have been proposed and studied (Ball, 1998; Bates and Horwitz, 1993; Binkley, 1997). Rothermel and Harrold (1994) deﬁned the regression test selection problem formally. Leung and White (1991) compared certain regression test selection strategies with the retest-all approach and observed that regression test selection can reduce the cost. CTS-based strategies could also be applied to regression testing (Chen et al., 2011b; Yu et al., 2008). We focused on test selection for output inspection, which is socalled observation-based testing (Leon et al., 2000), not regression testing. 6.2. Cluster test selection CTS-based strategy, a well-known strategy for observationbased testing, aims to reduce the cost by selecting a subset of the universal set of executions that is more likely to consist of failures. The selection process itself, especially the analysis of execution proﬁles, should be as automated as possible (Dickinson et al., 2001a). Dickinson et al. (2001a,b) studied cluster ﬁltering and proposed an adaptive sampling strategy to select all test cases in a cluster when a failed test case is examined. They compared the n per cluster strategy, one per cluster strategy, and adaptive sampling strategy. Their studies showed that adaptive sampling strategy can reveal more failures while equal or fewer test cases need to be examined (Dickinson et al., 2001a). All the existing CTS-based strategies use different execution proﬁles and clustering algorithms to achieve better results (Dickinson et al., 2001a; Yu et al., 2008). Chen et al. (2011b) proposed a slice ﬁltering technique to reduce the dimensions of test cases such that this technique can be used for larger software. All these CTS-based strategies

use the original execution proﬁle for a single clustering iteration. WAS, on the other hand, uses a weighted execution proﬁle and conducts multiple clustering iterations. Chen et al. introduced semi-supervised learning to reduce the dimensions of test cases and improve both the distance space for clustering and the effectiveness of CTS-based strategy (Chen et al., 2011a). However, semi-supervised learning techniques require human participation while WAS can be implemented automatically. Therefore, the effectiveness of these two strategies is not comparable. 6.3. Fault localization and test selection While there are many studies on fault localization based on execution proﬁle (Wong et al., 2012a), we only examined a few related to test selection. Wong and Qi (2009) propose a fault localization technique based on a back-propagation (BP) neural network, one of the most popular neural network models in practice (Fausett, 1994). However, they found that an RBF (radial basis function) neural network has several advantages over a BP network, including a faster learning rate and a stronger resistance to problems such as paralysis and minima (Lee et al., 1999; Wasserman, 1993). Therefore, they proposed a fault localization technique using a modiﬁed RBF (radial basis function) neural network (Wong et al., 2012b), which consists of a three-layer feed-forward structure: one input layer, one output layer, and one hidden layer where each neuron uses a Gaussian basis function (Hassoun, 1995) as the activation function with the distance computed using weighted bit comparison-based dissimilarity. This network will be trained to learn the relationship between the statement coverage of a test case and its corresponding success or failure. A set of virtual test cases (each covering a single statement) is provided as input after the training phase. Output for each virtual test case is considered to be the suspiciousness of the covered statement. Intuitively, the suspiciousness assigned to a statement should be directly proportional to the number of failed test cases that cover it. The more frequently a statement is executed by failed tests, the more suspicious it should be. Wong et al. further propose the DStar (Wong et al., 2014) technique to achieve a better effectiveness with respect to fewer statements examined to locate program bugs. DStar has been shown to be more effective than other fault localization techniques such as Tarantula (Jones and Harrold, 2005) and Ochiai (Abreu et al., 2009). In past decades, many studies have also been published on predicate-based fault localization. Liblit et al. propose a statistical debugging technique, Liblit05, that can isolate bugs in the programs with instrumented predicates at particular points (Liblit et al.,

54

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

2005). As an extension (and improvement) to Liblit05, Liu et al. propose the SOBER technique to rank suspicious predicates (Liu et al., 2006). All instrumented predicates can be ranked and examined in order of their fault-relevance. Zhang et al. (2011) propose a framework to handle statistical predicate-based fault localization by applying standard hypothesis testing techniques. Controlled experiments based on single-fault scenarios using the Siemens suite and space program are conducted to evaluate the effectiveness of different hypothesis testing methods in the proposed framework. The effectiveness and efﬁciency of the proposed framework has been compared to 33 statement-level fault-localization techniques to gauge whether the best predicate-based technique will outperform statement-level techniques. The results show that the proposed framework can generally achieve better effectiveness and efﬁciency than the statement-level techniques compared in the paper. Test prioritizing to improve the performance of SBFL techniques has also been studied recently. Gonzalez-Sanchez et al. (2011) uses Bayesian theory to prioritize test cases for fault localization by minimizing the fault locality entropy. Similarly, Yoo et al. (2013) propose a technique, FLINT (fault localization using information theory), to improve the effectiveness of fault localization by trying to reduce the Shannon entropy of the locality of the fault. Empirical evaluation shows that FLINT is able to improve the effectiveness of Tarantula on different subject programs. There are also two main issues with traditional fault localization techniques. First, for most fault localization techniques, a good performance may require a large number of test cases. For a large and complex program, even though a technique can narrow down the potential number of faulty statements to 5% of the entire program, this still requires hundreds of statements to be examined. Second, there is no explanation or context for why a given statement is ranked as suspicious, since most of the techniques are based on the assumption of perfect debugging (i.e. the examination of a faulty statement is enough for a developer to detect, understand, and ﬁx the corresponding bug). However, such an assumption is not realistic in practice, since additional related information such as function requirements, test speciﬁcations, etc. has to be provided for a correct debugging (Parnin and Orso, 2011). Yu et al. (2008) compared the effect of using different test reduction techniques on fault localization. Jiang et al. (2012) conducted an experiment to study the effect of test prioritization on fault localization. Their research showed that test prioritization and test reduction could make fault localization more effective. Yan et al. (2010) proposed a CTS-based strategy, ESBS, using execution proﬁles. ESBS can outperform the other CTS-based strategies. However, these strategies, including ESBS, should be based on well-established clustering results. WAS uses a fault localization technique to weight the execution proﬁle, which can signiﬁcantly improve the effectiveness of CTS-based strategies.

7. Threats to validity There are several threats to validity of the proposed strategy and its accompanying results, which include but are not limited to the following. One major concern that may limit the generalization of our results is the choice of subject programs. To alleviate this threat, we conducted our experiments on seven programs (written in C and Java) containing 184 faulty versions. All the programs we use are large in size, particularly ant (over 75,000 lines of code). Although we cannot claim that WAS can be applied to every program under every circumstance, such extensive empirical studies give us high conﬁdence in the validity of our results.

A threat to validity may also be caused by the fault localization techniques used. In Section 5.4, we compared the recall and precision of WAS using six fault localization techniques including Crosstab, Ochiai, Tarantula, H3c, H3b, and Jaccard on different subject programs. It has been observed that using different fault localization techniques will result in a different recall and precision, but the difference is insigniﬁcant. However, Crosstab is the technique which generally outperforms others when used in WAS. In the future, more advanced fault localization techniques will be included in the cross-comparison study. Another threat to validity is the use of K-means as the only clustering technique in WAS. We will evaluate the recall and precision of WAS using other clustering techniques in a future study, as most of clustering techniques can be easily applied in WAS. 8. Conclusion In this paper, we have proposed a novel weighted attributebased strategy, WAS, to improve the performance of traditional cluster-based test selection so that output veriﬁcation is more focused on executions that are more likely to fail. Test cases are clustered based on the similarity of their execution proﬁles weighted by the suspiciousness (likelihood of containing program bugs) of statements computed using different spectrum-based software fault localization techniques (Crosstab, Ochiai, Tarantula, H3b, H3c, and Jaccard). Case studies are conducted on seven open source programs (ﬂex, grep, ant, make, space, sed, and gzip) applying four existing CTS-based strategies (one per cluster, n per cluster, adaptive sampling, and ESBS), and the results are compared. Our data indicate that WAS generally attains better precision and recall than the other strategies. Thus, not only are more failed executions selected by examining only a small number of test cases, but also the majority of failed executions will be selected. Furthermore, the additional clustering iterations required by WAS introduce very little overhead, which makes WAS as efﬁcient as other CTS-based strategies. These ﬁndings suggest that using WAS provides signiﬁcant advantages over situations where no testing oracle exists for automated output veriﬁcation. Rather than spending excessive resources to manually examine the result of every test execution, we only need to verify the execution of a small number of test cases selected using WAS. Since those which are not selected are unlikely to fail, we can ignore their veriﬁcation with good conﬁdence. In future studies, we would like to further explore how using additional software fault localization techniques (Wong et al., 2012b, 2014; Wong and Qi, 2009) and other clustering techniques may affect the effectiveness of WAS. We will also extend our case studies to larger real-life programs. Acknowledgements This research is supported, in part, by the National Basic Research Program of China (973 Program 2014CB340702), the National Natural Science Foundation of China (Grant No. 61170067, 61373013), and the US National Science Foundation (DUE–1023071). Appendix. Precision and recall using WAS, ESBS, adaptive sampling, one per cluster, and n per cluster on ﬂex, make, and space are presented in Fig. 5. Precision and recall using WAS with different fault localization techniques including Crosstab, Tarantula, Ochiai, H3b, H3c, and Jaccard on gzip, grep, ant, sed, ﬂex, make, and space are presented in Fig. 6.

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

Fig. 5. Precision and recall using different CTS-based strategies on ﬂex, make, and space.

55

56

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

Fig. 6. Precision and recall using WAS with different fault localization techniques.

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

References Abreu, R., Zoeteweij, P., Golsteijn, R., van Gemund, A.J.C., 2009. A practical evaluation of spectrum-based fault localization. J. Syst. Softw. 82 (11), 1780–1792. Anderberg, M.R., 1973. Cluster Analysis for Applications. Academic Press, New York. Andrews, J.H., Briand, L.C., Labiche, Y., 2005. Is mutation an appropriate tool for testing experiments? In: Proceedings of the 27th International Conference on Software Engineering, St. Louis, Missouri, USA, May, pp. 402–411. Ball, T., 1998. On the limit of control ﬂow analysis for regression test selection. In: Proceedings of the ACM SIGSOFT international Symposium on Software Testing and Analysis, Clearwater Beach, USA, March, pp. 134–142. Bates, S., Horwitz, S., 1993. Incremental program testing using program dependence graphs. In: Proceedings of the 20th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 384–396. Binkley, D., 1997. Semantics guided regression test cost reduction. IEEE Trans. Softw. Eng. 23 (August (8)), 498–516. Biswas, S., Mall, R., Satpathy, M., Sukumaran, S., 2011. Regression test selection techniques: a survey. Inform.: Int. J. Comput. Inform. 35 (September), 289–321. Brock, G., Pihur, V., Datta, S., 2008. clValid, an R package for cluster validation. J. Stat. Softw. 25 (4), 1–22. Chen, S., Chen, Z., Zhao, Z., Xu, B., Feng, Y., 2011a. Using semisupervised clustering to improve regression test selection techniques. In: Proceedings of 2011 IEEE Fourth International Conference on Software Testing, Veriﬁcation and Validation, Berlin, Germany, March, pp. 1–10. Chen, T.Y., Kuo, F., Merkel, R.G., Tse, T.H., 2010. Adaptive random testing: the ART of test case diversity. J. Syst. Softw. 83 (January (1)), 60–66. Chen, Z., Duan, Y., Zhao, Z., Xu, B., Qian, J., 2011b. Using program slicing to improve the efﬁciency and effectiveness of cluster test selection. Int. J. Softw. Eng. Knowl. Eng. 21 (September (6)), 759–777. Dickinson, W., Leon, D., Podgurski, A., 2001a. Finding failures by cluster analysis of execution proﬁles. In: Proceedings of 23rd International Conference on Software Engineering, Toronto, Canada, May, pp. 339–348. Dickinson, W., Leon, D., Podgurski, A., 2001b. Pursuing failure: the distribution of program failures in a proﬁle space. In: Proceedings of 8th European Software Engineering Conference held jointly with 9th ACM SIGSOFT International symposium on Foundations of Software Engineering, Yokohama, Japan, April, pp. 246–255. DiGiuseppe, N., Jones, J.A., 2012. Software behavior and failure clustering: an empirical study of fault causality. In: Proceedings of Fifth International Conference on Software Testing, Veriﬁcation and Validation, Montreal, Canada, April, pp. 191–200. Do, H., Rothermel, G., 2006. On the use of mutation faults in empirical assessments of test case prioritization techniques. IEEE Trans. Softw. Eng. 32 (September (9)), 733–752. Fausett, L., 1994. Fundamentals of Neural Networks: Architectures, Algorithms, and Applications. Jaccard, P., 1901. Etude comparative de la distribution ﬂorale dans une portion des alpes et des jura. Bull. Soc. Vaudoise Sci. Nat. 37, 547–579. Gonzalez-Sanchez, A., Piel, E., Abreu, R., Gross, H., Van Gemund, A.J.C., 2011. Prioritizing tests for software fault diagnosis. Softw. Pract. Exp. 41 (September (10)). Hartigan, J.A., Wong, M.A., 1979. Algorithm AS 136: a K-means clustering algorithm. Appl. Stat. 28 (January (1)), 100–108. Hassoun, M.H., 1995. Fundamentals of Artiﬁcial Neural Networks. MIT Press, Cambridge, Massachusetts. Jiang, B., Zhang, Z., Chan, W.K., Tse, T.H., Chen, T.Y., 2012. How well do test prioritization techniques support statistical fault localization? Inf. Softw. Technol. 54 (7), 739–758. Jones, J.A., Harrold, M.J., 2005. Empirical evaluation of the tarantula automatic faultlocalization technique. In: Proceedings of 20th IEEE International Conference on Automated Software Engineering, Long Beach, USA, November, pp. 273–282. Mardia, K.V., Kent, J.T., Bibby, J.M., 1979. Multivariate Analysis. Academic Press, Waltham, Massachusetts, USA. Leon, D., Podgurski, A., White, L.J., 2000. Multivariate visualization in observationbased testing. In: Proceedings of International Conference on Software Engineering, Castletroy, Ireland, June, pp. 116–125. Leung, H.K., White, L., 1991. A Cost model to compare regression test strategies. Proc. Softw. Maint., 201–208. Lee, C.C., Chung, P.C., Tsai, J.R., Chang, C.I., 1999. Robust radial basis function neural network. IEEE Trans. Syst. Man Cybern. B: Cybern. 29 (December (6)), 674–685. Liblit, B., Naik, M., Zheng, A.X., Aiken, A., Jordan, M.I., 2005. Scalable statistical bug isolation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, Chicago, IL, USA, June, pp. 15–26. Liu, C., Fei, L., Yan, X., Han, J., Midkiff, S.P., 2006. Statistical debugging: a hypothesis testing-based approach. IEEE Trans. Softw. Eng. 32 (October (10)), 831–848. James, M., 1967. Some methods for classiﬁcation and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. Namin, A.S., Andrews, J.H., Labiche, Y., 2006. Using mutation analysis for assessing and comparing testing coverage criteria. IEEE Trans. Softw. Eng. 32 (August (8)), 608–624. Ochiai, A., 1957. Zoogeographic studies on the solenoid ﬁshes found in Japan and its neighboring regions. Bull. Jpn. Soc. Sci. Fish. 22, 526C530. Offutt, A.J., Lee, A., Rothermel, G., Untch, R.H., Zapf, C., 1996. An experimental determination of sufﬁcient mutant operators. ACM Trans. Softw. Eng. Methodol. 5 (April (2)), 99–118.

57

Parnin, C., Orso, A., 2011. Are automated debugging techniques actually helping programmers? In: Proceedings of the 2011 International Symposium on Software Testing and Analysis, Toronto, Canada, July, pp. 199–209. Podgurski, A., Masri, W., McCleese, Y., Wolff, F.G., Yang, C., 1999. Estimation of software reliability by stratiﬁed sampling. ACM Trans. Softw. Eng. Methodol. 8, 263–283. Podgurski, A., Yang, C., 1993. Partition testing, stratiﬁed sampling, and cluster analysis. In: Proceedings of SIGSOFT FSE, pp. 169–181. Rothermel, G., Harrold, M.J., 1994. Selecting tests and identifying test coverage requirements for modiﬁed software. In: Proceedings of the 1994 ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 169–184. Samuel, P., Mall, R., Bothra, A.K., 2008. Automatic test case generation using uniﬁed modeling language (UML) state diagrams. Softw. IET 2 (April (2)). The Software Infrastructure Repository http://sir.unl.edu/portal/index.html (retrieved October 2008). Tsai, W.T., Volovik, D., Keefe, T.F., 1990. Automated test case generation for programs speciﬁed by relational algebra queries. Trans. Softw. Eng. 16 (March (3)). Van der Meulen, M.J.P., Bishop, P., Revilla, M., 2004. An exploration of software faults and failure behaviour in a large population of programs. In: Proceedings of the 15th International Symposium on Software Reliability Engineering, Saint Malo, Bretagne, France, November, pp. 101–112. Wang, Y., Chen, Z., Feng, Y., Luo, B., Yang, Y., 2012. Using weighted attributes to improve cluster test selection. In: Proceedings of 5th International Conference on Software Security and Reliability, Washington, DC, June, pp. 138–146. Wasserman, P.D., 1993. Advanced Methods in Neural Computing. Van Nostrand Reinhold, New Yank. Witten, I.H., Frank, E., 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, Burlington, Massachusetts, USA. Wong, W.E., Mathur, A.P., 1995. Reducing the cost of mutation testing: an empirical study. J. Syst. Softw. 31 (December (3)), 185–196. Wong, W.E., Debroy, V., Choi, B., 2010. A family of code coverage-based heuristics for effective fault localization. J. Syst. Softw. 83 (February (2)), 188–208. Wong, W.E., Debroy, V., Xu, D., 2012a. Towards better fault localization: a crosstabbased statistical approach. IEEE Trans. Syst. Man Cybern. C 42 (May (3)), 378–396. Wong, W.E., Debroy, V., Golden, R., Xu, X., Thuraisingham, B., 2012b. Effective software fault localization using an RBF neural network. IEEE Trans. Reliab. 61 (March (1)), 149–169. Wong, W.E., Debroy, V., Gao, R., Li, Y., 2014. The DStar method for effective software fault localization. IEEE Trans. Reliab. 63 (March (1)), 290–308. Wong, W.E., Qi, Y., 2009. BP neural network-based effective fault localization. Int. J. Softw. Eng. Knowl. Eng. 19 (4), 573–597. 1998. ␹Suds User’s Manual. Telcordia Technologies. Yan, S., Chen, Z., Zhao, Z., Zhang, C., Zhou, Y., 2010. A dynamic test cluster sampling strategy by leveraging execution spectra information. In: Proceedings of Third International Conference on Software Testing, Veriﬁcation, and Validation, Paris, France, April, pp. 147–154. Yoo, S., Harman, M., Clark, D., 2013. Fault localization prioritization: comparing information-theoretic and coverage-based approaches. ACM Trans. Softw. Eng. Methodol. 22 (July (3)). Yu, Y., Jones, J., Harrold, M.J., 2008. An empirical study of the effects of test-suite reduction on fault localization. In: Proceedings of 30th International Conference on Software Engineering, Leipzig, Germany, May, pp. 201–210. Zhang, Z., Chan, W.K., Tse, T.H., Yu, Y., Hu, P., 2011. Non-parametric statistical fault localization. J. Syst. Softw. 84 (June (6)), 885–905. Yabin Wang is a Ph.D. student at the Software Institute, Nanjing University, under the supervision of Prof. Zhenyu Chen and Prof. Bin Luo. His research interest is software testing. He visited the University of Texas at Dallas as a visiting student under the supervision of Prof. Eric Wong in 2012. Ruizhi Gao received his B.S. in Software Engineering from Nanjing University. He is a Ph.D. candidate in the Computer Science department at the University of Texas at Dallas. His current research interests include software testing, fault localization, and program debugging. Zhenyu Chen is an Associate Professor at the Software Institute, Nanjing University. He received his B.S. and Ph.D. in Mathematics from Nanjing University. He worked as a Postdoctoral Researcher at the School of Computer Science and Engineering, Southeast University, China. His research interests focus on software analysis and testing. He has about 70 publications at major venues including TOSEM, JSS, SQJ, IJSEKE, ISSTA, ICST, QSIC etc. He has served as PC co-chair of QSIC 2013, AST2013, IWPD2012 and the program committee member of many international conferences. He has won research funding from several competitive sources such as NSFC. W. Eric Wong received his Ph.D. in computer science from Purdue University. He is currently a Professor, and the Director of International Outreach in Computer Science at the University of Texas, Dallas. He is also the Funding Director of the newly established Advanced Research Center on Software Testing and Quality Assurance. In addition, Professor Wong has an appointment as a Guest Researcher from NIST (National Institute of Standards and Technology), an agency of the U.S. Department of Commerce. Prior to jointing UTD, he was with Telcordia (formerly Bellcore) as a Project Manager for Dependable Telecom Software Development. Dr. Wong received the Quality Assurance Special Achievement Award from Johnson Space Center, NASA, in 1997. His research focus is on the technology to help practitioners develop high quality software at low cost. In particular, he is doing research in

58

Y. Wang et al. / The Journal of Systems and Software 98 (2014) 44–58

software testing, debugging, metrics, safety, and reliability. Dr. Wong is also the Secretary of the IEEE Reliability Society, and the Founding Steering Committee Chair of the IEEE International Conference on Software Security and Reliability (SERE). He is on the editorial board of IEEE Transactions on Reliability and the Journal of Systems and Software.

Bin Luo is a Professor at the Software Institute, Nanjing University. His main research interests are applied software engineering and software engineering education. He is leading the institute of applied software engineering at Nanjing University.

WAS: A weighted attribute-based strategy for cluster test selection

WAS: A weighted attribute-based strategy for cluster test selection

Recommend Documents